Merge pull request #2 from ohmytofu-ai/chris-aeviator-patch-1

Update README.md
2025-09-10 20:04:52 +00:00 · 2023-08-27 14:55:54 +02:00 · 2023-08-27 14:55:54 +02:00 · da1cfea208
commit da1cfea208
parent db8937c209 d628f5fd29
1 changed files with 11 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -7,6 +7,7 @@ TGI is well suited for distributed/ cloud burst/ on-demand workloads, yet HF's f


 ## Goals
+- ☑️ loads LLama2 in 4bit on a Pascal GPU (1080, Llama 2 7B)
 - Support Model loading from wherever you want (HDFS, S3, HTTPS, …)
 - Support Adapters (LORA/PEFT) without merging (possibly huge) Checkpoints and uploading them to 🤗
 - Support last Gen GPUS (back to Pascal hopefully)
@ -16,6 +17,16 @@ TGI is well suited for distributed/ cloud burst/ on-demand workloads, yet HF's f

 `</endOfMissionStatement>`

+
+# 🦙 LLama 2 in 4bit
+
+To use Llama 2 7B on a 1080 (Pascal Gen, Compute capability 6.1):
+1) Install this repository via `make install`
+2) Modify `server/Makefile` section `run-dev` and change the `/mnt/TOFU/HF_MODELS/` path to a path where you have downloaded a HF model via `git lfs clone https://huggingface.co/[repo]/[model]`. E.g. the model will be loaded as `/data/models/Llama-2-7b-chat-hf`
+3) open two terminals
+4) terminal 1: `make router-dev` (starts the router that exposes the model at localhost:8080)
+5) terminal 2: `make server-dev` (starts the model server, loads the model to the GPU)
+6) test the model by calling it with CURL `curl localhost:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":90}}'     -H 'Content-Type: application/json'`
 <div align="center">

 ![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)