diff --git a/README.md b/README.md index 3b71f602..a1a1d867 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,7 @@ TGI is well suited for distributed/ cloud burst/ on-demand workloads, yet HF's f ## Goals +- ☑️ loads LLama2 in 4bit on a Pascal GPU (1080, Llama 2 7B) - Support Model loading from wherever you want (HDFS, S3, HTTPS, …) - Support Adapters (LORA/PEFT) without merging (possibly huge) Checkpoints and uploading them to 🤗 - Support last Gen GPUS (back to Pascal hopefully) @@ -16,6 +17,16 @@ TGI is well suited for distributed/ cloud burst/ on-demand workloads, yet HF's f `` + +# 🦙 LLama 2 in 4bit + +To use Llama 2 7B on a 1080 (Pascal Gen, Compute capability 6.1): +1) Install this repository via `make install` +2) Modify `server/Makefile` section `run-dev` and change the `/mnt/TOFU/HF_MODELS/` path to a path where you have downloaded a HF model via `git lfs clone https://huggingface.co/[repo]/[model]`. E.g. the model will be loaded as `/data/models/Llama-2-7b-chat-hf` +3) open two terminals +4) terminal 1: `make router-dev` (starts the router that exposes the model at localhost:8080) +5) terminal 2: `make server-dev` (starts the model server, loads the model to the GPU) +6) test the model by calling it with CURL `curl localhost:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":90}}' -H 'Content-Type: application/json'`
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)