Merge pull request #2 from ohmytofu-ai/chris-aeviator-patch-1

Update README.md
This commit is contained in:
chris-aeviator 2023-08-27 14:55:54 +02:00 committed by GitHub
commit da1cfea208
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -7,6 +7,7 @@ TGI is well suited for distributed/ cloud burst/ on-demand workloads, yet HF's f
## Goals
- ☑️ loads LLama2 in 4bit on a Pascal GPU (1080, Llama 2 7B)
- Support Model loading from wherever you want (HDFS, S3, HTTPS, …)
- Support Adapters (LORA/PEFT) without merging (possibly huge) Checkpoints and uploading them to 🤗
- Support last Gen GPUS (back to Pascal hopefully)
@ -16,6 +17,16 @@ TGI is well suited for distributed/ cloud burst/ on-demand workloads, yet HF's f
`</endOfMissionStatement>`
# 🦙 LLama 2 in 4bit
To use Llama 2 7B on a 1080 (Pascal Gen, Compute capability 6.1):
1) Install this repository via `make install`
2) Modify `server/Makefile` section `run-dev` and change the `/mnt/TOFU/HF_MODELS/` path to a path where you have downloaded a HF model via `git lfs clone https://huggingface.co/[repo]/[model]`. E.g. the model will be loaded as `/data/models/Llama-2-7b-chat-hf`
3) open two terminals
4) terminal 1: `make router-dev` (starts the router that exposes the model at localhost:8080)
5) terminal 2: `make server-dev` (starts the model server, loads the model to the GPU)
6) test the model by calling it with CURL `curl localhost:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":90}}' -H 'Content-Type: application/json'`
<div align="center">
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)