mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-10 20:04:52 +00:00
Merge pull request #2 from ohmytofu-ai/chris-aeviator-patch-1
Update README.md
This commit is contained in:
commit
da1cfea208
11
README.md
11
README.md
@ -7,6 +7,7 @@ TGI is well suited for distributed/ cloud burst/ on-demand workloads, yet HF's f
|
||||
|
||||
|
||||
## Goals
|
||||
- ☑️ loads LLama2 in 4bit on a Pascal GPU (1080, Llama 2 7B)
|
||||
- Support Model loading from wherever you want (HDFS, S3, HTTPS, …)
|
||||
- Support Adapters (LORA/PEFT) without merging (possibly huge) Checkpoints and uploading them to 🤗
|
||||
- Support last Gen GPUS (back to Pascal hopefully)
|
||||
@ -16,6 +17,16 @@ TGI is well suited for distributed/ cloud burst/ on-demand workloads, yet HF's f
|
||||
|
||||
`</endOfMissionStatement>`
|
||||
|
||||
|
||||
# 🦙 LLama 2 in 4bit
|
||||
|
||||
To use Llama 2 7B on a 1080 (Pascal Gen, Compute capability 6.1):
|
||||
1) Install this repository via `make install`
|
||||
2) Modify `server/Makefile` section `run-dev` and change the `/mnt/TOFU/HF_MODELS/` path to a path where you have downloaded a HF model via `git lfs clone https://huggingface.co/[repo]/[model]`. E.g. the model will be loaded as `/data/models/Llama-2-7b-chat-hf`
|
||||
3) open two terminals
|
||||
4) terminal 1: `make router-dev` (starts the router that exposes the model at localhost:8080)
|
||||
5) terminal 2: `make server-dev` (starts the model server, loads the model to the GPU)
|
||||
6) test the model by calling it with CURL `curl localhost:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":90}}' -H 'Content-Type: application/json'`
|
||||
<div align="center">
|
||||
|
||||

|
||||
|
Loading…
Reference in New Issue
Block a user