Update README.md

This commit is contained in:
Omar Sanseviero 2023-09-29 12:20:26 +02:00 committed by GitHub
parent 195008d621
commit 5e68cf1260
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,16 +1,23 @@
<div align="center"> <div align="center">
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0) ![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)
# Text Generation Inference # Text Generation Inference
<a href="https://github.com/huggingface/text-generation-inference"> <a href="https://github.com/huggingface/text-generation-inference">
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social"> <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social">
</a> </a>
<a href="https://huggingface.github.io/text-generation-inference"> <a href="https://huggingface.github.io/text-generation-inference">
<img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational"> <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
</a> </a>
A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co) A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co)
to power Hugging Chat, the Inference API and Inference Endpoint. to power Hugging Chat, the Inference API and Inference Endpoint.
</div> </div>
## Table of contents ## Table of contents
- [Get Started](#get-started) - [Get Started](#get-started)
- [API Documentation](#api-documentation) - [API Documentation](#api-documentation)
- [Using a private or gated model](#using-a-private-or-gated-model) - [Using a private or gated model](#using-a-private-or-gated-model)
@ -24,7 +31,9 @@ to power Hugging Chat, the Inference API and Inference Endpoint.
- [Quantization](#quantization) - [Quantization](#quantization)
- [Develop](#develop) - [Develop](#develop)
- [Testing](#testing) - [Testing](#testing)
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as: Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:
- Simple launcher to serve most popular LLMs - Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics) - Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
- Tensor Parallelism for faster inference on multiple GPUs - Tensor Parallelism for faster inference on multiple GPUs
@ -39,75 +48,112 @@ Text Generation Inference (TGI) is a toolkit for deploying and serving Large Lan
- Log probabilities - Log probabilities
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output - Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance - Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance
## Get Started ## Get Started
### Docker ### Docker
For a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container: For a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container:
```shell ```shell
model=tiiuae/falcon-7b-instruct model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model
``` ```
And then you can make requests like And then you can make requests like
```bash ```bash
curl 127.0.0.1:8080/generate \ curl 127.0.0.1:8080/generate \
-X POST \ -X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json' -H 'Content-Type: application/json'
``` ```
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar. **Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli): To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
``` ```
text-generation-launcher --help text-generation-launcher --help
``` ```
### API documentation ### API documentation
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference). The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).
### Using a private or gated model ### Using a private or gated model
You have the option to utilize the `HUGGING_FACE_HUB_TOKEN` environment variable for configuring the token employed by You have the option to utilize the `HUGGING_FACE_HUB_TOKEN` environment variable for configuring the token employed by
`text-generation-inference`. This allows you to gain access to protected resources. `text-generation-inference`. This allows you to gain access to protected resources.
For example, if you want to serve the gated Llama V2 model variants: For example, if you want to serve the gated Llama V2 model variants:
1. Go to https://huggingface.co/settings/tokens 1. Go to https://huggingface.co/settings/tokens
2. Copy your cli READ token 2. Copy your cli READ token
3. Export `HUGGING_FACE_HUB_TOKEN=<your cli READ token>` 3. Export `HUGGING_FACE_HUB_TOKEN=<your cli READ token>`
or with Docker: or with Docker:
```shell ```shell
model=meta-llama/Llama-2-7b-chat-hf model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token> token=<your cli READ token>
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model
``` ```
### A note on Shared Memory (shm) ### A note on Shared Memory (shm)
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by [`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
`PyTorch` to do distributed training/inference. `text-generation-inference` make `PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models. use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.
In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible. peer-to-peer using NVLink or PCI is not possible.
To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command. To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.
If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with: creating a volume with:
```yaml ```yaml
- name: shm - name: shm
emptyDir: emptyDir:
medium: Memory medium: Memory
sizeLimit: 1Gi sizeLimit: 1Gi
``` ```
and mounting it to `/dev/shm`. and mounting it to `/dev/shm`.
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
this will impact performance. this will impact performance.
### Distributed Tracing ### Distributed Tracing
`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature `text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument. by setting the address to an OTLP collector with the `--otlp-endpoint` argument.
### Local install ### Local install
You can also opt to install `text-generation-inference` locally. You can also opt to install `text-generation-inference` locally.
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
Python 3.9, e.g. using `conda`: Python 3.9, e.g. using `conda`:
```shell ```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
conda create -n text-generation-inference python=3.9 conda create -n text-generation-inference python=3.9
conda activate text-generation-inference conda activate text-generation-inference
``` ```
You may also need to install Protoc. You may also need to install Protoc.
On Linux: On Linux:
```shell ```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
@ -115,46 +161,74 @@ sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*' sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP rm -f $PROTOC_ZIP
``` ```
On MacOS, using Homebrew: On MacOS, using Homebrew:
```shell ```shell
brew install protobuf brew install protobuf
``` ```
Then run: Then run:
```shell ```shell
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
make run-falcon-7b-instruct make run-falcon-7b-instruct
``` ```
**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run: **Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
```shell ```shell
sudo apt-get install libssl-dev gcc -y sudo apt-get install libssl-dev gcc -y
``` ```
### CUDA Kernels ### CUDA Kernels
The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove
the kernels by using the `DISABLE_CUSTOM_KERNELS=True` environment variable. the kernels by using the `DISABLE_CUSTOM_KERNELS=True` environment variable.
Be aware that the official Docker image has them enabled by default. Be aware that the official Docker image has them enabled by default.
## Optimized architectures ## Optimized architectures
TGI works out of the box to serve optimized models in [this list](https://huggingface.co/docs/text-generation-inference/supported_models). TGI works out of the box to serve optimized models in [this list](https://huggingface.co/docs/text-generation-inference/supported_models).
Other architectures are supported on a best-effort basis using: Other architectures are supported on a best-effort basis using:
`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")` `AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`
or or
`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")` `AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`
## Run Falcon ## Run Falcon
### Run ### Run
```shell ```shell
make run-falcon-7b-instruct make run-falcon-7b-instruct
``` ```
### Quantization ### Quantization
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement: You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:
```shell ```shell
make run-falcon-7b-instruct-quantize make run-falcon-7b-instruct-quantize
``` ```
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`. 4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
## Develop ## Develop
```shell ```shell
make server-dev make server-dev
make router-dev make router-dev
``` ```
## Testing ## Testing
```shell ```shell
# python # python
make python-server-tests make python-server-tests
@ -165,4 +239,4 @@ make python-tests
make rust-tests make rust-tests
# integration tests # integration tests
make integration-tests make integration-tests
``` ```