diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 1e8d8ac42..9bebe8af3 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -1,18 +1,16 @@ - sections: - local: index title: Text Generation Inference - - local: basic_tutorials/install - title: Installation - local: quicktour title: Quick Tour - local: supported_models title: Supported Models and Hardware title: Getting started - sections: - - local: basic_tutorials/running_locally - title: Running Locally - - local: basic_tutorials/running_docker - title: Running with Docker + - local: basic_tutorials/local_launch + title: Installing and Launching Locally + - local: basic_tutorials/docker_launch + title: Launching with Docker - local: basic_tutorials/consuming_TGI title: Consuming TGI as a backend - local: basic_tutorials/consuming_TGI diff --git a/docs/source/basic_tutorials/docker_launch.md b/docs/source/basic_tutorials/docker_launch.md new file mode 100644 index 000000000..1a6493703 --- /dev/null +++ b/docs/source/basic_tutorials/docker_launch.md @@ -0,0 +1,52 @@ +# Launching with Docker + +The easiest way of getting started is using the official Docker container: + +```shell +model=tiiuae/falcon-7b-instruct +volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run + +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.0 --model-id $model +``` +**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher. + + +You can then query the model using either the `/generate` or `/generate_stream` routes: + +```shell +curl 127.0.0.1:8080/generate \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ + -H 'Content-Type: application/json' +``` + +```shell +curl 127.0.0.1:8080/generate_stream \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ + -H 'Content-Type: application/json' +``` + +or from Python: + +```shell +pip install text-generation +``` + +```python +from text_generation import Client + +client = Client("http://127.0.0.1:8080") +print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text) + +text = "" +for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20): + if not response.token.special: + text += response.token.text +print(text) +``` + +To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs)) or in the cli: +``` +text-generation-launcher --help +``` \ No newline at end of file diff --git a/docs/source/basic_tutorials/installation.md b/docs/source/basic_tutorials/installation.md deleted file mode 100644 index e69de29bb..000000000 diff --git a/docs/source/basic_tutorials/local_launch.md b/docs/source/basic_tutorials/local_launch.md new file mode 100644 index 000000000..060dc22e3 --- /dev/null +++ b/docs/source/basic_tutorials/local_launch.md @@ -0,0 +1,95 @@ +# Installing and Launching Locally + +Before you start, you will need to setup your environment, install the Text Generation Inference. Text Generation Inference is tested on **Python 3.9+**. + +## Local Installation for Text Generation Inference + +Text Generation Inference is available on pypi, conda and GitHub. + +To install and launch locally, first [install Rust](https://rustup.rs/) and create a Python virtual environment with at least +Python 3.9, e.g. using `conda`: + +```shell +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh + +conda create -n text-generation-inference python=3.9 +conda activate text-generation-inference +``` + +You may also need to install Protoc. + +On Linux: + +```shell +PROTOC_ZIP=protoc-21.12-linux-x86_64.zip +curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP +sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc +sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*' +rm -f $PROTOC_ZIP +``` + +On MacOS, using Homebrew: + +```shell +brew install protobuf +``` + +Then run: + +```shell +BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels``` + +**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run: + +```shell +sudo apt-get install libssl-dev gcc -y +``` + + +Once installation is done, simply run: + +```shell +make run-falcon-7b-instruct +``` + +This will serve Falcon 7B Instruct model from the port 8080, which we can query. + +You can then query the model using either the `/generate` or `/generate_stream` routes: + +```shell +curl 127.0.0.1:8080/generate \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ + -H 'Content-Type: application/json' +``` + +```shell +curl 127.0.0.1:8080/generate_stream \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ + -H 'Content-Type: application/json' +``` + +or through Python: + +```shell +pip install text-generation +``` + +```python +from text_generation import Client + +client = Client("http://127.0.0.1:8080") +print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text) + +text = "" +for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20): + if not response.token.special: + text += response.token.text +print(text) +``` + +To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs)) or in the cli: +``` +text-generation-launcher --help +``` \ No newline at end of file diff --git a/docs/source/index.md b/docs/source/index.md index 6815f9def..cc5ab9e44 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -2,13 +2,17 @@ Text-Generation-Inference is, an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference implements optimization for all supported model architectures, including: -- Tensor Parallelism and custom cuda kernels -- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures -- Quantization with bitsandbytes or gptq -- Continuous batching of incoming requests for increased total throughput -- Accelerated weight loading (start-up time) with safetensors -- Logits warpers (temperature scaling, topk, repetition penalty ...) -- Watermarking with A Watermark for Large Language Models -- Stop sequences, Log probabilities +- Serve the most popular Large Language Models with a simple launcher +- Tensor Parallelism for faster inference on multiple GPUs - Token streaming using Server-Sent Events (SSE) +- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput +- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures +- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323) +- [Safetensors](https://github.com/huggingface/safetensors) weight loading +- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226) +- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor)) +- Stop sequences +- Log probabilities +- Production ready (distributed tracing with Open Telemetry, Prometheus metrics) +