From afc747337a5beb35c492fbb5fdaa5de1da9d20f1 Mon Sep 17 00:00:00 2001
From: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date: Thu, 16 May 2024 11:43:40 +0000
Subject: [PATCH] documentation
---
docs/source/_toctree.yml | 10 ++++++-
docs/source/installation.md | 8 ++++--
docs/source/installation_amd.md | 38 ++++++++++++++++++++++++++
docs/source/installation_gaudi.md | 3 ++
docs/source/installation_inferencia.md | 3 ++
docs/source/installation_nvidia.md | 18 ++++++++++++
docs/source/quicktour.md | 21 ++++++--------
docs/source/supported_models.md | 14 ----------
8 files changed, 86 insertions(+), 29 deletions(-)
create mode 100644 docs/source/installation_amd.md
create mode 100644 docs/source/installation_gaudi.md
create mode 100644 docs/source/installation_inferencia.md
create mode 100644 docs/source/installation_nvidia.md
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index c815b535..a52dd7f3 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -3,8 +3,16 @@
title: Text Generation Inference
- local: quicktour
title: Quick Tour
+ - local: installation_nvidia
+ title: Using TGI with Nvidia GPUs
+ - local: installation_amd
+ title: Using TGI with AMD GPUs
+ - local: installation_gaudi
+ title: Using TGI with Intel Gaudi
+ - local: installation_inferentia
+ title: Using TGI with AWS Inferentia
- local: installation
- title: Installation
+ title: Installation from source
- local: supported_models
title: Supported Models and Hardware
- local: messages_api
diff --git a/docs/source/installation.md b/docs/source/installation.md
index 3e62102d..b6c24d55 100644
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -1,6 +1,10 @@
-# Installation
+# Installation from source
-This section explains how to install the CLI tool as well as installing TGI from source. **The strongly recommended approach is to use Docker, as it does not require much setup. Check [the Quick Tour](./quicktour) to learn how to run TGI with Docker.**
+
+
+Installing TGI from source is not the recommended usage. We strongly recommend to use TGI through Docker, check the [Quick Tour](./quicktour), [Installation for Nvidia GPUs](./installation_nvidia) and [Installation for AMD GPUs](./installation_amd) to learn how to use TGI with Docker.
+
+
## Install CLI
diff --git a/docs/source/installation_amd.md b/docs/source/installation_amd.md
new file mode 100644
index 00000000..279b1e6e
--- /dev/null
+++ b/docs/source/installation_amd.md
@@ -0,0 +1,38 @@
+# Using TGI with AMD GPUs
+
+TGI is supported and tested on AMD Instinct MI210, MI250 and MI300 GPUs. The support may be extended in the future. The recommended usage is through Docker. Make sure to check the [AMD documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html) on how to use Docker with AMD GPUs.
+
+On a server powered by AMD GPUs, TGI can be launched with the following command:
+
+```bash
+model=teknium/OpenHermes-2.5-Mistral-7B
+volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
+
+docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
+ --device=/dev/kfd --device=/dev/dri --group-add video \
+ --ipc=host --shm-size 256g --net host -v $volume:/data \
+ ghcr.io/huggingface/text-generation-inference:1.4-rocm \
+ --model-id $model
+```
+
+The launched TGI server can then be queried from clients, make sure to check out the [Consuming TGI](./basic_tutorials/consuming_tgi) guide.
+
+## TunableOp
+
+TGI's docker image for AMD GPUs integrates [PyTorch's TunableOp](https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable), which allows to do an additional warmup to select the best performing matrix multiplication (GEMM) kernel from rocBLAS or hipBLASLt.
+
+Experimentally, on MI300X, we noticed a 6-8% latency improvement when using TunableOp on top of ROCm 6.1 and PyTorch 2.3.
+
+TunableOp is disabled by default as the warmup may take 1-2 minutes. To enable TunableOp, please pass `--env PYTORCH_TUNABLEOP_ENABLED="1"` when launcher TGI's docker container.
+
+## Flash attention implementation
+
+Two implementations of Flash Attention are available for ROCm, the first is [ROCm/flash-attention](https://github.com/ROCm/flash-attention) based on a [Composable Kernel](https://github.com/ROCm/composable_kernel) (CK) implementation, and the second is a [Triton implementation](https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/flash_attn_triton.py).
+
+By default, as its performances have experimentally been better, Triton implementation is used. It can be disabled (using CK implementation instead) by passing `--env ROCM_USE_FLASH_ATTN_V2_TRITON="0"` when launching TGI's docker container.
+
+## Unsupported features
+
+The following features are currently not supported in the ROCm version of TGI, and the supported may be extended in the future:
+* Loading [AWQ](https://huggingface.co/docs/transformers/quantization#awq) checkpoints.
+* Kernel for sliding window attention (Mistral)
\ No newline at end of file
diff --git a/docs/source/installation_gaudi.md b/docs/source/installation_gaudi.md
new file mode 100644
index 00000000..1ddf2b47
--- /dev/null
+++ b/docs/source/installation_gaudi.md
@@ -0,0 +1,3 @@
+# Using TGI with Intel Gaudi
+
+Check out this [repository](https://github.com/huggingface/tgi-gaudi) to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index).
diff --git a/docs/source/installation_inferencia.md b/docs/source/installation_inferencia.md
new file mode 100644
index 00000000..0394e6de
--- /dev/null
+++ b/docs/source/installation_inferencia.md
@@ -0,0 +1,3 @@
+# Using TGI with Inferentia
+
+Check out this [guide](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference) on how to serve models with TGI on Inferentia2.
diff --git a/docs/source/installation_nvidia.md b/docs/source/installation_nvidia.md
new file mode 100644
index 00000000..9eb9fd4d
--- /dev/null
+++ b/docs/source/installation_nvidia.md
@@ -0,0 +1,18 @@
+# Using TGI with Nvidia GPUs
+
+TGI optimized models are supported on NVIDIA [H100](https://www.nvidia.com/en-us/data-center/h100/), [A100](https://www.nvidia.com/en-us/data-center/a100/), [A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) and [T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) GPUs with CUDA 12.2+. Note that you have to install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to use it.
+
+For other NVIDIA GPUs, continuous batching will still apply, but some operations like flash attention and paged attention will not be executed.
+
+TGI can be used on NVIDIA GPUs through its official docker image:
+
+```bash
+model=teknium/OpenHermes-2.5-Mistral-7B
+volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
+
+docker run --gpus all --shm-size 64g -p 8080:80 -v $volume:/data \
+ ghcr.io/huggingface/text-generation-inference:1.4 \
+ --model-id $model
+```
+
+The launched TGI server can then be queried from clients, make sure to check out the [Consuming TGI](./basic_tutorials/consuming_tgi) guide.
diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md
index 70cf575c..e1e8f200 100644
--- a/docs/source/quicktour.md
+++ b/docs/source/quicktour.md
@@ -2,30 +2,27 @@
The easiest way of getting started is using the official Docker container. Install Docker following [their installation instructions](https://docs.docker.com/get-docker/).
-Let's say you want to deploy [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) model with TGI. Here is an example on how to do that:
+## Launching TGI
+
+Let's say you want to deploy [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) model with TGI on an Nvidia GPU. Here is an example on how to do that:
```bash
model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
-docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
+docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
+ ghcr.io/huggingface/text-generation-inference:1.4 \
+ --model-id $model
```
-
+### Supported hardware
-To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher.
+TGI supports various hardware. Make sure to check the [Using TGI with Nvidia GPUs](./installation_nvidia), [Using TGI with AMD GPUs](./installation_amd), [Using TGI with Gaudi](./installation_gaudi), [Using TGI with Inferentia](./installation_inferentia) guides depending on which hardware you would like to deploy TGI on.
-
-
-TGI also supports ROCm-enabled AMD GPUs (only MI210 and MI250 are tested), details are available in the [Supported Hardware section](./supported_models#supported-hardware) and [AMD documentation](https://rocm.docs.amd.com/en/latest/deploy/docker.html). To launch TGI on ROCm GPUs, please use instead:
-
-```bash
-docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4-rocm --model-id $model
-```
+## Consuming TGI
Once TGI is running, you can use the `generate` endpoint by doing requests. To learn more about how to query the endpoints, check the [Consuming TGI](./basic_tutorials/consuming_tgi) section, where we show examples with utility libraries and UIs. Below you can see a simple snippet to query the endpoint.
-
diff --git a/docs/source/supported_models.md b/docs/source/supported_models.md
index ceb25cfd..d478085e 100644
--- a/docs/source/supported_models.md
+++ b/docs/source/supported_models.md
@@ -40,17 +40,3 @@ If you wish to serve a supported model that already exists on a local folder, ju
```bash
text-generation-launcher --model-id
``````
-
-
-## Supported Hardware
-
-TGI optimized models are supported on NVIDIA [A100](https://www.nvidia.com/en-us/data-center/a100/), [A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) and [T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) GPUs with CUDA 12.2+. Note that you have to install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to use it. For other NVIDIA GPUs, continuous batching will still apply, but some operations like flash attention and paged attention will not be executed.
-
-TGI also has support of ROCm-enabled AMD Instinct MI210 and MI250 GPUs, with paged attention, GPTQ quantization, flash attention v2 support. The following features are currently not supported in the ROCm version of TGI, and the supported may be extended in the future:
-* Loading [AWQ](https://huggingface.co/docs/transformers/quantization#awq) checkpoints.
-* Flash [layer norm kernel](https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm)
-* Kernel for sliding window attention (Mistral)
-
-TGI is also supported on the following AI hardware accelerators:
-- *Habana first-gen Gaudi and Gaudi2:* check out this [repository](https://github.com/huggingface/tgi-gaudi) to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index)
-* *AWS Inferentia2:* check out this [guide](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference) on how to serve models with TGI on Inferentia2.