From eb0e93789d62bea0a5d6700a951eb6adc0fa4289 Mon Sep 17 00:00:00 2001 From: Felix Marty Date: Thu, 16 Nov 2023 17:36:24 +0000 Subject: [PATCH] update doc --- docs/source/basic_tutorials/launcher.md | 8 ++++---- docs/source/supported_models.md | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/source/basic_tutorials/launcher.md b/docs/source/basic_tutorials/launcher.md index eb4318cd..98a38c9f 100644 --- a/docs/source/basic_tutorials/launcher.md +++ b/docs/source/basic_tutorials/launcher.md @@ -195,7 +195,7 @@ Options: --hostname The IP address to listen on - [env: HOSTNAME=] + [env: HOSTNAME=hf-amd-mi210-dev] [default: 0.0.0.0] ``` @@ -204,7 +204,7 @@ Options: -p, --port The port to listen on - [env: PORT=] + [env: PORT=80] [default: 3000] ``` @@ -240,7 +240,7 @@ Options: --huggingface-hub-cache The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance - [env: HUGGINGFACE_HUB_CACHE=] + [env: HUGGINGFACE_HUB_CACHE=/data] ``` ## WEIGHTS_CACHE_OVERRIDE @@ -254,7 +254,7 @@ Options: ## DISABLE_CUSTOM_KERNELS ```shell --disable-custom-kernels - For some models (like bloom), text-generation-inference implemented custom cuda kernels to speed up inference. Those kernels were only tested on Nvidia A100, AMD MI210 and AMD MI250. Use this flag to disable them if you're running on different hardware and encounter issues + For some models (like bloom), text-generation-inference implemented custom cuda kernels to speed up inference. Those kernels were only tested on A100. Use this flag to disable them if you're running on different hardware and encounter issues [env: DISABLE_CUSTOM_KERNELS=] diff --git a/docs/source/supported_models.md b/docs/source/supported_models.md index 1bcf8cd6..d7d45b70 100644 --- a/docs/source/supported_models.md +++ b/docs/source/supported_models.md @@ -41,7 +41,7 @@ text-generation-launcher --model-id TGI optimized models are supported on NVIDIA [A100](https://www.nvidia.com/en-us/data-center/a100/), [A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) and [T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) GPUs with CUDA 11.8+. Note that you have to install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to use it. For other NVIDIA GPUs, continuous batching will still apply, but some operations like flash attention and paged attention will not be executed. -TGI also has experimental support of RoCm-enabled AMD Instinct MI210 and MI250 GPUs, with paged attention and flash attention v2 support. The following features are missing from the RoCm version of TGI: quantization, flash [rotary embedding kernel](https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary), flash [layer norm kernel](https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm). +TGI also has support of RoCm-enabled AMD Instinct MI210 and MI250 GPUs, with paged attention and flash attention v2 support. The following features are missing from the RoCm version of TGI: quantization and flash [layer norm kernel](https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm). TGI is also supported on the following AI hardware accelerators: - *Habana first-gen Gaudi and Gaudi2:* check out this [example](https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference) how to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index)