mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-10 20:04:52 +00:00
update doc
This commit is contained in:
parent
ddf0c85836
commit
eb0e93789d
@ -195,7 +195,7 @@ Options:
|
||||
--hostname <HOSTNAME>
|
||||
The IP address to listen on
|
||||
|
||||
[env: HOSTNAME=]
|
||||
[env: HOSTNAME=hf-amd-mi210-dev]
|
||||
[default: 0.0.0.0]
|
||||
|
||||
```
|
||||
@ -204,7 +204,7 @@ Options:
|
||||
-p, --port <PORT>
|
||||
The port to listen on
|
||||
|
||||
[env: PORT=]
|
||||
[env: PORT=80]
|
||||
[default: 3000]
|
||||
|
||||
```
|
||||
@ -240,7 +240,7 @@ Options:
|
||||
--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
|
||||
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance
|
||||
|
||||
[env: HUGGINGFACE_HUB_CACHE=]
|
||||
[env: HUGGINGFACE_HUB_CACHE=/data]
|
||||
|
||||
```
|
||||
## WEIGHTS_CACHE_OVERRIDE
|
||||
@ -254,7 +254,7 @@ Options:
|
||||
## DISABLE_CUSTOM_KERNELS
|
||||
```shell
|
||||
--disable-custom-kernels
|
||||
For some models (like bloom), text-generation-inference implemented custom cuda kernels to speed up inference. Those kernels were only tested on Nvidia A100, AMD MI210 and AMD MI250. Use this flag to disable them if you're running on different hardware and encounter issues
|
||||
For some models (like bloom), text-generation-inference implemented custom cuda kernels to speed up inference. Those kernels were only tested on A100. Use this flag to disable them if you're running on different hardware and encounter issues
|
||||
|
||||
[env: DISABLE_CUSTOM_KERNELS=]
|
||||
|
||||
|
@ -41,7 +41,7 @@ text-generation-launcher --model-id <PATH-TO-LOCAL-BLOOM>
|
||||
|
||||
TGI optimized models are supported on NVIDIA [A100](https://www.nvidia.com/en-us/data-center/a100/), [A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) and [T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) GPUs with CUDA 11.8+. Note that you have to install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to use it. For other NVIDIA GPUs, continuous batching will still apply, but some operations like flash attention and paged attention will not be executed.
|
||||
|
||||
TGI also has experimental support of RoCm-enabled AMD Instinct MI210 and MI250 GPUs, with paged attention and flash attention v2 support. The following features are missing from the RoCm version of TGI: quantization, flash [rotary embedding kernel](https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary), flash [layer norm kernel](https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm).
|
||||
TGI also has support of RoCm-enabled AMD Instinct MI210 and MI250 GPUs, with paged attention and flash attention v2 support. The following features are missing from the RoCm version of TGI: quantization and flash [layer norm kernel](https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm).
|
||||
|
||||
TGI is also supported on the following AI hardware accelerators:
|
||||
- *Habana first-gen Gaudi and Gaudi2:* check out this [example](https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference) how to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index)
|
||||
|
Loading…
Reference in New Issue
Block a user