Merge branch 'main' into tgi-rocm

This commit is contained in:
Felix Marty 2023-11-14 13:37:06 +00:00
commit 7903ad72a8
3 changed files with 10 additions and 4 deletions

View File

@ -1,6 +1,8 @@
<div align="center"> <div align="center">
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0) <a href="https://www.youtube.com/watch?v=jlMAX2Oaht0">
<img width=560 width=315 alt="Making TGI deployment optimal" src="https://huggingface.co/datasets/Narsil/tgi_assets/resolve/main/thumbnail.png">
</a>
# Text Generation Inference # Text Generation Inference
@ -138,6 +140,10 @@ this will impact performance.
`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature `text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument. by setting the address to an OTLP collector with the `--otlp-endpoint` argument.
### Architecture
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)
### Local install ### Local install
You can also opt to install `text-generation-inference` locally. You can also opt to install `text-generation-inference` locally.

View File

@ -4,7 +4,7 @@ Text Generation Inference improves the model in several aspects.
## Quantization ## Quantization
TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes), [GPT-Q](https://arxiv.org/abs/2210.17323) and [AWQ](https://arxiv.org/abs/2306.00978) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes`, `gptq` or `awq` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq) when using AWQ quantization, you need to point to one of the models [here](https://huggingface.co/models?search=awq). To get more information about quantization, please refer to [quantization guide](./../conceptual/quantization.md) TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes), [GPT-Q](https://arxiv.org/abs/2210.17323) and [AWQ](https://arxiv.org/abs/2306.00978) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes`, `gptq` or `awq` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq) when using AWQ quantization, you need to point to one of the models [here](https://huggingface.co/models?search=awq). To get more information about quantization, please refer to [quantization guide](./../conceptual/quantization)
## RoPE Scaling ## RoPE Scaling

View File

@ -670,7 +670,7 @@ class FlashCausalLM(Model):
self.device, self.device,
) )
_, batch = self.generate_token(batch) _, batch = self.generate_token(batch)
except Exception as e: except torch.cuda.OutOfMemoryError as e:
raise RuntimeError( raise RuntimeError(
f"Not enough memory to handle {len(batch.input_ids)} prefill tokens. " f"Not enough memory to handle {len(batch.input_ids)} prefill tokens. "
f"You need to decrease `--max-batch-prefill-tokens`" f"You need to decrease `--max-batch-prefill-tokens`"