mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-09 11:24:53 +00:00
fix: bump docs
This commit is contained in:
parent
9a44b3e7b9
commit
ddb7dcbf38
@ -59,8 +59,6 @@ Options:
|
||||
|
||||
Marlin kernels will be used automatically for GPTQ/AWQ models.
|
||||
|
||||
[env: QUANTIZE=]
|
||||
|
||||
Possible values:
|
||||
- awq: 4 bit quantization. Requires a specific AWQ quantized model: <https://hf.co/models?search=awq>. Should replace GPTQ models wherever possible because of the better latency
|
||||
- compressed-tensors: Compressed tensors, which can be a mixture of different quantization methods
|
||||
@ -73,6 +71,8 @@ Options:
|
||||
- bitsandbytes-fp4: Bitsandbytes 4bit. nf4 should be preferred in most cases but maybe this one has better perplexity performance for you model
|
||||
- fp8: [FP8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/) (e4m3) works on H100 and above This dtype has native ops should be the fastest if available. This is currently not the fastest because of local unpacking + padding to satisfy matrix multiplication limitations
|
||||
|
||||
[env: QUANTIZE=]
|
||||
|
||||
```
|
||||
## SPECULATE
|
||||
```shell
|
||||
@ -457,14 +457,14 @@ Options:
|
||||
--usage-stats <USAGE_STATS>
|
||||
Control if anonymous usage stats are collected. Options are "on", "off" and "no-stack" Defaul is on
|
||||
|
||||
[env: USAGE_STATS=]
|
||||
[default: on]
|
||||
|
||||
Possible values:
|
||||
- on: Default option, usage statistics are collected anonymously
|
||||
- off: Disables all collection of usage statistics
|
||||
- no-stack: Doesn't send the error stack trace or error type, but allows sending a crash event
|
||||
|
||||
[env: USAGE_STATS=]
|
||||
[default: on]
|
||||
|
||||
```
|
||||
## PAYLOAD_LIMIT
|
||||
```shell
|
||||
|
Loading…
Reference in New Issue
Block a user