mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-11-18 23:15:59 +00:00
Mostly straightforward, changes to existing code: * Wrap quantizer parameters in a small wrapper to avoid passing around untyped tuples and needing to repack them as a dict. * Move scratch space computation to warmup, because we need the maximum input sequence length to avoid allocating huge scratch buffers that OOM. |
||
|---|---|---|
| .. | ||
| flash_attention.md | ||
| guidance.md | ||
| paged_attention.md | ||
| quantization.md | ||
| safetensors.md | ||
| speculation.md | ||
| streaming.md | ||
| tensor_parallelism.md | ||