Added safetensors

2025-09-10 11:54:52 +00:00 · 2023-08-03 17:09:19 +03:00 · 2023-08-03 17:09:19 +03:00 · 0c3f3cdb08
commit 0c3f3cdb08
parent e20a5aeac5
1 changed files with 3 additions and 2 deletions
--- a/docs/source/basic_tutorials/preparing_model.md
+++ b/docs/source/basic_tutorials/preparing_model.md
@ -4,12 +4,13 @@ Text Generation Inference improves the model in several aspects.

 ## Quantization

-TGI supports `bits-and-bytes` and `GPT-Q` quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes` or `gptq` depending on the quantization technique you wish to use. 
+TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes` or `gptq` depending on the quantization technique you wish to use. 


 ## RoPE Scaling

 RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply set `ROPE_SCALING` and `ROPE_FACTOR` variables. `ROPE_SCALING` can take the values `linear` or `dynamic`. If your model is not fine-tuned to a longer sequence length, use `dynamic`. `ROPE_FACTOR` is the ratio between the intended max sequence length and the model's original max sequence length.

-## Safetensors Conversion
+## Safetensors

+[Safetensors](https://github.com/huggingface/safetensors) is a fast and safe persistence format for deep learning models. TGI supports `safetensors` model loading under the hood. By default, given a repository with `safetensors` and `pytorch` weights, TGI will always load `safetensors`. If there's no `pytorch` weights, TGI will convert the weights to `safetensors` format.