Merge branch 'main' into paged-attention-docs

2025-09-10 20:04:52 +00:00 · 2023-09-07 19:47:26 +03:00 · 2023-09-07 19:47:26 +03:00 · 2faf396128
commit 2faf396128
parent 73d4f92e0e 0a63e9ab68
3 changed files with 11 additions and 1 deletions
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -23,6 +23,8 @@
    title: Streaming
  - local: conceptual/paged_attention
    title: PagedAttention
  - local: conceptual/safetensors
    title: Safetensors
  - local: conceptual/flash_attention
    title: Flash Attention
  title: Conceptual Guides
--- a/docs/source/conceptual/safetensors.md
+++ b/docs/source/conceptual/safetensors.md
@ -0,0 +1,7 @@
 # Safetensors
 Safetensors is a model serialization format for deep learning models. It is [faster](https://huggingface.co/docs/safetensors/speed) and safer compared to other serialization formats like pickle (which is used under the hood in many deep learning libraries). 
 TGI depends on safetensors format mainly to enable [tensor parallelism sharding](./tensor_parallelism). For a given model repository during serving, TGI looks for safetensors weights. If there are no safetensors weights, TGI converts the PyTorch weights to safetensors format. 
 You can learn more about safetensors by reading the [safetensors documentation](https://huggingface.co/docs/safetensors/index).
--- a/server/text_generation_server/utils/gptq/exllama.py
+++ b/server/text_generation_server/utils/gptq/exllama.py
@ -69,10 +69,11 @@ def create_exllama_buffers():
    TEMP_STATE, TEMP_DQ = temp_state, temp_dq
-class Ex4bitLinear:
+class Ex4bitLinear(torch.nn.Module):
    """Linear layer implementation with per-group 4-bit quantization of the weights"""
    def __init__(self, qweight, qzeros, scales, g_idx, bias, bits, groupsize):
        super().__init__()
        global MAX_DQ, MAX_INNER, ACT_ORDER, DEVICE
        assert bits == 4