From af1ed38f39dd550610badfd371830607115df7cc Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Thu, 7 Sep 2023 17:22:06 +0300 Subject: [PATCH 1/2] Safetensors conceptual guide (#905) IDK what else to add in this guide, I looked for relevant code in TGI codebase and saw that it's used in quantization as well (maybe I could add that?) --- docs/source/_toctree.yml | 2 ++ docs/source/conceptual/safetensors.md | 7 +++++++ 2 files changed, 9 insertions(+) create mode 100644 docs/source/conceptual/safetensors.md diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 6a8baaf6..d37446b1 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -21,6 +21,8 @@ - sections: - local: conceptual/streaming title: Streaming + - local: conceptual/safetensors + title: Safetensors - local: conceptual/flash_attention title: Flash Attention title: Conceptual Guides diff --git a/docs/source/conceptual/safetensors.md b/docs/source/conceptual/safetensors.md new file mode 100644 index 00000000..fcc31bac --- /dev/null +++ b/docs/source/conceptual/safetensors.md @@ -0,0 +1,7 @@ +# Safetensors + +Safetensors is a model serialization format for deep learning models. It is [faster](https://huggingface.co/docs/safetensors/speed) and safer compared to other serialization formats like pickle (which is used under the hood in many deep learning libraries). + +TGI depends on safetensors format mainly to enable [tensor parallelism sharding](./tensor_parallelism). For a given model repository during serving, TGI looks for safetensors weights. If there are no safetensors weights, TGI converts the PyTorch weights to safetensors format. + +You can learn more about safetensors by reading the [safetensors documentation](https://huggingface.co/docs/safetensors/index). \ No newline at end of file From 0a63e9ab688cf715d31574ee5bb31025ff22ceec Mon Sep 17 00:00:00 2001 From: Nicolas Patry Date: Thu, 7 Sep 2023 17:36:30 +0200 Subject: [PATCH 2/2] Fix __call__ vs forward. (#993) # What does this PR do? Fix __call__ vs forward. To reproduce error just launch: TheBloke/WizardLM-Uncensored-Falcon-7B-GPTQ with gptq (it fails because falcon code uses `__call__` instead for `forward` calls) Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. --- server/text_generation_server/utils/gptq/exllama.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/server/text_generation_server/utils/gptq/exllama.py b/server/text_generation_server/utils/gptq/exllama.py index 6a1cf117..7353afb5 100644 --- a/server/text_generation_server/utils/gptq/exllama.py +++ b/server/text_generation_server/utils/gptq/exllama.py @@ -69,10 +69,11 @@ def create_exllama_buffers(): TEMP_STATE, TEMP_DQ = temp_state, temp_dq -class Ex4bitLinear: +class Ex4bitLinear(torch.nn.Module): """Linear layer implementation with per-group 4-bit quantization of the weights""" def __init__(self, qweight, qzeros, scales, g_idx, bias, bits, groupsize): + super().__init__() global MAX_DQ, MAX_INNER, ACT_ORDER, DEVICE assert bits == 4