text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-09 15:05:24 +00:00

History

Daniël de Kok 2a6c3caf1d Move quantized weight handling out of the `Weights` class (#2194 ) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.		2024-09-25 05:27:40 +00:00
..
merges	Enable multiple LoRa adapters (#2010 )	2024-09-24 03:55:04 +00:00
__init__.py	Aligin the source code with main branch 2.0.4	2024-09-24 03:06:55 +00:00
adapter.py	Enable multiple LoRa adapters (#2010 )	2024-09-24 03:55:04 +00:00
chunks.py	server: use chunked inputs	2024-09-24 03:42:29 +00:00
convert.py	Force weights_only (before fully breaking pickle files anyway). (#1710 )	2024-04-25 15:10:53 +03:00
dist.py	Removing IPEX_AVAIL. (#2115 )	2024-09-24 03:52:23 +00:00
hub.py	Enable multiple LoRa adapters (#2010 )	2024-09-24 03:55:04 +00:00
import_utils.py	refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132 )	2024-09-24 03:57:32 +00:00
log.py	v1.3.4	2024-04-22 09:08:34 +03:00
logits_process.py	Aligin the source code with main branch 2.0.4	2024-09-24 03:06:55 +00:00
peft.py	Enable multiple LoRa adapters (#2010 )	2024-09-24 03:55:04 +00:00
quantization.py	Move quantized weight handling out of the `Weights` class (#2194 )	2024-09-25 05:27:40 +00:00
segments.py	Enable multiple LoRa adapters (#2010 )	2024-09-24 03:55:04 +00:00
sgmv.py	Enable multiple LoRa adapters (#2010 )	2024-09-24 03:55:04 +00:00
speculate.py	chore: formatting	2024-04-18 16:26:00 +03:00
tokens.py	Aligin the source code with main branch 2.0.4	2024-09-24 03:06:55 +00:00
watermark.py	Aligin the source code with main branch 2.0.4	2024-09-24 03:06:55 +00:00
weights.py	Move quantized weight handling out of the `Weights` class (#2194 )	2024-09-25 05:27:40 +00:00