text-generation-inference/server/text_generation_server/models/custom_modeling
Daniël de Kok 8511669cb2
Move quantized weight handling out of the Weights class (#2194)
Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.
2024-07-09 20:04:03 +02:00
..
__init__.py feat(server): flash santacoder (#153) 2023-04-03 19:06:42 +02:00
bloom_modeling.py Consistently take prefix in model constructors (#2191) 2024-07-05 16:07:48 +02:00
clip.py Consistently take prefix in model constructors (#2191) 2024-07-05 16:07:48 +02:00
flash_cohere_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
flash_dbrx_modeling.py Falcon/DBRX: get correct number of key-value heads (#2205) 2024-07-08 13:22:38 +02:00
flash_gemma2_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
flash_gemma_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
flash_gpt2_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
flash_llama_modeling.py Consistently take prefix in model constructors (#2191) 2024-07-05 16:07:48 +02:00
flash_mistral_modeling.py Consistently take prefix in model constructors (#2191) 2024-07-05 16:07:48 +02:00
flash_mixtral_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
flash_neox_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
flash_pali_gemma_modeling.py Enable multiple LoRa adapters (#2010) 2024-06-25 14:46:27 -04:00
flash_phi_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
flash_qwen2_modeling.py Consistently take prefix in model constructors (#2191) 2024-07-05 16:07:48 +02:00
flash_rw_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
flash_santacoder_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
flash_starcoder2_modeling.py Move quantized weight handling out of the Weights class (#2194) 2024-07-09 20:04:03 +02:00
idefics2.py Enable multiple LoRa adapters (#2010) 2024-06-25 14:46:27 -04:00
idefics_config.py chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
idefics_image_processing.py chore: formatting 2023-12-11 14:49:52 +01:00
idefics_modeling.py reenable xpu for tgi (#1939) 2024-05-23 14:11:08 +02:00
idefics_perceiver.py Refactor layers. (#1866) 2024-05-13 12:44:30 +02:00
idefics_processing.py chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
idefics_vision.py Refactor layers. (#1866) 2024-05-13 12:44:30 +02:00
llava_next.py Refactor dead code - Removing all flash_xxx.py files. (#2166) 2024-07-05 10:29:56 +02:00
mamba_modeling.py Refactor layers. (#1866) 2024-05-13 12:44:30 +02:00
mpt_modeling.py Consistently take prefix in model constructors (#2191) 2024-07-05 16:07:48 +02:00
neox_modeling.py Consistently take prefix in model constructors (#2191) 2024-07-05 16:07:48 +02:00
opt_modeling.py fix dbrx & opt model prefix bug (#2201) 2024-07-08 09:01:14 +02:00
phi_modeling.py Consistently take prefix in model constructors (#2191) 2024-07-05 16:07:48 +02:00
siglip.py Removing some unused code. (#1915) 2024-05-17 11:35:49 +02:00
t5_modeling.py Refactor layers. (#1866) 2024-05-13 12:44:30 +02:00
vlm.py Pali gemma modeling (#1895) 2024-05-16 06:58:47 +02:00