text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-09 01:10:17 +00:00

Author	SHA1	Message	Date
Daniël de Kok	84ab88d843	Support flashinfer for Gemma3 prefill (#3167 ) * launcher: ensure correct detection of Gemma 3 head size * Support flashinfer for Gemma3 prefill Gemma3 uses bidirectional attention for images. Flashinfer supports custom masks. Hook up the mask with flashinfer, so that we do not have to use the slower SDPA implementation for prefills with images. * Update Gemma3 test outputs * Fixed unused import	2025-04-17 18:07:41 +02:00
Mohit Sharma	d9bb9bebc9	Add llama4 (#3145 ) * initial changes * Add support for other vlm * cleanup comment * Improve attn_implementation * Add comments for support of models * add model * add model * fixes and improvements * update docker * Add cache position * Add tests * remove redundant changes * remove tr version * Upgrade doc + fix linting. * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-06 10:20:22 +02:00
Mohit Sharma	a35fbdb925	Bug Fix: Sliding Window Attention (#3112 ) * (fix) sliding window attention * (fix) flashinfer * (typo) collection link * Add window_size_left param ipex rocm * Update window size rocm flash decoding * fix: bump snapshots and improve exceed window test case * feat: add tests for image types and remove alpha from png * Upgrading `from_env` to get token from file when necessary + fix pali_gemma. * fix: add pillow dependency and bump lock+requirements * fix: bump org name in gemma3 test * Fix qwen2. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-18 10:37:33 +01:00
Wang, Yi	06dfe9abfe	fix qwen2 vl crash in continous batching (#3004 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-20 18:36:45 -05:00
drbh	c1cf36c0dc	Improve qwen vl impl (#2943 ) * feat: refactor model, improve startup and re enable tests * fix: improve multimodal rotary embed caching * fix: limit vision flop calc to qwen2 vl models and update config typing * fix: include clippy lint * feat: refactor position ids in warmup and bump tests * fix: prefer default dtype * fix: enable all cuda graphs and bump snapshots * fix: adjust rotaty init path * fix: simplify get position ids and remove usused vision config * fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit * fix: improve position id init during cuda warmup for mrope and simplfy rotary forward * fix: check existance before accessing rope type in cuda warmup * fix: check key before access * fix: improve mrope check in cuda graph warmup * fix: remove check for default rope type * fix: add more test and improve model generation * fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids * fix: adjust signatures with types	2025-02-04 12:44:18 -05:00
Nicolas Patry	29a0893b67	Tmp tp transformers (#2942 ) * Upgrade the version number. * Remove modifications in Lock. * Tmp branch to test transformers backend with 2.5.1 and TP>1 * Fixing the transformers backend. inference_mode forces the use of `aten.matmul` instead of `aten.mm` the former doesn't have sharding support crashing the transformers TP support. `lm_head.forward` also crashes because it skips the hook that cast/decast the DTensor. Torch 2.5.1 is required for sharding support. * Put back the attention impl. * Revert the flashinfer (this will fails). * Building AOT. * Using 2.5 kernels. * Remove the archlist, it's defined in the docker anyway.	2025-01-23 18:07:30 +01:00
drbh	8f6146f11a	Revert "feat: improve qwen2-vl startup " (#2924 ) Revert "feat: improve qwen2-vl startup (#2802)" This reverts commit `eecca27113`.	2025-01-17 12:09:05 -05:00
drbh	eecca27113	feat: improve qwen2-vl startup (#2802 ) * feat: tokenize each request individually and increase warmup image size * feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller * fix: address image resize and rebase changes * feat: update to run qwen2-vl tests * fix: tweak param types	2025-01-17 11:50:41 -05:00
Wang, Yi	cc8b9650bd	Baichuan2-13B does not have max_position_embeddings in config (#2903 ) * Baichuan2-13B does not have max_position_embeddings in config see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * fmt Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2025-01-15 15:56:52 +01:00
Mohit Sharma	880ab9c2f3	Add Flash decoding kernel ROCm (#2855 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * add flash decoding	2025-01-13 11:12:35 +01:00
drbh	da5ab46705	Improve vlm support (add idefics3 support) (#2437 ) * feat: expand vlm support and add image token logic and tests * fix: avoid unused perceiver config * feat: integrate image tokens into inputs embeds * feat: add simple idefics3 test * feat: update docs, image token logic and weight names * fix: improve image processing * feat: improve prefix for idefics3 * fix: bump idefics3 tests and snapshots * fix: improve text model loading * feat: consolidate changes with existing vlms and add support and test for smolvlm * fix: create new idefic3 file, simplify logic and adjust llama weight loading * fix: lint with ruff * fix: clean up idefics 3 and improve prefix handling * fix: improve typing * fix: improve prompt_split_image with ref to original impl * fix: adjust ruff lints and small refactors * fix: adjust FlashLlamaModel prefix logic	2025-01-09 10:35:32 -05:00
Daniël de Kok	a9c7d2e3b6	Basic flashinfer 0.2 support (#2862 ) * Basic flashinfer 0.2 support This change does not use any of the new features yet, but makes some small compatibility changes. * Update to flashinfer 0.2.0.post1 * flashinfer: remove `contiguous` calls * Fix flashinfer install * flashinfer: fixup kv cache dtype * Fix some annoying perturbations * More output changes	2025-01-09 16:25:00 +01:00
Nicolas Patry	82c24f7420	Using both value from config as they might not be correct. (#2817 ) * Using both value from config as they might not be correct. * Fixing max_position_embeddings for falcon. * Simple attempt to fix the healthcheck block allocation. * Much simpler solution. * Default value for Backend start_health	2024-12-10 19:37:09 +01:00
Nicolas Patry	a04356fb8c	Attempt for cleverer auto batch_prefill values (some simplifications). (#2808 ) * Attempt for cleverer auto batch_prefill values (some simplifications). * Less flaky tests. * Fixing typo insertion. * Update launcher/src/main.rs Co-authored-by: Daniël de Kok <me@danieldk.eu> * Adding small comment for source of calculation. * Adding L40. * Adding L40s. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-12-09 19:44:32 +01:00
Nicolas Patry	5df8059037	Auto max prefill (#2797 ) * Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).	2024-12-06 05:52:00 +01:00
Nicolas Patry	b57f370386	Saving some VRAM. (#2790 ) * Saving some VRAM. - 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB left, so 400MB saved. - Effect not as visible on attention=flashinfer and n_shard=1. I suspect it's linked to the torch allocator. * Adding assertion.	2024-12-03 04:04:21 +01:00
OlivierDehaene	ab7ccf5bc3	feat: add payload limit (#2726 ) * feat: add payload limit * update launcher	2024-11-21 18:20:15 +00:00
Nicolas Patry	9fde566602	Fixing linting on main. (#2719 )	2024-11-04 15:21:41 +01:00
Travis Addair	aadc9cb485	Fix prefix caching + speculative decoding (#2711 )	2024-11-04 15:08:43 +01:00
Nicolas Patry	a5593ba83e	Hotfixing auto length (warmup max_s was wrong). (#2716 ) Some checks failed Secret Leaks / trufflehog (push) Has been cancelled Details	2024-11-04 09:55:54 +01:00
drbh	01dacf8e8f	fix cuda graphs for qwen2-vl (#2708 ) * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl * fix: only check model type if config exists * fix: adjust sharding and lm head logic * fix qwen2 failure in intel cpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: return correct shape logits and add streaming test * fix: remove unused import and refactor test --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-01 03:05:34 +01:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
OlivierDehaene	6f88bd9390	feat: add triton kernels to decrease latency of large batches (#2687 ) * feat: add triton kernels to decrease latency of large batches * cast to int32 * fix kernel * fix kernel * disable triton on rocm * fix speculation * add slots filtering kernel	2024-10-25 21:10:00 +00:00
Daniël de Kok	eab07f746c	Add support for FP8 KV cache scales (#2628 ) * Add support for FP8 KV cache scales Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the absmax values change, good scales are usually determined per layer using calibration calibration data and stored in the checkpoint. This change adds support for for using key-value scales and loading them from checkpoints in the two most common formats: - Separate per-layer `k_scale` and `v_scale` scalars. - Per-layer `kv_scale` scalar (older format). Currently, scales are only used with an `float8_e4m3fn` cache. Besides adding support for key/value scales, the `fp8_quantize` function is also extended to support quantization with a kernel vendored from vLLM. This is slightly faster than the PyTorch implementation, but also scales in FP32, potentially improving accuracy. * Update FP8 KV cache test to use checkpoint with scales * `can_scale`: check that the attention is flashinfer	2024-10-24 16:36:18 +02:00
Nicolas Patry	153ff3740b	CI job. Gpt awq 4 (#2665 ) * add gptq and awq int4 support in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix ci failure Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * set kv cache dtype Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * refine the code according to the review command Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Simplifying conditionals + reverting integration tests values. * Unused import * Fix redundant import. * Revert change after rebase. * Upgrading the tests (TP>1 fix changes to use different kernels.) * Update server/text_generation_server/layers/gptq/__init__.py --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-18 17:55:53 +02:00
drbh	5f32dea1e2	fix: prefer inplace softmax to avoid copy (#2661 ) * fix: prefer inplace softmax to avoid copy * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-17 08:49:02 -04:00
OlivierDehaene	a6a0c97ed9	feat: prefill chunking (#2600 ) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-16 12:49:33 +02:00
Daniël de Kok	2358c2bb54	Add basic FP8 KV cache support (#2603 ) * Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml	2024-10-04 17:51:48 +02:00
Mohit Sharma	f9e561eced	Update ROCM libs and improvements (#2579 ) * style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile	2024-09-30 10:54:32 +02:00
Daniël de Kok	1028996fb3	flashinfer: pass window size and dtype (#2574 )	2024-09-28 18:41:41 +02:00
Wang, Yi	3ac7df2b6d	hotfix : enable intel ipex cpu and xpu in python3.11 (#2517 ) enable intel ipex cpu and xpu in python3.11 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-12 17:23:49 +02:00
Nicolas Patry	dae3bf1d87	Fix tokenization yi (#2507 ) * Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?	2024-09-11 22:41:56 +02:00
Nicolas Patry	a4e3e8c608	Prefix test - Different kind of load test to trigger prefix test bugs. (#2490 ) * Adding prefix test. * [WIP] tmp dump of integration load tests. * Remove other tensor creation. * Fixed the radix tree. Used a slice everywhere in radix.rs to keep the cheap Arc cloning instead of recomputing the input_ids. * Fix parsing * Is it really flashinfer version ? * Remove some comments. * Revert the max prefix hit. * Adding numpy to diff. * Upgraded flashinfer. * Upgrading some stuff. * Are we done yet ? * Minor fixup * Remove 1 log and put back the other. * Add comment for why slot 0 is OK. * Mounting on the job. * Get me a debug branch * Debugging CIs is fun. * Attempt #28 * wip * Tmate. * Praying. * Updating VLM causal model with updated context. * Important line got squashed. * Tmate again. * Fingers crossed. * We want only 1 run of integration tests..... --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>	2024-09-11 18:10:40 +02:00
Nicolas Patry	e415b690a6	Lots of improvements (Still 2 allocators) (#2449 ) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-08-29 16:29:01 +02:00
Nicolas Patry	b70ae0969f	Prefix caching (#2402 ) * Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-08-20 11:15:30 +02:00
Nicolas Patry	f3b5c69441	Upgrading exl2. (#2415 ) * Upgrading exl2. * Fixing the other pathways. * Fix idefics.	2024-08-14 11:58:08 +02:00
Wang, Yi	59922f9bc1	add numa to improve cpu inference perf (#2330 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-13 15:33:55 +02:00
Nicolas Patry	7a48a84784	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-08-09 16:41:17 +02:00
Daniël de Kok	7830de1566	Add FlashInfer support (#2354 ) This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call state). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.	2024-08-09 11:42:00 +02:00
drbh	f7f61876cf	Pr 2290 ci run (#2329 ) * MODEL_ID propagation fix * fix: remove global model id --------- Co-authored-by: root <root@tw031.pit.tensorwave.lan>	2024-07-31 10:27:15 -04:00
drbh	bab02ff2bc	feat: add ruff and resolve issue (#2262 ) * feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check	2024-07-26 10:29:09 -04:00
drbh	5d85a958c9	fix: refactor adapter weight loading and mapping (#2193 ) * fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo	2024-07-24 15:32:14 -04:00
Nicolas Patry	abc32537ea	Fixing mistral nemo. (#2276 )	2024-07-23 11:16:03 +02:00
OlivierDehaene	53ec0b790b	feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248 ) * feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout	2024-07-20 19:02:04 +02:00
Daniël de Kok	e52be9bba2	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-07-19 17:23:20 +02:00
Daniël de Kok	8511669cb2	Move quantized weight handling out of the `Weights` class (#2194 ) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.	2024-07-09 20:04:03 +02:00
Daniël de Kok	5c7c9f1390	Falcon/DBRX: get correct number of key-value heads (#2205 )	2024-07-08 13:22:38 +02:00
Daniël de Kok	153fcf7739	Fix incorrect cache allocation with multi-query (#2203 ) We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.	2024-07-08 11:19:48 +02:00
Daniël de Kok	cce475a949	hotfix: Fix number of KV heads (#2202 ) Fix number of KV heads	2024-07-08 09:52:12 +02:00
Daniël de Kok	05c094fcfa	Consistently take `prefix` in model constructors (#2191 ) * Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes	2024-07-05 16:07:48 +02:00

1 2 3

128 Commits