text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 07:25:23 +00:00

History

Daniël de Kok eab07f746c Add support for FP8 KV cache scales (#2628 ) * Add support for FP8 KV cache scales Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the absmax values change, good scales are usually determined per layer using calibration calibration data and stored in the checkpoint. This change adds support for for using key-value scales and loading them from checkpoints in the two most common formats: - Separate per-layer `k_scale` and `v_scale` scalars. - Per-layer `kv_scale` scalar (older format). Currently, scales are only used with an `float8_e4m3fn` cache. Besides adding support for key/value scales, the `fp8_quantize` function is also extended to support quantization with a kernel vendored from vLLM. This is slightly faster than the PyTorch implementation, but also scales in FP32, potentially improving accuracy. * Update FP8 KV cache test to use checkpoint with scales * `can_scale`: check that the attention is flashinfer		2024-10-24 16:36:18 +02:00
..
__snapshots__	Add support for FP8 KV cache scales (#2628 )	2024-10-24 16:36:18 +02:00
test_bloom_560m_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_bloom_560m.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_chat_llama.py	Lots of improvements (Still 2 allocators) (#2449 )	2024-08-29 16:29:01 +02:00
test_completion_prompts.py	Stream options. (#2533 )	2024-09-19 20:50:37 +02:00
test_flash_awq_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_awq.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_deepseek_v2.py	Add support for Deepseek V2 (#2224 )	2024-07-19 17:23:20 +02:00
test_flash_falcon.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_gemma2.py	Softcapping for gemma2. (#2273 )	2024-07-22 18:27:10 +02:00
test_flash_gemma_gptq.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_gemma.py	Intel ci (#2630 )	2024-10-10 16:51:57 +02:00
test_flash_gpt2.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_grammar_llama.py	fix: correctly index into mask when applying grammar (#1618 )	2024-03-01 18:22:01 +01:00
test_flash_llama_exl2.py	Fixing exl2 and other quanize tests again. (#2419 )	2024-08-15 11:12:51 +02:00
test_flash_llama_fp8_kv_cache.py	Add support for FP8 KV cache scales (#2628 )	2024-10-24 16:36:18 +02:00
test_flash_llama_fp8.py	Further fixes. (#2426 )	2024-08-16 13:21:44 +02:00
test_flash_llama_gptq.py	GPTQ CI improvements (#2151 )	2024-07-05 14:12:16 +02:00
test_flash_llama_marlin_24.py	Improve the handling of quantized weights (#2250 )	2024-07-19 09:37:39 +02:00
test_flash_llama_marlin.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_llama_prefix_flashdecoding.py	Adding a test for FD. (#2516 )	2024-09-16 17:00:54 +02:00
test_flash_llama_prefix.py	Fix truffle (#2514 )	2024-09-11 22:45:19 +02:00
test_flash_llama.py	Intel ci (#2630 )	2024-10-10 16:51:57 +02:00
test_flash_medusa.py	Revamp medusa implementation so that every model can benefit. (#1588 )	2024-02-26 19:49:28 +01:00
test_flash_mistral.py	fix(router): fix openapi and add jsonschema validation (#1578 )	2024-02-21 11:05:32 +01:00
test_flash_mixtral_awq.py	Add support for fused MoE Marlin for AWQ (#2616 )	2024-10-08 11:56:41 +02:00
test_flash_mixtral_gptq.py	Test Marlin MoE with `desc_act=true` (#2622 )	2024-10-21 12:50:35 +02:00
test_flash_mixtral.py	Add tests for Mixtral (#2520 )	2024-09-16 12:39:18 +02:00
test_flash_neox_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_neox.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_pali_gemma.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_flash_phi35_moe.py	Fix Phi 3.5 MoE tests (#2684 )	2024-10-24 15:21:50 +02:00
test_flash_phi.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_qwen2.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_santacoder.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_starcoder2.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_starcoder_gptq.py	Upgrading the tests to match the current workings. (#2423 )	2024-08-15 13:28:42 +02:00
test_flash_starcoder.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_grammar_llama.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_grammar_response_format_llama.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_idefics2.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_idefics.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_llava_next.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_lora_mistral.py	feat: simple mistral lora integration tests (#2180 )	2024-07-15 09:16:15 -04:00
test_mamba.py	All integration tests back everywhere (too many failed CI). (#2428 )	2024-08-16 21:19:46 +02:00
test_mllama.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_mpt.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_mt0_base.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_neox_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_neox.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_opt.py	Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371 )	2024-08-07 23:14:02 -04:00
test_t5_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_tools_llama.py	feat: allow tool calling to respond without a tool (#2614 )	2024-10-10 09:28:25 -04:00