text-generation-inference/integration-tests/models/__snapshots__
Daniël de Kok eab07f746c
Add support for FP8 KV cache scales (#2628)
* Add support for FP8 KV cache scales

Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.

This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:

- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).

Currently, scales are only used with an `float8_e4m3fn` cache.

Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.

* Update FP8 KV cache test to use checkpoint with scales

* `can_scale`: check that the attention is flashinfer
2024-10-24 16:36:18 +02:00
..
test_bloom_560m All integration tests back everywhere (too many failed CI). (#2428) 2024-08-16 21:19:46 +02:00
test_bloom_560m_sharded fix: adjust test snapshots and small refactors (#2323) 2024-07-29 11:38:38 -04:00
test_chat_llama Lots of improvements (Still 2 allocators) (#2449) 2024-08-29 16:29:01 +02:00
test_completion_prompts Stream options. (#2533) 2024-09-19 20:50:37 +02:00
test_flash_awq Add AWQ quantization inference support (#1019) (#1054) 2023-09-25 15:31:27 +02:00
test_flash_awq_sharded Add AWQ quantization inference support (#1019) (#1054) 2023-09-25 15:31:27 +02:00
test_flash_deepseek_v2 Lots of improvements (Still 2 allocators) (#2449) 2024-08-29 16:29:01 +02:00
test_flash_falcon feat(server): add retry on download (#384) 2023-05-31 10:57:53 +02:00
test_flash_gemma Intel ci (#2630) 2024-10-10 16:51:57 +02:00
test_flash_gemma2 Softcapping for gemma2. (#2273) 2024-07-22 18:27:10 +02:00
test_flash_gemma_gptq More tensor cores. (#2558) 2024-09-24 23:57:26 +02:00
test_flash_gpt2 Add GPT-2 with flash attention (#1889) 2024-05-15 13:31:22 +02:00
test_flash_grammar_llama fix: correctly index into mask when applying grammar (#1618) 2024-03-01 18:22:01 +01:00
test_flash_llama Intel ci (#2630) 2024-10-10 16:51:57 +02:00
test_flash_llama_exl2 Add support for exl2 quantization 2024-05-30 11:28:05 +02:00
test_flash_llama_fp8 Lots of improvements (Still 2 allocators) (#2449) 2024-08-29 16:29:01 +02:00
test_flash_llama_fp8_kv_cache Add support for FP8 KV cache scales (#2628) 2024-10-24 16:36:18 +02:00
test_flash_llama_gptq GPTQ CI improvements (#2151) 2024-07-05 14:12:16 +02:00
test_flash_llama_marlin Add support for Marlin-quantized models 2024-06-06 13:16:52 +02:00
test_flash_llama_marlin_24 Improve the handling of quantized weights (#2250) 2024-07-19 09:37:39 +02:00
test_flash_llama_prefix Fix truffle (#2514) 2024-09-11 22:45:19 +02:00
test_flash_llama_prefix_flashdecoding Adding a test for FD. (#2516) 2024-09-16 17:00:54 +02:00
test_flash_medusa Speculative (#1308) 2023-12-11 12:46:30 +01:00
test_flash_mistral feat: add mistral model (#1071) 2023-09-28 09:55:47 +02:00
test_flash_mixtral Move to moe-kernels package and switch to common MoE layer (#2511) 2024-09-17 18:08:58 +02:00
test_flash_mixtral_awq Add support for fused MoE Marlin for AWQ (#2616) 2024-10-08 11:56:41 +02:00
test_flash_mixtral_gptq Test Marlin MoE with desc_act=true (#2622) 2024-10-21 12:50:35 +02:00
test_flash_neox fix(server): fix init for flash causal lm (#352) 2023-05-22 15:05:32 +02:00
test_flash_neox_sharded fix(server): fix init for flash causal lm (#352) 2023-05-22 15:05:32 +02:00
test_flash_pali_gemma All integration tests back everywhere (too many failed CI). (#2428) 2024-08-16 21:19:46 +02:00
test_flash_phi All integration tests back everywhere (too many failed CI). (#2428) 2024-08-16 21:19:46 +02:00
test_flash_phi35_moe Fix Phi 3.5 MoE tests (#2684) 2024-10-24 15:21:50 +02:00
test_flash_qwen2 feat: Qwen2 (#1608) 2024-02-28 15:50:31 +01:00
test_flash_santacoder feat(integration-tests): improve comparison and health checks (#336) 2023-05-16 20:22:11 +02:00
test_flash_starcoder fix: adjust test snapshots and small refactors (#2323) 2024-07-29 11:38:38 -04:00
test_flash_starcoder2 Lots of improvements (Still 2 allocators) (#2449) 2024-08-29 16:29:01 +02:00
test_flash_starcoder_gptq CI job. Gpt awq 4 (#2665) 2024-10-18 17:55:53 +02:00
test_grammar_llama fix: correctly index into mask when applying grammar (#1618) 2024-03-01 18:22:01 +01:00
test_grammar_response_format_llama Support chat response format (#2046) 2024-06-11 10:44:56 -04:00
test_idefics Support different image sizes in prefill in VLMs (#2065) 2024-06-17 10:49:41 +02:00
test_idefics2 Lots of improvements (Still 2 allocators) (#2449) 2024-08-29 16:29:01 +02:00
test_llava_next All integration tests back everywhere (too many failed CI). (#2428) 2024-08-16 21:19:46 +02:00
test_lora_mistral feat: simple mistral lora integration tests (#2180) 2024-07-15 09:16:15 -04:00
test_mamba All integration tests back everywhere (too many failed CI). (#2428) 2024-08-16 21:19:46 +02:00
test_mllama Mllama flash version (#2585) 2024-10-02 11:22:13 +02:00
test_mpt feat(server): Add Non flash MPT. (#514) 2023-07-03 13:01:46 +02:00
test_mt0_base Upgrading the tests to match the current workings. (#2423) 2024-08-15 13:28:42 +02:00
test_neox feat(server): Rework model loading (#344) 2023-06-08 14:51:52 +02:00
test_neox_sharded feat(server): Rework model loading (#344) 2023-06-08 14:51:52 +02:00
test_server_gptq_quantized GPTQ CI improvements (#2151) 2024-07-05 14:12:16 +02:00
test_t5_sharded feat(server): support fp16 for t5 (#360) 2023-05-23 18:16:48 +02:00
test_tools_llama feat: allow tool calling to respond without a tool (#2614) 2024-10-10 09:28:25 -04:00