mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-11-18 23:15:59 +00:00
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots |
||
|---|---|---|
| .. | ||
| test_flash_llama_fp8_all_params.json | ||
| test_flash_llama_fp8_load.json | ||
| test_flash_llama_fp8.json | ||