mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-24 00:12:08 +00:00
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots |
||
---|---|---|
.. | ||
test_flash_llama_fp8_all_params.json | ||
test_flash_llama_fp8_load.json | ||
test_flash_llama_fp8.json |