text-generation-inference/server
Daniël de Kok 32d50c2ea7 Add support for scalar FP8 weight scales (#2550)
* Add support for scalar FP8 weight scales

* Support LLM compressor FP8 checkpoints on H100

On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.

* Remove stray debug print
2024-10-25 09:01:04 +00:00
..
custom_kernels All integration tests back everywhere (too many failed CI). (#2428) 2024-09-25 06:10:59 +00:00
exllama_kernels MI300 compatibility (#1764) 2024-07-17 05:36:58 +00:00
exllamav2_kernels chore: add pre-commit (#1569) 2024-04-24 15:32:02 +03:00
tests Fix tokenization yi (#2507) 2024-09-25 06:15:35 +00:00
text_generation_server Add support for scalar FP8 weight scales (#2550) 2024-10-25 09:01:04 +00:00
.gitignore Impl simple mamba model (#1480) 2024-04-23 11:45:11 +03:00
dill-0.3.7-patch.sh Make Gaudi adapt to the tgi 2.3.0 2024-09-26 06:04:55 +00:00
dill-0.3.8-patch.sh Make Gaudi adapt to the tgi 2.3.0 2024-09-26 06:04:55 +00:00
Makefile Make Gaudi adapt to the tgi 2.3.0 2024-09-26 06:04:55 +00:00
Makefile-awq chore: add pre-commit (#1569) 2024-04-24 15:32:02 +03:00
Makefile-eetq Upgrade EETQ (Fixes the cuda graphs). (#1729) 2024-04-25 17:58:27 +03:00
Makefile-exllamav2 Upgrading exl2. (#2415) 2024-09-25 06:07:40 +00:00
Makefile-fbgemm Add Directory Check to Prevent Redundant Cloning in Build Process (#2486) 2024-09-25 06:14:07 +00:00
Makefile-flash-att Hotfixing make install. (#2008) 2024-09-24 03:29:29 +00:00
Makefile-flash-att-v2 Softcapping for gemma2. (#2273) 2024-09-25 05:31:08 +00:00
Makefile-flashinfer Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) 2024-09-25 06:14:07 +00:00
Makefile-lorax-punica Enable multiple LoRa adapters (#2010) 2024-09-24 03:55:04 +00:00
Makefile-selective-scan chore: add pre-commit (#1569) 2024-04-24 15:32:02 +03:00
Makefile-vllm Add support for Deepseek V2 (#2224) 2024-09-25 05:27:40 +00:00
poetry.lock Update to moe-kenels 0.3.1 (#2535) 2024-09-25 06:19:20 +00:00
pyproject.toml Make Gaudi adapt to the tgi 2.3.0 2024-09-26 06:04:55 +00:00
README.md chore: add pre-commit (#1569) 2024-04-24 15:32:02 +03:00
requirements_cuda.txt hotfix: add syrupy to the right subproject (#2499) 2024-09-25 06:13:36 +00:00
requirements_intel.txt hotfix: add syrupy to the right subproject (#2499) 2024-09-25 06:13:36 +00:00
requirements_rocm.txt hotfix: add syrupy to the right subproject (#2499) 2024-09-25 06:13:36 +00:00
requirements.txt Make Gaudi adapt to the tgi 2.3.0 2024-09-26 06:04:55 +00:00

Text Generation Inference Python gRPC Server

A Python gRPC server for Text Generation Inference

Install

make install

Run

make run-dev