text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-06-30 21:10:16 +00:00

Author	SHA1	Message	Date
Erik Kaunismäki	e5503eba78	configurable termination timeout (#3126 ) * make shard and webserver termination timeouts configurable * Updating documentation. * Fmt. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-20 14:25:56 +01:00
Nicolas Patry	83fe45c15e	Prepare for patch release. (#3124 )	2025-03-18 15:11:55 +01:00
Nicolas Patry	4ac06ddf56	Preparing relase 3.2.0 (#3100 ) * Preparing relase 3.2.0 * Forgot the README. * Update doc.	2025-03-12 10:11:33 +01:00
Alvaro Bartolome	55a6618434	Update `--max-batch-total-tokens` description (#3083 ) * Update `--max-batch-total-tokens` description * Update docstring in `launcher/src/main.rs` instead	2025-03-07 14:24:26 +01:00
Nicolas Patry	08bbfa16a1	Preparing for release. (#3060 ) * Preparing for release. * Upgrade doc. * Fix docs auto-generated. * Fix update doc along.	2025-03-04 16:47:10 +01:00
Nicolas Patry	c9d68945cc	Prepare for release 3.1.0 (#2972 ) * Prepare for release 3.1.0 * Back on main flake. * Fixing stuff. * Upgrade to moe-kernels 0.8.2 for Hip support. * Deactivating the flaky test.	2025-01-31 14:19:01 +01:00
Nicolas Patry	29a0893b67	Tmp tp transformers (#2942 ) * Upgrade the version number. * Remove modifications in Lock. * Tmp branch to test transformers backend with 2.5.1 and TP>1 * Fixing the transformers backend. inference_mode forces the use of `aten.matmul` instead of `aten.mm` the former doesn't have sharding support crashing the transformers TP support. `lm_head.forward` also crashes because it skips the hook that cast/decast the DTensor. Torch 2.5.1 is required for sharding support. * Put back the attention impl. * Revert the flashinfer (this will fails). * Building AOT. * Using 2.5 kernels. * Remove the archlist, it's defined in the docker anyway.	2025-01-23 18:07:30 +01:00
Nicolas Patry	07b01293c5	Prepare patch release. (#2829 )	2024-12-11 21:03:50 +01:00
Nicolas Patry	042791fbd5	Prep new version (#2810 ) * New version. * Link fixup. * Update docs. * FIxup.	2024-12-09 20:42:42 +01:00
Nicolas Patry	5df8059037	Auto max prefill (#2797 ) * Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).	2024-12-06 05:52:00 +01:00
OlivierDehaene	780531ec77	chore: prepare 2.4.1 release (#2773 ) * chore: prepare 2.4.1 release * fix tests * fmt	2024-11-22 17:26:15 +00:00
OlivierDehaene	ab7ccf5bc3	feat: add payload limit (#2726 ) * feat: add payload limit * update launcher	2024-11-21 18:20:15 +00:00
Daniël de Kok	a785000842	Add initial support for compressed-tensors checkpoints (#2732 ) compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.	2024-11-10 13:54:07 +01:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
OlivierDehaene	a6b02da971	chore: prepare 2.4.0 release (#2695 )	2024-10-25 21:10:49 +00:00
OlivierDehaene	41c2623735	feat: allow any supported payload on /invocations (#2683 ) * feat: allow any supported payload on /invocations * update openAPI * update doc	2024-10-23 11:26:01 +00:00
Daniël de Kok	5bbe1ce028	Support `e4m3fn` KV cache (#2655 ) * Support `e4m3fn` KV cache * Make check more obvious	2024-10-17 10:42:16 +02:00
Daniël de Kok	2358c2bb54	Add basic FP8 KV cache support (#2603 ) * Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml	2024-10-04 17:51:48 +02:00
Daniël de Kok	abd24dd385	doc: clarify that `--quantize` is not needed for pre-quantized models (#2536 )	2024-09-19 22:17:15 +02:00
Hugo Larcher	53729b74ac	doc: Add metrics documentation and add a 'Reference' section (#2230 ) * doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-16 19:43:30 +02:00

20 Commits