text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-11-18 23:15:59 +00:00

Author	SHA1	Message	Date
Nicolas Patry	8f326c9791	Fixing lockfile.	2024-12-09 21:20:59 +01:00
Nicolas Patry	7b631e21b0	Preparing for v3 release.	2024-12-10 01:40:09 +05:30
Nicolas Patry	a70dd2998b	Hotfixing the link. (#2811 )	2024-12-09 20:50:07 +01:00
Nicolas Patry	042791fbd5	Prep new version (#2810 ) * New version. * Link fixup. * Update docs. * FIxup.	2024-12-09 20:42:42 +01:00
Nicolas Patry	27fa83ca5b	V3 doc (#2809 ) * V3 document. * Updating asset.	2024-12-09 19:58:07 +01:00
Nicolas Patry	a04356fb8c	Attempt for cleverer auto batch_prefill values (some simplifications). (#2808 ) * Attempt for cleverer auto batch_prefill values (some simplifications). * Less flaky tests. * Fixing typo insertion. * Update launcher/src/main.rs Co-authored-by: Daniël de Kok <me@danieldk.eu> * Adding small comment for source of calculation. * Adding L40. * Adding L40s. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-12-09 19:44:32 +01:00
drbh	9f5c9a5e22	Enable paligemma2 (#2807 ) * feat: support loading gemma2 as vlm text model * feat: add test for paligemma2	2024-12-06 14:41:49 -05:00
Nicolas Patry	08f6fa0b59	Removing experimental to prefill chunking.	2024-12-06 19:09:40 +01:00
Nicolas Patry	d96dcb1797	Adding A100 compute. (#2806 )	2024-12-06 18:19:15 +01:00
Nicolas Patry	5df8059037	Auto max prefill (#2797 ) * Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).	2024-12-06 05:52:00 +01:00
OlivierDehaene	8c3669b287	feat: auto max_new_tokens (#2803 ) * feat: auto max_new_tokens * update default * Fixing the tests. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-12-06 05:50:35 +01:00
Wang, Yi	6685e8fcda	use oneapi 2024 docker image directly for xpu (#2793 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-12-06 09:36:23 +05:30
drbh	e0db633396	fix: avoid setting use_sgmv if no kernels present (#2796 )	2024-12-04 15:26:09 -05:00
Nicolas Patry	b57f370386	Saving some VRAM. (#2790 ) * Saving some VRAM. - 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB left, so 400MB saved. - Effect not as visible on attention=flashinfer and n_shard=1. I suspect it's linked to the torch allocator. * Adding assertion.	2024-12-03 04:04:21 +01:00
Daniël de Kok	2003d8be0c	Sync (most) server dependencies with Nix (#2782 ) * Sync (most) server dependencies with Nix Skipped most grpcio packages, because of protobuf version incompatibility with the opentelemetry packages. * Add a primitive script to generate Poetry commands to sync with Nix This is not fully automated, since getting the Nix versions may be unresolvable. However, it does take most of the work out of doing this manually. * Upgrade eetq ? * Fmt. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-12-03 04:04:06 +01:00
Dmitry Rogozhkin	535149d872	fix: only use eos_token_id as pad_token_id if int (#2774 ) LLama 3 has a list of values as eos_token_id: "['<\|end_of_text\|>', '<\|eom_id\|>', '<\|eot_id\|>']" This breaks tokenizer since it expects single value. This commit uses tokenizer.eos_token_id instead in such a case. Fixes: #2440 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2024-12-02 06:26:37 +01:00
drbh	2c74c55637	fix: add merge-lora arg for model id (#2788 )	2024-12-02 05:52:02 +01:00
Torsten Raudssus	a35d1e6fe5	Removing ../ that broke the link (#2789 )	2024-12-02 05:48:55 +01:00
Nicolas Patry	1d2cb356b9	Fix doc. (#2792 )	2024-12-02 05:28:26 +01:00
drbh	d471805134	Support continue final message (#2733 ) * feat: support continue_final_message param in chat request * feat: add test for continue final message * fix: bump openapi docs * fix: remove continue_final_message chat request param * fix: remove unneeded launcher args in continue test * fix: bump test output * fix: remove accidentally included guideline from rebase * fix: remove guideline tests * fix: adjust continuation tests expected text * fix: replace expected output for continue test	2024-11-27 19:13:30 -05:00
jp	caff779dd4	Fix: docs typo (#2777 ) Fix: typo in model loading code Fix typo in model loading code	2024-11-26 14:28:58 +01:00
Wang, Yi	892a26e549	upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… (#2778 ) upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention) Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>	2024-11-26 14:28:11 +01:00
Daniël de Kok	72ab60fdd5	Use FP8 KV cache when specified by compressed-tensors (#2761 ) The compressed-tensors configuration can specify the configuration of the KV cache as well. Use an FP8 KV cache when the configuration tells us to do so (all other options and types are ignored for now).	2024-11-26 08:27:41 +01:00
Daniël de Kok	289aa48554	Move JSON grammar -> regex grammar conversion to the router (#2772 ) * Move JSON grammar -> regex grammar conversion to the router This change moves the JSON grammar -> regex grammar conversion to the router by adding a dependency on the `outlines-core` Rust crate. In contrast to the Python implementation, the conversions are not LRU-cached since they seem to be fast enough: simple schema time: [5.8293 µs 5.8307 µs 5.8320 µs] change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05) Performance has improved. complex schema time: [14.875 µs 14.881 µs 14.887 µs] change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05) Performance has improved. Using the schemas from: https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py	2024-11-25 18:47:34 +01:00
drbh	c637d68d74	feat: concat the adapter id to the model id in chat response (#2779 ) * feat: concat the adapter id to the model id in chat response * fix: updated to include only the adapter id in chat response	2024-11-25 12:36:31 -05:00
OlivierDehaene	780531ec77	chore: prepare 2.4.1 release (#2773 ) * chore: prepare 2.4.1 release * fix tests * fmt	2024-11-22 17:26:15 +00:00
Daniël de Kok	e87893d38e	chore: Update to marlin-kernels 0.3.6 (#2771 ) This fixes a bug in 2:4 Marlin: https://github.com/vllm-project/vllm/pull/10464	2024-11-22 14:44:47 +00:00
OlivierDehaene	ab7ccf5bc3	feat: add payload limit (#2726 ) * feat: add payload limit * update launcher	2024-11-21 18:20:15 +00:00
Hugo Larcher	d5bc6a20bd	feat: Add automatic nightly benchmarks (#2591 ) * feat: Add automatic nightly benchmarks * fix: Update runners group * fix: add created_at field to results * fix: Add variable results file location	2024-11-21 17:11:42 +00:00
Lucain	d012f229c6	Remove guideline from API (#2762 )	2024-11-21 16:56:38 +00:00
Daniël de Kok	c5b5b3a11c	docs: Add a README section about using Nix (#2767 )	2024-11-21 16:53:27 +00:00
drbh	faa10ad0bc	fix: tweak grammar test response (#2769 )	2024-11-21 16:46:00 +00:00
OlivierDehaene	8e0c161d0a	fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770 ) * Incomplete generation stream fix (#2754) entries.len() could > batch.size in prefill, so need to filter as well. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * entries was wrongly extended for model that did not support chunking --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com>	2024-11-21 16:37:55 +00:00
Daniël de Kok	3c54488638	nix: downgrade to outlines 0.1.3 (#2768 )	2024-11-21 13:00:26 +01:00
drbh	6ee8d6dd3b	fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766 ) fix: set outlines version to 0.1.3 to avoid bug	2024-11-20 18:09:39 -05:00
Daniël de Kok	07bed530f7	nix: build and cache impure devshells (#2765 ) * nix: build and cache all devshells * nix: add poetry to the impure shell This shouldn't be used to manage dependencies in a Nix devshell, but can be handy to update `poetry.lock`. * Fix Nix build, disable pure shell (covered by Nix tests)	2024-11-20 20:56:11 +01:00
Daniël de Kok	46a5a7e73e	Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758 ) This change adds support for wNa16 int checkpoints with 2:4 sparsity using Marlin 2:4 kernels.	2024-11-20 18:25:23 +01:00
Daniël de Kok	2fda8845a7	nix: update for outlines 0.1.4 (#2764 )	2024-11-20 18:24:29 +01:00
Daniël de Kok	45013b60a4	Install compressed-tensors in Docker CPU builds	2024-11-20 14:17:47 +00:00
drbh	bd6e8b3c13	fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760 )	2024-11-19 15:10:22 -05:00
drbh	5489406c4a	PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645 ) * add OpenAI like tool_choice for named choice * add tests * fix: run linter and bump api docs * fix: consolidate changes and remove old tool type * feat: improve, simplify and rename tool choice struct add required support and refactor * fix: simplify tool choice logic, improve tests, openapi and rust docs * fix: refactor away prepare_chat_input and improve tool grammar apply control flow * feat: update docs and add tool choice configuration section * fix: simplify naming, tool choice default and improve test * fix: adjust tool choice none logic, add test and small refactors * fix: add missing snapshot file * fix: adjust tool choice type in test * fix: adjust default when json tool choice is * fix: remove trailing space lint after rebase * fix: remove mostly mocked unit test --------- Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>	2024-11-19 13:31:59 -05:00
Daniël de Kok	2007a9473a	Update to moe-kernels 0.7.0 (#2720 ) This version syncs with the vLLM kernels and brings some performance improvements.	2024-11-19 14:55:29 +01:00
Daniël de Kok	b4ec427ad0	Simplify two ipex conditions (#2755 )	2024-11-19 08:04:23 +01:00
drbh	38cff84a3e	feat: support flash attention 2 in qwen2 vl vision blocks (#2721 ) * feat: support flash attention 2 in qwen2 vl vision blocks * fix: calc max_seqlen once and small refactors	2024-11-18 12:46:40 -05:00
Daniël de Kok	3c9df21ff8	Add support for compressed-tensors w8a8 int checkpoints (#2745 ) * Add support for compressed-tensors w8a8 int checkpoints This change adds a loader for w8a8 int checkpoints. One large benefit of int8 support is that the corresponding cutlass matmul kernels also work on compute capability 7.5. Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8: \| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\| \|---------------\|------:\|----------------\|-----:\|-----------------------\|---\|-----:\|---\|------\| \|gsm8k_cot_llama\| 3\|flexible-extract\| 8\|exact_match \|↑ \|0.8431\|± \|0.0100\| \| \| \|strict-match \| 8\|exact_match \|↑ \|0.8393\|± \|0.0101\| \|ifeval \| 4\|none \| 0\|inst_level_loose_acc \|↑ \|0.8597\|± \| N/A\| \| \| \|none \| 0\|inst_level_strict_acc \|↑ \|0.8201\|± \| N/A\| \| \| \|none \| 0\|prompt_level_loose_acc \|↑ \|0.7967\|± \|0.0173\| \| \| \|none \| 0\|prompt_level_strict_acc\|↑ \|0.7468\|± \|0.0187\| Which is the same ballpark as vLLM. As usual, lots of thanks to Neural Magic/vLLM for the kernels. * Always use dynamic input quantization for w8a8 int It's far less flaky and gives better output. * Use marlin-kernels 0.3.5 * Fix a typo Co-authored-by: drbh <david.richard.holtz@gmail.com> * Small fixes --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2024-11-18 17:20:31 +01:00
Wang, Yi	a5ecd6e586	add ipex moe implementation to support Mixtral and PhiMoe (#2707 ) * add ipex moe implementation to support Mixtral and PhiMoe Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update to ipex xpu 2.5 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * torch has xpu support in 2.5 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix oneapi basekit version Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-11-18 17:16:55 +01:00
drbh	fea62e928f	fix: improve find_segments via numpy diff (#2686 )	2024-11-18 09:51:06 -05:00
Daniël de Kok	52e48739a5	Remove vLLM dependency for CUDA (#2751 ) * Remove vLLM dependency for CUDA This change adds `attention-kernels` as a dependency for paged attention and cache reshaping. With that, we don't use vLLM anywhere for CUDA. Tested run (since we don't have paged attention in CI): ``` ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release [...] 5 snapshots passed. ``` * Fix clippy warning	2024-11-17 17:34:50 +01:00
drbh	6489f85269	feat: return streaming errors as an event formatted for openai's client (#2668 ) * feat: return streaming errors as an event formatted for openai's client * fix: propagate completions error events to stream * fix: improve stream api error format and add status code * fix: improve streamin error to include error_type * Revert "fix: improve streamin error to include error_type" This reverts commit 2b1a360b1511d94ea9a24e5432e498e67939506a. * Reworked the implementation. * Revert "Reworked the implementation." This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5. * Small lifting. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-11-15 14:49:19 +01:00
Nicolas Patry	34a3bdedc3	Upgrading our deps. (#2750 ) * Upgrading our deps. * fixup. * Fixup.	2024-11-15 14:03:27 +01:00

1 2 3 4 5 ...

1176 Commits