text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-09 15:05:24 +00:00

Author	SHA1	Message	Date
Daniël de Kok	622c9c367a	nix: build Torch against MKL and various other improvements (#2469 ) Updates tgi-nix input: - Move Torch closer to upstream by building against MKL. - Remove compute capability 8.7 from Torch (Jetson). - Sync nixpkgs cumpute capabilities with Torch (avoids compiling too mana capabilities for MAGMA). - Use nixpkgs configuration passed through by `tgi-nix`.	2024-09-25 06:11:21 +00:00
drbh	08834e0cfd	fix: improve regex expression (#2468 )	2024-09-25 06:11:21 +00:00
drbh	e80b2c21dc	fix: bump minijinja version and add test for llama 3.1 tools (#2463 ) * fix: support tojson and avoid message indexing issue in template * fix: prefer minijinja native methods and prefer workspace level dependency * fix: adjust comment typo	2024-09-25 06:11:21 +00:00
Nicolas Patry	6793b720ba	Fixing CI. (#2462 )	2024-09-25 06:11:21 +00:00
drbh	73ebbd05f8	Pr 2451 ci branch (#2454 ) * fix[router]: Fix tools not passed in chat template Signed-off-by: GitHub <noreply@github.com> * feat: improve default tool serialization and lints * feat: refactor tool logic to include notify_error in prompt and adjust typing * fix: adjust non tool template apply * fix: simplify tool grammar logic and improve schema * feat: avoid skip tool test and avoid empty tool prompts * fix: increase test client timeout for grammar compilation tests --------- Signed-off-by: GitHub <noreply@github.com> Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>	2024-09-25 06:10:59 +00:00
drbh	7aebb953e2	Fix: don't apply post layernorm in SiglipVisionTransformer (#2459 ) * Fix: don't apply post layernorm in SiglipVisionTransformer This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813). This also makes Siglip consistent with the existing Clip implementation: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613 * fix: adjust pali gemma for post layer norm and small refactors --------- Co-authored-by: Travis Addair <tgaddair@gmail.com>	2024-09-25 06:10:59 +00:00
Daniël de Kok	92ac02e4f2	nix: add default package (#2453 ) The default package wraps the launcher and puts the server/router in the path. As a result, TGI can be started using something like: ``` nix run .# -- \ --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \ --port 8080 ```	2024-09-25 06:10:59 +00:00
Daniël de Kok	b7d1adc3e9	nix: add awq-inference-engine as server dependency (#2442 )	2024-09-25 06:10:59 +00:00
Nicolas Patry	6654c2d11b	Adding eetq to flake. (#2438 )	2024-09-25 06:10:59 +00:00
Daniël de Kok	a5af557359	nix: add `text-generation-benchmark` to pure devshell (#2431 ) nix: add text-generation-benchmark to pure devshell	2024-09-25 06:10:59 +00:00
Daniël de Kok	516392d790	nix: add pure server to flake, add both pure and impure devshells (#2430 ) * nix: pure server and support both pure and impure devShells * nix: remove unused poetry2nix input It is not wired up and we now have a pure server. * nix: add ipdb to impure devshell	2024-09-25 06:10:59 +00:00
Nicolas Patry	635dde8af9	Prefix caching (#2402 ) * Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-09-25 06:10:59 +00:00
Daniël de Kok	ddba272a66	nix: update to CUDA 12.4 (#2429 ) * Update to CUDA 12.4 * poetry2nix: follow tgi-nix nixpkgs	2024-09-25 06:10:59 +00:00
Nicolas Patry	cd208c5043	All integration tests back everywhere (too many failed CI). (#2428 ) * All integration tests back everywhere (too many failed CI). * Upgrade integration tests after 12.4 * Attempt to remove the specifed compute cap. * Common arch list. * Punica uses raw ASM which is not valid on 9.0 apparently.	2024-09-25 06:10:59 +00:00
Hugo Larcher	53fdbe617d	doc: Add metrics documentation and add a 'Reference' section (#2230 ) * doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-25 06:10:13 +00:00
Nicolas Patry	11d25a4bd3	FIxing the CI.	2024-09-25 06:09:22 +00:00
Nicolas Patry	85df9fc2db	Further fixes. (#2426 ) * Further fixes. * Update the conftest to allow NaN (first logprob). * Fix the condition.	2024-09-25 06:09:22 +00:00
Vaibhav Srivastav	df0e650891	Improve the Consuming TGI + Streaming docs. (#2412 ) * Improve the Consuming TGI docs. * Fix erronous update to . * add info about Open AI client. * More updates. * Apply suggestions from code review Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> * Suggestions from Lucain. * Update Gradio snippet. * Up. * Apply suggestions from code review Co-authored-by: Lucain <lucainp@gmail.com> * Update docs/source/basic_tutorials/consuming_tgi.md Co-authored-by: Lucain <lucainp@gmail.com> * Up. * Apply suggestions from code review Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Up. * Up. * Doc review from Nico. * Doc review from Nico. x2 * Last nit --------- Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>	2024-09-25 06:08:38 +00:00
Daniël de Kok	20ed7b598e	nix: try to reduce the number of Rust rebuilds (#2424 ) Try to reduce the number of router/launcher rebuilds by filtering sources. In this way, recompiles should only be triggered by changes in Cargo or Rust files.	2024-09-25 06:08:38 +00:00
Nicolas Patry	f0181ed2d7	Upgrading the tests to match the current workings. (#2423 )	2024-09-25 06:08:38 +00:00
Nicolas Patry	df6ea89da9	Fixing exl2 and other quanize tests again. (#2419 ) * Fixing exl2 and other quanize tests again. * Mark exl2 as non release (so CI tests them, needs to be removed latet). * Fixing exl2 (by disabling cuda graphs) * Fix quantization defaults without cuda graphs on exl2 (linked to new issues with it). * Removing serde override. * Go back to released exl2 and remove log. * Adding warnings for deprecated bitsandbytes + upgrade info to warn.	2024-09-25 06:08:38 +00:00
Daniël de Kok	e5c39a5545	nix: build router incrementally (#2422 )	2024-09-25 06:08:00 +00:00
Funtowicz Morgan	c3401e0b99	More fixes trtllm (#2342 ) * (backend) use parking_lot crate for RwLock fairness * (docker) let's put rust in the TRTLLM folder when building * (docker) build ompi with SLURM support * (launcher) default new server::run parameters to false for now * (chore) fmt ... why?	2024-09-25 06:08:00 +00:00
Nicolas Patry	4baa6ff59f	Upgrading exl2. (#2415 ) * Upgrading exl2. * Fixing the other pathways. * Fix idefics.	2024-09-25 06:07:40 +00:00
Daniël de Kok	bae161ab84	nix: partial incremental build of the router (#2416 ) This is less incremental than crate2nix, but does build all dependencies separately, so avoids full rebuilds.	2024-09-25 06:06:17 +00:00
drbh	ffc8fb0850	fix: adds causal to attention params (#2408 ) fix: adds causal to attention params to check when using flash attn v1	2024-09-25 06:06:17 +00:00
Wang, Yi	7a4d831d17	add numa to improve cpu inference perf (#2330 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:06:17 +00:00
Nicolas Patry	c5e4c1877b	Adding more kernels to flake. (#2411 )	2024-09-25 06:06:17 +00:00
Daniël de Kok	eb561bb715	nix: incremental build of the launcher (#2410 )	2024-09-25 06:06:17 +00:00
drbh	10b2be6536	fix: include create_exllama_buffers and set_device for exllama (#2407 )	2024-09-25 06:06:17 +00:00
drbh	1f8c0f83e3	Pr 2395 ci run (#2406 ) * fix(router): Fix appending to message content * feat: add message and chat template test --------- Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>	2024-09-25 06:06:17 +00:00
Nicolas Patry	18d6be6af4	Updating the flake. (#2404 )	2024-09-25 06:06:17 +00:00
drbh	96e8fa37b0	fix: improve completions to send a final chunk with usage details (#2336 ) * fix: improve completions to send a final chunk with usage details * fix: include finish reason string * fix: remove dev debug trait and unneeded mut * fix: update openapi schema	2024-09-25 06:06:17 +00:00
drbh	3079865b60	fix: allocate tmp based on sgmv kernel if available (#2345 ) * fix: allocate tmp based on sgmv kernel if available * fix: re add copy build artifacts step for punica kernels	2024-09-25 06:06:17 +00:00
drbh	8e6bfa2fc5	feat: validate template variables before apply and improve sliding wi… (#2403 ) * feat: validate template variables before apply and improve sliding window check * fix: improve missing template var test	2024-09-25 06:05:43 +00:00
Nicolas Patry	6393cdee63	Keeping the benchmark somewhere (#2401 ) Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-09-25 06:05:43 +00:00
Daniël de Kok	f586cc7f0c	Add support for prefix caching to the v3 router (#2392 ) This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.	2024-09-25 06:05:08 +00:00
Wang, Yi	b8efd6d00c	Cpu dockerimage (#2367 ) add intel-cpu docker image Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:05:08 +00:00
Nicolas Patry	1daaddd072	Fixing import exl2 (#2399 )	2024-09-25 06:04:51 +00:00
Nicolas Patry	fbe59c6267	Adding launcher to build. (#2397 )	2024-09-25 06:04:51 +00:00
Nicolas Patry	8750dc878e	Upgrade fbgemm (#2398 ) * Upgrade fbgemm * Fix fbgemm version	2024-09-25 06:04:51 +00:00
Daniël de Kok	197dd3af12	nix: add router to the devshell (#2396 )	2024-09-25 06:04:51 +00:00
Daniël de Kok	bb833389e0	Update flake for 9.0a capability in Torch (#2394 )	2024-09-25 06:04:51 +00:00
drbh	959add5e9b	feat: add guideline to chat request and template (#2391 ) * feat: add guideline to chat request and template * fix: add template test and update docs	2024-09-25 06:04:51 +00:00
Nicolas Patry	849bd93dc3	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-09-25 06:04:51 +00:00
Daniël de Kok	df719fd527	flake: use rust-overlay (#2390 )	2024-09-25 06:04:51 +00:00
Vaibhav Srivastav	1d4a35a23c	Update documentation for Supported models (#2386 ) * Minor doc fixes * up. * Other minor updates.	2024-09-25 06:04:51 +00:00
Daniël de Kok	e9ba044250	flake: add fmt and clippy (#2389 )	2024-09-25 06:03:56 +00:00
Nicolas Patry	afa14b7595	Using HF_HOME instead of CACHE to get token read in addition to models. (#2288 )	2024-09-25 06:03:56 +00:00
Daniël de Kok	dc0fa60f55	Add experimental flake (#2384 ) Add flake.nix	2024-09-25 06:01:59 +00:00

1 2 3 4 5 ...

1109 Commits