text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-11 02:10:16 +00:00

Author	SHA1	Message	Date
Daniël de Kok	c1a99e2f15	Update to moe-kenels 0.3.1 (#2535 ) * Update to moe-kenels 0.3.1 * Attempt to fix apt failure	2024-09-25 06:19:20 +00:00
Nicolas Patry	2d470c8282	Stream options. (#2533 ) * Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow	2024-09-25 06:19:20 +00:00
Nicolas Patry	0110b83aff	Adding a test for FD. (#2516 ) * Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.	2024-09-25 06:17:09 +00:00
Nicolas Patry	f32fa568b6	Fix truffle (#2514 ) * Attempting to discard the trufflehog warning. * Attempt to fix trufflehog.	2024-09-25 06:15:35 +00:00
Nicolas Patry	510d1c76c8	Prefix test - Different kind of load test to trigger prefix test bugs. (#2490 ) * Adding prefix test. * [WIP] tmp dump of integration load tests. * Remove other tensor creation. * Fixed the radix tree. Used a slice everywhere in radix.rs to keep the cheap Arc cloning instead of recomputing the input_ids. * Fix parsing * Is it really flashinfer version ? * Remove some comments. * Revert the max prefix hit. * Adding numpy to diff. * Upgraded flashinfer. * Upgrading some stuff. * Are we done yet ? * Minor fixup * Remove 1 log and put back the other. * Add comment for why slot 0 is OK. * Mounting on the job. * Get me a debug branch * Debugging CIs is fun. * Attempt #28 * wip * Tmate. * Praying. * Updating VLM causal model with updated context. * Important line got squashed. * Tmate again. * Fingers crossed. * We want only 1 run of integration tests..... --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>	2024-09-25 06:14:07 +00:00
Daniël de Kok	3e17cb7866	nix: add punica-kernels (#2477 ) Enables LoRA support.	2024-09-25 06:13:11 +00:00
Nicolas Patry	4e1ca8d7bd	Lots of improvements (Still 2 allocators) (#2449 ) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-09-25 06:13:11 +00:00
Daniël de Kok	622c9c367a	nix: build Torch against MKL and various other improvements (#2469 ) Updates tgi-nix input: - Move Torch closer to upstream by building against MKL. - Remove compute capability 8.7 from Torch (Jetson). - Sync nixpkgs cumpute capabilities with Torch (avoids compiling too mana capabilities for MAGMA). - Use nixpkgs configuration passed through by `tgi-nix`.	2024-09-25 06:11:21 +00:00
Daniël de Kok	b7d1adc3e9	nix: add awq-inference-engine as server dependency (#2442 )	2024-09-25 06:10:59 +00:00
Nicolas Patry	6654c2d11b	Adding eetq to flake. (#2438 )	2024-09-25 06:10:59 +00:00
Daniël de Kok	516392d790	nix: add pure server to flake, add both pure and impure devshells (#2430 ) * nix: pure server and support both pure and impure devShells * nix: remove unused poetry2nix input It is not wired up and we now have a pure server. * nix: add ipdb to impure devshell	2024-09-25 06:10:59 +00:00
Nicolas Patry	635dde8af9	Prefix caching (#2402 ) * Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-09-25 06:10:59 +00:00
Daniël de Kok	ddba272a66	nix: update to CUDA 12.4 (#2429 ) * Update to CUDA 12.4 * poetry2nix: follow tgi-nix nixpkgs	2024-09-25 06:10:59 +00:00
Daniël de Kok	20ed7b598e	nix: try to reduce the number of Rust rebuilds (#2424 ) Try to reduce the number of router/launcher rebuilds by filtering sources. In this way, recompiles should only be triggered by changes in Cargo or Rust files.	2024-09-25 06:08:38 +00:00
Daniël de Kok	e5c39a5545	nix: build router incrementally (#2422 )	2024-09-25 06:08:00 +00:00
Daniël de Kok	bae161ab84	nix: partial incremental build of the router (#2416 ) This is less incremental than crate2nix, but does build all dependencies separately, so avoids full rebuilds.	2024-09-25 06:06:17 +00:00
Nicolas Patry	c5e4c1877b	Adding more kernels to flake. (#2411 )	2024-09-25 06:06:17 +00:00
Daniël de Kok	eb561bb715	nix: incremental build of the launcher (#2410 )	2024-09-25 06:06:17 +00:00
Nicolas Patry	18d6be6af4	Updating the flake. (#2404 )	2024-09-25 06:06:17 +00:00
Daniël de Kok	bb833389e0	Update flake for 9.0a capability in Torch (#2394 )	2024-09-25 06:04:51 +00:00
Daniël de Kok	df719fd527	flake: use rust-overlay (#2390 )	2024-09-25 06:04:51 +00:00
Daniël de Kok	dc0fa60f55	Add experimental flake (#2384 ) Add flake.nix	2024-09-25 06:01:59 +00:00

22 Commits