text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-05-02 15:32:13 +00:00

Author	SHA1	Message	Date
Nicolas Patry	69c20a9d3f	Tmate let's find with ldconfig ?	2024-09-16 17:03:28 +02:00
Nicolas Patry	c784cb401d	Let's try a compat drvier ?	2024-09-16 17:03:28 +02:00
Nicolas Patry	fe533dc57b	Back to failing version	2024-09-16 17:03:28 +02:00
Nicolas Patry	2f1f082abe	Tmate.	2024-09-16 17:03:28 +02:00
Nicolas Patry	1a6b9926f6	missing lib.	2024-09-16 17:03:27 +02:00
Nicolas Patry	332e42f59a	Attempt.	2024-09-16 17:03:27 +02:00
Nicolas Patry	ec6fe324c6	Link to nix owned lib	2024-09-16 17:03:27 +02:00
Nicolas Patry	83ee55a617	Trye somethign.	2024-09-16 17:03:27 +02:00
Nicolas Patry	047530216c	No idea where the shared disk is.	2024-09-16 17:03:27 +02:00
Nicolas Patry	9f548fa82a	Change the home location ?	2024-09-16 17:03:27 +02:00
Nicolas Patry	3ff12084b7	Revert "No tmate." This reverts commit 6b9b6d951897127ae1ce09c8f61f86a64b301fec.	2024-09-16 17:03:26 +02:00
Nicolas Patry	26634f9697	No tmate.	2024-09-16 17:03:26 +02:00
Nicolas Patry	a533d086f0	Tmate to find cache.	2024-09-16 17:03:26 +02:00
Nicolas Patry	a5b81ab457	Home.	2024-09-16 17:03:26 +02:00
Nicolas Patry	98f2241a88	Put back libnvidia-ml	2024-09-16 17:03:26 +02:00
Nicolas Patry	72a805d50d	Remove tmate.	2024-09-16 17:03:26 +02:00
Nicolas Patry	45c0129976	Attempting something.	2024-09-16 17:03:25 +02:00
Nicolas Patry	2b18537f85	More tmate.	2024-09-16 17:03:25 +02:00
Nicolas Patry	12b88204b0	Putting the cuda package in the flake.	2024-09-16 17:03:25 +02:00
Nicolas Patry	d7333830b5	Tmate.	2024-09-16 17:03:25 +02:00
Nicolas Patry	c4bbe06bf1	Simpler command	2024-09-16 17:02:45 +02:00
Nicolas Patry	d0ae24a167	Release tests.	2024-09-16 17:02:25 +02:00
Nicolas Patry	5c4b2eaa30	Seeing the damage on the release tests.	2024-09-16 17:01:51 +02:00
Nicolas Patry	70f910bba6	Remove tmate.	2024-09-16 17:01:51 +02:00
Nicolas Patry	5adece6313	This doesn't seem needed.	2024-09-16 17:01:51 +02:00
Nicolas Patry	b7cb8d5145	Let's figure out the issue...	2024-09-16 17:01:30 +02:00
Nicolas Patry	3d7b81535a	Only link cuda driver librairies.	2024-09-16 17:01:30 +02:00
Nicolas Patry	ce3efc83ed	Remove tmate.	2024-09-16 17:01:30 +02:00
Nicolas Patry	7f58f7dc61	Symlink all the things.	2024-09-16 17:01:29 +02:00
Nicolas Patry	42107de71f	Let's try to find libnvidia-ml	2024-09-16 17:01:29 +02:00
Nicolas Patry	edaa7f847d	Does this work ?	2024-09-16 17:01:29 +02:00
Nicolas Patry	d1e79ddae0	Fix override.	2024-09-16 17:01:29 +02:00
Nicolas Patry	db054b95df	Check the paths.	2024-09-16 17:01:29 +02:00
Nicolas Patry	afcd047a58	Yaml yaml.	2024-09-16 17:01:29 +02:00
Nicolas Patry	60db294f9a	Link cuda to nix ?	2024-09-16 17:01:28 +02:00
Nicolas Patry	8e7c7c61f1	Let's see what the issue is ?	2024-09-16 17:01:28 +02:00
Nicolas Patry	c227345878	Run on actual GPUs.	2024-09-16 17:01:28 +02:00
Nicolas Patry	f47cdc1fe1	Attempting rapidly the integration tests.	2024-09-16 17:01:26 +02:00
Nicolas Patry	d95c670ada	Add nix test. (#2513 ) * Add nix test. * Modifying yourself means you need to rerun. * Fixing the test + adding click (needed for pre-commit hooks). * Try thuis. * Our runner + pure test (not written) * Reemove server. * Root user. * Different user ? * Add the actual test target. * Forgot this modification. * Add a formatter. * Add the secrets. * Fixed the auth token ? * Adding the other tests. * Missing pre-commit. * Test requires cargo for cargo fmt. * Update it a bit. * Up. * Attempting to use a cache location for the models. * Ignore the cache for now.	2024-09-12 14:54:56 +02:00
Nicolas Patry	dae3bf1d87	Fix tokenization yi (#2507 ) * Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?	2024-09-11 22:41:56 +02:00
Nicolas Patry	e415b690a6	Lots of improvements (Still 2 allocators) (#2449 ) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-08-29 16:29:01 +02:00
Nicolas Patry	2788d41a76	Fixing CI. (#2462 )	2024-08-27 15:33:02 +02:00
Nicolas Patry	e4201f44cf	All integration tests back everywhere (too many failed CI). (#2428 ) * All integration tests back everywhere (too many failed CI). * Upgrade integration tests after 12.4 * Attempt to remove the specifed compute cap. * Common arch list. * Punica uses raw ASM which is not valid on 9.0 apparently.	2024-08-16 21:19:46 +02:00
Hugo Larcher	53729b74ac	doc: Add metrics documentation and add a 'Reference' section (#2230 ) * doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-16 19:43:30 +02:00
Wang, Yi	b6bb1d5160	Cpu dockerimage (#2367 ) add intel-cpu docker image Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-12 14:10:30 +02:00
Daniël de Kok	22fb1be588	Fix cache block size for flash decoding (#2351 ) * Fix cache block size for flash decoding This seems to have been accidentally dropped during the TRT-LLM PR rebase. * Also run CI on changes to `backends`	2024-08-01 15:38:57 +02:00
Nicolas Patry	2b19d671b4	Rebase TRT-llm (#2331 ) * wip wip refacto refacto Initial setup for CXX binding to TRTLLM Working FFI call for TGI and TRTLLM backend Remove unused parameters annd force tokenizer name to be set Overall build TRTLLM and deps through CMake build system Enable end to end CMake build First version loading engines and making it ready for inference Remembering to check how we can detect support for chunked context Move to latest TensorRT-LLM version Specify which default log level to use depending on CMake build type make leader executor mode working unconditionally call InitializeBackend on the FFI layer bind to CUDA::nvml to retrieve compute capabilities at runtime updated logic and comment to detect cuda compute capabilities implement the Stream method to send new tokens through a callback use spdlog release 1.14.1 moving forward update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c correctly tell cmake to build dependent tensorrt-llm required libraries create cmake install target to put everything relevant in installation folder add auth_token CLI argument to provide hf hub authentification token allow converting huggingface::tokenizers error to TensorRtLlmBackendError use correct include for spdlog include guard to build example in cmakelists working setup of the ffi layer remove fmt import use external fmt lib end to end ffi flow working make sure to track include/ffi.h to trigger rebuild from cargo impl the rust backend which currently cannot move the actual computation in background thread expose shutdown function at ffi layer impl RwLock scenario for TensorRtLllmBackend oops missing c++ backend definitions compute the number of maximum new tokens for each request independently make sure the context is not dropped in the middle of the async decoding. remove unnecessary log add all the necessary plumbery to return the generated content update invalid doc in cpp file correctly forward back the log probabilities remove unneeded scope variable for now refactor Stream impl for Generation to factorise code expose the internal missing start/queue timestamp forward tgi parameters rep/freq penalty add some more validation about grammar not supported define a shared struct to hold the result of a decoding step expose information about potential error happening while decoding remove logging add logging in case of decoding error make sure executor_worker is provided add initial Dockerfile for TRTLLM backend add some more information in CMakeLists.txt to correctly install executorWorker add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper simplify prebuilt trtllm libraries name definition do the same name definition stuff for tensorrt_llm_executor_static leverage pkg-config to probe libraries paths and reuse new install structure from cmake fix bad copy/past missing nvinfer linkage direction align all the linker search dependency add missing pkgconfig folder for MPI in Dockerfile correctly setup linking search path for runtime layer fix missing / before tgi lib path adding missing ld_library_path for cuda stubs in Dockerfile update tgi entrypoint commenting out Python part for TensorRT installation refactored docker image move to TensorRT-LLM v0.11.0 make docker linter happy with same capitalization rule fix typo refactor the compute capabilities detection along with num gpus update TensorRT-LLM to latest version update TensorRT install script to latest update build.rs to link to cuda 12.5 add missing dependant libraries for linking clean up a bit install to decoder_attention target add some custom stuff for nccl linkage fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time use std::env::const::ARCH make sure variable live long enough... look for cuda 12.5 add some more basic info in README.md * Rebase. * Fix autodocs. * Let's try to enable trtllm backend. * Ignore backends/v3 by default. * Fixing client. * Fix makefile + autodocs. * Updating the schema thing + redocly. * Fix trtllm lint. * Adding pb files ? * Remove cargo fmt temporarily. * ? * Tmp. * Remove both check + clippy ? * Backporting telemetry. * Backporting `457fb0a1` * Remove PB from git. * Fixing PB with default member backends/client * update TensorRT-LLM to latest version * provided None for api_key * link against libtensorrt_llm and not libtensorrt-llm --------- Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>	2024-07-31 10:33:10 +02:00
Adrien	fd2e06316d	fix: fix buildkit config in ci Signed-off-by: Adrien <adrien@huggingface.co>	2024-07-29 09:25:56 +02:00
Adrien	3905f854ed	Fix registry name (#2307 )	2024-07-25 16:06:00 +02:00
Nicolas Patry	26614057a7	Using g6 instead of g5. (#2281 ) * Using g6 instead of g5. * Update the idefics2 snapshot.	2024-07-25 11:21:17 +02:00

1 2 3

138 Commits