text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-12 08:25:22 +00:00

Author	SHA1	Message	Date
Adrien Gallouët	051ff2d5ce	Rename bindings Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-05 11:21:41 +00:00
Adrien Gallouët	dbee804129	Simplify batching logic Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-05 10:12:39 +00:00
Adrien Gallouët	d3a772a8dd	Update args Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-05 10:10:38 +00:00
Adrien Gallouët	df2a4fbb8a	Update Dockerfile_llamacpp Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	d883109df6	Disable graceful shutdown in debug mode Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	38b33e9698	Add --type-v & --type-k Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	bfb8e03e9f	Add specific args for batch Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	ea28332bb3	Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	104a968d01	Remove warmup Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	8ed362d03a	Clear request cache after completion Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	c8505fb300	Auto-detect n_threads when not provided Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	27534d8ee4	Fix seq iterations Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	96434a1e7e	Fix batching Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	2a51e415ff	Output real logprobs Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	161280f313	Only export the latest logits Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Morgan Funtowicz	960c12bd6e	backend(llama): add CUDA Dockerfile_llamacpp for now	2025-02-04 13:32:58 +00:00
Adrien Gallouët	f38c34aeb7	Fix batch_pos Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	e88a527fcf	Add --offload-kqv Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	ae5bb789c2	Enable flash attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	3f199134f0	Fix args Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	7a3ed4171e	Add --numa Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	390f0ec061	Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	d6ded897a8	Add a stupid batch mechanism Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	e07835c5b5	Add --defrag-threshold Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	f388747985	Add GPU args Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	8d2dfdf668	Handle ctx args & fix sampling Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	a7b4b04cb5	Add some input validation checks Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	e7facf692f	Handle max_batch_size Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	3eb4823f3e	Use max_batch_total_tokens Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	bd0cc9905c	Get rid of llama_batch_get_one() Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	95e221eece	Add llamacpp backend Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:56 +00:00
Hugo Larcher	73b7cf83f6	Add backend name to telemetry (#2962 ) * feat: Add backend name to telemetry	2025-01-28 16:53:16 +01:00
Funtowicz Morgan	40b00275b2	Attempt to remove AWS S3 flaky cache for sccache (#2953 ) * backend(trtllm): attempt to remove AWS S3 flaky cache for sccache * backend(trtllm): what if we expose ENV instead of inline? * backend(trtllm): and with the right env var for gha sccache * backend(trtllm): relax the way to detect sccache * backend(trtllm): make sccache definition manually * backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present * backend(trtllm): export env variable in run mb? * backend(trtllm): Cache mode max to cache intermediate layers * backend(trtllm): inject ompi_version build arg in dependent step	2025-01-27 11:21:48 +01:00
Funtowicz Morgan	0a89902663	[TRTLLM] Expose finish reason (#2841 ) * feat(trtllm): expose finish reason to Rust * misc(llamacpp): fix typo * misc(backend): update deps	2025-01-23 16:48:26 +01:00
Funtowicz Morgan	cc212154e0	Bump TensorRT-LLM backend dependency to v0.16.0 (#2931 ) * backend(trtllm): update to 0.16.0 * backend(trtllm): do not use shallow clone * backend(trtllm): use tag instead * backend(trtllm): move to nvidia remote instead of hf * backend(trtllm): reenable shallow clone * backend(trtllm): attempt to use ADD instead of RUN for openmpi * backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile * backend(trtllm): add correctly untar it	2025-01-23 13:54:40 +01:00
Alvaro Bartolome	64a33c1f05	Run `pre-commit run --all-files` to fix CI (#2933 )	2025-01-21 17:33:33 +01:00
Funtowicz Morgan	17367438f3	Give TensorRT-LLMa proper CI/CD 😍 (#2886 ) * test(ctest) enable address sanitizer * feat(trtllm): expose finish reason to Rust * feat(trtllm): fix logits retrieval * misc(ci): enabe building tensorrt-llm * misc(ci): update Rust action toolchain * misc(ci): let's try to build the Dockerfile for trtllm # Conflicts: # Dockerfile_trtllm * misc(ci): provide mecanism to cache inside container * misc(ci): export aws creds as output of step * misc(ci): let's try this way * misc(ci): again * misc(ci): again * misc(ci): add debug profile * misc(ci): add debug profile * misc(ci): lets actually use sccache ... * misc(ci): do not build with ssl enabled * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(backend): test with TGI S3 conf * misc(backend): test with TGI S3 conf * misc(backend): once more? * misc(backend): let's try with GHA * misc(backend): missing env directive * misc(backend): make sure to correctly set IS_GHA_BUILD=true in wf * misc(backend): ok let's debug smtg * misc(backend): WWWWWWWWWWWWWAAAAAAAA * misc(backend): kthxbye retry s3 * misc(backend): use session token * misc(backend): add more info * misc(backend): lets try 1h30 * misc(backend): lets try 1h30 * misc(backend): increase to 2h * misc(backend): lets try... * misc(backend): lets try... * misc(backend): let's build for ci-runtime * misc(backend): let's add some more tooling * misc(backend): add some tags * misc(backend): disable Werror for now * misc(backend): added automatic gha detection * misc(backend): remove leak sanitizer which is included in asan * misc(backend): forward env * misc(backend): forward env * misc(backend): let's try * misc(backend): let's try * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): fix sscache -> sccache * misc(backend): fix sscache -> sccache * misc(backend): fix sscache -> sccache * misc(backend): let's actually cache things now * misc(backend): let's actually cache things now * misc(backend): attempt to run the testS? * misc(backend): attempt to run the tests? * misc(backend): attempt to run the tests? * change runner size * fix: Correctly tag docker images (#2878) * fix: Correctly tag docker images * fix: Correctly tag docker images * misc(llamacpp): maybe? * misc(llamacpp): maybe? * misc(llamacpp): maybe? * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): go * misc(ci): go * misc(ci): go * misc(ci): use bin folder * misc(ci): make the wf callable for reuse * misc(ci): make the wf callable for reuse (bis) * misc(ci): make the wf callable for reuse (bis) * misc(ci): give the wf a name * Create test-trtllm.yml * Update test-trtllm.yml * Create build-trtllm2 * Rename build-trtllm2 to 1-build-trtllm2 * Rename test-trtllm.yml to 1-test-trtllm2.yml * misc(ci): fw secrets * Update 1-test-trtllm2.yml * Rename 1-build-trtllm2 to 1-build-trtllm2.yml * Update 1-test-trtllm2.yml * misc(ci): use ci-build.yaml as main dispatcher * Delete .github/workflows/1-test-trtllm2.yml * Delete .github/workflows/1-build-trtllm2.yml * misc(ci): rights? * misc(ci): rights? * misc(ci): once more? * misc(ci): once more? * misc(ci): baby more time? * misc(ci): baby more time? * misc(ci): try the permission above again? * misc(ci): try the permission above again? * misc(ci): try the permission scoped again? * misc(ci): install tensorrt_llm_executor_static * misc(ci): attempt to rebuild with sccache? * misc(ci):run the tests on GPU instance * misc(ci): let's actually setup sccache in the build.rs * misc(ci): reintroduce variables * misc(ci): enforce sccache * misc(ci): correct right job name dependency * misc(ci): detect dev profile for debug * misc(ci): detect gha build * misc(ci): detect gha build * misc(ci): ok debug * misc(ci): wtf * misc(ci): wtf2 * misc(ci): wtf3 * misc(ci): use commit HEAD instead of merge commit for image id * misc(ci): wtfinfini * misc(ci): wtfinfini * misc(ci): KAMEHAMEHA * Merge TRTLLM in standard CI * misc(ci): remove input machine * misc(ci): missing id-token for AWS auth * misc(ci): missing id-token for AWS auth * misc(ci): missing id-token for AWS auth * misc(ci): again... * misc(ci): again... * misc(ci): again... * misc(ci): again... * misc(ci): missing benchmark * misc(ci): missing backends * misc(ci): missing launcher * misc(ci): give everything aws needs * misc(ci): give everything aws needs * misc(ci): fix warnings * misc(ci): attempt to fix sccache not building trtllm * misc(ci): attempt to fix sccache not building trtllm again --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com> Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by: Pauline Bailly-Masson <155966238+paulinebm@users.noreply.github.com>	2025-01-21 10:19:16 +01:00
drbh	8f6146f11a	Revert "feat: improve qwen2-vl startup " (#2924 ) Revert "feat: improve qwen2-vl startup (#2802)" This reverts commit `eecca27113`.	2025-01-17 12:09:05 -05:00
drbh	eecca27113	feat: improve qwen2-vl startup (#2802 ) * feat: tokenize each request individually and increase warmup image size * feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller * fix: address image resize and rebase changes * feat: update to run qwen2-vl tests * fix: tweak param types	2025-01-17 11:50:41 -05:00
Nicolas Patry	203cade244	Upgrading our rustc version. (#2908 ) * Upgrading our rustc version. * Fixing the rust tests to proper version. * Clippy everything.	2025-01-15 17:04:03 +01:00
Dmitry Dygalo	01067f8ba8	chore: Update jsonschema to 0.28.0 (#2870 ) * chore: Update jsonschema to 0.28.0 Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> * chore: Enable blocking feature for reqwest Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> --------- Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev>	2025-01-10 15:01:54 +01:00
drbh	a72f339c79	fix: lint backend and doc files (#2850 )	2024-12-16 16:12:34 -05:00
Nicolas Patry	11ab329883	Fixing CI. (#2846 )	2024-12-16 10:58:15 +01:00
Funtowicz Morgan	ea7f4082c4	TensorRT-LLM backend bump to latest version + misc fixes (#2791 ) * misc(cmake) update dependencies * feat(hardware) enable new hardware.hpp and unittests * test(ctest) enable address sanitizer * feat(backend): initial rewrite of the backend for simplicity * feat(backend): remove all the logs from hardware.hpp * feat(backend): added some logging * feat(backend): enable compiler warning if support for RVO not applying * feat(backend): missing return statement * feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder * feat(backend): delete previous backend impl * feat(backend): more impl * feat(backend): use latest trtllm main version to have g++ >= 13 compatibility * feat(backend): allow overriding which Python to use * feat(backend): fix backend_exception_t -> backend_error_t naming * feat(backend): impl missing generation_step_t as return value of pull_tokens * feat(backend): make backend_workspace_t::engines_folder constexpr * feat(backend): fix main.rs retrieving the tokenizer * feat(backend): add guard to multiple header definitions * test(backend): add more unittest * feat(backend): remove constexpr from par * feat(backend): remove constexpig * test(backend): more test coverage * chore(trtllm): update dependency towards 0.15.0 * effectively cancel the request on the executor * feat(backend) fix moving backend when pulling * feat(backend): make sure we can easily cancel request on the executor * feat(backend): fix missing "0" field access * misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut * chore: Add doc and CI for TRTLLM (#2799) * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting * misc(backend): indent --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>	2024-12-13 15:50:59 +01:00
Nicolas Patry	82c24f7420	Using both value from config as they might not be correct. (#2817 ) * Using both value from config as they might not be correct. * Fixing max_position_embeddings for falcon. * Simple attempt to fix the healthcheck block allocation. * Much simpler solution. * Default value for Backend start_health	2024-12-10 19:37:09 +01:00
OlivierDehaene	8c3669b287	feat: auto max_new_tokens (#2803 ) * feat: auto max_new_tokens * update default * Fixing the tests. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-12-06 05:50:35 +01:00
OlivierDehaene	ab7ccf5bc3	feat: add payload limit (#2726 ) * feat: add payload limit * update launcher	2024-11-21 18:20:15 +00:00
OlivierDehaene	8e0c161d0a	fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770 ) * Incomplete generation stream fix (#2754) entries.len() could > batch.size in prefill, so need to filter as well. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * entries was wrongly extended for model that did not support chunking --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com>	2024-11-21 16:37:55 +00:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
Funtowicz Morgan	ba5fc7d922	Add support for stop words in TRTLLM (#2678 ) * feat(trtllm): rewrite health to not account for current state * chore(looper): cleanup a bit more * feat(post_processing): max_new_tokens is const evaluated now * chore(ffi):formatting * feat(trtllm): add stop words handling # Conflicts: # backends/trtllm/lib/backend.cpp * chore(trtllm): create specific parallelconfig factory and logging init methods * chore(trtllm): define a macro for SizeType cast * chore(trtllm): use GetParallelConfig * chore(trtllm): minor refactoring * chore(trtllm): validate there are enough GPus on the system for the desired model * chore(trtllm): ensure max throughput scheduling policy is selected * chore(trtllm): minor fix * chore(router): minor refactorings * feat(docker): build with-slurm ompi * feat(docker): add python3.10 dev to runtime deps * chore(docker): add mpi to ld_library_path * chore(docker): install transformers * feat(trtllm): detect stop_words from generation_config.json	2024-10-25 10:58:34 +02:00

1 2

74 Commits