text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-07 00:10:17 +00:00

Author	SHA1	Message	Date
David Corvoysier	7178ee718d	ci: enclose if clause in brackets	2025-02-21 15:38:11 +00:00
David Corvoysier	e8d04ec683	ci: temporarily run only gpt2 tests	2025-02-21 15:38:11 +00:00
David Corvoysier	a381aed512	ci: temporarily remove documentation workflows	2025-02-21 15:33:02 +00:00
David Corvoysier	a2a351e6e0	ci: temporarily restrict build to intel-cpu and neuron	2025-02-21 15:32:48 +00:00
David Corvoysier	0c0488a754	test(neuron): added a small script to prune test models	2025-02-21 15:32:13 +00:00
David Corvoysier	1e4e406d77	test(neuron): avoid using image sha when exporting models We now manually evaluate the apparent hash of the neuron backend by combining the hash of the neuron backend directory and Dockerfile. This new hash is used to identify exported neuron models instead of the image sha. This has two benefits: - it changes less frequently (only hwen the neuron backend changes), which means less neuron models being pushed to the hub, - it can be evaluated locally, meaning that running the tests once locally will export the models before the CI uses them.	2025-02-21 15:32:13 +00:00
Nicolas Patry	532e72d5c5	Proper consistent naming.	2025-02-21 10:08:10 +01:00
Nicolas Patry	821025e60c	Wrong output.	2025-02-21 10:00:14 +01:00
Nicolas Patry	45787f15bb	Fixing the condition ?	2025-02-21 09:48:58 +01:00
Nicolas Patry	262abb8223	ci: doing a precompilation step (with a different token).	2025-02-21 08:36:21 +00:00
David Corvoysier	debf032ca3	test(neuron): no error anymore when requesting too many tokens	2025-02-20 17:34:56 +00:00
David Corvoysier	d39e002fa5	feat(neuron): avoid installing CUDA in image	2025-02-20 16:14:43 +01:00
David Corvoysier	7d6ff64c13	test(neuron): use smaller llama model	2025-02-20 16:14:42 +01:00
David Corvoysier	e89043901d	fix(neuron): avoid using Levenshtein	2025-02-20 16:14:42 +01:00
David Corvoysier	393753bc0b	refactor: remove sagemaker entry-point The SageMaker image is built differently anyway.	2025-02-20 16:14:42 +01:00
David Corvoysier	49dfdc3f8a	fix(neuron): export models from container in test fixtures The neuron tests require models to have been previously exported and cached on the hub. This is done automatically by the neuron.model fixture the first time the tests are ran for a specific version. This fixture used to export the models using optimum-neuron directly, but this package is not necessarily present on the system. Instead, it is now done through the neuron TGI itself, since it contains all the tools required to export the models. Note that since the CI runs docker in docker (dind) it does not seem possible to share a volume between the CI container and the container used to export the model. For that reason, a specific image with a modified entrypoint is built on-the-fly when a model export is required.	2025-02-20 16:14:42 +01:00
drbh	0e002962da	feat: add neuron case to build ci	2025-02-20 16:14:42 +01:00
David Corvoysier	8b04be3e04	review: --privileged should be the exception	2025-02-20 16:14:42 +01:00
David Corvoysier	c7f49d83ff	review: remove ureq pinned version	2025-02-20 16:14:41 +01:00
David Corvoysier	bc95ef2e8b	review: do not use latest tag	2025-02-20 16:14:41 +01:00
David Corvoysier	4a16e8eec2	test: add --neuron option	2025-02-20 16:14:41 +01:00
David Corvoysier	2c37e8acbe	test(neuron): merge integration tests and fixtures	2025-02-20 16:14:41 +01:00
David Corvoysier	542eee6ca7	fix(neuron): increase ulimit when building image The base image used to compile the rust components seems to have a low ulimit for opened files, which leads to errors during compilation.	2025-02-20 16:14:41 +01:00
David Corvoysier	0b7c7c3d18	feat(neuron): add server and integration tests	2025-02-20 16:14:41 +01:00
David Corvoysier	f085204c5e	feat(neuron): add server standalone installation	2025-02-20 16:14:40 +01:00
David Corvoysier	13caf6d087	feat: add neuron backend	2025-02-20 16:14:38 +01:00
Daniël de Kok	ed96ba6503	flashinfer 0.2.0.post1 -> post2 (#3040 ) * flashinfer 0.2.0.post1 -> post2 * Fix ruff stuff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-02-20 12:34:20 +01:00
Wang, Yi	feaa2477b7	update ipex and torch to 2.6 for cpu (#3039 ) ipex cpu 2.6 support topk_group in moe fusion ops Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-20 09:12:28 +01:00
Hugo Larcher	230aa25641	feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry (#3027 ) * feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable to add info about the environment running TGI. That is useful to track usage in case of collaborations for example. * fix: trufflehog	2025-02-19 21:09:12 +01:00
Nicolas Patry	9c89d0070e	Having less logs in case of failure for checking CI more easily. (#3037 ) * Having less logs in case of failure for checking CI more easily. * Cleaning up the versions to uv for the client. * Ignore entirely the API.	2025-02-19 17:01:33 +01:00
Nicolas Patry	fde3234cbc	Using public external registry (to use external runners for CI). (#3031 ) * Using public external registry (to use external runners for CI). * Fix build. * Fixing the external registry. * Fixing trtllm tests.	2025-02-19 14:53:14 +01:00
drbh	d6a0c67e2f	feat: add initial qwen2.5-vl model and test (#2971 ) * feat: support qwen2.5 vl model * fix: bump support models doc * feat: check before rope type adjustment and small refactors * fix: add transformer overlay for processor support * fix: vendor processor and config from transformers * fix: refactor/simplify conditionals	2025-02-19 12:38:20 +01:00
Cyril Vallez	a7448661f7	Improve Transformers support (#2970 ) * Much better support * add gpt neox * bump transformers version * bump version	2025-02-18 19:04:34 +01:00
Nicolas Patry	5543fdc765	It's find in some machine. using hf_hub::api::sync::Api to download c… (#3030 ) It's find in some machine. using hf_hub::api::sync::Api to download config is not successful which will make warmup fail since attribute like max_position_embeddings could not be got. update hf-hub to the latest version could fix it Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-18 12:19:51 +01:00
Nicolas Patry	b8a4928d0e	Pinning trufflehog. (#3032 )	2025-02-18 12:03:41 +01:00
Alvaro Bartolome	8a1cfd6122	Add `loop_controls` feature to `minijinja` to handle `{% break %}` (#2998 ) * Add `loop_controls` feature to `minijinja` * Add `test_chat_template_loop_controls` to test `break`	2025-02-18 10:33:22 +01:00
celsowm	794ec58b75	Update README.md (#3024 ) only way to avoid: error: experimental Nix feature 'nix-command' is disabled; add '--extra-experimental-features nix-command' to enable it	2025-02-18 10:08:28 +01:00
Daniël de Kok	f0ed76583c	Use eetq kernel from the hub (#3029 ) * Use eetq kernel from the hub * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-02-18 10:03:53 +01:00
Adrien Gallouët	cfd4fbb479	[Backend] Add Llamacpp backend (#2975 ) * Add llamacpp backend Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Get rid of llama_batch_get_one() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use max_batch_total_tokens Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Handle max_batch_size Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add some input validation checks Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Handle ctx args & fix sampling Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add GPU args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --defrag-threshold Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add a stupid batch mechanism Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --numa Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable flash attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --offload-kqv Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix batch_pos Signed-off-by: Adrien Gallouët <angt@huggingface.co> * backend(llama): add CUDA Dockerfile_llamacpp for now * Only export the latest logits Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Output real logprobs Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix batching Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix seq iterations Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Auto-detect n_threads when not provided Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Clear request cache after completion Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove warmup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * backend(llama): add CUDA architectures build argument for Dockerfile * Add specific args for batch Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --type-v & --type-k Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llamacpp to b4623 Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Disable graceful shutdown in debug mode Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Dockerfile_llamacpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Dockerfile Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Cargo.lock Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Simplify batching logic Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Rename bindings Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove n_ctx Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Make max_batch_total_tokens optional Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Ensure all samplers are freed on error Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Initialize penalty_last_n with llamacpp default value Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Improve default settings Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update docs Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Thanks clippy Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Thanks cargo fmt Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update docs Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Do not use HOSTNAME env Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp & cuda Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix requirements.txt Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix fmt Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable KQV offload by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove Ngrok tunneling Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove .cargo/config.toml Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix Dockerfile Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add missing cuda prefix Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Handle custom llama.cpp dir Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add README.md Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add HF transfer Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix bool args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>	2025-02-14 13:40:57 +01:00
Daniël de Kok	6df0fc0b55	Support sigmoid scoring function in GPTQ-MoE (#3017 )	2025-02-14 11:33:49 +01:00
Nicolas Patry	d6881c37ab	Putting back the NCCL forced upgrade. (#2999 ) * Putting back the NCCL forced upgrade. * . * ... * Ignoring conda. * Dropping conda from the buidl system + torch 2.6 * Cache min. * Rolling back torch version. * Reverting the EETQ modification. * Fix flash attention ? * Actually stay on flash v1. * Patching flash v1. * Torch 2.6, fork of rotary, eetq updated. * Put back nccl latest (override torch). * Slightly more reproducible build and not as scary.	2025-02-14 11:31:59 +01:00
Nicolas Patry	8a211dc7fc	Preventing single user hugging the server to death by asking (#3016 ) for way too many tokens.	2025-02-13 11:23:17 +01:00
Nicolas Patry	4cccce4b44	Update the flaky mllama test. (#3015 )	2025-02-12 12:26:52 +01:00
Wang, Yi	76bcb4948d	fix Qwen VL break in intel platform (#3002 ) * fix Qwen VL break in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * could use PositionRotaryEmbedding impl so rocm and ipex could all work Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-12 11:31:34 +01:00
Nicolas Patry	b86c3947ab	Revert "Update the flaky mllama test." This reverts commit `8a870b31b9`.	2025-02-11 17:13:06 +01:00
Nicolas Patry	8a870b31b9	Update the flaky mllama test.	2025-02-11 17:10:36 +01:00
Daniël de Kok	571ac9b507	Use kernels from the kernel hub (#2988 ) * Use Hub kernels for Marlin and cutlass quantization kernels * Use hub kernels for MoE/GPTQ-Marlin MoE * Use attention kernels from the Hub * Cache the kernels in the Docker image * Update moe kernels * Support loading local kernels for development * Support latest moe kernels * Update to moe 0.1.1 * CI: download locked kernels for server tests * Fixup some imports * CI: activate venv * Fix unused imports * Nix: add attention/moe/quantization kernels * Update hf-kernels to 0.1.5 * Update kernels * Update tgi-nix flake for hf-kernels * Fix EOF * Take `load_kernel` out of a frequently-called function * Hoist another case of kernel loading out of a somewhat hot function * marlin-kernels -> quantization * attention -> paged-attention * EOF fix * Update hf-kernels, fixup Docker * ipex fix * Remove outdated TODO	2025-02-10 19:19:25 +01:00
Nicolas Patry	4b8cda684b	Updating mllama after strftime. (#2993 ) * Updating mllama after strftime. * Town instead village. * Forgot the integration snapshot. * Attempt to fix intel CPU. * Intel extension fix. * Workaround intel. * Moving those deps directly into pyproject. * Revert "Moving those deps directly into pyproject." This reverts commit `98c1496ea6`. * Non system uv. * Fixing the docker environment hopefully. * Missed a step. * Move workdir up a bit. * Bailing out of reproducible python env. * Triton version.	2025-02-07 10:38:13 +01:00
Funtowicz Morgan	856709d5c3	[Backend] Bump TRTLLM to v.0.17.0 (#2991 ) * backend(trtllm): bump TRTLLM to v.0.17.0 * backend(trtllm): forget to bump dockerfile * backend(trtllm): use arg instead of env * backend(trtllm): use correct library reference decoder_attention_src * backend(trtllm): link against decoder_attention_{0\|1} * backend(trtllm): build against gcc-14 with cuda12.8 * backend(trtllm): use return value optimization flag as as error if available * backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode * backend(trtllm): link against CUDA 12.8	2025-02-06 16:45:03 +01:00
Wang, Yi	36223f834e	Triton fix (#2995 ) fix triton to 3.1.0 to fix ipex import issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-06 12:28:41 +01:00

1 2 3 4 5 ...

1307 Commits