text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 07:25:23 +00:00

Author	SHA1	Message	Date
Wang, Yi A	ba049c9d49	improve performance Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-13 20:00:27 -07:00
Wang, Yi A	76cc129796	remove block_scales which is not needed anymore Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-11 01:28:14 -07:00
Wang, Yi A	a83e9fe003	work with the latest vllm extension ops Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-10 19:56:58 -07:00
Wang, Yi A	4de8fb0127	Merge branch 'gaudi_backend_pa' into warmup_gaudi_backend	2025-04-10 19:42:22 -07:00
Wang, Yi A	4cdc34ec4d	match the latest vllm_extension ops Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-10 19:32:32 -07:00
Wang, Yi A	610dd200e5	Merge branch 'main' into gaudi_backend_pa Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-10 18:20:28 -07:00
Wang, Yi A	cd900c3b72	pingpong optimization Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-10 18:16:05 -07:00
Baptiste Colle	37104acd75	Gaudi: Add Integration Test for Gaudi Backend (#3142 ) * feat(gaudi): add integration test * feat(test): add more models to integration tests * remove debug comments * fix typos	2025-04-07 16:55:03 +02:00
Wang, Yi A	29703dbd27	fix warmup issue for mllama Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-04 20:25:01 -07:00
Yuan Wu	3d059f91ab	Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE (#3131 ) * Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE Signed-off-by: yuanwu <yuan.wu@intel.com> * Remove debug modifications Signed-off-by: yuanwu <yuan.wu@intel.com> --------- Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-04-03 10:34:53 +02:00
Wang, Yi A	8591687561	refine log and fix some issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-03 00:11:22 -07:00
Wang, Yi A	a84da5b698	optimize code Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-02 00:56:15 -07:00
Wang, Yi A	705cc0b619	multi-modality warmup Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-02 00:09:16 -07:00
Wang, Yi A	9d85ac9485	LLM warmup logic Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-31 23:07:14 -07:00
Wang, Yi A	c55a8caea2	remove torch.where to fix incorrect output in hpu graph model Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-31 22:51:54 -07:00
Wang, Yi A	f0e5faec1a	fix some issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-28 07:01:06 -07:00
Wang, Yi A	376e0507b7	missing gptj change... Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-28 01:08:40 -07:00
Wang, Yi A	7914e980e2	Merge branch 'main' into gaudi_backend_pa Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-28 00:03:49 -07:00
Wang, Yi A	1508ee8de1	remove block_tables and prefill_cache_indices which will lead to dynamic shape Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-27 23:57:59 -07:00
Wang, Yi A	7900be5ac3	warmup decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-26 20:19:13 -07:00
Wang, Yi A	ba7a131e04	add warmup_decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-26 17:39:26 -07:00
Wang, Yi A	fd70ad703e	warmup prefill remove model where pageattn is not used, set block table to None since it's not used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-26 03:10:58 -07:00
Yuan Wu	f5f14dc660	Gaudi: Fix llava-next and mllama crash issue (#3127 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-03-25 15:08:15 +01:00
Wang, Yi A	69773767c5	enable fp8 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-25 05:06:55 -07:00
Wang, Yi A	8d221b7b79	fix gptq issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-22 20:58:50 -07:00
Wang, Yi A	9914ffe1f1	remove unused quantization code and enable awq/gptq int4 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-22 19:37:20 -07:00
Wang, Yi A	fdf0733f56	fix incorrect output in qwen2 idefics if hpu graph is used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-21 01:01:37 -07:00
Wang, Yi A	36b6612f97	adjust warmup and enable vlm Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-20 23:12:52 -07:00
Wang, Yi A	f95aa42660	multi-modality initial PR Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-19 23:30:12 -07:00
Wang, Yi A	d5b78ba16f	Merge branch 'main' into gaudi_backend_pa	2025-03-19 18:15:08 -07:00
Wang, Yi A	2074d0516b	enable dbrx remove some unused code Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-19 03:16:41 -07:00
Wang, Yi A	2cde30de24	gpt_bigcode could also go pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-18 23:59:31 -07:00
Wang, Yi A	073f793976	fix phimoe issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-18 23:11:01 -07:00
Baptiste Colle	8c2c348f3c	Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork (#3117 ) feat(gaudi): add all the changes from tgi-gaudi fork up to PR #289	2025-03-18 09:45:52 +01:00
Wang, Yi A	5cd1c93cad	add moe support, fix qwen/mistral/mixtral crash Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-18 00:45:15 -07:00
Wang, Yi A	6bbe24d974	use tensor cache in hpu graph to avoid replay issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-17 01:36:49 -07:00
Wang, Yi A	a07e7437b6	enable all the model. not testet yet Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-17 01:26:32 -07:00
Wang, Yi A	5d3653943c	adjust block table in hpu to improve performance Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-16 20:28:01 -07:00
Wang, Yi A	b7fea6fc2f	fix TP in pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-14 18:01:58 -07:00
Wang, Yi A	201dc6294f	clean cuda/rocm code in hpu backend, enable flat_hpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-14 01:25:31 -07:00
Baptiste Colle	27ed848676	Release of Gaudi Backend for TGI (#3091 ) * feat(gaudi): release ready (docs, docker image and vlm ready) * fix(gaudi): add default argument for the dockerfile * fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices	2025-03-13 10:56:01 +01:00
David Corvoysier	f01dc9e743	Update neuron backend (#3098 ) * feat(neuron): use AWS Neuron SDK 2.21.1 * feat(neuron): bump optimum-neuron version * feat(neuron): tag latest image for local tests * test(neuron): simplify sampling test	2025-03-12 09:53:15 +01:00
Adrien Gallouët	094975c3a8	Update the llamacpp backend (#3022 ) * Build faster Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Make --model-gguf optional Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable mmap, offload_kqv & flash_attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Better error message Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update installed packages Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Save gguf in models/MODEL_ID/model.gguf Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix build with Mach-O Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Quantize without llama-quantize Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp and switch to ggml-org Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove make-gguf.sh Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Cargo.lock Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Support HF_HUB_USER_AGENT_ORIGIN Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-03-11 09:19:01 +01:00
Nicolas Patry	8e92942a18	Making `tool_calls` a vector. (#3075 ) * Making `tool_calls` a vector. * Update doc. * Fixing the nix overlay with updated version. * Add openai dependency. * Updating the old tests. * Trying to reduce the logs in the case of errors. * Less spammy logs too.	2025-03-05 22:32:31 +01:00
Hugo Larcher	d8ff7f2623	feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. (#3061 ) * feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. * fix: Rust version for Neuron * fix: PR comments, use rust-toolchain.toml	2025-03-04 16:43:50 +01:00
Daniël de Kok	e88f6f6ee9	Add property-based testing for `RadixAllocator` (#3068 )	2025-03-04 15:09:46 +01:00
Daniël de Kok	fa4e9511f8	Fix two edge cases in `RadixTrie::find` (#3067 ) - Always return a node, not its parent. - Do not recurse when a node does not represent a full prefix of the input.	2025-03-04 13:23:27 +01:00
Baptiste Colle	683ff53fa3	Add Gaudi Backend (#3055 ) * wip(gaudi): import server and dockerfile from tgi-gaudi fork * feat(gaudi): new gaudi backend working * fix: fix style * fix prehooks issues * fix(gaudi): refactor server and implement requested changes	2025-02-28 12:14:58 +01:00
drbh	b0069e0485	fix: run linters and fix formatting (#3057 )	2025-02-25 16:11:34 -05:00
David Corvoysier	c00add9c03	Add Neuron backend (#3033 ) * feat: add neuron backend * feat(neuron): add server standalone installation * feat(neuron): add server and integration tests * fix(neuron): increase ulimit when building image The base image used to compile the rust components seems to have a low ulimit for opened files, which leads to errors during compilation. * test(neuron): merge integration tests and fixtures * test: add --neuron option * review: do not use latest tag * review: remove ureq pinned version * review: --privileged should be the exception * feat: add neuron case to build ci * fix(neuron): export models from container in test fixtures The neuron tests require models to have been previously exported and cached on the hub. This is done automatically by the neuron.model fixture the first time the tests are ran for a specific version. This fixture used to export the models using optimum-neuron directly, but this package is not necessarily present on the system. Instead, it is now done through the neuron TGI itself, since it contains all the tools required to export the models. Note that since the CI runs docker in docker (dind) it does not seem possible to share a volume between the CI container and the container used to export the model. For that reason, a specific image with a modified entrypoint is built on-the-fly when a model export is required. * refactor: remove sagemaker entry-point The SageMaker image is built differently anyway. * fix(neuron): avoid using Levenshtein * test(neuron): use smaller llama model * feat(neuron): avoid installing CUDA in image * test(neuron): no error anymore when requesting too many tokens * ci: doing a precompilation step (with a different token). * test(neuron): avoid using image sha when exporting models We now manually evaluate the apparent hash of the neuron backend by combining the hash of the neuron backend directory and Dockerfile. This new hash is used to identify exported neuron models instead of the image sha. This has two benefits: - it changes less frequently (only hwen the neuron backend changes), which means less neuron models being pushed to the hub, - it can be evaluated locally, meaning that running the tests once locally will export the models before the CI uses them. * test(neuron): added a small script to prune test models --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-02-24 09:10:05 +01:00

1 2

96 Commits