text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-09 01:10:17 +00:00

Author	SHA1	Message	Date
Yuan Wu	fe7594e369	Fix the warmup issue of prefill batch_size (#268 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-01-23 17:26:17 +01:00
Yuan Wu	63c64bb307	Use the default value in globals.py (#265 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-01-21 10:10:23 +01:00
Karol Damaszke	8de110ae9f	Fix warmup with SKIP_TOKENIZER_IN_TGI=true (#266 )	2025-01-21 10:09:49 +01:00
Yuan Wu	7d106477d6	Fix router input validation for SKIP_TOKENIZER_IN_TGI=true (#267 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-01-21 10:08:53 +01:00
Yuan Wu	6d6acca5eb	Update the ReadME for 2.3.1 (#260 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-01-03 10:55:14 +01:00
Yuan Wu	46b556805b	Upgrade to SynapseAI 1.19 (#259 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-26 17:33:24 +01:00
regisss	5291f652a1	Merge pull request #225 from yuanwu2017/2.3.0	2024-12-19 11:42:59 -06:00
yuanwu	8e2e5d8e15	Fix benchmark build error Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-17 05:38:10 +00:00
yuanwu	eaeef6e7a4	Remove the useless modifications Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-17 02:08:12 +00:00
yuanwu	15de6c9195	Merge branch 'habana-main' into 2.3.0	2024-12-17 02:06:22 +00:00
Sun Choi	61309b2832	Remove the default max_tokens for /v1/chat/completions (#251 )	2024-12-16 09:32:57 +01:00
Sun Choi	cc2ca4ac22	HF_TOKEN replaces HUGGING_FACE_HUB_TOKEN as it is deprecated (#253 )	2024-12-15 09:59:58 +01:00
yuanwu	c3b8899f10	Revert "Use optimum-habana v1.15-release branch" This reverts commit `c6f023a06b`.	2024-12-11 08:17:17 +00:00
yuanwu	c922ef9534	Fix the warmup issue of llama2-7B. Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-09 07:20:48 +00:00
yuanwu	c6f023a06b	Use optimum-habana v1.15-release branch Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-08 13:02:31 +00:00
yuanwu	1b659788b5	Add the no-deps in pip install Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-08 12:14:38 +00:00
yuanwu	73e6e3b871	Remove the error log Subsequent updates will remove these codes Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-08 11:55:13 +00:00
yuanwu	9f356ce045	Refine the warmup process Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-07 09:56:16 +00:00
yuanwu	253a992447	Remove the CI workflows we don't currently support Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-02 08:45:36 +00:00
yuanwu	0228bd0260	Doesn't run the prefill warmup when limit_hpu_graph=true Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-01 21:29:41 +00:00
yuanwu	4586325a34	Fix the starCode warmup issue Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-01 06:14:00 +00:00
Yuan Wu	b83419a769	Merge branch 'habana-main' into 2.3.0	2024-11-28 12:38:36 +08:00
yuanwu	636cdb4c43	Fix startcode issue Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-11-26 08:55:42 +00:00
srajabos	d49ce00f40	With this change, bucketing/padding of input is applied to health check. (#245 )	2024-11-18 22:38:30 +01:00
yuanwu2017	56c3eb4adb	Remove the torch package in requirements.txt (#246 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-11-07 09:22:24 -08:00
yuanwu2017	c345c734a7	Merge branch 'habana-main' into 2.3.0	2024-11-01 11:24:40 +08:00
yuanwu	fcf2e3a338	Fix the prefill warmup issue Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-11-01 05:08:52 +02:00
Thanaji Rao Thakkalapelli	6ba3d1d6e5	updated release docker image version in readme to 2.0.6 (#242 )	2024-10-31 15:44:16 -07:00
yuanwu2017	8d84ffabf2	Upgrade to SynapseAI 1.18 (#227 ) Signed-off-by: yuanwu <yuan.wu@intel.com> Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>	2024-10-31 20:14:44 +01:00
Thanaji Rao Thakkalapelli	7fb4af9a87	updated supported models list table in readme (#241 ) * updated supported models list table in readme * updated read me * updated read me	2024-10-29 23:28:45 -07:00
yuanwu	4c9856f9e5	Add missing package Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-28 07:04:56 +00:00
yuanwu2017	c23584f626	Merge branch 'habana-main' into 2.3.0	2024-10-28 04:37:07 +08:00
yuanwu	372e071135	Fix the issues of tgi-gaudi for v.2.3.1 Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-27 20:40:36 +00:00
Nicolas Patry	7e282b4153	V2.3.1	2024-10-27 04:14:35 +00:00
Nicolas Patry	34e98b14ef	New release 2.3.1 (#2604 ) * New release 2.3.1 * Update doc number	2024-10-27 04:14:35 +00:00
drbh	902f526d69	Unroll notify error into generate response (#2597 ) * feat: unroll notify_error if no tool is choosen * fix: expect simple message when no tool is selected * fix: improve test to avoid notify_error * fix: improve docs and indicate change in expected response * fix: adjust linting in test file	2024-10-27 04:03:57 +00:00
drbh	7664d2e2b3	CI (2592): Allow LoRA adapter revision in server launcher (#2602 ) allow revision for lora adapters from launcher Co-authored-by: Sida <sida@kulamind.com> Co-authored-by: teamclouday <teamclouday@gmail.com>	2024-10-27 04:03:57 +00:00
Nicolas Patry	967e67111d	Max token capacity metric (#2595 ) * adding max_token_capacity_metric * added tgi to name of metric * Adding max capacity metric. * Add description for the metrics --------- Co-authored-by: Edwinhr716 <Edandres249@gmail.com>	2024-10-27 04:03:57 +00:00
Nicolas Patry	51506aa57a	Mllama flash version (#2585 ) * Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0	2024-10-27 04:03:57 +00:00
Daniël de Kok	fa964f82d3	nix: experimental support for building a Docker container (#2470 ) * nix: experimental support for building a Docker image Run using something like: ``` docker run \ --device nvidia.com/gpu=all \ -it --rm -p 8080:80 \ -v $PWD/data:/data \ -v $PWD/tmp:/tmp \ tgi-docker:latest \ --model-id <model_id> ``` * Example of building the Docker image using Nix inside Docker * Stream to make the builder image smaller This avoids storing a Docker image tarball in the image. Instead, stream the layers while doing `docker run`. * Don't spam journalctl on Linux * Other dockerfile. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-25 09:12:03 +00:00
Daniël de Kok	775e5f4c64	MoE Marlin: support `desc_act` for `groupsize != -1` (#2590 ) This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.	2024-10-25 09:12:03 +00:00
Daniël de Kok	692f8ddb69	Move flake back to tgi-nix `main` (#2586 )	2024-10-25 09:12:03 +00:00
drbh	bdc47394d2	feat: support phi3.5 moe (#2479 ) * feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-25 09:12:03 +00:00
Daniël de Kok	288bcb0027	Add support for GPTQ-quantized MoE models using MoE Marlin (#2557 ) This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.	2024-10-25 09:07:52 +00:00
Mohit Sharma	ff905aeff3	Update ROCM libs and improvements (#2579 ) * style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile	2024-10-25 09:01:04 +00:00
Ikram Ul Haq	6808b2de7e	Update architecture.md (#2577 )	2024-10-25 09:01:04 +00:00
Daniël de Kok	55fd2816ea	Remove compute capability lazy cell (#2580 ) Remove compute capability lock We are only calling the `get_cuda_capability` function once, so avoiding the cost of multiple calls is not really necessary yet.	2024-10-25 09:01:04 +00:00
Daniël de Kok	f82a3f5816	flashinfer: pass window size and dtype (#2574 )	2024-10-25 09:01:04 +00:00
Daniël de Kok	653193a942	Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s	2024-10-25 09:01:04 +00:00
Alvaro Bartolome	bc28f86903	Fix build with `--features google` (#2566 ) * Fix `cargo build --features google` * Add `cargo test --features google`	2024-10-25 09:01:04 +00:00

1 2 3 4 5 ...

1229 Commits