text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-01 21:40:16 +00:00

Author	SHA1	Message	Date
Yuan Wu	46b556805b	Upgrade to SynapseAI 1.19 (#259 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-26 17:33:24 +01:00
yuanwu	eaeef6e7a4	Remove the useless modifications Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-17 02:08:12 +00:00
yuanwu	c3b8899f10	Revert "Use optimum-habana v1.15-release branch" This reverts commit `c6f023a06b`.	2024-12-11 08:17:17 +00:00
yuanwu	c922ef9534	Fix the warmup issue of llama2-7B. Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-09 07:20:48 +00:00
yuanwu	c6f023a06b	Use optimum-habana v1.15-release branch Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-08 13:02:31 +00:00
yuanwu	1b659788b5	Add the no-deps in pip install Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-08 12:14:38 +00:00
yuanwu	9f356ce045	Refine the warmup process Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-07 09:56:16 +00:00
yuanwu	0228bd0260	Doesn't run the prefill warmup when limit_hpu_graph=true Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-01 21:29:41 +00:00
yuanwu	4586325a34	Fix the starCode warmup issue Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-01 06:14:00 +00:00
Yuan Wu	b83419a769	Merge branch 'habana-main' into 2.3.0	2024-11-28 12:38:36 +08:00
yuanwu	636cdb4c43	Fix startcode issue Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-11-26 08:55:42 +00:00
srajabos	d49ce00f40	With this change, bucketing/padding of input is applied to health check. (#245 )	2024-11-18 22:38:30 +01:00
yuanwu2017	56c3eb4adb	Remove the torch package in requirements.txt (#246 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-11-07 09:22:24 -08:00
yuanwu2017	c345c734a7	Merge branch 'habana-main' into 2.3.0	2024-11-01 11:24:40 +08:00
yuanwu	fcf2e3a338	Fix the prefill warmup issue Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-11-01 05:08:52 +02:00
yuanwu2017	8d84ffabf2	Upgrade to SynapseAI 1.18 (#227 ) Signed-off-by: yuanwu <yuan.wu@intel.com> Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>	2024-10-31 20:14:44 +01:00
yuanwu	4c9856f9e5	Add missing package Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-28 07:04:56 +00:00
yuanwu2017	c23584f626	Merge branch 'habana-main' into 2.3.0	2024-10-28 04:37:07 +08:00
yuanwu	372e071135	Fix the issues of tgi-gaudi for v.2.3.1 Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-27 20:40:36 +00:00
Nicolas Patry	51506aa57a	Mllama flash version (#2585 ) * Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0	2024-10-27 04:03:57 +00:00
Daniël de Kok	775e5f4c64	MoE Marlin: support `desc_act` for `groupsize != -1` (#2590 ) This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.	2024-10-25 09:12:03 +00:00
drbh	bdc47394d2	feat: support phi3.5 moe (#2479 ) * feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-25 09:12:03 +00:00
Daniël de Kok	288bcb0027	Add support for GPTQ-quantized MoE models using MoE Marlin (#2557 ) This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.	2024-10-25 09:07:52 +00:00
Mohit Sharma	ff905aeff3	Update ROCM libs and improvements (#2579 ) * style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile	2024-10-25 09:01:04 +00:00
Daniël de Kok	f82a3f5816	flashinfer: pass window size and dtype (#2574 )	2024-10-25 09:01:04 +00:00
Daniël de Kok	653193a942	Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s	2024-10-25 09:01:04 +00:00
Alvaro Bartolome	6976cf8c4c	Add LoRA adapters support for Gemma2 (#2567 ) * Add LoRA adapters support for Gemma2 * Make `black` formatting happy	2024-10-25 09:01:04 +00:00
Nicolas Patry	a684a81927	More tensor cores. (#2558 ) * More tensor cores. * Fixing the logic. * Gemma is modified by this.	2024-10-25 09:01:04 +00:00
Daniël de Kok	d4f995e718	Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537 ) This replaces the custom layers in both models.	2024-10-25 09:01:04 +00:00
Daniël de Kok	32d50c2ea7	Add support for scalar FP8 weight scales (#2550 ) * Add support for scalar FP8 weight scales * Support LLM compressor FP8 checkpoints on H100 On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype. However, we wouldn't pick up fp8 quantization for models quantized with LLM compressor. This change adds enough parsing to detect if models have FP8-quantized weights. * Remove stray debug print	2024-10-25 09:01:04 +00:00
Nicolas Patry	79ac2b741d	Micro cleanup. (#2555 )	2024-10-25 08:53:47 +00:00
yuanwu	b590310255	Add missing import package Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-25 08:52:24 +00:00
yuanwu	8ebe77b3be	Simplify the warmup Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-25 08:38:59 +00:00
Thanaji Rao Thakkalapelli	b126bf4785	Revert pr 235 as flash attention is not really enabled for gemma (#239 )	2024-10-23 10:58:57 +02:00
yuanwu2017	8686a0fc6d	Merge branch 'habana-main' into 2.3.0	2024-10-23 16:32:12 +08:00
yuanwu	67ee45a270	Pass the max_batch_total_tokens to causal_lm refine the warmup Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-23 08:28:26 +00:00
Thanaji Rao Thakkalapelli	c5e3881051	Enables Flash Attention in TGI for gemma models (#235 )	2024-10-18 09:20:42 -07:00
Thanaji Rao Thakkalapelli	46b14e6b28	Remove all references to habana_quantization_toolkit for 1.18 (#229 )	2024-10-18 10:59:59 +02:00
Sun Choi	8ae5d4c7d6	Ignore EOS for benchmark by using TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN (#234 )	2024-10-16 11:57:36 +02:00
Thanaji Rao Thakkalapelli	87a1cee32c	Fix sysntax error in PR 232	2024-10-15 13:23:48 -07:00
Thanaji Rao Thakkalapelli	e06320f64e	Enabling Flash Attention support for falcon model (#232 )	2024-10-15 19:50:17 +02:00
Sun Choi	0578bd917d	Fix gpt_bigcode/starcoderbase-3b accuracy issue (#228 ) Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>	2024-10-14 10:01:55 +02:00
yuanwu	bab529c916	Make Gaudi adapt to the tgi 2.3.0 Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-09-26 06:04:55 +00:00
yuanwu	14fdc4ae5e	Add some missing modification of 2.3.0 because of conflict Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-09-25 07:49:49 +00:00
Wang, Yi	3519398a14	hotfix: ipex fails since cuda moe kernel is not supported (#2532 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:19:20 +00:00
Daniël de Kok	c1a99e2f15	Update to moe-kenels 0.3.1 (#2535 ) * Update to moe-kenels 0.3.1 * Attempt to fix apt failure	2024-09-25 06:19:20 +00:00
Daniël de Kok	29a93b78ba	Move to moe-kernels package and switch to common MoE layer (#2511 ) * Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner	2024-09-25 06:18:05 +00:00
Wang, Yi	cbfe9e5185	hotfix : enable intel ipex cpu and xpu in python3.11 (#2517 ) enable intel ipex cpu and xpu in python3.11 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:15:35 +00:00
drbh	5fc0e0c589	fix: pass missing revision arg for lora adapter when loading multiple… (#2510 ) fix: pass missing revision arg for lora adapter when loading multiple adapters	2024-09-25 06:15:35 +00:00
Nicolas Patry	c6b568b892	Fix tokenization yi (#2507 ) * Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?	2024-09-25 06:15:35 +00:00

1 2 3 4 5 ...

709 Commits