text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-04-24 16:32:12 +00:00

Author	SHA1	Message	Date
drbh	bdc47394d2	feat: support phi3.5 moe (#2479 ) * feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-25 09:12:03 +00:00
Daniël de Kok	288bcb0027	Add support for GPTQ-quantized MoE models using MoE Marlin (#2557 ) This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.	2024-10-25 09:07:52 +00:00
Mohit Sharma	ff905aeff3	Update ROCM libs and improvements (#2579 ) * style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile	2024-10-25 09:01:04 +00:00
Ikram Ul Haq	6808b2de7e	Update architecture.md (#2577 )	2024-10-25 09:01:04 +00:00
Daniël de Kok	55fd2816ea	Remove compute capability lazy cell (#2580 ) Remove compute capability lock We are only calling the `get_cuda_capability` function once, so avoiding the cost of multiple calls is not really necessary yet.	2024-10-25 09:01:04 +00:00
Daniël de Kok	f82a3f5816	flashinfer: pass window size and dtype (#2574 )	2024-10-25 09:01:04 +00:00
Daniël de Kok	653193a942	Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s	2024-10-25 09:01:04 +00:00
Alvaro Bartolome	bc28f86903	Fix build with `--features google` (#2566 ) * Fix `cargo build --features google` * Add `cargo test --features google`	2024-10-25 09:01:04 +00:00
Alvaro Bartolome	6976cf8c4c	Add LoRA adapters support for Gemma2 (#2567 ) * Add LoRA adapters support for Gemma2 * Make `black` formatting happy	2024-10-25 09:01:04 +00:00
Nicholas Broad	0817643b58	remove LORA_ADAPTERS_PATH (#2563 ) specify how to call local adapters	2024-10-25 09:01:04 +00:00
Nicolas Patry	a684a81927	More tensor cores. (#2558 ) * More tensor cores. * Fixing the logic. * Gemma is modified by this.	2024-10-25 09:01:04 +00:00
Nicolas Patry	97d4bdd685	Cleanup Vertex + Chat (#2553 ) * Cleanup Vertex + Chat * logprobs defaults to false. * Parameters are optional * Fix docs. * Changing back this logprobs default. * Fixup doc. * Let's debug that. * Not unstable. * Updating Cargo ? * Wat? * Dummy change. * Trying some other install. * Trying smething. * Revert everything. * Update Cargo lock. * Fixing the pre-commit after rebase.	2024-10-25 09:01:04 +00:00
Nicolas Patry	25e0edf337	Hotfixing main. (#2562 )	2024-10-25 09:01:04 +00:00
Aritra Roy Gosthipaty	782130df17	Adding note for private models in quick-tour document (#2548 ) * chore: adding note for private models in quicktour doc * Update docs/source/quicktour.md Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: vb <vaibhavs10@gmail.com>	2024-10-25 09:01:04 +00:00
Orhun Parmaksız	5247f8938d	Simplify crossterm imports (#2545 )	2024-10-25 09:01:04 +00:00
Orhun Parmaksız	8c6d3e074f	Update the link to the Ratatui organization (#2546 )	2024-10-25 09:01:04 +00:00
Daniël de Kok	d4f995e718	Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537 ) This replaces the custom layers in both models.	2024-10-25 09:01:04 +00:00
Daniël de Kok	32d50c2ea7	Add support for scalar FP8 weight scales (#2550 ) * Add support for scalar FP8 weight scales * Support LLM compressor FP8 checkpoints on H100 On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype. However, we wouldn't pick up fp8 quantization for models quantized with LLM compressor. This change adds enough parsing to detect if models have FP8-quantized weights. * Remove stray debug print	2024-10-25 09:01:04 +00:00
Nicolas Patry	68cfc94f40	Hotfixing main (#2556 )	2024-10-25 08:53:47 +00:00
Nicolas Patry	79ac2b741d	Micro cleanup. (#2555 )	2024-10-25 08:53:47 +00:00
OlivierDehaene	73e6090d53	chore: Add old V2 backend (#2551 ) * wip * added v2	2024-10-25 08:53:36 +00:00
Daniël de Kok	9aed9d5f81	nix: remove unused `_server.nix` file (#2538 )	2024-10-25 08:53:36 +00:00
yuanwu	b590310255	Add missing import package Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-25 08:52:24 +00:00
yuanwu	8ebe77b3be	Simplify the warmup Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-25 08:38:59 +00:00
Thanaji Rao Thakkalapelli	b126bf4785	Revert pr 235 as flash attention is not really enabled for gemma (#239 )	2024-10-23 10:58:57 +02:00
yuanwu2017	8686a0fc6d	Merge branch 'habana-main' into 2.3.0	2024-10-23 16:32:12 +08:00
yuanwu	67ee45a270	Pass the max_batch_total_tokens to causal_lm refine the warmup Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-23 08:28:26 +00:00
Thanaji Rao Thakkalapelli	c5e3881051	Enables Flash Attention in TGI for gemma models (#235 )	2024-10-18 09:20:42 -07:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	9ae5ad5057	requirements name - cabelo@opensuse.org (#237 )	2024-10-18 09:20:05 -07:00
Thanaji Rao Thakkalapelli	46b14e6b28	Remove all references to habana_quantization_toolkit for 1.18 (#229 )	2024-10-18 10:59:59 +02:00
Thanaji Rao Thakkalapelli	21c13ff3a6	Remove References to torch compile mode in readme (#236 )	2024-10-17 14:07:51 -07:00
Sun Choi	8ae5d4c7d6	Ignore EOS for benchmark by using TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN (#234 )	2024-10-16 11:57:36 +02:00
Mandy Li	d07e7f4f62	Merge pull request #233 from huggingface/fix_sysntax Fix sysntax error in PR 232	2024-10-15 14:33:21 -07:00
Thanaji Rao Thakkalapelli	87a1cee32c	Fix sysntax error in PR 232	2024-10-15 13:23:48 -07:00
Thanaji Rao Thakkalapelli	e06320f64e	Enabling Flash Attention support for falcon model (#232 )	2024-10-15 19:50:17 +02:00
Sun Choi	0578bd917d	Fix gpt_bigcode/starcoderbase-3b accuracy issue (#228 ) Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>	2024-10-14 10:01:55 +02:00
Mohit Deopujari	fe8a373831	Enhancements to README (#226 )	2024-10-02 12:22:33 +02:00
yuanwu	bab529c916	Make Gaudi adapt to the tgi 2.3.0 Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-09-26 06:04:55 +00:00
yuanwu2017	e424752fa3	Enable the AutoGPTQ (#217 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-09-25 18:55:02 +02:00
yuanwu	14fdc4ae5e	Add some missing modification of 2.3.0 because of conflict Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-09-25 07:49:49 +00:00
Nicolas Patry	514a5a737d	Preparing for release. (#2540 ) * Preparing for release. * Upgrade version in docs.	2024-09-25 06:20:50 +00:00
OlivierDehaene	bd9675c8c7	fix: wrap python basic logs in debug assertion in launcher (#2539 ) * fix: wrap python basic logs in debug assertion in launcher * use level filters instead	2024-09-25 06:19:20 +00:00
Wang, Yi	3519398a14	hotfix: ipex fails since cuda moe kernel is not supported (#2532 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:19:20 +00:00
Daniël de Kok	b6ef2bfc1b	doc: clarify that `--quantize` is not needed for pre-quantized models (#2536 )	2024-09-25 06:19:20 +00:00
Daniël de Kok	c1a99e2f15	Update to moe-kenels 0.3.1 (#2535 ) * Update to moe-kenels 0.3.1 * Attempt to fix apt failure	2024-09-25 06:19:20 +00:00
Nicolas Patry	2d470c8282	Stream options. (#2533 ) * Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow	2024-09-25 06:19:20 +00:00
Daniël de Kok	29a93b78ba	Move to moe-kernels package and switch to common MoE layer (#2511 ) * Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner	2024-09-25 06:18:05 +00:00
OlivierDehaene	88b72c8eb3	fix: metrics unbounded memory (#2528 )	2024-09-25 06:17:09 +00:00
Daniël de Kok	0ecbd61099	nix: pure Rust check/fmt/clippy/test (#2525 ) Runs the tests in a Nix build sandbox.	2024-09-25 06:17:09 +00:00
Nicolas Patry	0110b83aff	Adding a test for FD. (#2516 ) * Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.	2024-09-25 06:17:09 +00:00

1 2 3 4 5 ...

1237 Commits