text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 23:45:23 +00:00

Author	SHA1	Message	Date
yuanwu	c6f023a06b	Use optimum-habana v1.15-release branch Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-08 13:02:31 +00:00
yuanwu	1b659788b5	Add the no-deps in pip install Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-08 12:14:38 +00:00
yuanwu	73e6e3b871	Remove the error log Subsequent updates will remove these codes Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-08 11:55:13 +00:00
yuanwu	9f356ce045	Refine the warmup process Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-07 09:56:16 +00:00
yuanwu	253a992447	Remove the CI workflows we don't currently support Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-02 08:45:36 +00:00
yuanwu	0228bd0260	Doesn't run the prefill warmup when limit_hpu_graph=true Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-01 21:29:41 +00:00
yuanwu	4586325a34	Fix the starCode warmup issue Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-12-01 06:14:00 +00:00
Yuan Wu	b83419a769	Merge branch 'habana-main' into 2.3.0	2024-11-28 12:38:36 +08:00
yuanwu	636cdb4c43	Fix startcode issue Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-11-26 08:55:42 +00:00
srajabos	d49ce00f40	With this change, bucketing/padding of input is applied to health check. (#245 )	2024-11-18 22:38:30 +01:00
yuanwu2017	56c3eb4adb	Remove the torch package in requirements.txt (#246 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-11-07 09:22:24 -08:00
yuanwu2017	c345c734a7	Merge branch 'habana-main' into 2.3.0	2024-11-01 11:24:40 +08:00
yuanwu	fcf2e3a338	Fix the prefill warmup issue Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-11-01 05:08:52 +02:00
Thanaji Rao Thakkalapelli	6ba3d1d6e5	updated release docker image version in readme to 2.0.6 (#242 )	2024-10-31 15:44:16 -07:00
yuanwu2017	8d84ffabf2	Upgrade to SynapseAI 1.18 (#227 ) Signed-off-by: yuanwu <yuan.wu@intel.com> Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>	2024-10-31 20:14:44 +01:00
Thanaji Rao Thakkalapelli	7fb4af9a87	updated supported models list table in readme (#241 ) * updated supported models list table in readme * updated read me * updated read me	2024-10-29 23:28:45 -07:00
yuanwu	4c9856f9e5	Add missing package Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-28 07:04:56 +00:00
yuanwu2017	c23584f626	Merge branch 'habana-main' into 2.3.0	2024-10-28 04:37:07 +08:00
yuanwu	372e071135	Fix the issues of tgi-gaudi for v.2.3.1 Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-27 20:40:36 +00:00
Nicolas Patry	7e282b4153	V2.3.1	2024-10-27 04:14:35 +00:00
Nicolas Patry	34e98b14ef	New release 2.3.1 (#2604 ) * New release 2.3.1 * Update doc number	2024-10-27 04:14:35 +00:00
drbh	902f526d69	Unroll notify error into generate response (#2597 ) * feat: unroll notify_error if no tool is choosen * fix: expect simple message when no tool is selected * fix: improve test to avoid notify_error * fix: improve docs and indicate change in expected response * fix: adjust linting in test file	2024-10-27 04:03:57 +00:00
drbh	7664d2e2b3	CI (2592): Allow LoRA adapter revision in server launcher (#2602 ) allow revision for lora adapters from launcher Co-authored-by: Sida <sida@kulamind.com> Co-authored-by: teamclouday <teamclouday@gmail.com>	2024-10-27 04:03:57 +00:00
Nicolas Patry	967e67111d	Max token capacity metric (#2595 ) * adding max_token_capacity_metric * added tgi to name of metric * Adding max capacity metric. * Add description for the metrics --------- Co-authored-by: Edwinhr716 <Edandres249@gmail.com>	2024-10-27 04:03:57 +00:00
Nicolas Patry	51506aa57a	Mllama flash version (#2585 ) * Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0	2024-10-27 04:03:57 +00:00
Daniël de Kok	fa964f82d3	nix: experimental support for building a Docker container (#2470 ) * nix: experimental support for building a Docker image Run using something like: ``` docker run \ --device nvidia.com/gpu=all \ -it --rm -p 8080:80 \ -v $PWD/data:/data \ -v $PWD/tmp:/tmp \ tgi-docker:latest \ --model-id <model_id> ``` * Example of building the Docker image using Nix inside Docker * Stream to make the builder image smaller This avoids storing a Docker image tarball in the image. Instead, stream the layers while doing `docker run`. * Don't spam journalctl on Linux * Other dockerfile. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-25 09:12:03 +00:00
Daniël de Kok	775e5f4c64	MoE Marlin: support `desc_act` for `groupsize != -1` (#2590 ) This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.	2024-10-25 09:12:03 +00:00
Daniël de Kok	692f8ddb69	Move flake back to tgi-nix `main` (#2586 )	2024-10-25 09:12:03 +00:00
drbh	bdc47394d2	feat: support phi3.5 moe (#2479 ) * feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-25 09:12:03 +00:00
Daniël de Kok	288bcb0027	Add support for GPTQ-quantized MoE models using MoE Marlin (#2557 ) This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.	2024-10-25 09:07:52 +00:00
Mohit Sharma	ff905aeff3	Update ROCM libs and improvements (#2579 ) * style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile	2024-10-25 09:01:04 +00:00
Ikram Ul Haq	6808b2de7e	Update architecture.md (#2577 )	2024-10-25 09:01:04 +00:00
Daniël de Kok	55fd2816ea	Remove compute capability lazy cell (#2580 ) Remove compute capability lock We are only calling the `get_cuda_capability` function once, so avoiding the cost of multiple calls is not really necessary yet.	2024-10-25 09:01:04 +00:00
Daniël de Kok	f82a3f5816	flashinfer: pass window size and dtype (#2574 )	2024-10-25 09:01:04 +00:00
Daniël de Kok	653193a942	Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s	2024-10-25 09:01:04 +00:00
Alvaro Bartolome	bc28f86903	Fix build with `--features google` (#2566 ) * Fix `cargo build --features google` * Add `cargo test --features google`	2024-10-25 09:01:04 +00:00
Alvaro Bartolome	6976cf8c4c	Add LoRA adapters support for Gemma2 (#2567 ) * Add LoRA adapters support for Gemma2 * Make `black` formatting happy	2024-10-25 09:01:04 +00:00
Nicholas Broad	0817643b58	remove LORA_ADAPTERS_PATH (#2563 ) specify how to call local adapters	2024-10-25 09:01:04 +00:00
Nicolas Patry	a684a81927	More tensor cores. (#2558 ) * More tensor cores. * Fixing the logic. * Gemma is modified by this.	2024-10-25 09:01:04 +00:00
Nicolas Patry	97d4bdd685	Cleanup Vertex + Chat (#2553 ) * Cleanup Vertex + Chat * logprobs defaults to false. * Parameters are optional * Fix docs. * Changing back this logprobs default. * Fixup doc. * Let's debug that. * Not unstable. * Updating Cargo ? * Wat? * Dummy change. * Trying some other install. * Trying smething. * Revert everything. * Update Cargo lock. * Fixing the pre-commit after rebase.	2024-10-25 09:01:04 +00:00
Nicolas Patry	25e0edf337	Hotfixing main. (#2562 )	2024-10-25 09:01:04 +00:00
Aritra Roy Gosthipaty	782130df17	Adding note for private models in quick-tour document (#2548 ) * chore: adding note for private models in quicktour doc * Update docs/source/quicktour.md Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: vb <vaibhavs10@gmail.com>	2024-10-25 09:01:04 +00:00
Orhun Parmaksız	5247f8938d	Simplify crossterm imports (#2545 )	2024-10-25 09:01:04 +00:00
Orhun Parmaksız	8c6d3e074f	Update the link to the Ratatui organization (#2546 )	2024-10-25 09:01:04 +00:00
Daniël de Kok	d4f995e718	Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537 ) This replaces the custom layers in both models.	2024-10-25 09:01:04 +00:00
Daniël de Kok	32d50c2ea7	Add support for scalar FP8 weight scales (#2550 ) * Add support for scalar FP8 weight scales * Support LLM compressor FP8 checkpoints on H100 On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype. However, we wouldn't pick up fp8 quantization for models quantized with LLM compressor. This change adds enough parsing to detect if models have FP8-quantized weights. * Remove stray debug print	2024-10-25 09:01:04 +00:00
Nicolas Patry	68cfc94f40	Hotfixing main (#2556 )	2024-10-25 08:53:47 +00:00
Nicolas Patry	79ac2b741d	Micro cleanup. (#2555 )	2024-10-25 08:53:47 +00:00
OlivierDehaene	73e6090d53	chore: Add old V2 backend (#2551 ) * wip * added v2	2024-10-25 08:53:36 +00:00
Daniël de Kok	9aed9d5f81	nix: remove unused `_server.nix` file (#2538 )	2024-10-25 08:53:36 +00:00

1 2 3 4 5 ...

1215 Commits