text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-09-10 03:44:54 +00:00

Author	SHA1	Message	Date
Mohit Sharma	8cc2febdb6	(fix) quantize=fp8	2024-09-30 12:07:38 +00:00
Mohit Sharma	8ee9823d3b	(feat) fp8 fnuz support for rocm	2024-09-30 11:43:45 +00:00
Mohit Sharma	2401fdc889	cleaned dockerfile	2024-09-30 03:40:00 +00:00
Mohit Sharma	3b28cf9067	improve dockerfile	2024-09-28 15:54:45 +00:00
Mohit Sharma	7cb49f6f4f	float16 dep	2024-09-27 15:53:44 +00:00
Mohit Sharma	b2cd1b66ed	fix imports after rebase	2024-09-27 15:52:43 +00:00
Mohit Sharma	473d9a892d	Merge remote-tracking branch 'upstream/main' into rocm_6.2_updates	2024-09-27 15:36:12 +00:00
Daniël de Kok	5b6b74e21d	Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s	2024-09-27 16:19:42 +02:00
Mohit Sharma	346dfe398a	remove import	2024-09-27 12:59:35 +00:00
Mohit Sharma	a24c2cc5e9	updated default value	2024-09-27 12:39:12 +00:00
Mohit Sharma	ac2dccd174	improved error messag	2024-09-27 12:34:04 +00:00
Mohit Sharma	816d4b67b2	fix import	2024-09-27 12:32:17 +00:00
Mohit Sharma	47c81d2924	Merge remote-tracking branch 'upstream/main' into fix_rocm_fa	2024-09-27 10:34:16 +00:00
Mohit Sharma	829144d15a	addressed review comments	2024-09-27 10:28:37 +00:00
Alvaro Bartolome	0aa66d693a	Fix build with `--features google` (#2566 ) * Fix `cargo build --features google` * Add `cargo test --features google`	2024-09-26 11:41:38 +02:00
Alvaro Bartolome	0b7df77178	Add LoRA adapters support for Gemma2 (#2567 ) * Add LoRA adapters support for Gemma2 * Make `black` formatting happy	2024-09-26 10:54:08 +02:00
Nicholas Broad	7efcb5e0ed	remove LORA_ADAPTERS_PATH (#2563 ) specify how to call local adapters	2024-09-25 01:20:15 +02:00
Nicolas Patry	dd8691b7c5	More tensor cores. (#2558 ) * More tensor cores. * Fixing the logic. * Gemma is modified by this.	2024-09-24 23:57:26 +02:00
Nicolas Patry	c032280b17	Cleanup Vertex + Chat (#2553 ) * Cleanup Vertex + Chat * logprobs defaults to false. * Parameters are optional * Fix docs. * Changing back this logprobs default. * Fixup doc. * Let's debug that. * Not unstable. * Updating Cargo ? * Wat? * Dummy change. * Trying some other install. * Trying smething. * Revert everything. * Update Cargo lock. * Fixing the pre-commit after rebase.	2024-09-24 23:37:17 +02:00
Nicolas Patry	75c8c54ac9	Hotfixing main. (#2562 )	2024-09-24 23:00:43 +02:00
Aritra Roy Gosthipaty	e6d29656b5	Adding note for private models in quick-tour document (#2548 ) * chore: adding note for private models in quicktour doc * Update docs/source/quicktour.md Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: vb <vaibhavs10@gmail.com>	2024-09-24 15:06:53 +02:00
Orhun Parmaksız	8024ded58f	Simplify crossterm imports (#2545 )	2024-09-24 14:57:20 +02:00
Orhun Parmaksız	03263f5e88	Update the link to the Ratatui organization (#2546 )	2024-09-24 14:51:48 +02:00
Daniël de Kok	3f14cd1420	Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537 ) This replaces the custom layers in both models.	2024-09-24 14:27:06 +02:00
Daniël de Kok	c29dc89c18	Add support for scalar FP8 weight scales (#2550 ) * Add support for scalar FP8 weight scales * Support LLM compressor FP8 checkpoints on H100 On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype. However, we wouldn't pick up fp8 quantization for models quantized with LLM compressor. This change adds enough parsing to detect if models have FP8-quantized weights. * Remove stray debug print	2024-09-24 13:57:40 +02:00
Mohit Sharma	64e981fdcf	fix issue for sliding window models	2024-09-24 10:53:19 +00:00
Nicolas Patry	0ff6ff60ad	Hotfixing main (#2556 )	2024-09-24 11:51:14 +02:00
Nicolas Patry	74d3ce106e	Micro cleanup. (#2555 )	2024-09-24 11:19:24 +02:00
Alvaro Bartolome	d31a6f75cc	Remove duplicated `RUN` in `Dockerfile` (#2547 )	2024-09-24 10:19:13 +02:00
OlivierDehaene	10e6f29295	chore: Add old V2 backend (#2551 ) * wip * added v2	2024-09-24 08:38:17 +02:00
Daniël de Kok	9263817c71	nix: remove unused `_server.nix` file (#2538 )	2024-09-23 09:43:23 +02:00
Nicolas Patry	169178b937	Preparing for release. (#2540 ) * Preparing for release. * Upgrade version in docs.	2024-09-20 17:42:04 +02:00
OlivierDehaene	7e2d18877e	fix: wrap python basic logs in debug assertion in launcher (#2539 ) * fix: wrap python basic logs in debug assertion in launcher * use level filters instead	2024-09-20 14:59:31 +00:00
Mohit Sharma	21d1b0cd8b	fix conflict	2024-09-20 08:59:17 +00:00
Wang, Yi	f478aa77ad	hotfix: ipex fails since cuda moe kernel is not supported (#2532 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-20 10:02:55 +02:00
Daniël de Kok	abd24dd385	doc: clarify that `--quantize` is not needed for pre-quantized models (#2536 )	2024-09-19 22:17:15 +02:00
Daniël de Kok	c103760172	Update to moe-kenels 0.3.1 (#2535 ) * Update to moe-kenels 0.3.1 * Attempt to fix apt failure	2024-09-19 22:16:32 +02:00
Nicolas Patry	f512021e77	Stream options. (#2533 ) * Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow	2024-09-19 20:50:37 +02:00
Mohit Sharma	4fb947d2aa	fixed style	2024-09-19 14:28:21 +00:00
Mohit Sharma	e6d07a6d34	euff	2024-09-18 12:03:52 +00:00
Daniël de Kok	ce85efa968	Move to moe-kernels package and switch to common MoE layer (#2511 ) * Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner	2024-09-17 18:08:58 +02:00
OlivierDehaene	86984e3236	fix: metrics unbounded memory (#2528 )	2024-09-17 16:01:28 +00:00
Daniël de Kok	71e4268600	nix: pure Rust check/fmt/clippy/test (#2525 ) Runs the tests in a Nix build sandbox.	2024-09-17 12:14:30 +02:00
Nicolas Patry	38fcafcf96	Adding a test for FD. (#2516 ) * Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.	2024-09-16 17:00:54 +02:00
Daniël de Kok	7774655297	Add tests for Mixtral (#2520 ) Disable by default because CI runners do not have enough GPUs.	2024-09-16 12:39:18 +02:00
Alex Strick van Linschoten	9cca3e0b03	Use `ratatui` not (deprecated) `tui` (#2521 ) * use ratatui not archived tui * bump ratatui all the way with options	2024-09-13 18:45:28 +02:00
Mohit Sharma	4ba9210f91	fix docker	2024-09-12 15:45:06 +00:00
Wang, Yi	3ac7df2b6d	hotfix : enable intel ipex cpu and xpu in python3.11 (#2517 ) enable intel ipex cpu and xpu in python3.11 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-12 17:23:49 +02:00
drbh	628334d336	fix: pass missing revision arg for lora adapter when loading multiple… (#2510 ) fix: pass missing revision arg for lora adapter when loading multiple adapters	2024-09-12 17:04:52 +02:00
Mohit Sharma	59fd0cbdff	add skinny kernel and merge fixes	2024-09-12 13:16:13 +00:00

1 2 3 4 5 ...

1067 Commits