text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 23:45:23 +00:00

Author	SHA1	Message	Date
yuanwu	67ee45a270	Pass the max_batch_total_tokens to causal_lm refine the warmup Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-10-23 08:28:26 +00:00
Thanaji Rao Thakkalapelli	c5e3881051	Enables Flash Attention in TGI for gemma models (#235 )	2024-10-18 09:20:42 -07:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	9ae5ad5057	requirements name - cabelo@opensuse.org (#237 )	2024-10-18 09:20:05 -07:00
Thanaji Rao Thakkalapelli	46b14e6b28	Remove all references to habana_quantization_toolkit for 1.18 (#229 )	2024-10-18 10:59:59 +02:00
Thanaji Rao Thakkalapelli	21c13ff3a6	Remove References to torch compile mode in readme (#236 )	2024-10-17 14:07:51 -07:00
Sun Choi	8ae5d4c7d6	Ignore EOS for benchmark by using TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN (#234 )	2024-10-16 11:57:36 +02:00
Mandy Li	d07e7f4f62	Merge pull request #233 from huggingface/fix_sysntax Fix sysntax error in PR 232	2024-10-15 14:33:21 -07:00
Thanaji Rao Thakkalapelli	87a1cee32c	Fix sysntax error in PR 232	2024-10-15 13:23:48 -07:00
Thanaji Rao Thakkalapelli	e06320f64e	Enabling Flash Attention support for falcon model (#232 )	2024-10-15 19:50:17 +02:00
Sun Choi	0578bd917d	Fix gpt_bigcode/starcoderbase-3b accuracy issue (#228 ) Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>	2024-10-14 10:01:55 +02:00
Mohit Deopujari	fe8a373831	Enhancements to README (#226 )	2024-10-02 12:22:33 +02:00
yuanwu	bab529c916	Make Gaudi adapt to the tgi 2.3.0 Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-09-26 06:04:55 +00:00
yuanwu2017	e424752fa3	Enable the AutoGPTQ (#217 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-09-25 18:55:02 +02:00
yuanwu	14fdc4ae5e	Add some missing modification of 2.3.0 because of conflict Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-09-25 07:49:49 +00:00
Nicolas Patry	514a5a737d	Preparing for release. (#2540 ) * Preparing for release. * Upgrade version in docs.	2024-09-25 06:20:50 +00:00
OlivierDehaene	bd9675c8c7	fix: wrap python basic logs in debug assertion in launcher (#2539 ) * fix: wrap python basic logs in debug assertion in launcher * use level filters instead	2024-09-25 06:19:20 +00:00
Wang, Yi	3519398a14	hotfix: ipex fails since cuda moe kernel is not supported (#2532 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:19:20 +00:00
Daniël de Kok	b6ef2bfc1b	doc: clarify that `--quantize` is not needed for pre-quantized models (#2536 )	2024-09-25 06:19:20 +00:00
Daniël de Kok	c1a99e2f15	Update to moe-kenels 0.3.1 (#2535 ) * Update to moe-kenels 0.3.1 * Attempt to fix apt failure	2024-09-25 06:19:20 +00:00
Nicolas Patry	2d470c8282	Stream options. (#2533 ) * Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow	2024-09-25 06:19:20 +00:00
Daniël de Kok	29a93b78ba	Move to moe-kernels package and switch to common MoE layer (#2511 ) * Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner	2024-09-25 06:18:05 +00:00
OlivierDehaene	88b72c8eb3	fix: metrics unbounded memory (#2528 )	2024-09-25 06:17:09 +00:00
Daniël de Kok	0ecbd61099	nix: pure Rust check/fmt/clippy/test (#2525 ) Runs the tests in a Nix build sandbox.	2024-09-25 06:17:09 +00:00
Nicolas Patry	0110b83aff	Adding a test for FD. (#2516 ) * Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.	2024-09-25 06:17:09 +00:00
Daniël de Kok	e8c329372b	Add tests for Mixtral (#2520 ) Disable by default because CI runners do not have enough GPUs.	2024-09-25 06:16:08 +00:00
Alex Strick van Linschoten	afe5cae8fc	Use `ratatui` not (deprecated) `tui` (#2521 ) * use ratatui not archived tui * bump ratatui all the way with options	2024-09-25 06:16:07 +00:00
Wang, Yi	cbfe9e5185	hotfix : enable intel ipex cpu and xpu in python3.11 (#2517 ) enable intel ipex cpu and xpu in python3.11 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:15:35 +00:00
drbh	5fc0e0c589	fix: pass missing revision arg for lora adapter when loading multiple… (#2510 ) fix: pass missing revision arg for lora adapter when loading multiple adapters	2024-09-25 06:15:35 +00:00
Nicolas Patry	7d897188d5	Add nix test. (#2513 ) * Add nix test. * Modifying yourself means you need to rerun. * Fixing the test + adding click (needed for pre-commit hooks). * Try thuis. * Our runner + pure test (not written) * Reemove server. * Root user. * Different user ? * Add the actual test target. * Forgot this modification. * Add a formatter. * Add the secrets. * Fixed the auth token ? * Adding the other tests. * Missing pre-commit. * Test requires cargo for cargo fmt. * Update it a bit. * Up. * Attempting to use a cache location for the models. * Ignore the cache for now.	2024-09-25 06:15:35 +00:00
Daniël de Kok	7be7ab7015	nix: support Python tokenizer conversion in the router (#2515 ) Ideally we wouldn't have the router wrapper that this change adds, but when I give PyO3 a Python interpreter with packages, it ends up linking libpython from the Python interpreter rather than the constructed environment and cannot pick up the Python modules as a result.	2024-09-25 06:15:35 +00:00
Nicolas Patry	f32fa568b6	Fix truffle (#2514 ) * Attempting to discard the trufflehog warning. * Attempt to fix trufflehog.	2024-09-25 06:15:35 +00:00
Nicolas Patry	c6b568b892	Fix tokenization yi (#2507 ) * Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?	2024-09-25 06:15:35 +00:00
Nicolas Patry	510d1c76c8	Prefix test - Different kind of load test to trigger prefix test bugs. (#2490 ) * Adding prefix test. * [WIP] tmp dump of integration load tests. * Remove other tensor creation. * Fixed the radix tree. Used a slice everywhere in radix.rs to keep the cheap Arc cloning instead of recomputing the input_ids. * Fix parsing * Is it really flashinfer version ? * Remove some comments. * Revert the max prefix hit. * Adding numpy to diff. * Upgraded flashinfer. * Upgrading some stuff. * Are we done yet ? * Minor fixup * Remove 1 log and put back the other. * Add comment for why slot 0 is OK. * Mounting on the job. * Get me a debug branch * Debugging CIs is fun. * Attempt #28 * wip * Tmate. * Praying. * Updating VLM causal model with updated context. * Important line got squashed. * Tmate again. * Fingers crossed. * We want only 1 run of integration tests..... --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>	2024-09-25 06:14:07 +00:00
Vallepu Vamsi Krishna	b67a0cd37b	Add Directory Check to Prevent Redundant Cloning in Build Process (#2486 ) Update Makefile-fbgemm Added Directory check for FBGEMM repository cloning.	2024-09-25 06:14:07 +00:00
Nicolas Patry	eb54d956ef	Fixing more correctly the invalid drop of the batch. (#2498 )	2024-09-25 06:14:07 +00:00
Martin Iglesias Goyanes	7c2ed55b2e	Add links to Adyen blogpost (#2500 ) * Add links to Adyen blogpost * Adding to toctree. * Update external.md * Update _toctree.yml --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-25 06:14:07 +00:00
Daniël de Kok	0198db125e	hotfix: add syrupy to the right subproject (#2499 )	2024-09-25 06:13:36 +00:00
Daniël de Kok	67f44cce0d	radix trie: add assertions (#2491 ) These should all be cheap assertions. Also: * Fixup some comments. * Delete a `remove` that was done unnecessarily twice.	2024-09-25 06:13:36 +00:00
Daniël de Kok	8ba790a14e	Fix incompatibility with latest `syrupy` and update in Poetry (#2497 )	2024-09-25 06:13:36 +00:00
Daniël de Kok	1e14a94721	nix: add pyright/ruff for proper LSP in the impure devshell (#2496 ) We need this to ensure that pyright/ruff are part of the same interpreter/venv.	2024-09-25 06:13:36 +00:00
Wang, Yi	938a7f3c3a	hotfix: fix regression of attention api change in intel platform (#2439 ) fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache format kv input now. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:13:36 +00:00
Daniël de Kok	d8610a6219	Add two handy gitignores for Nix environments (#2484 )	2024-09-25 06:13:36 +00:00
Nicolas Patry	556a87030b	Adding links to Adyen blogpost. (#2492 )	2024-09-25 06:13:36 +00:00
Daniël de Kok	c7b495f97d	hotfix: avoid non-prefilled block use when using prefix caching (#2489 ) The minimum batch size logic could cause prefix blocks to be deallocated without prefill. The next allocation of the same prefix would then use garbage blocks.	2024-09-25 06:13:11 +00:00
drbh	34a6399a50	feat: support lora revisions and qkv_proj weights (#2482 ) * feat: support lora revisions and qkv_proj weights * fix: add qkv_proj weights to weight test	2024-09-25 06:13:11 +00:00
drbh	be5cb0cf7f	fix: enable chat requests in vertex endpoint (#2481 ) * fix: enable chat requests in vertex endpoint * feat: avoid unwrap and pre allocate future vec	2024-09-25 06:13:11 +00:00
Daniël de Kok	3e17cb7866	nix: add punica-kernels (#2477 ) Enables LoRA support.	2024-09-25 06:13:11 +00:00
Daniël de Kok	07c70e7840	nix: improve impure devshell (#2478 ) - Add some test dependencies. - Install server in venv. - Install Python client in venv.	2024-09-25 06:13:11 +00:00
Nicolas Patry	a313355d2b	Tied embeddings in MLP speculator. (#2473 ) * Tied embeddings in MLP speculator. * Fixing the scale_weight when users decide to not use the speculation as much as defined in the config. * Adding scaling support + optimize some ops.	2024-09-25 06:13:11 +00:00
Wang, Yi	61b2f493a8	update doc with intel cpu part (#2420 ) * update doc with intel cpu part Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release. --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-25 06:13:11 +00:00

1 2 3 4 5 ...

1211 Commits