Commit Graph

1211 Commits

Author SHA1 Message Date
yuanwu
67ee45a270 Pass the max_batch_total_tokens to causal_lm
refine the warmup

Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-23 08:28:26 +00:00
Thanaji Rao Thakkalapelli
c5e3881051
Enables Flash Attention in TGI for gemma models (#235) 2024-10-18 09:20:42 -07:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
9ae5ad5057
requirements name - cabelo@opensuse.org (#237) 2024-10-18 09:20:05 -07:00
Thanaji Rao Thakkalapelli
46b14e6b28
Remove all references to habana_quantization_toolkit for 1.18 (#229) 2024-10-18 10:59:59 +02:00
Thanaji Rao Thakkalapelli
21c13ff3a6
Remove References to torch compile mode in readme (#236) 2024-10-17 14:07:51 -07:00
Sun Choi
8ae5d4c7d6
Ignore EOS for benchmark by using TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN (#234) 2024-10-16 11:57:36 +02:00
Mandy Li
d07e7f4f62
Merge pull request #233 from huggingface/fix_sysntax
Fix sysntax error in PR 232
2024-10-15 14:33:21 -07:00
Thanaji Rao Thakkalapelli
87a1cee32c
Fix sysntax error in PR 232 2024-10-15 13:23:48 -07:00
Thanaji Rao Thakkalapelli
e06320f64e
Enabling Flash Attention support for falcon model (#232) 2024-10-15 19:50:17 +02:00
Sun Choi
0578bd917d
Fix gpt_bigcode/starcoderbase-3b accuracy issue (#228)
Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>
2024-10-14 10:01:55 +02:00
Mohit Deopujari
fe8a373831
Enhancements to README (#226) 2024-10-02 12:22:33 +02:00
yuanwu
bab529c916 Make Gaudi adapt to the tgi 2.3.0
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-26 06:04:55 +00:00
yuanwu2017
e424752fa3
Enable the AutoGPTQ (#217)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-25 18:55:02 +02:00
yuanwu
14fdc4ae5e Add some missing modification of 2.3.0 because of conflict
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-25 07:49:49 +00:00
Nicolas Patry
514a5a737d Preparing for release. (#2540)
* Preparing for release.

* Upgrade version in docs.
2024-09-25 06:20:50 +00:00
OlivierDehaene
bd9675c8c7 fix: wrap python basic logs in debug assertion in launcher (#2539)
* fix: wrap python basic logs in debug assertion in launcher

* use level filters instead
2024-09-25 06:19:20 +00:00
Wang, Yi
3519398a14 hotfix: ipex fails since cuda moe kernel is not supported (#2532)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:19:20 +00:00
Daniël de Kok
b6ef2bfc1b doc: clarify that --quantize is not needed for pre-quantized models (#2536) 2024-09-25 06:19:20 +00:00
Daniël de Kok
c1a99e2f15 Update to moe-kenels 0.3.1 (#2535)
* Update to moe-kenels 0.3.1

* Attempt to fix apt failure
2024-09-25 06:19:20 +00:00
Nicolas Patry
2d470c8282 Stream options. (#2533)
* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow
2024-09-25 06:19:20 +00:00
Daniël de Kok
29a93b78ba Move to moe-kernels package and switch to common MoE layer (#2511)
* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner
2024-09-25 06:18:05 +00:00
OlivierDehaene
88b72c8eb3 fix: metrics unbounded memory (#2528) 2024-09-25 06:17:09 +00:00
Daniël de Kok
0ecbd61099 nix: pure Rust check/fmt/clippy/test (#2525)
Runs the tests in a Nix build sandbox.
2024-09-25 06:17:09 +00:00
Nicolas Patry
0110b83aff Adding a test for FD. (#2516)
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
2024-09-25 06:17:09 +00:00
Daniël de Kok
e8c329372b Add tests for Mixtral (#2520)
Disable by default because CI runners do not have enough GPUs.
2024-09-25 06:16:08 +00:00
Alex Strick van Linschoten
afe5cae8fc Use ratatui not (deprecated) tui (#2521)
* use ratatui not archived tui

* bump ratatui all the way with options
2024-09-25 06:16:07 +00:00
Wang, Yi
cbfe9e5185 hotfix : enable intel ipex cpu and xpu in python3.11 (#2517)
enable intel ipex cpu and xpu in python3.11

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:15:35 +00:00
drbh
5fc0e0c589 fix: pass missing revision arg for lora adapter when loading multiple… (#2510)
fix: pass missing revision arg for lora adapter when loading multiple adapters
2024-09-25 06:15:35 +00:00
Nicolas Patry
7d897188d5 Add nix test. (#2513)
* Add nix test.

* Modifying yourself means you need to rerun.

* Fixing the test + adding click (needed for pre-commit hooks).

* Try thuis.

* Our runner + pure test (not written)

* Reemove server.

* Root user.

* Different user ?

* Add the actual test target.

* Forgot this modification.

* Add a formatter.

* Add the secrets.

* Fixed the auth token ?

* Adding the other tests.

* Missing pre-commit.

* Test requires cargo for cargo fmt.

* Update it a bit.

* Up.

* Attempting to use a cache location for the models.

* Ignore the cache for now.
2024-09-25 06:15:35 +00:00
Daniël de Kok
7be7ab7015 nix: support Python tokenizer conversion in the router (#2515)
Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.
2024-09-25 06:15:35 +00:00
Nicolas Patry
f32fa568b6 Fix truffle (#2514)
* Attempting to discard the trufflehog warning.

* Attempt to fix trufflehog.
2024-09-25 06:15:35 +00:00
Nicolas Patry
c6b568b892 Fix tokenization yi (#2507)
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?
2024-09-25 06:15:35 +00:00
Nicolas Patry
510d1c76c8 Prefix test - Different kind of load test to trigger prefix test bugs. (#2490)
* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------

Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
2024-09-25 06:14:07 +00:00
Vallepu Vamsi Krishna
b67a0cd37b Add Directory Check to Prevent Redundant Cloning in Build Process (#2486)
Update Makefile-fbgemm

Added Directory check for FBGEMM repository cloning.
2024-09-25 06:14:07 +00:00
Nicolas Patry
eb54d956ef Fixing more correctly the invalid drop of the batch. (#2498) 2024-09-25 06:14:07 +00:00
Martin Iglesias Goyanes
7c2ed55b2e Add links to Adyen blogpost (#2500)
* Add links to Adyen blogpost

* Adding to toctree.

* Update external.md

* Update _toctree.yml

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 06:14:07 +00:00
Daniël de Kok
0198db125e hotfix: add syrupy to the right subproject (#2499) 2024-09-25 06:13:36 +00:00
Daniël de Kok
67f44cce0d radix trie: add assertions (#2491)
These should all be cheap assertions.

Also:

* Fixup some comments.
* Delete a `remove` that was done unnecessarily twice.
2024-09-25 06:13:36 +00:00
Daniël de Kok
8ba790a14e Fix incompatibility with latest syrupy and update in Poetry (#2497) 2024-09-25 06:13:36 +00:00
Daniël de Kok
1e14a94721 nix: add pyright/ruff for proper LSP in the impure devshell (#2496)
We need this to ensure that pyright/ruff are part of the same
interpreter/venv.
2024-09-25 06:13:36 +00:00
Wang, Yi
938a7f3c3a hotfix: fix regression of attention api change in intel platform (#2439)
fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:13:36 +00:00
Daniël de Kok
d8610a6219 Add two handy gitignores for Nix environments (#2484) 2024-09-25 06:13:36 +00:00
Nicolas Patry
556a87030b Adding links to Adyen blogpost. (#2492) 2024-09-25 06:13:36 +00:00
Daniël de Kok
c7b495f97d hotfix: avoid non-prefilled block use when using prefix caching (#2489)
The minimum batch size logic could cause prefix blocks to be
deallocated without prefill. The next allocation of the same
prefix would then use garbage blocks.
2024-09-25 06:13:11 +00:00
drbh
34a6399a50 feat: support lora revisions and qkv_proj weights (#2482)
* feat: support lora revisions and qkv_proj weights

* fix: add qkv_proj weights to weight test
2024-09-25 06:13:11 +00:00
drbh
be5cb0cf7f fix: enable chat requests in vertex endpoint (#2481)
* fix: enable chat requests in vertex endpoint

* feat: avoid unwrap and pre allocate future vec
2024-09-25 06:13:11 +00:00
Daniël de Kok
3e17cb7866 nix: add punica-kernels (#2477)
Enables LoRA support.
2024-09-25 06:13:11 +00:00
Daniël de Kok
07c70e7840 nix: improve impure devshell (#2478)
- Add some test dependencies.
- Install server in venv.
- Install Python client in venv.
2024-09-25 06:13:11 +00:00
Nicolas Patry
a313355d2b Tied embeddings in MLP speculator. (#2473)
* Tied embeddings in MLP speculator.

* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.

* Adding scaling support + optimize some ops.
2024-09-25 06:13:11 +00:00
Wang, Yi
61b2f493a8 update doc with intel cpu part (#2420)
* update doc with intel cpu part

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release.

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 06:13:11 +00:00