Mohit Sharma
ff905aeff3
Update ROCM libs and improvements ( #2579 )
...
* style
* update torch
* ix issues
* fix clone
* revert mkl
* added custom PA
* style
* fix style
* style
* hide env vart
* fix mixtral model
* add skinny kernel and merge fixes
* fixed style
* fix issue for sliding window models
* addressed review comments
* fix import
* improved error messag
* updated default value
* remove import
* fix imports after rebase
* float16 dep
* improve dockerfile
* cleaned dockerfile
2024-10-25 09:01:04 +00:00
Ikram Ul Haq
6808b2de7e
Update architecture.md ( #2577 )
2024-10-25 09:01:04 +00:00
Daniël de Kok
55fd2816ea
Remove compute capability lazy cell ( #2580 )
...
Remove compute capability lock
We are only calling the `get_cuda_capability` function once, so avoiding
the cost of multiple calls is not really necessary yet.
2024-10-25 09:01:04 +00:00
Daniël de Kok
f82a3f5816
flashinfer: pass window size and dtype ( #2574 )
2024-10-25 09:01:04 +00:00
Daniël de Kok
653193a942
Improve support for GPUs with capability < 8 ( #2575 )
...
* Improve support for GPUs with capability < 8
- For models that cannot use flashinfer, use flash-attn v1 + paged
attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
cache, since v1 cannot use block tables.
* nix: add flash-attn-v1 to the server environment
* Move disabling prefix caching into the block of exceptions
* Capability as `usize`s
2024-10-25 09:01:04 +00:00
Alvaro Bartolome
bc28f86903
Fix build with --features google
( #2566 )
...
* Fix `cargo build --features google`
* Add `cargo test --features google`
2024-10-25 09:01:04 +00:00
Alvaro Bartolome
6976cf8c4c
Add LoRA adapters support for Gemma2 ( #2567 )
...
* Add LoRA adapters support for Gemma2
* Make `black` formatting happy
2024-10-25 09:01:04 +00:00
Nicholas Broad
0817643b58
remove LORA_ADAPTERS_PATH ( #2563 )
...
specify how to call local adapters
2024-10-25 09:01:04 +00:00
Nicolas Patry
a684a81927
More tensor cores. ( #2558 )
...
* More tensor cores.
* Fixing the logic.
* Gemma is modified by this.
2024-10-25 09:01:04 +00:00
Nicolas Patry
97d4bdd685
Cleanup Vertex + Chat ( #2553 )
...
* Cleanup Vertex + Chat
* logprobs defaults to false.
* Parameters are optional
* Fix docs.
* Changing back this logprobs default.
* Fixup doc.
* Let's debug that.
* Not unstable.
* Updating Cargo ?
* Wat?
* Dummy change.
* Trying some other install.
* Trying smething.
* Revert everything.
* Update Cargo lock.
* Fixing the pre-commit after rebase.
2024-10-25 09:01:04 +00:00
Nicolas Patry
25e0edf337
Hotfixing main. ( #2562 )
2024-10-25 09:01:04 +00:00
Aritra Roy Gosthipaty
782130df17
Adding note for private models in quick-tour document ( #2548 )
...
* chore: adding note for private models in quicktour doc
* Update docs/source/quicktour.md
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
* Update docs/source/quicktour.md
Co-authored-by: vb <vaibhavs10@gmail.com>
* Update docs/source/quicktour.md
Co-authored-by: vb <vaibhavs10@gmail.com>
---------
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: vb <vaibhavs10@gmail.com>
2024-10-25 09:01:04 +00:00
Orhun Parmaksız
5247f8938d
Simplify crossterm imports ( #2545 )
2024-10-25 09:01:04 +00:00
Orhun Parmaksız
8c6d3e074f
Update the link to the Ratatui organization ( #2546 )
2024-10-25 09:01:04 +00:00
Daniël de Kok
d4f995e718
Add DenseMoELayer
and wire it up in Mixtral/Deepseek V2 ( #2537 )
...
This replaces the custom layers in both models.
2024-10-25 09:01:04 +00:00
Daniël de Kok
32d50c2ea7
Add support for scalar FP8 weight scales ( #2550 )
...
* Add support for scalar FP8 weight scales
* Support LLM compressor FP8 checkpoints on H100
On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.
* Remove stray debug print
2024-10-25 09:01:04 +00:00
Nicolas Patry
68cfc94f40
Hotfixing main ( #2556 )
2024-10-25 08:53:47 +00:00
Nicolas Patry
79ac2b741d
Micro cleanup. ( #2555 )
2024-10-25 08:53:47 +00:00
OlivierDehaene
73e6090d53
chore: Add old V2 backend ( #2551 )
...
* wip
* added v2
2024-10-25 08:53:36 +00:00
Daniël de Kok
9aed9d5f81
nix: remove unused _server.nix
file ( #2538 )
2024-10-25 08:53:36 +00:00
yuanwu
b590310255
Add missing import package
...
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-25 08:52:24 +00:00
yuanwu
8ebe77b3be
Simplify the warmup
...
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-25 08:38:59 +00:00
yuanwu2017
8686a0fc6d
Merge branch 'habana-main' into 2.3.0
2024-10-23 16:32:12 +08:00
yuanwu
67ee45a270
Pass the max_batch_total_tokens to causal_lm
...
refine the warmup
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-23 08:28:26 +00:00
Thanaji Rao Thakkalapelli
c5e3881051
Enables Flash Attention in TGI for gemma models ( #235 )
2024-10-18 09:20:42 -07:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
9ae5ad5057
requirements name - cabelo@opensuse.org ( #237 )
2024-10-18 09:20:05 -07:00
Thanaji Rao Thakkalapelli
46b14e6b28
Remove all references to habana_quantization_toolkit for 1.18 ( #229 )
2024-10-18 10:59:59 +02:00
Thanaji Rao Thakkalapelli
21c13ff3a6
Remove References to torch compile mode in readme ( #236 )
2024-10-17 14:07:51 -07:00
Sun Choi
8ae5d4c7d6
Ignore EOS for benchmark by using TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN ( #234 )
2024-10-16 11:57:36 +02:00
Mandy Li
d07e7f4f62
Merge pull request #233 from huggingface/fix_sysntax
...
Fix sysntax error in PR 232
2024-10-15 14:33:21 -07:00
Thanaji Rao Thakkalapelli
87a1cee32c
Fix sysntax error in PR 232
2024-10-15 13:23:48 -07:00
Thanaji Rao Thakkalapelli
e06320f64e
Enabling Flash Attention support for falcon model ( #232 )
2024-10-15 19:50:17 +02:00
Sun Choi
0578bd917d
Fix gpt_bigcode/starcoderbase-3b accuracy issue ( #228 )
...
Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>
2024-10-14 10:01:55 +02:00
Mohit Deopujari
fe8a373831
Enhancements to README ( #226 )
2024-10-02 12:22:33 +02:00
yuanwu
bab529c916
Make Gaudi adapt to the tgi 2.3.0
...
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-26 06:04:55 +00:00
yuanwu2017
e424752fa3
Enable the AutoGPTQ ( #217 )
...
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-25 18:55:02 +02:00
yuanwu
14fdc4ae5e
Add some missing modification of 2.3.0 because of conflict
...
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-25 07:49:49 +00:00
Nicolas Patry
514a5a737d
Preparing for release. ( #2540 )
...
* Preparing for release.
* Upgrade version in docs.
2024-09-25 06:20:50 +00:00
OlivierDehaene
bd9675c8c7
fix: wrap python basic logs in debug assertion in launcher ( #2539 )
...
* fix: wrap python basic logs in debug assertion in launcher
* use level filters instead
2024-09-25 06:19:20 +00:00
Wang, Yi
3519398a14
hotfix: ipex fails since cuda moe kernel is not supported ( #2532 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:19:20 +00:00
Daniël de Kok
b6ef2bfc1b
doc: clarify that --quantize
is not needed for pre-quantized models ( #2536 )
2024-09-25 06:19:20 +00:00
Daniël de Kok
c1a99e2f15
Update to moe-kenels 0.3.1 ( #2535 )
...
* Update to moe-kenels 0.3.1
* Attempt to fix apt failure
2024-09-25 06:19:20 +00:00
Nicolas Patry
2d470c8282
Stream options. ( #2533 )
...
* Stream options.
* Fetch stuff from nix integration test for easier testing.
* Adding the assert.
* Only send the usage when asked for.
* Update the docs.
* Impure test because we need network.
* develop.
* Optional usage.
* Fixes.
* Workflow
2024-09-25 06:19:20 +00:00
Daniël de Kok
29a93b78ba
Move to moe-kernels package and switch to common MoE layer ( #2511 )
...
* Move to moe-kernels package and switch to common MoE layer
This change introduces the new `moe-kernels` package:
- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
models.
- Port over Mixtral and Deepseek.
* Make `cargo check` pass
* Update runner
2024-09-25 06:18:05 +00:00
OlivierDehaene
88b72c8eb3
fix: metrics unbounded memory ( #2528 )
2024-09-25 06:17:09 +00:00
Daniël de Kok
0ecbd61099
nix: pure Rust check/fmt/clippy/test ( #2525 )
...
Runs the tests in a Nix build sandbox.
2024-09-25 06:17:09 +00:00
Nicolas Patry
0110b83aff
Adding a test for FD. ( #2516 )
...
* Adding a test for FD.
* Fixing flashdecoding (empty batch doesn't work).
* Fixing the invalid popping.
* Fixing radix with block_size > 1
* Last reference.
* Use an actual hash.
* Update hash for slice.len() == 1
* Update the locks.
* Increasing docker timeout.
2024-09-25 06:17:09 +00:00
Daniël de Kok
e8c329372b
Add tests for Mixtral ( #2520 )
...
Disable by default because CI runners do not have enough GPUs.
2024-09-25 06:16:08 +00:00
Alex Strick van Linschoten
afe5cae8fc
Use ratatui
not (deprecated) tui
( #2521 )
...
* use ratatui not archived tui
* bump ratatui all the way with options
2024-09-25 06:16:07 +00:00
Wang, Yi
cbfe9e5185
hotfix : enable intel ipex cpu and xpu in python3.11 ( #2517 )
...
enable intel ipex cpu and xpu in python3.11
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:15:35 +00:00