Commit Graph

1184 Commits

Author SHA1 Message Date
Mohit Sharma
ff905aeff3 Update ROCM libs and improvements (#2579)
* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile
2024-10-25 09:01:04 +00:00
Ikram Ul Haq
6808b2de7e Update architecture.md (#2577) 2024-10-25 09:01:04 +00:00
Daniël de Kok
55fd2816ea Remove compute capability lazy cell (#2580)
Remove compute capability lock

We are only calling the `get_cuda_capability` function once, so avoiding
the cost of multiple calls is not really necessary yet.
2024-10-25 09:01:04 +00:00
Daniël de Kok
f82a3f5816 flashinfer: pass window size and dtype (#2574) 2024-10-25 09:01:04 +00:00
Daniël de Kok
653193a942 Improve support for GPUs with capability < 8 (#2575)
* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s
2024-10-25 09:01:04 +00:00
Alvaro Bartolome
bc28f86903 Fix build with --features google (#2566)
* Fix `cargo build --features google`

* Add `cargo test --features google`
2024-10-25 09:01:04 +00:00
Alvaro Bartolome
6976cf8c4c Add LoRA adapters support for Gemma2 (#2567)
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
2024-10-25 09:01:04 +00:00
Nicholas Broad
0817643b58 remove LORA_ADAPTERS_PATH (#2563)
specify how to call local adapters
2024-10-25 09:01:04 +00:00
Nicolas Patry
a684a81927 More tensor cores. (#2558)
* More tensor cores.

* Fixing the logic.

* Gemma is modified by this.
2024-10-25 09:01:04 +00:00
Nicolas Patry
97d4bdd685 Cleanup Vertex + Chat (#2553)
* Cleanup Vertex + Chat

* logprobs defaults to false.

* Parameters are optional

* Fix  docs.

* Changing back this logprobs default.

* Fixup doc.

* Let's debug that.

* Not unstable.

* Updating Cargo ?

* Wat?

* Dummy change.

* Trying some other install.

* Trying smething.

* Revert everything.

* Update Cargo lock.

* Fixing the pre-commit after rebase.
2024-10-25 09:01:04 +00:00
Nicolas Patry
25e0edf337 Hotfixing main. (#2562) 2024-10-25 09:01:04 +00:00
Aritra Roy Gosthipaty
782130df17 Adding note for private models in quick-tour document (#2548)
* chore: adding note for private models in quicktour doc

* Update docs/source/quicktour.md

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Update docs/source/quicktour.md

Co-authored-by: vb <vaibhavs10@gmail.com>

* Update docs/source/quicktour.md

Co-authored-by: vb <vaibhavs10@gmail.com>

---------

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: vb <vaibhavs10@gmail.com>
2024-10-25 09:01:04 +00:00
Orhun Parmaksız
5247f8938d Simplify crossterm imports (#2545) 2024-10-25 09:01:04 +00:00
Orhun Parmaksız
8c6d3e074f Update the link to the Ratatui organization (#2546) 2024-10-25 09:01:04 +00:00
Daniël de Kok
d4f995e718 Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 (#2537)
This replaces the custom layers in both models.
2024-10-25 09:01:04 +00:00
Daniël de Kok
32d50c2ea7 Add support for scalar FP8 weight scales (#2550)
* Add support for scalar FP8 weight scales

* Support LLM compressor FP8 checkpoints on H100

On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.

* Remove stray debug print
2024-10-25 09:01:04 +00:00
Nicolas Patry
68cfc94f40 Hotfixing main (#2556) 2024-10-25 08:53:47 +00:00
Nicolas Patry
79ac2b741d Micro cleanup. (#2555) 2024-10-25 08:53:47 +00:00
OlivierDehaene
73e6090d53 chore: Add old V2 backend (#2551)
* wip

* added v2
2024-10-25 08:53:36 +00:00
Daniël de Kok
9aed9d5f81 nix: remove unused _server.nix file (#2538) 2024-10-25 08:53:36 +00:00
yuanwu
b590310255 Add missing import package
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-25 08:52:24 +00:00
yuanwu
8ebe77b3be Simplify the warmup
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-25 08:38:59 +00:00
yuanwu2017
8686a0fc6d
Merge branch 'habana-main' into 2.3.0 2024-10-23 16:32:12 +08:00
yuanwu
67ee45a270 Pass the max_batch_total_tokens to causal_lm
refine the warmup

Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-23 08:28:26 +00:00
Thanaji Rao Thakkalapelli
c5e3881051
Enables Flash Attention in TGI for gemma models (#235) 2024-10-18 09:20:42 -07:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
9ae5ad5057
requirements name - cabelo@opensuse.org (#237) 2024-10-18 09:20:05 -07:00
Thanaji Rao Thakkalapelli
46b14e6b28
Remove all references to habana_quantization_toolkit for 1.18 (#229) 2024-10-18 10:59:59 +02:00
Thanaji Rao Thakkalapelli
21c13ff3a6
Remove References to torch compile mode in readme (#236) 2024-10-17 14:07:51 -07:00
Sun Choi
8ae5d4c7d6
Ignore EOS for benchmark by using TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN (#234) 2024-10-16 11:57:36 +02:00
Mandy Li
d07e7f4f62
Merge pull request #233 from huggingface/fix_sysntax
Fix sysntax error in PR 232
2024-10-15 14:33:21 -07:00
Thanaji Rao Thakkalapelli
87a1cee32c
Fix sysntax error in PR 232 2024-10-15 13:23:48 -07:00
Thanaji Rao Thakkalapelli
e06320f64e
Enabling Flash Attention support for falcon model (#232) 2024-10-15 19:50:17 +02:00
Sun Choi
0578bd917d
Fix gpt_bigcode/starcoderbase-3b accuracy issue (#228)
Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>
2024-10-14 10:01:55 +02:00
Mohit Deopujari
fe8a373831
Enhancements to README (#226) 2024-10-02 12:22:33 +02:00
yuanwu
bab529c916 Make Gaudi adapt to the tgi 2.3.0
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-26 06:04:55 +00:00
yuanwu2017
e424752fa3
Enable the AutoGPTQ (#217)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-25 18:55:02 +02:00
yuanwu
14fdc4ae5e Add some missing modification of 2.3.0 because of conflict
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-25 07:49:49 +00:00
Nicolas Patry
514a5a737d Preparing for release. (#2540)
* Preparing for release.

* Upgrade version in docs.
2024-09-25 06:20:50 +00:00
OlivierDehaene
bd9675c8c7 fix: wrap python basic logs in debug assertion in launcher (#2539)
* fix: wrap python basic logs in debug assertion in launcher

* use level filters instead
2024-09-25 06:19:20 +00:00
Wang, Yi
3519398a14 hotfix: ipex fails since cuda moe kernel is not supported (#2532)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:19:20 +00:00
Daniël de Kok
b6ef2bfc1b doc: clarify that --quantize is not needed for pre-quantized models (#2536) 2024-09-25 06:19:20 +00:00
Daniël de Kok
c1a99e2f15 Update to moe-kenels 0.3.1 (#2535)
* Update to moe-kenels 0.3.1

* Attempt to fix apt failure
2024-09-25 06:19:20 +00:00
Nicolas Patry
2d470c8282 Stream options. (#2533)
* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow
2024-09-25 06:19:20 +00:00
Daniël de Kok
29a93b78ba Move to moe-kernels package and switch to common MoE layer (#2511)
* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner
2024-09-25 06:18:05 +00:00
OlivierDehaene
88b72c8eb3 fix: metrics unbounded memory (#2528) 2024-09-25 06:17:09 +00:00
Daniël de Kok
0ecbd61099 nix: pure Rust check/fmt/clippy/test (#2525)
Runs the tests in a Nix build sandbox.
2024-09-25 06:17:09 +00:00
Nicolas Patry
0110b83aff Adding a test for FD. (#2516)
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
2024-09-25 06:17:09 +00:00
Daniël de Kok
e8c329372b Add tests for Mixtral (#2520)
Disable by default because CI runners do not have enough GPUs.
2024-09-25 06:16:08 +00:00
Alex Strick van Linschoten
afe5cae8fc Use ratatui not (deprecated) tui (#2521)
* use ratatui not archived tui

* bump ratatui all the way with options
2024-09-25 06:16:07 +00:00
Wang, Yi
cbfe9e5185 hotfix : enable intel ipex cpu and xpu in python3.11 (#2517)
enable intel ipex cpu and xpu in python3.11

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:15:35 +00:00