Commit Graph

1215 Commits

Author SHA1 Message Date
yuanwu
c6f023a06b Use optimum-habana v1.15-release branch
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-08 13:02:31 +00:00
yuanwu
1b659788b5 Add the no-deps in pip install
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-08 12:14:38 +00:00
yuanwu
73e6e3b871 Remove the error log
Subsequent updates will remove these codes

Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-08 11:55:13 +00:00
yuanwu
9f356ce045 Refine the warmup process
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-07 09:56:16 +00:00
yuanwu
253a992447 Remove the CI workflows we don't currently support
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-02 08:45:36 +00:00
yuanwu
0228bd0260 Doesn't run the prefill warmup when limit_hpu_graph=true
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-01 21:29:41 +00:00
yuanwu
4586325a34 Fix the starCode warmup issue
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-01 06:14:00 +00:00
Yuan Wu
b83419a769
Merge branch 'habana-main' into 2.3.0 2024-11-28 12:38:36 +08:00
yuanwu
636cdb4c43 Fix startcode issue
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-11-26 08:55:42 +00:00
srajabos
d49ce00f40
With this change, bucketing/padding of input is applied to health check. (#245) 2024-11-18 22:38:30 +01:00
yuanwu2017
56c3eb4adb
Remove the torch package in requirements.txt (#246)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-11-07 09:22:24 -08:00
yuanwu2017
c345c734a7
Merge branch 'habana-main' into 2.3.0 2024-11-01 11:24:40 +08:00
yuanwu
fcf2e3a338 Fix the prefill warmup issue
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-11-01 05:08:52 +02:00
Thanaji Rao Thakkalapelli
6ba3d1d6e5
updated release docker image version in readme to 2.0.6 (#242) 2024-10-31 15:44:16 -07:00
yuanwu2017
8d84ffabf2
Upgrade to SynapseAI 1.18 (#227)
Signed-off-by: yuanwu <yuan.wu@intel.com>
Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>
2024-10-31 20:14:44 +01:00
Thanaji Rao Thakkalapelli
7fb4af9a87
updated supported models list table in readme (#241)
* updated supported models list table in readme

* updated read me

* updated read me
2024-10-29 23:28:45 -07:00
yuanwu
4c9856f9e5 Add missing package
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-28 07:04:56 +00:00
yuanwu2017
c23584f626
Merge branch 'habana-main' into 2.3.0 2024-10-28 04:37:07 +08:00
yuanwu
372e071135 Fix the issues of tgi-gaudi for v.2.3.1
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-27 20:40:36 +00:00
Nicolas Patry
7e282b4153 V2.3.1 2024-10-27 04:14:35 +00:00
Nicolas Patry
34e98b14ef New release 2.3.1 (#2604)
* New release 2.3.1

* Update doc number
2024-10-27 04:14:35 +00:00
drbh
902f526d69 Unroll notify error into generate response (#2597)
* feat: unroll notify_error if no tool is choosen

* fix: expect simple message when no tool is selected

* fix: improve test to avoid notify_error

* fix: improve docs and indicate change in expected response

* fix: adjust linting in test file
2024-10-27 04:03:57 +00:00
drbh
7664d2e2b3 CI (2592): Allow LoRA adapter revision in server launcher (#2602)
allow revision for lora adapters from launcher

Co-authored-by: Sida <sida@kulamind.com>
Co-authored-by: teamclouday <teamclouday@gmail.com>
2024-10-27 04:03:57 +00:00
Nicolas Patry
967e67111d Max token capacity metric (#2595)
* adding max_token_capacity_metric

* added tgi to name of metric

* Adding max capacity metric.

* Add description for the metrics

---------

Co-authored-by: Edwinhr716 <Edandres249@gmail.com>
2024-10-27 04:03:57 +00:00
Nicolas Patry
51506aa57a Mllama flash version (#2585)
* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0
2024-10-27 04:03:57 +00:00
Daniël de Kok
fa964f82d3 nix: experimental support for building a Docker container (#2470)
* nix: experimental support for building a Docker image

Run using something like:

```
docker run \
  --device nvidia.com/gpu=all \
  -it --rm -p 8080:80 \
  -v $PWD/data:/data \
  -v $PWD/tmp:/tmp \
  tgi-docker:latest \
  --model-id <model_id>
```

* Example of building the Docker image using Nix inside Docker

* Stream to make the builder image smaller

This avoids storing a Docker image tarball in the image. Instead,
stream the layers while doing `docker run`.

* Don't spam journalctl on Linux

* Other dockerfile.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-25 09:12:03 +00:00
Daniël de Kok
775e5f4c64 MoE Marlin: support desc_act for groupsize != -1 (#2590)
This change uses the updated Marlin MoE kernel from vLLM to support
MoE with activation sorting and groups.
2024-10-25 09:12:03 +00:00
Daniël de Kok
692f8ddb69 Move flake back to tgi-nix main (#2586) 2024-10-25 09:12:03 +00:00
drbh
bdc47394d2 feat: support phi3.5 moe (#2479)
* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-25 09:12:03 +00:00
Daniël de Kok
288bcb0027 Add support for GPTQ-quantized MoE models using MoE Marlin (#2557)
This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:

- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.
2024-10-25 09:07:52 +00:00
Mohit Sharma
ff905aeff3 Update ROCM libs and improvements (#2579)
* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile
2024-10-25 09:01:04 +00:00
Ikram Ul Haq
6808b2de7e Update architecture.md (#2577) 2024-10-25 09:01:04 +00:00
Daniël de Kok
55fd2816ea Remove compute capability lazy cell (#2580)
Remove compute capability lock

We are only calling the `get_cuda_capability` function once, so avoiding
the cost of multiple calls is not really necessary yet.
2024-10-25 09:01:04 +00:00
Daniël de Kok
f82a3f5816 flashinfer: pass window size and dtype (#2574) 2024-10-25 09:01:04 +00:00
Daniël de Kok
653193a942 Improve support for GPUs with capability < 8 (#2575)
* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s
2024-10-25 09:01:04 +00:00
Alvaro Bartolome
bc28f86903 Fix build with --features google (#2566)
* Fix `cargo build --features google`

* Add `cargo test --features google`
2024-10-25 09:01:04 +00:00
Alvaro Bartolome
6976cf8c4c Add LoRA adapters support for Gemma2 (#2567)
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
2024-10-25 09:01:04 +00:00
Nicholas Broad
0817643b58 remove LORA_ADAPTERS_PATH (#2563)
specify how to call local adapters
2024-10-25 09:01:04 +00:00
Nicolas Patry
a684a81927 More tensor cores. (#2558)
* More tensor cores.

* Fixing the logic.

* Gemma is modified by this.
2024-10-25 09:01:04 +00:00
Nicolas Patry
97d4bdd685 Cleanup Vertex + Chat (#2553)
* Cleanup Vertex + Chat

* logprobs defaults to false.

* Parameters are optional

* Fix  docs.

* Changing back this logprobs default.

* Fixup doc.

* Let's debug that.

* Not unstable.

* Updating Cargo ?

* Wat?

* Dummy change.

* Trying some other install.

* Trying smething.

* Revert everything.

* Update Cargo lock.

* Fixing the pre-commit after rebase.
2024-10-25 09:01:04 +00:00
Nicolas Patry
25e0edf337 Hotfixing main. (#2562) 2024-10-25 09:01:04 +00:00
Aritra Roy Gosthipaty
782130df17 Adding note for private models in quick-tour document (#2548)
* chore: adding note for private models in quicktour doc

* Update docs/source/quicktour.md

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Update docs/source/quicktour.md

Co-authored-by: vb <vaibhavs10@gmail.com>

* Update docs/source/quicktour.md

Co-authored-by: vb <vaibhavs10@gmail.com>

---------

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: vb <vaibhavs10@gmail.com>
2024-10-25 09:01:04 +00:00
Orhun Parmaksız
5247f8938d Simplify crossterm imports (#2545) 2024-10-25 09:01:04 +00:00
Orhun Parmaksız
8c6d3e074f Update the link to the Ratatui organization (#2546) 2024-10-25 09:01:04 +00:00
Daniël de Kok
d4f995e718 Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 (#2537)
This replaces the custom layers in both models.
2024-10-25 09:01:04 +00:00
Daniël de Kok
32d50c2ea7 Add support for scalar FP8 weight scales (#2550)
* Add support for scalar FP8 weight scales

* Support LLM compressor FP8 checkpoints on H100

On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.

* Remove stray debug print
2024-10-25 09:01:04 +00:00
Nicolas Patry
68cfc94f40 Hotfixing main (#2556) 2024-10-25 08:53:47 +00:00
Nicolas Patry
79ac2b741d Micro cleanup. (#2555) 2024-10-25 08:53:47 +00:00
OlivierDehaene
73e6090d53 chore: Add old V2 backend (#2551)
* wip

* added v2
2024-10-25 08:53:36 +00:00
Daniël de Kok
9aed9d5f81 nix: remove unused _server.nix file (#2538) 2024-10-25 08:53:36 +00:00