Commit Graph

1229 Commits

Author SHA1 Message Date
Yuan Wu
fe7594e369
Fix the warmup issue of prefill batch_size (#268)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-01-23 17:26:17 +01:00
Yuan Wu
63c64bb307
Use the default value in globals.py (#265)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-01-21 10:10:23 +01:00
Karol Damaszke
8de110ae9f
Fix warmup with SKIP_TOKENIZER_IN_TGI=true (#266) 2025-01-21 10:09:49 +01:00
Yuan Wu
7d106477d6
Fix router input validation for SKIP_TOKENIZER_IN_TGI=true (#267)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-01-21 10:08:53 +01:00
Yuan Wu
6d6acca5eb
Update the ReadME for 2.3.1 (#260)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-01-03 10:55:14 +01:00
Yuan Wu
46b556805b
Upgrade to SynapseAI 1.19 (#259)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-26 17:33:24 +01:00
regisss
5291f652a1
Merge pull request #225 from yuanwu2017/2.3.0 2024-12-19 11:42:59 -06:00
yuanwu
8e2e5d8e15 Fix benchmark build error
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-17 05:38:10 +00:00
yuanwu
eaeef6e7a4 Remove the useless modifications
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-17 02:08:12 +00:00
yuanwu
15de6c9195 Merge branch 'habana-main' into 2.3.0 2024-12-17 02:06:22 +00:00
Sun Choi
61309b2832
Remove the default max_tokens for /v1/chat/completions (#251) 2024-12-16 09:32:57 +01:00
Sun Choi
cc2ca4ac22
HF_TOKEN replaces HUGGING_FACE_HUB_TOKEN as it is deprecated (#253) 2024-12-15 09:59:58 +01:00
yuanwu
c3b8899f10 Revert "Use optimum-habana v1.15-release branch"
This reverts commit c6f023a06b.
2024-12-11 08:17:17 +00:00
yuanwu
c922ef9534 Fix the warmup issue of llama2-7B.
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-09 07:20:48 +00:00
yuanwu
c6f023a06b Use optimum-habana v1.15-release branch
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-08 13:02:31 +00:00
yuanwu
1b659788b5 Add the no-deps in pip install
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-08 12:14:38 +00:00
yuanwu
73e6e3b871 Remove the error log
Subsequent updates will remove these codes

Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-08 11:55:13 +00:00
yuanwu
9f356ce045 Refine the warmup process
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-07 09:56:16 +00:00
yuanwu
253a992447 Remove the CI workflows we don't currently support
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-02 08:45:36 +00:00
yuanwu
0228bd0260 Doesn't run the prefill warmup when limit_hpu_graph=true
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-01 21:29:41 +00:00
yuanwu
4586325a34 Fix the starCode warmup issue
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-01 06:14:00 +00:00
Yuan Wu
b83419a769
Merge branch 'habana-main' into 2.3.0 2024-11-28 12:38:36 +08:00
yuanwu
636cdb4c43 Fix startcode issue
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-11-26 08:55:42 +00:00
srajabos
d49ce00f40
With this change, bucketing/padding of input is applied to health check. (#245) 2024-11-18 22:38:30 +01:00
yuanwu2017
56c3eb4adb
Remove the torch package in requirements.txt (#246)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-11-07 09:22:24 -08:00
yuanwu2017
c345c734a7
Merge branch 'habana-main' into 2.3.0 2024-11-01 11:24:40 +08:00
yuanwu
fcf2e3a338 Fix the prefill warmup issue
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-11-01 05:08:52 +02:00
Thanaji Rao Thakkalapelli
6ba3d1d6e5
updated release docker image version in readme to 2.0.6 (#242) 2024-10-31 15:44:16 -07:00
yuanwu2017
8d84ffabf2
Upgrade to SynapseAI 1.18 (#227)
Signed-off-by: yuanwu <yuan.wu@intel.com>
Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>
2024-10-31 20:14:44 +01:00
Thanaji Rao Thakkalapelli
7fb4af9a87
updated supported models list table in readme (#241)
* updated supported models list table in readme

* updated read me

* updated read me
2024-10-29 23:28:45 -07:00
yuanwu
4c9856f9e5 Add missing package
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-28 07:04:56 +00:00
yuanwu2017
c23584f626
Merge branch 'habana-main' into 2.3.0 2024-10-28 04:37:07 +08:00
yuanwu
372e071135 Fix the issues of tgi-gaudi for v.2.3.1
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-27 20:40:36 +00:00
Nicolas Patry
7e282b4153 V2.3.1 2024-10-27 04:14:35 +00:00
Nicolas Patry
34e98b14ef New release 2.3.1 (#2604)
* New release 2.3.1

* Update doc number
2024-10-27 04:14:35 +00:00
drbh
902f526d69 Unroll notify error into generate response (#2597)
* feat: unroll notify_error if no tool is choosen

* fix: expect simple message when no tool is selected

* fix: improve test to avoid notify_error

* fix: improve docs and indicate change in expected response

* fix: adjust linting in test file
2024-10-27 04:03:57 +00:00
drbh
7664d2e2b3 CI (2592): Allow LoRA adapter revision in server launcher (#2602)
allow revision for lora adapters from launcher

Co-authored-by: Sida <sida@kulamind.com>
Co-authored-by: teamclouday <teamclouday@gmail.com>
2024-10-27 04:03:57 +00:00
Nicolas Patry
967e67111d Max token capacity metric (#2595)
* adding max_token_capacity_metric

* added tgi to name of metric

* Adding max capacity metric.

* Add description for the metrics

---------

Co-authored-by: Edwinhr716 <Edandres249@gmail.com>
2024-10-27 04:03:57 +00:00
Nicolas Patry
51506aa57a Mllama flash version (#2585)
* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0
2024-10-27 04:03:57 +00:00
Daniël de Kok
fa964f82d3 nix: experimental support for building a Docker container (#2470)
* nix: experimental support for building a Docker image

Run using something like:

```
docker run \
  --device nvidia.com/gpu=all \
  -it --rm -p 8080:80 \
  -v $PWD/data:/data \
  -v $PWD/tmp:/tmp \
  tgi-docker:latest \
  --model-id <model_id>
```

* Example of building the Docker image using Nix inside Docker

* Stream to make the builder image smaller

This avoids storing a Docker image tarball in the image. Instead,
stream the layers while doing `docker run`.

* Don't spam journalctl on Linux

* Other dockerfile.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-25 09:12:03 +00:00
Daniël de Kok
775e5f4c64 MoE Marlin: support desc_act for groupsize != -1 (#2590)
This change uses the updated Marlin MoE kernel from vLLM to support
MoE with activation sorting and groups.
2024-10-25 09:12:03 +00:00
Daniël de Kok
692f8ddb69 Move flake back to tgi-nix main (#2586) 2024-10-25 09:12:03 +00:00
drbh
bdc47394d2 feat: support phi3.5 moe (#2479)
* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-25 09:12:03 +00:00
Daniël de Kok
288bcb0027 Add support for GPTQ-quantized MoE models using MoE Marlin (#2557)
This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:

- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.
2024-10-25 09:07:52 +00:00
Mohit Sharma
ff905aeff3 Update ROCM libs and improvements (#2579)
* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile
2024-10-25 09:01:04 +00:00
Ikram Ul Haq
6808b2de7e Update architecture.md (#2577) 2024-10-25 09:01:04 +00:00
Daniël de Kok
55fd2816ea Remove compute capability lazy cell (#2580)
Remove compute capability lock

We are only calling the `get_cuda_capability` function once, so avoiding
the cost of multiple calls is not really necessary yet.
2024-10-25 09:01:04 +00:00
Daniël de Kok
f82a3f5816 flashinfer: pass window size and dtype (#2574) 2024-10-25 09:01:04 +00:00
Daniël de Kok
653193a942 Improve support for GPUs with capability < 8 (#2575)
* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s
2024-10-25 09:01:04 +00:00
Alvaro Bartolome
bc28f86903 Fix build with --features google (#2566)
* Fix `cargo build --features google`

* Add `cargo test --features google`
2024-10-25 09:01:04 +00:00