Commit Graph

417 Commits

Author SHA1 Message Date
Yuan Wu
aba419a0cc
Fix crash issue of llava-next fp8 (#286)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-03-07 10:31:58 +01:00
Yuan Wu
cd57fea11b
Fix Llava next crash issue (#285)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-03-06 10:12:21 +01:00
Yuan Wu
20ea73c6d4
Fix mistralai/Mistral-7B-Instruct failed issue (#284)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-03-05 17:01:23 +01:00
Yuan Wu
c35810d6f0
Fix the loading issue of 90B (#283)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-02-28 11:20:55 +01:00
Yuan Wu
1d3a4ab851
Enable mllama (#272)
Signed-off-by: Yuan Wu <yuan.wu@intel.com>
2025-02-27 16:12:15 +01:00
kaixuanliu
b52164d38a
Complete padding of CausalLMBatch when there exists batch bucketing (#261)
Signed-off-by: kaixuanliu <kaixuan.liu@intel.com>
2025-01-30 10:19:13 +01:00
Yuan Wu
fe7594e369
Fix the warmup issue of prefill batch_size (#268)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-01-23 17:26:17 +01:00
Yuan Wu
63c64bb307
Use the default value in globals.py (#265)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-01-21 10:10:23 +01:00
Karol Damaszke
8de110ae9f
Fix warmup with SKIP_TOKENIZER_IN_TGI=true (#266) 2025-01-21 10:09:49 +01:00
Yuan Wu
46b556805b
Upgrade to SynapseAI 1.19 (#259)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-26 17:33:24 +01:00
yuanwu
c922ef9534 Fix the warmup issue of llama2-7B.
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-09 07:20:48 +00:00
yuanwu
9f356ce045 Refine the warmup process
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-07 09:56:16 +00:00
yuanwu
0228bd0260 Doesn't run the prefill warmup when limit_hpu_graph=true
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-01 21:29:41 +00:00
yuanwu
4586325a34 Fix the starCode warmup issue
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-12-01 06:14:00 +00:00
Yuan Wu
b83419a769
Merge branch 'habana-main' into 2.3.0 2024-11-28 12:38:36 +08:00
yuanwu
636cdb4c43 Fix startcode issue
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-11-26 08:55:42 +00:00
srajabos
d49ce00f40
With this change, bucketing/padding of input is applied to health check. (#245) 2024-11-18 22:38:30 +01:00
yuanwu
fcf2e3a338 Fix the prefill warmup issue
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-11-01 05:08:52 +02:00
yuanwu2017
c23584f626
Merge branch 'habana-main' into 2.3.0 2024-10-28 04:37:07 +08:00
yuanwu
372e071135 Fix the issues of tgi-gaudi for v.2.3.1
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-27 20:40:36 +00:00
Nicolas Patry
51506aa57a Mllama flash version (#2585)
* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0
2024-10-27 04:03:57 +00:00
drbh
bdc47394d2 feat: support phi3.5 moe (#2479)
* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-25 09:12:03 +00:00
Mohit Sharma
ff905aeff3 Update ROCM libs and improvements (#2579)
* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile
2024-10-25 09:01:04 +00:00
Daniël de Kok
f82a3f5816 flashinfer: pass window size and dtype (#2574) 2024-10-25 09:01:04 +00:00
Daniël de Kok
653193a942 Improve support for GPUs with capability < 8 (#2575)
* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s
2024-10-25 09:01:04 +00:00
Alvaro Bartolome
6976cf8c4c Add LoRA adapters support for Gemma2 (#2567)
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
2024-10-25 09:01:04 +00:00
Daniël de Kok
d4f995e718 Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 (#2537)
This replaces the custom layers in both models.
2024-10-25 09:01:04 +00:00
yuanwu
b590310255 Add missing import package
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-25 08:52:24 +00:00
yuanwu
8ebe77b3be Simplify the warmup
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-25 08:38:59 +00:00
Thanaji Rao Thakkalapelli
b126bf4785
Revert pr 235 as flash attention is not really enabled for gemma (#239) 2024-10-23 10:58:57 +02:00
yuanwu2017
8686a0fc6d
Merge branch 'habana-main' into 2.3.0 2024-10-23 16:32:12 +08:00
yuanwu
67ee45a270 Pass the max_batch_total_tokens to causal_lm
refine the warmup

Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-10-23 08:28:26 +00:00
Thanaji Rao Thakkalapelli
c5e3881051
Enables Flash Attention in TGI for gemma models (#235) 2024-10-18 09:20:42 -07:00
Thanaji Rao Thakkalapelli
46b14e6b28
Remove all references to habana_quantization_toolkit for 1.18 (#229) 2024-10-18 10:59:59 +02:00
Thanaji Rao Thakkalapelli
87a1cee32c
Fix sysntax error in PR 232 2024-10-15 13:23:48 -07:00
Thanaji Rao Thakkalapelli
e06320f64e
Enabling Flash Attention support for falcon model (#232) 2024-10-15 19:50:17 +02:00
Sun Choi
0578bd917d
Fix gpt_bigcode/starcoderbase-3b accuracy issue (#228)
Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai>
2024-10-14 10:01:55 +02:00
yuanwu
bab529c916 Make Gaudi adapt to the tgi 2.3.0
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-09-26 06:04:55 +00:00
Wang, Yi
3519398a14 hotfix: ipex fails since cuda moe kernel is not supported (#2532)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:19:20 +00:00
Daniël de Kok
29a93b78ba Move to moe-kernels package and switch to common MoE layer (#2511)
* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner
2024-09-25 06:18:05 +00:00
Wang, Yi
cbfe9e5185 hotfix : enable intel ipex cpu and xpu in python3.11 (#2517)
enable intel ipex cpu and xpu in python3.11

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:15:35 +00:00
Nicolas Patry
c6b568b892 Fix tokenization yi (#2507)
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?
2024-09-25 06:15:35 +00:00
Nicolas Patry
510d1c76c8 Prefix test - Different kind of load test to trigger prefix test bugs. (#2490)
* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------

Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
2024-09-25 06:14:07 +00:00
Wang, Yi
938a7f3c3a hotfix: fix regression of attention api change in intel platform (#2439)
fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:13:36 +00:00
drbh
34a6399a50 feat: support lora revisions and qkv_proj weights (#2482)
* feat: support lora revisions and qkv_proj weights

* fix: add qkv_proj weights to weight test
2024-09-25 06:13:11 +00:00
Nicolas Patry
a313355d2b Tied embeddings in MLP speculator. (#2473)
* Tied embeddings in MLP speculator.

* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.

* Adding scaling support + optimize some ops.
2024-09-25 06:13:11 +00:00
Nicolas Patry
4e1ca8d7bd Lots of improvements (Still 2 allocators) (#2449)
* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review

Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2024-09-25 06:13:11 +00:00
drbh
7aebb953e2 Fix: don't apply post layernorm in SiglipVisionTransformer (#2459)
* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613

* fix: adjust pali gemma for post layer norm and small refactors

---------

Co-authored-by: Travis Addair <tgaddair@gmail.com>
2024-09-25 06:10:59 +00:00
Nicolas Patry
635dde8af9 Prefix caching (#2402)
* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-09-25 06:10:59 +00:00
Nicolas Patry
df6ea89da9 Fixing exl2 and other quanize tests again. (#2419)
* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
2024-09-25 06:08:38 +00:00