Wang, Yi A
7900be5ac3
warmup decode
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 20:19:13 -07:00
Wang, Yi A
ba7a131e04
add warmup_decode
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 17:39:26 -07:00
Wang, Yi A
fd70ad703e
warmup prefill
...
remove model where pageattn is not used, set block table to None since it's not used
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 03:10:58 -07:00
Wang, Yi A
69773767c5
enable fp8
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-25 05:06:55 -07:00
Wang, Yi A
8d221b7b79
fix gptq issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 20:58:50 -07:00
Wang, Yi A
9914ffe1f1
remove unused quantization code and enable awq/gptq int4
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 19:37:20 -07:00
Wang, Yi A
fdf0733f56
fix incorrect output in qwen2 idefics if hpu graph is used
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-21 01:01:37 -07:00
Wang, Yi A
36b6612f97
adjust warmup and enable vlm
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-20 23:12:52 -07:00
Wang, Yi A
f95aa42660
multi-modality initial PR
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 23:30:12 -07:00
Wang, Yi A
d5b78ba16f
Merge branch 'main' into gaudi_backend_pa
2025-03-19 18:15:08 -07:00
Wang, Yi A
2074d0516b
enable dbrx remove some unused code
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 03:16:41 -07:00
Wang, Yi A
2cde30de24
gpt_bigcode could also go pageattn
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 23:59:31 -07:00
Wang, Yi A
073f793976
fix phimoe issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 23:11:01 -07:00
Nicolas Patry
e497bc09f6
Minor fixes. ( #3125 )
2025-03-18 15:42:35 +01:00
Nicolas Patry
67ce543e04
Intel docker. ( #3121 )
...
* Intel docker.
* torchaudio ?
* Fixing dockerfile ?
2025-03-18 15:12:11 +01:00
Nicolas Patry
83fe45c15e
Prepare for patch release. ( #3124 )
2025-03-18 15:11:55 +01:00
Nicolas Patry
11f2eec10e
Publish nix docker image. ( #3122 )
...
* Publish nix docker image.
* Run during PR.
* Something else.
* Forgot to push.
* Build zstd.
* Pushing with skopeo
* Testing the PR.
* Runnign from nix.
* Cleaner tags.
2025-03-18 12:58:21 +01:00
Mohit Sharma
a35fbdb925
Bug Fix: Sliding Window Attention ( #3112 )
...
* (fix) sliding window attention
* (fix) flashinfer
* (typo) collection link
* Add window_size_left param ipex rocm
* Update window size rocm flash decoding
* fix: bump snapshots and improve exceed window test case
* feat: add tests for image types and remove alpha from png
* Upgrading `from_env` to get token from file when necessary + fix
pali_gemma.
* fix: add pillow dependency and bump lock+requirements
* fix: bump org name in gemma3 test
* Fix qwen2.
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-03-18 10:37:33 +01:00
Baptiste Colle
8c2c348f3c
Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork ( #3117 )
...
feat(gaudi): add all the changes from tgi-gaudi fork up to PR #289
2025-03-18 09:45:52 +01:00
Wang, Yi A
5cd1c93cad
add moe support, fix qwen/mistral/mixtral crash
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 00:45:15 -07:00
Daniël de Kok
095775e05c
launcher: correctly get the head dimension for VLMs ( #3116 )
...
* launcher: correctly get the head dimension for VLMs
For most (?) VLMs, the head dimension is in the `text_config`
configuration section. However, since we only queried the top-level
`head_dim` (which typically doesn't exist in VLMs), we would never use
flashinfer. This change adds a method that gets the head dimension from
the top-level `Config` struct or `text_config` when that fails.
* fix: bump org name in gemma3 test
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
2025-03-17 18:19:37 +01:00
Wang, Yi
0b3e3db043
xpu 2.6 update ( #3051 )
...
* xpu 2.6 update
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* install whl
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* update get xpu memory api
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* int
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix awq crash if modules_to_not_convert is None
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 13:48:48 +01:00
Wang, Yi A
6bbe24d974
use tensor cache in hpu graph to avoid replay issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 01:36:49 -07:00
Wang, Yi A
a07e7437b6
enable all the model. not testet yet
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 01:26:32 -07:00
Wang, Yi A
5d3653943c
adjust block table in hpu to improve performance
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-16 20:28:01 -07:00
Wang, Yi A
b7fea6fc2f
fix TP in pageattn
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-14 18:01:58 -07:00
Wang, Yi A
201dc6294f
clean cuda/rocm code in hpu backend, enable flat_hpu
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-14 01:25:31 -07:00
Daniël de Kok
f91434e99b
Make the Nix-based Docker container work on non-NixOS ( #3109 )
...
On NixOS, the CUDA driver shim gets mounted on /run/opengl-driver,
where Nix packages expect the shim to be. However, on other
distributions, some FHS paths are mounted. This is a small change
to make the dynamic loader find the shim.
2025-03-13 14:02:45 +01:00
Nicolas Patry
8b91f92978
Fixing the docker build. ( #3108 )
...
* Fixing the docker build.
* Apply suggestions from code review
2025-03-13 11:26:44 +01:00
Baptiste Colle
27ed848676
Release of Gaudi Backend for TGI ( #3091 )
...
* feat(gaudi): release ready (docs, docker image and vlm ready)
* fix(gaudi): add default argument for the dockerfile
* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
2025-03-13 10:56:01 +01:00
Nicolas Patry
83ef364177
We need gcc during runtime to enable triton to compile kernels. ( #3103 )
...
* We need gcc during runtime to enable triton to compile kernels.
* Fixing the docker build.
2025-03-13 10:45:47 +01:00
Daniël de Kok
83b7b7bb92
Router: add gemma3-text
model type ( #3107 )
2025-03-13 10:41:33 +01:00
Daniël de Kok
c73ae0bd88
Update to kernels
0.2.1 ( #3084 )
...
* Update to `kernels` 0.2.1
The package was renamed from `hf-kernels` to `kernels`. The new version
also updates the lockfile format.
* Download kernels in `install-cuda` target
2025-03-13 10:36:29 +01:00
Nicolas Patry
d4c6faa67b
Try to fix on main CI color. ( #3101 )
2025-03-12 10:12:24 +01:00
Nicolas Patry
4ac06ddf56
Preparing relase 3.2.0 ( #3100 )
...
* Preparing relase 3.2.0
* Forgot the README.
* Update doc.
2025-03-12 10:11:33 +01:00
David Corvoysier
f01dc9e743
Update neuron backend ( #3098 )
...
* feat(neuron): use AWS Neuron SDK 2.21.1
* feat(neuron): bump optimum-neuron version
* feat(neuron): tag latest image for local tests
* test(neuron): simplify sampling test
2025-03-12 09:53:15 +01:00
Nicolas Patry
5c5528e362
Fix tool call4 ( #3094 )
...
* Removing the no_tool content information.
* Removing a lot of NO_TOOL shenanigans.
* Update the tests.
2025-03-12 09:28:47 +01:00
Mohit Sharma
ed46c2c414
Add gemma3 model ( #3099 )
2025-03-12 09:25:51 +01:00
Nicolas Patry
f74c36fe0d
Fix tool call3 ( #3086 )
...
* Fixing the tool calling convention.
* Update tehe doc.
* Fixing some corner cases.
* Fixing the tool call id.
* Fmt.
* Snapshot update with the new updated tool_call_id.
* More qwen2.
2025-03-12 09:22:53 +01:00
celsowm
ae4451c3da
Update README.md ( #3095 )
...
space between param and value
2025-03-11 11:05:21 +01:00
Nicolas Patry
b447f7e821
Fix qwen vl ( #3096 )
...
* Fixing qwen2.5 VL.
* Fixing the CI.
2025-03-11 11:00:41 +01:00
Adrien Gallouët
094975c3a8
Update the llamacpp backend ( #3022 )
...
* Build faster
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Make --model-gguf optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable mmap, offload_kqv & flash_attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Better error message
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update installed packages
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Save gguf in models/MODEL_ID/model.gguf
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix build with Mach-O
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Quantize without llama-quantize
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp and switch to ggml-org
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove make-gguf.sh
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Support HF_HUB_USER_AGENT_ORIGIN
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-03-11 09:19:01 +01:00
drbh
dc5f05f8e6
Pr 3003 ci branch ( #3007 )
...
* change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API"
Moving after tool_calls2
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
add in Buffering..
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
fix: handle usage outside of stream state and add tests
Simplifying everything quite a bit.
Remove the unused model_dump.
Clippy.
Clippy ?
Ruff.
Uppgrade the flake for latest transformers.
Upgrade after rebase.
Remove potential footgun.
Fix completion test.
* Clippy.
* Tweak for multi prompt.
* Ruff.
* Update the snapshot a bit.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-03-10 17:56:19 +01:00
Daniël de Kok
124398fa57
hotfix: qwen2 formatting ( #3093 )
...
* hotfix: qwen2 formatting
* cargo fmt
2025-03-10 16:19:50 +01:00
Daniël de Kok
c5ecc7a4de
Small test and typing fixes ( #3078 )
...
* test_weights: add modules_to_not_convert
* More typing fixes
2025-03-10 15:08:23 +01:00
jiqing-feng
cae0cbe87d
Add modules_to_not_convert in quantized model ( #3053 )
...
* fix modules_to_not_convert
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* fix format
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* fix tp quant skip
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* revert unquantized changes
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* use DefaultWeightsLoader in skip modules
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
---------
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
2025-03-10 15:03:51 +01:00
EachSheep
bbe218a4f7
Add qwen2 multi lora layers support ( #3089 )
...
add qwen2 multi lora layers support to solve problem like https://github.com/huggingface/text-generation-inference/issues/2881 , the similar PR are at https://github.com/huggingface/text-generation-inference/pull/2883
Co-authored-by: hjs <hjs@pku.edu.cn>
2025-03-10 12:42:59 +01:00
Alex Weston
58a65f7914
Add request parameters to OTel span for /v1/chat/completions
endpoint ( #3000 )
...
Record request parameters in OTel span for /v1/chat/completions endpoint
2025-03-10 12:26:57 +01:00
Daniël de Kok
976eae216f
Nix: the launcher needs a Python env with Torch for GPU detection ( #3085 )
...
This makes `nix run .` in the repository work again. Should fix #3025 .
2025-03-10 12:11:10 +01:00
Nicolas Patry
622908deab
Fix tool call2 ( #3076 )
...
* Making `tool_calls` a vector.
* Arguments output is a string.
* Update all the integration tests.
* Add the requirements.
* Upgrade other tests.
* Clippy.
* Update the old test.
2025-03-07 19:45:57 +01:00