Nicolas Patry
078084286a
Fix qwen2.
2025-03-18 10:36:54 +01:00
drbh
febc488e0e
fix: bump org name in gemma3 test
2025-03-17 15:57:07 +00:00
drbh
2c2fc6544d
fix: add pillow dependency and bump lock+requirements
2025-03-14 18:17:57 +00:00
Nicolas Patry
e5dfd41ed4
Upgrading from_env
to get token from file when necessary + fix
...
pali_gemma.
2025-03-14 17:06:36 +01:00
drbh
659ce4f3fc
feat: add tests for image types and remove alpha from png
2025-03-14 15:33:06 +00:00
drbh
e5ec176bf4
fix: bump snapshots and improve exceed window test case
2025-03-14 15:04:38 +00:00
Mohit Sharma
170a12f331
Update window size rocm flash decoding
2025-03-14 07:50:11 +00:00
Mohit Sharma
b30cdabf68
Add window_size_left param ipex rocm
2025-03-14 07:47:45 +00:00
Mohit Sharma
eaf18c1ccb
(typo) collection link
2025-03-14 07:36:38 +00:00
Mohit Sharma
69e0a87dd5
(fix) flashinfer
2025-03-13 21:32:38 +00:00
Mohit Sharma
ff82f0f84c
(fix) sliding window attention
2025-03-13 19:43:00 +00:00
Daniël de Kok
f91434e99b
Make the Nix-based Docker container work on non-NixOS ( #3109 )
...
On NixOS, the CUDA driver shim gets mounted on /run/opengl-driver,
where Nix packages expect the shim to be. However, on other
distributions, some FHS paths are mounted. This is a small change
to make the dynamic loader find the shim.
2025-03-13 14:02:45 +01:00
Nicolas Patry
8b91f92978
Fixing the docker build. ( #3108 )
...
* Fixing the docker build.
* Apply suggestions from code review
2025-03-13 11:26:44 +01:00
Baptiste Colle
27ed848676
Release of Gaudi Backend for TGI ( #3091 )
...
* feat(gaudi): release ready (docs, docker image and vlm ready)
* fix(gaudi): add default argument for the dockerfile
* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
2025-03-13 10:56:01 +01:00
Nicolas Patry
83ef364177
We need gcc during runtime to enable triton to compile kernels. ( #3103 )
...
* We need gcc during runtime to enable triton to compile kernels.
* Fixing the docker build.
2025-03-13 10:45:47 +01:00
Daniël de Kok
83b7b7bb92
Router: add gemma3-text
model type ( #3107 )
2025-03-13 10:41:33 +01:00
Daniël de Kok
c73ae0bd88
Update to kernels
0.2.1 ( #3084 )
...
* Update to `kernels` 0.2.1
The package was renamed from `hf-kernels` to `kernels`. The new version
also updates the lockfile format.
* Download kernels in `install-cuda` target
2025-03-13 10:36:29 +01:00
Nicolas Patry
d4c6faa67b
Try to fix on main CI color. ( #3101 )
2025-03-12 10:12:24 +01:00
Nicolas Patry
4ac06ddf56
Preparing relase 3.2.0 ( #3100 )
...
* Preparing relase 3.2.0
* Forgot the README.
* Update doc.
2025-03-12 10:11:33 +01:00
David Corvoysier
f01dc9e743
Update neuron backend ( #3098 )
...
* feat(neuron): use AWS Neuron SDK 2.21.1
* feat(neuron): bump optimum-neuron version
* feat(neuron): tag latest image for local tests
* test(neuron): simplify sampling test
2025-03-12 09:53:15 +01:00
Nicolas Patry
5c5528e362
Fix tool call4 ( #3094 )
...
* Removing the no_tool content information.
* Removing a lot of NO_TOOL shenanigans.
* Update the tests.
2025-03-12 09:28:47 +01:00
Mohit Sharma
ed46c2c414
Add gemma3 model ( #3099 )
2025-03-12 09:25:51 +01:00
Nicolas Patry
f74c36fe0d
Fix tool call3 ( #3086 )
...
* Fixing the tool calling convention.
* Update tehe doc.
* Fixing some corner cases.
* Fixing the tool call id.
* Fmt.
* Snapshot update with the new updated tool_call_id.
* More qwen2.
2025-03-12 09:22:53 +01:00
celsowm
ae4451c3da
Update README.md ( #3095 )
...
space between param and value
2025-03-11 11:05:21 +01:00
Nicolas Patry
b447f7e821
Fix qwen vl ( #3096 )
...
* Fixing qwen2.5 VL.
* Fixing the CI.
2025-03-11 11:00:41 +01:00
Adrien Gallouët
094975c3a8
Update the llamacpp backend ( #3022 )
...
* Build faster
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Make --model-gguf optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable mmap, offload_kqv & flash_attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Better error message
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update installed packages
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Save gguf in models/MODEL_ID/model.gguf
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix build with Mach-O
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Quantize without llama-quantize
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp and switch to ggml-org
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove make-gguf.sh
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Support HF_HUB_USER_AGENT_ORIGIN
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-03-11 09:19:01 +01:00
drbh
dc5f05f8e6
Pr 3003 ci branch ( #3007 )
...
* change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API"
Moving after tool_calls2
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
add in Buffering..
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
fix: handle usage outside of stream state and add tests
Simplifying everything quite a bit.
Remove the unused model_dump.
Clippy.
Clippy ?
Ruff.
Uppgrade the flake for latest transformers.
Upgrade after rebase.
Remove potential footgun.
Fix completion test.
* Clippy.
* Tweak for multi prompt.
* Ruff.
* Update the snapshot a bit.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-03-10 17:56:19 +01:00
Daniël de Kok
124398fa57
hotfix: qwen2 formatting ( #3093 )
...
* hotfix: qwen2 formatting
* cargo fmt
2025-03-10 16:19:50 +01:00
Daniël de Kok
c5ecc7a4de
Small test and typing fixes ( #3078 )
...
* test_weights: add modules_to_not_convert
* More typing fixes
2025-03-10 15:08:23 +01:00
jiqing-feng
cae0cbe87d
Add modules_to_not_convert in quantized model ( #3053 )
...
* fix modules_to_not_convert
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* fix format
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* fix tp quant skip
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* revert unquantized changes
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* use DefaultWeightsLoader in skip modules
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
---------
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
2025-03-10 15:03:51 +01:00
EachSheep
bbe218a4f7
Add qwen2 multi lora layers support ( #3089 )
...
add qwen2 multi lora layers support to solve problem like https://github.com/huggingface/text-generation-inference/issues/2881 , the similar PR are at https://github.com/huggingface/text-generation-inference/pull/2883
Co-authored-by: hjs <hjs@pku.edu.cn>
2025-03-10 12:42:59 +01:00
Alex Weston
58a65f7914
Add request parameters to OTel span for /v1/chat/completions
endpoint ( #3000 )
...
Record request parameters in OTel span for /v1/chat/completions endpoint
2025-03-10 12:26:57 +01:00
Daniël de Kok
976eae216f
Nix: the launcher needs a Python env with Torch for GPU detection ( #3085 )
...
This makes `nix run .` in the repository work again. Should fix #3025 .
2025-03-10 12:11:10 +01:00
Nicolas Patry
622908deab
Fix tool call2 ( #3076 )
...
* Making `tool_calls` a vector.
* Arguments output is a string.
* Update all the integration tests.
* Add the requirements.
* Upgrade other tests.
* Clippy.
* Update the old test.
2025-03-07 19:45:57 +01:00
Alvaro Bartolome
55a6618434
Update --max-batch-total-tokens
description ( #3083 )
...
* Update `--max-batch-total-tokens` description
* Update docstring in `launcher/src/main.rs` instead
2025-03-07 14:24:26 +01:00
Daniël de Kok
036d802b62
Nix: add openai
to impure shell for integration tests ( #3081 )
2025-03-07 13:04:21 +01:00
Nicolas Patry
8e92942a18
Making tool_calls
a vector. ( #3075 )
...
* Making `tool_calls` a vector.
* Update doc.
* Fixing the nix overlay with updated version.
* Add openai dependency.
* Updating the old tests.
* Trying to reduce the logs in the case of errors.
* Less spammy logs too.
2025-03-05 22:32:31 +01:00
Nicolas Patry
3208d1cd1d
Revert "Trying to reduce the logs in the case of errors."
...
This reverts commit cdf70d6a28
.
2025-03-05 20:52:38 +01:00
Nicolas Patry
cdf70d6a28
Trying to reduce the logs in the case of errors.
2025-03-05 20:50:43 +01:00
Nicolas Patry
ab9dafc68f
Making sure Olmo (transformers backend) works. ( #3074 )
2025-03-05 17:46:47 +01:00
Nicolas Patry
31766dad77
Force upgrade transformers version for olmo.
2025-03-05 12:17:09 +01:00
Nicolas Patry
ec35976f82
Only add token when it is defined. ( #3073 )
...
* Only add token when it is defined.
* Update router/src/server.rs
2025-03-05 11:59:52 +01:00
David Corvoysier
cb42b3ad83
fix(neuron): explicitly install toolchain ( #3072 )
...
* fix(neuron): explicitly install toolchain
* ci(neuron): trigger CI when Dockerfile is modified
2025-03-05 11:46:58 +01:00
Nicolas Patry
491ed9e11d
Patch rust release. ( #3069 )
...
* Patch rust release.
* Trying to remove the rust-toolchain hardcoded in action.
* Upgrade rust toolchain.
* Put back the toolchain ?
* Fix neuron dockerfile.
* Move to the proper version of Rust.
* 1.85 since the GH action doesn't respect the override.
* Typo.
* Fixing the github action.
* Fixing docker llamacpp.
* Fixing the github action.
* Update clippy.
2025-03-04 18:07:33 +01:00
Sadra Barikbin
144d99c147
Fix a tiny typo in monitoring.md
tutorial ( #3056 )
...
Update monitoring.md
2025-03-04 17:06:26 +01:00
Nicolas Patry
08bbfa16a1
Preparing for release. ( #3060 )
...
* Preparing for release.
* Upgrade doc.
* Fix docs auto-generated.
* Fix update doc along.
2025-03-04 16:47:10 +01:00
Hugo Larcher
d8ff7f2623
feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. ( #3061 )
...
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.
* fix: Rust version for Neuron
* fix: PR comments, use rust-toolchain.toml
2025-03-04 16:43:50 +01:00
Daniël de Kok
e88f6f6ee9
Add property-based testing for RadixAllocator
( #3068 )
2025-03-04 15:09:46 +01:00
Daniël de Kok
fa4e9511f8
Fix two edge cases in RadixTrie::find
( #3067 )
...
- Always return a node, not its parent.
- Do not recurse when a node does not represent a full prefix of the
input.
2025-03-04 13:23:27 +01:00
Nicolas Patry
a914a21899
Revert "Patch rust release."
...
This reverts commit aad9c2b0bd
.
2025-03-04 12:16:18 +00:00