Wang, Yi A
a83e9fe003
work with the latest vllm extension ops
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 19:56:58 -07:00
Wang, Yi A
4de8fb0127
Merge branch 'gaudi_backend_pa' into warmup_gaudi_backend
2025-04-10 19:42:22 -07:00
Wang, Yi A
4cdc34ec4d
match the latest vllm_extension ops
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 19:32:32 -07:00
Wang, Yi A
610dd200e5
Merge branch 'main' into gaudi_backend_pa
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 18:20:28 -07:00
Wang, Yi A
cd900c3b72
pingpong optimization
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 18:16:05 -07:00
Baptiste Colle
37104acd75
Gaudi: Add Integration Test for Gaudi Backend ( #3142 )
...
* feat(gaudi): add integration test
* feat(test): add more models to integration tests
* remove debug comments
* fix typos
2025-04-07 16:55:03 +02:00
Wang, Yi A
29703dbd27
fix warmup issue for mllama
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-04 20:25:01 -07:00
Yuan Wu
3d059f91ab
Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE ( #3131 )
...
* Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE
Signed-off-by: yuanwu <yuan.wu@intel.com>
* Remove debug modifications
Signed-off-by: yuanwu <yuan.wu@intel.com>
---------
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-04-03 10:34:53 +02:00
Wang, Yi A
8591687561
refine log and fix some issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-03 00:11:22 -07:00
Wang, Yi A
a84da5b698
optimize code
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:56:15 -07:00
Wang, Yi A
705cc0b619
multi-modality warmup
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:09:16 -07:00
Wang, Yi A
9d85ac9485
LLM warmup logic
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 23:07:14 -07:00
Wang, Yi A
c55a8caea2
remove torch.where to fix incorrect output in hpu graph model
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 22:51:54 -07:00
Wang, Yi A
f0e5faec1a
fix some issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 07:01:06 -07:00
Wang, Yi A
376e0507b7
missing gptj change...
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 01:08:40 -07:00
Wang, Yi A
7914e980e2
Merge branch 'main' into gaudi_backend_pa
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 00:03:49 -07:00
Wang, Yi A
1508ee8de1
remove block_tables and prefill_cache_indices which will lead to dynamic shape
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-27 23:57:59 -07:00
Wang, Yi A
7900be5ac3
warmup decode
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 20:19:13 -07:00
Wang, Yi A
ba7a131e04
add warmup_decode
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 17:39:26 -07:00
Wang, Yi A
fd70ad703e
warmup prefill
...
remove model where pageattn is not used, set block table to None since it's not used
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 03:10:58 -07:00
Yuan Wu
f5f14dc660
Gaudi: Fix llava-next and mllama crash issue ( #3127 )
...
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-03-25 15:08:15 +01:00
Wang, Yi A
69773767c5
enable fp8
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-25 05:06:55 -07:00
Wang, Yi A
8d221b7b79
fix gptq issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 20:58:50 -07:00
Wang, Yi A
9914ffe1f1
remove unused quantization code and enable awq/gptq int4
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 19:37:20 -07:00
Wang, Yi A
fdf0733f56
fix incorrect output in qwen2 idefics if hpu graph is used
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-21 01:01:37 -07:00
Wang, Yi A
36b6612f97
adjust warmup and enable vlm
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-20 23:12:52 -07:00
Wang, Yi A
f95aa42660
multi-modality initial PR
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 23:30:12 -07:00
Wang, Yi A
d5b78ba16f
Merge branch 'main' into gaudi_backend_pa
2025-03-19 18:15:08 -07:00
Wang, Yi A
2074d0516b
enable dbrx remove some unused code
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 03:16:41 -07:00
Wang, Yi A
2cde30de24
gpt_bigcode could also go pageattn
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 23:59:31 -07:00
Wang, Yi A
073f793976
fix phimoe issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 23:11:01 -07:00
Baptiste Colle
8c2c348f3c
Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork ( #3117 )
...
feat(gaudi): add all the changes from tgi-gaudi fork up to PR #289
2025-03-18 09:45:52 +01:00
Wang, Yi A
5cd1c93cad
add moe support, fix qwen/mistral/mixtral crash
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 00:45:15 -07:00
Wang, Yi A
6bbe24d974
use tensor cache in hpu graph to avoid replay issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 01:36:49 -07:00
Wang, Yi A
a07e7437b6
enable all the model. not testet yet
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 01:26:32 -07:00
Wang, Yi A
5d3653943c
adjust block table in hpu to improve performance
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-16 20:28:01 -07:00
Wang, Yi A
b7fea6fc2f
fix TP in pageattn
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-14 18:01:58 -07:00
Wang, Yi A
201dc6294f
clean cuda/rocm code in hpu backend, enable flat_hpu
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-14 01:25:31 -07:00
Baptiste Colle
27ed848676
Release of Gaudi Backend for TGI ( #3091 )
...
* feat(gaudi): release ready (docs, docker image and vlm ready)
* fix(gaudi): add default argument for the dockerfile
* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
2025-03-13 10:56:01 +01:00
David Corvoysier
f01dc9e743
Update neuron backend ( #3098 )
...
* feat(neuron): use AWS Neuron SDK 2.21.1
* feat(neuron): bump optimum-neuron version
* feat(neuron): tag latest image for local tests
* test(neuron): simplify sampling test
2025-03-12 09:53:15 +01:00
Adrien Gallouët
094975c3a8
Update the llamacpp backend ( #3022 )
...
* Build faster
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Make --model-gguf optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable mmap, offload_kqv & flash_attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Better error message
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update installed packages
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Save gguf in models/MODEL_ID/model.gguf
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix build with Mach-O
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Quantize without llama-quantize
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp and switch to ggml-org
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove make-gguf.sh
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Support HF_HUB_USER_AGENT_ORIGIN
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-03-11 09:19:01 +01:00
Nicolas Patry
8e92942a18
Making tool_calls
a vector. ( #3075 )
...
* Making `tool_calls` a vector.
* Update doc.
* Fixing the nix overlay with updated version.
* Add openai dependency.
* Updating the old tests.
* Trying to reduce the logs in the case of errors.
* Less spammy logs too.
2025-03-05 22:32:31 +01:00
Hugo Larcher
d8ff7f2623
feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. ( #3061 )
...
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.
* fix: Rust version for Neuron
* fix: PR comments, use rust-toolchain.toml
2025-03-04 16:43:50 +01:00
Daniël de Kok
e88f6f6ee9
Add property-based testing for RadixAllocator
( #3068 )
2025-03-04 15:09:46 +01:00
Daniël de Kok
fa4e9511f8
Fix two edge cases in RadixTrie::find
( #3067 )
...
- Always return a node, not its parent.
- Do not recurse when a node does not represent a full prefix of the
input.
2025-03-04 13:23:27 +01:00
Baptiste Colle
683ff53fa3
Add Gaudi Backend ( #3055 )
...
* wip(gaudi): import server and dockerfile from tgi-gaudi fork
* feat(gaudi): new gaudi backend working
* fix: fix style
* fix prehooks issues
* fix(gaudi): refactor server and implement requested changes
2025-02-28 12:14:58 +01:00
drbh
b0069e0485
fix: run linters and fix formatting ( #3057 )
2025-02-25 16:11:34 -05:00
David Corvoysier
c00add9c03
Add Neuron backend ( #3033 )
...
* feat: add neuron backend
* feat(neuron): add server standalone installation
* feat(neuron): add server and integration tests
* fix(neuron): increase ulimit when building image
The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.
* test(neuron): merge integration tests and fixtures
* test: add --neuron option
* review: do not use latest tag
* review: remove ureq pinned version
* review: --privileged should be the exception
* feat: add neuron case to build ci
* fix(neuron): export models from container in test fixtures
The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.
* refactor: remove sagemaker entry-point
The SageMaker image is built differently anyway.
* fix(neuron): avoid using Levenshtein
* test(neuron): use smaller llama model
* feat(neuron): avoid installing CUDA in image
* test(neuron): no error anymore when requesting too many tokens
* ci: doing a precompilation step (with a different token).
* test(neuron): avoid using image sha when exporting models
We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
locally will export the models before the CI uses them.
* test(neuron): added a small script to prune test models
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-24 09:10:05 +01:00
Cyril Vallez
a7448661f7
Improve Transformers support ( #2970 )
...
* Much better support
* add gpt neox
* bump transformers version
* bump version
2025-02-18 19:04:34 +01:00
Adrien Gallouët
cfd4fbb479
[Backend] Add Llamacpp backend ( #2975 )
...
* Add llamacpp backend
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Get rid of llama_batch_get_one()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use max_batch_total_tokens
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle max_batch_size
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add some input validation checks
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle ctx args & fix sampling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add GPU args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --defrag-threshold
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add a stupid batch mechanism
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --numa
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable flash attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --offload-kqv
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batch_pos
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA Dockerfile_llamacpp for now
* Only export the latest logits
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Output real logprobs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batching
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix seq iterations
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Auto-detect n_threads when not provided
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Clear request cache after completion
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove warmup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA architectures build argument for Dockerfile
* Add specific args for batch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --type-v & --type-k
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llamacpp to b4623
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Disable graceful shutdown in debug mode
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Dockerfile_llamacpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Simplify batching logic
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Rename bindings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove n_ctx
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Make max_batch_total_tokens optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Ensure all samplers are freed on error
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Initialize penalty_last_n with llamacpp default value
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve default settings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks clippy
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks cargo fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Do not use HOSTNAME env
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp & cuda
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix requirements.txt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable KQV offload by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove Ngrok tunneling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove .cargo/config.toml
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add missing cuda prefix
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle custom llama.cpp dir
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add README.md
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add HF transfer
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix bool args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2025-02-14 13:40:57 +01:00