Wang, Yi A
a84da5b698
optimize code
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:56:15 -07:00
Wang, Yi A
705cc0b619
multi-modality warmup
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:09:16 -07:00
Wang, Yi A
9d85ac9485
LLM warmup logic
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 23:07:14 -07:00
Wang, Yi A
c55a8caea2
remove torch.where to fix incorrect output in hpu graph model
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 22:51:54 -07:00
Wang, Yi A
f0e5faec1a
fix some issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 07:01:06 -07:00
Wang, Yi A
376e0507b7
missing gptj change...
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 01:08:40 -07:00
Wang, Yi A
7914e980e2
Merge branch 'main' into gaudi_backend_pa
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 00:03:49 -07:00
Wang, Yi A
1508ee8de1
remove block_tables and prefill_cache_indices which will lead to dynamic shape
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-27 23:57:59 -07:00
Wang, Yi A
7900be5ac3
warmup decode
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 20:19:13 -07:00
Wang, Yi A
ba7a131e04
add warmup_decode
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 17:39:26 -07:00
Wang, Yi A
fd70ad703e
warmup prefill
...
remove model where pageattn is not used, set block table to None since it's not used
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 03:10:58 -07:00
Yuan Wu
f5f14dc660
Gaudi: Fix llava-next and mllama crash issue ( #3127 )
...
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-03-25 15:08:15 +01:00
Wang, Yi A
69773767c5
enable fp8
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-25 05:06:55 -07:00
Wang, Yi A
8d221b7b79
fix gptq issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 20:58:50 -07:00
Wang, Yi A
9914ffe1f1
remove unused quantization code and enable awq/gptq int4
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 19:37:20 -07:00
Wang, Yi A
fdf0733f56
fix incorrect output in qwen2 idefics if hpu graph is used
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-21 01:01:37 -07:00
Wang, Yi A
36b6612f97
adjust warmup and enable vlm
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-20 23:12:52 -07:00
Wang, Yi A
f95aa42660
multi-modality initial PR
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 23:30:12 -07:00
Wang, Yi A
d5b78ba16f
Merge branch 'main' into gaudi_backend_pa
2025-03-19 18:15:08 -07:00
Wang, Yi A
2074d0516b
enable dbrx remove some unused code
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 03:16:41 -07:00
Wang, Yi A
2cde30de24
gpt_bigcode could also go pageattn
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 23:59:31 -07:00
Wang, Yi A
073f793976
fix phimoe issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 23:11:01 -07:00
Baptiste Colle
8c2c348f3c
Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork ( #3117 )
...
feat(gaudi): add all the changes from tgi-gaudi fork up to PR #289
2025-03-18 09:45:52 +01:00
Wang, Yi A
5cd1c93cad
add moe support, fix qwen/mistral/mixtral crash
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 00:45:15 -07:00
Wang, Yi A
6bbe24d974
use tensor cache in hpu graph to avoid replay issue
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 01:36:49 -07:00
Wang, Yi A
a07e7437b6
enable all the model. not testet yet
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 01:26:32 -07:00
Wang, Yi A
5d3653943c
adjust block table in hpu to improve performance
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-16 20:28:01 -07:00
Wang, Yi A
b7fea6fc2f
fix TP in pageattn
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-14 18:01:58 -07:00
Wang, Yi A
201dc6294f
clean cuda/rocm code in hpu backend, enable flat_hpu
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-14 01:25:31 -07:00
Baptiste Colle
27ed848676
Release of Gaudi Backend for TGI ( #3091 )
...
* feat(gaudi): release ready (docs, docker image and vlm ready)
* fix(gaudi): add default argument for the dockerfile
* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
2025-03-13 10:56:01 +01:00
David Corvoysier
f01dc9e743
Update neuron backend ( #3098 )
...
* feat(neuron): use AWS Neuron SDK 2.21.1
* feat(neuron): bump optimum-neuron version
* feat(neuron): tag latest image for local tests
* test(neuron): simplify sampling test
2025-03-12 09:53:15 +01:00
Adrien Gallouët
094975c3a8
Update the llamacpp backend ( #3022 )
...
* Build faster
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Make --model-gguf optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable mmap, offload_kqv & flash_attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Better error message
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update installed packages
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Save gguf in models/MODEL_ID/model.gguf
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix build with Mach-O
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Quantize without llama-quantize
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp and switch to ggml-org
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove make-gguf.sh
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Support HF_HUB_USER_AGENT_ORIGIN
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-03-11 09:19:01 +01:00
Nicolas Patry
8e92942a18
Making tool_calls
a vector. ( #3075 )
...
* Making `tool_calls` a vector.
* Update doc.
* Fixing the nix overlay with updated version.
* Add openai dependency.
* Updating the old tests.
* Trying to reduce the logs in the case of errors.
* Less spammy logs too.
2025-03-05 22:32:31 +01:00
Hugo Larcher
d8ff7f2623
feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. ( #3061 )
...
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.
* fix: Rust version for Neuron
* fix: PR comments, use rust-toolchain.toml
2025-03-04 16:43:50 +01:00
Daniël de Kok
e88f6f6ee9
Add property-based testing for RadixAllocator
( #3068 )
2025-03-04 15:09:46 +01:00
Daniël de Kok
fa4e9511f8
Fix two edge cases in RadixTrie::find
( #3067 )
...
- Always return a node, not its parent.
- Do not recurse when a node does not represent a full prefix of the
input.
2025-03-04 13:23:27 +01:00
Baptiste Colle
683ff53fa3
Add Gaudi Backend ( #3055 )
...
* wip(gaudi): import server and dockerfile from tgi-gaudi fork
* feat(gaudi): new gaudi backend working
* fix: fix style
* fix prehooks issues
* fix(gaudi): refactor server and implement requested changes
2025-02-28 12:14:58 +01:00
drbh
b0069e0485
fix: run linters and fix formatting ( #3057 )
2025-02-25 16:11:34 -05:00
David Corvoysier
c00add9c03
Add Neuron backend ( #3033 )
...
* feat: add neuron backend
* feat(neuron): add server standalone installation
* feat(neuron): add server and integration tests
* fix(neuron): increase ulimit when building image
The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.
* test(neuron): merge integration tests and fixtures
* test: add --neuron option
* review: do not use latest tag
* review: remove ureq pinned version
* review: --privileged should be the exception
* feat: add neuron case to build ci
* fix(neuron): export models from container in test fixtures
The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.
* refactor: remove sagemaker entry-point
The SageMaker image is built differently anyway.
* fix(neuron): avoid using Levenshtein
* test(neuron): use smaller llama model
* feat(neuron): avoid installing CUDA in image
* test(neuron): no error anymore when requesting too many tokens
* ci: doing a precompilation step (with a different token).
* test(neuron): avoid using image sha when exporting models
We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
locally will export the models before the CI uses them.
* test(neuron): added a small script to prune test models
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-24 09:10:05 +01:00
Cyril Vallez
a7448661f7
Improve Transformers support ( #2970 )
...
* Much better support
* add gpt neox
* bump transformers version
* bump version
2025-02-18 19:04:34 +01:00
Adrien Gallouët
cfd4fbb479
[Backend] Add Llamacpp backend ( #2975 )
...
* Add llamacpp backend
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Get rid of llama_batch_get_one()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use max_batch_total_tokens
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle max_batch_size
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add some input validation checks
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle ctx args & fix sampling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add GPU args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --defrag-threshold
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add a stupid batch mechanism
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --numa
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable flash attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --offload-kqv
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batch_pos
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA Dockerfile_llamacpp for now
* Only export the latest logits
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Output real logprobs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batching
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix seq iterations
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Auto-detect n_threads when not provided
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Clear request cache after completion
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove warmup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA architectures build argument for Dockerfile
* Add specific args for batch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --type-v & --type-k
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llamacpp to b4623
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Disable graceful shutdown in debug mode
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Dockerfile_llamacpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Simplify batching logic
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Rename bindings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove n_ctx
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Make max_batch_total_tokens optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Ensure all samplers are freed on error
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Initialize penalty_last_n with llamacpp default value
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve default settings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks clippy
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks cargo fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Do not use HOSTNAME env
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp & cuda
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix requirements.txt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable KQV offload by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove Ngrok tunneling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove .cargo/config.toml
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add missing cuda prefix
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle custom llama.cpp dir
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add README.md
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add HF transfer
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix bool args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2025-02-14 13:40:57 +01:00
Funtowicz Morgan
856709d5c3
[Backend] Bump TRTLLM to v.0.17.0 ( #2991 )
...
* backend(trtllm): bump TRTLLM to v.0.17.0
* backend(trtllm): forget to bump dockerfile
* backend(trtllm): use arg instead of env
* backend(trtllm): use correct library reference decoder_attention_src
* backend(trtllm): link against decoder_attention_{0|1}
* backend(trtllm): build against gcc-14 with cuda12.8
* backend(trtllm): use return value optimization flag as as error if available
* backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode
* backend(trtllm): link against CUDA 12.8
2025-02-06 16:45:03 +01:00
Hugo Larcher
73b7cf83f6
Add backend name to telemetry ( #2962 )
...
* feat: Add backend name to telemetry
2025-01-28 16:53:16 +01:00
Funtowicz Morgan
40b00275b2
Attempt to remove AWS S3 flaky cache for sccache ( #2953 )
...
* backend(trtllm): attempt to remove AWS S3 flaky cache for sccache
* backend(trtllm): what if we expose ENV instead of inline?
* backend(trtllm): and with the right env var for gha sccache
* backend(trtllm): relax the way to detect sccache
* backend(trtllm): make sccache definition manually
* backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present
* backend(trtllm): export env variable in run mb?
* backend(trtllm): Cache mode max to cache intermediate layers
* backend(trtllm): inject ompi_version build arg in dependent step
2025-01-27 11:21:48 +01:00
Funtowicz Morgan
0a89902663
[TRTLLM] Expose finish reason ( #2841 )
...
* feat(trtllm): expose finish reason to Rust
* misc(llamacpp): fix typo
* misc(backend): update deps
2025-01-23 16:48:26 +01:00
Funtowicz Morgan
cc212154e0
Bump TensorRT-LLM backend dependency to v0.16.0 ( #2931 )
...
* backend(trtllm): update to 0.16.0
* backend(trtllm): do not use shallow clone
* backend(trtllm): use tag instead
* backend(trtllm): move to nvidia remote instead of hf
* backend(trtllm): reenable shallow clone
* backend(trtllm): attempt to use ADD instead of RUN for openmpi
* backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile
* backend(trtllm): add correctly untar it
2025-01-23 13:54:40 +01:00
Alvaro Bartolome
64a33c1f05
Run pre-commit run --all-files
to fix CI ( #2933 )
2025-01-21 17:33:33 +01:00
Funtowicz Morgan
17367438f3
Give TensorRT-LLMa proper CI/CD 😍 ( #2886 )
...
* test(ctest) enable address sanitizer
* feat(trtllm): expose finish reason to Rust
* feat(trtllm): fix logits retrieval
* misc(ci): enabe building tensorrt-llm
* misc(ci): update Rust action toolchain
* misc(ci): let's try to build the Dockerfile for trtllm
# Conflicts:
# Dockerfile_trtllm
* misc(ci): provide mecanism to cache inside container
* misc(ci): export aws creds as output of step
* misc(ci): let's try this way
* misc(ci): again
* misc(ci): again
* misc(ci): add debug profile
* misc(ci): add debug profile
* misc(ci): lets actually use sccache ...
* misc(ci): do not build with ssl enabled
* misc(ci): WAT
* misc(ci): WAT
* misc(ci): WAT
* misc(ci): WAT
* misc(ci): WAT
* misc(backend): test with TGI S3 conf
* misc(backend): test with TGI S3 conf
* misc(backend): once more?
* misc(backend): let's try with GHA
* misc(backend): missing env directive
* misc(backend): make sure to correctly set IS_GHA_BUILD=true in wf
* misc(backend): ok let's debug smtg
* misc(backend): WWWWWWWWWWWWWAAAAAAAA
* misc(backend): kthxbye retry s3
* misc(backend): use session token
* misc(backend): add more info
* misc(backend): lets try 1h30
* misc(backend): lets try 1h30
* misc(backend): increase to 2h
* misc(backend): lets try...
* misc(backend): lets try...
* misc(backend): let's build for ci-runtime
* misc(backend): let's add some more tooling
* misc(backend): add some tags
* misc(backend): disable Werror for now
* misc(backend): added automatic gha detection
* misc(backend): remove leak sanitizer which is included in asan
* misc(backend): forward env
* misc(backend): forward env
* misc(backend): let's try
* misc(backend): let's try
* misc(backend): again
* misc(backend): again
* misc(backend): again
* misc(backend): again
* misc(backend): again
* misc(backend): fix sscache -> sccache
* misc(backend): fix sscache -> sccache
* misc(backend): fix sscache -> sccache
* misc(backend): let's actually cache things now
* misc(backend): let's actually cache things now
* misc(backend): attempt to run the testS?
* misc(backend): attempt to run the tests?
* misc(backend): attempt to run the tests?
* change runner size
* fix: Correctly tag docker images (#2878 )
* fix: Correctly tag docker images
* fix: Correctly tag docker images
* misc(llamacpp): maybe?
* misc(llamacpp): maybe?
* misc(llamacpp): maybe?
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): go
* misc(ci): go
* misc(ci): go
* misc(ci): use bin folder
* misc(ci): make the wf callable for reuse
* misc(ci): make the wf callable for reuse (bis)
* misc(ci): make the wf callable for reuse (bis)
* misc(ci): give the wf a name
* Create test-trtllm.yml
* Update test-trtllm.yml
* Create build-trtllm2
* Rename build-trtllm2 to 1-build-trtllm2
* Rename test-trtllm.yml to 1-test-trtllm2.yml
* misc(ci): fw secrets
* Update 1-test-trtllm2.yml
* Rename 1-build-trtllm2 to 1-build-trtllm2.yml
* Update 1-test-trtllm2.yml
* misc(ci): use ci-build.yaml as main dispatcher
* Delete .github/workflows/1-test-trtllm2.yml
* Delete .github/workflows/1-build-trtllm2.yml
* misc(ci): rights?
* misc(ci): rights?
* misc(ci): once more?
* misc(ci): once more?
* misc(ci): baby more time?
* misc(ci): baby more time?
* misc(ci): try the permission above again?
* misc(ci): try the permission above again?
* misc(ci): try the permission scoped again?
* misc(ci): install tensorrt_llm_executor_static
* misc(ci): attempt to rebuild with sccache?
* misc(ci):run the tests on GPU instance
* misc(ci): let's actually setup sccache in the build.rs
* misc(ci): reintroduce variables
* misc(ci): enforce sccache
* misc(ci): correct right job name dependency
* misc(ci): detect dev profile for debug
* misc(ci): detect gha build
* misc(ci): detect gha build
* misc(ci): ok debug
* misc(ci): wtf
* misc(ci): wtf2
* misc(ci): wtf3
* misc(ci): use commit HEAD instead of merge commit for image id
* misc(ci): wtfinfini
* misc(ci): wtfinfini
* misc(ci): KAMEHAMEHA
* Merge TRTLLM in standard CI
* misc(ci): remove input machine
* misc(ci): missing id-token for AWS auth
* misc(ci): missing id-token for AWS auth
* misc(ci): missing id-token for AWS auth
* misc(ci): again...
* misc(ci): again...
* misc(ci): again...
* misc(ci): again...
* misc(ci): missing benchmark
* misc(ci): missing backends
* misc(ci): missing launcher
* misc(ci): give everything aws needs
* misc(ci): give everything aws needs
* misc(ci): fix warnings
* misc(ci): attempt to fix sccache not building trtllm
* misc(ci): attempt to fix sccache not building trtllm again
---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Pauline Bailly-Masson <155966238+paulinebm@users.noreply.github.com>
2025-01-21 10:19:16 +01:00
drbh
8f6146f11a
Revert "feat: improve qwen2-vl startup " ( #2924 )
...
Revert "feat: improve qwen2-vl startup (#2802 )"
This reverts commit eecca27113
.
2025-01-17 12:09:05 -05:00
drbh
eecca27113
feat: improve qwen2-vl startup ( #2802 )
...
* feat: tokenize each request individually and increase warmup image size
* feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller
* fix: address image resize and rebase changes
* feat: update to run qwen2-vl tests
* fix: tweak param types
2025-01-17 11:50:41 -05:00