Alvaro Bartolome
8a1cfd6122
Add loop_controls
feature to minijinja
to handle {% break %}
( #2998 )
...
* Add `loop_controls` feature to `minijinja`
* Add `test_chat_template_loop_controls` to test `break`
2025-02-18 10:33:22 +01:00
celsowm
794ec58b75
Update README.md ( #3024 )
...
only way to avoid:
error: experimental Nix feature 'nix-command' is disabled; add '--extra-experimental-features nix-command' to enable it
2025-02-18 10:08:28 +01:00
Daniël de Kok
f0ed76583c
Use eetq kernel from the hub ( #3029 )
...
* Use eetq kernel from the hub
* Fixing the CI.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-18 10:03:53 +01:00
Adrien Gallouët
cfd4fbb479
[Backend] Add Llamacpp backend ( #2975 )
...
* Add llamacpp backend
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Get rid of llama_batch_get_one()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use max_batch_total_tokens
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle max_batch_size
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add some input validation checks
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle ctx args & fix sampling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add GPU args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --defrag-threshold
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add a stupid batch mechanism
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --numa
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable flash attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --offload-kqv
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batch_pos
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA Dockerfile_llamacpp for now
* Only export the latest logits
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Output real logprobs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batching
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix seq iterations
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Auto-detect n_threads when not provided
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Clear request cache after completion
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove warmup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA architectures build argument for Dockerfile
* Add specific args for batch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --type-v & --type-k
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llamacpp to b4623
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Disable graceful shutdown in debug mode
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Dockerfile_llamacpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Simplify batching logic
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Rename bindings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove n_ctx
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Make max_batch_total_tokens optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Ensure all samplers are freed on error
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Initialize penalty_last_n with llamacpp default value
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve default settings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks clippy
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks cargo fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Do not use HOSTNAME env
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp & cuda
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix requirements.txt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable KQV offload by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove Ngrok tunneling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove .cargo/config.toml
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add missing cuda prefix
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle custom llama.cpp dir
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add README.md
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add HF transfer
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix bool args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2025-02-14 13:40:57 +01:00
Daniël de Kok
6df0fc0b55
Support sigmoid scoring function in GPTQ-MoE ( #3017 )
2025-02-14 11:33:49 +01:00
Nicolas Patry
d6881c37ab
Putting back the NCCL forced upgrade. ( #2999 )
...
* Putting back the NCCL forced upgrade.
* .
* ...
* Ignoring conda.
* Dropping conda from the buidl system + torch 2.6
* Cache min.
* Rolling back torch version.
* Reverting the EETQ modification.
* Fix flash attention ?
* Actually stay on flash v1.
* Patching flash v1.
* Torch 2.6, fork of rotary, eetq updated.
* Put back nccl latest (override torch).
* Slightly more reproducible build and not as scary.
2025-02-14 11:31:59 +01:00
Nicolas Patry
8a211dc7fc
Preventing single user hugging the server to death by asking ( #3016 )
...
for way too many tokens.
2025-02-13 11:23:17 +01:00
Nicolas Patry
4cccce4b44
Update the flaky mllama test. ( #3015 )
2025-02-12 12:26:52 +01:00
Wang, Yi
76bcb4948d
fix Qwen VL break in intel platform ( #3002 )
...
* fix Qwen VL break in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* could use PositionRotaryEmbedding impl so rocm and ipex could all work
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-12 11:31:34 +01:00
Nicolas Patry
b86c3947ab
Revert "Update the flaky mllama test."
...
This reverts commit 8a870b31b9
.
2025-02-11 17:13:06 +01:00
Nicolas Patry
8a870b31b9
Update the flaky mllama test.
2025-02-11 17:10:36 +01:00
Daniël de Kok
571ac9b507
Use kernels from the kernel hub ( #2988 )
...
* Use Hub kernels for Marlin and cutlass quantization kernels
* Use hub kernels for MoE/GPTQ-Marlin MoE
* Use attention kernels from the Hub
* Cache the kernels in the Docker image
* Update moe kernels
* Support loading local kernels for development
* Support latest moe kernels
* Update to moe 0.1.1
* CI: download locked kernels for server tests
* Fixup some imports
* CI: activate venv
* Fix unused imports
* Nix: add attention/moe/quantization kernels
* Update hf-kernels to 0.1.5
* Update kernels
* Update tgi-nix flake for hf-kernels
* Fix EOF
* Take `load_kernel` out of a frequently-called function
* Hoist another case of kernel loading out of a somewhat hot function
* marlin-kernels -> quantization
* attention -> paged-attention
* EOF fix
* Update hf-kernels, fixup Docker
* ipex fix
* Remove outdated TODO
2025-02-10 19:19:25 +01:00
Nicolas Patry
4b8cda684b
Updating mllama after strftime. ( #2993 )
...
* Updating mllama after strftime.
* Town instead village.
* Forgot the integration snapshot.
* Attempt to fix intel CPU.
* Intel extension fix.
* Workaround intel.
* Moving those deps directly into pyproject.
* Revert "Moving those deps directly into pyproject."
This reverts commit 98c1496ea6
.
* Non system uv.
* Fixing the docker environment hopefully.
* Missed a step.
* Move workdir up a bit.
* Bailing out of reproducible python env.
* Triton version.
2025-02-07 10:38:13 +01:00
Funtowicz Morgan
856709d5c3
[Backend] Bump TRTLLM to v.0.17.0 ( #2991 )
...
* backend(trtllm): bump TRTLLM to v.0.17.0
* backend(trtllm): forget to bump dockerfile
* backend(trtllm): use arg instead of env
* backend(trtllm): use correct library reference decoder_attention_src
* backend(trtllm): link against decoder_attention_{0|1}
* backend(trtllm): build against gcc-14 with cuda12.8
* backend(trtllm): use return value optimization flag as as error if available
* backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode
* backend(trtllm): link against CUDA 12.8
2025-02-06 16:45:03 +01:00
Wang, Yi
36223f834e
Triton fix ( #2995 )
...
fix triton to 3.1.0 to fix ipex import issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-06 12:28:41 +01:00
Nicolas Patry
0ef8c8a97a
Using the "lockfile". ( #2992 )
...
* Using the "lockfile".
* Revert dummy modifications.
* Lock on python 3.11
* Another attempt.
* ..
* Bad cache hits.
* The good old monkey.
* How in the world...
* We need the launcher still.
* .
* ..
* Attempt #42
* Don't break all other builds.
* Mode max.
* Applying to other builds.
2025-02-06 12:28:24 +01:00
drbh
c1cf36c0dc
Improve qwen vl impl ( #2943 )
...
* feat: refactor model, improve startup and re enable tests
* fix: improve multimodal rotary embed caching
* fix: limit vision flop calc to qwen2 vl models and update config typing
* fix: include clippy lint
* feat: refactor position ids in warmup and bump tests
* fix: prefer default dtype
* fix: enable all cuda graphs and bump snapshots
* fix: adjust rotaty init path
* fix: simplify get position ids and remove usused vision config
* fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit
* fix: improve position id init during cuda warmup for mrope and simplfy rotary forward
* fix: check existance before accessing rope type in cuda warmup
* fix: check key before access
* fix: improve mrope check in cuda graph warmup
* fix: remove check for default rope type
* fix: add more test and improve model generation
* fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids
* fix: adjust signatures with types
2025-02-04 12:44:18 -05:00
Daniël de Kok
dd2bd5fdb3
impureWithCuda: fix gcc version ( #2990 )
...
* impureWithCuda: fix gcc version
* trufflehog: do not fail on unverified results
2025-02-04 17:01:59 +01:00
Alvaro Bartolome
88fd56f549
Add strftime_now
callable function for minijinja
chat templates ( #2983 )
...
* Add `chrono` and `strftime_now` function callable
* Fix `test_chat_template_valid_with_strftime_now`
* Fix `test_chat_template_valid_with_strftime_now`
2025-02-03 15:30:48 +01:00
Hugo Larcher
e3f2018cb5
hotfix: fix trtllm CI build on release ( #2981 )
...
* hotfix: fix trtllm CI build on release
* fix: test release.
* fix: test release.
* fix: test release. env not recognized https://github.com/actions/runner/issues/1661
* fix: test release. Works.
2025-02-03 11:11:15 +01:00
Nicolas Patry
bb69c5b199
Back on nix main. ( #2979 )
2025-01-31 14:39:52 +01:00
Nicolas Patry
c9d68945cc
Prepare for release 3.1.0 ( #2972 )
...
* Prepare for release 3.1.0
* Back on main flake.
* Fixing stuff.
* Upgrade to moe-kernels 0.8.2 for Hip support.
* Deactivating the flaky test.
2025-01-31 14:19:01 +01:00
Mohit Sharma
c07a2cc82b
Update moe-kernel to 0.8.2 for rocm ( #2977 )
...
update moe-kernel for amd
2025-01-31 11:40:00 +01:00
Hugo Larcher
065aabb13d
doc: Update TRTLLM deployment doc. ( #2960 )
...
* doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI.
* doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI.
* fix: PR comments
2025-01-30 18:04:42 +01:00
Nicolas Patry
cb747b33da
Add deepseekv3 ( #2968 )
...
* Add fp8 support moe models
add deepseekv3
format codfe'
update dockerfile
update doc
* Small modifications.
* Moe kernels 0.8.1
* Upgrade to 0.8.1
* Fixing moe import.
* Black.
* Apply suggestions from code review
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
* Fixing Mixtral + Nits.
* Put link to ref.
* Fix other call locations.
* Scoring func `softmax` is the only one that works.
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
2025-01-30 16:40:25 +01:00
Nicolas Patry
80e7d98f88
Hotfixing intel-cpu (not sure how it was working before). ( #2967 )
...
* Hotfixing intel-cpu (not sure how it was working before).
* Do not fail on missing moe-kernels (Intel-cpu).
2025-01-29 22:34:41 +01:00
Daniël de Kok
ee0dffcd14
Update to moe-kernels 0.8.0 ( #2966 )
2025-01-29 18:19:55 +01:00
Mohit Sharma
4ef2e045c9
Add fp8 support moe models ( #2928 )
...
* Add fp8 support moe models
* flatten condition
2025-01-29 13:56:32 +01:00
Hugo Larcher
73b7cf83f6
Add backend name to telemetry ( #2962 )
...
* feat: Add backend name to telemetry
2025-01-28 16:53:16 +01:00
Nicolas Patry
eb3df0f46f
Fixing the oom maybe with 2.5.1 change. ( #2958 )
2025-01-28 10:30:28 +01:00
Hugo Larcher
c690da5973
fix: Telemetry ( #2957 )
...
* fix: add telemetry regular pings and fix unhandled errors avoid not sending telemetry stop events.
* fix: simplify error handling
* fix: update ping delay and update doc.
* fix: clippy
* doc: Rephrase properly.
2025-01-28 10:29:18 +01:00
Daniël de Kok
db922eb77e
Update to attention-kernels 0.2.0 ( #2950 )
...
This version removes our patches/custom API. Makes it simpler to
get changes from upstream. One of which is that we can enable FP8
KV cache for paged attention as well.
2025-01-27 11:42:36 +01:00
Funtowicz Morgan
40b00275b2
Attempt to remove AWS S3 flaky cache for sccache ( #2953 )
...
* backend(trtllm): attempt to remove AWS S3 flaky cache for sccache
* backend(trtllm): what if we expose ENV instead of inline?
* backend(trtllm): and with the right env var for gha sccache
* backend(trtllm): relax the way to detect sccache
* backend(trtllm): make sccache definition manually
* backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present
* backend(trtllm): export env variable in run mb?
* backend(trtllm): Cache mode max to cache intermediate layers
* backend(trtllm): inject ompi_version build arg in dependent step
2025-01-27 11:21:48 +01:00
Nicolas Patry
6cb41a80a1
Revert "Remove AWS credentials?"
...
This reverts commit d2ff68e98d
.
2025-01-24 14:34:17 +01:00
Nicolas Patry
d2ff68e98d
Remove AWS credentials?
2025-01-24 12:18:28 +01:00
Nicolas Patry
d9dda11726
Trying to put back the archlist (to fix the oom). ( #2947 )
2025-01-24 09:32:17 +01:00
Nicolas Patry
d937eb64da
Fixing cargo lock.
2025-01-23 18:54:34 +01:00
Cyril Vallez
18c4607d46
Transformers backend TP fix ( #2945 )
...
* init dispatch
* cohere fix
2025-01-23 18:09:57 +01:00
Nicolas Patry
29a0893b67
Tmp tp transformers ( #2942 )
...
* Upgrade the version number.
* Remove modifications in Lock.
* Tmp branch to test transformers backend with 2.5.1 and TP>1
* Fixing the transformers backend.
inference_mode forces the use of `aten.matmul` instead of `aten.mm` the
former doesn't have sharding support crashing the transformers TP
support.
`lm_head.forward` also crashes because it skips the hook that
cast/decast the DTensor.
Torch 2.5.1 is required for sharding support.
* Put back the attention impl.
* Revert the flashinfer (this will fails).
* Building AOT.
* Using 2.5 kernels.
* Remove the archlist, it's defined in the docker anyway.
2025-01-23 18:07:30 +01:00
Funtowicz Morgan
0a89902663
[TRTLLM] Expose finish reason ( #2841 )
...
* feat(trtllm): expose finish reason to Rust
* misc(llamacpp): fix typo
* misc(backend): update deps
2025-01-23 16:48:26 +01:00
Nikolai Kolodziej
4e172028aa
Add NVIDIA A40 to known cards ( #2941 )
...
feat: add NVIDIA A40 to known cards
2025-01-23 14:19:21 +01:00
Alvaro Bartolome
6ab02931cf
Set alias
for max_completion_tokens
in ChatRequest
( #2932 )
2025-01-23 14:18:47 +01:00
Funtowicz Morgan
cc212154e0
Bump TensorRT-LLM backend dependency to v0.16.0 ( #2931 )
...
* backend(trtllm): update to 0.16.0
* backend(trtllm): do not use shallow clone
* backend(trtllm): use tag instead
* backend(trtllm): move to nvidia remote instead of hf
* backend(trtllm): reenable shallow clone
* backend(trtllm): attempt to use ADD instead of RUN for openmpi
* backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile
* backend(trtllm): add correctly untar it
2025-01-23 13:54:40 +01:00
Daniël de Kok
1dd346666a
Clarify FP8-Marlin use on capability 8.9 ( #2940 )
...
The log message stated that the GPU does not support FP8 on capability
8.9. However we use FP8-Marlin on that capability because it is faster.
2025-01-22 18:18:11 +01:00
Wang, Yi
1d3c9beba8
fix moe in quantization path ( #2935 )
...
update ipex xpu to support moe for mixtral
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-01-22 14:36:15 +01:00
Nicolas Patry
2dfe3b3ee6
Upgrading the deps to have transformers==4.48.0 necessary ( #2937 )
2025-01-22 12:20:15 +01:00
Alvaro Bartolome
64a33c1f05
Run pre-commit run --all-files
to fix CI ( #2933 )
2025-01-21 17:33:33 +01:00
Nicolas Patry
bdb3e488e4
Trying to avoid the random timeout. ( #2929 )
...
* Trying to avoid the random timeout.
* More read timeout ?
* Longer timeout ?
* Remove legacy ENV directive.
* Remove the dummy test, only increase the read timeout.
* Wat?
2025-01-21 11:06:10 +01:00
Funtowicz Morgan
17367438f3
Give TensorRT-LLMa proper CI/CD 😍 ( #2886 )
...
* test(ctest) enable address sanitizer
* feat(trtllm): expose finish reason to Rust
* feat(trtllm): fix logits retrieval
* misc(ci): enabe building tensorrt-llm
* misc(ci): update Rust action toolchain
* misc(ci): let's try to build the Dockerfile for trtllm
# Conflicts:
# Dockerfile_trtllm
* misc(ci): provide mecanism to cache inside container
* misc(ci): export aws creds as output of step
* misc(ci): let's try this way
* misc(ci): again
* misc(ci): again
* misc(ci): add debug profile
* misc(ci): add debug profile
* misc(ci): lets actually use sccache ...
* misc(ci): do not build with ssl enabled
* misc(ci): WAT
* misc(ci): WAT
* misc(ci): WAT
* misc(ci): WAT
* misc(ci): WAT
* misc(backend): test with TGI S3 conf
* misc(backend): test with TGI S3 conf
* misc(backend): once more?
* misc(backend): let's try with GHA
* misc(backend): missing env directive
* misc(backend): make sure to correctly set IS_GHA_BUILD=true in wf
* misc(backend): ok let's debug smtg
* misc(backend): WWWWWWWWWWWWWAAAAAAAA
* misc(backend): kthxbye retry s3
* misc(backend): use session token
* misc(backend): add more info
* misc(backend): lets try 1h30
* misc(backend): lets try 1h30
* misc(backend): increase to 2h
* misc(backend): lets try...
* misc(backend): lets try...
* misc(backend): let's build for ci-runtime
* misc(backend): let's add some more tooling
* misc(backend): add some tags
* misc(backend): disable Werror for now
* misc(backend): added automatic gha detection
* misc(backend): remove leak sanitizer which is included in asan
* misc(backend): forward env
* misc(backend): forward env
* misc(backend): let's try
* misc(backend): let's try
* misc(backend): again
* misc(backend): again
* misc(backend): again
* misc(backend): again
* misc(backend): again
* misc(backend): fix sscache -> sccache
* misc(backend): fix sscache -> sccache
* misc(backend): fix sscache -> sccache
* misc(backend): let's actually cache things now
* misc(backend): let's actually cache things now
* misc(backend): attempt to run the testS?
* misc(backend): attempt to run the tests?
* misc(backend): attempt to run the tests?
* change runner size
* fix: Correctly tag docker images (#2878 )
* fix: Correctly tag docker images
* fix: Correctly tag docker images
* misc(llamacpp): maybe?
* misc(llamacpp): maybe?
* misc(llamacpp): maybe?
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): go
* misc(ci): go
* misc(ci): go
* misc(ci): use bin folder
* misc(ci): make the wf callable for reuse
* misc(ci): make the wf callable for reuse (bis)
* misc(ci): make the wf callable for reuse (bis)
* misc(ci): give the wf a name
* Create test-trtllm.yml
* Update test-trtllm.yml
* Create build-trtllm2
* Rename build-trtllm2 to 1-build-trtllm2
* Rename test-trtllm.yml to 1-test-trtllm2.yml
* misc(ci): fw secrets
* Update 1-test-trtllm2.yml
* Rename 1-build-trtllm2 to 1-build-trtllm2.yml
* Update 1-test-trtllm2.yml
* misc(ci): use ci-build.yaml as main dispatcher
* Delete .github/workflows/1-test-trtllm2.yml
* Delete .github/workflows/1-build-trtllm2.yml
* misc(ci): rights?
* misc(ci): rights?
* misc(ci): once more?
* misc(ci): once more?
* misc(ci): baby more time?
* misc(ci): baby more time?
* misc(ci): try the permission above again?
* misc(ci): try the permission above again?
* misc(ci): try the permission scoped again?
* misc(ci): install tensorrt_llm_executor_static
* misc(ci): attempt to rebuild with sccache?
* misc(ci):run the tests on GPU instance
* misc(ci): let's actually setup sccache in the build.rs
* misc(ci): reintroduce variables
* misc(ci): enforce sccache
* misc(ci): correct right job name dependency
* misc(ci): detect dev profile for debug
* misc(ci): detect gha build
* misc(ci): detect gha build
* misc(ci): ok debug
* misc(ci): wtf
* misc(ci): wtf2
* misc(ci): wtf3
* misc(ci): use commit HEAD instead of merge commit for image id
* misc(ci): wtfinfini
* misc(ci): wtfinfini
* misc(ci): KAMEHAMEHA
* Merge TRTLLM in standard CI
* misc(ci): remove input machine
* misc(ci): missing id-token for AWS auth
* misc(ci): missing id-token for AWS auth
* misc(ci): missing id-token for AWS auth
* misc(ci): again...
* misc(ci): again...
* misc(ci): again...
* misc(ci): again...
* misc(ci): missing benchmark
* misc(ci): missing backends
* misc(ci): missing launcher
* misc(ci): give everything aws needs
* misc(ci): give everything aws needs
* misc(ci): fix warnings
* misc(ci): attempt to fix sccache not building trtllm
* misc(ci): attempt to fix sccache not building trtllm again
---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Pauline Bailly-Masson <155966238+paulinebm@users.noreply.github.com>
2025-01-21 10:19:16 +01:00
Cyril Vallez
b980848abf
Flash Transformers modeling backend support ( #2913 )
...
* add transformers_flash
* inits
* switch version to make it work
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* runnable version
* working
* push change
* fix high dim
* init
* default
* latest transformers changes
* revert
* simplify check
* remove flag
* improve type hints + required args
* Update based on transformers PR
* small fix
* Remove Warpers for Processor
* fix compatibility version issue
* raise error if needed
* Simplify with monkey patch
* revert + style + minor improvements
* update comment
* device check
* move the import to avoid device issue
* Update __init__.py
* check for non-native models
* oupsi
---------
Co-authored-by: System administrator <root@ip-10-90-0-159.ec2.internal>
2025-01-21 10:01:51 +01:00