Adrien Gallouët
d6ded897a8
Add a stupid batch mechanism
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
e07835c5b5
Add --defrag-threshold
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
f388747985
Add GPU args
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
8d2dfdf668
Handle ctx args & fix sampling
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
a7b4b04cb5
Add some input validation checks
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
e7facf692f
Handle max_batch_size
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
3eb4823f3e
Use max_batch_total_tokens
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
bd0cc9905c
Get rid of llama_batch_get_one()
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
95e221eece
Add llamacpp backend
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:56 +00:00
Alvaro Bartolome
88fd56f549
Add strftime_now
callable function for minijinja
chat templates ( #2983 )
...
* Add `chrono` and `strftime_now` function callable
* Fix `test_chat_template_valid_with_strftime_now`
* Fix `test_chat_template_valid_with_strftime_now`
2025-02-03 15:30:48 +01:00
Hugo Larcher
e3f2018cb5
hotfix: fix trtllm CI build on release ( #2981 )
...
* hotfix: fix trtllm CI build on release
* fix: test release.
* fix: test release.
* fix: test release. env not recognized https://github.com/actions/runner/issues/1661
* fix: test release. Works.
2025-02-03 11:11:15 +01:00
Nicolas Patry
bb69c5b199
Back on nix main. ( #2979 )
2025-01-31 14:39:52 +01:00
Nicolas Patry
c9d68945cc
Prepare for release 3.1.0 ( #2972 )
...
* Prepare for release 3.1.0
* Back on main flake.
* Fixing stuff.
* Upgrade to moe-kernels 0.8.2 for Hip support.
* Deactivating the flaky test.
2025-01-31 14:19:01 +01:00
Mohit Sharma
c07a2cc82b
Update moe-kernel to 0.8.2 for rocm ( #2977 )
...
update moe-kernel for amd
2025-01-31 11:40:00 +01:00
Hugo Larcher
065aabb13d
doc: Update TRTLLM deployment doc. ( #2960 )
...
* doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI.
* doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI.
* fix: PR comments
2025-01-30 18:04:42 +01:00
Nicolas Patry
cb747b33da
Add deepseekv3 ( #2968 )
...
* Add fp8 support moe models
add deepseekv3
format codfe'
update dockerfile
update doc
* Small modifications.
* Moe kernels 0.8.1
* Upgrade to 0.8.1
* Fixing moe import.
* Black.
* Apply suggestions from code review
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
* Fixing Mixtral + Nits.
* Put link to ref.
* Fix other call locations.
* Scoring func `softmax` is the only one that works.
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
2025-01-30 16:40:25 +01:00
Nicolas Patry
80e7d98f88
Hotfixing intel-cpu (not sure how it was working before). ( #2967 )
...
* Hotfixing intel-cpu (not sure how it was working before).
* Do not fail on missing moe-kernels (Intel-cpu).
2025-01-29 22:34:41 +01:00
Daniël de Kok
ee0dffcd14
Update to moe-kernels 0.8.0 ( #2966 )
2025-01-29 18:19:55 +01:00
Mohit Sharma
4ef2e045c9
Add fp8 support moe models ( #2928 )
...
* Add fp8 support moe models
* flatten condition
2025-01-29 13:56:32 +01:00
Hugo Larcher
73b7cf83f6
Add backend name to telemetry ( #2962 )
...
* feat: Add backend name to telemetry
2025-01-28 16:53:16 +01:00
Nicolas Patry
eb3df0f46f
Fixing the oom maybe with 2.5.1 change. ( #2958 )
2025-01-28 10:30:28 +01:00
Hugo Larcher
c690da5973
fix: Telemetry ( #2957 )
...
* fix: add telemetry regular pings and fix unhandled errors avoid not sending telemetry stop events.
* fix: simplify error handling
* fix: update ping delay and update doc.
* fix: clippy
* doc: Rephrase properly.
2025-01-28 10:29:18 +01:00
Daniël de Kok
db922eb77e
Update to attention-kernels 0.2.0 ( #2950 )
...
This version removes our patches/custom API. Makes it simpler to
get changes from upstream. One of which is that we can enable FP8
KV cache for paged attention as well.
2025-01-27 11:42:36 +01:00
Funtowicz Morgan
40b00275b2
Attempt to remove AWS S3 flaky cache for sccache ( #2953 )
...
* backend(trtllm): attempt to remove AWS S3 flaky cache for sccache
* backend(trtllm): what if we expose ENV instead of inline?
* backend(trtllm): and with the right env var for gha sccache
* backend(trtllm): relax the way to detect sccache
* backend(trtllm): make sccache definition manually
* backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present
* backend(trtllm): export env variable in run mb?
* backend(trtllm): Cache mode max to cache intermediate layers
* backend(trtllm): inject ompi_version build arg in dependent step
2025-01-27 11:21:48 +01:00
Nicolas Patry
6cb41a80a1
Revert "Remove AWS credentials?"
...
This reverts commit d2ff68e98d
.
2025-01-24 14:34:17 +01:00
Nicolas Patry
d2ff68e98d
Remove AWS credentials?
2025-01-24 12:18:28 +01:00
Nicolas Patry
d9dda11726
Trying to put back the archlist (to fix the oom). ( #2947 )
2025-01-24 09:32:17 +01:00
Nicolas Patry
d937eb64da
Fixing cargo lock.
2025-01-23 18:54:34 +01:00
Cyril Vallez
18c4607d46
Transformers backend TP fix ( #2945 )
...
* init dispatch
* cohere fix
2025-01-23 18:09:57 +01:00
Nicolas Patry
29a0893b67
Tmp tp transformers ( #2942 )
...
* Upgrade the version number.
* Remove modifications in Lock.
* Tmp branch to test transformers backend with 2.5.1 and TP>1
* Fixing the transformers backend.
inference_mode forces the use of `aten.matmul` instead of `aten.mm` the
former doesn't have sharding support crashing the transformers TP
support.
`lm_head.forward` also crashes because it skips the hook that
cast/decast the DTensor.
Torch 2.5.1 is required for sharding support.
* Put back the attention impl.
* Revert the flashinfer (this will fails).
* Building AOT.
* Using 2.5 kernels.
* Remove the archlist, it's defined in the docker anyway.
2025-01-23 18:07:30 +01:00
Funtowicz Morgan
0a89902663
[TRTLLM] Expose finish reason ( #2841 )
...
* feat(trtllm): expose finish reason to Rust
* misc(llamacpp): fix typo
* misc(backend): update deps
2025-01-23 16:48:26 +01:00
Nikolai Kolodziej
4e172028aa
Add NVIDIA A40 to known cards ( #2941 )
...
feat: add NVIDIA A40 to known cards
2025-01-23 14:19:21 +01:00
Alvaro Bartolome
6ab02931cf
Set alias
for max_completion_tokens
in ChatRequest
( #2932 )
2025-01-23 14:18:47 +01:00
Funtowicz Morgan
cc212154e0
Bump TensorRT-LLM backend dependency to v0.16.0 ( #2931 )
...
* backend(trtllm): update to 0.16.0
* backend(trtllm): do not use shallow clone
* backend(trtllm): use tag instead
* backend(trtllm): move to nvidia remote instead of hf
* backend(trtllm): reenable shallow clone
* backend(trtllm): attempt to use ADD instead of RUN for openmpi
* backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile
* backend(trtllm): add correctly untar it
2025-01-23 13:54:40 +01:00
Daniël de Kok
1dd346666a
Clarify FP8-Marlin use on capability 8.9 ( #2940 )
...
The log message stated that the GPU does not support FP8 on capability
8.9. However we use FP8-Marlin on that capability because it is faster.
2025-01-22 18:18:11 +01:00
Wang, Yi
1d3c9beba8
fix moe in quantization path ( #2935 )
...
update ipex xpu to support moe for mixtral
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-01-22 14:36:15 +01:00
Nicolas Patry
2dfe3b3ee6
Upgrading the deps to have transformers==4.48.0 necessary ( #2937 )
2025-01-22 12:20:15 +01:00
Alvaro Bartolome
64a33c1f05
Run pre-commit run --all-files
to fix CI ( #2933 )
2025-01-21 17:33:33 +01:00
Nicolas Patry
bdb3e488e4
Trying to avoid the random timeout. ( #2929 )
...
* Trying to avoid the random timeout.
* More read timeout ?
* Longer timeout ?
* Remove legacy ENV directive.
* Remove the dummy test, only increase the read timeout.
* Wat?
2025-01-21 11:06:10 +01:00
Funtowicz Morgan
17367438f3
Give TensorRT-LLMa proper CI/CD 😍 ( #2886 )
...
* test(ctest) enable address sanitizer
* feat(trtllm): expose finish reason to Rust
* feat(trtllm): fix logits retrieval
* misc(ci): enabe building tensorrt-llm
* misc(ci): update Rust action toolchain
* misc(ci): let's try to build the Dockerfile for trtllm
# Conflicts:
# Dockerfile_trtllm
* misc(ci): provide mecanism to cache inside container
* misc(ci): export aws creds as output of step
* misc(ci): let's try this way
* misc(ci): again
* misc(ci): again
* misc(ci): add debug profile
* misc(ci): add debug profile
* misc(ci): lets actually use sccache ...
* misc(ci): do not build with ssl enabled
* misc(ci): WAT
* misc(ci): WAT
* misc(ci): WAT
* misc(ci): WAT
* misc(ci): WAT
* misc(backend): test with TGI S3 conf
* misc(backend): test with TGI S3 conf
* misc(backend): once more?
* misc(backend): let's try with GHA
* misc(backend): missing env directive
* misc(backend): make sure to correctly set IS_GHA_BUILD=true in wf
* misc(backend): ok let's debug smtg
* misc(backend): WWWWWWWWWWWWWAAAAAAAA
* misc(backend): kthxbye retry s3
* misc(backend): use session token
* misc(backend): add more info
* misc(backend): lets try 1h30
* misc(backend): lets try 1h30
* misc(backend): increase to 2h
* misc(backend): lets try...
* misc(backend): lets try...
* misc(backend): let's build for ci-runtime
* misc(backend): let's add some more tooling
* misc(backend): add some tags
* misc(backend): disable Werror for now
* misc(backend): added automatic gha detection
* misc(backend): remove leak sanitizer which is included in asan
* misc(backend): forward env
* misc(backend): forward env
* misc(backend): let's try
* misc(backend): let's try
* misc(backend): again
* misc(backend): again
* misc(backend): again
* misc(backend): again
* misc(backend): again
* misc(backend): fix sscache -> sccache
* misc(backend): fix sscache -> sccache
* misc(backend): fix sscache -> sccache
* misc(backend): let's actually cache things now
* misc(backend): let's actually cache things now
* misc(backend): attempt to run the testS?
* misc(backend): attempt to run the tests?
* misc(backend): attempt to run the tests?
* change runner size
* fix: Correctly tag docker images (#2878 )
* fix: Correctly tag docker images
* fix: Correctly tag docker images
* misc(llamacpp): maybe?
* misc(llamacpp): maybe?
* misc(llamacpp): maybe?
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): gogogo
* misc(ci): go
* misc(ci): go
* misc(ci): go
* misc(ci): use bin folder
* misc(ci): make the wf callable for reuse
* misc(ci): make the wf callable for reuse (bis)
* misc(ci): make the wf callable for reuse (bis)
* misc(ci): give the wf a name
* Create test-trtllm.yml
* Update test-trtllm.yml
* Create build-trtllm2
* Rename build-trtllm2 to 1-build-trtllm2
* Rename test-trtllm.yml to 1-test-trtllm2.yml
* misc(ci): fw secrets
* Update 1-test-trtllm2.yml
* Rename 1-build-trtllm2 to 1-build-trtllm2.yml
* Update 1-test-trtllm2.yml
* misc(ci): use ci-build.yaml as main dispatcher
* Delete .github/workflows/1-test-trtllm2.yml
* Delete .github/workflows/1-build-trtllm2.yml
* misc(ci): rights?
* misc(ci): rights?
* misc(ci): once more?
* misc(ci): once more?
* misc(ci): baby more time?
* misc(ci): baby more time?
* misc(ci): try the permission above again?
* misc(ci): try the permission above again?
* misc(ci): try the permission scoped again?
* misc(ci): install tensorrt_llm_executor_static
* misc(ci): attempt to rebuild with sccache?
* misc(ci):run the tests on GPU instance
* misc(ci): let's actually setup sccache in the build.rs
* misc(ci): reintroduce variables
* misc(ci): enforce sccache
* misc(ci): correct right job name dependency
* misc(ci): detect dev profile for debug
* misc(ci): detect gha build
* misc(ci): detect gha build
* misc(ci): ok debug
* misc(ci): wtf
* misc(ci): wtf2
* misc(ci): wtf3
* misc(ci): use commit HEAD instead of merge commit for image id
* misc(ci): wtfinfini
* misc(ci): wtfinfini
* misc(ci): KAMEHAMEHA
* Merge TRTLLM in standard CI
* misc(ci): remove input machine
* misc(ci): missing id-token for AWS auth
* misc(ci): missing id-token for AWS auth
* misc(ci): missing id-token for AWS auth
* misc(ci): again...
* misc(ci): again...
* misc(ci): again...
* misc(ci): again...
* misc(ci): missing benchmark
* misc(ci): missing backends
* misc(ci): missing launcher
* misc(ci): give everything aws needs
* misc(ci): give everything aws needs
* misc(ci): fix warnings
* misc(ci): attempt to fix sccache not building trtllm
* misc(ci): attempt to fix sccache not building trtllm again
---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Pauline Bailly-Masson <155966238+paulinebm@users.noreply.github.com>
2025-01-21 10:19:16 +01:00
Cyril Vallez
b980848abf
Flash Transformers modeling backend support ( #2913 )
...
* add transformers_flash
* inits
* switch version to make it work
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* Update Makefile-flash-att-v2
* runnable version
* working
* push change
* fix high dim
* init
* default
* latest transformers changes
* revert
* simplify check
* remove flag
* improve type hints + required args
* Update based on transformers PR
* small fix
* Remove Warpers for Processor
* fix compatibility version issue
* raise error if needed
* Simplify with monkey patch
* revert + style + minor improvements
* update comment
* device check
* move the import to avoid device issue
* Update __init__.py
* check for non-native models
* oupsi
---------
Co-authored-by: System administrator <root@ip-10-90-0-159.ec2.internal>
2025-01-21 10:01:51 +01:00
Nicolas Patry
447a5b2f87
Fixing TRTLLM dockerfile. ( #2922 )
...
* Fixing TRTLLM dockerfile.
* Fixed.
* Creating a dummy modification to chekc CI runs.
* Removing the cache directive.
* Modifying this should cache hit.
* Revert "Modifying this should cache hit."
This reverts commit 46a2bde108
.
* Modifying this should cache hit.
* Unwanted files.
2025-01-20 11:13:46 +01:00
Daniël de Kok
630f198624
flashinfer: switch to plan API ( #2904 )
...
This change doesn't switch `forward` to `run` yet, since it requires
that we have access to the softmax scale and the logit softcap outside
the model.
2025-01-17 18:18:02 +01:00
drbh
8f6146f11a
Revert "feat: improve qwen2-vl startup " ( #2924 )
...
Revert "feat: improve qwen2-vl startup (#2802 )"
This reverts commit eecca27113
.
2025-01-17 12:09:05 -05:00
drbh
eecca27113
feat: improve qwen2-vl startup ( #2802 )
...
* feat: tokenize each request individually and increase warmup image size
* feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller
* fix: address image resize and rebase changes
* feat: update to run qwen2-vl tests
* fix: tweak param types
2025-01-17 11:50:41 -05:00
Wang, Yi
6e982f43a1
fix the crash of meta-llama/Llama-3.2-1B ( #2918 )
...
* fix the crash of meta-llama/Llama-3.2-1B
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
Simpler fix (which doesn't break vlms).
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-01-17 15:50:58 +01:00
Mohit Sharma
c20025dbf7
Add fp8 kv cache for ROCm ( #2856 )
...
* add fp8 kv cache for rocm
* improvements
* update log statement
* remove bookkeeping field
2025-01-17 18:43:29 +05:30
Nicolas Patry
de19e7e844
Moving to uv
instead of poetry
. ( #2919 )
...
* Moving to `uv` instead of `poetry`.
More in the standard, faster, seemingly better lockfile.
* Creating venv if not created.
* Create the venv.
* Fix ?
* Fixing the test by activating the environment ?
* Install system ?
* Add the cli entry point.
* docker install on system
* Monkeying this...
* `--system` is redundant.
* Trying to force-include this pb folder.
* TRying to check that pb is imported correctly.
* Editable install necessary ?
* Non editable?
* Editable it is.
2025-01-17 12:32:00 +01:00
Daniël de Kok
d61f14f271
nix: update to PyTorch 2.5.1 ( #2921 )
2025-01-17 12:12:11 +01:00
Wang, Yi
885144166f
Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu ( #2815 )
...
* flash decoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable xpu flashdecoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* set flashdecoding blocksize as 64
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable flashdecoding, prefill chunking and prefix caching
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add flashdecoding-ipex
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-01-17 12:04:57 +01:00