Adrien Gallouët
c52f08351f
Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-05 10:57:50 +00:00
Adrien Gallouët
dbee804129
Simplify batching logic
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-05 10:12:39 +00:00
Adrien Gallouët
d3a772a8dd
Update args
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-05 10:10:38 +00:00
Adrien Gallouët
e007529590
Update Cargo.lock
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 17:54:53 +00:00
Adrien Gallouët
906c265aef
Cleanup Dockerfile
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 17:53:47 +00:00
Adrien Gallouët
df2a4fbb8a
Update Dockerfile_llamacpp
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
d883109df6
Disable graceful shutdown in debug mode
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
207041a977
Bump llamacpp to b4623
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
38b33e9698
Add --type-v & --type-k
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
bfb8e03e9f
Add specific args for batch
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Morgan Funtowicz
e6a8d33902
backend(llama): add CUDA architectures build argument for Dockerfile
2025-02-04 13:32:59 +00:00
Adrien Gallouët
ea28332bb3
Cleanup
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
104a968d01
Remove warmup
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
8ed362d03a
Clear request cache after completion
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
c8505fb300
Auto-detect n_threads when not provided
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
27534d8ee4
Fix seq iterations
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
96434a1e7e
Fix batching
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:59 +00:00
Adrien Gallouët
2a51e415ff
Output real logprobs
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
161280f313
Only export the latest logits
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Morgan Funtowicz
960c12bd6e
backend(llama): add CUDA Dockerfile_llamacpp for now
2025-02-04 13:32:58 +00:00
Adrien Gallouët
f38c34aeb7
Fix batch_pos
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
e88a527fcf
Add --offload-kqv
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
ae5bb789c2
Enable flash attention by default
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
3f199134f0
Fix args
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
7a3ed4171e
Add --numa
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
390f0ec061
Cleanup
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
d6ded897a8
Add a stupid batch mechanism
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
e07835c5b5
Add --defrag-threshold
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
f388747985
Add GPU args
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
8d2dfdf668
Handle ctx args & fix sampling
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
a7b4b04cb5
Add some input validation checks
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
e7facf692f
Handle max_batch_size
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
3eb4823f3e
Use max_batch_total_tokens
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
bd0cc9905c
Get rid of llama_batch_get_one()
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:58 +00:00
Adrien Gallouët
95e221eece
Add llamacpp backend
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-02-04 13:32:56 +00:00
Alvaro Bartolome
88fd56f549
Add strftime_now
callable function for minijinja
chat templates ( #2983 )
...
* Add `chrono` and `strftime_now` function callable
* Fix `test_chat_template_valid_with_strftime_now`
* Fix `test_chat_template_valid_with_strftime_now`
2025-02-03 15:30:48 +01:00
Hugo Larcher
e3f2018cb5
hotfix: fix trtllm CI build on release ( #2981 )
...
* hotfix: fix trtllm CI build on release
* fix: test release.
* fix: test release.
* fix: test release. env not recognized https://github.com/actions/runner/issues/1661
* fix: test release. Works.
2025-02-03 11:11:15 +01:00
Nicolas Patry
bb69c5b199
Back on nix main. ( #2979 )
2025-01-31 14:39:52 +01:00
Nicolas Patry
c9d68945cc
Prepare for release 3.1.0 ( #2972 )
...
* Prepare for release 3.1.0
* Back on main flake.
* Fixing stuff.
* Upgrade to moe-kernels 0.8.2 for Hip support.
* Deactivating the flaky test.
2025-01-31 14:19:01 +01:00
Mohit Sharma
c07a2cc82b
Update moe-kernel to 0.8.2 for rocm ( #2977 )
...
update moe-kernel for amd
2025-01-31 11:40:00 +01:00
Hugo Larcher
065aabb13d
doc: Update TRTLLM deployment doc. ( #2960 )
...
* doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI.
* doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI.
* fix: PR comments
2025-01-30 18:04:42 +01:00
Nicolas Patry
cb747b33da
Add deepseekv3 ( #2968 )
...
* Add fp8 support moe models
add deepseekv3
format codfe'
update dockerfile
update doc
* Small modifications.
* Moe kernels 0.8.1
* Upgrade to 0.8.1
* Fixing moe import.
* Black.
* Apply suggestions from code review
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
* Fixing Mixtral + Nits.
* Put link to ref.
* Fix other call locations.
* Scoring func `softmax` is the only one that works.
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
2025-01-30 16:40:25 +01:00
Nicolas Patry
80e7d98f88
Hotfixing intel-cpu (not sure how it was working before). ( #2967 )
...
* Hotfixing intel-cpu (not sure how it was working before).
* Do not fail on missing moe-kernels (Intel-cpu).
2025-01-29 22:34:41 +01:00
Daniël de Kok
ee0dffcd14
Update to moe-kernels 0.8.0 ( #2966 )
2025-01-29 18:19:55 +01:00
Mohit Sharma
4ef2e045c9
Add fp8 support moe models ( #2928 )
...
* Add fp8 support moe models
* flatten condition
2025-01-29 13:56:32 +01:00
Hugo Larcher
73b7cf83f6
Add backend name to telemetry ( #2962 )
...
* feat: Add backend name to telemetry
2025-01-28 16:53:16 +01:00
Nicolas Patry
eb3df0f46f
Fixing the oom maybe with 2.5.1 change. ( #2958 )
2025-01-28 10:30:28 +01:00
Hugo Larcher
c690da5973
fix: Telemetry ( #2957 )
...
* fix: add telemetry regular pings and fix unhandled errors avoid not sending telemetry stop events.
* fix: simplify error handling
* fix: update ping delay and update doc.
* fix: clippy
* doc: Rephrase properly.
2025-01-28 10:29:18 +01:00
Daniël de Kok
db922eb77e
Update to attention-kernels 0.2.0 ( #2950 )
...
This version removes our patches/custom API. Makes it simpler to
get changes from upstream. One of which is that we can enable FP8
KV cache for paged attention as well.
2025-01-27 11:42:36 +01:00
Funtowicz Morgan
40b00275b2
Attempt to remove AWS S3 flaky cache for sccache ( #2953 )
...
* backend(trtllm): attempt to remove AWS S3 flaky cache for sccache
* backend(trtllm): what if we expose ENV instead of inline?
* backend(trtllm): and with the right env var for gha sccache
* backend(trtllm): relax the way to detect sccache
* backend(trtllm): make sccache definition manually
* backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present
* backend(trtllm): export env variable in run mb?
* backend(trtllm): Cache mode max to cache intermediate layers
* backend(trtllm): inject ompi_version build arg in dependent step
2025-01-27 11:21:48 +01:00