Adrien Gallouët
|
df723e646b
|
Bump llama.cpp & cuda
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-06 13:24:36 +00:00 |
|
Adrien Gallouët
|
7bff88bba9
|
Do not use HOSTNAME env
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-06 13:17:17 +00:00 |
|
Adrien Gallouët
|
8bc10d37ee
|
Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-06 10:31:05 +00:00 |
|
Adrien Gallouët
|
2b0d99c1cf
|
Thanks cargo fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-06 10:18:08 +00:00 |
|
Adrien Gallouët
|
fb81c0d1c4
|
Thanks clippy
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-06 10:57:56 +01:00 |
|
Adrien Gallouët
|
e4d5fa7eaf
|
Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-06 09:46:24 +00:00 |
|
Adrien Gallouët
|
1641c22af8
|
Add doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 21:14:30 +00:00 |
|
Adrien Gallouët
|
b3e40c4b66
|
Improve default settings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 16:38:52 +00:00 |
|
Adrien Gallouët
|
f22e2fb550
|
Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 16:12:34 +00:00 |
|
Adrien Gallouët
|
0f62401b8e
|
Initialize penalty_last_n with llamacpp default value
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 15:44:46 +00:00 |
|
Adrien Gallouët
|
695b1292e9
|
Ensure all samplers are freed on error
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 15:42:59 +00:00 |
|
Adrien Gallouët
|
5b777877b1
|
Make max_batch_total_tokens optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 11:40:20 +00:00 |
|
Adrien Gallouët
|
09a745f1b8
|
Remove n_ctx
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 11:31:58 +00:00 |
|
Adrien Gallouët
|
051ff2d5ce
|
Rename bindings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 11:21:41 +00:00 |
|
Adrien Gallouët
|
c52f08351f
|
Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 10:57:50 +00:00 |
|
Adrien Gallouët
|
dbee804129
|
Simplify batching logic
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 10:12:39 +00:00 |
|
Adrien Gallouët
|
d3a772a8dd
|
Update args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-05 10:10:38 +00:00 |
|
Adrien Gallouët
|
e007529590
|
Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 17:54:53 +00:00 |
|
Adrien Gallouët
|
906c265aef
|
Cleanup Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 17:53:47 +00:00 |
|
Adrien Gallouët
|
df2a4fbb8a
|
Update Dockerfile_llamacpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
d883109df6
|
Disable graceful shutdown in debug mode
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
207041a977
|
Bump llamacpp to b4623
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
38b33e9698
|
Add --type-v & --type-k
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
bfb8e03e9f
|
Add specific args for batch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Morgan Funtowicz
|
e6a8d33902
|
backend(llama): add CUDA architectures build argument for Dockerfile
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
ea28332bb3
|
Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
104a968d01
|
Remove warmup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
8ed362d03a
|
Clear request cache after completion
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
c8505fb300
|
Auto-detect n_threads when not provided
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
27534d8ee4
|
Fix seq iterations
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
96434a1e7e
|
Fix batching
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:59 +00:00 |
|
Adrien Gallouët
|
2a51e415ff
|
Output real logprobs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
161280f313
|
Only export the latest logits
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Morgan Funtowicz
|
960c12bd6e
|
backend(llama): add CUDA Dockerfile_llamacpp for now
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
f38c34aeb7
|
Fix batch_pos
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
e88a527fcf
|
Add --offload-kqv
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
ae5bb789c2
|
Enable flash attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
3f199134f0
|
Fix args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
7a3ed4171e
|
Add --numa
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
390f0ec061
|
Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
d6ded897a8
|
Add a stupid batch mechanism
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
e07835c5b5
|
Add --defrag-threshold
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
f388747985
|
Add GPU args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
8d2dfdf668
|
Handle ctx args & fix sampling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
a7b4b04cb5
|
Add some input validation checks
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
e7facf692f
|
Handle max_batch_size
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
3eb4823f3e
|
Use max_batch_total_tokens
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
bd0cc9905c
|
Get rid of llama_batch_get_one()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:58 +00:00 |
|
Adrien Gallouët
|
95e221eece
|
Add llamacpp backend
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
|
2025-02-04 13:32:56 +00:00 |
|
Alvaro Bartolome
|
88fd56f549
|
Add strftime_now callable function for minijinja chat templates (#2983)
* Add `chrono` and `strftime_now` function callable
* Fix `test_chat_template_valid_with_strftime_now`
* Fix `test_chat_template_valid_with_strftime_now`
|
2025-02-03 15:30:48 +01:00 |
|