Commit Graph

10 Commits

Author SHA1 Message Date
Mohit Sharma
d9bb9bebc9
Add llama4 (#3145)
* initial changes

* Add support for other vlm

* cleanup comment

* Improve attn_implementation

* Add comments for support of models

* add model

* add model

* fixes and improvements

* update docker

* Add cache position

* Add tests

* remove redundant changes

* remove tr version

* Upgrade doc + fix linting.

* Fixing the CI.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-06 10:20:22 +02:00
Daniël de Kok
c73ae0bd88
Update to kernels 0.2.1 (#3084)
* Update to `kernels` 0.2.1

The package was renamed from `hf-kernels` to `kernels`. The new version
also updates the lockfile format.

* Download kernels in `install-cuda` target
2025-03-13 10:36:29 +01:00
Nicolas Patry
31766dad77
Force upgrade transformers version for olmo. 2025-03-05 12:17:09 +01:00
Hugo Larcher
d8ff7f2623
feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. (#3061)
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.

* fix: Rust version for Neuron

* fix: PR comments, use rust-toolchain.toml
2025-03-04 16:43:50 +01:00
Nicolas Patry
d6881c37ab
Putting back the NCCL forced upgrade. (#2999)
* Putting back the NCCL forced upgrade.

* .

* ...

* Ignoring conda.

* Dropping conda from the buidl system + torch 2.6

* Cache min.

* Rolling back torch version.

* Reverting the EETQ modification.

* Fix flash attention ?

* Actually stay on flash v1.

* Patching flash v1.

* Torch 2.6, fork of rotary, eetq updated.

* Put back nccl latest (override torch).

* Slightly more reproducible build and not as scary.
2025-02-14 11:31:59 +01:00
Daniël de Kok
571ac9b507
Use kernels from the kernel hub (#2988)
* Use Hub kernels for Marlin and cutlass quantization kernels

* Use hub kernels for MoE/GPTQ-Marlin MoE

* Use attention kernels from the Hub

* Cache the kernels in the Docker image

* Update moe kernels

* Support loading local kernels for development

* Support latest moe kernels

* Update to moe 0.1.1

* CI: download locked kernels for server tests

* Fixup some imports

* CI: activate venv

* Fix unused imports

* Nix: add attention/moe/quantization kernels

* Update hf-kernels to 0.1.5

* Update kernels

* Update tgi-nix flake for hf-kernels

* Fix EOF

* Take `load_kernel` out of a frequently-called function

* Hoist another case of kernel loading out of a somewhat hot function

* marlin-kernels -> quantization

* attention -> paged-attention

* EOF fix

* Update hf-kernels, fixup Docker

* ipex fix

* Remove outdated TODO
2025-02-10 19:19:25 +01:00
Nicolas Patry
0ef8c8a97a
Using the "lockfile". (#2992)
* Using the "lockfile".

* Revert dummy modifications.

* Lock on python 3.11

* Another attempt.

* ..

* Bad cache hits.

* The good old monkey.

* How in the world...

* We need the launcher still.

* .

* ..

* Attempt #42

* Don't break all other builds.

* Mode max.

* Applying to other builds.
2025-02-06 12:28:24 +01:00
Daniël de Kok
ee0dffcd14
Update to moe-kernels 0.8.0 (#2966) 2025-01-29 18:19:55 +01:00
Daniël de Kok
db922eb77e
Update to attention-kernels 0.2.0 (#2950)
This version removes our patches/custom API. Makes it simpler to
get changes from upstream. One of which is that we can enable FP8
KV cache for paged attention as well.
2025-01-27 11:42:36 +01:00
Nicolas Patry
de19e7e844
Moving to uv instead of poetry. (#2919)
* Moving to `uv` instead of `poetry`.

More in the standard, faster, seemingly better lockfile.

* Creating venv if not created.

* Create the venv.

* Fix ?

* Fixing the test by activating the environment ?

* Install system  ?

* Add the cli entry point.

* docker install on system

* Monkeying this...

* `--system` is redundant.

* Trying to force-include this pb folder.

* TRying to check that pb is imported correctly.

* Editable install necessary ?

* Non editable?

* Editable it is.
2025-01-17 12:32:00 +01:00