Commit Graph

35 Commits

Author SHA1 Message Date
Mohit Sharma
87a0af4ec2
Update transformers to 4.51 (#3148)
* update transformres

* Upgrading the nix deps too.

* Forcing torchvision to be in there.

* Fixing bug in mllama.

* Those tests cannot be run in CI.

* Lint.

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-07 12:55:43 +02:00
Nicolas Patry
d23b385eee
Preparing for release. (#3147)
* Preparing for release.

* Adding hf-xet dependency.

* Merged tgi-nix update.
2025-04-06 11:36:00 +02:00
Nicolas Patry
54d15462dc
Torch 2.6 (#3134)
* Torch 2.6

* Upgrade the toolchain.

* Don't upgrade just yet.

* Upgrade toolchain.

* Time upgrade.

* TGI-nix main.

* Upgrade to transformers 4.50
2025-03-24 11:55:49 +01:00
Nicolas Patry
11f2eec10e
Publish nix docker image. (#3122)
* Publish nix docker image.

* Run during PR.

* Something else.

* Forgot to push.

* Build zstd.

* Pushing with skopeo

* Testing the PR.

* Runnign from nix.

* Cleaner tags.
2025-03-18 12:58:21 +01:00
Daniël de Kok
f91434e99b
Make the Nix-based Docker container work on non-NixOS (#3109)
On NixOS, the CUDA driver shim gets mounted on /run/opengl-driver,
where Nix packages expect the shim to be. However, on other
distributions, some FHS paths are mounted. This is a small change
to make the dynamic loader find the shim.
2025-03-13 14:02:45 +01:00
Nicolas Patry
8b91f92978
Fixing the docker build. (#3108)
* Fixing the docker build.

* Apply suggestions from code review
2025-03-13 11:26:44 +01:00
Nicolas Patry
83ef364177
We need gcc during runtime to enable triton to compile kernels. (#3103)
* We need gcc during runtime to enable triton to compile kernels.

* Fixing the docker build.
2025-03-13 10:45:47 +01:00
Daniël de Kok
c73ae0bd88
Update to kernels 0.2.1 (#3084)
* Update to `kernels` 0.2.1

The package was renamed from `hf-kernels` to `kernels`. The new version
also updates the lockfile format.

* Download kernels in `install-cuda` target
2025-03-13 10:36:29 +01:00
Daniël de Kok
036d802b62
Nix: add openai to impure shell for integration tests (#3081) 2025-03-07 13:04:21 +01:00
Nicolas Patry
8e92942a18
Making tool_calls a vector. (#3075)
* Making `tool_calls` a vector.

* Update doc.

* Fixing the nix overlay with updated version.

* Add openai dependency.

* Updating the old tests.

* Trying to reduce the logs in the case of errors.

* Less spammy logs too.
2025-03-05 22:32:31 +01:00
Daniël de Kok
97c5f7e685
Use rotary kernel from the Hub (#3041) 2025-02-21 13:55:31 +01:00
drbh
d6a0c67e2f
feat: add initial qwen2.5-vl model and test (#2971)
* feat: support qwen2.5 vl model

* fix: bump support models doc

* feat: check before rope type adjustment and small refactors

* fix: add transformer overlay for processor support

* fix: vendor processor and config from transformers

* fix: refactor/simplify conditionals
2025-02-19 12:38:20 +01:00
Daniël de Kok
f0ed76583c
Use eetq kernel from the hub (#3029)
* Use eetq kernel from the hub

* Fixing the CI.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-18 10:03:53 +01:00
Daniël de Kok
571ac9b507
Use kernels from the kernel hub (#2988)
* Use Hub kernels for Marlin and cutlass quantization kernels

* Use hub kernels for MoE/GPTQ-Marlin MoE

* Use attention kernels from the Hub

* Cache the kernels in the Docker image

* Update moe kernels

* Support loading local kernels for development

* Support latest moe kernels

* Update to moe 0.1.1

* CI: download locked kernels for server tests

* Fixup some imports

* CI: activate venv

* Fix unused imports

* Nix: add attention/moe/quantization kernels

* Update hf-kernels to 0.1.5

* Update kernels

* Update tgi-nix flake for hf-kernels

* Fix EOF

* Take `load_kernel` out of a frequently-called function

* Hoist another case of kernel loading out of a somewhat hot function

* marlin-kernels -> quantization

* attention -> paged-attention

* EOF fix

* Update hf-kernels, fixup Docker

* ipex fix

* Remove outdated TODO
2025-02-10 19:19:25 +01:00
Daniël de Kok
dd2bd5fdb3
impureWithCuda: fix gcc version (#2990)
* impureWithCuda: fix gcc version

* trufflehog: do not fail on unverified results
2025-02-04 17:01:59 +01:00
Daniël de Kok
07bed530f7
nix: build and cache impure devshells (#2765)
* nix: build and cache all devshells

* nix: add poetry to the impure shell

This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.

* Fix Nix build, disable pure shell (covered by Nix tests)
2024-11-20 20:56:11 +01:00
Daniël de Kok
52e48739a5
Remove vLLM dependency for CUDA (#2751)
* Remove vLLM dependency for CUDA

This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.

Tested run (since we don't have paged attention in CI):

```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```

* Fix clippy warning
2024-11-17 17:34:50 +01:00
Daniël de Kok
a785000842
Add initial support for compressed-tensors checkpoints (#2732)
compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because

- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
  quantizers.
- Configurable exclusions for quantization.

This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.

The following types of quantization are supported in this PR:

- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.

Support for other quantization types will be added in subsequent PRs.
2024-11-10 13:54:07 +01:00
Daniël de Kok
0f346a3296
Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688)
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.

* Update test snapshots
2024-10-25 16:40:47 +02:00
OlivierDehaene
03c9388bf7
feat: natively support Granite models (#2682)
* feat: natively support Granite models

* Update doc
2024-10-23 10:04:05 +00:00
Daniël de Kok
9c9ef37c56
Add impureWithCuda dev shell (#2677)
* Add `impureWithCuda` dev shell

This shell is handy when developing some kernels jointly with TGI - it
adds nvcc and a bunch of commonly-used CUDA libraries to the environment.

We don't add this to the normal impure shell to keep the development
environment as clean as possible (avoid accidental dependencies, etc.).

* Add cuDNN
2024-10-22 11:02:55 +02:00
Daniël de Kok
9ed0c85fe1
nix: add black and isort to the closure (#2619)
To make sure that everything is formatted with the same black version
as CI.

I sometimes use isort for new files to get nicely ordered imports,
so add it as well. Also set the isort configuration to format in a
way that is compatible with black.
2024-10-09 11:08:02 +02:00
Daniël de Kok
68103079f4
nix: example of local package overrides during development (#2607) 2024-10-04 16:52:42 +02:00
Daniël de Kok
584b4d7a68
nix: experimental support for building a Docker container (#2470)
* nix: experimental support for building a Docker image

Run using something like:

```
docker run \
  --device nvidia.com/gpu=all \
  -it --rm -p 8080:80 \
  -v $PWD/data:/data \
  -v $PWD/tmp:/tmp \
  tgi-docker:latest \
  --model-id <model_id>
```

* Example of building the Docker image using Nix inside Docker

* Stream to make the builder image smaller

This avoids storing a Docker image tarball in the image. Instead,
stream the layers while doing `docker run`.

* Don't spam journalctl on Linux

* Other dockerfile.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-01 18:02:06 +02:00
Daniël de Kok
5b6b74e21d
Improve support for GPUs with capability < 8 (#2575)
* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s
2024-09-27 16:19:42 +02:00
Nicolas Patry
f512021e77
Stream options. (#2533)
* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow
2024-09-19 20:50:37 +02:00
Daniël de Kok
ce85efa968
Move to moe-kernels package and switch to common MoE layer (#2511)
* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner
2024-09-17 18:08:58 +02:00
Daniël de Kok
94304649f1
nix: support Python tokenizer conversion in the router (#2515)
Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.
2024-09-12 10:44:01 +02:00
Daniël de Kok
de2cdeca53
nix: add punica-kernels (#2477)
Enables LoRA support.
2024-09-02 11:31:36 +02:00
Daniël de Kok
4e821c003a
nix: build Torch against MKL and various other improvements (#2469)
Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.
2024-08-29 16:25:25 +02:00
Daniël de Kok
358ceb67dd
nix: add awq-inference-engine as server dependency (#2442) 2024-08-21 22:20:03 +02:00
Nicolas Patry
310778e02a
Adding eetq to flake. (#2438) 2024-08-21 09:06:33 +02:00
Daniël de Kok
9474415095
nix: add text-generation-benchmark to pure devshell (#2431)
nix: add text-generation-benchmark to pure devshell
2024-08-21 07:48:13 +02:00
Daniël de Kok
f5f11b797e
nix: add pure server to flake, add both pure and impure devshells (#2430)
* nix: pure server and support both pure and impure devShells

* nix: remove unused poetry2nix input

It is not wired up and we now have a pure server.

* nix: add ipdb to impure devshell
2024-08-20 22:07:33 +02:00
Daniël de Kok
1411bfb989
nix: try to reduce the number of Rust rebuilds (#2424)
Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.
2024-08-16 10:01:01 +02:00