* Use Hub kernels for Marlin and cutlass quantization kernels
* Use hub kernels for MoE/GPTQ-Marlin MoE
* Use attention kernels from the Hub
* Cache the kernels in the Docker image
* Update moe kernels
* Support loading local kernels for development
* Support latest moe kernels
* Update to moe 0.1.1
* CI: download locked kernels for server tests
* Fixup some imports
* CI: activate venv
* Fix unused imports
* Nix: add attention/moe/quantization kernels
* Update hf-kernels to 0.1.5
* Update kernels
* Update tgi-nix flake for hf-kernels
* Fix EOF
* Take `load_kernel` out of a frequently-called function
* Hoist another case of kernel loading out of a somewhat hot function
* marlin-kernels -> quantization
* attention -> paged-attention
* EOF fix
* Update hf-kernels, fixup Docker
* ipex fix
* Remove outdated TODO
* nix: build and cache all devshells
* nix: add poetry to the impure shell
This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.
* Fix Nix build, disable pure shell (covered by Nix tests)
* Add `impureWithCuda` dev shell
This shell is handy when developing some kernels jointly with TGI - it
adds nvcc and a bunch of commonly-used CUDA libraries to the environment.
We don't add this to the normal impure shell to keep the development
environment as clean as possible (avoid accidental dependencies, etc.).
* Add cuDNN
To make sure that everything is formatted with the same black version
as CI.
I sometimes use isort for new files to get nicely ordered imports,
so add it as well. Also set the isort configuration to format in a
way that is compatible with black.
* Improve support for GPUs with capability < 8
- For models that cannot use flashinfer, use flash-attn v1 + paged
attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
cache, since v1 cannot use block tables.
* nix: add flash-attn-v1 to the server environment
* Move disabling prefix caching into the block of exceptions
* Capability as `usize`s