text-generation-inference/nix/impure-shell.nix

{
  lib,
  mkShell,
  black,
  cmake,
  isort,
  ninja,
  which,
  cudaPackages,
  openssl,
  ffmpeg,
  llvmPackages,
  gcc,
  gcc-unwrapped,
  stdenv,
  pkg-config,
  poetry,
  protobuf,
  python3,
  pyright,
  redocly,
  ruff,
  rust-bin,
  server,

  # Enable dependencies for building CUDA packages. Useful for e.g.
  # developing marlin/moe-kernels in-place.
  withCuda ? false,
}:

mkShell {
  nativeBuildInputs =
    [
      black
      isort
      pkg-config
      poetry
      (rust-bin.stable.latest.default.override {
        extensions = [
          "rust-analyzer"
          "rust-src"
        ];
      })
      protobuf
      pyright
      redocly
      ruff
    ]
    ++ (lib.optionals withCuda [
      cmake
      ninja
      which

      # For most Torch-based extensions, setting CUDA_HOME is enough, but
      # some custom CMake builds (e.g. vLLM) also need to have nvcc in PATH.
      cudaPackages.cuda_nvcc
    ]);
  buildInputs =
    [
      openssl.dev
      ffmpeg.dev
      llvmPackages.libclang
      gcc.cc
      gcc-unwrapped
      stdenv
    ]
    ++ (with python3.pkgs; [
      venvShellHook
      docker
      pip
      ipdb
      click
      pytest
      pytest-asyncio
      syrupy
    ])
    ++ (lib.optionals withCuda (
      with cudaPackages;
      [
        cuda_cccl
        cuda_cudart
        cuda_nvrtc
        cuda_nvtx
        cuda_profiler_api
        cudnn
        libcublas
        libcusolver
        libcusparse
      ]
    ));

  inputsFrom = [ server ];

  env = {
    LIBCLANG_PATH = "${llvmPackages.libclang.lib}/lib";
    CPATH = "${gcc-unwrapped}/lib/gcc/${stdenv.hostPlatform.config}/${gcc-unwrapped.version}/include";
    BINDGEN_EXTRA_CLANG_ARGS = builtins.concatStringsSep " " [
      "-I${gcc.libc.dev}/include"
      "-I${gcc}/lib/gcc/x86_64-unknown-linux-gnu/${gcc.version}/include"
      "-I${llvmPackages.libclang.lib}/lib/clang/${llvmPackages.libclang.version}/include"
    ];
  } // lib.optionalAttrs withCuda {
    CUDA_HOME = "${lib.getDev cudaPackages.cuda_nvcc}";
    TORCH_CUDA_ARCH_LIST = lib.concatStringsSep ";" python3.pkgs.torch.cudaCapabilities;
  };

  venvDir = "./.venv";

  postVenvCreation = ''
    unset SOURCE_DATE_EPOCH
    ( cd server ; python -m pip install --no-dependencies -e . )
    ( cd clients/python ; python -m pip install --no-dependencies -e . )
  '';

  postShellHook = ''
    unset SOURCE_DATE_EPOCH
    export PATH=$PATH:~/.cargo/bin
  '';
}
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`{`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`lib,`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`mkShell,`
nix: add black and isort to the closure (#2619) To make sure that everything is formatted with the same black version as CI. I sometimes use isort for new files to get nicely ordered imports, so add it as well. Also set the isort configuration to format in a way that is compatible with black. 2024-10-09 09:08:02 +00:00			`black,`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`cmake,`
nix: add black and isort to the closure (#2619) To make sure that everything is formatted with the same black version as CI. I sometimes use isort for new files to get nicely ordered imports, so add it as well. Also set the isort configuration to format in a way that is compatible with black. 2024-10-09 09:08:02 +00:00			`isort,`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`ninja,`
			`which,`
			`cudaPackages,`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`openssl,`
feat: adjust impure shell deps and autodocs workflow 2024-11-26 18:55:24 +00:00			`ffmpeg,`
			`llvmPackages,`
			`gcc,`
			`gcc-unwrapped,`
			`stdenv,`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`pkg-config,`
nix: build and cache impure devshells (#2765) * nix: build and cache all devshells * nix: add poetry to the impure shell This shouldn't be used to manage dependencies in a Nix devshell, but can be handy to update `poetry.lock`. * Fix Nix build, disable pure shell (covered by Nix tests) 2024-11-20 19:56:11 +00:00			`poetry,`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`protobuf,`
			`python3,`
			`pyright,`
			`redocly,`
			`ruff,`
			`rust-bin,`
			`server,`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00
			`# Enable dependencies for building CUDA packages. Useful for e.g.`
			`# developing marlin/moe-kernels in-place.`
			`withCuda ? false,`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`}:`

			`mkShell {`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`nativeBuildInputs =`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`[`
nix: add black and isort to the closure (#2619) To make sure that everything is formatted with the same black version as CI. I sometimes use isort for new files to get nicely ordered imports, so add it as well. Also set the isort configuration to format in a way that is compatible with black. 2024-10-09 09:08:02 +00:00			`black`
			`isort`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`pkg-config`
nix: build and cache impure devshells (#2765) * nix: build and cache all devshells * nix: add poetry to the impure shell This shouldn't be used to manage dependencies in a Nix devshell, but can be handy to update `poetry.lock`. * Fix Nix build, disable pure shell (covered by Nix tests) 2024-11-20 19:56:11 +00:00			`poetry`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`(rust-bin.stable.latest.default.override {`
			`extensions = [`
			`"rust-analyzer"`
			`"rust-src"`
			`];`
			`})`
			`protobuf`
			`pyright`
			`redocly`
			`ruff`
			`]`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`++ (lib.optionals withCuda [`
			`cmake`
			`ninja`
			`which`

			`# For most Torch-based extensions, setting CUDA_HOME is enough, but`
			`# some custom CMake builds (e.g. vLLM) also need to have nvcc in PATH.`
			`cudaPackages.cuda_nvcc`
			`]);`
			`buildInputs =`
			`[`
			`openssl.dev`
feat: adjust impure shell deps and autodocs workflow 2024-11-26 18:55:24 +00:00			`ffmpeg.dev`
			`llvmPackages.libclang`
			`gcc.cc`
			`gcc-unwrapped`
			`stdenv`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`]`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`++ (with python3.pkgs; [`
			`venvShellHook`
			`docker`
			`pip`
			`ipdb`
			`click`
			`pytest`
			`pytest-asyncio`
			`syrupy`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`])`
			`++ (lib.optionals withCuda (`
			`with cudaPackages;`
			`[`
			`cuda_cccl`
			`cuda_cudart`
feat: natively support Granite models (#2682) * feat: natively support Granite models * Update doc 2024-10-23 10:04:05 +00:00			`cuda_nvrtc`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`cuda_nvtx`
feat: natively support Granite models (#2682) * feat: natively support Granite models * Update doc 2024-10-23 10:04:05 +00:00			`cuda_profiler_api`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`cudnn`
			`libcublas`
			`libcusolver`
			`libcusparse`
			`]`
			`));`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00
			`inputsFrom = [ server ];`

feat: adjust impure shell deps and autodocs workflow 2024-11-26 18:55:24 +00:00			`env = {`
			`LIBCLANG_PATH = "${llvmPackages.libclang.lib}/lib";`
			`CPATH = "${gcc-unwrapped}/lib/gcc/${stdenv.hostPlatform.config}/${gcc-unwrapped.version}/include";`
			`BINDGEN_EXTRA_CLANG_ARGS = builtins.concatStringsSep " " [`
			`"-I${gcc.libc.dev}/include"`
			`"-I${gcc}/lib/gcc/x86_64-unknown-linux-gnu/${gcc.version}/include"`
			`"-I${llvmPackages.libclang.lib}/lib/clang/${llvmPackages.libclang.version}/include"`
			`];`
			`} // lib.optionalAttrs withCuda {`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00			`CUDA_HOME = "${lib.getDev cudaPackages.cuda_nvcc}";`
			`TORCH_CUDA_ARCH_LIST = lib.concatStringsSep ";" python3.pkgs.torch.cudaCapabilities;`
			`};`

Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`venvDir = "./.venv";`

			`postVenvCreation = ''`
			`unset SOURCE_DATE_EPOCH`
			`( cd server ; python -m pip install --no-dependencies -e . )`
			`( cd clients/python ; python -m pip install --no-dependencies -e . )`
			`'';`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 09:02:55 +00:00
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 14:19:42 +00:00			`postShellHook = ''`
			`unset SOURCE_DATE_EPOCH`
			`export PATH=$PATH:~/.cargo/bin`
			`'';`
			`}`