text-generation-inference/docs/source/_toctree.yml
David Corvoysier c00add9c03
Add Neuron backend (#3033)
* feat: add neuron backend

* feat(neuron): add server standalone installation

* feat(neuron): add server and integration tests

* fix(neuron): increase ulimit when building image

The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.

* test(neuron): merge integration tests and fixtures

* test: add --neuron option

* review: do not use latest tag

* review: remove ureq pinned version

* review: --privileged should be the exception

* feat: add neuron case to build ci

* fix(neuron): export models from container in test fixtures

The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.

* refactor: remove sagemaker entry-point

The SageMaker image is built differently anyway.

* fix(neuron): avoid using Levenshtein

* test(neuron): use smaller llama model

* feat(neuron): avoid installing CUDA in image

* test(neuron): no error anymore when requesting too many tokens

* ci: doing a precompilation step (with a different token).

* test(neuron): avoid using image sha when exporting models

We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
  which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
  locally will export the models before the CI uses them.

* test(neuron): added a small script to prune test models

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-24 09:10:05 +01:00

94 lines
2.7 KiB
YAML

- sections:
- local: index
title: Text Generation Inference
- local: quicktour
title: Quick Tour
- local: supported_models
title: Supported Models
- local: installation_nvidia
title: Using TGI with Nvidia GPUs
- local: installation_amd
title: Using TGI with AMD GPUs
- local: installation_gaudi
title: Using TGI with Intel Gaudi
- local: installation_inferentia
title: Using TGI with AWS Inferentia
- local: installation_tpu
title: Using TGI with Google TPUs
- local: installation_intel
title: Using TGI with Intel GPUs
- local: installation
title: Installation from source
- local: multi_backend_support
title: Multi-backend support
- local: architecture
title: Internal Architecture
- local: usage_statistics
title: Usage Statistics
title: Getting started
- sections:
- local: basic_tutorials/consuming_tgi
title: Consuming TGI
- local: basic_tutorials/preparing_model
title: Preparing Model for Serving
- local: basic_tutorials/gated_model_access
title: Serving Private & Gated Models
- local: basic_tutorials/using_cli
title: Using TGI CLI
- local: basic_tutorials/non_core_models
title: Non-core Model Serving
- local: basic_tutorials/safety
title: Safety
- local: basic_tutorials/using_guidance
title: Using Guidance, JSON, tools
- local: basic_tutorials/visual_language_models
title: Visual Language Models
- local: basic_tutorials/monitoring
title: Monitoring TGI with Prometheus and Grafana
- local: basic_tutorials/train_medusa
title: Train Medusa
title: Tutorials
- sections:
- local: backends/neuron
title: Neuron
- local: backends/trtllm
title: TensorRT-LLM
- local: backends/llamacpp
title: Llamacpp
title: Backends
- sections:
- local: reference/launcher
title: All TGI CLI options
- local: reference/metrics
title: Exported Metrics
- local: reference/api_reference
title: API Reference
title: Reference
- sections:
- local: conceptual/chunking
title: V3 update, caching and chunking
- local: conceptual/streaming
title: Streaming
- local: conceptual/quantization
title: Quantization
- local: conceptual/tensor_parallelism
title: Tensor Parallelism
- local: conceptual/paged_attention
title: PagedAttention
- local: conceptual/safetensors
title: Safetensors
- local: conceptual/flash_attention
title: Flash Attention
- local: conceptual/speculation
title: Speculation (Medusa, ngram)
- local: conceptual/guidance
title: How Guidance Works (via outlines)
- local: conceptual/lora
title: LoRA (Low-Rank Adaptation)
- local: conceptual/external
title: External Resources
title: Conceptual Guides