mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-10 20:04:52 +00:00
Merge branch 'main' into nice-snippets
This commit is contained in:
commit
a19e49b4a1
2
.github/workflows/build_documentation.yml
vendored
2
.github/workflows/build_documentation.yml
vendored
@ -2,6 +2,8 @@ name: Build documentation
|
||||
|
||||
on:
|
||||
push:
|
||||
paths:
|
||||
- "docs/source/**"
|
||||
branches:
|
||||
- main
|
||||
- doc-builder*
|
||||
|
2
.github/workflows/build_pr_documentation.yml
vendored
2
.github/workflows/build_pr_documentation.yml
vendored
@ -2,6 +2,8 @@ name: Build PR Documentation
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths:
|
||||
- "docs/source/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
|
||||
|
@ -15,4 +15,6 @@
|
||||
title: Preparing Model for Serving
|
||||
- local: basic_tutorials/gated_model_access
|
||||
title: Serving Private & Gated Models
|
||||
- local: basic_tutorials/using_cli
|
||||
title: Using TGI CLI
|
||||
title: Tutorials
|
||||
|
@ -6,7 +6,7 @@ There are many ways you can consume Text Generation Inference server in your app
|
||||
|
||||
After the launch, you can query the model using either the `/generate` or `/generate_stream` routes:
|
||||
|
||||
```shell
|
||||
```bash
|
||||
curl 127.0.0.1:8080/generate \
|
||||
-X POST \
|
||||
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
|
||||
@ -20,14 +20,13 @@ curl 127.0.0.1:8080/generate \
|
||||
|
||||
You can simply install `huggingface-hub` package with pip.
|
||||
|
||||
```python
|
||||
```bash
|
||||
pip install huggingface-hub
|
||||
```
|
||||
|
||||
Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.
|
||||
|
||||
```python
|
||||
|
||||
from huggingface_hub import InferenceClient
|
||||
|
||||
client = InferenceClient(model=URL_TO_ENDPOINT_SERVING_TGI)
|
||||
|
@ -2,4 +2,23 @@
|
||||
|
||||
If the model you wish to serve is behind gated access or the model repository on Hugging Face Hub is private, and you have access to the model, you can provide your Hugging Face Hub access token. You can generate and copy a read token from [Hugging Face Hub tokens page](https://huggingface.co/settings/tokens)
|
||||
|
||||
If you're using the CLI, set the `HUGGING_FACE_HUB_TOKEN` environment variable.
|
||||
If you're using the CLI, set the `HUGGING_FACE_HUB_TOKEN` environment variable. For example:
|
||||
|
||||
```
|
||||
export HUGGING_FACE_HUB_TOKEN=<YOUR READ TOKEN>
|
||||
```
|
||||
|
||||
If you would like to do it through Docker, you can provide your token by specifying `HUGGING_FACE_HUB_TOKEN` as shown below.
|
||||
|
||||
```bash
|
||||
model=meta-llama/Llama-2-7b-chat-hf
|
||||
volume=$PWD/data
|
||||
token=<your READ token>
|
||||
|
||||
docker run --gpus all \
|
||||
--shm-size 1g \
|
||||
-e HUGGING_FACE_HUB_TOKEN=$token \
|
||||
-p 8080:80 \
|
||||
-v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.0 \
|
||||
--model-id $model
|
||||
```
|
35
docs/source/basic_tutorials/using_cli.md
Normal file
35
docs/source/basic_tutorials/using_cli.md
Normal file
@ -0,0 +1,35 @@
|
||||
# Using TGI CLI
|
||||
|
||||
You can use TGI command-line interface (CLI) to download weights, serve and quantize models, or get information on serving parameters. To install the CLI, please refer to [the installation section](./installation#install-cli).
|
||||
|
||||
`text-generation-server` lets you download the model with `download-weights` command like below 👇
|
||||
|
||||
```bash
|
||||
text-generation-server download-weights MODEL_HUB_ID
|
||||
```
|
||||
|
||||
You can also use it to quantize models like below 👇
|
||||
|
||||
```bash
|
||||
text-generation-server quantize MODEL_HUB_ID OUTPUT_DIR
|
||||
```
|
||||
|
||||
You can use `text-generation-launcher` to serve models.
|
||||
|
||||
```bash
|
||||
text-generation-launcher --model-id MODEL_HUB_ID --port 8080
|
||||
```
|
||||
|
||||
There are many options and parameters you can pass to `text-generation-launcher`. The documentation for CLI is kept minimal and intended to rely on self-generating documentation, which can be found by running
|
||||
|
||||
```bash
|
||||
text-generation-launcher --help
|
||||
```
|
||||
|
||||
You can also find it hosted in this [Swagger UI](https://huggingface.github.io/text-generation-inference/).
|
||||
|
||||
Same documentation can be found for `text-generation-server`.
|
||||
|
||||
```bash
|
||||
text-generation-server --help
|
||||
```
|
@ -4,8 +4,20 @@ This section explains how to install the CLI tool as well as installing TGI from
|
||||
|
||||
## Install CLI
|
||||
|
||||
TODO
|
||||
You can use TGI command-line interface (CLI) to download weights, serve and quantize models, or get information on serving parameters.
|
||||
|
||||
To install the CLI, you need to first clone the TGI repository and then run `make`.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/huggingface/text-generation-inference.git && cd text-generation-inference
|
||||
make install
|
||||
```
|
||||
|
||||
If you would like to serve models with custom kernels, run
|
||||
|
||||
```bash
|
||||
BUILD_EXTENSIONS=True make install
|
||||
```
|
||||
|
||||
## Local Installation from Source
|
||||
|
||||
@ -16,7 +28,7 @@ Text Generation Inference is available on pypi, conda and GitHub.
|
||||
To install and launch locally, first [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
|
||||
Python 3.9, e.g. using conda:
|
||||
|
||||
```shell
|
||||
```bash
|
||||
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||
|
||||
conda create -n text-generation-inference python=3.9
|
||||
@ -27,7 +39,7 @@ You may also need to install Protoc.
|
||||
|
||||
On Linux:
|
||||
|
||||
```shell
|
||||
```bash
|
||||
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
|
||||
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
|
||||
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
|
||||
@ -37,21 +49,22 @@ rm -f $PROTOC_ZIP
|
||||
|
||||
On MacOS, using Homebrew:
|
||||
|
||||
```shell
|
||||
```bash
|
||||
brew install protobuf
|
||||
```
|
||||
|
||||
Then run to install Text Generation Inference:
|
||||
|
||||
```shell
|
||||
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
|
||||
```bash
|
||||
git clone https://github.com/huggingface/text-generation-inference.git && cd text-generation-inference
|
||||
BUILD_EXTENSIONS=True make install
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
On some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
|
||||
|
||||
```shell
|
||||
```bash
|
||||
sudo apt-get install libssl-dev gcc -y
|
||||
```
|
||||
|
||||
@ -59,13 +72,8 @@ sudo apt-get install libssl-dev gcc -y
|
||||
|
||||
Once installation is done, simply run:
|
||||
|
||||
```shell
|
||||
```bash
|
||||
make run-falcon-7b-instruct
|
||||
```
|
||||
|
||||
This will serve Falcon 7B Instruct model from the port 8080, which we can query.
|
||||
|
||||
To see all options to serve your models, check in the [codebase](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or the CLI:
|
||||
```
|
||||
text-generation-launcher --help
|
||||
```
|
||||
|
@ -4,7 +4,7 @@ The easiest way of getting started is using the official Docker container. Insta
|
||||
|
||||
Let's say you want to deploy [Falcon-7B Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) model with TGI. Here is an example on how to do that:
|
||||
|
||||
```shell
|
||||
```bash
|
||||
model=tiiuae/falcon-7b-instruct
|
||||
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
|
||||
|
||||
@ -65,7 +65,6 @@ query().then((response) => {
|
||||
console.log(JSON.stringify(response));
|
||||
});
|
||||
/// {"generated_text":"\n\nDeep Learning is a subset of Machine Learning that is concerned with the development of algorithms that can"}
|
||||
|
||||
```
|
||||
|
||||
</js>
|
||||
@ -85,8 +84,8 @@ curl 127.0.0.1:8080/generate \
|
||||
|
||||
To see all possible deploy flags and options, you can use the `--help` flag. It's possible to configure the number of shards, quantization, generation parameters, and more.
|
||||
|
||||
```shell
|
||||
```bash
|
||||
docker run ghcr.io/huggingface/text-generation-inference:1.0.0 --help
|
||||
```
|
||||
|
||||
</Tip>
|
||||
</Tip>
|
||||
|
@ -54,15 +54,13 @@ example = json ! ({"error": "Incomplete generation"})),
|
||||
)]
|
||||
#[instrument(skip(infer, req))]
|
||||
async fn compat_generate(
|
||||
default_return_full_text: Extension<bool>,
|
||||
Extension(default_return_full_text): Extension<bool>,
|
||||
infer: Extension<Infer>,
|
||||
req: Json<CompatGenerateRequest>,
|
||||
Json(mut req): Json<CompatGenerateRequest>,
|
||||
) -> Result<Response, (StatusCode, Json<ErrorResponse>)> {
|
||||
let mut req = req.0;
|
||||
|
||||
// default return_full_text given the pipeline_tag
|
||||
if req.parameters.return_full_text.is_none() {
|
||||
req.parameters.return_full_text = Some(default_return_full_text.0)
|
||||
req.parameters.return_full_text = Some(default_return_full_text)
|
||||
}
|
||||
|
||||
// switch on stream
|
||||
@ -71,9 +69,9 @@ async fn compat_generate(
|
||||
.await
|
||||
.into_response())
|
||||
} else {
|
||||
let (headers, generation) = generate(infer, Json(req.into())).await?;
|
||||
let (headers, Json(generation)) = generate(infer, Json(req.into())).await?;
|
||||
// wrap generation inside a Vec to match api-inference
|
||||
Ok((headers, Json(vec![generation.0])).into_response())
|
||||
Ok((headers, Json(vec![generation])).into_response())
|
||||
}
|
||||
}
|
||||
|
||||
@ -135,7 +133,7 @@ example = json ! ({"error": "Incomplete generation"})),
|
||||
#[instrument(
|
||||
skip_all,
|
||||
fields(
|
||||
parameters = ? req.0.parameters,
|
||||
parameters = ? req.parameters,
|
||||
total_time,
|
||||
validation_time,
|
||||
queue_time,
|
||||
@ -146,29 +144,29 @@ seed,
|
||||
)]
|
||||
async fn generate(
|
||||
infer: Extension<Infer>,
|
||||
req: Json<GenerateRequest>,
|
||||
Json(req): Json<GenerateRequest>,
|
||||
) -> Result<(HeaderMap, Json<GenerateResponse>), (StatusCode, Json<ErrorResponse>)> {
|
||||
let span = tracing::Span::current();
|
||||
let start_time = Instant::now();
|
||||
metrics::increment_counter!("tgi_request_count");
|
||||
|
||||
tracing::debug!("Input: {}", req.0.inputs);
|
||||
tracing::debug!("Input: {}", req.inputs);
|
||||
|
||||
let compute_characters = req.0.inputs.chars().count();
|
||||
let compute_characters = req.inputs.chars().count();
|
||||
let mut add_prompt = None;
|
||||
if req.0.parameters.return_full_text.unwrap_or(false) {
|
||||
add_prompt = Some(req.0.inputs.clone());
|
||||
if req.parameters.return_full_text.unwrap_or(false) {
|
||||
add_prompt = Some(req.inputs.clone());
|
||||
}
|
||||
|
||||
let details = req.0.parameters.details || req.0.parameters.decoder_input_details;
|
||||
let details = req.parameters.details || req.parameters.decoder_input_details;
|
||||
|
||||
// Inference
|
||||
let (response, best_of_responses) = match req.0.parameters.best_of {
|
||||
let (response, best_of_responses) = match req.parameters.best_of {
|
||||
Some(best_of) if best_of > 1 => {
|
||||
let (response, best_of_responses) = infer.generate_best_of(req.0, best_of).await?;
|
||||
let (response, best_of_responses) = infer.generate_best_of(req, best_of).await?;
|
||||
(response, Some(best_of_responses))
|
||||
}
|
||||
_ => (infer.generate(req.0).await?, None),
|
||||
_ => (infer.generate(req).await?, None),
|
||||
};
|
||||
|
||||
// Token details
|
||||
@ -321,7 +319,7 @@ content_type = "text/event-stream"),
|
||||
#[instrument(
|
||||
skip_all,
|
||||
fields(
|
||||
parameters = ? req.0.parameters,
|
||||
parameters = ? req.parameters,
|
||||
total_time,
|
||||
validation_time,
|
||||
queue_time,
|
||||
@ -331,8 +329,8 @@ seed,
|
||||
)
|
||||
)]
|
||||
async fn generate_stream(
|
||||
infer: Extension<Infer>,
|
||||
req: Json<GenerateRequest>,
|
||||
Extension(infer): Extension<Infer>,
|
||||
Json(req): Json<GenerateRequest>,
|
||||
) -> (
|
||||
HeaderMap,
|
||||
Sse<impl Stream<Item = Result<Event, Infallible>>>,
|
||||
@ -341,9 +339,9 @@ async fn generate_stream(
|
||||
let start_time = Instant::now();
|
||||
metrics::increment_counter!("tgi_request_count");
|
||||
|
||||
tracing::debug!("Input: {}", req.0.inputs);
|
||||
tracing::debug!("Input: {}", req.inputs);
|
||||
|
||||
let compute_characters = req.0.inputs.chars().count();
|
||||
let compute_characters = req.inputs.chars().count();
|
||||
|
||||
let mut headers = HeaderMap::new();
|
||||
headers.insert("x-compute-type", "gpu+optimized".parse().unwrap());
|
||||
@ -359,24 +357,24 @@ async fn generate_stream(
|
||||
let mut error = false;
|
||||
|
||||
let mut add_prompt = None;
|
||||
if req.0.parameters.return_full_text.unwrap_or(false) {
|
||||
add_prompt = Some(req.0.inputs.clone());
|
||||
if req.parameters.return_full_text.unwrap_or(false) {
|
||||
add_prompt = Some(req.inputs.clone());
|
||||
}
|
||||
let details = req.0.parameters.details;
|
||||
let details = req.parameters.details;
|
||||
|
||||
let best_of = req.0.parameters.best_of.unwrap_or(1);
|
||||
let best_of = req.parameters.best_of.unwrap_or(1);
|
||||
if best_of != 1 {
|
||||
let err = InferError::from(ValidationError::BestOfStream);
|
||||
metrics::increment_counter!("tgi_request_failure", "err" => "validation");
|
||||
tracing::error!("{err}");
|
||||
yield Ok(Event::from(err));
|
||||
} else if req.0.parameters.decoder_input_details {
|
||||
} else if req.parameters.decoder_input_details {
|
||||
let err = InferError::from(ValidationError::PrefillDetailsStream);
|
||||
metrics::increment_counter!("tgi_request_failure", "err" => "validation");
|
||||
tracing::error!("{err}");
|
||||
yield Ok(Event::from(err));
|
||||
} else {
|
||||
match infer.generate_stream(req.0).instrument(info_span!(parent: &span, "async_stream")).await {
|
||||
match infer.generate_stream(req).instrument(info_span!(parent: &span, "async_stream")).await {
|
||||
// Keep permit as long as generate_stream lives
|
||||
Ok((_permit, mut response_stream)) => {
|
||||
// Server-Sent Event stream
|
||||
|
Loading…
Reference in New Issue
Block a user