mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-13 08:55:24 +00:00

Large Language Model Text Generation Inference

bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer

Go to file

Nick Hill 3efa5bbbfd fix(router): Include special tokens when tokenizing (#14 ) There's currently a discrepancy in the tokenization between the router and python server code. The latter includes special tokens but former does not. This results in a token count mismatch for seq2seq models such as mt0 where the tokenizer emits an EOS token at the end. This in turn results in some unexpected/incorrect output, in particular when batch concatenation is involved, because the python code uses the input length passed from the router for each row. As far as I can tell, it is better to include this token in the encoder `input_ids`, so I guess it's best to just adjust on the router side.		2022-12-30 19:31:44 +01:00
.github/workflows	feat(launcher): Add integration tests (#9 )	2022-12-16 11:29:36 +01:00
aml	fix(server): Fix Transformers fork version	2022-11-08 17:42:38 +01:00
assets	v0.1.0	2022-10-20 19:14:44 +02:00
k6	Add load testing	2022-10-11 10:36:51 +02:00
launcher	fix(server): Fix stop sequences (#11 )	2022-12-16 16:03:39 +01:00
proto	feat: Return logprobs (#8 )	2022-12-15 17:03:56 +01:00
router	fix(router): Include special tokens when tokenizing (#14 )	2022-12-30 19:31:44 +01:00
server	fix(server): Check for device type correctly when determining initial padding (#16 )	2022-12-30 19:30:42 +01:00
.dockerignore	fix(server): Fix Transformers fork version	2022-11-08 17:42:38 +01:00
.gitignore	v0.1.0	2022-10-20 19:14:44 +02:00
Cargo.lock	feat(launcher): Add integration tests (#9 )	2022-12-16 11:29:36 +01:00
Cargo.toml	feat(rust): Update to 1.65	2022-11-14 13:59:56 +01:00
Dockerfile	feat(rust): Update to 1.65	2022-11-14 13:59:56 +01:00
LICENSE	Create LICENSE (#2 )	2022-10-22 10:44:52 +02:00
Makefile	feat(server): Support all AutoModelForCausalLM on a best effort basis	2022-10-28 19:24:00 +02:00
README.md	feat: Return logprobs (#8 )	2022-12-15 17:03:56 +01:00
rust-toolchain.toml	feat(rust): Update to 1.65	2022-11-14 13:59:56 +01:00

README.md

Text Generation Inference

A Rust and gRPC server for text generation inference. Used in production at HuggingFace to power Bloom, BloomZ and MT0-XXL api-inference widgets.

Features

Dynamic batching of incoming requests for increased total throughput
Quantization with bitsandbytes
Safetensors weight loading
45ms per token generation for BLOOM with 8xA100 80GB
Logits warpers (temperature scaling, topk ...)
Stop sequences
Log probabilities

Officially supported models

BLOOM
BLOOMZ
MT0-XXL
~~Galactica~~ (deactivated)

Other models are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Load Tests for BLOOM

See k6/load_test.js

	avg	min	med	max	p(90)	p(95)	RPS
Original code	8.9s	1s	9.12s	16.69s	13.7s	14.26s	5.9
New batching logic	5.44s	959.53ms	5.28s	13.12s	7.78s	8.92s	9.08

Install

make install

Run

BLOOM 560-m

make run-bloom-560m

BLOOM

First you need to download the weights:

make download-bloom

make run-bloom # Requires 8xA100 80GB

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-bloom-quantize # Requires 8xA100 40GB

Test

curl 127.0.0.1:3000/generate \
    -v \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'

Develop

make server-dev
make router-dev