Large Language Model Text Generation Inference
Go to file
Nick Hill 3efa5bbbfd
fix(router): Include special tokens when tokenizing (#14)
There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.

This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.
2022-12-30 19:31:44 +01:00
.github/workflows feat(launcher): Add integration tests (#9) 2022-12-16 11:29:36 +01:00
aml fix(server): Fix Transformers fork version 2022-11-08 17:42:38 +01:00
assets v0.1.0 2022-10-20 19:14:44 +02:00
k6 Add load testing 2022-10-11 10:36:51 +02:00
launcher fix(server): Fix stop sequences (#11) 2022-12-16 16:03:39 +01:00
proto feat: Return logprobs (#8) 2022-12-15 17:03:56 +01:00
router fix(router): Include special tokens when tokenizing (#14) 2022-12-30 19:31:44 +01:00
server fix(server): Check for device type correctly when determining initial padding (#16) 2022-12-30 19:30:42 +01:00
.dockerignore fix(server): Fix Transformers fork version 2022-11-08 17:42:38 +01:00
.gitignore v0.1.0 2022-10-20 19:14:44 +02:00
Cargo.lock feat(launcher): Add integration tests (#9) 2022-12-16 11:29:36 +01:00
Cargo.toml feat(rust): Update to 1.65 2022-11-14 13:59:56 +01:00
Dockerfile feat(rust): Update to 1.65 2022-11-14 13:59:56 +01:00
LICENSE Create LICENSE (#2) 2022-10-22 10:44:52 +02:00
Makefile feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
README.md feat: Return logprobs (#8) 2022-12-15 17:03:56 +01:00
rust-toolchain.toml feat(rust): Update to 1.65 2022-11-14 13:59:56 +01:00

Text Generation Inference

architecture

A Rust and gRPC server for text generation inference. Used in production at HuggingFace to power Bloom, BloomZ and MT0-XXL api-inference widgets.

Features

Officially supported models

Other models are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

or

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Load Tests for BLOOM

See k6/load_test.js

avg min med max p(90) p(95) RPS
Original code 8.9s 1s 9.12s 16.69s 13.7s 14.26s 5.9
New batching logic 5.44s 959.53ms 5.28s 13.12s 7.78s 8.92s 9.08

Install

make install

Run

BLOOM 560-m

make run-bloom-560m

BLOOM

First you need to download the weights:

make download-bloom
make run-bloom # Requires 8xA100 80GB

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-bloom-quantize # Requires 8xA100 40GB

Test

curl 127.0.0.1:3000/generate \
    -v \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'

Develop

make server-dev
make router-dev