Commit Graph

284 Commits

Author SHA1 Message Date
drbh
e80b2c21dc fix: bump minijinja version and add test for llama 3.1 tools (#2463)
* fix: support tojson and avoid message indexing issue in template

* fix: prefer minijinja native methods and prefer workspace level dependency

* fix: adjust comment typo
2024-09-25 06:11:21 +00:00
drbh
73ebbd05f8 Pr 2451 ci branch (#2454)
* fix[router]: Fix tools not passed in chat template

Signed-off-by: GitHub <noreply@github.com>

* feat: improve default tool serialization and lints

* feat: refactor tool logic to include notify_error in prompt and adjust typing

* fix: adjust non tool template apply

* fix: simplify tool grammar logic and improve schema

* feat: avoid skip tool test and avoid empty tool prompts

* fix: increase test client timeout for grammar compilation tests

---------

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-09-25 06:10:59 +00:00
Hugo Larcher
53fdbe617d doc: Add metrics documentation and add a 'Reference' section (#2230)
* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 06:10:13 +00:00
drbh
1f8c0f83e3 Pr 2395 ci run (#2406)
* fix(router): Fix appending to message content

* feat: add message and chat template test

---------

Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-09-25 06:06:17 +00:00
drbh
96e8fa37b0 fix: improve completions to send a final chunk with usage details (#2336)
* fix: improve completions to send a final chunk with usage details

* fix: include finish reason string

* fix: remove dev debug trait and unneeded mut

* fix: update openapi schema
2024-09-25 06:06:17 +00:00
drbh
8e6bfa2fc5 feat: validate template variables before apply and improve sliding wi… (#2403)
* feat: validate template variables before apply and improve sliding window check

* fix: improve missing template var test
2024-09-25 06:05:43 +00:00
Nicolas Patry
6393cdee63 Keeping the benchmark somewhere (#2401)
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-09-25 06:05:43 +00:00
Daniël de Kok
f586cc7f0c Add support for prefix caching to the v3 router (#2392)
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.
2024-09-25 06:05:08 +00:00
drbh
959add5e9b feat: add guideline to chat request and template (#2391)
* feat: add guideline to chat request and template

* fix: add template test and update docs
2024-09-25 06:04:51 +00:00
Nicolas Patry
849bd93dc3 Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385)
* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.
2024-09-25 06:04:51 +00:00
drbh
6f2a468a64 Pr 2352 ci branch (#2382)
* Fix unsigned integer underflow

Passing --max-batch-size to the launcher actually had no effect
because after a few requests the max_size passed to State::next_batch
would underflow becoming a largo positive number.

In the scheduler, as soon as the cached batch size reached the
max_batch_size the max_size passed to next_batch becomes 0.
Since the only check in that funcion is
```
if Some(batch_requests.len()) == max_size {
    break;
}
```
and it's called after the `batch_requests.len()` has
become 1, it doesn't do anything to prevent more than 0
requests from being batched.

Now we have cached batch in the server that is large than
max_batch_size and `max_size - batch_size as usize`
underflows.

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

* fix: update v3 scheduler and ensure max_batch_size > 0

---------

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
2024-09-25 06:01:59 +00:00
drbh
3ea8e8a2d5 add gptj modeling in TGI #2366 (CI RUN) (#2372)
* add gptj modeling

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
drbh
5400c7155d feat: return the generated text when parsing fails (#2353) 2024-09-25 05:55:39 +00:00
drbh
b4562e1369 feat: prefer stop over eos_token to align with openai finish_reason (#2344) 2024-09-25 05:55:39 +00:00
drbh
88e07f12cc feat: implement a templated endpoint for visibility into chat requests (#2333)
* feat: implement a templated endpoint for visibility into chat requests

* feat: improve to tokenize too

* fix: adjust return type

* feat: simplify prepare_chat_input logic and adjust start stop chars
2024-09-25 05:55:39 +00:00
Erik Kaunismäki
3c4f816ae3 refactor usage stats (#2339)
* refactor usage stats

* Update docs/source/usage_statistics.md

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* Update router/src/server.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* changes based on feedback

* run python3 udpate_doc.py

* fix pre-commit

* Update router/src/server.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* delete option around usage stats arg

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 05:55:39 +00:00
Nicolas Patry
120d5773e8 Rebase TRT-llm (#2331)
* wip

wip

refacto

refacto

Initial setup for CXX binding to TRTLLM

Working FFI call for TGI and TRTLLM backend

Remove unused parameters annd force tokenizer name to be set

Overall build TRTLLM and deps through CMake build system

Enable end to end CMake build

First version loading engines and making it ready for inference

Remembering to check how we can detect support for chunked context

Move to latest TensorRT-LLM version

Specify which default log level to use depending on CMake build type

make leader executor mode working

unconditionally call InitializeBackend on the FFI layer

bind to CUDA::nvml to retrieve compute capabilities at runtime

updated logic and comment to detect cuda compute capabilities

implement the Stream method to send new tokens through a callback

use spdlog release 1.14.1 moving forward

update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c

correctly tell cmake to build dependent tensorrt-llm required libraries

create cmake install target to put everything relevant in installation folder

add auth_token CLI argument to provide hf hub authentification token

allow converting huggingface::tokenizers error to TensorRtLlmBackendError

use correct include for spdlog

include guard to build example in cmakelists

working setup of the ffi layer

remove fmt import

use external fmt lib

end to end ffi flow working

make sure to track include/ffi.h to trigger rebuild from cargo

impl the rust backend which currently cannot move the actual computation in background thread

expose shutdown function at ffi layer

impl RwLock scenario for TensorRtLllmBackend

oops missing c++ backend definitions

compute the number of maximum new tokens for each request independently

make sure the context is not dropped in the middle of the async decoding.

remove unnecessary log

add all the necessary plumbery to return the generated content

update invalid doc in cpp file

correctly forward back the log probabilities

remove unneeded scope variable for now

refactor Stream impl for Generation to factorise code

expose the internal missing start/queue timestamp

forward tgi parameters rep/freq penalty

add some more validation about grammar not supported

define a shared struct to hold the result of a decoding step

expose information about potential error happening while decoding

remove logging

add logging in case of decoding error

make sure executor_worker is provided

add initial Dockerfile for TRTLLM backend

add some more information in CMakeLists.txt to correctly install executorWorker

add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper

simplify prebuilt trtllm libraries name definition

do the same name definition stuff for tensorrt_llm_executor_static

leverage pkg-config to probe libraries paths and reuse new install structure from cmake

fix bad copy/past missing nvinfer linkage direction

align all the linker search dependency

add missing pkgconfig folder for MPI in Dockerfile

correctly setup linking search path for runtime layer

fix missing / before tgi lib path

adding missing ld_library_path for cuda stubs in Dockerfile

update tgi entrypoint

commenting out Python part for TensorRT installation

refactored docker image

move to TensorRT-LLM v0.11.0

make docker linter happy with same capitalization rule

fix typo

refactor the compute capabilities detection along with num gpus

update TensorRT-LLM to latest version

update TensorRT install script to latest

update build.rs to link to cuda 12.5

add missing dependant libraries for linking

clean up a bit

install to decoder_attention target

add some custom stuff for nccl linkage

fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time

use std::env::const::ARCH

make sure variable live long enough...

look for cuda 12.5

add some more basic info in README.md

* Rebase.

* Fix autodocs.

* Let's try to enable trtllm backend.

* Ignore backends/v3 by default.

* Fixing client.

* Fix makefile + autodocs.

* Updating the schema thing + redocly.

* Fix trtllm lint.

* Adding pb files ?

* Remove cargo fmt temporarily.

* ?

* Tmp.

* Remove both check + clippy  ?

* Backporting telemetry.

* Backporting 457fb0a1

* Remove PB from git.

* Fixing PB with default member backends/client

* update TensorRT-LLM to latest version

* provided None for api_key

* link against libtensorrt_llm and not libtensorrt-llm

---------

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>
2024-09-25 05:55:39 +00:00
drbh
a574381cb4 fix: reject grammars without properties (#2309) 2024-09-25 05:50:17 +00:00
Erik Kaunismäki
2c1d280fae Run ci api key (#2315)
* Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.

* change name to info routes

* Fix comment

* convert strings to lowercase for case insensitive comparison

* convert header to string

* fixes and update docs

* update docs again

* revert wrong update

---------

Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>
2024-09-25 05:46:41 +00:00
drbh
898a892082 fix: adjust default tool choice (#2244)
* fix: adjust default tool choice

* feat: improve tool choice syntax and response parsing/errors

* fix: remove dev tests

* feat: add ToolChoice to docs
2024-09-25 05:27:40 +00:00
Erik Kaunismäki
66f3de583e usage stats and crash reports (#2220)
* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* error reason can't be in nested json

* cargo fmt

---------

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>
2024-09-25 05:27:40 +00:00
Erik Kaunismäki
271ebb7e20 fix custom cache dir (#2226)
* fix to not ignore HUGGINGFACE_HUB_CACHE in cache

* delete printlns

* delete newlines

* maybe fix trailing whitespace
2024-09-25 05:27:40 +00:00
drbh
5029e7215c fix: append DONE message to chat stream (#2221)
* fix: append DONE message to chat stream

* fix: update completions endpoint
2024-09-25 05:27:40 +00:00
Nicolas Patry
cc4fceb21d Updating the self check (#2209)
* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.
2024-09-25 05:27:40 +00:00
Nicolas Patry
591f9f70eb Adding sanity check to openapi docs. 2024-09-25 05:26:10 +00:00
drbh
48f1196da8 feat: use model name as adapter id in chat endpoints (#2128) 2024-09-25 05:21:34 +00:00
Wang, Yi
74edda9c23 update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190)
update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:21:34 +00:00
Nicolas Patry
e93c830e66 Fixing missing object field for regular completions. (#2175)
* Fixing missing `object` field for regular completions.

* Fixing docs by re-adding missing `Prompt`.
2024-09-24 04:00:11 +00:00
Nicolas Patry
878491cd5b Revert "Fixing missing object field for regular completions."
This reverts commit 2bbb7fa4b2.
2024-09-24 03:59:15 +00:00
Nicolas Patry
b6c8984658 Fixing missing object field for regular completions. 2024-09-24 03:59:15 +00:00
drbh
233e46409a feat: improve update_docs for openapi schema (#2169)
* feat: add pre commit step to force schema update when router changes

* fix: prefer improved update_doc and start server and compare

* fix: adjust typo

* fix: adjust revert typo

* fix: update workflow to use update_doc md command

* feat: improve workflow to check openapi schema too

* fix: adjust timeout for CI

* fix: adjust raise condition and install server in ci

* fix: install protoc before server

* feat: improve update doc and add command to print router schema

* fix: adjust autodoc workflow

* fix: explicitly install protoc and python

* fix: alllow trailing space in openapi schema diff
2024-09-24 03:59:15 +00:00
Nicolas Patry
b80bd724e1 Move to FlashDecoding instead of PagedAttention kernel. (#1940)
* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.
2024-09-24 03:58:13 +00:00
drbh
381c5c02a6 fix: prefer serde structs over custom functions (#2127)
* fix: prefer enum for chat object

* fix: adjust typo

* fix: enum CompletionType not ObjectType

* fix: adjust typo

* feat: leverage serde for conditional deser

* fix: adjust HubTokenizerConfig after rebase

* fix: update create_post_processor logic for token type

* fix: adjust unwrap syntax in template

* Fixing the post processor.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-24 03:57:32 +00:00
Nicolas Patry
03691f6d34 Fixing clippy. (#2149) 2024-09-24 03:57:32 +00:00
Wang, Yi
8721b601e3 fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148)
* fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices]

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-24 03:57:32 +00:00
drbh
69514868ee fix: refactor post_processor logic and add test (#2137)
* fix: refactor post_processor logic and add test

* fix: remove dev comment

* fix: adjust when post_processor is overridden and  improve create_post_processor
2024-09-24 03:57:07 +00:00
Nicolas Patry
befe60b566 Fixing malformed rust tokenizers (#2134)
* Fixing malformed rust tokenizers

* Fix for deepseek too.
2024-09-24 03:57:07 +00:00
Daniël de Kok
d731866245 Idefics2: sync added image tokens with transformers (#2080)
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.
2024-09-24 03:56:28 +00:00
Nicolas Patry
7045598b20 Fixing prom leak by upgrading. (#2129) 2024-09-24 03:55:38 +00:00
drbh
399919d715 fix: simplify kserve endpoint and fix imports (#2119) 2024-09-24 03:55:04 +00:00
drbh
8a155b2d5b Enable multiple LoRa adapters (#2010)
* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------

Co-authored-by: Derek <datavistics@gmail.com>
2024-09-24 03:55:04 +00:00
Nicolas Patry
8980bf43d7 Fix CI . (#2118)
Fix clippy.
2024-09-24 03:53:26 +00:00
sunxichen
a9faabc374 fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089)
Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>
2024-09-24 03:51:26 +00:00
KevinDuffy94
76c6a5ca2a Add OTLP Service Name Environment Variable (#2076)
* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069

* Update Docs

* Update README.md

* Update Launcher Docs

* Update Launcher Docs
Removing Option
2024-09-24 03:51:26 +00:00
Lucain
931ff16c7a Support HF_TOKEN environment variable (#2066)
* Support HF_TOKEN environement variable

* Load test.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-24 03:50:38 +00:00
Ziru Niu
b3dadbde06 fix build.rs watch files (#2072) 2024-09-24 03:43:31 +00:00
drbh
7ce29b1ef2 implement Open Inference Protocol endpoints (#1942)
* feat: add kserve feature and basic routes

* feat: implement infer endpoint wrapper around generate

* fix: refactor and improve types

* fix: improve infer and simplify

* fix: cleanup and improve api docs

* fix: refactor and encapsulate kserve feat in file

* fix: remove typos after rebase
2024-09-24 03:42:29 +00:00
drbh
d0a1d50fd3 PR #2049 CI run (#2054)
* Use minijinja's pycompat mode for python methods

* fix: cargo fmt lint for pre commit

---------

Co-authored-by: Armin Ronacher <armin.ronacher@active-4.com>
2024-09-24 03:42:29 +00:00
drbh
99c947452d Support chat response format (#2046)
* feat: support response_format in chat

* fix: adjust typos

* fix: add trufflehog lint
2024-09-24 03:42:29 +00:00
OlivierDehaene
20df9234a9 feat: move allocation logic to rust (#1835)
Close #2007
2024-09-24 03:34:15 +00:00