Commit Graph

74 Commits

Author SHA1 Message Date
drbh
5836a1cc69 feat: conditionally toggle chat on invocations route (#1454)
This PR adds support for reading the `OAI_ENABLED` env var which will
changes the function called when the `/invocations` is called.

If `OAI_ENABLED=true` the `chat_completions` method is used otherwise it
defaults to `compat_generate`.

example running the router
```bash
OAI_ENABLED=true \
  cargo run -- \
  --tokenizer-name mistralai/Mistral-7B-Instruct-v0.2
```

example request
```bash
curl localhost:3000/invocations \
    -X POST \
    -d '{ "model": "tgi", "messages": [ { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": false, "max_tokens": 20, "logprobs": true, "seed": 0 }' \
    -H 'Content-Type: application/json' | jq 
```

**please let me know if any naming changes are needed or if any other
routes need similar functionality.
2024-04-22 11:54:00 +03:00
drbh
77afb882dc feat: support raise_exception, bos and eos tokens (#1450)
This PR adds support to handle the custom jinja function
`raise_exception` and passes the `bos` and `eos` tokens into the
template

Additionally this PR adds 3 tests to validate and show examples of what
can and cannot be parsed currently.

```bash
cargo test --package text-generation-router --lib -- infer::tests --nocapture
#     Finished test [unoptimized + debuginfo] target(s) in 7.82s
#      Running unittests src/lib.rs (target/debug/deps/text_generation_router-18a0bbf99c2ca1b4)

# running 3 tests
# test infer::tests::test_chat_template_valid_with_raise ... ok
# test infer::tests::test_chat_template ... ok
# test infer::tests::test_chat_template_invalid_with_raise ... ok

# test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 15 filtered out; finished in 0.00s
```
2024-04-22 11:52:57 +03:00
drbh
76b226b00c feat: supports openai chat completions API (#1427)
This PR adds support to make TGI a drop in replacement for OpenAI
clients by exposing the same HTTP interface.

Notes
- TGI inits a single model at startup so the `model` field is unused in
HTTP requests.
- `max_tokens` and `stream` should work as expected but other params may
be (unimplemented or not supported)

General approach
- fetch the `tokenizer_config` at startup from the hub
- pass `tokenizer_config` into `Infer` so we have it at request time
- use the `chat_template` on the config to format chat request
- parse jinja template and render chat string
- pass inputs into existing generate function
- wrap generation output in expected structure before returning

```bash
curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'
```

It is also possible to use the `openai` python library and change the
base url

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

for message in chat_completion:
    print(message)

```

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)
```

```bash
cd text-generation-inference/server
MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2
```

***note many of the existing `chat_templates` use non standard `jinja`
(ie. adding a `raise` to the template) which will throw an error when
parsing; hence using `upstage/SOLAR-10.7B-Instruct-v1.0` since it has a
valid template
```bash
cd text-generation-inference/router
cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0
```

trigger
```bash
curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \
    -H 'Content-Type: application/json'
```

^ supports `stream: true` and `stream: false` requests
2024-04-22 11:51:40 +03:00
Nicolas Patry
12cfc7930b Return prompt vs generated tokens. (#1436)
# What does this PR do?

Fixes #637 
 
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-04-22 11:46:49 +03:00
OlivierDehaene
9aef902982 feat: mixtral (#1328) 2024-04-18 12:39:52 +00:00
Nicolas Patry
a7f52f3812 Speculative (#1308) 2024-04-18 12:39:39 +00:00
Karol Damaszke
d957e32601
Add Habana copyright header (#122)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-04-08 18:06:21 +02:00
Karol Damaszke
252ccde104
Control prefill and decode batch size separately (#6) 2024-01-02 18:21:01 +01:00
Karol Damaszke
b1897acfd6
Calculate token budget with padding to max_input_length (#2) 2023-12-11 09:24:27 +01:00
OlivierDehaene
3b56d7669b
feat: add mistral model (#1071) 2023-09-28 09:55:47 +02:00
Nicolas Patry
a049864270
Preping 1.1.0 (#1066)
# What does this PR do?

Upgrade all relevant versions and dependencies.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2023-09-27 10:40:18 +02:00
Nicolas Patry
211b54ac41
Rebased #617 (#868)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Vincent Brouwers <vincent.brouwers@ing.com>
2023-08-28 11:43:47 +02:00
ivarflakstad
8bdb16ee9a
Use destructuring in router arguments to avoid '.0' (#798)
# What does this PR do?

This is purely code style - not anything important.
Instead of writing `req.0` all over we can use
[descructuring](https://doc.rust-lang.org/rust-by-example/flow_control/match/destructuring/destructure_structures.html)
to access the contained value that we actually want.

(Destructuring in function parameters
[here](https://doc.rust-lang.org/reference/items/functions.html#function-parameters))

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene
2023-08-10 10:52:50 +02:00
OlivierDehaene
1da642bd0e feat(server): add local prom and health routes if running w/ ngrok 2023-07-21 16:56:30 +02:00
OlivierDehaene
b66b190403
feat(router): ngrok edge (#642) 2023-07-19 11:59:58 +02:00
OlivierDehaene
b4024edd45
feat: better errors for warmup and TP (#575)
Close #571
2023-07-10 14:47:15 +02:00
OlivierDehaene
e28a809004
v0.9.0 (#525) 2023-07-01 19:25:41 +02:00
OlivierDehaene
e74bd41e0f
feat(server): add paged attention to flash models (#516)
Closes #478
2023-06-30 19:09:59 +02:00
Robert Kimball
70f485bf9f
feat(router): add header option to disable buffering for the generate_stream response (#498)
# This PR adds an http header option to disable buffering for the
generate_stream endpoint response stream.

Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.

Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.
2023-06-28 11:50:12 +02:00
OlivierDehaene
f59fb8b630
feat(router): add ngrok integration (#453) 2023-06-16 16:25:11 +02:00
OlivierDehaene
895c5f1562
feat(server): only compute prefill logprobs when asked (#406)
Close #288
2023-06-02 17:12:30 +02:00
OlivierDehaene
942005386a
feat(router): log input/ouput at debug level (#364)
@njhill FYI
2023-05-23 20:47:37 +02:00
OlivierDehaene
e250282213
feat(docker): add benchmarking tool to docker image (#298) 2023-05-09 13:19:31 +02:00
Sai Vinay G
926fd9a010
feat(router): Adding response schema for compat_generate (#292) 2023-05-09 12:38:09 +02:00
Nicolas Patry
411b0d4e1f
chore(github): add templates (#264) 2023-05-02 15:43:19 +02:00
Nicolas Patry
db2b4e0754
feat(router): new healthcheck that skips the queue (#244)
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2023-04-26 20:23:54 +02:00
Nicolas Patry
c4fb09f2ae
feat(router): add tests to validation (#237) 2023-04-26 16:14:40 +02:00
OlivierDehaene
8b182eb986
feat(router): add endpoint info to /info route (#228) 2023-04-25 13:11:18 +02:00
OlivierDehaene
ebc74d5666
feat(router): use number of tokens in batch as input for dynamic batching (#226)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2023-04-24 17:59:00 +02:00
OlivierDehaene
343437c7b5
feat(router): add device and dtype info (#215) 2023-04-21 15:36:29 +02:00
OlivierDehaene
709d8936f6
feat(router): drop requests when client closes the channel (#202) 2023-04-20 11:07:40 +02:00
OlivierDehaene
2475aede61
feat(router): add info route (#196)
close #125
2023-04-18 16:16:06 +02:00
OlivierDehaene
9987960062
feat(router): make router input validation optional (#164) 2023-04-09 20:22:27 +02:00
OlivierDehaene
7dec65a244
fix(router): use buckets for metrics histograms (#163) 2023-04-09 20:13:28 +02:00
OlivierDehaene
d503e8f09d
feat: aws sagemaker compatible image (#147)
The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
2023-03-29 21:38:30 +02:00
OlivierDehaene
55bd4fed7d
feat(router): add best_of parameter (#117) 2023-03-09 15:30:54 +01:00
OlivierDehaene
e8bfe199ba
feat(router): support left truncation (#115)
closes #111
2023-03-09 13:10:30 +01:00
OlivierDehaene
1a2d68250a
feat: support typical sampling (#114)
closes #112
2023-03-09 11:33:57 +01:00
OlivierDehaene
3fef90d50f
feat(clients): Python client (#103) 2023-03-07 18:52:22 +01:00
OlivierDehaene
9b8ea6a6c7
feat(server): add logits watermark (#90) 2023-03-02 12:30:41 +01:00
OlivierDehaene
f874c47831
feat(router): add api-inference headers (#91) 2023-03-02 11:41:51 +01:00
OlivierDehaene
4e685d907e
feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89) 2023-02-28 10:19:32 +01:00
OlivierDehaene
21340f24ba
feat(router): add legacy route for api-inference support (#88) 2023-02-27 14:56:58 +01:00
OlivierDehaene
0ac184ce77
feat(server): add special token bool (#85) 2023-02-24 15:55:57 +01:00
OlivierDehaene
6796d38c6d
feat(router): add cors allow origin options (#73) 2023-02-17 18:22:00 +01:00
OlivierDehaene
439fcaf810
feat(router): add prometheus metrics scrape endpoint (#71) 2023-02-16 17:18:53 +01:00
OlivierDehaene
5437d49beb
feat(router): add max_total_tokens and empty_input validation (#68)
closes #65
2023-02-15 21:56:59 +01:00
OlivierDehaene
9af454142a
feat: add distributed tracing (#62) 2023-02-13 13:02:45 +01:00
Yannic Kilcher
e520d5b349
fixed SSE naming (#61)
https://en.wikipedia.org/wiki/Server-sent_events
2023-02-08 22:30:11 +01:00
OlivierDehaene
20c3c5940c
feat(router): refactor API and add openAPI schemas (#53) 2023-02-03 12:43:37 +01:00