Commit Graph

39 Commits

Author SHA1 Message Date
OlivierDehaene
0f2daad8b9
feat: add chat template struct to avoid tuple ordering errors (#1570) 2024-02-16 16:37:32 +01:00
Aaron Mihalik
142cdabed3
Bugfix: eos and bos tokens positions are inconsistent (#1567) 2024-02-16 11:44:04 +01:00
Aaron Mihalik
c55abac384
Added name field to OpenAI compatible API Messages (#1563)
# What does this PR do?

Literally just adds the name field to the Message class.

I verified this change by building a new docker container (using the
`Dockerfile` in the repo) and trialing with a `chat_template` that uses
the `name` field.

Here's the previous behavior:

Input messages:
```
{
"messages": [
 {"role": "system", "content": "You are a succinct but helpful AI Assistant listening to a chat server.  Address everyone by @<username>"},
 {"role": "user", "name": "Aaron", "content": "Hello There!"},
 {"role": "assistant", "content": "  Hello @Aaron! How can I assist you today?"},
 {"role": "user", "name": "Sally", "content": "Hiya everyone.  Is @Aaron is this room?"}
],
  "model": "meta-llama/Llama-2-7b-chat-hf"
}
```

Response before the modification:
```
Hello @Aaron! Yes, you are in the chat room. How can I assist you today? 😊

Hiya everyone! *waves* It's great to see you all here. Is there something on your mind that you'd like to talk about or ask? I'm here to listen and help in any way I can. 🤖
```

Response after my modification:
```
Hello @Sally! Yes, @Aaron is currently in the chat room. How may I assist you today?
```

Fixes #1558 


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?
@Narsil

---------

Co-authored-by: Aaron Mihalik <aaron.mihalik@parsons.us>
Co-authored-by: drbh <david.richard.holtz@gmail.com>
2024-02-15 13:30:31 -05:00
OlivierDehaene
532146338b
feat(router): add max_batch_size (#1542)
Some hardware require a maximum batch size.
2024-02-09 12:38:41 +01:00
drbh
1734540211
feat: use existing add_generation_prompt variable from config in temp… (#1533)
This PR adds support to read the `add_generation_prompt` from the config
and use it in the chat template. If `add_generation_prompt` does not
exist we default to false
2024-02-07 09:35:53 +01:00
Nicolas Patry
86c8335f1b
Add a new /tokenize route to get the tokenized input (#1471)
# What does this PR do?


Ideally this is done client side, but this is a recurring request,
therefore we implemented it.

- Runs only if rust tokenizer is present (not encumbering the main
inference pipeline is important).
- Returns simple results, ID, text (gotten with offsets from the
original string) and offsets (so users can do things like highlighting
text).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-01-25 14:19:03 +01:00
drbh
3ccb3bb0b5
feat: support raise_exception, bos and eos tokens (#1450)
This PR adds support to handle the custom jinja function
`raise_exception` and passes the `bos` and `eos` tokens into the
template

Additionally this PR adds 3 tests to validate and show examples of what
can and cannot be parsed currently.

```bash
cargo test --package text-generation-router --lib -- infer::tests --nocapture
#     Finished test [unoptimized + debuginfo] target(s) in 7.82s
#      Running unittests src/lib.rs (target/debug/deps/text_generation_router-18a0bbf99c2ca1b4)

# running 3 tests
# test infer::tests::test_chat_template_valid_with_raise ... ok
# test infer::tests::test_chat_template ... ok
# test infer::tests::test_chat_template_invalid_with_raise ... ok

# test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 15 filtered out; finished in 0.00s
```
2024-01-18 12:31:56 +01:00
drbh
0eabc83541
feat: supports openai chat completions API (#1427)
This PR adds support to make TGI a drop in replacement for OpenAI
clients by exposing the same HTTP interface.

Notes
- TGI inits a single model at startup so the `model` field is unused in
HTTP requests.
- `max_tokens` and `stream` should work as expected but other params may
be (unimplemented or not supported)

General approach
- fetch the `tokenizer_config` at startup from the hub
- pass `tokenizer_config` into `Infer` so we have it at request time
- use the `chat_template` on the config to format chat request
- parse jinja template and render chat string
- pass inputs into existing generate function
- wrap generation output in expected structure before returning

# How to test

### Streaming curl
```bash
curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'
```


It is also possible to use the `openai` python library and change the
base url

###  🌊 STREAMING REQUEST
```python
from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    print(message)

# ChatCompletionChunk(id='', choices=[Choice(delta=ChoiceDelta(content=' that', function_call=None, role='assistant', tool_calls=None), finish_reason=None, index=2, logprobs=None)], created=1704486761, model='', object='text_completion', system_fingerprint='')
```

### 🚗 SYNCHRONOUS REQUEST
```python
from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)
# ChatCompletion(id='', choices=[Choice(finish_reason=None, index=0, logprobs=None, message=ChatCompletionMessage(content='\nDeep learning is a new field of research that has been gaining traction in the last ...', role='assistant', function_call=None, tool_calls=None))], created=1704486762, model='', object='text_completion', system_fingerprint='', usage=CompletionUsage(completion_tokens=100, prompt_tokens=76, total_tokens=176))
```


## How to run dev

```bash
cd text-generation-inference/server
MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2
```

***note many of the existing `chat_templates` use non standard `jinja`
(ie. adding a `raise` to the template) which will throw an error when
parsing; hence using `upstage/SOLAR-10.7B-Instruct-v1.0` since it has a
valid template
```bash
cd text-generation-inference/router
cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0
```

trigger
```bash
curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \
    -H 'Content-Type: application/json'
```

^ supports `stream: true` and `stream: false` requests
2024-01-16 11:07:41 +01:00
Nicolas Patry
ac08b4ef9c
Return prompt vs generated tokens. (#1436)
# What does this PR do?

Fixes #637 
 
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-01-11 13:01:43 -05:00
OlivierDehaene
50b495f3d8
feat: add more latency metrics in forward (#1346) 2023-12-14 15:59:38 +01:00
Nicolas Patry
9ecfa16b12
Speculative (#1308) 2023-12-11 12:46:30 +01:00
OlivierDehaene
3dbc649b11
fix: do not leak inputs on error (#1228)
Close #1225
2023-11-20 10:33:44 +01:00
OlivierDehaene
f9910d13e2
feat: remove flume (#1184) 2023-10-23 15:51:12 +02:00
OlivierDehaene
3b56d7669b
feat: add mistral model (#1071) 2023-09-28 09:55:47 +02:00
Nicolas Patry
211b54ac41
Rebased #617 (#868)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Vincent Brouwers <vincent.brouwers@ing.com>
2023-08-28 11:43:47 +02:00
OlivierDehaene
fe80f5360c
feat(server): auto max_batch_total_tokens for flash att models (#630) 2023-07-19 09:31:25 +02:00
OlivierDehaene
e74bd41e0f
feat(server): add paged attention to flash models (#516)
Closes #478
2023-06-30 19:09:59 +02:00
OlivierDehaene
bd3a9d8e85
fix(router): add timeout on flume sends (#488) 2023-06-23 14:58:28 +02:00
OlivierDehaene
218c9adaa5
feat: decrease IPC proto size (#367)
Closes #307 #308
2023-05-24 19:19:57 +02:00
Nicolas Patry
db2b4e0754
feat(router): new healthcheck that skips the queue (#244)
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2023-04-26 20:23:54 +02:00
OlivierDehaene
ebc74d5666
feat(router): use number of tokens in batch as input for dynamic batching (#226)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2023-04-24 17:59:00 +02:00
OlivierDehaene
709d8936f6
feat(router): drop requests when client closes the channel (#202) 2023-04-20 11:07:40 +02:00
OlivierDehaene
9987960062
feat(router): make router input validation optional (#164) 2023-04-09 20:22:27 +02:00
OlivierDehaene
7dec65a244
fix(router): use buckets for metrics histograms (#163) 2023-04-09 20:13:28 +02:00
OlivierDehaene
f000068944
feat(server): clear cache on error (#143) 2023-03-28 11:29:35 +02:00
OlivierDehaene
b49dbf2d88
fix(server): use server tokenizer as gt (#128) 2023-03-16 12:12:26 +01:00
OlivierDehaene
cbd36aa4d1
fix(server): revert gpt-neox optims (#123) 2023-03-13 22:57:08 +01:00
OlivierDehaene
55bd4fed7d
feat(router): add best_of parameter (#117) 2023-03-09 15:30:54 +01:00
OlivierDehaene
3fef90d50f
feat(clients): Python client (#103) 2023-03-07 18:52:22 +01:00
OlivierDehaene
0ac184ce77
feat(server): add special token bool (#85) 2023-02-24 15:55:57 +01:00
OlivierDehaene
439fcaf810
feat(router): add prometheus metrics scrape endpoint (#71) 2023-02-16 17:18:53 +01:00
OlivierDehaene
9af454142a
feat: add distributed tracing (#62) 2023-02-13 13:02:45 +01:00
OlivierDehaene
20c3c5940c
feat(router): refactor API and add openAPI schemas (#53) 2023-02-03 12:43:37 +01:00
OlivierDehaene
7b870e1e18
feat(router): use background task to manage request queue (#52)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2023-02-02 14:59:27 +01:00
OlivierDehaene
017a2a8c2f
feat: Add token streaming using ServerSideEvents support (#41) 2023-01-31 17:04:00 +01:00
OlivierDehaene
4f9ac67cfa
Revert "feat: Add token streaming using ServerSideEvents support" (#40)
Reverts huggingface/text-generation-inference#36
2023-01-31 14:21:51 +01:00
OlivierDehaene
7fbfbb0dc5
feat: Add token streaming using ServerSideEvents support (#36)
Add token streaming using ServerSideEvents (SSE).

The signature of the SSE events is: 

```rust
struct Details {
    finish_reason: String,
    generated_tokens: u32,
    seed: Option<u64>,
}

struct StreamResponse {
    token: Token,
    generated_text: Option<String>,
    details: Option<Details>,
}

struct ErrorResponse {
    error: String,
}
```
2023-01-31 11:49:43 +01:00
Olivier Dehaene
fa9a088467 Add load testing 2022-10-11 10:36:51 +02:00
Olivier Dehaene
295831a481 Init 2022-10-08 12:30:12 +02:00