Commit Graph

13 Commits

Author SHA1 Message Date
drbh
ec85883703 fix: avoid frequency and repetition penalty on padding tokens (#1765)
This PR resolves an issue with the penalty processors during batched
generation where extra padding tokens incorrectly impact the penalty
scores.

generation is impacted in the case where at least one item in the batch
includes a `frequency_penalty`

reproduction script below
```python
import requests
from concurrent import futures
import time

headers = {
    "Content-Type": "application/json",
}

json_data = {
    "inputs": "[INST] Whats the capitol of France? [/INST]",
    "parameters": {
        "max_new_tokens": 100,
        "seed": 20,
        "do_sample": False,
    },
}


json_data2 = {
    "inputs": "<s>[INST]Write a mind bending story: I saw a puppy a cat a rat and a raccoon during my bike ride in the park[/INST]",
    "parameters": {
        "max_new_tokens": 100,
        "seed": 2,
        "do_sample": False,
        # OFFENDING LINE
        "frequency_penalty": 1.05,
    },
}

base_url = "http://localhost:3000/generate"


def req():
    response = requests.post(base_url, headers=headers, json=json_data)
    print("[req ]", response.json())


def req2():
    response = requests.post(base_url, headers=headers, json=json_data2)
    print("[req2]", response.json())


n = 1

for i in range(0, 3):
    print(f"- {n} threads -")
    with futures.ThreadPoolExecutor(max_workers=n) as executor:
        executor.submit(req)
        for i in range(3):
            executor.submit(req2)

    n += 1

# - 1 threads -
# [req ] {'generated_text': ' The capital of France is Paris.'}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}
# - 2 threads -
# [req ] {'generated_text': ' The capital city'}
# [req2] {'generated_text': ' As""%\n================'}
# [req2] {'generated_text': ' As""%%$\n================'}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}

# output with this PR's changes:
# - 1 threads -
# [req ] {'generated_text': ' The capital of France is Paris.'}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}
# - 2 threads -
# [req ] {'generated_text': ' The capital city'}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}
# [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"}

```

**divergence from expected generation is easier to reproduce with
batched grammar requests as they are more sensitive to unexpected
outputs.

this PR resolves the issue by setting the penalty score to 0 where input
ids are padding tokens (0).

---------

Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2024-06-10 09:29:20 +03:00
Karol Damaszke
32acdd55b4
Add grammar support (#140)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-05-20 11:16:34 +02:00
drbh
56670398f3 fix: handle batches with and without grammars (#1676)
This PR correctly handles batches with a mixture of constrained and non
constrained generations.

Currently if batch contains mixed generations the generation will throw
an error because it will incorrectly attempt to constrain a request with
an empty grammar.

We now handled `None` grammars and only apply the mask if needed

Fixes:
https://github.com/huggingface/text-generation-inference/issues/1643
2024-04-25 14:06:48 +03:00
drbh
ab074c81b7 fix: improve tool type, bump pydantic and outlines (#1650)
This PR resolves a couple

- [X] adjusts the tool response to align with openai's tools response
type
- [X] bumps pydantic to `2.6.4` in all apps (resolves dependency issue
when running tests)
- [X] bump `outlines` version and fix import for new name
2024-04-25 12:34:55 +03:00
drbh
d4aebbd10a fix: correctly index into mask when applying grammar (#1618)
This PR fixes how the grammar mask is index when generating text and
adds a new test to ensure the grammars work with non flash models
2024-04-25 10:16:16 +03:00
OlivierDehaene
2ac1b55c95 v1.4.1 (#1568) 2024-04-24 15:42:59 +03:00
OlivierDehaene
31b5e37f49 chore: add pre-commit (#1569) 2024-04-24 15:32:02 +03:00
drbh
55acb86f42 Outlines guided generation (#1539)
This WIP PR starts to add grammar support via outlines, currently this
PR supports very simple regex grammars and does not optimize for
precompiling or caching grammar fsm's.

todo:
- [X] add simple outlines guidance to `NextTokenChooser`
- [X] update protos for grammar
- [X] update generation params API
- [X] constrain simple grammar
- [ ] support parsing more complex grammar into fsm
- [ ] support all outline support grammar types
- [ ] explore optimizations to avoid recompiling grammars

guided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6,
        "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
    }
}' | jq
```
response
```json
{
  "generated_text": "david@example.com"
}
```

unguided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6
    }
}' | jq
```
response
```json
{
  "generated_text": "    email = 'david"
}
```
2024-04-24 14:57:37 +03:00
OlivierDehaene
f1d8da3ba6 feat(server): add frequency penalty (#1541) 2024-04-24 08:43:50 +00:00
regisss
cc744ba426 Add changes from Optimum Habana's TGI folder 2023-12-05 11:12:16 +01:00
Nick Hill
e4b26aa10b
fix(server): avoid errors for very small top_p values (#544)
See https://github.com/huggingface/transformers/pull/24111

I didn't add validation to the `__init__` method since it's not done for
other values/warpers.
2023-07-04 20:11:33 +02:00
OlivierDehaene
53aa9194c8
fix(server): fix warpers on CPU (#472)
Closes #471
2023-06-20 11:06:10 +02:00
OlivierDehaene
62f91f78ac
feat(server): support vectorized warpers in flash causal lm (#317)
Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
2023-05-26 12:30:27 +02:00