text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 15:35:24 +00:00

Author	SHA1	Message	Date
BaihuiJin	15e5df1cc4	BS round up to BUCKET_SIZE to prevent capture graph when graph input not change (#185 )	2024-07-16 09:42:46 +02:00
BaihuiJin	aac547dd82	Clear previous hpu_graphs when graph shape changed to save memory (#176 )	2024-07-11 15:19:17 +02:00
Jacek Czaja	5df20f88ff	Fix to non-LLAMA models (#177 ) Co-authored-by: Jacek Czaja <jczaja@habana.ai>	2024-07-04 13:42:24 +02:00
Jacek Czaja	c64b5b75e2	[TORCH COMPILE] Ignore HPU GRAPHS env var when eager mode is used (#165 ) Co-authored-by: Jacek Czaja <jczaja@habana.ai>	2024-07-03 15:17:27 +02:00
Karol Damaszke	4b4382c6f8	Fix dtype mismatch in HeterogeneousFrequencyPenaltyLogitsProcessor (#163 )	2024-07-03 10:57:41 +02:00
Vidya Galli	ca1b2f4994	Updated kv cache for starcoder (#128 )	2024-06-14 22:36:44 +02:00
Jacek Czaja	ef86232c94	[Torch.compile] Enable llama-2-7b (#157 ) Co-authored-by: Jacek Czaja <jczaja@habana.ai>	2024-06-14 15:56:23 +02:00
Martin Iglesias Goyanes	8c847f2b60	Fixing frequency penalty (#1811 ) Thank you so much for the work you are doing, this is my little contribution to this great thing you have built. I hope it is useful and helpful, please don't hesitate to discuss any matters that are not clear! I am basing my implementation of frequency penalty on OpenAI's implementation: https://platform.openai.com/docs/guides/text-generation/parameter-details The problem I see with TGI's current implementation is that is not taking into account the frequency of tokens which have already been sampled in the current generation stream. Also, the scaling is of the adjusted token logits is done differently for positive and negative logits. While in OpenAI's implementation token frequency is taking into account and the scaling is always done with a subtraction (if penalty is positive) or add operation (if penalty is negative). This leads to corrupt generations as I mentioned in issue #1810 . Moreover, after my tests, other issues are also gone like the one about some request's with ``penalty_frequency = 1.0`` overruling other requests (with ``frequency_penalty = 0.0``) in the same batch and therefore corrupting all generations in the batch. Basically, padding does not affect this implementation so I believe this ``score *= input_ids.ne(0)`` is not needed anymore. Frequency penalty \| -1.0 \| 0.0 \| 1.0 -- \| -- \| -- \| -- Before my change \| https://paste.mozilla.org/JxqGJkWY \| https://paste.mozilla.org/hrztJ56h \| https://paste.mozilla.org/pBSEH2zw After my change \| https://paste.mozilla.org/7gXCi7zo \| https://paste.mozilla.org/ZR9rJ92g \| https://paste.mozilla.org/gHaD2YnC --------- Co-authored-by: martini <martin.iglesiasgoyanes@adyen.com>	2024-06-10 14:05:08 +03:00
OlivierDehaene	91352b1b71	fix: use get_speculate to the number of layers (#1737 )	2024-06-10 14:02:23 +03:00
Nicolas Patry	6ca39843b4	Small CI cleanup. (#1801 ) # What does this PR do? Just unifying some branches and making intentions clearer (no cuda graph when 0 all the way in the launcher) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-10 14:01:01 +03:00
Nicolas Patry	3f3a1a6a66	Better graceful shutdown. (#1827 ) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-10 14:00:26 +03:00
Nicolas Patry	7641cda775	Dummy CI run. (#1817 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-10 13:57:59 +03:00
Nicolas Patry	a788888619	Fixing qwen2. (#1818 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-10 13:57:52 +03:00
Nicolas Patry	388af49916	Blunder (#1815 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-10 13:57:47 +03:00
Wang, Yi	62a83fd800	add intel xpu support for TGI (#1475 ) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-06-10 13:16:45 +03:00
Nicolas Patry	2ed6242816	Use the generation config. (#1808 ) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-10 09:53:00 +03:00
drbh	ab59a5e346	feat: improve temperature logic in chat (#1749 ) This PR adds support for `do_sample` to chat to enable greedy sampling --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-06-10 09:50:25 +03:00
drbh	ec85883703	fix: avoid frequency and repetition penalty on padding tokens (#1765 ) This PR resolves an issue with the penalty processors during batched generation where extra padding tokens incorrectly impact the penalty scores. generation is impacted in the case where at least one item in the batch includes a `frequency_penalty` reproduction script below ```python import requests from concurrent import futures import time headers = { "Content-Type": "application/json", } json_data = { "inputs": "[INST] Whats the capitol of France? [/INST]", "parameters": { "max_new_tokens": 100, "seed": 20, "do_sample": False, }, } json_data2 = { "inputs": "<s>[INST]Write a mind bending story: I saw a puppy a cat a rat and a raccoon during my bike ride in the park[/INST]", "parameters": { "max_new_tokens": 100, "seed": 2, "do_sample": False, # OFFENDING LINE "frequency_penalty": 1.05, }, } base_url = "http://localhost:3000/generate" def req(): response = requests.post(base_url, headers=headers, json=json_data) print("[req ]", response.json()) def req2(): response = requests.post(base_url, headers=headers, json=json_data2) print("[req2]", response.json()) n = 1 for i in range(0, 3): print(f"- {n} threads -") with futures.ThreadPoolExecutor(max_workers=n) as executor: executor.submit(req) for i in range(3): executor.submit(req2) n += 1 # - 1 threads - # [req ] {'generated_text': ' The capital of France is Paris.'} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} # - 2 threads - # [req ] {'generated_text': ' The capital city'} # [req2] {'generated_text': ' As""%\n================'} # [req2] {'generated_text': ' As""%%$\n================'} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} # output with this PR's changes: # - 1 threads - # [req ] {'generated_text': ' The capital of France is Paris.'} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} # - 2 threads - # [req ] {'generated_text': ' The capital city'} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} # [req2] {'generated_text': " As you were riding your bicycle through Central Park, enjoying some fresh air on an otherwise gloomy day. You couldn't help but notice that it was eerily quiet for this time of year - usually there would be hordes"} ``` **divergence from expected generation is easier to reproduce with batched grammar requests as they are more sensitive to unexpected outputs. this PR resolves the issue by setting the penalty score to 0 where input ids are padding tokens (0). --------- Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-06-10 09:29:20 +03:00
Nicolas Patry	57b31f410d	Idefics2. (#1756 ) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-10 09:29:08 +03:00
Nicolas Patry	4f8ca6049e	Phi3 support (#1797 ) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-10 09:27:01 +03:00
fxmarty	5b162c7026	Make `--cuda-graphs` work as expected (bis) (#1768 ) This was ignored up to now, even with `--cuda-graphs 0`. With this fix, `--cuda-graphs` is obeyed to.	2024-06-10 09:24:43 +03:00
Nicolas Patry	11c16aa64c	Upgrading all versions. (#1759 )	2024-06-03 15:39:47 +03:00
Karol Damaszke	0e8f8726db	Warmup all decode buckets (#152 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-05-29 22:46:55 +02:00
Karol Damaszke	7b879fd1d8	Pad next token chooser parameters with empty logits processors (#151 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-05-29 22:43:56 +02:00
Jimin Ha	1023de8048	Add flash_attention argument options for Mistral (#145 ) Co-authored-by: Karol Damaszke <karol.damaszke@intel.com> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-05-27 20:00:42 +02:00
Karol Damaszke	32acdd55b4	Add grammar support (#140 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-05-20 11:16:34 +02:00
Sylwester Fraczek	fe16a465a0	causal_lm server tests rebased (#139 ) Co-authored-by: Sylwester Fraczek <sfraczek@habana.ai> Co-authored-by: Jacek Czaja <jczaja@habana.ai>	2024-05-06 15:55:35 +02:00
Karol Damaszke	bad7fe720a	Fix warmup shapes for corner cases (#136 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-05-06 11:35:27 +02:00
Karol Damaszke	600d033c04	Merge branch 'habana-main' into rebase_tgi_2.0	2024-04-29 09:44:45 +03:00
regisss	37aabf8571	Move call to `adapt_transformers_to_gaudi` earlier in the code (#133 )	2024-04-26 11:07:27 +02:00
OlivierDehaene	c6a31b9e2b	v2.0.0 (#1736 )	2024-04-26 07:42:52 +00:00
OlivierDehaene	f6d5c2edf2	feat: medusa v2 (#1734 )	2024-04-26 07:42:37 +00:00
Nicolas Patry	935d56abfe	Fp8 Support (#1726 ) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Dong Shin <d0104.shin@gmail.com>	2024-04-25 17:58:11 +03:00
OlivierDehaene	d1d0b3cbd6	hotfix: mixtral	2024-04-25 17:51:46 +03:00
OlivierDehaene	a1b65e5919	fix: fix CohereForAI/c4ai-command-r-plus (#1707 ) @Narsil @drbh this will update flash attention v2 and vllm. You will need to re-install them.	2024-04-25 17:51:35 +03:00
Nicolas Patry	2b2f4dee94	Adding Llava-Next (Llava 1.6) with full support. (#1709 ) - Changed all models to extract `embed_tokens` in order to enable llava to separately call the embeddings and the core model layers. - Added VlmCausalLM to inherit from FlashMistral in order to be maximally supported. The only added logics sits on top and parses images into pixel values, preallocates input_ids space for the image embeddings, and passes them for the model. - Added Clip for the vision tower. - Didn't add flash for the vision tower since there's no padding anyway. - Added heuristic (potentially incomplete) to calculate number of features before calculating the clip patches (allows for easier logic reuse of the LLM under the hood). Still needs to be done: - [x] Implement the image parsing in the controller side, to avoid downloading n times per TP shard and also refusing requests too large early and avoid issues where the truncation actually truncates the image. - [ ] Make sure it works with quantization properly. - [x] Make sure it works with TP>1 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-25 14:30:55 +00:00
Nicolas Patry	3417398c9a	Force weights_only (before fully breaking pickle files anyway). (#1710 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-25 15:10:53 +03:00
Nicolas Patry	fec3f8f21c	Fixing cohere tokenizer. (#1697 )	2024-04-25 15:10:46 +03:00
Nicolas Patry	fe063b8118	Pickle conversion now requires `--trust-remote-code`. (#1704 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-25 15:09:00 +03:00
Nicolas Patry	29c316e5bb	Add cuda graphs sizes and make it default. (#1703 ) # What does this PR do? ``` text-generation-launcher --model-id XXX # Uses cuda graphs by default text-generation-launcher --model-id XXX --cuda-graphs "1,2" #Restrict the number of cuda graphs which saves VRAM text-generation-launcher --model-id XXX --cuda-graphs "0" # Disabling it entirely ``` <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-25 15:08:54 +03:00
OlivierDehaene	dc1ab2001d	feat: Add dbrx support (#1685 ) Close #1679	2024-04-25 14:07:28 +03:00
drbh	56670398f3	fix: handle batches with and without grammars (#1676 ) This PR correctly handles batches with a mixture of constrained and non constrained generations. Currently if batch contains mixed generations the generation will throw an error because it will incorrectly attempt to constrain a request with an empty grammar. We now handled `None` grammars and only apply the mask if needed Fixes: https://github.com/huggingface/text-generation-inference/issues/1643	2024-04-25 14:06:48 +03:00
OlivierDehaene	da4199ed97	feat: cohere (#1660 )	2024-04-25 12:39:14 +03:00
SeongBeomLEE	097e72a672	fix: LlamaTokenizerFast to AutoTokenizer at flash_mistral.py (#1637 ) # What does this PR do? A few cases where you're using a mistral structure or mixtral structure but not a llama tokenizer, why not make it to call the AutoTokenizer in exception handling. Similar PR #619 @Narsil	2024-04-25 12:35:44 +03:00
Nicolas Patry	6729783a19	Remove unecessary cuda graph. (#1664 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-25 12:35:24 +03:00
drbh	ab074c81b7	fix: improve tool type, bump pydantic and outlines (#1650 ) This PR resolves a couple - [X] adjusts the tool response to align with openai's tools response type - [X] bumps pydantic to `2.6.4` in all apps (resolves dependency issue when running tests) - [X] bump `outlines` version and fix import for new name	2024-04-25 12:34:55 +03:00
drbh	d888bc2828	feat: support force downcast after FastRMSNorm multiply for Gemma (#1658 ) This PR adds `force_downcast_after` to `FastRMSNorm.forward` which is used in the Gemma model. References https://github.com/huggingface/transformers/pull/29402 and https://github.com/huggingface/transformers/pull/29729 Setting `force_downcast_after=True` will perform the `hidden_states * weight` multiplication in f32 and then downcast to half. This differs slightly from the current implementation which first casts the `hidden_states` to a half and then multiples.	2024-04-25 12:32:42 +03:00
drbh	d4aebbd10a	fix: correctly index into mask when applying grammar (#1618 ) This PR fixes how the grammar mask is index when generating text and adds a new test to ensure the grammars work with non flash models	2024-04-25 10:16:16 +03:00
Nicolas Patry	0390b28b85	Fix idefics default. (#1614 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-25 10:12:13 +03:00
drbh	e259625b8b	fix: Handle concurrent grammar requests (#1610 ) This PR fixes parallel grammar requests, currently grammar states are not concatenated correctly when a new request is added to the batch and this results in incorrect generation. This PR updates the `concatenate` function to correctly include the previous states. fixes: #1601	2024-04-25 10:11:40 +03:00
OlivierDehaene	666cdaaf16	feat: Qwen2 (#1608 ) See #1584 --------- Co-authored-by: Cheng Kuan Yong Jason <jasoncky96@gmail.com>	2024-04-25 09:21:22 +03:00
OlivierDehaene	7c6a47bb7a	feat: starcoder2 (#1605 )	2024-04-25 09:18:55 +03:00
Nicolas Patry	21d52c9ca1	Revamp medusa implementation so that every model can benefit. (#1588 ) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-25 09:13:03 +03:00
OlivierDehaene	a461257066	feat: add support for Gemma (#1583 )	2024-04-24 18:08:23 +03:00
OlivierDehaene	3c6e6d8c3f	fix(router): fix openapi and add jsonschema validation (#1578 )	2024-04-24 18:07:44 +03:00
Nicolas Patry	5a54d915ae	Fix mistral with length > window_size for long prefills (rotary doesn't create long enough cos, sin). (#1571 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-24 18:05:34 +03:00
OlivierDehaene	2ac1b55c95	v1.4.1 (#1568 )	2024-04-24 15:42:59 +03:00
OlivierDehaene	31b5e37f49	chore: add pre-commit (#1569 )	2024-04-24 15:32:02 +03:00
drbh	55acb86f42	Outlines guided generation (#1539 ) This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's. todo: - [X] add simple outlines guidance to `NextTokenChooser` - [X] update protos for grammar - [X] update generation params API - [X] constrain simple grammar - [ ] support parsing more complex grammar into fsm - [ ] support all outline support grammar types - [ ] explore optimizations to avoid recompiling grammars guided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data-raw '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6, "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+" } }' \| jq ``` response ```json { "generated_text": "david@example.com" } ``` unguided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6 } }' \| jq ``` response ```json { "generated_text": " email = 'david" } ```	2024-04-24 14:57:37 +03:00
Nicolas Patry	686b56a0c0	Small cleanup. (#1560 ) Using a single `os.getenv` statement instead of multiple. Should make truthful values easier to catch In the end didn't move towards full CLI because modifying globals in Python is error prone (depends on code import order). Added an error when mamba is launched with TP. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-24 14:42:35 +03:00
Nicolas Patry	e93cc34a22	Improving mamba runtime by using updates (#1552 ) - Move float16 to bfloat16, which has less imprecisions (load test are failing with the update kernels + f16, all working under bf16). Another note, is that we are not respecting the layer norm in f32 defined in the configuration (this is OK in my book, but that could impact the f16 precision) - Moved to update kernels. Triton overhead is super high, removed by switching to cuda graphs works great (update cuda graph is available in TRT-LLM if needed, seems exactly like the regular ssm kernel. - Moved inference_params struct in order to make only 2 tensors, to reduce the overhead of copying back and forth to the cuda graphs. - Left over overhead seems entirely in the tokenization bit. (Still 4 copies are paid before launching the graph) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-24 13:21:39 +03:00
OlivierDehaene	0c207f71ed	feat: experimental support for cuda graphs (#1428 ) Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-04-24 13:15:45 +03:00
Ilyas Moutawwakil	777e519277	ROCm AWQ support (#1514 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> This PR adds the possibility to run AWQ models with Exllama/GPTQ kernels, specifically for ROCm devices that support Exllama kernels but not AWQ's GEMM. This is done by : - un-packing, reordering and re-packing AWQ weights when `--quantize gptq` but the model's `quant_method=awq`. - avoiding overflows when adding 1 to zeros in exllama and triton. Ref: https://github.com/casper-hansen/AutoAWQ/pull/313 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-04-24 09:21:34 +00:00
OlivierDehaene	f1d8da3ba6	feat(server): add frequency penalty (#1541 )	2024-04-24 08:43:50 +00:00
drbh	51a4e62ed4	Impl simple mamba model (#1480 ) This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass. This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations. [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752) https://github.com/johnma2006/mamba-minimal https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs https://github.com/huggingface/transformers/pull/28094 Notes: this dev work is currently targeting `state-spaces/mamba-130m`, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: `cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768` Integration tests have been added and basic functionality such as model loading is supported. ```bash cd integration-tests pytest -vv models/test_fused_kernel_mamba.py ``` - [x] add tests - [x] load model - [x] make simple request - [ ] resolve warmup issue - [ ] resolve output issues fetching models tested during dev ```bash text-generation-server download-weights state-spaces/mamba-130m text-generation-server download-weights state-spaces/mamba-1.4b text-generation-server download-weights state-spaces/mamba-2.8b ``` The server can be run ```bash cd server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b ``` router ```bash cargo run ``` make a request ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' \| jq ``` response ```json { "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data." } ``` --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-04-23 11:45:11 +03:00
Dean Wyatte	27daa511ec	GPTNeoX: Use static rotary embedding (#1498 ) # What does this PR do? `transformers` 4.35 removed rotary embeddings from GPTNeoX's weights ([link to line diff](`253f9a3f97 (diff-0e2a05d86c82e96f516db8c14070ceb36f53ca44c6bc21a9cd92ad2e777b9cf1R298)`)). This applies the same fix as https://github.com/huggingface/text-generation-inference/pull/793 which generates them on-the-fly using the appropriate value from the config file Fixes https://github.com/huggingface/text-generation-inference/issues/1460 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @OlivierDehaene OR @Narsil	2024-04-23 09:21:21 +03:00
Nicolas Patry	433934519c	Fixing top_n_tokens. (#1497 ) Superseeds #1459 The fix works as follows. We updated next_token_chooser to return all logprbs, then batch_top_n_tokens, now also gets accepted_ids + speculated_length (so it knows how to interpret the flat logprobs). We then update the code to return lists ot `Tokens` that it expects. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-23 08:49:24 +03:00
OlivierDehaene	efd4b97d15	v1.4.0 (#1494 )	2024-04-22 15:47:42 +03:00
fxmarty	4b376b30f1	GPTQ support on ROCm (#1489 ) Tested with ``` CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq EXLLAMA_VERSION=1 CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq CUDA_VISIBLE_DEVICES="0,1" text-generation-launcher --model-id TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq ``` all with good and identical results on MI210. --------- Co-authored-by: Felix Marty <felix@hf.co> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>	2024-04-22 15:38:50 +03:00
Nicolas Patry	b064b33e8b	Add sealion mpt support (#1477 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Choon Meng Tan <choonmeng@aisingapore.org> Co-authored-by: David Ong Tat-Wee <13075447+ongtw@users.noreply.github.com>	2024-04-22 15:37:05 +03:00
Nicolas Patry	ea2aa53805	Reinstate exl2 with tp (#1490 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-22 15:36:57 +03:00
drbh	b2fc097b2b	feat: adds phi model (#1442 ) This PR adds basic modeling for phi-2 run ```bash text-generation-server \ serve \ microsoft/phi-2 \ --revision 834565c23f9b28b96ccbeabe614dd906b6db551a ``` test ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' \| jq . ``` notes - recently (~1 day ago) the Phi weights and model were updated to accommodate adding [GQA/MQA attention to the model.](https://github.com/huggingface/transformers/pull/28163) This impl expects the original model format so a fixed revision is required at the moment. - this PR only includes a basic implementation of the model and can later be extended for support Flash and Sharded versions as well as make use of better optimization	2024-04-22 13:06:38 +03:00
Nicolas Patry	2a3a9c526b	Fixing non divisible embeddings. (#1476 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-22 12:48:59 +03:00
PYNing	e930ad9cec	Fix local load for Medusa (#1420 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Close #1418 Close #1415 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-22 09:30:41 +03:00
R. P. Ruiz	92ddb41d95	Fix missing make target platform for local install: 'install-flash-attention-v2' (#1414 )	2024-04-22 09:18:00 +03:00
OlivierDehaene	118344b99d	fix: fix local loading for .bin models (#1419 )	2024-04-22 09:17:52 +03:00
OlivierDehaene	62646c2a54	v1.3.4	2024-04-22 09:08:34 +03:00
Nicolas Patry	8cc4306f72	Fix local load for peft (#1373 ) local directory overloaded still needs the directory to locate the weights files correctly. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-22 09:03:34 +03:00
OlivierDehaene	7eeabb9cda	feat: update exllamav2 kernels (#1370 ) Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-04-22 09:02:53 +03:00
Nicolas Patry	be05972911	Peft safetensors. (#1364 ) Works by removing adapter_model.safetensors from being detected as the core model file (which skips the real peft detection). # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-04-22 09:02:31 +03:00
OlivierDehaene	b7299e1b7f	fix: fix gpt-q with groupsize = -1 (#1358 )	2024-04-19 15:05:50 +03:00
OlivierDehaene	5ff9e81952	fix: fix offline (#1341 ) (#1347 ) @oOraph --------- Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>	2024-04-19 14:56:25 +03:00
OlivierDehaene	ecb0db45af	fix: fix logic if sliding window key is not present in config (#1352 )	2024-04-19 14:56:10 +03:00
OlivierDehaene	a95e6d603d	feat: relax mistral requirements (#1351 ) Close #1253 Close #1279	2024-04-19 14:50:24 +03:00
OlivierDehaene	bb6200503c	fix: max_past default value must be -1, not 0 (#1348 )	2024-04-19 14:18:05 +03:00
OlivierDehaene	214ec0eb49	fix: only keep stop sequence buffer if we have some	2024-04-19 14:18:00 +03:00
OlivierDehaene	04dbf7a506	fix: slice stopping criteria buffer	2024-04-19 14:17:52 +03:00
OlivierDehaene	b3c2d7291e	fix: fix quant linear autotune	2024-04-19 14:17:39 +03:00
OlivierDehaene	28fcdcca6d	fix: fix triton OutOfResources import	2024-04-19 14:17:32 +03:00
OlivierDehaene	5c9ef069ed	feat: add more latency metrics in forward (#1346 )	2024-04-19 13:41:34 +03:00
OlivierDehaene	c974437ba7	fix: fix gpt-q params loading	2024-04-19 12:12:50 +03:00
OlivierDehaene	f9b58ac7a1	feat: add quant to mixtral (#1337 )	2024-04-18 16:32:50 +03:00
OlivierDehaene	09c556dbd7	v1.3.1	2024-04-18 16:32:07 +03:00
OlivierDehaene	79f268f95a	chore: formatting	2024-04-18 16:26:00 +03:00
OlivierDehaene	9aef902982	feat: mixtral (#1328 )	2024-04-18 12:39:52 +00:00
Nicolas Patry	a7f52f3812	Speculative (#1308 )	2024-04-18 12:39:39 +00:00
Karol Damaszke	30cc78773e	Skip server tests of not enabled models (#125 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-04-09 14:15:41 +02:00
Karol Damaszke	d957e32601	Add Habana copyright header (#122 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-04-08 18:06:21 +02:00
Karol Damaszke	b0de25a285	Don't set rope_scaling for unsupported models (#115 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-04-02 12:12:02 +02:00
Karol Damaszke	7342baa2eb	Add support for rope_scaling and remove is_optimized_for_gaudi (#112 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-29 15:07:32 +01:00
Karol Damaszke	bf5263b88b	Disable watermark with FP8 quantization (#114 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-27 13:32:20 +01:00
jkaniecki	56f00a552b	Adjust warmup to all possible bucket sizes and decode batch size = 1 (#113 )	2024-03-27 11:59:51 +01:00
Karol Damaszke	b45f648483	Add warmup for logits processors (#107 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-18 15:17:47 +01:00
yuanwu2017	a4d5c3f40f	Fix the generate_stream crash in concurrent query (#105 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-03-15 10:54:56 +01:00
Yao Matrix	7149ac30e6	Fix the issue of out of range (#98 ) Signed-off-by: yuanwu <yuan.wu@intel.com> Co-authored-by: yuanwu <yuan.wu@intel.com>	2024-03-13 10:09:53 +01:00
Karol Damaszke	80ae9ead28	Set MAX_TOTAL_TOKENS automatically (#91 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-01 11:25:15 +01:00
Karol Damaszke	a5c788cfe4	Remove redundant fill op (#83 ) (#90 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-03-01 01:32:02 +01:00
Karol Damaszke	03c2123244	Use batched index_copy (#73 ) (#89 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-29 15:45:16 +01:00
Karol Damaszke	7dbf4bf7a4	Improve tensor slicing performance (#66 ) (#87 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-29 10:48:54 +01:00
Karol Damaszke	3831f1bed5	Add warmup for shift operation (#59 ) (#86 )	2024-02-29 09:19:28 +01:00
Karol Damaszke	022ce1eaaf	Overhead reduction (#58 ) (#85 ) Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>	2024-02-29 09:17:45 +01:00
Karol Damaszke	212136dff8	Log exceptions to debug.log (#52 ) (#84 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-29 09:14:42 +01:00
Karol Damaszke	c7ccfb87ff	Grouped pad/shift/move operations (#57 ) (#82 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-29 04:16:44 +01:00
Karol Damaszke	2122acc60f	Add warmup for all possible shapes for prefill #49 (#81 )	2024-02-28 10:40:13 +01:00
Karol Damaszke	31bed905d4	Update habana profiler (#50 ) (#80 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-28 09:57:40 +01:00
Karol Damaszke	d31fb62576	Add more info to high-level profiler events (#46 ) (#79 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-02-28 09:55:50 +01:00
Karol Damaszke	941d36f3fd	Enable deferred token generation (#44 ) (#75 ) Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>	2024-02-27 15:46:40 +01:00
jkaniecki	83b059bd27	Bulk shifting (#40 ) (#70 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-26 17:29:56 +01:00
jkaniecki	c3bd8ef445	Add Fp8 support (#42 ) (#71 ) Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com> Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> Co-authored-by: Grzegorz Morys <gmorys@habana.ai>	2024-02-23 11:52:28 +01:00
jkaniecki	a490847702	Sequence bucketing for prefill (#39 ) (#67 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-23 01:52:14 +01:00
jkaniecki	9ad6086250	Improve habana profile dev experience (#36 ) (#65 ) Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>	2024-02-22 13:57:45 +01:00
jkaniecki	f7ef414e38	Remove unused pad_token_id for filter (#35 ) (#64 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-22 11:24:09 +01:00
jkaniecki	8f590759e3	Prefill optimization by allocating space only for the first output token (#34 ) (#62 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com> Co-authored-by: Karol Damaszke <karol.damaszke@intel.com>	2024-02-22 04:55:43 +01:00
jkaniecki	80303b469c	Do not limit hpu graphs by default (#32 ) (#61 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-21 15:38:00 +01:00
jkaniecki	6b6dec9ea1	Transparent tokenizer uses explicit int32 (#31 ) (#60 ) Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>	2024-02-21 14:24:41 +01:00
regisss	2060bb58bf	Fix trust remote code (#55 )	2024-02-19 07:53:24 +01:00
Karol Damaszke	2a7a967de3	Revert prefill optimization and fix accuracy issue in shift operation (#29 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com> Co-authored-by: jkaniecki <153085639+jkaniecki@users.noreply.github.com>	2024-01-23 15:19:07 +01:00
jkaniecki	ac3bc0e95e	Removed kv_cache from HPU graph output (#19 )	2024-01-19 15:34:13 +01:00
Karol Damaszke	60f63262db	Prefill optimization by allocating space only for the first token (#17 )	2024-01-19 15:18:35 +01:00
Adam Stachowicz	0b96da89aa	Make tokenizer optional (#12 )	2024-01-19 15:12:04 +01:00
madamczykhabana	381ec38cad	Batch bucketing improvements (#15 )	2024-01-17 10:09:27 +01:00
mrs303	8523f7ef64	Deepspeed terminate (#11 )	2024-01-17 09:57:03 +01:00
Krzysztof Laskowski	c459c86f88	High-level server profiler (#13 )	2024-01-16 09:57:29 +01:00
madamczykhabana	41c4f4fa41	Debugging utils (#14 )	2024-01-15 21:05:27 +01:00
Karol Damaszke	a8c5b69e2c	Set default value of LIMIT_HPU_GRAPH to True (#7 )	2024-01-11 14:51:49 +01:00
Karol Damaszke	252ccde104	Control prefill and decode batch size separately (#6 )	2024-01-02 18:21:01 +01:00
Karol Damaszke	1be2d9a8ec	Batch size bucketing (#5 )	2023-12-22 21:53:01 +01:00
jkaniecki	e3dcd7f2c2	Disable tensor caching in HPU Graph execution (#4 )	2023-12-22 13:51:16 +01:00
Karol Damaszke	6436ae86a1	Fix for continuous batching (#1 )	2023-12-11 09:24:09 +01:00
regisss	e5f124b077	Merge tag 'v1.2.0' into v1.2-release	2023-12-06 18:46:16 +01:00
regisss	c09066aeb1	Merge tag 'v1.1.1' into v1.1-release	2023-12-06 09:50:58 +01:00
regisss	cc744ba426	Add changes from Optimum Habana's TGI folder	2023-12-05 11:12:16 +01:00
Nicolas Patry	ba552e1a82	Let each model resolve their own default dtype. (#1287 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-11-28 17:54:26 +01:00
fxmarty	b2b5df0e94	Add RoCm support (#1243 ) This PR adds support for AMD Instinct MI210 & MI250 GPUs, with paged attention and FAv2 support. Remaining items to discuss, on top of possible others: * Should we have a `ghcr.io/huggingface/text-generation-inference:1.1.0+rocm` hosted image, or is it too early? * Should we set up a CI on MI210/MI250? I don't have access to the runners of TGI though. * Are we comfortable with those changes being directly in TGI, or do we need a fork? --------- Co-authored-by: Felix Marty <felix@hf.co> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Co-authored-by: Your Name <you@example.com>	2023-11-27 14:08:12 +01:00
Nicolas Patry	ed2a3f617e	Exllama v2 (#1211 ) # What does this PR do? See #1165 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-153.ec2.internal>	2023-11-25 22:38:38 +01:00
Vince Jankovics	c6bb76703f	Fix IDEFICS dtype (#1214 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> This forces the use of `bfloat16` for IDEFICS. The issue is that with `float16` the 80b model gives garbage output. Let me know if this solution is not appropriate and I'll adjust accordingly. For the details see below. The current behaviour: ```sh $ curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json' {"generated_text":""} ``` On closer inspection with: ```python import requests headers = { "Content-Type": "application/json"} query = "What is Deep Learning?" data = { "inputs": query, "parameters": { "max_new_tokens": 10, "return_full_text": True, "decoder_input_details": True, "do_sample": False, }, } api_url = "http://127.0.0.1:8080" response = requests.post(api_url + "/generate", headers=headers, json=data).json() for i in ['prefill', 'tokens']: print(f'### {i}') print(repr(''.join([t['text'] for t in response['details'][i]]))) ``` Prints: ``` ### prefill '<s>WhatisDeepLearning?' ### tokens '<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>' ######## ``` With the change in this PR it prints: ``` ### prefill '<s>WhatisDeepLearning?' ### tokens '\n\nDeep Learning is a subset of machine' ``` Note, using the Transformers implementation (with `IdeficsForVisionText2Text.from_pretrained`) produces the latter (correct) output as well. This only happens with the 80b model, the 9b model is not as sensitive to the dtype (as also mentioned in the code). The reason for "forcing" this in the IDEFICS init method, is because if quantization is used, then the dtype cannot be set explicitly. And since it's left as `None`, it's set to `float16` by default [here](`96a982ad8f/server/text_generation_server/models/__init__.py (L90)`). I.e. there's no other way to manually change the dtype if someone is using quantization: ```sh $ docker run .... ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --dtype bfloat16 --quantize bitsandbytes-nf4 ..... 2023-10-31T12:42:26.710401Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2023-10-31T12:42:30.315734Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output: Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 80, in serve raise RuntimeError( RuntimeError: Only 1 can be set between `dtype` and `quantize`, as they both decide how goes the final model. rank=0 Error: ShardCannotStart 2023-10-31T12:42:30.414010Z ERROR text_generation_launcher: Shard 0 failed to start 2023-10-31T12:42:30.414044Z INFO text_generation_launcher: Shutting down shards ``` ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil what do you think? <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-11-23 15:00:09 +01:00
Traun Leyden	e12c34bd25	Load PEFT weights from local directory (#1260 ) # What does this PR do? Enables PEFT weights to be loaded from a local directory, as opposed to a hf hub repository. It is a continuation of the work in PR https://github.com/huggingface/text-generation-inference/pull/762 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes #1259 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? Yes but I don't know how to run the tests for this repo, and it doesn't look like this code is covered anyway - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. Yes, @Narsil asked for a PR in [this comment](https://github.com/huggingface/text-generation-inference/pull/762#issuecomment-1728089505) - [x] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). I didn't see any documentation added to the [original PR](https://github.com/huggingface/text-generation-inference/pull/762), and am not sure where this belongs. Let me know and I can add some - [x] Did you write any new necessary tests? I didn't see any existing test coverage for this python module ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @Narsil --> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-11-23 12:56:17 +01:00
Diwank Singh Tomer	91111a0dc2	Fix missing `trust_remote_code` flag for AutoTokenizer in utils.peft (#1270 ) Peft loading function was missing the `trust_remote_code=trust_remote_code` argument causing the custom tokenizer code to be not found. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil	2023-11-23 12:41:05 +01:00
OlivierDehaene	96a982ad8f	fix: better warmup error	2023-10-25 10:18:58 +02:00
OlivierDehaene	12590fdcce	feat: paged attention v2 (#1183 )	2023-10-23 12:29:25 +02:00

1 2 3 4 5 ...

434 Commits