text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-09 15:05:24 +00:00

Author	SHA1	Message	Date
Karol Damaszke	b0de25a285	Don't set rope_scaling for unsupported models (#115 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-04-02 12:12:02 +02:00
Karol Damaszke	7342baa2eb	Add support for rope_scaling and remove is_optimized_for_gaudi (#112 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-29 15:07:32 +01:00
Karol Damaszke	bf5263b88b	Disable watermark with FP8 quantization (#114 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-27 13:32:20 +01:00
jkaniecki	56f00a552b	Adjust warmup to all possible bucket sizes and decode batch size = 1 (#113 )	2024-03-27 11:59:51 +01:00
Karol Damaszke	b45f648483	Add warmup for logits processors (#107 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-18 15:17:47 +01:00
yuanwu2017	a4d5c3f40f	Fix the generate_stream crash in concurrent query (#105 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-03-15 10:54:56 +01:00
Yao Matrix	7149ac30e6	Fix the issue of out of range (#98 ) Signed-off-by: yuanwu <yuan.wu@intel.com> Co-authored-by: yuanwu <yuan.wu@intel.com>	2024-03-13 10:09:53 +01:00
Karol Damaszke	80ae9ead28	Set MAX_TOTAL_TOKENS automatically (#91 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-01 11:25:15 +01:00
Karol Damaszke	a5c788cfe4	Remove redundant fill op (#83 ) (#90 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-03-01 01:32:02 +01:00
Karol Damaszke	03c2123244	Use batched index_copy (#73 ) (#89 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-29 15:45:16 +01:00
Karol Damaszke	7dbf4bf7a4	Improve tensor slicing performance (#66 ) (#87 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-29 10:48:54 +01:00
Karol Damaszke	3831f1bed5	Add warmup for shift operation (#59 ) (#86 )	2024-02-29 09:19:28 +01:00
Karol Damaszke	022ce1eaaf	Overhead reduction (#58 ) (#85 ) Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>	2024-02-29 09:17:45 +01:00
Karol Damaszke	212136dff8	Log exceptions to debug.log (#52 ) (#84 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-29 09:14:42 +01:00
Karol Damaszke	c7ccfb87ff	Grouped pad/shift/move operations (#57 ) (#82 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-29 04:16:44 +01:00
Karol Damaszke	2122acc60f	Add warmup for all possible shapes for prefill #49 (#81 )	2024-02-28 10:40:13 +01:00
Karol Damaszke	31bed905d4	Update habana profiler (#50 ) (#80 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-28 09:57:40 +01:00
Karol Damaszke	d31fb62576	Add more info to high-level profiler events (#46 ) (#79 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-02-28 09:55:50 +01:00
Karol Damaszke	941d36f3fd	Enable deferred token generation (#44 ) (#75 ) Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>	2024-02-27 15:46:40 +01:00
jkaniecki	83b059bd27	Bulk shifting (#40 ) (#70 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-26 17:29:56 +01:00
jkaniecki	c3bd8ef445	Add Fp8 support (#42 ) (#71 ) Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com> Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> Co-authored-by: Grzegorz Morys <gmorys@habana.ai>	2024-02-23 11:52:28 +01:00
jkaniecki	a490847702	Sequence bucketing for prefill (#39 ) (#67 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-23 01:52:14 +01:00
jkaniecki	9ad6086250	Improve habana profile dev experience (#36 ) (#65 ) Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>	2024-02-22 13:57:45 +01:00
jkaniecki	f7ef414e38	Remove unused pad_token_id for filter (#35 ) (#64 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-22 11:24:09 +01:00
jkaniecki	8f590759e3	Prefill optimization by allocating space only for the first output token (#34 ) (#62 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com> Co-authored-by: Karol Damaszke <karol.damaszke@intel.com>	2024-02-22 04:55:43 +01:00
jkaniecki	80303b469c	Do not limit hpu graphs by default (#32 ) (#61 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-21 15:38:00 +01:00
jkaniecki	6b6dec9ea1	Transparent tokenizer uses explicit int32 (#31 ) (#60 ) Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>	2024-02-21 14:24:41 +01:00
regisss	2060bb58bf	Fix trust remote code (#55 )	2024-02-19 07:53:24 +01:00
Karol Damaszke	2a7a967de3	Revert prefill optimization and fix accuracy issue in shift operation (#29 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com> Co-authored-by: jkaniecki <153085639+jkaniecki@users.noreply.github.com>	2024-01-23 15:19:07 +01:00
jkaniecki	ac3bc0e95e	Removed kv_cache from HPU graph output (#19 )	2024-01-19 15:34:13 +01:00
Karol Damaszke	60f63262db	Prefill optimization by allocating space only for the first token (#17 )	2024-01-19 15:18:35 +01:00
Adam Stachowicz	0b96da89aa	Make tokenizer optional (#12 )	2024-01-19 15:12:04 +01:00
madamczykhabana	381ec38cad	Batch bucketing improvements (#15 )	2024-01-17 10:09:27 +01:00
mrs303	8523f7ef64	Deepspeed terminate (#11 )	2024-01-17 09:57:03 +01:00
Krzysztof Laskowski	c459c86f88	High-level server profiler (#13 )	2024-01-16 09:57:29 +01:00
madamczykhabana	41c4f4fa41	Debugging utils (#14 )	2024-01-15 21:05:27 +01:00
Karol Damaszke	a8c5b69e2c	Set default value of LIMIT_HPU_GRAPH to True (#7 )	2024-01-11 14:51:49 +01:00
Karol Damaszke	252ccde104	Control prefill and decode batch size separately (#6 )	2024-01-02 18:21:01 +01:00
Karol Damaszke	1be2d9a8ec	Batch size bucketing (#5 )	2023-12-22 21:53:01 +01:00
jkaniecki	e3dcd7f2c2	Disable tensor caching in HPU Graph execution (#4 )	2023-12-22 13:51:16 +01:00
Karol Damaszke	6436ae86a1	Fix for continuous batching (#1 )	2023-12-11 09:24:09 +01:00
regisss	e5f124b077	Merge tag 'v1.2.0' into v1.2-release	2023-12-06 18:46:16 +01:00
regisss	c09066aeb1	Merge tag 'v1.1.1' into v1.1-release	2023-12-06 09:50:58 +01:00
regisss	cc744ba426	Add changes from Optimum Habana's TGI folder	2023-12-05 11:12:16 +01:00
Nicolas Patry	ba552e1a82	Let each model resolve their own default dtype. (#1287 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-11-28 17:54:26 +01:00
fxmarty	b2b5df0e94	Add RoCm support (#1243 ) This PR adds support for AMD Instinct MI210 & MI250 GPUs, with paged attention and FAv2 support. Remaining items to discuss, on top of possible others: * Should we have a `ghcr.io/huggingface/text-generation-inference:1.1.0+rocm` hosted image, or is it too early? * Should we set up a CI on MI210/MI250? I don't have access to the runners of TGI though. * Are we comfortable with those changes being directly in TGI, or do we need a fork? --------- Co-authored-by: Felix Marty <felix@hf.co> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Co-authored-by: Your Name <you@example.com>	2023-11-27 14:08:12 +01:00
Nicolas Patry	ed2a3f617e	Exllama v2 (#1211 ) # What does this PR do? See #1165 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-153.ec2.internal>	2023-11-25 22:38:38 +01:00
Vince Jankovics	c6bb76703f	Fix IDEFICS dtype (#1214 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> This forces the use of `bfloat16` for IDEFICS. The issue is that with `float16` the 80b model gives garbage output. Let me know if this solution is not appropriate and I'll adjust accordingly. For the details see below. The current behaviour: ```sh $ curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json' {"generated_text":""} ``` On closer inspection with: ```python import requests headers = { "Content-Type": "application/json"} query = "What is Deep Learning?" data = { "inputs": query, "parameters": { "max_new_tokens": 10, "return_full_text": True, "decoder_input_details": True, "do_sample": False, }, } api_url = "http://127.0.0.1:8080" response = requests.post(api_url + "/generate", headers=headers, json=data).json() for i in ['prefill', 'tokens']: print(f'### {i}') print(repr(''.join([t['text'] for t in response['details'][i]]))) ``` Prints: ``` ### prefill '<s>WhatisDeepLearning?' ### tokens '<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>' ######## ``` With the change in this PR it prints: ``` ### prefill '<s>WhatisDeepLearning?' ### tokens '\n\nDeep Learning is a subset of machine' ``` Note, using the Transformers implementation (with `IdeficsForVisionText2Text.from_pretrained`) produces the latter (correct) output as well. This only happens with the 80b model, the 9b model is not as sensitive to the dtype (as also mentioned in the code). The reason for "forcing" this in the IDEFICS init method, is because if quantization is used, then the dtype cannot be set explicitly. And since it's left as `None`, it's set to `float16` by default [here](`96a982ad8f/server/text_generation_server/models/__init__.py (L90)`). I.e. there's no other way to manually change the dtype if someone is using quantization: ```sh $ docker run .... ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --dtype bfloat16 --quantize bitsandbytes-nf4 ..... 2023-10-31T12:42:26.710401Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2023-10-31T12:42:30.315734Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output: Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 80, in serve raise RuntimeError( RuntimeError: Only 1 can be set between `dtype` and `quantize`, as they both decide how goes the final model. rank=0 Error: ShardCannotStart 2023-10-31T12:42:30.414010Z ERROR text_generation_launcher: Shard 0 failed to start 2023-10-31T12:42:30.414044Z INFO text_generation_launcher: Shutting down shards ``` ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil what do you think? <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-11-23 15:00:09 +01:00
Traun Leyden	e12c34bd25	Load PEFT weights from local directory (#1260 ) # What does this PR do? Enables PEFT weights to be loaded from a local directory, as opposed to a hf hub repository. It is a continuation of the work in PR https://github.com/huggingface/text-generation-inference/pull/762 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes #1259 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? Yes but I don't know how to run the tests for this repo, and it doesn't look like this code is covered anyway - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. Yes, @Narsil asked for a PR in [this comment](https://github.com/huggingface/text-generation-inference/pull/762#issuecomment-1728089505) - [x] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). I didn't see any documentation added to the [original PR](https://github.com/huggingface/text-generation-inference/pull/762), and am not sure where this belongs. Let me know and I can add some - [x] Did you write any new necessary tests? I didn't see any existing test coverage for this python module ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @Narsil --> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-11-23 12:56:17 +01:00
Diwank Singh Tomer	91111a0dc2	Fix missing `trust_remote_code` flag for AutoTokenizer in utils.peft (#1270 ) Peft loading function was missing the `trust_remote_code=trust_remote_code` argument causing the custom tokenizer code to be not found. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil	2023-11-23 12:41:05 +01:00

1 2 3 4 5

236 Commits