text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-04-21 23:12:07 +00:00

Author	SHA1	Message	Date
OlivierDehaene	ab96b9aec3	feat(server): support new falcon config (#712 )	2023-07-27 18:38:57 +02:00
OlivierDehaene	3b71c38558	feat(server): flash attention v2 (#624 )	2023-07-18 16:21:18 +02:00
Nicolas Patry	1da07e85aa	feat(server): Add Non flash MPT. (#514 ) # What does this PR do? This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention. Fixes https://github.com/huggingface/text-generation-inference/issues/361 Fixes https://github.com/huggingface/text-generation-inference/issues/491 Fixes https://github.com/huggingface/text-generation-inference/issues/290	2023-07-03 13:01:46 +02:00
Nicolas Patry	ecf6dc3a5a	feat: Add the option to force another dtype than `f16`. (#513 )	2023-06-30 20:30:09 +02:00
Nicolas Patry	aefde28b45	feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438 ) Let's start discussing implementation. - Need to expose the quantization scripts (either included here or add doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa) - Make sure GPTQ works for multiple models (priority to Falcon). Currently it means that every place we use `get_{tensor\|sharded}` to check for quantization. My idea is to reintegrate as much as possible into `utils/layer.py` by expanding `load_multi` to be a bit more generic. This might require some thinking, but ultimately the `qweight,qzeros,scales,g_idx` should be in a single place, and independant of bias presence. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2023-06-26 12:27:01 +02:00
OlivierDehaene	53aa9194c8	fix(server): fix warpers on CPU (#472 ) Closes #471	2023-06-20 11:06:10 +02:00
OlivierDehaene	ece7ffa40a	feat(server): improve flash attention import errors (#465 ) @lewtun, is this enough? Closes #458 Closes #456	2023-06-19 09:53:45 +02:00
Nicolas Patry	abd58ff82c	feat(server): Rework model loading (#344 ) # What does this PR do? Reworked the loading logic. Idea is to use cleaner loading code: - Remove need for `no_init_weights` - Remove all weird `bnb_linear` and `load_weights` and `post_load_weights`. New code layout: - New class `Weights` in charge of handling loading the weights from multiple files into appropiate tensors (potentially sharded) - TP layers now are "shells", they contain the code to know what kind of sharding we need + eventual `all_reduce`. They do not inherit from linear, but they contain some kind of Linear instead - the contained linear can be either FastLinear, BnbLinear or GPTq Linear next. - All modeling code is explictly made for sharding, process group is just no-ops for non sharded code (removes a lot of test cases) ![Screenshot from 2023-05-19 23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f) --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net> Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>	2023-06-08 14:51:52 +02:00
OlivierDehaene	e7248fe90e	v0.8.2	2023-06-01 19:49:13 +02:00
OlivierDehaene	c0928e6f26	feat(server): remove trust_remote_code requirement for falcon models (#396 )	2023-06-01 12:07:41 +02:00
OlivierDehaene	b8b950b37c	feat(server): support RefinedWeb models (#379 )	2023-05-30 18:25:19 +02:00
CL-Shang	5fde8d9991	Fix issue when load AutoModelForSeq2SeqLM model (#370 )	2023-05-26 12:31:47 +02:00
OlivierDehaene	e3e487dc71	feat(server): support trust_remote_code (#363 )	2023-05-23 20:40:39 +02:00
Nicolas Patry	73d84c6ee5	Hotfixes for santacoder/bigcode. (#294 ) # What does this PR do? Hotfixes: - Uses `model_type`=`gpt_bigcode` for more general usage. - Hotfixes linked lm_head vs wte_embedding (safetensors file do not contain the key, correctly when the file is sharded, where as pytorch copies the tensor) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2023-05-15 10:35:20 +02:00
Nicolas Patry	76a48cd365	feat(server): GPTQ quantization (step1) (#277 ) Changes only the type from `bool` to `Option<Enum>` pretty much everywhere. - Use `Optional[str]` in Python (easier to manage than importing type everywhere). Except for the cli to get proper validation - Updated all models to handle gracefully new values. (Error out if unknown value, or gptq since not implemented). <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-05-12 14:46:41 +02:00
Ehsan M. Kermani	f092ba9b22	feat(server): add watermarking tests (#248 )	2023-04-27 19:16:35 +02:00
OlivierDehaene	b6ee0ec7b0	feat(router): add git sha to info route (#208 )	2023-04-19 21:36:59 +02:00
OlivierDehaene	a88c54bb4c	feat(server): check cuda capability when importing flash models (#201 ) close #198	2023-04-19 12:52:37 +02:00
OlivierDehaene	e14ae3b5e9	feat(server): support quantization for flash models (#200 ) closes #197	2023-04-19 12:51:11 +02:00
OlivierDehaene	7a1ba58557	fix(docker): fix docker image dependencies (#187 )	2023-04-17 00:26:47 +02:00
OlivierDehaene	880a76eed5	feat(server): support sharded santacoder (#167 )	2023-04-12 17:18:08 +02:00
OlivierDehaene	f26dfd0dc1	feat(server): support OPT models (#55 ) OPT models do not all have a `tokenizer.json` file on the hub at the moment. Can't merge for now.	2023-04-11 19:16:41 +02:00
OlivierDehaene	299217c95c	feat(server): add flash attention llama (#144 )	2023-04-11 16:38:22 +02:00
OlivierDehaene	c0aeb32583	feat(server): flash santacoder (#153 )	2023-04-03 19:06:42 +02:00
Nick Hill	462530c2b0	fix(server): Avoid using try/except to determine kind of AutoModel (#142 )	2023-03-27 09:23:22 +02:00
OlivierDehaene	d6a93fe992	fix(server): fix flash-neox scores warping (#137 )	2023-03-24 18:21:41 +01:00
OlivierDehaene	05e9a796cc	feat(server): flash neoX (#133 )	2023-03-24 14:02:14 +01:00
OlivierDehaene	8ad60b752f	fix(server): add position ids to neox (#126 )	2023-03-15 13:12:49 +01:00
OlivierDehaene	3fef90d50f	feat(clients): Python client (#103 )	2023-03-07 18:52:22 +01:00

29 Commits