Commit Graph

340 Commits

Author SHA1 Message Date
jkaniecki
56f00a552b
Adjust warmup to all possible bucket sizes and decode batch size = 1 (#113) 2024-03-27 11:59:51 +01:00
Karol Damaszke
b45f648483
Add warmup for logits processors (#107)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-18 15:17:47 +01:00
yuanwu2017
a4d5c3f40f
Fix the generate_stream crash in concurrent query (#105)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-03-15 10:54:56 +01:00
Yao Matrix
7149ac30e6
Fix the issue of out of range (#98)
Signed-off-by: yuanwu <yuan.wu@intel.com>
Co-authored-by: yuanwu <yuan.wu@intel.com>
2024-03-13 10:09:53 +01:00
Karol Damaszke
80ae9ead28
Set MAX_TOTAL_TOKENS automatically (#91)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-01 11:25:15 +01:00
Karol Damaszke
a5c788cfe4
Remove redundant fill op (#83) (#90)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-03-01 01:32:02 +01:00
Karol Damaszke
03c2123244
Use batched index_copy (#73) (#89)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-29 15:45:16 +01:00
Karol Damaszke
7dbf4bf7a4
Improve tensor slicing performance (#66) (#87)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-29 10:48:54 +01:00
Karol Damaszke
3831f1bed5
Add warmup for shift operation (#59) (#86) 2024-02-29 09:19:28 +01:00
Karol Damaszke
022ce1eaaf
Overhead reduction (#58) (#85)
Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>
2024-02-29 09:17:45 +01:00
Karol Damaszke
212136dff8
Log exceptions to debug.log (#52) (#84)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-29 09:14:42 +01:00
Karol Damaszke
c7ccfb87ff
Grouped pad/shift/move operations (#57) (#82)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-29 04:16:44 +01:00
Karol Damaszke
2122acc60f
Add warmup for all possible shapes for prefill #49 (#81) 2024-02-28 10:40:13 +01:00
Karol Damaszke
31bed905d4
Update habana profiler (#50) (#80)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-28 09:57:40 +01:00
Karol Damaszke
d31fb62576
Add more info to high-level profiler events (#46) (#79)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-02-28 09:55:50 +01:00
Karol Damaszke
941d36f3fd
Enable deferred token generation (#44) (#75)
Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>
2024-02-27 15:46:40 +01:00
jkaniecki
83b059bd27
Bulk shifting (#40) (#70)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-26 17:29:56 +01:00
regisss
8f4aba6ad3
Update dependencies (#69) 2024-02-25 13:07:47 +01:00
jkaniecki
c3bd8ef445
Add Fp8 support (#42) (#71)
Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>
Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>
Co-authored-by: Grzegorz Morys <gmorys@habana.ai>
2024-02-23 11:52:28 +01:00
jkaniecki
a490847702
Sequence bucketing for prefill (#39) (#67)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-23 01:52:14 +01:00
jkaniecki
9ad6086250
Improve habana profile dev experience (#36) (#65)
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>
2024-02-22 13:57:45 +01:00
jkaniecki
f7ef414e38
Remove unused pad_token_id for filter (#35) (#64)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-22 11:24:09 +01:00
jkaniecki
8f590759e3
Prefill optimization by allocating space only for the first output token (#34) (#62)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
Co-authored-by: Karol Damaszke <karol.damaszke@intel.com>
2024-02-22 04:55:43 +01:00
jkaniecki
80303b469c
Do not limit hpu graphs by default (#32) (#61)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-21 15:38:00 +01:00
jkaniecki
6b6dec9ea1
Transparent tokenizer uses explicit int32 (#31) (#60)
Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>
2024-02-21 14:24:41 +01:00
regisss
a4d3a00d98
Fix dependencies (#56) 2024-02-19 10:19:23 +01:00
regisss
dca9ac6508 Revert "Solve dependency issue"
This reverts commit ea2b93dd75.
2024-02-19 07:28:04 +00:00
regisss
ea2b93dd75 Solve dependency issue 2024-02-19 07:26:37 +00:00
regisss
2060bb58bf
Fix trust remote code (#55) 2024-02-19 07:53:24 +01:00
Karol Damaszke
2a7a967de3
Revert prefill optimization and fix accuracy issue in shift operation (#29)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
Co-authored-by: jkaniecki <153085639+jkaniecki@users.noreply.github.com>
2024-01-23 15:19:07 +01:00
jkaniecki
ac3bc0e95e
Removed kv_cache from HPU graph output (#19) 2024-01-19 15:34:13 +01:00
Karol Damaszke
60f63262db
Prefill optimization by allocating space only for the first token (#17) 2024-01-19 15:18:35 +01:00
Adam Stachowicz
0b96da89aa
Make tokenizer optional (#12) 2024-01-19 15:12:04 +01:00
madamczykhabana
381ec38cad
Batch bucketing improvements (#15) 2024-01-17 10:09:27 +01:00
mrs303
8523f7ef64
Deepspeed terminate (#11) 2024-01-17 09:57:03 +01:00
Krzysztof Laskowski
c459c86f88
High-level server profiler (#13) 2024-01-16 09:57:29 +01:00
madamczykhabana
41c4f4fa41
Debugging utils (#14) 2024-01-15 21:05:27 +01:00
Karol Damaszke
a8c5b69e2c
Set default value of LIMIT_HPU_GRAPH to True (#7) 2024-01-11 14:51:49 +01:00
Karol Damaszke
252ccde104
Control prefill and decode batch size separately (#6) 2024-01-02 18:21:01 +01:00
Karol Damaszke
1be2d9a8ec
Batch size bucketing (#5) 2023-12-22 21:53:01 +01:00
jkaniecki
e3dcd7f2c2
Disable tensor caching in HPU Graph execution (#4) 2023-12-22 13:51:16 +01:00
Karol Damaszke
6436ae86a1
Fix for continuous batching (#1) 2023-12-11 09:24:09 +01:00
regisss
e5f124b077 Merge tag 'v1.2.0' into v1.2-release 2023-12-06 18:46:16 +01:00
regisss
c09066aeb1 Merge tag 'v1.1.1' into v1.1-release 2023-12-06 09:50:58 +01:00
regisss
cc744ba426 Add changes from Optimum Habana's TGI folder 2023-12-05 11:12:16 +01:00
OlivierDehaene
ccd5725a0c v1.2.0 2023-11-30 15:18:15 +01:00
Nicolas Patry
ba552e1a82
Let each model resolve their own default dtype. (#1287)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2023-11-28 17:54:26 +01:00
Nicolas Patry
3c71c656c7
make install-flash-attn-v2-cuda should work like make install-flash-attn-v2 used to work. (#1294)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2023-11-28 16:28:40 +01:00
fxmarty
b2b5df0e94
Add RoCm support (#1243)
This PR adds support for AMD Instinct MI210 & MI250 GPUs, with paged
attention and FAv2 support.

Remaining items to discuss, on top of possible others:
* Should we have a
`ghcr.io/huggingface/text-generation-inference:1.1.0+rocm` hosted image,
or is it too early?
* Should we set up a CI on MI210/MI250? I don't have access to the
runners of TGI though.
* Are we comfortable with those changes being directly in TGI, or do we
need a fork?

---------

Co-authored-by: Felix Marty <felix@hf.co>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: Your Name <you@example.com>
2023-11-27 14:08:12 +01:00
Nicolas Patry
ed2a3f617e
Exllama v2 (#1211)
# What does this PR do?

See #1165

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-153.ec2.internal>
2023-11-25 22:38:38 +01:00