Commit Graph

205 Commits

Author SHA1 Message Date
Nicolas Patry
686b56a0c0 Small cleanup. (#1560)
Using a single `os.getenv` statement instead of multiple.
Should make truthful values easier to catch

In the end didn't move towards full CLI because modifying globals in
Python is error prone (depends on code import order).

Added an error when mamba is launched with TP.


# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-04-24 14:42:35 +03:00
Nicolas Patry
e93cc34a22 Improving mamba runtime by using updates (#1552)
- Move float16 to bfloat16, which has less imprecisions (load test are
  failing with the update kernels + f16, all working under bf16).

  Another note, is that we are not respecting the layer norm in f32
  defined in the configuration (this is OK in my book, but that could
  impact the f16 precision)

- Moved to update kernels. Triton overhead is super high, removed by
  switching to cuda graphs works great (update cuda graph is available
  in TRT-LLM if needed, seems *exactly* like the regular ssm kernel.

- Moved inference_params struct in order to make only 2 tensors, to
  reduce the overhead of copying back and forth to the cuda graphs.

- Left over overhead seems entirely in the tokenization bit. (Still 4
  copies are paid before launching the graph)


# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-04-24 13:21:39 +03:00
OlivierDehaene
0c207f71ed feat: experimental support for cuda graphs (#1428)
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-04-24 13:15:45 +03:00
Ilyas Moutawwakil
777e519277 ROCm AWQ support (#1514)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

This PR adds the possibility to run AWQ models with Exllama/GPTQ
kernels, specifically for ROCm devices that support Exllama kernels but
not AWQ's GEMM.

This is done by :
- un-packing, reordering and re-packing AWQ weights when `--quantize
gptq` but the model's `quant_method=awq`.
- avoiding overflows when adding 1 to zeros in exllama and triton.

Ref: https://github.com/casper-hansen/AutoAWQ/pull/313

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-04-24 09:21:34 +00:00
OlivierDehaene
f1d8da3ba6 feat(server): add frequency penalty (#1541) 2024-04-24 08:43:50 +00:00
drbh
51a4e62ed4 Impl simple mamba model (#1480)
This draft PR is a work in progress implementation of the mamba model.
This PR currently loads weights, and produces correct logits after a
single pass.

This PR still needs to correctly integrate this model so it produces
tokens as expected, and apply optimization to avoid all copies during
runtime/unnecessary operations.

[Mamba: Linear-Time Sequence Modeling with Selective State Spaces
(Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752)
https://github.com/johnma2006/mamba-minimal

https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs
https://github.com/huggingface/transformers/pull/28094

Notes: this dev work is currently targeting `state-spaces/mamba-130m`,
so if you want to test please use that model. Additionally when starting
the router the prefill needs to be limited: `cargo run --
--max-batch-prefill-tokens 768 --max-input-length 768`

Integration tests have been added and basic functionality such as model
loading is supported.

```bash
cd integration-tests
pytest -vv models/test_fused_kernel_mamba.py
```
- [x] add tests
- [x] load model
- [x] make simple request
- [ ] resolve warmup issue
- [ ] resolve output issues

fetching models tested during dev
```bash
text-generation-server download-weights state-spaces/mamba-130m
text-generation-server download-weights state-spaces/mamba-1.4b
text-generation-server download-weights state-spaces/mamba-2.8b
```

The server can be run
```bash
cd server
 MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b
```

router
```bash
cargo run
```

make a request
```bash
curl -s localhost:3000/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json' | jq
```

response
```json
{
  "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data."
}
```

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-04-23 11:45:11 +03:00
Dean Wyatte
27daa511ec GPTNeoX: Use static rotary embedding (#1498)
# What does this PR do?

`transformers` 4.35 removed rotary embeddings from GPTNeoX's weights
([link to line
diff](253f9a3f97 (diff-0e2a05d86c82e96f516db8c14070ceb36f53ca44c6bc21a9cd92ad2e777b9cf1R298))).
This applies the same fix as
https://github.com/huggingface/text-generation-inference/pull/793 which
generates them on-the-fly using the appropriate value from the config
file

Fixes
https://github.com/huggingface/text-generation-inference/issues/1460

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [x] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

@OlivierDehaene OR @Narsil
2024-04-23 09:21:21 +03:00
Nicolas Patry
433934519c Fixing top_n_tokens. (#1497)
Superseeds #1459

The fix works as follows.
We updated next_token_chooser to return all logprbs, then
batch_top_n_tokens, now also gets accepted_ids + speculated_length (so
it knows how to interpret the flat logprobs).

We then update the code to return lists ot `Tokens` that it expects.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->
2024-04-23 08:49:24 +03:00
OlivierDehaene
efd4b97d15 v1.4.0 (#1494) 2024-04-22 15:47:42 +03:00
Nicolas Patry
b064b33e8b Add sealion mpt support (#1477)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Choon Meng Tan <choonmeng@aisingapore.org>
Co-authored-by: David Ong Tat-Wee <13075447+ongtw@users.noreply.github.com>
2024-04-22 15:37:05 +03:00
drbh
b2fc097b2b feat: adds phi model (#1442)
This PR adds basic modeling for phi-2

run
```bash
text-generation-server \
    serve \
    microsoft/phi-2 \
    --revision 834565c23f9b28b96ccbeabe614dd906b6db551a
```

test
```bash
curl -s localhost:3000/generate \
   -X POST \
   -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
   -H 'Content-Type: application/json' | jq .
```

notes
- recently (~1 day ago) the Phi weights and model were updated to
accommodate adding [GQA/MQA attention to the
model.](https://github.com/huggingface/transformers/pull/28163) This
impl expects the original model format so a fixed revision is required
at the moment.
- this PR only includes a basic implementation of the model and can
later be extended for support Flash and Sharded versions as well as make
use of better optimization
2024-04-22 13:06:38 +03:00
PYNing
e930ad9cec Fix local load for Medusa (#1420)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Close #1418 
Close #1415

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-04-22 09:30:41 +03:00
OlivierDehaene
7eeabb9cda feat: update exllamav2 kernels (#1370)
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-04-22 09:02:53 +03:00
OlivierDehaene
ecb0db45af fix: fix logic if sliding window key is not present in config (#1352) 2024-04-19 14:56:10 +03:00
OlivierDehaene
a95e6d603d feat: relax mistral requirements (#1351)
Close #1253
Close #1279
2024-04-19 14:50:24 +03:00
OlivierDehaene
bb6200503c fix: max_past default value must be -1, not 0 (#1348) 2024-04-19 14:18:05 +03:00
OlivierDehaene
5c9ef069ed feat: add more latency metrics in forward (#1346) 2024-04-19 13:41:34 +03:00
OlivierDehaene
c974437ba7 fix: fix gpt-q params loading 2024-04-19 12:12:50 +03:00
OlivierDehaene
f9b58ac7a1 feat: add quant to mixtral (#1337) 2024-04-18 16:32:50 +03:00
OlivierDehaene
09c556dbd7 v1.3.1 2024-04-18 16:32:07 +03:00
OlivierDehaene
79f268f95a chore: formatting 2024-04-18 16:26:00 +03:00
OlivierDehaene
9aef902982 feat: mixtral (#1328) 2024-04-18 12:39:52 +00:00
Nicolas Patry
a7f52f3812 Speculative (#1308) 2024-04-18 12:39:39 +00:00
Karol Damaszke
30cc78773e
Skip server tests of not enabled models (#125)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-04-09 14:15:41 +02:00
Karol Damaszke
d957e32601
Add Habana copyright header (#122)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-04-08 18:06:21 +02:00
Karol Damaszke
b0de25a285
Don't set rope_scaling for unsupported models (#115)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-04-02 12:12:02 +02:00
Karol Damaszke
7342baa2eb
Add support for rope_scaling and remove is_optimized_for_gaudi (#112)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-29 15:07:32 +01:00
Karol Damaszke
bf5263b88b
Disable watermark with FP8 quantization (#114)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-27 13:32:20 +01:00
jkaniecki
56f00a552b
Adjust warmup to all possible bucket sizes and decode batch size = 1 (#113) 2024-03-27 11:59:51 +01:00
Karol Damaszke
b45f648483
Add warmup for logits processors (#107)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-18 15:17:47 +01:00
yuanwu2017
a4d5c3f40f
Fix the generate_stream crash in concurrent query (#105)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-03-15 10:54:56 +01:00
Karol Damaszke
80ae9ead28
Set MAX_TOTAL_TOKENS automatically (#91)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-01 11:25:15 +01:00
Karol Damaszke
a5c788cfe4
Remove redundant fill op (#83) (#90)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-03-01 01:32:02 +01:00
Karol Damaszke
03c2123244
Use batched index_copy (#73) (#89)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-29 15:45:16 +01:00
Karol Damaszke
7dbf4bf7a4
Improve tensor slicing performance (#66) (#87)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-29 10:48:54 +01:00
Karol Damaszke
3831f1bed5
Add warmup for shift operation (#59) (#86) 2024-02-29 09:19:28 +01:00
Karol Damaszke
022ce1eaaf
Overhead reduction (#58) (#85)
Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>
2024-02-29 09:17:45 +01:00
Karol Damaszke
212136dff8
Log exceptions to debug.log (#52) (#84)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-29 09:14:42 +01:00
Karol Damaszke
c7ccfb87ff
Grouped pad/shift/move operations (#57) (#82)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-29 04:16:44 +01:00
Karol Damaszke
2122acc60f
Add warmup for all possible shapes for prefill #49 (#81) 2024-02-28 10:40:13 +01:00
Karol Damaszke
31bed905d4
Update habana profiler (#50) (#80)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-28 09:57:40 +01:00
Karol Damaszke
941d36f3fd
Enable deferred token generation (#44) (#75)
Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>
2024-02-27 15:46:40 +01:00
jkaniecki
83b059bd27
Bulk shifting (#40) (#70)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-26 17:29:56 +01:00
jkaniecki
c3bd8ef445
Add Fp8 support (#42) (#71)
Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>
Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>
Co-authored-by: Grzegorz Morys <gmorys@habana.ai>
2024-02-23 11:52:28 +01:00
jkaniecki
a490847702
Sequence bucketing for prefill (#39) (#67)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-23 01:52:14 +01:00
jkaniecki
9ad6086250
Improve habana profile dev experience (#36) (#65)
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>
2024-02-22 13:57:45 +01:00
jkaniecki
8f590759e3
Prefill optimization by allocating space only for the first output token (#34) (#62)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
Co-authored-by: Karol Damaszke <karol.damaszke@intel.com>
2024-02-22 04:55:43 +01:00
jkaniecki
80303b469c
Do not limit hpu graphs by default (#32) (#61)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-21 15:38:00 +01:00
Karol Damaszke
2a7a967de3
Revert prefill optimization and fix accuracy issue in shift operation (#29)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
Co-authored-by: jkaniecki <153085639+jkaniecki@users.noreply.github.com>
2024-01-23 15:19:07 +01:00
jkaniecki
ac3bc0e95e
Removed kv_cache from HPU graph output (#19) 2024-01-19 15:34:13 +01:00