Commit Graph

350 Commits

Author SHA1 Message Date
drbh
a7556ba800 fix: refactors and helpful comments 2024-06-24 13:39:56 +00:00
drbh
4f1543d3c7 fix: refactors and adjust flash llama lora logic 2024-06-19 16:13:42 +00:00
drbh
224455f389 Merge branch 'main' into lora-internal 2024-06-18 09:50:41 -04:00
Daniël de Kok
c8c7ccd31e
Set maximum grpc message receive size to 2GiB (#2075)
* Set maximum grpc message receive size to 2GiB

The previous default was 4MiB, which doesn't really work well for
multi-modal models.

* Update to Rust 1.79.0

* Fixup formatting to make PR pass
2024-06-17 16:40:44 +02:00
Daniël de Kok
e903770897
Support different image sizes in prefill in VLMs (#2065)
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.
2024-06-17 10:49:41 +02:00
drbh
1104885f00
Merge branch 'main' into lora-internal 2024-06-14 10:06:15 -04:00
drbh
0e1c28cafd fix: merge 'main' into lora-internal to resolve conflicts 2024-06-14 14:02:33 +00:00
Tiezhen WANG
96b7b40ca3
Update the link for qwen2 (#2068)
* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-06-14 11:59:33 +02:00
Daniël de Kok
093a27c528
Add support for GPTQ Marlin (#2052)
Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
2024-06-14 09:45:42 +02:00
drbh
aa88c4fd3a fix: add lora kernel to dockerfile, support running without kernels and refactors 2024-06-14 00:35:07 +00:00
OlivierDehaene
90184df79c
fix(layers): fix SuRotaryEmbedding (#2060)
* fix(layers): fix SuRotaryEmbedding

* change arange

* remove logs
2024-06-12 18:24:47 +02:00
OlivierDehaene
521de6cacd
fix(server): fix OPT implementation (#2061) 2024-06-12 18:22:20 +02:00
drbh
ce40ad26fd fix: add model_id to IdeficsCausalLM 2024-06-10 10:24:21 -04:00
drbh
101b95adc4 fix: update all models forwards to include adapter_data 2024-06-10 10:24:21 -04:00
drbh
1deb372564 fix: add adapter_data param to phi and neox 2024-06-10 10:24:21 -04:00
drbh
b1169273fd fix: add adapter_data param and avoid missing layers 2024-06-10 10:24:21 -04:00
drbh
91f407226d feat: support if vlm models 2024-06-10 10:24:21 -04:00
drbh
611225f017 feat: support base model generation and refactors 2024-06-10 10:24:21 -04:00
drbh
68399c1ae3 feat: prefer model id in request 2024-06-10 10:23:52 -04:00
drbh
de56a81c5c feat: add lora support to mistral and refactors 2024-06-10 10:23:52 -04:00
drbh
dc0f76553c fix: pass model_id for all causal and seq2seq lms 2024-06-10 10:23:52 -04:00
drbh
88bd5c2c92 fix: pass model_id for all flash causal lms 2024-06-10 10:23:52 -04:00
drbh
73eb2ae255 fix: refactor and move changes to v3 proto 2024-06-10 10:23:52 -04:00
drbh
c927376725 fix: adjust adapter_segments logic when in batch 2024-06-10 10:23:52 -04:00
drbh
ad088d51fa fix: adjust batch for bgmv 2024-06-10 10:23:52 -04:00
drbh
8984ce6c69 feat: perfer loraxs custom punica kernels and add mlp loras 2024-06-10 10:23:52 -04:00
drbh
d5f21d57d1 fix: prefer adapter_data and refactors 2024-06-10 10:23:52 -04:00
drbh
8b50f4b779 feat: prefer lorax implementation and port loading logic 2024-06-10 10:23:52 -04:00
drbh
c661631225 feat: baseline impl single request multi lora support 2024-06-10 10:23:52 -04:00
drbh
a046c303f7 fix: refactor and reduce lora math 2024-06-10 10:23:52 -04:00
drbh
0a6ea7fb57 feat: load weights within layer and refactor lora pass 2024-06-10 10:23:52 -04:00
drbh
db3d8e6518 feat: first draft load multiple lora 2024-06-10 10:23:52 -04:00
Daniël de Kok
85dfc39222
Add Phi-3 medium support (#2039)
Add support for Phi-3-medium

The main difference between the medium and mini models is that medium
uses grouped query attention with a packed QKV matrix. This change adds
support for GQA with packed matrixes to `Weights.get_weights_col_packed`
and uses it for Phi-3. This also allows us to remove the custom
implementation of GQA from dbrx attention loading.
2024-06-10 09:22:29 +02:00
fxmarty
9b3674d903
ROCm and sliding windows fixes (#2033)
* update vllm commit & fix models using sliding window

* update

* update commit

* fix bug where tunableop is bound to cuda graph even when cuda graph are disabled

* enable tunableop by default

* fix sliding window

* address review

* dead code

* precise comment

* is it flaky?
2024-06-10 15:09:50 +08:00
Daniël de Kok
bf3c813782 server: use chunked inputs
The router will now send the input as chunks besides as a single
string. This change modifies the server to process chunked input
rather than strings. This also allows us to remove the image
extraction code from the server.
2024-06-07 08:09:04 +02:00
Daniël de Kok
0d96468ebb marlin: support tp>1 when group_size==-1 2024-06-06 17:19:28 +02:00
Daniël de Kok
4594e6faba Add support for Marlin-quantized models
This change adds support for Marlin-quantized models. Marlin is an
FP16xINT4 matmul kernel, which provides good speedups decoding batches
of 16-32 tokens. It supports quantized models with symmetric
quantization, groupsize -1 or 128, and 4-bit.

Tested with:

- Llama 2
- Llama 3
- Phi 3
2024-06-06 13:16:52 +02:00
Daniël de Kok
3f4bcf978c
Fix GPTQWeight import (#2020)
# What does this PR do?

Fix stray import.

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-05 14:49:15 +02:00
Nicolas Patry
0a94fad79f
Fixing rocm. (#2021)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-05 14:41:34 +02:00
OlivierDehaene
8aece3bd68
feat: move allocation logic to rust (#1835)
Close #2007
2024-06-05 12:18:38 +02:00
Daniël de Kok
9ffe1f1e67
Do not initialize scratch space when there are no ExLlamaV2 layers (#2015)
# What does this PR do?

Do not attempt to allocate ExLlamaV2 scratch buffers when there are no
ExLlama2 layers. Avoids a crash in warmup for models that cannot use
exllama when ExLlamaV2 is installed.

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-05 10:45:47 +02:00
Daniël de Kok
d14eaacaca
Support GPTQ models with column-packed up/gate tensor (#2006)
# What does this PR do?

The GPTQ code path for column-packed packed tensors assumed that this is
always a QKV matrix. However, models (e.g. Phi-3) can also have
column-packed MLP up/gate matrices.

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-04 19:37:49 +02:00
Daniël de Kok
9b52f0e2dc
Fix Phi-2 with tp>1 (#2003)
# What does this PR do?

We were using the wrong parallelism in the up-projection.

<!-- Remove if not applicable -->

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-04 14:26:07 +02:00
Nicolas Patry
9a59ebcec3 Hotfix GPTQ. 2024-06-03 09:32:12 +00:00
Nicolas Patry
9add5d0af5
Fixing GPTQ imports. (#1994)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-03 10:36:29 +02:00
Nicolas Patry
799a193b10 Fixing Phi3. 2024-06-01 08:47:00 +00:00
Nicolas Patry
5ab4cef67e
Fixing exl2 scratch buffer. (#1990)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-05-31 18:01:43 +02:00
Nicolas Patry
06edde9491
Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. (#1986)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-05-31 17:57:01 +02:00
Daniël de Kok
36dd16017c Add support for exl2 quantization
Mostly straightforward, changes to existing code:

* Wrap quantizer parameters in a small wrapper to avoid passing
  around untyped tuples and needing to repack them as a dict.
* Move scratch space computation to warmup, because we need the
  maximum input sequence length to avoid allocating huge
  scratch buffers that OOM.
2024-05-30 11:28:05 +02:00
drbh
cbced7f0f9
feat: adjust attn weight loading logic (#1975)
This PR updates `load_attention` to prefer loading specific attention
based on the model type. Additionally there were two cases where
`TensorParallelColumnLinear.load_multi` was called and this reduces it
to a single path
2024-05-29 12:42:11 -04:00