* Working loading state.
* Preprocessing.
* Working state ? (Broke idefics1 temporarily).
* Cleaner condition.
* Fix idefics.
* Updating config, removing TODO
* Mllama
* Ugrade transformers 4.45
* Flashing mllama.
* Starting to get there.
* Working state.
* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.
* Updating model link.
* Earlier assert.
* Fix vlm ?
* remove log.
* Force ignore all images but last.
* Default dtype bfloat16.
* Update integration test after switch to bf16.
* Remove dead code.
* Removed dead code.
* Upgrade the flake to latest transformers/tokenizers
* Move to hf tgi-nix
* Upgrade to 0.5.0
* Tied embeddings in MLP speculator.
* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.
* Adding scaling support + optimize some ops.
* Making prefix/flashinfer the default and testing the full release tests.
* Include flashinfer in the docker.
* Using prebuilt.
* Allowing window_left_size (dummy version).
* Disabling flashinfer/prefix caching on odd head_dim
* Disable prefix caching for lora.
* More specific codes.
* Update lock
* Updating integration tests with new values with FI/FD.
Remove paged as a default too, and using FD everywhere.
* Update cargo lock ?
* Upgrade to 1.80 because of bitstream...
* Everywhere 1.80
* Forgot last default place.
* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* Updated flake lock
* Tmp
* Upgrade resolution system for less errors in resolution.
* Remove lambda for cleaner function.
* Handling debugger.
* OVerride the env in server tests.
* Is this enough to make it work ?
* This seems to be working.
* Downgrade some logs.
* Fixing the default for vlm.
* Don't enable prefix caching on VLM just yet.
* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)
* Fixing prefix caching for flashdecoding.
* Update all models.
* Fixed flashinfer version.
* add_special_tokens is internal only
* Fixing seqlen with the new vlms.
* Fixing the issue with `add_special_tokens` not being passed around.
* Fixing the test.
* Removing encoder_decoder (seq2seq).
* Update the chat test.
* Fixing the batching tokenization in flash causal lm.
* Truncating left for radix purposes.
* Oops this doesn't belong here.
* Put back default pure shell.
* Update server tests
- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room
* Only n_heads / process_group.size() are necessary.
* Revert the integrationt tests change (seem linked to head_size
modification).
* Adding error message when assert is violated.
* Fixing the free algorithm to handle times where the common prefix is
smaller.
* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* Fix disabling prefix caching - Fix windowing checks.
* Revert the Cohere tokenizer change (for now using a revision instead).
* Fmt.
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* Update __init__.py
Fix issue with NoneType comparison for max_input_tokens and sliding_window
- Add default values for max_input_tokens and sliding_window to handle None cases.
- Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError.
- This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'.
* Update __init__.py
Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness.
* fix: syntax/style tweak
---------
Co-authored-by: Praz <prazanth2006@gmail.com>
* add gptj modeling
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix: update docs for model addition
* fix: adjust syntax typo
* fix: adjust syntax typo again
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
* fix: attempt forward on flash attn2 to check hardware support
* fix: warn window_size_left when using flash attn 1
* fix: prefer version check over test op and avoid window_size_left if not flash attn2
* fix: improve condtional and error message
* fix: update sliding window conditional
* fix: simplify changes and revert model changes
* fix: avoid changing conditional
* fix: typo tweak
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:
- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
So, we need weight loads that supports quantized weights. To this
end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
fork and we need to ensure that the KV cache is allocated with the
correct size.
- Shared experts.
* Refactor dead code.
* First working step.
* Remove a lot of duplicated code.
* More dead code.
* More cleanup.
* Fix Santacoder test.
* Fixing the simple tests.
* Fixing sharding.
* Fixes for VLM.
* Fixing santacoder (num_kv_heads hardcoded).
* Removing more dead code.
* Fixing `config.n_head`.
* Stopping earlier because of `<end_of_utterance>` in idefics2.
* Addresses comments.
* Removing the dead code.
* Fuse back mistral into FlashCausalLM.
* Finish removal.
* Fixing docs + causal_lm `batch_class`.
* Fixing docs + causal.lm.
* Add default to Gemma Causality.
* Default value for gemma/gemma2.
* Wrong default.
* Using flash decoding
Conditional flashdecoding.
Fix max_q.
Working kvcache
Working version with flash decoding.
Make it work for mistral.
Fix after rebase..
Less intrusive.
REvert changes in modeling.
Speedup flashdecoding.
HHachweew
Hack to make other models work.
Fixing non flash decoding llama path.
Router logic knows about page size.
Missing 2 models.
Missing cohere.
Fixing cohere flash decoding.
Revamped all this architecture.
Fix cohere.
Fixing falcon.
Enabling custom block size schedule.
Update router/src/infer.rs
Not sending preallocated output.
* Making it work on non flash decoding.
* Fix Cohere.
* Fix non decoding paths.
* Rebased.
* No need for cache_manager anymore.
* Update?
* "ipex" -> "cpu"
* These do not belong.
* Factoring cu_seqlen_qk for better abstracting over every model.
* Fixing non flash tests/imports.
* Changing return everywhere.
* Update mistral past.
* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).
* Fixup mistral clamping (had issues with cuda graphs).
* No need to recreate anything actually.
* feat: first draft load multiple lora
* feat: load weights within layer and refactor lora pass
* fix: refactor and reduce lora math
* feat: baseline impl single request multi lora support
* feat: prefer lorax implementation and port loading logic
* fix: prefer adapter_data and refactors
* feat: perfer loraxs custom punica kernels and add mlp loras
* fix: adjust batch for bgmv
* fix: adjust adapter_segments logic when in batch
* fix: refactor and move changes to v3 proto
* fix: pass model_id for all flash causal lms
* fix: pass model_id for all causal and seq2seq lms
* fix: add model_id to model test
* feat: add lora support to mistral and refactors
* feat: prefer model id in request
* fix: include rust code for adapter id
* feat: bump launcher and add new lora docs
* feat: support base model generation and refactors
* fix: rename doc to retry ci build
* feat: support if vlm models
* fix: add adapter_data param and avoid missing layers
* fix: add adapter_data param to phi and neox
* fix: update all models forwards to include adapter_data
* fix: add model_id to IdeficsCausalLM
* Update lora.md
Fixed a typo
* Update lora.md
Fixing spam image
* fix: add lora kernel to dockerfile, support running without kernels and refactors
* fix: avoid dockerfile conflict
* fix: refactors and adjust flash llama lora logic
* fix: skip llama test due to CI issue (temp)
* fix: skip llama test CI (temp) 2
* fix: revert skips and prefer updated ci token for tests
* fix: refactors and helpful comments
* fix: add noop in TensorParallelAdapterRowLinear too
* fix: refactor and move shard_lora_weights logic
* fix: exit early if no adapter_data
---------
Co-authored-by: Derek <datavistics@gmail.com>
* update vllm commit & fix models using sliding window
* update
* update commit
* fix bug where tunableop is bound to cuda graph even when cuda graph are disabled
* enable tunableop by default
* fix sliding window
* address review
* dead code
* precise comment
* is it flaky?
This change adds support for Marlin-quantized models. Marlin is an
FP16xINT4 matmul kernel, which provides good speedups decoding batches
of 16-32 tokens. It supports quantized models with symmetric
quantization, groupsize -1 or 128, and 4-bit.
Tested with:
- Llama 2
- Llama 3
- Phi 3
# What does this PR do?
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Mostly straightforward, changes to existing code:
* Wrap quantizer parameters in a small wrapper to avoid passing
around untyped tuples and needing to repack them as a dict.
* Move scratch space computation to warmup, because we need the
maximum input sequence length to avoid allocating huge
scratch buffers that OOM.
# What does this PR do?
Fix GPTQ for models which do not have float16 at the default dtype
Before this change GPTQ models would not work if the model's default
data type is not `float16`. For example, Gemma GPTQ models would fail
because the default dtype of Gemma is `bfloat16`. There are two issues:
If the default `dtype` is not `float16`, the quantizer's `float16`
parameters get converted to that dtype. The kernels cannot deal
with non-`float16` types. The same applies to inputs of quantized ops.
This is resolved by setting the dtype of gptq/awq-quantized models to
`float16`.
Simpler version of #1951.
**Draft:** just testing...
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Joshua Rosenkranz <joshua.rosenkranz@gmail.com>
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
# Add AWQ quantization inference support
Fixes
https://github.com/huggingface/text-generation-inference/issues/781
This PR (partially) adds support for AWQ quantization for inference.
More information on AWQ [here](https://arxiv.org/abs/2306.00978). In
general, AWQ is faster and more accurate than GPTQ, which is currently
supported by TGI.
This PR installs 4-bit GEMM custom CUDA kernels released by AWQ authors
(in `requirements.txt`, just one line change).
Quick way to test this PR would be bring up TGI as follows:
```
text-generation-server download-weights abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq
text-generation-launcher \
--huggingface-hub-cache ~/.cache/huggingface/hub/ \
--model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq \
--trust-remote-code --port 8080 \
--max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \
--quantize awq
```
Please note:
* This PR was tested with FlashAttention v2 and vLLM.
* This PR adds support for AWQ inference, not quantizing the models.
That needs to be done outside of TGI, instructions
[here](f084f40bd9).
* This PR only adds support for `FlashLlama` models for now.
* Multi-GPU setup has not been tested.
* No integration tests have been added so far, will add later if
maintainers are interested in this change.
* This PR can be tested on any of the models released
[here](https://huggingface.co/abhinavkulkarni?sort_models=downloads#models).
Please refer to the linked issue for benchmarks for
[abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq](https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq)
vs
[TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ).
Please note, AWQ has released faster (and in case of Llama, fused)
kernels for 4-bit GEMM, currently at the top of the `main` branch at
https://github.com/mit-han-lab/llm-awq, but this PR uses an older commit
that has been tested to work. We can switch to latest commit later on.
## Who can review?
@OlivierDehaene OR @Narsil
---------
# What does this PR do?
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Abhinav M Kulkarni <abhinavkulkarni@gmail.com>
Co-authored-by: Abhinav Kulkarni <abhinav@concentric.ai>