From 0595bf3e9a15807d038e41f09c0f7e1c595d2417 Mon Sep 17 00:00:00 2001 From: dtlzhuangz <139844877+dtlzhuangz@users.noreply.github.com> Date: Wed, 31 Jan 2024 19:05:49 +0800 Subject: [PATCH] feat: eetq gemv optimization when batch_size <= 4 (#1502) # What does this PR do? Add TensorRT-LLM weight-only GEMV kernel support. We extract GEMV kernel from [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv) to accelerate the decode speed of EETQ when batch_size is smaller or equal to 4. - Features 1. There is almost no loss of quantization accuracy. 2. The speed of decoding is 13% - 27% faster than original EETQ which utilizes GEMM kernel. - Test Below is our test on 3090. Environment: torch=2.0.1, cuda=11.8, nvidia driver: 525.78.01 prompt=1024, max_new_tokens=50 ![image](https://github.com/huggingface/text-generation-inference/assets/139844877/98e63b23-23cd-452f-91bd-55ccdc9b7021) ![image](https://github.com/huggingface/text-generation-inference/assets/139844877/5c3132ff-fc1c-4b20-a83f-59b3d5f586b7) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. --- server/Makefile-eetq | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/server/Makefile-eetq b/server/Makefile-eetq index 5e8e9830..8c060987 100644 --- a/server/Makefile-eetq +++ b/server/Makefile-eetq @@ -1,4 +1,4 @@ -eetq_commit := 323827dd471458a84e9c840f614e4592b157a4b1 +eetq_commit := 71adb5e191bb8290069a580abff0355d7b2dd5c9 eetq: # Clone eetq @@ -6,7 +6,7 @@ eetq: git clone https://github.com/NetEase-FuXi/EETQ.git eetq build-eetq: eetq - cd eetq && git fetch && git checkout $(eetq_commit) + cd eetq && git fetch && git checkout $(eetq_commit) && git submodule update --init --recursive cd eetq && python setup.py build install-eetq: build-eetq