text-generation-inference/integration-tests/models/test_compressed_tensors_w8a8_int.py
Daniël de Kok 3c9df21ff8
Add support for compressed-tensors w8a8 int checkpoints (#2745)
* Add support for compressed-tensors w8a8 int checkpoints

This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.

Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:

|     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
|               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
|ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
|               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
|               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|

Which is the same ballpark as vLLM.

As usual, lots of thanks to Neural Magic/vLLM for the kernels.

* Always use dynamic input quantization for w8a8 int

It's far less flaky and gives better output.

* Use marlin-kernels 0.3.5

* Fix a typo

Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Small fixes

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
2024-11-18 17:20:31 +01:00

91 lines
2.4 KiB
Python

import pytest
@pytest.fixture(scope="module")
def compressed_tensors_w8a8_int_handle(launcher):
with launcher(
"neuralmagic/Llama-3.2-3B-Instruct-quantized.w8a8",
num_shard=2,
quantize="compressed-tensors",
) as handle:
yield handle
@pytest.fixture(scope="module")
async def compressed_tensors_w8a8_int(compressed_tensors_w8a8_int_handle):
await compressed_tensors_w8a8_int_handle.health(300)
return compressed_tensors_w8a8_int_handle.client
@pytest.mark.release
@pytest.mark.asyncio
@pytest.mark.private
async def test_compressed_tensors_w8a8_int(
compressed_tensors_w8a8_int, response_snapshot
):
response = await compressed_tensors_w8a8_int.generate(
"What is deep learning?",
max_new_tokens=10,
decoder_input_details=True,
)
assert (
response.generated_text
== " and how does it differ from traditional machine learning?\n"
)
assert response.details.generated_tokens == 10
assert response == response_snapshot
@pytest.mark.release
@pytest.mark.asyncio
@pytest.mark.private
async def test_compressed_tensors_w8a8_int_all_params(
compressed_tensors_w8a8_int, response_snapshot
):
response = await compressed_tensors_w8a8_int.generate(
"What is deep learning",
max_new_tokens=10,
repetition_penalty=1.2,
return_full_text=True,
stop_sequences=["test"],
temperature=0.5,
top_p=0.9,
top_k=10,
truncate=5,
typical_p=0.9,
watermark=True,
decoder_input_details=True,
seed=0,
)
assert response.details.generated_tokens == 10
assert (
response.generated_text
== "What is deep learning?\nDeep learning, also known as neural network or"
)
assert response == response_snapshot
@pytest.mark.release
@pytest.mark.asyncio
@pytest.mark.private
async def test_compressed_tensors_w8a8_int_load(
compressed_tensors_w8a8_int, generate_load, response_snapshot
):
responses = await generate_load(
compressed_tensors_w8a8_int,
"What is deep learning?",
max_new_tokens=10,
n=4,
)
assert (
responses[0].generated_text
== " and how does it differ from traditional machine learning?\n"
)
assert len(responses) == 4
assert all([r.generated_text == responses[0].generated_text for r in responses])
assert responses == response_snapshot