2024-05-13 10:44:30 +00:00
import os
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
from dataclasses import dataclass
2024-07-09 18:04:03 +00:00
from typing import List , Optional , Union
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
2024-05-13 10:44:30 +00:00
import torch
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
from loguru import logger
from text_generation_server . utils . import_utils import SYSTEM
2024-07-09 18:04:03 +00:00
from text_generation_server . utils . log import log_once
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
from text_generation_server . utils . weights import Weight , Weights , WeightsLoader
2024-07-01 10:59:12 +00:00
2024-10-18 15:55:53 +00:00
if SYSTEM == " ipex " :
from . ipex import QuantLinear
2024-10-25 07:17:57 +00:00
elif SYSTEM in { " cuda " , " rocm " } :
from . triton import QuantLinear
2024-10-18 15:55:53 +00:00
2024-07-01 10:59:12 +00:00
2024-05-28 09:51:31 +00:00
@dataclass
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
class GPTQWeight ( Weight ) :
2024-05-28 09:51:31 +00:00
qweight : torch . Tensor
qzeros : torch . Tensor
scales : torch . Tensor
g_idx : Optional [ torch . Tensor ]
bits : int
groupsize : int
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
use_awq_kernel : bool
2024-05-28 09:51:31 +00:00
use_exllama : bool
def __post_init__ ( self ) :
if self . scales . dtype == torch . float :
self . scales = self . scales . half ( )
@property
def device ( self ) - > torch . device :
return self . qweight . device
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
def get_linear ( self , bias : torch . Tensor ) :
if self . use_awq_kernel :
if SYSTEM == " rocm " :
raise NotImplementedError (
" AWQ GEMM kernel can ' t be used on ROCm systems, please use `--quantize gptq` instead "
" to use Exllama/GPTQ kernels for AWQ inference. "
)
try :
2024-10-18 15:55:53 +00:00
from text_generation_server . layers . awq . quantize import WQLinear
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
return WQLinear (
w_bit = self . bits ,
group_size = self . groupsize ,
qweight = self . qweight ,
qzeros = self . qzeros ,
scales = self . scales ,
bias = bias ,
)
except ImportError :
raise NotImplementedError (
" You do not seem to have awq installed, either install it (cd server && make install-awq), or try using GPTQ `---quantize gptq` a conversion AWQ->GPTQ will happen on the fly "
)
elif self . use_exllama :
try :
from text_generation_server . layers . gptq import ExllamaQuantLinear
except ImportError :
raise NotImplementedError (
2024-07-26 14:29:09 +00:00
" Exllama gptq kernels are not installed. Install them `cd server/exllama_kernels && python setup.py install && cd ../exllamav2_kernels && python setup.py install` "
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
)
return ExllamaQuantLinear ( self , bias )
else :
return QuantLinear (
self . qweight ,
self . qzeros ,
self . scales ,
self . g_idx ,
bias ,
self . bits ,
self . groupsize ,
)
2024-05-28 09:51:31 +00:00
2024-07-09 18:04:03 +00:00
class GPTQWeightsLoader ( WeightsLoader ) :
"""
Loader for GPTQ - and AWQ - quantized weights .
"""
def __init__ (
self ,
* ,
bits : int ,
desc_act : bool ,
groupsize : int ,
quant_method : str ,
quantize : str ,
sym : bool ,
) :
self . bits = bits
self . desc_act = desc_act
self . groupsize = groupsize
self . quant_method = quant_method
self . quantize = quantize
self . sym = sym
2024-07-19 15:23:20 +00:00
def get_weights ( self , weights : Weights , prefix : str ) :
self . _get_gptq_params ( weights )
use_exllama = True
if self . bits != 4 :
use_exllama = False
if self . desc_act :
log_once ( logger . warning , " Disabling exllama because desc_act=True " )
use_exllama = False
try :
qweight = weights . get_tensor ( f " { prefix } .qweight " )
except RuntimeError :
raise RuntimeError (
" Cannot load `gptq` weight, make sure the model is already quantized, or quantize it with `text-generation-server quantize ORIGINAL_MODEL_ID NEW_MODEL_ID` "
)
if self . quantize == " gptq " and self . quant_method == " gptq " :
g_idx = weights . get_tensor ( f " { prefix } .g_idx " )
else :
g_idx = None
from text_generation_server . layers . gptq import (
HAS_EXLLAMA ,
CAN_EXLLAMA ,
GPTQWeight ,
)
if use_exllama :
if not HAS_EXLLAMA :
if CAN_EXLLAMA :
log_once (
logger . warning ,
" Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True " ,
)
use_exllama = False
else :
log_once ( logger . info , f " Using exllama kernels v { HAS_EXLLAMA } " )
qzeros = weights . get_tensor ( f " { prefix } .qzeros " )
scales = weights . get_tensor ( f " { prefix } .scales " )
if use_exllama and g_idx is not None :
g_idx = g_idx - g_idx [ 0 ]
if self . quantize == " gptq " and self . quant_method == " awq " :
log_once (
logger . info , " Converting AWQ model to Exllama/GPTQ packing format. "
)
from text_generation_server . layers . awq . conversion_utils import (
fast_awq_to_gptq ,
)
qweight , qzeros = fast_awq_to_gptq ( qweight , qzeros )
if use_exllama :
g_idx = None
else :
g_idx = (
torch . arange (
qweight . shape [ 0 ] * ( 32 / / self . bits ) ,
device = qweight . device ,
)
/ / self . groupsize
) . to ( dtype = torch . int32 )
return GPTQWeight (
qweight = qweight ,
qzeros = qzeros ,
scales = scales ,
g_idx = g_idx ,
bits = self . bits ,
groupsize = self . groupsize ,
use_exllama = use_exllama ,
)
2024-07-09 18:04:03 +00:00
def get_weights_col_packed (
self ,
weights : Weights ,
prefix : str ,
block_sizes : Union [ int , List [ int ] ] ,
) :
try :
qweight = weights . get_packed_sharded (
f " { prefix } .qweight " , dim = 1 , block_sizes = block_sizes
)
except RuntimeError :
raise RuntimeError (
f " Cannot load ` { self . quantize } ` weight, make sure the model is already quantized. "
)
scales = weights . get_packed_sharded (
f " { prefix } .scales " , dim = 1 , block_sizes = block_sizes
)
scales = scales . to ( dtype = weights . dtype )
self . _get_gptq_params ( weights )
qzeros = weights . get_packed_sharded (
f " { prefix } .qzeros " , dim = 1 , block_sizes = block_sizes
)
if self . quantize == " gptq " and self . quant_method == " gptq " :
g_idx = weights . get_tensor ( f " { prefix } .g_idx " )
elif self . quantize == " gptq " and self . quant_method == " awq " :
log_once (
logger . info , " Converting AWQ model to Exllama/GPTQ packing format. "
)
from text_generation_server . layers . awq . conversion_utils import (
fast_awq_to_gptq ,
)
qweight , qzeros = fast_awq_to_gptq ( qweight , qzeros )
g_idx = (
torch . arange (
qweight . shape [ 0 ] * ( 32 / / self . bits ) ,
device = qweight . device ,
)
/ / self . groupsize
) . to ( dtype = torch . int32 )
else :
g_idx = None
return GPTQWeight (
qweight = qweight ,
qzeros = qzeros ,
scales = scales ,
g_idx = g_idx ,
bits = self . bits ,
groupsize = self . groupsize ,
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
use_awq_kernel = self . quantize == " awq " ,
2024-07-09 18:04:03 +00:00
use_exllama = False ,
)
def get_multi_weights_col ( self , weights : Weights , prefixes : List [ str ] , dim : int ) :
try :
qweight = torch . cat (
[ weights . get_sharded ( f " { p } .qweight " , dim = 1 ) for p in prefixes ] , dim = 1
)
except RuntimeError :
raise RuntimeError (
f " Cannot load ` { self . quantize } ` weight, make sure the model is already quantized "
)
scales = torch . cat (
[ weights . get_sharded ( f " { p } .scales " , dim = 1 ) for p in prefixes ] , dim = 1
)
self . _get_gptq_params ( weights )
qzeros = torch . cat (
[ weights . get_sharded ( f " { p } .qzeros " , dim = 1 ) for p in prefixes ] , dim = 1
)
from text_generation_server . layers . gptq import HAS_EXLLAMA
use_exllama = (
self . bits == 4
and HAS_EXLLAMA
and self . quantize == " gptq "
and not self . desc_act
)
if self . quantize == " gptq " and self . quant_method == " gptq " :
w = [ weights . get_tensor ( f " { p } .g_idx " ) for p in prefixes ]
for w2 in w [ 1 : ] :
torch . testing . assert_close ( w2 , w [ 0 ] )
g_idx = w [ 0 ]
elif self . quantize == " gptq " and self . quant_method == " awq " :
log_once (
logger . info , " Converting AWQ model to Exllama/GPTQ packing format. "
)
from text_generation_server . layers . awq . conversion_utils import (
fast_awq_to_gptq ,
)
qweight , qzeros = fast_awq_to_gptq ( qweight , qzeros )
if use_exllama :
g_idx = None
else :
g_idx = (
torch . arange (
qweight . shape [ 0 ] * ( 32 / / self . bits ) ,
device = qweight . device ,
)
/ / self . groupsize
) . to ( dtype = torch . int32 )
else :
g_idx = None
return GPTQWeight (
qweight = qweight ,
qzeros = qzeros ,
scales = scales ,
g_idx = g_idx ,
bits = self . bits ,
groupsize = self . groupsize ,
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
use_awq_kernel = self . quantize == " awq " ,
2024-07-09 18:04:03 +00:00
use_exllama = use_exllama ,
)
def get_weights_row ( self , weights : Weights , prefix : str ) :
self . _get_gptq_params ( weights )
use_exllama = True
2024-10-18 15:55:53 +00:00
desc_act = self . desc_act
2024-07-09 18:04:03 +00:00
if self . bits != 4 :
use_exllama = False
if self . desc_act :
log_once ( logger . warning , " Disabling exllama because desc_act=True " )
use_exllama = False
try :
qweight = weights . get_sharded ( f " { prefix } .qweight " , dim = 0 )
except RuntimeError :
raise RuntimeError (
" Cannot load `gptq` weight, make sure the model is already quantized, or quantize it with `text-generation-server quantize ORIGINAL_MODEL_ID NEW_MODEL_ID` "
)
if self . quantize == " gptq " and self . quant_method == " gptq " :
g_idx = weights . get_sharded ( f " { prefix } .g_idx " , dim = 0 )
else :
g_idx = None
if weights . process_group . size ( ) > 1 :
if g_idx is not None :
if (
not torch . equal (
2024-10-18 15:55:53 +00:00
# Remove g_idx[0] to adapt the check with TP>1.
( g_idx - g_idx [ 0 ] ) . cpu ( ) ,
2024-07-09 18:04:03 +00:00
torch . tensor (
[ i / / self . groupsize for i in range ( g_idx . shape [ 0 ] ) ] ,
dtype = torch . int32 ,
) ,
)
and not ( g_idx == 0 ) . all ( )
) :
# Exllama implementation does not support row tensor parallelism with act-order, as
# it would require to reorder input activations that are split unto several GPUs
use_exllama = False
2024-10-18 15:55:53 +00:00
desc_act = True
2024-07-09 18:04:03 +00:00
from text_generation_server . layers . gptq import (
CAN_EXLLAMA ,
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
HAS_EXLLAMA ,
2024-07-09 18:04:03 +00:00
GPTQWeight ,
)
if use_exllama :
if not HAS_EXLLAMA :
if CAN_EXLLAMA :
log_once (
logger . warning ,
" Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True " ,
)
use_exllama = False
else :
log_once ( logger . info , f " Using exllama kernels v { HAS_EXLLAMA } " )
2024-10-18 15:55:53 +00:00
if not desc_act and self . groupsize != - 1 :
2024-07-09 18:04:03 +00:00
qzeros = weights . get_sharded ( f " { prefix } .qzeros " , dim = 0 )
scales = weights . get_sharded ( f " { prefix } .scales " , dim = 0 )
2024-10-18 15:55:53 +00:00
if g_idx is not None :
# qzeros, scales sharded, and g_idx must be adjusted accordingly
g_idx = g_idx - g_idx [ 0 ]
2024-07-09 18:04:03 +00:00
else :
qzeros = weights . get_tensor ( f " { prefix } .qzeros " )
scales = weights . get_tensor ( f " { prefix } .scales " )
if self . quantize == " gptq " and self . quant_method == " awq " :
log_once (
logger . info , " Converting AWQ model to Exllama/GPTQ packing format. "
)
from text_generation_server . layers . awq . conversion_utils import (
fast_awq_to_gptq ,
)
qweight , qzeros = fast_awq_to_gptq ( qweight , qzeros )
if use_exllama :
g_idx = None
else :
g_idx = (
torch . arange (
qweight . shape [ 0 ] * ( 32 / / self . bits ) ,
device = qweight . device ,
)
/ / self . groupsize
) . to ( dtype = torch . int32 )
return GPTQWeight (
qweight = qweight ,
qzeros = qzeros ,
scales = scales ,
g_idx = g_idx ,
bits = self . bits ,
groupsize = self . groupsize ,
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 07:37:39 +00:00
use_awq_kernel = self . quantize == " awq " ,
2024-07-09 18:04:03 +00:00
use_exllama = use_exllama ,
)
def _get_gptq_params ( self , weights : Weights ) :
2024-10-16 07:54:50 +00:00
if weights . has_tensor ( " gptq_bits " ) and weights . has_tensor ( " gptq_groupsize " ) :
2024-07-09 18:04:03 +00:00
self . bits = weights . get_tensor ( " gptq_bits " ) . item ( )
self . groupsize = weights . get_tensor ( " gptq_groupsize " ) . item ( )
self . desc_act = False
2024-07-12 10:20:12 +00:00
# `server quantize` used asymmetric quantization unconditionally
# before the `gptq_sym` setting tensor was added.
self . sym = (
weights . get_tensor ( " gptq_sym " ) . item ( )
2024-10-16 07:54:50 +00:00
if weights . has_tensor ( " gptq_sym " )
2024-07-12 10:20:12 +00:00
else False
)
2024-07-09 18:04:03 +00:00
self . quant_method = " gptq "
2024-08-12 12:08:59 +00:00
# Needs to be at the end because circular import.
try :
major , _minor = torch . cuda . get_device_capability ( )
except Exception :
major = 1
HAS_EXLLAMA = False
CAN_EXLLAMA = major > = 8 or SYSTEM == " rocm "
V2 = os . getenv ( " EXLLAMA_VERSION " , " 2 " ) == " 2 "
if os . getenv ( " DISABLE_EXLLAMA " ) == " True " :
HAS_EXLLAMA = False
elif CAN_EXLLAMA :
try :
if V2 :
from text_generation_server . layers . gptq . exllamav2 import (
QuantLinear as ExllamaQuantLinear , # noqa: F401
2024-08-12 21:59:37 +00:00
create_exllama_buffers , # noqa: F401
set_device , # noqa: F401
2024-08-12 12:08:59 +00:00
)
HAS_EXLLAMA = " 2 "
else :
from text_generation_server . layers . gptq . exllama import (
Ex4bitLinear as ExllamaQuantLinear , # noqa: F401
2024-08-12 21:59:37 +00:00
create_exllama_buffers , # noqa: F401
set_device , # noqa: F401
2024-08-12 12:08:59 +00:00
)
HAS_EXLLAMA = " 1 "
except ImportError :
pass