diff --git a/README.md b/README.md index fe55d7b5..a1c9ffab 100644 --- a/README.md +++ b/README.md @@ -239,6 +239,8 @@ You can also quantize the weights with bitsandbytes to reduce the VRAM requireme make run-bloom-quantize # Requires 8xA100 40GB ``` +4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`. + ## Develop ```shell diff --git a/launcher/src/main.rs b/launcher/src/main.rs index 36add771..98c5a0aa 100644 --- a/launcher/src/main.rs +++ b/launcher/src/main.rs @@ -104,7 +104,8 @@ struct Args { num_shard: Option, /// Whether you want the model to be quantized. This will use `bitsandbytes` for - /// quantization on the fly, or `gptq`. + /// quantization on the fly, or `gptq`. 4bit quantization is available through + /// `bitsandbytes` by providing the `bitsandbytes-fp4` or `bitsandbytes-nf4` options. #[clap(long, env, value_enum)] quantize: Option,