mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-20 14:22:08 +00:00
* feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>
66 lines
3.6 KiB
Markdown
66 lines
3.6 KiB
Markdown
# LoRA (Low-Rank Adaptation)
|
|
|
|
## What is LoRA?
|
|
|
|
LoRA is a technique that allows for efficent fine-tuning a model while only updating a small portion of the model's weights. This is useful when you have a large model that has been pre-trained on a large dataset, but you want to fine-tune it on a smaller dataset or for a specific task.
|
|
|
|
LoRA works by adding a small number of additional weights to the model, which are used to adapt the model to the new dataset or task. These additional weights are learned during the fine-tuning process, while the rest of the model's weights are kept fixed.
|
|
|
|
## How is it used?
|
|
|
|
LoRA can be used in many ways and the community is always finding new ways to use it. Here are some examples of how you can use LoRA:
|
|
|
|
Technically, LoRA can be used to fine-tune a large language model on a small dataset. However, these use cases can span a wide range of applications, such as:
|
|
|
|
- fine-tuning a language model on a small dataset
|
|
- fine-tuning a language model on a domain-specific dataset
|
|
- fine-tuning a language model on a dataset with limited labels
|
|
|
|
## Optimizing Inference with LoRA
|
|
|
|
LoRA's can be used during inference by mutliplying the adapter weights with the model weights at each specified layer. This process can be computationally expensive, but due to awesome work by [punica-ai](https://github.com/punica-ai/punica) and the [lorax](https://github.com/predibase/lorax) team, optimized kernels/and frameworks have been developed to make this process more efficient. TGI leverages these optimizations in order to provide fast and efficient inference with mulitple LoRA models.
|
|
|
|
## Serving multiple LoRA adapters with TGI
|
|
|
|
Once a LoRA model has been trained, it can be used to generate text or perform other tasks just like a regular language model. However, because the model has been fine-tuned on a specific dataset, it may perform better on that dataset than a model that has not been fine-tuned.
|
|
|
|
In practice its often useful to have multiple LoRA models, each fine-tuned on a different dataset or for a different task. This allows you to use the model that is best suited for a particular task or dataset.
|
|
|
|
Text Generation Inference (TGI) now supports loading multiple LoRA models at startup that can be used in generation requests. This feature is available starting from version `~2.0.6` and is compatible with LoRA models trained using the `peft` library.
|
|
|
|
### Specifying LoRA models
|
|
|
|
To use LoRA in TGI, when starting the server, you can specify the list of LoRA models to load using the `LORA_ADAPTERS` environment variable. For example:
|
|
|
|
```bash
|
|
LORA_ADAPTERS=predibase/customer_support,predibase/dbpedia
|
|
```
|
|
|
|
In the server logs, you will see the following message:
|
|
|
|
```txt
|
|
Loading adapter weights into model: predibase/customer_support
|
|
Loading adapter weights into model: predibase/dbpedia
|
|
```
|
|
|
|
## Generate text
|
|
|
|
You can then use these models in generation requests by specifying the `lora_model` parameter in the request payload. For example:
|
|
|
|
```json
|
|
curl 127.0.0.1:3000/generate \
|
|
-X POST \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"inputs": "Hello who are you?",
|
|
"parameters": {
|
|
"max_new_tokens": 40,
|
|
"adapter_id": "predibase/customer_support"
|
|
}
|
|
}'
|
|
```
|
|
|
|
> **Note:** The Lora feature is new and still being improved. If you encounter any issues or have any feedback, please let us know by opening an issue on the [GitHub repository](https://github.com/huggingface/text-generation-inference/issues/new/choose). Additionally documentation and an improved client library will be published soon.
|
|
|
|
An updated tutorial with detailed examples will be published soon. Stay tuned!
|