Initial commit

This commit is contained in:
Merve Noyan 2023-08-22 23:26:08 +03:00 committed by GitHub
parent c4422e5678
commit 7dcd953969
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 23 additions and 0 deletions

View File

@ -17,6 +17,8 @@
title: Serving Private & Gated Models title: Serving Private & Gated Models
- local: basic_tutorials/using_cli - local: basic_tutorials/using_cli
title: Using TGI CLI title: Using TGI CLI
- local: basic_tutorials/custom_models
title: Custom Model Serving
title: Tutorials title: Tutorials
- sections: - sections:
- local: conceptual/streaming - local: conceptual/streaming

View File

@ -0,0 +1,21 @@
# Custom Model Serving
TGI supports various LLM architectures (see full list [here](https://github.com/huggingface/text-generation-inference#optimized-architectures)). If you wish to serve a model that is not one of the supported models, TGI will fallback to transformers implementation of that model. They can be loaded by:
```python
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM
AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")``
#or
AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")
```
This means, you will be unable to use some of the features introduced by TGI, such as tensor-parallel sharding or flash attention. However, you can still get many benefits of TGI, such as continuous batching, or streaming outputs.
You can serve these models using docker like below 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
```