mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-10 20:04:52 +00:00
Initial commit
This commit is contained in:
parent
c4422e5678
commit
7dcd953969
@ -17,6 +17,8 @@
|
||||
title: Serving Private & Gated Models
|
||||
- local: basic_tutorials/using_cli
|
||||
title: Using TGI CLI
|
||||
- local: basic_tutorials/custom_models
|
||||
title: Custom Model Serving
|
||||
title: Tutorials
|
||||
- sections:
|
||||
- local: conceptual/streaming
|
||||
|
21
docs/source/basic_tutorials/custom_models.md
Normal file
21
docs/source/basic_tutorials/custom_models.md
Normal file
@ -0,0 +1,21 @@
|
||||
# Custom Model Serving
|
||||
|
||||
TGI supports various LLM architectures (see full list [here](https://github.com/huggingface/text-generation-inference#optimized-architectures)). If you wish to serve a model that is not one of the supported models, TGI will fallback to transformers implementation of that model. They can be loaded by:
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM
|
||||
|
||||
AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")``
|
||||
|
||||
#or
|
||||
|
||||
AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")
|
||||
```
|
||||
|
||||
This means, you will be unable to use some of the features introduced by TGI, such as tensor-parallel sharding or flash attention. However, you can still get many benefits of TGI, such as continuous batching, or streaming outputs.
|
||||
|
||||
You can serve these models using docker like below 👇
|
||||
|
||||
```bash
|
||||
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
|
||||
```
|
Loading…
Reference in New Issue
Block a user