diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 5ba470bd..7555b327 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -17,6 +17,8 @@ title: Serving Private & Gated Models - local: basic_tutorials/using_cli title: Using TGI CLI + - local: basic_tutorials/custom_models + title: Custom Model Serving title: Tutorials - sections: - local: conceptual/streaming diff --git a/docs/source/basic_tutorials/custom_models.md b/docs/source/basic_tutorials/custom_models.md new file mode 100644 index 00000000..ec852e36 --- /dev/null +++ b/docs/source/basic_tutorials/custom_models.md @@ -0,0 +1,21 @@ +# Custom Model Serving + +TGI supports various LLM architectures (see full list [here](https://github.com/huggingface/text-generation-inference#optimized-architectures)). If you wish to serve a model that is not one of the supported models, TGI will fallback to transformers implementation of that model. They can be loaded by: + +```python +from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM + +AutoModelForCausalLM.from_pretrained(, device_map="auto")`` + +#or + +AutoModelForSeq2SeqLM.from_pretrained(, device_map="auto") +``` + +This means, you will be unable to use some of the features introduced by TGI, such as tensor-parallel sharding or flash attention. However, you can still get many benefits of TGI, such as continuous batching, or streaming outputs. + +You can serve these models using docker like below 👇 + +```bash +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2 +``` \ No newline at end of file