Initial commit

2025-09-10 20:04:52 +00:00 · 2023-08-22 23:26:08 +03:00 · 2023-08-22 23:26:08 +03:00 · 7dcd953969
commit 7dcd953969
parent c4422e5678
2 changed files with 23 additions and 0 deletions
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -17,6 +17,8 @@
    title: Serving Private & Gated Models
  - local: basic_tutorials/using_cli
    title: Using TGI CLI
+  - local: basic_tutorials/custom_models
+    title: Custom Model Serving
  title: Tutorials
 - sections:
  - local: conceptual/streaming
--- a/docs/source/basic_tutorials/custom_models.md
+++ b/docs/source/basic_tutorials/custom_models.md
@ -0,0 +1,21 @@
+# Custom Model Serving
+
+TGI supports various LLM architectures (see full list [here](https://github.com/huggingface/text-generation-inference#optimized-architectures)). If you wish to serve a model that is not one of the supported models, TGI will fallback to transformers implementation of that model. They can be loaded by:
+
+```python
+from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM
+
+AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")``
+
+#or
+
+AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")
+```
+
+This means, you will be unable to use some of the features introduced by TGI, such as tensor-parallel sharding or flash attention. However, you can still get many benefits of TGI, such as continuous batching, or streaming outputs.
+
+You can serve these models using docker like below 👇 
+
+```bash
+docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
+```