From 21ca70e0eb8b9389d8dde15c12a2bc2866e2e8f6 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Tue, 1 Aug 2023 14:02:14 +0300
Subject: [PATCH] Added supported models and hardware

---
 docs/source/supported_models.md | 39 +++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)
diff --git a/docs/source/supported_models.md b/docs/source/supported_models.md
index e69de29b..146eb178 100644
--- a/docs/source/supported_models.md
+++ b/docs/source/supported_models.md
@@ -0,0 +1,39 @@
+# Supported Models and Hardware
+
+## Supported Models
+
+List of optimized models are below.
+
+- [BLOOM](https://huggingface.co/bigscience/bloom)
+- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
+- [Galactica](https://huggingface.co/facebook/galactica-120b)
+- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
+- [Llama](https://github.com/facebookresearch/llama)
+- [OPT](https://huggingface.co/facebook/opt-66b)
+- [SantaCoder](https://huggingface.co/bigcode/santacoder)
+- [Starcoder](https://huggingface.co/bigcode/starcoder)
+- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
+- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
+- [MPT](https://huggingface.co/mosaicml/mpt-30b)
+- [Llama V2](https://huggingface.co/meta-llama)
+
+If the above list lacks the model you would like to serve, depending on the model's pipeline type, you can try to initialize and serve the model on best-effort basis like below:
+
+`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`
+
+or
+
+`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`. 
+
+For the optimized models above, TGI uses custom CUDA kernels for better inference. You can add the flag `--disable-custom-kernels` at the end of the `docker run` command if you wish to disable them.
+
+
+## Supported Hardware
+
+Text Generation Inference optimized models supported on NVIDIA [A100](https://www.nvidia.com/en-us/data-center/a100/), [A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) and [T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) GPUs with CUDA 11.8+. Note that you have to install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to use it. For other hardware, continuous batching will still apply, but you might observe downgrades on some of the operations (e.g. flash attention, paged attention) will not be executed. 
+
+TGI is also supported on the following AI hardware accelerators:
+- *Habana first-gen Gaudi and Gaudi2:* checkout [here](https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference) how to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index)
+
+
+