mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-21 14:52:20 +00:00
Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.
2.6 KiB
2.6 KiB
Supported Models and Hardware
Text Generation Inference enables serving optimized models on specific hardware for the highest performance. The following sections list which models are hardware are supported.
Supported Models
- Deepseek V2
- Idefics 2 (Multimodal)
- Llava Next (1.6) (Multimodal)
- Llama
- Phi 3
- Gemma
- PaliGemma
- Gemma2
- Cohere
- Dbrx
- Mamba
- Mistral
- Mixtral
- Gpt Bigcode
- Phi
- Baichuan
- Falcon
- StarCoder 2
- Qwen 2
- Opt
- T5
- Galactica
- SantaCoder
- Bloom
- Mpt
- Gpt2
- Gpt Neox
- Idefics (Multimodal)
If the above list lacks the model you would like to serve, depending on the model's pipeline type, you can try to initialize and serve the model anyways to see how well it performs, but performance isn't guaranteed for non-optimized models:
# for causal LMs/text-generation models
AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`
# or, for text-to-text generation models
AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")
If you wish to serve a supported model that already exists on a local folder, just point to the local folder.
text-generation-launcher --model-id <PATH-TO-LOCAL-BLOOM>