mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-21 23:02:13 +00:00
Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts. |
||
---|---|---|
.. | ||
basic_tutorials | ||
conceptual | ||
_toctree.yml | ||
architecture.md | ||
index.md | ||
installation_amd.md | ||
installation_gaudi.md | ||
installation_inferentia.md | ||
installation_intel.md | ||
installation_nvidia.md | ||
installation.md | ||
messages_api.md | ||
quicktour.md | ||
supported_models.md | ||
usage_statistics.md |