text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-15 09:55:23 +00:00

Author	SHA1	Message	Date
Nicolas Patry	cd208c5043	All integration tests back everywhere (too many failed CI). (#2428 ) * All integration tests back everywhere (too many failed CI). * Upgrade integration tests after 12.4 * Attempt to remove the specifed compute cap. * Common arch list. * Punica uses raw ASM which is not valid on 9.0 apparently.	2024-09-25 06:10:59 +00:00
Nicolas Patry	f0181ed2d7	Upgrading the tests to match the current workings. (#2423 )	2024-09-25 06:08:38 +00:00
drbh	bafab73f76	fix: adjust test snapshots and small refactors (#2323 ) * fix: adjust test snapshots and small refactors * fix: revert non snapshot changes	2024-09-25 05:50:17 +00:00
Daniël de Kok	c1638a56f1	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-09-25 05:27:40 +00:00

4 Commits