From b6922d48de601aaf90bb5f230c5fa88ad450798d Mon Sep 17 00:00:00 2001 From: Nicolas Patry Date: Tue, 27 Feb 2024 15:49:58 +0100 Subject: [PATCH] Add the speculation docs. --- docs/source/conceptual/speculation | 1 - docs/source/conceptual/speculation.md | 47 +++++++++++++++++++++++++++ 2 files changed, 47 insertions(+), 1 deletion(-) delete mode 100644 docs/source/conceptual/speculation diff --git a/docs/source/conceptual/speculation b/docs/source/conceptual/speculation deleted file mode 100644 index f08b41c5..00000000 --- a/docs/source/conceptual/speculation +++ /dev/null @@ -1 +0,0 @@ -## Speculation diff --git a/docs/source/conceptual/speculation.md b/docs/source/conceptual/speculation.md index f08b41c5..071b7b68 100644 --- a/docs/source/conceptual/speculation.md +++ b/docs/source/conceptual/speculation.md @@ -1 +1,48 @@ ## Speculation + +Speculative decoding, assisted generation, Medusa, and others are a few different names for the same idea. +The idea is to generate tokens *before* the large model actually runs, and only *check* if those tokens where valid. + +So you are making *more* computations on your LLM, but if you are correct you produce 1, 2, 3 etc.. tokens on a single LLM pass. Since LLMs are usually memory bound (and not compute bound), provided your guesses are correct enough, this is a 2-3x faster inference (It can be much more for code oriented tasks for instance). + +You can check a more [detailed explanation](https://huggingface.co/blog/assisted-generation). + +Text-generation inference supports 2 main speculative methods: + +- Medusa +- N-gram + + +### Medusa + + +Medusa is a [simple method](https://arxiv.org/abs/2401.10774) to create many tokens in a single pass using fine-tuned LM heads in addition to your existing models. + + +You can check a few existing fine-tunes for popular models: + +- [text-generation-inference/gemma-7b-it-medusa](https://huggingface.co/text-generation-inference/gemma-7b-it-medusa) +- [text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa](https://huggingface.co/text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa) +- [text-generation-inference/Mistral-7B-Instruct-v0.2-medusa](https://huggingface.co/text-generation-inference/Mistral-7B-Instruct-v0.2-medusa) + + +In order to create your own medusa heads for your own finetune, you should check own the original medusa repo. [https://github.com/FasterDecoding/Medusa](https://github.com/FasterDecoding/Medusa) + + +In order to use medusa models in TGI, simply point to a medusa enabled model, and everything will load automatically. + + +### N-gram + + +If you don't have a medusa model, or don't have the resource to fine-tune, you can try to use `n-gram`. +Ngram works by trying to find in the previous sequence existing tokens that match, and use those as speculation. + +This is an extremely simple method, which works best for code, or highly repetitive text. This might not be beneficial, if the speculation misses too much. + + +In order to enable n-gram speculation simply use + +`--speculate 2` in your flags. + +[Details about the flag](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#speculate)