Having trouble to generate correct output? Try prefixes!

Smorty [she/her]@lemmy.blahaj.zone · 4 months ago

Having trouble to generate correct output? Try prefixes!

The Hobbyist@lemmy.zip · 4 months ago

This is interesting. Need to check if this is implemented in Open-WebUI.

But I think the thing which I’m hoping for most (in open-webui), is the support of draft models for speculative decoding. This would be really nice!

Smorty [she/her]@lemmy.blahaj.zone · 4 months ago

This prefix feature is already in Open Web UI! There is the “Playground”, which lets you define any kind of conversation and also let it continue a message you started writing for it. The playground is really useful.

What exactly do you mean by “draft models”? I have never heard of that speculative decoding thing…

The Hobbyist@lemmy.zip · edit-2 4 months ago

As you probably know, an LLM works iteratively: you give it instructions and it “auto-completes”, one token at a time. Every time you want to generate the next token, you have to perform the whole inference task, which is expensive.

However, verifying if a next token is the correct one, can be cheap because you can do it in parallel. For instance, take the sentence " The answer to your query is that the sky is blue due to some physical concept". If you wanted to check whether your model would output each one of those tokens, you would split the sentence after every token and you could batch verify the next token for every split and see whether the next token matches the sentence.

Speculative decoding is the process where a cheap and efficient draft model is used to generate a tentative output, which is then verified in parallel by the expensive model. Because the cheap draft model is many times quicker, you can get a sample output very fast and batch verify the output with the expensive model. This saves a lot of computational time because all the parallel verifications require a single forward pass. And the best part is that it has zero effect on the output quality of the expensive model. The cost is that you know have to run two models, but the smaller one may be a tenth of the size, so runs possibly 10x faster. The closer the draft model output matches the expensive model output, the higher the inference speed gain potential.

Having trouble to generate correct output? Try prefixes!

Having trouble to generate correct output? Try prefixes!

Predefined formats

Translation

Code completion and generation

Using this in ollama

Be aware!