Title: Max Buckley on LinkedIn: What is Speculative Decoding? Speculative Decoding is a very clever… Description: What is Speculative Decoding? Speculative Decoding is a very clever optimization that increases inference speed for LLMs. The idea is to use a smaller "draft"… Keywords: No keywords Text content: Max Buckley on LinkedIn: What is Speculative Decoding? Speculative Decoding is a very clever… LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings. Accept Reject Agree & Join LinkedIn By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy. Skip to main content LinkedIn Articles People Learning Jobs Games Get the app Join now Sign in Max Buckley’s Post Max Buckley Senior Software Engineer at Google 23h Report this post What is Speculative Decoding? Speculative Decoding is a very clever optimization that increases inference speed for LLMs. The idea is to use a smaller "draft" model to generate a completion to a prompt. This completion is then processed in parallel by the larger model, and if the draft model's predictions align with the larger model's preferences, we can potentially generate multiple tokens in a single pass - all while preserving the original output distribution. This leverages the concept of speculative execution, an optimization technique usually seen in processors, where a task can be performed in parallel while checking if it will actually be required. The classic example is branch prediction. For speculative execution to be effective, we need an efficient mechanism to suggest which speculative tasks to execute — specifically, ones likely to be needed. This optimization is particularly valuable for LLM inference because memory bandwidth is the bottleneck while compute resources are often available. With speculative decoding, your next-token predictions are batched together - the draft model's k-token completion becomes k+1 parallel prediction tasks for the main model, letting us potentially accept 1 to k+1 tokens tokens in a single forward pass. In the paper, the authors demonstrated an out-of-the-box latency improvement of 2X-3X without any change to the outputs. Intriguingly, they showed that the draft model doesn't even need to be a small LLM - a simple bigram model achieved a 1.25X speedup in their experiments. Theoretically, even random completions should lead to minor speedups. Speculative Decoding Paper: https://lnkd.in/eUBpwp5S Influential previous work: Blockwise Parallel Decoding for Deep Autoregressive Models: https://lnkd.in/eFyT7CTt Shallow Aggressive Decoding (SAD): https://lnkd.in/edHYU8Bg 235 8 Comments Like Comment Share Copy LinkedIn Facebook Twitter João Gante ML (generation) @ Hugging Face 🤗 2h Report this comment I wrote a blog post about the subject, also explaining why we get speedups at a HW level: https://huggingface.co/blog/assisted-generation Meanwhile, other teams have improved the speculation technique (https://huggingface.co/blog/dynamic_speculation_lookahead) and worked around the "must have the same tokenizer" limitation (https://huggingface.co/blog/universal_assisted_generation) :D Like Reply 5 Reactions 6 Reactions John (JC) Cosgrove Partner @ Cloudwerx 18h Report this comment A little hot take: fully applying everything we know about branch prediction is an incredibly fertile space for much much more than just inference speed gains, and it’s on the pile with a bunch of other things that no one has time to take seriously while we keep reinventing SOTA every 6 months 🤣 There is SO MUCH MORE VALUE waiting for business and consumers within the latent space that is missed after logit conversion and positional collapse. Everything we know about processing branches could unlock a LOT of value quickly, not just with direct hardware optimisation. Like Reply 2 Reactions 3 Reactions Ramesh Bhashyam Consultant - Data Management and Analysis 4h Report this comment The speculative execution aided by probability distribution (or probabilistic selection of which path to pursue) I guess goes beyond LLM and is in quantum computing in the sense of which computational answer to pursue. Cool Like Reply 1 Reaction 2 Reactions Sharad Sisodiya Jr. Data scientist @Celebal Technologies | Machine learning | Artificial intelligence | Neo4j Certified Professional | NLP | SQL | Python | Deep Learing 4h Report this comment Yes it is a really impressive RAG technique, I implemented the paper of speculative Rag and the results were pretty good . Like Reply 1 Reaction 2 Reactions Vaibhav Jade MSc Machine learning at Université de Montréal and Mila 15h Report this comment Hey Max, thanks for the awesome summarization! I have also dived deep into the paper and written an in depth blog, let me know your thoughts! : https://www.linkedin.com/posts/vaibhav-jade-72119810a_decoding-llm-decoding-how-llms-turn-probabilities-activity-7281406075537428482-nURB?utm_source=share&utm_medium=member_ios https://sugared-comic-3c2.notion.site/Decoding-LLM-decoding-Can-LLMs-guide-themselves-163cf6711aad80c99d9ec0104a6c0eed Like Reply 2 Reactions 3 Reactions Hiren Patel, PhD Data Science and Model Based Systems Engineering Instructor 1h Report this comment Nice explanation, especially comparing it to speculative execution in CPU’s. Thanks Like Reply 1 Reaction 2 Reactions Rory James Zauner AI/ML Research Engineer | Software Engineer | Human-Centered AI | Psychology | Writer | Python | TypeScript 8h Report this comment This looks really interesting. Thanks for sharing! Like Reply 1 Reaction 2 Reactions See more comments To view or add a comment, sign in 16,560 followers 100 Posts View Profile Connect Explore topics Sales Marketing IT Services Business Administration HR Management Engineering Soft Skills See All LinkedIn © 2025 About Accessibility User Agreement Privacy Policy Cookie Policy Copyright Policy Brand Policy Guest Controls Community Guidelines العربية (Arabic) বাংলা (Bangla) Čeština (Czech) Dansk (Danish) Deutsch (German) Ελληνικά (Greek) English (English) Español (Spanish) فارسی (Persian) Suomi (Finnish) Français (French) हिंदी (Hindi) Magyar (Hungarian) Bahasa Indonesia (Indonesian) Italiano (Italian) עברית (Hebrew) 日本語 (Japanese) 한국어 (Korean) मराठी (Marathi) Bahasa Malaysia (Malay) Nederlands (Dutch) Norsk (Norwegian) ਪੰਜਾਬੀ (Punjabi) Polski (Polish) Português (Portuguese) Română (Romanian) Русский (Russian) Svenska (Swedish) తెలుగు (Telugu) ภาษาไทย (Thai) Tagalog (Tagalog) Türkçe (Turkish) Українська (Ukrainian) Tiếng Việt (Vietnamese) 简体中文 (Chinese (Simplified)) 正體中文 (Chinese (Traditional)) Language Sign in to view more content Create your free account or sign in to continue your search Sign in Welcome back Email or phone Password Show Forgot password? Sign in or By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy. New to LinkedIn? Join now or New to LinkedIn? Join now By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy. LinkedIn Never miss a beat on the app Don’t have the app? Get it in the Microsoft Store. Open the app