Search: "speculative decoding"

Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch

I’ve been working on an educational implementation repo for speculative decoding: The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind …

Mon, 27 Apr 2026 01:36:17 GMT · 6 views

Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]

I’ve been working on an educational implementation repo for speculative decoding: The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind …

Sun, 26 Apr 2026 20:54:30 GMT · 7 views

LOCALLLAMA

Simple to use vLLM Docker Container for Qwen3.6 27b with Lorbus AutoRound INT4 quant and MTP speculative decoding - 118 tokens/second on 2x 3090s

Mon, 27 Apr 2026 15:38:07 GMT · 3 views

Speculative decoding with Gemma-4-31B + Gemma-4-E2B enables 120 - 200 tok/s output speed for specific tasks

So for my project I was using up until now either Gemini 3 / 2.5 Flash or Flash-lite. All my use cases are not agentic, simply LLM workflows for atomic tasks like extracting references from the law, c…

Sun, 26 Apr 2026 22:44:08 GMT · 8 views

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

Hey fellow Llamas, your time is precious, so I'll keep it short. We built a GGUF port of DFlash speculative decoding. Standalone C++/CUDA stack on top of ggml, runs on a single 24 GB RTX 3090, hosts t…

Mon, 27 Apr 2026 16:54:13 GMT · 42 views

Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip).

TL:DR - Remembered FPGA PCI boards being a big thing from my crypto days. Wondered if AMD Alveo V80 FPGA card could be used to approximate the performance of a Taalas HC1 (LLM-on-a-chip). Ran the idea…

Mon, 27 Apr 2026 08:05:35 GMT · 12 views

Results for "speculative decoding".