Switching to Secondary Is Faster
The article draws an analogy between faster secondary firearms and using smaller language models for initial tasks in LLM workflows. Smaller models can quickly generate boilerplate, drafts, and plans, which are then refined by larger, more accurate models. This approach mirrors speculative decoding and can save significant time in token generation.
- ▪Switching to a secondary, smaller language model for initial tasks is faster than relying solely on a large model.
- ▪A small model generating at 200 tokens per second can process 16k tokens in 80 seconds versus 320 seconds for a large model at 50 tokens per second.
- ▪The workflow involves planning with a small model, reviewing with a large model, generating code, and reviewing again.
- ▪The author uses Qwen 3.6 35b MoE as the small model for local, fast execution and boilerplate generation.
- ▪This method has not been tested on novel codebases, where the author prefers to write code manually and use small models for repetitive tasks.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3898242) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Wayne Posted on May 2 • Originally published at wheynelau.dev Switching to Secondary Is Faster #llm #agenticcoding #workflow Remember, switching to your pistol is always faster than reloading. The same idea applies to LLM workflows. Most of the time, you don't need a flagship model to scaffold a project. Boilerplate, spec drafts, and initial plans are all tasks where a smaller model can do the heavy lifting. Then you pass the result to a larger model for review.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).