I over-engineered my simple AI backend: distillation, router, embedding etc.
The author of Wu Wei Cards built a cost-constrained AI backend using Cloudflare Workers to provide a free AI companion for facilitators, requiring careful optimization to stay within a $2 per customer budget. They detail their journey of refining the system through techniques like distillation, routing, and embeddings to reduce token usage and improve efficiency. The article shares lessons learned from over-engineering parts of the AI infrastructure while balancing performance and cost.
Opening excerpt (first ~120 words) tap to expand
Scaling LLMs at the Edge: A journey through distillation, routers, and embeddings I have extensively edited this article after an LLM agent combed through my codebase and prepared the initial draft. At Sisyphus Consulting, We recently launched a unique product in the market: physical facilitation cards + digital tools for virtual facilitation. They’re named Wu Wei Cards. But this write-up is not about the product. I want to share the behind-the-scenes events of how I navigated through tinkering with LLMs, Embeddings and the whole trial-and-error. If you’re building something in AI-space, I hope this would be helpful to you. First, let me give the background so that you know the WHATs and WHYs.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Sisyphus Consulting.