I asked my local LLM to add 23 numbers and got seven wrong answers
The author tested a local large language model (LLM) by asking it to sum 23 stock transaction amounts, receiving seven different incorrect results across attempts. The errors revealed limitations in how LLMs handle arithmetic, input parsing, and tool integration, even with powerful hardware and multiple setups. Success was only achieved when using a proper harness with code execution and clear prompting. The experience illustrated the importance of the full AI stack—model, inference engine, orchestrator, and harness—for reliable results.
Full article excerpt tap to expand
ARTICLES I Asked My Local LLM to Add 23 Numbers. I Got Seven Different Wrong Answers. April 25, 2026 - 6 minutes read - 1178 words Seven attempts, seven different wrong answers — lessons from setting up a local LLM. It’s tax season, which means I’ve been staring at a notes file full of stock sales — 23 transactions across the year that I needed to total up. The kind of data I’d rather not paste into a chat window I don’t control. I’d been meaning to set up a local LLM anyway, and this seemed like the perfect low-stakes test. M3 Max, 64GB of unified memory, plenty of headroom for a real model. I installed Ollama, pulled Qwen 2.5 Coder, pasted the list, asked the question. “947 shares sold so far,” it told me. Confidently. The actual answer was 1,884. Over the next five hours and seven attempts, my local LLM gave me 2,333. Then 1,994. Then 2,364. Then 859. Twice it produced no number at all. Eventually, finally, 1,884. (For the actual filing I used Python. This is a post about exploring local LLMs, not doing taxes with one. Don’t do taxes with one.) What I thought would be a five-minute test became the cleanest tour through the modern AI stack I’ve ever stumbled into. By the end I understood what every layer — model, inference engine, orchestrator, harness — actually does, because I’d watched each one fail in turn. The data these are the stocks of Chegg i have sold. 250 shares of Chegg at 35 100 shares of Chegg at 42 88 shares of Chegg at 112 50 shares of $CHGG@78 40 shares of $CHGG @42. Cost basis $112. Sold in a loss. 80 shares at 22 145 shares at 8 ... (23 transactions total, prices spanning $35 down to $0.80) how many I have sold so far? Some lines say “Chegg,” some say “$CHGG,” some are bare numbers. Real answer: 1,884. Attempt 1: Ollama desktop, 7B → 947 Pasted into the chat. Got “947 shares so far.” The model had silently dropped half the input — listed only 12 of 23 transactions. Worse, even those 12 don’t sum to 947; they total 1,179. Two compounding failures, one confident answer. Lesson: small models will produce list-shaped output that omits items, then total the omitted version without acknowledging the omission. Attempt 2: ollama run, 7B → 2,333 Same model, raw CLI. This time it identified all 23 transactions and wrote out the expression: 250 + 100 + 100 + 101 + 80 + 88 + 60 + 80 + 50 + 70 + 29 + 40 + 51 + 80 + 80 + 50 + 60 + 50 + 145 + 70 + 68 + 82 + 100 = 2,333 The expression is correct. The answer isn’t. Lesson: transformers don’t compute arithmetic. After the =, the model is pattern-matching what number looks plausible, not running addition. Sometimes right, often not, never reliable past a few terms. Attempt 3: Open Interpreter → never executed This is what should fix it. Open Interpreter is a CLI harness — model writes code, harness runs it in a Python sandbox. Pointed it at Ollama: interpreter --model ollama/qwen2.5-coder:7b The model produced: {"name": "execute", "arguments":{"language": "python", "code": "..."}} …and Open Interpreter just printed it as text. No “Run this? (y/n)” prompt. The model emitted JSON-shaped text that looked like a tool call but wasn’t a structured tool call the harness recognized. Same result with the 32B. Lesson: tool-calling has two skills — knowing you should call a tool, and emitting the exact tokens that signal one. Smaller open-weights models do the first reliably and fumble the second. Frontier models are heavily post-trained on structured output; smaller models aren’t. The…
This excerpt is published under fair use for community discussion. Read the full article at Viggy28.