Benchmarking LLM Structured Outputs
The article discusses the challenges of achieving strict adherence to structured outputs in large language models (LLMs) from providers like OpenAI, Anthropic, and Google Gemini. It highlights a benchmarking study that tested various schemas against these models, revealing distinct patterns in their performance. The findings indicate that while some models accept complex schemas, they may return incorrect structures, emphasizing the need for robust validation mechanisms.
- ▪The benchmarking study tested eight synthetic schemas against six models from OpenAI, Anthropic, and Google Gemini.
- ▪OpenAI generally rejects most schemas at submission but conforms perfectly to those it accepts.
- ▪Anthropic's models accept complex schemas but may silently return incorrect structures, particularly with deeply nested objects.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3371682) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } David Moores Posted on May 25 • Originally published at carrick.tools Benchmarking LLM Structured Outputs #ai #llm #productivity #devops Cross-posted from carrick.tools. When you read the API documentation for OpenAI, Anthropic, or Google Gemini, the feature called "structured outputs" looks like a solved problem: pass a JSON schema, get back JSON that conforms to it. In production, it is not a contract. It is a well-typed, best-effort suggestion.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).