11 results for "regression"
Claude Code Regression: How to Diagnose and Fix the Recent Quality Drop
Anthropic's postmortem reveals three regressions in Claude Code: reasoning effort, context retention, and verbosity changes. Here's how to diagnose an…
A Post-Regression World
Whether by active strategy or passive habit, commodification is being woven into every level of the modern structure. And it might be the best thing that happens to industry. That feels counterintu……
Former Eagles coach reveals the two things that contributed to the team's major regression last season
Former Philadelphia Eagles coach Jeff Stoutland blamed bad play-calling and execution for the Eagles' offensive decline a year after winning the Super Bowl.…
How Missing Data Analysis Lab uses Flask, Bayesian optimization, and MongoDB in one regression workflow
Missing Data Analysis Lab is a Flask-served Python application for studying missing-value behavior,...…
Your Reviews Replicate You: LLM-Based Agents as Customer Digital Twins for Conjoint Analysis
Conjoint analysis is a cornerstone of market research for estimating consumer preferences; however, traditional methods face persistent challenges regarding time, cost, and respondent fatigue. To addr…
Claude system prompt bug wastes user money and bricks managed agents
Regression summary Issue #47027 was closed by @bcherny in February saying "This was fixed in v2.1.92." I'm running v2.1.111 (19 versions past the fix) and the exact same behavior reproduces reliabl...…
Pgrust update: at 67% Postgres compatibility, and accelerating
It’s been a week since I published the original pgrust post. pgrust is my attempt to rewrite Postgres in Rust. My ultimate goal is to build a database that is safer to work with so that I can work on …
AgentCheck – Pytest for AI Agents
Pytest-style behavioral regression testing for AI agents.…
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missi…
Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents
This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static in…
Explanation Quality Assessment as Ranking with Listwise Rewards
We reformulate explanation quality assessment as a ranking problem rather than a generation problem. Instead of optimizing models to produce a single "best" explanation token-by-token, we train reward…