60 stories tagged with #benchmarking, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Benchmarking"
Today Marks 22 Years Of Phoronix For Linux Hardware Testing & Benchmarking
Today marks 22 years since I started Phoronix.com to focus on Linux hardware reviews…
Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive
Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…
Cross Cloud A2A Agent Benchmarking
Building a Benchmarking Agent with A2A and MCP This tutorial aims to build and test...…
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…
DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents …
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmark…
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data member…
Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition
Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordin…
We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper
by VEKTOR Memory — 10 min read…
Benchmarking time-series databases for ecommerce infrastructure monitoring
Time-series database performance under ecommerce load: real benchmark results Your...…
From Benchmarketing to Benchmaxxing
AI benchmarks repeat 40 years of database benchmarketing mistakes. Learn why standard evals fall short and how to build your own.…
CVE-Bench: testing LLM agents on real-world vulnerability patches
Benchmarking LLMs on real-world CVE patching…
BenchBench
TL;DR: presenting the ultimate benchmark, getting models to create benchmarks for each other, and GPT 5.2 is the current (only) winner…
CostBench: an open benchmark for data warehouse cost-performance
Introducing CostBench, an open benchmark that turns cloud data warehouse runtime and billing models into comparable performance-per-dollar results.…
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
A Blog post by IBM Research on Hugging Face…
Benchmarking LLMs for Web Tasks
Comparisons of how LLMs perform for a bunch of web tasks…
Nvidia offers restricted access to Vera CPU in first round of Linux benchmarks - 88-core monster competes with or beats Epyc and Xeon in selected tests
It's running very close to AMD's EPYC, which is incredible for a first-generation custom server core from NVIDIA.…
noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]
Constraint acquisition needs better benchmarks
Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by …
Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism…
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, wh…
Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based …
Revisiting Benchmarking- Building a Rust A2A Agent
This is a submission for the GitHub Finish-Up-A-Thon Challenge Cross Language A2A Agent...…
[std-proposals] Benchmarking using the standard library as a module
Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes
Look, I’m a backend engineer. I don’t have time to read through 40 pages of model cards before...…
How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide
Look, let me spill the beans right up front: I'm obsessed with saving money. Not in a cheap-skate...…
EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basi…
SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusabl…
Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork
In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with un…
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactio…
SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking
Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reprodu…
FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization
Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a…
AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems
While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unquantified, making it unclear when ranking…
Show HN: AgentToolBench-Code – security benchmark for AI coding agents
GitHub Gist: instantly share code, notes, and snippets.…
Benchmarking LLM Structured Outputs
A benchmark of how OpenAI, Anthropic, and Google Gemini break under realistic JSON schemas, and how to engineer around it.…
Rust Concepts: Serde, Error Handling, Benchmarking & Workspaces (Part 6)
This is Part 6 — the final part of the Core Rust Concepts series. Part 1 — Ownership, Borrowing,...…
Design and Report Benchmarks for Knowledge Work
The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and ben…
FastKernels: Benchmarking GPU Kernel Generation in Production
LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are p…
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community tr…
MetalBench – Benchmark for Apple Silicon's Metal Shading Lang
A collection & harness for the development of Apple Silicon Kernels - Lazarus-931/MetalBench…
Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes
Look, I’m a backend engineer. I don’t have time to read through 40 pages of model cards before...…
Evaluating Spec CPU2026
SPEC’s CPU benchmark suite has been a long established industry standard, and is almost impossible to miss when reading through various publications.…
Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard
The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges th…
HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution
Real-world timing benchmarks for HiDream-O1-Image Full — tuning steps, guidance scale, and resolution to speed up iteration without killing quality.…
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more effi…
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. F…
RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation
LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Forma…
GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval
Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resourc…
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene und…
How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide
Look, let me spill the beans right up front: I'm obsessed with saving money. Not in a cheap-skate...…
MLX Vulkan Back End
Home for the Development of MLX Vulkan backend. Contribute to goniz/mlx-vulkan development by creating an account on GitHub.…
InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents
InferenceBench is a benchmark for open-ended LLM inference optimization by AI agents.…
Benchmarking AWS Nova on Log Data: How It Compares to ChatGPT-3.5
We re-ran a 2023 ChatGPT log analysis benchmark using AWS Nova Micro. Results show strong parsing and summarization — at 14x lower cost per token.…
What did gemma see? - Thinking in comments...
This is a submission for the Gemma 4 Challenge: Write About Gemma 4 While running a simple harness...…
Benchmarking AI coding agents for distributed SQL: 350 runs, 17 models
Discover what moved the score, what regressed, and the one structural finding that changes how every team should deliver context to an AI coding tool.…
Honest Perf Benchmarks for a Paid-API Compiler
Four PRs, three releases, and a benchmark suite that won't lie to you: seeded-RNG corpora, double-gated Claude scenarios, and skipped-but-recorded records.…
Show HN: A fast, thread-safe C hashmap with lazy sorting
Contribute to RaphaelPrevost/hashmap-benchmark development by creating an account on GitHub.…
Benchmarking AI agents across five TypeScript backend frameworks
Benchmarking AI agents across five TypeScript back end frameworks
A three-run benchmark of Claude Code building the same backend across Encore, Express, Fastify, Hono, and NestJS, measuring not just whether tests pass but whether the code an AI a…
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a …