WeSearch
Hub / Tags / Benchmarking
TAG · #BENCHMARKING

Benchmarking coverage.

Every story in the WeSearch catalog tagged with #benchmarking, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

60 stories tagged with #benchmarking, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Benchmarking"

RELATED TAGS
#ai48#technology12#ml8#programming5#api5#typescript3#security3#hardware3#gpu2#python2#performance2#rust2
PHORONIX

Today Marks 22 Years Of Phoronix For Linux Hardware Testing & Benchmarking

Today marks 22 years since I started Phoronix.com to focus on Linux hardware reviews…

24 views ·
#linux#hardware
THEHIVERYIQ

Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive

Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…

18 views ·
#ai#technology
DEV.TO (TOP)

Cross Cloud A2A Agent Benchmarking

Building a Benchmarking Agent with A2A and MCP This tutorial aims to build and test...…

20 views ·
#cloud computing#programming
ARXIV.ORG

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…

26 views ·
#machine learning#language models#evaluation
ARXIV CS.AI

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents …

14 views ·
#artificial intelligence#desktop agents#human collaboration
ARXIV CS.AI

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmark…

15 views ·
#artificial intelligence#healthcare
ARXIV CS.AI

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data member…

17 views ·
#artificial intelligence#data provenance
ARXIV CS.AI

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordin…

14 views ·
#artificial intelligence#machine learning#api
MEDIUM

We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper

by VEKTOR Memory — 10 min read…

19 views ·
#ai#memory
DEV.TO (TOP)

Benchmarking time-series databases for ecommerce infrastructure monitoring

Time-series database performance under ecommerce load: real benchmark results Your...…

14 views ·
#database#monitoring#ecommerce
TYPEDEF

From Benchmarketing to Benchmaxxing

AI benchmarks repeat 40 years of database benchmarketing mistakes. Learn why standard evals fall short and how to build your own.…

12 views ·
#ai#data
GITHUB

CVE-Bench: testing LLM agents on real-world vulnerability patches

Benchmarking LLMs on real-world CVE patching…

20 views ·
#ai#security#vulnerabilities
HACKER NEWS (NEWEST)

BenchBench

TL;DR: presenting the ultimate benchmark, getting models to create benchmarks for each other, and GPT 5.2 is the current (only) winner…

15 views ·
#artificial intelligence#machine learning
CLICKHOUSE

CostBench: an open benchmark for data warehouse cost-performance

Introducing CostBench, an open benchmark that turns cloud data warehouse runtime and billing models into comparable performance-per-dollar results.…

18 views ·
#data warehousing#cloud computing
HUGGING FACE BLOG

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

A Blog post by IBM Research on Hugging Face…

24 views ·
#artificial intelligence#enterprise it
100X

Benchmarking LLMs for Web Tasks

Comparisons of how LLMs perform for a bunch of web tasks…

16 views ·
TOM'S HARDWARE

Nvidia offers restricted access to Vera CPU in first round of Linux benchmarks - 88-core monster competes with or beats Epyc and Xeon in selected tests

It's running very close to AMD's EPYC, which is incredible for a first-generation custom server core from NVIDIA.…

21 views ·
#nvidia#cpu
R/MACHINELEARNING

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

16 views ·
ARXIV CS.AI

Constraint acquisition needs better benchmarks

Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by …

17 views ·
#artificial intelligence#mathematical programming
ARXIV CS.AI

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism…

22 views ·
#artificial intelligence#task generation
ARXIV CS.AI

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, wh…

20 views ·
#artificial intelligence#language models#theory of mind
ARXIV CS.AI

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based …

14 views ·
#artificial intelligence#machine learning#mobile applications
DEV.TO (TOP)

Revisiting Benchmarking- Building a Rust A2A Agent

This is a submission for the GitHub Finish-Up-A-Thon Challenge Cross Language A2A Agent...…

11 views ·
#programming#ai
R/CPP

[std-proposals] Benchmarking using the standard library as a module

19 views ·
DEV.TO (TOP)

Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes

Look, I’m a backend engineer. I don’t have time to read through 40 pages of model cards before...…

15 views ·
#api#ai#python
DEV.TO (TOP)

How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide

Look, let me spill the beans right up front: I'm obsessed with saving money. Not in a cheap-skate...…

12 views ·
#ai#api#cost-saving
ARXIV CS.AI

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basi…

19 views ·
#artificial intelligence#coding agents
ARXIV CS.AI

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusabl…

18 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with un…

16 views ·
#artificial intelligence#reinforcement learning#teamwork
ARXIV CS.AI

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactio…

16 views ·
#artificial intelligence#audio-video#evaluation
ARXIV CS.AI

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reprodu…

14 views ·
#artificial intelligence#mobile apps
ARXIV CS.AI

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a…

18 views ·
#artificial intelligence#optimization
ARXIV CS.AI

AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unquantified, making it unclear when ranking…

16 views ·
#artificial intelligence#data analysis
GIST

Show HN: AgentToolBench-Code – security benchmark for AI coding agents

GitHub Gist: instantly share code, notes, and snippets.…

24 views ·
#ai#security
DEV.TO (TOP)

Benchmarking LLM Structured Outputs

A benchmark of how OpenAI, Anthropic, and Google Gemini break under realistic JSON schemas, and how to engineer around it.…

32 views ·
#ai#llm
DEV.TO (TOP)

Rust Concepts: Serde, Error Handling, Benchmarking & Workspaces (Part 6)

This is Part 6 — the final part of the Core Rust Concepts series. Part 1 — Ownership, Borrowing,...…

16 views ·
#rust#programming#serialization
ARXIV CS.AI

Design and Report Benchmarks for Knowledge Work

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and ben…

11 views ·
#artificial intelligence#knowledge work
ARXIV CS.AI

FastKernels: Benchmarking GPU Kernel Generation in Production

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are p…

14 views ·
#machine learning#gpu#artificial intelligence
ARXIV CS.AI

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community tr…

12 views ·
#video generation#evaluation#artificial intelligence
GITHUB

MetalBench – Benchmark for Apple Silicon's Metal Shading Lang

A collection & harness for the development of Apple Silicon Kernels - Lazarus-931/MetalBench…

11 views ·
#technology#apple
DEV.TO (TOP)

Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes

Look, I’m a backend engineer. I don’t have time to read through 40 pages of model cards before...…

15 views ·
#api#multimodal
HACKER NEWS (NEWEST)

Evaluating Spec CPU2026

SPEC’s CPU benchmark suite has been a long established industry standard, and is almost impossible to miss when reading through various publications.…

12 views ·
#technology#hardware
ARXIV.ORG

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges th…

13 views ·
#cryptography#security#artificial intelligence
DEV.TO (TOP)

HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution

Real-world timing benchmarks for HiDream-O1-Image Full — tuning steps, guidance scale, and resolution to speed up iteration without killing quality.…

15 views ·
#ai#machinelearning#imagegeneration
ARXIV CS.AI

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more effi…

18 views ·
#education#artificial intelligence
ARXIV CS.AI

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. F…

16 views ·
#artificial intelligence#language models
ARXIV CS.AI

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Forma…

11 views ·
#human-computer interaction#artificial intelligence#user simulation
ARXIV CS.AI

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resourc…

17 views ·
#healthcare#artificial intelligence#machine learning
ARXIV CS.AI

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene und…

20 views ·
#computer vision#artificial intelligence
DEV.TO (TOP)

How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide

Look, let me spill the beans right up front: I'm obsessed with saving money. Not in a cheap-skate...…

18 views ·
#ai#api#cost-saving
GITHUB

MLX Vulkan Back End

Home for the Development of MLX Vulkan backend. Contribute to goniz/mlx-vulkan development by creating an account on GitHub.…

14 views ·
#technology#vulkan
INFERENCEBENCH

InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents

InferenceBench is a benchmark for open-ended LLM inference optimization by AI agents.…

19 views ·
#ai#optimization
DEV.TO (TOP)

Benchmarking AWS Nova on Log Data: How It Compares to ChatGPT-3.5

We re-ran a 2023 ChatGPT log analysis benchmark using AWS Nova Micro. Results show strong parsing and summarization — at 14x lower cost per token.…

12 views ·
#ai#logging#machinelearning
DEV.TO (TOP)

What did gemma see? - Thinking in comments...

This is a submission for the Gemma 4 Challenge: Write About Gemma 4 While running a simple harness...…

12 views ·
#ai#machine learning
YUGABYTE

Benchmarking AI coding agents for distributed SQL: 350 runs, 17 models

Discover what moved the score, what regressed, and the one structural finding that changes how every team should deliver context to an AI coding tool.…

13 views ·
#ai#database
DEV.TO (TOP)

Honest Perf Benchmarks for a Paid-API Compiler

Four PRs, three releases, and a benchmark suite that won't lie to you: seeded-RNG corpora, double-gated Claude scenarios, and skipped-but-recorded records.…

12 views ·
#typescript#api
GITHUB

Show HN: A fast, thread-safe C hashmap with lazy sorting

Contribute to RaphaelPrevost/hashmap-benchmark development by creating an account on GitHub.…

12 views ·
#programming#software
R/TYPESCRIPT

Benchmarking AI agents across five TypeScript backend frameworks

19 views ·
ENCORE — OPEN SOURCE BACKEND F

Benchmarking AI agents across five TypeScript back end frameworks

A three-run benchmark of Claude Code building the same backend across Encore, Express, Fastify, Hono, and NestJS, measuring not just whether tests pass but whether the code an AI a…

16 views ·
#ai#programming
ARXIV CS.AI

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a …

17 views ·
#artificial intelligence#multiagent systems