#benchmarking — Tagged Stories

Every story in the WeSearch catalog tagged with #benchmarking, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

60 stories tagged with #benchmarking, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag → or search "Benchmarking"

RELATED TAGS

#ai48 #technology12 #ml8 #programming5 #api5 #typescript3 #security3 #hardware3 #gpu2 #python2 #performance2 #rust2

PHORONIX

Today Marks 22 Years Of Phoronix For Linux Hardware Testing & Benchmarking

Today marks 22 years since I started Phoronix.com to focus on Linux hardware reviews…

24 views · Fri, 05 Jun 2026 04:10:56 GMT

#linux #hardware

THEHIVERYIQ

Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive

Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…

18 views · Wed, 03 Jun 2026 17:27:51 GMT

#ai #technology

DEV.TO (TOP)

Cross Cloud A2A Agent Benchmarking

Building a Benchmarking Agent with A2A and MCP This tutorial aims to build and test...…

20 views · Wed, 03 Jun 2026 15:42:10 GMT

#cloud computing #programming

ARXIV.ORG

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…

Benchmarking coverage.

Today Marks 22 Years Of Phoronix For Linux Hardware Testing & Benchmarking

Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive

Cross Cloud A2A Agent Benchmarking

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper

Benchmarking time-series databases for ecommerce infrastructure monitoring

From Benchmarketing to Benchmaxxing

CVE-Bench: testing LLM agents on real-world vulnerability patches

BenchBench

CostBench: an open benchmark for data warehouse cost-performance

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Benchmarking LLMs for Web Tasks

Nvidia offers restricted access to Vera CPU in first round of Linux benchmarks - 88-core monster competes with or beats Epyc and Xeon in selected tests

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

Constraint acquisition needs better benchmarks

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Revisiting Benchmarking- Building a Rust A2A Agent

[std-proposals] Benchmarking using the standard library as a module

Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes

How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

Show HN: AgentToolBench-Code – security benchmark for AI coding agents

Benchmarking LLM Structured Outputs

Rust Concepts: Serde, Error Handling, Benchmarking & Workspaces (Part 6)

Design and Report Benchmarks for Knowledge Work

FastKernels: Benchmarking GPU Kernel Generation in Production

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

MetalBench – Benchmark for Apple Silicon's Metal Shading Lang

Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes

Evaluating Spec CPU2026

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide

MLX Vulkan Back End

InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents

Benchmarking AWS Nova on Log Data: How It Compares to ChatGPT-3.5

What did gemma see? - Thinking in comments...

Benchmarking AI coding agents for distributed SQL: 350 runs, 17 models

Honest Perf Benchmarks for a Paid-API Compiler

Show HN: A fast, thread-safe C hashmap with lazy sorting

Benchmarking AI agents across five TypeScript backend frameworks

Benchmarking AI agents across five TypeScript back end frameworks

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

Browse more