Benchmark coverage.

10 views · Mon, 15 Jun 2026 01:32:35 GMT

BEAVER: Enterprise benchmark for LLM Text-to-SQL from private data warehouses

#database #enterprise #technology

TECHMEME

Xiaomi releases MiMo Code V0.1.0, an open-source AI coding assistant that it says outperforms Claude Code on agentic coding and software engineering benchmarks (Carl Franzen/VentureBeat)

Carl Franzen / VentureBeat : Xiaomi releases MiMo Code V0.1.0, an open-source AI coding assistant that it says outperforms Claude Code on agentic coding and software engineering be…

17 views · Fri, 12 Jun 2026 01:05:52 GMT

TECHCRUNCH

Waymo says it built a better benchmark for comparing robotaxis to humans

Waymo created a new computer model to help it better understand how humans behave in crash scenarios that its robotaxis encounter.…

23 views · Wed, 10 Jun 2026 09:04:48 GMT

ARXIV.ORG

Benchmarks in Leipzig

Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers. Most of the work was done during the 3…

23 views · Sat, 06 Jun 2026 14:35:58 GMT

#mathematics #artificial intelligence #research

GOOGLE NEWS

OpenAI research and product leads detail GPT-Rosalind capabilities and benchmarks - R&D World

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…

21 views · Sat, 06 Jun 2026 00:47:30 GMT

9TO5MAC

Chrome for Mac breaks benchmark records on the latest MacBook Pro

Google has shared the results of the latest Chrome performance benchmarks, including record scores on tests running on an M5 MacBook Pro.…

22 views · Sat, 06 Jun 2026 00:43:00 GMT

#technology #browsers #performance

PHORONIX

Today Marks 22 Years Of Phoronix For Linux Hardware Testing & Benchmarking

Today marks 22 years since I started Phoronix.com to focus on Linux hardware reviews…

24 views · Fri, 05 Jun 2026 04:10:56 GMT

#linux #hardware #benchmarking

TECHCRUNCH

Benchmark raises its first-ever growth fund as part of $2B capital raise

The legendary abandons its more than 20 year tradition of keeping its funds to about $425 million.…

28 views · Thu, 04 Jun 2026 04:55:34 GMT

#venture capital #artificial intelligence #investment

THEHIVERYIQ

Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive

Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…

18 views · Wed, 03 Jun 2026 17:27:51 GMT

#ai #technology #benchmarking

PHORONIX

AMD EPYC 8635P "Sorano" Benchmarks: Significant Upgrade Opportunity For EPYC 8004 Servers

17 views · Wed, 03 Jun 2026 15:42:13 GMT

21 views · Wed, 03 Jun 2026 15:42:10 GMT

Cross Cloud A2A Agent Benchmarking

Building a Benchmarking Agent with A2A and MCP This tutorial aims to build and test...…

#cloud computing #benchmarking #programming

TOM'S HARDWARE

Trump signs AI executive order seeking 30-day government access to frontier models before release — voluntary framework will include classified benchmark to determine which models qualify

The voluntary framework avoids mandatory licensing but gives the government a say in which firms get early access.…

23 views · Wed, 03 Jun 2026 11:57:04 GMT

#ai #cybersecurity #government

ARXIV.ORG

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…

26 views · Wed, 03 Jun 2026 04:51:55 GMT

#machine learning #language models #evaluation

19 views · Wed, 03 Jun 2026 04:11:55 GMT

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained un…

#artificial intelligence #autonomous agents #evaluation

16 views · Wed, 03 Jun 2026 04:11:55 GMT

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents …

#artificial intelligence #desktop agents #human collaboration

15 views · Wed, 03 Jun 2026 04:11:55 GMT

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly unde…

#artificial intelligence #graph theory #education

17 views · Wed, 03 Jun 2026 04:11:55 GMT

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchma…

#healthcare #artificial intelligence #clinical decision-making

16 views · Wed, 03 Jun 2026 04:11:55 GMT

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmark…

#artificial intelligence #healthcare #benchmarking

18 views · Wed, 03 Jun 2026 04:11:55 GMT

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data member…

#artificial intelligence #benchmarking #data provenance

14 views · Wed, 03 Jun 2026 04:11:55 GMT

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordin…

#artificial intelligence #machine learning #api

R/LOCALLLAMA

Why do we benchmark quants on perplexity and prose but never on tool call validity?

13 views · Wed, 03 Jun 2026 02:11:50 GMT

R/OPENAI

Someone benchmarked on how accurate different AI are on excel documents

19 views · Sat, 30 May 2026 22:27:47 GMT

MEDIUM

We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper

by VEKTOR Memory — 10 min read…

19 views · Sat, 30 May 2026 22:27:44 GMT

#ai #memory #benchmarking

15 views · Sat, 30 May 2026 07:42:08 GMT

Benchmarking time-series databases for ecommerce infrastructure monitoring

Time-series database performance under ecommerce load: real benchmark results Your...…

#database #monitoring #ecommerce

12 views · Sat, 30 May 2026 04:41:57 GMT

PROMPTPurify: 14 MB CPU-only prompt-injection guard (benchmarked vs. OSS guard)

Prompt-injection guardrail for LLM applications. Compact model that outperforms larger open-source guards. No regex, no signatures. Demo: anton.securelayer7.net - securelayer7/PROM…

#technology #security #ai

TYPEDEF

From Benchmarketing to Benchmaxxing

AI benchmarks repeat 40 years of database benchmarketing mistakes. Learn why standard evals fall short and how to build your own.…

13 views · Sat, 30 May 2026 04:11:57 GMT

#ai #benchmarking #data

EQBENCH

Eqbench: Emotional Intelligence Benchmarks for LLMs

18 views · Fri, 29 May 2026 22:20:35 GMT

#technology #artificial intelligence #emotional intelligence

20 views · Fri, 29 May 2026 19:45:02 GMT

CVE-Bench: testing LLM agents on real-world vulnerability patches

Benchmarking LLMs on real-world CVE patching…

#ai #security #vulnerabilities

YCOMBINATOR

Ask HN: How would you benchmark your engineering team's AI adoption?

12 views · Fri, 29 May 2026 16:20:05 GMT

#ai #engineering #productivity

HACKER NEWS (NEWEST)

Test yourself against local open-source LLMs benchmark questions

10 views · Fri, 29 May 2026 15:20:07 GMT

ARM NEWSROOM

Arm Metis with GPT5.5 Cyber scores 98% on firmware vulnerability benchmark

Arm Metis is an open-source agentic AI security framework that helps detect software vulnerabilities earlier.…

14 views · Fri, 29 May 2026 14:50:01 GMT

#technology #security #software

THE HINDU — TOP

Mahanadu has set new benchmark in political mobilisation, says TDP State president Palla Srinivasa Rao

TDP's Mahanadu-2026 sets a new political mobilisation benchmark, advocating 33% women’s representation and fostering democratic participation.…

24 views · Fri, 29 May 2026 14:05:00 GMT

#politics #women's empowerment #technology

17 views · Fri, 29 May 2026 12:50:00 GMT

Research repository for the Americas – benchmarks, models, governance

Open research repository for federated, regionally-grounded AI development across the Western Hemisphere. Maintained by GENIA Americas / RaceFor.AI. - GENIA-Americas/multimodal-ai-…

#ai #research #governance

HACKER NEWS (NEWEST)

BenchBench

TL;DR: presenting the ultimate benchmark, getting models to create benchmarks for each other, and GPT 5.2 is the current (only) winner…

17 views · Fri, 29 May 2026 12:20:00 GMT

#artificial intelligence #machine learning #benchmarking

12 views · Fri, 29 May 2026 11:50:00 GMT

Claude Opus 4.8 Is Here: Benchmarks, Dynamic Workflows, and Whether to Upgrade From 4.7

Anthropic shipped Claude Opus 4.8 yesterday. It catches 4x more of its own code mistakes, runs hundreds of parallel subagents through Dynamic Workflows, and keeps the same price as…

#ai #coding #productivity

R/WEBDEV

Updated imagor 1.9.1 benchmark results for dynamic image processing

16 views · Fri, 29 May 2026 10:20:02 GMT

R/LOCALLLAMA

StepFun 3.7 Flash - Speed Benchmark in M5 Max

25 views · Fri, 29 May 2026 04:29:44 GMT

15 views · Fri, 29 May 2026 03:59:41 GMT

LLM Benchmarks, Agent Frameworks, and the Tools That Matter in 2026 [03:37:09]

An in-depth look at the AI agent revolution reshaping software development and business automation in 2026.…

#ai #automation #technology

R/SELFHOSTED

I benchmarked 6 self-hosted book server apps up to 150K books (ingestion time + RAM/CPU)

21 views · Thu, 28 May 2026 22:29:40 GMT

R/CLAUDEAI

ChatGPT-5.5 Beats Opus in Realistic Benchmark (DeepSWE)

15 views · Thu, 28 May 2026 01:28:09 GMT

R/SINGULARITY

DeepSWE finally a proper coding benchmark

15 views · Thu, 28 May 2026 00:58:10 GMT

R/SINGULARITY

Buyout Game Benchmark: 8 models play a social strategy game with public balances, private transfers, messaging, eliminations, deals, defections, and a final buyout phase. 804 games. GPT-5.5 is the champion. Opus 4.7 performs well.

15 views · Wed, 27 May 2026 23:08:09 GMT

R/MACHINELEARNING

BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]

16 views · Wed, 27 May 2026 22:08:09 GMT

CLICKHOUSE

CostBench: an open benchmark for data warehouse cost-performance

Introducing CostBench, an open benchmark that turns cloud data warehouse runtime and billing models into comparable performance-per-dollar results.…

19 views · Wed, 27 May 2026 21:38:05 GMT

#data warehousing #benchmarking #cloud computing

HUGGING FACE BLOG

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

A Blog post by IBM Research on Hugging Face…

24 views · Wed, 27 May 2026 17:23:02 GMT

#artificial intelligence #enterprise it #benchmarking

100X

Benchmarking LLMs for Web Tasks

Comparisons of how LLMs perform for a bunch of web tasks…

16 views · Wed, 27 May 2026 16:38:02 GMT

R/NETSEC

MalShark: MCP-Powered Malware Traffic Analysis — Benchmarked Against Real Malware

15 views · Wed, 27 May 2026 14:38:04 GMT

R/CYBERSECURITY

MalShark: MCP-Powered Malware Traffic Analysis — Benchmarked Against Real Malware

19 views · Wed, 27 May 2026 14:08:05 GMT

TOM'S HARDWARE

Nvidia offers restricted access to Vera CPU in first round of Linux benchmarks - 88-core monster competes with or beats Epyc and Xeon in selected tests

It's running very close to AMD's EPYC, which is incredible for a first-generation custom server core from NVIDIA.…

25 views · Wed, 27 May 2026 13:58:00 GMT

#nvidia #cpu #benchmarking

R/MACHINELEARNING

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

16 views · Wed, 27 May 2026 13:08:04 GMT

TECHMEME

Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, and says GPT-5.5 is the leader at 70% (Michael Nuñez/VentureBeat)

18 views · Wed, 27 May 2026 10:08:02 GMT

R/LOCALLLAMA

New DeepSWE benchmark finds Claude Opus cheats

18 views · Wed, 27 May 2026 07:38:00 GMT

19 views · Wed, 27 May 2026 05:37:56 GMT

AI 3D tools need product evals, not benchmark faith

If you’re building AI-assisted 3D or CAD-like workflows, benchmark scores only get you so far. The real work is designing evals around your product contract and catching geometry f…

#ai #3d #evaluation

15 views · Wed, 27 May 2026 05:07:56 GMT

Chronos vs Toto: Zero-Shot Forecasting Benchmark Results

Introduction Good forecasts help with capacity planning and quieter alerts. But one...…

#forecasting #dataengineering #observability

17 views · Wed, 27 May 2026 04:07:56 GMT

Constraint acquisition needs better benchmarks

Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by …

#artificial intelligence #benchmarking #mathematical programming

24 views · Wed, 27 May 2026 04:07:56 GMT

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism…

#artificial intelligence #benchmarking #task generation

20 views · Wed, 27 May 2026 04:07:56 GMT

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, wh…

#artificial intelligence #language models #theory of mind

14 views · Wed, 27 May 2026 04:07:56 GMT

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based …

#artificial intelligence #machine learning #mobile applications