WeSearch
Hub / Tags / Benchmark
TAG · #BENCHMARK

Benchmark coverage.

Every story in the WeSearch catalog tagged with #benchmark, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

60 stories tagged with #benchmark, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Benchmark"

RELATED TAGS
#benchmarking61#ai54#technology14#ml9#security5#programming5#api5#hardware4#benchmarks4#performance4#typescript3#language-models3
GITHUB

BEAVER: Enterprise benchmark for LLM Text-to-SQL from private data warehouses

10 views ·
#database#enterprise#technology
TECHMEME

Xiaomi releases MiMo Code V0.1.0, an open-source AI coding assistant that it says outperforms Claude Code on agentic coding and software engineering benchmarks (Carl Franzen/VentureBeat)

Carl Franzen / VentureBeat : Xiaomi releases MiMo Code V0.1.0, an open-source AI coding assistant that it says outperforms Claude Code on agentic coding and software engineering be…

17 views ·
TECHCRUNCH

Waymo says it built a better benchmark for comparing robotaxis to humans

Waymo created a new computer model to help it better understand how humans behave in crash scenarios that its robotaxis encounter.…

23 views ·
ARXIV.ORG

Benchmarks in Leipzig

Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers. Most of the work was done during the 3…

23 views ·
#mathematics#artificial intelligence#research
GOOGLE NEWS

OpenAI research and product leads detail GPT-Rosalind capabilities and benchmarks - R&D World

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…

21 views ·
9TO5MAC

Chrome for Mac breaks benchmark records on the latest MacBook Pro

Google has shared the results of the latest Chrome performance benchmarks, including record scores on tests running on an M5 MacBook Pro.…

22 views ·
#technology#browsers#performance
PHORONIX

Today Marks 22 Years Of Phoronix For Linux Hardware Testing & Benchmarking

Today marks 22 years since I started Phoronix.com to focus on Linux hardware reviews…

24 views ·
#linux#hardware#benchmarking
TECHCRUNCH

Benchmark raises its first-ever growth fund as part of $2B capital raise

The legendary abandons its more than 20 year tradition of keeping its funds to about $425 million.…

28 views ·
#venture capital#artificial intelligence#investment
THEHIVERYIQ

Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive

Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…

18 views ·
#ai#technology#benchmarking
PHORONIX

AMD EPYC 8635P "Sorano" Benchmarks: Significant Upgrade Opportunity For EPYC 8004 Servers

17 views ·
DEV.TO (TOP)

Cross Cloud A2A Agent Benchmarking

Building a Benchmarking Agent with A2A and MCP This tutorial aims to build and test...…

21 views ·
#cloud computing#benchmarking#programming
TOM'S HARDWARE

Trump signs AI executive order seeking 30-day government access to frontier models before release — voluntary framework will include classified benchmark to determine which models qualify

The voluntary framework avoids mandatory licensing but gives the government a say in which firms get early access.…

23 views ·
#ai#cybersecurity#government
ARXIV.ORG

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…

26 views ·
#machine learning#language models#evaluation
ARXIV CS.AI

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained un…

19 views ·
#artificial intelligence#autonomous agents#evaluation
ARXIV CS.AI

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents …

16 views ·
#artificial intelligence#desktop agents#human collaboration
ARXIV CS.AI

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly unde…

15 views ·
#artificial intelligence#graph theory#education
ARXIV CS.AI

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchma…

17 views ·
#healthcare#artificial intelligence#clinical decision-making
ARXIV CS.AI

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmark…

16 views ·
#artificial intelligence#healthcare#benchmarking
ARXIV CS.AI

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data member…

18 views ·
#artificial intelligence#benchmarking#data provenance
ARXIV CS.AI

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordin…

14 views ·
#artificial intelligence#machine learning#api
R/LOCALLLAMA

Why do we benchmark quants on perplexity and prose but never on tool call validity?

13 views ·
R/OPENAI

Someone benchmarked on how accurate different AI are on excel documents

19 views ·
MEDIUM

We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper

by VEKTOR Memory — 10 min read…

19 views ·
#ai#memory#benchmarking
DEV.TO (TOP)

Benchmarking time-series databases for ecommerce infrastructure monitoring

Time-series database performance under ecommerce load: real benchmark results Your...…

15 views ·
#database#monitoring#ecommerce
GITHUB

PROMPTPurify: 14 MB CPU-only prompt-injection guard (benchmarked vs. OSS guard)

Prompt-injection guardrail for LLM applications. Compact model that outperforms larger open-source guards. No regex, no signatures. Demo: anton.securelayer7.net - securelayer7/PROM…

12 views ·
#technology#security#ai
TYPEDEF

From Benchmarketing to Benchmaxxing

AI benchmarks repeat 40 years of database benchmarketing mistakes. Learn why standard evals fall short and how to build your own.…

13 views ·
#ai#benchmarking#data
EQBENCH

Eqbench: Emotional Intelligence Benchmarks for LLMs

18 views ·
#technology#artificial intelligence#emotional intelligence
GITHUB

CVE-Bench: testing LLM agents on real-world vulnerability patches

Benchmarking LLMs on real-world CVE patching…

20 views ·
#ai#security#vulnerabilities
YCOMBINATOR

Ask HN: How would you benchmark your engineering team's AI adoption?

12 views ·
#ai#engineering#productivity
HACKER NEWS (NEWEST)

Test yourself against local open-source LLMs benchmark questions

10 views ·
ARM NEWSROOM

Arm Metis with GPT5.5 Cyber scores 98% on firmware vulnerability benchmark

Arm Metis is an open-source agentic AI security framework that helps detect software vulnerabilities earlier.…

14 views ·
#technology#security#software
THE HINDU — TOP

Mahanadu has set new benchmark in political mobilisation, says TDP State president Palla Srinivasa Rao

TDP's Mahanadu-2026 sets a new political mobilisation benchmark, advocating 33% women’s representation and fostering democratic participation.…

24 views ·
#politics#women's empowerment#technology
GITHUB

Research repository for the Americas – benchmarks, models, governance

Open research repository for federated, regionally-grounded AI development across the Western Hemisphere. Maintained by GENIA Americas / RaceFor.AI. - GENIA-Americas/multimodal-ai-…

17 views ·
#ai#research#governance
HACKER NEWS (NEWEST)

BenchBench

TL;DR: presenting the ultimate benchmark, getting models to create benchmarks for each other, and GPT 5.2 is the current (only) winner…

17 views ·
#artificial intelligence#machine learning#benchmarking
DEV.TO (TOP)

Claude Opus 4.8 Is Here: Benchmarks, Dynamic Workflows, and Whether to Upgrade From 4.7

Anthropic shipped Claude Opus 4.8 yesterday. It catches 4x more of its own code mistakes, runs hundreds of parallel subagents through Dynamic Workflows, and keeps the same price as…

12 views ·
#ai#coding#productivity
R/WEBDEV

Updated imagor 1.9.1 benchmark results for dynamic image processing

16 views ·
R/LOCALLLAMA

StepFun 3.7 Flash - Speed Benchmark in M5 Max

25 views ·
DEV.TO (TOP)

LLM Benchmarks, Agent Frameworks, and the Tools That Matter in 2026 [03:37:09]

An in-depth look at the AI agent revolution reshaping software development and business automation in 2026.…

15 views ·
#ai#automation#technology
R/SELFHOSTED

I benchmarked 6 self-hosted book server apps up to 150K books (ingestion time + RAM/CPU)

21 views ·
R/CLAUDEAI

ChatGPT-5.5 Beats Opus in Realistic Benchmark (DeepSWE)

15 views ·
R/SINGULARITY

DeepSWE finally a proper coding benchmark

15 views ·
R/SINGULARITY

Buyout Game Benchmark: 8 models play a social strategy game with public balances, private transfers, messaging, eliminations, deals, defections, and a final buyout phase. 804 games. GPT-5.5 is the champion. Opus 4.7 performs well.

15 views ·
R/MACHINELEARNING

BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]

16 views ·
CLICKHOUSE

CostBench: an open benchmark for data warehouse cost-performance

Introducing CostBench, an open benchmark that turns cloud data warehouse runtime and billing models into comparable performance-per-dollar results.…

19 views ·
#data warehousing#benchmarking#cloud computing
HUGGING FACE BLOG

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

A Blog post by IBM Research on Hugging Face…

24 views ·
#artificial intelligence#enterprise it#benchmarking
100X

Benchmarking LLMs for Web Tasks

Comparisons of how LLMs perform for a bunch of web tasks…

16 views ·
R/NETSEC

MalShark: MCP-Powered Malware Traffic Analysis — Benchmarked Against Real Malware

15 views ·
R/CYBERSECURITY

MalShark: MCP-Powered Malware Traffic Analysis — Benchmarked Against Real Malware

19 views ·
TOM'S HARDWARE

Nvidia offers restricted access to Vera CPU in first round of Linux benchmarks - 88-core monster competes with or beats Epyc and Xeon in selected tests

It's running very close to AMD's EPYC, which is incredible for a first-generation custom server core from NVIDIA.…

25 views ·
#nvidia#cpu#benchmarking
R/MACHINELEARNING

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

16 views ·
TECHMEME

Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, and says GPT-5.5 is the leader at 70% (Michael Nuñez/VentureBeat)

18 views ·
R/LOCALLLAMA

New DeepSWE benchmark finds Claude Opus cheats

18 views ·
DEV.TO (TOP)

AI 3D tools need product evals, not benchmark faith

If you’re building AI-assisted 3D or CAD-like workflows, benchmark scores only get you so far. The real work is designing evals around your product contract and catching geometry f…

19 views ·
#ai#3d#evaluation
DEV.TO (TOP)

Chronos vs Toto: Zero-Shot Forecasting Benchmark Results

Introduction Good forecasts help with capacity planning and quieter alerts. But one...…

15 views ·
#forecasting#dataengineering#observability
ARXIV CS.AI

Constraint acquisition needs better benchmarks

Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by …

17 views ·
#artificial intelligence#benchmarking#mathematical programming
ARXIV CS.AI

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism…

24 views ·
#artificial intelligence#benchmarking#task generation
ARXIV CS.AI

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, wh…

20 views ·
#artificial intelligence#language models#theory of mind
ARXIV CS.AI

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based …

14 views ·
#artificial intelligence#machine learning#mobile applications
ARXIV CS.AI

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchma…

14 views ·
#software engineering#artificial intelligence#computer vision
TECHMEME

Initial benchmarks show Nvidia's Vera CPU, which features 88 in-house-designed Olympus cores, packs a heavy-hitting punch, beating Intel's and AMD's x86_64 CPUs (Michael Larabel/Phoronix)

20 views ·