Bench coverage.

26 views · Mon, 13 Jul 2026 06:15:33 GMT

LongMedBench: Benchmarking Medical Agents for Long-Horizon Clinical Decision-Making

arXiv:2607.09322v1 Announce Type: new Abstract: In this work, we introduce LongMedBench, a real-world EHR-based benchmark for long-horizon clinical decision-making. Prior evaluatio…

#longmedbench #benchmarking #medical

20 views · Mon, 13 Jul 2026 06:15:33 GMT

MedRealMM: A Real-World Multimodal Benchmark for Chinese Online Medical Consultation

arXiv:2607.09142v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in online medical consultation, yet existing benchmarks remain poorly aligned…

#medrealmm #real-world #multimodal

22 views · Mon, 13 Jul 2026 06:15:33 GMT

Long-Horizon-Terminal-Bench: Testing the Limits of Agents on Long-Horizon Terminal Tasks with Dense Reward-Based Grading

arXiv:2607.08964v1 Announce Type: new Abstract: AI agents have become capable of autonomously completing short, well-specified tasks. However, existing terminal benchmarks largely …

#long-horizon-terminal-bench #testing #limits

THE HINDU — TOP

CJI-led Bench to hear pleas on Ram Temple donation embezzlement case | LIVE

Follow The Hindu Live for updates on the Supreme Court hearing on RJD MP Sudhakar Singh's plea seeking a CBI probe, protection of electronic evidence, and disclosure of the Shri Ra…

21 views · Mon, 13 Jul 2026 04:40:36 GMT

#cji-led #hear

14 views · Mon, 13 Jul 2026 04:20:37 GMT

REFORGE: A Method for Benchmarking LLMs' Reverse Engineering Capabilities in Decompiled Binary Function Naming

Large language models (LLMs) are increasingly applied to reverse-engineering tasks, and recent threat-intelligence reporting shows them operating inside live offensive-security wor…

#reforge #method #benchmarking

16 views · Mon, 13 Jul 2026 04:20:37 GMT

HERO: A Heterogeneity-Aware Benchmark Library for Federated Continual Learning

Federated continual learning (FCL) evaluates how distributed clients learn from changing data streams while retaining previously learned knowledge. Existing evaluations are difficu…

#hero #heterogeneity-aware #benchmark

13 views · Mon, 13 Jul 2026 04:20:37 GMT

MultiView-Bench: A Diagnostic Benchmark for World-Centric Multi-View Integration in VLMs

Recent benchmarks for VLMs largely assess single- or limited-view perception, leaving untested the core cognitive ability to integrate observations across viewpoints into a coheren…

#multiview-bench #diagnostic #benchmark

19 views · Mon, 13 Jul 2026 04:20:37 GMT

OmniMapBench: Benchmarking Visual-Centric Reasoning on Diverse Map Documents

Recent advancements in LVLMs necessitate robust benchmarks for complex, visually grounded reasoning. A critical limitation is identified in many document understanding benchmarks: …

#omnimapbench #benchmarking #visual-centric

21 views · Mon, 13 Jul 2026 04:20:37 GMT

Event Stream based Multi-Modal Video Anomaly Detection: A Benchmark Dataset and Algorithms

Video anomaly detection (VAD) is critical for automated surveillance but remains fragile under challenging conditions such as illumination variations, fast motion, and complex back…

#event #stream #based

22 views · Mon, 13 Jul 2026 03:24:25 GMT

Benches clear in Orioles-Royals game after infielder exits with ‘gut-punch’ injury

Alexander was visibly upset and turned to jaw at Erceg, and the Orioles staff quickly came out to restrain him from escalating the situation further.…

#benches #clear #orioles-royals

THE GUARDIAN

Liberal frontbencher Dan Tehan says coalition with One Nation ‘not even being talked about’

Anthony Albanese mocks ructions on conservative side of politics and says three rightwing parties are caught up in ‘race to the bottom’ Get our breaking news email , free app or da…

26 views · Sun, 05 Jul 2026 05:25:12 GMT

#liberal #frontbencher #tehan

DANLUU

Agentic test processes, LLM benchmarks

98 views · Sat, 04 Jul 2026 19:00:17 GMT

#ai #testing #softwaredevelopment

37 views · Fri, 03 Jul 2026 01:21:31 GMT

Betnijah Laney-Hamilton benched in latest Liberty challenge since injury return

Betnijah Laney-Hamilton watched the Liberty win Tuesday’s Commissioner’s Cup championship from the bench.…

#betnijah #laney-hamilton #benched

SENIOR SWE-BENCH

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

Evaluating agents as senior engineers on the work we actually give them…

32 views · Thu, 02 Jul 2026 03:50:14 GMT

#senior #swe-bench #open-source

OUTKICK

Willson Contreras at the center of benches-clearing scuffle, throws helmet at Nationals pitcher

Willson Contreras stormed the mound and hurled his batting helmet at Cade Cavalli after a strikeout, sparking a massive brawl at Fenway Park.…

39 views · Wed, 01 Jul 2026 03:06:13 GMT

#willson #contreras #center

THE HINDU

Roca Introduces Touch-T: A New Benchmark in Thermostatic Shower Systems

Roca Introduces Touch-T: A New Benchmark in Thermostatic Shower Systems…

26 views · Tue, 30 Jun 2026 08:09:23 GMT

#roca #introduces #touch-t

THE GLOBE AND MAIL

Poilievre shuffling his front bench, placing MP Michael Chong in key finance role

The shuffle is the first major shakeup since the naming of Poilievre’s shadow cabinet after last year’s election…

32 views · Tue, 30 Jun 2026 08:04:23 GMT

#poilievre #shuffling #front

41 views · Mon, 29 Jun 2026 07:20:58 GMT

NormAct: A Benchmark for Hidden Social Norm Compliance in Embodied Planning

arXiv:2606.27826v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) are increasingly deployed as embodied planners in egocentric environments, where task succe…

#normact #benchmark #hidden

29 views · Mon, 29 Jun 2026 01:40:59 GMT

David Ross would ‘love’ chance at Mets job after his ex-bench coach Andy Green is done as interim

If the Mets managerial job does open following the season, David Ross said he’d “love” to get a shot at it.…

#david #ross #would

GOOGLE NEWS

Messi on the bench as Argentina makes nine changes for Jordan clash - Reuters

Messi on the bench as Argentina makes nine changes for Jordan clash Reuters…

28 views · Sun, 28 Jun 2026 05:25:12 GMT

TECHMEME

GPT-5.6 Sol matches Mythos Preview on ExploitBench, adds Ultra mode with subagents for complex workflows, and max reasoning for deep problem-solving (OpenAI)

OpenAI : GPT-5.6 Sol matches Mythos Preview on ExploitBench, adds Ultra mode with subagents for complex workflows, and max reasoning for deep problem-solving — We're beginning a li…

45 views · Fri, 26 Jun 2026 18:07:35 GMT

CURSOR

Reward hacking is swamping model intelligence gains

On SWE-bench Pro, 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. Stricter eval harnesses show how benchmark scores can conflate coding ability…

27 views · Fri, 26 Jun 2026 08:07:31 GMT

#ai #machinelearning #codingbenchmarks

SOUTH CHINA MORNING POST

CK Asset sells penthouse in Hong Kong’s Mid-Levels for US$48.5m, sets pricing benchmark

The record per square foot pricing for new homes this year highlights the growing momentum in the city’s luxury property market.…

23 views · Fri, 26 Jun 2026 06:12:58 GMT

#realestate #hongkong #luxury

31 views · Fri, 26 Jun 2026 05:20:40 GMT

OpenFinGym: A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents

arXiv:2606.26350v1 Announce Type: new Abstract: Although large language model agents are increasingly applied to quantitative-finance workflows, their evaluation remains fragmented…

#artificialintelligence #machinelearning #quantitativefinance

26 views · Fri, 26 Jun 2026 05:20:40 GMT

Life After Benchmark Saturation: A Case Study of CORE-Bench

arXiv:2606.26158v1 Announce Type: new Abstract: When a benchmark's accuracy saturates, it is often retired and replaced with a more challenging version. We show that this approach …

#life #benchmark #saturation

ESPN — SOCCER

Pulisic starts on bench for U.S. with Turner in go...

24 views · Fri, 26 Jun 2026 02:19:29 GMT

INVESTING.COM — NEWS

Soccer-Pulisic on the bench as U.S., Turkey make wholesale changes

20 views · Fri, 26 Jun 2026 01:33:44 GMT

CHANNEL NEWSASIA

Pulisic on the bench as US, Turkey make wholesale changes

INGLEWOOD, California, June 25 : U.S. winger Christian Pulisic starts on the bench for a Group D match in which the U.S. and Turkey make wholesale changes. • The Americans make nin…

21 views · Fri, 26 Jun 2026 01:33:41 GMT

#pulisic #turkey

I'VE DONE SOME THINGS

Will It Mythos?

OK, so Mythos finds really challenging security bugs, right? That’s why it’s cordoned off from the hoi polloi, to protect the world from such a powerful finder of exploits. I am sk…

72 views · Tue, 23 Jun 2026 04:44:35 GMT

#security #ai #benchmarking

GOOGLE NEWS

China keeps lending benchmark LPRs unchanged for 13th month in June - Reuters

China keeps lending benchmark LPRs unchanged for 13th month in June Reuters…

32 views · Mon, 22 Jun 2026 04:18:17 GMT

GITHUB

BEAVER: Enterprise benchmark for LLM Text-to-SQL from private data warehouses

40 views · Mon, 15 Jun 2026 01:32:35 GMT

#database #enterprise #technology

AL JAZEERA ENGLISH

Lamine Yamal fit to start on bench for Spain vs Cape Verde at World Cup

‌Victor Munoz is also fit to play as coach Luis de la Fuente says he has no concerns over his players' fitness.…

84 views · Sun, 14 Jun 2026 22:42:33 GMT

#worldcup #soccer #sports

ESPN — SOCCER

Yamal on bench for Cape Verde clash - Spain boss

29 views · Sun, 14 Jun 2026 22:37:33 GMT

CHANNEL NEWSASIA

Yamal fit to start on bench for Spain against Cape Verde

ATLANTA, June 14 : Spain winger Lamine Yamal is fit and ready to come off the bench in his side's opening World Cup game, but manager Luis de la Fuente said on Sunday that his entr…

30 views · Sun, 14 Jun 2026 22:13:02 GMT

#yamal #start

27 views · Sun, 14 Jun 2026 07:29:46 GMT

Knicks’ ice-cold bench can’t derail the title train as 53-year drought ends

It wasn’t so long ago, Landry Shamet couldn’t miss a shot.…

#nba #basketball #newyorkknicks

WORLD NEWS | THE GUARDIAN

Liberal frontbencher Jonno Duniam to quit politics, says leadership spill ‘started to really wear on me’

Tasmanian senator says he feels he is ‘letting down the team’ but needs to prioritise family Follow our Australia news live blog for latest updates Get our breaking news email , fr…

39 views · Sun, 14 Jun 2026 05:19:38 GMT

#liberal #frontbencher #jonno

NET

Liberal frontbencher Jonno Duniam to retire

Senator Duniam says it is time to prioritise his family after 25 years in politics, and he will leave his position in the Senate by the end of this year.…

36 views · Sat, 13 Jun 2026 23:12:00 GMT

#liberal #frontbencher #jonno

29 views · Sat, 13 Jun 2026 05:42:54 GMT

Christian Pulisic reveals what led to surprise second-half benching in USMNT opener

After coming off at halftime of the U.S. men’s national team’s 4-1 win over Paraguay to open the World Cup, superstar attacker Christian Pulisic said he expects to be fine.…

#christian #pulisic #reveals

CALIFORNIA POST

Matt Turner’s wife Ash stands behind USMNT despite World Cup bench role

Most of the USA men's national team's starting 11 was set in stone heading into the team's June 12 World Cup debut match against Paraguay.…

46 views · Sat, 13 Jun 2026 03:05:38 GMT

#matt #turner #wife

TECHMEME

Xiaomi releases MiMo Code V0.1.0, an open-source AI coding assistant that it says outperforms Claude Code on agentic coding and software engineering benchmarks (Carl Franzen/VentureBeat)

Carl Franzen / VentureBeat : Xiaomi releases MiMo Code V0.1.0, an open-source AI coding assistant that it says outperforms Claude Code on agentic coding and software engineering be…

36 views · Fri, 12 Jun 2026 01:05:52 GMT

TECHCRUNCH

Waymo says it built a better benchmark for comparing robotaxis to humans

Waymo created a new computer model to help it better understand how humans behave in crash scenarios that its robotaxis encounter.…

43 views · Wed, 10 Jun 2026 09:04:48 GMT

#waymo #says #built

45 views · Sun, 07 Jun 2026 05:55:57 GMT

Knicks in prime NBA Finals position thanks to their impressive bench

This year, Brunson is the only starter averaging more than 34 minutes in the playoffs.…

#basketball #nba #new york knicks

42 views · Sat, 06 Jun 2026 14:35:58 GMT

Benchmarks in Leipzig

Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers. Most of the work was done during the 3…

#mathematics #artificial intelligence #research

GOOGLE NEWS

OpenAI research and product leads detail GPT-Rosalind capabilities and benchmarks - R&D World

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…

30 views · Sat, 06 Jun 2026 00:47:30 GMT

9TO5MAC

Chrome for Mac breaks benchmark records on the latest MacBook Pro

Google has shared the results of the latest Chrome performance benchmarks, including record scores on tests running on an M5 MacBook Pro.…

42 views · Sat, 06 Jun 2026 00:43:00 GMT

#technology #browsers #performance

PHORONIX

Today Marks 22 Years Of Phoronix For Linux Hardware Testing & Benchmarking

Today marks 22 years since I started Phoronix.com to focus on Linux hardware reviews…

44 views · Fri, 05 Jun 2026 04:10:56 GMT

#linux #hardware #benchmarking

35 views · Fri, 05 Jun 2026 01:05:49 GMT

Jose Alvarado fearless in big NBA Finals moment for Knicks bench

Jose Alvarado’s first thought when Jalen Brunson hobbled off the court and went to the locker room echoed what every Knicks fan was thinking.…

#nba #basketball #new york knicks

43 views · Thu, 04 Jun 2026 08:40:57 GMT

Mitchell Robinson solid off bench in Knicks’ Game 1 win but hand injury mystery remains

A week of mystery ended with Mitchell Robinson on the court. But no more clarity on how he actually sustained his injury.…

#nba #knicks #sports

TECHCRUNCH

Benchmark raises its first-ever growth fund as part of $2B capital raise

The legendary abandons its more than 20 year tradition of keeping its funds to about $425 million.…

42 views · Thu, 04 Jun 2026 04:55:34 GMT

#venture capital #artificial intelligence #investment

THEHIVERYIQ

Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive

Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…

33 views · Wed, 03 Jun 2026 17:27:51 GMT

#ai #technology #benchmarking

GITHUB

Show HN: Rscrypto, pure-Rust crypto with industry leading public benches

Rust crypto w/ zero default deps: BLAKE3, Ed25519/X25519, hashes, MACs, KDFs, AEADs, and checksums w/ full SIMD/ASM acceleration - loadingalias/rscrypto…

31 views · Wed, 03 Jun 2026 16:52:50 GMT

#cryptography #rust #software

PHORONIX

AMD EPYC 8635P "Sorano" Benchmarks: Significant Upgrade Opportunity For EPYC 8004 Servers

After announcing the AMD EPYC 8005 'Sorano' series back in February, AMD recently began shipping these Zen 5 successors to the EPYC 8004 'Siena' line-up.…

39 views · Wed, 03 Jun 2026 15:42:13 GMT

#epyc #sorano #benchmarks

DEV.TO (TOP)

Cross Cloud A2A Agent Benchmarking

Building a Benchmarking Agent with A2A and MCP This tutorial aims to build and test...…

34 views · Wed, 03 Jun 2026 15:42:10 GMT

#cloud computing #benchmarking #programming

TOM'S HARDWARE

Trump signs AI executive order seeking 30-day government access to frontier models before release — voluntary framework will include classified benchmark to determine which models qualify

The voluntary framework avoids mandatory licensing but gives the government a say in which firms get early access.…

42 views · Wed, 03 Jun 2026 11:57:04 GMT

#ai #cybersecurity #government

41 views · Wed, 03 Jun 2026 04:51:55 GMT

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…

#machine learning #language models #evaluation

38 views · Wed, 03 Jun 2026 04:11:55 GMT

BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often…

#artificial intelligence #user behavior #decision modeling

37 views · Wed, 03 Jun 2026 04:11:55 GMT

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained un…

#artificial intelligence #autonomous agents #evaluation

33 views · Wed, 03 Jun 2026 04:11:55 GMT

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents …

#artificial intelligence #desktop agents #human collaboration