Search: "ai evaluation" — WeSearch Press

ARXIV CS.AI

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per sc…

Wed, 29 Apr 2026 04:04:25 GMT · 5 views

ARXIV.ORG

Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static in…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the prese…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to cha…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV CS.AI

ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

Time-series anomaly detectors are commonly compared on workstation-class hardware under unconstrained execution. In-vehicle monitoring, however, requires predictable latency and stable behavior under …

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy document-pool condi…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expre…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

Applied AI-Enhanced RF Interference Rejection

AI-enhanced interference rejection in radio frequency (RF) transmissions has recently attracted interest because deep learning approaches trained on both the signal of interest (SOI) and the signal mi…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

Multimodal Large Language Models (MLLMs) rely on multimodal pre-training over diverse data sources, where different datasets often induce complementary cross-modal alignment capabilities. Model mergin…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (…

Wed, 29 Apr 2026 01:52:36 GMT · 9 views

ARXIV.ORG

The Controllability Trap: A Governance Framework for Military AI Agents

Agentic AI systems - capable of goal interpretation, world modeling, planning, tool use, long-horizon operation, and autonomous coordination - introduce distinct control failures not addressed by exis…

Tue, 28 Apr 2026 21:33:22 GMT · 5 views

ARXIV.ORG

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empir…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

AI Identity: Standards, Gaps, and Research Directions for AI Agents

AI agents are now running real transactions, workflows, and sub-agent chains across organizational boundaries without continuous human supervision. This creates a problem no current infrastructure is …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph

Graph-based anti-money laundering (AML) systems on blockchain networks can score suspicious activity at two granularity levels -- transactions or actor addresses -- yet compliance action is conducted …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automate…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Financial AI systems must produce answers grounded in specific regulatory filings, yet current LLMs fabricate metrics, invent citations, and miscalculate derived quantities. These errors carry direct …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framew…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Evaluating whether AI models would sabotage AI safety research

We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluati…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents

Autonomous AI agents can remain fully authorized and still become unsafe as behavior drifts, adversaries adapt, and decision patterns shift without any code change. We propose the \textbf{Informationa…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

FORTUNE

Bloomberg, the OG of financial data firms, has a potent new AI agent. How it built it holds lessons for other companies

Bloomberg's CTO Shawn Edwards says data, evaluations, and cost discipline were all key to making "AskB" work.…

Tue, 28 Apr 2026 21:21:24 GMT · 12 views

ARXIV.ORG

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities r…

Tue, 28 Apr 2026 04:13:21 GMT · 7 views

ARXIV.ORG

OpenGame: Open Agentic Coding for Games

Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across ma…

Wed, 29 Apr 2026 05:34:25 GMT · 6 views

ARXIV CS.AI

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is signif…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emphasize aggregate accu…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

WeatherSeg: Weather-Robust Image Segmentation using Teacher-Student Dual Learning and Classifier-Updating Attention

WeatherSeg, an advanced semi-supervised segmentation framework, addresses autonomous driving's environmental perception challenges in adverse weather while reducing annotation costs. This framework in…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

Results for "ai evaluation".

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Applied AI-Enhanced RF Interference Rejection

PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

The Controllability Trap: A Governance Framework for Military AI Agents

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

AI Identity: Standards, Gaps, and Research Directions for AI Agents

Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

Evaluating whether AI models would sabotage AI safety research

Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents

Bloomberg, the OG of financial data firms, has a potent new AI agent. How it built it holds lessons for other companies

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

OpenGame: Open Agentic Coding for Games

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

WeatherSeg: Weather-Robust Image Segmentation using Teacher-Student Dual Learning and Classifier-Updating Attention

Or browse by topic