Search: "evaluation metric"

ARXIV CS.AI

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expre…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph

Graph-based anti-money laundering (AML) systems on blockchain networks can score suspicious activity at two granularity levels -- transactions or actor addresses -- yet compliance action is conducted …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the prese…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system r…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV CS.AI

Applied AI-Enhanced RF Interference Rejection

AI-enhanced interference rejection in radio frequency (RF) transmissions has recently attracted interest because deep learning approaches trained on both the signal of interest (SOI) and the signal mi…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emphasize aggregate accu…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

Structure Guided Retrieval-Augmented Generation for Factual Queries

Retrieval-Augmented Generation (RAG) has been proposed to mitigate hallucinations in large language models (LLMs), where generated outputs may be factually incorrect. However, existing RAG approaches …

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

The Controllability Trap: A Governance Framework for Military AI Agents

Agentic AI systems - capable of goal interpretation, world modeling, planning, tool use, long-horizon operation, and autonomous coordination - introduce distinct control failures not addressed by exis…

Tue, 28 Apr 2026 21:33:22 GMT · 4 views

ARXIV.ORG

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is groun…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automate…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Financial AI systems must produce answers grounded in specific regulatory filings, yet current LLMs fabricate metrics, invent citations, and miscalculate derived quantities. These errors carry direct …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

Results for "evaluation metric".

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

Applied AI-Enhanced RF Interference Rejection

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

Structure Guided Retrieval-Augmented Generation for Factual Queries

The Controllability Trap: A Governance Framework for Military AI Agents

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

Or browse by topic