Search: "model evaluation"

ARXIV.ORG

An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to cha…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities r…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

A Systematic Approach for Large Language Models Debugging

Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empir…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph

Graph-based anti-money laundering (AML) systems on blockchain networks can score suspicious activity at two granularity levels -- transactions or actor addresses -- yet compliance action is conducted …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automate…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in …

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static in…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system r…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Evaluating whether AI models would sabotage AI safety research

We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluati…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch

I’ve been working on an educational implementation repo for speculative decoding: The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind …

Mon, 27 Apr 2026 01:36:17 GMT · 5 views

Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]

I’ve been working on an educational implementation repo for speculative decoding: The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind …

Sun, 26 Apr 2026 20:54:30 GMT · 6 views

ARXIV.ORG

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

Towards Causally Interpretable Wi-Fi CSI-Based Human Activity Recognition with Discrete Latent Compression and LTL Rule Extraction

We address Human Activity Recognition (HAR) utilizing Wi-Fi Channel State Information (CSI) under the joint requirements of causal interpretability, symbolic controllability, and direct operation on h…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach

Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear …

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

LEGO: An LLM Skill-Based Front-End Design Generation Platform

Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a un…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is groun…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

Results for "model evaluation".

An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

A Systematic Approach for Large Language Models Debugging

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Evaluating whether AI models would sabotage AI safety research

Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch

Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Towards Causally Interpretable Wi-Fi CSI-Based Human Activity Recognition with Discrete Latent Compression and LTL Rule Extraction

Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach

LEGO: An LLM Skill-Based Front-End Design Generation Platform

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

Or browse by topic