19 results for "model evaluation"
An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness
Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to cha…
A systematic evaluation of vision-language models for observational astronomical reasoning tasks
Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities r…
A Systematic Approach for Large Language Models Debugging
Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains…
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empir…
Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph
Graph-based anti-money laundering (AML) systems on blockchain networks can score suspicious activity at two granularity levels -- transactions or actor addresses -- yet compliance action is conducted …
MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation
The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automate…
Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task
Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in …
Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents
This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static in…
An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress
As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system r…
Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop
Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable…
Evaluating whether AI models would sabotage AI safety research
We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluati…
Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch
I’ve been working on an educational implementation repo for speculative decoding: The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind …
Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]
I’ve been working on an educational implementation repo for speculative decoding: The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind …
Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs
Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (…
Towards Causally Interpretable Wi-Fi CSI-Based Human Activity Recognition with Discrete Latent Compression and LTL Rule Extraction
We address Human Activity Recognition (HAR) utilizing Wi-Fi Channel State Information (CSI) under the joint requirements of causal interpretability, symbolic controllability, and direct operation on h…
Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach
Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear …
LEGO: An LLM Skill-Based Front-End Design Generation Platform
Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a un…
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs
Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is groun…
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such…