Search: "ai accuracy" — WeSearch Press

DEV.TO (TOP)

I built a CLI that hashes your ML accuracy claims before the experiment runs

I built a CLI that hashes your ML accuracy claims before the experiment runs Last month, a...…

Wed, 29 Apr 2026 08:00:56 GMT · 5 views

HACKER NEWS (AI / LLM)

Perplexity Builds Accuracy into Frontier AI

Wed, 29 Apr 2026 05:09:24 GMT · 4 views

MIT TECHNOLOGY REVIEW

Rebuilding the Data Stack for AI

Enterprise AI hinges on high-accuracy outputs, requiring better data context, unified architectures, and rigorous measurement frameworks, says Bavesh Patel, senior vice president at Databricks, and Ra…

Wed, 29 Apr 2026 07:14:26 GMT · 4 views

ARXIV CS.AI

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questio…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on lon…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance

Industrial maintenance environments increasingly rely on AI systems to assist operators in understanding asset behavior, diagnosing failures, and evaluating interventions. Although large language mode…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

The application of large language models (LLMs) in clinical decision support faces significant challenges of "tunnel vision" and diagnostic hallucinations present in their processing unstructured elec…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Modeling Induced Pleasure through Cognitive Appraisal Prediction via Multimodal Fusion

Multimodal affective computing analyzes user-generated social media content to predict emotional states. However, a critical gap remains in understanding how visual content shapes cognitive interpreta…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

FAIR_XAI: Improving Multimodal Foundation Model Fairness via Explainability for Wellbeing Assessment

In recent years, the integration of multimodal machine learning in wellbeing assessment has offered transformative potential for monitoring mental health. However, with the rapid advancement of Vision…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

ZenBrain: A Neuroscience-Inspired 7-Layer Memory Architecture for Autonomous AI Systems

Despite a century of empirical memory research, existing AI agent memory systems rely on system-engineering metaphors (virtual-memory paging, flat LLM storage, Zettelkasten notes), none integrating pr…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Agentic AI platforms for autonomous training and rule induction of human-human and virus-human protein-protein interactions

We instruct an AI agent to construct two separate agentic AI platforms: one for autonomous training of predictive ML models for human-human and virus-human PPI, and the other for inducing explicit gen…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to cha…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the prese…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV CS.AI

ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

Time-series anomaly detectors are commonly compared on workstation-class hardware under unconstrained execution. In-vehicle monitoring, however, requires predictable latency and stable behavior under …

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering

Reliable decision support in nuclear engineering requires traceable, domain-grounded knowledge retrieval, yet safety and risk analysis workflows remain hampered by fragmented documentation and halluci…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

Your Reviews Replicate You: LLM-Based Agents as Customer Digital Twins for Conjoint Analysis

Conjoint analysis is a cornerstone of market research for estimating consumer preferences; however, traditional methods face persistent challenges regarding time, cost, and respondent fatigue. To addr…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

RedParrot: Accelerating NL-to-DSL for Business Analytics via Query Semantic Caching

Recently, at Xiaohongshu, the rapid expansion of e-commerce and advertising demands real-time business analytics with high accuracy and low latency. To meet this demand, systems typically rely on conv…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

KARL: Mitigating Hallucinations in LLMs via Knowledge-Boundary-Aware Reinforcement Learning

Enabling large language models (LLMs) to appropriately abstain from answering questions beyond their knowledge is crucial for mitigating hallucinations. While existing reinforcement learning methods f…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

Parameter-Efficient Fine-Tuning (PEFT) has become the standard for adapting large language models (LLMs). In this work we challenge the wide-spread assumption that parameter efficiency equates memory …

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

See No Evil: Semantic Context-Aware Privacy Risk Detection for AR

Augmented reality (AR) systems pose unique privacy risks due to their continuous capture of visual data. Existing AR privacy frameworks lack semantic understanding of visual content, limiting their ef…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emphasize aggregate accu…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

WeatherSeg: Weather-Robust Image Segmentation using Teacher-Student Dual Learning and Classifier-Updating Attention

WeatherSeg, an advanced semi-supervised segmentation framework, addresses autonomous driving's environmental perception challenges in adverse weather while reducing annotation costs. This framework in…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review

Scientific research relies on accurate information retrieval from literature to support analytical decisions. In this work, we introduce a new task, INformation reTRieval through literAture reVIEW (In…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

AMERICAN AFFAIRS JOURNAL

Understanding the LLM Bubble

If there is no path to superintelligence by 2028, and there is little prospect of the dramatic product improvements needed to drive major short-term revenue growth (including solutions to inaccuracy a…

Tue, 28 Apr 2026 17:04:13 GMT · 9 views

ARTIFICIAL INTELLIGENCE (AI)

open models keep catching up and the frontier keeps moving. at some point one of those has to stop

a year ago there was a clear tier gap. now i'm less sure, but not in the way i expected. the tasks where open-weight models have genuinely caught up are real: coding assistance, summarization, instruc…

Tue, 28 Apr 2026 07:52:12 GMT · 5 views

ARXIV.ORG

Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis

Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instabili…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empir…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

Refereeing is vital in sports, where fair, accurate, and explainable decisions are fundamental. While intelligent assistant technologies are being widely adopted in soccer refereeing, current AI-assis…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the m…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

Results for "ai accuracy".

I built a CLI that hashes your ML accuracy claims before the experiment runs

Perplexity Builds Accuracy into Frontier AI

Rebuilding the Data Stack for AI

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

Modeling Induced Pleasure through Cognitive Appraisal Prediction via Multimodal Fusion

FAIR_XAI: Improving Multimodal Foundation Model Fairness via Explainability for Wellbeing Assessment

ZenBrain: A Neuroscience-Inspired 7-Layer Memory Architecture for Autonomous AI Systems

Agentic AI platforms for autonomous training and rule induction of human-human and virus-human protein-protein interactions

An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering

Your Reviews Replicate You: LLM-Based Agents as Customer Digital Twins for Conjoint Analysis

RedParrot: Accelerating NL-to-DSL for Business Analytics via Query Semantic Caching

KARL: Mitigating Hallucinations in LLMs via Knowledge-Boundary-Aware Reinforcement Learning

Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

See No Evil: Semantic Context-Aware Privacy Risk Detection for AR

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

WeatherSeg: Weather-Robust Image Segmentation using Teacher-Student Dual Learning and Classifier-Updating Attention

IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review

Understanding the LLM Bubble

open models keep catching up and the frontier keeps moving. at some point one of those has to stop

Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

Or browse by topic