60 stories tagged with #language-model, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Language Model"
GPU Forecasters: Language Models as Selective Surrogates for Kernel Optimization
GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. Whil…
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…
Visual Graph Scaffolds for Structural Reasoning in Large Language Models
Graphs have been used to enhance large language models (LLMs) for structured reasoning, mostly as external knowledge sources are provided to models at test time. In this paper, we …
ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning
Large language models (LLMs) exhibit strong natural-language reasoning abilities for clinical decision support, but struggle to effectively model structured longitudinal electronic…
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: skills depend on, conflict with, specializ…
The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs
Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. …
Decomposing how prompting steers behavior
Prompting steers large language models (LLMs) and vision-language models (VLMs) without weight updates, but it remains unclear how instruction changes reshape internal representati…
Uncertainty-Aware Clarification in LLM Agents with Information Gain
Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this c…
ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models
Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchma…
From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may o…
Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs
Knowledge Graphs (KGs) are widely used to mitigate the limitations of Large Language Models (LLMs), such as outdated knowledge and hallucinations. Existing LLM-KG integration frame…
Why Are Large Language Models So Terrible at Video Games?
LLMs can code your retro shooter but still fail at playing Halo; see what this gap reveals about AI’s real limits in 2026…
Parallax: Parameterized Local Linear Attention for Language Modeling
Scaling Laws for Agent Harnesses via Effective Feedback Compute
Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and …
Heuristic Parasites: A Behavioral Taxonomy of Recurrent Distortion Patterns in Large Language Models (Full System) V2
AI Propaganda factories with language models
AI-powered influence operations can now be executed end-to-end on commodity hardware. We show that small language models produce coherent, persona-driven political messaging and ca…
✨📊 🧠 The Ultimate Visual Guide to Large Language Models (LLMs)
Generative AI is a type of artificial intelligence that can produce new content including text,...…
📄Paper: RORA-VLM: Robust Retrieval Augmentation for Vision Language Models
Public At International Conference on Learning Representations (ICLR) 2025 💡 Why I read...…
LLMs believe false statements even after explicit warnings that they're false
Fine-tuning tests show "bias ... toward confidently representing the claims as true."…
Why does AI love writing about lighthouse keepers?
Asked to 'write a story', ChatGPT and other leading language models appear to be avoiding copyright infringement by obsessive recourse to the same small and strange cast of lightho…
How sure is the activation oracle?
Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty…
Can LLMs Introspect? A Reality Check
Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from huma…
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requir…
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, wh…
Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like diffe…
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The…
MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. Howe…
AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents
The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two indepe…
Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation
Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. Howev…
The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context
Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually govern…
What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement is poorly understood. Prior work has larg…
Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation
Multi-stakeholder tasks require one output to satisfy users with conflicting preferences. Holistic LLM judges conflate utility estimation and utility aggregation, yielding unstable…
Generating Robust Portfolios of Optimization Models using Large Language Models
Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to rea…
Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation
Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original model. Recent Multi-Teacher On-Policy Dis…
Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness whe…
Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering
An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, ma…
Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining…
PitchBench: Measuring Pitch Hearing in Audio-Language Models
Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation …
Microsoft Research: LLMs Corrupt your files during delegated work
Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust…
Sparse Autoencoders Reveal Cortical Brain-LLM Semantic Mapping
A preprint submitted to arXiv (arXiv:2605.23035) by Dongxin Guo and colleagues presents a mechanistic interpretability approach connecting large language model representations to h…
Prompt Politeness Affects LLM Accuracy
The wording of natural language prompts has been shown to influence the performance of large language models (LLMs), yet the role of politeness and tone remains underexplored. In t…
You don't need all the LLM benchmarks
In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models
We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Hist…
Confidence Calibration in Large Language Models
We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, lik…
How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning
Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces r…
Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform
Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state t…
LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition
The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, m…
Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models
Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constrained to unity. This architectural limitation…
HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models
Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but a…
Inference Time Context Sparsity: Illusion or Opportunity?
Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interacti…
Distilling Game Code World Model Generation into Lightweight Large Language Models
Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of automatically constructing environments for AI…
Understanding and Mitigating Premature Confidence for Better LLM Reasoning
Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from additional test-time compute. Improving rea…
Hypothesis Generation and Inductive Inference in Children and Language Models
Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which comp…
PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models
Efficiently updating Large Language Models (LLMs) with new or evolving factual knowledge remains a central challenge, as even parameter-efficient adaptation can erode previously ac…
Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models
Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown…
Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models
Backtesting large language models (LLMs) on historical financial data is unreliable because pre-training cuts off after the events happened. An LLM trained in 2024 already "knows" …
Learning to Reason Efficiently with A* Post-Training
Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language infer…
When Mean CE Fails: Median CE Can Better Track Language Model Quality
Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in…
Emotional intelligence in large language models is fragmented across perception, cognition, and interaction
As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical fro…
Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models
Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs)…