Search: "llm performance" — WeSearch Press

ARXIV CS.AI

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

LLMs Corrupt Your Documents When You Delegate

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation t…

Tue, 28 Apr 2026 12:54:59 GMT · 5 views

ARXIV.ORG

Don't Make the LLM Read the Graph: Make the Graph Think

We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanab…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis

Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instabili…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach

Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip).

TL:DR - Remembered FPGA PCI boards being a big thing from my crypto days. Wondered if AMD Alveo V80 FPGA card could be used to approximate the performance of a Taalas HC1 (LLM-on-a-chip). Ran the idea…

Mon, 27 Apr 2026 08:05:35 GMT · 14 views

ARXIV.ORG

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

Long-term memory is a critical challenge for Large Language Model agents, as fixed context windows cannot preserve coherence across extended interactions. Existing memory systems represent conversatio…

Tue, 28 Apr 2026 13:14:59 GMT · 14 views

LOCALLLAMA

Is there a way to mitigate performance as context grows?

In my local LLM setup I get from 30 to 80 t/s generation at the beginning, but it drops quite a lot as context grows. I use llama.cpp/Vulkan with an MI50 and a V100, is there some command line flags t…

Sun, 26 Apr 2026 20:57:14 GMT · 9 views

ARXIV CS.AI

Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning

Most digital language learning curricula rely on discrete-item quizzes that test recall rather than applied conversational proficiency. When progression is driven by quiz performance, learners can adv…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (…

Wed, 29 Apr 2026 01:52:36 GMT · 9 views

ARXIV.ORG

Does Point Cloud Boost Spatial Reasoning of Large Language Models?

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the role of point clouds in 3D spatial …

Tue, 28 Apr 2026 14:55:00 GMT · 4 views

ARXIV.ORG

An Intelligent Fault Diagnosis Method for General Aviation Aircraft Based on Multi-Fidelity Digital Twin and FMEA Knowledge Enhancement

Fault diagnosis of general aviation aircraft faces challenges including scarce real fault data, diverse fault types, and weak fault signatures. This paper proposes an intelligent fault diagnosis frame…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

PExA: Parallel Exploration Agent for Complex Text-to-SQL

LLM-based agents for text-to-SQL often struggle with latency-performance trade-off, where performance improvements come at the cost of latency or vice versa. We reformulate text-to-SQL generation with…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Discovering Agentic Safety Specifications from 1-Bit Danger Signals

Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iterative…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

The application of large language models (LLMs) in clinical decision support faces significant challenges of "tunnel vision" and diagnostic hallucinations present in their processing unstructured elec…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final ans…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

MarketBench: Evaluating AI Agents as Market Participants

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have infor…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proo…

Tue, 28 Apr 2026 04:13:21 GMT · 7 views

ARXIV.ORG

Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

Extracting abstract causal structures and applying them to novel situations is a hallmark of human intelligence. While Large Language Models (LLMs) and Vision Language Models (VLMs) have shown strong …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system r…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

LOCALLLAMA

Ubuntu 26.04 vs 24.04 speed improvements for inference?

I'm curious if any brave soul has upgraded their computer (especially if it's Strix Halo) from Ubuntu 24.04 -> 26.04 and seen a significant performance improvement for inference with VLLM, llama-serve…

Mon, 27 Apr 2026 15:25:00 GMT · 6 views

To 16GB VRAM users, plug in your old GPU

For those who want to run latest dense ~30b models and only have 16GB VRAM, if you have a old card with 6GB VRAM or more, plug it in. It matters that everything fits on the VRAM, even on 2 cards. Even…

Mon, 27 Apr 2026 12:47:47 GMT · 8 views

EDENAI

Eden AI – European Alternative to OpenRouter

Access 500+ LLMs and expert AI models through one unified API. Route requests by cost, performance, and region with built-in smart routing and fallbacks.…

Sun, 26 Apr 2026 13:42:25 GMT · 7 views

Results for "llm performance".

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

LLMs Corrupt Your Documents When You Delegate

Don't Make the LLM Read the Graph: Make the Graph Think

Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis

Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip).

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

Is there a way to mitigate performance as context grows?

Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Does Point Cloud Boost Spatial Reasoning of Large Language Models?

An Intelligent Fault Diagnosis Method for General Aviation Aircraft Based on Multi-Fidelity Digital Twin and FMEA Knowledge Enhancement

PExA: Parallel Exploration Agent for Complex Text-to-SQL

Discovering Agentic Safety Specifications from 1-Bit Danger Signals

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

MarketBench: Evaluating AI Agents as Market Participants

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

Ubuntu 26.04 vs 24.04 speed improvements for inference?

To 16GB VRAM users, plug in your old GPU

Eden AI – European Alternative to OpenRouter

Or browse by topic