Search: "bench performance"

R/LINUX

atomic_queue benchmarks SMT vs no-SMT performance

Wed, 29 Apr 2026 08:00:56 GMT · 3 views

R/CPP

atomic_queue benchmarks SMT vs no-SMT performance

Wed, 29 Apr 2026 08:00:56 GMT · 2 views

ARXIV.ORG

MarketBench: Evaluating AI Agents as Market Participants

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have infor…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

UPDATE: Vulkan benches arew now included. And yes, I used AI to help me write this post. As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if an…

Sun, 26 Apr 2026 22:44:09 GMT · 9 views

DEV.TO (TOP)

Performance Test: AWS Graviton4 Reduces EC2 Costs 40% vs. Intel Xeon 5th Gen

In a 12-week production benchmark across 14 workload types, AWS Graviton4-based EC2 instances...…

Wed, 29 Apr 2026 04:34:24 GMT · 4 views

FIRETHERING

Xiaomi releases MiMo-v2.5 Family weights with strong coding and agent benchmarks

Peking University gives its computer science students a compiler project every semester. Build a complete SysY compiler in Rust including lexer, parser, abstract syntax tree, IR code generation, assem…

Tue, 28 Apr 2026 12:24:59 GMT · 4 views

ARXIV CS.AI

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy document-pool condi…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning

Most digital language learning curricula rely on discrete-item quizzes that test recall rather than applied conversational proficiency. When progression is driven by quiz performance, learners can adv…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

ParkingScenes: A Structured Dataset for End-to-End Autonomous Parking in Simulation Scenes

Autonomous parking remains a critical yet challenging task in intelligent driving systems, particularly within constrained urban environments where maneuvering space is limited and precise control is …

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (…

Wed, 29 Apr 2026 01:52:36 GMT · 9 views

ARXIV.ORG

Does Point Cloud Boost Spatial Reasoning of Large Language Models?

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the role of point clouds in 3D spatial …

Tue, 28 Apr 2026 14:55:00 GMT · 4 views

ARXIV.ORG

PExA: Parallel Exploration Agent for Complex Text-to-SQL

LLM-based agents for text-to-SQL often struggle with latency-performance trade-off, where performance improvements come at the cost of latency or vice versa. We reformulate text-to-SQL generation with…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted heuristics, inherently limiting their potential for open-ende…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

The application of large language models (LLMs) in clinical decision support faces significant challenges of "tunnel vision" and diagnostic hallucinations present in their processing unstructured elec…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final ans…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proo…

Tue, 28 Apr 2026 04:13:21 GMT · 7 views

ARXIV.ORG

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framew…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system r…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability

This study introduces the Kerimov-Alekberli model, a novel information-geometric framework that redefines AI safety by formally linking non-equilibrium thermodynamics to stochastic control for the eth…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities r…

Tue, 28 Apr 2026 04:13:21 GMT · 7 views

ARXIV.ORG

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B

Decided to try out the new --spec-type ngram-mod feature in llama.cpp using Qwen3.6 27B during an OpenCode bug chasing session. TLDR: Performance is variable, but so far it seems to provide a nice spe…

Mon, 27 Apr 2026 11:28:02 GMT · 9 views

SIMON WILLISON'S WEBLOG

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model Big claims from Qwen about their latest open weight model: Qwen3.6-27B delivers flagship-level agentic coding performance, surpassing the previo…

Sun, 26 Apr 2026 22:44:22 GMT · 10 views

Results for "bench performance".

atomic_queue benchmarks SMT vs no-SMT performance

atomic_queue benchmarks SMT vs no-SMT performance

MarketBench: Evaluating AI Agents as Market Participants

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

Performance Test: AWS Graviton4 Reduces EC2 Costs 40% vs. Intel Xeon 5th Gen

Xiaomi releases MiMo-v2.5 Family weights with strong coding and agent benchmarks

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning

ParkingScenes: A Structured Dataset for End-to-End Autonomous Parking in Simulation Scenes

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Does Point Cloud Boost Spatial Reasoning of Large Language Models?

PExA: Parallel Exploration Agent for Complex Text-to-SQL

Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Or browse by topic