WeSearch
Hub / Search / swe bench
SEARCH · SWE BENCH

Results for "swe bench".

12 stories match your query across our 700+ source catalog. Ranked by relevance and recency.

12 results for "swe bench"

OPENAI

SWE-bench Verified no longer measures frontier coding capabilities

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.…

· 9 views
REDDIT

Confirmed: SWE Bench is now a benchmaxxed benchmark

· 9 views
LOCALLLAMA

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max

Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already publi…

· 2 views
ARXIV.ORG

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance

Industrial maintenance environments increasingly rely on AI systems to assist operators in understanding asset behavior, diagnosing failures, and evaluating interventions. Although large language mode…

· 3 views
ARXIV.ORG

MarketBench: Evaluating AI Agents as Market Participants

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have infor…

· 3 views
ARXIV.ORG

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the prese…

· 4 views
LOCALLLAMA

We benchmarked gpt-oss-120b across 6 inference providers and found a 10x throughput spread

We ran a benchmark across 10+ LLM routers, providers, and inference backends to answer the questions that come up every time someone picks a provider. Key findings: Do LLM routers add latency? No, Ope…

· 5 views
ARXIV.ORG

Does Point Cloud Boost Spatial Reasoning of Large Language Models?

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the role of point clouds in 3D spatial …

· 3 views
ARXIV.ORG

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on lon…

· 3 views
ARXIV.ORG

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final ans…

· 3 views
ARXIV.ORG

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framew…

· 3 views
ARXIV.ORG

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain …

· 3 views