WeSearch
Hub / Tags / Benchmarking
TAG · #BENCHMARKING

Benchmarking coverage.

Every story in the WeSearch catalog tagged with #benchmarking, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

9 stories tagged with #benchmarking, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Benchmarking"

RELATED TAGS
#swe-bench1#ai-benchmarking1#code-generation1#model-evaluation1#contamination1#amd-mi300x1#nvidia-h1001#gpu-benchmarking1#ai-training1#rocm-vs-cuda1#ct-report-generation1#factual-consistency1
DETAIL

Benchmarking a Bug Scanner

We ran a tournament pitting Detail's findings against thousands of comments from code review bots.…

4 views ·
#bug scanner#code review#software quality
NEURALNOISE.COM

Benchmarking Local LLM/Harness Combinations

I’ve been running a small benchmark, harness-bench, that pairs local LLMs (served via llama.cpp’s llama-server) with agent harnesses (Aider, Claude Code, Ope...…

7 views ·
GOOGLE NEWS

KROMATID to Present Breakthrough Genomic Integrity Benchmarking at ASGCT 2026, Powering the World's First Genomic Intelligence Platform - Morningstar

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…

8 views ·
DEV.TO (TOP)

I corrected my own benchmark claim from 91.5% to 88%. Here's what changed.

A week after shipping a flattering tokens-saved number for my AI context tool, I noticed it was apples-to-oranges. Here's the workload-matched redo, the smaller honest number, and …

4 views ·
#ai#opensource
ARXIV CS.AI

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emph…

6 views ·
#computer vision#vision-language models#object hallucination
APPLIEDCOMPUTE

Benchmarking Inference Engines on Agentic Workloads

5 views ·
#inference engines#agentic workloads
ARXIV.ORG

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of fin…

8 views ·
#ct report generation#factual consistency
SEMIANALYSIS

Why isn't AMD's MI300X competitive?

Training Performance, User Experience, Usability, Nvidia, AMD, GEMM, Attention, Networking, InfiniBand, Spectrum-X Ethernet, RoCEv2 Ethernet, SHARP, Total Cost of Ownership…

5 views ·
#amd mi300x#nvidia h100#gpu benchmarking
OPENAI

SWE-bench Verified no longer measures frontier coding capabilities

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.…

12 views ·
#swe-bench#ai benchmarking#code generation