Caisi Evaluation of DeepSeek V4 Pro

May 3, 2026 · 5:57 PM UTC ·6 min read · 0 reactions · 0 comments · 7 views

#artificial intelligence #model evaluation #benchmarking #cost efficiency #international comparison #Center for AI Standards and Innovation #DeepSeek #OpenAI #Anthropic #Arizona State University #ARC Prize Foundation

⚡ TL;DR · AI summary

The Center for AI Standards and Innovation (CAISI) evaluated DeepSeek V4 Pro in April 2026, finding its capabilities lag behind the current frontier by approximately 8 months. While DeepSeek V4 is the most capable Chinese model assessed by CAISI, it underperformed relative to U.S. models like GPT-5.5 and Opus 4.6 in independent testing. Despite this, DeepSeek V4 demonstrated superior cost efficiency compared to similarly capable U.S. models across several benchmarks.

Key facts

▪DeepSeek V4 Pro is the most capable PRC-developed AI model evaluated by CAISI to date.
▪CAISI's evaluations show DeepSeek V4's performance aligns with GPT-5, released about 8 months prior, rather than more recent models like GPT-5.4 or Opus 4.6.
▪DeepSeek V4 was more cost efficient than GPT-5.4 mini on 5 out of 7 benchmarks, with cost differences ranging from 53% less expensive to 41% more expensive.
▪CAISI used 16 benchmarks across 35 models, including non-public evaluations like PortBench and ARC-AGI-2 semi-private, to assess model capabilities.
▪Performance was measured using an Item Response Theory-inspired methodology, with trend lines fitted via least squares regression on frontier models.

Original article

Hacker News: Newest

Read full at Hacker News: Newest →

Opening excerpt (first ~120 words) tap to expand

In April 2026, the Center for AI Standards and Innovation (CAISI) evaluated the open-weight AI model DeepSeek V4 Pro (“DeepSeek V4”). CAISI evaluations indicate that DeepSeek V4’s capabilities lag behind the frontier by about 8 months (Figure 1). Figure 1: Comparison of aggregate capabilities over time of the most capable publicly released U.S. and PRC models according to a suite of benchmarks covering five domains.Every 200-point increase on the y-axis equates to a 3x increase in the odds of solving a given task. Model capability was fitted using an approach inspired by Item Response Theory (IRT), as detailed in the Appendix. 16 benchmarks across 35 models were used to produce this figure. Trend lines were fit with least squares regression on frontier models. Error bars denote 95% CIs.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Hacker News: Newest.

Anonymous · no account needed

Discussion

0 comments

Caisi Evaluation of DeepSeek V4 Pro

Discussion

More from Hacker News: Newest