WeSearch

Caisi Evaluation of DeepSeek V4 Pro

·6 min read · 0 reactions · 0 comments · 7 views
#artificial intelligence#model evaluation#benchmarking#cost efficiency#international comparison#Center for AI Standards and Innovation#DeepSeek#OpenAI#Anthropic#Arizona State University#ARC Prize Foundation
Caisi Evaluation of DeepSeek V4 Pro
⚡ TL;DR · AI summary

The Center for AI Standards and Innovation (CAISI) evaluated DeepSeek V4 Pro in April 2026, finding its capabilities lag behind the current frontier by approximately 8 months. While DeepSeek V4 is the most capable Chinese model assessed by CAISI, it underperformed relative to U.S. models like GPT-5.5 and Opus 4.6 in independent testing. Despite this, DeepSeek V4 demonstrated superior cost efficiency compared to similarly capable U.S. models across several benchmarks.

Key facts
Original article
Hacker News: Newest
Read full at Hacker News: Newest →
Opening excerpt (first ~120 words) tap to expand

In April 2026, the Center for AI Standards and Innovation (CAISI) evaluated the open-weight AI model DeepSeek V4 Pro (“DeepSeek V4”). CAISI evaluations indicate that DeepSeek V4’s capabilities lag behind the frontier by about 8 months (Figure 1). Figure 1: Comparison of aggregate capabilities over time of the most capable publicly released U.S. and PRC models according to a suite of benchmarks covering five domains.Every 200-point increase on the y-axis equates to a 3x increase in the odds of solving a given task. Model capability was fitted using an approach inspired by Item Response Theory (IRT), as detailed in the Appendix. 16 benchmarks across 35 models were used to produce this figure. Trend lines were fit with least squares regression on frontier models. Error bars denote 95% CIs.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Hacker News: Newest.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments