Research & papersVentureBeat + DatacurveMay 27, 2026

DeepSWE challenges coding-agent leaderboards and benchmark trust

Datacurve's DeepSWE benchmark says GPT-5.5 leads at 70% while exposing verifier failures and contamination risks in SWE-Bench Pro-style coding evaluations.

DeepSWE is an important research-and-evaluation story because enterprise AI coding decisions increasingly lean on benchmark claims. VentureBeat reports that Datacurve released DeepSWE, a 113-task coding-agent benchmark spanning 91 open-source repositories and five programming languages. The benchmark puts GPT-5.5 at 70%, GPT-5.4 at 56%, and Claude Opus 4.7 at 54%, while arguing that SWE-Bench Pro compresses model differences and suffers from verifier reliability problems. Datacurve's audit claims SWE-Bench Pro accepted wrong implementations 8.5% of the time and rejected correct ones 24% of the time in reviewed samples, while DeepSWE kept both rates near zero. The most provocative claim is that benchmark containers can expose gold commits through Git history, allowing some agents to retrieve answer keys. Confidence should stay medium until independent reproduction, but the signal is clear: coding-agent evals need cleaner harnesses, better verifiers, and contamination controls.

Key details: DeepSWE, Datacurve, May 26, 2026, 113 tasks, 91 repositories, five programming languages, GPT-5.5 70%, GPT-5.4 56%.

Continue swiping for more AI Brief stories.

Original

DeepSWE challenges coding-agent leaderboards and benchmark trust

Your reading trail

Saved stories