Beyond contamination, DeepSWE takes aim at task complexity. SWE-bench Pro tasks average 120 lines of code to resolve. DeepSWE tasks use shorter prompts (roughly half the length) but require solutions averaging 5.5 times more code and approximately twice as many output tokens. The benchmark spans 91 repositories across five programming languages, and its verifiers are hand-written to test software behavior rather than checking for specific implementation patterns, reducing the rate at which correct solutions are incorrectly penalized.
The gap between DeepSWE scores and public benchmark scores is striking. On SWE-bench Pro, frontier models tend to cluster relatively closely together. On DeepSWE, the separation is significantly wider. GPT-5.5 leads the current leaderboard at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. Claude Sonnet 4.6 scores 32%, Gemini 3.5 Flash 28%, and GPT-5.4-mini and Kimi-k2.6 each at 24%. Further down, DeepSeek-v4-pro scores just 8% and Gemini 3 Flash 5% — gaps that are considerably larger than what the same models show on existing public benchmarks.
DataCurve also audited SWE-bench Pro's verifier and found it misgrades agent outputs at a rate of 8% false positives and 24% false negatives — meaning roughly one in four correct solutions is marked as wrong. Those error rates, the company argues, make it difficult to trust comparisons between closely ranked models on that benchmark.
The release adds a new reference point to a debate that has been growing in the AI research community about whether published benchmark scores reflect real-world agent capability or have become an optimization target that labs are quietly tuning toward. DeepSWE's contamination-free design and hand-written verifiers are a direct attempt to make that kind of gaming harder.
This analysis is based on reporting from DataCurve.
Image courtesy of Scale Labs.
This article was generated with AI assistance and reviewed for accuracy and quality.