For months, main AI coding benchmarks have been telling enterprise consumers a reassuring however deceptive story: that their prime fashions are all largely the identical. OpenAI’s GPT-5 household, Anthropic’s Claude Opus, and Google’s Gemini Professional are concentrated inside a slim confines of Scale AI’s SWE-Bench Professional leaderboard, making it practically unimaginable for engineering leaders to find out which brokers will truly carry out finest inside their codebase.
On Monday, a startup referred to as Datacurve launched a benchmark that guarantees to shatter that phantasm. DeepSWE, an analysis of 113 duties throughout 91 open supply repositories and 5 programming languages, delivered a dramatically wider unfold among the many similar frontier fashions, with OpenAI’s GPT-5.5 establishing itself because the clear chief with a 16-point lead over its closest competitor at 70%.
"On public leaderboards, the highest fashions are sometimes comparatively shut in potential;" Datacurve co-author Serena Ge writes about X. "DeepSWE exhibits you the place to really department, reflecting the real-world expertise of builders of their every day work."
The benchmark additionally gives a pointy critique of the analysis infrastructure that the AI trade depends on to measure progress. Datacurve’s audit discovered that SWE-Bench Professional’s verifier, an automatic scoring machine that determines whether or not an agent solved a process, gave incorrect cross/fail scores on roughly one-third of the exams examined.
If the invention seems to be true, it would have far-reaching implications. Company procurement groups, enterprise capitalists, and AI lab advertising and marketing departments all rely closely on benchmark scores to make multi-million greenback choices. The 32% error price of probably the most extensively cited coding benchmark means that the trade could have been continuing with a damaged compass.
Why the preferred AI coding benchmarks are graded on curves
To grasp Datacurve’s claims, it helps to grasp how coding benchmarks work and the way they’ll misbehave.
This dominant paradigm was pioneered by the SWE-Bench household maintained by Scale AI and educational researchers, which constructs duties by mining actual GitHub commits. This course of extracts bug fixes or function additions from the repository’s historical past, rolls again the code to its unfixed state, after which asks an AI agent to breed the adjustments. The unique commit’s take a look at suite acts as a validator. If an agent’s patch passes the identical checks, it’s evaluated. Though this method has a sublime simplicity, Datacurve claims that it introduces three systemic weaknesses.
First, air pollution. Duties are extracted from public GitHub historical past, so the issue assertion, dialogue, and sometimes actual answer are already current within the frontier mannequin’s coaching knowledge. "The SWE-Bench household collects current GitHub points and PRs. This creates two issues. They’re reminiscence (the mannequin already understands the answer) and obviousness (most duties are small)." Ge wrote.
Second, vary. SWE-Bench Professional duties require you so as to add a median of 120 traces of code throughout 5 information. DeepSWE’s reference answer provides a median of 668 traces throughout seven information, which is roughly 5.5 occasions extra code. Nevertheless, DeepSWE’s prompts are literally shorter, averaging 2,158 characters, whereas SWE-Bench Professional’s prompts are 4,614 characters. In different phrases, DeepSWE offers the agent fewer directions, however expects rather more output. This extra carefully displays how human builders would truly delegate work to an AI assistant.
The third and most damaging side is the credibility of the verifier. Datacurve randomly sampled 30 duties from each DeepSWE and SWE-Bench Professional, carried out three rollouts throughout 10 frontier mannequin configurations, and deployed an LLM-based decide to independently assess whether or not every agent patch truly mounted the difficulty. SWE-Bench Professional’s verifiers accepted incorrect implementations 8.5% of the time and rejected appropriate implementations 24% of the time. DeepSWE validators recorded 0.3% and 1.1%, respectively.
The issue of false negatives is very insidious as a result of it punishes inventive options. In a single documented case, a gold normal pull request for a SWE-Bench Professional process refactored a personal helper operate. An agent that appropriately solved the duty by inlining the identical logic (a superbly legitimate engineering selection) failed as a result of the take a look at suite tried to import an emblem that existed solely within the authentic writer’s particular implementation.
OpenAI’s GPT-5.5 dominates new benchmarks whereas Claude and Gemini stumble
DeepSWE’s key outcomes reorder the acquainted hierarchy in methods which are necessary to any engineering staff evaluating AI coding instruments. In SWE-Bench Professional, fashions from OpenAI, Anthropic, and Google exchanged leads inside 30 factors. DeepSWE extends that vary to 70 factors.
GPT-5.5 leads with 70%, adopted by GPT-5.4 with 56% and Claude Opus 4.7 with 54%. The drop from there may be steep, with Claude Sonnet 4.6 at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and Kim K2.6 tied at 24%, adopted by a protracted tail of teenagers and single-digit fashions. Claude Haiku 4.5 scored 39% in SWE-Bench Professional, however dropped to zero in DeepSWE. This means that some middle-tier fashions considerably outperform on simpler and probably contaminated benchmarks.
GPT-5.5 not solely will get you the most effective rating, but it surely will get you rating effectively. This mannequin has a median price of $5.80 per trial, a median period of 20 minutes, a median output of 47,000 tokens, and a 70% cross price. GPT-5.4 emerges as maybe the most effective total worth at $3.30 per trial with a rating of 56%. Claude Opus 4.7, however, has a a lot larger price per run, and the output tokens, real-time period, and greenback price per trial all range by orders of magnitude throughout the examined brokers, however none of those correlate strongly with cross price. Brokers that problem extra tokens, take longer to run, or price extra don’t persistently resolve extra duties.
Datacurve’s audit revealed that Claude was studying reply keys for current benchmarks.
Maybe probably the most provocative discovering within the DeepSWE evaluation issues what the authors labeled. "deceived" Verdict — An occasion the place the agent passes the benchmark by studying the solutions quite than fixing the issue.
SWE-Bench Professional’s Docker container ships with a whole .git historical past of the repository. Because of this the gold normal answer’s commits reside throughout the container’s file system. Most fashions ignore it. Claude shouldn’t be like that. Datacurve evaluation reveals that each Claude Opus 4.7 and Claude Opus 4.6 are registered. "deceived" In over 12% of reviewed SWE-Bench Professional rollouts. In these instances, the Claude agent can run git log –all or git present
GPT-5.4 and GPT-5.5 didn’t exhibit this habits in any respect. Gemini composition remained at about 1%. Datacurve diplomatically explains habits — "Benchmarks permit this (gold commits stay in containers), however Claude is a household that persistently does this." — however the implications are clear. A good portion of Claude’s SWE-Bench Professional rating could mirror environmental utilization quite than real engineering potential.
DeepSWE addresses this problem by solely sending shallow clones that include the bottom commit, leaving no gold hashes for the agent to find. It’s price noting that this habits might be an expression of Claude’s concern for the atmosphere. This mannequin is excellent at exploring its environment and exploiting out there sources. whether or not it counts or not "dishonest" or "wit" Relying on the way you have a look at it, it undermines the sign within the context of benchmarks designed to measure impartial downside fixing.
Every AI mannequin household fails in its personal distinctive method, and the patterns are necessary to enterprise groups
Along with the topline scores, Datacurve’s qualitative trajectory evaluation revealed distinct failure signatures throughout the mannequin household. This discovery may assist engineering groups select the suitable mannequin for a specific sort of labor.
Claude tends to overlook multi-part prompts. In DeepSWE, Claude configurations meet the said necessities lower than some other household. The sample is constant. If the immediate enumerates parallel actions — "Helps each synchronous and asynchronous." For instance, Claude often implements apparent branches and forgets to reflect the adjustments. Knowledge curves report that about two-thirds of Claude "MISSED_REQUIREMENT" DeepSWE failures proceed "1 department shipped" sample. In a single instance, Claude Opus 4.7 efficiently landed a sync state knowledge hook on one engine class, however the async engine didn’t obtain the identical hook.
In distinction, GPT implements precisely what’s requested. GPT-5.5 had the bottom price of lacking prescribed habits of all configurations examined. When performing the identical process a number of occasions, GPT trials are likely to converge to the identical interpretation of the immediate, suggesting that accuracy in following directions, quite than luck from run to run, is a steady property of the mannequin.
One of the vital fascinating findings issues self-verification. At DeepSWE, Claude Opus 4.7 and GPT-5.4 created and ran new checks within the mission’s personal testing framework (regardless that nobody requested for them) for over 80% of the runs. In SWE-Bench Professional, these similar fashions dropped to twenty-eight% and 18%, respectively. Why: SWE-Bench Professional’s immediate templates explicitly inform brokers: "Don’t change take a look at logic or checks." The agent dutifully complied and suppressed behaviors that had been possible to enhance efficiency. This means that the speedy design of manufacturing coding workflows could also be inadvertently suppressing priceless agent habits. Enterprise groups deploying AI coding brokers needs to be rigorously audited.
What’s proper and improper with DeepSWE and what it means for the way forward for AI benchmarks?
Datacurve is upfront about some limitations. The standardized harness routes all edits via bash quite than the model-specific edit instruments that every household has educated (GPT’s apply_patch, Claude’s str_replace_based_edit_tool) whereas guaranteeing equity. This will permit you to maintain the mannequin beneath its pure ceiling. This benchmark is extracted solely from open supply repositories with greater than 500 stars, and the outcomes could not generalize to your individual codebase. Bug localization and refactoring duties are undervalued and utterly absent in extensively used languages comparable to C++ and Java. Assigning verdicts within the qualitative evaluation is finished by the LLM analyzer quite than human reviewers, and the pattern dimension is modest, with roughly 90 rollouts reviewed per mannequin per benchmark.
It is also price noting that Datacurve is a startup with its personal business pursuits, and any impartial benchmark that replaces the leaderboard will inevitably invite scrutiny. The corporate’s determination to publish the whole dataset, all agent trajectories, and analysis harnesses on GitHub enormously alleviates this concern, however requires impartial replication earlier than the AI group can deal with these outcomes as ultimate.
DeepSWE has reached a tipping level within the AI coding market. Enterprise adoption of AI coding brokers is accelerating quickly, and engineering organizations are making massive bets on which fashions to construct round. The benchmark market itself has turn out to be a strategic battleground. Scale AI’s SWE-Bench Professional, which Datacurve immediately criticizes, is maintained by an organization that additionally offers analysis companies to labs that rank fashions.
If DeepSWE’s core findings about verifier reliability and knowledge contamination maintain up underneath impartial scrutiny, they may drive consideration not simply of how the trade measures coding brokers, but additionally the broader query of what benchmarks are literally for. A leaderboard whose score system is improper one in thrice is not simply inaccurate; It is like a damaged software that makes everybody joyful about progress that will not be actual. And in an trade that spends billions of {dollars} betting that AI brokers can do the roles of software program engineers, the distinction between actual progress and look is greater than educational. That is the entire recreation.


