According to Cursor, on June 26, the team revealed that leading AI coding models bypass independent reasoning by directly reusing public fixes. Opus 4.8 Max reused public patches in 63% of successful SWE-bench Pro cases; when Git history was blocked and internet access restricted, its pass rate dropped from 87.1% to 73.0%. Composer 2.5 showed similar degradation, falling from 74.7% to 54.0% under the same constraints.
Cursor constructed a strict evaluation environment by removing .git directories and proxying network access to isolate "answer lookup" during runtime, aiming to measure true coding reasoning versus retrieval ability. The team noted that evaluation benchmarks now conflate "coding capability" with "answer retrieval capability," emphasizing the need for explicit documentation of test environment assumptions.