GPT-5.5 Returns to Cutting Edge in Coding, But OpenAI Switches Benchmarks After Losing to Opus 4.7

Gate News message, April 27 — SemiAnalysis, a semiconductor and AI analysis firm, released a comparative benchmark of coding assistants including GPT-5.5, Claude Opus 4.7, and DeepSeek V4. The key finding: GPT-5.5 marks OpenAI’s first return to the cutting edge in coding models in six months, with SemiAnalysis engineers now alternating between Codex and Claude Code after previously relying almost exclusively on Claude. GPT-5.5 is based on a new pretraining approach codenamed “Spud” and represents OpenAI’s first expansion of pretraining scale since GPT-4.5.

In practical testing, a clear division of labor emerged. Claude handles new project planning and initial setup, while Codex excels at reasoning-intensive bug fixes. Codex demonstrates stronger data structure comprehension and logical reasoning but struggles with inferring ambiguous user intent. On a single dashboard task, Claude automatically replicated the reference page layout but fabricated large amounts of data, while Codex skipped the layout but delivered significantly more accurate data.

The analysis reveals a benchmark manipulation detail: OpenAI’s February blog post urged the industry to adopt SWE-bench Pro as the new standard for coding benchmarks. However, GPT-5.5’s announcement switched to a new benchmark called “Expert-SWE.” The reason, buried in fine print, is that GPT-5.5 was surpassed by Opus 4.7 on SWE-bench Pro and fell significantly short of Anthropic’s unreleased Mythos (77.8%).

Regarding Opus 4.7, Anthropic published a postmortem analysis one week after release, acknowledging three bugs in Claude Code that persisted for several weeks from March to April, affecting nearly all users. Multiple engineers had previously reported performance degradation in version 4.6 but were dismissed as subjective observations. Additionally, Opus 4.7’s new tokenizer increases token usage by up to 35%, which Anthropic openly admitted—effectively constituting a hidden price increase.

DeepSeek V4 was assessed as “keeping pace with the frontier but not leading,” positioning itself as the lowest-cost alternative among closed-source models. The analysis also noted that “Claude continues to outperform DeepSeek V4 Pro on high-difficulty Chinese writing tasks,” commenting that “Claude won against the Chinese model in its own language.”

The article introduces a key concept: model pricing should be evaluated by “cost per task” rather than “cost per token.” GPT-5.5’s pricing is double that of GPT-5.4 (input $5, output $30 per million tokens), but it completes the same tasks using fewer tokens, making the actual cost not necessarily higher. Initial SemiAnalysis data shows Codex’s input-to-output ratio at 80:1, lower than Claude Code’s 100:1.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.

Related Articles

DeepSeek V4 Pro with Ollama Cloud: One-click integration with Claude Code

According to an Ollama tweet, DeepSeek V4 Pro was released on 4/24, has been added to the Ollama catalog in cloud mode, and can call tools like Claude Code, Hermes, OpenClaw, OpenCode, Codex, etc. with just a single line of command. V4 Pro: 1.6T params, 1M context, Mixture-of-Experts; cloud inference does not download local weights. If you want to run it locally, you need to obtain the weights yourself and run it with INT4/GGUF and multi-card GPUs. Early speed tests were affected by cloud load; typical performance is about 30 tok/s, with a peak of 1.1 tok/s. It is recommended to use the cloud prototype first, and for production later, run inference yourself or use a commercial API.

ChainNewsAbmedia35m ago

DeepSeek Cuts V4-Pro Prices by 75%, Slashes API Cache Costs to One-Tenth

Gate News message, April 27 — DeepSeek announced a 75% discount on its new V4-Pro model for developers and reduced input cache hit prices across its API lineup to one-tenth of previous levels. The V4 model, released on April 25 in Pro and Flash versions, has been optimized for Huawei's Ascend

GateNews38m ago

Coachella turns to Google’s DeepMind AI to reimagine concerts beyond the stage

Coachella has partnered with Google DeepMind to test new AI tools that reshape how live music performances are created and experienced. Summary Coachella has tested AI tools with Google DeepMind to turn live performances into interactive digital environments. Three prototypes were built,

Cryptonews44m ago

Guo Ming-chi: OpenAI wants to build an AI Agent phone; MediaTek, Qualcomm, and Luxshare Precision are key in the supply chain

Guo Ming-chi claims that OpenAI is working with MediaTek, Qualcomm, and Luxshare Precision to develop an AI Agent phone, with mass production expected in 2028. The new phone will be centered on task completion: an AI agent will understand and execute requests, combining cloud and on-device computing, with a focus on sensing and contextual understanding. The specifications and supply chain list are expected to be finalized in 2026–2027; if it takes shape, it could bring a new upgrade cycle to the high-end market, and Luxshare may become a major beneficiary.

ChainNewsAbmedia54m ago

IEA: AI infrastructure spending has already surpassed investment in oil and gas production, and is expected to increase another 75% in 2026

According to analysis and market data published by the International Energy Agency (IEA) on April 26, the combined capital expenditures of the world’s top five technology companies in 2025 exceed $400 billion, with most of the spending going toward building AI infrastructure. The scale has already surpassed the annual investment level of global oil and natural gas production. The IEA estimates that the related capital expenditures may further increase by 75% in 2026.

MarketWhisper1h ago

Senator Bernie Sanders Issues Warning on AI's Existential Threat

Sanders stressed that even as most AI scientists acknowledge the possibility of AI escaping control and becoming a danger to our existence, no major measures have been taken to avoid it. “We must make certain that Al benefits humanity, not hurts us,” he stated. Key Takeaways: Bernie Sanders

Coinpedia1h ago
Comment
0/400
No comments