Despite top AI models worldwide excelling in professional fields such as medical exams and coding, they repeatedly falter in children’s games like Pokémon, exposing their core shortcomings in long-term reasoning, memory, and planning. This article originates from Tencent Technology Public Account, authored by Guo Jingxiao.
(Previous summary: I play war games with AI: GPT o3 is a mastermind, DeepSeek is a war fanatic, Claude is like a naive sweetheart)
(Additional background: Google “Gemini 2.0” is here! Launching three AI agents: complex tasks, games, programming)
Table of Contents
Does the toolset gap determine success or failure?
Behind capability evolution: the unbridged “instinct” gap
The “Digital Long March” that AI finds hard to cross—far more than Pokémon
NetHack: The Abyss of Rules
Minecraft: The Disappearing Sense of Goal
StarCraft II: The Gap Between Generality and Specialization
Passenger Tycoon: The Imbalance of Micro and Macro
Elden Ring and Sekiro: The Gap in Physical Feedback
Why has Pokémon become an AI touchstone?
Top AI models worldwide can pass medical licensing exams, write complex code, and even beat human experts in math competitions, yet they repeatedly stumble in a children’s game like Pokémon.
This eye-catching attempt began in February 2025, when an Anthropic researcher launched a Twitch stream titled “Claude Playing Pokémon Red,” coinciding with the release of Claude Sonnet 3.7.
20,000 viewers flooded the stream. In the public chat, viewers brainstormed and cheered for Claude, gradually transforming the broadcast into a public observation of AI capabilities.
Sonnet 3.7 is merely “capable of playing” Pokémon, but “playing” does not equal “winning.” It often gets stuck for hours at critical points and makes basic mistakes even children wouldn’t.
This is not Claude’s first attempt.
Earlier versions performed even worse: wandering aimlessly on the map, falling into infinite loops, or unable to leave the beginner village.
Even the significantly improved Claude Opus 4.5 still makes baffling errors. Once, it circled outside the gym for four days without entering, simply because it didn’t realize it needed to cut down a tree blocking the path.
Why did a children’s game become AI’s Waterloo?
Because Pokémon demands the very abilities that current AI lacks: continuous reasoning in an open world without explicit instructions, recalling decisions made hours earlier, understanding implicit causal relationships, and making long-term plans among hundreds of possible actions.
These tasks are trivial for an 8-year-old child but form an insurmountable gap for AI models claiming to “surpass humans.”
Does the toolset gap determine success or failure?
In comparison, Google’s Gemini 2.5 Pro successfully completed a challenging Pokémon game in May 2025. Google CEO Sundar Pichai even jokingly said in public that the company has taken a step toward building “artificial Pokémon intelligence.”
However, this result cannot be simply attributed to the Gemini model being “smarter.”
The key difference lies in the toolset used by the models. Joel Zhang, an independent developer responsible for running Gemini’s Pokémon streams, likens the toolset to a “Iron Man suit”: AI does not enter the game barehanded but is placed within a system capable of calling upon multiple external abilities.
Gemini’s toolset offers more support, such as converting game visuals into text to compensate for visual understanding weaknesses, and providing customized puzzle-solving and path-planning tools. In contrast, Claude’s toolset is more minimalist, and its attempts more directly reflect the model’s actual perception, reasoning, and execution capabilities.
In everyday tasks, these differences are not obvious.
When users ask chatbots for internet searches, the models automatically invoke search tools. But in long-term tasks like Pokémon, the differences in toolsets are amplified to the point of deciding success or failure.
Because Pokémon uses strict turn-based mechanics without real-time reactions, it becomes an excellent “training ground” for testing AI. In each step, AI only needs to reason based on the current screen, target prompts, and available actions, then output clear commands like “Press A.”
This seems to be an interaction form that large language models excel at.
The core issue is the “discontinuity” in the time dimension. Although Claude Opus 4.5 has accumulated over 500 hours of operation and about 170,000 steps, its re-initialization after each step limits it to a very narrow context window. This mechanism makes it more like a forgetful person relying on sticky notes for cognition, cycling through fragmented information and never achieving the experiential leap from quantitative change to qualitative change like a real human player.
In chess and Go, AI systems have long surpassed humans, but these are highly customized for specific tasks. In contrast, general models like Gemini, Claude, and GPT, despite frequently beating humans in exams and programming contests, stumble repeatedly in children’s games.
This contrast itself is highly instructive.
Joel Zhang believes the core challenge for AI is the inability to sustain execution of a clear goal over long time spans. “If you want an intelligent agent to do real work, it can’t forget what it did five minutes ago,” he points out.
And this capability is an essential prerequisite for automating cognitive labor.
Independent researcher Peter Whidden offers a more intuitive description. He open-sourced a Pokémon algorithm based on traditional AI. “AI knows almost everything about Pokémon,” he says, “it trains on vast amounts of human data and knows the correct answers. But once in execution, it appears clumsy.”
In the game, this “knowing but not being able to do” gap is continuously magnified: the model may know it needs to find an item but cannot reliably locate it on a 2D map; it knows it should talk to NPCs but repeatedly fails in pixel-level movements.
Behind capability evolution: the unbridged “instinct” gap
Nevertheless, AI progress is clearly visible. Claude Opus 4.5 outperforms previous versions in self-recording and visual understanding, enabling it to go further in games. Gemini 3 Pro completed Pokémon Blue and then finished the more difficult Pokémon Crystal without losing a single battle—something Gemini 2.5 Pro never achieved.
Meanwhile, Anthropic’s Claude Code toolset allows models to write and run their own code, used in retro games like Passenger Tycoon, reportedly successfully managing virtual theme parks.
These cases reveal an unintuitive reality: with the right toolset, AI may demonstrate high efficiency in knowledge work such as software development, accounting, and legal analysis, even if they still struggle with tasks requiring immediate reactions.
The Pokémon experiment also reveals another intriguing phenomenon: models trained on human data tend to exhibit behaviors similar to humans.
Google’s technical report on Gemini 2.5 Pro notes that when the system simulates a “panic state,” such as Pokémon about to faint, the reasoning quality drops significantly.
When Gemini 3 Pro finally completes Pokémon Blue, it leaves a non-mission-critical note: “To end poetically, I want to return to my original home, have a final conversation with my mother, and retire the character.”
Joel Zhang finds this behavior surprising and somewhat human-like in emotional projection.
The “Digital Long March” that AI finds hard to cross—far more than Pokémon
Pokémon is not an isolated case. In the pursuit of Artificial General Intelligence (AGI), developers have found that even if AI can excel in legal exams, it still faces insurmountable “Waterloo” in several complex games.
NetHack: The Abyss of Rules
This 1980s dungeon crawler is a nightmare for AI research. Its high randomness and “permanent death” mechanism make it extremely challenging. Facebook AI Research found that even if models can write code, their performance in NetHack, which requires common sense logic and long-term planning, is far below that of novice humans.
Minecraft: The Disappearing Sense of Goal
Although AI can craft wooden pickaxes and even mine diamonds, “defeating the Ender Dragon” remains a fantasy. In open worlds, AI often forgets its original purpose during hours-long resource gathering or gets completely lost in complex navigation.
StarCraft II: The Gap Between Generality and Specialization
While customized models have beaten professional players, if Claude or Gemini directly take visual commands, they collapse instantly. Handling the uncertainty of “fog of war” and balancing micro-management with macro-building remain beyond the reach of general models.
Passenger Tycoon: The Imbalance of Micro and Macro
Managing a theme park requires tracking thousands of visitors. Even Claude Code with initial management capabilities can easily become exhausted when dealing with large-scale financial crashes or emergencies. Any reasoning lapse can lead to park bankruptcy.
Elden Ring and Sekiro: The Gap in Physical Feedback
These high-action feedback games are extremely unfriendly to AI. Current visual analysis delays mean that when AI is still “thinking,” the boss’s move has already resulted in the character’s death. Millisecond-level reaction requirements set a natural upper limit on the interaction logic of models.
Why has Pokémon become an AI touchstone?
Today, Pokémon is gradually becoming an informal yet highly convincing benchmark for AI evaluation.
Models from Anthropic, OpenAI, and Google have attracted hundreds of thousands of comments on Twitch streams. Google detailed Gemini’s game progress in technical reports, and Pichai publicly mentioned this achievement at the I/O developer conference. Anthropic even set up a “Claude Playing Pokémon” demo area at industry conferences.
“We are a group of super tech enthusiasts,” admits David Hershey, head of AI at Anthropic. But he emphasizes that this is not just entertainment.
Unlike traditional one-off Q&A benchmarks, Pokémon can continuously track the model’s reasoning, decision-making, and goal progression over extended periods, more closely resembling the complex tasks humans want AI to perform in the real world.
As of now, challenges for AI in Pokémon continue. But these recurring difficulties clearly outline the boundaries of capabilities that artificial general intelligence has yet to cross.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Top global large models can't pass "Pokémon": These games are AI's nightmare
Despite top AI models worldwide excelling in professional fields such as medical exams and coding, they repeatedly falter in children’s games like Pokémon, exposing their core shortcomings in long-term reasoning, memory, and planning. This article originates from Tencent Technology Public Account, authored by Guo Jingxiao.
(Previous summary: I play war games with AI: GPT o3 is a mastermind, DeepSeek is a war fanatic, Claude is like a naive sweetheart)
(Additional background: Google “Gemini 2.0” is here! Launching three AI agents: complex tasks, games, programming)
Table of Contents
Top AI models worldwide can pass medical licensing exams, write complex code, and even beat human experts in math competitions, yet they repeatedly stumble in a children’s game like Pokémon.
This eye-catching attempt began in February 2025, when an Anthropic researcher launched a Twitch stream titled “Claude Playing Pokémon Red,” coinciding with the release of Claude Sonnet 3.7.
20,000 viewers flooded the stream. In the public chat, viewers brainstormed and cheered for Claude, gradually transforming the broadcast into a public observation of AI capabilities.
Sonnet 3.7 is merely “capable of playing” Pokémon, but “playing” does not equal “winning.” It often gets stuck for hours at critical points and makes basic mistakes even children wouldn’t.
This is not Claude’s first attempt.
Earlier versions performed even worse: wandering aimlessly on the map, falling into infinite loops, or unable to leave the beginner village.
Even the significantly improved Claude Opus 4.5 still makes baffling errors. Once, it circled outside the gym for four days without entering, simply because it didn’t realize it needed to cut down a tree blocking the path.
Why did a children’s game become AI’s Waterloo?
Because Pokémon demands the very abilities that current AI lacks: continuous reasoning in an open world without explicit instructions, recalling decisions made hours earlier, understanding implicit causal relationships, and making long-term plans among hundreds of possible actions.
These tasks are trivial for an 8-year-old child but form an insurmountable gap for AI models claiming to “surpass humans.”
Does the toolset gap determine success or failure?
In comparison, Google’s Gemini 2.5 Pro successfully completed a challenging Pokémon game in May 2025. Google CEO Sundar Pichai even jokingly said in public that the company has taken a step toward building “artificial Pokémon intelligence.”
However, this result cannot be simply attributed to the Gemini model being “smarter.”
The key difference lies in the toolset used by the models. Joel Zhang, an independent developer responsible for running Gemini’s Pokémon streams, likens the toolset to a “Iron Man suit”: AI does not enter the game barehanded but is placed within a system capable of calling upon multiple external abilities.
Gemini’s toolset offers more support, such as converting game visuals into text to compensate for visual understanding weaknesses, and providing customized puzzle-solving and path-planning tools. In contrast, Claude’s toolset is more minimalist, and its attempts more directly reflect the model’s actual perception, reasoning, and execution capabilities.
In everyday tasks, these differences are not obvious.
When users ask chatbots for internet searches, the models automatically invoke search tools. But in long-term tasks like Pokémon, the differences in toolsets are amplified to the point of deciding success or failure.
Turn-based gameplay exposes AI’s “long-term memory” shortcoming
Because Pokémon uses strict turn-based mechanics without real-time reactions, it becomes an excellent “training ground” for testing AI. In each step, AI only needs to reason based on the current screen, target prompts, and available actions, then output clear commands like “Press A.”
This seems to be an interaction form that large language models excel at.
The core issue is the “discontinuity” in the time dimension. Although Claude Opus 4.5 has accumulated over 500 hours of operation and about 170,000 steps, its re-initialization after each step limits it to a very narrow context window. This mechanism makes it more like a forgetful person relying on sticky notes for cognition, cycling through fragmented information and never achieving the experiential leap from quantitative change to qualitative change like a real human player.
In chess and Go, AI systems have long surpassed humans, but these are highly customized for specific tasks. In contrast, general models like Gemini, Claude, and GPT, despite frequently beating humans in exams and programming contests, stumble repeatedly in children’s games.
This contrast itself is highly instructive.
Joel Zhang believes the core challenge for AI is the inability to sustain execution of a clear goal over long time spans. “If you want an intelligent agent to do real work, it can’t forget what it did five minutes ago,” he points out.
And this capability is an essential prerequisite for automating cognitive labor.
Independent researcher Peter Whidden offers a more intuitive description. He open-sourced a Pokémon algorithm based on traditional AI. “AI knows almost everything about Pokémon,” he says, “it trains on vast amounts of human data and knows the correct answers. But once in execution, it appears clumsy.”
In the game, this “knowing but not being able to do” gap is continuously magnified: the model may know it needs to find an item but cannot reliably locate it on a 2D map; it knows it should talk to NPCs but repeatedly fails in pixel-level movements.
Behind capability evolution: the unbridged “instinct” gap
Nevertheless, AI progress is clearly visible. Claude Opus 4.5 outperforms previous versions in self-recording and visual understanding, enabling it to go further in games. Gemini 3 Pro completed Pokémon Blue and then finished the more difficult Pokémon Crystal without losing a single battle—something Gemini 2.5 Pro never achieved.
Meanwhile, Anthropic’s Claude Code toolset allows models to write and run their own code, used in retro games like Passenger Tycoon, reportedly successfully managing virtual theme parks.
These cases reveal an unintuitive reality: with the right toolset, AI may demonstrate high efficiency in knowledge work such as software development, accounting, and legal analysis, even if they still struggle with tasks requiring immediate reactions.
The Pokémon experiment also reveals another intriguing phenomenon: models trained on human data tend to exhibit behaviors similar to humans.
Google’s technical report on Gemini 2.5 Pro notes that when the system simulates a “panic state,” such as Pokémon about to faint, the reasoning quality drops significantly.
When Gemini 3 Pro finally completes Pokémon Blue, it leaves a non-mission-critical note: “To end poetically, I want to return to my original home, have a final conversation with my mother, and retire the character.”
Joel Zhang finds this behavior surprising and somewhat human-like in emotional projection.
The “Digital Long March” that AI finds hard to cross—far more than Pokémon
Pokémon is not an isolated case. In the pursuit of Artificial General Intelligence (AGI), developers have found that even if AI can excel in legal exams, it still faces insurmountable “Waterloo” in several complex games.
NetHack: The Abyss of Rules
This 1980s dungeon crawler is a nightmare for AI research. Its high randomness and “permanent death” mechanism make it extremely challenging. Facebook AI Research found that even if models can write code, their performance in NetHack, which requires common sense logic and long-term planning, is far below that of novice humans.
Minecraft: The Disappearing Sense of Goal
Although AI can craft wooden pickaxes and even mine diamonds, “defeating the Ender Dragon” remains a fantasy. In open worlds, AI often forgets its original purpose during hours-long resource gathering or gets completely lost in complex navigation.
StarCraft II: The Gap Between Generality and Specialization
While customized models have beaten professional players, if Claude or Gemini directly take visual commands, they collapse instantly. Handling the uncertainty of “fog of war” and balancing micro-management with macro-building remain beyond the reach of general models.
Passenger Tycoon: The Imbalance of Micro and Macro
Managing a theme park requires tracking thousands of visitors. Even Claude Code with initial management capabilities can easily become exhausted when dealing with large-scale financial crashes or emergencies. Any reasoning lapse can lead to park bankruptcy.
Elden Ring and Sekiro: The Gap in Physical Feedback
These high-action feedback games are extremely unfriendly to AI. Current visual analysis delays mean that when AI is still “thinking,” the boss’s move has already resulted in the character’s death. Millisecond-level reaction requirements set a natural upper limit on the interaction logic of models.
Why has Pokémon become an AI touchstone?
Today, Pokémon is gradually becoming an informal yet highly convincing benchmark for AI evaluation.
Models from Anthropic, OpenAI, and Google have attracted hundreds of thousands of comments on Twitch streams. Google detailed Gemini’s game progress in technical reports, and Pichai publicly mentioned this achievement at the I/O developer conference. Anthropic even set up a “Claude Playing Pokémon” demo area at industry conferences.
“We are a group of super tech enthusiasts,” admits David Hershey, head of AI at Anthropic. But he emphasizes that this is not just entertainment.
Unlike traditional one-off Q&A benchmarks, Pokémon can continuously track the model’s reasoning, decision-making, and goal progression over extended periods, more closely resembling the complex tasks humans want AI to perform in the real world.
As of now, challenges for AI in Pokémon continue. But these recurring difficulties clearly outline the boundaries of capabilities that artificial general intelligence has yet to cross.