The world’s top AI models can pass medical licensing exams, write complex code, and even beat human experts in math competitions, but they repeatedly stumble in a children’s game—Pokémon.
This eye-catching attempt began in February 2025, when an Anthropic researcher launched a Twitch live stream titled “Claude Playing Pokémon Red” to coincide with the release of Claude Sonnet 3.7.
2000 viewers flooded into the stream. In the public chat, viewers brainstormed and cheered for Claude, gradually transforming the broadcast into a public observation of AI capabilities.
Sonnet 3.7 can say it “knows how to play” Pokémon, but “knowing how to play” doesn’t mean “winning.” It stalls for dozens of hours at critical points and makes basic mistakes even young children wouldn’t.
This isn’t Claude’s first attempt.
Earlier versions performed even worse: some wandered aimlessly on the map, some fell into infinite loops, and many couldn’t even leave the starting town.
Even with the significant upgrade to Claude Opus 4.5, perplexing errors still occurred. Once, it circled outside the “Gym” for four days without entering, simply because it didn’t realize it needed to cut down a tree blocking the path.
Why did a children’s game become AI’s Waterloo?
Because Pokémon demands exactly the abilities that current AI models lack most: sustained reasoning in open worlds without explicit instructions, recalling decisions made hours earlier, understanding implicit causal relationships, and making long-term plans among hundreds of possible actions.
These tasks are trivial for an 8-year-old child but form an insurmountable gap for AI models claiming to “surpass humans.”
01 Toolset Gap Determines Success or Failure?
In comparison, Google’s Gemini 2.5 Pro successfully completed a similarly difficult Pokémon game in May 2025. Google CEO Sundar Pichai even jokingly remarked in public that the company had taken a step toward building “artificial Pokémon intelligence.”
However, this result can’t be simply attributed to Gemini being “smarter.”
The key difference lies in the toolset used by the models. Joel Zhang, an independent developer responsible for running the Gemini Pokémon live stream, likened the toolset to a “Iron Man suit”: AI doesn’t enter the game empty-handed but is placed within a system capable of calling upon various external abilities.
Gemini’s toolset offers more support, such as converting game visuals into text to compensate for visual understanding weaknesses, and providing customized puzzle-solving and path-planning tools. In contrast, Claude’s toolset is more minimalist, and its attempts more directly reflect the model’s actual perception, reasoning, and execution capabilities.
In everyday tasks, these differences are not obvious.
When users request the chatbot to perform online searches, the model will automatically invoke search tools. But in long-term tasks like Pokémon, the differences in toolsets are amplified to a level that can determine success or failure.
02 Turn-Based Play Exposes AI’s “Long-Term Memory” Shortcomings
Because Pokémon uses strict turn-based mechanics and requires no real-time reactions, it becomes an excellent “training ground” for testing AI. In each step, AI only needs to reason based on the current scene, target prompts, and available actions, then output clear commands like ‘Press A.’
This seems to be the interaction form most suited to large language models.
The core issue lies in the “disconnection” across time. Although Claude Opus 4.5 has accumulated over 500 hours of operation and executed about 170,000 steps, it is limited by reinitialization after each step, forcing the model to find clues within a very narrow context window. This mechanism makes it more like a forgetful person relying on sticky notes to maintain cognition, cycling through fragmented information without ever achieving the experiential leap from quantitative change to qualitative change that human players do.
In chess and Go, AI systems have long surpassed humans, but these are highly specialized for specific tasks. In contrast, general models like Gemini, Claude, and GPT, despite frequently beating humans in exams and programming competitions, stumble repeatedly in children’s games.
This contrast itself is highly revealing.
Joel Zhang believes the core challenge for AI is the inability to sustain execution of a single clear goal over long periods. “If you want an agent to do real work, it can’t forget what it did five minutes ago,” he points out.
And this capability is an essential prerequisite for automating cognitive labor.
Independent researcher Peter Whidden offers a more intuitive description. He open-sourced a Pokémon algorithm based on traditional AI. “AI knows almost everything about Pokémon,” he said, “it trains on vast amounts of human data and knows the correct answers. But once it’s in execution, it’s clumsy and inept.”
In the game, this “knowing but not being able to do” disconnect is constantly magnified: the model may know it needs to find a certain item but cannot reliably locate it on a 2D map; it knows it should talk to NPCs but repeatedly fails in pixel-level movements.
03 Behind the Capability Evolution: The Uncrossed “Instinct” Gap
Nevertheless, AI progress is clearly visible. Claude Opus 4.5 significantly outperforms previous versions in self-recording and visual understanding, enabling it to go further in the game. Gemini 3 Pro, after beating Pokémon Blue, completed the more difficult Pokémon Crystal without losing a single battle. This had never been achieved by Gemini 2.5 Pro.
Meanwhile, Anthropic’s release of the Claude Code toolset allows the model to write and run its own code, which has been used in retro games like RollerCoaster Tycoon, reportedly successfully managing virtual theme parks.
These cases reveal an unintuitive reality: AI equipped with the right toolset may demonstrate high efficiency in software development, accounting, legal analysis, and other knowledge work, even if they still struggle with tasks requiring real-time responses.
The Pokémon experiments also reveal another intriguing phenomenon: models trained on human data tend to exhibit behaviors similar to humans.
In Google’s technical report on Gemini 2.5 Pro, it was noted that when the system simulates “panic states,” such as Pokémon about to faint, the model’s reasoning quality drops significantly.
And when Gemini 3 Pro finally completed Pokémon Blue, it left a non-essential note: “To end poetically, I want to return to my original home, have a final conversation with my mother, and let the character retire.”
Joel Zhang finds this behavior surprising and somewhat human-like in emotional projection.
04 The “Long March” of AI That Cannot Be Crossed, Far Beyond Pokémon
Pokémon is not an isolated case. On the road to developing Artificial General Intelligence (AGI), developers have found that even if AI can excel in legal exams, it still faces insurmountable “Waterloo” in several complex games.
《NetHack》: The Abyss of Rules
This 1980s dungeon crawler is a “nightmare” for AI research. Its high randomness and “permanent death” mechanism make it extremely challenging. Facebook AI Research found that even if a model can write code, its performance in NetHack—requiring common sense logic and long-term planning—is far inferior to that of novice humans.
《Minecraft》: Disappearing Sense of Purpose
Although AI can craft wooden pickaxes and even mine diamonds, defeating the Ender Dragon remains a fantasy. In open worlds, AI often forgets its original goal during resource gathering that can last dozens of hours or gets completely lost in complex navigation.
《StarCraft II》: The Gap Between Generality and Specialization
While customized models have defeated professional players, if Claude or Gemini are directly controlled via visual commands, they collapse instantly. In handling the uncertainty of “fog of war,” and balancing micro-management with macro-strategy, general models still fall short.
《RollerCoaster Tycoon》: Imbalance Between Micro and Macro
Managing a theme park requires tracking thousands of visitors’ states. Even Claude Code with basic management capabilities can easily become exhausted when dealing with large-scale financial crashes or emergencies. Any reasoning lapse can lead to park bankruptcy.
《Elden Ring》 and 《Sekiro》: The Chasm of Physical Feedback
These high-action games are extremely unfriendly to AI. Current visual parsing delays mean that when the AI is still “thinking” about a boss’s move, the character is often already dead. Millisecond-level reaction requirements set the natural upper limit of the model’s interaction logic.
05 Why Has Pokémon Become an AI Benchmark?
Today, Pokémon is gradually becoming an informal yet highly convincing benchmark for AI evaluation.
Models from Anthropic, OpenAI, and Google have attracted hundreds of thousands of comments on Twitch streams. Google’s technical report details Gemini’s progress in the game, and Pichai publicly mentioned this achievement at the I/O developer conference. Anthropic even set up a “Claude Playing Pokémon” showcase at industry events.
“We are a group of super tech enthusiasts,” admits David Hershey, head of AI applications at Anthropic. But he emphasizes that this is more than just entertainment.
Unlike traditional one-off question-and-answer benchmarks, Pokémon can continuously track the model’s reasoning, decision-making, and goal pursuit over extended periods, more closely resembling the complex tasks humans expect AI to perform in the real world.
So far, AI challenges in Pokémon persist. But these recurring difficulties clearly outline the boundaries of capabilities that general artificial intelligence has yet to cross.
Special Contributor Wu Ji also contributed to this article
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Top global large models can't pass "Pokémon": These games are AI's nightmare
Author: Guo Xiaojing, Tencent Technology
Editor | Xu Qingyang
The world’s top AI models can pass medical licensing exams, write complex code, and even beat human experts in math competitions, but they repeatedly stumble in a children’s game—Pokémon.
This eye-catching attempt began in February 2025, when an Anthropic researcher launched a Twitch live stream titled “Claude Playing Pokémon Red” to coincide with the release of Claude Sonnet 3.7.
2000 viewers flooded into the stream. In the public chat, viewers brainstormed and cheered for Claude, gradually transforming the broadcast into a public observation of AI capabilities.
Sonnet 3.7 can say it “knows how to play” Pokémon, but “knowing how to play” doesn’t mean “winning.” It stalls for dozens of hours at critical points and makes basic mistakes even young children wouldn’t.
This isn’t Claude’s first attempt.
Earlier versions performed even worse: some wandered aimlessly on the map, some fell into infinite loops, and many couldn’t even leave the starting town.
Even with the significant upgrade to Claude Opus 4.5, perplexing errors still occurred. Once, it circled outside the “Gym” for four days without entering, simply because it didn’t realize it needed to cut down a tree blocking the path.
Why did a children’s game become AI’s Waterloo?
Because Pokémon demands exactly the abilities that current AI models lack most: sustained reasoning in open worlds without explicit instructions, recalling decisions made hours earlier, understanding implicit causal relationships, and making long-term plans among hundreds of possible actions.
These tasks are trivial for an 8-year-old child but form an insurmountable gap for AI models claiming to “surpass humans.”
01 Toolset Gap Determines Success or Failure?
In comparison, Google’s Gemini 2.5 Pro successfully completed a similarly difficult Pokémon game in May 2025. Google CEO Sundar Pichai even jokingly remarked in public that the company had taken a step toward building “artificial Pokémon intelligence.”
However, this result can’t be simply attributed to Gemini being “smarter.”
The key difference lies in the toolset used by the models. Joel Zhang, an independent developer responsible for running the Gemini Pokémon live stream, likened the toolset to a “Iron Man suit”: AI doesn’t enter the game empty-handed but is placed within a system capable of calling upon various external abilities.
Gemini’s toolset offers more support, such as converting game visuals into text to compensate for visual understanding weaknesses, and providing customized puzzle-solving and path-planning tools. In contrast, Claude’s toolset is more minimalist, and its attempts more directly reflect the model’s actual perception, reasoning, and execution capabilities.
In everyday tasks, these differences are not obvious.
When users request the chatbot to perform online searches, the model will automatically invoke search tools. But in long-term tasks like Pokémon, the differences in toolsets are amplified to a level that can determine success or failure.
02 Turn-Based Play Exposes AI’s “Long-Term Memory” Shortcomings
Because Pokémon uses strict turn-based mechanics and requires no real-time reactions, it becomes an excellent “training ground” for testing AI. In each step, AI only needs to reason based on the current scene, target prompts, and available actions, then output clear commands like ‘Press A.’
This seems to be the interaction form most suited to large language models.
The core issue lies in the “disconnection” across time. Although Claude Opus 4.5 has accumulated over 500 hours of operation and executed about 170,000 steps, it is limited by reinitialization after each step, forcing the model to find clues within a very narrow context window. This mechanism makes it more like a forgetful person relying on sticky notes to maintain cognition, cycling through fragmented information without ever achieving the experiential leap from quantitative change to qualitative change that human players do.
In chess and Go, AI systems have long surpassed humans, but these are highly specialized for specific tasks. In contrast, general models like Gemini, Claude, and GPT, despite frequently beating humans in exams and programming competitions, stumble repeatedly in children’s games.
This contrast itself is highly revealing.
Joel Zhang believes the core challenge for AI is the inability to sustain execution of a single clear goal over long periods. “If you want an agent to do real work, it can’t forget what it did five minutes ago,” he points out.
And this capability is an essential prerequisite for automating cognitive labor.
Independent researcher Peter Whidden offers a more intuitive description. He open-sourced a Pokémon algorithm based on traditional AI. “AI knows almost everything about Pokémon,” he said, “it trains on vast amounts of human data and knows the correct answers. But once it’s in execution, it’s clumsy and inept.”
In the game, this “knowing but not being able to do” disconnect is constantly magnified: the model may know it needs to find a certain item but cannot reliably locate it on a 2D map; it knows it should talk to NPCs but repeatedly fails in pixel-level movements.
03 Behind the Capability Evolution: The Uncrossed “Instinct” Gap
Nevertheless, AI progress is clearly visible. Claude Opus 4.5 significantly outperforms previous versions in self-recording and visual understanding, enabling it to go further in the game. Gemini 3 Pro, after beating Pokémon Blue, completed the more difficult Pokémon Crystal without losing a single battle. This had never been achieved by Gemini 2.5 Pro.
Meanwhile, Anthropic’s release of the Claude Code toolset allows the model to write and run its own code, which has been used in retro games like RollerCoaster Tycoon, reportedly successfully managing virtual theme parks.
These cases reveal an unintuitive reality: AI equipped with the right toolset may demonstrate high efficiency in software development, accounting, legal analysis, and other knowledge work, even if they still struggle with tasks requiring real-time responses.
The Pokémon experiments also reveal another intriguing phenomenon: models trained on human data tend to exhibit behaviors similar to humans.
In Google’s technical report on Gemini 2.5 Pro, it was noted that when the system simulates “panic states,” such as Pokémon about to faint, the model’s reasoning quality drops significantly.
And when Gemini 3 Pro finally completed Pokémon Blue, it left a non-essential note: “To end poetically, I want to return to my original home, have a final conversation with my mother, and let the character retire.”
Joel Zhang finds this behavior surprising and somewhat human-like in emotional projection.
04 The “Long March” of AI That Cannot Be Crossed, Far Beyond Pokémon
Pokémon is not an isolated case. On the road to developing Artificial General Intelligence (AGI), developers have found that even if AI can excel in legal exams, it still faces insurmountable “Waterloo” in several complex games.
《NetHack》: The Abyss of Rules
This 1980s dungeon crawler is a “nightmare” for AI research. Its high randomness and “permanent death” mechanism make it extremely challenging. Facebook AI Research found that even if a model can write code, its performance in NetHack—requiring common sense logic and long-term planning—is far inferior to that of novice humans.
《Minecraft》: Disappearing Sense of Purpose
Although AI can craft wooden pickaxes and even mine diamonds, defeating the Ender Dragon remains a fantasy. In open worlds, AI often forgets its original goal during resource gathering that can last dozens of hours or gets completely lost in complex navigation.
《StarCraft II》: The Gap Between Generality and Specialization
While customized models have defeated professional players, if Claude or Gemini are directly controlled via visual commands, they collapse instantly. In handling the uncertainty of “fog of war,” and balancing micro-management with macro-strategy, general models still fall short.
《RollerCoaster Tycoon》: Imbalance Between Micro and Macro
Managing a theme park requires tracking thousands of visitors’ states. Even Claude Code with basic management capabilities can easily become exhausted when dealing with large-scale financial crashes or emergencies. Any reasoning lapse can lead to park bankruptcy.
《Elden Ring》 and 《Sekiro》: The Chasm of Physical Feedback
These high-action games are extremely unfriendly to AI. Current visual parsing delays mean that when the AI is still “thinking” about a boss’s move, the character is often already dead. Millisecond-level reaction requirements set the natural upper limit of the model’s interaction logic.
05 Why Has Pokémon Become an AI Benchmark?
Today, Pokémon is gradually becoming an informal yet highly convincing benchmark for AI evaluation.
Models from Anthropic, OpenAI, and Google have attracted hundreds of thousands of comments on Twitch streams. Google’s technical report details Gemini’s progress in the game, and Pichai publicly mentioned this achievement at the I/O developer conference. Anthropic even set up a “Claude Playing Pokémon” showcase at industry events.
“We are a group of super tech enthusiasts,” admits David Hershey, head of AI applications at Anthropic. But he emphasizes that this is more than just entertainment.
Unlike traditional one-off question-and-answer benchmarks, Pokémon can continuously track the model’s reasoning, decision-making, and goal pursuit over extended periods, more closely resembling the complex tasks humans expect AI to perform in the real world.
So far, AI challenges in Pokémon persist. But these recurring difficulties clearly outline the boundaries of capabilities that general artificial intelligence has yet to cross.
Special Contributor Wu Ji also contributed to this article