A research team from the University of California, Berkeley proposed a new AI training method, GEPA, and it has been accepted by ICLR 2026 as an Oral paper. GEPA does not update model weights, does not require GPU training, and instead repeatedly rewrites the prompts of an AI system using a single LLM that “reads training logs.” It averages outperforms the mainstream reinforcement learning method GRPO by 6% across six tasks, with a highest margin of 20%. It also requires 35 times fewer training attempts (rollouts). After the research was organized and spread by the AI engineering community, it sparked discussion on X, and it has now been integrated into DSPy as a first-class optimizer.
What GEPA does: Treat training logs as textbooks, not just scores
The workflow of traditional reinforcement learning methods (such as GRPO) is: have the AI run a task once, give a “+1 or -1” score based on the result, and then repeatedly adjust the model weights using that score. The problem is that the AI’s process for a single run of a task usually includes reasoning steps spanning thousands of tokens, tool calls, and error messages—rich details that are all compressed into a single score and effectively discarded. So RL needs to run the process tens of thousands of times to converge.
GEPA does the opposite: after the AI finishes running a task each time, it hands the entire process (reasoning, tool calls, and error logs) to another “reflective LLM” to read in full. The reflective LLM works like a senior engineer reading a code log: it identifies which step failed, why it failed, and how the prompt should be modified—then it directly rewrites the prompt for that specific module. While it runs the task the same number of times each round, GEPA extracts far more signal than RL’s single score.
Why it can win: Change “scoring” into “reading the whole process”
GEPA averages beating GRPO by 6% across six tasks, with a peak win of 20%. Compared with another mainstream prompt optimizer, MIPROv2, it also comes out ahead by more than 10% (improving 12% on the AIME-2025 math benchmark). The most critical factor is training cost: to reach the same level of performance, GEPA needs 35 times fewer rollouts (one full run of a task).
Another data point is that after GEPA is integrated with DSPy, the “Full Program Adapter” can optimize the entire DSPy program (including signatures, modules, and control flow). On the MATH math benchmark it achieves 93% accuracy, far surpassing DSPy’s original ChainOfThought-style writing at 67%. GEPA also performs especially well in multi-module workflows (AI agents that chain multiple modules): it can precisely pinpoint which module is failing and rewrite that module’s prompt, rather than adjusting the entire system.
Who will use it first: DSPy as a first-class citizen, and GitHub already open-sourced
GEPA’s code has been open-sourced on GitHub, integrated into the DSPy framework in the form of dspy.GEPA, and also released separately as a Python library. The research team spans institutions including UC Berkeley, Stanford, Notre Dame, and Anthropic. The paper’s authors include Matei Zaharia (co-founder of Databricks and a main author of DSPy) and Omar Khattab (a main author of DSPy).
For the developer community, GEPA offers a new solution to the problem of “having lots of rollouts but not knowing how to use them.” Most teams have already accumulated tens of thousands of agent-run task logs, but beyond flipping through a few runs to debug when something goes wrong, they lack a systematic way to turn those logs into model improvements. The next thing to watch is GEPA’s real-world adoption in enterprise agentic workflows (such as customer service automation and automated program repair), and whether GEPA implementations will appear outside the DSPy framework to correspond to it.
This Berkeley GEPA deep dive article: how to make AI learn new tasks without updating weights—winning over RL with 35 times less training cost—was first seen on ChainNews ABMedia.
Related News
OpenAI releases GPT-5.5 launch-week data: API revenue growth hits a new high, Codex doubles
AISI assessment: GPT-5.5’s network-attack capabilities are on par with Anthropic’s Mythos
When you ask Claude about life’s biggest matters: relationship issues 25%, spirituality 38% flattery rate
The U.S. Department of Labor launches an AI apprenticeship portal to help companies train talent
Google CEO Pichai reveals that using Gemini AI to understand human nature helps build more sincere communication