BioMysteryBench: Mythos expert untangles unsolved questions 29.6%

ChainNewsAbmedia

Anthropic published BioMysteryBench, a new evaluation benchmark for AI bioinformatics analysis capabilities, in an official research announcement on April 29. It consists of open-ended questions drawn from real research scenarios. The most notable data point is that, among questions a panel of human experts still could not solve after attempting them, Anthropic’s flagship model Mythos solved 29.6%, while Opus 4.7 solved 27.0%.

Evaluation design: two tracks—solvable vs. unsolved by experts

BioMysteryBench is made up of two types of questions. The first category is “solvable”—analysis tasks designed by bioinformatics research personnel with standard answers for comparison. The second category is “unsolved by experts”—questions that human expert panels cannot find credible answers for even after trying, used to test whether models can cross the current boundary of domain knowledge.

In the solvable portion, Anthropic’s model generations show a clear capability gradient: Claude Haiku 4.5 solved 36.8%, Claude Sonnet 4.6 reached 71.8%, and the latest flagship Claude Mythos reached 82.6%. This gradient largely matches Anthropic’s public claims about model capability differences—Haiku is the lightweight model, Sonnet is the workhorse, and Mythos is the top-tier research model.

What’s truly worth attention is the unsolved-by-experts portion. These questions were evaluated and labeled “unsolvable or no consensus” by a panel of experts in bioinformatics; Mythos solved 29.6% of them, while Opus 4.7 solved 27.0%. This result is not a single proof that “the model is stronger than humans.” A more precise framing is: on problems that experts cannot handle due to constraints like pathways, time, or resources, AI can propose solution paths that can be verified. It may not be the final answer, but it has the attribute of “an angle that humans have not tried.”

Pushed in parallel with Claude for Life Sciences

BioMysteryBench aligns with Anthropic’s “Claude for Life Sciences” initiative launched in the second half of 2025. The latter targets concrete application scenarios such as drug development, genomics, and clinical trial design; the former uses evaluation methods to quantify progress in AI’s “research-grade” capabilities in the life sciences. The combined signal is that Anthropic positions biomedicine as one of Claude’s long-term primary battlefields for applications, competing via a different entry point from DeepMind AlphaFold’s approach.

If Mythos’s figure of nearly 30% on expert-unsolved questions can be reproduced in independent third-party verification, it would become an early empirical validation of the practical value of AI models in scientific research scenarios. Key points to watch next include whether BioMysteryBench will be adopted as a standard benchmark by other research institutions, the validation process used by the human experts who solved the questions, and whether Mythos can reproduce test results in actual research projects.

This article, BioMysteryBench: Mythos solves 29.6% of expert-unsolved questions, first appeared on Chain News ABMedia.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments