According to Datadog and Carnegie Mellon’s latest benchmark, GPT-5 achieved 62.7% accuracy on the ARFBench test, falling short of human domain experts at 72.7%. ARFBench is the first AI benchmark built from 63 real production incidents, containing 750 multiple-choice questions covering 142 monitoring metrics and 5.38 million data points—no synthetic data.
AI models struggle most on cross-metric reasoning (Tier III questions), where GPT-5 scored just 47.5% F1. A theoretical model-expert oracle combining AI and human judgment reaches 87.2% accuracy, illustrating how collaboration could exceed either alone. Datadog’s hybrid model, Toto-1.0-QA-Experimental, topped the leaderboard at 63.9% accuracy, outperforming GPT-5 on anomaly identification.
Related News
Figure F.03 81 consecutive hours with no sorting of 101,391 parcels
Malta offers the entire population free ChatGPT Plus for one year: OpenAI’s first country-level partnership
Anthropic discusses the China-U.S. AI race: China’s lead could become a global threat; three recommendations to strengthen America’s moat