According to researcher Kosta Jordanov at Lenz Research, five frontier AI models disagreed on 67% of 1,000 real-world fact-check claims tested this month. The models—GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro—were asked to classify claims as true, mostly true, misleading, or false. In 34% of cases, disagreement was severe, with one model calling a claim true while another labeled it false.
The study measured agreement using Krippendorff's alpha, which scored 0.639 on a scale where 1.0 indicates perfect agreement; researchers generally consider scores below 0.8 weak. Unanimous agreement occurred on only 328 out of 1,000 claims, and notably, zero claims received unanimous "mostly true" verdicts. The researchers used claims submitted by real users to Lenz's fact-checking platform rather than standard benchmarks, reducing the likelihood that models pattern-matched against training data.