A study published this month by researcher Kosta Jordanov at Lenz Research found that five frontier AI models disagreed on 67% of 1,000 real-world fact-check claims, with unanimous agreement occurring on only 328 claims. The research tested GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro on claims submitted by actual users to a fact-checking platform. The models achieved a Krippendorff's alpha score of 0.639, falling below the 0.8 threshold researchers generally consider reliable. The disagreements occurred despite all models evaluating identical claims using the same four-label system: true, mostly true, misleading, or false. The findings highlight reliability concerns as people increasingly turn to AI systems for fact-checking.
The research gave five AI models the same 1,000 real-world fact-check claims submitted by actual users. The models had to select one of four labels: true, mostly true, misleading, or false. The study used claims submitted by real people to Lenz's fact-checking platform rather than pulling from standard test sets. "Most of these claims are unlikely to appear in any training corpus with a gold label attached—there's no canonical answer key to pattern-match against, no benchmark leaderboard to anchor to," the paper notes.
On 672 out of 1,000 claims, at least one model broke from the majority. In 34% of cases, the disagreement was severe: one model called a claim true while another called it false. "These aren't benchmark items with public answer keys—they're claims real users submitted for verification to a fact-checking platform," the study reads. "Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric."
The statistical measure of agreement, called Krippendorff's alpha, came in at 0.639 on a scale where 1.0 means perfect agreement and 0 means random chance. The study says this indicates "nontrivial but limited agreement." "The models' verdicts are structured rather than random, but not consistent enough to treat the panel as a single interchangeable judge," researchers note. Researchers generally consider anything below 0.8 to be weak.
The researchers provided example claims where the AI models showed the most divergence, including "The World Bank's active portfolio in Nigeria stands an over $16.4 billion as of 2025." ChatGPT 5.4 said it was "mostly true" while Gemini 3 Pro called it "false" and its sister model Gemini 3 Pro + Search rated it "misleading."
In another example, the models were provided with the claim: "Donald Trump said that an attack on Iran was postponed at the request of Gulf Allies." GPT-5.4 said it was false, Claude Opus 4.7 called it mostly true, Gemini 3 Pro said false, and Gemini 3 Pro + Search rated it true.
When all five models did agree—which happened on only 328 out of 1,000 claims—they almost never agreed that something was misleading or mostly true. Just four claims received a unanimous "misleading" verdict. Zero received unanimous "mostly true." "The panel converges on definitive verdicts; the middle of the rubric is where it fractures," the researchers found. Unanimity only happened at the extremes: either the claim was definitely true or definitely false.
The paper is careful to point this out: "A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness."
What did the Lenz Research study find about AI model agreement on fact-checking? The study found that five frontier AI models disagreed on 67% of 1,000 real-world fact-check claims submitted by actual users. Unanimous agreement occurred on only 328 claims, and the models achieved a Krippendorff's alpha score of 0.639, below the 0.8 reliability threshold researchers generally consider acceptable.
How did the AI models perform on the example claim about Nigeria's World Bank portfolio? ChatGPT 5.4 rated the claim "The World Bank's active portfolio in Nigeria stands an over $16.4 billion as of 2025" as mostly true, while Gemini 3 Pro called it false and Gemini 3 Pro + Search rated it misleading, demonstrating severe divergence among the models on the same factual claim.
Why did the study use real user-submitted claims instead of standard test sets? The researchers used claims submitted by real people to Lenz's fact-checking platform because most of these claims are unlikely to appear in any training corpus with a gold label attached, eliminating the possibility of models pattern-matching against benchmark answer keys and providing a more realistic test of fact-checking reliability.
Related News
Japan's Top 3 Banks Gain OpenAI AI Access for Cybersecurity Defense
Vitalik confirms CROPS AI overlaps with Ethereum’s access layer, with DeepSeek V4 as the core tool
Anthropic Opus 4.8 quick mode drops to $10, Mythos enables everything within weeks
Entelligence AI survey: 82% of AI engineering spending is wasted on vulnerability rework and delays
Gemini Launches AI Command Center Powered by Grok, Reports Mixed Q1 Results