博客文章 · Blog Posts2026-05-29 · Friday, May 29, 2026

Disagreement among frontier LLMs on real-world fact-checks

该研究用 1,000 个真实 fact-check claims 测五个 frontier LLM 的四档 verdict，一致性并不高：67% claims 至少有一个模型不同意多数，34% 存在相隔两个以上 bucket 的实质分歧，Krippendorff ordinal alpha 为 0.639。它的价值在于不用 benchmark gold label，而是测真实请求上的模型间不稳定性。

–浏览

Disagreement among frontier LLMs on real-world fact-checks

评论 · Comments