论文 · Papers2026-06-02 · Tuesday, June 2, 2026

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

SoundnessBench 评估 LLM 在执行研究前判断 proposal 方法论可行性的能力。数据由 1,099 个从 ICLR submissions 重构的机器学习研究 proposal 组成，并带 reviewer soundness 子分数；12 个 frontier LLM 普遍有 optimism bias，常把低 soundness 想法评为可行。论文把 AI Scientist 的瓶颈落到“第一道严谨性门槛”上，而不是只看自动实验执行。

–浏览

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

评论 · Comments