每日 Harness 开源 · Source
返回本期 · Back to 2026-06-01

论文 · Papers2026-06-01 · Monday, June 1, 2026

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

arxiv.org原文 ↗

评测方法其他垂直
PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges
PReMISE 把 reusable rubrics 视为 LLM judge 的测量规格:换 rubric 就是在改变固定 judge 对 response quality 的测量。框架从 pairwise human-preference data 发现 policy-level rubric,并审计 structural adequacy、reliability、preference fit、adversarial robustness 四个轴。关键结果是 preference-rank selection 将 paired-response judge accuracy 从 65.0% 提升到 68.6%,而 reliability-constrained refinement 把 exploit responses 获高分比例从 46.4% 降到 36.0%。
浏览

评论 · Comments