论文 · Papers2026-06-05 · Friday, June 5, 2026

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

RAMP 的核心判断是静态 benchmark 不足以说明软件工程 agent 的生产能力，因此它用 YatCC 上的编译器构造任务、串行依赖和阶段恢复机制做 runtime assessment。作者评估 15 个主流模型，报告串行 workflow 完成率从首阶段 100% 降到末阶段 20%，没有模型完成完整 pipeline；这个结果把“能解单题”和“能维持长链路执行”区分开了。

–浏览

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

评论 · Comments