论文 · Papers2026-06-04 · Thursday, June 4, 2026

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

构建专业桌面 GUI agent benchmark，覆盖设计、视频、音频和 3D 创作等长流程任务，并把人机协作协议纳入评测。DeskCraft 的长任务要求超过 50 个执行步骤，同时建模 mid-turn clarification、用户打断和 post-turn feedback。作者评估 18 个闭源和开源 agent、538 个任务，GPT-5.4 在 standard tasks 上为 31.6%，interactive tasks 为 27.6%。它把桌面 agent 从短指令点击题推进到真实专业软件工作流。

–浏览

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

评论 · Comments