2.1 Agent RL / 可验证奖励Agent RL / Verifiable Rewards

本主题共 34 条 · 最早 2026-06-01 · 最新 2026-07-21

视图 · View

2026 年 7 月10

ToolVerse: Unlocking Massive Environments and Long-Horizon Tasks for Agentic Reinforcement Learning
ToolVerse 面向 Tool-Integrated Reasoning，把近 400 个真实 MCP 自动转成约 4,500 个工具的可执行训练环境。任务生成依赖 tool dependency graph 与 Dynamic Unlocking Sampling，产出 GUST 数据集；训练算法则用 Turn-Aware Relative Advantage 细化长程 credit ass…
Paper2026-07-21arxiv.org原文 ↗
–
DSWorld: A Data Science World Model for Efficient Autonomous Agents
DSWorld 预测数据科学 workflow 在候选操作后的状态转移，用结构化状态、cost-aware routing、轻量真实执行和 LLM simulator 避免昂贵 trial-and-error。论文构建 8K-scale transition trajectory dataset，并提出 Reflective World Model Optimization 做 error-awa…
Paper2026-07-21arxiv.org原文 ↗
–
Agentic-DPO: From Imitation to Agentic Policy Optimization on Expert Trajectories
论文把 DPO 扩展到多轮 agent trajectory，不再只让模型模仿专家下一步动作。Agentic-DPO 将行动选择、工具调用路径和最终结果纳入偏好优化，目标是学习哪条轨迹整体更优。它适合放在 agent training 脉络里看，因为行为克隆容易复制表面步骤，而轨迹级偏好更接近实际执行质量。
Paper2026-07-15arxiv.org原文 ↗
–
SWE-1.7 Reach Near GPT 5.5 and Opus Intelligence
Cognition 发布 SWE-1.7 模型/系统更新，标题强调其在软件工程任务上接近 GPT 5.5 和 Opus Intelligence 水平。HN 讨论页记录 205 points、112 comments，说明开发者社区对 coding-agent 竞赛线仍然高度敏感。可确认的信息焦点是软件工程 agent 系统迭代，而不是通用聊天模型发布。
News2026-07-09cognition.com原文 ↗
–
You Only Need 1 Layer for RLVR?
The AI Timeline 这期覆盖 7 月 1 日至 7 日的 AI research/news，标题抓住 RLVR 只需一层的讨论。它把一周论文、发布和社区动向放在同一封邮件中，适合观察研究主题如何在几天内聚集。RLVR 话题本身指向推理训练、验证器和强化学习成本的再评估。
Blog2026-07-08mail.bycloud.ai原文 ↗
–
Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows
论文把 RLVR 用到 Atlassian 工作流里的工具调用 agent，奖励来自可验证的 API 调用顺序、参数完整性和任务状态，而不是只看自然语言回答。具体任务围绕 issue、workspace、project 等 SaaS 操作，强调 agent 必须按业务流程完成状态变化。它的看点是把工具调用训练从“像不像答案”推进到“外部系统是否真的被正确操作”。
Paper2026-07-04arxiv.org原文 ↗
–
Think in English, Answer in Korean: Efficient Adaptation of Multilingual Tool-Using Agents
Cohere 与 LG CNS 的 LuckyStar 111B 是一篇很工程化的 multilingual tool agent 适配报告：从已 fully post-trained 的 Command A 继续训练，而不是重新预训练。它组合 multilingual SFT、multi-step tool-use verifiable rewards、韩文 user-facing respon…
Paper2026-07-02arxiv.org原文 ↗
–
TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning
TRIAGE 指出标准 GRPO 把最终 verifier outcome 均匀赋给所有 action token，会奖励成功轨迹里的冗余/倒退动作，也会惩罚失败轨迹里的有用探索。它让 structured judge 把 segment 标成 decisive progress、useful exploration、no-progress infrastructure 或 regression，…
Paper2026-07-02arxiv.org原文 ↗
–
ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents
ReGRPO 关注工具调用型 VLM agent 在失败后的恢复能力，而不是只从成功轨迹做 SFT。方法先执行 near-miss actions 来收集 grounded failure observations，再构造 Reflection-of-Thought triplets，包括 ErrorType、Evidence 和 FixPlan，与纠正动作配对 warm-start；RL 阶段则…
Paper2026-07-02arxiv.org原文 ↗
–
ECHO: Prune to act, trace to learn with selective turn memory in agentic RL
ECHO 把上下文裁剪和 RL credit tracing 放在同一个设计里处理。每个完成的 environment turn 被压缩成带 source index 的 memory record，策略上下文按需从这些记录重建；成功终局的正向 credit 再沿 source index 回流到支持答案的证据和选择动作。BrowseComp-Plus 上 ECHO held-out accura…
Paper2026-07-02arxiv.org原文 ↗
–

2026 年 6 月24

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
这篇论文从 RL post-training 里“免费”抽出 agent 过程奖励：RL policy 与 reference policy 的 log-probability ratio 被解释为 progress advantage，用来做 step-level scoring。作者在五个 benchmark、四个 model family 上把它用于 test-time scaling、不确…
Paper2026-06-27arxiv.org原文 ↗
–
Diagnosing Task Insensitivity in Language Agents
论文分析 language agent 在相似但不同任务上套用训练模式的 task insensitivity。作者发现任务描述被语义破坏或替换时，模型仍可能输出原任务动作，并伴随训练期 attention 从 task tokens 漂向局部 observation。Task-Perturbed NLL Optimization 的价值在于用轻量 contrastive regularizer…
Paper2026-06-27arxiv.org原文 ↗
–
Tmax: A simple recipe for terminal agents
Tmax 面向 terminal-using agents，给出开放 RL 训练 recipe，目标是补上终端 agent 领域缺少公开数据、稳定基线和学术评测的问题。摘要强调 terminal agents 已经成为语言模型最常见的下游应用之一，但 RL 训练研究仍受限于 benchmark、数据和简单 baseline recipe。它的贡献更像一套可复现实验底座，而不是又一个封闭终端助手结果…
Paper2026-06-24arxiv.org原文 ↗
–
slime
slime 做的是大模型 RL 后训练的工程底座：Megatron 负责训练，SGLang 负责 rollout，Data Buffer 连接 prompt、生成、reward/verifier 和环境交互。README 把“生产验证”写得很具体，称 slime 支撑过 GLM-4.5 到 GLM-5.2 的后训练，并列出 Qwen、DeepSeek V3/R1、Llama 3 等模型路径。值得看…
Project2026-06-22github.com原文 ↗
–
REVES: REvision and VErification--Augmented Training for Test-Time Scaling
REVES 用 revision 和 verification 增强训练，让模型更接近 test-time scaling 中的 sequential revision 推理流程。摘要指出标准 post-training 多优化 single-shot objectives，和多步 inference dynamics 存在错位；近期把它视作 multi-turn RL 的做法也有各自限制。这个设…
Paper2026-06-20arxiv.org原文 ↗
–
Context-Aware RL for Agentic and Multimodal LLMs
这篇提出 ContextRL，用 context-aware reinforcement learning 改善长程推理和多模态表现。摘要里的失败例子很具体：答案可能取决于工具 trace 里的一行，或图像里一个细微细节；方法用 indirect auxiliary objective 训练模型识别这种决定性证据。对 agentic LLM 来说，这类训练目标直接对应工具使用、证据查找和多轮任务中…
Paper2026-06-20arxiv.org原文 ↗
–
STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
STARE 分析 GRPO 的 token 级熵动态，指出 trajectory-level advantage 与 token surprisal 之间存在四象限结构和近临界现象。方法用 batch 内 surprisal quantile 选出熵关键 token，重加权其 advantage，并用 target-entropy gate 做闭环控制。实验跨 1.5B 到 32B、短 CoT、长…
Paper2026-06-19arxiv.org原文 ↗
–
EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
论文指出 RL rollout 的瓶颈不是普通 serving 场景的固定模型推理，而是策略持续变化、高温长生成和 batch size 逐渐缩小。EfficientRollout 从目标模型诱导量化 drafter，并用系统感知开关只在 memory-bound 阶段启用 self-speculative decoding。它最高减少 19.6% rollout latency 和 12.7%…
Paper2026-06-19arxiv.org原文 ↗
–
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
Nemotron 3 Ultra 是 550B total、55B active 的 MoE hybrid Mamba-Attention 模型。NVIDIA 报告其用 20T text tokens 预训练，随后扩展到 1M context，并经过 SFT、RL 与 MOPD post-training。LatentMoE、MTP、NVFP4 pre-training、multi-environ…
Paper2026-06-17arxiv.org原文 ↗
–
Frontier post-training recipe review with Finbarr Timbers
Interconnects 这篇文章讨论 frontier model post-training recipe，包括 SFT、RL 和评测实践。它关注的是训练后阶段如何把 base model 变成可用助手、reasoner 或 agent，而不是重新讲预训练 scaling law。此类 recipe review 的信息密度在于把公开论文、实验经验和 eval 反馈放到同一工程框架中，便于观…
Blog2026-06-17interconnects.ai原文 ↗
–
ExpRL: Exploratory RL for LLM Mid-Training
ExpRL 用 RL-based mid-training 替代部分人工 curated reasoning traces。它把人类参考解隐藏起来，只用于生成 problem-specific grading rubrics；policy 从原 prompt 采样推理轨迹，再由 LLM judge 给 outcome-level 或 process-level dense rewards。论文报告…
Paper2026-06-17arxiv.org原文 ↗
–
TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
TRACE 面向 RLVR 中 rollout 成本高但 reward contrast 不足的问题，特别是多轮 ReAct rollout 里同一个 terminal reward 会被所有中间决策共享。方法把每个 thought-action-observation turn 视为树中的语义节点，在 prompt roots 和 intermediate prefixes 上自适应分配 con…
Paper2026-06-12arxiv.org原文 ↗
–
Open Reproduction of DeepSeek-R1
Hugging Face open-r1 是 DeepSeek-R1 的开放复现实验仓库，包含 SFT、GRPO 和 synthetic data generation 脚本，并用 Makefile 串起训练、生成、评测流程。项目计划按三步推进：复现 R1-Distill、复现 R1-Zero 的纯 RL 管线、展示 base model 到 RL-tuned 的多阶段训练。README 记录 S…
Project2026-06-12github.com原文 ↗
–
AsyncWebRL
AsyncWebRL 处理视觉 web agent 多步 RL 的系统低效和轨迹低效：异步重叠 rollout、梯度更新和 policy refresh，并用 everlasting rollout pool 与轻量截图处理减少等待。系统侧最高比 WebGym 端到端训练吞吐快 2.9 倍；算法侧把 GRPO 中按轨迹长度归一的问题替换为常数 1/k，减少失败长轨迹对学习信号的扭曲。它在 WebG…
Paper2026-06-10arxiv.org原文 ↗
–
Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces
OPT 构造的是“多个可行答案里找高价值方案”的优化式推理任务族。每个任务提供 feasibility checker 和 evaluator，用复杂度参数扩展搜索空间而无需新增人工标签；训练侧研究 solver-guided online policy optimization 和无 solver 时的 search-based offline RL。它的技术兴趣在于把可验证奖励从数学/代码扩展…
Paper2026-06-07arxiv.org原文 ↗
–
InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain
提出 InfoMem，用 answer-conditioned information gain 训练 chunk-wise long-context memory agents。核心奖励衡量最终 memory 对 ground-truth answer 每 token log-likelihood 的提升，而不是只看稀疏最终答案或词面重合。论文在相同 GRPO 框架和训练预算下优于可比 RL m…
Paper2026-06-04arxiv.org原文 ↗
–
Policy and World Modeling Co-Training for Language Agents
论文把 agent policy 与文本 world model 联合训练，让 RL rollout 同时学习动作选择和环境动态。
Paper2026-06-03arxiv.org原文 ↗
–
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
OpenWebRL 研究视觉 Web agent 的在线多轮强化学习，重点是让 agent 在动态网页环境中交互试错，而不是只模仿静态监督轨迹。论文讨论浏览器环境、视觉观察、动作空间、奖励与长程 credit assignment 等系统问题。值得看的是，Web agent 训练正在从“看截图做 imitation”转向“在网页里持续探索并修正策略”。
Paper2026-06-03arxiv.org原文 ↗
–
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
Harness-1 把搜索 agent 的证据、约束、候选答案和检查状态外置到 harness，而不是要求模型在越来越长的 transcript 中自行维护所有状态。贡献是把 RL 训练对象从纯对话策略改成模型加外部状态机，使检索、引用和验证步骤能被显式记录、检查和奖励。值得看的是，搜索 agent 的瓶颈常在跨多轮证据管理和自检，而这篇把状态管理变成了可训练接口。
Paper2026-06-03arxiv.org原文 ↗
–
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search
SAAS 聚焦 agentic search 的过度检索：模型在已有内部知识足够时仍继续调用搜索，增加成本和噪声。论文用 self-aware reinforcement learning 让 agent 学会判断何时检索、何时用内部知识、何时停止。它值得看在于把检索策略从“多查更好”改成可训练的成本-可靠性决策。
Paper2026-06-02arxiv.org原文 ↗
–
LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
LongTraceRL 用 search agent 轨迹构造更难的长上下文训练样本：读取未引用文档作为高混淆 distractors，搜索结果未打开文档作为低混淆 distractors。奖励设计使用 reasoning chain 中 gold entities 的 entity-level rubric reward，并只作用于最终答案正确的响应以降低 reward hacking。4B-3…
Paper2026-06-02arxiv.org原文 ↗
–
GrepSeek: Training Search Agents for Direct Corpus Interaction
GrepSeek 让 search agent 直接把语料库当环境，用 shell 命令查找、过滤和组合证据，而不是只调预建检索索引。训练采用两阶段：answer-aware Tutor 与 answer-blind Planner 生成冷启动轨迹，再用 GRPO 优化；并用 sharded-parallel 执行把 shell retrieval 加速最高 7.6 倍且保持字节等价。七个开放域…
Paper2026-06-02arxiv.org原文 ↗
–
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
DRIFT 面向多轮交互优化，试图避开在线 RL 昂贵 rollout 与离线 SFT 分布偏移之间的两难。方法把 KL-regularized RL 等价为 importance-weighted supervised learning：从固定 reference policy 采样离线轨迹，按 return 生成权重，再做 weighted SFT。实验称可匹配或超过多轮 RL baselin…
Paper2026-06-02始 2026-06-01arxiv.org原文 ↗
–
It's Not Just X. It's Y
文章讨论 AI 训练栈里 post-training 的作用，反对把能力进步简单归因于“数据”。它的核心判断是 post-training 已经成为把数据转化为可用行为的工程层，包括偏好优化、RL、合成任务、评测循环和产品约束。值得看的是它把“数据叙事”和“训练后行为塑形”拆开，避免把模型能力来源讲成单变量故事。
Blog2026-06-01mail.cyberneticforests.com原文 ↗
–