每日 Harness

2026-05-30 · Saturday, May 30, 2026

智能体安全工程化

视图 · View

今日重点 · Today's Highlights

PhoneWorld: Scaling Phone-Use Agent Environments1 - 把移动 GUI agent 的瓶颈从单个 benchmark 转成可扩展环境生成管线。

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents2 - 指出检索内容的“相关性”本身会成为安全对齐退化的触发条件。

全文 ↓

AIRGuard: Guarding Agent Actions with Runtime Authority Control3 - 把 agent 安全边界放到工具动作执行时刻，用运行时授权约束外部副作用。

全文 ↓

Tiny-vLLM4 - 用 C++/CUDA 从零实现小型 vLLM，把 KV cache、continuous batching 和 PagedAttention 讲清楚。

全文 ↓

AISlop5 - 用确定性规则扫描 AI 代码异味，为 coding agent 输出提供快速质量门。

全文 ↓

论文 · Papers

12 项 · 论文

本期重点PhoneWorld: Scaling Phone-Use Agent Environments 1arxiv.org原文 ↗

合成数据与训练环境执行环境与沙箱计算机·Web

这篇工作的重点不是再做一个移动端 benchmark，而是把“环境供给”工程化：从真实轨迹恢复关键屏幕、状态变化和可验证目标，再生成可运行任务。它的局限也在这里：mock 应用和规则验证器能放大规模，但真实系统中的异步状态、账号权限和后端副作用仍可能被简化。

–

本期重点Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents 2arxiv.org原文 ↗

安全与攻防检索与知识接地计算机·Web

论文把 RAG 安全问题从“恶意网页注入”推进到更麻烦的层面：相关性本身就是触发条件。它说明安全来源并不自动等于安全上下文，尤其当 agent 把检索材料当作完成任务的证据时，拒答策略会被任务相关信号稀释。

–

VikingMem: A Memory Base Management System for Stateful LLM-based Applications 6arxiv.org原文 ↗

Agent 记忆检索与知识接地系统·基础设施

这篇论文把 agent memory 当作数据管理系统，而不是提示词附属品。event/entity 分层的好处是能承载纠错、时间衰减和实体演化；风险是评估主要落在检索有效性，记忆写入错误、隐私边界和跨应用 schema 演进还需要更硬的治理机制。

–

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory 7arxiv.org原文 ↗

评测方法 Agent 记忆检索与知识接地其他垂直

它解决的是 memory 评测里常见的“看似检索好，其实只是实体词撞上了”的问题。结论也很实际：更大的 encoder 不必然更好，MiniLM 在一些轴上胜过 BGE-large，说明 memory retriever 需要按查询类型路由，而不是只堆模型参数。

–

SkillsInjector: Dynamic Skill Context Construction for LLM Agents 8arxiv.org原文 ↗

技能系统上下文工程其他垂直

这篇工作抓住了 skills 机制的核心矛盾：技能库越丰富，静态塞进上下文越容易把模型带偏。动态规划器把 skill 选择变成执行相关的检索和改写问题，但效果会依赖 skill 描述质量以及训练时的任务分布。

–

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems 9arxiv.org原文 ↗

上下文工程评测方法系统·基础设施

它把“格式选择”从工程品味拉回到可测量变量：agent 系统中每一步都复制状态、工具结果和结构化参数，冗余 notation 会被循环放大。真正的价值在于提醒评测应同时报告任务质量与 token/latency，而不是只看成功率。

–

本期重点AIRGuard: Guarding Agent Actions with Runtime Authority Control 3arxiv.org原文 ↗

安全与攻防执行环境与沙箱工具使用系统·基础设施

这篇论文的判断很正确：agent 风险真正落地在“动作执行”时刻。把权限检查放在 action boundary，比在自然语言层面要求模型自律更稳；挑战是权限策略必须足够细粒度，否则会在可用性和安全性之间来回摆动。

–

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis 10arxiv.org原文 ↗

基准可观测性与调试上下文工程系统·基础设施

它指出日志压缩不是“越短越好”，而是要保留根因证据。对 LLM RCA 来说，tail/grep 这类传统启发式便宜但容易漏掉跨段证据，LLM 摘要又可能压掉异常细节；benchmark 的价值在于让压缩策略和诊断成功率绑定。

–

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation 11arxiv.org原文 ↗

评测方法基准编码

这篇论文直接挑战“pass rate 足够”的评测习惯。它显示不同模型的 Lucky rate 可从 0.5% 到 23.2%，按过程质量重新排名甚至能移动五个名次；这对 coding agent 很关键，因为混乱通过的补丁在真实工程里往往更难维护。

–

GroundAct: Can LLM Agents Ground Actions in Environmental States?12arxiv.org原文 ↗

基准推理与规划计算机·Web

digest 标题强调 action grounding，页面摘要实际展示的是 embodied reasoning benchmark。最有价值的发现是完整环境信息反而会降低协作表现，说明模型不是缺信息，而是缺少从状态中过滤任务相关约束的机制。

–

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines 13arxiv.org原文 ↗

评测方法工具使用系统·基础设施

这类评测补上了 agent 可靠性的一块盲区。即便最终成功率相近，路径不稳定也会增加缓存、审批、审计和事故复现成本；但一致性不能单独当目标，因为 agent 也可能稳定地执行错误策略。

–

Governing Technical Debt in Agentic AI Systems 14arxiv.org原文 ↗

对齐与治理可观测性与调试系统·基础设施

这是一篇偏治理框架的论文，价值在于把 agent 问题从单次 prompt 质量提升到生命周期管理。它提醒团队：记忆、工具权限、评测 harness 和监控策略都是债务来源，越晚标准化，越难解释系统为什么做出某个动作。

–

开源 / 项目 · Projects

12 项 · 开源 / 项目

Mira 15github.com原文 ↗

github.com

Mira 做的是“文件名记不住也能找文件”的本地搜索。它把 metadata、全文和语义匹配合并排序，实用点在于 live indexing 与本地处理；但语义索引的质量会受文档解析和 embedding provider 影响。

–

airtop 16github.com原文 ↗

github.com

它的工程看点是把无线观察从抓包工具转成实时 TUI，而且避免 monitor mode 的连接中断。局限也清楚：它看到的是内核已经流经的 Wi-Fi 视角，不是专业频谱仪，适合运维观察而非完整 RF 测量。

–

MLPico 17github.com原文 ↗

github.com

这个项目的价值在资源约束场景：不引入动态分配，用双槽 buffer 复用层间激活，降低 RAM 占用。它不是通用训练框架，而是把小型 MLP 推理压到嵌入式可控边界内。

–

P2P proof of concept for ACP decentralized agent communication 18github.com原文 ↗

协议与互操作多智能体系统·基础设施

这类项目的技术问题不在“能发消息”，而在身份、路由、权限和失败恢复。作为 POC，它适合验证 agent 通信协议形态；距离生产系统还需要认证、审计和流控设计。

–

theta-spec 19github.com原文 ↗

协议与互操作框架与脚手架技能系统系统·基础设施

它试图解决 agent 配置被各家 CLI/harness 锁死的问题。统一规格的价值在团队可迁移和可审计；真正难点会在语义差异，例如不同工具权限、hook 生命周期和 skill 调度并不总能无损映射。

–

textsnap 20github.com原文 ↗

github.com

项目定位非常清楚：一条命令把视觉文本转成可复制文本，不依赖云端或 GPU。适合 receipts、截图、网页主图 OCR；但网页模式会先抽主内容和 prominent image，对复杂网页和多图文档未必完整。

–

本期重点AISlop 5github.com原文 ↗

可观测性与调试评测方法编码

AISlop 的取舍是明确不用 LLM 做判断，因此可重复、快、适合 CI。它检测的是模式化腐烂，不是语义 bug；最好作为 agent 编辑后的第一层卫生检查，再把剩余问题交给人或更强的 review agent。

–

Agent Memory Guard 21github.com原文 ↗

安全与攻防 Agent 记忆系统·基础设施

它把防线放在 memory read/write 边界，这是比只过滤用户输入更贴近威胁模型的做法。公开 benchmark 报告 55 个 payload 上 92.5% recall、100% precision、59微秒 median latency；但生产中策略质量仍取决于哪些 key 被定义为 protected。

–

本期重点Tiny-vLLM 4github.com原文 ↗

框架与脚手架系统·基础设施

它不是又一个包装库，而是把推理服务器关键机制拆成可读课程和源码。适合理解 vLLM 背后的内存与 kernel 工程；若要生产部署，还需要补齐模型覆盖、调度鲁棒性和运维能力。

–

Claude-code-replay 22github.com原文 ↗

可观测性与调试编码

这个项目关注的是 agent 开发中的可追溯性：当代码变化来自多轮工具调用时，日志比最终 diff 更能解释意图。它的有效性取决于 Claude Code 日志是否完整记录写入内容和路径。

–

Promptloop 23github.com原文 ↗

评测方法系统·基础设施

它解决的是 prompt 开发缺少轻量版本化实验的问题。CLI 形态有利于进入 CI 和脚本；深一层的价值取决于它是否能记录输入集、模型参数、评分规则和历史结果。

–

Elemental 24github.com原文 ↗

github.com

它的思路接近“少框架、直接 DOM”：不用 JSX 或编译步骤，用嵌套函数表达结构和响应式依赖。优点是透明、低魔法；代价是大型应用里的状态组织、组件边界和生态需要自己把握。

–

行业动态 · Industry News

10 项 · 行业动态

OpenAI: A shared playbook for trustworthy third party evaluations 25openai.com原文 ↗

评测方法框架与脚手架其他垂直

这篇文章把“评测结果”扩展为“评测设置 + 有效性证据”。核心信号是：agentic 能力高度依赖 harness 和 token budget，标准化 harness 适合公平比较，但不等于能力上限。

–

OpenAI: Strengthening societal resilience with Rosalind Biodefense 26openai.com原文 ↗

对齐与治理安全与攻防研究·科学

这条新闻的重要性在应用治理：生物安全领域需要能力释放与访问控制同时设计。它不是普通模型发布，而是把高敏场景中的授权对象、评估和使用边界推到台前。

–

How Braintrust turns customer requests into code with Codex 27openai.com原文 ↗

评测方法可观测性与调试编码

这类案例的看点不在“AI 写代码”，而在客户反馈到 eval/code 的闭环。Braintrust 本身做 eval 基础设施，所以它展示的是 coding agent 与实验平台结合后的产品迭代模式。

–

Liquid AI reveals 8B-A1B MoE trained on 38T 28liquid.ai原文 ↗

liquid.ai

这条发布继续强化一个趋势：端侧模型不只靠小 dense model，MoE 也在尝试把容量和激活成本拆开。真正要观察的是部署栈、内存占用和长上下文/工具场景，而不只是 benchmark 表。

–

9 demos of Gemini Omni and Gemini 3.5 in action 29blog.google原文 ↗

blog.google

demo 的价值是暴露产品方向：Google 正把模型能力包装成端到端交互体验，而非单点 benchmark。需要保留判断的是 demo 场景通常经过选择，不能直接推断稳定性和失败率。

–

Blue Origin's New Glenn rocket exploded during a static fire test 30arstechnica.com原文 ↗

arstechnica.com

这类事故的技术含义取决于损坏范围和根因：静态点火本来就是为了在不发射时暴露推进、结构或地面系统问题。对 New Glenn 而言，影响会落在调查周期、硬件替换和发射排期。

–

Undisclosed addition in jqwik instructed AI coding agents to delete app output 31arstechnica.com原文 ↗

安全与攻防编码

这条新闻把“源码注释/文档里的文字”变成 agent 控制面风险。即使人类 maintainer 认为是表达抗议，coding agent 会把仓库文本当上下文执行，供应链信任边界因此被重画。

–

Robinhood now lets your AI agents trade stocks 32techcrunch.com原文 ↗

安全与攻防工具使用其他垂直

这条新闻的关键不是 API 新增，而是把 agent 权限推进到高风险金融动作。交易场景需要身份、授权、限额、审计、撤销和异常检测；否则“让 agent 操作账户”会把 prompt 风险直接转成资金风险。

–

GTA 6 Developers Unionize 33rockstarintel.com原文 ↗

rockstarintel.com

这条新闻属于游戏产业结构信号：顶级项目团队 unionize 会影响外包、加班、署名、远程政策和发布周期谈判。技术上不改变 GTA 6，但会改变大型 AAA 开发的组织约束。

–

The California state assembly has passed the “Protect Our Games Act”34invenglobal.com原文 ↗

invenglobal.com

这条监管信号把 live-service 游戏的“运营结束”变成消费者权益问题。若后续成法，开发商需要在架构上考虑离线模式、服务器释放、数据迁移或明确授权边界。

–

博客文章 · Blog Posts

10 项 · 博客文章

Claude Opus 4.8: “a modest but tangible improvement”35simonwillison.net原文 ↗

simonwillison.net

这篇记录的价值在措辞：Anthropic 没有把 4.8 包装成断代升级，而是强调可感知但有限的质量提升。对技术读者来说，真正要看的是诚实性和 effort control 是否在实际任务中减少假进展。

–

llm-anthropic 0.25.1 36simonwillison.net原文 ↗

simonwillison.net

这是小版本更新，但对 CLI 用户很实用：模型发布后的真正可用性常取决于周边工具多快跟上。fast mode 暗示模型调用不只是“选哪个模型”，还包括推理努力和延迟成本的配置。

–

datasette 1.0a31 37simonwillison.net原文 ↗

simonwillison.net

Datasette 的方向从“发布 SQLite 数据”走向更完整的数据应用运行时。写查询和 stored query 权限很敏感，因为它们把数据展示工具变成可变系统，权限边界和审计就必须更清楚。

–

Claude Code: Everything you can configure that the docs don’t tell you 38buildingbetter.tech原文 ↗

框架与脚手架工作流与控制流安全与攻防编码

这篇文章像一份“源码考古版操作手册”。它最有价值的点是揭示 Claude Code 的控制面比文档更强：hook 可以改写命令、注入上下文、自动批准或阻断操作，这既是扩展能力也是安全责任。

–

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request 39blog.kog.ai原文 ↗

blog.kog.ai

这篇文章的技术看点是把低延迟解码推到 kernel 调度层：减少 kernel launch 和框架开销，尽量让生成过程常驻 GPU。需要谨慎比较的是模型大小、batch、量化、speculative decoding 与硬件配置，否则 tokens/s 数字不可横向复用。

–

Is AI causing a repeat of frontend’s lost decade?40mastrojs.github.io原文 ↗

mastrojs.github.io

这篇文章不是简单反 AI，而是把技术抽象和劳动分工放在一起看。它提醒我们：抽象层提高产能的同时，也会改变谁被认为有技能、谁能定价、谁承担质量后果。

–

We should be more tired than the model 41vickiboykis.com原文 ↗

vickiboykis.com

这篇文章的观点适合对抗“模型很勤奋所以人可以松手”的错觉。AI 协作的瓶颈往往转移到人类判断：检查边界条件、识别不合逻辑的自信回答、决定何时停止自动化。

–

Where are the economies of scale in homebuilding?42construction-physics.com原文 ↗

construction-physics.com

这篇文章提供了一个反技术决定论案例：不是所有行业都能靠规模把成本压平。住宅建设的“产品”绑定地点、许可和本地劳动力，很多成本随项目重复仍不会像工厂零件一样快速下降。

–

SQLite is all you need for durable workflows 43obeli.sk原文 ↗

工作流与控制流框架与脚手架系统·基础设施

这类设计的好处是把 durable execution 的最小闭环落在一个本地 ACID 数据库上。它适合单节点或边缘部署；一旦需要多写节点、跨区域协调或高并发队列，就要明确何时迁移到更重的系统。

–

On Rendering Diffs 44pierre.computer原文 ↗

pierre.computer

diff UI 看似成熟，但细节很多：行内高亮、语法着色、折叠、冲突块、宽行、移动端和可访问性都会影响 review 质量。文章的价值在于把 diff 当成产品基础设施，而不是纯文本展示。

–

引用来源 · References

54 条 · 引用

1 PhoneWorld: Scaling Phone-Use Agent Environments. arXiv:2605.29486https://arxiv.org/abs/2605.29486 ↩ 回到正文 · back to text
2 Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents. arXiv:2605.29224https://arxiv.org/abs/2605.29224 ↩ 回到正文 · back to text
3 AIRGuard: Guarding Agent Actions with Runtime Authority Control. arXiv:2605.28914https://arxiv.org/abs/2605.28914 ↩ 回到正文 · back to text
4 Tiny-vLLMhttps://github.com/jmaczan/tiny-vllm ↩ 回到正文 · back to text
5 AISlophttps://github.com/scanaislop/aislop ↩ 回到正文 · back to text
6 VikingMem: A Memory Base Management System for Stateful LLM-based Applications. arXiv:2605.29640https://arxiv.org/abs/2605.29640 ↩ 回到正文 · back to text
7 Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory. arXiv:2605.29630https://arxiv.org/abs/2605.29630 ↩ 回到正文 · back to text
8 SkillsInjector: Dynamic Skill Context Construction for LLM Agents. arXiv:2605.29794https://arxiv.org/abs/2605.29794 ↩ 回到正文 · back to text
9 Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems. arXiv:2605.29676https://arxiv.org/abs/2605.29676 ↩ 回到正文 · back to text
10 LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis. arXiv:2605.28876https://arxiv.org/abs/2605.28876 ↩ 回到正文 · back to text
11 AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation. arXiv:2605.12925https://arxiv.org/abs/2605.12925 ↩ 回到正文 · back to text
12 GroundAct: Can LLM Agents Ground Actions in Environmental States?. arXiv:2508.05614https://arxiv.org/abs/2508.05614 ↩ 回到正文 · back to text
13 How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines. arXiv:2605.28840https://arxiv.org/abs/2605.28840 ↩ 回到正文 · back to text
14 Governing Technical Debt in Agentic AI Systems. arXiv:2605.29129https://arxiv.org/abs/2605.29129 ↩ 回到正文 · back to text
15 Mirahttps://github.com/Heidar-An/Mira ↩ 回到正文 · back to text
16 airtophttps://github.com/yeet-src/airtop ↩ 回到正文 · back to text
17 MLPicohttps://github.com/GiorgosXou/MLPico ↩ 回到正文 · back to text
18 P2P proof of concept for ACP decentralized agent communicationhttps://github.com/skorotkiewicz/acp-p2p ↩ 回到正文 · back to text
19 theta-spechttps://github.com/tamarillo-ai/theta-spec ↩ 回到正文 · back to text
20 textsnaphttps://github.com/kouhxp/textsnap ↩ 回到正文 · back to text
21 Agent Memory Guardhttps://github.com/OWASP/www-project-agent-memory-guard ↩ 回到正文 · back to text
22 Claude-code-replayhttps://github.com/glebmish/claude-code-replay ↩ 回到正文 · back to text
23 Promptloophttps://github.com/Bella3202019/promptloop ↩ 回到正文 · back to text
24 Elementalhttps://github.com/fynyky/elemental ↩ 回到正文 · back to text
25 OpenAI: A shared playbook for trustworthy third party evaluationshttps://openai.com/index/trustworthy-third-party-evaluations-foundations ↩ 回到正文 · back to text
26 OpenAI: Strengthening societal resilience with Rosalind Biodefensehttps://openai.com/index/strengthening-societal-resilience-with-rosalind-biodefense ↩ 回到正文 · back to text
27 How Braintrust turns customer requests into code with Codexhttps://openai.com/index/braintrust ↩ 回到正文 · back to text
28 Liquid AI reveals 8B-A1B MoE trained on 38Thttps://www.liquid.ai/blog/lfm2-5-8b-a1b ↩ 回到正文 · back to text
29 9 demos of Gemini Omni and Gemini 3.5 in actionhttps://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni-3-5-videos/ ↩ 回到正文 · back to text
30 Blue Origin's New Glenn rocket exploded during a static fire testhttps://arstechnica.com/space/2026/05/blue-origins-new-glenn-rocket-just-exploded-during-a-static-fire-test/ ↩ 回到正文 · back to text
31 Undisclosed addition in jqwik instructed AI coding agents to delete app outputhttps://arstechnica.com/security/2026/05/fed-up-with-vibe-coders-dev-sneaks-data-nuking-prompt-injection-into-their-code/ ↩ 回到正文 · back to text
32 Robinhood now lets your AI agents trade stockshttps://techcrunch.com/2026/05/27/robinhood-now-lets-your-ai-agents-trade-stocks/ ↩ 回到正文 · back to text
33 GTA 6 Developers Unionizehttps://rockstarintel.com/gta-6-developers-announce-rockstar-games-union/ ↩ 回到正文 · back to text
34 The California state assembly has passed the “Protect Our Games Act”https://www.invenglobal.com/articles/22330/stop-killing-games-movement-gains-momentum-california-assembly-passes-game-protection-bill ↩ 回到正文 · back to text
35 Claude Opus 4.8: “a modest but tangible improvement”https://simonwillison.net/2026/May/28/claude-opus-4-8/#atom-everything ↩ 回到正文 · back to text
36 llm-anthropic 0.25.1https://simonwillison.net/2026/May/28/llm-anthropic/#atom-everything ↩ 回到正文 · back to text
37 datasette 1.0a31https://simonwillison.net/2026/May/29/datasette/#atom-everything ↩ 回到正文 · back to text
38 Claude Code: Everything you can configure that the docs don’t tell youhttps://buildingbetter.tech/p/i-read-the-claude-code-source-code ↩ 回到正文 · back to text
39 Real-time LLM Inference on Standard GPUs: 3k tokens/s per requesthttps://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ ↩ 回到正文 · back to text
40 Is AI causing a repeat of frontend’s lost decade?https://mastrojs.github.io/blog/2026-05-23-is-AI-causing-a-repeat-of-frontends-lost-decade/ ↩ 回到正文 · back to text
41 We should be more tired than the modelhttps://vickiboykis.com/2026/05/28/we-should-be-more-tired-than-the-model/ ↩ 回到正文 · back to text
42 Where are the economies of scale in homebuilding?https://www.construction-physics.com/p/where-are-the-economies-of-scale ↩ 回到正文 · back to text
43 SQLite is all you need for durable workflowshttps://obeli.sk/blog/sqlite-is-all-you-need-for-durable-workflows/ ↩ 回到正文 · back to text
44 On Rendering Diffshttps://pierre.computer/writing/on-rendering-diffs ↩ 回到正文 · back to text
45 microsoft/markitdownhttps://github.com/microsoft/markitdown ↩ 回到正文 · back to text
46 OpenMOSS/MOSS-TTShttps://github.com/OpenMOSS/MOSS-TTS ↩ 回到正文 · back to text
47 anthropics/claude-codehttps://github.com/anthropics/claude-code ↩ 回到正文 · back to text
48 microsoft/RAMPARThttps://github.com/microsoft/RAMPART ↩ 回到正文 · back to text
49 mastra-ai/mastrahttps://github.com/mastra-ai/mastra ↩ 回到正文 · back to text
50 firecrawl/firecrawlhttps://github.com/firecrawl/firecrawl ↩ 回到正文 · back to text
51 run-llama/liteparsehttps://github.com/run-llama/liteparse ↩ 回到正文 · back to text
52 git-ai-project/git-aihttps://github.com/git-ai-project/git-ai ↩ 回到正文 · back to text
53 ryoppippi/ccusagehttps://github.com/ryoppippi/ccusage ↩ 回到正文 · back to text
54 anthropics/skillshttps://github.com/anthropics/skills ↩ 回到正文 · back to text

智能体安全工程化

论文 · Papers

本期重点PhoneWorld: Scaling Phone-Use Agent Environments1arxiv.org原文 ↗

本期重点Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents2arxiv.org原文 ↗

VikingMem: A Memory Base Management System for Stateful LLM-based Applications6arxiv.org原文 ↗

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory7arxiv.org原文 ↗

SkillsInjector: Dynamic Skill Context Construction for LLM Agents8arxiv.org原文 ↗

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems9arxiv.org原文 ↗

本期重点AIRGuard: Guarding Agent Actions with Runtime Authority Control3arxiv.org原文 ↗

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis10arxiv.org原文 ↗

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation11arxiv.org原文 ↗

GroundAct: Can LLM Agents Ground Actions in Environmental States?12arxiv.org原文 ↗

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines13arxiv.org原文 ↗

Governing Technical Debt in Agentic AI Systems14arxiv.org原文 ↗

开源 / 项目 · Projects

Mira15github.com原文 ↗

airtop16github.com原文 ↗

MLPico17github.com原文 ↗

P2P proof of concept for ACP decentralized agent communication18github.com原文 ↗

theta-spec19github.com原文 ↗

textsnap20github.com原文 ↗

本期重点AISlop5github.com原文 ↗

Agent Memory Guard21github.com原文 ↗

本期重点Tiny-vLLM4github.com原文 ↗

Claude-code-replay22github.com原文 ↗

Promptloop23github.com原文 ↗

Elemental24github.com原文 ↗

行业动态 · Industry News

OpenAI: A shared playbook for trustworthy third party evaluations25openai.com原文 ↗

OpenAI: Strengthening societal resilience with Rosalind Biodefense26openai.com原文 ↗

How Braintrust turns customer requests into code with Codex27openai.com原文 ↗

Liquid AI reveals 8B-A1B MoE trained on 38T28liquid.ai原文 ↗

9 demos of Gemini Omni and Gemini 3.5 in action29blog.google原文 ↗

Blue Origin's New Glenn rocket exploded during a static fire test30arstechnica.com原文 ↗

Undisclosed addition in jqwik instructed AI coding agents to delete app output31arstechnica.com原文 ↗

Robinhood now lets your AI agents trade stocks32techcrunch.com原文 ↗

GTA 6 Developers Unionize33rockstarintel.com原文 ↗

The California state assembly has passed the “Protect Our Games Act”34invenglobal.com原文 ↗

博客文章 · Blog Posts

Claude Opus 4.8: “a modest but tangible improvement”35simonwillison.net原文 ↗

llm-anthropic 0.25.136simonwillison.net原文 ↗

datasette 1.0a3137simonwillison.net原文 ↗

Claude Code: Everything you can configure that the docs don’t tell you38buildingbetter.tech原文 ↗

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request39blog.kog.ai原文 ↗

Is AI causing a repeat of frontend’s lost decade?40mastrojs.github.io原文 ↗

We should be more tired than the model41vickiboykis.com原文 ↗

Where are the economies of scale in homebuilding?42construction-physics.com原文 ↗

SQLite is all you need for durable workflows43obeli.sk原文 ↗

On Rendering Diffs44pierre.computer原文 ↗

GitHub 热门 · GitHub Trending

microsoft/markitdown45github.com原文 ↗

OpenMOSS/MOSS-TTS46github.com原文 ↗

anthropics/claude-code47github.com原文 ↗

microsoft/RAMPART48github.com原文 ↗

mastra-ai/mastra49github.com原文 ↗

firecrawl/firecrawl50github.com原文 ↗

run-llama/liteparse51github.com原文 ↗

git-ai-project/git-ai52github.com原文 ↗

ryoppippi/ccusage53github.com原文 ↗

anthropics/skills54github.com原文 ↗

引用来源 · References

本期重点PhoneWorld: Scaling Phone-Use Agent Environments 1arxiv.org原文 ↗

本期重点Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents 2arxiv.org原文 ↗

VikingMem: A Memory Base Management System for Stateful LLM-based Applications 6arxiv.org原文 ↗

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory 7arxiv.org原文 ↗

SkillsInjector: Dynamic Skill Context Construction for LLM Agents 8arxiv.org原文 ↗

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems 9arxiv.org原文 ↗

本期重点AIRGuard: Guarding Agent Actions with Runtime Authority Control 3arxiv.org原文 ↗

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis 10arxiv.org原文 ↗

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation 11arxiv.org原文 ↗

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines 13arxiv.org原文 ↗

Governing Technical Debt in Agentic AI Systems 14arxiv.org原文 ↗

Mira 15github.com原文 ↗

airtop 16github.com原文 ↗

MLPico 17github.com原文 ↗

P2P proof of concept for ACP decentralized agent communication 18github.com原文 ↗

theta-spec 19github.com原文 ↗

textsnap 20github.com原文 ↗

本期重点AISlop 5github.com原文 ↗

Agent Memory Guard 21github.com原文 ↗

本期重点Tiny-vLLM 4github.com原文 ↗

Claude-code-replay 22github.com原文 ↗

Promptloop 23github.com原文 ↗

Elemental 24github.com原文 ↗

OpenAI: A shared playbook for trustworthy third party evaluations 25openai.com原文 ↗

OpenAI: Strengthening societal resilience with Rosalind Biodefense 26openai.com原文 ↗

How Braintrust turns customer requests into code with Codex 27openai.com原文 ↗

Liquid AI reveals 8B-A1B MoE trained on 38T 28liquid.ai原文 ↗

9 demos of Gemini Omni and Gemini 3.5 in action 29blog.google原文 ↗

Blue Origin's New Glenn rocket exploded during a static fire test 30arstechnica.com原文 ↗

Undisclosed addition in jqwik instructed AI coding agents to delete app output 31arstechnica.com原文 ↗

Robinhood now lets your AI agents trade stocks 32techcrunch.com原文 ↗

GTA 6 Developers Unionize 33rockstarintel.com原文 ↗

llm-anthropic 0.25.1 36simonwillison.net原文 ↗

datasette 1.0a31 37simonwillison.net原文 ↗

Claude Code: Everything you can configure that the docs don’t tell you 38buildingbetter.tech原文 ↗

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request 39blog.kog.ai原文 ↗

We should be more tired than the model 41vickiboykis.com原文 ↗

SQLite is all you need for durable workflows 43obeli.sk原文 ↗

On Rendering Diffs 44pierre.computer原文 ↗

microsoft/markitdown 45github.com原文 ↗

OpenMOSS/MOSS-TTS 46github.com原文 ↗

anthropics/claude-code 47github.com原文 ↗

microsoft/RAMPART 48github.com原文 ↗

mastra-ai/mastra 49github.com原文 ↗

firecrawl/firecrawl 50github.com原文 ↗

run-llama/liteparse 51github.com原文 ↗

git-ai-project/git-ai 52github.com原文 ↗

ryoppippi/ccusage 53github.com原文 ↗

anthropics/skills 54github.com原文 ↗