每日 Harness

2026-05-29 · Friday, May 29, 2026

运行时评测、harness 效应与 agent skill 供应链

分享长图 · Share

视图 · View

今日重点 · Today's Highlights

Benchmarks are Not Enough: RAMP - 把 agent 评测从单题正确率移到运行时可观察性：失败传播、恢复行为、资源浪费成为一等指标。

全文 ↓

Harness-Bench - 把"模型能力"拆成模型×执行壳的组合属性，挑战只报 base model 分数的习惯。

全文 ↓

Agent Skill 生态威胁报告 - 分析近 4000 个 agent skill，样本来自真实 marketplace 的供应链威胁，而非假想攻击。

全文 ↓

microsoft/agent-governance-toolkit - 把 agent 安全从单点 guardrail 扩成身份、策略、沙箱、可靠性与 fuzzing 组合。

全文 ↓

jj-vcs/jj - Git 兼容的 Jujutsu，主打简单操作模型与强大的历史编辑能力。

全文 ↓

论文 · Papers

12 项 · 论文

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution 1arxiv.org原文 ↗

工具使用协议与互操作可观测性与调试系统·基础设施

Tool Forge 的价值不在“又做一个工具注册表”，而在把工具生成、验证、生命周期和路由合成一个可审计工件链。它的局限也在摘要里说得很清楚：当前数字是初始系统 benchmark，尚未证明面对对抗路由、真实 API grounding 和跨系统评估时仍成立。

–

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development 2arxiv.org原文 ↗

合成数据与训练环境工具使用基准系统·基础设施

SynthTools 把“工具环境”从稀缺外部资源变成可控合成对象，适合训练和回归评测；但合成 API 是否覆盖真实接口的权限、速率、异常和业务语义，是它从 benchmark 走向生产前必须继续证明的点。

–

本期重点Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems 3arxiv.org原文 ↗

可观测性与调试评测方法系统·基础设施

RAMP 的重点是把 agent 评测从单题正确率移到运行时可观察性：失败传播、恢复行为和资源浪费成为一等指标。它也提示静态 benchmark 高分可能掩盖 serial workflow 中的能力塌陷。

–

本期重点Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows 4arxiv.org原文 ↗

基准框架与脚手架系统·基础设施

这篇论文把“模型能力”拆成模型与执行壳的组合属性，直接挑战只报 base model 分数的习惯。它的贡献是诊断性：让上下文管理、工具反馈、权限、恢复和 artifact contract 进入可比较空间。

–

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents 5arxiv.org原文 ↗

评测方法工具使用系统·基础设施

它关注的是“提前停止”的能力，而不是更努力地调用工具。这个方向很实用：在工具缺失或权限不足时，agent 的主要失败不是答错，而是持续消耗 token、时间和副作用预算。

–

MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents 6arxiv.org原文 ↗

Agent 记忆检索与知识接地推理与规划其他垂直

它把反思式检索从“LLM 自己想检索路径”改为受历史记忆结构约束，降低不稳定性。局限是评估集中在 LoCoMo 和 14B Qwen 系列，跨域记忆结构是否同样可靠还需更多数据。

–

Periodic RoPE for Infinite Context LLMs 7arxiv.org原文 ↗

上下文工程系统·基础设施

这不是简单拉长插值，而是把局部位置和全局交互分层处理，因此理论上避免无限外推。摘要仍较短，实际有效性取决于任务是否需要精确全局顺序，而 NoPE 全局层可能牺牲一部分位置可辨性。

–

Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval 8arxiv.org原文 ↗

Agent 记忆基准检索与知识接地其他垂直

这篇把“把整个记忆库塞回来也能答对”的评测漏洞说得很尖锐。结论偏工程化：结构化 belief state 和硬作用域隔离可能比更大 embedding 更能解决 precision 问题，但单作者、89 例 benchmark 的外部有效性需要复现。

–

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?9arxiv.org原文 ↗

基准检索与知识接地工具使用计算机·Web

它指出搜索 benchmark 可能奖励“记忆验证”而非“证据发现”。LiveBrowseComp 的设计用新近、低显著性事实切断参数记忆，对搜索 agent 的检索链、查询生成和证据依赖更有诊断价值。

–

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content 10arxiv.org原文 ↗

安全与攻防工具使用计算机·Web

该文的关键洞察是移动 GUI agent 看的是像素，无法稳定区分可信 UI 与用户生成内容。更麻烦的是 realism 与 attack success 不相关，说明单靠视觉质量过滤不是防线。

–

本期重点Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem 11arxiv.org原文 ↗

安全与攻防技能系统系统·基础设施

skill 正在变成 agent 的包生态，因此供应链威胁会从库代码扩展到“指令+脚本+权限”组合。报告的价值在样本来自真实 marketplace，而不是只给出假想攻击。

–

A Unified Framework for the Evaluation of LLM Agentic Capabilities 12arxiv.org原文 ↗

评测方法框架与脚手架系统·基础设施

它与 Harness-Bench 形成呼应：benchmark 分数混入 scaffold 和环境波动。统一框架的价值是解耦框架效应、环境效应和模型能力；风险是固定 ReAct 架构本身也会成为新的测量偏置。

–

开源 / 项目 · Projects

12 项 · 开源 / 项目

Hallucinate - Massively Multiplayer Online Rave 13hallucinate.site原文 ↗

hallucinate.site

一个多人在线音乐与视觉互动实验，HN 描述中同时给出了开源仓库。

–

Bttf is a command line datetime Swiss army knife 14github.com原文 ↗

github.com

bttf 的设计不是 POSIX date 兼容替代，而是面向现代时区、RFC 格式和管道组合的时间处理工具。示例覆盖从当前时间到 git 文件时间表，显示它更像 composable datetime toolkit。

–

Creusot helps you prove your Rust code is correct 15github.com原文 ↗

github.com

Creusot 的工程路线是借 Rust 类型系统和 Why3 证明生态连接实际代码与形式化验证。它对普通项目的门槛在 annotation、opam/Why3 工具链和证明维护，但对于高可靠 Rust 算法库很有现实价值。

–

The Anatomy of an LLM 16royvanrijn.com原文 ↗

royvanrijn.com

一个交互式解释器，用可视化方式拆解 LLM 的基本结构与推理流程。

–

Hermes Desktop 17github.com原文 ↗

框架与脚手架计算机·Web

它的工程价值在 packaging 而不是重写 agent：把 Python agent、Vue/Koa UI、Electron 更新与平台安装器整合为一个下载物。风险也来自捆绑：上游 hermes-agent、web-ui、Python 版本和补丁链都要持续维护。

–

Stripeek 18github.com原文 ↗

github.com

本地 Stripe API 代理和 TUI，用于观察 SDK 与 Stripe 之间的请求和响应。

–

Open Agent Tools Coder 19github.com原文 ↗

工具使用蒸馏与压缩编码

本地编码 agent，实验将工具调用委派给较小模型。

–

Roar 20github.com原文 ↗

github.com

macOS 命令行通知工具，面向脚本和长时间运行的后台任务。

–

LiteParse 21github.com原文 ↗

工具使用检索与知识接地数据·分析

LiteParse 的定位很清晰：把“够快、够本地、够结构化”的解析能力给 agent，而不是用云端 LLM 做重型文档理解。复杂表格、手写和扫描 PDF 仍被明确让位给 LlamaParse。

–

AG2B 22ag2b.ai原文 ↗

框架与脚手架协议与互操作工具使用计算机·Web

浏览器端 agent runtime，使用 WebMCP 暴露工具并在前端运行 agent loop。

–

Ktx 23github.com原文 ↗

检索与知识接地工作流与控制流数据·分析

ktx 把数据 agent 的问题从“让模型猜表名写 SQL”改成“先建立可审查语义层和 join/metric 约束”。这种文件化、git review 的设计有利于治理，但准确性仍依赖团队持续批准上下文变更。

–

Py-SQL-cleaner 24github.com原文 ↗

github.com

CLI 工具，用于格式化 Python 字符串中嵌入的 SQL。

–

行业动态 · Industry News

9 项 · 行业动态

Claude Opus 4.8 25anthropic.com原文 ↗

anthropic.com

这条发布在技术上仍缺少可引用的详细指标。可确定的是 Anthropic 在 Opus 4.x 线上继续做企业与编码能力迭代，但缺少 system card 细节时不宜把营销说法当能力结论。

–

Dynamic Workflows in Claude Code 26claude.com原文 ↗

工作流与控制流编码

这不是单纯 prompt 模板，而是把 Claude Code 的反复任务沉淀成可调用流程。关键问题会是 workflow 的可审计性、参数边界和失败恢复，而不是“能否生成一个脚本”。

–

Anthropic raises $65B in Series H funding at $965B post-money valuation 27anthropic.com原文 ↗

anthropic.com

融资规模把 frontier lab 的资本需求继续推高：训练、推理、企业 go-to-market 和安全合规都在变成巨额固定成本。技术层面，它会加速 Anthropic 在企业 agent、Claude Code/Cowork 和基础设施上的投入。

–

OpenAI’s Frontier Governance Framework 28openai.com原文 ↗

对齐与治理系统·基础设施

这份框架的作用是把 Preparedness Framework 中与监管义务相关的部分公开治理化。它不是新模型能力声明，而是把 frontier 风险流程转成面向法规、审计和外部沟通的文件。

–

Catch up on 12 major I/O 2026 moments 29blog.google原文 ↗

blog.google

Google 的叙事是把 Gemini 从聊天产品扩展成 Search、开发工具、创作工具和企业 agent 平台的底层模型。值得注意的是 Flash 被强调为可规模化工作马，而 Omni 负责多模态创作和视频/世界模型方向。

–

Data Formulator 0.7: AI-powered data analytics for enterprise data 30microsoft.com原文 ↗

工作流与控制流工具使用数据·分析

Data Formulator 0.7 把企业数据连接、agent 引导探索和可视化 refinement 放进同一个 workspace。关键设计不是让聊天框直接答数，而是让 agent 访问数据源、loaded tables、历史 charts 和目标，并生成可复现代码与可编辑图表。

–

AMD pulls a bait-and-switch on Linux users with Vivado licensing changes 31itsfoss.com原文 ↗

itsfoss.com

ItsFOSS 报道的核心变化是 Vivado 2026.1 起 free Basic tier 只支持 Windows，Linux 支持进入年费约 1,200-1,800 美元的 Core tier；AMD 论坛回复建议不付费用户停留在 2025.2，但该版本后续会失去官方支持。这个变化对学生、hobbyist 和 Linux-native FPGA 流程是实质性门槛。

–

EU fines Temu €200M for allowing sale of illegal products 32bbc.co.uk原文 ↗

bbc.co.uk

BBC 报道欧盟依据 Digital Services Act 对 Temu 处以 2 亿欧元罚款。

–

W3C Leadership Transition 33w3.org原文 ↗

w3.org

W3C 公告确认标准组织进入领导层交接期。对 Web 标准而言，这类变化的影响通常不是单项技术路线立即改变，而是议程优先级、会员协调和跨浏览器共识机制如何延续。

–

博客文章 · Blog Posts

10 项 · 博客文章

sqlite AGENTS.md 34simonwillison.net原文 ↗

安全与攻防编码

这篇短文抓住了开源维护的新现实：项目不是拒绝 AI 辅助，而是拒绝不可审计的代理代码流入主线。SQLite 的边界很具体，bug report 可以 agentic，代码贡献仍由人类维护者重写。

–

I analysed 20 years of my chats 35drobinin.com原文 ↗

drobinin.com

作者把 120 万条、20 年聊天记录转成关系地图，比较消息量、平均长度、词汇重叠、session 数、conversation-days 和情绪多样性。最有意思的发现不是网络缩小本身，而是 75% 网络流失后每年 conversation-days 仍约 360，只是分配给更少的人。

–

Can we have the day off?36mlsu.io原文 ↗

mlsu.io

一篇围绕工作日、休假和组织节奏的个人文章，在 HN 获得大量讨论。

–

Disagreement among frontier LLMs on real-world fact-checks 37lenz.io原文 ↗

评测方法研究·科学

该研究用 1,000 个真实 fact-check claims 测五个 frontier LLM 的四档 verdict，一致性并不高：67% claims 至少有一个模型不同意多数，34% 存在相隔两个以上 bucket 的实质分歧，Krippendorff ordinal alpha 为 0.639。它的价值在于不用 benchmark gold label，而是测真实请求上的模型间不稳定性。

–

How long until AI automates all cognitive labor?38futuresearch.ai原文 ↗

futuresearch.ai

FutureSearch 汇总 2023-2026 年多位研究者对“多数纯认知劳动可被 AI 以更高质量、速度和成本自动化”的时间线更新。作者观察到 2023-2025 多数预测提前，2025-2026 一度后移，但 2026 年 1 月到 4 月所有更新者又把时间线拉近。

–

Just Use Postgres for Durable Workflows 39dbos.dev原文 ↗

dbos.dev

文章主张的核心是 durable execution 可以是应用库+数据库模式，而不一定是外部编排服务。这个观点适合已有 Postgres 边界内的工作流，但跨系统副作用仍需要 idempotency key 和补偿设计。

–

Various LLM Smells 40shvbsle.in原文 ↗

可观测性与调试评测方法系统·基础设施

一篇整理 LLM 应用中常见工程坏味道的个人技术文章。

–

About LLMs at Zig Days 41kristoff.it原文 ↗

kristoff.it

Zig 社区文章讨论会议中关于 LLM 使用、代码和社区规范的安排。

–

The Sequence Opinion #868: Recursion Is the New Scaling Law 42thesequence.substack.com原文 ↗

测试时计算工作流与控制流其他垂直

TheSequence 讨论递归式模型调用和系统组合是否正在成为 AI 扩展的新路径。

–

Protestware for coding agents 43nesbitt.io原文 ↗

安全与攻防编码

文章讨论 coding agent 时代软件依赖、自动化执行和 protestware 风险。

–

引用来源 · References

55 条 · 引用

1 Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution. arXiv:2605.28000https://arxiv.org/abs/2605.28000 ↩ 回到正文 · back to text
2 SynthTools: A Framework for Scaling Synthetic Tools for Agent Development. arXiv:2511.09572https://arxiv.org/abs/2511.09572 ↩ 回到正文 · back to text
3 Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems. arXiv:2605.27492https://arxiv.org/abs/2605.27492 ↩ 回到正文 · back to text
4 Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows. arXiv:2605.27922https://arxiv.org/abs/2605.27922 ↩ 回到正文 · back to text
5 Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents. arXiv:2605.28532https://arxiv.org/abs/2605.28532 ↩ 回到正文 · back to text
6 MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents. arXiv:2605.27437https://arxiv.org/abs/2605.27437 ↩ 回到正文 · back to text
7 Periodic RoPE for Infinite Context LLMs. arXiv:2605.27980https://arxiv.org/abs/2605.27980 ↩ 回到正文 · back to text
8 Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval. arXiv:2605.11325https://arxiv.org/abs/2605.11325 ↩ 回到正文 · back to text
9 LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?. arXiv:2605.28721https://arxiv.org/abs/2605.28721 ↩ 回到正文 · back to text
10 MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content. arXiv:2605.28116https://arxiv.org/abs/2605.28116 ↩ 回到正文 · back to text
11 Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem. arXiv:2605.28588https://arxiv.org/abs/2605.28588 ↩ 回到正文 · back to text
12 A Unified Framework for the Evaluation of LLM Agentic Capabilities. arXiv:2605.27898https://arxiv.org/abs/2605.27898 ↩ 回到正文 · back to text
13 Hallucinate - Massively Multiplayer Online Ravehttps://hallucinate.site ↩ 回到正文 · back to text
14 Bttf is a command line datetime Swiss army knife. GitHub: BurntSushi/bttfhttps://github.com/BurntSushi/bttf ↩ 回到正文 · back to text
15 Creusot helps you prove your Rust code is correct. GitHub: creusot-rs/creusothttps://github.com/creusot-rs/creusot/tree/master ↩ 回到正文 · back to text
16 The Anatomy of an LLMhttps://www.royvanrijn.com/anatomy-of-an-llm/ ↩ 回到正文 · back to text
17 Hermes Desktop. GitHub: sir1st/hermes-desktophttps://github.com/sir1st/hermes-desktop ↩ 回到正文 · back to text
18 Stripeek. GitHub: progapandist/stripeekhttps://github.com/progapandist/stripeek ↩ 回到正文 · back to text
19 Open Agent Tools Coder. GitHub: district-solutions/open-agent-tools-coderhttps://github.com/district-solutions/open-agent-tools-coder ↩ 回到正文 · back to text
20 Roar. GitHub: dalemyers/Roarhttps://github.com/dalemyers/Roar ↩ 回到正文 · back to text
21 LiteParse. GitHub: run-llama/liteparsehttps://github.com/run-llama/liteparse/ ↩ 回到正文 · back to text
22 AG2Bhttps://ag2b.ai/docs ↩ 回到正文 · back to text
23 Ktx. GitHub: Kaelio/ktxhttps://github.com/Kaelio/ktx ↩ 回到正文 · back to text
24 Py-SQL-cleaner. GitHub: enumura1/py-sql-cleanerhttps://github.com/enumura1/py-sql-cleaner ↩ 回到正文 · back to text
25 Claude Opus 4.8https://www.anthropic.com/news/claude-opus-4-8 ↩ 回到正文 · back to text
26 Dynamic Workflows in Claude Codehttps://claude.com/blog/introducing-dynamic-workflows-in-claude-code ↩ 回到正文 · back to text
27 Anthropic raises $65B in Series H funding at $965B post-money valuationhttps://www.anthropic.com/news/series-h ↩ 回到正文 · back to text
28 OpenAI’s Frontier Governance Frameworkhttps://openai.com/index/openai-frontier-governance-framework ↩ 回到正文 · back to text
29 Catch up on 12 major I/O 2026 momentshttps://blog.google/innovation-and-ai/technology/ai/io-2026-keynote-moment-videos/ ↩ 回到正文 · back to text
30 Data Formulator 0.7: AI-powered data analytics for enterprise datahttps://www.microsoft.com/en-us/research/blog/data-formulator-0-7-ai-powered-data-analytics-for-enterprise-data/ ↩ 回到正文 · back to text
31 AMD pulls a bait-and-switch on Linux users with Vivado licensing changeshttps://itsfoss.com/news/amd-vivado-bait-and-switch-on-linux-users/ ↩ 回到正文 · back to text
32 EU fines Temu €200M for allowing sale of illegal productshttps://www.bbc.co.uk/news/articles/c1k2ydn1rz8o ↩ 回到正文 · back to text
33 W3C Leadership Transitionhttps://www.w3.org/press-releases/2026/w3c-leadership-transition/ ↩ 回到正文 · back to text
34 sqlite AGENTS.mdhttps://simonwillison.net/2026/May/27/sqlite-agents/#atom-everything ↩ 回到正文 · back to text
35 I analysed 20 years of my chatshttps://drobinin.com/posts/am-i-a-bad-friend/ ↩ 回到正文 · back to text
36 Can we have the day off?https://mlsu.io/posts/day-off/ ↩ 回到正文 · back to text
37 Disagreement among frontier LLMs on real-world fact-checkshttps://lenz.io/research/llm-disagreement ↩ 回到正文 · back to text
38 How long until AI automates all cognitive labor?https://futuresearch.ai/blog/agi-timeline-tracker/ ↩ 回到正文 · back to text
39 Just Use Postgres for Durable Workflowshttps://www.dbos.dev/blog/postgres-is-all-you-need-for-durable-execution ↩ 回到正文 · back to text
40 Various LLM Smellshttps://shvbsle.in/various-llm-smells/ ↩ 回到正文 · back to text
41 About LLMs at Zig Dayshttps://kristoff.it/blog/llms-at-zig-days/ ↩ 回到正文 · back to text
42 The Sequence Opinion #868: Recursion Is the New Scaling Lawhttps://thesequence.substack.com/p/the-sequence-opinion-868-recursion ↩ 回到正文 · back to text
43 Protestware for coding agentshttps://nesbitt.io/2026/05/28/protestware-for-coding-agents.html ↩ 回到正文 · back to text
44 Lum1104/Understand-Anything. GitHub: Lum1104/Understand-Anythinghttps://github.com/Lum1104/Understand-Anything ↩ 回到正文 · back to text
45 anthropics/knowledge-work-plugins. GitHub: anthropics/knowledge-work-pluginshttps://github.com/anthropics/knowledge-work-plugins ↩ 回到正文 · back to text
46 hardikpandya/stop-slop. GitHub: hardikpandya/stop-slophttps://github.com/hardikpandya/stop-slop ↩ 回到正文 · back to text
47 affaan-m/ECC. GitHub: affaan-m/ECChttps://github.com/affaan-m/ECC ↩ 回到正文 · back to text
48 Leonxlnx/taste-skill. GitHub: Leonxlnx/taste-skillhttps://github.com/Leonxlnx/taste-skill ↩ 回到正文 · back to text
49 twentyhq/twenty. GitHub: twentyhq/twentyhttps://github.com/twentyhq/twenty ↩ 回到正文 · back to text
50 obra/superpowers. GitHub: obra/superpowershttps://github.com/obra/superpowers ↩ 回到正文 · back to text
51 langfuse/langfuse. GitHub: langfuse/langfusehttps://github.com/langfuse/langfuse ↩ 回到正文 · back to text
52 NangoHQ/nango. GitHub: NangoHQ/nangohttps://github.com/NangoHQ/nango ↩ 回到正文 · back to text
53 vllm-project/vllm. GitHub: vllm-project/vllmhttps://github.com/vllm-project/vllm ↩ 回到正文 · back to text
54 microsoft/agent-governance-toolkit. GitHub: microsoft/agent-governance-toolkithttps://github.com/microsoft/agent-governance-toolkit ↩ 回到正文 · back to text
55 jj-vcs/jj. GitHub: jj-vcs/jjhttps://github.com/jj-vcs/jj ↩ 回到正文 · back to text

运行时评测、harness 效应与 agent skill 供应链

论文 · Papers

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution1arxiv.org原文 ↗

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development2arxiv.org原文 ↗

本期重点Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems3arxiv.org原文 ↗

本期重点Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows4arxiv.org原文 ↗

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents5arxiv.org原文 ↗

MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents6arxiv.org原文 ↗

Periodic RoPE for Infinite Context LLMs7arxiv.org原文 ↗

Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval8arxiv.org原文 ↗

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?9arxiv.org原文 ↗

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content10arxiv.org原文 ↗

本期重点Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem11arxiv.org原文 ↗

A Unified Framework for the Evaluation of LLM Agentic Capabilities12arxiv.org原文 ↗

开源 / 项目 · Projects

Hallucinate - Massively Multiplayer Online Rave13hallucinate.site原文 ↗

Bttf is a command line datetime Swiss army knife14github.com原文 ↗

Creusot helps you prove your Rust code is correct15github.com原文 ↗

The Anatomy of an LLM16royvanrijn.com原文 ↗

Hermes Desktop17github.com原文 ↗

Stripeek18github.com原文 ↗

Open Agent Tools Coder19github.com原文 ↗

Roar20github.com原文 ↗

LiteParse21github.com原文 ↗

AG2B22ag2b.ai原文 ↗

Ktx23github.com原文 ↗

Py-SQL-cleaner24github.com原文 ↗

行业动态 · Industry News

Claude Opus 4.825anthropic.com原文 ↗

Dynamic Workflows in Claude Code26claude.com原文 ↗

Anthropic raises $65B in Series H funding at $965B post-money valuation27anthropic.com原文 ↗

OpenAI’s Frontier Governance Framework28openai.com原文 ↗

Catch up on 12 major I/O 2026 moments29blog.google原文 ↗

Data Formulator 0.7: AI-powered data analytics for enterprise data30microsoft.com原文 ↗

AMD pulls a bait-and-switch on Linux users with Vivado licensing changes31itsfoss.com原文 ↗

EU fines Temu €200M for allowing sale of illegal products32bbc.co.uk原文 ↗

W3C Leadership Transition33w3.org原文 ↗

博客文章 · Blog Posts

sqlite AGENTS.md34simonwillison.net原文 ↗

I analysed 20 years of my chats35drobinin.com原文 ↗

Can we have the day off?36mlsu.io原文 ↗

Disagreement among frontier LLMs on real-world fact-checks37lenz.io原文 ↗

How long until AI automates all cognitive labor?38futuresearch.ai原文 ↗

Just Use Postgres for Durable Workflows39dbos.dev原文 ↗

Various LLM Smells40shvbsle.in原文 ↗

About LLMs at Zig Days41kristoff.it原文 ↗

The Sequence Opinion #868: Recursion Is the New Scaling Law42thesequence.substack.com原文 ↗

Protestware for coding agents43nesbitt.io原文 ↗

GitHub 热门 · GitHub Trending

Lum1104/Understand-Anything44github.com原文 ↗

anthropics/knowledge-work-plugins45github.com原文 ↗

hardikpandya/stop-slop46github.com原文 ↗

affaan-m/ECC47github.com原文 ↗

Leonxlnx/taste-skill48github.com原文 ↗

twentyhq/twenty49github.com原文 ↗

obra/superpowers50github.com原文 ↗

langfuse/langfuse51github.com原文 ↗

NangoHQ/nango52github.com原文 ↗

vllm-project/vllm53github.com原文 ↗

本期重点microsoft/agent-governance-toolkit54github.com原文 ↗

本期重点jj-vcs/jj55github.com原文 ↗

引用来源 · References

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution 1arxiv.org原文 ↗

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development 2arxiv.org原文 ↗

本期重点Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems 3arxiv.org原文 ↗

本期重点Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows 4arxiv.org原文 ↗

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents 5arxiv.org原文 ↗

MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents 6arxiv.org原文 ↗

Periodic RoPE for Infinite Context LLMs 7arxiv.org原文 ↗

Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval 8arxiv.org原文 ↗

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content 10arxiv.org原文 ↗

本期重点Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem 11arxiv.org原文 ↗

A Unified Framework for the Evaluation of LLM Agentic Capabilities 12arxiv.org原文 ↗

Hallucinate - Massively Multiplayer Online Rave 13hallucinate.site原文 ↗

Bttf is a command line datetime Swiss army knife 14github.com原文 ↗

Creusot helps you prove your Rust code is correct 15github.com原文 ↗

The Anatomy of an LLM 16royvanrijn.com原文 ↗

Hermes Desktop 17github.com原文 ↗

Stripeek 18github.com原文 ↗

Open Agent Tools Coder 19github.com原文 ↗

Roar 20github.com原文 ↗

LiteParse 21github.com原文 ↗

AG2B 22ag2b.ai原文 ↗

Ktx 23github.com原文 ↗

Py-SQL-cleaner 24github.com原文 ↗

Claude Opus 4.8 25anthropic.com原文 ↗

Dynamic Workflows in Claude Code 26claude.com原文 ↗

Anthropic raises $65B in Series H funding at $965B post-money valuation 27anthropic.com原文 ↗

OpenAI’s Frontier Governance Framework 28openai.com原文 ↗

Catch up on 12 major I/O 2026 moments 29blog.google原文 ↗

Data Formulator 0.7: AI-powered data analytics for enterprise data 30microsoft.com原文 ↗

AMD pulls a bait-and-switch on Linux users with Vivado licensing changes 31itsfoss.com原文 ↗

EU fines Temu €200M for allowing sale of illegal products 32bbc.co.uk原文 ↗

W3C Leadership Transition 33w3.org原文 ↗

sqlite AGENTS.md 34simonwillison.net原文 ↗

I analysed 20 years of my chats 35drobinin.com原文 ↗

Disagreement among frontier LLMs on real-world fact-checks 37lenz.io原文 ↗

Just Use Postgres for Durable Workflows 39dbos.dev原文 ↗

Various LLM Smells 40shvbsle.in原文 ↗

About LLMs at Zig Days 41kristoff.it原文 ↗

The Sequence Opinion #868: Recursion Is the New Scaling Law 42thesequence.substack.com原文 ↗

Protestware for coding agents 43nesbitt.io原文 ↗

Lum1104/Understand-Anything 44github.com原文 ↗

anthropics/knowledge-work-plugins 45github.com原文 ↗

hardikpandya/stop-slop 46github.com原文 ↗

affaan-m/ECC 47github.com原文 ↗

Leonxlnx/taste-skill 48github.com原文 ↗

twentyhq/twenty 49github.com原文 ↗

obra/superpowers 50github.com原文 ↗

langfuse/langfuse 51github.com原文 ↗

NangoHQ/nango 52github.com原文 ↗

vllm-project/vllm 53github.com原文 ↗

本期重点microsoft/agent-governance-toolkit 54github.com原文 ↗

本期重点jj-vcs/jj 55github.com原文 ↗