每日 Harness

2026-06-02 · Tuesday, June 2, 2026

智能体底座攻防升级

视图 · View

今日重点 · Today's Highlights

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents - 把 harness 更新能力和真实任务收益拆开，给自演化 agent 的能力边界提供了可测定义。

全文 ↓

Learning Agent-Compatible Context Management for Long-Horizon Tasks - 把上下文压缩从固定策略变成外部可训练模块，直接面向闭源 agent 可用性。

全文 ↓

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors - 把 prompt injection 推进到跨会话持久控制，并给出本地 harness 防御基准。

全文 ↓

GrepSeek: Training Search Agents for Direct Corpus Interaction - 让 search agent 直接操作语料库和 shell 命令，挑战“检索必先建索引”的默认架构。

全文 ↓

run-llama/liteparse - 本地 PDF 解析、bounding boxes、OCR 和多语言 binding 组合成轻量文档 ingestion 层。

全文 ↓

论文 · Papers

15 项 · 论文

本期重点Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents 1arxiv.org原文 ↗

自进化框架与脚手架系统·基础设施

论文把 agent 的 harness-updating 与 harness-benefit 拆开评估：前者是能否从执行证据写出有用的 prompt、skill、memory、tool 更新，后者是任务代理能否真正用上这些更新。核心发现是更新质量对模型基础能力并不单调，Qwen3.5-9B 生成的更新可接近 Claude Opus 4.6；收益则呈非单调形态，中档模型最受益。局限也很清楚：弱模型常不能激活或遵循相关 harness artifact。

–

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI 2arxiv.org原文 ↗

框架与脚手架上下文工程工具使用系统·基础设施

论文主张把 foundation model 外围的 harness 作为 agentic AI 的一等扩展对象，而不是只比较模型权重。它关注上下文构造、工具调用、执行编排、验证与记忆等系统层能力，强调长任务表现常由 harness 决定。价值在于把 agent 评估从“模型分数”推向“模型加执行层”的可复现实验。

–

本期重点Learning Agent-Compatible Context Management for Long-Horizon Tasks 3arxiv.org原文 ↗

上下文工程合成数据与训练环境研究·科学

AdaCoM 训练一个外部 LLM 管理冻结 agent 的上下文，用可学习的修改动作在保留约束、进展和证据的同时删除过期内容。论文在 web search 和 deep research benchmarks 上测试，提出 Fidelity-Reliability Trade-off：强 agent 需要更高保真上下文，弱 agent 反而需要更激进压缩。它的工程意义是上下文管理可作为可迁移模块，而不必改闭源 agent。

–

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories 4arxiv.org原文 ↗

可观测性与调试评测方法编码

TraceGraph 把多模型 agent rollout 池化成任务级行动-观察图，再标出 productive cores、trap regions，并用 Access、Trap exposure、Repair 描述轨迹。它在五个 benchmark split 上显示单一 pass rate 隐藏了模型如何进入陷阱和如何修复。SWE-bench 上 trap-aware recovery 在 fired subset 将 resolved rate 从 40.4% 提到 43.5%，说明轨迹图可直接转化为恢复策略。

–

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis 5arxiv.org原文 ↗

基准 Agent 记忆数据·分析

LongDS 用真实 Kaggle notebooks 构造长时程多轮数据分析任务，要求 agent 维护、回滚、组合和恢复分析状态。基准包含 68 个任务、2,225 turns、六个领域，平均依赖跨度 11.3 turns；五个 SOTA 模型中最好平均准确率只有 48.45%，早晚轮性能下降近 47 个百分点。结论指向状态维护，而不是简单增加 agent step。

–

本期重点GrepSeek: Training Search Agents for Direct Corpus Interaction 6arxiv.org原文 ↗

Agent RL / 可验证奖励工具使用检索与知识接地研究·科学

GrepSeek 让 search agent 直接把语料库当环境，用 shell 命令查找、过滤和组合证据，而不是只调预建检索索引。训练采用两阶段：answer-aware Tutor 与 answer-blind Planner 生成冷启动轨迹，再用 GRPO 优化；并用 sharded-parallel 执行把 shell retrieval 加速最高 7.6 倍且保持字节等价。七个开放域 QA benchmark 上 token F1 和 Exact Match 总体最强。

–

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search 7arxiv.org原文 ↗

Agent RL / 可验证奖励检索与知识接地工具使用研究·科学

SAAS 聚焦 agentic search 的过度检索：模型在已有内部知识足够时仍继续调用搜索，增加成本和噪声。论文用 self-aware reinforcement learning 让 agent 学会判断何时检索、何时用内部知识、何时停止。它值得看在于把检索策略从“多查更好”改成可训练的成本-可靠性决策。

–

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?8arxiv.org原文 ↗

基准评测方法研究·科学

SoundnessBench 评估 LLM 在执行研究前判断 proposal 方法论可行性的能力。数据由 1,099 个从 ICLR submissions 重构的机器学习研究 proposal 组成，并带 reviewer soundness 子分数；12 个 frontier LLM 普遍有 optimism bias，常把低 soundness 想法评为可行。论文把 AI Scientist 的瓶颈落到“第一道严谨性门槛”上，而不是只看自动实验执行。

–

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents 9arxiv.org原文 ↗

基准技能系统其他垂直

OpenSkillEval 自动构造真实任务实例来评估 skill-augmented agents 和 skills 本身，覆盖演示文稿、前端设计、海报、数据可视化和报告五类应用。实验使用 600 多个动态生成任务和 30 个开源 skills，发现 skill 可用不等于有效使用，效果强依赖模型与 agent framework，热门 skills 也不稳定优于无 skill 基线。它把开放 skill 生态的质量问题变成可审计对象。

–

本期重点From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors 10arxiv.org原文 ↗

安全与攻防框架与脚手架系统·基础设施

论文提出 ClawTrojan，研究本地 agent harness 中由文件或工具输出触发、写入并跨会话生效的多步 trojan backdoor。OpenClaw-style workspace 中 GPT-5.4 的攻击成功率达到 95.5%，而传统单轮 prompt injection 在同一模型上几乎为零。DASGuard 通过扫描敏感文件中的 control-like text、追踪来源并清理不可信控制内容来防御。

–

Mellum2 Technical Report 11arxiv.org原文 ↗

arxiv.org

JetBrains 发布 Mellum 2：12B 总参数、每 token 2.5B active 的 MoE 软件工程模型，覆盖代码生成、编辑、调试、工具调用、agentic coding 和对话式编程。架构使用 64 experts/8 active、GQA、滑窗注意力和可作 speculative decoding draft model 的 Multi-Token Prediction head；预训练约 10.6T tokens，并扩展到 128K context。报告同时释放 base、instruct、thinking checkpoints，Apache 2.0。

–

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode 12arxiv.org原文 ↗

arxiv.org

论文测量机器人、车端和边缘 copilot 常见的 batch-1 autoregressive decode，指出它虽 memory-dominated，但并非简单按 HBM 峰值带宽线性加速。三类 7-8B GQA 模型在 H100、A100、L40S、L4 上测试，Qwen-2.5-7B ctx=2048 时 L4 达到约 81% analytic memory floor，H100 仅约 27%。CUDA Graphs 在 H100 上带来 1.259x latency 改善，说明 launch-side overhead 已成为快卡瓶颈。

–

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization 13arxiv.org原文 ↗

Agent RL / 可验证奖励蒸馏与压缩其他垂直

DRIFT 面向多轮交互优化，试图避开在线 RL 昂贵 rollout 与离线 SFT 分布偏移之间的两难。方法把 KL-regularized RL 等价为 importance-weighted supervised learning：从固定 reference policy 采样离线轨迹，按 return 生成权重，再做 weighted SFT。实验称可匹配或超过多轮 RL baseline，同时保留标准 SFT 的训练效率和实现简洁性。

–

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards 14arxiv.org原文 ↗

Agent RL / 可验证奖励上下文工程检索与知识接地研究·科学

LongTraceRL 用 search agent 轨迹构造更难的长上下文训练样本：读取未引用文档作为高混淆 distractors，搜索结果未打开文档作为低混淆 distractors。奖励设计使用 reasoning chain 中 gold entities 的 entity-level rubric reward，并只作用于最终答案正确的响应以降低 reward hacking。4B-30B 三个 reasoning LLM 在五个长上下文 benchmark 上稳定优于强基线。

–

Skill Availability and Presentation Granularity in Large-Language-Model Agents 15arxiv.org原文 ↗

技能系统评测方法其他垂直

SkillsBench 控制实验研究 skill 是否可用以及呈现粒度是否影响 agent 成功率。实验用 30 个领域均衡任务、两个 reasoning 模型、六种 skill 条件，每个 task-condition-model cell 五次试验，共 1,800 行数据。skill availability 提升最强：GPT-5.5 相比无 skill 提升 26.7-36.0 个百分点，DeepSeek V4-Flash 提升 18.0-26.0 个百分点；粒度和示例差异则小且不确定。

–

开源 / 项目 · Projects

15 项 · 开源 / 项目

Textile 16gettextile.app原文 ↗

gettextile.app

Textile 是桌面文本组装工具，目标是把命令、剪贴板内容和固定字符串逐步组合成可复用文本流。它强调本地桌面工作流和可复用片段，而不是在线协作文档。它的看点在于把重复文本生产抽象为可串联的输入步骤，适合模板回复、提示词片段和日常编辑自动化。

–

Dataroom 17github.com原文 ↗

框架与脚手架工作流与控制流研究·科学

Dataroom 是面向低预算 GPU 与 Raspberry Pi 的自托管研究 harness。它把研究运行、资源约束和自托管部署放在同一个项目里，目标不是云端大规模训练，而是让小设备也能承担可重复实验工作流。它的价值在于把“研究 harness”从高配服务器迁移到边缘和个人预算环境。

–

Glq LLM quantization using E8 lattice 18github.com原文 ↗

github.com

GLQ 是 post-training LLM weight quantization 库，用 E8 lattice codebook 对每 8 个权重编码为 16-bit index，并结合 RHT、LDLQ error feedback 和 fused CUDA/Triton kernels 直接对压缩索引做 matmul。README 声称支持 2-8 bpw，SmolLM3-3B 4.5bpw 对比 GPTQ 在 10/12 指标更好，vLLM 中 3.5bpw 约达 bf16 94% throughput。它也提供 KV cache E8 压缩，可把 fp16 footprint 降到约 25%。

–

DepsGuard 19github.com原文 ↗

github.com

DepsGuard 是单静态 Rust 二进制，用于扫描 npm、pnpm、yarn、bun、uv 以及 Renovate/Dependabot 配置，并交互式应用供应链加固设置。它不运行 package install，只编辑用户批准的配置文件并先写备份；提供 TUI、read-only scan、restore，支持 Linux、macOS、Windows。项目的取向是把常见包管理器硬化做成一次性本地配置审计。

–

Valdr 20github.com原文 ↗

github.com

Valdr 是用 mostly memory-safe Rust 重写的单节点 Valkey/Redis 兼容 cache/store，目标是非集群单节点场景的 drop-in 替代。README 标明 alpha，单节点核心测试套件为 green，并可在 TLS 上跑真实负载；基准中 GET、SET、PING、MGET 等多项在指定配置下超过 Valkey 9.1.0。它把内存安全与核心基础设施重写结合起来，适合观察 agentic coding 对系统软件移植的影响。

–

2-command CLI to give AI agents structured data retrieval on PostgreSQL 21github.com原文 ↗

检索与知识接地协议与互操作数据·分析

Lithium 是运行在 PostgreSQL 上的结构化 agent storage engine，用 ltree 做层级路径索引，并提供 TypeScript API、versioning 和 scoped retrieval。快速路径是 `npx @lithium-ai/kit init` 加 MCP server，让 Claude Code 等 agent 可查询如 `engineering.auth` 下的精确树状数据。它明确反对把结构化状态全部塞进向量库，强调确定性、索引支持和现有 Postgres 基础设施。

–

A CSS 3D Engine 22github.com原文 ↗

github.com

PolyCSS 是 DOM/CSS 3D engine，用 CSS `matrix3d(...)` 把 OBJ/MTL、GLB、VOX 等 polygon meshes 渲染成真实 HTML 元素。它支持颜色、纹理、灯光、阴影、形状和动画，并提供 vanilla、React、Vue 包；还可导出独立 HTML snapshot。看点是绕开 WebGL，把 3D 表达压到 CSS/DOM 管线，适合需要可检查 DOM 或轻量嵌入的场景。

–

Cloud CI and agentic workflows for embedded hardware development 23github.com原文 ↗

工作流与控制流框架与脚手架系统·基础设施

Jumpstarter 是开源测试自动化框架，用统一 API 连接真实和虚拟 embedded devices，并接入 CI/CD。README 强调 UART、CAN、SPI、GPIO、电源和 USB 等硬件抽象，Python/PyTest 生态集成，团队安全租用测试硬件，以及人、脚本、CI、AI agents 共用接口。它把嵌入式硬件测试从人工台架操作推向可编程、可共享的流水线。

–

Vmette 24github.com原文 ↗

执行环境与沙箱安全与攻防协议与互操作编码

vmette 是 macOS 本地 Linux microVM sandbox，基于 Apple Virtualization.framework，为不可信本地 AI agents 提供硬件隔离边界。它默认无 host filesystem、无网络，只有显式共享目录和开启 egress 才可访问；每次运行 fresh guest，约 1 秒启动，并提供 CLI、Rust/C ABI、daemon 和 MCP server。它针对 coding/computer-use agents 运行未知代码、安装依赖和接触 prompt injection 的桌面风险。

–

Poolnarc 25github.com原文 ↗

github.com

Poolnarc 是基于两个 eBPF hook 的 Linux 隐藏加密货币挖矿程序检测工具。它把检测面压到少量内核 hook，通过行为信号识别隐藏 miner，而不是依赖静态进程名或文件路径。它适合作为轻量安全实验项目观察 eBPF 在反 cryptojacking 中的最小可用形态。

–

UQLM 26github.com原文 ↗

评测方法其他垂直

UQLM 是 CVS Health 开源的 Python 库，用 uncertainty quantification 检测 LLM hallucination。它提供 response-level confidence scores，覆盖 black-box consistency、多生成语义熵、white-box token probability、LLM-as-judge panel、ensemble 和 long-text scorers；BlackBoxUQ 可对同一 prompt 生成 5 个响应并计算 semantic_negentropy。项目兼容 LangChain Chat Models，适合把 hallucination risk 作为模型输出后的可量化信号。

–

AgentThreatBench 27github.com原文 ↗

安全与攻防 Agent 记忆基准系统·基础设施

OWASP Agent Memory Guard 是 OWASP Incubator 项目，也是 ASI06 Memory Poisoning 的 reference implementation。它作为 agent 与 memory store 之间的 runtime defense layer，筛查每次 read/write，阻断 prompt injection、secret leakage 和 integrity tampering；README 给出 55 个真实 payload、4 类威胁的 benchmark，recall 92.5%、precision 100%、median latency 59 microseconds。

–

Ministry of Everything 28github.com原文 ↗

框架与脚手架工作流与控制流上下文工程编码

Ministry of Everything 是 CLI-first agent harness，让单个操作者用 durable markdown documents 驱动 Claude Code 或 Codex。每个阶段产出 canvas，后续阶段可读而无需重放整个 chat；所有 turn 提交到个人 Git journal，可 resume、revert、audit 和 reuse。它没有后台调度，强调人仍是 strategist/reviewer，把 agent 协调成本压进 Git 和 markdown。

–

Stria 29github.com原文 ↗

上下文工程检索与知识接地协议与互操作编码

Stria 是面向 LLM agents 的 grammar-free structural codebase indexer 和 MCP server。它不用 tree-sitter 或语言 parser，而用 phrase extraction 生成结构索引；README 称标准仓库约 0.16 秒 build、sub-ms queries，3.1GB Linux kernel 72,000 文件从零索引约 80 秒。它给 agent 50-token 文件结构图，再按需 expand function bodies，目标是减少上下文浪费和路径猜测。

–

RedFlag 30github.com原文 ↗

github.com

RedFlag 是自托管更新管理器，面向拥有自己基础设施的 operators。server 发出的每条命令都有 Ed25519 签名，agent 校验签名、nonce、timestamp 并拒绝 replay；同时支持 Linux、Windows 和 Docker container 更新，并在人类批准后安装。项目还在测试 supply-chain gate：对解析后的 package closure 签发能力 token，由无网络 privileged executor 校验签名和 artifact hash。

–

行业动态 · Industry News

13 项 · 行业动态

OpenAI frontier models and Codex are now available on AWS 31openai.com原文 ↗

openai.com

OpenAI 宣布 frontier models 和 Codex 在 AWS 上可用，意味着企业可在既有 AWS procurement、governance 和基础设施环境中接入 OpenAI 模型与 coding agent 能力。这是分发渠道和云生态层面的合作，而非单一模型能力更新。它反映 frontier model 竞争从 API 本身扩展到 hyperscaler 可获得性。

–

Alphabet announces $80B equity capital raise to expand AI infra and compute 32abc.xyz原文 ↗

abc.xyz

Alphabet 宣布拟进行 800 亿美元股权融资，用于扩展 AI infrastructure 和 compute。该事件核心是资本结构服务于算力扩张，而不是产品发布。它值得关注在于 AI 基础设施建设已进入资产负债表级别，融资、数据中心和芯片供应成为模型竞争的前置条件。

–

Anthropic confidentially submits draft S-1 to the SEC 33anthropic.com原文 ↗

anthropic.com

Anthropic 宣布已向美国 SEC 机密提交 draft S-1，进入潜在 IPO 流程的早期监管步骤。这不是立即上市，而是 confidential submission 阶段。它的行业意义在于 frontier model 公司融资路径继续从私募市场向公开市场接口延伸。

–

Building the infrastructure for the Intelligence Age in Michigan 34openai.com原文 ↗

openai.com

OpenAI 宣布在 Michigan 推进 Stargate 相关 1GW 数据中心项目，定位为 Intelligence Age 基础设施建设。重点是大规模电力和数据中心 capacity，而不是单个模型发布。它说明 AI 产业新闻越来越接近能源、地方投资和长期算力选址问题。

–

DuckDuckGo makes its 'no-AI' search engine easier to access as its traffic booms 35techcrunch.com原文 ↗

techcrunch.com

TechCrunch 报道 DuckDuckGo 调整产品入口，让 no-AI search 更容易访问，同时其流量增长。事件的核心是搜索体验分化：一部分用户明确要 AI answers，另一部分用户要传统链接搜索。它反映 AI search 普及后，非 AI 模式也成为产品定位和流量策略。

–

Florida sues OpenAI and Sam Altman over AI risks 36politico.com原文 ↗

politico.com

Politico 报道 Florida 对 OpenAI 和 Sam Altman 提起诉讼，诉因涉及 AI risks。属于监管和法律风险新闻，具体可读作美国州级层面对 AI 公司安全责任的推进。它值得跟踪在于 frontier AI 的合规压力不只来自联邦规则，也会通过州诉讼进入司法系统。

–

Nvidia RTX Spark 37nvidia.com原文 ↗

nvidia.com

Nvidia RTX Spark 是面向个人电脑形态的 RTX AI 硬件产品。它定位在本地 AI 计算设备，而不是数据中心 GPU。它的行业意义在于 Nvidia 继续把 AI compute 下沉到桌面、小型机和开发者端，以支撑本地推理、agent 和创作工作流。

–

Nvidia Cosmos 3 38developer.nvidia.com原文 ↗

developer.nvidia.com

Nvidia Cosmos 3 面向 physical AI reasoning、world models 和 action models。其重点是机器人与物理世界模型栈，而非通用聊天模型。它值得关注在于 Nvidia 正把 GPU、仿真、world model 和 action model 打包成 physical AI 平台叙事。

–

Malicious npm packages detected across Red Hat Cloud Services 39github.com原文 ↗

github.com

Red Hat 相关 GitHub issue 披露 `@redhat-cloud-services/` scope 中检测到恶意 npm releases。这是供应链安全事件，影响 JavaScript clients 发布链路。它提醒企业包 scope 本身也会成为攻击面，依赖消费方需要审计版本、锁文件和发布 provenance。

–

Meta launches Instagram, Facebook, and WhatsApp subscriptions 40techcrunch.com原文 ↗

techcrunch.com

TechCrunch 报道 Meta 推出 Instagram、Facebook 和 WhatsApp subscriptions，并计划更多订阅产品包括 AI plans。这是 Meta 在社交产品线上的付费层扩展。它的行业含义是大型平台继续用订阅把隐私、增强功能或 AI 能力产品化，而不仅依赖广告收入。

–

OpenRouter raises $113M Series B 41news.ycombinator.com原文 ↗

news.ycombinator.com

Hacker News 条目指向 OpenRouter 完成 1.13 亿美元 Series B 融资。该事件属于模型路由和 API 聚合层融资新闻。它值得注意在于多模型访问、定价套利和 fallback routing 已形成独立基础设施层，并获得大规模资本支持。

–

The AV2 Video Standard Has Released 42news.ycombinator.com原文 ↗

news.ycombinator.com

Hacker News 条目称 AV2 视频标准最终 v1.0 规格发布。这是视频编码标准层面的节点。它的技术意义在于开放视频 codec 生态进入新代际，后续影响会体现在浏览器、硬件解码、流媒体转码成本和专利格局。

–

GrapheneOS Speech Services version 2 released 43discuss.grapheneos.org原文 ↗

discuss.grapheneos.org

GrapheneOS 论坛宣布 Speech Services v2 发布。该版本属于移动端隐私系统的语音服务更新。它值得关注在于 speech-to-text / text-to-speech 这类基础能力正被隐私导向 OS 重新实现和本地化。

–

博客文章 · Blog Posts

12 项 · 博客文章

Hackers Simply Asked Meta AI to Give Them Access to High-Profile Instagram Accounts. It Worked 44simonwillison.net原文 ↗

simonwillison.net

Simon Willison 记录并评论 Meta AI support bot 被用于接管高影响力 Instagram accounts 的事件。关键点是攻击者通过与支持 bot 对话拿到账户访问权，而不是传统漏洞利用。它说明 AI support automation 如果直接连接高权限账号恢复流程，会把社工与 prompt manipulation 合并成新的运营风险。

–

May 2026 newsletter 45simonwillison.net原文 ↗

simonwillison.net

Simon Willison 的 2026 年 5 月 newsletter 汇总模型发布、AI 成本、Datasette Agent 等近况。这是月度技术观察而非单点发布。它的价值在于把模型能力、价格下降、个人工具链和开源项目进展放到同一时间轴中看。

–

datasette 1.0a32 46simonwillison.net原文 ↗

simonwillison.net

Datasette 1.0a32 发布，修复 `/db/-/execute-write` 与 `base_url` 相关问题。该版本是 1.0 alpha 系列中的小版本维护。它值得看在于 Datasette 正接近 1.0，边缘 bug 修复会直接影响作为本地数据发布和 agent 数据接口时的稳定性。

–

pydantic-monty investigation 47simonwillison.net原文 ↗

simonwillison.net

Simon Willison 调查 Monty，一个 Rust 实现的 Python sandbox 子集，以及其资源限制设置。文章关注 pydantic-monty 如何限制 Python 表达能力、运行时行为和资源消耗。它值得看在于 LLM 代码执行环境常需要“足够 Python”与“可控资源边界”的折中，Monty 提供了一个小型可审计案例。

–

Open and closed models are on different exponentials 48interconnects.ai原文 ↗

interconnects.ai

Nathan Lambert 分析开放模型和闭源模型生态处在不同性能与经济曲线上。核心不是单次榜单差距，而是训练资本、产品收入、发布节奏和开放权重外部性导致的 exponentials 分化。它提供了判断开源追赶和闭源领先时更结构化的视角。

–

Import AI 459 49importai.substack.com原文 ↗

importai.substack.com

Import AI 459 覆盖 AI oversight、蛋白质 folding scaling laws 与 AI risk pricing 等研究。本期是研究 newsletter，重点在监督困难、生物科学 scaling 和风险市场化。它的价值在于把 AI safety、science 和经济信号并列，显示 frontier AI 影响范围已跨越模型 benchmark。

–

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic 50huggingface.co原文 ↗

工作流与控制流框架与脚手架工具使用系统·基础设施

Hugging Face / IBM Research 博文讨论企业 AI 采用中的 agent logic。其核心观点是企业规模化部署不能只依赖 LLM prompt，而需要可组合、可治理的 agent logic 作为系统层。它值得看在于把企业 AI 的难点从模型接入转向流程、工具、状态和控制逻辑。

–

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains 51huggingface.co原文 ↗

huggingface.co

JetBrains 在 Hugging Face 博客介绍 Mellum2，一个 12B MoE 软件工程模型。技术报告与发布博文都指向同一重点：模型面向代码、软件工程和 agentic coding，并强调开源权重和较低 active parameters。它是 JetBrains 把 IDE 公司经验沉淀到专用模型栈中的产品化动作。

–

Welcome NVIDIA Cosmos 3 52huggingface.co原文 ↗

huggingface.co

Hugging Face 博文介绍 Nvidia Cosmos 3 的 physical AI reasoning 和 action model 能力。它面向世界模型、机器人和物理环境决策。它值得看在于 HF 生态正在承接不只是文本模型，也包括视频、仿真和 embodied AI 模型发布。

–

Backpressure is all you need 53lucasfcosta.com原文 ↗

lucasfcosta.com

Lucas F. Costa 用系统设计视角解释 backpressure：系统在下游无法消化输入时必须显式反馈并调节上游速率。文章从高并发服务和队列压力出发，强调无 backpressure 的系统会把延迟、内存和失败扩散到全局。它的现实价值在于把“限流”从边缘保护提升为端到端稳定性机制。

–

Should you normalize RGB values by 255 or 256?5430fps.net原文 ↗

30fps.net

文章讨论 RGB 归一化时除以 255 还是 256 的差异。它关注数值映射边界：8-bit RGB 值 0-255 是离散整数区间，除以 255 可让最大值映射到 1.0，除以 256 则更像桶宽归一。它适合图形、图像处理和 shader 代码中澄清常见 off-by-one 直觉。

–

You Don't Love systemd Timers Enough 55blog.tjll.net原文 ↗

blog.tjll.net

文章介绍 systemd timers 的使用方式和相对 cron 的特性。重点是 systemd timers 可与 units、依赖、日志、随机延迟、持久化 missed runs 等机制结合。它值得看在于许多 cron 任务真正需要的是服务管理语义，而不仅是按时执行命令。

–

引用来源 · References

67 条 · 引用

1 Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents. arXiv:2605.30621https://arxiv.org/abs/2605.30621 ↩ 回到正文 · back to text
2 From Model Scaling to System Scaling: Scaling the Harness in Agentic AI. arXiv:2605.26112https://arxiv.org/abs/2605.26112 ↩ 回到正文 · back to text
3 Learning Agent-Compatible Context Management for Long-Horizon Tasks. arXiv:2605.30785https://arxiv.org/abs/2605.30785 ↩ 回到正文 · back to text
4 TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories. arXiv:2605.31308https://arxiv.org/abs/2605.31308 ↩ 回到正文 · back to text
5 LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis. arXiv:2605.30434https://arxiv.org/abs/2605.30434 ↩ 回到正文 · back to text
6 GrepSeek: Training Search Agents for Direct Corpus Interaction. arXiv:2605.29307https://arxiv.org/abs/2605.29307 ↩ 回到正文 · back to text
7 SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search. arXiv:2605.29796https://arxiv.org/abs/2605.29796 ↩ 回到正文 · back to text
8 SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?. arXiv:2605.30329https://arxiv.org/abs/2605.30329 ↩ 回到正文 · back to text
9 OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents. arXiv:2605.23657https://arxiv.org/abs/2605.23657 ↩ 回到正文 · back to text
10 From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors. arXiv:2605.31042https://arxiv.org/abs/2605.31042 ↩ 回到正文 · back to text
11 Mellum2 Technical Report. arXiv:2605.31268https://arxiv.org/abs/2605.31268 ↩ 回到正文 · back to text
12 Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode. arXiv:2605.30571https://arxiv.org/abs/2605.30571 ↩ 回到正文 · back to text
13 DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization. arXiv:2605.31455https://arxiv.org/abs/2605.31455 ↩ 回到正文 · back to text
14 LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards. arXiv:2605.31584https://arxiv.org/abs/2605.31584 ↩ 回到正文 · back to text
15 Skill Availability and Presentation Granularity in Large-Language-Model Agents. arXiv:2605.31408https://arxiv.org/abs/2605.31408 ↩ 回到正文 · back to text
16 Textilehttps://www.gettextile.app ↩ 回到正文 · back to text
17 Dataroomhttps://github.com/hanxiao/dataroom ↩ 回到正文 · back to text
18 Glq LLM quantization using E8 latticehttps://github.com/cnygaard/glq ↩ 回到正文 · back to text
19 DepsGuardhttps://github.com/arnica/depsguard ↩ 回到正文 · back to text
20 Valdrhttps://github.com/ianm199/valdr ↩ 回到正文 · back to text
21 2-command CLI to give AI agents structured data retrieval on PostgreSQLhttps://github.com/0xJaksun/lithium-core ↩ 回到正文 · back to text
22 A CSS 3D Enginehttps://github.com/LayoutitStudio/polycss ↩ 回到正文 · back to text
23 Cloud CI and agentic workflows for embedded hardware developmenthttps://github.com/jumpstarter-dev/jumpstarter ↩ 回到正文 · back to text
24 Vmettehttps://github.com/chamuka-inc/vmette ↩ 回到正文 · back to text
25 Poolnarchttps://github.com/yeet-src/poolnarc ↩ 回到正文 · back to text
26 UQLMhttps://github.com/cvs-health/uqlm ↩ 回到正文 · back to text
27 AgentThreatBenchhttps://github.com/OWASP/www-project-agent-memory-guard ↩ 回到正文 · back to text
28 Ministry of Everythinghttps://github.com/modulecollective/moe ↩ 回到正文 · back to text
29 Striahttps://github.com/Reliary/stria ↩ 回到正文 · back to text
30 RedFlaghttps://github.com/Fimeg/RedFlag ↩ 回到正文 · back to text
31 OpenAI frontier models and Codex are now available on AWShttps://openai.com/index/openai-frontier-models-and-codex-are-now-available-on-aws ↩ 回到正文 · back to text
32 Alphabet announces $80B equity capital raise to expand AI infra and computehttps://abc.xyz/investor/news/news-details/2026/Alphabet-Announces-Proposed-80-Billion-Equity-Capital-Raise-to-Expand-AI-Infrastructure-and-Compute-2026-b0myAMewCa/default.aspx ↩ 回到正文 · back to text
33 Anthropic confidentially submits draft S-1 to the SEChttps://www.anthropic.com/news/confidential-draft-s1-sec ↩ 回到正文 · back to text
34 Building the infrastructure for the Intelligence Age in Michiganhttps://openai.com/index/stargate-michigan-data-center ↩ 回到正文 · back to text
35 DuckDuckGo makes its 'no-AI' search engine easier to access as its traffic boomshttps://techcrunch.com/2026/06/01/duckduckgo-makes-its-no-ai-search-engine-easier-to-access-as-its-traffic-booms/ ↩ 回到正文 · back to text
36 Florida sues OpenAI and Sam Altman over AI riskshttps://www.politico.com/news/2026/06/01/openai-hit-with-florida-lawsuit-00944215 ↩ 回到正文 · back to text
37 Nvidia RTX Sparkhttps://www.nvidia.com/en-us/products/rtx-spark/ ↩ 回到正文 · back to text
38 Nvidia Cosmos 3https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/ ↩ 回到正文 · back to text
39 Malicious npm packages detected across Red Hat Cloud Serviceshttps://github.com/RedHatInsights/javascript-clients/issues/492 ↩ 回到正文 · back to text
40 Meta launches Instagram, Facebook, and WhatsApp subscriptionshttps://techcrunch.com/2026/05/27/meta-officially-launches-instagram-facebook-and-whatsapp-subscriptions-with-more-to-come-including-ai-plans/ ↩ 回到正文 · back to text
41 OpenRouter raises $113M Series Bhttps://news.ycombinator.com/item?id=48338660 ↩ 回到正文 · back to text
42 The AV2 Video Standard Has Releasedhttps://news.ycombinator.com/item?id=48340910 ↩ 回到正文 · back to text
43 GrapheneOS Speech Services version 2 releasedhttps://discuss.grapheneos.org/d/36001-grapheneos-speech-services-version-2-released ↩ 回到正文 · back to text
44 Hackers Simply Asked Meta AI to Give Them Access to High-Profile Instagram Accounts. It Workedhttps://simonwillison.net/2026/Jun/1/hackers-simply-asked-meta-ai/#atom-everything ↩ 回到正文 · back to text
45 May 2026 newsletterhttps://simonwillison.net/2026/Jun/1/may-newsletter/#atom-everything ↩ 回到正文 · back to text
46 datasette 1.0a32https://simonwillison.net/2026/May/31/datasette/#atom-everything ↩ 回到正文 · back to text
47 pydantic-monty investigationhttps://simonwillison.net/2026/May/22/monty-investigation/#atom-everything ↩ 回到正文 · back to text
48 Open and closed models are on different exponentialshttps://www.interconnects.ai/p/open-and-closed-models-are-on-different ↩ 回到正文 · back to text
49 Import AI 459https://importai.substack.com/p/import-ai-459-ai-oversight-is-difficult ↩ 回到正文 · back to text
50 Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logichttps://huggingface.co/blog/ibm-research/agent-logic-and-scalable-ai-adoption ↩ 回到正文 · back to text
51 Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrainshttps://huggingface.co/blog/JetBrains/mellum2-launch ↩ 回到正文 · back to text
52 Welcome NVIDIA Cosmos 3https://huggingface.co/blog/nvidia/cosmos-3-for-physical-ai ↩ 回到正文 · back to text
53 Backpressure is all you needhttps://www.lucasfcosta.com/blog/backpressure-is-all-you-need ↩ 回到正文 · back to text
54 Should you normalize RGB values by 255 or 256?https://30fps.net/pages/255-vs-256-division/ ↩ 回到正文 · back to text
55 You Don't Love systemd Timers Enoughhttps://blog.tjll.net/you-dont-love-systemd-timers-enough/ ↩ 回到正文 · back to text
56 D4Vinci/Scraplinghttps://github.com/D4Vinci/Scrapling ↩ 回到正文 · back to text
57 nesquena/hermes-webuihttps://github.com/nesquena/hermes-webui ↩ 回到正文 · back to text
58 github/docshttps://github.com/github/docs ↩ 回到正文 · back to text
59 supermemoryai/supermemoryhttps://github.com/supermemoryai/supermemory ↩ 回到正文 · back to text
60 nicobailon/pi-subagentshttps://github.com/nicobailon/pi-subagents ↩ 回到正文 · back to text
61 run-llama/liteparsehttps://github.com/run-llama/liteparse ↩ 回到正文 · back to text
62 iii-hq/iiihttps://github.com/iii-hq/iii ↩ 回到正文 · back to text
63 BloopAI/vibe-kanbanhttps://github.com/BloopAI/vibe-kanban ↩ 回到正文 · back to text
64 golemcloud/golemhttps://github.com/golemcloud/golem ↩ 回到正文 · back to text
65 AnInsomniacy/motrix-nexthttps://github.com/AnInsomniacy/motrix-next ↩ 回到正文 · back to text
66 tmoroney/auto-subshttps://github.com/tmoroney/auto-subs ↩ 回到正文 · back to text
67 mattpocock/sandcastlehttps://github.com/mattpocock/sandcastle ↩ 回到正文 · back to text

智能体底座攻防升级

论文 · Papers

本期重点Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents1arxiv.org原文 ↗

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI2arxiv.org原文 ↗

本期重点Learning Agent-Compatible Context Management for Long-Horizon Tasks3arxiv.org原文 ↗

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories4arxiv.org原文 ↗

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis5arxiv.org原文 ↗

本期重点GrepSeek: Training Search Agents for Direct Corpus Interaction6arxiv.org原文 ↗

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search7arxiv.org原文 ↗

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?8arxiv.org原文 ↗

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents9arxiv.org原文 ↗

本期重点From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors10arxiv.org原文 ↗

Mellum2 Technical Report11arxiv.org原文 ↗

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode12arxiv.org原文 ↗

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization13arxiv.org原文 ↗

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards14arxiv.org原文 ↗

Skill Availability and Presentation Granularity in Large-Language-Model Agents15arxiv.org原文 ↗

开源 / 项目 · Projects

Textile16gettextile.app原文 ↗

Dataroom17github.com原文 ↗

Glq LLM quantization using E8 lattice18github.com原文 ↗

DepsGuard19github.com原文 ↗

Valdr20github.com原文 ↗

2-command CLI to give AI agents structured data retrieval on PostgreSQL21github.com原文 ↗

A CSS 3D Engine22github.com原文 ↗

Cloud CI and agentic workflows for embedded hardware development23github.com原文 ↗

Vmette24github.com原文 ↗

Poolnarc25github.com原文 ↗

UQLM26github.com原文 ↗

AgentThreatBench27github.com原文 ↗

Ministry of Everything28github.com原文 ↗

Stria29github.com原文 ↗

RedFlag30github.com原文 ↗

行业动态 · Industry News

OpenAI frontier models and Codex are now available on AWS31openai.com原文 ↗

Alphabet announces $80B equity capital raise to expand AI infra and compute32abc.xyz原文 ↗

Anthropic confidentially submits draft S-1 to the SEC33anthropic.com原文 ↗

Building the infrastructure for the Intelligence Age in Michigan34openai.com原文 ↗

DuckDuckGo makes its 'no-AI' search engine easier to access as its traffic booms35techcrunch.com原文 ↗

Florida sues OpenAI and Sam Altman over AI risks36politico.com原文 ↗

Nvidia RTX Spark37nvidia.com原文 ↗

Nvidia Cosmos 338developer.nvidia.com原文 ↗

Malicious npm packages detected across Red Hat Cloud Services39github.com原文 ↗

Meta launches Instagram, Facebook, and WhatsApp subscriptions40techcrunch.com原文 ↗

OpenRouter raises $113M Series B41news.ycombinator.com原文 ↗

The AV2 Video Standard Has Released42news.ycombinator.com原文 ↗

GrapheneOS Speech Services version 2 released43discuss.grapheneos.org原文 ↗

博客文章 · Blog Posts

Hackers Simply Asked Meta AI to Give Them Access to High-Profile Instagram Accounts. It Worked44simonwillison.net原文 ↗

May 2026 newsletter45simonwillison.net原文 ↗

datasette 1.0a3246simonwillison.net原文 ↗

pydantic-monty investigation47simonwillison.net原文 ↗

Open and closed models are on different exponentials48interconnects.ai原文 ↗

Import AI 45949importai.substack.com原文 ↗

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic50huggingface.co原文 ↗

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains51huggingface.co原文 ↗

Welcome NVIDIA Cosmos 352huggingface.co原文 ↗

Backpressure is all you need53lucasfcosta.com原文 ↗

Should you normalize RGB values by 255 or 256?5430fps.net原文 ↗

You Don't Love systemd Timers Enough55blog.tjll.net原文 ↗

GitHub 热门 · GitHub Trending

D4Vinci/Scrapling56github.com原文 ↗

nesquena/hermes-webui57github.com原文 ↗

github/docs58github.com原文 ↗

supermemoryai/supermemory59github.com原文 ↗

nicobailon/pi-subagents60github.com原文 ↗

本期重点run-llama/liteparse61github.com原文 ↗

iii-hq/iii62github.com原文 ↗

BloopAI/vibe-kanban63github.com原文 ↗

golemcloud/golem64github.com原文 ↗

AnInsomniacy/motrix-next65github.com原文 ↗

tmoroney/auto-subs66github.com原文 ↗

mattpocock/sandcastle67github.com原文 ↗

引用来源 · References

本期重点Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents 1arxiv.org原文 ↗

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI 2arxiv.org原文 ↗

本期重点Learning Agent-Compatible Context Management for Long-Horizon Tasks 3arxiv.org原文 ↗

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories 4arxiv.org原文 ↗

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis 5arxiv.org原文 ↗

本期重点GrepSeek: Training Search Agents for Direct Corpus Interaction 6arxiv.org原文 ↗

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search 7arxiv.org原文 ↗

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents 9arxiv.org原文 ↗

本期重点From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors 10arxiv.org原文 ↗

Mellum2 Technical Report 11arxiv.org原文 ↗

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode 12arxiv.org原文 ↗

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization 13arxiv.org原文 ↗

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards 14arxiv.org原文 ↗

Skill Availability and Presentation Granularity in Large-Language-Model Agents 15arxiv.org原文 ↗

Textile 16gettextile.app原文 ↗

Dataroom 17github.com原文 ↗

Glq LLM quantization using E8 lattice 18github.com原文 ↗

DepsGuard 19github.com原文 ↗

Valdr 20github.com原文 ↗

2-command CLI to give AI agents structured data retrieval on PostgreSQL 21github.com原文 ↗

A CSS 3D Engine 22github.com原文 ↗

Cloud CI and agentic workflows for embedded hardware development 23github.com原文 ↗

Vmette 24github.com原文 ↗

Poolnarc 25github.com原文 ↗

UQLM 26github.com原文 ↗

AgentThreatBench 27github.com原文 ↗

Ministry of Everything 28github.com原文 ↗

Stria 29github.com原文 ↗

RedFlag 30github.com原文 ↗

OpenAI frontier models and Codex are now available on AWS 31openai.com原文 ↗

Alphabet announces $80B equity capital raise to expand AI infra and compute 32abc.xyz原文 ↗

Anthropic confidentially submits draft S-1 to the SEC 33anthropic.com原文 ↗

Building the infrastructure for the Intelligence Age in Michigan 34openai.com原文 ↗

DuckDuckGo makes its 'no-AI' search engine easier to access as its traffic booms 35techcrunch.com原文 ↗

Florida sues OpenAI and Sam Altman over AI risks 36politico.com原文 ↗

Nvidia RTX Spark 37nvidia.com原文 ↗

Nvidia Cosmos 3 38developer.nvidia.com原文 ↗

Malicious npm packages detected across Red Hat Cloud Services 39github.com原文 ↗

Meta launches Instagram, Facebook, and WhatsApp subscriptions 40techcrunch.com原文 ↗

OpenRouter raises $113M Series B 41news.ycombinator.com原文 ↗

The AV2 Video Standard Has Released 42news.ycombinator.com原文 ↗

GrapheneOS Speech Services version 2 released 43discuss.grapheneos.org原文 ↗

Hackers Simply Asked Meta AI to Give Them Access to High-Profile Instagram Accounts. It Worked 44simonwillison.net原文 ↗

May 2026 newsletter 45simonwillison.net原文 ↗

datasette 1.0a32 46simonwillison.net原文 ↗

pydantic-monty investigation 47simonwillison.net原文 ↗

Open and closed models are on different exponentials 48interconnects.ai原文 ↗

Import AI 459 49importai.substack.com原文 ↗

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic 50huggingface.co原文 ↗

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains 51huggingface.co原文 ↗

Welcome NVIDIA Cosmos 3 52huggingface.co原文 ↗

Backpressure is all you need 53lucasfcosta.com原文 ↗

You Don't Love systemd Timers Enough 55blog.tjll.net原文 ↗

D4Vinci/Scrapling 56github.com原文 ↗

nesquena/hermes-webui 57github.com原文 ↗

github/docs 58github.com原文 ↗

supermemoryai/supermemory 59github.com原文 ↗

nicobailon/pi-subagents 60github.com原文 ↗

本期重点run-llama/liteparse 61github.com原文 ↗

iii-hq/iii 62github.com原文 ↗

BloopAI/vibe-kanban 63github.com原文 ↗

golemcloud/golem 64github.com原文 ↗

AnInsomniacy/motrix-next 65github.com原文 ↗

tmoroney/auto-subs 66github.com原文 ↗

mattpocock/sandcastle 67github.com原文 ↗