主题 · Topics — 每日 Harness

1 认知与推理Reasoning

23 项 →

1.1 推理与规划Reasoning & Planning 12 项

PhoenixRepair: Rethinking Repair Strategy Exploration in Software Agents 2026-07-24arxiv.org
Do Coding Agents Need Executable World Models, Simplification, and Verification to Solve ARC-AGI-3? 2026-07-21arxiv.org
Can We Understand How Large Language Models Reason? 2026-07-13cacm.acm.org
FirstResearch: Auditable Question Formation for LLM Scientific Discovery Agents 2026-07-09arxiv.org
Understanding is the new bottleneck 2026-07-04geoffreylitt.com

查看全部 12 条

1.2 测试时计算Test-time Compute 11 项

Effort Router: Intelligent /effort selection per Claude turn 2026-07-21github.com
Controlling Reasoning Effort in LLMs 2026-07-21magazine.sebastianraschka.com
Doomed from the Start: Early Abort of LLM Agent Episodes via a Recall-Controlled Probe Cascade 2026-07-09arxiv.org
SPORK: Self-Speculative Forking to Accelerate Agentic LLM Inference 2026-07-08arxiv.org
Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling 2026-07-04arxiv.org

查看全部 11 条

2 学习与自进化Learning & Self-Evolution

72 项 →

3 记忆与上下文Memory & Context

217 项 →

3.1 Agent 记忆Agent Memory 78 项

Profile-Graph Memory for LLM Agents: Implicit Cross-Entity Traversal through Narrative Profiles 2026-07-25arxiv.org
Marrow 2026-07-25github.com
Fidelity Before Structure: Verbatim Chunks Beat Lossy Artifact Extraction in Long-Conversation LLM Memory 2026-07-25arxiv.org
Supra Cognitive Modes: A Routed Architecture for Agent Memory 2026-07-24arxiv.org
akitaonrails/ai-memory 2026-07-23github.com

查看全部 78 条

3.2 上下文工程Context Engineering 80 项

Why Software Factories Fail 2026-07-25github.com
State Compression in Two-Agent LLM Relays 2026-07-24arxiv.org
Prompt Caching in Agents 2026-07-24earendil.com
yvgude/lean-ctx 2026-07-23github.com
rtk-ai/rtk 2026-07-23github.com

查看全部 80 条

3.3 检索与知识接地Retrieval / RAG 59 项

Amdb 2026-07-25github.com
Grounded Forge 2026-07-24github.com
HKUDS/LightRAG 2026-07-23github.com
tirth8205/code-review-graph 2026-07-21github.com
Local-first CLI to make Obsidian vaults searchable for AI agents 2026-07-21github.com

查看全部 59 条

4 工具与技能Tools & Skills

182 项 →

4.1 工具使用Tool Use 79 项

oraios/serena 2026-07-25github.com
CockroachCrawler 2026-07-25github.com
Browser Bridge 2026-07-25github.com
Decode-Time Grammars 2026-07-24arxiv.org
Browser Tools SDK 2026-07-23libretto.sh

查看全部 79 条

4.2 技能系统Skills 56 项

ibelick/ui-skills 2026-07-20github.com
Nutlope/hallmark 2026-07-16github.com
Dynamic Agent Skills: A Lifecycle Survey and Taxonomy of Evolving Skill Libraries 2026-07-15arxiv.org
google-labs-code/stitch-skills 2026-07-13github.com
dotnet/skills 2026-07-09github.com

查看全部 56 条

4.3 协议与互操作Protocols & Interop 47 项

Palmier Pro 2026-07-24github.com
Claude Code Proxy 2026-07-24github.com
PrefectHQ/fastmcp 2026-07-22github.com
memorywire: A Vendor-Neutral Wire Format for Agent Memory Operations 2026-07-21arxiv.org
Scalable LLM Agent Tool Access in the Cloud 2026-07-21arxiv.org

查看全部 47 条

5 编排与多智能体Orchestration & Multi-Agent

130 项 →

5.1 多智能体Multi-Agent 42 项

block/buzz 2026-07-25github.com
THU-MAIC/OpenMAIC 2026-07-25github.com
Catalyst 2026-07-25github.com
Syndicate 2026-07-24usesyndicate.org
BDFL 2026-07-24usebdfl.com

查看全部 42 条

5.2 工作流与控制流Workflows & Control 88 项

alibaba/open-code-review 2026-07-25github.com
Sigil 2026-07-25github.com
Why Software Factories Fail 2026-07-24github.com
CodeRescue: Budget-Calibrated Recovery Routing for Coding Agents 2026-07-24arxiv.org
BatchDAG: LLM-Planned Execution Graphs for Scalable Ad-Hoc Analysis Over Enterprise Data 2026-07-24arxiv.org

查看全部 88 条

6 运行时与基础设施Harness Runtime & Infra

259 项 →

6.1 框架与脚手架Frameworks & Scaffolds 147 项

agegr/pi-web 2026-07-25github.com
Rendi 2026-07-24github.com
LLM Budget Cap 2026-07-24github.com
Fleet 2026-07-24github.com
MoonshotAI/kimi-code 2026-07-23github.com

查看全部 147 条

6.2 执行环境与沙箱Execution & Sandboxing 59 项

citrolabs/ego-lite 2026-07-25github.com
AgentCgroup: Understanding and Controlling OS Resources of AI Agents 2026-07-25arxiv.org
Pullrun 2026-07-24github.com
Superserve 2026-07-23superserve.ai
Serve-avd 2026-07-22github.com

查看全部 59 条

6.3 可观测性与调试Observability & Debugging 53 项

AgentTrails: Towards Trust and Reuse for Agentic Tasks 2026-07-24arxiv.org
AgentPulse 2026-07-24prove-ai.github.io
AgentDebugX 2026-07-24arxiv.org
Deterministic Replay for AI Agent Systems 2026-07-23arxiv.org
microsoft/AI-Engineering-Coach 2026-07-22github.com

查看全部 53 条

7 评测与安全Evaluation & Safety

263 项 →

7.1 基准Benchmarks 60 项

DocOps: A Verifiable Benchmark for Autonomous Agents in Complex Document Operations 2026-07-25arxiv.org
Alipay-PIBench: A Realistic Payment Integration Benchmark for Coding Agents 2026-07-25arxiv.org
RECON: Benchmarking Agent Memory for Compositional Reasoning over Long Contexts 2026-07-23arxiv.org
Memory Bench 2026-07-23github.com
Lomekwi: Resource-Bounded Tool Discovery in LLM Agents 2026-07-23arxiv.org

查看全部 60 条

7.2 评测方法Eval Methodology 77 项

The First Known Runaway AI Agent - or a Very Bad Marketing Stunt? 2026-07-25simonwillison.net
Silent Failures in Multimodal Agentic Search: A Diagnostic Taxonomy and Cross-Judge Evaluation 2026-07-25arxiv.org
Guardrails as Scapegoats: Auditing Unfaithful Safety Refusals in Tool-Augmented LLM Agents 2026-07-25arxiv.org
When JSON Is Not Enough 2026-07-24arxiv.org
SAAG: Structured Agent Assessment and Grounding 2026-07-24arxiv.org

查看全部 77 条

7.3 安全与攻防Security 113 项

Stateful Guardrails for Multi-Turn LLM Systems: A Conversational Risk Accumulation Framework 2026-07-25arxiv.org
NEXUS: Structured Runtime Safety for Tool-Using LLM Agents 2026-07-25arxiv.org
JANUS: Foreseeing Latent Risk for Long-Horizon Agent Safety 2026-07-25arxiv.org
ChannelGuard: Safe Models Do Not Compose into Safe Multi-Agent Systems 2026-07-25arxiv.org
ChainWatch: A Kill Chain-Aligned Sequential Detection Framework for Multi-Step Attacks in MCP-Based AI Agent Systems 2026-07-25arxiv.org

查看全部 113 条

7.4 对齐与治理Alignment & Governance 13 项

Safety and alignment in an era of long-horizon models 2026-07-21openai.com
ANCHOR: Automated Alignment Auditing for CLI Agents on Real-World Harm 2026-07-15arxiv.org
Understand to participate 2026-07-03simonwillison.net
AgentBound: Verifiable Behavioral Governance for Autonomous AI Agents 2026-07-02arxiv.org
AgentWatch 2026-06-29agent-watch.dev

查看全部 13 条

主题 · Topics — 每日 Harness

1 认知与推理Reasoning

1.1 推理与规划Reasoning & Planning 12 项

1.2 测试时计算Test-time Compute 11 项

2 学习与自进化Learning & Self-Evolution

2.1 Agent RL / 可验证奖励Agent RL / Verifiable Rewards 34 项

2.2 蒸馏与压缩Distillation & Compression 9 项

2.3 自进化Self-Evolution 14 项

2.4 合成数据与训练环境Synthetic Data & Environments 15 项

3 记忆与上下文Memory & Context

3.1 Agent 记忆Agent Memory 78 项

3.2 上下文工程Context Engineering 80 项

3.3 检索与知识接地Retrieval / RAG 59 项

4 工具与技能Tools & Skills

4.1 工具使用Tool Use 79 项

4.2 技能系统Skills 56 项

4.3 协议与互操作Protocols & Interop 47 项

5 编排与多智能体Orchestration & Multi-Agent

5.1 多智能体Multi-Agent 42 项

5.2 工作流与控制流Workflows & Control 88 项

6 运行时与基础设施Harness Runtime & Infra

6.1 框架与脚手架Frameworks & Scaffolds 147 项

6.2 执行环境与沙箱Execution & Sandboxing 59 项

6.3 可观测性与调试Observability & Debugging 53 项

7 评测与安全Evaluation & Safety

7.1 基准Benchmarks 60 项

7.2 评测方法Eval Methodology 77 项

7.3 安全与攻防Security 113 项

7.4 对齐与治理Alignment & Governance 13 项