论文 · Papers2026-06-02 · Tuesday, June 2, 2026

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

论文测量机器人、车端和边缘 copilot 常见的 batch-1 autoregressive decode，指出它虽 memory-dominated，但并非简单按 HBM 峰值带宽线性加速。三类 7-8B GQA 模型在 H100、A100、L40S、L4 上测试，Qwen-2.5-7B ctx=2048 时 L4 达到约 81% analytic memory floor，H100 仅约 27%。CUDA Graphs 在 H100 上带来 1.259x latency 改善，说明 launch-side overhead 已成为快卡瓶颈。

–浏览

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

评论 · Comments