arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.16257 2026-05-18 cs.RO

DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

DexJoCo：用于MuJoCo上的任务导向灵巧操作的基准和工具包

Hanwen Wang, Weizhi Zhao, Xiangyu Wang, Siyuan Huang, He Lin, Boyuan Zheng, Rongtao Xu, Gang Wang, Yao Mu, He Wang, Lue Fan, Hongsheng Li, Zhaoxiang Zhang, Tieniu Tan

AI总结本文提出DexJoCo基准和工具包，包含11个功能任务评估灵巧手的工具使用、双臂协调、长周期执行和推理能力，通过低成本数据收集系统和领域随机化评估鲁棒性，揭示当前策略的局限性。

详情

Comments: 8 pages, 6 figures, project page is available at: https://dexjoco.github.io

AI中文摘要

实现人类水平的操作需要能够进行复杂物体交互的灵巧机器人手。进一步发展此类能力需要标准化的基准以进行系统评估。然而，现有的灵巧基准缺乏反映灵巧手相对于平行夹具独特操作能力的任务以及全面的评估流程。本文提出了DexJoCo，一个用于任务导向灵巧操作的基准和工具包，包含11个功能基础任务，评估工具使用、双臂协调、长周期执行和推理。我们开发了一个低成本的数据收集系统，并在这些任务中收集了1100多条轨迹，支持领域随机化以评估鲁棒性。我们在此基础上对现代模型进行基准测试，包括视觉和动态随机化、多任务训练和动作头适应。通过广泛的实证分析，我们识别出当前策略在灵巧操作中的几个重要见解和共同限制，突显了未来灵巧手机器人学习中的关键挑战。项目页面可访问：https://dexjoco.github.io

英文摘要

Achieving human-level manipulation requires dexterous robotic hands capable of complex object interactions. Advancing such capabilities further demands standardized benchmarks for systematic evaluation. However, existing dexterous benchmarks lack tasks that reflect the unique manipulation capabilities of dexterous hands over parallel grippers, as well as comprehensive evaluation pipelines. In this paper, we present DexJoCo, a benchmark and toolkit for task-oriented dexterous manipulation, comprising 11 functionally grounded tasks that evaluate tool-use, bimanual coordination, long-horizon execution, and reasoning. We develop a low-cost data collection system and collect 1.1K trajectories across these tasks, with support for domain randomization to assess robustness. We benchmark modern models under diverse settings, including visual and dynamics randomization, multi-task training, and action-head adaptation. Through extensive empirical analysis, we identify several important insights and common limitations of current policies in dexterous manipulation, highlighting key challenges for future research in dexterous hand robot learning. Project page available at: https://dexjoco.github.io

URL PDF HTML ☆

赞 0 踩 0

2605.16250 2026-05-18 cs.CL cs.AI cs.DB cs.LG

A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation

一种生成式AI框架用于智能用电量分析和可持续资源优化

Pavan Manjunath, Thomas Pruefer

AI总结本文提出一个生成式AI框架，整合四个生产级能力，实现自然语言账单生成、消费预测及碳排放优化。

2605.16241 2026-05-18 cs.CV cs.AI

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

离线语义引导用于高效视觉-语言-动作策略蒸馏

Jin Shi, Brady Zhang, Yishun Lu

AI总结本文提出VLA-AD框架，利用视觉-语言模型作为离线语义监督者，将大规模VLA教师模型蒸馏为轻量学生策略，通过高阶语义指导提升效率与鲁棒性。

详情

AI中文摘要

大规模视觉-语言-动作（VLA）策略近期在机器人操作中表现出色，但其规模和推理成本仍是实时闭环控制的主要障碍。我们引入VLA-AD蒸馏框架，利用视觉-语言模型作为离线语义监督者，将大规模VLA教师模型转化为轻量学生策略。不同于仅依赖低层动作模仿，VLA-AD在教师提供的7自由度动作目标中加入高层语义指导，包括任务阶段锚点和多帧操作方向描述。这些辅助信号仅在训练时使用：在测试时，学生策略独立运行，无需VLA教师或VLM。我们在三个LIBERO基准测试套件上评估VLA-AD。使用OpenVLA-7B作为教师，我们的方法产生一个15800万参数的学生模型，模型大小减少44倍，同时与教师的平均相对差距仅为0.27%。生成的策略在RTX 4090上以12.5 Hz运行，比OpenVLA-7B快3.28倍。我们进一步表明，相同的语义蒸馏流程可泛化到不同的π_{0.5}-4B教师，其中学生在两个套件中优于教师，并在libero_goal上保持在0.53%以内。此外分析表明，阶段级监督和多帧方向线索使学生对噪声教师动作（如错误的高频夹具变化）更不敏感。总体而言，VLA-AD证明了从VLMs获得的离线语义指导可以显著提高VLA策略蒸馏的效率、鲁棒性和部署性。

英文摘要

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

URL PDF HTML ☆

赞 0 踩 0

2605.16239 2026-05-18 cs.LG

Dynamics-Level Watermarking of Flow Matching Models with Random Codes

流匹配模型的动力学水印技术：随机码方法

Shuchan Wang

AI总结本文提出在流匹配模型中嵌入水印的新方法，通过连续动力学嵌入随机码，实现可靠的信息恢复和生成质量保持。

2605.16238 2026-05-18 cs.AI

Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search

前瞻性多病原体疾病预测使用自主LLM引导的树搜索

Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, Zahra Shamsi

AI总结本文提出自主系统，利用LLM引导树搜索生成、评估和优化可执行预测软件，在2025-2026年美国呼吸道季节中实现了流感、新冠和呼吸道合胞病毒的多方法模型，其集成模型在样本外表现优于CDC标准模型。

详情

AI中文摘要

传染病概率预测对公共卫生至关重要，但依赖专家团队耗时的手动模型定制，限制了对细粒度地理分辨率或新兴病原体的扩展性。本文提出一个自主系统，利用大型语言模型（LLM）引导的树搜索，迭代生成、评估和优化可执行预测软件。在2025-2026年美国呼吸道季节的前瞻性、实时评估中，系统自主发现了针对流感、新冠和呼吸道合胞病毒（RSV）的方法学多样的模型。汇总这些机器生成的模型得到一个集成模型，其在样本外表现一致匹配或优于金标准的人工定制的疾病控制与预防中心（CDC）枢纽集合。该系统成功应对了RSV的数据稀缺“冷启动”场景。此外，受控回顾性消解揭示了优化对数尺度距离度量可防止奖励黑客，而自动化裁判在循环中确保结构符合复杂科学理论。通过自主将流行病学理论转化为准确、透明的代码，该框架克服了建模劳动力瓶颈，实现了前所未有的大规模专家级疾病预测部署。

英文摘要

Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.

URL PDF HTML ☆

赞 0 踩 0

2605.16233 2026-05-18 cs.AI cs.CL cs.LG cs.MA cs.SY eess.SY

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

FORGE：无权重更新的自演化代理记忆

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

AI总结 FORGE通过群体广播机制实现无梯度更新的自生成记忆，提升层次ReAct代理决策能力，在CybORG CAGE-2任务中显著提高性能并降低失败率。

详情

DOI: 10.1145/3786335.3813155

AI中文摘要

LLM代理能否通过自生成记忆提升决策能力而不进行梯度更新？我们提出了FORGE（失败优化反射毕业与进化），一种分阶段、基于群体的协议，通过注入提示的自然语言记忆来进化层次ReAct代理。FORGE包含一个反射式内环，其中专门的反思代理（使用相同的基础LLM，不从更强模型蒸馏）将失败轨迹转换为可重用的知识工件：文本启发式（规则）、少量示例（示例）或两者（混合），外环在阶段间将表现最佳实例的记忆传播到群体，并通过毕业标准冻结收敛实例。我们在CybORG CAGE-2上评估，这是一个具有30步地平线的随机网络防御POMDP，对抗B线攻击者。所有四个测试的LLM家族（Gemini-2.5-Flash-Lite、Grok-4-Fast、Llama-4-Maverick、Qwen3-235B）均表现出强烈负的、重尾零样本奖励。与零样本基线和反射基线（隔离单流学习）相比，FORGE在所有12种模型-表示条件下，将平均评估回报提高了1.7-7.7倍，比反射基线提高了29-72%，将主要失败率（低于-100）降低到约1%。我们发现（1）群体广播是关键机制，无毕业消融确认广播承载性能提升，而毕业主要节省计算；（2）示例在三个模型中表现最强，规则提供最佳成本-可靠性剖面，约少40%的token；（3）较弱基线模型受益显著，表明FORGE可能缓解能力差距而非放大强模型。所有证据均限于CAGE-2 B线；跨家族发现是方向性证据。

英文摘要

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.16232 2026-05-18 cs.CL cs.AI cs.ET cs.LG cs.SY eess.SY

A Unified Generative-AI Framework for Smart Energy Infrastructure: Intelligent Gas Distribution, Utility Billing, Carbon Analytics, and Quantum-Inspired Optimisation

智能能源基础设施的统一生成式AI框架：智能燃气分配、公用事业计费、碳分析和量子启发优化

Pavan Manjunath, Thomas pruefer

AI总结本文提出一种统一的生成式AI框架，整合智能燃气分配、计费、碳分析和量子优化，以提升能源管理效率与环境责任。

2605.16222 2026-05-18 cs.CL cs.LG

Artificial Aphasias in Lesioned Language Models

病变语言模型中的人工失语症

Nathan Roll, Jill Kries, Laura Gwilliams, Cory Shain

AI总结通过模拟失语症对语言模型进行参数损伤，研究其功能组织特性，发现模型与人类失语症在症状分布上有显著差异，揭示学习和处理细节对语言处理的影响。

详情

Comments: 49 pages, 13 figures

AI中文摘要

失语症，由脑损伤引起的特定语言障碍，通过揭示受损脑区与特定症状谱之间的因果关系，揭示人类语言的功能组织。本文提出一种受失语症启发的技术，用于表征语言模型的功能组织。我们通过零出模型参数（即'病变'）并测量其对临床失语症症状的影响，以Text Aphasia Battery (TAB)诊断。当应用于五种1B规模语言模型的112,426个输出时，评估的症状范围广泛，但其分布与人类显著不同。我们的方法揭示了注意力组件（查询、键、值、输出）与前馈组件（上、门、下）之间的广泛症状谱差异，同机制内组件差异证据较弱。我们还发现深度的影响，早期层的损伤导致语法和语义症状，而中后期层导致更高的语音和流畅性缺陷。尽管某些语言模型的损伤可能在某些人类失语症类型上更相似，但语言模型与人类在症状模式上的定性差异表明，失语症综合征受学习和处理细节影响较大，而非单纯是语言处理受损的领域无关结果。

英文摘要

Aphasias, selective language impairments which can arise from brain damage, reveal the functional organization of human language by providing causal links between affected brain regions and specific symptom profiles. Drawing on this literature, we introduce an aphasia-inspired technique to characterize the emergent functional organization of language models (LMs). We ``lesion'' (zero-out) model parameters and measure the effects of this intervention against clinical aphasia symptoms, as diagnosed by the Text Aphasia Battery (TAB). When applied to 112,426 outputs from five 1B-scale LMs, the full range of evaluated symptoms surface, but in distributions largely distinct from those of humans. Our method uncovers broad symptom-profile differences between attention components (query, key, value, output) and feed-forward components (up, gate, down), with weaker evidence for differences among components within the same mechanism. We also find an effect of depth, where lesions in early layers disproportionately cause syntactic and semantic symptoms while late-middle layers yield higher rates of phonological and fluency deficits. Although some LM lesions induce quantitatively more similar profiles to some human aphasia types than others, qualitative differences in symptom patterns between LMs and humans suggest that aphasia syndromes are heavily influenced by the details of learning and processing rather than being a domain-invariant consequence of disrupted language processing.

URL PDF HTML ☆

赞 0 踩 0

2605.16219 2026-05-18 cs.LG stat.ML

The Privacy Price of Tail-Risk Learning: Effective Tail Sample Size in Differentially Private CVaR Optimization

尾风险学习的隐私代价：差分隐私CVaR优化中的有效尾样本量

El Mustapha Mansouri

AI总结研究揭示差分隐私对CVaR学习有效样本量的影响，提出隐私代价分解方法，推导出标量估计和有限类别的学习速率，并指出隐私学习在有效尾样本量上的核心挑战。

详情

Comments: 34 pages, 3 figures, 2 tables

AI中文摘要

差分隐私改变了CVaR学习的有效样本量。对于尾质量τ，隐私相关的样本量不是n，而是nτ；等价地，有效的隐私尾样本量是εnτ。私有CVaR超额风险分解为普通的尾风险统计误差和隐私代价。这种分解在标量估计和有限类别的情况下是完整的：标量估计的速率是Θ(B min{1,(nτ)^{-1/2}+(εnτ)^{-1}})，有限类别的大小为M时的速率是Θ(B min{1,√(log(2M)/(nτ))+log(2M)/(εnτ)} )。这些完整的速率在纯DP下成立，其下界可扩展到近似DP的 stated small-δ 范围内。对于凸Lipschitz学习，模块化上界和下界减少显示，CVaR特定的隐私项必然以1/(εnτ)的比例增长，其维度依赖性继承自私有随机凸优化。这些结果识别出在私有CVaR学习中，普通私有学习在Θ(nτ)信息量的尾记录上的核心挑战。

英文摘要

Differential privacy changes the effective sample size governing CVaR learning. For tail mass $τ$, the privacy-relevant sample size is not $n$, but $nτ$; equivalently, the effective private tail sample size is $εnτ$. Private CVaR excess risk decomposes into ordinary tail-risk statistical error and a privacy price. This decomposition is complete for scalar estimation and finite classes: scalar estimation has rate $Θ(B \min\{1,(nτ)^{-1/2}+(εnτ)^{-1}\})$, and finite classes of size $M$ have rate $Θ(B \min\{1,\sqrt{\log(2M)/(nτ)}+\log(2M)/(εnτ)\})$. These complete rates hold under pure DP, and their lower bounds extend to approximate DP in the stated small-$δ$ regimes. For convex Lipschitz learning, modular upper and lower reductions show that the CVaR-specific privacy term necessarily scales as $1/(εnτ)$, with dimension dependence inherited from private stochastic convex optimization. Together, these results identify ordinary private learning on $Θ(nτ)$ informative tail records as the canonical hard subproblem inside private CVaR learning.

URL PDF HTML ☆

赞 0 踩 0

2605.16211 2026-05-18 cs.LG math.DS

Hypothesis-driven construction of mesoscopic dynamics

以假设为导向的介观动力学构建

Zhuoyuan Li, Aiqing Zhu, Qianxiao Li

AI总结本文提出一种基于数学约束假设类学习介观动力学的新方法，通过广义奥本奈尔原理构建统一框架，提供理论保证并验证了其在连续PDE和微观链模型中的有效性。

详情

Comments: 38 pages, 10 figures

AI中文摘要

传统科学建模通常从固定的实例有效方程开始，然后进行特定方程的分析和计算，在复杂应用如多尺度系统中变得尤为困难。本文提出了一种替代范式，通过在数学约束的假设类中学习介观动力学。基于广义奥本奈尔原理，我们引入了一个统一框架，涵盖耗散和保守的介观动力学。我们建立了统一的理论保证，包括全局良好定义性、渐近稳定性、唯一因子可识别性和离散能量耗散，适用于该假设类中所有时空演变方程，在所有学习阶段之前。每个问题实例的数据随后用于指导识别假设类中的成员，产生准确、稳健和可解释的动力学模型。我们通过连续PDE模型的数据作为检查，以及微观链模型中已知的精确介观模型的数据进行了实证验证。所提出的方法不仅是一种有效的动力学学习器，还提供了对底层物理的必要可解释诊断。

英文摘要

Traditional scientific modeling typically begins with fixed, instance-wise effective equations and then carries out equation-specific analysis and computation, a procedure that becomes exceptionally challenging in complex applications such as multiscale systems. We propose an alternative paradigm by learning mesoscopic dynamics within a mathematically constrained hypothesis class. Building upon a generalized Onsager principle, we introduce a unified framework encompassing both dissipative and conservative mesoscopic dynamics. We establish uniform and a priori theoretical guarantees, including global well-posedness, asymptotic stability, unique factorization identifiability, and discrete energy dissipation, applicable to all spatio-temporal evolution equations within this hypothesis class prior to all learning stages. Data from each problem instance is then used to guide the identification of members within our hypothesis class, giving rise to accurate, robust and interpretable dynamical models. We empirically validate this framework on both data from continuum PDE models as a check, and on data arising from microscopic chain models for which exact meso-scale models are unknown. The proposed approach not only acts as an effective dynamics learner, but also offers vital interpretable diagnostics of the underlying physics.

URL PDF HTML ☆

赞 0 踩 0

2605.16207 2026-05-18 cs.AI cs.CL

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

确认正确，遗漏其余：LLM辅导代理在反馈最关键的地方表现不佳

Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian, Tiffany Barnes

AI总结本文研究了LLM在逻辑推理中的辅导性能，发现其在区分最优解、次优解和错误解方面存在系统性偏差，影响适应性教学效果。

详情

Comments: 22 pages, 20 fgures

AI中文摘要

有效的辅导需要区分最优解、有效但次优解和错误解，这对智能辅导系统至关重要，但此前未针对LLM辅导代理进行测试。本文通过知识图谱衍生的地面真实数据，评估了七个LLM反馈代理在命题逻辑中的表现。模型在最优步骤上表现接近天花板，但在有效但次优的推理和错误解的验证上系统性地过度拒绝和接受，这在适应性辅导中尤为关键。这些失败在不同模型和情境下均持续存在，表明是架构而非信息限制的问题。此外，准确的诊断未能可靠地产生教学可行的反馈，揭示了诊断判断与教学效果之间的差距。研究发现LLM更适合混合架构，其中基于知识图谱的模型负责诊断，而LLM支持开放式的支架和对话。

英文摘要

Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.

URL PDF HTML ☆

赞 0 踩 0

2605.16205 2026-05-18 cs.AI cs.CL cs.LG cs.MA cs.SY eess.SY

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

上下文、推理与层次：在对抗性POMDP中的复合LLM代理设计成本-性能研究

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

AI总结研究探讨了在对抗性部分可观测序贯环境中，复合LLM代理设计的上下文、推理和层次分解对性能与成本的影响，发现程序化状态抽象在成本效率上表现最佳，而分层分解无需推理可获得最佳性能。

详情

DOI: 10.1145/3786335.3813149

AI中文摘要

在对抗性、部分可观测的序贯环境中部署复合LLM代理需要处理多个设计维度：（1）代理所见的内容，（2）其推理方式，以及（3）任务在组件间的分解。然而，从业者缺乏指导，以确定哪些设计选择能提升性能而非仅仅增加推理成本。我们通过CybORG CAGE-2环境（建模为部分可观测马尔可夫决策过程POMDP）进行受控研究。奖励为非正数，因此所有配置均在故障缓解模式下运行。我们的评估涵盖五种模型家族、六种模型和十二种配置（3,475次回合），并进行逐token的成本计算。我们变化上下文表示（原始观察与确定性状态跟踪层压缩历史）、推理（自我提问、自我批评和自我改进工具，可选思维链提示）以及分层分解（单体ReAct与委托给专门子代理）。我们发现：（1）程序化状态抽象在每token花费上获得最大回报（RPTS），在原始观察上提升均值回报高达76%。（2）在分层中分布推理工具相对于单独分层，对所有五种模型家族均降低性能，达到3.4倍更差的均值回报，同时使用1.8-2.7倍更多token。我们称此破坏性模式为推理瀑布。（3）没有推理的分层分解在大多数模型中获得最佳绝对性能，且上下文工程通常比推理更经济有效。这些发现表明在结构对抗性POMDPs中的设计原则：投资于程序化基础设施和清洁任务分解，而不是更深入的单个代理推理，因为这些策略在结合时可能会相互干扰。

英文摘要

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

URL PDF HTML ☆

赞 0 踩 0

2605.16198 2026-05-18 cs.AI cs.CY cs.LG cs.LO

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

形式方法与大语言模型交汇：面向高级AI系统合规性的审计、监控与干预

Parand A. Alamdari, Toryn Q. Klassen, Sheila A. McIlraith

AI总结本文提出结合形式方法与机器学习的审计和监控技术，用于检测AI系统中时间扩展行为约束的违规，实验表明其在检测违规方面优于LLM基方法，且能有效降低LLM代理的违规率。

详情

AI中文摘要

我们探讨了AI治理的一个维度：如何在整个AI开发生命周期中监控和审计AI增强的产品和服务，从预部署测试到部署后的审计。结合形式方法的原则与最先进的机器学习，我们提出技术，使AI增强产品和服务开发者、第三方AI开发者和评估者能够对产品特定的时间扩展行为约束（如安全约束、规范、规则和法规）进行离线审计和在线（运行时）监控，针对黑箱高级AI系统，特别是LLMs。我们进一步提供实用的预测监控技术，如基于抽样的方法，并引入干预监控器，在运行时预判并可能缓解预测的违规。实验结果表明，通过利用线性时序逻辑（LTL）的形式语法和语义，我们提出的方法在检测时间扩展行为约束的违规方面优于LLM基方法；使用我们的方法，即使小模型标注器也能匹配或超越前沿LLM判断者。我们还显示，通过受控实验，LLM的时间推理在事件距离、约束数量和命题数量增加时表现出显著的准确性下降。

英文摘要

We examine one particular dimension of AI governance: how to monitor and audit AI-enabled products and services throughout the AI development lifecycle, from pre-deployment testing to post-deployment auditing. Combining principles from formal methods with SoTA machine learning, we propose techniques that enable AI-enabled product and service developers, as well as third party AI developers and evaluators, to perform offline auditing and online (runtime) monitoring of product-specific (temporally extended) behavioral constraints such as safety constraints, norms, rules and regulations with respect to black-box advanced AI systems, notably LLMs. We further provide practical techniques for predictive monitoring, such as sampling-based methods, and we introduce intervening monitors that act at runtime to preempt and potentially mitigate predicted violations. Experimental results show that by exploiting the formal syntax and semantics of Linear Temporal Logic (LTL), our proposed auditing and monitoring techniques are superior to LLM baseline methods in detecting violations of temporally extended behavioral constraints; with our approach, even small-model labelers match or exceed frontier LLM judges. Our predictive and intervening monitors significantly reduce the violation rates of LLM-based agents while largely preserving task performance. We further show through controlled experiments that LLMs' temporal reasoning shows a pronounced degradation in accuracy with increasing event distance, number of constraints, and number of propositions.

URL PDF HTML ☆

赞 0 踩 0

2605.16193 2026-05-18 cs.CL cs.CY

Improving Cross-Cultural Survey Simulation with Calibrated Value Personas

通过校准价值人设提升跨文化调查模拟

Axel Abels, Elias Fernandez Domingos, Apurva Shah, Tom Lenaerts

AI总结本文提出基于价值的人设构建方法，通过校准提升跨文化调查模拟的准确性，减少预测误差，尤其在少数群体中效果显著。

详情

Comments: Submitted to the Fourth International Workshop on Value Engineering in AI (VALE 2026), held at IJCAI-ECAI 2026

AI中文摘要

大型语言模型（LLMs）越来越多地用于模拟人类意见和调查响应，但其在不同文化中再现人口响应的能力仍有限。现有基于人设的提示方法通常依赖社会人口统计或个性特征，这些只是影响人类响应价值观的间接代理。我们提出一种基于价值的人设构建方法，从调查响应中提取文本描述符，捕捉核心文化维度。通过从目标人群采样价值配置文件，并聚合LLM在不同人设下的响应，我们获得基于观察到的价值分布的群体级预测。我们进一步引入一种校准程序，以提高响应多样性的同时保持估计意见的准确性。我们证明，我们的方法在不同国家减少了预测误差，最大的改进出现在代表性不足的人群中。这大大缩小了与主流LLM先验一致的国家与在训练数据中代表性较低的国家之间的性能差距，同时产生与人类多样性密切匹配的响应分布。

英文摘要

Large language models (LLMs) are increasingly used to simulate human opinions and survey responses, but their ability to reproduce population responses across cultures remains limited. Existing persona-based prompting methods typically rely on sociodemographic or personality traits, which are only indirect proxies for the values that shape human responses. We propose a value-based persona construction method that derives textual descriptors from survey responses capturing core cultural dimensions. By sampling value profiles from target populations and aggregating LLM responses across personas, we obtain population-level predictions grounded in observed value distributions. We further introduce a calibration procedure that improves response diversity while preserving estimated opinions. We show that our approach reduces prediction error across countries, with the largest improvements observed in underrepresented populations. This substantially narrows the performance gap between countries aligned with dominant LLM priors and those that are less represented in training data, while also yielding response distributions that closely match human diversity.

URL PDF HTML ☆

赞 0 踩 0

2605.16191 2026-05-18 cs.CL cond-mat.other physics.comp-ph

Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search

优化的三维光伏结构与LLM引导的树搜索

Michael P. Brenner, Lizzie Dorfman, John C. Platt

AI总结本文利用AI编码系统生成新型科学假设，通过LLM驱动的树搜索算法优化三维光伏结构，解决中纬度地区传统光伏板的效率瓶颈问题。

详情

Comments: 10 pages 7 figures

AI中文摘要

我们展示了一个案例研究，说明AI编码系统如何用于生成新的科学假设。我们结合通用编码代理（谷歌的AntiGravity）与LLM驱动的树搜索算法（Empirical Research Assistance / ERA），以自动生成高效率的三维光伏（3DPV）结构，以克服中纬度地区传统光伏板的效率限制。这些结构通过一天中不同的太阳角度进行优化，我们以单天太阳日为例进行说明。我们的工作流程首先使用AntiGravity重现计算，证明3DPV的能量密度远高于静态平板光伏板。我们利用这些初始设计作为大规模树搜索的起点，寻找改进的解决方案并根据日间收益评分。初始的树搜索导致了名义上更高效的解决方案，但这些解决方案是由算法奖励黑客引起的，源于非物理设计特征，如结构上漂浮的断开层和光学求解器离散化中的利用。为对抗这一点，我们开发了一个工作流程，使编码代理迭代地将约束添加到物理引擎中，以消除奖励黑客。在消除奖励黑客后，ERA发现了一系列具有不同约束和改进性能的设计，包括具有不同固定收集面积的最优设计，优化天顶跟踪并避免自身阴影。将编码代理与树搜索（ERA）结合提供了一个强大的平台，用于解决可以通过评分函数经验评估的问题。

英文摘要

We present a case study for how AI coding systems can be used to generate novel scientific hypotheses. We combine a generic coding agent (Google's AntiGravity) with an LLM-driven tree search algorithm (Empirical Research Assistance / ERA) to autonomously generate high-efficiency three-dimensional photovoltaic (3DPV) structures that overcome losses limiting flat solar panels at mid-latitudes. These structures operate by presenting favorable angles to the sun throughout the day, and for illustrative purposes we focus on optimizing performance for a single solar day. Our workflow begins by using AntiGravity to reproduce calculations \cite{bernardi2012solar} showing that 3DPV can have energy densities much higher than stationary flat PV panels. We use these initial designs as the starting point for large scale tree search, where we seek improved solutions and score them for their diurnal yield. The initial tree search leads to nominally more efficient solutions, yet they are caused by algorithmic reward hacking, arising from non-physical design features such as structurally levitating disconnected tiers and exploitations of the discretizations in the optics solver. To counteract this, we develop a workflow where the coding agent iteratively patches the physics engine with constraints to eliminate reward hacking. With reward-hacking eliminated, ERA discovers a series of designs with various constraints and improved performance, including optimal designs with different fixed collector areas, optimizing zenith tracking and avoiding self shadowing. Combining coding agents with tree search (ERA) provides a powerful platform for scientific discovery, for problems whose solutions can be empirically evaluated with a score function.

URL PDF HTML ☆

赞 0 踩 0

2605.16181 2026-05-18 cs.SD

ARIA: A Diagnostic Framework for Music Training Data Attribution

ARIA：音乐训练数据归因的诊断框架

Changheon Han, Ashkan Panahi, Kıvanç Tatar

AI总结 ARIA框架通过分解音乐方面归因并结合可靠性诊断，揭示音乐生成中归因行为的差异，提升版权分析的准确性。

详情

Comments: Working Paper

AI中文摘要

音乐生成的训练数据归因（TDA）必须回答两个版权分析所需的问题：哪些训练歌曲影响生成输出，以及在哪些音乐方面产生影响。现有方法将影响简化为单一标量，无法揭示主导影响的音乐方面。我们提出ARIA框架，将归因分解到音乐方面（符号音乐五种，音频三种），并结合从段级分数矩阵计算的可靠性诊断。它通过比较前K个归因曲目组内相似性与随机参考组的相似性，以及通过奇异值分解和列统计诊断分数矩阵。在可获得归因真实值的符号音乐模型中，可靠性诊断将四种归因方法的排名与真实值相同。在音频音乐生成模型中，ARIA揭示了TDA方法在归因行为上的显著差异，标记出检索曲目在不同查询中几乎相同而非反映每查询归因的分数矩阵，并通过每个编码器表面的音乐方面来表征嵌入相似性检索基线。ARIA通过与版权分析中考虑的音乐方面一致的每方面归因证据，共同产生结果。

英文摘要

Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, without revealing which musical aspects are dominant in that influence. We propose ARIA, a framework that decomposes attribution along musical aspects (five for symbolic music, three for audio) and pairs the decomposition with reliability diagnostics computed from the segment-level score matrix. It measures within-group similarity among the top-K attributed tracks against random reference groups drawn from the training pool, and diagnoses the score matrix through its singular value decomposition and column statistics. On a symbolic-music model where attribution ground truth is available through counterfactual retraining, the reliability diagnostics rank four attribution methods identically to that ground truth. On an audio music generation model, ARIA reveals attribution behaviors that vary substantially across TDA methods, flags score matrices whose retrieved tracks are nearly identical across queries rather than reflecting per-query attribution, and characterizes embedding-similarity retrieval baselines by the musical aspect each encoder surfaces. Together, ARIA produces per-aspect attribution evidence aligned with the musical aspects considered under the idea-expression distinction in copyright analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.16179 2026-05-18 cs.CV

MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models

MAgSeg：利用多模态大语言模型对高分辨率卫星图像进行农业景观分割

Piyush Tiwary, Utkarsh Ahuja, Depanshu Sani, Aishwarya Jayagopal, Sagar Gubbi, Subhashini Venugopalan, Alok Talekar, Vaibhav Rajan

AI总结本文提出MAgSeg，一种无需视觉解码器的多模态大语言模型分割方法，有效解决南半球农业景观分割中的碎片化地块、高类内方差和标注数据稀缺问题，实现高效农业环境制图。

详情

AI中文摘要

在全球南方，农业景观分割具有碎片化地块、高类内方差和标注数据稀缺的特点。最近的分割进展由多模态大语言模型（MLLMs）推动。然而，当前方法面临关键上下文长度瓶颈和领域对齐差距。我们通过MAgSeg，一种新型无解码器的MLLM分割方法，解决这些限制。MAgSeg是一种架构高效的方法，使标准MLLM能够从高分辨率卫星图像中分割复杂的小农户农业景观，而无需辅助视觉解码器。我们引入了一种新的指令微调数据格式，以实现对高分辨率卫星图像的可扩展微调和训练，使MAgSeg能够从图像的全球上下文中学习，同时仅生成图像内特定区域的文本标记。在涵盖三个国家的多个数据集上的广泛评估表明，MAgSeg显著优于最先进的MLLM基线，提供了一种可扩展的解决方案来制图小农户农业环境。

英文摘要

Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.

URL PDF HTML ☆

赞 0 踩 0

2605.16175 2026-05-18 cs.LG

Imitation learning for clinical decision support in pediatric ECMO

儿童ECMO临床决策支持中的模仿学习

Fateme Golivand, Michael Skinner, Saurabh Mathur, Ameet Soni, Phillip Reeder, Kristian Kersting, Lakshmi Raman, Sriraam Natarajan

AI总结本文通过模仿学习方法，利用TabPFN等模型在儿科ECMO数据上学习行动模型，优于传统基线方法，为临床决策支持提供有效支持。

2605.16171 2026-05-18 cs.CV

Res$^2$CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment

Res²CLIP：基于残差到残差对齐的少样本通用异常检测

Xinyue Liu, Jianyuan Wang, Biao Leng, Shuo Zhang

AI总结 Res²CLIP通过残差到残差对齐解决少样本下类别泛化问题，消除细粒度特征差异和类特定偏差，提升跨类别泛化能力。

详情

AI中文摘要

少样本通用异常检测要求模型能泛化到新类别而无需重新训练，在现实场景中面临样本稀缺和快速变化类别的挑战。现有基于CLIP的方法面临两大问题：粗粒度统一文本提示难以适应细粒度前景-背景差异，导致跨粒度不匹配；在辅助数据集上微调会因领域偏移破坏CLIP的开放世界泛化能力，导致跨类别泛化退化。为解决这些问题，我们提出将多模态对齐完全转移到统一的残差空间，其中残差表示自然消除跨区域和类特定的细粒度正常特征差异，同时解决这两个问题。基于此洞察，Res²CLIP是首个在CLIP残差空间内对称连接视觉和文本模态的残差到残差对齐框架。该框架从残差视角出发，分为三个分支：基于文本提示的分支、基于视觉提示的分支以及新的残差到残差对齐分支。所有可学习优化均受限于残差域，残差对齐优化目标设计为使模型关注相对异常偏差而非优化类特定特征。在多个数据集上的实验验证了该架构的有效性。代码可在https://github.com/hito2448/Res2CLIP获取。

英文摘要

Few-shot Generalist Anomaly Detection requires models to generalize to novel categories without retraining, posing significant challenges in real-world scenarios with scarce samples and rapidly changing categories. Existing CLIP-based methods face two major challenges: coarse-grained unified text prompts struggle to adapt to fine-grained foreground-background differences, causing cross-granularity mismatch; and fine-tuning on auxiliary datasets disrupts CLIP's inherent open-world generalization due to domain shift, leading to cross-category generalization degradation. To address these, we propose to shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases, simultaneously resolving both problems. Based on this insight, Res$^2$CLIP, the first residual-to-residual alignment framework that symmetrically bridges visual and text modalities within CLIP's residual space, is designed. The framework is developed from a residual perspective into three branches: a text prompt-based branch, a visual prompt-based branch, and a novel residual-to-residual alignment branch. All learnable optimizations are constrained within the residual domain, and the residual alignment optimization objectives are designed to force the model to focus on relative anomaly deviations rather than optimizing class-specific features. Experiments on multiple datasets demonstrate the effectiveness of our architecture. The code is available at https://github.com/hito2448/Res2CLIP.

URL PDF HTML ☆

赞 0 踩 0

2605.16165 2026-05-18 cs.CV cs.AI

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

二阶多级方差校正用于多模态模型中的模态竞争

Yishun Lu, Wes Armour

AI总结本文提出ML-FOP-SOAP框架，通过多级方差校正提升多模态对齐稳定性，实验显示在Janus和Emu3数据集上，该方法提高了样本效率和训练速度，适用于大规模多模态基础模型。

详情

AI中文摘要

自回归的下一个标记训练为图像生成和文本理解提供统一框架，但同时也导致强模态竞争，破坏了优化稳定性并限制了大批次扩展。我们发现一阶优化器如AdamW易受跨模态梯度异质性影响，而二阶预条件，特别是SOAP，为多模态对齐提供了更稳定的基。基于此，我们提出ML-FOP-SOAP，一个带有多级方差校正的二阶优化框架。我们的Fisher-正交投影抑制由方差引起的模态冲突，减少视觉生成与文本理解之间的权衡。为在大梯度累积下实用，我们引入了分层折叠策略，以低微步开销捕获细粒度方差。在Janus和Emu3上的实验显示，在两个模态上均获得一致收益，并在8192批次大小下实现稳定训练。与AdamW相比，我们的方法提高了样本效率高达1.4倍，并加速了实时时钟训练高达1.5倍，为扩展多模态基础模型提供了一个稳健的优化器。

英文摘要

Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.16164 2026-05-18 cs.LG

Entropic Auto-Encoding via Implicit Free-Energy Minimization

通过隐式自由能最小化实现熵编码解码

Hazhir Aliahmadi, Irina Babayan, Greg van Anders

AI总结本文提出熵编码解码框架，通过自由能最小化隐式生成潜在变量先验，解决变分自编码器的后验崩溃问题，实现非高斯多模态潜在分布学习。

详情

Comments: 22 pages, 5 figures

AI中文摘要

尽管变分自编码器（VAEs）广泛使用，但其本质上受后验崩溃问题困扰，即潜在变量被忽略。此问题源于显式先验施加驱动优化至对应无信息潜在表示的损失景观区域。本文引入熵编码解码（EAEs），框架中重构损失是唯一显式目标，熵通过自由能最小化编码器集合隐式生成潜在变量的先验。该集合偏导学习至高体积近优解区域，解码器更新引导搜索轨迹至信息性潜在表示。我们证明EAEs通过学习非高斯多模态潜在分布缓解后验崩溃，实现多样且数据一致的生成，保留数据中的不同结构形式。作为概念验证，我们展示EAE捕捉反应扩散过程的已知低维动态叠加。然后，我们展示EAE在MNIST潜在表示中识别隐含分类差异，并在CelebA数据集上展示对人脸结构的层次理解，从

英文摘要

Despite their ubiquity, variational autoencoders (VAEs) inherently suffer from posterior collapse, a failure mode in which latent variables are effectively ignored. This failure arises because explicit prior imposition drives optimization toward loss landscape regions corresponding to uninformative latent representations. Here, we introduce Entropic Autoencoders (EAEs), a framework in which reconstruction loss is the only explicit objective, and entropy generates the latent variables' prior implicitly through a free energy-minimizing ensemble of encoders. This ensemble biases learning toward high-volume regions of near-optimal solutions, while decoder updates direct the search trajectories toward informative latent representations. We demonstrate that EAEs mitigate posterior collapse by learning non-Gaussian, multimodal latent distributions that yield diverse, data-consistent generations and preserve different forms of underlying structure in the data. As a proof-of-concept, we show that an EAE captures a superposition of the known low-dimensional dynamics of a reaction-diffusion process. Then, we show that an EAE identifies implicit categorical distinctions in MNIST latent representations, and displays a hierarchical understanding of facial structure on the CelebA dataset, from an "all-human" face to individual-dependent features.

URL PDF HTML ☆

赞 0 踩 0

2605.16154 2026-05-18 cs.LG cs.RO

Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

学习结果分歧之处：通过概率块掩码实现高效的VLA强化学习

Vaidehi Bagaria, Nikshep Grampurohit, Pulkit Verma

AI总结本文提出概率块掩码（PCM），通过选择性分配梯度计算来提升GRPO-based VLA RL的效率，实现更快的训练速度和更低的内存消耗。

详情

AI中文摘要

强化学习（RL）允许视觉-语言-动作（VLA）策略通过直接优化任务成功率来泛化到训练分布之外，但训练后计算成本较高。在基于GRPO的VLA RL中，我们发现主要成本并不在于轨迹收集，而在于梯度计算：在我们的运行中，每一步的墙钟时间中，梯度计算占78%，而轨迹收集仅占21%。梯度成本主导是因为许多计算花费在对学习贡献很小的阶段。GRPO的学习信号由优势方差驱动：只有在成功和失败轨迹分歧的阶段才会产生学习信号。然而，GRPO将相同的优势分配给轨迹中的每个块。因此，策略更新计算均匀地分布在轨迹上，包括政策在预训练和监督微调后已能处理的阶段。本文提出概率块掩码（PCM），一种对GRPO的修改，将梯度计算分配到每个轨迹的小概率选择子集块。PCM使用成功-失败动作方差来评分语义阶段，这是一个轨迹衍生的代理，用于表示每阶段梯度方差。PCM不需要奖励模型或学习的批评者。在三个LIBERO基准上，PCM在最终成功率与标准GRPO相当的同时，实现了2.38倍的墙钟加速，4.8倍更快的梯度更新，以及60%更低的峰值激活内存消耗，同时反向传播通过少于20%的轨迹块。

英文摘要

Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO-based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall-clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly across the trajectory, including phases the policy already handles after pre-training and supervised fine-tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop-in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success-failure action variance, a rollout-derived proxy for per-phase gradient variance, and samples a fixed chunk budget with online-updated phase-level keep probabilities. We formalize per-phase gradient variance as the quantity determines where gradient computation is useful and show that success-failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38 times wall-clock speedup, 4.8 times faster gradient updates, and 60% lower peak activation memory, while backpropagating through fewer than 20% of trajectory chunks.

URL PDF HTML ☆

赞 0 踩 0

2605.16153 2026-05-18 cs.AI

An Algebraic Exposition of the Theory of Dyadic Morality

双人道德理论的代数阐释

Kush R. Varshney

AI总结本文通过代数方法阐述双人道德理论，提出三种心理运算符以扩展结构因果模型，解决双人限制下的可扩展性问题，并应用于AI政策设计，通过节点压缩和顺序处理实现道德认知。

详情

AI中文摘要

本文提供双人道德理论（TDM）的代数阐释，该理论是一种基于简单双节点模板的心理道德判断模型：一个意图行为者对脆弱患者造成伤害。我们使用结构因果建模（SCM）符号形式化TDM，并识别三种心理运算符（类型化运算符、完成运算符和价值依赖推理机制）以扩展标准SCM，以捕捉人们在约束下如何计算道德判断。我们解决了TDM双人限制带来的可扩展性挑战，展示道德认知如何通过节点压缩和顺序处理压缩多节点场景。基于此代数框架，我们展示了具体应用于AI政策设计：检测冲突义务、构建保留用户自主性的有益政策、以及设计故障后沟通作为因果干预。最后，我们推荐对心智感知进行范围化的、情境化的测量，而非普遍平均，以实证化该理论。这种代数形式化使神经符号AI系统能够以数学严谨且符合人类道德认知的方式计算道德。

英文摘要

This paper provides an algebraic exposition of the theory of dyadic morality (TDM), a psychological model of moral judgment grounded in a simple two-node template: an intentional agent causing harm to a vulnerable patient. We formalize TDM using structural causal modeling (SCM) notation and identify three psychological operators (typecasting operator, completion operator, and valence-dependent inference mechanism) that extend standard SCM to capture how people compute moral judgments under constraints. We address scalability challenges arising from TDM's dyadic limitation, showing how moral cognition compresses multi-node scenarios through node collapse and sequential processing. Drawing on this algebraic framework, we demonstrate concrete applications to AI policy design: detecting conflicting obligations, structuring helpfulness policies to preserve user agency, and designing post-failure communication as causal interventions. Finally, we recommend scoped, contextual measurement of mind perception over universal averaging to operationalize the theory empirically. This algebraic formalization enables neurosymbolic AI systems to compute morality in a way that is both mathematically rigorous and faithful to human moral cognition.

URL PDF HTML ☆

赞 0 踩 0

2605.16147 2026-05-18 cs.CV

Registers Matter for Pixel-Space Diffusion Transformers

注册信息对像素空间扩散变换器的重要性

Nikita Starodubcev, Ilia Sudakov, Ilya Drobyshevskiy, Artem Babenko, Dmitry Baranchuk

AI总结研究发现扩散变换器与视觉变换器在处理像素空间时存在差异，注册令牌显著提升了生成质量，通过分析中间表示发现其在高噪声水平下产生更清晰的特征图，进而提出了一种高效的双流架构以提升生成效果。

详情

AI中文摘要

视觉变换器（ViTs）已知会表现出高范数补丁-令牌异常，这会降低特征图的质量，这一问题通过注册令牌得到有效的缓解。随着扩散模型越来越多地采用变换器架构并朝向像素空间训练迈进，它们在形式上越来越接近ViTs，从而引发了注册令牌是否也对扩散变换器（DiTs）有用的问题。在本文中，我们发现DiTs与ViTs在关键方面存在差异：它们不表现出补丁-令牌异常。有趣的是，注册令牌显著提高了像素空间DiTs的收敛性和生成质量。通过分析中间表示，我们发现注册令牌在高噪声水平下会产生更清晰的特征图，这可能解释了它们在像素空间生成中的有效性。我们进一步观察到，最近的像素空间DiT架构隐式地包含了注册-like机制，这可能部分解释了其强大的经验表现。受这些见解的启发，我们研究了一种参数高效的双流架构，专门处理注册令牌，并通过几乎不增加运行时间开销的方式提高了像素空间生成质量。

英文摘要

Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.16143 2026-05-18 cs.AI cs.CL

Look Before You Leap: Autonomous Exploration for LLM Agents

先看再跳：面向LLM代理的自主探索

Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng

AI总结本文提出自主探索能力，通过探索检查点覆盖率指标，改进LLM代理在陌生环境中的适应性，采用探索与执行交替训练策略，提升任务执行的泛化能力。

详情

AI中文摘要

基于大型语言模型的代理在陌生环境中常因过早利用而失败，本文识别自主探索为构建适应性代理的关键能力。引入探索检查点覆盖率作为可验证指标，系统评估显示标准任务导向强化学习训练的代理表现出狭窄重复行为。提出探索与执行交替训练策略，通过交互预算获取环境知识后再执行任务，结果表明系统性探索对构建通用且现实可用的代理至关重要。

英文摘要

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.

URL PDF HTML ☆

赞 0 踩 0

2605.16134 2026-05-18 cs.LG cs.AI

Navigating Potholes with Geometry-Aware Sharpness Minimization

用几何感知的尖锐性最小化导航坑洞

Simon Dufort-Labbé, Mehrab Hamidi, Razvan Pascanu, Ioannis Mitliagkas, Damien Scieur, Aristide Baratin

AI总结本文提出LLQR+SAM方法，结合学习预条件器与尖锐性最小化，通过双时间尺度结构提升模型鲁棒性，实验证明其在视觉和序列建模任务中表现优异。

详情

AI中文摘要

尖锐性感知最小化（SAM）通过扰动参数沿高损失曲率方向鼓励平坦极小值，但对所有参数方向均匀处理，忽略了损失几何结构。我们引入LLQR+SAM，结合SAM与通过最近提出的LLQR框架获得的学习预条件器，这是一种将最速下降法重新表述为分层线性二次调节问题的二阶方法。预条件器稀疏更新并保持为慢速指数移动平均，从而捕捉损失景观几何的平滑低分辨率图像。SAM扰动在此学习几何上操作，以更快的时间尺度探测曲率。我们证明这种双时间尺度结构不仅仅是计算便利：理论上，预条件器在平均几何下平坦但局部尖锐（坑洞）的方向放大SAM逃逸信号。宽广、平坦的盆地相比之下保持稳定。实验表明，LLQR+SAM在标准视觉和序列建模基准上相对于SAM和LLQR单独使用均表现出一致的改进，支持了慢速学习几何和快速尖锐性修正确实是互补的观点。

英文摘要

Sharpness-aware minimization (SAM) encourages flat minima by perturbing parameters along directions of high loss curvature, but treats all parameter directions uniformly, ignoring the underlying loss geometry. We introduce LLQR+SAM, which combines SAM with a learned preconditioner obtained from the recently proposed LLQR framework, a second-order method that recasts steepest descent as a layerwise linear-quadratic regulator problem. The preconditioner is updated sparsely and maintained as a slow exponential moving average, so it captures a smoothed, low-resolution picture of the loss landscape geometry. The SAM perturbation then operates on top of this learned geometry, probing curvature at a faster timescale. We show that this two-timescale structure is not merely a computational convenience: theoretically, the preconditioner amplifies the SAM escape signal in directions that are flat under the average geometry but locally sharp (potholes). Wide, flat basins, by contrast, remain stable. Empirically, LLQR+SAM gives consistent gains over both SAM and LLQR alone across standard vision and sequence modeling benchmarks, supporting the view that slow learned geometry and fast sharpness correction are genuinely complementary.

URL PDF HTML ☆

赞 0 踩 0

2605.16127 2026-05-18 cs.CV

WeatherOcc3D: VLM-Assisted Adverse Weather Aware 3D Semantic Occupancy Prediction

WeatherOcc3D: 借助VLM的恶劣天气感知3D语义占用预测

A. Enes Doruk, Abdelaziz Hussein, Hasan F. Ates

AI总结本文提出一种借助预训练CLIP隐空间的框架，通过语言环境线索指导多传感器融合，解决恶劣天气下传感器可靠性问题，提升3D语义占用预测的鲁棒性。

详情

AI中文摘要

尽管多模态3D语义占用预测通常通过融合相机和激光雷达输入来增强鲁棒性，但其有效性从根本上受限于环境变化。具体而言，相机传感器在低光条件下会严重退化，而激光雷达传感器在暴雨中会遇到显著的回散噪声。这些恶劣条件导致模态信任问题，传统静态融合策略无法在特定传感器不可靠时自适应地重新加权输入。为此，我们提出了一种借助预训练CLIP隐空间的框架，通过语言环境线索指导多传感器集成。我们利用参数高效的适配器将天气特定的文本嵌入与传感器特征对齐，并结合一种门控策略，将环境不确定性分解为两个因素：能见度和光照。这使模型能够动态调节融合比例——在晴天优先使用语义相机特征，在雨夜则转向几何激光雷达先验。在nuScenes数据集上的评估显示，我们的方法在OccMamba和M-CONet架构上分别实现了26.3和21.1的mIoU分数，显著优于传统基线。

英文摘要

While multi-modal 3D semantic occupancy prediction typically enhances robustness by fusing camera and LiDAR inputs, its effectiveness is fundamentally constrained by environmental variability. Specifically, camera sensors suffer from severe low-light degradation, while LiDAR sensors encounter significant backscatter noise during heavy precipitation. These adverse conditions create a modality trust problem, as static fusion strategies fail to adaptively re-weight inputs when a specific sensor becomes unreliable. To address this, we propose a VLM-assisted framework leveraging the pre-trained CLIP latent space to guide multi-sensor integration via linguistic environmental cues. We utilize a parameter-efficient adapter to align weather-specific text embeddings with sensor features, coupled with a gating strategy that decomposes environmental uncertainty into two factors: visibility and illumination. This enables the model to dynamically modulate the fusion ratio - prioritizing semantic camera features in clear daylight and shifting to geometric LiDAR priors during rainy nights. Evaluations on the nuScenes dataset demonstrate the versatility of our approach, as implementing our proposed framework on the OccMamba and M-CONet architectures achieves mIoU scores of 26.3 and 21.1, respectively, significantly outperforming their traditional baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.16126 2026-05-18 cs.LG cs.AI cs.IT math.IT math.ST stat.OT stat.TH

Entropy Across the Bridge: Conditional-Marginal Discretization for Flow and Schrödinger Samplers

熵跨桥梁：用于流和薛定谔采样的条件-边缘离散化

Bruno Trentini, Dejan Stancevic, Michael M. Bronstein, Alexander Tong, Luca Ambrogioni

AI总结本文提出一种基于熵率的目标，用于桥-aware的离散化，通过分离端点条件桥几何和边缘流演变，提升低预算下的高维桥和流采样性能。

详情

AI中文摘要

对于固定流基生成模型，在有限的推断预算下，样本质量强烈依赖于采样器在有限函数评估上的分配。流匹配和薛定谔桥梁定义了概率路径，但其推断网格通常为启发式或继承自一端扩散。本文推导出一种条件-边缘熵率目标用于桥-aware离散化，分离端点条件桥几何与边缘流演变，并以此构建无训练的熵推断时间调度器。对于高斯布朗桥，该速率具有闭式解且呈U型，推动边界密集的非均匀网格。在训练的二维桥/流模型上，估计的轮廓恢复预测形状，并在10步ODE-Heun MMD中比线性提升18.1%，在相同低NFE扫描中，SDE-Heun改进22.7%。在EDM/CIFAR-10上，熵时间离散化在五步FID测试中表现最佳（186.3±4.0 vs 200.5±2.9线性和238.0±5.3余弦）。在AlphaFlow蛋白质生成中，熵条件-边缘调度在CAMEO22和ATLAS基准上低NFE情况下表现优势。这些结果支持熵率调度作为高维桥和流采样的实用低预算分配信号。

英文摘要

For a fixed flow-based generative model under a small inference budget, sample quality can depend strongly on where the sampler spends its few function evaluations. Flow matching and Schrödinger bridges define probability paths, yet their inference grids are usually heuristic or inherited from one-endpoint diffusion. We derive a conditional-marginal entropy-rate objective for bridge-aware discretization, separating endpoint-conditioned bridge geometry from marginal flow evolution, and use it to build a training-free entropic inference-time scheduler from first principles. For Gaussian Brownian bridges this rate is closed-form and U-shaped, motivating boundary-heavy nonuniform grids. On trained two-dimensional bridge/flow models, the estimated profile recovers the predicted shape and improves 10-step ODE-Heun MMD over linear by 18.1%, with a paired 22.7% SDE-Heun improvement in the same low-NFE sweep. On EDM/CIFAR-10, the entropic time-discretization gives the best tested five-step FID (186.3 \pm 4.0 versus 200.5 \pm 2.9 for linear and 238.0 \pm 5.3 for cosine). On AlphaFlow protein generation, entropic conditional-marginal (cond-marg) scheduling shows advantage in low-NFE regimes on both CAMEO22 and ATLAS benchmarks. These results support entropy-rate scheduling as a practical low-budget allocation signal for high-dimensional bridge and flow samplers.

URL PDF HTML ☆

赞 0 踩 0

2605.16122 2026-05-18 cs.CV cs.AI

GenShield: Unified Detection and Artifact Correction for AI-Generated Images

GenShield：面向AI生成图像的统一检测与伪影校正

Zhipei Xu, Xuanyu Zhang, Youmin Xu, Qing Huang, Shen Chen, Taiping Yao, Shouhong Ding, Jian Zhang

AI总结本文提出GenShield框架，通过闭环诊断与修复流程实现可解释的AI生成图像检测与可控伪影校正，结合视觉链式推理课程学习策略，提升校正效果与泛化能力。

详情

AI中文摘要

基于扩散模型的图像合成使AI生成图像（AIGI）日益逼真，引发了在虚假信息检测、数字取证和内容审核等应用中真实性问题的紧迫关注。尽管在AIGI检测方面取得了显著进展，如何纠正检测到的具有明显伪影的AI生成图像并恢复真实外观仍鲜有研究。此外，现有工作很少建立AIGI检测与伪影校正之间的联系。为填补这一空白，我们提出了GenShield，一个统一的自回归框架，能够在闭环中联合执行可解释的AIGI检测和可控的伪影校正，揭示了这两个任务之间的相互促进关系。我们进一步引入基于视觉链式推理的课程学习策略，使系统能够进行自我解释、多步骤的“诊断-修复”校正，并具有明确的停止准则。同时构建了一个高质量的数据集，包含大规模的“伪影-校正”配对，并配套统一的评估流程。在我们的校正基准和主流AIGI检测基准上的广泛实验表明，我们的方法在性能和泛化能力方面均达到最先进的水平。代码可在https://github.com/zhipeixu/GenShield获取。

英文摘要

Diffusion-based image synthesis has made AI-generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI-generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection and artifact correction. To fill this gap, we propose GenShield, a unified autoregressive framework that jointly performs explainable AIGI detection and controllable artifact correction in a closed loop from diagnosis to restoration, revealing a mutually reinforcing relationship between these two tasks. We further introduce a Visual Chain-of-Thought based curriculum learning strategy that enables self-explained, multi-step ``diagnose-then-repair'' correction with an explicit stopping criterion. A high-quality dataset with large-scale ``artifact-restored'' pairs is also constructed alongside a unified evaluation pipeline. Extensive experiments on our correction benchmark and mainstream AIGI detection benchmarks demonstrate state-of-the-art performance and strong generalization of our method. The code is available at https://github.com/zhipeixu/GenShield.

URL PDF HTML ☆

赞 0 踩 0

2605.16118 2026-05-18 cs.LG

Multi-Fidelity Flow Matching: Cascaded Refinement of PDE Solutions

多保真度流匹配：参数PDE解的级联细化

Sipeng Chen, Junliang Liu, Hewei Tang, Shibo Li

AI总结本文提出多保真度流匹配方法，通过级联细化技术提升PDE解的精度，利用低保真度解指导高保真度解生成，优化了残差校准与流匹配训练几何。

详情

Comments: 27 pages, 2 figures, 7 tables. Preprint

AI中文摘要

在条件流匹配中，源分布是一个可校准的设计参数，而非默认各向同性先验。我们利用这一点在多保真度流匹配（MFFM）中提出一种级联细化框架，用于参数PDE解：源分布被校准为经验低到高保真度残差尺度，通过局部高斯模糊相关性进行校准，速度网络则基于低保真度解进行条件化。条件化使残差细化问题显著比无条件场生成更容易，而残差校准的源噪声提升了流匹配训练几何。多分辨率级联在相邻保真度之间独立应用相同构造。在层级流匹配预训练后，我们通过确定性单步滚动对组成的级联进行端到端微调，使每个级联层级的一次速度评估成为推理时的优化操作点。结果是一种学习到的多网格细化类比，每个查询在L个确定性网络评估中达到最细网格。我们在八个基准上验证了MFFM：两个超分辨率问题和六个来自PDEBench、The Well和FNO纳维-斯托克斯数据集的时空预测任务。

英文摘要

The source distribution in conditional flow matching is a design parameter that can be calibrated to data, not a default isotropic prior. We exploit this in Multi-Fidelity Flow Matching (MFFM), a cascade refinement framework for parametric PDE solutions: the source is calibrated to the empirical low-to-high-fidelity residual scale with local Gaussian-blur correlation, and the velocity network is conditioned on the low-fidelity solution. Conditioning makes the residual refinement problem substantially easier than unconditional field generation, while residual-calibrated source noise improves the flow-matching training geometry. A multi-resolution cascade applies the same construction independently between adjacent fidelities. After level-wise flow-matching pretraining, we fine-tune the composed cascade end-to-end with a deterministic one-step rollout, which makes one velocity evaluation per cascade level the optimized operating point at inference. The result is a learned analog of multigrid refinement that reaches the finest grid in $L$ deterministic network evaluations per query. We validate MFFM on eight benchmarks: two super-resolution problems and six spatiotemporal forecasting tasks from PDEBench, The Well, and the FNO Navier--Stokes dataset.

URL PDF HTML ☆

赞 0 踩 0