arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.04222 2026-06-08 eess.SY cs.RO cs.SY 版本更新

Safety by Invariance, Liveness through Refinement: Heterogeneous Contract Framework for Co-Design of Layered Control

通过不变性保证安全,通过精化实现活性:分层控制协同设计的异构契约框架

Yoshinari Takayama, Alessio Iovine, Bart Besselink, Guillaume Sandou, Adnane Saoud

AI总结 针对分层控制架构缺乏统一规范语言、跨时间尺度互联保证及层间组合分离的问题,提出将安全-活性分解引入异构假设-保证契约框架,通过连续时间层的不变性保证安全,离散时间层的精化实现活性,并形式化层间协调条件。

详情
Comments
21 pages
AI中文摘要

现实世界的控制系统必须在满足连续时间安全约束的同时实现长期目标(活性),这一组合推动了分层控制架构(LCA)的研究。然而,现有的LCA研究缺乏(i)跨离散规划和连续执行的统一规范语言,(ii)在异构时间尺度下互连子系统时保证规范得以保持的形式化保证,以及(iii)由于依赖简单的输入滤波法则而导致的层间组合分离。本文通过将安全-活性分解引入异构假设-保证框架来填补这三个空白:\emph{安全通过连续时间层的不变性}来保证,而\emph{活性通过离散时间层的精化}来实现,层间协调通过垂直精化和时间兼容性条件形式化。我们通过一个结合MPC规划器、输入到状态稳定(ISS)底层控制器和参考调节器桥的新型LCA实例化该契约,并在包含电池和超级电容器的混合储能系统(HESS)上进行了验证。

英文摘要

Real-world control systems must achieve long-horizon objectives (liveness) while respecting continuous-time safety constraints, a combination that motivates hierarchical layered control architectures (LCAs). Existing LCA research, however, lacks (i) a uniform specification language across discrete planning and continuous execution, (ii) formal guarantees that specifications are preserved when interconnecting subsystems at heterogeneous time scales, and (iii) compositional separation between layers, owing to reliance on naive input-filtering laws. This paper addresses all three gaps by importing the safety--liveness decomposition into a heterogeneous assume--guarantee framework: \emph{safety is enforced by invariance} at the continuous-time layer, while \emph{liveness is achieved through refinement} at the discrete-time layer, with inter-layer coordination formalized via vertical refinement and timing-compatibility conditions. We instantiate this contract with a novel LCA combining an MPC planner, an input-to-state stabilizing (ISS) low-level controller, and a reference-governor bridge, and validate it on a Hybrid Energy Storage System (HESS) comprising a battery and a supercapacitor.

2605.04130 2026-06-08 cs.LG 版本更新

Constrained Extreme Gradient Boosting for Adapting Reduced-Order Models

约束极端梯度提升用于自适应降阶模型

Melika Baghi, Xiao Liu, Kamran Paynabar

AI总结 提出约束极端梯度提升(cXGBoost)框架,通过Grassmann流形上的几何表示和范数约束,预测参数依赖的POD基,实现高效自适应的降阶建模。

详情
Comments
Preprint. Under review. 4 numerical examples
AI中文摘要

高保真仿真(如计算流体动力学和有限元分析)对于建模复杂工程系统至关重要,但在参数研究、优化和实时控制等任务中往往成本过高。基于投影的降阶模型(ROM)通过将控制动力学投影到低维子空间来缓解这一成本。然而,其性能在参数变化下可能恶化,因此需要自适应基构造。在这项工作中,我们提出了一种约束集成学习框架,称为约束极端梯度提升(cXGBoost),用于预测作为系统参数函数的本征正交分解(POD)基。该方法利用Grassmann流形上子空间的几何表示,将其映射到欧几里得空间,以便使用梯度提升树进行高效回归。在训练过程中施加范数约束,以确保逆映射的有效性并保持预测子空间的几何结构。所提出的方法在四个数值示例(包括流体动力学和波传播问题)上进行了评估,证明了其能够准确预测参数依赖的基,同时在非线性区域内保持鲁棒性。这些结果凸显了将几何学习与约束集成方法相结合,用于高维参数系统可扩展且可靠的降阶建模的潜力。

英文摘要

High-fidelity simulations, such as computational fluid dynamics and finite element analysis, are essential for modeling complex engineering systems but are often prohibitively expensive for tasks including parametric studies, optimization, and real-time control. Projection-based reduced-order models (ROMs) alleviate this cost by projecting the governing dynamics onto low-dimensional subspaces. However, their performance can deteriorate under parameter variation, motivating the need for adaptive basis construction. In this work, we propose a constrained ensemble learning framework, termed Constrained Extreme Gradient Boosting (cXGBoost), for predicting Proper Orthogonal Decomposition (POD) bases as functions of system parameters. The approach leverages a geometric representation of subspaces on the Grassmann manifold, which are mapped to a Euclidean space to enable efficient regression using gradient boosting trees. A norm constraint is imposed during training to ensure the validity of the inverse mapping and preserve the geometric structure of the predicted subspaces. The proposed method is evaluated on four numerical examples, including fluid dynamics and wave propagation problems, demonstrating its ability to accurately predict parameter-dependent bases while maintaining robustness across nonlinear regimes. These results highlight the potential of combining geometric learning with constrained ensemble methods for scalable and reliable reduced-order modeling of high-dimensional parametric systems.

2606.06397 2026-06-08 cs.LG 版本更新

The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning

后GCN十年回顾:曲率分层的关联学习评估

Shuo Wang, Xiangyu Wang, Quanxin Wang, Bailin Wu, Bokui Wang, Shunyang Huang, Boyan Deng, Haonan Liu, Ruiyi Fang, Zhenxiang Xu, Boyu Wang, Zhao Kang

AI总结 针对关联学习中统一基准掩盖几何依赖性性能的问题,提出曲率分层评估框架,通过将数据集按曲率正负零分区,揭示模型性能本质上是几何依赖的,并给出更可靠的评估协议。

详情
Comments
Comments: Suggestions and comments are welcomed
AI中文摘要

当前关联学习的评估实践严重依赖于在异质数据集上平均性能的扁平排行榜,隐含地假设了统一的底层结构。我们证明这一假设引入了系统性偏差:它掩盖了依赖于几何的性能变化,并可能导致关于模型泛化的误导性结论。在这项工作中,我们将内在几何识别为控制模型有效性的关键潜在因素。我们证明,传统的聚合指标掩盖了关键的性能权衡,这些权衡只有在数据集按其几何属性分层时才变得可见。为了解决这个问题,我们引入了一个曲率分层的评估框架,将数据集划分为正曲率、负曲率和近零曲率区域。我们的基准测试评估了18个代表性模型,包括图卷积网络(GCNs)、图基础模型(GFMs)和表格学习方法,涵盖14个数据集。我们发现,模型排名在每个曲率区域内高度稳定,但在不同区域之间显著变化,表明性能从根本上依赖于几何,而非普遍可迁移。值得注意的是,我们识别出与几何对齐的GNN相比,GFMs提供递减收益的区域。基于这些发现,我们提出了一种几何感知的评估协议,该协议比标准聚合基准产生更可靠和可解释的比较。我们发布所有代码、曲率分层的数据集划分和评估工具,以支持未来关联学习方法的可重复和严格评估。代码和数据集在我们的项目主页上提供:https://sirbabbage.github.io/CurvBench_HOME/。

英文摘要

Current evaluation practices in relational learning rely heavily on flat leaderboards that average performance across heterogeneous datasets, implicitly assuming a uniform underlying structure. We show that this assumption introduces systematic bias: it obscures geometry-dependent performance variations and can lead to misleading conclusions about model generalization. In this work, we identify intrinsic geometry as a key latent factor governing model effectiveness. We demonstrate that conventional aggregated metrics mask critical performance trade-offs that only become visible when datasets are stratified by their geometric properties. To address this issue, we introduce a curvature-stratified evaluation framework that partitions datasets into positive, negative, and near-zero curvature regimes. Our benchmark evaluates 18 representative models including Graph Convolutional Networks (GCNs), Graph Foundation Models (GFMs), and tabular learning methods across 14 datasets. We find that model rankings are highly stable within each curvature regime but shift significantly across regimes, indicating that performance is fundamentally geometry-dependent rather than universally transferable. Notably, we identify regimes where GFMs offer diminishing returns compared to geometry-aligned GNNs. Based on these findings, we propose a geometry-aware evaluation protocol that yields more reliable and interpretable comparisons than standard aggregated benchmarks. We release all code, curvature-stratified dataset splits, and evaluation tools to support reproducible and rigorous assessment of future relational learning methods. Code and datasets are provided in our project homepage: https://sirbabbage.github.io/CurvBench_HOME/.

2606.06224 2026-06-08 cs.CV cs.LG 版本更新

Symb-xMIL: Symbolic Explanations for Multiple Instance Learning in Digital Pathology

Symb-xMIL: 数字病理学中多实例学习的符号解释

Yanqing Luo, Julius Hense, Niklas Prenißl, Andreas Mock, Klaus-Robert Müller, Thomas Schnake, Mina Jamshidi Idaji

AI总结 提出Symb-xMIL框架,通过量化模型行为与可读决策规则(逻辑关系)的对齐程度,为多实例学习提供结构化的符号解释,并在合成和真实病理数据上验证其有效性。

详情
Comments
23 pages, 18 figures
AI中文摘要

多实例学习(MIL)模型的解释被广泛用于数字组织病理学的验证和发现。现有方法主要依赖于突出显示影响区域的热力图,但不解释如何将不同组织区域的证据组合以产生预测。这限制了可解释性,尤其是当决策依赖于组织特征之间的交互时。我们引入了符号可解释MIL(Symb-xMIL),一种事后解释框架,量化MIL模型的行为与人类可读决策规则(表示为输入特征之间的逻辑关系,如AND、OR、NOT)的对齐程度。这些对齐分数揭示了模型预测背后的语义模式。我们在合成和真实世界的组织病理学数据集上评估了Symb-xMIL。在合成MIL数据上,Symb-xMIL可靠地恢复了真实逻辑规则。在临床肿瘤检测任务中,最佳对齐的规则揭示了异质决策模式并暴露了隐藏的模型错误。在TCGA-HNSCC(头颈癌队列)的HPV预测任务中,我们的框架在HPV状态之外细化了患者生存分层,具有潜在的临床相关性。总体而言,Symb-xMIL将MIL的可解释性从视觉归因扩展到结构化的、基于规则的推理,实现了对模型预测更透明和基于语义的解释。

英文摘要

Explanations of multiple instance learning (MIL) models are widely used for validation and discovery in digital histopathology. Existing methods primarily rely on heatmaps that highlight influential regions but do not explain how evidence from different tissue regions is combined to produce a prediction. This limits interpretability, especially when decisions depend on interactions between tissue features. We introduce Symbolic explainable MIL (Symb-xMIL), a post-hoc explanation framework that quantifies how a MIL model's behavior aligns with human-readable decision rules, expressed as logical relationships (e.g., AND, OR, NOT) between input features. These alignment scores reveal semantic patterns underlying the model's predictions. We evaluate Symb-xMIL on synthetic and real-world histopathology datasets. On synthetic MIL data, Symb-xMIL reliably recovers ground-truth logical rules. In a clinical tumor detection task, the best-aligned rules uncover heterogeneous decision patterns and expose hidden model errors. On an HPV-prediction task on TCGA-HNSCC, a cohort of head and neck cancer, our framework refines patient survival stratification beyond HPV status with potential clinical relevance. Overall, Symb-xMIL extends MIL explainability beyond visual attribution toward structured, rule-based reasoning, enabling more transparent and semantically grounded interpretation of model predictions.

2606.06048 2026-06-08 cs.CV 版本更新

LLM-Conditioned Synthesis of Pathological Gaits via Structured Gait-Language Representations

基于结构化步态-语言表示的LLM条件病理步态合成

Mritula Chandrasekaran, Sanket Kachole, Jarek Francik, Dimitrios Makris

AI总结 提出一种多模态LLM引导框架,通过结构化文本描述合成病理步态3D数据,利用运动标记化、病理感知语言条件、LLM语义增强和语言到步态生成,改善下游分类性能。

详情
Comments
Accepted at CVPR MOMA Workshop 2026 and selected for spotlight presentation at the workshop
AI中文摘要

由于隐私、招募、成本和运动变异性,病理步态数据集仍然稀缺。我们的工作提出了一个多模态LLM引导框架,用于从结构化文本描述中合成病理感知的3D步态数据。该方法为病理步态分类任务生成固定长度的合成骨架步态序列。该框架结合了运动标记化、病理感知语言条件、基于LLM的语义增强和语言到步态生成。一个关键贡献是提出的病理标记器,旨在在离散表示学习期间保留病理特定的运动特征。实验表明,当与真实数据结合时,所提出的合成序列改善了循环分类器的下游分类。最佳结果是在留一受试者协议下,使用真实和合成样本训练的GRU分类器,达到92.77%的准确率。

英文摘要

Pathological gait datasets remain scarce due to privacy, recruitment, cost, and movement variability. Our work presents a multimodal LLM-guided framework for pathology-aware 3D gait data synthesis from structured textual descriptions. The proposed method generates fixed-length synthetic skeleton-based gait sequences for pathological gait classification tasks. The framework combines motion tokenisation, pathology-aware language conditioning, LLM-based semantic augmentation, and language-to-gait generation. A key contribution is the proposed pathological tokeniser, which is designed to preserve pathology-specific motion characteristics during discrete representation learning. Experiments suggest that the proposed synthetic sequences improve downstream classification for recurrent classifiers when combined with real data. The best result is obtained using a GRU classifier trained with real and synthetic samples, achieving 92.77\% accuracy under a leave-one-subject-out protocol.

2606.06042 2026-06-08 cs.CV 版本更新

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

LoomVideo: 统一多模态输入到视频生成与编辑

Jianzong Wu, Hao Lian, Jiongfan Yang, Dachao Hao, Ye Tian, Yunhai Tong, Jingyuan Zhu, Biaolong Chen, Qiaosong Qi, Aixi Zhang, Wanggui He, Mushui Liu, Jinlong Liu, Pipei Huang, Hao Jiang

AI总结 提出LoomVideo,一种5B参数的高效统一架构,通过多模态大语言模型和零开销Scale-and-Add条件机制,实现视频生成与编辑,显著降低计算复杂度并加速推理。

详情
AI中文摘要

开发能够解释交错多模态输入的统一视频生成和编辑模型是一个有前景但充满挑战的前沿领域。现有的统一框架主要依赖大规模模型(通常为13B参数或更多),并通过拼接序列令牌来引入源视频条件进行编辑。这种拼接不可避免地使序列长度加倍,使自注意力机制的计算复杂度翻两番,带来难以承受的开销。为解决这些瓶颈,我们提出了LoomVideo,一种高效5B参数的统一架构,用于视频生成和编辑。LoomVideo用多模态大语言模型(MLLM)替换标准文本编码器,并采用Deepstack注入机制将多层MLLM特征与扩散变换器(DiT)对齐。关键地,我们引入了一种零开销的Scale-and-Add条件方法用于视频编辑。通过缩放并直接将干净源视频潜变量加到带噪目标潜变量上,这种优雅的设计消除了令牌拼接的需要,大幅降低计算成本,同时保持对复杂非刚性编辑的强大能力。此外,无缝集成了负时间RoPE策略以处理多个参考图像。大量实验表明,我们紧凑的5B模型在全面基准测试中达到了最先进或极具竞争力的性能,在电商和时尚生成场景中展现出卓越优势。得益于零开销条件机制,LoomVideo在推理速度上比类似能力的模型至少快5.41倍,为高度实用和高效的视频基础模型铺平了道路。

英文摘要

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.

2606.06002 2026-06-08 cs.CV 版本更新

Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

面向文本到3D室内场景生成的视觉-语言模型中的全局-局部蒙特卡洛树搜索

Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma

AI总结 提出一种全局-局部蒙特卡洛树搜索方法,通过分层场景表示和PRM引导的MCTS解决文本到3D室内场景生成中的错误传播问题,并构建新基准数据集3DTindo-bench。

详情
AI中文摘要

大型视觉-语言模型在各种任务中取得了显著的推理性能。然而,关于使用LVLM进行文本到3D室内场景生成的研究很少。主要挑战在于,现有的基于LVLM的方法采用思维链顺序决策机制,无法修正早期决策,导致错误传播。在本文中,我们将该任务视为一个受空间和布局常识约束的规划问题。为解决此问题,我们将其建模为具有全局树和局部树的树搜索问题,这与现有的顺序决策方法不同。在全局树中,我们迭代地放置每个对象,并像人类布置房间一样探索多种尝试,其中问题空间表示为树。为了有效搜索树,我们提出了一种分层场景表示和PRM引导的MCTS方法。分层表示将场景抽象为房间级别、区域级别、地板对象级别和支撑对象级别。PRM引导的MCTS方法使用PRM剪枝不必要的分支,并使用MCTS算法平衡探索和利用,以更少的尝试获得最优解。在局部树中,它进一步将每个对象的放置分解为更细的子步骤,包括具体的放置参数。为了使场景整体外观一致,我们利用预训练的扩散图像生成模型为场景中的所有对象预测纹理。由于现有的文本到3D室内场景生成基准在规模和多样性上仍然有限,我们收集了一个新的大规模多样化数据集,包含65种场景类型和3,250条指令,具有不同的尺寸、布局和风格,命名为3DTindo-bench,以更好地评估最先进模型的能力。我们的实验表明,我们的方法比最先进的方法生成更逼真的3D场景。

英文摘要

Large Vision-Language Models have achieved significant reasoning performance in various tasks. However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation. In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense. To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches. In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree. To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method. This representation abstracts a scene into room level, region level, floor object level, and supported object level. The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts. In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters. To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene. As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art methods.

2606.05967 2026-06-08 stat.ML cs.LG 版本更新

Fast and Robust Convergence Rate for TD(0) with Linear Function Approximation, Universal Learning Steps and I.I.D. Samples

具有线性函数逼近、通用学习步长和独立同分布样本的TD(0)的快速鲁棒收敛速率

Ziad Kobeissi, Éloïse Berthier

AI总结 针对线性函数逼近的TD(0)算法,在独立同分布样本和常数学习步长下,提出一种均方误差的快速(1/k阶)、鲁棒(不依赖最小特征值)且尖锐(乘性常数小于11)的收敛速率,并引入PCTD(0)变体以在强混合假设下获得更好收敛性。

详情
Journal ref
AISTATS 2026, May 2026, Tanger, Morocco
Comments
This is an extended version of a paper accepted at AISTATS 2026
AI中文摘要

本文研究了具有线性函数逼近(LFA)的TD(0)时序差分方法的有限时间行为。我们考虑策略内独立同分布(i.i.d.)样本、常数学习步长和Polyak-Juditsky平均方法。我们为近似函数的均方误差(MSE)建立了一个新的收敛速率,该速率(i)快速,即具有迭代次数k的最优依赖性(即1/k阶),(ii)对病态条件鲁棒:仅依赖于初始误差和模型无关常数,以及(iii)尖锐,乘性常数小于11。特别地,与TD(0)文献中所有现有的O(1/k)速率不同,它不依赖于线性参数化的非中心协方差矩阵的最小特征值。我们还引入了PCTD(0),这是TD(0)的一个变体,在马尔可夫链的强混合附加假设下具有更好的收敛性质。

英文摘要

In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning step, and the Polyak-Juditsky averaging method. We establish a new convergence rate, for the Mean-Square Error (MSE) on the approximated function, that is (i) fast in the sense that it admits an optimal dependency in the number of iterations k (i.e., of order 1/k), (ii) robust to ill-conditioning: it only depends on an initial error and modelindependent constants and (iii) sharp up to a multiplicative constant lower than 11. In particular, it does not depend on the smallest eigenvalue of the uncentered covariance matrix of the linear parametrization, unlike all pre-existing O(1/k) rates in the TD(0) literature. We also introduce PCTD(0), a variant of TD(0), which benefits from better convergence properties under an additional assumption of strong mixing on the Markov Chain.

2606.05949 2026-06-08 cs.CV 版本更新

Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

忠实、丰富且精确:T2I模型在自然科学插图生成中的基准测试

Yifan Chang, Jiaxin Ai, Jianwen Sun, Yuandong Pu, Siqi Luo, Liangliang Zhao, Yuchen Ren, Minghao Liu, Yunfei Yu, Yu Qiao, Kaipeng Zhang, Yihao Liu

AI总结 提出FEPBench基准,通过细粒度原子集标注和三维评估(指令忠实性、推理丰富性、语义精确性)系统评估T2I模型在自然科学插图生成中的表现,发现即使最先进的闭源模型仍存在文本渲染瓶颈、推理丰富性有限以及生成丰富性与精确性难以平衡的问题。

详情
AI中文摘要

科学插图是交流研究发现的重要工具,尤其是在自然科学中,它们可视化复杂的概念和过程。随着文本到图像(T2I)模型能力的增强,研究人员已开始将其用于科学插图生成。然而,现有基准通常从整体层面评估输出,忽略了细粒度元素,同时科学推理能力和输出简洁性仍缺乏量化。我们引入了FEPBench,一个基于跨多个学科和布局类型精心挑选的高质量科学插图构建的基准。借助多模态大语言模型(MLLM)和人类专家,我们提供了细粒度原子集标注,并沿三个维度系统评估T2I模型:指令忠实性、推理丰富性和语义精确性。我们的评估进一步将模型性能分解为视觉、文本、关系和布局元素。结果表明,即使最先进的(SOTA)闭源模型,如GPT Image 2和Nano Banana Pro,仍然存在文本渲染瓶颈、推理丰富性有限以及生成丰富性与精确性难以平衡的问题。这些发现为改进和部署T2I模型进行科学插图生成提供了实用指导。基准数据、原子集标注和评估代码将由我们发布。

英文摘要

Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.

2606.05919 2026-06-08 stat.ML cs.LG econ.EM stat.CO 版本更新

Finding Most Influential Sets

寻找最具影响力的集合

Lucas D. Konrad, Nikolas Kuschnig

AI总结 针对具有线性分式留出效应的估计量,提出一种基于Dinkelbach方法的高效算法,将最具影响力集合的选择转化为一个单参数序列的top-k问题,实现全局最优解。

详情
Comments
Published as a conference paper at ICML 2026, fixed ref
AI中文摘要

识别最具影响力的集合(MIS)——即移除后能最大程度改变目标估计量的大小为$k$的子集——通常是不可行的,因为需要搜索$inom{n}{k}$个子集。对于具有线性分式留出效应的估计量,我们证明MIS选择可简化为一个单参数序列的top-k问题。Dinkelbach方法产生了一种每轮迭代成本为$\mathcal{O}(n)$且有限终止的算法。对于固定残差化输入,该算法返回单变量比率目标的全局最优集,包括预言机残差化偏线性模型。当存在估计的干扰函数时,均匀分母和生成得分稳定性意味着对一阶预言机正交得分目标的近似;在分离条件下,可精确恢复集合。模拟和应用表明,该方法恢复了以前计算上无法访问的精确MIS。

英文摘要

Identifying most influential sets (MIS) - size-$k$ subsets whose removal maximally changes a target estimand - is typically infeasible because it requires searching over $\binom{n}{k}$ subsets. For estimands with linear-fractional leave-set-out effects, we show that MIS selection reduces to a one-parameter sequence of top-$k$ problems. Dinkelbach's method yields an algorithm with $\mathcal{O}(n)$ cost per iteration and finite termination. For fixed residualized inputs, the algorithm returns a globally optimal set for the univariate ratio objective, including the oracle-residualized partial linear model. With estimated nuisance functions, uniform denominator and generated-score stability imply approximation to the first-order oracle orthogonal-score objective; exact set recovery follows under a separation condition. Simulations and applications show that the method recovers exact MIS that were previously computationally inaccessible.

2606.05763 2026-06-08 eess.AS cs.SD 版本更新

M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

M2S-AVSR:面向鲁棒视听语音识别的模态感知多视角自监督表示

Fei Su, Cancan Li, Ming Li, Juan Liu

AI总结 提出一种模态感知多视角自监督表示框架(M2S-AVSR),通过多视角编码学习视角不变视觉语音表示,并利用模态感知模块进行细粒度融合,以应对视角变化、音频失真和视觉遮挡等挑战,在多个基准上取得最优性能。

详情
Comments
submitted to IEEE Transactions on Audio, Speech, and Language Processing
AI中文摘要

视听语音识别(AVSR)通过利用视觉线索增强语音识别的鲁棒性,而现实场景由于视角变化、音频失真和视觉遮挡而仍然具有挑战性,这些因素会降低模态质量并增加视听异步性。在本文中,我们提出了一种新颖的模态感知多视角自监督表示框架,用于鲁棒的视听语音识别(M2S-AVSR)。首先,我们引入了一个多视角表示学习编码器,以学习视角不变的视觉语音表示。其次,我们采用了一个模态感知模块,该模块显式地对模态质量和跨模态同步性进行建模,以执行细粒度的模态感知融合,从而在解码过程中实现细粒度的视觉信息注入。此外,我们提出了AISHELL8-RealScene,一个在真实环境中录制的公开多场景、多视角对话视听数据集,并在此基础上建立了语音识别基准。在英语和普通话基准上的实验证明了所提出方法在挑战性条件下的有效性。在LRS3上,M2S-AVSR在视角扰动和视觉退化设置下实现了高达29.4%的相对改进。我们的方法还在MISP2021-AVSR测试集上取得了新的最先进性能。在AISHELL8-RealScene上,它在户外场景中取得了最佳结果。所提出的方法和数据集为未来在现实条件下进行鲁棒语音和多模态任务的研究提供了有用的支持。

英文摘要

Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we release AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.

2606.05761 2026-06-08 cs.AI cs.CL 版本更新

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

SubtleMemory: 面向长时程AI智能体的细粒度关系记忆辨别基准

Wenxuan Wang, Haoyu Sun, Fukuan Hou, Mingyang Song, Weinan Zhang, Yu Cheng, Yang Yang

AI总结 提出SubtleMemory基准,通过构建关系控制的潜在语义伪影并嵌入用户-智能体交互历史,评估长时程AI智能体在后续查询中恢复分布式关系结构的能力。

详情
Comments
48 pages
AI中文摘要

持久性AI助手(如OpenClaw)在长期交互中积累了大量相关记忆。随着这些记忆的增长,它们可能相互强化、在不同上下文中出现分歧或直接冲突,使得正确协助依赖于记忆关系而非孤立回忆。现有的长期记忆基准很少探究智能体在下游任务中如何保留和利用这些关系。为弥补这一空白,我们引入了SubtleMemory,一个用于长运行AI智能体中细粒度关系记忆辨别的基准。SubtleMemory构建了关系控制的潜在语义伪影,其变体实例化互补、细微或矛盾的关系,并将其嵌入到逼真的用户-智能体历史中,要求智能体在后续查询和指令中恢复分布式的关系结构。该基准包含10个长历史中的1,522个评估实例,基于1,090个关系控制的记忆变体集,涵盖用户相关和非用户相关的查询。评估了六个独立记忆系统、两个具有原生记忆模块的Claw式智能体以及三个具有插件记忆模块的Claw式智能体,我们发现当前系统在细粒度关系记忆辨别上仍然薄弱。我们进一步引入了诊断协议,揭示了在记忆保留、检索和下游推理阶段的不同能力特征。

英文摘要

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.

2606.05739 2026-06-08 cs.SD eess.AS 版本更新

Do speech foundation models perceive speaker similarity as humans do?

语音基础模型是否像人类一样感知说话人相似性?

Minoru Kishi, Hayato Yagi, Shinnosuke Takamichi, Yuki Saito

AI总结 本研究通过比较40多个语音基础模型的说话人嵌入与人类主观相似性评分,探究模型距离是否与人类感知一致,并识别影响模型与人类感知一致性的关键配置因素。

详情
Comments
Accepted by INTERSPEECH 2026
AI中文摘要

本研究对语音基础模型的说话人嵌入与人类对说话人相似性的主观感知进行了比较分析。人类听众能够在一个连续尺度上判断说话人的相似性,辨别两个声音的相似程度。相比之下,语音基础模型将说话人特征嵌入到数值表示中。然而,一个问题仍然存在:这些模型中说话人嵌入之间的数值距离是否真正与人类感知的相似性一致?为了解决这个问题,我们使用超过40个模型进行了全面调查,将模型导出的距离与人类感知的相似性评分进行比较。此外,我们确定了模型配置中的哪些因素对产生反映人类感知的说话人嵌入贡献最大。我们的发现为开发更具感知基础的语音基础模型提供了见解。

英文摘要

This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.

2606.05711 2026-06-08 cs.CL 版本更新

Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

超越Token:基于LLM的多智能体系统中潜在通信的统一框架

Yingzhuo Liu

AI总结 提出一个三维统一框架(通信内容、发送-接收对齐、信息融合方式),系统分类2024-2026年间18种潜在通信方法,识别五种设计模式并揭示开放挑战。

详情
AI中文摘要

基于大型语言模型(LLM)构建的多智能体系统已成为处理复杂推理、规划和工具使用任务的主流范式。此类系统中的主导通信协议是自然语言:智能体逐token交换消息,将其内部推理过程语言化,以便同伴读取、验证和响应。尽管这种协议方便且可解释,但它存在三个结构性缺陷——高推理成本、离散化过程中不可逆的信息丢失以及自然语言的歧义/冗余。因此,越来越多的研究探索另一种协议——潜在通信——其中智能体直接交换连续表示(嵌入、隐藏状态或KV缓存),绕过文本生成的瓶颈。本文提出了一个统一框架,用于组织快速增长的潜在通信文献。我们沿着三个正交轴分析现有方法:(1)通信的WHAT信息(嵌入、隐藏状态、KV缓存或其他连续状态);(2)使用的WHICH发送-接收对齐(潜在空间对齐和层对齐);(3)通信信息如何融合到接收方(拼接、前置、数学运算、交叉注意力或缓存恢复)。在此三维框架下,我们系统分类了2024年至2026年间提出的18种代表性方法,识别出五种主要设计模式,并揭示了一系列开放挑战——包括跨架构对齐、潜在通道的安全性、边缘部署的压缩以及潜在通信与潜在思维链之间的关系。我们希望该框架既能降低新研究者的入门门槛,也能为比较未来工作提供一套词汇。

英文摘要

Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks -- high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol -- latent communication -- in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges -- including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.

2606.05682 2026-06-08 cs.AI cs.LG 版本更新

Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillation

超越输出匹配:在NVFP4 LLM蒸馏中保留内部几何结构

Fangbo Tu, Junhua Zhao, Chi Liu, Xin Chen, Haifeng Wu, Jian Wan, Srinivasan Manoharan

AI总结 针对NVFP4低精度量化蒸馏中仅输出匹配导致内部表示退化的问题,提出CKA-QAD方法,通过CKA引导的层间Gram矩阵对齐保留内部几何结构,提升推理和编码任务准确率。

详情
Comments
13 pages,1 figures
AI中文摘要

随着大型语言模型越来越多地部署在延迟和成本受限的生产环境中,对低精度推理(包括基于NVFP4的方法)的需求不断增长。量化感知蒸馏(QAD)通过训练量化学生模型以KL散度损失匹配冻结的高精度教师模型的输出分布,帮助恢复低比特量化下的精度损失。在这项工作中,我们首先对QAD进行表示层面的诊断:仅输出匹配可能掩盖内部退化,因为许多中间激活几何结构可以产生相似的教师对齐logits。使用CKA,我们表明仅使用KL的QAD相对于BF16教师模型会降低层间表示相似性,在RL后训练模型中尤其严重。这种退化与推理和编码任务的下游瓶颈相关,表明低比特恢复需要保留内部几何结构,而不仅仅是匹配输出。受此发现启发,我们提出了CKA-QAD,一种用于NVFP4 QAD和低比特LLM精度恢复的CKA引导表示对齐方法。该方法添加了一个轻量级正则化器,通过在蒸馏过程中通过CKA对齐层间Gram矩阵来保留内部表示几何结构。在Nemotron 3 Nano和Qwen3-4B-Thinking-2507上,CKA-QAD显著改善了表示对齐,并以适度的训练开销提高了下游推理和编码精度。我们的发现将CKA引导的表示对齐定位为量化LLM恢复中输出匹配的实用补充。

英文摘要

Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in latency and cost constrained production environments. Quantization-aware distillation (QAD) helps recover accuracy lost under low bit quantization by training a quantized student to match the output distribution of a frozen higher precision teacher via a KL-divergence loss. In this work, we first provide a representation level diagnosis of QAD: output matching alone can mask internal degradation, because many intermediate activation geometries can yield similar teacher-aligned logits. Using CKA, we show that KL-only QAD can reduce layerwise representational similarity relative to the BF16 teacher, with especially severe drift in RL-post-trained models. This drift correlates with downstream bottlenecks on reasoning and coding tasks, suggesting that low bit recovery requires preserving internal geometry rather than matching outputs alone. Motivated by this finding, we propose \textbf{CKA-QAD}, a CKA-guided representational alignment method for NVFP4 QAD and low bit LLM accuracy recovery. The method adds a lightweight regularizer that preserves internal representational geometry during distillation by aligning layerwise Gram matrices through CKA. Across Nemotron 3 Nano and Qwen3-4B-Thinking-2507, CKA-QAD substantially improves representational alignment and improves downstream reasoning and coding accuracy with modest training overhead. Our findings position CKA-guided representational alignment as a practical complement to output matching for quantized LLM recovery.

2606.05654 2026-06-08 cs.SE cs.AI cs.LG 版本更新

When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability

当表面形式改变审核决策:代码混合工作流不稳定性的配对研究

Suraj Babu Thimma Krishnaram, Yibo Hu, Karthikeyan Saravanan

AI总结 通过配对评估设置,研究在清洁英语与泰米尔语-英语代码混合输入下,仇恨审核工作流的变化,发现代码混合导致决策翻转率高达0.265,并增加审核负担和误报。

详情
AI中文摘要

仇恨审核通常被评估为对清洁英语输入的分类,但部署的系统必须将内容路由到诸如ALLOW、FLAG或REVIEW等操作。我们通过配对评估设置研究这种工作流在代码混合输入下的变化,其中相同的基础内容以清洁英语和泰米尔语-英语代码混合形式表达。在基于清洁英语开发数据调整的阈值下,代码混合输入产生显著的动作不稳定性,配对清洁到代码混合决策翻转率为0.265。主要工作流影响是增加的审核负担和增加的非仇恨内容误报:审核率从0.138上升到0.297,非仇恨误报率从0.069上升到0.104。仅泰米尔语输入整体表现出更强的退化,表明存在更广泛的语言覆盖限制,而非相同的代码混合不稳定性模式。一个简单的基于分歧的延迟规则减少了压力输入上的自动错误,但只能通过增加审核负载。这些结果表明,工作流级别的评估揭示了标准分类摘要可能遗漏的审核失败。

英文摘要

Hate moderation is often evaluated as classification on clean English inputs, but deployed systems must route content to actions such as ALLOW, FLAG, or REVIEW. We study how this workflow changes under code-mixed inputs using a paired evaluation setting where the same underlying content is expressed as clean English and Tamil-English code-mix. Under thresholds tuned on clean English development data, code-mixed inputs produce substantial action instability, with a paired clean- to-code-mix decision flip rate of 0.265. The main workflow effects are increased review burden and increased false-flagging of non-hateful content: review rate rises from 0.138 to 0.297 and non-hate false-flag rate rises from 0.069 to 0.104. Tamil-only inputs show stronger degradation overall, suggesting a broader language-coverage limitation rather than the same code-mixed instability pattern. A simple disagreement-based deferral rule reduces automatic errors on stressed inputs, but only by increasing review load. These results show that workflow-level evaluation reveals moderation failures that standard classification summaries can miss.

2606.05342 2026-06-08 cs.AI 版本更新

SentinelBench: A Benchmark for Long-Running Monitoring Agents

SentinelBench: 一个用于长时间运行监控代理的基准测试

Matheus Kunzler Maldaner, Adam Fourney, Amanda Swearngin, Hussein Mozannar, Gagan Bansal, Maya Murad, Rafah Hosn, Saleema Amershi

AI总结 提出SentinelBench,一个包含10个合成网络环境中100个任务的基准测试,用于评估AI代理在长时间监控任务中的表现,衡量任务完成度、反应时间和资源使用。

详情
Comments
18 pages, 16 figures
AI中文摘要

AI代理越来越多地被要求执行持续数分钟、数小时或更长时间的工作。然而,代理行为的默认模式是连续动作:发出工具调用、刷新页面、搜索替代方案或以其他方式试图强制推进。这对于许多长时间运行的任务来说是一种错误的方法,这些任务更适合采用持续关注的策略。相反,代理应该监控环境,注意到外部事件何时使进展成为可能,然后迅速响应,同时在等待时不浪费资源。为了衡量这类任务的进展,我们引入了SentinelBench,一个用于时间演化监控任务的开源基准测试。SentinelBench包含10个合成网络环境中的100个任务,包括电子邮件、日历、金融、专业网络和娱乐。每个环境都提供一个实时网络界面,并重放一个脚本化的事件序列,要求代理导航和推理状态不断变化的网页。SentinelBench衡量任务完成度、反应时间和资源使用,揭示了响应性与成本之间的权衡。我们报告了三种模型和两种浏览器代理框架的结果,为未来比较建立了性能基线,并展示了代理设计选择如何显著影响关键指标。总之,这些结果表明SentinelBench能够区分代理行为中的有意义差异。

英文摘要

AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default model of agent behavior is continuous action: issuing tool calls, refreshing pages, searching for alternatives, or otherwise trying to force progress. This is the wrong approach for many long-running tasks, which are better served by a strategy of sustained attention. Instead, agents should monitor an environment, notice when an external event makes progress possible, then respond promptly without wasting resources while waiting. To measure progress on this class of tasks, we introduce SentinelBench, an open-source benchmark for time-evolving monitoring tasks. SentinelBench contains 100 tasks across 10 synthetic web environments, including email, calendars, finance, professional networking, and entertainment. Each environment exposes a live web interface and replays a scripted sequence of events, requiring agents to navigate and reason about web pages whose state shifts underfoot. SentinelBench measures task completion, reaction time, and resource use, exposing the tradeoff between responsiveness and cost. We report results across three models and two browser-agent harnesses, establishing performance baselines for future comparison and demonstrating how agent design choices can dramatically impact key metrics. Together, these results show that SentinelBench distinguishes meaningful differences in agent behavior.

2606.05152 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Reinforcement Learning from Rich Feedback with Distributional DAgger

利用丰富反馈的强化学习与分布式DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

AI总结 提出DistIL算法,通过分布式DAgger利用丰富反馈(如执行轨迹、工具输出等)进行前向交叉熵优化,实现单调策略改进和更好的Pass@N性能。

详情
AI中文摘要

推理模型发展迅速,但主流的基于可验证奖励的强化学习(RLVR)方法仍然非常狭窄:采样多个响应,并用单个比特奖励每个响应,指示最终答案是否正确。然而,许多设置提供了丰富的反馈,包括执行轨迹、工具输出、专家修正和模型自我评估。我们研究如何通过经典模仿学习算法DAgger的分布式变体来使用这种反馈,其中学习器可以局部访问当前策略所访问状态上的专家分布。这产生了一个简单的前向交叉熵目标,该目标接受黑盒专家,并且其序列级梯度通过将未来的专家-学生分歧传播回早期决策来进行丰富的信用分配。我们表明,基于反向KL或Jensen-Shannon的先前具有自蒸馏目标的强化学习无法保证单调策略改进:即使专家具有更高的奖励,它们的更新也可能增加更差动作的概率。相比之下,我们证明前向交叉熵允许单调策略改进并享有遗憾保证。我们进一步表明,我们的目标优化了教师加权的成功可能性的下界,从而改进了Pass@N。实验上,我们的方法DistIL在科学推理、编程和解决困难数学问题等多个领域优于RLVR和基于自蒸馏的强化学习基线。

英文摘要

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

2606.04874 2026-06-08 cs.CL 版本更新

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Agent规划基准:LLM Agent规划能力的诊断框架

Haoyu Sun, Wenxuan Wang, Mingyang Song, Jujie He, Weinan Zhang, Yang Liu, Yang Yang, Yu Cheng

AI总结 提出Agent规划基准(APB),通过4209个多模态案例和五个设置,诊断LLM Agent在长程规划、工具噪声鲁棒性、校准拒绝和推理时改进方面的系统弱点。

详情
AI中文摘要

规划是LLM Agent的核心:在行动之前,Agent必须分解目标、选择工具、推理约束并决定任务何时不可行。然而,现有的Agent评估通常只报告端到端的成功率,使得难以判断失败源于规划还是执行。我们引入了 extbf{Agent规划基准(APB)},一个针对规划的诊断基准,包含22个领域和五个设置下的4209个多模态案例,涵盖整体规划、反馈条件逐步规划以及在外来工具、损坏工具和不可解任务下的鲁棒性。在12个MLLM上,APB揭示了长程规划、工具噪声鲁棒性、校准拒绝和推理时改进方面的系统弱点。我们进一步在200个ToolSandbox任务和200个$τ^2$-bench任务上验证了APB,其中APB引导的改进在三个代表性模型上一致提高了计划正确性、计划等级和下游执行指标。因此,APB作为执行基准的上游诊断补充。

英文摘要

Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce Agent Planning Benchmark (APB), a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $τ^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks. The APB benchmark and code are available in \href{https://github.com/Mikivishy/AgentPlanningBenchmark}{this URL}.

2606.04812 2026-06-08 cs.LG cs.AI 版本更新

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

面向风险感知强化学习的情景生成与近似安全保证

Mohit Prashant, Arvind Easwaran

AI总结 针对强化学习策略对转移扰动敏感导致不安全行为的问题,提出使用变分自编码器近似状态空间分布,通过构造上下界屏障证书并采样非鲁棒区域状态来收紧概率安全保证。

详情
Comments
8 pages, preprint
AI中文摘要

保证安全性对于强化学习(RL)智能体在现实世界中的部署至关重要,尤其是使用深度RL学习的策略可能表现出对转移扰动的敏感性,从而导致未知或不安全的行为。一种策略验证方法是通过采样相对于安全约束的策略轨迹来构造概率屏障证书,从而将已知的安全行为与未知行为区分开来。如果策略容易受到转移不确定性或扰动的影响,使智能体处于未充分探索的状态,则难以获得这些约束违反概率的严格上下界。为了解决这个问题,我们使用变分自编码器(VAE)近似遇到的状态空间的分布,并利用状态的潜在特征构造上下界屏障证书,以高置信度优化已知安全行为的区域。我们在工作中将其表述为一个对偶优化问题,其中下界屏障证书比上界屏障证书提供更保守的安全区域估计。在训练期间采样位于两者集合差(即非鲁棒区域)内的状态,使我们能够收紧上下界,从而提供更尖锐的概率安全保证。在我们的研究中,我们描述了所放置的保证,并通过实验证明了我们边界的紧致性。

英文摘要

Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.

2606.04373 2026-06-08 cs.CV cs.AI 版本更新

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

解耦信息区域的选择性耦合:用于视觉Transformer无数据量化的掩码注意力对齐

Biao Qian, Yang Wang, Yong Wu, Jungong Han

AI总结 提出MaskAQ方法,通过解耦合成样本中的信息区域并利用掩码注意力对齐全精度模型与量化模型,解决无数据量化中分布不匹配问题。

详情
Comments
Accepted to appear at ICML 2026, Seoul, Korea
AI中文摘要

无数据量化(DFQ)通过合成样本解决数据安全问题,无需访问真实数据。由于自注意力机制相比经典卷积运算的优势,DFQ在视觉Transformer(ViT)中日益受到关注。然而,先前的ViT DFQ方法常遭受合成样本与量化模型Q期望输入分布之间的分布不匹配,导致性能次优。本文提出一种新颖的掩码注意力对齐方法用于ViT的无数据量化,称为MaskAQ,揭示了:1)自注意力机制中的语义主要局限于稀疏的补丁子集,称为信息区域;2)信息区域主导了合成样本与Q输出之间的互信息。为此,我们利用合成样本补丁相似性的微分熵最大化,从噪声背景中解耦信息区域。为了与不同的Q耦合,通过掩码注意力对齐目标选择信息区域以对齐全精度模型与Q,从而产生高质量的合成样本。此外,提出周期性样本刷新策略,使MaskAQ能够在训练过程中持续适应Q的演化状态,以保持与合成样本的理想互信息。大量实验验证了MaskAQ在多个骨干网络和下游任务上优于最先进方法。我们的代码可在https://github.com/hfutqian/MaskAQ获取。

英文摘要

Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.

2606.04349 2026-06-08 cs.CV cs.AI 版本更新

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

MorphoQuant: 面向全模态大语言模型的模态感知量化

Yue Wu, Changyuan Wang, Zixuan Wang, Shilin Ma, Yansong Tang

AI总结 提出MorphoQuant框架,通过分布感知偏差补偿和形态导向量化函数优化,解决全模态大语言模型在4比特后训练量化中的分布异质性和异常值问题,实现精度与效率的优异平衡。

详情
AI中文摘要

传统的后训练量化方法在处理4比特全模态大语言模型时,由于跨模态的极端分布异质性和不同的异常值模式而面临困难。为了解决这一问题,我们提出了MorphoQuant,一种模态感知的PTQ框架,旨在保留跨模态形态并减轻异常值损失。具体来说,我们引入了分布感知偏差补偿,它选择性地将长尾异常值吸收到通道偏差中。该机制在保持异常值幅度的同时,为密集内点维持高精度离散化,从而在多样的模态分布中保持精确的离散化。作为补充,我们提出了形态导向量化函数优化,以协同优化量化网格与偏差掩码,确保跨模态的细粒度对齐。在Qwen2.5-Omni上对MMMU和Video-MME等基准的广泛评估证明了我们方法的优越性。值得注意的是,我们的W4A4模型在ScienceQA上达到了76.63%,显著优于最先进的W4A4方法,并意外地超越了W4A16基线,这充分展示了我们框架在精度-效率权衡方面的卓越表现。

英文摘要

Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

2606.04101 2026-06-08 cs.DC cs.LG 版本更新

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

UltraEP:在机架级节点上以近最优负载均衡释放MoE训练与推理

Xinming Wei, Chao Jin, Tuo Dai, Yinmin Zhong, Shan Yu, Chengxu Yang, Bingyang Wu, Zili Zhang, Jing Mai, Qianchao Zhu, Zhouyang Li, Yuliang Liu, Guojie Luo

AI总结 提出UltraEP,首个基于精确负载的实时均衡器,通过协同设计规划求解与专家复制通信,在机架级节点上实现MoE训练和推理的微批次与逐层重均衡,达到94.3%的力均衡理想吞吐量。

详情
Comments
The authors have identified issues related to information disclosure in the current version of the manuscript and therefore request its withdrawal. A revised version may be prepared at a later date
AI中文摘要

大规模专家并行(EP)正成为训练和服务前沿MoE模型的关键,但它也加剧了设备级专家负载不均衡,导致计算掉队者、令牌全对全瓶颈和激活内存峰值。现有的均衡器基于历史负载定期重新分配专家,这对于具有非平稳负载模式的生产部署变得不可靠。我们提出UltraEP,首个用于大规模EP MoE训练和在机架级节点(RSN)上服务预填充的精确负载实时均衡器。基于RSN扩展的纵向扩展连接性,UltraEP在关键路径上对每个微批次和层进行重均衡,这需要规划求解和专家复制通信的非平凡协同设计,以最小化暴露的开销。为此,UltraEP通过高效的配额驱动规划对门控后负载做出积极反应,并利用RSN原生的持久tile流和基于中继的扇出缓解来执行由此产生的不规则专家状态传输。在训练和预填充中,平均涵盖106B到671B参数的MoE模型,UltraEP实现了力均衡理想吞吐量的94.3%,相比无均衡提升了1.49倍,同时将最终跨秩不均衡从1.30-4.01降低到1.01-1.04。此外,我们在2560个GPU的生产MoE训练中验证了UltraEP的可扩展性和鲁棒性。

英文摘要

Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns. We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\times$ improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04. Additionally, we validate UltraEP's scalability and robustness in production MoE training with 2560 GPUs.

2606.04058 2026-06-08 cs.LG cs.AI 版本更新

Spectral Scaling Laws of Muon

Muon的谱缩放定律

Gagik Magakyan, Pablo Parrilo, Asuman Ozdaglar

AI总结 本文系统研究了Muon优化器中动量矩阵奇异值谱随模型大小的缩放行为,发现其遵循幂律,并据此提出层感知的牛顿-舒尔茨迭代配置选择方法以减少计算开销。

详情
AI中文摘要

正交归一化更新规则已迅速成为训练大型语言模型的主流优化器选择,最近的开源最先进模型采用了Muon。为了保持这些更新的可处理性,Muon使用牛顿-舒尔茨(NS)迭代执行正交归一化。由于NS只是近似,小奇异值的方向无法被正交归一化。在Muon中,NS每一步都应用于动量矩阵,然而关于这些动量矩阵的奇异值谱在训练过程中如何行为,以及该行为如何随模型大小变化,我们知之甚少。我们首次系统研究了这一问题。通过追踪从77M到2.8B参数模型中各层动量缓冲区的奇异值分位数,我们观察到一致的现象:在短暂的预热后,分位数稳定在一个由层类型和模型大小决定的值上。这些稳定值随模型大小呈现出非常清晰的幂律,且指数依赖于层。中后深度的层随模型大小$M$的缩放非常温和(约$M^{-0.25}$),因此学术规模下使用的标准5步NS配置将在更大规模下继续对它们进行正交归一化。然而,某些后期层的缩放更为激进(高达$M^{-0.96}$),在前沿规模下将落入NS失效区域,除非使用更多NS迭代或更好调整的系数。NS迭代在规模上计算成本高昂;我们的定律为从业者提供了一种有原则的、层感知的配方,用于选择最小的NS配置,该配置仍能正交归一化重要的方向——在不牺牲更新质量的情况下避免不必要的计算。

英文摘要

Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton--Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size. These stabilization values follow remarkably clean power laws in model size, with layer-dependent exponents. Layers up to mid-late depth scale very mildly with model size $M$ (around $M^{-0.25}$), so the standard 5-step NS configuration used at academic scale will continue to orthonormalize them at much larger scales. Some of the late layers, however, scale much more aggressively (up to $M^{-0.96}$) and will fall into the NS failure regime at frontier scale unless one uses more NS iterations or better-tuned coefficients. NS iterations are computationally expensive at scale; our laws give practitioners a principled, layer-aware recipe for choosing the minimum NS configuration that still orthonormalizes the directions that matter -- avoiding unnecessary computation without sacrificing update quality.

2606.03889 2026-06-08 cs.CL 版本更新

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

RealClawBench: 来自真实开发者-智能体会话的实时OpenClaw基准测试

Zongwei Lv, Zhewen Tan, Yaoming Li, Yilun Yao, Yuxuan Tian, Lin Sun, Xiangzheng Zhang, Weihong Lin, Tong Yang, Guangxiang Zhao

AI总结 针对现有基准缺乏真实性的问题,提出RealClawBench框架,通过重构执行环境和确定性可验证评分器,将真实OpenClaw会话转化为可复现、自动评分的任务,评估14个模型后最佳仅解决65.8%任务,揭示了开发者-智能体工作负载上的巨大提升空间。

详情
Comments
19 pages, 5 figures, 8 tables
AI中文摘要

智能体基准测试应反映用户实际要求部署的智能体执行的任务,然而现有基准往往缺失真实开发者-智能体会话的关键真实性属性。我们引入RealClawBench,一个基于真实OpenClaw会话构建的实时基准框架,以捕获已部署智能体使用的分布、多样性和实际难度。真实用户请求难以基准测试,因为它们通常依赖本地执行环境,涉及隐含或未明确指定的意图,并且需要非平凡的验证。RealClawBench通过两个核心机制解决这些挑战:重构的执行环境和确定性可验证评分器,共同将真实会话转化为可复现、自动评分的任务。最终发布的版本包含从更大真实会话池中采样的281个可执行任务,同时保留源分布,最大最终与源分布的Jensen-Shannon散度为0.0448。评估14个当代模型显示,最佳系统仅解决65.8%的任务,揭示了在真实开发者-智能体工作负载上存在巨大的提升空间。通过将真实部署会话转化为受控评估实例,RealClawBench提供了一条实际路径,以构建能更好衡量智能体在实际使用中能力的基准测试。代码见:this https URL。

英文摘要

Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:https://anonymous.4open.science/r/real-claw-bench-582B.

2606.03559 2026-06-08 cs.LG math.OC stat.ML 版本更新

Analytical Evaluation of DCA Convergence Properties for Minimizing Prediction Functions of Gaussian RBF Support Vector Regression

高斯RBF支持向量回归预测函数最小化的DCA收敛性分析评估

Yohei Kakimoto, Yuto Omae, Hirotaka Takahashi

AI总结 针对以训练好的高斯RBF核支持向量回归(RBF-SVR)预测函数为目标函数的非凸优化问题,利用RBF核的解析结构构造显式DC分解,推导出DC分量强凸参数下界μ和子问题梯度Lipschitz常数上界L的闭式表达式,并通过数值实验表明特征量Cαρ主导DCA的收敛性和初始点依赖性。

详情
Comments
29 pages, 5 figures, 2 tables
AI中文摘要

对于目标函数为训练好的高斯径向基函数(RBF)核支持向量回归(SVR)模型(RBF-SVR)预测函数的非凸优化问题,我们提出一个框架,通过利用RBF核的解析结构构造显式的凸函数差(DC)分解,应用DC算法(DCA)。具体地,我们闭式推导了DC分量的强凸参数下界μ和子问题梯度Lipschitz常数上界L。μ和L完全由训练后的对偶系数和Cα、RBF核参数γ以及DC分解参数ρ决定,且共享共同主导项Cαρ。通过在六个基准函数上的数值实验,我们表明Cαρ是表征DCA收敛性质和初始点依赖性的主要单一量,并进一步证明它分解为两个独立路径C→Cα和γ→ρ,其主要变化由SVR超参数(C,γ)控制。这些结果使得RBF-SVR上DCA的收敛性质可以通过单一标量Cαρ预先评估:训练前近似从(C,γ)得到,训练后精确闭式得到。

英文摘要

For nonconvex optimization problems whose objective is the prediction function of a trained Support Vector Regression (SVR) model with the Gaussian radial basis function (RBF) kernel (RBF-SVR), we present a framework that applies the difference of convex functions (DC) algorithm (DCA) by exploiting the analytical structure of the RBF kernel to construct an explicit DC decomposition. Specifically, we derive in closed form both the lower bound $μ$ of the strong convexity parameter of the DC components and the upper bound $L$ of the gradient Lipschitz constant of the subproblem. Both $μ$ and $L$ are determined solely by the post-training dual-coefficient sum $C_α$ and the RBF kernel parameter $γ$, together with the DC decomposition parameter $ρ$, and they share a common leading term $C_αρ$. Through numerical experiments on six benchmark functions, we show that $C_αρ$ is the primary single quantity characterizing both the convergence properties and the initial-point dependence of DCA, and further demonstrate that it decomposes into two independent pathways, $C \to C_α$ and $γ\to ρ$, with its primary variation governed by the SVR hyperparameters $(C, γ)$. Together, these results allow the convergence properties of DCA on RBF-SVR to be assessed in advance through the single scalar quantity $C_αρ$: approximately from $(C, γ)$ before training, and exactly in closed form after training.

2606.03382 2026-06-08 cs.LG cs.AI 版本更新

Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

局部引导,全局影响:高斯重塑信任区域解锁行为转变

Bingxu Liu, Jiashun Liu, Johan Obando-Ceron, Hao Wang, Runze Liu, Pablo Samuel Castro, Aaron Courville, Ling Pan

AI总结 针对PPO在非平稳环境中优化失效的问题,提出高斯信任区域策略优化(GTR),通过高斯核重塑信任区域实现非单调约束,在保持局部稳定性的同时允许必要的大幅策略更新,并在多种任务中取得强性能。

详情
Comments
21 pages
AI中文摘要

虽然近端策略优化(PPO)在静态环境中表现出色,但我们表明其标准优化范式在连续和非平稳环境中存在困难。失败并非源于模型容量不足或裁剪过于严格。相反,PPO执行持续、方向低效的局部更新,这表明缺乏几何感知引导来积累有意义的行为变化,最终阻碍向新行为模式的转变。尽管基于散度的正则化引入了部分几何感知,但其单调递增的惩罚隐式地阻止了大的策略偏差,即使这种转变对于有效适应是必要的。为了解决这一局限性,我们提出了高斯信任区域策略优化(GTR),它使用高斯核重塑信任区域。由此产生的约束是有界且非单调的,在提供强局部稳定性的同时,在持续的高优势更新下逐渐放松。为了进一步提高鲁棒性,我们引入了一个混合高斯锚点,它适应最近的策略轨迹,减少了由陈旧参考引起的方差。GTR与架构无关,在游戏、模拟机器人控制、开放世界探索和语言模型后训练中均取得了强性能。这些结果表明,几何感知的信任区域设计可以成为复杂非平稳环境中鲁棒强化学习的一个有前景的方向。我们的代码可在该 https URL 获取。

英文摘要

While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization paradigm struggles in continual and non-stationary environments. The failure does not stem from insufficient model capacity or overly restrictive clipping. Instead, PPO performs persistent, directionally inefficient local updates, which indicates a lack of geometry-aware guidance for accumulating meaningful behavioral change and ultimately hindering transitions toward new behavior patterns. Although divergence-based regularization introduces partial geometric awareness, its monotonically increasing penalties implicitly discourage large policy deviations, even when such shifts are necessary for effective adaptation. To address this limitation, we propose Gaussian Trust Region Policy Optimization (GTR), which reshapes the trust region using a Gaussian kernel. The resulting constraint is bounded and non-monotonic, providing strong local stability while progressively relaxing under sustained high-advantage updates. To further improve robustness, we introduce a Mixture Gaussian Anchor that adapts to recent policy trajectories, reducing variance induced by stale references. GTR is architecture-agnostic and achieves strong performance across games, simulated robotic control, open-world exploration, and language model post-training. These results demonstrate that geometry-aware trust-region design can be a promising direction for robust reinforcement learning in complex non-stationary environments. Our code is available at https://anonymous.4open.science/r/GTR_demo/README.md.

2606.03280 2026-06-08 cs.AI 版本更新

A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

Pythia多跳设置中跨模型激活迁移的负面结果

Peiyan Zhang, Jason Xin

AI总结 研究在Pythia-160M到Pythia-410M的多跳推理设置中,通过线性翻译层传递隐藏状态是否能够改善下游回答,结果发现离线表示对齐不足以实现有用的因果通信。

详情
Comments
16 pages, 6 figures
AI中文摘要

最近的研究表明,语言模型可以通过训练过程中生成数据中的隐藏信号传递行为特征。我们提出一个更直接和严格的渠道是否也可行:一个语言模型能否在推理时通过翻译和注入隐藏激活,而不是通过传递自然语言文本,将有用的中间推理状态传递给另一个模型?我们在一个受控的Pythia-160M到Pythia-410M多跳推理设置中测试了这个问题。线性翻译层学习了发送者和接收者隐藏状态之间的强归一化空间映射,归一化余弦相似度在不同种子下接近0.97。然而,当翻译后的激活在推理时注入接收者时,并没有改善下游回答。低强度加法注入仍接近无注入基线,置信区间跨越零。替换式注入始终具有破坏性,将翻译向量重新缩放到接收者隐藏状态范数也无法挽救性能。因此,这是一个有范围的负面结果:在这种设置中,离线表示对齐不足以在接收者内部实现有用的因果通信。

英文摘要

Recent work shows that language models can transmit behavioural traits through hidden signals in generated data during training. We ask whether a different activation-mediated channel is viable: can one language model communicate a useful intermediate reasoning state to another at inference time through a post-hoc linear activation bridge, rather than through a textual or structured-token relay? We test this question in a controlled Pythia-160M to Pythia-410M multi-hop reasoning setting. A linear translation layer learns a strong normalized-space map between sender and receiver hidden states, with normalized cosine similarity near 0.97 across seeds. However, when the translated activations are injected into the receiver at inference time, they do not improve downstream answering. Low-strength additive injection remains near the no-injection baseline, with confidence intervals that cross zero. Replacement-style injection is consistently destructive, and rescaling translated vectors to the receiver hidden-state norm does not rescue performance. The result is therefore a scoped negative result: in this setting, offline representational alignment is not sufficient for useful causal communication inside the receiver.

2606.03163 2026-06-08 cs.MA cs.AI cs.DC 版本更新

OpenAgenet / OAN Yellow Paper: Technical Architecture for Trust-Governed Resource Identity and Discovery

OpenAgenet/OAN:信任治理的智能体身份与发现技术架构

Jinliang Xu

AI总结 本文提出OpenAgenet/OAN协议中立信任层技术架构,通过角色架构、身份对象、注册工作流、根治理生命周期、根验证包模型、授权感知发现、签名可信调用、验证要求、状态转换、安全属性、实现边界和部署考虑,实现异构智能体框架(包括MCP、A2A、ANP类系统及领域特定协议)的身份准入、可发现、可验证和安全交互。

详情
AI中文摘要

本文描述了OpenAgenet / OAN的技术架构。OAN是一个协议中立的信任层,用于开放的智能体互连。它规定了角色架构、身份对象、注册工作流、根治理生命周期、根验证包模型、授权感知发现、签名可信调用、验证要求、状态转换、安全属性、实现边界和部署考虑。该设计旨在支持异构智能体框架和交互协议,包括MCP、A2A、ANP类系统以及领域特定的智能体协议。OAN不定义智能体之间的完整业务对话;它定义了在特定协议交互开始之前,智能体身份如何变得可接纳、可发现、可验证且安全可接近。

英文摘要

This yellow paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection and discoverable AI resource products. It specifies the role architecture, \texttt{did:oan} identity objects, registration workflow, governance-backed Root lifecycle enforcement, Root-verified package model, authorization-aware Discovery, Root-issued infrastructure authorization VCs, signed trusted invocation, verification requirements, state transitions, security properties, implementation boundaries, and deployment considerations. The design is intended to support heterogeneous Agent frameworks and interaction protocols, including MCP, A2A, ANP-like systems, domain-specific Agent protocols, Skills, MCP Servers, and Tool/API resources. OAN does not define the entire business conversation among Agents or the native protocol of every resource; it defines how resource identities become admissible, discoverable, verifiable, and safe to approach before protocol-specific interaction begins.

2606.03161 2026-06-08 cs.MA cs.AI 版本更新

OpenAgenet / OAN White Paper: Open Infrastructure for Trusted Agent Interconnection

OpenAgenet/OAN:可信智能体互连的开放基础设施

Jinliang Xu

AI总结 针对智能体从孤立应用转向开放多运营商网络时面临的身份验证、治理状态、发现授权、新鲜度和信任证据问题,提出协议无关的信任层OAN,通过根治理身份准入、注册商辅助注册、根验证包发布、授权感知发现和签名可信调用来实现可信互连。

详情
AI中文摘要

OpenAgenet,简称OAN,是一个用于可信智能体互连的开放基础设施项目。它解决了一个在智能体从孤立应用转向开放的多运营商网络时变得明显的问题:在智能体能够安全地发现、选择和调用另一个智能体之前,它需要一种方法来验证身份来源、治理状态、发现授权、新鲜度和连接前的信任证据。OAN被设计为一个协议无关的信任层。它不取代智能体交互协议、工具协议、模型编排框架或应用级工作流。相反,它提供了根治理的身份准入、注册商辅助的注册、根验证的包发布、授权感知的发现以及签名的可信调用。本文介绍了OAN的动机、架构、角色、治理模型、与MCP、A2A和ANP的关系、部署模式、合作模型、区块链支持的授权公告、原型状态、性能概况和路线图。

英文摘要

OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke another Agent, it needs a way to verify identity provenance, governance state, discovery authorization, freshness, and pre-connection trust evidence. OAN is designed as a protocol-neutral trust layer. It does not replace Agent interaction protocols, tool protocols, model orchestration frameworks, or application-level workflows. Instead, it provides \texttt{did:oan}-based resource identity, governance-backed admission, Registrar-assisted onboarding, Root-verified package publication, authorization-aware Discovery, Root-issued infrastructure authorization VCs, and signed trusted invocation. The architectural center of OAN is the combination of federated governance, resource identity, and trusted Discovery, rather than a single directory or naming service. This white paper explains the motivation, architecture, roles, governance model, relationship with MCP, A2A, and ANP, deployment patterns, cooperation model, on-chain governance layer, prototype status, performance profile, and roadmap of OAN.