arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.24649 2026-06-08 cs.LG cs.AI cs.LO cs.SY eess.SY 交叉投稿

On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

关于循环多项式代理三元逻辑门网络的稳定性与可实现性

Sai Sandeep Damera, Ryan Matheu, Aniruddh G. Puranic, John S. Baras, Calin Belta

发表机构 * University of Maryland, College Park, USA（马里兰大学学院公园分校）

AI总结提出R-DTLGN架构，通过连续多项式代理训练并硬化为离散三元逻辑电路，结合数值单调和信息单调门，实现STL监控中的稳定递归和原则性弃权，并由STL公式确定网络规模。

详情

Comments: 9 pages, 3 figures. This work has been submitted to the IEEE for possible publication

AI中文摘要

循环神经网络（RNN）可以从部分轨迹在线学习预测信号时序逻辑（STL）判定，但在安全关键系统中部署为运行时监控器需要的不只是预测准确性。标准RNN架构无法提供结构保证，确保在传感器退化时输出能优雅降级；一个丢失的输入可能悄无声息地将判定从安全翻转为不安全。我们引入了循环可微三元逻辑门网络（R-DTLGN），这是一种在Kleene三值逻辑$\{-1, 0, +1\}$上运行的循环架构，其中$0$明确表示未知。R-DTLGN通过连续多项式代理进行训练，并在推理时硬化成离散的三元逻辑电路。我们通过从三元域上的两种序导出的两个门词汇表来分析硬化电路：数值单调门确保稳定的循环动态，而信息单调门（如果存在）保证原则性弃权（未知输入从不产生错误输出）和输入确定性上的单调性（更多信息只能改善判定）。我们证明，有界STL算子所需的循环连接仅使用AND和OR门，这两个门属于两个词汇表，从而将监控任务与架构的保证联系起来。由STL公式的时序算子导出的可实现性界限直接确定了网络隐藏状态的大小，用公式驱动的规范取代了超参数搜索。我们在D4RL PointMaze导航数据上的STL规范上进行了评估，测试了预测准确性、谓词丢失下的退化以及两个标签构建流程之间的准确性与安全性权衡。据我们所知，R-DTLGN是第一个将学习到的时序预测与基于三值逻辑的形式化退化保证相结合的循环架构。

英文摘要

Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy. Standard RNN architectures offer no structural guarantee that outputs degrade gracefully under sensor degradation; a dropped input can silently flip a verdict from safe to unsafe. We introduce the Recurrent Differentiable Ternary Logic Gate Network (R-DTLGN), a recurrent architecture that operates over Kleene's three-valued logic $\{-1, 0, +1\}$, where $0$ explicitly represents unknown. The R-DTLGN trains through continuous polynomial surrogates and hardens to a discrete ternary logic circuit at inference. We analyze the hardened circuit through two gate vocabularies derived from two orderings on the ternary domain: numerically monotone gates ensure stable recurrent dynamics, while information-monotone gates, when present, guarantee principled abstention (unknown inputs never produce wrong outputs) and monotonicity in input certainty (more information can only improve the verdict). We show that the recurrent connections required by bounded STL operators use exclusively AND and OR, which belong to both vocabularies, linking the monitoring task to the architecture's guarantees. A realizability bound derived from the STL formula's temporal operators directly sizes the network's hidden state, replacing hyperparameter search with a formula-driven specification. We evaluate on STL specifications over D4RL PointMaze navigation data, testing prediction accuracy, degradation under predicate dropout, and the accuracy-versus-safety tradeoff between two label construction pipelines. The R-DTLGN is, to our knowledge, the first recurrent architecture that couples learned temporal prediction with formal degradation guarantees rooted in three-valued logic.

URL PDF HTML ☆

赞 0 踩 0

2605.04130 2026-06-08 cs.LG 版本更新

Constrained Extreme Gradient Boosting for Adapting Reduced-Order Models

约束极端梯度提升用于自适应降阶模型

Melika Baghi, Xiao Liu, Kamran Paynabar

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering（H. Milton Stewart工业与系统工程学院）

AI总结提出约束极端梯度提升（cXGBoost）框架，通过Grassmann流形上的几何表示和范数约束，预测参数依赖的POD基，实现高效自适应的降阶建模。

详情

Comments: Preprint. Under review. 4 numerical examples

AI中文摘要

高保真仿真（如计算流体动力学和有限元分析）对于建模复杂工程系统至关重要，但在参数研究、优化和实时控制等任务中往往成本过高。基于投影的降阶模型（ROM）通过将控制动力学投影到低维子空间来缓解这一成本。然而，其性能在参数变化下可能恶化，因此需要自适应基构造。在这项工作中，我们提出了一种约束集成学习框架，称为约束极端梯度提升（cXGBoost），用于预测作为系统参数函数的本征正交分解（POD）基。该方法利用Grassmann流形上子空间的几何表示，将其映射到欧几里得空间，以便使用梯度提升树进行高效回归。在训练过程中施加范数约束，以确保逆映射的有效性并保持预测子空间的几何结构。所提出的方法在四个数值示例（包括流体动力学和波传播问题）上进行了评估，证明了其能够准确预测参数依赖的基，同时在非线性区域内保持鲁棒性。这些结果凸显了将几何学习与约束集成方法相结合，用于高维参数系统可扩展且可靠的降阶建模的潜力。

英文摘要

High-fidelity simulations, such as computational fluid dynamics and finite element analysis, are essential for modeling complex engineering systems but are often prohibitively expensive for tasks including parametric studies, optimization, and real-time control. Projection-based reduced-order models (ROMs) alleviate this cost by projecting the governing dynamics onto low-dimensional subspaces. However, their performance can deteriorate under parameter variation, motivating the need for adaptive basis construction. In this work, we propose a constrained ensemble learning framework, termed Constrained Extreme Gradient Boosting (cXGBoost), for predicting Proper Orthogonal Decomposition (POD) bases as functions of system parameters. The approach leverages a geometric representation of subspaces on the Grassmann manifold, which are mapped to a Euclidean space to enable efficient regression using gradient boosting trees. A norm constraint is imposed during training to ensure the validity of the inverse mapping and preserve the geometric structure of the predicted subspaces. The proposed method is evaluated on four numerical examples, including fluid dynamics and wave propagation problems, demonstrating its ability to accurately predict parameter-dependent bases while maintaining robustness across nonlinear regimes. These results highlight the potential of combining geometric learning with constrained ensemble methods for scalable and reliable reduced-order modeling of high-dimensional parametric systems.

URL PDF HTML ☆

赞 0 踩 0

2606.06397 2026-06-08 cs.LG 版本更新

The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning

后GCN十年回顾：曲率分层的关联学习评估

Shuo Wang, Xiangyu Wang, Quanxin Wang, Bailin Wu, Bokui Wang, Shunyang Huang, Boyan Deng, Haonan Liu, Ruiyi Fang, Zhenxiang Xu, Boyu Wang, Zhao Kang

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Tsinghua University（清华大学）； Western University（西方大学）； Zhejiang University（浙江大学）

AI总结针对关联学习中统一基准掩盖几何依赖性性能的问题，提出曲率分层评估框架，通过将数据集按曲率正负零分区，揭示模型性能本质上是几何依赖的，并给出更可靠的评估协议。

详情

Comments: Comments: Suggestions and comments are welcomed

AI中文摘要

当前关联学习的评估实践严重依赖于在异质数据集上平均性能的扁平排行榜，隐含地假设了统一的底层结构。我们证明这一假设引入了系统性偏差：它掩盖了依赖于几何的性能变化，并可能导致关于模型泛化的误导性结论。在这项工作中，我们将内在几何识别为控制模型有效性的关键潜在因素。我们证明，传统的聚合指标掩盖了关键的性能权衡，这些权衡只有在数据集按其几何属性分层时才变得可见。为了解决这个问题，我们引入了一个曲率分层的评估框架，将数据集划分为正曲率、负曲率和近零曲率区域。我们的基准测试评估了18个代表性模型，包括图卷积网络（GCNs）、图基础模型（GFMs）和表格学习方法，涵盖14个数据集。我们发现，模型排名在每个曲率区域内高度稳定，但在不同区域之间显著变化，表明性能从根本上依赖于几何，而非普遍可迁移。值得注意的是，我们识别出与几何对齐的GNN相比，GFMs提供递减收益的区域。基于这些发现，我们提出了一种几何感知的评估协议，该协议比标准聚合基准产生更可靠和可解释的比较。我们发布所有代码、曲率分层的数据集划分和评估工具，以支持未来关联学习方法的可重复和严格评估。代码和数据集在我们的项目主页上提供：https://sirbabbage.github.io/CurvBench_HOME/。

英文摘要

Current evaluation practices in relational learning rely heavily on flat leaderboards that average performance across heterogeneous datasets, implicitly assuming a uniform underlying structure. We show that this assumption introduces systematic bias: it obscures geometry-dependent performance variations and can lead to misleading conclusions about model generalization. In this work, we identify intrinsic geometry as a key latent factor governing model effectiveness. We demonstrate that conventional aggregated metrics mask critical performance trade-offs that only become visible when datasets are stratified by their geometric properties. To address this issue, we introduce a curvature-stratified evaluation framework that partitions datasets into positive, negative, and near-zero curvature regimes. Our benchmark evaluates 18 representative models including Graph Convolutional Networks (GCNs), Graph Foundation Models (GFMs), and tabular learning methods across 14 datasets. We find that model rankings are highly stable within each curvature regime but shift significantly across regimes, indicating that performance is fundamentally geometry-dependent rather than universally transferable. Notably, we identify regimes where GFMs offer diminishing returns compared to geometry-aligned GNNs. Based on these findings, we propose a geometry-aware evaluation protocol that yields more reliable and interpretable comparisons than standard aggregated benchmarks. We release all code, curvature-stratified dataset splits, and evaluation tools to support reproducible and rigorous assessment of future relational learning methods. Code and datasets are provided in our project homepage: https://sirbabbage.github.io/CurvBench_HOME/.

URL PDF HTML ☆

赞 0 踩 0

2606.06224 2026-06-08 cs.CV cs.LG 版本更新

Symb-xMIL: Symbolic Explanations for Multiple Instance Learning in Digital Pathology

Symb-xMIL: 数字病理学中多实例学习的符号解释

Yanqing Luo, Julius Hense, Niklas Prenißl, Andreas Mock, Klaus-Robert Müller, Thomas Schnake, Mina Jamshidi Idaji

发表机构 * Berlin Institute for the Foundations of Learning and Data（柏林学习与数据基础研究院）； Machine Learning Group, Technische Universität Berlin（柏林技术大学机器学习组）； Institute of Pathology, Charité Universitätsmedizin（查理研究所病理学部）； Berlin Institute of Health at Charité – Universitätsmedizin Berlin, BIH Biomedical Innovation Academy, BIH Charité Digital Clinician Scientist Program（柏林查理医学研究院健康研究所、BIH生物医学创新学院、BIH查理数字临床科学家项目）； Institute of Pathology, Ludwig Maximilian University of Munich（慕尼黑路德维希-马克西米利安大学病理学部）； Division of Translational Medical Oncology, DKFZ（转化医学肿瘤学部，德国有机化学研究所）； German Cancer Consortium (DKTK), partner site Munich, a partnership between DKFZ and Ludwig-Maximilians-Universität München (LMU)（德国癌症联盟（DKTK），慕尼黑合作伙伴站点，由DKFZ和路德维希-马克西米利安-慕尼黑大学（LMU）组成）； Department of Artificial Intelligence, Korea University（韩国大学人工智能系）； Max-Planck Institute for Informatics, Saarbrücken, Germany（马克斯·普朗克信息学院，萨尔布吕肯，德国）； Department of Chemistry, Chemical Physics Theory Group, University of Toronto（多伦多大学化学系，化学物理理论组）； Vector Institute for Artificial Intelligence, Toronto, Canada（多伦多人工智能矢量研究所）； Acceleration Consortium, University of Toronto（多伦多大学加速联盟）

AI总结提出Symb-xMIL框架，通过量化模型行为与可读决策规则（逻辑关系）的对齐程度，为多实例学习提供结构化的符号解释，并在合成和真实病理数据上验证其有效性。

详情

Comments: 23 pages, 18 figures

AI中文摘要

多实例学习（MIL）模型的解释被广泛用于数字组织病理学的验证和发现。现有方法主要依赖于突出显示影响区域的热力图，但不解释如何将不同组织区域的证据组合以产生预测。这限制了可解释性，尤其是当决策依赖于组织特征之间的交互时。我们引入了符号可解释MIL（Symb-xMIL），一种事后解释框架，量化MIL模型的行为与人类可读决策规则（表示为输入特征之间的逻辑关系，如AND、OR、NOT）的对齐程度。这些对齐分数揭示了模型预测背后的语义模式。我们在合成和真实世界的组织病理学数据集上评估了Symb-xMIL。在合成MIL数据上，Symb-xMIL可靠地恢复了真实逻辑规则。在临床肿瘤检测任务中，最佳对齐的规则揭示了异质决策模式并暴露了隐藏的模型错误。在TCGA-HNSCC（头颈癌队列）的HPV预测任务中，我们的框架在HPV状态之外细化了患者生存分层，具有潜在的临床相关性。总体而言，Symb-xMIL将MIL的可解释性从视觉归因扩展到结构化的、基于规则的推理，实现了对模型预测更透明和基于语义的解释。

英文摘要

Explanations of multiple instance learning (MIL) models are widely used for validation and discovery in digital histopathology. Existing methods primarily rely on heatmaps that highlight influential regions but do not explain how evidence from different tissue regions is combined to produce a prediction. This limits interpretability, especially when decisions depend on interactions between tissue features. We introduce Symbolic explainable MIL (Symb-xMIL), a post-hoc explanation framework that quantifies how a MIL model's behavior aligns with human-readable decision rules, expressed as logical relationships (e.g., AND, OR, NOT) between input features. These alignment scores reveal semantic patterns underlying the model's predictions. We evaluate Symb-xMIL on synthetic and real-world histopathology datasets. On synthetic MIL data, Symb-xMIL reliably recovers ground-truth logical rules. In a clinical tumor detection task, the best-aligned rules uncover heterogeneous decision patterns and expose hidden model errors. On an HPV-prediction task on TCGA-HNSCC, a cohort of head and neck cancer, our framework refines patient survival stratification beyond HPV status with potential clinical relevance. Overall, Symb-xMIL extends MIL explainability beyond visual attribution toward structured, rule-based reasoning, enabling more transparent and semantically grounded interpretation of model predictions.

URL PDF HTML ☆

赞 0 踩 0

2606.06048 2026-06-08 cs.CV 版本更新

LLM-Conditioned Synthesis of Pathological Gaits via Structured Gait-Language Representations

基于结构化步态-语言表示的LLM条件病理步态合成

Mritula Chandrasekaran, Sanket Kachole, Jarek Francik, Dimitrios Makris

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Toronto（多伦多大学）； MIT Media Lab（麻省理工学院媒体实验室）

AI总结提出一种多模态LLM引导框架，通过结构化文本描述合成病理步态3D数据，利用运动标记化、病理感知语言条件、LLM语义增强和语言到步态生成，改善下游分类性能。

详情

Comments: Accepted at CVPR MOMA Workshop 2026 and selected for spotlight presentation at the workshop

AI中文摘要

由于隐私、招募、成本和运动变异性，病理步态数据集仍然稀缺。我们的工作提出了一个多模态LLM引导框架，用于从结构化文本描述中合成病理感知的3D步态数据。该方法为病理步态分类任务生成固定长度的合成骨架步态序列。该框架结合了运动标记化、病理感知语言条件、基于LLM的语义增强和语言到步态生成。一个关键贡献是提出的病理标记器，旨在在离散表示学习期间保留病理特定的运动特征。实验表明，当与真实数据结合时，所提出的合成序列改善了循环分类器的下游分类。最佳结果是在留一受试者协议下，使用真实和合成样本训练的GRU分类器，达到92.77%的准确率。

英文摘要

Pathological gait datasets remain scarce due to privacy, recruitment, cost, and movement variability. Our work presents a multimodal LLM-guided framework for pathology-aware 3D gait data synthesis from structured textual descriptions. The proposed method generates fixed-length synthetic skeleton-based gait sequences for pathological gait classification tasks. The framework combines motion tokenisation, pathology-aware language conditioning, LLM-based semantic augmentation, and language-to-gait generation. A key contribution is the proposed pathological tokeniser, which is designed to preserve pathology-specific motion characteristics during discrete representation learning. Experiments suggest that the proposed synthetic sequences improve downstream classification for recurrent classifiers when combined with real data. The best result is obtained using a GRU classifier trained with real and synthetic samples, achieving 92.77\% accuracy under a leave-one-subject-out protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.06042 2026-06-08 cs.CV 版本更新

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

LoomVideo: 统一多模态输入到视频生成与编辑

Jianzong Wu, Hao Lian, Jiongfan Yang, Dachao Hao, Ye Tian, Yunhai Tong, Jingyuan Zhu, Biaolong Chen, Qiaosong Qi, Aixi Zhang, Wanggui He, Mushui Liu, Jinlong Liu, Pipei Huang, Hao Jiang

发表机构 * Peking University（北京大学）； Alibaba Group（阿里巴巴集团）

AI总结提出LoomVideo，一种5B参数的高效统一架构，通过多模态大语言模型和零开销Scale-and-Add条件机制，实现视频生成与编辑，显著降低计算复杂度并加速推理。

详情

AI中文摘要

开发能够解释交错多模态输入的统一视频生成和编辑模型是一个有前景但充满挑战的前沿领域。现有的统一框架主要依赖大规模模型（通常为13B参数或更多），并通过拼接序列令牌来引入源视频条件进行编辑。这种拼接不可避免地使序列长度加倍，使自注意力机制的计算复杂度翻两番，带来难以承受的开销。为解决这些瓶颈，我们提出了LoomVideo，一种高效5B参数的统一架构，用于视频生成和编辑。LoomVideo用多模态大语言模型（MLLM）替换标准文本编码器，并采用Deepstack注入机制将多层MLLM特征与扩散变换器（DiT）对齐。关键地，我们引入了一种零开销的Scale-and-Add条件方法用于视频编辑。通过缩放并直接将干净源视频潜变量加到带噪目标潜变量上，这种优雅的设计消除了令牌拼接的需要，大幅降低计算成本，同时保持对复杂非刚性编辑的强大能力。此外，无缝集成了负时间RoPE策略以处理多个参考图像。大量实验表明，我们紧凑的5B模型在全面基准测试中达到了最先进或极具竞争力的性能，在电商和时尚生成场景中展现出卓越优势。得益于零开销条件机制，LoomVideo在推理速度上比类似能力的模型至少快5.41倍，为高度实用和高效的视频基础模型铺平了道路。

英文摘要

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.06002 2026-06-08 cs.CV 版本更新

Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

面向文本到3D室内场景生成的视觉-语言模型中的全局-局部蒙特卡洛树搜索

Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China（网络与交换技术国家重点实验室，北京邮电大学）

AI总结提出一种全局-局部蒙特卡洛树搜索方法，通过分层场景表示和PRM引导的MCTS解决文本到3D室内场景生成中的错误传播问题，并构建新基准数据集3DTindo-bench。

详情

AI中文摘要

大型视觉-语言模型在各种任务中取得了显著的推理性能。然而，关于使用LVLM进行文本到3D室内场景生成的研究很少。主要挑战在于，现有的基于LVLM的方法采用思维链顺序决策机制，无法修正早期决策，导致错误传播。在本文中，我们将该任务视为一个受空间和布局常识约束的规划问题。为解决此问题，我们将其建模为具有全局树和局部树的树搜索问题，这与现有的顺序决策方法不同。在全局树中，我们迭代地放置每个对象，并像人类布置房间一样探索多种尝试，其中问题空间表示为树。为了有效搜索树，我们提出了一种分层场景表示和PRM引导的MCTS方法。分层表示将场景抽象为房间级别、区域级别、地板对象级别和支撑对象级别。PRM引导的MCTS方法使用PRM剪枝不必要的分支，并使用MCTS算法平衡探索和利用，以更少的尝试获得最优解。在局部树中，它进一步将每个对象的放置分解为更细的子步骤，包括具体的放置参数。为了使场景整体外观一致，我们利用预训练的扩散图像生成模型为场景中的所有对象预测纹理。由于现有的文本到3D室内场景生成基准在规模和多样性上仍然有限，我们收集了一个新的大规模多样化数据集，包含65种场景类型和3,250条指令，具有不同的尺寸、布局和风格，命名为3DTindo-bench，以更好地评估最先进模型的能力。我们的实验表明，我们的方法比最先进的方法生成更逼真的3D场景。

英文摘要

Large Vision-Language Models have achieved significant reasoning performance in various tasks. However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation. In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense. To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches. In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree. To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method. This representation abstracts a scene into room level, region level, floor object level, and supported object level. The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts. In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters. To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene. As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.05949 2026-06-08 cs.CV 版本更新

Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

忠实、丰富且精确：T2I模型在自然科学插图生成中的基准测试

Yifan Chang, Jiaxin Ai, Jianwen Sun, Yuandong Pu, Siqi Luo, Liangliang Zhao, Yuchen Ren, Minghao Liu, Yunfei Yu, Yu Qiao, Kaipeng Zhang, Yihao Liu

发表机构 * Shanghai Innovation Institute（上海创新研究院）； Shanghai AI Laboratory（上海人工智能实验室）； University of Science and Technology of China（中国科学技术大学）； Wuhan University（武汉大学）； Nankai University（南开大学）； Shanghai Jiao Tong University（上海交通大学）； Fudan University（复旦大学）； ZODA ； Alaya Studio（Alaya工作室）

AI总结提出FEPBench基准，通过细粒度原子集标注和三维评估（指令忠实性、推理丰富性、语义精确性）系统评估T2I模型在自然科学插图生成中的表现，发现即使最先进的闭源模型仍存在文本渲染瓶颈、推理丰富性有限以及生成丰富性与精确性难以平衡的问题。

详情

AI中文摘要

科学插图是交流研究发现的重要工具，尤其是在自然科学中，它们可视化复杂的概念和过程。随着文本到图像（T2I）模型能力的增强，研究人员已开始将其用于科学插图生成。然而，现有基准通常从整体层面评估输出，忽略了细粒度元素，同时科学推理能力和输出简洁性仍缺乏量化。我们引入了FEPBench，一个基于跨多个学科和布局类型精心挑选的高质量科学插图构建的基准。借助多模态大语言模型（MLLM）和人类专家，我们提供了细粒度原子集标注，并沿三个维度系统评估T2I模型：指令忠实性、推理丰富性和语义精确性。我们的评估进一步将模型性能分解为视觉、文本、关系和布局元素。结果表明，即使最先进的（SOTA）闭源模型，如GPT Image 2和Nano Banana Pro，仍然存在文本渲染瓶颈、推理丰富性有限以及生成丰富性与精确性难以平衡的问题。这些发现为改进和部署T2I模型进行科学插图生成提供了实用指导。基准数据、原子集标注和评估代码将由我们发布。

英文摘要

Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.

URL PDF HTML ☆

赞 0 踩 0

2606.05761 2026-06-08 cs.AI cs.CL 版本更新

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

SubtleMemory: 面向长时程AI智能体的细粒度关系记忆辨别基准

Wenxuan Wang, Haoyu Sun, Fukuan Hou, Mingyang Song, Weinan Zhang, Yu Cheng, Yang Yang

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Shanghai AI Laboratory（上海人工智能实验室）； Tongji University（同济大学）； Xiamen University（厦门大学）； Fudan University（复旦大学）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出SubtleMemory基准，通过构建关系控制的潜在语义伪影并嵌入用户-智能体交互历史，评估长时程AI智能体在后续查询中恢复分布式关系结构的能力。

详情

Comments: 48 pages

AI中文摘要

持久性AI助手（如OpenClaw）在长期交互中积累了大量相关记忆。随着这些记忆的增长，它们可能相互强化、在不同上下文中出现分歧或直接冲突，使得正确协助依赖于记忆关系而非孤立回忆。现有的长期记忆基准很少探究智能体在下游任务中如何保留和利用这些关系。为弥补这一空白，我们引入了SubtleMemory，一个用于长运行AI智能体中细粒度关系记忆辨别的基准。SubtleMemory构建了关系控制的潜在语义伪影，其变体实例化互补、细微或矛盾的关系，并将其嵌入到逼真的用户-智能体历史中，要求智能体在后续查询和指令中恢复分布式的关系结构。该基准包含10个长历史中的1,522个评估实例，基于1,090个关系控制的记忆变体集，涵盖用户相关和非用户相关的查询。评估了六个独立记忆系统、两个具有原生记忆模块的Claw式智能体以及三个具有插件记忆模块的Claw式智能体，我们发现当前系统在细粒度关系记忆辨别上仍然薄弱。我们进一步引入了诊断协议，揭示了在记忆保留、检索和下游推理阶段的不同能力特征。

英文摘要

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.

URL PDF HTML ☆

赞 0 踩 0

2606.05739 2026-06-08 cs.SD eess.AS 版本更新

Do speech foundation models perceive speaker similarity as humans do?

语音基础模型是否像人类一样感知说话人相似性？

Minoru Kishi, Hayato Yagi, Shinnosuke Takamichi, Yuki Saito

发表机构 * Keio University, Japan（庆应大学，日本）； The University of Tokyo, Japan（东京大学，日本）

AI总结本研究通过比较40多个语音基础模型的说话人嵌入与人类主观相似性评分，探究模型距离是否与人类感知一致，并识别影响模型与人类感知一致性的关键配置因素。

详情

Comments: Accepted by INTERSPEECH 2026

AI中文摘要

本研究对语音基础模型的说话人嵌入与人类对说话人相似性的主观感知进行了比较分析。人类听众能够在一个连续尺度上判断说话人的相似性，辨别两个声音的相似程度。相比之下，语音基础模型将说话人特征嵌入到数值表示中。然而，一个问题仍然存在：这些模型中说话人嵌入之间的数值距离是否真正与人类感知的相似性一致？为了解决这个问题，我们使用超过40个模型进行了全面调查，将模型导出的距离与人类感知的相似性评分进行比较。此外，我们确定了模型配置中的哪些因素对产生反映人类感知的说话人嵌入贡献最大。我们的发现为开发更具感知基础的语音基础模型提供了见解。

英文摘要

This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.05711 2026-06-08 cs.CL 版本更新

Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

超越Token：基于LLM的多智能体系统中潜在通信的统一框架

Yingzhuo Liu

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出一个三维统一框架（通信内容、发送-接收对齐、信息融合方式），系统分类2024-2026年间18种潜在通信方法，识别五种设计模式并揭示开放挑战。

详情

AI中文摘要

基于大型语言模型（LLM）构建的多智能体系统已成为处理复杂推理、规划和工具使用任务的主流范式。此类系统中的主导通信协议是自然语言：智能体逐token交换消息，将其内部推理过程语言化，以便同伴读取、验证和响应。尽管这种协议方便且可解释，但它存在三个结构性缺陷——高推理成本、离散化过程中不可逆的信息丢失以及自然语言的歧义/冗余。因此，越来越多的研究探索另一种协议——潜在通信——其中智能体直接交换连续表示（嵌入、隐藏状态或KV缓存），绕过文本生成的瓶颈。本文提出了一个统一框架，用于组织快速增长的潜在通信文献。我们沿着三个正交轴分析现有方法：（1）通信的WHAT信息（嵌入、隐藏状态、KV缓存或其他连续状态）；（2）使用的WHICH发送-接收对齐（潜在空间对齐和层对齐）；（3）通信信息如何融合到接收方（拼接、前置、数学运算、交叉注意力或缓存恢复）。在此三维框架下，我们系统分类了2024年至2026年间提出的18种代表性方法，识别出五种主要设计模式，并揭示了一系列开放挑战——包括跨架构对齐、潜在通道的安全性、边缘部署的压缩以及潜在通信与潜在思维链之间的关系。我们希望该框架既能降低新研究者的入门门槛，也能为比较未来工作提供一套词汇。

英文摘要

Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks -- high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol -- latent communication -- in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges -- including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.

URL PDF HTML ☆

赞 0 踩 0

2606.05682 2026-06-08 cs.AI cs.LG 版本更新

Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillation

超越输出匹配：在NVFP4 LLM蒸馏中保留内部几何结构

Fangbo Tu, Junhua Zhao, Chi Liu, Xin Chen, Haifeng Wu, Jian Wan, Srinivasan Manoharan

发表机构 * Fangbo Tu（图方波）； Junhua Zhao（赵俊华）； Chi Liu（刘驰）； Xin Chen（陈新）； Haifeng Wu（吴海峰）； Jian Wan（万健）； Srinivasan Manoharan（曼纳哈兰）

AI总结针对NVFP4低精度量化蒸馏中仅输出匹配导致内部表示退化的问题，提出CKA-QAD方法，通过CKA引导的层间Gram矩阵对齐保留内部几何结构，提升推理和编码任务准确率。

详情

Comments: 13 pages,1 figures

AI中文摘要

随着大型语言模型越来越多地部署在延迟和成本受限的生产环境中，对低精度推理（包括基于NVFP4的方法）的需求不断增长。量化感知蒸馏（QAD）通过训练量化学生模型以KL散度损失匹配冻结的高精度教师模型的输出分布，帮助恢复低比特量化下的精度损失。在这项工作中，我们首先对QAD进行表示层面的诊断：仅输出匹配可能掩盖内部退化，因为许多中间激活几何结构可以产生相似的教师对齐logits。使用CKA，我们表明仅使用KL的QAD相对于BF16教师模型会降低层间表示相似性，在RL后训练模型中尤其严重。这种退化与推理和编码任务的下游瓶颈相关，表明低比特恢复需要保留内部几何结构，而不仅仅是匹配输出。受此发现启发，我们提出了CKA-QAD，一种用于NVFP4 QAD和低比特LLM精度恢复的CKA引导表示对齐方法。该方法添加了一个轻量级正则化器，通过在蒸馏过程中通过CKA对齐层间Gram矩阵来保留内部表示几何结构。在Nemotron 3 Nano和Qwen3-4B-Thinking-2507上，CKA-QAD显著改善了表示对齐，并以适度的训练开销提高了下游推理和编码精度。我们的发现将CKA引导的表示对齐定位为量化LLM恢复中输出匹配的实用补充。

英文摘要

Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in latency and cost constrained production environments. Quantization-aware distillation (QAD) helps recover accuracy lost under low bit quantization by training a quantized student to match the output distribution of a frozen higher precision teacher via a KL-divergence loss. In this work, we first provide a representation level diagnosis of QAD: output matching alone can mask internal degradation, because many intermediate activation geometries can yield similar teacher-aligned logits. Using CKA, we show that KL-only QAD can reduce layerwise representational similarity relative to the BF16 teacher, with especially severe drift in RL-post-trained models. This drift correlates with downstream bottlenecks on reasoning and coding tasks, suggesting that low bit recovery requires preserving internal geometry rather than matching outputs alone. Motivated by this finding, we propose \textbf{CKA-QAD}, a CKA-guided representational alignment method for NVFP4 QAD and low bit LLM accuracy recovery. The method adds a lightweight regularizer that preserves internal representational geometry during distillation by aligning layerwise Gram matrices through CKA. Across Nemotron 3 Nano and Qwen3-4B-Thinking-2507, CKA-QAD substantially improves representational alignment and improves downstream reasoning and coding accuracy with modest training overhead. Our findings position CKA-guided representational alignment as a practical complement to output matching for quantized LLM recovery.

URL PDF HTML ☆

赞 0 踩 0

2606.05342 2026-06-08 cs.AI 版本更新

SentinelBench: A Benchmark for Long-Running Monitoring Agents

SentinelBench: 一个用于长时间运行监控代理的基准测试

Matheus Kunzler Maldaner, Adam Fourney, Amanda Swearngin, Hussein Mozannar, Gagan Bansal, Maya Murad, Rafah Hosn, Saleema Amershi

发表机构 * University of Florida（佛罗里达大学）； Microsoft Research, AI Frontiers（微软研究院，人工智能前沿）

AI总结提出SentinelBench，一个包含10个合成网络环境中100个任务的基准测试，用于评估AI代理在长时间监控任务中的表现，衡量任务完成度、反应时间和资源使用。

详情

Comments: 18 pages, 16 figures

AI中文摘要

AI代理越来越多地被要求执行持续数分钟、数小时或更长时间的工作。然而，代理行为的默认模式是连续动作：发出工具调用、刷新页面、搜索替代方案或以其他方式试图强制推进。这对于许多长时间运行的任务来说是一种错误的方法，这些任务更适合采用持续关注的策略。相反，代理应该监控环境，注意到外部事件何时使进展成为可能，然后迅速响应，同时在等待时不浪费资源。为了衡量这类任务的进展，我们引入了SentinelBench，一个用于时间演化监控任务的开源基准测试。SentinelBench包含10个合成网络环境中的100个任务，包括电子邮件、日历、金融、专业网络和娱乐。每个环境都提供一个实时网络界面，并重放一个脚本化的事件序列，要求代理导航和推理状态不断变化的网页。SentinelBench衡量任务完成度、反应时间和资源使用，揭示了响应性与成本之间的权衡。我们报告了三种模型和两种浏览器代理框架的结果，为未来比较建立了性能基线，并展示了代理设计选择如何显著影响关键指标。总之，这些结果表明SentinelBench能够区分代理行为中的有意义差异。

英文摘要

AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default model of agent behavior is continuous action: issuing tool calls, refreshing pages, searching for alternatives, or otherwise trying to force progress. This is the wrong approach for many long-running tasks, which are better served by a strategy of sustained attention. Instead, agents should monitor an environment, notice when an external event makes progress possible, then respond promptly without wasting resources while waiting. To measure progress on this class of tasks, we introduce SentinelBench, an open-source benchmark for time-evolving monitoring tasks. SentinelBench contains 100 tasks across 10 synthetic web environments, including email, calendars, finance, professional networking, and entertainment. Each environment exposes a live web interface and replays a scripted sequence of events, requiring agents to navigate and reason about web pages whose state shifts underfoot. SentinelBench measures task completion, reaction time, and resource use, exposing the tradeoff between responsiveness and cost. We report results across three models and two browser-agent harnesses, establishing performance baselines for future comparison and demonstrating how agent design choices can dramatically impact key metrics. Together, these results show that SentinelBench distinguishes meaningful differences in agent behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.04874 2026-06-08 cs.CL 版本更新

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Agent规划基准：LLM Agent规划能力的诊断框架

Haoyu Sun, Wenxuan Wang, Mingyang Song, Jujie He, Weinan Zhang, Yang Liu, Yang Yang, Yu Cheng

发表机构 * Tongji University（同济大学）； Shanghai AI Laboratory（上海人工智能实验室）； Harbin Institute of Technology（哈尔滨工业大学）； Fudan University（复旦大学）； Skywork AI ； University of California, Santa Cruz（加州大学圣克鲁兹分校）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出Agent规划基准(APB)，通过4209个多模态案例和五个设置，诊断LLM Agent在长程规划、工具噪声鲁棒性、校准拒绝和推理时改进方面的系统弱点。

详情

AI中文摘要

规划是LLM Agent的核心：在行动之前，Agent必须分解目标、选择工具、推理约束并决定任务何时不可行。然而，现有的Agent评估通常只报告端到端的成功率，使得难以判断失败源于规划还是执行。我们引入了 extbf{Agent规划基准(APB)}，一个针对规划的诊断基准，包含22个领域和五个设置下的4209个多模态案例，涵盖整体规划、反馈条件逐步规划以及在外来工具、损坏工具和不可解任务下的鲁棒性。在12个MLLM上，APB揭示了长程规划、工具噪声鲁棒性、校准拒绝和推理时改进方面的系统弱点。我们进一步在200个ToolSandbox任务和200个$τ^2$-bench任务上验证了APB，其中APB引导的改进在三个代表性模型上一致提高了计划正确性、计划等级和下游执行指标。因此，APB作为执行基准的上游诊断补充。

英文摘要

Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce Agent Planning Benchmark (APB), a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $τ^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks. The APB benchmark and code are available in \href{https://github.com/Mikivishy/AgentPlanningBenchmark}{this URL}.

URL PDF HTML ☆

赞 0 踩 0

2606.04812 2026-06-08 cs.LG cs.AI 版本更新

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

面向风险感知强化学习的情景生成与近似安全保证

Mohit Prashant, Arvind Easwaran

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结针对强化学习策略对转移扰动敏感导致不安全行为的问题，提出使用变分自编码器近似状态空间分布，通过构造上下界屏障证书并采样非鲁棒区域状态来收紧概率安全保证。

详情

Comments: 8 pages, preprint

AI中文摘要

保证安全性对于强化学习（RL）智能体在现实世界中的部署至关重要，尤其是使用深度RL学习的策略可能表现出对转移扰动的敏感性，从而导致未知或不安全的行为。一种策略验证方法是通过采样相对于安全约束的策略轨迹来构造概率屏障证书，从而将已知的安全行为与未知行为区分开来。如果策略容易受到转移不确定性或扰动的影响，使智能体处于未充分探索的状态，则难以获得这些约束违反概率的严格上下界。为了解决这个问题，我们使用变分自编码器（VAE）近似遇到的状态空间的分布，并利用状态的潜在特征构造上下界屏障证书，以高置信度优化已知安全行为的区域。我们在工作中将其表述为一个对偶优化问题，其中下界屏障证书比上界屏障证书提供更保守的安全区域估计。在训练期间采样位于两者集合差（即非鲁棒区域）内的状态，使我们能够收紧上下界，从而提供更尖锐的概率安全保证。在我们的研究中，我们描述了所放置的保证，并通过实验证明了我们边界的紧致性。

英文摘要

Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.

URL PDF HTML ☆

赞 0 踩 0

2606.04373 2026-06-08 cs.CV cs.AI 版本更新

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

解耦信息区域的选择性耦合：用于视觉Transformer无数据量化的掩码注意力对齐

Biao Qian, Yang Wang, Yong Wu, Jungong Han

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出MaskAQ方法，通过解耦合成样本中的信息区域并利用掩码注意力对齐全精度模型与量化模型，解决无数据量化中分布不匹配问题。

详情

Comments: Accepted to appear at ICML 2026, Seoul, Korea

AI中文摘要

无数据量化（DFQ）通过合成样本解决数据安全问题，无需访问真实数据。由于自注意力机制相比经典卷积运算的优势，DFQ在视觉Transformer（ViT）中日益受到关注。然而，先前的ViT DFQ方法常遭受合成样本与量化模型Q期望输入分布之间的分布不匹配，导致性能次优。本文提出一种新颖的掩码注意力对齐方法用于ViT的无数据量化，称为MaskAQ，揭示了：1）自注意力机制中的语义主要局限于稀疏的补丁子集，称为信息区域；2）信息区域主导了合成样本与Q输出之间的互信息。为此，我们利用合成样本补丁相似性的微分熵最大化，从噪声背景中解耦信息区域。为了与不同的Q耦合，通过掩码注意力对齐目标选择信息区域以对齐全精度模型与Q，从而产生高质量的合成样本。此外，提出周期性样本刷新策略，使MaskAQ能够在训练过程中持续适应Q的演化状态，以保持与合成样本的理想互信息。大量实验验证了MaskAQ在多个骨干网络和下游任务上优于最先进方法。我们的代码可在https://github.com/hfutqian/MaskAQ获取。

英文摘要

Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.

URL PDF HTML ☆

赞 0 踩 0

2606.04349 2026-06-08 cs.CV cs.AI 版本更新

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

MorphoQuant: 面向全模态大语言模型的模态感知量化

Yue Wu, Changyuan Wang, Zixuan Wang, Shilin Ma, Yansong Tang

发表机构 * institutetext: MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models Yue Wu Changyuan Wang Zixuan Wang Shilin Ma Yansong Tang（机构文本：MorphoQuant：多模态大语言模型的模态感知量化 Yue Wu 王昌元王梓轩马世林唐彦松）

AI总结提出MorphoQuant框架，通过分布感知偏差补偿和形态导向量化函数优化，解决全模态大语言模型在4比特后训练量化中的分布异质性和异常值问题，实现精度与效率的优异平衡。

详情

AI中文摘要

传统的后训练量化方法在处理4比特全模态大语言模型时，由于跨模态的极端分布异质性和不同的异常值模式而面临困难。为了解决这一问题，我们提出了MorphoQuant，一种模态感知的PTQ框架，旨在保留跨模态形态并减轻异常值损失。具体来说，我们引入了分布感知偏差补偿，它选择性地将长尾异常值吸收到通道偏差中。该机制在保持异常值幅度的同时，为密集内点维持高精度离散化，从而在多样的模态分布中保持精确的离散化。作为补充，我们提出了形态导向量化函数优化，以协同优化量化网格与偏差掩码，确保跨模态的细粒度对齐。在Qwen2.5-Omni上对MMMU和Video-MME等基准的广泛评估证明了我们方法的优越性。值得注意的是，我们的W4A4模型在ScienceQA上达到了76.63%，显著优于最先进的W4A4方法，并意外地超越了W4A16基线，这充分展示了我们框架在精度-效率权衡方面的卓越表现。

英文摘要

Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

URL PDF HTML ☆

赞 0 踩 0

2606.04058 2026-06-08 cs.LG cs.AI 版本更新

Spectral Scaling Laws of Muon

Muon的谱缩放定律

Gagik Magakyan, Pablo Parrilo, Asuman Ozdaglar

发表机构 * MIT（麻省理工学院）

AI总结本文系统研究了Muon优化器中动量矩阵奇异值谱随模型大小的缩放行为，发现其遵循幂律，并据此提出层感知的牛顿-舒尔茨迭代配置选择方法以减少计算开销。

详情

AI中文摘要

正交归一化更新规则已迅速成为训练大型语言模型的主流优化器选择，最近的开源最先进模型采用了Muon。为了保持这些更新的可处理性，Muon使用牛顿-舒尔茨（NS）迭代执行正交归一化。由于NS只是近似，小奇异值的方向无法被正交归一化。在Muon中，NS每一步都应用于动量矩阵，然而关于这些动量矩阵的奇异值谱在训练过程中如何行为，以及该行为如何随模型大小变化，我们知之甚少。我们首次系统研究了这一问题。通过追踪从77M到2.8B参数模型中各层动量缓冲区的奇异值分位数，我们观察到一致的现象：在短暂的预热后，分位数稳定在一个由层类型和模型大小决定的值上。这些稳定值随模型大小呈现出非常清晰的幂律，且指数依赖于层。中后深度的层随模型大小$M$的缩放非常温和（约$M^{-0.25}$），因此学术规模下使用的标准5步NS配置将在更大规模下继续对它们进行正交归一化。然而，某些后期层的缩放更为激进（高达$M^{-0.96}$），在前沿规模下将落入NS失效区域，除非使用更多NS迭代或更好调整的系数。NS迭代在规模上计算成本高昂；我们的定律为从业者提供了一种有原则的、层感知的配方，用于选择最小的NS配置，该配置仍能正交归一化重要的方向——在不牺牲更新质量的情况下避免不必要的计算。

英文摘要

Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton--Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size. These stabilization values follow remarkably clean power laws in model size, with layer-dependent exponents. Layers up to mid-late depth scale very mildly with model size $M$ (around $M^{-0.25}$), so the standard 5-step NS configuration used at academic scale will continue to orthonormalize them at much larger scales. Some of the late layers, however, scale much more aggressively (up to $M^{-0.96}$) and will fall into the NS failure regime at frontier scale unless one uses more NS iterations or better-tuned coefficients. NS iterations are computationally expensive at scale; our laws give practitioners a principled, layer-aware recipe for choosing the minimum NS configuration that still orthonormalizes the directions that matter -- avoiding unnecessary computation without sacrificing update quality.

URL PDF HTML ☆

赞 0 踩 0

2606.03889 2026-06-08 cs.CL 版本更新

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

RealClawBench: 来自真实开发者-智能体会话的实时OpenClaw基准测试

Zongwei Lv, Zhewen Tan, Yaoming Li, Yilun Yao, Yuxuan Tian, Lin Sun, Xiangzheng Zhang, Weihong Lin, Tong Yang, Guangxiang Zhao

发表机构 * Peking University（北京大学）； Qiyuan Tech（启元科技）

AI总结针对现有基准缺乏真实性的问题，提出RealClawBench框架，通过重构执行环境和确定性可验证评分器，将真实OpenClaw会话转化为可复现、自动评分的任务，评估14个模型后最佳仅解决65.8%任务，揭示了开发者-智能体工作负载上的巨大提升空间。

详情

Comments: 19 pages, 5 figures, 8 tables

AI中文摘要

智能体基准测试应反映用户实际要求部署的智能体执行的任务，然而现有基准往往缺失真实开发者-智能体会话的关键真实性属性。我们引入RealClawBench，一个基于真实OpenClaw会话构建的实时基准框架，以捕获已部署智能体使用的分布、多样性和实际难度。真实用户请求难以基准测试，因为它们通常依赖本地执行环境，涉及隐含或未明确指定的意图，并且需要非平凡的验证。RealClawBench通过两个核心机制解决这些挑战：重构的执行环境和确定性可验证评分器，共同将真实会话转化为可复现、自动评分的任务。最终发布的版本包含从更大真实会话池中采样的281个可执行任务，同时保留源分布，最大最终与源分布的Jensen-Shannon散度为0.0448。评估14个当代模型显示，最佳系统仅解决65.8%的任务，揭示了在真实开发者-智能体工作负载上存在巨大的提升空间。通过将真实部署会话转化为受控评估实例，RealClawBench提供了一条实际路径，以构建能更好衡量智能体在实际使用中能力的基准测试。代码见：this https URL。

英文摘要

Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:https://anonymous.4open.science/r/real-claw-bench-582B.

URL PDF HTML ☆

赞 0 踩 0

2606.03559 2026-06-08 cs.LG math.OC stat.ML 版本更新

Analytical Evaluation of DCA Convergence Properties for Minimizing Prediction Functions of Gaussian RBF Support Vector Regression

高斯RBF支持向量回归预测函数最小化的DCA收敛性分析评估

Yohei Kakimoto, Yuto Omae, Hirotaka Takahashi

发表机构 * Nihon University（日本大学）； Tokyo City University（东京城大学）

AI总结针对以训练好的高斯RBF核支持向量回归（RBF-SVR）预测函数为目标函数的非凸优化问题，利用RBF核的解析结构构造显式DC分解，推导出DC分量强凸参数下界μ和子问题梯度Lipschitz常数上界L的闭式表达式，并通过数值实验表明特征量Cαρ主导DCA的收敛性和初始点依赖性。

详情

Comments: 29 pages, 5 figures, 2 tables

AI中文摘要

对于目标函数为训练好的高斯径向基函数（RBF）核支持向量回归（SVR）模型（RBF-SVR）预测函数的非凸优化问题，我们提出一个框架，通过利用RBF核的解析结构构造显式的凸函数差（DC）分解，应用DC算法（DCA）。具体地，我们闭式推导了DC分量的强凸参数下界μ和子问题梯度Lipschitz常数上界L。μ和L完全由训练后的对偶系数和Cα、RBF核参数γ以及DC分解参数ρ决定，且共享共同主导项Cαρ。通过在六个基准函数上的数值实验，我们表明Cαρ是表征DCA收敛性质和初始点依赖性的主要单一量，并进一步证明它分解为两个独立路径C→Cα和γ→ρ，其主要变化由SVR超参数(C,γ)控制。这些结果使得RBF-SVR上DCA的收敛性质可以通过单一标量Cαρ预先评估：训练前近似从(C,γ)得到，训练后精确闭式得到。

英文摘要

For nonconvex optimization problems whose objective is the prediction function of a trained Support Vector Regression (SVR) model with the Gaussian radial basis function (RBF) kernel (RBF-SVR), we present a framework that applies the difference of convex functions (DC) algorithm (DCA) by exploiting the analytical structure of the RBF kernel to construct an explicit DC decomposition. Specifically, we derive in closed form both the lower bound $μ$ of the strong convexity parameter of the DC components and the upper bound $L$ of the gradient Lipschitz constant of the subproblem. Both $μ$ and $L$ are determined solely by the post-training dual-coefficient sum $C_α$ and the RBF kernel parameter $γ$, together with the DC decomposition parameter $ρ$, and they share a common leading term $C_αρ$. Through numerical experiments on six benchmark functions, we show that $C_αρ$ is the primary single quantity characterizing both the convergence properties and the initial-point dependence of DCA, and further demonstrate that it decomposes into two independent pathways, $C \to C_α$ and $γ\to ρ$, with its primary variation governed by the SVR hyperparameters $(C, γ)$. Together, these results allow the convergence properties of DCA on RBF-SVR to be assessed in advance through the single scalar quantity $C_αρ$: approximately from $(C, γ)$ before training, and exactly in closed form after training.

URL PDF HTML ☆

赞 0 踩 0

2606.03382 2026-06-08 cs.LG cs.AI 版本更新

Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

局部引导，全局影响：高斯重塑信任区域解锁行为转变

Bingxu Liu, Jiashun Liu, Johan Obando-Ceron, Hao Wang, Runze Liu, Pablo Samuel Castro, Aaron Courville, Ling Pan

发表机构 * Hong Kong University of Science and Technology（香港科技大学）； Mila - Québec AI Institute（魁北克AI研究院）； Université de Montréal（蒙特利尔大学）； Fudan University（复旦大学）； City University of Hong Kong（香港城市大学）

AI总结针对PPO在非平稳环境中优化失效的问题，提出高斯信任区域策略优化（GTR），通过高斯核重塑信任区域实现非单调约束，在保持局部稳定性的同时允许必要的大幅策略更新，并在多种任务中取得强性能。

详情

Comments: 21 pages

AI中文摘要

虽然近端策略优化（PPO）在静态环境中表现出色，但我们表明其标准优化范式在连续和非平稳环境中存在困难。失败并非源于模型容量不足或裁剪过于严格。相反，PPO执行持续、方向低效的局部更新，这表明缺乏几何感知引导来积累有意义的行为变化，最终阻碍向新行为模式的转变。尽管基于散度的正则化引入了部分几何感知，但其单调递增的惩罚隐式地阻止了大的策略偏差，即使这种转变对于有效适应是必要的。为了解决这一局限性，我们提出了高斯信任区域策略优化（GTR），它使用高斯核重塑信任区域。由此产生的约束是有界且非单调的，在提供强局部稳定性的同时，在持续的高优势更新下逐渐放松。为了进一步提高鲁棒性，我们引入了一个混合高斯锚点，它适应最近的策略轨迹，减少了由陈旧参考引起的方差。GTR与架构无关，在游戏、模拟机器人控制、开放世界探索和语言模型后训练中均取得了强性能。这些结果表明，几何感知的信任区域设计可以成为复杂非平稳环境中鲁棒强化学习的一个有前景的方向。我们的代码可在该 https URL 获取。

英文摘要

While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization paradigm struggles in continual and non-stationary environments. The failure does not stem from insufficient model capacity or overly restrictive clipping. Instead, PPO performs persistent, directionally inefficient local updates, which indicates a lack of geometry-aware guidance for accumulating meaningful behavioral change and ultimately hindering transitions toward new behavior patterns. Although divergence-based regularization introduces partial geometric awareness, its monotonically increasing penalties implicitly discourage large policy deviations, even when such shifts are necessary for effective adaptation. To address this limitation, we propose Gaussian Trust Region Policy Optimization (GTR), which reshapes the trust region using a Gaussian kernel. The resulting constraint is bounded and non-monotonic, providing strong local stability while progressively relaxing under sustained high-advantage updates. To further improve robustness, we introduce a Mixture Gaussian Anchor that adapts to recent policy trajectories, reducing variance induced by stale references. GTR is architecture-agnostic and achieves strong performance across games, simulated robotic control, open-world exploration, and language model post-training. These results demonstrate that geometry-aware trust-region design can be a promising direction for robust reinforcement learning in complex non-stationary environments. Our code is available at https://anonymous.4open.science/r/GTR_demo/README.md.

URL PDF HTML ☆

赞 0 踩 0

2606.03280 2026-06-08 cs.AI 版本更新

A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

Pythia多跳设置中跨模型激活迁移的负面结果

Peiyan Zhang, Jason Xin

发表机构 * Independent Researcher（独立研究者）

AI总结研究在Pythia-160M到Pythia-410M的多跳推理设置中，通过线性翻译层传递隐藏状态是否能够改善下游回答，结果发现离线表示对齐不足以实现有用的因果通信。

详情

Comments: 16 pages, 6 figures

AI中文摘要

最近的研究表明，语言模型可以通过训练过程中生成数据中的隐藏信号传递行为特征。我们提出一个更直接和严格的渠道是否也可行：一个语言模型能否在推理时通过翻译和注入隐藏激活，而不是通过传递自然语言文本，将有用的中间推理状态传递给另一个模型？我们在一个受控的Pythia-160M到Pythia-410M多跳推理设置中测试了这个问题。线性翻译层学习了发送者和接收者隐藏状态之间的强归一化空间映射，归一化余弦相似度在不同种子下接近0.97。然而，当翻译后的激活在推理时注入接收者时，并没有改善下游回答。低强度加法注入仍接近无注入基线，置信区间跨越零。替换式注入始终具有破坏性，将翻译向量重新缩放到接收者隐藏状态范数也无法挽救性能。因此，这是一个有范围的负面结果：在这种设置中，离线表示对齐不足以在接收者内部实现有用的因果通信。

英文摘要

Recent work shows that language models can transmit behavioural traits through hidden signals in generated data during training. We ask whether a different activation-mediated channel is viable: can one language model communicate a useful intermediate reasoning state to another at inference time through a post-hoc linear activation bridge, rather than through a textual or structured-token relay? We test this question in a controlled Pythia-160M to Pythia-410M multi-hop reasoning setting. A linear translation layer learns a strong normalized-space map between sender and receiver hidden states, with normalized cosine similarity near 0.97 across seeds. However, when the translated activations are injected into the receiver at inference time, they do not improve downstream answering. Low-strength additive injection remains near the no-injection baseline, with confidence intervals that cross zero. Replacement-style injection is consistently destructive, and rescaling translated vectors to the receiver hidden-state norm does not rescue performance. The result is therefore a scoped negative result: in this setting, offline representational alignment is not sufficient for useful causal communication inside the receiver.

URL PDF HTML ☆

赞 0 踩 0

2606.03002 2026-06-08 cs.LG cs.AI 版本更新

Perplexity Can Miss SAE Feature Damage Under Quantization

量化如何改变可解释特征：语言模型的稀疏自编码器分析

Evan Duan

发表机构 * University of Michigan（密歇根大学）

AI总结通过稀疏自编码器分析，发现量化导致语言模型中的可解释特征逐渐退化，且任务指标无法完全反映这种损伤，量化与幅度剪枝共享相似的损伤模式。

详情

Comments: 12 Pages of Content, Submitted to TMLR

AI中文摘要

量化是部署大型语言模型的标准途径，通常当量化模型的困惑度或下游精度接近全精度原始模型时，即认为其可接受。但模型是否仍以相同方式计算，或全精度模型中识别的可解释特征是否在权重舍入后存活，很少被测试，即使安全审计和引导干预越来越依赖这些特征。我们探究从稠密全精度模型中提取的稀疏自编码器（SAE）特征在模型量化后是否仍然忠实。使用冻结的SAE作为固定测量基础，我们在相同令牌上编码全精度和最近舍入（RTN）量化激活，并通过皮尔逊相关系数量化每个特征的存活率，在Pythia-70M和Gemma-2-2B上扫描从INT8到INT4的位宽。我们发现特征存活是分级的：特征系统地退化而非一次性全部失效，在Pythia-70M上INT6时62.4%的活跃特征存活，在Gemma-2-2B上INT6时51.3%存活，且大多数非存活特征被模糊而非破坏。存活率可仅从全精度统计量预测，交叉验证AUC为0.92至0.97，峰值激活是最强的边际预测因子。关键的是，任务指标可能忽略这种损伤：在Gemma-2-2B上，INT7改善了困惑度却使18.7%的特征退化。最后，量化和匹配困惑度的幅度剪枝损伤高度重叠的特征集，Jaccard重叠为0.79至0.86，损伤分数斯皮尔曼相关性为0.98，表明存在共享的压缩诱导脆弱性模式。这些结果表明，行为等价不足以证明可解释性发现可迁移到量化部署，从而激励对压缩进行特征级审计。

英文摘要

Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or downstream accuracy remains close to the full-precision original. But behavioral parity need not imply feature fidelity: the sparse-autoencoder (SAE) features used to interpret a full-precision model may change after weight rounding. We test this directly by using a frozen SAE as a fixed measurement basis, encoding full-precision and round-to-nearest (RTN) quantized activations on identical tokens, and measuring per-feature survival by Pearson correlation across bit-widths from INT8 to INT4 on Pythia-70M and Gemma-2-2B. Our central finding is that perplexity can miss feature damage: on Gemma-2-2B, INT7 improves perplexity while degrading 18.7% of active SAE features, and under sliding-window evaluation INT6 also improves perplexity while only 51.3% of active features survive. Feature survival is graded rather than cliff-like, with 62.4% of active Pythia features and 51.3% of active Gemma features surviving at INT6; most non-surviving features are blurred rather than fully damaged. Survival is also predictable from full-precision feature statistics alone, with cross-validated AUC 0.92--0.97 and peak activation as the strongest marginal predictor. Finally, RTN quantization and matched-perplexity magnitude pruning damage strongly overlapping feature sets, with Jaccard overlap 0.79--0.86 and damage-score Spearman correlation 0.98. These results show that behavioral metrics alone are insufficient evidence that full-precision interpretability findings transfer to quantized models, motivating feature-level audits of compression.

URL PDF HTML ☆

赞 0 踩 0

2606.02919 2026-06-08 cs.CV 版本更新

Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction

Pixel Cube: 基于扩散的肖像视频重光照通过真实感光照再现

Yufan Zhang, Yu Ji, Ayo Ajiboye, Rundi Wu, Yu Guo, Changxi Zheng, Jinwei Ye

发表机构 * George Mason University（乔治·马歇尔大学）； LightThought LLC ； Columbia University（哥伦比亚大学）

AI总结提出一种基于扩散的方法，利用混合训练数据集和HDR环境图控制，实现动态肖像视频的真实感重光照，保持时间一致性和身份特征。

详情

DOI: 10.1145/3811400
Journal ref: ACM Trans. Graph. 45, 4, Article 119 (July 2026), 17 pages
Comments: ACM SIGGRAPH 2026 Journal Track / ACM Transactions on Graphics, 17 pages. Project page: https://yufanzhang82.github.io/PixelCube/

AI中文摘要

CORE: 对比反思实现推理能力的快速提升

Linas Nasvytis, Simon Jerome Han, Ben Prystawski, Satchel Grant, Noah D. Goodman, Judith E. Fan

发表机构 * Stanford University（斯坦福大学）

AI总结提出对比反思（CORE）非参数学习算法，通过对比成功与失败的推理轨迹生成自然语言洞察，在少量样本和 rollout 下实现比参数方法（GRPO）和非参数方法（GEPA、情景RAG、MemRL）更快的推理性能提升。

详情

AI中文摘要

语言模型可以利用可验证奖励在多种推理任务上提升性能。然而，无论是参数方法（如RLVR）还是非参数方法（如提示优化），通常都需要数百个训练样本和数千次模型 rollout，这在最佳情况下成本高昂，最坏情况下则难以处理。为解决这一挑战，我们引入了对比反思（CORE），一种非参数学习算法，通过比较过去的推理轨迹来生成洞察：即捕捉成功与不成功问题尝试之间差异的推理策略和约束的简短自然语言描述。在四个推理任务上，我们证明CORE比参数方法（GRPO）和非参数方法（GEPA、情景RAG和MemRL）实现更快的改进，同时使用更少的rollout。在固定rollout预算下，使用少至五个训练样本，我们进一步展示CORE也实现了与各基线相当或更大的性能提升。最后，我们强调CORE在上下文效率上也显著优于非参数基线，需要更少的提示词，同时将学到的知识存储为紧凑、可解释的自然语言洞察。因此，我们的结果表明，将成功与不成功推理轨迹之间的对比提炼为抽象且有用的洞察，比权重更新、提示优化或直接重用存储的推理轨迹，为模型自我改进提供了一条更高效且可解释的途径。

英文摘要

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, CORE achieves the strongest performance in most task-data regimes. Finally, we highlight how CORE is substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.

URL PDF HTML ☆

赞 0 踩 0