arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2069
2509.22550 2026-06-01 cs.RO

An Intention-driven Lane Change Framework Considering Heterogeneous Dynamic Cooperation in Mixed-traffic Environment

考虑混合交通中异构动态协作的意图驱动换道框架

Xiaoyun Qiu, Haichao Liu, Yue Pan, Jun Ma, Xinhu Zheng

AI总结 提出一种结合驾驶风格识别、协作感知决策与运动规划的意图驱动换道框架,通过深度学习和逆强化学习实现混合交通中安全高效的换道。

详情
Journal ref
IEEE Transactions on Intelligent Transportation Systems, May, 2026
AI中文摘要

在混合交通环境中,自动驾驶车辆(AV)必须与异构的人类驾驶车辆(HV)交互,这些车辆的意图和驾驶风格因个体和场景而异。这种变异性给换道交互带来了不确定性,其中安全性和效率关键取决于准确预测周围驾驶员的协作反应。现有方法通常通过假设统一或固定的行为模式来过度简化这些交互。为了解决这一限制,我们提出了一种意图驱动的换道框架,该框架将驾驶风格识别与协作感知决策和运动规划相结合。一个基于深度学习的分类器实时识别不同的人类驾驶风格。然后,我们引入了一个双视角协作分数,由内在的基于风格的倾向和交互动态组件组成,从而实现可解释和自适应的意图预测及定量推断。一个决策模块结合了行为克隆(BC)和逆强化学习(IRL)来确定换道的可行性。随后,建立了一个协调的运动规划架构,将基于IRL的意图推断与模型预测控制(MPC)相结合,以生成无碰撞且符合社会规范的轨迹。在NGSIM数据集上的实验表明,所提出的决策模型优于代表性的基于规则和基于学习的基线,在换道分类中达到了96.98%的准确率。运动规划评估进一步证明了在混合交通环境中机动成功率和执行稳定性的提高。这些结果验证了结构化协作建模对于意图驱动的自主换道的有效性。

英文摘要

In mixed-traffic environments, autonomous vehicles (AVs) must interact with heterogeneous human-driven vehicles (HVs) whose intentions and driving styles vary across individuals and scenarios. Such variability introduces uncertainty into lane change interactions, where safety and efficiency critically depend on accurately anticipating surrounding drivers' cooperative responses. Existing methods often oversimplify these interactions by assuming uniform or fixed behavioral patterns. To address this limitation, we propose an intention-driven lane change framework that integrates driving-style recognition with cooperation-aware decision-making and motion-planning. A deep learning-based classifier identifies distinct human driving styles in real time. We then introduce a dual-perspective cooperation score composed of intrinsic style-dependent tendencies and interactive dynamic components, enabling interpretable and adaptive intention prediction and quantitative inference. A decision-making module combines behavior cloning (BC) and inverse reinforcement learning (IRL) to determine lane change feasibility. Later, a coordinated motion-planning architecture integrating IRL-based intention inference with model predictive control (MPC) is established to generate collision-free and socially compliant trajectories. Experiments on the NGSIM dataset show that the proposed decision-making model outperforms representative rule-based and learning-based baselines, achieving 96.98% accuracy in lane change classification. Motion-planning evaluations further demonstrate improved maneuver success and execution stability in mixed-traffic environments. These results validate the effectiveness of structured cooperation modeling for intention-driven autonomous lane changes.

2603.12916 2026-06-01 cs.LG cs.AI

Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection

Surprised by Attention: 面向时间序列异常检测的可预测查询动态

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

AI总结 提出 AxonAD 无监督检测器,通过预测多头注意力查询向量的演化并结合重构误差与查询不匹配分数,有效检测多变量时间序列中的结构依赖偏移异常。

Comments This manuscript has been accepted for publication at ECML-PKDD 2026. The final version will be published in the conference proceedings. Main: 17 Pages, 7 Figures, 3 Tables; Appendix: 3 Pages, 4 Tables

详情
AI中文摘要

多变量时间序列异常通常表现为跨通道依赖的偏移,而非简单的幅度异常。例如,在自动驾驶中,转向指令可能内部一致,但与产生的横向加速度解耦。当灵活的序列模型尽管协调性改变仍能合理重构信号时,基于残差的检测器可能遗漏此类异常。我们提出 AxonAD,一种无监督检测器,将多头注意力查询演化视为短视界可预测过程。梯度更新重构路径与仅基于历史上下文的预测器耦合,该预测器通过掩码预测器-目标目标针对指数移动平均(EMA)目标编码器进行训练。推理时,重构误差与尾部聚合的查询不匹配分数结合,该分数衡量最近时间步上预测查询与目标查询之间的余弦偏差。这种双重方法在保留幅度级检测的同时,对结构依赖偏移敏感。在带有区间标注的专有车载遥测数据以及 TSB-AD 多变量套件(17 个数据集,180 个序列)上,使用无阈值和范围感知指标,AxonAD 在排名质量和时间定位上优于强基线。消融实验证实查询预测和组合评分是观察到的改进的主要驱动因素。代码可在 https://github.com/iis-esslingen/AxonAD 获取。

英文摘要

Multivariate time series anomalies often manifest as shifts in cross-channel dependencies rather than simple amplitude excursions. In autonomous driving, for instance, a steering command might be internally consistent but decouple from the resulting lateral acceleration. Residual-based detectors can miss such anomalies when flexible sequence models still reconstruct signals plausibly despite altered coordination. We introduce AxonAD, an unsupervised detector that treats multi-head attention query evolution as a short horizon predictable process. A gradient-updated reconstruction pathway is coupled with a history-only predictor that forecasts future query vectors from past context. This is trained via a masked predictor-target objective against an exponential moving average (EMA) target encoder. At inference, reconstruction error is combined with a tail-aggregated query mismatch score, which measures cosine deviation between predicted and target queries on recent timesteps. This dual approach provides sensitivity to structural dependency shifts while retaining amplitude-level detection. On proprietary in-vehicle telemetry with interval annotations and on the TSB-AD multi-variate suite (17 datasets, 180 series) with threshold-free and range-aware metrics, AxonAD improves ranking quality and temporal localization over strong baselines. Ablations confirm that query prediction and combined scoring are the primary drivers of the observed gains. Code is available at the URL https://github.com/iis-esslingen/AxonAD.

2603.09453 2026-06-01 cs.LG cs.AI stat.ML

Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers

变分路由:用于校准混合专家Transformer的可扩展贝叶斯框架

Albus Yizhuo Li, Matthew Wicker

AI总结 提出变分混合专家路由(VMoER),通过将贝叶斯推断限制在专家选择阶段,实现大规模模型的不确定性校准,在微调基础模型上显著提升路由稳定性、降低校准误差并提高分布外检测AUROC,且额外计算开销极小。

Comments 8 pages, 7 figures for main text; 16 pages for Appendix; Accepted by ICML 2026;

详情
AI中文摘要

基础模型越来越多地部署在需要理解其输出不确定性的场景中,这对于确保负责任部署至关重要。虽然贝叶斯方法为不确定性量化提供了原则性方法,但其计算开销使得在基础模型规模下进行训练或推理不切实际。最先进的模型通过精心设计的稀疏性(包括混合专家(MoE)层)实现了数万亿的参数数量。在这项工作中,我们通过引入变分混合专家路由(VMoER)展示了大规模下的校准不确定性,这是一种用于建模MoE层不确定性的结构化贝叶斯方法。VMoER将贝叶斯推断限制在通常由确定性路由网络完成的专家选择阶段。我们使用两种推断策略实例化VMoER:对路由logits的摊销变分推断和推断用于随机专家选择的温度参数。在微调测试的基础模型上,VMoER在噪声下将路由稳定性提高了38%,校准误差降低了94%,分布外AUROC提高了12%,同时额外FLOPs增加不到1%。这些结果表明,VMoER为构建鲁棒且具有不确定性意识的基础模型提供了一条可扩展的路径。

英文摘要

Foundation models are increasingly being deployed in contexts where understanding the uncertainty of their outputs is critical to ensuring responsible deployment. While Bayesian methods offer a principled approach to uncertainty quantification, their computational overhead renders their use impractical for training or inference at foundation model scale. State-of-the-art models achieve parameter counts in the trillions through carefully engineered sparsity including Mixture-of-Experts (MoE) layers. In this work, we demonstrate calibrated uncertainty at scale by introducing Variational Mixture-of-Experts Routing (VMoER), a structured Bayesian approach for modelling uncertainty in MoE layers. VMoER confines Bayesian inference to the expert-selection stage which is typically done by a deterministic routing network. We instantiate VMoER using two inference strategies: amortised variational inference over routing logits and inferring a temperature parameter for stochastic expert selection. Across fine-tuning tested foundation models, VMoER improves routing stability under noise by 38\%, reduces calibration error by 94\%, and increases out-of-distribution AUROC by 12\%, while incurring less than 1\% additional FLOPs. These results suggest VMoER offers a scalable path toward robust and uncertainty-aware foundation models.

2603.13875 2026-06-01 cs.CL cs.LG

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

GradMem: 通过测试时梯度下降将上下文写入记忆

Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev

AI总结 提出GradMem方法,利用测试时梯度下降将上下文写入紧凑记忆状态,通过自监督重构损失优化记忆令牌,在键值检索和自然语言任务上优于前向式记忆写入方法。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

许多大型语言模型应用需要基于长上下文进行条件生成。Transformer通常通过存储每层过去激活的KV缓存来支持这一点,这会产生大量内存开销。一种理想的替代方案是压缩记忆:一次性读取上下文,将其存储在紧凑状态中,并从该状态回答许多查询。我们在上下文移除设置中研究这一点,其中模型在推理时无法访问原始上下文的情况下必须生成答案。我们引入了GradMem,它通过每个样本的测试时优化将上下文写入记忆。给定一个上下文,GradMem在保持模型权重冻结的情况下,对一小部分前缀记忆令牌执行几步梯度下降。GradMem显式优化模型级的自监督上下文重构损失,从而产生带有迭代纠错的损失驱动写入操作,这与仅前向方法不同。在关联键值检索中,GradMem在相同记忆大小下优于仅前向记忆写入器,并且额外的梯度步长比重复的前向写入更有效地扩展容量。我们进一步表明,GradMem可以迁移到合成基准之外:使用预训练语言模型,它在自然语言任务(包括bAbI和SQuAD变体)上取得了有竞争力的结果,仅依赖于记忆中的编码信息。

英文摘要

Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is compressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

2603.13727 2026-06-01 cs.LG physics.data-an

Data-driven Progressive Discovery of Physical Laws

数据驱动的物理定律渐进发现

Mingkun Xia, Weiwei Zhang

AI总结 提出链式符号回归(CoSR)框架,通过逐步组合具有明确物理意义的知识单元,从数据中渐进发现物理定律,并在多个物理问题中验证其有效性。

Comments This paper needs to be retracted due to methodological flaws found in RBC case

详情
AI中文摘要

符号回归是知识发现的有力工具,能够直接从数据中提取可解释的数学表达式。然而,传统的符号发现通常采用端到端的“一步式”过程,在处理真实物理系统时往往生成冗长且物理意义不明的表达式,导致模型泛化能力差。这一局限性根本上源于其偏离了科学发现的基本路径:物理定律并非以单一形式存在,而是遵循从简单到复杂、层次化且渐进式的模式。受此原理启发,我们提出了链式符号回归(CoSR),一种将物理定律发现建模为符号知识链的新框架。该知识链通过沿特定逻辑逐步组合多个具有明确物理意义的知识单元而形成,最终能够从数据中精确发现潜在的物理定律。CoSR完整复现了从开普勒第三定律到万有引力定律的经典力学渐进发现路径,并应用于三类问题:湍流瑞利-贝纳德对流、圆管粘性流以及激光-金属相互作用,展示了其改进经典标度理论的能力。最后,CoSR在复杂工程问题——不同飞行器气动系数标度中展示了发现新知识的能力。

英文摘要

Symbolic regression is a powerful tool for knowledge discovery, enabling the extraction of interpretable mathematical expressions directly from data. However, conventional symbolic discovery typically follows an end-to-end, "one-step" process, which often generates lengthy and physically meaningless expressions when dealing with real physical systems, leading to poor model generalization. This limitation fundamentally stems from its deviation from the basic path of scientific discovery: physical laws do not exist in a single form but follow a hierarchical and progressive pattern from simplicity to complexity. Motivated by this principle, we propose Chain of Symbolic Regression (CoSR), a novel framework that models the discovery of physical laws as a chain of symbolic knowledge. This knowledge chain is formed by progressively combining multiple knowledge units with clear physical meanings along a specific logic, ultimately enabling the precise discovery of the underlying physical laws from data. CoSR fully recapitulates the progressive discovery path from Kepler's third law to the law of universal gravitation in classical mechanics, and is applied to three types of problems: turbulent Rayleigh-Benard convection, viscous flows in a circular pipe, and laser-metal interaction, demonstrating its ability to improve classical scaling theories. Finally, CoSR showcases its capability to discover new knowledge in the complex engineering problem of aerodynamic coefficients scaling for different aircraft.

2603.11586 2026-06-01 cs.RO

Unsupervised LiDAR-Based Multi-UAV Detection and Tracking Under Extreme Sparsity

基于激光雷达的极端稀疏条件下多无人机无监督检测与跟踪

Nivand Khosravi, Rodrigo Ventura, Meysam Basiri

AI总结 针对非重复固态激光雷达扫描导致的极端稀疏点云,提出无监督检测与跟踪流水线,通过自适应DBSCAN聚类和时序一致性检验实现高精度检测,并比较确定性分配与概率数据关联在跟踪中的性能。

Comments Presented at the International Conference on Mechatronics and Robotics Engineering (ICMRE2026). To appear in IEEE conference proceedings

详情
Journal ref
Proc. 2026 12th International Conference on Mechatronics and Robotics Engineering (ICMRE), Oldenburg, Germany, 2026
AI中文摘要

非重复固态激光雷达扫描导致对空中无人机检测的极端稀疏测量:一个10-25米的小型四旋翼通常每次扫描仅产生1-2个回波,远低于大多数现有检测方法假设的点密度,且不足以进行稳健的多目标数据关联。我们提出了一种无监督、仅依赖激光雷达的流水线,无需标注训练数据即可处理检测和跟踪。检测器将距离自适应DBSCAN聚类与三阶段时序一致性检验相结合,并在真实空对空飞行数据上以八种不同参数配置进行基准测试。最佳设置达到0.891精度、0.804召回率和0.63米均方根误差,系统性的minPts扫描验证了大多数扫描最多包含1-2个目标点,直接量化了稀疏程度。对于多目标跟踪,我们在四种具有递增模糊程度的模拟场景中,比较了确定性匈牙利分配与联合概率数据关联(JPDA),每种均与交互多模型滤波耦合。JPDA将身份切换减少了64%,而对MOTA影响可忽略,表明当无人机轨迹彼此接近时概率关联具有优势。结合真实世界检测与RTK-GPS真值以及基于模拟的跟踪与身份标注真值的双环境评估策略,克服了在无人机间距低于2米时仅依赖GNSS评估的局限性。

英文摘要

Non-repetitive solid-state LiDAR scanning leads to an extremely sparse measurement regime for detecting airborne UAVs: a small quadrotor at 10-25 m typically produces only 1-2 returns per scan, which is far below the point densities assumed by most existing detection approaches and inadequate for robust multi-target data association. We introduce an unsupervised, LiDAR-only pipeline that addresses both detection and tracking without the need for labeled training data. The detector integrates range-adaptive DBSCAN clustering with a three-stage temporal consistency check and is benchmarked on real-world air-to-air flight data under eight different parameter configurations. The best setup attains 0.891 precision, 0.804 recall, and 0.63 m RMSE, and a systematic minPts sweep verifies that most scans contain at most 1-2 target points, directly quantifying the sparsity regime. For multi-target tracking, we compare deterministic Hungarian assignment with joint probabilistic data association (JPDA), each coupled with Interacting Multiple Model filtering, in four simulated scenarios with increasing levels of ambiguity. JPDA cuts identity switches by 64% with negligible impact on MOTA, demonstrating that probabilistic association is advantageous when UAV trajectories approach one another closely. A two-environment evaluation strategy, combining real-world detection with RTK-GPS ground truth and simulation-based tracking with identity-annotated ground truth, overcomes the limitations of GNSS-only evaluation at inter-UAV distances below 2 m.

2603.10422 2026-06-01 cs.CV

World2Act: Latent Action Post-Training from World Model Dynamics

World2Act:基于世界模型动力学的潜在动作后训练

An Dinh Vuong, Tuan Van Vo, Abdullah Sohail, Haoran Ding, Liang Ma, Xiaodan Liang, Anqing Duan, Ivan Laptev, Ian Reid

AI总结 提出World2Act框架,通过潜在空间对齐世界模型动力学与动作嵌入,避免像素级监督,在仿真和真实机器人上提升VLA策略的成功率。

Comments Updated version. Project page: https://wm2act.github.io/

详情
AI中文摘要

世界模型(WMs)通过提供动力学先验,为后训练视觉-语言-动作(VLA)策略提供了一种有前景的机制,可改善任务和场景变化下的泛化能力。然而,大多数基于WM的后训练方法依赖像素空间监督,使得策略对不完美的WM rollout引入的视觉伪影敏感。我们提出World2Act,一种潜在空间后训练框架,无需像素空间监督即可将WM动力学迁移到VLA策略。World2Act分两个阶段运行:1)通过对比对齐WM动力学潜在变量与动作嵌入,诱导共享的视频-动作潜在空间;2)通过引导策略动作表示朝向WM想象的动力学而非解码像素,对VLA进行后训练。基于GR00T-N1.6,World2Act在仿真基准(RoboCasa、LIBERO、Bridge-SIMPLER)上实现了高达+2.5%的绝对成功率提升,在真实机器人上比微调VLA基线提升了+6.7%。值得注意的是,它比像素空间WM监督高出高达+6.0%,包括在LIBERO上像素监督导致基线退化的情况下,这表明潜在WM动力学为像素空间迁移提供了一种更稳定的基于WM的后训练替代方案。

英文摘要

World Models (WMs) offer a promising mechanism for post-training Vision-Language-Action (VLA) policies by providing dynamics priors that improve generalization under task and scene variation. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to visual artifacts introduced by imperfect WM rollouts. We present World2Act, a latent-space post-training framework that transfers WM dynamics to the VLA policy without pixel-space supervision. World2Act operates in two stages: 1) it induces a shared video-action latent space by contrastively aligning WM-dynamics latents with action embeddings, and 2) it post-trains the VLA by guiding policy action representations toward WM-imagined dynamics rather than decoded pixels. Built on GR00T-N1.6, World2Act delivers absolute success-rate gains of up to +2.5% on simulation benchmarks (RoboCasa, LIBERO, Bridge-SIMPLER) and +6.7% on a real robot over finetuned VLA baselines. Notably, it outperforms pixel-space WM supervision by up to +6.0%, including on LIBERO where pixel supervision degrades the baseline, suggesting that latent WM dynamics offer a more stable WM-based post-training alternative to pixel-space transfer.

2603.09936 2026-06-01 cs.LG

Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective

生成性漂移实际上是分数匹配:一个谱与变分视角

Erkan Turan, Nicolas Dufour, Maks Ovsjanikov

AI总结 本文通过揭示高斯核下漂移算子等价于平滑分布上的分数差,将生成性漂移方法纳入分数匹配框架,并利用谱分析和变分方法解决了原始工作中的三个遗留问题,同时提出了指数带宽退火策略和基于JKO方案的停止梯度算子理论依据。

详情
AI中文摘要

基于漂移的生成建模~\citep{deng2026drifting} 最近通过核驱动的漂移算子实现了最先进的一步图像生成,但其成功很大程度上是经验性的,其理论基础仍不明确。我们观察到,\emph{在高斯核下,漂移算子恰好是平滑分布上的分数差}。这回答了原始工作中遗留的三个问题:(1) 消失的漂移是否保证分布相等 ($V_{p,q}=0\Rightarrow p=q$),(2) 如何在核之间选择,以及 (3) 为什么停止梯度算子对于稳定训练不可或缺。我们的观察将漂移定位在分数匹配家族中。通过线性化McKean-Vlasov动力学并在傅里叶空间中探测,我们揭示了与等离子体动力学理论中的\emph{朗道阻尼}相当的频率依赖收敛时间尺度:高斯核遭受指数高频瓶颈,这可能解释了经验上对拉普拉斯核的偏好。这提出了一种修复方法:指数带宽退火调度 $σ(t)=σ_0 e^{-rt}$,将收敛时间从 $\exp(O(K_{\max}^2))$ 减少到 $O(\log K_{\max})$。最后,通过将漂移形式化为平滑KL散度的Wasserstein梯度流,我们证明了停止梯度算子不是启发式的,而是源于Jordan-Kinderlehrer-Otto (JKO) 方案所要求的冻结场离散化,移除它会切断训练与任何梯度流保证的联系。这种变分视角进一步为构建新颖的漂移算子提供了通用模板,我们通过Sinkhorn散度漂移进行了演示。我们在玩具数据集上验证了分析,并将其扩展到ImageNet。

英文摘要

Generative Modeling via Drifting~\citep{deng2026drifting} has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet its success is largely empirical and its theoretical foundations remain poorly understood. We observe that \emph{under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions}. This answers three questions left open in the original work: (1) whether a vanishing drift guarantees equality of distributions ($V_{p,q}=0\Rightarrow p=q$), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the score-matching family. By linearizing the McKean-Vlasov dynamics and probing them in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emph{Landau damping} in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, potentially explaining the empirical preference for the Laplacian kernel. This suggests a fix: an exponential bandwidth annealing schedule $σ(t)=σ_0 e^{-rt}$ that reduces convergence time from $\exp(O(K_{\max}^2))$ to $O(\log K_{\max})$. Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is not a heuristic but is derived from the frozen-field discretization mandated by the Jordan-Kinderlehrer-Otto (JKO) scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, which we demonstrate with a Sinkhorn divergence drift. We validate our analysis on toy datasets and scale it up to ImageNet.

2603.09787 2026-06-01 cs.CV cs.LG

What is Missing? Explaining Neurons Activated by Absent Concepts

缺失的是什么?解释被缺失概念激活的神经元

Robin Hesse, Simone Schaub-Meyer, Janina Hesse, Bernt Schiele, Stefan Roth

AI总结 针对深度神经网络中编码缺失(概念缺失导致神经元激活)这一被忽视的因果关系,提出两种扩展归因和特征可视化方法以揭示并解释这种缺失,实验表明ImageNet模型利用此类缺失且考虑它们可改善去偏。

Comments ICML 2025 | Code: https://github.com/visinf/what-is-missing

详情
AI中文摘要

可解释人工智能(XAI)旨在通过估计模型的简化因果结构,提供对深度神经网络(DNN)行为的人类可解释洞察。在现有工作中,这种因果结构通常包括概念的存在与神经元强激活之间的关系。例如,归因方法主要识别对预测贡献最大的输入像素,而特征可视化方法揭示导致目标神经元高激活的输入——前者隐含假设相关信息存在于输入中,后者假设神经元编码概念的存在。然而,一种很大程度上被忽视的因果关系是编码缺失,即概念的缺失会增加神经元的激活。在这项工作中,我们展示了这种缺失但相关的概念是常见的,并且主流XAI方法在标准形式下难以揭示它们。为了解决这个问题,我们提出了两种简单的扩展,分别应用于归因和特征可视化技术,以揭示编码缺失。通过实验,我们展示了如何使用主流XAI方法揭示和解释编码缺失,ImageNet模型如何利用它们,以及考虑它们时如何改进去偏。

英文摘要

Explainable artificial intelligence (XAI) aims to provide human-interpretable insights into the behavior of deep neural networks (DNNs), typically by estimating a simplified causal structure of the model. In existing work, this causal structure often includes relationships where the presence of a concept is associated with a strong activation of a neuron. For example, attribution methods primarily identify input pixels that contribute most to a prediction, and feature visualization methods reveal inputs that cause high activation of a target neuron - the former implicitly assuming that the relevant information resides in the input, and the latter that neurons encode the presence of concepts. However, a largely overlooked type of causal relationship is that of encoded absences, where the absence of a concept increases neural activation. In this work, we show that such missing but relevant concepts are common and that mainstream XAI methods struggle to reveal them when applied in their standard form. To address this, we propose two simple extensions to attribution and feature visualization techniques that uncover encoded absences. Across experiments, we show how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.

2603.09221 2026-06-01 cs.LG

Beyond Test-Time Memory: State-Space Optimal Control for LLM Reasoning

超越测试时记忆:用于LLM推理的状态空间最优控制

Peihao Wang, Shan Yang, Xijun Wang, Tesi Xiao, Xin Liu, Changlong Yu, Yu Lou, Pan Li, Zhangyang Wang, Ming Lin, René Vidal

AI总结 提出测试时控制(TTC)层,通过有限时域LQR规划实现推理,作为适配器集成到预训练LLM中,在数学推理任务上提升高达27.8%的准确率。

Comments ICML 2026

详情
AI中文摘要

联想记忆长期以来支撑着序列模型的设计。除了回忆之外,人类通过预测未来状态和选择目标导向行动来进行推理,这是现代语言模型日益需要但并未原生编码的能力。虽然先前的工作使用强化学习或测试时训练,但规划仍然独立于模型架构。我们将推理形式化为最优控制,并引入测试时控制(TTC)层,该层在推理时对潜在状态执行有限时域LQR规划,在神经架构内表示价值函数,并利用它作为嵌套目标以实现预测前的规划。为了确保可扩展性,我们基于辛公式推导出一个硬件高效的LQR求解器,并将其实现为融合CUDA内核,从而以最小开销实现并行执行。作为适配器集成到预训练LLM中,TTC层在MATH-500上将数学推理性能提升高达27.8%,在AMC和AIME上实现2-3倍的Pass@8改进,证明将最优控制嵌入为架构组件为超越测试时训练的推理提供了有效且可扩展的机制。

英文摘要

Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode. While prior work uses reinforcement learning or test-time training, planning remains external to the model architecture. We formulate reasoning as optimal control and introduce the Test-Time Control (TTC) layer, which performs finite-horizon LQR planning over latent states at inference time, represents a value function within neural architectures, and leverages it as the nested objective to enable planning before prediction. To ensure scalability, we derive a hardware-efficient LQR solver based on a symplectic formulation and implement it as a fused CUDA kernel, enabling parallel execution with minimal overhead. Integrated as an adapter into pretrained LLMs, TTC layers improve mathematical reasoning performance by up to +27.8% on MATH-500 and 2-3x Pass@8 improvements on AMC and AIME, demonstrating that embedding optimal control as an architectural component provides an effective and scalable mechanism for reasoning beyond test-time training.

2602.12192 2026-06-01 cs.CL

Query-focused and Memory-aware Reranker for Long Context Processing

面向长上下文处理的查询聚焦与记忆感知重排序器

Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Yanyu Chen, Zheng Lin, Wei Zhang, Jie Zhou

AI总结 提出一种基于大语言模型中检索头注意力分数的列表式重排序框架,利用候选列表整体信息估计段落-查询相关性,无需Likert监督,在多个领域取得最优性能。

Comments Add new experiments and compare more baselines

详情
AI中文摘要

基于现有对大语言模型中检索头的分析,我们提出了一种替代的重排序框架,该框架训练模型使用选定头的注意力分数来估计段落-查询相关性。这种方法提供了一种列表式解决方案,在排序过程中利用整个候选短列表中的整体信息。同时,它自然地产生连续的相关性分数,使得能够在任意检索数据集上进行训练,而无需Likert量表监督。我们的框架轻量且有效,仅需小规模模型(如3B参数)即可实现强性能。大量实验表明,我们的方法在多个领域(包括维基百科和长叙事数据集)中优于现有的最先进点式和列表式重排序器。它进一步在评估对话理解和记忆使用的LoCoMo基准上建立了新的最先进水平。我们还证明我们的框架支持灵活的扩展。例如,用上下文信息增强候选段落可进一步提高排序准确性,而训练中间层的注意力头可在不牺牲性能的情况下提高效率。

英文摘要

Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages the holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models, such as 3B parameters, to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark, which assesses dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.

2408.16457 2026-06-01 cs.LG cs.DM

HYGENE: A Diffusion-based Hypergraph Generation Method

HYGENE: 一种基于扩散的超图生成方法

Dorian Gailhard, Enzo Tartaglione, Lirida Naviner, Jhony H. Giraldo

AI总结 提出一种基于扩散过程的超图生成方法HYGENE,通过渐进局部扩展和去噪扩散过程,从单对连接节点逐步构建目标超图,首次将深度学习应用于超图生成。

Comments arXiv admin note: text overlap with arXiv:2312.11529 by other authors

详情
AI中文摘要

超图是强大的数学结构,可以模拟社交网络、生物信息学和推荐系统等各个领域中的复杂高阶关系。然而,由于其固有的复杂性和缺乏有效的生成模型,生成真实且多样化的超图仍然具有挑战性。在本文中,我们介绍了一种基于扩散的超图生成(HYGENE)方法,通过渐进局部扩展方法解决了这些挑战。HYGENE 作用于超图的二分表示,从单对连接节点开始,迭代扩展以形成目标超图。在每一步中,使用去噪扩散过程以局部方式添加节点和超边,这允许在细化局部细节之前构建全局结构。我们的实验证明了 HYGENE 的有效性,证明了它能够紧密模仿超图中的各种属性。据我们所知,这是首次尝试使用深度学习模型进行超图生成,我们的工作旨在为该领域的未来研究奠定基础。

英文摘要

Hypergraphs are powerful mathematical structures that can model complex, high-order relationships in various domains, including social networks, bioinformatics, and recommender systems. However, generating realistic and diverse hypergraphs remains challenging due to their inherent complexity and lack of effective generative models. In this paper, we introduce a diffusion-based Hypergraph Generation (HYGENE) method that addresses these challenges through a progressive local expansion approach. HYGENE works on the bipartite representation of hypergraphs, starting with a single pair of connected nodes and iteratively expanding it to form the target hypergraph. At each step, nodes and hyperedges are added in a localized manner using a denoising diffusion process, which allows for the construction of the global structure before refining local details. Our experiments demonstrated the effectiveness of HYGENE, proving its ability to closely mimic a variety of properties in hypergraphs. To the best of our knowledge, this is the first attempt to employ deep learning models for hypergraph generation, and our work aims to lay the groundwork for future research in this area.

2603.08651 2026-06-01 cs.LG hep-th math-ph math.MP

Group Entropies and Mirror Duality: A Class of Flexible Mirror Descent Updates for Machine Learning

群熵与镜像对偶:一类灵活的机器学习镜像下降更新

Andrzej Cichocki, Piergiulio Tempesta

AI总结 本文提出一个连接形式群论和群熵与现代机器学习的理论算法框架,通过群论镜像映射实现灵活可调的镜像下降优化更新,并引入镜像对偶概念以切换链接函数,在单纯形约束二次规划问题上验证了有效性。

Comments 36 pages, 5 figures

详情
AI中文摘要

我们引入了一个全面的理论和算法框架,将形式群论和群熵与现代机器学习联系起来,为无限、灵活的镜像下降(MD)优化算法族铺平了道路。我们的方法利用了群熵的丰富结构,这些熵是由群合成法则控制的广义熵泛函,涵盖并显著扩展了所有迹形式熵,如Shannon、Tsallis和Kaniadakis族。通过在MD中利用群论镜像映射(或链接函数),通过多参数广义对数及其逆(群指数)表达,我们实现了高度灵活和自适应的MD更新,可以针对不同的数据几何和统计分布进行定制。为此,我们引入了“镜像对偶”的概念,允许我们在特定的学习率约束下,无缝地切换或互换群论链接函数及其逆。通过调整或学习群对数的超参数,使我们能够使模型适应训练分布的统计特性,同时通过微调确保理想的收敛特性。这种通用性不仅提供了更大的灵活性和改进的收敛特性,而且通过扩展正则化器和自然梯度算法的设计,为机器学习和深度学习中的应用开辟了新的视角。我们在大规模、单纯形约束的二次规划问题上广泛评估了所提出更新的有效性、鲁棒性和性能。

英文摘要

We introduce a comprehensive theoretical and algorithmic framework that bridges formal group theory and group entropies with modern machine learning, paving the way for an infinite, flexible family of Mirror Descent (MD) optimization algorithms. Our approach exploits the rich structure of group entropies, which are generalized entropic functionals governed by group composition laws, encompassing and significantly extending all trace-form entropies such as the Shannon, Tsallis, and Kaniadakis families. By leveraging group-theoretical mirror maps (or link functions) in MD, expressed via multi-parametric generalized logarithms and their inverses (group exponentials), we achieve highly flexible and adaptable MD updates that can be tailored to diverse data geometries and statistical distributions. To this end, we introduce the notion of \textit{mirror duality}, which allows us to seamlessly switch or interchange group-theoretical link functions with their inverses, subject to specific learning rate constraints. By tuning or learning the hyperparameters of the group logarithms enables us to adapt the model to the statistical properties of the training distribution, while simultaneously ensuring desirable convergence characteristics via fine-tuning. This generality not only provides greater flexibility and improved convergence properties, but also opens new perspectives for applications in machine learning and deep learning by expanding the design of regularizers and natural gradient algorithms. We extensively evaluate the validity, robustness, and performance of the proposed updates on large-scale, simplex-constrained quadratic programming problems.

2603.07751 2026-06-01 cs.CV cs.CL

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

3ViewSense: 视觉-语言模型中基于正交视图的空间与心理视角推理

Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng

AI总结 提出3ViewSense框架,通过正交视图的“模拟-推理”机制解决视觉-语言模型在空间推理中的视角一致性问题,显著提升遮挡计数和空间推理性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

当前大型语言模型已达到奥林匹克级别的逻辑推理能力,然而视觉-语言模型在诸如积木计数等基础空间任务上却出人意料地表现不佳。这种能力不匹配揭示了一个关键的“空间智能鸿沟”,即模型无法从2D观测中构建连贯的3D心理表征。我们通过诊断分析发现,这一瓶颈在于缺乏视角一致的空间接口,而非视觉特征不足或推理能力薄弱。为弥合这一鸿沟,我们引入了 extbf{3ViewSense}框架,该框架将空间推理建立在正交视图之上。借鉴工程认知,我们提出了一种“模拟-推理”机制,将复杂场景分解为规范的正交投影以解决几何歧义。通过将自我中心感知与这些异中心参考对齐,我们的方法促进了显式的心理旋转与重建。在空间推理基准上的实验结果表明,我们的方法显著优于现有基线,在遮挡密集计数和视角一致空间推理上取得了一致的提升。该框架还提高了空间描述的稳定性和一致性,为多模态系统中更强的空间智能提供了一条可扩展的路径。~ ootnote{https://github.com/Jasaxion/3ViewSense}

英文摘要

Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.~\footnote{https://github.com/Jasaxion/3ViewSense}

2603.07551 2026-06-01 cs.SD cs.AI

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

零样本文本转语音中的目标说话人投毒框架

Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth Narayanan

AI总结 针对零样本TTS语音克隆的隐私风险,提出说话人生成投毒(SGSP)任务,通过修改训练模型阻止特定身份生成,并评估了推理时过滤和参数修改基线在1、15和100个遗忘说话人上的隐私-效用权衡。

Comments Submitted to Interspeech2026

详情
AI中文摘要

零样本文本转语音(TTS)语音克隆带来了严重的隐私风险,需要从训练好的TTS模型中移除特定说话人身份。传统的机器遗忘在此情境下不足,因为零样本TTS可以从仅参考提示动态重建声音。我们将此任务形式化为说话人生成投毒(SGSP),其中我们修改训练模型以防止生成特定身份,同时保留其他说话人的效用。我们评估了推理时过滤和参数修改基线在1、15和100个遗忘说话人上的表现。通过效用(WER)和隐私之间的权衡来评估性能,隐私使用AUC和遗忘说话人相似度(FSSIM)量化。我们在最多15个说话人上实现了强隐私,但由于身份重叠增加,在100个说话人时揭示了可扩展性限制。因此,我们的研究引入了一个新颖的问题和评估框架,以推动生成式语音隐私的进一步进展。

英文摘要

Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.

2510.15614 2026-06-01 cs.CL

HypoSpace: A Diagnostic Benchmark for Set-Valued Hypothesis Generation under Underdetermination and Sublinear Coverage Bounds

HypoSpace: 欠定性与次线性覆盖边界下集合值假设生成的诊断基准

Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Anirudh Goyal, Yew-Soon Ong, Dianbo Liu

AI总结 提出HypoSpace基准,通过三个结构化领域评估大语言模型在欠定假设空间中的采样能力,发现模型在高有效性下存在覆盖失败,并展示分层解码可部分缓解该问题。

详情
AI中文摘要

许多科学问题是欠定的:多个不同的假设与相同的观察结果一致。在这种情况下,有效的推理不仅需要产生有效的解释,还需要系统地探索和覆盖可接受的假设集。我们引入了HypoSpace,这是一个将大语言模型(LLMs)视为有限假设空间上的采样器,并在三个指标上评估它们的基准:有效性、唯一性和恢复率。HypoSpace涵盖三个结构化领域(因果图推理、重力约束的3D体素重建和布尔遗传相互作用建模),具有确定性验证器和精确可枚举的解空间,以及基于真实世界的案例研究。实验上,HypoSpace揭示了一种依赖于能力和规模的覆盖失败:随着可接受假设空间变得更大或更具组合性,模型可以保持高有效性,同时表现出降低的唯一性和恢复率。我们进一步表明,对分层解码的分析部分缓解了这种崩溃,证明了HypoSpace作为集合值推理的诊断基准的实用性。代码可在 https://github.com/CTT-Pavilion/_HypoSpace 获取。

英文摘要

Many scientific problems are underdetermined: multiple distinct hypotheses are equally consistent with the same observations. In such settings, effective inference requires not only producing valid explanations, but also systematically exploring and covering the admissible hypothesis set. We introduce HypoSpace, a benchmark that treats large language models (LLMs) as samplers over finite hypothesis spaces and evaluates them on three metrics: Validity, Uniqueness, and Recovery. HypoSpace spans three structured domains (causal graph inference, gravity-constrained 3D voxel reconstruction, and Boolean genetic interaction modeling) with deterministic validators and exactly enumerable solution spaces, plus real-world anchored case studies. Empirically, HypoSpace reveals a capability- and scale-dependent coverage failure: models can maintain high Validity while exhibiting reduced Uniqueness and Recovery as admissible hypothesis spaces become larger or more combinatorial. We further show that the analysis on stratified decoding partially mitigates this collapse, demonstrating HypoSpace's utility as a diagnostic benchmark for set-valued inference. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

2603.06738 2026-06-01 cs.LG cs.AI

Rank-Factorized Implicit Neural Bias: Scaling Super-Resolution Transformer with FlashAttention

秩分解隐式神经偏置:使用FlashAttention扩展超分辨率Transformer

Dongheon Lee, Seokju Yun, Jaegyun Im, Youngmin Ro

AI总结 提出秩分解隐式神经偏置(RIB)替代相对位置偏置(RPB),通过低秩隐式神经表示和通道级拼接实现FlashAttention兼容,并引入卷积局部注意力和循环窗口策略,在Urban100×2上达到35.63 dB PSNR,训练和推理时间分别减少2.1倍和2.9倍。

详情
AI中文摘要

最近的超分辨率(SR)方法主要采用Transformer,因其强大的长程建模能力和卓越的表征能力。然而,大多数SR Transformer严重依赖相对位置偏置(RPB),这阻碍了它们利用硬件高效的注意力内核,如FlashAttention。这一限制在训练和推理过程中带来了巨大的计算负担,严重限制了通过扩大训练块大小或自注意力窗口来扩展SR Transformer的尝试。因此,与其他积极利用Transformer固有可扩展性的领域不同,SR Transformer仍然主要关注有效利用有限的感受野。在本文中,我们提出了秩分解隐式神经偏置(RIB),作为RPB的替代方案,使SR Transformer能够使用FlashAttention。具体来说,RIB使用低秩隐式神经表示来近似位置偏置,并以通道方式将它们与像素内容标记连接起来,将注意力分数计算中的逐元素偏置加法转化为点积运算。此外,我们引入了卷积局部注意力和循环窗口策略,以充分利用RIB和FlashAttention带来的长程交互优势。我们将窗口大小扩大到**96×96**,同时联合扩大训练块大小和数据集大小,最大化Transformer在SR任务中的优势。因此,我们的网络在Urban100×2上达到了**35.63 dB PSNR**,同时与基于RPB的SR Transformer(PFT)相比,训练和推理时间分别减少了**2.1倍**和**2.9倍**。

英文摘要

Recent Super-Resolution~(SR) methods mainly adopt Transformers for their strong long-range modeling capability and exceptional representational capacity. However, most SR Transformers rely heavily on relative positional bias~(RPB), which prevents them from leveraging hardware-efficient attention kernels such as FlashAttention. This limitation imposes a prohibitive computational burden during both training and inference, severely restricting attempts to scale SR Transformers by enlarging the training patch size or the self-attention window. Consequently, unlike other domains that actively exploit the inherent scalability of Transformers, SR Transformers remain heavily focused on effectively utilizing limited receptive fields. In this paper, we propose Rank-factorized Implicit Neural Bias~(RIB), an alternative to RPB that enables FlashAttention in SR Transformers. Specifically, RIB approximates positional bias using low-rank implicit neural representations and concatenates them with pixel content tokens in a channel-wise manner, turning the element-wise bias addition in attention score computation into a dot-product operation. Further, we introduce a convolutional local attention and a cyclic window strategy to fully leverage the advantages of long-range interactions enabled by RIB and FlashAttention. We enlarge the window size up to \textbf{96$\times$96} while jointly scaling the training patch size and the dataset size, maximizing the benefits of Transformers in the SR task. As a result, our network achieves \textbf{35.63\,dB PSNR} on Urban100$\times$2, while reducing training and inference time by \textbf{2.1$\times$} and \textbf{2.9$\times$}, respectively, compared to the RPB-based SR Transformer~(PFT).

2408.10441 2026-06-01 cs.CL

Goldfish: Monolingual Language Models for 350 Languages

Goldfish: 面向350种语言的单语语言模型

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

AI总结 针对低资源语言,发现大型多语言模型在基本语法生成上不如双元模型,而小规模单语模型在困惑度和语法性基准上表现更优,并发布了覆盖350种语言的1000多个单语模型套件Goldfish。

Comments LREC 2026

详情
AI中文摘要

对于许多低资源语言,唯一可用的语言模型是在多种语言上同时训练的大型多语言模型。尽管在推理任务上表现最先进,我们发现这些模型在许多语言的基本语法文本生成上仍然存在困难。首先,使用FLORES困惑度作为评估指标,大型多语言模型在许多语言上的表现甚至不如双元模型(例如,XGLM 4.5B中24%的语言;BLOOM 7.1B中43%的语言)。其次,当我们为350种语言训练仅有1.25亿参数、数据量不超过1GB的小型单语模型时,这些小型模型在困惑度和大规模多语言语法性基准上都优于大型多语言模型。为了促进未来低资源语言建模的工作,我们发布了Goldfish,一个包含超过1000个小型单语语言模型的套件,这些模型在350种语言上进行了可比训练。这些模型代表了其中215种语言首次公开可用的单语语言模型。

英文摘要

For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in many languages. First, large multilingual models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B) using FLORES perplexity as an evaluation metric. Second, when we train small monolingual models with only 125M parameters on 1GB or less data for 350 languages, these small models outperform large multilingual models both in perplexity and on a massively multilingual grammaticality benchmark. To facilitate future work on low-resource language modeling, we release Goldfish, a suite of over 1,000 small monolingual language models trained comparably for 350 languages. These models represent the first publicly-available monolingual language models for 215 of the languages included.

2603.04946 2026-06-01 cs.CL

LocalSUG: City-Preference-Enhanced LLM for Query Suggestion in Local-Life Services

LocalSUG:面向本地生活服务的城市偏好增强大语言模型查询建议

Jinwen Chen, Shiwen Zhang, Shuai Gong, Zheng Zhang, Yachao Zhao, Lingxiang Wang, Haibo Zhou, Wei Lin, Hainan Zhang

AI总结 提出LocalSUG框架,通过挖掘城市偏好候选词注入提示、束搜索GRPO算法和加速解码,解决大语言模型在本地生活服务查询建议中城市偏好不足、偏好暴露偏差和延迟约束问题。

详情
AI中文摘要

在本地生活服务平台中,查询建议通过从输入前缀生成候选查询来减少用户操作。传统的多阶段系统严重依赖历史热门查询,限制了其捕捉长尾和新兴需求的能力。尽管大语言模型提供了强大的语义泛化能力,但它们在本地生活服务中的部署面临三个挑战:城市偏好感知不足、偏好优化中的暴露偏差以及严格的在线延迟约束。我们提出了LocalSUG,一个基于大语言模型的本地生活服务查询建议框架。LocalSUG从术语共现中挖掘城市偏好增强的候选词,并将其作为动态参考注入提示中,而非融合到模型参数中。这使得模型能够适应变化的城市偏好(如商家开业或关闭),同时减少过时或本地无效的建议。我们进一步引入了束搜索驱动的GRPO算法,使训练与推理时解码对齐,并优化相关性以及业务导向的奖励。最后,质量感知的束加速和词汇剪枝在保持生成质量的同时降低了在线延迟。离线评估和大规模在线A/B测试表明,LocalSUG将点击率提高了0.35%,并将低/无结果率降低了3.98%,证明了其在实际部署中的有效性。

英文摘要

In local-life service platforms, query suggestion reduces user effort by generating candidate queries from input prefixes. Traditional multi-stage systems rely heavily on historical popular queries, limiting their ability to capture long-tail and emerging demand. Although LLMs provide strong semantic generalization, their deployment in local-life services faces three challenges: insufficient city-preference awareness, exposure bias in preference optimization, and strict online latency constraints. We propose LocalSUG, an LLM-based query suggestion framework for local-life services. LocalSUG mines city-preference-enhanced candidates from term co-occurrence and injects them into prompts as dynamic references rather than fusing them into model parameters. This allows the model to adapt to changing city preferences, such as merchant openings or closures, while reducing stale or locally invalid suggestions. We further introduce a beam-search-driven GRPO algorithm to align training with inference-time decoding and optimize relevance together with business-oriented rewards. Finally, quality-aware beam acceleration and vocabulary pruning reduce online latency while preserving generation quality. Offline evaluations and large-scale online A/B testing show that LocalSUG improves CTR by +0.35% and reduces the low/no-result rate by 3.98%, demonstrating its effectiveness in real-world deployment.

2603.02630 2026-06-01 cs.LG cs.AI

MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

MASPOB: 基于图神经网络的多智能体系统提示优化方法

Zhi Hong, Qian Zhang, Jiahang Sun, Zhiwei Shang, Mingze Kong, Xiangyi Wang, Yao Shu, Zhongxiang Dai

AI总结 提出基于赌博机的样本高效框架MASPOB,利用UCB平衡探索与利用、GNN捕获拓扑先验、坐标上升分解优化,解决多智能体系统提示优化中的样本效率、拓扑耦合和组合爆炸问题。

Comments ICML 2026 Spotlight

详情
AI中文摘要

大型语言模型(LLMs)在许多实际应用中取得了巨大成功,尤其是作为多智能体系统(MAS)的认知骨干来编排复杂工作流。由于许多部署场景排除了MAS工作流修改,且其性能对输入提示高度敏感,提示优化成为提高性能的更自然方法。然而,实际中的MAS提示优化面临三个关键挑战:(1)由于评估成本高昂,需要样本效率;(2)提示之间的拓扑诱导耦合;(3)搜索空间的组合爆炸。为了解决这些挑战,我们引入了MASPOB(基于赌博机的多智能体系统提示优化),一种基于赌博机的新型样本高效框架。通过利用上置信界(UCB)量化不确定性,赌博机框架平衡了探索与利用,在严格有限的预算内最大化收益。为了处理拓扑诱导耦合,MASPOB集成了图神经网络(GNN)以捕获结构先验,学习提示语义的拓扑感知表示。此外,它采用坐标上升将优化分解为单变量子问题,将搜索复杂度从指数级降低到线性级。跨不同基准的大量实验表明,MASPOB实现了最先进的性能,持续优于现有基线。

英文摘要

Large Language Models (LLMs) have achieved great success in many real-world applications, especially the one serving as the cognitive backbone of Multi-Agent Systems (MAS) to orchestrate complex workflows in practice. Since many deployment scenarios preclude MAS workflow modifications and its performance is highly sensitive to the input prompts, prompt optimization emerges as a more natural approach to improve its performance. However, real-world prompt optimization for MAS is impeded by three key challenges: (1) the need of sample efficiency due to prohibitive evaluation costs, (2) topology-induced coupling among prompts, and (3) the combinatorial explosion of the search space. To address these challenges, we introduce MASPOB (Multi-Agent System Prompt Optimization via Bandits), a novel sample-efficient framework based on bandits. By leveraging Upper Confidence Bound (UCB) to quantify uncertainty, the bandit framework balances exploration and exploitation, maximizing gains within a strictly limited budget. To handle topology-induced coupling, MASPOB integrates Graph Neural Networks (GNNs) to capture structural priors, learning topology-aware representations of prompt semantics. Furthermore, it employs coordinate ascent to decompose the optimization into univariate sub-problems, reducing search complexity from exponential to linear. Extensive experiments across diverse benchmarks demonstrate that MASPOB achieves state-of-the-art performance, consistently outperforming existing baselines.

2602.22968 2026-06-01 cs.AI cs.CV cs.CY

Certified Circuits: Stability Guarantees for Mechanistic Circuits

认证电路:机械论电路的稳定性保证

Alaa Anani, Tobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer

AI总结 提出Certified Circuits框架,通过随机数据子采样认证电路组件(神经元或边)对概念数据集编辑距离扰动的稳定性,生成更紧凑、更准确的电路。

Comments Accepted at ICML 2026

详情
AI中文摘要

理解神经网络如何得出其预测对于调试、审计和部署至关重要。机械论可解释性通过识别电路——负责特定行为的最小子网络——来追求这一目标。然而,现有的电路发现方法脆弱:电路强烈依赖于所选的概念数据集,并且常常无法迁移到分布外,引发对其是否捕捉概念或仅仅是数据集特定伪影的怀疑。我们引入了Certified Circuits,它为电路发现提供了可证明的稳定性保证。我们的框架用随机数据子采样包装任何黑盒发现算法,以认证电路组件——根据基础算法,模型图的神经元或边——的包含决策对概念数据集的有界编辑距离扰动是不变的。不稳定的组件被弃用,从而产生更紧凑、更准确的电路。我们在三个架构(ResNet、ViT、GPT-2)上,针对视觉(ImageNet和四个OOD数据集)和语言(IOI、IOI-Hard、Greater-Than)任务进行了验证。认证电路实现了高达56%的更高准确率和高达80%的更少组件,并且在基线退化时保持可靠。Certified Circuits通过产生可证明稳定且与目标概念更好对齐的机械论解释,将电路发现置于形式化的基础上。代码:https://github.com/AlaaAnani/certified-circuits。

英文摘要

Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits--minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture the concept or merely dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that inclusion decisions over circuit components--neurons or edges of the model graph, depending on the base algorithm--are invariant to bounded edit-distance perturbations of the concept dataset. Unstable components are abstained from, yielding circuits that are more compact and more accurate. We validate across three architectures (ResNet, ViT, GPT-2) on vision (ImageNet and four OOD datasets) and language (IOI, IOI-Hard, Greater-Than) tasks. Certified circuits achieve up to 56% higher accuracy and up to 80% fewer components, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code: https://github.com/AlaaAnani/certified-circuits.

2602.24210 2026-06-01 cs.CL cs.AI

From Leaky Thoughts to Private Reasoning: Controlling What LRMs Say to Themselves

从泄露思维到私有推理:控制LRM对自己说的话

Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych

AI总结 针对大型推理模型(LRM)推理过程中隐私泄露问题,提出通过指令跟随(IF)训练和分阶段解码策略(Staged Decoding)增强隐私保护,在IF和隐私基准上分别提升高达20.9和51.9个百分点。

详情
AI中文摘要

大型推理模型(LRM)产生的推理轨迹(RT)通常包含敏感信息。这些泄露的思维难以控制,且经常违反明确的隐私指令。由于RT可能通过提示注入攻击暴露,这对用户构成了直接的隐私风险。我们将此视为一个可控性问题:由于隐私指令本身就是指令,在RT内改进指令跟随(IF)为减少隐私泄露提供了直接途径。为此,我们引入了一个SFT数据集,教会模型在其推理过程中遵循通用指令,并提出了分阶段解码(Staged Decoding),一种简单的解码策略,通过使用独立的LoRA适配器解耦RT和答案生成,以最大化每个组件的IF。我们在两个系列(1.7B-14B参数)的六个模型上,在两个IF基准和两个隐私基准上评估了我们的方法。我们的方法带来了显著的改进,在IF上提升高达20.9分,在隐私基准上提升51.9个百分点,尽管由于推理性能与IF之间的权衡,这些改进可能以牺牲任务效用为代价。我们的结果表明,改进LRM中的IF可以显著增强隐私,为未来隐私感知的LRM提供了一个有前景的方向。我们的代码可在https://github.com/UKPLab/arxiv2026-controllable-reasoning-models获取。

英文摘要

Large reasoning models (LRMs) produce reasoning traces (RTs) that often contain sensitive information. These leaky thoughts are difficult to control and frequently violate explicit privacy directives. Because RTs can be exposed through prompt injection attacks, this becomes a direct privacy risk to the user. We approach this as a controllability problem: since privacy directives are themselves instructions, improving instruction-following (IF) within the RT provides a direct path to reducing privacy leaks. To this end, we introduce an SFT dataset that teaches models to follow general instructions throughout their reasoning process, and propose Staged Decoding, a simple decoding strategy that decouples RT and answer generation using separate LoRA adapters to maximize IF of each component. We evaluate our approach on six models from two families (1.7B-14B parameters), across two IF benchmarks and two privacy benchmarks. Our method yields substantial improvements, with gains of up to 20.9 points in IF and 51.9 percentage points on privacy benchmarks, though these can come at the cost of task utility due to the trade-off between reasoning performance and IF. Our results show that improving IF in LRMs can significantly enhance privacy, suggesting a promising direction for future privacy-aware LRMs. Our code is available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models.

2602.10117 2026-06-01 cs.LG cs.AI

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

盲点中的偏见:检测大语言模型未能提及的内容

Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu

AI总结 提出全自动黑盒流水线,通过统计测试和思维链分析,自动检测大语言模型在任务中未明确表述的偏见。

Comments Published at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

大语言模型(LLMs)通常提供看似合理的思维链(CoT)推理痕迹,但可能隐藏内部偏见。我们称这些为未表述的偏见。因此,通过模型陈述的推理来监控模型是不可靠的,现有的偏见评估通常需要预定义类别和手工制作的数据集。在这项工作中,我们引入了一个全自动的黑盒流水线,用于检测特定任务的未表述偏见。给定一个任务数据集,该流水线使用LLM自动评分器生成候选偏见概念。然后,通过生成正面和负面变体,在逐渐增大的输入样本上测试每个概念,并应用统计技术进行多重测试和早期停止。如果一个概念在模型的CoT中未被引用为理由,但产生了统计上显著的性能差异,则将其标记为未表述的偏见。我们在三个决策任务(招聘、贷款审批和大学录取)上对七个LLM评估了我们的流水线。我们的技术自动发现了这些模型中以前未知的偏见(例如,西班牙语流利度、英语熟练度、写作正式度)。在同一运行中,该流水线还验证了先前工作手动识别的偏见(性别、种族、宗教、民族)。更广泛地说,我们提出的方法为自动、更高效和更广泛的特定任务未表述偏见发现提供了一条实用、可扩展的路径。

英文摘要

Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these unverbalized biases. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across seven LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic, more efficient, and broader task-specific unverbalized bias discovery.

2602.23280 2026-06-01 cs.LG cs.RO

Mollified Value Learning

Mollified Value Learning

Hrishikesh Viswanath, Juanwu Lu, S. Talha Bukhari, Mihir Chauhan, Damon Conover, Ziran Wang, Aniket Bera

AI总结 针对离线目标条件强化学习中值函数估计困难的问题,提出一种通过空间测度聚合约束(而非逐点微分约束)来诱导距离类值几何的方法,称为Mollified Value Learning(MVL),在导航和高维机器人操作任务中提升了目标达成性能。

详情
AI中文摘要

离线目标条件强化学习(GCRL)从静态数据集中学习达到目标的行为,但在有限的状态-动作覆盖下,准确的值估计仍然具有挑战性。现有的物理信息方法通过施加由Hamilton-Jacobi-Bellman(HJB)最优性原理导出的逐点距离类几何约束(通常通过一阶偏微分方程如Eikonal方程)来解决这一问题。然而,通过显式微分结构强制局部一致性在复杂高维环境中可能变得不稳定。我们的关键洞察是,将距离类约束重新解释为局部空间测度上的期望。通过在该测度上聚合约束而非逐点评估,目标函数充当空间平滑器(mollifier),在无需昂贵微分算子的情况下诱导出距离类值几何。我们称之为Mollified Value Learning(MVL)。在导航和高维机器人操作任务上的实验表明,当与隐式值表示学习方法结合使用时,MVL学习到结构化的值表示,提高了目标达成性能。开源代码可在https://github.com/HrishikeshVish/MVL获取。

英文摘要

Offline goal-conditioned reinforcement learning (GCRL) learns goal-reaching behaviors from static datasets, but accurate value estimation remains challenging under limited state-action coverage. Existing physics-informed approaches address this by imposing pointwise distance-like geometric constraints derived from Hamilton--Jacobi--Bellman (HJB) optimality principles, often through first-order partial differential equations such as the Eikonal equation. However, enforcing local consistency through explicit differential structure can become unstable in complex, high-dimensional environments. Our key insight is to instead reinterpret distance-like constraints as an expectation over a local spatial measure. By aggregating constraints over this measure rather than evaluating them pointwise, the objective acts as a spatial mollifier, inducing distance-like value geometry without requiring expensive differential operators. We refer to this as Mollified Value Learning (MVL). Experiments across navigation and high-dimensional robotic manipulation tasks show that MVL learns structured, value representations, improving goal-reaching performance, when used with implicit value representation learning methods. Open-source codes are available at https://github.com/HrishikeshVish/MVL.

2602.22971 2026-06-01 cs.AI

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

SPM-Bench:面向扫描探针显微镜的大型语言模型基准测试

Peiyao Xiao, Xiaogang Li, Xinyi Gao, Chengliang Xu, Ben Wang, Zichao Chen, Zeyu Wang, Lin Qu, Bing Zhao, Hu Wei

AI总结 提出SPM-Bench,一个全自动数据合成管道和严格评估指标(SIP-F1),用于测试LLMs在扫描探针显微镜领域的推理能力,并首次量化模型“个性”。

详情
AI中文摘要

随着LLMs在通用推理方面取得突破,它们在特定科学领域的熟练程度因数据污染、复杂性不足和过高的人力成本而在现有基准测试中暴露出明显差距。在此,我们提出SPM-Bench,一个专为扫描探针显微镜(SPM)设计的原创、博士级多模态基准测试。我们提出一个全自动数据合成管道,确保高权威性和低成本。通过采用锚点门控筛(AGS)技术,我们从2023年至2025年间发表的arXiv和期刊论文中高效提取高价值图像-文本对。通过混合云-本地架构(其中VLM仅返回空间坐标“llbox”以进行本地高保真裁剪),我们的管道在保持高数据集纯度的同时实现了极致的token节省。为了准确客观地评估LLMs的性能,我们引入了严格不完美惩罚F1(SIP-F1)分数。该指标不仅建立了严格的能力层级,而且首次量化了模型“个性”(保守型、激进型、赌徒型或明智型)。通过将这些结果与模型报告的置信度和感知难度相关联,我们揭示了当前AI在复杂物理场景中的真实推理边界。这些见解使SPM-Bench成为自动化科学数据合成的可推广范式。

英文摘要

As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexity, and prohibitive human labor costs. Here we present SPM-Bench, an original, PhD-level multimodal benchmark specifically designed for scanning probe microscopy (SPM). We propose a fully automated data synthesis pipeline that ensures both high authority and low-cost. By employing Anchor-Gated Sieve (AGS) technology, we efficiently extract high-value image-text pairs from arXiv and journal papers published between 2023 and 2025. Through a hybrid cloud-local architecture where VLMs return only spatial coordinates "llbox" for local high-fidelity cropping, our pipeline achieves extreme token savings while maintaining high dataset purity. To accurately and objectively evaluate the performance of the LLMs, we introduce the Strict Imperfection Penalty F1 (SIP-F1) score. This metric not only establishes a rigorous capability hierarchy but also, for the first time, quantifies model "personalities" (Conservative, Aggressive, Gambler, or Wise). By correlating these results with model-reported confidence and perceived difficulty, we expose the true reasoning boundaries of current AI in complex physical scenarios. These insights establish SPM-Bench as a generalizable paradigm for automated scientific data synthesis.

2602.21340 2026-06-01 cs.LG

HiPPO Zoo: Explicit Memory Mechanisms for Interpretable State Space Models

HiPPO动物园:可解释状态空间模型的显式记忆机制

Jack Goffinet, Casey Hanks, David E. Carlson

AI总结 本文通过扩展HiPPO框架,提出五种显式、可解释的记忆机制(统称“HiPPO动物园”),使状态空间模型具备自适应记忆分配和联想记忆等能力,并在合成序列建模任务中验证其有效性。

Comments 24 pages, 7 figures; to be published in ICML 2026; additional experimental results included

详情
AI中文摘要

以压缩、高效且信息丰富的方式表示过去是处理序列数据系统的核心问题。Gu & Dao等人最初提出的HiPPO框架通过结构化线性常微分方程将信号投影到正交多项式(OP)基上,为序列压缩提供了一种原则性方法。后续工作将这些动态嵌入状态空间模型(SSM)中,其中HiPPO结构用作初始化。这些SSM方法的非线性后继(如Mamba)在许多具有长程依赖的任务中达到最先进水平,但它们表示和优先处理历史的机制在很大程度上仍是隐式的。在这项工作中,我们重新审视HiPPO框架,目标是使这些机制显式化。我们展示了如何扩展历史的多项式表示以支持现代SSM的能力(如自适应记忆分配和联想记忆),同时保留在OP基上的直接可解释性。我们引入一个统一的框架,包含五种这样的扩展,统称为“HiPPO动物园”。每种扩展通过对HiPPO框架进行显式、可解释的修改,暴露特定的建模能力。所得模型在线调整其记忆,并在流式设置中以高效更新进行训练。我们通过一系列合成序列建模任务展示了这些扩展的行为和建模优势,证明通常与现代SSM相关的能力可以通过显式、可解释的多项式记忆结构实现。

英文摘要

Representing the past in a compressed, efficient, and informative manner is a central problem for systems trained on sequential data. The HiPPO framework, originally proposed by Gu & Dao et al., provides a principled approach to sequential compression by projecting signals onto orthogonal polynomial (OP) bases via structured linear ordinary differential equations. Subsequent works have embedded these dynamics in state space models (SSMs), where HiPPO structure serves as an initialization. Nonlinear successors of these SSM methods such as Mamba are state-of-the-art for many tasks with long-range dependencies, but the mechanisms by which they represent and prioritize history remain largely implicit. In this work, we revisit the HiPPO framework with the goal of making these mechanisms explicit. We show how polynomial representations of history can be extended to support capabilities of modern SSMs such as adaptive memory allocation and associative memory, while retaining direct interpretability in the OP basis. We introduce a unified framework comprising five such extensions, which we collectively refer to as a "HiPPO zoo." Each extension exposes a specific modeling capability through an explicit, interpretable modification of the HiPPO framework. The resulting models adapt their memory online and train in streaming settings with efficient updates. We illustrate the behaviors and modeling advantages of these extensions through a range of synthetic sequence modeling tasks, demonstrating that capabilities typically associated with modern SSMs can be realized through explicit, interpretable polynomial memory structures.

2602.21013 2026-06-01 cs.RO

Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

笔记到自我:带草稿本的增强型VLA用于依赖记忆的操作任务

Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Roland Memisevic

AI总结 本文通过在视觉-语言-动作模型中加入语言草稿本来赋予其空间和时间记忆,从而提升其在依赖记忆的长时域操作任务上的泛化能力。

Comments To appear at ICRA 2026

详情
AI中文摘要

许多灵巧操作任务本质上是非马尔可夫的,但在最近视觉-语言-动作(VLA)范式的热潮中,这一点很少受到关注。尽管VLA成功地将互联网规模的语义理解引入机器人领域,但现有的VLA主要是“无状态的”,并且在依赖记忆的长时域任务中表现不佳。在这项工作中,我们探索了一种通过引入语言草稿本来赋予VLA空间和时间记忆的方法。草稿本使得记忆任务特定信息(如物体位置)成为可能,并且允许模型跟踪计划以及在该计划中朝着子目标的进展。我们在ClevrSkills环境中的一组依赖记忆的任务、MemoryBench以及一个具有挑战性的真实世界拾取和放置任务上评估了这种方法。我们表明,对于非递归和递归模型,引入语言草稿本显著提高了这些任务的泛化能力。

英文摘要

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

2411.00759 2026-06-01 cs.LG stat.ML

Minibatch Optimal Transport and Perplexity Bound Estimation in Discrete Flow Matching

离散流匹配中的小批量最优传输与困惑度界估计

Etrit Haxholli, Yeti Z. Gurbuz, Ogul Can, Eli Waxman

AI总结 针对离散流匹配中状态转移过多和概率估计困难的问题,提出基于小批量最优传输的动态优化目标以减少转移次数,并给出两个困惑度上界以支持训练与评估。

详情
AI中文摘要

离散流匹配是一种用于建模分类数据的最新框架,在性能上与自回归模型相当。然而,与连续流匹配不同,由于离散路径的随机性,整流策略无法应用,因此需要替代方法来最小化状态转移。我们提出了一种动态最优传输类的最小化目标,并推导了其用于具有凸插值的离散流的Kantorovich形式,其中传输成本仅取决于状态间的不相似性,并可通过小批量策略进行优化。我们表明,此类方法可以将转移次数减少多达32倍(从1024到32),以达到相同的生成困惑度,同时不损害多样性。此外,离散流中的路径非确定性排除了瞬时变量变换的类似物,从而无法进行连续流可用的精确概率估计。因此,我们提出了两个困惑度上界,实现了有原则的训练、评估和模型比较。最后,我们引入了多掩码流,其在生成困惑度上优于掩码流且不损害多样性,特别是在使用小批量最优传输时。

英文摘要

Discrete flow matching, a recent framework for modeling categorical data, has shown competitive performance with autoregressive models. However, unlike continuous flow matching, the rectification strategy cannot be applied due to the stochasticity of discrete paths, necessitating alternative methods to minimize state transitions. We propose a dynamic-optimal-transport-like minimization objective and derive its Kantorovich formulation for discrete flows with convex interpolants, where transport cost depends solely on inter-state dissimilarity and can be optimized via minibatch strategies. We show that such methods can reduce the number of transitions up to 32 times (1024 to 32) to reach the same generative perplexity without compromising diversity. Additionally, path nondeterminism in discrete flows precludes an instantaneous change-of-variables analogue, preventing precise probability estimation available to continuous flows. We therefore propose two upper bounds on perplexity, enabling principled training, evaluation and model comparison. Finally, we introduce Multimask Flows which outperform masked flows in generative perplexity without compromising diversity, particularly when utilizing minibatch Optimal Transport.

2602.19049 2026-06-01 cs.CL cs.LG

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

IAPO:面向令牌高效推理的信息感知策略优化

Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su, Xiaoqing Wang, Qi Guo, Jundong Li

AI总结 提出信息感知策略优化框架IAPO,通过基于条件互信息的令牌级优势塑造,在提升推理准确率的同时将推理长度减少高达36%。

详情
AI中文摘要

大型语言模型越来越依赖长思维链来提高准确性,但这种提升伴随着巨大的推理时间成本。我们重新审视令牌高效的后训练,并认为现有的序列级奖励塑造方法对推理努力在令牌间的分配控制有限。为弥补这一差距,我们提出IAPO,一个信息论后训练框架,根据每个令牌与最终答案的条件互信息(MI)分配令牌级优势。这提供了一种明确、有原则的机制来识别信息丰富的推理步骤并抑制低效探索。我们的理论分析表明,IAPO可以在不损害正确性的情况下诱导推理冗长性的单调减少。实验上,IAPO在保持推理准确率的同时,将推理长度减少高达36%,在各种推理数据集上优于现有的令牌高效强化学习方法。广泛的实证评估表明,信息感知优势塑造是令牌高效后训练的一个强大且通用的方向。代码可在 https://github.com/YinhanHe123/IAPO 获取。

英文摘要

Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.

2602.18837 2026-06-01 cs.LG

L2G-Net: Local to Global Spectral Graph Neural Networks via Cauchy Factorizations

L2G-Net:通过柯西分解的局部到全局谱图神经网络

Samuel Fernández-Menduiña, Eduardo Pavez, Antonio Ortega

AI总结 提出L2G-Net,通过将图傅里叶变换精确分解为作用于子图的算子并利用柯西矩阵组合,实现局部到全局的谱图神经网络,避免全特征分解,在长程依赖任务上以极少的可学习参数达到竞争性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

尽管具有理论优势,基于图傅里叶变换(GFT)的谱方法由于计算特征基的成本以及所得表示缺乏顶点域局部性,很少用于图神经网络(GNN)。因此,大多数GNN依赖局部近似,如多项式拉普拉斯滤波器或消息传递,这限制了它们建模长程依赖的能力。在本文中,我们引入了一种将GFT精确分解为作用于子图的算子的方法,然后通过一系列柯西矩阵进行组合。基于这种分解,我们提出了一类新的谱GNN,称为L2G-Net(局部到全局网络)。与现有的谱方法(使用GFT时完全全局,或使用多项式滤波器时局部)不同,L2G-Net通过处理子图的谱表示,然后通过结构化矩阵组合它们来运作。我们的算法避免了完全特征分解,利用图拓扑结构以节点数的二次复杂度(按子图间最大割大小缩放)构建分解。在强调长程依赖的大图上的实验表明,L2G-Net可扩展到标准GFT无法企及的范围,并以数量级更少的可学习参数与最先进方法竞争。

英文摘要

Despite their theoretical advantages, spectral methods based on the graph Fourier transform (GFT) are seldom used in graph neural networks (GNNs) due to the cost of computing the eigenbasis and the lack of vertex-domain locality in the resulting representations. As a result, most GNNs rely on local approximations such as polynomial Laplacian filters or message passing, which limit their ability to model long-range dependencies. In this paper, we introduce an exact factorization of the GFT into operators acting on subgraphs, which are then combined via a sequence of Cauchy matrices. Building on this factorization, we propose a new class of spectral GNNs, termed L2G-Net (Local to Global Net). Unlike existing spectral methods, which are either fully global (when using the GFT) or local (when using polynomial filters), L2G-Net operates by processing the spectral representations of subgraphs and then combining them via structured matrices. Our algorithm avoids full eigendecompositions, exploiting graph topology to construct the factorization with quadratic complexity in the number of nodes, scaled by the maximum cut size between subgraphs. Experiments stressing long-range dependencies on large graphs show that L2G-Net scales to regimes out of reach for the standard GFT, and is competitive with state-of-the-art methods with orders of magnitude fewer learnable parameters.