arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2069
2602.07285 2026-06-01 cs.LG

Fair Decisions from Calibrated Scores: Achieving Optimal Classification While Satisfying Sufficiency

基于校准分数的公平决策:在满足充分性的同时实现最优分类

Etam Benger, Katrina Ligett

AI总结 本文针对充分性公平约束下的二元分类问题,提出了一种基于分组校准分数的后处理算法,能够实现最优随机分类,并给出了可行正预测值与错误遗漏率对的几何刻画。

Comments Accepted to ICML 2026

详情
AI中文摘要

基于预测概率(分数)的二元分类是监督机器学习中的基本任务。在无约束设置中,阈值化分数是贝叶斯最优的,但使用单一阈值通常会违反统计群体公平约束。在独立性(统计均等)和分离性(均等机会)下,当分数已经满足相应准则时,这种阈值化就足够了。然而,这并不扩展到充分性:即使完全分组校准的分数——包括真实类别概率——在阈值化后也会违反预测均等。在这项工作中,我们提出了在充分性下最优二元(随机)分类的精确解,假设有限的分组校准分数集。我们给出了这些分类器可实现的正预测值(PPV)和错误遗漏率(FOR)可行对的几何刻画,并利用它推导出一个简单的后处理算法,该算法仅使用分组校准分数和组成员身份即可获得最优分类器。最后,由于充分性和分离性通常不兼容,我们确定了在满足充分性的前提下最小化与分离性偏差的分类器,并表明该分类器也可以通过我们的算法获得,其性能通常与最优值相当。

英文摘要

Binary classification based on predicted probabilities (scores) is a fundamental task in supervised machine learning. While thresholding scores is Bayes-optimal in the unconstrained setting, using a single threshold generally violates statistical group fairness constraints. Under independence (statistical parity) and separation (equalized odds), such thresholding suffices when the scores already satisfy the corresponding criterion. However, this does not extend to sufficiency: even perfectly group-calibrated scores -- including true class probabilities -- violate predictive parity after thresholding. In this work, we present an exact solution for optimal binary (randomized) classification under sufficiency, assuming finite sets of group-calibrated scores. We provide a geometric characterization of the feasible pairs of positive predictive value (PPV) and false omission rate (FOR) achievable by such classifiers, and use it to derive a simple post-processing algorithm that attains the optimal classifier using only group-calibrated scores and group membership. Finally, since sufficiency and separation are generally incompatible, we identify the classifier that minimizes deviation from separation subject to sufficiency, and show that it can also be obtained by our algorithm, often achieving performance comparable to the optimum.

2510.10544 2026-06-01 cs.LG cs.AI stat.ML

PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

PAC-Bayesian 强化学习训练可泛化策略

Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata

AI总结 提出一种新的 PAC-Bayesian 泛化界,通过链的混合时间显式考虑数据中的马尔可夫依赖性,并基于此设计 PB-SAC 算法以优化该界指导探索,在连续控制任务中提供有意义的置信度证书且保持竞争性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情
AI中文摘要

我们推导了一个新的用于强化学习的 PAC-Bayesian 泛化界,该界通过链的混合时间显式考虑了数据中的马尔可夫依赖性。这有助于克服在强化学习中获取泛化保证的挑战,因为数据的序列性质破坏了经典界所依赖的独立性假设。新界为现代离策略算法(如 Soft Actor-Critic)提供了非空泛证书。我们通过 PB-SAC 展示了该界的实际效用,这是一种在训练过程中优化该界以指导探索的新算法。在多个连续控制任务上的实验表明,所提出的方法在保持竞争性能的同时提供了有意义的置信度证书。

英文摘要

We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. The new bound provides non-vacuous certificates for modern off-policy algorithms such as Soft Actor-Critic. We demonstrate the practical utility of the bound through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across several continuous control tasks show that the proposed approach provides meaningful confidence certificates while maintaining competitive performance.

2602.06902 2026-06-01 cs.LG stat.ML

Parameter-free Dynamic Regret: Time-varying Movement Costs, Delayed Feedback, and Memory

无参数动态遗憾:时变移动成本、延迟反馈和记忆

Hao Qiu, Andrew Jacobsen, Emmanuel Esposito, Mengxiao Zhang

AI总结 本文提出一种新算法,在具有时变移动成本的在线凸优化中,首次实现了比较器自适应的动态遗憾界,并应用于延迟反馈和时变记忆问题。

Comments 28 pages; v2: ICML 2026

详情
AI中文摘要

在本文中,我们研究了具有移动成本的无约束在线凸优化(OCO)中的动态遗憾。具体来说,我们通过允许移动成本系数$λ_t$随时间任意变化来推广标准设置。我们的主要贡献是一种新颖的算法,该算法为此设置建立了第一个比较器自适应动态遗憾界,保证$\widetilde{\mathcal{O}}(\sqrt{(M^2+MP_T)(T+\sum_t λ_t)})$遗憾,其中$P_T$是比较器序列在$T$轮上的路径长度,$M$是最大比较器范数。我们的结果恢复了OCO中静态和动态遗憾的最优自适应率,作为所有轮次中$λ_t=0$的特例。为了展示我们结果的多功能性,我们考虑了两个应用:具有延迟反馈的OCO和具有时变记忆的OCO。我们表明这两个问题都可以转化为时变移动成本,特别是为延迟反馈设置建立了一种新颖的归约,这具有独立的意义。一个关键的观察是,我们的遗憾界中对移动成本的一阶依赖在实现两种设置中的最优比较器自适应动态遗憾保证中起着关键作用。

英文摘要

In this paper, we study dynamic regret in unconstrained online convex optimization (OCO) with movement costs. Specifically, we generalize the standard setting by allowing the movement cost coefficients $λ_t$ to vary arbitrarily over time. Our main contribution is a novel algorithm that establishes the first comparator-adaptive dynamic regret bound for this setting, guaranteeing $\widetilde{\mathcal{O}}(\sqrt{(M^2+MP_T)(T+\sum_t λ_t)})$ regret, where $P_T$ is the path length of the comparator sequence over $T$ rounds and $M$ is the maximal comparator norm. Our result recovers the optimal adaptive rates for both static and dynamic regret in OCO as the special case where $λ_t=0$ for all rounds. To demonstrate the versatility of our results, we consider two applications: OCO with delayed feedback and OCO with time-varying memory. We show that both problems can be translated into time-varying movement costs, establishing a novel reduction specifically for the delayed feedback setting that is of independent interest. A crucial observation is that the first-order dependence on movement costs in our regret bound plays a key role in enabling optimal comparator-adaptive dynamic regret guarantees in both settings.

2602.00942 2026-06-01 cs.LG

SALAAD: Sparse And Low-Rank Adaptation via ADMM for Large Language Model Inference

SALAAD: 基于ADMM的稀疏低秩适配用于大语言模型推理

Hao Ma, Melis Ilayda Bal, Liang Zhang, Bingcong Li, Niao He, Melanie Zeilinger, Michael Muehlebach

AI总结 提出SALAAD框架,通过增广拉格朗日方法在训练中诱导稀疏低秩结构,实现模型容量灵活控制,降低部署内存且无需重训。

详情
AI中文摘要

现代大型语言模型越来越多地在计算和内存限制下部署,使得模型容量的灵活控制成为核心挑战。虽然稀疏和低秩结构自然地权衡了容量和性能,但现有方法通常依赖于忽略层和矩阵异质性的启发式设计,或需要特定于模型的架构修改。我们提出了SALAAD,一个适用于不同模型架构的即插即用框架,在训练过程中诱导稀疏和低秩结构。通过在增广拉格朗日框架下制定结构化权重学习,并引入自适应控制器动态平衡训练损失和结构约束,SALAAD保持了标准训练动态的稳定性,同时实现了对训练过程中有效模型容量演变的显式控制。跨模型规模的实验表明,SALAAD在部署期间显著减少了内存消耗,同时实现了与特设方法相当的性能。此外,单次训练运行产生了一个连续谱的模型容量,使得能够在不同的内存预算下实现平滑和弹性的部署,而无需重新训练。

英文摘要

Modern large language models are increasingly deployed under compute and memory constraints, making flexible control of model capacity a central challenge. While sparse and low-rank structures naturally trade off capacity and performance, existing approaches often rely on heuristic designs that ignore layer and matrix heterogeneity or require model-specific architectural modifications. We propose SALAAD, a plug-and-play framework applicable to different model architectures that induces sparse and low-rank structures during training. By formulating structured weight learning under an augmented Lagrangian framework and introducing an adaptive controller that dynamically balances the training loss and structural constraints, SALAAD preserves the stability of standard training dynamics while enabling explicit control over the evolution of effective model capacity during training. Experiments across model scales show that SALAAD substantially reduces memory consumption during deployment while achieving performance comparable to ad-hoc methods. Moreover, a single training run yields a continuous spectrum of model capacities, enabling smooth and elastic deployment across diverse memory budgets without the need for retraining.

2601.01754 2026-06-01 cs.LG cs.CC cs.CL cs.FL

Context-Free Recognition with Transformers

使用Transformer进行上下文无关语言识别

Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

AI总结 本文证明循环Transformer通过O(log N)层和O(N^6)填充符号可识别所有上下文无关语言,并针对无歧义子类将填充需求降至O(N^3)。

详情
AI中文摘要

Transformer在处理符合某种语法的良好形式输入(如自然语言和代码)的任务中表现出色。然而,它们如何处理语法句法仍不清楚。事实上,在标准复杂性猜想下,标准Transformer无法识别上下文无关语言(CFL)——一种描述句法的规范形式,甚至无法识别正则语言(CFL的子类)。过去的工作表明,O(log(N))循环层(相对于输入长度N)允许Transformer识别正则语言,但循环Transformer识别上下文无关语言的问题仍然开放。在这项工作中,我们证明具有O(log(N))循环层和O(N^6)填充符号的循环Transformer可以识别所有CFL。然而,使用O(N^6)填充符号的训练和推理可能不切实际。幸运的是,我们表明,对于无歧义CFL等自然子类,Transformer上的识别问题变得更加易处理,只需要O(N^3)填充。实验上,循环和填充Transformer在识别CFL方面比固定深度Transformer表现更好。总体而言,我们的结果揭示了Transformer识别CFL的复杂性:虽然一般识别可能需要难以处理的填充量,但无歧义性等自然约束产生了高效的识别算法。

英文摘要

Transformers excel empirically on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work has shown that $\mathcal{O}(\log(N))$ looping layers (w.r.t. input length $N$) allow transformers to recognize regular languages, but the question of context-free recognition with looped transformers remained open. In this work, we show that looped transformers with $\mathcal{O}(\log(N))$ looping layers and $\mathcal{O}(N^6)$ padding symbols can recognize all CFLs. However, training and inference with $\mathcal{O}(N^6)$ padding symbols is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(N^3)$ padding. Empirically, looped and padded transformers perform better than fixed-depth transformers in recognizing CFLs. Overall, our results shed light on the intricacy of CFL recognition by transformers: while general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

2405.07836 2026-06-01 cs.LG stat.ME

Forecasting with Hyper-Trees

超树预测

Alexander März, Kashif Rasul

AI总结 提出超树框架,通过梯度提升树学习目标时间序列模型(如ARIMA或指数平滑)的参数,结合决策树与经典预测模型,并引入混合架构解决高维参数估计的缩放限制。

Comments Gradient Boosted Trees, Hyper Models, Hybrid Models, Time Series Forecasting, Time-Varying Parameters

详情
AI中文摘要

我们引入超树作为一种新颖的框架,用于使用梯度提升树对时间序列数据进行建模。与直接预测时间序列的传统树方法不同,超树学习目标时间序列模型(如ARIMA或指数平滑)的参数,这些参数是特征的函数。然后,目标模型使用这些参数生成最终预测。我们的框架将决策树在表格数据上的有效性与经典预测模型相结合,从而将时间序列归纳偏差引入树模型。为了解决提升树在估计高维目标模型参数时的缩放限制,我们将决策树和神经网络结合在一个统一的框架中。在这种混合方法中,树从输入特征生成信息表示,然后浅层网络将其作为输入来学习时间序列模型的参数。通过我们的研究,我们探索了超树在各种预测任务中的有效性,并将基于树的建模扩展到时间序列分析中的传统用途之外。

英文摘要

We introduce Hyper-Trees as a novel framework for modeling time series data using gradient boosted trees. Unlike conventional tree-based approaches that forecast time series directly, Hyper-Trees learn the parameters of a target time series model, such as ARIMA or Exponential Smoothing, as functions of features. These parameters are then used by the target model to generate the final forecasts. Our framework combines the effectiveness of decision trees on tabular data with classical forecasting models, thereby inducing a time series inductive bias into tree-based models. To resolve the scaling limitations of boosted trees when estimating a high-dimensional set of target model parameters, we combine decision trees and neural networks within a unified framework. In this hybrid approach, the trees generate informative representations from the input features, which a shallow network then uses as input to learn the parameters of a time series model. With our research, we explore the effectiveness of Hyper-Trees across a range of forecasting tasks and extend tree-based modeling beyond its conventional use in time series analysis.

2602.06161 2026-06-01 cs.CL cs.AI

Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding

停止翻转:面向快速可撤销扩散解码的上下文保持验证

Yanzheng Xiang, Lan Wei, Yizhen Yao, Qinglin Zhu, Hanqi Yan, Chen Jin, Philip Alexander Teare, Dandan Zhang, Lin Gui, Amrutha Saseendran, Yulan He

AI总结 针对并行扩散解码中因激进并行导致的翻转振荡问题,提出COVER方法,通过KV缓存覆盖和稳定性感知评分实现单次前向传递中的留一验证与稳定草稿,减少不必要修订并加速解码。

详情
AI中文摘要

并行扩散解码可以通过每步解掩多个令牌来加速扩散语言模型推理,但激进的并行常常损害质量。可撤销解码通过重新检查早期令牌来缓解这一问题,然而我们观察到现有的验证方案频繁触发翻转振荡,即令牌被重新掩码后又原样恢复。这种行为以两种方式减慢推理:重新掩码已验证位置削弱了并行草稿的条件上下文,且重复的重新掩码循环消耗修订预算而进展甚微。我们提出COVER(用于高效修订的缓存覆盖验证),它在单次前向传递中执行留一验证和稳定草稿。COVER通过KV缓存覆盖构建两种注意力视图:选定的种子被掩码用于验证,而其缓存的键值状态被注入到所有其他查询中以保留上下文信息,同时通过闭式对角校正防止种子位置的自泄漏。COVER进一步使用稳定性感知评分对种子进行优先级排序,该评分平衡不确定性、下游影响和缓存漂移,并自适应调整每步验证的种子数量。在多个基准测试中,COVER显著减少不必要的修订,实现更快的解码,同时保持输出质量。

英文摘要

Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.

2602.06055 2026-06-01 cs.CL

Are we chasing ghosts? Quantifying unattributable polarization, and attributing the rest to annotator groups

我们在追逐幽灵吗?量化不可归因的极化,并将其余归因于标注者群体

Dimitris Tsirmpas, John Pavlopoulos

AI总结 针对标注者群体间系统性意见差异难以捕捉的问题,提出一种新的极化归因度量方法,能避免固有极化和群体效应抵消,并验证了性别和种族对极化模式的解释力。

Comments 19 pages, 7 tables, 9 figures

详情
AI中文摘要

标准一致性指标通常无法捕捉少数群体和多数群体标注者之间的系统性意见差异,从而危及仇恨言论和毒性检测等任务。极化最近被提出作为一种更稳健的方式来区分微小分歧与系统性意见差异,但现有方法并未提供将其归因于特定标注者群体的实用工具。我们评估了当前方法,并识别出在现实场景中的两个主要局限性:(1)存在无法归因于任何已知或潜在群体的“固有”极化;(2)对立的极化效应在聚合标注中相互抵消。为了解决这些问题,我们引入了一种新的度量方法,能够测量并检验标注者群体极化归因的统计显著性,同时避免这些局限性,并提供了一个开源的Python库实现,发现每个评论只需不超过20个标注者即可实现可靠估计。我们将该方法应用于四个主观NLP数据集,发现性别和种族一致地解释了极化模式,而标注者群体之间的差异随着群体距离增大而增强。

英文摘要

Standard agreement metrics often fail to capture systematic differences in opinion between minority and majority-group annotators, jeopardizing tasks such as hate speech and toxicity detection. Polarization has recently been proposed as a more robust way of distinguishing minor disagreements from systematic differences in opinion, but existing approaches do not provide practical tools for attributing it to specific annotator groups. We evaluate current methods and identify two major limitations in realistic settings: (1) the presence of ``inherent'' polarization that cannot be attributed to any known or latent groups, and (2) opposing polarization effects canceling each other out in aggregated annotations. To address these issues, we introduce a new metric that measures and tests the statistical significance of polarization attribution for annotator groups while avoiding these limitations, as well as an open-source Python library implementation, finding that no more than 20 annotators are needed per comment for reliable estimation. We apply our method to four subjective NLP datasets and find that gender and race consistently explain polarization patterns, while differences between annotator groups become stronger as the groups are further apart.

2601.19791 2026-06-01 cs.LG stat.ML

To Grok Grokking: Provable Grokking in Ridge Regression

理解Grokking:岭回归中可证明的Grokking现象

Mingyue Xu, Gal Vardi, Itay Safran

AI总结 本文在经典岭回归设置中研究grokking现象,证明使用带权重衰减的梯度下降学习过参数化线性回归模型时,存在过拟合、泛化延迟和最终泛化误差任意小的三个阶段,并首次给出泛化延迟(grokking时间)的严格定量界,同时通过实验表明该界也适用于非线性神经网络。

详情
AI中文摘要

我们在经典岭回归设置中研究grokking现象,即过拟合后很久才出现泛化。我们证明了使用带权重衰减的梯度下降学习过参数化线性回归模型的端到端grokking结果。具体地,我们证明以下阶段发生:(i) 训练早期模型过拟合训练数据;(ii) 过拟合显现后长时间泛化性能差;(iii) 泛化误差最终变得任意小。此外,我们从理论和实验上表明,通过适当的超参数调优,可以以原则性的方式放大或消除grokking。据我们所知,这是首次以训练超参数表示的泛化延迟(我们称之为“grokking时间”)的严格定量界。最后,超越线性设置,我们实验证明我们的定量界也捕捉了非线性神经网络上grokking的行为。我们的结果表明,grokking不是深度学习固有的失败模式,而是特定训练条件的结果,因此不需要对模型架构或学习算法进行根本性改变来避免。

英文摘要

We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

2602.05649 2026-06-01 cs.LG

End-to-End Compression for Tabular Foundation Models

表格基础模型的端到端压缩

Guri Zabërgja, Rafiq Kamel, Arlind Kadra, Christian M. M. Frey, Josif Grabocka

AI总结 提出TACO,一种端到端表格压缩模型,在潜在空间压缩训练数据,以解决表格Transformer在推理时间和内存上的二次复杂度问题,在TabArena基准上实现高达94倍加速和97%内存节省,且性能无明显下降。

Comments Accepted as Spotlight at ICML 2026

详情
AI中文摘要

长期以来,梯度提升决策树在表格数据上的主导地位最近受到了上下文学习表格基础模型的挑战。上下文学习方法通过将训练数据作为上下文来预测查询测试点,无需参数更新即可在一次前向传播中完成拟合和预测。尽管最近的表格基础模型达到了最先进的性能,但基于注意力机制的Transformer架构在数据集大小上具有二次复杂度,这增加了训练和推理时间的开销,并限制了模型处理大规模数据集的能力。在这项工作中,我们提出了TACO,一种端到端的表格压缩模型,它在潜在空间中压缩训练数据集。我们在TabArena基准上测试了我们的方法,与最先进的表格Transformer架构相比,我们的方法在推理时间上快了高达94倍,同时内存消耗减少了97%,且性能没有显著下降。最后,我们的方法不仅随着数据集规模的增大而更好地扩展,而且与其他基线相比也取得了更好的性能。

英文摘要

The long-standing dominance of gradient-boosted decision trees for tabular data has recently been challenged by in-context learning tabular foundation models. In-context learning methods fit and predict in one forward pass without parameter updates by leveraging the training data as context for predicting on query test points. While recent tabular foundation models achieve state-of-the-art performance, their transformer architecture based on the attention mechanism has quadratic complexity regarding dataset size, which in turn increases the overhead on training and inference time, and limits the capacity of the models to handle large-scale datasets. In this work, we propose TACO, an end-to-end tabular compression model that compresses the training dataset in a latent space. We test our method on the TabArena benchmark, where our proposed method is up to 94x faster in inference time, while consuming up to 97\% less memory compared to the state-of-the-art tabular transformer architecture, all while retaining performance without significant degradation. Lastly, our method not only scales better with increased dataset sizes, but it also achieves better performance compared to other baselines.

2512.14980 2026-06-01 cs.LG

Softly Constrained Denoisers for Diffusion Models Applied to Partial Differential Equations

应用于偏微分方程的扩散模型的软约束去噪器

Victor M. Yeom-Song, Severi Rissanen, Arno Solin, Samuel Kaski, Mingfei Sun

AI总结 提出在扩散模型的去噪器中引入基于偏微分方程的软归纳偏置,以在提高约束遵从性的同时保持对模型错误指定的适应性。

Comments 22 pages including appendix, 8 figures including appendix, preprint

详情
AI中文摘要

扩散模型已成为偏微分方程解的强大生成先验。现有方法通过将PDE残差添加为损失正则化器或通过推理时调整来强制执行物理约束。这些方法使模型偏离真实数据分布,当控制PDE被错误指定时尤其成问题。为了规避这些问题同时充分利用PDE约束,我们在从PDE导出的去噪器架构中引入软归纳偏置。我们表明,这些软约束去噪器利用约束知识来改善对标准去噪器的遵从性,同时在相对于观测数据存在错误指定的情况下保持足够的灵活性以偏离它。

英文摘要

Diffusion models have become a powerful generative prior for solutions of partial differential equations (PDEs). Existing approaches enforce physical constraints either by adding the PDE residuals as loss regularizers or through inference-time adjustments. These methods bias the model away from the true data distribution, which is especially problematic when the governing PDE is misspecified. To circumvent these issues while making the most out of the PDE constraint, we introduce soft inductive biases into the denoiser architecture derived from the PDEs. We show that these softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers, while maintaining enough flexibility to deviate from it in case of misspecification with respect to observed data.

2602.04737 2026-06-01 cs.LG

Rationality Measurement and Theory for Reinforcement Learning Agents

强化学习智能体的理性度量与理论

Kejiang Qian, Amos Storkey, Fengxiang He

AI总结 本文提出一套理性度量及其理论,用于评估强化学习智能体在部署中的行为理性,并分解理性风险差距为环境变化和算法泛化能力两部分。

详情
AI中文摘要

本文针对强化学习智能体提出了一套理性度量及其相关理论,该属性日益关键但鲜有探索。我们定义部署中的行动为完全理性,如果它在最陡方向上最大化隐藏的真实价值函数。策略行动与其理性对应物的期望价值差异,在部署轨迹上累积,被定义为期望理性风险;训练中的经验平均版本也被定义。它们的差异称为理性风险差距,被分解为(1)由训练和部署之间环境变化引起的外在成分,以及(2)由算法在动态环境中的泛化能力引起的内在成分。它们分别被(1)训练和部署中转移核与初始状态分布之间的$1$-Wasserstein距离,以及(2)价值函数类的经验Rademacher复杂度所上界。我们的理论提出了关于正则化(包括层归一化、$\ell_2$正则化和权重归一化)和领域随机化的益处,以及环境变化的危害的假设。实验与这些假设完全一致。代码可在https://github.com/EVIEHub/Rationality获取。

英文摘要

This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the $1$-Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, $\ell_2$ regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses. The code is available at https://github.com/EVIEHub/Rationality.

2506.05994 2026-06-01 cs.LG cs.AR cs.ET

RETENTION: Resource-Efficient Tree-Based Ensemble Model Acceleration with Content-Addressable Memory

RETENTION: 基于内容可寻址存储器的资源高效树集成模型加速

Yi-Chun Liao, Chieh-Lin Tsai, Yuan-Hao Chang, Camélia Slimani, Jalil Boukhobza, Tei-Wei Kuo

AI总结 提出RETENTION框架,通过迭代剪枝算法和树映射方案,显著减少内容可寻址存储器容量需求,实现资源高效的树集成模型加速。

Comments Under review by IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems

详情
AI中文摘要

尽管深度学习在处理非结构化数据方面展现了卓越的能力,但现代基于树的集成模型在从结构化数据中提取相关信息和学习方面仍然更胜一筹。虽然已有若干工作致力于加速树模型,但模型的固有特性对传统加速器构成了重大挑战。最近利用内容可寻址存储器(CAM)的研究为加速树模型提供了有前景的解决方案,然而现有设计存在内存消耗过大和利用率低的问题。本文通过引入RETENTION,一个端到端框架,显著降低了树模型推理的CAM容量需求。我们提出了一种迭代剪枝算法,该算法具有针对基于装袋模型(例如随机森林)的新颖剪枝准则,在确保受控精度下降的同时最小化模型复杂度。此外,我们提出了一种树映射方案,其中包含两种创新的数据放置策略,以缓解CAM中广泛使用的无关状态导致的内存冗余。实验结果表明,仅实施树映射方案即可将CAM容量需求降低1.46倍至21.30倍,而完整的RETENTION框架在精度损失小于3%的情况下实现了4.35倍至207.12倍的降低。这些结果表明,RETENTION在最小化CAM资源需求方面非常有效,为树模型加速提供了一种资源高效的方向。

英文摘要

Although deep learning has demonstrated remarkable capability in learning from unstructured data, modern tree-based ensemble models remain superior in extracting relevant information and learning from structured datasets. While several efforts have been made to accelerate tree-based models, the inherent characteristics of the models pose significant challenges for conventional accelerators. Recent research leveraging content-addressable memory (CAM) offers a promising solution for accelerating tree-based models, yet existing designs suffer from excessive memory consumption and low utilization. This work addresses these challenges by introducing RETENTION, an end-to-end framework that significantly reduces CAM capacity requirement for tree-based model inference. We propose an iterative pruning algorithm with a novel pruning criterion tailored for bagging-based models (e.g., Random Forest), which minimizes model complexity while ensuring controlled accuracy degradation. Additionally, we present a tree mapping scheme that incorporates two innovative data placement strategies to alleviate the memory redundancy caused by the widespread use of don't care states in CAM. Experimental results show that implementing the tree mapping scheme alone reduces CAM capacity requirement by $1.46\times$ to $21.30 \times$, while the full RETENTION framework achieves $4.35\times$ to $207.12\times$ reduction with less than 3\% accuracy loss. These results demonstrate that RETENTION is highly effective in minimizing CAM resource demand, providing a resource-efficient direction for tree-based model acceleration.

2602.04107 2026-06-01 cs.LG cs.IT math.IT

Supervised Learning as Lossy Compression: Characterizing Generalization and Sample Complexity via Finite Blocklength Analysis

监督学习作为有损压缩:通过有限块长分析刻画泛化与样本复杂度

Kosuke Sugiyama, Masato Uchida

AI总结 本文通过将学习问题置于有损压缩框架中并应用有限块长分析,从信息论角度推导了固定随机学习算法及其最优采样策略的样本复杂度和泛化误差下界,显式分离了过拟合程度与归纳偏置-任务不匹配项。

Comments 40 pages, 1 figure

详情
AI中文摘要

本文通过将学习问题置于有损压缩的背景下并应用有限块长分析,提出了一种关于机器学习中泛化的新颖信息论视角。在我们的方法中,训练数据的采样形式上对应于编码过程,而模型构建对应于解码过程。通过利用有限块长分析,我们推导了固定随机学习算法及其相关最优采样策略的样本复杂度和泛化误差的下界。我们的界限明确地将学习算法的过拟合程度与其归纳偏置和任务之间的不匹配作为不同的项进行刻画。这种分离提供了相对于现有框架的显著优势。此外,我们分解了过拟合项,以显示其与信息论界限和稳定性理论中现有度量的理论联系,从而在我们的提议框架下统一了这些视角。

英文摘要

This paper presents a novel information-theoretic perspective on generalization in machine learning by framing the learning problem within the context of lossy compression and applying finite blocklength analysis. In our approach, the sampling of training data formally corresponds to an encoding process, and the model construction to a decoding process. By leveraging finite blocklength analysis, we derive lower bounds on sample complexity and generalization error for a fixed randomized learning algorithm and its associated optimal sampling strategy. Our bounds explicitly characterize the degree of overfitting of the learning algorithm and the mismatch between its inductive bias and the task as distinct terms. This separation provides a significant advantage over existing frameworks. Additionally, we decompose the overfitting term to show its theoretical connection to existing metrics found in information-theoretic bounds and stability theory, unifying these perspectives under our proposed framework.

2602.04031 2026-06-01 cs.LG

The Illusion of Generalization in Tabular Language Models

表格语言模型中的泛化错觉

Aditya Gorla, Ratish Puduppully

AI总结 通过系统评估Tabula-8B在165个数据集上的表现,发现其声称的泛化能力主要源于评估伪影(如数据污染和格式熟悉度),而非真正的表格推理。

详情
Journal ref
In Proc. 43th International Conference on Machine Learning (ICML 2026)
AI中文摘要

表格语言模型(TLMs)据称在表格预测中实现了强大的泛化能力。我们对代表性TLM——Tabula-8B进行了系统性的重新评估,使用了UniPredict基准中的165个数据集。我们的研究揭示了三个发现。首先,二分类和多类别分类在多数类基线上实现了接近零的中位数提升,而强大的聚合性能完全由四分位数分类任务驱动。其次,表现最好的数据集存在普遍的数据污染,包括完整的训练-测试重叠和任务级泄露,这些污染规避了标准的去重方法。第三,在没有表格数据暴露的情况下进行指令微调,恢复了标准分类性能的92.2%,而在四分位数分类上,格式熟悉度缩小了71.3%的差距,剩余部分归因于污染数据集。这些发现表明,声称的泛化能力可能反映的是评估伪影,而非学到的表格推理。最后,我们提出了加强TLM评估的建议。

英文摘要

Tabular Language Models (TLMs) have been claimed to achieve strong generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.

2511.16940 2026-06-01 cs.CV cs.CR

MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

MultiPriv: 视觉语言模型中个体级隐私推理的基准测试

Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, Kaili Liu, Ruxin Feng, Wen Jun Tan, Wei Yang Bryan Lim

AI总结 针对视觉语言模型通过层次化链式推理关联多模态数据识别个体的隐私风险,提出首个系统评估个体级隐私推理的基准MultiPriv,包含隐私感知与推理框架、双语多模态数据集和九项挑战任务,对50多个开源和商业模型评估发现60%的模型能以高达80%的准确率进行个体级隐私推理。

详情
AI中文摘要

现代视觉语言模型(VLM)通过层次化链式推理将碎片化的多模态数据与可识别的个体关联起来,构成了显著的个体级隐私风险。然而,现有的隐私基准在结构上不足以应对这一威胁,因为它们主要评估隐私感知,而未能解决更关键的隐私推理风险:VLM推断和关联分布式信息以构建个体档案的能力。为填补这一空白,我们提出了MultiPriv,这是第一个旨在系统评估VLM中个体级隐私推理的基准。我们引入了隐私感知与推理(PPR)框架,并构建了一个包含合成个体档案的双语多模态数据集,其中标识符(如人脸和姓名)与敏感属性相关联。该设计支持九项具有挑战性的任务,涵盖属性检测、跨图像重新识别和链式推理。我们对超过50个开源和商业VLM进行了大规模评估。在我们的受控基准中,60%的广泛使用的VLM能够以高达80%的准确率进行个体级隐私推理,这表明对个人隐私存在重大潜在威胁。该基准可在https://github.com/CyberChangAn/MultiPriv-PII获取。

英文摘要

Modern Vision-Language Models (VLMs) pose significant individual-level privacy risks by linking fragmented multimodal data to identifiable individuals through hierarchical chain-of-thought reasoning. However, existing privacy benchmarks remain structurally insufficient for this threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM's ability to infer and link distributed information to construct individual profiles. To address this gap, we propose MultiPriv, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the Privacy Perception and Reasoning (PPR) framework and construct a bilingual multimodal dataset with synthetic individual profiles, where identifiers, such as faces and names, are linked to sensitive attributes. This design enables nine challenging tasks spanning attribute detection, cross-image re-identification, and chained inference. We conduct a large-scale evaluation of over 50 open-source and commercial VLMs. In our controlled benchmark, 60% of widely used VLMs can perform individual-level privacy reasoning with up to 80% accuracy, suggesting a significant potential threat to personal privacy. The benchmark is available at https://github.com/CyberChangAn/MultiPriv-PII.

2602.03655 2026-06-01 cs.LG

Sequential Group Composition: A Window into the Mechanics of Deep Learning

序列群组合:深度学习机制的一扇窗口

Giovanni Luca Marchetti, Daniel Kunin, Adele Myers, Francisco Acosta, Nina Miolane

AI总结 通过序列群组合任务,研究神经网络如何学习结构化运算,揭示群结构、编码统计和序列长度对学习的影响,并证明深度架构能显著改善宽度需求。

Comments Accepted at ICML 2026

详情
AI中文摘要

经过序列训练的神经网络如何获得执行结构化运算(如算术、几何和算法计算)的能力?为了深入了解这个问题,我们引入了序列群组合任务。在该任务中,网络接收来自有限群的元素序列(这些元素编码在实向量空间中),并必须预测它们的累积乘积。该任务可能对顺序敏感,且无法通过线性模型解决。我们的分析隔离了群结构、编码统计和序列长度在塑造学习中的作用。我们证明,从零初始化开始的两层网络一次学习群的一个不可约表示,顺序由编码的傅里叶统计决定。为了完美学习该任务,这些网络需要隐藏宽度随序列长度 $k$ 呈指数增长。相比之下,我们构建了利用结合律的更深层架构,显著改善了这种缩放:循环神经网络可以在 $k$ 步内顺序组合元素,而多层网络可以在 $\log k$ 层内并行组合相邻对。总体而言,序列群组合任务为深度学习机制提供了一个可处理的窗口。

英文摘要

How do neural networks trained over sequences acquire the ability to perform structured operations, such as arithmetic, geometric, and algorithmic computation? To gain insight into this question, we introduce the sequential group composition task. In this task, networks receive a sequence of elements from a finite group encoded in a real vector space and must predict their cumulative product. This task can be order-sensitive and cannot be solved by a linear model. Our analysis isolates the roles of the group structure, encoding statistics, and sequence length in shaping learning. We prove that two-layer networks from vanishing initialization learn this task one irreducible representation of the group at a time in an order determined by the Fourier statistics of the encoding. To perfectly learn the task, these networks require a hidden width exponential in the sequence length $k$. In contrast, we construct deeper architectures that exploit associativity to dramatically improve this scaling: recurrent neural networks can compose elements sequentially in $k$ steps, while multilayer networks can compose adjacent pairs in parallel in $\log k$ layers. Overall, the sequential group composition task offers a tractable window into the mechanics of deep learning.

2602.03639 2026-06-01 cs.RO

Variance-Reduced Model Predictive Path Integral via Quadratic Model Approximation

基于二次模型近似的方差缩减模型预测路径积分

Fabian Schramm, Franki Nguimatsia Tiofack, Nicolas Perrin-Gilbert, Marc Toussaint, Justin Carpentier

AI总结 提出一种混合方差缩减MPPI框架,通过将目标函数分解为已知近似模型与残差项,并采用二次近似推导闭式先验,以降低方差并提高样本效率,在多个任务中实现更快收敛和更优性能。

Comments Accepted to Robotics: Science and Systems (RSS) 2026, Sydney, Australia

详情
AI中文摘要

基于采样的控制器,如模型预测路径积分(MPPI)方法,提供了很大的灵活性,但常常遭受高方差和低样本效率的问题。为了解决这些挑战,我们引入了一种混合方差缩减MPPI框架,该框架将先验模型整合到采样过程中。我们的关键见解是将目标函数分解为已知的近似模型和一个残差项。由于残差仅捕捉模型与目标之间的差异,它通常比原始目标具有更小的幅度和更低的方差。尽管这一原理适用于一般的建模选择,但我们证明采用二次近似能够推导出一个闭式的、模型引导的先验,该先验有效地将样本集中在信息丰富的区域。关键的是,该框架对几何信息的来源是不可知的,允许二次模型从精确导数、结构近似(例如高斯或拟牛顿)或无梯度的随机平滑中构建。我们在标准优化基准、一个非线性欠驱动小车-杆控制任务以及一个具有非光滑动力学的接触丰富操作问题上验证了该方法。在这些领域中,与标准MPPI相比,我们在低样本情况下实现了更快的收敛和更优的性能。这些结果表明,该方法可以在获取样本昂贵或受限的情况下,使基于样本的控制策略更加实用。

英文摘要

Sampling-based controllers, such as Model Predictive Path Integral (MPPI) methods, offer substantial flexibility but often suffer from high variance and low sample efficiency. To address these challenges, we introduce a hybrid variance-reduced MPPI framework that integrates a prior model into the sampling process. Our key insight is to decompose the objective function into a known approximate model and a residual term. Since the residual captures only the discrepancy between the model and the objective, it typically exhibits a smaller magnitude and lower variance than the original objective. Although this principle applies to general modeling choices, we demonstrate that adopting a quadratic approximation enables the derivation of a closed-form, model-guided prior that effectively concentrates samples in informative regions. Crucially, the framework is agnostic to the source of geometric information, allowing the quadratic model to be constructed from exact derivatives, structural approximations (e.g., Gauss- or Quasi-Newton), or gradient-free randomized smoothing. We validate the approach on standard optimization benchmarks, a nonlinear, underactuated cart-pole control task, and a contact-rich manipulation problem with non-smooth dynamics. Across these domains, we achieve faster convergence and superior performance in low-sample regimes compared to standard MPPI. These results suggest that the method can make sample-based control strategies more practical in scenarios where obtaining samples is expensive or limited.

2601.20789 2026-06-01 cs.CL cs.LG cs.SE

SERA: Soft-Verified Efficient Repository Agents

SERA:软验证的高效仓库智能体

Ethan Shen, Daniel Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers

AI总结 提出SERA方法,通过软验证生成(SVG)高效训练编码智能体,使其快速适应私有代码库,在开源模型中取得领先性能且成本极低。

Comments 21 main pages, 6 pages appendix

详情
AI中文摘要

开源编码智能体应比闭源系统具有根本优势,因为它们可以专门化到私有代码库,将仓库特定信息直接编码在其权重中。然而,训练的成本和复杂性一直使这一优势停留在理论层面。我们提出了软验证高效仓库智能体(SERA),一种高效的编码智能体训练方法,能够快速、廉价地创建专门化到私有代码库的智能体。利用软验证生成(SVG),我们可以从任何代码仓库生成数千条轨迹,而无需单元测试。除了仓库专门化,我们将SVG应用于更大的代码库语料库,生成了超过200,000条合成轨迹。仅使用监督微调(SFT),SERA在全开源(开放数据、方法、代码)模型中取得了领先结果,同时匹配了如Devstral-Small-2等开源权重模型的性能。创建SERA模型的成本比强化学习便宜26倍,比先前达到同等性能的合成数据方法便宜57倍。我们利用数据集提供了关于训练编码智能体的缩放定律、消融实验和混淆因素的详细分析。总体而言,我们相信我们的工作将极大加速开源编码智能体的研究,并展示能够适应私有代码库的开源模型的优势。我们将SERA作为Ai2开源编码智能体系列的第一个模型发布,同时公开所有代码、数据和Claude Code集成,以支持研究社区。

英文摘要

Open-weight coding agents should hold a fundamental advantage over closed-source systems because they can specialize to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical until now. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using Soft Verified Generation (SVG), we generate thousands of trajectories from any code repository, without requiring unit tests. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating 200,000+ synthetic trajectories. Using only supervised finetuning (SFT), SERA achieves leading results among fully open-source (open data, method, code) models while matching the performance of open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. We use our dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can adapt to private codebases. We release SERA as the first model in Ai2's Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.

2510.00845 2026-06-01 cs.LG cs.AI cs.CL

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

作为统计估计的机械可解释性:方差分析

Maxime Méloux, François Portet, Maxime Peyrard

AI总结 本文从统计估计角度审视机械可解释性中的电路发现,揭示因果中介分析中单输入得分的固有方差导致电路不稳定,并系统分解方差来源,倡导更严谨的实践。

详情
AI中文摘要

机械可解释性(MI)旨在通过识别功能子网络来逆向工程模型行为。然而,这些发现的科学有效性取决于其稳定性。在这项工作中,我们认为电路发现不是一个独立的任务,而是一个建立在因果中介分析(CMA)基础上的统计估计问题。我们揭示了这一基础层的根本不稳定性:精确的单输入CMA得分表现出高固有方差,这意味着组件的因果效应是一个易变的随机变量,而非固定属性。然后,我们证明电路发现流程继承了这一方差并进一步放大。快速近似方法,如边缘属性修补及其后续方法,引入了额外的估计噪声,而在数据集上聚合这些噪声得分会导致脆弱的结构估计。因此,输入数据或超参数的小扰动会产生截然不同的电路。我们系统地分解了这些方差来源,并倡导更严格的MI实践,优先考虑统计稳健性和稳定性指标的常规报告。

英文摘要

Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.

2602.02459 2026-06-01 cs.RO

TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments

TIC-VLA:一种用于动态环境中机器人导航的思考控制视觉-语言-动作模型

Zhiyu Huang, Yun Zhang, Johnson Liu, Rui Song, Chen Tang, Jiaqi Ma

AI总结 提出TIC-VLA模型,通过显式建模推理延迟并引入延迟语义-控制接口,结合异步训练流程,解决动态环境中视觉-语言-动作模型的推理与实时控制异步问题,在仿真和真实机器人上优于先前模型。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

在动态、以人为中心的环境中,机器人必须遵循语言指令同时保持实时反应控制。视觉-语言-动作(VLA)模型提供了一个有前景的框架,但它们假设时间对齐的推理和控制,尽管语义推理相对于实时动作固有地存在延迟。我们提出了Think-in-Control(TIC)-VLA,一种延迟感知框架,在动作生成过程中显式建模延迟的语义推理。TIC-VLA定义了一个延迟语义-控制接口,该接口除了当前观测外,还基于延迟的视觉-语言语义状态和显式延迟元数据来条件化动作生成,使策略能够补偿异步推理。我们进一步提出了一种延迟一致的训练流程,在模仿学习和在线强化学习期间注入推理推理延迟,使训练与异步部署对齐。为了支持现实评估,我们提出了DynaNav,一个物理精确、照片级真实的仿真套件,用于动态环境中的语言引导导航。在仿真和真实机器人上的大量实验表明,TIC-VLA在多秒推理延迟下始终优于先前的VLA模型,同时保持鲁棒的实时控制。项目网站:https://ucla-mobility.github.io/TIC-VLA/

英文摘要

Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Project website: https://ucla-mobility.github.io/TIC-VLA/

2602.02220 2026-06-01 cs.CV cs.RO

LangMap: A Human-Verified Benchmark for Hierarchical Open-Vocabulary Goal Navigation

LangMap:一个用于分层开放词汇目标导航的人工验证基准

Bo Miao, Weijia Liu, Jun Luo, Lachlan Shinnick, Jian Liu, Thomas Hamilton-Smith, Yuhe Yang, Zijie Wu, Vanja Videnovic, Feras Dayoub, Anton van den Hengel

AI总结 针对现有基准在分层语义目标导航中的不足,提出LangMap基准,通过人工验证的语义标注和对比注释协议,支持场景、房间、区域和实例四个层级的目标导航任务,并引入PlaNaVid基线方法。

详情
AI中文摘要

语言条件目标导航(LGN)要求智能体在没有逐步指导的情况下定位用户指定的目标。然而,现有基准主要关注类别级目标或依赖视觉语言模型(VLM)生成的实例描述,这些描述通常包含歧义和语义错误,限制了系统性和可靠的评估。我们提出了HieraNav,一个开放词汇的LGN任务,目标在四个分层语义层级上指定:场景、房间、区域和实例。为此,我们提出了Language as a Map(LangMap),据我们所知,这是第一个具有人工验证语义标注的真实世界3D室内导航基准,支持所有四个目标层级的任务。LangMap提供了区域标签以及覆盖414个对象类别的区分性区域和实例描述,通过比较同一场景区域和实例的严格对比注释协议生成,包含超过18K个任务。每个目标都配有简洁和详细的描述,支持跨指令风格的评估。定量和定性分析验证了我们的注释质量;值得注意的是,我们的实例描述在文本到视图匹配上比GOAT-Bench注释高出23个百分点。我们进一步引入了PlaNaVid,一个强大的仅RGB基线,它将有界多样记忆(BDM)与高级规划相结合,以激发用于多目标导航的反应策略。PlaNaVid在没有深度、3D场景表示或对象掩码的情况下实现了顶级成功率。进一步分析表明,记忆和更丰富的上下文提升了性能,而长尾类别、小物体、远距离目标和多目标完成仍然是开放的挑战。该基准可在https://bo-miao.github.io/LangMap获取。

英文摘要

Language-conditioned goal navigation (LGN) requires agents to locate user-specified targets without step-by-step guidance. However, existing benchmarks largely focus on category-level goals or rely on instance descriptions generated by vision-language models (VLMs), which often contain ambiguities and semantic errors, limiting systematic and reliable evaluation. We introduce HieraNav, an open-vocabulary LGN task with goals specified at four hierarchical semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), to our knowledge the first real-world 3D indoor navigation benchmark with human-verified semantic annotations to support tasks across all four goal levels. LangMap provides region labels and discriminative region and instance descriptions covering 414 object categories, produced through a rigorous contrastive annotation protocol comparing same-scene regions and instances, and contains over 18K tasks. Each target is paired with concise and detailed descriptions, enabling evaluation across instruction styles. Quantitative and qualitative analyses validate our annotation quality; notably, our instance descriptions outperform GOAT-Bench annotations by 23 percentage points in text-to-view matching. We further introduce PlaNaVid, a strong RGB-only baseline that combines Bounded Diverse Memory (BDM) with high-level planning to prime a reactive policy for multi-goal navigation. PlaNaVid achieves top-tier success rates without depth, 3D scene representations, or object masks. Further analysis shows that memory and richer context boost performance, while long-tailed categories, small objects, distant targets, and multi-goal completion remain open challenges. The benchmark is available at https://bo-miao.github.io/LangMap

2602.01914 2026-06-01 cs.LG

Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

面向长程可解释性:高效且忠实的多Token归因用于推理大语言模型

Wenbo Pan, Zhichao Liu, Xianlong Wang, Haining Yu, Xiaohua Jia

AI总结 提出FlashTrace方法,通过跨跨度聚合和递归归因机制,在长上下文推理中实现高效且忠实的多Token归因,速度提升超130倍。

Comments Accepted as an Oral paper at ICML 2026. Code available at https://github.com/wbopan/flashtrace

详情
AI中文摘要

Token归因方法通过识别因果重要的输入Token为语言模型输出提供直观解释。然而,随着现代LLM越来越依赖扩展的推理链,现有方案面临两个关键挑战:(1)效率瓶颈,在长度为N的上下文中归因M个Token的目标跨度需要O(M*N)次操作,使得长上下文归因极其缓慢;(2)忠实度下降,中间推理Token吸收归因质量,阻止重要性传播回原始输入。为解决这些问题,我们引入FlashTrace,一种高效的多Token归因方法,它采用跨跨度聚合在单次传递中计算多Token目标的归因,同时保持忠实度。此外,我们设计了一种递归归因机制,通过中间推理链将重要性追溯回源输入。在长上下文检索(RULER)和多步推理(MATH、MorehopQA)任务上的大量实验表明,FlashTrace在保持优越忠实度的同时,比现有基线实现了超过130倍的加速。我们进一步分析了递归归因的动态特性,表明即使单次递归跳跃也能通过沿推理链追溯重要性来提高忠实度。

英文摘要

Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face two critical challenges: (1) efficiency bottleneck, where attributing a target span of M tokens within a context of length N requires O(M*N) operations, making long-context attribution prohibitively slow; and (2) faithfulness drop, where intermediate reasoning tokens absorb attribution mass, preventing importance from propagating back to the original input. To address these, we introduce FlashTrace, an efficient multi-token attribution method that employs span-wise aggregation to compute attribution over multi-token targets in a single pass, while maintaining faithfulness. Moreover, we design a recursive attribution mechanism that traces importance through intermediate reasoning chains back to source inputs. Extensive experiments on long-context retrieval (RULER) and multi-step reasoning (MATH, MorehopQA) tasks demonstrate that FlashTrace achieves over 130x speedup over existing baselines while maintaining superior faithfulness. We further analyze the dynamics of recursive attribution, showing that even a single recursive hop improves faithfulness by tracing importance through the reasoning chain.

2602.01553 2026-06-01 cs.LG cs.AI

Plain Transformers are Surprisingly Powerful Link Predictors

普通Transformer竟是惊人的链接预测器

Quang Truong, Yu Song, Donald Loveland, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang

AI总结 提出PENCIL,一种仅编码器的普通Transformer,通过采样局部子图的注意力机制替代手工先验,在保持标准Transformer可扩展性的同时,隐式泛化多种启发式方法,实现高效且参数经济的链接预测。

Comments ICML'26

详情
AI中文摘要

链接预测是图机器学习中的核心挑战,需要能够捕捉丰富且复杂的拓扑依赖关系的模型。虽然图神经网络(GNN)是标准解决方案,但最先进的流程通常依赖于显式结构启发式或内存密集型的节点嵌入——这些方法难以泛化或扩展到大规模图。新兴的图Transformer(GT)提供了一种潜在的替代方案,但由于复杂的结构编码,它们通常会产生显著的开销,阻碍了其在大规模链接预测中的应用。我们通过PENCIL挑战这些复杂的范式,这是一种仅编码器的普通Transformer,用对采样局部子图的注意力替代手工先验,保留了标准Transformer的可扩展性和硬件效率。通过实验和理论分析,我们表明PENCIL比GNN提取了更丰富的结构信号,隐式泛化了一类广泛的启发式和基于子图的表达能力。实验上,PENCIL优于启发式信息增强的GNN,并且比基于ID嵌入的替代方案参数效率高得多,同时在各种基准测试中保持竞争力——即使没有节点特征。我们的结果挑战了当前对复杂工程技术的依赖,表明简单的设计选择可能足以实现相同的能力。我们的代码公开在 https://github.com/quang-truong/pencil。

英文摘要

Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies. While Graph Neural Networks (GNNs) are the standard solution, state-of-the-art pipelines often rely on explicit structural heuristics or memory-intensive node embeddings -- approaches that struggle to generalize or scale to massive graphs. Emerging Graph Transformers (GTs) offer a potential alternative but often incur significant overhead due to complex structural encodings, hindering their applications to large-scale link prediction. We challenge these sophisticated paradigms with PENCIL, an encoder-only plain Transformer that replaces hand-crafted priors with attention over sampled local subgraphs, retaining the scalability and hardware efficiency of standard Transformers. Through experimental and theoretical analysis, we show that PENCIL extracts richer structural signals than GNNs, implicitly generalizing a broad class of heuristics and subgraph-based expressivity. Empirically, PENCIL outperforms heuristic-informed GNNs and is far more parameter-efficient than ID-embedding--based alternatives, while remaining competitive across diverse benchmarks -- even without node features. Our results challenge the prevailing reliance on complex engineering techniques, demonstrating that simple design choices are potentially sufficient to achieve the same capabilities. Our code is publicly available at https://github.com/quang-truong/pencil.

2512.19673 2026-06-01 cs.LG cs.AI cs.CL

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

自底向上策略优化:你的语言模型策略内部隐藏着内部策略

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

AI总结 本文通过分解Transformer残差流中的内部层策略和内部模块策略,提出自底向上策略优化(BuPO)方法,通过早期优化内部层来重建LLM的推理基础,在复杂推理基准上验证了有效性。

Comments Preprint. Our code is available at https://github.com/Trae1ounG/BuPO

详情
AI中文摘要

现有的强化学习方法将大型语言模型(LLM)视为统一策略,忽略了其内部机制。在本文中,我们通过Transformer的残差流将基于LLM的策略分解为内部层策略和内部模块策略。我们对内部策略的熵分析揭示了不同的模式:(1)普遍地,内部策略从早期层的高熵探索演变为顶层层的确定性精炼;(2)Qwen表现出显式的渐进推理结构,与Llama中的突然收敛形成对比。此外,我们发现优化内部层会引发特征精炼,迫使较低层早期捕获高层推理表示。受这些发现启发,我们提出了自底向上策略优化(BuPO),一种新的强化学习范式,通过在早期阶段优化内部层来自底向上重建LLM的推理基础。在复杂推理基准上的大量实验证明了BuPO的有效性。

英文摘要

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via the Transformer's residual stream. Our entropy analysis of internal policy reveals distinct patterns: (1) universally, internal policies evolve from high-entropy exploration in early layers to deterministic refinement in the top layers; and (2) Qwen exhibits an explicit progressive reasoning structure, contrasting with the abrupt convergence in Llama. Furthermore, we discover that optimizing internal layers induces feature refinement, forcing lower layers to capture high-level reasoning representations early. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that reconstructs the LLM's reasoning foundation from the bottom up by optimizing internal layers in early stages. Extensive experiments on complex reasoning benchmarks demonstrate the effectiveness of BuPO.

2602.01399 2026-06-01 cs.LG cs.AI stat.ML

An Odd Estimator for Shapley Values

Shapley 值的一个奇估计器

Fabian Fumagalli, Landon Butler, Justin Singh Kang, Kannan Ramchandran, R. Teal Witter

AI总结 本文证明 Shapley 值仅依赖于集合函数的奇分量,并基于此提出 OddSHAP 估计器,通过在奇子空间上进行多项式回归实现高效近似,在较大采样预算下达到最先进精度。

Comments Accepted to ICML 2026

详情
AI中文摘要

Shapley 值是机器学习中用于归因的普遍框架,涵盖特征重要性、数据估值和因果推断。然而,其精确计算通常是棘手的,需要高效的近似方法。虽然最有效和流行的估计器利用配对采样启发式来减少估计误差,但驱动这种改进的理论机制仍然不透明。在这项工作中,我们为配对采样提供了一个优雅且基本的理由:我们证明了 Shapley 值仅依赖于集合函数的奇分量,并且配对采样正交化回归目标以滤除无关的偶分量。利用这一见解,我们提出了 OddSHAP,一种新颖的一致估计器,它仅在奇子空间上进行多项式回归。通过利用傅里叶基来隔离该子空间,并使用代理模型识别高影响交互,OddSHAP 克服了高阶近似的组合爆炸。通过广泛的基准测试,我们发现 OddSHAP 在较大的采样预算下实现了最先进的估计精度。

英文摘要

The Shapley value is a ubiquitous framework for attribution in machine learning, encompassing feature importance, data valuation, and causal inference. However, its exact computation is generally intractable, necessitating efficient approximation methods. While the most effective and popular estimators leverage the paired sampling heuristic to reduce estimation error, the theoretical mechanism driving this improvement has remained opaque. In this work, we provide an elegant and fundamental justification for paired sampling: we prove that the Shapley value depends exclusively on the odd component of the set function, and that paired sampling orthogonalizes the regression objective to filter out the irrelevant even component. Leveraging this insight, we propose OddSHAP, a novel consistent estimator that performs polynomial regression solely on the odd subspace. By utilizing the Fourier basis to isolate this subspace and employing a proxy model to identify high-impact interactions, OddSHAP overcomes the combinatorial explosion of higher-order approximations. Through an extensive benchmark, we find that OddSHAP achieves state-of-the-art estimation accuracy at larger sampling budgets.

2602.01267 2026-06-01 cs.LG

Diving into Kronecker Adapters: Component Design Matters

深入Kronecker适配器:组件设计至关重要

Jiayu Bai, Danchen Yu, Zhenyu Liao, TianQi Hou, Feng Zhou, Robert C. Qiu, Zenan Ling

AI总结 本文通过分析Kronecker适配器的组件维度和数量,提出组件设计的Kronecker适配器(CDKA),并给出参数预算感知的配置指南和训练稳定策略,实验证明其有效性。

详情
AI中文摘要

Kronecker适配器已成为微调大规模模型的一种有前景的方法,通过可调组件结构实现高秩更新。然而,现有工作大多将组件结构视为固定或启发式设计选择,对Kronecker组件的维度和数量探索不足。在本文中,我们确定组件结构是控制Kronecker适配器能力的关键因素。我们对Kronecker组件的维度和数量进行了细粒度分析。特别地,我们展示了Kronecker适配器与全微调之间的对齐取决于组件配置。在这些见解的指导下,我们提出了组件设计的Kronecker适配器(CDKA)。我们进一步提供了参数预算感知的配置指南和针对实际部署的定制训练稳定策略。跨各种架构和模态的实验证明了CDKA的有效性。代码可在https://github.com/rainstonee/CDKA获取。

英文摘要

Kronecker adapters have emerged as a promising approach for fine-tuning large-scale models, enabling high-rank updates through tunable component structures. However, existing work largely treats the component structure as a fixed or heuristic design choice, leaving the dimensions and number of Kronecker components underexplored. In this paper, we identify component structure as a key factor governing the capacity of Kronecker adapters. We perform a fine-grained analysis of both the dimensions and number of Kronecker components. In particular, we show that the alignment between Kronecker adapters and full fine-tuning depends on component configurations. Guided by these insights, we propose Component Designed Kronecker Adapters (CDKA). We further provide parameter-budget-aware configuration guidelines and a tailored training stabilization strategy for practical deployment. Experiments across various architectures and modalities demonstrate the effectiveness of CDKA. Code is available at https://github.com/rainstonee/CDKA.

2602.01186 2026-06-01 cs.LG cs.AI

The Gaussian-Head OFL Family: One-Shot Federated Learning from Client Global Statistics

高斯头OFL系列:基于客户端全局统计的一次性联邦学习

Fabio Turazza, Marco Picone, Marco Mamei

AI总结 提出高斯头OFL系列方法,通过客户端仅传输每类计数和一二阶矩,服务器利用闭式高斯头、FisherMix和Proto-Hyper三种组件构建模型,实现严格无数据的一次性联邦学习,在强非独立同分布下达到最先进鲁棒性和准确性。

Comments Accepted at the International Conference on Learning Representations (ICLR) 2026 - Final Version

详情
AI中文摘要

经典联邦学习依赖于服务器与客户端之间多轮迭代的模型交换和聚合过程,存在高通信成本和重复模型传输带来的隐私风险。相比之下,一次性联邦学习(OFL)通过将通信减少到单轮来缓解这些限制,从而降低开销并增强实际部署能力。然而,现有大多数一次性方法仍然不切实际或受限,例如,它们通常依赖公共数据集的可用性、假设同质客户端模型,或需要上传额外数据或模型信息。为克服这些问题,我们引入了高斯头OFL(GH-OFL)系列,这是一套一次性联邦方法,假设预训练嵌入具有类条件高斯性。客户端仅传输充分统计量(每类计数和一阶/二阶矩),服务器通过三个组件构建头部:(i)直接从接收统计量计算的闭式高斯头(NB/LDA/QDA);(ii)FisherMix,一种在估计的Fisher子空间中采样的合成样本上训练的带余弦边界的线性头;以及(iii)Proto-Hyper,一种轻量级低秩残差头,通过知识蒸馏在这些合成样本上细化高斯logits。在我们的实验中,GH-OFL方法在强非独立同分布偏移下提供了最先进的鲁棒性和准确性,同时保持严格无数据。

英文摘要

Classical Federated Learning relies on a multi-round iterative process of model exchange and aggregation between server and clients, with high communication costs and privacy risks from repeated model transmissions. In contrast, one-shot federated learning (OFL) alleviates these limitations by reducing communication to a single round, thereby lowering overhead and enhancing practical deployability. Nevertheless, most existing one-shot approaches remain either impractical or constrained, for example, they often depend on the availability of a public dataset, assume homogeneous client models, or require uploading additional data or model information. To overcome these issues, we introduce the Gaussian-Head OFL (GH-OFL) family, a suite of one-shot federated methods that assume class-conditional Gaussianity of pretrained embeddings. Clients transmit only sufficient statistics (per-class counts and first/second-order moments) and the server builds heads via three components: (i) Closed-form Gaussian heads (NB/LDA/QDA) computed directly from the received statistics; (ii) FisherMix, a linear head with cosine margin trained on synthetic samples drawn in an estimated Fisher subspace; and (iii) Proto-Hyper, a lightweight low-rank residual head that refines Gaussian logits via knowledge distillation on those synthetic samples. In our experiments, GH-OFL methods deliver state-of-the-art robustness and accuracy under strong non-IID skew while remaining strictly data-free.

2602.00521 2026-06-01 cs.AI

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

通过项目反应理论诊断LLM作为评判者的可靠性

Junhyuk Choi, Sohhyung Park, Chanhee Cho, Hyeonchu Park, Bugeun Kim

AI总结 提出基于项目反应理论(IRT)的两阶段诊断框架,通过内在一致性和人类对齐两个维度评估LLM作为评判者的可靠性,并提供可解释的诊断信号。

Comments Accepted ICML 2026

详情
AI中文摘要

虽然LLM作为评判者(LLM-as-a-Judge)在自动评估中被广泛使用,但现有的验证实践主要在观察输出层面进行,对于LLM评判者本身是否作为稳定可靠的测量工具提供的洞察有限。为了解决这一局限性,我们引入了一个基于项目反应理论(IRT)的两阶段诊断框架,用于评估LLM作为评判者的可靠性。该框架采用IRT的分级响应模型(GRM),并沿两个互补维度形式化可靠性:(1)内在一致性,定义为在提示变化下测量行为的稳定性,以及(2)人类对齐,捕捉与人类质量评估的一致性。我们通过该框架实证检验了多种LLM评判者,并表明利用IRT-GRM可以产生可解释的信号,用于系统性地诊断判断。这些信号为验证LLM作为评判者的可靠性以及识别不可靠性的潜在原因提供了实用指导。

英文摘要

While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.

2601.13433 2026-06-01 cs.CL cs.LG

Who Endorsed It? Measuring Authority Bias Across Expertise Levels in Language Models

谁背书了它?测量语言模型中跨专业水平的权威偏差

Priyanka Mary Mammen, Emil Joswin, Shankar Venkitachalam

AI总结 研究语言模型在推理任务中是否因背书来源的专业水平而产生系统性偏差,发现模型对高权威来源的错误背书更易受影响,导致准确率下降和错误答案置信度增加,但可通过机制干预减轻偏差。

详情
AI中文摘要

先前研究表明,语言模型在推理任务上的表现可能受到建议、提示和背书的影响。然而,背书来源可信度的影响仍未充分探索。我们调查语言模型是否根据背书提供者的感知专业水平表现出系统性偏差。跨越数学、法律和医学推理的4个数据集,我们使用代表每个领域四个专业水平的角色评估了11个模型。我们的结果表明,随着来源专业水平的增加,模型越来越容易受到错误/误导性背书的影响,更高权威的来源不仅导致准确率下降,还增加了对错误答案的置信度。我们还表明,这种权威偏差在模型内部被机制性地编码,并且模型可以被引导远离偏差,从而即使在专家给出误导性背书时也能提高其性能。

英文摘要

Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.