arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2602.07285 2026-06-01 cs.LG

Fair Decisions from Calibrated Scores: Achieving Optimal Classification While Satisfying Sufficiency

基于校准分数的公平决策：在满足充分性的同时实现最优分类

Etam Benger, Katrina Ligett

AI总结本文针对充分性公平约束下的二元分类问题，提出了一种基于分组校准分数的后处理算法，能够实现最优随机分类，并给出了可行正预测值与错误遗漏率对的几何刻画。

Comments Accepted to ICML 2026

详情

AI中文摘要

基于预测概率（分数）的二元分类是监督机器学习中的基本任务。在无约束设置中，阈值化分数是贝叶斯最优的，但使用单一阈值通常会违反统计群体公平约束。在独立性（统计均等）和分离性（均等机会）下，当分数已经满足相应准则时，这种阈值化就足够了。然而，这并不扩展到充分性：即使完全分组校准的分数——包括真实类别概率——在阈值化后也会违反预测均等。在这项工作中，我们提出了在充分性下最优二元（随机）分类的精确解，假设有限的分组校准分数集。我们给出了这些分类器可实现的正预测值（PPV）和错误遗漏率（FOR）可行对的几何刻画，并利用它推导出一个简单的后处理算法，该算法仅使用分组校准分数和组成员身份即可获得最优分类器。最后，由于充分性和分离性通常不兼容，我们确定了在满足充分性的前提下最小化与分离性偏差的分类器，并表明该分类器也可以通过我们的算法获得，其性能通常与最优值相当。

英文摘要

Binary classification based on predicted probabilities (scores) is a fundamental task in supervised machine learning. While thresholding scores is Bayes-optimal in the unconstrained setting, using a single threshold generally violates statistical group fairness constraints. Under independence (statistical parity) and separation (equalized odds), such thresholding suffices when the scores already satisfy the corresponding criterion. However, this does not extend to sufficiency: even perfectly group-calibrated scores -- including true class probabilities -- violate predictive parity after thresholding. In this work, we present an exact solution for optimal binary (randomized) classification under sufficiency, assuming finite sets of group-calibrated scores. We provide a geometric characterization of the feasible pairs of positive predictive value (PPV) and false omission rate (FOR) achievable by such classifiers, and use it to derive a simple post-processing algorithm that attains the optimal classifier using only group-calibrated scores and group membership. Finally, since sufficiency and separation are generally incompatible, we identify the classifier that minimizes deviation from separation subject to sufficiency, and show that it can also be obtained by our algorithm, often achieving performance comparable to the optimum.

URL PDF HTML ☆

赞 0 踩 0

2510.10544 2026-06-01 cs.LG cs.AI stat.ML

PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

PAC-Bayesian 强化学习训练可泛化策略

Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata

AI总结提出一种新的 PAC-Bayesian 泛化界，通过链的混合时间显式考虑数据中的马尔可夫依赖性，并基于此设计 PB-SAC 算法以优化该界指导探索，在连续控制任务中提供有意义的置信度证书且保持竞争性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

2602.06902 2026-06-01 cs.LG stat.ML

Parameter-free Dynamic Regret: Time-varying Movement Costs, Delayed Feedback, and Memory

无参数动态遗憾：时变移动成本、延迟反馈和记忆

Hao Qiu, Andrew Jacobsen, Emmanuel Esposito, Mengxiao Zhang

AI总结本文提出一种新算法，在具有时变移动成本的在线凸优化中，首次实现了比较器自适应的动态遗憾界，并应用于延迟反馈和时变记忆问题。

Comments 28 pages; v2: ICML 2026

详情

AI中文摘要

在本文中，我们研究了具有移动成本的无约束在线凸优化（OCO）中的动态遗憾。具体来说，我们通过允许移动成本系数$λ_t$随时间任意变化来推广标准设置。我们的主要贡献是一种新颖的算法，该算法为此设置建立了第一个比较器自适应动态遗憾界，保证$\widetilde{\mathcal{O}}(\sqrt{(M^2+MP_T)(T+\sum_t λ_t)})$遗憾，其中$P_T$是比较器序列在$T$轮上的路径长度，$M$是最大比较器范数。我们的结果恢复了OCO中静态和动态遗憾的最优自适应率，作为所有轮次中$λ_t=0$的特例。为了展示我们结果的多功能性，我们考虑了两个应用：具有延迟反馈的OCO和具有时变记忆的OCO。我们表明这两个问题都可以转化为时变移动成本，特别是为延迟反馈设置建立了一种新颖的归约，这具有独立的意义。一个关键的观察是，我们的遗憾界中对移动成本的一阶依赖在实现两种设置中的最优比较器自适应动态遗憾保证中起着关键作用。

英文摘要

In this paper, we study dynamic regret in unconstrained online convex optimization (OCO) with movement costs. Specifically, we generalize the standard setting by allowing the movement cost coefficients $λ_t$ to vary arbitrarily over time. Our main contribution is a novel algorithm that establishes the first comparator-adaptive dynamic regret bound for this setting, guaranteeing $\widetilde{\mathcal{O}}(\sqrt{(M^2+MP_T)(T+\sum_t λ_t)})$ regret, where $P_T$ is the path length of the comparator sequence over $T$ rounds and $M$ is the maximal comparator norm. Our result recovers the optimal adaptive rates for both static and dynamic regret in OCO as the special case where $λ_t=0$ for all rounds. To demonstrate the versatility of our results, we consider two applications: OCO with delayed feedback and OCO with time-varying memory. We show that both problems can be translated into time-varying movement costs, establishing a novel reduction specifically for the delayed feedback setting that is of independent interest. A crucial observation is that the first-order dependence on movement costs in our regret bound plays a key role in enabling optimal comparator-adaptive dynamic regret guarantees in both settings.

URL PDF HTML ☆

赞 0 踩 0

2602.00942 2026-06-01 cs.LG

SALAAD: Sparse And Low-Rank Adaptation via ADMM for Large Language Model Inference

SALAAD: 基于ADMM的稀疏低秩适配用于大语言模型推理

Hao Ma, Melis Ilayda Bal, Liang Zhang, Bingcong Li, Niao He, Melanie Zeilinger, Michael Muehlebach

AI总结提出SALAAD框架，通过增广拉格朗日方法在训练中诱导稀疏低秩结构，实现模型容量灵活控制，降低部署内存且无需重训。

详情

AI中文摘要

现代大型语言模型越来越多地在计算和内存限制下部署，使得模型容量的灵活控制成为核心挑战。虽然稀疏和低秩结构自然地权衡了容量和性能，但现有方法通常依赖于忽略层和矩阵异质性的启发式设计，或需要特定于模型的架构修改。我们提出了SALAAD，一个适用于不同模型架构的即插即用框架，在训练过程中诱导稀疏和低秩结构。通过在增广拉格朗日框架下制定结构化权重学习，并引入自适应控制器动态平衡训练损失和结构约束，SALAAD保持了标准训练动态的稳定性，同时实现了对训练过程中有效模型容量演变的显式控制。跨模型规模的实验表明，SALAAD在部署期间显著减少了内存消耗，同时实现了与特设方法相当的性能。此外，单次训练运行产生了一个连续谱的模型容量，使得能够在不同的内存预算下实现平滑和弹性的部署，而无需重新训练。

英文摘要

Modern large language models are increasingly deployed under compute and memory constraints, making flexible control of model capacity a central challenge. While sparse and low-rank structures naturally trade off capacity and performance, existing approaches often rely on heuristic designs that ignore layer and matrix heterogeneity or require model-specific architectural modifications. We propose SALAAD, a plug-and-play framework applicable to different model architectures that induces sparse and low-rank structures during training. By formulating structured weight learning under an augmented Lagrangian framework and introducing an adaptive controller that dynamically balances the training loss and structural constraints, SALAAD preserves the stability of standard training dynamics while enabling explicit control over the evolution of effective model capacity during training. Experiments across model scales show that SALAAD substantially reduces memory consumption during deployment while achieving performance comparable to ad-hoc methods. Moreover, a single training run yields a continuous spectrum of model capacities, enabling smooth and elastic deployment across diverse memory budgets without the need for retraining.

URL PDF HTML ☆

赞 0 踩 0

2601.01754 2026-06-01 cs.LG cs.CC cs.CL cs.FL

Context-Free Recognition with Transformers

使用Transformer进行上下文无关语言识别

Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

AI总结本文证明循环Transformer通过O(log N)层和O(N^6)填充符号可识别所有上下文无关语言，并针对无歧义子类将填充需求降至O(N^3)。

详情

AI中文摘要

Transformer在处理符合某种语法的良好形式输入（如自然语言和代码）的任务中表现出色。然而，它们如何处理语法句法仍不清楚。事实上，在标准复杂性猜想下，标准Transformer无法识别上下文无关语言（CFL）——一种描述句法的规范形式，甚至无法识别正则语言（CFL的子类）。过去的工作表明，O(log(N))循环层（相对于输入长度N）允许Transformer识别正则语言，但循环Transformer识别上下文无关语言的问题仍然开放。在这项工作中，我们证明具有O(log(N))循环层和O(N^6)填充符号的循环Transformer可以识别所有CFL。然而，使用O(N^6)填充符号的训练和推理可能不切实际。幸运的是，我们表明，对于无歧义CFL等自然子类，Transformer上的识别问题变得更加易处理，只需要O(N^3)填充。实验上，循环和填充Transformer在识别CFL方面比固定深度Transformer表现更好。总体而言，我们的结果揭示了Transformer识别CFL的复杂性：虽然一般识别可能需要难以处理的填充量，但无歧义性等自然约束产生了高效的识别算法。

英文摘要

Transformers excel empirically on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work has shown that $\mathcal{O}(\log(N))$ looping layers (w.r.t. input length $N$) allow transformers to recognize regular languages, but the question of context-free recognition with looped transformers remained open. In this work, we show that looped transformers with $\mathcal{O}(\log(N))$ looping layers and $\mathcal{O}(N^6)$ padding symbols can recognize all CFLs. However, training and inference with $\mathcal{O}(N^6)$ padding symbols is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(N^3)$ padding. Empirically, looped and padded transformers perform better than fixed-depth transformers in recognizing CFLs. Overall, our results shed light on the intricacy of CFL recognition by transformers: while general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

URL PDF HTML ☆

赞 0 踩 0

2405.07836 2026-06-01 cs.LG stat.ME

Forecasting with Hyper-Trees

超树预测

Alexander März, Kashif Rasul

AI总结提出超树框架，通过梯度提升树学习目标时间序列模型（如ARIMA或指数平滑）的参数，结合决策树与经典预测模型，并引入混合架构解决高维参数估计的缩放限制。

Comments Gradient Boosted Trees, Hyper Models, Hybrid Models, Time Series Forecasting, Time-Varying Parameters

详情

AI中文摘要

我们引入超树作为一种新颖的框架，用于使用梯度提升树对时间序列数据进行建模。与直接预测时间序列的传统树方法不同，超树学习目标时间序列模型（如ARIMA或指数平滑）的参数，这些参数是特征的函数。然后，目标模型使用这些参数生成最终预测。我们的框架将决策树在表格数据上的有效性与经典预测模型相结合，从而将时间序列归纳偏差引入树模型。为了解决提升树在估计高维目标模型参数时的缩放限制，我们将决策树和神经网络结合在一个统一的框架中。在这种混合方法中，树从输入特征生成信息表示，然后浅层网络将其作为输入来学习时间序列模型的参数。通过我们的研究，我们探索了超树在各种预测任务中的有效性，并将基于树的建模扩展到时间序列分析中的传统用途之外。

英文摘要

We introduce Hyper-Trees as a novel framework for modeling time series data using gradient boosted trees. Unlike conventional tree-based approaches that forecast time series directly, Hyper-Trees learn the parameters of a target time series model, such as ARIMA or Exponential Smoothing, as functions of features. These parameters are then used by the target model to generate the final forecasts. Our framework combines the effectiveness of decision trees on tabular data with classical forecasting models, thereby inducing a time series inductive bias into tree-based models. To resolve the scaling limitations of boosted trees when estimating a high-dimensional set of target model parameters, we combine decision trees and neural networks within a unified framework. In this hybrid approach, the trees generate informative representations from the input features, which a shallow network then uses as input to learn the parameters of a time series model. With our research, we explore the effectiveness of Hyper-Trees across a range of forecasting tasks and extend tree-based modeling beyond its conventional use in time series analysis.

URL PDF HTML ☆

赞 0 踩 0

2602.06161 2026-06-01 cs.CL cs.AI

Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding

停止翻转：面向快速可撤销扩散解码的上下文保持验证

Yanzheng Xiang, Lan Wei, Yizhen Yao, Qinglin Zhu, Hanqi Yan, Chen Jin, Philip Alexander Teare, Dandan Zhang, Lin Gui, Amrutha Saseendran, Yulan He

AI总结针对并行扩散解码中因激进并行导致的翻转振荡问题，提出COVER方法，通过KV缓存覆盖和稳定性感知评分实现单次前向传递中的留一验证与稳定草稿，减少不必要修订并加速解码。

详情

AI中文摘要

并行扩散解码可以通过每步解掩多个令牌来加速扩散语言模型推理，但激进的并行常常损害质量。可撤销解码通过重新检查早期令牌来缓解这一问题，然而我们观察到现有的验证方案频繁触发翻转振荡，即令牌被重新掩码后又原样恢复。这种行为以两种方式减慢推理：重新掩码已验证位置削弱了并行草稿的条件上下文，且重复的重新掩码循环消耗修订预算而进展甚微。我们提出COVER（用于高效修订的缓存覆盖验证），它在单次前向传递中执行留一验证和稳定草稿。COVER通过KV缓存覆盖构建两种注意力视图：选定的种子被掩码用于验证，而其缓存的键值状态被注入到所有其他查询中以保留上下文信息，同时通过闭式对角校正防止种子位置的自泄漏。COVER进一步使用稳定性感知评分对种子进行优先级排序，该评分平衡不确定性、下游影响和缓存漂移，并自适应调整每步验证的种子数量。在多个基准测试中，COVER显著减少不必要的修订，实现更快的解码，同时保持输出质量。

英文摘要

Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.

URL PDF HTML ☆

赞 0 踩 0

2602.06055 2026-06-01 cs.CL

Are we chasing ghosts? Quantifying unattributable polarization, and attributing the rest to annotator groups

我们在追逐幽灵吗？量化不可归因的极化，并将其余归因于标注者群体

Dimitris Tsirmpas, John Pavlopoulos

AI总结针对标注者群体间系统性意见差异难以捕捉的问题，提出一种新的极化归因度量方法，能避免固有极化和群体效应抵消，并验证了性别和种族对极化模式的解释力。

Comments 19 pages, 7 tables, 9 figures

详情

AI中文摘要

标准一致性指标通常无法捕捉少数群体和多数群体标注者之间的系统性意见差异，从而危及仇恨言论和毒性检测等任务。极化最近被提出作为一种更稳健的方式来区分微小分歧与系统性意见差异，但现有方法并未提供将其归因于特定标注者群体的实用工具。我们评估了当前方法，并识别出在现实场景中的两个主要局限性：（1）存在无法归因于任何已知或潜在群体的“固有”极化；（2）对立的极化效应在聚合标注中相互抵消。为了解决这些问题，我们引入了一种新的度量方法，能够测量并检验标注者群体极化归因的统计显著性，同时避免这些局限性，并提供了一个开源的Python库实现，发现每个评论只需不超过20个标注者即可实现可靠估计。我们将该方法应用于四个主观NLP数据集，发现性别和种族一致地解释了极化模式，而标注者群体之间的差异随着群体距离增大而增强。

英文摘要

Standard agreement metrics often fail to capture systematic differences in opinion between minority and majority-group annotators, jeopardizing tasks such as hate speech and toxicity detection. Polarization has recently been proposed as a more robust way of distinguishing minor disagreements from systematic differences in opinion, but existing approaches do not provide practical tools for attributing it to specific annotator groups. We evaluate current methods and identify two major limitations in realistic settings: (1) the presence of ``inherent'' polarization that cannot be attributed to any known or latent groups, and (2) opposing polarization effects canceling each other out in aggregated annotations. To address these issues, we introduce a new metric that measures and tests the statistical significance of polarization attribution for annotator groups while avoiding these limitations, as well as an open-source Python library implementation, finding that no more than 20 annotators are needed per comment for reliable estimation. We apply our method to four subjective NLP datasets and find that gender and race consistently explain polarization patterns, while differences between annotator groups become stronger as the groups are further apart.

URL PDF HTML ☆

赞 0 踩 0

2601.19791 2026-06-01 cs.LG stat.ML

To Grok Grokking: Provable Grokking in Ridge Regression

理解Grokking：岭回归中可证明的Grokking现象

Mingyue Xu, Gal Vardi, Itay Safran

AI总结本文在经典岭回归设置中研究grokking现象，证明使用带权重衰减的梯度下降学习过参数化线性回归模型时，存在过拟合、泛化延迟和最终泛化误差任意小的三个阶段，并首次给出泛化延迟（grokking时间）的严格定量界，同时通过实验表明该界也适用于非线性神经网络。

详情

AI中文摘要

我们在经典岭回归设置中研究grokking现象，即过拟合后很久才出现泛化。我们证明了使用带权重衰减的梯度下降学习过参数化线性回归模型的端到端grokking结果。具体地，我们证明以下阶段发生：(i) 训练早期模型过拟合训练数据；(ii) 过拟合显现后长时间泛化性能差；(iii) 泛化误差最终变得任意小。此外，我们从理论和实验上表明，通过适当的超参数调优，可以以原则性的方式放大或消除grokking。据我们所知，这是首次以训练超参数表示的泛化延迟（我们称之为“grokking时间”）的严格定量界。最后，超越线性设置，我们实验证明我们的定量界也捕捉了非线性神经网络上grokking的行为。我们的结果表明，grokking不是深度学习固有的失败模式，而是特定训练条件的结果，因此不需要对模型架构或学习算法进行根本性改变来避免。

英文摘要

We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

URL PDF HTML ☆

赞 0 踩 0

2602.05649 2026-06-01 cs.LG

End-to-End Compression for Tabular Foundation Models

表格基础模型的端到端压缩

Guri Zabërgja, Rafiq Kamel, Arlind Kadra, Christian M. M. Frey, Josif Grabocka

AI总结提出TACO，一种端到端表格压缩模型，在潜在空间压缩训练数据，以解决表格Transformer在推理时间和内存上的二次复杂度问题，在TabArena基准上实现高达94倍加速和97%内存节省，且性能无明显下降。

Comments Accepted as Spotlight at ICML 2026

详情

AI中文摘要

长期以来，梯度提升决策树在表格数据上的主导地位最近受到了上下文学习表格基础模型的挑战。上下文学习方法通过将训练数据作为上下文来预测查询测试点，无需参数更新即可在一次前向传播中完成拟合和预测。尽管最近的表格基础模型达到了最先进的性能，但基于注意力机制的Transformer架构在数据集大小上具有二次复杂度，这增加了训练和推理时间的开销，并限制了模型处理大规模数据集的能力。在这项工作中，我们提出了TACO，一种端到端的表格压缩模型，它在潜在空间中压缩训练数据集。我们在TabArena基准上测试了我们的方法，与最先进的表格Transformer架构相比，我们的方法在推理时间上快了高达94倍，同时内存消耗减少了97%，且性能没有显著下降。最后，我们的方法不仅随着数据集规模的增大而更好地扩展，而且与其他基线相比也取得了更好的性能。

英文摘要

The long-standing dominance of gradient-boosted decision trees for tabular data has recently been challenged by in-context learning tabular foundation models. In-context learning methods fit and predict in one forward pass without parameter updates by leveraging the training data as context for predicting on query test points. While recent tabular foundation models achieve state-of-the-art performance, their transformer architecture based on the attention mechanism has quadratic complexity regarding dataset size, which in turn increases the overhead on training and inference time, and limits the capacity of the models to handle large-scale datasets. In this work, we propose TACO, an end-to-end tabular compression model that compresses the training dataset in a latent space. We test our method on the TabArena benchmark, where our proposed method is up to 94x faster in inference time, while consuming up to 97\% less memory compared to the state-of-the-art tabular transformer architecture, all while retaining performance without significant degradation. Lastly, our method not only scales better with increased dataset sizes, but it also achieves better performance compared to other baselines.

URL PDF HTML ☆

赞 0 踩 0

2512.14980 2026-06-01 cs.LG

Softly Constrained Denoisers for Diffusion Models Applied to Partial Differential Equations

应用于偏微分方程的扩散模型的软约束去噪器

Victor M. Yeom-Song, Severi Rissanen, Arno Solin, Samuel Kaski, Mingfei Sun

AI总结提出在扩散模型的去噪器中引入基于偏微分方程的软归纳偏置，以在提高约束遵从性的同时保持对模型错误指定的适应性。

Comments 22 pages including appendix, 8 figures including appendix, preprint

2602.04737 2026-06-01 cs.LG

Rationality Measurement and Theory for Reinforcement Learning Agents

强化学习智能体的理性度量与理论

Kejiang Qian, Amos Storkey, Fengxiang He

AI总结本文提出一套理性度量及其理论，用于评估强化学习智能体在部署中的行为理性，并分解理性风险差距为环境变化和算法泛化能力两部分。

详情

AI中文摘要

本文针对强化学习智能体提出了一套理性度量及其相关理论，该属性日益关键但鲜有探索。我们定义部署中的行动为完全理性，如果它在最陡方向上最大化隐藏的真实价值函数。策略行动与其理性对应物的期望价值差异，在部署轨迹上累积，被定义为期望理性风险；训练中的经验平均版本也被定义。它们的差异称为理性风险差距，被分解为（1）由训练和部署之间环境变化引起的外在成分，以及（2）由算法在动态环境中的泛化能力引起的内在成分。它们分别被（1）训练和部署中转移核与初始状态分布之间的$1$-Wasserstein距离，以及（2）价值函数类的经验Rademacher复杂度所上界。我们的理论提出了关于正则化（包括层归一化、$\ell_2$正则化和权重归一化）和领域随机化的益处，以及环境变化的危害的假设。实验与这些假设完全一致。代码可在https://github.com/EVIEHub/Rationality获取。

英文摘要

This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the $1$-Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, $\ell_2$ regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses. The code is available at https://github.com/EVIEHub/Rationality.

URL PDF HTML ☆

赞 0 踩 0

2506.05994 2026-06-01 cs.LG cs.AR cs.ET

RETENTION: Resource-Efficient Tree-Based Ensemble Model Acceleration with Content-Addressable Memory

RETENTION: 基于内容可寻址存储器的资源高效树集成模型加速

Yi-Chun Liao, Chieh-Lin Tsai, Yuan-Hao Chang, Camélia Slimani, Jalil Boukhobza, Tei-Wei Kuo

AI总结提出RETENTION框架，通过迭代剪枝算法和树映射方案，显著减少内容可寻址存储器容量需求，实现资源高效的树集成模型加速。

Comments Under review by IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems

详情

DOI: 10.1109/TCAD.2026.3691288

AI中文摘要

尽管深度学习在处理非结构化数据方面展现了卓越的能力，但现代基于树的集成模型在从结构化数据中提取相关信息和学习方面仍然更胜一筹。虽然已有若干工作致力于加速树模型，但模型的固有特性对传统加速器构成了重大挑战。最近利用内容可寻址存储器（CAM）的研究为加速树模型提供了有前景的解决方案，然而现有设计存在内存消耗过大和利用率低的问题。本文通过引入RETENTION，一个端到端框架，显著降低了树模型推理的CAM容量需求。我们提出了一种迭代剪枝算法，该算法具有针对基于装袋模型（例如随机森林）的新颖剪枝准则，在确保受控精度下降的同时最小化模型复杂度。此外，我们提出了一种树映射方案，其中包含两种创新的数据放置策略，以缓解CAM中广泛使用的无关状态导致的内存冗余。实验结果表明，仅实施树映射方案即可将CAM容量需求降低1.46倍至21.30倍，而完整的RETENTION框架在精度损失小于3%的情况下实现了4.35倍至207.12倍的降低。这些结果表明，RETENTION在最小化CAM资源需求方面非常有效，为树模型加速提供了一种资源高效的方向。

英文摘要

Although deep learning has demonstrated remarkable capability in learning from unstructured data, modern tree-based ensemble models remain superior in extracting relevant information and learning from structured datasets. While several efforts have been made to accelerate tree-based models, the inherent characteristics of the models pose significant challenges for conventional accelerators. Recent research leveraging content-addressable memory (CAM) offers a promising solution for accelerating tree-based models, yet existing designs suffer from excessive memory consumption and low utilization. This work addresses these challenges by introducing RETENTION, an end-to-end framework that significantly reduces CAM capacity requirement for tree-based model inference. We propose an iterative pruning algorithm with a novel pruning criterion tailored for bagging-based models (e.g., Random Forest), which minimizes model complexity while ensuring controlled accuracy degradation. Additionally, we present a tree mapping scheme that incorporates two innovative data placement strategies to alleviate the memory redundancy caused by the widespread use of don't care states in CAM. Experimental results show that implementing the tree mapping scheme alone reduces CAM capacity requirement by $1.46\times$ to $21.30 \times$, while the full RETENTION framework achieves $4.35\times$ to $207.12\times$ reduction with less than 3\% accuracy loss. These results demonstrate that RETENTION is highly effective in minimizing CAM resource demand, providing a resource-efficient direction for tree-based model acceleration.

URL PDF HTML ☆

赞 0 踩 0

2602.04107 2026-06-01 cs.LG cs.IT math.IT

Supervised Learning as Lossy Compression: Characterizing Generalization and Sample Complexity via Finite Blocklength Analysis

监督学习作为有损压缩：通过有限块长分析刻画泛化与样本复杂度

Kosuke Sugiyama, Masato Uchida

AI总结本文通过将学习问题置于有损压缩框架中并应用有限块长分析，从信息论角度推导了固定随机学习算法及其最优采样策略的样本复杂度和泛化误差下界，显式分离了过拟合程度与归纳偏置-任务不匹配项。

Comments 40 pages, 1 figure

详情

AI中文摘要

本文通过将学习问题置于有损压缩的背景下并应用有限块长分析，提出了一种关于机器学习中泛化的新颖信息论视角。在我们的方法中，训练数据的采样形式上对应于编码过程，而模型构建对应于解码过程。通过利用有限块长分析，我们推导了固定随机学习算法及其相关最优采样策略的样本复杂度和泛化误差的下界。我们的界限明确地将学习算法的过拟合程度与其归纳偏置和任务之间的不匹配作为不同的项进行刻画。这种分离提供了相对于现有框架的显著优势。此外，我们分解了过拟合项，以显示其与信息论界限和稳定性理论中现有度量的理论联系，从而在我们的提议框架下统一了这些视角。

英文摘要

This paper presents a novel information-theoretic perspective on generalization in machine learning by framing the learning problem within the context of lossy compression and applying finite blocklength analysis. In our approach, the sampling of training data formally corresponds to an encoding process, and the model construction to a decoding process. By leveraging finite blocklength analysis, we derive lower bounds on sample complexity and generalization error for a fixed randomized learning algorithm and its associated optimal sampling strategy. Our bounds explicitly characterize the degree of overfitting of the learning algorithm and the mismatch between its inductive bias and the task as distinct terms. This separation provides a significant advantage over existing frameworks. Additionally, we decompose the overfitting term to show its theoretical connection to existing metrics found in information-theoretic bounds and stability theory, unifying these perspectives under our proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2602.04031 2026-06-01 cs.LG

The Illusion of Generalization in Tabular Language Models

表格语言模型中的泛化错觉

Aditya Gorla, Ratish Puduppully

AI总结通过系统评估Tabula-8B在165个数据集上的表现，发现其声称的泛化能力主要源于评估伪影（如数据污染和格式熟悉度），而非真正的表格推理。

详情

Journal ref: In Proc. 43th International Conference on Machine Learning (ICML 2026)

AI中文摘要

表格语言模型（TLMs）据称在表格预测中实现了强大的泛化能力。我们对代表性TLM——Tabula-8B进行了系统性的重新评估，使用了UniPredict基准中的165个数据集。我们的研究揭示了三个发现。首先，二分类和多类别分类在多数类基线上实现了接近零的中位数提升，而强大的聚合性能完全由四分位数分类任务驱动。其次，表现最好的数据集存在普遍的数据污染，包括完整的训练-测试重叠和任务级泄露，这些污染规避了标准的去重方法。第三，在没有表格数据暴露的情况下进行指令微调，恢复了标准分类性能的92.2%，而在四分位数分类上，格式熟悉度缩小了71.3%的差距，剩余部分归因于污染数据集。这些发现表明，声称的泛化能力可能反映的是评估伪影，而非学到的表格推理。最后，我们提出了加强TLM评估的建议。

英文摘要

Tabular Language Models (TLMs) have been claimed to achieve strong generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.

URL PDF HTML ☆

赞 0 踩 0

2511.16940 2026-06-01 cs.CV cs.CR

作为统计估计的机械可解释性：方差分析

Maxime Méloux, François Portet, Maxime Peyrard

AI总结本文从统计估计角度审视机械可解释性中的电路发现，揭示因果中介分析中单输入得分的固有方差导致电路不稳定，并系统分解方差来源，倡导更严谨的实践。

详情

AI中文摘要

机械可解释性（MI）旨在通过识别功能子网络来逆向工程模型行为。然而，这些发现的科学有效性取决于其稳定性。在这项工作中，我们认为电路发现不是一个独立的任务，而是一个建立在因果中介分析（CMA）基础上的统计估计问题。我们揭示了这一基础层的根本不稳定性：精确的单输入CMA得分表现出高固有方差，这意味着组件的因果效应是一个易变的随机变量，而非固定属性。然后，我们证明电路发现流程继承了这一方差并进一步放大。快速近似方法，如边缘属性修补及其后续方法，引入了额外的估计噪声，而在数据集上聚合这些噪声得分会导致脆弱的结构估计。因此，输入数据或超参数的小扰动会产生截然不同的电路。我们系统地分解了这些方差来源，并倡导更严格的MI实践，优先考虑统计稳健性和稳定性指标的常规报告。

英文摘要

Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.

URL PDF HTML ☆

赞 0 踩 0

2602.02459 2026-06-01 cs.RO

TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments

TIC-VLA：一种用于动态环境中机器人导航的思考控制视觉-语言-动作模型

Zhiyu Huang, Yun Zhang, Johnson Liu, Rui Song, Chen Tang, Jiaqi Ma

AI总结提出TIC-VLA模型，通过显式建模推理延迟并引入延迟语义-控制接口，结合异步训练流程，解决动态环境中视觉-语言-动作模型的推理与实时控制异步问题，在仿真和真实机器人上优于先前模型。

Comments International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

在动态、以人为中心的环境中，机器人必须遵循语言指令同时保持实时反应控制。视觉-语言-动作（VLA）模型提供了一个有前景的框架，但它们假设时间对齐的推理和控制，尽管语义推理相对于实时动作固有地存在延迟。我们提出了Think-in-Control（TIC）-VLA，一种延迟感知框架，在动作生成过程中显式建模延迟的语义推理。TIC-VLA定义了一个延迟语义-控制接口，该接口除了当前观测外，还基于延迟的视觉-语言语义状态和显式延迟元数据来条件化动作生成，使策略能够补偿异步推理。我们进一步提出了一种延迟一致的训练流程，在模仿学习和在线强化学习期间注入推理推理延迟，使训练与异步部署对齐。为了支持现实评估，我们提出了DynaNav，一个物理精确、照片级真实的仿真套件，用于动态环境中的语言引导导航。在仿真和真实机器人上的大量实验表明，TIC-VLA在多秒推理延迟下始终优于先前的VLA模型，同时保持鲁棒的实时控制。项目网站：https://ucla-mobility.github.io/TIC-VLA/

英文摘要

Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Project website: https://ucla-mobility.github.io/TIC-VLA/

URL PDF HTML ☆

赞 0 踩 0

2602.02220 2026-06-01 cs.CV cs.RO

深入Kronecker适配器：组件设计至关重要

Jiayu Bai, Danchen Yu, Zhenyu Liao, TianQi Hou, Feng Zhou, Robert C. Qiu, Zenan Ling

AI总结本文通过分析Kronecker适配器的组件维度和数量，提出组件设计的Kronecker适配器（CDKA），并给出参数预算感知的配置指南和训练稳定策略，实验证明其有效性。

详情

AI中文摘要

Kronecker适配器已成为微调大规模模型的一种有前景的方法，通过可调组件结构实现高秩更新。然而，现有工作大多将组件结构视为固定或启发式设计选择，对Kronecker组件的维度和数量探索不足。在本文中，我们确定组件结构是控制Kronecker适配器能力的关键因素。我们对Kronecker组件的维度和数量进行了细粒度分析。特别地，我们展示了Kronecker适配器与全微调之间的对齐取决于组件配置。在这些见解的指导下，我们提出了组件设计的Kronecker适配器（CDKA）。我们进一步提供了参数预算感知的配置指南和针对实际部署的定制训练稳定策略。跨各种架构和模态的实验证明了CDKA的有效性。代码可在https://github.com/rainstonee/CDKA获取。

英文摘要

Kronecker adapters have emerged as a promising approach for fine-tuning large-scale models, enabling high-rank updates through tunable component structures. However, existing work largely treats the component structure as a fixed or heuristic design choice, leaving the dimensions and number of Kronecker components underexplored. In this paper, we identify component structure as a key factor governing the capacity of Kronecker adapters. We perform a fine-grained analysis of both the dimensions and number of Kronecker components. In particular, we show that the alignment between Kronecker adapters and full fine-tuning depends on component configurations. Guided by these insights, we propose Component Designed Kronecker Adapters (CDKA). We further provide parameter-budget-aware configuration guidelines and a tailored training stabilization strategy for practical deployment. Experiments across various architectures and modalities demonstrate the effectiveness of CDKA. Code is available at https://github.com/rainstonee/CDKA.

URL PDF HTML ☆

赞 0 踩 0

2602.01186 2026-06-01 cs.LG cs.AI

The Gaussian-Head OFL Family: One-Shot Federated Learning from Client Global Statistics

高斯头OFL系列：基于客户端全局统计的一次性联邦学习

Fabio Turazza, Marco Picone, Marco Mamei

AI总结提出高斯头OFL系列方法，通过客户端仅传输每类计数和一二阶矩，服务器利用闭式高斯头、FisherMix和Proto-Hyper三种组件构建模型，实现严格无数据的一次性联邦学习，在强非独立同分布下达到最先进鲁棒性和准确性。

Comments Accepted at the International Conference on Learning Representations (ICLR) 2026 - Final Version

详情

AI中文摘要

经典联邦学习依赖于服务器与客户端之间多轮迭代的模型交换和聚合过程，存在高通信成本和重复模型传输带来的隐私风险。相比之下，一次性联邦学习（OFL）通过将通信减少到单轮来缓解这些限制，从而降低开销并增强实际部署能力。然而，现有大多数一次性方法仍然不切实际或受限，例如，它们通常依赖公共数据集的可用性、假设同质客户端模型，或需要上传额外数据或模型信息。为克服这些问题，我们引入了高斯头OFL（GH-OFL）系列，这是一套一次性联邦方法，假设预训练嵌入具有类条件高斯性。客户端仅传输充分统计量（每类计数和一阶/二阶矩），服务器通过三个组件构建头部：（i）直接从接收统计量计算的闭式高斯头（NB/LDA/QDA）；（ii）FisherMix，一种在估计的Fisher子空间中采样的合成样本上训练的带余弦边界的线性头；以及（iii）Proto-Hyper，一种轻量级低秩残差头，通过知识蒸馏在这些合成样本上细化高斯logits。在我们的实验中，GH-OFL方法在强非独立同分布偏移下提供了最先进的鲁棒性和准确性，同时保持严格无数据。

英文摘要

Classical Federated Learning relies on a multi-round iterative process of model exchange and aggregation between server and clients, with high communication costs and privacy risks from repeated model transmissions. In contrast, one-shot federated learning (OFL) alleviates these limitations by reducing communication to a single round, thereby lowering overhead and enhancing practical deployability. Nevertheless, most existing one-shot approaches remain either impractical or constrained, for example, they often depend on the availability of a public dataset, assume homogeneous client models, or require uploading additional data or model information. To overcome these issues, we introduce the Gaussian-Head OFL (GH-OFL) family, a suite of one-shot federated methods that assume class-conditional Gaussianity of pretrained embeddings. Clients transmit only sufficient statistics (per-class counts and first/second-order moments) and the server builds heads via three components: (i) Closed-form Gaussian heads (NB/LDA/QDA) computed directly from the received statistics; (ii) FisherMix, a linear head with cosine margin trained on synthetic samples drawn in an estimated Fisher subspace; and (iii) Proto-Hyper, a lightweight low-rank residual head that refines Gaussian logits via knowledge distillation on those synthetic samples. In our experiments, GH-OFL methods deliver state-of-the-art robustness and accuracy under strong non-IID skew while remaining strictly data-free.

URL PDF HTML ☆

赞 0 踩 0

2602.00521 2026-06-01 cs.AI

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

通过项目反应理论诊断LLM作为评判者的可靠性

Junhyuk Choi, Sohhyung Park, Chanhee Cho, Hyeonchu Park, Bugeun Kim

AI总结提出基于项目反应理论（IRT）的两阶段诊断框架，通过内在一致性和人类对齐两个维度评估LLM作为评判者的可靠性，并提供可解释的诊断信号。

Comments Accepted ICML 2026

详情

AI中文摘要

虽然LLM作为评判者（LLM-as-a-Judge）在自动评估中被广泛使用，但现有的验证实践主要在观察输出层面进行，对于LLM评判者本身是否作为稳定可靠的测量工具提供的洞察有限。为了解决这一局限性，我们引入了一个基于项目反应理论（IRT）的两阶段诊断框架，用于评估LLM作为评判者的可靠性。该框架采用IRT的分级响应模型（GRM），并沿两个互补维度形式化可靠性：（1）内在一致性，定义为在提示变化下测量行为的稳定性，以及（2）人类对齐，捕捉与人类质量评估的一致性。我们通过该框架实证检验了多种LLM评判者，并表明利用IRT-GRM可以产生可解释的信号，用于系统性地诊断判断。这些信号为验证LLM作为评判者的可靠性以及识别不可靠性的潜在原因提供了实用指导。

英文摘要

While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.

URL PDF HTML ☆

赞 0 踩 0

2601.13433 2026-06-01 cs.CL cs.LG

Who Endorsed It? Measuring Authority Bias Across Expertise Levels in Language Models

谁背书了它？测量语言模型中跨专业水平的权威偏差

Priyanka Mary Mammen, Emil Joswin, Shankar Venkitachalam

AI总结研究语言模型在推理任务中是否因背书来源的专业水平而产生系统性偏差，发现模型对高权威来源的错误背书更易受影响，导致准确率下降和错误答案置信度增加，但可通过机制干预减轻偏差。

详情

AI中文摘要

先前研究表明，语言模型在推理任务上的表现可能受到建议、提示和背书的影响。然而，背书来源可信度的影响仍未充分探索。我们调查语言模型是否根据背书提供者的感知专业水平表现出系统性偏差。跨越数学、法律和医学推理的4个数据集，我们使用代表每个领域四个专业水平的角色评估了11个模型。我们的结果表明，随着来源专业水平的增加，模型越来越容易受到错误/误导性背书的影响，更高权威的来源不仅导致准确率下降，还增加了对错误答案的置信度。我们还表明，这种权威偏差在模型内部被机制性地编码，并且模型可以被引导远离偏差，从而即使在专家给出误导性背书时也能提高其性能。

英文摘要

Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.

URL PDF HTML ☆

赞 0 踩 0