arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31250 2026-06-01 stat.ML cs.AI cs.LG

Entropic Projection Alignment: Estimating, Explaining, and Improving Model Performance Under Distribution Shift

熵投影对齐:估计、解释和改进分布偏移下的模型性能

Salim I. Amoukou, Emanuele Albini, Tom Bewley, Saumitra Mishra, Manuela Veloso

AI总结 提出熵投影对齐(EPA)方法,通过匹配选定矩并最小化KL散度来对齐源分布与目标分布,从而统一解决分布偏移下的性能估计、解释和改进问题。

详情
Comments
Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026)
AI中文摘要

我们提出了一个统一框架,用于解决分布偏移的三个关键挑战:(1)估计模型在未标记目标域上的性能,(2)通过识别导致偏移的特征来解释偏移,以及(3)提高目标域性能。我们的方法,熵投影对齐(EPA),通过匹配精心选择的矩同时最小化与源分布的KL散度,将源分布与目标分布对齐。该公式为重要性权重提供了唯一的闭式解,通过隐式方差控制实现鲁棒性。借鉴领域适应理论,我们证明矩匹配足以实现可靠的估计和适应,避免了完全密度比恢复的需要。大量实验以及强有力的理论保证表明,EPA在提供显著计算效率的同时,始终优于最先进的基线方法。

英文摘要

We propose a unified framework for addressing three key challenges of distribution shift: (1) estimating a model's performance on an unlabeled target domain, (2) explaining the shift by identifying the features responsible, and (3) improving the target domain performance. Our method, Entropic Projection Alignment (EPA), aligns the source distribution to the target by matching carefully selected moments while simultaneously minimising the KL divergence from the source. This formulation yields a unique closed-form solution for importance weights, achieving robustness through implicit variance control. Drawing on domain adaptation theory, we establish that moment matching is sufficient for reliable estimation and adaptation, avoiding the need for full density ratio recovery. Extensive experiments, together with strong theoretical guarantees, demonstrate that EPA consistently outperforms state-of-the-art baselines while offering substantial computational efficiency.

2605.31249 2026-06-01 cs.LG cs.AI

Learning Cardiac Latent Representations in Vectorcardiogram Space

在向量心电图空间中学习心脏潜在表示

Bosong Huang, Panzhen Zhao, Zengxiang Li, Patricia Lee, Wei Jin, Alan Wee-Chung Liew, Ming Jin, Shirui Pan

AI总结 针对标准十二导联心电图表示学习中的冗余和过拟合问题,提出基于Frank向量心电图模型的LVCG框架,在物理潜在空间中学习视图不变的心脏电活动表示,提升鲁棒性和泛化能力。

详情
AI中文摘要

心电图(ECG)是心脏评估的基石,学习信息丰富的ECG表示对于从疾病诊断到临床报告生成等任务至关重要。然而,现有方法几乎完全在可观测的ECG信号空间中操作。实际上,标准十二导联ECG代表了同一心脏电活动在不同空间方向上的多个投影。因此,在ECG空间中进行表示学习不可避免地引入了大量冗余,可能导致虚假相关性和过拟合风险增加。为了解决这个问题,受Frank向量心电图(VCG)模型启发,我们提出直接在VCG空间中学习心脏电活动的统一潜在表示。我们引入了LVCG,这是第一个设计用于在此物理基础潜在空间中运行的通用自监督表示学习框架。通过学习视图不变的潜在VCG表示而非导联特定伪影,LVCG最小化了冗余并提高了泛化能力。LVCG在各项任务中普遍优于ECG空间基线,展现出增强的鲁棒性和泛化能力,尤其在领域偏移设置中。

英文摘要

Electrocardiography (ECG) is a cornerstone of cardiac assessment, making the learning of informative ECG representations fundamental to tasks ranging from disease diagnosis to clinical report generation. However, existing methods operate almost exclusively in the observable ECG signal space. In practice, the standard twelve-lead ECG represents multiple projections of the same underlying cardiac electrical activity from different spatial orientations. Therefore, representation learning in the ECG space inevitably introduces substantial redundancy, which may lead to spurious correlations and increased risk of overfitting. To address this and motivated by the Frank vectorcardiogram (VCG) model, we propose learning a unified latent representation of cardiac electrical activity directly in the VCG space. We introduce LVCG, the first general self-supervised representation learning framework designed to operate in this physically grounded latent space. By learning view-invariant latent VCG representations rather than lead-specific artifacts, VCG minimizes redundancy and improves generalization. LVCG generally outperforms ECG-space baselines across tasks, demonstrating enhanced robustness and generalization, especially in domain shift settings.

2605.31246 2026-06-01 cs.CR cs.CV

BadBone: Backdoor Attacks Against Backbone Models in Visual Prompt Learning

BadBone:视觉提示学习中针对骨干模型的后门攻击

Ziqing Yang, Rui Wen, Xinlei He, Yun Shen, Michael Backes, Yang Zhang

AI总结 提出BadBone,一种利用双层优化的隐蔽自适应后门攻击方法,通过破坏骨干模型使下游提示学习任务继承后门漏洞,实验表明现有防御措施基本无效。

详情
Comments
Accepted by IEEE Transactions on Information Forensics & Security
AI中文摘要

提示学习是一种新的机器学习范式,因其简单性和有效性而受到广泛关注。尽管其应用日益增多,但该范式的安全漏洞仍未被充分探索。在这项工作中,我们率先提出BadBone,一种利用双层优化的隐蔽自适应后门攻击,针对提示学习。我们的目标不是对提示学习过程植入后门,而是破坏骨干模型,使得只有采用提示学习的目标下游任务继承后门漏洞。在三个不同模型和来自不同领域的三个数据集上的大量实验表明,我们的定向/非定向后门模型在保持预训练和下游任务实用性的同时,实现了高攻击性能。此外,我们针对六种最先进的模型级防御(包括Neural Cleanse、ABS、MNTD、NAD、CLP和D-BR)评估了我们的方法。结果表明,这些防御对我们的后门模型基本无效,因此有效的防御仍是未来工作的重要方向。

英文摘要

Prompt learning is a new machine learning paradigm that has attracted ample attention due to its simplicity and proven efficacy. Despite its growing adoption, the security vulnerabilities associated with this paradigm remain underexplored. In this work, we take the first step to propose BadBone, a stealthy and adaptive backdoor attack against prompt learning using bi-level optimization. Instead of backdooring the prompt learning process, we aim to compromise a backbone model such that only target downstream tasks employing prompt learning inherit the backdoor vulnerability. Extensive experiments on three different models and three datasets from various domains show that our targeted/untargeted backdoored models achieve high attack performance while maintaining utility on both pre-training and downstream tasks. Moreover, we evaluate our approach against six state-of-the-art model-level defenses, including Neural Cleanse, ABS, MNTD, NAD, CLP, and D-BR. The results demonstrate that these defenses are largely ineffective against our backdoored models and thus leave the effective defense as an important direction for future work.

2605.31245 2026-06-01 cs.LG

Toward Identifiable Sparse Autoencoders

走向可识别的稀疏自编码器

Walter Nelson, Theofanis Karaletsos, Francesco Locatello

AI总结 针对稀疏自编码器训练不稳定的问题,通过理论分析模型属性并改进架构与训练流程,提出iSAE变体,实现更低重构误差与更高稳定性。

详情
Comments
International Conference on Machine Learning (ICML) 2026
AI中文摘要

最近,稀疏自编码器(SAE)已成为解释和交互实际神经网络表示的有吸引力的工具。虽然常见的经验共识如此,但我们也在理论上表明SAE高度不稳定:不同的训练运行可能产生不同的概念字典和稀疏编码。我们刻画了阻碍实际SAE稳定性的模型属性,并通过架构和训练过程的最小改动解决每个问题。这些改动共同产生了两个版本的 extbf{可识别}SAE(iSAE),这是标准TopK SAE的变体,具有更低的重构误差和更高的稳定性。我们通过将SAE与传统字典学习方法联系起来,从理论上解释了这一改进,并表明实践中学习的字典满足近似受限等距条件,从而使这些模型中的相应稀疏编码接近可识别。

英文摘要

Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs are highly unstable: different training runs are likely to produce different concept dictionaries and sparse codes. We characterize the model properties that hinder the stability of real-world SAEs, and address each of these problems through minimal changes to the architecture and training procedure. Together, these changes yield two versions of an \textbf{i}dentifiable SAE (iSAE), a variant of the standard TopK SAE with lower reconstruction error and improved stability. We explain this improvement theoretically by connecting SAEs with traditional dictionary learning approaches, and show that the dictionaries learned in practice satisfy an approximate restricted isometry condition, rendering the corresponding sparse codes in those models near-identifiable.

2605.31244 2026-06-01 cs.LG physics.comp-ph

Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail

谱范围:理解神经缩放作为进入谱尾的进展

Konstantin Nikolaou, Jonas Scheunemann, Sven Krippendorf, Samuel Tovey, Christian Holm

AI总结 本文提出“谱位置”度量,通过经验神经正切核的特征值分析神经缩放定律,发现大模型通过“谱范围”进入更深的谱尾从而降低损失,并指出特征学习是关键机制。

详情
AI中文摘要

神经缩放定律描述了模型大小、数据集大小、计算量与性能之间可预测的幂律关系。尽管这些定律指导了现代基础模型的发展,但其背后的机制仍知之甚少,部分原因是缺乏可扩展的分析工具。为弥补这一差距,我们引入了“谱位置”:一种可扩展的度量,用于衡量经验神经正切核(eNTK)的哪些特征值当前驱动损失降低。将该度量应用于缩放实验,我们发现谱位置在整个训练过程中下降:学习从主导特征模式转移到谱尾。大模型比小模型更深入地进入谱尾,揭示了一种我们称之为“谱范围”的与大小相关的能力。这解释了为什么大模型能达到更低的损失:它们能持续学习小模型无法访问的弱谱信号。我们进一步确定特征学习是谱范围的关键促成因素。它自适应地放大梯度幅度,使学习在冻结表示停滞的地方持续进展。这指向了通过架构和优化器设计的具体干预措施。

英文摘要

Neural scaling laws describe predictable power-law relationships between model size, dataset size, compute, and performance. While these laws guide the development of modern foundation models, the mechanisms underpinning them remain poorly understood, in part due to the absence of scalable analysis tools. To close this gap, we introduce "spectral position": a scalable measure of which eigenvalues of the empirical neural tangent kernel (eNTK) currently drive loss reduction. Applying this measure to scaling experiments, we find that spectral position decreases throughout training: learning shifts from dominant eigenmodes into the spectral tail. Larger models reach further into the tail than smaller models, revealing a size-dependent capacity we call "spectral reach". This suggests why larger models achieve lower losses: they sustain learning on weak spectral signals inaccessible to smaller models. We further identify feature learning as a key enabler of spectral reach. It adaptively amplifies gradient magnitudes as learning advances, sustaining progress where frozen representations stall. This points to concrete interventions through architecture and optimizer design.

2605.31241 2026-06-01 cs.LG

Bifurcated Remaining Useful Life Prediction: A Hybrid Approach for Realistic Uncertainty Characterization

分支剩余使用寿命预测:一种用于现实不确定性表征的混合方法

Xabier Belaunzaran, Antonio Nappa, Arkaitz Artetxe, Basilio Sierra

AI总结 提出一种混合预测框架,通过将涡扇发动机寿命分为健康与退化阶段,结合LSTM自编码器、条件威布尔生存分析和概率神经网络,实现不确定性感知的剩余使用寿命预测。

详情
Comments
Submitted to 9th European Conference of the Prognostics and Health Management Society 2026
AI中文摘要

本研究提出了一种新颖的混合预测框架,用于使用NASA C-MAPSS数据集对涡扇发动机进行不确定性感知的剩余使用寿命(RUL)估计。该框架采用状态感知策略,将发动机的运行寿命分为“健康”和“退化”两个阶段。一个基于LSTM的自编码器,仅在标称数据(RUL > 150个循环)上训练,通过监测重构误差作为鲁棒的状态分类器。对于健康阶段,使用条件威布尔生存分析进行平均剩余寿命估计。对于退化阶段,使用带有蒙特卡洛丢弃法的概率神经网络捕获偶然和认知不确定性。不使用严格的二元标签,而是使用校准的sigmoid函数将自编码器的输出转换为连续状态概率,动态加权最终集成预测。该框架的主要优势在于生成物理一致的不确定性带,在寿命末期提供高置信度预测,同时准确反映早期运行的内在方差,为风险知情维护提供鲁棒工具。

英文摘要

This study presents a novel hybrid prognostic framework for uncertainty-aware Remaining Useful Life (RUL) estimation in turbofan engines using the NASA C-MAPSS dataset. The framework employs a state-aware strategy that bifurcates the engines operational lifespan into "healthy" and "degraded" regimes. An LSTM-based autoencoder, trained strictly on nominal data (RUL > 150 cycles), monitors reconstruction error to act as a robust state classifier. For the healthy regime, a Conditional Weibull Survival Analysis is used for Mean Residual Life estimation. For the degraded regime, a Probabilistic Neural Network with Monte Carlo Dropout captures both aleatoric and epistemic uncertainties. Rather than using rigid binary labels, a calibrated sigmoid function converts the autoencoders output into continuous state probabilities, dynamically weighting the final ensemble prediction. The primary strength of this framework is its generation of physically consistent uncertainty bands, yielding high-confidence predictions near end-of-life while accurately reflecting the inherent variance of early operation, providing a robust tool for risk-informed maintenance.

2605.31239 2026-06-01 stat.ML cs.AI cs.LG

Correcting Split Selection in Online Decision Trees via Anytime-Valid Inference

通过随时有效推断纠正在线决策树中的分裂选择

Salim I. Amoukou, Saumitra Mishra, Manuela Veloso

AI总结 针对在线决策树分裂选择缺乏有效统计保证的问题,提出基于随时有效推断的方法,实现任意数据流下错误分裂的随时有效控制、预测优势下的有限承诺时间,并在平稳独立同分布数据下保证风险单调递减且每次分裂严格改善。

详情
Comments
Accepted as a Spotlight at the Forty-Third International Conference on Machine Learning (ICML 2026)
AI中文摘要

基于装袋的集成方法,尤其是自适应随机森林,是数据流学习中最强的表现者之一。这些方法的共同点是依赖霍夫丁树作为基学习器,通过使用浓度不等式测试候选分裂是否显著优于其替代方案来增量式地构建决策树。尽管经验成功,现有变体缺乏有效的统计保证。当前分析依赖于固定样本浓度界,而分裂决策使用数据依赖的停止规则,这使其保证无效,并可能将错误分裂的概率推向1。我们引入了一种基于随时有效推断的原则性替代方案。我们的方法提供:(i) 在任意数据流(包括非平稳设置)下对错误分裂的随时有效控制;(ii) 在预测优势下的有限承诺时间;(iii) 在平稳独立同分布数据下,风险单调递减且每次分裂严格改善。在经验上,我们评估了独立树及其在非平稳流中在自适应随机森林中的使用。我们的方法提高了性能,同时生成了更小的树。

英文摘要

Bagging-based ensembles, most notably Adaptive Random Forests, are among the strongest performers for learning from data streams. A common denominator across these methods is their reliance on Hoeffding Trees as base learners, which grow decision trees incrementally by testing whether a candidate split is significantly better than its alternatives using concentration inequalities. Despite their empirical success, existing variants lack valid statistical guarantees. Current analyses rely on fixed-sample concentration bounds, while split decisions are made using data-dependent stopping rules, which invalidates their guarantees and can drive the probabilty of incorrect splits to one. We introduce a principled alternative based on anytime-valid inference. Our method provides: (i) anytime-valid control of false splits under arbitrary data streams, including non-stationary settings; (ii) finite commitment time under a predictive advantage; and (iii) under stationary i.i.d. data, risk is monotone decreasing and strictly improves at every split. Empirically, we evaluate both standalone trees and their use within Adaptive Random Forests on non-stationary streams. Our method improves performance while producing substantially smaller trees.

2605.31238 2026-06-01 cs.CL cs.LG

Scaling Multi-Hop Training Data via Graph-Constrained Path Selection

通过图约束路径选择扩展多跳训练数据

Pengyu Chen, Yonggang Zhang, Mingming Chen, Jun Song, Wei Xue, Yike Guo

AI总结 针对专业文档的组成推理,提出基于图约束路径选择的方法,通过解耦路径发现与语言化,利用图约束过滤无效路径,显著扩展可用语料并提升模型性能。

详情
Comments
21 pages, 5 figures
AI中文摘要

赋予大型语言模型对专业文档的组成推理能力需要大规模的多跳训练数据,而此类数据除了基于结构化来源的精心策划基准外很少存在。为了直接从纯文本、无标注文本中构建此类数据,现有方法要求单个教师模型联合发现文档中的证据路径并将其表述为问答对。然而,当文档围绕重复模板和密集交叉引用子句(这是大多数真实世界专业语料库的特征)构建时,这些方法会严重退化。在这项工作中,我们将这两个操作解耦:推理路径在上下文关键词质心的图上离线枚举,教师模型仅用于将预验证的路径语言化。该图强制执行五个几何可接受性约束,我们提供了Gram矩阵论证,表明仅局部相似性边界允许端点漂移高达约91°,并且需要上相似性边界才能退出由模板文本形成的密集嵌入团。一项匹配规模的消融实验揭示了其机制:在相等的训练规模下,约束链和无约束链产生无法区分的下游性能,而全规模下的增益来自可用语料库的4.4倍扩展,而非更高的每条链质量——这重新定义了图约束在此设置中的作用:提高教师可合成性而非改进链内容。在从CUAD法律合同语料库构建的80K示例上微调Qwen3-32B,将闭卷Token F1从21.66%提高到38.58%。我们已在https://github.com/hkgai-official/GCSCS发布代码。

英文摘要

Endowing large language models with compositional reasoning over specialized documents requires multi-hop training data at scale, where such data rarely exists outside of curated benchmarks built on structured sources. To construct it directly from plain, unannotated text, existing methods ask a single teacher model to jointly discover an evidence path through a document and verbalize it as a question-answer pair. However, these methods degrade sharply when documents are structured around repetitive templates and densely cross-referencing clauses, conditions that characterize most real-world specialized corpora. In this work, we decouple the two operations: reasoning paths are enumerated offline over a graph of contextual keyword centroids, and the teacher is invoked only to verbalize pre-validated paths. The graph enforces five geometric admissibility constraints, for which we provide Gram-matrix arguments establishing that local similarity bounds alone admit endpoint drift up to ${\sim}91^{\circ}$, and that an upper similarity bound is necessary to exit dense embedding cliques formed by boilerplate text. A matched-size ablation isolates the mechanism: at equal training scale, constrained and unconstrained chains yield indistinguishable downstream performance, and the gain at full scale comes from a 4.4$\times$ expansion of the usable corpus rather than from higher per-chain quality -- reframing the role of graph constraints, in this setting, as raising teacher synthesizability rather than improving chain content. Fine-tuning Qwen3-32B on 80K examples constructed from the CUAD legal contract corpus improves closed-book Token F1 from 21.66% to 38.58%. We have released our codes at https://github.com/hkgai-official/GCSCS.

2605.31234 2026-06-01 cs.RO

HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model

HARP-VLA:面向视觉-语言-动作模型的人机对齐表示学习

Xiang Zhu, Puzhen Yuan, Yichen Liu, Jianyu Chen

AI总结 提出HARP框架,通过有限配对人机演示和未配对视频,学习对齐的人机视觉与潜在动作表示,提升VLA模型预训练效果,在CALVIN和真实世界任务中取得性能提升。

详情
AI中文摘要

从大规模人类视频中学习可泛化的视觉-语言-动作(VLA)模型具有前景但也充满挑战,原因在于视觉观察和可执行动作方面存在跨实体差异。虽然潜在动作模型通过学习动作抽象减少了动作执行差距,但它们仍然依赖视觉特征。因此,未对齐的人机视觉表示可能导致策略输入不一致,并引发领域相关的潜在动作,阻碍人类视频的有效协同训练。为解决这一问题,我们提出HARP,一种人机对齐的表示学习框架,用于从人类视频中进行更有效的VLA预训练。具体而言,HARP使用有限的配对人机演示作为跨实体桥梁,并利用大量未配对的人机视频作为可扩展的动态监督数据源。它训练一个机器人适应的视觉编码器和一个潜在动作模型,采用以操作为中心的辅助线索和源相对对判别对齐损失,将机器人表示向人类语义对齐,同时保留对级判别性。学习到的对齐视觉编码器和潜在动作模型为VLA式策略学习提供了统一的视觉和动作表示,其中人类和机器人视频提供视觉-语言到潜在动作的监督,轻量级机器人动作头将潜在动作转化为可执行命令。在特征可视化、仿真和真实世界操作上的实验表明,人机对齐和下游策略性能得到提升,在CALVIN ABC→D上达到4.481的平均长度,真实世界成功率比最强基线提升7.1%。

英文摘要

Learning generalizable vision-language-action (VLA) models from large-scale human videos is promising but challenging due to cross-embodiment discrepancies in both visual observations and executable actions. While latent action models reduce the action execution gap by learning action abstractions, they still rely on visual features. Thus, misaligned human and robot visual representations can lead to inconsistencies in policy inputs and induce domain-dependent latent actions, hindering effective co-training with human videos. To address this, we propose HARP, a human-robot aligned representation learning framework for more effective VLA pretraining from human videos. Specifically, HARP uses limited paired human-robot demonstrations as cross-embodiment bridges and abundant unpaired human and robot videos as a scalable dynamics supervision data source. It trains a robot-adapted visual encoder and a latent action model with manipulation-centric auxiliary cues and a source-relative pair-discriminative alignment loss, which adapts robot representations toward human semantics while preserving pair-level discrimination. The learned aligned vision encoder and latent action model provide a unified vision and action representation for VLA-style policy learning, where human and robot videos provide vision-language-to-latent-action supervision and a lightweight robot action head grounds latent actions into executable commands. Experiments on feature visualization, simulation, and realworld manipulation show improved human-robot alignment and downstream policy performance, achieving 4.481 average length on CALVIN ABC$\rightarrow$D and a 7.1\% realworld success rate gain over the strongest baseline.

2605.31231 2026-06-01 math.NA cs.LG cs.NA

A holomorphic neural network framework for 3D boundary value problems governed by harmonic potentials

基于全纯神经网络的调和势控制的三维边值问题框架

Enrico Ballini, Allan Peter Engsig-Karup, Tito Andriollo

AI总结 提出一种基于Whittaker积分公式和全纯神经网络的框架,通过构造精确满足偏微分方程的神经网络求解三维调和势边值问题,仅需边界配点训练,在拉普拉斯和线弹性问题中验证了精度。

详情
AI中文摘要

我们提出了一种基于神经网络的框架,用于求解解可表示为调和势的三维边值问题。该方法利用Whittaker积分公式,通过关于合适复变量的全纯函数来表示解。这些函数随后使用全纯神经网络进行逼近,从而保证全纯性要求。该公式的一个关键特征是,控制偏微分方程(PDE)通过构造精确满足。因此,与标准的物理信息神经网络相比,在域内部不需要PDE的残差最小化,训练完全基于边界配点。该方法针对三维拉普拉斯和线弹性问题进行了验证,在后一种情况下,位移和应力场通过Papkovich-Neuber势表示。数值结果表明,标量和矢量场均得到精确逼近,误差在整个域内保持可控。总体而言,该工作表明,将解析结构融入神经网络架构为三维边值问题的无网格逼近提供了一种自然且有效的框架,同时保留了控制方程的基本性质。

英文摘要

We present a neural-network-based framework for the solution of three-dimensional boundary value problems where the solution is expressible in terms of harmonic potentials. The approach leverages the Whittaker integral formula, which allows representing the solution through functions that are holomorphic with respect to a suitable complex variable. These functions are subsequently approximated using holomorphic neural networks, which guaranty fulfillment of the holomorphicity requirement. A key feature of the proposed formulation is that the governing partial differential equations (PDEs) are satisfied exactly by construction. Therefore, in contrast to standard physics-informed neural networks, no residual minimization of PDEs is required in the interior of the domain, and training is based exclusively on boundary collocation points. The method is validated against three-dimensional Laplace and linear elasticity problems, where, in the latter case, displacement and stress fields are expressed via the Papkovich-Neuber potentials. The numerical results show an accurate approximation of both scalar and vector fields, with errors remaining controlled throughout the domain. Overall, the work demonstrates that the incorporation of analytical structures into neural network architectures provides a natural and effective framework for the meshless approximation of three-dimensional boundary value problems while preserving the underlying properties of the governing equations.

2605.31229 2026-06-01 cs.CV cs.AI

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

超越分类:面向持续多模态检索的动态适配器路由

Alicja Dobrzeniecka, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik, Bartlomiej Twardowski

AI总结 针对持续多模态检索(CMR)任务,提出基于原型路由和模型合并的动态适配器路由(DAR)方法,在跨域评估中取得优于现有基线的性能。

详情
AI中文摘要

虽然检索是视觉-语言模型的核心功能,但持续更新这些模型用于检索任务仍未被充分探索。现有工作通常通过类增量学习(CIL)的视角处理持续检索,在可能无法完全捕捉检索特定动态的设置中评估标准CIL方法和面向检索的适应方法。为了解决这一问题,我们引入了一个新的、原则性的持续多模态检索(CMR)评估框架,涵盖多样化的视觉领域,并在此设置中系统评估常见方法。我们的实证分析表明,标准CIL方法在我们更具挑战性的场景中未能产生有意义的增益。因此,我们提出了动态适配器路由(DAR),一种基于通过原型路由选择适配器并通过模型合并组合的新方法。DAR在先前基线上取得了优越性能,并在分布外评估中展现出强大的泛化能力。我们的结果凸显了CMR的独特挑战,并鼓励在该方向进行进一步研究。

英文摘要

While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard CIL methods and retrieval-oriented adaptations in settings that may not fully capture the retrieval-specific dynamics. To address this, we introduce a new, principled evaluation framework for continual multimodal retrieval (CMR) spanning diverse visual domains, and systematically evaluate common approaches within this setting. Our empirical analysis shows that standard CIL methods fail to yield meaningful gains in our more challenging scenario. Therefore, we propose Dynamic Adapter Routing (DAR), a novel approach based on adapters selected through prototype-based routing and combined via model merging.DAR achieves superior performance over the previous baselines and demonstrates strong generalization under out-of-distribution evaluation. Our results highlights the unique challenges of CMR and encourages further research in this direction.

2605.31228 2026-06-01 cs.LG cs.AI

EchoRL: Reinforcement Learning via Rollout Echoing

EchoRL:通过回滚回响进行强化学习

Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, Xun Xiao, Volker Tresp, Yunpu Ma

AI总结 针对RLVR训练中优势退化问题,提出EchoRL模块,通过从成功回滚中提取EchoClip作为辅助监督信号,持续提升训练性能。

详情
Comments
ICML 2026
AI中文摘要

基于可验证奖励的强化学习是增强大语言模型推理能力的有效后训练方法。然而,随着训练进行,学习信号可能崩溃,导致训练收益变得微弱且无效。具体而言,越来越多的提示回滚出现优势退化:所有自生成回滚均显示验证成功,使得其奖励的标准差为零;相应地,每个回滚的优势也退化为零。由于这些回滚的优势为零,用于模型优化的策略梯度最终消失,限制了训练性能。我们认为,其中一些回滚仍然包含有价值的学习信号,但不幸被现有RLVR方法忽略。本文受外部专家模型生成的金色轨迹的熵模式分析启发,提出EchoRL以更好地利用优势退化的回滚来进一步提升训练性能。EchoRL是一个轻量级模块,首先根据逐步熵值从验证成功的回滚中识别出EchoClip,然后将该片段作为辅助监督信号反馈到RL目标中。在10个基准、5个LLM骨干网络和4种流行RLVR后训练方法上的大量实验表明,EchoRL能够以最小开销持续改进RLVR后训练。

英文摘要

Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Specifically, a growing fraction of prompts' rollouts become advantage-degenerated: all the self-generated rollouts show verified-success, making the standard deviation over their rewards be zero; accordingly each rollout's advantage becomes degenerated (zero) as well. Given such rollouts' advantages, the policy-gradient for model optimization eventually vanishes, capping the training performance. We argue that some of these rollouts still contain valuable learning signals but unfortunately omitted with the existing RLVR methods. In this paper, inspired through analyzing the entropy pattern behind golden trajectories produced by external expert models, we propose EchoRL for better exploiting the advantage-degenerated rollouts to further improve the training performance. EchoRL is a lightweight module that first identifies an EchoClip from verified-success rollouts based on their step-level entropy values, and then feeds this clip back as an auxiliary supervision signal in the RL objective. Extensive experiments across 10 benchmarks, 5 LLM backbones, and 4 popular RLVR post-training methods demonstrate that EchoRL consistently improves RLVR post-training with minimal overhead.

2605.31227 2026-06-01 cs.CV

HiERO-StepG @ Ego4D Step Grounding Challenge: hierarchical activity understanding enables zero-shot step grounding

HiERO-StepG @ Ego4D Step Grounding Challenge: 层次化活动理解实现零样本步骤定位

Andrea Zenotto, Simone Alberto Peirone, Francesca Pistilli, Giuseppe Averta

AI总结 提出HiERO-StepG方法,利用弱监督层次化表示学习和聚类,无需任务特定微调即可实现零样本步骤定位,在Ego4D挑战中达到56.27% R@1 (IoU=0.3)。

详情
Comments
Technical report for the Ego4D Goal Step - Step Grounding challenge at CVPR 2026, derived from arXiv:2505.12911
AI中文摘要

程序性活动遵循明确的结构:无论是考虑烹饪食谱还是机械师修理汽车,这些活动自然分解为步骤和子步骤的层次结构。传统的步骤定位方法需要大量标注且扩展性差。相反,我们认为这种层次结构可以通过共同发生的动作和活动的重复模式,从非策划的人类活动视频中自然涌现。我们的方法基于HiERO,一种弱监督表示学习方法,它利用细粒度的动作级叙述,将功能相关的动作在特征空间中映射得接近。在这个特征空间中,程序步骤可以通过简单的聚类检测到,无需额外的任务特定微调。对于Ego4D步骤定位挑战,我们通过确保步骤分配的细粒度和粗粒度一致性、强制定位步骤的严格时间单调性以及后处理检测步骤以减少噪声预测的影响来增强这种方法。我们将这种方法称为HiERO-StepG,在提交时,它在全局排行榜上以完全零样本且不需要程序特定注释的情况下,在R@1 (IoU = 0.3)指标上达到56.27%,排名第二。项目页面:https://github.com/andreazenotto/HiERO-StepG。

英文摘要

Procedural activities follow well-defined structures: whether we consider a cooking recipe or a mechanic repairing a car, these activities naturally decompose in a hierarchy of steps and sub-steps. Traditional approaches for step grounding require extensive annotations and scale poorly. Instead, we argue that such hierarchical structure can emerge naturally from uncurated videos of human activities through recurring patterns of co-occurring actions and activities. Our approach builds on HiERO, a weakly-supervised representation learning approach that maps close in the feature space actions that are functionally related to each other, leveraging only fine-grained action-level narrations. In this feature space, procedure steps can be detected by a simple clustering, with no additional task-specific fine-tuning. For the Ego4D Step Grounding challenge, we augment this approach by ensuring fine and coarse level agreement in step assignments, enforcing strict temporal monotonicity of the grounded steps and post-processing the detected steps to reduce the impact of noisy predictions. We call this approach HiERO-StepG and it achieves 56.27 % on the R@1 (IoU = 0.3) metric on the global leaderboard at submission time, ranking second while being completely zero-shot and not requiring procedure-specific annotations. Project page: https://github.com/andreazenotto/HiERO-StepG.

2605.31226 2026-06-01 cs.LG cs.AI

What changes after deployment? A survey on On-device Learning in TinyML

部署后发生了什么变化?TinyML中设备端学习综述

Massimo Pavan, Luca Pezzarossa, Fabrizio Pittorino, Manuel Roveri, Xenofon Fafoutis

AI总结 本文针对微控制器级设备上的机器学习模型,系统综述了约70篇设备端学习(ODL)工作,基于分布变化类型分析其对应用、硬件和解决方案的影响,并指出方法论基准与现实部署之间的差距。

详情
AI中文摘要

微控制器级设备上的机器学习模型(TinyML)面临一个根本性挑战:部署后的分布变化会破坏静态模型。设备端学习(ODL)通过直接在设备上运行学习过程来解决这一问题。现有文献尚未描述分布变化如何发生,以及不同类型的变化需要不同的解决方案。本文基于分布变化类型这一原则,综述了约70篇ODL工作。调查分析了不同类型的分布变化如何影响可寻址的设备端应用、所使用的硬件以及解决方案的结构。还指出了方法论基准与现实部署场景之间持续存在的差距。

英文摘要

Machine learning models on microcontroller-class devices (TinyML) face a fundamental challenge: post-deployment distribution change undermines static models. On-device learning (ODL) addresses this by running the learning process directly on the device. The existing literature has not characterized how distribution change occurs or how different change types require different solutions. Approximately 70 ODL works are surveyed under one principle: the distribution change regime. The survey analyzes how different types of distribution change influence the applications addressable on-device, the hardware employed, and the structure of the solutions. A persistent gap between methodological benchmarks and real-world deployment scenarios is also identified.

2605.31224 2026-06-01 cs.CY cs.AI cs.HC

Comparing LLM-Based Conversational and Graphical Interfaces for Industrial Decision Tasks: An Exploratory Mixed-Methods Study

基于LLM的对话式与图形化界面在工业决策任务中的比较:一项探索性混合方法研究

Roberto Figliè, Simone Caputo, Alan Serrano, Tommaso Turchi, Daniele Mazzei

AI总结 通过混合方法研究,比较了基于LLM的对话式界面与图形化仪表盘在工业决策任务中的表现,发现对话式界面可减少交互努力,但仪表盘在概览和验证方面仍有价值。

详情
AI中文摘要

生成式AI对话用户界面(CUI)作为访问和分析数据的新方式,在各个领域(包括工业领域)的应用正在增长。在工业领域,物联网设备产生的大量数据流经用户界面,可能需要对决策者新的分析需求进行适应。基于LLM的CUI通过自然语言的直接性,无需学习每个GUI设计的成本,有望提供一种与这些数据直接交互的新方式。此外,LLM的能力及其代理性为自动化某些任务并在决策活动中辅助推理提供了可能性。但这些承诺是否可靠?我们通过一项混合方法研究来探讨这一普遍问题,比较了最先进的仪表盘与对话代理。共有20名参与者使用两种界面完成四项复杂度不同的模拟工业决策任务。我们结合了心理工作量、完成时间和决策准确性的测量,以及通过主题分析进行的事后问卷和半结构化访谈。研究结果表明,对话代理可以通过支持更直接的信息访问来减少交互努力,而仪表盘在概览和验证方面仍然有价值。然而,这些好处可能因任务而异,需要通过更大规模的研究进行验证。

英文摘要

The use of Generative AI Conversational User Interfaces (CUI) as a new way to access and analyze data is growing in all sectors, and the industrial one is no exception. There, large amounts of data produced by IoT devices are flowing through user interfaces and may require them a new adaptation to the new analyses needs of decision-makers. LLM-based CUIs are promising a new way to directly interact with those data through the directness of natural language and without the learning costs that every GUI design has. Moreover, the capabilities of LLMs and their agency open up the possibility to automate some tasks and help with the reasoning during decision-making activities. But are this promises well founded? We try to scope this general question with a mixed-approach study comparing a state-of-the-art dashboard with a conversational agent. A total of 20 participants used both interfaces to complete four simulated industrial decision tasks of varying complexity. We combined measures of mental workload, completion time, and decision accuracy with a post-study questionnaire and semi-structured interviews analyzed through thematic analysis. The findings suggest that the conversational agent can reduce interactional effort by supporting more direct access to information, while the dashboard remains valuable for overview and verification. However, these benefits may vary across tasks and require validation through larger-scale studies.

2605.31222 2026-06-01 cs.LG

Multivariate Distributional Reinforcement Learning Using Sliced Divergences

使用切片散度的多变量分布强化学习

Baptiste Debes, Tinne Tuytelaars

AI总结 提出SDRL方法,通过投影将一维散度扩展到多变量回报分布,并证明在标量折扣和一般矩阵折扣下的贝尔曼收缩性,支持多种散度并适用于标准单样本贝尔曼更新。

详情
AI中文摘要

分布强化学习(DRL)建模完整的回报分布而非期望,但将其扩展到多变量设置仍然具有挑战性。许多常见度量不能自然地推广到一维以上,或者失去计算可行性,并且多变量情况引入了额外的困难,例如一般矩阵折扣,对此没有可用的收缩结果。我们引入了切片分布强化学习(SDRL),它通过投影将可处理的一维散度提升到多变量回报分布。我们证明了在共享标量折扣下均匀切片的贝尔曼收缩,并引入了一种在一般密集折扣矩阵下具有收缩性的最大切片变体。SDRL支持广泛的基散度;我们分析了Wasserstein、Cramér和最大均值差异(MMD),并表征了哪些SDRL变体适用于分布强化学习中使用的标准单样本贝尔曼更新。我们在一个玩具链问题、一个基于网格世界的图像环境以及一组Atari游戏上评估了SDRL。

英文摘要

Distributional reinforcement learning (DRL) models the full return distribution rather than expectations, but extending it to multivariate settings remains challenging. Many common metrics do not naturally generalize beyond one dimension or lose computational tractability, and the multivariate case introduces additional difficulties such as general matrix discounting, for which no contraction results are available. We introduce Sliced Distributional Reinforcement Learning (SDRL), which lifts tractable one-dimensional divergences to multivariate return distributions via projections. We prove Bellman contraction for uniform slicing under shared scalar discounting, and introduce a maximum-slicing variant with contraction under general dense discount matrices. SDRL supports a broad class of base divergences; we analyze Wasserstein, Cramér, and Maximum Mean Discrepancy (MMD), and characterize which SDRL variants suit the standard single-sample Bellman update used in distributional RL. We evaluate SDRL on a toy chain problem and a gridworld image-based environment as well as a subset of Atari games.

2605.31220 2026-06-01 cs.CL cs.AI cs.LG

Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models

共享疑虑:语言模型的零样本跨语言置信度估计

Athina Kyriakou, Dennis Ulmer, Ivan Titov

AI总结 研究多语言大语言模型是否编码共享的、可跨语言迁移的置信度特征,通过轻量级线性探针从中间表示直接预测答案正确性,实现零样本跨语言泛化,并发现置信度特征集中在中间层。

详情
AI中文摘要

置信度估计(CE),即量化模型预测的可靠性,在大语言模型(LLM)背景下引起了极大兴趣。然而,大多数研究集中在英语上,忽视了LLM使用的多语言现实,而许多CE方法会退化或需要跨语言重新训练。为了解决这一差距,我们研究了多语言LLM是否编码共享的、可跨语言迁移的置信度特征。我们使用一个轻量级线性探针,直接从中间表示预测答案正确性。经过单语言训练后,该探针在零样本情况下泛化到未见过的、类型多样的语言,无需目标语言监督。学习到的层权重和多次消融实验表明,置信度特征集中在各语言的中间层,表明存在共享的置信度子空间。虽然零样本跨语言性能取决于与源语言的相似性,但该探针无需任何重新训练即可提供强基线,并且与其他流行的置信度估计方法相比具有优势。

英文摘要

Confidence estimation (CE), i.e. quantifying the reliability of a model's prediction, has attracted great interest in the context of large language models (LLMs). However, most studies focus on English, ignoring the multilingual reality of LLM usage, while many CE methods degrade or require retraining across languages. To address this gap, we investigate whether multilingual LLMs encode shared, language-transferable confidence features. We use a lightweight linear probe that predicts answer correctness directly from intermediate representations. Trained monolingually, the probe generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and multiple ablations reveal that confidence features concentrate in middle layers across languages, suggesting a shared confidence subspace. While zero-shot cross-lingual performance depends on similarity to the source language, the probe provides a strong baseline without any retraining and compares favorably to other popular confidence estimation methods.

2605.31217 2026-06-01 cs.CV

TALON: Token-Aligned Lightweight Adapters for 6-DoF Spacecraft Pose Estimation

TALON: 用于六自由度航天器姿态估计的令牌对齐轻量适配器

Abid Ali, Arunkumar Rathinam, Djamila Aouada

AI总结 提出TALON方法,通过在冻结的ViT注意力层前注入时空3D适配器并结合令牌对齐损失,实现轻量级六自由度航天器姿态估计,在SPADES和SwissCube数据集上显著降低姿态误差。

详情
Comments
13 pages paper with 3 figures in total
AI中文摘要

单目六自由度航天器姿态估计方法主要处理单帧图像,忽略了航天器机动过程中获取的图像序列中的时间信息。少数时间方法需要完全骨干微调或辅助光流网络,分别存在灾难性遗忘或增加计算成本的风险。我们提出TALON(轨道导航的令牌对齐轻量适配器):在冻结的ViT视觉变换器的自注意力层之前注入时空3D适配器,结合补丁-令牌对齐损失,通过原型条件KL散度目标将适配特征几何地锚定到关键点结构。注意力前放置允许冻结注意力对时间增强的令牌进行推理,每个块使用单个适配器即可获得比注意力后替代方案更强的性能。对齐损失塑造中间表示,使得每个关键点在令牌场中引发空间精确的激活,而该框架向冻结骨干添加的参数少于5%。在SPADES数据集上,TALON将姿态误差比先前最先进方法降低50%;在SwissCube数据集上,其在ADD-0.1d准确率上超越先前最佳方法21.8%。在SPARK真实数据上的从仿真到真实的零样本跨域评估将姿态误差降低4.7倍,消融实验表征了适配器深度在域内和跨域设置中的作用。

英文摘要

Monocular 6-DoF spacecraft pose estimation methods predominantly process individual frames, discarding the temporal information present in an image sequence acquired during spacecraft manoeuvres. Few temporal approaches require full backbone fine-tuning or auxiliary optical flow networks, risking catastrophic forgetting or increasing computational cost, respectively. We propose TALON (Token-Aligned Lightweight adapters for Orbital Navigation): spatiotemporal 3D adapters injected before the self-attention layers of a frozen ViT vision transformer, combined with a patch-token alignment loss that geometrically grounds the adapted features to keypoint structure through a prototype-conditioned KL-divergence objective. Pre-attention placement allows the frozen attention to reason over temporally enriched tokens, achieving stronger performance with a single adapter per block than post-attention alternatives. The alignment loss shapes the intermediate representations so that each keypoint induces a spatially precise activation in the token field, while the framework adds less than 5% parameters to the frozen backbone. On SPADES dataset, TALON reduces the pose error by 50% over the prior state-of-the-art, and on SwissCube dataset it surpasses the prior best by 21.8% in ADD-0.1d accuracy. Zero-shot cross-domain evaluation from sim-to-real on SPARK real data reduces pose error by 4.7x, and ablations characterise the role of adapter depth across in-domain and cross-domain settings.

2605.31215 2026-06-01 cs.LG cs.CV

Fixed-Point Masked Generative Modeling

不动点掩码生成建模

Andrea Miele, Yiming Qin, Alba Carballo-Castro, Justin Deschenaux, Pascal Frossard

AI总结 提出不动点掩码生成模型(FP-MGM),通过共享注意力层的不动点求解器实现自适应深度,并引入跨步一致性损失和三态重用(3SR)策略,在降低参数和训练成本的同时提升低预算掩码生成质量。

详情
AI中文摘要

掩码生成模型(MGM)支持并行解码并在多种模态上取得强性能,但每一步都需要全序列双向变换器,导致训练成本高且在低采样预算下质量下降。现有工作通过更好的采样器或更便宜的固定深度去噪器提升效率,但仍为每个精炼步骤分配固定量的去噪器计算。我们提出不动点掩码生成模型(FP-MGM),用共享注意力层上的不动点求解器替换部分去噪器,实现自适应深度且参数更少。为使其更有效地用于掩码生成,我们首先引入跨步一致性损失,对齐相邻去噪步骤的隐藏表示;其次,三态重用(3SR)通过分别处理未改变、仍掩码和新揭示的令牌,利用先前解热启动求解器。这些组件共同定义了我们的不动点掩码生成的完整训练到推理框架CoFRe。我们还表明,预训练的MGM可以通过短微调转换为FP-MGM,避免完全重新训练。跨模态,CoFRe改善了质量与成本的权衡。在OpenWebText上,与MDLM相比,CoFRe参数减少38.8%,训练时间减少11.5%,VRAM减少16.9%,同时在96个变换器块前向传播的预算下,生成困惑度从830.8提升到101.8。在ImageNette上,CoFRe训练时间减少48.6%,VRAM减少50.7%,并在所有测试的样本预算下改善FID。总体而言,CoFRe为更便宜的训练和更强的低预算掩码生成提供了一个实用框架。

英文摘要

Masked Generative Models (MGMs) enable parallel decoding and achieve strong performance across modalities, but require full-sequence bidirectional transformers at every step, making training costly and degrading quality under low sampling budgets. Existing work improves efficiency via better samplers or cheaper fixed-depth denoisers, but they still allocate a fixed amount of denoiser computation to each refinement step. We introduce Fixed-Point Masked Generative Models (FP-MGMs), which replace part of the denoiser with a fixed-point solver over shared attention layers to enable adaptive depth with fewer parameters. To make it more effective for masked generation, we first introduce a cross-step consistency loss, which aligns hidden representations at neighboring denoising steps and, second, three-state reuse (3SR) which warm-starts the solver using the previous solution by treating differently unchanged, still-masked, and newly revealed tokens respectively. Together, these components define our complete training-to-inference framework for fixed-point masked generation, \emph{CoFRe}. We also show that pre-trained MGMs can be converted into FP-MGMs with short fine-tuning, avoiding full retraining. Across modalities, CoFRe improves the quality and cost trade-off. On OpenWebText, CoFRe reduces parameters by 38.8\%, training time by 11.5\%, and VRAM by 16.9\%, while improving generative perplexity from 830.8 to 101.8 at a budget of $96$ transformer-block forward passes, compared to MDLM. In ImageNette, CoFRe reduces training time by 48.6\% and VRAM by 50.7\%, while improving FID in all sample budgets tested. Overall, CoFRe offers a practical framework for cheaper training and stronger low-budget masked generation.

2605.31212 2026-06-01 cs.CV cs.AI cs.CL

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

基准测试与增强文本到图像模型以生成早期算术教育中的视觉表示

Junling Wang, Boqi Chen, Heejin Do, Mubashara Akhtar, April Yi Wang, Mrinmaya Sachan

AI总结 针对早期算术教育中的方程到视觉生成任务,构建了E2V-Bench基准并评估了现有T2I模型,发现其在计数和关系结构上存在严重错误,进而探索了基准引导的增强策略。

详情
AI中文摘要

AI系统越来越多地用于支持教育内容创作,但尚不清楚它们能否生成忠实代表其旨在教授的教学概念的输出。因此,我们引入了方程到视觉生成任务,与传统的图像生成不同,该任务要求从算术方程中生成具有教学意义的视觉内容,同时精确保留其数值和关系结构。根据对教师的访谈和教育材料的分析,我们构建了E2V-Bench基准,涵盖四种基于教学法的视觉类型,以及用于评估视觉正确性的自动指标。我们的评估显示,最近的文本到图像(T2I)模型在此任务上频繁失败,错误主要表现为对象计数不正确和关系结构破坏。在此基础上,我们探索了基准引导的增强策略。这些策略改进了代表性模型,但剩余的差距要求未来的T2I模型具备更强的数值和关系基础。

英文摘要

AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.

2605.31210 2026-06-01 cs.RO cs.AI

Simulation of collision avoidance behavior in crowd movement by data-driven approach

基于数据驱动方法的群体运动碰撞规避行为模拟

Xuanwen Liang, Eric Wai Ming Lee

AI总结 针对数据驱动人群模拟中碰撞率高的问题,提出一种结合碰撞惩罚的生成对抗网络(CPGAN),通过侧向加速度碰撞损失函数和Voronoi特征提取方法,有效降低双向流中的对向碰撞率。

详情
AI中文摘要

人群运动模拟对于行人安全管理和设施布局优化至关重要。数据驱动模型提高了欧几里得度量下的轨迹预测精度,但存在碰撞率过高的问题,尤其是在双向和多向流中。本文建立了一种新颖的数据驱动人群模拟模型,将行人碰撞机制纳入损失函数以减少碰撞。提出了基于侧向加速度的碰撞损失函数和基于Voronoi的运动特征提取方法。该模型基于生成对抗网络(GAN)架构,称为CPGAN(碰撞惩罚GAN)。我们在涉及频繁碰撞规避行为的双向流场景中评估了CPGAN。结果表明,所提出的基于侧向加速度的碰撞损失显著降低了相反方向行人的碰撞率,达到与受控实验相当的水平。CPGAN有效模拟了双向流,再现了通道形成和N-t曲线。研究成果可为将行人动力学机制融入数据驱动人群模拟的损失函数提供启发。

英文摘要

Crowd movement simulation is essential for pedestrian safety management and facility layout optimization. Data-driven models enhance trajectory prediction accuracy under Euclidean metrics, yet they suffer from excessively high collision rates, especially in bidirectional and multidirectional flows. In this paper, we establish a novel data-driven crowd simulation model that incorporates the pedestrian collision mechanism into the loss function to reduce collisions. A new lateral-acceleration-based collision loss function and a Voronoi-based motion feature extraction approach are proposed. The model is based on a Generative Adversarial Network (GAN) architecture and is termed CPGAN (Collision-Penalized GAN). We evaluate CPGAN in bidirectional flow scenarios, which involve frequent collision avoidance behaviors. Results show that the proposed lateral-acceleration-based collision loss significantly reduces opposite-direction pedestrian collision rates to levels comparable with controlled experiments. CPGAN effectively simulates bidirectional flow, reproducing lane formation and N-t curves. The research outcomes can provide inspiration for integrating pedestrian dynamics mechanisms into loss functions in data-driven crowd simulation.

2605.31204 2026-06-01 cs.CV

Probabilistic Precipitation Nowcasting with Rectified Flow Transformers

基于整流流变压器的概率降水临近预报

Johannes Schusterbauer, Jannik Wiese, Nick Stracke, Timy Phan, Björn Ommer

AI总结 提出FREUD模型,通过帧级编码器和统一解码器结合整流流变压器,在保持不确定性的同时实现高效时空压缩,在SEVIR基准上达到降水临近预报最优性能。

详情
Comments
CVPR 2026, Project Page: https://compvis.github.io/weather-rf/
AI中文摘要

准确的天气预报在各个领域都至关重要,在极端天气条件下更是安全关键。与基于模拟的预报相比,数据驱动方法显示出更高的效率,能够实现短期、高分辨率的临近预报。特别是,扩散模型因其强大的概率基础在天气临近预报中被证明有效。然而,现有方法依赖于确定性压缩来降低高维天气数据的复杂性,限制了它们在解码过程中捕捉不确定性的能力。在这项工作中,我们引入了$ extbf{FREUD}$,一个基于整流流变压器的$ extbf{Fr}$ame-wise $ extbf{E}$ncoder和$ extbf{U}$nited $ extbf{D}$ecoder模型,用于高效压缩时空天气数据。帧级编码支持连续预报更新,而统一视频解码器确保时间一致性。我们保留不确定性的第一阶段允许通过集成捕捉偶然不确定性,这对于解码变异性高的极端天气事件特别有利。我们在SEVIR基准上使用紧凑的潜在空间整流流变压器实现了降水临近预报的最新性能,并通过模型和测试时缩放进一步展示了性能提升。代码见:https://github.com/CompVis/weather-rf

英文摘要

Accurate weather forecasts are essential across various domains and are safety-critical in extreme weather conditions. Compared to simulation-based forecasting, data-driven approaches show greater efficiency, enabling short-term, high-resolution nowcasting. In particular, diffusion models proved effective in weather nowcasting due to their strong probabilistic foundation. However, existing methods rely on deterministic compression to reduce the complexity of high-dimensional weather data, limiting their ability to capture uncertainty in the decoding process. In this work, we introduce $\textbf{FREUD}$, a $\textbf{Fr}$ame-wise $\textbf{E}$ncoder and $\textbf{U}$nited $\textbf{D}$ecoder model based on rectified flow transformers for efficient compression of spatio-temporal weather data. Frame-wise encoding enables continuous forecast updates, while the unified video decoder ensures temporal consistency. Our uncertainty-preserving first stage allows us to capture aleatoric uncertainty via ensembling, which is particularly beneficial for extreme weather events with high decoding variability. We achieve state-of-the-art performance in precipitation nowcasting with a compact latent-space rectified flow transformer on the SEVIR benchmark and show further performance gains by model and test-time scaling. Code available here: https://github.com/CompVis/weather-rf

2605.31201 2026-06-01 cs.CL

Learning Whom to Trust: Market-Feedback Adaptive Retrieval for Frozen LLMs in Event-Driven Financial RAG

学习信任谁:事件驱动金融RAG中冻结大语言模型的市场反馈自适应检索

Zijie Zhao, Roy E. Welsch

AI总结 针对事件驱动金融RAG,提出通过外部贝叶斯源记忆更新检索层,利用市场反馈自适应选择证据源,在冻结LLM情况下提升预测和投资组合表现。

详情
AI中文摘要

金融检索增强生成(RAG)系统通常按文本相关性对证据排序,但在金融市场中,有用的证据来源取决于事件类型、预测期限和市场背景。我们将新闻触发的事件影响预测作为一个时间点金融RAG问题进行研究。对于每个公司-新闻锚点,系统检索相关的金融新闻和SEC文件段落,附加决策前市场背景卡片,并预测多期限残差收益信号。我们的方法保持大语言模型(LLM)阅读器冻结,通过外部贝叶斯源记忆(根据已成熟的残差收益反馈更新)自适应检索层。在源自FinRL-DeepSeek/FNSPID任务的固定89只纳斯达克股票池上,使用原始FNSPID新闻和时间点EDGAR文件段落,与无记忆的冻结阅读器相比,带源记忆的冻结阅读器将留出宏F1从0.438提升至0.471,下游投资组合夏普比率从0.52提升至0.84。有监督的LoRA阅读器对静态RAG有适度改进,但未超过冻结源记忆阅读器。这些结果表明,对于金融RAG,学习从何处检索与学习如何阅读同等重要,提供了一种简单、模块化的市场反馈适应途径。

英文摘要

Financial retrieval-augmented generation (RAG) systems typically rank evidence by textual relevance, but in financial markets the useful evidence source depends on event type, forecast horizon, and market context. We study news-triggered event-impact prediction as a point-in-time financial RAG problem. For each company-news anchor, the system retrieves related financial news and SEC filing passages, appends a pre-decision market-context card, and predicts multi-horizon residual-return signals. Our method keeps the large language model (LLM) reader frozen and adapts the retrieval layer through an external Bayesian source memory updated from matured residual-return feedback. On a fixed 89-stock Nasdaq-oriented universe derived from the FinRL-DeepSeek/FNSPID task, using original FNSPID news and point-in-time EDGAR filing passages, Frozen Reader with Source Memory improves held-out macro-F1 from 0.438 to 0.471 and downstream portfolio Sharpe from 0.52 to 0.84 relative to Frozen Reader with No Memory. A supervised LoRA reader improves static RAG modestly, but does not improve over the frozen source-memory reader. These results suggest that, for financial RAG, learning where to retrieve from can be as important as learning how to read, offering a simple, modular route to market-feedback adaptation.

2605.31199 2026-06-01 cs.CR cs.AI

MAECO-Lite: Modular Ontology for Dynamic Malware Analysis

MAECO-Lite:动态恶意软件分析的模块化本体

Zekeri Adams, Peter Švec, Ján Kľuka, Roderik Ploszek, Monday Onoja, Štefan Balogh, Martin Homola

AI总结 针对MAEC和STIX在动态恶意软件分析中混淆工件与事件的问题,基于统一基础本体(UFO)进行本体分析,提出轻量级本体MAECO-Lite,通过模块化结构分离持久实体与运行时事件,提升语义清晰度和计算可用性。

详情
AI中文摘要

以实用且语义精确的方式捕获动态恶意软件行为仍然是网络威胁情报中的一个重大挑战。尽管MAEC和STIX等标准提供了广泛采用的词汇表来描述恶意软件工件和观测结果,但它们以相当复杂的结构表示数据,往往掩盖了重要的本体论区分。特别是,它们倾向于将持久的恶意软件工件与执行期间生成的事件混为一谈,从而模糊了本体设计基础标准中的核心区分。在本文中,我们以统一基础本体(UFO)为理论视角,对与动态恶意软件分析相关的核心MAEC和STIX构造进行了基础本体分析。我们的分析揭示了由于MAEC和STIX中工件、倾向和运行时事件的混淆而产生的一些本体论不匹配,这些不匹配使动态恶意软件行为的一致表示复杂化,并从实践角度限制了推理执行轨迹的能力。基于这些见解,我们提出了MAECO-Lite,一种轻量级本体,旨在表示数据并实现其处理以用于动态恶意软件分析。该本体采用模块化结构,以样本、进程、动作、系统工件和MITRE ATT&CK技术为中心,同时保持持久实体和运行时事件之间的清晰分离。使用描述逻辑概念学习算法的初步评估表明,简化的本体显著提高了学习性能,证明了基于本体的建模可以增强语义清晰度和计算可用性。

英文摘要

Capturing dynamic malware behavior in a practical but still semantically precise manner remains a significant challenge in cyber threat intelligence. While standards such as MAEC and STIX provide widely adopted vocabularies for describing malware artifacts and observations, they represent data with considerable complexity in structures that often obscure important ontological distinctions. In particular, they tend to conflate enduring malware artifacts with the events generated during execution, thereby flattening distinctions that are central in foundational standards for ontology design. In this paper, we conduct a foundational ontological analysis of core MAEC and STIX constructs relevant to dynamic malware analysis relying on Unified Foundational Ontology (UFO) as a theoretical lens. Our analysis reveals some ontological mismatches arising from the conflation of artifacts, dispositions, and runtime events in MAEC and STIX that complicate coherent representation of dynamic malware behavior and, from a practical perspective, limit the ability to reason about execution traces. Based on these insights, we propose MAECO-Lite, a lightweight ontology designed to represent data and operationalize their processing for dynamic malware analysis. The ontology adopts a modular structure centered on samples, processes, actions, system artifacts, and MITRE ATT&CK Techniques, while maintaining a clear separation between enduring entities and runtime events. An initial evaluation using description logic concept learning algorithms shows that the simplified ontology significantly improves learning performance, demonstrating that ontologically grounded modelling can enhance both semantic clarity and computational usability.

2605.31196 2026-06-01 cs.CV cs.AI cs.CL cs.RO

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

探索视觉-语言模型中的碰撞接地以实现安全的人机协作

Jun Wang, Xiaohao Xu, Xiaonan Huang

AI总结 针对安全人机协作,提出碰撞接地概念及物理基准TouchSafeBench,评估视觉-语言模型在分类当前安全状态和预警即将碰撞任务中的表现,发现现有模型不可靠,视觉流畅性不等于物理责任性。

详情
Comments
31 pages, 9 figures
AI中文摘要

安全的人机协作需要的不仅仅是视觉描述:监控器必须确定机器人身体是否安全分离、已经与场景或人发生碰撞,或即将碰撞。我们将这种能力称为碰撞接地:将视觉观察与机器人身体几何、相机视角、场景布局、人体接近度和时间运动相结合,以推断当前和即将发生的接触。我们引入了TouchSafeBench,一个基于物理的基准,用于评估视觉-语言模型(VLM)中的碰撞接地能力。TouchSafeBench基于Habitat 3.0构建,包含2,940个模拟室内共现场景,涵盖社交导航和社交重排,具有同步的多视角RGB-D观测、自上而下的轨迹地图、校准的相机元数据和模拟器导出的接触标签。我们研究了两个面向部署的任务:分类当前安全状态和在接触前预警即将发生的碰撞。在三个前沿或面向机器人的VLM和九种视觉表示中,当前模型远未达到可靠:最佳平均Macro-F1仍低于50%,显式深度不会自动转化为机器人身体碰撞证据,且机器人与场景的接触始终比人与人的接触风险更难。TouchSafeBench揭示了具身VLM的一个核心限制:视觉流畅性并不意味着物理责任性。可靠的机器人安全监控器需要能够显式绑定视角、机器人形态、度量几何和未来碰撞的表示。我们将在论文被接收后发布该基准。

英文摘要

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.

2605.31193 2026-06-01 cs.LG

Geometry-based Schrödinger Bridges for Trustworthy Multimodal Fusion

基于几何的薛定谔桥用于可信多模态融合

Jiayu Xiong, Jing Wang, Qi Zhang, Wanlong Wang, Jun Xue

AI总结 提出基于几何的多模态融合方法GMF,利用扩散薛定谔桥的初始速度平方作为独立于预测的可靠性信号,以提升对低质量数据的鲁棒性。

详情
Comments
ICML 2026 accepted paper
AI中文摘要

现实世界的多模态系统必须对低质量数据具有鲁棒性,例如传感器噪声、不完整的多模态数据和冲突输入。然而,现有的可信融合方法依赖模型自身的预测置信度来判断数据质量,这造成了循环依赖:当模型自信但错误时,这些方法无法检测到错误。为了打破这一循环,我们提出了基于几何的多模态融合(GMF)。我们不依赖预测,而是通过测量输入在潜在空间中所需的传输校正量来评估可靠性。我们实现了带有整流流的扩散薛定谔桥传输,其中初始速度的平方提供了一个高效的学习校正分数。有效数据具有低的平方速度幅度,而噪声、不完整数据或冲突数据需要更强的传输校正。这种基于几何的可靠性信号充当独立判断,即使在分类器被欺骗时也能有效标记不可靠输入。大量实验表明,与基于置信度的基线相比,GMF显著提高了对严重传感器噪声和语义冲突的鲁棒性。

英文摘要

Real-world multimodal systems must be robust against low-quality data, such as sensor noise, incomplete multimodal data and conflicting inputs. However, existing trustworthy fusion methods rely on the model's own prediction confidence to judge data quality. This creates a circular dependency: when a model is confident but wrong, these methods fail to detect the error. To break this loop, we propose Geometry-based Multimodal Fusion (GMF). Instead of relying on predictions, we evaluate reliability by measuring how much transport correction the input needs in latent space. We implement Diffusion Schrödinger Bridge transport with Rectified Flow, where the squared initial velocity gives an efficient learned correction score. Valid data has low squared velocity magnitude, while noisy, incomplete data or conflicting data requires stronger transport correction. This geometry-based reliability signal acts as an independent judge, effectively flagging unreliable inputs even when the classifier is fooled. Extensive experiments demonstrate that GMF significantly improves robustness against severe sensor noise and semantic conflicts compared to confidence-based baselines.

2605.31192 2026-06-01 cs.CV

The Regularizing Power of Language-Training Deepfake Detectors

语言训练深度伪造检测器的正则化能力

Benedikt Hopf, Zongwei Wu, Radu Timofte

AI总结 提出利用多模态大语言模型的双编码器架构和两阶段训练,通过语言正则化缓解过拟合,提升深度伪造检测的泛化性和可解释性。

详情
AI中文摘要

最近,得益于多模态大语言模型的出现,深度伪造检测器不仅追求泛化性,还追求可解释性。我们提出这两个挑战可以有效地联合解决,因为可描述的伪影通常泛化性更好,从而开辟了使用语言作为正则化机制的可能性。由于深度伪造检测通常过拟合于低层次的领域特定伪影,我们的直觉是,经过语言预训练的LLM会更偏好于可更好描述的高层次伪影。这样,我们可以在可能的情况下使用高层次特征,同时训练模型在必要时使用低层次特征。我们利用双编码器架构,将冻结的专家检测器与LoRA调优的MLLM编码器配对,并采用两阶段训练课程:首先,二元对齐阶段表明,MLLM的内在能力可以有效地组合特征,以减轻对数据集特定伪影的过拟合。为了进一步增强泛化性并实现可解释性,我们采用强化学习阶段,鼓励模型在分类前生成描述性推理,仅使用二元标签。通过奖励这种“先解释后分类”的行为,我们明确激励模型优先考虑高层次、鲁棒的特征。关键在于,这一过程既产生了可解释的描述,又进一步提升了跨数据集性能,即使在推理时省略推理链也是如此。在基准数据集上的大量实验验证了我们的方法,以较大优势超越了最先进的方法。

英文摘要

Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.

2605.31191 2026-06-01 cs.LG cs.CV

Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10

学生容量调节知识蒸馏有效性:基于CIFAR-10上ResNet教师-学生对的系统研究

Umut Onur Yasar

AI总结 通过ResNet教师-学生对在CIFAR-10上的图像分类实验,系统研究学生容量如何调节知识蒸馏(KD)的有效性,发现学生容量是蒸馏增益的关键调节因素,并指出实现正确性和输入分辨率感知架构的重要性。

详情
Comments
9 pages, 2 figures, 5 tables. Code available at https://github.com/umutonuryasar/kd-capacity-gap
AI中文摘要

我们研究了教师-学生容量关系如何调节基于ResNet的CIFAR-10图像分类中知识蒸馏(KD)的有效性。在三个教师-学生对(R50->R18、R34->R18和R50->R34)中,我们在受控、可重复的条件下(3个种子,全程报告均值±标准差)比较了Logit-KD和Feature-KD。我们报告三个主要发现。首先,学生容量是蒸馏增益的关键调节因素:即使教师-学生准确率差距相当,R34学生从KD中获得的收益也远大于R18学生,R50->R34 Feature-KD的最大增益为+0.30个百分点,而R34->R18 Feature-KD为+0.18个百分点,R34->R18 Logit-KD为+0.00个百分点。其次,实现的正确性对Feature-KD至关重要:一个排除了投影层的梯度裁剪错误抑制了Feature-KD的性能,并产生了与Logit-KD的误导性比较。修正后,Feature-KD在三个对中的两个上匹配或优于Logit-KD,在R50->R34上达到95.55%,基线为95.25%。第三,输入分辨率感知架构是有效蒸馏的先决条件:将ResNet主干修正为32x32输入使教师准确率提高超过5个百分点——比任何KD增益高出一个数量级。所有代码和结果可在github.com/umutonuryasar/kd-capacity-gap获取。

英文摘要

We investigate how teacher-student capacity relationships modulate knowledge distillation (KD) effectiveness in ResNet-based image classification on CIFAR-10. Across three teacher-student pairs -- R50->R18, R34->R18, and R50->R34 -- we compare Logit-KD and Feature-KD under controlled, reproducible conditions (3 seeds, mean+/-std reported throughout). We report three main findings. First, student capacity is a key moderating factor in distillation gain: R34 students benefit substantially more from KD than R18 students even when teacher-student accuracy gaps are comparable, with the strongest gain of +0.30pp observed for R50->R34 Feature-KD versus +0.18pp for R34->R18 Feature-KD and +0.00pp for R34->R18 Logit-KD. Second, implementation correctness critically affects Feature-KD: a gradient clipping bug that excluded projection layers suppressed Feature-KD performance and produced misleading comparisons with Logit-KD. After correction, Feature-KD matches or outperforms Logit-KD in two of three pairs, reaching 95.55% on R50->R34 against a baseline of 95.25%. Third, input-resolution-aware architecture is a prerequisite for effective distillation: correcting the ResNet stem for 32x32 inputs raises teacher accuracy by over 5pp -- an order of magnitude larger than any KD gain. All code and results are available at github.com/umutonuryasar/kd-capacity-gap.

2605.31189 2026-06-01 cs.LG

FlagGAM: Rule-Based Generalized Additive Modeling for Explainable Tabular Prediction

FlagGAM:基于规则的可解释表格预测广义加性模型

Zijie Zhao, Roy E. Welsch

AI总结 提出FlagGAM框架,通过规则定义的基函数分离特征级规则构建与预测,在保持可解释性的同时提升对不完美输入的鲁棒性。

详情
AI中文摘要

在高风险领域的表格预测中,需要准确、透明且对不完美输入鲁棒的模型。我们提出FlagGAM,一个规则定义的基函数框架,将特征级规则构建与预测分离。Flag核心模块将数值和分类变量转换为稀疏、可读的单变量基函数,包括阈值标志、类别级标志、尾部偏差基和分类阶跃函数;默认的加性头部随后将这些基函数组合为受限的GAM风格预测器。FlagGAM不是将触发的规则简化为紧凑的计数摘要,而是保留稀疏的规则基矩阵,支持混合类型分类和回归、特征特定权重以及可选的灵活预测头部。在表格基准测试中,默认FlagGAM在透明加性模式下接近EBM,在混合类型回归上显著优于岭回归,并在缺失和噪声扰动下显示出比常见基线更小的AUROC下降。灵活头部进一步提高了准确性,接近强树基线,但需要注意,所得模型应解释为规则基表示后接非线性预测器,而非完全加性GAM。总体而言,FlagGAM为需要竞争性准确性、可传达规则和对不完美输入鲁棒性的表格设置提供了实用的中间地带。

英文摘要

Tabular prediction in high-stakes domains requires models that are accurate, transparent, and robust to imperfect inputs. We propose FlagGAM, a rule-defined basis framework that separates feature-level rule construction from prediction. A Flag Core Module converts numerical and categorical variables into sparse, human-readable univariate bases, including threshold flags, category-level flags, tail-deviation bases, and categorical step functions; a default additive head then combines these bases as a restricted GAM-style predictor. Rather than reducing triggered rules to compact count summaries, FlagGAM retains a sparse rule-basis matrix that supports mixed-type classification and regression, feature-specific weighting, and optional flexible prediction heads. Across tabular benchmarks, default FlagGAM remains close to EBM in transparent additive mode, improves substantially over ridge regression on mixed-type regression, and shows smaller AUROC degradation than common baselines under missing and noisy perturbations. Flexible heads further improve accuracy and approach strong tree-based baselines, with the caveat that the resulting model should be interpreted as a rule-basis representation followed by a nonlinear predictor rather than as a fully additive GAM. Overall, FlagGAM provides a practical middle ground for tabular settings that require competitive accuracy, communicable rules, and robustness to imperfect inputs.

2605.31187 2026-06-01 cs.CV cs.LG

From Local Geometry to Global Pseudo Labeling for Robust Positive Unlabeled Learning under Covariate Shift

从局部几何到全局伪标注:协变量偏移下鲁棒的正无标记学习

Firas Gabetni, Alexandre Rocchi Henry, Nacim Belkhir, Ziyi Liu, Gianni Franchi

AI总结 提出SPUNA框架,利用局部流形结构逐步发现偏移数据,在协变量偏移下实现正无标记学习,性能达到全监督方法水平。

详情
AI中文摘要

检测协变量偏移对于构建可靠的视觉系统至关重要。虽然大多数先前工作专注于提高对偏移的鲁棒性,但显式检测协变量偏移仍未被充分探索。现有方法通常依赖于全监督训练,需要来自原始分布和偏移分布的有标签样本,这往往不切实际。在本文中,我们表明协变量偏移检测可以通过使用正无标记(PU)学习的弱监督有效解决。然而,在协变量偏移下,分布内数据和偏移数据显著重叠,使得经典PU方法不稳定且对噪声敏感。为克服这一挑战,我们引入了谱PU邻域标注(SPUNA),这是一种几何感知框架,通过利用视觉特征的局部流形结构逐步发现偏移数据。大量实验表明,SPUNA在PU设置中实现了最先进的性能,并且显著匹配了全监督方法的性能。此外,我们的方法在不同类型的偏移之间鲁棒地迁移,展示了强大的泛化能力。

英文摘要

Detecting covariate shift is critical for building reliable vision systems. While most prior work focuses on improving robustness to shift, explicitly detecting covariate shift remains underexplored. Existing approaches typically rely on fully supervised training, requiring labeled examples from both original and shifted distributions, which is often impractical. In this paper, we show that covariate shift detection can be effectively addressed with weaker supervision using Positive Unlabeled (PU) learning. However, under covariate shift, in distribution and shifted data overlap significantly, making classical PU methods unstable and sensitive to noise. To overcome this challenge, we introduce Spectral PU Neighborhood Annotation (SPUNA), a geometry aware framework that progressively discovers shifted data by leveraging the local manifold structure of visual features. Extensive experiments show that SPUNA achieves state of the art performance in PU settings and remarkably matches the performances of fully supervised methods. Moreover, our approach transfers robustly across different types of shifts, demonstrating strong generalization capabilities.