arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8098
2504.01250 2026-06-03 cs.LG cs.SY eess.SY

R2DN: Scalable Parameterization of Contracting and Lipschitz Recurrent Deep Networks

R2DN:收缩和Lipschitz循环深度网络的可扩展参数化

Nicholas H. Barbara, Ruigang Wang, Ian R. Manchester

发表机构 * Australian Centre for Robotics(澳大利亚机器人中心) School of Aerospace, Mechanical and Mechatronic Engineering(航空航天、机械与机电工程学院) The University of Sydney(悉尼大学)

AI总结 本文提出鲁棒循环深度网络(R2DN),通过将线性时不变系统与1-Lipschitz深度前馈网络反馈互联,直接参数化权重以保证模型稳定(收缩)且对小输入扰动鲁棒(Lipschitz),相比循环均衡网络(REN)无需迭代求解均衡层,显著提升GPU上的推理和反向传播速度,并在非线性系统辨识、观测器设计和基于学习的反馈控制中实现相近性能下训练和推理速度提升一个数量级。

详情
AI中文摘要

本文提出鲁棒循环深度网络(R2DN),这是一种用于机器学习和数据驱动控制的鲁棒循环神经网络的可扩展参数化。我们将R2DN构造为线性时不变系统与1-Lipschitz深度前馈网络的反馈互联,并直接参数化权重,使得我们的模型天生稳定(收缩)且对小输入扰动鲁棒(Lipschitz)。我们的参数化使用了类似于先前提出的循环均衡网络(REN)的结构,但无需在每个时间步迭代求解均衡层。这加速了GPU上的模型推理和反向传播,并且与REN相比,使得网络规模、批大小和输入序列长度的扩展在计算上可行。我们在非线性系统辨识、观测器设计和基于学习的反馈控制三个代表性问题上将R2DN与REN进行比较。我们发现,在相似的测试集性能下,训练和推理速度均提升一个数量级,并且它们在模型表达能力方面具有更好的可扩展性。

英文摘要

This paper presents the Robust Recurrent Deep Network (R2DN), a scalable parameterization of robust recurrent neural networks for machine learning and data-driven control. We construct R2DNs as the feedback interconnection of a linear time-invariant system and a 1-Lipschitz deep feedforward network, and directly parameterize the weights so that our models are stable (contracting) and robust to small input perturbations (Lipschitz) by design. Our parameterization uses a structure similar to the previously-proposed recurrent equilibrium network (REN), but without the requirement to iteratively solve an equilibrium layer at each time-step. This speeds up both model inference and backpropagation on GPUs, and makes it computationally feasible to scale up the network size, batch size, and input sequence length in comparison to RENs. We compare R2DNs to RENs on three representative problems in nonlinear system identification, observer design, and learning-based feedback control. We find that training and inference are both up to an order of magnitude faster with similar test set performance, and that they scale more favorably with respect to model expressivity.

2502.02260 2026-06-03 cs.LG cs.CR

Position: Adversarial ML for LLMs Is Not Making Any Progress

立场:针对LLM的对抗性机器学习并未取得任何进展

Javier Rando, Jie Zhang, Nicholas Carlini, Florian Tramèr

发表机构 * GitHub University of California, Berkeley(加州大学伯克利分校)

AI总结 本文认为,在大语言模型时代,对抗性机器学习研究的问题定义更模糊、更难解决且更难以评估,可能导致未来十年仍无法取得有意义进展。

Comments Accepted at ICML 2026 Position Paper Track

详情
AI中文摘要

在过去十年中,大量研究工作致力于保护在对抗性环境中运行的机器学习模型。然而,即使是简单的“玩具”问题(例如,对微小对抗扰动的鲁棒性),进展也很缓慢,并且常常受到非严格评估的阻碍。如今,对抗性机器学习研究已转向研究更大规模、通用目的的语言模型。在这篇立场论文中,我们认为情况现在更糟:在大语言模型时代,对抗性机器学习研究的问题(1)定义更不明确,(2)更难解决,以及(3)更难以评估。因此,我们警告说,又一个十年的对抗性机器学习工作可能无法产生有意义的进展。

英文摘要

In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" problems (e.g., robustness to small adversarial perturbations) and is often hindered by non-rigorous evaluations. Today, adversarial ML research has shifted towards studying larger, general-purpose language models. In this position paper, we argue that the situation is now even worse: in the era of LLMs, the field of adversarial ML studies problems that are (1) less clearly defined, (2) harder to solve, and (3) even more challenging to evaluate. As a result, we caution that yet another decade of work on adversarial ML may be failing to produce meaningful progress.

2412.01282 2026-06-03 cs.CV cs.AI

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

Align-KD:为移动视觉语言模型增强提取跨模态对齐知识

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, China(通用人工智能国家重点实验室,智能科学与技术学院,北京大学,中国) Huawei Noah’s Ark Lab, China(华为诺亚方舟实验室,中国)

AI总结 提出Align-KD方法,通过蒸馏教师模型浅层跨模态对齐知识,指导1.7B学生模型学习视觉-文本匹配,在6个基准上平均提升2.0分。

Comments CVPR 2025 Paper

详情
AI中文摘要

视觉语言模型(VLM)为多模态任务带来了强大的理解和推理能力。同时,移动设备对强大人工智能的需求也日益增长,例如AI助手软件。一些工作试图将VLM迁移到边缘设备以扩展其应用范围。简化模型结构是一种常见方法,但随着模型缩小,性能与大小之间的权衡变得越来越困难。知识蒸馏(KD)可以帮助模型在不增加大小或数据量的情况下提升综合能力。然而,现有的大模型蒸馏技术大多只考虑单模态LLM的应用,或者仅使用教师为学生创建新的数据环境。这些方法都没有考虑VLM中最重要的跨模态对齐知识的蒸馏。我们提出了一种名为Align-KD的方法,引导学生模型学习发生在浅层的跨模态匹配。教师还帮助学生基于文本的关注点学习将视觉标记投影到文本嵌入空间。在Align-KD的指导下,1.7B的MobileVLM V2模型能够从7B教师模型中学习丰富的知识,且训练损失设计轻量,在两个训练子集上分别在6个基准上平均得分提升2.0。代码地址:https://github.com/fqhank/Align-KD。

英文摘要

Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: https://github.com/fqhank/Align-KD.

2409.08958 2026-06-03 cs.LG cs.AI physics.comp-ph physics.flu-dyn

PINNfluence: Interpreting PINNs through Influence Functions

PINNfluence: 通过影响函数解释 PINN

Aleksander Krasowski, Jonas R. Naujoks, Moritz Weckbecker, Galip Ü. Yolcu, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek, René P. Klausen

发表机构 * Technical University of Munich(慕尼黑技术大学) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) University of Tübingen(图宾根大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出 PINNfluence 框架,基于影响函数对物理信息神经网络进行训练数据归因,实现预测、损失分量和训练数据点之间的细粒度归因,并通过基准实验区分训练好与差的 PINN 的结构特征。

Comments Accepted at ICML 2026

详情
AI中文摘要

物理信息神经网络(PINN)已成为物理科学中求解偏微分方程(PDE)的强大深度学习方法,但其行为在很大程度上仍然不透明,通常通过故障模式分析而非显式可解释性来理解。为了解决这个问题,我们引入了 PINNfluence,这是一个基于影响函数解释 PINN 的训练数据归因框架。通过将影响函数扩展到复合物理信息训练目标,我们实现了预测、损失分量和训练数据点之间的细粒度归因。通过跨各种 PDE 的基准实验,我们证明了影响模式提供了区分训练良好和训练不良的 PINN 结构特征的细粒度诊断。因此,PINNfluence 通过数据视角为理解和提高 PINN 的可靠性开辟了新途径。

英文摘要

Physics-informed neural networks (PINNs) have emerged as a powerful deep learning approach for solving partial differential equations (PDEs) in the physical sciences, yet their behavior remains largely opaque and is typically understood through failure mode analyses rather than explicit interpretability. To address this issue, we introduce PINNfluence, a training data attribution framework for interpreting PINNs based on influence functions. By extending influence functions to composite physics-informed training objectives, we enable fine-grained attribution between predictions, loss components, and training data points. Through benchmark experiments across various PDEs, we demonstrate that influence patterns provide granular diagnostics that distinguish structural characteristics across well-trained and poorly-trained PINNs. PINNfluence thus opens a new avenue for understanding and improving the reliability of PINNs through the lens of their data.

2411.15851 2026-06-03 cs.CV

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

ResCLIP: 用于无训练密集视觉-语言推理的残差注意力

Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan

发表机构 * University of Electronic Science and Technology of China(电子科学与技术大学)

AI总结 提出残差交叉相关自注意力模块和语义反馈精炼模块,利用中间层交叉相关注意力重组空间信息,提升CLIP在密集预测任务中的性能。

详情
Journal ref
Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 29968-29978
AI中文摘要

尽管像CLIP这样的视觉-语言模型在开放词汇任务中取得了显著成功,但其应用目前局限于图像级任务,在密集预测方面仍存在困难。最近的研究通常将这种密集预测的不足归因于最终块中的自注意力层,并通过将原始的查询-键注意力修改为自相关注意力(例如查询-查询和键-键注意力)取得了可观的成果。然而,这些方法忽略了捕捉丰富空间对应关系的交叉相关注意力(查询-键)特性。在本文中,我们揭示了CLIP非最终层中自注意力的交叉相关性也表现出定位特性。因此,我们提出了残差交叉相关自注意力(RCS)模块,该模块利用中间层的交叉相关自注意力来重塑最终块中的注意力。RCS模块有效重组了空间信息,释放了CLIP在密集视觉-语言推理中的定位潜力。此外,为了增强对相同类别区域的关注和局部一致性,我们提出了语义反馈精炼(SFR)模块,该模块利用语义分割图进一步调整注意力分数。通过整合这两种策略,我们的方法(称为ResCLIP)可以轻松作为即插即用模块集成到现有方法中,显著提升其在密集视觉-语言推理中的性能。在多个标准基准上的大量实验表明,我们的方法超越了最先进的无训练方法,验证了所提方法的有效性。代码可在 https://github.com/yvhangyang/ResCLIP 获取。

英文摘要

While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at https://github.com/yvhangyang/ResCLIP.

2410.14573 2026-06-03 cs.LG cs.AI

Building Trust in Black-box Optimization: A Comprehensive Framework for Explainability

在黑盒优化中建立信任:可解释性的综合框架

Nazanin Nezami, Hadis Anahideh

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出一套模型无关的指标IEMSO,通过采样核心、批次属性、优化过程和特征重要性四类指标,增强代理优化方法的透明性和可解释性。

详情
AI中文摘要

在受限评估预算内优化昂贵的黑盒函数在许多实际应用中面临重大挑战。代理优化(SO)是一种常见的解决方案,但其由代理模型和采样核心(例如采集函数)的复杂性引入的专有性质往往导致缺乏可解释性和透明度。尽管现有文献主要集中在增强对全局最优的收敛性,但新提出策略的实际解释仍未被充分探索,特别是在批量评估设置中。在本文中,我们提出了代理优化的包容性可解释性指标(IEMSO),这是一组全面的模型无关指标,旨在增强SO方法的透明度、可信度和可解释性。通过这些指标,我们在执行昂贵评估之前和之后为从业者提供中间和事后解释,以建立信任。我们考虑了四类主要指标,每类针对SO过程的特定方面:采样核心指标、批次属性指标、优化过程指标和特征重要性。我们的实验评估证明了所提指标在不同基准上的显著潜力。

英文摘要

Optimizing costly black-box functions within a constrained evaluation budget presents significant challenges in many real-world applications. Surrogate Optimization (SO) is a common resolution, yet its proprietary nature introduced by the complexity of surrogate models and the sampling core (e.g., acquisition functions) often leads to a lack of explainability and transparency. While existing literature has primarily concentrated on enhancing convergence to global optima, the practical interpretation of newly proposed strategies remains underexplored, especially in batch evaluation settings. In this paper, we propose \emph{Inclusive} Explainability Metrics for Surrogate Optimization (IEMSO), a comprehensive set of model-agnostic metrics designed to enhance the transparency, trustworthiness, and explainability of the SO approaches. Through these metrics, we provide both intermediate and post-hoc explanations to practitioners before and after performing expensive evaluations to gain trust. We consider four primary categories of metrics, each targeting a specific aspect of the SO process: Sampling Core Metrics, Batch Properties Metrics, Optimization Process Metrics, and Feature Importance. Our experimental evaluations demonstrate the significant potential of the proposed metrics across different benchmarks.

2310.10322 2026-06-03 cs.CL

Evaluating the Reversal Curse in Model Editing

评估模型编辑中的逆转诅咒

Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Quan Liu, Cong Liu, Jia-Chen Gu

发表机构 * National Engineering Research Center of Speech and Language Information Processing(语音与语言信息处理国家级工程研究中心) University of Science and Technology of China(中国科学技术大学) iFLYTEK Research(iFLYTEK研究院) University of California, Los Angeles(美国加州大学洛杉矶分校)

AI总结 本文研究双向语言模型编辑,提出反向泛化指标并构建BAKE基准,发现多数编辑方法在反向评估中存在系统性缺陷,并分析逆转诅咒的成因及缓解策略。

Comments Accepted by TMLR

详情
AI中文摘要

大型语言模型(LLMs)由于错误或过时的知识容易产生不期望的文本幻觉。由于重新训练LLMs资源密集,模型编辑日益受到关注。尽管出现了基准和方法,现有的单向编辑和评估范式未能探索逆转诅咒。在本文中,我们研究双向语言模型编辑,旨在提供严格的评估,以判断编辑后的LLMs能否双向回忆编辑知识。引入了反向泛化指标,并构建了名为BAKE(双向知识编辑评估)的基准,用于评估编辑后的模型能否在编辑的反向方向上回忆编辑知识。我们使用多种编辑方法和LLMs进行了大量实验。结果表明,尽管大多数编辑方法能够沿着修改方向准确回忆编辑事实,但在反向方向评估时,它们表现出显著的系统性缺陷。为了进一步研究逆转诅咒的根本原因并探索潜在的缓解策略,我们从三个角度进行了详细分析。我们的发现表明,尽管上下文学习(ICL)可以在一定程度上缓解逆转诅咒,但它缺乏连续性,受输入长度限制,并可能引入幻觉。因此,结合ICL和其他编辑方法的优势是开发新编辑范式的一个有前景的方向。

英文摘要

Large language models (LLMs) are prone to hallucinate unintended text due to false or outdated knowledge. Since retraining LLMs is resource intensive, there has been a growing interest in model editing. Despite the emergence of benchmarks and approaches, existing unidirectional editing and evaluation paradigms have failed to explore the reversal curse. In this paper, we study bidirectional language model editing, aiming to provide a rigorous evaluation to assess if edited LLMs can recall the editing knowledge bidirectionally. A metric of reverse generalization is introduced and a benchmark dubbed Bidirectional Assessment for Knowledge Editing (BAKE) is constructed to evaluate if post-edited models can recall the edited knowledge in the reverse direction of editing. We conduct extensive experiments using a variety of editing methods and LLMs. The results show that while most editing methods are able to accurately recall editing facts along the modification direction, they exhibit substantial systematic deficiencies when evaluating in the reverse direction. To further investigate the underlying causes of reversal curse and to explore potential strategies for mitigation, a detailed analysis is conducted from three perspectives. Our findings reveal that although In-Context Learning (ICL) can mitigate the reversal curse to a certain extent, it lacks continuity, is limited by the input length, and may introduce hallucinations. Therefore, combining the advantages of ICL and other editing methods is a promising direction for developing new editing paradigms.

2407.18428 2026-06-03 cs.LG cs.AI cs.CV

Weighted Risk Invariance: Domain Generalization under Invariant Feature Shift

加权风险不变性:不变特征偏移下的领域泛化

Gina Wong, Joshua Gleason, Rama Chellappa, Yoav Wald, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Maryland, College Park(马里兰大学学院公园分校) New York University(纽约大学) Center for Data Science(数据科学中心)

AI总结 针对不变协变量偏移下现有不变学习方法性能不佳的问题,提出加权风险不变性(WRI)框架,通过环境间损失的不变性并加权训练样本,在理论上保证学习到不变模型,并在实验中优于先前方法。

详情
Journal ref
TMLR 2024
AI中文摘要

学习预测在多个环境下不变的模型是一种有前景的分布外泛化方法。这类模型被训练来提取特征 $X_{ ext{inv}}$,其中给定提取特征的条件分布 $Y \mid X_{ ext{inv}}$ 在不同环境下不发生变化。不变模型还应能泛化到提取特征 $X_{ ext{inv}}$ 的边缘分布 $p(X_{ ext{inv}})$ 的偏移,这种偏移称为 $ extit{不变协变量偏移}$。然而,我们表明,现有学习不变模型的方法在不变协变量偏移下表现不佳,要么无法学习到不变模型——即使对于从简单且经过充分研究的线性-高斯模型生成的数据也是如此——要么有限样本性能较差。为了解决这些问题,我们提出 $ extit{加权风险不变性}$(WRI)。我们的框架基于对训练样本进行适当加权,强制要求损失在不同环境下保持不变。我们证明,在线性-高斯设置下,WRI 可证明地学习到不变模型,即丢弃虚假相关性。我们提出了一种实用算法,通过同时学习密度 $p(X_{ ext{inv}})$ 和模型参数来实现 WRI,并且实验表明,在不变协变量偏移下,WRI 优于先前的不变学习方法。

英文摘要

Learning models whose predictions are invariant under multiple environments is a promising approach for out-of-distribution generalization. Such models are trained to extract features $X_{\text{inv}}$ where the conditional distribution $Y \mid X_{\text{inv}}$ of the label given the extracted features does not change across environments. Invariant models are also supposed to generalize to shifts in the marginal distribution $p(X_{\text{inv}})$ of the extracted features $X_{\text{inv}}$, a type of shift we call an $\textit{invariant covariate shift}$. However, we show that proposed methods for learning invariant models underperform under invariant covariate shift, either failing to learn invariant models$\unicode{x2014}$even for data generated from simple and well-studied linear-Gaussian models$\unicode{x2014}$or having poor finite-sample performance. To alleviate these problems, we propose $\textit{weighted risk invariance}$ (WRI). Our framework is based on imposing invariance of the loss across environments subject to appropriate reweightings of the training examples. We show that WRI provably learns invariant models, i.e. discards spurious correlations, in linear-Gaussian settings. We propose a practical algorithm to implement WRI by learning the density $p(X_{\text{inv}})$ and the model parameters simultaneously, and we demonstrate empirically that WRI outperforms previous invariant learning methods under invariant covariate shift.

2407.11821 2026-06-03 cs.AI

Approximating Probabilistic Inference in Statistical EL with Knowledge Graph Embeddings

使用知识图谱嵌入近似统计EL中的概率推理

Yuqicheng Zhu, Nico Potyka, Bo Xiong, Trung-Kien Tran, Mojtaba Nayyeri, Evgeny Kharlamov, Steffen Staab

发表机构 * Bosch Center for AI(博世人工智能中心) University of Stuttgart(斯图加特大学) Cardiff University(卡迪夫大学) Stanford University(斯坦福大学) University of Oslo(奥斯陆大学) University of Southampton(南安普顿大学)

AI总结 本文提出利用知识图谱嵌入高效近似统计EL中的概率推理,并提供了运行时和正确性保证的理论证明及实验评估。

Comments Accepted at UAI 2026

详情
AI中文摘要

统计信息无处不在,但从中得出有效结论却极其困难。我们以统计EL(SEL)为例,解释了如何使用知识图谱嵌入来高效近似概率推理,SEL是轻量级描述逻辑EL的统计扩展。我们提供了运行时和正确性保证的证明,并通过实验评估了我们方法的运行时和近似质量。

英文摘要

Statistical information is ubiquitous but drawing valid conclusions from it is prohibitively hard. We explain how knowledge graph embeddings can be used to approximate probabilistic inference efficiently using the example of Statistical EL (SEL), a statistical extension of the lightweight Description Logic EL. We provide proofs for runtime and soundness guarantees, and empirically evaluate the runtime and approximation quality of our approach.

2407.05312 2026-06-03 cs.CV

An Improved Method for Personalizing Diffusion Models

一种改进的扩散模型个性化方法

Yan Zeng, Masanori Suganuma, Takayuki Okatani

发表机构 * Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究生院) RIKEN Center for AIP(理化学研究所AIP研究中心)

AI总结 提出一种在整合新信息时保留模型原有知识的扩散模型个性化方法,相比Dreambooth和文本反转训练时间更短且效果更优。

详情
AI中文摘要

扩散模型已经展示了令人印象深刻的图像生成能力。个性化方法,如文本反转和Dreambooth,通过使用特定图像增强模型的个性化。这些方法能够基于多样的文本上下文生成特定对象的图像。我们提出的方法旨在在整合新信息时保留模型的原有知识,从而在比Dreambooth和文本反转更少的训练时间内获得更优的结果。

英文摘要

Diffusion models have demonstrated impressive image generation capabilities. Personalized approaches, such as textual inversion and Dreambooth, enhance model individualization using specific images. These methods enable generating images of specific objects based on diverse textual contexts. Our proposed approach aims to retain the model's original knowledge during new information integration, resulting in superior outcomes while necessitating less training time compared to Dreambooth and textual inversion.

2405.03386 2026-06-03 cs.LG

Annot-Mix: Learning with Noisy Class Labels from Multiple Annotators via a Mixup Extension

Annot-Mix: 通过混合扩展从多个标注者学习带噪声类别标签

Marek Herde, Lukas Lührs, Denis Huseljic, Bernhard Sick

发表机构 * University of Kassel(卡塞尔大学) European Conference on Artificial Intelligence(欧洲人工智能会议) Conference on Prestigious Applications of Intelligent Systems(智能系统 prestigious 应用会议)

AI总结 提出Annot-Mix框架,通过扩展mixup处理多标注者提供的类别标签,在11个数据集上优于11种现有方法。

Comments 9 pages, 8 figures, 4 tables; post-publication arXiv version with minor editorial corrections; methodology, results, and conclusions unchanged

详情
Journal ref
ECAI 2024: 27th European Conference on Artifical Intelligence, IOS Press, pp. 2910-2918, 2024
AI中文摘要

使用带噪声的类别标签进行训练会损害神经网络的泛化性能。在此背景下,mixup是一种流行的正则化技术,通过使记忆错误类别标签更加困难来提高训练鲁棒性。然而,mixup忽略了多个标注者(例如众包工作者)通常提供类别标签的事实。因此,我们提出了mixup的一种扩展,该扩展处理每个实例的多个类别标签,同时考虑哪个类别标签来自哪个标注者。集成到我们的多标注者分类框架annot-mix中,在包含来自人类或模拟标注者的噪声类别标签的11个数据集的评估研究中,它的性能优于11种(大多数是最先进的)方法。我们的代码通过我们的GitHub仓库公开提供:https://github.com/ies-research/multi-annotator-machine-learning/tree/annot-mix

英文摘要

Training with noisy class labels impairs neural networks' generalization performance. In this context, mixup is a popular regularization technique to improve training robustness by making memorizing false class labels more difficult. However, mixup neglects that multiple annotators, e.g., crowdworkers, typically provide class labels. Therefore, we propose an extension of mixup, which handles multiple class labels per instance while considering which class label originates from which annotator. Integrated into our multi-annotator classification framework annot-mix, it performs superiorly to eleven (mostly state-of-the-art) approaches in an evaluation study with eleven datasets comprising noisy class labels from either human or simulated annotators. Our code is publicly available through our GitHub repository at https://github.com/ies-research/multi-annotator-machine-learning/tree/annot-mix

2403.19883 2026-06-03 cs.AI

Planning with Uncertainty: Symmetries, Policy Inference, and Solution Compression

不确定性规划:对称性、策略推理与解压缩

Frederico Messa, André Grahl Pereira

发表机构 * INF/UFRGS(乌尔巴诺-弗兰西斯科·里格尔大学信息学院)

AI总结 本文提出基于显式最佳优先策略空间搜索的FOND规划方法,通过定义策略等价关系、利用群论计算状态对称性、多项式时间策略推断以及整数规划实现部分状态策略压缩,显著提升求解效率。

详情
AI中文摘要

完全可观测非确定性(FOND)规划是人工智能不确定性规划的核心。它通过具有非确定性效果的动作来建模不确定性。在这项工作中,我们提出了一系列技术,将显式最佳优先策略空间搜索建立为一种与当前最先进方法相竞争的方法,用于解决FOND规划任务。我们研究了如何定义策略之间的等价关系,从而允许剪枝部分搜索空间。我们展示了可以使用群论技术有效计算状态之间的规范对称性。我们还提出了两项超越策略空间搜索的贡献:一个过程,在给定策略域集规范的情况下,能在多项式时间内推断出解策略函数;以及一个整数规划公式化过程,给定一个定义在完整状态上的解策略,能产生一组资源高效的模型,这些模型能够找到以最少部分状态无歧义地表示该策略的部分状态策略。

英文摘要

Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertainty through actions with non-deterministic effects. In this work, we present a collection of techniques that establish explicit best-first policy-space search as a method competitive with the state of the art for solving FOND planning tasks. We study how to define equivalence relations between policies, allowing part of the search space to be pruned. We show it is possible to use group theory techniques to effectively compute canonical symmetries between states. We also present two contributions that go beyond just policy-space search: we present a procedure that infers in polynomial time a solution policy function given just the specification of its domain set, and an integer-programming formulation procedure that, given a solution policy defined over complete states, yields a set of resource-efficient models that are capable of finding a partial-state policy that represents it unambiguously with the fewest partial states possible.

2303.15619 2026-06-03 cs.CL cs.AI

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

Typhoon: 面向预训练语言模型的有效任务特定掩码策略

Muhammed Shahir Abdurrahman, Hashem Elezabi, Bruce Changlong Xu

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 本文提出Typhoon,一种基于任务损失梯度的自适应掩码策略,在GLUE任务上对比随机掩码和整词掩码,经严格评估发现无显著优势。

详情
AI中文摘要

在掩码语言建模(MLM)中,选择哪些token进行掩码是一个核心但未被充分研究的设计决策。标准预训练随机均匀掩码token,但多项研究表明,更具信息性的掩码目标可以提升下游性能。我们将掩码视为微调流程中任务自适应的组件,并引入Typhoon,一种掩码策略,它利用任务损失相对于one-hot token输入的梯度来在线估计每种token类型对目标的贡献程度。Typhoon维护每个token类型显著性的指数移动平均,并将这些分数校准为掩码分布,在token独立性近似下,其期望掩码率与目标预算匹配。我们形式化了该方法,并在两个GLUE任务(MRPC和CoLA)上,针对三个BERT系列骨干网络(TinyBERT、DistilBERT和BERT-base)以及每个配置五个随机种子(总共90次训练运行),将其与随机掩码和整词掩码进行了评估。我们的主要发现是,一旦考虑了种子方差,没有哪种掩码策略在这些任务上可靠地优于其他策略:在MRPC上,Typhoon与最佳基线之间的差距保持在0.004 F1以内,所有十二次Typhoon比较中无配对检验达到显著性,且每个95%置信区间包含零。Typhoon在单次运行实验中的明显优势并未经受住这种更仔细的评估。我们将此视为一个警示性的、以可重复性为重点的结果——基于梯度的任务自适应掩码具有竞争力,但在此规模上并不明显优于无资源的随机掩码——我们描述了一个干净的现代重实现以支持后续工作。

英文摘要

The choice of \emph{which} tokens to mask is a central, under-examined design decision in masked language modeling (MLM). Standard pretraining masks tokens uniformly at random, but several studies show that more informative masking targets can improve downstream performance. We study masking as a \emph{task-adaptive} component of the fine-tuning pipeline and introduce \textbf{Typhoon}, a masking strategy that uses the gradient of the task loss with respect to one-hot token inputs to estimate, online, how much each token type contributes to the objective. Typhoon maintains an exponential moving average of per-token-type saliency and calibrates these scores into a masking distribution whose expected masking rate matches a target budget, under a token-independence approximation. We formalize the method and evaluate it against random masking and whole-word masking on two GLUE tasks, MRPC and CoLA, across three BERT-family backbones (TinyBERT, DistilBERT, and BERT-base) and five random seeds per configuration ($90$ training runs in total). Our main finding is that, once seed variance is accounted for, no masking strategy is reliably better than the others on these tasks: on MRPC the gap between Typhoon and the best baseline stays within $0.004$ $F_1$, across all twelve Typhoon comparisons no paired test reaches significance, and every $95\%$ confidence interval contains zero. Typhoon's apparent advantage in single-run experiments does not survive this more careful evaluation. We read this as a cautionary, reproducibility-focused result -- gradient-based task-adaptive masking is competitive but not clearly better than resource-free random masking at this scale -- and we describe a clean modern reimplementation to support follow-up work.

2606.03940 2026-06-03 eess.IV cs.CV cs.LG cs.RO

SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction

SEAOTTER: 基于传感器嵌入自编码器与一次性转码的高效重建

Dan Jacobellis, Neeraja J. Yadwadkar

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出SEAOTTER框架,结合传感器嵌入自编码器与可学习JPEG转码,在200:1压缩比下实现比AVIF快7倍编码、3.5倍解码,并提升ImageNet top-1准确率8%,同时保持JPEG兼容性。

详情
AI中文摘要

在机器人系统中,使用低成本、低功耗硬件可以轻松捕获高分辨率的大量视觉数据。然而,当通过JPEG/MPEG等传统编解码器传输时,有限的带宽和机载计算资源阻碍了充分利用。较新的编解码器(如AV1/AVIF)改善了率失真权衡,但需要更多资源进行编码,在没有定制ASIC的情况下不切实际。最近的非对称自编码器在极端功率和带宽约束下提供高质量,但增加了高昂的解码成本,并使用忽略围绕JPEG等标准建立的数十年基础设施的特有格式。为了解决这些限制,我们引入了一种基于传感器嵌入自编码器与一次性转码的高效重建(SEAOTTER)的云机器人压缩框架。由于传感器、云和消费阶段面临非常不同的功率和带宽预算,SEAOTTER结合了学习潜变量的紧凑性和标准JPEG文件的广泛可用性。由于朴素转码会降低性能,我们提出了一种可学习的JPEG颜色和量化变换,能够提高全局、密集和基于视觉语言感知的准确性。使用SEAOTTER,我们为预训练的冻结编码器训练通用和任务感知的转码流水线。在200:1的压缩比下,与AVIF相比,我们观察到编码速度提高7倍,解码速度提高3.5倍,ImageNet top-1准确率提高8%,同时保持与JPEG基础设施的兼容性。我们的代码可从此https URL获取。

英文摘要

In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT-SysML/seaotter .

2606.03455 2026-06-03 eess.AS cs.SD

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

WavTTS:通过直接原始波形建模实现高质量零样本TTS

Wenxi Chen, Dongya Jia, Yushen Chen, Zhikang Niu, Yuzhe Liang, Xiquan Li, Ruiqi Yan, Ziyang Ma, Guanrou Yang, Sanyuan Chen, Yue Wang, Zhuo Chen, Kai Yu, Xie Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) ByteDance Seed(字节跳动种子)

AI总结 提出WavTTS,首个基于流匹配与扩散Transformer的原始波形生成TTS模型,通过简单分块策略直接建模波形并集成多尺度梅尔频谱监督,在零样本TTS中接近潜在空间生成模型性能。

详情
AI中文摘要

最近,基于VAE潜在变量或梅尔频谱的扩散模型已成为零样本TTS的主流范式。尽管这些压缩表示提高了生成效率,但它们不可避免地遭受信息损失和非端到端训练的问题。理论上,直接建模原始波形可以规避这些问题;然而,由于音频信号序列长度极长,这一方向尚未充分探索且常被认为困难。为了克服这一点,我们提出了WavTTS,这是第一个原始波形生成TTS模型,显著缩小了与潜在空间生成模型的差距。基于流匹配与扩散Transformer(DiT),WavTTS通过简单的分块策略直接建模语音波形,同时集成多尺度梅尔频谱监督以在训练过程中提供感知指导。此外,我们研究了波形扩散中预测目标和噪声调度的影响,并开发了一种有效的调度设计以提高生成质量。在开源基准上的评估表明,WavTTS接近当前最先进的潜在生成零样本TTS模型的性能,同时显著优于之前的端到端语音生成模型。我们的发现证明了直接在波形空间扩展基于扩散的TTS的可行性,为端到端语音生成开辟了新方向。

英文摘要

Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.

2606.03116 2026-06-03 eess.AS cs.AI cs.SD

AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following

AnyAudio-Judge:基于动态评分标准的音频指令跟随基准与评估器

Haitao Li, Tian Tan, Yuguang Yang, Shan Yang, Xie Chen

发表机构 * Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) Tencent Hunyuan(腾讯文脉)

AI总结 针对指令引导音频生成中复杂指令解耦困难、评估缺乏可解释性和细粒度属性匹配的问题,提出基于动态评分标准的评估范式,通过自适应分解音频描述为可验证的二元评分项,并构建包含7920个样本的双语基准和105K训练语料,结合SFT与GRPO训练专用评估器,在零样本对齐检测和下游强化学习指令对齐中取得显著提升。

详情
AI中文摘要

指令引导音频生成的快速发展凸显了对稳健对齐评估的迫切需求。当前的自动评估方法严重依赖通用大语言模型的整体评分,难以解耦复杂指令,缺乏可解释性,且无法捕捉细粒度的属性不匹配。为解决这一问题,我们引入了一种新颖的基于动态评分标准的评估范式,该范式自适应地将复杂的音频描述分解为可变数量的独立、可验证的二元评分项。为了严格基准测试这一能力,我们提出了AnyAudio-Judge Bench,一个全面的双语基准,包含7920个精心策划的样本,涵盖四个不同的音频领域(语音、声音、音乐和混合),并包含特意构建的困难负样本。此外,我们构建了一个包含105K样本的大规模语料库,并带有明确的思维链(CoT)理由,以训练我们的专用评估器——AnyAudio-Judge模型。通过采用结合监督微调(SFT)和组相对策略优化(GRPO)的训练流程,我们的模型成功将其推理路径与基于评分标准的评分机制对齐。大量实验表明,AnyAudio-Judge不仅显著增强了与最先进基线相比的零样本对齐检测,而且提供了精确且可解释的奖励信号,显著改善了音频生成下游强化学习中的指令对齐。

英文摘要

The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.

2606.02906 2026-06-03 eess.IV cs.CV

Depth from Dual Differential Defocus and Stereo Consensus

基于双差分散焦与立体一致性的深度估计

Junjie Luo, Wei Xu, Dylan Chu, Emma Alexander, Qi Guo

发表机构 * Purdue University(普渡大学) Northwestern University(西北大学)

AI总结 提出D^3S Consensus算法,融合散焦深度与立体视觉,在超出景深范围内实现高精度深度估计,通过物理独立线索的一致性选择可靠预测,以更小基线达到可比工作范围。

详情
AI中文摘要

我们提出了D^3S Consensus,一种基于物理的闭式算法,它统一了散焦深度(DfD)和立体视觉,在超出相机景深(DoF)的扩展工作范围内实现高精度深度估计。给定一对双散焦立体图像,该方法通过一种新颖的DfD理论——双差分散焦(D^3)和(S)立体耦合方式,估计一组超定深度。然后,通过在这些物理独立线索之间强制执行一致性,从该组中选择最可信的深度预测,以拒绝不可靠的估计。分析表明,在相同误差容限下,D^3S与先前的基于三角测量的深度估计系统相比,以10倍小的基线实现了可比的工作范围。这使得紧凑的无源双目测距仪具有比传统立体和DfD设计小得多的外形尺寸。我们展示了第一个D^3S原型,其基线仅为4毫米,EFL为12毫米。它通过单次采集生成高达900×1800像素的深度图,在0.3-1.64米范围内平均绝对误差为1厘米。这已经超过了某些具有更大外形尺寸的商用立体相机的报告精度。

英文摘要

We introduce D^3S Consensus, a physics-based, closed-form algorithm that unifies depth-from-defocus (DfD) and stereo to achieve highly accurate depth estimation throughout an extended working range beyond the depth-of-field (DoF) of cameras. Given a pair of dual-defocus stereo images, the method estimates an overdetermined set of depth using a novel DfD theory, Dual Differential Defocus (D^3), and (S)tereo in a coupled fashion. It then picks the most confident depth prediction from the set by enforcing consensus between these physically independent cues to reject unreliable estimates. Analysis shows that D^3S achieves a comparable working range under the same error tolerance with 10x smaller baseline than previous triangulation-based depth estimation systems. This enables compact passive binocular rangefinders with substantially smaller form factors than conventional stereo and DfD designs. We demonstrate the first D^3S prototype with only 4 mm baseline and 12 mm EFL. It generates up to 900 x 1800-pixel depth maps with 1-cm mean absolute error over 0.3-1.64 m from a snapshot acquisition. This has surpassed the reported accuracy of certain commercially available stereo cameras with much larger form factors.

2606.02661 2026-06-03 eess.IV cs.AI cs.LG

Learning to Refine: Spectral-Decoupled Iterative Refinement Framework for Precipitation Nowcasting

学习细化:用于降水临近预报的频谱解耦迭代细化框架

Yunlong Zhou, Chen Zhao, Danyang Peng, Fanfan Ji, Xiao-Tong Yuan

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出频谱解耦迭代细化框架(SDIR),通过双路径设计(SFG-Former和FR-Refiner)和物理一致功率谱密度损失,在确定性框架中实现降水临近预报的渐进频率解耦细化,消除模糊和幻觉,在空间精度和频谱保真度上超越现有方法。

Comments 21 pages, 10 figures, accepted at ICML 2026

详情
AI中文摘要

准确的降水临近预报对减灾至关重要,但深度学习方法面临关键权衡:回归模型产生过度平滑、频谱衰减的预测,模糊对流细节并违反湍流幂律;扩散模型生成逼真但无锚定的幻觉,缺乏物理基础。我们提出频谱解耦迭代细化(SDIR),一个确定性框架,将临近预报重新表述为渐进频率解耦细化。SDIR首先提取稳定的低频天气尺度骨架,然后在物理约束下迭代细化高频纹理,消除模糊和幻觉。它采用双路径设计:天气尺度频率引导前馈网络(SFG-Former)使用尺度自适应Transformer处理全局结构,傅里叶残差细化器(FR-Refiner)使用尺度条件傅里叶神经算子处理精细残差。具有动态掩蔽的物理一致功率谱密度(PCPSD)损失强制执行湍流一致的频谱分布。在三个基准上的实验表明,SDIR在空间精度上显著优于最先进方法,同时实现了与基于扩散方法竞争的频谱保真度,实现了可靠的高分辨率业务化临近预报。代码链接:this https URL。

英文摘要

Accurate precipitation nowcasting is vital for disaster mitigation, but deep learning methods face a key trade-off: regression models produce over-smoothed, spectrally decaying predictions that blur convective details and violate turbulence power laws; diffusion models generate realistic yet unanchored hallucinations lacking physical grounding. We propose Spectral-Decoupled Iterative Refinement (SDIR), a deterministic framework that reformulates nowcasting as progressive frequency-decoupled refinement. SDIR first extracts a stable low-frequency synoptic skeleton, then iteratively refines high-frequency textures under physical constraints, eliminating both blurring and hallucinations. It features a dual-path design: the Synoptic Frequency-Guided Former (SFG-Former) with Scale-Adaptive Transformers for global structure, and the Fourier Residual Refiner (FR-Refiner) with Scale-Conditioned Fourier Neural Operators for fine residuals. A Physically Consistent Power Spectral Density (PCPSD) loss with dynamic masking enforces a turbulence-consistent spectral distribution. Experiments on three benchmarks show SDIR significantly outperforms SOTA methods in spatial accuracy while achieving spectral fidelity competitive with diffusion-based methods, enabling reliable high-resolution operational nowcasting. Code link: https://github.com/RuntimeWarning/SDIR.

2606.02642 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.MM cs.SD

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

SVHalluc: 音频-视觉大语言模型中的语音-视觉幻觉基准测试

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh

发表机构 * KAIST(韩国国立信息通信研究院)

AI总结 针对音频-视觉大语言模型中的语音-视觉幻觉问题,提出SVHalluc基准,从语义和时间两个维度评估模型将语音内容与视觉信号对齐的能力,发现现有模型存在跨模态理解局限。

Comments Accepted at CVPR 2026

详情
AI中文摘要

尽管音频-视觉大语言模型(LLMs)取得了成功,但它们可能产生看似合理但缺乏依据的输出,即幻觉。现有基准侧重于环境声音(例如狗叫)来指示事件发生。相比之下,人类语音承载着根本不同的、丰富的语义和时间结构,但当前模型能否准确地将语音内容与相应的视觉信号对齐仍未得到探索。在这项工作中,我们表明语音内容可以引发音频-视觉LLMs中的幻觉。为了系统研究这一点,我们引入了SVHalluc,这是第一个用于评估音频-视觉LLMs中语音-视觉幻觉的综合基准。我们的基准从两个关键且互补的方面诊断语音-视觉幻觉:语义和时间。实验结果表明,最先进的开源音频-视觉LLMs难以将语音内容与相应的视觉信号对齐,在多个任务上的准确率接近随机。相比之下,Gemini 2.5 Pro显著优于开源模型。我们的分析表明,它们的失败源于跨模态理解能力有限,尽管在单模态感知方面表现强劲。我们的工作揭示了当前音频-视觉LLMs的一个新的根本性局限,并强调了基于语音的视频理解的需求。项目页面:此https URL。

英文摘要

Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.

2606.02639 2026-06-03 eess.IV cs.AI cs.CV

Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF

通过AReT:解剖正则化TensoRF从数字重建放射图像进行稀疏视图肺结节体积测量

Spoorthi M, Suja Palaniswamy

发表机构 * Amrita University(阿姆里塔大学)

AI总结 本文发现并解决了TensoRF在X射线衰减场中的默认密度偏移问题,提出解剖正则化张量辐射场框架AReT,仅用三个正交X射线投影即可实现肺结节的稳定体积重建,在LIDC-IDRI数据集上达到高精度。

详情
AI中文摘要

我们识别并解决了TensoRF应用于X射线衰减场时一个先前未报告的失败模式:默认密度偏移-10(最初为RGB场景重建引入)抑制了密度梯度,并阻止了稀疏视图医学重建,无论学习率或正则化策略如何。将密度偏移设置为零可恢复梯度流,并仅从三个正交X射线投影实现肺结节的稳定体积重建。在此基础上,我们提出AReT,一个解剖正则化的张量辐射场框架,用于使用LIDC-IDRI数据集(19名患者,放射科医生注释的结节)的冠状、矢状和轴向投影进行肺结节重建。与需要密集多视图采集的现有NeRF方法不同,AReT专为稀疏视图胸部成像设计,并整合了结合L1稀疏性和总变分平滑性的胸部解剖感知正则化。对11种重建策略的系统比较表明,解剖感知正则化始终优于生成先验引导的方法。与放射科医生共识分割相比,AReT在临床可操作的结节(>=10 mm,n=14)上实现了Pearson r=0.983(p<0.0001),中位绝对体积误差为11.4%,接近零的系统偏差为-77.3 mm^3,并且比球形体积近似提高了8.4倍。

英文摘要

We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift of -10, originally introduced for RGB scene reconstruction, suppresses density gradients and prevents sparse-view medical reconstruction regardless of learning rate or regularization strategy. Setting the density shift to zero restores gradient flow and enables stable volumetric reconstruction of pulmonary nodules from only three orthogonal X-ray projections. Building on this, we propose AReT, an anatomy-regularized tensorial radiance field framework for lung nodule reconstruction using coronal, sagittal, and axial projections from the LIDC-IDRI dataset (19 patients, radiologist-annotated nodules). Unlike existing NeRF approaches requiring dense multi-view acquisition, AReT is designed for sparse-view thoracic imaging and incorporates chest-anatomy-aware regularization combining L1 sparsity and total variation smoothness. A systematic comparison across 11 reconstruction strategies shows anatomy-aware regularization consistently outperforms generative-prior-guided approaches. Evaluated against radiologist consensus segmentations, AReT achieves Pearson r=0.983 (p<0.0001) for clinically actionable nodules >=10 mm (n=14), median absolute volumetric error of 11.4%, near-zero systematic bias of -77.3 mm^3, and 8.4x improvement over spherical volume approximation.

2606.02631 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.SD

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

小波作为分词器:自然信号共享小波分词方案的初步结果

Shenghao Ding

发表机构 * Yet Another AI

AI总结 本文研究音频、图像和视频能否共享统一的小波分词方案,通过基于Haar DWT/IDWT的连续令牌模型,在多个数据集上验证了统一分词模式的可行性,并分析了潜在容量和元数据的影响。

Comments 12 pages, 3 figures

详情
AI中文摘要

本文研究音频、图像和视频是否可以共享一个共同的小波令牌模式,而不是依赖于各自模态特定的潜在网格。它介绍了一个初步的连续令牌模型,该模型围绕一级Haar DWT/IDWT前端、共享系数令牌布局、可选结构元数据、轻量级模态值适配器和共享的令牌级编码器-解码器主干构建。在Speech Commands、EuroSAT RGB和DAVIS 2017数据上,密集共享模型达到了39.92 dB音频、29.37 dB图像和23.93 dB视频的PSNR。在连续潜在标量预算下的匹配速率扫描表明,视觉增益不能仅由潜在容量解释,同时也表明加性元数据嵌入并非普遍改进来源。最后,固定速率能量选择提供了一个强大的非参数基线:在压缩保留比率下,energy_global相比均匀选择将音频的平均PSNR提高了16.73 dB,图像提高了16.90 dB,视频提高了15.86 dB。掩蔽稀疏训练在50%的密集令牌下达到了34.45 dB的视频PSNR。结果支持统一的 wavelet 令牌模式和稀疏令牌接口,但尚未建立通用的离散词汇表。

英文摘要

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

2606.02615 2026-06-03 eess.AS cs.AI cs.SD

FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations

FSA-GRPO:训练听觉大语言模型使用少样本示例

Haolong Zheng, Siyin Wang, Xulin Fan, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学)

AI总结 提出基于强化学习的后训练方法FSA-GRPO,通过专门设计的奖励机制鼓励模型利用少样本示例,增强其少样本适应能力,在儿童语音识别、语音翻译和音频理解等任务上取得提升。

详情
AI中文摘要

少样本提示为将听觉大语言模型适应低资源任务(如儿童语音识别)提供了一种有效方式。然而,大多数听觉大语言模型并未被明确训练以在这种示例条件格式下进行推理,限制了它们从少样本提示中获益的程度。为解决这一局限,我们引入了少样本感知GRPO(FSA-GRPO),一种基于强化学习的后训练方法,使用专门设计的奖励来鼓励模型利用少样本示例,从而增强其少样本适应能力。值得注意的是,仅使用高资源成人ASR数据进行训练即可提升模型的通用少样本适应能力,不仅在儿童语音识别中带来收益,在语音翻译和音频理解中也是如此。我们进一步研究了数据选择和辅助奖励加权,以确定有效的训练方案。实验表明,当域内数据不可用或无法用于训练时,FSA-GRPO比直接对相关域外数据进行微调更有效。

英文摘要

Few-shot prompting provides an effective way to adapt auditory large language models to low-resource tasks such as children's speech recognition. However, most auditory large language models are not explicitly trained to perform inference in this demonstration-conditioned format, limiting the extent to which they can benefit from few-shot prompting. To address this limitation, we introduce Few-Shot Aware GRPO (FSA-GRPO), an RL-based post-training recipe that uses a specially designed reward to encourage the model to leverage few-shot demonstrations, thereby strengthening its few-shot adaptation ability. Notably, training with only high-resource adult ASR data improves the model's general few-shot adaptation ability, yielding gains not only in children's speech recognition but also in speech translation and audio understanding. We further study data selection and auxiliary reward weighting to identify an effective training recipe. Our experiments show that when in-domain data are unavailable or cannot be used for training, FSA-GRPO is more effective than direct tuning on related out-of-domain data.

2606.03878 2026-06-03 stat.ML cs.LG

Privacy-Robust Incrementality Measurement for Advertising Systems under Signal Loss

信号损失下广告系统的隐私鲁棒增量测量

Prashant Shekhar, Caroline Howard

发表机构 * Department of Mathematics, Embry-Riddle Aeronautical University(数学系,埃姆伯里-里德尔航空大学)

AI总结 针对隐私保护报告系统导致的信号损失,提出鲁棒因果决策框架,通过投影观测兼容的实验世界到增量泛函,给出尖锐决策边界,实现认证、拒绝或未决的增量判断。

详情
AI中文摘要

广告平台使用随机提升测试来测量增量,但隐私保护报告系统通过匹配率损失、可链接性损失、归因窗口损失、聚合阈值抑制、随机报告噪声和分段异质信号损失降低观测信号。本文将隐私约束下的广告测量形式化为一个鲁棒因果决策问题,考虑上述信号损失。给定随机实验和隐私引起的退化的模糊集,该框架将观测兼容的干净/未过滤实验世界的纤维投影到增量泛函上,并返回认证、拒绝和未决的决策。主要结果给出了尖锐的决策边界。边界外的报告支持一致有效的认证或拒绝,而边界内的报告包含的信息太少,任何方法都无法一致区分高于阈值的增量与非增量。支持结果给出了有限样本认证、样本复杂度保证、表明信号损失减少有效信息的极小极大下界,以及报告粒度权衡。在200万条Criteo提升数据和6.4万条Hillstrom电子邮件实验中,两个数据集的干净转化提升均为正,分别为0.00112和0.00495。在Criteo中,总体认证在轻度退化下幸存,在Hillstrom中在严重退化下幸存,而两个数据集中所有考虑的有限样本压力设置在同时包含不确定性和报告噪声后仍然未决。总体而言,本研究为隐私感知的增量测量贡献了一个决策理论层,其输出是由退化广告信号证明的最强因果主张。

英文摘要

Advertising platforms use randomized lift tests to measure incrementality, but privacy-preserving reporting systems degrade the observed signal through match-rate loss, linkability loss, attribution-window loss, aggregation-threshold suppression, randomized reporting noise, and segment-heterogeneous signal loss. This paper formulates privacy-constrained advertising measurement as a robust causal decision problem under the mentioned signal losses. Given a randomized experiment and an ambiguity set for privacy-induced degradation, the framework projects the observation-compatible fiber of clean/unfiltered experimental worlds onto the incrementality functional and returns certified, rejected, and unresolved decisions. The main result gives a sharp decision frontier. Reports outside the frontier support uniformly valid certification or rejection, whereas reports inside it contain too little information for any method to uniformly distinguish above-threshold incrementality from non-incrementality. Supporting results give finite-sample certification, sample-complexity guarantees, a minimax lower bound showing that signal loss reduces effective information, and a reporting-granularity tradeoff. On 2.0M Criteo Uplift rows and the 64K-row Hillstrom email experiment, clean conversion lift is positive in both datasets, with lifts 0.00112 and 0.00495, respectively. Population certification survives mild degradation in Criteo and severe degradation in Hillstrom, while all considered finite-sample stress settings in both datasets remain unresolved after simultaneous uncertainty and reporting noise are included. Overall, the research contributes a decision-theoretic layer for privacy-aware incrementality measurement whose output is the strongest causal-claim justified by degraded ads signals.

2606.03820 2026-06-03 stat.ML cs.LG

A Quantitative Approximation Framework for Flow Distillation in Diffusion Models

扩散模型中流蒸馏的定量近似框架

Weiguo Gao, Ming Li, Lei Shi, Hanfei Zhou

发表机构 * School of Mathematical Sciences, Fudan University(复旦大学数学学院) Shanghai Key Laboratory of Contemporary Applied Mathematics, Fudan University(复旦大学当代应用数学重点实验室)

AI总结 针对扩散模型中的流蒸馏,提出一个定量近似框架,将少步采样视为学习流映射组合下的误差传播,通过理论分析和实验验证了稳定性平衡的非均匀时间网格能显著降低端到端相对MSE。

详情
AI中文摘要

我们为扩散蒸馏开发了一个定量近似框架,将少步采样视为学习流映射组合下的误差传播。聚焦于概率流ODE的轨迹蒸馏,我们表明局部近似误差在低噪声多模态区域可能被强烈放大,其中底层动力学变得刚性。在一个解析可处理的高斯混合Ornstein--Uhlenbeck设定中,我们分离了两个核心困难:近似时间依赖的分数场和控制由概率流ODE的时间积分Jacobian界决定的动力学放大。在近似方面,我们证明了构造性的L^p(p_t)保证,表明ReLU--ReQU网络随时间一致地近似高斯混合分数,其深度和宽度在目标精度上呈多对数缩放,并显式依赖于混合几何。在稳定性方面,我们推导了概率流速度的空间Lipschitz常数的一个显式界L(t),并将其转化为由∫_s^t L(u)du控制的流映射稳定性估计,使得刚性区域中的后期放大可计算。基于这些估计,我们证明深度残差组合有效近似长时程传输,全局误差由稳定性放大因子控制,并识别出一个Lipschitz不匹配区域,其中一步蒸馏在结构上不利。由此产生的理论通过累积稳定性坐标的均匀划分得到一个稳定性平衡的非均匀时间网格。实验支持该预测,并在8个分段下与均匀网格相比将端到端相对MSE降低了高达51.9%。

英文摘要

We develop a quantitative approximation framework for diffusion distillation, viewing few-step sampling as error propagation under compositions of learned flow maps. Focusing on trajectory distillation for the probability-flow ODE, we show that local approximation errors can be strongly amplified in low-noise multimodal regimes, where the underlying dynamics become stiff. In an analytically tractable Gaussian-mixture Ornstein--Uhlenbeck setting, we separate two core difficulties: approximating the time-dependent score field and controlling the dynamical amplification governed by the time-integrated Jacobian bound of the probability-flow ODE. On the approximation side, we prove constructive L^p(p_t) guarantees showing that ReLU--ReQU networks approximate the Gaussian-mixture score uniformly over time, with depth and width scaling polylogarithmically in the target accuracy and explicitly with the mixture geometry. On the stability side, we derive an explicit bound L(t) for the spatial Lipschitz constant of the probability-flow velocity and convert it into a flow map stability estimate governed by \int_s^t L(u)\,du, making late-time amplification in stiff regimes computable. Building on these estimates, we prove that deep residual compositions efficiently approximate the long-horizon transport, with global error controlled by the stability amplification factor, and identify a Lipschitz-mismatch regime in which one-step distillation is structurally unfavorable. The resulting theory yields a stability-balanced non-uniform time grid obtained by uniform partitioning in the cumulative stability coordinate. Experiments support the prediction and reduce end-to-end relative MSE by up to 51.9\% with 8 segments compared with uniform grids.

2606.03736 2026-06-03 stat.ML cs.LG

Resource-Constrained Adaptive Inference for Sequential Pricing

资源约束下的自适应推断用于顺序定价

Ruicheng Ao, Jiashuo Jiang, David Simchi-Levi

发表机构 * Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139(数据、系统与社会研究所,麻省理工学院,剑桥,马萨诸塞州,02139) Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Hong Kong(工业工程与决策分析系,香港科技大学,香港) Department of Civil and Environmental Engineering and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139(土木与环境工程系和运筹中心,麻省理工学院,剑桥,马萨诸塞州,02139)

AI总结 针对资源约束导致固定价格推断不可行的问题,提出一种目标感知定价控制器,通过认证可行目标带并记录连续局部密度,实现基于局部去偏的学生化区间,并分析遗憾-信息核算。

详情
AI中文摘要

资源约束的定价控制器可能使得固定价格推断变得不可能:即使每个已实现的动作具有已知的正密度,控制器的资源状态也可能从可行集中移除目标价格邻域。我们通过局部不可识别结果和已实现的信息时钟形式化了这种支持排除失败。然后,我们设计了一种目标感知定价控制器,该控制器认证可行的目标带并记录连续的局部密度。局部去偏产生了学生化区间,其宽度由该时钟控制。由此产生的遗憾-信息核算(直到初始求解误差)表明,廉价的探索可能不足以进行推断:多项式目标质量给出多项式速率,而纯$1/t$目标分支在没有额外局部移动的情况下不会产生收缩的固定目标区间。实验显示了在认证带中的校准以及当资源状态崩溃目标支持时的诊断性弃权。

英文摘要

Resource-constrained pricing controllers can make fixed-price inference impossible: the controller's resource state may remove the target price neighborhood from the feasible set, even when every realized action has a known positive density. We formalize this support-exclusion failure through a local non-identification result and a realized information clock. We then design a target-aware pricing controller that certifies feasible target bands and logs continuous local densities. Localized debiasing gives studentized intervals whose width is governed by this clock. The resulting regret--information accounting, stated up to pilot re-solving error, shows that cheap exploration can be insufficient for inference: polynomial target mass gives polynomial rates, while a pure $1/t$ target branch does not yield shrinking fixed-target intervals without additional local movement. Experiments show calibration in certified bands and diagnostic abstention when the resource state collapses target support.

2606.03600 2026-06-03 stat.ML cs.LG

Set-Preserving Calibration from Conformal P-Values to E-Values

从共形p值到e值的集合保持校准

Nabil Alami, Jad Zakharia, Souhaib Ben Taieb

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 针对共形预测中p值到e值转换的局限性,提出一种集合保持的P2E校准器,在不改变预测集的前提下实现高效转换,并在交叉共形预测和共形聚合中达到期望覆盖并提升效率。

详情
AI中文摘要

标准的共形预测(CP)过程通常用p值表述,但仅依赖p值限制了灵活性,例如在跨模型或数据分割组合依赖证据时。最近的工作探索了共形推断的e值表述,然而CP中p值和e值表述之间的直接联系仍然缺失,特别是在统计效率方面。我们首先指出了CP设置中经典p到e校准器的局限性,表明它们不是集合保持的,可能导致过于保守的预测集。为解决这一问题,我们提出了一种新颖的P2E校准器,它将共形p值转换为e值,而不改变原始共形p值诱导的预测集。我们在理论和实证上证明,我们的校准器相比现有的p到e校准器可以带来显著的效率提升。这种e值表述使得能够原则性地使用e值合并和随机化的最新进展,我们在两个应用中展示了其影响:交叉共形预测(CCP),其变体通常仅提供近似的$1-2\alpha$覆盖率,以及共形聚合(CA)。在这两种情况下,我们基于e值的方法满足所需的$1-\alpha$覆盖率保证,同时相比标准基线提高了效率。更广泛地说,我们的方法扩展了CP的灵活性,并为高效、无分布的量化不确定性开辟了新方向。

英文摘要

Standard conformal prediction (CP) procedures are typically formulated in terms of p-values, but reliance on p-values alone limits flexibility, for example, when combining dependent evidence across models or data splits. Recent work has explored e-value formulations for conformal inference, yet a direct connection between p- and e-value formulations in CP has been missing, especially regarding their statistical efficiency. We first identify limitations of classical p-to-e calibrators in the CP setting, showing that they are not set-preserving and can lead to overly conservative prediction sets. To address this, we propose a novel P2E calibrator that converts conformal p-values into e-values without altering the prediction set induced by the original conformal p-value. We establish both theoretically and empirically that our calibrator can yield significant efficiency gains over existing p-to-e calibrators. This e-value formulation enables principled use of recent advances in e-value merging and randomization, where we demonstrate its impact in two applications: cross-conformal prediction (CCP), whose variants typically provide only approximate $1-2α$ coverage, and conformal aggregation (CA). In both cases, our e-value-based methods satisfy the desired $1-α$ coverage guarantee while improving efficiency over standard baselines. More broadly, our approach expands the flexibility of CP and opens new directions for efficient, distribution-free uncertainty quantification.

2606.03574 2026-06-03 stat.ML cs.LG

Few-Shot Prediction for Pulsar Noise with Long Short-Term Memory Network

基于长短期记忆网络的脉冲星噪声少样本预测

Qingye Tang, Dechao An, Haoran Peng, Yuqi Ouyang

发表机构 * Sichuan University, College of Computer Science(四川大学计算机学院) Sichuan University, College of Physics(四川大学物理学院)

AI总结 针对脉冲星计时数据稀缺问题,提出一种结合模型无关元学习优化的LSTM网络,仅需少量真实计时残差即可快速适应新频域,并利用粒子群算法自动调参,在IPTA数据集上以10%数据实现高精度预测。

详情
AI中文摘要

本文提出了一种新颖的解决方案,用于在有限数据下预测脉冲星计时残差,解决了PTA数据集中毫秒脉冲星自旋频率子组数据稀缺的关键挑战。该方案应用了长短期记忆(LSTM)网络,并通过模型无关元学习算法进行优化,使得仅需少量真实计时残差即可通过微调LSTM网络快速适应新的频域。同时,采用粒子群优化算法进行自动超参数优化,提高了预测精度。我们的解决方案在国际脉冲星计时阵列(IPTA)第二次数据发布上进行了评估,在高频测试频域的三个指标上均展现出鲁棒的泛化能力和准确预测,且仅需这些域中10%的计时残差进行模型微调。此外,我们的轻量级结构仅需16.86 MB CPU内存和18毫秒即可完成单步残差预测。所有这些特性使得我们的解决方案非常适合实际应用,在这些应用中,有效且实时的脉冲星计时残差预测至关重要——尤其是在计算能力、内存或能源有限的资源受限环境中。

英文摘要

This work proposes a novel solution to predict pulsar timing residuals with limited data, addressing the critical challenge of data scarcity across spin-frequency subgroups of millisecond pulsars in PTA datasets. The proposed solution applies a Long Short-Term Memory (LSTM) network optimized using the model-agnostic meta-learning algorithm, enabling rapid adaptation to new frequency domain by fine-tuning the LSTM network with only a few-shot of ground truth timing residuals. Particle swarm optimization algorithm is also used for automatic hyperparameter optimization, leading to improved prediction accuracy. Our solution, evaluated on the second data release of the International Pulsar Timing Array (IPTA), demonstrates robust generalization with accurate predictions in three metrics across high-frequency test frequency domains, while requiring only 10% of the timing residuals from these domains for model fine-tuning. Furthermore, our lightweight structure only costs 16.86 MB CPU memory and 18 milliseconds for single-step residual prediction. All these characteristics make our solution highly suitable for real-world applications, where effective and real-time predictions of pulsar timing residuals are essential-particularly in resource-constrained environments with limited computational power, memory, or energy availability.

2606.03553 2026-06-03 stat.ML cs.LG math.OC

A Robust Optimization Approach to Sparse Principal Component Analysis

稀疏主成分分析的鲁棒优化方法

David Vävinggren, Francis Bach, André M. H. Teixeira, Dave Zachariah, Antônio H. Ribeiro

发表机构 * Uppsala University, Sweden(乌普萨拉大学,瑞典) PSL Research University / INRIA, France(巴黎社会科学大学 / INRIA,法国) Science for Life Laboratory, Sweden(生命科学实验室,瑞典)

AI总结 提出AdvPCA方法,通过鲁棒优化在重建目标中引入最坏情况潜在空间扰动实现稀疏性,并给出闭式解和迭代算法。

详情
AI中文摘要

虽然主成分分析(PCA)是降维的基本工具,但其稠密表示使其不适用于高维数据。现有方法通过显式的$\ell_1$惩罚来促进稀疏性,但由于任务的无监督性质,这些惩罚不易调整。相比之下,我们提出了对抗性PCA(AdvPCA),它利用鲁棒优化,通过优化针对有界、最坏情况潜在空间扰动的重建目标来实现稀疏性。我们表明,该公式允许闭式约简,从而产生一种实用的迭代算法,该算法交替进行稀疏编码器的对抗性线性回归式更新和解码器的正交更新。通过对解进行理论刻画,我们推导出一种数据自适应参数化,使算法能够开箱即用地有效执行。我们通过在合成和真实世界基因组学数据上的数值实验验证了这些主张。

英文摘要

While principal component analysis (PCA) is a fundamental tool for dimensionality reduction, its dense representations make it ill-suited for high-dimensional data. Existing methods address this by promoting sparsity through explicit $\ell_1$-penalties, but these are not obvious to tune due to the unsupervised nature of the task. In contrast, we propose Adversarial PCA (AdvPCA), which leverages robust optimization to achieve sparsity by optimizing the reconstruction objective against bounded, worst-case latent space perturbations. We show that this formulation admits a closed-form reduction, leading to a practical iterative algorithm that alternates between adversarial linear regression-style updates for the sparse encoder and orthogonal updates for the decoder. By theoretically characterizing the solution, we derive a data-adaptive parameterization that allows the algorithm to perform effectively out of the box. We validate these claims through numerical experiments on synthetic and real-world genomics data.

2606.03292 2026-06-03 stat.ML cs.LG

Combining Statistical Features and Deep Encodings for Rehearsal-Based Class-Incremental Time Series Classification

结合统计特征与深度编码的基于排练的类增量时间序列分类

Pablo García-Santaclara, Bruno Fernández-Castro, Rebeca Pilar Díaz-Redondo

发表机构 * atlanTTic – ICLAB, Universidade de Vigo(atlanTTic–ICLAB,维戈大学) Centro Tecnolóxico de Telecomunicacións de Galicia (GRADIANT)(加利西亚电信技术中心(GRADIANT)) Universidade de Vigo(维戈大学)

AI总结 提出一种双流特征提取管道(结合预训练冻结基础模型的深度时间嵌入特征与统计特征),用于多变量时间序列的类增量持续学习,在五个基准数据集上实现了有竞争力的平均准确率和低遗忘率。

详情
AI中文摘要

现实环境中使用的许多系统需要在不遗忘分类模型先前学习内容的情况下添加新类别并整合新信息。这被称为类增量持续学习,而对于多变量时间序列,数据的时间结构进一步增加了复杂性。本文提出了一种基于双流特征提取管道(使用通过预训练冻结基础模型生成的深度时间嵌入特征以及应用统计特征)的多变量时间序列分类类增量持续学习的新方法。在五个基准数据集上的评估表明,所提出的系统在所有数据集上实现了有竞争力的平均准确率,同时在所有实验配置中保持了较低的遗忘率。

英文摘要

Many systems used in real-world environments require adding new categories and incorporating new information without forgetting what was previously learnt by the classification model. This is known as class-incremental continual learning, and in the case of multivariate time-series, is further complicated by the temporal structure of the data. In this paper, we present a novel approach for performing class incremental continual learning for the classification of multivariate time series data based upon the construction of a dual-stream feature extraction pipeline (using both deep temporal embedding features generated via a pre-trained frozen foundation model and application of statistical features). Evaluated on five benchmark datasets, the proposed system achieves competitive average accuracy across all datasets while maintaining low forgetting rates across all experimental configurations.

2606.03245 2026-06-03 stat.ML cs.LG

Hierarchies of Calibration: Classification meets Regression

校准的层次结构:分类与回归的融合

Johannes Resin, Lu Yang, Tilmann Gneiting

发表机构 * Goethe University Frankfurt(法兰克福歌德大学) University of Minnesota(明尼苏达大学) Heidelberg Institute for Theoretical Studies(海德堡理论研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文综述、扩展并桥接了分类与回归任务中的校准概念,重点研究了不同校准概念之间的层次关系,并提出了模态校准、全校准、部分校准和平均校准等新概念。

详情
AI中文摘要

校准概念形式化了概率预测与相应结果之间的兼容性。简而言之,结果应与从预测分布中随机抽取的样本无法区分。本文回顾、扩展并桥接了针对分类和回归任务提出的校准概念。特别强调了各种概念之间的层次关系,因为它们适用于一般实值数据、连续结果、计数数据、名义类别和二元结果。为了突出若干贡献,我们引入了名义结果的模态校准概念,在此背景下区分了全校准、部分校准和平均校准,并证明了双概率积分变换(PIT)校准在逻辑上独立于先前针对离散结果提出的校准概念。此外,我们推广了关于校准概念的现有结果,这些概念以预测分布的性质或泛函(如均值、分位数或事件概率)表示。在整篇论文中,我们通过实例说明这些概念及其层次关系,并提供支持构建指导性示例和反例的算法工具。

英文摘要

Concepts of calibration formalize the compatibility between probabilistic predictions and the respective outcomes. In a nutshell, the outcomes ought to be indistinguishable from random draws from the predictive distributions. In this paper, we review, extend, and bridge notions of calibration that have been proposed for classification and regression tasks. Particular emphasis is given to hierarchical relations between the various notions, as they apply to general real-valued data, continuous outcomes, count data, nominal classes, and binary outcomes. To highlight a number of contributions, we introduce the notion of modal calibration for nominal outcomes, we distinguish full, partial, and average calibration in this setting, and we show that double probability integral transform (PIT) calibration is logically independent of previously proposed concepts of calibration for discrete outcomes. Furthermore, we generalize extant results on concepts of calibration that are expressed in terms of properties or functionals of the predictive distributions, such as means, quantiles, or event probabilities. Throughout the paper, we illustrate the concepts and their hierarchical relations in worked examples, and we provide algorithmic tools that support the construction of instructive examples and counterexamples.