arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2406.05670 2026-06-08 cs.LG cs.CR cs.CV 版本更新

Certified Robustness to Data Poisoning in Gradient-Based Training

基于梯度的训练中对数据投毒的认证鲁棒性

Philip Sosnin, Mark N. Müller, Maximilian Baader, Calvin Tsay, Matthew Wicker

发表机构 * Department of Computing, Imperial College London, United Kingdom(帝国理工学院伦敦分校计算机系) Department of Computer Science, ETH Zurich, Switzerland(苏黎世联邦理工学院计算机科学系) LogicStar.ai, Switzerland(LogicStar.ai公司) The Alan Turing Institute, United Kingdom(艾伦·图灵研究所)

AI总结 提出首个框架,通过凸松弛过度近似参数更新集,为梯度下降训练的模型提供针对无目标、有目标投毒和后门攻击的可证明鲁棒性保证。

详情
Comments
21 pages, 8 figures
AI中文摘要

现代机器学习流程利用大量公共数据,使得保证数据质量变得不可行,并使模型容易受到投毒和后门攻击。在攻击下可证明地约束模型行为仍然是一个开放问题。在这项工作中,我们通过开发第一个框架来应对这一挑战,该框架在不修改模型或学习算法的情况下,为使用可能被操纵的数据训练的模型的行为提供可证明的保证。特别是,我们的框架针对训练输入和标签的有界和无界操纵,认证了对无目标和有目标投毒以及后门攻击的鲁棒性。我们的方法利用凸松弛来过度近似给定投毒威胁模型下所有可能的参数更新集,从而允许我们为任何基于梯度的学习算法约束所有可达参数的集合。给定这个参数集,我们提供了最坏情况行为的界限,包括模型性能和后门成功率。我们在多个真实世界数据集上展示了我们的方法,这些数据集来自能源消耗、医学成像和自动驾驶等应用。

英文摘要

Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.

2408.15344 2026-06-08 cs.LG math.DS 版本更新

Conformal Disentanglement and Latent-Space Curation: A Neural Framework for Perspective Synthesis, Differentiation and Targeted Generation

共形解缠与潜在空间策展:面向视角合成、区分和定向生成的神经框架

George A. Kevrekidis, Eleni D. Koronaki, Dimitris G. Giovanis, Yannis G. Kevrekidis

发表机构 * Department of Applied Mathematics and Statistics, Johns Hopkins University(应用数学与统计学系,约翰霍普金斯大学) Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室) Faculty of Science, Technology and Medicine, University of Luxembourg(科学、技术与医学学院,卢森堡大学) Department of Civil and Systems Engineering, Johns Hopkins University(土木与系统工程系,约翰霍普金斯大学) Department of Chemical and Biomolecular Engineering, Johns Hopkins University(化学与生物分子工程系,约翰霍普金斯大学)

AI总结 提出一种神经自编码器框架,通过结构约束和正交正则化从多传感器数据中分离共享与传感器特定潜在变量,并利用解缠潜在子空间实现定向生成和跨传感器推断。

详情
AI中文摘要

许多科学和工程问题涉及通过多个异构传感器或测量模态观察同一现象。此类观测通常包含跨传感器共享的信息(反映底层系统)以及来自测量过程或环境效应的传感器特定或外部成分。当传感器独立观测不可用时,解缠这些贡献至关重要。我们提出一种神经自编码器框架,从多传感器数据中显式分离共享和传感器特定的潜在变量。该架构通过结构约束和基于正交的正则化强制潜在组件之间的几何独立性,产生可解释且解缠的表示。基于此表示,我们引入一种潜在空间生成方法,其中生成模型在选定的解缠潜在子空间上被调谐/“限制”;然后我们建设性地组合解缠的观测潜在变量,通过训练的解码器条件合成新样本。这使得能够生成具有指定共享(或传感器特定)特征的一致数据。它还通过一致地采样未观测模态中合理测量的分布来支持跨传感器推断。我们在多个计算示例上展示了该方法,显示了在异构传感设置中的有效解缠、定向数据生成和模态插补。

英文摘要

Many scientific and engineering problems involve observing a common phenomenon through multiple heterogeneous sensors or measurement modalities. Such observations typically contain both information shared across sensors, reflecting the underlying system, and sensor-specific or extraneous components arising from measurement processes or environmental effects. Disentangling these contributions is essential when sensor-independent observations are unavailable. We propose a neural autoencoder framework that explicitly separates shared and sensor-specific latent variables from multi-sensor data. The architecture enforces geometric independence between latent components through structural constraints and orthogonality-based regularization, yielding interpretable and disentangled representations. Building on this representation, we then introduce a latent-space generative methodology in which generative models are tuned/"restricted" on selected disentangled latent subspaces; we then constructively combine disentangled observed latent variables to conditionally synthesize new samples via trained decoders. This enables consistent data generation with prescribed shared (or sensor-specific) characteristics. It also supports cross-sensor inference by consistently sampling distributions over plausible measurements in unobserved modalities. We demonstrate the approach on several computational examples, showing effective disentanglement, targeted data generation, and modality imputation in heterogeneous sensing settings.

2406.00636 2026-06-08 cs.CV 版本更新

T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

T2LM:基于多句子的长期3D人体运动生成

Taeryung Lee, Fabien Baradel, Thomas Lucas, Kyoung Mu Lee, Gregory Rogez

发表机构 * IPAI & ASRI(IPAI与ASRI) Dept. of ECE, Seoul National University(电子工程系,首尔国立大学) NAVER LABS Europe(NAVER欧洲实验室)

AI总结 提出T2LM框架,利用1D卷积VQVAE和Transformer文本编码器,无需顺序数据即可从多句子生成连续长期3D人体运动,优于先前方法且与单动作SOTA竞争。

详情
Comments
CVPR 2024 HuMoGen Workshop
AI中文摘要

本文解决了长期3D人体运动生成的挑战性问题。具体而言,我们旨在从多个句子(即段落)流中生成平滑连接的长时间动作序列。先前的长期运动生成方法大多基于循环方法,使用先前生成的运动块作为下一步的输入。然而,这种方法有两个缺点:1)依赖顺序数据集,成本高昂;2)这些方法在每一步生成的运动之间产生不切实际的间隙。为了解决这些问题,我们引入了简单而有效的T2LM,一个无需顺序数据即可训练的连续长期生成框架。T2LM包含两个组件:一个1D卷积VQVAE,训练将运动压缩为潜在向量序列;以及一个基于Transformer的文本编码器,根据输入文本预测潜在序列。在推理时,一个句子序列被翻译成连续的潜在向量流,然后由VQVAE解码器解码为运动;使用具有局部时间感受野的1D卷积避免了训练序列和生成序列之间的时间不一致性。VQ-VAE上的这个简单约束使其仅用短序列训练即可产生更平滑的过渡。T2LM优于先前的长期生成模型,同时克服了需要顺序数据的限制;它也与最先进的单动作生成模型具有竞争力。

英文摘要

In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models.

2403.10318 2026-06-08 cs.LG 版本更新

pTNAS: Progressive Neural Architecture Search for Tabular Data

pTNAS: 面向表格数据的渐进式神经架构搜索

Naili Xing, Shaofeng Cai, Lingze Zeng, Jiaqi Zhu, Peng Lu, Jian Pei, Beng Chin Ooi

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个针对表格数据的渐进式神经架构搜索方法pTNAS,采用过滤-精炼优化策略,结合零成本代理和固定预算调度算法,实现架构快速识别与性能持续提升,相比其他NAS方法加速高达82.75倍。

详情
AI中文摘要

最近的进展已将表格学习的范式转向表格基础模型,但其准确性依赖于随着上下文大小扩展而性能不佳的高推理成本。当配备精心设计的架构时,深度神经网络仍然是一种极具竞争力且更高效的建模范式;然而,以数据自适应和预算感知的方式识别此类架构仍然具有挑战性。我们提出了pTNAS,这是首个针对表格数据定制的渐进式神经架构搜索(NAS)方法,它能够快速识别可行的架构,并在更多预算可用时持续提高其搜索性能。pTNAS采用了一种过滤-精炼优化策略,结合了高效的免训练和有效的基于训练的架构评估。在过滤阶段,我们引入了pTProxy,这是一种专为表格网络设计的新型零成本代理,它联合捕捉架构的可训练性和表达能力,从而能够快速过滤大型架构搜索空间。在精炼阶段,pTNAS采用固定预算调度算法,从一小批有希望的候选架构中准确识别出性能最佳的架构。我们进一步提出了一种预算感知协调器来整体优化预算分配。实验表明,与其他NAS方法相比,pTNAS将达到全局最佳架构的时间缩短了高达82.75倍,实现了最佳的平均预测排名,并且与TabPFN相比,端到端效率提高了高达4.78倍。

英文摘要

Recent advances have shifted the paradigm of tabular learning toward tabular foundation models, yet their accuracy relies on a heavy inference cost that scales poorly with context size. Deep neural networks remain a highly competitive and more efficient modeling paradigm when equipped with well-designed architectures; however, identifying such architectures in a data-adaptive and budget-aware manner remains challenging. We propose pTNAS, the first progressive neural architecture search (NAS) approach tailored for tabular data, which enables fast identification of a viable architecture and continuously improves its search performance as more budget becomes available. pTNAS adopts a filter-and-refine optimization strategy that combines efficient training-free and effective training-based architecture evaluation. In the filtering phase, we introduce pTProxy, a novel zero-cost proxy specifically designed for tabular networks that jointly captures architectural trainability and expressivity, enabling fast filtering of large architecture search spaces. In the refinement phase, pTNAS employs a fixed-budget scheduling algorithm to accurately identify the best-performing architecture from a small set of promising candidates. We further propose a budget-aware coordinator to optimize budget allocation holistically. Experiments show that pTNAS reduces the time to reach the globally best architecture by up to 82.75 X compared with other NAS approaches, achieves the best average predictive rank, and improves end-to-end efficiency by up to 4.78 X compared with TabPFN.

2403.05532 2026-06-08 cs.LG cs.CV 版本更新

Twin: Tuning Learning Rate and Weight Decay of Deep Homogeneous Classifiers without Validation

Twin: 无需验证的深度同质分类器学习率和权重衰减调优

Lorenzo Brigato, Stavroula Mougiakakou

发表机构 * ARTORG Center, University of Bern(伯恩大学ARTORG中心)

AI总结 提出Twin方法,利用同质网络的边界最大化动态和训练-测试损失间的经验缩放定律,实现无需验证集的学习率和权重衰减调优,在37个图像分类配置上达到与Oracle基线1.28%的平均绝对误差。

详情
Comments
Accepted at TMLR
AI中文摘要

我们介绍了Tune without Validation (Twin),一种简单有效的管道,用于调优同质分类器的学习率和权重衰减,无需验证集,消除了保留数据的需求并避免了两步过程。Twin利用了同质网络的边界最大化动态以及连接超参数配置下训练和测试损失的经验缩放定律。这种数学建模产生了一个依赖于区域的、无需验证的选择规则:在不可分离区域,训练损失在测试损失中是单调的,因此可以预测泛化;而在可分离区域,由于边界最大化,参数的范数成为泛化的可靠指标。在37个图像分类的数据集-架构配置中,我们证明Twin与使用测试准确率选择超参数的Oracle基线相比,平均绝对误差为1.28%。我们展示了Twin在验证数据稀缺的场景(如小数据 regime)或难以且昂贵收集的场景(如医学成像)中的优势。代码可在 https://github.com/lorenzobrigato/twin 获取。

英文摘要

We introduce Tune without Validation (Twin), a simple and effective pipeline for tuning learning rate and weight decay of homogeneous classifiers without validation sets, eliminating the need to hold out data and avoiding the two-step process. Twin leverages the margin-maximization dynamics of homogeneous networks and an empirical scaling law that links training and test losses across hyper-parameter configurations. This mathematical modeling yields a regime-dependent, validation-free selection rule: in the non-separable regime, training loss is monotonic in test loss and therefore predictive of generalization, whereas in the separable regime, the parameters' norm becomes a reliable indicator of generalization due to margin maximization. Across 37 dataset-architecture configurations for image classification, we demonstrate that Twin achieves a mean absolute error of 1.28% compared to an Oracle baseline that selects HPs using test accuracy. We demonstrate Twin's benefits in scenarios where validation data is scarce, such as small-data regimes, or difficult and costly to collect, as in medical imaging. Code available at https://github.com/lorenzobrigato/twin.

2206.08598 2026-06-08 cs.LG stat.ML 版本更新

Characterizing Learning Dynamics under Relative Reparameterization of Singular Models

奇异模型相对重参数化下的学习动态表征

Pascal Mattia Esser, Frank Nielsen

发表机构 * Ludwig-Maximilians-Universität München(慕尼黑路易斯-马克西米利安大学) Sony Computer Science Laboratories Inc.(索尼计算机科学实验室)

AI总结 针对奇异模型参数空间与模型空间非一一对应导致收敛慢的问题,提出相对重参数化方法提取正则子模型,并在高斯混合模型和神经网络上理论分析梯度下降收敛率。

详情
AI中文摘要

分析统计模型学习的一种常见方法是考虑模型参数空间中的操作,但当参数空间与底层统计模型空间之间不存在一一映射时,这变得具有挑战性。这种“奇异模型”经常出现,并且由于吸引子行为,学习轨迹的收敛速度会特征性地降低。在这项工作中,我们考虑了参数空间的相对重参数化技术,该技术提供了一种从奇异模型中提取正则子模型的通用方法。以高斯混合模型和神经网络为例,我们从理论和数值上分析了两种参数化下梯度下降的收敛率。通过分析二阶方法和Fisher信息矩阵的显式性质,我们区分了由算法和内在信息几何方面引起的收敛行为差异。

英文摘要

A common way to analyze learning of statistical models is to consider operations in the models parameter space, however this becomes challenging when there is no one-to-one mapping between the parameter space and the underlying statistical model space. Such ``singular models'' occur frequently and exhibit a characteristic decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we consider a relative reparameterization technique of the parameter space, which yields a general method for extracting regular sub-models from singular models. On the example of Gaussian Mixture Models and Neural Networks we theoretically and numerically analyze the convergence rate for Gradient Descent under both parameterizations. Analyzing second-order methods and explicit properties of the Fisher Information Matrix we distinguish between differences in convergence behavior arising from algorithmic and intrinsic information-geometric aspects.

2510.21122 2026-06-08 cs.CV

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

NoisyGRPO:通过噪声注入和贝叶斯估计激励多模态Co T推理

Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He

发表机构 * ShanghaiTech University(上海科技大学) Shanghai Engineering Research Center of Intelligent Vision and Imaging(上海智能视觉与成像工程研究中心) Lingang Laboratory(临港实验室)

AI总结 NoisyGRPO通过引入可控噪声增强探索并利用贝叶斯框架建模优势估计,提升多模态大语言模型的泛化能力和鲁棒性,尤其在小规模模型上表现突出。

详情
Journal ref
Advances in Neural Information Processing Systems 38 (2026) 124239-124267
Comments
Accepted by Neurips 2025, Project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/
AI中文摘要

强化学习(RL)在增强多模态大语言模型(MLLMs)的链式推理能力方面展现出潜力。然而,当应用于提升通用链式推理时,现有RL框架往往难以超越训练分布。为此,我们提出NoisyGRPO,一种系统化的多模态RL框架,通过在视觉输入中引入可控噪声以增强探索,并通过贝叶斯框架显式建模优势估计过程。具体而言,NoisyGRPO通过(1)噪声注入探索策略:用高斯噪声扰动视觉输入以鼓励探索更广泛的视觉场景;以及(2)贝叶斯优势估计:将优势估计建模为一个原理性的贝叶斯推断问题,其中注入的噪声水平作为先验,观察到的轨迹奖励作为似然。这种贝叶斯建模融合了两种信息源,以计算轨迹优势的稳健后验估计,有效引导MLLMs偏好视觉支撑的轨迹而非噪声轨迹。在标准链式推理质量、通用能力和幻觉基准测试中,NoisyGRPO显著提高了泛化能力和鲁棒性,尤其是在小规模MLLMs如Qwen2.5-VL 3B的RL设置中。项目页面可在https://artanic30.github.io/project_pages/NoisyGRPO/上获取。

英文摘要

Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.

2601.14637 2026-06-08 cs.CV cs.AI cs.CL cs.HC

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

Forest-Chat: 为交互式森林变化分析适应视觉-语言代理

James Brock, Ce Zhang, Nantheera Anantrasirichai

发表机构 * arXiv.org cs.CV(计算机视觉)

AI总结 本文提出Forest-Chat,一种基于LLM的森林变化分析代理,通过多任务处理实现自然语言查询,提升森林变化检测与语义解释的准确性与可解释性。

详情
Comments
28 pages, 9 figures, 12 tables, Submitted to Ecological Informatics
AI中文摘要

高分辨率卫星影像的普及与深度学习的进步为森林监测提供了新机遇。本文提出Forest-Chat,一种基于大语言模型的视觉-语言代理,支持多任务的交互式森林变化分析,包括变化检测、图像描述、对象计数、森林砍伐特征识别和变化推理。Forest-Chat基于多级变化解释(MCI)视觉-语言框架,结合零样本变化检测和多模态零样本变化描述与优化。引入Forest-Change数据集,包含双时相卫星影像、像素级变化掩码和语义变化描述。在Forest-Change数据集上,Forest-Chat在mIoU和BLEU-4指标上达到67.10%和40.17%,在LEVIR-MCI-Trees子集上达到88.13%和34.41%。零样本测试中,其在Forest-Change数据集上达到60.15%和34.00%,在LEVIR-MCI-Trees子集上达到47.32%和18.23%。进一步实验表明,描述优化能注入地理领域知识,但标签域迁移有限。这些发现表明,交互式、基于LLM的系统能支持可访问和可解释的森林变化分析。

英文摘要

The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. To support adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and semantic change captions via human annotation and rule-based methods. Forest-Chat achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on Forest-Change, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI. In a zero-shot capacity, it achieves 60.15% and 34.00% on Forest-Change, and 47.32% and 18.23% on LEVIR-MCI-Trees. Further experiments demonstrate the value of caption refinement for injecting geographic domain knowledge into supervised captions, and the system's limited label domain transfer onto JL1-CD-Trees. These findings demonstrate that interactive, LLM-driven systems can support accessible and interpretable forest change analysis.

2505.19888 2026-06-08 cs.LG

Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations

基于正交变换的联邦学习与个性化方法:通过黑盒基础模型

Eun Gyung Kong, Je Won Yeom, Yonghoon Jeon, Taesup Kim

发表机构 * Seoul National University(首尔国立大学) Mobilint, Inc.(Mobilint公司) Kakao Healthcare Corp.(Kakao医疗公司)

AI总结 本文提出FedOT框架,通过正交变换实现联邦学习中的鲁棒泛化与有效个性化,在异构环境中提升性能,优于基线方法。

详情
Journal ref
Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 24567-24576, 2026
Comments
31 pages, 5 figures
AI中文摘要

联邦学习(FL)在保护数据隐私的同时促进去中心化模型训练。然而,在异构(非iid)环境中同时实现鲁棒泛化和有效个性化仍是一个严峻挑战。此外,基础模型(FMs)的广泛使用要求双重隐私保护:(a)保护敏感客户端数据和(b)保护服务器的知识产权。这需要严格黑盒访问FMs。为解决这些挑战,我们引入FedOT,一种针对黑盒FMs优化的联邦学习框架。FedOT采用共享的全局任务依赖分类器,同时通过客户端特定的正交变换实现本地适应,该变换应用于FMs嵌入之外。这种架构本质上保证FMs内部参数保持不可访问和未修改。通过强制正交性,FedOT有效缓解了跨不同客户端的梯度冲突,理论上有界,保持FMs表示的语义完整性,并在显著的数据异质性下实现稳健性能。全局和本地参数的协同优化最佳平衡了泛化和个性化,显著优于基线FL方法。广泛的实证分析,包括严格多种子验证和可扩展性评估,证实了FedOT的鲁棒性、效率和优越性能。

英文摘要

Federated Learning (FL) facilitates decentralized model training while preserving data privacy. However, achieving both robust generalization and effective personalization simultaneously in heterogeneous (non-IID) environments remains a formidable challenge. Furthermore, the widespread adoption of proprietary Foundation Models (FMs) introduces a critical requirement for dual privacy: (a) protecting sensitive client data and (b) securing the server's valuable intellectual property. This mandates strictly black-box access to the FM. To address these multifaceted challenges, we introduce FedOT, a novel FL framework optimized for black-box FMs. FedOT employs a shared global task-dependent classifier while facilitating local adaptation through client-specific orthogonal transformations applied externally to the FM embeddings. This architecture inherently guarantees that the FM's internal parameters remain inaccessible and unmodified. By enforcing orthogonality, FedOT effectively mitigates gradient conflicts across diverse clients, which is theoretically bounded, preserves the semantic integrity of the FM representations, and achieves robust performance under significant data heterogeneity. The synergy of global and local parameters optimally balances generalization and personalization, markedly outperforming baseline FL methods across diverse benchmarks. Extensive empirical analysis, including rigorous multi-seed validation and scalability assessments, substantiates the robustness, efficiency, and superior performance of FedOT.

2502.21123 2026-06-08 cs.LG cs.AI

Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models

因果关系是理解和平衡可信机器学习与基础模型中多个目标的关键

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心) Max Planck Institute for Intelligent Systems, Tübingen(马克斯·普朗克智能系统研究所(图宾根)) Google Research(谷歌研究) ETH Zürich(苏黎世联邦理工学院) University of Toronto(多伦多大学)

AI总结 本文主张将因果方法集成到机器学习中,以平衡公平性、隐私、鲁棒性、准确性和可解释性等可信原则之间的权衡,并探讨其在基础模型中的实际应用。

详情
AI中文摘要

确保机器学习系统的可信度至关重要,因为它们日益嵌入高风险领域。本文主张将因果方法集成到机器学习中,以应对可信机器学习关键原则(包括公平性、隐私、鲁棒性、准确性和可解释性)之间的权衡。虽然这些目标理想情况下应同时满足,但它们通常被孤立地处理,导致冲突和次优解决方案。借鉴因果在ML中成功协调目标(如公平性与准确性,或隐私与鲁棒性)的现有应用,本文认为因果方法对于平衡可信ML和基础模型中的多个竞争目标至关重要。除了强调这些权衡,我们考察了如何将因果实际集成到ML和基础模型中,提供增强其可靠性和可解释性的解决方案。最后,我们讨论了采用因果框架的挑战、局限性和机遇,为更负责任和合乎伦理的AI系统铺平道路。

英文摘要

Ensuring trustworthiness in machine learning (ML) systems is crucial as they become increasingly embedded in high-stakes domains. This paper advocates for integrating causal methods into machine learning to navigate the trade-offs among key principles of trustworthy ML, including fairness, privacy, robustness, accuracy, and explainability. While these objectives should ideally be satisfied simultaneously, they are often addressed in isolation, leading to conflicts and suboptimal solutions. Drawing on existing applications of causality in ML that successfully align goals such as fairness and accuracy or privacy and robustness, this paper argues that a causal approach is essential for balancing multiple competing objectives in both trustworthy ML and foundation models. Beyond highlighting these trade-offs, we examine how causality can be practically integrated into ML and foundation models, offering solutions to enhance their reliability and interpretability. Finally, we discuss the challenges, limitations, and opportunities in adopting causal frameworks, paving the way for more accountable and ethically sound AI systems.

2505.13140 2026-06-08 cs.CV

CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

CacheFlow: 通过缓存归一化流实现快速的人体运动预测

Takahiro Maeda, Jinkun Cao, Norimichi Ukita, Kris Kitani

发表机构 * Toyota Technological Institute(丰田技术研究所) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 CacheFlow通过缓存归一化流生成模型,实现快速3D人体运动预测,相比传统方法速度提升显著,且保持预测精度和模型表达能力。

详情
Journal ref
Transactions on Machine Learning Research, 2026
Comments
Accepted at Transactions on Machine Learning Research (TMLR). See https://openreview.net/forum?id=icq5659pQt
AI中文摘要

许多用于3D人体运动预测的密度估计技术需要大量推理时间,通常超过预测时间范围。为解决此问题,我们提出了一种新的基于流的方法,称为CacheFlow。与之前的条件生成模型相比,CacheFlow利用无条件的流生成模型,将高斯混合转化为未来运动的密度。流生成模型的计算结果可以预先计算并缓存。然后,对于条件预测,我们通过一个更轻量的模型将历史轨迹映射到高斯混合中的样本。这种映射方式相比传统条件流模型节省了显著的计算开销。通过这种两阶段方法和缓存慢流模型的计算结果,我们构建了CacheFlow,不损失预测精度和模型表达能力。此推理过程大约在1毫秒内完成,比之前的VAE方法快4倍,比之前的扩散方法快30倍。此外,我们的方法在Human3.6M数据集上展示了改进的密度估计精度,并与SOTA方法具有可比的预测精度。我们的代码和模型可在https://github.com/meaten/CacheFlow上获得。

英文摘要

Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from poor time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models are available at https://github.com/meaten/CacheFlow.

2411.05729 2026-06-08 cs.LG stat.ML

Graph-Dictionary Signal Model for Sparse Representations of Multivariate Data

图词典信号模型用于多变量数据的稀疏表示

William Cappelletti, Pascal Frossard

发表机构 * LTS4, EPFL, Lausanne, Switzerland(瑞士洛桑联邦理工学院LTS4实验室)

AI总结 本文提出图词典信号模型,通过图结构描述多变量数据中的关系,利用稀疏组合的图原子进行信号重构,优于现有基线方法。

详情
AI中文摘要

表示和利用多变量信号需要捕捉变量间的关系,我们通过图来表示这些关系。图词典允许将复杂的关联信息表示为稀疏简单结构之和,但目前尚无先验模型能从数据中推断此类底层结构元素。我们定义了新的图词典信号模型,其中有限的图集合通过其拉普拉斯算子加权和的稀疏组合来描述数据分布中的关系。我们提出了一种从观测节点信号中推断图词典表示的框架,允许包含关于信号属性、底层图及其系数的先验知识。我们引入了原始-对偶分裂算法的双线性推广来解决学习问题。我们展示了该方法在多个合成设置中从信号中重建图的能力,其中我们的模型优于流行的基线方法。然后,我们利用图词典表示在脑活动数据上的示例运动解码任务中,比依赖更多特征的标准方法更好地分类想象运动。我们的图词典模型弥合了多变量数据稀疏表示与样本变化关系的结构分解之间的差距。

英文摘要

Representing and exploiting multivariate signals requires capturing relations between variables, which we can represent by graphs. Graph dictionaries allow to describe complex relational information as a sparse sum of simpler structures, but no prior model exists to infer such underlying structure elements from data. We define a novel Graph-Dictionary signal model, where a finite set of graphs characterizes relationships in data distribution as filters on the weighted sum of their Laplacians. We propose a framework to infer the graph dictionary representation from observed node signals, which allows to include a priori knowledge about signal properties, and about underlying graphs and their coefficients. We introduce a bilinear generalization of the primal-dual splitting algorithm to solve the learning problem. We show the capability of our method to reconstruct graphs from signals in multiple synthetic settings, where our model outperforms popular baselines. Then, we exploit graph-dictionary representations in an illustrative motor imagery decoding task on brain activity data, where we classify imagined motion better than standard methods relying on many more features. Our graph-dictionary model bridges a gap between sparse representations of multivariate data and a structured decomposition of sample-varying relationships into a sparse combination of elementary graph atoms.

2403.09110 2026-06-08 cs.LG cs.SY eess.SY math.DS math.OC

SINDy-RL: Interpretable and Efficient Model-Based Reinforcement Learning

SINDy-RL:可解释且高效的基于模型的强化学习

Nicholas Zolman, Christian Lagemann, Urban Fasel, J. Nathan Kutz, Steven L. Brunton

发表机构 * Department of Mechanical Engineering, University of Washington, Seattle, WA 98195, USA(华盛顿大学机械工程系) Data Science and Artificial Intelligence Department, The Aerospace Corporation, El Segundo, CA 90245(航空航天公司数据科学与人工智能部) Department of Aeronautics, Imperial College, London SW7 2AZ, United Kingdom(帝国理工学院航空系) Department of Applied Mathematics, University of Washington, Seattle, WA 98195(华盛顿大学应用数学系) Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195(华盛顿大学电气与计算机工程系)

AI总结 本文提出SINDy-RL框架,结合SINDy和DRL,实现低数据下高效、可解释的动力学模型和控制策略,通过基准环境和流体控制实验验证其有效性。

详情
Journal ref
Nat. Commun. 16, 10714 (2025)
Comments
For code, see https://github.com/nzolman/sindy-rl. v2 Update: Included Pinball and 3D Airfoil examples. Christian Lagemann added as an author for contributions with the 3D Airfoil code. To appear in Nature Communications
AI中文摘要

深度强化学习(DRL)在复杂环境中揭示复杂控制策略方面展现出巨大潜力,如稳定托卡马克聚变反应堆或最小化流体中物体的阻力。然而,DRL需要大量训练示例且成本高昂。此外,依赖深度神经网络导致不可解释的黑箱策略,可能在嵌入式系统中计算成本过高。最近的稀疏字典学习进展,如非线性动态的稀疏识别(SINDy),在低数据条件下展示了创建高效且可解释的数据驱动模型的潜力。本文介绍SINDy-RL,一种结合SINDy和DRL的统一框架,以创建高效、可解释且可信的动力学模型、奖励函数和控制策略。我们在基准控制环境和流体控制问题上展示了方法的有效性,包括在Re=1000时的3D NACA 0012翼型气流抑制。SINDy-RL在显著较少的环境交互中实现了与现代DRL算法相当的性能,并产生比DRL策略小多个数量级的可解释控制策略。

英文摘要

Deep reinforcement learning (DRL) has shown significant promise for uncovering sophisticated control policies that interact in complex environments, such as stabilizing a tokamak fusion reactor or minimizing the drag force on an object in a fluid flow. However, DRL requires an abundance of training examples and may become prohibitively expensive for many applications. In addition, the reliance on deep neural networks often results in an uninterpretable, black-box policy that may be too computationally expensive to use with certain embedded systems. Recent advances in sparse dictionary learning, such as the sparse identification of nonlinear dynamics (SINDy), have shown promise for creating efficient and interpretable data-driven models in the low-data regime. In this work we introduce SINDy-RL, a unifying framework for combining SINDy and DRL to create efficient, interpretable, and trustworthy representations of the dynamics model, reward function, and control policy. We demonstrate the effectiveness of our approaches on benchmark control environments and flow control problems, including gust mitigation on a 3D NACA 0012 airfoil at $Re=1000$. SINDy-RL achieves comparable performance to modern DRL algorithms using significantly fewer interactions in the environment and results in an interpretable control policy orders of magnitude smaller than a DRL policy.

2504.21614 2026-06-08 cs.CV

Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

Mcity数据引擎:通过开放词汇数据选择实现迭代模型改进

Daniel Bogdoll, Rajanikant Patnaik Ananta, Abeyankar Giridharan, Isabel Moore, Gregory Stevens, Henry X. Liu

发表机构 * University of Michigan Transportation Research Institute(密歇根大学交通研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Texas A&M University(德克萨斯A&M大学)

AI总结 本文提出Mcity数据引擎,通过开放词汇数据选择解决大规模未标记数据中长尾类检测难题,提供从数据采集到模型部署的完整数据开发流程。

详情
Comments
Accepted for publication at ITSC 2025
AI中文摘要

随着数据可用性的持续增长,选择和标注适合机器学习模型训练的样本变得越来越具有挑战性。特别是在大规模未标记数据中检测感兴趣的长尾类更是困难重重。这尤其适用于智能交通系统(ITS),其中车辆车队和道路侧感知系统产生大量的原始数据。虽然存在用于此类迭代数据选择和模型训练过程的工业专有数据引擎,但研究人员和开源社区却缺乏一个公开可用的系统。我们提出了Mcity数据引擎,它提供了完整的基于数据的发展周期模块,从数据采集阶段开始,到模型部署阶段结束。Mcity数据引擎通过开放词汇数据选择过程专注于罕见和新颖的类别。所有代码均以MIT许可证公开发布在GitHub上:https://github.com/mcity/mcity_data_engine

英文摘要

With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine

2311.00212 2026-06-08 cs.LG cs.NA math.DG math.NA

A Unified Framework to Enforce, Discover, and Promote Symmetry in Machine Learning

一种统一的框架用于在机器学习中强制、发现和促进对称性

Samuel E. Otto, Nicholas Zolman, J. Nathan Kutz, Steven L. Brunton

发表机构 * AI Institute in Dynamic Systems University of Washington(动态系统人工智能研究所华盛顿大学) Sibley School of Mechanical and Aerospace Engineering, Cornell University(机械与航空航天工程学院,康奈尔大学)

AI总结 本文提出统一框架,通过强制已知对称性、发现未知对称性和促进对称性三种方式,将对称性纳入机器学习模型中,基于李导数的数学框架统一了现有结果。

详情
Journal ref
J. Mach. Learn. Res. 26(248):1-83 (2025)
AI中文摘要

对称性在自然界中普遍存在,并在物理和机器学习中扮演越来越重要的角色。基本对称性,如庞加莱不变性,使在地球实验室发现的物理定律能够扩展到宇宙的最远区域。对称性对于在机器学习应用中实现这种扩展能力至关重要。例如,图像分类中的平移不变性使具有较少参数的模型,如卷积神经网络,能够用较小的数据集进行训练并达到最先进的性能。本文提供了一个统一的理论和方法框架,用于在三种方式中将对称性纳入机器学习模型:1. 在训练模型时强制已知对称性;2. 发现给定模型或数据集的未知对称性;3. 通过学习一个模型来促进对称性,该模型在用户指定的候选群中学习时,当数据中有足够证据时会打破对称性。我们证明这些任务可以被一个共同的数学框架所涵盖,其核心对象是与向量丛上的纤维线性李群作用相关的李导数。我们通过展示强制和发现对称性是线性代数任务,并且在李导数的双线性结构下是互为对偶的,扩展并统一了现有的结果。我们还提出了一种新的促进对称性的方式,通过引入基于李导数和核范数松弛的一类凸正则化函数,以在训练机器学习模型时惩罚对称性破坏。我们解释了这些想法如何应用于广泛范围的机器学习模型,包括基函数回归、动态系统发现、神经网络和作用于场的神经算子。

英文摘要

Symmetry is present throughout nature and continues to play an increasingly central role in physics and machine learning. Fundamental symmetries, such as Poincaré invariance, allow physical laws discovered in laboratories on Earth to be extrapolated to the farthest reaches of the universe. Symmetry is essential to achieving this extrapolatory power in machine learning applications. For example, translation invariance in image classification allows models with fewer parameters, such as convolutional neural networks, to be trained on smaller data sets and achieve state-of-the-art performance. In this paper, we provide a unifying theoretical and methodological framework for incorporating symmetry into machine learning models in three ways: 1. enforcing known symmetry when training a model; 2. discovering unknown symmetries of a given model or data set; and 3. promoting symmetry during training by learning a model that breaks symmetries within a user-specified group of candidates when there is sufficient evidence in the data. We show that these tasks can be cast within a common mathematical framework whose central object is the Lie derivative associated with fiber-linear Lie group actions on vector bundles. We extend and unify several existing results by showing that enforcing and discovering symmetry are linear-algebraic tasks that are dual with respect to the bilinear structure of the Lie derivative. We also propose a novel way to promote symmetry by introducing a class of convex regularization functions based on the Lie derivative and nuclear norm relaxation to penalize symmetry breaking during training of machine learning models. We explain how these ideas can be applied to a wide range of machine learning models including basis function regression, dynamical systems discovery, neural networks, and neural operators acting on fields.

2502.08903 2026-06-08 cs.RO cs.AI

3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

面向机器人任务规划的3D grounded视觉-语言框架:自动化提示合成与监督推理

Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng Yao

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出融合2D提示合成模块和小语言模型的框架,提升机器人3D场景理解与任务执行能力,实验显示任务成功率高达96.0%。

详情
Journal ref
Engineering Applications of Artificial Intelligence, vol. 164, p. 113268, 2026
AI中文摘要

视觉-语言模型(VLMs)在场景理解和感知任务中取得了显著成功,使机器人能够在动态环境中自适应地规划和执行动作。然而,大多数多模态大语言模型缺乏稳健的3D场景定位能力,限制了其在精细机器人操作中的有效性。此外,低识别精度、低效、差的迁移性和可靠性等挑战阻碍了其在精密任务中的应用。为解决这些限制,我们提出了一种新的框架,该框架整合了一个2D提示合成模块,通过将2D图像映射到点云,以及一个小型语言模型(SLM)来监督VLM的输出。2D提示合成模块使训练于2D图像和文本的VLM能够自主提取精确的3D空间信息,无需人工干预,显著增强了3D场景理解。同时,SLM监督VLM的输出,缓解幻觉并确保可靠的可执行机器人控制代码生成。我们的框架消除了在新环境中重新训练的需要,从而提高了成本效率和操作鲁棒性。实验结果表明,所提出的框架实现了96.0%的任务成功率(TSR),优于其他方法。消融研究证明了2D提示合成模块和输出监督模块的关键作用(当移除时,TSR下降67%)。这些发现验证了框架在提升3D识别、任务规划和机器人任务执行方面的有效性。

英文摘要

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.

2501.11592 2026-06-08 cs.LG cs.AI cs.CL

Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

无需训练的超小模型用于压缩感知中的通用稀疏重建

Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai

发表机构 * School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China(华中科技大学人工智能与自动化学院) China Belt and Road Joint Lab on Measurement and Control Technology, Wuhan, China(中国一带一路测量与控制技术联合实验室) School of Electric and Electrical Engineering, Chongqing University of Technology, Chongqing, China(重庆理工大学电气工程学院) Optics Valley Laboratory, Wuhan, China(光谷实验室) School of Water Conservancy and Transportation, Zhengzhou University, Zhengzhou, China(郑州大学水利与交通学院) School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China(华中科技大学软件工程学院)

AI总结 本文提出无需训练的超小神经模型CL,实现快速稀疏重建,继承传统迭代方法的通用性和可解释性,提升效率和精度。

详情
AI中文摘要

预训练大模型近年来受到广泛关注,但在需要高可解释性或资源有限的应用中面临挑战,如物理传感、医学成像和生物信息学。压缩感知(CS)是已证明的理论,推动了这些应用的许多突破。然而,作为典型的欠定线性系统,CS在使用传统迭代方法时,对大规模数据的稀疏重建时间过长。当前的AI方法如深度展开失败于替代它们,因为预训练模型在超出训练条件和数据分布时泛化性差或缺乏可解释性。本文提出名为系数学习(CL)的超小人工神经模型,实现无需训练的快速稀疏重建,同时完美继承传统迭代方法的泛化性和可解释性,带来融合先验知识的新特性。在CL中,长度为n的信号仅需最少n个可训练参数。一个案例研究模型称为CLOMP用于评估。实验在合成和真实的一维和二维信号上进行,显示了显著的效率和精度提升。与代表性的迭代方法相比,CLOMP在大规模数据上提高了100到1000倍的效率。在八个不同的图像数据集上的测试结果表明,CLOMP在采样率为0.1、0.3、0.5时分别提高了结构相似性指数292%、98%、45%。我们相信这种方法可以真正将CS重建带入AI时代,造福无数依赖稀疏解的欠定线性系统。

英文摘要

Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length $n$ only needs a minimal of $n$ trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.

2606.07463 2026-06-08 eess.SP cs.CE cs.LG 新提交

Amortized Neural Optimization for Pre-Layout Signal Integrity Design Space Exploration using Differentiable Surrogates

基于可微代理的布局前信号完整性设计空间探索的摊销神经优化

Julian Withöft, Werner John, Emre Ecik, Ralf Brüning, Jürgen Götze

发表机构 * Information Processing Lab, Faculty for Electrical Engineering and Information Technology, TU Dortmund(信息处理实验室,电气工程与信息科技学院,图腾大学) Pyramide2525, Paderborn, Germany(Pyramide2525,帕德博恩,德国) EMC Technology Center Paderborn, Zuken GmbH, Paderborn, Germany(EMC技术中心帕德博恩,Zuken GmbH,帕德博恩,德国)

AI总结 提出摊销神经优化(ANO)框架,利用可微神经网络代理模型替代迭代黑盒优化,实现单次前向传播获取近最优设计参数,在DDR5 DFE、SerDes均衡等场景中加速三到四个数量级。

详情
Comments
16 pages, 20 figures, 8 tables
AI中文摘要

高速信号完整性(SI)分析的布局前设计空间探索(DSE)通常受限于现代电子设计自动化(EDA)工作流程中仿真和迭代优化算法的计算成本。虽然机器学习代理模型加速了仿真步骤,但优化设计仍需利用迭代黑盒搜索方法。这种迭代性质扩展性差,使得多角点扫描计算成本高昂。作为解决方案,本文提出了用于布局前SI设计的摊销神经优化(ANO)。ANO通过利用完全可微的神经网络代理模型,完全消除了迭代黑盒推理。ANO从代理中提取解析梯度,以训练全局优化策略。推理时不再重复求解优化问题,而是离线学习优化过程,从而实现摊销。一旦ANO策略训练完成,它就能在单个确定性前向传播中直接将不同的通道上下文映射到近最优设计参数。基于三个复杂的SI设计场景展示了ANO框架的效率和准确性,包括DDR5决策反馈均衡(DFE)、9维SerDes Tx/Rx联合均衡以及DDR3 DQS差分对布线(在内部对偏斜约束下优化眼图指标)。与实例特定的黑盒算法相比,在牺牲约10%最优性的代价下,实现了三到四个数量级的加速。对于大规模32万实例多角点SerDes扫描优化,ANO将原本需要数天计算时间的迭代搜索算法压缩为一次批量前向传播,毫秒级完成。这将计算昂贵的SI优化转变为实时、交互式的布局前DSE。

英文摘要

Pre-layout design space exploration (DSE) for high-speed signal integrity (SI) analysis is often limited by the computational cost of simulations and iterative optimization algorithms within modern electronic design automation (EDA) workflows. While machine learning surrogate models accelerate the simulation step, optimizing designs still requires utilizing iterative black-box search methods. This iterative nature scales poorly, making multi-corner sweeps computationally expensive. As a solution, this paper proposes amortized neural optimization (ANO) for pre-layout SI design. ANO entirely eliminates iterative black-box inference by utilizing fully differentiable neural network surrogate models. ANO extracts analytical gradients from the surrogate to train a global optimization policy. Instead of solving the optimization problem repeatedly at inference, the optimization process is learned offline and therefore amortized. Once the ANO policy is trained, it maps different channel contexts directly to near-optimal design parameters in a single deterministic forward pass. The efficiency and accuracy of the ANO framework are demonstrated based on three complex SI design scenarios, including DDR5 decision feedback equalization (DFE), 9-dimensional SerDes Tx/Rx co-equalization, and DDR3 DQS differential pair routing to optimize eye diagram metrics under intra-pair skew constraints. By trading roughly 10% in optimality compared to instance-specific black-box algorithms, it realizes speedups of three to four orders of magnitude. For a large-scale 320,000-instance multi-corner SerDes sweep optimization, ANO collapses what would have taken days of computation using iterative search algorithms into a single batched forward pass that completes in milliseconds. This transforms computationally expensive SI optimization into real-time and interactive pre-layout DSE.

2606.07374 2026-06-08 eess.SP cs.CV 新提交

Beyond Backscatter: InSAR coherence from detected SAR images

超越后向散射:来自检测SAR图像的InSAR相干性

Francescopaolo Sica, Andrea Pulella, Michael Schmitt

发表机构 * Department of Aerospace Engineering, University of the Bundeswehr Munich(联邦国防军 Munich航空航天工程系) Microwaves and Radar Institute, German Aerospace Center (DLR)(德国航空航天中心 (DLR) 微波与雷达研究所)

AI总结 提出一种深度学习框架,直接从检测SAR图像回归相干性,无需精确配准,使用Residual U-Net学习后向散射幅度与相干性的关系,在多种数据集上验证了高分辨率相干性回归的准确性提升和泛化能力。

详情
Comments
27 pages, 20 figures
AI中文摘要

在这项工作中,我们提出了一个深度学习框架,用于直接从检测SAR图像进行相干性回归,无需精确配准。使用从精确配准的Sentinel-1 SLC数据导出的相干性图训练Residual U-Net,以学习后向散射幅度与相干性之间的关系。模型在12天SLC对上训练,并在不同数据集上进行评估,包括配准的SLC产品和开放存取的分析就绪数据,覆盖不同的辐射特性、几何形状和位置。实验结果表明,与现有的基于强度的方法相比,所提出的方法实现了高分辨率相干性回归,且准确性更高。该网络在多样化的地理位置以及训练时从未见过的不同时间基线之间都能很好地泛化。此外,能够在全球可用的分析就绪数据(例如通过Google Earth Engine分发的地距检测数据)上运行,使其在任务设计、变化监测和多种制图任务中能够大规模应用。

英文摘要

In this work, we propose a deep learning framework for coherence regression directly from detected SAR images, without the need for accurate coregistration. A Residual U-Net is trained using coherence maps derived from precisely coregistered Sentinel-1 SLC data to learn the relationship between backscatter magnitudes and coherence. The model is trained on 12-day SLC pairs and evaluated across different datasets, including coregistered SLC products and open access analysis-ready data, covering diverse radiometric properties, geometries, and locations. Experimental results demonstrate that the proposed method achieves high-resolution coherence regression with improved accuracy compared to existing intensity-based approaches. The network generalizes well across diverse geographical locations and even across different temporal baselines that were never seen at training time. Additionally, the ability to operate on globally available analysis-ready data, such as ground range detected data, e.g., distributed through Google Earth Engine, enables its large-scale application in mission design, change monitoring, and diverse mapping tasks.

2606.07259 2026-06-08 eess.AS cs.SD 新提交

Assessing True Generalisability of Audio-Visual Speech Recognisers

评估音视频语音识别器的真正泛化能力

Zhaofeng Lin, Stavros Petridis, Maja Pantic, Naomi Harte

发表机构 * Trinity College Dublin(都柏林三一学院) Imperial College London(伦敦帝国理工学院)

AI总结 通过构建与LRS3测试集严格匹配的评估集,发现当前最先进的音视频语音识别模型在未见数据上性能全面崩溃,揭示了其泛化能力不足,并分析了退化原因、词汇偏差和错误模式。

详情
Comments
Accepted to Interspeech 2026 Long paper track. 9 pages, 4 figures
AI中文摘要

当前的音视频语音识别(AVSR)模型在标准LRS3基准上实现了近乎完美的性能,引发了对自适应过拟合的担忧。为了系统评估真正的泛化能力,我们从大规模MultiVSR数据集中构建了一个高度可控、未见过的评估子集。与标准的分布外基准不同,我们的子集在声学、视觉和人口统计分布上与LRS3测试集严格匹配。评估五种最先进的架构揭示了普遍的性能崩溃,证明当前系统即使在严格对齐的条件下也无法泛化。通过跨七个因素的细粒度属性分析,我们隔离了这种退化的具体驱动因素。此外,我们发现了深刻的词汇偏差,揭示了不同的错误模式,并令人惊讶地发现音视频性能甚至落后于纯音频设置。我们发布了匹配的测试集,用于未来的基准测试。

英文摘要

Current Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.

2606.07063 2026-06-08 eess.IV cs.CV 新提交

Beyond Universality: The GCC-FER Dataset and Culture-Aware Adaptation for Dynamic Facial Expression Recognition

超越普遍性:GCC-FER数据集及面向动态面部表情识别的文化感知适应

Sonalika Singh, Jyotirindra Dandapat, Avishi Razdan, Kshipra V. Moghe, Puneet Gupta, Lalan Kumar

发表机构 * Department of Electrical Engineering, Indian Institute of Technology Delhi, India(印度理工学院德里分校电子工程系) Department of Computer Science and Engineering, Indian Institute of Technology Indore, India(印度理工学院印尔德分校计算机科学与工程系) Department of Psychology, COEP Technological University, India(COEP技术大学心理学系)

AI总结 针对动态面部表情识别中文化差异被忽视的问题,提出首个大规模全球跨文化数据集GCC-FER,并设计文化感知适应系统CA-FER,通过自适应校准面部表示减轻文化偏差,实验证明其有效性。

详情
AI中文摘要

动态面部表情识别(DFER)是情感计算、人机交互和智能多媒体系统中的关键使能技术。尽管文化细微差别对FER性能有显著影响,但大多数现有FER系统假设情感表达在人群中普遍一致。这种差异可归因于不同文化中面部肌肉激活模式的系统性差异。推进跨文化FER的主要挑战在于缺乏文化多样性的基准数据集。为解决这一问题,本文引入了一个名为全球跨文化面部表情识别(GCC-FER)的新型混合多元文化视频数据集。GCC-FER包含跨越四种文化群体(非洲、高加索、东亚和南亚)的23,934个视频样本,涵盖七种基本表情,结合了对代表性不足人群的心理学家监督内部数据收集以及对现有来源的严格种族过滤。据我们所知,GCC-FER是首个旨在解决这些人口统计差距的大规模全球跨文化DFER数据集。利用该数据集,为每个文化群体推导出基于行为的文化先验,并为实际部署推导出全局先验。提出了一种文化感知FER(CA-FER)系统,通过自适应重新校准潜在面部表示来减轻文化偏差。在GCC-FER和DFEW上的大量实验表明,所提系统在多文化环境下持续提高了FER性能。

英文摘要

Dynamic Facial Expression Recognition (DFER) is a key enabling technology in affective computing, human-computer interaction, and intelligent multimedia systems. Despite the significant influence of cultural nuances on FER performance, most existing FER systems assume that emotional expressions are universally consistent across populations. This variation can be attributed to systematic differences in facial muscle activation patterns across cultures. A major challenge in advancing cross-cultural FER lies in the scarcity of culturally diverse benchmark datasets. To address this, a new hybrid multicultural video dataset termed Global Cross-Cultural Facial Expression Recognition (GCC-FER) is introduced. GCC-FER comprises 23,934 video samples spanning four cultural groups (African, Caucasian, East Asian, and South Asian) across seven basic expressions, combining psychologically supervised in-house data collection for underrepresented populations with rigorous ethnicity filtering of existing sources. To the best of our knowledge, GCC-FER is the first large-scale global cross-cultural DFER dataset designed to address these demographic gaps. Leveraging this dataset, behaviorally grounded cultural priors are derived for each cultural group and a global prior for practical deployment. A Culture-Aware FER (CA-FER) system is proposed to mitigate cultural bias by adaptively recalibrating latent facial representations. Extensive experiments on GCC-FER and DFEW demonstrate that the proposed system consistently improves FER performance across multicultural settings.

2606.06907 2026-06-08 eess.AS cs.AI cs.SD 新提交

SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

SpectCount: 通过合成信号进行频谱时间计数改进大型音频语言模型

Seonuk Kim, Yonghyeon Jun, Ju Yeon Kang, Jimin Hong, Yoonhyeong Lee, Nam Soo Kim

发表机构 * Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea(电气与计算机工程系和INMC,首尔国立大学,首尔,韩国)

AI总结 针对大型音频语言模型在频谱时间感知上的弱点,提出SpectCount方法,利用动态生成的完全合成音频信号进行数据高效微调,无需真实音频或标注,显著提升多种听觉基准性能。

详情
Comments
5 pages, 5 figures
AI中文摘要

大型音频语言模型(LALMs)通过音频编码器和大规模音频数据扩展了大型语言模型。然而,高质量标注音频数据的稀缺性仍然是扩展的根本瓶颈。通过探测信号可检测性分析,我们识别出基础LALM在细粒度频谱时间感知上的弱点。为了解决这些挑战,我们提出频谱时间计数(SpectCount),一种基于动态生成的完全合成音频信号的数据高效微调方法,无需依赖真实世界音频、标注或预训练生成模型。SpectCount不仅解决了观察到的弱点,还在微调期间未见的声音、音乐和语音等多种听觉基准上提升了性能。这些结果表明,针对弱点的合成信号为LALMs增强听觉理解能力提供了一条数据高效的途径。

英文摘要

Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.

2606.06847 2026-06-08 eess.IV cs.CV 新提交

Physics-Driven Semantic Scattering Structure Understanding of Aircraft Target in SAR Images

SAR图像中飞机目标的物理驱动语义散射结构理解

Yifei Yin, Xiaogang Yu, Hao Shi, Liang Chen, Wei Li

发表机构 * School of Information and Electronics, Beijing Institute of Technology(信息与电子学院,北京理工大学) National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing(空间智能信息处理国家级重点实验室) Beijing Institute of Remote Sensing Information(遥感信息北京市研究院)

AI总结 针对SAR图像中飞机目标散射中心表示不稳定、弱散射部件缺失的问题,提出物理驱动框架S3U-SAR,通过定义语义散射关键点并利用多维物理先验约束,实现完整拓扑结构重建,在基准数据集上取得最优性能。

详情
AI中文摘要

合成孔径雷达(SAR)因其全天时、全天候观测能力,已成为目标解译不可或缺的手段。在SAR目标解译中,电磁散射信息提供了超越视觉纹理的物理基础线索,并被广泛用于目标解译。然而,现有方法仍以局部散射中心表示为主。这种无序且与部件无关的表示对飞机目标极不稳定。因此,物理存在的弱散射响应部件常被遗漏,导致重建的拓扑结构不完整。为解决这一局限,我们建立了语义散射结构理解作为SAR飞机解译的新范式。定义语义散射关键点以将局部电磁响应与物理上有意义的飞机部件关联,同时引入可见性感知属性以保留弱可观测但物理存在的部件。关键点进一步组织为稳定的语义散射结构。基于此,我们提出S3U-SAR,一个物理驱动框架,用于定位语义散射关键点并构建由多维物理先验(包括散射异质性、刚体拓扑、散斑不确定性)约束的完整表示。进一步引入置信门控联合监督策略以缓解优化冲突。我们构建了KP-SAR-Aircraft-1.0,首个用于语义散射结构理解的细粒度基准。大量实验表明,S3U-SAR相比基线取得了最佳性能。跨类别和跨数据集评估进一步验证了其鲁棒性和可迁移性。

英文摘要

Synthetic aperture radar (SAR) has become indispensable for target interpretation owing to its all-day and all-weather observation capability. In SAR target interpretation, electromagnetic scattering information provides a physically grounded cue beyond visual texture and has been widely exploited for target interpretation. However, existing methods remain dominated by local scattering center representations. Such unordered and component-agnostic representations are highly unstable for aircraft targets. As a result, physically existing components with weak scattering responses are often missed, resulting in the incomplete reconstructed topology structure. To address this limitation, we establish Semantic Scattering Structure Understanding as a new paradigm for SAR aircraft interpretation. Semantic scattering keypoints are defined to associate local electromagnetic responses with physically meaningful aircraft components, while visibility-aware attributes are introduced to retain weakly observable yet physically existed components. The keypoints are further organized into a stable semantic scattering structure. Build upon this, we propose S3U-SAR, a physics-driven framework to localize semantic scattering keypoints and construct the complete representation constrained by multi-dimensional physical priors containing scattering heterogeneity, rigid-body topology, speckle uncertainty. A confidence-gated joint supervision strategy is further introduced to alleviate optimization conflicts. We construct KP-SAR-Aircraft-1.0, the first fine-grained benchmark for semantic scattering structure understanding. Extensive experiments demonstrate that S3U-SAR achieves the best performance compared with baselines. Cross-category and cross-dataset evaluations further verify its robustness and transferability.

2606.06837 2026-06-08 eess.AS cs.LG 新提交

SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails

SEAM:面向面试防护栏的脚本化与自发语音的快捷方式感知实时检测

Vsevolod, Kovalev, Pranay Manocha

发表机构 * Symbal AI Princeton University(普林斯顿大学)

AI总结 提出SEAM框架,通过统一预处理、接缝感知采样、非语音增强和紧凑DistilHuBERT骨干,在8秒窗口下实现0.971 ROC-AUC,并揭示快捷方式学习问题。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

脚本化与自发语音检测对面试防护栏具有吸引力,但基准性能可能因与语料库身份、信道条件和录音伪影相关的快捷方式(而非说话风格本身)而膨胀。我们提出SEAM,一个用于实时脚本化检测的快捷方式感知框架,结合了统一预处理、接缝感知采样、非语音增强和紧凑的DistilHuBERT骨干。使用8秒窗口,该模型在外部面试领域评估集上达到0.971 ± 0.004的ROC-AUC。移除快捷方式预防组件可改善内部留出指标,但急剧降低外部性能,表明存在快捷方式学习。训练后量化将模型占用减少至41.8MB,且外部性能损失很小。结果表明,鲁棒的实时脚本化检测不仅依赖于骨干网络,还依赖于快捷方式感知的数据设计和评估。我们发布代码和模型检查点。

英文摘要

Scripted vs spontaneous speech detection is appealing for interview guardrails, but benchmark performance can be inflated by shortcuts tied to corpus identity, channel conditions, and recording artifacts rather than speaking style itself. We present SEAM, a shortcut-aware framework for real-time scriptedness detection that combines uniform preprocessing, seam-aware sampling, non-speech augmentation, and a compact DistilHuBERT backbone. With 8s windows, the model achieves 0.971 +- 0.004 ROC-AUC on an external interview-domain evaluation set. Removing the shortcut-prevention components improves internal held-out metrics but sharply reduces external performance, indicating shortcut learning. Post-training quantization reduces the model footprint to 41.8MB with little loss in external performance. The results demonstrate that robust real-time scriptedness detection depends not only on the backbone, but on shortcut-aware data design and evaluation. We release code and model checkpoints.

2606.06795 2026-06-08 eess.AS cs.SD 新提交

BiEAR: A Human Auditory-Inspired Adaptive Binaural Front-end for Multi-Speaker Localisation and Distance Estimation

BiEAR: 一种受人类听觉启发的自适应双耳前端,用于多说话人定位和距离估计

Hanyu Meng, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Qiquan Zhang, Haizhou Li

发表机构 * The University of New South Wales(新南威尔士大学) Tongyi Speech Lab, Alibaba Group(通义语音实验室,阿里巴巴集团) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen(人工智能学院,香港中文大学(深圳))

AI总结 提出受人类听觉启发的自适应双耳前端BiEAR,通过神经控制器动态调整滤波器组频率选择性,提升多说话人定位和距离估计的准确性与鲁棒性。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

我们提出BiEAR,一种受人类听觉启发的自适应双耳前端,用于多说话人定位和距离估计。受人类听觉中内侧橄榄耳蜗(MOC)反馈的启发,BiEAR使用神经控制器在推理过程中自适应调整双耳听觉滤波器组的频率选择性。这为双耳产生时频自适应表示,使模型能够响应变化的声学条件。我们在消声和真实房间环境中评估了BiEAR在多说话人定位和距离估计上的性能。结果表明,与常用的固定双耳前端相比,自适应前端提高了定位准确性以及对未见说话人和房间的鲁棒性。对学习到的滤波器自适应的可视化和分析表明,BiEAR随时间强调信息丰富的频带。这些发现表明,自适应的、受生物启发的双耳前端可以改善机器在复杂声学场景中的听觉鲁棒性。

英文摘要

We present BiEAR, a human auditory-inspired adaptive binaural front-end for multi-speaker localisation and distance estimation. Inspired by medial olivocochlear (MOC) feedback in human hearing, BiEAR uses a neural controller to adaptively adjust the frequency selectivity of a binaural auditory filterbank during inference. This yields time-frequency adaptive representations for ears, enabling the model to respond to changing acoustic conditions. We evaluate BiEAR on multi-speaker localisation and distance estimation in anechoic and real-room environments. Results show that the adaptive front-end improves localisation accuracy and robustness to unseen speakers and rooms compared with commonly used fixed binaural front-ends. Visualisation and analysis of learned filter adaptations show that BiEAR emphasises informative frequency bands over time. These findings suggest that adaptive, biologically inspired binaural front-ends can improve machine hearing robustness in complex acoustic scenes.

2606.06725 2026-06-08 eess.IV cs.CV 新提交

Compute-Optimal Network Design for Echocardiography Myocardial Segmentation and Perfusion Quantification using Neural Scaling Laws

基于神经缩放定律的超声心动图心肌分割与灌注量化的计算最优网络设计

Clara Rodrigo González, Matthieu Toulemonde, Lasha Gvinianidze, Cameron A. B. Smith, Oscar Bates, Roxy Senior, Fu Siong Ng, Meng-Xing Tang

发表机构 * Department of Bioengineering, Imperial College London(生物工程系,帝国理工学院伦敦分校) National Heart and Lung Institute, Imperial College London(国家心脏和肺 institute,帝国理工学院伦敦分校) Guy’s and St. Thomas’ NHS Foundation Trust(圣泰莫斯国家健康服务信托基金)

AI总结 应用神经缩放定律预测心肌分割性能,在CAMUS和CEUS数据集上确定最优网络大小,实现参数减少240倍且性能达最优,自动分割在心肌灌注量化中与资深心脏病专家等效。

详情
Comments
15 pages, 4 figures, 5 tables, journal
AI中文摘要

使用对比增强超声进行心肌灌注量化提供了一种床旁非电离替代核成像模态的方法。然而,其临床采用受到耗时的手动标注的限制。由于域内训练数据匮乏,自动分割已被证明具有挑战性。我们应用当前用于优化大数据集上大型语言模型的策略,将神经缩放定律应用于预测心肌分割的网络性能。我们在数据子集上外推性能,以确定CAMUS超声心动图数据集和25名患者的对比增强超声(CEUS)数据集上的最优网络大小。最后,通过将最终心肌灌注参数与资深心脏病专家获得的参数进行比较,验证了我们模型的临床实用性。基于缩放定律的外推能够预测完整数据集大小下的测试损失,使我们能够选择两个网络,在CAMUS上以240倍的参数减少获得最先进性能。我们观察到缩放定律的梯度从CAMUS迁移到CEUS数据集,但预测损失存在偏差。自动分割的掩膜在心肌灌注量化中与资深心脏病专家表现相当。这些结果确立了神经缩放定律作为小成像数据集上数据驱动计算最优模型设计的实用工具。

英文摘要

Myocardial perfusion quantification using contrast-enhanced ultrasound offers a bedside non-ionizing alternative to nuclear imaging modalities. However, its clinical adoption is hindered by time-consuming manual labelling. Automated segmentation has proved challenging due to a paucity of in-domain training data. Adapting strategies currently used to optimise large language models for large datasets, we apply neural scaling laws to predict network performance for myocardial segmentation. We extrapolate performance on subsets of the data to determine optimal network size on the CAMUS echocardiography dataset and a 25-patient contrast-enhanced ultrasound (CEUS) dataset. Finally, we validate the clinical utility of our models by comparing the final myocardial perfusion parameters with those obtained by a senior cardiologist. Extrapolation based on the scaling law is predictive of test loss at the full dataset size, allowing us to select two networks that obtained state-of-the-art performance on CAMUS with a 240-fold reduction in parameter count. We observe the gradient of the scaling law transfers from CAMUS to the CEUS dataset with a bias in the predicted losses. The automatically segmented masks perform equivalently to a senior cardiologist in myocardial perfusion quantification. These results establish neural scaling laws as a practical tool for data-driven compute-optimal model design for small imaging datasets.

2606.06540 2026-06-08 eess.IV cs.CV 新提交

ErA: Error-Aware Deep Unrolling Network for Single Image Defocus Deblurring

ErA:用于单图像散焦去模糊的误差感知深度展开网络

Tu Vo, Chan Y. Park

发表机构 * KC Machine Learning Lab(KC机器学习实验室)

AI总结 提出ErA网络,通过联合学习紧凑核基和逐像素权重,并利用增广拉格朗日展开中的误差感知项交替更新和ResUNet去噪器校正核估计误差,在多个数据集上达到最优性能。

详情
AI中文摘要

我们提出了ErA(误差感知深度展开网络),一个用于单图像散焦去模糊的端到端框架。ErA联合学习一个紧凑的核基和逐像素权重,同时增广拉格朗日展开中的一个误差感知项通过交替更新和ResUNet去噪器校正核估计误差。它在DPDD、RealDOF和RTF上达到了最先进的PSNR/SSIM,并在没有真实数据的CUHK上显示出强大的泛化能力。

英文摘要

We introduce ErA (Error-Aware Deep Unrolling Network), an end-to-end frame work for single-image defocus deblurring. ErA jointly learns a compact kerne basis and per-pixel weights, while an error-aware term in Augmented Lagrangian unrolling corrects kernel estimation errors via alternating updates and ResUNet denoisers. It achieves state-of-the-art PSNR/SSIM on DPDD, RealDOF, and RTF, and shows strong generalization on CUHK without ground truth.

2606.06534 2026-06-08 eess.IV cs.AI 新提交

Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models

基于视觉基础模型的注意力一致纵向医学视觉问答

Jialin Wu, Qianru Zhang, Georges El Fakhri, Xiaofeng Liu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Yale Biomedical Imaging Institute(耶鲁大学生物医学成像研究所)

AI总结 提出一种注意力引导的编码器-解码器框架,通过轻量级配准和自适应掩码生成,结合辅助损失函数,实现胸部X光片的纵向医学视觉问答,在Medical-Diff-VQA基准上取得优异性能。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 6448-6458
Comments
Accepted to CVPR 2026 Workshop PHAROS-AIF-MIH
AI中文摘要

纵向医学视觉问答(VQA)需要推理当前时间点图像与参考时间点图像之间的解剖差异。我们针对胸部X光片提出了一种注意力引导的编码器-解码器。与传统的直接对比不同,我们引入了一个轻量级仿射配准模块,通过小配准正则化将当前图像与参考图像进行共配准,以减少无关运动。配准后的图像对输入图像编码器,随后通过冻结的DINO掩码生成器和可训练的自适应掩码生成器生成应用于原始图像对的掩码。掩码图像对再次输入图像编码器,并与文本特征拼接,作为基于多模态Transformer的解码器的输入以生成最终答案。为了促进学习稳定并澄清变化信号,受DINO-v3启发,我们加入了额外的辅助目标,包括掩码重建损失、成对Gram风格一致性损失和KoLeo均匀性损失,以增强表示的几何结构。在Medical-Diff-VQA基准上,该模型获得了强大的BLEU、ROUGE-L、CIDEr和METEOR分数,同时通过共享显著性掩码提供了内在的可解释性。这些结果支持将显著性条件生成与轻度预对齐作为医学VQA中纵向推理的原则性框架。我们的训练策略也展示了在生物医学中利用图像基础模型的范式潜力:同时优化监督和无监督学习目标。

英文摘要

Longitudinal medical visual question answering (VQA) requires reasoning about anatomical differences between an image of a current time point and an image of a referred time point. We propose an attention-guided encoder-decoder for this task with chest X-rays. Instead of conventional direct contrast, we propose to include a lightweight affine registration module to reduce nuisance motion by co-registering the current image to the reference image with a small registration regularizer. The registered image pair is fed into the image encoder, followed by a frozen DINO-based mask generator and a trainable adaptive mask generator to produce masks applied to the original image pairs. The masked image pairs are again fed into the image encoder and concatenated with text features as the input to a multimodal transformer-based decoder to generate final answers. To facilitate learning stabilization and clarify the change signal, inspired by DINO-v3, we include additional auxiliary objectives, including a mask rebuilding loss, a pairwise Gram-style consistency loss, and a KoLeo uniformity loss, which enhances the geometry of the representation. On the Medical-Diff-VQA benchmark, the model delivers strong BLEU, ROUGE-L, CIDEr, and METEOR scores while offering intrinsic interpretability through the shared saliency mask. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA. Our training strategy also illustrates the potential of a paradigm in utilizing image foundation models in biomedicine: optimizing both supervised and unsupervised learning objectives simultaneously.

2606.06524 2026-06-08 eess.IV cs.CV cs.LG 新提交

Advanced Flood Prediction with Physics-Guided Deep Learning: Combining UNet, FNO, and SAR/Optical Imagery

基于物理引导深度学习的先进洪水预测:结合UNet、FNO与SAR/光学影像

Tewodros Syum Gebre, Jagrati Talreja, Leila Hashemi-Beni

发表机构 * National Center for Atmospheric Research (NCAR)(国家大气研究中心)

AI总结 提出物理引导深度学习框架,融合多模态遥感与浅水方程约束,通过UNet-FNO混合架构实现高精度洪水预测,IoU达0.82,F1达0.90。

详情
Comments
This paper has been accepted for publication in the Proceedings of the IEEE Radar Conference (RadarConf 2026). The final authenticated version will be available through IEEE Xplore
AI中文摘要

由于地面观测有限、地形条件异质以及数据驱动模型中难以强制执行水动力学一致性,准确且可扩展的洪水测绘仍然具有挑战性。本文介绍了一种物理引导的深度学习框架,该框架集成了多模态遥感(Sentinel-1 SAR、Sentinel-2光学影像和DEM衍生的地形特征)与深度平均浅水方程(SWE)的约束。所提出的混合架构结合了用于捕捉精细尺度空间细节的UNet和用于模拟流域尺度水力相互作用的傅里叶神经算子(FNO),而物理信息残差损失确保了质量和动量一致性。在多种洪泛区环境下评估,混合模型在洪水范围预测中实现了0.82的交并比和0.90的F1分数,优于仅使用UNet和仅使用FNO的基线模型。以水动力学模拟作为参考数据,该模型在水深方面实现了0.21米的均方根误差,在流速方面实现了0.15米/秒的均方根误差。物理一致性得以保持,残差低且质量不平衡低于2.1%。消融研究证实,去除基于物理的正则化会显著降低性能,突显了物理约束对稳定性和泛化能力的价值。这些结果表明,将水动力学原理嵌入深度学习可产生更准确、可靠且物理一致的洪水预测,为业务监测和大规模部署提供了巨大潜力。

英文摘要

Accurate and scalable flood mapping remains challenging due to limited ground observations, heterogeneous terrain conditions, and the difficulty of enforcing hydrodynamic consistency within data-driven models. This work introduces a physics-guided deep learning framework that integrates multi-modal remote sensing (Sentinel-1 SAR, Sentinel-2 optical imagery, and DEM-derived terrain features) with constraints from the depth-averaged shallow water equations (SWE). The proposed hybrid architecture combines a UNet to capture fine-scale spatial details with a Fourier Neural Operator (FNO) to model basin-scale hydraulic interactions, while physics-informed residual losses ensure mass and momentum consistency. Evaluated across diverse floodplain settings, the hybrid model achieves an Intersection over Union of 0.82 and an F1 score of 0.90 for flood extent prediction, outperforming UNet-only and FNO-only baselines. Using hydrodynamic simulations as reference data, the model achieves an RMSE of 0.21 m for water depth and 0.15 m/s for flow velocity. Physics consistency is maintained, with low residuals and mass imbalance below 2.1%. Ablation studies confirm that removing physicsbased regularization significantly degrades performance, underscoring the value of physical constraints for stability and generalization. These results demonstrate that embedding hydrodynamic principles into deep learning yields more accurate, reliable, and physically coherent flood predictions, offering strong potential for operational monitoring and large-scale deployment.

2606.06509 2026-06-08 eess.IV cs.AI cs.LG q-bio.TO 新提交

Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction

在有限标签下哪些解剖结构重要?用于心脏病理预测的数据高效解剖感知基准

Himanshu Singh

发表机构 * Himanshu Singh(希曼斯·辛格)

AI总结 针对有限标签和计算资源下的医学影像问题,提出解剖感知基准,通过比较不同解剖结构表示和分类器,发现表示质量比模型复杂度更重要。

详情
Comments
ACCEPTED at ICML 2026 Workshop GlobalSouthML (Seoul, South Korea; PMLR 306, 2026)
AI中文摘要

许多医学影像问题必须在有限标签和受限计算条件下解决,然而性能提升主要来自更具表达力的模型还是对临床有意义解剖结构的更好表示,目前尚不清楚。我们通过一个低数据解剖感知基准来研究这个问题,该基准用于在公共ACDC MRI数据集上进行5类心脏病理预测。利用来自右心室、心肌和左心室的分割衍生患者描述符,我们在线性、核和基于树的分类器上比较了特定解剖结构和多结构表示。我们发现,在有限标签设置下,表示主导复杂度。这些结果表明,在资源受限的医疗环境中,识别和表示最具信息量的解剖结构可能比单纯增加模型复杂度更重要。

英文摘要

Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance gains are driven mainly by more expressive models or by better representation of clinically meaningful anatomy. We study this question through a low-data anatomy-aware benchmark for 5-class cardiac pathology prediction on the public ACDC MRI dataset. Using segmentation-derived patient descriptors from the right ventricle, myocardium, and left ventricle, we compare anatomy-specific and multi-structure representations across linear, kernel, and tree-based classifiers. We find that under limited label settings, representation dominates complexity. These results suggest that in resource-constrained healthcare settings, identifying and representing the most informative anatomy may matter more than the increasing complexity of the model alone.