arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1755
2409.13477 2026-06-08 eess.IV cs.CV physics.med-ph

A Plug-and-Play Method for Guided Multi-contrast MRI Reconstruction based on Content/Style Modeling

基于内容/风格建模的即插即用式引导多对比度MRI重建方法

Chinmay Rao, Matthias van Osch, Nicola Pezzotti, Jeroen de Bresser, Mark van Buchem, Laurens Beljaards, Jakob Meineke, Elwin de Weerdt, Huangling Lu, Mariya Doneva, Marius Staring

发表机构 * University of Amsterdam(阿姆斯特丹大学) Erasmus University Rotterdam(埃因霍温理工大学) Erasmus University Medical Center(埃因霍温医学院) University of Utrecht(乌得勒支大学)

AI总结 提出一种无需k空间训练数据的模块化即插即用方法PnP-CoSMo,通过内容/风格解耦利用参考扫描引导欠采样对比度重建,在公共和内部数据集上达到或超越端到端方法,并实现更高加速比。

详情
AI中文摘要

由于同一解剖结构的不同MR对比度包含冗余信息,一种对比度可用于引导在同一会话中随后采集的另一种欠采样对比度的重建。为了解决这一利用多对比度侧信息的重建问题,已有多种端到端学习方法被提出。然而,一个关键挑战是需要包含原始k空间数据和配准参考图像的大型配对训练数据集。我们提出了一种模块化的即插即用方法,该方法不需要k空间训练数据,仅依赖于部分配对的图像域数据集。首先学习双对比度MR图像数据的内容/风格模型,随后在迭代重建中作为即插即用算子应用。内容与风格的解耦允许显式表示对比度无关和对比度特定的因素。因此,将先验信息融入重建简化为使用从参考扫描中导出的高质量内容替换估计图像的混叠内容的操作。将该操作与MR数据一致性步骤以及内容估计的校正过程相结合,形成迭代方案。我们将这种新方法命名为PnP-CoSMo。通过设计,它提供了跨对比度的泛化能力,并基于两个给定对比度下的共享和非共享生成因素提供了一个解释框架。我们通过仿真探索了包括可解释性和收敛性在内的多个方面。此外,在公共NYU fastMRI DICOM数据集上展示了其实用性,显示出与端到端方法相当或更优的质量以及更强的泛化能力。在两个内部多线圈数据集上,在给定SSIM下,PnP-CoSMo相比非引导重建实现了高达32.6%的加速。

英文摘要

Since the various MR contrasts of a given anatomy contain redundant information, one contrast can be used to guide the reconstruction of another undersampled contrast acquired subsequently in the same session. To solve this reconstruction problem leveraging multi-contrast side information, several end-to-end learning-based methods have been proposed. However, a key challenge is the requirement for large paired training datasets comprising raw k-space data and aligned reference images. We propose a modular plug-and-play method, which requires no k-space training data and relies solely on partially paired image-domain datasets. A content/style model of two-contrast MR image data is first learned and subsequently applied as a plug-and-play operator in iterative reconstruction. The disentanglement of content and style allows explicit representation of contrast-independent and contrast-specific factors. Consequently, incorporating prior information into the reconstruction reduces to a simple replacement operation on the aliased content of the estimated image using high-quality content derived from the reference scan. Combining this operation with an MR data consistency step, followed by a corrective procedure for the content estimate, yields an iterative scheme. We name this novel approach PnP-CoSMo. It offers, by design, cross-contrast generalizability and provides an explanatory framework based on the shared and non-shared generative factors underlying the two given contrasts. We explore various aspects, including interpretability and convergence, via simulations. Furthermore, its practicality is demonstrated on the public NYU fastMRI DICOM dataset, showing equivalent or superior quality and greater generalizability compared to end-to-end methods. On two in-house multi-coil datasets, PnP-CoSMo enabled up to 32.6% greater acceleration over non-guided reconstruction at given SSIM.

2507.12878 2026-06-08 eess.SP cs.LG stat.ML

Bayesian Modeling and Estimation of Linear Time-Varying Systems using Neural Networks and Gaussian Processes

基于神经网络和高斯过程的线性时变系统贝叶斯建模与估计

Yaniv Shulman

发表机构 * Shulman.info(Shulman信息)

AI总结 本文提出一种统一的贝叶斯框架,通过将系统脉冲响应建模为随机过程,利用变分推断和高斯过程,实现了对线性时变系统的鲁棒估计。

详情
AI中文摘要

本文提出了一种统一的贝叶斯框架,通过将系统脉冲响应建模为随机过程,利用变分推断和高斯过程,实现了对线性时变系统的鲁棒估计。

英文摘要

The identification of Linear Time-Varying (LTV) systems from input-output data is a fundamental yet challenging ill-posed inverse problem. This work introduces a unified Bayesian framework that models the system's impulse response, $h(t, τ)$, as a stochastic process. We decompose the response into a posterior mean and a random fluctuation term, a formulation that provides a principled approach for quantifying uncertainty, unifies intrinsic channel variability and epistemic uncertainty through a common posterior representation, and naturally defines a new, useful system class we term Linear Time-Invariant in Expectation (LTIE). To perform inference, we leverage modern machine learning techniques, including Bayesian neural networks and Gaussian Processes, using scalable variational inference. We demonstrate through a series of experiments that our framework can infer the properties of an LTI system from a single noisy input-output pair, including under deliberate additive-noise misspecification, achieve a lower overall error floor than the classical CCF stacking baseline in a simulated ambient noise tomography setting, and track a continuously varying LTV impulse response by using a structured Gaussian Process prior. This work provides a flexible and robust methodology for uncertainty-aware system identification in dynamic environments.

2603.14573 2026-06-08 cond-mat.dis-nn cs.LG math.PR

Rigorous Asymptotics for First-Order Algorithms Through the Dynamical Cavity Method

通过动力学空腔方法严格推导一阶算法的渐进行为

Yatin Dandi, David Gamarnik, Francisco Pernice, Lenka Zdeborová

发表机构 * Statistical Physics of Computation Laboratory, École polytechnique fédérale de Lausanne (EPFL)(计算统计物理实验室,瑞士联邦理工学院(EPFL)) Sloan School of Management, Operations Research Center and Institute of Data, Systems and Society (IDSS), MIT(斯隆管理学院,运筹学中心和数据、系统与社会研究所(IDSS),麻省理工学院) CSAIL and LIDS, MIT(计算机科学与人工智能实验室(CSAIL)和麻省理工学院数据科学研究所(LIDS))

AI总结 本文通过严格形式化的动力学空腔方法,推导出一阶算法(如梯度下降和近似消息传递)的动力学主方程,为非严谨的传统方法提供数学基础。

详情
Journal ref
COLT 2026
AI中文摘要

通过动力学空腔方法严格推导一阶算法的渐进行为,本文建立了动态平均场理论(DMFT)方程的数学基础,为广义一阶方法(包括梯度下降和近似消息传递等算法)的动力学行为提供了严格形式化的描述。

英文摘要

Dynamical Mean Field Theory (DMFT) provides an asymptotic description of the dynamics of macroscopic observables in certain disordered systems. Originally pioneered in the context of spin glasses by Sompolinsky and Zippelius (1982), it has since been used to derive asymptotic dynamical equations for a wide range of models in physics, high-dimensional statistics and machine learning. One of the main tools used by physicists to obtain these equations is the dynamical cavity method, which has remained largely non-rigorous. In contrast, existing mathematical formalizations have relied on alternative approaches, including Gaussian conditioning, large deviations over paths, or Fourier analysis. In this work, we formalize the dynamical cavity method and use it to give a new proof of the DMFT equations for General First Order Methods, a broad class of dynamics encompassing algorithms such as Gradient Descent and Approximate Message Passing.

2602.10680 2026-06-08 stat.ML cond-mat.dis-nn cs.LG

A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization

一个可解的高维模型,其中非线性自编码器学习到结构对PCA不可见,而测试损失与泛化不一致

Vicente Conde Mendes, Lorenzo Bardone, Cédric Koller, Jorge Medina Moreira, Vittorio Erba, Emanuele Troiani, Lenka Zdeborová

发表机构 * Statistical Physics of Computation Laboratory, École polytechnique fédérale de Lausanne (EPFL)(计算统计物理实验室,瑞士联邦理工学院(EPFL))

AI总结 本文提出一个高维模型,展示非线性自编码器能学习线性方法如PCA无法捕捉的结构,尽管其测试损失与泛化性能不一致。

详情
Journal ref
ICML 2026
AI中文摘要

许多现实世界的数据集包含隐藏的结构,这些结构无法通过输入特征间的简单线性相关性检测到。例如,潜在因子可能以协调的方式影响数据,尽管其影响对基于协方差的方法如PCA不可见。在实践中,非线性神经网络常在无监督和自监督学习中成功提取此类隐藏结构。然而,构建一个最小的高维模型,其中这种优势可以严格分析仍是一个开放的理论挑战。我们引入了一个可解的高维 spiked 模型,包含两个潜在因子:一个对协方差可见,另一个统计上相关但不相关,仅出现在高阶矩中。PCA 和线性自编码器无法恢复后者,而最小的非线性自编码器可以证明性地提取两者。我们分析了总体风险和经验风险最小化。我们的模型还提供了一个可解的例子,其中自监督测试损失与表征质量不一致:非线性自编码器恢复了线性方法无法捕捉的结构,尽管其重建损失更高。

英文摘要

Many real-world datasets contain hidden structure that cannot be detected by simple linear correlations between input features. For example, latent factors may influence the data in a coordinated way, even though their effect is invisible to covariance-based methods such as PCA. In practice, nonlinear neural networks often succeed in extracting such hidden structure in unsupervised and self-supervised learning. However, constructing a minimal high-dimensional model where this advantage can be rigorously analyzed has remained an open theoretical challenge. We introduce a tractable high-dimensional spiked model with two latent factors: one visible to covariance, and one statistically dependent yet uncorrelated, appearing only in higher-order moments. PCA and linear autoencoders fail to recover the latter, while a minimal nonlinear autoencoder provably extracts both. We analyze both the population risk, and empirical risk minimization. Our model also provides a tractable example where self-supervised test loss is poorly aligned with representation quality: nonlinear autoencoders recover latent structure that linear methods miss, even though their reconstruction loss is higher.

2509.24914 2026-06-08 stat.ML cond-mat.dis-nn cs.IT cs.LG math.IT

Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws

高维中的单头注意力:一般化、权重谱和扩展定律的理论

Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Yizhou Xu, Florent Krzakala, Lenka Zdeborová

发表机构 * Statistical Physics of Computation Laboratory, École polytechnique fédérale de Lausanne (EPFL)(计算物理实验室,瑞士联邦理工学院(EPFL)) Information, Learning and Physics Laboratory, École polytechnique fédérale de Lausanne (EPFL)(信息、学习与物理实验室,瑞士联邦理工学院(EPFL))

AI总结 本文研究了高维序列任务中训练的注意力层权重谱结构,通过随机矩阵理论等工具,揭示了训练误差、插值阈值及键查询矩阵谱的高维特性,并预测了功率谱定律的出现。

详情
Journal ref
ICML 2026
AI中文摘要

训练的注意力层表现出显著且可重复的权重谱结构,包括低秩坍塌、批量变形和孤立谱异常,但其起源及对泛化的影响尚不明确。本文通过在合成高维序列任务上训练单头绑定注意力层,利用随机矩阵理论、自旋玻璃理论和近似消息传递工具,获得训练和测试误差、插值和恢复阈值及键查询矩阵谱的高维表征。理论预测了训练查询-键映射的完整奇异值分布,包括低秩结构和孤立谱异常,与更现实的Transformer观察结果定性一致。最后,对于具有幂律谱的目标,显示学习通过序列谱恢复进行,导致幂律扩展定律的出现。

英文摘要

Trained attention layers exhibit striking and reproducible spectral structure of the weights, including low-rank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks generated from the attention-indexed model. Using tools from random matrix theory, spin-glass theory, and approximate message passing, we obtain an exact high-dimensional characterization of training and test error, interpolation and recovery thresholds, and the spectrum of the key and query matrices. Our theory predicts the full singular-value distribution of the trained query-key map, including low-rank structure and isolated spectral outliers, in qualitative agreement with observations in more realistic transformers. Finally, for targets with power-law spectra, we show that learning proceeds through sequential spectral recovery, leading to the emergence of power-law scaling laws.

2511.09568 2026-06-08 physics.chem-ph cs.AI cs.CV

VEDA: 3D Molecular Generation via Variance-Exploding Diffusion with Annealing

VEDA:通过退火变方差扩散实现3D分子生成

Peining Zhang, Jinbo Bi, Minghu Song

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 VEDA结合退火变方差扩散与SE(3)等价架构,高效生成准确的3D分子结构,实现高化学精度与计算效率。

详情
AI中文摘要

扩散模型在3D分子生成中展现出潜力,但面临采样效率与构象准确性之间的根本权衡。尽管流形模型速度快,但常产生几何不准确的结构,因难以捕捉分子构象的多模分布。相比之下,去噪扩散模型更准确但采样慢,限制在于扩散动力学与SE(3)-等价架构之间的整合不足。为此,我们提出了VEDA,一个统一的SE(3)-等价框架,结合变方差扩散与退火以高效生成构象准确的3D分子结构。关键贡献包括:(1) 一种VE调度使噪声注入类似于模拟退火,提高3D准确性并降低松弛能量;(2) 一种新型预处理方案协调SE(3)-等价网络的坐标预测性质与残差扩散目标;(3) 一种新的arcsin调度器将采样集中在对数信号噪声比的关键区间。在QM9和GEOM-DRUGS数据集上,VEDA的采样效率与流形模型相当,仅用100次采样步骤就实现了最先进的价键稳定性与有效性。更重要的是,VEDA生成的结构在GFN2-xTB优化过程中表现出显著的稳定性,其松弛能量中位数仅为1.72 kcal/mol,显著低于其基线架构SemlaFlow的32.3 kcal/mol。我们的框架证明了原理上整合VE扩散与SE(3)-等价架构可以实现高化学精度和计算效率。

英文摘要

Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)-equivariant architectures. To address this, we propose VEDA, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA's generated structures are remarkably stable, as measured by their relaxation energy during GFN2-xTB optimization. The median energy change is only 1.72 kcal/mol, significantly lower than the 32.3 kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.

2511.03898 2026-06-08 cs.CR cs.AI cs.CE cs.SE

Secure Code Generation at Scale with Reflexion

大规模安全代码生成中的反射

Arup Datta, Ahmed Aljohani, Hyunsook Do

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究评估了使用Instruct Prime和反射提示方法提升代码安全性的效果,发现反射提示能显著提高安全性能,尤其在第一轮提示中效果最明显。

详情
Comments
Accepted for publication at the 2nd IEEE International Conference on AI-powered Software (AIware 2025)
AI中文摘要

大型语言模型(LLMs)现在广泛用于起草和重构代码,但生成的代码不一定是安全的。我们评估了使用Instruct Prime(消除了合规性提示和提示污染)以及通过零 shot 基线和三轮反射提示方法评估五个指令调优的代码 LLMs。安全性通过不安全代码检测器(ICD)测量,结果通过修复、回归和净收益指标报告,考虑编程语言和CWE家族。我们的发现显示,在第一轮中不安全代码仍然普遍存在:在零 shot 基线(t0)下,约25-33%的程序不安全。弱加密/依赖配置的bug最难避免,而模板化的bug如XSS、代码注入和硬编码的秘密则更可靠地被处理。Python的高安全率;C和C#最低,Java、JS、PHP和C++在中间。反射提示对所有模型都有提升,将平均准确率从t0的70.74%提升到t3的79.43%,最大的提升出现在第一轮,随后是递减的收益。修复、回归和净收益指标的趋势显示,应用一到两轮提示产生大部分收益。一个可复制的包可在https://doi.org/10.5281/zenodo.17065846获取。

英文摘要

Large language models (LLMs) are now widely used to draft and refactor code, but code that works is not necessarily secure. We evaluate secure code generation using the Instruct Prime, which eliminated compliance-required prompts and cue contamination, and evaluate five instruction-tuned code LLMs using a zero-shot baseline and a three-round reflexion prompting approach. Security is measured using the Insecure Code Detector (ICD), and results are reported by measuring Repair, Regression, and NetGain metrics, considering the programming language and CWE family. Our findings show that insecurity remains common at the first round: roughly 25-33% of programs are insecure at a zero-shot baseline (t0 ). Weak cryptography/config-dependent bugs are the hardest to avoid while templated ones like XSS, code injection, and hard-coded secrets are handled more reliably. Python yields the highest secure rates; C and C# are the lowest, with Java, JS, PHP, and C++ in the middle. Reflexion prompting improves security for all models, improving average accuracy from 70.74% at t0 to 79.43% at t3 , with the largest gains in the first round followed by diminishing returns. The trends with Repair, Regression, and NetGain metrics show that applying one to two rounds produces most of the benefits. A replication package is available at https://doi.org/10.5281/zenodo.17065846.

2507.17799 2026-06-08 eess.AS cs.LG cs.SD

A Concept-based approach to Voice Disorder Detection

基于概念的方法用于声带疾病检测

Davide Ghia, Gabriele Ciravegna, Alkis Koudounas, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli

发表机构 * Politecnico di Torino CENTAI Institute(CENTAI研究院) San Feliciano Hospital(San Feliciano医院) SCDU Otorinolaringoiatria, Head Neck Cancer Unit, Ospedale San Giovanni Bosco(SCDU耳鼻喉科,头颈癌症单元,San Giovanni Bosco医院) Dipartimento di Oncologia, Università degli Studi di Torino(肿瘤学系,托里尼大学)

AI总结 本文提出基于概念的声带疾病检测方法,利用可解释AI提升模型透明度,与传统深度学习方法相比,实现更清晰的决策框架。

详情
AI中文摘要

声带疾病影响了大量人口,使用自动化非侵入性技术进行诊断将显著推动医疗进步,提高患者生活质量。近期研究表明,人工智能模型,特别是深度神经网络(DNNs),能有效解决此任务。然而,由于其复杂性,此类模型的决策过程常不透明,限制了其在临床中的可信度。本文探讨了基于可解释AI(XAI)的替代方法,旨在通过提供不同形式的解释来提高DNNs的可解释性。具体而言,本文聚焦于概念模型,如概念瓶颈模型(CBM)和概念嵌入模型(CEM),探讨它们如何在性能上与传统深度学习方法相媲美,同时提供更透明和可解释的决策框架。

英文摘要

Voice disorders affect a significant portion of the population, and the ability to diagnose them using automated, non-invasive techniques would represent a substantial advancement in healthcare, improving the quality of life of patients. Recent studies have demonstrated that artificial intelligence models, particularly Deep Neural Networks (DNNs), can effectively address this task. However, due to their complexity, the decision-making process of such models often remain opaque, limiting their trustworthiness in clinical contexts. This paper investigates an alternative approach based on Explainable AI (XAI), a field that aims to improve the interpretability of DNNs by providing different forms of explanations. Specifically, this works focuses on concept-based models such as Concept Bottleneck Model (CBM) and Concept Embedding Model (CEM) and how they can achieve performance comparable to traditional deep learning methods, while offering a more transparent and interpretable decision framework.

2506.12454 2026-06-08 stat.ML cond-mat.dis-nn cs.CR cs.LG

On the existence of consistent adversarial attacks in high-dimensional linear classification

高维线性分类中一致对抗攻击存在的存在性研究

Matteo Vilucchio, Lenka Zdeborová, Bruno Loureiro

发表机构 * Information Learning and Physics Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)(信息学习与物理实验室,瑞士联邦理工学院(EPFL)) Statistical Physics of Computation Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)(计算统计物理实验室,瑞士联邦理工学院(EPFL)) Département d’Informatique, École Normale Supérieure - PSL & CNRS, France(信息学系,法国高等科学研究院(PSL)与国家科学研究中心(CNRS))

AI总结 本文研究高维二分类中对抗攻击与模型表达能力有限导致的误分类区别,提出新的误差度量标准,揭示模型对保持真实标签扰动的脆弱性,理论分析显示模型越过度参数化,对标签保持扰动的敏感性越高。

详情
Journal ref
ICML 2026
AI中文摘要

本文研究高维二分类中对抗攻击与模型表达能力有限或数据有限导致的误分类的本质区别,提出新的误差度量标准,精确捕捉这一区别,量化模型对保持真实标签扰动的脆弱性。我们的主要技术贡献是精确且严谨地对这些度量在良好指定模型和潜在空间模型中的渐进行为进行刻画,揭示与标准稳健误差度量不同的脆弱性模式。理论结果表明,随着模型变得越来越过度参数化,其对标签保持扰动的脆弱性增加,为理解模型对对抗攻击的敏感机制提供了理论见解。

英文摘要

What fundamentally distinguishes an adversarial attack from a misclassification due to limited model expressivity or finite data? In this work, we investigate this question in the setting of high-dimensional binary classification, where statistical effects due to limited data availability play a central role. We introduce a new error metric that precisely capture this distinction, quantifying model vulnerability to consistent adversarial attacks -- perturbations that preserve the ground-truth labels. Our main technical contribution is an exact and rigorous asymptotic characterization of these metrics in both well-specified models and latent space models, revealing different vulnerability patterns compared to standard robust error measures. The theoretical results demonstrate that as models become more overparameterized, their vulnerability to label-preserving perturbations grows, offering theoretical insight into the mechanisms underlying model sensitivity to adversarial attacks.

2407.15555 2026-06-08 eess.SP cs.LG

The Rlign Algorithm for Enhanced Electrocardiogram Analysis through R-Peak Alignment for Explainable Classification and Clustering

通过R峰对齐提升心电图分析的Rlign算法:用于可解释分类和聚类

Lucas Plagwitz, Lucas Bickmann, Michael Fujarski, Alexander Brenner, Warnes Gobalakrishnan, Lars Eckardt, Antonius Büscher, Julian Varghese

发表机构 * IMI Medical Systems GmbH(IMI医疗系统 GmbH) University of Freiburg(弗赖堡大学)

AI总结 本文提出Rlign算法,通过R峰对齐重构心电图信号,提升分类、聚类和可解释性,优于传统方法和CNN。

详情
Journal ref
European Heart Journal - Digital Health, Volume 7, Issue 5, June 2026, ztag067
AI中文摘要

心电图(ECG)记录长期以来在诊断不同心脏状况中至关重要。最近,使用机器学习方法自动处理ECG的研究变得重要,主要通过在原始ECG信号上使用深度学习方法。像卷积神经网络(CNNs)这样的模型的优势在于能够有效处理生物医学影像或信号数据。然而,这种优势受到缺乏可解释性、需要大量训练数据以及适应于无监督聚类任务的复杂性等挑战的限制。为解决这些问题,我们旨在通过利用其半结构化、周期性形式重新引入浅层学习技术,包括支持向量机和主成分分析,到ECG信号处理中。为此,我们开发并评估了一种转换,能够有效将ECG信号重构为完全结构化的格式,从而后续使用浅层学习算法进行分析。在本研究中,我们提出了这种自适应转换方法,通过在数据集中对所有信号的R峰进行对齐,并在有无心跳率依赖的情况下重新采样R峰之间的段落。我们展示了这种转换在分类、聚类和可解释性方面的显著优势,优于商业软件的中位心拍转换和CNN方法。我们的方法在处理有限训练数据时,显示出浅层机器学习方法相对于CNNs的显著优势。此外,我们发布了一个经过充分测试且公开可访问的代码框架,提供了一个稳健的对齐管道以支持未来研究,网址为https://github.com/imi-ms/rlign。

英文摘要

Electrocardiogram (ECG) recordings have long been vital in diagnosing different cardiac conditions. Recently, research in the field of automatic ECG processing using machine learning methods has gained importance, mainly by utilizing deep learning methods on raw ECG signals. A major advantage of models like convolutional neural networks (CNNs) is their ability to effectively process biomedical imaging or signal data. However, this strength is tempered by challenges related to their lack of explainability, the need for a large amount of training data, and the complexities involved in adapting them for unsupervised clustering tasks. In addressing these tasks, we aim to reintroduce shallow learning techniques, including support vector machines and principal components analysis, into ECG signal processing by leveraging their semi-structured, cyclic form. To this end, we developed and evaluated a transformation that effectively restructures ECG signals into a fully structured format, facilitating their subsequent analysis using shallow learning algorithms. In this study, we present this adaptive transformative approach that aligns R-peaks across all signals in a dataset and resamples the segments between R-peaks, both with and without heart rate dependencies. We illustrate the substantial benefit of this transformation for traditional analysis techniques in the areas of classification, clustering, and explainability, outperforming commercial software for median beat transformation and CNN approaches. Our approach demonstrates a significant advantage for shallow machine learning methods over CNNs, especially when dealing with limited training data. Additionally, we release a fully tested and publicly accessible code framework, providing a robust alignment pipeline to support future research, available at https://github.com/imi-ms/rlign.

2606.07512 2026-06-08 cs.CV cs.AI cs.CL 新提交

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

MemDreamer: 通过分层图记忆和智能体检索机制解耦感知与推理以实现长视频理解

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen

AI总结 提出MemDreamer框架,通过分层图记忆和智能体检索机制解耦感知与推理,将长视频理解转化为智能体探索过程,在四个基准上达到SOTA,推理上下文窗口仅占全量2%且准确率提升12.5点。

详情
AI中文摘要

当前的视觉-语言模型在处理数小时长的视频时面临困难,因为处理完整长度的视觉序列会导致令牌爆炸和注意力稀释。为了克服这一问题,我们引入了MemDreamer,将感知与推理解耦,将长视频理解转化为智能体探索过程。作为一个即插即用的框架,它增量式地流式传输视频以构建分层图记忆,这是一种自顶向下的三层架构,用于语义抽象,并由一个捕获时空和因果关系的基础图锚定。在推理过程中,推理模型采用智能体工具增强的检索,通过观察-推理-行动循环导航层次结构、搜索节点和遍历逻辑边。实验表明,MemDreamer在四个主流基准上取得了最先进的结果,将人类专家的差距缩小到仅3.7个百分点。它将推理上下文窗口限制在全量上下文的仅2%,同时提供了12.5个百分点的绝对准确率提升。此外,统计分析揭示了VLM在逻辑推理和长视频理解基准上的性能之间存在强正线性相关,将智能体能力扩展确立为多模态理解的新范式。

英文摘要

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

2606.07400 2026-06-08 cs.LG 新提交

Generative Modeling of Discrete Latent Structures via Dynamic Policy Gradients

通过动态策略梯度对离散潜在结构进行生成建模

Stefan Ivanovic, Ge Liu, Mohammed El-Kebir

AI总结 提出GReinSS框架,使用动态缩放奖励学习潜在状态分布以最大化观测数据似然,在模拟潜在集和图重建中优于基线,并在RNA测序数据中比RSEM更准确地重建异构体。

详情
Comments
ICML 2026
AI中文摘要

许多科学问题需要从间接观测中推断未观测到的机械潜在状态。虽然经典方法(如期望最大化)无法扩展到组合爆炸的空间,但深度学习方法(如变分自编码器)通常形成人工潜在状态,而非重建机械真实状态。本文提出GReinSS,一个策略学习框架,使用动态缩放奖励来学习最大化观测数据似然的潜在状态分布。我们证明GReinSS能够准确重建模拟的潜在集和潜在图,优于替代的策略学习和生成建模基线。此外,GReinSS从真实短读RNA测序数据中重建的异构体,比标准RSEM算法更匹配通过正交长读测序检测到的异构体。总体而言,GReinSS是一种从间接观测中对组合潜在状态进行生成建模和推断的原则性且实际有效的方法。

英文摘要

Many scientific problems require inferring unobserved mechanistic latent states from indirect observations. While classical approaches, including expectation maximization, do not scale to combinatorially large spaces, deep learning approaches such as variational autoencoders typically form artificial latent states rather than reconstructing the mechanistic ground-truth states. Here, we introduce GReinSS, a policy learning framework that uses dynamically rescaled rewards to learn latent state distributions that maximize the observed data likelihood. We show that GReinSS accurately reconstructs simulated latent sets and latent graphs, outperforming alternative policy learning and generative modeling baselines. Additionally, GReinSS reconstructs isoforms from real short-read RNA sequencing data that better match isoforms detected by orthogonal long-read sequencing than the standard RSEM algorithm. Overall, GReinSS is a principled and practically effective approach for generative modeling and inference of combinatorial latent states from indirect observations.

2606.07387 2026-06-08 cs.LG 新提交

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

充分利用有限数据:面向文本到音乐生成的分数感知训练

Yun-Chen Cheng, Tzu-Hung Huang, Chih-Pin Tan

AI总结 提出分数感知训练方法,利用CLAP条件Beta噪声时间表将低分音频段用于高噪声训练,结合段级过滤和两阶段字幕策略,在有限数据下实现高效文本到音乐生成,并在ICME 2026 ATTM挑战赛中获得客观评估第二名。

详情
AI中文摘要

最先进的文本到音乐生成系统依赖于大规模专有数据集和工业级计算资源,使得无法区分架构贡献与资源优势。我们提出\textit{分数感知训练},将音频-字幕对齐分数作为整个流程的直接监督信号。我们不丢弃低分片段,而是通过CLAP条件Beta噪声时间表将其重新用于高噪声训练阶段,作为有效的隐式正则化器。作为补充,段级过滤移除最不匹配的样本,两阶段字幕程序弥合了冗长训练字幕与简洁推理提示之间的分布差距。REPA辅助损失进一步从预训练的CLAP和MuQ编码器中迁移结构化语义知识,无需额外数据。我们基于FluxAudio的450M参数系统提交至ICME 2026 ATTM Grand Challenge效率赛道,在客观评估中两个赛道均排名第二,在最终MOS评估中效率赛道排名第三。

英文摘要

State-of-the-art text-to-music generation systems rely on massive proprietary datasets and industrial-scale compute, making it impossible to disentangle architectural contributions from resource advantages. We propose \textit{score-aware training}, which treats audio-caption alignment score as a direct supervision signal throughout the pipeline. Rather than discarding low-scoring segments, we repurpose them via a CLAP-conditioned Beta noise timestep schedule that routes them to high-noise training regimes, acting as an effective implicit regularizer. Complementarily, segment-level filtering removes the most misaligned examples, and a two-stage caption procedure bridges the distribution gap between verbose training captions and concise inference prompts. A REPA auxiliary loss further transfers structured semantic knowledge from pretrained CLAP and MuQ encoders without additional data. Our 450M-parameter FluxAudio-based system, submitted to the ICME 2026 ATTM Grand Challenge Efficiency Track, ranked 2nd across both tracks in the objective evaluation and 3rd in the Efficiency Track in the final MOS evaluation.

2606.07367 2026-06-08 cs.LG 新提交

Self-evolving LLM agents with in-distribution Optimization

自演化分布内优化的LLM智能体

Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy

AI总结 提出Q-Evolve框架,通过分布内强化学习统一过程奖励标注与策略学习,利用加权隐式Q学习稳定贝尔曼更新,实现智能体自演化,在AlfWorld等任务上优于基线。

详情
Comments
ICML 2026
AI中文摘要

大型语言模型(LLM)最近已成为复杂环境中交互智能体的强大控制器,但训练它们执行可靠的长期决策仍然是一个基本挑战。一个关键困难在于信用分配:智能体通常仅在回合结束时收到延迟奖励。在本文中,我们提出了Q-Evolve,一个用于LLM智能体的自演化框架,该框架在原则性的分布内强化学习范式中统一了自动过程奖励标注和策略学习。在每个演化迭代中,我们的方法从混合离策略数据集(结合专家演示与智能体生成的轨迹)中学习一个分布内评论家,通过加权隐式Q学习目标在稀疏奖励设置中稳定贝尔曼备份。然后,通过学习到的价值函数通过优势估计推导出逐步过程奖励,无需环境回溯或人工标注即可提供密集且可靠的监督。利用这些信号,我们执行行为近端策略优化,使智能体在用于过程奖励标注的数据上演化,从而在不加剧分布偏移的情况下实现迭代自我改进。我们在AlfWorld、WebShop和ScienceWorld上评估了我们的方法,结果显示Q-Evolve在样本效率、鲁棒性和整体任务性能上优于强基线。我们的结果表明,通过过程级监督和策略的共同演化(两者都基于共享的分布内学习循环),可以实现稳定的智能体自演化。

英文摘要

Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often receive delayed rewards only at the end of episodes. In this paper, we propose Q-Evolve, a self-evolving framework for LLM agents that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. In each evolving iteration, our method learns an in-distribution critic from a hybrid off-policy dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent over the data used for process reward labeling, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on AlfWorld, WebShop, and ScienceWorld, showing Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable agent self-evolution is achievable through the co-evolution of process-level supervision and policy, both grounded within a shared in-distribution learning loop.

2606.07366 2026-06-08 cs.CV cs.LG cs.RO 新提交

Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

Dash2Sim: 来自野外行车记录仪视频的闭环驾驶仿真

Anurag Ghosh, Francesco Pittaluga, Khiem Vuong, Angela Chen, Juan Alvarez-Padilla, Manmohan Chandraker, Srinivasa Narasimhan

AI总结 提出Dash2Sim框架,将单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志,用于闭环仿真,并构建ROADWork4D基准数据集,验证了施工区场景对规划器的挑战。

详情
AI中文摘要

自动驾驶仿真通常依赖于在少数城市收集的数据或手工编写的合成场景。行车记录仪视频覆盖了更广泛的位置和情况,包括罕见或长尾场景。由于难以从单目野外视频中恢复准确的4D场景,它们被认为不太适用于仿真。施工区是行车记录仪捕捉到的一类长尾情况。我们提出Dash2Sim,一个将野外单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志并与现有仿真器兼容的框架,并针对独立维护的地图验证每个日志,无需标注。我们将Dash2Sim应用于大型视频语料库,创建了ROADWork4D基准数据集,涵盖17个城市的4,244个场景和270万个3D对象。在验证子集ROADWork4D-CL(2,201个场景)上,我们研究了特权闭环规划器,发现施工区场景具有挑战性:尽管基于规则和混合规划器的泛化能力优于基于学习的规划器,但所有规划器均表现不足,无法完成临时施工区通道所需的变道。在规划之外,Dash2Sim恢复的密集深度在新视角合成质量上提高了高达19%(基于感知指标),表明其具有为单目视频的闭环传感器仿真提供丰富条件的潜力。

英文摘要

Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

2606.07351 2026-06-08 cs.LG cs.AI 新提交

SleepExplain: Explainable Non-Rapid Eye Movement and Rapid Eye Movement Sleep Stage Classification from EEG Signal

SleepExplain: 基于EEG信号的可解释非快速眼动和快速眼动睡眠阶段分类

Rafsan Jany, Md. Hamjajul Ashmafee, Iqram Hussain, Md Azam Hossain

AI总结 提出SleepExplain模型,使用集成学习(随机森林、XGBoost、梯度提升)对NREM和REM睡眠阶段进行分类,准确率达94.30%,并利用SHAP提供可解释性。

详情
Journal ref
2022 25th International Conference on Computer and Information Technology (ICCIT), pp. 248-253, 2022
Comments
6 pages, 7 figures, 2022 25th International Conference on Computer and Information Technology (ICCIT)
AI中文摘要

睡眠阶段分类是多种睡眠相关疾病最重要的诊断方法之一。脑电图(EEG)被认为是检查神经效应与睡眠阶段之间关联的有力工具,因为它能正确识别与睡眠相关的神经变化。在非快速眼动(NREM)和快速眼动(REM)睡眠阶段,许多神经和身体功能受到影响,因此在其功能中扮演重要角色。本研究旨在从睡眠EEG数据中分类NREM和REM睡眠阶段,并提出一个新颖的SleepExplain模型,一种可解释的NREM和REM睡眠阶段分类,以解释其预测。在这项工作中,使用随机森林、XGBoost和梯度提升集成分类模型对睡眠阶段进行分类。总体而言,我们获得了92.54%(随机森林)、94.25%(梯度提升)和94.30%(XGBoost)的准确率。对于可解释分类模型,我们采用博弈论方法SHAP(SHapley Additive exPlanations)为预测提供令人信服的解释。

英文摘要

Classification of sleep stages is one of the most important diagnostic approaches for a variety of sleep-related disorders. Electroencephalography (EEG) is regarded as a powerful tool for examining the association between neurological effects and sleep phases since it correctly identifies sleep-related neurological alterations. During Non-Rapid Eye Movement (NREM) and Rapid Eye Movement (REM) sleep phases, a number of nerve and bodily functions are affected and therefore hold an important role both in their functionalities. This work aims to classify NREM and REM sleep stages from sleep EEG data and present a noble SleepExplain model, an explainable NREM and REM sleep stage classification to explain its predictions. In this work, sleep stages were classified using Random Forest, XGBoost, and Gradient Boosting ensemble classification models. Overall, we obtained an accuracy of 92.54% (Random Forest), 94.25% (Gradient Boosting), and 94.30% (XGBoost). For explainable classification model, we utilized a game theoretic approach, SHAP (SHapley Addictive exPlanations) to offer a convincing explanation for the prediction.

2606.07345 2026-06-08 cs.LG 新提交

TabSwift: An Efficient Tabular Foundation Model with Row-Wise Attention

TabSwift: 一种高效的基于行注意力的表格基础模型

Si-Yang Liu, Han-Jia Ye

AI总结 提出TabSwift,通过门控注意力稳定和可学习注册令牌增强轻量级行注意力骨干,实现高效表格上下文学习,在保持竞争力的同时降低推理成本。

详情
Comments
Accepted to ICML 2026, spotlight
AI中文摘要

以TabPFN为代表的表格基础模型通过上下文学习进行预测,直接从带标签的训练样本推断测试标签。它们已展现出有竞争力的性能,尤其是在中小型数据集上。然而,最近的表格基础模型通常通过日益复杂的架构来提高准确性,导致更高的推理成本并限制了实际部署。在这项工作中,我们重新审视了原始TabPFN设计,并表明一个轻量级的仅行注意力骨干可以通过两个简单的增强保持高度竞争力:门控注意力稳定机制和一组可学习的注册令牌,提供全局上下文并改善预训练质量。由此产生的模型TabSwift支持分类和回归,与更强的表格基础模型(如TabPFN v2和TabICL)竞争,同时推理效率更高。对于延迟敏感的服务,我们进一步引入了一种自适应逐层早期退出机制,动态调整每个样本的推理深度。总体而言,TabSwift为实际部署实现了高效且随时可用的表格上下文学习。

英文摘要

Tabular foundation models, exemplified by TabPFN, perform prediction via in-context learning, inferring test labels directly from labeled training examples. They have demonstrated competitive performance, particularly on small-to-medium datasets. However, recent tabular foundation models often improve accuracy with increasingly complex architectures, incurring higher inference cost and limiting practical deployment. In this work, we revisit the original TabPFN design and show that a lightweight row-wise attention-only backbone can remain highly competitive with two simple enhancements: a gated attention stabilization mechanism and a small set of learnable register tokens that provide global context and improve pretraining quality. The resulting model, TabSwift, supports both classification and regression, and is competitive with stronger tabular foundation models (e.g., TabPFN v2 and TabICL) while being more efficient at inference. For latency-sensitive serving, we further introduce an adaptive layer-wise early-exit mechanism that dynamically adjusts inference depth per sample. Overall, TabSwift enables efficient and anytime tabular in-context learning for practical deployments.

2606.07333 2026-06-08 cs.CV 新提交

Varifold Moment Invariants for Sustainable and Explainable Contour Feature Extraction

Varifold矩不变量:可持续且可解释的轮廓特征提取

G. Longari, J. -C. Alvarez Paiva, A. B. Tumpach

AI总结 提出Varifold矩不变量(VMI)统一框架,结合区域、边界和切线几何生成高判别力几何特征,配合轻量分类器在降低计算成本的同时超越现有轮廓方法。

详情
Comments
29 pages, 12 figures
AI中文摘要

我们引入Varifold矩不变量(VMI)作为许多先前提出的矩不变量的统一框架。这些不变量与其他在平移和旋转下不变的轮廓特征(如扩展高斯图像、椭圆傅里叶描述符或形状分布)密切相关。Varifold矩方法的优势在于能够结合区域的几何、其边界以及与之相切的直线族,从而创建大量具有高判别力和清晰几何意义的不变特征。通过将我们的VMI特征提取与轻量特征分类器随机森林或多层感知器相结合,我们在基于轮廓的方法中超越了现有技术水平,同时大幅降低了计算成本,使我们的算法能够在轻量设备上运行。我们在大量广泛使用的不同类型数据集(叶子、物体、细胞)上测试了我们的分类任务,并以少量几何可解释的特征实现了高精度。

英文摘要

We introduce Varifold Moments Invariants (VMI) as a unifying framework for many previously introduced Moment Invariants. These invariants are deeply related to other contour features that are invariant under translations and rotations, like Extended Gaussian Image, Elliptic Fourier Descriptors or Shape Distributions. The advantage of the varifold approach to moments consists in being able to combine the geometry of the region, its boundary, and the family of lines tangent to it, in order to create a substantial number of invariant features with high discriminating power and clear geometric meaning. By coupling our VMI feature extraction with the light feature classifiers Random Forest or Multi-Layer-Perceptron, we outperform state-of-the-art approaches based on contours, while decreasing drastically the computational cost to the point of allowing our algorithm to run on light devices. We tested our approach on classification tasks on a large number of widely-used datasets of various types (leaves, objects, cells) and achieved high accuracy with a low number of geometrically interpretable features.

2606.07304 2026-06-08 cs.RO 新提交

CAPE: Contrastive Action-conditioned Parallel Encoding for Embodied Planning

CAPE: 用于具身规划的条件对比动作并行编码

Cong Chen, Haowen Wang, Zhixiang Zhang, Pei Ren, Zhengping Che

AI总结 提出CAPE框架,通过对比学习区分不同动作序列的未来结果,实现高效视觉动力学建模,在真实世界和零样本迁移任务中显著提升规划性能并降低推理成本。

详情
Comments
19 pages, 7 figures
AI中文摘要

具身智能体需要在执行前预测候选动作的未来后果,以便有效规划。现有的视觉动力学模型通过重建未来视觉状态或展开密集潜在表示来学习,这会将学习能力分散到视觉显著但与规划无关的内容上,而不是驱动操作结果的动作条件变化。我们提出CAPE,一种对比动作条件并行编码框架,通过区分不同动作序列诱导的未来结果来学习视觉动力学。给定初始观察和候选动作序列,CAPE在单次前向传播中解码完整的未来潜在轨迹,并使用目标收敛对比目标进行训练,该目标对齐对应相同未来结果的预测,同时分离对应不同结果的预测。在真实世界DROID和零样本迁移到RoboCasa上,CAPE在状态检索、离线动作匹配和闭环规划方面显著优于先前基线,同时在长预测范围内显著降低了规划时的推理成本。

英文摘要

Embodied agents need to predict the future consequences of candidate actions in order to plan effectively before execution. Existing visual dynamics models learn by reconstructing future visual states or rolling out dense latent representations, which spreads learning capacity across visually salient but planning-irrelevant content rather than the action-conditioned changes that drive manipulation outcomes. We propose CAPE, a Contrastive Action-conditioned Parallel Encoding framework that learns visual dynamics by distinguishing the future outcomes induced by different action sequences. Given an initial observation and a candidate action sequence, CAPE decodes the full future latent trajectory in a single forward pass and is trained with a Goal-Convergent Contrastive Objective that aligns predictions corresponding to the same future outcome while separating those corresponding to different outcomes. On real-world DROID and zero-shot transfer to RoboCasa, CAPE substantially outperforms prior baselines on future-state retrieval, offline action matching, and closed-loop planning, while notably reducing planning-time inference cost at long prediction horizons.

2606.07300 2026-06-08 cs.CL 新提交

Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese

Phun-Bench:评估大语言模型的中文语音理解能力

Xing Yue, Yongliang Shen, Weiming Lu

AI总结 提出Phun-Bench基准,通过同音、押韵和语音相似性三个维度系统评估大语言模型的语音理解能力,发现模型在灵活运用语音知识方面存在不足。

详情
Comments
Accepted to ACL 2026 Main Conference
AI中文摘要

语言是思想的载体,与声音、符号和意义紧密相连。然而,大多数大语言模型(LLM)研究关注意义(语义)和符号(拼写),而很大程度上忽略了声音。现有的LLM语音能力基准要么可以通过死记硬背解决,要么与其他能力交织在一起,不足以衡量LLM在语音理解方面的真实能力。在这里,我们提出Phun-Bench,一个专门构建的中文基准,包含跨三个维度(同音、押韵和语音相似性)的多样化任务和设置,旨在系统评估LLM的语音理解能力。我们的结果表明,虽然LLM在回忆正确发音方面表现出色,但它们通常难以像人类说话者那样灵活直观地利用语音知识。此外,通过详细分析,我们提出了关于LLM语音理解和“感知”潜在机制的假设,突出了未来研究的一个未充分探索的前沿。

英文摘要

Language is a vehicle for thought, intricately tied to sounds, symbols, and meaning. However, most large language model (LLM) research focuses on meaning (semantics) and symbols (spelling) while largely overlooking sounds. Existing benchmarks on LLMs' phonological abilities are either solvable through rote memorization or intertwined with other abilities, making them inadequate to measure LLMs' genuine ability in phonological understanding. Here, we present Phun-Bench, a purpose-built Chinese benchmark with diverse tasks and settings across three dimensions (Homophony, Rhyme, and Phonetic Similarity), designed to systematically evaluate LLMs' phonological understanding. Our results show that while LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do. Moreover, through detailed analyses, we propose a hypothesis regarding the underlying mechanism of LLMs' phonological understanding and "perception", highlighting an underexplored frontier for future research.

2606.07291 2026-06-08 cs.LG 新提交

Trio: Learning Time-Series Forecasting with Temporal-Spatial-Sample Attention and Structural Causal Priors

Trio: 基于时间-空间-样本注意力与结构因果先验的时间序列预测学习

Tao Chen, Yexu Zhou, Zhi Gong, Hengwei He, Hongda Li, Zhewei Chen, Dongjing Wang, Xin Zhang, Decheng Liu, Chunlei Peng, Zheng Chen, Wenyue Ding

AI总结 提出Trio架构,通过时间、空间和样本三种注意力机制分别捕获时序动态、变量间依赖和历史样本对应,并引入时间序列结构因果模型生成合成任务以提供结构先验,提升多变量时间序列预测性能。

详情
AI中文摘要

多变量时间序列预测要求模型对时间动态、跨变量依赖以及历史输入-输出对应关系进行推理。最近的先验数据拟合网络(PFNs)表明,合成任务可用于学习可迁移的推理行为。然而,直接将这一范式迁移到时间序列预测仍然困难,因为时间顺序、动态滞后和重复的历史模式无法被普通的表格先验自然捕获。受此观察启发,我们提出了Trio,一种基于时间-空间-样本注意力的样本感知时间序列预测架构。时间注意力捕获窗口内动态,空间注意力建模变量间依赖,样本注意力检索相关的历史回溯-未来对以指导当前预测。我们的目标并非声称一个完全通用的PFN风格预测器,而是研究如何在预测模型中显式组织和重用历史输入-输出示例。我们进一步引入了一个时间序列结构因果模型(TS-SCM)生成器,以创建具有动态滞后、跨变量交互、噪声、反馈和分布漂移的结构化合成预测任务。在合成、工业和公共基准上的实验表明,所提出的架构提高了预测性能。探索性的零样本实验进一步表明,TS-SCM生成的任务可能提供有用的结构先验,而完全通用的PFN风格时间序列预测仍是一个开放问题。

英文摘要

Multivariate time-series forecasting requires models to reason over temporal dynamics, cross-variable dependencies, and historical input-output correspondences. Recent Prior-Data Fitted Networks (PFNs) suggest that synthetic tasks can be useful for learning transferable inference behavior. However, directly transferring this paradigm to time-series forecasting remains difficult, since temporal order, dynamic lags, and recurring historical patterns are not naturally captured by ordinary tabular priors. Motivated by this observation, we propose Trio, a sample-aware time-series forecasting architecture based on Temporal-Spatial-Sample attention. Temporal attention captures within-window dynamics, spatial attention models inter-variable dependencies, and sample attention retrieves relevant historical lookback-future pairs to guide the current prediction. Rather than claiming a fully general PFN-style forecaster, our goal is to study how historical input-output examples can be explicitly organized and reused within a forecasting model. We further introduce a Time-Series Structural Causal Model (TS-SCM) generator to create structured synthetic forecasting tasks with dynamic lags, cross-variable interactions, noise, feedback, and distributional drift. Experiments on synthetic, industrial, and public benchmarks show that the proposed architecture improves forecasting performance. Exploratory zero-shot experiments further suggest that TS-SCM-generated tasks may provide useful structural priors, while fully general PFN-style time-series forecasting remains an open problem.

2606.07271 2026-06-08 cs.LG cs.AI cs.SD 新提交

Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

整流流泄漏之处:沿插值路径表征成员信号

Thomas Sesmat, Gabriel Meseguer-Brocal, Geoffroy Peeters

AI总结 本文分析整流流(Rectified Flows)在插值路径上的训练数据成员信号,发现训练与测试数据的重建差异呈钟形曲线,并在高斯假设下推导出峰值位置,验证了该结构的普适性,并利用其进行成员推断攻击。

详情
Journal ref
43rd International Conference on Machine Learning, Seoul, South Korea, 2026
Comments
ICML 2026 article, 9 main pages and 25 with annexes, 11 figures
AI中文摘要

理解生成模型从训练数据中保留了什么仍然具有挑战性,这对版权和隐私有影响。除了逐字复制外,模型可以编码训练数据中更微妙的痕迹,这些痕迹从未出现在输出中,但仍可利用。我们针对整流流(Rectified Flows)研究了这一机制,整流流越来越多地用于部署的生成系统。我们分析了定义整流流训练的插值路径 $X_\lambda = (1-\lambda)X_0 + \lambda X_1$。我们展示了训练数据和测试数据的重建之间存在一个差距,该差距在 $\lambda$ 上呈钟形曲线,并在训练过程中累积,而验证指标保持稳定。该信号有一个最大值,我们在高斯假设下推导出其位置的闭式解。我们在音频和图像上验证了这些预测,并表明钟形结构是普遍的,而峰值预测在我们的假设满足时成立。作为概念验证,我们利用这种特定的 $\lambda$ 解析结构进行成员推断攻击,区分训练集的成员和非成员。

英文摘要

Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond verbatim reproduction, models can encode subtler traces of their training data that never surface in their outputs yet remain exploitable. We study this regime for Rectified Flows, which are increasingly used in deployed generative systems. We analyse the interpolation path $X_λ= (1-λ)X_0 + λX_1$ that defines the Rectified Flow training. We show that a gap exists between the reconstruction of train and test data that follows a bell-shaped curve over $λ$, wich accumulates during training, while the validation metrics remain stable. The signal has a maximum whose location we derive in closed form under Gaussian assumptions. We validate these predictions on both audio and images and show that the bell-shaped structure is universal, while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we exploit this specific $λ$-resolved structure to perform a Membership Inference Attack, distinguishing members of the training set from non-members.

2606.07253 2026-06-08 cs.AI econ.EM 新提交

TOPSIS-RAD: Ranking According to Desires

TOPSIS-RAD:根据期望排序

Leonardo Fernandes Costa, Helder Gomes Costa, Diogo Lima, Brunno Rodrigues

AI总结 提出TOPSIS-RAD方法,通过引入决策者定义的否决绩效水平和期望绩效水平,解决传统TOPSIS排序与决策者需求不一致、对异常值敏感及排名反转问题。

详情
Comments
21 pages, 15 Tables and 6 figures. The numerical computation of the data that appear in the Toy Examples was Supported by the Visual TOPSIS RAD that is available at https://topsis-ranking.vercel.app/. The data of the Toy examples are also available in this URL and can be loaded in the app as the template "Article"
AI中文摘要

传统TOPSIS从观测到的备选方案集中推导其参考点——正理想解(PIS)和负理想解(NIS),这使得排序容易与决策者(DM)需求不一致,对异常值表现敏感,并导致排名反转。本文提出TOPSIS-RAD,通过引入两组DM定义的参考水平来解决这些问题。否决绩效水平(VPL)在归一化之前排除不可行的备选方案,防止它们扭曲排序边界。期望绩效水平(DPL)在归一化之前将表现上限设定在DM期望的水平,将PIS锚定在明确的期望而非数据集极端值上。三个简单示例展示了每种机制:VPL通过移除不可行备选方案重塑归一化边界;固定的DPL边界通过限制远高于期望水平的表现的影响来稳定排序。该方法保留了TOPSIS熟悉的基于距离的结构,同时将排序建立在稳定的、DM指定的边界上。还讨论了局限性和未来研究方向。

英文摘要

Traditional TOPSIS derives its reference points -- the Positive Ideal Solution ($PIS$) and Negative Ideal Solution ($NIS$) -- from the observed alternative set, making rankings susceptible to misalignment with decision-maker (DM) requirements, sensitivity to outlier performances, and rank reversal. This paper proposes TOPSIS-RAD, which addresses these issues by incorporating two arrays of DM-defined reference levels. Vetoed Performance Levels ($VPL$) exclude non-viable alternatives before normalisation, preventing them from distorting the ranking frontiers. Desired Performance Levels ($DPL$) cap performances at the DM's desired level before normalisation, anchoring the $PIS$ in explicit aspirations rather than dataset extremes. Three toy examples demonstrate each mechanism: $VPL$ reshapes normalisation boundaries by removing a non-viable alternative; fixed $DPL$ frontiers stabilise rankings by limiting the influence of performances well above the desired level. The method preserves the familiar distance-based structure of TOPSIS while grounding the ranking in stable, DM-specified boundaries. Limitations and future research directions are also discussed.

2606.07229 2026-06-08 cs.SD cs.CL cs.MM 新提交

MMAE: A Massive Multitask Audio Editing Benchmark

MMAE:大规模多任务音频编辑基准

Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen

AI总结 提出首个面向通用指令音频编辑的综合评估基准MMAE,涵盖7种音频模态、6级任务复杂度和8种操作类型,通过2000个样本和基于评分标准的评估框架揭示当前模型在精确执行和结构鲁棒性上的严重不足。

详情
Comments
Open-Source at https://github.com/ddlBoJack/MMAE
AI中文摘要

我们引入了MMAE,一个大规模多任务音频编辑基准,作为首个专为通用指令式音频编辑设计的综合评估测试平台。受智能创作趋势的推动,交互式编辑已从视觉领域(如图像领域的Nano-banana 2和视频领域的Gemini-Omni)迅速扩展到音频领域。然而,当前的评估基础设施严重滞后,仍然高度碎片化且局限于特定子领域或基本操作。与现有范围有限的基准不同,MMAE扩展到广泛的实际场景,涵盖7种不同的音频模态,包括声音、语音、音乐及其混合。此外,我们建立了一个全面的分类体系,涵盖6级任务复杂度(从基本修改到多跳推理和多轮编辑)、2级粒度以及8种不同的操作类型。通过人机协作精心策划,MMAE包含2000个高保真样本,并配以开创性的基于评分标准的评估框架。通过将自由形式任务分解为17,741个可验证的标准,这种稳健的基于评分标准的范式能够对指令遵循和上下文一致性进行精确的多维评估。我们对领先模型的广泛评估表明,当前系统远未实现可靠的编辑。令人惊讶的是,精确匹配率(EMR)始终低于5%,在复杂的混合模态任务中更是骤降至绝对的0%,暴露了精确执行和结构鲁棒性方面的关键瓶颈。我们希望MMAE能够成为智能创作社区未来进步的催化剂,提供清晰的诊断路线图,并为下一代音频编辑系统建立标准化、持久的评估范式。

英文摘要

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

2606.07207 2026-06-08 cs.SD cs.LG eess.AS 新提交

Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development

熵作为结构先验:DiT信念空间上的对数障碍如何驱动音乐多样性与发展

Zixi Li, Youzhen Li

AI总结 提出Eisbach对数障碍,利用DiT输出空间能量分布的熵作为权重,在监督扩散训练中通过调节梯度步长促进音乐主题发展、声学区分和纹理多样性,避免模式崩溃。

详情
AI中文摘要

基于置信度的损失加权通常在生成模型中被避免,因为当模型自信地错误时会加速误差,但这种直觉在监督扩散训练中不成立。我们引入了Eisbach对数障碍,一种无参数权重,源自DiT输出空间能量分布的熵:高熵抑制梯度,低熵保留梯度。将其应用于Stable Audio 3 Medium在MusicCaps上的LoRA微调,意外地产生了比未加权训练更强的主题发展、更清晰的声学区分和更高的纹理多样性,这与模式崩溃相反。这是因为在监督扩散中,梯度方向锁定于真实值,因此置信度仅缩放步长,并且因为时间熵对平坦样本降权而保留高对比度样本。结果是一个在线、自引用的数据课程,完全从前向传播中涌现,并分析了噪声级动态和可测试的预测。

英文摘要

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output's spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.

2606.07190 2026-06-08 cs.CL 新提交

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

从正确性到效用:基于增益的LLM推理前缀评估

Yuhang Zhou, Yixin Cao, Guangnan Ye

AI总结 提出前缀增益概念,训练前缀效用模型(PUM)通过成对排序目标评估推理前缀对成功率的提升,在数学推理任务中优于传统正确性评估。

详情
AI中文摘要

推理前缀塑造了LLM问题求解的未来轨迹,然而现有的过程奖励模型通常通过局部步骤正确性来评估它们。我们认为正确性是最终关心效果的有用但间接的代理:即前缀是否增加了成功完成的概率。我们将此效果定义为前缀增益,即通过在一个前缀上条件化轻量级学生模型组所导致的求解率提升,并使用简单的成对排序目标训练前缀效用模型(PUM)。PUM学习基于结果的前缀效用,并能对完整轨迹和部分推理前缀进行评分。在数学推理的Best-of-$N$选择、束搜索和强化学习中,PUM提供了强大的前缀级监督信号,尤其是在候选池大、搜索预算增加或基于规则的奖励稀疏时。我们在该https URL发布所有数据、模型和代码。

英文摘要

Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion. We define this effect as prefix gain, the solve-rate improvement induced by conditioning lightweight student model group on a prefix, and use it to train a Prefix Utility Model (PUM) with a simple pairwise ranking objective. PUM learns outcome-grounded prefix utility and can score both complete trajectories and partial reasoning prefixes. Across Best-of-$N$ selection, beam search, and reinforcement learning on mathematical reasoning, PUM provides a strong prefix-level supervision signal, especially when candidate pools are large, search budgets increase, or rule-based rewards are sparse. We release all data, models, and code at https://zhiqix.github.io/pum-project-page.

2606.07186 2026-06-08 cs.RO cs.SE 新提交

A Causal Probabilistic Framework for Perception-Informed Closed-Loop Simulation of Autonomous Driving

面向感知信息闭环仿真的自动驾驶因果概率框架

Zhennan Fei, Rickard Johansson, Mikael Andersson, Matthias Eng, Mattias Eriksson, Kaveh Kianfar, Sadegh Rahrovani, Chris van der Ploeg, Michael Borth, Maren Buermann, Michiel Braat, Henk Goossens, Zijian Han, Majid Khorsand Vakilzadeh, Gabriel Rodrigues de Campos

AI总结 提出一种因果概率模型框架,将感知误差注入标准仿真环境,揭示理想SIL无法捕获的潜在风险,为SOTIF验证提供可扩展路径。

详情
AI中文摘要

软件在环(SIL)仿真是现代汽车安全功能验证的基石。然而,许多当前框架采用理想感知,绕过了感知算法的功能不足,导致过于乐观的安全评估。本文提出一种感知信息SIL测试方法,弥合了地面实况仿真与真实世界感知行为之间的差距。我们提出了一个将因果概率模型纳入标准化、基于场景的仿真工具链的框架,适用于高级驾驶辅助系统(ADAS)和自动驾驶系统(ADS)。我们的方法能够系统性地注入由物理触发条件(如雾、雨和物体合并场景)导出的真实感知误差,例如检测丢失、尺寸不准确和定位偏移。通过在标准化仿真环境中评估这些“故障”,我们证明了感知信息测试揭示了理想SIL环境无法捕获的潜在操作风险,为SOTIF(ISO 21448)验证提供了可扩展的途径。

英文摘要

Software-in-the-loop (SIL) simulation is a cornerstone for the validation of modern automotive safety functions. However, many current frameworks utilize ideal sensing, which bypasses the functional insufficiencies of perception algorithms, leading to over-optimistic safety assessments. This paper proposes a perception-informed SIL testing methodology that bridges the gap between ground-truth simulation and real-world perception behavior. We present a framework for incorporating causal probabilistic models into standardized, scenario-based simulation toolchains, applicable to both Advanced Driver Assistance Systems (ADAS) and Autonomous Driving Systems (ADS). Our approach enables the systematic injection of realistic perception errors, such as loss of detection, sizing inaccuracies, and positioning offsets, derived from physical triggering conditions like fog, rain, and object-merging scenarios. By evaluating these ``faults'' within a standardized simulation environment, we demonstrate that perception-informed testing reveals latent operational risks that ideal SIL environments fail to capture, providing a scalable pathway for SOTIF (ISO 21448) validation.

2606.07181 2026-06-08 cs.LG cs.AI q-bio.MN 新提交

RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking

RETROSPECT: 通过序列预测和化学变换排序的逆合成

Raja Sekhar Pappala, Shreyas Vinaya Sathyanarayana, Ronit Kumar Choudhary, Arjun Verma, Deepak Warrier

AI总结 提出RETROSPECT系统,将单步逆合成分解为候选生成和重排序,结合ChemAlign Transformer生成器和LambdaMART重排序器,在USPTO-50K上实现55.00% top-1准确率。

详情
Comments
Accepted at the AI for Science workshop (ICML 2026)
AI中文摘要

单步逆合成既需要准确的首位建议,也需要足够丰富的候选列表以供下游选择。我们将其研究为提议-选择分解。我们的系统RETROSPECT结合了一个单一的Transformer提议模型(我们称之为ChemAlign Transformer)和一个基于结构、反应模板、上游分数以及可选的DFT衍生描述符的LambdaMART重排序器。生成器使用混合根对齐和随机SMILES增强、预层归一化、绑定嵌入、指数移动平均权重以及可微的原子平衡辅助损失进行训练。在包含5,007个反应的完整USPTO-50K测试集上,生成器达到55.00%的top-1和86.18%的top-10精确匹配准确率,top-1有效率为99.86%。在用于重排序的合并候选池基准上(包含5,007个测试产物,每个产物约111个候选),基于结构特征集训练的LambdaMART模型达到59.4%的top-1和0.7171的平均倒数排名。特征消融实验表明,上游提议分数和模板频率统计提供了大部分重排序信号,而DFT和反应中心DFT特征提供的增益较小且不一致。这些结果支持逆合成的模块化观点:更强的单模型提议和学习候选选择是互补的,并且提议模型可以作为集成系统(如RetroChimera (Maziarz et al., 2024))的即插即用组件。

英文摘要

Single-step retrosynthesis needs both accurate first-ranked suggestions and candidate lists that are rich enough for downstream selection. We study this as a proposal-selection decomposition. Our system, RETROSPECT, combines a single Transformer proposal model, which we call the ChemAlign Transformer, with a LambdaMART reranker over structural, reaction-template, upstream-score, and optional DFT-derived descriptors. The generator is trained with hybrid root-aligned and random SMILES augmentation, Pre-LayerNorm, tied embeddings, exponential moving average weights, and a differentiable atom-balance auxiliary loss. On the full USPTO-50K test set of 5,007 reactions, the generator reaches 55.00% top-1 and 86.18% top-10 exact-match accuracy with 99.86% top-1 validity. On the merged candidate-pool benchmark used for reranking, which contains 5,007 test products and about 111 candidates per product, a LambdaMART model trained on the structural feature set reaches 59.4% top-1 with 0.7171 mean reciprocal rank. Feature ablations show that upstream proposal score and template-frequency statistics provide most of the reranking signal, while DFT and reaction-center DFT features provide smaller and less consistent gains. These results support a modular view of retrosynthesis: stronger single-model proposal and learned candidate selection are complementary, and the proposal model can serve as a drop-in component for ensemble systems such as RetroChimera (Maziarz et al., 2024)

2606.07172 2026-06-08 cs.CV cs.AI cs.CL cs.LG 新提交

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

文本监督增强视觉-语言模型中的地理空间表示

Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha

AI总结 研究视觉、视觉-语言及多模态模型的地理空间表示能力,发现文本监督能有效提升空间编码,推动地理空间AI发展。

详情
Comments
Accepted at ICML 2026
AI中文摘要

地理空间理解是机器学习系统在图像地理定位和空间推理等任务中一个关键但尚未充分探索的维度。在这项工作中,我们分析了三种模型家族获得的地理空间表示:纯视觉架构(如ViT)、视觉-语言模型(如CLIP)和大规模多模态基础模型(如LLaVA、Qwen和Gemma)。通过评估包括人物、地标和日常物体在内的图像聚类(根据可定位程度分组),我们揭示了空间准确性的系统性差距,并表明文本监督增强了地理空间表示的学习。我们的发现表明语言作为编码空间上下文的有效补充模态,以及多模态学习作为推进地理空间AI的关键方向。

英文摘要

Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.

2606.07151 2026-06-08 cs.LG 新提交

Geodesics of Dynamic Graphs for Regime Change Detection

动态图的测地线用于状态转换检测

William Cappelletti, Étienne Voutaz, Pascal Frossard

AI总结 提出将动态网络中的状态定义为时间图沿测地线的轨迹,通过图回归方法测量观测图与测地线的累积距离,结合变点检测算法识别状态转换,在合成和真实数据上优于现有方法。

详情
AI中文摘要

传统动态网络中的变点检测假设平稳状态之间的突然转换,忽略了大多数实际应用(如社交网络或物理系统)中出现的连续演化场景。我们通过将状态正式定义为时间图中连贯动态的时期来弥补这一空白,并将其表征为在适当定义的图空间中沿测地线的轨迹。这一原创视角使我们能够将状态转换定义为动态中的显著漂移,要么朝向新轨迹,要么速度变化。我们利用图回归方法测量观测图序列与相关图空间中其端点之间估计测地线的累积距离,并可将其与变点检测算法结合。我们在具有变化轨迹和不同速度的动态网络上进行实验,结果优于最先进的变点检测模型。然后,我们分析了新冠疫情期间的流动性数据,并表明我们对规则网络演化的假设导致变点与外部事件相比基线方法的结果更一致。我们的工作是首次在图空间中建模和检测演化状态之间的变化,为分析复杂时间图数据提供了现实且强大的工具。

英文摘要

Traditional change point detection in dynamic networks assumes abrupt transitions between stationary states, overlooking scenarios of continuous evolution which arise in most real-world applications, such as social networks or physical systems. We address this gap by formally defining regimes as periods of coherent dynamics in temporal graphs, which we characterize as trajectories along geodesics in a suitably defined graph space. This original perspective allows us to define regime changes as significant drifts in dynamics, either toward new trajectories or with pace changes. We leverage graph regression methods to measure the cumulative distance of sequences of observed graphs from the estimated geodesics between their endpoints, in the relevant graph space, which we can combine with change point detection algorithms. We present experiments on dynamic networks, with changing trajectories and varying speeds, in which we outperform state of the art change point detection models. Then, we analyse mobility data during the Covid-19 pandemic, and show that our assumptions on regular network evolution lead to change points that are more aligned to external events compared to the outcomes of baseline methods. Our work is the first to model and detect changes between evolving regimes in graph space, providing a realistic and powerful tool for analyzing complex temporal graph data.