arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪
2503.04929 2026-05-25 cs.RO cs.LG cs.SY eess.SY

Neural Configuration-Space Barriers for Manipulation Planning and Control

用于操作规划与控制的神经构型空间障碍

Kehan Long, Ki Myung Brian Lee, Nikola Raicevic, Niyas Attasseri, Melvin Leok, Nikolay Atanasov

发表机构 * Contextual Robotics Institute, University of California San Diego(情境机器人研究所,加州大学圣地亚哥分校)

AI总结 本文研究了如何在复杂动态环境中高效安全地规划和控制高维机械臂的运动。作者提出了一种基于神经网络配置空间距离函数(CDF)的统一方法,将安全约束转化为CDF屏障,从而减少路径规划中的碰撞检测次数。为应对模型误差和传感器噪声带来的不确定性,研究还提出了分布鲁棒的CDF屏障控制框架,无需假设噪声分布。实验表明,该方法能够在仅依赖 onboard 点云观测的情况下,实现高效且安全的机械臂操控。

详情
AI中文摘要

在杂乱动态环境中,高维机器人操作器的规划与控制需要计算效率和鲁棒的安全保证。受近期学习构型空间距离函数(CDF)作为机器人身体表示的研究启发,我们提出了一种统一的运动规划与控制方法,将安全约束公式化为CDF障碍。CDF障碍近似局部自由构型空间,显著减少了运动规划中的碰撞检测操作次数。然而,使用神经网络学习CDF障碍并依赖在线传感器观测会引入不确定性,这些必须在控制综合中加以考虑。为此,我们开发了一种分布鲁棒的CDF障碍控制公式,该公式在不假设已知底层分布的情况下,考虑了建模误差和传感器噪声。在UFactory xArm6操作器上的仿真和硬件实验表明,我们的神经CDF障碍公式能够在杂乱动态环境中实现高效规划和鲁棒安全控制,仅依赖机载点云观测。

英文摘要

Planning and control for high-dimensional robot manipulators in cluttered dynamic environments require computational efficiency and robust safety guarantees. Inspired by recent advances in learning configuration-space distance functions (CDFs) as representations of robot bodies, we propose a unified approach for motion planning and control that formulates safety constraints as CDF barriers. A CDF barrier approximates the local free configuration space, substantially reducing the number of collision-checking operations during motion planning. However, learning a CDF barrier with a neural network and relying on online sensor observations introduces uncertainties that must be considered during control synthesis. To address this, we develop a distributionally robust CDF barrier formulation for control that accounts for modeling errors and sensor noise without assuming a known underlying distribution. Simulations and hardware experiments on a UFactory xArm6 manipulator show that our neural CDF barrier formulation enables efficient planning and robust safe control in cluttered and dynamic environments, relying only on onboard point-cloud observations.

2502.17119 2026-05-25 cs.LG cs.AI

Diffusion and Flow Matching Models for Tabular Data: A Survey

表格数据的扩散与流匹配模型:综述

Zhong Li, Qi Huang, Lincen Yang, Jiayang Shi, Zhao Yang, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen

发表机构 * Great Bay University(大湾大学) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) LIACS, Leiden University(莱顿大学LIACS)

AI总结 本文综述了扩散模型和流匹配模型在表格数据生成中的应用,探讨了这些模型在处理数值与类别混合、缺失值、敏感字段及复杂依赖关系等挑战时的优势与方法。文章系统梳理了从2015年至2026年的相关研究,围绕数据工程难题、任务目标、设计选择及评估维度进行组织,并指出了在可扩展性、特征依赖建模、隐私保护、公平性及约束感知生成等方面的开放问题。

Comments We substantially updated the previous version "Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions" by including flow matching models for tabular data

详情
AI中文摘要

深度生成模型在图像、文本、音频和视频生成方面取得了快速进展,并越来越多地应用于结构化记录。然而,对于表格数据,生成建模仍然困难:数据集可能包含数值和分类属性、缺失值、敏感字段、不平衡类别、复杂的特征依赖和领域约束。早期基于GAN或VAE的表格数据建模方法取得了有用结果,但可能面临训练不稳定、模式崩溃、多模态分布建模能力弱以及混合类型特征处理脆弱等问题。因此,扩散模型因其噪声-去噪公式提供了灵活稳定的方式来建模复杂数据分布而受到越来越多的关注,并已被应用于表格合成、缺失值填补、可信数据生成和异常检测。流匹配通过学习沿概率路径的传输向量场提供了一条密切相关的途径,通常对路径设计和采样效率有更直接的控制。尽管取得了进展,但针对表格数据的扩散和流匹配模型文献仍然难以比较,因为方法针对不同任务,依赖于不同的表示、目标、评估协议和领域假设。据我们所知,这是第一篇专门针对表格数据的扩散和流匹配模型的综述。我们回顾了2015年6月至2026年5月的工作,围绕数据工程挑战、任务、设计选择和评估维度进行组织,并讨论了可扩展性、特征依赖建模、隐私、公平性、基准测试和约束感知生成中的开放问题。我们在GitHub仓库中保持更新。

英文摘要

Deep generative models have made rapid progress in image, text, audio, and video generation, and are increasingly being applied to structured records. For tabular data, however, generative modeling remains difficult: a dataset may contain numerical and categorical attributes, missing values, sensitive fields, imbalanced categories, complex feature dependencies, and domain constraints. Earlier tabular data modeling methods based on GANs or VAEs have achieved useful results, but they can suffer from unstable training, mode collapse, weak modeling of multimodal distributions, and fragile handling of mixed-type features. Diffusion models have therefore attracted growing interest because their noising-and-denoising formulation provides a flexible and stable way to model complex data distributions, and has been adapted to tabular synthesis, missing-value imputation, trustworthy data generation, and anomaly detection. Flow matching offers a closely related route by learning transport vector fields along probability paths, often with more direct control over path design and sampling efficiency. Despite this progress, the literature on diffusion and flow matching models for tabular data remains difficult to compare because methods target different tasks and rely on different representations, objectives, evaluation protocols, and domain assumptions. To the best of our knowledge, this is the first survey dedicated specifically to diffusion and flow matching models for tabular data. We review work from June 2015 to May 2026, organize it around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discuss open problems in scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation. We maintain updates in a GitHub repository.

2502.07646 2026-05-25 cs.LG stat.ME stat.ML

Causal Additive Models with Unobserved Causal Paths and Backdoor Paths

具有未观测因果路径和后门路径的因果加性模型

Thong Pham, Takashi Nicholas Maeda, Shohei Shimizu

发表机构 * Shiga University(Shiga大学) RIKEN(理化学研究所) AIP(应用物理研究所) Gakushuin University(早稻田大学) The University of Osaka(大阪大学)

AI总结 该论文研究了在存在未观测的因果路径和后门路径时,如何识别变量间的因果方向问题。作者提出了新的回归集刻画方法,用于判断残差独立性和观测变量的条件独立性,并基于此建立了因果方向可识别的充分条件。在此基础上,提出了一种搜索算法并证明了其正确性和完备性,实验表明该方法在性能上具有竞争力。

Comments 23 pages

Journal ref Proceedings of AISTATS 2026

详情
AI中文摘要

因果加性模型为存在隐藏变量时的因果发现提供了一个可处理且富有表现力的框架。当两个变量之间存在未观测的后门或因果路径时,其因果关系在现有理论下通常不可识别。我们建立了在许多此类情况下可识别因果方向的充分条件。这些条件依赖于回归集的新特征,以确定回归残差之间的独立性以及观测变量之间的条件独立性。基于这些结果,我们引入了一个结合这些创新的搜索算法,并证明了其可靠性和完备性。实证评估表明,其性能与最先进的方法相比具有竞争力。

英文摘要

Causal additive models provide a tractable yet expressive framework for causal discovery in the presence of hidden variables. When unobserved backdoor or causal paths exist between two variables, their causal relationship is often unidentifiable under existing theories. We establish sufficient conditions under which causal directions can be identified in many such cases. These conditions rely on new characterizations of regression sets to determine independence among regression residuals and conditional independencies among observed variables. Building on these results, we introduce a search algorithm that incorporates these innovations and prove its soundness and completeness. Empirical evaluations demonstrate its competitive performance against state-of-the-art methods.

2502.07489 2026-05-25 cs.LG

Physiome-ODE: A Benchmark for Irregularly Sampled Multivariate Time Series Forecasting Based on Biological ODEs

Physiome-ODE:基于生物常微分方程的不规则采样多元时间序列预测基准

Christian Klötergens, Vijaya Krishna Yalavarthi, Randolf Scholz, Maximilian Stubbemann, Stefan Born, Lars Schmidt-Thieme

发表机构 * ISMLL & VWFS DARC University of Hildesheim(ISMLL与VWFS DARC海德堡大学) University of Hildesheim(海德堡大学) Institute of Mathematics TU Berlin(柏林技术大学数学研究所) DARC University of Hildesheim(DARC海德堡大学)

AI总结 当前不规则采样多变量时间序列预测方法主要依赖于少量数据集进行评估,而基于常微分方程(ODE)的模型在这些数据集上表现不佳,限制了其进一步研究。本文提出了一种从真实生物ODE模型生成不规则采样多变量时间序列数据的方法,并通过拒绝采样构建了包含50个数据集的大型基准数据集Physiome-ODE。该基准显著区别于现有数据集,能够有效评估不同模型在处理不规则时间序列时的真实性能,为ODE模型的研究提供了新的推动。

详情
AI中文摘要

当前最先进的缺失值不规则采样时间序列预测方法主要依赖四个数据集和少量小玩具示例进行评估。尽管常微分方程(ODE)是科学和工程中的主流模型,但在现有三个数据集上,预测常数值的基线模型性能优于过去五年的基于ODE的模型。这一反直觉的发现阻碍了基于ODE的模型(一个更合理的模型族)的进一步研究。本文开发了一种从常微分方程生成不规则采样多元时间序列(IMTS)数据集的方法,并通过拒绝采样选择具有挑战性的实例。利用该方法,我们创建了Physiome-ODE,一个大型且复杂的IMTS基准数据集,包含50个独立数据集,源自生物学研究中真实世界的常微分方程。Physiome-ODE是我们所知的首个IMTS预测基准,其规模比当前四个数据集的评估设置大一个数量级。使用Physiome-ODE基准,我们展示了与当前四个数据集完全不同的定性结果:在Physiome-ODE上,基于ODE的模型能够发挥其优势,并且我们的基准能够以有意义的方式区分不同的IMTS预测模型。通过这种方式,我们期望为基于ODE的时间序列建模研究注入新的动力。

英文摘要

State-of-the-art methods for forecasting irregularly sampled time series with missing values predominantly rely on just four datasets and a few small toy examples for evaluation. While ordinary differential equations (ODE) are the prevalent models in science and engineering, a baseline model that forecasts a constant value outperforms ODE-based models from the last five years on three of these existing datasets. This unintuitive finding hampers further research on ODE-based models, a more plausible model family. In this paper, we develop a methodology to generate irregularly sampled multivariate time series (IMTS) datasets from ordinary differential equations and to select challenging instances via rejection sampling. Using this methodology, we create Physiome-ODE, a large and sophisticated benchmark of IMTS datasets consisting of 50 individual datasets, derived from real-world ordinary differential equations from research in biology. Physiome-ODE is the first benchmark for IMTS forecasting that we are aware of and an order of magnitude larger than the current evaluation setting of four datasets. Using our benchmark Physiome-ODE, we show qualitatively completely different results than those derived from the current four datasets: on Physiome-ODE ODE-based models can play to their strength and our benchmark can differentiate in a meaningful way between different IMTS forecasting models. This way, we expect to give a new impulse to research on ODE-based time series modeling.

2502.07295 2026-05-25 cs.LG

Targeted Regularization for Causal Effect Estimation with Exponential Dispersion Family Outcomes

针对指数分散族结果变量的因果效应估计的目标正则化

Jiahong Li, Zeqin Yang, Jixing Xu, Enzheng Hua, Zhichao Zou, Peng Zhen, Jiecheng Guo

发表机构 * Didi Chuxing(滴滴出行)

AI总结 本文研究了在指数型分布族(EDF)输出场景下因果效应估计中的目标正则化方法,旨在提升神经网络在估计因果效应时的统计性质,如双重稳健性和快速收敛性。作者提出了一个统一的目标正则化框架,适用于离散和连续处理变量,并通过分布层面的一阶偏差校正提升了估计精度。该方法将目标函数整合到神经网络架构中,实现了对结果模型、倾向得分模型和波动参数的联合端到端估计,实验验证了其有效性。

详情
AI中文摘要

用于因果效应估计的神经网络(NN)在实证中表现出色,但赋予其理想的半参数性质——双重稳健性和快速收敛速度——仍然具有挑战性。解决此问题的一种常见方法是目标正则化,它修改了神经网络的目标函数。然而,现有的神经因果效应估计工作主要局限于连续结果变量,限制了其在实践中常见的二元、计数或其他偏斜结果变量场景中的应用。我们针对指数分散族(EDF)提出了一个统一的目标正则化框架来解决这一限制。具体来说,我们首先推导了离散处理下典型函数的平均剂量函数(ADCF)和连续处理下筛投影ADCF的冯·米塞斯展开。其次,我们利用这一展开构建了一个统一的目标正则化,在分布层面修正一阶偏差。我们将此目标集成到一个神经网络架构中,该架构联合估计结果模型、倾向得分模型和波动参数。实验结果证明了我们方法的有效性。

英文摘要

Neural Networks (NNs) for causal effect estimation have shown strong empirical performance, yet endowing them with desirable semiparametric properties -- doubly robustness and fast convergence rates -- remains challenging. A common approach to address this is targeted regularization, which modifies the objective function of NNs. However, existing work on neural causal effect estimation is largely limited to continuous outcomes, restricting its applicability to settings involving binary, count, or other skewed outcomes commonly encountered in practice. We propose a unified targeted regularization framework for the Exponential Dispersion Family (EDF) to address this limitation. Specifically, we first derive the von Mises expansion of the average dose function of canonical functions (ADCF) for discrete treatments and of the sieve-projected ADCF for continuous treatments. Second, we use this expansion to construct a unified targeted regularization, that corrects first-order bias at the distributional level. We integrate this objective into a NN architecture that jointly estimates the outcome model, propensity score model, and fluctuation parameter end-to-end. Experimental results demonstrate the effectiveness of our method.

2502.04415 2026-05-25 cs.CV cs.AI

TerraQ: Spatiotemporal Question-Answering on Satellite Image Archives

TerraQ:卫星图像档案的时空问答

Sergios-Anestis Kefalidis, Konstantinos Plas, Manolis Koubarakis

发表机构 * Dept. of Informatics and Telecommunications(信息与电信系) National and Kapodistrian University of Athens(国家与卡布里亚大学) Archimedes/Athena RC(阿基米德/雅典RC)

AI总结 TerraQ 是一个用于卫星图像档案的时空问答系统,能够根据自然语言查询快速检索符合条件的卫星图像。该系统结合了自然语言处理与空间知识库,支持基于图像元数据和地理实体的复杂查询。其核心贡献在于提升了地球观测数据的可访问性与智能化检索能力。

详情
AI中文摘要

TerraQ是一个用于卫星图像档案的时空问答引擎。它是一个自然语言处理系统,旨在处理满足特定条件的卫星图像请求。这些请求可以引用图像元数据和来自专门知识库(例如,艾米利亚-罗马涅大区)的实体。通过它,用户可以提出诸如“给我一百张法国港口附近河流的图像,雪覆盖率低于20%,云覆盖率高于10%”之类的请求,从而使地球观测数据更易于访问,符合当前数字助手的趋势。

英文摘要

TerraQ is a spatiotemporal question-answering engine for satellite image archives. It is a natural language processing system that is built to process requests for satellite images satisfying certain criteria. The requests can refer to image metadata and entities from a specialized knowledge base (e.g., the Emilia-Romagna region). With it, users can make requests like "Give me a hundred images of rivers near ports in France, with less than 20% snow coverage and more than 10% cloud coverage", thus making Earth Observation data more easily accessible, in-line with the current landscape of digital assistants.

2502.04230 2026-05-25 cs.SD cs.AI cs.CR cs.LG eess.AS

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

XAttnMark:基于交叉注意力的鲁棒音频水印学习

Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

发表机构 * Department of Computer Science, Lehigh University, Bethlehem, PA, USA(莱文斯顿大学计算机科学系) Dolby Laboratories Inc., San Francisco, CA, USA(杜比实验室公司)

AI总结 随着生成式音频合成和编辑技术的快速发展,版权保护、数据溯源和深度伪造音频传播等问题日益突出。本文提出了一种基于交叉注意力机制的鲁棒音频水印方法XAttnMark,通过生成器与检测器之间的部分参数共享、高效的交叉注意力消息检索机制以及时间条件模块,实现了水印检测与归属的联合优化。此外,该方法引入了与心理声学对齐的时频掩码损失,提升了水印的不可感知性,实验表明其在多种音频变换下均表现出优越的鲁棒性,为生成式AI时代的音频版权保护提供了有效解决方案。

Comments Accepted at ICML'25

详情
AI中文摘要

生成式音频合成与编辑技术的快速普及引发了关于版权侵权、数据溯源以及通过深度伪造音频传播虚假信息的严重担忧。水印技术通过将不可感知但可识别和可追踪的信号嵌入音频内容,提供了一种主动解决方案。尽管最近基于神经网络的水印方法(如WavMark和AudioSeal)在鲁棒性和质量上有所改进,但它们难以同时优化鲁棒检测和准确归因。本文介绍了交叉注意力鲁棒音频水印(XATTNMARK),通过利用生成器和检测器之间的部分参数共享、用于高效消息检索的交叉注意力机制以及用于改善消息分布的时间条件模块,弥合了这一差距。此外,我们提出了一种心理声学对齐的时频(TF)掩蔽损失,捕捉细粒度的听觉掩蔽效应,提高了水印的不可感知性。XATTNMARK在检测和归因方面均达到了最先进的性能,展示了针对各种音频变换(包括不同强度的具有挑战性的生成式编辑)的卓越鲁棒性。这项工作推进了音频水印技术,用于在生成式AI时代保护知识产权并确保真实性。

英文摘要

The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, data provenance, and the spread of misinformation via deepfake audio. Watermarking offers a proactive solution by embedding imperceptible yet identifiable and traceable signals into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to jointly optimize both robust detection and accurate attribution. This paper introduces Cross-Attention Robust Audio Watermark (XATTNMARK), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned time-frequency (TF) masking loss that captures fine-grained auditory masking effects, improving watermark imperceptibility. XATTNMARK achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI.

2411.12173 2026-05-25 cs.LG cs.AI

SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

SkillTree: 面向长时域控制任务的可解释基于技能的深度强化学习

Yongyan Wen, Siyuan Li, Rongchang Zuo, Lei Yuan, Hangyu Mao, Peng Liu

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) National Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Polixir Technologies SenseTime Research(时光机器研究)

AI总结 本文提出了一种名为SkillTree的可解释技能型深度强化学习框架,用于解决长期控制任务中的复杂连续动作空间问题。该方法通过将连续动作空间离散化为技能空间,并在高层策略中引入可微决策树生成技能嵌入,从而指导底层策略执行具体技能,实现了技能层面的可解释性。实验表明,SkillTree在复杂机械臂控制任务中性能与基于神经网络的技能方法相当,同时提升了决策过程的透明度。

详情
AI中文摘要

深度强化学习(DRL)在各个研究领域取得了显著成功。然而,其对神经网络的依赖导致缺乏透明度,限制了实际应用。为了实现可解释性,决策树已成为神经网络的一种流行且有前景的替代方案。然而,由于其表达能力有限,传统决策树难以处理高维长时域连续控制任务。在本文中,我们提出了SkillTree,一种新颖的框架,将复杂的连续动作空间缩减为离散的技能空间。我们的层次化方法在高层次策略中集成了可微决策树以生成技能嵌入,进而指导低层次策略执行技能。通过使技能决策可解释,我们实现了技能级可解释性,增强了对复杂任务中决策过程的理解。实验结果表明,我们的方法在复杂机器人臂控制领域中达到了与基于技能的神经网络相当的性能。此外,SkillTree在技能级别提供解释,从而提高了决策过程的透明度。

英文摘要

Deep reinforcement learning (DRL) has achieved remarkable success in various research domains. However, its reliance on neural networks results in a lack of transparency, which limits its practical applications. To achieve explainability, decision trees have emerged as a popular and promising alternative to neural networks. Nonetheless, due to their limited expressiveness, traditional decision trees struggle with high-dimensional long-horizon continuous control tasks. In this paper, we proposes SkillTree, a novel framework that reduces complex continuous action spaces into discrete skill spaces. Our hierarchical approach integrates a differentiable decision tree within the high-level policy to generate skill embeddings, which subsequently guide the low-level policy in executing skills. By making skill decisions explainable, we achieve skill-level explainability, enhancing the understanding of the decision-making process in complex tasks. Experimental results demonstrate that our method achieves performance comparable to skill-based neural networks in complex robotic arm control domains. Furthermore, SkillTree offers explanations at the skill level, thereby increasing the transparency of the decision-making process.

2411.01088 2026-05-25 cs.LG math.OC

CRONOS: Enhancing Deep Learning with Scalable GPU Accelerated Convex Neural Networks

CRONOS: 利用可扩展的GPU加速凸神经网络增强深度学习

Miria Feng, Zachary Frangella, Mert Pilanci

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出了一种名为 CRONOS 的算法,用于对两层神经网络进行凸优化,该算法能够首次扩展到高维数据集如 ImageNet,显著优于以往仅在 MNIST 和 CIFAR-10 下采样版本上进行研究的工作。基于 CRONOS,作者进一步开发了 CRONOS-AM 算法,结合交替最小化方法,实现了对任意结构多层网络的训练。理论分析表明 CRONOS 在温和条件下能收敛到凸重构的全局最小值,实验验证显示其在图像和语言任务中表现优于主流深度学习优化器。

Journal ref Advances in Neural Information Processing Systems 37 (NeurIPS 2024)

详情
AI中文摘要

我们提出了用于两层神经网络凸优化的CRONOS算法。CRONOS是首个能够扩展到高维数据集(如现代深度学习中普遍存在的ImageNet)的算法。这显著改进了先前的工作,这些工作仅限于MNIST和CIFAR-10的下采样版本。以CRONOS为基础,我们进一步开发了一种名为CRONOS-AM的新算法,它将CRONOS与交替最小化相结合,以获得能够训练任意架构多层网络的算法。我们的理论分析证明,在温和假设下,CRONOS收敛到凸重述的全局最小值。此外,我们通过使用JAX进行GPU加速的大规模数值实验,验证了CRONOS和CRONOS-AM的有效性。我们的结果表明,在视觉和语言任务中,使用ImageNet和IMDb等基准数据集,CRONOS-AM可以获得与主流调优深度学习优化器相当或更好的验证精度。据我们所知,CRONOS是首个利用凸重述来增强大规模学习任务性能的算法。

英文摘要

We introduce the CRONOS algorithm for convex optimization of two-layer neural networks. CRONOS is the first algorithm capable of scaling to high-dimensional datasets such as ImageNet, which are ubiquitous in modern deep learning. This significantly improves upon prior work, which has been restricted to downsampled versions of MNIST and CIFAR-10. Taking CRONOS as a primitive, we then develop a new algorithm called CRONOS-AM, which combines CRONOS with alternating minimization, to obtain an algorithm capable of training multi-layer networks with arbitrary architectures. Our theoretical analysis proves that CRONOS converges to the global minimum of the convex reformulation under mild assumptions. In addition, we validate the efficacy of CRONOS and CRONOS-AM through extensive large-scale numerical experiments with GPU acceleration in JAX. Our results show that CRONOS-AM can obtain comparable or better validation accuracy than predominant tuned deep learning optimizers on vision and language tasks with benchmark datasets such as ImageNet and IMDb. To the best of our knowledge, CRONOS is the first algorithm which utilizes the convex reformulation to enhance performance on large-scale learning tasks.

2409.08036 2026-05-25 cs.LG

Heterogeneous Sheaf Neural Networks

异质层丛神经网络

Luke Braithwaite, Alessio Borgi, Gabriele Onorato, Kristjan Tarantelli, Francesco Restuccia, Fabrizio Silvestri, Pietro Liò

发表机构 * Department of Computer Science and Technology, University of Cambridge(计算机科学与技术系,剑桥大学) Department of Electrical and Computer Engineering, Northeastern University(电气与计算机工程系,东北大学) Department of Computer, Control and Management Engineering, Sapienza University of Rome(计算机、控制与管理工程系,罗马萨皮恩扎大学)

AI总结 该研究提出了一种名为HetSheaf的异构图神经网络框架,用于处理节点和边具有不同类型和特征空间的异构图数据。不同于传统方法通过复杂架构处理异构性,HetSheaf通过细胞叠层结构直接在数据层面表示异构性,并学习基于节点和边类型的限制映射。该方法引入了SheafPool读取模块,实现了对图级别的鲁棒预测,并在多个基准测试中表现出色,性能优于多种现有方法,同时显著减少了参数数量。

Comments 48 pages, 2 figures

详情
AI中文摘要

异质图的节点和边可以属于不同类型和特征空间,出现在许多真实世界领域,包括生物学、推荐系统、社交网络和计算机系统。现有的异质图神经网络通常在架构层面通过关系特定模块、元路径机制或类型感知注意力来处理这种异质性,这往往导致越来越专门化的参数密集型设计。在这项工作中,我们提出了HetSheaf,一个通过细胞层丛学习异质图的框架。HetSheaf不是仅在架构中编码异质性,而是通过分配类型感知的局部特征空间和学习基于节点特征、节点类型和边类型的限制映射,直接在底层数据结构中表示异质性。为了支持图级预测,我们进一步引入了SheafPool,一种通用的茎空间读出方法,它聚合节点表示同时对局部基的变化保持不变,从而使层丛网络的图分类得到良好定义,并且F1分数比平均池化高出高达42个百分点。在多样化的基准测试套件(节点分类、链接预测和图分类)中,HetSheaf在异质图基准(HGB)框架上,针对同质(GCN、GAT、GIN、GraphSAGE)、异质(R-GCN、HAT、HGT)和类型无关的层丛基线,一致地实现了高达2个百分点的性能提升(节点分类上高达94.97%的Macro F1分数,链接预测上高达99.62%),同时参数数量减少了高达10倍。

英文摘要

Heterogeneous graphs, whose nodes and edges can belong to different types and feature spaces, arise in many real-world domains, including biology, recommendation, social networks, and computer systems. Existing heterogeneous graph neural networks typically handle this heterogeneity at the architectural level through relation-specific modules, meta-path machinery or type-aware attention, which often leads to increasingly specialised parameter-heavy designs. In this work, we propose HetSheaf, a framework for learning heterogeneous graphs through cellular sheaves. Instead of encoding heterogeneity solely in the architecture, HetSheaf represents it directly in the underlying data structure by assigning type-aware local feature spaces and learning restriction maps conditioned on node features, node types, and edge types. To support graph-level prediction, we further introduce SheafPool, a universal stalk-space readout that aggregates node representations while being invariant to local changes of basis, thereby making graph classification with sheaf networks well-defined and achieving an F1 Score up to 42 percentage points higher than mean pooling. Across a diverse suite of benchmarks (node classification, link prediction and graph classification). HetSheaf consistently achieves up to 2 percentage points higher performance (up to 94.97% Macro F1 Score on node classification and up to 99.62% on link prediction) on the Heterogeneous Graph Benchmark (HGB) framework against homogeneous (GCN, GAT, GIN, GraphSAGE), heterogeneous (R-GCN, HAT, HGT) and type-agnostic sheaf baselines, while reducing the number of parameters by up to 10$\times$.

2407.03535 2026-05-25 cs.CV

BVI-RLV: A Fully Registered Dataset for Low-Light Video Enhancement

BVI-RLV:一个完全配准的低光视频增强数据集

Ruirui Lin, Guoxi Huang, Joanne Lin, Qi Sun, Alexandra Malyugina, David R Bull, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, Bristol Vision Institute (BVI), University of Bristol(视觉信息实验室,布里斯托尔视觉研究所(BVI),布里斯托尔大学)

AI总结 低光照视频常伴有时空不一致的噪声,影响视觉清晰度和计算机视觉任务的性能。为解决深度学习增强此类内容时缺乏高质量对齐训练数据的问题,本文提出了BVI-RLV数据集,包含40个不同场景下超过3万对低光与正常光配对帧,实现了高精度的亚像素级对齐。该数据集在动态运动场景中具有广泛适用性,并提供了多种模型的基线实现,实验表明其对监督学习效果显著,且在跨数据集评估中表现优于现有数据集。

Comments arXiv admin note: text overlap with arXiv:2402.01970

详情
AI中文摘要

低光视频通常表现出时空不连贯的噪声,损害可见性并降低计算机视觉应用的性能。使用深度学习增强此类内容的一个主要挑战在于缺乏像素对齐的高质量训练数据。我们引入了BVI-RLV,一个完全配准的低光视频数据集,包含来自40个不同场景的超过3万对帧,在两种低光条件下,每个帧都与正常光照的真实值对齐。与依赖中性密度(ND)滤波器或存在未对齐问题的现有数据集不同,BVI-RLV通过使用电动滑轨和基于图像的优化,在动态运动场景下实现了全高清分辨率下99.24%数据的亚像素配准。该数据集涵盖了广泛的运动类型和真实的时间噪声。我们还提供了使用四种代表性架构的基线实现:卷积神经网络(CNN)、Transformer、状态空间模型(Mamba)和扩散模型(DM)。实验表明,配准对于监督学习至关重要,与未配准训练相比,PSNR提升高达5.85 dB。在跨数据集评估中,基于BVI-RLV训练的模型优于基于现有数据集训练的模型,即使在真实户外场景中也取得了优越性能。我们的数据集公开于https://doi.org/10.21227/mzny-8c77。

英文摘要

Low-light videos often exhibit spatiotemporally incoherent noise, compromising visibility and degrading performance in computer vision applications. A major challenge for enhancing such content using deep learning lies in the scarcity of pixel-aligned, high-quality training data. We introduce BVI-RLV, a fully registered low-light video dataset comprising over 30k paired frames from 40 diverse scenes under two low-light conditions, each aligned with normal-light ground truth. Unlike existing datasets that rely on neutral density (ND) filters or suffer from misalignment issues, BVI-RLV achieves sub-pixel registration for 99.24% of data at full HD resolution across dynamic motion scenarios using a motorized dolly and image-based refinement. The dataset covers a wide range of motion types and realistic temporal noise. We also provide baseline implementations using four representative architectures: Convolutional Neural Network (CNN), Transformer, State Space Model (Mamba), and Diffusion Model (DM). Experiments demonstrate that registration is crucial for supervised learning, yielding up to 5.85 dB PSNR improvement compared to unregistered training. Models trained on BVI-RLV outperform those trained on existing datasets in cross-dataset evaluations, achieving superior performance even in real-world outdoor scenes. Our dataset is publicly available at https://doi.org/10.21227/mzny-8c77.

2402.17888 2026-05-25 cs.LG cs.AI

ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection

ConjNorm: 面向分布外检测的可处理密度估计

Bo Peng, Yadan Luo, Yonggang Zhang, Yixuan Li, Zhen Fang

发表机构 * University of Technology Sydney(悉尼大学) The University of Queensland(昆士兰大学) Hong Kong Baptist University(香港 Baptist 大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出了一种名为ConjNorm的新型密度估计方法,用于提升分布外检测(OOD detection)的性能。该方法基于Bregman散度构建理论框架,将分布考虑扩展到指数族分布,并通过引入共轭约束,将密度函数设计转化为寻找最优范数系数的问题。为了解决归一化计算的困难,作者设计了一种基于重要性采样的无偏且解析可计算的分区函数估计器。实验表明,ConjNorm在多个OOD检测基准上取得了当前最优性能,显著优于现有方法。

Comments ICLR24 poster

详情
AI中文摘要

事后分布外检测在可靠机器学习中引起了广泛关注。许多工作致力于基于logits、距离或严格数据分布假设推导得分函数,以识别低得分OOD样本。然而,这些估计得分可能无法准确反映真实数据密度或施加不切实际的约束。为了提供密度基得分设计的统一视角,我们提出了一个基于Bregman散度的新理论框架,将分布考虑扩展到指数分布族。利用定理中揭示的共轭约束,我们引入了一种 extsc{ConjNorm}方法,将密度函数设计重新定义为针对给定数据集寻找最优范数系数$p$。鉴于归一化的计算挑战,我们利用基于蒙特卡洛的重要性采样技术,设计了一个无偏且解析可处理的配分函数估计器。在OOD检测基准上的大量实验表明,我们提出的 extsc{ConjNorm}在各种OOD检测设置中建立了新的最先进水平,在CIFAR-100和ImageNet-1K上分别比当前最佳方法(FPR95)高出高达13.25%和28.19%。

英文摘要

Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$\%$ and 28.19$\%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.

2402.14212 2026-05-25 cs.LG cs.AI

Moonwalk: Inverse-Forward Differentiation

Moonwalk: 逆-前向微分

Dmitrii Krylov, Armin Karamzade, Roy Fox

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 Moonwalk 研究了反向传播中需要存储中间激活值的限制问题,提出了一种无需存储激活值的梯度计算方法。该方法通过引入向量-逆雅可比乘积(vijp)操作符,结合子浸入网络和碎片化梯度检查点技术,在前向过程中精确重建梯度,从而显著提升了网络深度而不增加内存消耗。实验表明,Moonwalk 在保持运行时间与反向传播相当的同时,能够在相同内存预算下训练出深度超过两倍的网络。

Journal ref The 29th International Conference on Artificial Intelligence and Statistics, 2026

详情
AI中文摘要

反向传播的主要限制是它需要在正向传播过程中存储中间激活值(残差),这限制了可训练网络的深度。这引出了一个基本问题:我们能否避免存储这些激活值?我们通过重新审视梯度计算的结构来解决这个问题。反向传播通过一系列向量-雅可比乘积计算梯度,这一操作通常是不可逆的。丢失的信息位于每层雅可比矩阵的余核中。我们定义了浸没式网络——其层雅可比矩阵具有平凡余核的网络——在这种网络中,梯度可以在前向扫描中精确重建,而无需存储激活值。对于非浸没式层,我们引入了碎片梯度检查点,仅记录恢复被雅可比矩阵擦除的余切向量所需的最小残差子集。我们方法的核心是一种新的算子,即向量-逆-雅可比乘积(vijp),它反转了余核外的梯度流。我们的混合模式算法首先通过内存高效的反向传播计算输入梯度,然后使用vijp在前向扫描中重建参数梯度,从而消除了存储激活值的需要。我们在Moonwalk中实现了该方法,并表明它在相同内存预算下训练深度超过两倍的网络时,运行时间与反向传播相当。

英文摘要

Backpropagation's main limitation is its need to store intermediate activations (residuals) during the forward pass, which restricts the depth of trainable networks. This raises a fundamental question: can we avoid storing these activations? We address this by revisiting the structure of gradient computation. Backpropagation computes gradients through a sequence of vector-Jacobian products, an operation that is generally irreversible. The lost information lies in the cokernel of each layer's Jacobian. We define submersive networks -- networks whose layer Jacobians have trivial cokernels -- in which gradients can be reconstructed exactly in a forward sweep without storing activations. For non-submersive layers, we introduce fragmental gradient checkpointing, which records only the minimal subset of residuals necessary to restore the cotangents erased by the Jacobian. Central to our approach is a novel operator, the vector-inverse-Jacobian product (vijp), which inverts gradient flow outside the cokernel. Our mixed-mode algorithm first computes input gradients with a memory-efficient reverse pass, then reconstructs parameter gradients in a forward sweep using the vijp, eliminating the need to store activations. We implement this method in Moonwalk and show that it matches backpropagation's runtime while training networks more than twice as deep under the same memory budget.

2103.14995 2026-05-25 cs.LG cs.AI eess.SP

Thermal transmittance prediction based on the application of artificial neural networks on heat flux method results

基于人工神经网络在热流法结果上的热透射率预测

Sanjin Gumbarević, Bojan Milovanović, Mergim Gaši, Marina Bagarić

发表机构 * Center for Theoretical Physics, Sloane Physics Laboratory, Yale University(理论物理中心、斯洛恩物理实验室、耶鲁大学) University of Zagreb, Faculty of Civil Engineering, Department of Materials(扎格雷布大学、土木工程学院、材料系)

AI总结 本文研究如何利用人工神经网络(ANN)加速建筑围护结构热传导系数(U值)的现场测量过程。通过在热流法(HFM)测量中引入并行测量策略,并基于内外空气温度预测未知热流,从而缩短测量时间。研究对比了多种ANN模型在多层墙体上的应用效果,结果表明该方法在热流预测方面具有较高准确性,为后续研究提供了有价值的参考方向。

Comments Submitted to International Building Physics Conference 2021

Journal ref J. Phys.: Conf. Ser. 2069 (2021) 012152

详情
AI中文摘要

由于能效相关指令,欧洲联盟更加关注建筑群的深度能源改造。许多需要深度能源改造的建筑年代久远,可能缺乏设计/改造文件,或者建筑构件中的材料可能随时间发生退化。热透射率(即U值)是确定通过建筑围护结构构件传输热损失的最重要参数之一,取决于构成建筑构件的所有材料的厚度和热性能。现场U值可通过ISO 9869-1标准(热流法 - HFM)确定。然而,测量持续时间是HFM在改造设计过程开始前现场测试中未广泛使用的原因之一。本文分析了通过使用一个热流传感器进行并行测量来减少测量时间的可能性。这种并行化可以通过在HFM结果上应用特定类别的人工神经网络(ANN)来实现,基于收集的室内外空气温度预测未知热流。在达到满意的预测后,HFM传感器可重新定位到另一个测量位置。本文展示了四种ANN案例应用于HFM结果的比较,这些测量在一面多层墙上进行:一个隐藏层中有三个神经元的多层感知器、100个单元的长短期记忆、100个单元的门控循环单元以及50个长短期记忆单元和50个门控循环单元的组合。分析在基于两个输入温度预测热流率方面给出了有希望的结果。另一面墙上的额外分析显示了该方法的可能局限性,这为这一主题的进一步研究提供了方向。

英文摘要

Deep energy renovation of building stock came more into focus in the European Union due to energy efficiency related directives. Many buildings that must undergo deep energy renovation are old and may lack design/renovation documentation, or possible degradation of materials might have occurred in building elements over time. Thermal transmittance (i.e. U-value) is one of the most important parameters for determining the transmission heat losses through building envelope elements. It depends on the thickness and thermal properties of all the materials that form a building element. In-situ U-value can be determined by ISO 9869-1 standard (Heat Flux Method - HFM). Still, measurement duration is one of the reasons why HFM is not widely used in field testing before the renovation design process commences. This paper analyzes the possibility of reducing the measurement time by conducting parallel measurements with one heat-flux sensor. This parallelization could be achieved by applying a specific class of the Artificial Neural Network (ANN) on HFM results to predict unknown heat flux based on collected interior and exterior air temperatures. After the satisfying prediction is achieved, HFM sensor can be relocated to another measuring location. Paper shows a comparison of four ANN cases applied to HFM results for a measurement held on one multi-layer wall - multilayer perceptron with three neurons in one hidden layer, long short-term memory with 100 units, gated recurrent unit with 100 units and combination of 50 long short-term memory units and 50 gated recurrent units. The analysis gave promising results in term of predicting the heat flux rate based on the two input temperatures. Additional analysis on another wall showed possible limitations of the method that serves as a direction for further research on this topic.

2605.22954 2026-05-25 cs.LG q-bio.QM

FederatedRSF : Federated Random Survival Forests for Partially Overlapping Medical Data

FederatedRSF:面向部分重叠医学数据的联邦随机生存森林

Maryam Moradpour, Jonas Harriehausen, Amirreza Aleyasin, Lion Philipp Wolf, Youngjun Park, Anne-Christin Hauschild

发表机构 * Institute for Predictive Deep Learning in Medicine and Healthcare(预测医学与健康人工智能研究所) Justus Liebig University Gießen(吉森约瑟夫·李比希大学) Hessian Center for Artificial Intelligence (hessian.AI)(黑森人工智能中心 (hessian.AI)) Department of Medical Informatics(医学信息学系) University Medical Center Göttingen(哥廷根大学医学中心) Max Planck Institute for Biology of Ageing(马克斯·普朗克衰老生物学研究所)

AI总结 本文提出了一种名为FederatedRSF的联邦学习方法,用于处理多中心医疗数据中的生存分析问题,特别是在数据特征部分重叠的情况下。该方法通过在各机构本地训练随机生存森林模型,并仅共享特征兼容的树结构,从而在不泄露原始数据的前提下实现模型聚合与推理。实验表明,该方法在乳腺癌数据集上的表现与集中式训练模型相当,有效解决了数据隐私和特征异质性带来的挑战。

Comments 4 pages, 2 figures. Maryam Moradpour, Jonas Harriehausen, and Amirreza Aleyasin contributed equally to this work. Includes supplementary material

详情
AI中文摘要

多中心生存预测可以提高鲁棒性和泛化性,但隐私法规和机构治理通常阻止跨机构汇集患者水平的临床和基因组数据。在实践中,部署因特征空间异质性而进一步复杂化,其中不同站点收集不同的协变量或使用不同的测序面板,导致特征集仅部分重叠。我们提出了FederatedRSF,一个实现联邦随机生存森林的Python包,它聚合本地训练的生存树,并仅将特征兼容的树重新分发到每个站点,从而在无需共享原始数据的情况下实现部分重叠的推理。我们在scikit-survival包中分发的GBSG2乳腺癌队列上评估了FederatedRSF,通过保留特征子集模拟客户端之间的特征异质性,并使用Harrell一致性指数(C-Index)在重复交叉验证和站点分割下评估区分能力。结果表明,联邦模型可以达到与集中式训练设置相当的性能。

英文摘要

Multi-center survival prediction can improve robustness and generalizability, yet privacy regulations and institutional governance often prevent pooling patient-level clinical and genomic data across institutions. In practice, deployment is further complicated by feature-space heterogeneity, in which sites collect different covariates or use different sequencing panels, resulting in only partially overlapping feature sets. We present FederatedRSF, a Python package that implements federated random survival forests, aggregating locally trained survival trees and redistributing only feature-compatible trees to each site, enabling inference with partial overlap without sharing raw data. We evaluate FederatedRSF on the GBSG2 breast cancer cohort distributed with the scikit-survival package, simulating feature heterogeneity across clients by withholding subsets of features, and assessing discrimination using Harrell's concordance index (C-Index) under repeated cross-validation and site-splits. The results demonstrated that the federated model can achieve performance comparable to that of the centralized training setting.

2605.22942 2026-05-25 cs.CV

Improved Vision-to-Chart Buoy Association with Learned World-to-Image Projection

改进的视觉到图表浮标关联:学习世界到图像投影

Borja Carrillo-Perez

发表机构 * Arquimea Research Center(阿基米德研究中心)

AI总结 本文针对MaCVi 2026视觉-海图数据关联挑战,提出了一种对基于DETR的融合变压器基线的轻量改进方法。通过引入一个专门的多层感知机(QueryMLP),该方法能够从海图测量和IMU姿态数据中显式预测浮标在图像中的水线接触点,从而为每个浮标提供直接的空间先验信息,减轻了变压器解码器的几何推理负担。该方法在测试集上取得了总体得分为0.7386(F1=0.8055,mIoU=0.6718)的优异性能,位列挑战赛提交结果的第二名。

Comments 5 pages, 3 figures. Technical report for the MaCVi 2026 Vision-to-Chart Data Association Challenge at the CVPR 2026 Workshop; 2nd place submission. Code: https://github.com/bcarrpe/macvi26-visionmap-querymlp

详情
AI中文摘要

本报告提出了对基于DETR的融合变压器基线的一种轻量级修改,用于MaCVi 2026视觉到图表数据关联挑战。挑战基线解码器接收每个浮标的查询,编码世界空间距离和方位,迫使变压器隐式学习从世界坐标到图像像素的复杂几何投影。相反,本工作训练了一个额外的专用MLP(QueryMLP),以根据图表测量和IMU方向数据显式预测浮标在水线处的接触点图像坐标。预测的像素坐标被附加到基线解码器查询向量中,为每个浮标提供直接的空间先验,并减轻变压器解码器的几何推理负担。在挑战排行榜上,所提出的方法在保留测试集上取得了Overall 0.7386、F1 0.8055、mIoU 0.6718的成绩,在所有提交中排名第二。

英文摘要

This report presents a lightweight modification to the DETR-based fusion transformer baseline for the MaCVi 2026 Vision-to-Chart data association challenge. The challenge baseline decoder receives per-buoy queries encoding world-space distance and bearing, forcing the transformer to implicitly learn the complex geometric projection from world coordinates to image pixels. Instead, this work trains an additional dedicated MLP, QueryMLP, to explicitly predict the buoy's waterline contact point in the image from chart measurements and IMU orientation data. The predicted pixel coordinates are appended to the baseline decoder query vector, providing a direct spatial prior per buoy and reducing the geometric reasoning burden on the transformer decoder. On the challenge leaderboard, the presented approach achieves an Overall score of 0.7386, with F1 = 0.8055 and mIoU = 0.6718, on the held-out test set, placing second among all submissions.

2605.22940 2026-05-25 cs.LG cs.AI stat.ML

Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning

以人为中心的学习力学:熵正则化表示学习的动力学框架

Kim Phuc Tran

发表机构 * Univ. Lille, ENSAIT, ULR 2461 – GEMTEX – Génie et Matériaux Textiles(里尔大学,ENSAIT,ULR 2461 – GEMTEX – 纺织工程与材料纺织系) International Chair in DS & XAI, International Research Institute for Artificial Intelligence and Data Science, Dong A University(数据科学与可解释人工智能国际主席,人工智能与数据科学国际研究所,东亚大学)

AI总结 本文提出了一种名为“以人为中心的学习力学”(HCLM)的动态信息理论框架,旨在为开放且受控的学习系统提供理论支持。研究指出,传统的熵正则化方法在某些情况下可能导致梯度不稳定或与优化方向不一致,因此引入了有效熵的概念,并提出了可计算的几何熵代理方法,如基于方差和对数行列式的协方差代理。文章的主要贡献包括形式化有效信息力下的熵正则化、推导收敛性和泛化性理论,以及从动态角度解释模型规模与性能之间的关系。实验表明,几何熵代理,尤其是对数行列式协方差熵,能产生更稳定和有力的信息力,提升表示学习的效果。

Comments Submitted to JMLR

详情
AI中文摘要

深度学习越来越被视为参数空间中的动力学过程,然而许多现有理论仍将训练视为封闭的优化系统。这种观点对于现实世界的人工智能是有限的,因为模型在不确定性、资源约束、分布偏移、下游决策风险和人类反馈下运行。我们提出了以人为中心的学习力学(HCLM),一个用于开放和受控学习系统的动力学和信息论框架。核心思想是,只有当所选的熵代理沿着优化轨迹产生非简并的信息力时,熵正则化才是有用的。否则,熵项可能产生弱、不稳定或不对齐的梯度,导致动力学坍缩为普通的损失最小化。我们引入了有效熵的概念,并研究了可处理的几何熵代理,包括基于方差和对数行列式协方差代理。本文做出三项贡献。首先,它通过有效信息力形式化了熵正则化,并刻画了简并熵区域。其次,它在显式假设下推导了收敛性、熵流、Wasserstein梯度流和噪声表示泛化结果。第三,它提供了缩放律行为的条件动力学解释,作为信息注入、熵耗散和残差风险之间的平衡,而不声称对经验神经缩放律的无条件推导。受控的表示学习实验支持几何熵代理(尤其是对数行列式协方差熵)比softmax归一化熵产生更强更稳定的信息力的假设。

英文摘要

Deep learning is increasingly viewed as a dynamical process in parameter space, yet many existing theories still treat training as a closed optimization system. This view is limited for real-world AI, where models operate under uncertainty, resource constraints, distribution shift, downstream decision risks, and human feedback. We propose Human-Centered Learning Mechanics (HCLM), a dynamical and information-theoretic framework for open and controlled learning systems. The central idea is that entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory. Otherwise, entropy terms may produce weak, unstable, or misaligned gradients, causing the dynamics to collapse toward ordinary loss minimization. We introduce the notion of effective entropy and study tractable geometric entropy surrogates, including variance-based and log-determinant covariance proxies. The paper makes three contributions. First, it formalizes entropy regularization through effective information force and characterizes degenerate entropy regimes. Second, it derives convergence, entropy-flow, Wasserstein-gradient-flow, and noisy-representation generalization results under explicit assumptions. Third, it offers a conditional dynamical interpretation of scaling-law-like behavior as a balance between information injection, entropy dissipation, and residual risk, without claiming an unconditional derivation of empirical neural scaling laws. Controlled representation-learning experiments support the hypothesis that geometric entropy surrogates, especially log-determinant covariance entropy, induce stronger and more stable information forces than softmax-normalized entropy.

2605.22939 2026-05-25 cs.CL cs.LG

Learnability-Informed Fine-Tuning of Diffusion Language Models

扩散语言模型的可学习性感知微调

Shubham Parashar, Atharv Chagi, Jacob Helwig, Lakshmi Jotsna, Sushil Vemuri, James Caverlee, Dileep Kalathil, Shuiwang Ji

发表机构 * Department of Computer Science and Engineering, Texas A\&M University, College Station, TX, USA(计算机科学与工程系,德克萨斯A&M大学,College Station, TX, USA) Department of Electrical and Computer Engineering, Texas A\&M University, College Station, TX, USA(电气与计算机工程系,德克萨斯A&M大学,College Station, TX, USA)

AI总结 本文旨在提升扩散语言模型(DLMs)的推理能力。研究发现,传统的监督微调(SFT)在DLMs中应用时存在局限,忽视了学习的难易程度与时机,导致性能下降。为此,作者提出了一种新的微调方法LIFT,通过在不同扩散时间步根据上下文的丰富程度学习易学或难学的token,从而更有效地利用训练信息。实验表明,LIFT在六个推理基准测试中均优于现有方法,相对提升了达3倍的性能。

详情
AI中文摘要

我们旨在提升扩散语言模型(DLM)的推理能力。虽然SFT是自回归模型常用的后训练方法,但其在DLM中的应用面临挑战,甚至可能损害性能,而根本原因尚未得到充分研究。我们的分析揭示,普通SFT忽略了可学习性,即学习什么以及何时学习。具体而言,当大部分输入被掩码时,稀有标记难以学习;而当大部分输入未被掩码时,学习常见标记则较为简单且价值不大。基于我们的分析,我们提出LIFT,一种高效的基于SFT的DLM后训练算法。LIFT在大部分输入被掩码时学习容易标记,在更多上下文可用时学习困难标记,从而使训练与不同扩散时间步的信息可用性对齐。我们的结果表明,LIFT在六个推理基准上优于现有SFT基线,在AIME'24和AIME'25上实现了高达3倍的相对增益。我们的代码已在https://github.com/divelab/LIFT公开。

英文摘要

We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT-based post-training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at https://github.com/divelab/LIFT.

2605.22937 2026-05-25 cs.CL

RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation

RAS: 基于上下文学习的反射增强缩放用于可执行Cypher查询生成

Minseok Jung, Abhas Ricky, Muhammad Rameez Chatni

发表机构 * Cloudera

AI总结 本文研究了在生成可执行Cypher查询任务中如何通过推理时的计算分配来减少错误。作者提出了一种基于上下文学习的反射增强缩放方法(RAS),通过利用数据库返回的错误信息作为反馈,指导模型生成更准确的查询语句。实验表明,与传统的独立缩放方法相比,RAS在多个数据集和语言模型上显著降低了查询执行错误率,证明了利用执行反馈进行推理优化的有效性。

详情
AI中文摘要

推理时缩放可以减少结构化查询生成中的错误,但为查询代码生成分配计算资源的方法仍未充分探索。我们研究Text2Cypher,其中语言模型生成针对属性图数据库执行的Cypher查询。不可执行查询构成了与语义不准确性不同的独特语法失败:语法错误会触发数据库生成的系统错误消息。这些错误消息通常在推理时被丢弃,而不是通过上下文学习(ICL)加以利用。我们比较了两种推理方法:独立缩放(IS),执行无记忆重采样;以及反射增强缩放(RAS),通过ICL使每次新尝试基于先前的执行反馈。在三个Neo4j数据集和五个代码专用语言模型上,RAS在n=5时将查询执行错误率降低了41-50%,优于IS的32-38%。执行错误不仅仅是需要丢弃的失败,而是可操作的反馈,围绕它们组织推理时计算是实现可执行性比缩放独立样本更高效的途径。

英文摘要

Inference-time scaling can reduce errors in structured query generation, but methods to allocate the compute for query code generation remains underexplored. We study Text2Cypher, where language models generate Cypher queries that execute against property graph databases. Non-executable queries constitute a distinct syntactic failure separate from semantic inaccuracy: a syntax error triggers a system-generated error message from the database. These error messages are typically discarded at inference time rather than leveraged through in-context learning (ICL). We compare two inference methods: Independent Scaling (IS), which performs memoryless resampling, and Reflection-Augmented Scaling (RAS), which conditions each new attempt on prior execution feedback via ICL. Across three Neo4j datasets and five code-specialized language models, RAS reduces the Query Execution Error Rate by 41--50% at n{=}5, outperforming IS at 32--38%. Execution errors are not merely failures to discard but actionable feedback, and structuring inference-time compute around them is a more efficient path to executability than scaling independent samples.

2605.22907 2026-05-25 cs.CV

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

VideoOdyssey: 超长上下文与全模态视频理解基准

Haichen He, Jiayi Zhou, Sifeng Shang, Yihan Hu, Yuanhan Zhang, Kaiyang Zhou

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) S-Lab, Nanyang Technological University(S 实验室,南洋理工大学) GVC Lab, Great Bay University(GVC 实验室,Great Bay 大学)

AI总结 VideoOdyssey 是一个面向超长上下文和多模态视频理解的新型基准,旨在评估模型在长时间视频中持续追踪、信息整合与记忆保持的能力。该基准通过极长的视频时长、多样化的场景以及多层次的连续验证机制,全面衡量模型在不同认知负荷下的表现。研究揭示了当前多模态大语言模型在超长上下文推理、细粒度感知和非语言多模态理解方面仍面临显著挑战。

详情
AI中文摘要

现实世界中的长视频理解要求模型在极端视频时长内对大量时间跨度进行连续跟踪、信息整合和记忆保持。掌握这种高强度的认知负荷构成了长视频理解的基本瓶颈。虽然现有基准通过增加视频时长推动了进展,但其评估任务通常仅需理解短且孤立的视频片段,未能捕捉超长上下文推理的挑战。为衡量这种认知负荷,我们强调连续证书长度,即人类必须连续观看以明确回答给定问题的视频长度。受此指标驱动,我们引入了VideoOdyssey,一个专门为超长上下文和全模态视频理解设计的基准。VideoOdyssey具有三个关键特征:1)极端的视频时长和多样性:涵盖11个领域和54个子类别,平均视频时长为109分钟;2)全面的评估场景:提供两个子集以应对不同的研究重点,即VideoOdyssey-V用于探测MLLMs的视觉理解极限,以及VideoOdyssey-AV用于评估全模态模型的同步音视频理解;3)超长且多级别的连续证书:将VideoOdyssey-V的平均连续证书延长至16分钟,VideoOdyssey-AV延长至12.8分钟。关键的是,我们设计了从秒到小时的5个粒度级别,提供了一个全面的诊断工具,用于评估模型在不同上下文长度和认知负荷下的表现。广泛评估表明,当前MLLMs的瓶颈不仅限于简单的检索,还包括在不同上下文长度下的连续推理、细粒度感知和非语言全模态理解方面的困难。

英文摘要

Real-world long video understanding requires models to perform continuous tracking, information integration and memory retention over massive temporal spans within extreme video durations. Mastering this intense cognitive load constitutes the fundamental bottleneck in long video understanding. While existing benchmarks have driven progress by scaling up video duration, their evaluation tasks often require comprehending only short and isolated video segments, falling short of capturing the challenge of ultra-long-context reasoning. To measure this cognitive load, we emphasize continuous certificate length, defined as the video length a human must continuously watch to definitively answer a given question. Driven by this metric, we introduce VideoOdyssey, a benchmark specifically designed for ultra-long-context and omni-modal video understanding. VideoOdyssey is characterized by three key features: 1) Extreme video duration and diversity: spanning 11 domains and 54 subcategories with an average video duration of 109 minutes; 2) Comprehensive evaluation scenarios: offering two subsets to address different research focuses, i.e., VideoOdyssey-V for probing the limits of visual understanding in MLLMs, and VideoOdyssey-AV for evaluating synchronized audio-visual understanding for omni-modal models; 3) Ultra-long and multi-level continuous certificates: extending the average continuous certificate to 16 minutes for VideoOdyssey-V and 12.8 minutes for VideoOdyssey-AV. Crucially, we design 5 granular levels from seconds to hours, providing a comprehensive diagnostic tool to evaluate models across varying context lengths and cognitive loads. Extensive evaluations show that bottlenecks of current MLLMs extend beyond simple retrieval to include struggles with continuous reasoning across varying context lengths, fine-grained perception, and non-verbal omni-modal understanding.

2605.22905 2026-05-25 cs.AI cs.CL

EVE-Agent: Evidence-Verifiable Self-Evolving Agents

EVE-Agent: 证据可验证的自我进化智能体

Yamato Arai, Yuma Ichikawa

发表机构 * Fujitsu Limited(富士通株式会社) The University of Tokyo(东京大学) RIKEN center for AIP(理化学研究所AIP研究中心)

AI总结 本文提出了一种名为EVE-Agent的证据可验证自进化智能体,旨在解决自进化搜索代理在缺乏可验证证据时可能生成不准确但流畅的训练样本的问题。该方法通过修改提议者-求解者框架,使每个生成的实例不仅包含答案,还包含可验证的来源片段,并通过证据验证器评估其对答案的贡献。实验表明,EVE-Agent显著提升了基于证据的正确性,且生成的训练样本具有可审计性,增强了系统的可信度。

Comments 23 pages, 2 figures

详情
AI中文摘要

自我进化智能体不应在其无法证明的示例上进行训练。无数据的自我进化搜索智能体提供了一种可扩展的途径,使系统能够生成自己的问题、回答问题,并从自身反馈中改进,而无需人工标注。然而,没有可验证的证据,这种循环可能会奖励流畅但无依据的示例,使自我生成的课程变成不透明且可能不可靠的训练信号。我们认为,证据可验证性是搜索智能体可信自我进化的先决条件:每个生成的实例不仅应包含答案,还应包含一个基于来源的文本片段,其对该答案的贡献可以被衡量。我们引入了EVE-Agent,一种证据可验证的自我进化智能体,通过对提议者-求解者框架的修改来实现这一原则。提议者生成一个问题、一个答案和一个逐字证据片段。然后,证据验证器根据提供证据时的边际准确率增益来奖励该片段。这产生了一个训练信号,倾向于真正有助于回答问题的证据,而不需要标准答案、人工标签或外部标注。EVE-Agent保持骨干模型、检索器、搜索工具和优化框架不变。实验表明,EVE-Agent在证据基础的准确性上显著优于先前的自我进化搜索智能体。由此产生的课程不仅是自我生成的,而且从结构上是可审计的:每个训练实例都带有一个可检查的来源片段,解释其为何值得信任。

英文摘要

Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer--solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.

2605.22903 2026-05-25 cs.CV cs.AI cs.CL

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

不看而见:视觉-语言基准测试真的测试视觉吗?

Zixuan Lan, Luzhe Sun, Matthew R. Walter, Jiawei Zhou

发表机构 * University of Chicago(芝加哥大学) Stony Brook University(石溪大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所)

AI总结 该研究质疑了当前视觉-语言模型(VLMs)基准测试是否真正评估了模型对视觉证据的依赖程度。通过系统分析多个开源模型的行为表现,研究发现尽管VLMs会利用视觉输入,但其预测对细粒度视觉信息的丢失并不敏感,这与标准准确率所暗示的情况存在明显偏差。研究还从表示层面揭示了视觉特征在深层逐渐趋同的现象,为这一现象提供了可能的解释,表明现有基准可能无法有效评估模型的细粒度视觉理解能力。

Comments Accepted to GRAIL-V: Grounded Retrieval and Agentic Intelligence for Vision-Language, CVPR 2026 Workshop. accepted version

详情
AI中文摘要

基准测试的准确性通常被隐含地视为反映了视觉-语言模型(VLM)中的基础视觉理解,但尚不清楚这些分数在多大程度上真正反映了对视觉证据的依赖。受一个令人惊讶的观察结果——在广泛使用的幻觉基准测试中,移除大量图像令牌仅轻微降低模型性能——的启发,我们在一组开源VLM中系统地研究了这种不匹配。我们的分析涵盖多个粒度级别,包括全局视觉退化、局部遮挡、问题重述、答案空间扩展以及超出标准准确率的决策级分析。我们进一步用视觉令牌几何的逐层分析补充这些行为结果。在整个实验中,我们发现尽管VLM确实整合了视觉输入,但其预测对细粒度视觉证据丢失的敏感性低于标准准确率所暗示的程度。即使最终预测保持不变,模型对正确答案的内部支持可能已经减弱。我们还补充了表示级分析,显示深层中视觉令牌之间的相似性增加,这为我们的发现提供了一个可能的解释。总之,这些结果表明,当前的基准测试不足以可靠地评估VLM中的细粒度视觉基础。

英文摘要

Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.

2605.22902 2026-05-25 cs.LG cs.AI cs.CL

Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

Transcoders 追踪视觉语言模型中的视觉基础与幻觉

Dimitrios Damianos, Leon Voukoutis, Georgios Skyrianos, Vassilis Katsouros, Georgios Paraskevopoulos

发表机构 * Institute of Language and Speech Processing(语言与语音处理研究所) Athena Research Center(雅典研究中心)

AI总结 该研究探讨了生成式视觉-语言模型(VLMs)中视觉输入如何转化为文本的问题,提出了基于Transcoders的函数中心解释框架,用于分解模型内部的计算路径,揭示图像块与文本生成之间的关联。相比传统的稀疏自编码器(SAEs),该方法在图像块缺失实验中表现出更强且更稳定的解释效果,并能更准确地对应语义相关的图像区域。此外,研究还通过结构分析揭示了模型生成幻觉的机制,并利用图特征构建分类器实现了对幻觉的预测。

详情
AI中文摘要

生成式视觉语言模型(VLM)在多模态推理上表现良好,但视觉输入如何转化为文本仍知之甚少。现有的VLM可解释性工作使用稀疏自编码器(SAE),其分解静态残差表示,忽略了驱动跨模态交互的功能更新。我们采用基于Transcoders的功能中心框架,Transcoders是MLP子层的稀疏近似,作为逐层计算的因果代理。应用于Gemma 3-4B-IT,该框架将模型分解为可解释的计算路径,连接图像块到文本生成中的方向。在补丁消融下,Transcoder归因对视觉基础标记产生比SAE归因更强且更稳定的效果,并与语义相关的图像区域更好对齐。假视觉基础反事实分析证实恢复的路径是视觉-语言交互特有的。最后,我们对幻觉生成进行结构分析,从Transcoder产生的电路痕迹中提取基于图的指标。基于这些机制图特征的逻辑分类器以AUC 0.68预测幻觉。这些结果表明,功能中心的电路分解为VLM中的多模态计算提供了可解释且可预测的描述。

英文摘要

Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which decompose static residual representations and miss the functional updates that drive cross-modal interaction. We adopt a function-centric framework based on Transcoders, sparse approximations of MLP sublayers that act as a causal proxy for layer-wise computation. Applied to Gemma 3-4B-IT, the framework decomposes the model into interpretable computational pathways linking image patches to directions in token generation. Transcoder attributions produce stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, and align better with semantically relevant image regions. A False Visual Grounding counterfactual analysis confirms that the recovered pathways are specific to vision-language interaction.Finally, we perform a structural analysis of hallucinated generations, by extracting graph-based indicators from circuit traces produced by the transcoders. A logistic classifier over these mechanistic graph features predicts hallucinations at AUC $0.68$. These results show that function-centric circuit decomposition yields interpretable and predictive accounts of multimodal computation in VLMs.

2605.22900 2026-05-25 cs.AI cs.LO quant-ph

Mediative Fuzzy Logic: From Type-1 Foundations to Type-2, Type-3 and Quantum Extensions

中介模糊逻辑:从类型-1基础到类型-2、类型-3和量子扩展

Oscar Montiel Ross

发表机构 * Instituto Politécnico Nacional - CITEDI(墨西哥国家理工学院- CITEDI)

AI总结 本文提出了一种称为“调解模糊逻辑”的新逻辑框架,旨在解决模糊控制与决策中存在犹豫或冲突评估的问题。该框架在传统类型-1模糊逻辑的基础上,扩展了类型-2、类型-3以及量子逻辑的表达能力,通过引入调解算子和连续双格结构中的真值对,构建了一个统一的逻辑系统。研究不仅建立了该逻辑的语义基础和推理规则,还展示了其在传感器融合等实际应用中的有效性,为智能决策系统提供了更加鲁棒和透明的理论支持。

Comments 30 pages, 1 figure

详情
AI中文摘要

中介模糊逻辑最初被设想为一种实用方案,用于协调模糊控制和决策中的犹豫或冲突评估。然而,其逻辑和语义基础仍然不完善,尤其是在操作性的类型-1设置之外。本文发展了类型-1核心以及区间类型-2、粒状类型-3和量子扩展的统一描述。我们将中介算子刻画为由犹豫和矛盾控制的凸聚合,将中介真值建模为连续双格结构中的独立真-假对,并引入一个命题系统,通过中介连接词扩展标准的t-范数模糊逻辑。我们证明了对于无中介的公式,该系统相对于底层模糊基的可靠性、次协调性和保守性,并制定了区间类型-2真值、粒索引局部评估以及希尔伯特空间上的效应和密度算子的一致语义扩展。一个自动制动传感器融合示例说明了该框架如何支持在信息不完全、异构和轻度矛盾的证据下做出透明、保守且安全优先的决策。在适当假设下,高级公式简化为类型-1情况,澄清了各层级间的一致性,并为智能决策系统的未来工作提供了可靠支持。

英文摘要

Mediative Fuzzy Logic was conceived as a practical scheme for reconciling hesitant or conflicting assessments in fuzzy control and decision-making. However, its logical and semantic foundations remain underdeveloped, especially beyond operational type-1 settings. This article develops a unified account of the type-1 core together with interval type-2, granular type-3, and quantum extensions. We characterize the mediative operator as a convex aggregation controlled by hesitation and contradiction, model mediative truth values as independent truth-falsity pairs in a continuous bilattice-like structure, and introduce a propositional system extending a standard t-norm-based fuzzy logic with a mediative connective. We establish soundness, paraconsistency, and conservativity over the underlying fuzzy base for formulas without mediation, and formulate coherent semantic extensions to interval type-2 truth values, granule-indexed local evaluations, and effects and density operators on Hilbert spaces. An autonomous-braking sensor-fusion example illustrates how the framework supports transparent, conservative, and safety-first decisions under incomplete, heterogeneous, and mildly contradictory evidence. Under suitable assumptions, the higher-level formulations reduce to the type-1 case, clarifying coherence across levels and reliably supporting future work in intelligent decision systems.

2605.22898 2026-05-25 cs.LG

FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning

FIRMA: 斐波那契环模型聚合用于隐私保护联邦学习

Rachid Hedjam

发表机构 * Bishop’s University(比什大学)

AI总结 本文提出了一种名为FIRMA的隐私保护联邦学习框架,旨在解决现有方法在去中心化、隐私保护和模型聚合效率之间的矛盾。FIRMA基于斐波那契数列设计环形拓扑结构,通过非对称邻居加权和永久私有分类头实现安全聚合,并引入动态邻居抑制和优化的环排列策略以提升模型性能。实验表明,FIRMA在多种异构数据环境下优于传统联邦平均方法,尤其在标签偏斜和狄利克雷异构场景中表现出显著优势。

详情
AI中文摘要

联邦学习协议面临结构性三难困境:规范的基于服务器的聚合~\cite{mcmahan2017} 产生单点故障和梯度反演风险;去中心化的环-八卦替代方案~\cite{hu2019segmented} 通过无信息的均匀权重将分类头暴露给半诚实的对等节点;个性化方法~\cite{collins2021exploiting} 重新引入中心聚合。现有协议无法同时实现无服务器操作、永久私有分类头、环拓扑和原则性的非对称邻居加权。我们提出FIRMA( extbf{FI}bonacci extbf{R}ing extbf{M}odel extbf{A}ggregation),一个包含三种逐步增强的联邦学习协议系列:1) ibfl\ 建立基础:无服务器环聚合,采用斐波那契加权的邻居混合和永久私有的分类头。2) ibflp\ 在此基础上增加精度门控邻居抑制,选择性降低收敛不良的对等节点权重,同时保留斐波那契方向偏差。3) ibflpp,完整系统,通过2-opt环置换最大化相邻客户端的类别多样性,通过$K_g{=}\lceil N/2 ceil$次八卦传递实现全局环覆盖,以及余弦退火自保留校准,完成该系列。我们建立了一个收敛速率界和三个支持命题,涉及归一化、覆盖、保留和多样性最优性。在28种配置(四个基准与七种异构性制度交叉)上的系统实验表明, ibflpp\ 在所有12种标签偏斜配置中均优于 edavg\,在CIFAR-10上$K{=}1$时峰值优势达$+20.7$个百分点。在Dirichlet异构性下, ibflpp\ 是所有无服务器协议中的帕累托主导方法,在28种配置中的17种中实现了最高精度。

英文摘要

Federated learning protocols face a structural trilemma: canonical server-based aggregation~\cite{mcmahan2017} creates a single point of failure and gradient inversion risk; decentralised ring-gossip alternatives~\cite{hu2019segmented} expose classification heads to semi-honest peers via uninformed uniform weights; and personalised methods~\cite{collins2021exploiting} reintroduce central aggregation. No existing protocol simultaneously achieves server-free operation, permanently private heads, ring topology, and principled asymmetric neighbour weighting. We propose FIRMA (\textbf{FI}bonacci \textbf{R}ing \textbf{M}odel \textbf{A}ggregation), a family of three progressively enhanced federated learning protocols: 1) \fibfl\ establishes the foundation: server-free ring aggregation with Fibonacci-weighted neighbour blending and permanently private classification heads. 2) \fibflp\ augments this with accuracy-gated neighbour suppression, selectively down-weighting poorly-converged peers while preserving the Fibonacci directional bias. 3) \fibflpp, the full system, completes the family with a 2-opt ring permutation that maximises adjacent-client class diversity, global ring coverage via $K_g{=}\lceil N/2\rceil$ gossip passes, and cosine-annealed self-retention calibration. We establish a convergence rate bound and three supporting propositions governing normalisation, coverage, retention, and diversity optimality. Systematic experiments across 28 configurations -- four benchmarks crossed with seven heterogeneity regimes -- demonstrate that \fibflpp\ surpasses \fedavg\ in all 12 label-skew configurations, with a peak advantage of $+20.7$\,pp on CIFAR-10 at $K{=}1$. Under Dirichlet heterogeneity, \fibflpp\ is the Pareto-dominant method among all server-free protocols, achieving the highest accuracy in 17 of 28 configurations.

2605.22897 2026-05-25 cs.LG

From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

从残差到原因:基于LLM的表格数据机制推断

Mohammad R. Rezaei, Rahul G. Krishnan

发表机构 * Department of Computer Science(计算机科学系) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 该研究旨在解决科学应用中机器学习模型在预测与解释之间的平衡问题,提出了一种基于大语言模型(LLM)的机制推理框架MARICL。该方法通过分析基础模型的残差,引导LLM推测模型缺失的结构,并通过多轮文本梯度优化生成显式的修正项,从而提升模型的可解释性和预测性能。实验表明,MARICL在多个科学、生物医学和社会经济数据集上均优于基础模型,并通过冻结公式在不同实验批次中的表现验证了其对机制的泛化能力。

详情
AI中文摘要

机器学习在科学应用中的一个持续挑战是同时实现预测和理解。统计模型在结构化数据上表现出色,但作为黑箱运行,而现有的可解释性方法主要是审视性的:它们回答“哪些特征重要?”,但不阐明特征如何交互或随着人类理解迭代地细化解释。要求LLM直接预测目标会迫使其搜索整个输出空间;我们转而用基础模型锚定预测,并让LLM回答该模型遗漏了什么这一更窄的问题。我们引入了多智能体残差上下文学习(MARICL),这是一个智能体框架,其中LLM智能体分析基础模型失败的地方,从上下文中提供的高残差示例中假设缺失的结构,并产生通过多轮文本梯度优化精炼的显式修正项。在涵盖科学、生物医学、社会经济和合成设置的九个基准测试中,MARICL在所有数据集上一致优于其基础模型。为了测试这些修正是反映真实结构还是批次特定噪声,我们冻结了在无细胞蛋白质数据集的一个实验批次上学习的公式,并将其应用于(无需重新训练且无需进一步LLM调用)保留批次。在同一试剂协议内,冻结公式在超过92%的情况下改善了预测;在不同协议下,它们系统性地失败。成功边界与生物化学一致,而非批次数量;这是机制泛化的直接证据。

英文摘要

A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largely inspective: they answer "which features matter?" but do not articulate how features interact or refine explanations iteratively alongside human understanding. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing. We introduce Multi-Agent Residual In-Context Learning (MARICL), an agentic framework in which LLM agents analyze where a base-model fails, hypothesize missing structure from high-residual examples provided in context, and produce explicit correction terms refined through multi-turn textual gradient optimization. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets. To test whether these corrections reflect real structure or batch-specific noise, we freeze formulas learned on one experimental batch of the Cell-Free Protein dataset and apply them (with no retraining and no further LLM calls) to held-out batches. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count; direct evidence of mechanistic generalization.

2605.22896 2026-05-25 cs.RO cs.AI cs.LG

Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

Agentic-VLA:视觉-语言-动作模型的高效在线自适应

Ruofan Jin, Zaixi Zhang

发表机构 * Ruofan Jin(金鲁凡) Zaixi Zhang(张在西)

AI总结 本文提出了一种名为Agentic-VLA的新型训练框架,旨在提升视觉-语言-动作(VLA)模型在机器人操作任务中的在线适应效率。该方法通过自适应奖励合成、语言引导探索和经验记忆三个核心创新,有效解决了现有VLA模型在新环境泛化能力和训练效率方面的不足。实验表明,Agentic-VLA在LIBERO和RoboTwin 2.0等基准测试中显著提升了任务完成率和学习效率,为构建具备持续学习能力的自适应VLA系统提供了重要进展。

Comments Total 15 pages

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用预训练的视觉-语言表示,已成为机器人操作领域的一种有前景的范式。然而,当前的VLA训练方法存在两个关键局限性:对新环境的泛化能力差,以及需要大量演示数据导致的训练效率低下。我们提出Agentic-VLA,一种智能训练框架,通过三项关键创新使VLA能够在线高效自适应:(1)自适应奖励合成,根据VLA当前能力和任务复杂度动态生成并调整奖励函数,将复杂任务分解为可学习的子目标以进行课程学习;(2)语言引导探索,其中评论模型提供结构化指导以实现系统化探索,而非随机采样;(3)经验记忆,存储和检索与任务相关的策略权重,用于相似任务的预热启动自适应。我们在LIBERO基准上评估Agentic-VLA,取得了显著改进:长时域任务提升12.3%,单样本学习提升28.5%,并在无需任务特定演示的情况下实现从0%到31.2%的跨任务迁移。与现有在线自适应方法相比,我们的框架还实现了2.4倍的收敛速度提升。除LIBERO外,Agentic-VLA在双臂RoboTwin 2.0基准(包括其随机困难设置)上仍保持优势。这些结果使Agentic-VLA成为迈向真正自适应、可在部署中持续学习的VLA系统的重要一步。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robotic manipulation by leveraging pre-trained vision-language representations. However, current VLA training methods suffer from two critical limitations: poor generalization to novel environments and low training efficiency requiring extensive demonstrations. We introduce Agentic-VLA, an agentic training framework that enables VLAs to efficiently adapt online through three key innovations: (1) Adaptive Reward Synthesis, which dynamically generates and adjusts reward functions based on the VLA's current capabilities and task complexity, decomposing complex tasks into learnable sub-goals for curriculum learning; (2) Language-Guided Exploration, where a critic model provides structured guidance for systematic exploration rather than random sampling; and (3) Experience Memory,which stores and retrieves task-relevant policy weights for warm-starting adaptation to similar tasks. We evaluate Agentic-VLA on the LIBERO benchmark, achieving substantial improvements: +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and enabling cross-task transfer from 0% to 31.2% without task-specific demonstrations. Our framework also demonstrates 2.4x faster convergence compared to existing online adaptation methods. Beyond LIBERO, Agentic-VLA retains its advantage on the dual-arm RoboTwin 2.0 benchmark, including under its randomized Hard setting. These results establish Agentic-VLA as a significant step toward truly adaptive VLA systems capable of continuous learning in deployment.

2605.22891 2026-05-25 cs.LG hep-ex

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

逐点度量误导:多模态逆问题的评估协议

Mads H. Baattrup, Jörn Bach, Laurids Jeppe, Finn Labe, Alexander Grohsjean, Christian Schwanenberger, Peer Stelldinger

发表机构 * Deutsches Elektronen-Synchrotron(德意志电子同步辐射研究中心) CERN(欧洲核子研究中心) University of Hamburg(汉堡大学) HAW Hamburg(汉堡应用技术大学)

AI总结 本文指出,在多模态逆问题中,传统的逐点评估指标(如RMSE、MAE)会误导科学重建的评价,因为它们无法准确反映后验分布的结构特性。研究提出了一种三步评估协议,分别从分布准确性、谱保真度和不确定性校准三个方面进行更全面的评估。实验表明,该方法在合成和真实物理问题中能有效区分模型性能,揭示了传统指标所忽略的关键科学特征。

Comments 29 pages, 9 figures, and 8 tables (including appendix)

详情
AI中文摘要

科学重建中的评估以逐点度量为主——RMSE、MAE、每事件分辨率——隐含假设误差越小重建越好。我们表明,对于具有多模态后验的逆问题,这一假设在结构上失败。根据总方差定律,当后验具有非零宽度时,训练以最小化MSE或MAE的点估计器产生的边际谱严格窄于真实值。由此产生的偏差独立于架构、训练和数据集大小,并且精确压缩了下游科学测量所依赖的谱特征——尾部、模态、形状。我们提出一个三部分评估协议,其中每一步针对其他步骤遗漏的失败模式:通过CRPS的每事件分布准确性、通过谱保真度诊断的总体边际准确性、以及通过基于覆盖的校准的不确定性可信度。在具有解析后验的合成基准和来自粒子物理的现实多对一逆问题上,模型排名在逐点度量和分布度量之间发生逆转,而校准进一步区分了在CRPS下无法区分的架构。决定科学结论的是评估协议,而非模型。

英文摘要

Evaluation in scientific reconstruction is dominated by pointwise metrics - RMSE, MAE, per-event resolution - under the implicit assumption that lower error means better reconstruction. We show that this assumption fails structurally for inverse problems with multimodal posteriors. By the law of total variance, point estimators trained to minimize MSE or MAE produce a marginal spectrum strictly narrower than the truth whenever the posterior has nonzero width. The resulting bias is independent of architecture, training, and dataset size, and it compresses precisely the spectral features - tails, modes, shapes - that downstream scientific measurements rely on. We propose a three-part evaluation protocol where each step targets a failure mode the others miss: per-event distributional accuracy via CRPS, population-level marginal accuracy via a spectrum-fidelity diagnostic, and uncertainty trustworthiness via coverage-based calibration. On a synthetic benchmark with an analytic posterior and on a realistic many-to-one inverse problem from particle physics, model rankings reverse between pointwise and distributional metrics, and calibration further separates architectures indistinguishable under CRPS. The evaluation protocol, not the model, determines the scientific conclusion.

2605.22890 2026-05-25 cs.RO cs.CV

Extending Deep Event Visual Odometry with Sparse Point-Cloud Export

基于稀疏点云导出的深度事件视觉里程计扩展

Alireza Safdari, Sajad Ashraf

发表机构 * st Sajad Ashraf(第一作者单位) nd Alireza Safdari(第二作者单位)

AI总结 该研究针对事件相机在高速运动和复杂光照条件下的视觉里程计问题,扩展了深度事件视觉里程计(DEVO)系统,引入了一种稀疏点云输出模块。通过提取DEVO内部估计的3D结构并转化为显式点云表示,实现了对场景几何信息的可视化与后续处理,同时保留了原有的视觉里程计流程。实验表明,生成的稀疏点云在局部一致性方面表现良好,达到了高精度要求,但也体现了在密度、完整性及对累积里程计噪声的敏感性方面的局限性。

Comments 9 Pages, 4 figures, 5 tabel

详情
AI中文摘要

事件相机因其低延迟、高时间分辨率和高动态范围,非常适合高速运动和挑战性光照条件下的视觉里程计。深度事件视觉里程计(DEVO)通过结合稀疏块跟踪、学习块选择、循环对应精化和可微束调整,证明了单目纯事件里程计能够实现强性能。在本项目中,我们通过稀疏点云导出管线扩展了DEVO。我们的方法不修改核心里程计公式,而是暴露DEVO已估计的内部3D结构,并将其转换为显式点云表示,用于可视化和进一步处理。此外,我们实现了一个实用的工作流程,用于数据导出、格式转换和点云清理。最终系统保留了原始视觉里程计管线,同时支持稀疏几何场景输出。在BOARD SLOW序列上的实验表明,导出的稀疏点云与EMVS重建在局部一致,在5厘米阈值下达到高精度,同时也突出了在密度、完整性和对累积里程计噪声敏感性方面的预期局限性。

英文摘要

Event cameras are well suited for visual odometry under high-speed motion and challenging lighting conditions due to their low latency, high temporal resolution, and high dynamic range. Deep Event Visual Odometry (DEVO) demonstrated that monocular event-only odometry can achieve strong performance by combining sparse patch tracking, learned patch selection, recurrent correspondence refinement, and differentiable bundle adjustment. In this project, we extend DEVO with a sparse point-cloud export pipeline. Rather than modifying the core odometry formulation, our approach exposes the internal 3D structure already estimated by DEVO and converts it into an explicit point-cloud representation for visualization and further processing. In addition, we implement a practical workflow for data export, format conversion, and point-cloud cleanup. The resulting system preserves the original visual odometry pipeline while enabling sparse geometric scene output. Experiments on the BOARD SLOW sequence show that the exported sparse cloud is locally consistent with EMVS reconstructions, achieving high precision at a 5 cm threshold, while also highlighting the expected limitations in density, completeness, and sensitivity to accumulated odometry noise.

2605.22889 2026-05-25 cs.RO

Remote Teleoperation of Endovascular Intervention Robots: A Systematic Review

血管内介入机器人的远程遥操作:系统综述

Xingyu Chen, Yinchao Yang, Nikola Fischer, Harry Robertshaw, Benjamin Jackson, Mohammad Shikh-Bahaei, Christos Bergeles, Thomas C Booth

发表机构 * School of Biomedical Engineering & Imaging Sciences, King’s College London(生物医学工程与成像科学学院,伦敦国王学院) Department of Engineering, King’s College London(工程系,伦敦国王学院) Department of Neuroradiology, King’s College Hospital(神经放射学系,伦敦国王医院)

AI总结 本文系统回顾了远程操控内窥血管介入机器人系统的相关研究,旨在评估其技术可行性、通信基础设施及临床效果。研究发现,通过机械或电磁驱动的远程操控导管和导丝可在数千公里范围内实现精准操作,且在稳定通信条件下网络延迟处于临床可接受范围。尽管初步结果显示小规模人体试验中手术成功率高达100%,但多数证据仍来自动物或模拟实验,未来需在中低收入国家开展多中心临床试验以验证其安全性和广泛适用性。

Comments The manuscript has been submitted to IEEE Transaction on Medical Robotic and Bionics

详情
AI中文摘要

远程机器人辅助血管内介入提供了一种有前景的方法,可减少临床医生的辐射暴露和身体劳损,同时将专业血管护理扩展到地理偏远地区。尽管取得了进展,但遥操作血管内介入仍未被充分探索,特别是对于急性卒中的机械取栓等时间敏感型介入。本综述旨在确定关于遥操作血管内介入机器人的证据,涵盖技术可行性、通信基础设施和临床结局。综述进一步确定了研究空白和未来方向。遵循PRISMA指南,从2501条初始搜索结果中纳入了16项符合纳入标准的研究。我们发现,由机械或电磁系统驱动的遥操作导管和导丝可在长达7000公里的距离内导航。凭借稳健的通信基础设施,网络延迟保持在临床可接受的范围内(30-163毫秒)。尽管初步结果在小规模人体试验中显示了100%的手术成功率,但大多数证据来自动物或体模模型。总体而言,研究结果表明,遥操作血管内介入可以减少职业危害,扩大患者获得紧急手术的机会,并优化资源配置。未来应在低收入和中等收入国家进行研究,以展示更广泛的地理可及性。最终,需要多中心临床试验来验证其在多样化临床环境中的安全性、有效性和泛化性。

英文摘要

Remote robotic-assisted endovascular intervention offers a promising approach to reduce clinician radiation exposure and physical strain, while extending specialized vascular care to geographically distant regions. Despite advancements, teleoperated endovascular intervention remains underexplored, especially for time-sensitive interventions like mechanical thrombectomy for acute stroke. The aim of the current review was to determine the evidence regarding teleoperated endovascular robotic systems, covering technical feasibility, communication infrastructure, and clinical outcomes. The review further identified research gaps and future directions. Following PRISMA guidelines, 16 studies were included that met the inclusion criteria out of 2501 initial search results. We found that teleoperated catheters and guidewires, driven by mechanical or electromagnetic systems, can be navigated across distances up to 7000 km. With robust communication infrastructure, network latency remained within clinically acceptable limits (30-163 ms). Although initial outcomes highlighted 100% procedural success in small-scale human trials, most evidence stemmed from animal or phantom models. Overall, the findings suggest that teleoperated endovascular intervention can reduce occupational hazards, expand patient access to urgent procedures, and optimize resource allocation. Future research should be conducted in low and middle income countries to demonstrate broader geographical access. Ultimately, multi-center clinical trials are required to validate the safety, efficacy, and generalization in diverse clinical settings.