arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2023
热门方向导航
2606.18583 2026-06-18 cs.CV cs.RO 新提交

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

空地激光雷达地点识别:基于块级自监督学习和扩展互逆重排序

Yandi Yang, Xianghong Zou, Jianping Li, Haofeng Xie, Saurav Uprety, Hongzhou Yang, Naser El-Sheimy

发表机构 * University of Calgary(卡尔加里大学) Nanchang University(南昌大学) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学)

AI总结 提出一种空地激光雷达地点识别框架,通过多尺度块级自监督学习缩小域差距,并利用扩展互逆重排序算法减少误检,在多个数据集上显著提升检索精度。

详情
AI中文摘要

激光雷达地点识别用于确定在预先采集的点云地图上的位置。最常研究的基于地面激光雷达的地点识别存在预访问要求、覆盖不完整和视角有限等缺点。使用预先采集的全覆盖机载激光扫描(ALS)数据作为空中先验地图可以克服这些缺点,使得跨视角地点识别变得必要且有利。然而,空地激光雷达地点识别面临重大挑战,包括空中和地面点云之间的域差距以及初始检索中的误检。为了解决这些问题,我们提出了一种用于空地激光雷达地点识别的新型检索和重排序框架。基于相邻点云块与锚点块共享相似语义的先验知识,我们的检索网络在多个尺度上引入了块级自监督学习模块,并与场景级学习相结合,以提高空中和地面点云之间全局特征的判别性。此外,利用ALS点云的结构化空间分布,我们引入了一种扩展互逆(ER)重排序算法,以最大化利用邻域信息,并根据邻域特征优化每个特征,然后用于更新相似度矩阵以进行最终排序。大量实验表明,我们的检索网络优于现有最先进(SOTA)方法,在CS-Urban-Scenes数据集上平均Recall@1提高了9.8%,平均Recall@1%提高了3.2%,同时在CS-Campus3D数据集上也展示了最佳性能。此外,我们的ER重排序算法在无需额外训练的情况下,进一步将CS-Campus3D上的平均Recall@1提高了4.9%,CS-Urban-Scenes上提高了10.2%。

英文摘要

LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.

2606.18582 2026-06-18 cs.CV cs.RO eess.IV 新提交

Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

ICRA 2026 GOOSE 2D细粒度语义分割挑战赛技术报告:利用DINOv3实现野外机器人中的鲁棒户外场景理解

Jaeil Park, Hyobin Choi, Sangjin Lee, Hyungtae Lim, Sung-Hoon Yoon

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院) Massachusetts Institute of Technology (MIT)(麻省理工学院)

AI总结 提出一种结合DINOv3自监督骨干、ViT-Adapter和Mask2Former解码器的网络设计,以及多尺度测试增强和模型集成的推理策略,在64类细粒度越野语义分割挑战中取得第一名,复合得分76.57%。

Comments 5 pages, 4 figures

详情
AI中文摘要

ICRA 2026野外机器人研讨会举办的GOOSE 2D细粒度语义分割挑战赛评估了越野图像在64个细粒度类别和11个评估的非空洞粗类别上的密集语义分割。我们提出了该挑战的第一名解决方案。我们的解决方案包含两个互补的改进:(a) 网络级设计,结合了自监督DINOv3 ViT-L/16骨干、ViT-Adapter和Mask2Former掩码分类解码器,以及基于全局[CLS]令牌的粗类别辅助损失;(b) 推理时聚合策略,基于多尺度和水平翻转测试时增强,以及使用Codabench分数选择的前三个检查点的集成。我们的方法达到了官方复合得分76.57%,包括69.32%的细类mIoU和83.81%的类别级mIoU,并在最终阶段排行榜上排名第一:http://this url。

英文摘要

The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: www.codabench.org/competitions/14257/#/results-tab.

2606.18571 2026-06-18 cs.LG cs.CL cs.SD eess.AS 新提交

Fair Cognitive Impairment Detection Through Unlearning

通过去学习实现公平的认知障碍检测

William Nguyen, Jiali Cheng, Hadi Amiri

发表机构 * University of Massachusetts Lowell, USA(马萨诸塞大学洛厄尔分校)

AI总结 提出一种多模态框架,结合跨模态融合和梯度反转去学习,减少人口统计信息对轻度认知障碍检测的偏见,在跨语言数据集上缩小性能差距。

Comments Interspeech 2026

详情
AI中文摘要

轻度认知障碍(MCI)是一种以记忆、语言或思维能力显著下降为特征的医学状况。从自发语音中检测MCI对于可扩展的筛查具有前景。然而,学习模型常常利用与标签相关的人口统计线索,导致不同亚组之间存在较大的性能差距。我们提出了一种多模态框架,结合了(i)模态间(语音、文本和图像)的跨模型融合,以及(ii)使用梯度反转的去学习,该技术阻止共享嵌入编码与任务无关的人口统计属性。在多语言基准TAUKADIAL和PREPARE上的评估表明,我们的方法在MCI分类上优于最先进的多语言和多模态基线,同时显著缩小了患者亚组(性别和语言)之间的性能差距。我们进一步分析了跨数据集的迁移,表明人口统计去学习有助于学习更鲁棒的MCI检测表示。

英文摘要

Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned models often exploit demographic cues correlated with labels, resulting in a large performance gap across subgroups. We present a multimodal framework that combines (i) cross-model fusion between modalities (speech, text, and image), and (ii) unlearning using gradient reversal that discourages the shared embedding from encoding task-irrelevant demographic attributes. Evaluated on the multilingual benchmarks TAUKADIAL and PREPARE, our method outperforms the state-of-the-art multilingual and multimodal baseline in MCI classification while substantially reducing the performance gap across patient subgroups (sex and language). We further analyze transfer across datasets, showing that demographic unlearning helps learn more robust representations for MCI detection.

2606.18566 2026-06-18 cs.CV cs.AI cs.GR 新提交

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

多模态超图融合用于低光照人群计数

Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang, Bangjun Wang

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 针对低光照环境下人群计数难题,构建三个新基准数据集,提出多模态超图融合模块和可变形矩形稀疏注意力模块,形成低光照计数网络LCNet,在三个基准上取得最优性能。

详情
AI中文摘要

人群计数是计算机视觉中的一项基本任务。然而,低光照环境下的人群计数在实际世界中具有重要实用价值,却仍未得到充分探索。现有方法主要关注良好光照场景或依赖单模态红绿蓝(RGB)表示,这在极端黑暗和复杂非均匀光照下往往变得不可靠。为解决此问题,我们构建了三个新的低光照人群计数基准,包括两个合成数据集SHA_Dark和SHB_Dark,以及一个真实世界基准LC-Crowd(低光照人群数据集)。受Retinex物理建模启发,我们引入深度和Canny边缘线索作为互补的几何和结构先验,以增强低光照条件下的内在反射率表示。我们提出多模态超图融合模块,将RGB外观、深度几何和边缘结构线索统一表示为超图中的节点,并通过动态超边构建和消息传递显式捕获它们的高阶互补关系。此外,为在密集预测中自适应分配计算,我们提出可变形矩形稀疏注意力(DRSA)模块,通过锚点感知估计和自适应矩形窗口建模将计算集中在信息丰富区域。基于这些设计,我们开发了统一的低光照计数网络(LCNet)用于鲁棒的低光照人群计数。在三个基准上的大量实验表明,所提方法在整体性能上优于现有最先进(SOTA)方法。代码见补充材料。数据集将在接收后公开。

英文摘要

Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA\_Dark and SHB\_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

2606.18565 2026-06-18 cs.CV eess.SP 新提交

Experimental Analysis of Neural Network-Based Image Classification on the CIFAR-10 Dataset

基于神经网络的CIFAR-10数据集图像分类实验分析

Necati Kagan Erkek, Emre Balci, Berkin Halay

发表机构 * Department of Electronics and Communication Engineering, Istanbul Technical University(伊斯坦布尔技术大学电子与通信工程系)

AI总结 通过全连接和卷积网络在CIFAR-10上实验,分析完整学习流程,六层卷积网络在10个epoch后验证准确率约74.77%,揭示了表示学习与记忆化的差异。

Comments 7 pages

详情
AI中文摘要

通过全连接和卷积网络公式,对CIFAR-10基准上的神经图像分类进行了实验研究。分析强调了完整的学习流程:图像向量化、归一化、独热类编码、监督损失最小化、学习率选择、小批量训练、卷积特征提取、最大池化和基于验证的泛化评估。评估了一个具有六个卷积层和三个最大池化阶段的卷积架构,使用批量大小为128、学习率为0.001的Adam优化器进行十个训练周期。验证准确率达到约74.77%,而验证损失在训练中期后开始增加,尽管训练损失持续减少。由此产生的行为说明了表示学习与记忆化之间的实际差异,并为未来关于正则化、数据增强、更深层架构和可复现图像分类教育的研究提供了紧凑的实验基线。

英文摘要

An experimental investigation of neural image classification on the CIFAR-10 benchmark is presented through fully connected and convolutional network formulations. The analysis emphasizes the complete learning pipeline: image vectorization, normalization, one-hot class encoding, supervised loss minimization, learning-rate selection, mini-batch training, convolutional feature extraction, max-pooling, and validation-based generalization assessment. A convolutional architecture with six convolutional layers and three max-pooling stages is evaluated for ten training epochs using a batch size of 128 and an Adam optimizer with a learning rate of 0.001. The validation accuracy reaches approximately 74.77%, while the validation loss begins to increase after the middle of training despite continued reduction in training loss. The resulting behavior illustrates the practical difference between representation learning and memorization, and it provides a compact experimental baseline for future studies on regularization, data augmentation, deeper architectures, and reproducible image-classification education.

2606.18564 2026-06-18 cs.SD eess.SP 新提交

Reference-Based Recursive Least-Squares Mitigation of Real Interference in Stereo Audio Recordings

基于参考的递归最小二乘法在立体声音频录音中抑制真实干扰

Necati Kagan Erkek, Y. Ugur Ozcan

发表机构 * Telecommunications Engineering, Department of Electronics, Information(电信工程系,电子与信息系)

AI总结 针对受真实火车噪声和环境背景污染的立体声音频,采用多参考递归最小二乘(RLS)估计器进行自适应干扰消除,通过参考信号估计干扰分量并减去,后接低通后置滤波器,有效降低参考相关性达30.6-34.1 dB。

Comments 7 pages

详情
AI中文摘要

评估了基于参考的自适应干扰消除方法,用于受真实火车噪声和环境背景污染的立体声音频录音。观测信号被建模为干净的立体声节目受到由外部声源通过未知传播路径产生的加性干扰污染。第二个立体声录音,代表同一物理噪声源的另一个滤波观测,被用作多参考递归最小二乘(RLS)估计器的参考输入。估计的火车干扰分量从含噪音频中减去,随后经过有限冲激响应低通后置滤波器。在相同算法参数下处理了三个74.01秒、采样率为11.025 kHz的真实音频序列。由于没有干净的参考真值,性能通过无参考指标评估:波形行为、Welch谱估计、RMS变化以及与参考的残差归一化相关性。每个参考通道使用30个抽头、15个反因果抽头和遗忘因子0.999,最大参考相关性从处理前的0.386--0.832降低到处理后的0.011--0.016。相应的相关性比降低约30.6--34.1 dB,而输出RMS根据片段和立体声通道减少1.8--4.8 dB。结果表明,当存在相关参考录音时,真实火车干扰(包括环境声学效应)可以被显著衰减。

英文摘要

Reference-based adaptive interference cancellation is evaluated for stereo audio recordings corrupted by real train noise and environmental background. The observed signal is modeled as a clean stereo program contaminated by an additive disturbance generated by an external acoustic source through unknown propagation paths. A second stereo recording, representing another filtered observation of the same physical noise source, is used as the reference input of a multi-reference recursive least-squares (RLS) estimator. The estimated train-interference component is subtracted from the noisy audio and followed by a finite-impulse-response low-pass postfilter. Three 74.01 s real audio sequences sampled at 11.025 kHz are processed under identical algorithmic parameters. Since clean ground truth is not available, performance is assessed with no-reference indicators: waveform behavior, Welch spectral estimates, RMS change, and residual normalized correlation with the reference. With 30 taps per reference channel, 15 anti-causal taps, and forgetting factor 0.999, the maximum reference correlation is reduced from 0.386--0.832 before processing to 0.011--0.016 after processing. The corresponding correlation-ratio reduction is approximately 30.6--34.1 dB, while the output RMS decreases by 1.8--4.8 dB depending on section and stereo channel. The results demonstrate that real train interference, including environmental acoustic effects, can be substantially attenuated when a correlated reference recording is available.

2606.18561 2026-06-18 cs.LG cs.AI 新提交

Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning

使用Wasserstein对抗学习校正传感器引起的分布漂移

Saraa Ali, Vladimir Bocharnikov, Fedor Ratnikov, Mikhail Hushchyn, Artem Ryzhikov, Denis Derkach

发表机构 * Laboratory of Methods for Big Data Analysis, HSE University(大数据分析方法实验室,高等经济大学)

AI总结 提出WGAN方法,通过可学习的校准变换将变化检测器响应分布映射回参考分布,在探测器模型和模拟量能器数据上验证了恢复老化系数和改善能量分布一致性的能力。

Comments This is a preprint sent to Nuclear Science and Techniques journal

详情
AI中文摘要

记录数据的质量取决于采集数据的传感器系统的稳定性。传感器运动和老化会降低下游数据驱动方法的性能和稳定性。我们提出了一种基于Wasserstein-GAN的无监督方法,用于推断物理可解释的变换参数,这些参数将变化的检测器响应分布映射回标称参考分布。与标准生成建模不同,生成器被用作可学习的校准变换,其可训练权重代表所寻求的参数,而判别器通过Wasserstein目标提供分布距离信号。我们在具有受控层偏移的跟踪探测器玩具模型上验证了该方法,并展示了其在具有单元老化效应的高粒度Geant4模拟量能器数据上的应用。该方法恢复了单个单元的老化系数,与真实值相关,并改善了校准后和参考能量和分布之间的一致性,同时随着通道间噪声水平的增加而表现出预期的退化。这些结果表明,在退化参数的直接标签不可用的情况下,对抗性分布匹配可以作为校准策略的数据驱动组件。

英文摘要

The quality of recorded data depends on the stability of the sensor system that acquires it. Sensor motion and aging can degrade the performance and stability of downstream data-driven methods. We present a Wasserstein-GAN-inspired approach for unsupervised inference of physically interpretable transformation parameters that map a changed detector response distribution back to a nominal reference distribution. In contrast to standard generative modeling, the generator is used as a learnable calibration transformation whose trainable weights represent the sought parameters, while the critic provides a distributional distance signal via the Wasserstein objective. We validate the approach on a tracking-detector toy model with controlled layer shifts and demonstrate its application on high-granularity Geant4-simulated calorimeter data with cell-wise aging effects. The method recovers aging coefficients for individual cells with correlation to ground truth and improves agreement between calibrated and reference energy-sum distributions, while exhibiting the expected degradation at increasing channel-to-channel noise levels. These results indicate that adversarial distribution matching can serve as a data-driven component of calibration strategies in settings where direct labels for degradation parameters are unavailable.

2606.18560 2026-06-18 cs.SD 新提交

Constraining to Generalize: Subspace Tuning for Few-shot Generalization of Audio-Language Models

约束泛化:音频-语言模型少样本泛化的子空间微调

Jaehyuk Jang, Kangwook Ko, Wonjun Lee, Changick Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 针对音频-语言模型少样本微调导致的基类-新类权衡问题,提出子空间微调(SubT),通过结构化子空间参数化和残差锚定约束文本嵌入漂移,并利用子空间感知门控抑制负迁移,在11个音频基准上实现高效强泛化。

详情
AI中文摘要

预训练音频-语言模型(ALM)的少样本适应通常以牺牲未见类泛化为代价提高可见类性能,导致基类-新类权衡。我们将此失败归因于文本嵌入空间中的零样本漂移:少样本微调可能扭曲类间结构,并使适应后的嵌入远离其预训练锚点。因此,我们提出子空间微调(SubT),一种几何约束的适应框架,具有两种互补的漂移控制。结构化子空间参数化限制结构变形,残差锚定稳定围绕零样本先验的适应。在推理时,子空间感知门控进一步抑制弱对齐未见类的负迁移。在11个音频基准上,SubT在保持高效的同时实现了强大的少样本泛化,直接对预计算文本嵌入进行操作,无需文本编码器反向传播。

英文摘要

Few-shot adaptation of pretrained Audio--Language Models (ALMs) often improves seen-class performance at the cost of unseen-class generalization, leading to the base-to-new trade-off. We attribute this failure to zero-shot drift in the text embedding space: few-shot tuning can distort inter-class structure and move adapted embeddings far from their pretrained anchors. We therefore propose Subspace Tuning (SubT), a geometry-constrained adaptation framework with two complementary controls on drift. Structured Subspace Parameterization limits structural deformation, and Residual Anchoring stabilizes adaptation around the zero-shot prior. At inference time, Subspace-aware Gating further suppresses negative transfer for weakly aligned unseen classes. Across 11 audio benchmarks, SubT delivers strong few-shot generalization while remaining efficient, operating directly on precomputed text embeddings without text-encoder backpropagation.

2606.18558 2026-06-18 cs.CV 新提交

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI(艾伦人工智能研究所) University of Washington(华盛顿大学) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出一种基于语言指令的3D点运动预测方法,通过构建大规模数据集和基准,实现类无关、视角稳定的运动轨迹预测,并在机器人操作和视频生成中验证其有效性。

详情
AI中文摘要

运动预测是视觉智能的核心:智能体必须预测物体如何运动,以规划行动、推理物理交互并合成逼真的未来场景。我们认为,世界坐标系中的3D点提供了一种通用表示,具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务:给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述,模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务:(1) MolmoMotion-1M是一个大型语料库,包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹;(2) PointMotionBench是一个人工验证的基准,涵盖111个物体类别和61种运动类型;(3) MolmoMotion是一个通用运动预测模型,支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式,并在PointMotionBench上显著优于现有运动预测基线。最后,我们展示了学习到的3D运动先验能很好地迁移到下游应用:它提高了机器人操作的训练效率和泛化能力,其预测轨迹为生成模型提供了有效的运动指导,以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb:基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出DeFAb基准,通过将知识库转换为可验证的溯因实例,评估基础模型在可废止推理中的创造力与理论推理能力,发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情
AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例;而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%,最差降至23.5%(四种表面渲染的最坏情况)。我们引入DeFAb(可废止溯因基准),这是一个数据集和生成流水线,将四十年的公共资助知识库转换为形式化可废止溯因实例:通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查(有效推导、保守性和最小性),DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具,评分的是理论修正的规范构建,而非流畅但破坏理论的散文。该流水线将分类层次结构(OpenCyc、YAGO、Wikidata)与行为属性图(ConceptNet、UMLS)配对,从18个来源生成372,648+个实例,涉及33.75M条实例化规则,分为三个级别,并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理:渲染鲁棒的Level 2准确率为7.8-23.5%;思维链方差(约36个百分点)超过任何模型间差距;匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard(235个实例的Level 3难度变体;最佳模型53.3% vs 符号100%)和CONJURE(一个内核验证的变革性创造力变体,包含560个Lean 4/Mathlib实例,其金答案证明内核先前未包含的定义,无需判断的验证器;试点发现零新概念)。同一验证器还可作为偏好优化(DPO、RLVR/GRPO)的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

2606.18555 2026-06-18 cs.CV 新提交

Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition

重新思考文本到图像作为室内场景识别的语义感知数据增强

Trong-Vu Hoang, Quang-Binh Nguyen, Dinh-Khoi Vo, Hoai-Danh Vo, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM, Vietnam(越南国立大学胡志明市理科大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国立大学胡志明市分校)

AI总结 针对室内图像数据不足,提出利用稳定扩散生成合成图像进行数据增强,并通过扩散重建误差防止滥用,在MIT室内场景数据集上验证了有效性。

Comments MAPR 2024

详情
AI中文摘要

在计算机视觉领域,室内图像识别由于光照条件、遮挡以及有限空间内多样化物体排列的复杂相互作用而面临挑战。为了解决训练室内图像缺乏的问题,我们引入了一种新颖的方法,利用稳定扩散(SD)生成合成图像,作为强大的数据增强工具。SD的使用提供了一个原则性框架,用于合成多样且逼真的室内场景,从而丰富训练数据池,以构建鲁棒的室内图像识别模型。在MIT室内场景数据集上的实验结果表明,当真实数据有限时,我们提出的方法在增强深度模型训练方面具有潜力。此外,为了防止SD合成图像的滥用,我们引入了一种基于扩散重建误差(DIRE)的应对措施。强大的DIRE表示使得仅使用轻量级深度模型就能训练鲁棒的分类器。实验表明,我们的方法能够完美识别SD生成的图像,使用MobilenetV3的准确率达到100%。

英文摘要

In the realm of computer vision, indoor image recognition presents challenges due to the intricate interplay of lighting conditions, occlusions, and diverse object arrangements within confined spaces. To address the lacks of training indoor images, we introduce a novel approach leveraging Stable Diffusion (SD) for the generation of synthetic images, which serve as a powerful data augmentation tool. The utilization of SD offers a principled framework for synthesizing diverse and realistic indoor scenes, thereby enriching the training data pool for robust indoor image recognition models. Experimental findings on the MIT Indoor Scene dataset reveal the potential of our proposed approach in enhancing the training of deep models when authentic data is limited. Furthermore, to prevent the misuse of SD synthetic images, we introduce a counter measure based on DIffusion Reconstruction Error (DIRE). The powerful DIRE presentation enables training robust classifiers only using lightweight deep models. Experiments show that our approach can perfectly recognize SD generated images with the accuracy of 100% using MobilenetV3.

2606.18554 2026-06-18 cs.CV 新提交

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

伪造灾难:扩散时代跨域合成灾难检测基准

Duc-Manh Phan, Quoc-Duy Tran, Duy-Khang Do, Anh-Tuan Vo, Hai-Dang Nguyen, Trong Le Do, Mai-Khiem Tran, Vinh-Tiep Nguyen, Tam V. Nguyen, Isao Echizen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国家大学下属理科大学) Vietnam National University, Ho Chi Minh(胡志明市国家大学) University of Information Technology, VNU-HCM(胡志明市国家大学下属信息技术大学) University of Dayton(代顿大学) National Institute of Informatics(国立信息学研究所)

AI总结 针对扩散模型生成的逼真灾难图像难以检测的问题,提出包含30000张图像(6000张真实、24000张合成)的基准数据集,实验发现微调检测器在未知生成器上准确率下降50%,零样本检测器也不稳定,凸显了跨域检测的迫切需求。

Comments SOICT 2025

详情
AI中文摘要

文本到图像扩散模型的快速进步使得创建高度逼真的合成图像成为可能,这些图像与真实照片极为相似,使得区分真实内容与AI生成的伪造品越来越困难。这对网络安全、数字取证和灾难响应构成了挑战,其中洪水、火灾或地震的虚假图像可能传播错误信息或扰乱应急行动。为此,我们引入了Forged Calamity,一个用于合成灾难检测的基准数据集,包含30000张图像,其中包括6000张真实样本和由四种扩散模型生成的24000张合成样本。在微调和零样本设置下的全面实验揭示了当前取证方法的一致弱点。微调检测器在分布内表现良好,但在未见过的生成器或灾难类型上准确率下降高达50%,显示出对模型特定伪影的过拟合。零样本通用检测器也难以保持稳定的准确率,只有少数具有鲁棒表示能力的模型表现出有限的韧性。这些发现凸显了持续存在的泛化差距,以及在扩散时代确保视觉真实性迫切需要领域和模型无关的检测方法。

英文摘要

The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50\% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

2606.18553 2026-06-18 cs.CV 新提交

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

基于知识的分层多模态检索用于新闻图像描述生成

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(越南国立大学胡志明市分校理学院) Vietnam National University, Ho Chi Minh City(越南国立大学胡志明市分校)

AI总结 提出分层多模态文章检索增强的图像描述框架,通过结构感知检索和上下文精炼,结合VLM和LLM生成富含上下文细节的描述,在EVENTA 2025挑战赛中获得第5名。

Comments SOICT 2025

详情
AI中文摘要

传统的图像描述方法通常难以生成全面、上下文丰富的描述,尤其是对于无法直接从视觉线索中观察到的细节。为了克服这一问题,我们提出了一种新颖的检索增强图像描述框架,通过利用外部知识生成具有更深层次洞察的描述,如对象属性、事件背景和潜在意义。我们的方法采用分层多模态文章检索机制,超越了单一的文本实体。该检索考虑了文章结构感知特征,包括加权文本组件(例如,标题、正文部分)和视觉布局模式,以及多方面的相似性计算(内容-视觉、视觉-视觉和话语定位)。后续的上下文相关性精炼阶段进一步增强了检索到的信息。检索到的文章随后作为描述生成的知识库:首先,VLM生成简洁的图像描述;其次,我们基于该描述从检索到的文章中分割出相关信息;最后,LLM利用描述和提取的知识生成全面、上下文详细的描述。我们参加了ACM Multimedia EVENTA 2025挑战赛,并在OpenEvent-V1数据集的私有测试集上以0.2824的总分获得第5名。源代码已在此https URL公开发布。

英文摘要

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench:智能体能否玩转长期博弈?

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出CEO-Bench,通过模拟500天运营初创公司的任务,评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情
AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而,现实世界的挑战需要结合多种复杂技能,这些技能在很大程度上尚未在智能体中得到测试:(1)在不确定性中导航长期视野;(2)在嘈杂环境中获取信息;(3)适应不断变化的世界;(4)协调多个移动部分以实现连贯目标。我们引入CEO-Bench,通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面,在相同的环境中运行,并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库,将信号转化为合理的策略,并通过编程协调许多决策。最强的智能体编写复杂的代码,模拟客户群体以预测未来现金流,并挖掘谈判历史以揭示隐藏的客户偏好。即便如此,大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金,且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

2606.18539 2026-06-18 cs.LG stat.ML 新提交

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

TS-Fault: 针对结构性故障的时间序列预测器基准测试

Yuyang Zhao, Lian Xu, Hao Miao, Chenxi Liu, Hao Xue

发表机构 * Ray-zyy

AI总结 提出TS-Fault基准,通过参数化故障场景(沿观测/机制、单变量/多变量两轴)评估时间序列预测模型鲁棒性,发现干净数据准确性与鲁棒性负相关、机制级故障重排排名、基础模型最脆弱。

详情
AI中文摘要

时间序列预测(TSF)支撑着能源、交通、金融和医疗等领域的关键决策,然而TSF模型几乎普遍通过在干净保留数据上的单一数字(如平均误差)进行排名,隐含假设该数字能预测部署可靠性。但实际故障并非独立同分布噪声,而是具有时间形状的结构化事件、断裂的跨变量依赖、伴随缺失的机制变化以及跨传感管道的因果传播。将TSF鲁棒性视为数据质量问题,我们提出TS-Fault,一个在显式、参数化且具有可控语义难度的故障场景下评估预测模型的基准。TS-Fault将重复出现的故障沿两个正交轴(观测级 vs 机制级;单变量 vs 多变量)组织为四种模式,并通过统一重要性评分将每种故障注入最关键的预测窗口。该设计使得鲁棒性能够针对模型实际依赖的结构进行测试,而非简化为通用噪声敏感性。我们在6个数据集、4种模式和5个难度级别上,采用配对干净/损坏协议评估了21个模型。结果揭示了三个与常见排行榜直觉相悖的发现:(i)干净数据准确性与鲁棒性负相关;(ii)干净排名在观测级故障下保持不变,但在机制级故障下重新洗牌;(iii)所有灾难性故障均发生在机制级故障下,基础模型在干净数据上准确率最高但表现出最大的脆弱性。代码已公开于该URL。

英文摘要

Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at https://github.com/Ray-zyy/TS-Fault.

2606.18538 2026-06-18 cs.LG stat.ML 新提交

Effects of sparsity and superposition on loss in simple autoencoders

稀疏性与叠加对简单自编码器损失的影响

Mriganka Basu Roy Chowdhury, Eric McLaughlin Weiner

发表机构 * Department of Statistics, UC Berkeley(伯克利大学统计学系) Department of Materials Science, UC Berkeley(伯克利大学材料科学系)

AI总结 研究神经网络中多语义性源于叠加现象,通过数学分析稀疏输入下自编码器的L2重构损失上下界,验证并扩展了Elhage等人的实证结果。

Comments 16 pages, 3 figures

详情
AI中文摘要

神经网络机械可解释性的主要困难之一是出现多语义性,即每个神经元通常负责多个不同任务,阻碍了对其功能的清晰解释。Elhage等人(2022)的开创性论文认为,这是由于叠加现象,即神经网络将不同特征表示为低维空间中的非正交方向,这种策略可以在不牺牲保真度的情况下实现更大的数据压缩,因为输入向量具有特征稀疏性。Elhage等人(2022)在一个相当自然且简单的具有稀疏输入的自编码器中实证验证了这些假设。本文的贡献在于分析叠加现象发生和最优性的数学基础,同时严格证实了他们的一些发现。特别地,我们为幂激活函数提供了L2重构损失的上界和下界,在非常稀疏的情况下是紧的。文末还包含一个简短的开放问题列表。

英文摘要

One of the major difficulties in the mechanistic interpretability of neural networks is the occurrence of polysemanticity, which suggests that each neuron is typically responsible for multiple different tasks, impeding a clean interpretation of their function. The seminal paper of Elhage et al. (2022) argues that this occurs due to superposition, a phenomenon where the neural network represents distinct features as non-orthogonal directions in a lower-dimensional space, a strategy that allows much greater compression of the data without sacrificing fidelity due to the feature sparsity of input vectors. Elhage et al. (2022) empirically validates these hypotheses in a rather natural and simple autoencoder with sparse inputs. The contribution of the present work is to analyze the mathematical basis for the occurrence and optimality of superposition, while rigorously corroborating some of their findings. In particular, we provide upper and lower bounds for the L2 reconstruction loss, tight in the very sparse regime, for power activation functions. A short list of interesting open problems are also included at the end.

2606.18537 2026-06-18 cs.LG 新提交

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

入乡随俗:从异构智能体学习通用行为

Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung

发表机构 * University of Washington(华盛顿大学) NVIDIA(英伟达)

AI总结 提出GRID方法,从追求不同目标的异构示范者中提取通用奖励,训练通用智能体以学习环境通用能力,避免模式平均偏差,提升下游任务微调效率。

详情
AI中文摘要

人类通常通过观察他人来获取新技能,因为观察到的行为隐含地揭示了如何在环境中行动。然而,从异构群体中获得的观察会引入冲突的行为信号,使得难以确定哪些行为值得模仿。我们通过通用奖励推断与解耦(GRID)来解决这一挑战,这是一种从追求不同目标的异构示范者群体中提取普遍有用行为的社会学习方法。GRID将每个智能体的奖励函数分解为通用奖励(捕捉所有智能体共享的行为)和特定奖励(捕捉个体偏好和目标)。仅基于通用奖励进行训练提供了一种通用预训练的新范式。它产生了一个通用智能体,该智能体内化了通用的环境能力,如安全性和基本任务熟练度,而不会出现困扰标准从示范学习技术的模式平均偏差。这个通用智能体作为微调到下游任务(包括训练中未见过的偏好)的优越先验。在合成基函数分解、多智能体Craftax和连续自动驾驶模拟器(Highway-Env)上的实验证实,GRID以语义上有意义的方式成功解耦了奖励结构,优于标准的从示范学习基线,并实现了更高效和稳定的特化。

英文摘要

Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it difficult to determine which behaviors are worth imitating. We address this challenge with General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from a heterogeneous population of demonstrators pursuing different goals. GRID decomposes per-agent reward functions into a general reward, capturing behaviors shared across all agents, and specific rewards, capturing individual preferences and objectives. Training exclusively on the general reward provides a new paradigm of generalist pretraining. It yields a generalist agent that internalizes universal environmental competencies, such as safety and basic task proficiency, without the mode-averaging bias that afflicts standard learning from demonstration techniques. This generalist serves as a superior prior for fine-tuning to downstream tasks, including preferences unseen during training. Experiments across a synthetic basis function decomposition, multi-agent Craftax, and a continuous autonomous driving simulator (Highway-Env) confirm that GRID successfully disentangles reward structure in a semantically meaningful way, outperforms standard learning from demonstration baselines, and enables more efficient and stable specialization.

2606.18528 2026-06-18 cs.CV 新提交

A Prototypical Signature Approach for Writer-Independent Offline Signature Verification

一种面向离线手写签名验证的原型签名方法

Kecia G. de Moura, Robert Sabourin, Rafael M. O. Cruz

发表机构 * École de technologie supérieure – Université du Québec Montreal(魁北克蒙特利尔高等电子与计算机工程学院)

AI总结 提出基于原型签名的数据驱动策略,生成多样且信息丰富的负样本,提升对熟练伪造签名的检测能力,并提高可扩展性和计算效率。

Comments Accepted for oral presentation at the International Conference on Pattern Recognition (ICPR) 2026

详情
AI中文摘要

离线手写签名验证旨在使用静态图像区分真实签名和伪造签名。由于真实伪造样本很少,通常从其他用户的真实签名中随机抽取负样本来创建训练数据。然而,这种随机选择往往缺乏多样性,增加冗余,并提高计算成本,导致训练效率低下。我们提出了一种数据驱动策略,使用原型签名生成多样且信息丰富的负样本,原型签名是真实签名特征的紧凑、不可识别的摘要。基于实验结果,我们得出结论:(i)原型签名产生更具信息量的负样本,改进了对熟练伪造的检测;(ii)所提出的方法与骨干网络无关,在不同架构上表现出鲁棒性;(iii)当与原始形式的线性SVM结合时,它可作为基于RBF模型的替代方案,同时显著提高可扩展性和计算效率。该方法的实现可在以下网址获取:https://this URL。

英文摘要

Offline handwritten signature verification aims to distinguish genuine from forged signatures using static images. Since real forgeries are rarely available, negative samples are usually randomly drawn from genuine signatures of other users to create training data. However, this random selection often lacks diversity, increases redundancy, and escalates computational cost, leading to inefficient training. We propose a data-driven strategy to generate diverse, informative negative samples using prototypical signatures, which are compact, non-identifiable summaries of genuine signature features. Based on the experiments results, we conclude that (i) prototypical signatures yield more informative negative samples, improving the detection of skilled forgeries; (ii) the proposed approach is backbone-agnostic, showing robustness across architectures; and (iii) when combined with a primal-form linear SVM, it serves as an alternative to RBF-based models while significantly improving scalability and computational efficiency. Implementation of the method is available at https://github.com/kdmoura/proto_hsv.

2606.18525 2026-06-18 cs.LG 新提交

Hierarchical Attention via Domain Decomposition

基于区域分解的层次注意力机制

Stephan Köhler, Oliver Rheinbach

发表机构 * Faculty of Mathematics and Computer Science(数学与计算机科学系)

AI总结 提出一种基于两水平重叠Schwarz区域分解的层次注意力机制,通过局部低秩注意力块与粗网格注意力块结合,在少参数下实现更快训练和更高精度。

Comments 20 pages, 10 figures

详情
AI中文摘要

我们提出了一种基于两水平重叠Schwarz区域分解的层次注意力机制。该方法的动机源于观察到两水平Schwarz区域分解方法将局部子域校正与一个传达全局、长程信息的粗水平相结合。我们在一个具有齐次Dirichlet边界条件的一维扩散问题背景下,测试了其在有限维算子学习中的实用性。尽管该问题简单,但它提供了一个受控的序列到序列设置,其中精确的非局部解算子已知。离散化后,学习解算子相当于逼近一个对称正定矩阵的逆。作为基线,我们使用一个全局无softmax的低秩注意力算子,形式为$QK^T$。所提出的构造将这个密集的全局分解替换为一个两水平加性结构:重叠子域上的局部低秩注意力块与一个粗注意力块相结合。得到的算子形式为$$M_{\theta}^{-1} = \Phi Q_0 K_0^T \Phi^T + \sum_{i=1}^{N} R_i^T D_i^{1/2} Q_i K_i^T D_i^{1/2} R_i.$$ 这里$R_i$限制到重叠子域,$D_i$是单位划分权重,$\Phi$是粗插值(或延拓)矩阵。针对合成Fourier右端项的数值实验表明,区域分解注意力算子能够比全局低秩注意力基线训练更快,并在使用显著更少参数的情况下提供更精确的逼近。

英文摘要

We propose a hierarchical attention mechanism based on two-level overlapping Schwarz domain decomposition. The method is motivated by the observation that two-level Schwarz domain decomposition methods combine local subdomain corrections with a coarse level that communicates global, long-range information. We test its usefulness in the context of finite-dimensional operator learning using a simple, one-dimensional diffusion problem with homogeneous Dirichlet boundary conditions. Although elementary, this problem provides a controlled sequence-to-sequence setting in which the exact nonlocal solution operator is known. After discretization, learning the solution operator amounts to approximating the inverse of a symmetric positive definite matrix. As a baseline, we use a global softmax-free low-rank attention operator of the form $QK^T$. The proposed construction replaces this dense global factorization by a two-level additive structure: local low-rank attention blocks on overlapping subdomains are combined with a coarse attention block. The resulting operator has the form $$M_θ^{-1} = ΦQ_0 K_0^T Φ^T + \sum_{i=1}^{N} R_i^T D_i^{1/2} Q_i K_i^T D_i^{1/2} R_i.$$ Here $R_i$ restricts to an overlapping subdomain, $D_i$ is a partition-of-unity weight, and $Φ$ is a coarse interpolation (or prolongation) matrix. Numerical experiments for synthetic Fourier right-hand sides indicate that the domain-decomposition attention operator is able to train faster and can give more accurate approximations than a global low-rank attention baseline while using significantly fewer parameters.

2606.18524 2026-06-18 cs.LG 新提交

On the Residual Scaling of Looped Transformers: Stability and Transferability

关于循环Transformer的残差缩放:稳定性和可迁移性

Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang, Shen Yan, Jian Li

发表机构 * Tsinghua University(清华大学)

AI总结 针对循环Transformer,提出残差缩放因子应为1/N而非1/√L,并推导出多层的分解参数化,实现超参数从少循环到多循环的迁移。

Comments 19 pages, 9 figures

详情
AI中文摘要

循环(权重共享)Transformer 将共享残差块应用 N 次(h ← h + ε f(h),每一步使用相同的 f),在不增加参数的情况下增加有效深度。先前的深度缩放分析建议深度为 L 的残差网络使用 ε = 1/√L。我们证明这对于循环架构是不够的:权重共享使得残差更新在迭代间相关,需要更强的缩放 ε = 1/N。对于多层块(L 个独特层循环 N 次),我们推导出一个分解参数化 ε = λ/(N√L),将两种增长源分开:1/N 控制层内循环相关性,1/√L 控制层间方差。一个关键结果是,最优学习率仅取决于独特层数 L,而非循环次数 N,从而实现了从小的 N 到大的 N 的直接超参数迁移,无需重新调整。在循环 Transformer 上的实验证实,1/N 缩放相比 1/√N 缩放提高了可训练性,并在不同循环次数下获得更优的损失。

英文摘要

Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = λ/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.

2606.18521 2026-06-18 cs.LG cs.AI 新提交

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

稀疏性诅咒:从模型合并理解RLVR模型参数空间

Chenrui Wu, Zexi Li, Jiajun Bu, Jiangchuan Liu, Haishuai Wang

发表机构 * Zhejiang University(浙江大学) Simon Fraser University(西蒙菲莎大学) The Chinese University of Hong Kong(香港中文大学) Zhejiang Key Lab of Accessible Perception and Intelligent Systems(浙江省可感知智能系统重点实验室)

AI总结 本文发现RLVR模型的稀疏更新在参数空间中分散更远,形成近正交捷径导致合并脆弱,并提出SAR-Merging方法解决该问题。

Comments Accepted by KDD 2026

详情
AI中文摘要

可验证奖励强化学习(RLVR)已成为一种强大的后训练范式,在激发推理智能和抵抗灾难性遗忘方面超越了监督微调(SFT)。最近的研究进一步揭示,与SFT相比,RLVR会引发高度稀疏且偏离主成分的参数更新。这自然引出一个问题:这种稀疏性是否使RLVR模型更易于模型合并?如果是,模型合并将提供一种可扩展的、无需训练的方法,来聚合来自独立训练的RLVR模型的多样化推理能力。令人惊讶的是,我们发现相反的情况,揭示了一种稀疏性诅咒:稀疏的RLVR更新在参数空间中分散得更远,形成近正交的捷径,使得聚合本质上是脆弱的。这很可能源于RL优化的随机性和涌现推理模式的多样性。与SFT模型收敛到共享的平坦盆地并自然合并不同,RLVR模型在标准合并方法下遭受严重退化。通过对更新几何的系统性实证分析,我们描述了这种失败背后的机制,并提出了敏感性感知解析合并(SAR-Merging),这是一种针对RLVR参数空间独特结构定制的合并方案。SAR-Merging通过基于Fisher信息的敏感性仲裁解决重叠更新区域中的冲突,然后通过幅度感知稀疏化和重新缩放来保留脆弱的推理路径。在数学和编程基准上的实验表明,SAR-Merging在RLVR模型上显著优于现有合并方法,实现了单任务增强和多能力融合。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful post-training paradigm that surpasses Supervised Fine-Tuning (SFT) in eliciting reasoning intelligence and resisting catastrophic forgetting. Recent studies further reveal that RLVR induces highly sparse and off-principal parameter updates compared to SFT. This naturally raises the question: does such sparsity make RLVR models more amenable to model merging? If so, model merging would offer a scalable, training-free path to aggregate diverse reasoning capabilities from independently trained RLVR models. Surprisingly, we find the opposite, uncovering a sparsity curse: the sparse RLVR updates are spread farther apart in parameter space, forming near-orthogonal shortcuts that make aggregation inherently fragile. This is likely rooted in the stochasticity of RL optimization and the diversity of emergent reasoning patterns. Unlike SFT models that converge to shared, flat basins and merge naturally, RLVR models suffer severe degradation under standard merging methods. Through systematic empirical analysis of the update geometry, we characterize the mechanisms behind this failure and propose Sensitivity-aware Resolving Merging (SAR-Merging), a merging recipe tailored for the unique structure of RLVR parameter spaces. SAR-Merging resolves conflicts in overlapping update regions via Fisher Information-based sensitivity arbitration, followed by magnitude-aware sparsification and rescaling to preserve fragile reasoning pathways. Experiments on mathematical and coding benchmarks demonstrate that SAR-Merging substantially outperforms existing merging methods on RLVR models, enabling both single-task enhancement and multi-capability fusion.

2606.18519 2026-06-18 cs.RO cs.AI 新提交

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿:利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced(加州大学默塞德分校)

AI总结 针对自然语言歧义性,提出基于线性时序逻辑(LTL)反馈循环的LLM任务规划系统,通过双LLM分工实现规范生成与验证,提升精准农业任务规划的可靠性。

详情
Journal ref
Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)
AI中文摘要

尽管机器人系统现已商业化并部署于各行各业,但许多系统高度专业化,通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题,我们近期引入了一个任务规划器,利用大语言模型(LLM)根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色,但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑(LTL)的反馈循环来扩展我们的系统,以确保任务规划系统满足用户制定的规范,同时仍使用自然语言。为减轻潜在偏差,我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验,我们强调了将任务验证集成到全自主流水线中的优势与局限,特别是关于LLM生成有效LTL公式的能力,并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

2606.18518 2026-06-18 cs.LG cs.AI 新提交

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

PSyGenTAB:通过约束优化生成合成临床表格数据的隐私保护框架

Arshia Ilaty, Hossein Shirazi, Manasi Chitale, Kedar Hegde, Dhanalakshmi Ramesh, Rashmi S. Manjunath, Amir Rahmani, Hajar Homayouni

发表机构 * San Diego State University(圣地亚哥州立大学) University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出PSyGenTAB框架,将合成医疗数据生成建模为约束优化问题,通过增强拉格朗日方法嵌入可配置隐私约束,在保证隐私阈值的同时最大化临床数据效用,实验表明合成数据训练的模型性能与真实数据相当。

Comments 20 pages

详情
AI中文摘要

由于机构壁垒和严格的隐私法规(如HIPAA和GDPR),医疗AI的发展受到高质量临床数据获取限制。合成数据生成提供了一种潜在解决方案,但现有方法缺乏明确管理隐私-效用权衡的原则性机制,常常退化临床有意义的模式或面临患者重识别风险。我们提出PSyGenTAB,一个隐私保护生成框架,将合成医疗数据生成建模为使用增强拉格朗日方法求解的约束优化问题。通过将可配置的隐私约束直接嵌入模型训练,PSyGenTAB在最大化临床数据效用的同时强制执行最低隐私阈值。在多个临床驱动的基准测试中,PSyGenTAB保留了可靠健康AI所需的特征间临床关系和少数类诊断模式。使用“合成训练、真实测试”和“真实训练、合成测试”协议的下游评估表明,在合成数据上训练的模型达到了与真实患者记录训练模型相当的性能。隐私审计进一步证明了精确记录复制的减少和对成员推理攻击的强大抵抗力。这些结果确立了PSyGenTAB作为平衡合成医疗数据中隐私保护和临床效用的原则性框架,支持安全的跨机构AI开发。

英文摘要

The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution, but existing methods lack principled mechanisms to explicitly manage the privacy-utility trade-off, often degrading clinically meaningful patterns or risking patient re-identification. We present PSyGenTAB, a privacy-preserving generative framework that formulates synthetic healthcare data generation as a constrained optimization problem solved using the Augmented Lagrangian Method. By embedding configurable privacy constraints directly into model training, PSyGenTAB enforces minimum privacy thresholds while maximizing clinical data utility. Across multiple clinically motivated benchmarks, PSyGenTAB preserves inter-feature clinical relationships and minority-class diagnostic patterns essential for reliable health AI. Downstream evaluation using Train-on-Synthetic, Test-on-Real and Train-on-Real, Test-on-Synthetic protocols shows that models trained on synthetic data achieve performance comparable to those trained on real patient records. Privacy auditing further demonstrates reduced exact record reproduction and strong resilience to membership inference attacks. These results establish PSyGenTAB as a principled framework for balancing privacy protection and clinical utility in synthetic healthcare data, supporting secure cross-institutional AI development.

2606.18516 2026-06-18 cs.RO 新提交

Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets

动态杂乱环境下的任务分配与运动规划:基于CBBA与凸集图

Matthew D. Osburn, Cameron K. Peterson, John L. Salmon

发表机构 * Electrical and Computer Engineering(电气与计算机工程系) Mechanical Engineering(机械工程系)

AI总结 针对动态杂乱环境中的多智能体任务规划,提出结合凸集图(GCS)进行轨迹优化与共识捆绑算法(CBBA)进行分布式任务分配的方法,实现安全高效的轨迹规划和任务协调。

Comments 15 pages single column, 10 figures, AIAA-Scitech 2027 Submission

详情
AI中文摘要

在杂乱、动态环境中的多智能体任务规划需要在分配任务给智能体的同时,确定通过环境的安全、时间高效的轨迹。当任务是动态的(例如会合目标)时,分配决策不仅取决于哪个智能体最适合某项任务,还取决于该任务何时何地可以到达。本文提出了一个解决该问题的方法,该方法将凸集图(GCS)用于轨迹优化,与共识捆绑算法(CBBA)用于分布式任务分配相结合。在我们的方法中,GCS通过使用时间扩展(3D+时间)配置空间找到通过动态环境的最优轨迹。同时,CBBA协调跨智能体的任务分配,使得在移动环境中能够做出明智的决策。然后,我们连接分配和规划,使智能体能够在3D+时间配置空间中避免碰撞,并提供准确的任务完成时间估计。我们在具有静态和动态任务的模拟杂乱环境中展示了我们方法的有效性。

英文摘要

Multi-agent task planning in cluttered, dynamic environments requires assigning tasks to agents while simultaneously determining safe, time-efficient trajectories through the environment. When tasks are dynamic, such as rendezvous objectives, allocation decisions depend not only on which agent is best suited for a task, but also on when and where that task can be reached. This paper presents a solution to this problem, which combines Graphs of Convex Sets (GCS) for trajectory optimization with the Consensus-Based Bundle Algorithm (CBBA) for distributed task allocation. In our approach, GCS finds optimal trajectories through dynamic environments using a time-extended (3D+time) configuration space. At the same time, CBBA coordinates task assignments across agents, enabling informed decision-making in a moving environment. We then connect allocation and planning to allow the agents to avoid collisions in the 3D+time configuration space and provide accurate time estimates for task completion. We demonstrate the effectiveness of our approach in simulated cluttered environments with static and dynamic tasks.

2606.18514 2026-06-18 cs.RO cs.LG 新提交

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

N(CO)$^2$: 基于机会约束的神经组合优化求解随机定向问题

Anas Saeed, Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * Department of Computer Science and Engineering, University of California, Merced(加州大学默塞德分校计算机科学与工程系)

AI总结 提出N(CO)$^2$框架,结合强化学习求解随机定向问题,无需手工启发式,在不确定环境下优化路径选择,性能媲美MILP。

详情
Journal ref
In Proceedings of the IEEE International Conference on Automation Science and Engineering (CASE), 2025
AI中文摘要

神经组合优化(NCO)通过学习启发式,为求解复杂图优化问题提供了一种有前景的替代传统启发式方法的方法。这类问题在自动化领域频繁出现,可用于建模多种应用。虽然NCO在确定性组合优化问题上已被广泛研究,但只有少数工作旨在解决随机组合优化问题。本文提出N(CO)$^2$:基于机会约束的神经组合优化,用于求解随机定向问题(SOP),无需手工设计的启发式。通过集成强化学习(RL)框架,模型在不确定性下优化路径选择,有效平衡探索与利用。实验结果表明,我们的方法在多种SOP实例上具有良好的泛化能力,与最先进的混合整数线性规划(MILP)相比性能具有竞争力。所提方法减少了启发式设计的人力投入,同时在不确定环境中实现自适应和高效的决策。

英文摘要

Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO)$^2$: Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.

2606.18510 2026-06-18 cs.CV cs.CR 新提交

Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

人脸呈现攻击检测中的架构偏差:视觉Transformer与卷积神经网络的比较研究

Ngela Landon Ntung, Floride Tuyisenge, Jema David Ndibwile

发表机构 * College of Engineering, Carnegie Mellon University(卡内基梅隆大学工程学院)

AI总结 通过比较ViT和CNN在人脸呈现攻击检测中的表现,发现预训练ViT(DeiT-S)在准确率、公平性和跨种族泛化上优于CNN,将种族间ACER差距降低83%。

Comments 8 Pages, 4 Figures, 5 Tables

详情
AI中文摘要

人脸呈现攻击检测(PAD)系统构成生物特征认证中的关键安全层;然而,现有方法在不同人口群体间表现出系统性性能差异,对深肤色个体影响尤为严重。本文通过实证比较研究,探究视觉Transformer架构相对于卷积基线是否能够减少人脸PAD系统中的人口统计偏差。实验在CASIA-SURF跨种族人脸反欺骗(CeFA)数据集上进行。评估了三种架构:从头训练的多模态ViT-Tiny、ResNet18 CNN基线,以及在CeFA上微调的预训练DeiT-S,覆盖非洲、东亚和零样本中亚人口群体。DeiT-S实现了最高总体准确率97.27%和最低等错误率0.86%,优于准确率90.15%的ResNet18。在公平性方面,DeiT-S将非洲与东亚受试者之间的种族间ACER差距降至0.13%,而基于LBP的工作[6]报告为0.75%,降低了83%。最值得注意的是,ResNet18在零样本中亚受试者上的BPCER为10.44%,而DeiT-S在相同未见群体上保持2.89%,展现出3.6倍的泛化优势。这些结果表明,预训练视觉Transformer在PAD中实现了更高的准确率,产生了更小的人口统计性能差距,并在未见人口群体上更公平地泛化,表明PAD中的跨人口公平性可能部分受架构设计影响。

英文摘要

Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.

2606.18509 2026-06-18 cs.LG stat.ML 新提交

Concept Modulation Models: A Unified Framework for Identifiability and Extrapolation

概念调制模型:可识别性与外推的统一框架

Soheun Yi, Yizhou Lu, Chandler Squires, Pradeep Ravikumar

发表机构 * Department of Statistics and Data Science, Carnegie Mellon University(卡内基梅隆大学统计与数据科学系) Machine Learning Department, Carnegie Mellon University(卡内基梅隆大学机器学习系)

AI总结 提出概念调制模型(CMMs),通过属性势统一条件潜变量模型的可识别性与外推分析,将基于转移的可识别性提升至条件设置,并导出代数外推准则。

详情
AI中文摘要

条件潜变量模型中的可靠泛化需要理解可识别性和外推:观测属性间的变化如何决定潜在结构,以及该结构如何决定未见属性上的分布。然而,现有的可识别性和外推保证大多是模型特定的,在非线性ICA、因果表示学习、扰动建模及相关条件潜变量模型中分别进行分析。我们引入概念调制模型(CMMs),这是一类属性索引的条件生成模型,其结构为$A\to \Lambda \to C\to X$,其中属性选择调制器,调制器诱导潜在概念法则,概念生成观测特征。CMMs通过展示观测属性上的特征一致性诱导受CMM类约束的潜在概念转移,将基于转移的可识别性提升至条件设置。我们通过属性势(属性条件概念法则之间的对数密度比)表达这些约束,将通用提升步骤与模型特定的刚性论证分离。相同的势控制外推:当且仅当传输的属性势恒等式扩展到这些属性时,未见属性上的一致性成立。这导出了代数外推准则,识别出几个现有可识别性和外推结果背后的共同基于势的证明对象,并且当与这些工作中的模型特定刚性论证结合时,恢复了它们所述的结论。

英文摘要

Reliable generalization in conditional latent variable models requires understanding both identifiability and extrapolation: how observed variation across attributes determines latent structure, and how that structure determines distributions at unseen attributes. However, existing identifiability and extrapolation guarantees are largely model-specific, with separate analyses in nonlinear ICA, causal representation learning, perturbation modeling, and related conditional latent variable models. We introduce concept modulation models (CMMs), an attribute-indexed class of conditional generative models with structure $A\to Λ\to C\to X$, where attributes select modulators, modulators induce latent concept laws, and concepts generate observed features. CMMs lift transition-based identifiability to conditional settings by showing that feature agreement on observed attributes induces a latent concept transition constrained by the CMM class. We express these constraints through attribute potentials, log-density ratios between attribute-conditioned concept laws, separating the generic lifting step from model-specific rigidity arguments. The same potentials control extrapolation: agreement at unseen attributes holds exactly when the transported attribute-potential identities extend to those attributes. This yields algebraic extrapolation criteria, identifies the common potential-based proof objects behind several existing identifiability and extrapolation results, and, when combined with the model-specific rigidity arguments in those works, recovers their stated conclusions.

2606.18508 2026-06-18 cs.CL cs.IR 新提交

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG:主题元数据作为段落级检索的语义指南针

Amirhossein Abaskohi, Raymond Li, Gaetano Cimino, Peter West, Giuseppe Carenini, Issam H. Laradji

发表机构 * University of British Columbia(不列颠哥伦比亚大学) University of Salerno(萨莱诺大学) ServiceNow Research(ServiceNow研究院)

AI总结 提出MCompassRAG框架,通过主题元数据增强段落表示,利用LLM蒸馏训练轻量检索器,实现主题感知检索,在六个基准上平均信息效率提升8.24%,延迟降低5倍以上。

详情
AI中文摘要

检索增强生成(RAG)系统关键依赖于文档的分块和搜索方式。细粒度块可以提高检索精度,但会扩大搜索空间,增加延迟和成本;较大的块减少了候选数量,但使密集相似性变得不可靠,因为每个块的表示混合了多个主题并引入了更多语义噪声。这种权衡在深度研究任务中尤其受限,因为检索必须在大型异构语料库中既快速又精确。我们引入了MCompassRAG,一种元数据引导的检索框架,它使用主题级信号作为语义指南针来选择相关证据。MCompassRAG不仅依赖于查询与噪声块嵌入之间的余弦相似度,还在同一嵌入空间中用主题元数据丰富块表示,并通过LLM教师蒸馏训练轻量级检索器。在推理时,MCompassRAG无需额外的LLM调用即可执行主题感知检索,提高了效率和证据质量。在六个复杂检索基准上,MCompassRAG平均信息效率(IE)提高了8.24%,延迟比最强的高效RAG基线低5倍以上。代码可从此https URL获取。

英文摘要

Retrieval-augmented generation (RAG) systems depend critically on how documents are chunked and searched. Fine-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise. This trade-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora. We introduce MCompassRAG, a metadata-guided retrieval framework that uses topic-level signals as a semantic compass for selecting relevant evidence. Instead of relying only on cosine similarity between queries and noisy chunk embeddings, MCompassRAG enriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM-teacher distillation. At inference time, MCompassRAG performs topic-aware retrieval without additional LLM calls, improving both efficiency and evidence quality. Across six complex retrieval benchmarks, MCompassRAG improves information efficiency (IE) by 8.24% on average with over 5 times lower latency than the strongest efficient RAG baselines. Code is available on https://github.com/AmirAbaskohi/MCompassRAG.

2606.18506 2026-06-18 cs.LG eess.SP stat.AP 新提交

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

超越AHI:一种可解释的因果发现引导的睡眠恢复框架在互联健康中的应用

Saba A. Farahani, Elahe Khatibi, Manoj Vishwanath, Amir M. Rahmani, Hung Cao

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 提出一种可解释的因果发现引导框架,从多模态PSG中推导层次化睡眠恢复评分(SRS),在两大队列中SRS与感知恢复的关联强度是AHI的2.5倍。

Comments 6 pages, 2 figures, 2 tables. Accepted at the 2nd Workshop on Sensing and Computing for Smart and Connected Health (SCH), co-located with IEEE/ACM CHASE 2026

详情
AI中文摘要

客观睡眠评估依赖于多导睡眠图(PSG),但临床影响通常更好地反映在患者报告结局(PROs)如嗜睡和疲劳中。现有的总结指标,包括呼吸暂停低通气指数(AHI),对功能恢复背后的多域生理学提供的洞察有限。我们提出了一种可解释的、因果发现引导的框架,用于从多模态PSG中推导层次化睡眠恢复评分(SRS)。利用两个大型人群队列(MESA: n=1540; MrOS: n=825),我们应用有向无环图(DAG)学习来识别候选生理驱动因素,涵盖呼吸负担、缺氧负担、睡眠碎片化、睡眠结构和自主神经调节。尽管源自临床PSG,这些域自然映射到互联健康技术中日益可用的传感流,包括可穿戴心电图、血氧测定和睡眠阶段估计设备。为了保持机制合理性,我们引入了一个两阶段筛选过程,结合基于生理学的约束和受约束的LLM辅助审计,以识别和消除结构混杂因素以及构造重叠变量。跨队列,这五个域作为与恢复相关的重复生理域出现,所得SRS与感知恢复的关联强度高达AHI的2.5倍。通过将多模态睡眠生理学与以患者为中心的结果通过可解释、偏差感知和域结构化的框架联系起来,这项工作为临床睡眠研究和新兴智能互联健康环境中的恢复建模提供了实用基础。

英文摘要

Objective sleep assessment relies on polysomnography (PSG), yet clinical impact is often better reflected in patient-reported outcomes (PROs) such as sleepiness and fatigue. Existing summary indices, including the Apnea-Hypopnea Index (AHI), provide limited insight into the multidomain physiology underlying functional recovery. We propose an interpretable, causal-discovery--guided framework for deriving a hierarchical Sleep Recovery Score (SRS) from multimodal PSG. Using two large population cohorts (MESA: n=1540; MrOS: n=825), we apply directed acyclic graph (DAG) learning to identify candidate physiological drivers spanning respiratory burden, hypoxic burden, sleep fragmentation, sleep architecture, and autonomic regulation. Although derived from clinical PSG, these domains map naturally to sensing streams increasingly available in connected health technologies, including wearable ECG, oximetry, and sleep-stage estimation devices. To preserve mechanistic plausibility, we introduce a two-stage screening process that combines physiology-based constraints with constrained LLM-assisted auditing to identify and remove structural confounders and construct-overlapping variables. Across cohorts, these five domains emerge as recurrent physiological domains associated with recovery, and the resulting SRS shows up to 2.5$\times$ stronger alignment with perceived recovery than AHI. By linking multimodal sleep physiology to patient-centered outcomes through an interpretable, bias-aware, and domain structured framework, this work provides a practical foundation for recovery modeling across both clinical sleep studies and emerging smart and connected health settings.

2606.18503 2026-06-18 cs.LG stat.ML 新提交

Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction

量子退火增强强化学习用于精确剩余使用寿命预测

Manoranjan Gandhudi, Arunkumar V., G. R. Anil, Gangadharan G. R

发表机构 * Central University of Karnataka(卡纳塔克中央大学) University College of Engineering, Anna University(安娜大学工程学院) AIONOS India Pvt Ltd(AIONOS印度私人有限公司) National Institute of Technology Tiruchirappalli(蒂鲁吉拉帕利国立理工学院)

AI总结 提出量子退火增强Q学习框架,通过将Q值更新编码为QUBO问题并利用量子退火采样实现随机动作选择,解决高维非凸空间中的收敛问题,在C-MAPSS和工业数据集上显著优于基线方法。

Comments 29 pages, 6 figures, 12 tables

详情
AI中文摘要

剩余使用寿命(RUL)估计是预测性维护的核心,意外故障的成本可能远超资产本身。统计退化模型忽略了真实系统的强非线性,而数据驱动模型在高维非凸搜索空间中常收敛到次优解。我们提出量子退火增强Q学习(QAQL)框架,将量子退火的采样行为与Q学习的序列决策相结合。每个Q值更新被编码为一个小的二次无约束二元优化(QUBO)问题,其基态对应贪婪动作;退火器不是作为确定性优化器,而是在多次读取中返回一个近最优动作的分布,这种随机动作选择提供了探索,从而抑制了在非线性退化轨迹上的过早收敛。QUBO在D-Wave Advantage系统上通过小规模嵌入求解,退火器被嵌入强化学习循环中,而非训练后附加。我们在两个公开基准上验证了QAQL:NASA C-MAPSS涡扇发动机数据集和一个设备群预测性维护数据集。在多次独立运行和六个误差指标上平均,QAQL优于本研究考虑的经典和量子基线,具有统计显著性改进。结果表明,量子退火是工业预测性维护应用中强化学习循环内一个可用的(而非仅理论上的)优化器。

英文摘要

Remaining useful life (RUL) estimation is central to predictive maintenance, where an unplanned failure can cost far more than the asset itself. Statistical degradation models miss the strong nonlinearity of real systems, and data-driven models often converge to suboptimal solutions in high-dimensional, non-convex search spaces. We propose a Quantum Annealing enhanced Q-Learning (QAQL) framework that couples the sampling behaviour of quantum annealing with the sequential decision making of Q-learning. Each Q-value update is encoded as a small quadratic unconstrained binary optimization (QUBO) whose ground state is the greedy action; rather than acting as a deterministic optimizer, the annealer returns a distribution over near-optimal actions across many reads, and this stochastic action selection supplies the exploration that curbs premature convergence on nonlinear degradation trajectories. The QUBO is solved on the D-Wave Advantage system using minor embedding, with the annealer woven into the reinforcement-learning loop rather than bolted on after training. We validate QAQL on two public benchmarks: the NASA C-MAPSS turbofan engine datasets and a device-fleet predictive maintenance dataset. Averaged over many independent runs and across six error metrics, QAQL outperforms the classical and quantum baselines considered in this study, with statistically significant improvements. The results indicate that quantum annealing is a usable, not merely theoretical, optimizer inside a reinforcement-learning loop for industrial predictive-maintenance applications.