arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2505.08222 2026-06-03 cs.RO cs.AI cs.DC cs.PF

Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles

通过自主车辆扩展多智能体强化学习用于水声跟踪

Matteo Gallici, Ivan Masmitja, Mario Martín

发表机构 * KEMLG Research Group, Universitat Politècnica de Catalunya Barcelona, Spain（凯姆尔格研究组，巴塞罗那理工大学，西班牙）； Instituto de Ciencias del Mar, Consejo Superior de Investigaciones Científicas, Barcelona, Spain（海洋科学研究所，西班牙国家科学研究委员会，巴塞罗那，西班牙）； KEMLG Research Group, Universitat Politècnica de Catalunya (UPC), and with the HPAI group at Barcelona Supercomputing Center (BSC), Barcelona, Spain（凯姆尔格研究组，巴塞罗那理工大学（UPC），以及巴塞罗那超级计算中心（BSC）的HPAI组，巴塞罗那，西班牙）

AI总结提出一种GPU加速环境（高达30000倍加速）和基于Transformer的MARL架构（TransfMAPPO），实现多目标快速移动场景下的水下跟踪，跟踪误差低于5米。

详情

AI中文摘要

自主车辆（AV）为水下跟踪等科学任务提供了经济高效的解决方案。强化学习（RL）已成为控制AV的强大方法，但扩展到舰队（对于多目标跟踪或快速移动目标至关重要）具有挑战性。多智能体RL（MARL）以样本效率低下而闻名，虽然像Gazebo的LRAUV这样的高保真模拟器提供高达100倍实时速度的单机器人模拟，但在多车辆场景中几乎没有加速，使得MARL训练不切实际。然而，高保真模拟对于测试复杂策略和缩小模拟到现实的差距至关重要。为了解决这些限制，我们开发了一个GPU加速环境，在保持其动力学的同时，实现了比Gazebo高达30000倍的加速。这使得快速、端到端的GPU训练以及无缝转移到Gazebo进行评估成为可能。我们还引入了一种基于Transformer的架构（TransfMAPPO），该架构学习对舰队规模和目标数量不变的策略，从而能够通过课程学习在日益复杂的场景中训练更大的舰队。经过大规模GPU训练后，我们在Gazebo中进行了广泛评估，表明即使面对多个快速移动的目标，我们的方法也能将跟踪误差保持在5米以下。

英文摘要

Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL) has emerged as a powerful method for controlling AVs, but scaling to fleets (essential for multi-target tracking or rapidly moving targets) is challenging. Multi-Agent RL (MARL) is notoriously sample-inefficient, and while high-fidelity simulators like Gazebo's LRAUV provide up to 100x faster-than-real-time single-robot simulations, they offer little speedup in multi-vehicle scenarios, making MARL training impractical. Yet, high-fidelity simulation is crucial to test complex policies and close the sim-to-real gap. To address these limitations, we develop a GPU-accelerated environment that achieves up to 30,000x speedup over Gazebo while preserving its dynamics. This enables fast, end-to-end GPU training and seamless transfer to Gazebo for evaluation. We also introduce a Transformer-based architecture (TransfMAPPO) that learns policies invariant to fleet size and number of targets, enabling curriculum learning to train larger fleets on increasingly complex scenarios. After large-scale GPU training, we perform extensive evaluations in Gazebo, showing our method maintains tracking errors below 5m even with multiple fast-moving targets.

URL PDF HTML ☆

赞 0 踩 0

2510.13565 2026-06-03 cs.CV

XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation

XD-RCDepth: 轻量级雷达-相机深度估计，具有可解释性对齐和分布感知蒸馏

Huawei Sun, Zixu Wang, Xiangyuan Peng, Julius Ott, Georg Stettinger, Lorenzo Servadei, Robert Wille

发表机构 * Technical University of Munich（慕尼黑技术大学）； Infineon Technologies AG（英飞凌科技）

AI总结提出轻量级雷达-相机深度估计架构XD-RCDepth，通过可解释性对齐蒸馏和深度分布蒸馏减少参数29.7%并保持精度，在nuScenes和ZJU-4DRadarCam数据集上实现实时性能。

2506.09398 2026-06-03 cs.LG physics.comp-ph

Efficient Prediction of SO(3)-Equivariant Hamiltonian Matrices via SO(2) Local Frames

通过SO(2)局部框架高效预测SO(3)等变哈密顿矩阵

Haiyang Yu, Yuchao Lin, Xuan Zhang, Xiaofeng Qian, Shuiwang Ji

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出QHNetV2网络，利用SO(2)局部框架和SO(2)等变操作实现全局SO(3)等变性，避免昂贵的SO(3)张量积，高效预测哈密顿矩阵。

Comments Code available at: https://github.com/divelab/AIRS

详情

AI中文摘要

我们考虑预测哈密顿矩阵以加速电子结构计算的任务，这在物理、化学和材料科学中扮演重要角色。受哈密顿矩阵的非对角块与SO(2)局部框架之间固有关系的启发，我们提出了一种新颖高效的网络，称为QHNetV2，该网络在不使用昂贵的SO(3) Clebsch-Gordan张量积的情况下实现了全局SO(3)等变性。这是通过引入一组新的高效且强大的SO(2)等变操作，并在SO(2)局部框架内执行所有非对角特征更新和消息传递来实现的，从而消除了对SO(3)张量积的需求。此外，在每个节点的SO(2)局部框架内执行连续的SO(2)张量积以融合节点特征，模拟对称收缩操作。在大型QH9和MD17数据集上的大量实验表明，我们的模型在广泛的分子结构和轨迹上实现了优越的性能，凸显了其强大的泛化能力。所提出的基于SO(2)局部框架的SO(2)操作为可扩展且对称感知的电子结构学习提供了一个有前景的方向。我们的代码将作为AIRS库的一部分发布，网址为https://github.com/divelab/AIRS。

英文摘要

We consider the task of predicting Hamiltonian matrices to accelerate electronic structure calculations, which plays an important role in physics, chemistry, and materials science. Motivated by the inherent relationship between the off-diagonal blocks of the Hamiltonian matrix and the SO(2) local frame, we propose a novel and efficient network, called QHNetV2, that achieves global SO(3) equivariance without the costly SO(3) Clebsch-Gordan tensor products. This is achieved by introducing a set of new efficient and powerful SO(2)-equivariant operations and performing all off-diagonal feature updates and message passing within SO(2) local frames, thereby eliminating the need of SO(3) tensor products. Moreover, a continuous SO(2) tensor product is performed within the SO(2) local frame at each node to fuse node features, mimicking the symmetric contraction operation. Extensive experiments on the large QH9 and MD17 datasets demonstrate that our model achieves superior performance across a wide range of molecular structures and trajectories, highlighting its strong generalization capability. The proposed SO(2) operations on SO(2) local frames offer a promising direction for scalable and symmetry-aware learning of electronic structures. Our code will be released as part of the AIRS library https://github.com/divelab/AIRS.

URL PDF HTML ☆

赞 0 踩 0

2510.09711 2026-06-03 cs.CL cs.AI

ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models

ReaLM：残差量化桥接知识图谱嵌入与大型语言模型

Wenbin Guo, Xin Wang, Jiaoyan Chen, Lingbing Guo, Zhao Li, Zirui Chen

发表机构 * Tianjin University（天津大学）； The University of Manchester（曼彻斯特大学）

AI总结提出ReaLM框架，通过残差向量量化将知识图谱嵌入离散化为可学习标记，融入大型语言模型词汇表，结合本体约束实现结构化知识与语言模型的语义对齐，在知识图谱补全任务上取得最优性能。

详情

AI中文摘要

大型语言模型（LLM）最近成为知识图谱补全（KGC）的强大范式，提供了超越传统基于嵌入方法的强大推理和泛化能力。然而，现有的基于LLM的方法通常难以充分利用结构化语义表示，因为预训练KG模型的连续嵌入空间与LLM的离散标记空间根本不对齐。这种差异阻碍了有效的语义转移并限制了它们的性能。为了解决这一挑战，我们提出了ReaLM，一种新颖且有效的框架，通过残差向量量化的机制弥合了KG嵌入和LLM标记化之间的差距。ReaLM将预训练的KG嵌入离散化为紧凑的代码序列，并将它们作为可学习标记集成到LLM词汇表中，从而实现符号知识和上下文知识的无缝融合。此外，我们引入了本体引导的类约束以强制语义一致性，基于类级别的兼容性细化实体预测。在两个广泛使用的基准数据集上进行的大量实验表明，ReaLM实现了最先进的性能，证实了其在将结构化知识与大规模语言模型对齐方面的有效性。

英文摘要

Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.

URL PDF HTML ☆

赞 0 踩 0

2510.08977 2026-06-03 cs.LG cs.CL

Breaking the Self-Confirming Loop: Diagnosing and Mitigating Systemic Reward Bias in Self-Rewarding RL

打破自我确认循环：诊断与缓解自奖励强化学习中的系统性奖励偏差

Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过量化反馈回路偏差并提出集成奖励强化学习（RLER）方法，诊断并缓解了自奖励强化学习中由置信度耦合导致的系统性奖励偏差，从而提升性能与稳定性。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）高效扩展了大语言模型（LLMs）的推理能力，但受限于稀缺的标注数据。基于内在奖励的强化学习（RLIR）通过自奖励提供了一种可扩展的替代方案，但常面临不稳定和性能较差的问题。我们将这一差距归因于置信度耦合的自奖励中的系统性偏差：模型倾向于过度奖励高置信度的错误，形成自我确认循环。我们通过三个指标量化这种反馈回路偏差：奖励噪声幅度（rho_noise）、策略-奖励耦合（rho_selfbias）和过度/不足奖励偏斜（rho_symbias）。我们的分析显示了一种复合效应，其中强耦合放大了置信度条件误差，并导致向过度奖励的漂移，从而引发不稳定和较低的性能上限。为缓解这一问题，我们提出集成奖励强化学习（RLER），该方法通过自适应奖励插值和分歧感知的轨迹选择聚合多样化的模型，以减少耦合并抑制过度奖励漂移。大量实验表明，RLER相比最佳RLIR基线提升了6.2%，且与RLVR的差距在3.6%以内，同时在未标注样本上表现出稳定的扩展性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) efficiently scales the reasoning ability of large language models (LLMs) but is bottlenecked by scarce labeled data. Reinforcement learning with intrinsic rewards (RLIR) offers a scalable alternative via self-rewarding, yet often suffers from instability and inferior performance. We trace this gap to a systemic bias in confidence-coupled self-rewarding: the model tends to over-reward high-confidence mistakes, forming a self-confirming loop. We quantify this feedback-loop bias with three metrics: reward noise magnitude (rho_noise), policy-reward coupling (rho_selfbias), and over-/under-reward skew (rho_symbias). Our analyses show a compounding effect where strong coupling amplifies confidence-conditioned errors and drives a drift toward over-reward, leading to instability and a lower performance ceiling. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models with adaptive reward interpolation and disagreement-aware rollout selection to reduce coupling and suppress over-reward drift. Extensive experiments show that RLER improves by 6.2% over the best RLIR baseline and is within 3.6% of RLVR, while exhibiting stable scaling on unlabeled samples.

URL PDF HTML ☆

赞 0 踩 0

2510.03316 2026-06-03 cs.CV cs.AI cs.LG

The View From Space: Navigating Instrumentation Differences with EOFMs

从太空视角：利用EOFMs导航仪器差异

Ryan P. Demilt, Nicholas LaHaye, Karis Tenneson

发表机构 * Spatial Informatics Group（空间信息组）

AI总结本研究通过分析地球观测基础模型（EOFMs）对传感器架构的敏感性，揭示了当前模型设计的缺陷，并为模型开发者、用户和遥感科学社区指明了前进方向。

详情

Journal ref: https://neurips.cc/virtual/2025/loc/san-diego/122891

AI中文摘要

地球观测基础模型（EOFMs）作为处理大量遥感及其他地球观测数据、并对许多关键地球监测任务产生影响的工具，其普及程度急剧上升。一个新兴趋势是利用预训练模型的输出作为“嵌入”，这些嵌入总结了高维数据，可用于通用任务，如相似性搜索和内容特定查询。然而，大多数EOFMs仅在单一模态数据上训练，然后通过匹配不同模态的波段进行应用或基准测试。现有工作尚不清楚多样化的传感器架构如何影响当前EOFMs套件的内部表示。我们在本工作中表明，EOFMs的表示空间对传感器架构高度敏感，理解这一差异为我们提供了关于当前EOFMs设计陷阱的关键视角，并指明了作为模型开发者、用户以及以稳健遥感科学为指导的社区应如何前进的方向。

英文摘要

Earth Observation Foundation Models (EOFMs) have exploded in prevalence as tools for processing the massive volumes of remotely sensed and other earth observation data, and for delivering impact on the many essential earth monitoring tasks. An emerging trend posits using the outputs of pre-trained models as 'embeddings' which summarize high dimensional data to be used for generic tasks such as similarity search and content-specific queries. However, most EOFM models are trained only on single modalities of data and then applied or benchmarked by matching bands across different modalities. It is not clear from existing work what impact diverse sensor architectures have on the internal representations of the present suite of EOFMs. We show in this work that the representation space of EOFMs is highly sensitive to sensor architecture and that understanding this difference gives a vital perspective on the pitfalls of current EOFM design and signals for how to move forward as model developers, users, and a community guided by robust remote-sensing science.

URL PDF HTML ☆

赞 0 踩 0

2509.26169 2026-06-03 cs.LG

Alignment-Aware Decoding

对齐感知解码

Frédéric Berdoz, Luca A. Lanzendörfer, René Caky, Roger Wattenhofer

发表机构 * EPFL, Switzerland（瑞士联邦理工学院）

AI总结提出一种推理时增强模型对齐的方法——对齐感知解码（AAD），可解释为隐式奖励优化，无需额外训练，在多种基准和模型规模上优于强基线，并能生成合成数据改善数据受限场景下的对齐。

Comments Accepted at ICML 2026

2509.22854 2026-06-03 cs.CL

Train Once, Reuse Everywhere: Generalizable Implicit In-Context Learning by Routing Attention

一次训练，随处重用：通过路由注意力实现可泛化的隐式上下文学习

Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, Wenya Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出In-Context Routing (ICR)方法，在注意力logits层面捕获可泛化的上下文学习模式，通过可学习的输入条件路由器调制注意力logits，实现高效的一次训练多次重用框架，在12个数据集上优于现有隐式ICL方法并展现强泛化能力。

Comments ICML 2026 Camera-ready

详情

AI中文摘要

隐式上下文学习（ICL）作为一种新兴的有前景范式，在大语言模型（LLMs）的表示空间中模拟ICL行为，旨在以零样本成本获得少样本性能。然而，现有方法主要依赖于将偏移向量注入残差流，这些向量通常从标注示例或任务特定对齐中构建。这种设计未能充分利用ICL背后的结构机制，且泛化能力有限。为了解决这个问题，我们提出了In-Context Routing (ICR)，一种新颖的隐式ICL方法，在注意力logits层面捕获和利用可泛化的ICL模式。它提取ICL过程中出现的可重用结构方向，并采用可学习的输入条件路由器相应地调制注意力logits，从而实现高效的一次训练多次重用框架。我们在涵盖不同领域和多个LLM的12个真实世界数据集上评估了ICR。结果表明，ICR一致优于需要任务特定检索或训练的现有隐式ICL方法，同时在它们难以处理的域外任务上展现出稳健的泛化能力。这些发现将ICR定位为推动ICL实际价值边界的方案。代码可在https://github.com/Lijiaqian1/In-Context-Routing.git获取。

英文摘要

Implicit in-context learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of large language models (LLMs), aiming to attain few-shot performance at zero-shot cost. However, existing approaches largely rely on injecting shift vectors into residual flows, which are typically constructed from labeled demonstrations or task-specific alignment. Such designs fall short of utilizing the structural mechanisms underlying ICL and suffer from limited generalizability. To address this, we propose In-Context Routing (ICR), a novel implicit ICL method that captures and utilizes generalizable ICL patterns at the attention logits level. It extracts reusable structural directions that emerge during ICL and employs a learnable input-conditioned router to modulate attention logits accordingly, enabling an efficient train-once-and-reuse framework. We evaluate ICR on 12 real-world datasets spanning diverse domains and multiple LLMs. The results show that ICR consistently outperforms existing implicit ICL methods that require task-specific retrieval or training, while demonstrating robust generalization to out-of-domain tasks where they struggle. These findings position ICR to push the boundary of the practical value of ICL. The code is available at https://github.com/Lijiaqian1/In-Context-Routing.git.

URL PDF HTML ☆

赞 0 踩 0

2509.22468 2026-06-03 cs.LG cs.AI

Learning the Neighborhood: Contrast-Free Multimodal Self-Supervised Molecular Graph Pretraining

学习邻域：无对比的多模态自监督分子图预训练

Boshra Ariguib, Mathias Niepert, Andrei Manolache

发表机构 * University of Tübingen（图宾根大学）

AI总结提出C-FREE框架，通过预测子图嵌入与互补邻域的关系，融合2D拓扑和3D构象信息，实现无对比、无负样本的多模态自监督分子图预训练，在MoleculeNet上取得最优结果。

Comments Accepted at ICML 2026

详情

AI中文摘要

高质量的分子表示对于性质预测和分子设计至关重要，然而大型标注数据集仍然稀缺。尽管分子图上的自监督预训练已显示出潜力，但许多现有方法要么依赖于手工数据增强或复杂的生成目标，要么仅利用2D拓扑，导致宝贵的3D结构信息未被充分利用。为弥补这一空白，我们引入了C-FREE（基于自我网络的无需对比的表示学习），一个将2D图与3D构象集成在一起的简单框架。C-FREE通过从潜在空间中互补邻域预测子图嵌入来学习分子表示，使用固定半径的自我网络作为不同构象之间的建模单元。这种设计使我们能够在混合图神经网络（GNN）-Transformer骨干中整合几何和拓扑信息，无需负样本、位置编码或昂贵的预处理。在提供丰富3D构象多样性的GEOM数据集上进行预训练后，C-FREE在MoleculeNet上取得了最先进的结果，超越了对比、生成和其他多模态自监督方法。在具有不同规模和分子类型的数据集上进行微调进一步表明，预训练能有效迁移到新的化学领域，突显了3D信息分子表示的重要性。

英文摘要

High-quality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce. While self-supervised pretraining on molecular graphs has shown promise, many existing approaches either depend on hand-crafted augmentations or complex generative objectives, and often rely solely on 2D topology, leaving valuable 3D structural information underutilized. To address this gap, we introduce C-FREE (Contrast-Free Representation learning on Ego-nets), a simple framework that integrates 2D graphs with ensembles of 3D conformers. C-FREE learns molecular representations by predicting subgraph embeddings from their complementary neighborhoods in the latent space, using fixed-radius ego-nets as modeling units across different conformers. This design allows us to integrate both geometric and topological information within a hybrid Graph Neural Network (GNN)-Transformer backbone, without negatives, positional encodings, or expensive pre-processing. Pretraining on the GEOM dataset, which provides rich 3D conformational diversity, C-FREE achieves state-of-the-art results on MoleculeNet, surpassing contrastive, generative, and other multimodal self-supervised methods. Fine-tuning across datasets with diverse sizes and molecule types further demonstrates that pretraining transfers effectively to new chemical domains, highlighting the importance of 3D-informed molecular representations.

URL PDF HTML ☆

赞 0 踩 0

2505.17659 2026-06-03 cs.RO cs.CV

Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling

Plan-R1：安全且可行的轨迹规划作为语言建模

Xiaolong Tang, Meina Kan, Shiguang Shan, Xilin Chen

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出Plan-R1两阶段轨迹规划框架，通过原则对齐与行为学习解耦，结合规则奖励和方差解耦GRPO，显著提升自动驾驶规划的安全性和可行性。

Comments Accepted by ICLR2026

详情

AI中文摘要

安全且可行的轨迹规划对于现实世界的自动驾驶系统至关重要。然而，现有的基于学习的规划器严重依赖专家演示，这不仅缺乏明确的安全意识，还可能继承次优人类驾驶数据中的不良行为（如超速）。受大型语言模型成功的启发，我们提出了Plan-R1，一种两阶段轨迹规划框架，将原则对齐与行为学习解耦。在第一阶段，通用轨迹预测器在专家数据上进行预训练，以捕获多样化的、类人的驾驶行为。在第二阶段，使用基于规则的奖励通过组相对策略优化（GRPO）对模型进行微调，明确地将自我规划与安全、舒适和交通规则遵守等原则对齐。这种两阶段范式保留了类人行为，同时增强了安全意识并丢弃了演示中的不良模式。此外，我们识别了直接应用GRPO到规划的一个关键限制：组级归一化消除了跨组的尺度差异，导致罕见、高方差的安全违规组与大量低方差的安全组具有相似的优势，从而抑制了对安全关键目标的优化。为解决此问题，我们提出了方差解耦GRPO（VD-GRPO），用中心化和固定缩放替代归一化以保留绝对奖励幅度，确保安全关键目标在整个训练过程中保持主导地位。在nuPlan基准上的实验表明，Plan-R1显著提高了规划的安全性和可行性，达到了最先进的性能，特别是在现实反应性设置中。我们的代码可在https://github.com/XiaolongTang23/Plan-R1获取。

英文摘要

Safe and feasible trajectory planning is critical for real-world autonomous driving systems. However, existing learning-based planners rely heavily on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting undesirable behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a two-stage trajectory planning framework that decouples principle alignment from behavior learning. In the first stage, a general trajectory predictor is pre-trained on expert data to capture diverse, human-like driving behaviors. In the second stage, the model is fine-tuned with rule-based rewards using Group Relative Policy Optimization (GRPO), explicitly aligning ego planning with principles such as safety, comfort, and traffic rule compliance. This two-stage paradigm retains human-like behaviors while enhancing safety awareness and discarding undesirable patterns from demonstrations. Furthermore, we identify a key limitation of directly applying GRPO to planning: group-wise normalization erases cross-group scale differences, causing rare, high-variance safety-violation groups to have similar advantages as abundant low-variance safe groups, thereby suppressing optimization for safety-critical objectives. To address this, we propose Variance-Decoupled GRPO (VD-GRPO), which replaces normalization with centering and fixed scaling to preserve absolute reward magnitudes, ensuring that safety-critical objectives remain dominant throughout training. Experiments on the nuPlan benchmark demonstrate that Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance, particularly in realistic reactive settings. Our code is available at https://github.com/XiaolongTang23/Plan-R1.

URL PDF HTML ☆

赞 0 踩 0

2502.02748 2026-06-03 cs.LG cond-mat.mtrl-sci

ReciNet: Reciprocal Space-Aware Long-Range Modeling for Crystalline Property Prediction

ReciNet: 用于晶体性质预测的倒易空间感知长程建模

Jianan Nie, Peiyao Xiao, Kaiyi Ji, Peng Gao

发表机构 * Department of Computer Science, Virginia Tech（维吉尼亚理工大学计算机科学系）； Department of Computer Science and Engineering, University at Buffalo（布法罗大学计算机科学与工程系）

AI总结提出基于倒易空间的几何网络ReciNet，通过傅里叶级数表示和可学习滤波器结合几何GNN与倒易模块，实现晶体中短程和长程相互作用建模，在多个基准上取得优异预测精度。

详情

AI中文摘要

从晶体结构预测其性质是材料科学中一项基础但具有挑战性的任务。与分子不同，晶体结构表现出原子的无限周期排列，需要能够有效捕捉局部和全局信息的方法。然而，当前的工作在捕捉周期结构内的长程相互作用方面存在不足。为了解决这个问题，我们利用倒易空间（周期晶体的自然域），并从分数坐标和倒易格矢出发，使用可学习滤波器构建傅里叶级数表示。在此基础上，我们引入了基于倒易空间的几何网络（ReciNet），这是一种新颖的架构，它集成了几何GNN和倒易模块来建模短程和长程相互作用。在综合基准JARVIS、Materials Project和MatBench上的实验表明，ReciNet在一系列晶体性质预测任务中取得了出色的预测精度。此外，我们探索了使用混合专家模型进行多性质预测的模型扩展，该扩展展示了高计算效率，并揭示了相关性质之间的正迁移。这些发现凸显了我们的模型作为可扩展且准确的晶体性质预测解决方案的潜力。

英文摘要

Predicting properties of crystals from their structures is a fundamental yet challenging task in materials science. Unlike molecules, crystal structures exhibit infinite periodic arrangements of atoms, requiring methods capable of capturing both local and global information effectively. However, current works fall short of capturing long-range interactions within periodic structures. To address this, we leverage \emph{reciprocal space}, the natural domain for periodic crystals, and construct a Fourier series representation from fractional coordinates and reciprocal lattice vectors with learnable filters. Building on this, we introduce the reciprocal space-based geometry network (\textbf{ReciNet}), a novel architecture that integrates geometric GNNs and reciprocal blocks to model short-range and long-range interactions. Experiments on comprehensive benchmarks JARVIS, Materials Project, and MatBench demonstrate that ReciNet achieves outstanding predictive accuracy across a range of crystal property prediction tasks. Additionally, we explore a model extension for multi-property prediction with the mixture-of-experts, which demonstrates high computational efficiency and reveals positive transfer between correlated properties. These findings highlight the potential of our model as a scalable and accurate solution for crystal property prediction.

URL PDF HTML ☆

赞 0 踩 0

2509.20623 2026-06-03 cs.RO

Latent Activation Editing: Inference-Time Refinement of Learned Policies for Safer Multirobot Navigation

潜在激活编辑：基于推理时策略精炼的安全多机器人导航

Satyajeet Das, Darren Chiu, Zhehui Huang, Lars Lindemann, Gaurav S. Sukhatme

发表机构 * Department of Computer Science, University of Southern California（南加州大学计算机科学系）； Automatic Control Laboratory, ETH Zürich（苏黎世联邦理工学院自动控制实验室）

AI总结提出潜在激活编辑（LAE）框架，通过在推理时在线检测并编辑中间激活，在不修改权重或架构的情况下降低预训练策略的碰撞率，在四旋翼导航中实现近90%的碰撞减少。

详情

AI中文摘要

强化学习在协调和导航多个四旋翼等复杂领域取得了显著进展。然而，即使经过良好训练的策略在障碍物密集的环境中仍然容易发生碰撞。通过重新训练或微调来解决这些罕见但关键的安全故障成本高昂，并且有损于先前学到的技能。受大语言模型中的激活引导和计算机视觉中的潜在编辑启发，我们引入了一个推理时潜在激活编辑（LAE）框架，该框架在不修改权重或架构的情况下精炼预训练策略的行为。该框架分两个阶段运行：（i）在线分类器监控中间激活以检测与不良行为相关的状态，（ii）激活编辑模块选择性地修改被标记的激活，将策略转向更安全的区域。在这项工作中，我们专注于提高多四旋翼导航的安全性。我们假设放大策略内部的风险感知可以诱导更安全的行为。我们通过训练一个潜在碰撞世界模型来实例化这一想法，该模型预测未来的碰撞前激活，从而促使更早和更谨慎的避碰响应。大量的仿真和真实Crazyflie实验表明，与未编辑的基线相比，LAE实现了统计上显著的碰撞减少（累计碰撞减少近90%），并显著增加了无碰撞轨迹的比例，同时保持了任务完成。更广泛地说，我们的结果确立了LAE作为一种轻量级范式，可在资源受限的硬件上对学习后的机器人策略进行部署后精炼。

英文摘要

Reinforcement learning has enabled significant progress in complex domains such as coordinating and navigating multiple quadrotors. However, even well-trained policies remain vulnerable to collisions in obstacle-rich environments. Addressing these infrequent but critical safety failures through retraining or fine-tuning is costly and risks degrading previously learned skills. Inspired by activation steering in large language models and latent editing in computer vision, we introduce a framework for inference-time Latent Activation Editing (LAE) that refines the behavior of pre-trained policies without modifying their weights or architecture. The framework operates in two stages: (i) an online classifier monitors intermediate activations to detect states associated with undesired behaviors, and (ii) an activation editing module that selectively modifies flagged activations to shift the policy towards safer regimes. In this work, we focus on improving safety in multi-quadrotor navigation. We hypothesize that amplifying a policy's internal perception of risk can induce safer behaviors. We instantiate this idea through a latent collision world model trained to predict future pre-collision activations, thereby prompting earlier and more cautious avoidance responses. Extensive simulations and real-world Crazyflie experiments demonstrate that LAE achieves statistically significant reduction in collisions (nearly 90% fewer cumulative collisions compared to the unedited baseline) and substantially increases the fraction of collision-free trajectories, while preserving task completion. More broadly, our results establish LAE as a lightweight paradigm, feasible on resource-constrained hardware, for post-deployment refinement of learned robot policies.

URL PDF HTML ☆

赞 0 踩 0

2509.18068 2026-06-03 cs.RO eess.SP

RadarSFD: Single-Frame Diffusion with Pretrained Priors for Radar Point Clouds

RadarSFD：基于预训练先验的单帧扩散用于雷达点云

Bin Zhao, Nakul Garg

发表机构 * Rice University（里士大学）

AI总结提出RadarSFD，一种条件潜在扩散框架，利用预训练单目深度估计器的几何先验，从单帧雷达数据重建密集LiDAR-like点云，无需合成孔径或多帧聚合。

Comments Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026). Project page: https://phi-lab-rice.github.io/RadarSFD/

详情

AI中文摘要

毫米波雷达在雾、烟、尘和低光环境下提供稳健的感知，使其适用于尺寸、重量和功率受限的机器人平台。现有的雷达成像方法通常依赖合成孔径或多帧聚合来提高分辨率，这对于小型空中、检测或可穿戴系统不切实际。我们提出RadarSFD，一种条件潜在扩散框架，无需运动或SAR即可从单帧雷达重建密集的LiDAR-like点云。我们的方法将预训练单目深度估计器的几何先验转移到扩散骨干中，通过通道级潜在拼接将其锚定到雷达输入，并使用结合潜在空间和像素空间损失的双空间目标进行正则化。在RadarHD基准上，RadarSFD相对于基线模型实现了最先进的性能。定性结果显示恢复了精细的墙壁和狭窄的间隙，跨新环境的实验证实了强大的泛化能力。消融研究强调了预训练初始化、雷达BEV条件和双空间损失的重要性。这些结果共同为紧凑型机器人系统中的密集点云感知建立了一个实用的单帧、无SAR毫米波雷达流水线。

英文摘要

Millimeter-wave radar provides robust perception in fog, smoke, dust, and low light, making it attractive for size-, weight-, and power-constrained robotic platforms. Existing radar imaging methods typically rely on synthetic aperture or multi-frame aggregation to improve resolution, which is impractical for small aerial, inspection, or wearable systems. We present RadarSFD, a conditional latent diffusion framework that reconstructs dense LiDAR-like point clouds from a single radar frame without motion or SAR. Our approach transfers geometric priors from a pretrained monocular depth estimator into the diffusion backbone, anchors them to radar inputs via channel-wise latent concatenation, and regularizes outputs with a dual-space objective combining latent and pixel-space losses. On the RadarHD benchmark, RadarSFD achieves state-of-the-art performance against baseline models. Qualitative results show recovery of fine walls and narrow gaps, and experiments across new environments confirm strong generalization. Ablation studies highlight the importance of pretrained initialization, radar BEV conditioning, and the dual-space loss. Together, these results establish a practical single-frame, no-SAR mmWave radar pipeline for dense point cloud perception in compact robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2509.14636 2026-06-03 cs.RO

BEV-ODOM2: Enhanced BEV-based Monocular Visual Odometry with PV-BEV Fusion and Dense Flow Supervision for Ground Robots

BEV-ODOM2: 基于PV-BEV融合与密集光流监督的增强型BEV单目视觉里程计用于地面机器人

Yufei Wei, Chenxiao Hu, Wangtao Lu, Sha Lu, Yuxiang Cui, Fuzhang Han, Rong Xiong, Yue Wang

发表机构 * Tsinghua University（清华大学）

AI总结针对现有BEV方法中位姿训练稀疏监督和透视投影信息丢失的问题，提出BEV-ODOM2框架，通过密集BEV光流监督和PV-BEV融合，在四个数据集上实现40%的RTE提升，并支持边缘实时部署。

详情

AI中文摘要

尺度一致的自我运动估计是自主地面机器人的基础。鸟瞰图（BEV）表示通过提供度量尺度的平面工作空间，自然地解决了单目视觉里程计（MVO）的尺度漂移问题，使得6自由度自我运动简化为更鲁棒的3自由度模型。然而，现有的基于BEV的方法存在两个关键限制：仅从位姿训练得到的稀疏监督信号，以及透视到BEV投影过程中的信息丢失。我们提出了BEV-ODOM2，一个增强框架，无需额外标注即可解决这两个限制。我们的方法引入了（1）直接从3自由度位姿真值构建的密集BEV光流监督，用于像素级指导，以及（2）透视视图（PV）-BEV融合，在投影前计算相关体积以保留6自由度运动线索。增强的旋转采样策略进一步在训练中平衡了不同的运动模式。我们在四个不同空间尺度的数据集上进行了评估：KITTI、Oxford、NCLT和我们新收集的ZJH-VO基准。BEV-ODOM2相比之前的BEV方法实现了40%的RTE提升，在NVIDIA Jetson AGX Orin上的实时推理确认了边缘部署的可行性。源代码和ZJH-VO数据集已公开发布，以促进未来研究。

英文摘要

Scale-consistent ego-motion estimation is fundamental for autonomous ground robots. Bird's-Eye-View (BEV) representation naturally addresses the scale drift problem of monocular visual odometry (MVO) by providing a metric-scaled planar workspace, enabling the simplification of 6-DoF ego-motion to a more robust 3-DoF model. However, existing BEV-based methods suffer from two key limitations: sparse supervision signals from pose-only training, and information loss during perspective-to-BEV projection. We present BEV-ODOM2, an enhanced framework that addresses both limitations without requiring additional annotations. Our approach introduces (1) dense BEV optical flow supervision constructed directly from 3-DoF pose ground truth for pixel-level guidance, and (2) Perspective View (PV)-BEV fusion that computes correlation volumes before projection to preserve 6-DoF motion cues. An enhanced rotation sampling strategy further balances diverse motion patterns during training. We evaluate on four datasets with varied spatial scales: KITTI, Oxford, NCLT, and our newly collected ZJH-VO benchmark. BEV-ODOM2 achieves a 40\% RTE improvement over prior BEV-based methods, with real-time inference on an NVIDIA Jetson AGX Orin confirming edge deployment feasibility. The source code and the ZJH-VO dataset are publicly released to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2507.09105 2026-06-03 cs.CV

Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production

混合自回归-扩散模型用于实时手语生成

Maoxiao Ye, Xinfeng Ye, Mano Manoharan

发表机构 * University of Auckland（奥克兰大学）

AI总结提出HybridSign混合自回归-扩散模型，结合因果帧生成与流式扩散精炼，实现低延迟高质量手语生成，在PHOENIX14T和How2Sign上取得最佳质量-效率权衡。

Comments Accepted at ACL 2026

详情

AI中文摘要

早期的手语生成（SLP）模型通常依赖于自回归解码，这自然保持了时间因果性，但在推理时会出现错误累积。最近的基于扩散的方法通过迭代去噪提高了生成质量，但其序列级精炼过程引入了大量延迟。为了解决这一权衡问题，我们提出了HybridSign，一种用于低延迟手语生成的混合自回归-扩散模型，它结合了因果帧生成与流式扩散精炼。多尺度姿态表示模块捕获细粒度发音特征，而置信度感知因果注意力机制利用关节级置信度分数提高在噪声2D姿态观测下的鲁棒性。在PHOENIX14T和How2Sign上的实验表明，HybridSign在比较的基线中始终实现了最佳的质量-效率权衡。在How2Sign测试集上，在60帧评估协议下，它达到了BLEU-1/4分数30.12/6.48和DTW 3.89，同时将首帧时间减少到5.90秒，吞吐量提高到10.17 FPS。

英文摘要

Earlier Sign Language Production (SLP) models typically relied on autoregressive decoding, which naturally preserves temporal causality but suffers from error accumulation at inference time. More recent diffusion-based approaches improve generation quality through iterative denoising, yet their sequence-level refinement process introduces substantial latency. To address this trade-off, we propose HybridSign, a hybrid autoregressive-diffusion model for low-latency sign language production that combines causal frame generation with flow-based diffusion refinement. A Multi-Scale Pose Representation module captures fine-grained articulator features, while a Confidence-Aware Causal Attention mechanism leverages joint-level confidence scores to improve robustness under noisy 2D pose observations. Experiments on PHOENIX14T and How2Sign show that HybridSign consistently achieves the best quality--efficiency trade-off among the compared baselines. On the How2Sign test split, it reaches BLEU-1/4 scores of 30.12/6.48 and DTW of 3.89, while reducing time-to-first-frame to 5.90s and increasing throughput to 10.17 FPS under a 60-frame evaluation protocol.

URL PDF HTML ☆

赞 0 踩 0

2507.23035 2026-06-03 cs.LG cs.AR

OASIS: Outlier-Aware LUT-Based GEMM with Dual-Side Quantization for LLM Inference Acceleration

OASIS：基于查找表的离群点感知双端量化LLM推理加速通用矩阵乘法

Xueying Wu, Baijun Zhou, Zhihui Gao, Yuzhe Fu, Qilin Zheng, Yintao He, Hai Li

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出OASIS架构，利用预计算笛卡尔积查找表实现非均匀量化权重与激活的高效通用矩阵乘法，通过离群点感知量化方案和实时离群点检测引擎Orizuru，在保持精度的同时显著提升推理速度和能效。

详情

AI中文摘要

大型语言模型（LLM）在各种应用中展现了令人印象深刻的能力，但在推理过程中需要大量的内存和计算资源。现有的量化方法在效率和准确性之间存在权衡：仅权重量化（WOQ）引入了昂贵的反量化开销，而整数权重和激活量化（INT-WAQ）降低了精度并损害了模型质量。非均匀权重和激活量化（NU-WAQ）能更好地捕捉LLM权重和激活的非均匀分布，但仍与传统的低精度计算单元不兼容。本文提出了OASIS，一种基于查找表（LUT）的架构，能够在无需反量化的情况下实现非均匀量化权重和激活之间的高效通用矩阵乘法（GEMM）。OASIS采用预计算的笛卡尔积LUT，实现了LUT大小的64倍缩减，并相较于现有基于LUT的GEMM方法实现了1024倍的计算并行度提升。为了在激进的激活量化下保持精度，OASIS引入了一种离群点感知量化方案，同时进行基于LUT的GEMM和针对离群点的误差补偿。此外，我们设计了Orizuru，一种用于实时激活离群点检测的高效top-k检测引擎。根据广泛评估，与FP16基线相比，OASIS的平均精度下降仅为1.98%，比Atom低5.18%。在硬件方面，与FIGLUT加速器相比，OASIS实现了平均3.00倍的加速和1.44倍的能效提升。

英文摘要

Large language models (LLMs) have demonstrated impressive capabilities across a wide range of applications, but demand substantial memory and compute resources during inference. Existing quantization methods expose a trade-off between efficiency and accuracy: weight-only quantization (WOQ) incurs costly dequantization overheads, while integer weight-and-activation quantization (INT-WAQ) reduces precision and degrades model quality. Non-uniform weight-and-activation quantization (NU-WAQ) can better capture the non-uniform distributions of LLM weights and activations, yet remains incompatible with conventional low-precision compute units. This paper presents OASIS, a lookup table (LUT)-based architecture that enables efficient general matrix multiplication (GEMM) between non-uniformly quantized weights and activations without requiring dequantization. OASIS employs pre-computed Cartesian Product LUTs, achieving a 64x reduction in LUT size and enabling a 1024x higher computational parallelism over existing LUT-based GEMM methods. To preserve accuracy under aggressive activation quantization, OASIS introduces an outlier-aware quantization scheme with concurrent LUT-based GEMM and error compensation for outliers. Furthermore, we design Orizuru, an efficient top-k detection engine for real-time activation outlier identification. According to extensive evaluations, OASIS incurs an average accuracy drop of only 1.98% compared to the FP16 baseline, which is 5.18% lower than Atom. On the hardware side, OASIS achieves an average 3.00x speedup and a 1.44x energy efficiency improvement compared to the FIGLUT accelerator.

URL PDF HTML ☆

赞 0 踩 0

2509.03376 2026-06-03 cs.CV

Transformer-Guided Content-Adaptive Graph Learning for Hyperspectral Unmixing

Transformer引导的内容自适应图学习用于高光谱解混

Hui Chen, Liangyu Liu, Xianchao Xiu, Wanquan Liu

发表机构 * School of Automation Engineering, Shanghai University of Electric Power（上海电力大学自动化工程学院）； School of Mechatronic Engineering and Automation, Shanghai University（上海大学机电工程与自动化学院）； School of Intelligent Systems Engineering, Sun Yat-sen University（中山大学智能系统工程学院）

AI总结提出T-CAGU框架，结合Transformer捕获全局依赖和内容自适应图神经网络增强局部关系，通过多阶传播动态学习图结构并引入图残差机制，实现高光谱图像的高效解混。

详情

AI中文摘要

高光谱解混（HU）旨在将遥感图像中的每个混合像素分解为一组端元及其对应的丰度。尽管深度学习在该领域取得了显著进展，但大多数方法无法同时表征全局依赖和局部一致性，难以保持长程交互和边界细节。本文提出了一种新颖的Transformer引导的内容自适应图解混框架（T-CAGU），通过采用Transformer捕获全局依赖并引入内容自适应图神经网络增强局部关系，克服了这些挑战。与以往工作不同，T-CAGU集成多个传播阶次以动态学习图结构，确保对噪声的鲁棒性。此外，T-CAGU利用图残差机制保留全局信息并稳定训练。实验结果表明其优于最先进的方法。我们的代码可在https://github.com/xianchaoxiu/T-CAGU获取。

英文摘要

Hyperspectral unmixing (HU) targets to decompose each mixed pixel in remote sensing images into a set of endmembers and their corresponding abundances. Despite significant progress in this field using deep learning, most methods fail to simultaneously characterize global dependencies and local consistency, making it difficult to preserve both long-range interactions and boundary details. This letter proposes a novel transformer-guided content-adaptive graph unmixing framework (T-CAGU), which overcomes these challenges by employing a transformer to capture global dependencies and introducing a content-adaptive graph neural network to enhance local relationships. Unlike previous work, T-CAGU integrates multiple propagation orders to dynamically learn the graph structure, ensuring robustness against noise. Furthermore, T-CAGU leverages a graph residual mechanism to preserve global information and stabilize training. Experimental results demonstrate its superiority over the state-of-the-art methods. Our code is available at https://github.com/xianchaoxiu/T-CAGU.

URL PDF HTML ☆

赞 0 踩 0

2508.13174 2026-06-03 cs.AI cs.LG q-fin.CP stat.ML

AlphaEval: A Comprehensive and Efficient Evaluation Framework for Formula Alpha Mining

AlphaEval：一个全面高效的公式化Alpha挖掘评估框架

Hongjun Ding, Binqi Chen, Jinsheng Huang, Taian Guo, Zhengyang Mao, Guoyi Shao, Lutong Zou, Luchen Liu, Ming Zhang

发表机构 * CUNY Baruch College（CUNY 巴纳特学院）； Peking University（北京大学）； Harvard University（哈佛大学）； Zhengren Research（正人研究所）； Zhengren Quant（正人量化）

AI总结提出AlphaEval框架，通过五个维度（预测能力、稳定性、鲁棒性、金融逻辑、多样性）对自动Alpha挖掘模型进行统一、可并行化且无需回测的评估，实现与回测相当的评估一致性并提高效率。

Comments Accepted by KDD2026

详情

DOI: 10.1145/3770855.3817727

AI中文摘要

公式化Alpha挖掘从金融数据中生成预测信号，对量化投资至关重要。尽管遗传编程、强化学习和大语言模型等多种算法方法显著扩展了Alpha发现的能力，但系统评估仍是一个关键挑战。现有评估指标主要包括回测和基于相关性的度量。回测计算密集、本质上是顺序的，并且对特定策略参数敏感。基于相关性的度量虽然高效，但仅评估预测能力，忽略了时间稳定性、鲁棒性、多样性和可解释性等其他关键属性。此外，大多数现有Alpha挖掘模型的闭源性质阻碍了可重复性并减缓了该领域的进展。为解决这些问题，我们提出了AlphaEval，一个统一、可并行化且无需回测的自动Alpha挖掘模型评估框架。AlphaEval沿五个互补维度评估生成Alpha的整体质量：预测能力、稳定性、对市场扰动的鲁棒性、金融逻辑和多样性。跨代表性Alpha挖掘算法的广泛实验表明，AlphaEval实现了与全面回测相当的评估一致性，同时提供更全面的洞察和更高的效率。此外，与传统的单一指标筛选方法相比，AlphaEval能有效识别更优的Alpha。所有实现和评估工具均已开源，以促进可重复性和社区参与。

英文摘要

Formula alpha mining, which generates predictive signals from financial data, is critical for quantitative investment. Although various algorithmic approaches-such as genetic programming, reinforcement learning, and large language models-have significantly expanded the capacity for alpha discovery, systematic evaluation remains a key challenge. Existing evaluation metrics predominantly include backtesting and correlation-based measures. Backtesting is computationally intensive, inherently sequential, and sensitive to specific strategy parameters. Correlation-based metrics, though efficient, assess only predictive ability and overlook other crucial properties such as temporal stability, robustness, diversity, and interpretability. Additionally, the closed-source nature of most existing alpha mining models hinders reproducibility and slows progress in this field. To address these issues, we propose AlphaEval, a unified, parallelizable, and backtest-free evaluation framework for automated alpha mining models. AlphaEval assesses the overall quality of generated alphas along five complementary dimensions: predictive power, stability, robustness to market perturbations, financial logic, and diversity. Extensive experiments across representative alpha mining algorithms demonstrate that AlphaEval achieves evaluation consistency comparable to comprehensive backtesting, while providing more comprehensive insights and higher efficiency. Furthermore, AlphaEval effectively identifies superior alphas compared to traditional single-metric screening approaches. All implementations and evaluation tools are open-sourced to promote reproducibility and community engagement.

URL PDF HTML ☆

赞 0 踩 0

2508.09606 2026-06-03 cs.RO cs.SY eess.SY

BEAVR: Bimanual, multi-Embodiment, Accessible, Virtual Reality Teleoperation System for Robots

BEAVR：用于机器人的双手、多形态、可访问的虚拟现实遥操作系统

Alejandro Posadas-Nava, Alejandro Carrasco, Richard Linares

发表机构 * Department of Aeronautics and Astronautics, Massachusetts Institute of Technology（航空与航天系，麻省理工学院）

AI总结提出BEAVR，一个开源的双手多形态VR遥操作系统，通过零拷贝流式架构和异步“思考-行动”控制循环，实现低延迟、多机器人实时控制与数据记录，并兼容多种视觉运动策略。

Comments Accepted for presentation on ICCR Kyoto 2025

详情

DOI: 10.1109/ICCR67607.2025.11372114

AI中文摘要

\textbf{BEAVR}是一个用于机器人的开源、双手、多形态虚拟现实（VR）遥操作系统，旨在统一异构机器人平台上的实时控制、数据记录和策略学习。BEAVR使用商用VR硬件实现实时、灵巧的遥操作，支持从7自由度机械臂到全身人形机器人的模块化集成，并直接以LeRobot数据集模式记录同步的多模态演示。我们的系统具有零拷贝流式架构，实现≤35毫秒延迟，一个用于可扩展推理的异步“思考-行动”控制循环，以及一个针对实时多机器人操作优化的灵活网络API。我们在多种操作任务上对BEAVR进行基准测试，并展示其与领先的视觉运动策略（如ACT、DiffusionPolicy和SmolVLA）的兼容性。所有代码公开可用，数据集发布在Hugging Face上\footnote{代码、数据集和VR应用可在https://github.com/ARCLab-MIT/BEAVR-Bot获取。}

英文摘要

\textbf{BEAVR} is an open-source, bimanual, multi-embodiment Virtual Reality (VR) teleoperation system for robots, designed to unify real-time control, data recording, and policy learning across heterogeneous robotic platforms. BEAVR enables real-time, dexterous teleoperation using commodity VR hardware, supports modular integration with robots ranging from 7-DoF manipulators to full-body humanoids, and records synchronized multi-modal demonstrations directly in the LeRobot dataset schema. Our system features a zero-copy streaming architecture achieving $\leq$35\,ms latency, an asynchronous ``think--act'' control loop for scalable inference, and a flexible network API optimized for real-time, multi-robot operation. We benchmark BEAVR across diverse manipulation tasks and demonstrate its compatibility with leading visuomotor policies such as ACT, DiffusionPolicy, and SmolVLA. All code is publicly available, and datasets are released on Hugging Face\footnote{Code, datasets, and VR app available at https://github.com/ARCLab-MIT/BEAVR-Bot.

URL PDF HTML ☆

赞 0 踩 0

2508.05852 2026-06-03 cs.CV

Interpretable Modeling of Driver Attention Shifts with a Vision-Language Model

基于视觉-语言模型的驾驶员注意力转移可解释建模

Kaiser Hamid, Khandakar Ashrafi Akbar, Peihang Li, Nade Liang

发表机构 * Texas Tech University（德克萨斯理工大学）； Towson University（托森大学）

AI总结本研究通过少量人工监督微调视觉-语言模型，生成可解释的驾驶员注意力转移描述，以补充传统注视热图，提升人因分析、监控和态势感知支持。

详情

AI中文摘要

驾驶员注视通常被建模为空间热图，但热图本身难以解释，因为它们不说明正在监控哪个道路对象或区域，也不说明注意力转移为何重要。本研究探讨了最小的人工监督是否能够引导视觉-语言模型生成驾驶员注意力转移的可解释描述。利用Berkeley DeepDrive-Attention数据集中选定的高变化注视时刻，我们比较了零样本、单样本和LoRA微调VLM条件与人工精炼参考描述和专家评分。结果表明，使用80个专家精炼的注意力示例进行微调，相对于未引导的VLM输出，提高了ROUGE-L、METEOR、实体对齐F1和人类对齐分数。研究结果表明，基于语言的描述可以通过使驾驶员注意力更易于人因分析、驾驶员监控审查和态势感知支持来补充注视热图。

英文摘要

Driver gaze is commonly modeled as a spatial heatmap, but heatmaps alone are difficult for humans to interpret because they do not explain which road object or region is being monitored or why an attention shift may matter. This study examines whether minimal human-grounded supervision can steer a vision--language model toward interpretable descriptions of driver attention shifts. Using selected high-change gaze moments from the Berkeley DeepDrive-Attention dataset, we compare zero-shot, one-shot, and LoRA fine-tuned VLM conditions against human-refined reference descriptions and expert ratings. Results show that fine-tuning with 80 expert-refined attention examples improves ROUGE-L, METEOR, Entity Alignment F1, and Human Alignment Score relative to unsteered VLM outputs. The findings suggest that language-based descriptions can complement gaze heatmaps by making driver attention more accessible for human-factors analysis, driver-monitoring review, and situation-awareness support.

URL PDF HTML ☆

赞 0 踩 0

2508.03098 2026-06-03 cs.CL

Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation

隐私感知解码：缓解检索增强生成中大语言模型的隐私泄露

Haoran Wang, Xiongxiao Xu, Baixiang Huang, Kai Shu

发表机构 * Emory University（埃默里大学）； Illinois Institute of Technology（伊利诺伊理工学院）

AI总结提出一种轻量级推理时防御方法PAD，通过在解码过程中注入校准高斯噪声，结合置信度筛选、敏感度估计和上下文感知噪声校准，在RAG系统中平衡隐私保护与生成质量，并利用Rényi差分隐私跟踪累积隐私损失。

详情

AI中文摘要

检索增强生成（RAG）通过将输出条件化于外部知识源来增强大语言模型（LLM）的事实准确性。然而，当检索涉及私人或敏感数据时，RAG系统容易受到提取攻击，从而通过生成的响应泄露机密信息。我们提出隐私感知解码（PAD），一种轻量级的推理时防御方法，在生成过程中自适应地将校准的高斯噪声注入到token logits中。PAD集成了基于置信度的筛选以选择性保护高风险token、高效的敏感度估计以最小化不必要的噪声，以及上下文感知的噪声校准以平衡隐私与生成质量。Rényi差分隐私（RDP）会计机制严格跟踪累积隐私损失，从而为敏感输出提供明确的每响应$(\varepsilon, δ)$-DP保证。与需要重新训练或语料库级过滤的先前方法不同，PAD是模型无关的，并且完全在解码时运行，计算开销极小。在三个真实世界数据集上的实验表明，PAD在保持响应实用性的同时显著减少了私有信息泄露，优于现有的基于检索和后处理的防御方法。我们的工作通过解码策略在缓解RAG中的隐私风险方面迈出了重要一步，为敏感领域中的通用和可扩展隐私解决方案铺平了道路。我们的代码可在https://github.com/wang2226/PAD获取。

英文摘要

Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, δ)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.

URL PDF HTML ☆

赞 0 踩 0

2507.19684 2026-06-03 cs.LG cs.AI cs.CL cs.CV

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

CoMPAS3D: 一个用于交互动作的数据集和基准

Bermet Burkanova, Yasaman Etesam, Payam Jome Yazdian, Trinity Evans, Chuxuan Zhang, Zoe Stanley, Paige Tuttösí, Angelica Lim

发表机构 * School of Computing Science Simon Fraser University（计算科学学院西蒙弗雷泽大学）

AI总结提出CoMPAS3D数据集和评估框架，通过动作可读性和熟练度适当性等客观指标，解决交互式动作生成中缺乏社交上下文评估的问题。

Comments https://rosielab.github.io/compas3d

详情

AI中文摘要

社交互动型人形机器人必须通过身体与人类互动，实时适应伙伴的动作、意图和能力。这需要模型不仅理解身体如何移动，还要理解在共享社交背景下动作的含义。然而，交互式动作生成的评估框架并未衡量生成的动作是否在共享动作词汇中可读，也不评估其是否适合伙伴的熟练水平。这一差距有两个原因：现有框架依赖运动学指标（如FID和节拍对齐），无法衡量上述特性；现有数据集缺乏动作标注和熟练度变化。萨尔萨舞作为评估领域很合适：即兴、双人、由动作词汇和评判标准（涵盖时机、音乐性、技巧、难度、配合和原创性）指导。我们提出CoMPAS3D，一个即兴双人萨尔萨舞的动作捕捉数据集，附带评估框架，涵盖运动学质量、两个客观指标（动作可读性和熟练度适当性）以及六个基于竞赛的主观维度。数据集包含18名舞者（涵盖初级、中级和高级水平）的3小时即兴表演，超过2800个专家标注片段，涵盖动作类型、错误和风格元素。我们定义了三个基准：动作分类（类似于转录）、熟练度估计（流利度评估）和跟随者生成（对话响应）。微调的视觉语言模型在应用于真实动作序列的客观指标上表现强劲。应用于Duolando和InterGen时，这些指标揭示了运动学指标遗漏的失败。人工评估确认了生成动作与真实动作之间的差距。CoMPAS3D、标注、基准代码和基线结果公开可用。

英文摘要

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, and abilities. This requires models that understand not just how bodies move, but what movement means in a shared social context. Yet evaluation frameworks for interactive motion generation do not measure whether generated follower motion is legible within a shared movement vocabulary, nor whether it is appropriate to the partner's proficiency level. This gap has two causes: existing frameworks rely on kinematic metrics such as FID and beat alignment that cannot measure either property, and existing datasets lack the move annotations and proficiency variation needed. Salsa is well-suited as an evaluation domain: improvised, dyadic, and governed by a move vocabulary and judging criteria covering timing, musicality, technique, difficulty, partnering, and originality. We present CoMPAS3D, a motion capture dataset of improvised partner salsa paired with an evaluation framework covering kinematic quality, two objective metrics (move legibility and proficiency appropriateness), and six competition-based subjective dimensions. The dataset includes 3 hours of improvisation by 18 dancers spanning beginner, intermediate, and professional levels, with over 2,800 expert-annotated segments covering move types, errors, and stylistic elements. We define three benchmarks: move classification (analogous to transcription), proficiency estimation (fluency assessment), and follower generation (dialogue response). Fine-tuned vision-language models perform strongly on objective metrics applied to ground-truth motion sequences. Applied to Duolando and InterGen, the metrics reveal failures that kinematic metrics miss. Human evaluations confirm the gap between generated and ground-truth motion. CoMPAS3D, annotations, benchmark code, and baseline results are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2504.01531 2026-06-03 cs.LG

DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting

DRAN：一种面向时空预测的分布与关系自适应网络

Xiaobei Zou, Luolin Xiong, Kexuan Zhang, Cesare Alippi, Yang Tang

发表机构 * Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai（能源化学过程智能制造关键实验室，教育部，东华大学，上海）； Faculty of Informatics, Università della Svizzera italiana（瑞士意大利大学信息学院）； Department of Electronics, Information and Bioengineering, Politecnico di Milano（米兰理工学院电子、信息与生物工程系）

AI总结针对非平稳时空系统的预测挑战，提出分布与关系自适应网络（DRAN），通过空间因子学习器（SFL）和动态-静态融合学习器（DSFL）分别适应分布偏移和关系变化，在天气和交通预测任务上超越现有方法。

Comments 15 pages, 10 figures

详情

AI中文摘要

准确的时空系统预测对于系统管理、控制和危机预防等任务至关重要。然而，许多时空系统固有的时变性给在非平稳条件下实现准确预测带来了挑战。为了解决非平稳性问题，我们提出了一种分布与关系自适应网络（DRAN），能够动态适应随时间变化的关系和分布。虽然时间归一化和反归一化是适应分布偏移的常用技术，但这种操作不适用于时空上下文，因为时间归一化会缩放节点的时间序列，可能破坏节点间的空间关系。为了解决这个问题，我们开发了一个空间因子学习器（SFL）模块，使得归一化和反归一化过程得以实现。为了适应传感器间空间关系的动态变化，我们提出了一种动态-静态融合学习器（DSFL）模块，通过自适应融合比例机制有效整合从动态和静态关系中学习到的特征。此外，我们引入了一个随机学习器来捕获时空表示中的噪声成分。我们的方法在天气预测和交通流预测任务上优于现有最先进方法。实验结果表明，我们的SFL在各种时间归一化操作下有效保持了空间关系。对学习到的动态和静态关系的可视化表明，DSFL能够捕获节点间的局部和远程关系。

英文摘要

Accurate predictions of spatio-temporal systems are crucial for tasks such as system management, control, and crisis prevention. However, the inherent time variance of many spatio-temporal systems poses challenges to achieving accurate predictions whenever stationarity is not granted. In order to address non-stationarity, we propose a Distribution and Relation Adaptive Network (DRAN) capable of dynamically adapting to relation and distribution changes over time. While temporal normalization and de-normalization are frequently used techniques to adapt to distribution shifts, this operation is not suitable for the spatio-temporal context as temporal normalization scales the time series of nodes and possibly disrupts the spatial relations among nodes. In order to address this problem, a Spatial Factor Learner (SFL) module is developed that enables the normalization and de-normalization process. To adapt to dynamic changes in spatial relationships among sensors, we propose a Dynamic-Static Fusion Learner (DSFL) module that effectively integrates features learned from both dynamic and static relations through an adaptive fusion ratio mechanism. Furthermore, we introduce a Stochastic Learner to capture the noisy components of spatio-temporal representations. Our approach outperforms state-of-the-art methods on weather prediction and traffic flow forecasting tasks.Experimental results show that our SFL efficiently preserves spatial relationships across various temporal normalization operations. Visualizations of the learned dynamic and static relations demonstrate that DSFL can capture both local and distant relationships between nodes.

URL PDF HTML ☆

赞 0 踩 0

2506.21129 2026-06-03 cs.LG cs.AI

Curriculum-Adapted Robust Reinforcement Learning for UAV Deconfliction in Adversarial Environments

对抗环境中无人机冲突消解的课程自适应鲁棒强化学习

Deepak Kumar Panda, Adolfo Perrusquia, Weisi Guo

发表机构 * Faculty of Engineering and Applied Sciences, Cranfield University（工程与应用科学学院，克兰菲尔德大学）

AI总结提出一种课程引导的适应框架，通过渐进暴露于梯度对抗观测扰动并对齐时序差分误差分布，提升无人机在GNSS欺骗攻击下的鲁棒性和泛化能力。

详情

AI中文摘要

自主无人机（UAV）越来越依赖强化学习（RL）进行导航。然而，全球导航卫星系统（GNSS）欺骗攻击可能导致分布外观测偏移，破坏价值估计并降低任务性能。现有的鲁棒RL方法通常能提高对特定攻击模型的抵抗力，但往往无法泛化到训练中未遇到的攻击。为解决这一局限，我们提出一种课程引导的适应框架，该框架逐步将鲁棒策略暴露于强度递增的基于梯度的对抗观测扰动，同时对齐课程阶段间的时序差分（TD）误差分布。所提出的方法不是适应特定的攻击模型，而是保持TD误差一致性以促进跨攻击条件的可迁移性。我们进一步推导了一个TD空间泛化保证，表明如果测试时攻击引起的TD误差分布与最终课程阶段的分布足够接近，则由此产生的性能退化是有界的。该框架在具有动态3D障碍物的无人机冲突消解环境中进行评估，面对之前未见过的固定和动态GNSS欺骗攻击。在固定欺骗条件下，课程适应策略实现了近乎完美的任务成功率，而标准和鲁棒RL基线为20-56%。在动态障碍物引诱欺骗攻击下，它获得了最高的情节奖励，同时随着空中交通密度的增加，任务完成步骤最多减少了45%。

英文摘要

Autonomous unmanned aerial vehicles (UAVs) increasingly rely on reinforcement learning (RL) for navigation. However, global navigation satellite system (GNSS) spoofing attacks can induce out-of-distribution observation shifts that corrupt value estimation and degrade mission performance. Existing robust RL approaches typically improve resilience against specific attack models but often fail to generalize to attacks not encountered during training. To address this limitation, we propose a curriculum-guided adaptation framework that progressively exposes a robust policy to gradient-based adversarial observation perturbations of increasing intensity while aligning temporal-difference (TD) error distributions across curriculum stages. Rather than adapting to a particular attack model, the proposed approach preserves TD-error consistency to promote transferability across attack conditions. We further derive a TD-space generalization certificate showing that if the TD-error distribution induced by a test-time attack remains sufficiently close to that of the final curriculum stage, the resulting performance degradation is bounded. The framework is evaluated in a UAV deconfliction environment with dynamic 3D obstacles under previously unseen fixed and dynamic GNSS spoofing attacks. Under fixed spoofing conditions, the curriculum-adapted policy achieved near-perfect mission success rates, compared with 20-56% for standard and robust RL baselines. Under dynamic obstacle-luring spoofing attacks, it achieved the highest episodic rewards while reducing mission completion steps by up to 45% across increasing aerial traffic densities.

URL PDF HTML ☆

赞 0 踩 0

2506.04367 2026-06-03 cs.CV

Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

微调视频变换器用于词级孟加拉手语：分类任务的比较分析

Jubayer Ahmed Bhuiyan Shawon, Hasan Mahmud, Kamrul Hasan

发表机构 * Systems and Software Lab (SSL), Department of CSE, Islamic University of Technology (IUT)（计算机科学与软件系，伊斯兰科技大学（IUT）系统与软件实验室）

AI总结本研究通过微调VideoMAE、ViViT和TimeSformer三种视频变换器模型，在BdSLW60和BdSLW401数据集上实现了高精度孟加拉手语识别，其中VideoMAE在帧率校正后的BdSLW60上达到95.5%准确率。

Comments 16 pages, 8 figures, 6 tables

详情

DOI: 10.1371/journal.pone.0341909
Journal ref: PLOS ONE, Vol. 21, No. 5, e0341909, 2026

AI中文摘要

手语识别（SLR）涉及从图像或视频中自动识别和分类手势，将其转换为文本或语音，以改善听障社区的可访问性。在孟加拉国，孟加拉手语（BdSL）是许多听障人士的主要交流方式。本研究在BdSLW60（arXiv:2402.08635）上微调了最先进的视频变换器架构——VideoMAE、ViViT和TimeSformer，BdSLW60是一个包含60个频繁手势的小规模BdSL数据集。我们将视频标准化为30 FPS，得到9,307个用户试用片段。为了评估可扩展性和鲁棒性，模型还在BdSLW401（arXiv:2503.02360）上进行了微调，这是一个包含401个手势类别的大规模数据集。此外，我们还在公开数据集（包括LSA64和WLASL）上进行了基准测试。应用了随机裁剪、水平翻转和短边缩放等数据增强技术以提高模型鲁棒性。为了在模型选择期间确保跨折的平衡评估，我们在训练集上采用了10折分层交叉验证，同时使用来自未见用户U4和U8的留出测试数据进行了独立于手语者的评估。结果表明，视频变换器模型显著优于传统的机器学习和深度学习方法。性能受数据集大小、视频质量、帧分布、帧率和模型架构等因素影响。在这些模型中，VideoMAE变体（MCG-NJU/videomae-base-finetuned-kinetics）在帧率校正后的BdSLW60数据集上达到了95.5%的最高准确率，在BdSLW401的正面手势上达到了81.04%——展示了可扩展且准确的BdSL识别的强大潜力。

英文摘要

Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 -- demonstrating strong potential for scalable and accurate BdSL recognition.

URL PDF HTML ☆

赞 0 踩 0

2506.03087 2026-06-03 cs.LG cs.AI

Do Explanations Increase the Risk of Decision Logic Leakage? Explanation-Guided Stealing of Graph Models

解释是否会增加决策逻辑泄露的风险？解释引导的图模型窃取

Bin Ma, Yuyuan Feng, Minhua Lin, Enyan Dai

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Xiamen University（厦门大学）； The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结研究解释机制可能泄露图神经网络决策逻辑的风险，提出一种结合解释对齐与数据增强的模型窃取框架，实验证明其优于传统方法。

详情

AI中文摘要

图神经网络（GNNs）已成为药物发现和金融分析等领域中分析图结构数据的重要工具，导致对模型透明度的需求日益增长。可解释GNNs的最新进展通过揭示影响预测的重要子图满足了这一需求，但这些解释机制可能无意中使这些模型面临安全风险。本文研究了此类解释如何潜在泄露可被利用进行模型窃取的关键决策逻辑。我们提出了{\method}，一种新颖的窃取框架，它将用于捕获决策逻辑的解释对齐与用于在有限查询下高效训练的引导数据增强相结合，从而能够有效复制目标模型的预测行为和底层推理模式。在分子图数据集上的实验表明，我们的方法在模型窃取方面优于传统方法。这项工作突出了在敏感领域部署可解释GNNs时的重要安全考虑，并表明需要针对基于解释的攻击采取保护措施。我们的代码可在https://github.com/beanmah/EGSteal获取。

英文摘要

Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to a growing demand for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose these models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose {\method}, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate that our approach shows advantages over conventional methods in model stealing. This work highlights important security considerations for the deployment of explainable GNNs in sensitive domains and suggests the need for protective measures against explanation-based attacks. Our code is available at https://github.com/beanmah/EGSteal.

URL PDF HTML ☆

赞 0 踩 0

2506.00431 2026-06-03 cs.LG

TIDFormer: Exploiting Temporal and Interactive Dynamics Makes A Great Dynamic Graph Transformer

TIDFormer: 利用时间和交互动态打造卓越的动态图Transformer

Jie Peng, Zhewei Wei, Yuhang Ye

发表机构 * Renmin University of China（中国人民大学）； Huawei Shenzhen, Guangdong China（华为深圳，广东中国）

AI总结提出TIDFormer，通过高效利用时间和交互动态，并设计可解释的自注意力机制，在多个动态图数据集上超越现有模型。

Comments KDD2025

详情

DOI: 10.1145/3711896.3737155

AI中文摘要

由于自注意力机制（SAMs）在序列建模中捕捉依赖关系的能力，一些现有的动态图神经网络（DGNNs）利用具有各种编码设计的Transformer架构来捕捉动态图的序列演化。然而，这些基于Transformer的DGNNs的有效性和效率差异很大，凸显了在动态图上正确定义SAM以及在不增加额外复杂模块的情况下全面编码时间和交互动态的重要性。在这项工作中，我们提出了TIDFormer，一种以高效方式充分利用时间和交互动态的动态图Transformer。我们阐明并验证了我们提出的SAM的可解释性，解决了先前工作中在动态图上其定义不可解释的开放问题。为了分别建模时间和交互动态，我们利用基于日历的时间划分信息，并仅使用采样的一阶邻居为二分图和非二分图提取信息丰富的交互嵌入。此外，我们通过简单的分解捕捉历史交互模式的潜在变化，联合建模时间和交互特征。我们在多个动态图数据集上进行了大量实验，以验证TIDFormer的有效性和效率。实验结果表明，TIDFormer表现出色，在大多数数据集和实验设置中超越了最先进的模型。此外，与之前基于Transformer的方法相比，TIDFormer展现出显著的效率优势。

英文摘要

Due to the proficiency of self-attention mechanisms (SAMs) in capturing dependencies in sequence modeling, several existing dynamic graph neural networks (DGNNs) utilize Transformer architectures with various encoding designs to capture sequential evolutions of dynamic graphs. However, the effectiveness and efficiency of these Transformer-based DGNNs vary significantly, highlighting the importance of properly defining the SAM on dynamic graphs and comprehensively encoding temporal and interactive dynamics without extra complex modules. In this work, we propose TIDFormer, a dynamic graph TransFormer that fully exploits Temporal and Interactive Dynamics in an efficient manner. We clarify and verify the interpretability of our proposed SAM, addressing the open problem of its uninterpretable definitions on dynamic graphs in previous works. To model the temporal and interactive dynamics, respectively, we utilize the calendar-based time partitioning information and extract informative interaction embeddings for both bipartite and non-bipartite graphs using merely the sampled first-order neighbors. In addition, we jointly model temporal and interactive features by capturing potential changes in historical interaction patterns through a simple decomposition. We conduct extensive experiments on several dynamic graph datasets to verify the effectiveness and efficiency of TIDFormer. The experimental results demonstrate that TIDFormer excels, outperforming state-of-the-art models across most datasets and experimental settings. Furthermore, TIDFormer exhibits significant efficiency advantages compared to previous Transformer-based methods.

URL PDF HTML ☆

赞 0 踩 0

2505.20853 2026-06-03 cs.LG cs.AI

Cooperation of Experts: Fusing Heterogeneous Information with Large Margin

专家合作：大间隔融合异构信息

Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang

发表机构 * Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang（未知）

AI总结提出专家合作框架，通过大间隔机制融合异构信息，在统一异构多路网络中编码多类型数据，实现鲁棒且互补的知识提取。

Comments Accepted at the 42nd International Conference on Machine Learning (ICML 2025)

详情

Journal ref: Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:63169-63185, 2025

AI中文摘要

融合异构信息仍然是现代数据分析中的一个持续挑战。尽管已取得显著进展，但现有方法往往未能考虑对象模式在不同语义空间中的固有异质性。为解决这一局限性，我们提出了专家合作（CoE）框架，该框架将多类型信息编码到统一的异构多路网络中。通过克服模态和连接差异，CoE为捕捉现实世界复杂数据的复杂结构提供了一个强大且灵活的模型。在我们的框架中，专用编码器充当领域特定专家，每个专家专门学习特定语义空间中的不同关系模式。为了增强鲁棒性并提取互补知识，这些专家通过一种新颖的大间隔机制进行协作，该机制由定制的优化策略支持。严格的理论分析保证了框架的可行性和稳定性，而跨多种基准的广泛实验证明了其优越的性能和广泛的适用性。我们的代码可在 https://github.com/strangeAlan/CoE 获取。

英文摘要

Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose the Cooperation of Experts (CoE) framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By overcoming modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novel large margin mechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework's feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability. Our code is available at https://github.com/strangeAlan/CoE.

URL PDF HTML ☆

赞 0 踩 0

2505.20142 2026-06-03 cs.LG

Grounding Functional Similarity by Invariance-Aware Model Stitching

通过不变性感知模型拼接实现功能相似性评估

Ioannis Athanasiadis, Anmar Karmush, Michael Felsberg

发表机构 * Ioannis Athanasiadis ； Anmar Karmush ； Michael Felsberg

AI总结针对标准模型拼接忽略不变性导致功能相似性误判的问题，提出前向-后向兼容性要求下的不变性感知模型拼接方法，揭示隐藏的功能差异。

详情

AI中文摘要

在深度学习中，功能相似性评估量化了独立训练的模型学习相似输入-输出关系的程度。在模型拼接中，功能相似性被表述为表示前向兼容性，即两个模型的表示能否对齐以解决给定任务。然而，最近的研究强调了一个关键限制：依赖不同信息线索的模型仍可能产生兼容的表示，使其看起来具有误导性的相似性（Smith et al., 2025）。我们将此失败归因于标准模型拼接本质上对拼接模型的不变性特性视而不见。为解决这一限制，我们引入了前向-后向兼容性要求，并据此制定了不变性感知模型拼接。通过分析关键拼接配置，我们研究了前向和后向兼容性之间的相互作用，表明不变性感知模型拼接为功能相似性评估提供了更原则性的方法，同时揭示了先前被掩盖的功能差异。

英文摘要

In deep learning, functional similarity evaluation quantifies the extent to which independently trained models learn similar input--output relationships. In model stitching, functional similarity is framed as representation forward compatibility, i.e., whether the representations of two models can be aligned to solve a given task. Recent studies, however, highlight a critical limitation: models relying on different information cues can still produce compatible representations, making them appear misleadingly similar (Smith et al., 2025). We attribute this failure to standard model stitching being inherently blind to the invariance properties of the stitched models. To address this limitation, we introduce the forward--backward compatibility requirement under which we formulate the invariance-aware model stitching. Through analyzing key stitching configurations, we study the interplay between forward and backward compatibility, showing that invariance-aware model stitching provides a more principled approach to functional similarity evaluation while revealing functional discrepancies previously obscured.

URL PDF HTML ☆

赞 0 踩 0

2406.18544 2026-06-03 cs.CV cs.GR

GS-ROR$^2$: Bidirectional-guided 3DGS and SDF for Reflective Object Relighting and Reconstruction

GS-ROR$^2$: 双向引导的3DGS和SDF用于反射物体重光照与重建

Zuo-Liang Zhu, Beibei Wang, Jian Yang

发表机构 * VCIP, College of Computer Science, Nankai University（VCIP，计算机科学学院，南开大学）； School of Intelligence Science and Technology, Nanjing University（智能科学与技术学校，南京大学）

AI总结提出一种双向引导框架，通过SDF辅助的高斯溅射优化重光照模型，并利用GS引导的SDF增强实现高质量几何重建，解决反射物体重光照与重建中的几何约束和细节捕捉问题。

Comments Accepted by ACM TOG

详情

DOI: 10.1145/3759248

AI中文摘要

3D高斯溅射(3DGS)因其细致的表达能力和高效的渲染速度，在新视角合成方面展现出强大能力。然而，使用3DGS创建可重光照的3D资产并重建忠实几何仍然存在问题，特别是对于反射物体，其不连续表示给几何约束带来困难。体积符号距离场(SDF)方法提供了鲁棒的几何重建，但昂贵的射线步进阻碍了其实时应用并减慢了训练速度。此外，这些方法难以捕捉尖锐的几何细节。为此，我们提出以互补方式双向引导3DGS和SDF，包括SDF辅助的高斯溅射用于重光照模型的高效优化，以及GS引导的SDF增强用于高质量几何重建。SDF辅助高斯溅射的核心是混合高斯与SDF之间的深度和法线相互监督，避免了SDF昂贵的体积渲染。得益于这种相互监督，学习到的混合高斯以最小的时间成本得到良好约束。由于高斯以延迟着色模式渲染，alpha混合的高斯是平滑的，但单个高斯可能仍然是异常值，产生漂浮伪影。因此，我们引入SDF感知的剪枝策略，移除位于SDF定义表面远处的高斯异常值，避免漂浮问题。这样，我们的GS框架提供了合理的法线并实现了逼真的重光照，但来自深度的网格仍然存在问题。因此，我们设计了GS引导的SDF细化，利用来自高斯的混合法线微调SDF。通过这种增强，我们的方法可以以额外17%的训练时间为代价，为反射物体提供高质量的网格。

英文摘要

3D Gaussian Splatting (3DGS) has shown a powerful capability for novel view synthesis due to its detailed expressive ability and highly efficient rendering speed. Unfortunately, creating relightable 3D assets and reconstructing faithful geometry with 3DGS is still problematic, particularly for reflective objects, as its discontinuous representation raises difficulties in constraining geometries. Volumetric signed distance field (SDF) methods provide robust geometry reconstruction, while the expensive ray marching hinders its real-time application and slows the training. Besides, these methods struggle to capture sharp geometric details. To this end, we propose to guide 3DGS and SDF bidirectionally in a complementary manner, including an SDF-aided Gaussian splatting for efficient optimization of the relighting model and a GS-guided SDF enhancement for high-quality geometry reconstruction. At the core of our SDF-aided Gaussian splatting is the mutual supervision of the depth and normal between blended Gaussians and SDF, which avoids the expensive volume rendering of SDF. Thanks to this mutual supervision, the learned blended Gaussians are well-constrained with a minimal time cost. As the Gaussians are rendered in a deferred shading mode, the alpha-blended Gaussians are smooth, while individual Gaussians may still be outliers, yielding floater artifacts. Therefore, we introduce an SDF-aware pruning strategy to remove Gaussian outliers located distant from the surface defined by SDF, avoiding floater issue. This way, our GS framework provides reasonable normal and achieves realistic relighting, while the mesh from depth is still problematic. Therefore, we design a GS-guided SDF refinement, which utilizes the blended normal from Gaussians to finetune SDF. With this enhancement, our method can further provide high-quality meshes for reflective objects at the cost of 17% extra training time.

URL PDF HTML ☆

赞 0 踩 0