arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28548 2026-05-28 cs.CV

GEM: Generative Supervision Helps Embodied Intelligence

GEM: 生成式监督助力具身智能

Ruowen Zhao, Bangguo Li, Zuyan Liu, Yinan Liang, Junliang Ye, Fangfu Liu, Diankun Wu, Zhengyi Wang, Xumin Yu, Yongming Rao, Han Hu, Jun Zhu

AI总结 提出GEM模型,通过在视觉语言模型预训练中引入深度图生成任务,联合训练以提升具身智能的语义理解与物理操作能力,并发布大规模数据集GEM-4M,在多个基准上取得最优结果。

详情
Comments
Project Page: https://zhaorw02.github.io/GEM/
AI中文摘要

具身视觉语言模型(VLMs)在机器人领域,特别是在视觉-语言-动作框架中,展示了令人印象深刻的性能和泛化能力。然而,标准文本引导预训练范式的高层语义焦点与具身环境中执行所需的关键低层空间和物理知识之间仍存在显著差距。在本文中,我们介绍了GEM,一种生成式监督的具身视觉语言模型,旨在弥合这一鸿沟。我们提出将深度图生成任务直接集成到VLM预训练阶段。通过将这一生成目标与主模型联合训练,我们观察到具身智能的显著提升,同时增强了语义理解和物理操作能力。为了支持这一范式,我们整理并发布了GEM-4M,一个包含基础、推理和规划数据与高质量深度监督配对的大规模综合数据集。大量实验表明,GEM在多个具身基准上取得了最先进的结果。此外,我们部署的动作模型GEM-VLA在模拟环境和真实世界评估中均表现出卓越的任务执行能力。代码、模型和数据集可在https://zhaorw02.github.io/GEM/获取。

英文摘要

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/

2605.28544 2026-05-28 cs.CV

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

DriveWAM: 视频生成先验实现自动驾驶的可扩展世界-动作建模

Chen Shi, Jinrui Xu, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang

AI总结 提出DriveWAM,通过将预训练视频扩散Transformer适配为自回归视频-动作策略,并引入场景演化驾驶引导和选择性KV记忆,实现可扩展的世界-动作建模,在NAVSIM和PhysicalAI基准上取得强规划性能。

详情
AI中文摘要

预训练基础模型已成为端到端自动驾驶的重要基础。与主要在静态图像-文本对上预训练的视觉-语言模型相比,视频生成模型捕获了自然适合驾驶的时间动态和运动先验。我们提出DriveWAM,一种驾驶世界-动作模型,它将预训练的视频扩散Transformer适配为自回归视频-动作策略。DriveWAM将视频和动作流组织成统一的时序token序列,并在联合流匹配目标下训练它们,保留预训练的视频生成架构,同时将其大规模视频先验适应于动作生成。为了融入高层场景理解,我们引入了场景演化驾驶引导,其中冻结的VLM生成块特定的语义意图以指导视频-动作生成。为了保持长时域推演有界,我们进一步引入了选择性KV记忆,通过推理时的相关性-冗余性缓存选择来维护有界的模态感知视频和动作记忆池。在NAVSIM和PhysicalAI-Autonomous-Vehicles基准上的实验表明,DriveWAM实现了强大的规划性能,从4k到100k驾驶片段的数据缩放研究进一步证实了世界-动作建模在端到端自动驾驶中的扩展潜力。

英文摘要

Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

2605.28543 2026-05-28 cs.AI cs.CL cs.LG

Cultural Binding Heads in Language Models

语言模型中的文化绑定头

Avrile Floro, Luca Benedetto

AI总结 通过机制可解释性和析因设计,识别出8个语言模型中2-3个中间层注意力头对文化绑定有因果贡献,且绑定主要在预训练阶段形成,知识探测表明模型知道的知识远多于其行为表现。

详情
AI中文摘要

大型语言模型通常默认对不同文化群体一视同仁,即使上下文需要区分:这缺乏差异意识。利用机制可解释性和Wang等人(2025)的N4文化挪用基准上的析因设计,我们在八个模型(四种架构,基础版和指令版)中识别出每个模型有2-3个中间层注意力头对文化绑定有因果贡献。文化绑定是将文化项目与适当身份关联的过程。敲除这些头上的身份到项目边会使绑定强度降低9-23%。识别出的头从指令模型转移到基础模型,表明文化绑定是在预训练阶段创建的。α缩放显示分级剂量反应,生成时适度放大引导(α=2-3)可将文化区分准确性提高1-3个百分点,同时基本保持中性推理不变。知识探测任务表明,模型知道的知识比其行为表现多3-5倍,表明瓶颈在于路由而非知识。

英文摘要

LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference awareness. Using mechanistic interpretability and a factorial design on the N4 cultural appropriation benchmark from Wang et al. (2025), we identify 2-3 mid-layer attention heads per model that contribute causally to cultural binding across eight models (four architectures, base and instruct). Cultural binding is the process of associating cultural items with the appropriate identity. Knockout of the identity-to-item edges on these heads lowers the binding strength by 9-23%. The identified heads transfer from instruct to base models, suggesting that cultural binding is created at pre-training. An $α$-scaling shows a graded dose-response and moderate amplification steering at generation ($α= 2-3$) increases cultural differentiation accuracy by 1-3 pp while leaving neutral reasoning mostly intact. A knowledge probing task shows that models know 3-5 times more than they act upon it, indicating that the bottleneck lies in routing and not knowledge.

2605.28534 2026-05-28 cs.CL

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

GUI-CIDER:通过因果内化和密度感知示例重选进行GUI代理的中期训练

Zheng Wu, Chengcheng Han, Zhengxi Lu, Tianjie Ju, Yanyu Chen, Qi Gu, Xunliang Cai, Zhuosheng Zhang

AI总结 提出GUI-CIDER中期训练方法,通过因果内化和密度感知示例重选显式内化GUI世界知识,提升代理对GUI操作的理解和任务成功率。

详情
AI中文摘要

尽管多模态大语言模型在构建图形用户界面(GUI)代理方面取得了快速进展,但其现实世界任务完成从根本上受到缺乏GUI操作世界知识的瓶颈。现有解决方案通常依赖昂贵的多代理框架或传统的后训练范式,如监督微调(SFT)和强化学习(RL)。然而,后训练仅允许代理通过动作注释或奖励信号隐式吸收世界知识,导致低效的轨迹记忆而非真正理解。因此,一种能够显式学习这些知识的方法至关重要。为此,我们提出GUI-CIDER,一种通过因果内化和密度感知示例重选显式内化GUI世界知识的中期训练方法。GUI-CIDER分为三个阶段:(1)数据合成,从GUI轨迹中提取静态规划和动态因果知识为文本;(2)示例重选,通过奖励因果结构和惩罚语义冗余来过滤语料库;(3)中期训练,使用精炼数据嵌入所学知识。在两个GUI知识基准和三个任务完成基准上的大量实验表明,GUI-CIDER持续提升了代理对GUI操作的理解及其任务成功率。代码可在https://github.com/Wuzheng02/GUI-CIDER获取。

英文摘要

Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.

2605.28533 2026-05-28 cs.LG

Semi-Supervised Hypothesis Testing by Betting on Predictions

基于预测投注的半监督假设检验

Yaniv Tenzer, Elad Tolochinsky, Yaniv Romano

AI总结 提出一种基于预测投注的框架,利用无标签数据增强序贯假设检验的效力,通过引入e统计量实现任意有效的检验,并在标签偏移或概念偏移下保持有效性。

详情
AI中文摘要

我们引入了一个基于预测投注的框架,利用无标签数据上的预测来增强序贯假设检验的效力。给定来自$(X,Y)$联合分布的有限样本,以及来自$X$边际分布的额外无标签样本,我们探究如何利用无标签数据对$Y$的分布以及$Y\mid X$的条件分布进行假设。我们引入了一个e统计量,并用它构建了一个序贯检验。在标准分布假设——标签偏移或概念偏移下,我们证明了该检验是任意有效的。此外,我们表明对于二元数据,该e统计量具有非平凡的检验功效。关键在于,即使底层预测不准确,我们的方法仍能保持这些性质。通过模拟实验和在大语言模型评估中的应用,我们展示了该方法相对于基线方法(包括预测驱动推断)的效力提升。即使在无标签数据相对有限,且由于$X$和$Y$之间弱相关导致预测精度较低的情况下,这些提升仍然存在。

英文摘要

We introduce a testing-by-betting framework that leverages predictions on unlabeled data to enhance the power of sequential hypothesis testing. Given limited samples from the joint distribution of $(X,Y)$, and additional unlabeled samples from the marginal of $X$, we ask how unlabeled data can be used to hypothesize about the distribution of $Y$, and the conditional distribution of $Y\mid X$. We introduce an e-statistic and use it to construct a sequential test. Under standard distributional assumptions -- label shift or concept shift -- we establish that the test is anytime valid. Furthermore, we show that for binary data, the e-statistic has non-trivial power. Crucially, our approach retains these properties even when the underlying predictions are inaccurate. Through simulations and applications to large language models evaluation, we demonstrate power gains over baseline approaches, including prediction-powered inference. These gains persist even with relatively limited unlabeled data and when predictions have low accuracy due to weak correlation between $X$ and $Y$.

2605.28532 2026-05-28 cs.AI

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

智能体知道它们不能做什么吗?评估使用工具的智能体的可行性意识

Liang Cheng, Mingsheng Cai, Jiuming Jiang, Luo Mai

AI总结 提出FeasiGen自动构建不可行任务管道,通过屏蔽关键工具将可解任务转为不可解,评估发现多数模型缺乏可行性检测能力,错误继续率高达73.9%。

详情
Comments
14 pages
AI中文摘要

使用工具的智能体通常因长推理链和迭代工具使用而产生大量计算成本。在实际场景中,许多任务在受限的工具环境下变得不可行,因为成功完成任务所需的能力不可用。检测不可行任务并提前停止执行可以显著减少不必要的执行成本。在这项工作中,我们提出了FeasiGen,一个自动构建不可行智能体任务的管道,通过识别成功完成任务所需的关键工具。我们的方法从多个智能体系统的成功执行中提取工具调用轨迹,识别不同执行策略中一致共享的关键工具,并屏蔽这些工具,从而自动将可解任务转化为不可解任务。人工验证确认,我们构建的任务的不可行性标注准确率超过94%。我们进一步引入了可行性感知评估指标,用于衡量智能体是否能识别不可行任务并适当停止执行。在九个模型上的广泛评估揭示了显著弱的不可行性检测能力,错误继续率高达73.9%。我们进一步观察到,多智能体架构在不可行条件下显著减少了错误执行。

英文摘要

Tool-using agents often incur substantial computational cost due to long reasoning chains and iterative tool usage. In practical scenarios, many tasks become infeasible under constrained tool environments, where the capabilities required for successful task completion are unavailable. Detecting infeasible tasks and stopping execution early can significantly reduce unnecessary execution cost. In this work, we propose FeasiGen, an automatic pipeline for constructing infeasible agent tasks by identifying the critical tools required for successful task completion. Our approach extracts tool-calling traces from successful executions across multiple agent systems, identifies critical tools consistently shared across diverse execution strategies, and masks these tools to automatically transform solvable tasks into infeasible ones. Human verification confirms that the infeasibility annotations for our constructed tasks achieve over 94% accuracy. We further introduce feasibility-aware evaluation metrics for measuring whether agents can recognize infeasible tasks and stop execution appropriately. Extensive evaluations across nine models reveal substantially weak infeasibility detection ability, with false continue rate reaching up to 73.9%. We further observe that multi-agent architectures significantly reduce erroneous execution under infeasible conditions.

2605.28531 2026-05-28 cs.LG

Stabilizing distribution-free probabilistic forecasts

稳定化无分布概率预测

Jente Van Belle, Honglin Wen, Wouter Verbeke, Pierre Pinson

AI总结 提出一种基于神经网络参数化回归样条的方法,联合优化无分布概率时间序列预测的质量与稳定性,以控制预测更新导致的波动,并在两个数据集上验证了其有效性。

详情
AI中文摘要

多步预测通常会在新观测值可用时进行更新,因为较短的预测期限通常会提高预测质量。然而,这种改进是以预测不稳定性为代价的,即同一目标时期的预测值存在变异性。这种不稳定性可能引发基于预测制定的计划发生代价高昂的变更,并可能削弱对预测系统的信任。在这项工作中,我们将预测稳定性与预测质量一起纳入无分布概率时间序列预测模型的训练中,从而能够控制这种权衡。我们提出了一种使用神经网络参数化的回归样条生成稳定化预测条件分位数函数的方法。这种方法能够联合优化质量和稳定性,因为它允许我们直接惩罚由预测更新引起的差异。此外,它允许对稳定预测分布的不同部分(例如,中心部分与尾部)赋予不同的重要性,以专注于对预期下游应用最相关的部分(例如,库存管理的上尾)。我们在两个具有不同统计特性的数据集上对所提出的方法进行了实证评估,结果表明,它可以在不显著损失预测质量的情况下有效降低预测不稳定性,并且可以将稳定化努力针对预测分布的特定部分。

英文摘要

Multi-step-ahead forecasts are often updated as new observations become available, since shorter forecast horizons typically improve forecast quality. However, such improvements come at the cost of forecast instability, i.e., variability in forecasts for the same target period. This instability can trigger costly changes to plans formulated based on the forecasts and may erode trust in the forecasting system. In this work, we integrate forecast stability alongside forecast quality into the training of distribution-free probabilistic time-series forecasting models, allowing us to control this trade-off. We propose a method for generating stabilized forecasted conditional quantile functions using regression splines parameterized by a neural network. This approach enables joint optimization of quality and stability, as it allows us to directly penalize dissimilarities arising from forecast updates. Furthermore, it allows assigning varying importance to stabilizing different parts of the forecast distributions (e.g., central parts vs. tails) to focus on the parts most relevant for the intended downstream use (e.g., the upper tail for inventory management). We empirically evaluate the proposed method on two datasets with different statistical properties and show that it can effectively reduce forecast instability without a substantial loss in forecast quality, and that it can target stabilization effort toward specific parts of the forecast distributions.

2605.28527 2026-05-28 cs.RO

What Frozen VLAs Already Know About Success: A Probing Study of Value-Like Structure in Foundation Robot Policies

冻结的VLA已经知道关于成功的信息:对基础机器人策略中价值类结构的探测研究

Jiachen Zhang, Junnan Nie, Junyi Lao, Wei Cheng, Chenghao Liu, Jiaxin Jiang, Songfang Huang

AI总结 通过线性探测从冻结的VLA特征中预测蒙特卡洛结果目标,发现其编码了成功信息,并可用于测试时动作选择提升成功率。

详情
Comments
14 pages, 1 figure, 11 tables. Equal contribution: Jiachen Zhang, Junnan Nie, and Junyi Lao. Corresponding author: Songfang Huang. Preprint
AI中文摘要

视觉-语言-动作(VLA)策略被训练来模仿动作;它们的损失函数从未要求它们估计奖励、进展或未来成功。然而,它们冻结的表示仍然携带这些信息,并且可以在不重新训练策略的情况下被读取并用于指导动作选择。从LIBERO-Goal上的混合成功和失败操作轨迹中,我们使用冻结特征上的轻量级线性探测恢复了蒙特卡洛结果目标。这些目标可以从OpenVLA、Pi0.5、DINOv2和CLIP特征中一致地预测,而基于进展、剩余时间、任务身份或本体感觉的基线则显著较差。为了排除任务和时间捷径,我们在相同任务、相同时间步的匹配比较下评估探测:Pi0.5探测仍然达到约92%的成对排序准确率,而标签打乱的对照则停留在随机水平。作为测试时选择器,在采样的Pi0.5动作前缀上使用相同的探测,将这一离线发现转化为行为:在推板任务中,成功率从贪婪解码下的26.7%上升到44.3%,在酒架任务中也有一个正面案例。这种提升并非普遍适用,并且需要额外的推理计算,但底层发现是清晰的:冻结的VLA已经编码了关于成功的信息,而它们的模仿目标从未明确要求这些信息。

英文摘要

Vision--language--action (VLA) policies are trained to imitate actions; their loss never asks them to estimate reward, progress, or future success. Their frozen representations nevertheless carry such information, and it can be read out and used to guide action choice without retraining the policy. From mixed successful and failed manipulation trajectories on LIBERO-Goal, we recover Monte-Carlo outcome targets using lightweight linear probes on frozen features. The targets are consistently predictable from OpenVLA, Pi0.5, DINOv2, and CLIP features, and substantially less so from baselines built on progress, time-to-go, task identity, or proprioception. To rule out task and temporal shortcuts, we evaluate the probes under same-task, same-timestep matched comparisons: Pi0.5 probes still reach roughly 92% pairwise ordering accuracy, while label-shuffled controls stay at chance. Used as a test-time selector over sampled Pi0.5 action prefixes, the same probe turns this offline finding into behavior: on push-plate, success rises from 26.7% under greedy decoding to 44.3%, with a second positive case on wine-rack. The gains are not universal and require additional inference compute, but the underlying finding is clean: frozen VLAs already encode information about success that their imitation objective never explicitly demands.

2605.28526 2026-05-28 cs.AI cs.CL

Entropy-aware Masking for Masked Language Modeling

面向掩码语言建模的熵感知掩码策略

Gokul Srinivasagan, Kai Hartung, Munir Georges

AI总结 提出基于熵分布的掩码策略,通过模型预测熵识别信息量高的token进行掩码,并引入自掩码方法提升训练效率,在GLUE上平均提升5%。

详情
Comments
accepted at starsem 2026 Conference
AI中文摘要

掩码语言建模已成为训练基于编码器的语言模型的标准预训练目标。在该方法中,输入中的某些token被掩码,模型学习利用周围上下文预测它们。这一过程使模型能够捕捉语言的句法和语义属性。传统上,用于掩码的token是随机选择的,这可能并不总是产生最有效的学习信号。在这项工作中,我们研究了一种基于熵分布的token掩码策略。我们利用模型在token预测上的熵来确定哪些token应被掩码。该方法旨在针对信息量更大、不确定性更高的token,以提高训练效率。我们还提出了一种新颖的自掩码方法,无需依赖外部参考模型即可增强训练效率。实验结果表明,与基线相比,我们的方法在GLUE分数上平均提升了5%。此外,我们尝试将知识蒸馏与熵掩码相结合,取得了最佳的整体结果。

英文摘要

Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model's entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.

2605.28524 2026-05-28 cs.AI

Let Relations Speak: An End-to-End LLM-GNN Soft Prompt Framework for Fraud Detection

让关系说话:面向欺诈检测的端到端LLM-GNN软提示框架

Zhixing Zuo, Huilin He, Jiasheng Wu, Dawei Cheng

AI总结 提出LGSPF框架,通过软提示桥接图结构与语义空间,并引入并行GNN编码器将多关系拓扑转化为图令牌,实现端到端优化,在欺诈检测中达到最优性能。

详情
Comments
14 pages,3 figures
AI中文摘要

近年来,大型语言模型(LLM)在处理欺诈检测等图任务方面展现出强大能力。然而,现有方法大多严重依赖丰富的文本属性,由于该领域缺乏文本数据,这带来了困难。尽管一些开创性方法试图克服这一问题,但它们通过硬提示将图结构文本化容易导致特征失真。此外,欺诈检测通常表现出多关系复杂性,当前方法难以捕捉这种深层语义信息。为应对这些挑战,我们提出了LLM-GNN软提示框架(LGSPF)。具体而言,LGSPF使用软提示桥接图结构和语义空间,以消除对文本的依赖。我们进一步引入并行图神经网络(GNN)编码器,将多关系拓扑转化为图令牌,用于细粒度的LLM欺诈理解。通过端到端优化,LGSPF增强了LLM和GNN之间的深层语义对齐。在多个欺诈检测基准上的实验表明,我们的方法达到了最先进的性能。此外,我们进一步验证了LGSPF在增强欺诈行为语义可解释性方面的贡献。

英文摘要

In recent years, Large Language Models (LLMs) have shown great capability in processing graph tasks such as fraud detection. However, most existing methods rely heavily on rich text attributes, which poses difficulties for this domain due to the lack of textual data. Although some pioneering methods attempt to overcome it, their textualization of graph structures via hard prompts easily leads to feature distortion. Additionally, fraud detection often exhibits multi-relational complexity, where current methods struggle to capture this deep semantic information. To address these challenges, we propose LLM-GNN Soft Prompt Framework (LGSPF). Specifically, LGSPF bridges the graph structure and semantic space using soft prompt to eliminate reliance on text. We further introduce a parallel Graph Neural Network (GNN) encoder to translate multi-relational topologies into graph tokens for fine-grained LLM fraud comprehension. Through end-to-end optimization, LGSPF enhances deep semantic alignment between LLM and GNN. Experiments across diverse fraud detection benchmarks demonstrate our method achieves state-of-the-art performance. Moreover, we further validate the contribution of LGSPF on enhancing the semantic interpretability of fraud behaviors.

2605.28521 2026-05-28 cs.CL

ClinicalEncoder26AM: A Multlilingual Diagnosable ColBERT Model; Evidences from the MultiClinNER Shared Task

ClinicalEncoder26AM:一个多语言可诊断的ColBERT模型——来自MultiClinNER共享任务的证据

François Remy

AI总结 本文提出ClinicalEncoder26AM,一个基于BGE-M3的多语言可诊断ColBERT模型,通过多适配器蒸馏和ColBERT式检索目标进行临床后训练,在MultiClinNER任务中微调为BIO标注器,实现了最先进的多语言实体召回率和字符加权F1分数前五。

详情
AI中文摘要

ClinicalEncoder26AM是一个用于临床和生物医学文本的多语言可诊断ColBERT模型,它在多个层次上将其token级语义与ClinicalMap25对齐,ClinicalMap25是一个受BioLORD-2023启发并通过合成和标注监督丰富的临床潜在空间。后训练方案基于BGE-M3,结合了合成临床笔记、患者-医生对话以及MedMentions等标注资源,同时通过多适配器蒸馏考虑命名实体级和句子级表示,并采用ColBERT风格的检索目标。在这篇系统演示论文中,我们通过将模型微调为用于患者症状、疾病和程序范围的BIO标注器来评估其在MultiClinNER共享任务中的表现,使用轻量级两层CNN头部来改善局部边界检测。最终系统保持简单,在单个8192 token窗口中处理大多数文档,实现了最先进的多语言实体召回率,并在所有实体类型和语言的字符加权F1分数中达到前五。训练曲线进一步表明,ClinicalEncoder26AM比基础M3模型在数据效率上显著更高,支持其临床后训练对下游信息提取的有用性。模型可在https://huggingface.co/Parallia/ClinicalEncoder26AM-Diagnosable-Colbert-L2-for-multilingual-medical-texts下载。

英文摘要

ClinicalEncoder26AM is a multilingual Diagnosable ColBERT for clinical and biomedical texts, which aligns at multiple levels its token-level semantic with ClinicalMap25, a clinical latent space inspired by BioLORD-2023 and enriched with synthetic and annotated supervision. The post-training recipe builds upon BGE-M3, and combines synthetic clinical notes, patient--doctor conversations, and annotated resources such as MedMentions, while considering both named-entity-level and sentence-level representations in a multi-adapter distillation, along with a ColBERT-style retrieval objective. In this system demonstration paper, we evaluate the model in the MultiClinNER shared task by finetuning it as a BIO tagger for patient symptoms, disorders, and procedure spans, using a lightweight two-layer CNN head to improve local boundary detection. The resulting system remains simple, processes most documents in a single 8192-token window, and achieves state-of-the-art multilingual entity recall, while achieving Top 5 overall across all entity types and languages in Character-weighted F1 scores. Training curves further show that ClinicalEncoder26AM is markedly more data-efficient than the base M3 model, supporting the usefulness of its clinical post-training for downstream information extraction. The model can be downloaded on https://huggingface.co/Parallia/ClinicalEncoder26AM-Diagnosable-Colbert-L2-for-multilingual-medical-texts

2605.28520 2026-05-28 cs.AI

GS-FUSE: Granger-Supervised Gated Fusion and Multi-Granularity Alignment for Event-Driven Financial Forecasting

GS-FUSE: 格兰杰监督的门控融合与多粒度对齐用于事件驱动的金融预测

Yang Zhang, En Chun, Ziyun Mao, Yulu Wu, Jun Wang

AI总结 提出GS-Fuse框架,通过格兰杰因果监督的门控融合模块和多粒度对齐机制,选择性利用事件文本与价格信号,提升金融事件对市场影响的预测精度。

详情
AI中文摘要

准确预测重大金融事件对市场的影响对投资者和政策制定者至关重要。然而,现有的多模态时间序列模型通常对称地融合文本和价格,没有明确的方式来决定事件文本何时真正具有预测性,因此难以利用事件到价格的方向性结构以及文本和价格信号的异质性角色。在这项工作中,我们提出了GS-Fuse,一个基于多模态事件的预测框架,它采用:(i) 格兰杰监督的、因果感知的门控融合模块,该模块仅在事件文本提供超越历史价格的增量预测价值时学习向事件文本开放;(ii) 多粒度对齐机制,该机制将高级事件表示和细粒度文本线索与未来市场轨迹联合对齐。作为构建在现成的大语言模型和时间序列基础模型之上的灵活、即插即用适配器,GS-Fuse可以在不同的骨干网络和市场设置中实例化。在真实世界金融数据集上的大量实验表明,GS-Fuse在多种资产和预测时间范围内始终优于最先进的时间序列和多模态基线。

英文摘要

Accurately forecasting the impact of salient financial events on markets is critical for investors and policymakers. However, existing multimodal time-series models typically fuse text and prices symmetrically, without an explicit way to decide when event text is truly predictive, and thus struggle to exploit the directional event-to-price structure and the heterogeneous roles of textual and price signals. In this work, we propose GS-Fuse, a multimodal event-based forecasting framework that employs (i) a Granger-supervised, causal-aware gated fusion module, which learns to open toward event text only when it provides incremental predictive value beyond historical prices, and (ii) a multi-granularity alignment mechanism that jointly aligns high-level event representations and fine-grained textual cues with future market trajectories. Built as a flexible, plug-and-play adapter on top of off-the-shelf large language models and time-series foundation models, GS-Fuse can be instantiated across diverse backbones and market settings. Extensive experiments on real-world financial datasets show that GS-Fuse consistently outperforms state-of-the-art time-series and multimodal baselines across multiple assets and forecasting horizons.

2605.28517 2026-05-28 cs.LG cs.AI

Stochastic Gradient Descent with Momentum is Algorithmically Stable

带动量的随机梯度下降具有算法稳定性

Yunwen Lei, Zimeng Wang, Xiaoming Yuan

AI总结 本文通过算法稳定性分析,证明了带动量的随机梯度下降(SGDM)在光滑凸问题上具有泛化保证,并建立了最优的过界总体风险界。

详情
AI中文摘要

带动量的随机梯度下降(SGDM)是机器学习中最广泛使用的优化算法之一。尽管文献中已经广泛研究了SGDM的优化性质,但关于SGDM是否以及何时能够很好地泛化到未见数据,仍然不够清楚。特别是,有人推测虽然动量加速了训练,但可能会降低泛化性能。在本文中,我们通过算法稳定性的视角,对SGDM进行了全面的泛化分析,填补了这一空白。更具体地说,我们引入了一个广义的SGDM框架,该框架涵盖了Polyak和Nesterov的动量方案,并为光滑凸问题建立了紧的平均模型稳定性界。值得注意的是,所获得的界利用了沿轨迹的小优化误差界,适用于区间$[0, 1)$内的任何动量参数,并且不需要通常假设的损失函数的Lipschitz连续性。我们进一步推导了广义SGDM的优化误差界,并将其与我们的泛化分析相结合,为具有Polyak和Nesterov动量的SGDM获得了最优的过界总体风险界。

英文摘要

Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimization properties of SGDM have been extensively studied in the literature, it remains insufficiently understood whether and when SGDM can generalize well to unseen data. In particular, it has been conjectured that while momentum accelerates training, it may degrade generalization. In this paper, we close this gap by developing a comprehensive generalization analysis of SGDM through the lens of algorithmic stability. More specifically, we introduce a generalized SGDM framework that encompasses both Polyak's and Nesterov's momentum schemes, and establish tight on-average model stability bounds for smooth and convex problems. Notably, the obtained bounds exploit small optimization error bounds along the trajectory, apply to any momentum parameter in the interval $[0, 1)$, and do not require the commonly assumed Lipschitzness of loss functions. We further derive optimization error bounds for the generalized SGDM, and combine them with our generalization analyses to obtain optimal excess population risk bounds for SGDM with both Polyak's and Nesterov's momentum.

2605.28516 2026-05-28 stat.ML cs.LG

Conservative neural posterior estimation via distributionally robust training

通过分布鲁棒训练实现保守神经后验估计

William Laplante, Yuga Hikida, Charita Dellaporta, François-Xavier Briol, Ayush Bharti

AI总结 提出DRO-NPE方法,通过Wasserstein模糊集上的最坏情况损失替代标准NPE目标,控制过拟合并减少后验过度自信,从而提高低模拟预算下的覆盖率和校准性能。

详情
AI中文摘要

基于神经后验估计(NPE)的模拟推断在有限模拟预算下通常会产生过度自信且不可靠的后验。为了解决这个问题,我们提出了DRO-NPE,一种分布鲁棒方法,它将标准NPE目标替换为Wasserstein模糊集上的最坏情况损失。我们引入了基于KL的误覆盖和误校准度量,并利用这些度量表明DRO-NPE目标控制了过拟合并减少了后验过度自信。我们的方法是可处理的、可并行化的,并且易于与标准归一化流集成。在基准SBI任务中,DRO-NPE一致地提高了覆盖率和校准性能,同时缩小了经验NPE损失与总体NPE损失之间的差距,从而在低模拟情况下实现更可靠的推断。

英文摘要

Simulation-based inference with neural posterior estimation (NPE) often yields overconfident and unreliable posteriors under limited simulation budgets. To address this, we propose DRO-NPE, a distributionally robust approach that replaces the standard NPE objective with a worst-case loss over a Wasserstein ambiguity set. We introduce KL-based metrics for miscoverage and miscalibration, and use these to show that the DRO-NPE objective controls overfitting and reduces posterior overconfidence. Our method is tractable, parallelisable, and readily integrates with standard normalising flows. Across benchmark SBI tasks, DRO-NPE consistently improves coverage and calibration, while narrowing the gap between empirical and population NPE loss, leading to more reliable inference in low-simulation regimes.

2605.28515 2026-05-28 cs.SE cs.AI

Do LLMs Favor Their Providers? Measuring Vertical Integration Bias in Code Generation

LLM 是否偏袒其提供商?测量代码生成中的垂直整合偏差

Melih Catal, Alex Wolf, Tiago Ferreiro Matos, Pooja Rani, Harald Gall

AI总结 本文提出 VIBench 基准,通过 20 个提供商可选的软件集成场景,测量前沿 LLM 在直接和代理代码生成中的垂直整合偏差,发现六成关联模型存在显著偏差,代理工作流加剧偏差至 +39.2 个百分点。

详情
AI中文摘要

大型语言模型已成为软件开发不可或缺的一部分,尤其是随着代理能力的出现。然而,许多前沿 LLM 与特定提供商有关联。这引发了一个问题:生成的代码是否偏袒提供商自身的生态系统而非可比较的替代方案,从而可能限制开发者的选择并增加对单一提供商的依赖。我们将这种行为定义为垂直整合偏差,并引入 VIBench,一个用于在 20 个提供商可选的软件集成场景中测量直接和代理代码生成中 VIB 的基准。通过评估 10 个前沿提供商关联模型与 3 个非关联对照模型,我们发现直接生成中存在正的 VIB,其中十个关联模型中有六个显示出统计显著效应,最高达 +18.8 个百分点。代理工作流进一步放大了 VIB,达到 +39.2 个百分点。此外,代理工作流中早期的关联生态系统选择可能持续存在于概念上解耦的下游文件中,持续性高达 90.3%。这些发现强调了在代码生成中测量和考虑 VIB 的必要性,尤其是在代理能力日益普及的背景下。

英文摘要

Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet, many frontier LLMs are affiliated with specific providers. This raises the question of whether generated code favors the provider's own ecosystem over comparable alternatives, potentially constraining developers' choices and increasing dependence on a single provider. We define this behavior as Vertical Integration Bias (VIB) and introduce \textsc{VIBench}, a benchmark for measuring VIB in direct and agentic code generation across $20$ provider-selectable software-integration scenarios. Evaluating $10$ frontier provider-affiliated models against $3$ non-affiliated controls, we find positive VIB in direct generation, with six of ten affiliated models showing statistically significant effects up to $+18.8$ percentage points (pp). Agentic workflows further amplify VIB, reaching $+39.2$ pp. Moreover, early affiliated-ecosystem choices in agentic workflows can persist into conceptually decoupled downstream files, with persistence as high as $90.3\%$. These findings underscore the need to measure and account for VIB in code generation, especially as agentic capabilities become more prevalent.

2605.28513 2026-05-28 cs.LG cs.AI

Learning Theory of the SVRG: Generalization and Convergence Analysis

SVRG的学习理论:泛化与收敛性分析

Yunwen Lei, Zimeng Wang, Xiaoming Yuan

AI总结 本文通过算法稳定性分析,首次为非凸和强凸设置下的SVRG方法建立了非平凡的泛化界,揭示了优化与泛化之间的相互作用,并得到了最优的过量风险界。

详情
AI中文摘要

方差缩减(VR)方法采用方差递减的随机梯度,因其高效性被广泛应用于机器学习中的大规模优化问题。现有的VR方法理论研究主要集中在收敛性分析上,而泛化行为在很大程度上未被探索。本文通过算法稳定性的视角,首次为代表性VR方法——随机方差缩减梯度(SVRG)建立了非平凡的泛化分析,填补了这一空白。特别地,我们利用SVRG的算法结构,在凸和强凸两种设置下建立了尖锐的稳定性界。所得到的界是数据依赖的,因为训练误差沿轨迹被纳入。我们的分析阐明了优化与泛化之间的相互作用,从而在两种设置下都得到了最优的过量风险界。我们的方法与现有的随机算法分析有本质不同,我们将SVRG更新分解为类似SGD的步骤加上一个零均值修正项,然后引入新的Lyapunov函数来吸收由参考点引起的额外梯度项。我们的分析框架可以推广到其他VR方法,并通过著名的随机平均梯度加速(SAGA)方法展示了泛化性。

英文摘要

Variance reduction (VR) methods employ stochastic gradients with decreasing variance, and they have been widely applied to solve large-scale optimization problems in machine learning because of their efficiency. Existing theoretical studies of VR methods are mainly focused on the convergence analysis, leaving the generalization behavior largely unexplored. In this paper, we bridge this gap by developing the first non-vacuous generalization analysis of the representative VR method: Stochastic Variance Reduced Gradient (SVRG), through the lens of algorithmic stability. In particular, we establish sharp stability bounds of the SVRG in both convex and strongly convex settings by exploiting its algorithmic structure. The obtained bounds are data-dependent, because the training errors are incorporated along the trajectory. Our analysis clarifies the interplay between optimization and generalization, leading to optimal excess population risk bounds in both settings. Our approach differs substantially from existing analyses of stochastic algorithms in the sense that we decompose the SVRG update as an SGD-like step plus a zero-mean correction term and then introduce novel Lyapunov functions to absorb the additional gradient terms induced by the reference points. Our analytical framework can be generalized to other VR methods, and we demonstrate the generalization by the well-known Stochastic Average Gradient Accelerated (SAGA) method.

2605.28512 2026-05-28 cs.CL

On Compositional Learning Behaviours in Formal Mathematics

论形式数学中的组合学习行为

Kevin Yandoka Denamganaï

AI总结 本文提出 S2B-LM 基准,通过去除数值处理混淆并添加思维链框架来评估组合学习行为(CLB),发现 CLB 能力对于形式数学验证的困难部分必要但不充分。

详情
Comments
work in progress, under review
AI中文摘要

能够征服形式数学困难尾部的自我进化科学智能体需要组合学习行为(CLBs)——在上下文中基础化和重组新颖符号结构的能力,而不仅仅是预学习原子的重组。我们提出了 extbf{S2B-LM},这是符号行为基准的一个改编,它移除了数值处理作为混淆因素,并添加了思维链框架以引发而非仅仅探测潜在的 CLB 能力。在 CLB 能力(adj-ZSCT)和 miniF2F 整体证明性能上交叉评估十个 Lean~4 定理证明器,精确置换检验建立了一个层次必要性结构:搜索密集型模型覆盖了可处理的绝大部分而没有可检测的 CLB,然而每个进入奥林匹克级别(miniF2F $>75\%$)的模型都是五个最高 CLB 得分者之一($p=0.004$)。在排除模型规模作为混淆因素后,我们的结果表明 CLB 能力对于形式数学验证的困难尾部是 \emph{必要但不充分的}。

英文摘要

Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere recombination of prelearned atoms. We propose \textbf{S2B-LM}, an adaptation of the Symbolic Behaviour Benchmark that removes numerical processing as a confound and adds chain-of-thought scaffolding to elicit rather than merely probe latent CLB competency. Cross-evaluating ten Lean~4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance, exact permutation tests establish a hierarchical necessity structure: search-heavy models cover the tractable bulk without detectable CLBs, yet every model breaking into the Olympiad-level tier (miniF2F $>75\%$) is among the five highest CLB scorers ($p=0.004$). After ruling out model scale as a confound, our results show that CLB competency is \emph{necessary but not sufficient} for the hard tail of formal mathematical verification.

2605.28501 2026-05-28 cs.LG

Fitting Unknown Number of Hyperplanes with Manifold Optimization

基于流形优化的未知数量超平面拟合

Zhiqin Cheng, Yu Zhan, Mingjin Zhang, Lingbo Liu, Liang Lin

AI总结 针对未知数量超平面拟合的非凸、非可微及模型阶数未知问题,提出基于流形优化的两阶段算法,通过黎曼期望最大化与投影密度估计实现高精度鲁棒拟合。

详情
AI中文摘要

将未知数量的超平面拟合到数据是机器学习中一个基本但具有挑战性的问题,其特点是非凸性、非可微性和未知模型阶数。现有方法常陷入局部最优或缺乏几何一致性。为解决这些局限,我们提出一种基于流形优化的新框架。我们将问题重新表述为单位球面流形 $\mathcal{S}^{ extbf{dim}-1}$ 上的无监督学习任务。该公式有效处理了非凸约束并线性化了距离度量,使得梯度下降易于处理。我们提出了一种两阶段流形优化算法。在第一阶段,我们采用带有重尾核的黎曼期望最大化过程来鲁棒地估计后验概率,有效解决了相交超平面间点分布的歧义。在第二阶段,当软估计收敛后,概率权重退化为硬匹配,产生严格满足几何定义的精确局部最优解。此外,我们引入了一种投影密度估计策略用于初始化,通过显著降低特征描述空间和搜索复杂度来促进全局收敛。大量实验表明,我们的方法在几何精度和鲁棒性方面均优于最先进的基线方法。

英文摘要

Fitting an unknown number of hyperplanes to data is a fundamental yet challenging problem in machine learning, characterized by its non-convexity, non-differentiability, and unknown model order. Existing approaches often struggle with local optima or lack geometric consistency. To address these limitations, we propose a novel framework based on Manifold Optimization. We reformulate the problem as an unsupervised learning task on the unit sphere manifold $\mathcal{S}^{\textbf{dim}-1}$. This formulation effectively handles the non-convex constraints and linearizes the distance measurement, rendering the gradient descent tractable. We propose a Two-Stage Manifold Optimization algorithm. In Phase I, we employ a Riemannian Expectation-Maximization process with a heavy-tailed kernel to robustly estimate posterior probabilities, effectively resolving the ambiguities of point distribution between intersecting hyperplanes. In Phase II, upon convergence of the soft estimates, the probabilistic weights degenerate into hard matching, generating a precise local optimum that strictly satisfies the geometric definition. Furthermore, we introduce a projected density estimation strategy for initialization to facilitate global convergence by significantly reducing the feature description space and search complexity. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in both geometric accuracy and robustness.

2605.28500 2026-05-28 cs.CL cs.AI cs.LG

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

功能熵:通过不确定性量化预测LLM生成代码的功能正确性

Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra

AI总结 针对LLM生成代码功能不正确的问题,提出基于功能等价性的不确定性量化方法(功能熵),在多个编程语言和模型上优于现有方法。

详情
AI中文摘要

大型语言模型在代码生成方面表现出令人印象深刻的能力,但它们经常生成功能不正确的代码。不确定性量化(UQ)方法已成为检测自然语言生成中幻觉的有前途的方法,但它们在代码生成任务中的有效性仍未得到充分探索。我们系统地评估了UQ技术如何跨三种编程语言、五个LLM和超过1700个问题迁移到代码生成。我们发现,一些基于令牌概率的方法无需修改即可有效泛化,而依赖自然语言推理(NLI)的基于采样的方法失败,因为NLI模型无法区分功能不同的代码,导致大多数响应崩溃为单个语义簇。为了解决这个问题,我们引入了功能等价性方法,这是一类特定于代码的方法,用基于LLM的功能等价性评估取代基于NLI的语义等价性,包括功能熵,即语义熵的代码特定模拟。功能等价性方法在15个模型-基准组合中的11个中实现了最高的AUROC,并在大多数设置中实现了最佳校准,始终优于基于NLI的对应方法以及所有其他评估方法。

英文摘要

Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.

2605.28498 2026-05-28 cs.HC cs.AI

The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

验证决策:温暖度和用户特征如何塑造对信息搜索中对话代理的依赖

Mert Yazan, Frederik Bungaran Ishak Situmeang, Suzan Verberne

AI总结 研究通过混合实验发现,即使提供事实核查工具,用户仍过度依赖对话AI,验证行为主要由用户特征(如先验信任)驱动,而温暖对话风格通过增加对错误答案的认同间接影响依赖。

详情
Comments
Under review for Computers in Human Behavior
AI中文摘要

对话式人工智能(AI)提供了高效便捷的信息访问途径。然而,当用户盲目信任AI并在不进行事实核查的情况下接受其答案时,可能会导致过度依赖。信息搜索日益遵循一种结合对话AI与网络搜索的混合交互范式,使得事实核查更加容易。本文考察了这种交互范式是否能有效抑制依赖。我们进一步探究了驱动用户验证AI答案的潜在因素(例如数字素养和对话温暖度)。我们进行了一项混合被试问答实验,参与者与温暖或中性的聊天机器人互动。我们的发现表明,尽管用户同时拥有对话和网络搜索的访问权限,依赖仍然存在。验证决策主要由现有的用户感知(例如对聊天机器人的先验信任)驱动,而非答案属性,一些用户无论上下文如何都会进行事实核查,而另一些用户则默认信任聊天机器人。温暖的对话风格通过增加对错误聊天机器人的认同,对依赖产生了间接但关键的影响。咨询额外的AI来源可预测更高的准确性,而传统网络搜索则不然。我们的研究通过以下方式扩展了过度依赖研究:(a)证明了即使在可进行事实核查的情况下,过度依赖仍然存在;(b)将验证行为识别为用户依赖性;(c)揭示了对话温暖度对过度依赖的间接影响,这对设计可信赖的对话搜索系统具有启示意义。

英文摘要

Conversational artificial intelligence (AI) provides an efficient and convenient gateway to information access. However, it can cause overreliance when users blindly trust AI and accept its answers without fact-checking. Information search increasingly follows a hybrid interaction paradigm that combines conversational AI with web search, making fact-checking easier. In this paper, we examine whether this interaction paradigm is effective in curbing reliance. We further investigate the underlying factors (e.g., digital literacy and conversation warmth) that drive users to verify AI answers. We conduct a mixed-subjects question-answering experiment where participants interact with either a warm or a neutral chatbot. Our findings reveal that reliance persists despite users having access to both conversational and web search. The decision to verify is driven primarily by existing user perceptions (e.g., prior trust in chatbots) rather than answer properties, with some users fact-checking regardless of the context and others trusting chatbots by default. Warm conversational style has an indirect yet critical influence on reliance by increasing agreement with the chatbot when it is incorrect. Consulting additional AI sources predicts higher accuracy, while traditional web search does not. Our study extends overreliance research by: (a) demonstrating its persistence despite access to fact-checking, (b) identifying verification behavior as user-dependent, and (c) revealing conversational warmth's indirect effect on overreliance with implications for designing trustworthy conversational search systems.

2605.28495 2026-05-28 cs.CV

Janus-LoRA: A Balanced Low-Rank Adaptation for Continual Learning

Janus-LoRA:面向持续学习的平衡低秩适配

Cheng Chen, Pengpeng Zeng, Yuyu Guo, Lianli Gao, Hengtao Shen, Jingkuan Song

AI总结 提出Janus-LoRA框架,通过梯度修正实现参数级正交性以克服灾难性遗忘,并利用解耦边际损失增强特征级分离,从而在持续学习中平衡稳定性与可塑性。

详情
Comments
9pages, International Conference on Machine Learning
AI中文摘要

低秩适配(LoRA)已成为持续学习的一种有前景的范式。它独立更新其低秩因子($A$和$B$),通过它们的相互作用对完整权重矩阵产生复合更新。为了防止灾难性遗忘,该更新应保持与包含先前学习知识的任务特定子空间正交。然而,我们发现这种复合更新系统性地违反了这种正交性,重新引入了干扰并破坏了稳定性。此外,天真地强制执行这种正交性会损害可塑性,破坏微妙的稳定性-可塑性权衡。为了解决这些问题,我们提出了 extbf{Janus-LoRA}框架,通过两个新颖的组件恢复这种平衡。具体来说,我们首先引入梯度修正,这是一种闭式解,数学上解耦LoRA的因子更新,针对通过高效在线估计识别的历史知识子空间强制执行正交性。接下来,为了增强可塑性,我们引入解耦边际损失,通过将新特征表示推离旧特征表示来促进特征级分离,从而为新学习创建独特、低干扰的区域。在具有挑战性的基准上的全面实验表明,通过协调参数级正交性与特征级分离,Janus-LoRA实现了优越的平衡,并建立了新的最先进性能。

英文摘要

Low-Rank Adaptation (LoRA) has emerged as a promising paradigm for Continual Learning. It independently updates its low-rank factors ($A$ and $B$), creating a composite update to the full weight matrix through their interaction. To prevent catastrophic forgetting, this update should remain orthogonal to the task-specific subspace that contains previously learned knowledge. However, we identify that this composite update systematically violates this orthogonality, reintroducing interference and undermining stability. Furthermore, naively enforcing this orthogonality compromises plasticity, disrupting the delicate stability-plasticity trade-off. To resolve these issues, we propose \textbf{Janus-LoRA}, a framework that restores this balance through two novel components. Specifically, we first introduce Gradient Rectification, a closed-form solution that mathematically decouples LoRA's factor updates, enforcing orthogonality against the historical knowledge subspace identified by an efficient Online Estimation. Next, to enhance plasticity, we introduce a Decoupled Margin Loss that promotes feature-level separation by pushing new feature representations away from old ones, thus creating distinct, low-interference regions for new learning. Comprehensive experiments on challenging benchmarks demonstrate that by harmonizing parameter-level orthogonality with feature-level separation, Janus-LoRA achieves a superior balance and establishes new state-of-the-art performance.

2605.28494 2026-05-28 cs.CL

A new semantically annotated corpus with syntactic-semantic and cross-lingual senses

一个带有句法语义和跨语言义项的新语义标注语料库

Myriam Rakho, Eric Laporte, Matthieu Constant

AI总结 本文构建了一个包含20个法语多义动词实例的新语义标注语料库,每个实例标注了三种义项:平行语料中的英语翻译、法语计算词典(Lexicon-Grammar表)条目以及两者的组合细粒度义项。

详情
Journal ref
Language Resources and Evaluation (LREC), 2012, Istanbul, Turkey, pp.597-600
AI中文摘要

我们描述了一个用于词义消歧的新义项标注语料库。该语料库由20个法语多义动词的实例组成。每个动词实例都标注了三种义项标签:(1) 该实例在平行语料库英语版本中的实际翻译,(2) 法语计算词典(Lexicon-Grammar表)中的动词条目,以及(3) 由翻译和Lexicon-Grammar条目拼接而成的细粒度义项标签。

英文摘要

We describe a new sense-tagged corpus for word sense disambiguation. The corpus is constituted of instances of 20 French polysemous verbs. Each verb instance is annotated with three sense labels: (1) the actual translation of the verb in the english version of this instance in a parallel corpus, (2) an entry of the verb in a computational dictionary of French (the Lexicon-Grammar tables) and (3) a fine-grained sense label resulting from the concatenation of the translation and the Lexicon-Grammar entry.

2605.28491 2026-05-28 cs.CV

DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

DiscoForcing:基于扩散强制的实时音频驱动角色控制统一框架

Kaiyang Ji, Bingsheng Qian, Binghuan Wu, Kangyi Chen, Ye Shi, Jingya Wang

AI总结 针对实时音频响应角色控制问题,提出DiscoForcing框架,结合因果音乐编码器和扩散强制序列模型,在严格因果、有限延迟的流式生成中实现音频与全身运动的稳定对齐。

详情
Comments
accepted by ICML 2026
AI中文摘要

我们研究实时音频响应角色控制作为一个部署忠实性问题:严格因果、有限延迟的流式生成,必须在交互帧率下生成连贯的全身运动,同时音频条件可能突然变化,包括节奏变化、音频丢失或用户编辑。先前的音乐到运动系统主要针对具有全局上下文的离线生成进行优化,在流式部署中,当条件历史变得过时或不可靠时,性能会下降。我们引入了DiscoForcing,一个流式音频驱动扩散框架,它将捕获节奏结构和相位动态的因果音乐编码器与在时间范围内以异构噪声水平训练的扩散强制序列模型相结合。在此基础上,我们设计了一个混合时间调度和一个历史引导的流式采样器,以明确权衡响应性与非平稳音频下的长期一致性。在端到端实时交互系统中实现,包括在线虚拟角色回放和人形部署工作流,DiscoForcing在匹配因果性和延迟约束下,比先前基线提供更稳定的长期展开和更清晰的音频-运动对齐,同时保持实时吞吐量。

英文摘要

We study real-time audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly, including tempo shifts, drops, or user edits. Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio-motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.

2605.28490 2026-05-28 cs.CV cs.AI

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

SSR3D-LLM: 通过潜在步骤实现结构化空间推理以实现统一3D-LLM中的细粒度定位

Jiawei Li, Ziyi Liu, Weijie Shi, Long Chen, Jiajie Xu, Xiaofang Zhou

AI总结 针对统一3D-LLM中细粒度查询的脆弱性,提出SSR3D-LLM,通过潜在空间推理步骤和几何感知评分器逐步精炼候选排名,在多个基准上取得最优结果。

详情
AI中文摘要

3D物体定位从自然语言中定位3D场景中的所指对象。统一的以实例为中心的3D-LLM旨在同时解决定位、对话、问答和描述任务,但许多方法依赖于单一的指针式定位决策,将关系指令压缩为一个选择。这对于需要根据上下文对象和空间关系排除多个同类候选的细粒度查询来说是脆弱的。我们提出结构化空间推理3D-LLM(SSR3D-LLM),一种用于统一3D-LLM的结构化定位接口。给定固定的Mask3D物体提议,LLM从查询中写出一系列潜在的空间推理步骤和记忆令牌,然后一个几何感知评分器读取这些潜在步骤,通过逐步长度掩码逐步精炼候选排名。潜在步骤从标准基准目标监督和训练期间的辅助指代线索监督中学习,而推理仅使用输入查询和Mask3D提议。在ReferIt3D、ScanRefer和Multi3DRef上,SSR3D-LLM在统一3D-LLM基线中取得了最强结果,在细粒度定位上相比单指针QPG基线有显著提升,并相比先前的统一3D-LLM有一致改进,同时保留了默认的语言任务路径。

英文摘要

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.

2605.28487 2026-05-28 cs.AI cs.LG

ProvMind: Provenance-grounded reasoning for materials synthesis

ProvMind:基于来源的材料合成推理

Yiming Zhang, Ryo Tamura, Koji Tsuda

AI总结 提出MatProcBench基准和ProvMind框架,通过来源图推理实现材料合成中的路线、条件和因果依赖优化,在双OOD分割上达到52.84%准确率。

详情
AI中文摘要

材料工艺优化需要对路线、条件、工具和因果依赖进行推理,然而大多数计算方法将合成过程扁平化为文本或有序步骤。我们引入了MatProcBench,一个基于文献挖掘的MatPROV图构建的来源基准,用于评估七个过程推理任务,涵盖路线连续性、步骤级变量推断和全局因果一致性,在相同分割和偏移感知评估下,包括结合时间与材料类别偏移的严格双OOD分割。我们进一步引入了ProvMind,一个过程记忆推理框架,检索类似训练过程,将其转换为来源感知的选项级兼容性分数,并使用语言模型进行约束最终决策。ProvMind在双OOD分割上达到52.84%的准确率,优于提示、检索增强和监督微调基线。

英文摘要

Materials process optimization requires reasoning over routes, conditions, tools and causal dependencies, yet most computational formulations flatten synthesis procedures into text or ordered steps. We introduce MatProcBench, a provenance-grounded benchmark constructed from literature-mined MatPROV graphs, to evaluate seven process-reasoning tasks spanning route continuity, step-level variable inference and global causal consistency under both same-split and shift-aware evaluation, including a strict dual-OOD split that combines temporal and material-class shift. We further introduce ProvMind, a process-memory reasoning framework that retrieves analogous training processes, converts them into provenance-aware option-level compatibility scores, and uses a language model for constrained final decision making. ProvMind achieves 52.84\% accuracy on the dual-OOD split, outperforming prompting, retrieval-augmented and supervised fine-tuning baselines.

2605.28486 2026-05-28 cs.RO

Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot Manipulation

Mag-VLA:用于双臂磁驱动微机器人操作的视觉-语言-动作模型

Yongchen Wang, Kangyi Lu, Lan Wei, Dandan Zhang

AI总结 提出Mag-VLA模型,利用双臂磁驱动微机器人实现灵巧操作,通过视觉-语言-动作框架和动作分块Transformer解码器,在真实机器人实验中达到90%接近成功率和最高80%运输成功率。

详情
Comments
Accepted by 2026 MARSS
AI中文摘要

磁驱动微机器人已被用作微尺度下的无线、非接触操作工具,使其在微创应用中具有前景。然而,由于间接驱动、有限的传感和非线性磁相互作用,其控制仍然具有挑战性。在这项工作中,我们提出了Mag-VLA,一种用于灵巧磁微机器人操作的视觉-语言-动作(VLA)模型,该模型使用两个装有磁铁的机械臂来构建动态磁场。双臂协调实现了诸如微机器人重新定向等单臂难以或无法完成的功能,但也引入了耦合控制挑战,因为策略必须在共享工作空间内为两个执行器生成协调轨迹。我们的框架采用Qwen2.5-VL-7B骨干网络,使用低秩适配(LoRA)处理视觉观察和语言指令以进行动作预测。为了捕捉任务进展,我们引入了一个运动感知阶段分类器和一个阶段条件的动作分块Transformer(ACT)解码器,用于时间上连贯的多步控制。我们进一步构建了一个遥操作磁微机器人操作数据集,涵盖三种任务配置。消融研究表明,基于ACT的解码器显著优于其他生成式动作头。在真实机器人实验中,Mag-VLA在所有任务中实现了90%的接近成功率,并且随着任务难度增加,运输成功率分别为80%、70%和50%。这些结果表明,层次化VLA建模为磁微机器人操作提供了一个有前景的框架。

英文摘要

Magnetically actuated microrobots have been used as wireless, non-contact manipulation tools at microscales, making them promising for minimally invasive applications. However, their control remains challenging due to indirect actuation, limited sensing, and nonlinear magnetic interactions. In this work, we propose Mag-VLA, a vision-language-action (VLA) model for dexterous magnetic microrobot manipulation using two robotic arms with mounted magnets for dynamic magnetic-field construction. Bimanual coordination enables capabilities such as microrobot reorientation that are difficult or infeasible with a single arm, but it also introduces coupled control challenges, as the policy must generate coordinated trajectories for both actuators within a shared workspace. Our framework adapts a Qwen2.5-VL-7B backbone using Low-Rank Adaptation (LoRA) to process visual observations and language instructions for action prediction. To capture task progression, we introduce a motion-aware phase classifier and a phase-conditioned Action Chunking Transformer (ACT) decoder for temporally coherent multi-step control. We further construct a teleoperated magnetic microrobot manipulation dataset covering three task configurations. Ablation studies show that the ACT-based decoder substantially outperforms alternative generative action heads. In real-robot experiments, Mag-VLA achieves a 90% approach success rate across all tasks and transport success rates of 80%, 70%, and 50% as task difficulty increases. These results demonstrate that hierarchical VLA modeling provides a promising framework for magnetic microrobot manipulation.

2605.28484 2026-05-28 cs.CL

Comonadic Morphophonology: A Compositional Framework for Context-Dependent Morphological Rules in Finnish

共单子形态音系学:芬兰语上下文相关形态规则的组合框架

Yongseok Jang

AI总结 提出一个基于共单子的组合框架,将每个形态音系规则表示为局部上下文到输出音段的函数,并通过Writer共单子实现长度变化规则的严格组合,显著减少规则表示规模并支持双向形态分析。

详情
Comments
13 pages. Accepted at the Society for Computation in Linguistics (SCiL) 2026
AI中文摘要

组合用于上下文相关形态音系规则(辅音渐变、元音和谐、所有格后缀同化)的有限状态转录机(FST)会导致乘法状态爆炸;神经模型规避了该问题,但未提供规则本身的形式化描述。我们提出了第一个框架,其中每个形态音系规则是从聚焦的局部上下文到单个输出音段的函数——类似于元胞自动机的局部规则类型——并且长度变化规则作为共单子的coKleisli箭头进行组合。我们的核心贡献是Writer共单子(DeletionSet x Zipper),一种新的代数构造,恢复了此类规则的严格coKleisli组合性:每个规则是一个coKleisli箭头,extend将其提升为全局变换,删除操作作为幺半群作用累积,而不需要中间物化。作为支持证据,十三个coKleisli箭头提供了一种替代形式化,表达了Omorfi通过874个延续类编码的相同形态音系行为(规则表示层面67:1的缩减),并且相同的抽象支持双向形态学——MorphGenerator重用分析箭头进行生成。在UD Finnish-TDT上,该系统仅使用规则消歧达到83.92%的UPOS准确率(使用外部后缀标注器达到94.66%),验证了该框架作为实用形态引擎的有效性。

英文摘要

Composing finite-state transducers (FSTs) for context-dependent morphophonological rules -- consonant gradation, vowel harmony, possessive suffix assimilation -- leads to multiplicative state explosion; neural models sidestep the problem but provide no formal account of the rules themselves. We present the first framework where each morphophonological rule is a function from a focused local context to a single output segment -- the type of a local rule familiar from cellular automata -- and where length-changing rules compose as coKleisli arrows of a comonad. Our central contribution is the Writer comonad (DeletionSet x Zipper), a new algebraic construction that restores strict coKleisli compositionality for such rules: each rule is a coKleisli arrow, extend lifts it to a global transformation, and deletions accumulate as a monoid action rather than requiring intermediate materialization. As supporting evidence, thirteen coKleisli arrows provide an alternative formulation expressing the same morphophonological behaviors that Omorfi encodes via 874 continuation classes (67:1 reduction at the rule-representation level), and the same abstraction enables bidirectional morphology -- a MorphGenerator reuses the analysis arrows for generation. On UD Finnish-TDT, the system achieves 83.92% UPOS accuracy with rule-only disambiguation (94.66% with an external suffix tagger), validating the framework as a practical morphological engine.

2605.28483 2026-05-28 cs.AI cs.IR

From Learning Resources to Competencies: LLM-Based Tagging with Evidence and Graph Constraints

从学习资源到能力:基于证据和图约束的LLM标签方法

Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge

AI总结 提出一种端到端对齐流程,利用大语言模型作为受约束的、能产生证据的标签器,将学习资源链接到结构化能力框架,在计算机科学数据集上取得优于基线方法的性能。

详情
AI中文摘要

将学习资源链接到结构化能力框架是实现学习管理系统中基于能力的搜索和课程分析的关键。然而,手动标注劳动密集,全自动方法往往缺乏透明度。在本文中,我们提出了一种端到端对齐流程,使用大语言模型作为受约束的、能产生证据的标签器。LMS资源——包括教学内容和评估——首先被分割成有意义的教学片段。对于每个片段,从基于图上下文增强的结构化能力档案中检索一小部分候选能力。然后,LLM从该集合中选择最相关的能力,并从片段文本中提供支持证据片段。这些预测利用能力图的结构进行细化,并在资源级别聚合。我们在从计算机科学系的能力参考体系(UTC)构建的数据集上评估了我们的方法,该数据集涵盖22个能力,涉及多个课程材料。我们的LLM+BM25+Graph(LBG)流程取得了强劲的结果:片段级微F1为0.57,宏F1为0.50;资源级宏F1为0.51;MRR为0.82——优于零样本和少样本LLM变体、检索/相似性基线以及监督分类器——同时产生更多机械可追踪的证据片段,以支持人工审计和教育分析。

英文摘要

Linking learning resources to a structured competency framework is key to enabling competency-based search and curriculum analytics in Learning Management Systems (LMS). However, manual tagging is labor-intensive, and fully automatic methods often lack transparency. In this paper, we present an end-to-end alignment pipeline that uses a large language model (LLM) as a constrained, evidence-producing tagger. LMS resources -both instructional content and assessments -are first segmented into meaningful pedagogical fragments. For each fragment, a small set of candidate competencies is retrieved from structured competency profiles enriched with graph-based context. The LLM then selects the most relevant competencies from this set and provides supporting evidence spans from the fragment text. These predictions are refined using the structure of the competency graph and aggregated at the resource level. We evaluate our approach on a dataset built from the Computer Science department's competency referential at the Université de Technologie de Compiègne (UTC), covering 22 competencies across multiple course materials. Our LLM+BM25+Graph (LBG) pipeline achieves strong results, with a micro-F1 of 0.57 and macro-F1 of 0.50 at the fragment level, 0.51 macro-F1 at the resource level, and an MRR of 0.82outperforming zero-shot and few-shot LLM variants, retrieval/similarity baselines, and supervised classifiers -while also producing more mechanically traceable evidence spans to support human auditing and educational analysis.

2605.28480 2026-05-28 eess.AS cs.SD

Audio-Mind: An Auditable Agentic Framework for Audio Understanding

Audio-Mind: 一种可审计的音频理解智能体框架

Yucheng Wang, Jing Peng, Hanqi Li, Chenghao Wang, Wenming Tu, Yu Xi, Zhaokai Sun, Kai Yu, Shuai Wang

AI总结 提出Audio-Mind框架,通过条件性证据获取动态结合强前端与规划器引导的工具使用,解决音频理解中智能体证据获取的时机问题,在MMAR和MSU-Bench上分别达到80.4%和82.8%的准确率,并生成可审计的推理轨迹。

详情
AI中文摘要

音频智能体通过将音频问题分解为工具调用、中间证据和迭代推理步骤来扩展大型音频语言模型(LALM)。然而,随着LALM变得更强,关键挑战从启用工具使用转变为确定智能体证据获取何时真正有益于音频理解。我们提出Audio-Mind,一个用于音频理解中条件性证据获取的可审计且可插拔框架。Audio-Mind动态结合强前端与规划器引导的工具使用,在初始证据足够时保留前端判断,同时为存在未解决证据差距的问题获取有界的外部证据。在MMAR和MSU-Bench上的实验表明,Audio-Mind优于先前的音频智能体基线,在MMAR上达到80.4%的准确率,在MSU-Bench上达到82.8%的准确率。匹配骨干网络的比较突显了这种设计的重要性:在强音频前端下,如果工作流不保留前端的整体音频基础判断,智能体分解可能成为编排瓶颈。除了准确性,Audio-Mind还产生更高质量、可审计的推理轨迹,暴露不确定性、工具证据和答案理由,为更可靠的音频问答标注和错误分析提供潜在基础。

英文摘要

Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence acquisition genuinely benefits audio understanding. We propose Audio-Mind, an auditable and pluggable framework for conditional evidence acquisition in audio understanding. Audio-Mind dynamically combines a strong frontend with planner-guided tool use, preserving frontend judgment when initial evidence is sufficient while acquiring bounded external evidence for questions with unresolved evidence gaps. Experiments on MMAR and MSU-Bench show that Audio-Mind outperforms prior audio-agent baselines, reaching 80.4% accuracy on MMAR and 82.8% accuracy on MSU-Bench. A matched-backbone comparison highlights why this design matters: under strong audio frontends, agentic decomposition can become an orchestration bottleneck when the workflow does not preserve the frontend's holistic audio-grounded judgment. Beyond accuracy, Audio-Mind produces higher-quality, auditable reasoning traces that expose uncertainty, tool evidence, and answer rationales, offering a potential basis for more reliable audio-QA annotation and error analysis.

2605.28477 2026-05-28 cs.CV

SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

SA4Depth: 自监督单目深度估计中一致的姿态-深度尺度对齐

Changxuan Li, Nadine Berner, Nassir Navab, Federico Tombari, Stefano Gasperini

AI总结 提出SA4Depth方法,通过可微的视觉特征重投影和姿态细化,对齐自监督深度估计中深度网络和姿态网络估计的场景尺度,提升深度预测精度且不增加推理时间。

详情
Comments
Accepted by IEEE RA-L 2026
AI中文摘要

从单目序列进行自监督深度估计依赖于深度网络和姿态网络的联合学习。尽管已有大量研究致力于改进深度网络,但对姿态的努力仍然有限。在此背景下,即使深度估计达到尺度级别,我们强调了姿态网络和深度网络估计的场景尺度之间对齐的重要性。然后,我们引入了SA4Depth,一种改善这种对齐并提升深度预测的方法,同时保持推理时间不变。我们提出的方法利用训练期间估计的深度,跨连续帧重投影可学习的视觉特征,并通过减少特征对齐残差来细化姿态估计。通过我们的方法,由独立的深度网络和姿态网络估计的场景尺度得以对齐,并且不同序列之间的预测尺度一致性得到改善。我们的可微细化无缝集成到现有的自监督流程中,并显著改善了它们的深度估计。我们在KITTI、Cityscapes和NYUv2上进行了广泛的室外和室内实验,证明了这一点。此外,KITTI Odometry上的结果证实了我们姿态细化的有效性。我们的代码可在https://github.com/Runningchauncey/SA4Depth获取。

英文摘要

Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at https://github.com/Runningchauncey/SA4Depth .