arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
2605.18211 2026-05-19 cs.CL cs.AI

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

利用图结构在序列到序列模型中进行知识图谱链接预测

Luu Huu Phuc, Ratan Bahadur Thapa, Mojtaba Nayyeri, Jingcheng Wu, Evgeny Kharlamov, Steffen Staab

发表机构 * Analytic Computing, KI, University of Stuttgart, Stuttgart, Germany(斯图加特大学分析计算研究所) Bosch Center for Artificial Intelligence, Stuttgart, Germany(博世人工智能中心) WAIS, University of Southampton, United Kingdom(南安普顿大学WAIS)

AI总结 本文提出了一种结合图结构的序列到序列模型GA-S2S,通过整合T5-small编码器解码器与关系图注意力网络RGAT,提升知识图谱链接预测的性能。

Comments 9 pages, 1 figure, 2 tables. Preprint of a paper accepted at the 5th Workshop on LLM-Integrated Knowledge Graph Generation from Text (TEXT2KG), co-located with ESWC 2026, May 10--14, 2026, Dubrovnik, Croatia

详情
AI中文摘要

我们介绍了图增强的序列到序列(GA-S2S)框架,这是一种新的框架,将T5-small编码器解码器与关系图注意力网络(RGAT)相结合,以提高知识图谱的链接预测。虽然现有的序列到序列模型仅依赖于实体和关系的表面描述,并且在最理想的情况下,将查询实体的邻居扁平化为一个线性序列,从而丢弃了内在的图结构,GA-S2S联合编码文本特征和查询实体周围的完整k跳子图拓扑。通过将原始编码器输出与RGAT的关系感知嵌入相结合,我们的模型捕捉并利用了更丰富的多跳关系模式和文本信息。在CoDEx数据集上的初步实验表明,GA-S2S在链接预测准确性上优于竞争的序列到序列基线模型,达到了高达19%的相对增益。

英文摘要

We introduce Graph-Augmented Sequence-to-Sequence (GA-S2S), a novel framework that integrates a T5-small encoder-decoder with a Relational Graph Attention Network (RGAT) to improve link prediction in knowledge graphs. While existing Seq2Seq models rely solely on surface-level textual descriptions of entities and relations and at best, flatten the neighborhoods of a query entity into a single linear sequence, thereby discarding the inherent graph structure, GA-S2S jointly encodes both textual features and the full $k$-hop subgraph topology surrounding the query entity. By integrating raw encoder outputs with RGAT's relation-aware embeddings, our model captures and leverages richer multi-hop relational patterns and textual information. Our preliminary experiments on the CoDEx dataset demonstrate that GA-S2S outperforms competitive Seq2Seq-based baseline models, achieving up to a 19\% relative gain in link prediction accuracy.

2605.18209 2026-05-19 cs.CV cs.AI

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

SPATIOROUTE: 动态提示路由用于零样本空间推理

Pawat Chunhachatrachai, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University(台湾国立大学) Delta Robotics Innovation Center(Delta机器人创新中心)

AI总结 本文提出SpatioRoute,一种动态提示生成方法,通过语义定制的提示模板路由问题,无需额外训练或3D传感器输入,在零样本设置下提升空间推理性能,同时发现Chain-of-Thought提示在空间视频理解中效果不佳。

Comments 10 pages, 2 figures, 2nd Workshop on 3D-LLM/VLA, CVPR 2026

详情
AI中文摘要

在眼动视频上的空间问题回答是一项具有挑战性的任务,需要视觉-语言模型(VLMs)对3D物体位置、场景可行性和方向关系进行推理,特别是在无任务特定微调的零样本设置中。我们引入SpatioRoute,一种动态提示生成方法,将每个输入问题路由到语义定制的提示模板,无需任何额外训练、微调或3D传感器输入。SpatioRoute在两个互补模式中运行:SpatioRoute-R,一种基于规则的路由器,将问题类型(如What、Is、How、Can、Which)确定性地映射到专门的提示模板;以及SpatioRoute-L,一种基于LLM的方法,仅从问题和情境上下文生成任务特定的提示,无需在路由时使用视频输入。我们评估了SpatioRoute在SQA3D基准测试上跨不同模型家族的VLMs。SpatioRoute在固定提示基线上实现了高达5%的总体准确率提升,建立了在不需3D点云输入的情况下零样本视频-only空间VQA的新状态。此外,我们发现Chain-of-Thought(CoT)提示,通过Think it Twice架构实现,在此设置中对Qwen系列模型性能有持续下降,证实了问题感知路由比统一推理指令在空间视频理解中更有效。

英文摘要

Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template -- without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.

2605.18202 2026-05-19 cs.LG cs.AI

Concise and Logically Consistent Conformal Sets for Neuro-Symbolic Concept-Based Models

简洁且逻辑一致的神经符号概念模型的符合集

Samuele Bortolotti, Emanuele Marconato, Andrea Pugnana, Andrea Passerini, Stefano Teso

发表机构 * Department of Information Engineering and Computer Science, University of Trento, Italy(特伦托大学信息工程与计算机科学系) CIMeC, University of Trento, Rovereto, Italy(特伦托大学罗韦雷托CIMeC)

AI总结 本文提出COCOCO框架,通过整合符合预测方法,解决神经符号概念模型中标签和概念预测过于自信的问题,满足一致性、覆盖性和简洁性三个要求,提升模型的可靠性。

详情
AI中文摘要

神经符号概念模型(NeSy-CBMs)是一类将神经网络与符号推理相结合的架构,用于在高风险应用中提高可靠性。它们通过从输入中提取高层概念,然后在给定的逻辑约束下推断任务标签。然而,其标签和概念预测可能过于自信,使利益相关者难以判断何时可以信任模型的决策。本文通过整合符合预测(CP)框架,提供严格的分布无关覆盖保证,正式化了三个要求——一致性、覆盖性和简洁性,证明现有方法至少在一项上不足。然后引入COCOCO,一种后处理框架,联合符合概念和标签,并通过单个推断-反推修订步骤进行协调。COCOCO满足所有三个要求,保留分布无关覆盖,对不完美的知识具有鲁棒性,并支持用户指定的大小预算。在8个数据集上的实验显示,COCOCO在性能和集合大小方面优于竞争对手和自然基线。

英文摘要

Neuro-Symbolic Concept-based Models (NeSy-CBMs) are a family of architectures that integrate neural networks with symbolic reasoning for enhanced reliability in high-stakes applications. They work by first extracting high-level concepts from the input and then inferring a task label from these compatibly with given logical constraints. Yet, their label and concept predictions can be overconfident, making it difficult for stakeholders to gauge when the model's decisions can be trusted. We address this issue by integrating ideas from Conformal Prediction (CP), a framework providing rigorous, distribution-free coverage guarantees. We formalize three desiderata -- consistency, coverage, and conciseness -- that any conformal method for NeSy-CBMs should satisfy, and show that existing approaches fall short of at least one. We then introduce COCOCO, a post-hoc framework that conformalizes concepts and labels jointly and reconciles them via a single deduction-abduction revision step. COCOCO satisfies all three desiderata, retains distribution-free coverage, is robust to imperfect knowledge and supports user-specified size budgets. Our experiments on 8 data sets highlight how COCOCO compares favorably against competitors and natural baselines in terms of performance and set size.

2605.18197 2026-05-19 cs.RO cs.AI cs.CV

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

仅RGB的主动3D场景图生成用于室内移动机器人

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)(移动机器人小组) Visual and Multimodal Applied Learning Lab (VANDAL)(视觉与多模态应用学习实验室)

AI总结 本文提出了一种仅使用RGB输入的主动3D场景图生成方法,通过统一感知与规划的结构化表示,解决了传统方法对专用传感器的依赖问题,并在Replica数据集上验证了其有效性。

详情
AI中文摘要

当前3D场景图生成方法依赖于专用深度传感器,如LiDAR或RGB-D相机,限制了部署到专用机器人平台,并排除了仅使用RGB相机的场景,如固定外部基础设施。现有流程通常基于被动收集的观测轨迹,而不是基于部分构建的场景表示选择视角,因此无法有效利用图中编码的语义和空间信息。本文提出了一种完全视觉框架,用于从仅RGB输入中主动、逐步构建3D场景图,解决了这两个限制。所提出的方法围绕共享的结构化表示统一感知和规划,该表示捕捉了物体语义、3D几何、关系上下文以及多视角信息。由于该框架是硬件无关的,并且仅依赖RGB观测,因此可以将机载机器人相机和固定外部相机的输入整合到同一表示中。在Replica数据集上的实验表明,仅RGB的流程在F1分数上与使用真实深度的基线相当。在ReplicaCAD上的主动探索实验进一步表明,语义驱动的视角选择在相同探索预算下能够检测到比基于几何前沿的基线多超过两倍的物体。最后,外部相机设置表明,互补的RGB视角可以有效启动场景图并提高上下文理解,而无需额外的探索成本。

英文摘要

Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.

2605.18194 2026-05-19 cs.AI cs.CV

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

超越笛卡尔错觉:在感知瓶颈下测试双阶段多模态理论 of Mind

Yajing Zhou, Xiangyu Kong

发表机构 * College of Computer Science, Beijing Information Science and Technology University(北京信息科技大学计算机学院)

AI总结 本文研究了多模态大语言模型在感知瓶颈下的双阶段空间推理能力,提出了一种基于锚点的具身体验空间分解链式推理方法,以解决空间对称性和视角模糊性问题,从而提升多模态理论 of Mind 的表现。

Comments 17 pages, 3 figures

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在一般推理方面表现出色,但其具身空间智能仍受“笛卡尔错觉”限制——依赖文本概率分布,缺乏基于3D拓扑的理解。这种局限性在多智能体环境中尤为明显,这些环境不仅需要场景感知,还需要第二阶的理论 of Mind(ToM)。具体而言,智能体A必须推断智能体B对环境的看法,这严格受B的物理朝向和感官限制影响。在本文中,我们通过一个新颖的音频-视觉任务来探测MLLMs的双阶段空间推理极限:要求A预测B对A相对位置的估计。为此,我们提出了一个认识感知瓶颈模块,摒弃了刚性的规则坐标转换。相反,我们引入了基于锚点的具身体验空间分解链式推理(CoT)。该方法引导MLLMs进行“几何到语义”的投影,迫使它首先建立B的局部坐标系统,然后根据A是否在B的视觉视锥内动态加权视觉和听觉模态。广泛的评估表明,尽管当前MLLMs在空间对称性和视外模糊性方面根本上存在困难(建立了一个严格的零样本基准线42%准确率),我们的感知受限推理链在纯自体心和 allocentric 基准上表现稳健。通过系统地评估这些感知瓶颈,我们的工作揭示了当前MLLM空间推理的极限,并为具身AI中的认识、模态感知推理建立了基础范式。

英文摘要

While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a "geometric-to-semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.

2605.18193 2026-05-19 cs.CV cs.GR

Best Segmentation Buddies for Image-Shape Correspondence

图像-形状对应关系的最佳分割伙伴

Itai Lang, Dongwei Lyu, Dale Decatur, Rana Hanocka

发表机构 * University of Chicago(芝加哥大学)

AI总结 本文研究了在真实图像和无纹理3D形状之间估计分割到分割对应关系的任务,通过将图像像素与3D形状顶点链接,利用深度视觉特征进行特征相似性计算,从而实现跨模态的对应关系发现。

Comments CVPR 2026. Project page: https://threedle.github.io/bsb/

详情
AI中文摘要

寻找对应关系是计算机视觉和图形学中的基本且广泛研究的问题。在本工作中,我们探讨了在真实图像和无纹理3D形状之间估计分割到分割对应关系的任务。该任务极具挑战性,因为存在显著的外观、几何和视角差异。我们的方法通过将图像分割中的像素链接到相应3D形状语义部分的顶点,跨越了跨模态的差距。为此,我们首先将深度视觉特征从2D视觉模型蒸馏到3D形状表面,使得能够计算图像像素与形状顶点之间的特征相似性。然后,我们识别出最佳分割伙伴,即那些最相似的图像像素位于图像分割区域内的顶点,从而可靠地发现语义对应形状部分的顶点。最后,我们利用从2D图像分割模型蒸馏的3D特征,直接在3D中对形状进行分割,从而引导对应关系的发现。我们展示了我们的方法在广泛图像-形状配对中的通用性和鲁棒性,展示了准确且具有语义意义的对应关系。我们的项目页面位于https://threedle.github.io/bsb/。

英文摘要

Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored task of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape. To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. Then, we identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D image segmentation model to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our project page is at https://threedle.github.io/bsb/.

2605.18192 2026-05-19 cs.CV

View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification

视感知语义对齐用于空-地人员重识别

Quan Zhang, Zeqiang Cai, Peiming Zhao, Jingze Wu, Cailun Wu, Hongbo Chen, Jianhuang Lai

发表机构 * Sun Yat-sen University, China(中山大学,中国) Pazhou Lab (HuangPu), Guangdong, China(琶洲实验室(黄埔),广东,中国) Guangdong Province Key Laboratory of Information Security Technology, Guangzhou, China(广东省信息安全技术重点实验室,广州,中国) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China(教育部机器智能与高级计算重点实验室,中国)

AI总结 本文提出ViSA框架,通过视感知语义对齐方法解决空-地人员重识别中的视角差异问题,通过专家驱动令牌生成模块和双分支局部融合模块实现跨视角语义一致性,实验表明在CARGO基准测试中mAP提升了10.06%。

Comments CVPR 2026 POSTER

详情
AI中文摘要

空-地人员重识别(AGPReID)由于无人机和固定相机之间视角差异显著而极具挑战性。现有方法通常采用视角不变的范式,通过对齐不同视角下的共享特征来实现鲁棒性。然而,视角不变性本质上强制了部分级对齐,忽略了视角特定的线索和判别性身份信息。为此,本文提出ViSA(视感知语义对齐)框架,该框架实现了跨视角语义一致性,包含专家驱动令牌生成模块(ETGM)和双分支局部融合模块(DLFM)。技术上,前者构建了一组视感知专家,生成适应性语义查询以感知视角特定的模式,后者利用图推理提取并对齐响应不同专家的局部区域。在三个AGPReID基准测试集(包括AG-ReID.v2、CARGO和LAGPeR)上的广泛实验表明,ViSA在挑战性的CARGO跨视角协议上实现了mAP提升10.06%的显著改进。代码可在https://github.com/Cat-Zero/ViSA获取。

英文摘要

Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts. Extensive experiments on three AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06\% mAP improvement on the challenging CARGO cross-view protocol. The code is available at \href{https://github.com/Cat-Zero/ViSA}{https://github.com/Cat-Zero/ViSA}.

2605.18191 2026-05-19 cs.AI

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

成对偏好奖励与基于群体的多样性增强以实现更优的开放生成

Guining Cao, Jiaxin Peng, Chu Zeng, Yu Zhao, Shuangyong Song, Yongxiang

发表机构 * Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(星辰AGI实验室,中国电信人工智能技术(北京)有限公司) School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) Tsinghua University(清华大学)

AI总结 本文提出了一种适用于开放生成任务的强化学习方法PPR-GDE,通过成对偏好奖励和基于群体的多样性增强,解决了传统方法在开放生成任务中存在验证困难、计算成本高以及多样性下降的问题。

Comments Work in progress

详情
AI中文摘要

当前的强化学习(RL)方法在可验证的环境中具有广泛的应用和强大的能力,因为可以提供标量奖励。然而,在开放生成任务中,验证响应的正确性仍然具有挑战性,训练奖励模型会带来显著的计算和标注成本。此外,强化学习(RLVR)往往导致多样性崩溃,并产生刻板或僵化的输出,这在开放领域场景中尤其不可取。我们提出了成对偏好奖励与基于群体的多样性增强(PPR-GDE),一种更适合开放生成的RL方法。PPR-GDE不需要标量奖励,并将群体层面的多样性纳入奖励信号中,它通过成对偏好奖励保留主观评估的比较结构,通过重复比较并交换响应顺序来减轻裁判位置偏差,并引入一个基于群体的多样性奖励,明确鼓励响应组内的语义分散,所有这些奖励信号都被整合到一个统一的群体相对策略优化目标中。我们将在角色扮演任务上实例化PPR-GDE,实验表明PPR-GDE在对齐质量和表达多样性方面优于强大的RL基线。进一步分析显示,成对偏好对于主观视角的偏好对齐至关重要,而多样性度量在实现卓越的表达多样性和更广泛的语义覆盖方面起着关键作用。

英文摘要

Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks, verifying the correctness of responses remains challenging, and training reward models incurs substantial computational and annotation costs. Moreover, reinforcement learning (RLVR) often leads to diversity collapse and produces stereotypical or rigid outputs, outcomes that are particularly undesirable in open-domain scenarios. We propose Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a RL method that is more suitable for open-ended generation. PPR-GDE does not require scalar rewards and incorporates group-level diversity into the reward signal, it preserves the comparative structure of subjective evaluation through a pairwise preference reward, mitigates judge position bias via repeated comparisons with swapped response order, and introduces a group-based diversity reward that explicitly encourages semantic dispersion within a response group, all of these reward signals are integrated into a unified group-relative policy optimization objective. We instantiate PPR-GDE on role-playing task, experiments show that PPR-GDE achieves a better alignment quality as well as expressive diversity than strong RL baselines. Further analysis shows that pairwise preference is critical for preference alignment in subjective perspective, while the diversity metric plays an essential role in achieving superior expressive diversity and broader semantic coverage.

2605.18190 2026-05-19 cs.LG cs.CV

Dual-Rate Diffusion: Accelerating diffusion models with an interleaved heavy-light network

双速率扩散:通过交错重-轻网络加速扩散模型

Grigory Bartosh, David Ruhe, Emiel Hoogeboom, Jonathan Heek, Thomas Mensink, Tim Salimans

发表机构 * Google DeepMind Amsterdam(谷歌深Mind阿姆斯特丹) Amsterdam University of Amsterdam(阿姆斯特丹大学)

AI总结 本文提出双速率扩散方法,通过交错执行高容量上下文编码器和轻量解噪模型,加速扩散模型推理,同时保持样本质量,在ImageNet基准上实现性能与计算成本的平衡。

详情
AI中文摘要

扩散模型在生成性能上达到最先进的水平,但在推理过程中由于重复评估重的神经网络而面临高昂的计算成本。在本文中,我们提出了双速率扩散,一种通过交错执行高容量的上下文编码器和轻量高效的去噪模型来加速采样的方法。上下文编码器被稀疏评估以提取高维特征,这些特征在每一步都被轻量去噪模型有效重用,以高效地细化样本。这种方法显著加速了推理过程,而不会牺牲样本质量。在ImageNet基准上,双速率扩散在性能上与标准基线相匹配,同时将计算成本降低了2-4倍。此外,我们证明了我们的方法与蒸馏技术,如动量匹配蒸馏,兼容,从而在少步生成中进一步提高效率。

英文摘要

Diffusion models achieve state-of-the-art generative performance but suffer from high computational costs during inference due to the repeated evaluation of a heavy neural network. In this work, we propose Dual-Rate Diffusion, a method to accelerate sampling by interleaving the execution of a heavy high-capacity context encoder and a light efficient denoising model. The context encoder is evaluated sparsely to extract high-dimensional features, which are effectively reused by the light denoising model at every step to refine the sample efficiently. This approach significantly accelerates inference without compromising sample quality. On ImageNet benchmarks, Dual-Rate Diffusion matches the performance of standard baselines while reducing computational cost by a factor of $2$-$4$. Furthermore, we demonstrate that our method is compatible with distillation techniques, such as Moment Matching Distillation, enabling further efficiency gains in few-step generation.

2605.18188 2026-05-19 cs.LG

UTOPYA: A Multimodal Deep Learning Framework for Physics-Informed Anomaly Detection and Time-Series Prediction

UTOPYA:一种用于物理信息异常检测和时间序列预测的多模态深度学习框架

Robson W. S. Pessoa, Julien Amblard, Alessandra Russo, Idelfonso B. R. Nogueira

发表机构 * Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU)(化学工程系,挪威科学与技术大学) Department of Computing, Imperial College London(计算系,帝国理工学院伦敦分校)

AI总结 本文提出UTOPYA框架,通过融合八种数据模态,利用FiLM条件交叉模态注意力和门控融合,共同解决批次蒸馏中的异常检测、时间序列预测和相分类问题,并通过物理信息正则化方案和课程学习方法提升性能。

详情
AI中文摘要

批次过程中的异常检测受到瞬态动态、稀少故障标签和依赖单一模态传感器数据的限制。本文介绍了UTOPYA(统一时间观测用于物理信息异常检测和时间序列预测),一种具有15.2M参数的多模态框架,通过特征-wise线性调制(FiLM)条件交叉模态注意力和门控融合,共同解决批次蒸馏中的异常检测、时间序列预测和相分类问题。本文引入的物理信息正则化方案强制时间平滑性和热力学单调性,而课程学习则按物理难度顺序引入训练样本。在Arweiler等人(2026)的119次实验多模态批次蒸馏数据集上,UTOPYA在窗口级别测试中达到0.832和0.874的AUROC,显著优于四个外部基线(PCA、自动编码器、隔离森林和LSTM自动编码器)在相同条件下的表现(+0.147窗口级别AUROC超过最佳基线)。对15种架构配置的多模态消融研究显示,通过FiLM条件的静态上下文是关键使能器,使实验级别多信号AUROC提高+0.145(从0.729到0.874)。此外,对14种设计选择的训练消融研究发现,包括实例归一化、Mixup、集成、测试时增强和随机权重平均在内的几种广泛采用的技巧在数据稀少的设置中未能提升或主动降低泛化能力。这些负面结果揭示了平滑基于正则化和异常检测之间的根本矛盾,为多模态过程监控部署提供了实际指导。

英文摘要

Anomaly detection in batch processes is hindered by transient dynamics, scarce fault labels, and reliance on single-modality sensor data. This work introduces UTOPYA (Unified Temporal Observation for Physics-Informed Anomaly Detection and Time-Series Prediction), a 15.2M-parameter multimodal framework that jointly addresses anomaly detection, time-series prediction, and phase classification in batch distillation by fusing eight data modalities through Feature-wise Linear Modulation (FiLM) conditioned cross-modal attention and gated fusion. A physics-informed regularisation scheme introduced in this work enforces temporal smoothness and thermodynamic monotonicity, while curriculum learning introduces training samples in order of physical difficulty. On the 119-experiment multimodal batch distillation dataset of Arweiler et al. (2026), UTOPYA achieves a window-level test AUROC of 0.832 and 0.874 under multi-signal experiment-level scoring, substantially outperforming four external baselines (PCA, autoencoder, Isolation Forest, and LSTM autoencoder) evaluated under identical conditions (+0.147 window-level AUROC over the best baseline). A multimodal ablation over 15~architectural configurations shows that static context via FiLM conditioning is the key enabler, lifting experiment-level multi-signal AUROC by +0.145 over the unimodal baseline (0.729 to 0.874). Separately, a training ablation across 14 design choices reveals that several widely-adopted techniques, including instance normalisation, Mixup, ensembling, test-time augmentation, and stochastic weight averaging, fail to improve or actively degrade generalisation in this data-scarce setting. These negative results expose a fundamental tension between smoothing-based regularisation and anomaly detection, providing practical guidance for multimodal process monitoring deployment.

2605.18184 2026-05-19 cs.RO cs.AI cs.CV

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

固定外部摄像头作为主动3D场景图生成的共同先验地图

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)(移动机器人组) Visual and Multimodal Applied Learning Lab (VANDAL)(视觉与多模态应用学习实验室)

AI总结 本文提出利用固定外部RGB摄像头作为共同先验地图,以实现主动、渐进式的3D场景图生成,通过融合机器人 onboard 摄像头和固定外部摄像头的数据,提高场景理解的效率和准确性。

详情
AI中文摘要

常用的先验信息,如BIM模型、平面图和遥感图像,可以为自主机器人系统提供有价值的几何和语义上下文。在本文中,我们将固定外部RGB摄像头的观测视为共同先验地图(CPMs):环境的广角视图,在任何机器人运动开始之前初始化一个语义和几何场景先验。我们提出一个仅使用RGB的框架,用于主动、渐进式的3D场景图(3DSG)生成,该框架在单一硬件无关的管道中无缝融合来自机器人 onboard 摄像头和固定外部摄像头的观测。通过仅依赖RGB观测并通过前馈3D重建模型进行处理,系统将所有摄像头——机器人 onboard 或外部——视为相同,无需硬件修改。基于图的主动语义探索框架然后直接利用部分场景图,引导机器人向高语义不确定性区域前进,逐步完成和细化先验。实验表明,使用单个外部摄像头初始化场景图可使初始物体召回率提高高达+79%,并且先验的更丰富上下文显著提高了后续主动探索的效率。

英文摘要

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

2605.18181 2026-05-19 cs.AI cs.CL

Scalable Environments Drive Generalizable Agents

可扩展环境驱动可泛化的智能体

Jiayi Zhang, Fanqi Kong, Guibin Zhang, Maojia Song, Zhaoyang Yu, Jianhao Ruan, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo

发表机构 * HKUST(GZ)(香港科技大学(广州)) DeepWisdom PKU(北京大学) NUS(新加坡国立大学) SUTD(新加坡科技设计大学) UdeM & Mila(蒙特利尔大学及Mila)

AI总结 本文探讨了可泛化智能体需要通过可扩展环境来适应多样任务和未见环境的问题,提出环境扩展的核心挑战,并提出了统一的分类方法和可扩展环境的构建范式。

详情
AI中文摘要

可泛化智能体应能够适应多样任务和未见环境,而不仅仅是训练分布内的任务。本文认为这种泛化需要环境扩展:扩展智能体交互的可执行规则集分布,而不是仅增加轨迹或任务数量。当前扩展实践主要集中在固定交互规则下收集更多经验或更广的任务集,导致智能体在底层接口、动态、观察或反馈信号变化时变得脆弱。因此,核心挑战是世界层面的分布偏移:智能体需要系统性地暴露于具有显著不同可执行规则集的环境中。为澄清这一挑战,我们提出了一个统一的分类法,通过主要交付成果和可执行规则集的变化将轨迹扩展、任务扩展和环境扩展区分开来。基于此分类法,我们综合了可扩展环境的构建范式,对比了优先考虑可控性和可验证性的程序生成器与提供更广泛覆盖和开放性的生成世界模型。我们进一步概述了如何将环境扩展与具有状态的学习机制结合,强调学习的更新规则用于跨环境适应。最后,我们讨论了其他观点并论证可扩展环境是实现稳健通用智能体的必要基础。

英文摘要

Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues that such generalization requires environment scaling: expanding the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world-level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule-sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule-set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open-endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross-environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.

2605.18177 2026-05-19 cs.CV

Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

基于令牌空间的掩码预测用于高效的视觉变换器分割

Calvin Galagain, Martyna Poreba, François Goulette

发表机构 * Université Paris-Saclay, CEA List(巴黎-萨克雷大学,CEA列表) U2IS, ENSTA Paris, Institut Polytechnique de Paris(U2IS,巴黎ENSTA,巴黎理工学院)

AI总结 本文提出TokenMask,一种直接从查询令牌亲和力计算掩码logits并进行logit空间插值的方法,从而在保持准确性的同时减少计算和内存需求,提高分割效率。

Comments CVPR, EVW 2026

详情
AI中文摘要

基于查询的视觉变换器分割模型通常重建密集的空间特征图以预测掩码,继承了卷积架构的设计模式。我们证明这种显式的图像空间重建是不必要的。我们引入TokenMask,一种令牌空间掩码头,直接从查询令牌亲和力计算掩码logits,并在logit空间而非特征空间中进行插值。这种重新表述保留了原始的线性评分机制,同时简化了计算结构。在多样化的ViT后端、数据集和分割任务中,TokenMask通过减少计算和内存需求,同时保持竞争性的准确性,提高了在NVIDIA Jetson AGX Orin上的实际速度。总体而言,TokenMask为嵌入式视觉系统提供了一种更简单且更易于部署的设计。

英文摘要

Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.

2605.18176 2026-05-19 cs.CV cs.AI

MARS: Technical Report for the CASTLE Challenge at EgoVis 2026

MARS:EgoVis 2026 CASTLE挑战的技术报告

Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shandong Jianzhu University(山东建筑大学)

AI总结 本文提出MARS系统,用于EgoVis 2026的CASTLE挑战,通过多模态代理推理解决需要多源信息的复杂问题,核心方法是多模态证据选择,主要贡献是实现了在多源数据上的有效推理。

Comments The Runner-up Solution for CASTLE Challenge @ EgoVis 2026

详情
AI中文摘要

本报告介绍了MARS,即多模态代理推理与源选择系统,是参与EgoVis 2026 CASTLE挑战的系统。参赛者必须在CASTLE 2024数据集上回答185个封闭式问题。与以往单视频眼动基准不同,CASTLE要求对四天活动、15个同步视角、官方 transcripts 及多种辅助模态(包括个人照片、辅助视频、注视、热成像和心率测量)进行推理。MARS将任务视为多模态源的代理证据选择问题,而非纯粹文本流程。MARS首先遵循官方CASTLE目录组织,从视频和 transcripts 两个主要来源以及注视、心率、照片和热成像四个辅助来源构建证据记忆。长视频仅转换为caption和基于DeepSeek的摘要,因为CASTLE视频太长无法直接输入模型上下文;此步骤压缩时间证据,同时保留照片和其他辅助媒体作为源特定证据。在推理时,一个GPT-5.4决策代理反复选择是否继续推理、请求特定缺失模态、生成答案或回退到随机选项,当证据不足时。所得到的系统在最终CASTLE挑战排行榜上获得第二名。我们的代码可在https://github.com/Hyu-Zhang/MARS获取。

英文摘要

This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into captions and DeepSeek-based summaries only because CASTLE videos are too long to fit directly into the model context for every question; this step compresses temporal evidence while keeping photos and other auxiliary media available as source-specific evidence. At inference time, a GPT-5.4 decision agent repeatedly chooses whether to continue reasoning, request a specific missing modality, produce an answer, or fall back to a random option when the evidence remains insufficient. The resulting system achieved second place on the final CASTLE Challenge leaderboard. Our codes are available at https://github.com/Hyu-Zhang/MARS.

2605.18175 2026-05-19 cs.SD

Sonalyzer-Moz: A Framework for Analyzing the Structure of Mozart's Sonata Form

Sonalyzer-Moz: 一个用于分析莫扎特奏鸣曲形式结构的框架

Jing Zhao, KokSheik Wong, Vishnu Monn Baskaran, Kiki Adhinugraha, David Taniar

发表机构 * School of Information Technology, Monash University Malaysia(Monash大学马来西亚分校信息科技学院) Department of Computer Science and Information Technology, La Trobe University(La Trobe大学计算机科学与信息技术系) Faculty of Information Technology, Monash University(Monash大学信息科技学院)

AI总结 本文提出Sonalyzer-Moz框架,通过整合特征聚合与序列建模,实现了对奏鸣曲形式结构的自动分析,并建立了首个大规模标注数据集SoSA-Moz,为系统研究奏鸣曲形式提供了基础。

Comments 6 pages, 2 figures

详情
AI中文摘要

奏鸣曲形式是一种音乐丰富且层级结构复杂的形式,对自动分析提出了重大挑战。尽管近年来音乐结构分析取得了进展,但奏鸣曲形式分析仍处于早期阶段。这主要是由于标注古典音乐结构耗时且需要较高的音乐背景知识。为推动该领域研究,我们编制了SoSA-Moz,这是首个大规模数据集,包含全面的层级结构标注。本工作为系统奏鸣曲形式分析奠定了基础。利用这一新贡献的资源,我们进一步提出了Sonalyzer-Moz,一个专门用于研究复杂奏鸣曲结构的基线模型。该框架整合了特征聚合与序列建模,使其能够捕捉局部特征和高层结构依赖性。实验结果表明,Sonalyzer-Moz能够识别对理解奏鸣曲形式至关重要的上层结构组件边界。因此,该方法首次展示了自动上层分析奏鸣曲形式的有效性,并为未来自动理解奏鸣曲形式的研究提供了稳健的基线,同时推进了古典音乐结构分析的研究。

英文摘要

The sonata form is a musically rich and hierarchically structured form that poses significant challenges for automatic analysis. While music structure analysis has seen strides of progress in recent years, sonata form analysis remains in its early stages. This is largely due to the time-consuming and high barrier of the music background requirement for annotating classical music structures. To advance research in this area, we curated SoSA-Moz, the first large-scale dataset featuring comprehensive hierarchical structure annotations. This work establishes a foundation for systematic sonata form analysis. Leveraging this newly contributed resource, we further propose Sonalyzer-Moz, a baseline model specifically designed for investigating complex sonata structures. This framework integrates feature aggregation with sequential modeling, enabling it to capture both local feature and upper-level structural dependencies. Experiment results show that Sonalyzer-Moz is capable of identifying the components' boundaries of the upper-level structure that are critical to understanding sonata form. Therefore, this method demonstrates, for the first time, the effectiveness of automatic upper-level analysis of sonata form, and provides a robust baseline for future research in the automatic understanding of sonata form while advancing the study of classical music structure analysis.

2605.18173 2026-05-19 cs.CV

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

你需要文本校正吗?用于无校正场景文本识别的软注意力掩码嵌入

Antonio Colombo, Giovanni Bianchi

发表机构 * School of Information, Polytechnic University of Turin(理工学院信息学院)

AI总结 本文提出了一种新的软注意力掩码嵌入模块(SAME),通过Transformer编码器的全局感受野编码高级特征并计算软注意力权重,然后与预测的掩码进行分层嵌入,生成精细的文本边界感知掩码,从而有效抑制背景噪声。基于该模块,本文提出了一个鲁棒的端到端文本识别框架SAME-Net,无需字符级标注或辅助文本校正模块。

详情
AI中文摘要

端到端场景文本识别,即在一个框架内统一文本检测和识别,已因深度学习的进步而取得显著进展。然而,大多数现有方法仍然受到多尺度变化、任意文本形状和复杂背景干扰导致的不完整掩码提案的影响,从而降低识别准确性。在本文中,我们提出了一种新的软注意力掩码嵌入模块(SAME),该模块利用Transformer编码器的全局感受野来编码高级特征并计算软注意力权重,然后与预测的掩码进行分层嵌入,生成精细的文本边界感知掩码,从而有效抑制背景噪声。基于该模块,我们提出了SAME-Net,一个鲁棒的端到端文本识别框架,无需字符级标注或辅助文本校正模块。由于软注意力机制是完全可微分的,识别损失梯度可以反向传播通过SAME模块到检测分支,从而实现检测和识别目标的联合优化。在具有挑战性的基准测试中进行了广泛的实验,证明了我们方法的有效性:SAME-Net在任意形状的Total-Text数据集上实现了84.02%的端到端H-mean,比之前的最先进方法GLASS在全词典准确率上高出1.02%,且无需额外训练数据;在多方向ICDAR 2015数据集上获得了具有竞争力的83.4%强词典结果。

英文摘要

End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary text rectification modules. Since the soft attention mechanism is fully differentiable, recognition loss gradients can be back-propagated through the SAME module to the detection branch, enabling joint optimization of detection and recognition objectives. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our approach: SAME-Net achieves 84.02\% end-to-end H-mean on the arbitrarily-shaped Total-Text dataset, surpassing the previous state-of-the-art GLASS by 1.02\% in full-lexicon accuracy without additional training data, and obtains competitive 83.4\% strong-lexicon results on the multi-oriented ICDAR 2015 dataset.

2605.18165 2026-05-19 cs.LG

Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

Elastic-dLLM: Diffusion LLMs的弹性上下文压缩与增强

Junyi Wu, Tianchen Zhao, Shaoqiu Zhang, Linfeng Zhang, Guohao Dai, Yu Wang

发表机构 * Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Infinigence AI

AI总结 本文针对扩散大语言模型中上下文压缩和增强问题,提出了一种位置保持的上下文压缩和终端感知增强方法,以提高解码效率并实现长上下文扩展。

详情
AI中文摘要

与自回归模型生成一个token一次不同,dLLMs通过联合去噪一批[MASK] tokens并每一步采样一个或多个token;尽管这允许并行解码,但由于被掩码token的大量批大小,这个过程会带来显著的计算成本。我们观察到,大部分成本用于重复处理前面的上下文和许多[MASK] tokens的相同特征表示,表明存在相当大的计算冗余。在本工作中,我们从[MASK] tokens的角度重新审视dLLM的冗余性。通过系统分析,我们验证了[MASK] tokens的冗余性并揭示了它们在提供结构信息中的关键作用。基于这些发现,我们提出了位置保持的[MASK] token压缩和终端感知增强。通过压缩冗余的[MASK]计算,该方法加速了解码,并进一步为受有限输入长度约束的完整序列dLLMs(如LLaDA-8B-Instruct和LLaDA-1.5)提供了自然的上下文折叠式长上下文扩展。此外,对于块dLLMs(如LLaDA2.0-mini),它通过添加受保护的终端[MASK] token来增强生成质量,且无显著开销。

英文摘要

Unlike autoregressive models, which generate one token at a time, dLLMs denoise a chunk of [MASK] tokens jointly and sample one or more tokens per step; despite enabling parallel decoding, this process incurs substantial computational cost due to the large chunk size of masked tokens. We observe that much of this cost is spent on repeatedly processing the preceding context and many [MASK] tokens with the same feature representations, indicating considerable computational redundancy. In this work, we revisit dLLM's redundancy from the perspective of [MASK] tokens. Through systematic analysis, we verify the redundancy of [MASK] tokens while revealing their critical role in providing structural information. Guided by these findings, we propose position-preserving [MASK] token compression and terminal-aware augmentation. By compressing redundant [MASK] computation, this approach accelerates decoding and further provides a natural extension toward context-folding-like long-context scaling under limited input-length constraints for full-sequence dLLMs such as LLaDA-8B-Instruct and LLaDA-1.5. Moreover, for block dLLMs such as LLaDA2.0-mini, it augments the context with a protected terminal [MASK] token to enhance generation quality with negligible overhead.

2605.18163 2026-05-19 cs.AI cs.CL

TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

TRACE: 通过跨层证据进行轨迹修正以减少幻觉

Tej Sanibh Ranade

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出TRACE算法,通过跨层证据在推理时修正LLM中的幻觉,无需训练或标注,通过内部证据选择修正策略,提升多个基准测试的性能。

Comments 25 pages, 8 figures, 4 tables

详情
AI中文摘要

幻觉修正并非单向问题。我们表明中间层既不总是比最终层更诚实,也不总是更不可信。然而,幻觉减少通常通过固定干预形式实现:对比一层另一层,引导沿诚实方向,或依赖外部证据。这种框架结构不完整。跨层事实证据并不均匀演变:在某些失败中,内部真实支持存在并随后被压制,而在其他情况下,候选竞争在深度上保持 genuinely 多方向性,因此没有单一符号标量家族通常足够。我们引入TRACE(Trajectory Correction from Cross-layer Evidence for Hallucination Reduction),一种确定性、无训练的算法,在LLM自身前向传递中,通过从每个输入的跨层候选轨迹中推导出修正层和适当的修正操作符,在推理时修正幻觉。在一种冻结超参数设置下,TRACE仅使用模型内部证据,在8个模型家族和3个事实性基准测试中评估为单一通用算法,对15个模型进行评估,提升所有评估单元,产生+12.26 MC1点和+8.65 MC2风格点的平均增益,无退化,增益达到+47.20 MC1和+43.38 MC2风格点。该方法不使用标签、检索、预训练、微调或每模型校准。

英文摘要

Hallucination correction is not a one-direction problem. We show that intermediate layers are neither uniformly more truthful than final layers nor uniformly less trustworthy. Yet hallucination reduction is usually instantiated through one fixed intervention form: contrast one layer against another, steer along a truthfulness direction, or defer to external evidence. This framing is structurally incomplete. Cross-layer factual evidence does not evolve uniformly: in some failures truthful support is present internally and later suppressed, whereas in others candidate competition remains genuinely multi-directional across depth, so no single signed scalar family is generally sufficient. We introduce Trajectory Correction from Cross-layer Evidence for Hallucination Reduction (TRACE), a deterministic, training-free algorithm which corrects hallucinations at inference time by deriving both the corrective layer and the appropriate correction operator from each input's cross-layer candidate trajectory inside the LLM's own forward pass. Under one frozen hyperparameter setting, TRACE selects among scalar reversal, earlier-state recovery, and candidate-space correction using only model-internal evidence. Evaluated as a single universal algorithm across 15 models, 8 model families, and 3 factuality benchmarks, TRACE improves every evaluation cell, yielding mean gains of +12.26 MC1 points and +8.65 MC2-style points with no regressions, with gains reaching +47.20 MC1 and +43.38 MC2-style points. The method uses no labels, retrieval, pretraining, finetuning, or per-model calibration.

2605.18162 2026-05-19 cs.CV cs.AI

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

通过几何逻辑一致性实现视觉语言模型中的自演化空间推理

Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang, Piotr Koniusz, Yirong Chen, Ding Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The City University of New York(纽约城市大学) The Chinese University of Hong Kong(香港中文大学) Data61 CSIRO(Data61澳大利亚国家科学研究院) University of New South Wales(新南威尔士大学) Australian National University(澳大利亚国立大学)

AI总结 本文提出SAGE框架,通过几何和语言二元操作在视觉语言模型中实现自演化空间推理,提升模型在空间推理任务中的鲁棒性和泛化能力。

Comments 23 pages, 7 figures, 3 tables

详情
AI中文摘要

视觉语言模型(VLMs)在视觉和语言任务上取得了显著进展,但其空间推理能力仍然脆弱:能够正确回答原始输入的模型在面对具有可预测答案映射的配对变换时仍可能失败,揭示了实例级正确性与鲁棒空间推理之间的差距。为此,我们提出空间对齐通过几何演化(SAGE),一种自演化框架,通过几何和语言二元操作在VLMs中强制逻辑一致性。SAGE将二元一致性作为辅助奖励纳入GRPO训练,鼓励模型在原始和变换输入之间产生逻辑一致的答案。一个动态操作池持续探测不一致,促进具有挑战性的操作并淘汰已掌握的操作,使训练聚焦于最有信息量的信号。SAGE具有模型无关性,比先前的GRPO方法更数据高效,并可作为轻量级的后训练阶段应用于任何现有的VLM。在视频和空间推理基准上的实验表明,SAGE在强基线模型上表现一致提升,并增强了对未见数据的泛化能力。

英文摘要

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.

2605.18156 2026-05-19 cs.CV

Semi-LAR: Semi-supervised Contrastive Learning with Linear Attention for Removal of Nighttime Flares

Semi-LAR: 基于线性注意力的半监督对比学习用于夜间光斑去除

Xiyu Zhu, Wei Wang, Kui Jiang, Zhengguo Li

发表机构 * School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China(武汉科技大学计算机科学与技术学院) Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China(哈尔滨工业大学计算机学院) SRO department, Institute for Infocomm Research, A*STAR, Singapore(新加坡资讯通信研究院SRO部门)

AI总结 本文提出了一种半监督对比学习框架,通过联合处理伪标签可靠性与表征歧视性,有效缓解了夜间光斑去除中的误差累积问题,并通过实验验证了该框架的模型无关性和性能提升。

详情
AI中文摘要

由于光斑瑕疵的大空间范围及其与场景结构的纠缠,光斑去除具有挑战性,而现有方法严重依赖大规模配对数据。我们提出了一种半监督光斑去除框架,通过联合处理伪标签可靠性与表征歧视性,实现了从未标记图像中稳定学习。我们提出了一种自适应伪标签存储库,通过无参考质量评估、动量更新和无效标签过滤逐步细化伪监督,有效缓解了误差累积。此外,我们提出了一种光斑感知的对比损失,明确将受光斑污染的输入视为负样本,并进行基于块的对比学习,鼓励表征在区分光斑模式的同时保持与可靠伪目标的一致性。在多个光斑基准上的广泛实验表明,所提出的框架具有模型无关性,并且在性能和鲁棒性方面均表现出一致的提升。

英文摘要

Lens flare removal is challenging due to the large spatial extent of flare artifacts and their entanglement with scene structures, while existing methods heavily rely on large-scale paired data. We propose a semi-supervised flare removal framework that enables stable learning from unlabeled images by jointly addressing pseudo-label reliability and representation discrimination. We propose an adaptive pseudo-label repository that progressively refines pseudo supervision through no-reference quality assessment, momentum-based updates, and invalid label filtering, effectively mitigating error accumulation. Moreover, we propose a flare-aware contrastive loss that explicitly treats flare-contaminated inputs as negatives and performs patch-level contrastive learning, encouraging representations that are discriminative against flare patterns while remaining consistent with reliable pseudo targets. Extensive experiments on multiple flare benchmarks demonstrate that the proposed framework is model-agnostic and consistently improves performance and robustness.

2605.18155 2026-05-19 cs.CL

FOL2NS: Generating Natural Sentences from First-Order Logic

FOL2NS:从一阶逻辑生成自然句子

Mei Jia

发表机构 * University of Manchester(曼彻斯特大学)

AI总结 本文提出FOL2NS框架,结合规则模块和微调语言模型,生成合成一阶逻辑公式并转换为自然语言表达,提升了生成样本的多样性和覆盖范围,但面临结构复杂度增加时语义精确性和自然生成的挑战。

Comments 11 pages, 8 figures

详情
AI中文摘要

将形式语言翻译成自然语言是自然语言处理中的基础挑战,推动了语义解析、定理验证和问答等下游应用的发展。在本研究中,我们引入了First-Order Logic to Natural Sentence (FOL2NS),一种神经符号框架,旨在生成合成的一阶逻辑公式并将它们转换为自然人类表达。该框架能够处理深度嵌套结构,具有变化的量词深度(QD),这些结构很少被现有语料库捕捉。通过结合规则驱动模块和微调的语言模型,FOL2NS增强了生成样本的多样性和覆盖范围。在我们的实验中,我们通过字符级分析和整体性能指标系统地评估了该框架的能力。实验结果表明,FOL2NS能够可靠地生成正确格式的模板和流畅的陈述,但在结构复杂度增加时面临实现精确语义表示和自然生成的挑战。

英文摘要

Translating formal language into natural language is a foundational challenge in NLP, driving various downstream applications in semantic parsing, theorem validation, and question answering. In this study, we introduce First-Order Logic to Natural Sentence (FOL2NS), a neurosymbolic framework designed to generate synthetic FOL formulas and convert them into natural human expressions. It handles deeply nested structures with varying quantifier depths (QD), which are rarely captured by existing corpora. By combining rule-driven modules with fine-tuned language models, FOL2NS enhances the diversity and coverage of the generated samples. In our experiments, we systematically evaluate the framework's capabilities through both character-level analysis and overall performance metrics. Experimental results show that FOL2NS can reliably produce well-formed templates and fluent statements, but it faces challenges in achieving precise semantic representations and natural generation as structural complexity increases.

2605.16142 2026-05-19 cs.AI cs.LG

Property-Guided LLM Program Synthesis for Planning

基于属性的LLM程序合成用于规划

André G. Pereira, Augusto B. Corrêa, Jendrik Seipp

发表机构 * Federal University of Rio Grande do Sul(里约格朗德杜斯尔大学) University of Oxford(牛津大学) Linköping University(林奈大学)

AI总结 本文研究了一种基于属性的LLM程序合成方法,通过检查候选程序是否满足形式定义的属性来指导LLM生成更高质量的程序,从而减少生成和评估成本。

详情
AI中文摘要

LLMs在程序合成中表现出色,能够发现超越先前解决方案的程序。然而,这些方法依赖于简单的数值评分来指示程序质量,如解决方案的值或通过的测试数量。因为评分无法指导程序为何失败,系统必须生成并评估许多候选程序,希望其中一些成功,从而增加LLM推理和评估成本。我们研究了一种不同的方法:属性引导的LLM程序合成。与评分程序后评估不同,我们检查候选程序是否满足形式定义的属性。当属性被违反时,我们提前停止评估并提供具体的反例,显示程序为何失败。这种反馈显著减少了程序生成的数量和评估成本,并可以指导LLM生成更强大的程序。我们在PDDL规划领域评估了这种方法,要求LLM合成直接启发函数:每个通过严格改进转换可达的状态都有严格改进的后继。具有这种属性的启发函数可使爬山算法直接到达目标状态。反例引导的修复循环生成一个候选程序,检查训练集上的属性,并返回第一个违反属性的案例。我们在十个规划领域上评估了这种方法,并使用分布外测试集。合成的启发函数在几乎所有测试任务中都是直接的,与最佳先前生成方法相比,我们的方法在每个领域平均生成的程序数量少七倍,无需使用搜索即可解决更多任务,并且评估候选人的计算量减少了几个数量级。只要问题允许可验证的属性,属性引导的LLM合成可以降低成本并提高程序质量。

英文摘要

LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.

2605.16015 2026-05-19 cs.RO cs.LG

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

通过强化学习实现四旋翼机的自适应外环控制

Vishnu Saj, Sushil Vemuri, Dileep Kalathil, Moble Benedict

发表机构 * Texas A&M University(德克萨斯大学)

AI总结 本文提出了一种新颖的自适应控制架构,通过强化学习和残差动力学预测器来提高四旋翼飞行器在动态扰动下的控制性能,实验证明其在现实环境中具有更高的轨迹跟踪精度。

详情
AI中文摘要

深度强化学习(DRL)在四旋翼飞行器控制中通常依赖于领域随机化(DR)进行仿真到现实的转移,导致过于保守的策略难以应对动态扰动。为了解决这个问题,我们提出了一种新的自适应控制架构,能够主动感知并响应即时扰动。首先,我们训练了一个最优的外环策略,然后用残差动力学预测器(RDP)替代其对地面真实扰动数据的依赖。RDP通过仅使用状态和控制动作的历史数据在线估计飞行器所受的外部力和力矩。为了实现无缝的硬件转移,我们引入了数据高效的线性校准桥和在线推力校正机制,利用仅几秒的飞行数据将模拟的潜在空间与现实对齐。在真实世界中对Crazyflie微型四旋翼的验证表明,我们的自适应控制器在严重不确定性下,包括质量变化、不对称载荷和动态悬挂载荷,均显著优于基线方法,保持了精确的轨迹跟踪性能。

英文摘要

Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads

2605.15960 2026-05-19 cs.AI cs.LG

Imperfect World Models are Exploitable

不完美的世界模型是可利用的

Logan Mondal Bhamidipaty, Esmeralda S. Whitammer, David Abel, Mykel J. Kochenderfer, Subramanian Ramamoorthy

发表机构 * University of Edinburgh(爱丁堡大学) Stanford University(斯坦福大学)

AI总结 本文提出了一种新的强化学习中模型利用的定义,指出世界模型如果暗示某种策略应严格优于另一种策略,而真实环境转移模型却暗示相反,那么该模型就是可利用的。研究通过发展奖励黑客和模型利用的一般理论,证明在大规模策略集上利用本质上是不可避免的,并揭示了安全规划在世界模型中的局限性。

Comments 17 pages, 3 figures, 2 tables; modified (fixed metadata)

详情
AI中文摘要

我们提出了一种新的强化学习中模型利用的定义。非正式地说,如果世界模型暗示一种策略应严格优于另一种策略,而环境的真实转移模型却暗示相反,则该世界模型是可利用的。我们通过类比先前对奖励黑客的描述,但发现相关的不可避免性证明无法转移到利用上。为克服这一障碍,我们发展了一种奖励黑客和模型利用的一般理论,证明在大规模策略集上利用本质上是不可避免的,并得出黑客作为特殊情况的相应结论。不幸的是,我们还发现保证在有限策略集上不可黑客的条件没有对应的防止利用的条件。因此,我们引入了一种放松的利用概念,并推导出一个安全的视野,在其中可以避免利用。总的来说,我们的结果建立了奖励黑客和模型利用之间的正式桥梁,并阐明了世界模型中安全规划的局限性。

英文摘要

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

2605.15239 2026-05-19 cs.LG

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

通过在线策略自我蒸馏减少大语言模型安全对齐的安全部署税

Yu Fu, Longxuan Yu, Haz Sameen Shahgir, Zhipeng Wei, Hui Liu, N. Benjamin Erichson, Yue Dong

发表机构 * International Computer Science Institute(国际计算机科学研究所) Microsoft(微软) Berkeley Lab(伯克利实验室)

AI总结 本文提出了一种名为OPSA的在线策略自我蒸馏方法,通过引入教师翻转率指标,有效减少大语言模型在安全对齐过程中因分布不匹配导致的安全税,实验显示其在不同模型规模上均优于传统方法。

Comments 20 pages, 5 figures

详情
AI中文摘要

安全对齐通常以牺牲推理能力为代价来提高对有害查询的鲁棒性,这种权衡被称为安全税。常见原因是分布不匹配:监督微调训练目标模型时,通常使用人类生成的安全演示、外部模型或固定自动生成的轨迹,而不是从自身策略采样的轨迹。我们识别出非策略训练不匹配是这一税的第二个来源,并研究了用于安全对齐的在线策略自我蒸馏(OPSA)。模型生成自己的轨迹,并从冻结的教师副本中接收密集的每token KL监督,该教师副本在特权安全上下文中进行条件化。由于该教师必须比采样的学生轨迹更安全,我们引入了教师翻转率:一个衡量特权上下文将不安全响应转换为安全响应频率的指标。我们使用此信号来寻找激活潜在安全推理而非仅引发安全外观演示的上下文。在两个推理模型家族和五个模型规模上,OPSA在匹配数据和全参数微调条件下,比非策略自我蒸馏和外部教师蒸馏实现了更优的安全-推理权衡,其在较小模型上获得最大收益(R1-Distill-1.5B增加8.85分,Qwen3-0.6B增加5.49分)。这些收益在不同训练集大小和自适应禁言评估中持续存在。token级分析进一步显示,OPSA将更新集中在早期合规决策token附近,提供了一种在保持通用推理的同时改进安全性的机制。

英文摘要

Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy training mismatch as a second source of this tax and study on-policy self-distillation for safety alignment, which we call OPSA. The model generates its own rollouts and receives dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. Because this teacher must be safer than the sampled student trajectory, we introduce \emph{teacher flip rate}: a criterion that measures how often a privileged context converts unsafe responses into safe ones. We use this signal to search for contexts that activate latent safety reasoning rather than merely elicit safe-looking demonstrations. Across two reasoning-model families and five model scales, OPSA achieves a stronger safety--reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning, with the largest gains on smaller models (+8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B). The gains persist across training-set sizes and adaptive jailbreak evaluations. Token-level analyses further show that OPSA concentrates updates near early compliance-decision tokens, providing a mechanism for improving safety while preserving general reasoning.

2605.13339 2026-05-19 cs.CL cs.AI

Probing Persona-Dependent Preferences in Language Models

探测语言模型中依赖人格的偏好

Oscar Gilg, Pierre Beckmann, Daniel Paleka, Patrick Butlin

发表机构 * MATS EPFL(瑞士联邦理工学院) ETH Zürich(苏黎世联邦理工学院) Eleos AI Research(Eleos AI研究)

AI总结 本文通过训练线性探针来探测语言模型中不同人格下的偏好表示,发现偏好向量在不同人格间具有共享特性,并能通过调整偏好向量来影响模型的输出选择。

Comments 41 pages, 45 figures. Code: https://github.com/oscar-gilg/Preferences. Earlier write-up on LessWrong: https://www.lesswrong.com/posts/pxC2RAeoBrvK8ivMf/models-have-linear-representations-of-what-tasks-they-like-1

详情
AI中文摘要

大型语言模型(LLMs)可以被认为具有偏好:它们能够可靠地选择某些任务和输出,而这些偏好受到训练后和系统提示的影响,似乎塑造了大部分行为。但模型也可以采用不同的身份,这些身份具有截然不同的偏好。这种内部是如何实现的?每个身份是否运行在自己的偏好机制上,还是有某种共享的基础?我们训练了线性探针来预测Gemma-3-27B和Qwen-3.5-122B残差流激活中的揭示性成对任务选择,并识别出一个真实的偏好向量:它跟踪模型在不同提示和情境下的偏好变化,并且在Gemma-3-27B中,沿着它引导的选择具有因果控制。这种偏好表示在不同身份间具有广泛共享性:一个训练于帮助助手的探针能够预测和引导质不同身份的选择,包括一个反相关于助手偏好的邪恶身份。

英文摘要

Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.

2605.12970 2026-05-19 cs.CL

Leveraging Speech to Identify Signatures of Insight and Transfer in Problem Solving

利用语音识别洞察和迁移的特征

Linas Nasvytis, Judith E. Fan

发表机构 * Department of Psychology, Stanford University(斯坦福大学心理学系) Graduate School of Education, Stanford University(斯坦福大学教育研究生院) Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 研究通过语音分析探讨解决问题过程中洞察力的表现形式及其对后续问题解决的影响,发现可迁移的洞察力更容易被口头描述。

详情
AI中文摘要

许多问题似乎需要瞬间的灵感来解决。这些突如其来的灵感是什么形式,它们如何影响人们未来解决类似问题的方式?在本研究中,我们要求189名参与者在解决一系列五个“火柴算术”问题时进行think-aloud。这些问题要么都依赖同一种非显而易见的解决方案(Same组),要么每次都不同(Different组)。我们的第一个观察是Same组参与者进步更快 than Different组参与者。然后我们利用自然语言处理技术分析参与者的语音,发现Same组参与者进步加快的同时,他们在说话量和说话内容上也发生了变化。特别是,他们更可能自发地标注他们正在解决的问题类型。综合来看,这些发现表明,可迁移的洞察力的一个标志是其对口头报告的可访问性,即使其背后的洞察力前因仍难以描述。

英文摘要

Many problems seem to require a flash of insight to solve. What form do these sudden insights take, and what impact do they have on how people approach similar problems in the future? In this work, we prompted participants (N = 189) to think aloud as they attempted to solve a sequence of five "matchstick-arithmetic" problems. These problems either all relied on the same kind of non-obvious solution (Same group) or a different kind each time (Different group). Our first observation was that Same participants improved more rapidly than Different participants. We then leveraged techniques from natural language processing to analyze participants' speech, and found that this accelerated improvement for Same participants was accompanied by changes in both how much they spoke and what they said. In particular, they were more likely to spontaneously label the kind of problem they were working on. Taken together, these findings suggest that a hallmark of transferable insights is their accessibility for verbal report, even if the underlying precursors of insight remain difficult to articulate.

2605.12765 2026-05-19 cs.LG

Inference-Time Machine Unlearning via Gated Activation Redirection

推理时的机器去学习 via 门控激活重定向

Vinícius Conte Turani, Otávio Parraga, João Vitor Boer Abitante, Kristen K. Arguello, Joana Pasquali, Ramiro N. Barros, Flavio du Pin Calmon, Christian Mattjie, Rodrigo C. Barros, Lucas S. Kupssinskü

发表机构 * MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil(MALTA机器学习理论与应用实验室,PUCRS,波士顿-阿尔格雷,巴西) Harvard University(哈佛大学) Kunumi Institute, Brazil(库努米研究所,巴西)

AI总结 本文提出了一种无需训练和梯度的机器去学习方法GUARD-IT,通过在推理时依赖输入的激活引导来消除特定数据集的影响,同时保持模型性能,且在量化部署下仍有效。

详情
AI中文摘要

大型语言模型会记住大量训练数据,这引发了隐私、版权侵犯和安全方面的担忧。机器去学习旨在在不改变模型性能的情况下移除特定遗忘集的影响,理想上近似于从头重新训练模型而不包含遗忘集。现有方法通过梯度基方法更新模型参数来实现这一目标。然而,这些更新计算成本高,导致不可逆的权重变化,并在模型量化部署时性能下降。一种最近的替代方法是激活工程,在推理期间更改激活以引导模型行为。尽管绕过了权重编辑,但朴素的激活引导会引入自身的问题,因为单一的全局引导向量对每个输入应用相同的干预,导致模型行为的意外变化。我们引入了推理时的机器去学习 via 门控激活重定向(GUARD-IT),这是一种训练和梯度自由的方法,通过在推理时依赖输入的激活引导来实现去学习。所得到的干预作为残差流中的规范保持旋转应用,不改变模型权重。在TOFU和MUSE上的实验表明,GUARD-IT在三个模型规模上匹配或超过了12种基于梯度的基线方法,是唯一一个在所有设置中同时保持效用、抑制记忆和避免灾难性崩溃的方法。GUARD-IT进一步支持无需重新训练的连续去学习,并在参数编辑方法会退化的量化场景下仍有效。

英文摘要

Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.

2605.12451 2026-05-19 cs.CV

FuTCR: Future-Targeted Contrast and Repulsion for Continual Panoptic Segmentation

FuTCR: 未来目标对比与排斥用于持续全景分割

Nicholas Ikechukwu, Keanu Nichols, Deepti Ghadiyaram, Bryan A. Plummer

发表机构 * Boston University(波士顿大学)

AI总结 本文提出FuTCR框架,通过重构表示来解决持续全景分割中区分新增背景类别的挑战,通过像素到区域对比和排斥机制提升新类别性能,实验显示在六个CPS设置和多种数据集规模下,FuTCR在相对新类别全景质量上比最先进的方法提升高达28%,同时保持或提升基础类性能。

Comments Revised author affiliation

详情
AI中文摘要

持续全景分割(CPS)需要能够快速适应新类别的方法。由于该密集预测任务的性质,训练图像可能包含标记和未标记的对象。由于对这些未标记对象一无所知,现有方法通常在训练时将任何未标记像素归为一个'背景'类别。实际上,在训练过程中,它们反复告诉模型所有不同的背景类别都是相同的(即使它们不是)。这使得学习区分随着新增的背景类别而变得具有挑战性,因为这些新类别可能需要使用模型之前被告诉不重要并被忽略的信息。因此,我们提出了一种面向未来的对比和排斥(FuTCR)框架,通过在新类别引入之前重构表示来解决这一限制。FuTCR首先通过将模型预测的掩码中像素始终被分类为背景但表现出非背景logits的区域进行分组,发现自信的未来样区域。接着,FuTCR通过像素到区域对比从这些未标记区域构建连贯的原型,同时同时排斥背景特征远离已知类原型,以显式保留代表空间给未来类别。在六个CPS设置和多种数据集规模的实验中,FuTCR在相对新类别全景质量上比最先进的方法提升高达28%,同时保持或提升基础类性能,提升幅度高达4%。

英文摘要

Continual Panoptic Segmentation (CPS) requires methods that can quickly adapt to new categories over time. The nature of this dense prediction task means that training images may contain a mix of labeled and unlabeled objects. As nothing is known about these unlabeled objects a priori, existing methods often simply group any unlabeled pixel into a single "background" class during training. In effect, during training, they repeatedly tell the model that all the different background categories are the same (even when they aren't). This makes learning to identify different background categories as they are added challenging since these new categories may require using information the model was previously told was unimportant and ignored. Thus, we propose a Future-Targeted Contrastive and Repulsive (FuTCR) framework that addresses this limitation by restructuring representations before new classes are introduced. FuTCR first discovers confident future-like regions by grouping model-predicted masks whose pixels are consistently classified as background but exhibit non-background logits. Next, FuTCR applies pixel-to-region contrast to build coherent prototypes from these unlabeled regions, while simultaneously repelling background features away from known-class prototypes to explicitly reserve representational space for future categories. Experiments across six CPS settings and a range of dataset sizes show FuTCR improves relative new-class panoptic quality over the state-of-the-art by up to 28%, while preserving or improving base-class performance with gains up to 4%.

2605.12413 2026-05-19 cs.CV

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

超越定位:从全景图像中对MLLMs的视角条件空间推理进行综合诊断

Yuangong Chen, Wai Keung Wong, Jiaxing Li, Ioannis Patras, Xu Zheng

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Guangzhou University(广州大学) Queen Mary University of London(伦敦玛丽女王大学) HKUST(Guangzhou)(香港科技大学(广州))

AI总结 本文研究了多模态大语言模型(MLLMs)在视角变化下的空间推理能力,提出了PCSR-Bench基准测试,评估了14种代表性MLLMs在不同任务上的表现,揭示了空间推理能力的显著差距,并探讨了通过强化学习进行优化的可能性。

Comments 10pages, 4 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉感知方面表现出色,但在视角变化下的空间推理能力有限。本文将这一挑战定义为视角条件空间推理(PCSR),研究了在360度全景图像中,广阔的场景覆盖减少了部分观察的歧义,但并不消除视角依赖推理的必要性。为了评估这一能力,我们引入了PCSR-Bench,这是一个包含来自2600张全景图像、26种室内环境的84,373个问题-答案对的诊断基准。PCSR-Bench包含八个任务,涵盖基础感知(如物体计数、相对距离和相对方向)和高级PCSR,包括组合链、以自身为中心的旋转、视角重新锚定、自身扭曲和有限视角可见性。我们评估了14种代表性MLLMs,并观察到显著的感知-推理差距:在基础相对方向任务上准确率达到57.59%,但在以自身为中心的旋转任务上降至13.49%,在以自身为中心的扭曲任务上降至7.13%,在开放性组合推理任务上降至0.64%。为了探索这一差距的可塑性,我们对一个7B规模的模型进行了基于强化学习的诊断研究。奖励塑造将匹配的7B基线从31.10%提升到60.06%,表明PCSR是部分可塑性而非完全不可变。然而,这些收益是任务选择性的,对奖励设计敏感,包括权重分配和奖励制定,并在一定程度上依赖于评估协议。这些结果将PCSR定位为当前MLLMs的关键瓶颈,并突显了在有针对性优化下的有限但有意义的恢复空间。

英文摘要

Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.