arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8098
2606.03976 2026-06-03 cs.CV cs.AI cs.LG q-bio.NC

Formalizing the Binding Problem

形式化绑定问题

Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang, Ansh Soni, Konrad P. Kording

AI总结 本文用信息论方法形式化绑定问题,提出一种探测方法测量模型表示中的绑定信息,并在视觉Transformer上实验,证明绑定是强视觉识别和推理的关键要素。

Comments Accepted to ICML 2026

详情
AI中文摘要

世界表征,可以说,包含关于特征的信息(例如,某物是蓝色的,某物是圆形的),但也包含关于哪些特征属于同一对象的信息(例如,圆形是蓝色的),我们称之为绑定信息。任何具有理解包含多个对象场景能力的系统都必须解决绑定问题:它需要知道哪些特征属于一起。然而,尽管有研究表明视觉Transformer(ViT)知道哪些补丁属于一起,但目前尚不清楚当前的深度学习模型是否学会展示绑定信息,即针对特征的信息。我们可能认为绑定信息并不多,毕竟将特征错误归因于错误对象是基于ViT架构的常见失败,尤其是在对象共享特征的场景中。本文用信息论方法形式化绑定问题,并引入一种探测方法来测量模型表示中的绑定信息。我们在ViT上进行实验,测量来自架构不同组件(如图像摘要标记[CLS]或空间标记)的绑定信息。我们使用具有不同绑定挑战的数据集,例如特征共享、遮挡和自然特征,同时比较多个预训练ViT的性能。总体而言,我们的研究证明了绑定是强视觉识别和推理的关键要素。

英文摘要

Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

2606.03965 2026-06-03 cs.CL cs.AI

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Agentic Chain-of-Thought Steering:实现高效且可控的LLM推理

Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley

AI总结 提出Agentic Chain-of-Thought Steering (ACTS)方法,通过强化学习训练控制器智能体在推理过程中自适应地选择推理策略和引导短语,实现预算感知的策略控制,从而在保持推理质量的同时显著节省token,并支持准确率-效率的可控权衡。

详情
AI中文摘要

大型语言模型通过扩展的思维链推理提高了最终答案的准确性,但通常token使用效率低下且缺乏推理时的控制。现有的高效推理方法通过缩短、提前停止或压缩轨迹来控制思考长度,但隐式地决定了模型的思考方式。在本文中,我们提出了Agentic Chain-of-Thought Steering (ACTS),它将推理引导形式化为一个马尔可夫决策过程,其中控制器智能体在推理过程中自适应地引导冻结的推理器。在每一步,控制器观察推理轨迹和剩余思考预算,然后发出一个包含推理策略和引导短语的引导动作,以启动推理器的下一步。这使得在保持推理器生成连续性的同时,能够进行预算感知的策略控制以实现高效推理。我们从构建的合成引导轨迹中初始化控制器智能体,并进行多预算增强,然后通过带有预算条件奖励塑造的强化学习进一步优化。跨多个基准的实验表明,ACTS在显著节省token的同时达到了与全思考相当的性能,并在不同的推理器和任务上实现了可控的准确率-效率权衡。代码可在该https URL获取。

英文摘要

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.

2606.03883 2026-06-03 cs.AI cs.LG

Reasoning Structure of Large Language Models

大型语言模型的推理结构

Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, Roger Wattenhofer

AI总结 针对大型推理模型评估中隐藏不同推理结构的问题,提出基于逻辑谜题的基准测试和将非结构化轨迹转化为可验证推理图的方法,并定义推理效率指标,以量化分析推理拓扑结构。

Comments Accepted at ICML 2026 and presented at the ICLR 2026 workshop on LLM reasoning

详情
AI中文摘要

大型推理模型(LRMs)通常使用最终答案准确率或token数量等指标进行评估。然而,这些指标上的相同分数可能隐藏着根本不同的推理结构。为了解决这一局限性,我们引入了一个可扩展的逻辑谜题LRM基准测试,以及一个将非结构化轨迹转化为包含声明和依赖关系的可验证推理图的流程。这将推理转化为一个结构化的、可测量的对象,其拓扑结构可以定量分析。在此基础上,我们定义了一个推理效率指标,用于量化模型逻辑流的集中程度。我们对开源推理模型的分析表明,结构度量能够区分token数量和准确率所混淆的行为,为诊断失败模式和比较推理如何随谜题难度扩展提供了实用工具。

英文摘要

Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.

2606.03804 2026-06-03 cs.LG

Easy-to-Use Shielding for Reinforcement Learning

易于使用的强化学习屏蔽技术

Stefan Pranger, Bettina Könighofer

AI总结 提出tempestpy库,将形式化屏蔽合成集成到Gymnasium API中,降低强化学习安全探索的门槛,并扩展了随机多人博弈的屏蔽算法。

详情
AI中文摘要

安全探索是强化学习中的一个关键挑战,旨在防止智能体在探索环境时做出有害决策。屏蔽是一种利用环境模型形式的领域知识来决定动作安全性的技术。尽管已经成熟,但由于缺乏将形式化屏蔽合成与标准强化学习框架连接起来的可访问端到端基础设施,屏蔽在强化学习中的应用有限。应用屏蔽通常需要形式化方法的专业知识和大量的工程工作,使其脱离典型的强化学习工作流程。我们通过将屏蔽合成工具Tempest扩展为安全强化学习的实用后端来解决这一问题。我们的核心贡献是tempestpy,一个Python库,它将基于Tempest的屏蔽合成直接集成到Gymnasium API中,使得屏蔽可以在现有的强化学习管道中合成和部署。这降低了屏蔽的入门门槛,将形式化安全探索方法转化为强化学习实践者可用的组件。我们还扩展了Tempest的算法支持,以计算随机多人博弈的可靠屏蔽,保留了形式化安全保证。我们端到端地展示了最终的工作流程,并在多个环境中评估了有屏蔽和无屏蔽的强化学习。为了便于建模,我们为MiniGrid提供了符号模型,并引入了MiniGridSafe,这是一个游乐场环境集合,旨在使屏蔽易于访问且实验透明。MiniGridSafe通过具有概率转换和额外智能体的安全导向场景扩展了MiniGrid,使得在简单直观的设置中研究具有挑战性的安全方面成为可能。

英文摘要

Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Shielding is one such technique that assumes domain knowledge in the form of an environment model to decide upon action safety. Although well-established, shielding has seen limited adoption in RL due to the lack of accessible end-to-end infrastructure connecting formal shield synthesis with standard RL frameworks. Applying shielding typically requires expertise in formal methods and substantial engineering effort, keeping it outside the typical RL workflow. We address this by extending our shield synthesis tool Tempest into a practical backend for safe RL. Our core contribution is tempestpy, a Python library that integrates Tempest-based shield synthesis directly into the Gymnasium API, allowing shields to be synthesized and deployed within existing RL pipelines. This lowers the barrier to entry for shielding and turns formal safe-exploration methods into a usable component for RL practitioners. We also extend Tempest's algorithmic support to compute sound shields for stochastic multiplayer games, preserving formal safety guarantees. We demonstrate the resulting workflow end to end and evaluate shielded and unshielded RL across multiple environments. To facilitate modeling, we provide symbolic models for MiniGrid and introduce MiniGridSafe, a collection of playground environments designed to make shielding easily accessible and experimentally transparent. MiniGridSafe extends MiniGrid with safety-oriented scenarios featuring probabilistic transitions and additional agents, enabling the study of challenging safety aspects in a simple and intuitive setting.

2606.03798 2026-06-03 cs.RO

Optimal Design and Analytical Modeling of a Soft Fin-Ray Effect Gripper Finger Using the Finite Rigid Elements Method

基于有限刚性单元法的软体鳍射线效应夹爪手指的优化设计与解析建模

Sara Adeli, Hassan Sayyaadi

AI总结 提出采用有限刚性单元法(FREM)对软体鳍射线效应(FRE)夹爪手指进行建模与优化,实现精准力控,以轻柔抓取易损农产品。

详情
AI中文摘要

受鳍射线启发的软体夹爪为轻柔处理易损、不规则物体(尤其在农业中)提供了有前景的解决方案。本研究旨在设计、制造和建模一种鳍射线效应(FRE)软体夹爪手指,以实现未来应用中的精确力控制。该设计旨在轻柔抓取需要适应性和精确力施加的易损农产品,如番茄。为解决软体机器人固有的挑战,包括非线性行为、无限自由度和可变材料属性,采用有限刚性单元法(FREM)进行建模。该方法在保持解析精度的同时,为后续阶段力控制器的开发提供了可靠基础。使用ANSYS创建了详细的有限元模型(FEM),并通过仿真和实验测试验证了解析结果。基于四个关键标准优化了夹爪手指:尖端位移、总变形、应力分布和接触力。最优手指配置包括长度30毫米、肋间距10毫米、七根肋条角度-15度、肋条厚度1毫米。使用FREM的理论建模预测手指变形误差为3%,而ANSYS数值模型误差为2%。

英文摘要

Fin Ray-inspired soft grippers offer a promising solution for gently handling delicate, irregular objects, especially in agriculture. The objective of this research is to design, fabricate, and model a Fin Ray Effect (FRE) soft gripper finger to enable precise force control in future applications. This design aims to gently grasp delicate agricultural products, such as tomatoes, that require both adaptability and accurate force application. To address the inherent challenges of soft robotics, including nonlinear behavior, infinite degrees of freedom, and variable material properties, the Finite Rigid Elements Method (FREM) was employed for modeling. This method preserves analytical accuracy while providing a reliable foundation for the development of a force controller in later stages. A detailed Finite Element Model (FEM) was created using ANSYS, and the analytical results were validated through simulation and experimental testing. The gripper's fingers were optimized based on four key criteria: tip displacement, total deflection, stress distribution, and contact force. The optimal finger configuration includes a length of 30 mm, rib spacing of 10 mm, seven ribs angled at -15 deg, and a rib thickness of 1 mm. Theoretical modeling using the FREM predicted finger deformation with a 3% error, while the ANSYS numerical model achieved 2% error.

2606.03777 2026-06-03 cs.AI cs.CR q-fin.RM

From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

从控制边界到保险索赔:通过CER框架重构AI中介损失

Alex Leung, Rex Zhang, Kentaroh Toyoda, SiewMei Loh

AI总结 本文提出CER框架(控制边界、证据重构、保险响应),用于诊断和重构由生成式或代理式AI系统导致的损失,以支持保险索赔。

详情
AI中文摘要

通过受保组织的生成式或代理式AI系统产生的AI损失需要状态重构,而不仅仅是事件重构,因为相关状态会随着系统推理、检索、调用工具和行动而改变。相关的问题不仅是发生了什么损失,还包括系统被允许做什么、实际做了什么,以及重构的损失能否支持保险索赔。本文处理受保人的AI系统处于因果链中的损失,包括外部触发的故障,如提示注入、检索增强生成(RAG)投毒、恶意工具输出、凭证滥用和数据投毒。具体而言,本文介绍了CER,一种用于AI残余风险转移的用例级诊断。C(控制边界)询问系统是否具有可执行的操作范围。E(证据重构)询问是否可以从保留的工件中重构系统状态和因果链。R(保险响应)询问重构的损失是否被保险:保险覆盖是否在市场上可用并为受保人投保,以及支持保险索赔所需的证据。本文做出三项贡献:定义了AI特定的重构问题,通过CER操作化该问题,并指定了AI重构的索赔级证据。公开示例包括报道的PocketOS和Replit代理数据库删除事件,以及作为已裁决的输出/依赖案例的Moffatt诉加拿大航空案。关键词:AI系统;CER框架;残余风险转移;代理式AI;生成式AI;AI保险;证据重构。

英文摘要

AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts. The relevant question is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery. This paper addresses losses in which the insured's AI system is in the causal chain, including externally triggered failures such as prompt injection, retrieval-augmented generation (RAG) poisoning, malicious tool output, credential misuse, and data poisoning. Specifically, this paper introduces CER, a use-case-level diagnostic for AI residual risk transfer. C (control boundary) asks whether the system had an enforceable operating envelope. E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts. R (insurance response) asks whether the reconstructed loss is insured: whether insurance coverage is available in the market and placed for the insured, together with the proof needed to support insurance claim recovery. The paper makes three contributions: it defines the AI-specific reconstruction problem, operationalizes that problem through CER, and specifies claim-grade evidence for AI reconstruction. Public examples include the reported PocketOS and Replit agentic database-deletion incidents and Moffatt v. Air Canada as an adjudicated output/reliance case. Keywords: AI systems; CER framework; residual risk transfer; agentic AI; generative AI; AI insurance; evidence reconstruction.

2606.03723 2026-06-03 cs.LG

Compress then Merge: From Multiple LoRAs into One Low-Rank Adapter

先压缩后合并:从多个LoRA到一个低秩适配器

Zhengbao He, Ruiqi Ding, Zhehao Huang, Ruikai Yang, Tao Li, Xiaolin Huang

AI总结 针对多LoRA合并时全参数合并破坏低秩结构的问题,提出先压缩后合并(CtM)方法,通过共享子空间投影保证输出严格秩r,性能优于现有单LoRA基线。

Comments Accepted to ICML 2026. Code: https://github.com/ZhengbaoHe/compress-then-merge

详情
AI中文摘要

低秩适配(LoRA)实现了基础模型的参数高效特化,但任务特定适配器的激增将能力分散到多个适配器中,使复用和部署复杂化。我们研究将$T$个LoRA合并为单个秩-$r$ LoRA的问题,从而保留低秩结构的优势。现有的先合并后压缩流水线将秩约束视为事后考虑:它们在完整参数空间中合并适配器,然后通过截断SVD将合并结果压缩到秩$r$。然而,全参数合并可能破坏低秩结构,使得后续压缩难以恢复有效的秩-$r$ LoRA。我们提出先压缩后合并(CtM),一种反向流水线,在合并前强制秩-$r$瓶颈:CtM仅使用LoRA权重计算共享的$r$维子空间以捕获跨适配器的公共结构,将每个适配器投影到共享子空间以获得$r\times r$坐标,然后在此缩减空间中应用标准合并规则。CtM通过构造保证秩-$r$ LoRA,避免了事后截断,并在由拼接的LoRA因子张成的核心空间中实现高效计算。跨多个模型和任务的实验表明,CtM持续优于现有的单LoRA输出基线,同时缩小了与全参数合并方法的性能差距。

英文摘要

Low-rank adaptation (LoRA) enables parameter-efficient specialization of foundation models, but the proliferation of task-specific adapters fragments capabilities across many adapters, complicating reuse and deployment. We study the problem of merging $T$ LoRAs into a single rank-$r$ LoRA, thereby preserving the benefits of low-rank structure. Existing Merge-then-Compress pipelines treat the rank constraint as an afterthought: they merge adapters in the full parameter space, then compress the merged result to rank $r$ via truncated SVD. However, full-parameter merging may destroy the low-rank structure, making it difficult for subsequent compression to recover an effective rank-$r$ LoRA. We propose Compress-then-Merge (CtM), a reversed pipeline that enforces the rank-$r$ bottleneck before merging: CtM computes shared $r$-dimensional subspaces using only the LoRA weights to capture cross-adapter common structure, projects each adapter into the shared subspaces to obtain $r\times r$ coordinates, and then applies standard merging rules in this reduced space. CtM guarantees a rank-$r$ LoRA by construction, avoiding post-hoc truncation, and enables efficient computation in the core space spanned by concatenated LoRA factors. Experiments across multiple models and tasks show that CtM consistently outperforms existing single-LoRA-output baselines while narrowing the performance gap to full-parameter merging methods.

2606.03719 2026-06-03 cs.AI

Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

通过推导图揭示Do-演算推理的结构

Clément Yvernes, Emilie Devijver, Marianne Clausel, Eric Gaussier

AI总结 本文引入推导图来表示Do-演算规则的应用与组合,刻画了在Do-演算下等价的观测与干预概率的完整空间,并展示了通过最多四次规则应用即可实现等价变换,进而利用等价因果查询产生更有效的估计量。

Comments Accepted at ICML 2026

详情
AI中文摘要

Do-演算定义了干预查询的一般推理系统,允许通过连续应用其规则来转换因果量。这个过程产生了丰富的等价干预表达式空间,但组合和排序这些规则仍然具有挑战性。在这项工作中,我们引入了推导图,它表示Do-演算规则如何应用和组合,并刻画了在Do-演算下等价的观测和干预概率的完整空间。这些图的结构产生了一个简单的过程,最多使用四次Do-演算规则的应用。最后,我们展示了如何将识别算法应用于等价的因果查询,为相同的因果量产生多个有效的估计量,最终得到更有效的估计量。

英文摘要

The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventional expressions, but combining and ordering these rules remains challenging. In this work, we introduce derivation graphs, which represent how do-calculus rules are applied and combined, and characterize the full space of observational and interventional probabilities which are equivalent under the do-calculus. The structure of these graphs yields a simple procedure that uses at most four applications of do-calculus rules. Finally, we show how applying identification algorithms to equivalent causal queries produces multiple valid estimands for the same causal quantity, eventually yielding more efficient estimators.

2606.03686 2026-06-03 cs.AI

The DeepSpeak-Agentic Dataset

DeepSpeak-Agentic 数据集

Sarah Barrington, Maty Bohacek, Hany Farid

AI总结 本文提出了一个包含37小时人机半结构化对话视频的数据集DeepSpeak-Agentic,用于评估AI代理的自动取证识别、研究人机交互特性,并作为大型语言模型和AI生成语音/面部技术的基准。

详情
AI中文摘要

我们提出了DeepSpeak-Agentic,一个包含超过37小时半结构化对话视频的数据集,对话发生在人类与具身AI代理之间。我们利用该数据集评估AI代理的自动取证识别(音频、视频或文本),研究人机交互的本质,并为驱动具身AI代理的大型语言模型和AI生成语音及面部技术的未来进展提供基准。我们还贡献了一个可扩展的数据采集系统,该系统创建代理,自动将其与人类众包工作者配对,记录指定场景下的视听对话,并在混合流中识别和分离人类与代理。

英文摘要

We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.

2606.03682 2026-06-03 cs.RO

GN0: Toward a Unified Paradigm for Generation, Evaluation, and Policy Learning in Visual-Language Navigation

GN0:迈向视觉语言导航中生成、评估与策略学习的统一范式

Xinhai Li, Xiaotao Zhang, Yuehao Huang, Jiankun Dong, Tianhang Wang, Sunyao Zhou, Yunzi Wu, Chengnuo Sun, Yunfei Ge, Qizhen Weng, Chi Zhang, Chenjia Bai, Xuelong Li

AI总结 提出GN0统一框架,通过自动生成大规模导航数据集GN-Matrix、基于3DGS的高保真仿真平台和BEV基准GN-Bench,结合RL驱动的导航基础模型BAE,在VLN任务上超越现有方法。

详情
AI中文摘要

具身导航将智能体与物理世界连接起来,是通用机器人智能的基础。导航数据的有限可用性和质量限制了视觉语言导航(VLN)系统的泛化和长时程能力。为解决这一问题,我们整理了多样化的3D场景,并开发了大规模导航数据的自动化流水线,生成了GN-Matrix数据集。基于3D高斯泼溅(3DGS)引擎,我们引入了一个支持交互式漫游和碰撞感知导航的高保真仿真平台。我们进一步提出了GN-Bench,这是首个基于BEV的基准测试,包含用于人机交互评估的动态3DGS化身。为了利用仿真器,我们开发了一个RL驱动的导航基础模型——Break and Establish(BAE)。在监督学习之后,DAgger将模型暴露于滚动生成的状态,打破了狭窄的专家中心分布,并实现了下游RL探索。这一统一的VLN范式整合了基于地图和无地图的任务,包括指令跟随、人类跟随和目标导航。GN-BAE将高保真3DGS渲染的鸟瞰图表示形式化为紧凑记忆,解锁了VLM中的潜在空间推理。在GN-Bench和VLN-CE上的广泛评估表明,GN0优于最先进的VLN方法。总体而言,GN-Matrix提供了一个涵盖数据、仿真和学习的统一框架,推动了研究和工业应用中的具身导航。

英文摘要

Embodied navigation connects intelligent agents with the physical world and is fundamental for general robotic intelligence. Limited availability and quality of navigation data have constrained Vision-and-Language Navigation (VLN) systems' generalization and long-horizon capabilities. To address this, we curate diverse 3D scenes and develop an automated pipeline for large-scale navigation data, resulting in the GN-Matrix dataset. Building on a 3D Gaussian Splatting (3DGS) engine, we introduce a high-fidelity simulation platform supporting interactive roaming and collision-aware navigation. We further propose GN-Bench, the first BEV-based benchmark incorporating dynamic 3DGS avatars for human-robot interaction evaluation. To leverage the simulator, we develop an RL-driven navigation foundation model, Break and Establish (BAE). After supervised learning, DAgger exposes the model to rollout-induced states, breaking narrow expert-centric distributions and enabling downstream RL exploration. This unified VLN paradigm integrates map-based and map-free tasks, including instruction following, human following, and goal navigation. GN-BAE formalizes high-fidelity 3DGS-rendered Bird's Eye View representations as compact memory, unlocking latent spatial reasoning in VLMs. Extensive evaluations on GN-Bench and VLN-CE show that GN0 outperforms state-of-the-art VLN methods. Overall, GN-Matrix offers a unified framework spanning data, simulation, and learning, advancing embodied navigation in research and industrial applications.

2606.03678 2026-06-03 cs.AI

EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

EvoDrive: 通过自我改进的LLM智能体实现安全关键自动驾驶的帕累托进化

Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Deng, Jian Sun, Wei Ma

AI总结 提出EvoDrive,首个基于LLM的自动化智能体进化框架,通过模拟器接地演员-评论家架构和帕累托存档,在安全关键场景生成中实现对抗性与真实性的多目标优化。

详情
AI中文摘要

生成安全关键场景对于验证和改进自动驾驶系统至关重要,但它本质上需要在最大化对抗性以暴露故障的同时保持真实性。现有方法通常通过手工设计的启发式方法来管理这种权衡,将生成限制在已知的先验知识中,忽视了未充分探索的模式。虽然最近开放式的智能体进化可以突破这一限制,但不受约束的通用智能体缺乏严格的模拟器接地,往往将多目标张力退化为单标量最大化。本文提出了EvoDrive,第一个基于LLM的自动化智能体进化框架,用于多目标场景生成。EvoDrive采用模拟器接地的演员-评论家架构,其中记忆驱动的演员迭代地提出对生成器的改进,评论家过滤掉不可信的候选者,而自我进化的世界评估器将有前途的候选者路由以优化模拟预算。EvoDrive进一步维护一个评估候选者的帕累托存档,以保留多样化的攻击-真实性权衡,并通过模拟反馈指导未来的进化。在MetaDrive和CARLA上的基准测试结果表明,EvoDrive不仅显著扩展了各种生成器的帕累托前沿,而且为策略训练生成了有价值的场景。

英文摘要

Generating safety-critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maximizing adversariality to expose failures while preserving realism. Existing methods usually manage this trade-off with handcrafted heuristics, confining generation to known priors and overlooking underexplored patterns. While recent open-ended agentic evolution can push this limit, unconstrained general agents lack strict simulator grounding and tend to collapse the multi-objective tension into single-scalar maximization. Here we present EvoDrive, the first automated, LLM-based agentic evolution framework for multi-objective scenario generation. EvoDrive employs a simulator-grounded actor-critic architecture where a memory-driven actor iteratively proposes improvements to the generators and critics filter out implausible candidates, and a self-evolving world evaluator routes promising proposals to optimize simulation budgets. EvoDrive further maintains a Pareto archive of evaluated candidates to preserve diverse attack-realism trade-offs and guide future evolution via simulation feedback. Benchmark results on MetaDrive and CARLA show that EvoDrive not only significantly expands the Pareto frontier across various generators, but also produces valuable scenarios for policy training.

2606.03566 2026-06-03 cs.CV cs.AI

Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

基于高效Transformer的局部块采样用于多发性硬化脉络丛分割

Po-Jui Lu, Alessandro Cagol, Mario Ocampo-Pineda, Federico Spagnolo, Marina Mastantuono, Andreea-Alexandra Aldea, Jannis Müller, Özgür Yaldizli, Matthias Weigel, Lester Melie-Garcia, Roberta Magliozzi, Maria Pia Sormani, Ludwig Kappos, Jens Kuhle, Cristina Granziera

AI总结 提出一种基于SwinUNETR和局部块采样的方法,实现多发性硬化侧脑室脉络丛的自动分割,在降低99%计算量的同时取得优于现有模型的Dice系数。

详情
AI中文摘要

背景:侧脑室脉络丛(LVCP)正逐渐被认为是与多发性硬化(MS)身体残疾和神经炎症相关的关键影像生物标志物。然而,LVCP的手动分割非常繁琐,限制了其在广泛临床试验和纵向评估中的应用。本研究旨在开发一种基于SwinUNETR的流程,利用靶向的脑室内和脑室周围小块采样,从独立和多模态MRI输入中自动分割MS中的LVCP。方法:我们回顾性评估了来自两个独立MS主导队列的三组数据的3T MRI扫描(数据集1:n=177;数据集2:n=177;扩展测试集:n=388)。我们的方法采用在32x32x32体素块上训练的SwinUNETR架构,并与3D UXNET模型进行基准比较。主要评估指标是Dice相似系数(DSC),辅以计算需求(GFLOPs)和95百分位豪斯多夫距离(HD95)。结果:在扩展测试集上,SwinUNETR模型在结合MPRAGE和FLAIR时获得了平均DSC为0.868(95% CI: 0.863-0.872),显著优于UXNET(DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001)。当仅限于独立FLAIR输入时,基于Transformer的方法保持了0.863的高DSC,而UXNET的空间定位显著恶化(HD95: 1.86 vs. 3.00 mm)。重要的是,所提出的框架将计算负载降低了99%(91.8 vs. 22,080 GFLOPs)。通过将局部块采样与SwinUNETR架构相结合,该方法为LVCP分割提供了一种准确、稳健且统计上优于当前领先模型的替代方案。其巨大的计算成本降低使其非常适合在临床和研究环境中广泛实施。

英文摘要

Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitudinal assessments. This research aims to develop a SwinUNETR-driven pipeline that leverages targeted intra- and peri-ventricular small patch sampling to automatically segment the LVCP in MS from both standalone and multi-modal MRI inputs. Methods: We retrospectively assessed 3T MRI scans across three sets of data stemming from two separate MS-dominant cohorts (Dataset 1: n=177; Dataset 2: n=177; expanded test set: n=388). Our method employed a SwinUNETR architecture trained on 32x32x32 voxel patches, benchmarking it against the 3D UXNET model. The primary metric for evaluation was the Dice Similarity Coefficient (DSC), supplemented by computational demand (GFLOPs) and the 95th percentile Hausdorff Distance (HD95). Results: On the extended test set, the SwinUNETR model secured a mean DSC of 0.868 (95% CI: 0.863-0.872) with MPRAGE and FLAIR combined, showing a statistically significant gain over UXNET (DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001). When restricted to standalone FLAIR inputs, the transformer-based approach sustained a high DSC of 0.863, while the spatial localization of UXNET worsened considerably (HD95: 1.86 vs. 3.00 mm). Importantly, the proposed framework lowered computational load by 99% (91.8 vs. 22,080 GFLOPs). By integrating localized patch sampling with a SwinUNETR architecture, this methodology offers an accurate, robust, and statistically superior alternative to current leading models for LVCP segmentation. Its vast reduction in computational cost makes it ideal for widespread implementation in clinical and research environments.

2606.03540 2026-06-03 cs.CV

Attend to Anything: Foundation Model for Unified Human Attention Modeling

关注一切:统一人类注意力建模的基础模型

Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao

AI总结 提出 Attend to Anything Model (AAM),一种多模态基础模型,通过层次化语言提示和双曲空间嵌入统一图像、视频和视听任务中的注意力建模,并在16个基准上平均提升6%,视频推理加速约4倍。

Comments Accepted to ICML 2026

详情
AI中文摘要

现有人类注意力(显著性)建模方法在模态、场景和任务公式上高度碎片化。因此,即使模型容量和数据规模增加,当前模型仍主要依赖于场景且针对特定任务,无法在实际应用中泛化。为解决这些根本限制,我们提出了关注一切模型(AAM),一种多模态基础模型,统一了各种图像、视频和视听任务及场景中的注意力建模。AAM将注意力重新表述为一种认知蕴含关系,按通用到特定的层次组织,通过双曲空间中的层次嵌入语言提示实现。此外,为统一静态图像和动态视频注意力,我们采用流体动力学视角,将视频帧注意力建模为由Fokker-Planck方程控制的扩散时间演化。在16个基准上的大量实验表明,AAM在各种场景下平均比最先进方法高出6%,同时视频推理速度提升约4倍。总体而言,这些结果表明AAM为未来注意力和显著性相关任务的研究提供了原则性基础。数据集和代码将在此https URL提供。

英文摘要

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.

2606.03509 2026-06-03 cs.CV

EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

EvoMemNav: 用于零样本具身导航的高效自进化细粒度记忆

Zuhao Ge, Xiaosong Jia, Chao Wu, Yuchen Zhou, Zuxuan Wu, Yu-Gang Jiang

AI总结 提出EvoMemNav框架,通过构建视觉-语义记忆图并采用预算驱动的粗到细策略,结合反射驱动写回机制,实现零样本具身导航中高效、自进化的细粒度记忆,提升多实例区分和停止验证性能。

Comments Preprint

详情
AI中文摘要

构建记忆对于零样本具身导航中的长时程规划至关重要。以检测器为中心的场景图通常将观测压缩为稀疏节点,丢弃细粒度视觉证据并积累噪声,而基于3D重建的方法计算成本高昂。我们提出EvoMemNav,一种用于零样本具身导航的高效、自进化、细粒度记忆框架。EvoMemNav构建视觉-语义记忆图(VSMGraph),将原始视图作为一等记忆,并通过轻量级语义线索和拓扑关系将其组织成房间-视图-对象层次结构,保留用于消歧和停止验证的细粒度细节。为了扩展到不断增长的记忆,我们引入预算驱动的粗到细策略:粗阶段将搜索空间压缩到有希望的区域,细阶段仅调用VLM进行目标验证和决策。除了静态记忆,EvoMemNav在每个子任务后执行反射驱动的写回,更新附加到图上的先验知识,编码累积的环境知识以优化未来决策而无需重新训练。在GOAT-Bench和HM3D上,针对物体、文本描述和图像目标模态的实验显示,SR/SPL持续提升,具有更好的多实例区分能力、更少的过早停止和更强的零样本泛化能力。

英文摘要

Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.

2606.03503 2026-06-03 cs.AI

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

ThoughtFold: 通过内省偏好学习折叠推理链

Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao, Kuikun Liu, Guangran Cheng, Chengqi Lyu, Dahua Lin, Wenwei Zhang, Kai Chen

AI总结 提出ThoughtFold框架,通过细粒度偏好学习惩罚冗余探索并鼓励直接连接关键推理段,将推理链折叠为更简洁路径,在保持精度的同时大幅降低token使用量。

详情
AI中文摘要

大型推理模型(LRMs)由于在思维链(CoTs)上使用可验证奖励的强化学习(RLVR)取得了显著进展。然而,由于长CoT自然包含试错,且主流RLVR方法选择结果正确的CoT轨迹进行记忆,长CoT中的冗余探索不可避免地得到强化,导致LRMs的过度思考问题。先前解决此问题的尝试主要给较短轨迹更多优势,但其学习信号仍基于结果,无法减少长CoT中冗余探索的记忆。因此,我们提出ThoughtFold,一个利用细粒度偏好学习来缓解冗余探索以实现高效推理的框架。ThoughtFold采用内省策略识别每个正确轨迹中的冗余,从而产生一系列候选子轨迹。利用这一谱系,我们引入一个掩码偏好优化目标,明确惩罚冗余探索并鼓励模型直接桥接关键推理段,有效地将其推理链折叠为更简洁的路径。大量实验表明,ThoughtFold显著提高了效率。它将DeepSeek-R1-Distill-Qwen-7B的token使用量减少约56%,同时保持最先进的准确性。

英文摘要

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

2606.03471 2026-06-03 cs.AI cs.MA q-bio.NC

A formal definition and meta-model for a machine theory of mind

机器心智理论的正式定义与元模型

Fabio Cuzzolin

AI总结 本文基于认知心理学、神经科学和人工智能证据,首次提出机器心智理论的严格形式化定义,并构建整体元模型,以审视现有研究并推动未来突破。

Comments 48 pages, 2 figures

详情
AI中文摘要

本文首次提出了机器心智理论概念的严格形式化定义,该定义基于认知心理学、神经科学和人工智能证据支持的原则,并以此作为视角审视该领域的最新进展和当前努力,推动进一步研究以“破解”该问题的潜在议程。本文还提出了一个通用的整体机器心智理论元模型,并考察了在经验基准测试此类模型方面的最新进展。

英文摘要

This paper proposes, for the first time, a rigorous formal definition of the concept of Machine Theory of Mind, based on principles supported by evidence from cognitive psychology, neuroscience and artificial intelligence, and uses the above as a lens to examine state-of-the-art and current efforts in the field, driving a potential agenda for further research there able to "crack" the problem. It also advances a general holistic meta-model for Machine Theory of Mind, and examines the state of the art when it comes to empirically benchmarking such models.

2606.03460 2026-06-03 cs.CV

From 3D Perception to Safety Reasoning: A Graph-Based Framework for Real-Time Underground Mine Monitoring

从3D感知到安全推理:基于图的实时地下矿井监控框架

Pasindu Ranasinghe, Simit Raval, Dibyayan Patra, Bikram Banerjee, Ismet Canbulat

AI总结 提出一个结合3D语义感知、不确定性异常检测、规则检查、设备端LLM推理和GraphRAG记忆分析的连续监控框架,通过场景图和时序图实现结构化安全推理,在115个危险场景中达到93%的覆盖率和92.7%的感知精度。

详情
AI中文摘要

地下煤矿开采要求人员和重型设备在共享、受限且照明不良的空间中作业,其中设备接近违规、结构不稳定和遮挡盲区等危险难以预测。传统监控系统(包括固定摄像头和基于规则的接近警报)可以检测预定义事件,但缺乏识别复杂或演变危险所需的3D场景理解和上下文记忆。本文提出一个连续监控框架,将彩色3D点云转换为结构化和可追溯的安全推理输出。该框架结合了3D语义感知、基于不确定性的异常检测、基于规则的危险检查、设备端LLM推理和基于GraphRAG的记忆分析,以识别即时危险并解释长期安全模式。场景图和时序图作为显式知识结构,连接推理阶段的感知输出。为克服标记地下数据的稀缺性,结合真实巷道扫描、受控物体放置和高保真长壁模拟生成多样化的危险场景,同时自监督预训练从有限标注中改进分割。感知模型在30 FPS下达到92.7%的准确率,内存使用低。在115个危险场景中,基于规则的检查覆盖率为57%,结合上下文LLM推理提高到76%,使用基于历史记录的记忆推理达到93%。定性结果表明,不确定性衍生的异常信号支持对超出预定义类别的分布外危险进行解释。总体而言,基于图的知识表示结合3D感知和分层安全推理,为地下矿井监控中的智能决策支持提供了实用基础。

英文摘要

Underground coal mining requires personnel and heavy equipment to operate within shared, confined, and poorly illuminated spaces where hazards such as equipment proximity violations, structural instabilities, and occluded blind spots are difficult to anticipate. Conventional monitoring systems, including fixed cameras and rule-based proximity alerts, can detect predefined events but lack the 3D scene understanding and contextual memory needed to identify complex or evolving hazards. This paper presents a continuous monitoring framework that converts colourised 3D point clouds into structured and traceable safety reasoning outputs. The framework combines 3D semantic perception, uncertainty-based anomaly detection, rule-based hazard checks, on-device LLM reasoning, and GraphRAG -based memory analysis to identify immediate hazards and interpret longer-term safety patterns. Scene and temporal graphs serve as the explicit knowledge structure, linking perception outputs across reasoning stages. To overcome the scarcity of labeled underground data, real roadway scans, controlled object placement, and high-fidelity longwall simulation were combined to generate diverse hazard scenarios, while self-supervised pretraining improved segmentation from limited annotations. The perception model achieved 92.7% accuracy at 30 FPS with low memory usage. Across 115 hazard scenarios, rule-based checks achieved 57% coverage, increasing to 76% with contextual LLM reasoning and 93% with memory-based reasoning using historical records. Qualitative results show uncertainty-derived anomaly signals support the interpretation of out-of-distribution hazards beyond predefined classes. Overall, graph-based knowledge representation combined with 3D perception and layered safety reasoning provides a practical foundation for intelligent decision support in underground mine monitoring.

2606.03421 2026-06-03 cs.RO

Reliability-Guided Depth Fusion for Glare-Resilient Navigation Costmaps

基于可靠性引导的深度融合用于抗眩光导航代价地图

Shang-En Tsai

AI总结 针对反光地面、玻璃边界等表面导致的深度测量噪声,提出基于显式深度可靠性建模的代价地图构建方法,通过DRM-Net预测像素级可靠性并采用加权门控融合机制抑制错误占据更新,实验证明能有效减少虚假障碍并保持实时性能。

详情
AI中文摘要

反光地面、玻璃边界和光滑室内表面上的镜面眩光经常破坏主动立体RGB-D深度测量,产生空洞和尖峰,这些空洞和尖峰在占据栅格代价地图中累积为持久的幻影障碍物。本文提出一种基于显式深度可靠性建模的抗眩光代价地图构建方法。轻量级深度可靠性地图网络(DRM-Net)预测镜面干扰下的逐像素测量可信度,可靠性引导的加权门控融合(RGF)机制在损坏的测量值累积到地图之前调节占据更新。为了支持鲁棒的训练和评估,该方法使用姿态对齐的多视图参考深度构建来减少循环监督偏差,并通过融合变体消融、参数敏感性分析、跨条件测试、配对导航比较、可靠性地图指标和嵌入式运行时分析进行评估。在配备Intel RealSense D435和Jetson Orin Nano的真实移动机器人平台上的实验表明,所提方法减少了虚假障碍物插入,改善了自由空间保留,并在反光地板、玻璃墙和自然光眩光条件下保持实时吞吐量。这些结果支持将眩光视为测量可靠性问题,而不是密集深度补全问题,用于安全关键的室内导航。

英文摘要

Specular glare on reflective floors, glass boundaries, and glossy indoor surfaces frequently corrupts active-stereo RGB-D depth measurements, producing holes and spikes that accumulate as persistent phantom obstacles in occupancy-grid costmaps. This paper presents a glare-resilient costmap construction method based on explicit depth-reliability modeling. A lightweight Depth Reliability Map network (DRM-Net) predicts per-pixel measurement trustworthiness under specular interference, and a reliability-guided weighted-and-gated fusion (RGF) mechanism modulates occupancy updates before corrupted measurements are accumulated into the map. To support robust training and evaluation, the method uses pose-aligned multi-view reference-depth construction to reduce circular-supervision bias and is evaluated through fusion-variant ablations, parameter-sensitivity analysis, cross-condition tests, paired navigation comparisons, reliability-map metrics, and embedded runtime profiling. Experiments on a real mobile robotic platform equipped with an Intel RealSense D435 and a Jetson Orin Nano show that the proposed method reduces false obstacle insertion, improves free-space preservation, and maintains real-time throughput under reflective-floor, glass-wall, and natural-light glare conditions. These results support treating glare as a measurement-reliability problem rather than as a dense depth-completion problem for safety-critical indoor navigation.

2606.03418 2026-06-03 cs.CV

IDO: Incongruity-aware Distribution Optimization for Multimodal Fake News Detection

IDO: 面向多模态假新闻检测的不一致性感知分布优化

Hengyang Zhou, Rongman Hong, Yuxuan Zhou, Jing Wang, Zhaoyan Pan

AI总结 提出不一致性感知分布优化(IDO)方法,通过事实不一致性和模态不一致性建模,提升多模态假新闻检测性能。

Comments Accept by GlobalSouthML@ICML 2026

详情
AI中文摘要

多模态假新闻检测旨在识别新闻的真实性。现有的多模态假新闻检测方法主要关注跨模态一致性,但往往未能明确建模欺骗性多模态内容中存在的语义不一致性。然而,虚假信息通常包含与事实不符的语义信息。为了解决这些挑战,我们提出了不一致性感知分布优化(IDO),从事实不一致性和模态不一致性的角度提高假新闻检测的性能。对于事实不一致性,我们引入通道级重加权策略以获得语义判别性嵌入,并利用高斯分布建模由事实不一致性引起的不确定性相关性。对于模态不一致性,我们利用不一致性对比学习来学习跨模态语义信息。实验表明,IDO达到了最先进的性能。

英文摘要

Multimodal fake news detection aims to identify the authenticity of news. Existing multimodal fake news detection methods mainly focus on cross-modal consistency, but often fail to explicitly model the semantic incongruity that characterizes deceptive multimodal content. However, misinformation often contains semantic information incongruity with the facts. To address these challenges, we propose Incongruity-aware Distribution Optimization (IDO) to improve the performance of fake news detection from the perspectives of factual incongruity and modality incongruity. For factual incongruity, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings and utilize gaussian distribution to model the uncertain correlation caused by factual incongruity. For modality incongruity, we utilize incongruity contrastive learning to learn cross-modal semantic information. Experiments demonstrate that IDO achieves state-of-the-art performance.

2606.03412 2026-06-03 cs.CL

Lexicons and grammars for language processing: industrial or handcrafted products?

语言处理的词典和语法:工业产品还是手工制品?

Eric Laporte

AI总结 本文分析词典和语法的手工构建与自动化工业化两种趋势,探讨哪种方式或两者结合能获得最佳结果。

详情
Journal ref
Léxico e gramática: dos sentidos à construção da significação, Cultura acadêmica, 2009, Trilhas Lingüísticas, 16, pp.51-84
AI中文摘要

近年来,语言数据在语言处理中的应用逐渐增加。这些数据现在通常被称为语言资源。用于此目的的大多数语言资源是文本集合,如布朗语料库和宾州树库,但电子词典(WordNet、FrameNet、VerbNet、ComLex、词典-语法...)和形式语法(TAG...)也在最近得到发展。词典和语法的大多数构建过程是手动的,而语料库的构建则一直高度自动化。然而,越来越多的语言处理专家认识到,词典和语法的信息内容比语料库更丰富,因此前者可以实现更精细的处理。构建时间的差异可能与信息内容的差异有关:语言学家手工制作词典和语法可能使其比自动生成的数据更具信息性。这种情况可能向两个方向发展:要么语言技术专家逐渐习惯于处理手工构建的资源,这些资源更具信息性且更复杂;要么词典和语法的构建过程被自动化和工业化,这是主流观点。两种演变都在进行中,并且它们之间存在紧张关系。语言学家和计算机科学家之间的关系取决于这些演变的未来,因为前者需要培训和雇佣大量语言学家,而后者主要依赖于计算机工程师提出的解决方案。本文旨在分析所讨论的语言资源的实际例子,并讨论手工制作或工业生成,或两者结合,哪种趋势能产生最佳结果或最现实。

英文摘要

During the recent years, the use of linguistic data for language processing increased progressively. Such data are now commonly called language resources. Most of the language resources used for this purpose are collections of texts as the Brown Corpus and the Penn Treebank, but electronic lexicons (WordNet, FrameNet, VerbNet, ComLex, Lexicon-Grammar...) and formal grammars (TAG...) developed recently. Most processes of construction of lexicons and grammars are manual, whereas the construction of corpora has always been highly automated. However, more and more specialists of language processing realize that the information content of lexicons and grammars is richer than that of corpora, and hence the former make more elaborate processing possible. The difference in construction time is likely to be connected with the difference in information content: the handcrafting of lexicons and grammars by linguists would make them more informative than automatically generated data. This situation can evolve into two directions: either specialists of language technology get progressively used to handling manually constructed resources, which are more informative and more complex, or the process of construction of lexicons and grammars is automated and industrialized, which is the mainstream perspective. Both evolutions are already in progress, and a tension exists between them. The relation between linguists and computer scientists depends on the future of these evolutions, since the first implies training and hiring numerous linguists, whereas the other depends essentially on solutions elaborated by computer engineers. The aim of this article is to analyse practical examples of the language resources in question, and to discuss about which of the two trends, handcrafting or generating industrially, or a combination of both, can give the best results or is the most realistic.

2606.03406 2026-06-03 cs.CV

SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

SAMatcher: 基于Segment Anything的共视性建模用于鲁棒特征匹配

Xu Pan, Qiyuan Ma, Mingyue Dong, He Chen, Wei Ji, Xianwei Zheng

AI总结 提出SAMatcher框架,通过共视性建模预测共视区域掩码和边界框作为结构先验,利用Segment Anything Model的对称交叉视图交互机制和统一监督方案,显著提升大视角和尺度变化下的特征匹配性能。

Comments 14 pages

详情
AI中文摘要

可靠的对应估计是图像处理中的一个基本问题,支撑着运动恢复结构、视觉定位和图像配准等应用。现有的基于学习的方法显著改进了局部特征表示,但大多数仍在像素或块级别操作,缺乏对跨视图共同可见区域的显式建模。我们提出了SAMatcher,一个通过共视性建模进行对应估计的特征匹配框架。SAMatcher不直接匹配局部特征,而是首先预测共视区域掩码和边界框作为对应估计的结构先验。基于Segment Anything Model (SAM),它引入了一种对称的交叉视图交互机制,实现了双向特征交换和跨视图语义对齐。我们进一步开发了一个统一的监督方案,通过掩码学习、边界框回归和掩码-边界框一致性约束联合优化掩码预测和边界框定位。在具有挑战性的基准上的大量实验表明,与现有的匹配流程相比,特别是在大视角和尺度变化下,性能有显著提升。我们的结果表明,最初为单目分割设计的基础模型可以通过显式的共视性建模有效地扩展到多视图对应推理,为图像匹配的结构化表示学习提供了新的视角。代码和项目页面:此https URL

英文摘要

Reliable correspondence estimation is a fundamental problem in image processing, underpinning applications such as Structure from Motion, visual localization, and image registration. Existing learning-based methods have significantly improved local feature representations, yet most still operate at the pixel or patch level and lack explicit modeling of regions that are jointly visible across views. We propose SAMatcher, a feature matching framework that formulates correspondence estimation through co-visibility modeling. Instead of directly matching local features, SAMatcher first predicts co-visible region masks and bounding boxes as structured priors for correspondence estimation. Built upon the Segment Anything Model (SAM), it introduces a symmetric cross-view interaction mechanism that enables bidirectional feature exchange and cross-view semantic alignment. We further develop a unified supervision scheme that jointly optimizes mask prediction and box localization through mask learning, box regression, and mask-box consistency constraints. Extensive experiments on challenging benchmarks demonstrate substantial improvements over existing matching pipelines, particularly under large viewpoint and scale variations. Our results show that foundation models originally designed for monocular segmentation can be effectively extended to multi-view correspondence reasoning through explicit co-visibility modeling, offering a new perspective on structured representation learning for image matching. Code and project page: https://xupan.top/Projects/samatcher

2606.03374 2026-06-03 cs.RO

eMEM: A Hybrid Spatio-Temporal Memory System For Embodied Agents

eMEM:一种面向具身智能体的混合时空记忆系统

A. Haroon Rasheed, Maria Kabtoul

AI总结 提出eMEM混合图记忆系统,通过多索引架构和分层整合管道实现具身智能体在空间、时间和语义上的高效记忆检索,并在ProcTHOR-10K基准测试中达到80.8加权平均分。

详情
AI中文摘要

我们提出eMEM(具身记忆),一种基于混合图的记忆系统,用于在物理环境中运行的具身智能体。当前的智能体记忆架构,如Generative Agents、MemGPT和A-MEM,将记忆视为文本流或知识图谱,但具身智能体需要同时能够按意义、空间和时间进行搜索的记忆。eMEM通过一个统一在单一图模型背后的多索引架构(用于结构化存储的SQLite、用于近似最近邻语义搜索的hnswlib以及用于空间查询的R-tree)填补了这一空白。一个分层整合管道将原始感知观察转化为压缩摘要,模仿生物系统中海马体-新皮层的整合。十个面向智能体的回忆工具暴露了记忆检索原语,包括概念到位置的解析和跨层回忆,作为LLM工具调用的第一类操作。该系统完全嵌入式,与智能体在同一进程中运行。此外,我们引入了eMEM-Bench v1,这是一个我们在ProcTHOR-10K场景上构建的用于具身记忆评估的基准。该基准明确围绕八个认知心理学范式(DRM诱饵、模式分离、模式完成、源监控、上下文依赖检索、长时程干扰、序列位置和增强保留曲线)组织,每个范式都经过选择,使得结果能够对照人类和先前智能体记忆系统的更广泛记忆系统文献进行解释;这是像LoCoMo或OpenEQA这样的表面任务基准无法提供的诊断水平。eMEM在988个探针上获得80.8加权平均分,在模拟延迟从1小时到1年的房间独特项目上保持平稳的保留曲线。我们表明,纯RAG基线(flat_rag消融)在上下文依赖检索上损失30分,在DRM诱饵拒绝上损失29分,分别隔离了多层存储和整合的贡献。我们发布了系统和基准代码。

英文摘要

We present eMEM (Embodied Memory), a hybrid graph-based memory system for embodied agents operating in physical environments. Current agent memory architectures, such as Generative Agents, MemGPT, and A-MEM, treat memory as text streams or knowledge graphs, but embodied agents require memory that is simultaneously searchable by meaning, space, and time. eMEM fills this gap with a multi-index architecture (SQL ITE for structured storage, hnswlib for approximate nearest neighbour semantic search, and an R-tree for spatial queries) unified behind a single graph model. A tiered consolidation pipeline transforms raw perceptual observations into compressed summaries, mirroring hippocampal-neocortical consolidation in biological systems. Ten agent-facing recall tools expose memory retrieval primitives, including concept-to-location resolution and cross layer recall, as first-class operations for LLM tool calling. The system is fully embedded and runs in-process alongside the agent. In addition we introduce eMEM-Bench v1, a benchmark we construct over ProcTHOR-10K scenes for embodied memory evaluation. The benchmark is organised explicitly around eight cognitive-psychology paradigms (DRM lures, pattern separation, pattern completion, source monitoring, context-dependent retrieval, long-horizon interference, serial position, and a foil augmented retention curve), each chosen so that the result is interpretable against the broader memory-systems literature in humans and prior agent-memory systems; a level of diagnostic that surface-task benchmarks like LoCoMo or OpenEQA cannot provide. eMEM scores 80.8 weighted mean over 988 probes, with a flat retention curve at ceiling from 1 h to 1 yr of simulated delay on room-unique items. We show that a pure RAG baseline (the flat_rag ablation) loses 30 pt on context dependent retrieval and 29 pt on DRM lure rejection, isolating the contribution of multi-layer storage and consolidation respectively. We release both the system and the benchmark code.

2606.03287 2026-06-03 cs.CV

BA-T: An Iterative Transformer for Two-View Bundle Adjustment

BA-T: 一种用于双视图束调整的迭代Transformer

Ganlin Zhang, Weirong Chen, Daniel Cremers, Xi Wang

AI总结 受经典束调整启发,提出BA-T,一种通过迭代Transformer在隐式token空间中实现结构化更新的轻量级方法,用于改进双视图三维重建的精度和多视图一致性。

详情
AI中文摘要

前馈三维重建模型通过深度跨视图注意力在图像间交换信息取得了强性能。然而,这些方法通常依赖沉重的解码器堆栈,缺乏几何精化的结构化机制,导致多视图一致性差。我们通过借鉴经典束调整(BA)来解决这个问题,BA可被视为位姿与局部几何之间的迭代信息传播过程。受BA启发,我们提出BA-T,一种迭代Transformer,将BA风格的结构化更新作为可重复层在隐式token空间中实现。BA-T不依赖深度注意力堆栈,而是通过单个轻量层基于潜在残差精化预测。实验表明,BA-T在迭代中逐步提升位姿和重建精度,比传统解码器实现更强的跨视图一致性,在使用仅16%解码器参数的情况下匹配或超越更大的模型。BA-T为深度注意力提供了一种紧凑、高效且结构化的替代方案,在轻量架构内实现精确的三维重建。代码将在以下网址公开:https://this https URL。

英文摘要

Feed-forward models for 3D reconstruction have achieved strong performance using deep cross-view attention to exchange information across images. However, these approaches often depend on heavy decoder stacks and lack a structured mechanism for geometry refinement, resulting in poor multi-view consistency. We address this by drawing inspiration from classical bundle adjustment (BA), which can be viewed as an iterative information propagation process between poses and local geometry. Inspired by BA, we propose BA-T, an iterative Transformer that implements BA-style structured updates as a repeatable layer in implicit token space. Instead of relying on deep attention stacks, BA-T refines predictions based on latent residual by a single lightweight layer. Experiments demonstrate that BA-T progressively improves pose and reconstruction accuracy across iterations, achieves stronger cross-view consistency than conventional decoders, and matches or surpasses substantially larger models while using only 16% of their decoder parameters. BA-T provides a compact, efficient, and structural alternative to depth-heavy attention, enabling accurate 3D reconstruction within a lightweight architecture. The code will be made publicly at https://github.com/zhangganlin/BA-T.

2606.03284 2026-06-03 cs.CL

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

SEA-NLI:将自然语言推理作为理解东南亚文化的透镜

Peerawat Chomphooyod, Jian Gang Ngui, Yosephine Susanto, Attapol T. Rutherford, Alham Fikri Aji, Sarana Nutanong, Can Udomcharoenchaikit, Peerat Limkonchotiwat

AI总结 提出SEA-NLI基准,通过自然语言推理评估模型对东南亚文化的理解,发现现有模型表现不佳,文化适应和提示可提升性能。

详情
AI中文摘要

前沿LLM在西方语境中表现良好,但在东南亚等代表性不足的文化中测试不足。现有的NLI基准大多以西方为中心、源自翻译或单语,限制了其衡量文化基础推理的能力。我们引入了SEA-NLI,一个原生的、基于文化的NLI基准,涵盖八个东南亚国家的英语和本地区域语言,并由母语者验证。在17个编码器和解码器模型中,我们观察到所有模型表现较低,尤其是在语言和科技等知识密集型类别中。我们的分析表明,失败案例主要源于缺乏东南亚文化知识:适应东南亚的模型和文化感知提示提升了性能,而思维链提示带来的提升有限。

英文摘要

Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia (SEA). Existing NLI benchmarks are largely Western-centric, translation-derived, or monolingual, limiting their ability to measure culturally grounded reasoning. We introduce SEA-NLI, a native, culturally grounded NLI benchmark covering eight SEA countries in English and native regional languages, verified by native speakers. Across 17 encoder and decoder models, we observe a low performance from all models, especially for knowledge-intensive categories such as Languages and Science and Technology. Our analysis shows that failure cases mainly stem from missing SEA cultural knowledge: SEA-adapted models and culture-aware prompting improve performance, while CoT prompting offers limited gains.

2606.03269 2026-06-03 cs.AI

Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering

从LLM中蒸馏答案集编程规则用于神经符号视觉问答

Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch

AI总结 提出从大语言模型中蒸馏答案集编程规则的方法,以可解释的方式扩展视觉问答系统的推理能力,仅需少量示例即可生成正确规则。

Comments Under consideration in Theory and Practice of Logic Programming (TPLP)

详情
AI中文摘要

视觉问答(VQA)是关于图像回答问题的任务,需要整合多模态输入和推理。将基于逻辑的表示纳入推理组件的模块化方法,相比端到端训练系统具有明显优势,尤其是在可解释性方面。然而,当任务需求变化时,调整或扩展这些表示可能会给开发者带来沉重负担。为了解决这一挑战,我们提出了一种从大语言模型(LLM)中蒸馏规则的方法。我们的方法提示LLM扩展一个初始的VQA推理理论(表示为答案集程序),以满足任务的新要求。VQA数据集中的示例指导LLM,验证结果,并通过利用ASP求解器的反馈帮助纠正错误规则。我们证明了该方法在多种VQA数据集上的有效性。值得注意的是,仅需少量示例即可从LLM中引出正确规则。我们的实验表明,从LLM中蒸馏规则是传统数据驱动规则学习方法的一种有前景的替代方案。正在考虑发表于《逻辑编程理论与实践》(TPLP)。

英文摘要

Visual Question Answering (VQA) is the task of answering questions about images, requiring the integration of multimodal input and reasoning. Modular approaches that incorporate logic-based representations into the reasoning component offer clear advantages over end-to-end trained systems, particularly in terms of interpretability. However, adapting or extending these representations when task requirements change can place a significant burden on developers. To address this challenge, we present an approach for distilling rules from Large Language Models (LLMs). Our method prompts an LLM to extend an initial VQA reasoning theory, expressed as an answer-set program, to meet new requirements of the task. Examples from VQA datasets guide the LLM, validate the results, and help correct erroneous rules by leveraging feedback from the ASP solver. We demonstrate that our approach is effective across diverse VQA datasets. Notably, only a few examples are needed to elicit correct rules from LLMs. Our experiments suggest that rule distillation from LLMs is a promising alternative to traditional data-driven rule learning approaches. Under consideration in Theory and Practice of Logic Programming (TPLP).

2606.03254 2026-06-03 cs.CV

FreeStreamGS: Online Feed-forward 3D Gaussian Splatting from Unposed Streaming Inputs

FreeStreamGS: 来自无位姿流式输入的在线前馈3D高斯泼溅

Ruiyang Chen, Feiran Li, Chu Zhou, Zonglin Li, Zhanyu Ma, Heng Guo

AI总结 提出FreeStreamGS,一种在线前馈框架,通过解耦内参恢复头和动态点精炼偏移策略,实现从无位姿流式输入的高效高质量新视角合成。

详情
AI中文摘要

前馈3D高斯泼溅(3DGS)允许从离线录制的图像序列进行高效高保真的新视角合成(NVS)。然而,从流式和无位姿图像输入实现在线NVS仍然具有挑战性。尽管已经提出了用于流式深度和点云恢复的在线前馈几何估计方法,但由于严重的渲染伪影,它们无法适应NVS。这是因为NVS对高斯尺度和位姿-几何对齐要求更严格的多视图一致性;即使微小的偏差也会随时间累积并明显降低渲染质量。为此,我们提出了FreeStreamGS,一个鲁棒的在线前馈框架,用于高效高质量的NVS。我们引入了两个关键机制:解耦内参恢复头,消除累积的相机内参偏差并防止长期流式中的场景尺度抖动;以及动态点精炼偏移策略,放松刚性反投影以校正耦合的位姿-深度漂移。大量实验表明,尽管FreeStreamGS无法访问未来帧,但其渲染质量与最先进的离线前馈3DGS方法相当。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) allows efficient and high-fidelity novel view synthesis (NVS) from an offline recorded image sequence. However, achieving online NVS from streaming and unposed image inputs remains challenging. Although online feed-forward geometric estimation methods have been proposed for streaming depth and point cloud recovery, they cannot be adapted to NVS due to severe rendering artifacts. This is because NVS demands stricter multi-view consistency in Gaussian scales and pose-geometry alignment; even minor deviations would accumulate over time and visibly degrade rendering quality. To this end, we propose FreeStreamGS, a robust online feed-forward framework for efficient and high-quality NVS. We introduce two key mechanisms: a Decoupled Intrinsic Recovery Head that removes cumulative camera intrinsic bias and prevents scene scale jitter during long-term streaming, and a Dynamic Point Refinement Offset strategy that relaxes rigid unprojection to correct coupled pose-depth drift. Extensive experiments show that FreeStreamGS achieves rendering quality competitive with state-of-the-art offline feed-forward 3DGS methods, despite operating without access to future frames.

2606.03251 2026-06-03 cs.AI cs.CV cs.LG eess.IV stat.ML

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

现实世界数据集是否包含自然实验?基于因果特征选择的实证研究

Gautam Gare, John Galeotti, Michael Mozer, Deva Ramanan, Nan Rosemary Ke

AI总结 本文利用因果发现和特征选择检测现实世界数据集中的自然实验,并通过干预性处理提升模型性能。

详情
AI中文摘要

在自然界中,影响某些个体或群体但不影响其他个体或群体的事件构成隐式干预,被称为自然实验。例如,COVID-19大流行是冠状病毒对感染COVID的亚群的一次干预。我们问:现有的现实世界数据集中是否存在自然实验?如果存在,我们应该如何处理它们?为了检测数据中的自然实验,我们使用因果发现恢复潜在因果图,并基于因果链接进行特征选择。如果通过将数据视为干预性而非观测性来提升下游性能,我们认为这表明数据集包含自然实验。我们首先通过使用合成图模拟包含和不包含自然实验的数据集来验证这一假设。然后,我们在大量现实世界数据集上进行系统的实证评估。我们的结果表明,现实世界数据集确实包含自然实验,我们可以利用这些自然实验通过因果推断来提升模型性能。我们的工作代表了该领域的初步探索,在有限范围内进行了初步研究。

英文摘要

In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.

2606.03247 2026-06-03 cs.CL cs.IR

Structures Facilitate Retrieve, Rerank, and Generate

结构促进检索、重排序和生成

Yeqin Zhang, Haomin Fu, Xujie Zhang, Cam-Tu Nguyen

AI总结 提出SF-Re2G方法,通过利用文档结构信息改进段落表示、构建结构增强的重排序器并融入子图上下文,以提升文档对话系统的检索、重排序和生成性能。

详情
AI中文摘要

文档对话系统(DGDS)利用外部文档中的知识来回答特定领域的用户问题。现有解决方案通常将文档划分为独立的段落进行检索和响应生成。然而,这种方法既没有充分利用文档内的结构信息,也没有为知识选择和响应提供足够的(文档)上下文。本文提出SF-Re2G来系统地解决这些问题。首先,我们通过将段落与同一章节的其他段落进行对比来改进段落表示,从而提高检索性能。其次,构建了一个结构增强的重排序器,利用同一对话轮次的多个基础段落往往位于同一邻近区域的事实。具体来说,来自检索的候选者根据文档结构被分组为子图。重排序器将结合其组信息对候选者重新评分。最后,选中的段落用于生成响应,同时考虑子图上下文以改进生成。在两个DGDS数据集上的实验结果验证了我们的方法在中文和英文上的有效性。

英文摘要

Document-grounded dialogue systems (DGDS) utilize knowledge from external documents to answer domain-specific user questions. Existing solutions typically divide documents into independent passages for retrieval and response generation. This approach, however, neither makes good use of structural information within documents nor provides enough (document) context for knowledge selection and responses. This paper proposes SF-Re2G to address such issues systematically. Firstly, we seek to improve a passage representation by contrasting it with others of the same section, thus improving the retrieval performance. Secondly, a structure-enhanced reranker is built, leveraging the fact that multiple grounding passages of one dialog turn tend to be in the same neighborhood. Specifically, candidates from the retrieval are grouped into subgraphs according to the document structure. The reranker will rescore the candidate integrating its group information. Finally, the chosen passages are used for responses, taking into account the subgraph context for better generation. Experimental results on two DGDS datasets validate our method for both Chinese and English.

2606.03180 2026-06-03 cs.CV cs.CL cs.LG

GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

GLINT:面向细粒度放射学表征的稀疏门控视觉-语言对齐

Jonggwon Park, Seongeun Lee, Junhyun Park, Hannah Yun, Hyunwoong Kim, Sohyun Jeong, Hyewon Kang, Byungmu Yoon, Kyoyun Choi

AI总结 针对放射学图像-报告全局对齐与局部病灶尺度不匹配的问题,提出GLINT框架,通过稀疏门控对齐和密集特征正则化实现零样本分类、定位和分割。

详情
AI中文摘要

放射学中的视觉-语言模型(VLM)通过利用临床工作流程中自然产生的图像-报告对,已成为一种可扩展的范式。然而,这种配对揭示了尺度上的不匹配:每个病灶仅占据图像的一小部分区域,但监督仅在全局图像-报告级别提供。这带来了一个核心挑战:先前的方法将权重密集地分布到所有补丁上,而不是集中在与给定查询相关的稀疏子集上。为了解决这个问题,我们提出了GLINT(门控语言-图像对齐)框架,该框架显式建模这种稀疏对应关系。在对齐方面,我们引入了稀疏门控对齐,这是一种新颖的架构,其中在单独的门控嵌入空间上的sigmoid门仅激活与每个文本查询相关的补丁,强制执行显式稀疏性。在表征方面,我们添加了密集特征正则化,将可训练编码器的中间特征锚定到冻结的自监督学习(SSL)教师模型上,从而保留门控所依赖的细粒度补丁特征。相同的方案适用于2D胸部X光片(CXR)和3D胸部计算机断层扫描(CT),分别基于DINOv3和V-JEPA 2.1构建。GLINT支持从自由文本查询进行零样本分类、定位和分割,据我们所知,这是首次在没有掩码监督的情况下在3D CT体积上展示零样本分割。值得注意的是,最显著的增益出现在零样本定位和分割上,这些任务需要稀疏的、特定于查询的定位,这与我们的设计意图一致。在下游评估中,GLINT在分类、报告生成和分割方面均优于SSL编码器和医学VLM。

英文摘要

Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

2606.03159 2026-06-03 cs.CV cs.AI cs.RO

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

NVIDIA OmniDreams:用于闭环自动驾驶仿真的实时生成式世界模型

NVIDIA, :, Aarti Basant, Amlan Kar, Despoina Paschalidou, Fangyin Wei, Francesco Ferroni, Guillermo Garcia Cobo, Haithem Turki, Huan Ling, Jaewoo Seo, James Lucas, Jay Zhangjie Wu, Jialiang Wang, Jonathan Lorraine, Jun Gao, Kai He, Katarina Tothova, Kevin Xie, Michał Tyszkiewicz, Qi Wu, Riccardo de Lutio, Ruilong Li, Sanja Fidler, Seung Wook Kim, Tianchang Shen, Tianshi Cao, Tobias Pfaff, William Lew, Xindi Wu, Xuanchi Ren, Yifan Lu, Yuxuan Zhang, Zan Gojcic, Zian Wang

AI总结 提出OmniDreams,一个基于Cosmos扩散模型训练的基础生成式世界模型,通过自回归生成动作条件视频,实现闭环仿真中复杂长尾场景的实时合成,并验证其在策略模型训练中的有效性。

详情
AI中文摘要

随着自动驾驶能力的提升,在长尾场景中安全评估驾驶策略仍是一个关键瓶颈。在闭环仿真中,驾驶策略模型与环境主动交互,其动作动态更新模拟器状态并直接影响下一组生成的传感器观测。尽管近期基于重建的神经模拟器提供了逼真效果,但它们从根本上受限于初始捕获数据,难以泛化到高度动态或新颖场景。为克服这些限制,我们引入了OmniDreams,一个从Cosmos扩散模型进行中期和后训练的基础生成式世界模型,能够自回归地实时生成动作条件视频。通过利用Cosmos丰富的视觉先验以及在21k小时驾驶场景上的中期和后训练,OmniDreams合成了传统模拟器难以捕获的复杂未观测现象,例如极端天气和不可预测的动态智能体行为。关键在于,它自回归地根据过去帧、当前模拟器状态和即时驾驶动作来调节其逼真的传感器生成。在结合Alpamayo 1策略模型和AlpaSim编排器的闭环系统中部署时,OmniDreams充当一个高度响应、反应灵敏的环境,为训练和评估下一代自动驾驶策略提供了可扩展且全面的解决方案。我们还展示了初步结果,表明从OmniDreams后训练的世界-动作模型(WAM)在Physical AI自动驾驶NuRec数据集上取得了强劲性能,超越了基于VLA的Alpamayo 1.5研究策略模型,同时仅使用其1/5的总参数量。这些结果凸显了像OmniDreams这样的实时世界模型也有潜力作为策略架构的骨干网络。

英文摘要

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid- and post-trained from the Cosmos diffusion model to autoregressively generate action-conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid- and post-training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed-loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next-generation autonomous driving policies. We additionally show preliminary results indicating that a world-action model (WAM) post-trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA-based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real-time world model like OmniDreams to also serve as a backbone for policy architectures.