arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2606.00153 2026-06-02 cs.CV cs.AI

DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion

DiffCrossGait:基于潜在扩散的2D-3D跨模态步态识别轨迹级对齐

Zhiyang Lu, Ming Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对2D-3D跨模态步态识别中的域差异问题,提出DiffCrossGait,通过潜在扩散空间中的轨迹级对齐实现连续模态对齐,并引入三阶段对齐策略确保身份锚定、动态一致性和跨模态结构可恢复性,在SUSTech1K和FreeGait基准上达到最优性能。

详情
Comments
Accepted by ICML2026
AI中文摘要

跨模态2D-3D步态识别受到2D轮廓和3D LiDAR距离视图表示之间固有域差异的阻碍。虽然先前的方法仅对齐最终嵌入,我们提出DiffCrossGait,将跨模态匹配重新表述为身份相关潜在扩散空间中的轨迹级对齐,而不是假设2D和3D观测完全等价。通过在潜在空间中使用共享高斯噪声驱动两种模态,我们实现了生成演化过程中的连续对齐。我们引入了一种三阶段对齐策略,利用不同的噪声强度来强制身份锚定、动态一致性和跨模态结构可恢复性,从而约束两种模态共享去噪动态和瓶颈结构,促进模态不变的步态特征。关键的是,我们的框架将生成对齐与判别骨干解耦;扩散机制仅作为训练目标,通过消除迭代去噪的计算开销确保高推理效率。在SUSTech1K和FreeGait基准上的大量实验表明,DiffCrossGait达到了最先进的性能。

英文摘要

Cross-modal 2D-3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D LiDAR range-view representations. While prior methods align only final embeddings, we propose DiffCrossGait, which reformulates cross-modal matching as trajectory-level alignment in an identity-relevant latent diffusion space, rather than assuming full equivalence between 2D and 3D observations. By driving both modalities with shared Gaussian noise within a latent space, we enable continuous alignment throughout the generative evolution. We introduce a Tri-Phase Alignment Strategy that exploits varying noise intensities to enforce identity anchoring, dynamics consistency, and cross-modal structural recoverability, thereby constraining both modalities to share denoising dynamics and bottleneck structure, which promotes modality-invariant gait features. Crucially, our framework decouples generative alignment from the discriminative backbone; the diffusion mechanism serves exclusively as a training objective, ensuring high inference efficiency by eliminating the computational overhead of iterative denoising. Extensive experiments on the SUSTech1K and FreeGait benchmarks demonstrate that DiffCrossGait achieves state-of-the-art performance.

2606.00151 2026-06-02 cs.LG cs.AI

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

通过重试在策略梯度强化学习中涌现探索行为

Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, Yutaka Matsuo

发表机构 * University of Tokyo(东京大学) Aalto University(阿尔托大学)

AI总结 提出ReMax目标函数,通过最大化M个样本的期望最大回报来使探索行为自然涌现,并推导策略梯度公式及RePPO算法,在MinAtar和Craftax基准上无需显式探索奖励即可促进探索。

详情
AI中文摘要

在强化学习(RL)中,智能体从探索中获益仅仅是因为它们反复遇到相似的状态:尝试不同的动作可以提高性能或减少不确定性;没有这样的重试,贪婪策略是最优的。我们通过ReMax形式化这一直觉,该目标函数根据$M$个样本($M$为正整数)的期望最大回报来评估策略,同时考虑回报的不确定性。优化该目标函数会使随机探索作为涌现属性出现,无需显式奖励项。为了实现高效的策略优化,我们为ReMax推导了新的策略梯度公式,并引入ReMax PPO(RePPO),这是一种PPO变体,它优化ReMax的同时将离散重试次数$M$推广为连续参数$m>0$,从而实现对探索的细粒度控制。实验上,RePPO在MinAtar和Craftax基准上无需任何显式探索奖励即可促进探索。

英文摘要

In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples, where $M$ is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.

2606.00148 2026-06-02 cs.CV cs.AI

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

StemBind: 当多模态大语言模型在抽象视觉推理中迷失于规则与实例之间

Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng, Qiyao Sun, Xuanyu Ji, Qingyong Hu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出 StemBind 诊断基准,通过共享主干的三对齐问题(感知、规则、完整)定位 MLLM 在抽象视觉推理中的失败环节,发现规则到实例的绑定是主要瓶颈。

详情
Comments
Project page: https://hexixiang.github.io/StemBind
AI中文摘要

多模态大语言模型(MLLM)常常知道规则但选错答案:在抽象视觉推理(AVR)任务中,模型可以描述所见内容并命名底层模式,但仍然无法选择匹配的候选。现有的 AVR 基准无法检测到这一点,因为它们将感知、规则归纳和答案选择合并为一个单一的对错信号。我们引入了 StemBind,一个共享主干的诊断基准,它用三个对齐的问题探测同一视觉主干:感知(图像中有什么)、规则(支配它的模式是什么)和完整(哪个选项完成它),因此最终答案的错误可以归因于同一证据上的特定子步骤。StemBind 包含 2,298 个经过精心策划的知识精简主干,涵盖九种可审计的视觉操作,总计 19,533 个 P/R/F 任务,每个完整项目都通过 Sternberg 的四个推理阶段(S1 编码、S2 推断、S3 映射、S4 应用)进行标注。评估 24 个前沿 MLLM 配置得出四个发现。(i)R-F 鸿沟:在 24 个模型中的 22 个上,规则准确率超过完整项目准确率,因此大多数失败发生在规则被识别之后。(ii)持续的绑定差距:即使在同一主干上 P 和 R 都正确,模型仍有 51.2% 的时间错误回答 F。(iii)瓶颈是 S3:过程诊断和阶段式刺激增强将主要失败定位到规则到实例的映射。(iv)扩展和思考无济于事:更大的模型和显式思考模式都无法可靠地缩小差距,思考甚至降低了规则和完整项目的准确率。StemBind 将 AVR 评估从最终答案排名重新定义为定位抽象视觉推理失败的位置,将规则到实例的绑定确定为视觉基础推理的具体下一个目标。

英文摘要

Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal. We introduce StemBind, a shared-stem diagnostic benchmark that probes the same visual stem with three aligned questions: Perception (what is in the image), Rule (what pattern governs it), and Full (which option completes it), so a final-answer error can be attributed to a specific sub-step on the same evidence. StemBind contains 2,298 curated knowledge-light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg's four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply). Evaluating 24 frontier MLLM configurations yields four findings. (i) The R-F chasm: rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures happen after the rule is identified. (ii) A persistent binding gap: even when P and R are both correct on the same stem, models still answer F incorrectly 51.2% of the time. (iii) The bottleneck is S3: process diagnostics and Stage-wise Stimulus Augmentation localize the dominant failure to rule-to-instance mapping. (iv) Scaling and thinking do not help: neither larger models nor explicit thinking mode reliably closes the gap, and thinking even lowers rule and full-item accuracy. StemBind reframes AVR evaluation from final-answer ranking to locating where abstract visual reasoning breaks down, identifying rule-to-instance binding as a concrete next target for vision-grounded reasoning.

2606.00147 2026-06-02 cs.LG cs.AI

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

RAFT:用于缓解遗忘的领域微调的数据精炼与自适应蒸馏

Yuduo Li, Xiaofeng Shi, Qian Kou, Longbin Yu, Hua Zhou

发表机构 * Beijing Academy of Artificial Intelligence (BAAI)(北京人工智能研究院) Beijing Jiaotong University (BJTU)(北京交通大学)

AI总结 提出RAFT框架,通过数据精炼(自条件重写、语义过滤、答案融合)和答案条件在线蒸馏(top-K温度蒸馏、EMA自适应损失平衡)来解决领域微调中的监督兼容性差距和轨迹保持差距,在提升领域性能的同时缓解通用能力退化。

详情
Comments
preprint
AI中文摘要

领域特定的监督微调(SFT)通常以提高领域内性能为代价,导致模型通用能力下降。我们将这种退化归因于领域SFT中的两个实际差距:监督兼容性差距,即领域目标在风格和推理格式上与原始模型的自然响应不同;以及轨迹保持差距,即教师强制SFT优化固定目标令牌,而不约束模型在其自身生成前缀上的行为。这个过程未能保留模型的原始行为。我们提出RAFT(用于缓解遗忘的领域微调的数据精炼与自适应蒸馏),一个两阶段框架来解决这两个因素。首先,RAFT通过自条件重写、语义过滤和答案融合构建模型兼容的监督。其次,RAFT执行答案条件在线蒸馏,其中原始指令调优模型在学生生成的轨迹上提供软目标,同时以融合答案作为有用上下文进行条件化。我们进一步引入top-K温度蒸馏和基于EMA的自适应损失平衡来稳定领域-通用权衡。在三个指令调优骨干和五个领域上,RAFT相比标准SFT将平均领域准确率提高了23.2%,同时恢复了MS-Bench和IFEval上SFT引起的部分退化,相对改进分别为18.2%和10.2%。这些结果表明,将数据精炼与轨迹级保持相结合为缓解遗忘的领域微调提供了有效方案。

英文摘要

Domain-specific supervised fine-tuning (SFT) often improves in-domain performance at the cost of degrading a model's general capabilities. We view this degradation through two practical gaps in domain SFT: a supervision-compatibility gap, where domain targets differ in style and reasoning format from the original model's natural responses, and a trajectory-preservation gap, where teacher-forced SFT optimizes fixed target tokens without constraining the model's behavior on its own generated prefixes. This process fails to preserve the model's original behavior. We propose RAFT (Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting), a two-stage framework that addresses both factors. First, RAFT constructs model-compatible supervision through self-conditioned rewriting, semantic filtering, and answer fusion. Second, RAFT performs Answer-Conditioned On-Policy Distillation, where the original instruction-tuned model provides soft targets on student-generated trajectories while being conditioned on the fused answer as helpful context. We further introduce top-K temperature distillation and EMA-based adaptive loss balancing to stabilize the domain-general trade-off. Across three instruction-tuned backbones and five domains, RAFT improves average domain accuracy by 23.2% over standard SFT, while recovering part of the SFT-induced degradation on MS-Bench and IFEval, with relative improvements of 18.2% and 10.2%, respectively. These results show that coupling data refinement with trajectory-level preservation provides an effective recipe for domain fine-tuning with alleviated forgetting.

2606.00145 2026-06-02 cs.RO cs.AI

Completion at the Boundary (CaB): Deployable Switching with Completion-Aware Control under Limited Calibration

边界完成(CaB):有限校准下具有完成感知的可部署切换

Yusuke Sano, Takeshi Itoga

发表机构 * Intelligent Systems Laboratory, SECOM Co., Ltd.(SECOM公司智能系统实验室)

AI总结 提出Completion at the Boundary (CaB)方法,通过边界阶段令牌(Before/Hit/After)保留双边证据,在有限校准条件下实现VLA代理的完成感知切换,提升复合指令执行和交接质量。

详情
AI中文摘要

视觉-语言-动作(VLA)代理可以执行自然语言指令,但部署系统仍缺乏操作接口:决定指令何时完成。这一缺口在短复合指令(“做A,然后做B”)中尤为严重,时机不当的交接会级联导致下游故障。完成本质上是闭环的,因为切换是一种改变指令上下文从而影响未来动作和观察的干预。我们研究在由开放式指令空间启发的可部署低校准机制下的完成问题,强制要求无测试时重新学习,并选择一个全局校准的切换规则(在开发集上选择一次,在测试集上原样复用)。在此约束下,将非对称边界证据压缩为单个标量可能在任务极性变化时变得脆弱。我们提出边界完成(CaB),它预测事件局部完成对象,形式为边界阶段令牌(Before/Hit/After),在此规则下保留双边证据。CaB-When将此完成对象转换为最小、可审计的切换决策(何时),而CaB-How重用同一完成对象来调节动作生成,以实现交接过程中的边界稳定控制(如何)。使用干预感知的E1/E2协议,我们表明在匹配容量和可部署性约束下,CaB在第一个视角Minecraft VLA基准上提高了复合执行和交接质量。

英文摘要

Vision-language-action (VLA) agents can execute natural-language instructions, yet deployed systems still lack an operational interface: deciding when the instruction is complete. This gap is acute in short composites ("do A, then B"), where mistimed handoffs cascade into downstream failures. Completion is inherently closed-loop because switching is an intervention that changes the instruction context and thus future actions and observations. We study completion under a deployable low-calibration regime motivated by open-ended instruction spaces, enforcing no test-time relearning and a single globally calibrated switching rule selected once on development set and reused unchanged on test set. Under this constraint, collapsing asymmetric boundary evidence into a single scalar can be brittle under polarity shifts across tasks. We propose Completion at the Boundary (CaB), which predicts an event-local completion object in the form of Boundary-Phase Tokens (Before/Hit/After), retaining two-sided boundary evidence under this discipline. CaB-When converts this completion object into a minimal, auditable switching decision (when), while CaB-How reuses the same completion object to condition action generation for boundary-stable control through handoffs (how). Using an intervention-aware E1/E2 protocol, we show that CaB improves composite execution and handoff quality on a first-person Minecraft VLA benchmark under matched capacity and deployability constraints.

2606.00144 2026-06-02 cs.LG cs.AI

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

BudgetDraft:面向稀疏KV投机解码的接受感知多视角训练

Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen, Kangning Cui, Qizhen Lan, Xilu Wang

发表机构 * Shanghai Institute of Optics and Fine Mechanics(上海光学精密机械研究所) The University of Sydney(悉尼大学) Marquette University(马基特大学) Johns Hopkins University(约翰·霍普金斯大学) Wake Forest University(威克森林大学) University of Texas Health Science Center at Houston(德克萨斯大学健康科学中心休斯顿分部) University of Surrey(萨里大学)

AI总结 针对中长上下文推理中稀疏/全缓存不匹配导致接受率下降的问题,提出BudgetDraft多视角稀疏训练方法,通过接受感知损失和多视角损失训练单一鲁棒草稿模型,在固定KV预算下恢复接受率,实现最高6.55倍加速。

详情
AI中文摘要

投机解码通过草稿模型提出多个令牌,验证器并行验证,从而加速自回归解码。在资源受限的部署中,草稿模型使用稀疏KV缓存以在固定KV预算下限制峰值GPU内存和端到端延迟,而验证器保留全KV缓存。实际应用中常见中长上下文推理(4K--16K上下文长度)。然而,随着上下文长度增长,朴素稀疏/全投机解码遭受稀疏/全不匹配问题,导致接受率快速下降。我们提出BudgetDraft,一种用于中长推理中稀疏草稿的多视角稀疏训练方法。草稿模型在训练期间暴露于多个采样的KV预算,并学习将每个稀疏视角与一个共享的全缓存教师目标对齐。BudgetDraft将全缓存分支上的接受感知损失与稀疏缓存分支上的多视角损失相结合,产生一个单一的预算鲁棒草稿模型,无需额外的推理时组件即可恢复跨稀疏级别的接受率。在PG-19、LongBench和LWM上的实验结果表明,BudgetDraft在4K、8K和16K上下文长度下,与自回归相比分别实现了最高6.55倍、4.46倍和2.10倍的端到端加速,同时保持推理流水线内存友好。

英文摘要

Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly. We propose BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full-cache teacher target. BudgetDraft combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch, producing a single budget-robust drafter that recovers acceptance across sparsity levels without extra inference-time components. Experimental results on PG-19, LongBench, and LWM show that BudgetDraft achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory-friendly.

2606.00139 2026-06-02 cs.CV cs.AI

Geodesics with Unified Tangent-constrained Priors and Curvature Regularization

具有统一切线约束先验和曲率正则化的测地线

Chong Di, Li Liu, Jinglin Zhang, Zhenjiang Li, Da Chen, Laurent D. Cohen

发表机构 * Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences)(山东省人工智能研究院,齐鲁工业大学(山东省科学院)) Yuanshen Rehabilitation Institute, Shanghai Jiao Tong University School of Medicine(元身康复研究院,上海交通大学医学院) School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences(放疗科,山东省肿瘤医院及研究院,山东第一医科大学,山东省医学科学院) CEREMADE, Université Paris Dauphine, Université-PSL, CNRS, UMR 7534(CEREMADE,巴黎大学Dauphine,Université-PSL,CNRS,UMR 7534)

AI总结 提出一种在方向提升空间中融合切线约束先验与曲率惩罚的测地线框架,通过快速行进法高效求解HJB PDE,增强复杂形状图像分割的鲁棒性。

详情
AI中文摘要

曲率惩罚的测地线模型通过计算全局最优曲线在图像分割中证明了其有效性。不幸的是,当描绘具有复杂形状和图像强度分布的对象时,这些模型仍然容易受到捷径的影响,因为它们缺乏强制执行形状感知切线约束的机制。为了解决这一局限性,我们提出了一种统一的测地线框架,该框架将切线约束先验与曲率惩罚相结合。关键思想是直接在方向提升空间内制定切线可接受性,其中路径切线被限制在由内在形状代表(ISR)(如骨架或内部地标)导出的空间变化角度扇区内。这一公式产生了一系列切线约束的芬斯勒度量,扩展了经典的曲率惩罚测地线模型,同时强制执行强制切线约束。由此产生的Hamilton-Jacobi-Bellman(HJB)偏微分方程(PDE)可以通过快速行进法的变体进行高效数值求解,保持了单次通过的计算复杂度。在合成、自然和医学图像上的实验表明,所提出的测地线框架确实提高了对弱边界和拓扑捷径的鲁棒性,与现有测地线模型相比,产生了具有增强形状保真度的分割结果。

英文摘要

Curvature-penalized geodesic models have proven their effectiveness in image segmentation by computing globally optimal curves. Unfortunately, these models remain susceptible to shortcuts when delineating objects with complex shapes and image intensity distributions, as they lack mechanisms to enforce shape-aware tangent constraints. To address this limitation, we propose a unified geodesic framework that integrates tangent-constrained priors with curvature penalization. The key idea is to formulate tangent admissibility directly within the orientation-lifted space, where path tangents are restricted to spatially varying angular sectors derived from intrinsic shape representatives (ISR) such as skeletons or interior landmarks. This formulation gives rise to a family of tangent-constrained Finslerian metrics, extending the classical curvature-penalized geodesic models while enforcing mandatory tangent constraints. The resulting Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDEs) admit efficient numerical solutions via variants of the fast marching method, preserving the single-pass computational complexity. Experiments on synthetic, natural, and medical images demonstrate that the proposed geodesic framework indeed improves robustness against weak boundaries and topological shortcuts, yielding segmentation results with enhanced shape fidelity compared to existing geodesic models.

2606.00137 2026-06-02 cs.CV cs.GR

Advances in Neural 3D Mesh Texturing: A Survey

神经3D网格纹理化进展:综述

Sai Raj Kishore Perla, Hao Zhang, Ali Mahdavi-Amiri

发表机构 * Simon Fraser University(西蒙弗雷泽大学)

AI总结 本文综述了神经3D网格纹理化的最新进展,涵盖纹理合成、迁移和补全方法,并提出了统一的分类体系。

详情
Journal ref
Eurographics STAR (State of The Art Report), Computer Graphics Forum, Volume 45, Number 2, 2026
Comments
Eurographics STAR (Computer Graphics Forum), 2026. Project Page: https://sairajk.github.io/neural-mesh-texturing/
AI中文摘要

3D网格纹理化在决定数字对象和场景的视觉真实感中起着至关重要的作用。尽管最近基于神经辐射场和高斯泼溅的生成式3D方法可以直接生成带纹理的资产,但多边形网格仍然是建模、动画、视觉效果和游戏管线中的核心表示。因此,神经3D网格纹理化仍然是一个重要且活跃的研究领域。在本综述中,我们对神经3D网格纹理化的最新进展进行了全面回顾,涵盖了纹理合成、迁移和补全的方法。我们首先总结了网格几何、纹理映射、可微渲染和神经生成模型的关键基础,然后将文献组织成一个统一的分类体系,涵盖从早期基于GAN的方法到现代基于扩散的管线。我们进一步分析了常见的架构和监督策略,回顾了数据集和评估协议,并讨论了新兴应用、实际/商业系统以及开放挑战。这些见解共同为当前格局提供了结构化的视角,并有助于指导基于学习的3D网格纹理化的未来发展。

英文摘要

Texturing 3D meshes plays a vital role in determining the visual realism of digital objects and scenes. Although recent generative 3D approaches based on Neural Radiance Fields and Gaussian Splatting can produce textured assets directly, polygonal meshes remain the core representation across modeling, animation, visual effects, and gaming pipelines. Neural 3D mesh texturing therefore continues to be an essential and active area of research. In this survey, we present a comprehensive review of recent advances in neural 3D mesh texturing, covering methods for texture synthesis, transfer, and completion. We first summarize key foundations in mesh geometry, texture mapping, differentiable rendering, and neural generative models, and then organize the literature into a unified taxonomy spanning early GAN-based methods to modern diffusion-based pipelines. We further analyze common architectures and supervision strategies, review datasets and evaluation protocols, and discuss emerging applications, practical/commercial systems, and open challenges. Together, these insights provide a structured perspective on the current landscape and help guide future developments in learning-based 3D mesh texturing.

2606.00136 2026-06-02 cs.LG cs.AI cs.CL cs.CR cs.SI

Generative AI and Digital Ecosystem Resilience: A Proactive Lifecycle-Based Survey

生成式AI与数字生态系统韧性:基于生命周期的主动式综述

Jonghyun Chung, Rishabh Chaddha, Sanket Badhe, Debanshu Das, Nathan Huang, Amanpreet Kaur

发表机构 * Google LLC(谷歌有限公司)

AI总结 本文采用基于生命周期的C5交互模型,综合机器学习与社会科学方法,系统综述了针对生成式AI驱动的对抗性合成内容的主动检测技术,包括协调不真实行为分析、流行病学建模和霍克斯过程等,旨在构建更具韧性的信息生态系统。

详情
Journal ref
IEEE Access (2026) IEEE Access (2026)
Comments
14 pages, 3 figures, 3 tables. Accepted for publication in IEEE Access (May 2026)
AI中文摘要

生成式AI加速了对抗性合成内容的扩散,使得传统的被动检测方法失效。本综述综合了新兴研究,展示了向主动检测新兴不真实叙事的范式转变。我们采用统一的、基于生命周期的分类法,将对抗性活动的社会技术生命周期模型与新兴不真实叙事检测的高级计算方法相结合。通过围绕C5交互模型(背景、原因、内容、放大循环、后果)构建分析,我们整合了机器学习和社会科学的不同研究流。为了区分合成放大模式与真实基线流量,本文综述了建模新叙事创建、播种和传播的最先进技术,包括协调不真实行为分析、流行病学建模和霍克斯过程。本综述还系统回顾了C5交互模型不同阶段对抗性威胁的主动检测方法,特别是高维嵌入空间中的异常检测、多层图上的无监督协调检测以及代理型AI系统。最后,本综述探讨了生成式AI带来的挑战,包括追踪快速变化威胁和多级分布漂移的困难,并概述了未来研究议程,重点在于检测异常聚类和构建预期性及韧性系统。本综述为更韧性的信息生态系统提供了基于生命周期的主动检测新兴合成威胁方法的全面回顾。

英文摘要

The proliferation of adversarial synthetic content, accelerated by Generative AI (GenAI) is rendering traditional reactive detection methods ineffective. This survey synthesizes emerging research to demonstrate a paradigm shift toward the proactive detection of emerging inauthentic narratives. In this survey, we adopt a unified, lifecycle-based taxonomy to combine socio-technical lifecycle models of adversarial campaigns with advanced computational methodologies for emerging inauthentic narrative detection. By structuring the analysis around the C5 Interaction Model (Context, Causes, Content, Cycle of Amplification, Consequences), we integrate different research streams from machine learning and social science. To differentiate spread patterns of synthetic amplification from authentic baseline traffic, this paper surveys state-of-the-art techniques for modeling the creation, seeding, and propagation of fresh narratives, including the analysis of Coordinated Inauthentic Behavior (CIB), epidemiological modeling, and Hawkes process. This survey also provides a systematic review of proactive detection methods for adversarial threats at different stages in the C5 interaction model, specifically, anomaly detection in high-dimensional embedding spaces, unsupervised coordination detection on multi-layer graphs, and agentic AI systems. Finally, this survey addresses challenges posed by GenAI, including the difficulty of tracking rapidly changing threats and multi-level distributional drift, and it outlines a future research agenda focused on detecting anomalous clusters and building anticipatory and resilient systems. This survey provides a comprehensive, lifecycle-based review of methods for the proactive detection of emerging synthetic threats for more resilient information ecosystems.

2606.00135 2026-06-02 cs.LG cs.AI

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

论智能体工具调用与强化学习训练的有效性与效率

Tong Liu, Cheng Qian, Matej Cief, Yuan He, Daniele Dan, Nikolaos Aletras, Gabriella Kazai

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 本文系统分析工具调用评估中的实现选择对结果敏感性的影响,并针对强化学习训练中的计算浪费提出两种加速技术。

详情
Comments
ICML 2026
AI中文摘要

工具调用是现代大型语言模型(LLM)智能体的核心组件,使其具备超越参数化知识的技能。本文从两个互补维度研究工具调用:有效性(即如何衡量该能力)和效率(即如何学习该能力)。在有效性方面,我们系统分析了工具调用评估流程,并表明结果可能对看似微小、通常未文档化的实现选择高度敏感,包括随机种子、系统提示、多轮模板构建以及先前交互/推理历史的传递方式。这些选择可能导致报告性能的显著差异,尤其是在多轮设置中,若缺乏严格标准化,排行榜排名将不可靠。在效率方面,我们考察了用于工具调用的标准强化学习(RL),并识别出两个计算浪费来源:(i)在 rollout 过程中,许多提示不产生学习信号;(ii)在策略更新过程中,优化产生高计算成本。基于这些发现,我们引入了两种加速基于 RL 的工具调用训练的技术,在不降低性能的情况下实现了显著的挂钟时间加速。

英文摘要

Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.

2606.00130 2026-06-02 cs.LG cs.AI

Automatically Differentiable Nonlinear Tensor Networks (ADNTNs) for Exponential Compression of Deep Neural Networks

自动可微非线性张量网络(ADNTNs)用于深度神经网络的指数级压缩

Andrzej Cichocki, Michal Wietczak

发表机构 * Institute of Computing Intelligence, Polish Academy of Sciences(波兰科学院计算智能研究所)

AI总结 提出自动可微非线性张量网络(ADNTNs)作为结构化权重生成器,通过反向模式自动微分端到端训练紧凑核心张量,实现深度神经网络的高效压缩,在AlexNet和VGG-16上达到每层2000倍至77000倍压缩比,且精度与密集基线相当或更优。

详情
Comments
6 figure, 28 pages, to be submitted to Journal and confrence
AI中文摘要

我们研究了自动可微非线性张量网络(ADNTNs),这是一类结构化权重生成器,其紧凑核心张量通过反向模式自动微分(AD)进行端到端训练。该方法可视为低秩适应和张量分解的自然扩展:ADNTN不是使用一个低秩矩阵更新,而是通过小核心、非线性激活和可选的横向混合张量的层次结构构建大权重张量。本文聚焦于三种架构:树张量网络(TTNs)、带边界解缠器的增强型TTN(aTTNs)以及多尺度纠缠重整化拟设(MERA)。该公式支持非线性激活、任务感知目标、批处理以及硬件感知的执行调度。同时,本文明确区分了“微分”收缩程序和使收缩自由:AD并未消除大中间体、不良收缩顺序或一般带环张量网络精确收缩的成本。在AlexNet和VGG-16层上的大量模拟显示,在所研究设置下每层压缩比约为2000倍至77000倍,精度通常与密集基线相当,且在几个VGG-16案例中有所提升。这些结果是令人鼓舞的而非最终结论:它们表明,只要优化、收缩调度和部署内核协同设计,ADNTNs是一条有前景、数学结构清晰且硬件感知的通往更小神经网络的路径。

英文摘要

We study Automatically Differentiable Nonlinear Tensor Networks (ADNTNs), a family of structured weight generators whose compact core tensors are trained end-to-end by reverse-mode automatic differentiation (AD). The approach can be viewed as a natural extension of low-rank adaptation and tensor factorisation: instead of using one low-rank matrix update, an ADNTN builds a large weight tensor through a hierarchy of small cores, nonlinear activations, and optional lateral mixing tensors. The paper focuses on three architectures: Tree Tensor Networks (TTNs), augmented TTNs (aTTNs) with boundary disentanglers, and Multi-scale Entanglement Renormalisation Ansatze (MERA). The formulation supports nonlinear activations, task-aware objectives, batching, and hardware-aware execution schedules. At the same time, the paper keeps a clear distinction between \emph{differentiating} a contraction program and making contraction free: AD does not remove the cost of large intermediates, poor contraction orders, or exact contraction of general loopy tensor networks. Extensive simulations on AlexNet and VGG-16 layers show per-layer compression ratios from roughly $2000\times$ to $77000\times$ in the studied settings, with accuracy often matching the dense baseline and, in several VGG-16 cases, improving it. These results are encouraging rather than final: they suggest that ADNTNs are a promising, mathematically structured, and hardware-aware route toward much smaller neural networks, provided that optimisation, contraction schedules, and deployment kernels are designed together.

2606.00129 2026-06-02 cs.LG cs.AI

A Shared Valence Axis Across Modern LLMs and Human EEG: The Saturation Regularity

现代LLM与人脑EEG共享的效价轴:饱和规律

Yousef A. Radwan, Xuhui Liu, Kilichbek Haydarov, Yuqian Fu, Mohamed Elhoseiny

发表机构 * King Abdullah University of Science and Technology(卡布斯大学)

AI总结 本研究通过构建从大型语言模型(LLM)中提取的一维效价方向(V轴),发现其与人类EEG神经活动对齐,但进一步对齐策略无法提升解码性能,并形式化为“饱和规律”,指出改进应来自监督无法触及的残差子空间。

详情
AI中文摘要

大型语言模型(LLM)已成为强大的表示学习器,其内部特征与人类认知日益对齐。我们研究现代LLM是否可以作为理解人脑神经表示的透镜,重点关注EEG中的情感效价。 我们首先仅使用九个情感唤起句子从现代LLM中构建了一维效价方向(V轴),并通过零样本迁移到情感基准测试和跨十四个LLM的模型一致性进行了验证。然后,我们展示了这个从LLM导出的方向映射到人类神经活动。在一个包含123名受试者观看情感视频的公共EEG队列中,EEG特征上的单个线性投影追踪了每个刺激的V轴位置。此外,36个未暴露于V轴的EEG情感分类器在其内部表示中自发发现了相同的方向,表明相同的效价结构在语言模型和人类电生理学中同时出现。 然而,这种趋同并未提供有效的训练信号。我们测试了二十五种对齐策略,包括知识蒸馏、表示相似性、对比和拓扑损失;没有一种能改善解码,十六种显著降低了准确性。我们将这一结果形式化为饱和规律:一旦任务标签单独驱动脑解码网络朝向目标方向,额外的监督主要扭曲已经饱和的盆地,而承载类内残差的子空间几乎得不到有用的梯度。 这一规律也指出了改进应来自何处:监督无法触及的残差子空间。受此启发,我们集成残差多样性而非监督盆地,在FACED上将平衡准确率提高了10.5%,并在SEED-V上复制了相同效果。

英文摘要

Large language models (LLMs) have emerged as powerful representation learners whose internal features increasingly align with human cognition. We study whether modern LLMs can serve as a lens for understanding neural representations in the human brain, focusing on emotional valence in EEG. We first build a one-dimensional valence direction, the V-axis, from modern LLMs using only nine emotion-evocative sentences. We validate it through zero-shot transfer to sentiment benchmarks and cross-model consistency across fourteen LLMs. We then show that this LLM-derived direction maps onto human neural activity. On a public EEG cohort of 123 subjects watching affective videos, a single linear projection on EEG features tracks the V-axis position of each stimulus. Moreover, 36 EEG emotion classifiers trained without exposure to the V-axis spontaneously rediscover the same direction in their internal representations, suggesting that the same valence structure emerges in both language models and human electrophysiology. Yet this convergence does not provide an effective training signal. We test twenty-five alignment strategies, including knowledge distillation, representational similarity, contrastive, and topographic losses; none improve decoding, and sixteen significantly reduce accuracy. We formalize this result as the saturation regularity: once task labels alone drive a brain-decoding network onto the target direction, additional supervision mainly distorts an already-saturated basin, while the load-bearing within-class residual receives little useful gradient. This regularity also indicates where improvement should come from: the residual subspace unreachable by supervision. Motivated by this insight, we ensemble across residual diversity rather than supervising the basin, improving balanced accuracy by 10.5% over the prior best on FACED, with the same effect replicated on SEED-V.

2606.00124 2026-06-02 cs.CV cs.LG

Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness

位置编码锚定视觉Transformer中的空间结构:基于几何视角的鲁棒性研究

Mahmoud Mannes

发表机构 * ESSTHS

AI总结 本文通过引入空间相似性距离相关性(SSDC)度量,研究不同位置编码对视觉Transformer内部空间表示几何结构的影响,发现位置编码通过建立索引锚定的空间组织来提升模型在内容破坏性分布偏移下的鲁棒性。

详情
Comments
16 pages (9 main text, 7 appendix). 5 figures (3 main text, 2 appendix) with 8 graphics total. 5 tables (1 main text, 4 appendix). Submitted to NeurIPS 2026 main conference and the ICML 2026 mechanistic interpretability workshop
AI中文摘要

视觉Transformer中的位置嵌入(PEs)已知会影响性能和鲁棒性,但它们在塑造内部空间表示中的作用尚不明确。本文研究了不同形式的PEs如何影响ViT的表示几何结构,以及这些变化如何与内容破坏性分布偏移下的鲁棒性相关。我们引入了一个度量——空间相似性距离相关性(SSDC),用于量化token表示中的空间结构。利用该度量,我们发现未使用PEs训练的ViT仍会发展出非平凡的空间结构,但这种结构由视觉内容驱动,并在token置换下崩溃。相反,所有考虑的PEs(可学习绝对位置编码、正弦位置编码和旋转位置编码)都与向索引锚定空间组织的一致转变相关。这些模型中的表示在破坏内容的扰动下保持稳定,并对这类分布偏移表现出显著增强的鲁棒性。我们进一步表明,尽管不同的PEs产生不同的空间结构深度轨迹,但其鲁棒性属性大致相似(编码方案间存在次要差异),这表明鲁棒性似乎更依赖于稳定的位置参考框架的存在,而非特定的编码机制。这些结果为位置编码如何塑造内部表示提供了几何解释,并对未来编码方案的原则性设计具有启示意义。

英文摘要

Positional embeddings (PEs) in Vision Transformers (ViTs) are known to impact performance and robustness, but their role in shaping internal spatial representations is not well understood. In this work, we study how different forms of PEs influence the representational geometry of ViTs and how these changes relate to robustness under content-disrupting distribution shifts. We introduce a metric, the Spatial Similarity Distance Correlation (SSDC), to quantify spatial structure in token representations. Using this metric, we show that ViTs trained without PEs still develop non-trivial spatial structure, but this structure is driven by visual content and collapses under token permutation. In contrast, we find that all PEs considered (learned absolute, sinusoidal, and rotary) are associated with a consistent shift toward an index-anchored spatial organization. Representations in these models remain stable under perturbations that disrupt content, and exhibit substantially improved robustness to such distributional shifts. We further show that while different PEs produce distinct depth-wise trajectories of spatial structure, their robustness properties are largely similar (with secondary variation across encoding schemes), suggesting that robustness appears to depend on the presence of a stable positional reference frame more than it depends on the specific encoding mechanism. These results offer a geometric account of how positional encodings shape internal representations, with implications for the principled design of future encoding schemes.

2606.00123 2026-06-02 cs.CV cs.AI cs.LG

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

CardioLens: 通过多序列心脏MRI评估揭示MLLMs的临床现实差距

Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su, Taiping Qu, Jingwei Guo, Nan Zhang, Hui Wang, Zhen Zhou, Kairui Bo, Yan Chen, Yue Ren, Shuai Li, Lei Xu, Henggui Zhang

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Beijing Anzhen Hospital(北京安贞医院) Beihang University(北航) King Abdullah University of Science and Technology(国王 Abdullah 科学与技术大学)

AI总结 提出CardioLens测试平台,通过多序列心脏磁共振成像评估24个多模态大语言模型,发现其在临床工作流中表现不佳,存在类别崩溃失败模式,且输入选择和推理提示改进效果有限。

详情
AI中文摘要

多模态大语言模型在公共医学基准上表现出色,但现有评估通常依赖于孤立输入和简化识别任务,难以作为临床使用的有效代理。我们提出了CardioLens,一个针对多序列心血管磁共振的无泄漏评估测试平台,通过严格的报告到QA构建和验证流程,从私有医院档案中构建。CardioLens包含473,896张切片和13,494个经过验证的QA对,涵盖4D Cine、LGE、灌注和T2加权成像,并评估CMR解读的三个阶段:图像理解、报告生成和疾病诊断。在24个最先进的MLLM上,CardioLens揭示了显著的临床现实差距:模型整体表现不佳,性能沿真实CMR工作流下降。混淆分析进一步显示一种类别崩溃失败模式,模型倾向于默认频繁出现的异常类别,而不是区分临床不同的发现。为了排除MLLM兼容输入构造是主要原因,我们在不同切片预算下比较了随机、临床动机和数据驱动的切片选择协议;性能变化很小,通常约为1%。显式推理提示也无法挽救性能,往往使模型更加保守,而不是改善视觉证据的使用。这些结果表明,当前MLLM远未达到可靠的CMR解读,临床决策需要跨序列、视图和时间相位整合分布式证据。CardioLens为开发面向真实临床部署的下一代MLLM提供了一个临床基础的测试平台。

英文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archives through a rigorous report-to-QA construction and verification pipeline. CardioLens contains 473,896 slices and 13,494 verified QA pairs across 4D Cine, LGE, perfusion, and T2-weighted imaging, and evaluates three stages of CMR interpretation: image understanding, report generation, and disease diagnosis. Across 24 state-of-the-art MLLMs, CardioLens reveals a substantial clinical reality gap: models perform poorly overall, with performance degrading along the real CMR workflow. Confusion analysis further shows a category-collapse failure mode, where models default to frequent abnormal categories rather than distinguishing clinically distinct findings. To rule out MLLM-compatible input construction as the primary cause, we compare random, clinically motivated, and data-driven slice selection protocols under different slice budgets; performance changes only marginally, typically by about 1%. Explicit reasoning prompts also fail to rescue performance, often making models more conservative rather than improving visual evidence use. These results show that current MLLMs remain far from reliable CMR interpretation, where clinical decisions require integrating distributed evidence across sequences, views, and temporal phases. CardioLens provides a clinically grounded testbed for developing next-generation MLLMs toward real-world clinical deployment.

2606.00121 2026-06-02 cs.CV cs.AI

Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity

基于语义和结构引导的大脑活动图像重建通用框架

Yizhuo Lu, Changde Du, Qiongyi Zhou, Liuyun Jiang, Huiguang He

发表机构 * State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology(脑认知与脑启发智能技术国家重点实验室) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Future Technology, University of Chinese Academy of Sciences(中国科学院大学未来技术学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出MindDiffuser两阶段框架,结合CLIP文本嵌入和视觉特征,通过Stable Diffusion生成语义图像并迭代优化结构信息,在fMRI、EEG、MEG三种模态上显著提升图像重建性能。

详情
AI中文摘要

从大脑记录中重建视觉刺激一直是脑解码中一项有意义且具有挑战性的任务。特别是,实现精确且可控的图像重建对于推动脑机接口的进步和应用具有重要意义。最近的方法利用文本到图像生成模型的能力,在语义(如概念和对象)方面重建了接近复杂自然刺激的图像。然而,它们在保持与原始刺激在细粒度结构信息(如位置、方向和大小)上的一致性方面存在困难,这削弱了模型的可控性和可解释性。为了解决上述问题,我们提出了一个两阶段图像重建框架,称为MindDiffuser。在第一阶段,从大脑反应解码的对比语言-图像预训练(CLIP)文本嵌入被输入到Stable Diffusion中,生成包含语义信息的初步图像。在第二阶段,我们使用解码的浅层CLIP视觉特征作为监督信号,通过反向传播迭代优化来自第一阶段的特征向量,以对齐结构信息。我们在由视觉刺激引发的三种模态(fMRI、EEG、MEG)的大脑反应数据集上进行了大量实验,结果表明我们的框架显著提升了先前最先进模型的性能,凸显了我们方法的有效性和通用性。空间和时间可视化结果进一步支持了我们框架的神经生物学合理性,为未来跨不同大脑信号模态的神经解码工作提供了指导。

英文摘要

Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Recent methods, leveraging advances in the power of text-to-image generation models, have reconstructed images that closely approximate complex natural stimuli in terms of semantics (e.g., concepts and objects). However, they struggle to maintain consistency with the original stimuli in fine-grained structural information (e.g., position, orientation and size), which undermines both the controllability and interpretability of the models. To address the aforementioned issues, we propose a two-stage image reconstruction framework, termed MindDiffuser. In Stage 1, Contrastive Language-Image Pretraining (CLIP) text embeddings decoded from brain responses are input into Stable Diffusion, generating a preliminary image containing semantic information. In Stage 2, we use decoded shallow CLIP visual features as supervisory signals, iteratively refining the feature vectors from Stage 1 via backpropagation to align structural information. We conducted extensive experiments on brain response datasets across three modalities (fMRI, EEG, MEG) elicited by visual stimuli, demonstrating that our framework significantly enhances the performance of previous state-of-the-art models, highlighting the effectiveness and versatility of our approach. Spatial and temporal visualization results further support the neurobiological plausibility of our framework, providing guidance for future neural decoding efforts across different brain signal modalities.

2606.00119 2026-06-02 cs.RO cs.AI

V2I Work Zone Geometry Reconstruction with Pose-Conditioned UWB Range Denoising

基于位姿条件的UWB测距去噪的V2I工作区几何重建

Jiaxi Liu, Hangyu Li, Yang Cheng, Rui Gana, Junwei You, Weizhe Tang, Peng Zhang, Steven T. Parker, Xiaopeng Li, Bin Ran

发表机构 * Department of Civil & Environmental Engineering, University of Wisconsin-Madison(威斯康星大学麦迪逊分校土木与环境工程系)

AI总结 针对V2I工作区几何重建中UWB测距受突发异常、非视距误差和位姿不确定性的影响,提出一种位姿条件、排列等变的预测去噪器,通过共享锚点时间预测、对称集聚合和位姿条件残差解码,显著提升测距精度和几何重建质量。

详情
AI中文摘要

可靠的工作区映射对于网联自动驾驶车辆(CAV)安全平稳地通过工作区至关重要。安装在锥形路标上的超宽带(UWB)路侧单元(RSU)提供了一种经济高效的工作区布局推断方式,因为路侧锚点和车载标签为工作区几何重建提供了直接的车对基础设施(V2I)距离约束。然而,在实际现场部署中,UWB测距估计受到突发异常、非视距(NLOS)误差、任意锚点排序问题以及车辆位姿不确定性的影响。为解决这些挑战,本研究提出了一种位姿条件、排列等变的预测去噪器,用于多锚点UWB测距。该模型采用共享锚点时间预测来捕捉距离动态,对称集聚合来处理无序和缺失的锚点,以及位姿条件残差解码来将车辆运动作为几何先验。两阶段训练策略首先从观测距离学习预测,然后通过NLOS加权监督微调去噪器。该方法在CAV收集的罕见真实世界V2I UWB现场数据以及受控大规模仿真基准上进行了评估,以获得消融见解。结果表明,所提出的方法在具有挑战性的NLOS主导场景中显著提高了测距精度、锥形标定位和工作区几何重建,对锚点重新索引和适度锚点丢失保持鲁棒,并将测量加权的现场均方误差相对于原始输入降低了66.9%。

英文摘要

Reliable work zone mapping is important for connected and autonomous vehicles (CAVs) to navigate safely and smoothly through work zone areas. Cone-mounted ultra-wideband (UWB) roadside units (RSU) offer a cost-effective way for work zone layout inference, as roadside anchors and vehicle tags provide direct vehicle-to-infrastructure (V2I) range constraints for work zone geometry reconstruction. However, UWB range estimation is degraded by bursty outliers, non-line-of-sight (NLOS) errors, arbitrary anchor-ordering issues, and vehicle pose uncertainties in practical field deployments. To address these challenges, this study proposes a pose-conditioned, permutation-equivariant predictive denoiser for multi-anchor UWB ranging. The model employs shared anchor-wise temporal prediction to capture range dynamics, symmetric set aggregation to handle unordered and missing anchors, and pose-conditioned residual decoding to incorporate vehicle motion as a geometric prior. A two-stage training strategy first learns prediction from observed ranges, and then fine-tunes the denoiser with NLOS-weighted supervision. The method is evaluated on rare real-world V2I UWB field data collected with a CAV, as well as on controlled large-scale simulation benchmarks for ablative insights. Results show that the proposed method substantially improves range accuracy, cone localization, and work zone geometry reconstruction in challenging NLOS-dominated regimes, remains robust to anchor re-indexing and moderate anchor dropout, and reduces measurement-weighted field MSE by 66.9% relative to the raw input.

2606.00117 2026-06-02 cs.RO

Ontology-Guided Reasoning for Affordance-Based Explanations of Robot Navigation

基于本体引导的机器人导航可供性解释推理

Amar Halilovic, Vahidin Hasic, Senka Krivic

发表机构 * Institute of Artificial Intelligence, Ulm University(乌尔姆大学人工智能研究所) Faculty of Electrical Engineering, University of Sarajevo(萨拉热窝大学电气工程学院)

AI总结 提出本体引导推理方法,通过局部可供性本体表示实体、可供性状态和空间关系,评估假设的对象-可供性状态变化作为解释因素,生成语义可理解且可操作的解释,并在机器人图书管理员场景中验证其准确性和鲁棒性。

详情
AI中文摘要

本文提出基于本体引导的推理方法,用于机器人导航的可供性解释。在人类环境中,机器人仅检测到其路径被阻塞是不够的。它还必须推理附近物体的可供性、可能的状态变化以及哪些变化能使其安全继续。我们通过将附近实体、其可供性、可供性状态和定性空间关系表示在局部可供性本体中,并评估假设的对象-可供性状态变化作为候选解释因素来解决这一问题。这产生了不仅语义上可理解而且可操作的解释。我们在以机器人图书管理员场景为中心的轻量级基准中实例化该方法,并在程序生成的导航案例上进行评估。结果表明,与仅语义基线相比,本体引导推理更准确地识别相关解释因素,并且随着语义杂波增加仍保持鲁棒性。总体而言,本文论证了可供性本体不仅可以作为环境的语义描述,还可以作为可解释性和可靠机器人自主性的推理基础。

英文摘要

This paper proposes ontology-guided reasoning for affordance-based explanations of robot navigation. In human environments, it is not sufficient for a robot to detect that its route is blocked. It must also reason about what nearby objects afford, which state changes are possible, and which of these changes would allow it to continue safely. We address this problem by representing nearby entities, their affordances, affordance states, and qualitative spatial relations in a local affordance ontology and by evaluating hypothetical object--affordance state changes as candidate explanation factors. This yields explanations that are not only semantically grounded but also actionable. We instantiate the approach in a lightweight benchmark centered on a robot librarian scenario and evaluate it on procedurally generated navigation cases. The results show that ontology-guided reasoning identifies relevant explanation factors more accurately than a semantic-only baseline and remains robust as semantic clutter increases. Overall, the paper argues that affordance ontologies can serve not merely as semantic descriptions of the environment, but as reasoning foundations for explainability and reliable robot autonomy.

2606.00116 2026-06-02 cs.CL cs.AI cs.LG

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

增强BiGRU与KAN模块在法律文档分类与摘要中的应用

Ahmed Faizul Haque Dhrubo, Souvik Pramanik, Most. Aysha Siddika Sumona, Shahnewaz Siddique, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

发表机构 * Dept. of ECE North South University(电子工程系北南大学)

AI总结 提出一种基于KAN的BiGRU模型,用于低资源多语言法律文档的分类与摘要,通过KAN模块提升分类准确率至67.96%。

详情
Comments
This paper contains of 10 pages, 10 figures, 4 tables and version 2 after it review from ACL 2026
AI中文摘要

本研究引入了一种基于KAN的BiGRU模型的新架构,用于低资源多语言环境下的法律文档分类与摘要任务。为了解决领域语言、不同语言使用、上下文长依赖和类别不平衡等问题,我们使用了由孟加拉国法律文档组成的数据集,这些文档来自Manupatra,包括孟加拉语、英语和音译孟加拉语。我们的分类任务采用BiGRU模型以及Kolmogorov-Arnold网络(KAN)模块,而摘要部分则利用基于注意力的GRU结合KAN模型头部。分类模型达到了67.96%的准确率和0.65的F1分数;摘要的ROUGE-1、ROUGE-2和ROUGE-L指标分别对应0.38、0.23和0.31的F1分数。消融研究表明,使用KAN将分类准确率从57.34%提升至67.96%。此外,我们将所提出的技术与多个基线进行了比较,包括经典机器学习算法和预训练语言模型。

英文摘要

This study introduces a novel architecture of KAN-based BiGRU model for the task of classification and summarization of legal documents in a low-resource multilingual setup. In order to tackle problems associated with domain language, the usage of different languages, long dependencies within context, and class imbalance, we employ the dataset composed of legal documents from Bangladesh and taken from Manupatra, which include Bengali, English, and transliterated Bengali languages. Our classification task involves BiGRU model, along with Kolmogorov-Arnold Network (KAN) module, while the summarization part utilizes attention-based GRU, combined with a KAN model head. Classification model yields 67.96% of accuracy and 0.65 F1 score; while ROUGE-1, ROUGE-2, and ROUGE-L measures for summarization yield 0.38, 0.23, and 0.31 F1 scores, correspondingly. Ablation study shows that the use of KAN increases classification accuracy from 57.34% to 67.96%. Moreover, our proposed technique is compared to several baselines, including classical ML algorithms and pretrained language models.

2606.00115 2026-06-02 cs.CV cs.LG stat.ML

Physics from Video: Identifiability of Time-Invariant Second-Order ODEs under Minimal Trajectory Conditions

来自视频的物理:最小轨迹条件下时不变二阶ODE的可辨识性

Yuanyuan Wang, Wenjie Wang, Kun Zhang, Mingming Gong

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 研究从原始像素中辨识连续时间物理定律的结构可辨识性,证明在最小轨迹条件下,编码器-仅管道可唯一恢复二阶线性ODE参数,并引入方差底正则化器稳定无解码器目标。

详情
Comments
Accepted at ICML 2026
AI中文摘要

弥合视觉真实感与物理理解之间的差距是基于视频的世界模型的核心挑战。我们研究从原始像素中辨识连续时间物理定律的结构可辨识性,重点关注编码器-仅管道能否唯一恢复二阶线性ODE的参数。我们证明,一个水平集斜率覆盖条件确保学习到的潜在空间与真实物理状态局部仿射,从而实现精确的参数恢复。我们的理论首次给出了不同阻尼机制下最小数据需求的刻画,建立了欠阻尼系统可从单个视频片段辨识,而其他机制需要三个不同轨迹。我们进一步引入方差底正则化器以稳定无解码器目标并防止潜在坍缩。在合成和真实数据上验证,我们的方法表明,无需计算密集的像素重建,即可从视频中可靠估计可解释的物理常数,确保物理正确性和透明性。代码可在 https://github.com/wenjiewang3/PhysicsFromVideo 获取。

英文摘要

Bridging the gap between visual realism and physical understanding is a core challenge for video-based world models. We study the structural identifiability of continuous-time physical laws from raw pixels, focusing on whether an encoder-only pipeline can uniquely recover the parameters of second-order linear ODEs. We prove that a level-set slope-coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance-floor regularizer to stabilize the decoder-free objective and prevent latent collapse. Validated on synthetic and real-world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute-intensive pixel reconstruction, ensuring both physical correctness and transparency. Code is available at https://github.com/wenjiewang3/PhysicsFromVideo.

2606.00114 2026-06-02 cs.CV cs.IT math.IT

Recursive Vision Transformer with Dynamic Depth and Width Adjustment for Resource-Efficient Image Semantic Communication

递归视觉Transformer与动态深度和宽度调整用于资源高效图像语义通信

Zhilong Zhang, Xinhui Zhang, Gongyu Jin, Sihua Wang, Danpu Liu, Changchuan Yin

发表机构 * Beijing Laboratory of Advanced Information Network(北京先进信息网络实验室) Beijing Key Laboratory of Network System Architecture and Convergence(北京网络系统架构与融合重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出一种递归视觉Transformer图像语义通信系统,通过动态深度和宽度调整策略降低参数和计算复杂度,在资源受限设备上实现高效通信。

详情
AI中文摘要

图像语义通信是下一代无线通信系统中的关键组成部分。然而,此类系统通常具有较大的内存占用和较高的计算复杂度,使其难以部署在资源受限的设备上。为了解决这些挑战,我们提出了一种基于视觉Transformer(ViT)的图像语义通信系统。在该系统中,引入递归结构以迭代细化语义特征并减少参数数量。此外,设计了三种动态调整策略以自适应降低计算复杂度:动态深度调整、动态宽度调整以及宽度-深度联合优化。动态深度调整根据图像内容和信道条件自适应确定递归模块的数量,而动态宽度调整则选择性保留重要的神经元和注意力头。宽度-深度联合优化进一步实现了灵活的计算配置。仿真结果验证了所提出的基于递归ViT的系统,结合三种动态调整策略,在相当的计算复杂度下,参数数量减少了48.7%,并且实现了比现有基线更高的重建质量。

英文摘要

Image semantic communication is a critical component in next-generation wireless communication systems. However, such systems typically suffer from large memory footprints and high computational complexity, making them difficult to deploy on resource-constrained devices. To address these challenges, we propose a vision transformer (ViT)-enabled image semantic communication system. In this system, a recursive structure is introduced to iteratively refine semantic features and reduce the parameter count. In addition, three dynamic adjustment strategies are designed to adaptively reduce computational complexity: dynamic depth adjustment, dynamic width adjustment, and joint width-depth optimization. Dynamic depth adjustment adaptively determines the number of recursive modules according to image content and channel conditions, while dynamic width adjustment selectively preserves important neurons and attention heads. The joint width-depth optimization further enables flexible computation configurations. Simulation results verify that the proposed recursive ViT-based system, combined with the three dynamic adjustment strategies, reduces the parameter count by 48.7% and achieves higher reconstruction quality than existing baselines under comparable computational complexity.

2606.00113 2026-06-02 cs.RO

World Models for Robotic Manipulation: A Survey

机器人操作的世界模型:综述

Fangyuan Wang, Ziyuan Wang, Guorui Pei, Mengshi Zhang, Canxi Liang, Jun Hu, Zhongxuan Li, Jinsong Wu, Ning Han, Zeqing Zhang, Jiaming Qi, Hongmin Wu, Shiyao Zhang, Pai Zheng, Jia Pan, David Navarro-Alarcon, Sichao Liu, Peng Zhou

发表机构 * Department of Mechanical Engineering, The Hong Kong Polytechnic University(香港理工大学机械工程系) Department of Mechanical Engineering and Automation, Harbin Institute of Technology(哈尔滨工业大学机械工程与自动化系) School of Advanced Engineering, Great Bay University(大湾大学先进工程学院) College of Robotics Science and Engineering, Taiyuan University of Technology(太原科技大学机器人科学与工程学院) School of Data Science, City University of Hong Kong (Dongguan)(香港城市大学(东莞)数据科学学院) Department of Mechatronic Engineering, Guangdong Polytechnic Normal University(广东 polytechnic 正常大学机电工程系) School of Computing and Data Science, The University of Hong Kong(香港大学计算与数据科学学院) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) College of Mechanical and Electrical Engineering, Northeast Forestry University(东北林业大学机械与电气工程学院) Greater Bay Area National Center of Technology Innovation(粤港澳大湾区国家技术创新中心) Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University(香港理工大学工业与系统工程系)

AI总结 本文通过三个问题(预测什么未来表示、预测如何与动作连接、何时在机器人学习流程中使用预测)系统综述了机器人操作中的世界模型,将其定义为动作条件预测系统,并分类为五种表示族,提出了功能分类法,总结了基础设施角色、数据集和评估协议,揭示了从任务特定动力学预测器向预测基础设施的演变及开放挑战。

详情
AI中文摘要

机器人操作依赖于在执行前预测动作如何重塑物体、接触和场景几何的能力。学习的世界模型通过预测在机器人干预下任务相关的未来演化提供这种能力,然而该术语现在涵盖潜在动力学模型、动作条件视频生成器、三维和四维场景预测器、物理信息模拟器以及视觉-语言-动作系统中的预测模块。这种广度使文献碎片化,并模糊了对操作重要的设计选择。我们通过三个问题调查机器人操作的世界模型:预测什么未来表示、预测如何与动作连接、以及何时在机器人学习流程中使用预测。我们将世界模型操作性地定义为动作条件预测系统,并将其与感知模块、逆模型、策略、奖励和值函数区分开来。然后,我们将现有工作组织成五种表示族,开发了一个功能分类法,将集成预测-动作模型与显式预测规划器分开,并描述了基础设施角色,包括合成经验生成、候选过滤、基于搜索的评估、学习环境和结果验证。我们进一步将这些角色映射到预训练、后训练和推理适应中,回顾了34个操作数据集,并综合了预测保真度、任务性能和模拟器可靠性的评估协议。本综述表明,世界模型正在从任务特定的动力学预测器演变为机器人学习的预测基础设施,同时揭示了接触建模、幻觉控制、动作对齐和闭环使用下基准测试方面的开放挑战。

英文摘要

Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.

2606.00110 2026-06-02 cs.CV cs.RO

General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling

广义协变动作建模:通过时空解耦构建广义流形

Huaihai Lyu, Chaofan Chen, Mingyu Cao, Yuheng Ji, Changsheng Xu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出广义动作流形框架,通过时间不变性和几何不变性解耦实现广义协变,提升从稀疏演示中泛化的鲁棒性。

详情
AI中文摘要

从有限数据中实现鲁棒泛化是具身智能的核心挑战。现有方法通过回归绝对坐标失败,这违反了广义协变原理。根本上,这混淆了内在任务几何与刚性执行模式,将策略绑定到特定运动风格和固定速度。为解决此问题,我们提出广义动作流形(GAM)框架,通过结构解耦强制执行广义协变。具体地,GAM通过强制两个正交维度的不变性来实现流形:(1)时间不变性,利用弧长参数化将空间路径几何与时间动力学正交化,确保对速度变化的鲁棒性;(2)几何不变性,其中模式-仿射-分解机制将轨迹映射到姿态归一化坐标框架中的规范“世界线”。这区分了不变几何模式与仿射调制,确保空间泛化性。通过将GAM集成到结构化视觉-语言-动作(VLA)架构中,我们使稀疏演示能够密集填充连续有效的动作流形。实验结果表明,GAM实现了优越的迁移和鲁棒性,优于几何无关基线。

英文摘要

Achieving robust generalization from limited data is a central challenge in embodied intelligence. Prevailing methods fail by regressing absolute coordinates, which violates the principle of general covariance. Fundamentally, this conflates the intrinsic task geometry with rigid execution patterns, binding policies to specific motion styles and fixed speeds. To resolve this, we propose the Generalized Action Manifold (GAM) framework that enforces general covariance through structural disentanglement. Specifically, GAM realizes the manifold by enforcing invariance across two orthogonal dimensions: (1) Temporal Invariance, utilizing an Arc-Length Parameterizer to orthogonalize the spatial path geometry from temporal dynamics, ensuring robustness to velocity variations; (2) Geometric Invariance, where a Schema-Affine-Factorization mechanism maps trajectories to canonical ``world lines'' in a pose-normalized coordinate frame. This distinguishes invariant geometric schemas from affine modulations, ensuring spatial generalizability. By integrating GAM within a structured Vision-Language-Action (VLA) architecture, we enable sparse demonstrations to densely populate a continuous, valid action manifold. Empirical results demonstrate that GAM enables superior transfer and robustness capabilities, outperforming geometry-agnostic baselines.

2606.00109 2026-06-02 cs.CV cs.AI cs.LG

VDSB-GWSyn: Diffusion Schrödinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography

VDSB-GWSyn: 用于冠状动脉造影中可控且解剖学可行的导丝合成的扩散薛定谔桥

Haoyuan Tang, Zhuo Zhang, Jialin Li, Shuai Xiao, Jiachen Yang

发表机构 * Tianjin University(天津大学)

AI总结 提出基于扩散薛定谔桥的VDSB-GWSyn框架,通过形状先验和血管分割约束生成可控、高保真导丝样本,显著提升下游导丝端点定位精度。

详情
Comments
Early accept to MICCAI 2026
AI中文摘要

冠状动脉导丝端点定位是计算机辅助PCI的基本能力,随着机器人辅助PCI逐渐普及以减少操作者辐射暴露,其重要性日益增加。然而,带有导丝的标注CAG图像稀缺以及现有导丝合成模型的适应性有限,仍是导丝端点定位的关键瓶颈。为解决此问题,我们提出VDSB-GWSyn,一个基于扩散薛定谔桥(DSB)模型的框架,能够在复杂解剖背景下合成可控、高保真的导丝样本。VDSB-GWSyn首先使用我们的形状先验算法学习基本导丝几何形状,然后在血管分割掩码的约束下生成导丝掩码并输出对应的端点坐标,最后通过SPADE条件化的DSB在真实CAG图像上合成逼真的导丝样本。实验结果表明,VDSB-GWSyn合成的导丝样本取得了良好的ROI-FID和ROI-KID,以及高IPR分数。此外,将我们的合成数据用于合成预训练后接真实微调,显著改进了下游导丝端点定位,将MPE从16.01像素降低到7.71像素,PCK@3像素从52.63%提高到86.27%,从而实现了更临床可靠的机器人辅助导丝输送系统部署。此外,具有严格背景保留和解剖可行性约束的可控设备合成的核心设计理念,有可能迁移到其他标注数据稀缺的介入设备感知任务中。

英文摘要

Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assisted PCI is progressively adopted to reduce operator radiation exposure. However, the scarcity of annotated CAG images with guidewires and the limited adaptability of existing guidewire synthesis models remain key bottlenecks for guidewire endpoint localization. To address this issue, we propose VDSB-GWSyn, a Diffusion Schrödinger Bridge (DSB) model-based framework, enabling synthesis of controllable, high-fidelity guidewire samples under complex anatomical backgrounds. VDSB-GWSyn first uses our shape prior algorithm to learn the basic guidewire geometry. It then generates guidewire masks under constraints imposed by the vessel segmentation masks and outputs the corresponding endpoint coordinates. Finally, it synthesizes realistic guidewire samples on real CAG images using DSB conditioned with SPADE. Experimental results show that the guidewire samples synthesized by VDSB-GWSyn achieve favorable ROI-FID and ROI-KID, as well as high IPR scores. In addition, incorporating our synthesized data for synthetic pre-training followed by real fine-tuning substantially improves downstream guidewire endpoint localization, reducing MPE from 16.01~px to 7.71~px and increasing PCK at 3~px from 52.63\% to 86.27\%, leading to more clinically reliable deployment of robot-assisted guidewire delivery systems. Moreover, the core design philosophy of controllable device synthesis with strict background preservation and anatomical feasibility constraints has the potential to transfer to other interventional device perception tasks where annotated data are scarce.

2606.00105 2026-06-02 cs.CV cs.AI

Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning

视觉噪声引导的上下文蒸馏用于多模态大语言模型遗忘

Junkai Chen, Yuhao He, Junxiang You, Ruiqi Liu, Chenyu Wang, Shu Wu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Advanced Interdisciplinary Sciences, UCAS(北京大学交叉学科研究院)

AI总结 提出视觉噪声引导的上下文蒸馏(VGID)框架,通过双模态干预构建教师分布进行蒸馏,实现多模态大语言模型参数级遗忘,平衡遗忘效果与模型效用。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务上取得了显著进展,但它们也可能记忆和暴露敏感或受限知识,引发隐私和更广泛的安全风险。机器遗忘(MU)提供了一种有前景的方法,可以从训练好的模型中移除目标不良知识,而无需从头重新训练,同时保持通用模型效用。然而,在MLLMs中实现有效遗忘仍然特别具有挑战性。现有的基于训练的方法通常难以平衡遗忘效果和模型效用。相比之下,无训练方法如上下文遗忘通过避免参数更新来保持模型效用,但它们不会在参数级别移除记忆的知识,可能仍然容易受到逆向工程攻击。更重要的是,上下文遗忘在多模态设置中不足,其中视觉输入可以提供强条件信号并诱导不良输出。为了解决这些挑战,我们提出了视觉噪声引导的上下文蒸馏(VGID),一种基于蒸馏的MLLM遗忘框架。VGID通过结合视觉扰动与文本上下文遗忘的双模态干预,从冻结的基础模型动态构建面向遗忘的教师分布。由此产生的干预诱导分布作为蒸馏的教师信号,引导学生模型实现参数级遗忘,而无需外部教师模型或显式的不良响应注释。实验结果表明,VGID在保持竞争性模型效用的同时实现了强遗忘效果,在代表性设置中,遗忘集ROUGE-L降低了0.371,而保留集ROUGE-L仅下降0.055。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose sensitive or restricted knowledge, raising concerns about privacy and broader safety risks. Machine Unlearning (MU) provides a promising way to remove targeted undesirable knowledge from trained models without retraining from scratch while preserving general model utility. Nevertheless, effective unlearning in MLLMs remains particularly challenging. Existing training-based methods often struggle to balance unlearning effectiveness and model utility. In contrast, training-free methods such as in-context unlearning preserve model utility by avoiding parameter updates, but they do not remove memorized knowledge at the parameter level and may remain vulnerable to reverse-engineering attacks. More importantly, in-context unlearning is insufficient in multimodal settings, where visual inputs can provide strong conditioning signals and induce undesirable outputs. To address these challenges, we propose Visual-Noise Guided In-Context Distillation (VGID), a distillation-based framework for MLLM unlearning. VGID dynamically constructs an unlearning-oriented teacher distribution from the frozen base model through dual-modal intervention that combines visual perturbation with textual in-context unlearning. The resulting intervention-induced distribution serves as a teacher signal for distillation, guiding the student model toward parameter-level unlearning without requiring external teacher models or explicit undesirable response annotations. Experimental results show that VGID achieves strong unlearning effectiveness while preserving competitive model utility, reducing forget set ROUGE-L by 0.371 with only a 0.055 drop in retain set ROUGE-L in a representative setting.

2606.00104 2026-06-02 cs.RO cs.AI

PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs

PEACE: 一种用于无人机的带约束执行的规划-执行智能体

Erdem Uysal, Timo Kehrer, Sebastiano Panichella

发表机构 * Institute of Computer Science, University of Bern(伯尔尼大学计算机科学研究所) AI4I - The Italian Institute of Artificial Intelligence(意大利人工智能研究所)

AI总结 提出一种基于大语言模型的规划-执行智能体架构,通过解耦高层任务规划与低层控制,并引入约束执行层和有限重规划,实现无人机可解释、可约束的自主飞行。

详情
Comments
Accepted to ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction
AI中文摘要

基础模型越来越多地被用于驱动自主系统,然而现有方法要么将模型保持在紧密的控制循环中,增加延迟和幻觉风险,要么将自然语言编译成不透明的端到端策略,难以解释、约束,且需要特定领域的数据集和微调。我们提出一种用于基于PX4的无人机的规划-执行智能体,将高层任务规划与低层控制解耦。大语言模型执行单次任务规划,而执行通过结构化的ROS 2工具调用接口(桥接到MAVLink)处理。该系统通过将模块化2D检测器(如YOLO或视觉语言模型)与用于3D物体定位的针孔深度投影模块相结合,构建世界模型。约束执行层强制执行高度限制和水平地理围栏,有限重规划能够从执行时的动作失败中恢复。我们将我们的方法定位在基于基础模型的机器人系统的三种常见设计模式中,并在Gazebo中的PX4软件在环仿真中展示其可行性。结果突出了与紧密耦合的LLM控制相比,改进的可解释性、约束执行和减少的LLM调用。代码、数据集、视频和其他材料可在以下链接找到:https://github.com/erdemuysalx/PEACE

英文摘要

Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end-to-end policies that are hard to explain, constraint and require domain-specific datasets and fine-tuning. We propose a planner-executor agent for PX4-based drones that decouples high-level mission planning from low-level control. A large language model performs single-pass task planning, while execution is handled through a structured ROS 2 tool-calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision-language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution-time action failures. We position our approach within three common design patterns for foundation-model-based robotics systems and demonstrate its feasibility in PX4 software-in-the-loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: https://github.com/erdemuysalx/PEACE

2606.00103 2026-06-02 cs.AI

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

评估大语言模型中的交互式推理:一个带有可执行游戏的分层基准

Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou

发表机构 * East China Normal University(东华师范大学) Ant Group(蚂蚁集团)

AI总结 提出一个多轮交互式推理评估框架,将推理视为主动证据获取和信念更新,通过474个可执行游戏基准测试大语言模型在成功率、交互效率、上下文鲁棒性和元认知适应方面的表现。

详情
Comments
preprint version, under review
AI中文摘要

我们引入了一个用于推理评估的多轮交互式框架,将推理视为主动证据获取和信念更新。其中,LLMs仅接收任务规则,必须向隐藏环境发出有针对性的查询,随时间整合部分观察结果,并决定何时提交最终答案。除了标准的成功率和交互效率,我们还在受控上下文扰动下评估上下文鲁棒性,并通过反事实修正和必要性判断评估元认知适应。我们将该框架实例化为一个包含474个可执行游戏的基准,每个游戏在五个固定配置搜索空间(对应五个难度级别)下进行评估,并评估了一系列前沿LLMs。结果表明,该基准具有高度区分性,不仅在成功率上,而且在交互效率上也暴露了巨大差异。此外,我们实证表明,上下文扰动导致适度但持续的下降,而反事实修正和必要性判断导致更大的下降。

英文摘要

We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer. Beyond standard success rate and interaction efficiency, we evaluate contextual robustness under controlled contextual perturbations, and metacognitive adaptation through counterfactual revision and necessity judgment. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs. Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops.

2606.00102 2026-06-02 cs.AI math.PR

On the evolution of the concept of probability as a mirror of the evolution of reason

论概率概念的演化作为理性演化的镜像

Jean-Louis Le Mouël, Vincent Courtillot, Dominique Gibert, Vladimir Kossobokov, Jean-Baptiste Boulé, Pierpaolo Zuddas, Fernando Lopes, Païkan Marccagi, Alexis Maineult

发表机构 * Académie des Sciences, Institut de France, Paris, France(法国科学院,法兰西学院,巴黎,法国) DeepField Sensing, France(法国DeepField Sensing公司) Institute of Earthquake Prediction Theory and Mathematical Geophysics, Russian Academy of Sciences, Moscow, Russia(俄罗斯科学院地震预测理论与数学地球物理研究所,莫斯科,俄罗斯) Accademia Nazionale delle Scienze detta dei XL, Roma, Italia(意大利国家科学院(罗马)) Muséum National d’Histoire Naturelle, CNRS UMR7196, INSERM U1154, Paris, France(自然史博物馆,法国国家科学研究中心UMR7196,法国国家医学研究院U1154,巴黎,法国) Sorbonne Université, CNRS, METIS, UMR7619, Paris, France(索邦大学,法国国家科学研究中心,METIS,UMR7619,巴黎,法国) Laboratoire de Géologie de l’ENS, UMR 8538, Paris, France(巴黎高等师范学院地质实验室,UMR 8538,巴黎,法国)

AI总结 本文从历史与认识论视角,将概率论的发展解读为理性本身的转变,并探讨概率、模糊逻辑与深度学习在科学理性中的角色与局限。

详情
Comments
44 pages, 7 figures
AI中文摘要

几个世纪以来,概率论已从博弈演算发展成为不确定性推理的核心框架。本文将其演化不仅解释为数学史,更视为理性本身的转变。从帕斯卡和费马的组合对称性,到贝叶斯和拉普拉斯的归纳逻辑,从泊松的事件统计到柯尔莫哥洛夫的公理化形式化,概率逐步将不确定性、时间和一致性纳入科学判断。这一轨迹在现代贝叶斯推断中达到成熟的认识论形式,尤其是在Tarantola将概率视为信息逻辑的观点中,先验知识与数据被一致地结合。然而,这一框架也暴露了一个局限:概率量化了关于明确定义命题的不确定性,但本身并未形式化用于描述这些概念的概念模糊性。因此,本文考察理性如何超越概率。模糊逻辑被呈现为一种用于分级意义和定性判断的严谨语言,而深度学习则被分析为一种基于几何插值和优化而非显式推理的独特、强大的预测模式。通过将概率、模糊逻辑和深度学习置于共同的历史和认识论视角,本文阐明了它们的角色与局限。它认为当代科学理性不能仅归结为数据驱动的性能,而需要明确阐述不确定性、模糊性和推理。

英文摘要

Over the centuries, probability theory has grown from the calculus of games of chance into a central framework for reasoning under uncertainty. This article interprets that evolution not merely as a mathematical history, but as a transformation of rationality itself. From Pascal and Fermat's combinatorial symmetry to the inductive logic of Bayes and Laplace, from Poisson's statistics of events to Kolmogorov's axiomatic formalization, probability progressively incorporated uncertainty, time, and coherence into scientific judgment. This trajectory reaches a mature epistemological form in modern Bayesian inference, especially in Tarantola's view of probability as a logic of information, where prior knowledge and data are combined coherently. Yet this framework also exposes a limit: probability quantifies uncertainty about well-defined propositions, but does not by itself formalize the vagueness of the concepts used to describe them. The article therefore examines how rationality extends beyond probability. Fuzzy logic is presented as a rigorous language for graded meaning and qualitative judgment, while deep learning is analyzed as a distinct, powerful mode of prediction based on geometric interpolation and optimization rather than explicit inference. By situating probability, fuzzy logic, and deep learning in a common historical and epistemological perspective, the article clarifies their roles and limits. It argues that contemporary scientific rationality cannot be reduced to data-driven performance alone, but requires the explicit articulation of uncertainty, vagueness, and inference.

2606.00101 2026-06-02 cs.CV cs.AI

CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

CoCoVideo: 基于商业模型的高质量对比基准用于AI生成视频检测

Huidong Feng, Wentao Chen, Jie Chen, Xinqi Cai, Ruolong Ma, Yinglin Zheng, Yuxin Lin, Ming Zeng

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) China Academy of Information and Communications Technology(中国信息通信技术研究院) AI Transcend Pte. Ltd.(AI Transcend有限公司)

AI总结 针对现有数据集依赖低质量开源模型且商业样本带水印的问题,提出包含13个商业生成器的CoCoVideo-26K对比数据集,并设计结合对比学习与置信门控多模态大语言模型的CoCoDetect检测框架,实现高保真AI生成视频的鲁棒检测。

详情
Comments
Accepected by CVPR 2026
AI中文摘要

随着人工智能生成内容(AIGC)技术的快速发展,视频伪造日益普遍,给公共讨论和社会安全带来新挑战。尽管现有深度伪造检测方法取得了显著进展,但AIGC伪造检测仍然具有挑战性,因为现有数据集主要依赖开源视频生成模型,其质量远低于商业AIGC系统。即使包含少量商业样本的数据集也常常保留可见水印,损害真实性并阻碍模型泛化到高保真AIGC视频。为解决这些问题,我们引入了CoCoVideo-26K,一个基于对比学习的商业模型AIGC视频数据集,涵盖13个主流商业生成器,并提供语义对齐的真实-伪造视频对。该数据集能够深入探索真实视频与高质量合成视频之间的差异,并为高逼真视频伪造检测建立新基准。基于该数据集,我们提出了CoCoDetect,一个集成对比学习与置信门控多模态大语言模型(MLLM)推理的检测框架。R3D-18骨干网络提取时空表示,而置信门将不确定案例路由到MLLM进行物理合理性和场景一致性的推理。在CoCoVideo-26K和公共基准上的大量实验证明了最先进的性能,验证了该框架的鲁棒性和泛化能力。我们的代码和数据集可在https://github.com/DonoToT/CoCoVideo获取。

英文摘要

With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below that of commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework's robustness and generalizability. Our code and dataset are available at https://github.com/DonoToT/CoCoVideo.

2606.00098 2026-06-02 cs.CV eess.IV

Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

分割引导的空间索引用于可泛化和可解释的深度伪造检测

Izaldein Al-Zyoud, Abdulmotaleb El Saddik

发表机构 * University of Central Florida(佛罗里达大学)

AI总结 提出分割引导的空间索引方法,通过冻结的FaRL解析器为DINOv3 ViT-L/16的patch token分配语义标签,仅选择语义相关的区域进行分类,实现可泛化且可解释的深度伪造检测。

详情
AI中文摘要

我们引入了分割引导的空间索引,用于可泛化和可解释的深度伪造检测。关键思想颠倒了标准设计顺序:不是先汇集所有人脸token再分类,而是先选择语义上有意义的patch token,然后仅汇集这些token。一个冻结的FaRL解析器为每个DINOv3 ViT-L/16 patch token分配一个语义标签;丢弃非目标token;一个线性探针对保留的区域进行分类。这种空间索引利用了DINOv3的patch级空间一致性(即产生涌现分割的相同属性),向探针呈现一个更纯净的区域子空间,其中与操作相关的证据较少被全脸线索稀释。区域归因是结构性的:当嘴部模型预测为假时,决策仅使用了嘴部token,而不是叠加的显著性图。在Celeb-DF v2上,嘴部索引探针的AUC达到0.905,优于LipForensics(+8.1个百分点)和Xception(+16.9个百分点),且无需对DINOv3或FaRL进行微调,也无需目标域数据。消融实验隔离了机制:用DINOv3的CLS token替换区域选择,Celeb-DF v2 AUC下降26.4个百分点;用FaRL特征替换DINOv3,AUC下降20.9个百分点。DINOv3表示和空间索引都是独立必要的;单独任何一个都无法达到完整系统的性能。

英文摘要

We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

弥合2D-3D鸿沟:面向视觉语言导航的分层语义几何地图

Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Bosch Corporate Research(博世企业研究) King Abdullah University of Science and Technology(卡布斯大学)

AI总结 提出分层语义几何地图(HSGM),将3D几何信息转化为VLM可理解的结构化表示,结合VLM高层语义规划与经典路径规划,实现零样本视觉语言导航,在R2R-CE和RxR-CE基准上达到最先进性能。

详情
AI中文摘要

视觉语言导航(VLN)使具身智能体能够通过遵循语言指令在未知环境中到达目标位置。尽管近期视觉语言模型(VLM)取得了进展,但仍存在关键的语义-几何鸿沟:VLM擅长语言和2D视觉理解,但在3D空间推理方面表现不佳,且无法捕捉动作与空间转换之间的因果动态,导致导航不可靠,尤其在零样本设置中。为弥合这一鸿沟,我们提出分层语义几何地图(HSGM),将3D几何信息转化为与VLM兼容的结构化表示,有效将其与物理世界连接。具体而言,HSGM表示为多通道俯视图,组织为三个层次:(1)几何层,记录可导航区域和障碍物;(2)语义层,表示物体及其关系;(3)决策层,支持高层任务推理和目标选择。导航过程中,VLM作为高层语义规划器,解释HSGM编码的空间布局以选择几何有效航点,而航点间的低层无碰撞运动由经典路径规划算法执行,从而将语义推理与动作执行完全解耦。此外,复杂指令被分解为子任务,以缓解长程导航中的进度遗忘或幻觉问题。在R2R-CE和RxR-CE基准上的大量实验表明,我们的零样本框架达到了最先进性能,甚至优于若干监督方法。代码见 https://github.com/Teacher-Tom/HSGM_public。

英文摘要

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.