arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪 全部专题
2606.01620 2026-06-02 cs.CV

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

基于参考引导深度压缩VAE的流式说话人肖像视频实时生成

Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang, Zhan Chen, Jiaolong Yang, Baining Guo

发表机构 * Microsoft Research(微软研究院) Microsoft AI(微软人工智能)

AI总结 提出一种结合因果视频VAE和自回归潜在去噪模型的流式说话人肖像视频生成框架,通过参考图像引导实现实时高质量生成。

Comments CVPR 2026 (Highlight) Camera ready

详情
AI中文摘要

视频扩散模型显著推动了肖像视频生成的发展,但其高计算需求限制了在交互式应用中的使用。本文提出一个框架,用于生成以语音音频和参考图像为条件的可流式说话人肖像视频。该框架专为流式场景精心设计,包含一个用于深度潜在压缩的因果视频VAE和一个自回归潜在去噪模型。我们的因果VAE集成了可变数量的参考图像作为引导,使网络能够专注于动态信息而非静态外观,从而提升压缩效率和重建质量。此外,我们扩展了残差自编码范式,以改善VAE中的时空因果处理。生成器基于Rectified Flow Transformer架构,并以块状自回归方式生成视频潜在表示。我们的方法能够实时生成高质量的说话人肖像视频,速度显著快于基线模型。此外,综合实验表明,在逼真度、生动性和视频质量方面,该方法与这些大型模型相当甚至更优。

英文摘要

Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.

2606.01617 2026-06-02 cs.CL cs.AI

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

EvoPool: 面向标签高效专业监督的进化式程序化标注

Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 提出进化多智能体框架EvoPool,通过程序化标注器迭代进化与投票聚合,在低标注成本下显著提升专业领域监督性能。

Comments 39 pages, 7 figures. Code: https://github.com/tianyi0216/EvoPool

详情
AI中文摘要

大型语言模型在通用任务上表现出色,但在训练标签成本高昂的专业高风险领域,其性能不如较小的监督模型。我们针对这一场景提出了EvoPool,一个受达尔文进化启发的进化多智能体框架。三个专业智能体迭代地提出可执行的标注器代码,一个小型验证集提供适应度信号,一个确定性门控仅保留通过跨代可行性、多样性和边际贡献检查的标注器。通过EvoAgg(一种结合语义特征与标注器投票特征的文本感知聚合器)将池投票映射为软训练标签。所构建的池在每样本成本接近零的情况下运行,在10万样本上比LLM标注快4500至31000倍。在8个LLM弱专业和复杂任务中的7个(涵盖生物医学关系抽取、法律条款分类、复杂推理和密集多标签生物医学分类)上,EvoPool比最强的LLM标注基线平均高出+0.141 macro-F1,在ChemProt上最高达+0.301,在PubMed上达+0.265。代码见:https://github.com/tianyi0216/EvoPool

英文摘要

Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training labels are costly. We address this regime with EvoPool, an evolutionary multi-agent framework inspired by Darwinian evolution. Three specialized agents iteratively propose executable annotator code, a small validation set provides a fitness signal, and a deterministic gate keeps only annotators that pass viability, diversity, and marginal-contribution checks across generations. Pool votes are mapped to soft training labels by EvoAgg, a text-aware aggregator combining semantic features with annotator-vote features. The authored pool runs at near-zero per-example cost and is 4500 to 31000x faster than LLM annotation on 100K examples. Across 7 of 8 LLM-weak specialized and complex tasks spanning biomedical relation extraction, legal-clause classification, complex reasoning, and dense multi-label biomedical classification, EvoPool beats the strongest LLM annotation baseline by an average +0.141 macro-F1, peaking at +0.301 on ChemProt and +0.265 on PubMed. Code is available at: https://github.com/tianyi0216/EvoPool

2606.01615 2026-06-02 cs.CV cs.MM

Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval

图灵模式用于多媒体:反应-扩散多模态融合用于语言引导的视频时刻检索

Xiang Fang, Wanlong Fang, Wei Ji, Tat-Seng Chua

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学)

AI总结 提出基于反应-扩散过程的多模态融合框架RDMF,通过模拟生物模式形成机制实现视频与文本的动态对齐,用于视频时刻检索与高亮检测。

Comments Published in ACM MM 2025. Address some typos

详情
AI中文摘要

视频-语言模型对于时刻检索和高亮检测等任务至关重要,但它们通常难以捕捉时间视频序列与文本语义之间的动态、非线性交互。现有方法依赖静态交叉注意力或提示调优机制,无法自适应地建模模态间的演化关系,导致对齐次优和泛化受限。受系统生物学启发,我们提出 extbf{反应-扩散多模态融合(RDMF)},这是一个新颖的框架,将视频-语言对齐重新构想为反应-扩散(RD)过程,借鉴了Alan Turing引入的模式形成原理。在RDMF中,视频特征随时间扩散以捕捉时间上下文,而文本-视频交互被建模为非线性反应,放大相关特征并抑制噪声,形成类似于生物系统的涌现模式。利用Gray-Scott RD模型,我们设计了一个计算高效的融合模块,集成视频和文本表示,并通过图灵不稳定性准则对稳定性和收敛性进行严格的数学分析。我们的框架具有理论依据,采用先进的数学工具确保稳定的模式形成,并且实际可行,集成了标准组件如预训练编码器和DETR风格的头用于时刻检索和显著性预测。RDMF代表了一种开创性的跨学科方法,桥接了系统生物学和多媒体研究,以解决传统多模态融合的局限性。初步实验表明,它在识别显著视频时刻方面具有超越现有方法的潜力,为视频-语言任务提供了新的范式。

英文摘要

Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose \textbf{Reaction-Diffusion Multimodal Fusion (RDMF)}, a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems. Leveraging the Gray-Scott RD model, we design a computationally efficient fusion module that integrates video and text representations, supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria. Our framework is theoretically grounded, employing advanced mathematical tools to ensure stable pattern formation, and is practically viable, incorporating standard components like pretrained encoders and DETR-style heads for moment retrieval and saliency prediction. RDMF represents a pioneering interdisciplinary approach, bridging systems biology and multimedia research to address the limitations of conventional multimodal fusion. Preliminary experiments demonstrate its potential to outperform existing methods in identifying salient video moments, offering a new paradigm for video-language tasks.

2606.01612 2026-06-02 cs.CV cs.LG

Self-Improving Small Object Grounding in LVLMs

LVLMs中的自改进小目标定位

Tianze Yang, Yucheng Shi, Ruitong Sun, Ninghao Liu, Jin Sun

发表机构 * University of Georgia(佐治亚大学)

AI总结 利用LVLMs内部注意力模式,通过轻量级IoU回归器或无需训练的注意力熵选择器,从多个候选框中选出最佳框,实现小目标定位的自改进。

Comments 29 Pages, 15 Figures

详情
AI中文摘要

大型视觉语言模型(LVLMs)中的内部注意力模式能否在无需微调的情况下识别可靠的小目标框?在这项工作中,我们给出了肯定的答案。LVLMs中的注意力结构编码了定位质量——一个仅基于注意力图训练的轻量级IoU回归器实现了强IoU预测(Pearson r > 0.67)。该回归器驱动了我们基于注意力的候选选择(ACS)框架的回归器变体,称为ACS-Learned,它从多个采样候选中选择最佳框以改进目标定位。通过分析回归器学习的内容,我们揭示了哪些Transformer层和头最为关键,并推导出ACS-Free:一个无需训练的选择器,它根据这些判别性头上的注意力熵对候选进行排序,推理时无需任何学习组件。在COCO和Objects365上的实验表明,小目标定位的自改进高达19%,其中ACS-Free在所有无需训练的方法中排名最佳,表明有用的注意力结构提高了LVLMs中定位的可靠性和可解释性。

英文摘要

Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r > 0.67). This regressor powers the regressor-based variant of our Attention-based Candidate Selection (ACS) framework, called ACS-Learned, which selects the best box from multiple sampled candidates to improve object grounding. By analyzing what the regressor learns, we reveal which transformer layers and heads are most critical and derive ACS-Free: a training-free selector that ranks candidates by attention entropy on these discriminative heads, with no learned component at inference. Experiments on COCO and Objects365 demonstrate up to 19% self-improvement on small object localization, with ACS-Free ranking best among all training-free methods, demonstrating that useful attention structure improves both localization reliability and interpretability in LVLMs.

2606.01610 2026-06-02 cs.AI

Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization

重新审视知识编辑中的涟漪效应:通过压力感知联合邻域优化

Haoben Huang, Shuxin Liu, Ou Wu, Di Gao

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(杭州高等研究院,中国科学院大学)

AI总结 针对大语言模型单次编辑引发的涟漪效应,提出联合邻域优化框架,通过压力感知协调和语义预执行门控联合优化可编辑侧与保留侧的耦合压力,在RippleEdits上传播与保留指标提升至少7.0%。

详情
AI中文摘要

大语言模型中的单次编辑更新会在局部知识邻域中引发涟漪效应:理想情况下传播到相关事实,同时意外扰动应保留的事实。现有方法分别处理这两种效应,而未显式建模它们的耦合。我们通过分析典型基线中的涟漪响应挑战这种分离,识别出两种耦合的设计压力:可编辑侧协调和保留侧泄露。我们提出联合邻域优化(JNO),一种新的知识编辑框架,在目标规划阶段形式化并联合处理这两种压力。JNO通过压力感知协调(PAC)实例化这一原则,该协调在耦合约束下联合优化邻域目标表示,并设置语义预执行门控,在参数执行前拒绝高风险目标计划。在RippleEdits上的实验表明,JNO在保持跨骨干编辑稳定性的同时,传播和保留指标至少提升7.0%。

英文摘要

Single-edit updates in large language models can trigger ripple effects across local knowledge neighborhoods: desirable propagation to related facts and unintended perturbation of preserved ones. Existing methods address these two effects separately, without explicitly modeling their coupling. We challenge this separation through an analysis of ripple responses across typical baselines, identifying two coupled design pressures: editable-side coordination and preserved-side leakage. We propose Joint Neighborhood Optimization (JNO), a new knowledge-editing framework to formalize and jointly address both pressures at the target-planning stage. JNO instantiates this principle through Pressure-Aware Coordination (PAC), which jointly optimizes neighborhood target representations under coupled constraints, and a semantic pre-execution gate that rejects high-risk target plans before parameter execution. Experiments on RippleEdits show JNO improves propagation and preservation metrics by at least 7.0% while preserving cross-backbone editing stability.

2606.01608 2026-06-02 cs.CV

Exploiting Semantic and Pixel Representations for Ultra-Low Bitrate Image Compression

利用语义和像素表示进行超低比特率图像压缩

Hao Wei, Yanhui Zhou, Chenyang Ge, Saeed Anwar, Ajmal Mian

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,人工智能与机器人研究院,西安交通大学) School of Information and Telecommunication, Xi’an Jiaotong University(信息与电信学院,西安交通大学) Department of Computer Science and Software Engineering, The University of Western Australia(计算机科学与软件工程系,西澳大学)

AI总结 提出SPRDiff扩散压缩方法,通过三重编码器架构和失真感知重建模块,在超低比特率下同时保持语义一致性和像素级保真度,实现率-失真-感知权衡最优。

详情
AI中文摘要

大多数现有的极端压缩方法未能实现最优的率-失真-感知权衡,因为它们通常优先考虑感知保真度和视觉真实性而非像素级精度。因此,重建结果往往与原始图像有明显偏差。超低比特率图像压缩因此至关重要——不仅要产生极其紧凑的表示,还要确保重建图像在语义上与源图像保持一致,并在像素级忠实于源图像。为此,我们提出了SPRDiff,一种基于扩散的压缩方法,充分利用语义和像素表示,从而在超低比特率约束下增强重建保真度。具体来说,我们开发了一个三重编码器架构,利用预训练的面向失真和面向语义编码器的高保真特征来补偿冻结的VAE编码器提取的有限表示,从而改善潜在压缩和熵建模。为了进一步提高扩散模型的重建保真度,我们引入了一个具有双特征提取的失真感知重建模块。该模块不仅生成保留主要结构的粗略重建,还提供实用且准确的语义级和像素级条件信号来指导扩散模型。在基准数据集上的大量实验表明,我们的方法在极低比特率(低于0.03 bpp)下在率-失真-感知权衡方面优于最先进的方法,有效保持了重建图像中的感知质量和像素级保真度。我们将在https://github.com/cshw2021/SPRDiff发布源代码和训练模型。

英文摘要

Most existing extreme compression methods fail to achieve an optimal rate-distortion-perception trade-off, as they typically prioritize perceptual fidelity and visual realism over pixel-level accuracy. Consequently, the resulting reconstructions often deviate noticeably from the originals. Ultra-low bitrate image compression is therefore crucial-not only for producing extremely compact representations but also for ensuring that reconstructed images remain semantically coherent and faithful to the source at the pixel level. To this end, we propose SPRDiff, a diffusion-based compression method that fully leverages both semantic and pixel representations, thereby enhancing reconstruction fidelity under ultra-low bitrate constraints. Specifically, we develop a triple-encoder architecture that utilizes high-fidelity features from the pretrained distortion-oriented and semantic-oriented encoders to compensate for the limited representations extracted by the frozen VAE encoder, thereby improving latent compression and entropy modeling. To further enhance the reconstruction fidelity of diffusion models, we introduce a distortion-aware reconstruction module with dual feature extraction. This module not only generates a coarse reconstruction that preserves the main structures, but also provides practical and accurate semantic- and pixel-level conditional signals to guide the diffusion model. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in the rate-distortion-perception tradeoff at extremely low bitrates (below 0.03 bpp), effectively preserving both perceptual quality and pixel-wise fidelity in the reconstructed images. We will release the source code and trained models at https://github.com/cshw2021/SPRDiff.

2606.01607 2026-06-02 cs.LG cs.AI

FedMTFI: Feature Importance Based Optimized Multi Teacher Knowledge Distillation in Heterogeneous Federated Learning Environment

FedMTFI: 异构联邦学习环境中基于特征重要性优化的多教师知识蒸馏

Nazmus Shakib Shadin, Aaron Cummings, Xinyue Zhang, Bobin Deng

发表机构 * Department of Computer Science, Kennesaw State University, Marietta, GA, 30060 USA(计算机科学系,肯纳邦大学,马里埃塔,GA,30060 USA)

AI总结 提出FedMTFI架构,通过结合多教师知识蒸馏与Shapley值特征重要性,在异构联邦学习中提升模型准确性和可解释性。

Comments Accepted by IJCNN 2026

详情
AI中文摘要

联邦学习(FL)是一种去中心化方法,能够在无需暴露原始数据的情况下实现协作模型训练。它允许设备仅共享模型权重,而将个人数据保留在本地并确保安全,从而避免了敏感数据的传输。然而,在现实环境中,设备持有的数据往往分布不均,且设备在计算能力和内存容量上大多存在差异。这些差异使得FL难以在整个系统中保持一致的性能。为了解决这些问题,我们提出了FedMTFI,一种新颖的架构,它将多教师知识蒸馏(MTKD)与特征重要性相结合,以改善异构环境中的FL过程。在FedMTFI中,客户端根据相似的硬件和模型类型进行聚类。每个聚类在非独立同分布(non-IID)数据上训练特定模型。在聚类内部,每个客户端仅使用自己的本地私有数据更新该模型。然后,服务器使用FedAvg对每个聚类中的本地训练模型进行聚合,形成多个原型模型。接着,这些原型作为教师模型,通过MTKD训练一个全局通用的学生模型。FedMTFI的独特之处在于集成了Shapley值(SHAP),以在蒸馏过程中强调重要特征,从而提高了准确性和可解释性。实验结果表明,FedMTFI比传统FL算法实现了更高的准确性,并且在non-IID数据条件下表现更有效。

英文摘要

Federated learning (FL) is a decentralized approach that enables collaborative model training without exposing raw data. Instead of transferring sensitive data, it allows devices to share only model weights, keeping personal data locally and secure. However, in real world settings, the data held by devices is often not evenly distributed and devices mostly differ in computing power and memory capacity. These differences make FL harder to maintain consistent performance across the system. To address these issues, we propose FedMTFI, a novel architecture that combines multi-teacher knowledge distillation (MTKD) with feature importance to improve the FL process in heterogeneous environments. In FedMTFI, clients are clustered based on similar hardware and model types. Each cluster trains a specific model on not independently and identically distributed (non-IID) data. Within a cluster, every client updates that model using only its own local private data. The server then aggregates the locally trained models in each cluster using FedAvg to form multiple prototype models. Then these prototypes serve as teacher models to train a global generalized student model using MTKD. What makes FedMTFI more unique is the integration of Shapley values (SHAP) to emphasize important features during distillation, which enhances both accuracy and interpretability. Experimental results show that FedMTFI achieves higher accuracy than traditional FL algorithms and performs more effectively under non-IID data conditions.

2606.01604 2026-06-02 cs.CV

Paving the Way for Point Cloud Video Representation Learning Using A PDE Model

使用PDE模型为点云视频表示学习铺平道路

Zhuoxu Huang, Zhenkun Fan, Jungong Han, Josef Kittler

发表机构 * Department of Computer Science, Aberystwyth University(阿伯里斯يث大学计算机科学系) Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University(自动化系、北京信息科学与技术国家研究中心、清华大学) Department of Electrical Engineering, Surrey University(Surrey大学电子工程系)

AI总结 提出MotionPDE方法,通过将时空相关性学习建模为可解的偏微分方程(PDE),并利用对比学习结构优化,作为即插即用模块提升点云视频表示学习性能。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI) in 2026

详情
AI中文摘要

研究时空相关性,特别是空间点随时间的变化,对于理解点云视频至关重要。传统方法,尤其是基于流的技术,由于顺序点云数据的无序空间排列,难以处理这些相关性。为了解决这一挑战,我们提出了一种新方法,通过将问题建模为可解的偏微分方程(PDE)来正则化时空相关性学习。虽然PDE在物理领域长期有效,但其在点云视频等新型序列数据上的应用仍未充分探索。受流体分析启发,我们构建了一个简化的PDE,并通过时间嵌入和空间嵌入之间的对比学习结构来指导和优化PDE的求解过程。借助这种额外的监督,我们的方法MotionPDE作为现有骨干模型的有效、即插即用的增强模块,仅增加极少的计算开销和参数。利用对比学习过程,我们进一步挖掘了MotionPDE的自监督能力,取得了有希望的结果,突显了其在点云视频数据解释中的实用性和适应性。带有训练检查点的代码仓库将在https://github.com/zhh6425/motionpde.git提供,以促进未来研究。

英文摘要

Investigating spatial-temporal correlations, specifically how spatial points vary over time, is crucial for understanding point cloud videos. Traditional methods, particularly flow-based techniques, struggle with these correlations due to the unordered spatial arrangement of sequential point cloud data. To address this challenge, we propose a novel approach that regularizes spatial-temporal correlation learning by formulating the problem as a solvable Partial Differential Equation (PDE). While PDEs have long been effective in the physical domain, their application to novel sequential data like point cloud video remains underexplored. Inspired by fluid analysis, we construct a simplified PDE, and the process of solving PDE is guided and refined by a contrastive learning structure between the temporal embeddings and the spatial embeddings. With this extra supervision, our method, named MotionPDE, serves as an effective, plug-and-play enhancement module for existing backbone models, adding minimal computational overhead and parameters. Capitalizing on the contrastive learning process, we delve deeper into the self-supervised capabilities of MotionPDE, yielding promising results that underscore its utility and adaptability in point cloud video data interpretation. The code repo with trained checkpoints will be available at https://github.com/zhh6425/motionpde.git for facilitating future research.

2606.01601 2026-06-02 cs.CV

EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers

EIVE: 面向检测Transformer的端到端实例特定视觉解释

Jianlin Xiang, Yanshan Li, Linhui Dai

发表机构 * Institute of Intelligent Information Processing, Shenzhen University(智能信息处理研究院,深圳大学) Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University(广东省智能信息处理重点实验室,深圳大学) Shenzhen Key Laboratory of Modern Communications and Information Processing, Shenzhen University(深圳现代通信与信息处理重点实验室,深圳大学)

AI总结 提出EIVE框架,通过重新公式化解码器交叉注意力为实例级特征归因路径,直接生成实例级显著性图,无需梯度计算或输入扰动,高效解释DETR类检测器。

Comments 17 pages, 11 figures

详情
AI中文摘要

由于目标检测的多实例特性,其视觉可解释性仍然具有挑战性。现有方法主要采用事后范式(如基于梯度或扰动的解释方法)来解释预训练检测器。然而,这些方法需要额外的梯度计算或重复模型推理,导致效率有限。为解决此问题,我们提出了一种端到端实例特定视觉解释框架(EIVE),该框架在Detection Transformer(DETR)类模型的前向传播后直接生成实例级显著性图。具体而言,我们将解码器中的交叉注意力机制重新公式化为实例级特征归因路径,使得每个目标查询的交叉注意力对应于其预测实例的视觉归因。基于此公式,我们设计了一个跨层混合共识融合(CLHCF)模块,聚合解码器各层的交叉注意力信号,生成稳定且紧凑的解释。EIVE的解释过程既不需要梯度计算也不需要输入扰动,具有高计算效率,并适用于单尺度和多尺度的DETR类目标检测器。最后,我们提出了一种注意力感知联合训练策略(AAJTS)作为面向训练的应用,该策略对交叉注意力模式施加空间约束,以鼓励稳定且集中的归因表示,从而提高可解释性和检测性能。在MS COCO 2017、ExDark和Cityscapes上的实验表明,EIVE生成高质量的实例级显著性图,在标准指标上达到与最先进事后方法相当或更好的性能,同时显著提高了解释效率。代码可在https://github.com/xjlDestiny/EIVE.git获取。

英文摘要

Visual explainability for object detection remains challenging due to the multi-instance nature of detection. Existing approaches predominantly adopt post-hoc paradigms, such as gradient-based or perturbation-based explanation methods, to interpret pretrained detectors. However, these methods require additional gradient computation or repeated model inference, resulting in limited efficiency. To address this issue, we propose an End-to-end Instance-specific Visual Explanation framework (EIVE) that directly generates instance-level saliency maps following the forward pass of Detection Transformer (DETR)-like models. Specifically, we reformulate the cross-attention mechanism in the decoder as an instance-level feature attribution pathway, so that the cross-attention of each object query corresponds to the visual attribution of its predicted instance. Based on this formulation, we design a cross-layer hybrid consensus fusion (CLHCF) module to aggregate cross-attention signals across decoder layers, producing stable and compact explanations. The explanation process of EIVE requires neither gradient computation nor input perturbation, yielding high computational efficiency, and applies to single- and multi-scale DETR-like object detectors. Finally, we present an attention-aware joint training strategy (AAJTS) as a training-oriented application, which imposes spatial constraints on cross-attention patterns to encourage stable and concentrated attribution representations, thereby improving both interpretability and detection performance. Experiments on MS COCO 2017, ExDark, and Cityscapes demonstrate that EIVE produces high-quality instance-level saliency maps and achieves performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics, while substantially improving explanation efficiency. Code is available at https://github.com/xjlDestiny/EIVE.git.

2606.01600 2026-06-02 cs.CV cs.CL cs.RO

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

RoboTrustBench:机器人操作视频世界模型的可信度基准测试

Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, Bin Zhu

发表机构 * Singapore Management University(新加坡国立管理学院) Fudan University(复旦大学) Princeton University(普林斯顿大学)

AI总结 针对视频世界模型在机器人操作中的可信度问题,提出RoboTrustBench基准,包含正常、约束敏感、反事实和对抗四种场景,通过专家验证的指令-图像对和六维评估协议,发现当前模型在约束推理、反事实基础、物理交互和不安全指令抑制方面存在不足。

Comments Project: https://huiqiongli.github.io/RoboTrustBench/

详情
AI中文摘要

视频世界模型越来越多地用于机器人操作,然而现有基准大多在有效、可行和安全的指令下评估它们。我们引入了RoboTrustBench,一个用于评估视频世界模型在四种场景下可信度的基准:正常、约束敏感、反事实和对抗。基于真实世界的DROID片段构建,RoboTrustBench包含1,207个专家验证的指令-图像对和一个六维评估协议,包含13个细粒度标准。通过人类和MLLM评估七个代表性的视频世界模型,我们发现当前模型通常生成视觉上连贯的视频,但在约束推理、反事实基础、物理交互和不安全指令抑制方面存在困难。这些结果表明,视觉质量和表面级别的指令遵循不足以实现可信赖的机器人视频世界建模。

英文摘要

Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.

2606.01599 2026-06-02 cs.AI

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

TRON:面向视觉推理强化学习的目标化规则可验证在线环境

Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang, Ninghao Liu, Jin Sun

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出TRON在线环境框架,通过可控生成-验证程序产生无限训练实例,支持视觉推理强化学习,在多个多模态基准上提升性能。

Comments 27 pages, 8 figures

详情
AI中文摘要

视觉推理的强化学习(RL)需要可扩展、可验证且可控的训练信号。现有的视觉RL后训练在静态策划数据集上进行,其图像-问题-答案样本受限于收集预算。本文引入TRON(目标化、规则可验证的在线环境),一种在线环境基底:训练rollout由可控的生成-验证程序按需生成,该程序采样新的潜在视觉状态,渲染图像,提出问题,并精确验证答案。因此,单次运行可以按当前课程所需的难度级别抽取无限的新实例流。当前TRON套件包含520个环境,组织成五个能力桶(空间、数学、图表、模式/逻辑和计数);同一基底支持在所有桶上训练的单个完整模型以及每个桶的能力专家模型,无需额外数据收集。我们还引入了基底分析,涵盖生成可靠性、实例和级别多样性、跨环境近似重复以及按难度级别的基础模型通过率。使用METHOD进行RL后训练在Qwen3-VL-4B、Qwen2.5-VL-7B和MiMo-VL-7B-SFT上的十个外部多模态推理基准上持续提升性能。

英文摘要

Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by their collection budget. In this work, we introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator-verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer. A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with METHOD consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

2606.01597 2026-06-02 cs.RO cs.MA

Physics-Informed Modeling and Control of Emergent Behaviors in Robot Swarms

机器人群体涌现行为的物理信息建模与控制

Zixuan Jin, Wenzhuo Zhang, Shuxian Quan, Zirui Dong, Fangwen Ye, Yuchen Shi, Cheng Xu

发表机构 * School of Computer and Communication Engineering, University of Science and Technology Beijing(北京科技大学计算机与通信工程学院) Shunde Innovation School, University of Science and Technology Beijing(北京科技大学顺德创新学院)

AI总结 提出PhySwarm框架,利用多阶段对流-扩散-反应宏观模型和等效确定性运动微观模型,结合神经物理控制器,实现机器人群体多阶段涌现行为的建模与控制。

详情
AI中文摘要

机器人群体可以通过局部感知、有限通信和分散决策展现连贯的集体行为,但当行为在多阶段展开时,建模和控制这种涌现仍然具有挑战性。本文介绍了PhySwarm,一个物理信息微观-宏观框架,将多阶段群体涌现表示为受物理约束的密度场演化,并与可执行的机器人运动耦合。在宏观层面,多阶段对流-扩散-反应模型(Macro-ADR)通过定向传输、基于扩散的空间调节和行为阶段转换来描述阶段依赖的群体密度演化。在微观层面,等效确定性运动模型(Micro-EDM)通过势场对流、密度梯度补偿以及速率或事件门控的阶段切换来实现这些机制。神经物理控制器(NPC)将局部观测和时间记忆映射到有界物理参数,并通过强化学习-PINN目标进行训练,该目标结合了任务奖励、宏观尺度密度残差和微观尺度运动一致性约束。在几个概念验证的群体任务中——包括路径引导的觅食、队形可重构的导航和角色自适应的搜索与救援——我们证明了PhySwarm可以在统一的物理信息建模框架内生成不同的多阶段涌现行为。学习到的密度场和物理参数提供了可解释的证据,表明对流、扩散和反应如何共同调节多阶段群体组织。这些结果为学习、解释和控制机器人群体中的涌现行为建立了一条物理信息路径。

英文摘要

Robot swarms can exhibit coherent collective behaviors through local perception, limited communication and decentralized decision-making, yet modeling and controlling such emergence remains challenging when behaviors unfold over multiple phases. Here we introduce PhySwarm, a physics-informed micro--macro framework that represents multi-stage swarm emergence as physically constrained density-field evolution coupled to executable robot motion. At the macroscopic level, a multi-phase advection--diffusion--reaction model (Macro-ADR) describes phase-dependent swarm-density evolution through directed transport, diffusion-based spatial regulation and behavioral phase transitions. At the microscopic level, an equivalent deterministic motion model (Micro-EDM) realizes these mechanisms through potential-field advection, density-gradient compensation and rate- or event-gated phase switching. A neural-physics controller (NPC) maps local observations and temporal memory to bounded physical parameters, and is trained with a reinforcement learning--PINN objective that combines task rewards with macro-scale density residuals and micro-scale motion-consistency constraints. In several proof-of-concept swarm missions -- including trail-guided foraging, formation-reconfigurable navigation and role-adaptive search and rescue -- we demonstrate that PhySwarm can generate distinct multi-stage emergent behaviors within a unified physics-informed modeling framework. The learned density fields and physical parameters provide interpretable evidence of how advection, diffusion and reaction jointly regulate multi-stage swarm organization. These results establish a physics-informed route for learning, interpreting and controlling emergent behaviors in robot swarms.

2606.01595 2026-06-02 cs.LG

Uncertainty-Calibrated Diffusion for Reliable 3D Molecular Graph Generation

不确定性校准的扩散用于可靠的3D分子图生成

Fang Wan, Jingxiang Qu, Yi Liu

发表机构 * State University of New York at Stony Brook(纽约州立大学石溪分校)

AI总结 针对扩散模型在3D分子图生成中因认知不确定性导致采样质量下降的问题,提出不确定性校准扩散方法(UCD),通过校准反向扩散过程来补偿认知不确定性,在多个基准上取得最优性能。

详情
AI中文摘要

贝叶斯推理通过将预测视为分布而非确定性值,为神经网络中的认知不确定性建模提供了原则性框架。同时,用于3D分子图生成的扩散模型在受严格化学约束的脆弱几何结构上运行,使得推理对不确定性误校准高度敏感。一个被广泛忽视的问题是,来自学习去噪器的认知不确定性会与反向扩散过程中有意注入的偶然不确定性相互作用,导致系统性的方差膨胀以及真实分布与模拟分布之间的不匹配。这种效应对于高精度分子生成尤其有害,因为即使微小偏差也可能违反化学有效性。在这项工作中,我们对认知不确定性如何通过扩散推理传播并降低采样质量进行了理论和实证分析。基于此研究,我们提出了UCD(不确定性校准扩散),一种简单而有效的方法,通过校准反向扩散过程来考虑认知不确定性。在标准3D分子基准上的大量实验表明,UCD在不同基线方法中一致地提高了采样质量,为3D分子扩散建立了新的最先进性能。代码可在 https://github.com/jiuguaiwf/UCD 获取。

英文摘要

Bayesian inference provides a principled framework for modeling epistemic uncertainty in neural networks by treating predictions as distributions rather than deterministic values. Meanwhile, diffusion-based models for 3D molecular graph generation operate on fragile geometric structures governed by strict chemical constraints, making inference highly sensitive to uncertainty miscalibration. A largely overlooked issue is that epistemic uncertainty arising from the learned denoiser interacts with the aleatoric uncertainty intentionally injected during reverse diffusion, leading to systematic variance inflation and a mismatch between the true distribution and the simulated distribution. This effect is particularly detrimental for high-precision molecular generation, where even small deviations can violate chemical validity. In this work, we provide a theoretical and empirical analysis of how epistemic uncertainty propagates through diffusion inference and degrades sampling quality. Building on this investigation, we propose UCD (Uncertainty-Calibrated Diffusion), a simple yet effective method that calibrates the reverse diffusion process to account for epistemic uncertainty. Extensive experiments on standard 3D molecular benchmarks demonstrate that UCD consistently improves sampling quality across diverse baseline methods, establishing new state-of-the-art performance for 3D molecular diffusion. The code is available at https://github.com/jiuguaiwf/UCD.

2606.01591 2026-06-02 cs.CV cs.LG

TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning

TLG: 通过源标注重建和类别目标推理实现视频问答的时间逻辑基础

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出TLG三阶段系统,通过重建动作时间线、解析问题为时间逻辑程序并确定性执行,结合强视觉语言模型和前沿推理模型,将视频问答准确率从46.9%提升至71.37%。

详情
AI中文摘要

TimeLogic挑战评估对视频的形式时间逻辑推理——包括16个算子(之前、之后、直到、自从、总是、共现、排序等),采用布尔和四选一形式。端到端视频语言模型在此任务上接近随机水平,因为它们将视频视为帧的集合,无法定位动作发生的时间。我们提出TLG(时间逻辑基础),一个三阶段系统:(i)从生成基准测试的公共源数据集标注中重建每个视频的动作时间线,将每个问题解析为时间逻辑程序,并确定性执行;(ii)在没有标注的情况下回退到强大的开放视觉语言模型;(iii)仅将视觉语言模型经验上最弱的问题类别路由到前沿推理模型。TLG将测试准确率从46.9%的视觉语言模型基线提升到71.37%,绝对增益+24.5,达到排行榜前三名3分以内。我们报告了广泛的消融实验,包括三种基于模型的时间线重建变体,它们都低于整体视觉语言模型,将时间基础隔离为不可约的瓶颈,并表明真正的标注——而非更大的模型——驱动准确率。

英文摘要

The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, ...) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video's action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.

2606.01590 2026-06-02 cs.CV cs.GR

Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis

面向街景新视角合成的有效多传感器条件控制

Zhengfei Kuang, Adam Sun, Liyuan Zhu, Tong Wu, Shengqu Cai, Jonathan Tremblay, Iro Armeni, Ehsan Adeli, Lior Yariv, Gordon Wetzstein

发表机构 * Stanford Univerity(斯坦福大学) NVIDIA

AI总结 提出StreetNVS视频扩散框架,通过参考增强相机注意力模块和相对射线级位置编码联合利用LiDAR、环视图像和相机位姿,实现稀疏LiDAR条件下的高质量街景新视角合成。

详情
AI中文摘要

现代车辆平台配备了丰富的传感器套件,包括LiDAR、标定多相机系统和精确的自车运动,这原则上为从新视角重新渲染驾驶场景提供了强信号。最近一系列工作利用视频扩散模型完成此任务,通过其生成先验从稀疏车辆观测中合成合理的新视角。然而在实践中,现有方法仅利用了该信号的一部分,且其质量往往随着目标轨迹偏离记录驾驶路径而下降。我们认为这本质上是一个多传感器融合问题:稀疏LiDAR重投影提供准确但不完整的度量几何,环视参考图像提供密集外观但不提供度量深度,而相机位姿将两者跨视图连接起来。我们引入StreetNVS,一种视频扩散框架,通过基于相对射线级位置编码的参考增强相机注意力模块,联合对所有三种信号进行条件控制。我们开发了一种两阶段课程训练策略,逐步使模型适应越来越稀疏的LiDAR。在Waymo Open数据集上,StreetNVS在稀疏LiDAR条件下显著优于最先进的基线,与依赖密集10-100倍点云的方法性能相当。我们进一步展示了沿极端轨迹外路径(如高程、车道偏移、拉回和旋转)合成连贯视频的能力。我们的网站:https://streetnvs.github.io

英文摘要

Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation. Our website: https://streetnvs.github.io

2606.01584 2026-06-02 cs.CL cs.AI

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

识别LLM中高置信度的社会偏见以构建可信的对话辅导代理

Aitor Arronte Alvarez, Naiyi Xie Fincham

发表机构 * University of Hawaii at Manoa(夏威夷大学马诺亚分校)

AI总结 本研究通过生成对话数据集,评估大型语言模型在辅导场景中检测社会偏见的能力,发现模型在对话上下文中比基准测试更难检测偏见,且对错误判断过度自信,影响推理和反馈。

Comments Accepted for AIED 2026

详情
AI中文摘要

对话辅导代理已被证明能提高学习参与度和学生成绩,大型语言模型(LLM)越来越多地被用于这些系统以提供可扩展的个性化反馈。然而,LLM可能会延续或放大刻板的社会偏见,在教育环境中带来特殊风险。在本研究中,我们评估了LLM在对话辅导场景中的表现,以识别高置信度的社会偏见,即模型在无法识别辅导对话中的偏见判断时仍保持高度自信,可能影响其推理和向学习者提供的反馈。我们提出了一种新的数据集生成方法,通过重新生成学生-AI辅导教师互动并引入来自基准数据集的受控偏见轮次,实现在自然教学条件下的偏见评估。利用这些数据,我们评估了多个LLM检测刻板偏见的能力,并通过计算和人工评估分析了其响应背后的置信度和推理。我们发现,在对话辅导上下文中,偏见检测比基于基准的评估更具挑战性,且最先进的LLM对其刻板偏见陈述的错误评估过于自信。此外,模型置信度强烈影响推理和反馈,突显了基于LLM的辅导代理中过度自信和偏见行为的风险。最后,我们讨论了影响、缓解考虑和未来研究方向。

英文摘要

Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are increasingly used in these systems to provide scalable, personalized feedback. However, LLMs may perpetuate or amplify stereotypical social biases, posing particular risks in educational settings. In this study, we evaluate LLMs in conversational tutoring scenarios to identify high-confidence social biases, instances where models are unable to identify biased judgments in tutoring conversations while maintaining strong confidence in their assessments, potentially affecting their reasoning and the feedback they provide to learners. We present a new dataset generation method that enables bias evaluation under naturalistic instructional conditions by regenerating student-AI tutor interactions and introducing turns with controlled bias derived from a benchmark dataset. Using this data, we assess multiple LLMs' ability to detect stereotypical biases and analyze the confidence and reasoning underlying their responses through computational and human evaluations. We find that bias detection is substantially more challenging in conversational tutoring contexts than in benchmark-based evaluations, and that state-of-the-art LLMs are overconfident in their incorrect assessments of stereotypical bias statements. Moreover, model confidence strongly influences reasoning and feedback, highlighting the risks of overconfident, biased behavior in LLM-based tutoring agents. We conclude by discussing implications, mitigation considerations, and directions for future research.

2606.01577 2026-06-02 cs.CV

FLAME: Physics-Guided Neural Operators for Onboard Satellite Methane Detection in Hyperspectral Imagery

FLAME:物理引导的神经算子用于高光谱图像中星载甲烷检测

Junhyuk Heo, Junhwan Park, Sancheol Sim, Beomkyu Choi, Woojin Cho

发表机构 * KAIST(韩国科学技术院)

AI总结 提出FLAME,一种将甲烷吸收物理直接嵌入架构的物理引导神经算子,在星载甲烷检测中实现最高精度,像素级假阳性率降低近3倍,参数最少且满足星载硬件延迟预算。

详情
AI中文摘要

甲烷是近期气候变化的主要驱动因素,快速识别其排放源是一项关键的气候干预措施。星载高光谱成像是完成此任务的主要工具,但每个传感器产生的数据量使得地面检测不切实际,因此需要星载检测。经典方法在星载硬件上产生过高的计算成本,而深度学习模型速度快但检测质量不足。我们提出FLAME,一种物理引导的神经算子,将甲烷吸收的物理直接构建到其架构中。在甲烷检测基准上,FLAME在所有评估方法中实现了最高的检测精度,将像素级假阳性率相比最强神经基线降低了近3倍,在学习基线中使用参数最少,并且在星载卫星硬件的延迟预算内运行。

英文摘要

Methane is a major driver of near-term climate change, and rapidly identifying its emission sources is a critical climate intervention. Spaceborne hyperspectral imagery is the primary tool for this task, but the volume of data produced by each sensor makes ground-based detection impractical and necessitates onboard detection. Classical methods incur prohibitive computational cost on onboard hardware, while deep learning models are fast but fall short on detection quality. We propose FLAME, a physics-guided neural operator that builds the physics of methane absorption directly into its architecture. On the methane detection benchmark, FLAME achieves the highest detection accuracy among all evaluated methods, reduces the pixel-level false positive rate by nearly $3\times$ over the strongest neural baseline, uses the fewest parameters among learned baselines, and runs within the latency budget of onboard satellite hardware.

2606.01576 2026-06-02 cs.CV

Deformable Wiener Filter for Future Video Coding

可变形维纳滤波器用于未来视频编码

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma

发表机构 * National Engineering Research Center of Visual Technology, School of Computer Science, Peking University(视觉技术国家工程研究中心,北京大学计算机科学学院) Core Media Technology, Disney Streaming(核心媒体技术,迪士尼流媒体) Wangxuan Institute of Computer Technology, Peking University(王萱计算机技术研究所,北京大学) Information Technology R&D Innovation Center of Peking University(北京大学信息技术研发创新中心) Peng Cheng Laboratory, Shenzhen(鹏城实验室,深圳)

AI总结 提出一种结合局部与非局部特征的可变形维纳滤波器(DWF),通过监督训练和自适应融合实现高效环路滤波,在VVC标准上平均节省1.16%~2.67%的码率。

Comments This paper has been published in IEEE Transactions on Image Processing

Journal ref IEEE Transactions on Image Processing, vol. 31, pp. 7222-7236, 2022

详情
AI中文摘要

环路滤波器由于在混合视频编码框架中显著的降噪能力而受到越来越多的关注。然而,现有通用视频编码(VVC)中的环路滤波器主要利用图像局部相似性。尽管一些基于非局部的环路滤波器可以弥补这一不足,但非局部滤波器广泛使用的无监督参数估计方法限制了性能。鉴于此,我们提出了一种可变形维纳滤波器(DWF)。它结合了局部和非局部特性,并基于维纳滤波器理论监督地训练滤波器系数。在滤波过程中,首先为每个感兴趣样本导出局部相邻样本和非局部相似样本。然后,基于块级噪声和样本级特征将待滤波样本分类到特定组中。每组样本共享相同的滤波器系数。之后,根据分类结果自适应融合局部和非局部参考样本。最后,对每个待滤波样本进行带有异常值数据约束的滤波操作。此外,详细分析了所提出的DWF在不同参考样本导出方案下的性能。仿真结果表明,与VTM-11.0相比,所提方法在全内、随机访问和低延迟B配置下平均分别节省1.16%、1.92%和2.67%的码率。

英文摘要

In-loop filters have attracted increasing attention due to the remarkable noise-reduction capability in the hybrid video coding framework. However, the existing in-loop filters in Versatile Video Coding (VVC) mainly take advantage of the image local similarity. Although some non-local based in-loop filters can make up for this shortcoming, the widely-used unsupervised parameter estimation method by non-local filters limits the performance. In view of this, we propose a deformable Wiener Filter (DWF). It combines the local and non-local characteristics and supervisedly trains the filter coefficients based on the Wiener Filter theory. In the filtering process, local adjacent samples and non-local similar samples are first derived for each sample of interest. Then the to-be-filtered samples are classified into specific groups based on the patch level noise and sample-level characteristics. Samples in each group share the same filter coefficients. After that, the local and non-local reference samples are adaptively fused based on the classification results. Finally, the filtering operation with outlier data constraints is conducted for each to-be-filtered sample. Moreover, the performance of the proposed DWF is analyzed with different reference sample derivation schemes in detail. Simulation results show that the proposed approach achieves 1.16%, 1.92%, and 2.67% bit-rate savings on average compared to the VTM-11.0 for All Intra, Random Access, and Low-Delay B configurations, respectively.

2606.01566 2026-06-02 cs.LG

RobustModelMaker: Coupling Bootstrap Stability Selection with Leakage-Safe Nested Cross-Validation for Scientific Machine Learning

RobustModelMaker: 将Bootstrap稳定性选择与防泄漏嵌套交叉验证相结合的科学机器学习

Amanda S Barnard

发表机构 * School of Computing, Australian National University(计算学院,澳大利亚国立大学)

AI总结 针对小到中等规模科学数据集,提出RobustModelMaker框架,通过结合bootstrap稳定性选择与严格嵌套交叉验证,在防止数据泄漏的同时提供稳定性测试的特征子集和性能估计,在预测得分和选择稳定性上优于多种替代方法。

Comments 19 pages, 2 figure plates, 8 tables

详情
AI中文摘要

小到中等规模的科学数据集使机器学习流程面临两种叠加压力。单次特征选择产生的特征集在训练数据微小扰动下会发生显著变化,而任何使用相同数据进行选择、调参和评估的程序都会产生乐观偏差的性能估计。这两种失效模式通常被视为可分离的,但在科学数据所处的场景中,它们相互影响:不稳定的选择会放大本已乐观的得分的方差,而针对其中一种的标准补救措施很少能解决另一种。RobustModelMaker是一个Python框架,它将bootstrap稳定性选择与严格的嵌套交叉验证相结合,在每个折叠内执行所有预处理和选择,并生成一个经过稳定性测试的特征子集以及一个防泄漏的性能估计。该框架支持二分类、多分类和回归中的九种算法。行为通过确定性测试套件进行验证,该套件涵盖单元测试、性能测试和可重复性检查,在三个真实科学数据集上,与三种替代选择器(ANOVA F检验、带交叉验证的递归特征消除和Boruta)在预测得分和选择稳定性的Jaccard度量上进行比较。RobustModelMaker在每个数据集上的得分与最佳替代选择器相当,并且在所有三种任务类型中,在联合得分-稳定性前沿上占据了一个任何替代方法都无法匹敌的位置。两个示例应用——来自PLCO试验的卵巢癌生物标志物发现和UCI超导数据上的临界温度回归——说明了该框架在实际中的使用方式,以及当稳定性被视为首要交付成果而非涌现属性时,哪些权衡变得可见。

英文摘要

Small-to-medium scientific datasets place machine learning pipelines under two compounding pressures. Single-run feature selection produces feature sets that change substantially under small perturbations of the training data, and any procedure that uses the same data for selection, tuning, and evaluation produces optimistically biased performance estimates. The two failure modes are routinely treated as separable, but in the regimes where scientific data live, they interact: an unstable selection inflates the variance of an already-optimistic score, and standard remedies for one rarely address the other. RobustModelMaker is a Python framework that couples bootstrap stability selection with strict nested cross-validation, performs all preprocessing and selection inside each fold, and produces a stability-tested feature subset together with a leakage-safe performance estimate. The framework supports nine algorithms across binary classification, multiclass classification, and regression. Behaviour is verified by a deterministic test suite spanning unit, performance, and reproducibility checks on three real scientific datasets comparing to three alternative selectors (ANOVA F-test, recursive feature elimination with cross-validation, and Boruta) on both predictive score and a Jaccard measure of selection stability. RobustModelMaker is competitive in score with the best alternative selector on each dataset, and occupies a position on the joint score-stability frontier that none of the alternatives match across all three task types. Two example applications, ovarian cancer biomarker discovery from the PLCO Trial and critical-temperature regression on the UCI Superconductivity Data, illustrate how the framework is used in practice and what trade-offs become visible when stability is treated as a first-class deliverable rather than an emergent property.

2606.01565 2026-06-02 cs.RO cs.CV

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

层级语义增强导航:面向视觉语言导航的最优传输与图驱动推理

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore(新加坡南洋理工大学交叉学科研究生项目) University College London(伦敦大学学院)

AI总结 提出层级语义增强导航框架,通过动态层级语义场景图、基于最优传输的拓扑规划器与图感知强化学习策略,解决连续环境中的视觉语言导航难题,实现最优性能。

Comments Published in NeurIPS 2025, address some typos

详情
AI中文摘要

连续环境中的视觉语言导航(VLN-CE)对自主智能体构成严峻挑战,要求无缝整合自然语言指令与视觉观察以在复杂3D室内空间导航。现有方法在长程任务中常因场景理解有限、规划效率低下及缺乏稳健决策框架而表现不佳。我们引入层级语义增强导航(HSAN)框架,这是一种开创性方法,通过三项协同创新重新定义VLN-CE。首先,HSAN构建动态层级语义场景图,利用视觉语言模型捕捉从物体到区域到区域的多级环境表示,实现细粒度空间推理。其次,它采用基于最优传输的拓扑规划器,以Kantorovich对偶为基础,通过平衡语义相关性与空间可达性来选择长期目标,并具有理论最优性保证。第三,图感知强化学习策略确保精确的低层控制,在稳健避障的同时导航子目标。通过整合谱图理论、最优传输和先进的多模态学习,HSAN解决了先前工作中静态地图和启发式规划器的缺陷。在多个具有挑战性的VLN-CE数据集上的大量实验表明,HSAN实现了最先进的性能,在导航成功率和泛化到未见环境方面均有显著提升。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.

2606.01563 2026-06-02 cs.LG

MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference

MomentKV:消除长上下文推理中KV缓存驱逐的方向差距

Yu Li, Binxu Li, Tian Lan

发表机构 * George Washington University(乔治·华盛顿大学) Princeton University(普林斯顿大学)

AI总结 针对长上下文推理中KV缓存驱逐导致输出退化的问题,提出MomentKV方法,通过维护驱逐令牌集的矩统计量(计数、键均值、值均值和值-键协方差)来识别与累积摘要对齐的令牌,并在推理时提供驱逐注意力输出的一阶近似,实现选择性驱逐与精确校正的相互增强。

详情
AI中文摘要

基于Transformer的语言模型中的自回归解码依赖于KV缓存,其内存占用随序列长度线性增长,成为长上下文推理的主要瓶颈。KV缓存驱逐通过保留固定大小的键值对子集并丢弃其余部分来解决这一问题。我们发现输出退化的一个主要来源并非驱逐令牌上的残余注意力质量(现有方法已最小化),而是保留令牌集与驱逐令牌集之间的方向不匹配。具体而言,实际中被驱逐的令牌通常与保留的令牌接近正交。因此,即使少量的驱逐质量也可能对最终的方向分布产生过大影响,并放大为显著的输出误差。这揭示了现有策略的根本局限性。为解决此问题,我们提出MomentKV,它在驱逐令牌集上维护紧凑的小规模矩统计量,包括计数、键均值、值均值和值-键协方差。在驱逐过程中,利用矩统计量识别已经与累积摘要良好对齐并被其捕获的令牌,保持驱逐集的几何规则性。在推理过程中,它们产生驱逐注意力输出的闭式一阶近似,在选择性驱逐与精确校正之间形成相互增强的循环。在LongBench和RULER上使用LLaMA-3.1-8B-Instruct和Qwen3-4B-Instruct进行的实验表明,MomentKV在每个缓存预算下均优于所有基线,在激进压缩下增益最大。

英文摘要

Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction addresses this by retaining a fixed-size subset of key-value pairs and discarding the rest. We identify that a primary source of output degradation is not the residual attention mass on evicted tokens, which existing methods already minimize, but a directional mismatch between the retained and evicted token sets. Specifically, the evicted tokens in practice are often near-orthogonal to the retained ones. Thus, even a small evicted mass could have an oversized impact on the resulting direction distribution and amplify into substantial output error. This reveals a fundamental limit in existing strategies. To address this, we propose MomentKV, which maintains compact, small-size moment statistics over the evicted token set, including a count, key mean, value mean, and value-key covariance. During eviction, the moment statistics is leveraged to identify tokens already well aligned with and captured by the accumulated summary, keeping the evicted set geometrically regular. During inference, they yield a closed-form first-order approximation of the evicted attention output, forming a mutually reinforcing loop between selective eviction and accurate correction. On LongBench and RULER with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, MomentKV outperforms all baselines at every cache budget, with the largest gains under aggressive compression.

2606.01560 2026-06-02 cs.LG cs.AI

GJDNet: Robust Graph Neural Networks via Joint Disentangled Learning Against Adversarial Attacks

GJDNet: 通过联合解缠学习实现鲁棒图神经网络对抗攻击

Canyixing Cui, Tao Wu, Xingping Xian, Xiao-Ke Xu, Mao Wang, Weina Niu

发表机构 * School of Computer Science and Technology, Chongqing University of Posts and Telecommunications(重庆邮电大学计算机科学与技术学院) School of Cyber Security and Information Law, Chongqing University of Posts and Telecommunications(重庆邮电大学网络安全与信息法学院) Computational Communication Research Center, Beijing Normal University(北京师范大学计算通信研究中心) School of Journalism and Communication, Beijing Normal University(北京师范大学新闻传播学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出GJDNet框架,通过联合解缠节点表示和决策空间,并采用球形决策边界,增强图神经网络在不同图同配性下的鲁棒性。

详情
AI中文摘要

图神经网络(GNN)易受对抗攻击,这类攻击通过在同配图中引入异配边、在异配图中引入同配边,从根本上反转连接模式。这种结构反转造成结构-特征不匹配,扰乱不同图类型上的邻域聚合。然而,我们发现现有防御措施存在局限性,它们要么在固定的同配性假设下将邻域视为整体,要么依赖无法应对扰动引起的表示偏移的标准softmax分类器。为进一步利用这一观察,我们采用鲁棒性视角,联合解缠节点表示和决策空间,在隔离扰动影响的同时强制实现分离良好的决策区域。基于此原则,我们提出图联合解缠网络(GJDNet),这是一个统一的框架,用于在不同图同配性机制下进行鲁棒节点分类。GJDNet在表示和决策两个层面增强鲁棒性:它采用特征驱动的软结构解缠,结合偏度感知的邻居过滤,抑制扰动引起的结构-特征不匹配;并引入球形决策边界(SDB),促进嵌入空间中的类内紧凑性和类间分离,从而在扰动下稳定决策边界。理论分析揭示了所提出的解缠表示和决策机制的有效性,而大量实验表明,GJDNet在不同连接模式的图上始终展现出强鲁棒性。

英文摘要

Graph Neural Networks (GNNs) are vulnerable to adversarial attacks, which inherently invert connectivity patterns by introducing disassortative edges in assortative graphs and assortative edges in disassortative graphs. This structural inversion creates structure-feature mismatches that disrupt neighborhood aggregation across different graph types. However, we find that existing defenses are limited, as they either treat neighborhoods as monolithic under fixed assortativity assumptions or rely on standard softmax classifiers that fail to account for perturbation-induced representation shifts. To further exploit this observation, we adopt a robustness perspective that jointly disentangles node representations and decision spaces, isolating perturbation effects while enforcing well-separated decision regions. Based on this principle, we propose Graph Joint Disentanglement Network (GJDNet), a unified framework for robust node classification across diverse graph assortativity regimes. GJDNet enhances robustness at both representation and decision levels: it employs feature-driven soft structural disentanglement with skewness-aware neighbor filtering to suppress perturbation-induced structure-feature mismatches, and introduces a Spherical Decision Boundary (SDB) to promote intra-class compactness and inter-class separation in the embedding space, thereby stabilizing decision boundaries under perturbations. Theoretical analysis provides insights into the effectiveness of the proposed disentangled representation and decision mechanisms, while extensive experiments demonstrate that GJDNet consistently achieves strong robustness across graphs with different connectivity regimes.

2606.01558 2026-06-02 cs.CV

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

注意力引导的多模态大语言模型微调提升思维链推理能力

Sanchit Sinha, Guangzhi Xiong, Bohan Liu, Zhenghao He, Aidong Zhang

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 针对多模态大语言模型中思维链推理效果不佳的问题,提出注意力引导的微调目标Attentive-CoT,通过延迟答案承诺和维持视觉令牌访问来提升推理性能。

详情
AI中文摘要

思维链提示在多模态大语言模型中的有效性仍不确定:在多个视觉推理基准上,与直接提示相比,思维链提示常常降低性能。在本文中,我们对三个现代多模态大语言模型系列在不同模型规模下,针对需要逐步视觉证据的数据集进行了思维链行为的系统分析。我们的分析识别出两种反复出现的失败模式:过早的答案承诺和推理生成过程中有限的直接视觉令牌访问。我们进一步发现,标准的思维链式监督微调只能部分缓解这些问题,同时往往增加对文本先验的依赖并减少反事实视觉依赖。受这些发现的启发,我们提出了Attentive-CoT,一种注意力引导的微调目标,它鼓励思维链轨迹延迟答案承诺,同时维持持续的视觉令牌访问。Attentive-CoT可以插入任何思维链式监督微调训练中,无需架构更改。在六个多模态大语言模型上的三个视觉推理基准实验表明,Attentive-CoT相比标准微调提升了思维链性能。

英文摘要

The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence. Motivated by these findings, we propose Attentive-CoT (Att-CoT), an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. Att-CoT can be plugged into any CoT-SFT training run without architectural changes. Experiments on three visual reasoning benchmarks across six MLLMs show that Att-CoT enhances CoT performance over standard fine-tuning.

2606.01557 2026-06-02 cs.LG eess.SP

Everywhere Learning: Artificial Intelligence with Pointwise Constraints

处处学习:具有逐点约束的人工智能

Ignacio Boero, Ignacio Hounie, Luiz Chamon, Alejandro Ribeiro

发表机构 * Department of Electrical and Systems Engineering, University of Pennsylvania(宾夕法尼亚大学电气与系统工程系) École polytechnique, Institut Polytechnique de Paris(巴黎理工学院)

AI总结 提出“处处学习”新范式,通过近似对偶理论分析泛化性能,并用稀疏L1惩罚控制泛化,在语言模型任务中验证其优势。

详情
AI中文摘要

处处学习是一种新范式,其中人工智能系统被训练以满足数据分布上概率为1的损失约束。这与训练人工智能系统最小化平均损失的标准范式形成对比。我们发展了一种近似对偶理论,以支持泛化分析,该分析建立了经验与统计处处学习问题解之间的接近性。我们的结果表明,对偶变量将数据分布重新加权到损失约束更难满足的点,并且泛化由数据分布质量集中与约束更难满足点上的质量集中之间的不匹配控制。我们进一步表明,我们可以通过约束松弛上的稀疏L1惩罚来控制泛化。我们通过语言模型任务中的智能体分类实验说明了处处学习的优点。

英文摘要

Everywhere learning is a new paradigm whereby Artificial Intelligence (AI) systems are trained to satisfy loss constraints with probability one over the data distribution. This is in contrast to the standard paradigm of training AI systems to minimize average losses. We develop an approximate duality theory to substantiate a generalization analysis that establishes the proximity between solutions of empirical and statistical everywhere learning problems. Our results show that dual variables reweigh the data distribution towards points in which loss constraints are more difficult to satisfy and that generalization is controlled by the mismatch between the concentration of mass of the data distribution and the concentration of mass on points where constraints are more difficult to satisfy. We further show that we can control generalization with a sparse L1 penalty on constraint relaxations. We illustrate the merits of everywhere learning with an experiment in agentic classification for language model tasks.

2606.01552 2026-06-02 cs.AI

RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

RoleCDE:角色扮演代理中的角色-对齐权衡的基准测试与缓解

Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Zhouxing Wang, Zhiqiang Yin, Xun Liang

发表机构 * School of Information, Renmin University of China(中国人民大学信息学院)

AI总结 针对角色扮演代理在角色特定价值与对齐约束冲突时的决策问题,提出首个基准RoleCDE,通过认知困境场景评估角色-场景基础、价值冲突解决和决策倾向,发现“角色价值解耦”现象,并基于RoleCDE的微调有效缓解该问题。

Comments 23pages

详情
AI中文摘要

角色扮演代理(RPAs)被广泛用于引导大语言模型(LLMs)表现出角色一致的行为,然而现有基准主要评估表面保真度,对角色-对齐价值冲突下的决策提供有限洞察。为解决这一差距,我们引入RoleCDE,这是首个旨在评估RPAs在角色特定价值与对齐导向约束之间结构化冲突下的基准。RoleCDE将角色感知决策制定为认知困境场景,联合评估角色-场景基础、价值冲突解决和决策倾向。该基准大规模构建,涵盖约8000个多样化的角色档案和场景,以及近24000个困境实例,跨越三个难度级别和八个角色类别。对几个主流LLMs的评估揭示了一种“角色价值解耦”现象,即当两者冲突时,代理系统性地默认选择对齐和道德一致的决策,而非角色特定价值,即使在明确的角色条件下也是如此。这种行为在很大程度上不受困境难度影响,但在不同角色类别间差异显著。我们进一步表明,基于RoleCDE的微调通过改善价值权衡推理有效缓解了这种解耦,同时保持了通用角色扮演保真度和通用推理性能。代码可在 https://github.com/rabbitrose/RoleCDE 获取。

英文摘要

Role-playing agents(RPAs) are widely used to steer large language models(LLMs) toward role-consistent behavior, yet existing benchmarks mainly evaluate surface-level fidelity and offer limited insight into decision making under role-alignment value conflicts. To address this gap, we introduce RoleCDE, the first benchmark designed to evaluate RPAs under structured conflicts between role-specific values and alignment-oriented constraints. RoleCDE formulates role-aware decision making as cognitive dilemma scenarios, jointly evaluating role-scenario grounding, value conflict resolution, and decision tendencies. The benchmark is constructed at scale, covering approximately 8k diverse role profiles and scenarios and nearly 24k dilemma instances across three difficulty levels and eight role categories. Evaluation of several mainstream LLMs reveals a "Role Value Decoupling" phenomenon, where agents systematically default to alignment-and morality-consistent decisions rather than role-specific values when the two conflict, even under explicit role conditioning. This behavior is largely invariant to dilemma difficulty but varies substantially across role categories. We further show that RoleCDE-based fine-tuning effectively mitigates this decoupling by improving value trade-off reasoning, while preserving general role-playing fidelity and general reasoning performance. Code is available at: https://github.com/rabbitrose/RoleCDE.

2606.01549 2026-06-02 cs.CV

ForestMamba: Sparse Mamba with Geometry-guided Queries for 3D Forest Point Cloud Segmentation

ForestMamba: 基于几何引导查询的稀疏Mamba用于3D森林点云分割

Trung Thanh Nguyen, Tuan-Anh Vu, Duc Viet Le, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide, Teja Kattenborn

发表机构 * Nagoya University(名古屋大学) RIKEN Seika(日本理化学研究所Seika研究中心) University of California, Los Angeles(加州大学洛杉矶分校) University of Twente(埃因霍温理工大学) Ritsumeikan University(立命馆大学)

AI总结 提出ForestMamba方法,通过稀疏编码器、几何引导查询初始化和Mamba查询解码器,实现高效且结构感知的森林点云分割,在七个森林区域上优于现有方法,推理速度提升3倍,GPU内存降低2.3倍。

详情
AI中文摘要

基于AI的地面和无人机LiDAR点云语义和实例分割正成为一种变革性方法,将森林的复杂3D结构转化为可操作的信息,用于森林监测和生物多样性评估。然而,由于数据量大、采样密度不规则、冠层结构复杂重叠以及地理变异性,森林LiDAR场景仍然极具挑战性。基于稀疏卷积或Transformer的现有方法取得了有希望的结果,但存在两个关键限制:注意力的二次复杂度难以扩展到大型森林场景,以及通用上下文建模未利用森林结构先验,限制了复杂区域中的树木分离。为了解决这些挑战,我们提出了ForestMamba,一种结构感知方法,将森林特定先验融入特征编码、查询生成和查询细化中,同时用线性时间状态空间建模替代二次注意力。首先,我们引入了一个具有垂直优先 slab 序列化的稀疏编码器,将稀疏体素组织成垂直连贯的序列,以实现高效的长程上下文建模。其次,我们提出了一种基于实时多尺度冠层高度模型(CHM)的几何引导查询初始化策略,其中冠层最大值提供了生态学上有意义的查询种子,并通过最远点采样(FPS)补充以覆盖林下树木。第三,我们设计了一个基于Mamba的查询解码器,将局部kNN体素聚合与空间双路径Mamba相结合,以线性计算复杂度进行查询细化。在七个森林区域上的大量实验表明,ForestMamba在分割任务中始终优于现有基线,同时实现比基于Transformer的方法快3倍的推理速度和低2.3倍的GPU内存。

英文摘要

AI-based semantic and instance segmentation of terrestrial and drone LiDAR point clouds is emerging as a transformative approach for converting the complex 3D structure of forests into actionable information for forest monitoring and biodiversity assessment. However, forest LiDAR scenes remain highly challenging due to their large data volumes, irregular sampling density, overlapping and complex canopy structure, and geographic variability. Existing methods based on sparse convolutions or Transformers achieve promising results, but suffer from two key limitations: Quadratic complexity of attention scales poorly to large forest scenes, and Generic context modeling does not exploit forest structural priors, limiting tree separation in complex regions. To address these challenges, we propose ForestMamba, a structure-aware method that incorporates forest-specific priors into feature encoding, query generation, and query refinement, while replacing quadratic attention with linear-time state-space modeling. First, we introduce a sparse encoder with vertical-priority slab serialization that organizes sparse voxels into vertically coherent sequences for efficient long-range context modeling. Second, we propose a geometry-guided query initialization strategy based on an on-the-fly multi-scale Canopy Height Model (CHM), where canopy maxima provide ecologically meaningful query seeds, supplemented by Farthest Point Sampling (FPS) to cover understory trees. Third, we design a Mamba-based query decoder that combines local kNN voxel aggregation with a spatial dual-path Mamba for query refinement with linear computational complexity. Extensive experiments across seven forest regions demonstrate that ForestMamba consistently outperforms existing baselines in both segmentation tasks, while achieving 3 times faster inference and 2.3 times lower GPU memory than Transformer-based methods.

2606.01545 2026-06-02 cs.RO

Hierarchical Object Representation for Spatial Robot Perception: Points, Meshes, and Superquadrics

用于空间机器人感知的层次化物体表示:点云、网格和超二次曲面

Ceng Zhang, Wan Su, Mohamed Samshad, Gregory S. Chirikjian, Rajat Talak

发表机构 * National University of Singapore (NUS)(新加坡国立大学) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出一种层次化物体表示方法,从原始传感器数据逐步抽象为稠密网格和超二次曲面,用于高保真重建、鲁棒重定位和高效碰撞检测,并在室内外场景中验证其有效性。

Comments 18 pages, 5 figures, 4 tables

详情
AI中文摘要

层次化3D场景图(3DSG)已成为一种可操作且可扩展的表示方法,用于融合度量、语义和拓扑信息的长期自主导航。然而,3DSG中物体的几何表示问题一直被忽视,大多数方法使用简化几何模型,如部分点云或3D边界框。本文提出一种层次化物体表示,可用于高保真物体级重建、基于物体的鲁棒重定位或地图对齐,以及密集杂乱环境中安全机器人导航规划的高效解析碰撞检测。该表示结构上分为四个不同层次,从原始传感器数据逐步抽象为稠密3D网格,再到解析基元(如超二次曲面),从而提供物体几何的稀疏解析表示。我们开发了一个流程,从机器人捕获的RGB-D图像流构建层次化物体表示,并在室内外真实开放集物体场景中验证其效果。在包括HOPE、ReplicaCAD、Kimera-Multi以及使用Unitree B2机器人收集的NUS校园数据集等多个数据集上的大量实验,验证了该流程在室内外环境中的有效性。我们展示了基于超二次曲面的地图对齐方法优于当前最先进的基于物体的地图对齐方法ROMAN。代码见https://github.com/perceptica-robotics/Hickory。

英文摘要

Hierarchical 3D Scene Graphs (3DSG) have emerged as an actionable and scalable representation for long-term autonomy incorporating metric, semantic, and topological information in the scene. However, the question of geometric representation of objects in 3DSG has been overlooked as most methods use simplified geometric models such as partial point clouds or 3D bounding boxes. In this work, we introduce a hierarchical object representation that can be leveraged for high-fidelity object-level reconstruction, object-based robust re-localization or map alignment, and efficient and analytical collision checking for safe robot navigation planning in dense and cluttered environments. The representation is structurally organized into four distinct layers, progressively abstracting the scene from raw sensor data to dense 3D meshes to analytical primitives such as superquadrics, which provide a sparse and analytical representation for object geometry. We develop a pipeline that builds the hierarchical object representation from RGB-D image stream captured by a robot, and demonstrate its working in real-world open-set object scenes in both indoor and outdoor environments. Extensive experiments across diverse datasets including HOPE, ReplicaCAD, Kimera-Multi, and NUS Campus Dataset collected using Unitree B2 Robot validate our pipeline in both indoor and outdoor environments. We show that our superquadric-based map alignment method outperforms the current state-of-the-art object based map alignment method ROMAN. Our code can be found at https://github.com/perceptica-robotics/Hickory.

2606.01544 2026-06-02 cs.LG

CRePE: Convolution-aware Relative Importance in Post-training Pruning with Efficient Search

CRePE: 后训练剪枝中基于卷积感知的相对重要性及高效搜索

Cheonjun Park

发表机构 * Hankuk University of Foreign Studies(韩国家外国语大学)

AI总结 提出CRePE方法,通过引入二维局部邻域上下文和自适应系数改进相对重要性评分,结合PHO代理优化实现高效后训练剪枝,在多种模型和稀疏度下取得最优性能。

Comments 10 pages

详情
AI中文摘要

在实际部署大型语言模型(LLM)时,会带来大量的内存和计算成本。后训练剪枝(PTP)是一种通过移除权重来降低这些成本的有效方法,无需额外训练。在现有方法中,RIA引入了通过行和列和归一化的相对重要性分数,实现了最先进的精度。然而,RIA仅考虑一维十字形(行/列)方向信息,并对行和列贡献赋予相同权重。在本文中,我们提出**CRePE**,它将二维局部邻域上下文和自适应系数纳入相对重要性评分。CRePE在各种模型和稀疏度设置下始终优于现有的PTP方法。然而,通过基于困惑度(PPL)的爬山法确定最优自适应系数需要大量PPL评估和约11小时的搜索时间。为了解决这个问题,我们提出**PHO**(基于代理的超参数优化),它消除了重复PPL测量的需要,并将搜索时间减少到约20分钟。此外,PHO在一个模型上找到的最优超参数配置可以很好地迁移到其他模型,展现出强大的泛化能力。最后,我们验证了CRePE可以与现有技术(包括通道置换、非均匀稀疏分配和重新剪枝方法)正交结合。

英文摘要

Deploying Large Language Models (LLMs) in practice incurs substantial memory and computational costs. Post-training pruning (PTP) is an effective approach to reducing these costs by removing weights without additional training. Among existing methods, RIA introduces relative importance scores normalized by row and column sums, achieving state-of-the-art accuracy. However, RIA considers only 1D cross-shaped (row/column) directional information and assigns equal weight to row and column contributions. In this paper, we propose \textbf{CRePE}, which incorporates 2D local neighborhood context and adaptive coefficients into Relative Importance scoring. CRePE consistently outperforms existing PTP methods across diverse models and sparsity settings. However, identifying optimal adaptive coefficients via perplexity (PPL)-based hill climbing requires numerous PPL evaluations and approximately 11 hours of search time. To address this, we propose \textbf{PHO} (Proxy-based Hyperparameter Optimization), which eliminates the need for repeated PPL measurements and reduces the search time to approximately 20 minutes. Furthermore, the optimal hyperparameter configuration found by PHO on one model transfers well to other models, demonstrating strong generalization. Finally, we verify that CRePE can be orthogonally combined with existing techniques including Channel Permutation, non-uniform sparsity allocation, and re-pruning methods.

2606.01543 2026-06-02 cs.CV

PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images

PathAR: 结构优先的多模态病理图像自回归合成

Yuan Zhang, Jiahao Xia, Junzhang Huang, Meng Wang, Feng Chen, Guanyu Yang, Huazhu Fu

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(新一代人工智能技术及其交叉应用重点实验室(东南大学),教育部) Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore(创新与精准眼健康中心,新加坡国立大学 Yong Loo Lin 医学院) Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore(眼科学系,新加坡国立大学 Yong Loo Lin 医学院) Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University(生物统计学系,全球健康中心,南京医科大学) Institute of High-Performance Computing, Agency for Science, Technology and Research(高性能计算研究所,科技研究局)

AI总结 提出PathAR,一种结构优先的自回归合成框架,通过显式分解结构与外观并使用交错自回归Transformer,实现模态标签条件下的病理图像生成,改善结构一致性和模态保真度。

Comments 12 pages, 7 figures

详情
AI中文摘要

多模态病理学中的数据稀缺推动了统一生成模型的发展,这些模型在保持解剖学一致结构的同时合成模态特定的外观。尽管模态在外观统计上存在差异,但细胞拓扑和组织边界等形态结构在不同采集协议中基本保持不变。然而,现有方法通常将这些因素建模在均匀的token流中,隐式地将结构与外观耦合,削弱了模态变化下的结构可控性。为解决这一问题,我们提出病理自回归建模(PathAR),一种结构优先的自回归合成框架,显式分解结构和外观,用于模态标签条件下的病理生成。PathAR采用双向量量化(Dual-VQ)分词器将样本分解为掩码引导的结构和外观token,以及一个具有非对称注意力可见性的交错自回归(IAR)Transformer,以强制执行结构到外观的依赖关系。PathAR在异质模态特定外观下稳定形态,并支持空间对齐的图像-掩码对生成。大量实验表明,PathAR在结构一致性和模态保真度上优于基线,保持样本多样性,支持数据稀缺情况下的下游分割,并展现出对更细粒度器官标签变化的可扩展性。

英文摘要

Data scarcity in multimodal pathology motivates unified generative models that synthesize modality-specific appearance while preserving anatomically coherent structure. Although modalities differ in appearance statistics, morphological structures such as cellular topology and tissue boundaries are largely preserved across acquisition protocols. However, existing methods often model these factors within a homogeneous token stream, implicitly coupling structure with appearance and weakening structural controllability under modality shifts. To address this, we propose pathology Autorgressive modeling (PathAR), a structure-first autoregressive synthesis framework that explicitly factorizes structure and appearance for modality-label-conditioned pathology generation.PathAR employs a dual vector quantization (Dual-VQ) tokenizer to decompose samples into mask-grounded structure and appearance tokens, and an interleaved autoregressive (IAR) transformer with asymmetric attention visibility to enforce structure-to-appearance dependence. PathAR stabilizes morphology under heterogeneous modality-specific appearances and enables spatially aligned image--mask pair generation. Extensive experiments show that PathAR improves structural consistency and modality fidelity over baselines, maintains sample diversity, supports downstream segmentation in data-scarce regimes, and demonstrates extensibility to finer-grained intra-modality organ-label variation.

2606.01540 2026-06-02 cs.LG cs.AI

TN-SHAP-G: Graph-Structured Tensor Network Surrogates for Shapley Values and Interactions

TN-SHAP-G:用于Shapley值和交互的图结构张量网络代理

Farzaneh Heidari, Guillaume Rabusseau

发表机构 * University of Washington(华盛顿大学) CNRS(法国国家科学研究中心)

AI总结 提出TN-SHAP-G框架,利用图结构输入通过张量网络代理高效计算Shapley值和高阶交互指数。

详情
AI中文摘要

Shapley值是一种广泛使用的工具,用于归因黑盒模型中输入变量的重要性和交互,但其计算涉及定义在指数级子集空间上的函数。我们提出TN-SHAP-G,一个利用图结构输入中的结构高效计算Shapley值和高阶交互指数的框架。给定一个预测器和一个固定的掩码方案,TN-SHAP-G学习一个紧凑的、与图对齐的多线性代理,该代理近似掩码输入行为,表示为拓扑结构反映输入图的张量网络。一旦从少量oracle查询中训练完成,该代理通过多线性扩展实现一阶和高阶Shapley指数的确定性恢复,无需额外模型查询或蒙特卡洛方差。分子基准实验表明,学习到的分解在小图上紧密匹配精确Shapley值,并能高效扩展到基于采样的方法不可行的更大图。

英文摘要

Shapley values are a widely used tool for attributing importance and interactions among input variables in black-box models, but their computation involves a function defined over an exponentially large space of subsets. We propose TN-SHAP-G, a framework that exploits structure in graph-structured inputs to compute Shapley values and higher-order interaction indices efficiently. Given a predictor and a fixed masking scheme, TN-SHAP-G learns a compact, graph-aligned multilinear surrogate that approximates the masked-input behavior, represented as a tensor network whose topology mirrors the input graph. Once trained from a small number of oracle queries, the surrogate enables deterministic recovery of first- and higher-order Shapley indices via the multilinear extension, without additional model queries or Monte Carlo variance. Experiments on molecular benchmarks show that the learned factorization closely matches exact Shapley values on small graphs and scales efficiently to larger graphs where sampling-based methods become infeasible.