URL PDF HTML ☆

赞 0 踩 0

2605.16080 2026-05-18 cs.CV 版本更新

ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation

ReAlign：通过推理对齐表示实现通用图像伪造检测

Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University（北京大学电子与计算机工程学院）； School of Future Technology, South China University of Technology（华南理工大学未来技术学院）； School of Electronic and Information Engineering, South China University of Technology（华南理工大学电子与信息工程学院）； Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University（广东省超高清沉浸媒体技术重点实验室，北京大学深圳研究生院）

AI总结本文提出ReAlign框架，通过对比学习将LLM生成的高质量推理文本转化为轻量级AIGI检测器，提升检测准确性和泛化能力。

Comments Accepted by CVPR 2026

详情

AI中文摘要

AI生成图像（AIGIs）的兴起对数字真实性提出了新的挑战，需要高效且通用的图像伪造检测系统。现有方法无论是非LLM还是LLM基于的方法，都有各自的优势和局限性。非LLM方法提供高效的低级artifact检测，但缺乏语义理解。相反，LLM方法提供强大的语义推理和可解释性，但计算成本高且对细微视觉伪影不敏感。此外，解释性推理文本对伪造检测性能的真实贡献仍不明确。本文研究了LLM生成的推理文本的内在价值和潜力，将其视为通用性和语义错误敏感性的来源。基于这些发现，我们提出了ReAlign，一种新的框架，通过对比学习将由GRPO优化的LLM生成的高质量推理文本提炼成轻量级AIGI检测器。ReAlign有效继承了推理文本表示的泛化能力和语义敏感性，同时保持高效和轻量级以部署。此外，ReAlign采用定制的联合优化策略，整合对比损失用于图像-文本对齐和分类损失用于准确的伪造鉴别。在AIGCDetectBenchmark、AIGI-Holmes和我们新构建的UltraSynth-10k上的实验结果表明，ReAlign在准确性和泛化能力上均优于现有最先进检测器，特别是在面对来自现代生成模型的复杂、高保真伪造时表现突出。

英文摘要

The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.

URL PDF HTML ☆

赞 0 踩 0

2605.16079 2026-05-18 cs.CV cs.AI cs.HC 版本更新

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

VideoSeeker：通过原生代理工具调用激励实例级视频理解

Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）； Xiaohongshu Inc.（小红书公司）； East China Normal University（华东师范大学）； Xi’an Jiaotong University（西安交通大学）

AI总结 VideoSeeker通过整合代理推理与实例级视频理解任务，提升视频理解精度，实验表明其在实例级任务中比基线模型提升13.7%，超越GPT-4o和Gemini-2.5-Pro。

Comments Project Page: https://gaotiexinqu.github.io/VideoSeeker/

详情

AI中文摘要

大型视觉-语言模型（LVLMs）在视频理解上取得了显著进展，但在需要精确实例级时空定位的任务中面临重大挑战。现有方法主要依赖文本提示进行人机交互，但这些提示难以提供精确的空间和时间参考，导致用户体验不佳。此外，当前方法通常将视觉感知与语言推理解耦，以语言为中心而非视觉内容，限制了模型主动感知细粒度视觉证据的能力。为解决这些问题，我们提出VideoSeeker，一种通过视觉提示实现实例级视频理解的新范式。VideoSeeker无缝整合代理推理与实例级视频理解任务，使模型能够主动感知并按需检索相关视频片段。我们构建了一个四阶段全自动数据合成管道，高效生成大规模高质量的实例级视频数据。我们通过冷启动监督和强化学习训练将工具调用和主动感知能力内化到模型中，构建了一个强大的视频理解模型。实验表明，我们的模型在实例级视频理解任务中平均比基线模型提升13.7%，超越强大的闭源模型如GPT-4o和Gemini-2.5-Pro，同时在通用视频理解基准上也表现出有效的迁移能力。相关数据集和代码将公开发布。

英文摘要

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2605.16076 2026-05-18 cs.CV cs.AI 版本更新

基于深度学习的实验室平板图像端到端斑块计数与病毒滴度测定

Eugenia Moris, Alicia Costábile, Sebastián Rey, Irene Ferreiro, Joaquín Hurtado, Lizandra Lissette Luciano, Matías Villagrán, Aisha Espino Vázquez, Jomari Ramos, Isadora Monteiro, María Victoria de Santiago, Pilar Moreno, Gonzalo Moratorio, José Ignacio Orlando

发表机构 * Arionkoder LLC ； Laboratory of Experimental Virus Evolution, Pasteur Institute of Montevideo（实验病毒进化实验室，蒙特维多巴斯德研究所）； Laboratory of Molecular Virology, Faculty of Sciences, University of the Republic（分子病毒学实验室，科学学院，乌拉圭共和国大学）； Center for Innovation in Epidemiological Surveillance, Pasteur Institute of Montevideo（流行病学监测创新中心，蒙特维多巴斯德研究所）； Biochemistry Section, Faculty of Sciences, University of the Republic（生物化学部门，科学学院，乌拉圭共和国大学）

AI总结本文提出一种端到端的深度学习方法，通过分割模型对实验室平板图像中的斑块进行自动计数和滴度测定，提高了病毒感染性检测的效率和准确性。

详情

AI中文摘要

斑块实验仍然是病毒感染性检测的金标准，但通过平板图像进行斑块计数过程繁琐且易受操作者差异影响。本文提出了一种端到端的计算机辅助工作流程，直接从实验室斑块实验图像中基于细胞病理效应的病毒滴度测定。所提出的方法结合了源自Segment Anything Model (SAM)的两个模型：一个基于SAM2的井分割模块，用于在异质成像条件下定位实验井；另一个基于SAM的斑块分割模型，用于在每个井中检测和统计斑块。该方法在混合数据集上进行了评估，该数据集包括Mayaro病毒和Coxsackievirus B3的私有斑块实验图像，以及来自VACVPlaque数据集的天花病毒图像。该流程输出每井斑块计数，自动计算每毫升形成斑块单位（PFU/mL），并整合到一个基于网络的平台中，允许用户审查结果并组织实验。在测试板（17块来自MAYV/CVB3和22块来自VACV）上，该工作流程在两种板格式（6孔和12孔）上实现了良好的泛化，并与手动注释有很强的一致性（MAYV/CVB3的皮尔逊相关系数为0.92，VACV为0.88）。自动斑块计数还与四位独立专家的注释进行了比较，显示了高度的一致性。所提出的系统将在本论文被接受后开源并公开发布，以实现可重复、可扩展和审计准备的斑块实验分析，同时显著减少手动注释的工作量。

英文摘要

Plaque assays remain the gold standard readout of virus infectivity; however, plaque counting from plate images is labor-intensive and prone to inter-operator variability. We present an end-to-end, computer-aided workflow for cytopathic effect-based virus titration directly from laboratory plaque assay images. The proposed approach combines two models derived from the Segment Anything Model (SAM): a SAM2-based well-segmentation module that localizes assay wells across heterogeneous imaging conditions, and a SAM-based plaque-segmentation model that detects and enumerates plaques within each well. The method was evaluated on a mixed dataset comprising private plaque assay images of Mayaro virus and Coxsackievirus B3, together with public Vaccinia virus images from the VACVPlaque dataset. The pipeline outputs per-well plaque counts, automatically computes plaque-forming units per milliliter (PFU/mL), and is integrated into a web-based platform that allows users to review results and organize experiments. On held-out plates (17 from MAYV/CVB3 and 22 from VACV), the workflow generalized across two plate formats (6-well and 12-well) and showed strong agreement with manual annotations (Pearson correlation coefficients of 0.92 for MAYV/CVB3 and 0.88 for VACV). Automated plaque counts were further compared with annotations from four independent experts, demonstrating high concordance. The proposed system will be open sourced and publicly released upon acceptance of this manuscript to enable reproducible, scalable, and audit-ready plaque assay analysis while substantially reducing manual annotation effort.

URL PDF HTML ☆

赞 0 踩 0

2605.16003 2026-05-18 cs.CV 版本更新

WorldVLN: 用于空域视觉-语言导航的自回归世界动作模型

Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li

发表机构 * Tsinghua University（清华大学）； Shandong University（山东大学）； Manifold AI ； Beijing Institute of Technology（北京理工大学）； Northeastern University（东北大学）

AI总结 WorldVLN提出一种自回归世界动作模型，通过预测潜在世界演变并生成可执行的航点动作，提升空域视觉-语言导航性能，优于现有基线模型。

详情

AI中文摘要

空域视觉-语言导航（VLN）要求智能体通过闭环感知与行动在3D环境中遵循自然语言指令。本文认为空域VLN可视为预测驱动的世界-动作问题：智能体应预测潜在世界演变并根据预测后果行动。为此，我们提出WorldVLN，首个针对空域VLN的自回归世界动作模型。不同于生成完整视觉片段的全序列视频生成世界模型，WorldVLN采用潜在自回归视频主干来预测短视界世界状态转换，并直接解码为可执行航点动作。每次动作段执行后，新接收的观测被编码回自回归上下文，实现闭环世界-动作预测。我们进一步引入双阶段训练框架，首先将视频先验在指令条件下的导航动力学中定位，然后开发Action-aware GRPO，首个针对自回归WAMs的强化学习方法，通过下游回放后果优化航点决策。在公开户外和室内基准上，WorldVLN在12%+的成功率提升和挑战性案例中表现更优。它进一步实现零样本迁移至真实无人机部署，表明所提WorldVLN为空间动作任务提供了一条有前景的路径。演示和代码可在https://embodiedcity.github.io/WorldVLN/上获取。

英文摘要

Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.

URL PDF HTML ☆

赞 0 踩 0

2605.15961 2026-05-18 cs.CV 版本更新

Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

稀疏自编码器使CLIP模型的鲁棒且可解释的微调成为可能

Fabian Morelli, Arnas Uselis, Ankit Sonthalia, Seong Joon Oh

发表机构 * University of Tübingen（图宾根大学）； KAIST（韩国科学技术院）

AI总结本文提出SAE-FT方法，通过稀疏自编码器约束视觉表示的变化，实现CLIP模型的鲁棒且可解释的微调，提高下游任务性能同时保持模型鲁棒性。

2605.15951 2026-05-18 cs.CV 版本更新

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

从失败到反馈：群体修订解锁对象级 grounding 的难题

Yuyuan Liu, Yiping Ji, Anjie Le, Jiayuan Zhu, Jiazhen Pan, Can Peng, Jiajun Deng, Fengbei Liu, Junde Wu

发表机构 * Department of Engineering Science, University of Oxford（牛津大学工程科学系）； Australian Institute for Machine Learning, Adelaide University（阿德莱德大学人工智能研究所）； Technical University of Munich（慕尼黑技术大学）； University of Science and Technology of China（中国科学技术大学）； Cornell University（康奈尔大学）

AI总结本文提出群体修订优化方法，通过生成改进候选响应提升硬案例学习效果，改进奖励和优势函数以增强高质量修订影响，优于现有GRPO方法。

Comments 8 pages, 5 figures, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

通过强化学习微调大视觉-语言模型以增强对象级 grounding 能力已成为有前景的方法。然而，现有方法主要基于GRPO，在响应层面分配奖励，导致在挑战性场景中所有候选响应失败时学习信号稀疏。本文提出群体修订优化范式，通过生成改进候选响应探索更好的 grounding 结果。受奖励塑造启发，我们引入巩固过程，量化每个候选响应相对于初始尝试的改进，并将其转化为信息丰富的塑造信号。这些信号用于精炼奖励和调节优势，放大高质量修订的影响。我们的方法在指称和推理分割、REC 和计数基准上均优于先前的 GRPO 基方法。我们的代码可在 https://github.com/yyliu01/GroupRevision 获取。

英文摘要

Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at https://github.com/yyliu01/GroupRevision.

URL PDF HTML ☆

赞 0 踩 0

2605.15942 2026-05-18 cs.CV cs.AI 版本更新

Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

分解式视觉-语言对齐用于细粒度开放词汇分割

Chenhao Wang, Yingrui Ji, Yu Meng, Yao Zhu

发表机构 * Aerospace Information Research Institute, Chinese Academy of Sciences（中国科学院航空航天信息研究所）； University of Chinese Academy of Sciences（中国科学院大学）； Zhejiang University（浙江大学）

AI总结本文提出分解式视觉-语言对齐框架，通过将文本提示分解为概念令牌和多个属性令牌，实现细粒度开放词汇分割中对未见属性-类别组合的泛化提升。

详情

AI中文摘要

开放词汇分割模型常难以泛化到未见的对象类别和属性组合，因为细粒度描述通常被编码为整体句子，将多个语义单元纠缠在一起。我们提出一种分解式视觉-语言对齐框架，将文本提示显式分解为概念令牌和多个属性令牌，使每个语义单元能够分别进行跨模态交互。在特征层面，我们引入了特征门控交叉注意力模块，生成属性特定的门控图以以乘法方式融合信息，有效强制组合语义。在评分层面，每个token的相似性在log空间中聚合，产生稳定且可解释的组合匹配。该方法可以无缝集成到现有的基于transformer的分割架构中，并在细粒度开放词汇分割基准中显著提升对未见属性-类别组合的泛化能力。

英文摘要

Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.15923 2026-05-18 cs.CV 版本更新

CLOVER：端到端自动驾驶规划的闭环价值估计与排序

Sining Ang, Yuguang Yang, Canyu Chen, Yan Wang

发表机构 * Department of Automation, University of Science and Technology of China（中国科学技术大学自动化系）； Institute for AI Industry Research, Tsinghua University（清华大学人工智能产业研究院）； School of Electronic Information Engineering, Beihang University（北航电子信息技术学院）； National College for Excellent Engineers, Beihang University（北航卓越工程师学院）

AI总结 CLOVER通过闭环价值估计与排序框架，解决端到端自动驾驶规划中训练与评估不匹配的问题，通过生成器和评分器的轻量级架构提升规划器性能，实现更准确的候选轨迹排序。

详情

AI中文摘要

端到端自动驾驶规划器通常通过模仿单条记录轨迹进行训练，但通过基于规则的规划指标进行评估，这导致了训练与评估之间的不匹配：接近记录路径的轨迹可能违反规划规则，而偏离记录路径的替代方案可能仍有效且得分高。这种不匹配对提案选择规划器尤其限制，因为其性能依赖于候选集覆盖和评分器排序质量。我们提出了CLOVER，一种用于端到端自动驾驶规划的闭环价值估计与排序框架。CLOVER采用轻量级生成器-评分器架构：生成器产生多样化的候选轨迹，评分器预测规划指标子分数以在推理时对它们进行排序。为了扩展提案支持超越单轨迹模仿，CLOVER构建了评估器过滤的伪专家轨迹，并通过集级别覆盖监督训练生成器。然后，它执行保守的闭环自我蒸馏：评分器被拟合到生成的提案上的真实评估子分数，而生成器则通过稳定性正则化向教师选择的前k和向量帕累托目标进行细化。我们分析了当评分器不完美时如何改进生成器，证明了当评分器选择的目标在真实评估下得到丰富且更新保持保守时，评分器介导的细化是可靠的。在NAVSIM上，CLOVER实现了94.5 PDMS和90.4 EPDMS，建立了新的状态。在更具挑战性的NavHard分割上，它获得了48.3 EPDMS，与最强报告结果相匹配。在补充的nuScenes开环评估中，CLOVER在比较方法中实现了最低的L2误差和碰撞率。代码数据将在https://github.com/WilliamXuanYu/CLOVER上发布。

英文摘要

End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.

URL PDF HTML ☆

赞 0 踩 0

2605.13169 2026-05-18 cs.CV cs.AI 版本更新

PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

PanoWorld：迈向360度全景世界的空间超感知

Changpeng Wang, Xin Lin, Junhan Liu, Yuheng Liu, Zhen Wang, Donglian Qi, Yunfeng Yan, Xi Chen

发表机构 * Zhejiang University（浙江大学）； University of California, San Diego（加州大学圣地亚哥分校）； University of California, Irvine（加州大学伊维特分校）； The University of Hong Kong（香港大学）

AI总结本文提出PanoWorld，通过构建全景原生理解能力，解决传统多模态大模型在空间感知上的不足，通过全景空间交叉注意力机制提升3D空间推理能力，并建立PanoSpace-Bench基准测试，验证了全景原生监督的有效性。

Comments Project page: https://wcpcp.github.io/PanoWorld

详情

AI中文摘要

多模态大实验室模型（MLLMs）在主导视角图像范式下仍难以实现空间理解，继承了人类感知的窄视野。为导航、机器人搜索和3D场景理解，360度全景感知通过一次性捕捉整个周围环境提供超感知。然而，现有MLLM流程通常将全景分解为多个视角，使等距投影（ERP）的球形结构隐含。本文研究全景原生理解，要求MLLM在ERP全景上作为连续的观察者中心空间进行推理。为此，我们首先定义了全景原生理解的关键能力，包括语义锚定、球形定位、参考框架转换和深度感知的3D空间推理。然后构建大规模元数据构造流程，将混合源ERP全景转换为几何感知、语言引导和深度感知的监督，并将这些信号作为能力对齐的指令微调数据。在模型方面，我们引入具有球形空间交叉注意力的PanoWorld，将球形几何注入视觉流。我们进一步构建PanoSpace-Bench，一个评估ERP原生空间推理的诊断基准。实验表明，PanoWorld在PanoSpace-Bench、H* Bench和R2R-CE Val-Unseen基准上显著优于专有和开源基线。这些结果表明，稳健的全景推理需要专门的全景原生监督和几何感知的模型适应。所有源代码和提出的数据将公开发布。

英文摘要

Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.

URL PDF HTML ☆

赞 0 踩 0

2605.12309 2026-05-18 cs.CV 版本更新

G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

G$^2$TR: 基于生成的视觉标记减少方法用于分离编码统一多模态模型

Junxian Li, Kai Liu, Zizhong Ding, Zhixin Wang, Zhikai Chen, Renjing Pei, Yulun Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Huawei Technologies Ltd（华为技术有限公司）

AI总结本文提出G$^2$TR方法，通过生成分支信号减少多模态模型的视觉标记，提升效率并保持性能，实验显示在图像理解和编辑任务中表现优异。

详情

AI中文摘要

单独编码统一多模态模型（UMMs）的发展伴随着由于密集视觉标记处理而迅速增长的推理成本。本文聚焦于理解侧的视觉标记减少以提高单独编码UMMs的效率。尽管该主题在MLLMs中已被广泛研究，现有方法通常依赖于注意力分数、文本-图像相似性等，隐含假设最终目标是判别推理。这一假设不适用于UMMs，其中理解侧的视觉标记必须保留模型对图像编辑的能力。我们提出G$^2$TR，一种用于单独编码UMMs的生成引导视觉标记减少框架。我们的关键见解是生成分支提供了一个任务无关的信号，用于识别不仅语义相关但对潜在空间图像重建和生成也重要的理解侧视觉标记。G$^2$TR通过估计与VAE潜在一致性来估计标记重要性，进行平衡的标记选择，并将冗余标记合并到保留的代表中以减少信息损失。该方法是训练无关的，即插即用的，并且仅在理解编码阶段之后应用，使其兼容现有的UMM推理流程。在图像理解和编辑基准上的实验表明，G$^2$TR显著减少了视觉标记和prefill计算，减少了1.94倍，同时保持推理准确性和编辑质量，在几乎所有基准上优于基线。代码地址：https://github.com/lijunxian111/G2TR。

英文摘要

The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks. Code is at: https://github.com/lijunxian111/G2TR.

URL PDF HTML ☆

赞 0 踩 0

2605.10867 2026-05-18 cs.CR cs.AI cs.CV cs.LG cs.NI 版本更新

BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data

BEACON：一个用于从游戏数据中学习行为指纹的多模态数据集

Ishpuneet Singh, Gursmeep Kaur, Uday Pratap Singh Atwal, Guramrit Singh, Gurjot Singh, Maninder Singh

AI总结 BEACON数据集通过高精度运动技能和认知负荷，为行为生物特征的鲁棒性测试提供严格压力测试，支持连续认证、行为建模和多模态学习。

详情

AI中文摘要

在高风险数字环境中，连续认证需要具有细粒度行为信号的高质量数据集，但现有基准往往受限于规模小、单模态传感或缺乏同步环境上下文。为此，本文引入BEACON（行为认证与连续监控行为引擎），一个大规模多模态数据集，捕捉竞技Valorant游戏中的多样化技能层级。BEACON包含约430GB同步多模态数据（461GB总存储量，包括辅助Valorant配置捕获），来自79个会话的28名不同玩家，估计102.51小时的活跃游戏时间，包括高频鼠标动态、按键事件、网络数据包捕获、屏幕录制、硬件元数据和游戏内配置上下文。BEACON利用战术射击游戏固有的高精度运动技能和高认知负荷，使其成为评估行为生物特征鲁棒性的严格压力测试。该数据集允许在高保真的电子竞技环境中研究连续认证、行为建模、用户漂移和多模态表示学习。作者在Hugging Face和GitHub上发布数据集和代码，以创建可重复的基准，用于评估下一代行为指纹和安全模型。

英文摘要

Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON (Behavioral Engine for Authentication & Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive Valorant gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary Valorant configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models.

URL PDF HTML ☆

赞 0 踩 0

2605.10100 2026-05-18 cs.CV cs.AI 版本更新

HYPERPOSE: Hyperbolic Kinematic Phase-Space Attention for 3D Human Pose Estimation

HYPERPOSE：超几何运动相空间注意力用于3D人体姿态估计

Vinduja Thekkath, Ashish Musale, Ajay Waghumbare, Upasna Singh

AI总结 HYPERPOSE提出一种在双曲空间内进行时空推理的3D人体姿态估计框架，通过超几何运动相空间注意力机制保留人体骨骼的树状结构，提升几何精度和时间动态建模。

详情

AI中文摘要

我们引入HYPERPOSE，一种新颖的3D人体姿态估计框架，其通过在洛伦兹模型的双曲空间$\mathbb{H}^d$中进行时空推理，原生保持人体骨骼的层次树状拓扑结构。当前最先进的姿态估计器依赖于transformers和图卷积网络来捕捉复杂的关节动态，但这些架构仅在欧几里得空间中操作，与人体固有的树状结构根本不匹配，导致指数体积扭曲和结构不一致。为此，我们脱离平坦空间，引入超几何运动相空间注意力（HKPSA）机制，原生嵌入复杂关节关系，同时结合多尺度窗口双曲注意力机制，以$O(TW)$复杂度高效建模时间动态。此外，为克服非欧几里得流形训练的已知不稳定性，HYPERPOSE引入新的黎曼损失套件和不确定性加权课程学习，强制物理测地线约束，如骨骼长度和速度一致性。在Human3.6M和MPI-INF-3DHP数据集上的广泛评估表明，HYPERPOSE在结构和时间一致性上达到最先进的水平，显著减少体积扭曲和速度误差，同时在整体位置准确性上建立新的最先进基准。

英文摘要

We introduce HYPERPOSE, a novel 3D human pose estimation framework that performs spatio-temporal reasoning entirely within the Lorentz model of hyperbolic space $\mathbb{H}^d$ to natively preserve the hierarchical tree topology of the human skeleton. Current state-of-the-art pose estimators aim to capture complex joint dynamics by relying on transformers and graph convolutional networks. Since these architectures operate exclusively in Euclidean space which fundamentally mismatches the inherent tree structure of the human body, these methods inevitably suffer from exponential volume distortion and struggle to maintain structural coherence. To this end, we depart from flat spaces and aim to improve geometric fidelity with Hyperbolic Kinematic Phase-Space Attention (HKPSA), natively embedding complex joint relationships without distortion, alongside a multi-scale windowed hyperbolic attention mechanism that efficiently models temporal dynamics in $O(TW)$ complexity. Furthermore, to overcome the well-known instability of training non-Euclidean manifolds, HYPERPOSE introduces a novel Riemannian loss suite and an uncertainty-weighted curriculum, enforcing physical geodesic constraints like bone length and velocity consistency. Extensive evaluations on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HYPERPOSE achieves state-of-the-art structural and temporal coherence, significantly reducing both volume distortion and velocity error, while establishing new state-of-the-art benchmarks in overall positional accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.09231 2026-05-18 cs.CV stat.ML 版本更新

An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories

一种弹性形状变分自编码器用于骨骼姿态轨迹

Arafat Rahman, Shashwat Kumar, Laura E. Barnes, Anuj Srivastava

发表机构 * Systems and Information Engineering, University of Virginia（弗吉尼亚大学系统与信息工程系）； Biomedical Engineering, Johns Hopkins University（约翰霍普金斯大学生物医学工程系）； Dept. of Applied Mathematics and Statistics, Johns Hopkins University（约翰霍普金斯大学应用数学与统计学系）

AI总结本文提出ES-VAE，通过运输平方根速度场表示在Kendall形状流形上学习骨骼轨迹的生成模型，有效分离形状动态，优于标准VAE和序列建模基线，在步态分析和动作识别中表现优异。

Comments 9 pages

详情

AI中文摘要

深度生成模型为建模复杂结构数据提供了灵活的框架，如图像、视频、3D物体和文本。然而，当应用于人体骨骼序列时，标准变分自编码器（VAEs）通常将大量容量分配给干扰因素，如摄像机方向、主体尺寸、视角和执行速度，而非形状和运动的内在几何结构。我们提出弹性形状-变分自编码器（ES-VAE），一种针对骨骼轨迹的几何感知生成模型，利用传输平方根速度场（TSRVF）表示在Kendall形状流形上。该表示本质上消除了形状的刚体平移、旋转和全局缩放以及序列的时间率变化，隔离了底层形状动态。ES-VAE编码器将骨骼序列映射到低维潜在空间，结合黎曼对数映射，而解码器利用相应的指数映射重建序列。我们在两个数据集上展示了ES-VAE的有效性。首先，我们分析骨骼步态周期以预测临床移动评分并分类主体为健康和中风后组。其次，我们在NTU RGB+D数据集上评估动作识别。在两种设置中，ES-VAE均优于标准VAE和一系列序列建模基线，包括时间卷积网络、Transformer和图卷积网络。更广泛地说，ES-VAE为在姿态形状流形上学习生成模型提供了系统框架，相较于现有深度学习方法，提供了改进的潜在表示和下游性能。

英文摘要

Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.06475 2026-05-18 cs.AI cs.CV 版本更新

链式窥视：面向视频理解的搜索引导渐进性对象基础推理

Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya Poria

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China.（网络与交换技术国家重点实验室，北京邮电大学，北京，中国）； Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University, China.（大数据研究院，复旦大学计算机科学与人工智能学院，中国）； ARC Lab, Tencent PCG, Shenzhen, China.（腾讯PCG深圳实验室，深圳，中国）； School of Artificial Intelligence, Beijing University of Technology, Beijing, China.（北京理工大学人工智能学院，北京，中国）

AI总结本文提出Chain-of-Glimpse框架，通过搜索引导的渐进推理解决视频中对象变化问题，提升多步骤决策的准确性和可解释性。

详情

AI中文摘要

视频理解需要在不同帧间识别和推理语义区分度高的视觉对象，但现有对象无关方法难以有效处理时间变化带来的显著对象变化。为此，我们引入Chain-of-Glimpse，一种搜索引导的渐进性对象基础推理框架，通过将每个推理步骤明确锚定到特定视觉证据区域，实现组合性和多步骤决策。形式上，Chain-of-Glimpse将视频推理视为逐步过程，逐步构建围绕任务相关视觉对象的空间基础轨迹，从而减少对显著性驱动线索的过度依赖。具体而言，Chain-of-Glimpse包含一个搜索引导的控制器，通过强化学习优化，以格式奖励显著激励基础能力，以迭代地基础视觉证据区域并形成可靠的推理轨迹，产生准确且可解释的多步骤决策。在域内NExTQA和域外Video-Holmes、CG-Bench Reasoning和VRBench基准测试中，广泛评估表明Chain-of-Glimpse在多样化视频推理任务中表现出一致的性能提升、鲁棒性和泛化能力。

英文摘要

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2604.10210 2026-05-18 cs.CV cs.AI cs.LG 版本更新

A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

A3-FPN：渐近内容感知金字塔注意力网络用于密集视觉预测

Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang

发表机构 * Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms（人工智能理论与算法河南省工程研究中心）； Henan University（河南大学）； Faculty of Computer Science and Control Engineering（计算机科学与控制工程学院）； Shenzhen University of Advanced Technology（深圳先进技术大学）； Department of Electrical and Electronic Engineering（电子与电气工程系）

AI总结本文提出A3-FPN，通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示，提升密集预测任务中小物体的识别性能。

Journal ref Pattern Recognition, 2026, 113793

详情

DOI: 10.1016/j.patcog.2026.113793

AI中文摘要

学习多尺度表示是解决密集预测任务中物体尺度变化的常见策略。尽管现有特征金字塔网络在视觉识别中取得了显著进展，但固有设计缺陷限制了它们捕捉判别特征和识别小物体的能力。本文提出渐近内容感知金字塔注意力网络（A3-FPN），通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示。具体而言，A3-FPN采用横向扩展的列网络，实现渐近全局特征交互，并将每个层次与所有层次表示解耦。在特征融合中，它从相邻层次收集补充内容，生成位置加权偏移和权重用于上下文感知重采样，并学习深度上下文重权重以提高类别内相似性。在特征重组装中，它进一步加强了同一尺度的判别特征学习，并基于特征图的信息内容和空间变化重组装冗余特征。在MS COCO、VisDrone2019-DET和Cityscapes上的大量实验表明，A3-FPN可以轻松集成到最先进的CNN和Transformer架构中，取得显著性能提升。值得注意的是，当与OneFormer和Swin-L主干结合时，A3-FPN在MS COCO上达到49.6的mask AP，在Cityscapes上达到85.6的mIoU。代码可在https://github.com/mason-ching/A3-FPN上获取。

CG-MLLM：通过多模态大语言模型实现图像描述与3D内容生成

Junming Huang, Chi Wang, Letian Li, Guangkai Xu, Donglin Huang, Hao Chen, Qiang Dai, Weiwei Xu

发表机构 * Zhejiang University, China（浙江大学）

AI总结本文提出CG-MLLM，一种能实现3D描述和高分辨率3D生成的多模态大语言模型，通过混合Transformer架构分离不同建模需求，结合预训练视觉语言模型与专用3D VAE潜在空间，提升3D生成质量与感知能力。

Comments ICML 2026

详情

AI中文摘要

大型语言模型(LLMs)已革新了文本生成和多模态感知，但其在3D内容生成方面的能力仍待探索。现有方法往往只能生成低分辨率网格或粗略结构代理，无法原生捕捉细粒度几何结构。本文提出CG-MLLM，一种新型多模态大语言模型，能够在单一框架内实现3D描述和高分辨率3D生成。通过混合Transformer架构，CG-MLLM分离了不同的建模需求，其中Token-level Autoregressive (TokenAR) Transformer处理token级内容，Block-level Autoregressive (BlockAR) Transformer处理块级内容。通过整合预训练的视觉语言骨干网络与专用3D VAE潜在空间，CG-MLLM促进了标准token与空间块之间的长上下文交互。实验结果表明，CG-MLLM在生成高保真3D对象方面显著优于现有MLLMs，有效将高分辨率3D内容创作带入主流LLM范式。此外，我们进一步发现，学习生成3D内容能够反向增强模型的基于图像的3D理解能力。

英文摘要

Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture finegrained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-ofTransformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles blocklevel content. By integrating a pre-trained visionlanguage backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm. Beyond generation, we further observe that learning to produce 3D content transfers back to perception, strengthening the model's image-based 3D understanding.

URL PDF HTML ☆

赞 0 踩 0

2601.12894 2026-05-18 cs.RO cs.CV 版本更新

Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning

稀疏动作生成：通过实时剪枝加速扩散策略

Kangye Ji, Jianbo Zhou, Yuan Meng, Ye Li, Hanyun Cui, Zhi Wang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Department of Computer Science（计算机科学系）

AI总结本文提出SAG方法，通过自适应剪枝和重用机制实现稀疏动作生成，提升实时视觉运动控制效率，实验显示生成速度提升4倍。

详情

AI中文摘要

关注重叠的分割以重建被遮挡物体的拓扑结构

J. Schueler, H. M. Araújo, S. N. Balashov, J. E. Borg, C. Brew, F. M. Brunbauer, C. Cazzaniga, A. Cottle, D. Edgeman, C. D. Frost, F. Garcia, D. Hunt, M. Kastriotou, P. Knights, H. Kraus, A. Lindote, M. Lisowska, D. Loomba, E. Lopez Asamar, P. A. Majewski, T. Marley, C. McCabe, L. Millins, R. Nandakumar, T. Neep, F. Neves, K. Nikolopoulos, E. Oliveri, A. Roy, T. J. Sumner, E. Tilly, W. Thompson, M. A. Vogiatzi

发表机构 * Department of Physics and Astronomy, University of New Mexico（新墨西哥大学物理与天文学系）； Department of Physics, Blackett Laboratory, Imperial College London（伦敦帝国理工学院物理系）； Particle Physics Department, STFC Rutherford Appleton Laboratory（英国科学与技术设施委员会拉瑟福德-苹果顿实验室粒子物理部）； Luleå University of Technology（卢勒阿高校）； CERN（欧洲核子研究中心）； ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory（英国科学与技术设施委员会拉瑟福德-苹果顿实验室ISIS中子与穆子源）； University College London (UCL), Department of Physics and Astronomy（伦敦大学学院（UCL）物理与天文学系）； Department of Physics, Keble Road, University of Oxford（牛津大学物理系）； Helsinki Institute of Physics, University of Helsinki（赫尔辛基大学物理研究所）； School of Physics and Astronomy, University of Birmingham（伯明翰大学物理与天文学学院）； LIP – Laboratório de Instrumentação e Física Experimental de Partículas, University of Coimbra（科英布拉大学粒子物理实验仪器实验室）； Departamento de Fisica Teorica, Universidad Autonoma de Madrid（马德里自治大学理论物理系）； Department of Physics, King’s College London（伦敦国王学院物理系）； University of Hamburg（汉堡大学）

AI总结本文提出OASIS框架，通过加权损失函数优先处理重叠区域，提升被遮挡物体的像素强度和拓扑特征重建。在MIGDAL实验中，OASIS显著改善了低能电子轨迹的重建效果。

详情

DOI: 10.1088/2632-2153/ae6e18

在真实世界中导航AI生成图像检测的挑战：真正重要的是什么？

Despina Konstantinidou, Dimitrios Karageorgiou, Christos Koutlis, Olga Papadopoulou, Emmanouil Schinas, Symeon Papadopoulos

发表机构 * Information Technologies Institute - Centre for Research and Technology Hellas（信息科技研究所 - 希腊研究中心与技术研究所）

AI总结研究真实世界中AI生成图像检测的挑战，分析设计选择对检测性能的影响，提出优化方法并提升AUC 26.87%。

Comments ACM International Workshop on Multimedia AI against Disinformation 2026 (MAD 2026)

详情

DOI: 10.1145/3810988.3812665

AI中文摘要

随着生成式人工智能的发展，AI生成图像的逼真度已达到足以欺骗甚至警惕的人类观察者的水平。然而，尽管当前的AI生成图像检测（AID）方法在受控基准数据集上表现优异，但在真实世界案例中却表现不佳。为此，我们引入了ITW-SM数据集，一个经过精心编排的真实和AI生成图像集合，源自主要社交媒体平台。我们利用它分析构建检测器时的关键设计选择，包括其架构、预训练的潜在空间、训练数据以及预处理方法。我们指出，简单地扩大预训练阶段或选择更多训练数据并不总是能提高检测性能。相反，我们的研究揭示了优化每个设计选择以使处理流程能够传播并有效分析低级痕迹和高级图像语义的重要性。基于我们的发现，我们在多种最先进的检测方法上实现了平均AUC提升26.87%，为开发更具鲁棒性的检测器提供了路线图。我们的资源可在https://mever-team.github.io/itw-sm获取。

英文摘要

As generative Artificial Intelligence (AI) advances, the realism of AI generated imagery has reached a threshold capable of deceiving even vigilant human observers. Yet, while current AI-generated Image Detection (AID) approaches perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world cases. To study this behavior we introduce the ITW-SM dataset, a curated collection of real and AI-generated images originating from major social media platforms. We employ it to analyze the effects of key design choices typically considered when building a detector, involving its architecture, pre-trained latent spaces, training data as well as pre-processing approaches. We indicate that naively scaling the pre-training stage or opting for more training data does not always lead to better detection performance. Instead, our work reveals that it is crucial to optimize each design choice to enable the processing pipeline to propagate and effectively analyze both low-level traces as well as high-level image semantics. Building on our findings, we achieve a substantial average improvement of 26.87% in AUC across multiple state-of-the-art detection approaches and under real-world conditions, providing a roadmap for developing more resilient detectors. Our assets are available on https://mever-team.github.io/itw-sm.

URL PDF HTML ☆

赞 0 踩 0

2506.16129 2026-05-18 cs.CV 版本更新

FM-G-CAM：计算机视觉中可解释AI的综合方法

Ravidu Suien Rammuni Silva, Jordan J. Bird

发表机构 * Department of Computer Science Nottingham Trent University（计算机科学系诺丁汉特大学）

AI总结本文提出FM-G-CAM方法，通过综合考虑多个预测类别，提供CNN模型决策的全面解释，改进传统Grad-CAM的局限性。

详情

AI中文摘要

可解释性是现代AI在现实应用中的关键因素。本文旨在强调理解计算机视觉模型（特别是卷积神经网络）预测的必要性。现有方法主要基于梯度加权类激活图（Grad-CAM），仅关注单一目标类别，忽略了CNN预测过程的大部分内容。本文提出了一种全面的方法，称为融合多类梯度加权类激活图（FM-G-CAM），考虑多个高预测类别，提供预测器CNN的全面解释。我们还提供了详细数学和算法描述。此外，通过现实应用场景的定量和定性比较，展示了FM-G-CAM相较于Grad-CAM的优势。最后，我们提供了一个开源Python库，包含FM-G-CAM实现，方便生成CNN模型预测的显著图。

英文摘要

Explainability is a vital aspect of modern AI for real-world impact and usability. The main objective of this paper is to emphasise the need to understand the predictions of Computer Vision models, specifically Convolutional Neural Network (CNN) models. Existing methods for explaining CNN predictions are largely based on Gradient-weighted Class Activation Maps (Grad-CAM) and focus solely on a single target class; this assumption about the target class selection neglects a large portion of the predictor CNN's prediction process. In this paper, we present an exhaustive methodology, called Fused Multi-class Gradient-weighted Class Activation Map (FM-G-CAM), that considers multiple top-predicted classes and provides a holistic explanation of the predictor CNN's rationale. We also provide a detailed mathematical and algorithmic description of our method. Furthermore, alongside a concise comparison of existing methods, we compare FM-G-CAM with Grad-CAM, quantitatively and qualitatively highlighting its benefits through real-world practical use cases. Finally, we present an open-source Python library with an FM-G-CAM implementation to conveniently generate saliency maps for CNN-based model predictions.

URL PDF HTML ☆

赞 0 踩 0

2605.15764 2026-05-18 cs.CV cs.AI 版本更新

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

GRASP：学习多个人非语言互动中的社会推理

Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Georgia Institute of Technology（佐治亚理工学院）； Amazon AGI Korea University（亚马逊AGI韩国大学）

AI总结 GRASP通过连接高层社会问答与细粒度目光和指代手势事件，提升多个人非语言互动的社会推理能力，包含290万对问题-答案对，提出Social Grounding Reward提升模型性能。

Comments Project page: https://social-reaoning.github.io/grasp/

详情

AI中文摘要

理解社会互动需要推理微妙的非语言线索，但当前多模态大语言模型（MLLMs）在多个人视频中常无法识别谁与谁互动。我们引入GRASP，一个大规模社会推理数据集，将高层社会问答与细粒度目光和指代手势事件连接起来。GRASP包含290K个问题-答案对，覆盖46K小时视频，按16类分类涵盖目光、手势及联合目光-手势推理，同时包含GRASP-Bench用于评估。不同于以往仅关注孤立线索或高层社会问答的资源，GRASP通过身份一致的目光轨迹、指代手势及其联合组成构建社会事件。此外，我们提出Social Grounding Reward（SGR），一种利用这些社会事件鼓励模型推理每个互动参与者的学习信号。实验显示，SGR在GRASP-Bench上提升性能，同时在相关社会视频问答基准上保持零样本性能。

英文摘要

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.15760 2026-05-18 cs.CV 版本更新

Learn2Splat: Extending the Horizon of Learned 3DGS Optimization

Learn2Splat: 扩展学得3DGS优化的视野

Naama Pearl, Stefano Esposito, Haofei Xu, Amit Peleg, Patricia Gschossmann, Lorenzo Porzi, Peter Kontschieder, Gerard Pons-Moll, Andreas Geiger

发表机构 * University of Tübingen, Tübingen AI Center（图宾根大学，图宾根人工智能中心）； ETH Zurich（苏黎世联邦理工学院）； Meta Reality Labs（Meta现实实验室）

AI总结本文提出了一种学得优化器，通过元学习方案扩展优化视野，提升稀疏和密集视角下的重建质量与稳定性，实现零样本泛化。

详情

AI中文摘要

3D高斯散射（3DGS）优化通常使用标准优化器（Adam、SGD）。尽管在多样场景中稳定，但标准优化器通用性强，无法针对问题结构进行优化。特别是，它们产生独立的参数更新，无法捕捉场景中的结构和空间关系，导致优化效率低和收敛慢。近期的工作引入了学得优化器，通过参数间和高斯间依赖预测相关更新。然而，这些方法在固定迭代次数训练，并依赖手动调度学习率以避免退化。本文提出了一种学得优化器，能够在延长的优化视野中避免退化，无需辅助机制。为此，我们提出了一种元学习方案，通过检查点缓冲区和优化器滚动策略扩展优化视野，并结合一种编码梯度尺度信息的架构。结果表明，早期新颖视角合成质量得到提升，同时在长视野中保持稳定，实现零样本泛化。为支持我们的发现，我们引入了第一个统一框架，用于训练和评估学得和传统优化器，适用于稀疏和密集视角设置。代码和模型将公开发布。我们的项目页面可在 https://naamapearl.github.io/learn2splat 上找到。

英文摘要

3D Gaussian Splatting (3DGS) optimization is most commonly performed using standard optimizers (Adam, SGD). While stable across diverse scenes, standard optimizers are general-purpose and not tailored to the structure of the problem. In particular, they produce independent parameter updates that do not capture the structural and spatial relationships within a scene, leading to inefficient optimization and slow convergence. Recent works introduced learned optimizers that predict correlated updates informed by inter-parameter and inter-Gaussian dependencies. However, these methods are trained for a fixed number of optimization iterations and rely on manually scheduled learning rates to avoid degradation. In this paper, we introduce a learned optimizer for 3DGS that avoids degradation over extended optimization horizons without auxiliary mechanisms. To enable this, we propose a meta-learning scheme that extends the optimization horizon via a checkpoint buffer and an optimizer rollout strategy, combined with an architecture that encodes gradient scale information in its latent states. Results show improved early novel view synthesis quality while remaining stable over long horizons, with zero-shot generalization to unseen reconstruction settings. To support our findings, we introduce the first unified framework for training and evaluating both learned and conventional optimizers across sparse and dense view settings. Code and models will be released publicly. Our project page is available at https://naamapearl.github.io/learn2splat .

URL PDF HTML ☆

赞 0 踩 0

2605.15755 2026-05-18 cs.CV 版本更新

Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models

基于属性的选型推理用于艺术品情感理解的多模态大语言模型

Cheng Zhang, Yuer Liu, Zhiyu Zhou, Hongxia Xie, Wen-Huang Cheng

发表机构 * Department of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； Department of Computer Science, National Taiwan University（国立台湾大学计算机科学系）

AI总结本文提出基于属性的选型推理方法，通过多模态大语言模型实现艺术品情感理解，通过引入属性瓶颈引导框架提升情感预测精度和解释简洁性。

详情

AI中文摘要

多模态大语言模型（MLLMs）能够生成流畅的艺术品情感解释，但常面临属性泛滥问题：它们列举许多可见的正式属性，但未能识别哪些线索真正支持情感判断。因此，本文将艺术品情感理解定义为属性引导的选型推理（AGSR），其中预定义的正式属性作为证据单元，只有情感相关属性应进入最终解释。为使该问题可测量，我们扩展了EmoArt，最初在ACM MM 2025上介绍为包含132,664件艺术品的资源，具有内容、正式属性、价值-唤醒和情感标注，通过添加1,400件艺术品的人类显著性扩展标注，由15名艺术训练标注者标注。此扩展提供了实例级监督，以区分仅存在的属性和情感显著的属性。我们进一步提出FAB-G（正式属性瓶颈引导推理），一个监督的多代理框架，首先预测属性级显著性，然后将下游情感分析限制在保留的线索上。实验表明，FAB-G在情感、唤醒和价值预测上取得了一致的提升，实现了在Dice和Tversky度量下与人类标记的显著属性更强的一致性，并产生了比基于提示的基线更紧凑的最终解释。跨数据集评估进一步表明，基于属性的显著性选择在EmoArt的源分布之外转移，同时揭示了属性特定的边界案例。数据集和项目页面可在https://zhiliangzhang.github.io/EmoArt-130k/上获取。

英文摘要

Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at https://zhiliangzhang.github.io/EmoArt-130k/

URL PDF HTML ☆

赞 0 踩 0

2605.15753 2026-05-18 cs.RO cs.CV 版本更新

Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

层次化和整体化的开放词汇功能3D场景图用于室内空间

Xinggang Hu, Chenyangguang Zhang, Alexandros Delitzas, Xiangkui Zhang, Marc Pollefeys, Francis Engelmann, Xiangyang Ji

发表机构 * Tsinghua University（清华大学）； ETH Zürich（苏黎世联邦理工学院）； MPI for Informatics（信息研究所）； Dalian University of Technology（大连理工大学）； Microsoft（微软）； Stanford University（斯坦福大学）； University of Lugano（卢加诺大学）

AI总结本文提出一种开放词汇管道，结合2D视觉定位和3D图优化，解决小规模密集相似实例的场景图推理问题，通过时间图优化和全局层次塑造提升室内空间的功能3D场景图生成能力。

详情

AI中文摘要

功能3D场景图提供了一种灵活的3D场景理解和机器人操作的表示方法，由物体节点、交互元素和功能关系边定义。然而，由于现有基准覆盖有限和先前管道设计过于简单，其潜力尚未被充分挖掘。因此，本文通过引入密集的桌面上物体和显式的多级功能关系扩展基准覆盖。这种扩展引入了关键挑战，包括小规模、密集和相似实例的处理，关系推理中缺乏视觉锚点，跨帧融合中的实例混淆，以及动态视角下的属性不确定性。为了解决这些问题，我们提出了一种基于2D视觉定位和3D图优化的开放词汇管道。具体而言，我们从2D视觉证据中锚定细粒度的功能边，并使用多个线索在3D中跨帧关联节点。此外，边关联被公式化为时间图优化，整合证据积累、熵正则化和时间平滑，以稳健地确定每个节点的功能连接。最后，通过全局层次塑造恢复层次图结构。大量实验表明，所提方法能够在具有挑战性的现实场景中可靠地推断功能3D场景图，从而进一步解锁其在实际应用中的潜力。

英文摘要

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.

URL PDF HTML ☆

赞 0 踩 0

2605.15737 2026-05-18 cs.CV 版本更新

BARRIER: Bounded Activation Regions for Robust Information Erasure

BARRIER：基于鲁棒信息擦除的有界激活区域

Jan Miksa, Patryk Krukowski, Przemysław Spurek, Dawid Damian Rymarczyk, Marcin Sendera

发表机构 * Jagiellonian University（雅盖隆大学）； IDEAS Research Institute（IDEAS研究所）； National Research Institute（国家研究所）

AI总结 BARRIER通过动态隐藏层激活几何结构，利用区间算术保护中性概念，实现稳定的信息擦除，同时保持其他表示的完整性。

详情

AI中文摘要

机器无学习面临关键瓶颈。传统方法主要消除目标概念，但常导致其他重要表示的意外抑制。为此，BARRIER将干预从静态模型权重转移到隐藏层激活的动态几何结构。通过SVD投影的激活空间区间算术，将目标区域封装在包围超立方体中，确保保留分布的严谨保护。此几何构造将知识保护从经验启发式转化为具有概率尾界的功能漂移优化目标。关键稳定性允许在遗忘区域进行激进的无学习更新。实验表明，BARRIER在分类器和扩散模型中达到最佳折中，最大化目标概念擦除同时保护其他表示的完整性。代码见https://github.com/OneAndZero24/BARRIER。

英文摘要

Machine unlearning has reached a critical bottleneck. As traditional weight-space interventions focus primarily on erasing targeted concepts, they often fail to prevent the unintended suppression of other significant representations. This leads to substantial collateral damage, with essential knowledge being forgotten, because these methods lack formal mathematical guarantees for the preservation of neutral concepts. To avoid degradation, they are frequently forced into conservative updates. We propose BARRIER (Bounded Activation Regions for Robust Information Erasure), a paradigm-shifting framework that shifts the locus of intervention from static model weights to the dynamic geometry of hidden-layer activations. Unlike existing methods, BARRIER employs Interval Arithmetic (IA) on SVD-based projections of the activation space to encapsulate the specific target region within a bounding hypercube. By driving unlearning updates exclusively within this forget interval and mathematically bounding the model response on the complement, we ensure rigorous protection of the retain distribution. This geometric construction transforms the preservation of knowledge from an empirical heuristic into a formal optimization target with a probabilistic tail bound on functional drift. Crucially, this stability permits highly aggressive unlearning updates within the forget region. Empirical evaluations demonstrate that BARRIER matches state-of-the-art trade-offs across classifiers and diffusion models, maximizing targeted concept erasure while safeguarding the integrity of all other representations. Our code is available at https://github.com/OneAndZero24/BARRIER.

URL PDF HTML ☆

赞 0 踩 0

2605.15736 2026-05-18 cs.CV cs.AI 版本更新

BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

BiomedAP: 一种基于视觉的双锚框架与门控跨模态融合用于鲁棒的医学视觉-语言适应

Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen

发表机构 * Wenzhou University（温州大学）； Wenzhou Business College（温州商务学院）

AI总结 BiomedAP通过门控跨模态融合和双锚约束机制，提升医学视觉-语言模型在提示变化下的鲁棒性，实验显示其在多个基准上均优于基线方法。

Comments CVPR2026 Workshop

2605.15733 2026-05-18 cs.NE cs.AI cs.CV 版本更新

Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model

在启发式世界模型中的结构抽象与泛化

Tianqiu Zhang, Muyang Lyu, Xiao Liu, Si Wu

发表机构 * Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, IDG/McGovern Institute for Brain Research, Center of Quantitative Biology, School of Psychological and Cognitive Sciences, Key Laboratory of Machine Perception (Ministry of Education), Peking University（北京大学-清华大学生命科学中心，先进跨学科研究院，IDG/麦克戈文脑科学研究院，定量生物学中心，心理与认知科学学院，机器感知重点实验室（教育部），北京大学）

AI总结本文提出了一种脑启发的分层模型，通过逆向模型提取潜在转换并构建预测视觉世界模型，展示了在连续高维动态中同时提取抽象结构的能力，实现了结构泛化。

Comments Project page: https://hpc-mec-worldmodel.github.io/

详情

AI中文摘要

人类将经验抽象为结构化表示以促进模式推断和知识转移。尽管海马-内侧颞叶（HPC-MEC）回路已知能表示空间和概念空间，但如何同时从连续、高维动态中提取抽象结构的机制仍不明确。我们提出了一种脑启发的分层模型，同时推断潜在转换并构建预测视觉世界模型。该架构采用逆向模型进行结构提取，同时结合HPC-MEC耦合模型，将关系结构（MEC）与整合的事件场景（HPC）分离。通过使用原始变换动态作为基准，我们展示了该模型在结构抽象方面的能力。通过利用速度驱动的路径整合，该框架能够在不同情境中实现稳健的预测和结构重用，从而实现结构泛化。本文提供了一个新的计算框架，用于理解如何通过脑启发的自监督学习世界模型，促进可重用的抽象知识的获取。

英文摘要

Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal-entorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

URL PDF HTML ☆

赞 0 踩 0

2605.15728 2026-05-18 cs.CV cs.AI 版本更新

DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation

DecomPose：解耦跨类优化冲突以实现类别级6D物体姿态估计

Yifan Gao, Lu Zou, Zhangjin Huang, Guoping Wang

发表机构 * Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, Hubei, China（智能机器人湖北省重点实验室，武汉理工大学，武汉，湖北，中国）； University of Science（科学技术大学）； Peking University, Beijing, China（北京大学，北京，中国）

AI总结本文提出DecomPose框架，通过数据驱动的难度代理和不对称分支策略，解耦跨类优化冲突，提升类别级6D姿态估计性能。

详情

AI中文摘要

类别级6D物体姿态估计通常被建模为多类联合学习问题，但类别间的几何异质性导致共享模块中不兼容的优化信号纠缠，产生梯度冲突和负迁移。为此，我们首先引入基于梯度的诊断方法量化模块级跨类冲突。基于诊断结果，我们提出DecomPose框架，通过难度感知的梯度解耦和稳定性驱动的不对称分支策略，缓解优化冲突：(1) 难度感知的梯度解耦通过数据驱动的难度代理将类别分组，并将每个实例路由到组特定的对应分支以隔离不兼容的更新；(2) 稳定性驱动的不对称分支将更高容量的分支分配给结构简单的类别作为稳定的优化锚点，同时通过轻量级分支约束复杂类别以抑制噪声更新并缓解负迁移。在REAL275、CAMERA25和HouseCat6D上的大量实验表明，DecomPose有效减少了跨类优化冲突，并在多个基准上实现了优越的姿态估计性能。

英文摘要

Category-level 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and delivers superior pose estimation performance across multiple benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.15725 2026-05-18 cs.CV cs.AI cs.RO 版本更新

DiLA: Disentangled Latent Action World Models

DiLA：解耦的潜在动作世界模型

Tianqiu Zhang, Muyang Lyu, Yufan Zhang, Fang Fang, Si Wu

发表机构 * Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, IDG/McGovern Institute for Brain Research, Peking University（北京大学-清华生命科学中心，先进跨学科研究院，IDG/麦克戈文脑科学研究院，北京大学）； Center of Quantitative Biology, Peking University（北京大学定量生物学中心）； School of Psychological and Cognitive Sciences, Key Laboratory of Machine Perception (Ministry of Education), Peking University（心理与认知科学学院，机器感知重点实验室（教育部），北京大学）

AI总结 DiLA通过内容-结构解耦解决动作抽象与生成保真度的平衡问题，实现高质量视频生成和动作迁移。

Comments Project Page: http://disentangled-latent-action-world-models.github.io

详情

AI中文摘要

潜在动作模型（LAMs）通过推断连续帧间的抽象动作来学习世界模型，但面临动作抽象与生成保真度的权衡问题。现有方法通常通过两阶段训练或限制预测到光流来解决。本文提出DiLA，一种解耦的潜在动作世界模型，通过内容-结构解耦解决这一权衡。我们的关键发现是解耦和潜在动作学习是共演进的：潜在动作学习中的预测瓶颈驱动解耦，迫使模型将空间布局压缩到结构路径，同时将视觉细节卸载到单独的内容路径进行生成。这种协同作用产生了一个连续且语义结构化的潜在动作空间，而不牺牲生成质量。DiLA在视频生成质量、动作迁移、视觉规划和流形可解释性方面表现优异。这些发现确立了DiLA作为统一框架，同时实现高层动作抽象和高保真生成，推动了自监督世界模型学习的前沿。

英文摘要

Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.

URL PDF HTML ☆

赞 0 踩 0

2605.15723 2026-05-18 cs.LG cs.CV 版本更新

GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

GOMA：从图信号平滑视角迈向结构驱动的多模态对齐

Xu Wang, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang

发表机构 * School of Airspace Science and Engineering, Shandong University（山东大学 airspace 科学与工程学院）； Department of Computer Science, Beijing Institute of Technology（北京理工大学计算机学院）； School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）

AI总结 GOMA通过统一设计解决多模态对齐中的拓扑障碍、平滑控制与信息保留问题，在七个多模态图基准上取得最佳检索性能并保持稳定性。

详情

AI中文摘要

多模态对齐通常通过CLIP式双编码器从孤立图像-文本对学习，忽略了实体间的关系上下文。多模态属性图（MAGs）中节点携带多模态属性，边编码语料结构，为优化冻结的视觉-语言嵌入提供自然设置。这种优化具有挑战性：视觉、文本和跨模态关系常诱导不同的邻域几何结构，而无限制的图传播可能导致检索表示快速过平滑。有效利用图上下文需要同时打破模态特定的拓扑障碍、控制平滑制度，并在语义边界崩溃前保留信息性平滑。我们提出图优化多模态对齐（GOMA），一种结构驱动的后对齐框架，将冻结的多模态嵌入视为图信号，并通过统一的检索导向设计解决这些需求。GOMA解耦了三个关键设计选择：消息应流动何处、多模态证据应如何传播，以及应保留哪种平滑深度。具体而言，它学习模态感知的传播算子，执行有限步耦合平滑而不使用对角线跨模态快捷方式，并自适应读取节点特定的平滑轨迹以在崩溃前保留有用平滑。所有实验遵循一种转换性MAG检索协议，其中图仅作为无标签上下文，且移除对角线自配对边。在七个MAG基准上，GOMA取得最佳或并列最佳检索性能，并显著优于最强的图竞争对手，证明MAG结构可以作为冻结多模态嵌入的有效后编码器。

英文摘要

Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views frozen multimodal embeddings as graph signals and addresses these requirements through a unified retrieval-oriented design. GOMA decouples three key design choices: where messages should flow, how multimodal evidence should propagate, and which smoothing depth should be retained. Concretely, it learns modality-aware propagation operators, performs finite-step coupled smoothing without diagonal cross-modal shortcuts, and adaptively reads out node-specific smoothing trajectories to preserve useful smoothing before collapse. All experiments follow a transductive MAG retrieval protocol where the graph serves only as unlabeled context and diagonal self-pair edges are removed. On seven MAG benchmarks, GOMA achieves state-of-the-art or tied state-of-the-art retrieval and remains substantially more stable than the strongest graph competitor, demonstrating that MAG structure can serve as an effective post-encoder for frozen multimodal embeddings.

URL PDF HTML ☆

赞 0 踩 0

2605.15722 2026-05-18 cs.LG cs.AI cs.CV eess.SP 版本更新

Bidirectional Fusion Guided by Cardiac Patterns for Semi-Supervised ECG Segmentation

双向融合引导心脏模式用于半监督ECG分割

Jeonghwa Lim, Minje Park, Sunghoon Joo

发表机构 * VUNO Inc.（VUNO公司）

AI总结本文提出CardioMix框架，通过心脏模式引导的双向CutMix策略提升ECG分割性能，实验表明其在多种数据集和标注比例下均优于现有方法。

Comments 11 pages, 6 figures, 6 tables

详情

AI中文摘要

准确界定心电图（ECG）并分割有意义的波形特征对心血管诊断至关重要。然而，标注数据稀缺给深度学习模型训练带来了重大挑战。传统半监督语义分割（SemiSeg）方法主要关注未标注数据的一致性，未能充分利用标注与未标注集之间的信息交换。为此，我们引入CardioMix，基于心脏模式引导的双向CutMix策略构建ECG分割框架。该方法通过从未标注数据中引入真实变化丰富标注集，同时对未标注集施加更强的监督信号，确保所有增强样本在生理上具有意义。本框架设计为即插即用模块，与各种SemiSeg算法具有高度兼容性。在SemiSegECG公共多数据集基准上的大量实验表明，CardioMix在多种数据集和标注比例下均优于现有基于CutMix的融合策略作为即插即用模块兼容各种SemiSeg算法。

英文摘要

Accurate delineation of electrocardiogram (ECG), the segmentation of meaningful waveform features, is crucial for cardiovascular diagnostics. However, the scarcity of annotated data poses a significant challenge for training deep learning models. Conventional semi-supervised semantic segmentation (SemiSeg) methods primarily focus on consistency from unlabeled data, underutilizing the information exchange possible between labeled and unlabeled sets. To address this, we introduce CardioMix, a framework built on a bidirectional CutMix strategy guided by cardiac patterns for ECG segmentation. This approach enriches the labeled set with realistic variations from unlabeled data while simultaneously applying stronger supervisory signals to the unlabeled set, as the cardiac pattern-guided mixing ensures all augmented samples remain physiologically meaningful. Our framework is designed as a plug-and-play module, demonstrating high compatibility with various SemiSeg algorithms. Extensive experiments on SemiSegECG, a public multi-dataset benchmark for ECG delineation, demonstrate that CardioMix consistently outperforms existing CutMix-based fusion strategies across diverse datasets and labeled ratios as a plug-and-play module compatible with various SemiSeg algorithms.

URL PDF HTML ☆

赞 0 踩 0

2605.15720 2026-05-18 cs.CV cs.LG 版本更新

Semi-MedRef: Semi-Supervised Medical Referring Image Segmentation with Cross-Modal Alignment

Semi-MedRef：基于跨模态对齐的半监督医学指引用图像分割

Yuchen Li, Zhen Zhao, Yi Liu, Luping Zhou

发表机构 * The University of Sydney（悉尼大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Changzhou University（常州大学）

AI总结本文提出Semi-MedRef框架，通过三个组件维持医学图像与位置语言的一致性，实验显示其在低标签条件下优于其他方法。

详情

AI中文摘要

医学指引用图像分割（MRIS）需要像素级掩码与解剖位置的文本描述对齐，这在低标签环境下使标注成本高昂。半监督学习（SSL）可通过利用未标记数据缓解这一负担，但其成功依赖于在扰动下保持可靠的图像-文本对齐。现有SSL方法多采用独立或简单的多模态扰动（如左右翻转），未能充分解决强增强下的跨模态对齐问题，而CutMix在单模态SSL中效果显著，但在多模态设置中因破坏图像-文本一致性而未被广泛探索。本文提出Semi-MedRef，一种教师-学生SSL框架，通过三个保持对齐的组件：T-PatchMix，一种跨模态CutMix风格增强，通过位置约束和概率驱动规则同步补丁混合与指引用表达；PosAug，一种位置感知文本增强，通过遮蔽或模糊解剖短语；以及ITCL，一种位置引导的图像-文本对比学习模块，利用位置伪标签构建软解剖正例并加强医学基础的跨模态对齐。在QaTa-COV19和MosMedData+上的实验表明，Semi-MedRef在所有标签条件下均优于完全监督和半监督基线。

英文摘要

Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.15711 2026-05-18 cs.CV 版本更新

EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

EntropyScan: 向通过视觉注意力熵实现LVLMs的模型级后门检测

Xuanyu Ge, Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen

发表机构 * China University of Geosciences（中国地质大学）； University of the Chinese Academy of Sciences（中国科学院大学）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）

AI总结本文提出EntropyScan，一种轻量且不依赖触发器的模型级后门检测方法，通过量化视觉注意力分布的结构扭曲来检测后门模型，实验显示其在两个LVLM架构和三种高级攻击场景中达到98.5%的F1分数和96.6%的AUC。

Comments 20 pages, 6 figures, 8tables

2605.15708 2026-05-18 cs.CV 版本更新

3D Segmentation Using Viewpoint-Dependent Spatial Relationships

基于视角依赖空间关系的3D分割

Ayaka Nanri, Klara Reichard, Mert Kiray, Federico Tombari, Benjamin Busam, Asako Kanezaki

AI总结本文提出一个包含22万样本的3D参照分割数据集，通过密集视角采样扩展至数千万样本，研究视角依赖空间关系对3D大模型的影响，提升分割精度并提高mIoU至0.47。

2605.15707 2026-05-18 eess.IV cs.CV 版本更新

Evaluation of Anatomical Shape Priors in Deep Learning-Based Cardiac Multi-Compartment Segmentation

基于深度学习的心脏多腔分割中解剖形状先验的评估

Michael Hudler, Franz Thaler, Martin Urschler

发表机构 * Institute for Medical Informatics, Statistics and Documentation（医学信息学、统计学与文档研究所）

AI总结本文评估了轻量级显式形状先验在心脏多腔CT分割中的效果，发现标准3D U-Net仍为强大基线，手工先验效果有限，未来需更 expressive 的学习先验。

Comments Published in the Proceedings of the Third Austrian Symposium on AI, Robotics, and Vision (AIRoV 2026), pp. 23-27

2605.15689 2026-05-18 cs.CV 版本更新

How to Choose Your Teacher for Fine Grained Image Recognition

如何为细粒度图像识别选择教师

Oswin Gosal, Edwin Arkel Rios, Augusto Christian Surya, Fernando Mikael, Bo-Cheng Lai, Min-Chun Hu

发表机构 * National Tsing Hua University, Taiwan（台湾国立清华大学）； National Yang Ming Chiao Tung University, Taiwan（台湾国立阳明交通大学）

AI总结本文提出Ratio 1-2指标，通过分析实验数据提升教师选择效果，使小模型在细粒度图像识别中获得17%的准确率提升。

Comments Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 3 figures, 4 tables

2605.15684 2026-05-18 cs.CV 版本更新

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

ElasticDiT：通过弹性架构和稀疏注意力实现高效扩散变换器，用于移动设备上的高分辨率图像生成

Kunpeng Du, Haizhen Xie, Sen Lu, Lei Yu, Binglei Bao, Huaao Tang, Chuntao Liu, Hao Wu, Yang Zhao, Zhicai Huang, Heyuan Gao, Zhijun Tu, Jie Hu, Xinghao Chen

发表机构 * Huawei Technologies（华为技术）

AI总结本文提出ElasticDiT，通过弹性架构和稀疏注意力机制，在移动设备上实现高效扩散变换器，平衡图像质量和计算效率，同时减少内存占用。

详情

AI中文摘要

扩散变换器（DiT）架构是高保真图像生成的最新范式，支撑如Stable Diffusion-3和FLUX.1等模型。然而，将这些模型部署到资源受限的移动设备上会带来极高的计算和内存开销。尽管效率驱动的方法如Linear-DiT和静态剪枝缓解了瓶颈，但通常会带来质量下降。不同于云环境，移动约束要求一种单模型范式，能够动态平衡保真度和延迟。我们引入ElasticDiT，通过调整空间压缩比和DiT块深度实现这种动态权衡。通过整合Shift Sparse Block Attention（SSBA）和Tiny DWT-Distilled VAE（T-DVAE），ElasticDiT在保持图像质量的同时减少了推理延迟和内存占用。实验表明，ElasticDiT能够在一个参数集内覆盖广泛的保真度-延迟权衡范围。通过联合调整压缩和深度，单个ElasticDiT模型可以动态重新配置以超越任务特定的基线。具体而言，我们的flex lite变体实现了32.87的HPS，超过了Flux模型，同时通过SSBA保持84.16%的平均稀疏度质量。此外，插件式的T-DVAE仅需标准VAEs的1/8计算成本即可实现SD3级的重建，而Flow-GRPO提升了语义对齐（GenEval: 66.93到73.62）。这些结果表明，ElasticDiT提供了一种多功能、硬件适应性的解决方案，消除了对多个专用模型的需求，为未来移动设备上的高分辨率图像生成提供了有前景的路径。

英文摘要

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.

URL PDF HTML ☆

赞 0 踩 0

2605.15682 2026-05-18 cs.CV 版本更新

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

DreamSR：通过增强感受野的扩散变换器实现超高清图像超分辨率

Qingji Dong, Hang Dong, Mingqin Chen, Rui Zhang, Yitong Wang

发表机构 * ByteDance Inc.（字节跳动公司）

AI总结 DreamSR通过双分支MM-ControlNet和增强感受野策略，解决超分辨率中局部过生成和细节合成问题，实现高质量细节恢复。

详情

AI中文摘要

大规模预训练扩散模型因强大的生成先验通过文本引导被广泛应用于实际图像超分辨率。然而，当使用基于补丁的推理策略超分辨率处理高分辨率图像时，现有扩散基超分辨率方法常因LR图像全局提示与每次推理步骤中局部补丁不完整语义信息之间的不匹配而产生过生成问题。另一方面，现有方法由于网络设计和训练策略过度强调全局生成能力，也难以在局部补丁中生成细节纹理。为了解决这个问题，我们提出了DreamSR，一种新的超分辨率模型，通过抑制局部过生成并提高细节合成，从而实现具有超高质量细节的视觉忠实结果。具体来说，我们提出了一个双分支MM-ControlNet，其中ControlNet使用补丁级提示生成局部文本特征，而预训练的DiT使用全局提示生成全局文本特征，从而缓解过生成并确保补丁间的语义一致性。我们还设计了全面的训练策略，包含阶段特定的数据处理管道和增强感受野策略，增强模型捕捉补丁信息和有效恢复局部纹理的能力。广泛的实验表明，DreamSR优于最先进的方法，提供高质量的超分辨率结果。代码和模型可在https://github.com/jerrydong0219/DreamSR上获得。

英文摘要

Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at https://github.com/jerrydong0219/DreamSR.

URL PDF HTML ☆

赞 0 踩 0

2605.15681 2026-05-18 cs.GR cs.CV 版本更新

VLMs 跟踪无需跟踪：诊断视觉路径跟随中的失败

Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No

发表机构 * Yonsei University（延世大学）

AI总结研究VLMs在视觉路径跟随任务中的表现，发现其在面对局部相似干扰时易切换路径，揭示局部竞争导致的失败原因。

详情

AI中文摘要

视觉-语言模型（VLMs）在多模态基准测试中表现优异，但可能仍缺乏对基本视觉操作的鲁棒控制。我们研究了路径跟随任务，其中模型必须通过连续的局部延续跟随选定的视觉路径。为隔离这一能力，我们设计了受控的路径跟随任务，引入附近的竞争者并减少语义和拓扑模糊性，如交叉和重叠。在这些任务中，即使是最先进的VLMs也频繁失去目标路径并切换到附近的替代路径，尤其是在这些替代路径在局部上相似时。行为干预和内部分析表明，这些失败源于局部竞争：附近的相似干扰者会将模型拉离真正的延续。标准解决方案无法消除这一瓶颈：模型大小扩展只能提供有限的收益，推理部分通过成本高昂的替代策略补偿，而显式路径指示未能恢复稳定的路径跟随。最后，在复杂的电缆场景和地铁地图上测试表明，相同的路径切换失败在受控设置之外仍然存在。

英文摘要

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

URL PDF HTML ☆

赞 0 踩 0

2605.15671 2026-05-18 eess.IV cs.CV 版本更新

Degradation-Aware Blur-Segmentation of Brain Tumor

考虑退化因素的脑肿瘤模糊分割

Yuchun Wang, Xiaosong Li, Gefei Liang, Yang Liu

发表机构 * School of Physics and Optoelectronic Engineering, Foshan University, China（物理与光电工程学院，佛山大学，中国）

AI总结本文提出DABSeg网络，通过同步去模糊和精确分割，提升多模态3D脑肿瘤分割在退化条件下的鲁棒性与临床实用性。

详情

AI中文摘要

多模态3D MRI脑肿瘤分割是放疗目标勾画、手术规划和治疗后评估的关键步骤。现有方法通常假设MRI图像无伪影，但扫描过程中不可避免的患者运动引入伪影和模糊，导致边界和纹理特征退化，影响分割性能。为此，我们引入考虑退化因素的模糊分割网络（DABSeg），一种同步去模糊的3D多模态MRI分割网络，统一了模糊去除和准确分割。具体而言，我们提出一个特征域运动去模糊茎以补偿模糊并平衡强度。同时，骨干网络嵌入了一个模糊感知的跨模态交叉注意力模块和多尺度残差聚合，以实现有效的模态互补性。值得注意的是，我们优化了一个联合损失，结合加权Dice与清晰参考重建项，其中不平衡的权重应用于小目标以增强学习强度和预测稳定性，以小病变和边界区域。系统比较和消融实验在BraTS2020数据集上，无论是清晰还是退化条件均一致表明，DABSeg在肿瘤Dice分数和边界精度上优于现有最先进方法。这些结果验证了考虑退化因素的跨任务协作学习在提升多模态3D脑肿瘤分割在现实退化条件下的鲁棒性和临床实用性方面的有效性。源代码可在https://github.com/YuchunWang24/DABSeg_ICPR获取。

英文摘要

Multimodal 3D MRI brain tumor segmentation is a pivotal step in radiotherapy target delineation, surgical planning and post-treatment assessment. Existing methods often assume artifact-free MRI images. However, inevitable patient motion during scanning introduces artifacts and blur that degrade boundary and texture features, leading to poor segmentation performance. To bridge this gap, we introduce Degradation-Aware Blur-Segmentation Net (DABSeg), a synchronous deblurring 3D multimodal MRI segmentation network that unifies blur removal and accurate segmentation. Specifically, we propose a feature-domain motion-deblurring stem to compensate for blur and rebalance intensity. Concurrently, the backbone network embeds a blur-aware cross-modal cross-attention module and multi-scale residual aggregation to yield effective modality complementarity. Notably, we optimize a joint loss that combines weighted Dice with a clear-reference reconstruction term, where imbalanced weights are applied to small targets to boost learning intensity and predictive stability for small lesions and border regions. Systematic comparisons and ablation experiments on the BraTS2020 dataset under both clear and degenerative conditions consistently demonstrate that DABSeg surpasses state-of-the-art methods in tumor Dice score and boundary precision. These results validate the effectiveness of degenerative-aware cross-task collaborative learning in improving the robustness and clinical utility of multi-modal 3D brain tumor segmentation under realistic degenerative conditions. The source code is available at https://github.com/YuchunWang24/DABSeg_ICPR

URL PDF HTML ☆

赞 0 踩 0

2605.15666 2026-05-18 cs.CV 版本更新

ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark

ChronoEarth-492K：一个大规模且长时域的时空超光谱地球观测数据集和基准

Haozhe Si, Yuxuan Wan, Yuqing Wang, Minh Do, Han Zhao

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）； Siebel School of Computing and Data Science（计算与数据科学学院）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出ChronoEarth-492K数据集，通过NASA EO-1 Hyperion任务的超光谱数据，提供大规模、时间校准的时空超光谱数据，支持短时和长时分析，并建立统一的评估平台，推动超光谱时空表示学习的发展。

详情

AI中文摘要

超光谱成像（HSI）为地球表面提供了密集的光谱信息，使土地覆盖和生态系统动态在材料层面得以理解。尽管近年来在超光谱自监督学习（SSL）方面取得了进展，但现有数据集仍然时间较浅，限制了长时间域时空建模的发展。为解决这一差距，我们引入ChronoEarth-492K，这是首个大规模、时间校准的超光谱SSL数据集，基于NASA的EO-1 Hyperion任务，目前是世界上持续时间最长的超光谱档案（2001-2017）。ChronoEarth-492K包含492,354个辐射校准的块，覆盖185,398个全球地点17年，其中28,786个地点包含多时间序列（≥3次观测），可支持短时间域和长时间域的分析。在此基础上，我们建立了ChronoEarth基准，一个涵盖静态、短时间域和长时间域任务的统一评估套件，由六个开源地理空间产品组成，涵盖土地覆盖、作物类型、森林动态和土壤特性。我们进一步提出了一套标准化的评估协议，并在最先进的超光谱基础模型上报告了广泛的基线结果。共同而言，ChronoEarth和基准提供了首个大规模、时间校准的平台，用于系统性的时空超光谱表示学习。

英文摘要

Hyperspectral imaging (HSI) provides dense spectral information for the Earth's surface, enabling material-level understanding of land cover and ecosystem dynamics. Despite recent progress in hyperspectral self-supervised learning (SSL), existing datasets remain temporally shallow, limiting the development of long-horizon spatiotemporal modeling. To address this gap, we introduce ChronoEarth-492K, the first large-scale, temporally calibrated hyperspectral SSL dataset built upon NASA's EO-1 Hyperion mission, the world's longest continuous hyperspectral archive up to date (2001-2017). ChronoEarth-492K comprises 492,354 radiometrically harmonized patches across 185,398 global locations over 17 years, with 28,786 sites containing multi-temporal sequences ($\geq 3$ observations) that enable both short- and long-horizon temporal analysis. Building on this foundation, we establish the ChronoEarth-Benchmark, a unified evaluation suite spanning static, short-horizon, and long-horizon temporal tasks, constructed from six open-source geospatial products covering land cover, crop type, forest dynamics, and soil properties. We further introduce a standardized evaluation protocol and report extensive baseline results across state-of-the-art hyperspectral foundation models. Together, ChronoEarth and benchmark provide the first large-scale, temporally grounded platform for systematic spatiotemporal hyperspectral representation learning.

URL PDF HTML ☆

赞 0 踩 0

2605.15661 2026-05-18 cs.CV cs.AI 版本更新

VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

VAGS：图像编辑与生成的速率自适应引导尺度

Yan Luo, Ahmadou Aidara, Jingyi Lu, Jeremy Moebel, Kai Han, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab（哈佛人工智能与机器人实验室）； Harvard University（哈佛大学）； School of Computing and Data Science（计算与数据科学学院）； The University of Hong Kong（香港大学）； Kempner Institute for the Study of Natural and Artificial Intelligence（自然与人工智能研究学院）

AI总结 VAGS通过自适应引导尺度提升图像编辑和生成的结构保真度和生成质量，无需微调或额外计算。

详情

AI中文摘要

分类自由引导（CFG）是控制流式采样器中文本语义强度的主要手段，但传统方法在整个ODE轨迹中固定引导尺度。这存在根本矛盾：早期步骤以噪声为主，携带弱语义信号，而后期步骤需提交图像结构，要求更强的方向性承诺；更关键的是，任何引导强度的值取决于引导速度是否与模型当前动态一致或相反。本文提出速率自适应引导尺度（VAGS），一种无需训练的替代方案，通过结合时间信号级项和任务相关速度场的余弦相似度，将名义尺度乘以一个有界因子。对于无需反向传播的编辑，VAGS测量源和目标引导速度之间的对齐程度，使每一步的编辑强度反映局部保留与变换的兼容性。对于生成，VAGS-Gen利用无条件与条件速度之间的对齐作为类比信号。两种变体均无需微调、辅助网络或额外前向传递，固定CFG是其特殊情形。在PIE-Bench和DIV2K进行编辑，在COCO17、CUB-200和Flickr30K进行生成时，VAGS在结构保真度和生成质量上优于固定CFG和近期无训练引导变体。代码可在https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale公开获取。

英文摘要

Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.

URL PDF HTML ☆

赞 0 踩 0

2605.15660 2026-05-18 cs.CV 版本更新

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

MaTe：仅需图像进行材料迁移的扩散变换器

Nisha Huang, Henglin Liu, Yizhou Lin, Kaer Huang, Chubin Chen, Jie Guo, Tong-Yee Lee, Xiu Li

发表机构 * Tsinghua University（清华大学）； PengCheng Laboratory（鹏城实验室）； Lenovo Research（联想研究院）； National Cheng-Kung University（国立成功大学）

AI总结 MaTe通过多模态注意力机制实现材料迁移，无需文本指导或辅助网络，提升了生成质量和效率。

2605.15640 2026-05-18 cs.CV 版本更新

Learning Disentangled Representations for Generalized Multi-view Clustering

学习解耦表示以实现通用多视图聚类

Xin Zou, Ruimeng Liu, Chang Tang, Zhenglai Li, Xinwang Liu, Kunlun He, Wanqing Li

发表机构 * AI Thrust, The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州）人工智能方向）； School of Computer Science and Technology, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件工程学院）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）； School of Computer, National University of Defense Technology（国防科技大学计算机学院）； Medical Big Data Research Center, Medical Engineering Laboratory of Chinese PLA General Hospital（中国人民解放军总医院医学大数据研究中心，医学工程实验室）； School of Computing and Information Technology, University of Wollongong（沃林根大学计算与信息学院）

AI总结本文提出GMAE框架，通过解耦表示学习保留多视图互补性，提升聚类效果。实验表明其在完整和不完整多视图聚类任务中均优于现有方法。

Comments accepted by IEEE TPAMI 2026 (IEEE Transactions on Pattern Analysis and Machine Intelligence)

详情

DOI: 10.1109/TPAMI.2026.3687339

AI中文摘要

多视图聚类（MVC）因其能利用互补信息而受到关注。然而，现有深度MVC方法在跨视图融合时常面临视图分布纠缠问题，影响共享潜在空间质量。为此，本文提出通用多视图自编码器（GMAE），通过解耦表示学习保留跨视图互补性。具体而言，GMAE采用双路径自编码器将源特征解耦为视图特定和视图共同嵌入，促进更清晰的聚类结构发现。进一步构建跨视图对抗判别器，引导视图特定编码器捕捉更判别性特征。通过策略性调节互信息，GMAE有效对齐分布并防止表示崩溃，确保生成稳健且非平凡的嵌入。在13个基准数据集上的全面实验表明，GMAE在完整和不完整MVC任务中均优于现有方法。代码实现见：https://github.com/obananas/GMAE。

英文摘要

Multi-View Clustering (MVC) has gained significant attention for its ability to leverage complementary information across diverse views. However, existing deep MVC methods often struggle with view-distribution entanglement during cross-view fusion, which hampers the quality of the shared latent space and leads to suboptimal Figures. To address this issue, we propose the Generalized Multi-view Auto-Encoder (GMAE), a framework designed to preserve cross-view complementarity through disentangled representation learning. Specifically, GMAE employs dual-path autoencoders to decouple source features into view-specific and view-common embeddings, facilitating the discovery of clearer clustering structures. We further construct cross-view adversarial discriminators to guide view-specific encoders in capturing more discriminative features. By strategically modulating mutual information, GMAE effectively aligns distributions and prevents representation collapse, ensuring the generation of robust, non-trivial embeddings. Comprehensive experiments on 13 benchmark datasets demonstrate that GMAE consistently outperforms state-of-the-art methods in both complete and incomplete MVC tasks. Our code implementation is available at the repository: https://github.com/obananas/GMAE.

URL PDF HTML ☆

赞 0 踩 0

2605.15621 2026-05-18 cs.CV 版本更新

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

LRCP: 低秩压缩性引导的视觉标记修剪用于高效的LVLMs

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Tianjun Shi, Shikai Jiang, Yao Hu, Jiawei Li

发表机构 * Xiaohongshu（小红书）； Harbin Institute of Technology（哈尔滨工业大学）； Fudan University（复旦大学）

AI总结本文提出LRCP，通过低秩压缩性引导视觉标记修剪，有效减少视觉语言模型的推理成本，实现94.7%的图像理解性能保留和88.9%的标记减少。

Comments The paper includes 11 figures, multiple tables, comprehensive experimental results on 11 image understanding benchmarks and 3 video benchmarks, with extensive ablation studies and qualitative visualizations

详情

AGC：面向视觉-语言模型对抗鲁棒性的自适应测地修正

Zhiwei Li, Jiacheng Xue, Weining Wang, Ajian Liu, Xingyu Gao, Zhenan Sun, Qi Li

发表机构 * NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences（自动化研究所国家工程研究中心与人工智能院，中国科学院）； School of Computer Science and Engineering, Central South University（中南大学计算机科学与工程学院）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出AGC，一种无需训练的防御机制，通过自适应步长修正输入特征，提升视觉-语言模型的对抗鲁棒性，实测在八个细粒度数据集上提升44.4%的鲁棒准确率，同时降低10倍推理延迟。

详情

AI中文摘要

英文摘要

A fundamental challenge in point cloud object detection lies in the conflict between the extreme sparsity of distant points and the need for remote context understanding. The existing methods typically use 1D serialization to expand the receptive field, which inevitably discards already scarce local geometric details and reduces detection of distant and small objects. To address this issue, we propose 3DTMDet, a novel detection network that synergistically combines state space models (Mamba) with Transformers. The core idea is to utilize SSM's linear complexity and advantages in long sequence modeling to effectively capture global interactions between sparse and distant points, while using Transformer modules with local attention to encode fine-grained geometric structures in local point sets, preserving accurate shape information. We propose the 3D Hybrid Mamba Transformer (3DHMT) block, which uses an SSM-Attention-SSM pipeline to balance global context understanding and local detail preservation, effectively alleviating the tension between receptive field enlargement and geometric preservation in remote detection. In addition, we introduced a voxel generation block inspired by LiDAR physics, which diffuses features along the sensor observation direction to reconstruct the complete object structure of occlusion and distant areas. Extensive experiments conducted on the KITTI and ONCE datasets have shown that 3DTMDet outperforms state-of-the-art detectors. The code is available at https://github.com/QiuBingwen/3DTMDet.

URL PDF HTML ☆

赞 0 踩 0

2605.15536 2026-05-18 cs.RO cs.AI cs.CV 版本更新

RIDE: 基于Retinex的解耦方法用于揭示隐藏物体

Chunming He, Rihan Zhang, Dingming Zhang, Chengyu Fang, Longxiang Tang, Jingjia Feng, Fengyang Xiao, Sina Farsiu

发表机构 * Duke University（杜克大学）； Tsinghua University（清华大学）； Harvard University（哈佛大学）

AI总结 RIDE通过Retinex理论提出同域图像分解方法，解决隐藏物体分割问题，利用判别性差距定理提升前景与背景的区分度。

详情

AI中文摘要

隐藏物体分割（COS）涵盖一系列密集预测任务，包括伪装物体检测、多形体分割、透明物体检测和工业缺陷检测，其中目标通过不同物理机制与周围环境视觉融合。现有方法要么直接操作RGB图像，要么采用异构分解（如傅里叶、小波）将空间证据分散到尺度/频率系数，使像素对齐线索不直接。我们引入一种根本不同的视角：通过Retinex理论进行同域图像分解，将图像分解为光照和反射成分。我们的核心发现是视觉融合迫使复合空间中的外观匹配，但并不需要同时在两个成分空间中匹配，这一现象我们正式称为判别性差距定理。关键的是，我们证明在多样化的COS子任务中，底层物理过程系统性地反相关光照和反射差异，从而理论保证Retinex分解在完整物理范围内保持或严格提升总前景-背景判别性，反相关最大化增益。基于此，我们提出RIDE，包括：（i）任务驱动的Retinex分解模块，学习端到端的分割最优分解；（ii）判别性差距注意力机制，适应性利用分解帮助的区域；（iii）伪装打破对比损失，操作在反射特征空间中。

英文摘要

Concealed Object Segmentation (COS) encompasses a family of dense-prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emph{heterogeneous} decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel-aligned cues less direct. We introduce a fundamentally different perspective: \textbf{homogeneous image decomposition} via Retinex theory, which factorizes an image into illumination and reflectance components within the \emph{same} spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emph{not} necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the \textbf{Discriminability Gap Theorem}. Crucially, we show that across diverse COS sub-tasks, the underlying physical processes systematically anti-correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground--background discriminability across the full physical regime, with anti-correlation maximizing the gain. Building on this, we propose \textbf{RIDE} comprising: (i) a Task-Driven Retinex Decomposition module that learns segmentation-optimal factorizations end-to-end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage-Breaking Contrastive loss operating in reflectance feature space.

URL PDF HTML ☆

赞 0 踩 0

2605.15430 2026-05-18 cs.RO cs.CV 版本更新

Where to Perch in a Tree: Vision-Guidance for Tree-Grasping Drones

在树上何处栖息：用于树抓取无人机的视觉引导

Alex Dunnett, Leonie Bottomley, Mirko Kovac, Basaran Bahadir Kocer

发表机构 * Department of Civil, Aerospace and Design Engineering, University of Bristol（布里斯托大学土木、航空航天与设计工程系）； Laboratory of Sustainability Robotics at Swiss Federal Laboratories for Materials Science and Technology (EMPA)（瑞士材料科学与技术联邦实验室可持续机器人实验室）； Ecole Polytechnique Fédérale de Lausanne (EPFL)（洛桑联邦理工学院）

AI总结本文提出一种视觉引导方法，用于确定树上理想的栖息点，通过图像处理算法评估树的形状和结构，基于枝条宽度、坡度和曲率选择适宜栖息的枝条。

Comments Work in progress version accepted to the Recent Advances in Robotic Perception for Forestry

详情

AI中文摘要

本研究展示了一种方法，用于确定树上理想的栖息点，该方法利用视觉引导的自主树栖无人机。各种图像处理算法，包括用于机器学习、图像分割和二值图像形态学的算法，被用来评估树的形状和结构。与仅寻找最近可用的枝条不同，本研究通过评估每条枝条的潜力，根据枝条宽度、坡度（与水平面的角度）和曲率等因素来确定其适合栖息的程度。对于给定的树栖无人机和超过10,000张从2月到10月在亚热带和温润气候下的城市树木图像数据集，所提出的方法成功地为76%的可行目标生成了结果。可行目标定义为枝条直径足够厚且可用栖息空间至少等于腱驱动抓取夹具的宽度。这些初步成功的结果为开发一系列改进和额外功能奠定了基础，以创建通用方法；这将涉及整合深度感知和姿态传感器的补充数据，以增强枝条评估。

英文摘要

This study demonstrates a method to locate an ideal perch location on a tree for vision-guided autonomous tree-perching drones. Various image processing algorithms, including those used for machine learning, image segmentation and binary image morphology, are implemented to assess the shape and structure of a tree. Rather than identifying the closest available branch, this study builds on vision methods by evaluating the potential of each branch, determining its suitability for perching based on factors such as branch width, slope (angle to the horizontal) and curvature. For a given tree-perching drone and a dataset of more than 10,000 urban tree images taken from February to October in a subtropical and temperate monsoon climate, the proposed method successfully produces a result for 76% of feasible targets. A feasible target defined as a tree where the branch diameters are sufficiently thick and where the available perching space is at least equal to the width of a tendon-driven grasping claw. These successful preliminary results create a foundation from which a number of identified improvements and additional features can be developed to create a generalised method; this will involve the incorporation of supplementary data from depth perception and attitude sensors to enhance the branch assessment.

URL PDF HTML ☆

赞 0 踩 0

2605.15424 2026-05-18 cs.CV 版本更新

Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models

Social-Mamba：基于状态空间模型的社会感知轨迹预测

Po-Chien Luan, Wuyang Li, Yang Gao, Alexandre Alahi

发表机构 * EPFL, Switzerland（瑞士联邦理工学院）

AI总结本文提出Social-Mamba，通过将社会互动视为结构化序列过程，结合循环Mamba模块和社交三元组分解，实现高效准确的轨迹预测，实验表明其在多个基准上表现优异。

详情

AI中文摘要

人类轨迹预测对于拥挤环境中安全导航至关重要，需要在准确性和计算效率之间取得平衡。高效建模社会互动是密集人群中的关键。然而，大多数最新方法依赖于注意力机制，虽然能捕捉复杂依赖关系，但会带来二次计算成本，随着邻居数量的增加而表现不佳。最近，选择性状态空间模型提供了线性时间的替代方案；然而，其本质上是顺序的，与社会互动的无结构和动态性质不匹配。为此，我们提出了Social-Mamba，一种预测架构，将社会互动重新表述为结构化序列过程。其核心是循环Mamba模块，一个新型模块，能够实现连续的双向信息流。Social-Mamba在以自我为中心的网格上组织代理，并引入社交三元组分解，将互动分解为时间、以自我为中心和目标为中心的扫描。这些通过可学习的社会门和全局扫描动态整合，以生成准确且高效的轨迹预测。在五个轨迹预测基准上的广泛实验表明，Social-Mamba在准确率方面达到最先进的水平，同时提供优越的参数效率和计算可扩展性。此外，将Social-Mamba嵌入到流匹配框架中进一步增强了准确性和效率，使其成为未来轨迹预测研究的灵活且稳健的基础。代码已公开：https://github.com/vita-epfl/Social-Mamba

英文摘要

Human trajectory forecasting is crucial for safe navigation in crowded environments, requiring models that balance accuracy with computational efficiency. Efficiently modeling social interactions is key to performance in dense crowds. Yet, most recent methods rely on attention mechanisms, which are effective at capturing complex dependencies, but incur quadratic computational costs that scale poorly with the growing number of neighbors. Recently, Selective State-Space Models have provided a linear-time alternative; however, their inherently sequential design is misaligned with the unstructured and dynamic nature of social interactions. To address this challenge, we propose Social-Mamba, a forecasting architecture that reformulates social interactions as structured sequential processes. At its core is the Cycle Mamba block, a novel module that enables continuous bidirectional information flow. Social-Mamba organizes agents on an egocentric grid and introduces social triplet factorization, which decomposes interactions into temporal, egocentric, and goal-centric scans. These are dynamically integrated through a learnable social gate and global scan to generate accurate and efficient trajectory predictions. Extensive experiments on five trajectory forecasting benchmarks show that Social-Mamba achieves state-of-the-art accuracy while offering superior parameter efficiency and computational scalability. Furthermore, embedding Social-Mamba into a flow-matching framework further enhances both accuracy and efficiency, establishing it as a flexible and robust foundation for future trajectory forecasting research. The code is publicly available: https://github.com/vita-epfl/Social-Mamba

URL PDF HTML ☆

赞 0 踩 0

2605.15423 2026-05-18 cs.CV cs.AI eess.IV 版本更新

MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

MR2-ByteTrack：基于CNN和Transformer的视频目标检测用于AI增强的嵌入式视觉传感器节点

Luca Bompani, Manuele Rusci, Luca Benini, Daniele Palossi, Francesco Conti

发表机构 * Electrical, Electronic and Information Engineering (DEI), University of Bologna, Italy.（博洛尼亚大学电气、电子与信息工程学院，意大利）； Department of Electrical Engineering (ESAT), KU Leuven, Belgium.（卢旺达大学电气工程系，比利时）； Dalle Molle Institute for Artificial Intelligence (IDSIA), USI--SUPSI, Switzerland.（人工智能研究所（IDSIA），瑞士USI--SUPSI）

AI总结本文提出MR2-ByteTrack，一种针对嵌入式视觉节点的视频目标检测方法，通过交替使用全分辨率和低分辨率推理，结合ByteTrack和Rescore算法提升效率，实现在嵌入式设备上的高精度实时检测。

详情

AI中文摘要

现代智能视觉传感器需要设备端智能来处理视频流，因为云计算在带宽、延迟和隐私限制下往往不可行。然而，这些传感系统通常依赖超低功耗微控制器（MCUs），其内存和计算能力有限，使得需要特征存储或多帧缓冲的传统视频目标检测方法不可行。为了解决这一挑战，我们引入了多分辨率重评分ByteTrack（MR2-ByteTrack），一种专为基于MCU的嵌入式视觉节点设计的视频目标检测（VOD）方法。MR2-ByteTrack通过交替使用全分辨率和低分辨率推理来降低计算成本，同时通过ByteTrack在帧间链接检测，并通过Rescore算法通过概率联合规则聚合跨帧的检测置信度分数以纠正误分类。我们将其应用于基于CNN的检测器和基于Transformer的模型，证明了其在具有根本不同空间处理的架构中的通用性。在ImageNetVID上的实验表明，MR2-ByteTrack保持了准确性，实现了CNN模型的mAP最高达49.0，Transformer模型的mAP为48.7，同时将CNN的乘加操作减少了高达53%，Transformer的减少了32%。当部署在GAP9上，一个超低功耗RISC-V多核MCU上时，我们的方法相比仅处理全分辨率图像，实现了高达55%的能耗节省，实现了在MCU类嵌入式视觉节点上的首个实时Transformer-based VOD。代码可在https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access获取。

英文摘要

Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access

URL PDF HTML ☆

赞 0 踩 0

2605.15421 2026-05-18 cs.CV 版本更新

U-SEG: Uncertainty in SEGmentation -- A systematic multi-variable exploration

U-SEG：不确定性在分割中的探索——系统多变量研究

Michael Smith, Frank P. Ferrie

发表机构 * Centre for Intelligent Machines, McGill University（智能机器中心，麦吉尔大学）

AI总结本文系统探讨了不确定性估计与分割交集中的关键问题，分析了不同变量对分割性能的影响，发现挑战性任务和样本多样性在分割中具有重要作用。

Comments Accepted to CVPR Findings Track 2026

详情

AI中文摘要

本文深入探讨了不确定性估计与分割交叉领域中的一些未被充分研究的课题。先前研究表明，不确定性估计的质量对多种变量非常敏感。作为不确定性估计的主要应用之一，帮助识别和解决实际场景中的预测错误，任何影响这一应用的因素都必须明确识别。例如，更具挑战性的领域或不同的数据集和架构是否会导致使用不确定性估计时性能下降？视频序列中的先前帧是否能提供与其它方法相当的不确定性估计？能否利用样本多样性结合不确定性估计方法以获得更好的估计？最后，何时使用基于集成的不确定性估计比确定性网络更合理？我们通过创建框架并执行大规模研究，跨多个变量（如数据集、主干网络和下游任务）对语义和全景分割进行研究。我们发现，a) 具有挑战性的全景分割任务通常导致性能下降，而数据集和主干网络之间的高性能方差表明泛化并不保证；b) 时间序列样本对特定配置有用，但在许多情况下不值得付出代价；c) 样本多样性在校准下游任务中最具潜力，但其他情况下无法超越更简单的替代方案；d) 确定性方法在某些下游任务中足够，但若在部署中能实现正确条件，集成方法可带来显著改进。

英文摘要

In this study, we explore in depth a few under-studied topics at the intersection of uncertainty estimation and segmentation. Prior work has shown that the quality of uncertainty estimates can be very sensitive to a range of variables. As one of the main uses of uncertainty estimation is to help identify and deal with prediction errors in practical scenarios, any factors that affect this must be clearly identified. For example, do more challenging domains or different datasets and architectures result in worse performance when using uncertainty estimates? Can prior frames in a video sequence in fact provide useful uncertainty estimates comparable to other approaches? Is it possible to combine uncertainty estimation approaches, taking advantage of sample diversity, to get better estimates? Finally, when might it make sense to use an ensemble-based uncertainty estimate over a deterministic network? We address these questions by creating a framework for and executing a large scale study across many variables such as datasets, backbones, and downstream tasks, for both semantic and panoptic segmentation. We find that a) the more challenging task of panoptic segmentation usually results in worse performance while high performance variance between datasets and backbones indicates that generalization is not guaranteed, b) time series samples can be useful for specific configurations, but in many cases are not worth the cost, c) sample diversity shows the most promise in the downstream task of calibration, but otherwise fails to beat simpler alternatives, d) a deterministic approach is adequate for some downstream tasks, but ensembles allow for significant improvements if the right conditions can be achieved in deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.15398 2026-05-18 cs.GR cs.CV 版本更新

对特征空间中的群卷积神经网络进行离散化以处理3D几何

Daniel Franzen, Jean Philip Filling, Michael Wand

发表机构 * Johannes Gutenberg University Mainz（美因茨约翰内斯·古腾堡大学）

AI总结本文提出在特征空间中进行采样，通过特征相似性选择代表性样本，从而解耦几何分辨率与内存处理成本，实现计算效率与精度的平衡。实验表明粗粒度的特征空间采样能有效保持分类精度，加速等变3D分类器的训练。

Comments 11 pages, 7 figures, 2 tables

详情

AI中文摘要

群卷积神经网络（GCNNs）是深度学习中引入对称性作为归纳偏置的重要方法：在每个线性层中，GCNNs密集采样变换群G，并在不同姿态下相关数据和滤波器（适用于可旋转GCNNs的适当反混叠）以保持对G的等变性。不幸的是，对这种采样产生的许多数据项应用滤波器成本很高（即使仅限于平移，即普通CNNs），随着自由度（如3D中的平移和旋转）的增加，成本呈指数增长，这往往阻碍了实际应用。在本文中，我们提出在特征空间中进行采样，即用特征相似性选择的代表性样本替代几何密集采样。这在训练和推理过程中解耦了几何分辨率与内存和处理成本，提供了一种新的方法来权衡计算努力和准确性。我们的主要经验发现是，粗粒度的特征空间采样在保持分类精度方面表现得非常出色，这允许基于几何相似性进行预计算，从而显著加速等变3D分类器的训练。

英文摘要

Group-convolutional neural networks (GCNNs) are among the most important methods for introducing symmetry as an inductive bias in deep learning: In each linear layer, GCNNs sample a transformation group $G$ densely and correlate data and filters in different poses (with suitable anti-aliasing for steerable GCNNs) to maintain equivariance with respect to $G$. Unfortunately, applying filters to many data items resulting from this sampling is expensive (even for translations alone, i.e., in ordinary CNNs), and costs grow exponentially with increasing degrees of freedom (such as translations and rotations in 3D), which often hinders practical applications. In this paper, we propose sampling in feature space, i.e., replacing geometrically dense samples with representative samples selected by feature similarity. This decouples geometric resolution from memory and processing costs during training and inference, providing a novel way to trade off computational effort and accuracy. Our main empirical finding is that a coarse feature-space sampling already preserves classification accuracy remarkably well, which permits precomputation based on geometric similarity, accelerating the training of equivariant 3D classifiers substantially.

URL PDF HTML ☆

赞 0 踩 0

2605.15342 2026-05-18 cs.CV cs.LG 版本更新

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

Minerva-Ego：眼动视频理解的空间时间提示

Arsha Nagrani, Jasper Uijilings, Shyamal Buch, Tobias Weyand, Sudheendra Vijayanarasimhan, Bo Hu, Ramin Mehran, David A Ross, Cordelia Schmid

发表机构 * Google（谷歌）； DeepMind（深度Mind）

AI总结本文提出Minerva-Ego基准，通过多步骤多模态问题和密集标注的时空推理轨迹评估眼动视频理解模型，发现提示'何时何地'显著提升性能。

详情

AI中文摘要

视频推理模型是眼动和具身智能体的核心组成部分。然而，标准评估模型的基准仅提供输出评估（例如回答问题），而不评估中间推理步骤，且大多数仅提供文本领域的答案。我们引入了Minerva-Ego，一个用于评估复杂眼动视觉推理的基准。我们扩展了最近高质量的视频数据源，这些数据源来自眼动/具身设置，并添加了一组具有挑战性的多步骤多模态问题和密集标注的时空推理轨迹。基准测试实验表明，最先进的模型与人类表现之间仍有较大差距。为了深入研究这一差距，我们对数据集中的每个推理轨迹标注了所需解决问题的对象，作为时空掩码标注。通过广泛的评估，我们发现提示前沿模型以'哪里'和'何时'的提示来查看，能显著提高性能。Minerva-Ego可在https://github.com/google-deepmind/neptune下载。

英文摘要

Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.

URL PDF HTML ☆

赞 0 踩 0

2605.15326 2026-05-18 cs.CV 版本更新

Multimodal Object Detection Under Sparse Forest-Canopy Occlusion

多模态目标检测在稀疏森林冠层遮挡下的应用

Nitik Jain, Mangal Kothari

发表机构 * Robotics & AI, Johns Hopkins University, USA（约翰霍普金斯大学机器人与人工智能系，美国）； Department of Aerospace Engineering, IIT Kanpur（印度理工学院坎浦尔航空航天工程系）； Senior Principal Flight Control Engineer, ADASI, EDGE Group, Abu Dhabi, UAE（阿布扎赫尔ADASI高级飞行控制系统工程师，EDGE集团）

AI总结本文提出一种多模态管道，结合激光雷达、可见-热成像融合和合成孔径成像技术，以提高森林冠层下人类检测的可靠性，展示了改进的YOLOv5检测器在热成像和融合图像上的性能。

详情

AI中文摘要

可靠检测森林冠层下的人类仍是一个远程传感难题，由于遮挡稀疏、结构化且视点依赖。本文提出一个多模态的证明概念管道，整合三种互补方法：(i) 通过植被评估激光雷达回波的实验评估以评估主动传感的可行性；(ii) 使用多尺度变换和稀疏表示框架进行可见-热图像融合以增强人类显著性；(iii) 通过空中光学切片（AOS）合成孔径成像以抑制冠层杂波。在Teledyne FLIR热数据集上微调YOLOv5检测器，并在热图像和融合图像上进行评估。结果表明，测试的地面激光雷达配置对目标级检测的穿透有限，而可见-热融合在低对比度场景中提高了目标可见性，AOS在合成森林图像中增强了地面平面检测。微调的YOLOv5在FLIR前三个类别上实现了平均平均精度约为0.83。这些发现为在森林环境中部署的无人机搜索和救援及监视系统建立了初始基准，并推动了未来专门针对森林数据集和实时多模态整合的工作。

英文摘要

Reliable detection of humans beneath forest canopy remains a difficult remote-sensing challenge due to sparse, structured, and viewpoint-dependent occlusion. This paper presents a multimodal proof-of-concept pipeline that integrates three complementary approaches: (i) experimental evaluation of LiDAR returns through vegetation to assess the feasibility of active sensing, (ii) visible--thermal image fusion using a multi-scale transform and sparse-representation framework to enhance human saliency, and (iii) synthetic-aperture image formation via Airborne Optical Sectioning (AOS) to suppress canopy clutter. A YOLOv5 detector is fine-tuned on the Teledyne FLIR thermal dataset and evaluated on thermal and fused imagery. Results show that the tested terrestrial LiDAR configuration provides limited penetration for object-level detection, while visible--thermal fusion improves target visibility in low-contrast scenes and AOS enhances ground-plane detection in synthetic forest imagery. The fine-tuned YOLOv5 achieves a mean average precision of $\sim$0.83 on the top three FLIR classes. These findings establish an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments, and motivate future work on dedicated forest datasets and real-time multimodal integration.

URL PDF HTML ☆

赞 0 踩 0

2605.15325 2026-05-18 cs.CV 版本更新

COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection

COPRA：基于强化学习的条件参数适应用于视频异常检测

Darryl Cherian Jacob, Xinyu Liu, Kai Wang, Pan He

发表机构 * Auburn University（奥本大学）； Tencent Hunyuan（腾讯文元）

AI总结 COPRA通过生成输入特定的参数更新，动态适应冻结的VLM，提升视频异常检测的适应性和泛化能力，同时拓展到多选视频问答和密集标注等任务。

Comments Manuscript currently under review for publication

2605.15320 2026-05-18 cs.GR cs.CV cs.LG 版本更新

FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

FFAvatar: 少样本、前馈和可泛化的头像重建

Thuan Hoang Nguyen, Jiahao Luo, Yinyu Nie, Hao Li, Gordon Guocheng Qian, Jian Wang

发表机构 * Snap Inc. ； University of California, Santa Cruz（加州大学圣克鲁兹分校）； MBZUAI

AI总结 FFAvatar通过多视图查询-Former融合多源图像信息，实现高保真3D高斯头像重建，支持实时部署与高质量动画。

Comments Project Page: https://ffavatar.github.io

2605.15312 2026-05-18 cs.CY cs.CV 版本更新

Beyond Performance Disparities: A Three-Level Audit of Representational Harm in CelebA

超越表现差异：对CelebrA中表征性伤害的三级审计

Sieun Park, Yuanmo He

AI总结本文通过三级审计揭示CelebrA数据集中性别化的年龄和美貌标准如何在数据和模型中再现，指出表征性伤害导致女性被过度审视而老年男性被排除在外。

Comments 15 pages, 8 figures

详情

AI中文摘要

大规模面部数据集如CelebrA在计算机视觉中广泛应用，但其标签中的文化偏见仍被忽视。公平性研究区分了表征性与分配性伤害，但对计算机视觉数据集的审计多关注分类标签，未探讨此类伤害如何在学习特征和模型注意力中体现。本文从数据集结构、学习特征权重和空间注意力三级层面分析CelebrA，聚焦性别化的年龄和美貌标准如何在数据中编码并在模型行为中再现。首先，202599张图像的分层聚类显示39个属性组织成与文化原型一致的潜在特质束：表演性女性（年轻、化妆、装饰）和专业男性（老化、面部毛发、正式着装）。尽管女性整体更常被评价为有吸引力，但被分配到老化或男性化簇时会遭受严重惩罚。其次，XGBoost结合SHAP分析揭示性别特定效应，如脂肪减少吸引力仅对女性有效。第三，Grad-CAM发现女性和年轻男性子群的预测集中在中面部线索，而老年男性的预测则偏向外围线索如头发和服装。老年男性获得最高准确率但最低平均精度，表明被数据集评估模板排除。文化双重标准由此从媒体代表进入数据标签、特征权重和模型注意力，产生两种表征性伤害：在狭窄评估模板下对女性的过度审视，以及完全排除老年男性。聚焦性能差异的公平性指标掩盖了这两种伤害，强调在公平性研究中需解决表征性伤害。

英文摘要

Large-scale facial datasets like CelebA are widely used in computer vision, yet the cultural biases embedded in their labels remain underexplored. Fairness research has distinguished representational from allocational harms, but audits of computer vision datasets have mostly examined categorical labels, leaving open how such harms appear in learned features and model attention. This paper examines CelebA at three levels: dataset structure, learned feature weights, and spatial attention, focusing on how gendered double standards of ageing and beauty are encoded in the data and reproduced in model behaviour. First, hierarchical clustering of 202,599 images shows that the 39 attributes organise into latent trait bundles aligned with cultural archetypes: performative femininity (youth, makeup, adornment) and professional masculinity (ageing, facial hair, formal attire). Female faces, though more often rated attractive overall, incur steep penalties when assigned to ageing or masculine-coded clusters. Second, XGBoost with SHAP analysis reveal gender-specific effects, such as adiposity reducing attractiveness only for females. Third, Grad-CAM finds that predictions for female and younger male subgroups concentrate on mid-face cues, whereas predictions for older males drift toward peripheral cues such as hair and clothing. Older males attain the highest accuracy but the lowest average precision, indicating categorical exclusion of groups outside the dataset's evaluative templates. Cultural double standards thus pass from media representation into dataset labels, feature weights, and model attention, producing two representational harms: hyper-scrutiny of women under a narrow evaluative template, and exclusion of older men from the scheme entirely. Fairness metrics focused on performance disparities mask both, underscoring the need to address representational harm in fairness research.

URL PDF HTML ☆

赞 0 踩 0

2605.15309 2026-05-18 cs.CV 版本更新

Zeqing Wang, Danze Chen, Zhaohu Xing, Zizhao Tong, Yinhan Zhang, Xingyi Yang, Yeying Jin

发表机构 * Tencent（腾讯）； National University of Singapore（新加坡国立大学）； The Hong Kong Polytechnic University（香港理工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； University of Chinese Academy of Sciences（中国科学院大学）

AI总结当前游戏世界模型多从玩家视角出发，将非玩家角色（NPC）仅视为背景像素，难以捕捉玩家与NPC之间的互动。为此，本文提出ReactiveGWM，一种能够模拟玩家与NPC动态交互的反应型游戏世界模型。该模型通过解耦玩家控制与NPC行为，并引入轻量级偏差注入和跨注意力模块，实现了对NPC高层策略（如进攻、防守）的灵活响应，且无需针对具体游戏进行再训练，具备跨游戏的零样本策略迁移能力。

Comments The code is available at https://inv-wzq.github.io/ReactiveGWM/

2605.15241 2026-05-18 eess.IV cs.CV cs.LG 版本更新

From Full and Partial Intraoral Scans to Crown Proposal: A Classification-Guided Restoration Assistance Pipeline

Rabin Kunwar, Dikshya Parajuli, Rujal Acharya, Romik Gosai, Prince Panta, Kundan Siwakoti, Shuvangi Adhikari, Saugat Kafley, Louis Digiorgio, Amit Regmi, Akio Tanaka, Masahiko Inada, Yuriko Komagamine, Kennta Kashiwazaki, Manabu Kanazawa

发表机构 * Accelerated Komputing Pvt. Ltd.（加速计算私人有限公司）； University of Pittsburgh（匹兹堡大学）； Institute of Science Tokyo（东京科学研究所）； Emium Co. Ltd.（Emium公司）； GodelBlock Inc.（GodelBlock公司）； Carnegie Mellon University（卡内基梅隆大学）

AI总结该研究提出了一种端到端的牙冠提案生成流程，旨在从全牙弓或部分牙弓的口腔扫描数据中生成个性化的牙冠初始方案，以辅助临床医生进行后续调整。方法结合了分类引导的分割策略和基于上下文的检索与拟合技术，有效解决了部分扫描数据分割精度低和生成牙冠细节丢失的问题。实验表明，该方法在多个评估指标上表现优异，具备较高的分割精度和实际应用价值。

2605.15093 2026-05-18 cs.CV 版本更新

CoralLite: μCT Reconstruction of Coral Colonies from Individual Corallites

Jess Jones, Leonardo Bertini, Kenneth Johnson, Erica Hendy, Tilo Burghardt

发表机构 * University of Bristol（布里斯托大学）； University of Liverpool（利物浦大学）； Natural History Museum（自然历史博物馆）

AI总结该研究提出了一种名为CoralLite的方法，用于从珊瑚骨骼的微CT扫描数据中重建单个珊瑚虫的骨骼结构。研究通过结合弱标注数据预训练与全标注切片微调的混合V-Trans-UNet网络，实现了对整个珊瑚群体骨骼的高精度分割与三维建模。该方法在相同珊瑚群体和不同生物样本上均表现出良好的分割性能，为基于微CT的珊瑚个体骨骼建模提供了首个深度学习基准与完整数据集。

Comments 15 pages, 10 figures, 2 tables

详情

英文摘要

The life history of an individual coral is archived within the accreting skeleton of the colony. While reef-forming coral colonies (e.g. massive $\textit{Porites}$ sp.) may live for hundreds of years and deposit calcareous structures many metres in height and width, their living tissue is a thin outer surface layer comprised of asexually-dividing polyps that only survive a few years. To understand the rate and timing of polyp division and the consequences for colony skeletal growth, scientists need to track the skeletal corallite deposited around each polyp. Here we propose CoralLite, an annotated $μ$CT scan dataset of entire calcareous skeletons and an associated, first corallite deep learning reconstruction baseline. CoralLite combines fully quantified volumetric segmentations with cross-slice linking for visualisations of 3D models for each corallite up to colony scale. For segmentation, we propose and evaluate in detail a hybrid V-Trans-UNet architecture applicable to segmenting tiled $μ$CT virtual slabs of $\textit{Porites}$ sp. colonies. The model is pre-trained on weakly annotated data and topology-aware fine-tuned using fully annotated slice sections with 8k+ manual corallite region annotations. On unseen slices of the same colony, the resulting model reaches 0.94 topological accuracy at mean Dice scores of 0.77 on the same colony and projection axis, and 0.63 mean Dice scores on a different, biologically unrelated specimen. Whilst our experiments are limited in scale and context, our results show for the first time that visual machine learning can effectively support full 3D individual corallite modelling from $μ$CT scans of coral skeletons alone. For reproducibility and as a baseline for future research we publish our full dataset of 697 $μ$CT slices, 37 partial or full slice annotations, and all network weights and source code with this paper.

URL PDF HTML ☆

赞 0 踩 0

2605.15010 2026-05-18 cs.CV 版本更新

3D Skew-Normal Splatting

Xiangru Wu, Ke Fan, Yanwei Fu

发表机构 * Fudan University（复旦大学）

AI总结本文提出了一种名为Skew-Normal Splatting（SNS）的新方法，用于改进3D高斯溅射（3DGS）在实时新视角合成中的表示能力。通过引入Azzalini偏正态分布作为基本单元，SNS能够灵活建模对称和非对称结构，尤其在处理物体边界和单侧表面时表现出更强的表示能力。此外，SNS保持了数学上的可解析性，并通过解耦参数化和分块优化策略提升了训练稳定性，实验表明其在多个基准测试中优于传统高斯及其他非高斯核方法。

2605.14876 2026-05-18 cs.CV cs.AI 版本更新

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结尽管当前文本到图像生成模型在技术上取得了快速进展，但它们大多依赖单步生成范式，难以处理复杂的语义内容，且参数扩展带来的性能提升有限。为了解决多步推理方法中存在的幻觉、优化不稳定和推理延迟等问题，本文提出了一种闭环视觉推理框架CLVR，该框架将视觉语言逻辑规划与像素级扩散生成深度融合，并引入了基于代理提示的强化学习和Δ-空间权重合并等方法，有效提升了生成质量与推理效率，实验表明其在多个基准测试中优于现有开源模型，接近商业模型的性能。

2605.14716 2026-05-18 cs.GR cs.CV cs.LG 版本更新

AnchorRoute: Human Motion Synthesis with Interval-Routed Sparse Contro

Pengcheng Fang, Tengjiao Sun, Dongjie Fu, Xiaoyu Zhan, Yanwen Guo, Hansung Kim, Xiaohao Cai

发表机构 * University of Southampton（索姆塞特大学）； Mogo AI Ltd.（Mogo AI有限公司）； Nanjing University（南京大学）

AI总结 AnchorRoute 是一种基于稀疏锚点的人体运动合成框架，通过用户指定的少量根位置、平面轨迹或身体点目标，生成完整的人体动作。该方法在生成阶段利用锚点生成条件特征，并注入到预训练的扩散模型中以保持生成质量，同时学习稀疏空间控制；在生成后阶段，通过锚点残差定义修正区间，结合软 token 更新进行精细化调整，从而在统一的锚点框架下实现生成与优化的结合。实验表明，AnchorRoute 在多种控制方式下均优于现有方法，生成动作更贴合锚点约束。

2605.14309 2026-05-18 cs.CV cs.AI cs.LG 版本更新

ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

Shen Lin, Jing Lin, Junhao Dong, Piotr Koniusz, Li Xu

发表机构 * Fujian Normal University（福建师范大学）； Nanyang Technological University（南洋理工大学）； University of New South Wales（新南威尔士大学）； Data61 CSIRO（Data61澳大利亚联邦科学与工业研究组织）

AI总结本文提出了一种基于可解释概念分解的视觉-语言模型（VLM）概念级机器遗忘方法ICED，旨在解决传统图像或实例级遗忘难以精确移除目标知识而不影响无关语义的问题。该方法通过多模态大语言模型构建任务相关的概念词汇表，并将视觉表征分解为稀疏、非负的语义概念组合，从而实现对图像中目标概念的精确抑制，同时保留非目标语义和跨模态知识。实验表明，该方法在保持模型性能的同时，能够更全面地遗忘目标知识并更好保留图像中的非目标信息。

2605.13073 2026-05-18 cs.CV 版本更新

HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization

Yulei Kang, Tianze Zhu, Jian-Fang Hu, Jianhuang Lai, Wei-Shi Zheng

发表机构 * Sun Yat-sen University（中山大学）； Northeastern University（东北大学）； Guangdong Province Key Laboratory of Information Security Technology（广东省信息安全技术重点实验室）； Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China（教育部机器智能与先进计算重点实验室）

AI总结本文针对真实场景中3D高斯泼溅（3DGS）重建面临的动态干扰和光照引起的视图间外观不一致问题，提出了一种基于冲突感知的优化框架。该方法通过语义一致性引导的掩膜生成和双视角梯度调和策略，有效抑制了不可靠的监督信息并缓解视图间梯度冲突，从而提升了重建质量与稳定性。实验表明，该方法在复杂真实场景下取得了当前最优的渲染效果。

2605.09869 2026-05-18 cs.RO cs.CV 版本更新

ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

Haosen Wang, Zhenyang Li, Yinqiang Zhang, Zongqi He, Lutao Jiang, Kai Li, Yizhou Zhao, Liaoyuan Fan, Wenjian Hou, Tingbang Liang, Yibin Wen, Defeng Gu

发表机构 * Sun Yat-sen University（中山大学）； The University of Hong Kong（香港大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； City University of Hong Kong（香港城市大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文研究了零样本物体导航中的动作一致性问题，即智能体在导航过程中容易因语义信息的反复解读而无法持续追踪目标。为此，作者提出了 ConsistNav，一种无需训练的零样本物体导航框架，通过引入语义执行控制器、持久候选记忆和稳定性感知动作控制三个模块，有效提升了导航过程中对目标的持续追踪能力和动作一致性。实验表明，ConsistNav 在多个基准数据集上取得了优于现有方法的性能，显著提升了成功率和路径成功率。

Comments 13 pages, 5 figures

2605.08245 2026-05-18 cs.CV cs.AI 版本更新

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Harshvardhan Saini, Samyak Jha, Yiming Tang, Dianbo Liu

发表机构 * Indian Institute of Technology Dhanbad（印度理工学院丹巴德分校）； National University of Singapore（新加坡国立大学）

AI总结本文研究了视觉-语言模型（VLMs）中由于语言与视觉模态过度对齐导致的幻觉问题，揭示了其根本原因在于解码器结构使得视觉嵌入过度对齐到文本流形，从而引入了语言统计偏倚，掩盖了细粒度视觉信息。作者首次量化分析了这一现象，提出两种互补的解决方案：一种是无需训练的推理策略，另一种是引入偏倚感知的微调方法，均能有效去除视觉表示中的语言偏倚。实验表明，这些方法在多个基准测试中显著减少了模型幻觉，并提升了长文本生成的质量。

2605.07074 2026-05-18 cs.CV 版本更新

Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection

Zhiyuan Wang, Yanxiang Chen, Pengcheng Zhao, Yunfeng Diao, Xin Liao

发表机构 * Hefei University of Technology（合肥工业大学）； Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education（知识工程与大数据重点实验室（合肥工业大学））； School of Computer Science and Information Engineering, Hefei University of Technology（计算机科学与信息工程学院）； Intelligent Interconnected Systems Laboratory of Anhui Province (Hefei University of Technology)（安徽省智能互联系统实验室（合肥工业大学））； School of Computer Science, Nanjing Audit University（南京审计大学计算机学院）； College of Cyber Science and Technology, Hunan University（湖南大学计算机科学与技术学院）

AI总结该论文研究了如何检测由不同未知架构生成的AI图像，指出现有方法容易过度依赖生成器特定的指纹和语义内容，导致泛化能力不足。研究发现，特征纠缠是主要原因，为此提出了一种正交分解与净化网络（ODP-Net），通过结构化分离通用伪造痕迹、生成器指纹和语义内容，有效提升了模型在未知生成模型上的检测性能。

Comments ~10 pages (IEEEtran two-column), 6 figures, 6 tables, 1 algorithm

2605.01852 2026-05-18 cs.CV 版本更新

DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity

Lilika Makabe, Kohei Ashida, Hiroaki Santo, Fumio Okura, Yasuyuki Matsushita

发表机构 * Graduate School of Information Science and Technology, The University of Osaka（信息科学与技术研究生院，大阪大学）

AI总结本文提出了一种名为DP-SfM的方法，利用双像素（DP）传感器捕获的图像进行多视角三维重建，无需参考物体或预先标定即可自动解决尺度模糊问题。该方法通过结合深度图与双像素图像中的散焦模糊信息，提出了一种简单有效的线性方法来估计绝对尺度，并进一步通过基于强度的优化对齐左右图像。实验表明，该方法在不同相机和镜头捕获的多样化场景中均表现出良好的效果。

Journal ref IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2604.17669 2026-05-18 cs.CV 版本更新

Low Light Image Enhancement Challenge at NTIRE 2026

George Ciubotariu, Sharif S M A, Abdur Rehman, Fayaz Ali Dharejo, Rizwan Ali Naqvi, Marcos V. Conde, Radu Timofte, Zhi Jin, Hongjun Wu, Wenjian Zhang, Chang Ye, Xunpeng Yi, Qinglong Yan, Yibing Zhang, Zaynab Ali, Saiprasad Meesiyawar, Varda I Pattanshetty, Varsha I Pattanshetty, Nikhil Akalwadi, Padmashree Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Hao Yang, Ruikun Zhang, Liyuan Pan, Furkan Kınlı, Donghun Ryou, Inju Ha, Junoh Kang, Bohyung Han, Wei Zhou, Yuval Haitman, Ariel Lapid, Reuven Peretz, Idit Diamant, Leilei Cao, Shuo Zhang, Praful Hambarde, Prateek Shaily, Jayant Kumar, Hardik Sharma, Aashish Negi, Sachin Chaudhary, Akshay Dudhane, Amit Shukla, MoHao Wu, Lin Wang, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi, Raul Balmez, Alexandru Brateanu, Ciprian Orhei, Cosmin Ancuti, Codruta O. Ancuti, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Kaifan Qiao, Bofei Chen, Jingyi Xu, Duo Zhang, Xin Deng, Mai Xu, Shengxi Li, Lai Jiang, Harini A, Ananya N, Lakshanya K, Ying Xu, Xinyi Zhu, Shijun Shi, Jiangning Zhang, Yong Liu, Kai Hu, Jing Xu, Xianfang Zeng, Jinao Song, Guangsheng Tang, Cheng Li, Yuqiang Yang, Ziyi Wang, Yan Chen, Long Bao, Heng Sun, Mohab Kishawy, Jun Chen, Wan-Chi Siu, Yihao Cheng, Hon Man Hammond Lee, Chun-Chuen Hui

发表机构 * NTIRE 2026

AI总结本文综述了NTIRE 2026低光图像增强挑战赛，介绍了参赛者提出的各种解决方案及最终结果。该挑战赛旨在寻找能够有效提升低对比度和噪声图像清晰度与视觉吸引力的网络模型。共有22支队伍提交了有效作品，本文全面评估了当前在（联合去噪与）低光图像增强领域的先进方法，展示了该领域的重要进展，并基于新的数据集进行了分析。

2604.16925 2026-05-18 cs.CV 版本更新

Rethinking Cross-Dose PET Denoising: Mitigating Averaging Effects via Residual Noise Learning

Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo

发表机构 * IWR, Heidelberg University（海德堡大学IWR）； Silicon Austria Labs（Silicon Austria实验室）； College of Medicine and Biological Information Engineering, Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University（医学与生物信息工程学院，医学图像智能计算教育部重点实验室，东北大学）； Department of Epidemiology & Global Health, Umeå University（流行病学与全球健康系，乌梅大学）

AI总结本文研究了低剂量正电子发射断层扫描（LDPET）图像的跨剂量去噪问题，指出传统模型在不同剂量条件下泛化能力较差，主要由于噪声水平和统计特性差异导致。作者分析发现，现有方法在训练过程中隐式优化了异质噪声分布的期望，导致网络学习到的是跨剂量的平均去噪映射，无法准确建模特定剂量的噪声特性。为此，提出了一种统一的残差噪声学习框架，直接从低剂量图像中估计噪声，而非预测全剂量图像，实验表明该方法在多个医疗中心的大规模数据集上优于现有方法，显著提升了跨剂量去噪性能。

2604.15221 2026-05-18 cs.RO cs.CV 版本更新

Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees

Jakob Thumm, Marian Frei, Tianle Ni, Matthias Althoff, Marco Pavone

发表机构 * Department of Aeronautics and Astronautics, Stanford University（斯坦福大学航空航天系）； Chair of Imaging and Computer Vision, RWTH Aachen University（亚琛工业大学影像与计算机视觉教授职位）； School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Department of Computer Engineering, Technical University of Munich（慕尼黑技术大学计算机工程系）

AI总结本文提出了一种基于视觉的人体姿态估计与运动预测框架，能够在保证安全协作的前提下提供可验证的不确定性保障。该方法结合了对噪声不确定性的估计与分布外检测，以提升预测的置信度，并引入符合性预测集来确保预测结果在实际人机协作中的高可靠性。实验在真实的人体运动数据和实际人机协作场景中验证了方法的有效性。

2603.14764 2026-05-18 cs.CV cs.AI cs.LG 版本更新

Topology-Preserving Polygon Augmentation for Segmentation in Structured Visual Domains

Sudip Laudari, Sang Hun Baek

发表机构 * Independent Researcher（独立研究者）

AI总结该论文研究了在结构化视觉领域（如建筑平面图分析）中保持多边形标注拓扑结构的图像增强方法。针对传统几何增强可能导致多边形区域分割、破坏语义连通性的缺陷，提出了一种轻量的拓扑保持增强策略，能够在不改变顶点顺序的前提下修复索引空间中的邻接关系。实验表明，该方法在常见几何变换下能实现接近完美的循环邻接保持（CAP），并有效提升了基于多边形的分割标注一致性。

Comments 10 pages, 6 figures

2603.13864 2026-05-18 cs.CR cs.CV 版本更新

Inevitable Encounters: Backdoor Attacks Involving Lossy Compression

Qian Li, Yunuo Chen, Yuntian Chen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Eastern Institute of Technology（技术东院）

AI总结本文研究了在现实场景中，由于数据存储和传输过程中不可避免地使用有损压缩，导致后门攻击效果被削弱的问题。针对图像压缩过程中嵌入的触发器信息可能丢失的问题，作者提出了两种专门应对有损压缩的中毒策略，确保触发器信息在压缩后仍能被有效恢复。实验表明，这两种方法在多种压缩方案下均具有良好的攻击效果，为后门攻击在实际应用中的实现提供了新的思路。

2603.08063 2026-05-18 cs.CV 版本更新

SkyLink: A Large Vision-Language Model Driven Re-ranking Framework for Cross-View UAV geolocalization

Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao

发表机构 * Department of Data Science, City University of Hong Kong, Hong Kong（香港城市大学数据科学系）； Information Systems, City University of Hong Kong, Hong Kong（香港城市大学信息系统系）； College of Computer Science and Technology, Zhejiang University of Technology, Zhejiang（浙江工业大学计算机科学与技术学院）

AI总结 SkyLink 是一种基于大视觉-语言模型（LVLM）的跨视角无人机地理定位重排序框架，旨在提升无人机图像与卫星图像之间的匹配精度。该方法通过建模不同视角之间的视觉-语义关系，实现更有效的跨视角匹配，并引入一种关系感知损失函数以增强模型的判别能力和训练稳定性。实验表明，SkyLink 显著提升了现有模型在多种基准数据集上的重排序性能，尤其在复杂场景中表现突出。

2603.07514 2026-05-18 cs.LG cs.AI cs.CV 版本更新

A Unified View of Score-Based and Drifting Models

Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao

发表机构 * Sony AI（索尼人工智能）； Sony Group Corporation（索尼集团）； Stanford University（斯坦福大学）； Georgia Tech（佐治亚理工学院）

AI总结本文探讨了漂移模型与基于分数的生成模型之间的内在联系，揭示了漂移方法在本质上等价于对平滑分布进行分数匹配的目标。研究发现，使用高斯核时，均值漂移场精确对应于数据分布与模型分布的分数差异，这一结论基于Tweedie公式。对于实际常用的拉普拉斯核，理论与实验均表明其残差项在高维情况下可忽略，因此实际应用中的漂移方法近似于基于分数的生成方法。该研究为理解生成模型提供了统一的视角，并指出了漂移模型与扩散模型在运输方向上的结构性相似与差异。

2602.20630 2026-05-18 cs.CV 版本更新

From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

Yepeng Liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge, Yuliang Gu, kuang Gao, Bing Wang, Guang Chen, Hangjun Ye, Yongchao Xu

发表机构 * School of Computer Science, Wuhan University（1 武汉大学计算机学院）； Xiaomi EV（2 小米电动车）

AI总结本文将关键点检测问题重新定义为一个序列决策过程，提出了一种基于强化学习的端到端框架 TraqPoint，旨在直接优化关键点在图像序列中的长期可追踪性。其核心创新在于引入了一种关注轨迹质量的奖励机制，通过策略梯度方法同时提升关键点在多视角下的一致性和区分度。实验表明，TraqPoint 在稀疏匹配任务中显著优于当前最先进的关键点检测与描述方法。

Comments Accepted by CVPR 2026 (Oral)

2602.10687 2026-05-18 cs.CV cs.AI 版本更新

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China（合肥工业大学计算机科学与信息工程学院）； Wuhan University, Wuhan, China（武汉大学）； Lab for Intelligence and visiON (LION)（智能视觉实验室）

AI总结现有伪造检测方法多局限于单模态或双模态设置，难以应对现实中的多模态虚假信息。本文提出OmniVL-Guard，一个基于平衡强化学习的统一视觉-语言伪造检测与定位框架，旨在解决多模态交互与多任务优化中的偏差问题。该方法包含自进化推理路径生成和自适应奖励缩放策略优化两个核心设计，有效提升了检测与定位的综合性能，并在多个数据集上展现出优越的零样本泛化能力。

Comments Accepted by ICML 2026

2602.05414 2026-05-18 cs.CV 版本更新

TSBOW -- Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions

Ngoc Doan-Minh Huynh, Duong Nguyen-Ngoc Tran, Long Hoang Pham, Tai Huu-Phuong Tran, Hyung-Joon Jeon, Huy-Hung Nguyen, Duong Khac Vu, Hyung-Min Jeon, Son Hong Phan, Quoc Pham-Nam Ho, Chi Dai Tran, Trinh Le Ba Khanh, Jae Wook Jeon

发表机构 * Automation Lab, Department of Electrical and Computer Engineering（自动化实验室，电气与计算机工程系）

AI总结随着全球变暖加剧极端天气事件的频率和强度，现有交通监控数据集难以应对复杂天气条件下的遮挡车辆检测问题。为此，本研究提出了TSBOW数据集，包含超过32小时的真实城市交通视频，涵盖多种天气条件和遮挡场景，标注了超过4.8万个目标框，旨在提升恶劣天气下交通参与者检测的性能。TSBOW为智能交通系统的研究提供了重要资源，推动了基于CCTV的交通监控技术发展。

Comments This paper has been accepted by the 40th AAAI Conference on Artificial Intelligence (AAAI-26)

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence. 40(2026). 5239-5247

2602.00841 2026-05-18 cs.CV 版本更新

Beyond First-Order: Learning Riemannian Geometries for Invariant Visual Place Recognition

Jintao Cheng, Weibin Li, Zhijian He, Jin Wu, Chi Man Vong, Wei Zhang

发表机构 * The Hong Kong University of Science and Technology, Hong Kong, China（香港科技大学）； University of Macau, Macau, China（澳门大学）； University of Science and Technology Beijing, Beijing, China（北京科技大学）

AI总结本文研究了视觉地点识别（VPR）中如何构建对环境和视角剧烈变化具有鲁棒性的特征表示。为解决现有方法在极端变化下结构关联丢失或适应成本高的问题，提出了一种基于黎曼几何的不变聚合框架RIA，通过在对称正定流形上建模二阶场景结构，有效保留不变结构信息并抑制噪声。实验表明，RIA在无需大量监督训练的情况下即可达到与监督方法相当的性能，并在无结构环境中取得最先进的识别准确率。

Comments 14pages, 5 figures

2601.00678 2026-05-18 cs.CV 版本更新

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

Melonie de Almeida, Daniela Ivanova, Tong Shi, John H. Williamson, Paul Henderson

发表机构 * University of Glasgow（格拉斯哥大学）

AI总结该论文提出了一种基于单张图像生成动态视频的新方法，能够根据给定的相机轨迹生成高质量且时间一致的视频。核心方法是通过构建动态的3D高斯场景表示，并在单次前向传播中生成合理的物体运动，从而实现快速的相机控制视频生成。该方法在多个数据集上表现出色，取得了领先的视频质量和推理效率。

2512.14671 2026-05-18 cs.CV 版本更新

ART: Articulated Reconstruction Transformer

Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv, Chen Geng, Jiajun Wu, Richard Newcombe, Jakob Engel, Zhao Dong

发表机构 * Reality Labs Research, Meta（Meta现实实验室）； Stanford University（斯坦福大学）

AI总结本文提出了一种名为ART的全新模型，用于从稀疏的多状态RGB图像中重建完整的3D可动物体，该模型无需依赖特定物体类别或复杂的优化过程。ART将可动物体视为由多个刚性部件组成，通过设计的Transformer架构将图像映射到可学习的部件槽位，并联合解码各部件的三维几何、纹理及运动参数，实现了物理可解释且可直接用于仿真的重建结果。实验表明，ART在多个基准测试中表现优异，显著超越了现有方法，确立了新的状态-of-the-art。

Comments Project Page: https://kyleleey.github.io/ART/

2511.18719 2026-05-18 cs.CV 版本更新

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibin Huang, Chi Zhang, Xuelong Li

发表机构 * Southeast University（东南大学）； Institute of Artificial Intelligence (TeleAI), China Telecom（人工智能研究院（TeleAI），中国电信）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出了一种名为ViPO的视觉偏好策略优化方法，用于提升视觉生成模型与人类偏好的一致性。与现有方法依赖单一标量奖励不同，ViPO通过引入感知结构模块，将反馈转化为结构化的像素级优势图，从而更精细地引导模型优化视觉内容中的关键区域。该方法在图像和视频生成任务中均表现出色，提升了对域内人类偏好奖励的对齐能力，并增强了对域外任务的泛化性能，且具有轻量、通用、易于集成现有训练流程的优点。

2511.18127 2026-05-18 cs.CV 版本更新

SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, Yoichi Sato

发表机构 * The University of Tokyo（东京大学）

AI总结 SFHand 是一种用于语言引导的实时 3D 手部状态预测框架，旨在提升增强现实和辅助机器人等场景下的人机交互体验。该方法通过连续视频流和语言指令，自回归地预测未来手部的多种状态，包括手部类型、2D 边界框、3D 姿态和轨迹，并结合了区域兴趣增强的记忆层以捕捉时间上下文和关键手部区域。研究还引入了 EgoHaFL 数据集，实验证明 SFHand 在 3D 手部预测任务中取得了显著优于现有方法的性能，并在下游操作任务中提升了任务成功率。

2511.17426 2026-05-18 cs.LG cs.CV stat.ML 版本更新

Self-Supervised Learning by Curvature Alignment

Benyamin Ghojogh, M. Hadi Sepanj, Paul Fieguth

发表机构 * Vision and Image Processing Group, Systems Design Engineering, University of Waterloo, Ontario, Canada（温哥华大学图像与图像处理小组，系统设计工程，安大略省，加拿大）

AI总结本文提出了一种基于曲率对齐的自监督学习方法CurvSSL及其核空间扩展kernel CurvSSL，旨在通过显式建模数据流形的局部几何结构来提升表征学习效果。该方法在传统非对比学习框架中引入曲率正则化项，通过计算嵌入特征的局部曲率并对其在不同数据增强视图间进行对齐和去相关，从而增强表示的不变性和几何一致性。实验表明，该方法在MNIST和CIFAR-10数据集上取得了优于现有方法的线性评估性能。

Comments A shorter version of this paper has been published in: Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, Special Issue: Proceedings of CVIS 2025

Journal ref Shorter version of this paper is published in Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, Special Issue: Proceedings of CVIS 2025

2511.03260 2026-05-18 cs.CV 版本更新

Enhancing Medical Image Segmentation via Heat Conduction Equation

Rong Wu, Yim-Sang Yu

发表机构 * Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA（流行病学与生物统计学系，加州大学旧金山分校，加州，美国）

AI总结本文针对医学图像分割中在有限计算资源下难以实现高效全局上下文建模和长距离依赖推理的问题，提出了一种结合U-Mamba结构与热传导方程的混合架构。该方法在瓶颈层引入热传导算子，通过模拟频率域热扩散过程提升语义抽象能力，实验表明其在腹部CT数据集上的Dice系数达到0.8719，验证了该方法在医学图像分割任务中的有效性与优越性。

2510.22665 2026-05-18 cs.CV cs.AI 版本更新

SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

Qiwei Ma, Xukun Lu, Wang Liu, Puhong Duan, Xudong Kang, Shutao Li

发表机构 * School of Artificial Intelligence and Robotics, Hunan University（湖南大学人工智能与机器人学院）； Yuelushan Center for Industrial Innovation（岳麓山创新中心）； School of Medical Information Engineering, Jining Medical University（济南医学院医学信息工程学院）

AI总结本文提出SARVLM，首个专为合成孔径雷达（SAR）影像设计的视觉-语言基础模型，旨在提升SAR图像的语义理解能力。为解决SAR多模态数据稀缺及跨模态表征不足的问题，研究者构建了包含百万级图像-文本对的SARVLM-1M大规模数据集，并设计了两阶段领域迁移训练策略，利用光学遥感数据作为桥梁，有效提升模型在SAR领域的表现。实验表明，SARVLM在多个基准任务中均优于现有模型，显著推进了SAR影像的语义理解水平。

Comments 13 pages, 13 figures

2510.02307 2026-05-18 cs.CV cs.AI 版本更新

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

发表机构 * Rice University（里士大学）

AI总结文本到图像扩散模型在生成分辨率超出训练设定的图像时性能往往会下降。本文针对低分辨率图像生成问题，提出了一种无需额外训练的噪声重新校准方法 NoiseShift，通过调整去噪器的噪声条件索引，恢复正向与反向过程的一致性，从而减少训练与测试阶段的不匹配。实验表明，NoiseShift 在多个主流扩散模型上显著提升了低分辨率图像的生成质量，且实现简单、推理开销极小。

2509.24798 2026-05-18 cs.CV cs.AI 版本更新

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

发表机构 * Centre for AI, DS\&AI, Astrazeneca, UK（英国阿斯利康人工智能中心）； Institute for Imaging, Data and Communications (IDCOM), School of Engineering, University of Edinburgh, Edinburgh, UK（爱丁堡大学工程学院影像、数据与通信研究所）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结本文提出了一种名为 Causal-Adapter 的模块化框架，用于适配冻结的文本到图像扩散模型，实现对图像的反事实生成。该方法通过因果干预目标属性，并将其影响一致地传播至因果依赖部分，同时保持图像的核心身份。与依赖提示工程的方法不同，Causal-Adapter 引入结构因果模型，并采用属性正则化策略，实现了更准确的语义控制和高保真图像生成，在多个数据集上取得了优越的性能。

Comments Project Page: https://leitong02.github.io/causaladapter/

Journal ref ICML 2026

2509.16223 2026-05-18 eess.SP cs.CV 版本更新

mRadNet: A Compact Radar Object Detector with MetaFormer

Huaiyu Chen, Fahed Hassanat, Robert Laganiere, Martin Bouchard

发表机构 * School of Electrical Engineering and Computer Science, University of Ottawa, Canada（渥太华大学电气与计算机工程学院，加拿大）； tsensor Cortek Inc., Canada（加拿大tsensor Cortek公司）

AI总结本文提出了一种名为mRadNet的紧凑型雷达目标检测模型，旨在满足车载嵌入式系统对模型轻量化和高效性的需求。该模型基于U-Net结构，结合MetaFormer模块，利用分离卷积和注意力机制有效提取局部与全局特征，并引入更高效的特征嵌入与融合策略以进一步降低计算复杂度。实验结果表明，mRadNet在CRUW数据集上以最少的参数和最低的计算量实现了优于现有方法的检测性能。

Comments 5 pages, 2 figures, to appear in Proc. of 34th European Signal Processing Conference (EUSIPCO 2026), Bruges, Belgique, Aug. 31 - Sept. 4, 2026. Code availble at https://github.com/huaiyu-chen/mRadNet

2509.05030 2026-05-18 cs.CV 版本更新

LUIVITON: Learned Universal Interoperable VIrtual Try-ON

Cong Cao, Xianhang Cheng, Jingyuan Liu, Yujian Zheng, Zhenhui Lin, Ren Li, Meriem Chkir, Hao Li

发表机构 * The University of Tokyo（东京大学）

AI总结本文提出了一种名为LUIVITON的全自动虚拟试穿系统，旨在解决现实世界中服装与人体模型之间骨骼结构、模板和密集对应关系不一致的问题，实现复杂多层服装在不同姿态和形态的人形角色上的自动穿戴。该方法通过SMPL作为中间代理，将服装到身体的映射分解为两个关键对应任务，并分别采用几何驱动模型和基于扩散的多视角外观特征匹配方法进行处理，最终在目标角色上生成物理合理的服装垂坠效果。该系统能够处理复杂的服装拓扑结构，并适用于多种人形角色，同时具备高效计算和无需人工干预的优点。

2508.17034 2026-05-18 cs.RO cs.CV 版本更新

DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration

Jiayi Li, Yuxin Yao, Qiuhang Lu, Juyong Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）； City University of Hong Kong（香港城市大学）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文针对刚性配准中噪声数据、部分重叠和实时处理等挑战，提出了一种双空间滤波与强化学习相结合的新方法DualReg。该方法结合基于特征匹配和基于局部几何匹配的优点，通过高效的滤波机制去除不可靠的特征对应点，并利用几何代理构建目标函数以估计变换参数。实验表明，该方法在保持精度的同时，相比MAC方法在KITTI数据集上实现了32倍的CPU时间加速。

Comments Accepted to CVPR 2026, Project page: https://ustc3dv.github.io/DualReg/

2508.01014 2026-05-18 cs.RO cs.CV 版本更新

Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

Cheng-You Lu, Zhuoli Zhuang, Nguyen Thanh Trung Le, Da Xiao, Yu-Cheng Chang, Thomas Do, Srinath Sridhar, Chin-teng Lin

发表机构 * University of Technology Sydney（悉尼技术大学）； Brown University（布朗大学）

AI总结 Hestia 是一种面向高效三维重建的视角规划方法，旨在解决传统重建过程中图像采集依赖人工或固定轨迹的问题。该方法通过引入体素面感知的分层结构，结合多样化数据集、贪心策略与几何感知设计，提升了视角规划的鲁棒性和重建质量。实验表明，Hestia 在覆盖范围、重建精度和实时性方面均优于现有方法，具有良好的实际应用前景。

Comments Accepted to the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

2507.01201 2026-05-18 cs.LG cs.CV 版本更新

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Lauren Hyoseo Yoon, Yisong Yue, Been Kim

发表机构 * Computation and Neural Systems（计算与神经系统）； California Institute of Technology（加利福尼亚理工学院）； Computation and Mathematical Sciences（计算与数学科学）； Google DeepMind（谷歌深Mind）

AI总结该论文研究了如何对齐独立训练的视觉和语言模型，提出了一种名为JAM的方法，通过联合训练模态特定的自编码器，实现跨模态对齐。JAM引入了多模态扩散损失，有效提升了对齐效果，并系统分析了对齐目标、网络深度及基础模型规模对表示一致性的影响。研究不仅提供了对共享语义结构的理论见解，也为构建专业化的多模态模型提供了实用指导。

2506.23552 2026-05-18 cs.CV cs.SD eess.AS 版本更新

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh

发表机构 * Yonsei University（延世大学）； CineLingo ； Seoul National University（首尔国立大学）

AI总结本文提出了一种名为 JAM-Flow 的统一框架，用于同时生成面部运动和语音信号，解决了传统方法中将人脸生成与语音合成作为独立任务处理的问题。该方法结合了流匹配技术和一种新型的多模态扩散变换器（MM-DiT）架构，通过选择性联合注意力层实现跨模态交互，并保留各模态的特性。JAM-Flow 能够在单一模型中支持多种条件输入，如文本、参考音频和参考运动，从而实现从文本生成同步说话人脸、音频驱动动画等多种任务，显著推进了多模态生成建模的发展。

Comments project page: https://joonghyuk.com/jamflow-web Under review. Preprint published on arXiv

2505.21698 2026-05-18 cs.CV 版本更新

Adapting Foundation Vision-Language Models to Medical Diagnosis via Query-Driven Expert Bridging

Yitong Li, Morteza Ghahremani, Christian Wachinger

发表机构 * Lab for AI in Medical Imaging, Technical University of Munich (TUM)（医学影像人工智能实验室，慕尼黑技术大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结该研究针对基础视觉-语言模型在医学影像诊断中的应用难题，提出了一种名为MedBridge的轻量级适配框架，通过结合领域对齐、分辨率保持和多标签推理，有效缓解了医学图像与通用图像之间的领域差异。MedBridge利用预训练的视觉-语言模型作为多视角查询编码器，引入可学习的查询标记以实现非破坏性的领域适配，并通过多专家混合架构动态整合异构模型进行多标签诊断，显著提升了跨领域和同领域任务的性能。实验表明，该方法在多个胸部X光诊断基准上优于现有方法，且具有模型无关性和良好的扩展性。

2505.21535 2026-05-18 cs.CV cs.AI cs.LG 版本更新

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

发表机构 * University of Arizona（亚利桑那大学）； TetraMem, Inc.（TetraMem公司）

AI总结本文提出了一种名为FAR的函数保持注意力替换框架，旨在解决Transformer模型在基于忆阻器（ReRAM）的存算一体（IMC）设备上推理效率低的问题。FAR通过将预训练DeiT模型中的注意力机制替换为与IMC数据流兼容的多头双向LSTM结构，并结合块级知识蒸馏和结构化剪枝，实现了功能等效的同时显著降低了计算延迟和参数量。实验表明，FAR在ImageNet及多个下游任务上保持了与原始模型相当的准确率，展示了其在边缘计算设备上高效部署Transformer模型的潜力。

Comments 7 pages main paper, 6 figures; accepted by GLSVLSI 2026

2505.18134 2026-05-18 cs.AI cs.CL cs.CV 版本更新

VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press

发表机构 * Princeton University（普林斯顿大学）

AI总结 VideoGameBench 是一个用于评估视觉语言模型（VLMs）完成流行视频游戏能力的基准测试，包含10款90年代经典游戏，模型仅通过原始视觉输入和目标描述进行实时交互。该研究揭示了当前前沿VLM在实时游戏任务中表现有限，难以完成完整游戏，主要受限于推理延迟等问题。为此，研究还提出了VideoGameBench Lite 以缓解实时性挑战，并指出当前最先进的模型在该基准上的完成率仍非常低。

Comments 10 pages, 38 pages including supplementary

2505.07322 2026-05-18 cs.CV 版本更新

RealRep: Generalized SDR-to-HDR Conversion via Attribute-Disentangled Representation Learning

Li Xu, Siqi Wang, Kepeng Xu, Gang He, Lin Zhang, Weiran Wang, Yu-Wing Tai

发表机构 * Xidian University（西安电子科技大学）； Dartmouth College（达特茅斯学院）

AI总结本文提出了一种通用的SDR到HDR转换框架RealRep，通过解耦亮度和色度属性的学习，提升对真实世界中多样SDR内容的鲁棒性。核心方法包括解耦表征学习、基于退化感知的负样本生成策略，以及一个轻量的两阶段映射网络DDACMNet，能够根据退化条件动态调整映射过程。实验表明，RealRep在泛化能力和HDR色彩重构的感知保真度方面均优于现有方法。

Comments Published on AAAI'26(Oral): The Annual AAAI Conference on Artificial Intelligence

2505.06982 2026-05-18 cs.CV 版本更新

Decentralized LoRA augmented transformer with multi-scale feature learning for secured eye diagnosis

Md. Naimur Asif Borno, Md Sakib Hossain Shovon, MD Hanif Sikder, Iffat Firozy Rimi, Tahani Jaser Alahmadi, Mohammad Ali Moni

发表机构 * organization= Research Assistant, The University of Queensland , addressline= 308 Queen St , city= Brisbane City , postcode= QLD 4000 , state= Queensland , country= Australia ； organization= Mechatronics Engineering, Rajshahi University of Engineering \& Technology , city= Rajshahi , postcode= 6204 , country= Bangladesh ； organization= Researcher, The University of Queensland , addressline= 308 Queen St , city= Brisbane City , postcode= QLD 4000 , state= Queensland , country= Australia ； organization= Department of Computer Science, American International University Bangladesh , city= Dhaka , postcode= 1216 , country= Bangladesh ； organization= Department of Computer Science, University of South Asia-Bangladesh , city= Dhaka , postcode= 1216 , country= Bangladesh ； organization= Department of Computer Science ； Engineering, Daffodil International University , city= Dhaka , country= Bangladesh ； Department of Information Systems, College of Computer ； Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, Saudi Arabia. Email ； organization= Faculty of Health, Medicine ； Behavioural Sciences, The University of Queensland , addressline= 308 Queen St , city= Brisbane City , postcode= QLD 4000 , state= Queensland , country= Australia ； Cyber Futures Institute Charles Sturt University , addressline= 308 Queen St , city= Bathurst NSW , country= Australia

AI总结本文提出了一种基于改进型图像Transformer（DeiT）的去中心化眼病诊断框架，旨在解决医学影像中眼科疾病诊断面临的数据不平衡、隐私保护、空间特征多样性和临床可解释性等挑战。该方法结合多尺度特征学习、低秩适配（LoRA）、知识蒸馏和联邦学习，有效提升了模型在计算效率、数据隐私保护和诊断性能方面的表现。实验表明，该框架在多个基准数据集上优于传统卷积神经网络和现有Transformer模型，并通过Grad-CAM++提供了可解释的诊断依据，为安全、可扩展的眼科AI诊断系统奠定了基础。

Comments Published at Knowledge-Based Systems

2504.21850 2026-05-18 cs.CV 版本更新

Visual Compositional Tuning

Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Esin Tureci, Olga Russakovsky

发表机构 * Princeton University（普林斯顿大学）； Meta AI

AI总结本文研究了视觉指令微调（VIT）数据集中样本复杂度对信息量的影响，提出了一种名为COMPACT的合成数据生成方法，通过在一个训练样本中组合多个基础视觉能力，显著提升了数据效率。实验表明，COMPACT在减少训练数据量90%的情况下，仍能保持与完整数据相当甚至更好的模型性能，在多个视觉语言基准测试中表现优异。该方法为提升视觉语言任务的训练效率提供了可扩展的解决方案。

Comments See the project website at this [URL](https://princetonvisualai.github.io/compact/)

2504.09544 2026-05-18 cs.LG cs.CE cs.CV 版本更新

Integrating chemical structures as treatments improves representations of microscopy images for morphological profiling

Yemin Yu, Emre Hayir, Neil Tenenholtz, Lester Mackey, Ying Wei, David Alvarez-Melis, Ava P. Amini, Alex X. Lu

发表机构 * Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）； Microsoft Research（微软研究院）； Department of Computer Science, Zhejiang University（浙江大学计算机科学系）

AI总结该研究提出了一种名为MICON的新框架，通过在自监督预训练中整合化学结构信息，提升高通量显微图像的表征能力，以更准确地进行形态学分析。研究认为，将化合物结构作为诱导细胞表型变化的“处理”因素进行建模，能够显著优于传统手工特征和现有深度学习方法。实验表明，结合化学信息的表征学习在跨实验重复和数据来源的药物效应识别任务中表现更优，为多模态显微筛查数据的表征学习提供了新方向。

Comments 24 pages

2504.05451 2026-05-18 cs.CV 版本更新

ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes

Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman

发表机构 * UT Austin（得克萨斯大学奥斯汀分校）； Meta AI ； Stanford University（斯坦福大学）； Northeastern University（东北大学）

AI总结 ViewBridge 是一种用于学习活动视点不变表示的框架，旨在应对野外视频中极端视角变化带来的挑战。该方法通过知识蒸馏保留动作语义，并结合课程学习策略，逐步增加视角难度以实现平滑适应。实验表明，ViewBridge 在两个任务上优于现有方法，适用于多个数据集。

2503.02597 2026-05-18 cs.CV cs.AI 版本更新

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

发表机构 * Sony Group Corporation, Tokyo, Japan（索尼集团，日本东京）

AI总结近期多模态大语言模型（MLLMs）在理解和推理多模态信息方面取得了显著进展，但视觉与语言模态之间的对齐问题仍是一个关键挑战。本文从模型架构层面出发，提出了一种新的模态互注意力机制（MMA），通过将因果注意力扩展为跨模态互注意力，使图像模态能够关注文本模态，从而提升模型对输入信息的准确理解。该方法在多个多模态理解基准测试中取得了优越性能，且无需增加额外参数，具有通用性和可扩展性。

Comments ICML 2026. Code is available at https://github.com/sony/aki

2406.18944 2026-05-18 cs.CV cs.AI cs.CR 版本更新

Rethinking and Red-Teaming Protective Perturbation in Personalized Diffusion Models

Yixin Liu, Ruoxi Chen, Xun Chen, Lichao Sun

发表机构 * Lehigh University（莱维大学）； Lehigh University Computer Science（莱维大学计算机科学）； Engineering Bethlehem PA USA（工程布雷顿佛罗里达美国）； Independent Researcher（独立研究员）； Independent Researcher Fremont California USA（独立研究员佛罗里达加州美国）

AI总结个性化扩散模型（PDMs）在使用少量数据生成特定人物图像方面表现出色，但其对微小对抗性扰动高度敏感，导致在受污染数据上微调时性能显著下降。本文通过 Shortcut Learning 的视角深入分析了 PDMs 的微调过程，揭示了对抗扰动在 CLIP 嵌入空间中引发的潜在语义对齐问题，并据此提出了一种系统性的反制框架，包括图像净化和对比解耦学习，有效提升了模型的鲁棒性和泛化能力。

Comments Code is available at https://github.com/liuyixin-louis/DiffShortcut

2403.13805 2026-05-18 cs.CV cs.AI cs.LG 版本更新

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）； The Chinese University of Hong Kong（香港中文大学）； MThreads, Inc.（MThreads公司）； Nanyang Technological University（南洋理工大学）

AI总结本文提出了一种名为RAR的方法，旨在提升多模态大语言模型（MLLMs）在细粒度和少样本视觉识别任务中的性能。RAR结合了CLIP的多模态检索能力与MLLMs的丰富知识库，通过建立多模态检索器来扩展模型的上下文窗口，并在推理时检索相关类别信息供MLLMs进行排序和预测。该方法有效解决了MLLMs在面对大量类别时性能下降的问题，在多个细粒度和零样本识别基准上取得了显著的性能提升。

Comments Project: https://github.com/Liuziyu77/RAR

2212.12130 2026-05-18 cs.CV 版本更新

Learning to Detect and Segment for Open Vocabulary Object Detection

Tao Wang, Nan Li

发表机构 * Sichuan University（四川大学）； University of California San Diego（加州大学圣地亚哥分校）

AI总结该研究旨在解决开放词汇物体检测中的检测与分割问题，提出了一种名为CondHead的动态网络结构，以提升模型对新类别物体的泛化能力。核心方法通过条件参数化网络头，利用语义嵌入引导模型学习类别特异性知识，从而实现更准确的边界框回归和分割预测。该方法在保持计算开销极小的前提下，显著提升了现有开放词汇检测方法的性能。

Comments We appologize that author Nan Li was not on the published version due to cvpr23 policy that authors cannot be added after abstract deadline