arXivDaily arXiv每日学术速递 周一至周五更新
2605.15199 2026-05-15 cs.CV cs.AI 版本更新

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Ruozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez

发表机构 * ByteDance(字节跳动) ByteDance Seed(字节跳动种子) Rice University(罗切斯特大学)

AI总结 EntityBench 是一个用于评估多镜头视频生成中实体一致性能力的基准数据集,包含140个情节(共2,491个镜头),从真实叙事媒体中提取,涵盖不同难度级别的场景,并明确追踪角色、物体和地点在多镜头间的连续性。该基准引入了三部分评估体系,分别评估单镜头质量、提示对齐度和跨镜头一致性,并通过“保真度门”机制确保只有准确的实体表现在跨镜头评分中被计入。研究还提出了一种基于记忆增强的生成方法EntityMem,通过在生成前存储每个实体的视觉参考,显著提升了跨镜头实体一致性表现。

Comments Project page: https://catherine-r-he.github.io/EntityBench/

详情
英文摘要

Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.

2605.15198 2026-05-15 cs.CV cs.AI cs.CL 版本更新

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

发表机构 * Meta AI The Chinese University of Hong Kong(香港中文大学)

AI总结 该研究提出了一种名为ATLAS的新型视觉推理框架,旨在解决传统方法在计算开销和任务泛化上的不足。ATLAS通过一个单一的离散“功能词”同时实现代理式推理和潜在视觉推理,无需视觉监督且兼容标准训练流程。研究还引入了LA-GRPO方法以提升训练稳定性,实验表明ATLAS在多个基准上表现出色,兼具高效性与可解释性。

Comments Project Page: https://atlas-oneword.github.io Code: https://github.com/ZiyuGuo99/ATLAS

详情
英文摘要

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

2605.15196 2026-05-15 cs.CV cs.LG 版本更新

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna

发表机构 * University of Washington(华盛顿大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 该论文提出了一种名为 RefDecoder 的参考条件视频解码器,旨在提升视觉生成任务中的细节保真度和结构一致性。通过在解码过程中引入高保真参考图像信号,RefDecoder 利用参考注意力机制将参考图像编码为高维特征,并与去噪后的视频潜在特征进行联合处理,从而增强生成结果的质量。实验表明,RefDecoder 在多个基准数据集上显著提升了生成视频的 PSNR 指标,并且无需额外微调即可直接集成到现有视频生成系统中,有效提升了生成内容的主体一致性、背景一致性和整体质量。

详情
英文摘要

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

2605.15195 2026-05-15 cs.CV 版本更新

VGGT-$Ω$

Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht

发表机构 * Visual Geometry Group, University of Oxford(视觉几何组,牛津大学) Meta AI

AI总结 本文提出了一种改进的前馈重建模型 VGGT-$Ω$,旨在提升静态和动态场景的重建精度与效率。通过简化网络结构、引入注册机制和自监督学习策略,VGGT-$Ω$ 在大幅降低 GPU 内存占用的同时,显著提升了模型性能,并在多个基准测试中取得了优异结果,例如在 Sintel 数据集上将相机估计精度提升了 77%。研究还表明,该模型中的注册机制可有效支持视觉-语言-动作模型的空间理解任务。

Comments CVPR 2026 (Oral)

详情
英文摘要

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/

2605.15193 2026-05-15 cs.CV 版本更新

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Tuna Han Salih Meral, Kaan Oktay, Hidir Yesiltepe, Adil Kaan Akan, Pinar Yanardag

发表机构 * Virginia Tech(弗吉尼亚理工大学) fal(法尔)

AI总结 该研究针对图像生成中的潜在流匹配方法,提出了通过对齐潜在空间的几何结构来提升生成质量的新方法。作者发现,传统方法在将高斯噪声传输到变分自编码器潜在空间时,往往沿着欧几里得路径进行,但这种路径无法保持在薄球壳状的潜在分布上。为此,他们将潜在表示分解为径向和角度成分,发现感知和语义信息主要由方向决定,从而提出将数据潜在表示投影到固定半径球面,并采用球面线性插值替代传统方法,使生成路径始终位于球面上,显著提升了生成图像的质量。

详情
英文摘要

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.

2605.15190 2026-05-15 cs.CV 版本更新

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

Yanzuo Lu, Ronglai Zuo, Jiankang Deng

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 本文提出了一种名为RAVEN的实时自回归视频外推网络,用于从先前生成的内容中实时生成未来视频片段。为了解决训练与推理过程中历史分布不一致导致的长期生成质量下降问题,RAVEN在训练时将自滚动过程重构为包含干净历史端点和噪声去噪状态的交错序列,从而对齐训练注意力与推理外推过程。此外,论文还引入了基于条件高斯转移的CM-GRPO方法,通过在线强化学习优化一致性采样步骤,进一步提升了生成效果。实验表明,RAVEN在多项评估指标上优于现有因果视频蒸馏方法。

Comments Project Page: https://yanzuo.lu/raven

详情
英文摘要

Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

2605.15187 2026-05-15 cs.CV cs.GR cs.RO 版本更新

Articraft: An Agentic System for Scalable Articulated 3D Asset Generation

Matt Zhou, Ruining Li, Xiaoyang Lyu, Zhaomou Song, Zhening Huang, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi, Shangzhe Wu

发表机构 * University of Cambridge(剑桥大学) University of Oxford(牛津大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为Articraft的智能系统,用于大规模生成可动的3D模型资产。该系统通过将生成任务转化为编写程序的过程,并利用大型语言模型自动编写代码,从而克服了当前缺乏大规模多样化数据集的瓶颈。Articraft引入了专门的编程接口和验证机制,使语言模型能够高效生成包含部件定义、几何组合、关节设置及测试验证的代码,最终生成高质量的可动3D资产。实验表明,该方法在生成质量上优于现有最先进的生成工具,并基于此构建了一个包含10,000个样本、涵盖245类物体的高质量数据集,用于训练和应用如机器人仿真与虚拟现实等领域。

Comments Project page: https://articraft3d.github.io/

详情
英文摘要

A bottleneck in learning to understand articulated 3D objects is the lack of large and diverse datasets. In this paper, we propose to leverage large language models (LLMs) to close this gap and generate articulated assets at scale. We reduce the problem of generating an articulated 3D asset to that of writing a program that builds it. We then introduce a new agentic system, Articraft, that writes such programs automatically. We design a programmatic interface and harness to help the LLM do so effectively. The LLM writes code against a domain-specific SDK for defining parts, composing geometry, specifying joints, and writing tests to validate the resulting assets. The harness exposes a restricted workspace and interface to the LLM, validates the resulting assets, and returns structured feedback. In this way, the LLM is not distracted by details such as authoring a URDF file or managing a complex software environment. We show that this produces higher-quality assets than both state-of-the-art articulated-asset generators and general-purpose coding agents. Using Articraft, we build Articraft-10K, a curated dataset of over 10K articulated assets spanning 245 categories, and show its utility both for training models of articulated assets and in downstream applications such as robotics simulation and virtual reality.

2605.15185 2026-05-15 cs.CV cs.AI 版本更新

Quantitative Video World Model Evaluation for Geometric-Consistency

Jiaxin Wu, Yihao Pi, Yinling Zhang, Yuheng Li, Xueyan Zou

发表机构 * Tsinghua University - IEI Lab(清华大学-IEI实验室) UW-Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe研究院)

AI总结 本文提出了一种名为PDI-Bench的定量评估框架,用于检测生成视频中的几何一致性问题。该方法通过分割和点追踪获取物体中心视角的观测信息,结合单目重建技术将其映射到三维空间,并计算反映尺度-深度对齐、三维运动一致性和结构刚性等三个失败维度的投影几何残差。研究还构建了PDI-Dataset,用于系统评估生成视频的几何特性,揭示了现有生成模型在物理合理性方面的不足。

Comments 12 pages, 5 figures. Project page : https://pdi-bench.github.io/

详情
英文摘要

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

2605.15182 2026-05-15 cs.CV 版本更新

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Yifan Wang, Tong He

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出了一种名为Warp-as-History的方法,旨在实现无需额外训练即可从单个训练视频中生成可控相机轨迹的视频。该方法通过将相机引起的图像变形转化为伪历史信息,并结合目标帧的位置对齐和可见令牌选择,直接输入到视频生成模型中,从而引导模型生成符合指定相机路径的视频。实验表明,该方法不仅在零样本情况下表现出良好的相机轨迹跟随能力,而且通过轻量的微调还可进一步提升生成视频的质量和运动一致性。

Comments Project page: https://yyfz.github.io/warp-as-history/

详情
英文摘要

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.

2605.15181 2026-05-15 cs.CV 版本更新

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe研究)

AI总结 该研究旨在解决开放性图像编辑中抽象、多步骤指令的处理问题,提出了一种将规划与执行紧密结合的框架。其核心方法包括一个生成原子分解步骤的规划器、一个选择编辑工具和区域的协调器,以及一个基于视觉语言判断的奖励机制,用于指导编辑过程。该方法通过奖励驱动的执行优化协调器,并利用成功轨迹反哺规划器,从而实现更连贯、可靠的图像编辑效果。

详情
英文摘要

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.

2605.15178 2026-05-15 cs.CV 版本更新

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie

发表机构 * NVIDIA(英伟达)

AI总结 本文提出了一种名为 SANA-WM 的高效世界模型,能够在单分钟内生成高保真、720p 分辨率的视频,并具备精确的相机控制能力。该模型通过混合线性注意力机制、双分支相机控制、两阶段生成流程以及鲁棒的标注管道等核心设计,在保证视觉质量的同时显著提升了训练与推理效率。实验表明,SANA-WM 在数据使用量、训练时长和硬件资源消耗方面均优于现有开源模型,且在单分钟世界建模基准测试中表现出更高的动作跟随精度和生成吞吐量。

Comments https://nvlabs.github.io/Sana/WM/

详情
英文摘要

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.

2605.15171 2026-05-15 cs.CV cs.AI cs.LG 版本更新

Evidential Reasoning Advances Interpretable Real-World Disease Screening

Chenyu Lian, Hong-Yu Zhou, Jing Qin

发表机构 * The Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China(智能健康中心,护理学院,香港理工大学,香港,中国) Research Institute for Smart Ageing, the Hong Kong Polytechnic University, Hong Kong, China(智能老龄化研究 institute,香港理工大学,香港,中国) School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University, Beijing, China(生物医学工程学院,清华大学,北京,中国)

AI总结 本文提出了一种基于证据推理的可解释疾病筛查框架EviScreen,旨在解决当前医学图像筛查模型在可解释性和性能上的不足。该方法通过从历史病例中检索区域级证据,并结合双知识库进行回顾性解释,提升了模型的透明度和诊断准确性。同时,利用对比检索生成的异常图增强定位解释性,实验表明该方法在真实世界疾病筛查基准上表现出色,尤其在临床召回率下的特异性显著提高。

Comments ICML 2026

详情
英文摘要

Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.

2605.15167 2026-05-15 cs.CV 版本更新

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen, Qifeng Chen

发表机构 * HKUST(香港理工大学) Webank(网商银行)

AI总结 本文研究了纯合成分层数据是否有助于提升图形设计分解的效果。作者基于先进的CLD框架构建了合成数据集SynLayers,并利用视觉语言模型生成文本监督和自动推理输入,发现纯合成数据在性能上可超越现有非可扩展数据集,且在数据量增加时表现持续提升,同时能有效平衡分层分布。该研究为分层设计编辑系统提供了可扩展的合成数据基础,具有重要的实用价值。

Comments 22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers

详情
英文摘要

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

2605.15128 2026-05-15 cs.CV cs.CL cs.IR 版本更新

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Minghao Guo, Qingyue Jiao, Zeru Shi, Yihao Quan, Boxuan Zhang, Danrui Li, Liwei Che, Wujiang Xu, Shilong Liu, Zirui Liu, Mubbasir Kapadia, Vladimir Pavlovic, Jiang Liu, Mengdi Wang, Yiyu Shi, Dimitris N. Metaxas, Ruixiang Tang

发表机构 * Rutgers(罗格斯大学) Notre Dame(圣母大学) Princeton(普林斯顿大学) UMN(明尼苏达大学)

AI总结 MemEye 是一个以视觉为核心的多模态智能体记忆评估框架,旨在解决现有方法在长期记忆中对视觉证据保存和推理能力评估不足的问题。该框架从视觉证据的粒度和使用方式两个维度进行评估,构建了包含8个生活场景任务的新基准,并通过消融验证门机制评估模型的推理结构与视觉必要性。实验表明,当前主流模型在细粒度视觉细节保存和时序状态推理方面仍存在明显不足,突显了证据路由、时间追踪和细节提取在长期多模态记忆中的关键作用。

Comments 46 pages, 15 figures

详情
英文摘要

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

2605.15116 2026-05-15 cs.CV 版本更新

DriveCtrl: Conditioned Sim-to-Real Driving Video Generation

Haonan Zhao, Yiting Wang, Jingkun Chen, Valentina Donzella, Thomas Bashford-Rogers, Kurt Debattista

发表机构 * University of Warwick(沃里克大学) Northwestern Polytechnical University(西北工业大学) Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 自动驾驶系统训练需要大量标注的驾驶视频数据,但仿真数据与真实场景之间存在显著的领域差异,限制了其实际应用。本文提出DriveCtrl,一种基于深度条件的可控仿真到真实驾驶视频生成框架,通过结构感知适配器在保持场景布局和运动模式的同时生成视觉真实且时间连贯的驾驶视频。该方法还引入了支持多条件信号的数据生成流水线,并提出专门的评估指标DVRS,实验表明其在真实感、时间一致性和感知任务性能上均优于现有方法,有效缩小了仿真与真实驾驶视频之间的差距。

详情
英文摘要

Large-scale labelled driving video data is essential for training autonomous driving systems. Although simulation offers scalable and fully annotated data, the domain gap between synthetic and real-world driving videos significantly limits its utility for downstream deployment. Existing video generation methods are not well-suited for this task, as they fail to simultaneously preserve scene structure, object dynamics, temporal consistency, and visual realism, all of which are critical for maintaining annotation validity in generated data. In this paper, we present DriveCtrl, a depth-conditioned controllable sim-to-real video generation framework for realistic driving video synthesis. Built upon a pretrained video foundation model, DriveCtrl introduces a structure-aware adapter that enables depth-guided generation while preserving the scene layout and motion patterns of the source simulation, producing temporally coherent driving videos that remain aligned with the original simulated sequences. We further introduce a scalable data generation pipeline that transforms simulator videos into realistic driving footage matching the visual style of a target real-world dataset. The pipeline supports three conditioning signals: structural depth, reference-dataset style, and text prompts, while preserving frame-level annotations for downstream perception tasks. To better assess this task, we propose a driving-domain-specific knowledge-informed evaluation metric called Driving Video Realism Score (DVRS) that assesses the realism of generated videos. Experiments demonstrate that DriveCtrl consistently outperforms the base model and competing alternatives in realism, temporal quality, and perception task performance, substantially narrowing the sim-to-real gap for driving video generation.

2605.15088 2026-05-15 cs.CV 版本更新

SAGE3D: Soft-guided attention and graph excitation for 3D point cloud corner detection

Batuhan Arda Bekar, Can Sarı, Hüseyin Can Gülkan, Barış Özcan

AI总结 本文提出SAGE3D,一种基于Transformer的混合模型,用于机载LiDAR点云中的角点检测。该方法采用分层编码-解码架构,通过Set Abstraction层逐步下采样点云,并通过特征传播恢复每个点的预测结果。研究引入了软引导注意力机制和激励图神经网络,前者在训练时将真实角点标签作为先验信息注入注意力计算以提高精度,后者在关键尺度上通过正向消息传递增强高置信度角点的预测,从而提升召回率。

Comments 5 pages, 4 figures

详情
英文摘要

We present SAGE3D, a hybrid Transformer-based model for corner detection in airborne LiDAR point clouds. We propose a multi-stage solution built on a hierarchical encoder-decoder architecture that progressively downsamples point clouds through Set Abstraction layers and recovers per-point predictions via Feature Propagation. We introduce two innovations: Soft-Guided Attention, which injects ground-truth corner labels as a log-prior into attention logits during training to improve precision; then an Excitatory Graph Neural Network positioned at strategic resolutions in the hierarchy, employing positive-only message passing where high-confidence corners reinforce predictions through learned boosting, optimizing for recall. The hierarchical design enables multi-scale feature extraction while our guided attention and excitatory modules ensure corner signals are amplified rather than diluted across scales.

2605.15071 2026-05-15 cs.CV cs.AI cs.CL 版本更新

On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen

发表机构 * MBZUAI Inception

AI总结 该研究指出视觉语言模型在处理文化遗产材料时存在“文化时差”问题,即模型倾向于用不符合历史时期的概念、材料或文化框架来误解历史文物。为此,研究者构建了TAB-VLM基准数据集,包含1600件印度不同时期的文化遗物和600个问题,用于评估模型的时序推理能力。实验表明,即使是最先进的模型在该基准上的表现也有限,揭示了当前视觉语言模型在理解和处理非西方文化历史材料方面仍存在显著不足。

Comments Project Page: https://khushboo0012.github.io/tab-vlm-webpage/

详情
英文摘要

Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

2605.15062 2026-05-15 cs.CV 版本更新

Computational Imaging Priors for Wireless Capsule Endoscopy: Monte Carlo-Guided Hemoglobin Mapping for Rare-Anomaly Detection

Chengshuai Yang, Lei Xing, Gregory Entin, Roopa Vemulapalli, Lisa Casey, Raiyan Tripti Zaman

发表机构 * Department of Biomedical Engineering, University of Texas Southwestern Medical Center(生物医学工程系,德克萨斯西南医学中心) Department of Internal Medicine, University of Texas Southwestern Medical Center(内科系,德克萨斯西南医学中心) Department of Radiation Oncology, Stanford University(放射肿瘤学系,斯坦福大学) VELVETECH, LLC(VELVETECH公司) Division of Digestive and Liver Diseases, Clements University Hospital(消化和肝病系,克莱门斯大学医院) Internal Medicine, Division of Digestive and Liver Diseases, Parkland Hospital(内科,消化和肝病系,帕克兰医院)

AI总结 该研究针对胶囊内镜图像中因血红蛋白对比与胆汁和光照衰减混淆而导致的分类性能下降问题,提出了一种基于蒙特卡洛启发的分析先验模型,用于从RGB信号中计算血红蛋白分布,从而提升对罕见血管异常的检测能力。通过在Kvasir-Capsule数据集上的实验,该方法在多个种子设置下均表现出方向一致的AUC提升,尤其在淋巴管扩张等类别上效果显著。研究还展示了该方法可生成可解释的热图,并能在普通三通道RGB输入上运行,具有较好的实用价值。

Comments 24 pages, 6 figures, 3 tables. Code and trained-model checkpoints at https://github.com/integritynoble/GI_Multi_Task . 6-seed (seeds 41, 42, 43, 44, 45, 47) mean +/- SD ablation as the headline; per-class single-seed=42 analyses in Appendix A

详情
英文摘要

Background. RGB-trained capsule-endoscopy classifiers underperform on small-vessel vascular findings by conflating hemoglobin contrast with bile and illumination falloff. Thus, here we test whether a Monte Carlo-inspired analytic model can compute hemoglobin from RGB signal built upon extracted classifier. Methods. On Kvasir-Capsule (47,238 frames, video-level 70/15/15 split, 11 evaluable classes) we evaluate two software-only configurations against RGB-only EfficientNet-B0 across 6 seeds: (i) a prior P_blood = sigma(alpha * (H_norm - 0.5)) * Phi(r) fused as 2 zero-init auxiliary channels; (ii) a distillation head training a 3-channel RGB backbone to predict P_blood. Significance: paired DeLong, McNemar, bootstrap CIs with Bonferroni correction. Results. Across 6 seeds (n=6,423), the analytic prior provides a small but direction-consistent macro-AUC improvement: RGB-only 0.760 +/- 0.027, input-fusion 0.783 +/- 0.024 (paired Delta = +0.023, sign-positive on 5/6 seeds), distillation 0.773 +/- 0.028. The largest robust per-class lift is on Lymphangiectasia, where AUC rises from RGB 0.238 +/- 0.057 to input-fusion 0.337 +/- 0.019, sign-consistent across all 6 seeds. On rare focal-vascular classes (Angiectasia, Blood - fresh) the prior's per-seed effects are bimodal: seed=42 reaches Angiectasia AUC 0.528 -> 0.916, but the cross-seed mean is 0.646 -> 0.608 with sigma_PI = 0.23 - reported as a high-variance per-seed exemplar. Conclusion. A Monte Carlo-inspired analytic prior provides a small, direction-consistent macro-AUC improvement on Kvasir-Capsule across 6 seeds with the largest robust per-class lift on Lymphangiectasia; the distillation variant runs on plain 3-channel RGB and yields a free interpretability heatmap.

2605.15055 2026-05-15 cs.LG cs.CV 版本更新

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

Quanhao Li, Junqiu Yu, Kaixun Jiang, Yujie Wei, Zhen Xing, Pandeng Li, Ruihang Chu, Shiwei Zhang, Yu Liu, Zuxuan Wu

发表机构 * Fudan University(复旦大学) Wan Team, Alibaba Group(阿里集团万团队)

AI总结 本文提出了一种名为DiffusionOPD的多任务训练框架,用于改进扩散模型的图像生成能力。该方法基于在线策略蒸馏(OPD),通过独立训练任务特定的教师模型,再沿学生的生成轨迹将其知识蒸馏到统一的学生模型中,从而解耦单任务探索与多任务整合,避免了联合优化带来的干扰与不平衡问题。理论分析表明,DiffusionOPD将OPD框架从离散标记扩展到连续状态马尔可夫过程,推导出统一的KL散度目标函数,提升了训练效率和生成质量,并在多个基准测试中取得了优越的性能。

详情
英文摘要

Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.

2605.15054 2026-05-15 cs.CV 版本更新

LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

Mitchell Piehl, Muchao Ye

发表机构 * The University of Iowa(爱荷华大学)

AI总结 本文提出了一种名为LATERN的上下文感知可解释视频异常检测框架,旨在解决现有视觉语言模型在视频异常检测中缺乏结构化时间上下文的问题。该方法通过引入上下文感知异常评分模块和递归证据聚合模块,将视频异常检测建模为时间证据聚合过程,从而生成更准确且语义连贯的事件级解释。实验表明,LATERN在多个具有挑战性的基准数据集上显著提升了冻结模型在测试时的检测精度和解释一致性。

详情
英文摘要

Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.

2605.15042 2026-05-15 cs.CV cs.AI 版本更新

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

Wuyang Li, Yang Gao, Mariam Hassan, Lan Feng, Wentao Pan, Po-Chien Luan, Alexandre Alahi

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 EverAnimate 是一种高效的后训练方法,用于生成高质量的长时域动画视频,能够保持视觉质量和角色身份的一致性。该方法通过引入持久潜空间传播和修复流匹配两种机制,解决了长视频生成中由于分块生成导致的细节退化和语义不一致问题。实验表明,仅需轻量的LoRA调优,EverAnimate 在短时和长时动画生成任务中均优于现有方法,显著提升了图像保真度和视觉质量。

Comments Project Page: https://everanimate.github.io/homepage/

详情
英文摘要

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.

2605.15024 2026-05-15 cs.CV 版本更新

HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning

Man Wang, Chenyang Liu, Wenjun Li, Feng Ni, Bing Jia, Baoqi Huang, Riting Xia, Zhenwei Shi

发表机构 * College of Computer Science, Inner Mongolia University(内蒙古大学计算机学院) Research Center for Spatiotemporal Intelligence, Inner Mongolia University(内蒙古大学时空智能研究中心) Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University(北航宇航学院航空航天智能科学与技术系) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北航虚拟现实技术与系统国家重点实验室) China Mobile Communications Group Inner Mongolia Co., Ltd.(中国移动通信集团内蒙古有限公司)

AI总结 本文提出了一种名为HiSem的层次化语义解耦网络,用于解决遥感图像变化描述生成中的语义纠缠问题。该方法通过引入双向差分注意力调制模块和层次化自适应语义解耦模块,分别增强时序交互并分离不同粒度的语义表示,从而更准确地区分变化与未变化图像对,并建模细粒度的变化语义。实验表明,HiSem在两个基准数据集上均优于现有方法,在WHU-CDC数据集上BLEU-4指标提升了7.52%,为遥感图像变化描述任务提供了结构化的建模视角。

详情
英文摘要

Remote sensing image change captioning (RSICC) aims to achieve high-level semantic understanding of genuine changes occurring between bi-temporal images. Despite notable progress, existing methods are fundamentally limited by a shared modeling assumption: changed and unchanged image pairs, which have intrinsically different semantic granularities, are processed under a unified modeling strategy. This modeling inconsistency leads to semantic entanglement between coarse-grained change existence judgment and fine-grained semantic understanding.To address the above limitation, we propose a novel hierarchical semantic disentangling network (HiSem) that explicitly disentangles semantic representations of different granularities. Specifically, we first introduce the Bidirectional Differential Attention Modulation (BDAM) module that leverages discrepancy-aware attention to enhance cross-temporal interactions, thereby amplifying true change signals while suppressing irrelevant variations. Building upon this, we design a Hierarchical Adaptive Semantic Disentanglement (HASD) module that performs adaptive routing at two hierarchical levels: a coarse-grained image-level routing mechanism distinguishes changed and unchanged image pairs, while a fine-grained token-level Mixture-of-Experts (MoE) block models diverse and heterogeneous change semantics for changed samples. Extensive experiments on two benchmark datasets demonstrate that HiSem outperfoms previous methods, achieving a significant improvement of +7.52\% BLEU-4 on the WHU-CDC dataset. More importantly, our approach provides a structured perspective for RSICC by explicitly aligning model design with the intrinsic semantic heterogeneity of bi-temporal scenes. The code will be available at https://github.com/Man-Wang-star/HiSem

2605.14991 2026-05-15 cs.CV cs.AI 版本更新

Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning

Francesco Pastori, Francesca Fati, Marina Rosanu, Luigi De Vitis, Lucia Ribero, Gabriella Schivardi, Giovanni Damiano Aletti, Nicoletta Colombo, Jvan Casarin, Francesco Multinu, Elena De Momi

发表机构 * Department of Gynecologic Oncology, European Institute of Oncology, IEO, IRCCS, Milan, Italy(妇科肿瘤科,欧洲肿瘤研究所,IEO,IRCCS,米兰,意大利) Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy(电子、信息与生物工程系,米兰理工学院,米兰,意大利) Department of Obstetrics and Gynecology, Mayo Clinic, Rochester, USA(妇产科,梅奥诊所,罗切斯特,美国) Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy(肿瘤学与血液肿瘤学系,米兰大学,米兰,意大利) Department of Medicine and Innovative Technology, Università degli Studi dell'Insubria, Varese, Italy(医学与创新技术系,因斯布鲁克大学,瓦雷塞,意大利)

AI总结 该研究旨在通过术前增强CT影像预测卵巢癌患者对新辅助化疗的反应,以帮助早期识别无效治疗的患者。研究提出了一种基于多损失深度学习的非侵入性框架,利用自动提取的3D病灶掩膜,结合部分微调的图像编码器和注意力机制进行特征聚合与分类。实验在包含280例患者的回顾性队列上验证,模型在测试集上实现了ROC-AUC为0.73、F1得分为0.70,表明其具备一定的临床预测能力,为影像驱动的患者分层提供了可靠基础。

详情
英文摘要

Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.

2605.14990 2026-05-15 cs.CV 版本更新

Characterizing the visual representation of objects from the child's view

Jane Yang, Tarun Sepuri, Alvin Wei Ming Tan, Khai Loong Aw, Michael C. Frank, Bria Long

发表机构 * Department of Psychology, University of California San Diego(加州大学圣地亚哥分校心理学系) Department of Psychology, Stanford University(斯坦福大学心理学系) Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 该研究探讨了儿童在日常生活中如何通过视觉经验学习物体类别表征,并分析了来自BabyView数据集的大量第一人称视频数据。研究利用监督检测模型从数百万帧画面中提取常见物体类别,发现儿童接触到的物体类别分布极不均衡,且物体呈现方式多变,如角度异常、场景杂乱或部分遮挡。尽管如此,检测到的物体类别在高层次类别(如动物、食物)中仍表现出较强的聚类结构,这一现象在自监督模型的高维嵌入中也得到验证,表明儿童的视觉学习具有高度鲁棒性和效率。

Comments 19 pages, 6 figures

详情
英文摘要

Children acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? We analyzed first-person videos of young children's visual experience at home from the BabyView dataset ($N$ = 31 participants, 868 hours, ages 5--36 months), using a supervised object detection model to extract common object categories from more than 3 million frames. We found that children's object category exposure was highly skewed: a few categories (e.g., cups, chairs) dominated children's visual experiences while most categories appeared rarely, replicating previous findings from a more restricted set of contexts. Category exemplars were highly variable: children encountered objects from unusual angles, in highly cluttered scenes, and partially occluded views; many categories (especially animals) were most frequently viewed as depictions. Surprisingly, despite this variability, detected categories (e.g., giraffes, apples) showed stronger groupings within superordinate categories (e.g., animals, food) relative to groupings derived from canonical photographs of these categories. We found this same pattern when using high-dimensional embeddings from both self-supervised visual and multimodal models; this effect was also recapitulated in densely sampled data from individual children. Understanding the robustness and efficiency of visual category learning will require the development of models that can exploit strong superordinate structure and learn from non-canonical, sparse, and variable exemplars.

2605.14988 2026-05-15 cs.CV 版本更新

Compositional Video Generation via Inference-Time Guidance

Ariel Shaulov, Eitan Shaar, Amit Edenzon, Gal Chechik, Lior Wolf

发表机构 * Tel-Aviv University(特拉维夫大学) Bar Ilan University(巴伊兰大学) NVIDIA Research(NVIDIA研究)

AI总结 文本到视频扩散模型虽然能够生成逼真的视频,但在需要细致组合理解的提示任务上表现不佳,例如实体关系、属性、动作和运动方向等。本文提出了一种名为CVG的推理时引导方法,通过利用模型内部的交叉注意力图来捕捉提示概念在时空上的分布,并训练一个轻量级的组合分类器,利用其梯度在去噪早期阶段引导潜在变量轨迹,从而提升生成视频的组合忠实度。该方法无需修改模型结构或微调生成器,仅依靠冻结的视觉语言模型主干即可实现跨语义相关组合标签的迁移,实验表明其在组合性文本到视频任务上显著提升了生成结果的准确性与视觉质量。

详情
英文摘要

Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.

2605.14984 2026-05-15 cs.CV cs.AI 版本更新

Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Ming Qian, Zimin Xia, Changkun Liu, Shuailei Ma, Wen Wang, Zeran Ke, Bin Tan, Hang Zhang, Gui-Song Xia

发表机构 * LIESMARS & School of Artificial Intelligence, Wuhan University(珞珈实验室与武汉大学人工智能学院) EPFL(苏黎世联邦理工学院) HKUST(香港科技大学) Northeastern University(东北大学) Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Amap, Alibaba Group(高德地图,阿里巴巴集团)

AI总结 本文研究如何从单张卫星图像生成街景级别的3D场景,这是一个具有挑战性的问题。现有方法在几何精度和语义多样性之间存在明显权衡,而本文提出的Sat3DGen通过引入一种以几何优先的方法,结合新的几何约束和视角训练策略,显著提升了生成场景的几何准确性和视觉真实感。实验表明,该方法在几何误差和图像质量方面均优于现有最佳方法,并在多个下游任务中展现了广泛的应用价值。

Comments ICLR 2026; code: https://github.com/qianmingduowan/Sat3DGen demo: https://huggingface.co/spaces/qian43/Sat3DGen project page: https://qianmingduowan.github.io/Sat3DGen_project_page/

详情
英文摘要

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.

2605.14980 2026-05-15 cs.CV cs.AI 版本更新

MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions

Xiaofei Hui, Haoxuan Qu, Hossein Rahmani, Shuohong Wang, Jeff W. Lichtman, Jun Liu

发表机构 * School of Computing and Communications(计算与通信学院) Lancaster University(兰卡斯特大学) Department of Cell Biology(细胞生物学系) Harvard Medical School(哈佛医学院) Department of Molecular and Cellular Biology(分子与细胞生物学系) Harvard University(哈佛大学)

AI总结 本文提出了一种名为MicroscopyMatching的通用显微图像分析框架,旨在解决不同实验条件下显微图像分析任务(如分割、追踪和计数)的自动化难题。该框架通过将多样化的分析任务统一为匹配问题,并利用预训练的潜在扩散模型的强大匹配能力,实现了在多种生物样本和成像条件下可靠且无需额外调整的分析效果。该研究为生物医学研究提供了一种实用且广泛适用的解决方案,显著降低了对人工分析的依赖。

详情
英文摘要

Analyzing microscopy images to extract biological object properties (e.g., their morphological organization, temporal dynamics, and population density) is fundamental to various biomedical research. Yet conducting this manually is costly and time-consuming. Though deep learning-based approaches have been explored to automate this process, the substantial diversity of microscopy analysis settings in practice (including variations of biological object types, sample processing protocols, imaging equipment, and analysis tasks, etc.) often renders them ineffective. As a result, these approaches typically require extensive adaptation for different settings, which, however, can impose burdens that are often practically unsustainable for laboratories, forcing biomedical researchers to still commonly rely on manual analysis, thereby severely bottlenecking the pace of biomedical research progress. This situation has created a pressing and long-standing need for a reliable and broadly applicable microscopy image analysis tool, yet such a tool is still missing. To address this gap, we present the first ready-to-use microscopy image analysis framework, MicroscopyMatching, that can reliably perform key analysis tasks (including segmentation, tracking, and counting) across diverse microscopy analysis settings. From a fundamentally different perspective, MicroscopyMatching reformulates diverse microscopy image analysis tasks as a unified matching problem, effectively handling this problem by exploiting the robust matching capability from pre-trained latent diffusion models.

2605.12625 2026-05-15 cs.RO cs.CV 版本更新

Driving Intents Amplify Planning-Oriented Reinforcement Learning

Hengtong Lu, Victor Shea-Jay Huang, Chengmin Yang, Pengfei Jing, Jifeng Dai, Yan Xie, Benjin Zhu

发表机构 * Li Auto(力 auto) Tsinghua University(清华大学) CUHK MMLab(香港大学MMLab)

AI总结 该研究针对基于单轨迹演示训练的连续动作策略在模式崩溃问题上的局限性,提出了一种两阶段的DIAL框架,用于生成符合偏好评估的驾驶策略。第一阶段通过分类器无关引导(CFG)扩展动作采样分布,打破单一演示导致的模式坍缩;第二阶段引入多意图GRPO方法,在偏好强化学习中保持分布多样性,防止策略微调重新坍缩。实验表明,该方法在驾驶任务中显著提升了人类演示水平的评价得分,验证了扩展采样分布对提升连续动作策略性能的重要性。

Comments Project page: https://mind-omni.github.io/

详情
英文摘要

Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.

2605.12624 2026-05-15 cs.RO cs.CV 版本更新

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

Yuzhou Huang, Benjin Zhu, Hengtong Lu, Victor Shea-Jay Huang, Haiming Zhang, Wei Chen, Jifeng Dai, Yan Xie, Hongsheng Li

发表机构 * CUHK MMLab(香港中文大学多模态实验室) Li Auto Tsinghua University(清华大学)

AI总结 本文提出了一种用于自动驾驶的统一流式VLA架构MindVLA-U1,旨在解决现有VLA模型在规划质量上落后于VA模型的问题。该方法通过一个统一的视觉-语言-动作(VLA)主干网络,实现了语言指令和连续动作轨迹的联合生成,并采用流式处理设计提升实时性。MindVLA-U1在保持自然语言交互接口的同时,显著提升了规划性能,首次在WOD-E2E基准测试中超越人类驾驶员,并在规划精度和处理速度方面均达到当前最优水平。

Comments Work in progress. Project page: https://mind-omni.github.io/

详情
英文摘要

Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A full streaming design processes the driving video framewise rather than as fixed video-action chunks under costly temporal VLM modeling. Planned trajectories evolve smoothly across frames while a learned streaming memory channel carries temporal context and updates. The unified architecture enables fast/slow systems on dense & sparse MoT backbones via flexible self-attention context management, and exposes a measurable language-control path for action: language-predicted driving intents steers the action diffusion via classifier-free guidance (CFG), turning language-side intent into control signals for continuous action planning. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA by large margins, and matches VA latency (16 FPS vs. RAP's 18 FPS at 1B scale) while preserving natural language interfaces for human-vehicle interaction.

2605.12622 2026-05-15 cs.RO cs.CV 版本更新

Action Emergence from Streaming Intent

Pengfei Jing, Victor Shea-Jay Huang, Hengtong Lu, Jifeng Dai, Yan Xie, Benjin Zhu

发表机构 * Li Auto Tsinghua University(清华大学) CUHK

AI总结 本文研究了端到端自动驾驶中动作生成的“意图涌现”能力,即在复杂交通场景中生成物理可行、语义合理且安全合规的动作。为此,作者提出了一种名为SI(Streaming Intent)的视觉-语言-动作模型,通过连续的因果推理链对驾驶意图进行语义和时间上的流式处理,并利用该意图引导动作生成,从而实现可控且高质量的轨迹规划。实验表明,SI在Waymo End-to-End基准上表现出色,并首次在全端到端VLA模型中实现了基于意图的可控性。

Comments Project page: https://mind-omni.github.io/

详情
英文摘要

We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates -- to our knowledge for the first time in a fully end-to-end VLA -- intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.

2605.07785 2026-05-15 cs.CV 版本更新

Radiologist-Guided Causal Concept Bottleneck Models for Chest X-Ray Interpretation

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * University of Edinburgh(爱丁堡大学) NHS Lothian(洛锡安国家健康服务)

AI总结 该研究提出了一种由放射科医生指导的因果概念瓶颈模型(XpertCausal),用于胸片的解释。该模型通过概率噪声-OR框架建模疾病与影像学表现之间的因果关系,并利用贝叶斯推理从预测的概念中估计疾病概率。通过结合放射科医生定义的概念-疾病关联,模型结构被约束在临床合理的推理路径上,从而在诊断性能、校准度和解释质量方面优于传统概念瓶颈模型,更贴近专家知识。

详情
英文摘要

Concept Bottleneck Models (CBMs) in medical imaging aim to improve model interpretability by predicting intermediate clinical concepts before final diagnoses. However, most existing CBMs treat concepts as discriminative predictors of pathology labels, without explicitly modelling the underlying clinical generative process where diseases produce observable radiographic findings. We propose XpertCausal, a radiologist-guided causal CBM for chest X-ray interpretation which models pathology-to-concept relationships using a probabilistic noisy-OR framework. This generative model is then inverted via Bayesian inference to estimate pathology probabilities from predicted concepts. Radiologist-curated concept-pathology associations are used to constrain model structure to radiologist-defined clinically plausible reasoning pathways. We evaluate XpertCausal on MIMIC-CXR across pathology classification performance, calibration, explanation quality, and alignment with radiologist-defined reasoning pathways. Compared with both a non-causal CBM baseline and a causal ablation with unconstrained learned associations, XpertCausal achieves improved AUROC, calibration, and clinically relevant explanation quality, while learning concept-pathology relationships that more closely align with expert knowledge. These results demonstrate that incorporating clinically motivated causal structure and expert domain knowledge into CBMs can lead to more accurate, interpretable, and clinically aligned models for CXR interpretation.

2604.25855 2026-05-15 cs.CV cs.AI 版本更新

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

Hector G. Rodriguez, Marcus Rohrbach

发表机构 * TU Darmstadt(图宾根大学)

AI总结 本文提出了一种名为SIEVES的新型选择性预测方法,旨在提升视觉问答(VQA)系统在真实世界和分布外(OOD)场景中的可靠性和覆盖率。该方法通过让模型在回答问题时生成局部视觉证据,并设计一个选择器来基于这些证据显式评估回答质量,从而在不依赖模型内部信号(如logits或隐藏状态)的情况下实现更准确的置信度估计。实验表明,SIEVES在多个具有挑战性的OOD基准上显著提升了系统覆盖率,且适用于多种前沿闭源模型,无需访问其权重或logits。

详情
英文摘要

Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering (VQA) benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world, out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. Existing selective prediction methods estimate implicit confidence scores, relying on model internal signals like logits or hidden representations, which are not available for frontier closed-sourced models. To enable reliable generalization in VQA, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner using only model inputs and outputs. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all tested OOD benchmarks and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation. Code is publicly available at https://github.com/hector-gr/SIEVES .

2604.21909 2026-05-15 cs.CV cs.IT math.IT q-bio.NC 版本更新

Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin

发表机构 * Cranberry-Lemon University(Cranberry-Lemon 大学) Department of Computer Science(计算机科学系) Icahn School of Medicine at Mount Sinai(Mount Sinai 医学院) Department of AI and Human Health(人工智能与人类健康系) Imperial College London(伦敦帝国学院) Department of Computing(计算系) Department of Psychiatry(精神病学系)

AI总结 该研究探讨了人类与机器视觉系统在分类任务中对混淆方向的不同表现,揭示了两者在归纳偏置上的差异。通过分析12种扰动下人类与深度神经网络的响应,研究量化了混淆矩阵中的不对称性,并将其与信息-误差权衡的几何特性联系起来。结果表明,人类表现出广泛但较弱的类别间不对称性,而深度模型则表现出更集中、更强的定向混淆,且这种差异在准确率相同的情况下仍能反映不同的泛化策略。

详情
英文摘要

To humans, a robin seems more like a bird than a bird seems like a robin, but does this asymmetry also hold for machine vision? Humans and modern vision models can match each other in accuracy while making systematically different kinds of errors, differing not in how often they fail, but in who gets mistaken for whom. We show these directional confusions reveal distinct inductive biases invisible to accuracy alone. Using matched human and deep neural network responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link its organization to the geometry of the information--error trade-off - how efficiently, and how gracefully, a system generalizes under distortion. We find that humans exhibit broad but weak asymmetries across many class pairs, whereas deep vision models show sparser, stronger directional collapses into a few dominant categories. Robustness training reduces overall asymmetry magnitude but fails to recover this human-like distributed structure. Generative simulations further show that these two asymmetry organizations shift the trade-off geometry in opposite directions even at matched accuracy, explaining why the same scalar asymmetry score can reflect fundamentally different generalization strategies. Together, these results establish directional confusion structure as a sensitive, interpretable signature of inductive bias that accuracy-based evaluation cannot recover.

2601.16981 2026-05-15 cs.CV cs.GR 版本更新

SyncLight: Single-Edit Multi-View Relighting

David Serrano-Lozano, Anand Bhattad, Luis Herranz, Jean-François Lalonde, Javier Vazquez-Corral

发表机构 * Computer Vision Center Universitat Autònoma de Barcelona(Autonomous University of Barcelona计算机视觉中心) Johns Hopkins University(约翰霍普金斯大学) Universidad Politécnica de Madrid(马德里理工大学) Université Laval(拉瓦尔大学)

AI总结 SyncLight 是一种基于单视角编辑实现多视角场景一致重照明的方法,旨在解决多摄像机直播、立体电影和虚拟制作中对光照一致性要求高的问题。该方法通过一个基于潜在空间桥接的多视角扩散变换模型,能够在单次推理过程中对多视角图像进行高保真重照明,并且无需相机位姿信息即可推广到任意数量的视角。SyncLight 的核心贡献在于实现了参数化光照控制,并引入了一个包含合成与真实多视角数据的大型混合数据集以支持训练。

Comments Project page: http://sync-light.github.io

详情
英文摘要

We present SyncLight, a method to enable consistent, parametric control over light sources across multiple uncalibrated views of a static scene conditioned on a single view. While single-view relighting has advanced significantly, existing generative approaches struggle to maintain the rigorous lighting consistency essential for multi-camera broadcasts, stereoscopic cinema, and virtual production. SyncLight addresses this by enabling precise control over light intensity and color across a multi-view capture of a scene, conditioned on a single reference edit. Our method leverages a multi-view diffusion transformer trained using a latent bridge matching formulation, achieving high-fidelity relighting of the entire image set in a single inference step. To facilitate training, we introduce a large-scale hybrid dataset comprising diverse synthetic environments -- curated from existing sources and newly designed scenes -- alongside high-fidelity, real-world multi-view captures under calibrated illumination. Though trained only on image pairs, SyncLight generalizes zero-shot to an arbitrary number of viewpoints, effectively propagating lighting changes across all views, without requiring camera pose information. SyncLight enables practical relighting workflows for multi-view capture systems.

2512.13609 2026-05-15 cs.CV cs.LG 版本更新

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli

发表机构 * York University(约克大学) Vector Institute for AI(人工智能向量研究所) Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出了一项新的任务和基准测试“Do-Undo Bench”,旨在解决视觉语言模型在理解并生成由现实世界动作驱动的场景变换方面存在的关键不足。该任务要求模型不仅模拟现实动作对场景的影响,还需将其恢复到原始状态,从而检验模型对因果关系的理解能力。研究发现当前模型在动作可逆性方面表现不佳,凸显了评估动作理解能力的必要性,该基准为评估和推进多模态系统中与现实动态相关的动作感知生成提供了直观的测试平台。

Comments Project page: https://s-mahajan.github.io/Do-Undo-Bench/

详情
英文摘要

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.

2511.14751 2026-05-15 cs.CV cs.RO 版本更新

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, Sebastian Scherer

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Field AI

AI总结 本文提出了一种名为Co-Me的置信度引导的token合并方法,用于加速视觉几何变换器,无需重新训练或微调基础模型。该方法通过训练一个轻量级的置信度预测器,根据token的不确定性进行排序,并选择性地合并低置信度的token,从而在保持空间覆盖的同时有效降低计算量。实验表明,Co-Me在多种多视角和流式视觉几何变换器中均能实现显著加速,应用于VGGT和Pi3时分别达到21.5倍和20.4倍的加速效果,为实时三维感知与重建提供了可行的解决方案。

详情
英文摘要

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and Pi3, Co-Me achieves up to 21.5x and 20.4x speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

2507.01909 2026-05-15 cs.CV 版本更新

Modality-agnostic, patient-specific digital twins modeling temporally varying digestive motion

Jorge Tapias Gomez, Nishant Nadkarni, Lando S. Bosma, Jue Jiang, Ergys D. Subashi, William P. Segars, James M. Balter, Mert R Sabuncu, Neelam Tyagi, Harini Veeraraghavan

发表机构 * Computer and Information Science, Cornell University(康奈尔大学计算机与信息科学系) Department of Medical Physics, Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心医学物理系) University Medical Center Utrecht(乌得勒支大学医学中心) Department of Radiation Physics, University of Texas MD Anderson Cancer Center(德克萨斯大学MD安德森癌症中心放射物理系) Carl E. Ravin Advanced Imaging Laboratories and Center for Virtual Imaging Trials, Duke University Medical Center(杜克大学医学中心卡尔·E·拉文高级影像实验室和虚拟影像试验中心) Department of Radiation Oncology, University of Michigan(密歇根大学放射肿瘤学系)

AI总结 该研究旨在解决可变形图像配准(DIR)在高度移动的胃肠道器官中难以准确评估的问题,提出了一种基于患者特异性数字孪生(DT)的模态无关方法,用于模拟和评估DIR的时空动态运动。研究通过半自动化流程生成21个运动阶段的4D序列,基于已有的胃肠道运动模型和真实患者扫描数据,评估了六种DIR方法的配准精度,并验证了剂量映射的准确性。该方法为动态、解剖结构复杂的区域提供了高精度的空间和剂量评估,具有重要的临床应用价值。

Comments This work is still review, it contains 7 Pages, 6 figures, and 4 tables

Journal ref Phys. Med. Biol. 71 (2026) 015029

详情
英文摘要

Objective: Clinical implementation of deformable image registration (DIR) requires voxel-based spatial accuracy metrics such as manually identified landmarks, which are challenging to implement for highly mobile gastrointestinal (GI) organs. To address this, patient-specific digital twins (DT) modeling temporally varying motion were created to assess the accuracy of DIR methods. Approach: 21 motion phases simulating digestive GI motion as 4D sequences were generated from static 3D patient scans using published analytical GI motion models through a semi-automated pipeline. Eleven datasets, including six T2w FSE MRI (T2w MRI), two T1w 4D golden-angle stack-of-stars, and three contrast-enhanced CT scans. The motion amplitudes of the DTs were assessed against real patient stomach motion amplitudes extracted from independent 4D MRI datasets. The generated DTs were then used to assess six different DIR methods using target registration error, Dice similarity coefficient, and the 95th percentile Hausdorff distance using summary metrics and voxel-level granular visualizations. Finally, for a subset of T2w MRI scans from patients treated with MR-guided radiation therapy, dose distributions were warped and accumulated to assess dose warping errors, including evaluations of DIR performance in both low- and high-dose regions for patient-specific error estimation. Main results: Our proposed pipeline synthesized DTs modeling realistic GI motion, achieving mean and maximum motion amplitudes and a mean log Jacobian determinant within 0.8 mm and 0.01, respectively, similar to published real-patient gastric motion data. It also enables the extraction of detailed quantitative DIR performance metrics and rigorous validation of dose mapping accuracy. Significance: The pipeline enables rigorously testing DIR tools for dynamic, anatomically complex regions enabling granular spatial and dosimetric accuracies.

2506.01015 2026-05-15 cs.CV 版本更新

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, Gustavo Carneiro

发表机构 * Department of Engineering Science, University of Oxford(牛津大学工程科学系) Australian Institute for Machine Learning, Adelaide University(阿德莱德大学人工智能研究所) Stanford University(斯坦福大学) University of Central Florida(佛罗里达中央大学) University of Surrey(萨里大学)

AI总结 本文提出AuralSAM2,旨在将音频信息有效整合到SAM2模型中,以提升视频分割任务中多模态交互的能力。核心方法AuralFuser通过融合音频与视觉特征生成稀疏和密集提示,并基于SAM2的特征金字塔结构传播听觉线索,增强跨模态影响。此外,引入了音频引导的对比损失以加强模态对齐,实验表明该方法在公共基准上取得了显著的性能提升,且对交互效率影响较小。

Comments Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026

详情
英文摘要

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.

2501.12202 2026-05-15 cs.CV 版本更新

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, Song Zhang, Yang Liu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo

发表机构 * Tencent(腾讯)

AI总结 本文介绍了 Hunyuan3D 2.0,这是一个用于生成高分辨率带纹理3D模型的先进大规模合成系统。该系统包含两个基础模块:基于可扩展流式扩散变换器的形状生成模型 Hunyuan3D-DiT,以及利用几何和扩散先验知识生成高质量纹理的 Hunyuan3D-Paint。此外,还开发了 Hunyuan3D-Studio,提供一个用户友好的平台,便于专业和非专业人士高效生成和操作3D模型。实验表明,Hunyuan3D 2.0 在几何细节、条件对齐和纹理质量等方面均优于现有先进模型,并已开源以填补大规模3D生成模型在开源社区中的空白。

Comments GitHub link: https://github.com/Tencent/Hunyuan3D-2

详情
英文摘要

We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2

2412.17155 2026-05-15 cs.CV cs.LG 版本更新

The Potential of Convolutional Neural Networks for Cancer Detection

Hossein Molaeian, Kaveh Karamjani, Sina Teimouri, Saeed Roshani, Sobhan Roshani

发表机构 * Department of Computer Engineering, Islamic Azad University – Kermanshah Branch(伊斯兰阿兹大学克尔曼沙姆分校计算机工程系)

AI总结 本文探讨了卷积神经网络(CNN)在癌症检测中的应用潜力,旨在通过深度学习方法提升早期癌症诊断的准确性。研究分析了多种CNN架构在不同癌症数据集上的表现,比较了各类方法的优缺点,并识别出在癌症分类任务中表现优异的模型结构。该工作为将CNN技术整合到临床诊断流程中提供了参考,有助于增强医疗健康领域的诊断能力。

详情
英文摘要

Early detection is crucial for successful cancer treatment and increasing survivability rates, particularly in the most common forms. Ten different cancers have been identified in most of these advances that effectively use CNNs (Convolutional Neural Networks) for classification. The distinct architectures of CNNs used in each study concentrate on pattern recognition for different types of cancer across various datasets. The advantages and disadvantages of each approach are identified by comparing these architectures. This study explores the potential of integrating CNNs into clinical practice to complement traditional diagnostic methods. It also identifies the top-performing CNN architectures, highlighting their role in enhancing diagnostic capabilities in healthcare.

2605.14966 2026-05-15 cs.CV cs.AI 版本更新

MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs

Wei Ding, Yilin Li, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

发表机构 * Tsinghua University(清华大学) Tsinghua University, Tencent(清华大学腾讯) Tencent(腾讯) University of Science and Technology Beijing(北京科技大学) University of Macau(澳门大学)

AI总结 本文提出了一种名为MHSA的轻量级框架,旨在通过引导注意力机制来缓解大视觉语言模型(LVLMs)中的幻觉问题。MHSA通过学习修正跨模态注意力模式,利用来自LVLM自身和DHCP判别器的监督信号训练一个简单的三层MLP生成器,从而生成修正后的注意力权重。该方法在推理时无需修改LVLM参数,仅替换原始跨模态注意力即可有效减少生成和判别层面的幻觉,为LVLM的幻觉研究提供了新的视角。

Comments 19 pages, 17 figures

详情
英文摘要

Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.

2605.14960 2026-05-15 cs.GR cs.CG cs.CV 版本更新

Meschers: Geometry Processing of Impossible Objects

Ana Dodik, Isabella Yu, Kartik Chandra, Jonathan Ragan-Kelley, Joshua Tenenbaum, Vincent Sitzmann, Justin Solomon

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) MIT(麻省理工学院)

AI总结 本文研究了如何用计算机准确表示“不可能物体”——一类在现实中无法存在但人类可以感知的几何构造。传统方法通过切割或弯曲深度轴来实现,但会导致局部几何变化或光照处理困难,影响后续图形处理。为此,作者提出了一种名为 Meschers 的网格表示方法,基于离散外微分几何理论,能够有效支持渲染、光照和距离计算等应用,并实现了对不可能物体的逆向渲染,优于传统方法。

Journal ref ACM Trans. Graph. 44, 4, Article 70 (August 2025)

详情
英文摘要

Impossible objects, geometric constructions that humans can perceive but that cannot exist in real life, have been a topic of intrigue in visual arts, perception, and graphics, yet no satisfying computer representation of such objects exists. Previous work embeds impossible objects in 3D, cutting them or twisting/bending them in the depth axis. Cutting an impossible object changes its local geometry at the cut, which can hamper downstream graphics applications, such as smoothing, while bending makes it difficult to relight the object. Both of these can invalidate geometry operations, such as distance computation. As an alternative, we introduce Meschers, meshes capable of representing impossible constructions akin to those found in M.C. Escher's woodcuts. Our representation has a theoretical foundation in discrete exterior calculus and supports the use-cases above, as we demonstrate in a number of example applications. Moreover, because we can do discrete geometry processing on our representation, we can inverse-render impossible objects. We also compare our representation to cut and bend representations of impossible objects.

2605.14950 2026-05-15 cs.CV cs.RO 版本更新

Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu, Yunhe Li, Yuqian Fu, Yinxinyu Chen, Hongyi Cai, Zewei Ye, Bing Cheng, Kai Ye, Yiran Mao, Yilei Zhong, MingKang Dong, Junchi Yan, Gen Li, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) King Abdullah University of Science and Technology(卡塔尔国王 Abdullah 大学科学与技术大学) Nanyang Technological University(南洋理工大学) SJTU-Quic Robot Joint Lab(上海交通大学-Quick 机器人联合实验室)

AI总结 Evo-Depth 是一种轻量级的深度增强视觉-语言-动作模型,旨在提升机器人操作任务中的空间理解能力。该模型通过一个轻量的隐式深度编码模块,从多视角RGB图像中提取紧凑的深度特征,并通过空间增强模块将深度信息融入视觉-语言表征,从而实现高效的空间语义增强。此外,Evo-Depth 引入了渐进对齐训练策略,以更好地对齐深度增强表征与动作学习任务,最终在多个仿真和现实场景中表现出优异的性能和效率。

详情
英文摘要

Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.

2605.14949 2026-05-15 cs.CV eess.IV eess.SP 版本更新

A CUBS-Compatible Ultrasound Morphology and Uncertainty-Aware Baseline for Carotid Intima-Media Segmentation and Preliminary Risk Prediction

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer, and Communication Technology(信息、计算机与通信技术学院) Sirindhorn International Institute of Technology, Thammasat University(沙恭达国际技术学院,泰国朱拉隆功大学)

AI总结 该研究提出了一种基于超声影像的颈动脉内膜中层分割与初步风险预测的可复现基线模型AtheroFlow-XNet,旨在更全面地评估动脉粥样硬化患者的血管风险。模型结合了手动标注的内膜中层掩膜进行监督分割,并引入临床变量辅助风险预测,同时利用蒙特卡洛Dropout实现不确定性感知的推理。实验结果表明,该方法在分割精度和风险预测性能上均达到较高水平,为超声影像支持的自动化血管分析提供了新的思路。

Comments 13 pages, 5 figures, 2 tables, 20 equations, 3 appendices

详情
英文摘要

Carotid atherosclerosis is a major contributor to ischemic stroke and transient ischemic attack. Conventional ultrasound assessment is commonly based on intima-media thickness, plaque appearance, stenosis degree, and peak systolic velocity, but these morphology- and velocity-based indicators may not fully capture patient-specific vascular risk. This study presents AtheroFlow-XNet, a CUBS-compatible ultrasound morphology and uncertainty-aware learning baseline for carotid intima-media segmentation and preliminary risk prediction. Using the Carotid Ultrasound Boundary Study dataset, manual lumen-intima and media-adventitia boundary annotations were converted into dense intima-media masks for supervised segmentation. Clinical variables were incorporated into an auxiliary risk-prediction branch, and Monte Carlo dropout was used for uncertainty-aware inference. The model was evaluated using a patient-level train-validation-test split with 1,522 training images, 326 validation images, and 328 testing images. The proposed model achieved a Dice coefficient of 0.7930 for LI-MA mask segmentation, a segmentation loss of 0.2359, and an area under the receiver operating characteristic curve of 0.6910 for preliminary risk prediction. Qualitative results showed that predicted masks were generally aligned with manual annotations, while uncertainty maps highlighted ambiguous wall-boundary regions. These results suggest that ultrasound-derived carotid morphology can support automated wall analysis and uncertainty-aware interpretation. Since CUBS does not provide Doppler waveforms or CFD-derived hemodynamic biomarkers, this work should be interpreted as a reproducible morphology-driven baseline. Future work will incorporate Doppler-derived flow profiles, patient-specific vascular reconstruction, and CFD-based wall shear biomarkers.

2605.14948 2026-05-15 cs.CV 版本更新

ACE-LoRA: Adaptive Orthogonal Decoupling for Continual Image Editing

Yuehao Liu, Weijia Zhang, Xuanming Shang, Zhizhou Chen, Yanhao Ge, Shanyan Guan, Chao Ma

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanjing University(南京大学) VIVO

AI总结 本文提出ACE-LoRA,一种用于持续图像编辑的动态正则化框架,旨在解决在不断学习新任务时避免遗忘之前知识的问题。该方法通过自适应正交解耦技术识别并消除任务间的干扰,并引入秩不变历史信息压缩策略以提升持续更新的可扩展性。此外,研究还构建了首个全面的持续图像编辑基准CIE-Bench,为该领域提供标准化评估平台,实验表明该方法在指令遵循、视觉真实感和抗遗忘能力方面均优于现有方法。

详情
英文摘要

State-of-the-art diffusion models often rely on parameter-efficient fine-tuning to perform specialized image editing tasks. However, real-world applications require continual adaptation to new tasks while preserving previously learned knowledge. Despite the practical necessity, continual learning for image editing remains largely underexplored. We propose ACE-LoRA, a dynamic regularization framework for continual image editing that effectively mitigates catastrophic forgetting. ACE-LoRA leverages Adaptive Orthogonal Decoupling to identify and orthogonalize task interference, and introduces a Rank-Invariant Historical Information Compression strategy to address scalability issues in continual updates. To facilitate continual learning in image editing and provide a standardized evaluation protocol, we introduce CIE-Bench, the first comprehensive benchmark in this domain. CIE-Bench encompasses diverse and practically relevant image editing scenarios with a balanced level of difficulty to effectively expose limitations of existing models while remaining compatible with parameter-efficient fine-tuning. Extensive experiments demonstrate that our method consistently outperforms existing baselines in terms of instruction fidelity, visual realism, and robustness to forgetting, establishing a strong foundation for continual learning in image editing.

2605.14938 2026-05-15 cs.LG cs.CV 版本更新

Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

Yuehao Liu, Shanyan Guan, Weijia Zhang, Xuanming Shang, Yanhao Ge, Wei Li, Chao Ma

发表机构 * MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University(人工智能大模型关键实验室,人工智能研究院,上海交通大学) vivo Mobile Communication Co., Ltd.(vivo移动通信有限公司)

AI总结 本文提出了一种名为Octopus的持续学习框架,旨在解决多模态大语言模型在逐步学习新任务时易遗忘旧知识的问题。该方法基于无历史数据的梯度正交化(HiFGO),通过在梯度层面强制正交性来减少参数干扰,无需存储历史任务数据,从而避免了隐私和存储问题。实验表明,Octopus在UCIT数据集上取得了优于现有最先进方法的性能,分别提升了2.14%和6.82%的平均与最终任务准确率。

详情
英文摘要

Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.

2605.14935 2026-05-15 cs.CV 版本更新

Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

Nhat Le, Daochang Liu, Anh Nguyen, Ajmal Mian

发表机构 * The University of Western Australia(西澳大学) The University of Liverpool(利物浦大学)

AI总结 本文提出了一种名为MSCoT的多尺度粗到细模型,用于测试时的人体运动合成与控制。该方法通过将运动分解为多尺度的层次化表示,并在每个时间尺度上以粗到细的方式预测完整的token序列,从而实现了高效且灵活的控制。通过引入多尺度token引导策略和轻量级token细化模块,MSCoT克服了离散采样的挑战,提升了控制精度与生成质量,实验表明其在运动质量、控制准确性和推理速度方面均优于现有方法。

详情
英文摘要

We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with the control objectives. MSCoT is able to produce quality motions, consistent with the control constraints, while offering substantially faster sampling than diffusion-based approaches. Experiments on popular benchmarks demonstrate state-of-the-art controllable text-to-motion generation performance of MSCoT over existing baselines, with better motion quality (48% FID improvement), higher control accuracy (-61% avg error), and $10 \times$ faster inference speed on HumanML3D.

2605.14925 2026-05-15 cs.CV cs.LG 版本更新

Road Maps as Free Geometric Priors: Weather-Invariant Drone Geo-Localization with GeoFuse

Yunsong Fang, Tingyu Wang, Zhedong Zheng

发表机构 * University of Macau(澳门大学) Hangzhou Dianzi University(杭州电子科技大学)

AI总结 本文研究了在恶劣天气条件下无人机图像的地理定位问题,旨在将受天气影响的无人机图像与带有地理标签的卫星图像进行匹配。为了解决天气引起的图像退化和跨视角域差距问题,作者提出了一种名为GeoFuse的跨模态融合框架,通过将精确对齐的道路地图与卫星图像结合,生成更具判别力且对天气变化鲁棒的表示。实验表明,GeoFuse在多个基准数据集上显著优于现有方法,有效提升了地理定位的准确率。

Comments 18 pages, 4 figures

详情
英文摘要

Drone-view geo-localization aims to match a query drone image, often captured under adverse weather conditions (e.g., rain, snow, fog), against a gallery of geo-tagged satellite images. Weather-induced degradations in the drone view, such as noise, reduced visibility, and partial occlusions, severely exacerbate the intrinsic cross-view domain gap. While prior methods predominantly rely on weather-specific architectures or data augmentations, they have largely overlooked road map data, a readily available modality that provides strong, inherently weather-invariant geometric layout cues (e.g., road networks and building footprints) at negligible additional cost. We introduce GeoFuse, a cross-modal fusion framework that integrates precisely aligned road map tiles with satellite imagery to yield more discriminative and weather-resilient representations. We first augment the existing University-1652 and DenseUAV benchmarks with geo-aligned road maps, supplying structural priors robust to meteorological variations. Building on this, we propose a flexible fusion module that combines satellite and road map features via token-level and channel-level interactions, with a lightweight dynamic gating mechanism that adaptively weights modality contributions per instance. Finally, we employ class-level cross-view contrastive learning to promote robust alignment between weather-degraded drone features and the fused satellite-roadmap representations. Extensive experiments under diverse weather conditions show that GeoFuse consistently outperforms state-of-the-art methods, achieving +3.46% and +23.18% Recall@1 accuracy on the University-1652 and DenseUAV benchmarks, respectively.

2605.14923 2026-05-15 cs.CV 版本更新

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

Pengxin Xu, Xincheng Lin, Luping Xiao, Qing Jiang, Meishan Zhang, Hao Fei, Shanghang Zhang, Xingyu Chen

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Zhongguancun Academy(中关村学院) Huazhong University of Science and Technology(华中科技大学) Beijing University of Posts and Telecommunications(北京邮电大学) South China University of Technology(华南理工大学) University of Oxford(牛津大学) Peking University(北京大学)

AI总结 该研究提出了一种面向交互的层次化场景解析任务——Hierarchical Scene Parsing,旨在通过显式的场景-物体-部件-功能层次结构,捕捉场景中结构化的依赖关系,以提升视觉语义理解能力。为此,研究引入了基于视觉-语言模型的SceneParser,结合结构补全伪标签和课程学习进行统一的层次化生成训练,并构建了包含大量标注数据的SceneParser-Bench基准。实验表明,该方法在层次化解析任务上优于现有模型,且在传统任务和下游规划任务中也表现出良好的兼容性与实用性。

Comments Preprint. Code, models, and dataset are provided in the manuscript

详情
英文摘要

General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.

2605.14913 2026-05-15 cs.CV 版本更新

Representative Attention For Vision Transformers

Yuntong Li, Hainuo Wang, Hengxing Liu, Mingjia Li, Xiaojie Guo

发表机构 * College of Intelligence and Computing, Tianjin University(智能与计算学院,天津大学)

AI总结 该论文提出了一种名为Representative Attention(RPAttention)的线性全局注意力机制,旨在解决视觉Transformer中传统自注意力计算复杂度高、依赖图像坐标的问题。其核心方法通过在表示空间中动态生成语义相关的代表性token,替代固定空间划分的中间token,从而实现跨空间区域的语义通信。该方法在保持全局感受野的同时,将token交互复杂度从二次降至线性,实验表明其在图像分类、目标检测和语义分割任务中均表现出优越的性能。

详情
英文摘要

Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self-attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial distance, by following a lightweight Gather-Interact-Distribute paradigm. Spatial tokens are first softly gathered into representative tokens through competitive similarity-based routing. The representatives then perform global interaction within a compact latent space, before broadcasting the refined information back to all spatial tokens via query-driven cross-attention. Via replacing coordinate-driven aggregation with representation-driven compression, RPAttention preserves global receptive fields while adaptively aligning token communication with the content structure of each input.RPAttention reduces the dominant token interaction complexity from quadratic to linear scaling with respect to the number of spatial tokens, while maintaining expressive global context modeling. Extensive experiments across diverse vision transformer backbones on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our design.

2605.14908 2026-05-15 cs.CV 版本更新

SteerSeg: Attention Steering for Reasoning Video Segmentation

Ali Cheraghian, Hamidreza Dastmalchi, Abdelwahed Khamis, Morteza Saberi, Aijun An, Lars Petersson

发表机构 * Macquarie University(麦考瑞大学) York University(约克大学) CSIRO Data61(CSIRO数据61) UTS(UTS大学)

AI总结 视频推理分割任务需要根据自然语言描述在视频帧中定位目标对象,通常涉及空间推理和隐含引用。现有方法通过提取冻结的大视觉语言模型(LVLM)的注意力图作为分割的先验信息,实现无需训练的定位,但这些注意力图主要用于文本生成,导致定位信号模糊。本文提出SteerSeg,一种轻量框架,通过识别注意力偏差并引入输入级条件引导来优化注意力分布,结合可学习的软提示和推理引导的思维链(CoT)提示,显著提升了LVLM的空间定位能力,并在多个基准测试中表现出良好的泛化性能。

Comments Project page: https://steerseg.github.io

详情
英文摘要

Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io

2605.14906 2026-05-15 cs.CV 版本更新

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Xiyu Ren, Zhaowei Wang, Yiming Du, Zhongwei Xie, Chi Liu, Xinlin Yang, Haoyue Feng, Wenjun Pan, Tianshi Zheng, Baixuan Xu, Zhengnan Li, Yangqiu Song, Ginny Wong, Simon See

发表机构 * CSE Deparment, HKUST(香港科技大学计算机科学与工程系) CUHK(香港大学) OmniMemory (Shenzhen) Intelligent Technology Co., Ltd.(深圳奥米克记忆科技有限公司) NVIDIA AI Technology Center (NVAITC), NVIDIA, Santa Clara, USA(英伟达AI技术中心(NVAITC),英伟达,美国圣克拉拉)

AI总结 MemLens 是一个用于评估大型视觉语言模型(LVLMs)多模态长期记忆能力的综合性基准,涵盖了信息抽取、多轮推理、时序推理等五个方面,测试了不同上下文长度下的模型表现。研究发现,长上下文模型在短对话中表现良好,但随着对话增长性能下降,而记忆增强代理虽在长度上更稳定,却在存储时间压缩下丢失了视觉细节。实验表明,单一方法难以胜任多轮多模态任务,因此提出了结合长上下文注意力与结构化多模态检索的混合架构方向。

Comments Work in progress

详情
英文摘要

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.

2605.14894 2026-05-15 cs.CV 版本更新

SEDiT: Mask-Free Video Subtitle Erasure via One-step Diffusion Transformer

Zheng Hui, Yunlong Bai

发表机构 * Baidu Inc.(百度公司)

AI总结 本文提出了一种名为 SEDiT 的新型视频字幕擦除方法,无需预先生成掩码即可直接完成字幕移除任务。该方法基于一步式扩散变换器,通过引入单阶段框架避免了传统两阶段处理中的次优问题,并在理论上证明了一步去噪的可行性。为保证时间一致性,文中采用混合训练策略并支持原生高清视频的高效处理。

Comments Project page:http://zheng222.github.io/SEDiT_project

详情
英文摘要

Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We empirically validate the reliability of one-step denoising and further provide a formal theoretical justification. Under the localized-editing structure of subtitle removal, the conditional optimal transport (OT) map and its induced rectified flow velocity field are Lipschitz continuous with respect to the latent variable, which underpins the theoretical feasibility of one-step sampling. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with a clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to one-step and chunk-wise streaming inference, our method can efficiently handle native 1440p video with infinite length.

2605.14893 2026-05-15 cs.CV cs.AI cs.LG 版本更新

Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

Jakub Grzywaczewski, Dawid Płudowski, Przemysław Biecek

发表机构 * Warsaw University of Technology(华沙技术大学) Centre for Credible Artificial Intelligence(可信人工智能中心) University of Warsaw(华沙大学)

AI总结 本文研究了对比预训练视觉-语言模型(VLMs)中潜在空间的结构问题,发现其共享的潜在空间中存在大量非语义的多模态噪声。作者通过协方差矩阵的谱分解方法,将潜在空间分解为语义信号和共享噪声子空间,并观察到噪声结构在不同数据子集上具有强子群不变性。实验表明,去除这些噪声维度对下游任务性能影响较小,甚至有助于提升性能,揭示了现代VLMs潜在空间中存在大量由模型架构引起的噪声,而非仅由任务相关语义主导。

详情
英文摘要

Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.

2605.14891 2026-05-15 cs.CV 版本更新

Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

Isma Hadji, Enrique Sanchez, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

发表机构 * Samsung AI Center Cambridge, UK(三星AI研究中心(剑桥,英国)) Technical University of Iasi, Romania(亚西技术大学(罗马尼亚)) Queen Mary University of London, UK(伦敦女王玛丽大学(英国))

AI总结 本文提出了一种基于视觉自回归(VAR)模型的多尺度图像超分辨率方法,通过引入层次化图像分块(HIT)和直接偏好优化(DPO)正则化项,解决了现有方法在尺度映射和模型复杂度方面的不足。HIT 通过逐级表示不同尺度的图像并强制跨尺度的分块重叠,提升了模型的灵活性,而 DPO 则仅依赖低分辨率与高分辨率图像对,引导模型生成更高质量的输出。该方法在无需外部训练数据的情况下,使用更小的模型实现了领先的多尺度超分辨率效果。

Comments Accepted for publication at ICML 2026. *Joint first authorship (alphabetical order). arXiv admin note: substantial text overlap with arXiv:2506.04990

详情
英文摘要

We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. VAR models break image tokenization into additive, gradually increasing scales, using Residual Quantization (RQ), an approach that aligns perfectly with our target ISR task. Previous works taking advantage of this synergy suffer from two main shortcomings. First, due to the limitations in RQ, they only generate images at a predefined fixed scale, failing to map intermediate outputs to the corresponding image scales. They also rely on large backbones or a large corpus of annotated data to achieve better performance. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a \textbf{Hierarchical Image Tokenization (HIT)} approach that progressively represents images at different scales while enforcing token overlap across scales, and b) a \textbf{Direct Preference Optimization (DPO) regularization term} that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. Our proposed HIT acts as a strong inductive bias for the VAR training, resulting in a small model (300M params vs 1B params of VARSR), that achieves state-of-the-art results without external training data, and that delivers multi-scale outputs with a single forward pass.

2605.14885 2026-05-15 cs.CV 版本更新

Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

Zhuohao Chen, Zeng Li, Yifei Zhang, Chang Liu, Yu Zhou

发表机构 * Nankai University(南开大学) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) Department of Automation and BNRist, Tsinghua University(清华大学自动化系和清华大学脑科学与智能技术研究院)

AI总结 场景文本识别需要建模从粗粒度布局到细粒度字符笔画的视觉结构演变过程,但现有方法依赖大量标注数据。本文提出了一种统一的自监督框架——Masked Next-Scale Prediction(MNSP),通过跨尺度预测和掩码图像重建联合学习,显式建模场景文本的层次结构演化。该方法引入了Next-Scale Prediction(NSP)模块,从低分辨率上下文预测高分辨率特征,并结合多尺度语言对齐模块保持语义一致性,实验表明其在多个基准数据集上取得了先进性能。

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Findings Track.10 pages, 4 figures

详情
英文摘要

Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2\% average accuracy on the challenging Union14M benchmark and 96.7\% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at https://github.com/CzhczhcHczh/MNSP

2605.14880 2026-05-15 cs.CV cs.GR cs.LG 版本更新

Denoising-GS: Gaussian Splatting with Spatial-aware Denoising

Qingyuan Zhou, Xinyi Liu, Weidong Yang, Ning Wang, Shuquan Ye, Ben Fei, Ying He, Wanli Ouyang

发表机构 * School of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Fudan University(复旦大学) Software of Engineering(软件工程系) Dalian University of Technology(大连理工大学) Department of Information Engineering(信息工程系) The Chinese University of Hong Kong(香港中文大学) College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为Denoising-GS的高保真新视角合成方法,针对3D高斯泼溅(3DGS)在优化过程中因初始点云稀疏不完整而引入噪声的问题,引入了一种基于空间感知的去噪框架。该方法通过同时考虑高斯原语的位置和空间结构,设计了保持空间优化流的优化器和基于空间梯度的去噪策略,有效提升了去噪的连贯性和一致性,并通过不确定性估计和空间一致性优化进一步提升了模型的表现。实验表明,Denoising-GS在多个基准数据集上均取得了最先进的效果。

详情
英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have achieved remarkable success in high-fidelity Novel View Synthesis (NVS), yet the optimization process inevitably introduces noisy Gaussian primitives due to the sparse and incomplete initialization from Structure-from-Motion (SfM) point clouds. Most existing methods focus solely on adjusting the positions of primitives during optimization, while neglecting the underlying spatial structure. To this end, we introduce a new perspective by formulating the optimization of 3DGS as a primitive denoising process and propose Denoising-GS, a spatial-aware denoising framework for Gaussian primitives by taking both the positions and spatial structure into consideration. Specifically, we design an optimizer that preserves the spatial optimization flow of primitives, facilitating coherent and directed denoising rather than random perturbations. Building upon this, the Spatial Gradient-based Denoising strategy jointly considers the spatial supports of primitives to ensure gradient-consistent updates. Furthermore, the Uncertainty-based Denoising module estimates primitive-wise uncertainty to prune redundant or noisy primitives, while the Spatial Coherence Refinement strategy selectively splits primitives in sparse regions to maintain structural completeness. Experiments conducted on three benchmark datasets demonstrate that Denoising-GS consistently enhances NVS fidelity while maintaining representation compactness, achieving state-of-the-art performance across all benchmarks. Source code and models will be made publicly available.

2605.14874 2026-05-15 cs.CV 版本更新

LPH-VTON: Resolving the Structure-Texture Dilemma of Virtual Try-On via Latent Process Handover

Yixin Liu, Baihong Qian, Jinglin Jiang, Jeffery Wu, Yan Chen, Wei Wang, Yida Wang, Lanqing Yang, Guangtao Xue

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 虚拟试穿(VTON)旨在生成与人体姿态和结构精确对齐的逼真服装图像。当前基于扩散模型的方法在结构完整性和纹理保真度之间面临根本性的权衡问题。本文提出LPH-VTON框架,通过在单一连续去噪过程中解耦结构与纹理生成,实现两者的协同优化,有效解决了这一矛盾,并在标准数据集VITON-HD上取得了结构对齐与感知真实感的优越平衡。

详情
英文摘要

Virtual Try-On (VTON) aims to synthesize photorealistic images of garments precisely aligned with a person's body and pose. Current diffusion-based methods, however, face a fundamental trade-off between structural integrity and textural fidelity. In this paper, we formalize this challenge as a consequence of complementary inductive biases inherent in prevailing architectures: models heavily reliant on spatial constraints naturally favor geometric alignment but often suppress textures, whereas models dominated by unconstrained generative priors excel at vibrant detail rendering but are prone to structural drift. Based on this diagnosis, we propose LPH-VTON, a new synergistic framework that resolves this tension within a single, continuous denoising process. LPH-VTON strategically decomposes the generation, leveraging a structure-biased model to establish a geometrically consistent latent scaffold in the early stages, before handing over control to a texture-biased model for high-fidelity detail rendering. Extensive experiments validate our approach. Our model achieves a superior Pareto-optimal balance, establishing new benchmarks in perceptual faithfulness while maintaining highly competitive structural alignment across the standard dataset VITON-HD, proving the efficacy of temporal architectural decoupling.

2605.14847 2026-05-15 cs.CV 版本更新

SR-Prominence: A Crowdsourced Protocol and Dataset Suite for Perceptually-Weighted Super-Resolution Artifact Evaluation

Ivan Molodetskikh, Kirill Malyshev, Mark Mirgaleev, Nikita Zagainov, Evgeney Bogatyrev, Dmitriy Vatolin

发表机构 * AI Center, Lomonosov Moscow State University(莫斯科罗蒙诺索夫国立大学人工智能中心) Lomonosov Moscow State University(莫斯科罗蒙诺索夫国立大学) Innopolis University(因诺皮里斯大学)

AI总结 现代图像超分辨率方法虽然能生成细节丰富、视觉吸引的结果,但常常引入影响感知质量的视觉伪影。本文提出“伪影显著性”作为评估指标,定义为多数观者认为某区域存在明显伪影的比例,并构建了SR-Prominence数据集,包含3,935个标注显著性的伪影掩码,涵盖多个真实场景。研究发现传统全参考质量评估指标如SSIM在局部显著性预测上表现突出,而无参考方法和专用伪影检测器泛化能力较差,该数据集为超分辨率伪影评估提供了感知导向的新基准。

详情
英文摘要

Modern image super-resolution methods generate detailed, visually appealing results, but they often introduce visual artifacts: unnatural patterns and texture distortions that degrade perceived quality. These defects vary widely in perceptual impact--some are barely noticeable, while others are highly disturbing--yet existing detection methods treat them equally. We propose artifact prominence as an evaluative target, defined as the fraction of viewers who judge a highlighted region to contain a noticeable artifact. We design a crowdsourced annotation protocol and construct SR-Prominence, a dataset suite containing 3,935 artifact masks from DeSRA, Open Images, Urban100, and a realistic no-ground-truth Urban100-HR setting, annotated with prominence. Re-annotating DeSRA reveals that 48.2% of its in-lab binary artifacts are not noticed by a majority of viewers. Across the suite, we audit SR artifact detectors, image-quality metrics, and SR methods. We find that classical full-reference metrics, especially SSIM and DISTS, provide surprisingly strong localized prominence signals, whereas no-reference IQA methods and specialized artifact detectors often fail to generalize across datasets and reference settings. SR-Prominence is released with an objective scoring protocol that allows new metrics to be benchmarked on our suite without further crowdsourcing. Together, the data and protocols enable SR artifact evaluation to move from binary defect presence toward perceptual impact. SR-Prominence is available at https://huggingface.co/datasets/imolodetskikh/sr-artifact-prominence.

2605.14845 2026-05-15 cs.CV 版本更新

Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study

Marta Robledo-Moreno, Ruben Vera-Rodriguez, Ruben Tolosana, Javier Ortega-Garcia

发表机构 * BiometricsAI, School of Engineering, Universidad Autonoma de Madrid, Spain(生物识别AI,工程学院,马德里自治大学,西班牙)

AI总结 本文研究了视觉-语言模型(VLM)在在线签名验证任务中的零样本能力,评估了GPT-5.2和Gemini 2.5 Pro等先进模型在签名验证挑战(SVC)基准上的表现。通过将原始运动时间序列转化为静态图像,并利用模型的隐含token概率计算生物特征分数,实验发现模型在随机伪造场景下表现出色,GPT-5.2在移动任务中的等错误率低至0.32%,但在高难度的熟练伪造场景中性能显著下降,并暴露出模型在链式推理过程中产生运动幻觉的问题。

Comments Accepted at the 14th International Workshop on Biometrics and Forensics

详情
英文摘要

Recent advancements in Vision-Language Models (VLMs) have demonstrated strong capabilities in general visual reasoning, yet their applicability to rigorous biometric tasks remains unexplored. This work presents an exploratory study evaluating the zero-shot performance of state-of-the-art VLMs (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge (SVC) benchmark. To enable visual processing, raw kinematic time-series are converted into static images, encoding pressure information into stroke opacity whenever available in the source data. Furthermore, we introduce a scoring protocol that extracts latent token probabilities to compute robust biometric scores. Experimental results reveal a significant performance dichotomy dependent on signal quality and forgery type. In random forgery scenarios, the zero-shot VLM achieves exceptional discrimination, with GPT-5.2 reaching an Equal Error Rate of 0.32% in mobile tasks, outperforming supervised state-of-the-art systems. Conversely, in skilled forgery scenarios, where the task is more challenging because both signatures are almost identical, the results are significantly worse, and a critical "Rationalization Trap" emerges: chain-of-thought (CoT) reasoning degrades performance as the model produces kinematic hallucinations to justify forgery artifacts as natural variability.

2605.14843 2026-05-15 cs.CV 版本更新

MechVerse: Evaluating Physical Motion Consistency in Video Generation Models

Rahul Jain, Mayank Patel, Asim Unmesh, Karthik Ramani

发表机构 * School of Electrical and Computer Engineering, Purdue University(普渡大学电气与计算机工程学院) School of Mechanical Engineering, Purdue University(普渡大学机械工程学院)

AI总结 本文提出 MechVerse,一个用于评估视频生成模型中物理运动一致性的新基准。研究关注当前模型在生成具有机械结构的视频时,常无法满足运动学和几何约束的问题,例如部件变形、运动传递不一致等。MechVerse 包含大量合成视频片段及结构化提示,用于评估模型在机械约束下的生成能力,实验表明现有模型在外观和流畅性上表现良好,但在生成符合物理机制的运动方面仍存在明显不足。

Comments Under Review

详情
英文摘要

Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.

2605.14842 2026-05-15 cs.CV 版本更新

Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

Mor Ventura, Roy Hirsch, Yonatan Bitton, Regev Cohen, Roi Reichart

发表机构 * Technion – Israel Institute of Technology(技术ion – 以色列理工学院) Google Research(谷歌研究)

AI总结 本文研究了图像编辑中抽象意图的理解与评估问题,提出了一个基于原子实体分析的评估框架Entity-Rubrics,并构建了首个专注于抽象图像编辑的基准数据集AbstractEdit。该工作首次对抽象图像编辑进行了形式化定义与分类,通过分解编辑任务为实体级别的评估指标,实现了与人类判断的高相关性。实验表明,现有模型在抽象指令理解上存在显著挑战,而结合先进语言模型编码器和迭代推理机制可有效提升性能,为多模态交互的自然化提供了新方向。

详情
英文摘要

Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.

2605.14838 2026-05-15 cs.CV cs.MM 版本更新

Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

Bolin Zhang, Chao Yang, Bin Jiang, Takahiro Komamizu, Ichiro Ide

发表机构 * College of Computer Science and Electronic Engineering(计算机科学与电子工程学院) Faculty of Electronic Engineering and Computer Science(电子工程与计算机科学系) Mathematical and Data Science Center(数学与数据科学中心) Graduate School of Informatics(信息科学研究生院)

AI总结 本文研究弱监督视频时刻检索(VMR)问题,旨在仅利用视频级别的对应关系,而不依赖时间标注,从未剪辑视频中找到与给定查询语义相似的时刻。为了解决现有方法在生成高质量时间提案、区分视频内错位时刻以及模型稳定性方面的不足,提出了一种新的弱监督方法MCMT,通过多提案协作与多任务训练,生成多个提案并结合可学习的高斯掩码构建高质量正样本掩码,同时引入正向和逆向掩码查询重建任务,提升模型的鲁棒性和检索性能。实验表明该方法在两个标准数据集上表现优异。

Comments 26 pages, 4 figures. Preprint version of the article published in International Journal of Machine Learning and Cybernetics

Journal ref International Journal of Machine Learning and Cybernetics 16, 4509-4524 (2025)

详情
英文摘要

This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.

2605.14832 2026-05-15 cs.RO cs.CV 版本更新

Learning Direct Control Policies with Flow Matching for Autonomous Driving

Marcello Ceresini, Federico Pirazzoli, Andrea Bertogalli, Lorenzo Cipelli, Filippo D'Addeo, Anthony Dell'Eva, Alessandro Paolo Capasso, Alberto Broggi

发表机构 * Università degli Studi di Parma(帕尔马大学) Alma Mater Studiorum - Università di Bologna(博洛尼亚大学) VisLab (an Ambarella Inc. company)(VisLab(安柏莱拉公司))

AI总结 本文提出了一种基于流匹配的自主驾驶规划方法,能够直接生成由加速度和曲率构成的可执行控制轨迹。该方法以鸟瞰图(BEV)作为输入条件,通过少量常微分方程(ODE)积分步骤生成控制序列,实现了低延迟的实时闭环重规划。研究在意大利帕尔马市的真实城市道路场景中训练模型,并在分布内和显著分布外的环境中进行了闭环测试,结果显示模型在未见过的场景中仍能保持稳定控制并成功完成任务,主要得益于BEV表示和流匹配方法对分布偏移的鲁棒性。

Comments 16 pages, 6 figures, 2 tables. Accepted at IEEE ITSC 2026

详情
英文摘要

We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird's-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide video demonstrations of closed-loop behavior at https://marcelloceresini.github.io/DirectControlFlowMatching.

2605.14821 2026-05-15 cs.CV 版本更新

HDRFace: Rethinking Face Restoration with High-Dimensional Representation

Zirui Wang, Xianhui Lin, Yi Dong, Bo Wei, Gangjian Zhang, Siteng Ma, Zebiao Zheng, Xing Liu, Hong Gu, Minjing Dong

发表机构 * City University of Hong Kong(香港城市大学) vivo BlueImage Lab, vivo Mobile Communication Co., Ltd(vivo蓝影实验室,vivo移动通信有限公司)

AI总结 在复杂退化条件下的人脸修复仍是一个信息严重丢失的病态逆问题。本文提出HDRFace,一种基于高维表示的人脸修复框架,通过引入语义丰富的先验知识,在不改变生成主干网络的前提下提升修复质量。该方法首先利用现成修复器获得结构可靠的中间结果,再通过预训练的高维特征编码器提取输入和中间结果的细粒度面部表示,并将其作为额外条件注入生成过程。此外,提出结构-细节感知的自适应融合机制(SDFM),在结构建模时强调全局约束,在细节生成时加强表示引导,从而在结构一致性和细节保真之间取得平衡。

详情
英文摘要

Face restoration under complex degradations still remains an ill-posed inverse problem due to severe information loss. Although diffusion models benefit from strong generative priors, most methods still condition only on low-quality inputs, making it difficult to recover identity-critical details under heavy degradations. In this work, we propose HDRFace, a High-Dimensional Representation conditioned Face restoration framework that injects semantically rich priors into the conditional flow without modifying the generative backbone. Our pipeline first obtains a structurally reliable intermediate restoration with an off-the-shelf restorer, then uses a pretrained high-dimensional feature encoder to extract fine-grained facial representations from both the low-quality input and the intermediate result, and injects them as additional conditions for generation. We further introduce SDFM, a Structure-Detail aware adaptive Fusion Mechanism that emphasizes global constraints during structure modeling and strengthens representation guidance during detail synthesis, balancing structural consistency and detail fidelity. To validate the generalization ability of our method, we implement the proposed framework on two generative models, SD V2.1-base and Qwen-Image, and consistently observe stable and coherent performance gains across different architectures.

2605.14819 2026-05-15 cs.CV 版本更新

The Velocity Deficit: Initial Energy Injection for Flow Matching

Linze Li, Zong-Wei Hong, Shen Zhang, Bo Lin, Jinglun Li, Yao Tang, Jiajun Liang

发表机构 * Jiiov Technology, Beijing, China(吉 iov 技术,北京,中国)

AI总结 该论文提出了一种名为“速度亏损”(Velocity Deficit)的现象,指出在高维流匹配中,均方误差(MSE)目标函数会导致速度幅值被系统性低估,从而使生成样本无法到达数据流形,产生积分滞后问题。为了解决这一问题,作者提出了初始能量注入方法,包括基于训练的幅度感知流匹配(MAFM)和无需训练的尺度调度校正器(SSC),揭示了速度收缩在轨迹起点和终点的不对称影响。实验表明,SSC在ImageNet-1k等任务上显著提升了生成质量并加快了生成速度,同时方法也适用于文本到图像生成和高分辨率图像生成。

Comments Accepted by ICML2026

详情
英文摘要

While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.

2605.14815 2026-05-15 cs.CV 版本更新

Probing into Camera Control of Video Models

Chen Hou, Christian Rupprecht

发表机构 * Visual Geometry Group University of Oxford(视觉几何组牛津大学)

AI总结 本文研究了视频生成模型中的相机控制问题,旨在使模型能够生成具有几何意义的内容。不同于以往依赖额外模块和配对数据的方法,作者提出将相机控制视为一种几何引导,通过在去噪过程中对潜在特征进行可微分重采样来实现。该方法无需额外训练,适用于大多数视频扩散模型,并可用于探测基础模型的相机控制能力,揭示了现有模型在多视角生成任务中的共性偏差与性能差异。

详情
英文摘要

Video is a rich and scalable source of 3D/4D visual observations, and camera control is a key capability for video generation models to produce geometrically meaningful content. Existing approaches typically learn a mapping from camera motion to video using additional camera modules and paired data. However, such datasets are often limited in scale, diversity, and scene dynamics, which can bias the model toward a narrow output distribution and compromise the strong prior learned by the base model. These limitations motivate a different perspective on camera control. In this paper, we show that camera control need not be modeled as an implicit mapping problem, but can instead be treated as a form of geometric guidance that induces displacements across frames. Specifically, we reformulate camera control into a set of displacement fields and apply them via differentiable resampling of latent features during denoising. Our simple approach achieves effective camera control with minimal degradation across diverse quality metrics compared to fine-tuned baselines. Since our method is applicable to most video diffusion models without training, it can also serve as a probe to study the camera control capabilities of base models. Using this probe, we identify universal biases shared by representative video models, as well as disparities in their responses to camera control. Finally, we benchmark their performance in multi-view generation, offering insights into their potential for 3D/4D tasks.

2605.14808 2026-05-15 cs.CV 版本更新

SuperADD: Training-free Class-agnostic Anomaly Segmentation -- CVPR 2026 VAND 4.0 Workshop Challenge Industrial Track

Lukas Roming, Felix Lehnerer, Jonas V. Funk, Andreas Michel, Georg Maier, Thomas Längle, Jürgen Beyerer

发表机构 * Fraunhofer IOSB(弗劳恩霍夫智能系统研究所)

AI总结 本文提出了一种无需训练且类别无关的工业异常分割方法SuperADD,用于应对生产环境中因采集条件变化导致的数据分布偏移问题。该方法基于SuperAD改进,引入了DINOv3主干网络、重叠块处理、强度增强、优化的记忆库采样以及迭代形态闭合等技术,提升了模型在不同光照条件下的鲁棒性和泛化能力。实验表明,SuperADD在MVTec AD 2数据集上取得了优于现有方法的分割性能,适用于工业场景中对产品变体和外观变化的高效处理需求。

Comments Technical report for the CVPR 2026 VAND 4.0 workshop challenge industrial track

详情
英文摘要

Visual anomaly detection (AD) for industrial inspection is a highly relevant task in modern production environments. The problem becomes particularly challenging when training and deployment data differ due to changes in acquisition conditions during production. In the VAND 4.0 Industrial Track, models must remain robust under distribution shifts such as varying illumination and their performance is assessed on the MVTec AD 2 dataset. To address this setting, we propose a training-free and class-agnostic anomaly detection pipeline based on the work of SuperAD. Our approach improves generalization through several modifications designed to enhance robustness under distribution shifts. These adaptations include using a DINOv3 backbone, overlapping patch-wise processing, intensity-based augmentations, improved memory-bank subsampling for better coverage of the data distribution, and iterative morphological closing for cleaner and more spatially consistent anomaly maps. Unlike methods that rely on class-specific architectures or per-class hyperparameter tuning, our method uses a single architecture and one shared hyperparameter configuration across all object classes. This makes the approach well suited for industrial deployment, where product variants and appearance changes must be handled with minimal adaptation effort. We achieve segmentation F1 scores of $62.61\%$, $57.42\%$, and $54.35\%$ on test public, private, and private mixed of MVTec AD 2 respectively, thereby outperforming SuperAD and other state-of-the-art methods. Code is available at https://github.com/LukasRoom/SuperADD.

2605.14795 2026-05-15 cs.CV 版本更新

COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking

Shukun Jia, Shiyu Hu, Yipei Wang, Ximeng Cheng, Yichao Cao, Xiaobo Lu

发表机构 * School of Automation, Southeast University, Nanjing, China(东南大学自动化学院,南京,中国) Key Laboratory of Measurement and Control of Complex Systems of Engineering, Ministry of Education, Nanjing, China(工程复杂系统测量与控制国家重点实验室,教育部,南京,中国) School of Physical & Mathematical Sciences, Nanyang Technological University, Singapore(南洋理工大学物理与数学科学学院,新加坡) Big Data Institute, Central South University, Changsha, China(中南大学大数据研究院,长沙,中国)

AI总结 该论文研究了在稀疏语义监督下如何提升指称多目标跟踪(RMOT)的判别能力,提出了COAL框架,通过引入显式语义注入和反事实学习策略,增强对复杂语义结构的识别能力。COAL结合视觉语言模型和大语言模型,构建了一个层次化多流融合架构,有效缓解了稀疏监督导致的过拟合和语义崩溃问题。实验表明,该方法在多个基准数据集上取得了显著提升,尤其在具有挑战性的Refer-KITTI-V2数据集上超越了现有最优方法。

详情
英文摘要

Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL's efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.

2605.14785 2026-05-15 cs.LG cs.CV 版本更新

Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning

Alberto Tamajo, Srinandan Dasmahapatra, Rahman Attar

发表机构 * School of Electronics and Computer Science, University of Southampton(南安普顿大学电子与计算机科学学院)

AI总结 在类增量学习(CIL)中,神经网络容易出现灾难性遗忘问题,而基于重放的策略虽能缓解这一问题,但研究发现不同类别被遗忘的程度并不均衡。本文系统分析了这种不均衡遗忘现象,提出三个最后一层系数以捕捉增量学习过程中影响各类别遗忘的不同梯度级干扰源,并验证这些系数能够有效预测各类别的遗忘程度。研究还发现,自诱导干扰系数是预测遗忘程度最强的指标,为缓解不均衡遗忘提供了新的思路和方向。

Comments 37 pages; 24 tables; 7 figures; submitted to a journal

详情
英文摘要

Neural networks suffer from catastrophic forgetting in class-incremental learning (CIL) settings. Rehearsal$\unicode{x2013}$replaying a subset of past samples$\unicode{x2013}$is a well-established mitigation strategy. However, recent results suggest that, despite balanced rehearsal allocation, some classes are forgotten substantially more than others. Despite its relevance, this imbalanced forgetting phenomenon remains underexplored. This work shows that imbalanced forgetting arises systematically and severely in rehearsal-based CIL and investigates it extensively. Specifically, we construct, from a principled analysis, three last-layer coefficients that capture different gradient-level sources of interference affecting each past class during an incremental step. We then demonstrate that, together, they reliably predict how past classes will rank in terms of forgetting at the end of that step. While predictive performance alone does not establish causality, these results support the interpretation of the coefficients as a plausible mechanistic account linking last-layer gradient-level interactions during training to class-level forgetting outcomes. Notably, one coefficient$\unicode{x2013}$capturing self-induced interference$\unicode{x2013}$emerges as the strongest predictor, with controlled experiments providing evidence consistent with this coefficient being influenced by the new-class interference coefficient. Overall, our findings provide valuable insights and suggest promising directions for mitigating imbalanced forgetting by reducing class-wise disparities in the identified sources of interference.

2605.14781 2026-05-15 cs.CV 版本更新

MonoPRIO: Adaptive Prior Conditioning for Unified Monocular 3D Object Detection

Leon Davies, Qinggang Meng, Mohamad Saada, Baihua Li, Simon Sølvsten

发表机构 * organization= Department of Computer Science, Loughborough University , addressline= Epinal Way , city= Loughborough , postcode= LE11 3TU , state= Leicestershire , country= United Kingdom organization= European Center for Risk \& Resilience Studies, University of Southern Denmark , addressline= Degnevej 14 , city= Esbjerg , postcode= 6705 , country= Denmark

AI总结 单目3D目标检测在面对遮挡、截断和投影引起的尺度-深度歧义时面临挑战,尤其是在统一多类场景中,类别差异和部分可见性使得尺度估计更加不稳定。为此,本文提出MonoPRIO,通过自适应先验条件化方法,在尺度路径上优化统一的单目3D检测性能。该方法构建了类别感知的尺度原型,采用软混合先验路由解码器查询,并引入不确定性感知的对数空间条件化和Cluster-Aligned Prior正则化,显著提升了检测精度和鲁棒性。实验表明,MonoPRIO在KITTI测试集上取得了目前最强的统一多类检测结果,并在仅检测汽车的场景中也表现出优越的性能,同时计算量远低于现有方法。

Comments 12 pages, 4 figures, 8 tables. Submitted to Pattern Recognition. Code and reproducibility material available at https://github.com/bigggs/MonoPRIO

详情
英文摘要

Monocular 3D object detection remains challenging because metric size and depth are underdetermined by single-view evidence, particularly under occlusion, truncation, and projection-induced scale-depth ambiguity. Although recent methods improve depth and geometric reasoning, metric size remains unstable in unified multi-class settings, where class variability and partial visibility broaden plausible size modes. We propose MonoPRIO, a unified monocular 3D detector that targets this bottleneck through adaptive prior conditioning in the size pathway. MonoPRIO constructs class-aware size prototypes offline, routes each decoder query to a soft mixture prior, applies uncertainty-aware log-space conditioning, and uses Cluster-Aligned Prior (CAP) regularisation on matched positives during training. On the official KITTI test server, MonoPRIO achieves the strongest fully reported unified multi-class result among methods reporting complete Car, Pedestrian, and Cyclist metrics. In the car-only setting, it also achieves the strongest 3D bounding-box AP across Easy/Moderate/Hard categories among compared methods without extra data, while using substantially less compute than MonoCLUE. Ablations and diagnostics show complementary gains from routed injection and CAP, with the largest benefits in ambiguity-prone, partially occluded, and low-data regimes. These findings indicate that adaptive priors are most effective when image evidence underdetermines metric size, while atypical geometry or extreme visibility loss can still cause mismatch between routed priors and true instance geometry. Code, trained models, result logs, and reproducibility material are available at https://github.com/bigggs/MonoPRIO.

2605.14772 2026-05-15 cs.CV cs.GR cs.LG 版本更新

BioHuman: Learning Biomechanical Human Representations from Video

Yujun Huo, He Zhang, Chentao Song, Honglin Song, Zongyu Zuo, Tao Yu

发表机构 * Beihang University(北航) Tsinghua University(清华大学)

AI总结 该研究旨在从视频中学习人体的生物力学表示,以超越传统运动学分析,实现对人体内部肌肉活动等生物力学状态的理解。为此,作者提出了一种基于仿真的框架,从现有的动作捕捉数据中估计肌肉激活状态,构建了包含同步视频、运动和激活信息的大型数据集BioHuman10M,并基于此数据集设计了一个端到端模型BioHuman,能够从单目视频中联合预测人体运动和肌肉激活状态。实验表明,该方法在运动重建和肌肉活动预测方面表现出色,并具有良好的泛化能力,为基于视频的生物力学理解提供了新的基准。

详情
英文摘要

Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.

2605.14747 2026-05-15 cs.CL cs.AI cs.CV cs.LG 版本更新

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Hao Tian

发表机构 * National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(国家多媒体信息处理重点实验室,计算机科学学院,北京大学) The University of Hong Kong(香港大学) Renmin University of China(中国人民大学)

AI总结 本文提出了一种名为Video2GUI的全自动框架,用于从未标注的互联网视频中提取结构化的GUI交互轨迹,以解决当前GUI智能体预训练数据规模小、领域单一的问题。该方法通过粗到细的过滤策略筛选高质量的GUI教程视频,并将其转化为可用于训练的交互轨迹,构建了包含1200万条轨迹、覆盖1500多个应用和网站的大型数据集WildGUI。基于该数据集预训练的模型在多个GUI定位和操作基准测试中取得了5-20%的性能提升,达到了或超越了现有最佳水平。

Comments Accepted at ICML 2026

详情
英文摘要

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

2605.14742 2026-05-15 cs.CV cs.RO 版本更新

EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

Yuejiao Su, Xinshen Zhang, Zhen Ye, Lei Yao, Lap-Pui Chau, Yi Wang

发表机构 * Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR(香港理工大学电子与电气工程系) Division of Emerging Interdisciplinary Areas (EMIA), The Hong Kong University of Science and Technology, Hong Kong SAR(香港理工大学新兴跨学科领域研究中心)

AI总结 本文提出EARL,一种以自我视角分析为导向的强化学习框架,旨在提升机器人对人类与环境交互的推理能力和像素级定位精度。EARL采用两阶段解析结构,首先生成结构化文本描述,再根据用户查询生成回答和像素掩码,并通过分析引导特征合成器整合语义先验信息。实验表明,EARL在像素级定位任务中取得了优于现有基于强化学习方法的显著提升,展现出良好的泛化能力。

Comments Accepted at ICML 2026. Project page: https://github.com/yuggiehk/EARL

详情
英文摘要

Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.

2605.14733 2026-05-15 cs.CV 版本更新

Video-Zero: Self-Evolution Video Understanding

Ruixu Zhang, Deyi Ji, Lanyun Zhu, Xuanyi Liu, Yuxin Meng, Ruihang Chu, Yujiu Yang

发表机构 * Tsinghua University(清华大学) Tencent(腾讯) Tongji University(同济大学) Peking University(北京大学)

AI总结 本文提出了一种名为Video-Zero的自进化视频理解框架,旨在无需人工标注的情况下提升视频理解模型的推理能力。该方法通过一个问答共进化系统,聚焦于视频中时间局部化的关键证据,生成基于证据的问题并进行对齐学习,从而实现更有效的监督与模型训练。实验表明,Video-Zero在多个视频理解任务中显著提升了基础模型的性能,验证了其有效性与泛化能力。

详情
英文摘要

Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner--Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.

2605.14731 2026-05-15 cs.GR cs.CV cs.SD 版本更新

UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

Xiaoyu Zhan, Xinyu Fu, Chenghao Yang, Xiaohong Zhang, Dongjie Fu, Pengcheng Fang, Tengjiao Sun, Xiaohao Cai, Hansung Kim, Yuanqi Li, Jie Guo, Yanwen Guo

发表机构 * Nanjing University(南京大学) Mogo AI Ltd.(Mogo AI有限公司) University of Southampton(南安普顿大学)

AI总结 本文提出了一种统一的稀疏运动建模方法UMo,用于实现高保真、实时的共语义数字人动画生成。UMo通过统一处理文本、音频和运动信息,结合空间稀疏的专家混合框架和时间稀疏的关键帧设计,实现了高效实时的密集重建,能够在保证时间一致性和高保真度的同时提升生成质量。此外,UMo采用多阶段训练策略和针对性的音频增强方法,有效提升了语音-运动对齐的精度和语义一致性,为实时共语义动画提供了实用的解决方案。

详情
英文摘要

Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.

2605.14727 2026-05-15 cs.CV 版本更新

CHASM: Cross-frequency Harmonized Axis-Separable Mixing for Spectral Token Operators

Pengcheng Fang, Hongli Chen, Yuxia Chen, Tengjiao Sun, Jiaxin Liu, Xiaohao Cai

发表机构 * University of Southampton(南安普顿大学) University of Queensland(昆士兰大学) Chengdu University of Technology(成都理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种名为CHASM的跨频率协调轴分离混合器,用于改进基于傅里叶变换的光谱token操作器。CHASM通过共享一个学习到的通道特征基,并为每个频率保留独立的正谱增益,实现了跨频率的通道方向对齐与局部频率适应性的结合。该方法在多个视觉任务中表现出色,实验表明其结构设计有助于提升模型性能,并验证了跨频率协调作为光谱操作器的有效归纳偏置。

详情
英文摘要

Spectral token mixers based on Fourier transforms provide an efficient way to model global interactions in visual feature maps. Existing designs often either apply filter-wise spectral responses along fixed channel axes, or learn adaptive frequency-indexed channel mixing without explicitly aligning the channel directions used across frequencies. We propose CHASM, a Cross-frequency Harmonized Axis-Separable Mixer, as a structured middle ground. CHASM separates what should be shared from what should remain frequency-specific: all frequencies share a learned channel eigenbasis, while each frequency retains its own positive spectral gains. The shared basis makes channel directions comparable across the spectrum, whereas the positive gains preserve local spectral adaptivity. CHASM applies this structured operator separably along the height and width axes and is used as a drop-in replacement mixer inside existing backbones. We provide a structural characterization of the shared-basis operator family and evaluate CHASM through controlled same-backbone comparisons. Across accelerated MRI reconstruction, undersampled MRI segmentation, and natural-image reconstruction, CHASM consistently improves over same-backbone spectral-mixer baselines. Ablations show that removing the shared-basis constraint weakens performance, and randomizing coherent sampling geometry substantially reduces the gain, supporting cross-frequency harmonization as a useful inductive bias for spectral token operators.

2605.14717 2026-05-15 cs.CV cs.AI 版本更新

Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning

Saqib Nazir, Ardhendu Behera

发表机构 * Department of Computer Science, Edge Hill University, UK(英国埃德希尔大学计算机科学系)

AI总结 该研究旨在解决无标记单细胞成像中直接从明场图像推断分子表型的难题,提出了一种基于多任务学习的深度学习框架,能够同时完成白细胞分类和蛋白质表达水平的回归预测。该模型采用卷积神经网络与Transformer相结合的混合架构,通过可学习的跨分支门控模块融合局部纹理特征与全局表示,从而实现对差分相位对比图像的鲁棒形态-分子联合推理。实验表明,该方法在多个基准数据集上表现出色,为无需荧光染色的低成本血液学分析提供了新途径。

Comments Accepted in 28th International Conference on Pattern Recognition (ICPR) 2026

详情
英文摘要

Label-free single-cell imaging offers a scalable, non-invasive alternative to fluorescence-based cytometry, yet inferring molecular phenotypes directly from bright-field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein-expression regression from label-free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine-grained texture features with transformer-based global representations through a learnable cross-branch gating module, enabling robust morpho-molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label-free single-cell imaging for cost-effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at https://github.com/saqibnaziir/Single-Cell-Phenotyping.

2605.14712 2026-05-15 cs.RO cs.AI cs.CL cs.CV 版本更新

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Shijie Lian, Bin Yu, Xiaopeng Lin, Zhaolong Shen, Laurence Tianruo Yang, Yurun Jin, Haishan Liu, Changti Wu, Hang Yuan, Cong Huang, Kai Chen

发表机构 * HUST(华中科技大学) ZGCA(中钢集团人工智能研究院) ZGCI(中钢智能科技有限公司) HIT(哈尔滨工业大学) HKUST(GZ)(香港科技大学(广州)) BUAA(北京航空航天大学) ZZU(浙江工业大学) ECNU(华东师范大学) USTC(中国科学技术大学) DeepCybo

AI总结 该研究针对机器人模仿学习中因短时意图差异导致的动作冲突问题,提出了一种基于历史信息的视觉-语言-动作(VLA)框架IntentVLA,通过编码近期视觉观测生成紧凑的短时意图表示,用于指导动作生成。研究还构建了AliasBench基准,用于评估短时观测歧义下的策略性能,实验表明IntentVLA在多个任务中提升了动作执行的稳定性并优于现有VLA方法。

Comments Code can be found in https://github.com/ZGC-EmbodyAI/IntentVLA

详情
英文摘要

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

2605.14710 2026-05-15 cs.CV cs.AI 版本更新

Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

Liren Chen, Lidong Sun, Mingyan Huang, Junzhe Tang, Yinghui Zhu, Guanjie Wang, Yiqing Xia, Ting Xiao

发表机构 * School of Information Science and Engineering, East China University of Science and Technology(信息科学与工程学院,东华大学)

AI总结 该研究针对缺血性中风预后预测中多模态数据融合不足的问题,提出了一种三模态融合模型,有效整合了医学影像、结构化临床数据和非结构化文本。核心方法通过大语言模型自动生成半结构化诊断文本,缓解了专家标注稀缺的问题,并设计了以视觉特征为条件的对齐融合模块,实现了跨模态的深度交互与异构性缓解。实验表明,该模型在真实临床数据上取得了最先进的预测性能。

Comments Corresponding author: Ting Xiao

详情
英文摘要

Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

2605.14708 2026-05-15 cs.CV 版本更新

StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

Zeyu Chen, Fangmin Zhao, Yan Shu, Yichao Liu, Liu Yu, Yu Zhou

发表机构 * Nankai University(南开大学) University of Trento(特伦特大学) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)

AI总结 StyleTextGen 是一种用于多语言场景文本生成的风格条件生成框架,旨在解决从复杂背景中准确提取文本风格并保持跨字符细粒度风格一致性的挑战。该方法引入了双分支风格编码器、文本风格一致性损失以及掩码引导的生成策略,有效提升了多语言文本风格的感知与复制能力。研究还构建了首个双语场景文本风格基准 StyleText-CE,并在多项指标上取得了当前最优的性能。

Comments This paper has been accepted to CVPR 2026

详情
英文摘要

Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.

2605.14705 2026-05-15 cs.CV 版本更新

Towards Continuous Sign Language Conversation from Isolated Signs

Youngmin Kim, Kyobin Choo, Jiwoo Park, Minseo Kim, Chanyoung Kim, Junhyeok Kim, Seong Jae Hwang

发表机构 * Yonsei University(延世大学) LG Electronics(LG电子) Emory University(埃默里大学)

AI总结 该研究旨在直接建模手语对话系统,以更好地支持聋人和听力障碍者使用手语进行交流。面对现有手语数据集词汇量有限、泛化能力弱的问题,研究者构建了大规模的孤立手语动作数据集SignaVox-W,并基于此生成连续的手语对话数据集SignaVox-U。通过引入检索引导的语音到手语翻译模型和扩散变换器BRAID,实现了从孤立动作到连续对话的生成,最终训练出无需依赖语音或书面语的直接手语到手语对话模型SignaVox,显著提升了手语生成的质量与语义对齐能力。

详情
英文摘要

Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.

2605.14704 2026-05-15 cs.CV cs.AI cs.RO 版本更新

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University(国立台湾大学) Delta Robotics Innovation Center(Delta机器人创新中心)

AI总结 在现实场景中,目标物体可能位于不可见区域,而当前视觉语言模型(VLMs)在推理这些被遮挡物体的位置方面仍面临挑战。为此,研究提出SceneFunRI基准,基于SceneFun3D数据集构建了一个包含855个实例的2D空间推理任务,要求模型通过任务指令和常识推理定位不可见的功能性物体。实验表明,现有最强基线模型在该任务上的表现仍较为有限,揭示了当前模型在不可见区域推理能力上的不足,亟需更紧密融合任务意图、常识先验、空间定位与不确定性感知搜索的模型改进。

详情
英文摘要

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

2605.14703 2026-05-15 cs.CV 版本更新

Generating HDR Video from SDR Video

SaiKiran Tedla, Francesco Banterle, Trevor Canham, Karanpreet Raja, David B. Lindell, Kiriakos N. Kutulakos, Jiacheng Li, Feiran Li, Daisuke Iso

发表机构 * Sony Research(索尼研究实验室) York University(约克大学) Vector Institute(向量研究所) University of Toronto(多伦多大学)

AI总结 本文研究如何从标准动态范围(SDR)视频生成高动态范围(HDR)视频,提出了一种基于大规模生成视频模型的解决方案。该方法引入了多曝光视频模型(MEVM)和可学习的视频合并模型(VMM),能够从单个非线性SDR视频输入生成多曝光SDR序列,并将其合并为高质量的HDR视频,有效保留暗部和亮部细节。实验表明,该方法在真实场景的消费级视频和经典影片中均能实现鲁棒的HDR转换,并可与现有SDR生成模型结合构建HDR合成流程。

详情
英文摘要

The high dynamic range (HDR) video ecosystem is approaching maturity, but the problem of upconverting legacy standard dynamic range (SDR) videos persists without a convincing solution. We propose a framework for HDR video synthesis from casual SDR footage by leveraging large-scale generative video models. We introduce a Multi-Exposure Video Model (MEVM) that can predict exposure-bracketed linear SDR video sequences from a single nonlinear SDR video input. We further propose a learnable Video Merging Model (VMM) that merges the predicted exposure-bracketed video into a high-quality HDR sequence while preserving detail in both shadows and highlights. Extensive experiments, quantitative and qualitative evaluation, and a user study demonstrate that our approach enables robust HDR conversion for in-the-wild examples from casual consumer videos and even iconic films. Finally, our model can support HDR synthesis pipelines built upon existing SDR generative video models. Output HDR videos can be viewed on our supplementary webpage: sdr2hdrvideo.github.io

2605.14696 2026-05-15 cs.CV 版本更新

EponaV2: Driving World Model with Comprehensive Future Reasoning

Jiawei Xu, Zhizhou Zhong, Zhijian Shu, Mingkai Jia, Mingxiao Li, Jia-Wang Bian, Qian Zhang, Kaicheng Zhang, Jin Xie, Jian Yang, Wei Yin

发表机构 * PCA Lab, VCIP, College of Computer Science, Nankai University(PCA实验室、VCIP、计算机科学学院、南开大学) Horizon Robotics HKUST(香港科技大学) NJUPT(南京工程大学) NTU(国立台湾大学) Anyverse School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院、南京大学)

AI总结 本文提出了一种名为 EponaV2 的新型驾驶世界模型,旨在解决现有自动驾驶系统在轨迹规划中依赖大量人工标注数据的问题。该模型通过引入全面的未来推理机制,能够预测未来几何和语义信息,从而提升对环境的理解和规划能力。此外,受大语言模型训练方法的启发,EponaV2 引入了流匹配组相对策略优化机制,进一步提升了规划精度,在多个基准测试中取得了优于现有方法的性能。

详情
英文摘要

Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

2605.14689 2026-05-15 cs.CV 版本更新

Are Candidate Models Really Needed for Active Learning?

Harshini Mridula Mohan, Maanya Manjunath, Vipul Arya, S. H. Shabbeer Basha, Nitin Cheekatla

发表机构 * SoCSE, RV University, Bengaluru, India.(RV大学计算机科学与工程系,印度班加罗尔) School of Engineering and Technology, Vidyashilp University, Bengaluru, India.(维达希尔普大学工程与技术学院,印度班加罗尔) Dataplex Inc., USA.(Dataplex公司,美国)

AI总结 本文探讨了在主动学习中是否真的需要候选模型,并提出了一种无需初始候选模型的主动学习方法。研究采用随机初始化的卷积神经网络和变换器模型,结合基于置信度的采样策略,验证了其在减少标注负担方面与传统方法相当的效果。实验表明,低置信度采样策略在多数情况下表现最佳,为高效、灵活的主动学习提供了新思路。

Comments Accepted for publication in Computer Vision and Image Understanding (CVIU)

详情
英文摘要

Deep learning has profoundly impacted domains such as computer vision and natural language processing by uncovering complex patterns in vast datasets. However, the reliance on extensive labeled data poses significant challenges, including resource constraints and annotation errors, particularly in training Convolutional Neural Networks (CNNs) and transformers due to a larger number of parameters. Active learning offers a promising solution to reduce labeling burdens by strategically selecting the most informative samples for annotation. However, the current active learning frameworks are time-intensive which select the samples iteratively with the help of initial candidate models. This study investigates the feasibility of using CNNs and transformers with randomly initialized weights, eliminating the need for initial candidate models while achieving results comparable to active learning frameworks that depend on such candidate models. We evaluate three confidence-based sampling strategies: high confidence (HC), low confidence (LC), and a combination of high confidence in the early stages of training and low confidence at later stages of training (HCLC). Among these, mostly LC demonstrated the best performance in our experiments, showcasing its effectiveness as an active learning strategy without the need for candidate models. Further, extensive experiments verify the robustness of the proposed active learning methods. By challenging traditional frameworks, the proposed work introduces a streamlined approach to active learning, advancing efficiency and flexibility across diverse datasets and domains.

2605.14654 2026-05-15 cs.CV 版本更新

Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging

Tan Pan, Shuhao Mei, Yixuan Sun, Kaiyu Guo, Chen Jiang, Zhaorui Tan, Mengzhu Li, Limei Han, Xiang Zou, Yuan Cheng, Mahsa Baktashmotlagh

发表机构 * Fudan University, China(复旦大学) University of Queensland, Australia(昆士兰大学) Shanghai Academy of AI for Science, China(上海人工智能科学研究院) Huashan Hospital, National Center for Neurological Disorders, Fudan University, China(华山医院,国家神经系统疾病中心,复旦大学) Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore(生物信息研究所(BII),科技研究局(A*STAR),新加坡)

AI总结 该研究针对医学影像中的多模态3D数据,提出了一种超越个体级自监督的方法,利用解剖结构在不同个体间保持的拓扑一致性作为监督信号。通过两种对齐策略——个体内的跨模态三元组目标和个体间的伪对应关系生成——有效提升了模型对局部和全局拓扑结构的学习能力。实验表明,该方法在多个下游任务中取得了显著性能提升,并在测试时模态缺失情况下表现出更强的鲁棒性。

Comments ICML2026

详情
英文摘要

Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.

2605.14651 2026-05-15 cs.CV 版本更新

TERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection

Omkar Oak, Rukmini Nazre, Rujuta Budke, Suraj Sawant

发表机构 * COEP Technological University, Pune, India(科帕尔技术大学,印度普纳) University of Massachusetts, Amherst, USA(马萨诸塞大学,美国阿姆赫斯特) North Carolina State University, USA(北卡罗来纳州立大学,美国)

AI总结 本文提出了一种多时相的遥感影像变化检测框架TERRA-CD,用于多类别和语义变化检测。该研究构建了一个包含5,221对Sentinel-2影像的基准数据集,覆盖美国和欧洲232个城市,并提供了三种标注方案,涵盖土地覆盖分类、植被变化和语义变化。通过多种深度学习方法评估了该数据集在多类别和语义变化检测中的有效性,为城市植被监测和环境变化分析提供了重要资源。

Comments Paper presented at 11th International Congress on Information and Communication Technology (ICICT) 2026, London

详情
英文摘要

Urban vegetation monitoring plays a vital role in understanding environmental changes, yet comprehensive datasets for this purpose remain limited. To address this gap, we present the Temporal Remote-sensing Repository for Analyzing Change Detection (TERRA-CD), a benchmark dataset comprising 5,221 Sentinel-2 image pairs from 2019 and 2024, covering 232 cities across the USA and Europe. The dataset features three distinct annotation schemes: 4-class land cover mapping masks, 3-class vegetation change masks, and 13-class semantic change masks capturing all possible land cover transitions. Using various deep learning approaches including Siamese networks, STANet variants, Bi-SRNet, Changemask, Post-Classification Comparison, and HRSCD strategies, we evaluated the dataset's effectiveness for both vegetation Multi-class Change Detection as well as Semantic Change Detection. The proposed dataset and methods are available at https://github.com/omkarsoak/TERRA-CD.

2605.14645 2026-05-15 cs.CV cs.AI 版本更新

Vision-Based Water Level and Flow Estimation

ZhiXin Sun

发表机构 * PowerChina Zhongnan Engineering Corporation Limited(中国电力工程集团中南工程公司)

AI总结 该研究提出了一种结合先进视觉模型与统计建模的综合框架,用于提高水位检测和水流估算的精度。通过引入物理先验知识和鲁棒滤波策略,有效应对了环境敏感性、精度有限和现场校准复杂等挑战。该方法在保持自动化和可解释性优势的同时,提升了传统视觉方法在水文监测中的可靠性。

详情
英文摘要

With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git

2605.14641 2026-05-15 cs.CV cs.AI 版本更新

How to Evaluate and Refine your CAM

Luca Domeniconi, Alessandra Stramiglio, Michele Lombardi, Samuele Salti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 该研究针对卷积神经网络中类别归因图(CAM)的评估与改进问题,提出了一种合成数据集以生成真实归因标签,从而更严格地比较现有评估指标,并提出了一种新的复合评估指标ARCC,能够更可靠地识别忠实的解释。同时,为解决CAM分辨率低的问题,研究还引入了RefineCAM方法,通过聚合多层网络的CAM生成高分辨率归因图,实验表明该方法在新评估指标下优于现有方法。

Comments Accepted at ICPR 2026

详情
英文摘要

Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation.

2605.14635 2026-05-15 cs.CV cs.AI 版本更新

MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

Tianwei Chen, Takuya Furusawa, Yuki Hirakawa, Ryotaro Shimizu, Mo Fan, Takashi Wada

发表机构 * ZOZO NEXT Inc.(ZOZO NEXT公司)

AI总结 本文提出一个多标签视觉情感分析基准数据集MultiEmo-Bench,用于全面评估多模态大语言模型(MLLMs)对图像引发情感的预测能力。现有数据集采用单一标签标注方式,难以反映图像可能引发的多维度、多强度情感,为此本文引入多标注员协同标注机制,生成包含10,344张图像和236,998个有效情感标签的高质量数据集,并基于该数据集评估了多个主流模型在主控情感预测和情感分布预测任务上的表现,揭示了当前MLLMs在情感理解方面的进展与不足。

详情
英文摘要

This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.

2605.14631 2026-05-15 cs.LG cs.AI cs.CV 版本更新

Action-Inspired Generative Models

Eshwar R. A., Debnath Pal

发表机构 * Department of Computer Science Engineering(计算机科学与工程系) PES University (EC Campus), Bengaluru(班加罗尔EC校区的PES大学) Department of Computational and Data Sciences(计算与数据科学系) Indian Institute of Science, Bengaluru(班加罗尔印度科学研究院)

AI总结 本文提出了一种受动作启发的生成模型(AGMs),旨在改进现有桥接匹配方法中对所有随机转移赋予相同回归权重的问题。该方法引入了一个轻量的可学习标量势函数 $V_ϕ$,用于在线评估桥接样本并调节漂移目标,从而选择性地惩罚非信息性传输路径,提升了生成质量。该模型结构简单,仅增加约1.4%的参数,无需额外计算开销,可直接嵌入任何桥接匹配训练流程中。

Comments 11 pages, 5 figures, and 4 tables

详情
英文摘要

We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential $V_ϕ$ that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier -- preventing adversarial feedback between the two networks whilst preserving $V_ϕ$'s guiding signal. Crucially, $V_ϕ$ comprises only $\sim$1.4% of the primary drift network's parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, $V_ϕ$ is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.

2605.14629 2026-05-15 eess.IV cs.CV 版本更新

Efficient Dense Matching for Enhanced Gaussian Splatting Using AV1 Motion Vectors

Julien Zouein, Vibhoothi Vibhoothi, François Pitié, Anil Kokaram

发表机构 * SigMedia

AI总结 本文提出了一种基于AV1运动向量的高效密集匹配方法,用于提升高斯泼溅(3DGS)的初始点云质量。该方法利用AV1视频编解码器中的运动向量,避免了传统SfM方法中耗时的穷举匹配,显著降低了计算开销并提高了点云密度。实验表明,该方法生成的点云数量是传统SfM方法的八倍,有效提升了3DGS的重建精度和训练效率。

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a prominent framework for real-time, photorealistic scene reconstruction, offering significant speed-ups over Neural Radiance Fields (NeRF). However, the fidelity of 3DGS representations remains heavily dependent on the quality of the initial point cloud. While standard Structure-from-Motion (SfM) pipelines using COLMAP provide adequate initialisation, they often suffer from high computational costs and sparsity in textureless regions, which degrades subsequent reconstruction accuracy and convergence speed. In this work, we introduce an AV1-based feature detection and matching pipeline that significantly reduces SfM processing overhead. By leveraging motion vectors inherent to the AV1 video codec, we bypass computationally expensive exhaustive matching while maintaining geometric robustness. Our pipeline produces substantially denser point clouds, with up to eight times as many points as classical SfM. We demonstrate that this enhanced initialisation directly improves 3DGS performance, yielding an 9-point increase in VMAF and a 63% average reduction in training time required to reach baseline quality. The project page: https://sigmedia.tv/AV1-3DGS.github.io/

2605.14626 2026-05-15 cs.CV 版本更新

UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

Ping Zhou, Haoyu Wang, Mengmeng Zheng, Lei Zhang, Wei Wei, Chen Ding, Fei Zhou

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院) School of Computer Science & Technology, Xi’an University of Posts & Telecommunications(西安邮电大学计算机科学与技术学院) MMLab, The Chinese University of Hong Kong(香港中文大学MMLab)

AI总结 RGB-T语义分割需要严格对齐的可见光-红外-标签三元组,但在实际场景中这类数据往往稀缺。为解决这一问题,本文提出UniTriGen,一种统一的三元组生成框架,能够在文本提示引导下直接生成空间对齐、语义一致且模态互补的可见光-红外-标签三元组。该方法通过共享潜在空间中的联合编码和扩散过程建模,确保跨模态一致性,并引入轻量级模态特定适配器以适应不同模态的成像特性,同时采用场景平衡和类别感知的少样本采样策略,提升生成三元组的多样性和质量,从而在多种RGB-T语义分割模型中实现性能提升。

详情
英文摘要

RGB-T semantic segmentation requires strictly aligned VIS-IR-Label triplets; however, such aligned triplet data are often scarce in real-world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross-modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS-IR-Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross-modal consistency. Lightweight modality-specific residual adapters are further integrated into this mechanism to accommodate modality-specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene-balanced and class-aware few-shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high-quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB-T semantic segmentation models.

2605.14621 2026-05-15 cs.CV cs.AI cs.CL 版本更新

Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

Tian Qin, Junzhe Chen, Yuqing Shi, Tianshu Zhang, Qiang Ju, Lijie Wen

发表机构 * Tsinghua University(清华大学) The University of Sydney(悉尼大学) Stanford University(斯坦福大学) Baichuan AI(百川AI)

AI总结 大型视觉语言模型(LVLMs)在语言先验主导弱或模糊视觉证据时容易产生幻觉。现有对比解码方法通过比较原始图像和外部扰动输入的预测来缓解这一问题,但依赖外部参考可能引入偏差并增加计算成本。本文提出SIRA,一种无需训练的内部对比解码框架,通过利用多模态变换器的分阶段信息流,在模型内部构建反事实参考,有效抑制幻觉,同时保持描述覆盖率,并适用于开源权重模型。

详情
英文摘要

Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

2605.14615 2026-05-15 cs.CV 版本更新

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

Boying Li, Cheng Zhang, Weirong Chen, Daniel Cremers, Ian Reid, Hamid Rezatofighi

发表机构 * Monash University(蒙纳士大学) Technical University of Munich(慕尼黑技术大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出了一种名为 CalibAnyView 的新型相机标定方法,能够在任意数量的视角下(包括单视角)实现鲁棒的几何一致性标定。该方法通过构建大规模多视角视频数据集,并设计多视角变换网络预测密集透视场,结合几何优化框架联合估计相机内参和重力方向,从而在真实场景中取得优于现有方法的标定效果。该工作为野外环境下的三维重建和机器人感知等任务提供了可靠的基础。

Comments 44 pages, 25 figures

详情
英文摘要

Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.

2605.14609 2026-05-15 cs.CV cs.LG 版本更新

Deep Image Segmentation via Discriminant Feature Learning

Adam Dawid Sztamborski, Raül Pérez-Gonzalo, Antonio Agudo

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(机器人与信息学工业研究所,西班牙加泰罗尼亚科技学院-巴塞罗那高等学院) Politechnika Łódzka(卢布林理工大学)

AI总结 本文研究了图像分割中边界不清晰的问题,提出了一种新的可微且与网络结构无关的损失函数Deep Discriminant Analysis(DDA),通过最大化类间方差并最小化类内方差,提升特征分布的紧致性和可分性。实验表明,DDA在多种架构上均能有效提升分割精度、边界清晰度和模型置信度,为构建更鲁棒的分割模型提供了简单而有效的方法。

Comments Accepted to ICIP 2026

详情
英文摘要

Accurate image segmentation remains challenging, particularly in generating sharp, confident boundaries. While modern architectures have advanced the field, many of them still rely on standard loss functions like Cross-Entropy and Dice, which often neglect the discriminative structure of learned features, leading to inaccurate boundaries. This work introduces Deep Discriminant Analysis (DDA), a differentiable, architecture-agnostic loss function that embeds classical discriminant principles for network training. DDA explicitly maximizes between-class variance while minimizing within-class one, promoting compact and separable feature distributions without increasing inference cost. Evaluations on the DIS5K benchmark demonstrate that DDA consistently improves segmentation accuracy, boundary sharpness, and model confidence across various architectures. Our results show that integrating discriminant analysis offers a simple, effective path for building more robust segmentation models.

2605.14607 2026-05-15 cs.CV cs.CY 版本更新

ViMU: Benchmarking Video Metaphorical Understanding

Qi Li, Xinchao Wang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出ViMU,首个用于评估视频隐喻理解能力的基准,旨在解决现有视频理解模型主要关注字面内容而忽视隐喻、讽刺和社会含义的问题。ViMU通过开放问答和多选题形式,要求模型基于多模态证据推断视频中的隐含意义,且问题设计无提示,确保模型依赖自身理解能力进行推理。该工作为视频理解领域引入了新的评估方向,推动模型在深层次语义理解方面的发展。

详情
英文摘要

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

2605.14606 2026-05-15 cs.CV 版本更新

MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting

Chunlei Shi, Cui Wu, Xiang Xu, Hao Li, Ni Fan, Xue Han, Yongchao Feng, Yufeng Zhu, Boyu Liu, Zengliang Zang, Hongbin Wang, Yanlan Yang, Dan Niu

发表机构 * School of Automation, Southeast University(东南大学自动化学院) Nanjing XinDa Institute of Meteorological Science and Technology(南京新达气象科学与技术研究所) Beijing Leninainfo Technology Co., Ltd.(北京 Leninainfo 技术有限公司) China CEC Engineering Corporation(中国 CEC 工程公司) School of Mathematical Sciences, Tongji University(同济大学数学科学学院) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室) College of Meteorology and Oceanography, National University of Defense Technology(国防科技大学气象学与海洋学学院) Key Laboratory of Transportation Meteorology of China Meteorological Administration, Nanjing Innovation Institute for Atmospheric Sciences(中国气象局交通运输气象重点实验室,南京大气科学创新研究院)

AI总结 本文提出了一种名为MambaRain的多尺度编码-解码框架,用于0-3小时的降水临近预报。该方法结合了Mamba模型的线性复杂度长期时间建模能力和自注意力机制对空间相关性的显式捕捉,有效解决了现有方法在长时段预测中性能下降的问题。通过引入混合架构和频谱损失函数,MambaRain在保持计算效率的同时提升了预报精度,尤其在2-3小时的困难预测区间表现突出。

Comments 9 pages,7 figures

详情
英文摘要

Accurate precipitation nowcasting over extended horizons (0-3 hours) is essential for disaster mitigation and operational decision-making, yet remains a critical challenge in the field. Existing deterministic approaches are predominantly constrained to shorter prediction windows (0-2 hours), exhibiting severe performance degradation beyond 90 minutes owing to their inherent difficulty in capturing long-range spatiotemporal dependencies from radar-derived observations. To address these fundamental limitations, we propose MambaRain, a novel multi-scale encoder-decoder architecture that synergistically integrates Mamba's linear-complexity long-range temporal modeling with self-attention mechanisms for explicit spatial correlation capture. The core innovation lies in a hybrid design paradigm wherein Mamba blocks leverage selective state space mechanisms to model global temporal dynamics across extended sequences with computational efficiency, while self-attention modules explicitly characterize spatial correlations within precipitation fields - a capability inherently absent in Mamba's sequential processing paradigm. This complementary synergy enables comprehensive spatiotemporal representation learning, effectively extending the viable forecasting horizon to 2-3 hours with substantial accuracy improvements. Furthermore, we introduce a spectral loss formulation to mitigate blurring artifacts characteristic of chaotic precipitation systems, thereby preserving fine-scale motion details critical for nowcasting accuracy. Experimental validation demonstrates that MambaRain substantially outperforms existing deterministic methodologies in 0-3 hour nowcasting tasks, with particularly pronounced performance gains in the challenging 2-3 hour prediction range.

2605.14601 2026-05-15 cs.CV 版本更新

Towards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach

Kanglin Ning, Yiran Zhao, Wenrui Li, Shaoru Sun, Xingtao Wang, Xiaopeng Fan

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) The Suzhou Research Institute of HIT(哈尔滨工业大学苏州研究院) The PengChengLab(鹏城实验室)

AI总结 本文提出了一种基于连续语义高斯表示的单目全景3D目标检测框架PanoGSDet,旨在解决全景图像中2D特征到3D空间映射不准确的问题。该方法通过全景深度估计模块和语义高斯模块,将全景图像中的语义和深度信息提升到3D语义高斯分布,并通过优化和预测模块生成精确的3D目标框。实验表明,该方法在Structured3D数据集上显著优于现有方法。

Comments Current has been accepted by ICME 2026

详情
英文摘要

Three-dimensional object detection in panoramic imagery is crucial for comprehensive scene understanding, yet accurately mapping 2D features to 3D remains a significant challenge. Prevailing methods often project 2D features onto discrete 3D grids, which break geometric continuity and limit representation efficiency. To overcome this limitation, this paper proposes PanoGSDet, a monocular panoramic 3D detection framework built upon continuous semantic 3D Gaussian representations. The proposed framework comprises a panoramic depth estimation component and a semantic Gaussian component. The panoramic depth estimation component extracts the equirectangular semantic and depth features from the monocular panorama input. The semantic Gaussian component includes a semantic Gaussian lifting module that projects spherical features into 3D semantic Gaussians, a semantic Gaussian optimization module that refines these semantic Gaussians, and a Gaussian guided prediction head that generates 3D bounding boxes from optimized Gaussian representations. Extensive experiments on the Structured3D dataset demonstrate that our method significantly outperforms existing methods.

2605.14597 2026-05-15 cs.CV cs.CE cs.MM 版本更新

VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

Chunlei Shi, Hao Li, Yufeng Zhu, Boyu Liu, Yongchao Feng, Zengliang Zang, Hongbin Wang, Yanlan Yang, Dan Niu

发表机构 * Department of Automation, Southeast University(东南大学自动化部门) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室) Key Laboratory of Transportation Meteorology, China Meteorological Administration(中国气象局交通运输气象重点实验室)

AI总结 降水临近预报是气象应用中的重要时空预测任务,但因降水系统的混沌特性面临诸多挑战。现有方法多依赖单一来源的雷达数据构建确定性或概率性模型进行外推,但存在模糊性或计算效率低等问题。本文提出一种基于粗到细的视觉Mamba Unet与残差扩散模型(VMU-Diff)的多源数据融合框架,通过两阶段过程实现降水临近预报:第一阶段利用雷达与多波段卫星数据融合预测全局运动趋势,第二阶段基于条件扩散模型生成细节预测,实验表明该方法在短期预报中优于现有先进方法。

Comments 5 pages, 2 figures

详情
英文摘要

Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.

2605.14594 2026-05-15 cs.CV cs.GR 版本更新

TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

Bojun Xiong, Zoubin Bi, Xinghui Peng, Yunmu Wang, Junchen Deng, Jun Liang, Jing Li, Bowen Cai, Huan Fu

发表机构 * HUJING Digital Media & Entertainment Group(华景数字媒体与娱乐集团)

AI总结 本文提出TOPOS,一种用于单图像条件生成高保真3D头部模型的框架,旨在满足影视、动画和游戏等行业对统一拓扑结构的需求。TOPOS通过引入一种新型变分自编码器(TOPOS-VAE)和修正流变换器(TOPOS-DiT),在固定工业标准拓扑下联合生成几何和外观,实现跨生成头部的顶点级一致性。此外,TOPOS-Texture模块可从同一肖像图像生成可重新光照的UV纹理贴图,保留高频细节,实验表明TOPOS在3D头部生成任务中达到领先水平。

Comments Technical Report

详情
英文摘要

High-fidelity 3D head generation plays a crucial role in the film, animation and video game industries. In industrial pipelines, studios typically enforce a fixed reference topology across all head assets, as such a clean and uniform topology is a prerequisite for production-level rigging, skinning and animation. In this paper, we present TOPOS, a framework tailored for single image conditioned 3D head generation that jointly recovers geometry and appearance under such an industry-standard topology. In contrast to general 3D generative models which produce triangle meshes with inconsistent topology and numerous vertices, hindering semantic correspondence and asset-level reuse, TOPOS generates head meshes with a fixed, studio-style topology, enabling consistent vertex-level correspondence across all generated heads. To model heads under this unified topology, we proposed a novel variational autoencoder structure, termed TOPOS-VAE. Inspired by multi-model large language models (MLLMs), our TOPOS-VAE leverages the Perceiver Resampler to convert input pointclouds sampled from head meshes of diverse topologies into the target reference topology. Building upon TOPOS-VAE's structured latent space, we train a rectified flow transformer, TOPOS-DiT, to efficiently generate high-fidelity head meshes from a single image. We further present TOPOS-Texture, an end-to-end module that produces relightable UV texture maps from the same portrait image via fine-tuning a multimodal image generative model. The generated textures are spatially aligned with the underlying mesh geometry and faithfully preserve high-frequency appearance details. Extensive experiments demonstrate that TOPOS achieves state-of-the-art performance on 3D head generation, surpassing both classical face reconstruction methods and general 3D object generative models, highlighting its effectiveness for digital human creation.

2605.14590 2026-05-15 cs.CV 版本更新

FedStain: Modeling Higher-Order Stain Statistics for Federated Domain Generalization in Computational Pathology

Fengyi Zhang, Junya Zhang, Wenzhuo Sun

发表机构 * School of Electronic Science and Technology, Hainan University, Haikou, China, 570228(海南大学电子科学与技术学院) School of Computer Science and Technology, Xidian University, Xi'an, China, 710126(西安电子科技大学计算机科学与技术学院) Xiangjiang College of Elite Engineers, Hunan University, Changsha, China, 410082(湖南大学精英工程师学院)

AI总结 在计算病理学中,由于不同机构之间染色异质性显著,鲁棒的全切片图像分析仍面临挑战。现有联邦域泛化方法大多依赖低阶统计量,难以捕捉真实染色过程中存在的非高斯特性。本文提出FedStain,一种联邦域泛化框架,通过引入偏度和峰度等高阶统计量作为紧凑的染色描述子,在保护隐私和通信效率的前提下,有效建模染色变化,实验表明其在多个基准数据集上显著优于现有方法。

详情
英文摘要

Robust whole-slide image (WSI) analysis under strict data-governance remains challenging due to substantial cross-institutional stain heterogeneity. Domain generalization (DG) mitigates these shifts but typically requires centralized data, conflicting with privacy regulations. Federated learning (FedL) provides a decentralized alternative; however, existing FedL and federated DG (FedDG) approaches rely almost exclusively on low-order statistics, assuming Gaussian-like stain distributions. In contrast, real-world staining processes often produce asymmetric, heavy-tailed color distributions due to biochemical diffusion and scanner nonlinearity. Consequently, current methods fail to model the higher-order, non-Gaussian characteristics dominating real-world stain variability. To address this, we propose FedStain, a stain-aware FedDG framework explicitly incorporating higher-order stain moments--skewness and kurtosis--as compact statistical descriptors exchanged during federated optimization. These descriptors require no pixel-level data transmission, preserving strict privacy and communication efficiency, while enabling the global model to capture stain variability missed by low-order statistics. FedStain also employs a contrastive, cross-site parameter aggregation strategy to promote stain-invariant representations without relaxing data constraints. Extensive experiments on Camelyon17 and our new MvMidog-Fed benchmark show FedStain yields consistent improvements, outperforming state-of-the-art FedL, DG, and FedDG baselines by up to +3.9% absolute accuracy. To our knowledge, FedStain is the first FedDG approach to explicitly model higher-order stain statistics, enabling robust cross-institutional deployment in computational pathology.

2605.14581 2026-05-15 cs.CV cs.AI cs.IR 版本更新

A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

Ho Hung Lim, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港理工大学)

AI总结 本研究探讨了在视觉金融文档检索中,将文档图像编码为单一向量进行聚合可能带来的信息丢失问题。通过构建一个金融文档诊断基准,实验发现单一向量聚合会导致不同文档的向量几乎相同,从而掩盖了关键语义细节。研究指出,全局纹理主导是导致这一问题的根本原因,并表明该现象在不同模型规模和优化策略下均存在,突显了单一向量方法在金融应用中的潜在风险。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.

2605.14579 2026-05-15 cs.CV 版本更新

Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation

Zhiquan Chen, Haitao Wang, Guowei Zou, Hejun Wu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Guangdong Key Laboratory of Big Data Analysis and Processing(广东省大数据分析与处理重点实验室)

AI总结 医学图像分割是精准医疗的基础,但在面对组织外观差异大、边界模糊和解剖结构多变等挑战时,现有方法仍难以实现稳定而精确的分割。本文提出 Med-DisSeg 框架,通过引入一种轻量级的分散损失(Dispersive Loss)和自适应注意力机制,提升细粒度结构分割的表示学习与解剖边界刻画能力。该方法通过扩大样本间嵌入表示的间隔,增强编码器对结构特征的敏感性,并利用多尺度解码器保留局部纹理与整体形状信息,实验表明其在多个医学影像数据集上均取得领先的分割性能。

详情
英文摘要

Accurate medical image segmentation is fundamental to precision medicine, yet robust delineation remains challenging under heterogeneous appearances, ambiguous boundaries, and large anatomical variability. Similar intensity and texture patterns between targets and surrounding tissues often lead to blurred activations and unreliable separation. We attribute these failures to representation collapse during encoding and insufficient fine grained multi scale decoding. To address these issues, we propose Med DisSeg, a dispersion driven medical image segmentation framework that jointly improves representation learning and anatomical delineation. Med DisSeg combines a lightweight Dispersive Loss with adaptive attention for fine grained structure segmentation. The Dispersive Loss enlarges inter sample margins by treating in batch hidden representations as negative pairs, producing well dispersed and boundary aware embeddings with negligible overhead. Based on these enhanced representations, the encoder strengthens structure sensitive responses, while the decoder performs adaptive multi scale calibration to preserve complementary local texture and global shape information. Extensive experiments on five datasets spanning three imaging modalities demonstrate consistent state of the art performance. Moreover, Med DisSeg achieves competitive results on multi organ CT segmentation, supporting its robustness and cross task applicability.

2605.14569 2026-05-15 cs.CV 版本更新

Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

Yujie Wei, Chenglong Ma, Jianxiong Gao, Chenhui Wang, Shiwei Zhang, Biao Gong, Shuai Tan, Hangjie Yuan, Hongming Shan

发表机构 * Fudan University(复旦大学) Alibaba Group(阿里巴巴集团) Ant Group(蚂蚁集团)

AI总结 本文提出了一种名为CineNeuron的层次化框架,旨在解决从功能性磁共振成像(fMRI)信号重建动态视频时存在的语义鸿沟问题。该方法受到人类大脑双通路处理机制的启发,通过自底向上的语义增强阶段和自顶向下的记忆整合阶段,分别将fMRI信号映射到丰富的语义空间,并动态融合历史数据中的相关记忆以提升视频重建质量。实验表明,CineNeuron在两个fMRI到视频的基准数据集上均优于现有最先进方法。

Comments Accepted to CVPR 2026

详情
英文摘要

Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant "memories" from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics.

2605.14566 2026-05-15 cs.CV 版本更新

SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation

Zhiquan Chen, Haitao Wang, Guowei Zou, Hejun Wu

发表机构 * School of Computer Science(计算机科学学院) Engineering, Sun Yat-sen University(工程,中山大学) Guangdong Key Laboratory of Big Data Analysis(大数据分析与处理重点实验室)

AI总结 医学图像分割在数据稀缺的情况下仍面临挑战,传统方法常因标注不足导致泛化能力差和边界模糊。为此,本文提出 SpectraFlow 框架,结合结构感知的预训练与边界导向的解码,提升分割精度。该方法分为两阶段:第一阶段通过混合域均值流预训练,学习与结构相关的表示;第二阶段引入轻量解码器,结合注意力融合与频率方向卷积,增强边界细节与鲁棒性。实验表明,该方法在多个医学数据集上优于现有方法,尤其在低数据场景下表现突出。

详情
英文摘要

Medical image segmentation remains challenging in low-data regimes, where scarce annotations often yield poor generalization and ambiguous boundaries with missing fine structures. Recent self-supervised pretraining has improved transferability, but it often exhibits a texture bias. In contrast, accurate segmentation is inherently geometry-aware and depends on both topological consistency and precise boundary preservation. To address this problem, we propose a two-stage framework that couples structure-aware encoder pretraining with boundary-oriented decoding. In Stage-1, we aim to learn structure-aware representations for downstream segmentation in low-data regimes. To this end, we propose Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression, where masks act as conditional structural guidance rather than prediction targets, making the pretraining task-agnostic. To further improve training stability under scarce supervision, we incorporate a lightweight Dispersive Loss to prevent representation collapse. In Stage-2, we fine-tune the pretrained encoder with a lightweight decoder that combines Direct Attentional Fusion for adaptive cross-scale gating and Frequency-Directional Dynamic Convolution for high-frequency boundary refinement under appearance variation. Experiments on ISIC-2016, Kvasir-SEG, and GlaS demonstrate consistent gains over state-of-the-art methods, with improved robustness in low-data settings and sharper boundary delineation.

2605.14548 2026-05-15 cs.CV 版本更新

Local Spatiotemporal Convolutional Network for Robust Gait Recognition

Xiaoyun Wang, Cunrong Li, Wu Wang

发表机构 * School of Mechanical and Electrical Engineering, Osh State University(机械与电子工程学院,奥什州大学)

AI总结 本文研究如何从视频序列中鲁棒地识别步态特征,以提升步态识别的准确性和稳定性。为解决现有方法对计算资源需求高、难以捕捉连续帧中内在运动模式的问题,作者提出了一种结构简洁但高效的局部时空卷积网络(LSTCN),通过引入全局双向空间池化机制和局部时空卷积层,使标准二维卷积网络能够有效提取步态的时空特征。该方法在降低计算复杂度的同时,提升了对视角变化、服装差异等干扰因素的鲁棒性。

详情
英文摘要

Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.

2605.14534 2026-05-15 cs.CV cs.AI cs.MM 版本更新

PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

Fuhao Li, Shaofeng You, Jiagao Hu, Yu Liu, Yuxuan Chen, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian Luan

发表机构 * MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus)

AI总结 评估图像和视频中的物体移除效果仍然具有挑战性,因为该任务本质上是一对多的,而现有指标常与人类感知不一致。为解决这一问题,本文提出RC(移除一致性)指标,包括RC-S和RC-T,分别从空间和时间维度衡量移除区域的感知一致性,并构建了PROVE-Bench基准数据集以支持社区评估。实验表明,RC指标在多种图像和视频基准上表现出比现有方法更强的人类感知对齐能力。

Comments Project Page: https://xiaomi-research.github.io/prove/

详情
英文摘要

Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: https://github.com/xiaomi-research/prove/.

2605.14525 2026-05-15 cs.CV 版本更新

From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper

Ling Li, Changjie Chen, Yuyan Wang, Jiaqing Lyu, Kenglun Chang, Yiyun Chen, Zhidong Deng

发表机构 * Department of Computer Science, THUAI, BNRist, Tsinghua University, Beijing, China(清华大学北京研究院,清华大学计算机科学系,北京,中国) Dalian University of Technology, Dalian, China(大连理工大学,大连,中国) Apple, Beijing, China(苹果公司,北京,中国) Hong Kong University of Science and Technology (Guang Zhou), Guang Zhou, China(香港科技大学(广州),广州,中国) University of Manchester, Manchester, UK(曼彻斯特大学,曼彻斯特,英国)

AI总结 在多视角三维人体姿态估计中,传统方法通常依赖于同一时刻不同视角的图像来预测某一时刻的姿态,忽略了相邻帧之间的丰富时序依赖关系。本文提出了一种新的输入方式——稀疏交错输入,通过在不同时间点采集不同视角的图像,使模型能够捕捉丰富的时空信息,从而提升性能。该方法不仅能够通过多相机提高输出姿态的帧率,突破单视角帧率限制,还能减少数据冗余。研究引入了DenseWarper模型,利用极线几何实现高效的时空热图交换,并在多个数据集上取得了优于传统密集输入方法的先进性能。

详情
英文摘要

In multi-view 3D human pose estimation, models typically rely on images captured simultaneously from different camera views to predict a pose at a specific moment. While providing accurate spatial information, this traditional approach often overlooks the rich temporal dependencies between adjacent frames. We propose a novel 3D human pose estimation input method: the sparse interleaved input to address this. This method leverages images captured from different camera views at various time points (e.g., View 1 at time $t$ and View 2 at time $t+δ$), allowing our model to capture rich spatio-temporal information and effectively boost performance. More importantly, this approach offers two key advantages: First, it can theoretically increase the output pose frame rate by N times with N cameras, thereby breaking through single-view frame rate limitations and enhancing the temporal resolution of the production. Second, using a sparse subset of available frames, our method can reduce data redundancy and simultaneously achieve better performance. We introduce the DenseWarper model, which leverages epipolar geometry for efficient spatio-temporal heatmap exchange. We conducted extensive experiments on the Human3.6M and MPI-INF-3DHP datasets. Results demonstrate that our method, utilizing only sparse interleaved images as input, outperforms traditional dense multi-view input approaches and achieves state-of-the-art performance. The source code for this work is available at: https://github.com/lingli1724/DenseWarper-ICLR2026

2605.14518 2026-05-15 cs.CV cs.LG 版本更新

ArcGate: Adaptive Arctangent Gated Activation

Avik Bhattacharya, Siddhant Dnyanesh Gole, Subhasis Chaudhuri, Alejandro C. Frery, Biplab Banerjee

发表机构 * Microwave Remote Sensing Lab Center of Studies in Resources Engineering(微波遥感实验室 资源工程研究中心) Centre of Machine Intelligence and Data Science(智能与数据科学中心) Department of Electrical Engineering(电气工程系) School of Mathematics and Statistics(数学与统计学学院) Center of Studies in Resources Engineering(资源工程研究中心)

AI总结 本文提出了一种新型的自适应反正切门控激活函数ArcGate,通过三阶段非线性变换生成多样化的激活形状,相比传统的固定形状激活函数(如ReLU、GELU等),其每个网络层包含七个可学习参数,能够根据特征层次和数据分布自主优化非线性特性。实验在多个遥感数据集上验证了ArcGate的优越性,尤其在噪声环境下表现出更强的鲁棒性,并揭示了其参数随网络深度变化的演化规律,表明ArcGate是一种适用于高分辨率地球观测任务的通用且自适应的激活函数。

详情
英文摘要

Activation functions are central to deep networks, influencing non-linearity, feature learning, convergence, and robustness. This paper proposes the Adaptive Arctangent Gated Activation (ArcGate) function, a flexible formulation that generates a broad spectrum of activation shapes via a three-stage non-linear transformation. Unlike conventional fixed-shape activations such as ReLU, GELU, or SiLU, ArcGate uses seven learnable parameters per layer, allowing the neural network to autonomously optimize its non-linearity to the specific requirements of the feature hierarchy and data distribution. We evaluate ArcGate using ResNet-50 and Vision Transformer (ViT-B/16) architectures on three widely used remote sensing benchmarks: PatternNet, UC Merced Land Use, and the 13-band EuroSAT MSI multispectral dataset. Experimental results show that ArcGate consistently outperforms standard baselines, achieving a peak overall accuracy of 99.67% on PatternNet. Most notably, ArcGate exhibits superior structural resilience in noisy environments, maintaining a 26.65% performance lead over ReLU under moderate Gaussian noise (standard deviation 0.1). Analysis of the learned parameters reveals a depth-dependent functional evolution, where the model increases gating strength in deeper layers to enhance signal propagation. These findings suggest that ArcGate is a robust and adaptive general node activation function for high-resolution earth observation tasks.

2605.14513 2026-05-15 cs.CV cs.AI 版本更新

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

Xuzhe Zheng, Yuexiao Ma, Jing Xu, Xiawu Zheng, Rongrong Ji, Fei Chao

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(中国教育部多媒体可信感知与高效计算重点实验室,厦门大学)

AI总结 本文提出了一种名为HASTE的训练-free视频扩散加速方法,旨在解决现有稀疏注意力机制在视频生成中因二次复杂度和固定阈值带来的效率与质量平衡问题。该方法通过引入头级自适应框架,包含时间掩码复用和误差引导的预算校准两个模块,有效减少了掩码预测开销并优化了各注意力头的稀疏性分配。实验表明,HASTE在保持视频质量的同时,显著提升了模型推理速度。

详情
英文摘要

Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.

2605.14487 2026-05-15 cs.CV cs.AI 版本更新

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

Jiahao Tian, Yiwei Wang, Gang Yu, Chi Zhang

发表机构 * AGI Lab, Westlake University University of California at Merced StepFun

AI总结 本文研究了长时序自回归视频生成中的误差累积和上下文丢失问题,提出了一种名为Head Forcing的训练无需额外训练的框架。该方法通过识别并区分扩散变压器中注意力头的不同功能,分别为局部细节优化、结构稳定和长程上下文聚合的头分配定制化的键值缓存策略,从而提升生成质量和效率。实验表明,该方法在不增加训练成本的情况下显著延长了视频生成时长,并支持多提示交互合成,优于现有基线方法。

详情
英文摘要

Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian-sjtu.github.io/headforcing.github.io/.

2605.14486 2026-05-15 cs.CV 版本更新

Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

Yiheng Li, Yang Yang, Zichang Tan, Gao Li, Zhen Lei, Wenhao Wang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部门) Sangfor Technologies Inc.(Sangfor技术公司) China Mobile Financial Technology Co., Ltd.(中国移动金融科技有限公司) CAIR, HKSIS, Chinese Academy of Sciences(中国科学院CAIR、HKSIS部门) SCSE, FIE, M.U.S.T, Macau, China(澳门SCSE、FIE、M.U.S.T部门) Vast Intelligence Lab, Sydney, Australia(悉尼澳大利亚Vast Intelligence Lab)

AI总结 随着AI生成图像的滥用日益严重,亟需具备广泛适用性的图像检测技术。本文提出了一种基于GAN的上采样方法,以生成与重建方法对齐但具有更多样化伪影模式的假图像,从而弥补现有方法在多样性方面的不足。为了解决不同生成方法之间的领域偏移问题,研究引入了分离专家融合(SEF)框架,通过领域特定专家模型和门控网络实现特征的互补融合,显著提升了模型在多种生成方法上的检测性能和泛化能力。

Comments preprint

详情
英文摘要

As the misuse of AI-generated images grows, generalizable image detection techniques are urgently needed. Recent state-of-the-art (SOTA) methods adopt aligned training datasets to reduce content, size, and format biases, empowering models to capture robust forgery cues. A common strategy is to employ reconstruction techniques, e.g., VAE and DDIM, which show remarkable results in diffusion-based methods. However, such reconstruction-based approaches typically introduce limited and homogeneous artifacts, which cannot fully capture diverse generative patterns, such as GAN-based methods. To complement reconstruction-based fake images with aligned yet diverse artifact patterns, we propose a GAN-based upsampling approach that mimics GAN-generated fake patterns while preserving content, size, and format alignment. This naturally results in two aligned but distinct types of fake images. However, due to the domain shift between reconstruction-based and upsampling-based fake images, direct mixed training causes suboptimal results, where one domain disrupts feature learning of the other. Accordingly, we propose a Separate Expert Fusion (SEF) framework to extract complementary artifact information and reduce inter-domain interference. We first train domain-specific experts via LoRA adaptation on a frozen foundational model, then conduct decoupled fusion with a gating network to adaptively combine expert features while retaining their specialized knowledge. Rather than merely benefiting GAN-generated image detection, this design introduces diverse and complementary artifact patterns that enable SEF to learn a more robust decision boundary and improve generalization across broader generative methods. Extensive experiments demonstrate that our method yields strong results across 13 diverse benchmarks. Codes are released at: https://github.com/liyih/SEF_AIGC_detection.

2605.14475 2026-05-15 cs.CV 版本更新

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

Jiashun Zhu, Ronghao Fu, Jiasen Hu, Nachuan Xing, Xu Na, Xiao Yang, Zhiwen Lin, Weipeng Zhang, Lang Sun, Zhiheng Xue, Haoran Liu, Weijie Zhang, Bo Yang

发表机构 * College of Computer Science and Technology, Jilin University, China(吉林大学计算机科学与技术学院)

AI总结 GeoVista 是一种面向超高分辨率遥感图像理解的视觉引导主动感知框架,旨在解决现有方法在探索大场景时易丢失全局上下文、重复访问或遗漏关键区域的问题。该方法通过构建全局探索计划并多分支验证候选区域,结合显式的证据状态管理,实现跨区域的信息聚合与去重。GeoVista 引入了 APEX-GRO 轨迹语料库和 Observe-Plan-Track 机制,有效提升了遥感图像的语义理解和问答性能,在多个基准测试中取得了最先进的结果。

详情
英文摘要

Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista

2605.14462 2026-05-15 cs.CV 版本更新

Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

Yubo Zhao, Yujin Chai, Yunao Dong, Chengfeng Zhao, Zijiao Zeng, Yuan Liu, Chi-Keung Tang

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Tencent IEG(腾讯IEG)

AI总结 本文研究如何从单目视频中重建具有物理合理性的4D人-物交互(HOI)动画,以支持3D内容生成和仿真学习等应用。为了解决现有方法在交互一致性、接触稳定性和物理合理性方面的不足,作者提出了HA-HOI框架,采用“以人为先,物体跟随”的策略,以人体运动为交互锚点,重建并优化物体的运动轨迹,并将其映射到物理仿真中进行验证。该方法在多个基准和真实视频上显著提升了人-物对齐、接触一致性及仿真适用性,推动了从视觉合理到物理合理的交互动画生成。

详情
英文摘要

Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce $\textbf{HA-HOI}$, a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a $\textit{human-first, object-follow}$ formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, $\textbf{HA-HOI}$ improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: https://knoxzhao.github.io/real2sim_in_HOI/

2605.14461 2026-05-15 cs.CV 版本更新

ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models

Ledun Zhang, Yatu Ji, Xufei Zhuang, Xinying Yao

发表机构 * Inner Mongolia University of Technology(内蒙古科技大学)

AI总结 ClickRemoval 是一个基于预训练 Stable Diffusion 模型的开源交互式工具,旨在解决扩散模型中对象移除的难题。该工具仅需用户点击操作即可定位目标对象并修复背景,无需手动绘制掩码或输入文本描述。通过在去噪过程中进行自注意力调制,ClickRemoval 在复杂场景中实现了高效且自然的移除效果,实验表明其在定量指标和用户研究中均表现优异。

Comments 5 pages, 4 figures. Open-source software paper

详情
英文摘要

Existing object removal tools often rely on manual masks or text prompts, making precise removal difficult for non-expert users in complex scenes and often leading to incomplete removal or unnatural background completion. To address this issue, we present ClickRemoval, an open-source interactive object removal tool built on pretrained Stable Diffusion models and driven solely by user clicks. Without additional training, hand-drawn masks, or text descriptions, ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising. Experiments show that ClickRemoval achieves competitive results across quantitative metrics and user studies. We release a complete software package at https://github.com/zld-make/ClickRemoval under the Apache-2.0 license.

2605.14448 2026-05-15 cs.CV cs.CL cs.IR 版本更新

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Longxiang Zhang, Weilong Dai, Guanghao Zhang, Hao Jiang, Pipei Huang

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 该研究提出了一种名为Think When Needed(TWN)的统一多模态嵌入框架,旨在通过自适应推理机制提升多模态嵌入的质量与效率。TWN采用双LoRA架构,将推理和嵌入适配器附加到共享的冻结主干模型上,以减少参数开销并避免梯度冲突。通过自监督路由门机制,模型能够根据输入内容决定是否生成链式推理(CoT),从而避免冗余推理带来的性能下降,并显著降低推理成本。实验表明,TWN在MMEB-V2的78个任务中取得了最先进的嵌入质量,同时在参数和推理效率方面优于现有生成式方法。

Comments 30 pages, preprint

详情
英文摘要

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

2605.14406 2026-05-15 cs.LG cs.CV 版本更新

GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

Yuhao Liu, Sadeer Al-Kindi, Ashok Veeraraghavan, Guha Balakrishnan

发表机构 * Department of Electrical and Computer Engineering, Rice University(理海大学电气与计算机工程系) Center for Cardiovascular Computational and Precision Health, Department of Cardiology, DeBakey Heart and Vascular Center, Houston Methodist(休斯顿方法主义医疗中心心血管计算与精准健康中心、心内科部门、德贝基心脏和血管中心)

AI总结 GeoViSTA 是一种结合遥感图像和表格数据的多模态模型,旨在学习统一的地理空间表征。该模型通过双边交叉注意力机制,在图像和表格数据之间交换空间与语义信息,并借助地理感知的注意力机制对齐图像块与不规则的统计区域。GeoViSTA 在自监督的联合掩码重建任务中进行训练,显著提升了在疾病死亡率和火灾风险等关键任务上的预测性能,展示了其在综合地理空间推理中的强大能力。

详情
英文摘要

Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.

2605.14403 2026-05-15 cs.CV 版本更新

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

Yize Liu, Siyuan Yan, Ming Hu, Lie Ju, Xieji Li, Feilong Tang, Wei Feng, Zongyuan Ge

发表机构 * AIM for Health Lab, Faculty of Information Technology, Monash University, Melbourne, Australia(健康人工智能实验室,信息科技学院,墨尔本大学,澳大利亚) Faculty of Information Technology, Monash University, Melbourne, Australia(信息科技学院,墨尔本大学,澳大利亚) University College London, Institute of Ophthalmology, London, United Kingdom(伦敦大学学院,眼科研究所,英国)

AI总结 DermAgent 是一个用于皮肤科图像分析的自反思智能代理系统,旨在解决现有多模态大语言模型在皮肤病诊断中领域知识不足和幻觉问题。该系统通过集成七个专业视觉与语言模块,在计划-执行-反思框架下实现可追溯的诊断推理,结合多工具协同推理与外部证据检索,有效提升了诊断准确性和可靠性。实验表明,DermAgent 在多个皮肤病基准测试中表现优异,显著优于现有先进模型。

Comments MICCAI2026 early acceptance

详情
英文摘要

Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.

2605.14399 2026-05-15 cs.CV cs.GR 版本更新

SceneForge: Structured World Supervision from 3D Interventions

Jizhizi Li, Jiayang Ao, Danny Wicks, Petru-Daniel Tudosiu

发表机构 * Canva Research(Canva研究院)

AI总结 SceneForge 是一个基于可编辑3D世界状态的干预驱动框架,旨在生成在场景编辑、视角变化和场景级干预下保持一致的结构化监督信号。该方法通过显式干预(如物体移除或相机变化)并传播其对场景结构和物理属性的影响,生成包括反事实观测、多视角观测及阴影、反射等效应感知信号在内的对齐输出。实验表明,SceneForge 能有效提升多任务学习中物体移除和场景移除的性能,为干预一致的多模态学习提供了可扩展的监督基础。

详情
英文摘要

Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.

2605.14396 2026-05-15 cs.CV cs.CR cs.LG cs.RO 版本更新

Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

Chenyi Wang, Ruoyu Song, Raymond Muller, Jean-Philippe Monteuuis, Jonathan Petit, Z. Berkay Celik, Ryan Gerdes, Ming F. Li

发表机构 * University of Arizona(亚利桑那大学) Purdue University(普渡大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 自动驾驶车辆依赖在线高精度地图构建来感知车道边界、分隔线和人行横道等关键道路元素,这些元素直接影响运动规划的安全性。本文提出MIRAGE框架,通过条件扩散模型系统性地发现能够绕过对抗防御、导致地图预测退化的语义攻击,例如制造阴影或湿滑路面等合理环境变化。实验表明,MIRAGE生成的攻击在多个防御机制下仍具有强效,并且生成场景的现实感达到80-84%,远高于传统像素级攻击方法。

详情
英文摘要

Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.

2605.14393 2026-05-15 cs.CV 版本更新

Analogical Trajectory Transfer

Junho Kim, Eun Sun Lee, Gwangtak Bae, Seunggu Kang, Young Min Kim

发表机构 * Dept. of Electrical and Computer Engineering, Seoul National University(电子与计算机工程系,首尔国立大学) Interdisciplinary Program in Artificial Intelligence and INMC, Seoul National University(人工智能交叉计划和INMC,首尔国立大学)

AI总结 本文研究类比轨迹迁移问题,旨在将一个三维环境中的运动轨迹转换到另一个语义上相似但空间布局不同的环境中,从而实现机器的类比空间推理能力。为了解决场景间物体位置、尺度和布局差异带来的碰撞和几何失真问题,作者提出了一种基于场景聚类和分层映射预测的方法,通过分解问题并组合子问题的解,生成语义一致且空间连贯的轨迹转移结果。该方法无需训练,运行速度快,且在多个应用场景中优于基于大语言模型和场景图匹配的基线方法。

详情
英文摘要

We study analogical trajectory transfer, where the goal is to translate motion trajectories in one 3D environment to a semantically analogous location in another. Such a capacity would enable machines to perform analogical spatial reasoning, with applications in AR/VR co-presence, content creation, and robotics. However, even semantically similar scenes can still differ substantially in object placement, scale, and layout, so naively matching semantics leads to collisions or geometric distortions. Furthermore, finding where each trajectory point should transfer to has a large search space, as the mapping must preserve semantics and functionality without tearing the trajectory apart or causing collisions. Our key insight is to decompose the problem into spatially segregated subproblems and merge their solutions to produce semantically consistent and spatially coherent transfers. Specifically, we partition scenes into object-centric clusters and estimate cross-scene mappings via hierarchical smooth map prediction, using 3D foundation model features that encode contextual information from object and open-space arrangements. We then combinatorially assemble the per-cluster maps into an initial transfer and refine the result to remove collisions and distortions, yielding a spatially coherent trajectory. Our method does not require training, attains a fast runtime around 0.6 seconds, and outperforms baselines based on LLMs, VLMs, and scene graph matching. We further showcase applications in virtual co-presence, multi-trajectory transfer, camera transfer, and human-to-robot motion transfer, which indicates the broad applicability of our work to AR/VR and robotics.

2605.14391 2026-05-15 cs.CV 版本更新

Dual-Latent Collaborative Decoding for Fidelity-Perception Balanced Image Compression

Qi Mao, Zijian Wang, Zhengxue Cheng, Lingyu Zhu, Siwei Ma

发表机构 * School of Information and Communication Engineering and the State Key Laboratory of Media Convergence and Communication, Communication University of China(信息与通信工程学院和媒体融合与通信国家重点实验室,中国通信大学) School of Information Science and Electronic Engineering, Shanghai Jiao Tong University(信息科学与电子工程学院,上海交通大学) Department of Computer Science, City University of Hong Kong(计算机科学系,香港城市大学) State Key Laboratory of Multimedia information Processing, School of Computer Science, Peking University(多媒体信息处理国家重点实验室,计算机科学学院,北京大学)

AI总结 本文研究了如何在图像压缩中平衡重建图像的保真度与感知质量。现有方法通常依赖单一的潜在表示同时处理结构细节、语义信息和感知先验,导致不同任务之间的冲突。为此,作者提出了一种双潜在协作解码框架MoDE,通过将标量量化和向量量化两种潜在表示分别作为保真度专家和感知专家,并引入专家特定增强和跨专家调制模块,实现两者的协同解码。实验表明,该方法在广泛比特率范围内实现了更优的保真-感知平衡。

详情
英文摘要

Learned image compression (LIC) increasingly requires reconstructions that balance distortion fidelity and perceptual realism across a wide range of bitrates. However, most existing methods still rely on a single compressed latent representation to simultaneously carry structural details, semantic cues, and perceptual priors, requiring the same latent representation to serve multiple, potentially conflicting roles. This tension becomes evident across different latent paradigms: scalar-quantized (SQ) continuous latents provide rate-scalable fidelity but tend to lose perceptual details at low rates, while vector-quantized (VQ) discrete tokens preserve compact semantic cues but suffer from limited structural fidelity and bitrate scalability. To address this issue, we propose Mixture of Decoder Experts (MoDE), a dual-latent collaborative decoding framework that decomposes reconstruction responsibilities across complementary latent paradigms. Specifically, MoDE treats the SQ branch as a fidelity-oriented expert and the VQ branch as a perception-oriented expert, and coordinates them through two decoder-side modules: Expert-Specific Enhancement (ESE), which preserves branch-specific expert references, and Cross-Expert Modulation (CEM), which enables selective complementary transfer during reconstruction. The resulting framework supports selective cross-latent collaboration under a shared dual-stream bitstream and enables both fidelity-anchored and perception-anchored decoding. Extensive experiments demonstrate that MoDE achieves a more favorable fidelity-perception balance than representative distortion-oriented, perception-oriented, generative, and dual-latent baselines across a wide bitrate range, highlighting decoder-side expert collaboration as an effective design for wide-range fidelity-perception balanced LIC.

2605.14346 2026-05-15 cs.CV 版本更新

Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation

Yuanhang Yao, Ping Qian, Zhu Liu, Long Ma, Weimin Wang

发表机构 * School of Software Technology, Dalian University of Technology(大连理工大学软件学院)

AI总结 本文研究了如何在点监督下稳定红外小目标检测任务,针对轻量级检测器语义信息不足导致的伪标签噪声和训练不稳定问题,提出了一种基于分层视觉基础模型(VFM)的知识蒸馏框架。该方法通过双层优化过程,结合语义条件仿射调制(SCAM)和动态协作学习策略,有效提升了检测精度和训练稳定性。实验表明,该方法在多种红外小目标检测模型上均取得了显著改进。

详情
英文摘要

Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers. In addition, a dynamic collaborative learning strategy with cluster-level sample reweighting enhances robustness to imperfect pseudo-masks. Experiments on diverse challenging cases across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability. Our code is available at https://github.com/yuanhang-yao/semantic-prior.

2605.14341 2026-05-15 cs.CV 版本更新

AnyBand-Diff: A Unified Remote Sensing Image Generation and Band Repair Framework with Spectral Priors

Zuopeng Zhao, Ying Liu, Xiaoyu Li, Su Luo, Lu Li, Wenwen Liu

发表机构 * School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology(计算机科学与技术学院/人工智能学院,中国矿业大学) Mine Digitization Engineering Research Center of the Ministry of Education(教育部矿山数字化工程研究中心) Jiangsu Provincial Industrial Technology Engineering Center for Intelligent Sensing(江苏省智能感知与应急物联网地下空间工业技术工程中心)

AI总结 本文提出了一种名为 AnyBand-Diff 的统一遥感图像生成与波段修复框架,旨在解决现有扩散模型在生成遥感图像时忽略物理规律导致的光谱失真和辐射不一致问题。该方法引入了基于光谱先验的扩散模型架构,结合双随机掩码策略和物理引导采样机制,能够从任意波段子集恢复完整的光谱信息,并保证生成图像的辐射一致性。实验表明,AnyBand-Diff 在生成可靠遥感图像和实现高精度光谱重建方面表现出色,为物理感知的生成模型在地球观测领域的应用提供了新思路。

详情
英文摘要

Existing diffusion models have made significant progress in generating realistic images. However, their direct adaptation to remote sensing imagery often disregards intrinsic physical laws. This oversight frequently leads to spectral distortion and radiometric inconsistency, severely limiting the scientific utility of generated data. To address this issue, this paper introduces AnyBand-Diff, a novel spectral-prior-guided diffusion framework tailored for robust spectral reconstruction. Specifically, we design a Masked Conditional Diffusion backbone integrated with a dual stochastic masking strategy, empowering the model to recover complete spectral information from arbitrary band subsets. Subsequently, to ensure radiometric fidelity, a Physics-Guided Sampling mechanism is proposed, leveraging gradients from a differentiable physical model to explicitly steer the denoising trajectory toward the manifold of physically plausible solutions. Furthermore, a Multi-Scale Physical Loss is formulated to enforce rigorous constraints across pixel, region, and global levels in a joint manner. Extensive experiments confirm the effectiveness of AnyBand-Diff in generating reliable imagery and achieving accurate spectral reconstruction, contributing to the advancement of physics-aware generative methods for Earth observation.

2605.14337 2026-05-15 cs.CV 版本更新

IG-Diff: Complex Night Scene Restoration with Illumination-Guided Diffusion Model

Yifan Chen, Fei Yin, Chunle Guo, Chongyi Li, Yujiu Yang

发表机构 * Tsinghua Univerisity(清华大学) NanKai University(南开大学)

AI总结 在夜间复杂场景中,由于光照不足和多种退化因素共存,图像恢复面临较大挑战。本文提出一种基于光照引导的扩散模型(IG-Diff),通过引入光照引导模块,有效提升了低光环境下多退化因素共存场景的图像恢复效果。同时,作者构建了包含多种退化因素的复杂夜间场景数据集,为相关研究提供了重要资源。

Comments Accepted by CGI-2025

详情
英文摘要

In nighttime circumstances, it is challenging for individuals and machines to perceive their surroundings. While prevailing image restoration methods adeptly handle singular forms of degradation, they falter when confronted with intricate nocturnal scenes, such as the concurrent presence of weather and low-light conditions. Compounding this challenge, the lack of paired data that encapsulates the coexistence of low-light situations and other forms of degradation hinders the development of a comprehensive end-to-end solution. In this work, we contribute complex nighttime scene datasets that simulate both illumination degradation and other forms of deterioration. To address the complexity of night degradation, we propose an integration of an illumination-guided module embedded in the diffusion model to guide the illumination restoration process. Our model can preserve texture fidelity while contending with the adversities posed by various degradation in low-light scenarios.

2605.14333 2026-05-15 cs.CV 版本更新

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Yang Yue, Fangyun Wei, Tianyu He, Jinjing Zhao, Zanlin Ni, Zeyu Liu, Jiayi Guo, Lei Shi, Yue Dong, Li Chen, Ji Li, Gao Huang, Dong Chen

发表机构 * Tsinghua University(清华大学) Microsoft Research(微软研究院)

AI总结 本文研究了在基于离散分词的自回归图像生成中如何提升文本和人脸的生成质量。作者指出,传统分词器因过度下采样和量化导致细粒度结构丢失,难以保留可读的文本和清晰的人脸特征。为此,他们提出了InsightTok,通过引入局部、内容感知的感知损失,有效提升了文本和人脸的保真度,并在不牺牲整体重建质量的前提下显著优于现有分词器。该方法在自回归图像生成模型InsightAR中表现出色,生成的图像具有更清晰的文本和更真实的人脸细节。

Comments Code and checkpoints are available at https://github.com/LeapLabTHU/InsightTok

详情
英文摘要

Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.

2605.14326 2026-05-15 cs.CV 版本更新

D2-CDIG: Controlled Diffusion Remote Sensing Image Generation with Dual Priors of DEM and Cloud-Fog

Zuopeng Zhao, Ying Liu, Kanyaphakphachsorn Pharksuwan, Su Luo, Xiaoyu Li, Maocai Ning

发表机构 * China University of Mining and Technology(中国矿业大学)

AI总结 本文提出了一种名为D2-CDIG的可控扩散遥感图像生成框架,旨在解决现有方法在复杂地形和大气条件下生成图像准确性与自然度不足的问题。该方法通过融合数字高程模型(DEM)和云雾信息作为双重先验知识,实现了对地表特征和大气现象的精确控制,并引入了可调节的云雾滑块以灵活控制云层厚度和分布。实验表明,D2-CDIG在图像质量、细节丰富度和真实感方面相比传统方法有显著提升,为遥感大模型训练和下游任务提供了高质量的数据支持。

详情
英文摘要

Remote sensing image generation provides a reliable data foundation for remote sensing large models and downstream tasks. However, existing controllable remote sensing image generation methods typically rely on traditional techniques such as segmentation and edge detection, which do not fully leverage terrain or atmospheric conditions. As a result, the generated images often lack accuracy and naturalness when dealing with complex terrains and atmospheric phenomena. In this paper, we propose a novel remote sensing image generation framework, D2-CDIG, which integrates diffusion models with a dual-prior control mechanism. By incorporating both Digital Elevation Model (DEM) and cloud-fog information as dual prior knowledge, D2-CDIG precisely controls ground features and atmospheric phenomena within the generated images. Specifically, D2-CDIG decouples the terrain and atmospheric generation processes through independent control of ground and atmospheric branches. Additionally, a refined cloud-fog slider is introduced to flexibly adjust cloud thickness and distribution. During training, ground and atmospheric control signals are injected in layers to ensure a seamless transition within the images. Compared to traditional methods based on segmentation or edge detection, D2-CDIG shows significant improvements in image quality, detail richness, and realism. D2-CDIG offers a flexible and precise solution for remote sensing image generation, providing high-quality data for training large remote sensing models and downstream tasks.

2605.14315 2026-05-15 cs.CV 版本更新

TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

David Huang, Guile Wu, Chengjie Huang, Bingbing Liu, Dongfeng Bai

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室) University of Toronto(多伦多大学) Foundation Model Department, Huawei(华为基础模型部门)

AI总结 本文提出了一种名为 TurboVGGT 的新型方法,用于实现快速的多视角三维重建。该方法采用自适应交替注意力机制的视觉几何变换器,在保证重建质量的同时显著提升了计算效率。通过自适应稀疏全局注意力和帧内注意力的结合,TurboVGGT 能够有效捕捉跨帧的全局关系和单帧内的局部细节,实验表明其在多个三维重建基准上表现优异,兼具速度与精度。

Comments Technical Report

详情
英文摘要

Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an end-to-end trainable framework with adaptive sparse global attention guided by adaptive sparsity selection to capture global relationships across frames and frame attention to aggregate local details within each frame. In the adaptive sparse global attention, TurboVGGT adaptively learns representative tokens with varying sparsity levels for global geometry modeling, considering that token importance varies across frames, attention layers operate tokens at different levels of abstraction, and global dependencies rely on structurally informative regions. Extensive experiments on multiple 3D reconstruction benchmarks demonstrate that TurboVGGT achieves fast multi-view reconstruction while maintaining competitive reconstruction quality compared with state-of-the-art methods. Project page: https://turbovggt.github.io/.

2605.14310 2026-05-15 cs.CV 版本更新

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

Ailar Mahdizadeh, Puria Azadi, Muchen Li, Xiangteng He, Leonid Sigal

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所)

AI总结 在流式视频理解中,如何高效压缩视觉-语言模型的键值缓存以支持长期推理是一个重要问题。本文将KV缓存压缩视为一个核心集选择问题,提出了一种基于几何覆盖和多样性优化的方法,通过联合优化键和值空间的表示,同时保留检索结构和输出相关信息。该方法引入正交性驱动的多样性准则,提升缓存子集的多样性,实验表明在多个开源模型和视频基准上优于传统启发式压缩方法。

详情
英文摘要

Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.

2605.14291 2026-05-15 cs.CR cs.AI cs.CL cs.CV cs.LG 版本更新

To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

Chengshuai Zhao, Zhen Tan, Dawei Li, Zhiyuan Yu, Huan Liu

发表机构 * School of Computing Augmented Intelligence, Arizona State University, Tempe, AZ, USA Department of Computer Science Engineering, Texas A\&M University, College Station, TX, USA

AI总结 随着大型视觉-语言模型(LVLMs)的快速发展,未经授权的数据抓取和微调行为带来了严重的版权和隐私风险。为此,本文提出MMGuard,通过注入人类不可感知的扰动生成“不可学习”的示例,主动防御数据被用于未经授权的LVLM微调。该方法利用模型的学习动态,制造优化捷径,使模型在训练时过度拟合噪声,从而在推理时性能下降。此外,MMGuard引入跨模态关联破坏策略,增强防御效果,并在多种威胁模型下展现出高效、隐蔽且鲁棒的保护能力。

详情
英文摘要

The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.

2605.14278 2026-05-15 cs.CV 版本更新

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

Ruicheng Zhang, Kaixi Cong, Jun Zhou, Zhizhou Zhong, Zunnan Xu, Shuiyang Mao, Wei Liu, Xiu Li

发表机构 * Tsinghua University(清华大学) HKUST(香港科技大学) Video Rebirth Project(视频重生项目)

AI总结 本文提出了一种名为KVPO的ODE原生在线组相对策略优化框架,用于通过键值语义探索对流式自回归视频生成器进行对齐。该方法通过将多样性探索的来源从随机噪声转移到历史键值缓存,构建语义多样且保持数据流形的生成分支,从而提升长期一致性。同时,KVPO引入基于轨迹速度能量的替代策略,实现了与ODE原生形式完全一致的奖励加权对比目标,在多个实验设置中显著提升了视频的视觉质量、运动质量和文本-视频对齐效果。

详情
英文摘要

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

2605.14274 2026-05-15 cs.CV 版本更新

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

Zhenyang Ni, Yijiang Li, Ruochen Jiao, Simon Sinong Zhan, Sipeng Chen, Zhenfei Yin, Minshuo Chen, Philip Torr, Zhaoran Wang, Qi Zhu

发表机构 * Northwestern University(西北大学) University of California, San Diego(加州大学圣地亚哥分校) University of Oxford(牛津大学)

AI总结 该论文提出了一种名为CreFlow的在线强化学习框架,用于改进稀疏奖励下的具身视频生成模型。研究针对现有视频强化学习奖励机制无法准确反映任务逻辑的问题,引入了基于组合逻辑约束的奖励模型,将任务要求转化为线性时序逻辑约束,从而提供更准确的奖励信号和局部错误信息。CreFlow通过两个关键设计——信用感知的NFT损失和校正重流损失,有效提升了高维视频生成的训练效率与稳定性,实验表明其在双臂操作任务中的执行成功率提升了23.8个百分点。

详情
英文摘要

Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.

2605.14269 2026-05-15 cs.CV cs.AI 版本更新

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal

发表机构 * UNC Chapel Hill(UNC夏洛特希尔大学) FieldAI NTU Singapore(新加坡国立大学) AI2 Johns Hopkins University(约翰霍普金斯大学)

AI总结 生成真实的人类运动是视频生成中的核心挑战之一。为了解决现有奖励信号无法准确评估运动真实性的难题,本文提出PhyMotion,一种基于物理模拟的结构化运动奖励机制,通过评估运动的运动学合理性、接触与平衡一致性以及动力学可行性等多个维度,实现对生成视频中人体运动质量的精细评价。实验表明,PhyMotion相比现有方法能更准确地反映人类判断,并在基于强化学习的后训练中显著提升了运动真实性和生成质量。

Comments First two authors contributed equally, website: https://phy-motion.github.io/

详情
英文摘要

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.

2605.14267 2026-05-15 cs.CV cs.AI 版本更新

Image Restoration via Diffusion Models with Dynamic Resolution

Yang Zheng, Wen Li, Zhaoqiang Liu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 该研究针对扩散模型在图像修复任务中计算开销大的问题,提出了一种基于动态分辨率扩散模型的图像修复方法。通过将数据投影到低维子空间,有效降低了计算负担,并在原有像素空间方法的基础上改进,提出了SubDPS和SubDAPS两种新方法,其中SubDAPS++进一步提升了修复效率和质量。实验表明,该方法在多个数据集和任务上优于现有基于扩散模型的图像修复方法。

Comments Accepted by ICML 2026

详情
英文摘要

Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high-dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios. The code is available at https://github.com/StarNextDay/SubDAPS.git.

2605.14253 2026-05-15 cs.CV cs.LG 版本更新

Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

Harry Robertshaw, Yanghe Hao, Weiyuan Deng, Benjamin Jackson, S. M. Hadi Sadati, Nikola Fischer, Tom Vercauteren, Alejandro Granados, Thomas C. Booth

发表机构 * Surgical & Interventional Engineering School of Biomedical Engineering & Imaging Sciences Kings College London(生物医学工程与成像科学学院手术与介入工程系伦敦国王学院) School of Engineering & Materials Science Queen Mary London(工程与材料科学学院女王玛丽学院伦敦)

AI总结 本文旨在开发一种基于荧光透视图像的实时导管尖端跟踪系统,以支持基于强化学习的自主机械取栓手术导航。研究提出了一种多线程处理框架,结合深度学习分割模型与后处理算法,有效应对图像对比度低、噪声大及设备遮挡等挑战。实验表明,该方法在分割精度上优于现有方法,为未来自主导航系统的实现提供了可靠高效的解决方案。

Comments Harry Robertshaw and Yanghe Hao contributed equally to this work. Published in the International Journal of Computer Assisted Radiology and Surgery

Journal ref Int J CARS (2026)

详情
英文摘要

Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.

2605.14251 2026-05-15 cs.CV 版本更新

Generative Deep Learning for Computational Destaining and Restaining of Unregistered Digital Pathology Images

Aarushi Kulkarni, Alarice Lowe, Pratik Shah

发表机构 * Department of Computer Science University of California Irvine, CA, USA(计算机科学系 加州大学 伊藤市 加州 USA) Department of Pathology Stanford University Stanford, CA, USA(病理学系 斯坦福大学 斯坦福市 加州 USA)

AI总结 该研究探讨了基于条件生成对抗网络(cGAN)的数字病理图像去染色与再染色方法,并针对不同机构间未对齐的全切片图像(WSI)进行了评估。为减少领域偏移影响,研究提出了一种预处理流程,包括基于直方图的染色归一化和通道强度校准。实验结果表明,即使在无图像配准的情况下,该方法仍能实现较好的染色还原效果,并在多个指标上优于直接染色方法,验证了预处理对模型性能的重要影响。

详情
英文摘要

Conditional generative adversarial networks (cGANs) have enabled high-fidelity computational staining and destaining of hematoxylin and eosin (H&E) in digital pathology whole-slide images (WSI). However, their ability to generalize to out-of-distribution WSI across institutions without retraining remains insufficiently characterized. Previously developed cGAN models trained on 102 registered prostate core biopsy WSIs from Brigham and Women's Hospital were evaluated on 82 spatially unregistered WSIs acquired at Stanford University. To mitigate domain shift without retraining, a preprocessing pipeline consisting of histogram-based stain normalization for H&E-stained WSIs and channel-wise intensity calibration for unstained WSIs was developed. Because image registration was intentionally omitted for real-world deployment conditions, the reported quantitative results are conservative lower bounds reflecting both model performance and limited spatial alignment. Under these conditions, virtual destaining achieved a Pearson correlation coefficient (PCC) of 0.854, structural similarity index measure (SSIM) of 0.699, and peak signal-to-noise ratio (PSNR) of 18.41 dB. H&E restaining from computationally destained outputs outperformed direct staining from ground-truth unstained inputs across all metrics (PCC: 0.798 vs. 0.715; SSIM: 0.756 vs. 0.718; PSNR: 20.08 vs. 18.51 dB), suggesting that preprocessing quality may be more limiting than model capacity. Qualitative pathological review indicated preservation of benign glandular structures while showing that malignant glands were often rendered with vessel-like morphologies. These findings support the feasibility of applying cGAN-based computational H&E staining and destaining generative models to external WSI datasets using preprocessing-based adaptation alone while defining specific morphological targets for future domain adaptation.

2605.14239 2026-05-15 cs.CV 版本更新

Implicit spatial-frequency fusion of hyperspectral and lidar data via kolmogorov-arnold networks

Zekun Long, Judy X. Yang, Jing Wang, Ali Zia, Guanyiman Fu, Jun Zhou

发表机构 * School of Information and Communication Technology(信息与通信技术学院) School of Computing, Engineering and Mathematical Sciences(计算、工程与数学科学学院)

AI总结 本文研究了高光谱图像(HSI)与激光雷达(LiDAR)数据的融合问题,旨在提升复杂场景下的分类性能。针对现有方法在建模结构不连续性和光谱特征方面存在的不足,作者提出了一种基于Kolmogorov-Arnold网络(KAN)的隐式频域-几何融合网络(IFGNet),通过可学习的样条函数自适应捕捉高光谱与LiDAR特征之间的高度非线性关系,并在空间和频域引入LiDAR引导的隐式聚合模块,增强几何感知的表示能力。实验表明,IFGNet在多个基准数据集上显著优于现有方法,具有更高的分类精度和效率。

Comments 6 pages, 1 figure, conference

详情
英文摘要

Hyperspectral image (HSI) classification is challenging in complex scenes due to spectral ambiguity, spatial heterogeneity, and the strong coupling between material properties and geometric structures. Although LiDAR provides complementary elevation information, most HSI-LiDAR fusion methods rely on CNNs or MLPs with fixed activation functions and linear weights. These methods struggle to model structural discontinuities in LiDAR data, intricate spectral features of HSI, and their interactions. In addition, fusion of the two modalities in both spatial and frequency domains with LiDAR guidance remains underexplored. To address these issues, we propose the Implicit Frequency-Geometry Fusion Network (IFGNet), which leverages Kolmogorov-Arnold Networks (KANs) with learnable spline-based functions to adaptively capture highly nonlinear relationships between hyperspectral and LiDAR features. Furthermore, IFGNet introduces a LiDAR-guided implicit aggregation module in both spatial and frequency domains, enhancing geometry-aware spatial representations while capturing global structural patterns. Experiments on the Houston 2013 and MUUFL benchmarks demonstrate that IFGNet consistently outperforms existing fusion methods in overall accuracy, average accuracy, and Cohen's Kappa, while maintaining an efficient architecture.

2605.14221 2026-05-15 cs.CV 版本更新

Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI

Ahmed Rekik, R. Jarrett Rushmore, Sylvain Bouix, Linda Marrakchi-Kacem

发表机构 * École de technologie supérieure (ÉTS)(埃克塞尔技术高等学院) Boston University School of Medicine(波士顿大学医学学院) Signal and Smart Systems Lab (L3S)(信号与智能系统实验室) National School of Engineering of Tunis, University of Tunis El Manar(突尼斯国家工程学院,突尼斯El Manar大学)

AI总结 本文研究了如何在磁共振成像(MRI)中精确分割人脑皮下结构的问题,提出了一个基于标志点引导的三维脑分割方法。该方法模仿哈佛-牛津图谱的手动分割流程,通过全局到局部网络自动检测16个关键标志点,并结合语义分割模型和标志点驱动的后处理步骤,将12个粗略解剖标签分割为26个独立结构,显著提升了分割边界的一致性和准确性。

Comments 7 pages, 5 figures. Accepted for presentation at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

详情
英文摘要

Precise segmentation of brain structures in magnetic resonance imaging (MRI) is essential for reliable neuroimaging analysis, yet voxel-wise deep models often yield anatomically inconsistent results that diverge from expert-defined boundaries. In this research, we propose a landmark-guided 3D brain segmentation approach that explicitly mimics the manual segmentation protocol of the Harvard--Oxford Atlas. A Global-to-Local network automatically detects 16 landmarks representing key subcortical reference points. Then, a semantic segmentation model produces a coarse segmentation of 12 anatomical labels, each grouping multiple subcortical regions. Finally, a landmark-driven post-processing step separates these 12 labels into 26 distinct structures by enforcing local anatomical constraints. Experimental results demonstrate consistent improvements in boundary accuracy. Overall, integrating learned landmarks aligns segmentations more closely with manual protocols.

2605.14191 2026-05-15 cs.CV 版本更新

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

Zhuojin Li, Hsin-Pai Cheng, Hong Cai, Shizhong Han, Fatih Porikli

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出了一种名为CoReDiT的结构化token剪枝框架,旨在提升扩散变换器(DiTs)在图像和视频生成任务中的计算效率。该方法通过线性时间计算的空间一致性分数评估潜在token网格中的局部冗余,并在自注意力机制中跳过高一致性的冗余token,同时通过邻近保留token的聚合重建被跳过的注意力输出,以保持表示的密集性和视觉连续性。实验表明,CoReDiT在多个先进扩散模型上实现了高达55%的自注意力计算量减少,并在云端和移动端分别提升了1.33倍和1.72倍的推理速度,同时保持了高质量的生成效果,并提升了设备端的内存使用效率。

Comments 8 pages, 8 figures, CVPR workshop

Journal ref 2026 CVPR Workshop of EDGE

详情
英文摘要

Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-α and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.

2605.13838 2026-05-15 cs.CV cs.GR cs.LG 版本更新

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

Zijie Wu, Lixin Xu, Puhua Jiang, Sicong Liu, Chunchao Guo, Xiang Bai

发表机构 * Huazhong University of Science and Technology(华中科技大学) Tencent Hunyuan(腾讯混元)

AI总结 R-DMesh 是一种用于视频引导的三维动画生成方法,旨在解决静态网格与参考视频初始姿态不匹配导致的动画失真问题。该方法通过引入条件变分自编码器和三流注意力机制,将输入网格分解为基准形态、相对运动轨迹和姿态校正偏移,并在动画前自动对齐初始姿态,从而生成高保真的四维网格。研究还构建了大规模数据集 Video-RDMesh,实验表明该方法在姿态重定向和四维生成等任务中表现出色。

Comments Accepted by SIGGRAPH 2026, Project Page: https://r-dmesh.github.io/ Code URL: https://github.com/Tencent-Hunyuan/R-DMesh

详情
英文摘要

Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real-world scenarios, the initial pose of a user-provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure. To address this, we present Rectified Dynamic Mesh (R-DMesh), a unified framework designed to generate high-fidelity 4D meshes that are ``rectified'' to align with video context. Unlike standard motion transfer approaches, our method introduces a novel VAE that explicitly disentangles the input into a conditional base mesh, relative motion trajectories, and a crucial rectification jump offset. This offset is learned to automatically transform the arbitrary pose of the input mesh to match the video's initial state before animation begins. We process these components via a Triflow Attention mechanism, which leverages vertex-wise geometric features to modulate the three orthogonal flows, ensuring physical consistency and local rigidity during the rectification and animation process. For generation, we employ a Rectified Flow-based Diffusion Transformer conditioned on pre-trained video latents, effectively transferring rich spatio-temporal priors to the 3D domain. To support this task, we construct Video-RDMesh, a large-scale dataset of over 500k dynamic mesh sequences specifically curated to simulate pose misalignment. Extensive experiments demonstrate that R-DMesh not only solves the alignment problem but also enables robust downstream applications, including pose retargeting and holistic 4D generation.

2605.11459 2026-05-15 cs.RO cs.AI cs.CV cs.LG 版本更新

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

Yanyan Zhang, Chaoda Song, Vikash Singh, Xinpeng Li, Kai Ye, Zhe Hu, Zhongzhu Pu, Yu Yin, Vipin Chaudhary

发表机构 * Case Western Reserve University(凯斯西储大学) The Hong Kong Polytechnic University(香港理工大学) Tsinghua University(清华大学) InspireOmni AI

AI总结 视觉-语言-动作(VLA)模型在灵活性和泛化能力方面表现出色,但大多数现有模型由于采用单帧观测范式,无法感知时间动态变化,导致在非静态环境中性能显著下降。本文提出了一种无需训练的“节奏与路径校正”方法,通过在推理阶段对分块动作的VLA模型进行闭式修正,有效补偿动态变化带来的影响。该方法从单一二次成本函数出发,通过联合优化得到两个正交分解的通道,分别用于压缩执行节奏和调整空间路径,从而在动态环境中显著提升任务成功率。

详情
英文摘要

Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.

2605.10496 2026-05-15 cs.CV 版本更新

M$^2$E-UAV: A Benchmark and Analysis for Onboard Motion-on-Motion Event-Based Tiny UAV Detection

Weiqi Yan, Lixin Chen, Xiangrui Hou, Zhipeng Cai, Youbiao Wang, Yangyang Shi, Yu Zang, Cheng Wang

发表机构 * Fujian Key Laboratory of Urban Intelligent Sensing and Computing, School of Informatics, Xiamen University, Xiamen, China(福建城市智能感知与计算重点实验室,厦门大学信息学院,厦门,中国) Meta, Menlo Park, USA(Meta,Menlo Park,美国)

AI总结 本文提出M$^2$E-UAV,首个针对运动中事件相机的微型无人机检测数据集与基准,旨在解决在观察者与目标同时运动的情境下,无人机检测面临的背景事件干扰严重、目标稀疏等问题。该数据集包含同步的事件流和IMU数据,并提供了基于时间传播的无人机前景标注,适用于多种表示方法的模型评估。实验表明,现有方法在面对稀疏目标和密集背景事件时仍存在较大局限。

详情
英文摘要

Tiny UAV detection from an onboard event camera is difficult when the observer and target move at the same time. In this motion-on-motion regime, ego-motion activates background edges across buildings, vegetation, and horizon structures, while the UAV may appear as a sparse event cluster. Unlike static- or ground-observer event-based UAV detection, onboard UAV-view detection breaks the clean-background assumption because sensor ego-motion can activate dense background events over the entire field of view. To explore this practical problem, we present M$^2$E-UAV, to the best of our knowledge, the first onboard UAV-view motion-on-motion event-based dataset and benchmark for tiny UAV detection, where both the sensing platform and the target UAV are moving. M$^2$E-UAV provides synchronized event streams and IMU measurements collected from an onboard sensing platform, together with event-level UAV foreground labels derived from temporally propagated 10 Hz bounding-box annotations. The processed benchmark contains 87,223 training samples and 21,395 validation samples across four scene families: sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village. We define a train/validation split and an evaluation protocol for comparing representative existing baselines across event-frame, voxel-grid, and point-set representations, with optional IMU input. The benchmark results show that existing baselines remain limited under sparse tiny-target evidence and dense ego-motion-induced background events. Code and benchmark files will be released at https://github.com/Wickyan/M2E-UAV.

2605.08888 2026-05-15 cs.CL cs.CV 版本更新

DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

Xiang Feng, Jiawei Zhou, Zhangfeng Huang, Kewei Wang, Shanshan Ye, Jinxin Hu, Zulong Chen, Yong Luo, Jing Zhang

发表机构 * School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China(计算机学院,国家多媒体软件工程技术研究中心和湖北多媒体与网络通信工程重点实验室,武汉大学,中国) Alibaba Group, Hangzhou, China(阿里巴巴集团,杭州,中国) Independent Researcher(独立研究者) Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates(机器学习系,Mohamed bin Zayed人工智能大学,阿拉伯联合酋长国)

AI总结 DocScope 是一个用于评估多模态大语言模型在长篇视觉丰富文档中进行可验证推理能力的基准测试。该研究将长文档问答问题转化为结构化的推理轨迹预测任务,要求模型输出证据页面、支持区域、相关事实陈述和最终答案,并通过四阶段评估协议对推理过程进行细致检验。实验表明,仅凭答案准确性无法全面评估模型可靠性,证据链完整率普遍较低,且区域定位和跨文档证据整合是当前的主要挑战。

Comments 50pages, 25 figures, 14 tables;

详情
英文摘要

Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol -- Page Localization, Region Grounding, Fact Extraction, and Answer Verification -- that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open-weight models, and several domain-specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory-level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29\%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross-architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at https://github.com/MiliLab/DocScope.

2605.08851 2026-05-15 cs.CV cs.AI cs.LG 版本更新

Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

Jialin Li, Zhuo Zhang, Yue Cao, Guipeng Lan, Jiabao Wen, Shuai Xiao, Jiachen Yang

发表机构 * School of Electrical and Information Engineering, Tianjin University, Tianjin, China(天津大学电气与信息工程学院)

AI总结 该研究针对冠状动脉造影中狭窄病变检测数据不足的问题,提出了一种基于熵最优传输的几何约束狭窄编辑方法。通过将局部编辑建模为受几何信息引导的熵最优传输问题,该方法实现了更精确的结构控制和图像生成。实验表明,该方法生成的图像显著提升了狭窄检测性能,在公开数据集和多中心数据集上分别取得了27.8%和23.0%的相对性能提升。

Comments Accepted to ICML 2026

详情
英文摘要

The scarcity of high-quality imaging data for coronary angiography (CAG) stenosis limits the clinical translation of automated stenosis detection. Synthetic stenosis data provides a practical avenue to augment training sets, improving data quality, diversity, and distributional coverage, and enhancing detection precision and generalization. However, diffusion-based editing commonly relies on soft guidance in a noise-initialized reverse process, offering limited pixel-level precision and structure preservation. We propose the OT-Bridge Editor, which reframes localized editing as a constrained entropic optimal transport (OT) problem and leverages geometric information to steer the generation path, enabling stronger geometric control. Extensive experiments show that our synthesized angiograms consistently improve downstream stenosis detection, yielding substantial relative gains of 27.8% on the public ARCADE benchmark and 23.0% on our multi-center dataset, supported by consistent qualitative results.

2605.08825 2026-05-15 cs.CV 版本更新

Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning

Meisen Wang, Hao Deng, Wei Bao, Ma Yuanxiao, Chengjie Wang, Zhiqiang Tian, Shaoyi Du, Siqi Li

发表机构 * Xi’an Jiaotong University(西安交通大学) Tsinghua University(清华大学) China Mobile System Integration(中国移动系统集成) Inner Mongolia Agricultural University(内蒙古农业大学)

AI总结 该论文针对基于事件相机的物体检测(EOD)任务,提出了一个统一的检测框架Ev-DTAD,旨在解决现有方法在表示层和模型层上的不足。通过引入层次化时间聚合(HTA)和频率感知超图时间融合(FHTF)模块,分别在表示层面显式编码时间信息,并在模型层面进行高阶关系推理,从而更有效地整合碎片化事件响应。实验表明,Ev-DTAD在多个数据集上实现了更高的检测精度和效率,验证了其方法的有效性。

详情
英文摘要

Event cameras provide microsecond-level temporal resolution, low latency, and high dynamic range, offering potential for perception under fast motion and challenging illumination conditions. However, existing Event-based Object Detection (EOD) methods face limitations at both the representation and model levels: prior event representations usually encode temporal information indirectly through redundant structures, while detection models struggle to explicitly aggregate fragmented event responses into coherent high-order object features. To address these limitations, we present Event Dual Temporal-Relational Aggregation Detector (Ev-DTAD), a unified EOD framework that integrates representation-level temporal encoding with model-level temporal-hypergraph reasoning. Specifically, we introduce Hierarchical Temporal Aggregation (HTA), a compact three-channel pseudo-RGB representation that explicitly embeds temporal information across intra- and inter-window events. To further enhance detection under sparse and fragmented event responses, we propose Frequency-aware Hypergraph Temporal Fusion (FHTF), which refines multi-scale event features through temporal evolution modeling and high-order relational reasoning. Extensive experiments on Gen1 (+0.8 mAP and 1.7$\times$ faster), 1Mpx/Gen4 (+0.5 mAP and 1.6$\times$ faster), and eTraM (+3.0 mAP and 2.0$\times$ faster) demonstrate that Ev-DTAD achieves a competitive accuracy-efficiency trade-off, validating the complementarity between compact temporal representation and temporal-hypergraph feature reasoning.

2605.08698 2026-05-15 cs.CV cs.LG 版本更新

Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods

Md Abu Obaida Zishan, Jannatun Noor, Annajiat Alim Rasel

发表机构 * School of Data and Sciences BRAC University, Dhaka(数据与科学学院,布拉克大学,达卡) Computing for Sustainability and Social Good (C2SG) Research Group Department of Computer Science and Engineering United Internation University, Dhaka(可持续性与社会公益(C2SG)研究组,计算机科学与工程系,联合国际大学,达卡)

AI总结 本文提出了一种无需训练即可提升Stable Diffusion等扩散模型生成高分辨率图像能力的方法,通过插值扩展卷积核来解决传统方法中因分辨率提升导致的物体重复伪影问题。该方法数学上证明了在乘以常数系数的情况下,插值能够正确扩展卷积核,并在生成超训练分辨率图像时取得了与现有方法相当的实验效果。此外,该方法还展示了在全连接层上的应用潜力,并可有效降低神经网络训练的内存占用。

Comments Updated the title for clarity. Removed background and redundant text from section 4.2,5. Improved organization in section 4 and clarity of text in Section 4.3

详情
英文摘要

Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of $2.6\%$ in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least $4\times$.

2605.04554 2026-05-15 cs.CV 版本更新

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

Kaili Zheng, Kaiwen Wang, Xun Zhu, Chenyi Guo, Ji Wu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) College of AI, Tsinghua University(清华大学人工智能学院) Beijing National Research Center for Information Science and Technology(北京信息科学与技术国家研究中心)

AI总结 该论文提出了一种名为InterMesh的端到端多人人体网格恢复框架,旨在更准确地建模人类与环境及彼此之间的交互关系。与现有基于DETR的方法不同,InterMesh通过引入人类-物体交互检测器,显式地将交互语义信息融入人体网格恢复过程,从而提升姿态和形状估计的准确性。研究设计了轻量的模块以高效整合交互信息,并在多个数据集上验证了方法的有效性,显著提升了在复杂交互场景下的恢复性能。

Comments 13 pages, 10 figures

详情
英文摘要

Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions. Code and models are released at https://github.com/Kelly510/InterMesh.

2605.02438 2026-05-15 cs.CV cs.LG 版本更新

Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang, Dan Wang, Hui Yan, Xin Liu, Zhen Cui

发表机构 * Nanjing University of Science(南京理工大学) Beijing Normal University, Beijing, China(北京师范大学) China Academy of Space Technology, Beijing, China(中国空间技术研究院)

AI总结 本文研究开放集监督异常检测(OSAD)问题,旨在利用有限的异常监督信息识别未见过的异常样本。为了解决现有基于原型的方法在建模正常数据时忽略多模态特性导致决策边界模糊的问题,提出了一种混合原型流匹配(MPFM)框架,通过连续变换将正常特征分布映射到结构化的高斯混合原型空间。该方法引入高斯混合先验建模速度场,并结合互信息最大化正则化器提升原型区分度,实验表明其在多种基准数据集上均取得领先性能。

Comments Accepted by ICML 2026

详情
英文摘要

Open-set supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.

2605.01725 2026-05-15 cs.CV cs.AI 版本更新

Motion-Aware Caching for Efficient Autoregressive Video Generation

Jing Xu, Yuexiao Ma, Xuzhe Zheng, Xing Wang, Shiwei Liu, Chenqian Yan, Xiawu Zheng, Rongrong Ji, Fei Chao, Songwei Liu

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(多媒体可信感知与高效计算重点实验室,中国教育部,厦门大学) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所) Tübingen AI Center(图宾根人工智能中心)

AI总结 本文研究了如何通过运动感知的缓存机制提升自回归视频生成的效率。现有方法依赖于粗粒度的块级缓存跳过,无法准确捕捉像素级别的动态变化,导致生成质量下降。为此,作者提出了MotionCache,通过帧间差异作为像素运动的轻量代理,结合粗到细的策略,在保证生成质量的前提下显著提升了生成速度。实验表明,MotionCache在多个先进模型上实现了最高达6.28倍的加速,同时保持了高质量的生成效果。

Comments 20 pages

详情
英文摘要

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.

2604.28130 2026-05-15 cs.CV 版本更新

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Kehong Gong, Zhengyu Wen, Dao Thien Phong, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Guanli Hou, Dongze Lian, Xiaoyu He, Mingyuan Zhang, Hanwang Zhang

发表机构 * Huawei Technologies Co., Ltd(华为技术有限公司) Central Research Institute(中央研究院)

AI总结 本文提出了一种端到端的任意骨骼运动捕获框架 MoCapAnything V2,解决了传统分阶段方法在关节位置与旋转映射上的不确定性问题。通过引入目标资产的参考姿态-旋转对,明确旋转坐标系,使旋转预测更加精确并易于学习。该方法直接从视频中预测关节位置,无需依赖网格中间表示,提升了鲁棒性与效率,并在多个数据集上显著降低了旋转误差,推理速度也比基于网格的方法快约20倍。

Comments Project page: https://animotionlab.github.io/MoCapAnythingV2/

详情
英文摘要

Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/

2604.09304 2026-05-15 cs.CV 版本更新

GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic

Jiayuan Lu, Rengan Xie, Xuancheng Jin, Zhizhen Wu, Qi Ye, Tian Xie, Hujun Bao, Rui Wang. Yuchi Huo

发表机构 * State Key Lab of CAD\&CG, Zhejiang University Zhejiang Lab China State Key Laboratory of Industrial Control Technology, Zhejiang University China Zhejiang University China Zhejiang Lab State Key Laboratory of Industrial Control Technology, Zhejiang University Zhejiang University

AI总结 本文提出了一种名为GeRM的生成渲染模型,旨在弥合基于物理的渲染(PBR)与照片级真实感渲染(PRR)之间的差距。该模型通过学习分布转移向量(DTV)场,结合多条件ControlNet和残差感知转移机制,实现了从物理真实到视觉真实的可控图像生成。研究还引入了一个多智能体视觉语言框架,构建了用于监督转移过程的专家引导数据集P2P-50K,实验表明GeRM在多种应用场景中均优于现有先进方法。

详情
英文摘要

While physically-based rendering (PBR) simulates light transport that guarantees physical realism, achieving true photorealistic rendering (PRR) demands prohibitive time and labor, and still struggles to capture the intractable richness of the real world. We propose GeRM, the first multimodal generative rendering model to bridge the gap from PBR to PRR (P2P). We formulate this P2P transition by learning a distribution transfer vector (DTV) field to direct the generative process. To achieve this, we introduce a multi-condition ControlNet that synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions. To improve the model's grasp of the image distribution shift driven by text prompts, we propose a residual perceptual transfer mechanism to associate text prompts with corresponding targeted modification regions, which more clearly defines the incremental component updates. To supervise this transfer process, we introduce a multi-agent visual language model framework to construct an expert-guided pairwise transfer dataset, named P2P-50K, where each paired sample corresponds to a specific transfer vector in the DTV field. Extensive experiments demonstrate that GeRM synthesizes high-quality controllable images and outperforms state-of-the-art baselines across diverse applications, including PBR and PRR image synthesis and editing.

2604.08991 2026-05-15 cs.CV cs.AI 版本更新

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang, Luyang Zhang, Cheng Zhang, Hongxia Xie, Wen-Huang Cheng

发表机构 * Jilin University(吉林大学) National Taiwan University(国立台湾大学)

AI总结 本文提出PinpointQA,首个用于室内视频中小物体中心空间理解的数据集与基准,旨在评估模型在视频中精确定位目标物体并描述其位置的能力。该数据集基于ScanNet++和ScanNet200构建,包含1024个场景和10,094个问答对,涵盖四个逐步增加难度的任务,实验表明主流多模态大语言模型在该基准上仍存在明显性能差距,而通过PinpointQA进行微调可显著提升模型表现。

详情
英文摘要

Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.

2604.06757 2026-05-15 cs.CV 版本更新

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Central South University(中南大学) National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) Microsoft(微软)

AI总结 FlowInOne 提出了一种统一的多模态生成框架,将文本描述、空间布局和编辑指令等不同模态的信息转化为单一的视觉表示,从而实现以图像输入、图像输出为特点的生成流程。该方法通过一个统一的流匹配模型消除了跨模态对齐和任务特定结构的限制,将文本到图像生成、布局引导编辑和视觉指令遵循等任务整合到同一范式下。研究还构建了大规模视觉提示数据集 VisPrompt-5M 和评估基准 VP-Bench,实验表明 FlowInOne 在多项任务中达到当前最优性能,为完全以视觉为中心的生成建模奠定了新基础。

详情
英文摘要

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space. Our code and models are released on https://csu-jpg.github.io/FlowInOne.github.io/

2603.18943 2026-05-15 cs.CV 版本更新

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为 VGGT-360 的全新无需训练的零样本全景深度估计框架,旨在实现几何一致的全景深度估计。该方法通过利用类似 VGGT 的基础模型的内在三维一致性,将任务重新表述为基于多视角重建的三维模型的全景重投影,从而将碎片化的单视角推理统一为连贯的全景理解。VGGT-360 集成了三个即插即用模块,形成统一的全景到三维到深度的框架,在多个室内和室外数据集上表现出优于现有训练和无需训练方法的性能。

详情
英文摘要

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

2603.14851 2026-05-15 cs.CV cs.RO 版本更新

AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Harvard University, US(哈佛大学,美国)

AI总结 该论文提出了一种端到端自动驾驶框架 AutoMoT,通过统一视觉-语言-动作(VLA)模型,将场景理解与动作生成结合,以提升自动驾驶系统的整体性能。其核心方法采用异步混合变压器(MoT)架构,通过共享注意力机制保留预训练视觉语言模型的推理能力,同时实现高效的动作策略生成。实验表明,AutoMoT 在多个基准测试中表现出色,并揭示了预训练模型在自动驾驶任务中的适用边界。

详情
英文摘要

Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose AutoMoT in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that AutoMoT achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to https://automot-website.github.io/ for the demonstration videos and qualitative results.

2603.11042 2026-05-15 cs.CV cs.AI cs.LG cs.MM cs.SD 版本更新

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Adobe Research(Adobe研究院)

AI总结 本文提出了一种名为V2M-Zero的视频到音乐生成方法,能够在无需视频-音乐配对数据的情况下生成与视频事件时间对齐的音乐。该方法通过分别提取音乐和视频的事件曲线,捕捉各自模态中的时间结构变化,从而实现跨模态的时间同步。实验表明,V2M-Zero在多个基准数据集上取得了优于现有方法的性能,尤其在时间同步和语义对齐方面表现突出,并且实现了时间与音乐风格的独立控制。

Comments Project page: https://genjib.github.io/v2m_zero/

详情
英文摘要

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.

2603.09921 2026-05-15 cs.CV 版本更新

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He

发表机构 * ShanghaiTech University(上海科技大学) Shanghai Engineering Research Center of Intelligent Vision and Imaging(上海智能视觉与成像工程技术研究中心) Lingang Laboratory(临港实验室)

AI总结 本文提出WikiCLIP,一种用于开放域视觉实体识别(VER)的高效对比学习框架。该方法利用大语言模型的嵌入作为知识丰富的实体表示,并通过视觉引导的知识适配器(VGKA)在图像块级别对齐文本语义与视觉线索,同时引入硬负样本合成机制以增强细粒度区分能力。实验表明,WikiCLIP在多个基准数据集上显著优于现有方法,尤其在OVEN数据集的未见测试集上提升达16%,且推理延迟比主流生成模型降低近百倍。

Comments Accepted by CVPR26, codes and weights are publicly available

详情
英文摘要

Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16\% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/

2603.00574 2026-05-15 cs.CV cs.AI 版本更新

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Yongbo He, Zirun Guo, Tao Jin

发表机构 * Zhejiang University(浙江大学)

AI总结 多模态测试时适配旨在将预训练模型适应于测试时不断变化的数据分布,但现有方法常面临无偏模态的负迁移和有偏模态的灾难性遗忘问题。为此,本文提出了一种名为DASP的诊断-缓解框架,通过分析统一潜在空间中模态间的维度冗余差异,识别出有偏模态并采用非对称适配策略,将每个模态的适配器分为稳定和可塑两部分,分别处理不同模态对稳定性和可塑性的需求,从而在保持通用知识的同时实现对新领域的灵活适应。实验表明,DASP在多个多模态基准上显著优于现有方法。

Comments Accepted to CVPR 2026

详情
英文摘要

Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.

2602.14068 2026-05-15 cs.CV 版本更新

CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

Yuhui Wu, Chenxi Xie, Ruibin Li, Liyi Chen, Qiaosi Yi, Lei Zhang

发表机构 * The Hong Kong Polytechnic University, Hong Kong(香港理工大学) OPPO Research Institute, ShenZhen, China(OPPO研究院,深圳,中国)

AI总结 CoCoEdit 是一种基于区域正则化强化学习的内容一致图像编辑框架,旨在解决现有模型在编辑目标区域时容易导致非目标区域发生不期望变化的问题。该方法通过引入像素级相似性奖励和区域正则化机制,有效提升了编辑质量与内容一致性。实验表明,CoCoEdit 在多个基准测试中取得了与先进模型相当的编辑效果,并在内容一致性方面表现出显著优势。

Comments Accepted by ICML 2026

详情
英文摘要

Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.

2602.07045 2026-05-15 cs.CV cs.AI 版本更新

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

Zhiming Luo, Di Wang, Haonan Guo, Jing Zhang, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 为了推动多模态大语言模型在遥感领域的应用,研究者提出了首个专注于复杂遥感推理的视觉语言推理基准VLRS-Bench。该基准围绕认知、决策和预测三个核心维度构建,包含2000对问答对,涵盖14项任务和最多八个时间阶段,旨在评估模型在遥感场景下的复杂推理能力。通过融合遥感领域先验知识和专家经验,VLRS-Bench有效提升了任务的地理空间真实性和推理难度,揭示了当前先进模型在该领域的显著瓶颈,为未来研究提供了重要参考。

详情
英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average question length of 130.19 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community. The project repository is available at https://github.com/MiliLab/VLRS-Bench.

2602.04657 2026-05-15 cs.CV 版本更新

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Yu Zhang, Ying Li, Rong Xiao

发表机构 * School of Cyberspace Security, Northwestern Polytechnical University(网络安全学院,西北工业大学) School of Computer Science, Northwestern Polytechnical University(计算机学院,西北工业大学) Intellifusion(智融科技)

AI总结 TRIO 是一种通过推理目标指导实现视觉-语言模型高效推理的视觉标记压缩方法。该方法从推理目标出发,将视觉标记压缩转化为保持输出结果不变性的过程,并通过设计的局部代理损失生成标记级梯度显著性,指导标记重排序与选择。TRIO 免于训练,兼容 FlashAttention,适用于实际部署,可在保留 97.2% 原始性能的同时显著提升推理速度与降低计算开销。

详情
英文摘要

Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose TRIO from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specifically, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle.The proposed TRIO is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, TRIO retains just 11.1\% of visual tokens but maintains 97.2\% of the original performance, with a 2.75$\times$ prefill speedup, 2.14$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead.Our code is available at https://github.com/ocy1/TRIO.

2602.04473 2026-05-15 cs.CV 版本更新

CC-Pan: Channel-wise Compression based Diffusion for Efficient Pan-Sharpening

Junjie Li, Congyang Ou, Haokui Zhang, Guoting Wei, Shengqin Jiang, Ying Li

发表机构 * School of Cyberspace Security, Northwestern Polytechnical University(网络安全学院,西北工业大学) Nanjing University of Science and Technology(南京理工大学) School of Computer Science, Nanjing University of Information Science and Technology(计算机科学学院,南京信息工程大学)

AI总结 本文提出了一种基于通道压缩的扩散模型CC-Pan,用于高效实现多光谱与全色图像的融合(Pan-Sharpening)。该方法通过训练一个通道独立的变分自编码器,将高分辨率多光谱图像编码为紧凑的潜在表示,从而支持不同传感器的多光谱图像并加速推理过程。同时,通过设计的单向和双向交互控制结构引入光谱物理特性及全色图像,结合轻量化的跨带注意力模块,显著提升了融合精度和光谱一致性。实验表明,CC-Pan在多个数据集上优于现有扩散模型,并实现了2-3倍的加速效果,具有良好的跨传感器泛化能力。

详情
英文摘要

Recently, diffusion models have brought novel insights to pan-sharpening and notably boosted fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) sensors, suffering from high inference latency and sensor-specific limitations. In this paper, we present CC-Pan, a cross-sensor latent diffusion framework for efficient pan-sharpening. Specifically, CC-Pan trains a band-wise single-channel variational autoencoder (VAE) to encode high-resolution multispectral (HRMS) images into compact latent representations, naturally supporting MS images with varying band counts across different sensors and establishing a basis for inference acceleration. Spectral physical properties, along with PAN and MS images, are then injected into the diffusion backbone through carefully designed unidirectional and bidirectional interactive control structures, achieving high-precision spatial--spectral fusion in the latent diffusion process. Furthermore, a lightweight region-based cross-band attention (RCBA) module is incorporated at the central layer of the diffusion model, reinforcing inter-band spectral connections to boost spectral consistency and further elevate fusion precision. Extensive experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that CC-Pan outperforms state-of-the-art diffusion-based methods across all three benchmarks, attains a $2$--$3\times$ inference speedup, and exhibits robust cross-sensor generalization capability on the held-out WorldView-2 sensor without any sensor-specific retraining.

2602.00807 2026-05-15 cs.CV cs.RO 版本更新

Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds

Xianzhe Fan, Shengliang Deng, Xiaoyang Wu, Yuxiang Lu, Zhuoling Li, Mi Yan, Yujia Zhang, Zhizheng Zhang, He Wang, Hengshuang Zhao

发表机构 * School of Computing and Data Science, The University of Hong Kong, Hong Kong SAR, China(计算与数据科学学院,香港大学,香港特别行政区,中国) School of Computing(计算学院) Peking University, Beijing, China(北京大学,北京,中国)

AI总结 现有视觉-语言-动作(VLA)模型通常以二维图像作为视觉输入,这限制了它们在复杂场景中的空间理解能力。为提升VLA模型的性能,本文提出Any3D-VLA,通过引入多样化的点云数据增强三维感知能力,并在训练过程中融合仿真、传感器和模型估计的点云,学习跨域通用的三维表示。实验表明,该方法有效提升了模型性能并缓解了领域差异问题。

Comments ICML 2026

详情
英文摘要

Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA's advantages in improving performance and mitigating the domain gap. Our project homepage is available at https://xianzhefan.github.io/Any3D-VLA.github.io.

2512.22331 2026-05-15 cs.CV cs.AI 版本更新

The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma

Mariya Miteva, Maria Nisheva-Pavlova

发表机构 * Faculty of Mathematics and Informatics – Sofia University St. Kliment Ohridski(数学与信息学系 – 圣克莱门特·奥赫里迪斯大学)

AI总结 该研究旨在通过多模态磁共振成像(MRI)数据非侵入性预测胶质母细胞瘤(GBM)中MGMT启动子甲基化状态,这对预后和治疗具有重要意义。为了解决传统单模态和早期融合方法在特征冗余和模态特异性建模方面的不足,作者提出了一种基于变分自编码器(VAE)的多视图潜在表征学习框架,能够在紧凑的概率潜在空间中保留各模态的影像特征并实现晚期融合。实验表明,该方法结合随机森林分类器在测试集上取得了0.77的AUC值,显著优于基线模型和调参后的模型,验证了多视图概率编码在整合互补MRI信息和提升预测性能方面的有效性。

Comments 17 pages, 4 figures

详情
英文摘要

Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) that preserves modality-specific radiomic structure while enabling late fusion in a compact probabilistic latent space. The approach is evaluated on radiomic features extracted from the necrotic tumor core in post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Re-covery (FLAIR) Magnetic Resonance Imaging (MRI). Experimental results demonstrate that the proposed multi-view VAE combined with a random forest classifier achieves a test Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) of 0.77 (95% confidence interval: 0.71-0.83), substantially outperforming both a baseline radiomics model (AUC = 0.54) and a hyperparameter-tuned model (AUC = 0.64). These findings indicate that multi-view probabilistic encoding enables more effective integration of complementary MRI information and significantly improves predictive performance for MGMT promoter methylation status.

2512.22317 2026-05-15 cs.LG cs.AI cs.CV 版本更新

LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

Xudong Ling, Chaorong Li, Tianxi Huang, Qian Dong, Guiduo Duan

发表机构 * Laboratory of Intelligent Collaborative Computing, University of Electronic Science(智能协同计算实验室,电子科学科技大学) School of Computer Science(计算机科学学院) Technology (School of Artificial Intelligence), Yibin University(技术(人工智能学院),宜宾大学) College of Humanities(人文学院) General Education, Chengdu Textile College(通识教育,成都纺织学院)

AI总结 短时降水临近预报是一个具有高度不确定性和约束不足的时空预测问题,尤其在快速演变的极端天气事件中更为明显。本文提出了一种语言感知的多模态临近预报框架LangPrecip,通过将气象文本作为降水演变的语义运动约束,结合修正流范式,实现了文本与雷达信息在潜在空间中的高效融合。此外,研究还构建了一个包含160k对雷达序列和运动描述的大规模多模态数据集LangPrecip-160k,并在瑞典和MRMS数据集上验证了方法的有效性,显著提升了重降雨情况下的预测性能。

详情
英文摘要

Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.

2512.12772 2026-05-15 cs.MM cs.CV 版本更新

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Jianghan Chao, Jianzhang Gao, Wenhui Tan, Yuchong Sun, Ruihua Song, Liyun Ru

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学香樟人工智能学院) Baichuan Inc(百川科技)

AI总结 为了全面评估能够处理多模态信息的全大语言模型(Omni-LLMs),本文提出JointAVBench基准,涵盖多模态依赖、多样化的音频信息类型和不同场景跨度三个关键方面。该基准通过自动化流程生成严格依赖音视频联合理解的问题与答案,弥补了现有数据集在多模态评估方面的不足。实验表明,即使表现最好的Omni-LLM在该基准上的平均准确率也仅为65.3%,显示出在跨场景推理等方面仍有较大提升空间。

详情
英文摘要

Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 65.3\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.

2512.12083 2026-05-15 cs.CV 版本更新

RePack then Refine: Efficient Diffusion Transformer with Vision Foundation Model

Guanfang Dong, Luke Schultz, Negar Hassanpour, Chao Gao

发表机构 * Huawei Technologies Canada Ltd.(华为加拿大有限公司)

AI总结 该研究提出了一种名为“RePack then Refine”的三阶段框架,旨在高效利用视觉基础模型(VFM)的语义丰富特征来提升扩散变换器(DiT)的性能。通过RePack模块将高维VFM特征压缩到低维流形,去除冗余并保留结构信息,再在压缩后的潜在空间上训练标准DiT,最后引入一个潜在引导细化模块恢复压缩过程中丢失的高频细节。实验表明,该方法在ImageNet-1K数据集上仅用64个训练周期就达到了1.65的FID值,显著优于现有扩散模型。

详情
英文摘要

Semantic-rich features from Vision Foundation Models (VFMs) have been leveraged to enhance Latent Diffusion Models (LDMs). However, raw VFM features are typically high-dimensional and redundant, increasing the difficulty of learning and reducing training efficiency for Diffusion Transformers (DiTs). In this paper, we propose Repack then Refine, a three-stage framework that brings the semantic-rich VFM features to DiT while further accelerating learning efficiency. Specifically, the RePack module projects the high-dimensional features onto a compact, low-dimensional manifold. This filters out the redundancy while preserving essential structural information. A standard DiT is then trained for generative modeling on this highly compressed latent space. Finally, to restore the high-frequency details lost due to the compression in RePack, we propose a Latent-Guided Refiner, which is trained lastly for enhancing the image details. On ImageNet-1K, RePack-DiT-XL/1 achieves an FID of 1.82 in only 64 training epochs. With the Refiner module, performance further improves to an FID of 1.65, significantly surpassing latest LDMs in terms of convergence efficiency. Our results demonstrate that packing VFM features, followed by targeted refinement, is a highly effective strategy for balancing generative fidelity with training efficiency. Source code is publicly available at https://github.com/guanfangdong/RePack-then-Refine.

2512.03532 2026-05-15 cs.CV 版本更新

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu

发表机构 * PICO, ByteDance, Beijing(字节跳动北京研究院)

AI总结 OpenTrack3D 是一种面向开放词汇的3D实例分割框架,旨在提升在复杂、非结构化且无需网格的环境中进行3D目标分割的准确性和泛化能力。该方法通过引入视觉-空间追踪器在线生成跨视角一致的物体提案,并结合深度信息和DINO特征图提取实例特征,实现了无需网格的高效分割。此外,OpenTrack3D 采用多模态大语言模型替代CLIP,显著提升了对复杂用户查询的语义理解能力,实验表明其在多个基准数据集上均取得先进性能。

详情
英文摘要

Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.

2512.02482 2026-05-15 cs.CV 版本更新

G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline

Vishwesh Nath, Javier G. Tejero, Aravind S. Kumar, Ruilong Li, Filippo Filicori, Mahdi Azizian, Sean D. Huver

发表机构 * NVIDIA Northwell Health(北well健康)

AI总结 本文提出了一种名为G-SHARP的实时手术场景重建框架,旨在满足微创手术中对可变形组织进行快速而精确3D建模的需求。该方法基于开源的GSplat(Apache-2.0)可微高斯光栅化器构建,实现了原理化的形变建模、鲁棒的遮挡处理以及高保真重建,并在EndoNeRF数据集上取得了领先的重建质量。此外,研究还提供了可在NVIDIA IGX Orin和Thor边缘设备上部署的Holoscan SDK应用,支持实际手术室环境中的实时手术可视化。

详情
英文摘要

We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.

2511.14823 2026-05-15 cs.LG cs.CV 版本更新

Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence

Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

发表机构 * Institute of Technology University of Tartu(塔尔图技术大学) S Holding OÜ(3S控股公司)

AI总结 当前机器学习模型在静态任务上表现出色,但在非平稳环境中因架构僵化而难以实现持续适应和终身学习。本文提出了一种动态嵌套层次结构,使模型能够在训练或推理过程中自主调整优化层级的数量、嵌套结构和更新频率,从而实现无需预定义约束的自我演化。该方法通过数学推导和实验验证,在语言建模、持续学习和长上下文推理等任务中展现出优越性能,为构建具有自适应能力的通用人工智能奠定了基础。

Comments 12 pages, 1 figure

Journal ref Frontiers in Artificial Intelligence, 2026

详情
英文摘要

Contemporary machine learning models, including large language models, exhibit remarkable capabilities in static tasks yet falter in non-stationary environments due to rigid architectures that hinder continual adaptation and lifelong learning. Building upon the nested learning paradigm, which decomposes models into multi-level optimization problems with fixed update frequencies, this work proposes dynamic nested hierarchies as the next evolutionary step in advancing artificial intelligence and machine learning. Dynamic nested hierarchies empower models to autonomously adjust the number of optimization levels, their nesting structures, and update frequencies during training or inference, inspired by neuroplasticity to enable self-evolution without predefined constraints. This innovation addresses the anterograde amnesia in existing models, facilitating true lifelong learning by dynamically compressing context flows and adapting to distribution shifts. Through rigorous mathematical formulations, theoretical proofs of convergence, expressivity bounds, and sublinear regret in varying regimes, alongside empirical demonstrations of superior performance in language modeling, continual learning, and long-context reasoning, dynamic nested hierarchies establish a foundational advancement toward adaptive, general-purpose intelligence.

2511.13397 2026-05-15 cs.CV cs.AI 版本更新

Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising

发表机构 * Department of Electronic and Computer Engineering, University of Limerick(利默尼克大学电子与计算机工程系) Data Driven Computer Engineering Research Centre, University of Limerick(利默尼克大学数据驱动计算机工程研究中心) Lero, The Irish Software Research Centre, University of Limerick(利默尼克大学Lero爱尔兰软件研究中心) Valeo Vision Systems(瓦莱奥视觉系统)

AI总结 本文提出了一种名为Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)的视觉问答基准,用于评估视觉语言模型在交通场景中的感知能力。该基准包含合成数据集和真实场景数据集,并为每个问题标注了目标物体与相机之间的距离,从而能够分析模型在不同距离下的感知性能。该研究为自动驾驶领域中模型的感知能力评估提供了一个新的、有针对性的工具。

Journal ref IEEE Data Descriptions, 2026

详情
英文摘要

The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

2511.13026 2026-05-15 cs.CV 版本更新

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

发表机构 * MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus实验室) Renmin University of China(中国人民大学)

AI总结 该论文提出了一种名为REVISOR的新框架,旨在提升大语言模型在长视频理解任务中的推理能力。针对纯文本反思机制在处理长视频时的不足,REVISOR引入了多模态反思机制,结合视觉信息进行深度反思,并设计了双属性解耦奖励机制以增强模型对关键视频片段的识别与利用。该方法无需额外监督微调或外部模型,显著提升了模型在多个长视频理解基准测试中的表现。

详情
英文摘要

Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

2511.02271 2026-05-15 cs.CV 版本更新

Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

Yucheng Song, Yifan Ge, Junhao Li, Zhining Liao, Zhifang Liao

AI总结 本文提出了一种基于分层任务结构的跨模态因果干预框架HTSC-CIF,用于解决医学报告生成中的三个核心挑战:领域知识理解不足、文本与视觉实体嵌入对齐不佳以及跨模态偏差带来的虚假相关性。该方法将任务分解为低、中、高三个层次,分别通过空间特征对齐、双向语言与图像建模以及因果干预模块进行优化,显著提升了生成报告的准确性和可解释性。实验表明,HTSC-CIF在多个基准数据集上优于现有最先进方法。

Comments Due to issues with the training epochs and training strategy in our paper, there are numerical errors in the result comparison table presented in the preprint. Therefore, we have decided to withdraw the manuscript for further revision

详情
英文摘要

Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.

2510.20206 2026-05-15 cs.CV 版本更新

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 RAPO++ 是一种面向文本到视频生成的跨阶段提示优化框架,旨在解决用户输入提示与训练数据不匹配的问题。该方法通过检索增强提示优化(RAPO)和样本特定提示优化(SSPO)两个阶段,结合语义对齐、空间保真度和时间一致性等多源反馈,逐步提升生成视频的质量,并进一步通过微调语言模型实现高效的提示生成。实验表明,RAPO++ 在多个先进模型和基准测试中显著提升了生成视频的语义一致性、组合合理性及时空稳定性,是一种模型无关、高效且可扩展的解决方案。

Comments arXiv admin note: text overlap with arXiv:2504.11739

详情
英文摘要

Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

2510.17434 2026-05-15 cs.CV 版本更新

Leveraging AV1 motion vectors for Fast and Dense Feature Matching

Julien Zouein, Hossein Javidnia, François Pitié, Anil Kokaram

发表机构 * Sigmedia Group(Sigmedia集团) Department of Electronic and Electrical Engineering(电子与电气工程系)

AI总结 该研究利用AV1视频编码中的运动矢量生成密集的亚像素级特征匹配,并通过余弦一致性筛选短轨迹。该方法在短视频上运行效率高、消耗的CPU资源少,且能产生密度更高的匹配结果,几何一致性表现良好。实验表明,该方法在少样本场景重建中表现出良好的性能,为压缩域特征匹配在大规模应用中提供了可行的解决方案。

Comments Accepted ICIR 2025, camera-ready version

详情
英文摘要

We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

2510.15849 2026-05-15 cs.CV 版本更新

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin

发表机构 * Tsinghua University, China(清华大学) The Fifth Affiliated Hospital of Wenzhou Medical University, China(温州医科大学第五附属医院) Shenzhen Traditional Chinese Medicine Hospital, China(深圳中医医院) Wenzhou Medical University, China(温州医科大学) The First Hospital of Hebei Medical University, China(河北医科大学第一医院) Chinese Medicine Guangdong Laboratory/Hengqin Laboratory, China(广东中医实验室/横琴实验室)

AI总结 本文提出了一种无需人工提示和训练的舌部分割方法Memory-SAM,通过检索历史案例中的特征并生成有效提示来引导SAM2模型。该方法利用DINOv3的密集特征和FAISS检索技术,从少量先验案例中自动提取前景和背景提示,从而实现高精度分割。实验表明,Memory-SAM在包含600张专家标注图像的数据集上取得了优于现有方法的分割效果,尤其在真实场景下表现突出。

详情
英文摘要

Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.

2510.13016 2026-05-15 cs.CV 版本更新

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl

发表机构 * LMU Munich(慕尼黑大学) MCML Technical University of Munich(慕尼黑技术大学) University of Zurich(苏黎世大学) University of Oxford(牛津大学) Amazon(亚马逊) NVIDIA(英伟达)

AI总结 该论文提出了一种名为SVAG-Bench的大型基准,用于评估多实例时空视频动作定位能力。该任务要求模型同时检测、跟踪并定位满足自然语言查询的所有对象,以实现对复杂场景中多个动作的统一理解。SVAG-Bench包含688个视频和大量精细标注,支持对多动作歧义、时间重叠和动作组合性的细致评估,并提供了标准化的评估工具和一个模块化的基线模型SVAGFormer。

详情
英文摘要

A truly capable AI system must do more than detect objects or recognize activities in isolation. It must form unified, grounded representations of who is acting, what they are doing, and when and where these actions unfold. These representations provide the perceptual bedrock for high-level reasoning, planning, and embodied interaction in the real world. Building such agents is central to long-horizon goals in embodied AI and robotics. Current video benchmarks evaluate fragments of these capabilities in isolation. They focus on either spatial grounding, object tracking, or temporal localization. As a result, they cannot rigorously measure progress on their joint, multi-instance integration. We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes. To support this task, we construct SVAG-Bench. It comprises 688 videos, 19,590 verified annotations, and 903 unique action verbs drawn from crowded urban environments, wildlife, and traffic surveillance. Each video has on average 28.5 action-centric queries. This yields the densest annotation among comparable video grounding benchmarks and enables fine-grained evaluation of multi-actor disambiguation, temporal overlap, and action compositionality. Annotations are produced by a pipeline that combines expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification to ensure both linguistic diversity and correctness. We further release SVAGEval, a standardized multi-referent evaluation toolkit. We also introduce SVAGFormer, a strong modular baseline architecture for SVAG.

2509.22746 2026-05-15 cs.AI cs.CV 版本更新

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Zejun Li, Yingxiu Zhao, Jiwen Zhang, Siyuan Wang, Yang Yao, Runzhou Zhao, Jun Song, Bo Zheng, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Alibaba Group Holding Limited(阿里巴巴集团控股有限公司) Future Living Lab of Alibaba(阿里巴巴未来生活实验室) University of Southern California(南加州大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 当前视觉推理方法主要专注于探索特定的推理模式,虽能在特定领域取得改进,但难以形成通用的推理能力。为此,本文提出了一种新的自适应推理范式——Mixture-of-Visual-Thoughts(MoVT),通过在一个模型中统一不同推理模式,并根据上下文选择合适的模式。研究引入了两阶段的自适应视觉推理框架AdaVaR,利用监督学习进行初始训练,并通过强化学习与精心设计的算法引导模型实现上下文自适应的模式选择,实验表明该方法在多种场景下均能有效提升视觉推理性能。

Comments 27 pages, 11 figures, 5 tables, accepted by ICLR 2026

详情
英文摘要

Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.

2509.21261 2026-05-15 cs.CV 版本更新

Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization

Feng-Qi Cui, Jinyang Huang, Anyang Tong, Ziyu Jia, Jie Zhang, Zhi Liu, Dan Guo, Jianwei Lu, Meng Wang

AI总结 本文研究了细粒度微动作识别中的跨人差异问题,提出了一个基于分布鲁棒优化的框架,以提升模型在不同个体间的泛化能力。该框架包含两个可插拔模块,分别在特征层和损失层进行优化:特征层通过时频对齐模块消除个体运动特性差异,损失层则通过分组不变正则化损失增强模型对少见和困难样本的鲁棒性。实验表明,该方法在大规模数据集上显著优于现有方法,具有更高的准确性和泛化稳定性。

Comments Withdrawn by the authors due to accidental submissions of non-final manuscript versions. Both v1 and v2 contain an outdated framework figure, in which several module names are inconsistent with the finalized terminology used in the manuscript. This inconsistency may confuse readers about the structure and naming of the proposed method

详情
英文摘要

Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.

2509.14232 2026-05-15 cs.CV 版本更新

GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 GenExam 是首个面向多学科文本到图像生成的考试式基准,旨在评估模型在理解、推理与图像生成方面的综合能力。该基准包含10个学科共1000道题目,每个题目均配有标准答案图像和细粒度评分点,以精确评估生成结果的语义正确性与视觉合理性。实验表明,GenExam 对现有模型提出了巨大挑战,开源模型在性能上与闭源模型存在显著差距,凸显了当前生成模型在复杂任务中的不足。

Comments Accepted by ICML 2026

详情
英文摘要

Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights for on the path to intelligent generative models. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.

2509.01299 2026-05-15 cs.CV 版本更新

Cross-Domain Few-Shot Segmentation via Ordinary Differential Equations over Time Intervals

Huan Ni, Qingshan Liu, Xiaonan Niu, Danfeng Hong, Lingli Zhao, Haiyan Guan

发表机构 * School of Remote Sensing & Geomatics Engineering, Nanjing University of Information Science & Technology(南京信息工程大学遥感与地理信息学院) Tiandu-Nuist Deep Sapce Exploartion Laboratory(天都-南京信息工程大学深空探索实验室) School of Computer Science, Nanjing University of Posts and Telecommunications(南京邮电大学计算机科学学院) Nanjing Center, China Geological Survey(南京地质调查局南京中心) School of Automation, Southeast University(东南大学自动化学院) School of Remote Sensing and Information Engineering, Wuhan University(武汉大学遥感与信息工程学院)

AI总结 本文研究了跨域少样本分割(CD-FSS)问题,旨在在源域和目标域之间存在域偏移的情况下,利用极少的样本对未知类别进行分割。为了解决现有方法中模块独立导致知识流动受限的问题,作者提出了一种基于常微分方程(ODE)和傅里叶变换的统一模块FSS-TI,通过时间区间内的特征演化过程,实现了对域无关特征的探索和有限样本下的高效学习。实验表明,该方法在跨域适应性和分割性能方面均优于现有方法。

详情
英文摘要

Cross-domain few-shot segmentation (CD-FSS) aims to segment unseen categories with very limited samples while alleviating the negative effects of domain shift between the source and target domains. At present, existing CD-FSS studies typically rely on multiple independent modules to enhance cross-domain adaptability. However, the independence among these modules hinders the effective flow of knowledge, making it difficult to fully leverage their collective potential. In contrast, this paper proposes an all-in-one module based on ordinary differential equations (ODEs) and the Fourier transform, resulting in a structurally concise method-Few-Shot Segmentation over Time Intervals (FSS-TIs). FSS-TIs not only explores a domain-agnostic feature space, but also achieves significant performance improvement through target-domain fine-tuning with extremely limited support samples. Specifically, the ODE modeling process incorporates nonlinear transformations and random perturbations of the amplitude and phase spectra, effectively simulating potential target-domain data distributions. Meanwhile, the analytical solution of the ODE is transformed into a theoretically infinitely iterable feature refinement process, thereby enhancing the learning capability under limited support samples. In this way, both the exploration of domain-agnostic features and the few-shot learning problem can be addressed through the optimization of the intrinsic parameters of the ODE. Moreover, during target-domain fine-tuning, we strictly constrain the support samples to match the settings of real-world CD-FSS tasks, without incurring additional annotation costs. Experimental results demonstrate the superiority of FSS-TIs over existing CD-FSS methods, and in-depth ablation studies further validate the cross-domain adaptability of FSS-TIs.

2508.17588 2026-05-15 cs.CV 版本更新

HERO: Hierarchical Extrapolation and Refresh for Efficient World Models

Quanjian Song, Xinyu Wang, Donghao Zhou, Jingyu Lin, Cunjian Chen, Yue Ma

AI总结 HERO 是一种针对世界模型设计的训练-free 分层加速框架,旨在解决基于扩散模型的世界模型在推理过程中效率低下的问题。该方法利用世界模型多模态特性中浅层与深层特征表示的差异,分别采用块级刷新机制和线性外推策略,有效加速了推理过程。实验表明,HERO 在保持质量损失最小的前提下,实现了1.73倍的加速效果,优于现有的扩散模型加速方法。

Comments 12 pages in total

详情
英文摘要

Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training-free hierarchical acceleration framework tailored for efficient world models. Owing to the multi-modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch-wise refresh mechanism efficiently selects tokens for recomputation. With patch-wise sampling and frequency-aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed-forward networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.

2508.06202 2026-05-15 cs.CV cs.AI 版本更新

LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

Chang Che, Ziqi Wang, Pengwan Yang, Qi Wang, Hui Ma, Zenglin Shi

发表机构 * Hefei University of Technology(合肥工业大学) University of Amsterdam(阿姆斯特丹大学) Tsinghua University(清华大学)

AI总结 持续视觉指令微调(CVIT)使多模态大语言模型能够逐步学习新任务,但面临灾难性遗忘的问题。为解决这一挑战,本文提出了一种高效的架构扩展方法LiLoRA,通过共享LoRA矩阵A并引入对矩阵B的低秩分解,显著减少了参数开销,并结合余弦正则化稳定性损失以保持表示的一致性。实验表明,LiLoRA在多个CVIT基准上实现了更优的性能,同时提升了参数效率。

Comments AAAI 2026 Oral Presentation. 9 pages

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence, 40(24):19978--19986, 2026

详情
英文摘要

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches. The code is available at https://github.com/chanceche/LiLoRA.

2508.05008 2026-05-15 cs.CV 版本更新

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation

Xusheng Liang, Lihua Zhou, Nianxin Li, Miao Xu, Ziyang Song, Dong Yi, Jinlin Wu, Jiawei Ma, Hongbin Liu, Zhen Lei, Jiebo Luo

发表机构 * City University of Hong Kong(香港城市大学) Shenzhen Loop Area Institute(深圳环城院) CAIR, HKISI, Chinese Academy of Sciences(中国科学院计算智能研究所) UESTC(电子科技大学) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 该研究针对医学图像分割中因设备差异、成像模式等引起的领域偏移问题,提出了一种多模态因果驱动的表示学习框架MCDRL。该方法结合视觉-语言模型与因果推理,通过构建领域特定的干扰词典并训练因果干预网络,有效消除领域偏差的同时保留解剖结构信息。实验表明,MCDRL在多个医学图像分割任务中表现出更优的分割精度和更强的跨领域泛化能力。

Comments Accepted by CVPR 2026

详情
英文摘要

Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.

2507.13941 2026-05-15 q-bio.NC cs.AI cs.CV eess.IV 版本更新

Shared representations in brains and models reveal a two-route cortical organization during scene perception

Pablo Marcos-Manchón, Lluís Fuentemilla

发表机构 * Department of Cognition, Development and Education Psychology, Faculty of Psychology, University of Barcelona(认知、发展与教育心理学系,心理学学院,巴塞罗那大学) Institute of Neurosciences, University of Barcelona(神经科学研究所,巴塞罗那大学) Bellvitge Institute for Biomedical Research(Bellvitge生物医学研究 institute)

AI总结 该研究通过分析7T fMRI数据,探讨了人类大脑在场景感知过程中信息的组织与传递路径。研究利用表征相似性分析,比较了个体间共享的脑区表征结构与视觉和语言神经网络的层次特征,发现大脑存在两条分离的处理通路:一条负责场景布局与环境背景,另一条专门处理生物内容。这一发现深化了对视觉信息处理的经典模型,揭示了场景感知是一个由多个可区分表征路径组成的分布式脑网络。

Comments for associate code, see https://github.com/memory-formation/convergent-transformations

详情
英文摘要

The brain transforms visual inputs into high-dimensional cortical representations that support diverse cognitive and behavioral goals. Characterizing how this information is organized and routed across the human brain is essential for understanding how we process complex visual scenes. Here, we applied representational similarity analysis to 7T fMRI data collected during natural scene viewing. We quantified representational geometry shared across individuals and compared it to hierarchical features from vision and language neural networks. This analysis revealed two distinct processing routes: a ventromedial pathway specialized for scene layout and environmental context, and a lateral occipitotemporal pathway selective for animate content. Vision models aligned with shared structure in both routes, whereas language models corresponded primarily with the lateral pathway. These findings refine classical visual-stream models by characterizing scene perception as a distributed cortical network with separable representational routes for context and animate content.

2507.07776 2026-05-15 cs.CV 版本更新

SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples

Dren Fazlija, Monty-Maximilian Zühlke, Johanna Schrader, Arkadij Orlov, Clara Stein, Iyiola E. Olatunji, Daniel Kudenko

发表机构 * University of Luxembourg(卢森堡大学) CAIMed – Lower Saxony Center for AI & Causal Methods in Medicine(下萨克森人工智能与因果方法医学中心)

AI总结 该论文提出SCOOTER,一个用于评估无约束对抗样本真实性的开源框架。无约束对抗攻击通过改变物体颜色等方式绕过传统防御策略,但其不可察觉性需依赖人类评估。SCOOTER提供了标准化的人类评估流程、大规模对比实验以及开源工具和数据集,揭示了当前多种对抗攻击方法在人类感知上表现不佳,并强调了人类感知与自动视觉系统之间的差异。

Comments 42 pages, 16 figures, 11 tables, Under Review, Code: https://github.com/DrenFazlija/Scooter, Data: https://doi.org/10.5281/zenodo.15771501

详情
英文摘要

Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.

2507.05193 2026-05-15 eess.IV cs.CV 版本更新

RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis

Songxiao Yang, Haolin Wang, Yao Fu, Ye Tian, Tamotsu Kamishima, Masayuki Ikebe, Yafei Ou, Masatoshi Okutomi

发表机构 * Institute of Science Tokyo(东京科学研究所) Hokkaido University(北海道大学) The University of Tokyo(东京大学)

AI总结 该研究提出了一种名为RAM-W600的多任务腕关节X光图像数据集,用于类风湿性关节炎(RA)的辅助诊断与疾病监测。该数据集包含来自六个医疗中心的388名患者的1048张腕部常规X光图像,提供了像素级的腕骨实例分割标注和SvdH骨侵蚀评分,是首个公开的腕骨实例分割资源。该数据集有助于推动RA相关研究,如关节间隙狭窄量化、骨侵蚀检测、骨变形评估等,并可能应用于腕部骨折定位等任务,有望降低腕部RA研究的门槛,促进计算机辅助诊断技术的发展。

Comments Published in NeurIPS 2025

详情
英文摘要

Rheumatoid arthritis (RA) is a common autoimmune disease that has been the focus of research in computer-aided diagnosis (CAD) and disease monitoring. In clinical settings, conventional radiography (CR) is widely used for the screening and evaluation of RA due to its low cost and accessibility. The wrist is a critical region for the diagnosis of RA. However, CAD research in this area remains limited, primarily due to the challenges in acquiring high-quality instance-level annotations. (i) The wrist comprises numerous small bones with narrow joint spaces, complex structures, and frequent overlaps, requiring detailed anatomical knowledge for accurate annotation. (ii) Disease progression in RA often leads to osteophyte, bone erosion (BE), and even bony ankylosis, which alter bone morphology and increase annotation difficulty, necessitating expertise in rheumatology. This work presents a multi-task dataset for wrist bone in CR, including two tasks: (i) wrist bone instance segmentation and (ii) Sharp/van der Heijde (SvdH) BE scoring, which is the first public resource for wrist bone instance segmentation. This dataset comprises 1048 wrist conventional radiographs of 388 patients from six medical centers, with pixel-level instance segmentation annotations for 618 images and SvdH BE scores for 800 images. This dataset can potentially support a wide range of research tasks related to RA, including joint space narrowing (JSN) progression quantification, BE detection, bone deformity evaluation, and osteophyte detection. It may also be applied to other wrist-related tasks, such as carpal bone fracture localization. We hope this dataset will significantly lower the barrier to research on wrist RA and accelerate progress in CAD research within the RA-related domain.

2507.04049 2026-05-15 cs.CV cs.RO 版本更新

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving

Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo

发表机构 * School of Artificial Intelligence (School of Software), Yanshan University(燕山大学人工智能学院(软件学院)) Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, School of Computer Science and Technology, Beijing Jiaotong University(北京交通大数据挖掘与具身智能关键实验室,北京交通大学计算机科学与技术学院) Horizon Robotics(地平线机器人) School of Mechanical and Aerospace Engineering, Nanyang Technological University(南洋理工大学机械与航空航天工程学院) University of Macau(澳门大学) School of Electrical Engineering and Computer Science, The University of Queensland(昆士兰大学电子工程与计算机科学学院)

AI总结 大多数端到端自动驾驶方法依赖单一专家示范的模仿学习,导致行为保守且同质化,难以适应复杂的真实场景。本文提出DIVER框架,结合强化学习与扩散生成模型,生成多样化且可行的驾驶轨迹。DIVER通过强化学习引导扩散过程,利用奖励机制确保轨迹的安全性与多样性,并提出新的多样性度量指标,实验表明其在多个基准测试中显著提升了轨迹多样性,有效缓解了模仿学习中的模式崩溃问题。

Comments 17 pages, 10 figures

详情
英文摘要

Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.

2506.04499 2026-05-15 cs.CV 版本更新

FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Shizhong Han, Hsin-Pai Cheng, Hong Cai, Jihad Masri, Soyeb Nagori, Fatih Porikli

发表机构 * Qualcomm AI Research(高通AI研究)

AI总结 本文提出了一种名为FALO的高效且精确的LiDAR三维目标检测方法,专为资源受限的边缘设备设计。该方法通过将稀疏体素按坐标和邻近性排列成一维序列,并结合提出的ConvDotMix模块进行处理,实现了在空间和嵌入维度上的充分特征混合与高阶非线性交互。实验表明,FALO在保持先进检测精度的同时,推理速度比当前最新方法在移动端GPU和NPU上提升了1.6到9.8倍,适合部署在紧凑型嵌入式设备上。

详情
英文摘要

Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).

2505.22394 2026-05-15 cs.CV 版本更新

PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

Fan Fei, Jiajun Tang, Fei-Peng Tian, Boxin Shi, Ping Tan

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(信息处理国家重点实验室,计算机学院,北京大学) National Engineering Research Center of Visual Technology, School of Computer Science, Peking University(视觉技术国家工程研究中心,计算机学院,北京大学) The Hong Kong University of Science and Technology(香港科技大学) Light Illusions PKU-AI 2 Robotics Joint Lab of Embodied AI(北京大学人工智能2机器人联合实验室)

AI总结 本文提出了一种名为 PacTure 的新框架,用于根据文本描述为无纹理的3D网格生成物理基于渲染(PBR)材质纹理。为了解决现有方法在生成效率和纹理一致性方面的不足,该方法引入了视图打包技术,有效提升了多视角生成时的分辨率,同时保持了生成模型的高效性与兼容性。通过结合细粒度控制和自回归预测框架,PacTure 在生成质量和效率方面均优于现有先进方法。

Comments Accepted by Computational Visual Media Journal (CVMJ) in Feb. 2026. 19 pages, 7 figures

详情
英文摘要

We present PacTure, a novel framework for generating physically-based rendering (PBR) material textures for an untextured 3D mesh from a text description. Existing 2D generation-based texturing approaches either generate textures sequentially from different views, resulting in long inference times and globally inconsistent textures, or adopt multi-view generation with cross-view attention to enhance global consistency, which, however, limits the resolution for each view. In response to these weaknesses, we first introduce view packing, a novel technique that significantly increases the effective resolution for each view during multi-view generation, without imposing additional inference cost. Unlike UV mapping, it preserves the spatial proximity essential for image generation and maintains full compatibility with current 2D generative models. To further reduce the inferencing cost, we enable fine-grained control and multi-domain generation within the next-scale prediction autoregressive framework, creating an efficient multi-view PBR generation backbone. Extensive experiments show that PacTure outperforms state-of-the-art methods in both quality and efficiency.

2505.11809 2026-05-15 cs.CV 版本更新

From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models

Zicheng Fan, Kunihiko Fujiwara, Pengyuan Liu, Fan Zhang, Filip Biljecki

发表机构 * organization= Department of Architecture, National University of Singapore , country= Singapore organization= Research \& Development Institute, Takenaka Corporation , country= Japan organization= Urban Analytics Subject Group, Urban Studies \& Social Policy Division, University of Glasgow , country= United Kingdom organization= Institute of Remote Sensing GIS, Peking University , country= China organization= Department of Real Estate, National University of Singapore , country= Singapore

AI总结 本文提出一种基于视觉语言模型(VLM)的方法,利用街景图像评估城市地标在真实街道环境中的可见性,替代传统的基于几何遮挡的视线模拟方法。通过在受控方向和缩放的街景图像中检测目标地标,构建异构可见性图以表示地标之间的视觉连接关系,揭示了多个地标通过共享视觉走廊相互关联的模式。实验表明,该方法在多个国际知名地标上的检测准确率达87%,并在伦敦泰晤士河沿岸案例中有效识别了关键中介地点,为城市规划和遗产保护提供了新的分析视角。

详情
英文摘要

Visibility analysis in urban planning has traditionally relied on line-of-sight (LoS) simulations, which capture geometric occlusion. However, these approaches depend on accurate 3D data that is often unavailable and may not adequately represent how visually distinctive urban landmarks are encountered in real streetscapes. We reformulate landmark visibility assessment as an urban visual search problem in image space by leveraging the widespread availability of street view imagery (SVI). Given a reference image of a target landmark, a Vision Language Model (VLM) is applied to detect the landmark in direction- and zoom-controlled SVI. A successful detection indicates machine-recognised landmark visibility at the corresponding viewpoint. Beyond isolated viewpoints, we construct a heterogeneous visibility graph to represent visual connectivity among landmarks, street-view locations, and the urban spaces that mediate them. This graph enables us to map where visual connections occur, how strong they are, and how multiple landmarks become jointly connected through shared visual corridors. Across six well-known landmark structures in global cities, the image-based method achieves an overall detection accuracy of 87%, with a precision score of 68% for landmark-visible locations. In a second case study along the River Thames in London, the visibility graph reveals multi-landmark connections and identifies key mediating locations, with bridges accounting for approximately 31% of all connections. The proposed method complements LoS-based visibility analysis and offers a practical alternative in data-constrained settings. It also showcases the possibility of revealing the prevalent connections of visual objects in the urban environment, opening new perspectives for urban planning and heritage conservation.

2504.09549 2026-05-15 cs.CV 版本更新

SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Yuhao Wang, Xiang Hu, Lixin Wang, Pingping Zhang, Huchuan Lu

发表机构 * School of Future Technology, Dalian University of Technology(大连理工大学未来技术学院) School of Information and Communication Engineering, Dalian University of Technology(大连理工大学信息与通信工程学院)

AI总结 本文提出了一种名为SD-ReID的生成框架,用于解决航拍与地面视角下的人再识别(AG-ReID)问题。该方法通过结合生成模型与可控条件,学习不同视角下的特征分布,从而提取更具鲁棒性的身份表示,并引入视图细化解码器以增强特征对齐能力。实验表明,该方法在多个AG-ReID数据集上均取得了优异的性能。

Comments This work is accepted by IEEE TIP 2026. More modifications may performed

详情
英文摘要

Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code and pre-trained models are available at https://github.com/924973292/SD-ReID.

2504.01571 2026-05-15 cs.GR cs.AI cs.CV cs.LG 版本更新

Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation

Aleksander Plocharski, Jan Swidzinski, Przemyslaw Musialski

发表机构 * Warsaw University of Technology(华沙技术大学) Akces NCBR Imperial College London(伦敦帝国理工学院) New Jersey Institute of Technology(新泽西理工学院)

AI总结 本文提出了一种基于过程化扩散引导(Pro-DG)的建筑立面生成方法,通过在稳定扩散框架中引入分层过程化规则生成控制图,从而生成逼真的建筑立面图像。该方法从单张输入图像及其分割结果出发,利用逆过程模块识别立面的分层布局,并结合结构特征设计了一种新的ControlNet流程,实现由过程化变换引导的立面图像生成。该方法能够精确控制局部外观并进行大规模结构编辑,实验表明其在保持建筑风格和实现可控编辑方面优于现有方法。

Comments 17 pages, 15 figures, Computer Graphics Forum 2026 Journal Paper

详情
英文摘要

We use hierarchical procedural rules for the generation of control maps within the stable diffusion framework to produce photo-realistic architectural facade images. Starting from a single input image and its segmentation, we apply an inverse procedural module to identify the facade's hierarchical layout. Leveraging this hierarchy and structural features, we introduce a novel ControlNet pipeline that generates new facade imagery guided by procedural transformations. Our method enables various structural edits, including floor duplication and window rearrangement, by integrating hierarchical alignment directly into control maps. This precisely guides the diffusion-based generative process, ensuring local appearance fidelity alongside extensive structural modifications. Comprehensive evaluations, including comparisons with inpainting-based approaches and synthetic benchmarks, confirm our approach's superior capability in preserving architectural identity and achieving accurate, controllable edits. Quantitative results and user feedback validate our method's effectiveness.

2405.07459 2026-05-15 cs.CV 版本更新

DAPL: Integration of Positive and Negative Descriptions in Text-Based Person Search

Yuchuan Deng, Zhanpeng Hu, Zijie Xin, Chuang Deng, Qijun Zhao

发表机构 * Sichuan University(四川大学) Renmin University of China(中国人民大学)

AI总结 本文研究了基于文本的行人检索(TBPS)任务中如何有效整合正负描述信息的问题。现有方法主要关注正向属性,忽视了负向描述的重要性,可能导致误检。为此,作者提出了DAPL框架,通过结合正负描述,引入双属性对比学习和敏感属性匹配学习,提升模型对未见属性的识别能力,并设计动态词元相似度损失函数,优化视觉与文本嵌入的对齐精度,显著提升了TBPS任务的准确性和鲁棒性。

Journal ref 2025 IEEE International Conference on Multimedia and Expo (ICME)

详情
英文摘要

Text-based person search (TBPS) aims to retrieve specific images of individuals from large datasets using textual descriptions. Existing TBPS methods focus primarily on identifying explicit positive attributes, often neglecting the critical role of negative descriptions. This oversight can lead to false positives, where images that should be excluded based on negative descriptions are incorrectly included, due to partial alignment with the positive criteria. To address this limitation, we propose the Dual Attribute Prompt Learning (DAPL) framework, which incorporates both positive and negative descriptions to improve the interpretative accuracy of vision-language models in TBPS tasks. DAPL combines Dual Image-Attribute Contrastive (DIAC) learning with Sensitive Image-Attribute Matching (SIAM) learning to enhance the detection of previously unseen attributes. Furthermore, to achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we introduce the Dynamic Token-wise Similarity (DTS) loss. This loss function refines the representation of both matching and non-matching descriptions at the token level, providing more precise and adaptable similarity assessments, and ultimately improving the accuracy of the matching process. Empirical results demonstrate that DAPL outperforms state-of-the-art methods, enhancing both precision and robustness in TBPS tasks.

2605.14145 2026-05-15 cs.CV 版本更新

Rethinking the Good Enough Embedding for Easy Few-Shot Learning

Michael Karnes, Alper Yilmaz

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 本文探讨了在大规模数据训练下,不同深度视觉模型是否收敛于一个“理想”的潜在表示空间,并提出“好的嵌入即足够”的观点。研究通过冻结DINOv2-L特征并结合k近邻分类器,构建了一个无需反向传播的非参数化少样本学习框架,揭示了最优特征提取层并引入主成分分析和独立成分分析进行流形优化。实验表明,该方法在多个主流基准上优于复杂的元学习算法,达到了当前最优性能。

详情
英文摘要

The field of deep visual recognition is undergoing a paradigm shift toward universal representations. The Platonic Representation Hypothesis suggests that diverse architectures trained on massive datasets are converging toward a shared, "ideal" latent space. This again raises a critical question: is a "Good Embedding All You Need?" In this paper, we leverage this convergence to demonstrate that off-the-shelf embeddings are inherently "good enough" for complex tasks, rendering intensive task-specific fine-tuning unnecessary. We explore this hypothesis within the few-shot learning framework, proposing a straightforward, non-parametric pipeline that entirely bypasses backpropagation. By utilizing a k-Nearest Neighbor classifier on frozen DINOv2-L features, we conduct a layer-wise characterization to identify an optimal feature extraction. We further demonstrate that manifold refinement via PCA and ICA provides a beneficial regularizing effect. Our results across four major benchmarks demonstrate that our approach consistently surpasses sophisticated meta-learning algorithms, achieving state-of-the-art performance.

2605.14136 2026-05-15 cs.CV 版本更新

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

Nurislam Tursynbek, Zhiqiang Lao, Heather Yu, Gedas Bertasius, Marc Niethammer

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Futurewei Technologies Inc(未来科技有限公司) UCSD(加州大学圣迭戈分校)

AI总结 近期文本到视频扩散模型虽然能生成视觉上吸引人的帧,但在时间一致性方面仍存在不足,常出现闪烁、漂移或运动不稳定的问题。本文提出了一种无需训练、仅在推理阶段使用的 TeDiO 方法,通过正则化模型内部的注意力图中的时间对角线模式,增强视频的时间一致性。该方法能够估计对角线平滑度、识别不稳定区域并进行轻量级潜在变量更新,从而在不修改模型权重或依赖外部运动监督的情况下,显著提升多个视频扩散模型的运动流畅性,同时保持每帧的视觉质量。

Comments CVPR'26 Workshop on Agentic AI for Visual Media

详情
英文摘要

Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.

2605.14135 2026-05-15 cs.CV 版本更新

PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

Adil Qureshi, Dongki Jung, Jaehoon Choi, Dinesh Manocha

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 本文提出了一种名为PanoPlane的方法,用于从稀疏视角生成高保真室内新视角图像,其核心是通过全景场景补全重建封闭房间的几何结构。该方法引入了一种无需训练的布局锚定注意力引导机制,在推理时引导扩散模型关注场景中检测到的平面表面,从而实现基于几何一致性的内容补全,替代了传统的无约束幻象生成。实验表明,该方法在Replica、ScanNet++和Matterport3D数据集上均取得了优于现有方法的新视角合成效果,PSNR指标最高提升了17.8%。

详情
英文摘要

We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages $360^{\circ}$ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model's internal representation toward scene's detected planar surfaces at inference time. By directing each unobserved region's attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to $+17.8\%$ improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.

2605.14123 2026-05-15 eess.IV cs.CV 版本更新

Keyed Nonlinear Transform: Lightweight Privacy-Enhancing Feature Sharing for Medical Image Analysis

Haebom Lee, Gyeongjung Kim

发表机构 * OOLU Soft Co., Ltd.(OOLU软件有限公司)

AI总结 本文提出了一种名为Keyed Nonlinear Transform(KNT)的轻量级特征转换方法,用于在医疗图像分析中增强隐私保护,解决特征共享过程中患者身份信息泄露的问题。该方法通过密钥条件的非线性变换对中间特征进行混淆,有效降低了特征的可重新识别性,同时保持了模型的分类性能和计算效率。实验表明,KNT在不重新训练模型的前提下,显著提升了隐私保护水平,并适用于多种医学图像任务。

详情
英文摘要

Feature sharing via split inference offers a lightweight alternative to federated learning for resource-constrained hospitals, but transmitted features still leak patient identity information and lack practical mechanisms for controlled feature sharing. We propose Keyed Nonlinear Transform (KNT), a drop-in feature transformation that applies key-conditioned obfuscation to intermediate representations. KNT reduces re-identification AUC from 0.635 to 0.586, corresponding to a 36% reduction in above-chance identity signal, while introducing only 0.15 ms CPU overhead, without backbone retraining, and preserving classification performance within 1.0 pp. Our analysis shows that KNT's nonlinear transform prevents closed-form inversion and shifts recovery to iterative gradient-based optimization under full key compromise, substantially increasing inversion difficulty. The same transform generalizes to dense prediction tasks, incurring only a 4.4 pp Dice reduction on skin-lesion segmentation without retraining. These results position KNT as a practical and efficient privacy layer for split inference deployments.

2605.14110 2026-05-15 cs.CV cs.RO 版本更新

SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

Sandro Papais, Lezhou Feng, Charles Cossette, Lingting Ge

发表机构 * University of Toronto(多伦多大学) Zoox Inc(Zoox公司)

AI总结 本文提出SToRe3D,一种用于高效多视角3D目标检测的稀疏性框架,旨在解决视觉Transformer(ViT)在处理多视角和大范围3D区域时计算量大、推理延迟高的问题。该方法通过联合选择2D图像token和3D目标查询,并结合特征存储与重新激活机制,实现对关键信息的计算分配。实验表明,SToRe3D在保持检测精度的同时,显著提升了推理速度,为实时大规模3D检测提供了可行方案。

Comments Accepted to CVPR 2026

详情
英文摘要

Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.

2605.14108 2026-05-15 cs.CV cs.AI cs.LG 版本更新

Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

Nishi Doshi, Shrey Shah

发表机构 * University of Southern California(南加州大学)

AI总结 该研究针对农村地区糖尿病视网膜病变(DR)筛查资源不足的问题,提出了一种边缘-云端级联架构,以提高筛查效率并降低云端计算负担。该架构分为两层:第一层使用轻量级的MobileNetV3-small模型在本地设备上进行二分类分诊,判断是否需要转诊;第二层在云端使用RETFoundDINOv2模型对需转诊的图像进行细粒度严重程度分级。实验表明,该方法在APTOS数据集上显著减少了云端调用次数,同时保持了较高的筛查准确性。

详情
英文摘要

Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology

2605.14104 2026-05-15 cs.CV 版本更新

DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction

Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu, Juming Xiong, Zhengyi Lu, Yanfan Zhu, Marilyn Lionts, Yuechen Yang, Yu Wang, Shilin Zhao, Haichun Yang, Yuankai Huo

发表机构 * Vanderbilt University, Tennessee, USA(范德比尔特大学,田纳西州,美国) Weill Cornell Medicine, New York, USA(韦尔医学院,纽约,美国) Vanderbilt University Medical Center, Tennessee, USA(范德比尔特大学医学中心,田纳西州,美国)

AI总结 该研究提出了一种名为DUET的新型双范式框架,用于从组织切片图像中预测空间转录组数据。DUET结合了参数化预测与基于记忆的检索方法,在细胞归纳先验的指导下实现更准确的基因表达推断。通过引入大规模单细胞数据作为分子约束,并设计轻量适配器动态调整不同空间区域的模型偏好,DUET在多个公开数据集上取得了当前最优的预测性能。

详情
英文摘要

Inferring spatially resolved gene expression from histology images offers a cost-effective complement to spatial transcriptomics (ST). However, existing methods reduce this task to a simple morphology-to-expression mapping, where visual similarity does not guarantee molecular consistency. Meanwhile, single-cell data has amassed rich resources far surpassing the scale of ST data, yet it remains underexplored in vision-omics modeling. Furthermore, current approaches commit to a monolithic paradigm with bottlenecks, unable to balance expressive flexibility with biological fidelity. To bridge these gaps, we propose DUET, a novel dual-paradigm framework that synergizes parametric prediction and memory-based retrieval under cellular inductive priors. DUET implements a parallel regression-retrieval paradigm, adaptively reconciling the outputs of its complementary pathways. To mitigate aleatoric vision ambiguity, we incorporate large-scale single-cell references to impose molecular states as biological constraints for faithful learning. Building upon structural refinement, we further design a lightweight adapter to dynamically assign branch preference across spatial contexts to achieve optimal performance. Extensive experiments on three public datasets across varied gene scales demonstrate that DUET achieves SOTA performance, with consistent gains contributed by each proposed component. Code is available at https://github.com/Junchao-Zhu/DUET

2605.14047 2026-05-15 cs.CV cs.AR 版本更新

Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation

Kieran Carrigg, Sigur de Vries, Amirhossein Sadough, Marcel van Gerven

发表机构 * Department of Machine Learning and Neural Computing(机器学习与神经计算系) Donders Institute for Brain, Cognition, and Behaviour(大脑、认知与行为多纳尔斯研究所)

AI总结 本文研究了如何在边缘设备上高效部署视觉Transformer(ViT),针对其因层归一化操作导致的计算复杂度和全局归约瓶颈问题,提出了一种基于遗传编程的硬件感知框架。该方法通过进化生成每层特定的标量函数,替代传统的归一化层,无需从头训练模型即可实现高性能适配。实验表明,该方法在保持图像分类精度的同时,显著降低了计算和内存开销,为ViT在边缘加速器上的部署提供了有效解决方案。

Comments 18 pages, 7 figures

详情
英文摘要

Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers' behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6\%$ of the variance ($R^2$) compared to only $70.2\%$ for homogeneous baselines, allowing our modified architecture to recover $84.25\%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.

2605.14045 2026-05-15 cs.CV 版本更新

PVRF: All-in-one Adverse Weather Removal via Prior-modulated and Velocity-constrained Rectified Flow

Wei Dong, Han Zhou, Terry Ji, Guanhua Zhao, Shahab Asoodeh, Yulun Zhang, Guangtao Zhai, Jun Chen, Xiaohong Liu

发表机构 * McMaster University(麦斯特大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 该论文提出了一种名为PVRF的统一框架,用于解决真实场景中复杂多变的恶劣天气去除问题。该方法结合了基于冻结视觉-语言模型的软天气感知模块和速度约束的修正流优化,通过属性调制归一化和天气加权适配器生成初始修复估计,并利用终端一致的残差修正流提升修复质量与稳定性。实验表明,PVRF在修复保真度和感知质量方面优于现有方法,且具有良好的跨数据集泛化能力。

Comments 10 pages, 9 figures, and 4 tables

详情
英文摘要

Adverse weather removal (AWR) in real-world images remains challenging due to heterogeneous and unseen degradations, while distortion-driven training often yields overly smooth results. We propose PVRF, a unified framework that integrates zero-shot soft weather perceptions with velocity-constrained rectified-flow refinement. PVRF introduces an AWR-specific question answering module (AWR-QA) that uses frozen vision--language models (VLMs) to estimate soft probabilities of weather types and low-level attribute scores. These perceptions condition restoration networks via attribute-modulated normalization (AMN) and weather-weighted adapters (WWA), producing an anchor estimate for refinement. We then learn a terminal-consistent residual rectified flow with perception-adaptive source perturbation and a terminal-consistent velocity parameterization to stabilize learning near the terminal regime. Extensive experiments show that PVRF improves both fidelity and perceptual quality over state-of-the-art baselines, with strong cross-dataset generalization on single and combined degradations. Code will be released at https://github.com/dongw22/PVRF.

2605.14031 2026-05-15 cs.SD cs.CV cs.LG 版本更新

Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

Wuao Liu, Mustafa Chasmai, Subhransu Maji, Grant Van Horn

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文研究了在有限数据条件下,掩码自编码器(MAE)在生物声学细粒度物种分类任务中的有效性。通过在iNatSounds数据集上的系统实验,分析了预训练数据规模、领域特异性、数据筛选和迁移策略等因素的影响。研究发现,使用多样化通用音频数据预训练的模型在生物声学任务中表现最佳,而针对特定领域的额外预训练和数据筛选在小规模数据下效果有限,甚至可能降低性能。结果表明,在中等规模的细粒度生物声学场景中,预训练数据的规模比目标函数设计对模型性能影响更大。

Comments Workshop on Fine-Grained Visual Categorization (FGVC) at CVPR 2026. 8 pages, 6 figures

详情
英文摘要

Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.

2605.13994 2026-05-15 cs.CV cs.AI 版本更新

CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI

Xiaoyue Liu, Xiaohan Yuan, Mark Y Chan, Ching-Hui Sia, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore, Singapore(新加坡国立大学生物医学工程系) School of Automation, Southeast University, Nanjing, China(东南大学自动化学院) Department of Medicine, National University of Singapore, Singapore(新加坡国立大学医学系) Department of Cardiology, National University Heart Centre Singapore, Singapore(新加坡国立心脏中心心内科部)

AI总结 本文提出了一种名为CineMesh4D的端到端4D(3D+时间)重建方法,用于从稀疏的动态MRI图像中生成个性化的全心脏网格模型。该方法通过跨域映射直接从多视角的2D动态MRI图像重建全心结构,引入了可微渲染损失以利用多视角稀疏轮廓进行监督,并设计了双上下文时间块以融合全局和局部时间信息,从而提升重建质量与运动一致性。实验表明,CineMesh4D在重建精度和运动连贯性方面优于现有方法,为个性化实时心脏评估提供了可行的解决方案。

详情
英文摘要

Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.

2605.13974 2026-05-15 cs.CV cs.AI cs.MM 版本更新

Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

Evelyn Turri, Davide Bucciarelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) University of Pisa(比萨大学)

AI总结 本文研究了扩散变换器(DiT)中一种被称为“大规模激活”的现象,即一小部分隐藏通道的响应远大于其余通道。研究发现,这些少量通道在功能上至关重要,能够主导图像生成质量;在空间上具有组织性,能反映图像的主要主体和显著区域;并且具有可迁移性,可用于实现跨提示的语义插值和主体驱动生成。这些发现揭示了DiT模型中隐藏的稀疏语义控制机制,为理解与利用扩散模型提供了新视角。

Comments Project page: https://aimagelab.github.io/MAs-DiT/

详情
英文摘要

Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

2605.13923 2026-05-15 cs.LG cs.CV cs.RO cs.SY eess.SY 版本更新

Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations

Bardh Hoxha, Oliver Schön, Hideki Okamoto, Lars Lindemann, Georgios Fainekos

发表机构 * Toyota NA R&D(丰田NA研发) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文研究了在部分可观测环境下,基于视觉观测对过去时间信号时序逻辑(ptSTL)进行认证运行时监控的问题。提出了一种基于语义潜在表示的方法,通过训练可重复使用的监控接口,能够在无需针对每个公式重新训练的情况下,提供有限样本保证。该方法在长时域上相比现有方法具有更高的认证精度,并在真实驾驶数据集上验证了其有效性。

详情
英文摘要

We study certified runtime monitoring of past-time signal temporal logic (ptSTL) from visual observations under partial observability. The monitor must infer safety-relevant quantities from images and provide finite-sample guarantees, while being \emph{reusable}: once trained and calibrated, it should certify any formula in a target fragment without per-formula retraining. For fragments induced by a finite dictionary of temporal atoms, we prove that the \emph{semantic basis}, the vector of atom robustness scores, is the minimum prediction target within the class of monotone, 1-Lipschitz reusable interfaces: any formula is evaluated by a deterministic decoder derived from the parse tree, and a single conformal calibration pass certifies the entire fragment with no union bound. We also introduce a \emph{rolling prediction monitor} that predicts only current predicate values and reconstructs temporal history online; this is easier to learn but grows conservative at long horizons. On a pedestrian-crossroad benchmark, rolling achieves tighter certified bounds at short horizons while the semantic-basis monitor is up to 4-times tighter at long horizons. We validate the presented monitors on real-world Waymo driving data, where both monitors satisfy the conformal coverage guarantee empirically.

2605.13910 2026-05-15 stat.ML cs.CV cs.LG 版本更新

Covariance-aware sampling for Diffusion Models

Andrea Schioppa, Tim Salimans

发表机构 * GDM - Amsterdam(阿姆斯特丹GDM)

AI总结 本文提出了一种协方差感知采样器,旨在提升扩散模型在少量采样步数下的像素空间生成质量。该方法通过显式建模反向过程的协方差,结合Tweedie公式和傅里叶空间分解,有效改进了传统仅依赖均值预测的采样方式。实验表明,在相同函数评估次数下,该方法在像素级扩散模型中生成的样本质量优于当前最先进的二阶采样器和最新aDDIM采样器。

详情
英文摘要

We present a covariance-aware sampler that improves the quality of pixel-space Diffusion Model (DM) sampling in the few-step regime. We hypothesize that in the few-step regime samplers fail because they rely solely on the predicted mean of the reverse distribution, while our solution explicitly models the reverse-process covariance. Our method combines Tweedie's formula to estimate the covariance with an efficient, structured Fourier-space decomposition of the covariance matrix. Implemented as an extension of DDIM, our method requires only a minimal overhead: one extra Jacobian-Vector Product (JVP) per step. We demonstrate that for pixel-based DMs, our method consistently produces superior samples compared to state-of-the-art second order samplers (Heun, DPM-Solver++) and the recent aDDIM sampler, at an identical number of function evaluations (NFE).

2605.13889 2026-05-15 eess.IV cs.CV cs.LG 版本更新

Physics-Grounded Adversarial Stain Augmentation with Calibrated Coverage Guarantees

Mingi Hong

AI总结 不同医院间染色差异会影响病理模型的部署性能,现有染色增强方法缺乏对参数的理论约束和对未知中心的覆盖保障。本文提出了一种基于物理原理的校准对抗染色增强方法(CASA),通过DKW不等式从多中心统计数据中校准增强预算,在Macenko染色参数空间中进行对抗增强。实验表明,CASA在Camelyon17-WILDS数据集上取得了更高的滑片级准确率和最差组准确率,显著优于其他对比方法。

详情
英文摘要

Stain variation across hospitals degrades histopathology models at deployment. Existing augmentation methods perturb color spaces with arbitrary hyperparameters, lacking both a principled budget and coverage guarantees for unseen centers. We propose \textbf{C}alibrated \textbf{A}dversarial \textbf{S}tain \textbf{A}ugmentation (\textbf{CASA}), which performs adversarial augmentation in the Macenko stain parameter space with a budget calibrated from multi-center statistics via the DKW inequality. On Camelyon17-WILDS (5 seeds), CASA achieves $93.9\% \pm 1.6\%$ slide-level accuracy -- outperforming HED-strong ($88.4\% \pm 7.3\%$), RandStainNA ($85.2\% \pm 6.7\%$), and ERM ($63.9\% \pm 11.3\%$) -- with the highest worst-group accuracy ($84.9\% \pm 0.9\%$) among all 10 compared methods.

2605.13869 2026-05-15 cs.NE cs.AI cs.CV 版本更新

Elastic Spiking Transformers for Efficient Gesture Understanding

Alberto Ancilotto, Gianluca Amprimo, Stefano Di Carlo, Elisabetta Farella

发表机构 * Fondazione Bruno Kessler(布鲁诺·科塞拉基金会) Politecnico di Torino(托斯纳理工大学)

AI总结 本文提出了一种弹性脉冲变换器(Elastic Spiking Transformer),用于高效的手势理解任务。该模型通过引入嵌套弹性结构,在特征提取、自注意力和前馈模块中实现运行时的动态调整,能够在不重新训练的情况下根据硬件资源实时调整网络宽度和注意力头数量。这种方法不仅提升了模型在不同硬件内存限制下的适应性,还通过减少活跃神经元数量降低了脉冲发放频率,从而显著减少能量消耗,适用于边缘设备上的实时手势识别。

详情
英文摘要

Spiking Neural Networks (SNNs), particularly Spiking Transformers, offer energy-efficient processing of event-based sensor data for healthcare applications. Yet current architectures are rigid: they are trained and deployed as static networks with fixed parameter counts and computational graphs. This limits deployment on neuromorphic hardware such as Loihi and SpiNNaker, where on-chip constraints often require smaller models that trade accuracy for feasibility. We introduce the Elastic Spiking Transformer, a runtime-adaptive architecture that brings elasticity into the spiking paradigm. Inspired by Matryoshka-style representation learning, it embeds nested elasticity in the Feature Extractor, Spiking Self-Attention, and Feed-Forward blocks. Through granularity-aware weight sharing, a single universal model can dynamically slice network width and attention heads at inference time without retraining. This design provides two key advantages for SNNs. First, it allows the model to adjust its parameter footprint to different hardware memory budgets. Second, reducing active neurons also lowers spike firing rates, yielding proportional reductions in synaptic operations, an energy benefit not directly available in standard artificial neural networks. We evaluate the approach on CIFAR10/100, CIFAR10-DVS, and the EHWGesture clinical gesture understanding dataset. Results show that one Elastic Spiking Transformer spans a broad range of complexity-accuracy trade-offs, matching or surpassing independently trained baselines while supporting adaptive, real-time gesture recognition on resource-constrained edge devices.

2605.13862 2026-05-15 cs.GR cs.CV eess.IV 版本更新

Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation

Diandian Gu, Jing Lin, Gaohong Liu, Jiahang Liu, Su Ma, Guang Shi, Jun Wang, Qinlong Wang, Qianyi Wu, Zhongcong Xu, Xuanyu Yi, Zihao Yu, Jianfeng Zhang, Zhuolin Zheng, Yifan Zhu, Rui Chen, Hengkai Guo, Xiaoyang Guo, Mingcong Han, Xu Han, Xiu Li, Yixun Liang, Weiqiang Lou, Junzhe Lu, Guan Luo, Minghan Qin, Shuguang Wang, Yuang Wang

发表机构 * ByteDance(字节跳动)

AI总结 本文提出 Seed3D 2.0,这是一个在生成精度、仿真就绪能力及应用范围方面均有显著提升的三维内容生成系统。其核心方法包括分阶段生成几何结构、局部感知的 VAE 优化纹理与材质生成,并引入统一的 PBR 模型和语义条件控制,以提高生成质量和细节表现。此外,系统还支持场景布局规划与部件级交互生成,实现了跨物理与图形引擎的高一致性场景构建,实验表明其在纹理化三维资产生成方面优于多个商业模型。

Comments Seed3D 2.0 Technical Report; Official Page on https://seed.bytedance.com/seed3d_2_0

详情
英文摘要

We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0, with substantial improvements across generation fidelity, simulation-ready capabilities, and application coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression and more efficient decoding. For texture and material generation, we replace the cascaded pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic conditioning for improved material precision and visual fidelity. Beyond single-object generation, Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware decomposition, and training-free articulation generation, enabling coherent scene construction and part-level physical interaction across physics and graphics engines. A large-scale human preference study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates of 69.0% to 89.9% in textured 3D asset generation. Seed3D 2.0 is available on https://exp.volcengine.com/ark/vision?_vtm_=0.0.c70961.d701978.0&mode=vision&modelId=doubao-seed3d-2-0-260328&tab=Gen3D

2605.13857 2026-05-15 cs.GR cs.CV cs.LG 版本更新

MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation

Dongxia Liu, Jie Ma, Xiaochen Yang, Jiancheng Zhang, Bin Xia, Zhehan Kan, Nisha Huang, Jun Liang, Wenming Yang, Jin Li

发表机构 * tabular c 1 Tsinghua University 2 University of Glasgow 3 The Chinese University of Hong Kong 4 HUIJING Digital Media \& Entertainment Group 2mm Project Page: tabular

AI总结 本文提出 MoZoo,一种基于生成扩散模型的动物毛发与肌肉动态模拟方法,旨在高效生成高质量的动物视频效果。该方法通过角色感知的 RoPE 和非对称解耦注意力机制,实现了从粗略网格生成高保真视频,并引入 MoZoo-Data 数据集和 MoZooBench 基准以支持训练与评估。实验表明,MoZoo 在多种动物骨骼和布局上均能保持优秀的时空一致性与毛发模拟效果。

Comments Github Page:https://dongxialiu15.github.io/MoZoo/

详情
英文摘要

The creation of cinematic-quality animal effects necessitates the precise modeling of muscle and fur dynamics, a process that remains both labor-intensive and computationally expensive within traditional production workflows. While generative diffusion models have shown promise in diverse artistic workflows, their capacity for high-fidelity animal simulation remains largely unexploited. We present MoZoo, a generative dynamics solver that bypasses conventional refinement to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. We propose Role-Aware RoPE (RAR-RoPE) which employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets. Complementing this, Asymmetric Decoupled Attention partitions the latent sequence to enforce a unidirectional information flow, effectively preventing feature interference and improving computational efficiency. To address the scarcity of high-quality training data, we introduce MoZoo-Data, a synthetic-to-real pipeline that leverages a rendering engine and an inverse mapping approach to construct a large-scale dataset of paired sequences. Furthermore, we establish MoZooBench, a comprehensive benchmark with 120 mesh-video pairs. Experimental results demonstrate that MoZoo achieves high-fidelity fur simulation across diverse animal skeletons and layouts, preserving superior temporal and structural consistency.

2605.13855 2026-05-15 cs.GR cs.AI cs.CV 版本更新

SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method

Wentao Yang, Fanzhen Kong, Zejian Kang, Xiangru Huang

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学)

AI总结 本文提出了一种基于Order-Independent Transparency(OIT)的稀疏3D高斯泼溅(3DGS)重建方法SparseOIT,旨在解决传统3DGS在处理非朗伯或透明材质物体时的不足。通过分析OIT对渲染方程的修改,发现其显著降低了高斯点之间的依赖性,从而可以利用主动集方法等优化技术提升计算效率。SparseOIT结合了OIT渲染方程、重建算法和几何正则化,实现了高效且高质量的3D重建,在实验中优于其他OIT方法,并达到基于体渲染的最先进3DGS方法的性能水平。

详情
英文摘要

3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose SparseOIT, an OIT-based 3DGS reconstruction algorithm that maintains an active set of gaussian splats and enjoys an acceleration ratio that is proportional to the potential sparsity. SparseOIT is designed by jointly considering the OIT rendering equation, the reconstruction algorithm and the geometric regularization. Through extensive experiments, we demonstrate that SparseOIT outperforms existing methods in the OIT-family by a large margin and also achieves comparable performance to the state-of-the-art 3DGS reconstruction methods based on volumetric rendering. Project page:

2605.13854 2026-05-15 cs.CV cs.GR cs.MM eess.IV 版本更新

Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery

Minghao Sun, Chongyang Xu, Yitao Xie, Buzhen Huang, Kun Li

发表机构 * Tianjin University(天津大学) Nanyang Technological University(南洋理工大学) Sichuan University(四川大学)

AI总结 本文研究了在严重遮挡和深度模糊条件下多人3D重建的问题,提出了一种基于对比多模态超图推理的方法,以融合语义、几何和姿态信息进行群体网格重建。该方法通过结合RGB特征、几何先验和遮挡感知的不完整姿态初始化节点表示,并引入骨盆深度指示作为全局空间锚点,构建共享拓扑结构的超图以建模高阶群体动态。通过设计基于超图的对比学习方案,增强模态内判别性和模态间正交性,有效传播全局上下文信息,从而在严重遮挡下实现更准确的重建。实验表明,该方法在多个基准数据集上取得了新的最佳性能。

Comments ICME 2026

详情
英文摘要

Multi-person 3D reconstruction is pivotal for real-world interaction analysis, yet remains challenging due to severe occlusions and depth ambiguity. Current approaches typically rely on single-modality inputs, which inherently lack geometric guidance. Furthermore, these methods often reconstruct subjects in isolation, neglecting the collective group context essential for resolving ambiguities in crowded scenes. To address these limitations, we propose Contrastive Multi-modal Hypergraph Reasoning to synergize semantic, geometric, and pose cues for crowd reconstruction. We first initialize robust node representations by combining RGB features, geometric priors, and occlusion-aware incomplete poses. Additionally, we introduce a pelvis depth indicator as a global spatial anchor, aligning visual features with a metric-scale-agnostic depth ordering. Subsequently, we construct a shared-topology hypergraph that moves beyond pairwise constraints to model higher-order crowd dynamics. To improve feature fusion, we design a hypergraph-based contrastive learning scheme that jointly enhances intra-modal discriminability and enforces cross-modal orthogonality. This mechanism enables the network to propagate global context effectively, allowing it to infer missing information even under severe occlusion. Extensive experiments on the Panoptic and GigaCrowd benchmarks confirm that our method achieves new state-of-the-art performance. Code and pre-trained models are available at https://github.com/SunMH-try/CoMHR.

2605.13853 2026-05-15 cs.GR cs.AI cs.CV 版本更新

FaceParts: Segmentation and Editing of Gaussian Splatting

Tymoteusz Zapała, Julia Farganus, Dominik Galus, Mikołaj Czachorowski, Piotr Syga, Przemysław Spurek

发表机构 * Wrocław University of Science and Technology(华沙理工大学) Jagiellonian University(雅盖隆大学)

AI总结 本文提出了一种名为 FaceParts 的框架,用于对高斯溅射(Gaussian Splatting)虚拟人像进行无监督的面部分割与编辑。该方法直接在高斯域中操作,无需监督即可将人脸分解为语义一致的面部部件,并结合特征解耦、基于密度的聚类以及 FLAME 模型辅助的部件迁移技术,实现了精确的编辑与跨人像部件替换。实验表明,该方法在多个面部特征上具有良好的分割效果,并能保持身份一致性及表情和姿态的自然适应性。

详情
英文摘要

Facial editing is an important task with applications in entertainment, virtual reality, and digital avatars. Most existing approaches rely on generative models in the 2D image domain, while in 3D the task is typically performed through labor-intensive manual editing. We propose FaceParts, a framework for unsupervised segmentation and editing of Gaussian Splatting avatars. Unlike existing 2D or mesh-assisted methods, our approach operates directly in the Gaussian domain, decomposing avatars into semantically coherent facial parts without supervision. The method integrates feature disentanglement, density-based clustering, and FLAME-anchored part transfer, enabling precise editing and cross-avatar part swapping. Experiments on the NeRSemble dataset with 11 subjects demonstrate robust isolation of features such as beards, eyebrows, eyes and mustaches. Quantitative evaluation confirms that transferred segments adapt to pose and expression, while maintaining identity consistency (ID = 0.943), low Average Expression Distance (AED = 0.021) and low Average Pose Distance (APD = 0.004).

2605.13852 2026-05-15 cs.GR cs.CV cs.LG 版本更新

Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

Ido Sobol, Kihyuk Sohn, Yoav Blum, Egor Zakharov, Max Bluvstein, Andrea Vedaldi, Or Litany

发表机构 * Technion(技术学院) Meta AI

AI总结 Realiz3D 是一种通过领域感知学习实现高质量三维生成的框架,旨在生成既符合精确几何和材质控制、又具有真实感的图像。该方法通过分离控制信号与视觉领域,引入协变量调整残差适配器,使扩散模型在不依赖特定视觉领域的情况下学习控制能力,从而缓解真实图像与合成渲染之间的域差距。实验表明,Realiz3D 在多视角生成和三维纹理生成任务中表现出色,能够生成既符合三维一致性又具有高度真实感的图像。

Comments Accepted to CVPR 2026. Project page: https://idosobol.github.io/realiz3d/

详情
英文摘要

We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.

2605.12034 2026-05-15 cs.MM cs.AI cs.CV 版本更新

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu, Lichao Ma, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Xuerui Yang, Fei Tian

发表机构 * StepFun-Audio Team(StepFun-Audio团队)

AI总结 本文研究了多模态语言模型在视觉信息过强影响下的性能表现,提出了一种基于视觉去偏评估的分阶段微调方法。通过清理现有基准中的视觉捷径问题,构建了OmniClean数据集,并基于此设计了包含双模态微调、多模态强化学习和自蒸馏的三阶段微调方案OmniBoost。实验表明,该方法使小型多模态模型在无需更强教师模型的情况下,性能接近甚至超越了更大规模的模型,展示了分阶段微调在多模态模型优化中的有效性。

Comments Project page: https://cheliu-computation.github.io/omni/

详情
英文摘要

Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision. Project page: https://cheliu-computation.github.io/omni/

2605.09639 2026-05-15 eess.IV cs.CV 版本更新

XTinyU-Net: Training-Free U-Net Scaling via Initialization-Time Sensitivity

Alvin Kimbowa, Moein Heidari, David Liu, Ilker Hacihaliloglu

发表机构 * School of Biomedical Engineering, The University of British Columbia, Vancouver, Canada(生物医学工程学院,不列颠哥伦比亚大学,温哥华,加拿大) Department of Radiology, The University of British Columbia, Vancouver, Canada(放射学系,不列颠哥伦比亚大学,温哥华,加拿大) Department of Medicine, The University of British Columbia, Vancouver, Canada(医学系,不列颠哥伦比亚大学,温哥华,加拿大)

AI总结 在医疗图像分割中,U-Net架构仍是主流,但在资源受限环境下需要大幅压缩模型。本文提出一种无需训练的模型选择框架XTinyU-Net,通过初始化时的雅可比敏感度指标,自动识别出针对特定数据集的极轻量U-Net配置。实验表明,XTinyU-Net在参数减少400到1600倍的情况下,仍能保持与原始nnU-Net相当的分割精度,并优于现有轻量模型。

Comments Early accepted to MICCAI 2026

详情
英文摘要

While U-Net architectures remain the gold standard for medical image segmentation, their deployment in resource-constrained environments demands aggressive model compression. However, finding an optimally efficient configuration is computationally prohibitive, typically requiring exhaustive train-and-evaluate cycles to find the smallest model that maintains peak performance. In this paper, we introduce a training-free selection framework to automatically identify ultralightweight, dataset-specific U-Net configurations directly at initialization. We observe that systematically scaling down U-Net channel width induces a sharp transition from a stable performance plateau to representational capacity collapse. To pinpoint this boundary without training, we propose a Jacobian-based sensitivity metric that scores discrete, width-capped U-Net variants using a small set of unlabeled images. By analyzing the total variation of this sensitivity curve, we isolate the smallest stable configuration, which we denote as XTinyU-Net. Evaluated across six diverse medical datasets within the nnU-Net framework, XTinyU-Net achieves segmentation accuracy comparable to the heavy nnU-Net baseline with 400x-1600x fewer parameters, and outperforms contemporary lightweight architectures while utilizing 5x-72x fewer parameters. Code is publicly accessible on https://github.com/alvinkimbowa/nntinyunet.git.

2605.07931 2026-05-15 cs.CV cs.AI 版本更新

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jing, De Ma, Gang Pan, Bin Liu

发表机构 * Zhejiang University(浙江大学) Central South University(中南大学) Harbin Institute of Technology(哈尔滨工业大学) Embodied Intelligence General Platform Laboratory, Chery Auto(奇瑞汽车 embodied intelligence 通用平台实验室) E-surfing Digital Life Technology Co., Ltd., China Telecom(亿联数字生活技术有限公司,中国电信)

AI总结 本文研究了视觉-语言-动作(VLA)模型中世界模型模块的参数化设计问题,提出了一种新的方法OneWM-VLA,通过自适应注意力池化将每帧视觉信息压缩为一个语义token,从而大幅降低视觉带宽。该方法在单一流匹配目标下同时生成潜在视觉流和动作轨迹,无需额外解码器。实验表明,该方法在保持长时序任务性能的同时显著提升了多个复杂任务的成功率。

详情
英文摘要

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).

2603.11045 2026-05-15 cs.LG cond-mat.mtrl-sci cs.AI cs.CV physics.ins-det 版本更新

Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

Tao Zhong, Yixun Hu, Dongzhe Zheng, Aditya Sood, Christine Allen-Blanchette

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出了一种名为NeFTY的神经场热层析成像方法,用于解决无标签的三维逆热传导问题。该方法通过将扩散率表示为基于坐标的连续神经网络,并在每次优化步骤中使用可微分的隐式欧拉热求解器,确保控制方程在离散化层面精确成立,而非作为软约束。实验表明,NeFTY在合成三维基准测试和真实热成像数据中均显著优于传统物理信息神经网络和体素网格方法,在缺陷分割和深度估计方面表现出优越性能。

Comments 37 pages, 19 figures

详情
英文摘要

Inverse problems for stiff parabolic partial differential equations (PDEs), such as the inverse heat conduction problem (IHCP), are severely ill-posed: the forward map rapidly damps high-frequency interior structure before it reaches the boundary. Soft-constrained physics-informed neural networks (PINNs), which embed the PDE as a residual penalty, suffer from gradient pathology in this regime and tend to fit boundary measurements while leaving the interior field essentially untouched. We propose Neural Field Thermal Tomography (NeFTY), a hard-constrained neural field framework for label-free three-dimensional inverse heat conduction. NeFTY represents the unknown diffusivity as a continuous coordinate-based neural network, and at every optimization step passes the candidate field through a differentiable implicit-Euler heat solver with harmonic-mean interface flux, so that the governing PDE holds exactly on the discretization rather than as a soft penalty. Adjoint gradients propagate the surface reconstruction error back to the network weights at solver-level memory cost, making test-time inversion tractable on a single GPU. Across synthetic 3D benchmarks, NeFTY substantially outperforms soft-constrained PINN variants and a voxel-grid baseline on label-free volumetric recovery, and it transfers to real thermography data, surpassing classical signal-processing baselines in both defect segmentation and depth estimation. Additional details at https://cab-lab-princeton.github.io/nefty/

2603.03577 2026-05-15 cs.CV cs.RO 版本更新

From Local Matches to Global Masks: Template-Guided Instance Detection and Segmentation in Open-World Scenes

Qifan Zhang, Sai Haneesh Allu, Jikai Wang, Yangxiao Lu, Yu Xiang

发表机构 * IRVLUTD

AI总结 本文研究了在开放世界场景中,如何利用少量模板图像检测和分割新颖物体实例的问题。提出了一种名为L2G-Det的局部到全局检测框架,通过模板与查询图像之间的密集块级匹配生成候选点,并结合改进的分割模型实现精确的实例分割。该方法避免了传统提案机制的依赖,提升了在遮挡和背景干扰下的检测与分割性能。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project page: https://irvlutd.github.io/L2G/

详情
英文摘要

Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.

2602.12105 2026-05-15 cs.GR cs.CV cs.LG 版本更新

Iskra: A System for Inverse Geometry Processing

Ana Dodik, Ahmed H. Mahmoud, Justin Solomon

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 本文提出了一种名为 Iskra 的系统,用于对几何处理问题的求解过程进行微分,从而支持逆向几何处理任务。该系统利用现有的高效几何算法,如局部-全局和 ADMM 求解器,并结合张量计算流程与伴随方法,实现了对用户指定代码的高效反向传播。该方法无需对原有几何算法进行重新设计,即可实现微分,具有实现简单、运行快速和内存消耗低的优势,并在多个几何处理应用中验证了其有效性与性能。

详情
英文摘要

We propose a system for differentiating through solutions to geometry processing problems. Our system differentiates a broad class of geometric algorithms, exploiting existing fast problem-specific schemes common to geometry processing, including local-global and ADMM solvers. It is compatible with machine learning frameworks, opening doors to new classes of inverse geometry processing applications. We marry the scatter-gather approach to mesh processing with tensor-based workflows and rely on the adjoint method applied to user-specified imperative code to generate an efficient backward pass behind the scenes. We demonstrate our approach by differentiating through mean curvature flow, spectral conformal parameterization, geodesic distance computation, and as-rigid-as-possible deformation, examining usability and performance on these applications. Our system allows practitioners to differentiate through existing geometry processing algorithms without needing to reformulate them, resulting in low implementation effort, fast runtimes, and lower memory requirements than differentiable optimization tools not tailored to geometry processing.

2602.04585 2026-05-15 cs.CV 版本更新

ImmuVis: Hyperconvolutional Foundation Model for Imaging Mass Cytometry

Dawid Uchal, Marcin Możejko, Krzysztof Gogolewski, Piotr Kupidura, Szymon Łukasik, Jakub Giezgała, Tomasz Nocoń, Kacper Pietrzyk, Robert Pieniuta, Mateusz Sulimowicz, Michal Orzyłowski, Tomasz Siłkowski, Karol Zagródka, Eike Staub, Ewa Szczurek

发表机构 * Faculty of Mathematics, Informatics and Mechanics, University of Warsaw(数学与信息学学院,华沙大学) Merck Healthcare KGaA(默克健康护理公司) Institute of AI for Health, Helmholtz Munich(健康人工智能研究所,海德堡-穆恩)

AI总结 本文提出了一种名为 ImmuVis 的高效基础模型,专门用于成像质谱流式细胞术(IMC)数据的处理。该模型通过引入标记自适应超卷积,解决了IMC数据中通道不固定的问题,使得模型能够灵活处理不同研究中的标记组合。ImmuVis 在大规模数据集 IMC17M 上进行预训练,相比基于 Transformer 的方法具有更低的计算成本,并在虚拟染色和分类任务中表现出色,同时提供了校准的不确定性估计,为实际应用中的IMC建模提供了实用框架。

Comments 38 pages, 19 figures

详情
英文摘要

We present ImmuVis, a family of efficient foundation models for imaging mass cytometry (IMC), a high-throughput multiplex imaging technology that handles molecular marker measurements as image channels and enables large-scale spatial tissue profiling. Unlike natural images, multiplex imaging lacks a fixed channel space, as real-world marker sets vary across studies, violating a core assumption of standard vision backbones. To address this, ImmuVis introduces marker-adaptive hyperconvolutions that generate convolutional kernels from learned marker embeddings, enabling a single model to operate on arbitrary measured marker subsets without retraining. We pretrain ImmuVis on the largest dataset to date, IMC17M (28 cohorts, 24,405 images, 265 markers, over 17M patches), using self-supervised masked reconstruction. ImmuVis outperforms state-of-the-art baselines and ablations in virtual staining and downstream classification tasks at substantially lower compute cost than transformer-based alternatives, and is the sole model that provides calibrated uncertainty via a heteroscedastic likelihood objective. These results position ImmuVis as a practical framework for real-world IMC modeling.

2512.09115 2026-05-15 cs.CV 版本更新

SuperF: Neural Implicit Fields for Multi-Image Super-Resolution

Sander Riisøen Jyhne, Christian Igel, Morten Goodwin, Per-Arne Andersen, Serge Belongie, Nico Lang

发表机构 * University of Agder(阿格德大学) University of Copenhagen(哥本哈根大学)

AI总结 本文提出了一种名为 SuperF 的多图像超分辨率方法,旨在通过多个亚像素偏移的低分辨率图像提升图像的光学分辨率。该方法基于坐标感知的神经网络(神经场),通过共享一个隐式神经表示(INR)并联合优化图像对齐与重建过程,有效避免了单图像超分辨率中常见的“幻觉”问题。SuperF 不依赖高分辨率训练数据,实验表明其在卫星图像和手持相机拍摄的地面图像上均取得了高质量的超分辨率结果,放大因子高达8倍。

Comments Published at ICLR 2026, Project website: https://sjyhne.github.io/superf/, 23 pages, 13 figures, 8 table

详情
英文摘要

High-resolution imagery is often hindered by limitations in sensor technology, atmospheric conditions, and costs. Such challenges occur in satellite remote sensing, but also with handheld cameras, such as our smartphones. Hence, super-resolution aims to enhance the image resolution algorithmically. Since single-image super-resolution requires solving an inverse problem, such methods must exploit strong priors, e.g. learned from high-resolution training data, or be constrained by auxiliary data, e.g. by a high-resolution guide from another modality. While qualitatively pleasing, such approaches often lead to "hallucinated" structures that do not match reality. In contrast, multi-image super-resolution (MISR) aims to improve the (optical) resolution by constraining the super-resolution process with multiple views taken with sub-pixel shifts. Here, we propose SuperF, a test-time optimization approach for MISR that leverages coordinate-based neural networks, also called neural fields. Their ability to represent continuous signals with an implicit neural representation (INR) makes them an ideal fit for the MISR task. The key characteristic of our approach is to share an INR for multiple shifted low-resolution frames and to jointly optimize the frame alignment with the INR. Our approach advances related INR baselines, adopted from burst fusion for layer separation, by directly parameterizing the sub-pixel alignment as optimizable affine transformation parameters and by optimizing via a super-sampled coordinate grid that corresponds to the output resolution. Our experiments yield compelling results on simulated bursts of satellite imagery and ground-level images from handheld cameras, with upsampling factors of up to 8. A key advantage of SuperF is that this approach does not rely on any high-resolution training data.

2512.02920 2026-05-15 cs.LG cs.CV cs.SI 版本更新

Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation

Ziniu Zhang, Minxuan Duan, Haris N. Koutsopoulos, Hongyang R. Zhang

发表机构 * Northeastern University(东北大学)

AI总结 本文研究如何利用道路网络数据和卫星图像信息进行交通事故预测与因果分析。作者构建了一个包含美国六州九百万起事故记录和一千万张高分辨率卫星图像的多模态数据集,并结合天气、道路类型和交通流量等标注信息,评估了融合视觉与网络嵌入的多模态学习方法。实验表明,融合两种模态信息可显著提升预测性能,平均AUROC达90.1%,并发现降水、道路类型和季节性因素对事故率有显著影响。

Comments 17 pages. Appeared in KDD 2026

详情
英文摘要

We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset spanning six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1\%$, a $3.7\%$ gain over graph neural network models that use only graph structures. With the improved embeddings, we conduct a causal analysis using a matching estimator to identify the key factors influencing traffic accidents. We find that accident rates rise by $24\%$ under higher precipitation, by $22\%$ on higher-speed roads such as motorways, and by $29\%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.

2510.18326 2026-05-15 cs.CV 版本更新

Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ABHFA-Net

Gao Yu Lee, Tanmoy Dam, Md Meftahul Ferdaus, Daniel Puiu Poenar, Vu Duong

发表机构 * School of Mechanical and Aerospace Engineering (MAE), NTU(南洋理工大学机械与航空航天工程学院) Department of Computer Science, The University of New Orleans(新奥尔良大学计算机科学系)

AI总结 随着自然灾害和人为灾害频发,亟需在标注数据有限的情况下具备强鲁棒性的视觉识别系统。本文提出了一种基于注意力机制和巴氏距离的特征聚合网络(ABHFA-Net),用于提升少样本分类在基准和灾害图像上的性能。该方法通过将类别原型建模为概率分布,并利用巴氏距离进行分类,同时引入空间通道注意力机制和对比softmax损失,有效提升了特征判别能力和类别可分性。实验表明,ABHFA-Net在多个基准和真实灾害数据集上均取得优异性能,尤其在灾害图像分类中表现出显著优势。

Comments Revised and Submitted to SN Computer journal

详情
英文摘要

The rising incidence of natural and human-induced disasters necessitates robust visual recognition systems capable of operating under limited labeled data conditions. However, disaster-related image classification remains challenging due to data scarcity, high intra-class variability, and domain-specific complexities in remote sensing imagery. To address these challenges, we propose the Attention Bhattacharyya Distance-based Feature Aggregation Network (ABHFA-Net), a novel few-shot learning (FSL) framework that models class prototypes as probability distributions and performs classification via Bhattacharyya distance-based comparison. Our approach integrates a spatial channel attention mechanism to enhance discrimiantive feature learning in the few-shot context and introduces a Bhattacharyya-based contrastive softmax loss for improved class separability. Extensive experiments on both benchmark datasets (CIFAR-FS, FC-100, miniImageNet, tieredImageNet) and real-world disaster datasets (AIDER, CDD, MEDIC) demonstrate the effectiveness of the proposed method. In particular, ABHFA-Net achieves 80.7% and 92.3% accuracy on CIFAR-FS under 5-way 1-shot and 5-shot settings, respectively, outperforming existing state-of-the-art methods. On disaster datasets, the model consistently improves classification performance, achieving up to 68.2% (1-shot) and 78.3% (5-shot) accuracy on AIDER, highlighting its robustness in real-world scenarios. These results establish ABHFA-Net as a strong and practical solution for few-shot disaster image classification, particularly in data-scarce and time-critical remote sensing applications. The code repository for our work is available at https://github.com/GreedYLearner1146/ABHFA-Net.

2510.16196 2026-05-15 cs.CV cs.AI 版本更新

Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

Zheng Huang, Enpei Zhang, Weikang Qiu, Yinghao Cai, Carl Yang, Elynn Chen, Xiang Zhang, Rex Ying, Dawei Zhou, Yujun Yan

发表机构 * Dartmouth College(达特茅斯学院) Yale University(耶鲁大学) Emory University(埃默里大学) New York University(纽约大学) UNC Charlotte(北卡罗来纳大学柴郡分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 本文研究如何从功能性磁共振成像(fMRI)信号中重建视觉刺激,以理解大脑如何编码视觉信息。研究发现,fMRI信号与语言模型的文本空间更为相似,而非基于视觉或图文联合的空间,并提出应通过结构化文本空间来更好地表示视觉刺激的组成特性。基于这一发现,作者提出了PRISM模型,通过将fMRI信号投影到结构化文本空间,并结合对象生成和属性关系搜索模块,显著提升了图像重建质量,在真实数据集上实现了感知损失的降低。

详情
英文摘要

Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.

2505.17353 2026-05-15 cs.CV cs.AI cs.LG eess.IV 版本更新

Dual Ascent Diffusion for Inverse Problems

Minseo Kim, Axel Levy, Gordon Wetzstein

发表机构 * Stanford University(斯坦福大学)

AI总结 本文研究了如何利用扩散模型解决逆问题中的病态问题,提出了一种基于对偶上升优化框架的新方法。该方法在图像恢复任务中表现出更优的图像质量、更强的噪声鲁棒性以及更快的计算速度,同时能更真实地反映观测数据。该工作为逆问题求解提供了更高效且准确的解决方案。

Comments Project page: https://soniaminseokim.github.io/ddiff/

详情
英文摘要

Ill-posed inverse problems are fundamental in many domains, ranging from astrophysics to medical imaging. Emerging diffusion models provide a powerful prior for solving these problems. Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. Our framework achieves better image quality as measured by various metrics for image restoration problems, it is more robust to high levels of measurement noise, it is faster, and it estimates solutions that represent the observations more faithfully than the state of the art.