arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.17413 2026-03-19 cs.CV

Towards Motion-aware Referring Image Segmentation

Chaeyun Kim, Seunghoon Yi, Yejin Kim, Yohan Jo, Joonseok Lee

Comments Accepted at AISTATS 2026. * Equal contribution

详情

英文摘要

Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions. Codes are available at https://github.com/snuviplab/MRaCL

URL PDF HTML ☆

赞 0 踩 0

2603.17412 2026-03-19 cs.CV cs.LG

Mutually Causal Semantic Distillation Network for Zero-Shot Learning

Shiming Chen, Shuhuang Chen, Guo-Sen Xie, Xinge You

Comments Accepted to IJCV. arXiv admin note: text overlap with arXiv:2203.03137

2603.17408 2026-03-19 cs.CV cs.AI

Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression

Xinning Chai, Zhengxue Cheng, Xin Li, Rong Xie, Li Song

Comments Accepted by IEEE Transactions on BroadCasting

详情

英文摘要

Recent diffusion-based extreme image compression methods have demonstrated remarkable performance at ultra-low bitrates. However, most approaches require training separate diffusion models for each target bitrate, resulting in substantial computational overhead and hindering practical deployment. Meanwhile, recent studies have shown that joint super-resolution can serve as an effective approach for enhancing low-bitrate reconstruction. However, when moving toward ultra-low bitrate regimes, these methods struggle due to severe information loss, and their reliance on fixed super-resolution scales prevents flexible adaptation across diverse bitrates. To address these limitations, we propose ASSR-EIC, a novel image compression framework that leverages arbitrary-scale super-resolution (ASSR) to support variable-rate extreme image compression (EIC). An arbitrary-scale downsampling module is introduced at the encoder side to provide controllable rate reduction, while a diffusion-based, joint degradation-aware ASSR decoder enables rate-adaptive reconstruction within a single model. We exploit the compression- and rescaling-aware diffusion prior to guide the reconstruction, yielding high fidelity and high realism restoration across diverse compression and rescaling settings. Specifically, we design a global compression-rescaling adaptor that offers holistic guidance for rate adaptation, and a local compression-rescaling modulator that dynamically balances generative and fidelity-oriented behaviors to achieve fine-grained, bitrate-adaptive detail restoration. To further enhance reconstruction quality, we introduce a dual semantic-enhanced design. Extensive experiments demonstrate that ASSR-EIC delivers state-of-the-art performance in extreme image compression while simultaneously supporting flexible bitrate control and adaptive rate-dependent reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2603.17405 2026-03-19 cs.LG

Causal Representation Learning on High-Dimensional Data: Benchmarks, Reproducibility, and Evaluation Metrics

Alireza Sadeghi, Wael AbdAlmageed

2603.17403 2026-03-19 cs.LG

Large-Scale 3D Ground-Motion Synthesis with Physics-Inspired Latent Operator Flow Matching

Yaozhong Shi, Grigorios Lavrentiadis, Konstantinos Tsalouchidis, Zachary E. Ross, David McCallen, Caifeng Zou, Kamyar Azizzadenesheli, Domniki Asimaki

2603.17398 2026-03-19 cs.CV

Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion

Rui Hong, Shuxue Quan

Comments 6 pages, 3 figures, 4 tables. Published at IS&T Electronic Imaging 2026, GENAI Track

2603.17390 2026-03-19 cs.CV

Harnessing the Power of Foundation Models for Accurate Material Classification

Qingran Lin, Fengwei Yang, Chaolun Zhu

2603.17385 2026-03-19 cs.LG

The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions

Rui Wu, Hong Xie, Yongjun Li

Comments 33 pages, 6 figures. Submitted to the Journal of Machine Learning Research (JMLR)

2603.17384 2026-03-19 cs.LG

Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models

Rui Wu, Hong Xie, Yongjun Li

Comments 34 pages, 5 figures. Submitted to JMLR

2603.17382 2026-03-19 cs.CV

VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm

Hongbo Lu, Liang Yao, Chenghao He, Fan Liu, Wenlong Liao, Tao He, Pai Peng

2603.17378 2026-03-19 cs.LG cs.AI

Efficient Exploration at Scale

Seyed Mohammad Asghari, Chris Chute, Vikranth Dwaracherla, Xiuyuan Lu, Mehdi Jafarnia, Victor Minden, Zheng Wen, Benjamin Van Roy

2603.17375 2026-03-19 cs.CV

Stereo World Model: Camera-Guided Stereo Video Generation

Yang-Tian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao, Yuewen Ma, Xiaojuan Qi

Comments Project Page: https://sunyangtian.github.io/StereoWorld-web/

2603.17374 2026-03-19 cs.CV

Shot-Aware Frame Sampling for Video Understanding

Mengyu Zhao, Di Fu, Yongyu Xie, Jiaxing Zhang, Zhigang Yuan, Shirin Jalali, Yong Cao

2603.17373 2026-03-19 cs.CL

SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems

Rima Hazra, Bikram Ghuku, Ilona Marchenko, Yaroslava Tokarieva, Sayan Layek, Somnath Banerjee, Julia Stoyanovich, Mykola Pechenizkiy

2603.17372 2026-03-19 cs.CV cs.AI

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen

2603.17370 2026-03-19 cs.CV

Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes

Umangi Jain, Vladimir Kim, Matheus Gadelha, Igor Gilitschenski, Zhiqin Chen

Comments Project Page: https://umangi-jain.github.io/material-magic-wand

2603.17368 2026-03-19 cs.AI

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Minling Zhang

2603.17365 2026-03-19 cs.LG math.PR

Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning

Ziran Liu

Comments 37 pages

2603.17360 2026-03-19 cs.CV

MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

Xuri Ge, Chunhao Wang, Xindi Wang, Zheyun Qin, Zhumin Chen, Xin Xin

Comments Accepted by The Web Conference 2026 (WWW2026)

2603.17358 2026-03-19 cs.CV eess.IV

A 3D Reconstruction Benchmark for Asset Inspection

James L. Gray, Nikolai Goncharov, Alexandre Cardaillac, Ryan Griffiths, Jack Naylor, Donald G. Dansereau

Comments 29 pages, 15 figures, 8 tables

2603.17355 2026-03-19 cs.CV

OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Yiwen Zhao, Ce Zheng, Yufu Wang, Hsueh-Han Daniel Yang, Liting Wen, Laszlo A. Jeni

Comments Accepted by CVPR 2026

2603.17354 2026-03-19 cs.LG cs.CL

Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity

Hengyuan Zhang, Xinrong Chen, Zunhai Su, Xiao Liang, Jing Xiong, Wendong Xu, He Xiao, Chaofan Tao, Wei Zhang, Ruobing Xie, Lei Jiang, Hayden Kwok-Hay So, Ngai Wong

2603.17351 2026-03-19 cs.RO

OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

Zhongyuang Liu, Min He, Shaonan Yu, Xinhang Xu, Muqing Cao, Jianping Li, Jianfei Yang, Lihua Xie

详情

英文摘要

Language-guided embodied navigation requires an agent to interpret object-referential instructions, search across multiple rooms, localize the referenced target, and execute reliable motion toward it. Existing systems remain limited in real indoor environments because narrow field-of-view sensing exposes only a partial local scene at each step, often forcing repeated rotations, delaying target discovery, and producing fragmented spatial understanding; meanwhile, directly prompting LLMs with dense 3D maps or exhaustive object lists quickly exceeds the context budget. We present OmniVLN, a zero-shot visual-language navigation framework that couples omnidirectional 3D perception with token-efficient hierarchical reasoning for both aerial and ground robots. OmniVLN fuses a rotating LiDAR and panoramic vision into a hardware-agnostic mapping stack, incrementally constructs a five-layer Dynamic Scene Graph (DSG) from mesh geometry to room- and building-level structure, and stabilizes high-level topology through persistent-homology-based room partitioning and hybrid geometric/VLM relation verification. For navigation, the global DSG is transformed into an agent-centric 3D octant representation with multi-resolution spatial attention prompting, enabling the LLM to progressively filter candidate rooms, infer egocentric orientation, localize target objects, and emit executable navigation primitives while preserving fine local detail and compact long-range memory. Experiments show that the proposed hierarchical interface improves spatial referring accuracy from 77.27\% to 93.18\%, reduces cumulative prompt tokens by up to 61.7\% in cluttered multi-room settings, and improves navigation success by up to 11.68\% over a flat-list baseline. We will release the code and an omnidirectional multimodal dataset to support reproducible research.

URL PDF HTML ☆

赞 0 踩 0

2603.17343 2026-03-19 cs.CV

EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection

Chenyang Zhu, Maorong Wang, Jun Liu, Ching-Chun Chang, Isao Echizen

2603.17333 2026-03-19 cs.CL

Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

Risham Sidhu, Julia Hockenmaier

Comments preprint

2603.17328 2026-03-19 cs.AI cs.LG

A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

Weiming Wu, Zi-Jian Cheng, Jie Meng, Peng Zhen, Shan Huang, Qun Li, Guobin Wu, Lan-Zhe Guo

2603.17326 2026-03-19 cs.CV

FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

Peisen Zhao, Xiaopeng Zhang, Mingxing Xu, Ruoyu Sun, Zewei Du, Dunzheng Wang, Guanghao Zheng, Haohang Xu, Zhibo Zhang, Yuhang Zhang, Yi Ai, Lin Liu, Qi Tian

2603.17325 2026-03-19 cs.CV

MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation

Thuy Truong Tran, Minh Kha Do, Phuc Nguyen Duy, Min Hun Lee

2603.17324 2026-03-19 cs.AI cs.LG

ShuttleEnv: An Interactive Data-Driven RL Environment for Badminton Strategy Modeling

Ang Li, Xinyang Gong, Bozhou Chen, Yunlong Lu, Jiaming Ji, Yongyi Wang, Yaodong Yang, Wenxin Li

2603.17323 2026-03-19 cs.RO

DexEXO: A Wearability-First Dexterous Exoskeleton for Operator-Agnostic Demonstration and Learning

Alvin Zhu, Mingzhang Zhu, Beom Jun Kim, Jose Victor S. H. Ramos, Yike Shi, Yufeng Wu, Raayan Dhar, Fuyi Yang, Ruochen Hou, Hanzhang Fang, Quanyou Wang, Yuchen Cui, Dennis W. Hong

Comments https://dexexo-research.github.io/