arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.09231 2026-04-13 cs.CV

Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

Huiang He, Shengchu Zhao, Jianwen Huang, Jie Li, Jiaqi Wu, Hu Zhang, Pei Tang, Heliang Zheng, Yukun Li, Rongfei Jia

Comments 13 pages

详情

英文摘要

Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.

URL PDF HTML ☆

赞 0 踩 0

2604.09222 2026-04-13 cs.SD cs.AI

GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

Yunqiang Wang, Hengyuan Na, Di Wu, Miao Hu, Guocong Quan

Comments Under Review

2604.09220 2026-04-13 cs.CV

TinyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference

Muhammad Hannan Akhtar, Ihab Amer, Tamer Shanableh

Comments Submitted to "Computers and Electrical Engineering", Elsevier

详情

英文摘要

Implicit neural video representations encode entire video sequences within the parameters of a neural network and enable constant time frame reconstruction. Recent work on Neural Representations for Videos (NeRV) has demonstrated competitive reconstruction performance while avoiding the sequential decoding process of conventional video codecs. However, most existing studies focus on moderate or high capacity models, leaving the behavior of extremely compact configurations required for constrained environments insufficiently explored. This paper presents a systematic study of tiny NeRV architectures designed for efficient deployment. Two lightweight configurations, NeRV-T and NeRV-T+, are introduced and evaluated across multiple video datasets in order to analyze how aggressive capacity reduction affects reconstruction quality, computational complexity, and decoding throughput. Beyond architectural scaling, the work investigates strategies for improving the performance of compact models without increasing inference cost. Knowledge distillation with frequency-aware focal supervision is explored to enhance reconstruction fidelity in low-capacity networks. In addition, the impact of lowprecision inference is examined through both post training quantization and quantization aware training to study the robustness of tiny models under reduced numerical precision. Experimental results demonstrate that carefully designed tiny NeRV variants can achieve favorable quality efficiency trade offs while substantially reducing parameter count, computational cost, and memory requirements. These findings provide insight into the practical limits of compact neural video representations and offer guidance for deploying NeRV style models in resource constrained and real-time environments. The official implementation is available at https: //github.com/HannanAkhtar/TinyNeRV-Implementation.

URL PDF HTML ☆

赞 0 踩 0

2604.09213 2026-04-13 cs.CV

SHIFT: Steering Hidden Intermediates in Flow Transformers

Nina Konovalova, Andrey Kuznetsov, Aibek Alanov

2604.09212 2026-04-13 cs.CL cs.MA

SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

Han Luo, Guy Laban

Comments Accepted to Findings of the Association for Computational Linguistics (ACL 2026). Our code and data are available at https://github.com/lhannnn/SPASM

2604.09210 2026-04-13 cs.CV

Adding Another Dimension to Image-based Animal Detection

Vandita Shukla, Fabio Remondino, Benjamin Risse

Comments CV4Animals Workshop 2025

2604.09206 2026-04-13 cs.CV

Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception

Jiahao Wang, Zikun Xu, Yuner Zhang, Zhongwei Jiang, Chenyang Lu, Shuocheng Yang, Yuxuan Wang, Jiaru Zhong, Chuang Zhang, Shaobing Xu, Jianqiang Wang

Comments Accepted by CVPR 2026

2604.09202 2026-04-13 cs.LG cs.AI

On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

Anas Hattay, Fred Ngole Mboula, Eric Gascard, Zakaria Yahoun

2604.09201 2026-04-13 cs.CV

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

Haoyu Zhao, Zihao Zhang, Jiaxi Gu, Haoran Chen, Qingping Zheng, Pin Tang, Yeyin Jin, Yuang Zhang, Junqi Cheng, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang

2604.09199 2026-04-13 cs.CV

Globally Optimal Pose from Orthographic Silhouettes

Agniva Sengupta, Dilara Kuş, Jianning Li, Stefan Zachow

2604.09197 2026-04-13 cs.CV cs.AI

Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma

Francesca Fati, Felipe Coutinho, Marika Reinius, Marina Rosanu, Gabriel Funingana, Luigi De Vitis, Gabriella Schivardi, Hannah Clayton, Alice Traversa, Zeyu Gao, Guilherme Penteado, Shangqi Gao, Francesco Pastori, Ramona Woitek, Maria Cristina Ghioni, Giovanni Damiano Aletti, Mercedes Jimenez-Linan, Sarah Burge, Nicoletta Colombo, Evis Sala, Maria Francesca Spadea, Timothy L. Kline, James D. Brenton, Jaime Cardoso, Francesco Multinu, Elena De Momi, Mireia Crispin-Ortuzar, Ines P. Machado

2604.09195 2026-04-13 cs.AI

Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation

Haobo Hu, Qi Mao, Yuanhang Li, Libiao Jin

2604.09189 2026-04-13 cs.CL cs.AI cs.LG

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

Avni Mittal

2604.09188 2026-04-13 cs.SD

LatentFlowSR: High-Fidelity Audio Super-Resolution via Noise-Robust Latent Flow Matching

Fei Liu, Yang Ai, Hui-Peng Du, Yu-Fei Shi, Zhen-Hua Ling

2604.09181 2026-04-13 cs.CV cs.LG

MixFlow: Mixed Source Distributions Improve Rectified Flows

Nazir Nayal, Christopher Wewer, Jan Eric Lenssen

2604.09175 2026-04-13 cs.LG cs.AI math.ST stat.ML stat.TH

Generalization and Scaling Laws for Mixture-of-Experts Transformers

Mansour Zoubeirou a Mayaki

2604.09169 2026-04-13 cs.CV

UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation

Le-Van Thai, Tien Dat Nguyen, Hoai Nhan Pham, Lan Anh Dinh Thi, Duy-Dong Nguyen, Ngoc Lam Quang Bui

Comments Accepted at CVPR 2026 Workshop. 11 pages, 5 figures, 4 tables

2604.09167 2026-04-13 cs.CV cs.MA

MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu, Gao Huang

2604.09164 2026-04-13 cs.CV

Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

Yicheng Qiu, Keiji Yanai

Comments ICME2026

2604.09159 2026-04-13 cs.LG

Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling

Xubin Zhou, Yipeng Yang, Zhan Li

2604.09156 2026-04-13 cs.RO math.DS

On the Terminology and Geometric Aspects of Redundant Parallel Manipulators

Andreas Mueller

2604.09155 2026-04-13 cs.LG cs.AI

CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation

Yushi Feng, Junye Du, Qifan Wang, Zizhan Ma, Qian Niu, Yutaka Matsuo, Long Feng, Lequan Yu

2604.09151 2026-04-13 cs.CV nlin.PS

Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery

Sara Ameli

2604.09150 2026-04-13 cs.CL

Think Less, Know More: State-Aware Reasoning Compression with Knowledge Guidance for Efficient Reasoning

Yi Sui, Chaozhuo Li, Dawei Song

2604.09145 2026-04-13 cs.CV

Deep Light Pollution Removal in Night Cityscape Photographs

Hao Wang, Xiaolin Wu, Xi Zhang, Baoqing Sun

Comments 17 pages, supplementary material included

2604.09143 2026-04-13 cs.LG stat.ME

Score-Driven Rating System for Sports

Vladimír Holý, Michal Černý

2604.09142 2026-04-13 cs.CV

Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

Jiahao Li, Xinhong Chen, Zhengmin Jiang, Cheng Huang, Yung-Hui Li, Jianping Wang

详情

英文摘要

Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.

URL PDF HTML ☆

赞 0 踩 0

2604.09130 2026-04-13 cs.LG cs.AI physics.comp-ph

EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers

Yi-Lun Liao, Alexander J. Hoffman, Sabrina C. Shen, Alexandre Duval, Sam Walton Norwood, Tess Smidt

2604.09127 2026-04-13 cs.CV

FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition

Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh

2604.09125 2026-04-13 cs.CV

Few-Shot Personalized Age Estimation

Jakub Paplhám, Vojtěch Franc, Artem Moroz