arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.24454 2026-03-26 cs.CV

Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang, Jiankang Deng, Guansong Pang

Comments 14 pages, 7 figures, accepted by CVPR 2026

详情

英文摘要

Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.

URL PDF HTML ☆

赞 0 踩 0

2603.24440 2026-03-26 cs.LG cs.AI cs.CV

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar

Comments Project Page: https://cua-suite.github.io/

详情

英文摘要

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

URL PDF HTML ☆

赞 0 踩 0

2603.24432 2026-03-26 cs.SD cs.CL

What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

Massa Baali, Sarthak Bisht, Rita Singh, Bhiksha Raj

2603.24431 2026-03-26 cs.LG physics.comp-ph physics.data-an physics.flu-dyn

Learning Response-Statistic Shifts and Parametric Roll Episodes from Wave--Vessel Time Series via LSTM Functional Models

Jose del Aguila Ferrandis

2603.24430 2026-03-26 cs.SD

Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation

Shengfan Shen, Di Wu, Xingchen Song, Dinghao Zhou, Liumeng Xue, Meng Meng, Jian Luan, Shuai Wang

Comments submitted to Interspeech 2026, under review

2603.24428 2026-03-26 cs.LG

Marchuk: Efficient Global Weather Forecasting from Mid-Range to Sub-Seasonal Scales via Flow Matching

Arsen Kuzhamuratov, Mikhail Zhirnov, Andrey Kuznetsov, Ivan Oseledets, Konstantin Sobolev

2603.24413 2026-03-26 cs.CL

PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation

Manoj Balaji Jagadeeshan, Atul Singh, Nallani Chakravartula Sahith, Amrith Krishna, Pawan Goyal

2603.24407 2026-03-26 cs.CV

Teacher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation

Ching-Lam Cheng, Bin Zhu, Shengfeng He

Comments 5 pages, accepted by ICASSP2026

2603.24401 2026-03-26 cs.RO

Enhancing Drone Light Shows Performances: Optimal Allocation and Trajectories for Swarm Drone Formations

Yunes Alqudsi

2603.24393 2026-03-26 cs.RO

3D-Mix for VLA: A Plug-and-Play Module for Integrating VGGT-based 3D Information into Vision-Language-Action Models

Bin Yu, Shijie Lian, Xiaopeng Lin, Zhaolong Shen, Yuliang Wei, Haishan Liu, Changti Wu, Hang Yuan, Bailing Wang, Cong Huang, Kai Chen

Comments 13 pages

2603.24389 2026-03-26 cs.CL cs.AI cs.CY

When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools

Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao, Qingyong Hu

Comments Accepted to AIED 2026, Project page: https://qingyonghu.github.io/Interaction2Eval/

2603.24384 2026-03-26 cs.LG stat.ML

On the Use of Bagging for Local Intrinsic Dimensionality Estimation

Kristóf Péter, Ricardo J. G. B. Campello, James Bailey, Michael E. Houle

Comments Main document: 10 pages, 5 figures; Appendix: 38 pages, 27 figures

2603.24383 2026-03-26 cs.CV

ViHOI: Human-Object Interaction Synthesis with Visual Priors

Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding

Comments Accepted to CVPR 2026

2603.24382 2026-03-26 cs.LG cs.AI cs.CE

MolEvolve: LLM-Guided Evolutionary Search for Interpretable Molecular Optimization

Xiangsen Chen, Ruilong Wu, Yanyan Lan, Ting Ma, Yang Liu

2603.24376 2026-03-26 cs.CV

GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization

Pengyue Jia, Derong Xu, Yingyi Zhang, Xiaopeng Li, Wenlin Zhang, Yi Wen, Yuanshao Zhu, Xiangyu Zhao

2603.24375 2026-03-26 cs.CL

Towards Reward Modeling for AI Tutors in Math Mistake Remediation

Kseniia Petukhova, Ekaterina Kochmar

2603.24373 2026-03-26 cs.CV

PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, Zelun Zhang, Jing Zhang, Jun Zhang, Yi Liu

2603.24372 2026-03-26 cs.CL

Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

Arsen Shebzukhov

Comments 10 pages, 10 figures, pages 10-27 appendix

2603.24366 2026-03-26 cs.LG cs.RO

CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control

Yifeng Zhang, Harsh Goel, Peizhuo Li, Mehul Damani, Sandeep Chinchali, Guillaume Sartoretti

Comments \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

详情

英文摘要

Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever-expanding cities. Multi-Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of partial observability and coordination in decentralized environments still remain key challenges in formulating scalable and efficient control strategies. To address these challenges, we present CoordLight, a MARL-based framework designed to improve intra-neighborhood traffic by enhancing decision-making at individual junctions (agents), as well as coordination with neighboring agents, thereby scaling up to network-level traffic optimization. Specifically, we introduce the Queue Dynamic State Encoding (QDSE), a novel state representation based on vehicle queuing models, which strengthens the agents' capability to analyze, predict, and respond to local traffic dynamics. We further propose an advanced MARL algorithm, named Neighbor-aware Policy Optimization (NAPO). It integrates an attention mechanism that discerns the state and action dependencies among adjacent agents, aiming to facilitate more coordinated decision-making, and to improve policy learning updates through robust advantage calculation. This enables agents to identify and prioritize crucial interactions with influential neighbors, thus enhancing the targeted coordination and collaboration among agents. Through comprehensive evaluations against state-of-the-art traffic signal control methods over three real-world traffic datasets composed of up to 196 intersections, we empirically show that CoordLight consistently exhibits superior performance across diverse traffic networks with varying traffic flows. The code is available at https://github.com/marmotlab/CoordLight

URL PDF HTML ☆

赞 0 踩 0

2603.24361 2026-03-26 cs.RO

LATS: Large Language Model Assisted Teacher-Student Framework for Multi-Agent Reinforcement Learning in Traffic Signal Control

Yifeng Zhang, Peizhuo Li, Tingguang Zhou, Mingfeng Fan, Guillaume Sartoretti

2603.24357 2026-03-26 cs.RO

A Sensorless, Inherently Compliant Anthropomorphic Musculoskeletal Hand Driven by Electrohydraulic Actuators

Misato Sonoda, Ronan Hinchet, Amirhossein Kazemipour, Yasunori Toshimitsu, Robert K. Katzschmann

Comments This work has been submitted to the IEEE for possible publication

2603.24355 2026-03-26 cs.CV cs.AI

Language-Guided Structure-Aware Network for Camouflaged Object Detection

Min Zhang

2603.24322 2026-03-26 cs.CV

Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions

Shiqin Wang, Haoyang Chen, Huaizhou Huang, Yinkan He, Dongfang Sun, Xiaoqing Chen, Xingyu Liu, Zheng Wang, Kaiyan Zhao

Comments Accepted by CVPR 2026

2603.24318 2026-03-26 cs.RO cs.AI

Toward Generalist Neural Motion Planners for Robotic Manipulators: Challenges and Opportunities

Davood Soleymanzadeh, Ivan Lopez-Sanchez, Hao Su, Yunzhu Li, Xiao Liang, Minghui Zheng

2603.24312 2026-03-26 cs.CV

Refining time-space traffic diagrams: A neighborhood-adaptive linear regression method

Zhihong Yao, Yi Yu, Yunxia Wu, Hao Li, Yangsheng Jiang, Zhengbing He

详情

DOI: 10.1109/TITS.2026.3678329
Journal ref: IEEE Transactions on Intelligent Transportation Systems, 2026

英文摘要

The time-space (TS) traffic diagram serves as a crucial tool for characterizing the dynamic evolution of traffic flow, with its resolution directly influencing the effectiveness of traffic theory research and engineering applications. However, constrained by monitoring precision and sampling frequency, existing TS traffic diagrams commonly suffer from low resolution. To address this issue, this paper proposes a refinement method for TS traffic diagrams based on neighborhood-adaptive linear regression. Introducing the concept of neighborhood embedding into TS diagram refinement, the method leverages local pattern similarity in TS diagrams, adaptively identifies neighborhoods similar to target cells, and fits the low-to-high resolution mapping within these neighborhoods for refinement. It avoids the over-smoothing tendency of the traditional global linear model, allows the capture of unique traffic wave propagation and congestion evolution characteristics, and outperforms the traditional neighborhood embedding method in terms of local information utilization to achieve target cell refinement. Validation on two real datasets across multiple scales and upscaling factors shows that, compared to benchmark methods, the proposed method achieves improvements of 9.16%, 8.16%, 1.86%, 3.89%, and 5.83% in metrics including MAE, MAPE, CMJS, SSIM, and GMSD, respectively. Furthermore, the proposed method exhibits strong generalization and robustness in cross-day and cross-scenario validations. In summary, requiring only a minimal amount of paired high- and low-resolution training data, the proposed method features a concise formulation, providing a foundation for the low-cost, fine-grained refinement of low-sampling-rate traffic data.

URL PDF HTML ☆

赞 0 踩 0

2603.24307 2026-03-26 cs.CL

Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

N J Karthika, Keerthana Suryanarayanan, Jahanvi Purohit, Ganesh Ramakrishnan, Jitin Singla, Anil Kumar Gourishetty

2603.24296 2026-03-26 cs.CV

AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication

Jie Song, Jun Jia, Wei Sun, Wangqiu Zhou, Tao Tan, Guangtao Zhai

2603.24294 2026-03-26 cs.CV

VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection

Jumin Lee, Siyeong Lee, Namil Kim, Sung-Eui Yoon

2603.24283 2026-03-26 cs.SD cs.AI

Bridging Biological Hearing and Neuromorphic Computing: End-to-End Time-Domain Audio Signal Processing with Reservoir Computing

Rinku Sebastian, Simon O'Keefe, Martin Trefzer

2603.24275 2026-03-26 cs.LG

Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers

Jun Ma, Xu Zhang, Zhengxing Jiao, Yaxin Hou, Hui Liu, Junhui Hou, Yuheng Jia