arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.22170 2026-03-17 cs.LG cs.CV

SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models

Jiesong Lian, Ruizhe Zhong, Zixiang Zhou, Xiaoyue Mi, Long Hu, Yuan Zhou, Qinglin Lu, Yixue Hao, Junchi Yan

Comments 16 pages, 9 figures

详情

英文摘要

Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM's score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark are available at https://github.com/lian700/SoliReward.

URL PDF HTML ☆

赞 0 踩 0

2512.17266 2026-03-17 cs.AI

EventGPT: Capturing Player Impact from Team Action Sequences Using GPT-Based Framework

Miru Hong, Minho Lee, Geonhee Jo, Jae-Hee So, Pascal Bauer, Sang-Ki Ko

Comments 8 pages, 2 figures, 7 tables. To appear in Hudl Performance Insights 2025

2512.12372 2026-03-17 cs.CV

STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, Boxin Shi

2512.11782 2026-03-17 cs.CV

MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Peiqing Yang, Shangchen Zhou, Kai Hao, Qingyi Tao

Comments Accepted to CVPR 2026. Project page: https://pq-yang.github.io/projects/MatAnyone2/

2511.20157 2026-03-17 cs.CV

SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

Da Li, Jiping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun, Kai Chen, Rui Fan, Jiangang Kong, Xi Shen

Comments Project page: https://pokerman8.github.io/SKEL-CF/

2511.19917 2026-03-17 cs.CV

Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

Qin Ren, Yufei Wang, Lanqing Guo, Wen Zhang, Zhiwen Fan, Chenyu You

2511.18706 2026-03-17 cs.CV

CoD: A Diffusion Foundation Model for Image Compression

Zhaoyang Jia, Zihan Zheng, Naifu Xue, Jiahao Li, Bin Li, Zongyu Guo, Xiaoyi Zhang, Houqiang Li, Yan Lu

Comments Accepted at CVPR 2026

2511.18333 2026-03-17 cs.CV

ConsistCompose: Unified Multimodal Layout Control for Image Composition

Xuanke Shi, Boxuan Li, Xiaoyang Han, Zhongang Cai, Lei Yang, Quan Wang, Dahua Lin

Comments Accepted to CVPR 2026; 23 pages, 17 figures

2511.17454 2026-03-17 cs.CV

Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition

Nissim Maruani, Peiying Zhang, Siddhartha Chaudhuri, Matthew Fisher, Nanxuan Zhao, Vladimir G. Kim, Pierre Alliez, Mathieu Desbrun, Wang Yifan

2511.14386 2026-03-17 cs.CV cs.AI

Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving

Kangqiao Zhao, Shuo Huai, Xurui Song, Jun Luo

Comments AAAI 2026

2511.14063 2026-03-17 cs.CV

Semantic Context Matters: Improving Conditioning for Autoregressive Models

Dongyang Jin, Ryan Xu, Jianhao Zeng, Rui Lan, Yancheng Bai, Lei Sun, Xiangxiang Chu

2511.13888 2026-03-17 cs.LG

Tractable Probabilistic Models for Investment Planning

Nicolas M. Cuadrado A., Mohannad Takrouri, Jiří Němeček, Martin Takáč, Jakub Mareček

2511.08825 2026-03-17 cs.AI

Neural Value Iteration

Yang You, Ufuk Çakır, Alex Schutz, Nick Hawes

2511.03718 2026-03-17 cs.CL cs.AI

Grounded Misunderstandings in Asymmetric Dialogue: A Perspectivist Annotation Scheme for MapTask

Nan Li, Albert Gatt, Massimo Poesio

Comments 14 pages, 5 figures, 6 tables; Camera-ready Version; Accepted by LREC 2026 (Oral)

2510.20548 2026-03-17 cs.CL cs.AI

GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, Yan Tao

Comments 8 pages, 3 figures, 4 tables

2510.17512 2026-03-17 cs.SD cs.LG cs.MM eess.AS

AWARE: Audio Watermarking with Adversarial Resistance to Edits

Kosta Pavlović, Lazar Stanarević, Petar Nedić, Elena Nešović Slavko Kovačević, Igor Djurović

2510.16084 2026-03-17 cs.LG cond-mat.quant-gas math-ph math.MP physics.optics quant-ph

Near-Equilibrium Propagation training in nonlinear wave systems

Karol Sajnok, Michał Matuszewski

Comments 7 figures

2510.16021 2026-03-17 cs.LG econ.GN q-fin.EC

Feature-driven reinforcement learning for photovoltaic in continuous intraday trading

Arega Getaneh Abate, Xiao-Bing Zhang, Xiufeng Liu, Ruyu Liu

2510.12720 2026-03-17 cs.CL cs.CV cs.MM cs.SD

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, Eng Siong Chng, Xie Chen

Comments Accepted by ICLR2026. Open Source at https://github.com/ddlBoJack/Omni-Captioner

详情

英文摘要

Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.

URL PDF HTML ☆

赞 0 踩 0

2510.12615 2026-03-17 cs.LG cs.AI

A Functional Perspective on Knowledge Distillation in Neural Networks

Israel Mason-Williams, Gabryel Mason-Williams, Helen Yannakoudakis

Comments 57 pages, 23 figures and 95 tables

2510.09653 2026-03-17 cs.CV cs.AI

Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition

Ranjan Sapkota, Manoj Karkee

2510.04673 2026-03-17 cs.AI cs.CV

Watch and Learn: Learning to Use Computers from Online Videos

Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, Tomas Pfister

Comments CVPR 2026; Project page: https://chanh.ee/wandl/

2510.03481 2026-03-17 cs.RO cs.SY eess.SY

Optimization-Based Robust Permissive Synthesis for Interval MDPs

Khang Vo Huynh, David Parker, Lu Feng

2509.26207 2026-03-17 cs.SD cs.LG

The silence of the weights: a structural pruning strategy for attention-based audio signal architectures with second order metrics

Andrea Diecidue, Carlo Alberto Barbano, Piero Fraternali, Mathieu Fontaine, Enzo Tartaglione

2509.25827 2026-03-17 cs.CL cs.AI

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang

Comments 30 pages; Accepted as an oral presentation at ICLR 2026

2509.25164 2026-03-17 cs.CV

YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, Manoj Karkee

2509.24741 2026-03-17 cs.CV

Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm

Xue-Feng Zhu, Tianyang Xu, Yifan Pan, Jinjie Gu, Xi Li, Jiwen Lu, Xiao-Jun Wu, Josef Kittler

2509.22756 2026-03-17 cs.RO cs.AI

Persistent Autoregressive Mapping with Traffic Rules for Autonomous Driving

Shiyi Liang, Xinyuan Chang, Changjie Wu, Huiyuan Yan, Yifan Bai, Xinran Liu, Hang Zhang, Yujian Yuan, Shuang Zeng, Mu Xu, Xing Wei

Comments AAAI2026

2509.22407 2026-03-17 cs.AI cs.RO

EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer

Zhehao Dong, Xiaofeng Wang, Zheng Zhu, Yirui Wang, Yang Wang, Yukun Zhou, Boyuan Wang, Chaojun Ni, Runqi Ouyang, Wenkang Qin, Xinze Chen, Yun Ye, Guan Huang, Zhen Lu, Yue Yang

2509.19270 2026-03-17 cs.CL cs.AI cs.SD

SloPal: A 60-Million-Word Slovak Parliamentary Corpus with Aligned Speech and Fine-Tuned ASR Models

Erik Božík, Marek Šuppa

Comments LREC 2026