arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2506.07180 2026-05-01 cs.CL cs.AI cs.CV

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Wenrui Zhou, Mohamed Hendy, Shu Yang, Qingsong Yang, Zikun Guo, Yuyu Luo, Lijie Hu, Di Wang

Comments 27 Pages, Accepted by ACL 2026 Main Conference

详情

英文摘要

As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the videolanguage domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE(Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISEpioneeringly brings linguistic perspectives on sycophancy into the video domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. Furthermore, we propose two potential training-free mitigation strategies revealing potential paths for reducing sycophantic bias: (i) enhancing visual grounding through interpretable key-frame selection and (ii) steering model behavior away from sycophancy via targeted, inference-time intervention on its internal neural representations. Our code is available at https://anonymous.4open.science/r/VideoSycophancy-567F.

URL PDF HTML ☆

赞 0 踩 0

2505.19630 2026-05-01 cs.CL

Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning

Yichun Feng, Jiawei Wang, Lu Zhou, Yikai Zheng, Zhen Lei, Yixue Li

2505.13230 2026-05-01 cs.LG cond-mat.dis-nn stat.ML

Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks

Francesco D'Amico, Dario Bocchi, Matteo Negri

Comments Final accepted version at ICLR26 main conference; 27 pages, 21 Figures, 5 tables

2504.14988 2026-05-01 cs.CV

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

Comments Accepted to ICLR 2026

2504.14602 2026-05-01 cs.RO cs.AI cs.HC

K2MUSE: A human lower-limb multimodal walking dataset spanning task and acquisition variability for rehabilitation robotics

Jiwei Li, Bi Zhang, Xiaowei Tan, Wanxin Chen, Zhaoyuan Liu, Juanjuan Zhang, Weiguang Huo, Jian Huang, Lianqing Liu, Xingang Zhao

Comments 34 pages, 30 figures,7 tables

详情

英文摘要

The natural interaction and control performance of lower limb rehabilitation robots are closely linked to biomechanical information from various human locomotion activities. Multidimensional human motion data significantly deepen the understanding of the complex mechanisms governing neuromuscular alterations, thereby facilitating the development and application of rehabilitation robots in multifaceted real-world environments.However, existing lower limb datasets are inadequate for supplying the essential multimodal data and large-scale gait samples necessary for the development of effective data-driven approaches, and the significant effects of acquisition interference in real applications are neglected.To fill this gap, we present the K2MUSE dataset, which includes a comprehensive collection of multimodal data, comprising kinematic, kinetic, amplitude mode ultrasound (AUS), and surface electromyography (sEMG) measurements. The proposed dataset includes lower-limb multimodal data collected from two cohorts, including 30 able-bodied young adults and 12 older adults, across different inclines (0$^\circ$, $\pm$5$^\circ$, and $\pm$10$^\circ$), speeds (0.5 m/s, 1.0 m/s, and 1.5 m/s), and representative non-ideal acquisition conditions (muscle fatigue, electrode shifts, and interday differences). The kinematic and ground reaction force data were collected with a Vicon motion capture system and an instrumented treadmill with embedded force plates, whereas the sEMG and AUS data of thirteen muscles on the bilateral lower limbs were synchronously recorded.K2MUSE is released with the corresponding structured documentation, preprocessing pipelines, and example code, thereby providing a comprehensive resource for rehabilitation robot development, biomechanical analysis, and wearable sensing research. The dataset is available at https://k2muse.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2504.02768 2026-05-01 cs.CL

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, Arianna Bisazza

Comments Published in TACL, MIT Press

2503.01835 2026-05-01 cs.CV

Primus: Enforcing Attention Usage for 3D Medical Image Segmentation

Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Sebastian Ziegler, Dasha Trofimova, Raphael Stock, Michael Baumgartner, Gregor Köhler, Klaus Maier-Hein

Comments Accepted in Transactions on Machine Learning Research (TMLR)

2503.01611 2026-05-01 cs.CL

In-context Learning vs. Instruction Tuning: The Case of Small and Multilingual Language Models

David Ponce, Thierry Etchegoyhen

2503.01448 2026-05-01 cs.CV

Generative Human Geometry Distribution

Xiangjun Tang, Biao Zhang, Peter Wonka

2502.16942 2026-05-01 cs.CL

NUTSHELL: A Dataset for Abstract Generation from Scientific Talks

Maike Züfle, Sara Papi, Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Jan Niehues

2502.14270 2026-05-01 cs.LG

Predicting Fetal Birthweight from High Dimensional Data using Advanced Machine Learning

Nachiket Kapure, Harsh Joshi, Rajeshwari Mistri, Parul Kumari, Manasi Mali, Seema Purohit, Neha Sharma, Mrityunjoy Panday, Chittaranjan S. Yajnik

Comments Withdrawn due to concerns regarding overlap in text and methodology (Sections 2--4), requiring substantial revision and restructuring to ensure clarity and originality. A corrected version will be submitted separately

2502.12272 2026-05-01 cs.LG cs.AI cs.CL

Learning to Reason at the Frontier of Learnability

Thomas Foster, Anya Sims, Johannes Forkel, Mattie Fellows, Jakob Foerster

2502.07645 2026-05-01 cs.RO

From Action Labels to Sets: Rethinking Action Supervision for Imitation Learning from Corrective Feedback

Zhaoting Li, Rodrigo Pérez-Dattari, Robert Babuska, Cosimo Della Santina, Jens Kober

2502.02097 2026-05-01 cs.CV

VerteNet -- A Multi-Context Hybrid CNN Transformer for Accurate Vertebral Landmark Localization in Lateral Spine DXA Images

Arooba Maqsood, Zaid Ilyas, Afsah Saleem, Erchuan Zhang, David Suter, Parminder Raina, Jonathan M. Hodgson, John T. Schousboe, William D. Leslie, Joshua R. Lewis, Syed Zulqarnain Gilani

Comments 17 pages with 5 figures

2501.07451 2026-05-01 cs.CV

A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion

Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis

Comments Under review at Image and Vision Computing

2410.07442 2026-05-01 cs.CV

Self-Supervised Learning for Real-World Object Detection: a Survey

Alina Ciocarlan, Sidonie Lefebvre, Sylvie Le Hégarat-Mascle, Arnaud Woiselle

详情

DOI: 10.1016/j.cviu.2026.104783

英文摘要

Self-Supervised Learning (SSL) has emerged as a promising approach in computer vision, enabling networks to learn meaningful representations from large unlabeled datasets. SSL methods fall into two main categories: instance discrimination and Masked Image Modeling (MIM). While instance discrimination is fundamental to SSL, it was originally designed for classification and may be less effective for object detection, particularly for small objects. In this survey, we focus on SSL methods specifically tailored for real-world object detection, with an emphasis on detecting small objects in complex environments. Unlike previous surveys, we offer a detailed comparison of SSL strategies, including object-level instance discrimination and MIM methods, and assess their effectiveness for small object detection using both CNN and ViT-based architectures. Specifically, our benchmark is performed on the widely-used COCO dataset, as well as on a specialized real-world dataset focused on vehicle detection in infrared remote sensing imagery. We also assess the impact of pre-training on custom domain-specific datasets, highlighting how certain SSL strategies are better suited for handling uncurated data. Our findings highlight that instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. This survey provides a practical guide for selecting optimal SSL strategies, taking into account factors such as backbone architecture, object size, and custom pre-training requirements. Ultimately, we show that choosing an appropriate SSL pre-training strategy, along with a suitable encoder, significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings.

URL PDF HTML ☆

赞 0 踩 0

2409.20302 2026-05-01 cs.AI cs.CL cs.IR

OM4OV: Leveraging Ontology Matching for Ontology Versioning

Zhangcheng Qiang, Kerry Taylor, Weiqing Wang

Comments 17 pages, 10 figures, 2 tables

2403.12235 2026-05-01 cs.RO cs.SY eess.SY

IKSPARK: Obstacle-Aware Inverse Kinematics via Convex Optimization

Liangting Wu, Roberto Tron

2402.14532 2026-05-01 cs.LG stat.ML

A Framework for Variational Inference of Lightweight Bayesian Neural Networks with Heteroscedastic Uncertainties

David J. Schodt, Ryan Brown, Michael Merritt, Samuel Park, Delsin Menolascino, Mark A. Peot

Comments Fix equation typos

2310.02277 2026-05-01 cs.LG cs.AI

Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

Lu Yin, Ajay Jaiswal, Shiwei Liu, Souvik Kundu, Zhangyang Wang

Comments Published at ICML 2024

2309.12802 2026-05-01 cs.SD cs.LG eess.AS

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Alexandre R. Ferreira, Cláudio E. C. Campelo

Comments 9 pages, 6 figures, 7 tables

2309.12071 2026-05-01 cs.AI cs.CL

Benchmarking quantized LLaMa-based models on the Brazilian Secondary School Exam

Matheus L. O. Santos, Cláudio E. C. Campelo

Comments 8 pages, 6 figures, 4 tables

2309.02449 2026-05-01 cs.LG

League of Legends: Real-Time Result Prediction

Jailson B. S. Junior, Claudio E. C. Campelo

Comments 8 pages

2306.16050 2026-05-01 cs.CV cs.LG eess.IV

Evaluating Similitude and Robustness of Deep Image Denoising Models via Adversarial Attack

Jie Ning, Jiebao Sun, Yao Li, Zhichang Guo, Wangmeng Zuo

2306.10407 2026-05-01 cs.LG cs.AI physics.bio-ph q-bio.CB

FP-IRL: Fokker--Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

Chengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna Garikipati

详情

DOI: 10.1016/j.cma.2026.119010
Journal ref: Computer Methods in Applied Mechanics and Engineering, 458, 119010 (2026)

英文摘要

Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision process (MDP). However, most existing IRL methods require access to the transition function, either prescribed or estimated \textit{a priori}, which poses significant challenges when the underlying dynamics are unknown, unobservable, or not easily sampled. We propose Fokker--Planck inverse reinforcement learning (FP-IRL), a novel physics-constrained IRL framework tailored for systems that can be described by Fokker--Planck (FP) dynamics. FP-IRL simultaneously infers both the reward and transition functions directly from trajectory data, without requiring access to sampled transitions. Our method leverages a correspondence between MDPs and the FP equation, linking reward maximization in MDPs with free energy minimization in FP dynamics. This connection enables inference of the FP potential function using our inference approach of variational system identification, from which the full set of MDP components -- reward, transition, and policy -- can be recovered using analytic expressions. We demonstrate the effectiveness of FP-IRL through experiments on synthetic benchmarks and a modified version of the Mountain Car problem. Our results show that FP-IRL achieves accurate recovery of agent incentives while preserving computational efficiency and physical interpretability.

URL PDF HTML ☆

赞 0 踩 0

2102.05231 2026-05-01 cs.CV cs.AI

Culture-inspired Multi-modal Color Palette Generation and Colorization: A Chinese Youth Subculture Case

Yufan Li, Jinggang Zhuo, Ling Fan, Harry Jiannan Wang

Comments accepted by the 3rd IEEE Workshop on Artificial Intelligence for Art Creation

2604.27733 2026-05-01 cs.LG stat.ML

Mind the Gap: Structure-Aware Consistency in Preference Learning

Mehryar Mohri, Yutao Zhong

2604.27728 2026-05-01 cs.RO

Connected Dependability Cage: Run-Time Function and Anomaly Monitoring for the Development and Operation of Safe Automated Vehicles

Iqra Aslam, Nour Habib, Abhishek Buragohain, Meng Zhang, Andreas Rausch, Vaibhav Tiwari, Mohamed Benchat

2604.27724 2026-05-01 cs.AI

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Xupeng Chen, Binbin Shi, Chenqian Le, Jiaqi Zhang, Kewen Wang, Ran Gong, Jinhan Zhang, Chihang Wang

2604.27723 2026-05-01 cs.LG stat.ML

Optimized Deferral for Imbalanced Settings

Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong