arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1832
2506.07180 2026-05-01 cs.CL cs.AI cs.CV

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Wenrui Zhou, Mohamed Hendy, Shu Yang, Qingsong Yang, Zikun Guo, Yuyu Luo, Lijie Hu, Di Wang

Comments 27 Pages, Accepted by ACL 2026 Main Conference

详情
英文摘要

As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the videolanguage domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE(Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISEpioneeringly brings linguistic perspectives on sycophancy into the video domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. Furthermore, we propose two potential training-free mitigation strategies revealing potential paths for reducing sycophantic bias: (i) enhancing visual grounding through interpretable key-frame selection and (ii) steering model behavior away from sycophancy via targeted, inference-time intervention on its internal neural representations. Our code is available at https://anonymous.4open.science/r/VideoSycophancy-567F.

2505.19630 2026-05-01 cs.CL

Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning

Yichun Feng, Jiawei Wang, Lu Zhou, Yikai Zheng, Zhen Lei, Yixue Li

详情
英文摘要

Large language models (LLMs) struggle in real-world clinical consultations. Single-turn consultation systems require patients to describe all symptoms at once, which often leads to unclear complaints and vague diagnoses. Traditional dialogue models, constrained by static supervised learning, are limited to superficially imitating existing dialogue patterns and lack the ability to actively construct understanding in dynamic interactions, thus failing to achieve genuine clinical reasoning.To address these challenges, we propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework, and train a doctor agent on Qwen2.5-7B-Instruct using this framework. Within this framework, a medical consultation is modeled as a dynamic decision-making process under uncertainty. The core intelligence of the doctor agent is shifted from knowing the answer to learning and mastering a questioning methodology aimed at achieving an optimal diagnosis. Through strategic questioning, it guides the progressive emergence of key patient information in multi-turn dialogues. To support this high-fidelity simulation of the real diagnostic process, we constructed MTMedDialog, a novel English multi-turn medical consultation dataset designed for dynamic, interactive training.To validate its real-world effectiveness, rigorous evaluations including blinded human assessments and trials with real patients were conducted. DoctorAgent-RL outperformed frontier models and achieved a 70% exact diagnostic match rate, confirming its potential as a collaborative tool. By handling initial screenings, it can free clinicians to focus on complex cases, thereby addressing critical issues like physician shortages and misdiagnosis risks while alleviating the strain on healthcare resources.

2505.13230 2026-05-01 cs.LG cond-mat.dis-nn stat.ML

Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks

Francesco D'Amico, Dario Bocchi, Matteo Negri

Comments Final accepted version at ICLR26 main conference; 27 pages, 21 Figures, 5 tables

详情
英文摘要

Scaling laws in deep learning -- empirical power-law relationships linking model performance to resource growth -- have emerged as simple yet striking regularities across architectures, datasets, and tasks. These laws are particularly impactful in guiding the design of state-of-the-art models, since they quantify the benefits of increasing data or model size, and hint at the foundations of interpretability in machine learning. However, most studies focus on asymptotic behavior at the end of training. In this work, we describe a richer picture by analyzing the entire training dynamics: we identify two novel \textit{dynamical} scaling laws that govern how performance evolves as function of different norm-based complexity measures. Combined, our new laws recover the well-known scaling for test error at convergence. Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100. Furthermore, we provide analytical support using a single-layer perceptron trained with logistic loss, where we derive the new dynamical scaling laws, and we explain them through the implicit bias induced by gradient-based training.

2504.14988 2026-05-01 cs.CV

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

Comments Accepted to ICLR 2026

详情
英文摘要

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

2504.14602 2026-05-01 cs.RO cs.AI cs.HC

K2MUSE: A human lower-limb multimodal walking dataset spanning task and acquisition variability for rehabilitation robotics

Jiwei Li, Bi Zhang, Xiaowei Tan, Wanxin Chen, Zhaoyuan Liu, Juanjuan Zhang, Weiguang Huo, Jian Huang, Lianqing Liu, Xingang Zhao

Comments 34 pages, 30 figures,7 tables

详情
英文摘要

The natural interaction and control performance of lower limb rehabilitation robots are closely linked to biomechanical information from various human locomotion activities. Multidimensional human motion data significantly deepen the understanding of the complex mechanisms governing neuromuscular alterations, thereby facilitating the development and application of rehabilitation robots in multifaceted real-world environments.However, existing lower limb datasets are inadequate for supplying the essential multimodal data and large-scale gait samples necessary for the development of effective data-driven approaches, and the significant effects of acquisition interference in real applications are neglected.To fill this gap, we present the K2MUSE dataset, which includes a comprehensive collection of multimodal data, comprising kinematic, kinetic, amplitude mode ultrasound (AUS), and surface electromyography (sEMG) measurements. The proposed dataset includes lower-limb multimodal data collected from two cohorts, including 30 able-bodied young adults and 12 older adults, across different inclines (0$^\circ$, $\pm$5$^\circ$, and $\pm$10$^\circ$), speeds (0.5 m/s, 1.0 m/s, and 1.5 m/s), and representative non-ideal acquisition conditions (muscle fatigue, electrode shifts, and interday differences). The kinematic and ground reaction force data were collected with a Vicon motion capture system and an instrumented treadmill with embedded force plates, whereas the sEMG and AUS data of thirteen muscles on the bilateral lower limbs were synchronously recorded.K2MUSE is released with the corresponding structured documentation, preprocessing pipelines, and example code, thereby providing a comprehensive resource for rehabilitation robot development, biomechanical analysis, and wearable sensing research. The dataset is available at https://k2muse.github.io/.

2504.02768 2026-05-01 cs.CL

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, Arianna Bisazza

Comments Published in TACL, MIT Press

详情
英文摘要

We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.

2503.01835 2026-05-01 cs.CV

Primus: Enforcing Attention Usage for 3D Medical Image Segmentation

Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Sebastian Ziegler, Dasha Trofimova, Raphael Stock, Michael Baumgartner, Gregor Köhler, Klaus Maier-Hein

Comments Accepted in Transactions on Machine Learning Research (TMLR)

详情
英文摘要

Transformers have achieved remarkable success across multiple fields, yet their impact on 3D medical image segmentation remains limited with convolutional networks still dominating major benchmarks. In this work, (A) we analyze current Transformer-based segmentation models and identify critical shortcomings, particularly their over-reliance on convolutional blocks. Further, we demonstrate that in some architectures, performance is unaffected by the absence of the Transformer, thereby demonstrating their limited effectiveness. To address these challenges, we move away from hybrid architectures and (B) introduce Transformer-centric segmentation architectures, termed Primus and PrimusV2. Primus leverages high-resolution tokens, combined with advances in positional embeddings and block design, to maximally leverage its Transformer blocks, while PrimusV2 expands on this through an iterative patch embedding. Through these adaptations, Primus surpasses current Transformer-based methods and competes with a default nnU-Net while PrimusV2 exceeds it and is on par with the state-of-the-art CNNs such as ResEnc-L and MedNeXt architectures across nine public datasets. In doing so, we introduce the first competitive Transformer-centric model, making Transformers state-of-the-art in 3D medical image segmentation. The code is available here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/primus.md.

2503.01611 2026-05-01 cs.CL

In-context Learning vs. Instruction Tuning: The Case of Small and Multilingual Language Models

David Ponce, Thierry Etchegoyhen

详情
英文摘要

Instruction following is a critical ability for Large Language Models to perform downstream tasks. The standard approach to instruction tuning has relied on a specific phase of supervised fine-tuning over curated instruction datasets, optionally complemented with an alignment step over human preferences. Recent work has shown the potential of in-context learning (ICL) alternatives to guide base models towards instruction following. This type of approach is particularly relevant to circumvent the notable efforts and resources needed for supervised instruction tuning. In this work, we evaluate the viability of ICL for instruction following in scenarios where it is particularly relevant, i.e., languages other than English and across model sizes. Our results show that these scenarios result in downgraded ICL instruction following performance. We further show that applying Direct Preference Optimisation over base models can partially improve baseline results, although alternatives to current ICL instruction following will be needed to bridge the gap with larger English-centric language models.

2503.01448 2026-05-01 cs.CV

Generative Human Geometry Distribution

Xiangjun Tang, Biao Zhang, Peter Wonka

详情
英文摘要

Realistic human geometry generation is an important yet challenging task, requiring both the preservation of fine clothing details and the accurate modeling of clothing-body interactions. To tackle this challenge, we build upon Geometry distributions, a recently proposed representation that can model a single human geometry with high fidelity using a flow matching model. However, extending a single-geometry distribution to a dataset is non-trivial and inefficient for large-scale learning. To address this, we propose a new geometry distribution model by two key techniques: (1) encoding distributions as 2D feature maps rather than network parameters, and (2) using SMPL models as the domain instead of Gaussian and refining the associated flow velocity field. We then design a generative framework adopting a two staged training paradigm analogous to state-of-the-art image and 3D generative models. In the first stage, we compress geometry distributions into a latent space using a diffusion flow model; the second stage trains another flow model on this latent space. We validate our approach on two key tasks: pose-conditioned random avatar generation and avatar-consistent novel pose synthesis. Experimental results demonstrate that our method outperforms existing state-of-the-art methods, achieving a 57% improvement in geometry quality.

2502.16942 2026-05-01 cs.CL

NUTSHELL: A Dataset for Abstract Generation from Scientific Talks

Maike Züfle, Sara Papi, Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Jan Niehues

详情
英文摘要

Scientific communication is receiving increasing attention in natural language processing, especially to help researches access, summarize, and generate content. One emerging application in this area is Speech-to-Abstract Generation (SAG), which aims to automatically generate abstracts from recorded scientific presentations. SAG enables researchers to efficiently engage with conference talks, but progress has been limited by a lack of large-scale datasets. To address this gap, we introduce NUTSHELL, a novel multimodal dataset of *ACL conference talks paired with their corresponding abstracts. We establish strong baselines for SAG and evaluate the quality of generated abstracts using both automatic metrics and human judgments. Our results highlight the challenges of SAG and demonstrate the benefits of training on NUTSHELL. By releasing NUTSHELL under an open license (CC-BY 4.0), we aim to advance research in SAG and foster the development of improved models and evaluation methods.

2502.14270 2026-05-01 cs.LG

Predicting Fetal Birthweight from High Dimensional Data using Advanced Machine Learning

Nachiket Kapure, Harsh Joshi, Rajeshwari Mistri, Parul Kumari, Manasi Mali, Seema Purohit, Neha Sharma, Mrityunjoy Panday, Chittaranjan S. Yajnik

Comments Withdrawn due to concerns regarding overlap in text and methodology (Sections 2--4), requiring substantial revision and restructuring to ensure clarity and originality. A corrected version will be submitted separately

详情
英文摘要

Birth weight serves as a fundamental indicator of neonatal health, closely linked to both early medical interventions and long-term developmental risks. Traditional predictive models, often constrained by limited feature selection and incomplete datasets, struggle to achieve overlooking complex maternal and fetal interactions in diverse clinical settings. This research explores machine learning to address these limitations, utilizing a structured methodology that integrates advanced imputation strategies, supervised feature selection techniques, and predictive modeling. Given the constraints of the dataset, the research strengthens the role of data preprocessing in improving the model performance. Among the various methodologies explored, tree-based feature selection methods demonstrated superior capability in identifying the most relevant predictors, while ensemble-based regression models proved highly effective in capturing non-linear relationships and complex maternal-fetal interactions within the data. Beyond model performance, the study highlights the clinical significance of key physiological determinants, offering insights into maternal and fetal health factors that influence birth weight, offering insights that extend over statistical modeling. By bridging computational intelligence with perinatal research, this work underscores the transformative role of machine learning in enhancing predictive accuracy, refining risk assessment and informing data-driven decision-making in maternal and neonatal care. Keywords: Birth weight prediction, maternal-fetal health, MICE, BART, Gradient Boosting, neonatal outcomes, Clinipredictive.

2502.12272 2026-05-01 cs.LG cs.AI cs.CL

Learning to Reason at the Frontier of Learnability

Thomas Foster, Anya Sims, Johannes Forkel, Mattie Fellows, Jakob Foerster

详情
英文摘要

Reinforcement learning is now widely adopted as the final stage of large language model training, especially for reasoning-style tasks such as maths problems. Typically, models attempt each question many times during a single training step and attempt to learn from their successes and failures. However, we demonstrate that throughout training with two popular algorithms (PPO and VinePPO) on two widely used datasets, many questions are either solved by all attempts - meaning they are already learned - or by none - providing no meaningful training signal. To address this, we adapt a method from the reinforcement learning literature - sampling for learnability - and apply it to the reinforcement learning stage of LLM training. Our curriculum prioritises questions with high variance of success, i.e. those where the agent sometimes succeeds, but not always. Our findings demonstrate that this curriculum consistently boosts training performance across multiple algorithms and datasets, paving the way for more efficient and effective reinforcement learning with LLMs.

2502.07645 2026-05-01 cs.RO

From Action Labels to Sets: Rethinking Action Supervision for Imitation Learning from Corrective Feedback

Zhaoting Li, Rodrigo Pérez-Dattari, Robert Babuska, Cosimo Della Santina, Jens Kober

详情
英文摘要

Behavior cloning (BC) optimizes policies by treating human demonstrations as pointwise action labels. While effective with accurate action labels, this formulation is brittle in practice: when human-provided actions are imperfect, treating each label as an exact target can steer the policy away from the underlying desired behavior, particularly when expressive models are used (e.g., energy-based models). As a result, we propose a human-in-the-loop alternative that replaces pointwise supervision with set-valued action targets. We introduce Contrastive policy Learning from Interactive Corrections (CLIC). CLIC leverages human corrections to construct and refine sets of desired actions, and optimizes a policy to place probability mass over these sets rather than over a single action target. This formulation naturally accommodates both absolute and relative corrections and can represent complex multi-modal behaviors. Extensive simulation and real-robot experiments show that the proposed approach leads to effective policy learning across diverse settings: CLIC remains competitive with the state of the art under accurate data while being substantially more robust under noisy, relative, and partial feedback. Our implementation is publicly available at https://clic-webpage.github.io/.

2502.02097 2026-05-01 cs.CV

VerteNet -- A Multi-Context Hybrid CNN Transformer for Accurate Vertebral Landmark Localization in Lateral Spine DXA Images

Arooba Maqsood, Zaid Ilyas, Afsah Saleem, Erchuan Zhang, David Suter, Parminder Raina, Jonathan M. Hodgson, John T. Schousboe, William D. Leslie, Joshua R. Lewis, Syed Zulqarnain Gilani

Comments 17 pages with 5 figures

详情
英文摘要

This aims to develop and validate a deep learning model that can accurately locate vertebral landmarks in lateral spine Dual energy X-ray Absorptiometry (DXA) scans. Accurate vertebral landmark localization is critical for reliable fracture assessment and scoring of abdominal aortic calcification using the Kauppila 24-point method; however, DXA lateral spine images are low-contrast, artifact-prone, and manufacturer-dependent, while manual annotation is time-consuming and reader-dependent. This study aimed to address these challenges by developing a dual-resolution self- and cross-attention model for robust vertebral landmark localization using lateral spine DXA scans from four different scanner models. Ground-truth vertebral corner landmarks (T12 to L5) were manually annotated, and performance was evaluated using normalized mean and median localization errors against baseline and state-of-the-art methods. The proposed framework achieved superior localization accuracy across all four DXA scanner models, with a normalized mean error of 4.92 pixels and a median error of 2.35 pixels, outperforming baseline methods. The abdominal aorta crop detection algorithm achieved 100% accuracy in validation and 96% accuracy (sensitivity 0.93, specificity 0.98) in an independent test set. Generated intervertebral guides further improved inter-reader agreement, reflected by higher Cohens weighted kappa and inter-reader correlation. The proposed deep learning framework enables accurate and robust vertebral landmark localization in lateral spine DXA images across heterogeneous imaging systems to support clinically relevant downstream analyses. The code for this work can be found at: https://github.com/zaidilyas89/VerteNet

2501.07451 2026-05-01 cs.CV

A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion

Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis

Comments Under review at Image and Vision Computing

详情
英文摘要

Model compression is essential in the deployment of large Computer Vision models on embedded devices. However, static optimization techniques (e.g. pruning, quantization, etc.) neglect the fact that different inputs have different complexities, thus requiring different amount of computations. Dynamic Neural Networks allow to condition the number of computations to the specific input. The current literature on the topic is very extensive and fragmented. We present a comprehensive survey that synthesizes and unifies existing Dynamic Neural Networks research in the context of Computer Vision. Additionally, we provide a logical taxonomy based on which component of the network is adaptive: the output, the computation graph or the input. Furthermore, we argue that Dynamic Neural Networks are particularly beneficial in the context of Sensor Fusion for better adaptivity, noise reduction and information prioritization. We present preliminary works in this direction. We complement this survey with a curated repository listing all the surveyed papers, each with a brief summary of the solution and the code base when available: https://github.com/DTU-PAS/awesome-dynn-for-cv .

2410.07442 2026-05-01 cs.CV

Self-Supervised Learning for Real-World Object Detection: a Survey

Alina Ciocarlan, Sidonie Lefebvre, Sylvie Le Hégarat-Mascle, Arnaud Woiselle

详情
英文摘要

Self-Supervised Learning (SSL) has emerged as a promising approach in computer vision, enabling networks to learn meaningful representations from large unlabeled datasets. SSL methods fall into two main categories: instance discrimination and Masked Image Modeling (MIM). While instance discrimination is fundamental to SSL, it was originally designed for classification and may be less effective for object detection, particularly for small objects. In this survey, we focus on SSL methods specifically tailored for real-world object detection, with an emphasis on detecting small objects in complex environments. Unlike previous surveys, we offer a detailed comparison of SSL strategies, including object-level instance discrimination and MIM methods, and assess their effectiveness for small object detection using both CNN and ViT-based architectures. Specifically, our benchmark is performed on the widely-used COCO dataset, as well as on a specialized real-world dataset focused on vehicle detection in infrared remote sensing imagery. We also assess the impact of pre-training on custom domain-specific datasets, highlighting how certain SSL strategies are better suited for handling uncurated data. Our findings highlight that instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. This survey provides a practical guide for selecting optimal SSL strategies, taking into account factors such as backbone architecture, object size, and custom pre-training requirements. Ultimately, we show that choosing an appropriate SSL pre-training strategy, along with a suitable encoder, significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings.

2409.20302 2026-05-01 cs.AI cs.CL cs.IR

OM4OV: Leveraging Ontology Matching for Ontology Versioning

Zhangcheng Qiang, Kerry Taylor, Weiqing Wang

Comments 17 pages, 10 figures, 2 tables

详情
英文摘要

Due to the dynamic nature of the Semantic Web, version control is necessary to manage changes in widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component of efficient ontology management, many approaches treat OV as similar to ontology matching (OM) and directly reuse OM systems for OV tasks. In this study, we systematically analyse similarities and differences between OM and OV and formalise an OM4OV framework to offer more advanced OV support. The framework is implemented and evaluated in the state-of-the-art OM system Agent-OM. The experimental results indicate that OM systems can be effectively reused for OV tasks, but without necessary extensions, can produce skewed measurements, poor performance in detecting update entities, and limited explanation of false mappings. To tackle these issues, we propose an optimisation method called the cross-reference (CR) mechanism, which builds on existing OM alignments to reduce the number of matching candidates and to improve overall OV performance.

2403.12235 2026-05-01 cs.RO cs.SY eess.SY

IKSPARK: Obstacle-Aware Inverse Kinematics via Convex Optimization

Liangting Wu, Roberto Tron

详情
英文摘要

Inverse kinematics (IK) is central to robot control and motion planning, yet its nonlinear kinematic mapping makes it inherently nonconvex and particularly challenging under complex constraints. We present IKSPARK (Inverse Kinematics using Semidefinite Programming And RanK minimization), an obstacle-aware IK solver for robots with diverse morphologies, including open and closed kinematic chains with spherical, revolute, and prismatic joints. Our formulation expresses IK as a semidefinite programming (SDP) problem with additional rank-1 constraints on symmetric matrices with fixed traces. IKSPARK first solves the relaxed SDP, whose infeasibility certifies infeasibility of the original IK problem, and then recovers a rank-1 solution using iterative rank-minimization methods with proven local convergence. Obstacle avoidance is handled through a convexified formulation of mixed-integer constraints. Extensive experiments show that IKSPARK computes highly accurate solutions across various kinematic structures and constrained environments without post-processing. In obstacle-rich settings, especially fixed workcell environments, IKSPARK achieves substantially higher success rates than traditional nonlinear optimization methods.

2402.14532 2026-05-01 cs.LG stat.ML

A Framework for Variational Inference of Lightweight Bayesian Neural Networks with Heteroscedastic Uncertainties

David J. Schodt, Ryan Brown, Michael Merritt, Samuel Park, Delsin Menolascino, Mark A. Peot

Comments Fix equation typos

详情
英文摘要

Obtaining heteroscedastic predictive uncertainties from a Bayesian Neural Network (BNN) is vital to many applications. Often, heteroscedastic aleatoric uncertainties are learned as outputs of the BNN in addition to the predictive means, however doing so may necessitate adding more learnable parameters to the network. In this work, we demonstrate that both the heteroscedastic aleatoric and epistemic variance can be embedded into the variances of learned BNN parameters, improving predictive performance for lightweight networks. By complementing this approach with a moment propagation approach to inference, we introduce a relatively simple framework for sampling-free variational inference suitable for lightweight BNNs.

2310.02277 2026-05-01 cs.LG cs.AI

Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

Lu Yin, Ajay Jaiswal, Shiwei Liu, Souvik Kundu, Zhangyang Wang

Comments Published at ICML 2024

详情
英文摘要

We present Junk DNA Hypothesis by adopting a novel task-centric angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by pruning without compromising performance. Contrary to this belief, this paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the monotonic relationship between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in irreparable loss of knowledge and performance degradation in difficult tasks, even when downstream continual training is allowed. Interestingly, our evaluations show that the other popular compression, namely quantization, fails to exhibit similar monotonic effect and does not as convincingly disentangle this task-difficulty information. To study formally, we introduce several quantifiable metrics to gauge the downstream task difficulty: (1) within the same task category, and (2) across different task categories. Our extensive experiments substantiate the Junk DNA Hypothesis across a diverse range of model sizes, tasks, datasets, and even pruning methods. Codes are available at: https://github.com/VITA-Group/Junk_DNA_Hypothesis.git.

2309.12802 2026-05-01 cs.SD cs.LG eess.AS

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Alexandre R. Ferreira, Cláudio E. C. Campelo

Comments 9 pages, 6 figures, 7 tables

详情
Journal ref
https://sbia.org.br/eventos/cbic_2023/cbic2023-169/
英文摘要

To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios.

2309.12071 2026-05-01 cs.AI cs.CL

Benchmarking quantized LLaMa-based models on the Brazilian Secondary School Exam

Matheus L. O. Santos, Cláudio E. C. Campelo

Comments 8 pages, 6 figures, 4 tables

详情
Journal ref
https://sbic.org.br/eventos/cbic_2023/cbic2023-177/
英文摘要

Although Large Language Models (LLMs) represent a revolution in the way we interact with computers, allowing the construction of complex questions and the ability to reason over a sequence of statements, their use is restricted due to the need for dedicated hardware for execution. In this study, we evaluate the performance of LLMs based on the 7 and 13 billion LLaMA models, subjected to a quantization process and run on home hardware. The models considered were Alpaca, Koala, and Vicuna. To evaluate the effectiveness of these models, we developed a database containing 1,006 questions from the ENEM (Brazilian National Secondary School Exam). Our analysis revealed that the best performing models achieved an accuracy of approximately 46% for the original texts of the Portuguese questions and 49% on their English translations. In addition, we evaluated the computational efficiency of the models by measuring the time required for execution. On average, the 7 and 13 billion LLMs took approximately 20 and 50 seconds, respectively, to process the queries on a machine equipped with an AMD Ryzen 5 3600x processor

2309.02449 2026-05-01 cs.LG

League of Legends: Real-Time Result Prediction

Jailson B. S. Junior, Claudio E. C. Campelo

Comments 8 pages

详情
Journal ref
https://sbia.org.br/eventos/cbic_2023/cbic2023-161/
英文摘要

This paper presents a study on the prediction of outcomes in matches of the electronic game League of Legends (LoL) using machine learning techniques. With the aim of exploring the ability to predict real-time results, considering different variables and stages of the match, we highlight the use of unpublished data as a fundamental part of this process. With the increasing popularity of LoL and the emergence of tournaments, betting related to the game has also emerged, making the investigation in this area even more relevant. A variety of models were evaluated and the results were encouraging. A model based on LightGBM showed the best performance, achieving an average accuracy of 81.62\% in intermediate stages of the match when the percentage of elapsed time was between 60\% and 80\%. On the other hand, the Logistic Regression and Gradient Boosting models proved to be more effective in early stages of the game, with promising results. This study contributes to the field of machine learning applied to electronic games, providing valuable insights into real-time prediction in League of Legends. The results obtained may be relevant for both players seeking to improve their strategies and the betting industry related to the game.

2306.16050 2026-05-01 cs.CV cs.LG eess.IV

Evaluating Similitude and Robustness of Deep Image Denoising Models via Adversarial Attack

Jie Ning, Jiebao Sun, Yao Li, Zhichang Guo, Wangmeng Zuo

详情
Journal ref
Neurocomputing 687 (2026) 133674
英文摘要

Deep neural networks (DNNs) have shown superior performance comparing to traditional image denoising algorithms. However, DNNs are inevitably vulnerable while facing adversarial attacks. In this paper, we propose an adversarial attack method named denoising-PGD which can successfully attack all the current deep denoising models while keep the noise distribution almost unchanged. We surprisingly find that the current mainstream non-blind denoising models (DnCNN, FFDNet, ECNDNet, BRDNet), blind denoising models (DnCNN-B, Noise2Noise, RDDCNN-B, FAN), plug-and-play (DPIR, CurvPnP) and unfolding denoising models (DeamNet) almost share the same adversarial sample set on both grayscale and color images, respectively. Shared adversarial sample set indicates that all these models are similar in term of local behaviors at the neighborhood of all the test samples. Thus, we further propose an indicator to measure the local similarity of models, called robustness similitude. Non-blind denoising models are found to have high robustness similitude across each other, while hybrid-driven models are also found to have high robustness similitude with pure data-driven non-blind denoising models. According to our robustness assessment, data-driven non-blind denoising models are the most robust. We use adversarial training to complement the vulnerability to adversarial attacks. Moreover, the model-driven image denoising BM3D shows resistance on adversarial attacks.

2306.10407 2026-05-01 cs.LG cs.AI physics.bio-ph q-bio.CB

FP-IRL: Fokker--Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

Chengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna Garikipati

详情
Journal ref
Computer Methods in Applied Mechanics and Engineering, 458, 119010 (2026)
英文摘要

Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision process (MDP). However, most existing IRL methods require access to the transition function, either prescribed or estimated \textit{a priori}, which poses significant challenges when the underlying dynamics are unknown, unobservable, or not easily sampled. We propose Fokker--Planck inverse reinforcement learning (FP-IRL), a novel physics-constrained IRL framework tailored for systems that can be described by Fokker--Planck (FP) dynamics. FP-IRL simultaneously infers both the reward and transition functions directly from trajectory data, without requiring access to sampled transitions. Our method leverages a correspondence between MDPs and the FP equation, linking reward maximization in MDPs with free energy minimization in FP dynamics. This connection enables inference of the FP potential function using our inference approach of variational system identification, from which the full set of MDP components -- reward, transition, and policy -- can be recovered using analytic expressions. We demonstrate the effectiveness of FP-IRL through experiments on synthetic benchmarks and a modified version of the Mountain Car problem. Our results show that FP-IRL achieves accurate recovery of agent incentives while preserving computational efficiency and physical interpretability.

2102.05231 2026-05-01 cs.CV cs.AI

Culture-inspired Multi-modal Color Palette Generation and Colorization: A Chinese Youth Subculture Case

Yufan Li, Jinggang Zhuo, Ling Fan, Harry Jiannan Wang

Comments accepted by the 3rd IEEE Workshop on Artificial Intelligence for Art Creation

详情
英文摘要

Color is an essential component of graphic design, acting not only as a visual factor but also carrying cultural implications. However, existing research on algorithmic color palette generation and colorization largely ignores the cultural aspect. In this paper, we contribute to this line of research by first constructing a unique color dataset inspired by a specific culture, i.e., Chinese Youth Subculture (CYS), which is an vibrant and trending cultural group especially for the Gen Z population. We show that the colors used in CYS have special aesthetic and semantic characteristics that are different from generic color theory. We then develop an interactive multi-modal generative framework to create CYS-styled color palettes, which can be used to put a CYS twist on images using our automatic colorization model. Our framework is illustrated via a demo system designed with the human-in-the-loop principle that constantly provides feedback to our algorithms. User studies are also conducted to evaluate our generation results.

2604.27733 2026-05-01 cs.LG stat.ML

Mind the Gap: Structure-Aware Consistency in Preference Learning

Mehryar Mohri, Yutao Zhong

详情
英文摘要

Preference learning has become the foundation of aligning Large Language Models (LLMs) with human intent. Popular methods, such as Direct Preference Optimization (DPO), minimize surrogate losses as proxies for the intractable pairwise ranking loss. However, we demonstrate that for the equicontinuous hypothesis sets typical of neural networks, these standard surrogates are theoretically inconsistent, yielding vacuous generalization guarantees. To resolve this, we formulate LLM alignment within a margin-shifted ranking framework. We derive rigorous $H$-consistency bounds that depend on enforcing a separation margin $γ$. Crucially, we extend this to Structure-Aware $H$-consistency, introducing a novel objective (SA-DPO) that adapts the margin based on the semantic distance between responses to handle synonyms and hard pairs. Finally, we analyze the trade-off between consistency and model limitations via the Margin-Capacity Profile, proving that heavy-tailed surrogates (such as the Polynomial Hinge family) offer superior consistency guarantees for capacity-bounded models compared to the standard logistic loss used in DPO.

2604.27728 2026-05-01 cs.RO

Connected Dependability Cage: Run-Time Function and Anomaly Monitoring for the Development and Operation of Safe Automated Vehicles

Iqra Aslam, Nour Habib, Abhishek Buragohain, Meng Zhang, Andreas Rausch, Vaibhav Tiwari, Mohamed Benchat

详情
英文摘要

The advancement of automated vehicles introduces complex safety challenges, particularly in dynamic and unpredictable environments where AI-enabled perception systems must operate reliably. Ensuring compliance with safety standards such as ISO 26262 and ISO/PAS 21448 (SOTIF) is essential for addressing system malfunctions and mitigating unsafe behavior in unknown scenarios. However, as automation levels increase, vehicles must go beyond conventional functional safety by incorporating fail-operational capabilities that enable continued safe operation during system or component failures and the handling of unfamiliar or degraded operational conditions. To address these safety concerns, we propose the Connected Dependability Cage, an architectural framework designed to enable hierarchical fail-operational behavior in AI-enabled perception systems. This framework integrates two complementary monitoring mechanisms: a Function Monitor that oversees multiple heterogeneous AI-based perception pipelines and detects inconsistencies through a voting mechanism, and an Anomaly Monitor that evaluates the reliability of AI perception by detecting unknown or novel objects in scenes that may be excluded from the training dataset. In the presence of critical discrepancies, the system supports graceful degradation, ultimately enabling a transition to a minimal-risk maneuver strategy. Furthermore, whenever either monitor raises a safety flag, an automated data recording process is initiated to facilitate iterative system development and continuous improvement. Both monitors have been implemented and validated through extensive vehicle testing, demonstrating their practical effectiveness in real-world applications.

2604.27724 2026-05-01 cs.AI

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Xupeng Chen, Binbin Shi, Chenqian Le, Jiaqi Zhang, Kewen Wang, Ran Gong, Jinhan Zhang, Chihang Wang

详情
英文摘要

Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR'd text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MEDVRAG reaches 78.6% average accuracy. Under controlled comparison with the same Qwen2.5-VL-32B backbone, retrieval contributes a +5.8 point gain over the no-retrieval baseline; we also note a +1.8 point edge over MedRAG + GPT-4 (76.8%), with the caveat that this is a cross-paper rather than head-to-head comparison. Ablations isolate +1.0 from page-image vs text-chunk retrieval, +1.5 from iteration, and +1.0 from the memory bank.

2604.27723 2026-05-01 cs.LG stat.ML

Optimized Deferral for Imbalanced Settings

Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong

详情
英文摘要

Learning algorithms can be significantly improved by routing complex or uncertain inputs to specialized experts, balancing accuracy with computational cost. This approach, known as learning to defer, is essential in domains like natural language generation, medical diagnosis, and computer vision, where an effective deferral can reduce errors at low extra resource consumption. However, the two-stage learning to defer setting, which leverages existing predictors such as a collection of LLMs or other classifiers, often faces challenges due to an expert imbalance problem. This imbalance can lead to suboptimal performance, with deferral algorithms favoring the majority expert. We present a comprehensive study of two-stage learning to defer in expert imbalance settings. We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings. Extensive experiments demonstrate the effectiveness of our approach, showing clear improvements over existing baselines on both image classification and real-world Large Language Model (LLM) routing tasks.