arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1618
2509.16654 2026-03-04 cs.CV

Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?

Xin Chen, Jia He, Maozheng Li, Dongliang Xu, Tianyu Wang, Yixiao Chen, Zhixin Lin, Yue Yao

Comments 5 pages, 5 figures

详情
英文摘要

Vision-Language Models (VLMs) have recently shown remarkable progress in multimodal reasoning, yet their applications in autonomous driving remain limited. In particular, the ability to understand road topology, a key requirement for safe navigation, has received relatively little attention. While some recent works have begun to explore VLMs in driving contexts, their performance on topology reasoning is far from satisfactory. In this work, we systematically evaluate VLMs' capabilities in road topology understanding. Specifically, multi-view images are projected into unified ground-plane coordinate system and fused into bird's-eye-view (BEV) lanes. Based on these BEV lanes, we formulate four topology-related diagnostic VQA tasks, which together capture essential components of spatial topology reasoning. Through extensive evaluation, we find that while frontier closed-source models (e.g., GPT-4o) achieve relatively high accuracy in some tasks, they still fail in some spatial questions that humans can answer (e.g., GPT-4o achieve only 67.8% in vector, a two-class classification problem). Furthermore, we find open-source VLMs, even at 30B scale, struggle significantly. These results indicate that spatial reasoning remains a fundamental bottleneck for current VLMs. We also find that the model's capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.

2509.11663 2026-03-04 cs.RO cs.AI cs.CV

ConEQsA: Concurrent and Asynchronous Embodied Questions Scheduling and Answering

Haisheng Wang, Dong Liu, Weiming Zhi

Comments 8 pages, 6 figures

详情
英文摘要

This paper formulates the Embodied Questions Answering (EQsA) problem, introduces a corresponding benchmark, and proposes an agentic system to tackle the problem. Classical Embodied Question Answering (EQA) is typically formulated as answering one single question by actively exploring a 3D environment. Real deployments, however, often demand handling multiple questions that may arrive asynchronously and carry different urgencies. We formalize this setting as Embodied Questions Answering (EQsA) and present ConEQsA, an agentic framework for concurrent, urgency-aware scheduling and answering. ConEQsA leverages shared group memory to reduce redundant exploration, and a priority-planning method to dynamically schedule questions. To evaluate the EQsA setting fairly, we contribute the Concurrent Asynchronous Embodied Questions (CAEQs) benchmark containing 40 indoor scenes and five questions per scene (200 in total), featuring asynchronous follow-up questions and human-annotated urgency labels. We further propose metrics for EQsA performance: Direct Answer Rate (DAR), and Normalized Urgency-Weighted Latency (NUWL), which serve as a fair evaluation protocol for EQsA. Empirical evaluations demonstrate that ConEQsA consistently outperforms strong sequential baselines, and show that urgency-aware, concurrent scheduling is key to making embodied agents responsive and efficient under realistic, multi-question workloads. Code is available on https://anonymous.4open.science/r/ConEQsA.

2509.10625 2026-03-04 cs.CL cs.AI

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi

Comments Accepted (poster) to Principled Design for Trustworthy AI at ICLR 2026

详情
英文摘要

Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model's forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, indicating a deeper signal than dataset-specific spurious features, and outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers and, notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don't know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

2508.18817 2026-03-04 cs.RO cs.LG

Learning Acrobatic Flight from Preferences

Colin Merk, Ismail Geles, Jiaxu Xing, Angel Romero, Giorgia Ramponi, Davide Scaramuzza

Comments 8 pages, 6 figures

详情
英文摘要

Preference-based reinforcement learning (PbRL) enables agents to learn control policies without requiring manually designed reward functions, making it well-suited for tasks where objectives are difficult to formalize or inherently subjective. Acrobatic flight poses a particularly challenging problem due to its complex dynamics, rapid movements, and the importance of precise execution. However, manually designed reward functions for such tasks often fail to capture the qualities that matter: we find that hand-crafted rewards agree with human judgment only 60.7% of the time, underscoring the need for preference-driven approaches. In this work, we propose Reward Ensemble under Confidence (REC), a probabilistic reward learning framework for PbRL that explicitly models per-timestep reward uncertainty through an ensemble of distributional reward models. By propagating uncertainty into the preference loss and leveraging disagreement for exploration, REC achieves 88.4% of shaped reward performance on acrobatic quadrotor control, compared to 55.2% with standard Preference PPO. We train policies in simulation and successfully transfer them zero-shot to the real world, demonstrating complex acrobatic maneuvers learned purely from preference feedback. We further validate REC on a continuous control benchmark, confirming its applicability beyond the domain of aerial robotics.

2508.18264 2026-03-04 cs.CV

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

Comments Accepted by ICLR 2026. Code: https://github.com/Ironieser/mmtok , Project Homepage: https://project.ironieser.cc/mmtok

详情
英文摘要

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual inputs to vision tokens. However, redundancy in vision tokens results in the degraded inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the coverage criterion. We first formulate the subset selection problem as a maximum coverage problem. Afterwards, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Finally, with only four vision tokens, 87.7% of the original performance is still preserved on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection. The code is available at https://github.com/Ironieser/mmtok

2508.05282 2026-03-04 cs.CL

Not All Errors Are Created Equal: ASCoT Addresses Late-Stage Fragility in Efficient LLM Reasoning

Dongxu Zhang, Yujun Wu, Yiding Sun, Jinnan Yang, Ning Yang, Jihua Zhu, Miao Xin, Baoliang Tian

详情
英文摘要

While Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs), ensuring reasoning reliability remains an open challenge. Contrary to the prevailing cascading failure hypothesis which posits that early errors are most detrimental, we identify a counter-intuitive phenomenon termed \textbf{Late-Stage Fragility}: errors introduced in later reasoning stages are significantly more prone to corrupting final answers. To address this, we introduce ASCoT (Adaptive Self-Correction Chain-of-Thought), a method harmonizing efficiency with robust verification. ASCoT first employs semantic pruning to compress redundant steps, then utilizes an Adaptive Verification Manager (AVM) to prioritize high risk, late-stage steps via a positional impact score, triggering a Multi-Perspective Self-Correction Engine (MSCE) only when necessary. Experiments on GSM8K and MATH-500 demonstrate that ASCoT effectively reallocates computational resources: it reduces token usage by 21\%--30\% for LLaMA-3.1-8B with negligible accuracy drops ($<1.8\%$), achieving a superior trade-off between inference efficiency and reasoning fidelity.

2508.04542 2026-03-04 cs.LG cs.CR cs.SI

Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape

Haoran Niu, K. Suzanne Barber

Comments 13 pages, 10 figures, 1 table

详情
英文摘要

It is difficult for individuals and organizations to protect personal information without a fundamental understanding of relative privacy risks. By analyzing over 5,000 empirical identity theft and fraud cases, this research identifies which types of personal data are exposed, how frequently such exposures occur, and what the consequences of those exposures are. We construct an Identity Ecosystem graph - a foundational, graph-based model in which nodes represent personally identifiable information (PII) attributes and edges represent empirical disclosure relationships between them (e.g., one PII attribute is exposed due to the exposure of another). Leveraging this graph structure, we develop a privacy risk prediction framework that uses graph theory and graph neural networks to estimate the likelihood of further disclosures when certain PII attributes are compromised. The results show that our approach effectively addresses the core question: Can the disclosure of a given identity attribute possibly lead to the disclosure of another attribute? The code for the privacy risk prediction framework is available at: https://github.com/niu-haoran/Privacy-Risk-Predictions-and-UTCID-Identity-Ecosystem.git.

2508.01592 2026-03-04 cs.CV cs.AI

DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter

Weihong Li, Shaohua Dong, Haonan Lu, Yanhao Zhang, Heng Fan, Libo Zhang

Comments Accepted by ICRA 2026

详情
英文摘要

In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely 0.93M trainable parameters. Extensive experiments on five benchmarks demonstrate that DMTrack achieves state-of-the-art results. Our code and models will be available at https://github.com/Nightwatch-Fox11/DMTrack.

2507.16334 2026-03-04 cs.AI cs.LG math.DG

Higher Gauge Flow Models

Alexander Strunk, Roland Assam

Comments arXiv admin note: text overlap with arXiv:2507.13414

详情
英文摘要

This paper introduces Higher Gauge Flow Models, a novel class of Generative Flow Models. Building upon ordinary Gauge Flow Models (arXiv:2507.13414), these Higher Gauge Flow Models leverage an L$_{\infty}$-algebra, effectively extending the Lie Algebra. This expansion allows for the integration of the higher geometry and higher symmetries associated with higher groups into the framework of Generative Flow Models. Experimental evaluation on a Gaussian Mixture Model dataset revealed substantial performance improvements compared to traditional Flow Models.

2507.13414 2026-03-04 cs.LG cs.AI math.DG

Gauge Flow Models

Alexander Strunk, Roland Assam

详情
英文摘要

This paper introduces Gauge Flow Models, a novel class of Generative Flow Models. These models incorporate a learnable Gauge Field within the Flow Ordinary Differential Equation (ODE). A comprehensive mathematical framework for these models, detailing their construction and properties, is provided. Experiments using Flow Matching on Gaussian Mixture Models demonstrate that Gauge Flow Models yields significantly better performance than traditional Flow Models of comparable or even larger size. Additionally, unpublished research indicates a potential for enhanced performance across a broader range of generative tasks.

2506.23316 2026-03-04 cs.RO cs.CV

SceneStreamer: Continuous Scenario Generation as Next Token Group Prediction

Zhenghao Peng, Yuxin Liu, Bolei Zhou

详情
英文摘要

Realistic and interactive traffic simulation is essential for training and evaluating autonomous driving systems. However, most existing data-driven simulation methods rely on static initialization or log-replay data, limiting their ability to model dynamic, long-horizon scenarios with evolving agent populations. We propose SceneStreamer, a unified autoregressive framework for continuous scenario generation that represents the entire scene as a sequence of tokens, including traffic light signals, agent states, and motion vectors, and generates them step by step with a transformer model. This design enables SceneStreamer to continuously introduce and retire agents over an unbounded horizon, supporting realistic long-duration simulation. Experiments demonstrate that SceneStreamer produces realistic, diverse, and adaptive traffic behaviors. Furthermore, reinforcement learning policies trained in SceneStreamer-generated scenarios achieve superior robustness and generalization, validating its utility as a high-fidelity simulation environment for autonomous driving. More information is available at https://vail-ucla.github.io/scenestreamer/ .

2506.07275 2026-03-04 cs.LG cs.HC stat.AP

Tailored Behavior-Change Messaging for Physical Activity: Integrating Contextual Bandits and Large Language Models

Haochen Song, Dominik Hofer, Rania Islambouli, Laura Hawkins, Ananya Bhattacharjee, Zahra Hassanzadeh, Jan Smeddinck, Meredith Franklin, Joseph Jay Williams

详情
英文摘要

Contextual multi-armed bandit (cMAB) algorithms offer a promising framework for adapting behavioral interventions to individuals over time. However, cMABs often require large samples to learn effectively and typically rely on a finite pre-set of fixed message templates. In this paper, we present a hybrid cMABxLLM approach in which the cMAB selects an intervention type, and a large language model (LLM) which personalizes the message content within the selected type. We deployed this approach in a 30-day physical-activity intervention, comparing four behavioral change intervention types: behavioral self-monitoring, gain-framing, loss-framing, and social comparison, delivered as daily motivational messages to support motivation and achieve a daily step count. Message content is personalized using dynamic contextual factors, including daily fluctuations in self-efficacy, social influence, and regulatory focus. Over the trial, participants received daily messages assigned by one of five models: equal randomization (RCT), cMAB only, LLM only, LLM with interaction history, or cMABxLLM. Outcomes include motivation towards physical activity and message usefulness, assessed via ecological momentary assessments (EMAs). We evaluate and compare the five delivery models using pre-specified statistical analyses that account for repeated measures and time trends. We find that the cMABxLLM approach retains the perceived acceptance of LLM-generated messages, while reducing token usage and providing an explicit, reproducible decision rule for intervention selection. This hybrid approach also avoids the skew in intervention delivery by improving support for under-delivered intervention types. More broadly, our approach provides a deployable template for combining Bayesian adaptive experimentation with generative models in a way that supports both personalization and interpretability.

2506.07177 2026-03-04 cs.CV cs.AI

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang

Comments ICLR 2026. Project page: https://frame-guidance-video.github.io/

详情
英文摘要

Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.

2506.03922 2026-03-04 cs.CL cs.AI cs.CV

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, Siqi He, Shannan Yan, Junzhe Chen, Xiaomin He, Chaoya Jiang, Wei Ye, Kaidong Yu, Xuelong Li

详情
英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

2506.01303 2026-03-04 cs.LG q-bio.NC

Dynamic Manifold Hopfield Networks for Context-Dependent Associative Memory

Chong Li, Taiping Zeng, Xiangyang Xue, Jianfeng Feng

详情
英文摘要

Neural population activity in cortical and hippocampal circuits can be flexibly reorganized by context, suggesting that cognition relies on dynamic manifolds rather than static representations. However, how such dynamic organization can be realized mechanistically within a unified dynamical system remains unclear. Continuous Hopfield networks provide a classical attractor framework in which neural dynamics follow gradient descent on a fixed energy landscape, constraining retrieval within a static attractor manifold geometry. Extending this approach, we introduce Dynamic Manifold Hopfield Networks (DMHN), continuous dynamical models in which contextual modulation dynamically reshapes attractor geometry, transforming a static attractor manifold into a context-dependent family of neural manifolds. In DMHN, network interactions are learned in a data-driven manner, to intrinsically deform the geometry of its attractor manifold across cues without explicit context-specific parameterization. As a result, in associative retrieval, DMHN achieve substantially higher capacity and robustness than classical and modern Hopfield networks: when storing $2N$ patterns in a network of $N$ neurons, DMHN attain reliable retrieval with an average accuracy of 64%, compared with 1% and 13% for classical and modern variants, respectively. Together, these results establish dynamic reorganization of attractor manifold geometry as a principled mechanism for context-dependent remapping in neural associative memory.

2505.21813 2026-03-04 cs.LG stat.ML

Optimizing Data Augmentation through Bayesian Model Selection

Madi Matymov, Ba-Hien Tran, Michael Kampffmeyer, Markus Heinonen, Maurizio Filippone

Comments 26 pages, 3 figures

详情
英文摘要

Data Augmentation (DA) has become an essential tool to improve robustness and generalization of modern machine learning. However, when deciding on DA strategies it is critical to choose parameters carefully, and this can be a daunting task which is traditionally left to trial-and-error or expensive optimization based on validation performance. In this paper, we counter these limitations by proposing a novel framework for optimizing DA. In particular, we take a probabilistic view of DA, which leads to the interpretation of augmentation parameters as model (hyper)-parameters, and the optimization of the marginal likelihood with respect to these parameters as a Bayesian model selection problem. Due to its intractability, we derive a tractable ELBO, which allows us to optimize augmentation parameters jointly with model parameters. We provide extensive theoretical results on variational approximation quality, generalization guarantees, invariance properties, and connections to empirical Bayes. Through experiments on computer vision and NLP tasks, we show that our approach improves calibration and yields robust performance over fixed or no augmentation. Our work provides a rigorous foundation for optimizing DA through Bayesian principles with significant potential for robust machine learning.

2505.19892 2026-03-04 cs.AI

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, Dacheng Tao

详情
英文摘要

Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, $\textbf{(i)}$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. $\textbf{(ii)}$ We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. $\textbf{(iii)}$ We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

2505.18996 2026-03-04 cs.LG stat.ML

Automatic and Structure-Aware Sparsification of Hybrid Neural ODEs

Bob Junyi Zou, Lu Tian

Comments Accepted at The 14th International Conference on Learning Representations (ICLR) 2026

详情
英文摘要

Hybrid neural ordinary differential equations (neural ODEs) integrate mechanistic models with neural ODEs, offering strong inductive bias and flexibility, and are particularly advantageous in data-scarce healthcare settings. However, excessive latent states and interactions from mechanistic models can lead to training inefficiency and over-fitting, limiting practical effectiveness of hybrid neural ODEs. In response, we propose a new hybrid pipeline for automatic state selection and structure optimization in mechanistic neural ODEs, combining domain-informed graph modifications with data-driven regularization to sparsify the model for improving predictive performance and stability while retaining mechanistic plausibility. Experiments on synthetic and real-world data show improved predictive performance and robustness with desired sparsity, establishing an effective solution for hybrid model reduction in healthcare applications.

2505.15008 2026-03-04 cs.LG cs.AI stat.ML

Know When to Abstain: Optimal Selective Classification with Likelihood Ratios

Alvin Heng, Harold Soh

详情
英文摘要

Selective classification enhances the reliability of predictive models by allowing them to abstain from making uncertain predictions. In this work, we revisit the design of optimal selection functions through the lens of the Neyman--Pearson lemma, a classical result in statistics that characterizes the optimal rejection rule as a likelihood ratio test. We show that this perspective not only unifies the behavior of several post-hoc selection baselines, but also motivates new approaches to selective classification which we propose here. A central focus of our work is the setting of covariate shift, where the input distribution at test time differs from that at training. This realistic and challenging scenario remains relatively underexplored in the context of selective classification. We evaluate our proposed methods across a range of vision and language tasks, including both supervised learning and vision-language models. Our experiments demonstrate that our Neyman--Pearson-informed methods consistently outperform existing baselines, indicating that likelihood ratio-based selection offers a robust mechanism for improving selective classification under covariate shifts. Our code is publicly available at https://github.com/clear-nus/sc-likelihood-ratios.

2504.10190 2026-03-04 cs.CV

Differentially Private 2D Human Pose Estimation

Kaushik Bhargav Sivangi, Paul Henderson, Fani Deligianni

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

详情
英文摘要

Human pose estimation (HPE) has become essential in numerous applications including healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. While traditional anonymization techniques offer limited protection and often compromise data utility for broader motion analysis, Differential Privacy (DP) provides formal privacy guarantees but typically degrades model performance when applied naively. In this work, we present the first comprehensive framework for differentially private 2D human pose estimation (2D-HPE) by applying Differentially Private Stochastic Gradient Descent (DP-SGD) to this task. To effectively balance privacy with performance, we adopt Projected DP-SGD (PDP-SGD), which projects the noisy gradients to a low-dimensional subspace. Next, we incorporate Feature Differential Privacy(FDP) to selectively privatize only sensitive features while retaining public visual cues. Finally, we propose a hybrid feature-projective DP framework that combines both approaches to balance privacy and accuracy for HPE. We evaluate our approach on the MPII dataset across varying privacy budgets, training strategies, and clipping norms. Our combined feature-projective method consistently outperforms vanilla DP-SGD and individual baselines, achieving up to 82.61\% mean PCKh@0.5 at $ε= 0.8$, substantially closing the gap to the non-private performance. This work lays foundation for privacy-preserving human pose estimation in real-world, sensitive applications.

2503.14572 2026-03-04 cs.LG cs.AI

Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation

Justus Westerhoff, Golzar Atefi, Mario Koddenbrock, Alexei Figueroa, Alexander Löser, Erik Rodner, Felix A. Gers

详情
Journal ref
Transactions on Machine Learning Research (2025)
英文摘要

The capacity of foundation models allows for their application to new, unseen tasks. The adaptation to such tasks is called transfer learning. An efficient transfer learning method that circumvents parameter optimization is imprinting. The conceptual differences between studies on imprinting form the basis of our systematic investigation. In this work, we propose the general \texttt{IMPRINT} framework, identifying three main components: generation, normalization, and aggregation. Through the lens of this framework, we conduct an in-depth analysis and a comparison of the existing methods. Our findings reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. Beyond an extensive analytical grounding, our framework enables us to propose a novel variant of imprinting which outperforms previous work on transfer learning tasks by 4\%. This variant determines proxies through clustering motivated by the neural collapse phenomenon -- a connection that we draw for the first time. We publicly release our code at https://github.com/DATEXIS/IMPRINT.

2503.12567 2026-03-04 cs.CV

GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch

Abyad Enan, Mashrur Chowdhury

Comments This work has been submitted to a peer-reviewed journal and is currently under review

详情
英文摘要

Computer vision plays a critical role in ensuring the safe navigation of autonomous vehicles (AVs). An AV perception module facilitates safe navigation. This module enables AVs to recognize traffic signs, traffic lights, and various road users. However, the perception module is vulnerable to adversarial attacks, which can compromise its accuracy and reliability. One such attack is the adversarial patch attack (APA), an attack in which an adversary strategically places a specially crafted sticker on an object to deceive object classifiers. Such an APA can cause AVs to misclassify traffic signs, leading to catastrophic incidents. To enhance the security of an AV perception system against APAs, this study develops a Generative Adversarial Network (GAN)-based single-stage defense strategy for traffic sign classification. This approach is tailored to defend against APAs across different classes of traffic signs, without prior knowledge of a patch's design, and is effective against patches of varying sizes. In addition, our single-stage defense is computationally efficient, requiring significantly lower computation time than existing multi-stage defenses, making it suitable for real-time deployment in autonomous driving systems. Compared to a classifier without any defense mechanism, our experimental analysis demonstrates that the defense strategy presented in this paper improves our classifier's accuracy under APA conditions by up to 90% considering the traffic sign classes considered in this study. and overall classification accuracy is enhanced by 55% for all traffic signs considered in this study. Our defense strategy is model agnostic, making it applicable to any traffic sign classifier, regardless of the underlying classification model.

2503.07348 2026-03-04 cs.CV

Cycle-Consistent Multi-Graph Matching for Self-Supervised Annotation of C.Elegans

Christoph Karg, Sebastian Stricker, Lisa Hutschenreiter, Bogdan Savchynskyy, Dagmar Kainmueller

详情
英文摘要

In this work we present a novel approach for unsupervised multi-graph matching, which applies to problems for which a Gaussian distribution of keypoint features can be assumed. We leverage cycle consistency as loss for self-supervised learning, and determine Gaussian parameters through Bayesian Optimization, yielding a highly efficient approach that scales to large datasets. Our fully unsupervised approach enables us to reach the accuracy of state-of-the-art supervised methodology for the biomedical use case of semantic cell annotation in 3D microscopy images of the worm C. elegans. To this end, our approach yields the first unsupervised atlas of C. elegans, i.e. a model of the joint distribution of all of its cell nuclei, without the need for any ground truth cell annotation. This advancement enables highly efficient semantic annotation of cells in large microscopy datasets, overcoming a current key bottleneck. Beyond C. elegans, our approach offers fully unsupervised construction of cell-level atlases for any model organism with a stereotyped body plan down to the level of unique semantic cell labels, and thus bears the potential to catalyze respective biomedical studies in a range of further species.

2502.20314 2026-03-04 cs.LG

Adversarial Attacks in Weight-Space Classifiers

Tamir Shor, Ethan Fetaya, Chaim Baskin, Alex Bronstein

详情
英文摘要

Implicit Neural Representations (INRs) have been recently garnering increasing interest in various research fields, mainly due to their ability to represent large, complex data in a compact, continuous manner. Past work further showed that numerous popular downstream tasks can be performed directly in the INR parameter-space. Doing so can substantially reduce the computational resources required to process the represented data in their native domain. A major difficulty in using modern machine-learning approaches, is their high susceptibility to adversarial attacks, which have been shown to greatly limit the reliability and applicability of such methods in a wide range of settings. In this work, we perform an in-depth security analysis of the behavior of weight-space classifiers under adversarial attacks. Our study reveals that parameter-space models trained for classification exhibit increased robustness to standard white-box adversarial attacks compared to standard classifiers that operate in the original signal space. This is achieved without the need of any robust training. We source this robust behavior to the phenomenon of gradient-obfuscation promoted during the INR optimization process, and pinpoint the limitations of this robustness under alternative adversarial approaches. To support our claims, we develop a novel suite of adversarial attacks targeting parameter-space classifiers, and furthermore analyze practical considerations of such attacks.

2502.16894 2026-03-04 cs.CL

Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xiaoye Qu, Wei Wei, Yu Cheng

Comments Accepted by ICML 2025

详情
英文摘要

While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE's efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT's state-of-the-art performance, closing the gap with Full FT.

2412.17558 2026-03-04 cs.CL

A Survey of Query Optimization in Large Language Models

Mingyang Song, Mao Zheng

Comments Ongoing Work

详情
英文摘要

Query Optimization (QO) has become essential for enhancing Large Language Model (LLM) effectiveness, particularly in Retrieval-Augmented Generation (RAG) systems where query quality directly determines retrieval and response performance. This survey provides a systematic and comprehensive analysis of query optimization techniques with three principal contributions. \textit{First}, we introduce the \textbf{Query Optimization Lifecycle (QOL) Framework}, a five-phase pipeline covering Intent Recognition, Query Transformation, Retrieval Execution, Evidence Integration, and Response Synthesis, providing a unified lens for understanding the optimization process. \textit{Second}, we propose a \textbf{Query Complexity Taxonomy} that classifies queries along two dimensions, namely evidence type (explicit vs.\ implicit) and evidence quantity (single vs.\ multiple), establishing principled mappings between query characteristics and optimization strategies. \textit{Third}, we conduct an in-depth analysis of four atomic operations, namely \textbf{Query Expansion}, \textbf{Query Decomposition}, \textbf{Query Disambiguation}, and \textbf{Query Abstraction}, synthesizing a broad spectrum of representative methods from premier venues. We further examine evaluation methodologies, identify critical gaps in existing benchmarks, and discuss open challenges including process reward models, efficiency optimization, and multi-modal query handling. This survey offers both a structured foundation for research and actionable guidance for practitioners.

2411.11677 2026-03-04 cs.LG cs.CR cs.IR

Few-shot Model Extraction Attacks against Sequential Recommender Systems

Hui Zhang, Fu Liu

Comments there are something wrong in the formula

详情
英文摘要

Among adversarial attacks against sequential recommender systems, model extraction attacks represent a method to attack sequential recommendation models without prior knowledge. Existing research has primarily concentrated on the adversary's execution of black-box attacks through data-free model extraction. However, a significant gap remains in the literature concerning the development of surrogate models by adversaries with access to few-shot raw data (10\% even less). That is, the challenge of how to construct a surrogate model with high functional similarity within the context of few-shot data scenarios remains an issue that requires resolution.This study addresses this gap by introducing a novel few-shot model extraction framework against sequential recommenders, which is designed to construct a superior surrogate model with the utilization of few-shot data. The proposed few-shot model extraction framework is comprised of two components: an autoregressive augmentation generation strategy and a bidirectional repair loss-facilitated model distillation procedure. Specifically, to generate synthetic data that closely approximate the distribution of raw data, autoregressive augmentation generation strategy integrates a probabilistic interaction sampler to extract inherent dependencies and a synthesis determinant signal module to characterize user behavioral patterns. Subsequently, bidirectional repair loss, which target the discrepancies between the recommendation lists, is designed as auxiliary loss to rectify erroneous predictions from surrogate models, transferring knowledge from the victim model to the surrogate model effectively. Experiments on three datasets show that the proposed few-shot model extraction framework yields superior surrogate models.

2409.00447 2026-03-04 cs.AI

The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

I. de Rodrigo, A. Sanchez-Cuadrado, J. Boal, A. J. Lopez-Lopez

详情
英文摘要

This paper introduces the MERIT Dataset, a multimodal (text + image + layout) fully labeled dataset within the context of school reports. Comprising over 400 labels and 33k samples, the MERIT Dataset is a valuable resource for training models in demanding Visually-rich Document Understanding (VrDU) tasks. By its nature (student grade reports), the MERIT Dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs). The paper outlines the dataset's generation pipeline and highlights its main features in the textual, visual, layout, and bias domains. To demonstrate the dataset's utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models and that these would greatly benefit from including samples from the MERIT Dataset in their pretraining phase.

2407.12843 2026-03-04 cs.CL cs.AI

NutriBench: A Dataset for Evaluating Large Language Models on Nutrition Estimation from Meal Descriptions

Andong Hua, Mehak Preet Dhaliwal, Laya Pullela, Ryan Burke, Yao Qin

Comments ICLR 2025

详情
英文摘要

Accurate nutrition estimation helps people make informed dietary choices and is essential in the prevention of serious health complications. We present NutriBench, the first publicly available natural language meal description nutrition benchmark. NutriBench consists of 11,857 meal descriptions generated from real-world global dietary intake data. The data is human-verified and annotated with macro-nutrient labels, including carbohydrates, proteins, fats, and calories. We conduct an extensive evaluation of NutriBench on the task of carbohydrate estimation, testing twelve leading Large Language Models (LLMs), including GPT-4o, Llama3.1, Qwen2, Gemma2, and OpenBioLLM models, using standard, Chain-of-Thought and Retrieval-Augmented Generation strategies. Additionally, we present a study involving professional nutritionists, finding that LLMs can provide comparable but significantly faster estimates. Finally, we perform a real-world risk assessment by simulating the effect of carbohydrate predictions on the blood glucose levels of individuals with diabetes. Our work highlights the opportunities and challenges of using LLMs for nutrition estimation, demonstrating their potential to aid professionals and laypersons and improve health outcomes. Our benchmark is publicly available at: https://mehak126.github.io/nutribench.html

2407.01656 2026-03-04 cs.LG cond-mat.dis-nn physics.data-an q-bio.NC stat.ML

Absolute abstraction: a renormalisation group approach

Carlo Orientale Caputo, Elias Seiffert, Enrico Frausin, Matteo Marsili

Comments 35 pages, 6 figures

详情
英文摘要

Abstraction is the process of extracting the essential features from raw data while ignoring irrelevant details. It is well known that abstraction emerges with depth in neural networks, where deep layers capture abstract characteristics of data by combining lower level features encoded in shallow layers (e.g. edges). Yet we argue that depth alone is not enough to develop truly abstract representations. We advocate that the level of abstraction crucially depends on how broad the training set is. We address the issue within a renormalisation group approach where a representation is expanded to encompass a broader set of data. We take the unique fixed point of this transformation -- the Hierarchical Feature Model -- as a candidate for a representation which is absolutely abstract. This theoretical picture is tested in numerical experiments based on Deep Belief Networks and auto-encoders trained on data of different breadth. These show that representations in neural networks approach the Hierarchical Feature Model as the data get broader and as depth increases, in agreement with theoretical predictions.