arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1582
专题追踪
2502.04028 2026-02-11 cs.LG

Deep Meta Coordination Graphs for Multi-agent Reinforcement Learning

Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

详情
英文摘要

This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi-agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. Through DMCG, we dynamically compose what we refer to as \textit{meta coordination graphs}, to learn a more expressive representation of agent interactions and use them to integrate agent information through graph convolutional networks. The goal is to enable an evolving coordination graph to guide effective coordination in cooperative MARL tasks. The graphs are jointly optimized with agents' value functions to learn to implicitly reason about joint actions, facilitating the end-to-end learning of interaction representations and coordinated policies. We demonstrate that DMCG consistently achieves state-of-the-art coordination performance and sample efficiency on challenging cooperative tasks, outperforming several prior graph-based and non-graph-based MARL baselines. Through several ablations, we also isolate the impact of individual components in DMCG, showing that the observed improvements are due to the meaningful design choices in this approach. We also include an analysis of its computational complexity to discuss its practicality in real-world applications. All codes can be found here: {\color{blue}{https://github.com/Nikunj-Gupta/dmcg-marl}.

2501.10913 2026-02-11 cs.CV cs.CL

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, Sungroh Yoon

Comments Accepted to ICCV 2025

Journal ref Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 2825-2835

详情
英文摘要

While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately. Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.

2501.04121 2026-02-11 cs.CV

Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

Julia Lee Romero, Kyle Min, Subarna Tripathi, Morteza Karimzadeh

Comments We expanded the paper and resubmitted as a separate submission to arXiv. This submission is outdated and readers can refer to arXiv:2506.01102

详情
英文摘要

Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.

2412.19571 2026-02-11 cs.RO

xFLIE: Leveraging Actionable Hierarchical Scene Representations for Autonomous Semantic-Aware Inspection Missions

Vignesh Kottayam Viswanathan, Mario A. V. Saucedo, Sumeet Gajanan Satpute, Christoforos Kanellakis, George Nikolakopoulos

Comments 20 pages

详情
英文摘要

We present a novel architecture aimed towards incremental construction and exploitation of a hierarchical 3D scene graph representation during semantic-aware inspection missions. Inspection planning, particularly of distributed targets in previously unseen environments, presents an opportunity to exploit the semantic structure of the scene during reasoning, navigation and scene understanding. Motivated by this, we propose the 3D Layered Semantic Graph (3DLSG), a hierarchical inspection scene graph constructed in an incremental manner and organized into abstraction layers that support planning demands in real-time. To address the task of semantic-aware inspection, a mission framework, termed as Enhanced First-Look Inspect Explore (xFLIE), that tightly couples the 3DLSG with an inspection planner is proposed. We assess the performance through simulations and experimental trials, evaluating target-selection, path-planning and semantic navigation tasks over the 3DLSG model. The scenarios presented are diverse, ranging from city-scale distributed to solitary infrastructure targets in simulated worlds and subsequent outdoor and subterranean environment deployments onboard a quadrupedal robot. The proposed method successfully demonstrates incremental construction and planning over the 3DLSG representation to meet the objectives of the missions. Furthermore, the framework demonstrates successful semantic navigation tasks over the structured interface at the end of the inspection missions. Finally, we report multiple orders of magnitude reduction in path-planning time compared to conventional volumetric-map-based methods over various environment scale, demonstrating the planning efficiency and scalability of the proposed approach.

2412.14708 2026-02-11 cs.AI cs.DC cs.ET cs.HC

Creation of AI-driven Smart Spaces for Enhanced Indoor Environments -- A Survey

Aygün Varol, Naser Hossein Motlagh, Mirka Leino, Sasu Tarkoma, Johanna Virkki

Comments 39 pages, 3 figures, 1 table, journal

Journal ref Internet of Things, Volume 36, 2026, 101876

详情
英文摘要

Smart spaces are ubiquitous computing environments that integrate diverse sensing and communication technologies to enhance space functionality, optimize energy utilization, and improve user comfort and well-being. The integration of emerging AI methodologies into these environments facilitates the formation of AI-driven smart spaces, which further enhance functionalities of the spaces by enabling advanced applications such as personalized comfort settings, interactive living spaces, and automatization of the space systems, all resulting in enhanced indoor experiences of the users. In this paper, we present a systematic survey of existing research on the foundational components of AI-driven smart spaces, including sensor technologies, data communication protocols, sensor network management and maintenance strategies, as well as the data collection, processing and analytics. Given the pivotal role of AI in establishing AI-powered smart spaces, we explore the opportunities and challenges associated with traditional machine learning (ML) approaches, such as deep learning (DL), and emerging methodologies including large language models (LLMs). Finally, we provide key insights necessary for the development of AI-driven smart spaces, propose future research directions, and sheds light on the path forward.

2412.08794 2026-02-11 cs.LG stat.ML

Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, Cody Fleming

Journal ref International Conference on Learning Representations (ICLR), 2025

详情
英文摘要

In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints, utilizing only offline data. Traditional methods often face difficulties in balancing these constraints, leading to either diminished performance or increased safety risks. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders, which model the latent safety constraints. Subsequently, we frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints. This is achieved by training an encoder with a reward-Advantage Weighted Regression objective within the latent constraint space. Our methodology is supported by theoretical analysis, including bounds on policy performance and sample complexity. Extensive empirical evaluation on benchmark datasets, including challenging autonomous driving scenarios, demonstrates that our approach not only maintains safety compliance but also excels in cumulative reward optimization, surpassing existing methods. Additional visualizations provide further insights into the effectiveness and underlying mechanisms of our approach.

2411.13883 2026-02-11 cs.LG cs.AI cs.CY

Influence of Recommender Systems on Users: A Dynamical Systems Analysis

Prabhat Lankireddy, Jayakrishnan Nair, D Manjunath

详情
英文摘要

We analyze the unintended effects that recommender systems have on the preferences of users that they are learning. We consider a contextual multi-armed bandit recommendation algorithm that learns optimal product recommendations based on user and product attributes. It is well known that the sequence of recommendations affects user preferences. However, typical learning algorithms treat the user attributes as static and disregard the impact of their recommendations on user preferences. Our interest is to analyze the effect of this mismatch between the model assumption of a static environment and the reality of an evolving environment affected by the recommendations. To perform this analysis, we introduce a model for the coupled evolution of a linear bandit recommendation system and its users, whose preferences are drawn towards the recommendations made by the algorithm. We describe a method, that is grounded in stochastic approximation theory, to come up with a dynamical system model that asymptotically approximates the mean behavior of the stochastic model. The resulting dynamical system captures the coupled evolution of the population preferences and the learning algorithm. Analyzing this dynamical system gives insight into the long-term properties of user preferences and the learning algorithm. Under certain conditions, we show that the RS is able to learn the population preferences in spite of the model mismatch. We discuss and characterize the relation between various parameters of the model and the long term preferences of users in this work. A key observation is that the exploration-exploitation tradeoff used by the recommendation algorithm significantly affects the long term preferences of users. Algorithms that exploit more can polarize user preferences, leading to the well-known filter bubble phenomenon.

2411.12188 2026-02-11 cs.CV cs.LG

Constant Rate Scheduling: A General Framework for Optimizing Diffusion Noise Schedule via Distributional Change

Shuntaro Okada, Kenji Doi, Ryota Yoshihashi, Hirokatsu Kataoka, Tomohiro Tanaka

Comments Published in Transactions on Machine Learning Research (TMLR), January 2026

详情
英文摘要

We propose a general framework for optimizing noise schedules in diffusion models, applicable to both training and sampling. Our method enforces a constant rate of change in the probability distribution of diffused data throughout the diffusion process, where the rate of change is quantified using a user-defined discrepancy measure. We introduce three such measures, which can be flexibly selected or combined depending on the domain and model architecture. While our framework is inspired by theoretical insights, we do not aim to provide a complete theoretical justification of how distributional change affects sample quality. Instead, we focus on establishing a general-purpose scheduling framework and validating its empirical effectiveness. Through extensive experiments, we demonstrate that our approach consistently improves the performance of both pixel-space and latent-space diffusion models, across various datasets, samplers, and a wide range of number of function evaluations from 5 to 250. In particular, when applied to both training and sampling schedules, our method achieves a state-of-the-art FID score of 2.03 on LSUN Horse 256$\times$256, without compromising mode coverage.

2410.03972 2026-02-11 cs.LG cs.IT cs.NE math.IT q-bio.NC

Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

Ann Huang, Satpreet H. Singh, Flavio Martinelli, Kanaka Rajan

Journal ref Advances in Neural Information Processing Systems (2025)

详情
英文摘要

Task-trained recurrent neural networks (RNNs) are widely used in neuroscience and machine learning to model dynamical computations. To gain mechanistic insight into how neural systems solve tasks, prior work often reverse-engineers individual trained networks. However, different RNNs trained on the same task and achieving similar performance can exhibit strikingly different internal solutions, a phenomenon known as solution degeneracy. Here, we develop a unified framework to systematically quantify and control solution degeneracy across three levels: behavior, neural dynamics, and weight space. We apply this framework to 3,400 RNNs trained on four neuroscience-relevant tasks: flip-flop memory, sine wave generation, delayed discrimination, and path integration, while systematically varying task complexity, learning regime, network size, and regularization. We find that higher task complexity and stronger feature learning reduce degeneracy in neural dynamics but increase it in weight space, with mixed effects on behavior. In contrast, larger networks and structural regularization reduce degeneracy at all three levels. These findings empirically validate the Contravariance Principle and provide practical guidance for researchers seeking to tune the variability of RNN solutions, either to uncover shared neural mechanisms or to model the individual variability observed in biological systems. This work provides a principled framework for quantifying and controlling solution degeneracy in task-trained RNNs, offering new tools for building more interpretable and biologically grounded models of neural computation.

2410.02387 2026-02-11 cs.LG cs.AI

BiSSL: Enhancing the Alignment Between Self-Supervised Pretraining and Downstream Fine-Tuning via Bilevel Optimization

Gustav Wagner Zakarias, Lars Kai Hansen, Zheng-Hua Tan

Journal ref Transactions on Machine Learning Research (TMLR), (02/2026)

详情
英文摘要

Models initialized from self-supervised pretraining may suffer from poor alignment with downstream tasks, reducing the extent to which subsequent fine-tuning can adapt pretrained features toward downstream objectives. To mitigate this, we introduce BiSSL, a novel bilevel training framework that enhances the alignment of self-supervised pretrained models with downstream tasks prior to fine-tuning. BiSSL acts as an intermediate training stage conducted after conventional self-supervised pretraining and is tasked with solving a bilevel optimization problem that incorporates the pretext and downstream training objectives in its lower- and upper-level objectives, respectively. This approach explicitly models the interdependence between the pretraining and fine-tuning stages within the conventional self-supervised learning pipeline, facilitating enhanced information sharing between them that ultimately leads to a model initialization better aligned with the downstream task. We propose a general training algorithm for BiSSL that is compatible with a broad range of pretext and downstream tasks. Using SimCLR and Bootstrap Your Own Latent to pretrain ResNet-50 backbones on the ImageNet dataset, we demonstrate that our proposed framework significantly improves accuracy on the vast majority of 12 downstream image classification datasets, as well as on object detection. Exploratory analyses alongside investigative experiments further provide compelling evidence that BiSSL enhances downstream alignment.

2409.17538 2026-02-11 cs.LG cs.AI cs.CL

LoRA Provides Differential Privacy by Design via Random Sketching

Saber Malekmohammadi, Golnoosh Farnadi

详情
英文摘要

Low-rank adaptation of language models has been proposed to reduce the computational and memory overhead of fine-tuning pre-trained language models. LoRA incorporates trainable low-rank matrices into some parameters of the pre-trained model, called adapters. In this work, we show theoretically that the low-rank adaptation mechanism of LoRA is equivalent to fine-tuning adapters with noisy batch gradients, with the noise variance being a decreasing function of adaptation rank ($r$). Motivated by this understanding, we prove inherent differential privacy for LoRA when adaptation matrices $A_\ell$ are frozen. We show that various factors, e.g., the adaptation rank and batch size, affect the guaranteed privacy level. Our findings provide useful insights into LoRA and uncovers the reason behind the robustness of models fine-tuned with LoRA to privacy attacks.

2406.16821 2026-02-11 cs.LG cs.AI physics.bio-ph physics.chem-ph q-bio.BM

General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design

Yue Jian, Curtis Wu, Danny Reidenbach, Aditi S. Krishnapriyan

详情
英文摘要

Structure-based drug design (SBDD) aims to generate ligands that bind strongly and specifically to target protein pockets. Recent diffusion models have advanced SBDD by capturing the distributions of atomic positions and types, yet they often underemphasize binding affinity control during generation. To address this limitation, we introduce \textbf{\textnormal{\textbf{BADGER}}}, a general \textbf{binding-affinity guidance framework for diffusion models in SBDD}. \textnormal{\textbf{BADGER} }incorporates binding affinity awareness through two complementary strategies: (1) \textit{classifier guidance}, which applies gradient-based affinity signals during sampling in a plug-and-play fashion, and (2) \textit{classifier-free guidance}, which integrates affinity conditioning directly into diffusion model training. Together, these approaches enable controllable ligand generation guided by binding affinity. \textnormal{\textbf{BADGER} } can be added to any diffusion model and achieves up to a \textbf{60\% improvement in ligand--protein binding affinity} of sampled molecules over prior methods. Furthermore, we extend the framework to \textbf{multi-constraint diffusion guidance}, jointly optimizing for binding affinity, drug-likeness (QED), and synthetic accessibility (SA) to design realistic and synthesizable drug candidates.

2406.07089 2026-02-11 cs.CV

RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent

Wenjia Xu, Zijian Yu, Boyang Mu, Zhiwei Wei, Yuanben Zhang, Guangzuo Li, Jiuniu Wang, Mugen Peng

详情
英文摘要

The unprecedented advancements in Multimodal Large Language Models (MLLMs) have demonstrated strong potential in interacting with humans through both language and visual inputs to perform downstream tasks such as visual question answering and scene understanding. However, these models are constrained to basic instruction-following or descriptive tasks, facing challenges in complex real-world remote sensing applications that require specialized tools and knowledge. To address these limitations, we propose RS-Agent, an AI agent designed to interact with human users and autonomously leverage specialized models to address the demands of real-world remote sensing applications. RS-Agent integrates four key components: a Central Controller based on large language models, a dynamic toolkit for tool execution, a Solution Space for task-specific expert guidance, and a Knowledge Space for domain-level reasoning, enabling it to interpret user queries and orchestrate tools for accurate remote sensing task. We introduce two novel mechanisms: Task-Aware Retrieval, which improves tool selection accuracy through expert-guided planning, and DualRAG, a retrieval-augmented generation method that enhances knowledge relevance through weighted, dual-path retrieval. RS-Agent supports flexible integration of new tools and is compatible with both open-source and proprietary LLMs. Extensive experiments across 9 datasets and 18 remote sensing tasks demonstrate that RS-Agent significantly outperforms state-of-the-art MLLMs, achieving over 95% task planning accuracy and delivering superior performance in tasks such as scene classification, object counting, and remote sensing visual question answering. Our work presents RS-Agent as a robust and extensible framework for advancing intelligent automation in remote sensing analysis.

2405.21016 2026-02-11 cs.CV

MpoxSLDNet: A Novel CNN Model for Detecting Monkeypox Lesions and Performance Comparison with Pre-trained Models

Fatema Jannat Dihan, Saydul Akbar Murad

详情
英文摘要

Monkeypox virus (MPXV) is a zoonotic virus that poses a significant threat to public health, particularly in remote parts of Central and West Africa. Early detection of monkeypox lesions is crucial for effective treatment. However, due to its similarity with other skin diseases, monkeypox lesion detection is a challenging task. To detect monkeypox, many researchers used various deep-learning models such as MobileNetv2, VGG16, ResNet50, InceptionV3, DenseNet121, EfficientNetB3, MobileNetV2, and Xception. However, these models often require high storage space due to their large size. This study aims to improve the existing challenges by introducing a CNN model named MpoxSLDNet (Monkeypox Skin Lesion Detector Network) to facilitate early detection and categorization of Monkeypox lesions and Non-Monkeypox lesions in digital images. Our model represents a significant advancement in the field of monkeypox lesion detection by offering superior performance metrics, including precision, recall, F1-score, accuracy, and AUC, compared to traditional pre-trained models such as VGG16, ResNet50, and DenseNet121. The key novelty of our approach lies in MpoxSLDNet's ability to achieve high detection accuracy while requiring significantly less storage space than existing models. By addressing the challenge of high storage requirements, MpoxSLDNet presents a practical solution for early detection and categorization of monkeypox lesions in resource-constrained healthcare settings. In this study, we have used "Monkeypox Skin Lesion Dataset" comprising 1428 skin images of monkeypox lesions and 1764 skin images of Non-Monkeypox lesions. Dataset's limitations could potentially impact the model's ability to generalize to unseen cases. However, the MpoxSLDNet model achieved a validation accuracy of 94.56%, compared to 86.25%, 84.38%, and 67.19% for VGG16, DenseNet121, and ResNet50, respectively.

2405.20044 2026-02-11 cs.CV

A Point-Neighborhood Learning Framework for Nasal Endoscope Image Segmentation

Pengyu Jie, Wanquan Liu, Chenqiang Gao, Yihui Wen, Rui He, Weiping Wen, Pengcheng Li, Jintao Zhang, Deyu Meng

Comments 10 pages, 10 figures,

Journal ref IEEE Transactions on Circuits and Systems for Video Technology 2025

详情
英文摘要

Lesion segmentation on nasal endoscopic images is challenging due to its complex lesion features. Fully-supervised deep learning methods achieve promising performance with pixel-level annotations but impose a significant annotation burden on experts. Although weakly supervised or semi-supervised methods can reduce the labelling burden, their performance is still limited. Some weakly semi-supervised methods employ a novel annotation strategy that labels weak single-point annotations for the entire training set while providing pixel-level annotations for a small subset of the data. However, the relevant weakly semi-supervised methods only mine the limited information of the point itself, while ignoring its label property and surrounding reliable information. This paper proposes a simple yet efficient weakly semi-supervised method called the Point-Neighborhood Learning (PNL) framework. PNL incorporates the surrounding area of the point, referred to as the point-neighborhood, into the learning process. In PNL, we propose a point-neighborhood supervision loss and a pseudo-label scoring mechanism to explicitly guide the model's training. Meanwhile, we proposed a more reliable data augmentation scheme. The proposed method significantly improves performance without increasing the parameters of the segmentation neural network. Extensive experiments on the NPC-LES dataset demonstrate that PNL outperforms existing methods by a significant margin. Additional validation on colonoscopic polyp segmentation datasets confirms the generalizability of the proposed PNL.

2405.18065 2026-02-11 cs.CV cs.AI

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

Issar Tzachor, Boaz Lerner, Matan Levy, Michael Green, Tal Berkovitz Shalev, Gavriel Habib, Dvir Samuel, Noam Korngut Zailer, Or Shimshi, Nir Darshan, Rami Ben-Ari

Comments ICLR 2025

Journal ref International Conference on Learning Representations (ICLR), 2025

详情
英文摘要

The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on VPR-specific data. In this paper, we present an effective approach to harness the potential of a foundation model for VPR. We show that features extracted from self-attention layers can act as a powerful re-ranker for VPR, even in a zero-shot setting. Our method not only outperforms previous zero-shot approaches but also introduces results competitive with several supervised methods. We then show that a single-stage approach utilizing internal ViT layers for pooling can produce global features that achieve state-of-the-art performance, with impressive feature compactness down to 128D. Moreover, integrating our local foundation features for re-ranking further widens this performance gap. Our method also demonstrates exceptional robustness and generalization, setting new state-of-the-art performance, while handling challenging conditions such as occlusion, day-night transitions, and seasonal variations.

2405.07801 2026-02-11 cs.CV

Deep Learning-Based Object Pose Estimation: A Comprehensive Survey

Jian Liu, Wei Sun, Hui Yang, Zhiwen Zeng, Chongpei Liu, Jin Zheng, Xingyu Liu, Hossein Rahmani, Nicu Sebe, Ajmal Mian

Comments Accepted by IJCV 2026, 45 pages, 13 figures

Journal ref International Journal of Computer Vision, 2026

详情
英文摘要

Object pose estimation is a fundamental computer vision problem with broad applications in augmented reality and robotics. Over the past decade, deep learning models, due to their superior accuracy and robustness, have increasingly supplanted conventional algorithms reliant on engineered point pair features. Nevertheless, several challenges persist in contemporary methods, including their dependency on labeled training data, model compactness, robustness under challenging conditions, and their ability to generalize to novel unseen objects. A recent survey discussing the progress made on different aspects of this area, outstanding challenges, and promising future directions, is missing. To fill this gap, we discuss the recent advances in deep learning-based object pose estimation, covering all three formulations of the problem, \emph{i.e.}, instance-level, category-level, and unseen object pose estimation. Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks, providing the readers with a holistic understanding of this field. Additionally, it discusses training paradigms of different domains, inference modes, application areas, evaluation metrics, and benchmark datasets, as well as reports the performance of current state-of-the-art methods on these benchmarks, thereby facilitating the readers in selecting the most suitable method for their application. Finally, the survey identifies key challenges, reviews the prevailing trends along with their pros and cons, and identifies promising directions for future research. We also keep tracing the latest works at https://github.com/CNJianLiu/Awesome-Object-Pose-Estimation.

2312.12122 2026-02-11 cs.CV cs.GR

ZS-SRT: An Efficient Zero-Shot Super-Resolution Training Method for Neural Radiance Fields

Xiang Feng, Yongbo He, Yubo Wang, Chengkai Wang, Zhenzhong Kuang, Jiajun Ding, Feiwei Qin, Jun Yu, Jianping Fan

Journal ref Neurocomputing, Volume 590, 14 July 2024, Article 127714

详情
英文摘要

Neural Radiance Fields (NeRF) have achieved great success in the task of synthesizing novel views that preserve the same resolution as the training views. However, it is challenging for NeRF to synthesize high-quality high-resolution novel views with low-resolution training data. To solve this problem, we propose a zero-shot super-resolution training framework for NeRF. This framework aims to guide the NeRF model to synthesize high-resolution novel views via single-scene internal learning rather than requiring any external high-resolution training data. Our approach consists of two stages. First, we learn a scene-specific degradation mapping by performing internal learning on a pretrained low-resolution coarse NeRF. Second, we optimize a super-resolution fine NeRF by conducting inverse rendering with our mapping function so as to backpropagate the gradients from low-resolution 2D space into the super-resolution 3D sampling space. Then, we further introduce a temporal ensemble strategy in the inference phase to compensate for the scene estimation errors. Our method is featured on two points: (1) it does not consume high-resolution views or additional scene data to train super-resolution NeRF; (2) it can speed up the training process by adopting a coarse-to-fine strategy. By conducting extensive experiments on public datasets, we have qualitatively and quantitatively demonstrated the effectiveness of our method.

2311.08290 2026-02-11 cs.LG

On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Nicholas E. Corrado, Josiah P. Hanna

Comments TMLR 2026

详情
英文摘要

On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent's current policy. However, after observing only a finite number of trajectories, such on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to high-variance gradient estimates that yield data-inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce (Zhong et. al, 2022). Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled w.r.t. the current policy. We empirically evaluate PROPS on continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) increases the data efficiency of on-policy policy gradient algorithms.

2306.05880 2026-02-11 cs.LG cs.AI

Time Series Continuous Modeling for Imputation and Forecasting with Implicit Neural Representations

Etienne Le Naour, Louis Serrano, Léon Migus, Yuan Yin, Ghislain Agoua, Nicolas Baskiotis, Patrick Gallinari, Vincent Guigue

Journal ref Transactions on Machine Learning Research (TMLR), 2024

详情
英文摘要

We introduce a novel modeling approach for time series imputation and forecasting, tailored to address the challenges often encountered in real-world data, such as irregular samples, missing data, or unaligned measurements from multiple sensors. Our method relies on a continuous-time-dependent model of the series' evolution dynamics. It leverages adaptations of conditional, implicit neural representations for sequential data. A modulation mechanism, driven by a meta-learning algorithm, allows adaptation to unseen samples and extrapolation beyond observed time-windows for long-term predictions. The model provides a highly flexible and unified framework for imputation and forecasting tasks across a wide range of challenging scenarios. It achieves state-of-the-art performance on classical benchmarks and outperforms alternative time-continuous models.

2212.00133 2026-02-11 cs.LG math.OC stat.ML

Universal Neural Optimal Transport

Jonathan Geuter, Gregor Kornhardt, Ingimar Tomasson, Vaios Laschos

Comments 37 pages, 19 figures, accepted to ICML 2025

Journal ref Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:19196-19232, 2025

详情
英文摘要

Optimal Transport (OT) problems are a cornerstone of many applications, but solving them is computationally expensive. To address this problem, we propose UNOT (Universal Neural Optimal Transport), a novel framework capable of accurately predicting (entropic) OT distances and plans between discrete measures for a given cost function. UNOT builds on Fourier Neural Operators, a universal class of neural networks that map between function spaces and that are discretization-invariant, which enables our network to process measures of variable resolutions. The network is trained adversarially using a second, generating network and a self-supervised bootstrapping loss. We ground UNOT in an extensive theoretical framework. Through experiments on Euclidean and non-Euclidean domains, we show that our network not only accurately predicts OT distances and plans across a wide range of datasets, but also captures the geometry of the Wasserstein space correctly. Furthermore, we show that our network can be used as a state-of-the-art initialization for the Sinkhorn algorithm with speedups of up to $7.4\times$, significantly outperforming existing approaches.

2602.10074 2026-02-11 cs.CR cs.CL

CAPID: Context-Aware PII Detection for Question-Answering Systems

Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, D. B. Emerson, Shubhankar Mohapatra, Xi He

Comments Accepted to the Student Research Workshop at EACL 2026

详情
英文摘要

Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.

2602.09970 2026-02-11 eess.AS cs.SD

BioME: A Resource-Efficient Bioacoustic Foundational Model for IoT Applications

Heitor R. Guimarães, Abhishek Tiwari, Mahsa Abdollahi, Anderson R. Avila, Tiago H. Falk

详情
英文摘要

Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.

2602.09959 2026-02-11 math.ST cs.LG stat.ML stat.TH

Statistical-Computational Trade-offs in Learning Multi-Index Models via Harmonic Analysis

Hugo Latourelle-Vigeant, Theodor Misiakiewicz

Comments 91 pages

详情
英文摘要

We study the problem of learning multi-index models (MIMs), where the label depends on the input $\boldsymbol{x} \in \mathbb{R}^d$ only through an unknown $\mathsf{s}$-dimensional projection $\boldsymbol{W}_*^\mathsf{T} \boldsymbol{x} \in \mathbb{R}^\mathsf{s}$. Exploiting the equivariance of this problem under the orthogonal group $\mathcal{O}_d$, we obtain a sharp harmonic-analytic characterization of the learning complexity for MIMs with spherically symmetric inputs -- which refines and generalizes previous Gaussian-specific analyses. Specifically, we derive statistical and computational complexity lower bounds within the Statistical Query (SQ) and Low-Degree Polynomial (LDP) frameworks. These bounds decompose naturally across spherical harmonic subspaces. Guided by this decomposition, we construct a family of spectral algorithms based on harmonic tensor unfolding that sequentially recover the latent directions and (nearly) achieve these SQ and LDP lower bounds. Depending on the choice of harmonic degree sequence, these estimators can realize a broad range of trade-offs between sample and runtime complexity. From a technical standpoint, our results build on the semisimple decomposition of the $\mathcal{O}_d$-action on $L^2 (\mathbb{S}^{d-1})$ and the intertwining isomorphism between spherical harmonics and traceless symmetric tensors.

2602.09936 2026-02-11 stat.ML cs.LG math.ST stat.TH

The Catastrophic Failure of The k-Means Algorithm in High Dimensions, and How Hartigan's Algorithm Avoids It

Roy R. Lederman, David Silva-Sánchez, Ziling Chen, Gilles Mordant, Amnon Balanov, Tamir Bendory

详情
英文摘要

Lloyd's k-means algorithm is one of the most widely used clustering methods. We prove that in high-dimensional, high-noise settings, the algorithm exhibits catastrophic failure: with high probability, essentially every partition of the data is a fixed point. Consequently, Lloyd's algorithm simply returns its initial partition - even when the underlying clusters are trivially recoverable by other methods. In contrast, we prove that Hartigan's k-means algorithm does not exhibit this pathology. Our results show the stark difference between these algorithms and offer a theoretical explanation for the empirical difficulties often observed with k-means in high dimensions.

2602.08842 2026-02-11 cs.AR cs.RO cs.SY eess.SY

karl. - A Research Vehicle for Automated and Connected Driving

Jean-Pierre Busch, Lukas Ostendorf, Guido Linden, Lennart Reiher, Till Beemelmanns, Bastian Lampe, Timo Woopen, Lutz Eckstein

Comments 8 pages; Accepted to be published as part of the 37th Intelligent Vehicles Symposium (IV), Detroit, MI, United States, June 22-25, 2026

详情
英文摘要

As highly automated driving is transitioning from single-vehicle closed-access testing to commercial deployments of public ride-hailing in selected areas (e.g., Waymo), automated driving and connected cooperative intelligent transport systems (C-ITS) remain active fields of research. Even though simulation is omnipresent in the development and validation life cycle of automated and connected driving technology, the complex nature of public road traffic and software that masters it still requires real-world integration and testing with actual vehicles. Dedicated vehicles for research and development allow testing and validation of software and hardware components under real-world conditions early on. They also enable collecting and publishing real-world datasets that let others conduct research without vehicle access, and support early demonstration of futuristic use cases. In this paper, we present karl., our new research vehicle for automated and connected driving. Apart from major corporations, few institutions worldwide have access to their own L4-capable research vehicles, restricting their ability to carry out independent research. This paper aims to help bridge that gap by sharing the reasoning, design choices, and technical details that went into making karl. a flexible and powerful platform for research, engineering, and validation in the context of automated and connected driving. More impressions of karl. are available at https://karl.ac.

2511.02241 2026-02-11 cs.NE cs.AI cs.LG q-bio.NC

Structural Plasticity as Active Inference: A Biologically-Inspired Architecture for Homeostatic Control

Brennen A. Hill

Comments National Science Foundation (NSF) workshop on Brain-Inspired Dynamics for Engineering Energy-Efficient Circuits and Artificial Intelligence

详情
英文摘要

Traditional neural networks, while powerful, rely on biologically implausible learning mechanisms such as global backpropagation. This paper introduces the Structurally Adaptive Predictive Inference Network (SAPIN), a novel computational model inspired by the principles of active inference and the morphological plasticity observed in biological neural cultures. SAPIN operates on a 2D grid where processing units, or cells, learn by minimizing local prediction errors. The model features two primary, concurrent learning mechanisms: a local, Hebbian-like synaptic plasticity rule based on the temporal difference between a cell's actual activation and its learned expectation, and a structural plasticity mechanism where cells physically migrate across the grid to optimize their information-receptive fields. This dual approach allows the network to learn both how to process information (synaptic weights) and also where to position its computational resources (network topology). We validated the SAPIN model on the classic Cart Pole reinforcement learning benchmark. Our results demonstrate that the architecture can successfully solve the CartPole task, achieving robust performance. The network's intrinsic drive to minimize prediction error and maintain homeostasis was sufficient to discover a stable balancing policy. We also found that while continual learning led to instability, locking the network's parameters after achieving success resulted in a stable policy. When evaluated for 100 episodes post-locking (repeated over 100 successful agents), the locked networks maintained an average 82% success rate.

2506.01816 2026-02-11 math.OC cs.LG cs.NA math.DS math.NA

An adaptive data sampling strategy for stabilizing dynamical systems via controller inference

Steffen W. R. Werner, Benjamin Peherstorfer

Comments 27 pages, 9 figures

详情
英文摘要

Learning stabilizing controllers from data is an important task in engineering applications; however, collecting informative data is challenging because unstable systems often lead to rapidly growing or erratic trajectories. In this work, we propose an adaptive sampling scheme that generates data while simultaneously stabilizing the system to avoid instabilities during the data collection. Under mild assumptions, the approach provably generates data sets that are informative for stabilization and have minimal size. The numerical experiments demonstrate that controller inference with the novel adaptive sampling approach learns controllers with up to one order of magnitude fewer data samples than unguided data generation. The results show that the proposed approach opens the door to stabilizing systems in edge cases and limit states where instabilities often occur and data collection is inherently difficult.

2505.21872 2026-02-11 eess.IV cs.LG

Targeted Unlearning Using Perturbed Sign Gradient Methods With Applications On Medical Images

George R. Nahass, Zhu Wang, Homa Rashidisabet, Won Hwa Kim, Sasha Hubschman, Jeffrey C. Peterson, Chad A. Purnell, Pete Setabutr, Ann Q. Tran, Darvin Yi, Sathya N. Ravi

Comments 39 pages, 12 figures, 11 tables, 3 algorithms

Journal ref Transactions on Machine Learning Research 2025, https://openreview.net/forum?id=XE0bJg6sQN

详情
英文摘要

Machine unlearning aims to remove the influence of specific training samples from a trained model without full retraining. While prior work has largely focused on privacy-motivated settings, we recast unlearning as a general-purpose tool for post-deployment model revision. Specifically, we focus on utilizing unlearning in clinical contexts where data shifts, device deprecation, and policy changes are common. To this end, we propose a bilevel optimization formulation of boundary-based unlearning that can be solved using iterative algorithms. We provide convergence guarantees when first-order algorithms are used to unlearn. Our method introduces tunable loss design for controlling the forgetting-retention tradeoff and supports novel model composition strategies that merge the strengths of distinct unlearning runs. Across benchmark and real-world clinical imaging datasets, our approach outperforms baselines on both forgetting and retention metrics, including scenarios involving imaging devices and anatomical outliers. This work establishes machine unlearning as a modular, practical alternative to retraining for real-world model maintenance in clinical applications.

2504.03560 2026-02-11 math.OC cs.LG math.ST stat.ML stat.TH

Stochastic Optimization with Optimal Importance Sampling

Liviu Aolaritei, Bart P. G. Van Parys, Henry Lam, Michael I. Jordan

详情
英文摘要

Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its effectiveness, the performance of IS is highly sensitive to the choice of the proposal distribution and often requires stochastic calibration. While the design and analysis of IS have been extensively studied in estimation settings, applying IS within stochastic optimization introduces a lesser-known fundamental challenge: the decision variable and the importance sampling distribution are mutually dependent, creating a circular optimization structure. This interdependence complicates both convergence analysis and variance control. In this paper, we consider the generic setting of convex stochastic optimization with linear constraints. We propose a single-loop stochastic approximation algorithm, based on a variant of Nesterov's dual averaging, that jointly updates the decision variable and the importance sampling distribution, notably without time-scale separation or nested optimization. The method is globally convergent and achieves the minimal asymptotic variance among stochastic gradient schemes, which moreover matches the performance of an oracle sampler adapted to the optimal solution and thus effectively resolves the circular optimization challenge.