arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2975
专题追踪
2501.07809 2026-02-03 cs.LG cs.AI math.AP

Conformal mapping based Physics-informed neural networks for designing neutral inclusions

Daehee Cho, Hyeonmin Yun, Jaeyong Lee, Mikyoung Lim

详情
英文摘要

We address the neutral inclusion problem with imperfect boundary conditions, focusing on designing interface functions for inclusions of arbitrary shapes. Traditional Physics-Informed Neural Networks (PINNs) struggle with this inverse problem, leading to the development of Conformal Mapping Coordinates Physics-Informed Neural Networks (CoCo-PINNs), which integrate geometric function theory with PINNs. CoCo-PINNs effectively solve forward-inverse problems by modeling the interface function through neural network training, which yields a neutral inclusion effect. This approach enhances the performance of PINNs in terms of credibility, consistency, and stability.

2501.07114 2026-02-03 cs.CV

Semantically Guided Dynamic Visual Prototype Refinement for Compositional Zero-Shot Learning

Zhong Peng, Yishi Xu, Gerong Wang, Wenchao Chen, Bo Chen, Jing Zhang, Hongwei Liu

Comments Accepted for publication in Neurocomputing

详情
英文摘要

Compositional Zero-Shot Learning (CZSL) seeks to recognize unseen state-object pairs by recombining primitives learned from seen compositions. Despite recent progress with vision-language models (VLMs), two limitations remain: (i) text-driven semantic prototypes are weakly discriminative in the visual feature space; and (ii) unseen pairs are optimized passively, thereby inducing seen bias. To address these limitations, we present Duplex, a framework that couples dual-prototype learning with dynamic local-graph refinement of visual prototypes. For each composition, Duplex maintains a semantic prototype via prompt learning and a visual prototype for unseen pairs constructed by recombining disentangled state and object primitives from seen images. The visual prototypes are updated dynamically through lightweight aggregation on mini-batch local graphs, which incorporates unseen compositions during training without labels. This design introduces fine-grained visual evidence while preserving semantic structure. It enriches class prototypes, better disambiguates semantically similar yet visually distinct pairs, and mitigates seen bias. Experiments on MIT-States, UT-Zappos, and CGQA in closed-world and open-world settings achieve competitive performance and consistent compositional generalization. Our source code is available at https://github.com/ISPZ/Duplex-CZSL.

2412.13526 2026-02-03 cs.LG

MOMA: Masked Orthogonal Matrix Alignment for Zero-Additional-Parameter Model Merging

Fanshuang Kong, Richong Zhang, Zhijie Nie, Hang Zhou, Ziqiao Wang, Qiang Sun, Chunming Hu

详情
英文摘要

Model merging offers a scalable alternative to multi-task learning but often yields suboptimal performance on classification tasks. We attribute this degradation to a geometric misalignment between the merged encoder and static task-specific classifier heads. Existing methods typically rely on auxiliary parameters to enforce strict representation alignment. We challenge this approach by revealing that the misalignment is predominantly an orthogonal transformation, rendering such strict alignment unnecessary. Leveraging this insight, we propose MOMA (Masked Orthogonal Matrix Alignment), which rectifies the misalignment by jointly optimizing a global multi-task vector mask and task-specific orthogonal transformations. Crucially, MOMA absorbs corresponding new parameters directly into the existing model weights, achieving performance comparable to state-of-the-art baselines with zero additional parameters and zero added inference cost.

2412.09606 2026-02-03 cs.CV

Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, Yuliang Xiu

Comments Project Page: https://fanegg.github.io/Feat2GS/

Journal ref Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

详情
英文摘要

Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences in architecture and training protocols (i.e., objectives, proxy tasks), a unified framework to fairly and comprehensively probe their 3D awareness is urgently needed. Existing works on 3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately, these tasks ignore texture awareness, and require 3D data as ground-truth, which limits the scale and diversity of their evaluation set. To address these issues, we introduce Feat2GS, which readout 3D Gaussians attributes from VFM features extracted from unposed images. This allows us to probe 3D awareness for geometry and texture via novel view synthesis, without requiring 3D data. Additionally, the disentanglement of 3DGS parameters - geometry ($\boldsymbol{x}$, $α$, $Σ$) and texture ($\boldsymbol{c}$) - enables separate analysis of texture and geometry awareness. Under Feat2GS, we conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM. Building on these findings, we develop several variants that achieve state-of-the-art across diverse datasets. This makes Feat2GS useful for probing VFMs, and as a simple-yet-effective baseline for novel-view synthesis. Code and data are available at https://fanegg.github.io/Feat2GS/.

2411.17982 2026-02-03 cs.RO cs.CV

HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction

Wei Zhang, Qing Cheng, David Skuddis, Niclas Zeller, Daniel Cremers, Norbert Haala

Journal ref IEEE Transactions on Robotics, Volume 41, Pages 6478-6493, Year 2025

详情
英文摘要

We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, and ScanNet++, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality. The project page and source code will be made available at https://hi-slam2.github.io/.

2411.12433 2026-02-03 cs.AI

Preference-Conditioned Gradient Variations for Multi-Objective Quality-Diversity

Hannah Janmohamed, Maxence Faldor, Thomas Pierrot, Antoine Cully

详情
英文摘要

In a variety of domains, from robotics to finance, Quality-Diversity algorithms have been used to generate collections of both diverse and high-performing solutions. Multi-Objective Quality-Diversity algorithms have emerged as a promising approach for applying these methods to complex, multi-objective problems. However, existing methods are limited by their search capabilities. For example, Multi-Objective Map-Elites depends on random genetic variations which struggle in high-dimensional search spaces. Despite efforts to enhance search efficiency with gradient-based mutation operators, existing approaches consider updating solutions to improve on each objective separately rather than achieving desired trade-offs. In this work, we address this limitation by introducing Multi-Objective Map-Elites with Preference-Conditioned Policy-Gradient and Crowding Mechanisms: a new Multi-Objective Quality-Diversity algorithm that uses preference-conditioned policy-gradient mutations to efficiently discover promising regions of the objective space and crowding mechanisms to promote a uniform distribution of solutions on the non-dominated front. We evaluate our approach on six robotics locomotion tasks and show that our method outperforms or matches all state-of-the-art Multi-Objective Quality-Diversity methods in all six, including two newly proposed tri-objective tasks. Importantly, our method also achieves a smoother set of trade-offs, as measured by newly-proposed sparsity-based metrics.

2411.10701 2026-02-03 cs.CV cs.LG eess.IV

Diffusion-based Layer-wise Semantic Reconstruction for Unsupervised Out-of-Distribution Detection

Ying Yang, De Cheng, Chaowei Fang, Yubiao Wang, Changzhe Jiao, Lechao Cheng, Nannan Wang

Comments 26 pages, 23 figures, published to Neurlps2024

Journal ref Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

详情
英文摘要

Unsupervised out-of-distribution (OOD) detection aims to identify out-of-domain data by learning only from unlabeled In-Distribution (ID) training samples, which is crucial for developing a safe real-world machine learning system. Current reconstruction-based methods provide a good alternative approach by measuring the reconstruction error between the input and its corresponding generative counterpart in the pixel/feature space. However, such generative methods face a key dilemma: improving the reconstruction power of the generative model while keeping a compact representation of the ID data. To address this issue, we propose the diffusion-based layer-wise semantic reconstruction approach for unsupervised OOD detection. The innovation of our approach is that we leverage the diffusion model's intrinsic data reconstruction ability to distinguish ID samples from OOD samples in the latent feature space. Moreover, to set up a comprehensive and discriminative feature representation, we devise a multi-layer semantic feature extraction strategy. By distorting the extracted features with Gaussian noise and applying the diffusion model for feature reconstruction, the separation of ID and OOD samples is implemented according to the reconstruction errors. Extensive experimental results on multiple benchmarks built upon various datasets demonstrate that our method achieves state-of-the-art performance in terms of detection accuracy and speed. Code is available at <https://github.com/xbyym/DLSR>.

2411.02843 2026-02-03 cs.CV

Advances in Photoacoustic Imaging Reconstruction and Quantitative Analysis for Biomedical Applications

Lei Wang, Weiming Zeng, Kai Long, Hongyu Chen, Rongfeng Lan, Li Liu, Wai Ting Siok, Nizhuan Wang

Journal ref Visual Computing for Industry, Biomedicine, and Art, 2026

详情
英文摘要

Photoacoustic imaging (PAI) represents an innovative biomedical imaging modality that harnesses the advantages of optical resolution and acoustic penetration depth while ensuring enhanced safety. Despite its promising potential across a diverse array of preclinical and clinical applications, the clinical implementation of PAI faces significant challenges, including the trade-off between penetration depth and spatial resolution, as well as the demand for faster imaging speeds. This paper explores the fundamental principles underlying PAI, with a particular emphasis on three primary implementations: photoacoustic computed tomography (PACT), photoacoustic microscopy (PAM), and photoacoustic endoscopy (PAE). We undertake a critical assessment of their respective strengths and practical limitations. Furthermore, recent developments in utilizing conventional or deep learning (DL) methodologies for image reconstruction and artefact mitigation across PACT, PAM, and PAE are outlined, demonstrating considerable potential to enhance image quality and accelerate imaging processes. Furthermore, this paper examines the recent developments in quantitative analysis within PAI, including the quantification of haemoglobin concentration, oxygen saturation, and other physiological parameters within tissues. Finally, our discussion encompasses current trends and future directions in PAI research while emphasizing the transformative impact of deep learning on advancing PAI.

2410.19919 2026-02-03 cs.LG cs.AI

Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces

Avik Kar, Rahul Singh

Comments Accepted in the 41st Conference on Uncertainty in Artificial Intelligence

Journal ref PMLR 286:1924-1964, 2025

详情
英文摘要

We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs, a broad class that subsumes several important classes such as linear and RKHS MDPs, function approximation frameworks, and develop an adaptive algorithm $\text{ZoRL}$ with regret bounded as $\mathcal{O}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}}= 2d_\mathcal{S} + d_z + 3$, $d_\mathcal{S}$ is the dimension of the state space and $d_z$ is the zooming dimension. In contrast, algorithms with fixed discretization yield $d_{\text{eff.}} = 2(d_\mathcal{S} + d_\mathcal{A}) + 2$, $d_\mathcal{A}$ being the dimension of action space. $\text{ZoRL}$ achieves this by discretizing the state-action space adaptively and zooming into ''promising regions'' of the state-action space. $d_z$, a problem-dependent quantity bounded by the state-action space's dimension, allows us to conclude that if an MDP is benign, then the regret of $\text{ZoRL}$ will be small. The zooming dimension and $\text{ZoRL}$ are truly adaptive, i.e., the current work shows how to capture adaptivity gains for infinite-horizon average-reward RL. $\text{ZoRL}$ outperforms other state-of-the-art algorithms in experiments, thereby demonstrating the gains arising due to adaptivity.

2409.13363 2026-02-03 cs.LG cs.AI

FPBoost: Fully Parametric Gradient Boosting for Survival Analysis

Alberto Archetti, Eugenio Lomurno, Diego Piccinotti, Matteo Matteucci

详情
英文摘要

Survival analysis is a statistical framework for modeling time-to-event data. It plays a pivotal role in medicine, reliability engineering, and social science research, where understanding event dynamics even with few data samples is critical. Recent advancements in machine learning, particularly those employing neural networks and decision trees, have introduced sophisticated algorithms for survival modeling. However, many of these methods rely on restrictive assumptions about the underlying event-time distribution, such as proportional hazard, time discretization, or accelerated failure time. In this study, we propose FPBoost, a survival model that combines a weighted sum of fully parametric hazard functions with gradient boosting. Distribution parameters are estimated with decision trees trained by maximizing the full survival likelihood. We show how FPBoost is a universal approximator of hazard functions, offering full event-time modeling flexibility while maintaining interpretability through the use of well-established parametric distributions. We evaluate concordance and calibration of FPBoost across multiple benchmark datasets, showcasing its robustness and versatility as a new tool for survival estimation.

2408.12815 2026-02-03 cs.CV cs.AI

Staircase Cascaded Fusion of Lightweight Local Pattern Recognition and Long-Range Dependencies for Structural Crack Segmentation

Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Mianzhao Wang, Shengyong Chen, Yang Lv

Comments This paper is currently under review

详情
英文摘要

Accurately segmenting structural cracks at the pixel level remains a major hurdle, as existing methods fail to integrate local textures with pixel dependencies, often leading to fragmented and incomplete predictions. Moreover, their high parameter counts and substantial computational demands hinder practical deployment on resource-constrained edge devices. To address these challenges, we propose CrackSCF, a Lightweight Cascaded Fusion Crack Segmentation Network designed to achieve robust crack segmentation with exceptional computational efficiency. We design a lightweight convolutional block (LRDS) to replace all standard convolutions. This approach efficiently captures local patterns while operating with a minimal computational footprint. For a holistic perception of crack structures, a lightweight Long-range Dependency Extractor (LDE) captures global dependencies. These are then intelligently unified with local patterns by our Staircase Cascaded Fusion Module (SCFM), ensuring the final segmentation maps are both seamless in continuity and rich in fine-grained detail. To comprehensively evaluate our method, this paper created the challenging TUT benchmark dataset and evaluated it alongside five other public datasets. The experimental results show that the CrackSCF method consistently outperforms the existing methods, and it demonstrates greater robustness in dealing with complex background noise. On the TUT dataset, CrackSCF achieved 0.8382 on F1 score and 0.8473 on mIoU, and it only required 4.79M parameters.

2407.09103 2026-02-03 cs.AI

DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents

Thomas Constum, Pierrick Tranouez, Thierry Paquet

详情
英文摘要

Information extraction from handwritten documents involves traditionally three distinct steps: Document Layout Analysis, Handwritten Text Recognition, and Named Entity Recognition. Recent approaches have attempted to integrate these steps into a single process using fully end-to-end architectures. Despite this, these integrated approaches have not yet matched the performance of language models, when applied to information extraction in plain text. In this paper, we introduce DANIEL (Document Attention Network for Information Extraction and Labelling), a fully end-to-end architecture integrating a language model and designed for comprehensive handwritten document understanding. DANIEL performs layout recognition, handwriting recognition, and named entity recognition on full-page documents. Moreover, it can simultaneously learn across multiple languages, layouts, and tasks. For named entity recognition, the ontology to be applied can be specified via the input prompt. The architecture employs a convolutional encoder capable of processing images of any size without resizing, paired with an autoregressive decoder based on a transformer-based language model. DANIEL achieves competitive results on four datasets, including a new state-of-the-art performance on RIMES 2009 and M-POPP for Handwriting Text Recognition, and IAM NER for Named Entity Recognition. Furthermore, DANIEL is much faster than existing approaches. We provide the source code and the weights of the trained models at \url{https://github.com/Shulk97/daniel}.

2407.06706 2026-02-03 cs.RO

Exploring Unstructured Environments using Minimal Sensing on Cooperative Nano-Drones

Pedro Arias-Perez, Alvika Gautam, Miguel Fernandez-Cortizas, David Perez-Saura, Srikanth Saripalli, Pascual Campoy

Comments Submitted to IEEE Robotics and Automation Letters

详情
英文摘要

Recent advances have improved autonomous navigation and mapping under payload constraints, but current multi-robot inspection algorithms are unsuitable for nano-drones due to their need for heavy sensors and high computational resources. To address these challenges, we introduce ExploreBug, a novel hybrid frontier range bug algorithm designed to handle limited sensing capabilities for a swarm of nano-drones. This system includes three primary components: a mapping subsystem, an exploration subsystem, and a navigation subsystem. Additionally, an intra-swarm collision avoidance system is integrated to prevent collisions between drones. We validate the efficacy of our approach through extensive simulations and real-world exploration experiments involving up to seven drones in simulations and three in real-world settings, across various obstacle configurations and with a maximum navigation speed of 0.75 m/s. Our tests demonstrate that the algorithm efficiently completes exploration tasks, even with minimal sensing, across different swarm sizes and obstacle densities. Furthermore, our frontier allocation heuristic ensures an equal distribution of explored areas and paths traveled by each drone in the swarm. We publicly release the source code of the proposed system to foster further developments in mapping and exploration using autonomous nano drones.

2406.13375 2026-02-03 cs.CL

ALiiCE: Evaluating Positional Fine-grained Citation Generation

Yilong Xu, Jinhua Gao, Xiaoming Yu, Baolong Bi, Huawei Shen, Xueqi Cheng

Comments NAACL 2025 Main Conference (Long paper)

详情
英文摘要

Large Language Model (LLM) can enhance its credibility and verifiability by generating text with citations. However, existing research on citation generation is predominantly limited to sentence-level statements, neglecting the significance of positional fine-grained citations that can appear anywhere within sentences. To facilitate further exploration of the positional fine-grained citation generation, we propose ALiiCE, the first automatic evaluation framework for this task. Our method employs a dependency tree based approach to parse the sentence-level claim into atomic claims. Then ALiiCE evaluates citation quality using three metrics, including positional fine-grained citation recall, precision, and coefficient of variation of citation positions. We evaluate the positional fine-grained citation generation performance of several LLMs on long-form QA datasets. Our experiments and analyses demonstrate the effectiveness and reasonableness of ALiiCE. We offer our insights into the current advancements and future directions for the positional fine-grained citation generation task.

2406.09260 2026-02-03 cs.CV cs.AI cs.RO eess.IV

Deep Transformer Network for Monocular Pose Estimation of Shipborne Unmanned Aerial Vehicle

Maneesha Wickramasuriya, Taeyoung Lee, Murray Snyder

Comments 23 pages, 25 figures, 3 tables

详情
英文摘要

This paper introduces a deep transformer network for estimating the relative 6D pose of a Unmanned Aerial Vehicle (UAV) with respect to a ship using monocular images. A synthetic dataset of ship images is created and annotated with 2D keypoints of multiple ship parts. A Transformer Neural Network model is trained to detect these keypoints and estimate the 6D pose of each part. The estimates are integrated using Bayesian fusion. The model is tested on synthetic data and in-situ flight experiments, demonstrating robustness and accuracy in various lighting conditions. The position estimation error is approximately 0.8\% and 1.0\% of the distance to the ship for the synthetic data and the flight experiments, respectively. The method has potential applications for ship-based autonomous UAV landing and navigation.

2403.16526 2026-02-03 cs.CV

ModeTv2: GPU-accelerated Motion Decomposition Transformer for Pairwise Optimization in Medical Image Registration

Haiqiao Wang, Zhuoyuan Wang, Dong Ni, Yi Wang

详情
英文摘要

Deformable image registration plays a crucial role in medical imaging, aiding in disease diagnosis and image-guided interventions. Traditional iterative methods are slow, while deep learning (DL) accelerates solutions but faces usability and precision challenges. This study introduces a pyramid network with the enhanced motion decomposition Transformer (ModeTv2) operator, showcasing superior pairwise optimization (PO) akin to traditional methods. We re-implement ModeT operator with CUDA extensions to enhance its computational efficiency. We further propose RegHead module which refines deformation fields, improves the realism of deformation and reduces parameters. By adopting the PO, the proposed network balances accuracy, efficiency, and generalizability. Extensive experiments on three public brain MRI datasets and one abdominal CT dataset demonstrate the network's suitability for PO, providing a DL model with enhanced usability and interpretability. The code is publicly available at https://github.com/ZAX130/ModeTv2.

2311.18547 2026-02-03 cs.LG cs.AI cs.SY eess.SY

Real-Time Vibration-Based Bearing Fault Diagnosis Under Time-Varying Speed Conditions

Tuomas Jalonen, Mohammad Al-Sa'd, Serkan Kiranyaz, Moncef Gabbouj

详情
英文摘要

Detection of rolling-element bearing faults is crucial for implementing proactive maintenance strategies and for minimizing the economic and operational consequences of unexpected failures. However, many existing techniques are developed and tested under strictly controlled conditions, limiting their adaptability to the diverse and dynamic settings encountered in practical applications. This paper presents an efficient real-time convolutional neural network (CNN) for diagnosing multiple bearing faults under various noise levels and time-varying rotational speeds. Additionally, we propose a novel Fisher-based spectral separability analysis (SSA) method to elucidate the effectiveness of the designed CNN model. We conducted experiments on both healthy bearings and bearings afflicted with inner race, outer race, and roller ball faults. The experimental results show the superiority of our model over the current state-of-the-art approach in three folds: it achieves substantial accuracy gains of up to 15.8%, it is robust to noise with high performance across various signal-to-noise ratios, and it runs in real-time with processing durations five times less than acquisition. Additionally, by using the proposed SSA technique, we offer insights into the model's performance and underscore its effectiveness in tackling real-world challenges.

2309.16146 2026-02-03 cs.AI

T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems

Ming Wang, Daling Wang, Wenfang Wu, Shi Feng, Yifei Zhang

Journal ref Journal of Artificial Intelligence Research, Vol. 85 (2026)

详情
英文摘要

To address the interpretability challenge in machine learning (ML) systems, counterfactual explanations (CEs) have emerged as a promising solution. CEs are unique as they provide workable suggestions to users, instead of explaining why a certain outcome was predicted. The application of CEs encounters two main challenges: general user preferences and variable ML systems. On one hand, user preferences for specific values can vary depending on the task and scenario. On the other hand, the ML systems for verification may change while the CEs are performed. Thus, user preferences tend to be general rather than specific, and CEs need to be adaptable to variable ML models while maintaining robustness even as these models change. Facing these challenges, we propose general user preferences based on insights from psychology and behavioral science, and add the challenge of non-static ML systems as one preference. Moreover, we introduce a novel method, \uline{T}ree-based \uline{C}onditions \uline{O}ptional \uline{L}inks (T-COL) for generating CEs adaptable to general user preferences. Moreover, we employ T-COL to enhance the robustness of CEs with specific conditions, making CEs robust even when the ML models are replaced. To assess subjectivity preferences, we define LLM-based autonomous agents to simulate users and align them with real users. Experiments show that T-COL outperforms all baselines in adapting to general user preferences.

2106.14568 2026-02-03 cs.LG cs.CV

Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity

Shiwei Liu, Tianlong Chen, Zahra Atashgahi, Xiaohan Chen, Ghada Sokar, Elena Mocanu, Mykola Pechenizkiy, Zhangyang Wang, Decebal Constantin Mocanu

Comments published in International Conference on Learning Representations (ICLR 2022)

Journal ref Proceedings of the International Conference on Machine Learning (ICLR 2022)

详情
英文摘要

The success of deep ensembles on improving predictive performance, uncertainty estimation, and out-of-distribution robustness has been extensively studied in the machine learning literature. Albeit the promising results, naively training multiple deep neural networks and combining their predictions at inference leads to prohibitive computational costs and memory requirements. Recently proposed efficient ensemble approaches reach the performance of the traditional deep ensembles with significantly lower costs. However, the training resources required by these approaches are still at least the same as training a single dense model. In this work, we draw a unique connection between sparse neural network training and deep ensembles, yielding a novel efficient ensemble learning framework called FreeTickets. Instead of training multiple dense networks and averaging them, we directly train sparse subnetworks from scratch and extract diverse yet accurate subnetworks during this efficient, sparse-to-sparse training. Our framework, FreeTickets, is defined as the ensemble of these relatively cheap sparse subnetworks. Despite being an ensemble method, FreeTickets has even fewer parameters and training FLOPs than a single dense model. This seemingly counter-intuitive outcome is due to the ultra training/inference efficiency of dynamic sparse training. FreeTickets surpasses the dense baseline in all the following criteria: prediction accuracy, uncertainty estimation, out-of-distribution (OoD) robustness, as well as efficiency for both training and inference. Impressively, FreeTickets outperforms the naive deep ensemble with ResNet50 on ImageNet using around only 1/5 of the training FLOPs required by the latter. We have released our source code at https://github.com/VITA-Group/FreeTickets.

2010.13636 2026-02-03 cs.CV cs.LG

Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer Proxies

Yuehua Zhu, Muli Yang, Cheng Deng, Wei Liu

Comments Accepted in NeurIPS 2020 as a spotlight paper. Code can be found at https://github.com/YuehuaZhu/ProxyGML

详情
英文摘要

Deep metric learning plays a key role in various machine learning tasks. Most of the previous works have been confined to sampling from a mini-batch, which cannot precisely characterize the global geometry of the embedding space. Although researchers have developed proxy- and classification-based methods to tackle the sampling issue, those methods inevitably incur a redundant computational cost. In this paper, we propose a novel Proxy-based deep Graph Metric Learning (ProxyGML) approach from the perspective of graph classification, which uses fewer proxies yet achieves better comprehensive performance. Specifically, multiple global proxies are leveraged to collectively approximate the original data points for each class. To efficiently capture local neighbor relationships, a small number of such proxies are adaptively selected to construct similarity subgraphs between these proxies and each data point. Further, we design a novel reverse label propagation algorithm, by which the neighbor relationships are adjusted according to ground-truth labels, so that a discriminative metric space can be learned during the process of subgraph classification. Extensive experiments carried out on widely-used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate the superiority of the proposed ProxyGML over the state-of-the-art methods in terms of both effectiveness and efficiency. The source code is publicly available at https://github.com/YuehuaZhu/ProxyGML.

2602.01459 2026-02-03 cs.CV cs.AI cs.LG

Understanding vision transformer robustness through the lens of out-of-distribution detection

Joey Kuang, Alexander Wong

Comments Accepted to JCVIS 2025

详情
英文摘要

Vision transformers have shown remarkable performance in vision tasks, but enabling them for accessible and real-time use is still challenging. Quantization reduces memory and inference costs at the risk of performance loss. Strides have been made to mitigate low precision issues mainly by understanding in-distribution (ID) task behaviour, but the attention mechanism may provide insight on quantization attributes by exploring out-of-distribution (OOD) situations. We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common OOD datasets. ID analyses show the initial instabilities of 4-bit models, particularly of those trained on the larger ImageNet-22k, as the strongest FP32 model, DeiT3, sharply drop 17% from quantization error to be one of the weakest 4-bit models. While ViT shows reasonable quantization robustness for ID calibration, OOD detection reveals more: ViT and DeiT3 pretrained on ImageNet-22k respectively experienced a 15.0% and 19.2% average quantization delta in AUPR-out between full precision to 4-bit while their ImageNet-1k-only counterparts experienced a 9.5% and 12.0% delta. Overall, our results suggest pretraining on large scale datasets may hinder low-bit quantization robustness in OOD detection and that data augmentation may be a more beneficial option.

2602.01454 2026-02-03 cs.LG

Modeling Topological Impact on Node Attribute Distributions in Attributed Graphs

Amirreza Shiralinasab Langari, Leila Yeganeh, Kim Khoa Nguyen

详情
英文摘要

We investigate how the topology of attributed graphs influences the distribution of node attributes. This work offers a novel perspective by treating topology and attributes as structurally distinct but interacting components. We introduce an algebraic approach that combines a graph's topology with the probability distribution of node attributes, resulting in topology-influenced distributions. First, we develop a categorical framework to formalize how a node perceives the graph's topology. We then quantify this point of view and integrate it with the distribution of node attributes to capture topological effects. We interpret these topology-conditioned distributions as approximations of the posteriors $P(\cdot \mid v)$ and $P(\cdot \mid \mathcal{G})$. We further establish a principled sufficiency condition by showing that, on complete graphs, where topology carries no informative structure, our construction recovers the original attribute distribution. To evaluate our approach, we introduce an intentionally simple testbed model, $\textbf{ID}$, and use unsupervised graph anomaly detection as a probing task.

2602.01452 2026-02-03 cs.CV cs.AI

Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles

Penghao Deng, Jidong J. Yang, Jiachen Bian

Comments 21 pages, 15 figures, 3 tables

详情
英文摘要

Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle's front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a "part-versus-whole" semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.

2602.01451 2026-02-03 cs.CL

Understanding QA generation: Extracting Parametric and Contextual Knowledge with CQA for Low Resource Bangla Language

Umme Abira Azmary, MD Ikramul Kayes, Swakkhar Shatabda, Farig Yousuf Sadeque

详情
英文摘要

Question-Answering (QA) models for low-resource languages like Bangla face challenges due to limited annotated data and linguistic complexity. A key issue is determining whether models rely more on pre-encoded (parametric) knowledge or contextual input during answer generation, as existing Bangla QA datasets lack the structure required for such analysis. We introduce BanglaCQA, the first Counterfactual QA dataset in Bangla, by extending a Bangla dataset while integrating counterfactual passages and answerability annotations. In addition, we propose fine-tuned pipelines for encoder-decoder language-specific and multilingual baseline models, and prompting-based pipelines for decoder-only LLMs to disentangle parametric and contextual knowledge in both factual and counterfactual scenarios. Furthermore, we apply LLM-based and human evaluation techniques that measure answer quality based on semantic similarity. We also present a detailed analysis of how models perform across different QA settings in low-resource languages, and show that Chain-of-Thought (CoT) prompting reveals a uniquely effective mechanism for extracting parametric knowledge in counterfactual scenarios, particularly in decoder-only LLMs. Our work not only introduces a novel framework for analyzing knowledge sources in Bangla QA but also uncovers critical findings that open up broader directions for counterfactual reasoning in low-resource language settings.

2602.01448 2026-02-03 cs.RO

Towards a Novel Wearable Robotic Vest for Hemorrhage Suppression

Harshith Jella, Pejman Kheradmand, Joseph Klein, Behnam Moradkhani, Yash Chitalia

详情
英文摘要

This paper introduces a novel robotic system designed to manage severe bleeding in emergency scenarios, including unique environments like space stations. The robot features a shape-adjustable "ring mechanism", transitioning from a circular to an elliptical configuration to adjust wound coverage across various anatomical regions. We developed various arms for this ring mechanism with varying flexibilities to improve adaptability when applied to non-extremities of the body (abdomen, back, neck, etc.). To apply equal and constant pressure across the wound, we developed an inflatable ring and airbag balloon that are compatible with this shape-changing ring mechanism. A series of experiments focused on evaluating various ring arm configurations to characterize their bending stiffness. Subsequent experiments measured the force exerted by the airbag balloon system using a digital scale. Despite its promising performance, certain limitations related to coverage area are identified. The shape-changing effect of the device is limited to scenarios involving partially inflated or deflated airbag balloons, and cannot fully conform to complex anatomical regions. Finally, the device was tested on casualty simulation kits, where it successfully demonstrated its ability to control simulated bleeding.

2602.01447 2026-02-03 cs.CL cs.AI

SentiFuse: Deep Multi-model Fusion Framework for Robust Sentiment Extraction

Hieu Minh Duong, Rupa Ghosh, Cong Hoan Nguyen, Eugene Levin, Todd Gary, Long Nguyen

详情
英文摘要

Sentiment analysis models exhibit complementary strengths, yet existing approaches lack a unified framework for effective integration. We present SentiFuse, a flexible and model-agnostic framework that integrates heterogeneous sentiment models through a standardization layer and multiple fusion strategies. Our approach supports decision-level fusion, feature-level fusion, and adaptive fusion, enabling systematic combination of diverse models. We conduct experiments on three large-scale social-media datasets: Crowdflower, GoEmotions, and Sentiment140. These experiments show that SentiFuse consistently outperforms individual models and naive ensembles. Feature-level fusion achieves the strongest overall effectiveness, yielding up to 4\% absolute improvement in F1 score over the best individual model and simple averaging, while adaptive fusion enhances robustness on challenging cases such as negation, mixed emotions, and complex sentiment expressions. These results demonstrate that systematically leveraging model complementarity yields more accurate and reliable sentiment analysis across diverse datasets and text types.

2602.01443 2026-02-03 cs.AI

SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce

Alberto Castelo, Zahra Zanjani Foumani, Ailin Fan, Keat Yang Koay, Vibhor Malik, Yuanzheng Zhu, Han Li, Meysam Feghhi, Ronie Uliana, Shuang Xie, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Lingyun Wang, Zhong Wu

详情
英文摘要

A/B testing remains the gold standard for evaluating e-commerce UI changes, yet it diverts traffic, takes weeks to reach significance, and risks harming user experience. We introduce SimGym, a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by Large Language Model agents operating in a live browser. SimGym extracts per-shop buyer profiles and intents from production interaction data, identifies distinct behavioral archetypes, and simulates cohort-weighted sessions across control and treatment storefronts. We validate SimGym against real human outcomes from real UI changes on a major e-commerce platform under confounder control. Even without alignment post training, SimGym agents achieve state of the art alignment with observed outcome shifts and reduces experiment cycles from weeks to under an hour , enabling rapid experimentation without exposure to real buyers.

2602.01439 2026-02-03 cs.LG cs.AI

TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse

Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn

详情
英文摘要

Despite scale driving substantial recent advancements in machine learning, reinforcement learning (RL) methods still primarily use small value functions. Naively scaling value functions -- including with a transformer architecture, which is known to be highly scalable -- often results in learning instability and worse performance. In this work, we ask what prevents transformers from scaling effectively for value functions? Through empirical analysis, we identify the critical failure mode in this scaling: attention scores collapse as capacity increases. Our key insight is that we can effectively prevent this collapse and stabilize training by controlling the entropy of the attention scores, thereby enabling the use of larger models. To this end, we propose Transformer Q-Learning (TQL), a method that unlocks the scaling potential of transformers in learning value functions in RL. Our approach yields up to a 43% improvement in performance when scaling from the smallest to the largest network sizes, while prior methods suffer from performance degradation.

2602.01437 2026-02-03 cs.LG stat.ML

Theoretical Analysis of Measure Consistency Regularization for Partially Observed Data

Yinsong Wang, Shahin Shahrampour

详情
英文摘要

The problem of corrupted data, missing features, or missing modalities continues to plague the modern machine learning landscape. To address this issue, a class of regularization methods that enforce consistency between imputed and fully observed data has emerged as a promising approach for improving model generalization, particularly in partially observed settings. We refer to this class of methods as Measure Consistency Regularization (MCR). Despite its empirical success in various applications, such as image inpainting, data imputation and semi-supervised learning, a fundamental understanding of the theoretical underpinnings of MCR remains limited. This paper bridges this gap by offering theoretical insights into why, when, and how MCR enhances imputation quality under partial observability, viewed through the lens of neural network distance. Our theoretical analysis identifies the term responsible for MCR's generalization advantage and extends to the imperfect training regime, demonstrating that this advantage is not always guaranteed. Guided by these insights, we propose a novel training protocol that monitors the duality gap to determine an early stopping point that preserves the generalization benefit. We then provide detailed empirical evidence to support our theoretical claims and to show the effectiveness and accuracy of our proposed stopping condition. We further provide a set of real-world data simulations to show the versatility of MCR under different model architectures designed for different data sources.

2602.01419 2026-02-03 cs.LG cs.AI

Semi-supervised CAPP Transformer Learning via Pseudo-labeling

Dennis Gross, Helge Spieker, Arnaud Gotlieb, Emmanuel Stathatos, Panorios Benardos, George-Christopher Vosniakos

详情
英文摘要

High-level Computer-Aided Process Planning (CAPP) generates manufacturing process plans from part specifications. It suffers from limited dataset availability in industry, reducing model generalization. We propose a semi-supervised learning approach to improve transformer-based CAPP transformer models without manual labeling. An oracle, trained on available transformer behaviour data, filters correct predictions from unseen parts, which are then used for one-shot retraining. Experiments on small-scale datasets with simulated ground truth across the full data distribution show consistent accuracy gains over baselines, demonstrating the method's effectiveness in data-scarce manufacturing environments.