arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1284
专题追踪
2503.12087 2026-02-09 cs.CV

Temporally Consistent Mitral Annulus Measurements from Sparse Annotations in Echocardiographic Videos

Gino E. Jansen, Mark J. Schuuring, Berto J. Bouma, Ivana Išgum

Journal ref Proc. SPIE 13406 (2025) 192-200

详情
英文摘要

This work presents a novel approach to achieving temporally consistent mitral annulus landmark localization in echocardiography videos using sparse annotations. Our method introduces a self-supervised loss term that enforces temporal consistency between neighboring frames, which smooths the position of landmarks and enhances measurement accuracy over time. Additionally, we incorporate realistic field-of-view augmentations to improve the recognition of missing anatomical landmarks. We evaluate our approach on both a public and private dataset, and demonstrate significant improvements in Mitral Annular Plane Systolic Excursion (MAPSE) calculations and overall landmark tracking stability. The method achieves a mean absolute MAPSE error of 1.81 $\pm$ 0.14 mm, an annulus size error of 2.46 $\pm$ 0.31 mm, and a landmark localization error of 2.48 $\pm$ 0.07 mm. Finally, it achieves a 0.99 ROC-AUC for recognition of missing landmarks.

2503.06558 2026-02-09 cs.LG stat.ML

Generative modelling with jump-diffusions

Adrian Baule

Comments New version contains: (i) A generalized score function in closed analytical form leading to the jump-Laplace (JL) model; (ii) Additional numerical experiments comparing JL ODE/SDE, Gaussian ODE, and Levy-Ito-Model SDE

详情
英文摘要

Score-based diffusion models generate samples from an unknown target distribution using a time-reversed diffusion process. While such models represent state-of-the-art approaches in industrial applications such as artificial image generation, it has recently been noted that their performance can be further improved by considering injection noise with heavy tailed characteristics. Here, I present a generalization of generative diffusion processes to a wide class of non-Gaussian noise processes. I consider forward processes driven by standard Gaussian noise with super-imposed Poisson jumps representing a finite activity Levy process. The generative process is shown to be governed by a generalized score function that depends on the jump amplitude distribution and can be estimated by minimizing a simple MSE loss as in conventional Gaussian models. Both probability flow ODE and SDE formulations are derived using basic technical effort. A detailed implementation for a pure jump process with Laplace distributed amplitudes yields a generalized score function in closed analytical form and is shown to outperform the equivalent Gaussian model in specific parameter regimes.

2502.13064 2026-02-09 cs.SD

A Dual-Stage Time-Context Network for Speech-Based Alzheimer's Disease Detection

Yifan Gao, Long Guo, Hong Liu

详情
英文摘要

Alzheimer's disease (AD) is a progressive neurodegenerative disorder that leads to irreversible cognitive decline in memory and communication. Early detection of AD through speech analysis is crucial for delaying disease progression. However, existing methods mainly use pre-trained acoustic models for feature extraction but have limited ability to model both local and global patterns in long-duration speech. In this letter, we introduce a Dual-Stage Time-Context Network (DSTC-Net) for speech-based AD detection, integrating local acoustic features with global conversational context in long-duration recordings.We first partition each long-duration recording into fixed-length segments to reduce computational overhead and preserve local temporal details.Next, we feed these segments into an Intra-Segment Temporal Attention (ISTA) module, where a bidirectional Long Short-Term Memory (BiLSTM) network with frame-level attention extracts enhanced local features.Subsequently, a Cross-Segment Context Attention (CSCA) module applies convolution-based context modeling and adaptive attention to unify global patterns across all segments.Extensive experiments on the ADReSSo dataset show that our DSTC-Net outperforms state-of-the-art models, reaching 83.10% accuracy and 83.15% F1.

2502.10273 2026-02-09 cs.CV cs.AI

Probing Perceptual Constancy in Large Vision-Language Models

Haoran Sun, Bingyang Wang, Suyang Yu, Yijiang Li, Qingying Gao, Haiyun Lyu, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Maijunxian Wang, Dezhi Luo, Hokin Deng

Comments Under Review

详情
英文摘要

Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for visual understanding in a dynamic world. Here, we explored such ability in current Vision Language Models (VLMs). In this study, we evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions. We found significant variability in VLM performance across these domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.

2502.01989 2026-02-09 cs.LG

VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model

Tao Zhang, Jia-Shu Pan, Ruiqi Feng, Tailin Wu

Comments ICLR 2026. 30 pages, 13 figures

详情
英文摘要

Inspired by human SYSTEM 2 thinking, LLMs excel at complex reasoning tasks via extended Chain-of-Thought. However, similar test-time scaling for diffusion models to tackle complex reasoning remains largely unexplored. From existing work, two primary challenges emerge in this setting: (i) the dependence on an external verifier indicating a notable gap from intrinsic reasoning of human intelligence without any external feedback, and (ii) the lack of an efficient search algorithm. In this paper, we introduce the Verifier-free Test-time Scalable Diffusion Model (VFScale) to achieve scalable intrinsic reasoning, which equips number-of-sample test-time scaling with the intrinsic energy function of diffusion models as the verifier. Concretely, VFScale comprises two key innovations to address the aforementioned challenges. On the training side, VFScale consists of a novel MRNCL loss and a KL regularization to improve the energy landscape, ensuring that the learned energy function itself serves as a reliable verifier. On the inference side, VFScale integrates the denoising process with a novel hybrid Monte Carlo Tree Search (hMCTS) to improve search efficiency. On challenging reasoning tasks of Maze and Sudoku, we demonstrate the effectiveness of VFScale's training objective and scalable inference method. In particular, trained with Maze sizes of up to $6\times6$, our VFScale solves 88% of Maze problems with much larger sizes of $15\times15$, while standard diffusion models completely fail. The code can be found at https://github.com/AI4Science-WestlakeU/VFScale.

2501.16093 2026-02-09 cs.CL cs.AI

STAR: Stepwise Task Augmentation with Relation Learning for Aspect Sentiment Quad Prediction

Wenna Lai, Haoran Xie, Guandong Xu, Qing Li

Comments 17 pages, 6 figures, and 7 tables

详情
英文摘要

Aspect-based sentiment analysis (ABSA) aims to identify four sentiment elements, including aspect term, aspect category, opinion term, and sentiment polarity. These elements construct a complete picture of sentiments. The most challenging task, aspect sentiment quad prediction (ASQP), requires predicting all four elements simultaneously and is hindered by the difficulty of accurately modeling dependencies among sentiment elements. A key challenge lies in the scarcity of annotated data, which limits the model ability to understand and reason about the relational dependencies required for effective quad prediction. To address this challenge, we propose a stepwise task augmentation framework with relation learning that decomposes ASQP into a sequence of auxiliary subtasks with increasing relational granularity. Specifically, STAR incrementally constructs auxiliary data by augmenting the training data with pairwise and overall relation tasks, enabling the model to capture and compose sentiment dependencies in a stepwise manner. This stepwise formulation provides effective relational learning signals that enhance quad prediction performance, particularly in low-resource scenarios. Extensive experiments across four benchmark datasets demonstrate that STAR consistently outperforms existing methods, achieving average F1 improvements of over $2\%$ under low-resource conditions.

2501.07681 2026-02-09 cs.LG cs.CV math.OC stat.ML

Dataset Distillation as Pushforward Optimal Quantization

Hong Ye Tan, Emma Slade

Comments ICLR 2026, https://openreview.net/forum?id=FMSp8AUF3m

详情
英文摘要

Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose Dataset Distillation by Optimal Quantization, based on clustering in a latent space. Compared to the previous SOTA method D\textsuperscript{4}M, we achieve better performance and inter-model generalization on the ImageNet-1K dataset with trivial additional computation, and SOTA performance in higher image-per-class settings. Using the distilled noise initializations in a stronger diffusion transformer model, we obtain SOTA distillation performance on ImageNet-1K and its subsets, outperforming diffusion guidance methods.

2412.13486 2026-02-09 cs.CV cs.CL cs.GR

T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Synthesis in Controllable Concept Art Generation

Zhenhong Sun, Yifu Wang, Yonhon Ng, Yongzhi Xu, Daoyi Dong, Hongdong Li, Pan Ji

Comments https://openreview.net/forum?id=lyn2BgKQ8F

详情
英文摘要

2D concept art generation for 3D scenes is a crucial yet challenging task in computer graphics, as creating natural intuitive environments still demands extensive manual effort in concept design. While generative AI has simplified 2D concept design via text-to-image synthesis, it struggles with complex multi-instance scenes and offers limited support for structured terrain layout. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the ControlNet model for detailed multi-instance generation via three key modules: Prompt Balance ensures keyword representation and minimizes the risk of missing critical instances; Characteristic Priority emphasizes sketch-based features by highlighting TopK indices in feature channels; and Dense Tuning refines contour details within instance-related regions of the attention map. Leveraging the controllability of T3-S2S, we also introduce a feature-sharing strategy with dual prompt sets to generate layer-aware isometric and terrain-view representations for the terrain layout. Experiments show that our sketch-to-scene workflow consistently produces multi-instance 2D scenes with details aligned with input prompts.

2411.18212 2026-02-09 cs.LG cs.AI cs.RO cs.SY eess.SY

SCoTT: Strategic Chain-of-Thought Tasking for Wireless-Aware Robot Navigation in Digital Twins

Aladin Djuhera, Amin Seffo, Vlad C. Andrei, Holger Boche, Walid Saad

Journal ref Asilomar Conference on Signals, Systems, and Computers, 2025

详情
英文摘要

Path planning under wireless performance constraints is a complex challenge in robot navigation. However, naively incorporating such constraints into classical planning algorithms often incurs prohibitive search costs. In this paper, we propose SCoTT, a wireless-aware path planning framework that leverages vision-language models (VLMs) to co-optimize average path gains and trajectory length using wireless heatmap images and ray-tracing data from a digital twin (DT). At the core of our framework is Strategic Chain-of-Thought Tasking (SCoTT), a novel prompting paradigm that decomposes the exhaustive search problem into structured subtasks, each solved via chain-of-thought prompting. To establish strong baselines, we compare classical A* and wireless-aware extensions of it, and derive DP-WA*, an optimal, iterative dynamic programming algorithm that incorporates all path gains and distance metrics from the DT, but at significant computational cost. In extensive experiments, we show that SCoTT achieves path gains within 2% of DP-WA* while consistently generating shorter trajectories. Moreover, SCoTT's intermediate outputs can be used to accelerate DP-WA* by reducing its search space, saving up to 62% in execution time. We validate our framework using four VLMs, demonstrating effectiveness across both large and small models, thus making it applicable to a wide range of compact models at low inference cost. We also show the practical viability of our approach by deploying SCoTT as a ROS node within Gazebo simulations. Finally, we discuss data acquisition pipelines, compute requirements, and deployment considerations for VLMs in 6G-enabled DTs, underscoring the potential of natural language interfaces for wireless-aware navigation in real-world applications.

2411.08010 2026-02-09 cs.CL cs.AI

ExpressivityBench: Can LLMs Communicate Implicitly?

Joshua Tint, Som Sagar, Aditya Taparia, Kelly Raines, Bimsara Pathiraja, Caleb Liu, Ransalu Senanayake

Comments 21 pages, 7 figures

详情
英文摘要

Human communication is often implicit, conveying tone, identity, and intent beyond literal meanings. While large language models have achieved strong performance on explicit tasks such as summarization and reasoning, their capacity for expressivity, or implicit communication, remains underexplored. We introduce \textbf{ExpressivityBench}, a framework for evaluating the expressivity of LLMs using information-theoretic communication models. Our approach quantifies how well LLM-generated text communicates target properties without explicit mention, across nine tasks spanning emotion, identity, and tone. To enable scalable and reproducible evaluation, we employ LLM-based graders validated against human judgments. Our results reveal that while models are adept at expressing affective content, they struggle with sociolinguistic signals, lagging behind human baselines. This study provides a necessary step to evaluate human-like implicit communication, with implications for applications such as education, mental health support, and socially-aware dialogue systems. We provide code and data for our benchmark alongside our paper.

2411.04285 2026-02-09 cs.LG cs.AI

Robust Real-Time Mortality Prediction in the Intensive Care Unit using Temporal Difference Learning

Thomas Frost, Kezhi Li, Steve Harris

Comments To be published in the Proceedings of the 4th Machine Learning for Health symposium, Proceedings of Machine Learning Research (PMLR)

Journal ref Proc. Mach. Learn. Res. 259:350-363 (2025)

详情
英文摘要

The task of predicting long-term patient outcomes using supervised machine learning is a challenging one, in part because of the high variance of each patient's trajectory, which can result in the model over-fitting to the training data. Temporal difference (TD) learning, a common reinforcement learning technique, may reduce variance by generalising learning to the pattern of state transitions rather than terminal outcomes. However, in healthcare this method requires several strong assumptions about patient states, and there appears to be limited literature evaluating the performance of TD learning against traditional supervised learning methods for long-term health outcome prediction tasks. In this study, we define a framework for applying TD learning to real-time irregularly sampled time series data using a Semi-Markov Reward Process. We evaluate the model framework in predicting intensive care mortality and show that TD learning under this framework can result in improved model robustness compared to standard supervised learning methods. and that this robustness is maintained even when validated on external datasets. This approach may offer a more reliable method when learning to predict patient outcomes using high-variance irregular time series data.

2410.09771 2026-02-09 cs.CV cs.AI cs.LG

EUGens: Efficient, Unified, and General Dense Layers

Sang Min Kim, Byeongchan Kim, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Rahul Kidambi, Dongseok Shim, Avinava Dubey, Snigdha Chaturvedi, Min-hwan Oh, Krzysztof Choromanski

Comments Neurips 2025

详情
英文摘要

Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbf{E}fficient, \textbf{U}nified and \textbf{Gen}eral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbf{the first} unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf{27}\%) and memory efficiency (up to \textbf{30}\%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.

2410.07790 2026-02-09 cs.CV

Enhancing Hyperspectral Image Prediction with Contrastive Learning in Low-Label Regime

Salma Haidar, José Oramas

Journal ref Applied Intelligence, vol. 56, article 74 (2026)

详情
英文摘要

Self-supervised contrastive learning is an effective approach for addressing the challenge of limited labelled data. This study builds upon the previously established two-stage patch-level, multi-label classification method for hyperspectral remote sensing imagery. We evaluate the method's performance for both the single-label and multi-label classification tasks, particularly under scenarios of limited training data. The methodology unfolds in two stages. Initially, we focus on training an encoder and a projection network using a contrastive learning approach. This step is crucial for enhancing the ability of the encoder to discern patterns within the unlabelled data. Next, we employ the pre-trained encoder to guide the training of two distinct predictors: one for multi-label and another for single-label classification. Empirical results on four public datasets show that the predictors trained with our method perform better than those trained under fully supervised techniques. Notably, the performance is maintained even when the amount of training data is reduced by $50\%$. This advantage is consistent across both tasks. The method's effectiveness comes from its streamlined architecture. This design allows for retraining the encoder along with the predictor. As a result, the encoder becomes more adaptable to the features identified by the classifier, improving the overall classification performance. Qualitative analysis reveals the contrastive-learning-based encoder's capability to provide representations that allow separation among classes and identify location-based features despite not being explicitly trained for that. This observation indicates the method's potential in uncovering implicit spatial information within the data.

2410.04010 2026-02-09 cs.LG cs.AI cs.CL cs.NE

Hyperbolic Fine-Tuning for Large Language Models

Menglin Yang, Ram Samarth B B, Aosong Feng, Bo Xiong, Jihong Liu, Irwin King, Rex Ying

Comments NeurIPS 2025; https://github.com/marlin-codes/HypLoRA

详情
英文摘要

Large language models (LLMs) have demonstrated remarkable performance across various tasks. However, it remains an open question whether the default Euclidean space is the most suitable choice for LLMs. In this study, we investigate the geometric characteristics of LLMs, focusing specifically on tokens and their embeddings. Our findings reveal that token frequency follows a power-law distribution, where high-frequency tokens (e.g., the, that ) constitute the minority, while low-frequency tokens (e.g., apple, dog) constitute the majority. Furthermore, high-frequency tokens cluster near the origin, whereas low-frequency tokens are positioned farther away in the embedding space. Additionally, token embeddings exhibit hyperbolic characteristics, indicating a latent tree-like structure within the embedding space. Motivated by these observations, we propose HypLoRA, an efficient fine-tuning approach that operates in hyperbolic space to exploit these underlying hierarchical structures better. HypLoRA performs low-rank adaptation directly in hyperbolic space, thereby preserving hyperbolic modeling capabilities throughout the fine-tuning process. Extensive experiments across various base models and reasoning benchmarks, specifically arithmetic and commonsense reasoning tasks, demonstrate that HypLoRA substantially improves LLM performance.

2408.04567 2026-02-09 cs.CV cs.GR

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Yongzhi Xu, Yonhon Ng, Yifu Wang, Inkyu Sa, Yunfei Duan, Zhenhong Sun, Yang Li, Pan Ji, Hongdong Li

Comments Project Page: https://xrvisionlabs.github.io/Sketch2Scene/ Code: https://github.com/Tencent/Triplet_Tuning

详情
英文摘要

3D Content Generation is at the heart of many computer graphics applications, including video gaming, film-making, virtual and augmented reality, etc. This paper proposes a novel deep-learning based approach for automatically generating interactive and playable 3D game scenes, all from the user's casual prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and convenient way to convey the user's design intention in the content creation process. To circumvent the data-deficient challenge in learning (i.e. the lack of large training data of 3D scenes), our method leverages a pre-trained 2D denoising diffusion model to generate a 2D image of the scene as the conceptual guidance. In this process, we adopt the isometric projection mode to factor out unknown camera poses while obtaining the scene layout. From the generated isometric image, we use a pre-trained image understanding method to segment the image into meaningful parts, such as off-ground objects, trees, and buildings, and extract the 2D scene layout. These segments and layouts are subsequently fed into a procedural content generation (PCG) engine, such as a 3D video game engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can be seamlessly integrated into a game development environment and is readily playable. Extensive tests demonstrate that our method can efficiently generate high-quality and interactive 3D game scenes with layouts that closely follow the user's intention.

2407.19992 2026-02-09 cs.CV

High-Precision Edge Detection via Task-Adaptive Texture Handling and Ideal-Prior Guidance

Hao Shu

Comments 30 pages

详情
英文摘要

Image edge detection (ED) requires specialized architectures, reliable supervision, and rigorous evaluation criteria to ensure accurate localization. In this work, we present a framework for high-precision ED that jointly addresses architectural design, data supervision, and evaluation consistency. We propose SDPED, a compact ED model built upon Cascaded Skipping Density Blocks (CSDB), motivated by a task-adaptive architectural transfer from image super-resolution. By re-engineering texture-oriented structures for ED, SDPED effectively differentiates textures from edges while preserving fine spatial precision. Extensive experiments on four benchmark datasets (BRIND, UDED, MDBD, and BIPED2) demonstrate consistent performance improvements, particularly in Average Precision (AP), with gains of up to 22.5% on MDBD and 11.8% on BIPED2. In addition, we introduce an ideal-prior guidance strategy that incorporates noiseless data into training by treating labels as noise-free samples, providing a practical means to mitigate the subjectivity and noise inherent in human annotations. To enable fair and resolution-independent evaluation, we further adopt a fixed-pixel criterion for assessing localization accuracy. Overall, this work offers a coherent solution for high-precision ED and provides insights applicable to precision-oriented modeling in low-level and soft-computing-based vision tasks. Codes can be found on https://github.com/Hao-B-Shu/SDPED.

2407.17835 2026-02-09 cs.LG math.CT math.DG math.MG

IsUMap: Manifold Learning and Data Visualization leveraging Vietoris-Rips filtrations

Lukas Silvester Barth, Fatemeh, Fahimi, Parvaneh Joharinad, Jürgen Jost, Janis Keck

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence, 39(17), 17699-17706 (2025)

详情
英文摘要

This work introduces IsUMap, a novel manifold learning technique that enhances data representation by integrating aspects of UMAP and Isomap with Vietoris-Rips filtrations. We present a systematic and detailed construction of a metric representation for locally distorted metric spaces that captures complex data structures more accurately than the previous schemes. Our approach addresses limitations in existing methods by accommodating non-uniform data distributions and intricate local geometries. We validate its performance through extensive experiments on examples of various geometric objects and benchmark real-world datasets, demonstrating significant improvements in representation quality.

2407.14069 2026-02-09 cs.CV

Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective

Zeen Song, Wenwen Qiang, Changwen Zheng, Hui Xiong, Gang Hua

详情
英文摘要

Video contrastive learning (V-CL) has emerged as a popular framework for unsupervised video representation learning, demonstrating strong results in tasks such as action classification and detection. Yet, to harness these benefits, it is critical for the learned representations to fully capture both static and dynamic semantics. However, our experiments show that existing V-CL methods fail to effectively learn either type of feature. Through a rigorous theoretical analysis based on the Structural Causal Model and gradient update, we find that in a given dataset, certain static semantics consistently co-occur with specific dynamic semantics. This phenomenon creates spurious correlations between static and dynamic semantics in the dataset. However, existing V-CL methods do not differentiate static and dynamic similarities when computing sample similarity. As a result, learning only one type of semantics is sufficient for the model to minimize the contrastive loss. Ultimately, this causes the V-CL pre-training process to prioritize learning the easier-to-learn semantics. To address this limitation, we propose Bi-level Optimization with Decoupling for Video Contrastive Learning. (BOD-VCL). In BOD-VCL, we model videos as linear dynamical systems based on Koopman theory. In this system, all frame-to-frame transitions are represented by a linear Koopman operator. By performing eigen-decomposition on this operator, we can separate time-variant and time-invariant components of semantics, which allows us to explicitly separate the static and dynamic semantics in the video. By modeling static and dynamic similarity separately, both types of semantics can be fully exploited during the V-CL training process. BOD-VCL can be seamlessly integrated into existing V-CL frameworks, and experimental results highlight the significant improvements achieved by our method.

2407.06605 2026-02-09 cs.RO

Robust Meta-Learning of Vehicle Yaw Rate Dynamics via Conditional Neural Processes

Lars Ullrich, Andreas Völz, Knut Graichen

Comments Published in 2023 62nd IEEE IEEE Conference on Decision and Control (CDC), Singapore, Singapore, December 13 - 15, 2023

Journal ref 2023 62nd IEEE IEEE Conference on Decision and Control (CDC), Singapore, Singapore, December 13 - 15, 2023, pp. 322--327

详情
英文摘要

Trajectory planners of autonomous vehicles usually rely on physical models to predict the vehicle behavior. However, despite their suitability, physical models have some shortcomings. On the one hand, simple models suffer from larger model errors and more restrictive assumptions. On the other hand, complex models are computationally more demanding and depend on environmental and operational parameters. In each case, the drawbacks can be associated to a certain degree to the physical modeling of the yaw rate dynamics. Therefore, this paper investigates the yaw rate prediction based on conditional neural processes (CNP), a data-driven meta-learning approach, to simultaneously achieve low errors, adequate complexity and robustness to varying parameters. Thus, physical models can be enhanced in a targeted manner to provide accurate and computationally efficient predictions to enable safe planning in autonomous vehicles. High fidelity simulations for a variety of driving scenarios and different types of cars show that CNP makes it possible to employ and transfer knowledge about the yaw rate based on current driving dynamics in a human-like manner, yielding robustness against changing environmental and operational conditions.

2405.16895 2026-02-09 cs.CV

Anonymization Prompt Learning for Facial Privacy-Preserving Text-to-Image Generation

Liang Shi, Jie Zhang, Shiguang Shan

Comments Accepted by IJCV

详情
英文摘要

Text-to-image diffusion models, such as Stable Diffusion, generate highly realistic images from text descriptions. However, the generation of certain content at such high quality raises concerns. A prominent issue is the accurate depiction of identifiable facial images, which could lead to malicious deepfake generation and privacy violations. In this paper, we propose Anonymization Prompt Learning (APL) to address this problem. Specifically, we train a learnable prompt prefix for text-to-image diffusion models, which forces the model to generate anonymized facial identities, even when prompted to produce images of specific individuals. Extensive quantitative and qualitative experiments demonstrate the successful anonymization performance of APL, which anonymizes any specific individuals without compromising the quality of non-identity-specific image generation. Furthermore, we reveal the plug-and-play property of the learned prompt prefix, enabling its effective application across different pretrained text-to-image models for transferrable privacy and security protection against the risks of deepfakes.

2404.09657 2026-02-09 cs.RO cs.LG

Sampling for Model Predictive Trajectory Planning in Autonomous Driving using Normalizing Flows

Georg Rabenstein, Lars Ullrich, Knut Graichen

Comments Accepted to be published as part of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Shinhwa World, Jeju Island, Korea, June 2-5, 2024

详情
英文摘要

Alongside optimization-based planners, sampling-based approaches are often used in trajectory planning for autonomous driving due to their simplicity. Model predictive path integral control is a framework that builds upon optimization principles while incorporating stochastic sampling of input trajectories. This paper investigates several sampling approaches for trajectory generation. In this context, normalizing flows originating from the field of variational inference are considered for the generation of sampling distributions, as they model transformations of simple to more complex distributions. Accordingly, learning-based normalizing flow models are trained for a more efficient exploration of the input domain for the task at hand. The developed algorithm and the proposed sampling distributions are evaluated in two simulation scenarios.

2402.09004 2026-02-09 cs.CV cs.LG

STAG: Structural Test-time Alignment of Gradients for Online Adaptation

Juhyeon Shin, Yujin Oh, Jonghyun Lee, Saehyung Lee, Minjun Park, Dongjun Lee, Uiwon Hwang, Sungroh Yoon

详情
英文摘要

Test-Time Adaptation (TTA) adapts pre-trained models using only unlabeled test streams, requiring real-time inference and update without access to source data. We propose StructuralTest-time Alignment of Gradients (STAG), a lightweight plug-in enhancer that exploits an always-available structural signal: the classifier's intrinsic geometry. STAG derives class-wise structural anchors from classifier weights via self-structural entropy, and during adaptation analytically computes the predicted-class entropy gradient from forward-pass quantities, aligning it to the corresponding anchor with a cosine-similarity loss. This closed-form design incurs near-zero memory and latency overhead and requires no additional backpropagation beyond the underlying baseline. Across corrupted image classification and continual semantic segmentation, STAG provides broadly applicable performance gains for strong TTA baselines on both CNN and Transformer architectures regardless of the underlying normalization scheme, with particularly large gains under challenging online regimes such as imbalanced label shifts, single-sample adaptation, mixed corruption streams and long-horizon continual TTA.

2106.10823 2026-02-09 cs.CV

3D Object Detection for Autonomous Driving: A Survey

Rui Qian, Xin Lai, Xirong Li

Comments The manuscript is accepted by Pattern Recognition on 14 May 2022

详情
英文摘要

Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a quick glance at the progress we have made, we attribute challenges to visual appearance recovery in the absence of depth information from images, representation learning from partially occluded unstructured point clouds, and semantic alignments over heterogeneous features from cross modalities. Despite existing efforts, 3D object detection for autonomous driving is still in its infancy. Recently, a large body of literature have been investigated to address this 3D vision task. Nevertheless, few investigations have looked into collecting and structuring this growing knowledge. We therefore aim to fill this gap in a comprehensive survey, encompassing all the main concerns including sensors, datasets, performance metrics and the recent state-of-the-art detection methods, together with their pros and cons. Furthermore, we provide quantitative comparisons with the state of the art. A case study on fifteen selected representative methods is presented, involved with runtime analysis, error analysis, and robustness analysis. Finally, we provide concluding remarks after an in-depth analysis of the surveyed works and identify promising directions for future work.

2104.10330 2026-02-09 cs.CV

BADet: Boundary-Aware 3D Object Detection from Point Clouds

Rui Qian, Xin Lai, Xirong Li

Comments The manuscript is accepted by Pattern Recognition on 6 Jan, 2022

详情
英文摘要

Currently, existing state-of-the-art 3D object detectors are in two-stage paradigm. These methods typically comprise two steps: 1) Utilize a region proposal network to propose a handful of high-quality proposals in a bottom-up fashion. 2) Resize and pool the semantic features from the proposed regions to summarize RoI-wise representations for further refinement. Note that these RoI-wise representations in step 2) are considered individually as uncorrelated entries when fed to following detection headers. Nevertheless, we observe these proposals generated by step 1) offset from ground truth somehow, emerging in local neighborhood densely with an underlying probability. Challenges arise in the case where a proposal largely forsakes its boundary information due to coordinate offset while existing networks lack corresponding information compensation mechanism. In this paper, we propose $BADet$ for 3D object detection from point clouds. Specifically, instead of refining each proposal independently as previous works do, we represent each proposal as a node for graph construction within a given cut-off threshold, associating proposals in the form of local neighborhood graph, with boundary correlations of an object being explicitly exploited. Besides, we devise a lightweight Region Feature Aggregation Module to fully exploit voxel-wise, pixel-wise, and point-wise features with expanding receptive fields for more informative RoI-wise representations. We validate BADet both on widely used KITTI Dataset and highly challenging nuScenes Dataset. As of Apr. 17th, 2021, our BADet achieves on par performance on KITTI 3D detection leaderboard and ranks $1^{st}$ on $Moderate$ difficulty of $Car$ category on KITTI BEV detection leaderboard. The source code is available at https://github.com/rui-qian/BADet.

2602.06294 2026-02-09 cs.RO

Robots That Generate Planarity Through Geometry

Jakub F. Kowalewski, Abdulaziz O. Alrashed, Jacob Alpert, Rishi Ponnapalli, Lucas R. Meza, Jeffrey Ian Lipton

详情
英文摘要

Constraining motion to a flat surface is a fundamental requirement for equipment across science and engineering. Modern precision robotic motion systems, such as gantries, rely on the flatness of components, including guide rails and granite surface plates. However, translating this static flatness into motion requires precise internal alignment and tight-tolerance components that create long, error-sensitive reference chains. Here, we show that by using the geometric inversion of a sphere into a plane, we can produce robotic motion systems that derive planarity entirely from link lengths and connectivity. This allows planar motion to emerge from self-referencing geometric constraints, and without external metrology. We demonstrate these Flat-Plane Mechanisms (FPMs) from micron to meter scales and show that fabrication errors can be attenuated by an order of magnitude in the resulting flatness. Finally, we present a robotic FPM-based 3-axis positioning system that can be used for metrology surface scans ($\pm 12$-mm) and 3D printing inside narrow containers. This work establishes an alternative geometric foundation for planar motion that can be realized across size scales and opens new possibilities in metrology, fabrication, and micro-positioning.

2602.06291 2026-02-09 cs.CL

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Hyunwoo Ko, Amit Agarwal, Sunghee Ahn, Kyong-Ha Lee, Youngjae Yu

Comments Preprint

详情
英文摘要

Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.

2602.06287 2026-02-09 cs.LG cs.AI physics.ao-ph

Toward generative machine learning for boosting ensembles of climate simulations

Parsa Gooya, Reinel Sospedra-Alfonso, Johannes Exenberger

Comments SI_Toward_generative_machine_learning_for_boosting_the_ensembles_size_of_climate_simulation.pdf contains Supplementary Information

详情
英文摘要

Accurately quantifying uncertainty in predictions and projections arising from irreducible internal climate variability is critical for informed decision making. Such uncertainty is typically assessed using ensembles produced with physics based climate models. However, computational constraints impose a trade off between generating the large ensembles required for robust uncertainty estimation and increasing model resolution to better capture fine scale dynamics. Generative machine learning offers a promising pathway to alleviate these constraints. We develop a conditional Variational Autoencoder (cVAE) trained on a limited sample of climate simulations to generate arbitrary large ensembles. The approach is applied to output from monthly CMIP6 historical and future scenario experiments produced with the Canadian Centre for Climate Modelling and Analysis' (CCCma's) Earth system model CanESM5. We show that the cVAE model learns the underlying distribution of the data and generates physically consistent samples that reproduce realistic low and high moment statistics, including extremes. Compared with more sophisticated generative architectures, cVAEs offer a mathematically transparent, interpretable, and computationally efficient framework. Their simplicity lead to some limitations, such as overly smooth outputs, spectral bias, and underdispersion, that we discuss along with strategies to mitigate them. Specifically, we show that incorporating output noise improves the representation of climate relevant multiscale variability, and we propose a simple method to achieve this. Finally, we show that cVAE-enhanced ensembles capture realistic global teleconnection patterns, even under climate conditions absent from the training data.

2602.06285 2026-02-09 cs.CV

MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training

Lucia Gordon, Serge Belongie, Christian Igel, Nico Lang

详情
英文摘要

Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT. Our dataset, code, and visualization tool are linked from the project page at lgordon99.github.io/mmearth-bench.

2602.06273 2026-02-09 cs.RO cs.HC

A High-Fidelity Robotic Manipulator Teleoperation Framework for Human-Centered Augmented Reality Evaluation

Harsh Chhajed, Tian Guo

详情
英文摘要

Validating Augmented Reality (AR) tracking and interaction models requires precise, repeatable ground-truth motion. However, human users cannot reliably perform consistent motion due to biomechanical variability. Robotic manipulators are promising to act as human motion proxies if they can mimic human movements. In this work, we design and implement ARBot, a real-time teleoperation platform that can effectively capture natural human motion and accurately replay the movements via robotic manipulators. ARBot includes two capture models: stable wrist motion capture via a custom CV and IMU pipeline, and natural 6-DOF control via a mobile application. We design a proactively-safe QP controller to ensure smooth, jitter-free execution of the robotic manipulator, enabling it to function as a high-fidelity record and replay physical proxy. We open-source ARBot and release a benchmark dataset of 132 human and synthetic trajectories captured using ARBot to support controllable and scalable AR evaluation.

2602.06271 2026-02-09 cs.SD eess.AS

Misophonia Trigger Sound Detection on Synthetic Soundscapes Using a Hybrid Model with a Frozen Pre-Trained CNN and a Time-Series Module

Kurumi Sashida, Gouhei Tanaka

Comments 13 pages, 3 figures. Submitted to IJCNN 2026

详情
英文摘要

Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot "eating sound" detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.