arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2393
专题追踪
2505.00527 2026-02-17 cs.RO

DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation

Zixuan Chen, Junhui Yin, Yangtao Chen, Jing Huo, Pinzhuo Tian, Jieqi Shi, Yiwen Hou, Yinchuan Li, Yang Gao

Comments RAL 2026

详情
英文摘要

Generalizing language-conditioned multi-task imitation learning (IL) models to novel long-horizon 3D manipulation tasks is challenging. To address this, we propose DeCo (Task Decomposition and Skill Composition), a model-agnostic framework that enhances zero-shot generalization to compositional long-horizon manipulation tasks. DeCo decomposes IL demonstrations into modular atomic tasks based on gripper-object interactions, creating a dataset that enables models to learn reusable skills. At inference, DeCo uses a vision-language model (VLM) to parse high-level instructions, retrieve relevant skills, and dynamically schedule their execution. A spatially-aware skill-chaining module ensures smooth, collision-free transitions between skills. We introduce DeCoBench, a benchmark designed to evaluate compositional generalization in long-horizon manipulation tasks. DeCo improves the success rate of three IL models, RVT-2, 3DDA, and ARP, by 66.67%, 21.53%, and 57.92%, respectively, on 12 novel tasks. In real-world experiments, the DeCo-enhanced model, trained on only 6 atomic tasks, completes 9 novel tasks in zero-shot, with a 53.33% improvement over the baseline model. Project website: https://deco226.github.io.

2504.17880 2026-02-17 cs.RO

Autonomous Navigation of Quadrupeds Using Coverage Path Planning with Morphological Skeleton Map

Alexander James Becoy, Kseniia Khomenko, Luka Peternel, Raj Thilak Rajan

Comments 15 pages, published to Fronters In Robotics (currently in production), major revision: title change, abstract revised, grammar fixed, mathematical notations fixed and made consistent, conclusion revised, related works extended, Algorithm 1-3 revised

Journal ref Frontiers in Robotics and AI, Volume 31, 1601862, July 2025

详情
英文摘要

This paper proposes a novel method of coverage path planning for the purpose of scanning an unstructured environment autonomously. The method uses the morphological skeleton of the prior 2D navigation map via SLAM to generate a sequence of points of interest (POIs). This sequence is then ordered to create an optimal path given the robot's current position. To control the high-level operation, a finite state machine is used to switch between two modes: navigating towards a POI using Nav2, and scanning the local surrounding. We validate the method in a leveled indoor obstacle-free non-convex environment on time efficiency and reachability over five trials. The map reader and the path planner can quickly process maps of width and height ranging between [196,225] pixels and [185,231] pixels in 2.52 ms/pixel and 1.7 ms/pixel, respectively, where their computation time increases with 22.0 ns/pixel and 8.17 $μ$s/pixel, respectively. The robot managed to reach 86.5% of all waypoints over all five runs. The proposed method suffers from drift occurring in the 2D navigation map.

2504.06193 2026-02-17 cs.LG cs.AI

Heuristic Methods are Good Teachers to Distill MLPs for Graph Link Prediction

Zongyue Qin, Shichang Zhang, Mingxuan Ju, Tong Zhao, Neil Shah, Yizhou Sun

详情
英文摘要

Link prediction is a crucial graph-learning task with applications including citation prediction and product recommendation. Distilling Graph Neural Networks (GNNs) teachers into Multi-Layer Perceptrons (MLPs) students has emerged as an effective approach to achieve strong performance and reducing computational cost by removing graph dependency. However, existing distillation methods only use standard GNNs and overlook alternative teachers such as specialized model for link prediction (GNN4LP) and heuristic methods (e.g., common neighbors). This paper first explores the impact of different teachers in GNN-to-MLP distillation. Surprisingly, we find that stronger teachers do not always produce stronger students: MLPs distilled from GNN4LP can underperform those distilled from simpler GNNs, while weaker heuristic methods can teach MLPs to near-GNN performance with drastically reduced training costs. Building on these insights, we propose Ensemble Heuristic-Distilled MLPs (EHDM), which eliminates graph dependencies while effectively integrating complementary signals via a gating mechanism. Experiments on ten datasets show an average 7.93% improvement over previous GNN-to-MLP approaches with 1.95-3.32 times less training time, indicating EHDM is an efficient and effective link prediction method. Our code is available at https://github.com/ZongyueQin/EHDM

2503.12385 2026-02-17 cs.CV

Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset

Yutao Hu, Sen Li, Jincheng Yan, Wenqi Shao, Xiaoyan Luo

Comments accepted to The Eleventh Workshop on Fine-Grained Visual Categorization in CVPR 2024

详情
英文摘要

Fine-grained visual categorization (FGVC) is a challenging but significant task in computer vision, which aims to recognize different sub-categories of birds, cars, airplanes, etc. Among them, recognizing models of different cars has significant application value in autonomous driving, traffic surveillance and scene understanding, which has received considerable attention in the past few years. However, Stanford-Car, the most widely used fine-grained dataset for car recognition, only has 196 different categories and only includes vehicle models produced earlier than 2013. Due to the rapid advancements in the automotive industry during recent years, the appearances of various car models have become increasingly intricate and sophisticated. Consequently, the previous Stanford-Car dataset fails to capture this evolving landscape and cannot satisfy the requirements of automotive industry. To address these challenges, in our paper, we introduce Car-1000, a large-scale dataset designed specifically for fine-grained visual categorization of diverse car models. Car-1000 encompasses vehicles from 166 different automakers, spanning a wide range of 1000 distinct car models. Additionally, we have reproduced several state-of-the-art FGVC methods on the Car-1000 dataset, establishing a new benchmark for research in this field. We hope that our work will offer a fresh perspective for future FGVC researchers. Our dataset is available at https://github.com/toggle1995/Car-1000.

2503.09027 2026-02-17 cs.CV

Measure Twice, Cut Once: A Semantic-Oriented Approach to Video Temporal Localization with Video LLMs

Zongshang Pang, Mayu Otani, Yuta Nakashima

Comments ICLR2026

详情
英文摘要

Temporally localizing user-queried events through natural language is a crucial capability for video models. Recent methods predominantly adapt video LLMs to generate event boundary timestamps for temporal localization tasks, which struggle to leverage LLMs' pre-trained semantic understanding capabilities due to the uninformative nature of timestamp outputs. In this work, we explore a timestamp-free, semantic-oriented framework that fine-tunes video LLMs using two generative learning tasks and one discriminative learning task. We first introduce a structural token generation task that enables the video LLM to recognize the temporal structure of input videos based on the input query. Through this task, the video LLM generates a sequence of special tokens, called structural tokens, which partition the video into consecutive segments and categorize them as either target events or background transitions. To enhance precise recognition of event segments, we further propose a query-focused captioning task that enables the video LLM to extract fine-grained event semantics that can be effectively utilized by the structural tokens. Finally, we introduce a structural token grounding module driven by contrastive learning to associate each structural token with its corresponding video segment, achieving holistic temporal segmentation of the input video and readily yielding the target event segments for localization. Extensive experiments across diverse temporal localization tasks demonstrate that our proposed framework, MeCo, consistently outperforms methods relying on boundary timestamp generation, highlighting the potential of a semantic-driven approach for temporal localization with video LLMs \footnote{Code available at https://github.com/pangzss/MeCo.

2503.01884 2026-02-17 cs.LG cs.AI

Contextual Quantum Neural Networks for Stock Price Prediction

Sharan Mourya, Hannes Leipold, Bibhas Adhikari

Journal ref Mourya, S., Leipold, H., & Adhikari, B. (2026). Contextual quantum neural networks for stock price prediction. Scientific Reports, 16, Article 34413

详情
英文摘要

In this paper, we apply quantum machine learning (QML) to predict the stock prices of multiple assets using a contextual quantum neural network. Our approach captures recent trends to predict future stock price distributions, moving beyond traditional models that focus on entire historical data, enhancing adaptability and precision. Utilizing the principles of quantum superposition, we introduce a new training technique called the quantum batch gradient update (QBGU), which accelerates the standard stochastic gradient descent (SGD) in quantum applications and improves convergence. Consequently, we propose a quantum multi-task learning (QMTL) architecture, specifically, the share-and-specify ansatz, that integrates task-specific operators controlled by quantum labels, enabling the simultaneous and efficient training of multiple assets on the same quantum circuit as well as enabling efficient portfolio representation with logarithmic overhead in the number of qubits. This architecture represents the first of its kind in quantum finance, offering superior predictive power and computational efficiency for multi-asset stock price forecasting. Through extensive experimentation on S\&P 500 data for Apple, Google, Microsoft, and Amazon stocks, we demonstrate that our approach not only outperforms quantum single-task learning (QSTL) models but also effectively captures inter-asset correlations, leading to enhanced prediction accuracy. Our findings highlight the transformative potential of QML in financial applications, paving the way for more advanced, resource-efficient quantum algorithms in stock price prediction and other complex financial modeling tasks.

2502.17315 2026-02-17 cs.CL

HIPPO: Enhancing the Table Understanding Capability of LLMs through Hybrid-Modal Preference Optimization

Haolan Wang, Zhenghao Liu, Xinze Li, Xiaocui Yang, Yu Gu, Yukun Yan, Qi Shi, Fangfang Li, Chong Chen, Ge Yu

详情
英文摘要

Tabular data contains rich structural semantics and plays a crucial role in organizing and manipulating information. Recent methods employ Multi-modal Large Language Models (MLLMs) to address table-related tasks across various modalities of table representations. However, existing studies mainly focus on exploring the table understanding ability of MLLMs using unimodal representations, which limits further exploration of multi-modal representations to enable more effective table reasoning. To better capture structural semantics from the tabular data, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, optimizing MLLMs by learning more comprehensive table information from these multiple modalities. Specifically, HIPPO samples MLLM responses from hybrid-modal table representations and designs a modality-consistent sampling strategy to enhance response diversity and mitigate modality bias during Direct Preference Optimization (DPO) training. Experiments on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO, achieving a 4% improvement over various table reasoning models. Further analysis reveals that HIPPO not only enhances the table reasoning capability based on unimodal representations but also facilitates the extraction of complementary semantics across modalities. The code is available at https://github.com/NEUIR/HIPPO.

2502.14560 2026-02-17 cs.LG cs.AI cs.CL

Less is More: Improving LLM Alignment via Preference Data Selection

Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He

详情
英文摘要

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.

2502.09980 2026-02-17 cs.CV cs.RO

V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models

Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Stephen F. Smith, Yu-Chiang Frank Wang, Min-Hung Chen

Comments Accepted by ICRA 2026 (IEEE International Conference on Robotics and Automation). Project: https://eddyhkchiu.github.io/v2vllm.github.io/ Code: https://github.com/eddyhkchiu/V2V-LLM Dataset: https://huggingface.co/datasets/eddyhkchiu/V2V-GoT-QA

详情
英文摘要

Current autonomous driving vehicles rely mainly on their individual sensors to understand surrounding scenes and plan for future trajectories, which can be unreliable when the sensors are malfunctioning or occluded. To address this problem, cooperative perception methods via vehicle-to-vehicle (V2V) communication have been proposed, but they have tended to focus on perception tasks like detection or tracking. How those approaches contribute to overall cooperative planning performance is still under-explored. Inspired by recent progress using Large Language Models (LLMs) to build autonomous driving systems, we propose a novel problem setting that integrates a Multimodal LLM into cooperative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Multimodal Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected autonomous vehicles (CAVs) and answer various types of driving-related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V-LLM can be a promising unified model architecture for performing various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. The code and data will be released to the public to facilitate open-source research in this field. Our project website: https://eddyhkchiu.github.io/v2vllm.github.io/ .

2502.07443 2026-02-17 cs.AI cs.GT

Approximating Human Strategic Reasoning with LLM-Enhanced Recursive Reasoners Leveraging Multi-agent Hypergames

Vince Trencsenyi, Agnieszka Mensfelt, Kostas Stathis

详情
英文摘要

LLM-driven multi-agent-based simulations have been gaining traction with applications in game-theoretic and social simulations. While most implementations seek to exploit or evaluate LLM-agentic reasoning, they often do so with a weak notion of agency and simplified architectures. We implement a role-based multi-agent strategic interaction framework tailored to sophisticated recursive reasoners, providing the means for systematic in-depth development and evaluation of strategic reasoning. Our game environment is governed by the umpire responsible for facilitating games, from matchmaking through move validation to environment management. Players incorporate state-of-the-art LLMs in their decision mechanism, relying on a formal hypergame-based model of hierarchical beliefs. We use one-shot, 2-player beauty contests to evaluate the recursive reasoning capabilities of the latest LLMs, providing a comparison to an established baseline model from economics and data from human experiments. Furthermore, we introduce the foundations of an alternative semantic measure of reasoning to the k-level theory. Our experiments show that artificial reasoners can outperform the baseline model in terms of both approximating human behaviour and reaching the optimal solution.

2502.02415 2026-02-17 cs.LG

Fast Graph Generation via Autoregressive Noisy Filtration Modeling

Markus Krimmel, Jenna Wiens, Karsten Borgwardt, Dexiong Chen

Journal ref Transactions on Machine Learning Research, 2026

详情
英文摘要

Existing graph generative models often face a critical trade-off between sample quality and generation speed. We introduce Autoregressive Noisy Filtration Modeling (ANFM), a flexible autoregressive framework that addresses both challenges. ANFM leverages filtration, a concept from topological data analysis, to transform graphs into short sequences of subgraphs. We identify exposure bias as a potential hurdle in autoregressive graph generation and propose noise augmentation and reinforcement learning as effective mitigation strategies, which allow ANFM to learn both edge addition and deletion operations. This unique capability enables ANFM to correct errors during generation by modeling non-monotonic graph sequences. Our results show that ANFM matches state-of-the-art diffusion models in quality while offering over 100 times faster inference, making it a promising approach for high-throughput graph generation. The source code is publicly available at https://github.com/BorgwardtLab/anfm .

2502.00194 2026-02-17 cs.LG cs.AI physics.comp-ph

Physics-Informed Neural Network based Damage Identification for Truss Railroad Bridges

Althaf Shajihan, Kirill Mechitov, Girish Chowdhary, Billie F. Spencer

Comments 30 pages, 15 figures

Journal ref Structure and Infrastructure Engineering, 1-22 (2026)

详情
英文摘要

Railroad bridges are a crucial component of the U.S. freight rail system, which moves over 40 percent of the nation's freight and plays a critical role in the economy. However, aging bridge infrastructure and increasing train traffic pose significant safety hazards and risk service disruptions. The U.S. rail network includes over 100,000 railroad bridges, averaging one every 1.4 miles of track, with steel bridges comprising over 50% of the network's total bridge length. Early identification and assessment of damage in these bridges remain challenging tasks. This study proposes a physics-informed neural network (PINN) based approach for damage identification in steel truss railroad bridges. The proposed approach employs an unsupervised learning approach, eliminating the need for large datasets typically required by supervised methods. The approach utilizes train wheel load data and bridge response during train crossing events as inputs for damage identification. The PINN model explicitly incorporates the governing differential equations of the linear time-varying (LTV) bridge-train system. Herein, this model employs a recurrent neural network (RNN) based architecture incorporating a custom Runge-Kutta (RK) integrator cell, designed for gradient-based learning. The proposed approach updates the bridge finite element model while also quantifying damage severity and localizing the affected structural members. A case study on the Calumet Bridge in Chicago, Illinois, with simulated damage scenarios, is used to demonstrate the model's effectiveness in identifying damage while maintaining low false-positive rates. Furthermore, the damage identification pipeline is designed to seamlessly integrate prior knowledge from inspections and drone surveys, also enabling context-aware updating and assessment of bridge's condition.

2501.16717 2026-02-17 cs.RO

Strawberry Robotic Operation Interface: An Open-Source Device for Collecting Dexterous Manipulation Data in Robotic Strawberry Farming

Linsheng Hou, Wenwu Lu, Yanan Wang, Chen Peng, Zhenghao Fei

详情
英文摘要

The strawberry farming is labor-intensive, particularly in tasks requiring dexterous manipulation such as picking occluded strawberries. To address this challenge, we present the Strawberry Robotic Operation Interface (SROI), an open-source device designed for collecting dexterous manipulation data in robotic strawberry farming. The SROI features a handheld unit with a modular end effector, a stereo robotic camera, enabling the easy collection of demonstration data in field environments. A data post-processing pipeline is introduced to extract spatial trajectories and gripper states from the collected data. Additionally, we release an open-source dataset of strawberry picking demonstrations to facilitate research in dexterous robotic manipulation. The SROI represents a step toward automating complex strawberry farming tasks, reducing reliance on manual labor.

2501.15889 2026-02-17 cs.LG cs.AI

Adaptive Width Neural Networks

Federico Errica, Henrik Christiansen, Viktor Zaverkin, Mathias Niepert, Francesco Alesiani

Comments International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

For almost 70 years, researchers have typically selected the width of neural networks' layers either manually or through automated hyperparameter tuning methods such as grid search and, more recently, neural architecture search. This paper challenges the status quo by introducing an easy-to-use technique to learn an unbounded width of a neural network's layer during training. The method jointly optimizes the width and the parameters of each layer via standard backpropagation. We apply the technique to a broad range of data domains such as tables, images, text, sequences, and graphs, showing how the width adapts to the task's difficulty. A by product of our width learning approach is the easy truncation of the trained network at virtually zero cost, achieving a smooth trade-off between performance and compute resources. Alternatively, one can dynamically compress the network until performances do not degrade. In light of recent foundation models trained on large datasets, requiring billions of parameters and where hyper-parameter tuning is unfeasible due to huge training costs, our approach introduces a viable alternative for width learning.

2501.13354 2026-02-17 cs.CV

ATRNet-STAR: A Large Dataset and Benchmark Towards Remote Sensing Object Recognition in the Wild

Yongxiang Liu, Weijie Li, Li Liu, Jie Zhou, Bowen Peng, Yafei Song, Xuying Xiong, Wei Yang, Tianpeng Liu, Zhen Liu, Xiang Li

Comments 17 pages, 12 figures; Homepage: https://github.com/waterdisappear/ATRNet-STAR . in IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

详情
英文摘要

The absence of publicly available, large-scale, high-quality datasets for Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) has significantly hindered the application of rapidly advancing deep learning techniques, which hold huge potential to unlock new capabilities in this field. This is primarily because collecting large volumes of diverse target samples from SAR images is prohibitively expensive, largely due to privacy concerns, the characteristics of microwave radar imagery perception, and the need for specialized expertise in data annotation. Throughout the history of SAR ATR research, there have been only a number of small datasets, mainly including targets like ships, airplanes, buildings, etc. There is only one vehicle dataset MSTAR collected in the 1990s, which has been a valuable source for SAR ATR. To fill this gap, this paper introduces a large-scale, new dataset named ATRNet-STAR with 40 different vehicle categories collected under various realistic imaging conditions and scenes. It marks a substantial advancement in dataset scale and diversity, comprising over 190,000 well-annotated samples, 10 times larger than its predecessor, the famous MSTAR. Building such a large dataset is a challenging task, and the data collection scheme will be detailed. Secondly, we illustrate the value of ATRNet-STAR via extensively evaluating the performance of 15 representative methods with 7 different experimental settings on challenging classification and detection benchmarks derived from the dataset. Finally, based on our extensive experiments, we identify valuable insights for SAR ATR and discuss potential future research directions in this field. We hope that the scale, diversity, and benchmark of ATRNet-STAR can significantly facilitate the advancement of SAR ATR.

2501.07575 2026-02-17 cs.CV cs.AI

Dataset Distillation via Committee Voting

Jiacheng Cui, Zhaoyi Li, Xiaochen Ma, Xinyue Bi, Yaxin Luo, Zhiqiang Shen

Comments Code at: https://github.com/Jiacheng8/CV-DD

详情
英文摘要

Dataset distillation aims to synthesize a compact yet representative dataset that preserves the essential characteristics of the original data for efficient model training. Existing methods mainly focus on improving data-synthetic alignment or scaling distillation to large datasets. In this work, we propose $\textbf{C}$ommittee $\textbf{V}$oting for $\textbf{D}$ataset $\textbf{D}$istillation ($\textbf{CV-DD}$), an orthogonal approach that leverages the collective knowledge of multiple models to produce higher-quality distilled data. We first establish a strong baseline that achieves state-of-the-art performance through modern architectural and optimization choices. By integrating distributions and predictions from multiple models and generating high-quality soft labels, our method captures a broader range of data characteristics, reduces model-specific bias and the impact of distribution shifts, and significantly improves generalization. This voting-based strategy enhances diversity and robustness, alleviates overfitting, and improves post-evaluation performance. Extensive experiments across multiple datasets and IPC settings demonstrate that CV-DD consistently outperforms single- and multi-model distillation methods and generalizes well to non-training-based frameworks and challenging synthetic-to-real transfer tasks. Code is available at: https://github.com/Jiacheng8/CV-DD.

2501.05633 2026-02-17 cs.LG cs.IT eess.SP math.IT

Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification

Ali Bereyhi, Ben Liang, Gary Boudreau, Ali Afana

Comments This paper has been published in IEEE Transactions on Signal Processing, vol. 73, pp. 4463 - 4478, 2025. The present arXiv version contains additional experimental results. 27 pages, 8 figures, 2 tables

Journal ref IEEE Transactions on Signal Processing, vol. 73, pp. 4463 - 4478, 2025

详情
英文摘要

Error accumulation is effective for gradient sparsification in distributed settings: initially-unselected gradient entries are eventually selected as their accumulated error exceeds a certain level. The accumulation essentially behaves as a scaling of the learning rate for the selected entries. Although this property prevents the slow-down of lateral movements in distributed gradient descent, it can deteriorate convergence in some settings. This work proposes a novel sparsification scheme that controls the learning rate scaling of error accumulation. The development of this scheme follows two major steps: first, gradient sparsification is formulated as an inverse probability (inference) problem, and the Bayesian optimal sparsification mask is derived as a maximum-a-posteriori estimator. Using the prior distribution inherited from Top-k, we derive a new sparsification algorithm which can be interpreted as a regularized form of Top-k. We call this algorithm regularized Top-k (RegTop-k). It utilizes past aggregated gradients to evaluate posterior statistics of the next aggregation. It then prioritizes the local accumulated gradient entries based on these posterior statistics. We validate our derivation through various numerical experiments. In distributed linear regression, it is observed that while Top-k remains at a fixed distance from the global optimum, RegTop-k converges to the global optimum at significantly higher compression ratios. We further demonstrate the generalization of this observation by employing RegTop-k in distributed training of ResNet-18 on CIFAR-10, as well as fine-tuning of multiple computer vision models on the ImageNette dataset. Our numerical results confirm that as the compression ratio increases, RegTop-k sparsification noticeably outperforms Top-k.

2412.20110 2026-02-17 cs.CV

Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification

Xi Yang, Pai Peng, Wulin Xie, Xiaohuan Lu, Jie Wen

Comments The authors request withdrawal of this article. This version was submitted in error. Compared to the intended final version, it contains inaccuracies and fails to accurately reflect the authors' work and conclusions

详情
英文摘要

Few-shot image classification remains a critical challenge in the field of computer vision, particularly in data-scarce environments. Existing methods typically rely on pre-trained visual-language models, such as CLIP. However, due to the modality gap, which is the inconsistent distribution of image and text features in the joint embedding space, directly using these features as class prototypes often leads to suboptimal performance. To address this issue, we propose a novel Cross-Modal Mapping (CMM) method. This method globally aligns image features with the text feature space through linear transformation and optimizes their local spatial relationships using triplet loss, thereby significantly enhancing cross-modal consistency. Experimental results show that compared to other methods, CMM simplifies the training process and demonstrates higher efficiency. Furthermore, CMM improves the average Top-1 accuracy by 1.06% on 11 benchmark datasets compared to methods that partially fine-tune the backbone, and it performs excellently on 4 distribution shift datasets. Notably, CMM effectively mitigates the modality gap in pre-trained models, enabling text features to serve as effective class prototypes for image features, thus providing an efficient and highly generalizable solution for few-shot learning.

2412.14294 2026-02-17 cs.CV cs.LG

TRecViT: A Recurrent Video Transformer

Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu

详情
英文摘要

We propose a novel block for \emph{causal} video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture \emph{TRecViT} is causal and shows strong performance on sparse and dense tasks, trained in supervised or self-supervised regimes, being the first causal video model in the state-space models family. Notably, our model outperforms or is on par with the popular (non-causal) ViViT-L model on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count than the full self-attention ViViT, with an inference throughput of about 300 frames per second, running comfortably in real-time. When compared with causal transformer-based models (TSM, RViT) and other recurrent models like LSTM, TRecViT obtains state-of-the-art results on the challenging SSv2 dataset. Code and checkpoints are available online https://github.com/google-deepmind/trecvit.

2412.13474 2026-02-17 cs.RO cs.SY eess.SY

Planning Human-Robot Co-manipulation with Human Motor Control Objectives and Multi-component Reaching Strategies

Kevin Haninger, Luka Peternel

Comments 10 Pages

Journal ref IEEE Robotics and Automation Letters, Volume 10, Issue 2, February 2025

详情
英文摘要

For successful goal-directed human-robot interaction, the robot should adapt to the intentions and actions of the collaborating human. This can be supported by musculoskeletal or data-driven human models, where the former are limited to lower-level functioning such as ergonomics, and the latter have limited generalizability or data efficiency. What is missing, is the inclusion of human motor control models that can provide generalizable human behavior estimates and integrate into robot planning methods. We use well-studied models from human motor control based on the speed-accuracy and cost-benefit trade-offs to plan collaborative robot motions. In these models, the human trajectory minimizes an objective function, a formulation we adapt to numerical trajectory optimization. This can then be extended with constraints and new variables to realize collaborative motion planning and goal estimation. We deploy this model, as well as a multi-component movement strategy, in physical collaboration with uncertain goal-reaching and synchronized motion tasks, showing the ability of the approach to produce human-like trajectories over a range of conditions.

2412.01168 2026-02-17 cs.RO cs.SY eess.SY

On the Surprising Effectiveness of Spectral Clipping in Learning Stable Linear and Latent-Linear Dynamical Systems

Hanyao Guo, Yunhai Han, Harish Ravichandar

详情
英文摘要

When learning stable linear dynamical systems from data, three important properties are desirable: i) predictive accuracy, ii) verifiable stability, and iii) computational efficiency. Unconstrained minimization of prediction errors leads to high accuracy and efficiency but cannot guarantee stability. Existing methods to enforce stability often preserve accuracy, but do so only at the cost of increased computation. In this work, we investigate if a seemingly-naive procedure can simultaneously offer all three desiderata. Specifically, we consider a post-hoc procedure in which we surgically manipulate the spectrum of the linear system after it was learned using unconstrained least squares. We call this approach spectral clipping (SC) as it involves eigen decomposition and subsequent reconstruction of the system matrix after any eigenvalues whose magnitude exceeds one have been clipped to one (without altering the eigenvectors). We also show that SC can be readily combined with Koopman operators to learn nonlinear dynamical systems that can generate stable predictions of nonlinear phenomena, such as those underlying complex dexterous manipulation skills involving multi-fingered robotic hands. Through comprehensive experiments involving two different applications and publicly available benchmark datasets, we show that this simple technique can efficiently learn highly-accurate predictive dynamics that are provably-stable. Notably, we find that SC can match or outperform strong baselines while being orders-of-magnitude faster. Finally, we find that SC can learn stable robot policies even when the training data includes unsuccessful or truncated demonstrations. Our code and datasets can be found at https://github.com/GT-STAR-Lab/spec_clip.

2412.00686 2026-02-17 cs.CV cs.AI

LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis

Comments 38 pages, 24 Figures, 19 Tables

详情
英文摘要

Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting -- a common issue in a naive divide-and-conquer implementation. We demonstrate the effectiveness of this approach across various datasets and benchmarks, establishing it as a valuable reference for evaluating future solutions.

2411.16085 2026-02-17 cs.LG cs.AI cs.CL cs.CV cs.DM

Cautious Optimizers: Improving Training with One Line of Code

Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu

详情
英文摘要

AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a \textbf{one-line modification in Pytorch} to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only consistent speed-up on LLM pretraining, but also image classification, with minimum extra tuning on hyperparameters. Code is available at https://github.com/kyleliang919/C-Optim.

2411.06403 2026-02-17 cs.AI

Mastering NIM and Impartial Games with Weak Neural Networks: An AlphaZero-inspired Multi-Frame Approach

Søren Riis

详情
英文摘要

We study impartial games under fixed-latency, fixed-scale quantised inference (FSQI). In this fixed-scale, bounded-range regime, we prove that inference is simulable by constant-depth polynomial-size Boolean circuits (AC0). This yields a worst-case representational barrier: single-frame agents in the FSQI/AC0 regime cannot strongly master NIM, because optimal play depends on the global nim-sum (parity). Under our stylised deterministic rollout interface, a single rollout policy head from the structured family analysed here reveals only one fixed linear functional of the invariant, so increasing rollout budget alone does not recover the missing bits. We derive two structural bypasses: (1) a multi-policy-head rollout architecture that recovers the full invariant via distinct rollout channels, and (2) a multi-frame architecture that tracks local nimber differences and supports restoration. Experiments across multiple settings are consistent with these predictions: single-head baselines stay near chance, while two-frame models reach near-perfect restoration accuracy and multi-head FSM-controlled shootouts achieve perfect win/loss position classification. Overall, the empirical results support the view that explicit structural priors (history/differences or multiple rollout channels) are important in the FSQI/AC0 regime.

2410.18784 2026-02-17 cs.LG cs.NA eess.SP math.NA math.ST stat.ML stat.TH

Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionality

Zhihan Huang, Yuting Wei, Yuxin Chen

Comments Accepted to Mathematics of Operations Research

详情
英文摘要

The denoising diffusion probabilistic model (DDPM) has emerged as a mainstream generative model in generative AI. While sharp convergence guarantees have been established for the DDPM, the iteration complexity is, in general, proportional to the ambient data dimension, resulting in overly conservative theory that fails to explain its practical efficiency. This has motivated the recent work Li and Yan (2024a) to investigate how the DDPM can achieve sampling speed-ups through automatic exploitation of intrinsic low dimensionality of data. We strengthen this line of work by demonstrating, in some sense, optimal adaptivity to unknown low dimensionality. For a broad class of data distributions with intrinsic dimension $k$, we prove that the iteration complexity of the DDPM scales nearly linearly with $k$, which is optimal when using KL divergence to measure distributional discrepancy. Notably, our work is closely aligned with the independent concurrent work Potaptchik et al. (2024) -- posted two weeks prior to ours -- in establishing nearly linear-$k$ convergence guarantees for the DDPM.

2410.10481 2026-02-17 cs.LG cs.AI cs.CR

Model-based Large Language Model Customization as Service

Zhaomin Wu, Jizhou Guo, Junyi Hou, Bingsheng He, Lixin Fan, Qiang Yang

Comments Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

Journal ref Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (2025)

详情
英文摘要

Prominent Large Language Model (LLM) services from providers like OpenAI and Google excel at general tasks but often underperform on domain-specific applications. Current customization services for these LLMs typically require users to upload data for fine-tuning, posing significant privacy risks. While differentially private (DP) data synthesis presents a potential alternative, its application commonly results in low effectiveness due to the introduction of excessive noise on data for DP. To overcome this, we introduce Llamdex, a novel framework that facilitates LLM customization as a service, where the client uploads pre-trained domain-specific models rather than data. This client-uploaded model, optionally protected by DP with much lower noise, is inserted into the base LLM via connection modules. Significantly, these connecting modules are trained without requiring sensitive domain data, enabling clients to customize LLM services while preserving data privacy. Experiments demonstrate that Llamdex improves domain-specific accuracy by up to 26% over state-of-the-art private data synthesis methods under identical privacy constraints and, by obviating the need for users to provide domain context within queries, maintains inference efficiency comparable to the original LLM service.

2410.06244 2026-02-17 cs.CV

Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization

Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Zeyu Zheng, Zirui Wang, Cihang Xie, Yuyin Zhou

Comments 31 pages, 33 figures, The project page and associated code can be accessed via https://jwmao1.github.io/storyiter/

详情
英文摘要

This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.

2410.03919 2026-02-17 cs.LG stat.ML

Online Posterior Sampling with a Diffusion Prior

Branislav Kveton, Boris Oreshkin, Youngsuk Park, Aniket Deshmukh, Rui Song

Comments Advances in Neural Information Processing Systems 37

详情
英文摘要

Posterior sampling in contextual bandits with a Gaussian prior can be implemented exactly or approximately using the Laplace approximation. The Gaussian prior is computationally efficient but it cannot describe complex distributions. In this work, we propose approximate posterior sampling algorithms for contextual bandits with a diffusion model prior. The key idea is to sample from a chain of approximate conditional posteriors, one for each stage of the reverse diffusion process, which are obtained by the Laplace approximation. Our approximations are motivated by posterior sampling with a Gaussian prior, and inherit its simplicity and efficiency. They are asymptotically consistent and perform well empirically on a variety of contextual bandit problems.

2410.02081 2026-02-17 cs.LG

MixLinear: Extreme Low Resource Multivariate Time Series Forecasting with 0.1K Parameters

Aitian Ma, Dongsheng Luo, Mo Sha

详情
英文摘要

Recently, there has been a growing interest in Long-term Time Series Forecasting (LTSF), which involves predicting long-term future values by analyzing a large amount of historical time-series data to identify patterns and trends. There exist significant challenges in LTSF due to its complex temporal dependencies and high computational demands. Although Transformer-based models offer high forecasting accuracy, they are often too compute-intensive to be deployed on devices with hardware constraints. On the other hand, the linear models aim to reduce the computational overhead by employing either decomposition methods in the time domain or compact representations in the frequency domain. In this paper, we propose MixLinear, an ultra-lightweight multivariate time series forecasting model specifically designed for resource-constrained devices. MixLinear effectively captures both temporal and frequency domain features by modeling intra-segment and inter-segment variations in the time domain and extracting frequency variations from a low-dimensional latent space in the frequency domain. By reducing the parameter scale of a downsampled $n$-length input/output one-layer linear model from $O(n^2)$ to $O(n)$, MixLinear achieves efficient computation without sacrificing accuracy. Extensive evaluations with four benchmark datasets show that MixLinear attains forecasting performance comparable to, or surpassing, state-of-the-art models with significantly fewer parameters ($0.1K$), which makes it well-suited for deployment on devices with limited computational capacity.

2409.14823 2026-02-17 cs.SD eess.AS

HiFi-Glot: High-Fidelity Neural Formant Synthesis with Differentiable Resonant Filters

Yicheng Gu, Pablo Pérez Zarazaga, Chaoren Wang, Zhizheng Wu, Zofia Malisz, Gustav Eje Henter, Lauri Juvela

详情
英文摘要

Formant synthesis aims to generate speech with controllable formant structures, enabling precise control of vocal resonance and phonetic features. However, while existing formant synthesis approaches enable precise formant manipulation, they often yield an impoverished speech signal by failing to capture the complex co-occurring acoustic cues essential for naturalness. To address this issue, this letter presents HiFi-Glot, an end-to-end neural formant synthesis system that achieves both precise formant control and high-fidelity speech synthesis. Specifically, the proposed model adopts a source--filter architecture inspired by classical formant synthesis, where a neural vocoder generates the glottal excitation signal, and differentiable resonant filters model the formants to produce the speech waveform. Experiment results demonstrate that our proposed HiFi-Glot model can generate speech with higher perceptual quality and naturalness while exhibiting a more precise control over formant frequencies, outperforming industry-standard formant manipulation tools such as Praat. Code, checkpoints, and representative audio samples are available at https://www.yichenggu.com/HiFi-Glot/.