arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2393
专题追踪 全部专题
2510.08431 2026-02-17 cs.CV cs.LG

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang

Comments ICLR 2026

详情
英文摘要

Although continuous-time consistency models (e.g., sCM, MeanFlow) are theoretically principled and empirically powerful for fast academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of evaluation benchmarks like FID. This work represents the first effort to scale up continuous-time consistency to general application-level image and video diffusion models, and to make JVP-based distillation effective at large scale. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM generally matches the state-of-the-art distillation method DMD2 on quality metrics while mitigating mode collapse and offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation. Code is available at https://github.com/NVlabs/rcm.

2510.07817 2026-02-17 cs.CV

PAGCNet: A Pose-Aware and Geometry Constrained Framework for Panoramic Depth Estimation

Kanglin Ning, Ruzhao Chen, Penghong Wang, Xingtao Wang, Ruiqin Xiong, Xiaopeng Fan

详情
英文摘要

Explicitly modeling room background depth as a geometric constraint has proven effective for panoramic depth estimation. However, reconstructing this background depth for regular enclosed regions in a complex indoor scene without external measurements remains an open challenge. To address this, we propose a pose-aware and geometry-constrained framework for panoramic depth estimation. Our framework first employs multiple task-specific decoders to jointly estimate room layout, camera pose, depth, and region segmentation from a input panoramic image. A pose-aware background depth resolving (PA-BDR) component uses tasks decoder's prediction to resolve the camera pose. Subsequently, the proposed PA-BDR component uses the camera pose to compute the background depth of regular enclosed regions and uses this background depth as a strong geometric prior. Based on the output of the region segmentation decoder, a fusion mask generation (FMG) component produces a fusion weight map to guide where and to what extent the geometry-constrained background depth should correct the depth decoder's prediction. Finally, an adaptive fusion component integrates this refined background depth with the initial depth prediction, guided by the fusion weight. Extensive experiments on Matterport3D, Structured3D, and Replica datasets demonstrate that our method achieves significantly superior performance compared to current open-source methods. Code is available at https://github.com/emiyaning/PAGCNet.

2510.06738 2026-02-17 cs.CL

AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

Boyi Zeng, Lin Chen, Ziwei He, Xinbing Wang, Zhouhan Lin

Comments ICLR 2026

详情
英文摘要

Protecting the intellectual property of large language models (LLMs) is crucial, given the substantial resources required for their training. Consequently, there is an urgent need for both model owners and third parties to determine whether a suspect LLM is trained from scratch or derived from an existing base model. However, the intensive post-training processes that models typically undergo-such as supervised fine-tuning, extensive continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcycling-pose significant challenges to reliable identification. In this work, we propose a training-free fingerprinting method based on weight matrices. We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations, yielding a highly robust and high-fidelity similarity metric. On a comprehensive testbed of 60 positive and 90 negative model pairs, our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives. By achieving perfect scores on all classification metrics, our approach establishes a strong basis for reliable model lineage verification. Moreover, the entire computation completes within 30s on an NVIDIA 3090 GPU. The code is available at https://github.com/LUMIA-Group/AWM.

2510.06714 2026-02-17 cs.LG cs.AI

Dual Goal Representations

Seohong Park, Deepinder Mann, Sergey Levine

Comments ICLR 2026

详情
英文摘要

In this work, we introduce dual goal representations for goal-conditioned reinforcement learning (GCRL). A dual goal representation characterizes a state by "the set of temporal distances from all other states"; in other words, it encodes a state through its relations to every other state, measured by temporal distance. This representation provides several appealing theoretical properties. First, it depends only on the intrinsic dynamics of the environment and is invariant to the original state representation. Second, it contains provably sufficient information to recover an optimal goal-reaching policy, while being able to filter out exogenous noise. Based on this concept, we develop a practical goal representation learning method that can be combined with any existing GCRL algorithm. Through diverse experiments on the OGBench task suite, we empirically show that dual goal representations consistently improve offline goal-reaching performance across 20 state- and pixel-based tasks.

2510.04398 2026-02-17 cs.CL cs.AI cs.CR cs.LG

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan, René Vidal

Comments Accepted at NeurIPS 2025. Code is available at https://github.com/Buyun-Liang/SECA

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often exhibit hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks to elicit hallucinations in LLMs, but these methods often rely on unrealistic prompts, either by inserting nonsensical tokens or by altering the original semantic intent. Consequently, such approaches provide limited insight into how hallucinations arise in real-world settings. In contrast, adversarial attacks in computer vision typically involve realistic modifications to input images. However, the problem of identifying realistic adversarial prompts for eliciting LLM hallucinations remains largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA), which elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no semantic equivalence or semantic coherence errors compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.

2510.03669 2026-02-17 cs.LG cs.CL

Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

Comments Full version of submission to 2nd AI for Math Workshop@ ICML 2025 (best paper)

详情
英文摘要

Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.

2510.02826 2026-02-17 cs.LG

Multi-scale Autoregressive Models are Laplacian, Discrete, and Latent Diffusion Models in Disguise

Steve Hong, Samuel Belkadi

详情
英文摘要

We reinterpret Visual Autoregressive (VAR) models as iterative refinement models to identify which design choices drive their quality-efficiency trade-off. Instead of treating VAR only as next-scale autoregression, we formalise it as a deterministic forward process that builds a Laplacian-style latent pyramid, together with a learned backward process that reconstructs samples in a small number of coarse-to-fine steps. This formulation makes the link to denoising diffusion explicit and highlights three modelling choices that may underlie VAR's efficiency and sample quality: refinement in a learned latent space, discrete prediction over code indices, and decomposition by spatial frequency. We support this view with controlled experiments that isolate the contribution of each factor to quality and speed. We also discuss how the same framework can be adapted to permutation-invariant graph generation and probabilistic medium-range weather forecasting, and how it provides practical points of contact with diffusion methods while preserving few-step, scale-parallel generation.

2510.02410 2026-02-17 cs.LG

OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data

Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A. Xu, Winnie Chow, Martin Maritsch, Robert Jakob, Ning Wang, Juncheng Liu, Aradhana Verma, Brian Han, Daniel Seung Kim, Henry Chubb, Scott Ceresnak, Aydin Zahedivash, Alexander Tarlochan Singh Sandhu, Fatima Rodriguez, Daniel McDuff, Elgar Fleisch, Oliver Aalami, Filipe Barata, Paul Schmiedmayer

详情
英文摘要

LLMs have emerged as powerful tools for interpreting multimodal data. In medicine, they hold particular promise for synthesizing large volumes of clinical information into actionable insights and digital health applications. Yet, a major limitation remains their inability to handle time series. To overcome this gap, we present OpenTSLM, a family of Time Series Language Models (TSLMs) created by integrating time series as a native modality to pretrained LLMs, enabling reasoning over multiple time series of any length. We investigate two architectures for OpenTSLM. The first, OpenTSLM-SoftPrompt, models time series implicitly by concatenating learnable time series tokens with text tokens via soft prompting. Although parameter-efficient, we hypothesize that explicit time series modeling scales better and outperforms implicit approaches. We thus introduce OpenTSLM-Flamingo, which integrates time series with text via cross-attention. We benchmark both variants against baselines that treat time series as text tokens or plots, across a suite of text-time-series Chain-of-Thought (CoT) reasoning tasks. We introduce three datasets: HAR-CoT, Sleep-CoT, and ECG-QA-CoT. Across all, OpenTSLM models outperform baselines, reaching 69.9 F1 in sleep staging and 65.4 in HAR, compared to 9.05 and 52.2 for finetuned text-only models. Notably, even 1B-parameter OpenTSLM models surpass GPT-4o (15.47 and 2.95). OpenTSLM-Flamingo matches OpenTSLM-SoftPrompt in performance and outperforms on longer sequences, while maintaining stable memory requirements. By contrast, SoftPrompt grows exponentially in memory with sequence length, requiring around 110 GB compared to 40 GB VRAM when training on ECG-QA with LLaMA-3B. Expert reviews by clinicians find strong reasoning capabilities exhibited by OpenTSLMs on ECG-QA. To facilitate further research, we provide all code, datasets, and models open-source.

2510.01039 2026-02-17 cs.LG

Gated X-TFC: Soft Domain Decomposition for Forward and Inverse Problems in Sharp-Gradient PDEs

Vikas Dwivedi, Enrico Schiassi, Monica Sigovan, Bruno Sixou

详情
英文摘要

Physics-informed neural networks (PINNs) and related methods struggle to resolve sharp gradients in singularly perturbed boundary value problems without resorting to some form of domain decomposition, which often introduce complex interface penalties. While the Extreme Theory of Functional Connections (X-TFC) avoids multi-objective optimization by employing exact boundary condition enforcement, it remains computationally inefficient for boundary layers and incompatible with decomposition. We propose Gated X-TFC, a novel framework for both forward and inverse problems, that overcomes these limitations through a soft, learned domain decomposition. Our method replaces hard interfaces with a differentiable logistic gate that dynamically adapts radial basis function (RBF) kernel widths across the domain, eliminating the need for interface penalties. This approach yields not only superior accuracy but also dramatic improvements in computational efficiency: on a benchmark one dimensional (1D) convection-diffusion, Gated X-TFC achieves an order-of-magnitude lower error than standard X-TFC while using 80 percent fewer collocation points and reducing training time by 66 percent. In addition, we introduce an operator-conditioned meta-learning layer that learns a probabilistic mapping from PDE parameters to optimal gate configurations, enabling fast, uncertainty-aware warm-starting for new problem instances. We further demonstrate scalability to multiple subdomains and higher dimensions by solving a twin boundary-layer equation and a 2D Poisson problem with a sharp Gaussian source. Overall, Gated X-TFC delivers a simple alternative alternative to PINNs that is both accurate and computationally efficient for challenging boundar-layer regimes. Future work will focus on nonlinear problems.

2510.00634 2026-02-17 cs.CV

LAKAN: Landmark-assisted Adaptive Kolmogorov-Arnold Network for Face Forgery Detection

Jiayao Jiang, Bin Liu, Qi Chu, Nenghai Yu

Comments 5 pages, 3 figures. This work has been accepted at ICASSP 2026

详情
英文摘要

The rapid development of deepfake generation techniques necessitates robust face forgery detection algorithms. While methods based on Convolutional Neural Networks (CNNs) and Transformers are effective, there is still room for improvement in modeling the highly complex and non-linear nature of forgery artifacts. To address this issue, we propose a novel detection method based on the Kolmogorov-Arnold Network (KAN). By replacing fixed activation functions with learnable splines, our KAN-based approach is better suited to this challenge. Furthermore, to guide the network's focus towards critical facial areas, we introduce a Landmark-assisted Adaptive Kolmogorov-Arnold Network (LAKAN) module. This module uses facial landmarks as a structural prior to dynamically generate the internal parameters of the KAN, creating an instance-specific signal that steers a general-purpose image encoder towards the most informative facial regions with artifacts. This core innovation creates a powerful combination between geometric priors and the network's learning process. Extensive experiments on multiple public datasets show that our proposed method achieves superior performance.

2510.00232 2026-02-17 cs.CL cs.AI cs.CY cs.LG

BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley, Zexue He

Comments Accepted by ICLR 2026

详情
英文摘要

Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance, leading to inconsistent comparisons among them. Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading model responses and expect fair and safe outputs rather than LLMs' probabilities. To enable consistent evaluation across debiasing methods and bridge this gap, we introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We further introduce a response-level metric, Bias-Free Score, to measure the extent to which LLM responses are fair, safe, and anti-stereotypical. Debiasing performances are systematically compared and analyzed across key dimensions: the prompting vs. training paradigm, model size, and generalization of different training strategies to unseen bias types. We release our benchmark, aiming to establish a unified testbed for bias mitigation research.

2509.25260 2026-02-17 cs.AI cs.CL cs.LG

Internal Planning in Language Models: Characterizing Horizon and Branch Awareness

Muhammed Ustaomeroglu, Baris Askin, Gauri Joshi, Carlee Joe-Wong, Guannan Qu

Comments Accepted to ICLR 2026

详情
英文摘要

The extent to which decoder-only language models (LMs) engage in planning, that is, organizing intermediate computations to support coherent long-range generation, remains an important question, with implications for interpretability, reliability, and principled model design. Planning involves structuring computations over long horizons, and considering multiple possible continuations, but how far transformer-based LMs exhibit them without external scaffolds, e.g., chain-of-thought prompting, is unclear. We address these questions by analyzing the hidden states at the core of transformer computations, which capture intermediate results and act as carriers of information. Since these hidden representations are redundant and encumbered with fine-grained details, we develop a pipeline based on vector-quantized variational autoencoders that compresses them into compact summary codes. These codes enable measuring mutual information and analyzing the computational structure of the underlying model behavior. Using this framework, we study planning in LMs across synthetic grammar, path-finding tasks, and natural language datasets, focusing on two planning properties: (i) the planning horizon of pre-output computations, and (ii) the extent to which the model considers alternative valid continuations. As a separate downstream use of the same pipeline, we also analyze how decision-relevant information is distributed across layers and earlier prefix blocks when producing next-token predictions. Together, these analyses advance our understanding of planning in LMs and provide a general-purpose pipeline for inspecting internal model dynamics. Our results reveal that the effective planning horizon is task-dependent, that models implicitly preserve information about unused correct continuations, and that predictions draw most on recent computations, though earlier blocks remain informative.

2509.23437 2026-02-17 cs.LG stat.ML

Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions

Steve Hong, Runa Eschenhagen, Bruno Mlodozeniec, Richard Turner

详情
英文摘要

Influence functions offer a principled way to trace model predictions back to training data, but their use in deep learning is hampered by the need to invert a large, ill-conditioned Hessian matrix. Approximations such as Generalised Gauss-Newton (GGN) and Kronecker-Factored Approximate Curvature (K-FAC) have been proposed to make influence computation tractable, yet it remains unclear how the departure from exactness impacts data attribution performance. Critically, given the restricted regime in which influence functions are derived, it is not necessarily clear better Hessian approximations should even lead to better data attribution performance. In this paper, we investigate the effect of Hessian approximation quality on influence-function attributions in a controlled classification setting. Our experiments show that better Hessian approximations consistently yield better influence score quality, offering justification for recent research efforts towards that end. We further decompose the approximation steps for recent Hessian approximation methods and evaluate each step's influence on attribution accuracy. Notably, the mismatch between K-FAC eigenvalues and GGN/EK-FAC eigenvalues accounts for the majority of the error and influence loss. These findings highlight which approximations are most critical, guiding future efforts to balance computational tractability and attribution accuracy.

2509.23094 2026-02-17 cs.CL

d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, Xu Yang

Comments Accepted by ICLR 2026, 21 pages, 9 figures

详情
英文摘要

Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

2509.22067 2026-02-17 cs.LG cs.AI

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina

详情
英文摘要

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 1-13%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, demonstrates a comparable harmful potential. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.

2509.21719 2026-02-17 cs.CV

DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining

Shuning Sun, Jialang Lu, Xiang Chen, Jichao Wang, Dianjie Lu, Guijuan Zhang, Guangwei Gao, Zhuoran Zheng

详情
英文摘要

Videos captured in the wild often suffer from rain streaks, blur, and noise. In addition, even slight changes in camera pose can amplify cross-frame mismatches and temporal artifacts. Existing methods rely on optical flow or heuristic alignment, which are computationally expensive and less robust. To address these challenges, Lie groups provide a principled way to represent continuous geometric transformations, making them well-suited for enforcing spatial and temporal consistency in video modeling. Building on this insight, we propose DeLiVR, an efficient video deraining method that injects spatiotemporal Lie-group differential biases directly into attention scores of the network. Specifically, the method introduces two complementary components. First, a rotation-bounded Lie relative bias predicts the in-plane angle of each frame using a compact prediction module, where normalized coordinates are rotated and compared with base coordinates to achieve geometry-consistent alignment before feature aggregation. Second, a differential group displacement computes angular differences between adjacent frames to estimate a velocity. This bias computation combines temporal decay and attention masks to focus on inter-frame relationships while precisely matching the direction of rain streaks. Extensive experimental results demonstrate the effectiveness of our method on publicly available benchmarks. The code is publicly available at https://github.com/Shuning0312/ICLR-DeLiVR.

2509.19189 2026-02-17 cs.LG stat.ML

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu

Comments 60 pages, accepted by NeurIPS 2025 as a spotlight paper

详情
英文摘要

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

2509.18053 2026-02-17 cs.RO

V2V-GoT: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models and Graph-of-Thoughts

Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Yu-Chiang Frank Wang, Min-Hung Chen, Stephen F. Smith

Comments Accepted by ICRA 2026 (IEEE International Conference on Robotics and Automation). Project: https://eddyhkchiu.github.io/v2vgot.github.io/ Code: https://github.com/eddyhkchiu/V2V-GoT Dataset: https://huggingface.co/datasets/eddyhkchiu/V2V-GoT-QA

详情
英文摘要

Current state-of-the-art autonomous vehicles could face safety-critical situations when their local sensors are occluded by large nearby objects on the road. Vehicle-to-vehicle (V2V) cooperative autonomous driving has been proposed as a means of addressing this problem, and one recently introduced framework for cooperative autonomous driving has further adopted an approach that incorporates a Multimodal Large Language Model (MLLM) to integrate cooperative perception and planning processes. However, despite the potential benefit of applying graph-of-thoughts reasoning to the MLLM, this idea has not been considered by previous cooperative autonomous driving research. In this paper, we propose a novel graph-of-thoughts framework specifically designed for MLLM-based cooperative autonomous driving. Our graph-of-thoughts includes our proposed novel ideas of occlusion-aware perception and planning-aware prediction. We curate the V2V-GoT-QA dataset and develop the V2V-GoT model for training and testing the cooperative driving graph-of-thoughts. Our experimental results show that our method outperforms other baselines in cooperative perception, prediction, and planning tasks. Our project website: https://eddyhkchiu.github.io/v2vgot.github.io/ .

2509.17750 2026-02-17 cs.RO cs.SY eess.SY math.OC

EigenSafe: A Spectral Framework for Learning-Based Probabilistic Safety Assessment

Inkyu Jang, Jonghae Park, Sihyun Cho, Chams E. Mballo, Claire J. Tomlin, H. Jin Kim

Comments Inkyu Jang and Jonghae Park contributed equally to this work. Project Webpage: https://eigen-safe.github.io/

详情
英文摘要

We present EigenSafe, an operator-theoretic framework for safety assessment of learning-enabled stochastic systems. In many robotic applications, the dynamics are inherently stochastic due to factors such as sensing noise and environmental disturbances, and it is challenging for conventional methods such as Hamilton-Jacobi reachability and control barrier functions to provide a well-calibrated safety critic that is tied to the actual safety probability. We derive a linear operator that governs the dynamic programming principle for safety probability, and find that its dominant eigenpair provides critical safety information for both individual state-action pairs and the overall closed-loop system. The proposed framework learns this dominant eigenpair, which can be used to either inform or constrain policy updates. We demonstrate that the learned eigenpair effectively facilitates safe reinforcement learning. Further, we validate its applicability in enhancing the safety of learned policies from imitation learning through robot manipulation experiments using a UR3 robotic arm in a food preparation task.

2509.14585 2026-02-17 cs.LG math.OC

Online reinforcement learning via sparse Gaussian mixture model Q-functions

Minh Vu, Konstantinos Slavakis

详情
英文摘要

This paper introduces a structured and interpretable online policy-iteration framework for reinforcement learning (RL), built around the novel class of sparse Gaussian mixture model Q-functions (S-GMM-QFs). Extending earlier work that trained GMM-QFs offline, the proposed framework develops an online scheme that leverages streaming data to encourage exploration. Model complexity is regulated through sparsification by Hadamard overparametrization, which mitigates overfitting while preserving expressiveness. The parameter space of S-GMM-QFs is naturally endowed with a Riemannian manifold structure, allowing for principled parameter updates via online gradient descent on a smooth objective. Numerical experiments show that S-GMM-QFs match or even outperform dense deep RL (DeepRL) methods on standard benchmarks while using significantly fewer parameters. Moreover, they maintain strong performance even in low-parameter regimes where sparsified DeepRL methods fail to generalize.

2509.14530 2026-02-17 cs.RO

Learning to Pick: A Visuomotor Policy for Clustered Strawberry Picking

Zhenghao Fei, Wenwu Lu, Linsheng Hou, Chen Peng

详情
英文摘要

Strawberries naturally grow in clusters, interwoven with leaves, stems, and other fruits, which frequently leads to occlusion. This inherent growth habit presents a significant challenge for robotic picking, as traditional percept-plan-control systems struggle to reach fruits amid the clutter. Effectively picking an occluded strawberry demands dexterous manipulation to carefully bypass or gently move the surrounding soft objects and precisely access the ideal picking point located at the stem just above the calyx. To address this challenge, we introduce a strawberry-picking robotic system that learns from human demonstrations. Our system features a 4-DoF SCARA arm paired with a human teleoperation interface for efficient data collection and leverages an End Pose Assisted Action Chunking Transformer (ACT) to develop a fine-grained visuomotor picking policy. Experiments under various occlusion scenarios demonstrate that our modified approach significantly outperforms the direct implementation of ACT, underscoring its potential for practical application in occluded strawberry picking.

2509.00955 2026-02-17 cs.LG cs.AI stat.ML

ART: Adaptive Resampling-based Training for Imbalanced Classification

Arjun Basandrai, Shourya Jain, K. Ilanthenral

Comments Submitted to MLWA

详情
英文摘要

Traditional resampling methods for handling class imbalance typically uses fixed distributions, undersampling the majority or oversampling the minority. These static strategies ignore changes in class-wise learning difficulty, which can limit the overall performance of the model. This paper proposes an Adaptive Resampling-based Training (ART) method that periodically updates the distribution of the training data based on the class-wise performance of the model. Specifically, ART uses class-wise macro F1 scores, computed at fixed intervals, to determine the degree of resampling to be performed. Unlike instance-level difficulty modeling, which is noisy and outlier-sensitive, ART adapts at the class level. This allows the model to incrementally shift its attention towards underperforming classes in a way that better aligns with the optimization objective. Results on diverse benchmarks, including Pima Indians Diabetes and Yeast dataset demonstrate that ART consistently outperforms both resampling-based and algorithm-level methods, including Synthetic Minority Oversampling Technique (SMOTE), NearMiss Undersampling, and Cost-sensitive Learning on binary as well as multi-class classification tasks with varying degrees of imbalance. In most settings, these improvements are statistically significant. On tabular datasets, gains are significant under paired t-tests and Wilcoxon tests (p < 0.05), while results on text and image tasks remain favorable. Compared to training on the original imbalanced data, ART improves macro F1 by an average of 2.64 percentage points across all tested tabular datasets. Unlike existing methods, whose performance varies by task, ART consistently delivers the strongest macro F1, making it a reliable choice for imbalanced classification.

2508.20037 2026-02-17 cs.RO

Visio-Verbal Teleimpedance Interface: Enabling Semi-Autonomous Control of Physical Interaction via Eye Tracking and Speech

Henk H. A. Jekel, Alejandro Díaz Rosales, Luka Peternel

Journal ref Frontiers in Robotics and AI, Volume 13, 1749105, February 2026

详情
英文摘要

The paper presents a visio-verbal teleimpedance interface for commanding 3D stiffness ellipsoids to the remote robot with a combination of the operator's gaze and verbal interaction. The gaze is detected by an eye-tracker, allowing the system to understand the context in terms of what the operator is currently looking at in the scene. Along with verbal interaction, a Visual Language Model (VLM) processes this information, enabling the operator to communicate their intended action or provide corrections. Based on these inputs, the interface can then generate appropriate stiffness matrices for different physical interaction actions. To validate the proposed visio-verbal teleimpedance interface, we conducted a series of experiments on a setup including a Force Dimension Sigma.7 haptic device to control the motion of the remote Kuka LBR iiwa robotic arm. The human operator's gaze is tracked by Tobii Pro Glasses 2, while human verbal commands are processed by a VLM using GPT-4o. The first experiment explored the optimal prompt configuration for the interface. The second and third experiments demonstrated different functionalities of the interface on a slide-in-the-groove task.

2508.19884 2026-02-17 cs.LG

Parameter-Free Structural-Diversity Message Passing for Graph Neural Networks

Mingyue Kong, Yinglong Zhang, Chengda Xu, Xuewen Xia, Xing Xu

Comments 50 pages, 6 figures

Journal ref Neural Networks, 2026

详情
英文摘要

Graph Neural Networks (GNNs) have shown remarkable performance in structured data modeling tasks such as node classification. However, mainstream approaches generally rely on a large number of trainable parameters and fixed aggregation rules, making it difficult to adapt to graph data with strong structural heterogeneity and complex feature distributions. This often leads to over-smoothing of node representations and semantic degradation. To address these issues, this paper proposes a parameter-free graph neural network framework based on structural diversity, namely SDGNN (Structural-Diversity Graph Neural Network). The framework is inspired by structural diversity theory and designs a unified structural-diversity message passing mechanism that simultaneously captures the heterogeneity of neighborhood structures and the stability of feature semantics, without introducing additional trainable parameters. Unlike traditional parameterized methods, SDGNN does not rely on complex model training, but instead leverages complementary modeling from both structure-driven and feature-driven perspectives, thereby effectively improving adaptability across datasets and scenarios. Experimental results show that on eight public benchmark datasets and an interdisciplinary PubMed citation network, SDGNN consistently outperforms mainstream GNNs under challenging conditions such as low supervision, class imbalance, and cross-domain transfer. This work provides a new theoretical perspective and general approach for the design of parameter-free graph neural networks, and further validates the importance of structural diversity as a core signal in graph representation learning. To facilitate reproducibility and further research, the full implementation of SDGNN has been released at: https://github.com/mingyue15694/SGDNN/tree/main

2508.19228 2026-02-17 cs.LG

Predicting the Order of Upcoming Tokens Improves Language Modeling

Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji

详情
英文摘要

Multi-token prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We found MTP's exact future token prediction to be too difficult as an auxiliary loss. Instead, we propose token order prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, DeepSeek MTP (DS-MTP) and TOP objectives. The results of nine standard NLP benchmarks show that TOP overall outperforms NTP, MTP, and DS-MTP even at scale. TOP models with continued training on math and code also perform better on 4 relevant benchmarks. On the synthetic star graph task, TOP enables pathfinding on graphs where NTP, MTP, and DS-MTP fail. Our code is available at https://github.com/zaydzuhri/token-order-prediction

2508.15990 2026-02-17 cs.RO cs.CV

GelSLAM: A Real-time, High-Fidelity, and Robust 3D Tactile SLAM System

Hung-Jui Huang, Mohammad Amin Mirzaee, Michael Kaess, Wenzhen Yuan

Comments 20 pages

详情
英文摘要

Accurately perceiving an object's pose and shape is essential for precise grasping and manipulation. Compared to common vision-based methods, tactile sensing offers advantages in precision and immunity to occlusion when tracking and reconstructing objects in contact. This makes it particularly valuable for in-hand and other high-precision manipulation tasks. In this work, we present GelSLAM, a real-time 3D SLAM system that relies solely on tactile sensing to estimate object pose over long periods and reconstruct object shapes with high fidelity. Unlike traditional point cloud-based approaches, GelSLAM uses tactile-derived surface normals and curvatures for robust tracking and loop closure. It can track object motion in real time with low error and minimal drift, and reconstruct shapes with submillimeter accuracy, even for low-texture objects such as wooden tools. GelSLAM extends tactile sensing beyond local contact to enable global, long-horizon spatial perception, and we believe it will serve as a foundation for many precise manipulation tasks involving interaction with objects in hand. The video demo, code, and dataset are available at https://joehjhuang.github.io/gelslam.

2508.13213 2026-02-17 cs.AI

AI sustains higher strategic tension than humans in chess

Adamo Cerioli, Edward D. Lee, Vito D. P. Servedio

详情
英文摘要

Strategic decision-making requires balancing immediate opportunities against long-term objectives: a tension fundamental to competitive environments. We investigate this trade-off in chess by analyzing the dynamics of human and AI gameplay through a network-based metric that quantifies piece-to-piece interactions. Our analysis reveals that elite AI players sustain substantially higher levels of strategic tension for longer durations than top human grandmasters. We find that cumulative tension scales with algorithmic complexity in AI systems and increases linearly with skill level (Elo rating) in human play. Longer time controls are associated with higher tension in human games, reflecting the additional strategic complexity players can manage with more thinking time. The temporal profiles reveal contrasting approaches: highly competitive AI systems tolerate densely interconnected positions that balance offensive and defensive tactics over extended periods, while human players systematically limit tension and game complexity. These differences have broader implications for understanding how artificial and biological systems navigate complex strategic environments and for the deployment of AI in high-stakes competitive scenarios.

2508.11301 2026-02-17 cs.CV

Hyperspectral vs. RGB for Pedestrian Segmentation in Urban Driving Scenes: A Comparative Study

Jiarong Li, Imad Ali Shah, Enda Ward, Martin Glavin, Edward Jones, Brian Deegan

Comments Submitted to IEEE ICVES, July, 2025

Journal ref Proc. 2025 IEEE International Conference on Vehicular Electronics and Safety (ICVES), pp. 387-392

详情
英文摘要

Pedestrian segmentation in automotive perception systems faces critical safety challenges due to metamerism in RGB imaging, where pedestrians and backgrounds appear visually indistinguishable.. This study investigates the potential of hyperspectral imaging (HSI) for enhanced pedestrian segmentation in urban driving scenarios using the Hyperspectral City v2 (H-City) dataset. We compared standard RGB against two dimensionality-reduction approaches by converting 128-channel HSI data into three-channel representations: Principal Component Analysis (PCA) and optimal band selection using Contrast Signal-to-Noise Ratio with Joint Mutual Information Maximization (CSNR-JMIM). Three semantic segmentation models were evaluated: U-Net, DeepLabV3+, and SegFormer. CSNR-JMIM consistently outperformed RGB with an average improvements of 1.44% in Intersection over Union (IoU) and 2.18% in F1-score for pedestrian segmentation. Rider segmentation showed similar gains with 1.43% IoU and 2.25% F1-score improvements. These improved performance results from enhanced spectral discrimination of optimally selected HSI bands effectively reducing false positives. This study demonstrates robust pedestrian segmentation through optimal HSI band selection, showing significant potential for safety-critical automotive applications.

2508.08500 2026-02-17 cs.AI

Large Language Models as Oracles for Ontology Alignment

Sviatoslav Lushnei, Dmytro Shumskyi, Severyn Shykula, Ernesto Jimenez-Ruiz, Artur d'Avila Garcez

Comments Paper accepted at the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), main conference. 21 pages

详情
英文摘要

There are many methods and systems to tackle the ontology alignment problem, yet a major challenge persists in producing high-quality mappings among a set of input ontologies. Adopting a human-in-the-loop approach during the alignment process has become essential in applications requiring very accurate mappings. However, user involvement is expensive when dealing with large ontologies. In this paper, we analyse the feasibility of using Large Language Models (LLM) to aid the ontology alignment problem. LLMs are used only in the validation of a subset of correspondences for which there is high uncertainty. We have conducted an extensive analysis over several tasks of the Ontology Alignment Evaluation Initiative (OAEI), reporting in this paper the performance of several state-of-the-art LLMs using different prompt templates. Using LLMs as Oracles resulted in strong performance in the OAEI 2025, achieving the top-2 overall rank in the bio-ml track.

2508.03346 2026-02-17 cs.AI

Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, Qiang Xu

Comments Accepted by ICLR2026

详情
英文摘要

Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies \emph{the informational contribution of individual reasoning steps} to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly improves LLM inference efficiency while preserving accuracy, paving the way for more scalable LLM deployments and a better understanding of their internal reasoning. The code and data are released in https://github.com/staymylove/COT_Compresstion_via_Step_entropy.