arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1284
专题追踪 全部专题
2602.06445 2026-02-09 cs.RO

ECO: Energy-Constrained Optimization with Reinforcement Learning for Humanoid Walking

Weidong Huang, Jingwen Zhang, Jiongye Li, Shibowen Zhang, Jiayang Wu, Jiayi Wang, Hangxin Liu, Yaodong Yang, Yao Su

Comments IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING. PREPRINT VERSION. ACCEPTED FEB, 2026

详情
英文摘要

Achieving stable and energy-efficient locomotion is essential for humanoid robots to operate continuously in real-world applications. Existing MPC and RL approaches often rely on energy-related metrics embedded within a multi-objective optimization framework, which require extensive hyperparameter tuning and often result in suboptimal policies. To address these challenges, we propose ECO (Energy-Constrained Optimization), a constrained RL framework that separates energy-related metrics from rewards, reformulating them as explicit inequality constraints. This method provides a clear and interpretable physical representation of energy costs, enabling more efficient and intuitive hyperparameter tuning for improved energy efficiency. ECO introduces dedicated constraints for energy consumption and reference motion, enforced by the Lagrangian method, to achieve stable, symmetric, and energy-efficient walking for humanoid robots. We evaluated ECO against MPC, standard RL with reward shaping, and four state-of-the-art constrained RL methods. Experiments, including sim-to-sim and sim-to-real transfers on the kid-sized humanoid robot BRUCE, demonstrate that ECO significantly reduces energy consumption compared to baselines while maintaining robust walking performance. These results highlight a substantial advancement in energy-efficient humanoid locomotion. All experimental demonstrations can be found on the project website: https://sites.google.com/view/eco-humanoid.

2602.06441 2026-02-09 cs.LG

Is Gradient Ascent Really Necessary? Memorize to Forget for Machine Unlearning

Zhuo Huang, Qizhou Wang, Ziming Hong, Shanshan Ye, Bo Han, Tongliang Liu

详情
英文摘要

For ethical and safe AI, machine unlearning rises as a critical topic aiming to protect sensitive, private, and copyrighted knowledge from misuse. To achieve this goal, it is common to conduct gradient ascent (GA) to reverse the training on undesired data. However, such a reversal is prone to catastrophic collapse, which leads to serious performance degradation in general tasks. As a solution, we propose model extrapolation as an alternative to GA, which reaches the counterpart direction in the hypothesis space from one model given another reference model. Therefore, we leverage the original model as the reference, further train it to memorize undesired data while keeping prediction consistency on the rest retained data, to obtain a memorization model. Counterfactual as it might sound, a forget model can be obtained via extrapolation from the memorization model to the reference model. Hence, we avoid directly acquiring the forget model using GA, but proceed with gradient descent for the memorization model, which successfully stabilizes the machine unlearning process. Our model extrapolation is simple and efficient to implement, and it can also effectively converge throughout training to achieve improved unlearning performance.

2602.06440 2026-02-09 cs.CL cs.AI cs.CR

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao, Weiyue Li, Mengyu Wang

详情
英文摘要

Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.

2602.06429 2026-02-09 cs.LG physics.geo-ph

Reclaiming First Principles: A Differentiable Framework for Conceptual Hydrologic Models

Jasper A. Vrugt, Jonathan M. Frame, Ethan Bollman

Comments 85 pages, 14 figures

详情
英文摘要

Conceptual hydrologic models remain the cornerstone of rainfall-runoff modeling, yet their calibration is often slow and numerically fragile. Most gradient-based parameter estimation methods rely on finite-difference approximations or automatic differentiation frameworks (e.g., JAX, PyTorch and TensorFlow), which are computationally demanding and introduce truncation errors, solver instabilities, and substantial overhead. These limitations are particularly acute for the ODE systems of conceptual watershed models. Here we introduce a fully analytic and computationally efficient framework for differentiable hydrologic modeling based on exact parameter sensitivities. By augmenting the governing ODE system with sensitivity equations, we jointly evolve the model states and the Jacobian matrix with respect to all parameters. This Jacobian then provides fully analytic gradient vectors for any differentiable loss function. These include classical objective functions such as the sum of absolute and squared residuals, widely used hydrologic performance metrics such as the Nash-Sutcliffe and Kling-Gupta efficiencies, robust loss functions that down-weight extreme events, and hydrograph-based functionals such as flow-duration and recession curves. The analytic sensitivities eliminate the step-size dependence and noise inherent to numerical differentiation, while avoiding the instability of adjoint methods and the overhead of modern machine-learning autodiff toolchains. The resulting gradients are deterministic, physically interpretable, and straightforward to embed in gradient-based optimizers. Overall, this work enables rapid, stable, and transparent gradient-based calibration of conceptual hydrologic models, unlocking the full potential of differentiable modeling without reliance on external, opaque, or CPU-intensive automatic-differentiation libraries.

2602.06427 2026-02-09 cs.CV cs.RO

Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

Yuxiang Zhao, Yirong Yang, Yanqing Zhu, Yanfen Shen, Chiyu Wang, Zhining Gu, Pei Shi, Wei Guo, Mu Xu

详情
英文摘要

Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.

2602.06426 2026-02-09 cs.LG

Beyond Code Contributions: How Network Position, Temporal Bursts, and Code Review Activities Shape Contributor Influence in Large-Scale Open Source Ecosystems

S M Rakib Ul Karim, Wenyi Lu, Sean Goggins

详情
英文摘要

Open source software (OSS) projects rely on complex networks of contributors whose interactions drive innovation and sustainability. This study presents a comprehensive analysis of OSS contributor networks using advanced graph neural networks and temporal network analysis on data spanning 25 years from the Cloud Native Computing Foundation ecosystem, encompassing sandbox, incubating, and graduated projects. Our analysis of thousands of contributors across hundreds of repositories reveals that OSS networks exhibit strong power-law distributions in influence, with the top 1\% of contributors controlling a substantial portion of network influence. Using GPU-accelerated PageRank, betweenness centrality, and custom LSTM models, we identify five distinct contributor roles: Core, Bridge, Connector, Regular, and Peripheral, each with unique network positions and structural importance. Statistical analysis reveals significant correlations between specific action types (commits, pull requests, issues) and contributor influence, with multiple regression models explaining substantial variance in influence metrics. Temporal analysis shows that network density, clustering coefficients, and modularity exhibit statistically significant temporal trends, with distinct regime changes coinciding with major project milestones. Structural integrity simulations show that Bridge contributors, despite representing a small fraction of the network, have a disproportionate impact on network cohesion when removed. Our findings provide empirical evidence for strategic contributor retention policies and offer actionable insights into community health metrics.

2602.06425 2026-02-09 cs.CV

POPL-KF: A Pose-Only Geometric Representation-Based Kalman Filter for Point-Line-Based Visual-Inertial Odometry

Aiping Wang, Zhaolong Yang, Shuwen Chen, Hai Zhang

详情
英文摘要

Mainstream Visual-inertial odometry (VIO) systems rely on point features for motion estimation and localization. However, their performance degrades in challenging scenarios. Moreover, the localization accuracy of multi-state constraint Kalman filter (MSCKF)-based VIO systems suffers from linearization errors associated with feature 3D coordinates and delayed measurement updates. To improve the performance of VIO in challenging scenes, we first propose a pose-only geometric representation for line features. Building on this, we develop POPL-KF, a Kalman filter-based VIO system that employs a pose-only geometric representation for both point and line features. POPL-KF mitigates linearization errors by explicitly eliminating both point and line feature coordinates from the measurement equations, while enabling immediate update of visual measurements. We also design a unified base-frames selection algorithm for both point and line features to ensure optimal constraints on camera poses within the pose-only measurement model. To further improve line feature quality, a line feature filter based on image grid segmentation and bidirectional optical flow consistency is proposed. Our system is evaluated on public datasets and real-world experiments, demonstrating that POPL-KF outperforms the state-of-the-art (SOTA) filter-based methods (OpenVINS, PO-KF) and optimization-based methods (PL-VINS, EPLF-VINS), while maintaining real-time performance.

2602.06423 2026-02-09 cs.CL

On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

Wenbo Shang, Yuxi Sun, Jing Ma, Xin Huang

Comments Paper accepted as a conference paper at ICLR 2026

详情
英文摘要

Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.

2602.06422 2026-02-09 cs.CV

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Yunze Tong, Mushui Liu, Canyu Zhao, Wanggui He, Shiyi Zhang, Hongwei Zhang, Peng Zhang, Jinlong Liu, Ju Huang, Jiamang Wang, Hao Jiang, Pipei Huang

Comments 18 pages, in submission

详情
英文摘要

Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.

2602.06419 2026-02-09 cs.CV

Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors

Soham Pahari, Sandeep C. Kumain

详情
英文摘要

Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.

2602.06418 2026-02-09 cs.LG q-bio.BM

Adaptive Protein Tokenization

Rohit Dilip, Ayush Varshney, David Van Valen

详情
英文摘要

Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.

2602.06413 2026-02-09 cs.AI

Intrinsic Stability Limits of Autoregressive Reasoning: Structural Consequences for Long-Horizon Execution

Hsien-Jyh Liao

Comments 16 Pages, 7 figures, Keyworda: Autoregressive Reasoning, Long-Horizon Stability, Chain-of-Thought Reasoning, Information-Theoretic Analysis, Structured Reasoning, Inference Dynamics

详情
英文摘要

Large language models (LLMs) demonstrate remarkable reasoning capabilities, yet their performance often deteriorates sharply in long-horizon tasks, exhibiting systematic breakdown beyond certain scales. Conventional explanations primarily attribute this phenomenon to task complexity, such as combinatorial search explosion or long-term credit assignment challenges. In this work, we argue that these explanations are incomplete: even in linear, unbranched tasks without semantic ambiguity, autoregressive execution is subject to an intrinsic stability limit. We propose that the fundamental constraint on long-horizon reasoning arises from process-level instability in autoregressive generation rather than solely from search or task complexity, reframing long-horizon reasoning as a problem of structural governance. We derive Theorem~A, showing that decision advantage in single-path autoregressive reasoning decays exponentially with execution length, imposing a fundamental bound on maintainable reasoning chains. This result implies a structural consequence: stable long-horizon reasoning requires discrete segmentation, naturally inducing graph-like execution structures such as directed acyclic graphs (DAGs). Empirical studies in both synthetic environments and real TextWorld tasks reveal observable performance cliffs consistent with theoretical predictions. Our findings provide a dynamical perspective on long-horizon reasoning failure and suggest new limitations on maintaining long-term coherence under purely autoregressive architectures. Furthermore, we highlight that short-horizon evaluation protocols may obscure structural instability, indicating a potential shift from scaling toward structured governance in future reasoning systems.

2602.06406 2026-02-09 cs.CV

Point Virtual Transformer

Veerain Sood, Bnalin, Gaurav Pandey

Comments 8 pages, 4 figures

详情
英文摘要

LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.

2602.06405 2026-02-09 cs.CV cs.NE

A neuromorphic model of the insect visual system for natural image processing

Adam D. Hines, Karin Nordström, Andrew B. Barron

Comments 21 pages, 7 figures, under review

详情
英文摘要

Insect vision supports complex behaviors including associative learning, navigation, and object detection, and has long motivated computational models for understanding biological visual processing. However, many contemporary models prioritize task performance while neglecting biologically grounded processing pathways. Here, we introduce a bio-inspired vision model that captures principles of the insect visual system to transform dense visual input into sparse, discriminative codes. The model is trained using a fully self-supervised contrastive objective, enabling representation learning without labeled data and supporting reuse across tasks without reliance on domain-specific classifiers. We evaluated the resulting representations on flower recognition tasks and natural image benchmarks. The model consistently produced reliable sparse codes that distinguish visually similar inputs. To support different modelling and deployment uses, we have implemented the model as both an artificial neural network and a spiking neural network. In a simulated localization setting, our approach outperformed a simple image downsampling comparison baseline, highlighting the functional benefit of incorporating neuromorphic visual processing pathways. Collectively, these results advance insect computational modelling by providing a generalizable bio-inspired vision model capable of sparse computation across diverse tasks.

2602.06402 2026-02-09 cs.CV

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv, Liang Diao, Zengjian Fan, Wenfeng Xie, Ziling Lin, De Shi, Lin Huang, Kaihe Xu, Hong Li

Comments 20 pages, 8 figures. Technical report

详情
英文摘要

Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.

2602.06394 2026-02-09 cs.AI cs.CE q-bio.GN q-fin.CP

Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization

Arvid E. Gollwitzer, Paridhi Latawa, David de Gruijl, Deepak A. Subramanian, Adrián Noriega de la Colina

详情
英文摘要

Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.

2602.06391 2026-02-09 cs.CV

POINTS-GUI-G: GUI-Grounding Journey

Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, Jie Zhou

详情
英文摘要

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.

2602.06390 2026-02-09 cs.LG cs.AI

Generating High-quality Privacy-preserving Synthetic Data

David Yavo, Richard Khoury, Christophe Pere, Sadoune Ait Kaci Azzou

详情
英文摘要

Synthetic tabular data enables sharing and analysis of sensitive records, but its practical deployment requires balancing distributional fidelity, downstream utility, and privacy protection. We study a simple, model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off. First, a mode patching step repairs categories that are missing or severely underrepresented in the synthetic data, while largely preserving learned dependencies. Second, a k nearest neighbor filter replaces synthetic records that lie too close to real data points, enforcing a minimum distance between real and synthetic samples. We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder, and evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income. We assess marginal and joint distributional similarity, the performance of models trained on synthetic data and evaluated on real data, and several empirical privacy indicators, including nearest neighbor distances and attribute inference attacks. With moderate thresholds between 0.2 and 0.35, the post processing reduces divergence between real and synthetic categorical distributions by up to 36 percent and improves a combined measure of pairwise dependence preservation by 10 to 14 percent, while keeping downstream predictive performance within about 1 percent of the unprocessed baseline. At the same time, distance based privacy indicators improve and the success rate of attribute inference attacks remains largely unchanged. These results provide practical guidance for selecting thresholds and applying post hoc repairs to improve the quality and empirical privacy of synthetic tabular data, while complementing approaches that provide formal differential privacy guarantees.

2602.06385 2026-02-09 cs.LG

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun

详情
英文摘要

Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove "equal-rate" dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.

2602.06384 2026-02-09 cs.CL

FMBench: Adaptive Large Language Model Output Formatting

Yaoting Wang, Yun Zhou, Henghui Ding

详情
英文摘要

Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.

2602.06380 2026-02-09 cs.RO cs.SY eess.SY

A Consistency-Improved LiDAR-Inertial Bundle Adjustment

Xinran Li, Shuaikang Zheng, Pengcheng Zheng, Xinyang Wang, Jiacheng Li, Zhitian Li, Xudong Zou

详情
英文摘要

Simultaneous Localization and Mapping (SLAM) using 3D LiDAR has emerged as a cornerstone for autonomous navigation in robotics. While feature-based SLAM systems have achieved impressive results by leveraging edge and planar structures, they often suffer from the inconsistent estimator associated with feature parameterization and estimated covariance. In this work, we present a consistency-improved LiDAR-inertial bundle adjustment (BA) with tailored parameterization and estimator. First, we propose a stereographic-projection representation parameterizing the planar and edge features, and conduct a comprehensive observability analysis to support its integrability with consistent estimator. Second, we implement a LiDAR-inertial BA with Maximum a Posteriori (MAP) formulation and First-Estimate Jacobians (FEJ) to preserve the accurate estimated covariance and observability properties of the system. Last, we apply our proposed BA method to a LiDAR-inertial odometry.

2602.06375 2026-02-09 cs.AI

Difficulty-Estimated Policy Optimization

Yu Zhao, Fan Jiang, Tianle Liu, Bo Zeng, Yu Liu, Longyue Wang, Weihua Luo

详情
英文摘要

Recent advancements in Large Reasoning Models (LRMs), exemplified by DeepSeek-R1, have underscored the potential of scaling inference-time compute through Group Relative Policy Optimization (GRPO). However, GRPO frequently suffers from gradient signal attenuation when encountering problems that are either too trivial or overly complex. In these scenarios, the disappearance of inter-group advantages makes the gradient signal susceptible to noise, thereby jeopardizing convergence stability. While variants like DAPO attempt to rectify gradient vanishing, they do not alleviate the substantial computational overhead incurred by exhaustive rollouts on low-utility samples. In this paper, we propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment. DEPO integrates an online Difficulty Estimator that dynamically assesses and filters training data before the rollout phase. This mechanism ensures that computational resources are prioritized for samples with high learning potential. Empirical results demonstrate that DEPO achieves up to a 2x reduction in rollout costs without compromising model performance. Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling. Code and data will be released upon acceptance.

2602.06373 2026-02-09 cs.CL

ReBeCA: Unveiling Interpretable Behavior Hierarchy behind the Iterative Self-Reflection of Language Models with Causal Analysis

Tianqiang Yan, Sihan Shang, Yuheng Li, Song Qiu, Hao Peng, Wenjian Luo, Jue Xie, Lizhen Qu, Yuan Gao

Comments 17 pages, 3 figures

详情
英文摘要

While self-reflection can enhance language model reliability, its underlying mechanisms remain opaque, with existing analyses often yielding correlation-based insights that fail to generalize. To address this, we introduce \textbf{\texttt{ReBeCA}} (self-\textbf{\texttt{Re}}flection \textbf{\texttt{Be}}havior explained through \textbf{\texttt{C}}ausal \textbf{\texttt{A}}nalysis), a framework that unveils the interpretable behavioral hierarchy governing the self-reflection outcome. By modeling self-reflection trajectories as causal graphs, ReBeCA isolates genuine determinants of performance through a three-stage Invariant Causal Prediction (ICP) pipeline. We establish three critical findings: (1) \textbf{Behavioral hierarchy:} Semantic behaviors of the model influence final self-reflection results hierarchically: directly or indirectly; (2) \textbf{Causation matters:} Generalizability in self-reflection effects is limited to just a few semantic behaviors; (3) \textbf{More $\mathbf{\neq}$ better:} The confluence of seemingly positive semantic behaviors, even among direct causal factors, can impair the efficacy of self-reflection. ICP-based verification identifies sparse causal parents achieving up to $49.6\%$ structural likelihood gains, stable across tasks where correlation-based patterns fail. Intervention studies on novel datasets confirm these causal relationships hold out-of-distribution ($p = .013, η^2_\mathrm{p} = .071$). ReBeCA thus provides a rigorous methodology for disentangling genuine causal mechanisms from spurious associations in self-reflection dynamics.

2602.06370 2026-02-09 cs.CL

Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production

Alberto Andres Valdes Gonzalez

Comments 26 pages, 12 figures. Empirical benchmark comparing fine-tuned encoders and LLM prompting for text classification under cost and latency constraints

详情
英文摘要

Large language models (LLMs) such as GPT-4o and Claude Sonnet 4.5 have demonstrated strong capabilities in open-ended reasoning and generative language tasks, leading to their widespread adoption across a broad range of NLP applications. However, for structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone, overlooking operational constraints encountered in production systems. In this work, we present a systematic comparison of two contrasting paradigms for text classification: zero- and few-shot prompt-based large language models, and fully fine-tuned encoder-only architectures. We evaluate these approaches across four canonical benchmarks (IMDB, SST-2, AG News, and DBPedia), measuring predictive quality (macro F1), inference latency, and monetary cost. We frame model evaluation as a multi-objective decision problem and analyze trade-offs using Pareto frontier projections and a parameterized utility function reflecting different deployment regimes. Our results show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance while operating at one to two orders of magnitude lower cost and latency compared to zero- and few-shot LLM prompting. Overall, our findings suggest that indiscriminate use of large language models for standard text classification workloads can lead to suboptimal system-level outcomes. Instead, fine-tuned encoders emerge as robust and efficient components for structured NLP pipelines, while LLMs are better positioned as complementary elements within hybrid architectures. We release all code, datasets, and evaluation protocols to support reproducibility and cost-aware NLP system design.

2602.06369 2026-02-09 cs.CV cs.AI

Revisiting Salient Object Detection from an Observer-Centric Perspective

Fuxi Zhang, Yifan Wang, Hengrun Zhao, Zhuohan Sun, Changxing Xia, Lijun Wang, Huchuan Lu, Yangrui Shao, Chen Yang, Long Teng

详情
英文摘要

Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like "Perceive-Reflect-Adjust" process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly "salient." Code and dataset are publicly available at: https://github.com/Dustzx/OC_SOD

2602.06366 2026-02-09 cs.RO

Towards Adaptive Environment Generation for Training Embodied Agents

Teresa Yeo, Dulaj Weerakoon, Dulanga Weerakoon, Archan Misra

Comments Accepted to AAAI-26 Bridge Program B10: Making Embodied AI Reliable with Testing and Formal Verification

详情
英文摘要

Embodied agents struggle to generalize to new environments, even when those environments share similar underlying structures to their training settings. Most current approaches to generating these training environments follow an open-loop paradigm, without considering the agent's current performance. While procedural generation methods can produce diverse scenes, diversity without feedback from the agent is inefficient. The generated environments may be trivially easy, providing limited learning signal. To address this, we present a proof-of-concept for closed-loop environment generation that adapts difficulty to the agent's current capabilities. Our system employs a controllable environment representation, extracts fine-grained performance feedback beyond binary success or failure, and implements a closed-loop adaptation mechanism that translates this feedback into environment modifications. This feedback-driven approach generates training environments that more challenging in the ways the agent needs to improve, enabling more efficient learning and better generalization to novel settings.

2602.06363 2026-02-09 cs.CV

Robust Pedestrian Detection with Uncertain Modality

Qian Bie, Xiao Wang, Bin Yang, Zhixi Yu, Jun Chen, Xin Xu

Comments Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file

详情
英文摘要

Existing cross-modal pedestrian detection (CMPD) employs complementary information from RGB and thermal-infrared (TIR) modalities to detect pedestrians in 24h-surveillance systems.RGB captures rich pedestrian details under daylight, while TIR excels at night. However, TIR focuses primarily on the person's silhouette, neglecting critical texture details essential for detection. While the near-infrared (NIR) captures texture under low-light conditions, which effectively alleviates performance issues of RGB and detail loss in TIR, thereby reducing missed detections. To this end, we construct a new Triplet RGB-NIR-TIR (TRNT) dataset, comprising 8,281 pixel-aligned image triplets, establishing a comprehensive foundation for algorithmic research. However, due to the variable nature of real-world scenarios, imaging devices may not always capture all three modalities simultaneously. This results in input data with unpredictable combinations of modal types, which challenge existing CMPD methods that fail to extract robust pedestrian information under arbitrary input combinations, leading to significant performance degradation. To address these challenges, we propose the Adaptive Uncertainty-aware Network (AUNet) for accurately discriminating modal availability and fully utilizing the available information under uncertain inputs. Specifically, we introduce Unified Modality Validation Refinement (UMVR), which includes an uncertainty-aware router to validate modal availability and a semantic refinement to ensure the reliability of information within the modality. Furthermore, we design a Modality-Aware Interaction (MAI) module to adaptively activate or deactivate its internal interaction mechanisms per UMVR output, enabling effective complementary information fusion from available modalities.

2602.06359 2026-02-09 cs.LG cs.AI

Training Data Selection with Gradient Orthogonality for Efficient Domain Adaptation

Xiyang Zhang, Yuanhe Tian, Hongzhi Wang, Yan Song

详情
英文摘要

Fine-tuning large language models (LLMs) for specialized domains often necessitates a trade-off between acquiring domain expertise and retaining general reasoning capabilities, a phenomenon known as catastrophic forgetting. Existing remedies face a dichotomy: gradient surgery methods offer geometric safety but incur prohibitive computational costs via online projections, while efficient data selection approaches reduce overhead but remain blind to conflict-inducing gradient directions. In this paper, we propose Orthogonal Gradient Selection (OGS), a data-centric method that harmonizes domain performance, general capability retention, and training efficiency. OGS shifts the geometric insights of gradient projection from the optimizer to the data selection stage by treating data selection as a constrained decision-making process. By leveraging a lightweight Navigator model and reinforcement learning techniques, OGS dynamically identifies training samples whose gradients are orthogonal to a general-knowledge anchor. This approach ensures naturally safe updates for target models without modifying the optimizer or incurring runtime projection costs. Experiments across medical, legal, and financial domains demonstrate that OGS achieves excellent results, significantly improving domain performance and training efficiency while maintaining or even enhancing performance on general tasks such as GSM8K.

2602.06356 2026-02-09 cs.RO cs.SY eess.SY

Nipping the Drift in the Bud: Retrospective Rectification for Robust Vision-Language Navigation

Gang He, Zhenyang Liu, Kepeng Xu, Li Xu, Tong Qiao, Wenxin Yu, Chang Wu, Weiying Xie

详情
英文摘要

Vision-Language Navigation (VLN) requires embodied agents to interpret natural language instructions and navigate through complex continuous 3D environments. However, the dominant imitation learning paradigm suffers from exposure bias, where minor deviations during inference lead to compounding errors. While DAgger-style approaches attempt to mitigate this by correcting error states, we identify a critical limitation: Instruction-State Misalignment. Forcing an agent to learn recovery actions from off-track states often creates supervision signals that semantically conflict with the original instruction. In response to these challenges, we introduce BudVLN, an online framework that learns from on-policy rollouts by constructing supervision to match the current state distribution. BudVLN performs retrospective rectification via counterfactual re-anchoring and decision-conditioned supervision synthesis, using a geodesic oracle to synthesize corrective trajectories that originate from valid historical states, ensuring semantic consistency. Experiments on the standard R2R-CE and RxR-CE benchmarks demonstrate that BudVLN consistently mitigates distribution shift and achieves state-of-the-art performance in both Success Rate and SPL.

2602.06353 2026-02-09 cs.LG

Enhance and Reuse: A Dual-Mechanism Approach to Boost Deep Forest for Label Distribution Learning

Jia-Le Xu, Shen-Huan Lyu, Yu-Nian Wang, Ning Chen, Zhihao Qu, Bin Tang, Baoliu Ye

详情
英文摘要

Label distribution learning (LDL) requires the learner to predict the degree of correlation between each sample and each label. To achieve this, a crucial task during learning is to leverage the correlation among labels. Deep Forest (DF) is a deep learning framework based on tree ensembles, whose training phase does not rely on backpropagation. DF performs in-model feature transform using the prediction of each layer and achieves competitive performance on many tasks. However, its exploration in the field of LDL is still in its infancy. The few existing methods that apply DF to the field of LDL do not have effective ways to utilize the correlation among labels. Therefore, we propose a method named Enhanced and Reused Feature Deep Forest (ERDF). It mainly contains two mechanisms: feature enhancement exploiting label correlation and measure-aware feature reuse. The first one is to utilize the correlation among labels to enhance the original features, enabling the samples to acquire more comprehensive information for the task of LDL. The second one performs a reuse operation on the features of samples that perform worse than the previous layer on the validation set, in order to ensure the stability of the training process. This kind of Enhance-Reuse pattern not only enables samples to enrich their features but also validates the effectiveness of their new features and conducts a reuse process to prevent the noise from spreading further. Experiments show that our method outperforms other comparison algorithms on six evaluation metrics.