arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1795
专题追踪
2602.02545 2026-02-04 cs.LG cs.AI

Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization

Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). However, recent studies question whether RL genuinely expands reasoning capacity or merely aligns existing latent capabilities, arguing that exploration remains confined within the pre-trained model's low-rank bias manifold. In this work, we challenge this accessibility boundary hypothesis by demonstrating that the latent reasoning space can be fundamentally expanded through targeted geometric interventions. We propose Manifold-Reshaping Policy Optimization (MRPO), a geometric framework designed to fundamentally restructure the inference space of LLMs. MRPO operates in two stages: first, we employ Spectral Orthogonal Exploration (SOE) to eject the policy initialization into the null space of the bias manifold; second, we integrate an Effective Rank regularization term into the policy optimization objective. This approach incentivizes the discovery and maintenance of high-dimensional reasoning trajectories against the entropy-reducing tendency of standard RL. Empirically, our 4B-parameter method achieves state-of-the-art performance on mathematical tasks, significantly outperforming larger models (e.g., Qwen3-32B) and expanding the capability boundary beyond standard GRPO. Our code is available at https://anonymous.4open.science/r/MRPO-D57B/

2602.02542 2026-02-04 cs.LG cs.AI eess.SP

Auto-Augmentation Contrastive Learning for Wearable-based Human Activity Recognition

Qingyu Wu, Jianfei Shen, Feiyi Fan, Yang Gu, Chenyang Xu, Yiqiang Chen

详情
英文摘要

For low-semantic sensor signals from human activity recognition (HAR), contrastive learning (CL) is essential to implement novel applications or generic models without manual annotation, which is a high-performance self-supervised learning (SSL) method. However, CL relies heavily on data augmentation for pairwise comparisons. Especially for low semantic data in the HAR area, conducting good performance augmentation strategies in pretext tasks still rely on manual attempts lacking generalizability and flexibility. To reduce the augmentation burden, we propose an end-to-end auto-augmentation contrastive learning (AutoCL) method for wearable-based HAR. AutoCL is based on a Siamese network architecture that shares the parameters of the backbone and with a generator embedded to learn auto-augmentation. AutoCL trains the generator based on the representation in the latent space to overcome the disturbances caused by noise and redundant information in raw sensor data. The architecture empirical study indicates the effectiveness of this design. Furthermore, we propose a stop-gradient design and correlation reduction strategy in AutoCL to enhance encoder representation learning. Extensive experiments based on four wide-used HAR datasets demonstrate that the proposed AutoCL method significantly improves recognition accuracy compared with other SOTA methods.

2602.02539 2026-02-04 cs.LG cs.CV

How Much Information Can a Vision Token Hold? A Scaling Law for Recognition Limits in VLMs

Shuxin Zhuang, Zi Liang, Runsheng Yu, Hongzong Li, Rong Feng, Shiqin Tang, Youzhi Zhang

详情
英文摘要

Recent vision-centric approaches have made significant strides in long-context modeling. Represented by DeepSeek-OCR, these models encode rendered text into continuous vision tokens, achieving high compression rates without sacrificing recognition precision. However, viewing the vision encoder as a lossy channel with finite representational capacity raises a fundamental question: what is the information upper bound of visual tokens? To investigate this limit, we conduct controlled stress tests by progressively increasing the information quantity (character count) within an image. We observe a distinct phase-transition phenomenon characterized by three regimes: a near-perfect Stable Phase, an Instability Phase marked by increased error variance, and a total Collapse Phase. We analyze the mechanical origins of these transitions and identify key factors. Furthermore, we formulate a probabilistic scaling law that unifies average vision token load and visual density into a latent difficulty metric. Extensive experiments across various Vision-Language Models demonstrate the universality of this scaling law, providing critical empirical guidance for optimizing the efficiency-accuracy trade-off in visual context compression.

2602.02538 2026-02-04 cs.LG cs.CL cs.CV

Enhancing Post-Training Quantization via Future Activation Awareness

Zheqi Lv, Zhenxuan Fan, Qi Tian, Wenqiao Zhang, Yueting Zhuang

详情
英文摘要

Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning. It typically sets quantization hyperparameters (e.g., scaling factors) based on current-layer activations. Although this method is efficient, it suffers from quantization bias and error accumulation, resulting in suboptimal and unstable quantization, especially when the calibration data is biased. To overcome these issues, we propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization. This allows better identification and preservation of important weights, while reducing sensitivity to calibration noise. We further introduce a window-wise preview mechanism to softly aggregate multiple future-layer activations, mitigating over-reliance on any single layer. To avoid expensive greedy search, we use a pre-searched configuration to minimize overhead. Experiments show that FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning, making it well-suited for edge deployment.

2602.02537 2026-02-04 cs.CV cs.LG

WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

Runjie Zhou, Youbo Shao, Haoyu Lu, Bowei Xing, Tongtong Bai, Yujie Chen, Jie Zhao, Lin Sui, Haotian Yao, Zijia Zhao, Hao Yang, Haoning Wu, Zaida Zhou, Jinguo Zhu, Zhiqi Huang, Yiping Bao, Yangyang Liu, Y. Charles, Xinyu Zhou

详情
英文摘要

We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure "what the model memorizes." The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.

2602.02536 2026-02-04 cs.LG cs.AI cs.CL cs.CV

From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation

Tianle Gu, Kexin Huang, Lingyu Li, Ruilin Luo, Shiyang Huang, Zongqi Wang, Yujiu Yang, Yan Teng, Yingchun Wang

详情
英文摘要

Safety moderation is pivotal for identifying harmful content. Despite the success of textual safety moderation, its multimodal counterparts remain hindered by a dual sparsity of data and supervision. Conventional reliance on binary labels lead to shortcut learning, which obscures the intrinsic classification boundaries necessary for effective multimodal discrimination. Hence, we propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces. By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process. This approach forces the model to ground its decision in explicit safety semantics, preventing the model from converging on superficial shortcuts. To facilitate this paradigm, we develop a multi-head scalar reward model (UniRM). UniRM provides multi-dimensional supervision by assigning attribute-level scores to the response generation stage. Furthermore, we introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning. Empirical results show UniMod achieves competitive textual moderation performance and sets a new multimodal benchmark using less than 40\% of the training data used by leading baselines. Ablations further validate our multi-attribute trajectory reasoning, offering an effective and efficient framework for multimodal moderation. Supplementary materials are available at \href{https://trustworthylab.github.io/UniMod/}{project website}.

2602.02535 2026-02-04 cs.LG cs.AI

Enhancing Psychologists' Understanding through Explainable Deep Learning Framework for ADHD Diagnosis

Abdul Rehman, Ilona Heldal, Jerry Chun-Wei Lin

Journal ref Expert Systems, Wiley, 2024

详情
英文摘要

Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopmental disorder that is challenging to diagnose and requires advanced approaches for reliable and transparent identification and classification. It is characterized by a pattern of inattention, hyperactivity and impulsivity that is more severe and more frequent than in individuals with a comparable level of development. In this paper, an explainable framework based on a fine-tuned hybrid Deep Neural Network (DNN) and Recurrent Neural Network (RNN) called HyExDNN-RNN model is proposed for ADHD detection, multi-class categorization, and decision interpretation. This framework not only detects ADHD, but also provides interpretable insights into the diagnostic process so that psychologists can better understand and trust the results of the diagnosis. We use the Pearson correlation coefficient for optimal feature selection and machine and deep learning models for experimental analysis and comparison. We use a standardized technique for feature reduction, model selection and interpretation to accurately determine the diagnosis rate and ensure the interpretability of the proposed framework. Our framework provided excellent results on binary classification, with HyExDNN-RNN achieving an F1 score of 99% and 94.2% on multi-class categorization. XAI approaches, in particular SHapley Additive exPlanations (SHAP) and Permutation Feature Importance (PFI), provided important insights into the importance of features and the decision logic of models. By combining AI with human expertise, we aim to bridge the gap between advanced computational techniques and practical psychological applications. These results demonstrate the potential of our framework to assist in ADHD diagnosis and interpretation.

2602.02533 2026-02-04 cs.RO cs.LG

HMVLA: Hyperbolic Multimodal Fusion for Vision-Language-Action Models

Kun Wang, Xiao Feng, Mingcheng Qu, Tonghua Su

Comments 5 pages,5 figures,ICASSP

详情
英文摘要

Vision Language Action (VLA) models have recently shown great potential in bridging multimodal perception with robotic control. However, existing methods often rely on direct fine-tuning of pre-trained Vision-Language Models (VLMs), feeding semantic and visual features directly into a policy network without fully addressing the unique semantic alignment challenges in the VLA domain. In this paper, we propose HMVLA, a novel VLA framework that exploits the inherent hierarchical structures in vision and language for comprehensive semantic alignment. Unlike traditional methods that perform alignment in Euclidean space, our HMVLA embeds multimodal features in hyperbolic space, enabling more effective modeling of the hierarchical relationships present in image text data. Furthermore, we introduce a sparsely gated Mixture of Experts (MoE) mechanism tailored for semantic alignment, which enhances multimodal comprehension between images and text while improving efficiency. Extensive experiments demonstrate that HMVLA surpasses baseline methods in both accuracy and generalization. In addition, we validate its robustness by reconstructing datasets to further test cross domain adaptability.

2602.02532 2026-02-04 cs.LG cs.AI

CADENT: Gated Hybrid Distillation for Sample-Efficient Transfer in Reinforcement Learning

Mahyar Alinejad, Yue Wang, George Atia

详情
英文摘要

Transfer learning promises to reduce the high sample complexity of deep reinforcement learning (RL), yet existing methods struggle with domain shift between source and target environments. Policy distillation provides powerful tactical guidance but fails to transfer long-term strategic knowledge, while automaton-based methods capture task structure but lack fine-grained action guidance. This paper introduces Context-Aware Distillation with Experience-gated Transfer (CADENT), a framework that unifies strategic automaton-based knowledge with tactical policy-level knowledge into a coherent guidance signal. CADENT's key innovation is an experience-gated trust mechanism that dynamically weighs teacher guidance against the student's own experience at the state-action level, enabling graceful adaptation to target domain specifics. Across challenging environments, from sparse-reward grid worlds to continuous control tasks, CADENT achieves 40-60\% better sample efficiency than baselines while maintaining superior asymptotic performance, establishing a robust approach for adaptive knowledge transfer in RL.

2602.02531 2026-02-04 cs.LG physics.flu-dyn

Hypersonic Flow Control: Generalized Deep Reinforcement Learning for Hypersonic Intake Unstart Control under Uncertainty

Trishit Mondal, Ameya D. Jagtap

Comments 34 Pages, 23 Figures

详情
英文摘要

The hypersonic unstart phenomenon poses a major challenge to reliable air-breathing propulsion at Mach 5 and above, where strong shock-boundary-layer interactions and rapid pressure fluctuations can destabilize inlet operation. Here, we demonstrate a deep reinforcement learning (DRL)- based active flow control strategy to control unstart in a canonical two-dimensional hypersonic inlet at Mach 5 and Reynolds number $5\times 10^6$. The in-house CFD solver enables high-fidelity simulations with adaptive mesh refinement, resolving key flow features, including shock motion, boundary-layer dynamics, and flow separation, that are essential for learning physically consistent control policies suitable for real-time deployment. The DRL controller robustly stabilizes the inlet over a wide range of back pressures representative of varying combustion chamber conditions. It further generalizes to previously unseen scenarios, including different back-pressure levels, Reynolds numbers, and sensor configurations, while operating with noisy measurements, thereby demonstrating strong zero-shot generalization. Control remains robust in the presence of noisy sensor measurements, and a minimal, optimally selected sensor set achieves comparable performance, enabling practical implementation. These results establish a data-driven approach for real-time hypersonic flow control under realistic operational uncertainties.

2602.02530 2026-02-04 cs.LG cs.RO

Formulating Reinforcement Learning for Human-Robot Collaboration through Off-Policy Evaluation

Saurav Singh, Rodney Sanchez, Alexander Ororbia, Jamison Heard

详情
英文摘要

Reinforcement learning (RL) has the potential to transform real-world decision-making systems by enabling autonomous agents to learn from experience. Deploying RL in real-world settings, especially in the context of human-robot interaction, requires defining state representations and reward functions, which are critical for learning efficiency and policy performance. Traditional RL approaches often rely on domain expertise and trial-and-error, necessitating extensive human involvement as well as direct interaction with the environment, which can be costly and impractical, especially in complex and safety-critical applications. This work proposes a novel RL framework that leverages off-policy evaluation (OPE) for state space and reward function selection, using only logged interaction data. This approach eliminates the need for real-time access to the environment or human-in-the-loop feedback, greatly reducing the dependency on costly real-time interactions. The proposed approach systematically evaluates multiple candidate state representations and reward functions by training offline RL agents and applying OPE to estimate policy performance. The optimal state space and reward function are selected based on their ability to produce high-performing policies under OPE metrics. Our method is validated on two environments: the Lunar Lander environment by OpenAI Gym, which provides a controlled setting for assessing state space and reward function selection, and a NASA-MATB-II human subjects study environment, which evaluates the approach's real-world applicability to human-robot teaming scenarios. This work enhances the feasibility and scalability of offline RL for real-world environments by automating critical RL design decisions through a data-driven OPE-based evaluation, enabling more reliable, effective, and sustainable RL formulation for complex human-robot interaction settings.

2602.02526 2026-02-04 cs.LG cs.AI cs.CL physics.comp-ph

The "Robert Boulton" Singularity: Semantic Tunneling and Manifold Unfolding in Recursive AI

Pengyue Hou

Comments Companion paper to arXiv:2601.11594. Provides empirical validation of the MNCIS framework in Large Language Models (GPT-2) using a recursive training protocol (N=1500). Includes complete, reproducible Python implementation of Adaptive Spectral Negative Coupling (ASNC) and Effective Rank metrics in the Appendix

详情
英文摘要

The stability of generative artificial intelligence trained on recursive synthetic data is conventionally monitored via Perplexity (PPL). We demonstrate that PPL is a deceptive metric in context-stabilized regimes (L=128). Using a rigorous sliding-window protocol (N=1500), we identify a novel failure mode termed "Semantic Tunneling." While the Baseline model maintains high grammatical fluency (PPL approx. 83.9), it suffers a catastrophic loss of semantic diversity, converging within seven generations to a single, low-entropy narrative attractor: the "Robert Boulton" Singularity. This phenomenon represents a total collapse of the latent manifold (Global Effective Rank 3.62 -> 2.22), where the model discards diverse world knowledge to optimize for statistically safe syntactic templates. To address this, we apply the Multi-Scale Negative Coupled Information Systems (MNCIS) framework recently established in Hou (2026) [arXiv:2601.11594]. We demonstrate that Adaptive Spectral Negative Coupling (ASNC) acts as a topological operator that actively induces "Manifold Unfolding." MNCIS forces the model to expand its effective rank from the anisotropic baseline of 3.62 to a hyper-diverse state of 5.35, effectively constructing an "Artificial Manifold" that resists the gravitational pull of semantic attractors and preserves the long-tail distribution of the training data.

2602.02523 2026-02-04 cs.LG cs.AI

TabularMath: Evaluating Computational Extrapolation in Tabular Learning via Program-Verified Synthesis

Zerui Cheng, Jiashuo Liu, Jianzhu Yao, Pramod Viswanath, Ge Zhang, Wenhao Huang

Comments 30 pages; TabularMath technical report

详情
英文摘要

Standard tabular benchmarks mainly focus on the evaluation of a model's capability to interpolate values inside a data manifold, where models good at performing local statistical smoothing are rewarded. However, there exists a very large category of high-value tabular data, including financial modeling and physical simulations, which are generated based upon deterministic computational processes, as opposed to stochastic and noisy relationships. Therefore, we investigate if tabular models can provide an extension from statistical interpolation to computational extrapolation. We propose TabularMath, a diagnostic benchmark of 114 deterministic problems (233,472 rows) generated from verified programs based on GSM8K and AIME. We evaluate 9 tabular architectures and in-context learning (ICL) with GPT-OSS-120B. On standard regression metrics, TabPFN v2.5 performs remarkably well, achieving R^2=0.998 in-distribution and maintaining positive R^2 even under distribution shift, which is unique among the tabular models we tested. When we measure rounded consistency (exact integer match), a different picture emerges: TabPFN v2.5 drops below 10% on out-of-distribution data, while ICL maintains around 40%. This gap between R^2 and exact-match accuracy suggests that tabular models learn smooth function approximations but struggle to recover precise computational outputs under extrapolation. The two paradigms appear complementary: TabPFN scales efficiently with data; ICL achieves exact computation from few examples. We release all code and data to support further investigation.

2602.02522 2026-02-04 cs.LG cs.AI

IMU-1: Sample-Efficient Pre-training of Small Language Models

George Grigorev

Comments 16 pages

详情
英文摘要

We present IMU-1, a 430M-parameter language model trained on 72B tokens that approaches the benchmark performance of models trained on 56x more data. We describe a validated training recipe combining recent architectural interventions (QK-norm attention, per-head gating, value residuals, LayerNorm scaling) with optimization advances (NorMuon with cautious weight decay, muP parametrization) and a three-stage training schedule with post-hoc checkpoint EMA. We provide ablations for each component and release code, weights and data to enable reproduction: https://huggingface.co/thepowerfuldeez/imu1_base

2602.02521 2026-02-04 cs.LG cs.AI eess.SP

Scaled Dot-Product Attention implements projection of inputs onto a common surface

Terence D Sanger

详情
英文摘要

Scaled dot-product attention (SDPA) is a fundamental component responsible for the success of large-language models and other nonlinear signal processing applications. The rationale for SDPA has been based upon "query, key, value" concepts borrowed from database theory, but these concepts are difficult to reconcile with standard methods in mathematical signal processing. We show that SDPA can be rewritten in a different but mathematically equivalent form as a projection of the input vectors onto a common surface determined by the inputs themselves. Therefore SDPA discovers nonlinear dependencies in the input that are time-dependent and context-dependent. The rewritten form of SDPA permits increased speed of both feedforward and learning algorithms, but more importantly suggests potential extensions. In the context of language, we re-interpret the role of SDPA as finding a time-dependent contextual meaning determined by the surface on which the set of input vectors lies. Input token embeddings are then modified by the local context surface. This interpretation differs substantially from the concept of "self-attention", and provides a strong justification for the use of SDPA for time-series data with time-varying local nonlinear dependencies.

2602.02517 2026-02-04 cs.LG cs.AI cs.CY

What Drives Length of Stay After Elective Spine Surgery? Insights from a Decade of Predictive Modeling

Ha Na Cho, Seungmin Jeong, Yawen Guo, Alexander Lopez, Hansen Bow, Kai Zheng

详情
英文摘要

Objective: Predicting length of stay after elective spine surgery is essential for optimizing patient outcomes and hospital resource use. This systematic review synthesizes computational methods used to predict length of stay in this patient population, highlighting model performance and key predictors. Methods: Following PRISMA guidelines, we systematically searched PubMed, Google Scholar, and ACM Digital Library for studies published between December 1st, 2015, and December 1st, 2024. Eligible studies applied statistical or machine learning models to predict length of stay for elective spine surgery patients. Three reviewers independently screened studies and extracted data. Results: Out of 1,263 screened studies, 29 studies met inclusion criteria. Length of stay was predicted as a continuous, binary, or percentile-based outcome. Models included logistic regression, random forest, boosting algorithms, and neural networks. Machine learning models consistently outperformed traditional statistical models, with AUCs ranging from 0.94 to 0.99. K-Nearest Neighbors and Naive Bayes achieved top performance in some studies. Common predictors included age, comorbidities (notably hypertension and diabetes), BMI, type and duration of surgery, and number of spinal levels. However, external validation and reporting practices varied widely across studies. Discussion: There is growing interest in artificial intelligence and machine learning in length of stay prediction, but lack of standardization and external validation limits clinical utility. Future studies should prioritize standardized outcome definitions and transparent reporting needed to advance real-world deployment. Conclusion: Machine learning models offer strong potential for length of stay prediction after elective spine surgery, highlighting their potential for improving discharge planning and hospital resource management.

2602.02502 2026-02-04 cs.LG cs.AI

Sparse Adapter Fusion for Continual Learning in NLP

Min Zeng, Xi Chen, Haiqin Yang, Yike Guo

Comments This paper has been accepted to EACL 2026

详情
英文摘要

Continual learning in natural language processing plays a crucial role in adapting to evolving data and preventing catastrophic forgetting. Despite significant progress, existing methods still face challenges, such as inefficient parameter reuse across tasks, risking catastrophic forgetting when tasks are dissimilar, and the unnecessary introduction of new parameters for each task, which hampers knowledge sharing among similar tasks. To tackle these issues, we propose a Sparse Adapter Fusion Method (SAFM), which dynamically fuses old and new adapters to address these challenges. SAFM operates in two stages: the decision stage and the tuning stage. In the decision stage, SAFM determines whether to incorporate a new adapter, reuse an existing one, or add an empty adapter. The architecture search procedure, designed to prioritize reusing or adding empty adapters, minimizes parameter consumption and maximizes reuse. In the tuning stage, SAFM especially facilitates a layer-wise loss to encourage differentiation between adapters, effectively capturing knowledge within the same task. Experimental results consistently show that SAFM outperforms state-of-the-art (SOTA) methods, achieving comparable performance while utilizing less than 60% of the parameters.

2602.02501 2026-02-04 cs.LG cs.CR

Augmenting Parameter-Efficient Pre-trained Language Models with Large Language Models

Saurabh Anand, Shubham Malaviya, Manish Shukla, Sachin Lodha

Comments 22 pages, 9 figures, 11 tables, short paper was accepted in ACM SAC 2024

详情
英文摘要

Training AI models in cybersecurity with help of vast datasets offers significant opportunities to mimic real-world behaviors effectively. However, challenges like data drift and scarcity of labelled data lead to frequent updates of models and the risk of overfitting. To address these challenges, we used parameter-efficient fine-tuning techniques for pre-trained language models wherein we combine compacters with various layer freezing strategies. To enhance the capabilities of these pre-trained language models, in this work we introduce two strategies that use large language models. In the first strategy, we utilize large language models as data-labelling tools wherein they generate labels for unlabeled data. In the second strategy, large language modes are utilized as fallback mechanisms for predictions having low confidence scores. We perform comprehensive experimental analysis on the proposed strategies on different downstream tasks specific to cybersecurity domain. We empirically demonstrate that by combining parameter-efficient pre-trained models with large language models, we can improve the reliability and robustness of models, making them more suitable for real-world cybersecurity applications.

2602.02498 2026-02-04 cs.CL cs.AI cs.LG

Test-Time Detoxification without Training or Learning Anything

Baturay Saglam, Dionysis Kalogerias

详情
英文摘要

Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model's generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly black-box settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach delivers robust toxicity reductions across models and prompts and, in most settings, achieves the best overall toxicity-quality trade-off. More broadly, our work positions word embeddings as effective control variables and encourages wider use of black-box optimization to guide autoregressive language models toward scalable, safer text generation, without requiring any training or access to intermediate computations.

2602.02497 2026-02-04 cs.CL cs.AI

STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

Xuzhao Li, Xuchen Li, Jian Zhao, Shiyu Hu

Comments Preprint, Under review

详情
英文摘要

As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $\times$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.

2602.02496 2026-02-04 cs.CL

The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

Shikhar Shiromani, Archie Chaudhury, Sri Pranav Kunda

Comments 8 pages, 1 figure

详情
英文摘要

Large Language Models (LLMs) frequently exhibit unfaithful behavior, producing a final answer that differs significantly from their internal chain of thought (CoT) reasoning in order to appease the user they are conversing with. In order to better detect this behavior, we introduce the Hypocrisy Gap, a mechanistic metric utilizing Sparse Autoencoders (SAEs) to quantify the divergence between a model's internal reasoning and its final generation. By mathematically comparing an internal truth belief, derived via sparse linear probes, to the final generated trajectory in latent space, we quantify and detect a model's tendency to engage in unfaithful behavior. Experiments on Gemma, Llama, and Qwen models using Anthropic's Sycophancy benchmark show that our method achieves an AUROC of 0.55-0.73 for detecting sycophantic runs and 0.55-0.74 for hypocritical cases where the model internally "knows" the user is wrong, consistently outperforming a decision-aligned log-probability baseline (0.41-0.50 AUROC).

2602.02488 2026-02-04 cs.LG cs.AI cs.CL cs.CV

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang

Comments Code: https://github.com/Gen-Verse/Open-AgentRL

详情
英文摘要

We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL

2602.02453 2026-02-04 cs.AI

Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Andong Chen, Wenxin Zhu, Qiuyu Ding, Yuchen Song, Muyun Yang, Tiejun Zhao

Comments Working paper

详情
英文摘要

Chain-of-Thought reasoning has driven large language models to extend from thinking with text to thinking with images and videos. However, different modalities still have clear limitations: static images struggle to represent temporal structure, while videos introduce substantial redundancy and computational cost. In this work, we propose Thinking with Comics, a visual reasoning paradigm that uses comics as a high information-density medium positioned between images and videos. Comics preserve temporal structure, embedded text, and narrative coherence while requiring significantly lower reasoning cost. We systematically study two reasoning paths based on comics and evaluate them on a range of reasoning tasks and long-context understanding tasks. Experimental results show that Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks, while remaining substantially more efficient than Thinking with Video. Further analysis indicates that different comic narrative structures and styles consistently affect performance across tasks, suggesting that comics serve as an effective intermediate visual representation for improving multimodal reasoning.

2602.02419 2026-02-04 cs.AI cs.SE

SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

Qingni Wang, Yue Fan, Xin Eric Wang

详情
英文摘要

Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38% percentage points over Gemini-only inference.

2602.02276 2026-02-04 cs.CL cs.AI cs.LG

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen, Dazhi Cheng, Minghan Chu, Jialei Cui, Jiaqi Deng, Muxi Diao, Hao Ding, Mengfan Dong, Mengnan Dong, Yuxin Dong, Yuhao Dong, Angang Du, Chenzhuang Du, Dikang Du, Lingxiao Du, Yulun Du, Yu Fan, Shengjun Fang, Qiulin Feng, Yichen Feng, Garimugai Fu, Kelin Fu, Hongcheng Gao, Tong Gao, Yuyao Ge, Shangyi Geng, Chengyang Gong, Xiaochen Gong, Zhuoma Gongque, Qizheng Gu, Xinran Gu, Yicheng Gu, Longyu Guan, Yuanying Guo, Xiaoru Hao, Weiran He, Wenyang He, Yunjia He, Chao Hong, Hao Hu, Jiaxi Hu, Yangyang Hu, Zhenxing Hu, Ke Huang, Ruiyuan Huang, Weixiao Huang, Zhiqi Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yu Jing, Guokun Lai, Aidi Li, C. Li, Cheng Li, Fang Li, Guanghe Li, Guanyu Li, Haitao Li, Haoyang Li, Jia Li, Jingwei Li, Junxiong Li, Lincan Li, Mo Li, Weihong Li, Wentao Li, Xinhang Li, Xinhao Li, Yang Li, Yanhao Li, Yiwei Li, Yuxiao Li, Zhaowei Li, Zheming Li, Weilong Liao, Jiawei Lin, Xiaohan Lin, Zhishan Lin, Zichao Lin, Cheng Liu, Chenyu Liu, Hongzhang Liu, Liang Liu, Shaowei Liu, Shudong Liu, Shuran Liu, Tianwei Liu, Tianyu Liu, Weizhou Liu, Xiangyan Liu, Yangyang Liu, Yanming Liu, Yibo Liu, Yuanxin Liu, Yue Liu, Zhengying Liu, Zhongnuo Liu, Enzhe Lu, Haoyu Lu, Zhiyuan Lu, Junyu Luo, Tongxu Luo, Yashuo Luo, Long Ma, Yingwei Ma, Shaoguang Mao, Yuan Mei, Xin Men, Fanqing Meng, Zhiyong Meng, Yibo Miao, Minqing Ni, Kun Ouyang, Siyuan Pan, Bo Pang, Yuchao Qian, Ruoyu Qin, Zeyu Qin, Jiezhong Qiu, Bowen Qu, Zeyu Shang, Youbo Shao, Tianxiao Shen, Zhennan Shen, Juanfeng Shi, Lidong Shi, Shengyuan Shi, Feifan Song, Pengwei Song, Tianhui Song, Xiaoxi Song, Hongjin Su, Jianlin Su, Zhaochen Su, Lin Sui, Jinsong Sun, Junyao Sun, Tongyu Sun, Flood Sung, Yunpeng Tai, Chuning Tang, Heyi Tang, Xiaojuan Tang, Zhengyang Tang, Jiawen Tao, Shiyuan Teng, Chaoran Tian, Pengfei Tian, Ao Wang, Bowen Wang, Chensi Wang, Chuang Wang, Congcong Wang, Dingkun Wang, Dinglu Wang, Dongliang Wang, Feng Wang, Hailong Wang, Haiming Wang, Hengzhi Wang, Huaqing Wang, Hui Wang, Jiahao Wang, Jinhong Wang, Jiuzheng Wang, Kaixin Wang, Linian Wang, Qibin Wang, Shengjie Wang, Shuyi Wang, Si Wang, Wei Wang, Xiaochen Wang, Xinyuan Wang, Yao Wang, Yejie Wang, Yipu Wang, Yiqin Wang, Yucheng Wang, Yuzhi Wang, Zhaoji Wang, Zhaowei Wang, Zhengtao Wang, Zhexu Wang, Zihan Wang, Zizhe Wang, Chu Wei, Ming Wei, Chuan Wen, Zichen Wen, Chengjie Wu, Haoning Wu, Junyan Wu, Rucong Wu, Wenhao Wu, Yuefeng Wu, Yuhao Wu, Yuxin Wu, Zijian Wu, Chenjun Xiao, Jin Xie, Xiaotong Xie, Yuchong Xie, Yifei Xin, Bowei Xing, Boyu Xu, Jianfan Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinbo Xu, Xinran Xu, Yangchuan Xu, Yichang Xu, Yuemeng Xu, Zelai Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Guangyao Yang, Hao Yang, Junwei Yang, Kai Yang, Ningyuan Yang, Ruihan Yang, Xiaofei Yang, Xinlong Yang, Ying Yang, Yi Yang, Yi Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Dan Ye, Wenjie Ye, Zhuorui Ye, Bohong Yin, Chengzhen Yu, Longhui Yu, Tao Yu, Tianxiang Yu, Enming Yuan, Mengjie Yuan, Xiaokun Yuan, Yang Yue, Weihao Zeng, Dunyuan Zha, Haobing Zhan, Dehao Zhang, Hao Zhang, Jin Zhang, Puqi Zhang, Qiao Zhang, Rui Zhang, Xiaobin Zhang, Y. Zhang, Yadong Zhang, Yangkun Zhang, Yichi Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yushun Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Chenguang Zhao, Feifan Zhao, Jinxiang Zhao, Shuai Zhao, Xiangyu Zhao, Yikai Zhao, Zijia Zhao, Huabin Zheng, Ruihan Zheng, Shaojie Zheng, Tengyang Zheng, Junfeng Zhong, Longguang Zhong, Weiming Zhong, M. Zhou, Runjie Zhou, Xinyu Zhou, Zaida Zhou, Jinguo Zhu, Liya Zhu, Xinhao Zhu, Yuxuan Zhu, Zhen Zhu, Jingze Zhuang, Weiyu Zhuang, Ying Zou, Xinxing Zu

Comments Kimi K2.5 tech report

详情
英文摘要

We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.

2602.02230 2026-02-04 cs.LG cs.AI

SEDformer: Event-Synchronous Spiking Transformers for Irregular Telemetry Time Series Forecasting

Ziyu Zhou, Yuchen Fang, Weilin Ruan, Shiyu Wang, James Kwok, Yuxuan Liang

Comments Under review

详情
英文摘要

Telemetry streams from large-scale Internet-connected systems (e.g., IoT deployments and online platforms) naturally form an irregular multivariate time series (IMTS) whose accurate forecasting is operationally vital. A closer examination reveals a defining Sparsity-Event Duality (SED) property of IMTS, i.e., long stretches with sparse or no observations are punctuated by short, dense bursts where most semantic events (observations) occur. However, existing Graph- and Transformer-based forecasters ignore SED: pre-alignment to uniform grids with heavy padding violates sparsity by inflating sequences and forcing computation at non-informative steps, while relational recasting weakens event semantics by disrupting local temporal continuity. These limitations motivate a more faithful and natural modeling paradigm for IMTS that aligns with its SED property. We find that Spiking Neural Networks meet this requirement, as they communicate via sparse binary spikes and update in an event-driven manner, aligning naturally with the SED nature of IMTS. Therefore, we present SEDformer, an SED-enhanced Spiking Transformer for telemetry IMTS forecasting that couples: (1) a SED-based Spike Encoder converts raw observations into event synchronous spikes using an Event-Aligned LIF neuron, (2) an Event-Preserving Temporal Downsampling module compresses long gaps while retaining salient firings and (3) a stack of SED-based Spike Transformer blocks enable intra-series dependency modeling with a membrane-based linear attention driven by EA-LIF spiking features. Experiments on public telemetry IMTS datasets show that SEDformer attains state-of-the-art forecasting accuracy while reducing energy and memory usage, providing a natural and efficient path for modeling IMTS.

2602.02196 2026-02-04 cs.AI

TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Hang Yan, Xinyu Che, Fangzhi Xu, Qiushi Sun, Zichen Ding, Kanzhi Cheng, Jian Zhang, Tao Qin, Jun Liu, Qika Lin

Comments 29pages, 10 figures

详情
英文摘要

Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test-time Improvement Diagnostic Evaluation (TIDE), an agent-agnostic and environment-agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated memory. Through extensive experiments across diverse agents and environments, TIDE highlights that improving agent performance requires more than scaling internal reasoning, calling for explicitly optimizing the interaction dynamics between the agent and the environment.

2602.02178 2026-02-04 cs.CL

AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models?

Liang Lin, Feng Xiong, Zengbin Wang, Kun Wang, Junhao Dong, Xuecai Hu, Yong Wang, Xiangxiang Chu

详情
英文摘要

Diffusion Large Language Models (DLLMs) have emerged as a powerful alternative to autoregressive models, enabling parallel token generation across multiple positions. However, preference alignment of DLLMs remains challenging due to high variance introduced by Evidence Lower Bound (ELBO)-based likelihood estimation. In this work, we propose AR-MAP, a novel transfer learning framework that leverages preference-aligned autoregressive LLMs (AR-LLMs) as implicit teachers for DLLM alignment. We reveal that DLLMs can effectively absorb alignment knowledge from AR-LLMs through simple weight scaling, exploiting the shared architectural structure between these divergent generation paradigms. Crucially, our approach circumvents the high variance and computational overhead of direct DLLM alignment and comprehensive experiments across diverse preference alignment tasks demonstrate that AR-MAP achieves competitive or superior performance compared to existing DLLM-specific alignment methods, achieving 69.08\% average score across all tasks and models. Our Code is available at https://github.com/AMAP-ML/AR-MAP.

2602.02175 2026-02-04 cs.CV

CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization

Xinquan Yu, Wei Lu, Xiangyang Luo, Rui Yang

详情
英文摘要

To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.

2602.02163 2026-02-04 cs.CV

Reg4Pru: Regularisation Through Random Token Routing for Token Pruning

Julian Wyatt, Ronald Clark, Irina Voiculescu

Comments 11 pages, 7 figures

详情
英文摘要

Transformers are widely adopted in modern vision models due to their strong ability to scale with dataset size and generalisability. However, this comes with a major drawback: computation scales quadratically to the total number of tokens. Numerous methods have been proposed to mitigate this. For example, we consider token pruning with reactivating tokens from preserved representations, but the increased computational efficiency of this method results in decreased stability from the preserved representations, leading to poorer dense prediction performance at deeper layers. In this work, we introduce Reg4Pru, a training regularisation technique that mitigates token-pruning performance loss for segmentation. We compare our models on the FIVES blood vessel segmentation dataset and find that Reg4Pru improves average precision by an absolute 46% compared to the same model trained without routing. This increase is observed using a configuration that achieves a 29% relative speedup in wall-clock time compared to the non-pruned baseline. These findings indicate that Reg4Pru is a valuable regulariser for token reduction strategies.