arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2797
专题追踪
2602.07265 2026-02-10 cs.LG cs.AI

XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference

Daniil Vankov, Nikita Ivkin, Kyle Ulrich, Xiang Song, Ashish Khetan, George Karypis

详情
英文摘要

Mixture-of-Experts (MoE) architectures are increasingly used to efficiently scale large language models. However, in production inference, request batching and speculative decoding significantly amplify expert activation, eroding these efficiency benefits. We address this issue by modeling batch-aware expert selection as a modular optimization problem and designing efficient greedy algorithms for different deployment settings. The proposed method, namely XShare, requires no retraining and dynamically adapts to each batch by maximizing the total gating score of selected experts. It reduces expert activation by up to 30% under standard batching, cuts peak GPU load by up to 3x in expert-parallel deployments, and achieves up to 14% throughput gains in speculative decoding via hierarchical, correlation-aware expert selection even if requests in a batch drawn from heterogeneous datasets.

2602.07260 2026-02-10 cs.CV cs.LG

3D Transport-based Morphometry (3D-TBM) for medical image analysis

Hongyu Kan, Kristofor Pas, Ivan Medri, Naqib Sad Pathan, Natasha Ironside, Shinjini Kundu, Jingjia He, Gustavo Kunde Rohde

详情
英文摘要

Transport-Based Morphometry (TBM) has emerged as a new framework for 3D medical image analysis. By embedding images into a transport domain via invertible transformations, TBM facilitates effective classification, regression, and other tasks using transport-domain features. Crucially, the inverse mapping enables the projection of analytic results back into the original image space, allowing researchers to directly interpret clinical features associated with model outputs in a spatially meaningful way. To facilitate broader adoption of TBM in clinical imaging research, we present 3D-TBM, a tool designed for morphological analysis of 3D medical images. The framework includes data preprocessing, computation of optimal transport embeddings, and analytical methods such as visualization of main transport directions, together with techniques for discerning discriminating directions and related analysis methods. We also provide comprehensive documentation and practical tutorials to support researchers interested in applying 3D-TBM in their own medical imaging studies. The source code is publicly available through PyTransKit.

2602.07259 2026-02-10 cs.AI

Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective

Cheol Woo Kim, Davin Choo, Tzeh Yuan Neoh, Milind Tambe

详情
英文摘要

As AI systems grow more capable and autonomous, ensuring their safety and reliability requires not only model-level alignment but also strategic oversight of the humans and institutions involved in their development and deployment. Existing safety frameworks largely treat alignment as a static optimization problem (e.g., tuning models to desired behavior) while overlooking the dynamic, adversarial incentives that shape how data are collected, how models are evaluated, and how they are ultimately deployed. We propose a new perspective on AI safety grounded in Stackelberg Security Games (SSGs): a class of game-theoretic models designed for adversarial resource allocation under uncertainty. By viewing AI oversight as a strategic interaction between defenders (auditors, evaluators, and deployers) and attackers (malicious actors, misaligned contributors, or worst-case failure modes), SSGs provide a unifying framework for reasoning about incentive design, limited oversight capacity, and adversarial uncertainty across the AI lifecycle. We illustrate how this framework can inform (1) training-time auditing against data/feedback poisoning, (2) pre-deployment evaluation under constrained reviewer resources, and (3) robust multi-model deployment in adversarial environments. This synthesis bridges algorithmic alignment and institutional oversight design, highlighting how game-theoretic deterrence can make AI oversight proactive, risk-aware, and resilient to manipulation.

2602.07258 2026-02-10 cs.LG stat.ME

Robust Ultra-High-Dimensional Variable Selection With Correlated Structure Using Group Testing

Wanru Guo, Juan Xie, Binbin Wang, Weicong Chen, Xiaoyi Lu, Vipin Chaudhary, Curtis Tatsuoka

Comments 57 Pages, 5 Figures, 4 Tables

详情
英文摘要

Background: High-dimensional genomic data exhibit strong group correlation structures that challenge conventional feature selection methods, which often assume feature independence or rely on pre-defined pathways and are sensitive to outliers and model misspecification. Methods: We propose the Dorfman screening framework, a multi-stage procedure that forms data-driven variable groups via hierarchical clustering, performs group and within-group hypothesis testing, and refines selection using elastic net or adaptive elastic net. Robust variants incorporate OGK-based covariance estimation, rank-based correlation, and Huber-weighted regression to handle contaminated and non-normal data. Results: In simulations, Dorfman-Sparse-Adaptive-EN performed best under normal conditions, while Robust-OGK-Dorfman-Adaptive-EN showed clear advantages under data contamination, outperforming classical Dorfman and competing methods. Applied to NSCLC gene expression data for trametinib response, robust Dorfman methods achieved the lowest prediction errors and enriched recovery of clinically relevant genes. Conclusions: The Dorfman framework provides an efficient and robust approach to genomic feature selection. Robust-OGK-Dorfman-Adaptive-EN offers strong performance under both ideal and contaminated conditions and scales to ultra-high-dimensional settings, making it well suited for modern genomic biomarker discovery.

2602.07256 2026-02-10 cs.LG cs.AI

Graph homophily booster: Reimagining the role of discrete features in heterophilic graph learning

Ruizhong Qiu, Ting-Wei Li, Gaotang Li, Hanghang Tong

Comments ICLR 2026

详情
英文摘要

Graph neural networks (GNNs) have emerged as a powerful tool for modeling graph-structured data. However, existing GNNs often struggle with heterophilic graphs, where connected nodes tend to have dissimilar features or labels. While numerous methods have been proposed to address this challenge, they primarily focus on architectural designs without directly targeting the root cause of the heterophily problem. These approaches still perform even worse than the simplest MLPs on challenging heterophilic datasets. For instance, our experiments show that 21 latest GNNs still fall behind the MLP on the Actor dataset. This critical challenge calls for an innovative approach to addressing graph heterophily beyond architectural designs. To bridge this gap, we propose and study a new and unexplored paradigm: directly increasing the graph homophily via a carefully designed graph transformation. In this work, we present a simple yet effective framework called GRAPHITE to address graph heterophily. To the best of our knowledge, this work is the first method that explicitly transforms the graph to directly improve the graph homophily. Stemmed from the exact definition of homophily, our proposed GRAPHITE creates feature nodes to facilitate homophilic message passing between nodes that share similar features. Furthermore, we both theoretically and empirically show that our proposed GRAPHITE significantly increases the homophily of originally heterophilic graphs, with only a slight increase in the graph size. Extensive experiments on challenging datasets demonstrate that our proposed GRAPHITE significantly outperforms state-of-the-art methods on heterophilic graphs while achieving comparable accuracy with state-of-the-art methods on homophilic graphs.

2602.07251 2026-02-10 cs.CV cs.AI

The Double-Edged Sword of Data-Driven Super-Resolution: Adversarial Super-Resolution Models

Haley Duba-Sullivan, Steven R. Young, Emma J. Reid

详情
英文摘要

Data-driven super-resolution (SR) methods are often integrated into imaging pipelines as preprocessing steps to improve downstream tasks such as classification and detection. However, these SR models introduce a previously unexplored attack surface into imaging pipelines. In this paper, we present AdvSR, a framework demonstrating that adversarial behavior can be embedded directly into SR model weights during training, requiring no access to inputs at inference time. Unlike prior attacks that perturb inputs or rely on backdoor triggers, AdvSR operates entirely at the model level. By jointly optimizing for reconstruction quality and targeted adversarial outcomes, AdvSR produces models that appear benign under standard image quality metrics while inducing downstream misclassification. We evaluate AdvSR on three SR architectures (SRCNN, EDSR, SwinIR) paired with a YOLOv11 classifier and demonstrate that AdvSR models can achieve high attack success rates with minimal quality degradation. These findings highlight a new model-level threat for imaging pipelines, with implications for how practitioners source and validate models in safety-critical applications.

2602.07243 2026-02-10 cs.RO cs.AI cs.GR

Realistic Synthetic Household Data Generation at Scale

Siddharth Singh, Ifrah Idrees, Abraham Dauhajre

Comments Accepted at Agentic AI Benchmarks and Applications for Enterprise Tasks workshop at AAAI 2026

详情
英文摘要

Advancements in foundation models have catalyzed research in Embodied AI to develop interactive agents capable of environmental reasoning and interaction. Developing such agents requires diverse, large-scale datasets. Prior frameworks generate synthetic data for long-term human-robot interactions but fail to model the bidirectional influence between human behavior and household environments. Our proposed generative framework creates household datasets at scale through loosely coupled generation of long-term human-robot interactions and environments. Human personas influence environment generation, while environment schematics and semantics shape human-robot interactions. The generated 3D data includes rich static context such as object and environment semantics, and temporal context capturing human and agent behaviors over extended periods. Our flexible tool allows users to define dataset characteristics via natural language prompts, enabling configuration of environment and human activity data through natural language specifications. The tool creates variations of user-defined configurations, enabling scalable data generation. We validate our framework through statistical evaluation using multi-modal embeddings and key metrics: cosine similarity, mutual information gain, intervention analysis, and iterative improvement validation. Statistical comparisons show good alignment with real-world datasets (HOMER) with cosine similarity (0.60), while synthetic datasets (Wang et al.) show moderate alignment (0.27). Intervention analysis across age, organization, and sleep pattern changes shows statistically significant effects (p < 0.001) with large effect sizes (Cohen's d = 0.51-1.12), confirming bidirectional coupling translates persona traits into measurable environmental and behavioral differences. These contributions enable development and testing of household smart devices at scale.

2602.07238 2026-02-10 cs.AI cs.LG econ.GN q-fin.EC

Is there "Secret Sauce'' in Large Language Model Development?

Matthias Mertens, Natalia Fischl-Lanzoni, Neil Thompson

详情
英文摘要

Do leading LLM developers possess a proprietary ``secret sauce'', or is LLM performance driven by scaling up compute? Using training and benchmark data for 809 models released between 2022 and 2025, we estimate scaling-law regressions with release-date and developer fixed effects. We find clear evidence of developer-specific efficiency advantages, but their importance depends on where models lie in the performance distribution. At the frontier, 80-90% of performance differences are explained by higher training compute, implying that scale--not proprietary technology--drives frontier advances. Away from the frontier, however, proprietary techniques and shared algorithmic progress substantially reduce the compute required to reach fixed capability thresholds. Some companies can systematically produce smaller models more efficiently. Strikingly, we also find substantial variation of model efficiency within companies; a firm can train two models with more than 40x compute efficiency difference. We also discuss the implications for AI leadership and capability diffusion.

2602.07227 2026-02-10 cs.LG cs.RO

Cerebellar-Inspired Residual Control for Fault Recovery: From Inference-Time Adaptation to Structural Consolidation

Nethmi Jayasinghe, Diana Gontero, Spencer T. Brown, Vinod K. Sangwan, Mark C. Hersam, Amit Ranjan Trivedi

详情
英文摘要

Robotic policies deployed in real-world environments often encounter post-training faults, where retraining, exploration, or system identification are impractical. We introduce an inference-time, cerebellar-inspired residual control framework that augments a frozen reinforcement learning policy with online corrective actions, enabling fault recovery without modifying base policy parameters. The framework instantiates core cerebellar principles, including high-dimensional pattern separation via fixed feature expansion, parallel microzone-style residual pathways, and local error-driven plasticity with excitatory and inhibitory eligibility traces operating at distinct time scales. These mechanisms enable fast, localized correction under post-training disturbances while avoiding destabilizing global policy updates. A conservative, performance-driven meta-adaptation regulates residual authority and plasticity, preserving nominal behavior and suppressing unnecessary intervention. Experiments on MuJoCo benchmarks under actuator, dynamic, and environmental perturbations show improvements of up to $+66\%$ on \texttt{HalfCheetah-v5} and $+53\%$ on \texttt{Humanoid-v5} under moderate faults, with graceful degradation under severe shifts and complementary robustness from consolidating persistent residual corrections into policy parameters.

2602.07226 2026-02-10 cs.LG

Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators

Zihan Zhu, Yanqiu Wu, Qiongkai Xu

详情
英文摘要

In the era of Model-as-a-Service, organizations increasingly rely on third-party AI models for rapid deployment. However, the dynamic nature of emerging AI applications, the continual introduction of new datasets, and the growing number of models claiming superior performance make efficient and reliable validation of model services increasingly challenging. This motivates the development of sample-efficient performance estimators, which aim to estimate model performance by strategically selecting instances for labeling, thereby reducing annotation cost. Yet existing evaluation approaches often fail in low-variance settings: RMSE conflates bias and variance, masking persistent bias when variance is small, while p-value based tests become hypersensitive, rejecting adequate estimators for negligible deviations. To address this, we propose a fault-tolerant evaluation framework that integrates bias and variance considerations within an adjustable tolerance level ${\varepsilon}$, enabling the evaluation of performance estimators within practically acceptable error margins. We theoretically show that proper calibration of ${\varepsilon}$ ensures reliable evaluation across different variance regimes, and we further propose an algorithm that automatically optimizes and selects ${\varepsilon}$. Experiments on real-world datasets demonstrate that our framework provides comprehensive and actionable insights into estimator behavior.

2602.06000 2026-02-10 cs.AI cs.CL cs.LG cs.SD

Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

Ali Shendabadi, Parnia Izadirad, Mostafa Salehi, Mahmoud Bijankhan

详情
英文摘要

Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.

2602.05929 2026-02-10 cs.CL

KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

Jian Chen, Zhuoran Wang, Jiayu Qin, Ming Li, Meng Wang, Changyou Chen, Yin Chen, Qizhen Weng, Yirui Liu

详情
英文摘要

Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.

2602.05818 2026-02-10 cs.AI cs.DB

TKG-Thinker: Towards Dynamic Reasoning over Temporal Knowledge Graphs via Agentic Reinforcement Learning

Zihao Jiang, Miao Peng, Zhenyan Shan, Wenjie Xu, Ben Liu, Gong Chen, Ziqi Gao, Min Peng

详情
英文摘要

Temporal knowledge graph question answering (TKGQA) aims to answer time-sensitive questions by leveraging temporal knowledge bases. While Large Language Models (LLMs) demonstrate significant potential in TKGQA, current prompting strategies constrain their efficacy in two primary ways. First, they are prone to reasoning hallucinations under complex temporal constraints. Second, static prompting limits model autonomy and generalization, as it lack optimization through dynamic interaction with temporal knowledge graphs (TKGs) environments. To address these limitations, we propose \textbf{TKG-Thinker}, a novel agent equipped with autonomous planning and adaptive retrieval capabilities for reasoning over TKGs. Specifically, TKG-Thinker performs in-depth temporal reasoning through dynamic multi-turn interactions with TKGs via a dual-training strategy. We first apply Supervised Fine-Tuning (SFT) with chain of thought data to instill core planning capabilities, followed by a Reinforcement Learning (RL) stage that leverages multi-dimensional rewards to refine reasoning policies under intricate temporal constraints. Experimental results on benchmark datasets with three open-source LLMs show that TKG-Thinker achieves state-of-the-art performance and exhibits strong generalization across complex TKGQA settings.

2602.05625 2026-02-10 cs.AI

Reactive Knowledge Representation and Asynchronous Reasoning

Simon Kohaut, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

详情
英文摘要

Exact inference in complex probabilistic models often incurs prohibitive computational costs. This challenge is particularly acute for autonomous agents in dynamic environments that require frequent, real-time belief updates. Existing methods are often inefficient for ongoing reasoning, as they re-evaluate the entire model upon any change, failing to exploit that real-world information streams have heterogeneous update rates. To address this, we approach the problem from a reactive, asynchronous, probabilistic reasoning perspective. We first introduce Resin (Reactive Signal Inference), a probabilistic programming language that merges probabilistic logic with reactive programming. Furthermore, to provide efficient and exact semantics for Resin, we propose Reactive Circuits (RCs). Formulated as a meta-structure over Algebraic Circuits and asynchronous data streams, RCs are time-dynamic Directed Acyclic Graphs that autonomously adapt themselves based on the volatility of input signals. In high-fidelity drone swarm simulations, our approach achieves several orders of magnitude of speedup over frequency-agnostic inference. We demonstrate that RCs' structural adaptations successfully capture environmental dynamics, significantly reducing latency and facilitating reactive real-time reasoning. By partitioning computations based on the estimated Frequency of Change in the asynchronous inputs, large inference tasks can be decomposed into individually memoized sub-problems. This ensures that only the specific components of a model affected by new information are re-evaluated, drastically reducing redundant computation in streaming contexts.

2602.05494 2026-02-10 cs.LG cs.AI

A Unified Framework for Rethinking Policy Divergence Measures in GRPO

Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yanning Dai, Shilong Deng, Sarra Habchi, Qi Zhu, Matthias Gallé, Chao Huang

详情
英文摘要

Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.

2602.05400 2026-02-10 cs.CL

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

Comments 45 pages, 7 figures, 8 tables

详情
英文摘要

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

2602.05325 2026-02-10 cs.RO

RoboPaint: From Human Demonstration to Any Robot and Any View

Jiacheng Fan, Zhiyue Zhao, Yiqian Zhang, Chao Chen, Peide Wang, Hengdi Zhang, Zhengxue Cheng

Comments 17 pages

详情
英文摘要

Acquiring large-scale, high-fidelity robot demonstration data remains a critical bottleneck for scaling Vision-Language-Action (VLA) models in dexterous manipulation. We propose a Real-Sim-Real data collection and data editing pipeline that transforms human demonstrations into robot-executable, environment-specific training data without direct robot teleoperation. Standardized data collection rooms are built to capture multimodal human demonstrations (synchronized 3 RGB-D videos, 11 RGB videos, 29-DoF glove joint angles, and 14-channel tactile signals). Based on these human demonstrations, we introduce a tactile-aware retargeting method that maps human hand states to robot dex-hand states via geometry and force-guided optimization. Then the retargeted robot trajectories are rendered in a photorealistic Isaac Sim environment to build robot training data. Real world experiments have demonstrated: (1) The retargeted dex-hand trajectories achieve an 84\% success rate across 10 diverse object manipulation tasks. (2) VLA policies (Pi0.5) trained exclusively on our generated data achieve 80\% average success rate on three representative tasks, i.e., pick-and-place, pushing and pouring. To conclude, robot training data can be efficiently "painted" from human demonstrations using our real-sim-real data pipeline. We offer a scalable, cost-effective alternative to teleoperation with minimal performance loss for complex dexterous manipulation.

2602.04902 2026-02-10 cs.LG cs.AI

Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

Kingsuk Maitra

Comments 15 pages, 5 figures, 299 pages total with supplementary material (21 appendices, 27 Jupyter notebooks with embedded results)

详情
英文摘要

The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator $p_t = q_t - q_{t-1}$, implementing the symplectic shear $\hat{q}_t = q_t + γp_t$ on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution -- by injecting kinematic momentum, we sidestep the topological depth constraint ($L \geq 2$) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A--R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within $\sim$2.9% validation loss. Dedicated associative recall experiments reveal a scaling law $γ^* = 4.17 \times N^{-0.74}$ establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.

2602.04780 2026-02-10 cs.LG cond-mat.dis-nn

Dynamical Regimes of Multimodal Diffusion Models

Emil Albrychiewicz, Andrés Franco Valiente, Li-Ching Chen

Comments 40 pages, 14 figures

详情
英文摘要

Diffusion based generative models have achieved unprecedented fidelity in synthesizing high dimensional data, yet the theoretical mechanisms governing multimodal generation remain poorly understood. Here, we present a theoretical framework for coupled diffusion models, using coupled Ornstein-Uhlenbeck processes as a tractable model. By using the nonequilibrium statistical physics of dynamical phase transitions, we demonstrate that multimodal generation is governed by a spectral hierarchy of interaction timescales rather than simultaneous resolution. A key prediction is the ``synchronization gap'', a temporal window during the reverse generative process where distinct eigenmodes stabilize at different rates, providing a theoretical explanation for common desynchronization artifacts. We derive analytical conditions for speciation and collapse times under both symmetric and anisotropic coupling regimes, establishing strict bounds for coupling strength to avoid unstable symmetry breaking. We show that the coupling strength acts as a spectral filter that enforces a tunable temporal hierarchy on generation. We support these predictions through controlled experiments with diffusion models trained on MNIST datasets and exact score samplers. These results motivate time dependent coupling schedules that target mode specific timescales, offering a potential alternative to ad hoc guidance tuning.

2602.04438 2026-02-10 cs.RO

Gust Estimation and Rejection with a Disturbance Observer for Proprioceptive Underwater Soft Morphing Wings

Tobias Cook, Leo Micklem, Huazhi Dong, Yunjie Yang, Michael Mistry, Francesco Giorgio-Serchi

Comments 2026 IEEE International Conference on Robotics & Automation (ICRA)

详情
英文摘要

Unmanned underwater vehicles are increasingly employed for maintenance and surveying tasks at sea, but their operation in shallow waters is often hindered by hydrodynamic disturbances such as waves, currents, and turbulence. These unsteady flows can induce rapid changes in direction and speed, compromising vehicle stability and manoeuvrability. Marine organisms contend with such conditions by combining proprioceptive feedback with flexible fins and tails to reject disturbances. Inspired by this strategy, we propose soft morphing wings endowed with proprioceptive sensing to mitigate environmental perturbations. The wing's continuous deformation provides a natural means to infer dynamic disturbances: sudden changes in camber directly reflect variations in the oncoming flow. By interpreting this proprioceptive signal, a disturbance observer can reconstruct flow parameters in real time. To enable this, we develop and experimentally validate a dynamic model of a hydraulically actuated soft wing with controllable camber. We then show that curvature-based sensing allows accurate estimation of disturbances in the angle of attack. Finally, we demonstrate that a controller leveraging these proprioceptive estimates can reject disturbances in the lift response of the soft wing. By combining proprioceptive sensing with a disturbance observer, this technique mirrors biological strategies and provides a pathway for soft underwater vehicles to maintain stability in hazardous environments.

2602.04019 2026-02-10 cs.LG cs.AI

Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models

Yichen Xu, Yuyang Liang, Shan Dai, Tianyang Hu, Tsz Nam Chan, Chenhao Ma

详情
英文摘要

As large language models (LLMs) continue to grow, the cost of full-parameter fine-tuning has made parameter-efficient fine-tuning (PEFT) the default strategy for downstream adaptation. Constraints from inference latency in scalable serving and fine-tuning cost in edge or rapid-deployment settings make the choice of which layers to fine-tune unavoidable. Yet current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection. This paper develops a unified projected residual view of PEFT on top of a frozen base model. Under a local quadratic approximation, layerwise adaptation is governed by three quantities: (i) the projected residual norm (resnorm), which measures how much correctable bias a layer can capture; (ii) the activation energy, which determines feature conditioning; and (iii) layer coupling, which quantifies how strongly residuals interact across layers. We show that, for squared loss and linear adapters, the resnorm equals a normalized gradient norm, activation energy controls ill-conditioning and noise amplification, and weak coupling yields approximately additive layerwise contributions. Building on these insights, we introduce the Layer Card, a reusable diagnostic that summarizes residual signal strength, compute cost, and performance for each layer of a given model. With an identical model and LoRA configuration, Layer Card-guided placement refines the choice of adapted layers to flexibly prioritize different objectives, such as maximizing performance or reducing fine-tuning cost. Moreover, on Qwen3-8B, we show that selectively adapting a subset of layers can achieve performance close to full-layer LoRA while substantially reducing fine-tuning cost and the number of adapter-augmented layers during inference, offering a more cost-performance-aware alternative to full-layer insertion.

2602.03994 2026-02-10 cs.LG cs.AI

Bypassing the Rationale: Causal Auditing of Implicit Reasoning in Language Models

Anish Sathyanarayanan, Aditya Nagarsekar, Aarush Rathore

Comments Under Review at ICLR, 2026

详情
英文摘要

Chain-of-thought (CoT) prompting is widely used as a reasoning aid and is often treated as a transparency mechanism. Yet behavioral gains under CoT do not imply that the model's internal computation causally depends on the emitted reasoning text, i.e., models may produce fluent rationales while routing decision-critical computation through latent pathways. We introduce a causal, layerwise audit of CoT faithfulness based on activation patching. Our key metric, the CoT Mediation Index (CMI), isolates CoT-specific causal influence by comparing performance degradation from patching CoT-token hidden states against matched control patches. Across multiple model families (Phi, Qwen, DialoGPT) and scales, we find that CoT-specific influence is typically depth-localized into narrow "reasoning windows," and we identify bypass regimes where CMI is near-zero despite plausible CoT text. We further observe that models tuned explicitly for reasoning tend to exhibit stronger and more structured mediation than larger untuned counterparts, while Mixture-of-Experts models show more distributed mediation consistent with routing-based computation. Overall, our results show that CoT faithfulness varies substantially across models and tasks and cannot be inferred from behavior alone, motivating causal, layerwise audits when using CoT as a transparency signal.

2602.03950 2026-02-10 cs.AI cs.LG cs.MA

Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation

Aditya Basarkar, Benyamin Tabarsi, Tiffany Barnes, Dongkuan Xu

Comments 9 pages, 7 figures, submitted to ACL ARR 2026, hyperlink to code repository provided in the abstract

详情
英文摘要

Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi-agent LLM-based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self-evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM to maintain high-level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.

2602.03881 2026-02-10 cs.CV cs.AI cs.LG

DiGAN: Diffusion-Guided Attention Network for Early Alzheimer's Disease Detection

Maxx Richard Rahman, Mostafa Hammouda, Wolfgang Maass

详情
英文摘要

Early diagnosis of Alzheimer's disease (AD) remains a major challenge due to the subtle and temporally irregular progression of structural brain changes in the prodromal stages. Existing deep learning approaches require large longitudinal datasets and often fail to model the temporal continuity and modality irregularities inherent in real-world clinical data. To address these limitations, we propose the Diffusion-Guided Attention Network (DiGAN), which integrates latent diffusion modelling with an attention-guided convolutional network. The diffusion model synthesizes realistic longitudinal neuroimaging trajectories from limited training data, enriching temporal context and improving robustness to unevenly spaced visits. The attention-convolutional layer then captures discriminative structural-temporal patterns that distinguish cognitively normal subjects from those with mild cognitive impairment and subjective cognitive decline. Experiments on the ADNI dataset demonstrate that DiGAN outperforms existing state-of-the-art baselines, showing its potential for early-stage AD detection.

2602.03786 2026-02-10 cs.AI cs.CL

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Yongru Chen, Bang Liu, Chenglin Wu, Yuyu Luo, Jiayi Zhang

详情
英文摘要

Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance-cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash. The code is available at: https://github.com/FoundationAgents/AOrchestra

2602.02765 2026-02-10 cs.CV

SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?

Haruhiko Murata, Kazuhiro Hotta

Comments I corrected the incorrect email address. I'm sorry for any inconvenience this may have caused

详情
英文摘要

Vision Transformers (ViT) have been established as large-scale foundation models. However, because self-attention operates globally, they lack an explicit mechanism to distinguish foreground from background. As a result, ViT may learn unnecessary background features and artifacts, leading to degraded classification performance. To address this issue, we propose SVD-ViT, which leverages singular value decomposition (SVD) to prioritize the learning of foreground features. SVD-ViT consists of three components-\textbf{SPC module}, \textbf{SSVA}, and \textbf{ID-RSVD}-and suppresses task-irrelevant factors such as background noise and artifacts by extracting and aggregating singular vectors that capture object foreground information. Experimental results demonstrate that our method improves classification accuracy and effectively learns informative foreground representations while reducing the impact of background noise.

2602.02244 2026-02-10 cs.LG cs.CL

Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

Hao Wang, Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye, Xiangyu Yue, Sirui Han, Yike Guo, Dapeng Wu

详情
英文摘要

The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which adaptively adjusts distillation strength to mitigate knowledge forgetting by amplifying exploration at reasoning tokens while stabilizing factual tokens. Extensive experiments on mathematical reasoning tasks demonstrate that, in SFT stage, CurioSFT outperforms the vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. We also verify that exploration capabilities preserved during SFT successfully translate into concrete gains in RL stage, yielding an average improvement of 5.0 points.

2602.02206 2026-02-10 cs.LG

Fat-Cat: Document-Driven Metacognitive Multi-Agent System for Complex Reasoning

Tong Yang, Yemin Wang, Chaoning Zhang, Aming Wu

Comments This submission is withdrawn due to errors in the manuscript content and inaccuracies in the author information. The authors plan to correct these issues and may submit a revised version in the future

详情
英文摘要

The effectiveness of LLM-based agents is often limited not by model capacity alone, but by how efficiently contextual information is utilized at runtime. Existing agent frameworks rely on rigid, syntax-heavy state representations such as nested JSON, which require models to devote a substantial portion of their limited attention to syntactic processing rather than semantic reasoning. In this paper, we propose Fat-Cat, a document-driven agent architecture that improves the signal-to-noise ratio of state management. By integrating three key components: (1) a Semantic File System that represents agent state as Markdown documents aligned with common pre-training corpora, (2) a Textual Strategy Evolution module that accumulates task-solving knowledge without parameter updates, and (3) a Closed-Loop Watcher that monitors reasoning trajectories to reduce hallucinations. Extensive reasoning, retrieval, and coding benchmarks, Fat-Cat consistently improves agent performance. It enables the Kimi-k2 model to outperform the proprietary GPT-4o baseline on HotPotQA. Replacing the document-based state with JSON leads to performance drop, while empirically validating the critical necessity of document-driven state modeling over rigid syntax. The code is available at https://github.com/answeryt/Fat-Cat.

2602.01777 2026-02-10 cs.LG cs.AI math.ST stat.ML stat.TH

Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions

M. Arashi, M. Amintoosi

详情
英文摘要

Stochastic gradient methods are central to large-scale learning, but they treat mini-batch gradients as unbiased estimators, which classical decision theory shows are inadmissible in high dimensions. We formulate gradient computation as a high-dimensional estimation problem and introduce a framework based on Stein-rule shrinkage. We construct a gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging statistics from adaptive optimizers. Under a Gaussian noise model, we show our estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal. We incorporate this into the Adam optimizer, yielding SR-Adam, a practical algorithm with negligible computational cost. Empirical evaluations on CIFAR10 and CIFAR100 across multiple levels of input noise show consistent improvements over Adam in the large-batch regime. Ablation studies indicate that gains arise primarily from selectively applying shrinkage to high-dimensional convolutional layers, while indiscriminate shrinkage across all parameters degrades performance. These results illustrate that classical shrinkage principles provide a principled approach to improving stochastic gradient estimation in deep learning.

2602.01666 2026-02-10 cs.CV

Moonworks Lunara Aesthetic II: An Image Variation Dataset

Yan Wang, Partho Hassan, Samiha Sadeka, Nada Soliman, Sayeef Abdullah, Sabit Hassan

详情
英文摘要

We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara's signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations.