arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2603.02676 2026-05-12 cs.CL cs.AI

ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

Wicaksono Leksono Muhamad, Joanito Agili Lopo, Tack Hwa Wong, Muhammad Ravi Shulthan Habibi, Samuel Cahyawijaya

AI总结 该研究针对大语言模型在多语言推理任务中受内容效应影响的问题,提出了一种通过显式结构抽象将三段论转化为规范逻辑表示,并结合确定性解析以判断推理有效性的新方法。该方法在SemEval-2026任务11的多语言基准测试中表现优异,各项子任务均进入前五,显著减少了内容偏差,为复杂微调或激活层干预提供了有效的替代方案。

详情
英文摘要

Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.

2603.01743 2026-05-12 cs.CV

Action-Guided Attention for Video Action Anticipation

Tsung-Ming Tai, Sofia Casarin, Andrea Pilzer, Werner Nutt, Oswald Lanz

AI总结 视频中对未来动作的预测具有挑战性,因为已观测到的帧仅能提供过去活动的证据,需要推断潜在意图以预测未来动作。现有基于Transformer的方法依赖像素表示的点积注意力,缺乏高层语义,难以有效建模视频序列。为此,本文提出了一种动作引导注意力(AGA)机制,利用预测的动作序列作为查询和键来引导序列建模,从而增强对过去关键时刻的关注,并通过门控函数与当前帧嵌入结合,提升了模型对潜在意图的理解和泛化能力。实验表明,AGA在EPIC-Kitchens-100数据集上具有良好的泛化性能,并可通过后训练分析揭示模型学习到的动作依赖关系和反事实证据,提供可解释的预测依据。

Comments Accepted by ICLR 2026

详情
英文摘要

Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.

2602.22088 2026-05-12 cs.RO

Force Policy: Learning Hybrid Force-Position Control Policy under Interaction Frame for Contact-Rich Manipulation

Hongjie Fang, Shirun Tang, Mingyu Mei, Haoxiang Qin, Zihao He, Jingjing Chen, Ying Feng, Chenxi Wang, Wanxi Liu, Zaixing He, Cewu Lu, Shiquan Wang

AI总结 该研究针对复杂接触场景下的机器人操作任务,提出了一种名为Force Policy的混合力-位姿控制策略。通过引入物理基础的交互坐标系,将力控制与运动执行解耦,并从示范数据中恢复该坐标系,从而实现全局视觉引导与局部高频率力反馈的协同控制。实验表明,该方法在接触建立、力控制精度和新物体泛化能力方面均优于现有方法,显著提升了操作的稳定性和执行质量。

Comments accepted by RSS 2026

详情
英文摘要

Contact-rich manipulation demands human-like integration of perception and force feedback: vision should guide task progress, while high-frequency interaction control must stabilize contact under uncertainty. Existing learning-based policies often entangle these roles in a monolithic network, trading off global generalization against stable local refinement, while control-centric approaches typically assume a known task structure or learn only controller parameters rather than the structure itself. In this paper, we formalize a physically grounded interaction frame, an instantaneous local basis that decouples force regulation from motion execution, and propose a method to recover it from demonstrations. Based on this, we address both issues by proposing Force Policy, a global-local vision-force policy in which a global policy guides free-space actions using vision, and upon contact, a high-frequency local policy with force feedback estimates the interaction frame and executes hybrid force-position control for stable interaction. Real-world experiments across diverse contact-rich tasks show consistent gains over strong baselines, with more robust contact establishment, more accurate force regulation, and reliable generalization to novel objects with varied geometries and physical properties, ultimately improving both contact stability and execution quality. Project page: https://force-policy.github.io/

2602.16596 2026-05-12 cs.LG cs.CR math.ST stat.ML stat.TH

Sequential Membership Inference Attacks

Thomas Michel, Debabrota Basu, Emilie Kaufmann

AI总结 本文研究了针对现代动态AI模型的序列成员推理攻击(SeMI),旨在通过利用模型更新序列信息,提高隐私审计的准确性。作者提出了一种最优攻击方法SeMI*,能够通过控制插入时间并分析模型序列中的统计特性,更有效地识别目标样本是否被包含在训练数据中。实验表明,与仅依赖最终模型的基线方法相比,SeMI攻击在多种数据集和基于(差分隐私)随机梯度下降训练的模型上表现出更高的攻击效果和更严格的隐私评估能力。

Comments 32 pages, 14 figures

详情
英文摘要

Modern AI models are not static. They go through multiple updates in their lifecycles. We propose to design Sequential Membership Inference (SeMI) attacks leading to tighter privacy audits by exploiting the sequence of models and injecting a target canary at a controlled insertion time. First, for empirical mean computation, we develop SeMI*, an {optimal SeMI attack to identify the presence of a target inserted at a specific insertion step}. We derive the power of SeMI* to show that accessing the model sequence yields more powerful MI attacks than scrutinising only the final model. SeMI* exhibits an isolation property -- its power depends on the statistics obtained right before and after insertion of the target. Leveraging this insight, we develop practical white-box (accessing model gradients) and black-box (accessing loss) SeMI attacks against models trained with (DP-)SGD. Across datasets and models trained with (DP-)SGD, our experiments show that SeMI attacks achieve higher powers than snapshot-independent baselines, and yield tighter privacy audits thanks to (a) control over the insertion time and (b) observations across the model sequence.

2602.13759 2026-05-12 cs.LG cs.NA math.NA math.OC

Discrete Double-Bracket Flows for Isotropic-Noise Invariant Eigendecomposition

ZhiMing Li, JiaHe Feng

AI总结 本文研究了在存在各向同性噪声的流形观测下,如何在特殊正交群 $SO(n)$ 上进行特征分解的问题。作者提出了一种离散双括号流方法,其生成元为反对称矩阵,能够排除噪声对特征空间动态的影响,从而实现对噪声水平的不变性。该方法在保证稳定性的同时,仅依赖于信号的迹自由部分,显著提升了算法在高噪声环境下的鲁棒性和收敛性。

Comments 75 pages, 9 figures

详情
英文摘要

We study eigendecomposition on $SO(n)$ under streaming observations $C_k = C_{\mathrm{sig}} + σ_k^2 I + E_k$, where the isotropic background $σ_k^2 I$ may be time-varying and arbitrarily large. Standard algorithms couple their stability to $\lVert C_k \rVert_2 \approx σ^2$, forcing step sizes, contraction rates, and iteration counts to degrade with the noise floor. We observe that $σ^2 I$ lies in the center of the matrix algebra and therefore *should never enter* the eigenspace dynamics. We construct a discrete double-bracket flow whose skew-symmetric generator $Ω= [A, \operatorname{diag}(A)]$ operates in the tangent Lie algebra $\mathfrak{so}(n)$, where scalar multiples of the identity vanish by antisymmetry. The resulting trajectory, Lyapunov function, and maximal stable step size $η_{\max} = 1/L_C$ depend exclusively on the trace-free signal $C_e$ -- achieving pointwise, pathwise $σ^2$-invariance. We establish input-to-state stability with a noise ball governed solely by trace-free perturbations, prove global convergence via strict-saddle geometry and a discrete Łojasiewicz argument, and extend the framework to top-$k$ eigentracking on the Stiefel manifold $\operatorname{St}(k,n)$ at cost $k$ matrix-vector products per step.

2602.13486 2026-05-12 cs.LG cs.AI cs.DC

Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity

Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang

AI总结 该论文研究了联邦低秩适配(FedLoRA)中由于客户端异构性导致的秩崩溃(rank collapse)问题,即全局更新能量过度集中于最低共享秩,影响模型性能。通过理论分析,作者揭示了秩崩溃的根本原因是聚合权重与客户端贡献之间的不匹配,并提出了一种基于秩划分的聚合方法raFLoRA,有效缓解了秩崩溃现象,提升了模型在异构环境下的性能和鲁棒性。

详情
英文摘要

Federated low-rank adaptation (FedLoRA) has facilitated communication-efficient and privacy-preserving fine-tuning of foundation models for downstream tasks. In practical federated learning scenarios, client heterogeneity in system resources and data distributions motivates the use of heterogeneous LoRA ranks across clients. However, we identify a previously overlooked phenomenon in heterogeneous FedLoRA with SVD-based allocation, termed rank collapse, where the energy of the global update becomes concentrated in the minimum shared rank, resulting in suboptimal performance and high sensitivity to rank configurations. Through theoretical analysis, we reveal the root cause of rank collapse: a mismatch between rank-agnostic aggregation weights and rank-dependent client contributions, which systematically suppresses higher-rank updates at a geometric rate over rounds. Motivated by this insight, we propose raFLoRA, a rank-partitioned aggregation method that decomposes local updates into rank partitions and then aggregates each partition weighted by its effective client contributions. Extensive experiments across vision, language, and reasoning tasks show that raFLoRA prevents rank collapse, improves model performance, and enhances robustness across diverse heterogeneous configurations compared with strong FedLoRA baselines.

2602.12606 2026-05-12 cs.LG

RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

Justin Gu, Rishabh Ranjan, Charilaos Kanatsoulis, Haiming Tang, Martin Jurkovic, Valter Hudovernik, Mark Znidar, Pranshu Chaturvedi, Parth Shroff, Fengyu Li, Jure Leskovec

AI总结 RelBench v2 是一个用于关系深度学习的大规模基准测试平台和数据集仓库,新增了四个涵盖学术出版、企业资源规划、消费平台和临床记录的大型关系数据集,总数据量超过2200万行。该版本引入了新的自动补全任务,要求模型在尊重时间约束的前提下直接预测关系表中的缺失属性值,并整合了多个外部基准和评估框架,显著扩展了其应用范围。实验表明,关系深度学习模型在补全、预测和推荐任务中均优于单表基线模型,凸显了显式建模关系结构的重要性。

Comments Published at ICLR 2026. Website: https://relbench.stanford.edu

详情
英文摘要

Relational deep learning (RDL) has emerged as a powerful paradigm for learning directly on relational databases by modeling entities and their relationships across multiple interconnected tables. As this paradigm evolves toward larger models and relational foundation models, scalable and realistic benchmarks are essential for enabling systematic evaluation and progress. In this paper, we introduce RelBench v2, a major expansion of the RelBench benchmark for RDL. RelBench v2 adds four large-scale relational datasets spanning scholarly publications, enterprise resource planning, consumer platforms, and clinical records, increasing the benchmark to 11 datasets comprising over 22 million rows across 29 tables. We further introduce autocomplete tasks, a new class of predictive objectives that require models to infer missing attribute values directly within relational tables while respecting temporal constraints, expanding beyond traditional forecasting tasks constructed via SQL queries. In addition, RelBench v2 expands beyond its native datasets by integrating external benchmarks and evaluation frameworks: we translate event streams from the Temporal Graph Benchmark into relational schemas for unified relational-temporal evaluation, interface with ReDeLEx to provide uniform access to 70+ real-world databases suitable for pretraining, and incorporate 4DBInfer datasets and tasks to broaden multi-table prediction coverage. Experimental results demonstrate that RDL models consistently outperform single-table baselines across autocomplete, forecasting, and recommendation tasks, highlighting the importance of modeling relational structure explicitly.

2602.11665 2026-05-12 cs.LG math.OC

Fully First-Order Algorithms for Online Bilevel Optimization

Tingkai Jia, Cheng Chen

AI总结 本文研究了仅使用一阶oracle的非凸-强凸在线双层优化问题,提出了一种完全基于一阶信息的算法,避免了传统超梯度下降方法中所需的Hessian-向量乘积计算,从而降低了计算成本。通过将原问题转化为带有不等式约束的单层在线优化问题,并构造拉格朗日函数序列,作者设计了一种新的全一阶算法,并给出了理论保证,证明其在总迭代次数为$O(T\log T)$时具有$O(1 + V_T + H_{2,T})$的遗憾界。此外,作者还提出了一个改进版本,消除了对$H_{2,T}$的依赖,并在随机设置下建立了相应的遗憾界。

Comments make a lot of improvements

详情
英文摘要

In this work, we study nonconvex-strongly convex online bilevel optimization (OBO) using only first-order oracle. Existing OBO algorithms are mainly based on hypergradient descent, which requires access to a Hessian-vector product (HVP) oracle and potentially incurs high computational costs. By reformulating the original OBO problem as a single-level online problem with inequality constraints and constructing a sequence of Lagrangian function, we eliminate the need for HVPs arising from implicit differentiation. Specifically, we propose a fully first-order algorithm for OBO, and provide theoretical guarantees showing that it achieves regret of $O(1 + V_T + H_{2,T})$ with a total of $O(T\log T)$ iterations, where $V_T$ measures the variation in function values and $H_{2,T}$ characterizes the drift variation of the inner-level optimal solution. We also establish a sublinear regret bound under the single-loop structure by introducing additional gradient-variation terms. Furthermore, we develop an improved variant with an adaptive inner-iteration scheme, which removes the dependence on $H_{2,T}$ and achieves regret of $O(\log T + V_T)$. Finally, under the stochastic OBO setting, we establish the regret bound for the fully first-order algorithm, i.e., $O(T^{2/3}(1 + σ^2) + V_T + H_{2,T})$. Numerical experiments demonstrate the feasibility of our algorithm and support our theoretical findings.

2602.09524 2026-05-12 cs.CV

HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection

Han Zhou, Yuxuan Gao, Yinchao Du, Xuezhe Zheng

AI总结 本文提出了一种名为HLGFA的高-低分辨率引导特征对齐框架,用于无监督工业异常检测。该方法通过建模正常样本在高分辨率和低分辨率表示之间的跨分辨率特征一致性来学习正常模式,而非依赖像素级重建。通过条件调制和门控残差修正,利用高分辨率中的结构和细节先验指导低分辨率特征的优化,从而在推理阶段通过跨分辨率对齐的破坏来检测异常。实验表明,HLGFA在标准数据集上取得了优异的性能,显著优于基于重建和基于特征的现有方法。

Comments 14 pages, 6 figures, references added

详情
英文摘要

Unsupervised industrial anomaly detection (UAD) is essential for modern manufacturing inspection, where defect samples are scarce and reliable detection is required. In this paper, we propose HLGFA, a high-low resolution guided feature alignment framework that learns normality by modeling cross-resolution feature consistency between high-resolution and low-resolution representations of normal samples, instead of relying on pixel-level reconstruction. Dual-resolution inputs are processed by a shared frozen backbone to extract multi-level features, and high-resolution representations are decomposed into structure and detail priors to guide the refinement of low-resolution features through conditional modulation and gated residual correction. During inference, anomalies are naturally identified as regions where cross-resolution alignment breaks down. In addition, a noise-aware data augmentation strategy is introduced to suppress nuisance-induced responses commonly observed in industrial environments. Extensive experiments on standard benchmarks demonstrate the effectiveness of HLGFA, achieving 97.9% pixel-level AUROC and 97.5% image-level AUROC on the MVTec AD dataset, outperforming representative reconstruction-based and feature-based methods.

2602.09520 2026-05-12 cs.LG cs.DC

Rashomon Sets and Model Multiplicity in Federated Learning

Xenia Heilmann, Luca Corbucci, Mattia Cerrato

AI总结 本文研究了联邦学习中模型的多重性问题,提出了联邦学习环境下的“拉什莫农集”概念,以揭示在隐私保护和数据异构条件下,不同客户端可能存在的多种性能相近但决策边界不同的模型。研究首次将拉什莫农集扩展到联邦学习场景,区分了全局、部分共识和个体三个层面的定义,并提出了在隐私约束下估计模型多重性的方法,进一步设计了一个感知模型多重性的联邦学习流程,实验表明该方法有助于客户端选择更符合本地数据和公平性需求的模型。

Comments Accepted at ACM FAccT 2026

详情
英文摘要

The Rashomon set captures the collection of models that achieve near-identical empirical performance yet may differ substantially in their decision boundaries. Understanding the differences among these models, i.e., their multiplicity, is recognized as a crucial step toward model transparency, fairness, and robustness, as it reveals decision boundaries instabilities that standard metrics obscure. However, the existing definitions of Rashomon set and multiplicity metrics assume centralized learning and do not extend naturally to decentralized, multi-party settings like Federated Learning (FL). In FL, multiple clients collaboratively train models under a central server's coordination without sharing raw data, which preserves privacy but introduces challenges from heterogeneous client data distribution and communication constraints. In this setting, the choice of a single best model may homogenize predictive behavior across diverse clients, amplify biases, or undermine fairness guarantees. In this work, we provide the first formalization of Rashomon sets in FL.First, we adapt the Rashomon set definition to FL, distinguishing among three perspectives: (I) a global Rashomon set defined over aggregated statistics across all clients, (II) a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and (III) individual Rashomon sets specific to each client's local distribution.Second, we show how standard multiplicity metrics can be estimated under FL's privacy constraints. Finally, we introduce a multiplicity-aware FL pipeline and conduct an empirical study on standard FL benchmark datasets. Our results demonstrate that all three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.

2602.09514 2026-05-12 cs.CL cs.AI

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, JinCheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, He Zhu, Yuchen Eleanor Jiang, Wei Wang, Wangchunshu Zhou

AI总结 EcoGym 是一个用于评估大型语言模型在交互式经济环境中进行长期计划与执行能力的通用基准。该基准包含三个多样化环境,支持无界时间范围内的决策,并基于商业相关指标进行评估,以检验模型在长期战略一致性和鲁棒性方面的能力。实验表明,现有主流模型在高层次策略或高效行动执行方面均存在显著不足,EcoGym 为研究经济场景下的可控性与效用权衡提供了开放、可扩展的测试平台。

Comments update

详情
英文摘要

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending (adapted from the closed-source Vending-Bench, with full open-source release), Freelance (new), and Operation (new), implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability utility trade-offs in economic settings.

2602.09317 2026-05-12 cs.LG cs.AI stat.ML

SnareNet: Flexible Repair Layers for Neural Networks with Hard Constraints

Ya-Chi Chu, Alkiviades Boukas, Madeleine Udell

AI总结 SnareNet 是一种用于神经网络的可控修复架构,旨在解决模型输出违反物理、操作或安全约束的问题。其核心方法是在网络中引入可微分的修复层,通过在约束空间中进行迭代调整,使输出满足用户指定的约束条件。该方法采用自适应松弛训练策略,确保端到端训练的稳定性,并在多个基准任务中表现出更高的目标优化质量与更强的约束满足能力,尤其在处理非凸约束时具有显著优势。

详情
英文摘要

Neural networks are increasingly used as fast surrogate models across various domains, but unconstrained predictions can violate physical, operational, or safety requirements. We propose SnareNet, a feasibility-controlled architecture to learn mappings whose outputs must satisfy input-dependent constraints. SnareNet appends a differentiable repair layer that navigates in the constraint map's range space, steering iterates toward feasibility and producing a repaired output that satisfies constraints to a user-specified tolerance. We stabilize end-to-end training by adaptive relaxation, a new training paradigm that snares the neural network at initialization and shrinks it into the feasible set, enabling early exploration and strict feasibility later in training. On optimization learning and trajectory planning benchmarks, SnareNet consistently attains improved objective quality while satisfying constraints more reliably than prior work, and it is the first to enforce non-convex constraints at medium-to-high precision robustly across instances.

2602.08616 2026-05-12 cs.LG cs.AI

Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete Action Spaces

Heiko Hoppe, Fabian Akkerman, Wouter van Heeswijk, Maximilian Schiffer

AI总结 该论文研究了在超大规模离散动作空间中高效应用强化学习的问题,提出了一种名为距离引导强化学习(DGRL)的新方法。该方法结合采样动态邻域和基于距离的更新策略,将策略优化转化为稳定的回归任务,有效降低了梯度方差与动作空间规模之间的依赖。实验表明,DGRL在多种结构化环境中相比现有方法性能提升了最高66%,同时加快了收敛速度并降低了计算复杂度。

Comments 31 pages, 8 figures

详情
英文摘要

Reinforcement Learning (RL) is increasingly applied to large-scale decision-making problems like logistics, scheduling, and recommender systems, but existing algorithms struggle with the curse of dimensionality in such large discrete action spaces. We propose Distance-Guided Reinforcement Learning (DGRL), combining Sampled Dynamic Neighborhoods and Distance-Based Updates to enable efficient RL in problems with up to $10^{20}$ actions. Unlike prior methods, DGRL performs stochastic volumetric exploration and transforms policy optimization into a stable regression task, decoupling gradient variance from action space cardinality. On structured tasks, DGRL provably guarantees local value improvement. DGRL naturally generalizes to hybrid continuous-discrete action spaces. We demonstrate performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments, while simultaneously improving convergence speed and computational complexity.

2602.07144 2026-05-12 cs.LG cs.AI stat.ML

BONSAI: Bayesian Optimization with Natural Simplicity and Interpretability

Samuel Daulton, David Eriksson, Maximilian Balandat, Eytan Bakshy

AI总结 BONSAI 是一种面向默认配置的贝叶斯优化方法,旨在在优化过程中尽量减少对默认参数的偏离,从而提升结果的可解释性与实用性。该方法通过控制获取函数的损失,有效剪枝低影响的参数变化,并兼容多种获取函数如预期改进和上置信界。理论分析表明,BONSAI 在保证优化性能的同时,能够以零获取成本恢复关键参数坐标,优于现有稀疏贝叶斯优化方法,并在多个实际应用中验证了其在减少非默认参数数量方面的显著优势。

Comments 32 pages

详情
英文摘要

Bayesian optimization (BO) is a popular technique for sample-efficient optimization of black-box functions. In many applications, the parameters being tuned come with a carefully engineered default configuration, and practitioners only want to deviate from this default when necessary. Standard BO, however, does not aim to minimize deviation from the default and, in practice, often pushes weakly relevant parameters to the boundary of the search space. This makes it difficult to distinguish between important and spurious changes and increases the burden of vetting recommendations when the optimization objective omits relevant operational considerations. We introduce BONSAI, a default-aware BO policy that prunes low-impact deviations from a default configuration while explicitly controlling the loss in acquisition value. BONSAI is compatible with a variety of acquisition functions, including expected improvement and upper confidence bound (GP-UCB). We theoretically bound the regret incurred by BONSAI, showing that, under certain conditions, it enjoys the same no-regret property as vanilla GP-UCB. Moreover, assuming known ARD lengthscales -- the same assumption underlying GP-UCB regret bounds -- BONSAI provably recovers the relevant-coordinate set at zero acquisition cost, yielding a method that matches the GP-UCB regret rate while recovering the minimal-$\ell_0$ solution -- a guarantee not provided by prior sparse-BO methods. Across many real-world applications, we empirically find that BONSAI substantially reduces the number of non-default parameters in recommended configurations while maintaining competitive optimization performance, with little effect on wall time -- averaging only $1.5\times$ the candidate-generation cost of standard BO, compared to $7$-$34\times$ on average for prior sparse-BO methods (IR, ER, and SEBO).

2602.07052 2026-05-12 cs.CV eess.IV

Markerless Head Tracking for Accurate and Accessible Neuronavigation

Ziye Xie, Oded Schlesinger, Raj Kundu, Jessica Y. Choi, Pablo Iturralde, Dennis A. Turner, Stefan M. Goetz, Guillermo Sapiro, Angel V. Peterchev, J. Matias Di Martino

AI总结 本文提出了一种无需标记的头部分位追踪方法,用于提高神经导航的精度和可及性。该方法利用低成本的可见光和红外摄像头,结合立体视觉和深度传感技术,通过算法建模面部几何结构来替代传统依赖物理标记的系统。实验结果表明,该方法在50名受试者上的追踪误差中位数仅为2.32毫米和2.01度,精度足以满足经颅磁刺激等临床需求,并显著优于以往无标记方法。

详情
英文摘要

Neuronavigation is widely used in biomedical research and interventions to guide the precise placement of instruments around the head to support procedures such as transcranial magnetic stimulation. Traditional systems, however, rely on subject-mounted markers that require manual registration, may shift during procedures, and can cause discomfort. We introduce and evaluate markerless approaches that replace expensive hardware and physical markers with low-cost visible and infrared light cameras incorporating stereo and depth sensing, combined with algorithmic modeling of the facial geometry. Validation with 50 human subjects yielded a median tracking discrepancy of only 2.32 mm and 2.01$^\circ$ for the best markerless algorithm compared to a conventional marker-based system, which indicates sufficient accuracy for transcranial magnetic stimulation and a substantial improvement over prior markerless results. The study also suggests that integration of the data from the various camera sensors can improve the overall accuracy further. The proposed markerless neuronavigation methods can reduce setup cost and complexity, improve patient comfort, and expand access to neuronavigation in clinical and research settings.

2602.06457 2026-05-12 cs.LG math.OC

Achieving Better Local Regret Bound for Online Non-Convex Bilevel Optimization

Tingkai Jia, Haiguang Wang, Cheng Chen

AI总结 本文研究在线双层优化问题,旨在改进其局部遗憾界。作者提出了两种优化算法,分别针对标准和窗口平均的双层局部遗憾,建立了最优的遗憾界,并引入了自适应迭代策略和基于窗口的分析方法,提升了算法的理论保证和实际效果。实验验证了理论分析的正确性,并展示了所提方法的有效性。

Comments add a synthetic experiment

详情
英文摘要

Online bilevel optimization (OBO) has emerged as a powerful framework for many machine learning problems. Prior works have developed several algorithms that minimize the standard bilevel local regret or the window-averaged bilevel local regret of the OBO problem, but the optimality of existing regret bounds remains unclear. In this work, we establish optimal regret bounds for both settings. For standard bilevel local regret, we propose an algorithm with adaptive iteration strategy that achieves the optimal regret $Ω(1+V_T)$ with at most $O(T\log T)$ total inner-level gradient evaluations. We further develop a fully single-loop algorithm whose regret bound includes an additional gradient-variation terms. For the window-averaged bilevel local regret, we design an algorithm that captures linear environmental variation through a novel window-based analysis and achieves the optimal regret $Ω(T/W^2)$. The algorithm also supports an efficient single-loop structure, achieving an $O(T/W)$ regret bound with $O(WT)$ total gradient evaluations. Experiments validate our theoretical findings and demonstrate the practical effectiveness of the proposed methods.

2602.06286 2026-05-12 cs.AI

When Agents Say One Thing and Do Another: Validating Elicited Beliefs from LLMs

Khurram Yamin, Jingjing Tang, Santiago Cortes-Gomez, Amit Sharma, Eric Horvitz, Bryan Wilder

AI总结 该研究探讨了大型语言模型(LLMs)在决策过程中是否表现出一致的信念体系,并提出了一种基于决策理论的框架,通过同时获取模型的概率判断和决策行为,检验两者之间的一致性。研究发现,尽管模型的信念与其决策存在细微差异,但最强的模型在这些差异上表现较小,表明其信念在一定程度上可以被视为近似合理的决策依据。

详情
英文摘要

Large language models (LLMs) are increasingly deployed in high-stakes settings where good decisions require forming beliefs over the probability of unknown outcomes. However, it is unclear whether LLMs act as if they hold coherent beliefs when making decisions, or if so, how we could validate models' reports of such beliefs. We propose a decision-theoretic framework that elicits both probability judgments and decisions from an agent and tests their mutual consistency. Formally, our methods characterize whether it is possible for the actions to be produced by a ``near-rational" decision maker who holds the elicited probability as their true belief. We show that, perhaps surprisingly, this formalization implies empirically testable conditions even without any assumption about the agent's utility function. Applying our framework to stylized clinical diagnosis tasks, we find that models' reported beliefs are demonstrably imperfect summaries of the information revealed in their decisions, but that the discrepancies are small for the strongest models.

2602.04811 2026-05-12 cs.CL cs.AI cs.LG

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu

AI总结 SE-Bench 是一个用于评估智能体自我进化能力的基准环境,通过隐藏 NumPy 库及其 API 文档并随机化标识符,迫使模型在无文档支持的情况下学习和应用新知识。该研究揭示了自我进化中的三个关键问题:开放书籍训练的悖论、强化学习的局限性以及自我博弈在知识内化中的有效性。SE-Bench 为研究智能体长期学习与知识内化提供了严谨的测试平台。

Comments Under review

详情
英文摘要

True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.

2602.04549 2026-05-12 cs.CV

Nix and Fix: Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models

Cem Eteke, Enzo Tartaglione

AI总结 该论文提出了一种名为NiFi的方法,旨在通过扩散模型实现对3D高斯溅射(3DGS)的极端压缩,压缩率可高达1000倍。该方法通过引入基于扩散模型的单步蒸馏技术,有效修复压缩过程中产生的伪影,从而在极低比特率下仍能保持优异的视觉质量。研究在压缩效率与感知质量之间取得了显著平衡,为3D内容在带宽受限场景下的应用提供了新的解决方案。

详情
英文摘要

3D Gaussian Splatting (3DGS) revolutionized novel view rendering. Instead of inferring from dense spatial points, as implicit representations do, 3DGS uses sparse Gaussians. This enables real-time performance but increases space requirements, hindering rate-constrained applications. 3DGS compression emerged as a field aimed at alleviating this issue. While impressive progress has been made, at low rates, compression introduces artifacts that degrade visual quality significantly. We introduce NiFi, a method for extreme 3DGS compression through restoration via artifact-aware, diffusion-based one-step distillation. We show that our method achieves state-of-the-art perceptual quality at extremely low rates, down to 0.1 MB, and towards 1000x rate improvement over 3DGS at comparable perceptual performance. Code is available at: https://github.com/ceteke/nifi

2602.04054 2026-05-12 cs.LG cs.CV

SEIS: Subspace-based Equivariance and Invariance Scores for Neural Representations

Huahua Lin, Katayoun Farrahi, Xiaohao Cai

AI总结 本文提出了一种基于子空间的等变与不变性评分(SEIS),用于分析神经网络在几何变换下的特征表示,无需标签或明确的变换信息即可区分等变性与不变性。研究发现,卷积编码器在深度上表现出从强等变性向不变性过渡的趋势,而分割解码器在后续层中等变性有恢复现象。此外,数据增强和多任务学习能够同时增强等变性与不变性,而基于Transformer和MLP-Mixer的模型则展现出不同的几何特性。

详情
英文摘要

Understanding how neural representations respond to geometric transformations is essential for evaluating whether learned features preserve meaningful spatial structure. Existing approaches primarily assess robustness primarily by comparing model outputs under transformed inputs, offering limited insight into how geometric information is organized within internal representations and failing to distinguish between information loss and re-encoding. In this work, we introduce SEIS (Subspace-based Equivariance and Invariance Scores), a subspace metric for analyzing layer-wise feature representations under geometric transformations, disentangling equivariance from invariance without requiring labels or explicit knowledge of the transformation. Through controlled experiments across diverse architectures, we uncover several consistent patterns. First, convolutional encoders exhibit a depth-wise transition from strong equivariance to increasing invariance, with both properties stabilizing within the first few training epochs. In segmentation decoders, however, equivariance tends to recover in later layers. Second, this trade-off is not intrinsic but is shaped by training decisions: data augmentation actively strengthens both equivariance and invariance simultaneously, and multi-task learning induces synergistic gains in both properties beyond what either task achieves alone. Extending our analysis beyond convolutional networks, we find that transformer-based models exhibit distinct geometric behaviors, while MLP-Mixers display intermediate characteristics.

2602.03677 2026-05-12 cs.CL

Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration

Yu Zhang, Mufan Xu, Xuefeng Bai, Kehai Chen, Pengfei Zhang, Yang Xiang, Min Zhang

AI总结 该研究探讨了多模态大语言模型中根据用户指令选择性利用多模态信息的机制。通过分析注意力层的信息流动,发现指令标记在模态仲裁中起到结构锚点的作用,浅层注意力层负责聚合多模态信息至指令标记,而深层注意力层则根据指令意图选择性强化相关模态。实验表明,仅干预少量关键注意力头即可显著影响模态遵循能力,验证了该机制的有效性,为提升模型在指令引导下的多模态整合能力提供了理论依据。

Comments Modality Following

详情
英文摘要

Modality following is the ability to selectively leverage multimodal contexts based on user instructions. It is fundamental to the safety and reliability of multimodal large language models (MLLMs) in real-world deployments. However, the internal mechanisms governing this decision-making process remain largely under-explored. In this work, we investigate the mechanism underlying modality following through an information flow perspective. Our findings reveal that instruction tokens serve as structural anchor for modality arbitration: Shallow attention layers perform undifferentiated information transfer, aggregating multimodal cues to instruction tokens as a latent buffer; in contrast, deep attention layers selectively strengthen the instruction-compliant subspace and resolve modality arbitration according to the instruction-specified intent, with a sparse subset of attention heads driving this process. Targeted attention-head interventions further validate the functional specificity of these heads: blocking only $5\%$ of the identified heads substantially degrades modality following while preserving general visual and language capabilities, whereas targeted amplification can restore failed modality-following samples by up to approximately $60\%$. Together, this work provides a mechanistic account of modality following and informs future efforts to improve how MLLMs integrate and utilize multimodal evidence under user instructions.

2602.03190 2026-05-12 cs.LG cs.AI cs.CL

PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning

Wenquan Lu, Hai Huang, Enqi Liu, Randall Balestriero

AI总结 该研究提出了一种名为PrAg-PO的策略优化方法,旨在提升大语言模型在数学推理任务中的鲁棒性和多样性。通过在训练过程中混合使用不同的提示模板并结合特定模板的格式奖励,PrAg-PO鼓励模型在多样化的指令和输出格式下生成推理过程,从而增强推理的多样性和稳定性。实验表明,与现有方法相比,PrAg-PO在多个数学基准测试中取得了更高的推理准确率,并有效避免了训练过程中的早期崩溃问题。

详情
英文摘要

Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have shown strong potential for improving the mathematical reasoning capabilities of large language models. While a growing body of work seeks to improve training entropy, rollout diversity, and exploration, most existing methods still train models with a single fixed reasoning prompt or template, which can encourage prompt-specific overfitting and unstable training dynamics. In this work, we introduce Prompt Augmented Policy Optimization (PrAg-PO), a simple policy optimization method that mixes prompt templates with template-specific format rewards during training. By encouraging models to generate reasoning traces under diverse instructions and output formats, PrAg-PO increases rollout diversity and improves robustness. Compared with GRPO and DAPO, PrAg-PO achieves significantly higher reasoning accuracy while mitigating premature training collapse. Empirically, experiments on DeepSeek-R1-Distill-Qwen-1.5B, Qwen2.5-Math-1.5B, and Qwen3-1.7B show that PrAg-PO consistently outperforms strong baselines and achieves competitive performance against recent methods on mathematics benchmarks, using only a fixed MATH Level 3-5 training set of 8.5K problems. The code and model checkpoints are available at https://github.com/wenquanlu/PrAg-PO.

2602.02821 2026-05-12 cs.CL cs.IT math.IT

When Efficient Communication Explains Convexity

Ashvin Ranjan, Shane Steinert-Threlkeld

AI总结 本文从高效沟通的角度探讨语言多样性背后的成因,重点关注语义类型学中意义表达的规律。研究通过信息瓶颈(IB)方法,分析了沟通需求分布的凸性与沟通最优性之间的关系,并发现凸性在驱动这一关联中起关键作用。该成果不仅验证了高效沟通能解释语义类型学现象,还进一步揭示了其背后的决定性因素。

详情
英文摘要

Much recent work has argued that the variation in the languages of the world can be explained from the perspective of efficient communication; in particular, languages can be seen as optimally balancing competing pressures to be simple and to be informative. Focusing on the expression of meaning -- semantic typology -- the present paper asks what factors are responsible for successful explanations in terms of efficient communication. Using the Information Bottleneck (IB) approach to formalizing this trade-off, we first demonstrate and analyze a correlation between optimality in the IB sense and a novel generalization of convexity to this setting. In a second experiment, we manipulate various modeling parameters in the IB framework to determine which factors drive the correlation between convexity and optimality. We find that the convexity of the communicative need distribution plays an especially important role. These results move beyond showing that efficient communication can explain aspects of semantic typology into explanations for why that is the case by identifying which underlying factors are responsible.

2602.02494 2026-05-12 cs.LG q-bio.NC

MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training

Dulhan Jayalath, Oiwi Parker Jones

AI总结 本文提出了一种名为MEG-XL的脑到文本接口模型,旨在解决瘫痪患者因缺乏大量训练数据而难以使用现有系统的问题。该模型通过长上下文预训练,每个样本使用长达2.5分钟的MEG信号进行训练,相比以往方法提升了数十到数百倍的上下文长度,从而更有效地捕捉神经活动的长期依赖关系。实验表明,MEG-XL在少量数据下即可达到与传统监督方法相当甚至更优的解码性能,证明了长上下文预训练在脑机接口任务中的有效性。

Comments Published as a conference paper at ICML 2026. 19 pages, 8 figures, 5 tables

详情
英文摘要

Clinical brain-to-text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre-training improves data-efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre-train with only a few seconds of context. Thus, we propose MEG-XL, a model pre-trained with 2.5 minutes of MEG context per sample, 5-300x longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine-tuning on the task of word decoding from brain data, MEG-XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre-trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long-context pre-training helps exploit extended neural context that other methods unnecessarily discard. Code, model weights, and instructions are available at https://github.com/neural-processing-lab/MEG-XL .

2602.02045 2026-05-12 cs.LG

Outlier-robust Diffusion Posterior Sampling for Bayesian Inverse Problems

Yiming Yang, Xiaoyuan Cheng, Yi He, Kaiyu Li, Wenxuan Yuan, Zhuo Sun

AI总结 该论文研究了在贝叶斯反问题中扩散模型对异常值的鲁棒性问题,指出观测似然的误设会显著影响恢复性能,尤其是在存在异常值的情况下。为解决这一问题,作者提出了鲁棒扩散后验采样方法,该方法在理论上保证了对线性反问题的异常值鲁棒性,并与现有梯度基后验采样方法兼容。实验表明,该方法在科学反问题和自然图像任务中均表现出更强的鲁棒性和性能提升。

详情
英文摘要

Diffusion models have emerged as powerful learned priors for Bayesian inverse problems (BIPs). Diffusion-based solvers rely on a presumed likelihood for the observations in BIPs to guide the generation process. Likelihood misspecification is common in practical BIPs and is known to degrade recovery performance, particularly under outlier contamination. We investigate this problem by first characterizing the induced posterior deviation and proving the stability of diffusion-based solvers for linear BIPs. Our stability analysis further reveals potential robustness deficiencies of existing diffusion-based solvers under outlier-contaminated measurements. To address this issue, we propose a simple yet effective solution: robust diffusion posterior sampling, which is provably outlier-robust for linear BIPs and compatible with existing gradient-based posterior samplers. Empirical results from scientific inverse problems and natural image tasks demonstrate the effectiveness and robustness of our method, with consistent performance gains in challenging scenarios involving outlier contamination for both linear and nonlinear tasks.

2602.01977 2026-05-12 cs.CL

Beyond Local Edits: Embedding-Virtualized Knowledge for Broader Evaluation and Preservation of Model Editing

Shuainan Liu, Xuanang Chen, Ben He, Le Sun

AI总结 该论文提出了一种名为Embedding-Virtualized Knowledge(EVK)的新方法,用于更全面地评估和保留大型语言模型的编辑效果。通过在嵌入空间中引入可控扰动,EVK能够探索超出显式数据标注的更广泛知识区域,并构建了基于嵌入层面的评估基准EVK-Bench,以量化编辑引起的潜在知识偏移。此外,论文还提出了一种可插拔的EVK-Align模块,能够在编辑过程中约束嵌入层面的知识偏移,有效提升知识保留效果,同时保持编辑准确性。

Comments We voluntarily withdraw this manuscript. Extensive post-submission testing shows the method lacks the originally reported generality and effectiveness. The benchmark metrics originally designed are inadequate for assessing existing model editing algorithms. To avoid misleading the community, we have decided to withdraw this paper and will not release an updated version.

详情
英文摘要

Knowledge editing methods for large language models are commonly evaluated using predefined benchmarks that assess edited facts together with a limited set of related or neighboring knowledge. While effective, such evaluations remain confined to finite, dataset-bounded samples, leaving the broader impact of editing on the model's knowledge system insufficiently understood. To address this gap, we introduce Embedding-Virtualized Knowledge (EVK) that characterizes model knowledge through controlled perturbations in embedding space, enabling the exploration of a substantially broader and virtualized knowledge region beyond explicit data annotations. Based on EVK, we construct an embedding-level evaluation benchmark EVK-Bench that quantifies potential knowledge drift induced by editing, revealing effects that are not captured by conventional sample-based metrics. Furthermore, we propose a plug-and-play EVK-Align module that constrains embedding-level knowledge drift during editing and can be seamlessly integrated into existing editing methods. Experiments demonstrate that our approach enables more comprehensive evaluation while significantly improving knowledge preservation without sacrificing editing accuracy.

2602.01442 2026-05-12 cs.LG cs.AI cs.CL

Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers

Donald Ye

AI总结 该研究揭示了基于梯度的归因方法在Transformer模型中存在系统性偏差,即高估早期层的“梯度膨胀”组件,而低估晚期层的“隐藏英雄”组件。通过因果实验发现,梯度归因无法准确反映各组件的因果重要性,导致归因排名与实际功能影响严重不符。研究指出,这种偏差源于梯度方法难以检测组件间的冗余关系,进而对模型解释和电路级分析提出了新的挑战。

Comments 9 pages, 6 figures, under review at ICML 2026 Workshop on Mechanistic Interpretability

详情
英文摘要

Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks and up to 10 random seeds, uncovering a systematic, layer-wise failure: gradient attribution consistently overvalues early-layer \textbf{Gradient Bloats} and undervalues late-layer \textbf{Hidden Heroes}. Rank correlation collapses from $ρ= 0.72$ on sequence reversal to $0.27$ on sequence sorting, reaching $ρ= -0.18$ in individual seeds. This failure stems from first-order gradient attribution's inability to detect collective redundancy: joint Bloat ablation causes $14\times$ greater damage than individual results predict. Consequently, Bloats dominate gradient rankings despite negligible functional impact, while ablating Hidden Heroes destroys OOD accuracy ($-36.4\% \pm 22.8\%$). This systematic inversion of early-layer feature extraction and late-layer computation motivates causal validation as a prerequisite for circuit-level claims.

2602.01219 2026-05-12 cs.LG cs.CV

Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights

Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li

AI总结 本文提出了一种名为Mixture-of-Top-k Attention(MiTA)的高效注意力机制,旨在解决传统Transformer中自注意力机制在处理长序列时的可扩展性问题。该方法通过引入少量关键查询,动态选择最相关的k个键值对作为可变形的专家模块,并将宽隐层压缩为共享专家,从而在保持模型表达能力的同时提升计算效率。实验表明,MiTA在视觉任务中表现出优越的性能和效率,并展现出如自动剪枝和易于泛化等新特性。

Comments Code is available at https://github.com/QishuaiWen/MiTA

详情
英文摘要

The vanilla self-attention mechanism in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically induced by inputs and whose hidden dimension is equal to the sequence length $N$. As the context extends, the expressive capacity of such an $N$-width MLP increases, but it becomes unscalable for extremely long sequences. Recently, this fast-weight perspective has motivated the Mixture-of-Experts (MoE) attention mechanism, which partitions the sequence into rigid blocks, treats them as fast-weight experts, and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for efficient attention mechanisms, interpreting them as making fast weights scalable through either routing or compression, and organizing them into a five-dimensional taxonomy. Then, we propose Mixture-of-Top-$k$ Attention (MiTA), which employs a small set of landmark queries to gather top-$k$ attended key-value pairs as query-aware and deformable routed experts, while compressing the $N$-width MLP into a narrower shared expert. Consequently, our MiTA improves the flexibility of prior MoE attention from rigid to deformable fast-weight experts, as well as the scalability of prior top-$k$ attention from query-specific set to reusable top-$k$ set. We conduct extensive experiments on vision tasks showing the superior effectiveness and efficiency of our MiTA, and also uncovering intriguing properties such as an emergent token-pruning effect and easy generalization from standard attention. Code is available at https://github.com/QishuaiWen/MiTA.

2602.01194 2026-05-12 cs.CV

EMFormer: Efficient Multi-Scale Transformer for Accumulative Context Weather Forecasting

Hao Chen, Tao Han, Jie Zhang, Song Guo, Fenghua Ling, Lei Bai

AI总结 本文提出了一种高效多尺度变换器EMFormer,用于提升长期天气预测的准确性与效率。该方法通过单次卷积提取多尺度特征,结合累积上下文微调策略和动态复合损失函数,有效缓解了长期预测中的灾难性遗忘和误差累积问题。实验表明,EMFormer在天气预测和极端事件检测中表现出色,并在视觉基准任务上展现出良好的泛化能力,同时相比传统多尺度模块计算效率提升了5.69倍。

Comments This paper has been accepted by ICML2026

详情
英文摘要

Long-term weather forecasting is critical for socioeconomic planning and disaster preparedness. While recent approaches employ finetuning to extend prediction horizons, they remain constrained by the issues of catastrophic forgetting, error accumulation, and high training overhead. To address these limitations, we present a novel pipeline across pretraining, finetuning and forecasting to enhance long-context modeling while reducing computational overhead. First, we introduce an Efficient Multi-scale Transformer (EMFormer) to extract multi-scale features through a single convolution in both training and inference. Based on the new architecture, we further employ an accumulative context finetuning to improve temporal consistency without degrading short-term accuracy. Additionally, we propose a composite loss that dynamically balances different terms via a sinusoidal weighting, thereby adaptively guiding the optimization trajectory throughout pretraining and finetuning. Experiments show that our approach achieves strong performance in weather forecasting and extreme event prediction, substantially improving long-term forecast accuracy. Moreover, EMFormer demonstrates strong generalization on vision benchmarks (ImageNet-1K and ADE20K) while delivering a 5.69x speedup over conventional multi-scale modules. Code: https://github.com/chenhao-zju/emformer

2602.01015 2026-05-12 cs.CL cs.CY

Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

Conrad Borchers, Jill-Jênn Vie, Roger Azevedo

AI总结 本研究探讨了大语言模型(LLMs)在模拟学习者推理和元认知判断方面的局限性,通过分析630条多步骤化学问题的“大声思考”语料,评估了LLMs在问题解决过程中的表现。研究发现,尽管GPT-4.1生成的推理流畅且上下文恰当,但其推理过于连贯、冗长且缺乏变化,与真实学习者的思维过程存在显著差异。研究认为,这种差异源于LLM训练数据中缺乏真实学习过程中的情感表达和工作记忆限制,揭示了当前LLMs在模拟学习方面存在的认识局限。

Comments Manuscript under review

详情
英文摘要

Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.