arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2512.18248 2026-05-12 cs.LG

On the Convergence Rate of LoRA Gradient Descent

Siqiao Mu, Diego Klabjan

AI总结 本文研究了原始LoRA梯度下降算法的收敛速率问题,该算法在微调大模型中广泛应用,因其计算效率高且效果良好。由于LoRA缺乏Lipschitz平滑性,其收敛性分析较为困难,现有理论多依赖强假设或仅分析渐进行为。本文首次在不依赖这些假设的前提下,提供了LoRA梯度下降的非渐近收敛分析,证明其收敛速率可达 $O\left(\frac{1}{\log T}\right)$,并通过数值实验验证了理论结果。

Comments ICML 2026

详情
英文摘要

The low-rank adaptation (LoRA) algorithm for fine-tuning large models has grown popular in recent years due to its remarkable performance and low computational requirements. LoRA trains two ``adapter" matrices that form a low-rank representation of the model parameters, thereby massively reducing the number of parameters that need to be updated at every step. Although LoRA is simple, its convergence is poorly understood due to the lack of Lipschitz smoothness, a key condition for classic convergence analyses. As a result, current theoretical results only consider asymptotic behavior or assume strong boundedness conditions which artificially enforce Lipschitz smoothness. In this work, we provide for the first time a non-asymptotic convergence analysis of the \textit{original LoRA gradient descent} algorithm, which reflects widespread practice, without such assumptions. Our work relies on three key steps: i) reformulating the problem in terms of the outer product of the stacked adapter matrices, ii) a modified descent lemma for the ``Lipschitz-like" reparametrized function, and iii) controlling the step size. With this approach, we prove that LoRA gradient descent converges to a stationary point at rate $O(\frac{1}{\log T})$, where $T$ is the number of iterations. We conduct numerical experiments to validate our theoretical findings.

2512.11470 2026-05-12 cs.LG cs.CL

Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

Bowen Ding, Yuhan Chen, Jiayang Lyv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin

AI总结 该研究探讨了如何在数学推理任务中更有效地利用专家轨迹进行大语言模型的后训练。作者提出了“可塑性-上限”框架,将最终性能分解为基础的监督微调(SFT)表现和随后的强化学习(RL)可塑性,从而揭示了SFT-then-RL的顺序训练流程在性能上的优势。研究还给出了具体的训练策略,包括在SFT稳定或轻度过拟合阶段过渡到RL、数据规模决定后训练潜力以及验证损失作为选择专家轨迹的可靠指标,为最大化利用专家轨迹提供了实用指导。

Comments ACL-26, Main Conference

详情
英文摘要

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) dominate the post-training landscape for mathematical reasoning, yet differ fundamentally in their reliance on expert trajectories. To understand the optimal way to harness these trajectories for maximizing performance, we propose the Plasticity-Ceiling Framework. This framework empirically grounds the post-training landscape by decomposing the final performance ceiling into the foundational SFT performance and the subsequent RL plasticity (i.e., the maximum improvement via RL). Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability and premature convergence deficits inherent in synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the final ceiling by securing a robust SFT foundation with substantial RL plasticity; (2) Refuting the ``Less is More'' hypothesis in SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) The Minimum Validation Loss of SFT serves as a reliable indicator for selecting the expert trajectories that maximize the ultimate performance ceiling. Our findings provide actionable guidelines for extracting maximum value from expert trajectories.

2512.06571 2026-05-12 cs.RO

Learning Agile Striker Skills for Humanoid Soccer Robots from Noisy Sensory Input

Zifan Xu, Myoungkyu Seo, Dongmyeong Lee, Hao Fu, Jiaheng Hu, Jiaxun Cui, Yuqian Jiang, Zhihan Wang, Anastasiia Brund, Joydeep Biswas, Peter Stone

AI总结 本文研究了如何让类人足球机器人在存在噪声感知输入的情况下学习快速且稳健的踢球技能。作者提出了一种基于强化学习的系统,通过扩展教师-学生训练框架,设计了四个训练阶段,使机器人能够适应不同的球-球门配置并持续执行踢球动作。该方法结合了定制奖励函数、真实噪声建模和在线约束强化学习,有效缩小了仿真到现实的差距,并在仿真和实际机器人上均表现出优异的踢球精度和进球成功率。

详情
英文摘要

Learning fast and robust ball-kicking skills is a critical capability for humanoid soccer robots, yet it remains a challenging problem due to the need for rapid leg swings, postural stability on a single support foot, and robustness under noisy sensory input and external perturbations (e.g., opponents). This paper presents a reinforcement learning (RL)-based system that enables humanoid robots to execute robust continual ball-kicking with adaptability to different ball-goal configurations. The system extends a typical teacher-student training framework -- in which a "teacher" policy is trained with ground truth state information and the "student" learns to mimic it with noisy, imperfect sensing -- by including four training stages: (1) long-distance ball chasing (teacher); (2) directional kicking (teacher); (3) teacher policy distillation (student); and (4) student adaptation and refinement (student). Key design elements -- including tailored reward functions, realistic noise modeling, and online constrained RL for adaptation and refinement -- are critical for closing the sim-to-real gap and sustaining performance under perceptual uncertainty. Extensive evaluations in both simulation and on a real robot demonstrate strong kicking accuracy and goal-scoring success across diverse ball-goal configurations. Ablation studies further highlight the necessity of the constrained RL, noise modeling, and the adaptation stage. This work presents a system for learning robust continual humanoid ball-kicking under imperfect perception, establishing a benchmark task for visuomotor skill learning in humanoid whole-body control.

2511.19279 2026-05-12 cs.LG cs.CL

MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings

Victor Rambaud, Salvador Mascarenhas, Yair Lakretz

AI总结 该研究提出了一种名为MapFormer的新型Transformer架构,旨在通过自监督学习从观测数据中学习认知地图,从而实现类似人类和动物的强泛化能力。其核心方法是通过输入依赖的位置编码矩阵,将输入内容与其结构关系解耦,从而捕捉抽象关系并支持路径积分。实验表明,MapFormer在多个认知任务中显著优于现有模型,展现出接近完美的分布外泛化能力,并且在自然数据上也表现出优越的性能,具有良好的可扩展性。

Comments 19 pages (29 with appendix), 8 figures

详情
英文摘要

A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce $\textit{MapFormers}$, new Transformer-based architectures, which can learn cognitive maps from observational data and perform path-integration without supervision. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved by updating position encodings with input-dependent matrices, built as exponentials of learned combinations of Lie-algebra generators. We developed two variants of $\textit{MapFormers}$ that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested $\textit{MapFormers}$ on several formal tasks targeting distinct cognitive capacities, including gating, 2D navigation and nested hierarchies (Dyck Languages). Our results demonstrate that $\textit{MapFormers}$ significantly outperform current AI architectures, achieving near-perfect OOD generalization where standard models fail. Furthermore, we show that $\textit{MapFormers}$ are scalable; evaluations on naturalistic data yield perplexity improvements over baselines, suggesting that these principles extend to large-scale, real-world domains. These results are obtained through efficient parallel computation on commutative maps, though our models can also learn non-commutative cognitive maps via sequential path-integration. Overall, these results suggest that input-dependent matrices provide a critical structural bias, by disentangling abstract relations from content in order to drive robust OOD generalization.

2511.18374 2026-05-12 cs.RO cs.SY eess.SY math.DS

Explicit Bounds on the Hausdorff Distance for Truncated mRPI Sets via Norm-Dependent Contraction Rates

Jiaxun Sun, Hengyu Xue, Yuyang Zhao

AI总结 本文研究了截断最小鲁棒正不变集(mRPI)与其无限时间范围极限之间的Hausdorff距离的显式上界。通过系统矩阵的诱导范数收缩因子和扰动集大小度量,提出了一种可计算的闭式上界表达式,并给出了一个无需迭代计算的显式时间步长选择规则,以保证预设的逼近精度。通过选择不同的向量范数,可以进一步优化收缩因子和逼近精度,对鲁棒不变集逼近和基于管的模型预测控制中的约束收紧具有重要意义。

Comments 6 pages, 5 figures. Accepted at the 2026 IEEE Conference on Control Technology and Applications (CCTA), Vancouver, BC, Canada, August 12-14, 2026

详情
英文摘要

We derive a computable closed-form upper bound on the Hausdorff distance between a truncated minimal robust positively invariant (mRPI) set and its infinite-horizon limit. The bound depends only on a disturbance-set size measure and an induced-norm contraction factor of the system matrix, and it yields an explicit, fully analytic horizon-selection rule that guarantees a prescribed approximation tolerance without iterative set computations. The choice of vector norm enters as a design lever: norm shaping -- through diagonal or Lyapunov-based weighting -- tightens both the contraction factor and the resulting certificate, with direct consequences for robust invariant-set approximation and tube-based model predictive control (MPC) constraint tightening. Numerical examples illustrate the accuracy, scalability, and practical impact of the proposed bound.

2511.17879 2026-05-12 cs.LG cs.SD

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Yusong Wu, Stephen Brade, Aleksandra Teng Ma, Tia-Jane Fowler, Enning Yang, Berker Banar, Aaron Courville, Natasha Jaques, Cheng-Zhi Anna Huang

AI总结 本文研究了在实时人机音乐协作中,如何通过生成对抗后训练方法缓解强化学习后训练中的奖励黑客问题。作者提出了一种对抗性训练方法,在策略生成的轨迹上进行训练,以提升旋律到和声伴奏生成的多样性与适应性。实验表明,该方法有效提高了输出多样性、和声连贯性以及用户的互动体验。

Comments v3: fix the Figure numbering bugs

详情
英文摘要

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player's future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

2511.12878 2026-05-12 cs.CV cs.RO

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng, Erhang Zhang, Xieyuanli Chen, Hesheng Wang

AI总结 本文提出了一种名为Uni-Hand的通用手部运动预测框架,旨在解决第一人称视角下手部运动预测中存在的预测目标不足、模态差异、手部与头部运动耦合以及下游任务验证有限等问题。该方法通过融合视觉与语言信息、引入全局上下文和任务感知的文本嵌入,实现了2D和3D空间中手部关键点的多目标预测,并首次引入手部与物体交互状态的预测以提升下游任务表现。实验结果表明,Uni-Hand在多个公开数据集和新构建的基准测试中均取得了最先进的预测性能,并在机器人策略迁移和动作识别等任务中展现出优异的应用潜力。

Comments Accepted by T-PAMI 2026. Code and data: https://github.com/IRMVLab/UniHand

详情
英文摘要

Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.

2511.11159 2026-05-12 cs.LG

Adaptive Symmetrization of the KL Divergence

Omri Ben-Dov, Luiz F. O. Chamon

AI总结 该研究针对KL散度在分布拟合中的不对称性问题,提出了一种非对抗性的方法以最小化Jeffreys散度。通过引入一个代理模型来近似主模型的反向KL散度,联合优化主模型与代理模型,从而在保证可计算性的同时提升稳定性与拟合精度。实验表明,该方法在密度估计和基于模拟的推断任务中,特别是在数据量较少的情况下,优于最大似然估计和生成对抗网络。

详情
英文摘要

The forward Kullback-Leibler (KL) divergence is a ubiquitous objective for fitting a parameterized distribution to samples due to its tractability and equivalence to maximum likelihood estimation (MLE). Its inherent asymmetry, however, may lead to degenerate solutions that generalize poorly. While the symmetric Jeffreys divergence offers a more balanced alternative, its optimization is challenging due to the presence of a reverse KL term. Generative adversarial networks (GANs) bypass this intractability using a min-max formulation at the cost of introducing new instability issues. This work proposes a non-adversarial approach to minimize the Jeffreys divergence. To do so, it uses a proxy model to tractably approximate the reverse KL divergence of the main model. The main and proxy models are jointly fitted to the data using a constrained optimization formulation to obtain a practical algorithm that adapts the models' priorities throughout training. We evaluate our framework on various tasks, including density estimation and simulation-based inference, and demonstrate that this approach is more stable and more accurate than MLE and GANs, particularly in low-data regimes.

2511.06216 2026-05-12 cs.LG

Adaptive Multi-view Graph Contrastive Learning via Fractional-order Neural Diffusion Networks

Yanan Zhao, Feng Ji, Jingyang Dai, Jiaze Ma, Keyue Jiang, Kai Zhao, Wee Peng Tay

AI总结 本文提出了一种基于分数阶神经扩散网络的自适应多视角图对比学习方法,通过调整分数阶导数阶数 $α$ 自动生成多样化的局部与全局视角,无需手动设计数据增强。该方法能够根据数据自动学习最优的扩散尺度,生成更具表达力和鲁棒性的图表示。实验表明,该方法在多个基准数据集上优于现有的图对比学习方法。

Comments Machine learning, diffusion neural networks. arXiv admin note: text overlap with arXiv:2504.16748

详情
英文摘要

Graph contrastive learning (GCL) learns node and graph representations by contrasting multiple views of the same graph. Existing methods typically rely on fixed, handcrafted views-usually a local and a global perspective, which limits their ability to capture multi-scale structural patterns. We present an augmentation-free, multi-view GCL framework grounded in fractional-order continuous dynamics. By varying the fractional derivative order $α\in (0,1]$, our encoders produce a continuous spectrum of views: small $α$ yields localized features, while large $α$ induces broader, global aggregation. We treat $α$ as a learnable parameter so the model can adapt diffusion scales to the data and automatically discover informative views. This principled approach generates diverse, complementary representations without manual augmentations. Extensive experiments on standard benchmarks demonstrate that our method produces more robust and expressive embeddings and outperforms state-of-the-art GCL baselines.

2511.01008 2026-05-12 cs.CL

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Haolin Yang, Jipeng Zhang, Zhitao He, Alexander Zhou, Yi R. Fung

AI总结 MARS-SQL 是一种用于文本到 SQL 任务的多智能体强化学习框架,旨在解决大语言模型在复杂任务中逻辑精确性和模式对齐方面的不足。该方法通过将任务分解为模式对齐、查询生成和解决方案验证三个专门角色,并采用多轮强化学习策略进行训练,使智能体能够通过与数据库的交互逐步优化 SQL 生成过程。实验表明,MARS-SQL 在多个基准数据集上取得了领先的执行准确率,展现出良好的泛化能力。

详情
英文摘要

Large Language Models (LLMs) often struggle with the precise logic and schema alignment required for complex Text-to-SQL tasks. While current methods rely heavily on static prompting, they lack the ability to dynamically adapt and self-correct through environmental interaction. To bridge this gap, we propose MARS-SQL, a trainable multi-agent framework for Text-to-SQL. Rather than introducing a new standalone SQL primitive, MARS-SQL makes an agentic workflow trainable by decomposing the problem into three specialized roles: schema grounding, query generation, and solution validation. Central to our approach is a generation agent trained via a multi-turn RL policy within a ReAct-style loop. The agent learns to iteratively reason, execute intermediate SQL actions on a live database, and refine its strategy based on execution feedback. To improve robustness, we further introduce a validation mechanism that treats solution selection as a generative modeling task, identifying the optimal interaction trajectory through next-token prediction probabilities. Empirical evaluations demonstrate the effectiveness of coupling interactive learning with trajectory ranking. MARS-SQL achieves state-of-the-art performance, recording an execution accuracy of 77.84% on the BIRD development dataset and 89.75% on the Spider test dataset, while also transferring strongly to out-of-domain benchmarks. Code is available at https://github.com/YangHaolin0526/MARS-SQL.

2511.00560 2026-05-12 cs.CV

4D Neural Voxel Splatting: Dynamic Scene Rendering with Voxelized Guassian Splatting

Chun-Tin Wu, Jun-Cheng Chen

AI总结 尽管3D高斯泼溅(3D-GS)在新视角合成中实现了高效的渲染,但将其扩展到动态场景时仍因每帧复制高斯分布而导致较大的内存开销。为此,本文提出了一种4D神经体素泼溅(4D-NVS)方法,结合体素表示与神经高斯泼溅,以高效建模动态场景。该方法通过学习变形场的紧凑神经体素集来建模时间动态,显著降低了内存消耗并加快了训练速度,同时保持了高质量的图像渲染。实验表明,该方法在内存占用和训练速度上优于现有方法,实现了实时渲染与更优的视觉效果。

Comments 10 pages, 7 figures

详情
英文摘要

Although 3D Gaussian Splatting (3D-GS) achieves efficient rendering for novel view synthesis, extending it to dynamic scenes still results in substantial memory overhead from replicating Gaussians across frames. To address this challenge, we propose 4D Neural Voxel Splatting (4D-NVS), which combines voxel-based representations with neural Gaussian splatting for efficient dynamic scene modeling. Instead of generating separate Gaussian sets per timestamp, our method employs a compact set of neural voxels with learned deformation fields to model temporal dynamics. The design greatly reduces memory consumption and accelerates training while preserving high image quality. We further introduce a novel view refinement stage that selectively improves challenging viewpoints through targeted optimization, maintaining global efficiency while enhancing rendering quality for difficult viewing angles. Experiments demonstrate that our method outperforms state-of-the-art approaches with significant memory reduction and faster training, enabling real-time rendering with superior visual fidelity.

2511.00371 2026-05-12 cs.CL cs.CY cs.SE

Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs

Erfan Al-Hossami, Razvan Bunescu

AI总结 本文研究如何通过苏格拉底式调试帮助学生自主发现并修正代码中的错误,核心问题是生成引导学生从错误观念走向矛盾并更新信念的推理轨迹(RT)。作者提出了推理轨迹生成任务,并构建了包含人工和大语言模型生成轨迹的数据集,同时设计了基于大语言模型的解决方案生成相关对话。实验表明,大型语言模型能够生成高达91%正确的推理轨迹和98.7%有效的对话回合,展示了该方法在编程教育中的潜力。

Comments 25 pages, 2 tables, 13 figures

详情
英文摘要

In Socratic debugging, instructors guide students towards identifying and fixing a bug on their own, instead of providing the bug fix directly. Most novice programmer bugs are caused by programming misconceptions, namely false beliefs about a programming concept. In this context, Socratic debugging can be formulated as a guided Reasoning Trajectory (RT) leading to a statement about the program behavior that contradicts the bug-causing misconception. Upon reaching this contradiction, the ensuing cognitive dissonance is expected to lead the student to identify the false belief on their own, followed by an enduring belief update. In this paper, we introduce the task of reasoning trajectory generation, together with a dataset of debugging problems annotated with RTs that are manually created or LLM-generated. We then describe LLM-based solutions for generating RTs and Socratic conversations that are anchored on them. A large-scale LLM-as-judge evaluation shows that large language and reasoning models can generate up to 91% correct reasoning trajectories and 98.7% valid conversation turns.

2510.26067 2026-05-12 cs.RO

Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion

Chi Zhang, Mingrui Li, Wenzhe Tong, Xiaonan Huang

AI总结 本文研究了如何通过形态感知的图强化学习方法,提升张力结构机器人(tensegrity robot)的运动控制能力。该方法将图神经网络(GNN)集成到软演员-评论家(SAC)算法中,利用机器人物理结构的图表示来捕捉组件间的耦合关系,从而实现更高效、稳定的运动学习。实验表明,该方法在样本效率、抗噪声和抗刚度变化能力以及轨迹精度方面均优于传统方法,并且能够在无需微调的情况下直接从仿真迁移到实际硬件,实现了稳定的实物运动控制。

Comments 8 pages, 10 figures. Project page: https://tensegrity-graph-rl.github.io/

详情
英文摘要

Tensegrity robots combine rigid rods and elastic cables, offering high resilience and deployability but at the same time posing major challenges for locomotion control due to their underactuated and highly coupled dynamics. This paper introduces a morphology-aware reinforcement learning framework that integrates a graph neural network (GNN) into the Soft Actor-Critic (SAC) algorithm. By representing the robot's physical topology as a graph, the proposed GNN-based policy captures coupling among components, enabling faster and more stable learning than conventional multilayer perceptron (MLP) policies. The method is validated on a physical 3-bar tensegrity robot across three locomotion primitives, including straight-line tracking and bidirectional turning. It shows superior sample efficiency, robustness to noise and stiffness variations, and improved trajectory accuracy. Additionally, the learned policies transfer directly from simulation to hardware without fine-tuning, achieving stable real-world locomotion. These results demonstrate the advantages of incorporating structural priors into reinforcement learning for tensegrity robot control.

2510.22767 2026-05-12 cs.LG cs.CL

TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination

Omar Naim, Krish Sharma, Niyar R Barman, Nicholas Asher

AI总结 TELL-TALE 是一种在推理阶段通过任务感知的层消除方法,旨在提升大语言模型在特定任务上的性能。该方法通过移除对任务无关或有害的模型层,优化模型架构,从而在不重新训练的情况下提高任务表现并降低计算成本。实验表明,TELL-TALE 在多种任务和模型家族中均能匹配或超越基线性能,并且与微调结合使用可进一步提升效果,具有实际部署价值。

Comments ACL 2026 Findings

详情
英文摘要

Large Language Models (LLMs) typically come with a fixed architecture, despite growing evidence that not all layers contribute equally to every downstream task. We introduce TALE (Task-Aware Layer Elimination), an inference-time method that improves task performance by selectively removing layers that are irrelevant or detrimental for a given task. TALE optimizes task-specific performance, yielding a task-optimized architecture without retraining. Across 9 tasks and 5 model families, under both zero-shot and few-shot settings, TALE consistently matches or surpasses baseline performance while simultaneously reducing computational costs. TALE also synergizes with fine-tuning, leading to further performance improvements. Computing TALE for a new task requires modest resources, making it a practical and deployable solution for task-specialized LLM inference.

2510.21954 2026-05-12 cs.CL

Model-Aware Tokenizer Transfer

Mykola Haltiuk, Aleksander Smywinski-Pohl

AI总结 大型语言模型(LLMs)在支持多种语言方面取得进展,但其预定义的分词器在适应低资源或使用不同书写系统的语言时仍面临瓶颈。本文提出了一种模型感知的分词器迁移方法MATT,通过引入注意力影响建模(AIM)目标,将源模型中的词间通信模式迁移到使用新分词器的目标模型中,从而提升分词器迁移的效果。实验表明,MATT能够在短时间内恢复大部分原始模型性能,优于传统启发式方法,展示了结合模型内部信号进行分词器迁移的有效性。

详情
英文摘要

Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.

2510.20797 2026-05-12 cs.CL cs.AI cs.LG

No Mean Feat: Simple, Strong Baselines for Context Compression

Yair Feldman, Yoav Artzi

AI总结 该论文研究了上下文压缩技术,旨在通过将长输入替换为短的预计算表示来降低Transformer模型的推理成本,尤其适用于检索增强生成(RAG)等任务。作者提出了一套标准化的评估框架BenchPress,并设计了两个简单但性能优越的基线方法,显著优于常用的因果压缩方法。研究发现双向注意力机制在生成压缩表示时具有优势,且简单的池化操作也能实现有效的上下文压缩。

Comments Code available at https://github.com/lil-lab/benchpress

详情
英文摘要

Context compression reduces Transformer inference costs by replacing lengthy inputs with shorter pre-computed representations. It carries significant benefits for retrieval-augmented generation (RAG) and has attracted growing research attention. However, progress remains difficult to measure due to inconsistent evaluations and baselines. We design a standard, easy-to-reproduce evaluation suite for context compression, BenchPress, along with simple, high-performance baselines for English reading comprehension. BenchPress supports benchmarking across model scales, datasets, compression ratios, and short ($<$1K tokens) to mid-range ($<$8K tokens) contexts. While the suite is applicable to any compression paradigm, our baselines target soft context compression. We establish two simple baselines that strongly outperform the widely used causal compression-token approach: mean pooling and a bidirectional compression-token variant. Our results show the benefit of bidirectional attention when computing compressed representations, and that simple pooling is an expressive compression operator.

2510.17671 2026-05-12 cs.LG cs.AI cs.CL

LILO: Bayesian Optimization with Natural Language Feedback

Katarzyna Kobalczyk, Zhiyuan Jerry Lin, Benjamin Letham, Zhuokai Zhao, Maximilian Balandat, Eytan Bakshy

AI总结 LILO 是一种结合贝叶斯优化与大语言模型的优化框架,旨在解决由复杂主观偏好引导的现实优化问题。该方法利用大语言模型将自然语言反馈转化为结构化偏好信号,突破了传统偏好优化中对标量或成对反馈的限制。通过将这些偏好整合到高斯过程代理模型中,LILO 在保证样本效率和稳定性的同时,提供了更灵活的反馈接口,并在多个基准测试中优于传统方法和纯语言模型优化器。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
英文摘要

Many real-world optimization problems are guided by complex, subjective preferences that are difficult to express as explicit closed-form objectives. In response, we introduce Language-in-the-Loop Optimization (LILO), a Bayesian optimization (BO) framework that employs a large language model (LLM) to translate free-form natural language feedback and prior knowledge from a decision maker into structured preference signals, going beyond the restrictive scalar or pairwise feedback formats typically assumed in preferential BO. The LLM-derived preferences are integrated by a Gaussian process proxy model, enabling principled acquisition-driven exploration with calibrated uncertainty. By placing the LLM in a supporting role rather than as the optimizer itself, LILO preserves the sample efficiency and stability of BO while providing a flexible and expressive feedback interface. Across synthetic and real-world benchmarks, LILO consistently outperforms both conventional preference-based BO methods and LLM-only optimizers, with particularly strong gains in feedback-limited regimes.

2510.09592 2026-05-12 cs.CL

Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

Donghang Wu, Haoyang Zhang, Jun Chen, Xiangyu, Zhang, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

AI总结 本文提出了一种名为“Mind-Paced Speaking(MPS)”的新型框架,旨在解决实时语音语言模型在链式推理(CoT)过程中因生成延迟过高而难以实现高效推理的问题。该方法受到人类大脑分工的启发,采用“双脑”结构,将高阶推理与语音生成分别由两个独立模块完成,从而避免模式切换并保持推理过程的完整性。实验表明,MPS在推理准确率和实时性方面均优于现有方法,为高质量推理与实时交互的结合提供了有效解决方案。

详情
英文摘要

Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a "Formulation Brain" for high-level reasoning to pace and guide a separate "Articulation Brain" for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. MPS is the methodology underlying our released Step-Audio R1.1 system, effectively bridging the gap between high-quality reasoning and real-time interaction.

2510.09580 2026-05-12 cs.AI cs.CL

GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj K. Jha

AI总结 本文提出了一种名为 GraphMERT 的高效可扩展模型,用于从非结构化文本中提炼出可靠的知识图谱(KG)。该模型采用纯图编码器结构,能够生成具有事实准确性和语义一致性的领域特定知识图谱,解决了传统神经符号系统在可扩展性和可靠性方面的不足。实验表明,GraphMERT 在医学领域(如糖尿病相关文献)生成的知识图谱在事实性和有效性指标上显著优于大语言模型。

Comments Camera-ready version. Published in Transactions on Machine Learning Research (TMLR), 2026. Reviewed on OpenReview: https://openreview.net/forum?id=tnXSdDhvqc

详情
Journal ref
Transactions on Machine Learning Research, 2026
英文摘要

Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades. A marriage of the neural and symbolic components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of purely neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side of the problem. However, automatically deriving reliable KGs from text corpora remains an open problem. We address these challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. Concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When a large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on reliability due to prompt sensitivity, shallow domain expertise, and hallucinated relations. On text obtained from PubMed papers on diabetes, our 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only 40.2% FActScore. The GraphMERT KG also attains a higher ValidityScore of 68.8%, versus 43.0% for the LLM baseline.

2510.04988 2026-05-12 cs.LG

Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

Kristi Topollai, Anna Choromanska

AI总结 本文针对深度学习优化中常用的固定动量系数问题,提出了一种基于模型的自适应记忆动量机制,通过在线动态调整动量系数来提升优化性能。该方法利用当前梯度和历史梯度累积信息构建两个近似目标函数的平面,从而推导出自适应动量更新规则,无需额外假设或超参数调优。实验表明,该方法在多种学习任务中均优于使用手动调优动量的常规SGD和Adam优化器,为优化算法的自适应性研究提供了新思路。

详情
英文摘要

The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence direction. Fundamental momentum methods, such as Nesterov Accelerated Gradient and the Heavy Ball method, as well as more recent optimizers such as AdamW and Lion, all rely on the momentum coefficient that is customarily set to $β= 0.9$ and kept constant during model training, a strategy widely used by practitioners, yet suboptimal. In this paper, we introduce an \textit{adaptive memory} mechanism that replaces constant momentum with a dynamic momentum coefficient that is adjusted online during optimization. We derive our method by approximating the objective function using two planes: one derived from the gradient at the current iterate and the other obtained from the accumulated memory of the past gradients. To the best of our knowledge, such a proximal framework was never used for momentum-based optimization. Our proposed approach is novel, extremely simple to use, and does not rely on extra assumptions or hyperparameter tuning. We implement adaptive memory variants of both SGD and AdamW across a wide range of learning tasks, from simple convex problems to large-scale deep learning scenarios, demonstrating that our approach can outperform standard SGD and Adam with hand-tuned momentum coefficients. Finally, our work opens doors for new ways of inducing adaptivity in optimization.

2510.03648 2026-05-12 cs.LG

SAFA-SNN: Sparsity-Aware On-Device Few-Shot Class-Incremental Learning with Fast-Adaptive Structure of Spiking Neural Network

Huijing Zhang, Muyang Cao, Linshan Jiang, Xin Du, Di Yu, Changze Lv, Shuiguang Deng

AI总结 本文提出了一种基于脉冲神经网络(SNN)的设备端少样本类增量学习方法SAFA-SNN,旨在解决边缘设备在数据样本有限的情况下持续学习新类别的挑战。该方法通过稀疏性感知的神经元动态和快速自适应网络结构,有效缓解了灾难性遗忘问题,并采用零阶优化技术处理脉冲非微分特性,同时利用正交子空间投影增强类别原型的判别能力。实验表明,SAFA-SNN在多个基准数据集上优于现有方法,具有更高的准确率和更低的能耗。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Continuous learning of novel classes is crucial for edge devices to preserve data privacy and maintain reliable performance in dynamic environments. However, the scenario becomes particularly challenging when data samples are insufficient, requiring on-device few-shot class-incremental learning (FSCIL). Although existing work has explored parameter-efficient FSCIL frameworks based on artificial neural networks (ANNs), their deployment is still fundamentally constrained by limited device resources. Spiking neural networks (SNNs) process spatiotemporal information efficiently, offering lower energy consumption, greater biological plausibility, and compatibility with neuromorphic hardware than ANNs. In this work, we propose an SNN-based method containing Sparsity-Aware neuronal dynamics and Fast Adaptive structure (SAFA-SNN) for on-device FSCIL. By threshold regulation, most neurons exhibit stable spikes and others exhibit adaptive spikes. As a result, synaptic traces that encode base-class knowledge are naturally preserved, thereby alleviating catastrophic forgetting. To cope with spike non-differentiability in backpropagation, we employ a gradient-free technique, i.e., zeroth-order optimization. Moreover, class prototypes can limit overfitting on few-shot data but introduce bias. We enhance prototype discriminability by orthogonal subspace projection. Extensive experiments conducted on two standard benchmark datasets (CIFAR-100 and Mini-ImageNet) and three neuromorphic datasets (CIFAR10-DVS, DVS128 Gesture, and N-Caltech101) demonstrate that SAFA-SNN outperforms baselines, specifically achieving at least 4.01% improvement at the last incremental session on Mini-ImageNet and 20% lower energy cost on CIFAR-100 over baselines with practical implementation.

2509.25742 2026-05-12 cs.LG

Less is More: Towards Simple Graph Contrastive Learning

Yanan Zhao, Feng Ji, Jingyang Dai, Jiaze Ma, Wee Peng Tay

AI总结 本文研究了图对比学习(GCL)在异质图上的有效性问题,提出了一种简单高效的图对比学习方法。通过分析图结构与节点特征之间的关系,作者发现利用图结构特征来减少节点特征噪声可以提升对比学习的效果,并基于此设计了一个无需数据增强和负样本的简单模型,该模型在异质图上取得了优越的性能,同时在计算和内存开销上也具有优势。

详情
英文摘要

Graph Contrastive Learning (GCL) has shown strong promise for unsupervised graph representation learning, yet its effectiveness on heterophilic graphs, where connected nodes often belong to different classes, remains limited. Most existing methods rely on complex augmentation schemes, intricate encoders, or negative sampling, which raises the question of whether such complexity is truly necessary in this challenging setting. In this work, we revisit the foundations of supervised and unsupervised learning on graphs and uncover a simple yet effective principle for GCL: mitigating node feature noise by aggregating it with structural features derived from the graph topology. This observation suggests that the original node features and the graph structure naturally provide two complementary views for contrastive learning. Building on this insight, we propose an embarrassingly simple GCL model that uses a GCN encoder to capture structural features and an MLP encoder to isolate node feature noise. Our design requires neither data augmentation nor negative sampling, yet achieves state-of-the-art results on heterophilic benchmarks with minimal computational and memory overhead, while also offering advantages in homophilic graphs in terms of complexity, scalability, and robustness. We provide theoretical justification for our approach and validate its effectiveness through extensive experiments, including robustness evaluations against both black-box and white-box adversarial attacks.

2509.25646 2026-05-12 cs.LG cs.NA math.NA

Deep set based operator learning with uncertainty quantification

Lei Ma, Ling Guo, Hao Wu, Tao Zhou

AI总结 该论文提出了一种具有不确定性量化能力的深度集操作符学习框架UQ-SONet,用于从数据中学习科学计算中的操作符。该方法通过集变换器嵌入处理稀疏且可变的传感器位置,并采用条件变分自编码器来近似解操作符的条件分布,从而在保持预测精度的同时提供原理化的不确定性估计。实验表明,该框架在确定性和随机偏微分方程上均表现出良好的鲁棒性和有效性。

详情
英文摘要

Learning operators from data is central to scientific machine learning. While DeepONets are widely used for their ability to handle complex domains, they require fixed sensor numbers and locations, lack mechanisms for uncertainty quantification, and are thus limited in practical applicability. Recent permutation-invariant extensions, such as the Variable-Input Deep Operator Network, relax these sensor constraints but still rely on sufficiently dense observations and cannot capture uncertainties arising from incomplete measurements or from operators with inherent randomness. To address these challenges, we propose UQ-SONet, a permutation-invariant operator learning framework with built-in uncertainty quantification. Our model integrates a set transformer embedding to handle sparse and variable sensor locations, and employs a conditional variational autoencoder to approximate the conditional distribution of the solution operator. By minimizing the negative ELBO, UQ-SONet provides principled uncertainty estimation while maintaining predictive accuracy. Numerical experiments on deterministic and stochastic PDEs, including the Navier-Stokes equation, demonstrate the robustness and effectiveness of the proposed framework.

2509.22196 2026-05-12 cs.LG stat.ML

Mechanistic Independence: A Principle for Identifiable Disentangled Representations

Stefan Matthes, Zhiwei Han, Hao Shen

AI总结 本文提出了一种基于“机制独立性”的统一框架,用于实现可识别的解耦表征,其核心在于通过潜变量对观测变量的作用方式来刻画潜在因素,而非依赖潜变量的分布特性。该方法在潜变量密度变化甚至引入统计依赖的情况下仍保持不变性,并提出了多种独立性准则,证明了即使在非线性和非可逆混合条件下,也能实现潜空间的可识别性。研究还建立了这些准则之间的层次关系,并从图论角度对潜空间进行了结构表征,为解耦表征的可识别性提供了新的理论基础。

详情
英文摘要

Disentangled representations seek to recover latent factors of variation underlying observed data, yet their identifiability is still not fully understood. We introduce a unified framework in which disentanglement is achieved through mechanistic independence, which characterizes latent factors by how they act on observed variables rather than by their latent distribution. This perspective is invariant to changes of the latent density, even when such changes induce statistical dependencies among factors. Within this framework, we propose several related independence criteria -- ranging from support-based and sparsity-based to higher-order conditions -- and show that each yields identifiability of latent subspaces, even under nonlinear, non-invertible mixing. We further establish a hierarchy among these criteria and provide a graph-theoretic characterization of latent subspaces as connected components. Together, these results clarify the conditions under which disentangled representations can be identified without relying on statistical assumptions.

2509.21743 2026-05-12 cs.AI cs.LG

Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts

Ammar Ahmed, Azal Ahmad Khan, Ayaan Ahmad, Sheng Di, Zirui Liu, Ali Anwar

AI总结 该论文提出了一种名为Retrieval-of-Thought(RoT)的高效推理方法,旨在解决大模型推理过程中因生成长推理轨迹而导致的延迟和成本增加问题。RoT通过检索和复用先前推理中的“思维”步骤,构建可组合的思维图谱,从而快速生成针对新问题的推理模板,减少冗余探索。实验表明,RoT在保持推理准确率的同时,显著降低了输出token数量、推理延迟和计算成本,展现出高效的推理范式。

详情
英文摘要

Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.

2509.21671 2026-05-12 cs.LG q-bio.NC

Neuroprobe: Evaluating Intracranial Brain Responses to Naturalistic Stimuli

Andrii Zahorodnii, Christopher Wang, Geeling Chau, Bennett Stankovits, Charikleia Moraitaki, Eli Gross, Alexander Brady, Andrei Barbu, Boris Katz, Ila R Fiete

AI总结 本文提出 Neuroprobe,一个用于评估侵入式脑电图(iEEG)记录下大脑对自然刺激响应的解码任务框架。该研究基于 BrainTreebank 数据集,包含10名受试者观看电影时超过40小时的iEEG记录,旨在系统研究语言处理过程中不同脑区的时间和空间特征解码规律。Neuroprobe 不仅有助于揭示语言和听觉信息在大脑中的处理流程,还为比较神经基础模型的架构和训练方法提供了标准化评估平台。

Comments 38 pages, 7 main figures, 16 supplementary figures, 13 tables

详情
英文摘要

High-resolution neural datasets enable foundation models for the next generation of brain-computer interfaces and neurological treatments. The community requires rigorous benchmarks to discriminate between competing modeling approaches, yet no standardized evaluation frameworks exist for intracranial EEG (iEEG) recordings. To address this gap, we present Neuroprobe: a suite of decoding tasks for studying multi-modal language processing in the brain. Unlike scalp EEG, intracranial EEG requires invasive surgery to implant electrodes that record neural activity directly from the brain with minimal signal distortion. Neuroprobe is built on the BrainTreebank dataset, which consists of over 40 hours of iEEG recordings from 10 human subjects performing a naturalistic movie viewing task. Neuroprobe serves two critical functions. First, it is a source from which neuroscience insights can be drawn. The high temporal and spatial resolution of the labeled iEEG allows researchers to systematically determine when and where computations for each aspect of language processing occur in the brain by measuring the decodability of each feature across time and all electrode locations. Using Neuroprobe, we visualize how information flows from key language and audio processing sites in the superior temporal gyrus to sites in the prefrontal cortex. We also demonstrate the time evolution of processing from simple auditory features (e.g., pitch and volume) to more complex language features (e.g., part of speech) in a purely data-driven manner. Second, as the field moves toward neural foundation models trained on large-scale datasets, Neuroprobe provides a rigorous framework for comparing competing architectures and training protocols. We make the code for Neuroprobe openly available, aiming to enable rapid progress in the field of iEEG foundation models. Public leaderboard: https://neuroprobe.dev/

2509.15816 2026-05-12 cs.LG

On the Convergence of Muon and Beyond

Da Chang, Yongxiang Liu, Ganzhao Yuan

AI总结 本文研究了Muon优化器在非凸随机优化中的收敛性能,针对其理论分析与实际效果之间的差距,提出了两种基于动量的方差减少变体——Muon-MVR1和Muon-MVR2。通过严格的理论分析,证明在无上限学习率调度下,Muon-MVR2能够达到最优的任意时间收敛速率$\widetilde{\mathcal{O}}(T^{-1/3})$,并给出了在Polyak–Łojasiewicz条件下的收敛保证。实验表明,所提方法在CIFAR-10和C4数据集上具有良好的实际效果。

详情
英文摘要

The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal ergodic convergence rate of $\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To study the theoretical limits of Muon, we analyze two momentum-based variance-reduced variants: the one-batch Muon-MVR1 and the two-batch Muon-MVR2. We provide the first rigorous proof that, under \textbf{horizon-free} learning-rate schedules, variance reduction enables Muon-MVR2 to attain the optimal anytime convergence rate $\widetilde{\mathcal{O}}(T^{-1/3})$, matching the lower bound for this problem class. Under the Polyak--Łojasiewicz (PL) condition, we establish anytime guarantees for Muon-MVR1 and Muon-MVR2: they attain best-iterate rates of $\widetilde{\mathcal{O}}(T^{-1/4})$ and $\widetilde{\mathcal{O}}(T^{-1/3})$ for the expected square-root suboptimality, and, given an additional uniform gradient bound along the iterates, achieve last-iterate rates of $\mathcal{O}(T^{-1/4})$ and $\mathcal{O}(T^{-1/3})$ for the objective gap, respectively. Experiments on CIFAR-10 and C4 support the practical effectiveness of the proposed variance-reduced Muon variants. Code is available at \href{https://github.com/MaeChd/MUON-MVR}{Muon-MVR} Codebase.

2509.13484 2026-05-12 cs.CV cs.CY

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

Liu Liu, Alexandra Kudaeva, Marco Cipriano, Fatimeh Al Ghannam, Freya Tan, Gerard de Melo, Andres Sevtsuk

AI总结 本文提出MINGLE,一种用于检测城市场景中语义复杂社交群体区域的视觉-语言模型方法。该方法通过结合人体检测、深度估计、视觉-语言模型推理及空间聚合算法,实现了对图像中社交互动区域的识别与定位。研究还构建了一个包含10万张城市街景图像的新数据集,标注了个体及社交群体的边界框和标签,为相关研究提供了重要资源。

Comments 13 pages, 4 figures Updated with the camera-ready version after acceptance

详情
英文摘要

Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.

2509.13332 2026-05-12 cs.AI cs.CL

Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

Pratik Jayarao, Himanshu Gupta, Neeraj Varshney, Chaitanya Dwivedi

AI总结 本文系统研究了在“大语言模型作为评判者”的框架下,显式推理(思考型)模型与非显式推理(非思考型)模型在准确性、效率和鲁棒性方面的表现差异。通过使用开源的Qwen 3系列模型进行对比实验,结果表明,显式推理模型在保持较低计算开销的同时,能显著提升判断准确性,并在多种偏见条件下表现出更强的稳定性。研究还发现,显式推理的优势不仅适用于英文任务,也在多语言环境下得到验证。

Comments Accepted in 2025 NeurIPS Foundations of Reasoning in Language Models Workshop

详情
英文摘要

As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of "thinking" and "non-thinking" LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.

2509.12635 2026-05-12 cs.CL cs.AI

Positional Encoding via Token-Aware Phase Attention

Yu Wang, Sheng Shen, Rémi Munos, Hongyuan Zhan, Yuandong Tian

AI总结 本文研究了旋转位置嵌入(RoPE)在处理长上下文时存在的内在距离依赖偏差问题,并提出了一种新的位置编码方法——Token-Aware Phase Attention(TAPA)。TAPA 通过在注意力机制中引入可学习的相位函数,有效保留了长距离的token交互,并支持直接且轻量的持续预训练,从而在长上下文场景下实现了比RoPE基线更低的困惑度和更强的检索性能。

Comments 28 pages

详情
英文摘要

We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light continual pretraining, extrapolates to unseen lengths, and attains substantially lower perplexity and stronger retrieval performance in the long-context regime than RoPE-style baselines.