arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2409
2603.13249 2026-05-29 cs.CL cs.AI cs.CY

Steering at the Source: Style Modulation Heads for Robust Persona Control

源头操控:用于稳健角色控制的风格调制头

Yoshihiro Izawa, Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura

AI总结 本文通过识别并仅干预少量注意力头(风格调制头),在无需微调的情况下实现对大型语言模型角色和风格的稳健控制,同时显著缓解了残差流干预导致的连贯性下降问题。

详情
Comments
8 main pages with appendix
AI中文摘要

激活操控提供了一种计算高效的机制,无需微调即可控制大型语言模型(LLM)。虽然能有效控制目标特征(如角色),但连贯性下降仍然是安全和实际部署的主要障碍。我们假设这种下降源于对残差流的干预,该干预无差别地影响聚合特征,并无意中放大了非目标噪声。在这项工作中,我们识别出一组稀疏的注意力头(仅三个头),它们独立控制角色和风格形成,我们将其称为风格调制头。具体来说,这些头可以通过内部表示的几何分析进行定位,结合层间余弦相似度和头部贡献分数。我们证明,仅针对这些特定头的干预能够实现稳健的行为控制,同时显著减轻残差流操控中观察到的连贯性下降。更广泛地说,我们的发现表明,精确的组件级定位能够实现更安全、更精确的模型控制。

英文摘要

Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.

2603.12588 2026-05-29 cs.CV

SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification

SDF-Net:面向光学-SAR船舶重识别的结构感知解耦特征学习

Furui Chen, Han Wang, Yuhan Sun, Jianing You, Yixuan Lv, Zhuang Zhou, Hong Tan, Shengyang Li

AI总结 针对光学与SAR图像间辐射差异导致的船舶重识别挑战,提出SDF-Net,通过结构一致性约束和解耦特征学习,实现模态不变的身份特征提取,在HOSS-ReID数据集上达到最优性能。

详情
AI中文摘要

光学与合成孔径雷达(SAR)图像之间的跨模态船舶重识别(ReID)面临根本性挑战,即被动光学成像与相干主动雷达传感之间的严重辐射差异。现有方法主要依赖统计分布对齐或语义匹配,但往往忽略了一个关键的物理先验:船舶是刚性物体,其几何结构在不同传感模态下保持稳定,而纹理外观则高度依赖模态。本文提出SDF-Net,一种结构感知解耦特征学习网络,系统地将几何一致性引入光学-SAR船舶重识别。基于ViT骨干网络,SDF-Net引入结构一致性约束,从中间层提取尺度不变的梯度能量统计量,以稳健地锚定表示对抗辐射变化。在终端阶段,SDF-Net将学习到的表示解耦为模态不变的身份特征和模态特定的特征。然后通过无参数的加性残差融合整合这些解耦线索,有效增强判别能力。在HOSS-ReID数据集上的大量实验表明,SDF-Net持续优于现有最先进方法。代码和训练模型已在https://github.com/cfrfree/SDF-Net公开。

英文摘要

Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery is fundamentally challenged by the severe radiometric discrepancy between passive optical imaging and coherent active radar sensing. While existing approaches primarily rely on statistical distribution alignment or semantic matching, they often overlook a critical physical prior: ships are rigid objects whose geometric structures remain stable across sensing modalities, whereas texture appearance is highly modality-dependent. In this work, we propose SDF-Net, a Structure-Aware Disentangled Feature Learning Network that systematically incorporates geometric consistency into optical--SAR ship ReID. Built upon a ViT backbone, SDF-Net introduces a structure consistency constraint that extracts scale-invariant gradient energy statistics from intermediate layers to robustly anchor representations against radiometric variations. At the terminal stage, SDF-Net disentangles the learned representations into modality-invariant identity features and modality-specific characteristics. These decoupled cues are then integrated through a parameter-free additive residual fusion, effectively enhancing discriminative power. Extensive experiments on the HOSS-ReID dataset demonstrate that SDF-Net consistently outperforms existing state-of-the-art methods. The code and trained models are publicly available at https://github.com/cfrfree/SDF-Net.

2603.11331 2026-05-29 cs.LG cs.AI

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

大型语言模型的越狱缩放定律:多项式-指数交叉

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan

AI总结 研究发现对抗性提示注入攻击可使攻击成功率从无注入时的缓慢多项式增长变为随推理样本数指数增长,并通过自旋玻璃模型从理论上解释了这一现象。

详情
AI中文摘要

对抗性攻击可以可靠地将安全对齐的大型语言模型引导至不安全行为。经验上,我们发现对抗性提示注入攻击可以将攻击成功率从无注入时观察到的缓慢多项式增长放大为随推理样本数指数增长。我们首先通过一组关于上下文安全生成分布的最小假设,确定了这两种机制的统计基础,并推导出两种缩放定律。为了进一步解释这一现象,我们提出了一个基于自旋玻璃系统的代理语言理论生成模型,该系统处于复制对称破缺状态,生成样本来自相关的吉布斯测度,并将低能、有偏大小的子集标记为不安全。我们分析展示了该模型如何自然实现最小假设。短注入提示对应于指向不安全簇中心的弱磁场,导致攻击成功率随推理样本数呈幂律缩放;而长注入提示(即强磁场)则导致指数缩放。我们在参数规模从3B到70B的广泛大型语言模型中观察到了定性一致的行为。特别是,主要趋势在多种攻击方法(如GCG和AutoDAN)以及基准数据集(如AdvBench和HarmBench)中保持稳定。

英文摘要

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We analytically show how this model naturally realizes the minimal assumptions. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We observe qualitatively consistent behavior across a broad range of large language models, spanning parameter scales from 3B to 70B. In particular, the main trends remain stable across multiple attack methods, such as GCG and AutoDAN, as well as across benchmark datasets such as AdvBench and HarmBench.

2603.10474 2026-05-29 cs.LG cs.NE cs.RO

Muscle Synergy Priors Enhance Biomechanical Fidelity in Predictive Musculoskeletal Locomotion Simulation

肌肉协同先验增强预测性肌肉骨骼运动模拟的生物力学保真度

Ilseung Park, Eunsik Choi, Jangwhan Ahn, Jooeun Ahn

AI总结 提出一种生理学启发的强化学习框架,通过肌肉协同约束控制,在有限实验数据下提高了预测性人体运动模拟的生物力学保真度和泛化能力。

详情
Comments
Added a manuscript footnote stating "Project page with supplementary videos: https://ces40320.github.io/WebHomepage__Walk-RL ."
AI中文摘要

人类运动源于高维神经肌肉控制,这使得预测性肌肉骨骼模拟具有挑战性。我们提出了一种生理学启发的强化学习框架,利用肌肉协同约束控制。我们从少量地面行走试验的逆肌肉骨骼分析中提取了低维协同基,并将其作为动作空间,用于训练一个肌肉驱动的三维模型,该模型在可变速度、坡度和不平坦地形上进行训练。由此产生的控制器在0.7-1.8 m/s的速度和±6°的坡度上生成了稳定的步态,并再现了关节角度、关节力矩和地面反作用力的条件依赖性调节。与无约束控制器相比,协同约束控制减少了非生理性膝关节运动学,并将膝关节力矩曲线保持在实验包络内。在各种条件下,模拟的垂直地面反作用力与人体测量值强相关,肌肉激活时间大多落在受试者间变异范围内。这些结果表明,将神经生理结构嵌入强化学习可以在有限实验数据下提高预测性人体运动模拟的生物力学保真度和泛化能力。

英文摘要

Human locomotion emerges from high-dimensional neuromuscular control, making predictive musculoskeletal simulation challenging. We present a physiology-informed reinforcement-learning framework that constrains control using muscle synergies. We extracted a low-dimensional synergy basis from inverse musculoskeletal analyses of a small set of overground walking trials and used it as the action space for a muscle-driven three-dimensional model trained across variable speeds, slopes and uneven terrain. The resulting controller generated stable gait from 0.7-1.8 m/s and on $\pm$ 6$^{\circ}$ grades and reproduced condition-dependent modulation of joint angles, joint moments and ground reaction forces. Compared with an unconstrained controller, synergy-constrained control reduced non-physiological knee kinematics and kept knee moment profiles within the experimental envelope. Across conditions, simulated vertical ground reaction forces correlated strongly with human measurements, and muscle-activation timing largely fell within inter-subject variability. These results show that embedding neurophysiological structure into reinforcement learning can improve biomechanical fidelity and generalization in predictive human locomotion simulation with limited experimental data.

2603.07916 2026-05-29 cs.AI cs.DB cs.LG

Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases

Rel-MOSS:面向关系数据库中不平衡关系深度学习的解决方案

Jun Yin, Peng Huo, Bangguo Zhu, Hao Yan, Senzhang Wang, Shirui Pan, Chengqi Zhang

AI总结 针对关系数据库中实体分类的类别不平衡问题,提出关系中心少数类合成过采样GNN(Rel-MOSS),通过关系门控控制器和关系引导的少数类合成器提升少数类表示,在12个数据集上平均平衡准确率提升2.46%,G-Mean提升4.00%。

详情
AI中文摘要

在最近的进展中,为了实现关系数据库(RDB)上完全数据驱动的学习范式,提出了关系深度学习(RDL),将RDB结构化为异构实体图,并采用图神经网络(GNN)作为预测模型。然而,现有的RDL方法忽略了RDB中关系数据的不平衡问题,可能导致少数实体表示不足,从而在实践中产生不可用的模型。在这项工作中,我们首次研究了RDB实体分类中的类别不平衡问题,并设计了以关系为中心的少数类合成过采样GNN(Rel-MOSS),以填补当前文献中的关键空白。具体来说,为了缓解少数类相关信息被多数类信息淹没的问题,我们设计了关系门控控制器来调节来自每个单独关系类型的邻域消息。基于关系门控表示,我们进一步提出了用于过采样的关系引导的少数类合成器,该合成器整合了实体关系签名以保持关系一致性。在12个实体分类数据集上的大量实验为Rel-MOSS的优越性提供了令人信服的证据,与最先进的RDL方法和处理类别不平衡的经典方法相比,在平衡准确率和G-Mean上分别平均提高了2.46%和4.00%。

英文摘要

In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is proposed to structure the RDB as a heterogeneous entity graph and adopt the graph neural network (GNN) as the predictive model. However, existing RDL methods neglect the imbalance problem of relational data in RDBs and risk under-representing the minority entities, leading to an unusable model in practice. In this work, we investigate, for the first time, class imbalance problem in RDB entity classification and design the relation-centric minority synthetic over-sampling GNN (Rel-MOSS), in order to fill a critical void in the current literature. Specifically, to mitigate the issue of minority-related information being submerged by majority counterparts, we design the relation-wise gating controller to modulate neighborhood messages from each individual relation type. Based on the relational-gated representations, we further propose the relation-guided minority synthesizer for over-sampling, which integrates the entity relational signatures to maintain relational consistency. Extensive experiments on 12 entity classification datasets provide compelling evidence for the superiority of Rel-MOSS, yielding an average improvement of up to 2.46% and 4.00% in terms of Balanced Accuracy and G-Mean, compared with SOTA RDL methods and classic methods for handling class imbalance.

2603.07860 2026-05-29 cs.LG

Sparse Scheduled Diffusion Guidance for Inverse Problems

稀疏调度扩散引导用于逆问题

Abduragim Shtanchaev, Albina Ilina, Yazid Janati, Arip Asadulaev, Martin Takac, Eric Moulines

AI总结 提出Spin方法,通过从中间时间步开始后验采样并仅在调度步骤应用轻量级校正,实现高效逆问题求解,在FFHQ和ImageNet上速度提升2-50倍且内存更低。

详情
AI中文摘要

预训练扩散模型是贝叶斯逆问题的有效先验,但使用这些先验进行后验采样通常成本高昂,因为数据一致性引导应用于整个反向轨迹。现有方法表明,有时可以避免通过去噪器的向量-雅可比乘积,但它们通常仍然依赖于整个轨迹的密集引导或昂贵的内部求解。我们提出了稀疏调度扩散引导用于逆问题(Spin),这是一种避免从纯噪声开始后验采样的求解器。Spin首先在中间时间步$t_*$从后验时间边际采样,然后将该状态作为引导反向扩散过程的热启动。在引导时间,Spin不是在每个去噪步骤强制执行测量约束,而是仅在调度的时间步应用轻量级校正,此时去噪器仍能清理伪影。由此产生的过程将先验细化与数据一致性解耦:先验提供去噪,而轻量级像素空间优化强制执行测量约束,无需通过去噪器或解码器进行反向传播。在FFHQ和ImageNet上的线性和非线性逆问题中,Spin以显著更好的运行时-内存曲线实现了有竞争力的重建质量,在像素空间模型上运行速度提高2倍,在潜在扩散模型上运行速度提高50倍,且内存成本更低。

英文摘要

Pretrained diffusion models are effective priors for Bayesian inverse problems, but posterior sampling with these priors is often costly because data-consistency guidance is applied throughout the full reverse trajectory. Existing methods have shown that vector-Jacobian products through the denoiser can sometimes be avoided, yet they typically still rely on dense guidance through the full trajectory or expensive inner solves. We introduce Sparse Scheduled Diffusion Guidance for Inverse Problems (Spin), a solver that avoids starting posterior sampling from pure noise. Spin first samples from a posterior time-marginal at an intermediate timestep $t_*$, and then uses that state as a warm start for a guided reverse diffusion process. At guidance time, instead of enforcing the measurement constraint at every denoising step, Spin applies lightweight corrections only at scheduled timesteps where the denoiser can still clean up artifacts. The resulting procedure decouples prior refinement from data consistency: the prior supplies denoising, while lightweight pixel-space optimization enforces the measurement constraint without backpropagation through the denoiser or decoder. Across linear and nonlinear inverse problems on FFHQ and ImageNet, Spin achieves competitive reconstruction quality with a substantially better runtime--memory profile, running 2x faster on pixel-space models and up to 50x faster on latent diffusion models, with lower memory costs.

2603.05488 2026-05-29 cs.CL cs.AI cs.LG

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

推理剧场:从思维链中分离模型信念

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo

AI总结 通过激活探针、早期强制回答和思维链监控器分析,发现推理模型存在表演性思维链现象,并利用探针引导的早期退出实现高效计算。

详情
AI中文摘要

我们提供了推理模型中表演性思维链(CoT)的证据,即模型对其最终答案变得非常自信,但继续生成令牌而不揭示其内部信念。我们的分析比较了两个大型模型(DeepSeek-R1 671B 和 GPT-OSS 120B)中的激活探针、早期强制回答和思维链监控器,并发现了任务难度特定的差异:模型的最终答案可以从思维链中远早于监控器能够判断的激活中解码,特别是对于基于回忆的简单MMLU问题。我们将此与困难的多跳GPQA-Diamond问题中的真正推理进行对比。尽管如此,转折点(例如回溯、“啊哈”时刻)几乎只出现在探针显示大信念转变的响应中,表明这些行为追踪的是真正的不确定性,而不是学到的“推理剧场”。最后,探针引导的早期退出在MMLU上减少了高达80%的令牌,在GPQA-Diamond上减少了30%,且准确率相似,将注意力探针定位为检测表演性推理和实现自适应计算的高效工具。

英文摘要

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

2603.05002 2026-05-29 cs.LG math.OC stat.ML

Non-Euclidean Gradient Descent Operates at the Edge of Stability

非欧几里得梯度下降在稳定性边缘运行

Rustem Islamov, Michael Crawshaw, Jeremy Cohen, Robert Gower

AI总结 本文通过方向光滑性解释梯度下降中的稳定性边缘现象,并将其推广到非欧几里得范数,定义广义尖锐度,实验表明非欧几里得梯度下降也表现出渐进尖锐化和阈值振荡。

详情
AI中文摘要

稳定性边缘(EoS)是一种现象,其中Hessian矩阵的尖锐度(最大特征值)在梯度下降(GD)中接近并徘徊在稳定性阈值$2/η$附近(步长为$η$)。尽管(表面上)违反了经典光滑性假设,但EoS在深度学习中已被广泛观察到,其理论基础仍不完整。我们通过方向光滑性[Mishkin et al., 2024]的视角提供了对EoS的解释。这种解释自然地扩展到非欧几里得范数,我们用它来定义任意范数下的广义尖锐度。我们的广义尖锐度度量包括先前研究的普通GD和预处理GD作为特例,以及尚未研究EoS的方法,例如$\ell_{\infty}$下降、块坐标下降、谱GD及其归一化版本。通过在神经网络上的实验,我们表明具有广义尖锐度的非欧几里得GD也表现出渐进尖锐化,随后在阈值$2/η$附近或之上振荡。在实践中,我们的框架提供了一种几何感知的谱诊断方法,可应用于广泛的非欧几里得梯度方法类别。

英文摘要

The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian approaches and then hovers near the stability threshold $2/η$ during gradient descent (GD) with step size $η$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation of EoS through the lens of Directional Smoothness [Mishkin et al., 2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and their normalized versions. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/η$. Practically, our framework provides a geometry-aware spectral diagnostic that can be applied across a broad class of non-Euclidean gradient methods.

2603.04678 2026-05-29 cs.CL cs.AI

Post-Training Language Models for Crosslingual Consistency

后训练语言模型以实现跨语言一致性

Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza

AI总结 针对多语言模型对翻译等价提示响应不一致的问题,提出基于信息论的跨语言一致性定义,并开发后训练方法直接一致性优化(DCO)以提升一致性。

详情
Comments
ICML 2026. The first two authors contributed equally. Codes available at: https://github.com/Betswish/ConsistencyRL
AI中文摘要

语言模型通常对跨语言的翻译等价提示响应不一致,这损害了多语言系统的可靠性。为了量化这一点,我们从信息论角度将跨语言一致性定义为模型响应分布与其跨语言往返推前分布之间的散度界。然后,我们引入惩罚一致性优化(PCO),这是一种后训练程序,将该散度与固定参考语言模型的Kullback-Leibler惩罚相结合。由于直接优化PCO需要昂贵的策略内展开,我们提出了一个易于处理的替代方案——直接一致性优化(DCO),它可以在策略外进行优化。在多种语言模型和26种语言中,DCO显著提高了跨语言一致性,优于现有方法,并实现了对低资源语言的有针对性的对齐。

英文摘要

Language models often respond inconsistently to translation-equivalent prompts across languages, undermining the reliability of multilingual systems. To quantify this, we give an information-theoretic definition of crosslingual consistency as a divergence bound between a model's response distribution and its round-trip pushforward across languages. We then introduce penalized consistency optimization (PCO), a post-training procedure that couples this divergence with a Kullback-Leibler penalty to a fixed reference language model. Because direct optimization of PCO requires expensive on-policy roll-outs, we propose a tractable surrogate, direct consistency optimization (DCO), which can be optimized off-policy. Across diverse language models and 26 languages, DCO significantly improves crosslingual consistency, outperforms existing methods, and enables targeted alignment of low-resource languages.

2603.04314 2026-05-29 cs.CV cs.AI

MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification

MOO:用于牛个体重识别视角分析的多视角观测数据集

William Grolleau, Achraf Chaouch, Astrid Sabourin, Guillaume Lapouge, Catherine Achard

AI总结 提出大规模合成多视角观测数据集MOO,通过128个均匀采样视角的1000头牛图像,量化视角变化对重识别的影响,并验证合成几何先验在真实场景中的迁移性。

详情
Comments
6 pages, 3 figures, accepted to the CVPR 2026 Workshop on Computer Vision for Animal Behavior Tracking and Modeling (CV4Animals)
AI中文摘要

动物重识别(ReID)由于视角变化面临严峻挑战,特别是在航空-地面(AG-ReID)场景中,模型需要跨越剧烈的高度变化匹配个体。然而,现有数据集缺乏精确的角度标注来系统分析这些几何变化。为此,我们引入了多视角观测(MOO)数据集,这是一个大规模合成AG-ReID数据集,包含从128个均匀采样视角捕获的1000头牛个体(128,000张标注图像)。利用这个受控数据集,我们量化了高度的影响,并识别出一个关键高度阈值,超过该阈值模型对未见视角的泛化能力显著提升。最后,我们在零样本和监督设置下验证了向真实世界应用的迁移性,展示了在四个真实牛数据集上的性能提升,并确认合成几何先验有效弥合了领域差距。总之,该数据集和分析为跨视角动物ReID的未来模型开发奠定了基础。MOO公开于https://github.com/TurtleSmoke/MOO。

英文摘要

Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models must match individuals across drastic elevation changes. However, existing datasets lack the precise angular annotations required to systematically analyze these geometric variations. To address this, we introduce the Multi-view Oriented Observation (MOO) dataset, a large-scale synthetic AG-ReID dataset of $1,000$ cattle individuals captured from $128$ uniformly sampled viewpoints ($128,000$ annotated images). Using this controlled dataset, we quantify the influence of elevation and identify a critical elevation threshold, above which models generalize significantly better to unseen views. Finally, we validate the transferability to real-world applications in both zero-shot and supervised settings, demonstrating performance gains across four real-world cattle datasets and confirming that synthetic geometric priors effectively bridge the domain gap. Collectively, this dataset and analysis lay the foundation for future model development in cross-view animal ReID. MOO is publicly available at https://github.com/TurtleSmoke/MOO.

2603.03805 2026-05-29 cs.LG cs.AI cs.DB

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

通过结构先验的合成预训练实现关系上下文学习

Yanbo Wang, Jiaxuan You, Chuan Shi, Muhan Zhang

AI总结 提出RDB-PFN,首个仅通过合成数据训练的关系基础模型,利用结构因果模型生成多样关系数据库,实现对新数据库的即时上下文学习,在19个真实关系预测任务上优于现有表格基础模型。

详情
AI中文摘要

关系数据库是现代业务的支柱,但它们缺乏与文本或视觉领域相当的基础模型。一个关键障碍是高质量的关系数据库是私有的、稀缺的且结构异构,使得互联网规模的预训练不可行。为了克服这种数据稀缺性,我们引入了RDB-PFN,这是第一个完全通过合成数据训练的关系基础模型。受先验数据拟合网络的启发,其中从结构因果模型生成的合成数据能够实现单表推理,我们设计了一个关系先验生成器,从零开始创建无限多样的关系数据库流。在超过200万个合成单表和关系任务上进行预训练后,RDB-PFN通过真正的上下文学习学会即时适应任何新数据库。实验表明,RDB-PFN在19个真实世界的关系预测任务上实现了强大的少样本性能,优于在相同DFS线性化输入上评估的最先进的表格基础模型,同时使用轻量级架构和快速推理。代码可在https://github.com/MuLabPKU/RDBPFN获取。

英文摘要

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce, and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, we introduce RDB-PFN, the first relational foundation model trained purely via synthetic data. Inspired by Prior-Data Fitted Networks (PFNs), where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a Relational Prior Generator to create an infinite stream of diverse RDBs from scratch. Pre-training on over 2 million synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine in-context learning. Experiments show that RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming state-of-the-art tabular foundation models evaluated on the same DFS-linearized inputs, while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN.

2603.03503 2026-05-29 cs.CV cs.LG

Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer for 200m Resolution Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1, RCM, and AMSR2 Data

地理加权弱监督贝叶斯高分辨率Transformer:利用Sentinel-1、RCM和AMSR2数据实现200米分辨率泛北极海冰密集度制图与不确定性估计

Mabel Heffring, Lincoln Linlin Xu

AI总结 提出一种贝叶斯高分辨率Transformer模型,结合地理加权弱监督损失函数和决策级数据融合,利用Sentinel-1、RCM和AMSR2数据实现200米分辨率泛北极海冰密集度制图与不确定性量化。

详情
Comments
23 pages, 20 figures
AI中文摘要

尽管具有可靠对应不确定性的泛北极海冰高分辨率制图对于业务化海冰密集度(SIC)制图至关重要,但由于冰特征信号的细微性、SIC标签的不精确性、模型不确定性和数据异质性等关键挑战,这是一项艰巨的任务。本研究提出了一种新颖的贝叶斯高分辨率Transformer方法,利用Sentinel-1、RADARSAT星座任务(RCM)和先进微波扫描辐射计2(AMSR2)数据,实现200米分辨率泛北极SIC制图和不确定性量化。首先,为了改进微小和细微海冰特征(例如裂缝/水道、融池和浮冰)的提取,我们设计了一种新颖的高分辨率Transformer模型,该模型具有全局和局部模块,能够更好地区分海冰模式的细微差异。其次,为了解决低分辨率和非精确SIC标签的问题,我们设计了一种地理加权弱监督损失函数,在区域级别而非像素级别监督模型,并优先考虑纯开阔水和冰盖特征,同时减轻边缘冰区(MIZ)中模糊性的影响。第三,为了改进不确定性量化,我们设计了所提Transformer模型的贝叶斯扩展,将其参数视为随机变量,以更有效地捕获不确定性。第四,为了解决数据异质性,我们在决策级融合三种不同类型的数据(Sentinel-1、RCM和AMSR2),以改进SIC制图和不确定性量化。所提方法在2021年和2025年泛北极最小范围条件下进行了评估。结果表明,所提模型在使用Sentinel-1数据时实现了0.70的总体特征检测精度,同时保留了泛北极SIC模式(相对于ARTIST海冰产品,Sentinel-1 R² = 0.90)。

英文摘要

Although high-resolution mapping of pan-Arctic sea ice with reliable corresponding uncertainty is essential for operational sea ice concentration (SIC) charting, it is a difficult task due to key challenges, such as the subtle nature of ice signature features, inexact SIC labels, model uncertainty, and data heterogeneity. This study presents a novel Bayesian High-Resolution Transformer approach for 200 meter resolution pan-Arctic SIC mapping and uncertainty quantification using Sentinel-1, RADARSAT Constellation Mission (RCM), and Advanced Microwave Scanning Radiometer 2 (AMSR2) data. First, to improve small and subtle sea ice feature (e.g., cracks/leads, ponds, and ice floes) extraction, we design a novel high-resolution Transformer model with both global and local modules that can better discern the subtle differences in sea ice patterns. Second, to address low-resolution and inexact SIC labels, we design a geographically-weighted weakly supervised loss function to supervise the model at region level instead of pixel level, and to prioritize pure open water and ice pack signatures while mitigating the impact of ambiguity in the marginal ice zone (MIZ). Third, to improve uncertainty quantification, we design a Bayesian extension of the proposed Transformer model, treating its parameters as random variables to more effectively capture uncertainties. Fourth, to address data heterogeneity, we fuse three different data types (Sentinel-1, RCM, and AMSR2) at decision-level to improve both SIC mapping and uncertainty quantification. The proposed approach is evaluated under pan-Arctic minimum-extent conditions in 2021 and 2025. Results demonstrate that the proposed model achieves 0.70 overall feature detection accuracy using Sentinel-1 data, while also preserving pan-Arctic SIC patterns (Sentinel-1 R\textsuperscript{2} = 0.90 relative to the ARTIST Sea Ice product).

2603.02803 2026-05-29 cs.CV

Structure-Aware Text Recognition for Ancient Greek Critical Editions

面向古希腊校勘本的结构感知文本识别

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

AI总结 本文通过构建大规模合成语料库和真实扫描基准,评估了视觉语言模型在结构感知文本识别上的性能,发现Qwen3VL-8B模型在真实扫描上达到1.0%的中位字符错误率。

详情
AI中文摘要

视觉语言模型(VLM)的最新进展已经改变了端到端的文档理解。然而,它们解释历史学术文本复杂布局语义的能力仍然有限。本文研究了面向古希腊校勘本的结构感知文本识别,这些校勘本具有密集的参考层次和广泛的边缘注释。我们引入了两个新资源:(i)从TEI/XML源生成的185,000页图像的大规模合成语料库,具有受控的排版和布局变化,以及(ii)跨越一个多世纪编辑和排版实践的真实扫描校勘本的精选基准。使用这些数据集,我们在零样本和微调设置下评估了三种最先进的VLM。我们的实验揭示了当前VLM架构在面对高度结构化的历史文档时的显著局限性。在零样本设置中,大多数模型的性能明显低于现有的现成软件。尽管如此,Qwen3VL-8B模型达到了最先进的性能,在真实扫描上实现了1.0%的中位字符错误率。这些结果既突显了当前VLM在结构感知识别复杂学术文档方面的不足,也展示了其未来潜力。

英文摘要

Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.

2603.02082 2026-05-29 cs.CL

What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies

儿童在语言习得中究竟获得了什么?基于CHILDES的填充词-空位依赖自动检测案例研究

Zhenghao Herbert Zhou, William Dai, Maya Viswanathan, Simon Charlow, R. Thomas McCoy, Robert Frank

AI总结 通过自动检测英语口语语料中的三种核心填充词-空位结构,量化儿童语言输入中的分布证据,并分析儿童产出轨迹,为先天语法知识与统计学习之争提供数据支持。

详情
Comments
Camera-ready version accepted to CoNLL 2026
AI中文摘要

儿童对填充词-空位依赖的习得,一些研究者认为依赖于先天语法知识,而另一些则认为儿童导向言语中可用的分布证据足以解释。不幸的是,相关输入难以大规模细粒度量化,使得这一问题难以解决。我们提出一个系统,能够识别英语口语语料中的三种核心填充词-空位结构——主句wh-疑问句、嵌入式wh-疑问句和关系从句——并进一步识别提取位置(即主语、宾语或附加语)。我们的方法结合了成分分析和依存分析,利用它们在结构分类和提取位置识别上的互补优势。我们在人工标注数据上验证了该系统,发现其在大多数类别上表现良好。将该系统应用于57个英语CHILDES语料库,我们能够描述儿童在发育过程中接收的填充词-空位输入及其产出轨迹,包括特定结构的频率和提取位置不对称性。由此产生的细粒度标签为未来的习得研究和计算研究提供了基础,我们通过一个使用语言模型进行过滤语料训练的案例研究进行了演示。

英文摘要

Children's acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora -- matrix wh-questions, embedded wh-questions, and relative clauses -- and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children's filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.

2603.01311 2026-05-29 cs.CL

Catalyst-Agent: Autonomous heterogeneous catalyst screening with an LLM Agent

Catalyst-Agent:基于LLM Agent的自主异质催化剂筛选

Achuth Chandrasekhar, Janghoon Ock, Amir Barati Farimani

AI总结 提出Catalyst-Agent,一种基于MCP服务器和LLM的AI代理,通过OPTIMADE API探索材料数据库、利用UMA模型计算吸附能,实现闭环自主催化剂筛选,在ORR、NRR和CO2RR反应中成功率达33-41%。

详情
AI中文摘要

发现针对特定应用的新型催化剂是21世纪的一项重大挑战。传统方法包括基于化学理论的耗时且昂贵的实验试错法,或基于密度泛函理论的计算密集型第一性原理方法。近期研究表明,图神经网络(GNN)等深度学习模型可以将催化剂材料的筛选速度提高多个数量级,且具有很高的准确性和保真度。在这项工作中,我们引入了Catalyst-Agent,一个基于模型上下文协议(MCP)服务器、由LLM驱动的AI代理。它可以使用OPTIMADE API探索庞大的材料数据库,进行结构修改,通过FAIRchem的AdsorbML工作流程和板坯构建使用Meta FAIRchem的UMA(GNN)模型计算吸附能,并以闭环方式向研究人员提供有用的材料建议,包括改进接近命中候选者的结构修改。我们在三个关键反应上进行了测试:氧还原反应(ORR)、氮还原反应(NRR)和CO2还原反应(CO2RR)。Catalyst-Agent在其选择和评估的所有材料中实现了33-41%的成功率,并且平均每个成功材料在1-4次试验内收敛。这项工作展示了AI代理利用其规划能力和工具使用实现自主催化剂筛选工作流程的潜力。

英文摘要

The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem's UMA (GNN) model via FAIRchem's AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including structural modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 33-41% among all the materials it chooses and evaluates, and manages to converge in 1-4 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use for autonomous catalyst screening workflows.

2602.23258 2026-05-29 cs.AI cs.CL

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

AgentDropoutV2: 通过测试时修正或拒绝剪枝优化多智能体系统中的信息流

Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang

AI总结 提出AgentDropoutV2框架,在测试时通过检索增强修正器纠正错误并剪枝不可修复输出,动态优化多智能体系统信息流,显著提升数学和代码基准性能。

详情
AI中文摘要

虽然多智能体系统(MAS)在复杂推理中表现出色,但它们受到来自单个智能体的错误信息的级联影响。当前的解决方案通常依赖于刚性的结构工程或昂贵的微调,限制了它们的适应性。我们提出了AgentDropoutV2(ADv2),一种测试时修正或拒绝剪枝框架,动态优化MAS信息流。作为主动防火墙,ADv2拦截智能体输出,并采用检索增强修正器迭代纠正错误。这种修正由一个指示池引导,该池通过从历史MAS失败轨迹中提炼错误模式离线构建。随后,不可修复的输出被剪枝以防止错误传播。实验结果表明,ADv2在固定和动态MAS框架上均显著提升了性能,在广泛的数学和代码基准测试中分别实现了平均6.39和2.28个百分点的准确率提升。此外,ADv2表现出卓越的适应性,根据任务难度动态调整修正力度,以解决广泛的错误模式。我们的代码已发布在https://github.com/TonySY2/AgentDropoutV2。

英文摘要

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual agents. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their adaptability. We propose AgentDropoutV2 (ADv2), a test-time rectify-or-reject pruning framework that dynamically optimizes MAS information flow. Acting as an active firewall, ADv2 intercepts agent outputs and employs a retrieval-augmented rectifier to iteratively correct errors. This rectification is guided by an indicator pool, which is constructed offline by distilling error patterns from historical MAS failure trajectories. Irreparable outputs are subsequently pruned to prevent error propagation. Empirical results demonstrate that ADv2 significantly boosts performance on both fixed and dynamic MAS frameworks, achieving average accuracy gains of 6.39 and 2.28 percentage points on extensive math and code benchmarks, respectively. Furthermore, ADv2 exhibits remarkable adaptivity, dynamically modulating rectification efforts based on task difficulty to resolve a wide spectrum of error patterns. Our code is released at https://github.com/TonySY2/AgentDropoutV2.

2602.21565 2026-05-29 cs.LG

Routing by Reaching: Composition of Pre-trained GFlowNets for Multi-Objective Generation

通过到达进行路由:预训练GFlowNets的组合用于多目标生成

Seokwon Yoon, Youngbin Choi, Seunghyuk Cho, Seungbeom Lee, MoonJeong Park, Dongwoo Kim

AI总结 提出一个在推理时组合预训练GFlowNets的框架,无需微调或重新训练即可快速适应多目标生成任务,并证明在线性标量化下精确恢复目标分布,对非线性算子通过畸变因子量化近似质量。

详情
Comments
Appears in the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

生成流网络(GFlowNets)学习按照奖励函数比例采样多样化的候选,使其非常适合科学发现,其中探索多个有希望的解决方案至关重要。进一步将GFlowNets扩展到多目标设置已引起越来越多的兴趣,因为现实世界的应用通常涉及多个相互冲突的目标。然而,现有方法需要对每个目标组合进行联合训练,这意味着目标集的任何变化都需要从头开始重新训练。我们提出了一个在推理时组合预训练GFlowNets的框架,无需微调或重新训练即可实现快速适应。重要的是,我们的框架是灵活的,能够处理从线性标量化到复杂非线性算子的多种奖励组合,这些在以前的文献中通常分开处理。我们证明,我们的方法在线性标量化下精确恢复目标分布,并通过畸变因子量化非线性算子的近似质量。在合成二维网格和真实分子生成任务上的实验表明,我们的方法达到了与基线相当的性能。

英文摘要

Generative Flow Networks (GFlowNets) learn to sample diverse candidates in proportion to a reward function, making them well-suited for scientific discovery, where exploring multiple promising solutions is crucial. Further extending GFlowNets to multi-objective settings has attracted growing interest as real-world applications often involve multiple, conflicting objectives. However, existing approaches require joint training for each combination of objectives, meaning that any change in the objective set necessitates retraining from scratch. We propose a framework that composes pre-trained GFlowNets at inference time, enabling rapid adaptation without fine-tuning or retraining. Importantly, our framework is flexible, capable of handling diverse reward combinations ranging from linear scalarization to complex nonlinear operators, which are often handled separately in previous literature. We prove that our method exactly recovers the target distribution for linear scalarization, and quantify the approximation quality for nonlinear operators through a distortion factor. Experiments on a synthetic 2D grid and real-world molecule generation tasks demonstrate that our approach achieves performance comparable to baselines.

2602.20141 2026-05-29 cs.AI

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

循环结构策略梯度用于部分可观测平均场博弈

Clarisse Wibault, Johannes Forkel, Sebastian Towers, Tiphaine Wibault, Juan Duque, George Whittle, Andreas Schaab, Yucheng Yang, Chiyuan Wang, Maike Osborne, Benjamin Moll, Jakob Foerster

AI总结 针对部分可观测平均场博弈,提出首个历史感知的混合结构方法RSPG,通过利用低维状态动作空间和已知转移动力学计算期望回报,实现比无模型RL方法快一个数量级的收敛速度。

详情
AI中文摘要

平均场博弈(MFGs)为大规模群体系统中的交互建模提供了原则性框架。然而,由于无模型方法方差高而精确方法扩展性差,算法进展有限。最近的混合结构方法(HSMs)通过利用低维个体状态和动作空间以及已知的转移动力学,计算以公共噪声的蒙特卡洛轨迹为条件的精确期望回报,从而在保持可处理性的同时降低方差。然而,HSMs尚未扩展到部分可观测设置。我们提出循环结构策略梯度(RSPG),这是首个用于具有公共部分信息的MFGs的历史感知HSM。RSPG实现了比无模型RL方法快一个数量级的收敛速度,同时学习历史感知行为,这与当前的HSMs不同。为了促进对MFGs的研究,我们还引入了MFAX,这是我们基于JAX的MFG框架,支持解析和基于样本的平均场更新。MFAX和使用示例可在https://clarisse-wibault.github.io/rspg/找到。

英文摘要

Mean Field Games (MFGs) provide a principled framework for modelling interactions in large population systems. However, algorithmic progress has been limited since model-free methods are high variance and exact methods scale poorly. Recent Hybrid Structural Methods (HSMs) reduce variance while maintaining tractability by leveraging low-dimensional individual state and action spaces and known transition dynamics to compute the exact expected return conditioned on Monte Carlo rollouts of common noise. However, HSMs have not been extended to partially observable settings. We propose Recurrent Structural Policy Gradient (RSPG), the first history-aware HSM for MFGs with public partial information. RSPG achieves an order-of-magnitude faster convergence than model-free RL methods while learning history-aware behaviour, unlike current HSMs. To facilitate research into MFGs, we also introduce MFAX, our JAX-based framework for MFGs that supports both analytic and sample-based mean-field updates. MFAX and usage examples can be found at https://clarisse-wibault.github.io/rspg/.

2602.18527 2026-05-29 cs.CV cs.AI cs.SD

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

JAEGER:模拟物理环境中的联合3D音频-视觉定位与推理

Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

AI总结 提出JAEGER框架,通过集成RGB-D观测和多通道一阶环境声学,将音频-视觉大语言模型扩展到3D空间,实现联合空间定位与推理,并引入神经强度向量(Neural IV)提升声源方向估计的鲁棒性。

详情
Comments
Accepted to ICML 2026
AI中文摘要

当前的音频-视觉大语言模型(AV-LLMs)主要局限于2D感知,依赖于RGB视频和单声道音频。这种设计选择引入了基本的维度不匹配,阻碍了在复杂3D环境中可靠的声源定位和空间推理。我们通过提出JAEGER框架来解决这一限制,该框架将AV-LLMs扩展到3D空间,通过集成RGB-D观测和多通道一阶环境声学实现联合空间定位与推理。我们工作的核心贡献是神经强度向量(Neural IV),一种学习的空间音频表示,它编码了鲁棒的方向线索,以增强到达方向估计,即使在具有重叠声源的不利声学场景中也是如此。为了促进大规模训练和系统评估,我们提出了SpatialSceneQA,一个包含从模拟物理环境中整理的6.1万个指令调优样本的基准。大量实验表明,我们的方法在各种空间感知和推理任务中始终优于以2D为中心的基线,强调了显式3D建模对于推进物理环境中AI的必要性。我们的源代码、预训练模型检查点和数据集可在https://github.com/liuzhan22/JAEGER获取。

英文摘要

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints, and datasets are available at https://github.com/liuzhan22/JAEGER.

2602.18196 2026-05-29 cs.LG

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

RAT+:密集训练,稀疏推理——用于扩张推理的循环增强注意力

Xiuying Wei, Caglar Gulcehre

AI总结 提出RAT+架构,通过密集预训练和循环增强注意力,使模型在推理时可灵活切换为稀疏扩张注意力,大幅降低计算和缓存开销,同时保持高精度。

详情
Comments
Accepted by ICML2026
AI中文摘要

结构化扩张注意力具有吸引人的推理效率调节旋钮:它将注意力的FLOPs和KV缓存大小减少扩张大小D的倍数,同时保持长程连接。虽然先前的工作通过从头训练每个配置来研究它,但直接将预训练注意力模型稀疏化为扩张模式会导致严重的精度下降,阻碍跨推理场景的灵活重用。我们引入RAT+,一种密集预训练架构,通过全序列循环和主动循环学习增强注意力。单个RAT+模型密集预训练一次,然后可以在推理时灵活切换到扩张注意力(可选局部窗口)或混合层/头组合,仅需短期的10亿token分辨率适应,而无需重新训练单独的稀疏模型。在100B token上训练的1.5B参数模型中,RAT+在D=16时紧密匹配密集精度,在D=64时在常识推理和LongBench任务上下降约2-3个点。我们进一步扩展到2.6B和7.6B参数,观察到更有希望的性能(例如,在注意力FLOPs和KV缓存大小减少64倍的情况下,平均精度损失1个点)。代码可在https://github.com/wimh966/rat-plus获取。

英文摘要

Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work studies it by training each configuration from scratch, directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation, preventing flexible reuse across inference scenarios. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at D = 16, and drops by about 2-3 points at D = 64 on commonsense reasoning and LongBench tasks. We further scale to 2.6B and 7.6B parameters and observe even more promising performance (e.g., a 1-point average accuracy loss with a 64x reduction in attention FLOPs and KV cache size). Code is available at https://github.com/wimh966/rat-plus.

2602.16610 2026-05-29 cs.CL cs.AI cs.LG

Who can we trust? LLM-as-a-jury for Comparative Assessment

我们该信任谁?LLM作为陪审团进行比较评估

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill

AI总结 针对LLM作为评估者时判断不一致和可靠性差异的问题,提出BT-sigma模型,通过引入判别参数联合推断项目排名和法官可靠性,优于平均聚合方法。

详情
Comments
Accepted to ICML 2026
AI中文摘要

大型语言模型(LLMs)越来越多地被用作自动评估器,用于自然语言生成评估,通常采用成对比较判断。现有方法通常依赖单一法官或聚合多个法官并假设其可靠性相同。在实践中,LLM法官在不同任务和评估方面的表现差异很大,其判断概率可能存在偏差和不一致。此外,用于法官校准的人工标注监督可能不可用。我们首先通过实验证明LLM比较概率的不一致性存在,并表明这限制了直接基于概率排名的有效性。为解决此问题,我们研究了LLM作为陪审团的设置,并提出了BT-sigma,这是Bradley-Terry模型的一种法官感知扩展,为每个法官引入一个判别参数,仅从成对比较中联合推断项目排名和法官可靠性。在基准NLG评估数据集上的实验表明,BT-sigma始终优于基于平均的聚合方法,并且学习到的判别参数与LLM判断的循环一致性的独立度量高度相关。进一步分析揭示,BT-sigma可以解释为一种无监督校准机制,通过建模法官可靠性来改进聚合。

英文摘要

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and evaluation aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-asa-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminators strongly correlate with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.

2602.16449 2026-05-29 cs.LG cs.AI stat.ML

GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

GICDM: 缓解枢纽性以实现可靠的基于距离的生成模型评估

Nicolas Salvy, Hugues Talbot, Bertrand Thirion

AI总结 针对生成模型评估中高维嵌入空间的枢纽性现象,提出GICDM方法(基于迭代上下文不相似度度量),通过多尺度扩展校正邻域估计,恢复可靠度量并与人类评估对齐。

详情
Comments
Forty-third International Conference on Machine Learning, 2026
AI中文摘要

生成模型评估通常依赖于高维嵌入空间来计算样本之间的距离。我们表明,这些空间中的数据集表示受到枢纽性现象的影响,这会扭曲最近邻关系并使基于距离的度量产生偏差。基于经典的迭代上下文不相似度度量(ICDM),我们引入了生成式ICDM(GICDM),一种校正真实数据和生成数据邻域估计的方法。我们引入了多尺度扩展以改善经验行为。在合成和真实基准上的大量实验表明,GICDM解决了枢纽性引起的失败,恢复了可靠的度量行为,并改善了与人类评估的一致性。

英文摘要

Generative model evaluation commonly relies on high-dimensional embedding spaces to compute distances between samples. We show that dataset representations in these spaces are affected by the hubness phenomenon, which distorts nearest-neighbor relationships and biases distance-based metrics. Building on the classical Iterative Contextual Dissimilarity Measure (ICDM), we introduce Generative ICDM (GICDM), a method to correct neighborhood estimation for both real and generated data. We introduce a multi-scale extension to improve empirical behavior. Extensive experiments on synthetic and real benchmarks demonstrate that GICDM resolves hubness-induced failures, restores reliable metric behavior, and improves alignment with human assessment.

2602.15382 2026-05-29 cs.CL cs.CV cs.LG

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

视觉虫洞:异构多智能体系统中的潜在空间通信

Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao

AI总结 提出Vision Wormhole框架,通过通用视觉编解码器将推理轨迹映射到共享连续空间,实现异构VLM间的潜在状态传输,无需配对翻译器,降低对齐复杂度并提升效率。

详情
Comments
Preprint. Work in progress
AI中文摘要

由大型语言模型驱动的多智能体系统(MAS)实现了先进的协作推理,但仍受限于离散文本通信,这带来了运行时开销和信息量化损失。虽然潜在状态传输提供了一种替代方案,但现有方法要么假设同构的发送器-接收器架构,要么依赖于特定配对的学得翻译器,限制了跨具有不连续流形的不同模型族的可扩展性。我们将为自然图像训练的视觉-语言模型(VLM)的视觉界面重新概念化为异构智能体之间的连续通信通道,并将这一思想实例化为 extbf{视觉虫洞}:一种通用视觉编解码器,将推理轨迹映射到共享的连续参考空间,并将其注入接收器的视觉通路,实现无需配对翻译器的跨架构潜在状态传输。该框架采用中心辐射拓扑,将对齐复杂度从$O(N^2)$降低到$O(N)$,并通过无标签的教师-学生蒸馏针对文本通道进行训练,无需并行隐藏状态监督。在异构VLM族(Qwen-VL、Gemma、SmolVLM2、LFM2.5-VL)和九个推理基准上的大量实验表明,视觉虫洞在大多数评估设置中减少了端到端挂钟时间,并产生了正的平均宏$Δ$-准确率。

英文摘要

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain bottlenecked by discrete text communication, which imposes runtime overhead and information quantization loss. While latent state transfer offers an alternative, existing approaches either assume homogeneous sender--receiver architectures or rely on pair-specific learned translators, limiting scalability across diverse model families with disjoint manifolds. We reconceptualize the visual interface of Vision-Language Models (VLMs), trained for natural images, as a continuous communication channel between heterogeneous agents, and instantiate this idea as the \textbf{Vision Wormhole}: a Universal Visual Codec maps reasoning traces into a shared continuous reference space and injects them into the receiver's visual pathway, yielding cross-architecture latent state transfer without per-pair translators. The framework adopts a hub-and-spoke topology that reduces alignment complexity from $O(N^2)$ to $O(N)$, and is trained by label-free teacher--student distillation against the text channel, requiring no parallel hidden-state supervision. Extensive experiments across heterogeneous VLM families (Qwen-VL, Gemma, SmolVLM2, LFM2.5-VL) and nine reasoning benchmarks show that the Vision Wormhole reduces end-to-end wall-clock time across most evaluated settings and yields positive macro-average $Δ$-accuracy.

2602.15239 2026-05-29 cs.LG

Size Transferability of Graph Transformers with Convolutional Positional Encodings

图Transformer的尺寸可迁移性与卷积位置编码

Javier Porras-Valenzuela, Zhiyang Wang, Xiaotao Shang, Yusu Wang, Alejandro Ribeiro

AI总结 通过图神经网络位置编码建立图Transformer与流形神经网络的联系,证明其在小图上训练后可泛化到大图,并在标准基准和实际地形最短路径估计任务中验证可扩展性。

详情
AI中文摘要

Transformer在各个领域取得了显著成功,推动了图Transformer(GTs)作为基于注意力的图结构数据架构的兴起。GTs的一个关键设计选择是使用基于图神经网络(GNN)的位置编码来融入结构信息。在这项工作中,我们通过图序列的流形极限模型研究GTs,并建立了具有GNN位置编码的GTs与流形神经网络(MNNs)之间的理论联系。基于GNN在流形收敛下的可迁移性结果,我们证明了GTs从其位置编码继承了可迁移性保证。特别地,在温和假设下,在小图上训练的GTs可以证明地泛化到更大的图。我们通过标准图基准上的大量实验补充了理论,表明GTs表现出与GNN相当的可扩展行为。为了进一步展示在真实场景中的效率,我们实现了GTs用于地形上的最短路径距离估计,以更好地说明可迁移GTs的效率。我们的结果为理解GTs提供了新见解,并为在大规模设置中高效训练GTs提出了实用方向。

英文摘要

Transformers have achieved remarkable success across domains, motivating the rise of Graph Transformers (GTs) as attention-based architectures for graph-structured data. A key design choice in GTs is the use of Graph Neural Network (GNN)-based positional encodings to incorporate structural information. In this work, we study GTs through the lens of manifold limit models for graph sequences and establish a theoretical connection between GTs with GNN positional encodings and Manifold Neural Networks (MNNs). Building on transferability results for GNNs under manifold convergence, we show that GTs inherit transferability guarantees from their positional encodings. In particular, GTs trained on small graphs provably generalize to larger graphs under mild assumptions. We complement our theory with extensive experiments on standard graph benchmarks, demonstrating that GTs exhibit scalable behavior on par with GNNs. To further show the efficiency in a real-world scenario, we implement GTs for shortest path distance estimation over terrains to better illustrate the efficiency of the transferable GTs. Our results provide new insights into the understanding of GTs and suggest practical directions for efficient training of GTs in large-scale settings.

2602.12304 2026-05-29 cs.SD cs.AI cs.MM eess.AS

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

OmniCustom: 通过联合音视频生成模型实现同步音视频定制

Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu

AI总结 提出一种基于DiT的零样本音视频定制框架OmniCustom,通过参考图像和音频同步生成保持身份和音色一致性的视频,支持文本指定语音内容。

详情
Comments
code: https://github.com/OmniCustom-project/OmniCustom
AI中文摘要

现有的主流视频定制方法侧重于基于给定参考图像和文本提示生成身份一致的视频。受益于联合音视频生成的快速发展,本文提出一个更具吸引力的新任务:同步音视频定制,旨在同步定制视频身份和音频音色。具体来说,给定参考图像$I^{r}$和参考音频$A^{r}$,该新任务要求生成保持参考图像身份并模仿参考音频音色的视频,语音内容可由用户提供的文本提示自由指定。为此,我们提出OmniCustom,一个基于DiT的强大音视频定制框架,能够以零样本方式一次性根据参考图像身份、音频音色和文本提示合成视频。我们的框架基于三个关键贡献。首先,身份和音频音色控制通过独立的参考身份和音频LoRA模块实现,这些模块通过基础音视频生成模型中的自注意力层操作。其次,我们引入了对比学习目标与标准流匹配目标一起使用。它将以参考输入为条件的预测流作为正例,以无参考条件的预测流作为负例,从而增强模型保持身份和音色的能力。第三,我们在构建的大规模高质量音视频人类数据集上训练OmniCustom。大量实验表明,OmniCustom在生成具有一致身份和音色保真度的音视频内容方面优于现有方法。项目页面:https://omnicustom-project.github.io/page/。

英文摘要

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity. Project page: https://omnicustom-project.github.io/page/.

2602.11171 2026-05-29 cs.CL cs.AI

A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search

语言引导的贝叶斯优化用于高效LoRA超参数搜索

Baek Seong-Eun, Lee Jung-Mok, Kim Sung-Bin, Tae-Hyun Oh

AI总结 提出一种利用预训练LLM领域知识的贝叶斯优化框架,通过语言提示将超参数映射到连续空间,结合子集训练代理评估,仅需约30次迭代即可发现比标准超参数提升20%以上性能的LoRA超参数。

详情
Comments
Accepted at ICML 2026
AI中文摘要

使用低秩适配(LoRA)微调大型语言模型(LLM)提供了一种资源高效的方式来实现个性化或专业化。然而,LoRA对超参数选择高度敏感,且穷举超参数搜索计算成本高昂。为此,我们提出一个贝叶斯优化(BO)框架,利用预训练LLM的领域知识来高效搜索LoRA超参数。我们的方法将预训练LLM重新用作离散到连续映射模块,将超参数及其领域知识链接到连续向量空间,在其中进行BO。我们通过语言提示设计和控制映射,提供描述超参数间关系及其各自角色的领域感知文本提示。这使我们能够以自然语言将关于LoRA的领域知识显式注入LLM。我们还引入一个额外的可学习标记,以捕获提示中难以用语言描述的残差信息。这有助于BO采样更多高性能超参数。此外,通过利用LoRA训练机制中从完整数据集和子集训练数据集获得的性能之间观察到的强相关性,我们引入使用数据子集的代理训练和评估。这显著提高了我们方法的效率。我们证明,仅需约30次迭代发现的超参数,相比从约45,000种组合中找到的标准超参数,实现了超过20%的性能提升。项目页面:https://baekseongeun.github.io/lora-bo/

英文摘要

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) offers a resource-efficient way to personalize or specialize. However, LoRA is highly sensitive to hyperparameter choices, and exhaustive hyperparameter search is computationally expensive. To address this, we propose a Bayesian Optimization (BO) framework that leverages the domain knowledge of pre-trained LLMs to efficiently search for LoRA hyperparameters. Our approach repurposes a pre-trained LLM as a discrete-to-continuous mapping module to link hyperparameters and their domain knowledge to a continuous vector space, where BO is conducted. We design and control the mapping via language prompting, providing a domain-aware textual prompt that describes the relationships among hyperparameters and their respective roles. This allows us to explicitly inject domain knowledge about LoRA into the LLM in natural language. We also introduce an additional learnable token to capture residual information that is difficult to describe linguistically in the prompt. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the strong correlation observed between the performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation using a data subset. This significantly improves the efficiency of our method. We demonstrate that our hyperparameter, discovered with only about 30 iterations, achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations. Project page: https://baekseongeun.github.io/lora-bo/

2602.11065 2026-05-29 cs.CL cs.AI

S-MARC: Causal Streaming Reasoning for Full-Duplex Conversational Behavior Modeling

S-MARC:全双工对话行为建模的因果流式推理

Dingkun Zhou, Shuchang Pan, Jiachen Lian, Siddharth Banerjee, Sarika Pasumarthy, Dhruv Hebbar, Siddhant Patel, Zeyi Austin Li, Kan Jen Cheng, Sanay Bordia, Krish Patel, Akshaj Gupta, Tingle Li, Gopala Anumanchipalli

AI总结 提出S-MARC框架,通过流式因果层次建模意图到动作路径,预测高层交际功能和低层交互行为,并构建高质量语料库,实现全双工对话中的鲁棒行为检测与可解释推理。

详情
AI中文摘要

人类对话由隐式的思维链组织,并表现为时间结构化的对话行为。捕捉这一感知路径对于构建自然的全双工交互系统至关重要。我们提出了S-MARC(对话的流式因果建模与推理),一个用于对话行为建模与推理的流式、因果、层次化框架。通过形式化意图到动作的路径,S-MARC预测高层交际功能和低层交互行为,同时建模它们的因果和时间依赖关系。为支持这一设置,我们构建了一个高质量语料库,将可控、事件丰富的双工对话数据与行为标签配对。S-MARC将流式预测组织成持续演化的图结构,为其决策生成简洁的推理依据,并动态优化其推理过程。在合成和真实双工对话上的实验表明,S-MARC实现了鲁棒的行为检测,产生了可解释的推理链,并为全双工口语对话系统中的对话推理建立了基准基础。

英文摘要

Human conversation is organized by an implicit chain of thought and manifests as temporally structured conversational behaviors. Capturing this perceptual pathway is critical for building natural full-duplex interactive systems. We propose S-MARC (Streaming Causal Modeling and Reasoning for Conversation), a streaming, causal, and hierarchical framework for conversational behavior modeling and reasoning. By formalizing the intent-to-action pathway, S-MARC predicts high-level communicative functions and low-level interaction behaviors while modeling their causal and temporal dependencies. To support this setting, we construct a high-quality corpus that pairs controllable, event-rich duplex dialogue data with behavior labels. S-MARC organizes streaming predictions into a continuously evolving graph structure, generates concise justifications for its decisions, and dynamically optimizes its reasoning process. Experiments on synthetic and real duplex dialogues show that S-MARC achieves robust behavior detection, produces interpretable reasoning chains, and establishes a benchmark foundation for conversational reasoning in full-duplex spoken dialogue systems.

2602.10637 2026-05-29 cs.LG cond-mat.stat-mech physics.chem-ph stat.ML

Coarse-Grained Boltzmann Generators

粗粒度玻尔兹曼生成器

Weilong Chen, Bojun Zhao, Jan Eckwert, Julija Zavadlav

AI总结 提出粗粒度玻尔兹曼生成器(CG-BGs)框架,结合基于流的生成模型与重要性采样,利用学习到的平均力势(PMF)进行重加权,在降低计算成本的同时实现大分子系统的平衡采样。

详情
Comments
Accepted at ICML 2026
AI中文摘要

从玻尔兹曼分布中采样平衡分子构型是一个长期挑战。玻尔兹曼生成器(BGs)通过结合精确似然生成模型与重要性采样来解决这一问题,但实际可扩展性有限。同时,粗粒度代理模型通过降低有效维度来建模更大系统,但往往缺乏确保渐近正确统计量的重加权过程。在这项工作中,我们提出了粗粒度玻尔兹曼生成器(CG-BGs),一个用于粗粒度坐标空间中的降阶生成建模与重要性采样的框架。CG-BGs使用基于流的模型生成样本,并使用学习到的平均力势(PMF)进行重加权。我们表明,可以通过增强采样力匹配从快速收敛的轨迹中学习PMF。实验证明,CG-BGs在高度降阶表示中捕获溶剂介导的相互作用,同时相对于原子级BGs大幅降低计算成本,为更大分子系统的平衡采样提供了实用途径。

英文摘要

Sampling equilibrium molecular configurations from the Boltzmann distribution is a longstanding challenge. Boltzmann Generators (BGs) address this by combining exact-likelihood generative models with importance sampling, but practical scalability is limited. Meanwhile, coarse-grained surrogates enable the modeling of larger systems by reducing effective dimensionality, yet often lack a reweighting procedure required to ensure asymptotically correct statistics. In this work, we propose Coarse-Grained Boltzmann Generators (CG-BGs), a framework for reduced-order generative modeling with importance sampling in coarse-grained coordinate space. CG-BGs generate samples using a flow-based model and reweight them using a learned potential of mean force (PMF). We show that the PMF can be learned from rapidly converged trajectories via enhanced sampling force matching. Experiments demonstrate that CG-BGs capture solvent-mediated interactions in highly reduced representations while substantially reducing computational cost relative to atomistic BGs, providing a practical route toward equilibrium sampling of larger molecular systems.

2602.10520 2026-05-29 cs.LG

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

优先过程而非结果:奖励潜在思维轨迹改善循环语言模型的推理能力

Jonathan Williams, Esin Tureci

AI总结 针对循环语言模型(LoopLM)中标准强化学习(如GRPO)仅奖励最终潜在状态导致推理改进失败的问题,提出RLTT框架,通过在整个潜在推理轨迹上分配奖励实现密集的轨迹级信用分配,显著提升数学推理性能并泛化至非数学任务。

详情
Comments
ICML 2026
AI中文摘要

循环语言模型(LoopLMs)在生成token之前执行多步潜在推理,并在较小的参数预算下在推理基准上优于传统LLM。然而,使用强化学习进一步改进LoopLM推理的尝试失败了——诸如群体相对策略优化(GRPO)等标准目标仅对最终潜在状态分配信用,与模型的内部计算存在根本性不匹配。为解决此问题,我们引入了RLTT(奖励潜在思维轨迹),这是一种强化学习框架,将奖励分布在整个潜在推理轨迹上。RLTT提供密集的、轨迹级别的信用分配,无需依赖外部验证器,并且可以直接替代GRPO,开销可忽略不计。在相同训练和推理条件下,使用Ouro-1.4B/2.6B-Thinking进行的大量实验中,RLTT在具有挑战性的数学推理基准上比GRPO取得了统计上显著的改进,在1.4B规模上,MATH-500、AIME24/26和BeyondAIME的平均准确率提高了+5.8%,在2.6B规模上提高了+10.9%。尽管仅在数学上训练,RLTT也能有效迁移到非数学推理基准,证明了轨迹级信用分配在LoopLMs中强化学习的有效性。代码可在https://github.com/jonwill8/RLTT.git获取。

英文摘要

Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-1.4B/2.6B-Thinking under identical training and inference conditions, RLTT yields statistically significant improvements over GRPO on challenging mathematical reasoning benchmarks, improving mean accuracy over MATH-500, AIME24/26, and BeyondAIME by +5.8% on the 1.4B scale, and +10.9% on the 2.6B scale. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs. Code is available at https://github.com/jonwill8/RLTT.git.

2602.08979 2026-05-29 cs.SD cs.CL

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

超越文本:音频章节划分的新视角

Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel

AI总结 本文通过提出音频专用架构AudioSeg、分析影响性能的因素以及形式化评估协议,系统研究了音频章节划分任务,发现AudioSeg显著优于基于文本的方法,停顿是最有效的声学特征,而多模态大模型在短音频上表现有潜力。

详情
Comments
Accepted at ACL 2026 (Main Conference)
AI中文摘要

音频章节划分是将长音频分割成连贯部分的任务,对于导航播客、讲座和视频越来越重要。尽管其相关性,研究仍然有限且基于文本,留下了关于利用音频信息、处理ASR错误以及无转录评估的关键问题未解决。我们通过三个贡献来解决这些空白:(1)基于文本模型与声学特征、一种新颖的仅音频架构(AudioSeg,操作于学习到的音频表示)以及多模态大模型的系统比较;(2)影响性能因素的经验分析,包括转录质量、声学特征、持续时间和说话人组成;(3)形式化的评估协议,对比依赖转录的文本空间协议与转录不变的时间空间协议。我们在YTSeg上的实验表明,AudioSeg显著优于基于文本的方法,停顿提供了最大的声学增益,而MLLMs受限于上下文长度和指令遵循能力较弱,但MLLMs在较短的音频上显示出潜力。

英文摘要

Audio chaptering, the task of segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.