arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31603 2026-06-01 cs.CV cs.AI

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Lumos-Nexus: 面向视频统一模型的高效频率桥接与同质潜在空间

Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu, Yujie Wei, Fei Du, Hai Ci, Tao Feng, Jiasheng Tang, Weihua Chen, Fan Wang, Yong Liu

AI总结 提出Lumos-Nexus框架,通过两阶段训练和渐进频率桥接,在保持推理能力的同时显著提升视频生成保真度。

详情
Comments
Project page (https://jiazheng-xing.github.io/nexus-lumos-home/) and Code (https://github.com/alibaba-damo-academy/Lumos-Custom/) are available
AI中文摘要

基于连接器的视频统一模型在指令引导的视频合成中展现出强大能力,但将大型高保真生成器集成到统一训练循环中计算成本过高,限制了可实现的视觉质量。因此,我们提出Lumos-Nexus,一个训练高效的统一视频生成框架,促进强推理驱动生成能力的发展,同时显著提升视觉保真度。Lumos-Nexus采用两阶段设计:1)训练时,仅将轻量级生成器与理解模块对齐,以学习接收推理驱动的语义控制。2)推理时,我们引入统一渐进频率桥接(UPFB),在共享潜在空间中逐步将生成任务移交给高容量预训练生成器,实现从粗到细的细化,在不牺牲推理质量的情况下生成高保真视频。为填补推理驱动视频生成基准的空白,我们引入VR-Bench,评估模型将推断意图转化为连贯且语义对齐的视频内容的能力。大量实验表明,Lumos-Nexus在VBench上实现了视觉真实感和时间连贯性的显著提升,同时在VR-Bench上展现出强大的基于推理的生成性能。代码和模型可在https://jiazheng-xing.github.io/nexus-lumos-home/获取。

英文摘要

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

2605.31598 2026-06-01 cs.CV

Linear Scaling Video VLMs for Long Video Understanding

面向长视频理解的线性缩放视频视觉语言模型

Cristobal Eyzaguirre, Jiajun Wu, Juan Carlos Niebles

AI总结 提出StateKV方法,通过固定容量的重要性驱动循环状态实现线性时间视频预填充,在保持接近全自注意力性能的同时显著降低计算成本。

详情
AI中文摘要

视频视觉语言模型(VLM)越来越多地用于长时和流式场景,但大多数视频编码器仍依赖时空自注意力,导致计算和延迟随帧数二次增长。现有的效率方法提高了可扩展性,但相对于全自注意力往往损失准确性,例如通过激进的帧/令牌丢弃或粗略的注意力近似。我们引入了StateKV,一种推理时方法,通过将跨帧上下文携带在固定容量、基于重要性的循环状态中,并配以用于解码的第二个完整每帧缓存,使预训练的长视频VLM适应线性时间视频预填充。在三个长视频基准测试和跨越三个家族、多个尺度的七个模型上,StateKV保持接近全自注意力的性能,并持续优于主流的滑动窗口/基于最近性的流式近似,无需微调或架构更改。StateKV还降低了以FLOPs衡量的视频预填充成本,通过运行更大的模型在固定计算预算下实现更强的准确性。这些结果表明了向可扩展长视频理解迈出的实际一步。

英文摘要

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.

2605.31596 2026-06-01 cs.CV cs.LG

KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

KLIP:通过逆问题中扩散先验的KL散度进行局部分布偏移检测

Alireza Kheirandish, Jihoon Hong, Sara Fridovich-Keil

AI总结 提出基于KL散度的OOD检测指标,无需校准数据或偏移分布知识,可检测并定位图像中的局部分布偏移。

详情
Comments
CVPR 2026
AI中文摘要

扩散模型作为计算成像的数据驱动先验以及检测分布外(OOD)图像方面已展现出有前景的性能。然而,现有的OOD检测方法通常需要一些关于偏移分布的知识,无法检测细微或局部的分布偏移,并且作用于完整图像而非逆问题中可用的间接测量。我们提出了一种基于扩散先验与后验分布之间的Kullback-Leibler散度的OOD检测指标,该指标(i)不需要任何校准数据或关于偏移分布的知识,并且(ii)可以检测整张图像是否为OOD,以及定位图像内的OOD块。实验上,我们表明该指标可以检测细微但语义上有意义的分布偏移,例如从健康肝脏CT扫描到有肿瘤的CT扫描的偏移,并且能够泛化到不同类型的扩散模型、数据集和逆问题。我们的代码可在https://github.com/voilalab/KLIP找到。

英文摘要

Diffusion models have shown promising performance as data-driven priors for computational imaging, as well as some capacity to detect out-of-distribution (OOD) images. However, existing approaches to OOD detection often require some knowledge of the shifted distribution, fail to detect subtle or localized distribution shifts, and operate on full images, rather than the indirect measurements available in inverse problems. We propose an OOD detection metric based on the Kullback-Leibler divergence between the diffusion prior and the posterior distribution, that (i) does not require any calibration data or knowledge of the shifted distribution, and (ii) can detect whole images as OOD as well as localize OOD patches within an image. Experimentally, we show that this metric can detect subtle yet semantically meaningful distribution shifts, such as the shift from healthy liver CT scans to those with tumors, and generalizes across different types of diffusion models, datasets, and inverse problems. Our code can be found at https://github.com/voilalab/KLIP.

2605.31595 2026-06-01 cs.CV

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

利用紧凑高斯体学习全局运动的前馈式4D重建

Mungyeom Kim, Minkyeong Jeon, Honggyu An, Jaewoo Jung, Hyuna Ko, Jisang Han, Hyeonseo Yu, Donghwan Shin, Sunghwan Hong, Takuya Narihira, Kazumi Fukuda, Yuki Mitsufuji, Seungryong Kim

AI总结 提出C4G框架,通过紧凑的可学习高斯查询令牌和视频扩散模型增强渲染,实现无需相机位姿的前馈式4D动态场景重建,显著减少高斯数量并提升运动建模鲁棒性。

详情
Comments
Project Page: see https://cvlab-kaist.github.io/C4G
AI中文摘要

从单目视频进行动态场景重建仍然是计算机视觉中的一个基本挑战。现有的前馈方法逐帧预测像素级3D高斯体,存在重复高斯体和视角依赖偏差,阻碍了场景运动的有效学习。我们提出C4G,一个前馈式4D重建框架,基于一组紧凑的时间戳条件可学习高斯查询令牌。每个令牌在整个时间上下文中聚合对应特征,并解码出一个3D高斯体,其位置由目标时间戳调制,无需逐场景优化即可实现全局一致的运动建模。为了捕捉细粒度细节,我们进一步引入基于视频扩散模型的渲染增强模块。由于我们的框架有效地将特征聚合到高斯体中,我们将此能力扩展到特征提升,生成一个支持点跟踪和动态场景理解的4D特征场。C4G在显著减少高斯体数量且无需相机位姿的情况下,实现了强的新视角合成性能,同时展现出更强的运动建模能力和对大时间间隔的鲁棒性。

英文摘要

Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.

2605.31594 2026-06-01 cs.LG math.OC

A Tight Theory of Error Feedback Algorithms in Distributed Optimization

分布式优化中误差反馈算法的紧致理论

Daniel Berg Thomsen, Adrien Taylor, Aymeric Dieuleveut

AI总结 本文针对分布式优化中的两种主流误差反馈算法(EF和EF21),通过确定最优步长和构造最优Lyapunov函数,给出了紧致的收敛性分析,结果与智能体数量无关且恢复单智能体情形下的已知最优保证。

详情
AI中文摘要

通信成本是分布式学习和一阶优化的主要瓶颈。缓解此问题的常见方法是压缩智能体之间交换的梯度信息。然而,这种压缩通常会降低基于梯度方法的收敛保证。误差反馈机制为此问题提供了一种简单且计算成本低的补救措施,但已提出众多变体,且它们的相对性能仍知之甚少。本文通过确定最优步长选择并为每种方法构造最优Lyapunov函数,为文献中的两种主要误差反馈算法——经典误差反馈方法(EF)和误差反馈21(EF21)——提供了紧致的收敛性分析。结果与智能体数量无关,并恢复了单智能体情形下已知的最佳保证。

英文摘要

Communication costs are a major bottleneck in distributed learning and first-order optimization. A common approach to alleviate this issue is to compress the gradient information exchanged between agents. However, such compression typically degrades the convergence guarantees of gradient-based methods. Error feedback mechanisms provide a simple and computationally cheap remedy for this issue, but numerous variants have been proposed, and their relative performance remains poorly understood. This paper provides tight convergence analyses for two of the main error-feedback algorithms from the literature, the classic Error Feedback method (EF) and Error Feedback 21 (EF21), by identifying optimal step-size choices and constructing optimal Lyapunov functions tailored to each method. The results hold independently of the number of agents and recover the known best guarantees possible in the single-agent regime.

2605.31593 2026-06-01 cs.CR cs.AI

Stateful Online Monitoring Catches Distributed Agent Attacks

有状态在线监控捕获分布式智能体攻击

Davis Brown, Samarth Bhargav, Arav Santhanam, Kasper Hong, Ivan Zhang, Matan Shtepel, Steffi Chern, Alexander Robey, Eric Wong, Hamed Hassani

AI总结 针对分布式智能体攻击中跨账户聚合的恶意行为难以被单上下文监控检测的问题,提出一种基于实时聚类的有状态在线监控方法,能够更早、更有效地捕获分布式攻击,同时保持低延迟。

详情
AI中文摘要

语言模型可以发现数千个严重的软件漏洞,并且智能体越来越多地被滥用于网络攻击。为了避免检测,攻击者经常分布他们的滥用行为,将有害任务分割到多个用户账户中,使得每个单独的记录看起来无害。由于安全监控器一次只评估一个智能体上下文,它们在结构上无法检测到仅在跨多个账户的聚合中才可见的滥用行为。我们通过构建据我们所知第一个分布式智能体攻击来证明这一漏洞是真实存在的,该攻击是一个多智能体框架,能够在完成困难的网络安全任务的同时,将有害目标隐藏在具有有限上下文的子智能体中,从而规避标准监控器,后者捕获它的频率仅为先前智能体攻击的五分之一。为了防御,我们开发了一种在线有状态监控器,它使用实时聚类来收集跨多个智能体记录的微弱可疑信号,并且仅在极少数情况下升级到语言模型以标记跨用户账户的滥用行为。在模拟数据中心流量的大规模评估中,我们的监控器帕累托优于标准监控器,提前30%捕获分布式攻击,并在网络滥用达到最有害阶段之前标记出来。至关重要的是,这对于约99%的用户流量带来的额外延迟可以忽略不计。这种检测优势在良性背景流量非常大时仍然存在但会缩小。经过广泛的红队演练,我们改进了防御,并且令人惊讶地发现它也能捕获标准越狱,因为自适应攻击者会跨账户重复使用攻击变体。我们的结果指向了一类新的安全监控器,它们对用户群体而非孤立记录进行推理。

英文摘要

Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi-agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real-time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large-scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red-teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.

2605.31591 2026-06-01 cs.CV

CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference

CoFiDA-M: 面向仅图像推理的跨域自适应的概念感知特征调制

Nurjahan Sultana, Moi Hoon Yap, Xinqi Fan, Wenqi Lu

AI总结 提出CoFiDA-M框架,通过特权信息学习利用临床概念(如MONET概率)指导特征调制,训练仅图像的学生模型,在跨域皮肤癌筛查中显著提升黑色素瘤召回率。

详情
Comments
'Accepted by CVPR 2026'
AI中文摘要

基于AI的皮肤癌筛查模型在从专家皮肤镜(源)图像转向消费级临床(目标)图像时性能严重下降,阻碍了实际部署。现有的域自适应方法常常忽略关键的语义不变性,如临床概念。虽然像MONET这样的新基础模型可以提供这种语义信息作为密集的概率分数,但该元数据在测试时不可用,为实用的仅图像筛查工具造成了部署悖论。我们通过提出CoFiDA-M来解决这一差距,这是一个特权信息框架,在训练时从概念中学习,但部署为仅图像模型。我们的方法训练一个教师网络,该网络使用MONET概念概率来指导FiLM调制器,将视觉特征转换为语义“编辑”的特征空间。然后训练一个轻量级的、仅图像的学生模型来重现这种编辑后的表示,而不仅仅是教师的最终预测。这种蒸馏将临床推理“烘焙”到学生模型的权重中。在一个具有挑战性的多数据集基准上,我们的仅图像学生模型显著优于最先进的方法,特别是在黑色素瘤召回率方面。我们的工作提供了一个实用且可泛化的框架,用于利用噪声概率元数据作为特权信息,展示了强大的跨数据集鲁棒性和在皮肤科之外实际部署的潜力。实现代码可在以下网址获取:https://github.com/mmu-dermatology-research/CoFiDA.git

英文摘要

Models for AI-based skin cancer screening suffer a severe performance drop when shifting from expert dermoscopic (source) images to consumer-grade clinical (target) images, hindering real-world deployment. Existing domain adaptation methods often ignore crucial semantic invariants, such as clinical concepts. While new foundation models like MONET can provide this semantic information as dense, probabilistic scores, this metadata is unavailable at test time, creating a deployment paradox for practical image-only screening tools. We address this gap by proposing CoFiDA-M, a privileged information framework that learns from concepts at training time but deploys as an image-only model. Our method trains a teacher network that uses MONET concept probabilities to guide a FiLM modulator, transforming visual features into a semantically ``edited" feature space. A lightweight, image-only student is then trained to reproduce this edited representation, not just the teacher's final predictions. This distillation ``bakes" the clinical reasoning into the student's weights. On a challenging multi-dataset benchmark, our image-only student significantly outperforms state-of-the-art approaches, especially in melanoma recall. Our work provides a practical and generalizable framework for leveraging noisy, probabilistic metadata as privileged information, demonstrating strong cross-dataset robustness and potential for real-world deployment beyond dermatology. Implementation code is available at: https://github.com/mmu-dermatology-research/CoFiDA.git

2605.31590 2026-06-01 cs.CV cs.AI

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

TunerDiT: 无需训练的多事件视频生成扩散变压器渐进式引导

Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai, Lei Zhang, Xun Xiao, Thomas Seidl, Daniel Cremers, Volker Tresp

AI总结 针对长视频多事件生成难题,提出无需额外训练的TunerDiT方法,通过事件分区掩码和跨事件提示融合实现渐进式引导,在8项指标上达到最优。

详情
Comments
17 pages, 13 figures
AI中文摘要

文本到视频(T2V)生成在生成长时间跨度包含多个事件的视频时面临挑战性问题。受扩散过程内在特性的启发,我们探测了视频扩散变压器(DiTs),并发现了DiT去噪轨迹中的内在转折点,其中条件文本从全局布局到细粒度细节影响生成。基于这一发现,我们提出了TunerDiT,一种简单而有效的渐进式引导方法,无需额外训练即可实现多事件生成。TunerDiT包含两个引导手柄:(1)事件分区掩码,强制事件边界同时允许跨事件过渡带;(2)跨事件提示融合,注入相邻事件语义用于后期细化。我们贡献了一个自策提示套件用于多事件生成基准测试,即Meve。与其他无训练方法相比,TunerDiT在8项指标上达到了最先进性能,并在视频一致性和事件分离之间提供了可调权衡。文本对齐的提升随事件数量增加而增强,表明随着事件数量增加存在扩展可能性。

英文摘要

Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.

2605.31589 2026-06-01 cs.CV

Recognizing Co-Speech Gestures in-the-Wild

识别野外伴随语音的手势

Sindhu B Hegde, K R Prajwal, Andrew Zisserman

AI总结 针对当前多模态模型难以捕捉语义性伴随手势的问题,构建了首个大规模基准数据集GRW,用于训练视频模型进行手势语义分类、对应词汇识别和时序定位。

详情
AI中文摘要

尽管人类在说话时自然地进行手势,但这些动作中只有稀疏的子集在视觉上具有描绘性,并与特定的口语词汇语义相关。当前的多模态模型难以捕捉这些语义性的伴随手势,主要受限于缺乏精确标注的训练数据。为解决这一问题,我们引入了野外手势识别(GRW)数据集,这是第一个大规模基准,旨在将无约束的人类手势与特定词汇以帧精确的时间边界进行映射。GRW包含156,688个手动标注的视频片段,涵盖了一个高度多样化的150词分类体系,包括物理动作、空间描述符和抽象概念。我们利用GRW训练视频模型以(a)将手势分类为语义性或非语义性,(b)识别伴随手势对应的词汇,以及(c)在时间上定位手势。我们还使用GRW为这三项任务建立基准。

英文摘要

While humans naturally gesture during speech, only a sparse subset of these movements are visually depictive and semantically linked to specific spoken words. Current multimodal models struggle to capture these semantic co-speech gestures, heavily bottlenecked by a lack of precisely annotated training data. To address this, we introduce the Gesture Recognition in the Wild (GRW) dataset, the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. Comprising 156,688 manually annotated video clips, GRW spans a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts. We leverage GRW to train video models to (a) classify gestures as semantic or not, (b) recognize the word corresponding to a co-speech gesture, and (c) temporally localize the gesture. We also use GRW to establish benchmarks for these three tasks.

2605.31586 2026-06-01 cs.CL cs.AI

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

语言模型学习构式语义,更不用说句法:探究LM对配对焦点构式的理解

Wesley Scivetti, Ethan Wilcox, Nathan Schneider, Kanishka Misra, Leonie Weissweiler

AI总结 通过构建新数据集,研究不同规模开源语言模型对英语中稀有配对焦点构式(如“let alone”)的语义理解,发现中等规模模型能掌握其形式和意义,且语义学习晚于句法知识,并与世界知识相关。

详情
Comments
Conference on Natural Language Learning (CoNLL) 2026
AI中文摘要

理解稀有构式(形式-意义配对)的语义已被证明是一个具有挑战性的问题,目前只有最大的LLM才能解决。开源模型是否具有稳健的构式理解,以及如果具备,这种知识习得背后的学习动态是什么,仍然是一个开放问题。聚焦于英语中一组稀有的配对焦点构式(例如“let alone”、“much less”),我们构建了一个新颖的数据集,利用标量形容词语义和一般世界知识来测试它们的意义。通过测试一系列在参数数量、架构和预训练数据集大小上不同的模型,我们发现几个中等规模的模型对配对焦点构式的形式和意义都敏感,尽管在人类规模数据上训练的模型在所有意义评估中均失败。转向一组开放检查点模型的训练动态,我们发现配对焦点理解在训练后期出现,晚于配对焦点句法知识,并且配对焦点语义的学习与世界知识某些领域的提升相关。总体而言,我们的实证结果支持中等规模开源模型能够掌握稀有配对焦点构式的结论,并展示了配对焦点构式知识与其他意义领域之间的联系。

英文摘要

Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. "let alone", "much less"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.

2605.31584 2026-06-01 cs.CL cs.AI cs.LG

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

LongTraceRL: 基于评分奖励从搜索智能体轨迹中学习长上下文推理

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

AI总结 提出LongTraceRL框架,通过知识图谱随机游走生成多跳问题并利用搜索智能体轨迹构建分层干扰物,结合基于实体链的评分奖励进行过程监督,提升大语言模型在长上下文推理中的表现。

详情
AI中文摘要

长上下文推理仍然是大型语言模型的核心挑战,模型往往难以在大量干扰内容中定位和整合关键信息。基于可验证奖励的强化学习(RLVR)在此任务上展现出潜力,但现有方法受限于低混淆度的干扰物和稀疏的、仅基于结果的奖励信号,无法监督中间推理步骤。为解决这些问题,我们引入了 extsc{LongTraceRL}。在数据构建方面,我们通过知识图谱随机游走生成多跳问题,并利用搜索智能体轨迹构建\emph{分层干扰物}:智能体读取但未引用的文档(高混淆度)和搜索结果中出现但从未打开的文档(低混淆度),从而生成比随机采样或单次搜索构建的训练上下文更具挑战性的内容。在奖励设计方面,我们提出了一种\emph{评分奖励},利用每条推理链上的黄金实体作为细粒度的实体级过程监督。该评分奖励仅应用于最终答案正确的响应(正向策略),以区分正确响应之间的推理质量,并防止奖励作弊。在五个长上下文基准上对三种推理LLM(4B-30B)进行的实验表明, extsc{LongTraceRL} 始终优于强基线,并鼓励全面、基于证据的推理。代码、数据集和模型可在 \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL} 获取。

英文摘要

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.

2605.31581 2026-06-01 cs.AI

Choosing the Lens: Strategic Perspective Activation in Context-Dependent Argumentation

选择视角:上下文相关论证中的策略性视角激活

Albert Sadowski, Jarosław A. Chudziak

AI总结 本文提出上下文相关论证框架(CDAF),通过击败函数和视角标记特化,研究代理如何通过选择相关性集和优先级来策略性地影响论证结果。

详情
Comments
Accepted to LAMAS&SR workshop at FLoC 2026
AI中文摘要

相同的论证通常需要在不同的外部体制下进行评估。对体制有影响力的代理拥有标准形式主义无法直接捕捉的策略杠杆。我们引入了上下文相关论证框架(CDAF),这是对Dung理论的扩展,其中击败函数根据上下文决定哪些攻击成功。视角标记特化从相关性集$ρ$和优先级$π$推导出击败函数。相关性集是代理的行动空间。在一个小型工作示例中,代理的目标论证在完全相关单射优先级下被拒绝,但在部分激活下被接受,而VAF受众无法镜像其中一种激活。我们定义了相应的决策问题ACTIVATION-MANIPULATION,并记录了基线复杂度界限。紧界限和多代理变体留待未来研究。

英文摘要

The same arguments often need to be evaluated under different external regimes. An agent with influence over the regime has a strategic lever that standard formalisms do not directly capture. We introduce context-dependent argumentation frameworks (CDAFs), an extension of Dung's theory in which a defeat function determines, per context, which attacks succeed. A perspective-labeled specialisation derives the defeat function from a relevance set $ρ$ and a priority $π$. The relevance set is the agent's action space. In a small worked example, the agent's target argument is rejected under every full-relevance injective priority, yet accepted under partial activations, one of which no VAF audience can mirror. We define the corresponding decision problem, ACTIVATION-MANIPULATION, and record baseline complexity bounds. Tight bounds and multi-agent variants are left open.

2605.31580 2026-06-01 cs.LG

Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings

赋予传感器声音:用于语义时间序列嵌入的多模态JEPA

Utsav Dutta, Gerardo Pastrana, Sina Khoshfetrat Pakazad, Henrik Ohlsson

AI总结 提出CHARM模型,通过通道级文本描述与Transformer编码器结合,利用联合嵌入预测架构(JEPA)学习语义时间序列嵌入,在异常检测、分类和预测任务中仅用线性探针即取得强性能。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML), PMLR 306, 2026
Comments
9 pages, 5 figures, accepted at ICML 2026. arXiv admin note: substantial text overlap with arXiv:2505.14543
AI中文摘要

基于Transformer的架构在语言和视觉领域的序列建模中取得了进展,但针对异构多变量时间序列的通用表示学习仍未被充分探索。我们提出了CHARM(通道感知表示模型),该模型将通道级文本描述整合到对通道顺序等变的Transformer编码器中。CHARM采用联合嵌入预测架构(JEPA)和一种新颖的损失函数进行训练,该损失函数促进信息丰富且时间稳定的嵌入;潜在空间预测增强了对传感器噪声的鲁棒性,而描述感知门控通过学习到的通道间关系提供了可解释性。在异常检测、分类以及短期和长期预测任务中,学习到的嵌入仅使用线性探针就取得了强性能。性能主要由JEPA目标和条件架构驱动,文本描述作为跨数据集泛化的通道标识符。

英文摘要

Transformer-based architectures have advanced sequence modeling in language and vision, yet general-purpose representation learning for heterogeneous multivariate time series remains underexplored. We introduce CHARM (Channel-Aware Representation Model), which incorporates channel-level textual descriptions into a Transformer encoder equivariant to channel order. CHARM is trained with a Joint Embedding Predictive Architecture (JEPA) and a novel loss promoting informative, temporally stable embeddings; latent-space prediction encourages robustness to sensor noise while description-aware gating provides interpretability through learned inter-channel relationships. Across anomaly detection, classification, and short- and long-term forecasting, the learned embeddings achieve strong performance using only a linear probe. Performance is driven primarily by the JEPA objective and conditioning architecture, with text descriptions serving as channel identifiers for cross-dataset generalization.

2605.31579 2026-06-01 eess.SP cs.IT math.IT math.ST stat.TH

Functional Multi-Target Detection via Bispectrum Inversion

基于双谱反演的功能性多目标检测

Anna Little, Daniel Sanz-Alonso, Mikhail Sweeney, Ruiyi Yang

AI总结 针对含未知平移的多目标检测问题,提出基于自相关分析的无初始化恢复算法,通过去偏三阶经验自相关估计双谱,并利用频率推进或Kotlarski反卷积公式恢复信号,证明非渐近恢复保证。

详情
AI中文摘要

本文发展了多目标检测的功能性理论,其中从包含信号多个未知平移的单个含噪观测中恢复紧支撑信号。我们的公式允许连续、非网格平移和相关平稳高斯过程噪声,超越了先前工作中常见的离散、网格对齐、白噪声模型。我们分析了两种基于自相关分析的无初始化恢复算法;特别地,两种算法首先通过去偏三阶经验自相关估计信号的双谱。然后利用功能性频率推进方案或Kotlarski型反卷积公式从估计的双谱中恢复信号。对于两种算法,我们在无带限假设下证明了紧支撑信号的非渐近恢复保证。得到的误差界依赖于信号的光滑性和双谱估计的精度,后者由噪声特性和信号出现次数决定。数值实验验证了我们的理论,并展示了在低信噪比条件下的准确恢复。

英文摘要

This paper develops a functional theory for multi-target detection, where a compactly supported signal is recovered from a single noisy observation containing many unknown translations of the signal. Our formulation allows continuous, off-grid translations and correlated stationary Gaussian process noise, extending beyond the discrete, grid-aligned, white-noise models common in prior work. We analyze two uninitialized recovery algorithms based on autocorrelation analysis; in particular, both algorithms first estimate the signal's bispectrum via a debiased third-order empirical autocorrelation. The signal is then recovered from the estimated bispectrum using either a functional frequency marching scheme or a Kotlarski-type deconvolution formula. For both algorithms, we prove non-asymptotic recovery guarantees for compactly supported signals without bandlimiting assumptions. The resulting error bounds depend on the smoothness of the signal and the accuracy of bispectrum estimation, with the latter governed by the noise characteristics and the number of signal occurrences. Numerical experiments validate our theory and demonstrate accurate recovery in low-SNR regimes.

2605.31577 2026-06-01 cs.CV

SurGe: Improved Surface Geometry in Point Maps

SurGe: 改进点图中的表面几何

Karim Knaebel, Gonzalo Martin Garcia, Christian Schmidt, Ilya Fradlin, Lucas Nunes, Daan de Geus, Bastian Leibe

AI总结 针对前馈3D重建方法中局部表面几何不准确的问题,提出点图法线度量、点梯度匹配损失和邻域注意力解码器(NAD)来改善局部表面方向预测,在多个零样本单目几何基准上取得最优平均排名。

详情
Comments
Project page at https://vision.rwth-aachen.de/surge
AI中文摘要

最近的前馈3D重建方法能够很好地预测点图并估计全局3D几何。然而,它们的预测仍然显示出不准确的局部表面几何,这在定性上明显可见,但在常见指标中仅被微弱反映。为了使这些错误在评估中更明确,我们引入了一个点图法线度量,用于评估由相邻3D预测引起的局部表面方向。为了减少这些错误,我们提出了两个互补组件:一个点梯度匹配损失,用于监督深度归一化的3D有限差分;以及一个邻域注意力解码器(NAD),它逐步上采样特征并使用邻域注意力进行局部特征混合。在八个零样本单目几何基准上,我们的模型SurGe在全局点图AbsRel上取得了最佳平均排名,并一致地改进了局部点图和点图法线评估。

英文摘要

Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.

2605.31576 2026-06-01 cs.CV

Joint Multi-Camera LiDAR Extrinsic Calibration via Learned Pairwise Initialization and Geometric Refinement

联合多相机激光雷达外参标定:基于学习的成对初始化与几何优化

Aziz Al-Najjar, Marzieh Amini, James R. Green, Felix Kwamena

AI总结 提出两阶段框架,先通过CMRNext独立估计每个相机的外参和2D-3D对应,再通过联合光束法平差优化实现全局一致的多相机标定,显著提升精度和一致性。

详情
Comments
Paper is accepted in CVPR 2026 Workshop URVI: Unified Robotic Vision with Cross-Modal Sensing and Alignment
AI中文摘要

大多数基于学习的相机-激光雷达标定方法独立处理每个相机-激光雷达对,忽略了多相机平台中的刚性几何耦合。因此,每个相机的估计可能单独准确,但在系统层面不一致。我们提出一个两阶段框架,用于联合多相机激光雷达外参标定,结合了学习的成对匹配与几何优化。首先,CMRNext独立应用于每个相机,产生初始外参估计和密集的2D-3D对应。然后,这些预测通过多帧光束法平差联合优化,包含重投影项、每相机先验项和相对位姿先验项。该方法将成对预测转化为全局一致的多相机标定。在KITTI(CMRNext的域内)和Walkley(域外)数据集上的实验表明,该方法提高了每相机的精度和相机间的一致性。在KITTI上,该方法实现了0.89厘米的平移误差和0.038度的旋转误差。在Walkley上,它将平移误差从108.6厘米降低到3.1厘米,突显了当单相机预测不可靠时显式多相机耦合的优势。

英文摘要

Most learning-based camera-LiDAR calibration methods treat each camera-LiDAR pair independently, ignoring the rigid geometric coupling in multi-camera platforms. As a result, per-camera estimates may be individually accurate yet inconsistent at the system level. We present a two-stage framework for joint multi-camera LiDAR extrinsic calibration that combines learned pairwise matching with geometric refinement. First, CMRNext is applied independently to each camera to produce initial extrinsic estimates and dense 2D-3D correspondences. These predictions are then jointly refined through a multi-frame bundle adjustment with reprojection, per-camera prior, and relative-pose prior terms. This approach converts pairwise predictions into a globally consistent multi-camera calibration. Experiments on KITTI (in-domain for CMRNext) and Walkley (out-of-domain) datasets show improved per-camera accuracy and inter-camera consistency. On KITTI, the method achieves 0.89 cm translation error and 0.038 rotation error. On Walkley, it reduces translation error from 108.6 cm to 3.1 cm, highlighting the benefit of explicit multi-camera coupling when single-camera predictions are less reliable.

2605.31575 2026-06-01 cs.IR cs.AI

SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics

SPECTRA: 具有相关性真值表和受控干扰物诊断的合成信息检索测试集

Eric Liang

AI总结 提出SPECTRA框架,通过分离潜在主题结构、文本实现、元数据控制、查询意图生成和确定性相关性真值表,生成合成文本语料库和检索测试集,以诊断检索系统的扩展性和故障模式。

详情
AI中文摘要

可扩展的信息检索测试需要足够大的语料库来测试索引构建、排序延迟、查询路由和评估工具,但人工判断的测试集仍然昂贵,并且在文档私有或仍在设计时可能不可用。本文介绍了SPECTRA,一个可复现的框架,通过分离潜在主题结构、表面文本实现、元数据控制、查询意图生成和确定性相关性真值表,生成合成文本语料库和检索测试集。该框架旨在作为Cranfield风格和TREC风格评估的诊断补充,而非替代人工评估。一个单进程Python原型生成了多达60,000个文档和961万个标记的语料库,同时保持了可控的长尾词汇增长,并为96个查询生成了分级相关性标签。在本地模拟研究中,生成速度接近线性,约为每秒12,000到14,000个文档,估计的Zipf斜率绝对值保持在0.86附近,增加跨主题干扰文本使BM25 nDCG@10从2%干扰物时的1.00下降到36%干扰物时的0.43。这些结果表明,轻量级合成语料库可以在昂贵的集合构建开始之前暴露检索系统的扩展性和故障模式。

英文摘要

Scalable information retrieval testing needs corpora that are large enough to stress index construction, ranking latency, query routing, and evaluation tooling, yet human-judged test collections remain expensive and may be unavailable when documents are private or still under design. This paper introduces SPECTRA, a reproducible framework for generating synthetic text corpora and retrieval test collections through a separation of latent topical structure, surface text realization, metadata controls, query intent generation, and deterministic relevance oracles. The framework is intended as a diagnostic complement to Cranfield-style and TREC-style evaluation, not as a replacement for human assessment. A single-process Python prototype generated corpora up to 60,000 documents and 9.61 million tokens while preserving controllable long-tail vocabulary growth and producing graded relevance labels for 96 queries. In the local simulation study, generation remained close to linear at roughly 12K to 14K documents per second, estimated Zipf slopes stayed near 0.86 in absolute value, and increasing cross-topic distractor text reduced BM25 nDCG@10 from 1.00 at 2% distractors to 0.43 at 36% distractors. These results show that lightweight synthetic corpora can expose retrieval-system scaling and failure modes before costly collection construction begins.

2605.31574 2026-06-01 cs.HC

Can Generative AI help people navigate Radical Moral Disagreements? The CONSIDER prototype

生成式AI能否帮助人们应对根本性道德分歧?CONSIDER原型

William Hohnen-Ford, Sarah Chen, Kathryn B. Francis, Madeline G. Reinecke, Ilina Singh, David Lyreskog

AI总结 本文提出CONSIDER原型,一种基于大型语言模型的一对一AI工具,通过结构化分歧帮助用户澄清价值观,以应对根本性道德分歧。

详情
Comments
25 pages, 1 figure, 2 tables. Submitted manuscript
AI中文摘要

根本性道德分歧(RMDs)是高度两极分化的话题,在日常生活中日益受到审查,越来越多的证据表明这种两极分化对公众心理健康造成了可衡量的成本。为了应对这些挑战,一些研究者提出使用大型语言模型(LLMs)来支持更民主的 deliberation 和更好的道德推理。然而,现有工具由于RMDs的激烈和分裂特性,难以帮助人们有效应对。本文介绍了CONSIDER,一种用于RMD导航的一对一AI工具原型。借鉴密尔关于分歧的认识论价值的观点,CONSIDER旨在通过与对立的LLM生成观点进行结构化分歧来澄清价值观。我们描述了CONSIDER的设计逻辑,并分析了此类工具可能带来的潜在风险,以指导未来的发展。

英文摘要

Radical Moral Disagreements (RMDs) are highly polarising topics that are increasingly censored in everyday life, with growing evidence suggesting that this polarisation carries measurable costs to public mental health. To address these challenges, some researchers have proposed Large Language Models (LLMs) as a means to support more democratic deliberation and better moral reasoning. Yet existing tools are poorly calibrated to help people navigate RMDs, because of their intense and divisive characteristics. This paper introduces CONSIDER, a prototype for a one-to-one AI tool for RMD navigation. Drawing on Mill's account of the epistemic value of disagreement, CONSIDER aims at value clarification through structured disagreement with an opposing LLM-generated opinion. We describe CONSIDER's design logic and analyse potential risks posed by such tools to guide future development.

2605.31572 2026-06-01 cs.CV

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

nuReasoning:面向长尾自动驾驶的推理中心数据集与基准

Zhiyu Huang, Johnson Liu, Rui Song, Zewei Zhou, Ruining Yang, Yun Zhang, Tianhui Cai, Hanyin Zhang, Mingxuan Gao, Valeria Xu, Jiali Chen, Yishan Shen, Yiluan Guo, Tony, Qi, Jiaqi Ma

AI总结 提出nuReasoning数据集,包含2万段20秒长尾驾驶场景的推理标注,支持空间、决策和反事实推理评估,并证明推理监督可提升VLM和VLA的驾驶性能。

详情
AI中文摘要

推理对于自动驾驶在长尾场景中至关重要,车辆必须运用常识知识、理解空间关系、推断智能体交互并做出安全决策。然而,现有的自动驾驶数据集和基准主要针对感知、预测或规划,对现实长尾驾驶场景的推理监督有限。我们提出nuReasoning,一个面向推理中心自动驾驶的大规模真实世界数据集和基准。沿袭nuScenes和nuPlan的体系,nuReasoning将真实世界自动驾驶数据集和基准推向长尾驾驶场景中的推理。该数据集包含2万个片段,每个片段长20秒,采集自多个城市,具有同步的多摄像头图像、LiDAR数据、高清地图、物体标注以及人工验证的推理标注,涵盖空间推理、决策推理和反事实推理。与先前主要关注视觉问答的数据集不同,nuReasoning同时支持推理评估和规划评估,能够直接研究推理监督如何影响驾驶性能。实验表明,在nuReasoning上微调VLM可显著提升驾驶特定问答的性能,而将推理监督纳入VLA训练中,即使在推理时禁用文本推理输出,也能改善规划性能。这些结果确立了nuReasoning作为在现实长尾场景中评估和改进鲁棒、可解释、推理驱动的自动驾驶系统的基础。

英文摘要

Reasoning is essential for autonomous driving (AD) in long-tail scenarios, where vehicles must apply commonsense knowledge, understand spatial relations, infer agent interactions, and make safe decisions. However, existing AD datasets and benchmarks mainly target perception, prediction, or planning, and provide limited supervision for reasoning over realistic long-tail driving scenes. We introduce nuReasoning, a large-scale real-world dataset and benchmark for reasoning-centric AD. Following the lineage of nuScenes and nuPlan, nuReasoning advances real-world AD datasets and benchmarks toward reasoning in long-tail driving scenarios. The dataset contains 20,000 clips, each 20 seconds long, collected across multiple cities, with synchronized multi-camera images, LiDAR data, HD maps, object annotations, and human-verified reasoning annotations spanning Spatial Reasoning, Decision Reasoning, and Counterfactual Reasoning. Unlike prior datasets that focus primarily on visual question answering, nuReasoning supports both reasoning evaluation and planning evaluation, enabling a direct study of how reasoning supervision affects driving performance. Experiments show that fine-tuning VLMs on nuReasoning substantially improves driving-specific question answering, while incorporating reasoning supervision into VLA training improves planning performance even when textual reasoning outputs are disabled at inference time. These results establish nuReasoning as a foundation for evaluating and improving robust, interpretable, reasoning-driven AD systems in realistic long-tail settings.

2605.31564 2026-06-01 cs.CL cs.AI

What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

什么先被揭开?面向图到文本生成的扩散模型轨迹分析

Qing Wang, Jacob Devasier, Chengkai Li

AI总结 本文首次系统研究掩码扩散语言模型在图到文本生成中的解码轨迹,发现其优先生成实体,并针对监督微调导致的输出长度固定问题提出无训练推理时修改方法λ缩放结构解码,恢复+9.4 BLEU-4,同时引入Graph-LLaDA模型以显式融入关系图结构。

详情
AI中文摘要

我们首次系统研究了掩码扩散语言模型(MDLM)在图到文本生成中的应用。我们分析了MDLM的生成轨迹——即迭代解码过程中令牌被掩码的顺序——发现与自回归LLM线性生成文本不同,MDLM自然优先处理实体,然后是关系词和功能词,结构令牌最后解决。我们进一步发现了一个先前未记录的监督微调失败模式:SFT通过过早地将结构性的句子结束令牌锚定在解码轨迹早期,破坏了这一策略,从而有效固定了输出长度,这可能导致信息遗漏或幻觉。为了解决这个问题,我们提出了λ缩放结构解码,一种无训练的推理时修改方法,降低结构令牌的置信度,并恢复了+9.4 BLEU-4。最后,我们引入了Graph-LLaDA,它将图Transformer编码器集成到LLaDA的解码过程中,以显式融入关系图结构。在LAGRANGE上的跨数据集评估表明,先前的基线过拟合于特定数据集模式,而基于LLM和MDLM的方法泛化能力显著更好。

英文摘要

We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories -- the order in which tokens are unmasked during iterative decoding -- and find that, unlike autoregressive LLMs which generate text linearly, MDLMs naturally prioritize entities first, followed by relational and function words, with structural tokens resolved last. We further identify a previously undocumented failure mode of supervised fine-tuning: SFT disrupts this strategy by prematurely anchoring structural sentence-ending tokens early in the decoding trajectory, effectively fixing the output length which can lead to omitted or hallucinated information. To address this, we propose lambda-scaled structural decoding, a training-free inference-time modification that downweights structural token confidence and recovers +9.4 BLEU-4. Finally, we introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process to explicitly incorporate relational graph structure. Cross-dataset evaluation on LAGRANGE reveals that previous baselines overfit to dataset-specific patterns, while LLM- and MDLM-based approaches generalize significantly better.

2605.31563 2026-06-01 cs.CL

Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

分歧理由:重新思考仇恨言论检测中的分类与可解释性评估

Benedetta Muscato, Beiduo Chen, Gizem Gezici, Barbara Plank, Fosca Giannotti

AI总结 本研究通过统一框架重新实现多种模型、训练策略和评估指标,在标签和理由表示空间下分析分类与可解释性指标,发现软表示更能捕捉分歧,从而重新思考主观NLP任务的评估方法。

详情
Comments
16 pages
AI中文摘要

人类分歧在标注中普遍存在且众所周知。然而,通过词级人类理由捕捉的解释变异仍远未得到充分探索。同时,鉴于这种变异,如何最好地评估人类标签和理由——甚至如何超越多数投票聚合理由——尚不明确。然而,理由可能提供对人类推理丰富性的额外见解,这些推理在风格、价值观和解释上可能有所不同——尤其是在像仇恨言论检测这样的主观NLP任务中。在本工作中,我们通过在不同标签和理由表示空间下系统地重新实现它们,将多样化的模型、训练策略、损失函数和现有评估指标统一到一个协议下。分类指标围绕两个关键属性组织——预测性和分布性——而可解释性指标则通过三个互补维度:合理性、忠实性和复杂性。在这个统一的监督框架中,我们评估模型在分类和可解释性指标上的行为,以及指标对标签(硬和软)和理由表示空间(硬、中间和软)选择的敏感性。结果表明,硬指标和软指标都更倾向于软表示,突显了它们在捕捉变异方面的有效性,以及重新思考主观NLP中评估的必要性。

英文摘要

Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.

2605.31562 2026-06-01 cs.LG

Effective Biological Representation Learning by Masking Gene Expression

通过掩码基因表达实现有效的生物表示学习

Kian Kenyon-Dean, Alina Selega, Ihab Bendidi, Jordan M. Sorokin, Luca Bertinetto, David Errington, Hayley Donnella, Oren Kraus

AI总结 提出自监督模型TxFM,采用掩码自编码方法处理RNA-seq数据,通过消融研究确定关键架构,并在精心策划的DiverseRNA-1.4M数据集上训练,获得优于大规模基础模型的基因表示。

详情
Comments
31 pages, 11 figures. Preprint; presented at ICLR 2026 2nd Workshop on Foundation Models for Science: Real-World Impact and Science-First Design
AI中文摘要

RNA测序产生丰富多样的基因表达数据集,为细胞状态和功能提供了引人注目的见解,在药物发现中有许多应用。由于固有的技术噪声和实验批次效应,对此类数据进行建模具有挑战性,许多现有的转录组基础模型(FMs)表现不如线性基线。这些结果提出了一个问题:深度表示学习是否比直接使用原始转录计数具有明显优势。我们的工作通过开发一种新的自监督模型TxFM来探索这一点,重点关注归纳表示学习评估。TxFM采用了一种针对多样化RNA-seq计数数据定制的掩码自编码方法,我们的消融研究通过实验确定了强迁移性能所需的关键架构配置。此外,我们策划了一个公共训练语料库DiverseRNA-1.4M,并发现,在此策划数据集上训练的TxFM产生了高保真度的基因表示,其性能优于在规模大100倍以上的图谱级语料库上训练的FMs。总体而言,我们的结果表明,只要精心综合模型架构和训练数据策划,归纳自监督学习是转录组表示的一种可行建模方法。

英文摘要

RNA sequencing produces rich and diverse datasets of gene expression, offering compelling insights into cellular state and function that have many applications in drug discovery. Modeling such data is challenging due to inherent technical noise and experimental batch effects, as evidenced by many existing transcriptomic foundation models (FMs) underperforming relative to linear baselines. Such results raise the question of whether deep representation learning provides a distinct advantage over the direct use of raw transcript counts. Our work explores this by developing a new self-supervised model, TxFM, with a focus on inductive representation learning evaluations. TxFM employs a masked autoencoding approach tailored to diverse RNA-seq count data, and our ablation study empirically identifies crucial architecture configurations required for strong transfer performance. Additionally, we curate a public training corpus, DiverseRNA-1.4M, and find that TxFM trained on this curated dataset yields high-fidelity gene representations that outperform FMs trained on atlas-scale corpora over 100x larger. Overall, our results indicate that inductive self-supervised learning is a viable modeling approach for transcriptomics representation, provided a careful synthesis of model architecture and training data curation.

2605.31561 2026-06-01 cs.CL

What Am I Missing? Question-Answering as Hidden State Probing

我遗漏了什么?将问答作为隐藏状态探测

Chu Fei Luo, Samuel Dahan, Xiaodan Zhu

AI总结 提出将问答作为推理时干预手段,通过学生-教师框架探测隐藏状态,发现提问前的隐藏状态可预测最终正确性,并设计门控策略优化提问时机。

详情
AI中文摘要

自链式推理引入大型语言模型(LLMs)以来,测试时推理已成为一个重要的研究领域。然而,这一推理过程的机制仍未被充分探索——从相同的输入提示,甚至相同的部分解出发,LLMs在多次采样时可能产生不同的答案。我们提出利用提问作为推理时干预手段,以表达模型隐藏状态的信息。为此,我们提出了一个学生-教师设置,其中学生向教师提问。我们在学生提问前后训练一个探测其隐藏状态的探针,发现该探针能预测轨迹的最终正确性,甚至在生成教师答案之前。这表明,在问题生成过程中存在有意义的自我诊断信号,而非来自教师的信息传递。然后,我们将提问建模为序列决策问题,使用该探针作为质量分数,并定义一个门控策略来提问以最大化正确可能性。我们发现,提问作为干预的成功在很大程度上取决于模型的自我一致性。我们的实证结果显示检测与恢复之间存在差距;虽然我们的门控策略捕捉了模型的正确性和不确定性,但干预在恢复错误轨迹的同时,同样可能损害正确轨迹。这种诊断与纠正之间的差距对语言模型在不确定性下进行自我精炼的能力具有更广泛的影响。

英文摘要

Test-time reasoning has become a significant field of study since the introduction of chain-of-thought reasoning in large language models (LLMs). However, the mechanisms of this reasoning process are still under-explored -- from the same input prompt, and even the same partial solution, LLMs can produce varied answers if sampled multiple times. We propose to leverage question-asking as an inference-time intervention that articulates information about the model's hidden state. To achieve that, we present a student-teacher setting where a student asks questions to a teacher. We train a probe on the student's hidden state before and after asking a question and find it is predictive of the trajectory's final correctness, even before generating the teacher's answer. This suggests there is a meaningful signal from the self-diagnosis that occurs during question generation rather than information transfer from the teacher. We then frame question-asking as a sequential decision problem, using this probe as a quality score, and define a gating policy to ask questions that maximize likelihood of correctness. We find that the success of question-asking as an intervention is largely dependent on the model's self-consistency. Our empirical results show a gap between detection and recovery; while our gating policy captures model correctness and uncertainty, interventions are equally likely to harm correct trajectories as they are to recover incorrect ones. This gap between diagnosis and correction has broader implications on language models' capacity for self-refinement under uncertainty.

2605.31560 2026-06-01 cs.CE cond-mat.mtrl-sci physics.app-ph physics.chem-ph

Can dents and gouges compromise the structural integrity of hydrogen transport pipelines?

凹痕和沟槽会危及氢气输送管道的结构完整性吗?

R. Das, B. Bezensek, E. Martínez-Pañeda

AI总结 通过实验和氢脆模型研究,发现氢气不会显著增加凹痕和沟槽缺陷的损伤严重性,除非在特定条件下氢气逸出被完全阻止。

详情
AI中文摘要

将天然气管道改造用于氢气输送需要了解外部缺陷(如凹痕和沟槽)在氢气暴露下如何影响结构完整性。为此,我们将实验与一种针对大塑性应变场景的新型氢脆模型相结合,该模型整合了:(i) 多陷阱氢传输,(ii) 有限应变塑性,以及 (iii) 依赖于氢和三轴度的损伤定律。模型的每个组成部分均通过X65管道钢的实验验证:(i) 氢渗透,(ii) 全尺寸管道压痕,以及 (iii) 在不同氢和三轴度水平下的力学测试。验证后的模型用于研究被动(在氢气暴露前压痕)和主动(在氢气存在下压痕)凹痕和沟槽。结果表明,氢气不会显著增加这些缺陷的损伤严重性,除非在内部加压且存在预先存在的被动凹痕和沟槽的管道外表面,氢的逸出被完全阻止。

英文摘要

Repurposing natural gas pipelines for hydrogen transport requires understanding how external defects, like dents and gouges, affect structural integrity under H$_2$ exposure. To address this, we combine experiments with a new hydrogen embrittlement model aimed at large plastic straining scenarios, which integrates: (i) multi-trap hydrogen transport, (ii) finite-strain plasticity, and (iii) a hydrogen- and triaxiality-dependent damage law. Each constituent of the model is validated with experiments on X65 pipeline steel: (i) hydrogen permeation, (ii) full-scale pipe-indentation, and (iii) mechanical testing at different hydrogen and triaxiality levels. The validated model is used to study \textit{passive} (indent before H$_2$ exposure) and \textit{active} (indent with H$_2$) dents and gouges. Results reveal that hydrogen does not significantly increase the damage severity of those defects, unless hydrogen egress is completely precluded at the outer surface of a pipeline that is being pressurised internally and contains a pre-existing \textit{passive} dent with a gouge.

2605.31559 2026-06-01 cs.LG

Functional Attention: From Pairwise Affinities to Functional Correspondences

函数注意力:从成对亲和性到函数对应

Jiefang Xiao, Maolin Gao, Simon Weber, Guandao Yang, Daniel Cremers

AI总结 提出函数注意力机制,将注意力重新解释为自适应基之间的函数对应,通过结构化线性算子替代softmax亲和性,实现紧凑、可泛化、分辨率不变的全局依赖表示,在PDE求解、3D分割和回归等算子学习任务中达到最先进性能。

详情
Comments
26 pages, 12 figures. Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

学习无限维函数空间之间的映射,即算子学习,对于许多机器学习应用至关重要。尽管基于Transformer的算子很流行,但它们通常依赖于token-wise注意力。这些方法将连续场视为离散token,通常忽略全局函数结构。我们引入了\emph{函数注意力},它将注意力重新解释为自适应基之间的函数对应。受几何函数映射的启发,我们的方法用结构化线性算子替换softmax亲和性。这产生了一个紧凑、可泛化、分辨率不变的表示,显式捕获全局依赖关系。实验表明,\emph{函数注意力}可以在许多算子学习任务中达到最先进的性能,包括求解PDE、3D分割和回归,同时保持对不同离散化的鲁棒性。项目页面可在https://github.com/xjffff/FUNCATTN获取。

英文摘要

Learning mappings between infinite-dimensional function spaces, or operator learning, is essential for many machine learning applications. Although transformer-based operators are popular, they often rely on token-wise attention. These methods treat continuous fields as discrete tokens and usually ignore the global functional structure. We introduce \emph{Functional Attention}, which reinterprets attention as a functional correspondence between adaptive bases. Inspired by geometric functional maps, our method replaces softmax affinities with structured linear operators. This yields a compact, generalizable, resolution-invariant representation that explicitly captures global dependencies. Experiments demonstrate that \emph{Functional Attention} can match state-of-the-art performance in many operator learning tasks, including solving PDEs, 3D segmentation, and regression, while remaining robust to varying discretizations. Project page is available at https://github.com/xjffff/FUNCATTN.

2605.31558 2026-06-01 cs.LG cs.AI

Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

位置注意力头与符号注意力头:学习动态、RoPE几何和长度泛化

Felipe Urrutia, Juan José Alegría, Cinthia Sanchez Macias, Jorge Salas, Cristian B. Calderon, Cristobal Rojas

AI总结 通过控制实验研究Transformer注意力头在位置推理和符号推理任务中的学习动态,发现位置和符号注意力头的不同机制及其对长度泛化的影响。

详情
AI中文摘要

基于Transformer的语言模型在当今社会广泛应用。因此,理解它们解决结构化任务的机制以及预测它们在新型场景中的行为对于安全部署至关重要。我们通过在两个结构等价的多跳推理任务上训练仅解码器Transformer(GPT-J)来研究注意力头的学习动态:一个需要位置推理的数字任务和一个需要符号推理的字母任务。利用最近引入的度量标准,该标准将注意力头的行为分类为给定提示下的位置性或符号性,我们表明成功学习与纯头(即表达为位置性或符号性的头)的出现相关。尽管任务结构等价,但它们施加了不同的机制需求:数字任务需要位置头和符号头,而字母任务仅需要符号头。然后,我们识别这些头的计算角色,描述它们实现的基本功能,并给出理论构造,展示单层基于RoPE的注意力如何通过几何可解释的查询、键和值操作实现这些功能。该分析通过一种新的差异概念形式化,在位置和符号机制对更长序列的鲁棒性上产生了定量分离。我们在受控模型和真实世界模型中经验验证了由此产生的预测,表明符号机制更可靠地外推到更长序列,而位置机制面临更严格的限制。

英文摘要

Transformer-based language models are widespread in today's society. As such, understanding the mechanisms by which they solve structured tasks and predicting how they may behave in novel scenarios is of great importance for safe deployment. We study the learning dynamics of attention heads in a controlled setting by training a decoder-only Transformer (GPT-J) on two structurally equivalent multi-hop reasoning tasks: a number task requiring positional reasoning and a letter task requiring symbolic reasoning. Using a recently introduced metric that classifies attention-head behavior as positional or symbolic for a given prompt, we show that successful learning is associated with the emergence of pure heads, i.e., heads that express themselves as either positional or symbolic. Despite the tasks' structural equivalence, they impose different mechanistic demands: the number task requires both positional and symbolic heads, whereas the letter task requires only symbolic heads. We then identify the computational roles of these heads, characterize the basic functions they implement, and give theoretical constructions showing how single-layer RoPE-based attention can realize these functions through geometrically interpretable query, key, and value operations. This analysis yields a quantitative separation between positional and symbolic mechanisms in their robustness to longer sequences, formalized through a novel notion of discrepancy. We empirically validate the resulting predictions in both controlled and real-world models, showing that symbolic mechanisms extrapolate more reliably to longer sequences while positional mechanisms face sharper limitations.

2605.31556 2026-06-01 cs.CV cs.AI cs.CL cs.CY cs.HC

Vision-Language Models Suppress Female Representations Under Ambiguous Input

视觉-语言模型在模糊输入下抑制女性表征

Arnau Marin-Llobet, Simon Henniger, Mahzarin R. Banaji

AI总结 本研究通过引入零样本度量LALS,发现视觉-语言模型在模糊输入下内部编码与输出存在系统性解耦,女性信号在生成前被抑制,揭示了模型对性别偏见的内部处理机制。

详情
Comments
16 pages, 12 figures, 1 table
AI中文摘要

对齐训练使视觉-语言模型(VLM)避免表达人口统计偏见,当性别清晰可见时,它们基本成功。但对于模糊输入(如全副武装的工人、从背后看到的人物)——实践中常见但很少研究的情况——我们发现,在模糊输入图像时,最小的提示压力就会暴露职业-性别默认值,模型甚至对强烈女性刻板印象的职业也倾向于男性。但这些输出是否反映了模型实际内部编码的内容?我们引入LALS(潜在关联倾向分数),一种零样本度量,将视觉标记激活投影到模型的文本嵌入空间中,以测量每个标记和层的概念关联。在15个职业、超过800张性别模糊图像和四个VLM上,内部表征和输出系统性地解耦:模型通常内部编码女性关联但输出男性。逐层分析揭示了一个不对称滤波器——男性信号端到端放大,而女性信号在中间网络达到峰值并在生成前被抑制——颜色消融实验表明,文化负载的视觉线索(如服装颜色)进一步调节这些内部关联。

英文摘要

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.

2605.31555 2026-06-01 cs.DL cs.IR cs.SI

Effects of Vertex Merging & Splitting on Large Coauthorship Networks: A Counterfactual Analysis

顶点合并与分裂对大型合著网络的影响:一个反事实分析

Jinseok Kim

AI总结 本研究通过反事实分析,探讨了由作者姓名歧义导致的顶点合并与分裂错误如何影响合著网络度量,发现基于首字母的消歧方法会低估网络规模并高估连接紧密性。

详情
Journal ref
ComplextNetworks 2025 (pp. 64-75)
Comments
12 pages, 3 figures, 2 tables, ComplexNetworks2025
AI中文摘要

研究人员分析合著网络,但网络数据中的作者姓名歧义仍然是一个重大挑战,因为它可能改变顶点数量,扭曲网络属性。尽管许多学者使用简单的启发式方法(如作者名字首字母)进行作者姓名消歧,但这些技术可能通过合并或分裂顶点而扭曲我们对网络属性的理解,引发对这些方法可靠性和有效性的担忧。本研究利用三个具有高精度算法作者姓名消歧的大型合著网络,调查了由姓名歧义引起的不同程度顶点合并和分裂错误如何影响网络度量。作为反事实场景,将合著网络研究中广泛使用的两种基于首字母的消歧方法应用于这些数据集。在随机改变合并或分裂顶点数量的同时,计算了九个合著网络度量。结果表明,基于首字母的消歧生成的合著网络低估了特定网络属性,导致发现的合著网络比实际更小且连接更紧密。相反,其他网络度量值增加,使得作者看起来比实际更具合作性,并嵌入在更少碎片化的研究社区中。该研究强调了在分析合著网络时仔细消歧顶点名称对于获得严谨有效结果的重要性。

英文摘要

Researchers analyze coauthorship networks, but author name ambiguity in their network data remains a significant challenge as it can change the number of vertices, distorting network properties. Although many scholars use straightforward heuristics for author name disambiguation using author's forename initials, these techniques can skew our understanding of network properties by merging or splitting vertices, raising concerns about the reliability and validity of these methods. This study investigates how different levels of vertex merging and splitting errors that are induced by name ambiguity impact network measures, using three large coauthorship networks with highly accurate algorithmic author name disambiguation. As a counterfactual scenario, two initial-based disambiguation methods widely used in coauthorship network research were applied to these datasets. Nine coauthorship network metrics were computed while varying randomly the numbers of merged or split vertices. Results show that initial-based disambiguation generates coauthorship networks with specific network properties underestimated, leading to the discovery of coauthorship networks that are smaller and more closely connected than they genuinely are. In contrast, other network metric values increase, making authors appear more collaborative and embedded within less fragmented research communities than they are. The study emphasizes the importance of careful disambiguation of vertex names in analyzing coauthorship networks for rigorous and valid findings.

2605.31552 2026-06-01 math.NA cs.NA

Spectral coarse spaces based on indefinite operators: the $H_k$-GenEO method

基于不定算子的谱粗空间:$H_k$-GenEO方法

Théophile Chaumont-Frelet, Victorita Dolean, Mark Fry, Ivan G. Graham, Matthias Langer

AI总结 针对高度不定全局PDE问题,提出基于局部全局问题副本特征值问题的谱粗空间构建方法$H_k$-GenEO,相比基于半正定局部特征值问题的方法,在参数$k$增大时更鲁棒。

详情
AI中文摘要

GenEO('重叠区域上的广义特征值问题')是一种用于离散PDE迭代求解器预处理的粗空间构造方法。该方法结合局部PDE特征值问题的少量模态来获得全局粗空间。然后将粗求解与全局PDE的局部求解相结合以获得预处理器。对于局部特征值问题为半正定的情况,已经发展了大量的GenEO理论。这主要应用于正定全局PDE,但最近也扩展到对流-扩散-反应问题,这些问题可能既非自伴也非正定。然而,当全局问题高度不定时,基于半正定局部特征值问题的粗空间在实践中缺乏鲁棒性。本文考虑高度不定的全局PDE问题,其特点是大参数$k$(允许高度可变系数),并基于求解基于 extit{全局问题的局部副本}的特征值问题,开发了一种新的谱粗空间。我们对局部区域的直径没有约束,因此允许局部特征值问题是不定的。新方法(称为$H_k$-GenEO)随着$k$的增加,比基于半正定特征值问题的方法更加鲁棒。我们提供了预处理GMRES迭代方法鲁棒性的充分条件,这些条件涉及局部特征值问题的容差和局部PDE求解的子域大小。在实践中,观察到该方法在更弱的局部特征值问题容差条件下对$k$具有鲁棒性。实验还表明该方法能够抵抗PDE系数的高度变化。

英文摘要

GenEO (`Generalised Eigenvalue problems on the Overlap') is a method for constructing coarse spaces used in the preconditioning of iterative solvers for discrete PDEs. This method combines a (small) number of modes of local PDE eigenproblems to obtain a global coarse space. A coarse solve is then combined with local solves of the global PDE to obtain the preconditioner. A substantial theory for GenEO has been developed for the case when the local elgenproblems are positive semi-definite. This has been applied mostly to positive definite global PDEs, but also recently extended to the case of convection--diffusion--reaction problems, which may be neither self-adjoint, nor positive definite. However, when the global problem is highly indefinite, coarse spaces built from positive semi-definite local eigenproblems fail to be robust in practice. In this paper we consider highly indefinite global PDE problems, characterised by a large parameter $k$ (allowing also highly variable coefficients), and we develop a new spectral coarse space built from solving eigenvalue problems based on \textit{local copies of the global problem}. We put no constraint on the diameters of the local domains, thus allowing the local eigenvalue problems to be indefinite. The new method (which we call $H_k$-GenEO) is seen to be much more robust as $k$ increases than methods based on positive semi-definite eigenproblems. We provide sufficient conditions for robustness of the preconditioned GMRES iterative method, in terms of the tolerance of the local eigenproblems and the size of the subdomains for the local PDE solves. In practice the method is observed to be robust with respect to $k$ under even weaker conditions on the local eigenproblem tolerance. The experiments also suggest the method can be resilient to high variation in PDE coefficients.

2605.31551 2026-06-01 cs.CV

SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation

SMART: SMPLest-X 网格自适应与 RAFT 跟踪用于足球姿态估计

Parthsarthi Rawat

AI总结 提出 SMART 方法,通过微调 SMPLest-X 模型、结合 RAFT 光流相机跟踪和足部平面锚定等策略,在 FIFA 骨骼跟踪挑战中显著降低 3D 姿态估计误差。

详情
Comments
CVPR 2026 SoccerNet FIFA Skeleton Tracking Light Challenge, Rank 6
AI中文摘要

我们介绍了参加 2026 年 FIFA 骨骼跟踪挑战赛的方法,该挑战要求从广播视频中估计足球运动员的 3D 世界空间姿态。我们的方法通过分层片段分割、多任务深度监督和广播增强对 SMPLest-X(ViT-H,687 M 参数)进行微调,并结合 RAFT 密集光流相机跟踪器、足部平面锚定和两遍时间平滑。在验证集上,SMART 相对于 FIFA 基线得分 1.053 取得了 0.647 的成绩,提升了 38.6%;在保留的测试集上,SMART 得分为 0.593(全局 MPJPE:0.324 m,局部 MPJPE:0.054 m)。

英文摘要

We present our approach to the FIFA Skeletal Tracking Challenge 2026, which requires estimating 3D world-space poses of soccer players from broadcast video. Our method finetunes SMPLest-X (ViT-H, 687 M parameters) via a stratified clip split, multi-task depth supervision, and broadcast augmentation, paired with a RAFT dense optical flow camera tracker, foot-plane anchoring, and two-pass temporal smoothing. Against the FIFA baseline score of 1.053 on the validation set, SMART achieves 0.647, a 38.6% improvement; on the held-out test set, SMART scores 0.593 (Global MPJPE: 0.324 m, Local MPJPE: 0.054 m).