arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31603 2026-06-01 cs.CV cs.AI

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Lumos-Nexus: 面向视频统一模型的高效频率桥接与同质潜在空间

Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu, Yujie Wei, Fei Du, Hai Ci, Tao Feng, Jiasheng Tang, Weihua Chen, Fan Wang, Yong Liu

AI总结 提出Lumos-Nexus框架,通过两阶段训练和渐进频率桥接,在保持推理能力的同时显著提升视频生成保真度。

详情
Comments
Project page (https://jiazheng-xing.github.io/nexus-lumos-home/) and Code (https://github.com/alibaba-damo-academy/Lumos-Custom/) are available
AI中文摘要

基于连接器的视频统一模型在指令引导的视频合成中展现出强大能力,但将大型高保真生成器集成到统一训练循环中计算成本过高,限制了可实现的视觉质量。因此,我们提出Lumos-Nexus,一个训练高效的统一视频生成框架,促进强推理驱动生成能力的发展,同时显著提升视觉保真度。Lumos-Nexus采用两阶段设计:1)训练时,仅将轻量级生成器与理解模块对齐,以学习接收推理驱动的语义控制。2)推理时,我们引入统一渐进频率桥接(UPFB),在共享潜在空间中逐步将生成任务移交给高容量预训练生成器,实现从粗到细的细化,在不牺牲推理质量的情况下生成高保真视频。为填补推理驱动视频生成基准的空白,我们引入VR-Bench,评估模型将推断意图转化为连贯且语义对齐的视频内容的能力。大量实验表明,Lumos-Nexus在VBench上实现了视觉真实感和时间连贯性的显著提升,同时在VR-Bench上展现出强大的基于推理的生成性能。代码和模型可在https://jiazheng-xing.github.io/nexus-lumos-home/获取。

英文摘要

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

2605.31598 2026-06-01 cs.CV

Linear Scaling Video VLMs for Long Video Understanding

面向长视频理解的线性缩放视频视觉语言模型

Cristobal Eyzaguirre, Jiajun Wu, Juan Carlos Niebles

AI总结 提出StateKV方法,通过固定容量的重要性驱动循环状态实现线性时间视频预填充,在保持接近全自注意力性能的同时显著降低计算成本。

详情
AI中文摘要

视频视觉语言模型(VLM)越来越多地用于长时和流式场景,但大多数视频编码器仍依赖时空自注意力,导致计算和延迟随帧数二次增长。现有的效率方法提高了可扩展性,但相对于全自注意力往往损失准确性,例如通过激进的帧/令牌丢弃或粗略的注意力近似。我们引入了StateKV,一种推理时方法,通过将跨帧上下文携带在固定容量、基于重要性的循环状态中,并配以用于解码的第二个完整每帧缓存,使预训练的长视频VLM适应线性时间视频预填充。在三个长视频基准测试和跨越三个家族、多个尺度的七个模型上,StateKV保持接近全自注意力的性能,并持续优于主流的滑动窗口/基于最近性的流式近似,无需微调或架构更改。StateKV还降低了以FLOPs衡量的视频预填充成本,通过运行更大的模型在固定计算预算下实现更强的准确性。这些结果表明了向可扩展长视频理解迈出的实际一步。

英文摘要

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.

2605.31596 2026-06-01 cs.CV cs.LG

KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

KLIP:通过逆问题中扩散先验的KL散度进行局部分布偏移检测

Alireza Kheirandish, Jihoon Hong, Sara Fridovich-Keil

AI总结 提出基于KL散度的OOD检测指标,无需校准数据或偏移分布知识,可检测并定位图像中的局部分布偏移。

详情
Comments
CVPR 2026
AI中文摘要

扩散模型作为计算成像的数据驱动先验以及检测分布外(OOD)图像方面已展现出有前景的性能。然而,现有的OOD检测方法通常需要一些关于偏移分布的知识,无法检测细微或局部的分布偏移,并且作用于完整图像而非逆问题中可用的间接测量。我们提出了一种基于扩散先验与后验分布之间的Kullback-Leibler散度的OOD检测指标,该指标(i)不需要任何校准数据或关于偏移分布的知识,并且(ii)可以检测整张图像是否为OOD,以及定位图像内的OOD块。实验上,我们表明该指标可以检测细微但语义上有意义的分布偏移,例如从健康肝脏CT扫描到有肿瘤的CT扫描的偏移,并且能够泛化到不同类型的扩散模型、数据集和逆问题。我们的代码可在https://github.com/voilalab/KLIP找到。

英文摘要

Diffusion models have shown promising performance as data-driven priors for computational imaging, as well as some capacity to detect out-of-distribution (OOD) images. However, existing approaches to OOD detection often require some knowledge of the shifted distribution, fail to detect subtle or localized distribution shifts, and operate on full images, rather than the indirect measurements available in inverse problems. We propose an OOD detection metric based on the Kullback-Leibler divergence between the diffusion prior and the posterior distribution, that (i) does not require any calibration data or knowledge of the shifted distribution, and (ii) can detect whole images as OOD as well as localize OOD patches within an image. Experimentally, we show that this metric can detect subtle yet semantically meaningful distribution shifts, such as the shift from healthy liver CT scans to those with tumors, and generalizes across different types of diffusion models, datasets, and inverse problems. Our code can be found at https://github.com/voilalab/KLIP.

2605.31595 2026-06-01 cs.CV

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

利用紧凑高斯体学习全局运动的前馈式4D重建

Mungyeom Kim, Minkyeong Jeon, Honggyu An, Jaewoo Jung, Hyuna Ko, Jisang Han, Hyeonseo Yu, Donghwan Shin, Sunghwan Hong, Takuya Narihira, Kazumi Fukuda, Yuki Mitsufuji, Seungryong Kim

AI总结 提出C4G框架,通过紧凑的可学习高斯查询令牌和视频扩散模型增强渲染,实现无需相机位姿的前馈式4D动态场景重建,显著减少高斯数量并提升运动建模鲁棒性。

详情
Comments
Project Page: see https://cvlab-kaist.github.io/C4G
AI中文摘要

从单目视频进行动态场景重建仍然是计算机视觉中的一个基本挑战。现有的前馈方法逐帧预测像素级3D高斯体,存在重复高斯体和视角依赖偏差,阻碍了场景运动的有效学习。我们提出C4G,一个前馈式4D重建框架,基于一组紧凑的时间戳条件可学习高斯查询令牌。每个令牌在整个时间上下文中聚合对应特征,并解码出一个3D高斯体,其位置由目标时间戳调制,无需逐场景优化即可实现全局一致的运动建模。为了捕捉细粒度细节,我们进一步引入基于视频扩散模型的渲染增强模块。由于我们的框架有效地将特征聚合到高斯体中,我们将此能力扩展到特征提升,生成一个支持点跟踪和动态场景理解的4D特征场。C4G在显著减少高斯体数量且无需相机位姿的情况下,实现了强的新视角合成性能,同时展现出更强的运动建模能力和对大时间间隔的鲁棒性。

英文摘要

Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.

2605.31594 2026-06-01 cs.LG math.OC

A Tight Theory of Error Feedback Algorithms in Distributed Optimization

分布式优化中误差反馈算法的紧致理论

Daniel Berg Thomsen, Adrien Taylor, Aymeric Dieuleveut

AI总结 本文针对分布式优化中的两种主流误差反馈算法(EF和EF21),通过确定最优步长和构造最优Lyapunov函数,给出了紧致的收敛性分析,结果与智能体数量无关且恢复单智能体情形下的已知最优保证。

详情
AI中文摘要

通信成本是分布式学习和一阶优化的主要瓶颈。缓解此问题的常见方法是压缩智能体之间交换的梯度信息。然而,这种压缩通常会降低基于梯度方法的收敛保证。误差反馈机制为此问题提供了一种简单且计算成本低的补救措施,但已提出众多变体,且它们的相对性能仍知之甚少。本文通过确定最优步长选择并为每种方法构造最优Lyapunov函数,为文献中的两种主要误差反馈算法——经典误差反馈方法(EF)和误差反馈21(EF21)——提供了紧致的收敛性分析。结果与智能体数量无关,并恢复了单智能体情形下已知的最佳保证。

英文摘要

Communication costs are a major bottleneck in distributed learning and first-order optimization. A common approach to alleviate this issue is to compress the gradient information exchanged between agents. However, such compression typically degrades the convergence guarantees of gradient-based methods. Error feedback mechanisms provide a simple and computationally cheap remedy for this issue, but numerous variants have been proposed, and their relative performance remains poorly understood. This paper provides tight convergence analyses for two of the main error-feedback algorithms from the literature, the classic Error Feedback method (EF) and Error Feedback 21 (EF21), by identifying optimal step-size choices and constructing optimal Lyapunov functions tailored to each method. The results hold independently of the number of agents and recover the known best guarantees possible in the single-agent regime.

2605.31593 2026-06-01 cs.CR cs.AI

Stateful Online Monitoring Catches Distributed Agent Attacks

有状态在线监控捕获分布式智能体攻击

Davis Brown, Samarth Bhargav, Arav Santhanam, Kasper Hong, Ivan Zhang, Matan Shtepel, Steffi Chern, Alexander Robey, Eric Wong, Hamed Hassani

AI总结 针对分布式智能体攻击中跨账户聚合的恶意行为难以被单上下文监控检测的问题,提出一种基于实时聚类的有状态在线监控方法,能够更早、更有效地捕获分布式攻击,同时保持低延迟。

详情
AI中文摘要

语言模型可以发现数千个严重的软件漏洞,并且智能体越来越多地被滥用于网络攻击。为了避免检测,攻击者经常分布他们的滥用行为,将有害任务分割到多个用户账户中,使得每个单独的记录看起来无害。由于安全监控器一次只评估一个智能体上下文,它们在结构上无法检测到仅在跨多个账户的聚合中才可见的滥用行为。我们通过构建据我们所知第一个分布式智能体攻击来证明这一漏洞是真实存在的,该攻击是一个多智能体框架,能够在完成困难的网络安全任务的同时,将有害目标隐藏在具有有限上下文的子智能体中,从而规避标准监控器,后者捕获它的频率仅为先前智能体攻击的五分之一。为了防御,我们开发了一种在线有状态监控器,它使用实时聚类来收集跨多个智能体记录的微弱可疑信号,并且仅在极少数情况下升级到语言模型以标记跨用户账户的滥用行为。在模拟数据中心流量的大规模评估中,我们的监控器帕累托优于标准监控器,提前30%捕获分布式攻击,并在网络滥用达到最有害阶段之前标记出来。至关重要的是,这对于约99%的用户流量带来的额外延迟可以忽略不计。这种检测优势在良性背景流量非常大时仍然存在但会缩小。经过广泛的红队演练,我们改进了防御,并且令人惊讶地发现它也能捕获标准越狱,因为自适应攻击者会跨账户重复使用攻击变体。我们的结果指向了一类新的安全监控器,它们对用户群体而非孤立记录进行推理。

英文摘要

Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi-agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real-time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large-scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red-teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.

2605.31591 2026-06-01 cs.CV

CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference

CoFiDA-M: 面向仅图像推理的跨域自适应的概念感知特征调制

Nurjahan Sultana, Moi Hoon Yap, Xinqi Fan, Wenqi Lu

AI总结 提出CoFiDA-M框架,通过特权信息学习利用临床概念(如MONET概率)指导特征调制,训练仅图像的学生模型,在跨域皮肤癌筛查中显著提升黑色素瘤召回率。

详情
Comments
'Accepted by CVPR 2026'
AI中文摘要

基于AI的皮肤癌筛查模型在从专家皮肤镜(源)图像转向消费级临床(目标)图像时性能严重下降,阻碍了实际部署。现有的域自适应方法常常忽略关键的语义不变性,如临床概念。虽然像MONET这样的新基础模型可以提供这种语义信息作为密集的概率分数,但该元数据在测试时不可用,为实用的仅图像筛查工具造成了部署悖论。我们通过提出CoFiDA-M来解决这一差距,这是一个特权信息框架,在训练时从概念中学习,但部署为仅图像模型。我们的方法训练一个教师网络,该网络使用MONET概念概率来指导FiLM调制器,将视觉特征转换为语义“编辑”的特征空间。然后训练一个轻量级的、仅图像的学生模型来重现这种编辑后的表示,而不仅仅是教师的最终预测。这种蒸馏将临床推理“烘焙”到学生模型的权重中。在一个具有挑战性的多数据集基准上,我们的仅图像学生模型显著优于最先进的方法,特别是在黑色素瘤召回率方面。我们的工作提供了一个实用且可泛化的框架,用于利用噪声概率元数据作为特权信息,展示了强大的跨数据集鲁棒性和在皮肤科之外实际部署的潜力。实现代码可在以下网址获取:https://github.com/mmu-dermatology-research/CoFiDA.git

英文摘要

Models for AI-based skin cancer screening suffer a severe performance drop when shifting from expert dermoscopic (source) images to consumer-grade clinical (target) images, hindering real-world deployment. Existing domain adaptation methods often ignore crucial semantic invariants, such as clinical concepts. While new foundation models like MONET can provide this semantic information as dense, probabilistic scores, this metadata is unavailable at test time, creating a deployment paradox for practical image-only screening tools. We address this gap by proposing CoFiDA-M, a privileged information framework that learns from concepts at training time but deploys as an image-only model. Our method trains a teacher network that uses MONET concept probabilities to guide a FiLM modulator, transforming visual features into a semantically ``edited" feature space. A lightweight, image-only student is then trained to reproduce this edited representation, not just the teacher's final predictions. This distillation ``bakes" the clinical reasoning into the student's weights. On a challenging multi-dataset benchmark, our image-only student significantly outperforms state-of-the-art approaches, especially in melanoma recall. Our work provides a practical and generalizable framework for leveraging noisy, probabilistic metadata as privileged information, demonstrating strong cross-dataset robustness and potential for real-world deployment beyond dermatology. Implementation code is available at: https://github.com/mmu-dermatology-research/CoFiDA.git

2605.31590 2026-06-01 cs.CV cs.AI

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

TunerDiT: 无需训练的多事件视频生成扩散变压器渐进式引导

Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai, Lei Zhang, Xun Xiao, Thomas Seidl, Daniel Cremers, Volker Tresp

AI总结 针对长视频多事件生成难题,提出无需额外训练的TunerDiT方法,通过事件分区掩码和跨事件提示融合实现渐进式引导,在8项指标上达到最优。

详情
Comments
17 pages, 13 figures
AI中文摘要

文本到视频(T2V)生成在生成长时间跨度包含多个事件的视频时面临挑战性问题。受扩散过程内在特性的启发,我们探测了视频扩散变压器(DiTs),并发现了DiT去噪轨迹中的内在转折点,其中条件文本从全局布局到细粒度细节影响生成。基于这一发现,我们提出了TunerDiT,一种简单而有效的渐进式引导方法,无需额外训练即可实现多事件生成。TunerDiT包含两个引导手柄:(1)事件分区掩码,强制事件边界同时允许跨事件过渡带;(2)跨事件提示融合,注入相邻事件语义用于后期细化。我们贡献了一个自策提示套件用于多事件生成基准测试,即Meve。与其他无训练方法相比,TunerDiT在8项指标上达到了最先进性能,并在视频一致性和事件分离之间提供了可调权衡。文本对齐的提升随事件数量增加而增强,表明随着事件数量增加存在扩展可能性。

英文摘要

Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.

2605.31589 2026-06-01 cs.CV

Recognizing Co-Speech Gestures in-the-Wild

识别野外伴随语音的手势

Sindhu B Hegde, K R Prajwal, Andrew Zisserman

AI总结 针对当前多模态模型难以捕捉语义性伴随手势的问题,构建了首个大规模基准数据集GRW,用于训练视频模型进行手势语义分类、对应词汇识别和时序定位。

详情
AI中文摘要

尽管人类在说话时自然地进行手势,但这些动作中只有稀疏的子集在视觉上具有描绘性,并与特定的口语词汇语义相关。当前的多模态模型难以捕捉这些语义性的伴随手势,主要受限于缺乏精确标注的训练数据。为解决这一问题,我们引入了野外手势识别(GRW)数据集,这是第一个大规模基准,旨在将无约束的人类手势与特定词汇以帧精确的时间边界进行映射。GRW包含156,688个手动标注的视频片段,涵盖了一个高度多样化的150词分类体系,包括物理动作、空间描述符和抽象概念。我们利用GRW训练视频模型以(a)将手势分类为语义性或非语义性,(b)识别伴随手势对应的词汇,以及(c)在时间上定位手势。我们还使用GRW为这三项任务建立基准。

英文摘要

While humans naturally gesture during speech, only a sparse subset of these movements are visually depictive and semantically linked to specific spoken words. Current multimodal models struggle to capture these semantic co-speech gestures, heavily bottlenecked by a lack of precisely annotated training data. To address this, we introduce the Gesture Recognition in the Wild (GRW) dataset, the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. Comprising 156,688 manually annotated video clips, GRW spans a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts. We leverage GRW to train video models to (a) classify gestures as semantic or not, (b) recognize the word corresponding to a co-speech gesture, and (c) temporally localize the gesture. We also use GRW to establish benchmarks for these three tasks.

2605.31586 2026-06-01 cs.CL cs.AI

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

语言模型学习构式语义,更不用说句法:探究LM对配对焦点构式的理解

Wesley Scivetti, Ethan Wilcox, Nathan Schneider, Kanishka Misra, Leonie Weissweiler

AI总结 通过构建新数据集,研究不同规模开源语言模型对英语中稀有配对焦点构式(如“let alone”)的语义理解,发现中等规模模型能掌握其形式和意义,且语义学习晚于句法知识,并与世界知识相关。

详情
Comments
Conference on Natural Language Learning (CoNLL) 2026
AI中文摘要

理解稀有构式(形式-意义配对)的语义已被证明是一个具有挑战性的问题,目前只有最大的LLM才能解决。开源模型是否具有稳健的构式理解,以及如果具备,这种知识习得背后的学习动态是什么,仍然是一个开放问题。聚焦于英语中一组稀有的配对焦点构式(例如“let alone”、“much less”),我们构建了一个新颖的数据集,利用标量形容词语义和一般世界知识来测试它们的意义。通过测试一系列在参数数量、架构和预训练数据集大小上不同的模型,我们发现几个中等规模的模型对配对焦点构式的形式和意义都敏感,尽管在人类规模数据上训练的模型在所有意义评估中均失败。转向一组开放检查点模型的训练动态,我们发现配对焦点理解在训练后期出现,晚于配对焦点句法知识,并且配对焦点语义的学习与世界知识某些领域的提升相关。总体而言,我们的实证结果支持中等规模开源模型能够掌握稀有配对焦点构式的结论,并展示了配对焦点构式知识与其他意义领域之间的联系。

英文摘要

Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. "let alone", "much less"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.

2605.31584 2026-06-01 cs.CL cs.AI cs.LG

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

LongTraceRL: 基于评分奖励从搜索智能体轨迹中学习长上下文推理

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

AI总结 提出LongTraceRL框架,通过知识图谱随机游走生成多跳问题并利用搜索智能体轨迹构建分层干扰物,结合基于实体链的评分奖励进行过程监督,提升大语言模型在长上下文推理中的表现。

详情
AI中文摘要

长上下文推理仍然是大型语言模型的核心挑战,模型往往难以在大量干扰内容中定位和整合关键信息。基于可验证奖励的强化学习(RLVR)在此任务上展现出潜力,但现有方法受限于低混淆度的干扰物和稀疏的、仅基于结果的奖励信号,无法监督中间推理步骤。为解决这些问题,我们引入了 extsc{LongTraceRL}。在数据构建方面,我们通过知识图谱随机游走生成多跳问题,并利用搜索智能体轨迹构建\emph{分层干扰物}:智能体读取但未引用的文档(高混淆度)和搜索结果中出现但从未打开的文档(低混淆度),从而生成比随机采样或单次搜索构建的训练上下文更具挑战性的内容。在奖励设计方面,我们提出了一种\emph{评分奖励},利用每条推理链上的黄金实体作为细粒度的实体级过程监督。该评分奖励仅应用于最终答案正确的响应(正向策略),以区分正确响应之间的推理质量,并防止奖励作弊。在五个长上下文基准上对三种推理LLM(4B-30B)进行的实验表明, extsc{LongTraceRL} 始终优于强基线,并鼓励全面、基于证据的推理。代码、数据集和模型可在 \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL} 获取。

英文摘要

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.

2605.31581 2026-06-01 cs.AI

Choosing the Lens: Strategic Perspective Activation in Context-Dependent Argumentation

选择视角:上下文相关论证中的策略性视角激活

Albert Sadowski, Jarosław A. Chudziak

AI总结 本文提出上下文相关论证框架(CDAF),通过击败函数和视角标记特化,研究代理如何通过选择相关性集和优先级来策略性地影响论证结果。

详情
Comments
Accepted to LAMAS&SR workshop at FLoC 2026
AI中文摘要

相同的论证通常需要在不同的外部体制下进行评估。对体制有影响力的代理拥有标准形式主义无法直接捕捉的策略杠杆。我们引入了上下文相关论证框架(CDAF),这是对Dung理论的扩展,其中击败函数根据上下文决定哪些攻击成功。视角标记特化从相关性集$ρ$和优先级$π$推导出击败函数。相关性集是代理的行动空间。在一个小型工作示例中,代理的目标论证在完全相关单射优先级下被拒绝,但在部分激活下被接受,而VAF受众无法镜像其中一种激活。我们定义了相应的决策问题ACTIVATION-MANIPULATION,并记录了基线复杂度界限。紧界限和多代理变体留待未来研究。

英文摘要

The same arguments often need to be evaluated under different external regimes. An agent with influence over the regime has a strategic lever that standard formalisms do not directly capture. We introduce context-dependent argumentation frameworks (CDAFs), an extension of Dung's theory in which a defeat function determines, per context, which attacks succeed. A perspective-labeled specialisation derives the defeat function from a relevance set $ρ$ and a priority $π$. The relevance set is the agent's action space. In a small worked example, the agent's target argument is rejected under every full-relevance injective priority, yet accepted under partial activations, one of which no VAF audience can mirror. We define the corresponding decision problem, ACTIVATION-MANIPULATION, and record baseline complexity bounds. Tight bounds and multi-agent variants are left open.

2605.31580 2026-06-01 cs.LG

Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings

赋予传感器声音:用于语义时间序列嵌入的多模态JEPA

Utsav Dutta, Gerardo Pastrana, Sina Khoshfetrat Pakazad, Henrik Ohlsson

AI总结 提出CHARM模型,通过通道级文本描述与Transformer编码器结合,利用联合嵌入预测架构(JEPA)学习语义时间序列嵌入,在异常检测、分类和预测任务中仅用线性探针即取得强性能。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML), PMLR 306, 2026
Comments
9 pages, 5 figures, accepted at ICML 2026. arXiv admin note: substantial text overlap with arXiv:2505.14543
AI中文摘要

基于Transformer的架构在语言和视觉领域的序列建模中取得了进展,但针对异构多变量时间序列的通用表示学习仍未被充分探索。我们提出了CHARM(通道感知表示模型),该模型将通道级文本描述整合到对通道顺序等变的Transformer编码器中。CHARM采用联合嵌入预测架构(JEPA)和一种新颖的损失函数进行训练,该损失函数促进信息丰富且时间稳定的嵌入;潜在空间预测增强了对传感器噪声的鲁棒性,而描述感知门控通过学习到的通道间关系提供了可解释性。在异常检测、分类以及短期和长期预测任务中,学习到的嵌入仅使用线性探针就取得了强性能。性能主要由JEPA目标和条件架构驱动,文本描述作为跨数据集泛化的通道标识符。

英文摘要

Transformer-based architectures have advanced sequence modeling in language and vision, yet general-purpose representation learning for heterogeneous multivariate time series remains underexplored. We introduce CHARM (Channel-Aware Representation Model), which incorporates channel-level textual descriptions into a Transformer encoder equivariant to channel order. CHARM is trained with a Joint Embedding Predictive Architecture (JEPA) and a novel loss promoting informative, temporally stable embeddings; latent-space prediction encourages robustness to sensor noise while description-aware gating provides interpretability through learned inter-channel relationships. Across anomaly detection, classification, and short- and long-term forecasting, the learned embeddings achieve strong performance using only a linear probe. Performance is driven primarily by the JEPA objective and conditioning architecture, with text descriptions serving as channel identifiers for cross-dataset generalization.

2605.31577 2026-06-01 cs.CV

SurGe: Improved Surface Geometry in Point Maps

SurGe: 改进点图中的表面几何

Karim Knaebel, Gonzalo Martin Garcia, Christian Schmidt, Ilya Fradlin, Lucas Nunes, Daan de Geus, Bastian Leibe

AI总结 针对前馈3D重建方法中局部表面几何不准确的问题,提出点图法线度量、点梯度匹配损失和邻域注意力解码器(NAD)来改善局部表面方向预测,在多个零样本单目几何基准上取得最优平均排名。

详情
Comments
Project page at https://vision.rwth-aachen.de/surge
AI中文摘要

最近的前馈3D重建方法能够很好地预测点图并估计全局3D几何。然而,它们的预测仍然显示出不准确的局部表面几何,这在定性上明显可见,但在常见指标中仅被微弱反映。为了使这些错误在评估中更明确,我们引入了一个点图法线度量,用于评估由相邻3D预测引起的局部表面方向。为了减少这些错误,我们提出了两个互补组件:一个点梯度匹配损失,用于监督深度归一化的3D有限差分;以及一个邻域注意力解码器(NAD),它逐步上采样特征并使用邻域注意力进行局部特征混合。在八个零样本单目几何基准上,我们的模型SurGe在全局点图AbsRel上取得了最佳平均排名,并一致地改进了局部点图和点图法线评估。

英文摘要

Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.

2605.31576 2026-06-01 cs.CV

Joint Multi-Camera LiDAR Extrinsic Calibration via Learned Pairwise Initialization and Geometric Refinement

联合多相机激光雷达外参标定:基于学习的成对初始化与几何优化

Aziz Al-Najjar, Marzieh Amini, James R. Green, Felix Kwamena

AI总结 提出两阶段框架,先通过CMRNext独立估计每个相机的外参和2D-3D对应,再通过联合光束法平差优化实现全局一致的多相机标定,显著提升精度和一致性。

详情
Comments
Paper is accepted in CVPR 2026 Workshop URVI: Unified Robotic Vision with Cross-Modal Sensing and Alignment
AI中文摘要

大多数基于学习的相机-激光雷达标定方法独立处理每个相机-激光雷达对,忽略了多相机平台中的刚性几何耦合。因此,每个相机的估计可能单独准确,但在系统层面不一致。我们提出一个两阶段框架,用于联合多相机激光雷达外参标定,结合了学习的成对匹配与几何优化。首先,CMRNext独立应用于每个相机,产生初始外参估计和密集的2D-3D对应。然后,这些预测通过多帧光束法平差联合优化,包含重投影项、每相机先验项和相对位姿先验项。该方法将成对预测转化为全局一致的多相机标定。在KITTI(CMRNext的域内)和Walkley(域外)数据集上的实验表明,该方法提高了每相机的精度和相机间的一致性。在KITTI上,该方法实现了0.89厘米的平移误差和0.038度的旋转误差。在Walkley上,它将平移误差从108.6厘米降低到3.1厘米,突显了当单相机预测不可靠时显式多相机耦合的优势。

英文摘要

Most learning-based camera-LiDAR calibration methods treat each camera-LiDAR pair independently, ignoring the rigid geometric coupling in multi-camera platforms. As a result, per-camera estimates may be individually accurate yet inconsistent at the system level. We present a two-stage framework for joint multi-camera LiDAR extrinsic calibration that combines learned pairwise matching with geometric refinement. First, CMRNext is applied independently to each camera to produce initial extrinsic estimates and dense 2D-3D correspondences. These predictions are then jointly refined through a multi-frame bundle adjustment with reprojection, per-camera prior, and relative-pose prior terms. This approach converts pairwise predictions into a globally consistent multi-camera calibration. Experiments on KITTI (in-domain for CMRNext) and Walkley (out-of-domain) datasets show improved per-camera accuracy and inter-camera consistency. On KITTI, the method achieves 0.89 cm translation error and 0.038 rotation error. On Walkley, it reduces translation error from 108.6 cm to 3.1 cm, highlighting the benefit of explicit multi-camera coupling when single-camera predictions are less reliable.

2605.31575 2026-06-01 cs.IR cs.AI

SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics

SPECTRA: 具有相关性真值表和受控干扰物诊断的合成信息检索测试集

Eric Liang

AI总结 提出SPECTRA框架,通过分离潜在主题结构、文本实现、元数据控制、查询意图生成和确定性相关性真值表,生成合成文本语料库和检索测试集,以诊断检索系统的扩展性和故障模式。

详情
AI中文摘要

可扩展的信息检索测试需要足够大的语料库来测试索引构建、排序延迟、查询路由和评估工具,但人工判断的测试集仍然昂贵,并且在文档私有或仍在设计时可能不可用。本文介绍了SPECTRA,一个可复现的框架,通过分离潜在主题结构、表面文本实现、元数据控制、查询意图生成和确定性相关性真值表,生成合成文本语料库和检索测试集。该框架旨在作为Cranfield风格和TREC风格评估的诊断补充,而非替代人工评估。一个单进程Python原型生成了多达60,000个文档和961万个标记的语料库,同时保持了可控的长尾词汇增长,并为96个查询生成了分级相关性标签。在本地模拟研究中,生成速度接近线性,约为每秒12,000到14,000个文档,估计的Zipf斜率绝对值保持在0.86附近,增加跨主题干扰文本使BM25 nDCG@10从2%干扰物时的1.00下降到36%干扰物时的0.43。这些结果表明,轻量级合成语料库可以在昂贵的集合构建开始之前暴露检索系统的扩展性和故障模式。

英文摘要

Scalable information retrieval testing needs corpora that are large enough to stress index construction, ranking latency, query routing, and evaluation tooling, yet human-judged test collections remain expensive and may be unavailable when documents are private or still under design. This paper introduces SPECTRA, a reproducible framework for generating synthetic text corpora and retrieval test collections through a separation of latent topical structure, surface text realization, metadata controls, query intent generation, and deterministic relevance oracles. The framework is intended as a diagnostic complement to Cranfield-style and TREC-style evaluation, not as a replacement for human assessment. A single-process Python prototype generated corpora up to 60,000 documents and 9.61 million tokens while preserving controllable long-tail vocabulary growth and producing graded relevance labels for 96 queries. In the local simulation study, generation remained close to linear at roughly 12K to 14K documents per second, estimated Zipf slopes stayed near 0.86 in absolute value, and increasing cross-topic distractor text reduced BM25 nDCG@10 from 1.00 at 2% distractors to 0.43 at 36% distractors. These results show that lightweight synthetic corpora can expose retrieval-system scaling and failure modes before costly collection construction begins.

2605.31572 2026-06-01 cs.CV

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

nuReasoning:面向长尾自动驾驶的推理中心数据集与基准

Zhiyu Huang, Johnson Liu, Rui Song, Zewei Zhou, Ruining Yang, Yun Zhang, Tianhui Cai, Hanyin Zhang, Mingxuan Gao, Valeria Xu, Jiali Chen, Yishan Shen, Yiluan Guo, Tony, Qi, Jiaqi Ma

AI总结 提出nuReasoning数据集,包含2万段20秒长尾驾驶场景的推理标注,支持空间、决策和反事实推理评估,并证明推理监督可提升VLM和VLA的驾驶性能。

详情
AI中文摘要

推理对于自动驾驶在长尾场景中至关重要,车辆必须运用常识知识、理解空间关系、推断智能体交互并做出安全决策。然而,现有的自动驾驶数据集和基准主要针对感知、预测或规划,对现实长尾驾驶场景的推理监督有限。我们提出nuReasoning,一个面向推理中心自动驾驶的大规模真实世界数据集和基准。沿袭nuScenes和nuPlan的体系,nuReasoning将真实世界自动驾驶数据集和基准推向长尾驾驶场景中的推理。该数据集包含2万个片段,每个片段长20秒,采集自多个城市,具有同步的多摄像头图像、LiDAR数据、高清地图、物体标注以及人工验证的推理标注,涵盖空间推理、决策推理和反事实推理。与先前主要关注视觉问答的数据集不同,nuReasoning同时支持推理评估和规划评估,能够直接研究推理监督如何影响驾驶性能。实验表明,在nuReasoning上微调VLM可显著提升驾驶特定问答的性能,而将推理监督纳入VLA训练中,即使在推理时禁用文本推理输出,也能改善规划性能。这些结果确立了nuReasoning作为在现实长尾场景中评估和改进鲁棒、可解释、推理驱动的自动驾驶系统的基础。

英文摘要

Reasoning is essential for autonomous driving (AD) in long-tail scenarios, where vehicles must apply commonsense knowledge, understand spatial relations, infer agent interactions, and make safe decisions. However, existing AD datasets and benchmarks mainly target perception, prediction, or planning, and provide limited supervision for reasoning over realistic long-tail driving scenes. We introduce nuReasoning, a large-scale real-world dataset and benchmark for reasoning-centric AD. Following the lineage of nuScenes and nuPlan, nuReasoning advances real-world AD datasets and benchmarks toward reasoning in long-tail driving scenarios. The dataset contains 20,000 clips, each 20 seconds long, collected across multiple cities, with synchronized multi-camera images, LiDAR data, HD maps, object annotations, and human-verified reasoning annotations spanning Spatial Reasoning, Decision Reasoning, and Counterfactual Reasoning. Unlike prior datasets that focus primarily on visual question answering, nuReasoning supports both reasoning evaluation and planning evaluation, enabling a direct study of how reasoning supervision affects driving performance. Experiments show that fine-tuning VLMs on nuReasoning substantially improves driving-specific question answering, while incorporating reasoning supervision into VLA training improves planning performance even when textual reasoning outputs are disabled at inference time. These results establish nuReasoning as a foundation for evaluating and improving robust, interpretable, reasoning-driven AD systems in realistic long-tail settings.

2605.31564 2026-06-01 cs.CL cs.AI

What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

什么先被揭开?面向图到文本生成的扩散模型轨迹分析

Qing Wang, Jacob Devasier, Chengkai Li

AI总结 本文首次系统研究掩码扩散语言模型在图到文本生成中的解码轨迹,发现其优先生成实体,并针对监督微调导致的输出长度固定问题提出无训练推理时修改方法λ缩放结构解码,恢复+9.4 BLEU-4,同时引入Graph-LLaDA模型以显式融入关系图结构。

详情
AI中文摘要

我们首次系统研究了掩码扩散语言模型(MDLM)在图到文本生成中的应用。我们分析了MDLM的生成轨迹——即迭代解码过程中令牌被掩码的顺序——发现与自回归LLM线性生成文本不同,MDLM自然优先处理实体,然后是关系词和功能词,结构令牌最后解决。我们进一步发现了一个先前未记录的监督微调失败模式:SFT通过过早地将结构性的句子结束令牌锚定在解码轨迹早期,破坏了这一策略,从而有效固定了输出长度,这可能导致信息遗漏或幻觉。为了解决这个问题,我们提出了λ缩放结构解码,一种无训练的推理时修改方法,降低结构令牌的置信度,并恢复了+9.4 BLEU-4。最后,我们引入了Graph-LLaDA,它将图Transformer编码器集成到LLaDA的解码过程中,以显式融入关系图结构。在LAGRANGE上的跨数据集评估表明,先前的基线过拟合于特定数据集模式,而基于LLM和MDLM的方法泛化能力显著更好。

英文摘要

We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories -- the order in which tokens are unmasked during iterative decoding -- and find that, unlike autoregressive LLMs which generate text linearly, MDLMs naturally prioritize entities first, followed by relational and function words, with structural tokens resolved last. We further identify a previously undocumented failure mode of supervised fine-tuning: SFT disrupts this strategy by prematurely anchoring structural sentence-ending tokens early in the decoding trajectory, effectively fixing the output length which can lead to omitted or hallucinated information. To address this, we propose lambda-scaled structural decoding, a training-free inference-time modification that downweights structural token confidence and recovers +9.4 BLEU-4. Finally, we introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process to explicitly incorporate relational graph structure. Cross-dataset evaluation on LAGRANGE reveals that previous baselines overfit to dataset-specific patterns, while LLM- and MDLM-based approaches generalize significantly better.

2605.31563 2026-06-01 cs.CL

Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

分歧理由:重新思考仇恨言论检测中的分类与可解释性评估

Benedetta Muscato, Beiduo Chen, Gizem Gezici, Barbara Plank, Fosca Giannotti

AI总结 本研究通过统一框架重新实现多种模型、训练策略和评估指标,在标签和理由表示空间下分析分类与可解释性指标,发现软表示更能捕捉分歧,从而重新思考主观NLP任务的评估方法。

详情
Comments
16 pages
AI中文摘要

人类分歧在标注中普遍存在且众所周知。然而,通过词级人类理由捕捉的解释变异仍远未得到充分探索。同时,鉴于这种变异,如何最好地评估人类标签和理由——甚至如何超越多数投票聚合理由——尚不明确。然而,理由可能提供对人类推理丰富性的额外见解,这些推理在风格、价值观和解释上可能有所不同——尤其是在像仇恨言论检测这样的主观NLP任务中。在本工作中,我们通过在不同标签和理由表示空间下系统地重新实现它们,将多样化的模型、训练策略、损失函数和现有评估指标统一到一个协议下。分类指标围绕两个关键属性组织——预测性和分布性——而可解释性指标则通过三个互补维度:合理性、忠实性和复杂性。在这个统一的监督框架中,我们评估模型在分类和可解释性指标上的行为,以及指标对标签(硬和软)和理由表示空间(硬、中间和软)选择的敏感性。结果表明,硬指标和软指标都更倾向于软表示,突显了它们在捕捉变异方面的有效性,以及重新思考主观NLP中评估的必要性。

英文摘要

Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.

2605.31562 2026-06-01 cs.LG

Effective Biological Representation Learning by Masking Gene Expression

通过掩码基因表达实现有效的生物表示学习

Kian Kenyon-Dean, Alina Selega, Ihab Bendidi, Jordan M. Sorokin, Luca Bertinetto, David Errington, Hayley Donnella, Oren Kraus

AI总结 提出自监督模型TxFM,采用掩码自编码方法处理RNA-seq数据,通过消融研究确定关键架构,并在精心策划的DiverseRNA-1.4M数据集上训练,获得优于大规模基础模型的基因表示。

详情
Comments
31 pages, 11 figures. Preprint; presented at ICLR 2026 2nd Workshop on Foundation Models for Science: Real-World Impact and Science-First Design
AI中文摘要

RNA测序产生丰富多样的基因表达数据集,为细胞状态和功能提供了引人注目的见解,在药物发现中有许多应用。由于固有的技术噪声和实验批次效应,对此类数据进行建模具有挑战性,许多现有的转录组基础模型(FMs)表现不如线性基线。这些结果提出了一个问题:深度表示学习是否比直接使用原始转录计数具有明显优势。我们的工作通过开发一种新的自监督模型TxFM来探索这一点,重点关注归纳表示学习评估。TxFM采用了一种针对多样化RNA-seq计数数据定制的掩码自编码方法,我们的消融研究通过实验确定了强迁移性能所需的关键架构配置。此外,我们策划了一个公共训练语料库DiverseRNA-1.4M,并发现,在此策划数据集上训练的TxFM产生了高保真度的基因表示,其性能优于在规模大100倍以上的图谱级语料库上训练的FMs。总体而言,我们的结果表明,只要精心综合模型架构和训练数据策划,归纳自监督学习是转录组表示的一种可行建模方法。

英文摘要

RNA sequencing produces rich and diverse datasets of gene expression, offering compelling insights into cellular state and function that have many applications in drug discovery. Modeling such data is challenging due to inherent technical noise and experimental batch effects, as evidenced by many existing transcriptomic foundation models (FMs) underperforming relative to linear baselines. Such results raise the question of whether deep representation learning provides a distinct advantage over the direct use of raw transcript counts. Our work explores this by developing a new self-supervised model, TxFM, with a focus on inductive representation learning evaluations. TxFM employs a masked autoencoding approach tailored to diverse RNA-seq count data, and our ablation study empirically identifies crucial architecture configurations required for strong transfer performance. Additionally, we curate a public training corpus, DiverseRNA-1.4M, and find that TxFM trained on this curated dataset yields high-fidelity gene representations that outperform FMs trained on atlas-scale corpora over 100x larger. Overall, our results indicate that inductive self-supervised learning is a viable modeling approach for transcriptomics representation, provided a careful synthesis of model architecture and training data curation.

2605.31561 2026-06-01 cs.CL

What Am I Missing? Question-Answering as Hidden State Probing

我遗漏了什么?将问答作为隐藏状态探测

Chu Fei Luo, Samuel Dahan, Xiaodan Zhu

AI总结 提出将问答作为推理时干预手段,通过学生-教师框架探测隐藏状态,发现提问前的隐藏状态可预测最终正确性,并设计门控策略优化提问时机。

详情
AI中文摘要

自链式推理引入大型语言模型(LLMs)以来,测试时推理已成为一个重要的研究领域。然而,这一推理过程的机制仍未被充分探索——从相同的输入提示,甚至相同的部分解出发,LLMs在多次采样时可能产生不同的答案。我们提出利用提问作为推理时干预手段,以表达模型隐藏状态的信息。为此,我们提出了一个学生-教师设置,其中学生向教师提问。我们在学生提问前后训练一个探测其隐藏状态的探针,发现该探针能预测轨迹的最终正确性,甚至在生成教师答案之前。这表明,在问题生成过程中存在有意义的自我诊断信号,而非来自教师的信息传递。然后,我们将提问建模为序列决策问题,使用该探针作为质量分数,并定义一个门控策略来提问以最大化正确可能性。我们发现,提问作为干预的成功在很大程度上取决于模型的自我一致性。我们的实证结果显示检测与恢复之间存在差距;虽然我们的门控策略捕捉了模型的正确性和不确定性,但干预在恢复错误轨迹的同时,同样可能损害正确轨迹。这种诊断与纠正之间的差距对语言模型在不确定性下进行自我精炼的能力具有更广泛的影响。

英文摘要

Test-time reasoning has become a significant field of study since the introduction of chain-of-thought reasoning in large language models (LLMs). However, the mechanisms of this reasoning process are still under-explored -- from the same input prompt, and even the same partial solution, LLMs can produce varied answers if sampled multiple times. We propose to leverage question-asking as an inference-time intervention that articulates information about the model's hidden state. To achieve that, we present a student-teacher setting where a student asks questions to a teacher. We train a probe on the student's hidden state before and after asking a question and find it is predictive of the trajectory's final correctness, even before generating the teacher's answer. This suggests there is a meaningful signal from the self-diagnosis that occurs during question generation rather than information transfer from the teacher. We then frame question-asking as a sequential decision problem, using this probe as a quality score, and define a gating policy to ask questions that maximize likelihood of correctness. We find that the success of question-asking as an intervention is largely dependent on the model's self-consistency. Our empirical results show a gap between detection and recovery; while our gating policy captures model correctness and uncertainty, interventions are equally likely to harm correct trajectories as they are to recover incorrect ones. This gap between diagnosis and correction has broader implications on language models' capacity for self-refinement under uncertainty.

2605.31559 2026-06-01 cs.LG

Functional Attention: From Pairwise Affinities to Functional Correspondences

函数注意力:从成对亲和性到函数对应

Jiefang Xiao, Maolin Gao, Simon Weber, Guandao Yang, Daniel Cremers

AI总结 提出函数注意力机制,将注意力重新解释为自适应基之间的函数对应,通过结构化线性算子替代softmax亲和性,实现紧凑、可泛化、分辨率不变的全局依赖表示,在PDE求解、3D分割和回归等算子学习任务中达到最先进性能。

详情
Comments
26 pages, 12 figures. Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

学习无限维函数空间之间的映射,即算子学习,对于许多机器学习应用至关重要。尽管基于Transformer的算子很流行,但它们通常依赖于token-wise注意力。这些方法将连续场视为离散token,通常忽略全局函数结构。我们引入了\emph{函数注意力},它将注意力重新解释为自适应基之间的函数对应。受几何函数映射的启发,我们的方法用结构化线性算子替换softmax亲和性。这产生了一个紧凑、可泛化、分辨率不变的表示,显式捕获全局依赖关系。实验表明,\emph{函数注意力}可以在许多算子学习任务中达到最先进的性能,包括求解PDE、3D分割和回归,同时保持对不同离散化的鲁棒性。项目页面可在https://github.com/xjffff/FUNCATTN获取。

英文摘要

Learning mappings between infinite-dimensional function spaces, or operator learning, is essential for many machine learning applications. Although transformer-based operators are popular, they often rely on token-wise attention. These methods treat continuous fields as discrete tokens and usually ignore the global functional structure. We introduce \emph{Functional Attention}, which reinterprets attention as a functional correspondence between adaptive bases. Inspired by geometric functional maps, our method replaces softmax affinities with structured linear operators. This yields a compact, generalizable, resolution-invariant representation that explicitly captures global dependencies. Experiments demonstrate that \emph{Functional Attention} can match state-of-the-art performance in many operator learning tasks, including solving PDEs, 3D segmentation, and regression, while remaining robust to varying discretizations. Project page is available at https://github.com/xjffff/FUNCATTN.

2605.31558 2026-06-01 cs.LG cs.AI

Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

位置注意力头与符号注意力头:学习动态、RoPE几何和长度泛化

Felipe Urrutia, Juan José Alegría, Cinthia Sanchez Macias, Jorge Salas, Cristian B. Calderon, Cristobal Rojas

AI总结 通过控制实验研究Transformer注意力头在位置推理和符号推理任务中的学习动态,发现位置和符号注意力头的不同机制及其对长度泛化的影响。

详情
AI中文摘要

基于Transformer的语言模型在当今社会广泛应用。因此,理解它们解决结构化任务的机制以及预测它们在新型场景中的行为对于安全部署至关重要。我们通过在两个结构等价的多跳推理任务上训练仅解码器Transformer(GPT-J)来研究注意力头的学习动态:一个需要位置推理的数字任务和一个需要符号推理的字母任务。利用最近引入的度量标准,该标准将注意力头的行为分类为给定提示下的位置性或符号性,我们表明成功学习与纯头(即表达为位置性或符号性的头)的出现相关。尽管任务结构等价,但它们施加了不同的机制需求:数字任务需要位置头和符号头,而字母任务仅需要符号头。然后,我们识别这些头的计算角色,描述它们实现的基本功能,并给出理论构造,展示单层基于RoPE的注意力如何通过几何可解释的查询、键和值操作实现这些功能。该分析通过一种新的差异概念形式化,在位置和符号机制对更长序列的鲁棒性上产生了定量分离。我们在受控模型和真实世界模型中经验验证了由此产生的预测,表明符号机制更可靠地外推到更长序列,而位置机制面临更严格的限制。

英文摘要

Transformer-based language models are widespread in today's society. As such, understanding the mechanisms by which they solve structured tasks and predicting how they may behave in novel scenarios is of great importance for safe deployment. We study the learning dynamics of attention heads in a controlled setting by training a decoder-only Transformer (GPT-J) on two structurally equivalent multi-hop reasoning tasks: a number task requiring positional reasoning and a letter task requiring symbolic reasoning. Using a recently introduced metric that classifies attention-head behavior as positional or symbolic for a given prompt, we show that successful learning is associated with the emergence of pure heads, i.e., heads that express themselves as either positional or symbolic. Despite the tasks' structural equivalence, they impose different mechanistic demands: the number task requires both positional and symbolic heads, whereas the letter task requires only symbolic heads. We then identify the computational roles of these heads, characterize the basic functions they implement, and give theoretical constructions showing how single-layer RoPE-based attention can realize these functions through geometrically interpretable query, key, and value operations. This analysis yields a quantitative separation between positional and symbolic mechanisms in their robustness to longer sequences, formalized through a novel notion of discrepancy. We empirically validate the resulting predictions in both controlled and real-world models, showing that symbolic mechanisms extrapolate more reliably to longer sequences while positional mechanisms face sharper limitations.

2605.31556 2026-06-01 cs.CV cs.AI cs.CL cs.CY cs.HC

Vision-Language Models Suppress Female Representations Under Ambiguous Input

视觉-语言模型在模糊输入下抑制女性表征

Arnau Marin-Llobet, Simon Henniger, Mahzarin R. Banaji

AI总结 本研究通过引入零样本度量LALS,发现视觉-语言模型在模糊输入下内部编码与输出存在系统性解耦,女性信号在生成前被抑制,揭示了模型对性别偏见的内部处理机制。

详情
Comments
16 pages, 12 figures, 1 table
AI中文摘要

对齐训练使视觉-语言模型(VLM)避免表达人口统计偏见,当性别清晰可见时,它们基本成功。但对于模糊输入(如全副武装的工人、从背后看到的人物)——实践中常见但很少研究的情况——我们发现,在模糊输入图像时,最小的提示压力就会暴露职业-性别默认值,模型甚至对强烈女性刻板印象的职业也倾向于男性。但这些输出是否反映了模型实际内部编码的内容?我们引入LALS(潜在关联倾向分数),一种零样本度量,将视觉标记激活投影到模型的文本嵌入空间中,以测量每个标记和层的概念关联。在15个职业、超过800张性别模糊图像和四个VLM上,内部表征和输出系统性地解耦:模型通常内部编码女性关联但输出男性。逐层分析揭示了一个不对称滤波器——男性信号端到端放大,而女性信号在中间网络达到峰值并在生成前被抑制——颜色消融实验表明,文化负载的视觉线索(如服装颜色)进一步调节这些内部关联。

英文摘要

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.

2605.31551 2026-06-01 cs.CV

SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation

SMART: SMPLest-X 网格自适应与 RAFT 跟踪用于足球姿态估计

Parthsarthi Rawat

AI总结 提出 SMART 方法,通过微调 SMPLest-X 模型、结合 RAFT 光流相机跟踪和足部平面锚定等策略,在 FIFA 骨骼跟踪挑战中显著降低 3D 姿态估计误差。

详情
Comments
CVPR 2026 SoccerNet FIFA Skeleton Tracking Light Challenge, Rank 6
AI中文摘要

我们介绍了参加 2026 年 FIFA 骨骼跟踪挑战赛的方法,该挑战要求从广播视频中估计足球运动员的 3D 世界空间姿态。我们的方法通过分层片段分割、多任务深度监督和广播增强对 SMPLest-X(ViT-H,687 M 参数)进行微调,并结合 RAFT 密集光流相机跟踪器、足部平面锚定和两遍时间平滑。在验证集上,SMART 相对于 FIFA 基线得分 1.053 取得了 0.647 的成绩,提升了 38.6%;在保留的测试集上,SMART 得分为 0.593(全局 MPJPE:0.324 m,局部 MPJPE:0.054 m)。

英文摘要

We present our approach to the FIFA Skeletal Tracking Challenge 2026, which requires estimating 3D world-space poses of soccer players from broadcast video. Our method finetunes SMPLest-X (ViT-H, 687 M parameters) via a stratified clip split, multi-task depth supervision, and broadcast augmentation, paired with a RAFT dense optical flow camera tracker, foot-plane anchoring, and two-pass temporal smoothing. Against the FIFA baseline score of 1.053 on the validation set, SMART achieves 0.647, a 38.6% improvement; on the held-out test set, SMART scores 0.593 (Global MPJPE: 0.324 m, Local MPJPE: 0.054 m).

2605.31550 2026-06-01 cs.CL

Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

语义三元组恢复:大型语言模型中层次化表格理解的新协议

Yibin Zhao, Fangxin Shang, Dingrui Yang, Yuqi Wang

AI总结 提出语义三元组恢复(STR)协议,将表格重写为<项目路径,特征路径,值>原子事实,并设计TripletQL路由以选择三元组子集,在四个表格问答基准上匹配或超越HTML基线并减少输入令牌。

详情
AI中文摘要

表格问答要求模型恢复由二维布局、合并单元格和层次化标题隐式编码的语义关系。当前流水线通常使用HTML或Markdown作为中间表格表示,但这些面向布局的序列化引入了标记开销,并要求大型语言模型从行和列跨度推断标题-单元格对齐。我们提出语义三元组恢复(STR),一种将每个单元格重写为原子事实<项目路径,特征路径,值>的协议,其中项目路径指定行实体,特征路径指定层次化属性,值包含单元格内容。我们还提出TripletQL,一种轻量级查询感知路由器,使用STR为每个问题选择适当的渲染或过滤的三元组子集。在四个中英文表格问答基准上,STR匹配或超越基于HTML的基线,同时减少输入令牌。对于较小的语言模型和较长的表格上下文,相对收益更大,表明在受限推理预算下显式语义表示特别有用。代码和数据可在https://github.com/Phoenix-ni/STR.git获取。

英文摘要

Table question answering requires models to recover semantic relations encoded implicitly by two-dimensional layout, merged cells, and hierarchical headers. Current pipelines typically use HTML or Markdown as intermediate table representations, but these layout-oriented serializations introduce markup overhead and require large language models to infer header-cell alignments from row and column spans. We propose Semantic Triplet Restoration (STR), a protocol that rewrites each cell as an atomic fact <item path, feature path, value>, where the item path specifies the row-wise entity, the feature path specifies the hierarchical attribute, and the value contains the cell content. We also present TripletQL, a lightweight query-aware router that uses STR to select an appropriate rendering or filtered subset of triplets for each question. Across four Chinese and English table-QA benchmarks, STR matches or improves upon HTML-based baselines while reducing input tokens. The relative benefit grows for smaller language models and longer table contexts, suggesting that explicit semantic representations are especially useful under constrained inference budgets. Code and data are available at https://github.com/Phoenix-ni/STR.git .

2605.31547 2026-06-01 cs.LG math.DS stat.ML

The Dynamic-Probabilistic Consistency Gap in Chaotic Surrogate Modeling

混沌替代建模中的动态-概率一致性差距

Andre Herz, Matthijs Pals, Daniel Durstewitz, Georgia Koppe

AI总结 针对混沌系统替代建模中动态与概率目标不一致的问题,提出基于可微扩展卡尔曼滤波的KAFFEE框架,通过局部预测残差似然和雅可比协方差传播来缩小差距。

详情
AI中文摘要

动力系统重构旨在学习捕捉时间序列数据背后动力学的替代模型。可靠部署这些替代模型需要与所学动力学一致的不确定性估计。我们揭示了一个动态-概率一致性差距:追求有限时域概率目标可能会退化动力学,或使预测不确定性脱离其应反映的局部切向动力学。我们分离出这一差距背后的三种机制:核心坍缩、噪声掩盖和盲不确定性。具体来说,我们表明开环高斯滚动目标会惩罚混沌系统中雅可比生成的协方差增长,鼓励削弱物理扩张或使不确定性与之脱钩的优化捷径。为缓解这一差距,我们提出KAFFEE(用于遍历仿真的卡尔曼感知框架),这是一个基于可微扩展卡尔曼滤波的训练框架,在通过学习的局部雅可比传输协方差的同时,评估局部预测残差(新息)的似然。在随机超混沌Lorenz-96上,KAFFEE减少了已识别的失败模式,改善了相对于开环目标的动力学不变量重建,并保持了有竞争力的预测分数。我们进一步表明,当概率性地将DSR基础模型适应于13个混沌系统时,DPC差距出现,而KAFFEE在基本保留零样本动力学的同时实现了上下文贝叶斯滤波。

英文摘要

Dynamical systems reconstruction (DSR) aims to learn surrogate models that capture the dynamics underlying time-series data. Reliably deploying these surrogates requires uncertainty estimates consistent with the learned dynamics. We expose a dynamic-probabilistic consistency (DPC) gap: the pursuit of finite-horizon probabilistic objectives can degrade dynamics or decouple predictive uncertainty from the local tangent dynamics it ought to reflect. We isolate three mechanisms behind this gap: core collapse, noise masking, and blind uncertainty. Specifically, we show that open-loop Gaussian rollout objectives can penalize Jacobian-generated covariance growth in chaotic systems, encouraging optimization shortcuts that weaken physical expansion or decouple uncertainty from it. To mitigate this gap, we propose KAFFEE (Kalman-Aware Framework For Ergodic Emulation), a differentiable extended Kalman filter-based training framework that evaluates likelihood on local predictive residuals (innovations) while transporting covariance through learned local Jacobians. On stochastic hyperchaotic Lorenz-96, KAFFEE reduces the identified failure modes, improves reconstruction of dynamical invariants relative to open-loop objectives, and maintains competitive predictive scores. We further show that the DPC gap appears when probabilistically adapting a DSR foundation model across 13 chaotic systems, where KAFFEE enables in-context Bayesian filtering while largely preserving zero-shot dynamics.

2605.31545 2026-06-01 cs.CL

Preference-Aware Rubric Learning for Personalized Evaluation

偏好感知的评分标准学习用于个性化评估

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Yoko Yamakata, Tat-Seng Chua

AI总结 提出PARL框架,通过从用户历史中学习偏好感知的评估标准并采用自验证机制,实现个性化评估。

详情
AI中文摘要

随着大型语言模型从通用助手演变为以用户为中心的智能体,个性化已成为使模型行为与个体偏好对齐的核心,而对个性化对齐的评估成为关键瓶颈。现有的评估方法——从自动指标到LLM作为评判者的方法——未能捕捉长期交互历史中嵌入的主观、用户特定偏好。我们确定了可靠且有效的个性化评估的三个基本原则:代表性、用户一致性和区分性。为应对这些原则,我们引入了“个性化评估即学习”范式,将个性化评估形式化为一个学习问题而非静态判断。在此范式下,我们提出了PARL(偏好感知的评分标准学习用于个性化评估),一个从原始用户历史中直接学习诱导偏好感知的评估标准,并执行自验证机制以确保与用户偏好一致的框架。PARL将评分标准诱导与区分性强化学习目标相结合,该目标对比用户撰写的回答与竞争性个性化模型输出,使学到的评分标准能够捕捉精确、用户特定的决策边界。在真实世界的个性化文本生成任务上的实验表明,PARL一致地诱导出高保真度的评分标准,可靠地识别与用户对齐的回答,并在用户和任务间泛化,同时捕捉稳定的风格偏好和细粒度的评估模式。为确保可重复性,我们的代码可在 https://github.com/SnowCharmQ/PARL 获取。

英文摘要

As Large Language Models (LLMs) evolve from general-purpose assistants to user-centric agents, personalization has become central to aligning model behavior with individual preferences, making the evaluation of personalized alignment a critical bottleneck. Existing evaluation methods-ranging from automatic metrics to LLM-as-a-judge approaches-fail to capture subjective, user-specific preferences embedded in long-term interaction histories. We identify three essential principles for reliable and effective personalized evaluation: Representativeness, User-Consistency, and Discriminativeness. To address these principles, we introduce Personalized Evaluation as Learning, a paradigm that formulates personalized evaluation as a learning problem rather than a static judgment. Under this paradigm, we propose PARL (Preference-Aware Rubric Learning for Personalized Evaluation), a framework that learns to induce preference-aware evaluation rubrics directly from raw user histories and performs a self-validation mechanism to ensure consistency with the user's preferences. PARL integrates rubric induction with a discriminative reinforcement learning objective that contrasts user-authored responses against competitive personalized model outputs, enabling the learned rubrics to capture precise, user-specific decision boundaries. Experiments on real-world personalized text generation tasks show that PARL consistently induces high-fidelity rubrics that reliably identify user-aligned responses and generalize across users and tasks, while capturing stable stylistic preferences and fine-grained evaluative patterns. To ensure reproducibility, our code is available at https://github.com/SnowCharmQ/PARL.

2605.31539 2026-06-01 cs.CV cs.LG q-bio.QM

Automated Prediction of Postoperative Pancreatic Fistula Using Preoperative Computed Tomography

利用术前计算机断层扫描自动预测术后胰瘘

Ashok Choudhary, Chris Varghese, Leo Y. Li-Han, Frank G. Lee, Ellen L. Larson, Elizabeth B. Habermann, Cornelius A. Thiels, Hojjat Salehinejad

AI总结 提出一种从胰腺分割到分类的端到端深度学习流程,利用术前CT扫描自动预测术后胰瘘风险,为临床决策提供工具和方法基准。

详情
AI中文摘要

术后胰瘘(POPF)是胰腺切除术后的一种严重并发症,会增加发病率、住院时间和医疗费用。我们提出了一种自动化的端到端深度学习流程——从胰腺分割到分类——用于利用术前CT扫描进行术前POPF风险估计和分层。使用包含自动分割的胰腺体积和手术结果的数据集评估了多种架构,包括自定义轻量级3D CNN基线(CNN3D)、R(2+1)D ResNet-18和ResNet-MC3-18模型。在多个3D架构上的评估显示了有前景的预测性能。该方法为胰腺特异性CT分类提供了临床有价值的工具和方法基准,支持胰腺手术中改进的术前决策。

英文摘要

Postoperative pancreatic fistula (POPF) is a serious complication after pancreatic resection, increasing morbidity, hospital stay, and healthcare costs. We present an automatic, end-to-end deep learning pipeline-from pancreatic segmentation to classification-for preoperative POPF risk estimation and stratification using preoperative CT scans. A data set with auto-segmented pancreas volumes and surgical outcomes was used to evaluate multiple architectures, including a custom lightweight 3D CNN baseline (CNN3D), R(2+1)D ResNet-18, and ResNet-MC3-18 models. Evaluation across multiple 3D architectures demonstrated promising predictive performance. This approach offers a clinically valuable tool and a methodological benchmark for pancreas-specific CT classification, supporting improved preoperative decision-making in pancreatic surgery.

2605.31535 2026-06-01 cs.CV cs.AI cs.LG

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

RayDer: 从真实世界视频中可扩展的自监督新视角合成

Ulrich Prestel, Stefan Andreas Baumann, Nick Stracke, Björn Ommer

AI总结 提出统一前馈变压器RayDer,将相机估计、场景重建和渲染整合为单一骨干,实现自监督新视角合成的可扩展幂律缩放,在零样本开放集性能上媲美有监督方法。

详情
Comments
Project Page: https://compvis.github.io/rayder
AI中文摘要

自监督新视角合成(NVS)在扩展方面仍然具有挑战性,尽管视频数据丰富,这主要是由于在真实视频上训练的脆弱性以及多网络系统设计的难以预测的缩放行为。我们引入了RayDer,一个统一的前馈变压器,将相机估计、场景重建和渲染整合到一个单一骨干中,将自监督NVS转化为一个适定的单模型缩放问题。一个最小的动态状态,被视为干扰因素,吸收时变内容,使得在无约束的真实世界视频上稳定训练成为可能。重要的是,RayDer将静态场景NVS作为其目标任务:动态内容仅作为可扩展的监督被利用,而不是像动态场景(4D)NVS那样重建。在多个模型大小和数量级的数据上,RayDer展示了与数据和计算量相关的清晰幂律缩放,并优于静态场景数据混合。在大量基准测试中,RayDer实现了与最先进的有监督方法相竞争的强大零样本开放集性能。项目页面:https://compvis.github.io/rayder

英文摘要

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder