arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2602.16872 2026-05-28 cs.CV

DODO: Discrete OCR Diffusion Models

DODO: 离散OCR扩散模型

Sean Man, Gilad Deutch, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

发表机构 * Technion - Israel Institute of Technology, Haifa, Israel.(特拉维夫大学-以色列理工学院,海法,以色列。) Amazon Web Services(亚马逊网络服务)

AI总结 针对OCR任务中自回归解码速度慢的问题,提出首个利用块离散扩散的VLM模型DODO,在保持高精度的同时实现高达5倍的推理加速。

详情
AI中文摘要

光学字符识别(OCR)是数字化信息的基础任务,是视觉数据与文本理解之间的关键桥梁。虽然现代视觉语言模型(VLM)在该领域取得了高精度,但它们主要依赖自回归解码,这需要为每个生成的token进行顺序前向传播,因此在处理长文档时计算成本高且速度慢。我们发现了一个克服这一瓶颈的关键机会:与开放式生成不同,OCR是一个高度确定性的任务,视觉输入严格决定了唯一的输出序列,理论上可以通过扩散模型实现高效的并行解码。然而,我们表明现有的掩码扩散模型未能利用这一潜力;它们引入了结构不稳定性,这在灵活任务(如字幕生成)中无害,但对于OCR的刚性精确匹配要求则是灾难性的。为了弥合这一差距,我们引入了DODO,这是首个利用块离散扩散并释放其OCR加速潜力的VLM。通过将生成分解为块,DODO减轻了全局扩散的同步误差。实验上,我们的方法在实现接近最先进精度的同时,与自回归基线相比,推理速度提高了5倍。

英文摘要

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 5x faster inference compared to autoregressive baselines.

2602.16837 2026-05-28 cs.LG

A Structural Theory of Position Bias in Transformers

Transformer中位置偏差的结构理论

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, Sören Laue

发表机构 * University of Hamburg(汉堡大学)

AI总结 本文通过残差感知累积注意力展开,提出一种结构理论解释因果Transformer中位置偏差的起源,并揭示残差连接如何改变无限深度下的注意力动力学,从而解释Lost-in-the-Middle现象。

Comments Revised version with improved presentation

详情
AI中文摘要

Transformer模型系统性地偏好某些token位置,但这一位置偏差的架构起源仍知之甚少。这种偏差与“中间丢失”现象密切相关,即模型未能充分利用上下文中间位置的信息。我们证明,中间丢失类型的行为可能源于因果Transformer本身的架构。为此,我们基于残差感知累积注意力展开,发展了一种位置偏差的结构理论。在有限深度下,因果掩码和残差连接导致广泛的、通常是U形的影响分布。在无限深度下,我们的框架解决了先前仅注意力的坍缩理论与实际Transformer行为之间的差异:残差连接从根本上改变了累积注意力动力学。实验上,预测的影响分布与预训练语言模型中测量的输入token影响高度吻合。

英文摘要

Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. This bias is closely connected to the Lost-in-the-Middle phenomenon, where models underutilize information placed in the middle of the context. We show that Lost-in-the-Middle-type behavior can arise from the architecture of causal Transformers itself. To do so, we develop a structural theory of position bias based on residual-aware cumulative attention rollout. At finite depth, causal masking and residual connections induce broad, often U-shaped, influence profiles. At infinite depth, our framework resolves a discrepancy between prior attention-only collapse theory and practical Transformer behavior: residual connections fundamentally change cumulative attention dynamics. Empirically, the predicted profiles closely match measured input-token influence in pretrained language models.

2602.16284 2026-05-28 cs.LG

Fast KV Compaction via Attention Matching

通过注意力匹配实现快速KV压缩

Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出通过注意力匹配在潜在空间快速压缩键值缓存的方法,以保持注意力输出并实现高达50倍压缩且质量损失小。

详情
AI中文摘要

将语言模型扩展到长上下文通常受到键值(KV)缓存大小的瓶颈限制。在实际部署中,长上下文通常通过摘要化在令牌空间中进行压缩管理。然而,摘要化可能高度有损,严重损害下游性能。最近关于Cartridges的工作表明,可以在潜在空间中训练高度紧凑的KV缓存,以紧密匹配全上下文性能,但代价是缓慢且昂贵的端到端优化。本文描述了一种通过注意力匹配在潜在空间中快速上下文压缩的方法,该方法构建紧凑的键和值以再现注意力输出并在每个KV头级别保持注意力质量。我们表明,该公式自然分解为简单的子问题,其中一些子问题允许高效的闭式解。在此框架内,我们开发了一系列方法,显著推动了压缩时间与质量的帕累托前沿,在某些数据集上实现了高达50倍的压缩,且几乎没有质量损失。

英文摘要

Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

2602.15894 2026-05-28 cs.CL cs.LG

Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

质量约束的熵最大化策略优化用于LLM多样性

Haihui Pan, Yuzhong Hong, Kaichen Zhang, Shaoke Lv, Junwei Bao, Hongfei Jiang, Yang Song

发表机构 * Zuoyebang Education Technology(左叶bang教育科技)

AI总结 提出QEMPO框架,通过理论推导的闭式解在保证输出质量的同时最大化熵以提升LLM多样性,实验证明其在不牺牲质量的情况下提升多样性。

详情
AI中文摘要

在许多大语言模型(LLM)对齐应用中,用户不仅期望高质量输出,还希望有显著的多样性。然而,现有方法通常面临这些目标之间的根本权衡:提高输出质量的方法往往会降低多样性,而增加多样性的方法往往以牺牲质量为代价。在这项工作中,我们提出了质量约束的熵最大化策略优化(QEMPO),这是一个新颖的框架,在明确保持输出质量的同时增强LLM输出的多样性。QEMPO建立在坚实的理论基础之上:我们推导出一个闭式解析解,该解在质量约束下可证明地最大化熵(多样性的原则性度量),并在定义的目标下保证最优性。利用这一解,QEMPO自然支持在线和离线训练设置。实验结果表明,QEMPO在不牺牲质量的情况下持续提高输出多样性,并且在许多情况下,与现有基线相比,在质量和多样性两个维度上都取得了提升,与我们的理论保证一致。

英文摘要

In many large language model (LLM) alignment applications, users expect not only high-quality outputs but also substantial diversity. However, existing methods often face a fundamental trade-off between these objectives: approaches that improve output quality tend to reduce diversity, while methods that increase diversity often do so at the expense of quality. In this work, we propose Quality-constrained Entropy Maximization Policy Optimization (QEMPO), a novel framework that enhances the diversity of LLM outputs while explicitly preserving output quality. QEMPO is grounded in a strong theoretical foundation: we derive a closed-form analytical solution that provably maximizes entropy-a principled measure of diversity-subject to a quality constraint, with guarantees on optimality under the defined objective. Leveraging this solution, QEMPO naturally supports both online and offline training settings. Empirical results demonstrate that QEMPO consistently improves output diversity without sacrificing quality, and in many cases yields gains in both dimensions compared to existing baselines, aligning with our theoretical guarantees.

2601.16800 2026-05-28 cs.CL

Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

大语言模型作为细粒度意见分析的自动标注者和标注裁决者

Gaurav Negi, MA Waskow, John McCrae, Omnia Zayed, Paul Buitelaar

发表机构 * Data Science Institute(数据科学研究所) University of Galway(Galway大学)

AI总结 本文探索使用大语言模型作为自动标注者进行细粒度意见分析,提出声明式标注流水线和LLM裁决方法,实验表明LLM在跨度级别可靠但难以再现关系结构,更适合作为标注助手而非完全替代人类。

详情
AI中文摘要

文本的细粒度意见分析提供了对表达情感的详细理解,包括所涉及的实体。尽管这种详细程度很有价值,但在数据集中标注意见以训练模型需要大量人力投入和成本,尤其是在不同领域和实际应用中。为了解决领域特定标注数据集的短缺,我们探索了LLM作为自动标注者进行细粒度意见分析的可行性。我们使用声明式标注流水线,这种方法减少了在使用LLM识别文本中细粒度意见跨度时手动提示工程的可变性。我们还提出了一种专门的方法,让LLM裁决多个标签并产生最终标注。我们使用不同大小的模型在方面情感三元组提取(ASTE)和方面-类别-意见-情感(ACOS)分析任务上试用了该流水线。在这项工作中,我们试图开发完全自主的基于LLM的标注者,但我们的结果揭示了一个不均衡的画面,其特点是关键的性能分叉:LLM在跨度级别可靠,但难以忠实地再现连接这些跨度的关系结构。这表明LLM更适合作为高保真标注助手和数据增强工具,以扩展细粒度意见标注数据集,而不是完全取代人类标注者。

英文摘要

Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is valuable, annotating opinions in datasets for model training requires considerable human effort and substantial cost, especially across diverse domains and real-world applications. To address this shortage of domain-specific labelled datasets, we explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis. We use a declarative annotation pipeline, an approach that reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a dedicated methodology for an LLM to adjudicate multiple labels and produce final annotations. We trial the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks. In this work, we attempt to develop fully autonomous LLM-based annotators, but our results reveal an uneven picture characterised by a critical performance bifurcation: LLMs are reliable at the span level yet struggle to faithfully reproduce the relational structures that connect those spans. This suggests that LLMs are better positioned as high-fidelity annotation assistants and data augmentation tools to expand fine-grained opinion-annotated datasets, rather than replacing human annotators entirely.

2602.13748 2026-05-28 cs.CL cs.CV

RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

RMPL:基于关系感知的多任务渐进学习与分阶段训练的多媒体事件抽取

Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出RMPL框架,通过分阶段训练结合单模态事件抽取和多模态关系抽取的异构监督,在低资源条件下实现多媒体事件抽取,并在M2E2基准上取得一致改进。

Comments Accepted by ACM ICMR 2026

详情
AI中文摘要

多媒体事件抽取(MEE)旨在从包含文本和图像的文档中识别事件及其论元。它需要跨不同模态对事件语义进行 grounding。MEE 的进展受到缺乏标注训练数据的限制。M2E2 是唯一已建立的基准,但它仅提供评估用的标注。这使得直接监督训练不切实际。现有方法主要依赖于跨模态对齐或使用视觉-语言模型(VLM)进行推理时提示。这些方法没有显式学习结构化的事件表示,并且通常在多模态设置中产生较弱的论元 grounding。为解决这些限制,我们提出了 RMPL,一种用于低资源条件下 MEE 的基于关系感知的多任务渐进学习框架。RMPL 通过分阶段训练整合了来自单模态事件抽取和多模态关系抽取的异构监督。模型首先使用统一模式进行训练,以学习跨模态的共享事件中心表示。然后,使用混合文本和视觉数据对模型进行微调,以进行事件提及识别和论元角色抽取。在 M2E2 基准上使用多个 VLM 进行的实验表明,在不同模态设置下均取得了一致的改进。

英文摘要

Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.

2507.22554 2026-05-28 cs.LG

DeepC4: Deep Conditional Census-Constrained Clustering for Large-scale Multitask Spatial Disaggregation of Urban Morphology

DeepC4: 用于城市形态大规模多任务空间分解的深度条件普查约束聚类

Joshua Dimasaka, Christian Geiß, Emily So

发表机构 * Department of Architecture, University of Cambridge(剑桥大学建筑系) Cambridge University Centre for Risk in the Built Environment(剑桥大学建筑环境风险研究中心) Earth Observation Center, German Aerospace Center(德国航天中心地球观测中心) Institute of Geography, University of Bonn(波恩大学地理研究所)

AI总结 提出DeepC4,一种结合局部普查统计作为聚类约束并联合学习卫星图像模式的多任务深度学习方法,用于城市形态的粗到细空间分解,在卢旺达数据上优于现有方法。

Comments Major Revised Preprint Submitted to ISPRS Journal of Photogrammetry and Remote Sensing (in review) | Keywords: urban morphology, building exposure, physical vulnerability, spatial disaggregation, deep clustering | Data: https://doi.org/10.5281/zenodo.13119552 | Code: https://github.com/riskaudit/DeepC4

详情
AI中文摘要

为了理解许多发展中经济体在可持续发展和减灾方面的全球进展,最近的两项重大举措——全球地震模型(GEM)基金会的统一非洲暴露数据集和通过地球观测常规(METEOR)项目进行暴露建模——利用来自各种卫星图像及其衍生物、建筑环境地理空间数据集以及地方级普查统计的信息,实施了经典的空间分解技术来生成城市形态的大规模映射。然而,与经过良好验证的普查统计数据的局部差异以及传播的模型不确定性仍然是这种粗到细粒度映射问题中的挑战,特别是受到弱和条件标签监督的约束。因此,我们提出了深度条件普查约束聚类(DeepC4),这是一种新颖的基于深度学习的空间分解方法,它将局部普查统计作为聚类级别的约束,同时在卫星图像模式的联合多任务学习中考虑多个条件标签关系。作为使用卢旺达城市形态的演示,DeepC4在屋顶、墙壁和高度预测上分别实现了0.63、0.78和0.45的宏F1分数,以及0.57、0.71和0.42的宏mIoU,估计的国家住宅和居住者数量与普查记录相比误差在1.13%和1.11%以内,优于GEM(2.03%和3.29%),并且在各省份中占据了比METEOR多32%-49%的500米网格像素。随着世界在2030年接近许多全球框架的结束,我们的工作提供了一种新的基于深度学习的映射技术,该技术明确编码了经过良好验证的普查和专家信念系统,以实现对现有大规模粗粒度派生信息的可解释和可解释审计。

英文摘要

To understand our global progress for sustainable development and disaster risk reduction in many developing economies, two recent major initiatives - the Uniform African Exposure Dataset of the Global Earthquake Model (GEM) Foundation and the Modelling Exposure through Earth Observation Routines (METEOR) Project - implemented classical spatial disaggregation techniques to generate large-scale mapping of urban morphology using the information from various satellite imagery and its derivatives, geospatial datasets of the built environment, and subnational census statistics. However, the local discrepancy with well-validated census statistics and the propagated model uncertainties remain a challenge in such coarse-to-fine-grained mapping problems, specifically constrained by weak and conditional label supervision. Therefore, we present Deep Conditional Census-Constrained Clustering (DeepC4), a novel deep learning-based spatial disaggregation approach that incorporates local census statistics as cluster-level constraints while considering multiple conditional label relationships in a joint multitask learning of the patterns of satellite imagery. As a demonstration using Rwandan urban morphology, DeepC4 achieves macro-F1 scores of 0.63, 0.78, and 0.45 and macro-mIoU of 0.57, 0.71, and 0.42 for roof, wall, and height prediction respectively, estimates national dwelling and occupant counts within 1.13% and 1.11% error compared to census records, outperforming GEM (2.03% and 3.29%), and occupies 32%-49% more 500-meter grid pixels than METEOR across provinces. As the world approaches the conclusion of many global frameworks in 2030, our work offers a new deep learning-based mapping technique that explicitly encodes well-validated census and experts' belief systems to achieve an explainable and interpretable auditing of existing coarse-grained derived information at large scales.

2602.13524 2026-05-28 cs.LG cs.AI

Singular Vectors of Attention Heads Align with Features

注意力头的奇异向量与特征对齐

Gabriel Franco, Carson Loughridge, Mark Crovella

发表机构 * Department of Computer Science, Boston University, Boston, USA Faculty of Computing \& Data Sciences, Boston University, Boston, USA

AI总结 本文通过理论分析和实验验证,解释了注意力头奇异向量与特征表示对齐的原因和条件,并提出了稀疏注意力分解作为对齐的可检验预测。

Comments To be published in ICML 2026

详情
AI中文摘要

识别语言模型中的特征表示是机械可解释性的核心任务。最近的一些研究观察到,在某些情况下,可以从注意力矩阵的奇异向量中推断出特征表示。然而,这一现象缺乏合理的解释。本文探讨了这个问题:为什么以及何时奇异向量与特征对齐?首先,我们证明在可以直接观察特征的模型中,奇异向量与特征稳健地对齐。然后,我们从理论上表明,这种对齐在多种条件下是预期的。最后,我们提出如何在特征表示不可直接观察的真实模型中操作性地识别对齐。我们将稀疏注意力分解确定为对齐的一个可检验预测,并展示证据表明它在真实模型中以与预测一致的方式出现。这些结果共同表明,奇异向量与特征的对齐可以作为语言模型中特征识别的合理且有理论依据的基础。

英文摘要

Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made the observation that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this phenomenon is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges in real models in a manner consistent with predictions. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models.

2602.13075 2026-05-28 cs.LG

Unified Multi-Domain Graph Pre-training for Homogeneous and Heterogeneous Graphs via Domain-Specific Expert Encoding

统一多域图预训练:通过域特定专家编码实现同质和异质图

Chundong Liang, Yongqi Huang, Dongxiao He, Peiyuan Li, Yawen Li, Di Jin, Weixiong Zhang

发表机构 * School of Computer Science Technology Tianjin University Tianjin China North China University of Science School of Economics Management Beijing University of Posts Departments of Health Technology \& Informatics Computing The Hong Kong Polytechnic University Kowloon Hong Kong Tianjin University Beijing University of Posts The Hong Kong Polytechnic University

AI总结 提出统一多域图预训练方法GPH²,通过域特定专家编码和任务导向专家融合策略,解决同质与异质图混合场景下的跨域分布偏移问题。

Comments 12 pages, 7 figures

详情
AI中文摘要

近年来,图预训练取得了显著成功,为下游任务提供了可迁移的表示。然而,大多数现有方法仅针对同质图或异质图设计,阻碍了跨不同类型图的统一建模。这种分离与现实应用相矛盾,因为混合的同质和异质图普遍存在,且上游预训练与下游部署之间的分布偏移很常见。在本文中,我们通过实验证明,同质和异质图预训练的平衡混合有利于下游任务,并提出了一种统一的跨同质和异质图的多域图预训练方法(GPH²)。为了解决缺乏同质和异质图统一编码器的问题,我们提出了一种统一的多视图图构建方法,无需显式的图类型特定设计即可同时编码两者。为了应对混合图带来的跨域分布差异增加,我们引入了域特定专家编码。每个专家在单个图上独立预训练以捕获域特定知识,从而保护预训练编码器免受跨域差异的不利影响。对于下游任务,我们进一步设计了一种任务导向的专家融合策略,根据专家的判别优势自适应地整合多个专家。在混合图上的大量实验表明,GPH²能够实现跨图类型和域的稳定迁移,显著优于现有的图预训练方法。

英文摘要

Graph pre-training has achieved remarkable success in recent years, delivering transferable representations for downstream adaptation. However, most existing methods are designed for either homogeneous or heterogeneous graphs, thereby hindering unified graph modeling across diverse graph types. This separation contradicts real-world applications, where mixed homogeneous and heterogeneous graphs are ubiquitous, and distribution shifts between upstream pre-training and downstream deployment are common. In this paper, we empirically demonstrate that a balanced mixture of homogeneous and heterogeneous graph pre-training benefits downstream tasks and propose a unified multi-domain \textbf{G}raph \textbf{P}re-training method across \textbf{H}omogeneous and \textbf{H}eterogeneous graphs ($\mathbf{GPH^{2}}$). To address the lack of a unified encoder for homogeneous and heterogeneous graphs, we propose a Unified Multi-View Graph Construction that simultaneously encodes both without explicit graph-type-specific designs. To cope with the increased cross-domain distribution discrepancies arising from mixed graphs, we introduce domain-specific expert encoding. Each expert is independently pre-trained on a single graph to capture domain-specific knowledge, thereby shielding the pre-training encoder from the adverse effects of cross-domain discrepancies. For downstream tasks, we further design a Task-oriented Expert Fusion Strategy that adaptively integrates multiple experts based on their discriminative strengths. Extensive experiments on mixed graphs demonstrate that $\text{GPH}^{2}$ enables stable transfer across graph types and domains, significantly outperforming existing graph pre-training methods.

2602.12843 2026-05-28 cs.CV

MMRad-22K: A Structured Multimodal Evidence Dataset for Chest X-ray Report Generation

MMRad-22K:用于胸部X光报告生成的结构化多模态证据数据集

Yichen Zhao, Zelin Peng, Fenghe Tang, Piao Yang, Yu Huang, Wei Shen

发表机构 * MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University(人工智能MOE实验室、人工智能研究院、计算机科学学院、上海交通大学) School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC)(生物医学工程学院、生命科学与医学系、中国科学技术大学) Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advanced Research, USTC(医学影像、机器人、分析计算与学习中心(MIRACLE)、苏州市先进研究院、中国科学技术大学) Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine(放射科、浙江大学医学院第一附属医院)

AI总结 针对胸部X光报告生成中现有资源监督信号碎片化的问题,提出结构化多模态证据数据集MMRad-22K,并基于统一LVLM骨干进行适配,证明结构化多模态证据优于纯文本或边界框证据,在语言和临床指标上表现更优。

详情
AI中文摘要

胸部X光(CXR)报告遵循基于区域的临床工作流程,放射科医生检查解剖区域并将局部发现整合到最终报告中。然而,现有的CXR报告生成资源以碎片化形式提供这些监督信号。我们引入MMRad-22K,一个将区域文本观察、解剖定位坐标、局部图像证据和报告目标组织成结构化多模态证据单元的数据集,用于CXR报告生成。为了推动这一构想,我们首先比较了不同证据格式对报告生成的影响,发现结构化多模态证据通常比纯文本或基于边界框的证据更有用。然后,我们使用MMRad-22K适配统一的LVLM骨干,并证明多模态证据适配在语言和临床导向指标上均优于文本证据适配和端到端适配。在相同的评估协议下,适配模型也达到了与几个开源LVLM参考相当的性能水平。这些结果共同支持MMRad-22K作为实用的结构化多模态资源,用于训练和评估与临床阅读工作流程一致的CXR报告生成。

英文摘要

Chest X-ray (CXR) reporting follows a region-based clinical workflow in which radiologists inspect anatomical regions and integrate localized findings into a final report. However, existing resources for CXR report generation provide these supervision signals in fragmented forms. We introduce MMRad-22K, a dataset that organizes regional textual observations, anatomical grounding coordinates, localized image evidence, and report targets into structured multimodal evidence units for CXR report generation. To motivate this formulation, we first compare different evidence formats for report generation and find that structured multimodal evidence is generally more useful than text-only or bounding box-based evidence. We then adapt a unified LVLM backbone using MMRad-22K and show that adaptation with multimodal evidence outperforms both textual-evidence adaptation and end-to-end adaptation on language and clinically oriented metrics. Under the same evaluation protocol, the adapted model also reaches a performance level comparable to several open-source LVLM references. Together, these results support MMRad-22K as a practical structured multimodal resource for training and evaluating CXR report generation aligned with clinical reading workflows.

2602.12586 2026-05-28 cs.AI

Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models

能给我你的订单吗?扩散语言模型中插槽填充顺序的蒙特卡洛树搜索

Joshua Ong Jun Leang, Yu Zhao, Mihaela Cătălina Stoian, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia

发表机构 * Imperial College London(帝国理工学院伦敦分校) University of Edinburgh(爱丁堡大学)

AI总结 针对掩码扩散模型(MDM)中计划-填充解码对插槽填充顺序敏感的问题,提出McDiffuSE框架,利用蒙特卡洛树搜索(MCTS)优化生成顺序,平均性能提升3.2%,在MBPP和MATH500上分别提升19.5%和4.9%。

Comments 8 pages, ICML2026

详情
AI中文摘要

虽然掩码扩散模型(MDM)中的计划-填充解码在数学和代码推理方面显示出潜力,但其性能对插槽填充顺序高度敏感,常常导致输出方差较大。我们引入了McDiffuSE框架,该框架将插槽选择形式化为决策问题,并通过蒙特卡洛树搜索(MCTS)优化填充顺序。McDiffuSE在提交前使用前瞻模拟评估部分完成情况,系统地探索生成顺序的组合空间。实验表明,与自回归基线相比平均提升3.2%,与基线计划-填充相比提升8.0%,在MBPP和MATH500上分别显著提升19.5%和4.9%。我们的分析揭示,虽然McDiffuSE主要遵循顺序生成,但引入非顺序生成对于最大化性能至关重要。我们观察到,需要更大的探索常数而非增加模拟次数,以克服模型置信度偏差并发现有效的顺序。这些发现确立了基于MCTS的规划作为提升MDM生成质量的有效方法。

英文摘要

While plan-and-infill decoding in Masked Diffusion Models (MDMs) shows promise for mathematical and code reasoning, performance remains highly sensitive to slot infilling order, often yielding substantial output variance. We introduce McDiffuSE, a framework that formulates slot selection as decision making and optimises infilling orders through Monte Carlo Tree Search (MCTS). McDiffuSE uses look-ahead simulations to evaluate partial completions before commitment, systematically exploring the combinatorial space of generation orders. Experiments show an average improvement of 3.2% over autoregressive baselines and 8.0% over baseline plan-and-infill, with notable gains of 19.5% on MBPP and 4.9% on MATH500. Our analysis reveals that while McDiffuSE predominantly follows sequential ordering, incorporating non-sequential generation is essential for maximising performance. We observe that larger exploration constants, rather than increased simulations, are necessary to overcome model confidence biases and discover effective orderings. These findings establish MCTS-based planning as an effective approach for enhancing generation quality in MDMs.

2602.12468 2026-05-28 cs.LG cs.FL

Continuous Diffusion Models Can Obey Formal Syntax

连续扩散模型可以遵守形式语法

Jinwoo Kim, Taylor Berg-Kirkpatrick, Loris D'Antoni

发表机构 * Department of Computer Science and Engineering, University of California-San Diego, San Diego, USA(计算机科学与工程系,加州大学圣地亚哥分校,圣地亚哥,美国)

AI总结 提出一种无需训练的引导方法,利用正则表达式约束连续扩散语言模型的生成过程,使其满足形式语法,并在JSON和自然语言基准上实现68-96%的约束满足率。

详情
AI中文摘要

扩散语言模型因其全局、非因果的生成过程而成为自回归模型的有前途的替代方案,但其连续潜在动态使得离散约束(例如,输出应为匹配给定模式的JSON文件)难以施加。我们提出了一种无需训练的引导方法,用于引导连续扩散语言模型满足用正则表达式表达的形式语法约束。我们的方法构建了一个解析分数,估计潜在状态解码为给定正则表达式接受的合法字符串的概率,并利用其梯度引导采样,无需训练辅助分类器。去噪过程以句法有效性为条件,针对基础模型进行优化。我们在PLAID扩散模型之上将我们的方法实现为Diffinity,并在180个正则表达式约束下对JSON和自然语言基准进行了评估。Diffinity实现了68-96%的约束满足率,同时相对于无约束采样仅产生很小的困惑度代价,在约束满足和输出质量方面均优于自回归约束解码。Diffinity已在github.com/large-loris-models/Diffinity开源。

英文摘要

Diffusion language models offer a promising alternative to autoregressive models due to their global, non-causal generation process, but their continuous latent dynamics make discrete constraints -- e.g., the output should be a JSON file that matches a given schema -- difficult to impose. We introduce a training-free guidance method for steering continuous diffusion language models to satisfy formal syntactic constraints expressed using regular expressions. Our approach constructs an analytic score estimating the probability that a latent state decodes to a valid string accepted by a given regular expression, and uses its gradient to guide sampling, without training auxiliary classifiers. The denoising process targets the base model conditioned on syntactic validity. We implement our method in Diffinity on top of the PLAID diffusion model and evaluate it on 180 regular-expression constraints over JSON and natural-language benchmarks. Diffinity achieves 68-96\% constraint satisfaction while incurring only a small perplexity cost relative to unconstrained sampling, outperforming autoregressive constrained decoding in both constraint satisfaction and output quality. Diffinity is open-sourced at github.com/large-loris-models/Diffinity.

2602.11564 2026-05-28 cs.CV

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

LUVE:基于双频专家的潜在级联超高分辨率视频生成

Chen Zhao, Jiawei Chen, Hongyu Li, Zhuoliang Kang, Shilin Lu, Xiaoming Wei, Kai Zhang, Jian Yang, Ying Tai

发表机构 * Nanjing University(南京大学) Nanyang Technological University(南洋理工大学)

AI总结 提出LUVE框架,通过三阶段潜在级联架构(低分辨率运动生成、潜在空间上采样、高分辨率内容精炼)结合双频专家,解决超高分辨率视频生成中的运动建模、语义规划和细节合成难题。

Comments ICML 2026

详情
AI中文摘要

近期视频扩散模型在视觉质量上取得了显著进步,但超高分辨率(UHR)视频生成由于运动建模、语义规划和细节合成的复合困难,仍然是一个严峻挑战。为解决这些限制,我们提出了 extbf{LUVE},一个基于双频 extbf{专}家的 extbf{潜}在级联 extbf{UHR} extbf{V}ideo生成框架。LUVE采用三阶段架构,包括用于运动一致潜在合成的低分辨率运动生成、直接在潜在空间进行分辨率上采样以减少内存和计算开销的视频潜在上采样,以及集成低频和高频专家以共同增强语义连贯性和细粒度细节生成的高分辨率内容精炼。大量实验表明,我们的LUVE在UHR视频生成中实现了卓越的照片真实感和内容保真度,全面的消融研究进一步验证了每个组件的有效性。项目可在\href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}获取。

英文摘要

Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.

2602.10820 2026-05-28 cs.LG

Adaptive Sampling and Clipping for Private Worst-Case Group Optimization

自适应采样与裁剪的私有最坏情况组优化

Max Cairney-Leeming, Amartya Sanyal, Christoph H. Lampert

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学与技术研究所) University of Copenhagen(哥本哈根大学)

AI总结 提出ASC算法,通过自适应控制每组梯度贡献的采样率和裁剪阈值,在差分隐私下优化最坏情况组准确率,同时保持模型效用。

Comments 10 pages, 3 figures

详情
AI中文摘要

以人为中心的机器学习任务被接受的一个核心要求是它们应该是公平的,即对于来自不同社会群体的个体,它们应该表现得同样好。第二个同样重要的要求是它们应该尊重用户数据的隐私。虽然存在分别解决每个方面的技术,例如前者采用最坏情况组优化,后者采用差分隐私SGD,但这些技术往往相互冲突,目前没有实用的方法可以同时强制执行这两个要求。在这项工作中,我们克服了这个问题,并提出了一种以差分隐私方式优化最坏情况组准确率的算法。我们的主要贡献是ASC(自适应采样和裁剪的最坏情况组优化),它自适应地控制每组梯度贡献的采样率和裁剪阈值。因此,它能够重新加权训练目标以偏向更难学习的组,同时将执行隐私所需的噪声保持在足够低的水平以保持模型效用。我们的实验表明,ASC在不牺牲整体平均准确率的情况下,实现了比先前工作更高的最坏情况组准确率。

英文摘要

A central requirement for the acceptance of machine learning methods for human-centric tasks is that they should be fair, in the sense that they should work comparably well for individuals from different societal groups. A second, equally important, requirement is that they should respect the privacy of user data. While techniques exist to address each aspect in isolation, such as worst-case group optimization for the former and differentially private SGD for the latter, these are often at odds with with each other, and no practical method currently exists to enforce both requirements simultaneously. In this work, we overcome this problem and propose an algorithm for optimizing the worst-case group accuracy in a differentially private way. Our main contribution is ASC (Adaptively Sampled and Clipped Worst-case Group Optimization), which adaptively controls both the sampling rate and the clipping threshold of each group's gradient contributions. Thereby, it is able to reweight the training objective in favor of harder-to-learn groups, while keeping the noise required to enforce privacy low enough to preserve model utility. Our experiments show that ASC achieves substantially higher worst-case group accuracy than prior work, without sacrificing overall average accuracy.

2602.06054 2026-05-28 cs.CL

Are We Truly Innovating? A Qualitative and Quantitative Study of Originality in AI Research Papers

我们真的在创新吗?AI研究论文原创性的定性与定量研究

Abeer Mostafa, Thi Huyen Nguyen, Zahra Ahmadi

发表机构 * Peter L. Reichertz Institute for Medical Informatics(汉诺威医学院彼得·L·里赫茨医学信息学研究所) L3S Research Center(L3S研究中心) Lower Saxony Center for Artificial Intelligence and Causal Methods in Medicine (CAIMed)(下萨克森人工智能与医学因果方法中心(CAIMed))

AI总结 基于10万+同行评审报告,通过定性与定量方法分析AI研究论文原创性的感知维度,并评估大语言模型在原创性评估中的可靠性。

详情
AI中文摘要

评估AI研究的原创性可以说是同行评审中最重要但最不可靠的步骤。评审者对原创性的判断仍然不透明、不一致,并且依赖于对先前工作的比较,而这些比较往往不完整。在本文中,我们基于来自顶级AI会议的超过10万份同行评审报告,对研究原创性进行了大规模、数据驱动的定性与定量分析,涵盖了该领域快速增长的时期。利用结构化的、语义检索的先前工作以及嵌入在专家评审者评估中的信号,我们系统地描述了原创性在实践中是如何被感知的,并识别出最强烈影响新颖性判断的关键维度。我们的分析产生了一个细粒度、基于证据的框架,为作者和评审者提供了关于原创性如何被评估的可操作见解。此外,我们评估了当前大语言模型(LLM)智能体在评估原创性方面的可靠性。我们发现这些模型倾向于系统性地高估新颖性,并且在检测概念抄袭方面存在困难,尤其是在存在改写的情况下。我们在以下网址发布我们的数据集、训练模型和代码:https://anonymous.4open.science/r/Novelty-Reviewer-365C/。

英文摘要

Assessing originality in AI research is arguably the most consequential yet least reliable step in peer review. Reviewer judgments of originality remain opaque, inconsistent, and dependent on comparisons to prior work that are often incomplete. In this paper, we present a large-scale, data-driven qualitative and quantitative analysis of research originality based on over 100,000 peer-review reports from leading AI venues, spanning a period of rapid growth in the field. Leveraging structured, semantically retrieved prior work and signals embedded in expert reviewer assessments, we systematically characterize how originality is perceived in practice and identify the key dimensions that most strongly influence novelty judgments. Our analysis yields a fine-grained, evidence-based framework that equips both authors and reviewers with actionable insights into how originality is evaluated. In addition, we evaluate the reliability of current large language model (LLM) agents in assessing originality. We find that these models tend to systematically overestimate novelty and struggle to detect conceptual plagiarism, particularly in the presence of paraphrasing. We release our dataset, trained models, and code at: https://anonymous.4open.science/r/Novelty-Reviewer-365C/.

2602.01992 2026-05-28 cs.AI

Emergent Analogical Reasoning in Transformers

Transformer中的涌现类比推理

Gouki Minegishi, Jingyuan Feng, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学) Google Deep Mind(谷歌深Mind)

AI总结 本研究通过范畴论中的函子概念形式化类比推理,设计合成任务探究Transformer中类比推理的涌现机制,发现其依赖于数据特征、优化选择和模型规模,并通过机制分析揭示几何对齐和函子应用两个关键组件。

Comments Accepted to ICML2026 (spotlight)

详情
AI中文摘要

类比是人类智能的核心能力,使得在一个领域中发现的抽象模式能够应用于另一个领域。尽管类比在认知中占据核心地位,但Transformer获取并实现类比推理的机制仍知之甚少。受范畴论中函子概念的启发,我们将类比推理形式化为跨类别实体间对应关系的推断。基于这一表述,我们引入了在受控设置下评估类比推理涌现的合成任务。我们发现,类比推理的涌现对数据特征、优化选择和模型规模高度敏感。通过机制分析,我们展示了Transformer中的类比推理分解为两个关键组件:(1) 嵌入空间中关系结构的几何对齐,以及(2) Transformer内部函子的应用。这些机制使模型能够将关系结构从一个类别转移到另一个类别,从而实现类比。最后,我们量化了这些效应,并在预训练的大型语言模型中观察到了相同的趋势。通过这样做,我们将类比从一个抽象的认知概念转变为现代神经网络中一个具体的、基于机制的現象。

英文摘要

Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. Despite its central role in cognition, the mechanisms by which Transformers acquire and implement analogical reasoning remain poorly understood. In this work, inspired by the notion of functors in category theory, we formalize analogical reasoning as the inference of correspondences between entities across categories. Based on this formulation, we introduce synthetic tasks that evaluate the emergence of analogical reasoning under controlled settings. We find that the emergence of analogical reasoning is highly sensitive to data characteristics, optimization choices, and model scale. Through mechanistic analysis, we show that analogical reasoning in Transformers decomposes into two key components: (1) geometric alignment of relational structure in the embedding space, and (2) the application of a functor within the Transformer. These mechanisms enable models to transfer relational structure from one category to another, realizing analogy. Finally, we quantify these effects and find that the same trends are observed in pretrained LLMs. In doing so, we move analogy from an abstract cognitive notion to a concrete, mechanistically grounded phenomenon in modern neural networks.

2511.18894 2026-05-28 cs.CV cs.AI

Not All Pixels Are Equal: Pixel-wise Meta-Learning for Medical Segmentation with Noisy Labels

并非所有像素都平等:面向含噪标签医学分割的像素级元学习

Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng

发表机构 * Xidian University(西安电子科技大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出MetaDCSeg框架,通过动态学习像素级权重并引入动态中心距离机制建模边界不确定性,抑制噪声标签影响并提升边界分割性能。

详情
AI中文摘要

医学图像分割对于临床应用至关重要,但常常受到噪声标注和模糊解剖边界的干扰,限制了其在现实场景中的应用。现有方法通常直接适应为实例分类设计的噪声标签学习技术,忽视了医学分割中像素级异质性及其空间和解剖上的难度差异。因此,全局假设或简单的置信度指标无法解决这些局部变化,导致边界模糊问题未得到解决。为解决这一问题,我们提出MetaDCSeg,一个鲁棒的框架,动态学习最优像素级权重以抑制噪声标签的影响,同时保留可靠标注。通过动态中心距离(DCD)机制显式建模边界不确定性,我们的方法利用前景、背景和边界中心的加权特征距离,引导模型关注模糊边界附近的难分割像素。该策略能够更精确地处理结构边界(这些边界常被现有方法忽略),并显著提升分割性能。在四个不同噪声水平的基准数据集上的大量实验表明,MetaDCSeg优于现有最先进方法。

英文摘要

Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, limiting its application in real-world scenarios. Existing methods often directly adapt noisy label learning techniques designed for instance classification, overlooking the pixel-wise heterogeneity in medical segmentation with its spatially and anatomically varying difficulties. Consequently, global assumptions or simple confidence metrics fail to address these local variations, leaving boundary ambiguities unresolved. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg outperforms existing state-of-the-art methods.

2403.11852 2026-05-28 cs.RO cs.AI

Delay-Aware Reinforcement Learning for Highway On-Ramp Merging under Stochastic Communication Latency

考虑随机通信延迟的高速公路匝道合流延迟感知强化学习

Amin Tabrizian, Zhitong Huang, Arsyi Aziz, Peng Wei

发表机构 * Department of Computer Science, George Washington University, Washington, D.C.(计算机科学系,乔治华盛顿大学,华盛顿特区) Connected and Automated Vehicle Program Manager, Traffic Operations Division, Virginia Department of Transportation(连接与自动化车辆计划主任,交通运营处,弗吉尼亚州交通部) Department of Mechanical & Aerospace Engineering, George Washington University, Washington, D.C.(机械与航空航天工程系,乔治华盛顿大学,华盛顿特区)

AI总结 针对V2I通信随机延迟导致状态观测延迟的问题,提出DAROM框架,通过随机延迟MDP建模和延迟感知编码器恢复马尔可夫性,结合物理安全控制器实现鲁棒控制。

详情
AI中文摘要

延迟和部分可观测的状态信息给现实自动驾驶中基于强化学习(RL)的控制带来了重大挑战。在高速公路匝道合流中,路侧单元(RSU)可以感知附近交通,进行边缘感知,并通过车到基础设施(V2I)链路将状态估计传输给自车。随着智能交通基础设施和边缘计算的最新进展,这种RSU辅助感知越来越现实,并已部署在现代互联道路系统中。然而,边缘处理时间和无线传输可能引入随机的V2I通信延迟,违反马尔可夫假设并显著降低控制性能。在这项工作中,我们提出了DAROM,一种对随机延迟鲁棒的高速公路匝道合流延迟感知强化学习框架。我们将问题建模为随机延迟马尔可夫决策过程(RDMDP),并开发了一个统一的RL智能体用于联合纵向和横向控制。为了在延迟观测下恢复马尔可夫表示,我们引入了一个延迟感知编码器,该编码器以延迟观测、掩蔽动作历史和观测延迟幅度为条件来推断当前潜在状态。我们进一步集成基于物理的安全控制器以减少合流过程中的碰撞风险。在模拟城市交通(SUMO)模拟器中,使用下一代仿真(NGSIM)数据集的真实交通数据进行的实验表明,DAROM在各种交通密度下始终优于标准RL基线。特别是,基于门控循环单元(GRU)的编码器在高达2.0秒的随机V2I延迟的高密度交通中实现了超过99%的成功率。

英文摘要

Delayed and partially observable state information poses significant challenges for reinforcement learning (RL)-based control in real-world autonomous driving. In highway on-ramp merging, a roadside unit (RSU) can sense nearby traffic, perform edge perception, and transmit state estimates to the ego vehicle over vehicle-to-infrastructure (V2I) links. With recent advancements in intelligent transportation infrastructure and edge computing, such RSU-assisted perception is increasingly realistic and already deployed in modern connected roadway systems. However, edge processing time and wireless transmission can introduce stochastic V2I communication delays, violating the Markov assumption and substantially degrading control performance. In this work, we propose DAROM, a Delay-Aware Reinforcement Learning framework for On-ramp Merging that is robust to stochastic delays. We model the problem as a random delay Markov decision process (RDMDP) and develop a unified RL agent for joint longitudinal and lateral control. To recover a Markovian representation under delayed observations, we introduce a Delay-Aware Encoder that conditions on delayed observations, masked action histories, and observed delay magnitude to infer the current latent state. We further integrate a physics-based safety controller to reduce collision risk during merging. Experiments in the Simulation of Urban MObility (SUMO) simulator using real-world traffic data from the Next Generation Simulation (NGSIM) dataset demonstrate that DAROM consistently outperforms standard RL baselines across traffic densities. In particular, the gated recurrent unit (GRU)-based encoder achieves over 99% success in high-density traffic with random V2I delays of up to 2.0 seconds.

2602.07574 2026-05-28 cs.CV cs.CL

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

ViCA:仅视觉交叉注意力的高效多模态大语言模型

Wenjie Liu, Hao Wu, Xin Qiu, Xudong Wang, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology(宁波数字孪生研究院、东部技术研究院) Munich Center for Machine Learning, LMU Munich(慕尼黑机器学习中心、慕尼黑大学)

AI总结 提出ViCA架构,通过仅视觉交叉注意力减少视觉令牌计算,在保持98%准确率的同时将视觉计算降至4%,实现显著加速。

详情
AI中文摘要

现代多模态大语言模型(MLLMs)采用统一的自我注意设计,在每个Transformer层处理视觉和文本令牌,导致大量计算开销。在这项工作中,我们重新审视了这种密集视觉处理的必要性,并表明投影的视觉嵌入已经与语言空间良好对齐,而有效的视觉-语言交互仅发生在少数层中。基于这些见解,我们提出了ViCA(仅视觉交叉注意力),一种最小的MLLM架构,其中视觉令牌绕过所有自我注意和前馈层,仅通过稀疏的交叉注意力在选定层与文本交互。在三个MLLM骨干、九个多模态基准和26个基于剪枝的基线上的广泛评估表明,ViCA在将视觉侧计算减少到4%的同时保持了98%的基线准确率,始终实现了优越的性能-效率权衡。此外,ViCA提供了一个规则的、硬件友好的推理流水线,在单批推理中实现了超过3.5倍的加速,在多批推理中实现了超过10倍的加速,与仅文本的LLM相比,将视觉定位减少到接近零的开销。它还与令牌剪枝方法正交,可以无缝结合以进一步提高效率。我们的代码可在https://github.com/EIT-NLP/ViCA获取。

英文摘要

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.

2602.06880 2026-05-28 cs.LG

Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

解耦自适应梯度下降中的方差与尺度不变更新以实现统一向量和矩阵优化

Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich

发表机构 * Department of Computer Science, Purdue University, West Lafayette, USA(计算机科学系,普渡大学,西拉法叶,美国) Edwardson School of Industrial Engineering, Purdue University, West Lafayette, USA(工业工程学院,普渡大学,西拉法叶,美国)

AI总结 提出DeVA框架,通过解耦AdaGrad更新中的方差适应项和尺度不变项,统一向量自适应方法与矩阵谱优化,在语言建模和图像分类中优于Muon和SOAP,减少约6.6%的token使用。

详情
AI中文摘要

像Adam这样的自适应方法已成为大规模向量和欧几里得优化的$ extit{事实}$标准,因为它们具有二阶性质的坐标适应。最近,基于矩阵的谱优化器如Muon(Jordan等人,2024b)展示了将权重矩阵视为矩阵而非长向量的威力。将这些方法联系起来是困难的,因为许多自然泛化不可行实现,而且我们也不能简单地将Adam适应移到矩阵谱上。为了解决这个问题,我们重新表述了AdaGrad更新,并将其分解为方差适应项和尺度不变项。这种解耦产生了$ extbf{DeVA}$($ extbf{De}$coupled $ extbf{V}$ariance $ extbf{A}$daptation),一个连接基于向量的方差适应和矩阵谱优化的框架,实现了从Adam到自适应谱下降的无缝过渡。在语言建模和图像分类上的大量实验表明,DeVA持续优于Muon和SOAP(Vyas等人,2024)等最先进方法,减少了约6.6%的token使用。理论上,我们证明方差适应项有效改善了块状平滑性,促进了更快的收敛。我们的实现可在https://github.com/Tsedao/Decoupled-Variance-Adaptation获取。

英文摘要

Adaptive methods like Adam have become the $\textit{de facto}$ standard for large-scale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term. This decoupling produces $\textbf{DeVA}$ ($\textbf{De}$coupled $\textbf{V}$ariance $\textbf{A}$daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state-of-the-art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6\%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence. Our implementation is available at https://github.com/Tsedao/Decoupled-Variance-Adaptation

2412.01004 2026-05-28 cs.CV

Take Only What You Need: Rank Minimization as an Implicit Forgetting Regularizer in Continual Learning

只取所需:秩最小化作为持续学习中的隐式遗忘正则化器

Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong

发表机构 * University of New South Wales(新南威尔士大学) CSIRO(澳大利亚联邦科学工业研究组织)

AI总结 本文提出CoDyRA方法,通过秩最小化作为隐式遗忘正则化器,在持续学习中平衡可塑性与稳定性,在多个基准上优于现有方法。

Comments Preprint

详情
AI中文摘要

持续学习中的核心张力是可塑性(获取新知识)与稳定性(保留先前知识)之间的权衡。我们研究如何通过容量控制(即调节每次参数更新的有效秩,这是LoRA更新中可直接控制的逐步骤量)来持续更新预训练骨干网络,使其吸收新知识的同时保留现有能力。对模块和任务间LoRA秩和放置的受控探测揭示了一致的权衡,存在一个随放置和任务变化的中等秩最佳点,没有普遍最优的固定秩;一个形式化界限表明遗忘随秩增长。基于这些发现,我们提出了持续动态秩选择LoRA(CoDyRA),该方法通过在每个组件重要性权重上施加稀疏性促进正则化,联合训练每个LoRA更新与秩最小化。监督目标驱动可塑性;秩最小化正则化遗忘。我们证明秩最小化在持续学习机制中充当隐式遗忘正则化器,通过控制相对于当前模型状态的遗忘,同时保护通用能力和先前任务知识。在MTIL、X-TAIL和TRACE(CLIP、LLaMA、Gemma)上,CoDyRA在新知识学习和遗忘方面优于先前的持续学习方法,实现了强大的可塑性-稳定性平衡。代码可在https://github.com/jeff024/codyra获取。

英文摘要

The central tension in continual learning (CL) is the trade-off between plasticity (acquiring new knowledge) and stability (retaining prior knowledge). We study how a pre-trained backbone can be continually updated to absorb new knowledge while preserving existing capabilities, via capacity control: regulating the effective rank of each parameter update, a per-step quantity directly controllable inside a LoRA update. A controlled probe of LoRA rank and placement across modules and tasks reveals a consistent trade-off, with a moderate-rank sweet spot that varies by placement and task, leaving no universally optimal fixed rank; a formal bound shows forgetting grows with rank. Building on these findings, we propose Continual Dynamic Rank-Selective LoRA (CoDyRA), which jointly trains each LoRA update with rank minimization via sparsity-promoting regularization on per-component importance weights. The supervised objective drives plasticity; rank minimization regularizes forgetting. We show that rank minimization serves as an implicit forgetting regularizer in the CL regime, protecting general capability and prior-task knowledge simultaneously by controlling forgetting against the current model state. Across MTIL, X-TAIL, and TRACE (CLIP, LLaMA, Gemma), CoDyRA outperforms prior CL methods on new knowledge learning and forgetting, achieving a strong plasticity-stability balance. Code is available at https://github.com/jeff024/codyra.

2505.18647 2026-05-28 cs.LG cs.AI

STFlow: Data-Coupled Flow Matching for Geometric Trajectory Simulation

STFlow: 用于几何轨迹模拟的数据耦合流匹配

Kiet Bennema ten Brinke, Koen Minartz, Vlado Menkovski

发表机构 * Machine Learning for Physical Sciences (ML4Sci/e) Group, Department of Mathematics \& Computer Science, Eindhoven University of Technology, The Netherlands

AI总结 提出STFlow,一种基于图神经网络和层次卷积的生成模型,通过数据依赖耦合的流匹配框架,从条件随机游走而非高斯噪声去噪,降低传输成本,提高训练和推理效率,在N体系统、分子动力学和人类轨迹预测中实现最低预测误差。

Comments Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea. PMLR 306, 2026, 18 pages, 12 figures

详情
AI中文摘要

模拟动力系统的轨迹是分子动力学、生物化学和行人动力学等广泛领域中的基本问题。机器学习已成为扩展基于物理的模拟器和直接从实验数据开发模型的宝贵工具。特别是,深度生成建模和几何深度学习的最新进展通过学习复杂的轨迹分布,同时尊重固有的置换和时间平移对称性,实现了概率模拟。然而,N体系统的轨迹通常具有对导致分岔的扰动的高敏感性,以及多尺度的时间和空间相关性。为了应对这些挑战,我们引入了STFlow(时空流),一种基于图神经网络和层次卷积的生成模型。通过在流匹配框架中引入数据依赖的耦合,STFlow从条件随机游走而非高斯噪声开始去噪。这种新颖的信息先验通过降低传输成本简化了学习任务,提高了训练和推理效率。我们在N体系统、分子动力学和人类轨迹预测上验证了我们的方法。在这些基准测试中,STFlow以更少的模拟步骤实现了最低的预测误差,并提高了可扩展性。

英文摘要

Simulating trajectories of dynamical systems is a fundamental problem in a wide range of fields such as molecular dynamics, biochemistry, and pedestrian dynamics. Machine learning has become an invaluable tool for scaling physics-based simulators and developing models directly from experimental data. In particular, recent advances in deep generative modeling and geometric deep learning enable probabilistic simulation by learning complex trajectory distributions while respecting intrinsic permutation and time-shift symmetries. However, trajectories of N-body systems are commonly characterized by high sensitivity to perturbations leading to bifurcations, as well as multi-scale temporal and spatial correlations. To address these challenges, we introduce STFlow (Spatio-Temporal Flow), a generative model based on graph neural networks and hierarchical convolutions. By incorporating data-dependent couplings within the Flow Matching framework, STFlow denoises starting from conditioned random-walks instead of Gaussian noise. This novel informed prior simplifies the learning task by reducing transport cost, increasing training and inference efficiency. We validate our approach on N-body systems, molecular dynamics, and human trajectory forecasting. Across these benchmarks, STFlow achieves the lowest prediction errors with fewer simulation steps and improved scalability.

2602.05897 2026-05-28 cs.CL

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

停止奖励幻觉步骤:面向小型推理模型的忠实感知步骤级强化学习

Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院) Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd(中国电信人工智能研究院) College of Integrated Circuits, Zhejiang University, Hangzhou, Zhejiang, China(浙江大学集成电路学院) Zhongguancun Academy, Beijing, China(中关村学院)

AI总结 针对小型推理模型在中间推理步骤中容易产生忠实性幻觉的问题,提出忠实感知步骤级强化学习(FaithRL),通过过程奖励模型提供步骤级监督和隐式截断重采样策略,减少幻觉并提高推理可靠性。

详情
AI中文摘要

随着大型语言模型变得更小更高效,小型推理模型(SRM)在资源受限环境中实现思维链(CoT)推理至关重要。然而,它们容易产生忠实性幻觉,尤其是在中间推理步骤中。现有的基于在线强化学习的缓解方法依赖于结果奖励或粗粒度的CoT评估,这可能在最终答案正确时无意中强化不忠实的推理。为了解决这些局限性,我们提出了忠实感知步骤级强化学习(FaithRL),通过来自过程奖励模型的显式忠实奖励引入步骤级监督,以及一种隐式截断重采样策略,该策略从忠实前缀生成对比信号,同时减轻步骤级奖励的奖励黑客攻击。在多个SRM和开放书籍QA基准上的实验表明,FaithRL持续减少CoT和最终答案中的幻觉,从而实现更忠实和可靠的推理。代码可在 https://github.com/Easy195/FaithRL 获取。

英文摘要

As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes, while also mitigating reward hacking from step-level rewards. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.

2509.25582 2026-05-28 cs.LG

Safe In-Context Reinforcement Learning

安全的上下文强化学习

Amir Moeini, Minjae Kwon, Alper Kamil Bozkurt, Yuichi Motai, Rohan Chandra, Lu Feng, Shangtong Zhang

发表机构 * University of Virginia(弗吉尼亚大学) Virginia Commonwealth University(弗吉尼亚 Commonwealth 大学)

AI总结 提出SCARED方法,在无参数更新的上下文强化学习适应过程中,通过精确惩罚对偶法在约束马尔可夫决策过程框架下保证安全,实现奖励最大化与成本约束。

Comments ICML 2026

详情
AI中文摘要

上下文强化学习(ICRL)是一种新兴的强化学习范式,其中智能体在预训练后,无需任何参数更新,仅依赖不断扩展的交互历史上下文即可适应分布外测试任务。尽管ICRL展现出令人印象深刻的泛化能力,但适应过程中的安全性尚未被探索,这限制了其在测试时行为需安全的实际部署中的应用。本文提出SCARED:基于精确惩罚对偶的安全上下文自适应强化,这是首个在约束马尔可夫决策过程框架下促进ICRL安全适应的方法。在无需参数更新的适应过程中,我们的智能体不仅最大化奖励,还将累积成本控制在用户指定的安全预算内。我们还证明智能体对安全预算有主动反应:安全预算越高,智能体行为越激进;安全预算越低,智能体行为越保守。在具有挑战性的基准测试中,SCARED始终实现安全且鲁棒的上下文适应,优于现有的ICRL和安全元强化学习基线。

英文摘要

In-context reinforcement learning (ICRL) is an emerging RL paradigm where an agent, after pretraining, can adapt to out-of-distribution test tasks without any parameter updates, instead relying on an expanding context of interaction history. While ICRL has shown impressive generalization, safety during this adaptation process remains unexplored, limiting its applicability in real-world deployments where test-time behavior is expected to be safe. In this work, we propose SCARED: Safe Contextual Adaptive Reinforcement via Exact-penalty Dual, the first method that promotes safe adaptation of ICRL under the constrained Markov decision process framework. During the parameter-update-free adaptation process, our agent not only maximizes the reward but also keeps the accumulated cost within a user-specified safety budget. We also demonstrate that the agent actively reacts to the safety budget; with a higher safety budget, the agent behaves more aggressively, and with a lower safety budget the agent behaves more conservatively. Across challenging benchmarks, SCARED consistently enables safe and robust in-context adaptation, outperforming existing ICRL and safe meta-RL baselines.

2503.01829 2026-05-28 cs.CL cs.AI cs.LG cs.MA

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

如果你能说服我:评估大型语言模型说服效果与易受影响性的框架

Nimet Beyza Bozdag, Shuhaib Mehri, Gokhan Tur, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出PMIYC框架,通过多智能体对话自动评估LLM的说服效果与易受影响性,发现不同模型在说服力和抗说服性上存在显著差异。

Comments Paper published at the ACM Conference on AI and Agentic Systems 2026

详情
AI中文摘要

大型语言模型(LLM)展现出与人类水平相当的说服能力。虽然这些能力可用于社会公益,但也存在被滥用的风险。除了关注LLM如何说服他人外,它们自身对说服的易受影响性也构成了关键的校准挑战,引发了关于鲁棒性、安全性和伦理原则遵守的问题。为了研究这些动态,我们引入了“如果你能说服我”(PMIYC),一个用于评估多智能体交互中说服力和易受影响性的自动化框架。我们的框架提供了一种可扩展的替代方案,替代了通常用于研究LLM说服的昂贵且耗时的人工标注过程。PMIYC自动进行说服者和被说服者智能体之间的多轮对话,同时衡量说服的有效性和易受影响性。我们的综合评估涵盖了多种LLM和说服场景(例如,主观和错误信息场景)。我们通过人工评估验证了框架的有效性,并展示了与先前研究中人工评估的一致性。通过PMIYC,我们发现Llama-3.3-70B和GPT-4o表现出相似的说服效果,比Claude 3 Haiku高出30%。然而,GPT-4o在对抗错误信息方面的抵抗力比Llama-3.3-70B高出50%以上。值得注意的是,o4-mini既是有效的说服者,也是抵抗的被说服者。这些发现为LLM的说服动态提供了实证见解,并有助于开发更安全的AI系统。

英文摘要

Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used for social good, they also present risks of potential misuse. Beyond the concern of how LLMs persuade others, their own susceptibility to persuasion poses a critical alignment challenge, raising questions about robustness, safety, and adherence to ethical principles. To study these dynamics, we introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasiveness and susceptibility to persuasion in multi-agent interactions. Our framework offers a scalable alternative to the costly and time-intensive human annotation process typically used to study persuasion in LLMs. PMIYC automatically conducts multi-turn conversations between Persuader and Persuadee agents, measuring both the effectiveness of and susceptibility to persuasion. Our comprehensive evaluation spans a diverse set of LLMs and persuasion settings (e.g., subjective and misinformation scenarios). We validate the efficacy of our framework through human evaluations and demonstrate alignment with human assessments from prior studies. Through PMIYC, we find that Llama-3.3-70B and GPT-4o exhibit similar persuasive effectiveness, outperforming Claude 3 Haiku by 30%. However, GPT-4o demonstrates over 50% greater resistance to persuasion for misinformation compared to Llama-3.3-70B. Notably, o4-mini emerges as both an effective persuader, and a resistant persuadee. These findings provide empirical insights into the persuasive dynamics of LLMs and contribute to the development of safer AI systems.

2602.03668 2026-05-28 cs.RO cs.CV

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

MVP-LAM:通过跨视角重建学习以动作为中心的潜在动作

Jung Min Lee, Dohyeok Lee, Seokhun Ju, Taehyun Cho, Jin Woo Koo, Li Zhao, Sangwoo Hong, Jungwoo Lee

发表机构 * Seoul National University, Seoul, South Korea(首尔国立大学,首尔,韩国) Konkuk University, Seoul, South Korea(韩国konkuk大学,首尔,韩国) Microsoft Research Asia, Beijing, China(微软亚洲研究院,北京,中国) HodooAI Labs, Seoul, South Korea(HodooAI实验室,首尔,韩国)

AI总结 提出MVP-LAM模型,利用多视角视频通过跨视角重建目标学习与真实动作高度相关的潜在动作,提升动作预测和下游操作性能。

详情
AI中文摘要

从多样化人类视频中学习的潜在动作作为视觉-语言-动作(VLA)预训练的伪标签,但只有当它们对底层真实动作保持信息量时才能提供有效监督。为了有效监督,潜在动作应包含关于底层动作的信息,尽管这些信息不可直接获取。我们提出多视角潜在动作模型(MVP-LAM),该模型从多视角视频中学习与真实动作高度相关的潜在动作。MVP-LAM通过跨视角重建目标训练潜在动作,使得一个视角的潜在动作必须解释另一个视角的未来,从而减少对视角特定线索的依赖。在Bridge V2上,MVP-LAM生成更以动作为中心的潜在动作,与真实动作的互信息更高,动作预测性能提升,包括在分布外评估下。最后,使用MVP-LAM潜在动作预训练VLA模型提高了各种基准上的下游操作性能。代码和训练好的检查点可在https://jmsnu.github.io获取。

英文摘要

Latent actions learned from diverse human videos serve as pseudo-labels for vision-language-action (VLA) pretraining, but provide effective supervision only if they remain informative about the underlying ground-truth actions. For effective supervision, latent actions should contain information about the underlying actions even though they are inaccessible. We propose Multi-ViewPoint Latent Action Moel (MVP-LAM), which learns latent actions that are highly informative about ground-truth actions from multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective, so that a latent action from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on various benchmarks. The code and trained checkpoints are available at https://jmsnu.github.io.

2602.03515 2026-05-28 cs.LG cs.AI cs.DC

Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

通过基旋转缓解异步流水线并行中的陈旧性问题

Hyunji Jung, Sungbin Shin, Namhoon Lee

发表机构 * POSTECH(POSTECH大学)

AI总结 针对异步流水线并行中梯度陈旧性随流水线深度线性增长的问题,提出基旋转框架,通过将优化器坐标系与Hessian特征基对齐来保持延迟更新的有效性,理论证明最小化基失配并实证在3B参数LLM训练中减少81.7%迭代次数。

Comments ICML 2026

详情
AI中文摘要

异步流水线并行通过消除同步执行中固有的流水线气泡来最大化硬件利用率,为高效大规模分布式训练提供了一条途径。然而,这种效率提升可能会被梯度陈旧性所削弱,其中使用延迟梯度的即时模型更新会在优化过程中引入噪声。关键的是,我们发现了一个常被忽视的严重问题:这种延迟随流水线深度线性增长,从根本上破坏了该方法原本意图提供的可扩展性。我们将此问题归因于优化景观的一个特定性质:Hessian特征基与标准坐标基之间的失配,这触发了坐标自适应优化器更新轨迹中的振荡。我们识别出这些振荡导致延迟更新偏离其真实对应项,使其无法用于当前迭代。这一见解通过理论分析(包括一个表明基失配放大延迟惩罚的收敛界)和实证评估得到证实。为了解决这个问题,我们提出了基旋转,一个将优化器坐标系旋转以与Hessian特征基对齐的框架,使延迟更新保持有用。我们从理论上证明基旋转最小化基失配,从而抵消放大延迟惩罚的条件。在训练高达3B参数的LLM的实证中,与性能最佳的异步基线相比,基旋转减少了81.7%所需的迭代次数。

英文摘要

Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. We trace this pathology to a specific property of the optimization landscape: the misalignment between the Hessian eigenbasis and the standard coordinate basis, which triggers oscillations in the update trajectories of coordinate-wise adaptive optimizers. We identify that these oscillations cause delayed updates to diverge from their true counterparts, invalidating their use for current iterations. This insight is formalized through theoretical analysis, including a convergence bound showing that basis misalignment amplifies the delay penalty, and substantiated with empirical evaluation. To address this, we propose basis rotation, a framework that rotates the optimizer's coordinate system to align with the Hessian eigenbasis, keeping delayed updates useful. We theoretically demonstrate that basis rotation minimizes basis misalignment, thereby counteracting the conditions that amplify delay penalties. Empirically, in training up to a 3B-parameter LLM, basis rotation reduces the required iterations by 81.7\% compared to the best-performing asynchronous baseline.

2602.03491 2026-05-28 cs.CV cs.CL

Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

解耦骨架与血肉:基于解缠对齐和结构感知引导的高效多模态表格推理

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室)

AI总结 提出DiSCo解缠结构-内容对齐框架和Table-GLS全局到局部结构引导推理框架,高效增强LVLM的表格理解与推理能力,无需昂贵监督或外部工具。

Comments Accepted as a Spotlight Paper at ICML 2026

详情
AI中文摘要

由于复杂的布局和紧密耦合的结构-内容信息,对表格图像进行推理对于大型视觉语言模型(LVLM)仍然具有挑战性。现有解决方案通常依赖于昂贵的监督训练、强化学习或外部工具,限制了效率和可扩展性。这项工作解决了一个关键问题:如何以最少的标注且无需外部工具来使LVLM适应表格推理?具体来说,我们首先引入了DiSCo,一种解缠结构-内容对齐框架,在多模态对齐期间明确分离结构抽象和语义基础,高效地将LVLM适应于表格结构。在DiSCo的基础上,我们进一步提出了Table-GLS,一种全局到局部结构引导推理框架,通过结构化探索和基于证据的推理来执行表格推理。跨多个基准的大量实验表明,我们的框架高效地增强了LVLM的表格理解和推理能力,特别是泛化到未见过的表格结构。我们的数据和代码可在https://github.com/AAAndy-Zhu/TableVLM获取。

英文摘要

Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures. Our data and code are available at https://github.com/AAAndy-Zhu/TableVLM.

2602.02898 2026-05-28 cs.AI cs.CL

Aligning Language Model Benchmarks with Pairwise Preferences

将语言模型基准与成对偏好对齐

Marco Gutierrez, Xinyi Leng, Hannah Cyberey, Jonathan Richard Schwarz, Ahmed Alaa, Thomas Hartvigsen

发表机构 * School of Data Science, University of Virginia(弗吉尼亚大学数据科学学院) Imperial College London(伦敦帝国理工学院) Thomson Reuters Foundational Research(汤姆森路透基础研究) Department of Electrical Engineering and Computer Science, UC Berkeley and UCSF(伯克利大学电气工程与计算机科学系及旧金山大学)

AI总结 提出BenchAlign方法,通过利用语言模型在问题级别的性能与模型成对排名,自动调整离线基准权重,使新基准能根据偏好准确排序未见模型。

详情
AI中文摘要

语言模型基准是广泛使用的、计算高效的现实性能代理。然而,许多近期工作发现基准常常无法预测实际效用。为弥合这一差距,我们引入基准对齐,即利用有限的模型性能信息自动更新离线基准,旨在生成新的静态基准,以预测给定测试设置中的模型成对偏好。然后我们提出BenchAlign,这是该问题的首个解决方案,它利用语言模型在问题级别的性能以及可能在部署期间收集的模型成对排名,学习基准问题的偏好对齐权重,生成新的基准,根据这些偏好对先前未见过的模型进行排序。我们的实验表明,我们的对齐基准能够根据人类偏好模型准确地对未见模型进行排序,即使模型大小不同,同时保持可解释性。总体而言,我们的工作为将基准与实际人类偏好对齐的局限性提供了见解,这有助于加速模型开发以追求实际效用。

英文摘要

Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.

2602.02855 2026-05-28 cs.LG cond-mat.dis-nn math.ST stat.TH

When pre-training hurts LoRA fine-tuning: a dynamical analysis via single-index models

当预训练损害LoRA微调:基于单指标模型的动力学分析

Gibbs Nwemadji, Bruno Loureiro, Jean Barbier

发表机构 * International School of Advanced Studies(国际先进研究学校) Département d’Informatique, École Normale Supérieure, PSL & CNRS(信息学院,巴黎高等师范学校,PSL与CNRS) Abdus Salam International Centre for Theoretical Physics(阿布杜斯·萨拉姆国际理论物理中心)

AI总结 本文通过单指标模型下的动力学分析,数学证明了过度预训练会降低LoRA微调的收敛速度,并刻画了收敛率与初始对齐及目标任务非线性的关系。

Comments 38 pages, 14 figures

详情
AI中文摘要

在源任务上的预训练通常被认为有助于类似下游问题的微调。本文从数学上表明,这种朴素直觉并不总是成立:过度预训练会在计算上减慢微调优化。我们研究了在单次SGD训练的单指标模型上进行低秩适应(LoRA)微调的现象。利用微调动力学的汇总统计描述,我们精确刻画了收敛率如何依赖于初始微调对齐和目标任务的非线性程度。关键结论是,即使预训练和下游任务高度对齐,强预训练也会导致搜索阶段延长并阻碍收敛。因此,我们的理论提供了一个统一图景,说明预训练强度与任务难度如何在非平凡的可处理模型中共同塑造LoRA微调的动力学和局限性。在实践方面,我们通过实验表明,我们的理论发现超越了玩具模型,在真实数据上训练的视觉变换器模型中仍然相关。

英文摘要

Pre-training on a source task is usually expected to facilitate fine-tuning on similar downstream problems. In this work, we mathematically show that this naive intuition is not always true: excessive pre-training can computationally slow down fine-tuning optimization. We study this phenomenon for low-rank adaptation (LoRA) fine-tuning on single-index models trained under one-pass SGD. Leveraging a summary statistics description of the fine-tuning dynamics, we precisely characterize how the convergence rate depends on the initial fine-tuning alignment and the degree of non-linearity of the target task. The key take away is that even when the pre-training and downstream tasks are well aligned, strong pre-training can induce a prolonged search phase and hinder convergence. Our theory thus provides a unified picture of how pre-training strength and task difficulty jointly shape the dynamics and limitations of LoRA fine-tuning in a nontrivial tractable model. On the practical side, we empirically show that our theoretical findings extend beyond our toy model and remain relevant in the context of a vision-transformer model trained on real data.