arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2086
专题追踪
2604.15711 2026-05-08 cs.CV cs.AI

SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification

SSMamba: 一种用于病理图像分类的自监督混合状态空间模型

Enhui Chai, Sicheng Chen, Tianyi Zhang, Xingyu Li, Tianxiang Cui

发表机构 * School of Computer Science, Northwest University(西北大学计算机科学学院) PuzzleLogic Pte Ltd, Singapore(新加坡PuzzleLogic公司) Department of Electrical & Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系) School of Computer Science, University of Nottingham Ningbo China(中国诺丁汉大学 Ningbo 分校计算机科学学院)

AI总结 本文提出SSMamba,一种混合自监督框架,通过MAMIM、DMS和LPR模块解决病理图像分析中的领域偏移、局部-全局关系建模和细粒度敏感性问题,优于现有方法。

Journal ref Medical Image Analysis, Volume 111, June 2026, 104080

详情
AI中文摘要

病理诊断高度依赖图像分析,其中感兴趣区域(ROIs)是诊断证据的主要基础,而全滑片图像(WSI)级任务主要捕捉聚合模式。为提取这些关键形态学特征,基于视觉Transformer(ViTs)和大规模自监督学习(SSL)的ROIs级基础模型(FMs)已被广泛应用。然而,在应用于ROI分析时仍存在三个核心限制:(1)跨放大域偏移,固定尺度预训练阻碍适应多样临床环境;(2)局部-全局关系建模不足,ViT骨干在FMs中存在高计算开销和不精确的局部表征;(3)细粒度敏感性不足,传统自注意力机制容易忽略细微诊断线索。为解决这些挑战,我们提出SSMamba,一种混合SSL框架,能够有效进行细粒度特征学习,而无需依赖大规模外部数据集。该框架包含三个领域自适应组件:Mamba Masked Image Modeling(MAMIM)用于缓解域偏移,方向多尺度(DMS)模块用于平衡局部-全局建模,以及局部感知残差(LPR)模块用于增强细粒度敏感性。采用两阶段流程,SSL预训练在目标ROIs数据集上,随后进行监督微调(SFT),SSMamba在10个公共ROIs数据集上优于11种最先进的(SOTA)病理FMs,并在6个公共WSI数据集上超越8种SOTA方法。这些结果验证了任务特定架构设计在病理图像分析中的优越性。

英文摘要

Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Directional Multi-scale (DMS) module for balanced local-global modeling, and a Local Perception Residual (LPR) module for enhanced fine-grained sensitivity. Employing a two-stage pipeline, SSL pretraining on target ROI datasets followed by supervised fine-tuning (SFT), SSMamba outperforms 11 state-of-the-art (SOTA) pathological FMs on 10 public ROI datasets and surpasses 8 SOTA methods on 6 public WSI datasets. These results validate the superiority of task-specific architectural designs for pathological image analysis.

2604.05834 2026-05-08 cs.LG

Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning

乘积交互中的隐藏问题:揭示多模态对比学习中的脆弱性

Tillmann Rheude, Stefan Hegselmann, Roland Eils, Benjamin Wild

发表机构 * Berlin Institute of Health, Charité - Universitätsmedizin Berlin(柏林健康研究所,柏林查理医院) Intelligent Medicine Institute, Fudan University(智能医学研究院,复旦大学) Department of Mathematics and Computer Science, Freie Universität Berlin(数学与计算机科学系,柏林自由大学)

AI总结 本文提出Gated Symile,通过引入对比门机制,解决多模态对比学习中因单个模态信息不足、错位或缺失导致的脆弱性问题,提升检索准确率。

详情
AI中文摘要

对比学习已成为从配对数据中进行无监督学习的标准方法,如CLIP用于图像-文本匹配。然而,许多领域涉及超过两种模态,需要捕捉更高阶依赖关系的目标。Symile通过将点积替换为模态嵌入上的多线性内积(MIP)扩展CLIP。本文发现乘积交互中存在隐藏的脆弱性:一个弱信息、错位或缺失的模态会通过目标传播并扭曲跨模态检索分数。我们提出Gated Symile,一种对比门机制,通过基于注意力的、按候选基点的方式适应模态贡献。门通过将嵌入插值到可学习的中性方向,并在可靠跨模态对齐不当时提供显式NULL选项来抑制不可靠输入。在受控的合成基准和三个现实世界的三模态数据集上,Gated Symile在经过良好调优的最先进(sota)基线之上实现了更高的top-1检索准确率。更广泛地说,我们的结果强调门机制作为在存在噪声、错位或缺失输入时实现鲁棒多模态对比学习的一步。

英文摘要

Contrastive learning has become a standard approach for unsupervised learning from paired data, as demonstrated by CLIP for image-text matching. However, many domains involve more than two modalities and require objectives that capture higher-order dependencies beyond pairwise alignment. Symile extends CLIP to this setting by replacing the dot product with the multilinear inner product (MIP) over modality embeddings. In this work, we show that there is a fragility which ishidden in the multiplicative interaction: a single weakly informative, misaligned, or missing modality can propagate through the objective and distort cross-modal retrieval scores. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions with an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned state-of-the-art (sota) baselines. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning beyond two modalities in the presence of noise, misalignment, or missing inputs.

2604.05438 2026-05-08 cs.LG cs.CL

Residual-Mass Accounting for Partial-KV Decoding

残差质量会计用于部分KV解码

Yasuto Hoshi, Daisuke Miyashita, Jun Deguchi

发表机构 * Kioxia Corporation(铠侠公司)

AI总结 本文提出残差质量会计方法,通过精确计算检索token贡献并减去以保持非重叠,从而在部分KV解码中提升性能,尤其在1%支持预算下优于Top-K基线。

详情
AI中文摘要

我们研究了一种受控的部分KV解码设置,在其中精确计算sink/tail锚点和检索token集的未归一化softmax贡献,而剩余的prefill token则由残差估计表示。我们关注在查询依赖的精确支持已选定后的会计规则,并仅将穷举Top-K作为oracle选择器,而非可部署的检索系统。所提出规则保持基础语言模型和精确分支KV张量不变。它从学习的正特征图ϕ中构建固定大小的总结状态(S,u),减去检索token的特征贡献以保持精确和残差集不重叠,并将估计的残差分子和分母与精确分支下合并归一化。在1%的精确支持预算下,我们的残差完成方法在RULER和BABILong上优于仅选择Top-K基线,在冻结的1B和3B Llama-3.2-Instruct基础架构上所有报告的上下文长度下均如此。在0.5-4%的精确支持预算扫描中,这一趋势大致持续。在LongBench上,总结结果大多有利,而多文档QA则混合。注意力输出诊断支持检索token减法作为分区一致的会计规则,同时表明主要剩余误差是不完美的学习-ϕ对未检索残差质量的近似。

英文摘要

We study a controlled partial-KV decoding setting in which exact unnormalized softmax contributions are computed for sink/tail anchors and a retrieved token set, while the remaining prefill tokens are represented by a residual estimate. We focus on the accounting rule after the query-dependent exact support has been selected, and use exhaustive Top-K only as an oracle selector, not as a deployable retrieval system. The proposed rule leaves the backbone language model and the exact-branch KV tensors unchanged. It builds fixed-size summary states $(S,u)$ from learned positive feature maps $ϕ$, subtracts retrieved-token feature contributions to keep the exact and residual sets non-overlapping, and merges the estimated residual numerator and denominator with the exact branch under one normalization. At a 1% exact-support budget, our residual-completion method improves over the selection-only Top-K baseline on RULER and BABILong across frozen 1B and 3B Llama-3.2-Instruct backbones at all reported context lengths. In the 0.5-4% exact-support budget sweeps, this trend largely persists. On LongBench, summarization results are mostly favorable, while multi-document QA is mixed. Attention-output diagnostics support retrieved-token subtraction as the partition-consistent accounting rule, while indicating that the main remaining error is imperfect learned-$ϕ$ approximation of the unretrieved residual mass.

2604.04552 2026-05-08 cs.CV cs.AI

StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods

StableTTA:通过无训练测试时间适应方法提升视觉模型性能

Zheng Li, Jerry Cheng, Huanying Helen Gu

发表机构 * Department of Computer Science(计算机科学系)

AI总结 StableTTA通过无训练测试时间适应方法,在相干批次推理中提升预测一致性与准确性,同时通过特征级裁剪实现高效logit聚合。

Comments 27 pages, 10 figures, 9 tables

详情
AI中文摘要

集成方法虽然能提升预测性能,但往往导致高内存和计算成本。我们发现非线性投影和投票操作引起的聚合不稳定性。为解决效率挑战和不一致性,我们提出StableTTA,一种无训练测试时间适应方法,包含两个变种。StableTTA-I针对相干批次推理设置,通过方差感知logit聚合显著提升预测一致性与准确性。StableTTA-II建立特征级裁剪,通过单次模型主干前向传递实现高效logit聚合。在ImageNet-1K上71个模型的实验表明,StableTTA-I在相干批次推理中持续提升预测准确性,而StableTTA-II提供轻量且架构无关的准确性提升,计算开销极低。这些结果表明,推理时的语义相干性和聚合稳定性为改进实际测试时间适应系统提供了有价值的视角。

英文摘要

Ensemble methods improve predictive performance but often incur high memory and computational costs. We identify an aggregation instability induced by nonlinear projection and voting operations. To address both efficiency challenges and this inconsistency, we propose StableTTA, a training-free test-time adaptation method with two variants. StableTTA-I targets coherent-batch inference settings, where temporally or semantically adjacent observations are likely to belong to the same class. Examples include burst photography, video streams, robotics perception, and industrial inspection. Under coherent-batch inference, StableTTA-I substantially improves prediction consistency and accuracy through variance-aware logit aggregation. StableTTA-II establishes feature-level cropping, enabling efficient logit aggregation with a single forward pass on a single model backbone. Experiments on ImageNet-1K across 71 models demonstrate that StableTTA-I consistently improves prediction accuracy under coherent-batch inference, while StableTTA-II provides lightweight and architecture-agnostic accuracy improvements with minimal computational overhead. These results suggest that inference-time semantic coherence and aggregation stability provide useful perspectives for improving practical test-time adaptation systems.

2603.29552 2026-05-08 cs.CL cs.AI cs.LG

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

双语婴儿语言模型的培养:利用小型模型研究多语言语言习得

Linda Zeng, Steven Y. Feng, Michael C. Frank

发表机构 * Stanford University(斯坦福大学)

AI总结 本文通过训练小型语言模型,研究多语言习得机制,发现双语模型在两种语言上均表现良好,表明多语言输入对统计学习者无实质挑战。

Comments Code and data at https://github.com/lindazeng979/bilingual-babyLM

详情
AI中文摘要

多语言现象在全球范围内非常普遍,引发了关于儿童如何同时学习多种语言的重要理论和实践问题。例如,多语言习得是否会导致学习延迟?是否存在更好的或更差的多语言输入结构?许多相关研究探讨这些问题,但很难得出确切结论,因为儿童无法被随机分配为多语言者,且不同语言间数据通常不匹配。我们使用语言模型训练作为模拟多种高度受控暴露条件的方法,并利用合成数据和机器翻译创建匹配的1000万词单语和双语数据集。我们训练GPT-2模型在单语和双语数据上进行训练,以反映不同暴露模式的范围,并评估其在困惑度、语法正确性和语义知识上的表现。在模型规模和测量指标上,双语模型在一种语言上与单语模型表现相似,但在另一种语言上也表现出色。这些结果表明,不同双语暴露模式之间没有显著差异,且双语输入对无偏统计学习者不存在根本性挑战。

英文摘要

Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.

2603.28129 2026-05-08 cs.RO

A Position Statement on Endovascular Models and Effectiveness Metrics for Mechanical Thrombectomy Navigation, on behalf of the Stakeholder Taskforce for AI-assisted Robotic Thrombectomy (START)

关于内血管模型和机械取栓导航的有效性指标的立场声明,代表人工智能辅助机器人取栓利益相关者工作组(START)

Harry Robertshaw, Anna Barnes, Phil Blakelock, Raphael Blanc, Robert Crossley, Rebecca Fahrig, Ameer E. Hassan, Benjamin Jackson, Lennart Karstensen, Neelam Kaur, Markus Kowarschik, Jeremy Lynch, Franziska Mathis-Ullrich, Dwight Meglan, Vitor Mendes Pereira, Mouloud Ourak, Matteo Pantano, S. M. Hadi Sadati, Alice Taylor-Gee, Tom Vercauteren, Phil White, Alejandro Granados, Thomas C. Booth

发表机构 * on behalf of the Stakeholder Taskforce for AI-assisted Robotic Thrombectomy (START)(人工智能辅助机器人取栓任务组成员)

AI总结 本文旨在建立共识框架,开发和验证人工智能辅助机器人用于取栓,标准化有效性指标并定义参考测试环境,以提升地理多样性人群的及时取栓访问。

Comments Published in Journal of the American Heart Association

Journal ref J Am Heart Assoc. 2026;15:e044931

详情
AI中文摘要

尽管我们在对抗传染病和癌症方面取得进展,21世纪中叶的主要医疗挑战将是中风发病率的上升。大血管闭塞尤其具有破坏性,但有效的治疗(需在几小时内实施以获得最佳结果)受限于地理因素。为改善地理多样性人群的及时取栓访问,部署机器人手术系统是一个解决方案。人工智能(AI)辅助可能有助于提升操作员在这一新兴治疗输送方法中的技能。我们的目标是建立共识框架,以开发和验证AI辅助机器人用于取栓。目标包括标准化有效性指标,并在计算机模拟、体外、体外和体内环境中定义参考测试环境。为此,我们召集了神经介入、机器人、数据科学、健康经济学、政策、统计学和患者倡导领域的专家。通过孵化器日、德尔菲过程和最终立场声明建立共识。我们发现,四个基本测试环境各有不同的验证作用。现实要求各异:更简单的测试环境应包括与导丝和导管兼容的现实血管解剖结构,而标准测试环境应包含可变形血管。更高级的测试环境应包括血流、脉动性和疾病特征。有效性指标分为两类:一类用于计算机模拟、体外和体外阶段,聚焦于技术导航;另一类用于体内阶段,聚焦于临床结果。患者安全是该技术开发的核心。目前需要的一项患者安全任务是将体外测量与体内并发症相关联。

英文摘要

While we are making progress in overcoming infectious diseases and cancer; one of the major medical challenges of the mid-21st century will be the rising prevalence of stroke. Large vessels occlusions are especially debilitating, yet effective treatment (needed within hours to achieve best outcomes) remains limited due to geography. One solution for improving timely access to mechanical thrombectomy in geographically diverse populations is the deployment of robotic surgical systems. Artificial intelligence (AI) assistance may enable the upskilling of operators in this emerging therapeutic delivery approach. Our aim was to establish consensus frameworks for developing and validating AI-assisted robots for thrombectomy. Objectives included standardizing effectiveness metrics and defining reference testbeds across in silico, in vitro, ex vivo, and in vivo environments. To achieve this, we convened experts in neurointervention, robotics, data science, health economics, policy, statistics, and patient advocacy. Consensus was built through an incubator day, a Delphi process, and a final Position Statement. We identified that the four essential testbed environments each had distinct validation roles. Realism requirements vary: simpler testbeds should include realistic vessel anatomy compatible with guidewire and catheter use, while standard testbeds should incorporate deformable vessels. More advanced testbeds should include blood flow, pulsatility, and disease features. There are two macro-classes of effectiveness metrics: one for in silico, in vitro, and ex vivo stages focusing on technical navigation, and another for in vivo stages, focused on clinical outcomes. Patient safety is central to this technology's development. One requisite patient safety task needed now is to correlate in vitro measurements to in vivo complications.

2603.27389 2026-05-08 cs.LG cs.AI stat.ML

Prediction-Based Markov Violation Scores for Detecting Non-Markovian Observations in Reinforcement Learning

基于预测的马尔可夫违反评分用于检测强化学习中的非马尔可夫观测

Naveen Mysore

发表机构 * GitHub

AI总结 本文提出一种基于预测的马尔可夫违反评分,用于检测强化学习中非马尔可夫性观测。通过随机森林和岭回归分析观测轨迹中的非马尔可夫结构,评估六个环境、三种算法和不同噪声水平下的表现。

Comments Accepted at RLC 2026, to appear in Reinforcement Learning Journal

详情
AI中文摘要

强化学习算法假设观测满足马尔可夫性,但现实中的传感器常因相关噪声、延迟或部分可观测性而违反这一假设。标准性能指标将马尔可夫失效与其他子最优性来源混为一谈,使从业者无法检测此类违规。本文引入一种基于预测的马尔可夫违反评分(MVS),用于量化观测轨迹中的非马尔可夫结构。随机森林首先去除非线性马尔可夫兼容动态;岭回归随后测试历史观测是否在残差上减少预测误差,超过当前观测所提供的误差。所得评分在[0, 1]范围内,且无需构建因果图。评估涵盖六个环境(CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d)、三种算法(PPO, A2C, SAC)、受控AR(1)噪声六个强度水平以及每个条件下的10个种子。在事后检测中,16个环境-算法对中的7个,主要为高维运动任务,显示出噪声强度与MVS之间显著的正单调性(Spearman rho最高达0.78,经重复测量分析确认);在训练时噪声下,16个对中的13个表现出统计显著的奖励退化。在低维环境中记录到反转现象,随机森林吸收噪声信号,导致MVS随着真实违规增加而降低,该失败模式进行了详细分析。一个实用实验表明,MVS能够正确识别部分可观测性并指导架构选择,完全恢复因非马尔可夫观测损失的性能。源代码可在https://github.com/NAVEENMN/Markovianes上获取。

英文摘要

Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance metrics conflate Markov breakdowns with other sources of suboptimality, leaving practitioners without tools to detect such violations. This paper introduces a prediction-based Markov Violation Score (MVS) that quantifies non-Markovian structure in observation trajectories. A random forest first removes nonlinear Markov-compliant dynamics; ridge regression then tests whether historical observations reduce prediction error on the residuals beyond what the current observation provides. The resulting score is bounded in [0, 1] and requires no causal graph construction. Evaluation spans six environments (CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d), three algorithms (PPO, A2C, SAC), controlled AR(1) noise at six intensity levels, and 10 seeds per condition. In post-hoc detection, 7 of 16 environment-algorithm pairs, primarily high-dimensional locomotion tasks, show significant positive monotonicity between noise intensity and MVS (Spearman rho up to 0.78, confirmed under repeated-measures analysis); under training-time noise, 13 of 16 pairs exhibit statistically significant reward degradation. An inversion phenomenon is documented in low-dimensional environments where the random forest absorbs the noise signal, causing MVS to decrease as true violations grow, a failure mode analyzed in detail. A practical utility experiment demonstrates that MVS correctly identifies partial observability and guides architecture selection, fully recovering performance lost to non-Markovian observations. Source code to reproduce all results is available at https://github.com/NAVEENMN/Markovianes.

2603.20991 2026-05-08 cs.LG cs.AI cs.CL cs.LO

Structural Sensitivity in Compressed Transformers: Relative Error Propagation and Layer Removal

压缩变换器中的结构敏感性:相对误差传播与层移除

Abhinaba Basu, Kumkum Basu, Koushik Deb

发表机构 * Indian Institute of Information Technology Allahabad, India(阿利哈巴德信息技术学院,印度) National Institute of Electronics(电子研究所) Indian Institute of Technology Patna, India(帕纳印度理工学院,印度) Indian Institute of Information Technology Kalyani, India(卡里尼信息技术学院,印度)

AI总结 研究压缩变换器时,误差在层间累积机制及层移除策略,通过rho指标分析误差传播,发现早期层压缩影响更大,提出基于rho的层移除方法,结合两种标准提升性能,提供无训练的压缩决策工具。

详情
AI中文摘要

压缩变换器权重可降低大语言模型部署成本,但每层压缩引入误差,误差随信号传递累积,其累积机制不明确。本文通过测量每层输出与输入误差比rho,发现误差随层下游rho值乘积累积,预测表示漂移。通过六种变换器(117M至8B参数)实验得出三项发现:(i) 误差随层下游rho值乘积累积,解释为何早期层压缩影响更大,深度减少稀疏性计划优于均匀计划;(ii) 同一层内简单剪枝导致组件敏感性差异大,激活感知剪枝(Wanda)可缩小差异;(iii) 深度剪枝通过层rho偏离一的排名需两次前向传递,优于ShortGPT的Block Influence,物理删除可提升运行速度。结合两种标准最佳,实现14.2 perplexity和60.0%下游准确率。十二个Lean 4范数不等式提供矩阵误差界。收缩轮廓为无训练压缩决策提供工具:层内压缩位置和层移除决策。

英文摘要

Compressing transformer weights makes large language models cheaper to deploy. But each layer's compression introduces an error. These errors accumulate as the signal passes through later layers, and how they accumulate is not well understood. We measure this directly: at each layer, we take the ratio of output to input error, calling it rho. A value below one means the layer absorbs the error; above one means it grows. Computing rho on six transformers (117M to 8B parameters) yields three findings. (i) Errors at layer t scale downstream by the product of later rho values, predicting representation drift (Spearman r = -0.44, p < 10^-4). This explains why compressing early layers hurts more than late ones, and why depth-decreasing sparsity schedules outperform uniform ones. Across architecture families, however, model width and redundancy matter more than rho alone. (ii) Within a layer, naive pruning shows a ~600x spread in component sensitivity. Activation-aware pruning (Wanda) shrinks this to 3-7x; the ranking reverses across architectures, so fixed importance scores do not transfer. (iii) For depth pruning, ranking layers by how far rho is from one takes two forward passes. It beats ShortGPT's Block Influence with 1.6x lower perplexity at eight layers removed, and physical deletion delivers 1.22x wall-clock speed-up. A blend of the two criteria does best (perplexity 14.2, 60.0% downstream accuracy on LLaMA-2-7B). Twelve Lean 4 norm inequalities provide machine-checked per-matrix error bounds. The contraction profile thus gives a training-free instrument for two decisions: where to compress within layers, and which to remove.

2603.20180 2026-05-08 cs.CV cs.AI cs.CL

Adaptive Greedy Frame Selection for Long Video Understanding

自适应贪心帧选择用于长视频理解

Yuning Huang, Xiaoyu Ji, Joseph Huang, Yichi Zhang, Fengqing Zhu

发表机构 * Purdue University(普渡大学)

AI总结 本文提出一种自适应贪心帧选择方法,在固定帧预算下联合优化查询相关性和语义代表性,通过构建1FPS候选池并嵌入两种互补空间,提升长视频问答的准确率。

详情
AI中文摘要

大型视觉-语言模型(VLMs)越来越多地应用于长视频问答,但推理常受限于输入帧数和生成的视觉标记数量。朴素的稀疏采样可能遗漏关键时刻,而纯粹的相关性驱动选择常导致近似重复帧并牺牲时间上远处的证据覆盖。我们提出一种问题自适应的贪心帧选择方法,在固定帧预算下联合优化查询相关性和语义代表性。我们的方法构建一个1~FPS候选池(上限为1000),并精确对齐时间戳,将候选者嵌入两个互补空间(SigLIP用于问题相关性,DINOv2用于语义相似性),并通过贪心最大化相关性项和设施位置覆盖项的加权和来选择帧。该目标被归一化、单调且子模,从而获得标准(1-1/e)的贪心近似保证。为考虑问题依赖的可靠性与覆盖之间的权衡,我们引入四种预设策略和一个轻量级的纯文本问题类型分类器,将每个查询路由到其最佳表现的预设策略。在MLVU上的实验显示,与均匀采样和最近基线相比,在各种帧预算下均实现了稳定的准确性提升,特别是在紧预算下改进最大。

英文摘要

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.

2603.20079 2026-05-08 cs.CL

Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues

利用与认知负荷相关的语言线索预测解释互动中的理解状态

Yu Wang, Olcay Türk, Angela Grimminger, Hendrik Buschmeier

发表机构 * Faculty of Linguistics Literary Studies, Bielefeld University, Bielefeld, Germany Faculty of Arts Humanities, Paderborn University, Paderborn, Germany \| SFB/Transregio 318 ‘Constructing Explainability’, Bielefeld \& Paderborn, Germany

AI总结 研究通过分析对话中说话者和听者的语言特征,探讨如何预测听者的理解状态。通过MUNDEX语料库的统计分析,发现认知负荷相关的语言线索与听者理解水平相关,使用三种语言线索和文本特征可提升预测效果。

Journal ref Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026), pp. 11368-11378

详情
AI中文摘要

我们研究了对话中说话者和听者所展现的口头和非口头语言特征如何有助于在实时基础上预测听者的理解状态。具体而言,我们考察了三种与认知负荷相关且被认为与听者理解相关联的语言线索:说话者话语的信息价值(用惊奇度量化)和句法复杂性,以及听者互动目光行为的变化。基于对MUNDEX面对面对话棋盘游戏解释语料库的统计分析,我们发现个体线索随听者理解水平的变化而变化。听者状态('理解'、'部分理解'、'不理解'和'误解')由听众通过回顾视频回忆法进行自注释。随后的分类实验结果表明,使用两种现成分类器和一个微调的德语BERT多模态分类器,能够一般性地预测这四种理解状态,并在结合三种语言线索和文本特征时效果更佳。

英文摘要

We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener's state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker's utterances, and the variation in the listener's interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener's level of understanding. Listener states ('Understanding', 'Partial Understanding', 'Non-Understanding' and 'Misunderstanding') were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.

2603.19715 2026-05-08 cs.AI

Neuro-Symbolic Proof Generation for Scaling Systems Software Verification

神经符号证明生成用于系统软件验证的扩展

Baoding He, Zenan Li, Wei Sun, Yuan Yao, Taolue Chen, Xiaoxing Ma, Zhendong Su

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) School of Computer Science, Nanjing University, China(南京大学计算机科学学院) Department of Computer Science, ETH Zurich, Switzerland(苏黎世联邦理工学院计算机科学系) School of Computing and Mathematical Sciences, Birkbeck, University of London, UK(伦敦大学伯克贝克学院计算与数学科学学院)

AI总结 本文提出一种神经符号证明生成框架,通过结合大语言模型和交互式定理证明工具,实现系统级验证的自动化证明搜索,有效提升软件验证的可扩展性。

Comments Published as a conference paper at OSDI'2026, and code is available at \url{https://github.com/SoaringE/seL4-proof-search}

详情
AI中文摘要

通过交互式定理证明进行形式验证日益用于确保关键系统的正确性,但构建大型证明脚本仍高度手动,限制了可扩展性。大语言模型(LLMs)在数学推理方面的进展使其在软件验证中的整合日益具有前景。本文介绍了一种神经符号证明生成框架,旨在自动化系统级验证项目的证明搜索。该框架在证明状态上执行最佳优先树搜索,反复查询LLM获取下一步证明步骤。在神经侧,我们使用证明状态-步骤对的数据集微调LLMs;在符号侧,我们整合了多种ITP工具来修复被拒绝的步骤、过滤和排序证明状态,并在搜索进度停滞时自动解决子目标。这种协同作用使LLM适应更加数据高效,并在语义指导下修剪搜索空间。我们将其框架实现于新的Isabelle REPL上,该REPL暴露了细粒度的证明状态和自动化工具,并在FVEL seL4基准和额外的Isabelle开发中进行了评估。在seL4上,系统证明了多达77.6%的定理,显著超越了之前的LLM方法和独立的Sledgehammer,同时解决了更多多步证明。在进一步的Isabelle基准测试中,结果表明具有很强的泛化能力,表明了一条通往可扩展自动化软件验证的可行路径。

英文摘要

Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for system-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6\% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further Isabelle benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.

2603.18257 2026-05-08 cs.LG cs.AI

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

发现你可以控制的东西:强化学习中的干预边界发现

Jiaxin Liu, Anzhe Cheng, Paul Bogdan

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Southern California(南加州大学)

AI总结 本文提出干预边界发现方法,通过随机化动作生成干预对比,利用FDR校正的两样本检验确定观测维度,有效识别可控状态维度。

详情
AI中文摘要

当RL代理的观测包含由相同混杂因素驱动的干扰项时,仅观测数据无法确定代理控制的维度。在我们的基准测试中,即使使用状态条件化的观测选择器,当干扰项模仿可控状态变量时也会失效。我们提出了干预边界发现(IBD),将代理自身的行为通道视为随机干预的来源:随机化动作实现干预对比,各维度的两样本检验结合FDR校正生成观测维度的二进制掩码。在12个连续控制设置中,最多包含100个干扰项,IBD在12个设置中的11个中达到oracle回报,而基于互信息、状态条件化前向模型和梯度敏感度的观测基线往往在将完整观测传递给SAC时表现不佳。

英文摘要

When an RL agent's observations contain distractors driven by the same confounders as its true state, observational data alone cannot identify which dimensions the agent controls. In our benchmarks, even state-conditioned observational selectors can collapse when distractors mimic controllable state variables. We propose Interventional Boundary Discovery (IBD), which treats the agent's own action channel as a source of randomized interventions: randomizing actions implements an interventional contrast, and per-dimension two-sample tests with FDR correction produce a binary mask over observation dimensions. Across 12 continuous-control settings with up to 100 distractors, IBD matches oracle return in 11 of 12 settings, while observational baselines including mutual information, state-conditioned forward models, and gradient-based sensitivity often underperform simply passing the full observation to SAC.

2603.17980 2026-05-08 cs.CV

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

感知空间:面向高效准确3D场景理解的自我运动感知视频表示

Shuyao Shi, Kang G. Shin

发表机构 * Department of Computer Science(计算机科学系) University of Michigan(密歇根大学) University of Michigan Ann Arbor(密歇根大学安阿伯分校)

AI总结 本文提出Motion-MLLM框架,结合IMU数据与视觉特征,通过运动-视觉关键帧过滤模块和异构跨模态融合模块,提升3D场景理解与空间推理的效率和准确性。

Comments 22 pages, 10 figures

详情
AI中文摘要

近期多模态大语言模型(MLLMs)在3D场景中的空间推理方面展现出巨大潜力。然而,它们通常依赖计算成本高昂的3D表示,如点云或重建的鸟瞰图(BEV)地图,或缺乏物理基础以解决尺度和大小的歧义。本文通过引入自我运动模态数据,显著增强了MLLMs。具体而言,我们提出了一种名为Motion-MLLM的新型框架,引入两个关键组件:(1)级联的运动-视觉关键帧过滤模块,利用IMU数据和视觉特征高效选择稀疏但具有代表性的关键帧;(2)异构跨模态融合模块,其中运动令牌作为中介,将自我运动线索和跨帧视觉上下文传递到视觉表示中。通过将视觉内容与物理自我运动轨迹相结合,Motion-MLLM能够推理场景中的绝对尺度和空间关系。我们的广泛评估表明,Motion-MLLM在各种3D场景理解和空间推理任务中取得了显著改进。与基于视频帧和显式3D数据的最先进(SOTA)方法相比,Motion-MLLM在准确性方面具有竞争力,同时运行速度分别快1.30倍和1.61倍。

英文摘要

Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM achieves competitive accuracy while running $1.30\times$ and $1.61\times$ faster, respectively.

2603.16309 2026-05-08 cs.CL

Omnilingual MT: Machine Translation for 1,600 Languages

多语言机器翻译:支持1600种语言的机器翻译

Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Xiang "Tony" Cao, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Gonzalez, Holger Schwenk, Edan Toledo, Arina Turkatenko, Albert Ventayol-Boada, Rashel Moritz, Alexandre Mourachko, Surya Parimi, Mary Williamson, Shireen Yates, David Dale, Marta R. Costa-jussà

发表机构 * FAIR at Meta(Meta的FAIR)

AI总结 本文提出Omnilingual Machine Translation (OMT),首次实现支持超过1600种语言的机器翻译系统,通过综合大数据策略和新创建的数据集,展示了大规模多语言翻译的性能提升和跨语言迁移的改进。

详情
AI中文摘要

高质量的机器翻译(MT)可以扩展到数百种语言,为多语言系统设定了高标准。然而,与世界上7000种语言相比,当前系统仍只提供有限的覆盖:目标语言约200种,源语言可能有几百种,这得益于跨语言迁移。这些数字难以评估,因为缺乏可靠的基准和指标。我们提出了Omnilingual Machine Translation (OMT),第一个支持超过1600种语言的MT系统。这种规模是通过综合大数据策略实现的,该策略整合了大规模的多语言语料库和新创建的数据集,包括人工编写的MeDLEY双语语料库。我们探索了两种专门化大型语言模型(LLM)进行机器翻译的方法:作为解码器-only模型(OMT-LLaMA)或作为编码器-解码器架构中的模块(OMT-NLLB)。值得注意的是,我们的10亿到80亿参数模型在性能上与7000亿参数LLM基线相匹配或超过,揭示了明显的专业化优势,并在低计算设置中实现了强大的翻译质量。此外,我们对英语到1600种语言的评估进一步显示,虽然基线模型可以解释不支持的语言,但经常无法生成有意义的翻译;OMT-LLaMA模型显著扩展了能够实现连贯生成的语言集合。此外,OMT模型在跨语言迁移中有所改进,接近解决MT在1600种语言中的“理解”部分。我们的排行榜和主要人工创建的评估数据集(BOUQuET和Met-BOUQuET)正在动态发展向多语言性,并且是免费可用的。

英文摘要

High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

2603.16281 2026-05-08 cs.LG q-bio.NC

Laya: A LeJEPA Approach to EEG via Latent Prediction over Reconstruction

Laya: 一种基于联合嵌入预测架构的EEG方法

Saarang Panchavati, Uddhav Panchavati, Hiroki Nariai, Corey Arnold, William Speier

发表机构 * Department of Medical Informatics(医学信息学系) UCLA(美国加利福尼亚大学洛杉矶分校) Department of Cognitive Science(认知科学系) UCSD(美国圣地亚哥大学) David Geffen School of Medicine(大卫·盖弗医学院) Mattel Children’s Hospital(马特尔儿童医院)

AI总结 本文提出Laya,一种基于LeJEPA的EEG基础模型,通过预测潜在表示而非信号重建,提升EEG表示的语义结构和临床准确性。

详情
AI中文摘要

脑电图(EEG)是一种广泛用于研究脑功能的工具,应用于临床神经科学、诊断和脑机接口(BCIs)。最近基于大规模未标记语料训练的EEG基础模型旨在学习可迁移的表示,但其效果尚不明确;报告的改进通常有限,对下游适应和微调策略敏感,且在线性探测中受限。我们假设一个贡献因素是依赖信号重建作为主要自监督学习(SSL)目标,这会偏向社会高方差的伪影而非任务相关的神经结构。为解决这一限制,我们探索了基于联合嵌入预测架构(JEPA)的SSL范式,通过预测潜在表示而非重建原始信号。我们引入Laya,首个基于LeJEPA的EEG基础模型。我们表明,潜在预测产生的表示编码了EEG中的语义结构:Laya嵌入跟踪临床有意义的状态变化,如癫痫发作,对噪声具有鲁棒性,并在冻结线性探测中实现了最强的平均临床准确性,尤其在相关神经模式微妙且容易被伪影掩盖的任务中表现突出。受控消融实验确认,预训练目标的选择而非架构或数据是这些收益的主要驱动因素。

英文摘要

Electroencephalography (EEG) is a widely used tool for studying brain function, with applications in clinical neuroscience, diagnosis, and brain-computer interfaces (BCIs). Recent EEG foundation models trained on large unlabeled corpora aim to learn transferable representations, but their effectiveness remains unclear; reported improvements over smaller task-specific models are often modest, sensitive to downstream adaptation and fine-tuning strategies, and limited under linear probing. We hypothesize that one contributing factor is the reliance on signal reconstruction as the primary self-supervised learning (SSL) objective, which biases representations toward high-variance artifacts rather than task-relevant neural structure. To address this limitation, we explore an SSL paradigm based on Joint Embedding Predictive Architectures (JEPA), which learn by predicting latent representations instead of reconstructing raw signals. We introduce Laya, the first EEG foundation model based on LeJEPA. We show that latent prediction yields representations that encode semantic structure in EEG: Laya embeddings track clinically meaningful state changes such as seizure onset, are resilient to noise, and achieve the strongest mean clinical accuracy under frozen linear probing, with particular gains on tasks where relevant neural patterns are subtle and easily obscured by artifacts. Controlled ablations against matched MAE variants confirm that the choice of pretraining objective, rather than architecture or data, is the primary driver of these gains.

2603.15646 2026-05-08 cs.LG cs.AI cs.CL

Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

基于情境评分奖励的交替强化学习:超越标量化策略

Guangchen Lan, Lian Xiong, Xin Zhou, Hejie Cui, Yuwei Zhang, Mao Li, Zhenyu Shi, Besnik Fetahu, Lihong Li, Xian Li

发表机构 * Amazon(亚马逊)

AI总结 本文提出ARL-RR框架,通过逐个优化语义评分元类别,避免固定标量化,提升模型性能与训练效率。

详情
AI中文摘要

基于评分奖励的强化学习(RLRR)是一种扩展传统强化学习的框架,通过将标量偏好信号替换为结构化、多维、情境化的评分评估。然而,现有方法受限于线性压缩向量奖励为标量奖励的固定权重,易受人工评分设计影响,无法捕捉奖励维度间的相关性。为克服奖励聚合的局限,本文提出基于评分奖励的交替强化学习(ARL-RR)框架,通过逐个优化语义评分元类别消除固定标量化的需求。理论分析表明,奖励聚合诱导了方差收缩效应,有助于解释性能提升。进一步引入轻量级、基于搜索的适应程序,根据任务表现动态选择下一个元类别,使策略强调关键目标,从而提升模型性能。实验证明,在HealthBench数据集上,ARL-RR在不同模型规模(1.7B、4B、8B和14B)中均优于标量化方法,在模型性能和训练效率上均表现优异。

英文摘要

Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).

2603.14337 2026-05-08 cs.CV

On the Nature of Attention Sink that Shapes Decoding Strategy in Omni-LLMs

关于塑造 Omni-LLMs 解码策略的注意力 sink 性质

Suho Yoo, Youngjoon Jang, Joon Son Chung

发表机构 * KAIST(韩国科学技术院) VGG, University of Oxford(牛津大学视觉感知实验室)

AI总结 本文研究 Omni-LLMs 中注意力 sink 的行为,发现其不仅反映头部冗余,还承担额外功能,提出 OutRo 方法通过特征空间对齐和放松因果掩码提升解码性能。

Comments Preprint

详情
AI中文摘要

本文旨在加强 Omni-LLMs 推理时的推理能力,无需额外训练。这些模型联合处理视频、音频和文本,鉴于其消耗的大量 token,注意力如何路由是其行为的关键。本文聚焦于注意力 sink,即不考虑语义内容而吸收大量注意力的 token,通过系统分析发现:(i) 高 sink 注意力不仅表明头部冗余,还暗示 sink 值表示承担额外功能;(ii) sink 值向量作为共享偏置,加到每个 token 的输出上,作为全局信号组织整体表示。基于此,提出 OutRo 方法,通过特征空间对齐非 sink token 表示与 sink,并在早期层放松 sink token 的因果掩码以加强此偏置。该设计提升推理过程,无需额外前向传递或访问注意力图。基于大量实验,OutRo 在七个视频 QA 基准上持续提升性能,表现出强泛化能力,仅带来 1.1 倍的解码开销。

英文摘要

The goal of this paper is to strengthen the reasoning of Omnimodal Large Language Models (Omni-LLMs) at inference time, without additional training. These models jointly process video, audio, and text, and given the large number of tokens they consume, how attention is routed across them is central to their behaviour. We focus specifically on attention sinks, tokens that absorb a disproportionate share of attention mass regardless of their semantic content, to understand how this routing unfolds. To this end, we conduct a systematic analysis of sink behaviour in Omni-LLMs. Our analysis yields two key findings: (i) high sink attention does not solely indicate head redundancy, suggesting that sink value representations play additional functional roles; (ii) the sink value vector acts as a shared bias added to every token's output, serving as a global signal that organises the representation as a whole. Building on this, we propose OutRo, which correspondingly aligns non-sink token representations with the sink in feature space, and relaxes the causal mask for sink tokens at an early layer to sharpen this bias before the rest of decoding proceeds. This design enhances the reasoning process without requiring additional forward passes or access to attention maps. Based on extensive experiments, OutRo consistently improves performance on seven video QA benchmarks and demonstrates strong generalisation, while incurring only a 1.1x decoding overhead.

2603.14209 2026-05-08 cs.CV cs.AI

ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control

ChArtist:通过统一的空间和主题控制生成图像图表

Shishi Xiao, Tongyu Zhou, David Laidlaw, Gromit Yeuk-Yin Chan

发表机构 * Adobe Research(Adobe研究院) Brown University(布朗大学)

AI总结 ChArtist通过统一的空间和主题控制生成图像图表,结合骨骼基空间控制和主题驱动控制,提升数据准确性与视觉美感。

Comments Project page: https://chartist-ai.github.io/

详情
AI中文摘要

图像图表是一种有效的视觉叙事媒介,能够无缝整合视觉元素与数据图表。然而,创建此类图像具有挑战性,因为视觉元素的灵活性往往与图表结构的刚性冲突。此过程需要一种创造性的变形,以保持数据忠实度和视觉美感。当前从自然图像中提取密集结构线索(如边缘或深度图)的方法不适用于图像图表生成的条件信号。我们提出了ChArtist,一种专门领域的扩散模型,用于自动生成图像图表,提供两种不同的控制类型:1)与图表结构匹配的空间控制,以及2)基于参考图像的主体驱动控制。为此,我们引入了基于骨骼的空间控制表示。这种表示仅编码图表的数据编码信息,允许轻松整合参考视觉,而无需刚性轮廓约束。我们基于扩散变换器(DiT)实现了该方法,并利用自适应位置编码机制来管理这两种控制。我们进一步引入了空间门控注意力,以调节空间控制与主体控制之间的相互作用。为了支持预训练模型的微调,我们创建了一个包含30,000个三元组(骨骼、参考图像、图像图表)的大规模数据集。我们还提出了一种统一的数据准确性度量标准,以评估生成图表的数据忠实度。我们相信这项工作表明,当前生成模型可以通过超越通用条件到任务特定表示,实现数据驱动的视觉叙事。项目页面:https://chartist-ai.github.io/.

英文摘要

A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific diffusion model for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. Project page: https://chartist-ai.github.io/.

2603.09986 2026-05-08 cs.CL cs.AI

Quantifying Hallucinations in Language Language Models on Medical Textbooks

对语言模型在医学教科书中 hallucinations 的量化

Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman

发表机构 * National Institutes of Health, National Library of Medicine(国家卫生研究院,国家医学图书馆) Department of Computer Science, University of Maryland(大学计算机科学系)

AI总结 本文研究了大型语言模型在医学教科书基础上的 hallucinations 发生频率及不同模型响应的差异,发现即使在高可信度响应下,LLaMA-70B-Instruct仍存在19.7%的hallucinations,且临床专家对模型响应的评估显示高一致性。

Comments 8 pages, 4 figures

详情
AI中文摘要

Hallucinations,即大型语言模型倾向于提供事实性错误和无依据的主张,是自然语言处理中的严重问题,目前尚无有效解决方案。现有医学问答基准很少评估此行为相对于固定证据源。我们询问在教科书基础的问答中hallucinations发生频率以及不同模型对医学问答提示的响应差异。我们进行了两个实验,第一个实验确定在封闭源零样本提示下,知名开源大型语言模型(LLaMA-70B-Instruct)在医学问答中的hallucinations普遍性,第二个实验确定hallucinations发生率和临床专家对模型响应的偏好。我们观察到,在第一个实验中,尽管98.8%的提示响应获得最大可信度,LLaMA-70B-Instruct在19.7%的答案中hallucinated(95% CI 18.6到20.7)。在第二个实验中,不同模型中较低的hallucination率与较高的有用性评分(ρ=-0.71,p=0.058)相关。临床专家在实验1和2中分别表现出高一致性(二次加权κ=0.92)和(τ_b=0.06到0.18,κ=0.57到0.61)。我们的发现表明,在所有测试的尺度和架构中,当前大型语言模型仍不适合无监督的临床部署,且人类专家监督既是必要也是主导成本驱动因素。

英文摘要

Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments, the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given closed-source zero-shot prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($ρ=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $κ=0.92$) and ($τ_b=0.06$ to $0.18$, $κ=0.57$ to $0.61$) for experiments 1 and 2 respectively. Our findings indicate that, across all scales and architectures tested, current large language models remain unfit for unsupervised clinical deployment, and that human expert oversight is both necessary and the dominant cost driver.

2603.03511 2026-05-08 cs.LG cond-mat.mtrl-sci physics.chem-ph

Orbital Transformers for Predicting Wavefunctions in Time-Dependent Density Functional Theory

轨道变换器用于时间依赖密度泛函理论中波函数的预测

Xuan Zhang, Haiyang Yu, Chengdong Wang, Jacob Helwig, Shuiwang Ji, Xiaofeng Qian

发表机构 * Department of Computer Science and Engineering, Texas A&M University(计算机科学与工程系,德克萨斯大学阿马尔科分校) Department of Materials Science and Engineering, Texas A&M University(材料科学与工程系,德克萨斯大学阿马尔科分校) J. Mike Walker ’66 Department of Mechanical Engineering, Texas A&M University(J. Mike Walker ’66 机械工程系,德克萨斯大学阿马尔科分校) Department of Electrical and Computer Engineering, Texas A&M University(电气与计算机工程系,德克萨斯大学阿马尔科分校) Department of Physics and Astronomy, Texas A&M University(物理与天文学系,德克萨斯大学阿马尔科分校)

AI总结 本文提出OrbEvo模型,基于等变图变换器架构,通过等变条件编码外部电场强度和方向,打破SO(3)到SO(2)对称性,利用波函数池化和密度矩阵作为交互方法,准确捕捉外部场下激发态的量子动力学。

Journal ref The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

我们旨在学习由时间依赖密度泛函理论(TDDFT)模拟的波函数,这些波函数可以有效地表示为原子轨道线性组合系数。在实时TDDFT中,分子的电子波函数随时间演化响应于外部激发,使能基于原理预测物理性质如光吸收、电子动力学和高阶响应。然而,传统实时TDDFT依赖于细时间步长下所有占据态的传播,耗时较长。本文提出OrbEvo,基于等变图变换器架构,学习在时间步长上演变完整的电子波函数系数。首先,为考虑外部场,我们设计等变条件以编码外部电场的强度和方向,并打破对称性从SO(3)到SO(2)。此外,我们设计了两个OrbEvo模型,OrbEvo-WF和OrbEvo-DM,分别使用波函数池化和密度矩阵作为交互方法。受密度泛函在TDDFT中的核心作用启发,OrbEvo-DM通过张量收缩将所有占据电子态的密度矩阵聚合为特征向量,提供更直观的方法学习时间演化算子。我们采用专门针对限制时间依赖波函数在自回归滚动中误差累积的训练策略。为了评估我们的方法,我们生成包含QM9数据集中5000种不同分子和MD17数据集中1500种分子构型的TDDFT数据集。结果表明,我们的OrbEvo模型能够准确捕捉外部场下激发态的量子动力学,包括时间依赖波函数、时间依赖偶极矩和光吸收光谱。

英文摘要

We aim to learn wavefunctions simulated by time-dependent density functional theory (TDDFT), which can be efficiently represented as linear combination coefficients of atomic orbitals. In real-time TDDFT, the electronic wavefunctions of a molecule evolve over time in response to an external excitation, enabling first-principles predictions of physical properties such as optical absorption, electron dynamics, and high-order response. However, conventional real-time TDDFT relies on time-consuming propagation of all occupied states with fine time steps. In this work, we propose OrbEvo, which is based on an equivariant graph transformer architecture and learns to evolve the full electronic wavefunction coefficients across time steps. First, to account for external field, we design an equivariant conditioning to encode both strength and direction of external electric field and break the symmetry from SO(3) to SO(2). Furthermore, we design two OrbEvo models, OrbEvo-WF and OrbEvo-DM, using wavefunction pooling and density matrix as interaction method, respectively. Motivated by the central role of the density functional in TDDFT, OrbEvo-DM encodes the density matrix aggregated from all occupied electronic states into feature vectors via tensor contraction, providing a more intuitive approach to learn the time evolution operator. We adopt a training strategy specifically tailored to limit the error accumulation of time-dependent wavefunctions over autoregressive rollout. To evaluate our approach, we generate TDDFT datasets consisting of 5,000 different molecules in the QM9 dataset and 1,500 molecular configurations of the malonaldehyde molecule in the MD17 dataset. Results show that our OrbEvo model accurately captures quantum dynamics of excited states under external field, including time-dependent wavefunctions, time-dependent dipole moment, and optical absorption spectra.

2603.02087 2026-05-08 cs.CV cs.AI cs.LG

A Detection-Gated Pipeline for Robust Glottal Area Waveform Extraction and Clinical Pathology Assessment

一种用于鲁棒声门面积波形提取和临床病理评估的检测门控流程

Harikrishnan Unnikrishnan, Rita Patel

发表机构 * Orchard Robotics Department of Speech, Language, and Hearing Sciences(语音、语言和听力科学系) Indiana University(印第安纳大学) Department of Otolaryngology–Head and Neck Surgery(耳鼻喉科-头颈外科系) Indiana University School of Medicine(印第安纳大学医学院)

AI总结 本文提出一种两阶段模块化声门面积分割框架,用于高速视频喉镜,实现高精度、通用性和实时回放。结合YOLOv8n声门定位器和U-Net分割器,通过声门闭合时的门控机制减少伪分割。在GIRAFE和BAGLS数据集上训练,跨数据集评估显示其Dice相似系数达0.745,性能优于现有方法。

Comments for associated code see: https://github.com/hari-krishnan/openglottal

详情
AI中文摘要

我们提出了一种完全自动的、两阶段模块化声门面积分割框架,用于高速视频喉镜(HSV),旨在实现准确性、通用性和实时回放。我们的检测门控流程结合了YOLOv8n声门定位器和U-Net分割器;定位器定义了一个紧密的裁剪以确保一致的视野,并通过门控机制减少声门闭合期间的伪分割。模型在GIRAFE(N=600)和BAGLS(N=55,750)数据集上进行训练。跨数据集可移植性通过在BAGLS测试集上基准测试GIRAFE训练的模型而无需微调来评估。在这些评估中,该流程实现了Dice相似系数(DSC)为0.745(87%的领域内天花板)。在分布内测试集上,系统实现了DSCs为0.81(GIRAFE)和0.856(BAGLS),优于或竞争于现有最先进方法。对40名受试者的探索性临床研究显示,声门面积变异系数(CV)能够区分健康与病理功能(p=0.006)。该系统在商用硬件上处理约35帧/秒,支持在不同采集设置下统一提取喉部运动测量。代码、权重和软件可在https://github.com/hari-krishnan/openglottal上获得。

英文摘要

We present a fully automated, two-stage modular glottal area segmentation framework for high-speed videoendoscopy (HSV) designed for accuracy, generalizability, and real-time playback. Our detection-gated pipeline combines a YOLOv8n glottis localizer with a U-Net segmenter; the localizer defines a tight crop to ensure a consistent field of view and gates the output to reduce spurious segmentations during glottal closure. The models were trained on the GIRAFE (N=600) and BAGLS (N=55,750) datasets. Cross-dataset portability was evaluated by benchmarking GIRAFE-trained models on the BAGLS test set without fine-tuning. In these evaluations, the pipeline achieved a Dice Similarity Coefficient (DSC) of 0.745 (87% of the in-domain ceiling). On in-distribution test sets, the system achieved DSCs of 0.81 (GIRAFE) and 0.856 (BAGLS), outperforming or competing with state-of-the-art methods. An exploratory clinical study of 40 subjects demonstrated that the glottal area Coefficient of Variation (CV distinguished healthy from pathological function (p=0.006). The system processes ~35 frames per second on commodity hardware, enabling interactive clinical review. This design supports uniform extraction of laryngeal kinematic measures across varying acquisition settings. Code, weights, and software are available at https://github.com/hari-krishnan/openglottal.

2603.00117 2026-05-08 cs.RO cs.AI

PEPA: a Persistently Autonomous Embodied Agent with Personalities

PEPA:具有个性的持续自主体

Kaige Liu, Yang Li, Lijun Zhu, Weinan Zhang

发表机构 * School of Artificial Intelligence and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院) Shanghai Innovation Institute(上海创新研究院) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) State Key Laboratory of Intelligent Manufacturing Equipment and Technology, Huazhong University of Science and Technology(华中科技大学智能制造装备与技术国家重点实验室)

AI总结 本文提出PEPA框架,通过三层认知架构实现持续自主,利用个性特质自主生成目标并维持行为演化,实验证明其在动态环境中稳定运行。

详情
AI中文摘要

生物体通过内部生成的目标和自我维持的行为组织实现持久自主,但现有具身代理仍依赖外部预设任务。本文提出个性特质作为内在组织原则,通过三层认知架构实现自主目标生成与行为演化。实验证明,PEPA在多层办公楼中自主处理用户请求与个性驱动动机,展现稳定的行为特性。

英文摘要

Living organisms exhibit persistent autonomy through internally generated goals and self-sustaining behavioral organization, yet current embodied agents remain driven by externally scripted objectives. This dependence on predefined task specifications limits their capacity for long-term deployment in dynamic, unstructured environments where continuous human intervention is impractical. We propose that personality traits provide an intrinsic organizational principle for achieving persistent autonomy. Analogous to genotypic biases shaping biological behavioral tendencies, personalities enable agents to autonomously generate goals and sustain behavioral evolution without external supervision. To realize this, we develop PEPA, a three-layer cognitive architecture that operates through three interacting systems: Sys3 autonomously synthesizes personality-aligned goals and refines them via episodic memory and daily self-reflection; Sys2 performs deliberative reasoning to translate goals into executable action plans; Sys1 grounds the agent in sensorimotor interaction, executing actions and recording experiences. We validate the framework through real-world deployment on a quadruped robot in a multi-floor office building. Operating without reliance on fixed task specifications, the robot autonomously arbitrates between user requests and personality-driven motivations, navigating elevators and exploring environments accordingly. Quantitative analysis across five distinct personality prototypes demonstrates stable, trait-aligned behaviors. The results confirm that personality-driven cognitive architectures enable sustained autonomous operation characteristic of persistent embodied systems. Code and demo videos are available at https://sites.google.com/view/pepa-persistent/.

2602.22710 2026-05-08 cs.SD cs.AI cs.HC

Same Words, Different Judgments: How Preferences Vary Across Modalities

相同词语,不同判断:偏好如何在不同模态间变化

Aaron Broukhim, Nadir Weibel, Eshin Jolly

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) University of California San Diego(加州大学圣地亚哥分校) Department of Psychology(心理学系)

AI总结 研究探讨了文本与语音模态在偏好标注中的差异,发现需9名评阅者才能达到较高一致性,且语音评阅者在决策阈值、长度偏差和用户导向标准上存在显著差异,合成评分可有效预测评阅者一致性,强调音频偏好数据评估需模态特定设计。

Comments Submitted to NeurIPS 2026 for review

详情
AI中文摘要

基于偏好强化学习(PbRL)是将AI系统对齐人类偏好的主流框架。然而,此类数据的评估协议最初为文本设计,尚未在语音中验证。本文首次提出基于ICC的受控跨模态研究,比较100个提示中相同语义内容的文本和音频评估。研究显示,实现任一模态内良好一致性(ICC(2,k)≈0.80)需约9名评阅者。同时,模态间在人们如何报告偏好上存在显著差异:音频评阅者表现出更窄的决策阈值、减少的长度偏差和更用户导向的评估标准,跨模态一致性接近偶然。研究证明合成评分可有效预测评阅者一致性,从而作为刺激选择的早期信号和人类注释的代理。这些发现表明,音频偏好数据的评估协议需要模态特定设计,而非直接从文本适应。

英文摘要

Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences. However, evaluation protocols for such data were designed for text and have not been validated for speech. We present the first ICC-based, controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. We show that achieving $\textit{good}$ agreement within either modality (ICC(2,$k$) $\approx$ .80) requires $\sim$9 raters. At the same time, modalities show marked differences in how people report preferences: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. We demonstrate that synthetic ratings can be used to effectively predict inter-rater agreement, thus serving as an early signal for stimulus selection and proxy for human annotations. Together, these findings argue that evaluation protocols for audio preference data require modality-specific design rather than direct adaptation from text.

2602.19202 2026-05-08 cs.CV

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

UniE2F: 一种基于视频基础模型的统一扩散框架用于事件到帧重建

Gang Xu, Zhiyu Zhu, Junhui Hou

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳)) Department of Computer Science, City University of Hong Kong (Dongguan)(香港城市大学(东莞)计算机科学系)

AI总结 本文提出UniE2F框架,利用预训练视频扩散模型的生成先验,从稀疏事件数据中重建高保真视频帧,通过引入事件基帧间残差引导提升重建精度,并扩展到零样本视频帧插值与预测。

详情
AI中文摘要

事件相机在高速、低功耗和高动态范围场景感知方面表现出色。然而,由于其仅记录相对强度变化而非绝对强度,导致的数据流存在显著的空间信息和静态纹理细节丢失。本文通过利用预训练视频扩散模型的生成先验,从稀疏事件数据中重建高保真视频帧。首先,通过直接应用事件数据作为条件合成视频建立基线模型。然后,基于事件流与视频帧之间的物理相关性,引入事件基帧间残差引导以增强视频帧重建的准确性。进一步,通过调节反向扩散采样过程,将方法扩展到零样本视频帧插值和预测,从而创建统一的事件到帧重建框架。实验结果表明,本文方法在真实世界和合成数据集上均优于先前方法。我们还建议评审人员参考补充材料中的视频演示以获取视频结果。代码将在https://github.com/CS-GangXu/UniE2F上公开。

英文摘要

Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at https://github.com/CS-GangXu/UniE2F.

2602.15827 2026-05-08 cs.RO cs.AI cs.LG cs.SY eess.SY

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

感知型人形扑动:通过动作匹配链式动态人类技能

Zhen Wu, Xiaoyu Huang, Lujie Yang, Yuanhang Zhang, Xi Chen, Pieter Abbeel, Rocky Duan, Angjoo Kanazawa, Carmelo Sferrazza, Guanya Shi, C. Karen Liu

发表机构 * Amazon FAR(亚马逊FAR) UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Stanford University(斯坦福大学)

AI总结 本文提出感知型人形扑动框架,通过动作匹配和强化学习实现人形机器人在复杂环境中自主执行长时程视觉扑动任务,展示高动态扑动技能和多障碍物穿越能力。

详情
AI中文摘要

尽管近期人形运动学进展实现了在多样化地形上的稳定行走,但捕捉高度动态人类动作的敏捷性和适应性仍是一个开放性挑战。特别是复杂环境中的敏捷扑动需要不仅低层鲁棒性,还要求人形动作表达性、长时程技能组合和感知驱动决策。本文提出感知型人形扑动(PHP),一个模块化框架,使人形机器人能够自主执行长时程、基于视觉的扑动任务。我们的方法首先利用动作匹配,将其形式化为特征空间中的最近邻搜索,将重新目标化的基本人类技能组合成长时程运动轨迹。该框架使复杂技能链的灵活组合和流畅过渡成为可能,同时保持动态人类动作的优雅和流畅性。接下来,我们训练用于这些组合动作的运动跟踪强化学习(RL)专家策略,并通过结合DAgger和RL的方法将其转化为单一基于深度的、多技能学生策略。关键的是,感知与技能组合的结合使机器人能够实现自主、上下文感知的决策:仅使用机载深度感知和离散2D速度命令,机器人选择并执行是否要跃过、攀上、翻越或从障碍物上滚落,这些障碍物具有不同的几何形状和高度。我们通过大量现实世界实验验证了该框架在Unitree G1人形机器人上的有效性,展示了高度动态的扑动技能,如攀爬高达1.25米(96%机器人高度)的障碍物,以及长时程多障碍物穿越能力,并具有闭环适应实时障碍扰动的能力。

英文摘要

While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-like motion expressiveness, long-horizon skill composition, and perception-driven decision-making. In this paper, we present Perceptive Humanoid Parkour (PHP), a modular framework that enables humanoid robots to autonomously perform long-horizon, vision-based parkour across challenging obstacle courses. Our approach first leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories. This framework enables the flexible composition and smooth transition of complex skill chains while preserving the elegance and fluidity of dynamic human motions. Next, we train motion-tracking reinforcement learning (RL) expert policies for these composed motions, and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL. Crucially, the combination of perception and skill composition enables autonomous, context-aware decision-making: using only onboard depth sensing and a discrete 2D velocity command, the robot selects and executes whether to step over, climb onto, vault or roll off obstacles of varying geometries and heights. We validate our framework with extensive real-world experiments on a Unitree G1 humanoid robot, demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25m (96% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations.

2602.13670 2026-05-08 cs.LG

Advancing Analytic Class-Incremental Learning through Vision-Language Calibration

通过视觉-语言校准推进分析类增量学习

Binyu Zhao, Wei Zhang, Xingrui Yu, Zhaonian Zou, Ivor Tsang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, China(哈尔滨工业大学计算机科学与技术学院) CFAR and IHPC, Agency for Science, Technology and Research (A*STAR), Singapore(新加坡科技研究局(A*STAR)的CFAR和IHPC) College of Computing and Data Science, Nanyang Technological University, Singapore(新加坡南洋理工大学计算与数据科学学院)

AI总结 本文提出VILA框架,通过双级视觉语言校准策略解决预训练模型分析类增量学习中的表示刚性问题,提升学习效率与稳定性,在多个基准测试中表现优异。

Comments 20 pages, 11 figures, 9 tables. Accepted by ICML2026

详情
AI中文摘要

类增量学习(CIL)在预训练模型(PTMs)中面临高效适应与长期稳定性的关键权衡。尽管分析学习能够实现快速、递归的闭式更新,但累积误差和特征不兼容常削弱其效果。本文系统研究了PTM基于分析CIL的失效模式,发现表示刚性是主要瓶颈。受此启发,我们提出VILA,一种新颖的双分支框架,通过两级视觉-语言校准策略推进分析CIL。具体而言,我们通过几何校准在特征层面融合可塑、任务适应的特征与冻结的通用视觉锚点,并利用跨模态语义先验在决策层面校正预测偏差。这种融合在保持分析学习极端效率的同时克服了其固有脆弱性。在八个基准测试中的广泛实验表明,VILA在细粒度和长序列场景中持续表现出色。我们的框架在高保真预测与分析学习的简洁性之间实现了平衡。我们的代码可在https://github.com/byzhaoAI/VILA获得。

英文摘要

Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy is often compromised by accumulated errors and feature incompatibility. In this paper, we first conduct a systematic study to dissect the failure modes of PTM-based analytic CIL, identifying representation rigidity as the primary bottleneck. Motivated by this insight, we propose VILA, a novel dual-branch framework that advances analytic CIL via a two-level vision-language calibration strategy. Specifically, we coherently fuse plastic, task-adapted features with a frozen, universal visual anchor at the feature level through geometric calibration, and leverage cross-modal semantic priors at the decision level to rectify prediction bias. This confluence maintains analytic-learning's extreme efficiency while overcoming its inherent brittleness. Extensive experiments across eight benchmarks demonstrate that VILA consistently yields superior performance, particularly in fine-grained and long-sequence scenarios. Our framework harmonizes high-fidelity prediction with the simplicity of analytic learning. Our code is available at https://github.com/byzhaoAI/VILA.

2602.13310 2026-05-08 cs.CV cs.AI

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

视觉并行思考器:用于视觉理解的分而治之推理

Haoran Xu, Hongyu Wang, Jiaze Li, Shunpeng Chen, Zizhao Tong, Jianzhong Ju, Zhenbo Luo, Jian Luan

发表机构 * Zhejiang University(浙江大学) Hunan University(湖南大学) University of Chinese Academy of Sciences(中国科学院大学) Independent Researcher(独立研究者)

AI总结 本文提出视觉并行思考器,通过并行推理框架提升视觉理解能力,结合Pa-Attention和LPRoPE实现高效多模态处理,验证了并行推理在视觉领域的有效性。

详情
AI中文摘要

现有LLM测试时间扩展定律强调通过延长推理长度产生自我反思行为。然而,这种垂直扩展策略常在模型陷入特定思维模式时遇到瓶颈。通过从深度转向并行性,并行思维缓解了探索范围的缩小。然而,将此范式扩展到视觉领域仍是一个开放性问题。本文首先探讨了视觉分区在并行推理中的作用,并提出两种不同策略。基于此,我们引入视觉并行思考器,代表首个针对大规模语言模型的并行推理框架。为保持路径独立性和促进推理多样性,我们的方法整合了Pa-Attention与LPRoPE。利用vLLM框架,我们开发了原生多模态实现,以实现高效并行处理。在V*、CountBench、RefCOCO和HallusionBench等基准数据集上的实验证实,视觉并行思考器成功将并行推理的优势扩展到视觉领域。

英文摘要

Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.

2602.11229 2026-05-08 cs.AI cs.LG

Latent Generative Solvers for Generalizable Long-Term Physics Simulation

潜在生成求解器用于通用的长期物理模拟

Zituo Chen, Sili Deng

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出潜在生成求解器(LGS),通过物理变分自编码器、分层流强迫变换器和训练中的输入噪声,实现了对异构PDE家族的泛化和长序列稳定性,有效提升了物理模拟的性能。

详情
AI中文摘要

可靠的物理模拟需要同时具备跨异构PDE家族的泛化能力和长自回归滚动下的稳定性,而现有的神经PDE求解器无法同时满足。本文提出潜在生成求解器(LGS),包含三个耦合组件:(i)将十二个PDE家族压缩到共享潜在流形的物理变分自编码器;(ii)通过流匹配生成下一个潜在状态的分层流强迫变换器,该变换器基于轨迹上下文进行条件生成;(iii)训练中的输入噪声,通过推导充分条件收缩界解释了观察到的长周期稳定性。在250万轨迹、16系统数据集上预训练后,LGS在一步时与最强确定性基线相当,在5-10步滚动中胜出15/16系统,将20步L2RE从56.1%降至30.2%,并且使用13-77倍更少的递归动态步计算。此外,它能高效适应256²的Kolmogorov流,五次微调后1步L2RE从0.398降至0.129,优于U-AFNO的0.653→0.343。

英文摘要

Reliable physics simulation demands two capabilities that today's neural PDE solvers do not deliver together: generalization across heterogeneous PDE families, and stability under long autoregressive rollouts. Deterministic operators accumulate error geometrically, while existing probabilistic solvers are confined to a single PDE family or short horizons. We close this gap with the \textbf{Latent Generative Solver} (LGS), three coupled components: (i) a Physics VAE (PhyVAE) compressing twelve PDE families into a shared latent manifold; (ii) a Pyramidal Flow-Forcing Transformer (PFlowFT) that generates the next latent by flow matching, conditioned on a per-trajectory context updated on the model's own predictions; and (iii) input noising during training, for which we derive a sufficient-condition contraction bound explaining the observed long-horizon stability. Pretrained on a 2.5\,M-trajectory, 16-system corpus at $128^2$, LGS matches the strongest deterministic baseline at one step, wins on 15/16 systems at both 5- and 10-step rollout, cuts 20-step L2RE from $56.1\%$ to $\mathbf{30.2\%}$, and uses $\mathbf{13}$--$\mathbf{77\times}$ less recurrent dynamics-step compute. It also adapts efficiently to a $256^2$ Kolmogorov flow held out from the pretraining corpus, dropping 1-step L2RE from $0.398$ to $0.129$ in five finetune epochs against U-AFNO's $0.653{\to}0.343$.

2602.11183 2026-05-08 cs.RO cs.CV cs.SY eess.SY

Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering

通过记忆增强的卡尔曼滤波缓解连续导航中的误差累积

Yin Tang, Jiawei Ma, Jinrui Zhang, Alex Jinpeng Wang, Deyu Zhang

发表机构 * Big Data Institute, Central South University, Changsha, China. Work done while working at CityUHK as a visiting scholar. Department of Computer Science \& Institute of Digital Medicine, City University of Hong Kong, Hong Kong, China School of Computer Science, Central South University, Changsha, China

AI总结 本文提出NeuroKalman框架,通过先验预测和似然校正过程缓解连续导航中的状态漂移问题,实验表明其在TravelUAV基准上表现优异。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

在复杂环境中连续导航对无人机至关重要。然而,现有的视觉-语言导航(VLN)模型采用死 reckoning 方法,迭代更新下一个航点位置,随后构建完整轨迹。这种逐步方式不可避免地导致位置误差随时间累积,造成内部信念与客观坐标不一致,即'状态漂移',最终影响完整轨迹预测。受经典控制理论启发,本文将此类顺序预测视为递归贝叶斯状态估计问题。本文设计了NeuroKalman框架,将导航分解为两个互补过程:基于运动动态的先验预测和从历史观测中获得的似然校正。我们首先将测量似然的核密度估计与基于注意力的检索机制数学关联,从而允许系统利用检索到的历史锚点修正潜在表示,而无需梯度更新。在TravelUAV基准上的全面实验表明,仅使用10%的训练数据微调,本文方法明显优于强基线,并有效调节漂移累积。

英文摘要

Continuous navigation in complex environments is critical for Unmanned Aerial Vehicle (UAV). However, the existing Vision-Language Navigation (VLN) models follow the dead-reckoning, which iteratively updates its position for the next waypoint prediction, and subsequently construct the complete trajectory. Then, such stepwise manner will inevitably lead to accumulated errors of position over time, resulting in misalignment between internal belief and objective coordinates, which is known as "state drift" and ultimately compromises the full trajectory prediction. Drawing inspiration from classical control theory, we propose to correct for errors by formulating such sequential prediction as a recursive Bayesian state estimation problem. In this paper, we design NeuroKalman, a novel framework that decouples navigation into two complementary processes: a Prior Prediction, based on motion dynamics and a Likelihood Correction, from historical observation. We first mathematically associate Kernel Density Estimation of the measurement likelihood with the attention-based retrieval mechanism, which then allows the system to rectify the latent representation using retrieved historical anchors without gradient updates. Comprehensive experiments on TravelUAV benchmark demonstrate that, with only 10% of the training data fine-tuning, our method clearly outperforms strong baselines and regulates drift accumulation.

2602.07974 2026-05-08 cs.LG

Structural Learning Theory: A Metric-Topology Factorization Approach

结构学习理论:一种度量-拓扑分解方法

Xin Li

发表机构 * Department of Computer Science University at Albany(大学计算机科学系,阿尔巴尼大学)

AI总结 本文提出结构学习理论,通过度量和拓扑分解解决多上下文非平稳环境中的学习问题,引入宽度概念和合同-相似算子,分解学习为陷阱发现和 funnel 一般化。

详情
AI中文摘要

在结构、多上下文或非平稳环境中学习涉及两个正交困难。第一个是度量:一旦正确上下文已知,如何在其中进行预测?这属于统计学习理论(SLT)的领域。第二个是结构:需要多少局部上下文,如何从数据中发现它们?本文为结构轴开发了结构学习理论(StrLT)。我们引入宽度,即覆盖学习问题所需的最小联合收缩且低风险单元数。宽度与VC维不可比:二者可以发散而另一个保持有界。我们证明宽度诱导了相变:如果分配的单元数K<w,学习会遭受不可减少的结构误差底座;如果K≥w,问题减少为普通单元内统计学习。为估计宽度,我们引入合同-相似(CS)算子,一种任务自适应图核,结合几何局部性和预测兼容性。其CS拉普拉斯算子通过谱分离暴露合同盆地。我们进一步开发了度量滑翔,利用低维潜在收缩映射来减少 funnel-学习成本。共同,宽度、CS估计和滑翔将学习分解为陷阱发现和 funnel 一般化,对开放环境中的持续和终身学习有深刻影响。

英文摘要

Learning in structured, multi-context, or non-stationary environments involves two orthogonal difficulties. The first is \emph{metric}: once the correct context is known, how hard is prediction within it? This is the domain of Statistical Learning Theory (SLT). The second is \emph{structural}: how many local contexts are required, and how can they be discovered from data? This paper develops \emph{Structural Learning Theory} (StrLT) for the structural axis. We introduce \emph{width}, the minimum number of jointly contractive and low-risk cells needed to cover a learning problem. Width is incomparable with VC dimension: either can diverge while the other remains bounded. We show that width induces a \emph{phase transition}: if the allocated number of cells \(K<w\), learning suffers an irreducible structural error floor; if \(K\ge w\), the problem reduces to ordinary within-cell statistical learning. To estimate width, we introduce the \emph{contractive-similarity} (CS) operator, a task-adaptive graph kernel combining geometric locality with predictive compatibility. Its CS Laplacian exposes contractive basins through spectral separation. We further develop the \emph{metric slingshot}, which reuses low-dimensional latent contraction maps to reduce funnel-learning cost. Together, width, CS estimation, and the slingshot decompose learning into trap discovery and funnel generalization, with deep implications for continual and lifelong learning in an open-ended environment.