arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2367
专题追踪 全部专题
2605.22616 2026-05-29 cs.CL

Chinese sensorimotor and embodiment norms for 3,000 lexicalized concepts

3000个词汇化概念的中文感觉运动与具身规范

Jing Chen, Gábor Parti, Yin Zhong, Chu-Ren Huang, Marco Marelli

发表机构 * Department of Psychology(心理学系) University of Milano-Bicocca(米兰-比科卡大学) Department of Chinese and Bilingual Studies(中文与双语研究系) The Hong Kong Polytechnic University(香港理工大学) Center for Language Education(语言教育中心) The Hong Kong University of Science and Technology(香港科学大学)

AI总结 本研究为3000个中文词汇化概念提供了11维感觉运动评分和单维具身评分,验证了其高信度和效度,并发现感觉运动信息对词汇加工有促进作用,且可从语言表征中部分恢复。

详情
AI中文摘要

理解概念知识如何植根于身体体验,以及机器系统在缺乏直接感觉运动经验的情况下能在多大程度上获取此类知识,是认知科学和具身人工智能研究的核心问题。大规模规范资源对于实证研究这些问题至关重要,但此类资源在非印欧语言中仍然稀缺。我们为普通话中的3000个词汇化概念提供了一个新颖的规范数据库,包括从378名普通话母语者收集的11维感觉运动评分和单维具身评分。这些评分显示出高可靠性,并与现有中文资源(每个资源覆盖较少词汇和11个感觉运动维度的子集)具有强交叉规范效度。在一项验证研究中,我们测试了源自理论驱动指标——具身感知强度(PSE)(Huang et al., 2025)的新变量以及七个常见复合变量在词汇决策任务中的表现。结果表明,PSE-感觉运动和Minkowski-3是词汇决策表现的最强复合预测因子,捕捉了感觉运动信息对词汇加工的促进作用。进一步的探索性研究表明,使用简单回归模型(各维度平均Spearman r = .62)可以从纯语言表征中大幅恢复感觉运动评分,但恢复程度差异显著:视觉和听觉维度比化学感觉维度产生更高的对应性。表征相似性分析进一步表明,感觉运动空间的关系几何也部分可恢复(r = .540),这与分布语言使用编码了具身概念结构某些方面的观点一致。

英文摘要

Understanding how conceptual knowledge is grounded in bodily experience, and to what extent machine systems can acquire such knowledge without direct sensorimotor experience, are central questions in both cognitive science and embodied artificial intelligence research. Large-scale normative resources are essential for investigating these questions empirically, yet such resources remain sparse for non-Indo-European languages. We present a novel normative database for 3,000 lexicalized concepts in Mandarin Chinese, comprising 11-dimensional sensorimotor ratings and unidimensional embodiment ratings collected from 378 native Mandarin speakers. The ratings demonstrate high reliability and strong cross-norm validity with existing Chinese resources, each of which covers fewer words and a subset of the 11 sensorimotor dimensions. In a validation study, we tested new variables derived from a theoretically motivated metric, Perceptual Strength of Embodiment (PSE) (Huang et al., 2025), together with seven common composite variables, on lexical decision tasks. The results suggest that PSE-Sensorimotor and Minkowski-3 are the strongest composite predictors of lexical decision performance, capturing the facilitatory effects of sensorimotor information on lexical processing. A further exploratory study showed that sensorimotor ratings are substantially recoverable from purely linguistic representations using simple regression models (mean Spearman r = .62 across dimensions), though recovery varied markedly: visual and auditory dimensions yielded higher correspondence than chemosensory ones. Representational similarity analysis further showed that the relational geometry of the sensorimotor space is also partially recoverable (r = .540), consistent with the view that distributional language use encodes aspects of embodied conceptual structure.

2605.22255 2026-05-29 cs.CV cs.IR

Direct content-based retrieval from music scores images

基于内容的乐谱图像直接检索

Noelia Luna-Barahona, Antonio Ríos-Vila, Félix Fuentes-Hurtado, David Rizo, Jorge Calvo-Zaragoza

发表机构 * Pattern Recognition and Artificial Intelligence Group, University of Alicante(阿利坎特大学模式识别与人工智能小组) Instituto Superior de Enseñanzas Artísticas de la Comunidad Valenciana(瓦伦西亚社区艺术教育研究所)

AI总结 研究乐谱图像内容检索方法,比较基于光学音乐识别的转录方法、无转录Transformer模型和文本提示大语言模型在不同数据集上的表现。

Comments 17 pages (14 pages + references), 3 figures (with subfigures)

详情
AI中文摘要

乐谱数字化对其保存和可访问性至关重要,但信息检索仍主要依赖于元数据搜索,如按标题或作曲家搜索。与文本文档相比,乐谱图像中的基于内容搜索仍未得到充分探索,尽管它对音乐家、音乐学家和教育工作者具有潜在价值。本文首先研究了乐谱中哪些特征与搜索最相关,并定义了一种从任何带注释语料库构建查询数据集的系统方法,从而为该领域做出贡献。我们还考虑了多种用于乐谱图像内容搜索的方法,从依赖光学音乐识别(OMR)的基于转录的方法,到训练用于直接从乐谱图像识别查询的无转录Transformer模型,以及文本提示的大语言模型。我们的实验在四个具有不同特征(数据集大小、图像质量和排版机制)的语料库上评估了这些模型。总体而言,每种方法在不同条件下表现出色:基于OMR的流水线在域内检索中表现更好,而无转录模型更有效地处理域变异性。

英文摘要

The digitization of musical scores plays a crucial role in their preservation and accessibility, yet information retrieval still depends mainly on metadata searches, such as by title or composer. Content based search in music score images remains underexplored compared to text documents, despite its potential value for musicians, musicologists, and educators. This work contributes to the field by first studying which characteristics of a score are most relevant for search and by defining a systematic method to build query datasets from any annotated corpus. We also consider diverse methods for content-based search on music score images, ranging from transcription-based approaches relying on Optical Music Recognition (OMR), to a transcription-free Transformer model trained to recognize queries directly from score images, and a text-prompted Large Language Model. Our experiments evaluate these models on four corpora exhibiting diverse characteristics in terms of dataset size, image quality, and typesetting mechanisms. Overall, each method excels under different conditions: OMR-based pipelines achieve higher in-domain retrieval, whereas transcription-free models handle domain variability more effectively.

2605.20752 2026-05-29 cs.RO

GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

GaussianDream:用于机器人操作的前馈3D高斯世界模型

Zijian Zhang, Yuqing Jiang, Qian Cheng, Xiaofan Li, Si Liu, Ding Zhao, Ping Luo, Weitao Zhou, Haibao Yu

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Tsinghua University(清华大学) Zhejiang University(浙江大学) Beihang University(北京航空航天大学) Carnegie Mellon University(卡内基梅隆大学) The University of Hong Kong(香港大学)

AI总结 提出GaussianDream,一种前馈3D高斯世界模型插件,通过可学习查询编码当前帧3D空间结构和短期未来演化,在训练时用静态重建和未来预测头监督,推理时仅保留查询条件化动作生成,在多个机器人操作基准上达到最先进性能。

Comments 19 pages, 9 figures

详情
AI中文摘要

视觉-语言-动作(VLA)策略通过将预训练的视觉-语言模型的语义先验迁移到动作生成,推进了语言条件机器人操作。然而,标准的动作模仿学习通常缺乏对显式3D空间信息、密集几何监督和未来环境演化的充分建模,而这些对于精确的机器人交互至关重要。为解决这一问题,我们提出 extbf{GaussianDream},一种前馈3D高斯世界模型插件。具体地,我们在编码器中引入可学习的GaussianDream查询,使模型能够捕捉当前帧的3D空间结构和短时域的未来演化。训练时,潜在的高斯Dream前缀由静态重建头和未来预测头处理,生成当前3D高斯场景状态和未来高斯演化状态。当前分支通过RGB渲染和深度进行监督,而未来分支使用未来RGB、深度和伪3D场景流信号。推理时,GaussianDream丢弃所有辅助头,仅保留学习到的前缀以条件化动作生成,无需测试时的高斯重建或未来预测。实验结果表明,GaussianDream在多个机器人操作基准上取得了最先进的性能,在LIBERO上达到 extbf{98.4\%},在RoboCasa Human-50上达到 extbf{54.8\%},在真实机器人任务上达到 extbf{50.0\%}。与现有的3D增强VLA方法相比,GaussianDream在实现高精度的同时,提供了比基于视频的世界模型方法更高的推理效率。

英文摘要

Vision-language-action (VLA) policies have advanced language-conditioned robotic manipulation by transferring semantic priors from pretrained vision-language models to action generation. However, standard action-imitation learning often lacks sufficient modeling of explicit 3D spatial information, dense geometric supervision, and future environment evolution, all critical for precise robotic interaction. To address this, we propose \textbf{GaussianDream}, a feed-forward 3D Gaussian world-model plug-in. Specifically, we introduce learnable GaussianDream Queries in the encoder, enabling the model to capture current-frame 3D spatial structure and short-horizon future evolution. During training, the latent GaussianDream prefix is processed by a static reconstruction head and a future prediction head to produce current 3D Gaussian scene states and future Gaussian evolution states. The current branch is supervised by RGB rendering and depth, while the future branch uses future RGB, depth, and pseudo 3D scene-flow signals. During inference, GaussianDream discards all auxiliary heads and retains only the learned prefix to condition action generation, without test-time Gaussian reconstruction or future prediction. Experimental results demonstrate that GaussianDream achieves state-of-the-art performance across multiple robotic manipulation benchmarks, reaching \textbf{98.4\%} on LIBERO, \textbf{54.8\%} on RoboCasa Human-50, and \textbf{50.0\%} on real-robot tasks. Compared with existing 3D-enhanced VLA methods, GaussianDream achieves strong accuracy while providing higher inference efficiency than video-based world-model approaches.

2605.20612 2026-05-29 cs.LG

Matryoshka Concept Bottleneck Models

Matryoshka 概念瓶颈模型

Ziye Chen, Hongbin Lin, Jie Li, Lijie Hu

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 Matryoshka 概念瓶颈模型 (MCBM),通过嵌套层次结构实现自适应概念利用,将预期干预成本从线性降低到对数阶 O(log K),同时保证单调性能提升。

详情
AI中文摘要

概念瓶颈模型 (CBMs) 已成为可解释深度学习的一种重要范式,通过将预测基于人类可理解的概念来学习。然而,它们的实际部署受到测试时干预成本高昂的阻碍,因为纠正模型错误通常需要人类专家手动检查和验证大量预测概念。现有方法存在根本性的结构限制:它们要么采用单一静态概念集,迫使专家详尽地标注概念,导致高昂的干预成本;要么训练多个针对不同概念预算的模型,导致大量的计算和维护开销。为了解决这一挑战,我们提出了 Matryoshka 概念瓶颈模型 (MCBM),这是一种统一的架构,能够在单个模型中实现自适应概念利用。受 Matryoshka 表示学习的启发,MCBM 基于最大相关性和最小冗余性将概念组织成嵌套层次结构,允许在不重新训练的情况下在多个概念粒度级别进行推理。理论上,我们证明 MCBM 将预期干预成本从线性降低到对数阶 $O(\log K)$,同时保证单调性能提升。实验上,大量实验表明,MCBM 在实现动态且高效的专家交互的同时,与独立训练的模型性能相当。

英文摘要

Concept Bottleneck Models (CBMs) have emerged as a prominent paradigm for interpretable deep learning, learning by grounding predictions in human-understandable concepts. However, their practical deployment is hindered by the high cost of test-time intervention, as correcting model errors typically requires human experts to manually inspect and verify a large set of predicted concepts. Existing approaches suffer from a fundamental structural limitation: they either adopt a single static concept set, forcing experts to exhaustively annotate concepts and incurring prohibitive intervention costs, or train multiple models tailored to different concept budgets, resulting in substantial computational and maintenance overhead. To address this challenge, we propose the Matryoshka Concept Bottleneck Model (MCBM), a unified architecture that enables adaptive concept utilization within a single model. Inspired by Matryoshka Representation Learning, MCBM organizes concepts into a nested hierarchy based on maximum relevance and minimum redundancy, allowing inference at multiple levels of conceptual granularity without retraining. Theoretically, we show that MCBM reduces the expected intervention costs from linear to logarithmic order, $O(\log K)$, while guaranteeing monotonic performance improvement. Empirically, extensive experiments demonstrate that MCBM matches the performance of independently trained models while enabling dynamic and efficient expert interaction.

2605.16608 2026-05-29 cs.LG cs.CL

To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios

使用还是不使用MRL:文本嵌入在没有Matryoshka学习的情况下对截断具有鲁棒性,除非在重度截断场景下

Sotaro Takeshita, Yurina Takeshita, Simone Paolo Ponzetto, Daniel Ruffinelli

发表机构 * Data and Web Science Group, University of Mannheim(曼海姆大学数据与网络科学小组) NEC Laboratories Europe(NEC欧洲实验室) Independent Researcher(独立研究者)

AI总结 本文通过实验比较了使用Matryoshka表示学习(MRL)与随机截断对文本嵌入的影响,发现除非嵌入被重度截断(减少至少80%),否则非MRL模型的截断嵌入性能与MRL模型相当甚至更优。

详情
AI中文摘要

Matryoshka表示学习(MRL)是一种广泛采用的方法,用于训练文本编码器,使其提供各种大小的有用文本表示,只需在训练时预先确定的大小处截断结果向量即可。最近的研究表明,除非向量大小减少至少70%,否则随机截断文本嵌入对下游性能的影响很小,这表明嵌入在没有MRL的情况下已经对截断具有鲁棒性。然而,之前没有工作将随机截断与MRL进行比较,因此不清楚这两种方法作为有效的嵌入缩减方法如何比较。在本文中,我们通过将MRL使用的相同截断应用于使用和不使用MRL训练的模型来研究这一点。我们在多个模型和下游任务上的结果表明,除非重度截断嵌入(即将其大小减少至少80%),否则非MRL模型的截断嵌入与使用MRL训练的模型具有竞争力,并且通常表现更好。这表明截断鲁棒性可能不一定来自MRL,而选择花费MRL的额外训练成本取决于是否需要重度截断。我们提供代码以供复现。

英文摘要

Matryoshka Representation Learning (MRL) is a widely adopted approach for training text encoders so they provide useful text representations at various sizes, available by simply truncating the resulting vectors at sizes pre-determined at training time. Recent works have shown that randomly truncating text embeddings has minimal impact in downstream performance unless vectors are reduced in size by at least 70%, suggesting that embeddings are already robust to truncation without the use of MRL. However, no prior work has compared random truncation to MRL, so it is unclear how the two methods compare as effective embedding reduction methods. In this paper, we study this by applying the same truncation used by MRL to models trained with and without MRL. Our results across several models and downstream tasks show that, unless heavily truncating embeddings (i.e. reducing their size by at least 80%), truncated embeddings of non-MRL models are competitive with, and often outperform models trained with MRL. This suggests that truncation robustness may not necessarily come from MRL, and that the choice of spending the additional training cost of MRL depends on whether heavy truncation is desired. We make our code available for reproduction.

2605.14373 2026-05-29 cs.LG cs.AI

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

将陈旧梯度转化为稳定梯度:具有隐式景观平滑的相干坐标下降用于轻量级零阶优化

Chen Liang, Xiatao Sun, Qian Wang, Daniel Rakita

发表机构 * Department of Computer Science, Yale University, New Haven, USA(耶鲁大学计算机科学系)

AI总结 提出一种确定性的、样本高效的零阶优化方法Coherent Coordinate Descent (CoCD),通过利用历史梯度的相干性实现每步O(1)查询复杂度,并发现大步长有限差分可隐式平滑优化景观,从而在轻量级场景下优于现有方法。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026); Project page: https://chen-dylan-liang.github.io/CoCD/

详情
AI中文摘要

零阶优化对于反向传播不可用的场景至关重要,例如内存受限的在线学习和黑盒优化。然而,现有方法面临严峻的权衡:它们要么样本效率低(例如标准有限差分),要么由于随机估计(例如随机子空间方法)而遭受高方差。在这项工作中,我们提出了相干坐标下降(CoCD),一种确定性的、样本高效的、预算感知的零阶优化器。理论上,我们形式化了梯度相干性的概念,并证明CoCD等价于具有“热启动”的块循环坐标下降(BCCD),有效地将历史(陈旧)梯度从负担转化为计算资产。该机制在保持全局下降方向的同时,实现了每步O(1)查询复杂度。此外,我们推导出误差界,揭示了一个反直觉的见解:更大的有限差分步长可以通过降低有效平滑常数来隐式地平滑优化景观,从而提高收敛稳定性。在MLP、CNN和ResNet架构(最多27万个参数)上的实验表明,CoCD在样本效率和收敛损失/准确性方面显著优于BCCD,并且比随机化零阶方法表现出更好的稳定性。我们的结果表明,对于轻量级零阶优化,确定性的、结构感知的更新是随机化的优越替代方案。

英文摘要

Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (up to 270k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.

2605.14241 2026-05-29 cs.LG

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

LLM 代理中功能等价工具的延迟-质量路由

Kexin Chu, Dawei Xiang, Wei Zhang

发表机构 * University of Connecticut(康涅狄格大学)

AI总结 针对LLM代理中多个功能等价工具提供者的路由问题,提出LQM-ContextRoute上下文强盗路由器,通过延迟-质量匹配和查询特定质量估计,在运行时负载下实现延迟与质量的权衡,在多个基准上优于SW-UCB。

Comments 14 pages, 6 figure, 13 tables

详情
AI中文摘要

工具增强的LLM代理越来越多地通过多个功能等价的提供者访问同一工具类型,例如共享接口背后的网络搜索API、检索器或LLM后端。这在运行时负载下产生了提供者路由问题:路由器必须在延迟、可靠性和答案质量上存在差异的提供者之间进行选择,通常在部署时没有黄金标签。我们引入了LQM-ContextRoute,一种用于同功能工具提供者的上下文强盗路由器。其关键设计是延迟-质量匹配:不是让低延迟在加性奖励中抵消差答案,而是路由器根据每个服务周期的预期答案质量对提供者进行排序。它将这种容量感知得分与查询特定质量估计和LLM作为评判的反馈相结合,使其能够在线适应负载变化和提供者质量差异。在主要的网络搜索负载基准上,LQM-ContextRoute在保持延迟-质量前沿的同时,F1比SW-UCB提高了2.18个百分点。在高异质性的StrategyQA设置中,LQM-ContextRoute避免了加性奖励崩溃,准确率比SW-UCB提高了18个百分点;在异质性检索器池上,NDCG比SW-UCB提高了2.91--3.22个百分点。这些结果表明,同功能工具路由受益于将延迟视为服务容量,特别是在运行时压力与提供者质量异质性共存时。

英文摘要

Tool-augmented LLM agents increasingly access the same tool type through multiple functionally equivalent providers, such as web-search APIs, retrievers, or LLM backends exposed behind a shared interface. This creates a provider-routing problem under runtime load: the router must choose among providers that differ in latency, reliability, and answer quality, often without gold labels at deployment time. We introduce LQM-ContextRoute, a contextual bandit router for same-function tool providers. Its key design is latency-quality matching: instead of letting low latency offset poor answers in an additive reward, the router ranks providers by expected answer quality per service cycle. It combines this capacity-aware score with query-specific quality estimation and LLM-as-judge feedback, allowing it to adapt online to both load changes and provider-quality differences. On the main web-search load benchmark, LQM-ContextRoute improves F1 by +2.18 pp over SW-UCB while staying on the latency-quality frontier. In a high-heterogeneity StrategyQA setting, LQM-ContextRoute avoids additive-reward collapse and improves accuracy by up to +18 pp over SW-UCB; on heterogeneous retriever pools, it improves NDCG by +2.91--+3.22 pp over SW-UCB. These results show that same-function tool routing benefits from treating latency as service capacity, especially when runtime pressure and provider-quality heterogeneity coexist.

2605.14113 2026-05-29 cs.CV cs.AI cs.LG cs.MA

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

ProtoMedAgent: 通过隐私感知的智能体工作流实现多模态临床可解释性

Alvaro Lopez Pellicer, Plamen Angelov, Marwan Bukhari, Yi Li, Eduardo Soares, Jemma Kerns

发表机构 * School of Computing and Communications(计算与通信学校) Lancaster University(兰卡斯特大学) Lancaster Medical School(兰卡斯特医学院) PUC-Rio(里约热内卢联邦大学) Puc-Behring Institute for AI(人工智能皮克林研究所)

AI总结 提出ProtoMedAgent框架,通过神经符号瓶颈和反射性Scribe-Critic循环约束生成过程,解决原型网络在临床报告中的语义结构缺失和检索谄媚问题,并引入k-匿名和ℓ-多样性隐私门控。

Comments CVR 2026

详情
AI中文摘要

尽管可解释的原型网络为临床诊断提供了引人注目的基于案例的推理,但其原始连续输出缺乏医学文档所需的语义结构。通过标准检索增强生成(RAG)弥合这一差距通常会触发“检索谄媚”,即大语言模型(LLM)产生事后合理化幻觉以与视觉预测对齐。我们引入了ProtoMedAgent,一个将多模态临床报告形式化为在严格神经符号瓶颈上的迭代、零梯度测试时优化问题的框架。在冻结的原型骨干上运行,我们将潜在视觉和表格特征蒸馏为离散语义记忆。在线生成严格受限于精确的集合论差分和反射性Scribe-Critic循环,从数学上排除了无根据的叙述性声明。为了安全地限制数据泄露,我们引入了一个由k-匿名和ℓ-多样性控制的语义隐私门控。在4,160名患者临床队列上的评估显示,ProtoMedAgent达到了91.2%的比较集忠实度,从根本上优于标准RAG(46.2%)。ProtoMedAgent还利用一个绑定ℓ-多样性的相变,系统性地将工件级成员推理风险降低了绝对9.8%。

英文摘要

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%.

2605.13986 2026-05-29 cs.LG stat.ML

TabPFN-3: Technical Report

TabPFN-3: 技术报告

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Mihir Manium, Shi Bin Hoo, Magnus Bühler, Anurag Garg, Dominik Safaric, Jake Robertson, Benjamin Jäger, Simone Alessi, Adrian Hayler, Vladyslav Moroshan, Lennart Purucker, Philipp Singer, Alan Arazi, Julien Siems, Jan Hendrik Metzen, Georg Grab, Nick Erickson, Siyuan Guo, Eliott Kalfon, Simon Bing, David Salinas, Clara Cornu, Lilly Charlotte Wehrhahn, Diana Kriuchkova, Kursat Kaya, Lydia Sidhoum, Marie Salmon, Jerry Chen, Madelon Hulsebos, Yann LeCun, Samuel Müller, Bernhard Schölkopf, Sauraj Gambhir, Noah Hollmann, Frank Hutter

发表机构 * Prior Labs

AI总结 本文提出TabPFN-3,通过扩展训练数据和优化推理,在表格数据上实现最先进性能,并支持时间序列、关系数据和表格文本数据。

详情
AI中文摘要

表格数据支撑着科学和工业中大多数高价值预测问题,而TabPFN推动了该模态的基础模型革命。根据用户反馈设计,TabPFN-3在此基础上将最先进性能扩展到具有100万训练行的数据集,并大幅减少训练和推理时间。TabPFN-3完全基于我们先验的合成数据进行预训练,极大地推动了表格预测的前沿,并在时间序列、关系数据和表格文本数据上带来了实质性收益。在标准表格基准TabArena上,TabPFN-3的前向传播以显著优势优于所有其他模型(包括调优和集成基线),并在速度/性能前沿上占据帕累托优势。在更多样化的数据集上,TabPFN-3在多类数据集上排名第一,并在多达100万训练行和200个特征的数据集上击败了经过8小时调优的梯度提升树基线。TabPFN-3将测试时计算缩放引入表格基础模型。我们的API产品TabPFN-3-Plus(思考版)利用这一点,在TabArena上以超过200 Elo的优势击败所有非TabPFN模型,在最大数据子集上达到420 Elo,并且比AutoGluon 1.5 extreme快10倍,同时不使用LLM、真实数据、互联网搜索或除TabPFN之外的任何其他模型。TabPFN-3扩展了我们模型的能力,实现了对关系数据(在RelBenchV1上新的最先进基础模型)和表格文本数据(通过TabPFN-3-Plus在TabSTAR上达到最先进)的最先进预测;并改进了现有集成:专用检查点TabPFN-TS-3在时间序列基准fev-bench上排名第二,SHAP值计算速度提升高达120倍。TabPFN-3在实现这一性能的同时,比TabPFN-2.5快20倍。此外,减少的KV缓存和行分块技术使得在单个H100上以快速推理速度扩展到100万行。

英文摘要

Tabular data underpins most high-value prediction problems in science and industry, and TabPFN has driven the foundation model revolution for this modality. Designed with feedback from our users, TabPFN-3 builds on this foundation to scale state-of-the-art performance to datasets with 1M training rows and substantially reduce training and inference time. Pretrained exclusively on synthetic data from our prior, TabPFN-3 dramatically pushes the frontier of tabular prediction and brings substantial gains on time series, relational, and tabular-text data. On the standard tabular benchmark TabArena, a forward pass of TabPFN-3 outperforms all other models, including tuned and ensembled baselines, by a significant margin, and pareto-dominates the speed/performance frontier. On more diverse datasets, TabPFN-3 ranks first on datasets with many classes, and beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M training rows and 200 features. TabPFN-3 introduces test-time compute scaling to tabular foundation models. Our API offering TabPFN-3-Plus (Thinking) exploits this to beat all non-TabPFN models by over 200 Elo on TabArena, rising to 420 Elo on the largest data subset, and outperforms AutoGluon 1.5 extreme while being 10x faster, without using LLMs, real data, internet search or any other model besides TabPFN. TabPFN-3 extends the capabilities of our models, enabling SOTA prediction on relational data (new SOTA foundation model on RelBenchV1) and tabular-text data (SOTA on TabSTAR via TabPFN-3-Plus); and improves existing integrations: a specialized checkpoint, TabPFN-TS-3, ranks 2nd on the time-series benchmark fev-bench, and SHAP-value computation is up to 120x faster. TabPFN-3 achieves this performance while being up to 20x faster than TabPFN-2.5. In addition, a reduced KV cache and row-chunking scale to 1M rows on one H100 with fast inference speed.

2605.13230 2026-05-29 cs.LG cs.AI

Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence

教师引导的策略优化:大策略差异下的在线推理蒸馏

Xinyu Liu, Kechen Jiao, Chunyang Xiao, Runsong Zhao, Junhao Ruan, Bei Li, Jiahao Liu, Qifan Wang, Xin Chen, Jingang Wang, Chenglong Wang, Tong Xiao, JingBo Zhu

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Tsinghua University(清华大学) Meituan(美团) Meta AI NiuTrans Research, Shenyang, China(新译研究院,沈阳,中国)

AI总结 针对在线蒸馏中教师与学生策略差异大时反向KL监督失效的问题,提出教师引导策略优化(TGPO),通过教师直接指导学生上下文的token级生成并结合RLVR奖励,在推理基准上优于现有方法。

详情
AI中文摘要

在线蒸馏(OPD)已成为面向推理的大型语言模型(LLM)后训练的一种有前景的范式,特别是与可验证奖励的强化学习(RLVR)结合时。现有的OPD方法依赖于基于反向KL(RKL)的教师监督,对学生策略采样的轨迹进行监督。然而,我们识别出一个关键限制:在教师-学生策略差异大的情况下,RL驱动的探索常常产生教师分布之外的轨迹,导致无信息的负面反馈。为了解决这个问题,我们提出教师引导策略优化(TGPO),一种在策略差异大设置下仍然有效的在线推理蒸馏方法。TGPO不依赖于单纯的评估监督,而是利用教师直接指导基于学生生成上下文的token级生成;结合RLVR风格的轨迹级奖励,TGPO引导探索朝向改进的延续。在推理基准上的实验表明,TGPO始终优于现有的基于RKL的OPD方法,并且在不同教师模型下保持鲁棒性。

英文摘要

On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especially when combined with reinforcement learning from verifiable rewards (RLVR). Existing OPD methods rely on reverse KL (RKL)-based teacher supervision over trajectories sampled from the student policy. However, we identify a critical limitation: under large teacher--student policy divergence, RL-driven exploration often produces trajectories outside the teacher distribution, resulting in uninformative negative feedback. To address this, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy reasoning distillation method that remains effective under large policy divergence settings. Rather than relying solely on evaluative supervision, TGPO uses teacher to directly guide token level generation conditioning on student-generated contexts; together with RLVR-style trajectory level rewards, TGPO steers exploration toward improved continuations. Experiments on reasoning benchmarks show that TGPO consistently outperforms existing RKL-based OPD methods and remains robust across different teacher models.

2605.11723 2026-05-29 cs.CV cs.AI

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

CaC:通过分层时空聚焦推进视频奖励模型

Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, Honglie Wang, Yiyang Fan, Zhenlong Yuan, Zijun Li, Yongrui Heng, Guosheng Lin, Fan Yang, Tingting Gao

发表机构 * BJTU(北京工业大学) NTU(国立台湾大学) BUPT(北京邮电大学) Kuaishou Technology(快手科技)

AI总结 提出基于视觉语言模型的粗到细异常奖励模型CaC,通过全局时间扫描、局部空间定位和结构化时空思维链推理,结合大规模生成视频异常数据集和三阶段渐进训练,显著提升细粒度异常检测精度并减少生成视频异常。

Comments 27 pages, 10 figures

详情
AI中文摘要

在本文中,我们提出了Concentrate and Concentrate (CaC),一种基于视觉语言模型的粗到细异常奖励模型。在推理过程中,它首先进行全局时间扫描以锚定异常时间窗口,然后在局部区间内进行细粒度空间定位,最后通过结构化的时空思维链推理得出稳健判断。为了使模型具备这些能力,我们构建了第一个大规模生成视频异常数据集,包含逐帧边界框注释、时间异常窗口和细粒度归因标签。基于该数据集,我们设计了三阶段渐进训练范式。模型首先通过单帧和多帧监督微调学习空间和时间锚定,然后通过基于两轮组相对策略优化(GRPO)的强化学习策略进行优化。除了传统的准确率奖励,我们引入了时间和空间IoU奖励来监督中间定位过程,有效引导模型进行更扎实和可解释的时空推理。大量实验表明,CaC能够稳定聚焦于细微异常,在细粒度异常基准上实现了25.7%的准确率提升,并且作为奖励信号时,CaC将生成视频异常减少了11.7%,同时提高了整体视频质量。

英文摘要

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.

2605.10299 2026-05-29 cs.LG

Nearly-Optimal Algorithm for Adversarial Kernelized Bandits

对抗性核化赌博机的近最优算法

Shogo Iwazaki

发表机构 * LY Corporation(LY公司)

AI总结 针对对抗性环境下的核化赌博机问题,提出指数权重算法并证明其达到近最优遗憾界,同时给出下界并利用Nyström近似实现高效计算。

Comments 47 pages

详情
AI中文摘要

本文研究对抗性环境下的核化赌博机(也称为高斯过程赌博机),其中已知再生核希尔伯特空间(RKHS)中的奖励函数可能在每轮被对抗性地选择。我们证明指数权重算法实现了$ ilde{O}(\sqrt{T γ_T})$的对抗遗憾,其中$T$和$γ_T$分别表示总轮数和最大信息增益。对于平方指数(SE)和$ν$-Matérn核,我们还证明了算法无关的下界,保证了我们的算法在多项式对数因子内的最优性。此外,我们提出了使用Nyström近似的计算高效变体,同时保持近最优的遗憾保证。

英文摘要

This paper studies kernelized bandits (also known as Gaussian process bandits) in an adversarial environment, where the reward functions in a known reproducing kernel Hilbert space (RKHS) may be adversarially chosen at each round. We show that the exponential-weight algorithm achieves $\tilde{O}(\sqrt{T γ_T})$ adversarial regret, where $T$ and $γ_T$ denote the number of total rounds and the maximum information gain, respectively. For squared exponential (SE) and $ν$-Matérn kernels, we also show algorithm-independent lower bounds that guarantee the optimality of our algorithm up to polylogarithmic factors. Furthermore, we present a computationally efficient variant of our algorithm using Nyström approximation while maintaining nearly optimal regret guarantees.

2605.08870 2026-05-29 cs.LG math.AT math.DG

TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection

TopoGeoScore: 一种用于OOD检查点选择的自监督纯源几何框架

Farid Hazratian, Ali Zia, Hien Duy Nguyen

发表机构 * University of Tehran(塔里哈大学) La Trobe University(拉特罗布大学) Kyushu University(九州大学)

AI总结 提出TopoGeoScore,一种仅利用源域表示、无需目标样本或标签的自监督几何评分方法,通过提取类流形的拓扑与几何特征并学习可解释的线性分数,实现分布外鲁棒检查点的选择。

详情
AI中文摘要

当目标域标签不可用时,分布外(OOD)鲁棒性难以诊断。我们考虑一种更严格的纯源无监督精度估计变体:仅使用源域表示选择鲁棒检查点,无需目标样本或目标标签。我们提出 extbf{TopoGeoScore},一种用于无标签OOD检查点选择的纯源几何评分器。给定一个训练好的检查点,我们从源嵌入构建类条件互$k$近邻图,并提取三个可解释信号:用于全局类流形复杂度的挠率启发约化拉普拉斯对数行列式、用于局部邻域正则性的Ollivier-Ricci曲率,以及用于碎片化连通性、环和全局-局部不一致性的高阶拓扑摘要。TopoGeoScore不是手动固定权重,而是通过自监督目标学习非负线性分数,该目标强制在近似保持几何的嵌入视图下具有不变性,并与破坏结构的视图分离。该分数保持可解释性,且不使用目标域样本或标签。在基于CIFAR的损坏和分布偏移基准、ImageNet-C、MNLI$ o$HANS迁移和OGBN-Arxiv上的结果表明,源表示包含可测量的全局-局部-拓扑鲁棒性证据,支持在分布偏移下部署前的实用检查点选择。

英文摘要

Out-of-distribution (OOD) robustness is difficult to diagnose when target-domain labels are unavailable. We consider a more restrictive source-only variant of unsupervised accuracy estimation: selecting robust checkpoints using only source-domain representations, with no target samples or target labels. We propose \textbf{TopoGeoScore}, a source-only geometric scorer for label-free OOD checkpoint selection. Given a trained checkpoint, we construct class-conditional mutual $k$-nearest-neighbour graphs from source embeddings and extract three interpretable signals: a torsion-inspired reduced Laplacian log-determinant for global class-manifold complexity, Ollivier--Ricci curvature for local neighbourhood regularity, and higher-order topological summaries for fragmented connectivity, loops, and global--local inconsistency. Instead of fixing their weights by hand, TopoGeoScore learns a non-negative linear score through a self-supervised objective that enforces invariance under approximately geometry-preserving embedding views and separation from structure-breaking views. The score remains interpretable and uses no target-domain samples or labels. Results across CIFAR-based corruption and distribution-shift benchmarks, ImageNet-C, MNLI$\to$HANS transfer, and OGBN-Arxiv suggest that source representations contain measurable global--local--topological evidence of robustness, supporting practical checkpoint selection before deployment under distribution shift.

2605.08832 2026-05-29 cs.LG physics.flu-dyn

Inpainting physics: self-supervised learning for context-driven fluid simulation

物理修复:用于上下文驱动流体模拟的自监督学习

Jonas Weidner, Yeray Martin-Ruisanchez, Daniel Rueckert, Benedikt Wiestler, Julian Suk

发表机构 * AI for Image-Guided Diagnosis and Therapy, Technical University of Munich(图像引导诊断与治疗人工智能,慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) AI in Healthcare and Medicine, Technical University of Munich(医疗人工智能,慕尼黑技术大学) Imperial College London(伦敦帝国学院)

AI总结 提出将稳态CFD推理重构为修复问题,通过自监督学习速度场先验并在推理时施加边界约束,利用局部邻域分词器处理大规模3D网格,在颅内动脉瘤血流动力学中优于监督代理模型。

详情
AI中文摘要

计算流体动力学(CFD)的神经代理模型通常被训练为正向算子,将显式问题规范(如几何形状和边界条件)映射到解场。这使得模型与训练期间看到的条件变量绑定,并在边界条件变化或局部几何改变时限制了复用。我们提出将稳态CFD推理重构为一个修复问题:不是训练显式边界条件,而是学习速度场的自监督先验,并在推理时通过固定已知区域(如入口、出口或先前模拟中未改变的区域)来施加边界约束。为了将这一思想扩展到大规模3D网格,我们引入了一个局部邻域分词器,将高分辨率速度场表示为紧凑的空间潜在令牌,并在这些令牌上训练潜在流匹配和掩码自编码器模型。在颅内动脉瘤血流动力学中,我们的方法从稀疏边界上下文中重建完整速度场,在边界条件和数据集偏移下优于监督神经代理模型,并通过复用未改变的模拟上下文实现局部几何编辑。这些结果表明,将CFD推理视为上下文条件修复可以将神经代理从任务特定预测器转变为可复用的流先验。

英文摘要

Neural surrogate models for computational fluid dynamics (CFD) are typically trained as forward operators that map explicit problem specifications, such as geometry and boundary conditions, to solution fields. This ties the model to the conditioning variables seen during training and limits reuse under boundary-condition shifts or local geometry changes. We propose to reformulate steady CFD inference as an inpainting problem: instead of training on explicit boundary conditions, we learn a self-supervised prior over velocity fields and impose boundary constraints only during inference by fixing known regions such as inlet, outlet or unchanged regions from previous simulations. To scale this idea to large 3D meshes, we introduce a local neighbourhood tokeniser that represents high-resolution velocity fields as compact spatial latent tokens and train latent flow-matching and masked-autoencoder models on these tokens. On intracranial aneurysm hemodynamics, our method reconstructs full velocity fields from sparse boundary context, outperforms supervised neural surrogates under boundary-condition and dataset shift and enables local geometry editing by reusing unchanged simulation context. These results suggest that viewing CFD inference as context-conditioned inpainting can turn neural surrogates from task-specific predictors into reusable flow priors.

2605.08786 2026-05-29 cs.LG

PRIM: Meta-Learned Bayesian Root Cause Analysis

PRIM:元学习的贝叶斯根因分析

Christopher Lohse, Anish Dhir, Amadou Ba, Bradley Eck, Marco Ruffini, Jonas Wahl

发表机构 * University of Dublin, Trinity College(都柏林大学,三一学院) IBM Gatsby Computational Neuroscience Unit, University College London(大学学院伦敦的加布里埃尔计算神经科学单位) Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Saarbrücken, Germany(德国萨尔布吕肯德意志人工智能研究中心(DFKI)) Department of Philosophy, University of Bergen(卑尔根大学哲学系)

AI总结 提出一种基于元学习的贝叶斯根因分析方法PRIM,通过合成先验因果模型进行贝叶斯推断,隐式识别数据生成机制变化,实现零样本快速推理。

详情
AI中文摘要

复杂系统中的根因分析(RCA)由于错误在多个变量间传播、需要结构因果知识以及测试时推理的计算成本而具有挑战性。我们提出了PRIM(基于先验拟合的元学习根因识别),一种因果元学习方法,将RCA视为对因果模型合成先验的贝叶斯推断任务。通过边缘化结构不确定性,PRIM隐式识别基线和异常时期之间数据生成机制的变化。在此过程中,PRIM无需显式统计检验即可推断分布差异,并在测试时无需模型拟合即可隐式学习因果结构。遵循基于模拟的元学习范式(先验拟合网络),PRIM使用模型平均因果估计(MACE)Transformer神经过程,该过程联合关注观测样本、异常样本以及节点的因果结构,从而在17毫秒内对多达100个变量的系统实现零样本推理。在合成基准和两个真实基准数据集PetShop和CausRCA上,PRIM与预先知道系统因果图结构的方法竞争,同时在多个任务上优于不知图结构的方法。对特定领域和数据动态的轻量级微调进一步提升了性能。

英文摘要

Root cause analysis (RCA) in complex systems is challenging due to error propagation across multiple variables, the need for structural causal knowledge, and the computational cost of inference at test time. We introduce PRIM (Prior-fitted Root cause Identification with Meta-learning), a causal meta-learning approach that frames RCA as a Bayesian inference task over a synthetic prior of causal models. By marginalising out structural uncertainty, PRIM implicitly identifies changes in the data-generating mechanism between baseline and anomalous periods. In doing so, PRIM infers distributional differences without explicit statistical testing, and implicitly learns causal structure without model fitting at test time. Following the simulation-based meta-learning paradigm of prior-fitted networks, PRIM uses a Model-Averaged Causal Estimation (MACE) transformer neural process that jointly attends over observational and anomalous samples and the causal structure of nodes, enabling zero-shot inference in 17,ms for systems with up to 100 variables. Across synthetic benchmarks and two realistic benchmark datasets, PetShop and CausRCA, PRIM is competitive with methods that are aware of the system's causal graphical structure a priori while outperforming graph-unaware methods on several tasks. Lightweight fine-tuning to specific domains and data dynamics improves performance further.

2605.06355 2026-05-29 cs.LG stat.ML

Order-Agnostic Autoregressive Modelling with Missing Data

缺失数据下的顺序无关自回归建模

Ignacio Peis, Pablo M. Olmos, Jes Frellsen

发表机构 * Technical University of Denmark(丹麦技术大学) Pioneer Centre for AI(先锋人工智能中心) Universidad Carlos III de Madrid(马德里卡洛斯三世大学)

AI总结 本文通过缺失数据视角重新审视顺序无关自回归模型,提出缺失感知训练框架,并利用其条件密度估计进行主动信息获取,在多个基准上优于传统插补方法。

详情
AI中文摘要

顺序无关自回归模型在深度生成建模中表现出色,但其在数据不完整情况下的应用尚未被充分探索。本文从缺失数据的角度重新审视这些模型。首先,我们证明它们在完全观测数据上的标准训练过程隐式地在完全随机缺失机制下进行插补,从而在高缺失率场景下实现了稳健的样本外插补性能。其次,我们提出了第一个原则性框架,用于在一般缺失机制下直接从不完整数据集中训练这些模型。第三,我们利用其摊销条件密度估计进行主动信息获取,即顺序选择对下游预测或推理最有信息量的缺失变量。在一系列真实世界基准测试中,我们的缺失感知顺序无关自回归模型(MO-ARM)持续优于已建立的插补基线。

英文摘要

Order-Agnostic autoregressive models have demonstrated strong performance in deep generative modeling, yet their use in settings with incomplete data remains largely unexplored. In this work, we reinterpret them through the lens of missing data. First, we show that their standard training procedure on fully observed data implicitly performs imputation under a missing completely at random mechanism, resulting in robust out-of-sample imputation performance in settings with high missingness. Second, we introduce the first principled framework for training them directly on incomplete datasets under general missingness mechanisms. Third, we leverage their amortized conditional density estimation to perform active information acquisition, i.e., sequentially selecting the most informative missing variables for downstream prediction or inference. Across a suite of real-world benchmarks, our Missingness-Aware Order-Agnostic Autoregressive Model (MO-ARM) consistently outperforms established imputation baselines.

2605.05964 2026-05-29 cs.LG

Uncertainty Estimation via Hyperspherical Confidence Mapping

基于超球面置信映射的不确定性估计

Eunseo Choi, Ho-Yeon Kim, Jaewon Lee, Taeyong jo, Myungjun lee, Heejin Ahn

发表机构 * KAIST(韩国科学技术院) Samsung Electronic Co., Ltd(三星电子有限公司)

AI总结 提出超球面置信映射(HCM),通过将输出分解为幅度和归一化方向向量并利用几何约束违反程度实现无采样、无分布假设的不确定性估计,在回归和分类任务中匹配或超越集成与证据方法且推理成本更低。

Comments Accepted at ICLR 2026. 24 pages, 7 figures, including appendix. Updated references

详情
AI中文摘要

量化神经网络预测中的不确定性对于自动驾驶、医疗和制造等高安全领域至关重要。现有方法通常依赖昂贵的采样或严格的分布假设,我们提出超球面置信映射(HCM),一个简单而原则性的框架,用于无采样和无分布假设的不确定性估计。HCM将输出分解为幅度和约束在单位超球面上的归一化方向向量,从而将不确定性解释为该几何约束的违反程度,得到适用于回归和分类的确定性和可解释性估计。在多种基准和实际工业任务上的实验表明,HCM匹配或超越了集成和证据方法,且推理成本更低,置信度-错误对齐更强。我们的结果凸显了几何结构在不确定性估计中的力量,并将HCM定位为传统技术的通用替代方案。

英文摘要

Quantifying uncertainty in neural network predictions is essential for high-stakes domains such as autonomous driving, healthcare, and manufacturing. While existing approaches often depend on costly sampling or restrictive distributional assumptions, we propose Hyperspherical Confidence Mapping (HCM), a simple yet principled framework for sampling-free and distribution-free uncertainty estimation. HCM decomposes outputs into a magnitude and a normalized direction vector constrained to lie on the unit hypersphere, enabling a novel interpretation of uncertainty as the degree of violation of this geometric constraint. This yields deterministic and interpretable estimates applicable to both regression and classification. Experiments across diverse benchmarks and real-world industrial tasks demonstrate that HCM matches or surpasses ensemble and evidential approaches, with far lower inference cost and stronger confidence-error alignment. Our results highlight the power of geometric structure in uncertainty estimation and position HCM as a versatile alternative to conventional techniques.

2605.05155 2026-05-29 cs.CV cs.AI

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

Aes3D: 3D高斯泼溅中的美学评估

Chuanzhi Xu, Boyu Wei, Haoxian Zhou, Xuanhua Yin, Zihan Deng, Haodong Chen, Qiang Qu, Weidong Cai

发表机构 * The University of Sydney(悉尼大学) The University of Hong Kong(香港大学)

AI总结 针对3D高斯泼溅场景缺乏美学评估的问题,提出首个系统框架Aes3D,包含专用数据集Aesthetic3D和轻量级模型Aes3DGSNet,直接预测场景级美学分数,无需渲染多视图图像。

详情
AI中文摘要

随着3D高斯泼溅(3DGS)在沉浸式媒体和数字内容创作中受到关注,评估3D场景的美学对于帮助创作者构建更具视觉吸引力的3D内容变得重要。然而,现有的3D场景评估方法主要强调重建保真度和感知真实感,在很大程度上忽略了构图、和谐度和视觉吸引力等更高层次的美学属性。这一局限性源于两个关键挑战:(1)缺乏带有美学标注的通用3DGS数据集,以及(2)3DGS作为低级基元表示的内在性质,使其难以捕捉高级美学特征。为应对这些挑战,我们提出Aes3D,这是首个用于评估3D神经渲染场景美学的系统框架。Aes3D包含Aesthetic3D,这是首个专用于3D场景美学评估的数据集,基于我们提出的3D场景美学标注策略构建。此外,我们提出Aes3DGSNet,一个轻量级模型,可直接从3DGS表示预测场景级美学分数。值得注意的是,我们的模型仅基于3D高斯基元运行,无需渲染多视图图像,从而降低了计算成本和硬件要求。通过对多视图3DGS场景表示进行美学监督学习,Aes3DGSNet有效捕获高级美学线索并准确回归美学分数。实验结果表明,我们的方法在保持轻量级设计的同时实现了强劲性能,为3D场景美学评估建立了新基准。代码和数据集将在未来版本中提供。

英文摘要

As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes becomes important in helping creators build more visually compelling 3D content. However, existing evaluation methods for 3D scenes primarily emphasize reconstruction fidelity and perceptual realism, largely overlooking higher-level aesthetic attributes such as composition, harmony, and visual appeal. This limitation comes from two key challenges: (1) the absence of general 3DGS datasets with aesthetic annotations, and (2) the intrinsic nature of 3DGS as a low-level primitive representation, which makes it difficult to capture high-level aesthetic features. To address these challenges, we propose Aes3D, the first systematic framework for assessing the aesthetics of 3D neural rendering scenes. Aes3D includes Aesthetic3D, the first dataset dedicated to 3D scene aesthetic assessment, built on our proposed annotation strategy for 3D scene aesthetics. In addition, we present Aes3DGSNet, a lightweight model that directly predicts scene-level aesthetic scores from 3DGS representations. Notably, our model operates solely on 3D Gaussian primitives, eliminating the need for rendering multi-view images and thus reducing computational cost and hardware requirements. Through aesthetics-supervised learning on multi-view 3DGS scene representations, Aes3DGSNet effectively captures high-level aesthetic cues and accurately regresses aesthetic scores. Experimental results demonstrate that our approach achieves strong performance while maintaining a lightweight design, establishing a new benchmark for 3D scene aesthetic assessment. Code and datasets will be made available in a future version.

2605.05133 2026-05-29 cs.LG

Transformed Latent Variable Multi-Output Gaussian Processes

变换潜变量多输出高斯过程

Xiaoyu Jiang, Xinxing Shi, Sokratia Georgaka, Magnus Rattray, Mauricio A Álvarez

发表机构 * Department of Computer Science, University of Manchester, Manchester, UK(曼彻斯特大学计算机科学系) Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK(曼彻斯特大学生物医学与健康学院)

AI总结 提出T-LVMOGP框架,通过Lipschitz正则化神经网络构建灵活的多输出深度核,结合随机变分推理,有效扩展到高维输出场景,在气候建模和空间转录组学等基准上优于基线方法。

Comments ICML 2026

详情
AI中文摘要

多输出高斯过程(MOGP)为建模相关输出提供了一个原则性的概率框架,但在应用于具有高维输出空间的数据集时面临可扩展性瓶颈。为了保持可处理性,现有方法通常采用限制性假设,例如使用低秩或可分离和核,这可能限制表达能力。我们提出了变换潜变量多输出高斯过程(T-LVMOGP),这是一种新颖的框架,将MOGP扩展到大量输出,同时保留捕获有意义输出间依赖关系的能力。T-LVMOGP通过使用Lipschitz正则化神经网络将输入和输出特定的潜变量映射到嵌入空间,构建了一个灵活的多输出深度核。结合随机变分推理,我们的模型有效地扩展到高维输出设置。在包括超过10,000个输出的气候建模和零膨胀空间转录组学数据在内的多个基准测试中,T-LVMOGP在预测准确性和计算效率上均优于基线方法。

英文摘要

Multi-Output Gaussian Processes (MOGPs) provide a principled probabilistic framework for modelling correlated outputs but face scalability bottlenecks when applied to datasets with high-dimensional output spaces. To maintain tractability, existing methods typically resort to restrictive assumptions, such as employing low-rank or sum-of-separable kernels, which can limit expressiveness. We propose the Transformed Latent Variable MOGP (T-LVMOGP), a novel framework that scales MOGPs to a massive number of outputs while preserving the capacity to capture meaningful inter-output dependencies. T-LVMOGP constructs a flexible multi-output deep kernel by mapping inputs and output-specific latent variables into an embedding space using a Lipschitz-regularised neural network. Combined with stochastic variational inference, our model effectively scales to high-dimensional output settings. Across diverse benchmarks, including climate modelling with over 10,000 outputs and zero-inflated spatial transcriptomics data, T-LVMOGP outperforms baselines in both predictive accuracy and computational efficiency.

2605.04569 2026-05-29 cs.CV

LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention

LIVEditor-14B:基于上下文稀疏注意力的闪电视频编辑

Shitong Shao, Zikai Zhou, Haopeng Li, Yingwei Song, Wenliang Zhong, Lichen Bai, Zeke Xie

发表机构 * Hong Kong University of Science(香港科技大学) University of Arizona, USA(美国亚利桑那大学)

AI总结 提出上下文稀疏注意力(ISA)框架,通过冗余上下文剪枝和动态查询分组实现近无损加速,构建LIVEditor-14B模型在多个基准上超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

视频编辑已向上下文学习(ICL)范式发展,但由此产生的二次注意力成本造成了关键的计算瓶颈。在这项工作中,我们提出了上下文稀疏注意力(ISA),这是首个专为ICL视频编辑设计的近无损经验稀疏框架。我们的设计基于两个关键见解:首先,上下文标记的显著性显著低于源标记;其次,我们从理论上证明并经验验证了查询锐度与近似误差相关。受这些发现启发,ISA实现了一种高效的预选择策略来剪枝冗余上下文,随后通过动态查询分组机制将高误差查询路由到全注意力,将低误差查询路由到计算高效的0阶泰勒稀疏注意力。此外,我们构建了 extbf{ exttt{LIVEditor-14B}},这是一种通过ISA和提出的视频编辑数据流水线(整理了170万高质量数据集)的新型闪电视频编辑模型。大量实验表明,LIVEditor-14B在注意力模块延迟上减少了约60%,同时在EditVerseBench、IVE-Bench和VIE-Bench上超越了最先进的方法,实现了近无损加速且不损害视觉保真度。

英文摘要

Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \textbf{\texttt{LIVEditor-14B}} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor-14B achieves a $\sim$60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.

2605.02772 2026-05-29 cs.CV

Linearizing Vision Transformer with Test-Time Training

通过测试时训练线性化视觉Transformer

Yining Li, Dongchen Han, Zeyu Liu, Hanyi Wang, Yulin Wang, Gao Huang

发表机构 * Tsinghua University(清华大学)

AI总结 提出利用测试时训练(TTT)架构与Softmax注意力的结构对齐性,结合键实例归一化和局部性增强模块,实现从预训练Transformer到线性注意力模型的有效权重迁移,在Stable Diffusion 3.5上仅需1小时微调即可达到相近的图像生成质量并加速推理。

Comments ICML 2026

详情
AI中文摘要

虽然线性复杂度注意力机制为克服二次瓶颈提供了Softmax注意力的有前途替代方案,但从头训练此类模型仍然成本高昂。继承预训练Transformer的权重提供了一种有吸引力的捷径,但Softmax与线性注意力之间的基本表征差距阻碍了有效的权重迁移。在这项工作中,我们从两个角度解决这一转换挑战:架构对齐和表征对齐。我们确定测试时训练(TTT)是一种线性复杂度架构,其两层动态公式在结构上与Softmax注意力对齐,从而能够直接继承预训练注意力权重。为了进一步对齐表征属性,包括键平移不变性和局部性,我们引入了键实例归一化和一个轻量级局部性增强模块。我们通过线性化Stable Diffusion 3.5验证了我们的方法,并推出了SD3.5-T$^5$(Transformer到测试时训练)。仅在4$ imes$H20 GPU上微调1小时,SD3.5-T$^5$在文本到图像质量上即可与微调后的Softmax模型相媲美,同时在1K和2K分辨率下分别加速推理1.32倍和1.47倍。代码可在https://github.com/LeapLabTHU/Transformer-to-TTT获取。

英文摘要

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T$^5$ (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4$\times$H20 GPUs, SD3.5-T$^5$ achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32$\times$ and 1.47$\times$ at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.

2605.02288 2026-05-29 cs.CV

LabBuilder: Protocol-Grounded 3D Layout Generation for Interactable and Safe Laboratory

LabBuilder: 基于协议的可交互且安全的3D实验室布局生成

Jianbao Cao, Zhangrui Zhao, Bohan Feng, Zixuan Hu, Rui Li, Haiyuan Wan, Chenxi Li, Jingyuan Li, Wenzhe Cai, Lei Bai, Wanli Ouyang, Lingyu Duan, Di Huang, Minting Pan, Sha Zhang, Xinzhu Ma, Shixiang Tang, Dongzhan Zhou

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Wuhan University(武汉大学) Beihang University(北航) Peking University(北京大学) Tsinghua University(清华大学) Shanghai Jiaotong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出LabBuilder系统,通过协议引导和约束感知优化,从文本描述生成安全且可执行的3D实验室布局,显著优于现有方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

自动化实验室有望加速科学发现,但其部署受限于设计安全且可执行环境的难度。虽然基于模拟器的设计提供了可扩展性,但现有的3D场景生成方法主要针对家庭环境,优化视觉合理性而忽略了科学实验所需的协议基础和布局级安全约束。我们提出了LabBuilder,一个端到端系统,从简洁的文本规范生成并验证3D实验室布局。它通过三个紧密耦合的组件运行:LabForge首先整理一个包含注释资产和化学知识的元数据集,将自然语言规范转化为结构化协议;基于这些协议,LabGen通过迭代的、约束感知的优化策略合成实验室布局;最后,LabTouchstone评估生成的布局作为统一基准。大量实验表明,LabBuilder显著优于现有最先进方法,生成的实验室环境在建模的几何、化学安全和导航约束下既真实又有效。

英文摘要

Automated laboratories hold the promise of accelerating scientific discovery, yet their deployment is bottlenecked by the difficulty of designing safe and executable environments. While simulator-based design offers scalability, existing 3D scene generation methods are primarily tailored for household settings, optimizing for visual plausibility while neglecting the protocol grounding and layout-level safety constraints essential for scientific experimentation. We present LabBuilder, an end-to-end system that generates and verifies 3D laboratory layouts from concise textual specifications. It operates through three tightly coupled components: LabForge first curates a meta-dataset of annotated assets and chemical knowledge, translating natural language specifications into structured protocols; building on these protocols, LabGen synthesizes laboratory layouts via an iterative, constraint-aware optimization strategy; finally, LabTouchstone evaluates the resulting layouts as a unified benchmark. Extensive experiments demonstrate that LabBuilder significantly outperforms existing state-of-the-art methods, producing laboratory environments that are realistic and valid under modeled geometric, chemical-safety, and navigation constraints.

2605.00969 2026-05-29 cs.SD cs.AI cs.CL

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

MedMosaic:一个具有挑战性的多样化医学音频大规模基准

Harshit Rajgarhia, Shuubham Ojha, Asif Shaik, Akhil Pothanapalli, Rachuri Lokesh, Abhishek Mukherji, Prasanna Desikan

发表机构 * Centific Global Solutions Inc.(Centific全球解决方案公司) University of Maryland, College Park, MD, USA(马里兰大学学院市分校)

AI总结 为解决医学音频数据稀缺和现有基准不足的问题,提出MedMosaic数据集,包含多种医学音频类型和46701个问答对,用于评估语言和音频推理模型,实验表明推理仍具挑战性。

Comments Accepted at ICML 2026

详情
AI中文摘要

由于隐私法规和领域专业知识导致的高注释成本,医学音频数据难以收集。因此,现有基准往往未能充分代表复杂的医学音频场景。为应对这一挑战,我们提出了MedMosaic,一个医学音频问答数据集,旨在在现实临床约束下对语言和音频推理模型进行基准测试。MedMosaic包含多种医学音频类型,包括与疾病相关的生理声音、精心构建的模拟带有伪影的语音的合成声音,以及模拟不同上下文长度的真实短篇和长篇临床对话。该数据集还包含总共46,701个问答对,涵盖多项选择、顺序多轮和开放式问答等类别,从而能够系统评估多跳推理和答案生成能力。对13个音频和多模态推理模型的基准测试显示,推理对所有评估系统仍然具有挑战性,且在不同问题类型上表现差异显著。特别是,即使是像Gemini-2.5-pro这样的最先进模型也只能达到约68.1%的准确率。这些发现强调了医学推理中的持续局限性,并凸显了对更鲁棒、特定领域的多模态推理模型的需求。基准数据样本可在此处获取:https://shorturl.at/Lyp33

英文摘要

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models. A sample of benchmark data is available here: https://shorturl.at/Lyp33

2605.00222 2026-05-29 cs.LG physics.chem-ph

CompleteRXN: Toward Completing Open Chemical Reaction Databases

CompleteRXN:迈向完整开放化学反应数据库

Gabriel Vogel, Minouk Noordsij, Evgeny Pidko, Jana M. Weber

发表机构 * Department of Intelligent Systems(智能系统系) Delft University of Technology(代尔夫特理工大学) Department of Chemical Engineering(化学工程系)

AI总结 针对化学反应数据库(如USPTO)普遍存在的不完整问题,提出CompleteRXN基准和约束反应平衡器(CRB)模型,通过监督学习和约束解码实现高精度的反应补全。

详情
AI中文摘要

诸如USPTO等化学反应数据集存在严重的不完整性,经常缺失副产物、共反应物和化学计量系数。这限制了它们在下游应用中的适用性和可靠性。在此,我们介绍CompleteRXN,一个在现实缺失数据条件下用于反应补全的大规模监督基准。通过将USPTO记录映射到精心整理的机理反应,我们构建了一个对齐的不完整和原子平衡反应数据集。我们评估了代表性基线方法,包括一种新颖的具有约束解码的编码器-解码器反应补全模型——约束反应平衡器(CRB),以及最近的算法方法SynRBL。在我们的CompleteRXN基准上,CRB在难度递增的划分上实现了高性能,在随机划分上达到99.20%的等价准确率,在极端分布外划分上达到91.12%。SynRBL生成了许多平衡且化学上合理的补全结果,但在基准测试划分上的准确率较低。在所有方法中,性能随着不完整程度的增加而下降。当在基准之外(完整的未整理USPTO)评估反应时,我们观察到性能大幅下降,这突显了基准性能与实际鲁棒性之间的差距,并激励了未来的工作。

英文摘要

Chemical reaction datasets such as USPTO suffer from substantial incompleteness, frequently missing byproducts, co-reactants, and stoichiometric coefficients. This limits their applicability and reliability in downstream applications. Here, we introduce CompleteRXN, a large-scale supervised benchmark for reaction completion under realistic missing-data conditions. We construct a dataset of aligned incomplete and atom-balanced reactions by mapping USPTO records to curated mechanistic reactions. We evaluate representative baselines, including a novel encoder-decoder reaction completion model with constrained decoding, the Constrained Reaction Balancer (CRB), and a recent algorithmic method, SynRBL. On our CompleteRXN benchmark, the CRB achieves high performance across splits of increasing difficulty, reaching 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split. SynRBL produces many balanced and chemically plausible completions, but with lower accuracy on the benchmark test splits. Across all methods, performance degrades with increasing incompleteness. We observe a substantial drop when evaluating on reactions outside the benchmark (full uncurated USPTO), highlighting the gap between benchmark performance and practical robustness and motivating future work.

2604.27272 2026-05-29 cs.CL cs.AI cs.LG

When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

当2D任务遇到1D序列化:结构化任务中的序列化摩擦

Chung-Hsiang Lo, Lu Li, Diji Yang, Tianyu Zhang, Yunkai Zhang, Yoshua Bengio, Yi Zhang

发表机构 * Northeastern University(东北大学) University of Pennsylvania(宾夕法尼亚大学) UC Santa Cruz(加州大学圣克鲁兹分校) Mila - Quebec AI Institute(魁北克人工智能研究所) University of Montreal(蒙特利尔大学) BAIR, UC Berkeley(伯克利大学BAIR实验室)

AI总结 研究通过矩阵转置、康威生命游戏和LU分解三个任务,发现将二维布局任务序列化为一维文本会因表示不匹配导致性能下降,且错误呈现空间结构模式。

详情
AI中文摘要

在LLM时代,许多符号化和结构化问题通过一维文本序列化呈现给模型。然而,其中一些问题本质上是二维的:它们的相关关系,如行列对应或空间邻接,由二维布局中的位置定义,而非顺序。这引发了一个表示问题:在一维序列中保留相同的符号条目是否也保留了计算所需的关系结构?我们通过序列化摩擦的视角研究这一问题:即相同底层任务实例和条目仍然存在,但依赖于布局的关系在一维序列化下变得隐式的表示不匹配。本研究使用三个受控合成测试任务:矩阵转置、康威生命游戏和LU分解。在每个任务中,相同的实例要么作为一维文本序列化呈现,要么作为其原生二维布局渲染为图像呈现。在整个测试集中,随着任务规模增长,一维序列化的性能下降更显著,且序列化下的错误呈现空间结构模式,表明这种呈现选择在我们的测试集中具有重要影响。为了进一步解释这些结果,我们添加了补充分析,包括视觉内探针以及混合训练转置设置下两种输入呈现的额外比较。这些发现表明,对于布局定义的任务,将输入简化为1D序列化并非中性的表示选择。

英文摘要

In the LLM era, many symbolic and structured problems are presented to models through 1D text serialization. Yet some such problems are natively two-dimensional: their relevant relations, such as row--column correspondence or spatial adjacency, are defined by position in a 2D layout rather than by sequential order. This raises a representational question: does preserving the same symbolic entries in a 1D sequence also preserve the relational structure needed for computation? We study this issue through the lens of serialization friction: the representational mismatch in which the same underlying task instances and entries are still present, but relations that depend on layout become implicit under 1D serialization. The study uses a controlled synthetic testbed of three tasks: matrix transpose, Conway's Game of Life, and LU decomposition. In each task, the same instances are presented either as 1D text serialization or as their native 2D layout rendered as an image. Across this testbed, 1D serialization degrades more sharply as task size grows, and errors under serialization exhibit spatially structured patterns, suggesting that this presentation choice is consequential within our testbed. To further interpret these results, we add supplementary analyses that include a within-visual probe and an additional comparison of the two input presentations under the mixed-training transpose setting. These findings suggest that, for layout-defined tasks, reducing inputs to 1D serialization is not a neutral choice of representation.

2604.26645 2026-05-29 cs.AI cs.LG

SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

SciHorizon-DataEVA:面向异构科学数据AI就绪性评估的智能体系统

Dianyu Liu, Chuan Qin, Xi Chen, Xiaohan Li, Wenxi Xu, Yuyang Wang, Xin Chen, Yuanchun Zhou, Hengshu Zhu

发表机构 * SciHorizon Team, Computer Network Information Center, Chinese Academy of Sciences(科学前沿团队,计算机网络信息中心,中国科学院)

AI总结 提出SciHorizon-DataEVA智能体系统,基于Sci-TQA2原则和层次化多智能体评估方法,实现对异构科学数据的可扩展AI就绪性评估。

详情
AI中文摘要

AI-for-Science (AI4Science) 正通过将机器学习模型嵌入跨领域的预测、模拟和假设生成工作流程,日益变革科学发现。然而,这些模型的有效性从根本上受到科学数据AI就绪性的限制,目前尚不存在可扩展且系统的评估机制。在这项工作中,我们提出了SciHorizon-DataEVA,一种新颖的智能体系统,用于对异构科学数据进行可扩展的AI就绪性评估。在评估标准层面,我们引入了Sci-TQA2原则,将AI就绪性组织为四个互补维度:治理可信度、数据质量、AI兼容性和科学适应性。每个维度被分解为可测量的原子元素,以实现细粒度且可执行的评估。为了大规模实施这些原则,我们开发了Sci-TQA2-Eval,一种通过有向循环工作流编排的层次化多智能体评估方法。我们的Sci-TQA2-Eval通过结合轻量级数据集分析、适用性感知的度量激活以及基于领域约束和数据集-论文信号的知识增强规划,动态构建数据集感知的评估规范。这些规范通过自适应的、以工具为中心的评估机制执行,该机制具有内置的验证和自我修正能力,从而实现对异构科学数据的可扩展且可靠的评估。在跨多个领域的科学数据集上的广泛实验证明了SciHorizon-DataEVA在原则性AI就绪性评估方面的有效性和通用性。

英文摘要

AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine-grained and executable assessment. To operationalize these principles at scale, we develop Sci-TQA2-Eval, a hierarchical multi-agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci-TQA2-Eval dynamically constructs dataset-aware evaluation specifications by combining lightweight dataset profiling, applicability-aware metric activation, and knowledge-augmented planning grounded in domain constraints and dataset-paper signals. These specifications are executed through an adaptive, tool-centric evaluation mechanism with built-in verification and self-correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon-DataEVA for principled AI-readiness evaluation.

2604.26506 2026-05-29 cs.CL cs.CR

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

SafeReview: 防御基于LLM的评审系统免受对抗性隐藏提示攻击

Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Backes, Yue Zhang, Linyi Yang

发表机构 * CISPA Westlake University(西交利物浦大学) Southern University of Science and Technology(南方科技大学) HKUST (Guangzhou)(香港科技大学(广州))

AI总结 提出SafeReview,一种共进化对抗训练框架,通过联合训练生成器和防御者模型,增强基于LLM的同行评审系统对对抗性隐藏提示的鲁棒性。

Comments 17 pages, 5 figures, 8 tables

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地融入学术同行评审,它们对对抗性隐藏提示(即嵌入在提交内容中以操纵结果的对抗性指令)的脆弱性对学术诚信构成了严重威胁。我们提出SafeReview,一种共进化对抗训练框架,用于防御基于LLM的同行评审系统免受此类攻击。SafeReview联合训练一个生成器模型以创建复杂的攻击提示,以及一个防御者模型以在对抗性操纵下保持评审完整性。生成器经过优化以产生越来越有效的提示注入,而防御者则通过基于偏好的训练得到加强,以在干净和受攻击的提交之间保持一致的评审。实验结果表明,与静态防御相比,SafeReview提高了对自适应提示注入攻击的鲁棒性,更好地保留了受攻击下的论文排名,并跨攻击者架构具有泛化能力。这些结果证明了共进化训练作为保障LLM辅助同行评审安全的基础的潜力。

英文摘要

As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial hidden prompts, i.e., adversarial instructions embedded in submissions to manipulate outcomes, poses a critical threat to scholarly integrity. We propose SafeReview, a co-evolutionary adversarial training framework for defending LLM-based peer review systems against such attacks. SafeReview jointly trains a Generator model to create sophisticated attack prompts and a Defender model to preserve review integrity under adversarial manipulation. The Generator is optimized to produce increasingly effective prompt injections, while the Defender is strengthened through preference-based training to maintain consistent reviews between clean and attacked submissions. Experimental results show that SafeReview improves robustness against adaptive prompt injection attacks, better preserves paper ranking under attack, and generalizes across attacker architectures compared with static defenses. These results demonstrate the potential of co-evolutionary training as a foundation for securing LLM-assisted peer review.

2604.23862 2026-05-29 cs.LG cs.AI cs.CL

Graph Memory Transformer (GMT)

图记忆Transformer (GMT)

Nicola Zanarini, Niccolò Ferrari, Evelina Lamma

发表机构 * Bonfiglioli Engineering s.r.l.(博尼菲利工程公司) Department of Engineering, University of Ferrara(费拉拉大学工程学院) NAIS s.r.l.(NAIS公司)

AI总结 提出用显式学习的记忆图替换解码器-only Transformer中的前馈网络子层,保留自回归架构,实现可解释的记忆导航。

Comments 65 pages, 10 figures, 5 tables. Author list updated in arXiv metadata; no technical changes. Code available at https://github.com/Nemesis533/GMT-GraphMemoryTransformer

详情
AI中文摘要

我们研究是否可以在解码器-only Transformer中,用显式学习的记忆图替换前馈网络(FFN)子层,同时保留周围的自回归架构。所提出的图记忆Transformer(GMT)保持因果自注意力不变,但将通常的逐token FFN变换替换为一个记忆单元,该单元通过一个由学习的有向转移矩阵连接的质心库来路由token表示。在此处研究的基础GMT v7实例中,16个Transformer块中的每个块包含128个质心、一个128*128的边矩阵、引力源路由、token条件目标选择以及门控位移读出。因此,该单元返回从估计的源记忆状态到目标记忆状态的移动,而不是检索到的值。由此产生的模型是一个完全解码器-only的语言模型,具有82.2M可训练参数且没有密集的FFN子层,而评估中使用的密集GPT风格基线有103.0M参数。基础v7模型训练稳定,并将质心使用、转移结构和源到目标移动作为前向计算中可直接检查的量。在验证损失和困惑度方面,它落后于较大的密集基线(3.5995/36.58 vs. 3.2903/26.85),但在评估设置下显示出接近的零样本基准表现。这些结果并非旨在声称最先进性能;它们支持用图介导的记忆导航替换密集的token内变换的可行性和结构可解释性。更广泛的扩展、优化的内核以及更广泛的基准评估留待后续工作。

英文摘要

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

2604.22280 2026-05-29 cs.CV

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

超越思维链:重写作为生成式多模态嵌入的通用接口

Peixi Wu, Ke Mei, Feipeng Ma, Bosong Chai, Zhibin Lan, Chenxi Zhao, Shannan Yan, Jie Chen, Zhangchi Hu, Yansong Peng, Bo Lin, Junjie Zhou, Dacheng Yin, Tianyi Wang, Fengyun Rao, Jing Lyu, Hebei Li, Xiaoyan Sun

发表机构 * WeChat Vision, Tencent Inc.(腾讯微信视觉部) Zhejiang University(浙江大学) Tsinghua University(清华大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院)

AI总结 针对思维链推理在检索中产生冗余和语义歧义的问题,提出重写驱动的多模态嵌入框架RIME,联合优化生成与嵌入,并通过跨模态对齐和精炼强化学习实现高效准确的检索。

详情
AI中文摘要

多模态大语言模型已成为通用多模态嵌入的有前景的基础。最近的研究表明,推理驱动的生成式多模态嵌入在多个嵌入任务上可以超越判别式嵌入。然而,思维链推理往往会产生冗余的思考步骤,并在更广泛的检索场景中引入总结答案的语义歧义。为了解决这一限制,我们提出了重写驱动的多模态嵌入(RIME),这是一个通过检索友好的重写联合优化生成和嵌入的统一框架。同时,我们提出了跨模态对齐(CMA)来桥接生成式和判别式嵌入空间,从而实现灵活的相互检索以权衡效率和准确性。在此基础上,我们还引入了精炼强化学习(Refine-RL),将判别式嵌入作为稳定的语义锚点来指导重写优化。在MMEB-V2、MRMR和UVRB上的大量实验表明,RIME显著优于先前的生成式嵌入模型,同时大幅减少了思考长度。

英文摘要

Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.

2604.19011 2026-05-29 cs.LG cs.RO

Accelerating trajectory optimization with Sobolev-trained diffusion policies

基于Sobolev训练的扩散策略加速轨迹优化

Théotime Le Hellard, Franki Nguimatsia Tiofack, Quentin Le Lidec, Justin Carpentier

发表机构 * Inria - Département d’Informatique de l’École normale supérieure, PSL Research University(法国国家科学研究中心-巴黎高等师范学院计算机系,PSL研究大学) Courant Institute, New York University(纽约大学Courant研究所)

AI总结 针对梯度型轨迹优化求解器,提出利用Sobolev学习训练扩散策略以提供初始猜测,通过利用轨迹和反馈增益的一阶损失避免复合误差,实现求解时间减少2至20倍。

详情
AI中文摘要

轨迹优化求解器利用已知系统动力学通过迭代改进计算局部最优轨迹。其缺点是每个新问题实例独立求解,因此收敛速度和求解质量依赖于初始轨迹。为提高效率,一种自然的方法是用学习策略生成的初始猜测对轨迹优化进行热启动,该策略在求解器先前生成的轨迹上训练。基于扩散的策略最近成为表达性模仿学习模型,使其成为这一角色的有前途候选者。然而,一个反直觉的挑战来自轨迹优化示范的局部最优性:当策略展开时,小的非最优偏差可能将其推入训练数据中未表示的情况,从而在长时域上引发复合误差。在这项工作中,我们专注于基于学习的热启动,用于同时提供反馈增益的梯度型轨迹优化求解器。利用这一特性,我们推导出一阶损失,用于使用轨迹和反馈增益对基于扩散的策略进行Sobolev学习。通过全面实验,我们证明所得策略避免了复合误差,因此可以从非常少的轨迹中学习,提供初始猜测,将求解时间减少2倍到20倍。结合一阶信息使得用更少的扩散步骤进行预测成为可能,从而降低推理延迟。

英文摘要

Trajectory Optimization (TO) solvers exploit known system dynamics to compute locally optimal trajectories through iterative improvements. A downside is that each new problem instance is solved independently; therefore, convergence speed and quality of the solution found depend on the initial trajectory proposed. To improve efficiency, a natural approach is to warm-start TO with initial guesses produced by a learned policy trained on trajectories previously generated by the solver. Diffusion-based policies have recently emerged as expressive imitation learning models, making them promising candidates for this role. Yet, a counterintuitive challenge comes from the local optimality of TO demonstrations: when a policy is rolled out, small non-optimal deviations may push it into situations not represented in the training data, triggering compounding errors over long horizons. In this work, we focus on learning-based warm-starting for gradient-based TO solvers that also provide feedback gains. Exploiting this specificity, we derive a first-order loss for Sobolev learning of diffusion-based policies using both trajectories and feedback gains. Through comprehensive experiments, we demonstrate that the resulting policy avoids compounding errors, and so can learn from very few trajectories to provide initial guesses reducing solving time by $2\times$ to $20 \times$. Incorporating first-order information enables predictions with fewer diffusion steps, reducing inference latency.