arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1955
2605.11739 2026-05-22 cs.CL

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

学习预见:揭示在线蒸馏的解锁效率

Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang

AI总结 本文研究了在线蒸馏(OPD)的效率来源,提出EffOPD方法通过适应性选择 extrapolation 步长和沿当前更新方向移动来加速OPD,实现了3倍的训练加速同时保持最终性能。

详情
AI中文摘要

在线蒸馏(OPD)已成为大型语言模型的一种高效的后训练范式。然而,现有研究大多将其优势归因于更密集和稳定的监督,而OPD效率背后的参数级机制仍不清晰。本文认为OPD的效率源于一种“预见”机制:它在训练早期就建立了指向最终模型的稳定更新轨迹。这种预见体现在两个方面。首先,在模块分配层面,OPD识别出边际效用低的区域,并将更新集中在对推理更关键的模块上。其次,在更新方向层面,OPD表现出更强的低秩集中,其主导子空间在训练早期就与最终更新子空间紧密对齐。基于这些发现,我们提出了EffOPD,一种即插即用的加速方法,通过自适应选择extrapolation步长并沿当前更新方向移动来加速OPD。EffOPD不需要额外可训练模块或复杂的超参数调优,实现了平均3倍的训练加速,同时保持可比的最终性能。整体而言,我们的发现为理解OPD的效率提供了参数动态视角,并为设计更高效的大型语言模型后训练方法提供了实用见解。

英文摘要

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbf{EffOPD}, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of $3\times$ while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.

2605.11246 2026-05-22 cs.LG

Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization

支持接近增强的扩散估计用于离线黑盒优化

Yonghan Yang, Ye Yuan, Zipeng Sun, Linfeng Du, Bowei He, Haolun Wu, Can Chen, Xue Liu

AI总结 本文提出SPADE框架,通过条件生成建模重新想象前向替代建模,利用扩散模型建模前向似然p(y|x),并引入校准扩散估计模块和支撑接近正则化机制,以提高优化性能。

Comments Accepted by ICML 2026. First two authors contributed equally

详情
AI中文摘要

离线黑盒优化旨在仅使用静态数据集发现具有高属性分数的新设计,这一任务本质上受到分布外(OOD)外推问题的挑战。现有方法通常分为逆向方法,其在将分数映射到设计的 ill-posed 性质上挣扎,以及前向方法,其往往缺乏量化不确定性有效性的分布表达能力。在本文中,我们提出SPADE(Support-Proximity Augmented Diffusion Estimation),一种新颖的框架,通过条件生成建模的视角重新想象前向替代建模。SPADE通过扩散模型建模前向似然p(y|x),但通过两个关键增强来适应优化:(1)校准扩散估计模块,强制统计矩和成对排名的全局一致性;(2)支撑接近正则化机制,通过kNN基于的密度估计隐式内化数据流形约束p(x)。理论上,我们证明我们的正则化在第一阶上等价于最大化具有有效设计先验的贝叶斯后验。经验上,SPADE在Design-Bench任务和LLM数据混合优化基准上实现了最先进的性能。

英文摘要

Offline black-box optimization aims to discover novel designs with high property scores using only a static dataset, a task fundamentally challenged by the out-of-distribution (OOD) extrapolation problem. Existing approaches typically bifurcate into inverse methods, which struggle with the ill-posed nature of mapping scores to designs, and forward methods, which often lack the distributional expressivity to quantify uncertainty effectively. In this work, we propose SPADE (Support-Proximity Augmented Diffusion Estimation), a novel framework that reimagines forward surrogate modeling through the lens of conditional generative modeling. SPADE models the forward likelihood p(y|x) using a diffusion model, but with two critical enhancements to tailor it for optimization: (1) a Calibrated Diffusion Estimation module that enforces global consistency in statistical moments and pairwise rankings, and (2) a Support-Proximity Regularization mechanism that implicitly internalizes the data manifold constraint p(x) via kNN-based density estimation. Theoretically, we prove that our regularization is first-order equivalent to maximizing a Bayesian posterior with a valid design prior. Empirically, SPADE achieves state-of-the-art performance across Design-Bench tasks and an LLM data mixture optimization benchmark.

2605.10696 2026-05-22 cs.RO

VRA: Grounding Discrete-Time Joint Acceleration in Voltage-Constrained Actuation

VRA:在电压受限致动器中接地离散时间联合加速度

Lingwei Zhang, Jiaming Wang, Tianlin Zhang, Zhitao Song, Xuanqi Zeng, Weipeng Xia, Zhongyu Li, Yun-hui Liu

AI总结 本文提出VRA方法,通过将运动学加速度与电压受限致动器物理相联系,解决在电压受限情况下不可实现的加速度问题,实验表明该方法能消除不可实现的加速度,恢复一致的近约束执行并减少约束引起的振荡。

Comments 10 pages, Accepted by RSS 2026

详情
AI中文摘要

离散时间关节加速度约束被广泛用于强制位置和速度限制。然而,在电压受限的电动致动器中,运动学上可行的加速度可能无法物理实现,暴露了缺失的执行层面抽象。我们提出电压可实现加速度(VRA),一种关节级加速度接口,通过限制命令加速度到电压可实现的约束,将运动学加速度接地在电压受限致动器物理上。在电动致动器和轮腿四足机器人上的硬件实验表明,VRA消除了不可实现的加速度,恢复了一致的近约束执行,并减少了约束引起的振荡。

英文摘要

Discrete-time joint acceleration constraints are widely used to enforce position and velocity limits. However, under voltage-constrained electric actuators, kinematically admissible accelerations may be physically unrealizable, exposing a missing execution-level abstraction. We propose Voltage-Realizable Acceleration (VRA), a joint-level acceleration interface that grounds kinematic acceleration in voltage-constrained actuator physics by restricting commanded accelerations to voltage-realizable constraints. Hardware experiments on electric actuators and a wheel-legged quadruped show that VRA removes unrealizable accelerations, restores consistent near-constraint execution, and reduces constraint-induced oscillations.

2605.08982 2026-05-22 cs.LG

PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling

PMCTS:用于原理化并行推断时间扩展的粒子蒙特卡洛树搜索

Yaniv Oren, Viliam Vadocz, Joery A. de Vries, Wendelin Böhmer, Matthijs T. J. Spaan, Hendrik Baier

AI总结 本文提出PMCTS,一种适用于神经网络评估的原理化并行MCTS算法,通过并行计算实现推断时间扩展,并在多个领域中显著优于传统启发式基线方法。

详情
AI中文摘要

蒙特卡洛树搜索(MCTS)是一种通过搜索来改进策略的广泛使用的方法,其在现实世界应用中日益受到关注。由于其搜索过程的顺序性和确定性,利用并行计算扩展MCTS的运行时间仍然是一个主要挑战。我们引入了粒子MCTS(PMCTS),据我们所知,这是首个原理化并行MCTS算法,适用于神经网络评估,并能保持正式的策略改进保证。经验上,PMCTS在并行计算下表现良好,并在多个领域中显著优于流行的启发式基线方法。

英文摘要

Monte Carlo Tree Search (MCTS) is a widely used approach for policy improvement through search with increasing popularity for real world applications. Due to the sequential and deterministic nature of its search, runtime-scaling of MCTS with parallel compute remains a major challenge. We introduce Particle MCTS (PMCTS), to our knowledge the first principled parallel MCTS algorithm which is suited for neural network evaluations and can preserve formal policy improvement guarantees. Empirically, PMCTS scales well with parallel compute and significantly outperforms the popular heuristic-based baselines across domains.

2605.08389 2026-05-22 cs.CV cs.AI

Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

解耦端点与语义转换学习以实现零样本复合图像检索

Mingyu Liu, Sihan Huang, Yijia Fan, Yinlin Yan, Quan Zhang, Jian-Fang Hu, Jianhuang Lai

AI总结 本文提出了一种解耦端点与语义转换学习的方法DeCIR,用于零样本复合图像检索,通过构造配对的正向/反向编辑元组,训练独立的低秩文本适配器分支,并利用低秩方向合并(LRDM)将它们合并为一个可部署的适配器,从而提升了投影基于的零样本复合图像检索性能。

详情
AI中文摘要

零样本复合图像检索(ZS-CIR)在不依赖人工标注的CIR三元组的情况下,从参考图像和文本修改中检索目标图像。基于投影的ZS-CIR方法因其不依赖LLM并在推理时保持轻量而具有吸引力,但它们在复杂语义修改上往往表现不佳。这一差距反映了基于投影的ZS-CIR中的语义转换瓶颈:端点级匹配可以让编辑文本作为目标侧的属性线索,而不是作为源条件的语义转换。我们进一步表明,将语义转换监督添加到相同的文本适配器中会创建端点对齐与语义转换对齐之间的冲突。为了解决这一冲突,DeCIR解耦端点与转换学习。它从图像-标题对中构建配对的正向/反向编辑元组,训练独立的低秩文本适配器分支用于端点对齐和语义转换对齐,并将它们通过低秩方向合并(LRDM)合并为一个可部署的适配器。在CIRR、CIRCO、FashionIQ和GeneCIS上的大量实验表明,DeCIR在不增加推理复杂性的情况下,一致提升了基于投影的ZS-CIR性能。

英文摘要

Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.

2605.07711 2026-05-22 cs.CL

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

SimCT: 通过跨分词器策略进行监督恢复

Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang

AI总结 本文提出SimCT,一种改进的在线策略蒸馏方法,通过扩展监督空间来恢复因分词差异而丢失的监督信号,从而在数学推理和代码生成任务中提升了性能。

Comments 4 figures, 6 tables, 28 pages

详情
AI中文摘要

在线策略蒸馏(OPD)是一种标准工具,用于将教师行为转移到较小的学生模型中,但其隐含假设教师和学生预测在逐个token上是可比的,这一假设在两个模型对同一文本进行不同分词时会失效。在异构分词器情况下,精确共享token匹配会静默丢弃大量教师信号,特别是在词汇不一致的位置。我们提出了简单的跨分词器OPD(SimCT),通过扩展监督空间:在共享token之外,SimCT比较教师和学生在短多token延续上的表现,这些延续两者都能实现,从而保持OPD损失形式不变。我们证明这些单位是 finest 共同可分词的监督接口,并且更粗糙的替代方法会移除对在线学习有用的教师-学生区分。在三个异构教师-学生对上,SimCT在数学推理和代码生成基准上表现优于共享词汇OPD和代表性跨分词器基线,消融实验确认改进来自恢复精确共享token匹配所丢弃的监督。代码可在https://github.com/sunjie279/SimCT-获取。

英文摘要

On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \href{https://github.com/sunjie279/SimCT-}{https://github.com/sunjie279/SimCT-}.

2605.07598 2026-05-22 cs.LG

Optimal Recourse Summaries via Bi-Objective Decision Tree Learning

通过双目标决策树学习实现最优补救摘要

Ioannis Chatzis, Jason Liartis, Athanasios Voulodimos, Giorgos Stamou

AI总结 本文提出SOGAR方法,通过将补救摘要学习转化为最优决策树学习问题,找到帕累托前沿,实现补救效果与成本之间的平衡,产生稳定、低成本且有效的补救摘要。

详情
AI中文摘要

可操作的补救为个体提供了改变不利分类器结果的行动。虽然在实例层面有用,但不适合作为全局审计和偏见检测,因为汇总局部行动是昂贵且不一致的。补救摘要通过将人口划分为子群体并为每个子群体分配一个共享行动,从而实现这一限制。设计摘要涉及补救效果和补救成本之间的根本权衡,现有方法未能充分解决。我们引入了最优和全局可操作补救摘要(SOGAR),将补救摘要学习转化为最优决策树学习问题,并找到帕累托前沿——即一组解决方案,其中改进一个目标必然使另一个恶化。SOGAR允许事后选择所需的权衡而无需重新训练。使用浅层轴平行决策树和稀疏叶行动,SOGAR产生稳定、低成本且有效的补救摘要,在效果和成本指标上均优于现有方法。

英文摘要

Actionable Recourse provides individuals with actions they can take to change an unfavorable classifier outcome. While useful at the instance level, it is ill-suited for global auditing and bias detection, since aggregating local actions is costly and often inconsistent. Recourse Summaries address this limitation by partitioning the population and assigning one shared action per subgroup, enabling comparison across subgroups. Designing summaries involves a fundamental trade-off between recourse effectiveness and recourse cost, which existing methods do not adequately address. We introduce Summaries of Optimal and Global Actionable Recourse (SOGAR), which formulates recourse summary learning as an optimal decision tree learning problem and finds the Pareto front -- the complete set of solutions where improving one objective necessarily worsens the other. SOGAR enables post-hoc selection of the desired trade-off without retraining. Using shallow axis-parallel decision trees and sparse leaf actions, SOGAR produces stable, low-cost, and effective recourse summaries that outperform existing approaches across effectiveness and cost metrics.

2605.07243 2026-05-22 cs.CL

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

SpecBlock:带有动态树草案的块迭代推测解码

Weijie Shi, Qiang Xu, Fan Deng, Yaguang Wu, Jiarun Liu, Yehong Xu, Hao Chen, Jia Zhu, Jiajie Xu, Xiangjun Huang, Jian Yang, Xiaofang Zhou

AI总结 该研究提出了一种结合路径依赖性和低成本草案的块迭代草案方法SpecBlock,通过动态树草案和路径依赖机制提高LLM推理效率,同时在部署时利用验证器反馈进行成本感知适应,从而在速度和成本上均优于现有方法。

详情
AI中文摘要

推测解码通过起草候选延续的树并单次目标前向验证来加速LLM推理。现有草案工具分为两派,各有缺陷。自回归草案工具如EAGLE-3在每条草案路径上保留依赖性,但每次树深度调用一次草案器,使草案成为每次迭代延迟的重要部分。并行草案工具通过一次前向预测多个未来位置来减少草案器调用,但每个位置的预测不考虑其他位置,导致验证器拒绝路径。本文提出SpecBlock,一种结合路径依赖性和低成本草案的块迭代草案工具。每个草案器前向生成K个依赖位置,称为块。草案树通过重复块扩展生长。两种机制显式携带路径依赖性以保持后续草案位置的准确性。在每个块内,逐层位移将前一位置的隐藏状态传输到每个解码器层。在块之间,每个新块可以从上一个块的任意位置开始,继承其隐藏状态以延长路径。为在验证器预算中花费在可能接受的位置,一个共同训练的排名头取代固定top-k树,通过在草案过程中分配每位置分支。为避免在推理时训练草案器在它从未生成的前缀上,一个有效前缀掩码在较后位置的损失在较早位置出错时丢弃。除了静态草案外,部署时的成本感知老虎机利用免费验证器反馈来选择性更新草案器,仅当预期吞吐量增益超过更新成本时。实验表明,SpecBlock在44-52%的草案成本下,相比EAGLE-3提高了8-13%的平均速度,而成本感知适应将此优势扩展到11-19%。

英文摘要

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.

2605.06597 2026-05-22 cs.CL cs.AI cs.LG

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

UniSD:面向大语言模型的统一自蒸馏框架

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, Srijan Kumar

AI总结 本文提出UniSD框架,系统研究自蒸馏方法,通过整合多种机制提升监督可靠性、表征对齐和训练稳定性,从而在多个基准和模型上验证自蒸馏的有效性,并构建出性能最优的UniSDfull流水线。

Comments Website: https://unifiedsd.github.io/ Code: https://github.com/Ahren09/UniSD

详情
AI中文摘要

自蒸馏(SD)为在不依赖更强外部教师的情况下适应大语言模型(LLMs)提供了一条有前途的路径。然而,在自回归LLMs中,SD仍然具有挑战性,因为自生成轨迹是自由形式的,正确性依赖于任务,且合理的推理仍可能提供不稳定或不可靠的监督。现有方法主要考察孤立的设计选择,留下其有效性、作用和交互关系不清晰。在本文中,我们提出UniSD,一个统一的框架,系统地研究自蒸馏。UniSD整合了互补的机制,解决监督可靠性、表征对齐和训练稳定性问题,包括多教师一致、EMA教师稳定、token级对比学习、特征匹配和发散剪裁。在六个基准和六个模型(来自三个模型家族)上,UniSD揭示了自蒸馏何时优于静态模仿,哪些组件驱动了收益,以及这些组件在不同任务间的交互方式。基于这些见解,我们构建了UniSDfull,一个整合互补组件的流水线,实现了最强的整体性能,比基模型提高了+5.4点,比最强基线提高了+2.8点。广泛评估凸显了自蒸馏作为一种实用且可控的高效LLM适应方法,无需更强的外部教师。

英文摘要

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

2605.05749 2026-05-22 cs.CV

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

具有自适应更新的射线感知指针记忆用于流式3D重建

Feifei Li, Qi Song, Chi Zhang, Rui Huang

AI总结 本文提出了一种射线感知指针记忆,用于流式3D重建,通过统一的记忆表示模型同时建模空间位置和视角方向,采用自适应指针更新策略以保留信息性指针并丢弃冗余指针,从而提高长期重建稳定性和相机姿态精度。

详情
AI中文摘要

从连续图像流中进行密集3D重建需要准确的几何聚合和稳定的长期内存管理。最近的前馈重建框架通过持久内存表示整合观测,但大多数方法在更新内存时主要依赖基于外观的相似性。这种基于外观的整合常常导致在视角变化时出现观测冗余和不稳定的几何结构。在本文中,我们提出了一种用于流式3D重建的射线感知指针记忆,该记忆在统一的记忆表示中显式建模空间位置和视角方向。每个内存指针存储其3D位置、关联的射线方向和特征嵌入,使系统能够联合推理几何接近性和视角一致性。基于此表示,我们引入了一种自适应指针更新策略,将传统的融合基记忆压缩替换为保留或替换机制。而不是平均附近的观测,系统会选择性地保留信息性指针并丢弃冗余的,从而在保持内存增长有限的同时保留独特的几何结构。此外,对空间距离和射线方向差异的联合推理使系统能够统一区分局部冗余、新观测和潜在的环路重访。当检测到环路候选时,会触发姿态细化以强制在重建中保持全局几何一致性。大量实验表明,所提出的射线感知记忆设计显著提高了长期重建的稳定性和相机姿态精度,同时保持了高效的流式推理。我们的方法提供了一个系统的方法框架,用于可扩展且抗漂移的在线3D重建。

英文摘要

Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.

2605.05118 2026-05-22 cs.LG cs.AI stat.ML

On the Wasserstein Gradient Flow Interpretation of Drifting Models

关于漂移模型的Wasserstein梯度流解释

Arthur Gretton, Li Kevin Wenliang, Alexandre Galashov, James Thornton, Valentin De Bortoli, Arnaud Doucet

AI总结 本文通过Wasserstein梯度流分析了漂移模型,揭示了GMD框架与WGF路径之间的关系,展示了三种主要结果:漂移模型中的算法对应于KL散度的WGF极限点,实际实现的算法对应于Sinkhorn散度的固定点但缺乏某些特性,同时该方法可以扩展到其他WGF的极限点,如MMD、切线Wasserstein距离和GAN批评者函数。

详情
AI中文摘要

最近,Deng等人(2026)提出了生成模型通过漂移(GMD),一种新的生成任务框架。本文通过Wasserstein梯度流(WGF)的视角分析了GMD,即概率测度空间中函数的最速下降路径,配备了最优传输的几何结构。与之前的WGF相关贡献不同,GMD可以被视为直接针对特定WGF流的固定点。我们展示了三个主要结果:首先,Deng等人(2026)提出的一种算法对应于在KL散度上的WGF的极限点,伴有Parzen平滑。其次,Deng等人(2026)实际实现的算法对应于另一种过程,类似于Sinkhorn散度的固定点,但缺乏后者的一些理想特性。第三,同样的想法可以扩展到其他WGF的极限点,包括最大均值差异(MMD)、切线Wasserstein距离和GAN批评者函数。

英文摘要

Recently, Deng et al. (2026) proposed Generative Modeling via Drifting (GMD), a novel framework for generative tasks. This note presents an analysis of GMD through the lens of Wasserstein Gradient Flows (WGF), i.e., the path of steepest descent for a functional in the space of probability measures, equipped with the geometry of optimal transport. Unlike previous WGF-based contributions, GMD can be thought of as directly targeting a fixed point of a specific WGF flow. We demonstrate three main results: first, that one algorithm proposed by Deng et al. (2026) corresponds to finding the limiting point of a WGF on the KL divergence, with Parzen smoothing on the densities. Second, that the algorithm actually implemented by Deng et al. (2026) corresponds to a different procedure, which bears some resemblance to the fixed point of a WGF on the Sinkhorn divergence, but lacks certain desirable properties of the latter. Third, the same same idea can be extended to the limiting point of other WGFs, including the Maximum Mean Discrepancy (MMD), the sliced Wasserstein distance, and GAN critic functions.

2605.04217 2026-05-22 cs.LG cs.CL

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

Jordan-RoPE: 通过复Jordan块实现非半单相对位置编码

Yaobo Zhang

AI总结 本文提出了一种非半单相对位置编码Jordan-RoPE,通过复旋转特征和Nilpotent响应在同一缺陷Jordan块中实现距离调制的相位基,从而生成振荡-多项式特征,如e^{-γd}cos(ωd)、e^{-γd}sin(ωd)等,并在语言模型中验证了其有效性。

Comments 15 pages, 4 figures, 6 tables; code available at https://github.com/ybzhang-nxu/jordan_rope

详情
AI中文摘要

相对位置编码决定了查询-键滞后函数能够进入原始注意力logit的哪些功能。RoPE提供旋转相位,而ALiBi提供加性距离偏置。受线性平移不变位置编码的群论观点启发,我们研究了非半单情况,其中复旋转特征和Nilpotent响应共存于同一缺陷Jordan块中。所生成的相对算子产生如e^{-γd}cos(ωd)、e^{-γd}sin(ωd)、d e^{-γd}cos(ωd)和d e^{-γd}sin(ωd)等振荡-多项式特征,其中因果滞后d=i-j≥0。因此,该构造实现了距离调制的相位基d e^{iωd},而非仅仅添加单独的距离通道到RoPE。我们将其精确Jordan-RoPE公式化为非半单一参数表示,给出其实块形式,并指定非正交位置映射所需的共轭查询作用。我们还区分了该精确表示与稳定变体,后者虽然改善了数值行为但破坏了精确群律。核级别诊断和一个Jordan友好的合成语言模型任务表明,当目标包含距离调制的相位交互时,耦合的Jordan基是有用的。在小型WikiText-103字语言模型上,一个缩放精确变体在Jordan家族中优于RoPE和直接求和基线,而RoPE+ALiBi仍然是整体最强的。证据是结构性的,而非广义的性能声明。

英文摘要

Relative positional encodings determine which functions of query-key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group-theoretic views of linear translation-invariant positional encodings, we study a non-semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory-polynomial features such as $e^{-γd}\cos(ωd)$, $e^{-γd}\sin(ωd)$, $d e^{-γd}\cos(ωd)$, and $d e^{-γd}\sin(ωd)$, for causal lag $d=i-j\geq 0$. Thus the construction realizes a distance-modulated phase basis $d e^{iωd}$, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan-RoPE as a non-semisimple one-parameter representation, give its real block form, and specify the contragredient query action required by non-orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel-level diagnostics and a Jordan-friendly synthetic language-model task show that the coupled Jordan basis is useful when the target contains distance-modulated phase interactions. On a small WikiText-103 byte language model, a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.

2605.04062 2026-05-22 cs.LG cs.AI

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor: 一种通过混合精度量化感知蒸馏实现大语言模型轻量化的框架

Shu-Hao Zhang, Le-Tong Huang, Xiang-Sheng Deng, Xin-Yi Zou, Chen Wu, Nan Li, Shao-Qun Zhang, Zhi-Hua Zhou

AI总结 本文提出EdgeRazor框架,通过混合精度量化感知蒸馏方法,在资源受限设备上部署大语言模型,实现了更高的压缩比和更高效的性能。

详情
AI中文摘要

量化已成为在资源受限设备上部署大语言模型(LLMs)的主流方法,但将精度压缩到低于4位通常会导致严重的性能退化或高昂的重训练成本。在本文中,我们提出了EdgeRazor,一种通过混合精度量化感知蒸馏实现LLM轻量化的框架。它包含三个模块:混合精度结构量化用于精细控制位宽,层自适应特征蒸馏动态选择最信息丰富的特征进行对齐,以及熵感知KL散度用于在人工标注和蒸馏数据集上实现前向-反向平衡。在MobileLLM和Qwen系列上的评估表明,在权重-激活量化下,1.88位的Qwen3-0.6B-EdgeRazor在2位基准上表现优异,优于11.27,超过最强的3位基准4.38。在效率方面,EdgeRazor在所有位宽下实现了更高的压缩比,1.58位的Qwen3-0.6B-EdgeRazor将存储从1.11 GB减少到0.19 GB,同时在16位基准上加速解码15.16倍。这些结果经验上验证了EdgeRazor的有效性和效率。代码可以从GitHub和Huggingface访问。

英文摘要

Quantization has emerged as a mainstream approach for deploying Large Language Models (LLMs) on resource-constrained devices, yet compressing precision below 4-bit typically causes severe performance degradation or prohibitive retraining costs. In this paper, we propose EdgeRazor, a lightweight framework for LLMs via Mixed-Precision Quantization-Aware Distillation. It contains three modules: Structural Quantization with Mixed Precision for fine-grained control of bit-widths, Layer-Adaptive Feature Distillation that dynamically selects the most informative features for alignment, and Entropy-Aware KL Divergence for forward-reverse balance on both human-annotated and distilled datasets. Evaluations conducted on MobileLLM and Qwen families show that under weight-activation quantization, the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms the state-of-the-art 2-bit baselines by 11.27 and surpasses the strongest 3-bit baselines by 4.38, while the quantized MobileLLM-350M-EdgeRazor requires a training budget 4-10$\times$ lower than the leading quantization-aware training method. In terms of efficiency, EdgeRazor achieves higher compression ratios at all bit-widths, and the 1.58-bit Qwen3-0.6B-EdgeRazor reduces storage from 1.11 GB to 0.19 GB while accelerating decoding by 15.16$\times$ over the 16-bit baseline. These results empirically validate the effectiveness and efficiency of EdgeRazor. The codes can be accessed from \href{https://github.com/zhangsq-nju/EdgeRazor}{GitHub} and \href{https://huggingface.co/collections/zhangsq-nju/edgerazor-nbit}{Huggingface}.

2605.03934 2026-05-22 cs.SD cs.AI

Towards Open World Sound Event Detection

面向开放世界的声音事件检测

P. H. Hai, L. T. Minh, L. H. Son

AI总结 本文提出了一种开放世界声音事件检测(OW-SED)范式,通过引入可变形架构和新颖的WOOT框架,解决了重叠和模糊事件的挑战,提升了在开放世界环境下的检测性能。

Comments 32 pages, 3 figures. Accepted to Signal Processing (Elsevier)

详情
Journal ref
Signal Processing, Article 110707, 2026
AI中文摘要

声音事件检测(SED)在音频理解中起着至关重要的作用,应用于监控、智能城市、医疗保健和多媒体索引等领域。然而,传统SED系统基于封闭世界假设,限制了其在现实环境中处理新兴声音事件的能力。受开放世界学习在计算机视觉中的成功启发,我们引入了开放世界声音事件检测(OW-SED)范式,其中模型必须检测已知事件、识别未知事件并逐步学习它们。为了解决OW-SED特有的挑战,如重叠和模糊事件,我们提出了一种1D可变形架构,利用可变形注意力来适应性地聚焦于显著的时序区域。此外,我们设计了一种新颖的开放世界可变形声音事件检测转换器(WOOT)框架,结合特征解耦来分离类特定和类无关的表示,以及一种一对多匹配策略和多样性损失以增强表示多样性。实验结果表明,我们的方法在封闭世界设置中相比现有领先技术略具优势,并在开放世界场景中显著优于现有基线。

英文摘要

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.

2605.02784 2026-05-22 cs.CV

HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar

HumanSplatHMR: 闭合人体网格恢复与高斯点绘肖像之间的循环

Yeheng Zong, Pou-Chun Kung, Yike Pan, Seth Isaacson, Yizhou Chen, Ram Vasudevan, Katherine A. Skinner

AI总结 本文提出HumanSplatHMR方法,通过闭合几何姿态估计与可微渲染之间的循环,改进人体姿态恢复和高斯点绘肖像的生成,提升在新视角和新姿态下的渲染质量。

Comments Project page: https://scottyehengz.github.io/HumanSplat/

详情
AI中文摘要

从视频中准确恢复人体姿态和外观是场景重建的关键组成部分,应用于动作捕捉、动作预测、虚拟现实和数字孪生等领域。尽管对从视频中构建逼真人类肖像已有大量研究,本文证明现有方法无法准确恢复人类的3D几何结构。基于ViT的方法不一致可靠且可能过度拟合2D视角,而基于NeRF和高斯点绘的肖像将姿态和外观分开,限制了对新姿态的渲染泛化能力。为解决这些问题,本文提出HumanSplatHMR,一种联合优化框架,通过同时优化3D人体姿态并学习高保真的肖像,以实现新视角和新姿态的合成。我们的关键见解是闭合几何姿态估计与可微渲染之间的循环。不同于以往依赖运动捕捉系统或离线优化获得的准确人体姿态的人形肖像方法,在野外场景中不实用,我们的方法仅使用最先进的姿态估计器得到的人体网格估计,以更好地反映现实情况。因此,不同于将人体姿态仅作为变形先验使用,HumanSplatHMR通过可微渲染将光度、分割和深度损失反向传播到姿态参数和全局位置。这种耦合在时间上优化全局3D姿态,提高精度和对齐性,同时产生更高质量的新视角渲染。实验显示,与省略图像级优化的姿态恢复基线和将姿态估计与肖像重建解耦的肖像基线相比,有持续的改进。

英文摘要

Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplatHMR, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplatHMR backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.

2605.02409 2026-05-22 cs.LG

Inducing Permutation Invariant Priors in Bayesian Optimization for Carbon Capture and Storage Applications

在碳捕集与封存应用中诱导排列不变的先验分布

Sofianos Panagiotis Fotias, Vassilis Gaganis

AI总结 本文提出了一种新的高斯过程核(GP-Perm),用于在碳捕集与封存项目中处理排列对称性问题,同时结合深度核学习模型(DKL-DS)以学习排列不变的嵌入,通过八个用例评估了所提出的方法。

详情
AI中文摘要

贝叶斯优化是一种迭代方法,专门用于优化昂贵的黑盒目标函数。像高斯过程(GP)这样的代理模型是贝叶斯优化的黄金标准,但当输入具有排列对称性时,常用的内核在处理无序项集时效率低下。受此问题的启发,我们转向在碳捕集与封存项目中使用排列不变的贝叶斯优化进行井位布置。高保真黑盒模拟器被指示在群控制下操作井,导致注入器和生产器群中出现无法被标准GP内核利用的排列对称性。在本工作中,我们的主要贡献是一种新的高斯过程内核(GP-Perm),通过比较集合的诱导经验表示之间的稳定分歧来编码排列不变性,并可以与标准内核结合以处理额外的向量值输入。作为学习不变的基线,我们还考虑了使用深度集架构的深度核学习模型(DKL-DS)来学习排列不变的嵌入。我们评估了所提出的方法在8个用例中的表现,包括七个合成基准和一个现实的CCS案例研究(Johansen构造)

英文摘要

Bayesian Optimization is an iterative method, tailored to optimizing expensive black box objective functions. Surrogate models like Gaussian Processes, which are the gold standard in Bayesian Optimization, can be inefficient for inputs with permutation symmetries, as the most common kernels employed are better suited for vector inputs rather than unordered sets of items. Motivated by this issue, we turn to permutation invariant Bayesian Optimization for well placement in Carbon Capture and Storage projects. The high fidelity black box simulator is instructed to operate wells under group control, giving rise to permutation symmetries within injector and producer groups that cannot be exploited with standard GP kernels. In this work, our main contribution is a novel Gaussian Process kernel (GP-Perm) that encodes permutation invariance by comparing sets through a stable divergence between their induced empirical representations, and can be combined with standard kernels for additional vector-valued inputs. As a learned invariant baseline, we also consider a Deep Kernel Learning model (DKL-DS) using the Deep Sets architecture to learn a permutation-invariant embedding. We evaluate the proposed methodology across 8 use cases, comprising seven synthetic benchmarks and one realistic CCS case study (Johansen formation)

2605.02098 2026-05-22 cs.CV

From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments

从球形到高斯:在大规模3D环境中点云裁剪策略的比较分析

Maximilian Kellner, Dominik Merkle, Michael Brunklaus, Alexander Reiterer

AI总结 本文比较了点云裁剪策略,提出了一种新的方法以提高大规模3D环境中的模型性能,特别是在户外场景中取得了新的最佳成果。

详情
AI中文摘要

大规模3D点云可能包含数以千万计的点。即使经过下采样,这些点云对于现代3D神经网络来说仍然太大。为了发展对场景的语义理解,点云被划分为更小的子云,以便处理。通常,这种划分是通过球形裁剪完成的,导致周围几何上下文的损失。为了解决这个问题,我们提出了替代方法,产生具有更大裁剪尺寸的子云,同时保持相似数量的点。具体来说,我们比较了指数、高斯和线性裁剪方法与球形方法。我们使用多个室内和户外环境数据集评估了三种3D深度学习模型架构。我们的结果表明,改变裁剪策略可以提高模型性能,特别是在大规模户外场景中,取得了新的最佳成果。代码可在https://github.com/mvg-inatech/point_cloud_cropping获取。

英文摘要

Large-scale 3D point clouds can consist of hundreds of millions of points. Even after downsampling, these point clouds are too large for modern 3D neural networks. In order to develop a semantic understanding of the scene, the point clouds are divided into smaller subclouds that can be processed. Typically, this division is done using spherical crops, resulting in a loss of surrounding geometric context. To address this issue, we propose alternative methods that produce subclouds with larger crop sizes while maintaining a similar number of points. Specifically, we compare exponential, Gaussian, and linear cropping methods with the spherical method. We evaluated three 3D deep learning model architectures using multiple indoor and outdoor environment datasets. Our results demonstrate that altering the cropping strategy can enhance model performance, especially for large-scale outdoor scenes, yielding new state-of-the-art results. Code is available at https://github.com/mvg-inatech/point_cloud_cropping

2605.00392 2026-05-22 cs.CV cs.LG

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

RTPrune: 两次阅读启发的令牌修剪用于高效DeepSeek-OCR推理

Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, Tongxuan Liu

AI总结 本文提出RTPrune,一种针对DeepSeek-OCR的两次阶段令牌修剪方法,通过优先保留高范数视觉令牌并利用最优传输理论进行令牌配对和合并,从而在OCR任务中实现更高效的推理性能和更优的效率-精度权衡。

Comments 21 pages, accepted by ICML2026

详情
AI中文摘要

DeepSeek-OCR利用视觉-文本压缩来减少长文本处理成本并加速推理,但视觉令牌仍然容易出现冗余的文本和结构信息。此外,当前用于传统视觉-语言模型(VLMs)的令牌修剪方法由于不恰当的压缩机制而无法保持文本保真度。通过分析DeepSeek-OCR的解码过程,我们发现了一种独特的双阶段阅读轨迹:模型最初优先处理大多数高范数令牌,然后随后重新分配其注意力到剩余的令牌上。受此启发,我们提出RTPrune,一种专为DeepSeek-OCR设计的双阶段令牌修剪方法。在第一阶段,我们优先保留捕捉显著文本和结构信息的高范数视觉令牌。在第二阶段,剩余的令牌基于最优传输理论进行配对和合并,以实现高效的特征聚合。我们进一步引入了一个动态修剪比率,以适应令牌相似性和文本密度,从而在OCR任务中实现更优的效率-精度权衡。广泛的实验表明,RTPrune在OmniDocBench上实现了99.47%的准确率和1.23倍更快的prefill速度,当应用于DeepSeek-OCR-Large时,仅保留84.25%的令牌。

英文摘要

DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.

2604.28177 2026-05-22 cs.CV cs.CY

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

AEGIS:一个评估人工智能生成学术图像取证分析的综合基准

Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao, Peilin Gao, Zijie Xi, Zixin Ding, Haiyang Sun, Haocheng Gao, Yuan Liu, Liangjia Wang, Yiling Huang, Yujie Wang, Yuyue Zhang, Ronghui Xi, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Haihong E

AI总结 本文提出AEGIS基准,通过七个学术类别和39个细粒度子类型覆盖,揭示了人工智能生成学术图像取证分析的内在难度,同时评估了多种模型在检测、推理和定位方面的性能,揭示了不同模型家族的互补优势。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

我们介绍了AEGIS,一个用于评估人工智能生成学术图像取证分析的综合基准。与现有基准相比,AEGIS有三个关键改进:(1)领域特定复杂性:涵盖七个学术类别和39个细粒度子类型,暴露了内在的取证难度,其中即使GPT-5.1的整体性能也仅为48.80%,而专家模型只能达到有限的定位精度(IoU 30.09%);(2)多样化的伪造模拟:在25种生成模型中建模四种普遍的学术伪造策略,其中11种模型的平均取证准确率低于50%,表明取证技术落后于生成技术的发展;(3)多维取证评估:共同评估检测、推理和定位,揭示了不同模型家族之间的互补优势,其中多模态大语言模型(MLLMs)在文本伪影识别上的准确率高达84.74%,专家检测器在二元真实性检测上的最高准确率为79.54%。通过评估25种领先的MLLMs、九个专家模型和一个统一的多模态理解和生成模型,AEGIS成为了一个诊断测试平台,揭示了学术图像取证分析中的根本性限制。

英文摘要

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

2604.26836 2026-05-22 cs.LG cs.SY eess.SY

Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics

具有不确定性的预测安全过滤器用于概率神经网络动态

Bernd Frauenknecht, Lukas Kesper, Daniel Mayfrank, Henrik Hose, Sebastian Trimpe

AI总结 本文提出了一种具有不确定性的预测安全过滤器(UPSi),通过将未来结果建模为可达集,利用概率集合(PE)神经网络动态模型提供严格的安全预测,从而在模型基于强化学习(MBRL)中提升探索安全性,同时保持与标准MBRL相当的性能。

详情
AI中文摘要

预测安全过滤器(PSFs)利用模型预测控制在深度强化学习(RL)探索期间强制约束满足,但其对第一原理模型或高斯过程的依赖限制了可扩展性和更广泛的应用。同时,基于模型的RL(MBRL)方法通常使用概率集合(PE)神经网络来从数据中捕捉复杂的、高维动态,且在最少的先验知识下。然而,现有将PE整合到PSFs中的尝试缺乏严格的不确定性量化。我们引入了具有不确定性的预测安全过滤器(UPSi),一种通过将未来结果建模为可达集来提供严格安全预测的PSF,利用PE动态模型。UPSi引入了显式的确定性约束,防止模型被利用,并无缝集成到常见的MBRL框架中。我们评估了UPSi在Dyna-style MBRL中的标准安全RL基准上,并报告了在先前神经网络PSFs上显著改进的探索安全性,同时保持与标准MBRL相当的性能。UPSi弥合了现代MBRL的可扩展性和通用性与预测安全过滤器的安全保证之间的差距。

英文摘要

Predictive safety filters (PSFs) leverage model predictive control to enforce constraint satisfaction during deep reinforcement learning (RL) exploration, yet their reliance on first-principles models or Gaussian processes limits scalability and broader applicability. Meanwhile, model-based RL (MBRL) methods routinely employ probabilistic ensemble (PE) neural networks to capture complex, high-dimensional dynamics from data with minimal prior knowledge. However, existing attempts to integrate PEs into PSFs lack rigorous uncertainty quantification. We introduce the Uncertainty-Aware Predictive Safety Filter (UPSi), a PSF that provides rigorous safety predictions using PE dynamics models by formulating future outcomes as reachable sets. UPSi introduces an explicit certainty constraint that prevents model exploitation and integrates seamlessly into common MBRL frameworks. We evaluate UPSi within Dyna-style MBRL on standard safe RL benchmarks and report substantial improvements in exploration safety over prior neural network PSFs while maintaining performance on par with standard MBRL. UPSi bridges the gap between the scalability and generality of modern MBRL and the safety guarantees of predictive safety filters.

2604.20665 2026-05-22 cs.CV cs.AI

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

视见之代价:在单体范式内实现可信的多模态推理

Karan Goyal

AI总结 本文提出了一种新的多模态评估方法,通过信息论视角揭示了多模态推理中的视见代价问题,提出了三个新指标并提出了语义充分性准则,挑战了传统多模态评估方法。

Comments Addresses practical viability of Vlabel construction. Writing is grounded. Acknowledgement is duly added

详情
AI中文摘要

视觉语言模型(VLMs)的快速普及通常被视为促进统一多模态知识发现的手段,但其背后存在一个未经检验的假设:当前VLMs能够忠实合成多模态数据。我们认为它们往往不能,这种差距反映了主导的视觉编码器-投影器-语言模型范式中的可信问题。而非从视觉输入中提取基础知识,最先进的模型经常表现出功能失明,即利用强大的语言先验来绕过严重的视觉表示瓶颈。在本文中,我们挑战了传统多模态评估方法,该方法依赖于数据删减或新数据集创建,因此将数据集偏差与架构能力不足混淆了。我们提出了一种信息论的突破:模态翻译协议,旨在量化我们称之为视见代价的东西。通过翻译语义负载而不是删减它们,我们提出了三个新的指标——视见的 toll(ToS)、诅咒(CoS)和谬误(FoS)——最终得出语义充分性准则(SSC)。此外,我们假设多模态扩展的分歧定律:随着底层语言引擎扩展到前所未有的推理能力,视觉知识瓶颈的惩罚可能增加而不是减少。我们主张社区应超越“多模态增益”作为主要评估目标。通过将SSC从被动的诊断约束提升为主动的架构蓝图,我们为引导下一代人工智能系统走向真正的多模态推理提供了基础。

英文摘要

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.

2604.16076 2026-05-22 cs.LG cs.AI cs.NE

Prototype-Grounded Concept Models for Verifiable Concept Alignment

基于原型的可验证概念模型用于可验证的概念对齐

Stefano Colamonaco, David Debot, Pietro Barbiero, Giuseppe Marra

AI总结 本研究提出了一种基于原型的概念模型(PGCMs),通过将概念与学习到的视觉原型关联起来,从而提高概念对齐的可验证性和可解释性,同时保持预测性能。

详情
AI中文摘要

概念瓶颈模型(CBMs)旨在通过人类可理解的概念来提高深度学习的可解释性,但它们无法验证所学概念是否与人类的意图一致,从而损害了可解释性。我们引入了基于原型的概念模型(PGCMs),将概念 grounded 在学习到的视觉原型上:作为概念的显式证据的图像部分。这种 grounding 允许直接检查概念语义,并支持在原型层面进行有针对性的人类干预以纠正不一致。实证结果表明,PGCMs 在预测性能上与最先进的 CBMs 相当,同时显著提高了透明度、可解释性和可干预性。

英文摘要

Concept Bottleneck Models (CBMs) aim to improve interpretability in Deep Learning by structuring predictions through human-understandable concepts, but they provide no way to verify whether learned concepts align with the human's intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs achieve similar predictive performance as state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.

2604.15774 2026-05-22 cs.CL

MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents

MemEvoBench: 评估LLM代理中内存误进化带来的安全风险

Weiwei Xie, Shaoxiong Guo, Fan Zhang, Tian Xia, Xue Yang, Lizhuang Ma, Junchi Yan, Qibing Ren

AI总结 本文提出MemEvoBench,首个评估LLM代理长期内存安全性的基准,针对对抗性内存注入、噪声工具输出和偏见反馈,通过7个领域36种风险类型的问题任务和20个Agent-SafetyBench环境改编的工作流任务,验证了内存进化对安全性的重大影响,指出静态提示防御不足,亟需加强LLM代理内存进化的安全性。

详情
AI中文摘要

为大型语言模型(LLMs)配备持久化内存可以增强交互连续性和个性化,但引入了新的安全风险。具体而言,受污染或偏见的内存积累可能触发异常代理行为。现有的评估方法尚未建立衡量内存误进化的标准化框架。这种现象是指由于反复接触误导信息而导致的行为漂移。为解决这一缺口,我们引入MemEvoBench,首个评估LLM代理长期内存安全性的基准,针对对抗性内存注入、噪声工具输出和偏见反馈。该框架包含7个领域36种风险类型的问答式任务,以及改编自20个Agent-SafetyBench环境的工作流任务,采用混合良性与误导性内存池在多轮交互中模拟内存进化。在代表性模型上的实验揭示了在偏见内存更新下显著的安全退化。我们的分析表明,内存进化是这些失败的重要原因。此外,静态提示基于防御证明不足,强调了在LLM代理中保障内存进化的安全性的紧迫性。

英文摘要

Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.

2604.11028 2026-05-22 cs.RO cs.AI

Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation

联邦单体机器人:多机器人协调无需机器人内部多代理碎片化

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

AI总结 本文提出了一种联邦单体机器人(FSAR)架构,通过在单体机器人运行时基础上实现多机器人协调,避免了机器人内部的多代理碎片化,提升了协调效率和恢复能力。

Comments 30 pages, 10 figures, 9 tables. Code: https://github.com/s20sc/fsar-fleet-coordination

详情
AI中文摘要

随着具身机器人向舰队规模操作发展,多机器人协调已成为系统挑战的核心。现有方法通常将其视为增加机器人内部多代理分解的动机。我们主张另一种原则:多机器人协调不需要机器人内部的多代理碎片化。每个机器人应保持一个单体具身代理,拥有自己的持久运行时、本地策略范围、能力状态和恢复权限,而协调则通过在舰队层面的联邦实现。我们提出了联邦单体机器人(FSAR),一种基于单体机器人运行时的多机器人协调运行时架构。每个机器人暴露受控的能力表面,而非内部碎片化的代理社会。舰队协调通过共享的能力注册表、跨机器人任务委托、策略感知的权限分配、信任范围内的交互以及分层恢复协议实现。我们正式化了关键协调关系,包括权限委托、跨机器人能力请求、本地与舰队恢复边界以及分层人类监督,并描述了一种支持共享具身能力模块(ECM)发现、合同感知的跨机器人协调以及舰队层面治理的舰队运行时架构。我们在代表性的多机器人协调场景中评估了FSAR,与分解密集的基线进行比较。结果表明,在治理局部性(d=2.91,p<.001 vs. 集中控制)和恢复包含性(d=4.88,p<.001 vs. 分解密集)方面有统计学显著的提升,同时在所有场景中减少了权限冲突和策略违规。我们的结果支持了从具身代理到具身舰队的路径应通过在相干机器人运行时之间进行联邦而非在其中进行碎片化的观点。

英文摘要

As embodied robots move toward fleet-scale operation, multi-robot coordination is becoming a central systems challenge. Existing approaches often treat this as motivation for increasing internal multi-agent decomposition within each robot. We argue for a different principle: multi-robot coordination does not require intra-robot multi-agent fragmentation. Each robot should remain a single embodied agent with its own persistent runtime, local policy scope, capability state, and recovery authority, while coordination emerges through federation across robots at the fleet level. We present Federated Single-Agent Robotics (FSAR), a runtime architecture for multi-robot coordination built on single-agent robot runtimes. Each robot exposes a governed capability surface rather than an internally fragmented agent society. Fleet coordination is achieved through shared capability registries, cross-robot task delegation, policy-aware authority assignment, trust-scoped interaction, and layered recovery protocols. We formalize key coordination relations including authority delegation, inter-robot capability requests, local-versus-fleet recovery boundaries, and hierarchical human supervision, and describe a fleet runtime architecture supporting shared Embodied Capability Module (ECM) discovery, contract-aware cross-robot coordination, and fleet-level governance. We evaluate FSAR on representative multi-robot coordination scenarios against decomposition-heavy baselines. Results show statistically significant gains in governance locality (d=2.91, p<.001 vs. centralized control) and recovery containment (d=4.88, p<.001 vs. decomposition-heavy), while reducing authority conflicts and policy violations across all scenarios. Our results support the view that the path from embodied agents to embodied fleets is better served by federation across coherent robot runtimes than by fragmentation within them.

2604.09095 2026-05-22 cs.LG math.OC

GeoPAS: Geometric Probing for Algorithm Selection in Continuous Black-Box Optimization

GeoPAS: 在连续黑盒优化中用于算法选择的几何探测

Jiabao Brad Wang, Xiang Shi, Yiliang Yuan, Mustafa Misir

AI总结 本文提出了一种几何探测框架,通过随机采样多尺度二维切片来表示问题实例,并结合有效性掩码感知的视觉池化进行聚合,从而在连续黑盒优化中实现算法选择。

Comments 20 pages, 9 figures, 6 tables; extended version of a GECCO 2026 poster-track paper; code available at https://github.com/BradWangW/GeoPAS

详情
AI中文摘要

连续黑盒优化的自动化算法选择依赖于在有限探测下表示问题信息,并在具有厚尾性能分布的情况下选择求解器。本文提出了一种几何探测框架,通过随机采样多尺度二维切片来表示每个问题实例。这些切片通过有效性掩码感知的视觉池化进行编码并聚合为实例表示。然后通过结合学习的实例条件估计和算法侧经验先验的对数复合分数进行求解器选择。该框架在标准单目标黑盒优化基准套件上评估,使用十二种求解器的组合,在实例级、分组随机和问题级转移协议下进行测试。在两种套件协议下,它将单最佳求解器的平均相对预期运行时间从30.37降至3.14和3.61,同时提高了中位数和上尾性能。在问题级转移下,传统自适应设置提高了典型和中等尾部性能,但使均值被罕见极端失败所主导;一个先验重的评分变体缓解了这种失败模式,尽管其鲁棒性可能依赖于基准。结果表明,粗粒度几何探测提供了有用的求解器相关信息,而鲁棒跨问题选择也取决于度量对齐的决策评分。

英文摘要

Automated algorithm selection for continuous black-box optimization depends on representing problem information under limited probing and selecting solvers under heavy-tailed performance distributions. This paper proposes a geometric probing framework that represents each problem instance by randomly sampled multi-scale two-dimensional slices of the objective landscape. The slices are encoded with validity-mask-aware visual pooling and aggregated into an instance representation. Solver selection is then performed by a logarithmic composite score combining a learned instance-conditioned estimate with an algorithm-side empirical prior. The framework is evaluated on a standard single-objective black-box optimization benchmark suite with a portfolio of twelve solvers under instance-level, grouped random, and problem-level transfer protocols. Under the two within-suite protocols, it reduces aggregate mean relative expected running time from 30.37 for the single best solver to 3.14 and 3.61, while also improving median and upper-tail performance. Under problem-level transfer, the canonical adaptive setting improves typical and moderate-tail performance but leaves the mean dominated by rare extreme failures; a prior-heavy scoring variant mitigates this failure mode, although its robustness may be benchmark-dependent. The results suggest that coarse geometric probes provide useful solver-relevant information, while robust cross-problem selection also depends on metric-aligned decision scoring.

2604.08362 2026-05-22 cs.CL cs.AI cs.LG

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

迈向真实世界的人类行为模拟:在长时间跨度、跨场景、异质行为轨迹上对大语言模型进行基准测试

Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin

AI总结 本文提出OmniBehavior基准测试,通过真实世界数据整合长周期、跨场景和异质行为模式,揭示现有模型在模拟复杂人类行为时的局限性,包括对正向平均人的趋同、人格同质化和乌托邦偏见,为未来高保真模拟研究指明方向。

Comments Project page: https://OmniBehavior.github.io

详情
AI中文摘要

大语言模型(LLMs)的出现揭示了通用用户模拟的潜力。然而,现有基准测试仍局限于孤立场景、狭窄动作空间或合成数据,无法捕捉真实人类行为的整体性。为弥合这一差距,我们引入OmniBehavior,首个完全基于真实世界数据构建的用户模拟基准测试,将长周期、跨场景和异质行为模式整合到统一框架中。基于此基准测试,我们首先提供了实证证据,表明以往孤立场景的数据集存在隧道视野问题,而真实世界决策依赖于长期的跨场景因果链。对最新LLMs的广泛评估显示,当前模型在模拟这些复杂行为时表现不佳,即使扩展上下文窗口,性能也趋于平稳。关键的是,模拟行为与真实行为的系统性比较揭示了根本性的结构偏差:LLMs倾向于趋同于正向平均人,表现出超活跃、人格同质化和乌托邦偏见。这导致了个体差异和长尾行为的丧失,突显了未来高保真模拟研究的关键方向。

英文摘要

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

2604.08295 2026-05-22 cs.AI cs.CV

U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

U-CECE:一个通用的多分辨率框架用于概念反事实解释

Angeliki Dimitriou, Nikolaos Chaidos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

AI总结 本文提出U-CECE框架,旨在解决概念反事实方法在表达性和效率之间的权衡问题,通过多分辨率层次结构提供不同层次的解释能力,并在不同数据集上验证了其效率与表达性的平衡。

详情
AI中文摘要

随着AI模型日益复杂,可解释性对于建立信任至关重要,然而基于概念的反事实方法仍面临表达性与效率之间的权衡。将底层概念表示为原子集合虽然快速但忽略了关系上下文,而完整的图表示更加忠实但需要解决NP难的图编辑距离(GED)问题。我们提出了U-CECE,一个统一的、模型无关的多分辨率框架,用于概念反事实解释,能够适应数据环境和计算预算。U-CECE涵盖三个层次的表达性:原子概念用于广泛解释,关系集合-集合用于简单交互,以及结构图用于完整语义结构。在结构层,支持基于监督图神经网络(GNNs)的精度导向的归纳模式和基于无监督图自动编码器(GAEs)的可扩展归纳模式。在结构上,CUB和视觉基因组数据集的实验展示了不同层次的效率-表达性权衡,同时人类调查和LVLM基于评估表明,检索到的结构反事实与精确GED基于的地面真相解释在语义上等价,且常被优先选择。

英文摘要

As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.

2604.07799 2026-05-22 cs.RO cs.AI

Learning Without Losing Identity: Capability Evolution for Embodied Agents

无需失去身份的学习:体素代理的能力进化

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

AI总结 本文提出了一种以能力为中心的体素代理进化范式,通过引入体素能力模块(ECMs)实现持续改进,同时保持代理身份的稳定性,实验表明其在任务成功率和安全性方面优于传统方法。

Comments 12 pages, 2 figures, 7 tables

详情
AI中文摘要

体素代理被期望在动态物理环境中持续运作,并随时间不断获得新能力。现有方法通常通过修改代理本身来提高性能,导致长期系统不稳定和身份丢失。本文提出了一种以能力为中心的进化范式,认为机器人应保持持久的代理作为认知身份,同时通过能力进化实现持续改进。具体而言,我们引入了体素能力模块(ECMs),代表可随时间学习、优化和组合的模块化功能单元。我们提出一个统一框架,将能力进化与代理身份解耦。能力通过包含任务执行、经验收集、模型优化和模块更新的闭环过程进化,所有执行均由运行时层控制,确保安全性和策略约束。通过模拟体素任务证明,能力进化在20次迭代中将任务成功率从32.4%提升到91.3%,优于代理修改基线和现有技能学习方法(SPiRL, SkiMo),同时保持零策略漂移和零安全违规。我们的结果表明,将代理身份与能力进化分离为长期体素智能提供了可扩展且安全的基础。

英文摘要

Embodied agents are expected to operate persistently in dynamic physical environments, continuously acquiring new capabilities over time. Existing approaches to improving agent performance often rely on modifying the agent itself -- through prompt engineering, policy updates, or structural redesign -- leading to instability and loss of identity in long-lived systems. In this work, we propose a capability-centric evolution paradigm for embodied agents. We argue that a robot should maintain a persistent agent as its cognitive identity, while enabling continuous improvement through the evolution of its capabilities. Specifically, we introduce the concept of Embodied Capability Modules (ECMs), which represent modular, versioned units of embodied functionality that can be learned, refined, and composed over time. We present a unified framework in which capability evolution is decoupled from agent identity. Capabilities evolve through a closed-loop process involving task execution, experience collection, model refinement, and module updating, while all executions are governed by a runtime layer that enforces safety and policy constraints. We demonstrate through simulated embodied tasks that capability evolution improves task success rates from 32.4% to 91.3% over 20 iterations, outperforming both agent-modification baselines and established skill-learning methods (SPiRL, SkiMo), while preserving zero policy drift and zero safety violations. Our results suggest that separating agent identity from capability evolution provides a scalable and safe foundation for long-term embodied intelligence.

2604.07180 2026-05-22 cs.CV cs.AI

Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis

基于能量的组织流形用于纵向多参数MRI分析

Kartikay Tehlan, Lukas Förner, Sina Wendrich, Nico Schmutzenhofer, Michael Frühwald, Matthias Wagner, Nassir Navab, Thomas Wendler

AI总结 本文提出了一种基于患者特定能量建模的几何框架,用于纵向多参数MRI分析,通过训练紧凑的隐式神经表示来学习能量函数,为组织状态提供微分几何描述,无需分割标签,展示了患者特定能量流形在纵向mpMRI分析中的应用潜力。

Comments The code is available at https://github.com/tkartikay/EnFold-MRI

详情
AI中文摘要

我们提出了一种基于患者特定能量建模的几何框架,用于纵向多参数MRI分析。该框架基于序列空间中的患者特定能量建模,而不是在具有空间网络的图像上进行操作。每个体素由其多序列强度向量(T1,T1c,T2,FLAIR,ADC)表示,并通过去噪分数匹配训练紧凑的隐式神经表示,以从单次基线扫描学习一个能量函数E_θ(u) over R^d。学习的能量景观提供了没有分割标签的组织状态的微分几何描述。局部极小值定义了组织盆地,梯度大小反映了接近状态边界的可能性,拉普拉斯曲率表征了局部约束结构。重要的是,该基线能量流形被视为固定的几何参考:它编码了诊断时观察到的对比组合,并且在随访时不进行重新训练。因此,纵向评估被公式化为对后续扫描相对于此基线几何的评估。而不是比较解剖分割,我们分析MRI序列向量的分布如何在基线能量函数下演变。在一项儿童病例中,复发后随访扫描显示能量和方向位移在序列空间中逐渐偏离基线肿瘤相关状态,但在明显放射学再出现之前。在一项稳定疾病病例中,体素分布仍被限制在已建立的低能盆地内,没有系统性漂移。所展示的病例证明了患者特定能量流形可以作为纵向mpMRI分析的几何参考系统,而无需显式分割或监督分类,为进一步研究基于流形的肿瘤风险区域追踪提供了基础。

英文摘要

We propose a geometric framework for longitudinal multi-parametric MRI analysis based on patient-specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi-sequence intensity vector ($T1$, $T1c$, $T2$, FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function $E_θ(\mathbf{u})$ over $\mathbb{R}^d$ from a single baseline scan. The learned energy landscape provides a differential-geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow-up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow-up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour-associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low-energy basins without systematic drift. The presented cases serve as proof-of-concept that patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold-based tissue-at-risk tracking in neuro-oncology.

2603.29981 2026-05-22 cs.LG stat.ML

Aligning Validation with Deployment in Spatial Prediction: Target-Weighted Cross-Validation

在空间预测中对齐验证与部署:目标加权交叉验证

Alexander Brenning, Thomas Suesse

AI总结 本文提出了一种基于加权交叉验证的部署导向验证框架,通过引入目标加权交叉验证(TWCV)来对齐验证任务与指定领域内预测任务的分布,以减少因采样偏差导致的预测误差。

详情
AI中文摘要

可靠地估计预测性能对于空间环境建模至关重要,其中机器学习模型用于从不均匀分布的观测数据中生成地图。标准交叉验证(CV)假设验证数据能代表目标领域内预测条件的分布。在实践中,由于选择性或集群采样,这一假设经常被违反,导致性能和不确定性估计偏倚。本文引入了一种基于加权交叉验证的部署导向验证框架,该框架通过重要性加权交叉验证(IWCV)和基于校准的方法,目标加权交叉验证(TWCV),利用具有空间意义的任务描述符如环境协变量和预测距离。模拟实验表明,传统非空间和空间交叉验证策略在现实采样设计下会表现出显著偏倚,而加权交叉验证方法在验证任务充分覆盖部署任务空间时能大幅减少这种偏倚。德国氮氧化物(NO₂)浓度制图案例研究显示,标准交叉验证由于采样偏倚会高估预测误差,而加权交叉验证则能产生更符合部署条件的估计。该框架将验证任务生成与风险估计分开,并为在样本分布与预测领域不同的空间预测设置中改进性能评估提供了实用方法。

英文摘要

Reliable estimation of predictive performance is essential for spatial environmental modeling, where machine-learning models are used to generate maps from unevenly distributed observations. Standard cross-validation (CV) assumes that validation data are representative of prediction conditions across the target domain. In practice, this assumption is often violated due to preferential or clustered sampling, leading to biased performance and uncertainty estimates. We introduce a deployment-oriented validation framework based on weighted CV that aligns validation tasks with the distribution of prediction tasks across a specified domain. The framework includes importance-weighted cross-validation (IWCV) and a calibration-based approach, Target-Weighted Cross-Validation (TWCV), which uses spatially meaningful task descriptors such as environmental covariates and prediction distance. Simulation experiments show that conventional non-spatial and spatial CV strategies can exhibit substantial bias under realistic sampling designs, whereas weighted CV approaches substantially reduce this bias when validation tasks adequately cover the deployment-task space. A case study on mapping nitrogen dioxide (NO$_2$) concentrations across Germany demonstrates that standard CV can overestimate prediction error due to sampling bias, while weighted CV yields estimates more consistent with deployment conditions. The framework separates validation task generation from risk estimation and provides a practical approach for improving performance assessment in spatial prediction settings where sample distributions differ from prediction domains.