arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
2605.05118 2026-05-22 cs.LG cs.AI stat.ML

On the Wasserstein Gradient Flow Interpretation of Drifting Models

关于漂移模型的Wasserstein梯度流解释

Arthur Gretton, Li Kevin Wenliang, Alexandre Galashov, James Thornton, Valentin De Bortoli, Arnaud Doucet

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文通过Wasserstein梯度流分析了漂移模型,揭示了GMD框架与WGF路径之间的关系,展示了三种主要结果:漂移模型中的算法对应于KL散度的WGF极限点,实际实现的算法对应于Sinkhorn散度的固定点但缺乏某些特性,同时该方法可以扩展到其他WGF的极限点,如MMD、切线Wasserstein距离和GAN批评者函数。

详情
AI中文摘要

最近,Deng等人(2026)提出了生成模型通过漂移(GMD),一种新的生成任务框架。本文通过Wasserstein梯度流(WGF)的视角分析了GMD,即概率测度空间中函数的最速下降路径,配备了最优传输的几何结构。与之前的WGF相关贡献不同,GMD可以被视为直接针对特定WGF流的固定点。我们展示了三个主要结果:首先,Deng等人(2026)提出的一种算法对应于在KL散度上的WGF的极限点,伴有Parzen平滑。其次,Deng等人(2026)实际实现的算法对应于另一种过程,类似于Sinkhorn散度的固定点,但缺乏后者的一些理想特性。第三,同样的想法可以扩展到其他WGF的极限点,包括最大均值差异(MMD)、切线Wasserstein距离和GAN批评者函数。

英文摘要

Recently, Deng et al. (2026) proposed Generative Modeling via Drifting (GMD), a novel framework for generative tasks. This note presents an analysis of GMD through the lens of Wasserstein Gradient Flows (WGF), i.e., the path of steepest descent for a functional in the space of probability measures, equipped with the geometry of optimal transport. Unlike previous WGF-based contributions, GMD can be thought of as directly targeting a fixed point of a specific WGF flow. We demonstrate three main results: first, that one algorithm proposed by Deng et al. (2026) corresponds to finding the limiting point of a WGF on the KL divergence, with Parzen smoothing on the densities. Second, that the algorithm actually implemented by Deng et al. (2026) corresponds to a different procedure, which bears some resemblance to the fixed point of a WGF on the Sinkhorn divergence, but lacks certain desirable properties of the latter. Third, the same same idea can be extended to the limiting point of other WGFs, including the Maximum Mean Discrepancy (MMD), the sliced Wasserstein distance, and GAN critic functions.

2605.04217 2026-05-22 cs.LG cs.CL

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

Jordan-RoPE: 通过复Jordan块实现非半单相对位置编码

Yaobo Zhang

发表机构 * School of Physics, Ningxia University(宁夏大学物理学院)

AI总结 本文提出了一种非半单相对位置编码Jordan-RoPE,通过复旋转特征和Nilpotent响应在同一缺陷Jordan块中实现距离调制的相位基,从而生成振荡-多项式特征,如e^{-γd}cos(ωd)、e^{-γd}sin(ωd)等,并在语言模型中验证了其有效性。

Comments 15 pages, 4 figures, 6 tables; code available at https://github.com/ybzhang-nxu/jordan_rope

详情
AI中文摘要

相对位置编码决定了查询-键滞后函数能够进入原始注意力logit的哪些功能。RoPE提供旋转相位,而ALiBi提供加性距离偏置。受线性平移不变位置编码的群论观点启发,我们研究了非半单情况,其中复旋转特征和Nilpotent响应共存于同一缺陷Jordan块中。所生成的相对算子产生如e^{-γd}cos(ωd)、e^{-γd}sin(ωd)、d e^{-γd}cos(ωd)和d e^{-γd}sin(ωd)等振荡-多项式特征,其中因果滞后d=i-j≥0。因此,该构造实现了距离调制的相位基d e^{iωd},而非仅仅添加单独的距离通道到RoPE。我们将其精确Jordan-RoPE公式化为非半单一参数表示,给出其实块形式,并指定非正交位置映射所需的共轭查询作用。我们还区分了该精确表示与稳定变体,后者虽然改善了数值行为但破坏了精确群律。核级别诊断和一个Jordan友好的合成语言模型任务表明,当目标包含距离调制的相位交互时,耦合的Jordan基是有用的。在小型WikiText-103字语言模型上,一个缩放精确变体在Jordan家族中优于RoPE和直接求和基线,而RoPE+ALiBi仍然是整体最强的。证据是结构性的,而非广义的性能声明。

英文摘要

Relative positional encodings determine which functions of query-key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group-theoretic views of linear translation-invariant positional encodings, we study a non-semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory-polynomial features such as $e^{-γd}\cos(ωd)$, $e^{-γd}\sin(ωd)$, $d e^{-γd}\cos(ωd)$, and $d e^{-γd}\sin(ωd)$, for causal lag $d=i-j\geq 0$. Thus the construction realizes a distance-modulated phase basis $d e^{iωd}$, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan-RoPE as a non-semisimple one-parameter representation, give its real block form, and specify the contragredient query action required by non-orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel-level diagnostics and a Jordan-friendly synthetic language-model task show that the coupled Jordan basis is useful when the target contains distance-modulated phase interactions. On a small WikiText-103 byte language model, a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.

2605.04062 2026-05-22 cs.LG cs.AI

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor: 一种通过混合精度量化感知蒸馏实现大语言模型轻量化的框架

Shu-Hao Zhang, Le-Tong Huang, Xiang-Sheng Deng, Xin-Yi Zou, Chen Wu, Nan Li, Shao-Qun Zhang, Zhi-Hua Zhou

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Intelligent Science and Technology, Nanjing University(南京大学智能科学与技术学院) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Microsoft AI(微软AI)

AI总结 本文提出EdgeRazor框架,通过混合精度量化感知蒸馏方法,在资源受限设备上部署大语言模型,实现了更高的压缩比和更高效的性能。

详情
AI中文摘要

量化已成为在资源受限设备上部署大语言模型(LLMs)的主流方法,但将精度压缩到低于4位通常会导致严重的性能退化或高昂的重训练成本。在本文中,我们提出了EdgeRazor,一种通过混合精度量化感知蒸馏实现LLM轻量化的框架。它包含三个模块:混合精度结构量化用于精细控制位宽,层自适应特征蒸馏动态选择最信息丰富的特征进行对齐,以及熵感知KL散度用于在人工标注和蒸馏数据集上实现前向-反向平衡。在MobileLLM和Qwen系列上的评估表明,在权重-激活量化下,1.88位的Qwen3-0.6B-EdgeRazor在2位基准上表现优异,优于11.27,超过最强的3位基准4.38。在效率方面,EdgeRazor在所有位宽下实现了更高的压缩比,1.58位的Qwen3-0.6B-EdgeRazor将存储从1.11 GB减少到0.19 GB,同时在16位基准上加速解码15.16倍。这些结果经验上验证了EdgeRazor的有效性和效率。代码可以从GitHub和Huggingface访问。

英文摘要

Quantization has emerged as a mainstream approach for deploying Large Language Models (LLMs) on resource-constrained devices, yet compressing precision below 4-bit typically causes severe performance degradation or prohibitive retraining costs. In this paper, we propose EdgeRazor, a lightweight framework for LLMs via Mixed-Precision Quantization-Aware Distillation. It contains three modules: Structural Quantization with Mixed Precision for fine-grained control of bit-widths, Layer-Adaptive Feature Distillation that dynamically selects the most informative features for alignment, and Entropy-Aware KL Divergence for forward-reverse balance on both human-annotated and distilled datasets. Evaluations conducted on MobileLLM and Qwen families show that under weight-activation quantization, the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms the state-of-the-art 2-bit baselines by 11.27 and surpasses the strongest 3-bit baselines by 4.38, while the quantized MobileLLM-350M-EdgeRazor requires a training budget 4-10$\times$ lower than the leading quantization-aware training method. In terms of efficiency, EdgeRazor achieves higher compression ratios at all bit-widths, and the 1.58-bit Qwen3-0.6B-EdgeRazor reduces storage from 1.11 GB to 0.19 GB while accelerating decoding by 15.16$\times$ over the 16-bit baseline. These results empirically validate the effectiveness and efficiency of EdgeRazor. The codes can be accessed from \href{https://github.com/zhangsq-nju/EdgeRazor}{GitHub} and \href{https://huggingface.co/collections/zhangsq-nju/edgerazor-nbit}{Huggingface}.

2605.03934 2026-05-22 cs.SD cs.AI

Towards Open World Sound Event Detection

面向开放世界的声音事件检测

P. H. Hai, L. T. Minh, L. H. Son

发表机构 * VNU University of Engineering and Technology(越南工程大学) Artificial Intelligence Research Center, VNU Information Technology Institute(VNU信息技术研究所人工智能研究中心)

AI总结 本文提出了一种开放世界声音事件检测(OW-SED)范式,通过引入可变形架构和新颖的WOOT框架,解决了重叠和模糊事件的挑战,提升了在开放世界环境下的检测性能。

Comments 32 pages, 3 figures. Accepted to Signal Processing (Elsevier)

Journal ref Signal Processing, Article 110707, 2026

详情
AI中文摘要

声音事件检测(SED)在音频理解中起着至关重要的作用,应用于监控、智能城市、医疗保健和多媒体索引等领域。然而,传统SED系统基于封闭世界假设,限制了其在现实环境中处理新兴声音事件的能力。受开放世界学习在计算机视觉中的成功启发,我们引入了开放世界声音事件检测(OW-SED)范式,其中模型必须检测已知事件、识别未知事件并逐步学习它们。为了解决OW-SED特有的挑战,如重叠和模糊事件,我们提出了一种1D可变形架构,利用可变形注意力来适应性地聚焦于显著的时序区域。此外,我们设计了一种新颖的开放世界可变形声音事件检测转换器(WOOT)框架,结合特征解耦来分离类特定和类无关的表示,以及一种一对多匹配策略和多样性损失以增强表示多样性。实验结果表明,我们的方法在封闭世界设置中相比现有领先技术略具优势,并在开放世界场景中显著优于现有基线。

英文摘要

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.

2605.02784 2026-05-22 cs.CV

HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar

HumanSplatHMR: 闭合人体网格恢复与高斯点绘肖像之间的循环

Yeheng Zong, Pou-Chun Kung, Yike Pan, Seth Isaacson, Yizhou Chen, Ram Vasudevan, Katherine A. Skinner

发表机构 * University of Michigan(密歇根大学)

AI总结 本文提出HumanSplatHMR方法,通过闭合几何姿态估计与可微渲染之间的循环,改进人体姿态恢复和高斯点绘肖像的生成,提升在新视角和新姿态下的渲染质量。

Comments Project page: https://scottyehengz.github.io/HumanSplat/

详情
AI中文摘要

从视频中准确恢复人体姿态和外观是场景重建的关键组成部分,应用于动作捕捉、动作预测、虚拟现实和数字孪生等领域。尽管对从视频中构建逼真人类肖像已有大量研究,本文证明现有方法无法准确恢复人类的3D几何结构。基于ViT的方法不一致可靠且可能过度拟合2D视角,而基于NeRF和高斯点绘的肖像将姿态和外观分开,限制了对新姿态的渲染泛化能力。为解决这些问题,本文提出HumanSplatHMR,一种联合优化框架,通过同时优化3D人体姿态并学习高保真的肖像,以实现新视角和新姿态的合成。我们的关键见解是闭合几何姿态估计与可微渲染之间的循环。不同于以往依赖运动捕捉系统或离线优化获得的准确人体姿态的人形肖像方法,在野外场景中不实用,我们的方法仅使用最先进的姿态估计器得到的人体网格估计,以更好地反映现实情况。因此,不同于将人体姿态仅作为变形先验使用,HumanSplatHMR通过可微渲染将光度、分割和深度损失反向传播到姿态参数和全局位置。这种耦合在时间上优化全局3D姿态,提高精度和对齐性,同时产生更高质量的新视角渲染。实验显示,与省略图像级优化的姿态恢复基线和将姿态估计与肖像重建解耦的肖像基线相比,有持续的改进。

英文摘要

Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplatHMR, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplatHMR backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.

2605.02409 2026-05-22 cs.LG

Inducing Permutation Invariant Priors in Bayesian Optimization for Carbon Capture and Storage Applications

在碳捕集与封存应用中诱导排列不变的先验分布

Sofianos Panagiotis Fotias, Vassilis Gaganis

发表机构 * School of Mining and Metallurgical Engineering, National Technical University of Athens(采矿与冶金工程学院,国家技术大学雅典)

AI总结 本文提出了一种新的高斯过程核(GP-Perm),用于在碳捕集与封存项目中处理排列对称性问题,同时结合深度核学习模型(DKL-DS)以学习排列不变的嵌入,通过八个用例评估了所提出的方法。

详情
AI中文摘要

贝叶斯优化是一种迭代方法,专门用于优化昂贵的黑盒目标函数。像高斯过程(GP)这样的代理模型是贝叶斯优化的黄金标准,但当输入具有排列对称性时,常用的内核在处理无序项集时效率低下。受此问题的启发,我们转向在碳捕集与封存项目中使用排列不变的贝叶斯优化进行井位布置。高保真黑盒模拟器被指示在群控制下操作井,导致注入器和生产器群中出现无法被标准GP内核利用的排列对称性。在本工作中,我们的主要贡献是一种新的高斯过程内核(GP-Perm),通过比较集合的诱导经验表示之间的稳定分歧来编码排列不变性,并可以与标准内核结合以处理额外的向量值输入。作为学习不变的基线,我们还考虑了使用深度集架构的深度核学习模型(DKL-DS)来学习排列不变的嵌入。我们评估了所提出的方法在8个用例中的表现,包括七个合成基准和一个现实的CCS案例研究(Johansen构造)

英文摘要

Bayesian Optimization is an iterative method, tailored to optimizing expensive black box objective functions. Surrogate models like Gaussian Processes, which are the gold standard in Bayesian Optimization, can be inefficient for inputs with permutation symmetries, as the most common kernels employed are better suited for vector inputs rather than unordered sets of items. Motivated by this issue, we turn to permutation invariant Bayesian Optimization for well placement in Carbon Capture and Storage projects. The high fidelity black box simulator is instructed to operate wells under group control, giving rise to permutation symmetries within injector and producer groups that cannot be exploited with standard GP kernels. In this work, our main contribution is a novel Gaussian Process kernel (GP-Perm) that encodes permutation invariance by comparing sets through a stable divergence between their induced empirical representations, and can be combined with standard kernels for additional vector-valued inputs. As a learned invariant baseline, we also consider a Deep Kernel Learning model (DKL-DS) using the Deep Sets architecture to learn a permutation-invariant embedding. We evaluate the proposed methodology across 8 use cases, comprising seven synthetic benchmarks and one realistic CCS case study (Johansen formation)

2605.02098 2026-05-22 cs.CV

From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments

从球形到高斯:在大规模3D环境中点云裁剪策略的比较分析

Maximilian Kellner, Dominik Merkle, Michael Brunklaus, Alexander Reiterer

发表机构 * Fraunhofer Institute for Physical Measurement Techniques IPM(弗劳恩霍夫物理测量技术研究所IPM) University of Freiburg, Department of Sustainable Systems Engineering INATECH(弗赖堡大学可持续系统工程系INATECH)

AI总结 本文比较了点云裁剪策略,提出了一种新的方法以提高大规模3D环境中的模型性能,特别是在户外场景中取得了新的最佳成果。

详情
AI中文摘要

大规模3D点云可能包含数以千万计的点。即使经过下采样,这些点云对于现代3D神经网络来说仍然太大。为了发展对场景的语义理解,点云被划分为更小的子云,以便处理。通常,这种划分是通过球形裁剪完成的,导致周围几何上下文的损失。为了解决这个问题,我们提出了替代方法,产生具有更大裁剪尺寸的子云,同时保持相似数量的点。具体来说,我们比较了指数、高斯和线性裁剪方法与球形方法。我们使用多个室内和户外环境数据集评估了三种3D深度学习模型架构。我们的结果表明,改变裁剪策略可以提高模型性能,特别是在大规模户外场景中,取得了新的最佳成果。代码可在https://github.com/mvg-inatech/point_cloud_cropping获取。

英文摘要

Large-scale 3D point clouds can consist of hundreds of millions of points. Even after downsampling, these point clouds are too large for modern 3D neural networks. In order to develop a semantic understanding of the scene, the point clouds are divided into smaller subclouds that can be processed. Typically, this division is done using spherical crops, resulting in a loss of surrounding geometric context. To address this issue, we propose alternative methods that produce subclouds with larger crop sizes while maintaining a similar number of points. Specifically, we compare exponential, Gaussian, and linear cropping methods with the spherical method. We evaluated three 3D deep learning model architectures using multiple indoor and outdoor environment datasets. Our results demonstrate that altering the cropping strategy can enhance model performance, especially for large-scale outdoor scenes, yielding new state-of-the-art results. Code is available at https://github.com/mvg-inatech/point_cloud_cropping

2605.00392 2026-05-22 cs.CV cs.LG

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

RTPrune: 两次阅读启发的令牌修剪用于高效DeepSeek-OCR推理

Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, Tongxuan Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 本文提出RTPrune,一种针对DeepSeek-OCR的两次阶段令牌修剪方法,通过优先保留高范数视觉令牌并利用最优传输理论进行令牌配对和合并,从而在OCR任务中实现更高效的推理性能和更优的效率-精度权衡。

Comments 21 pages, accepted by ICML2026

详情
AI中文摘要

DeepSeek-OCR利用视觉-文本压缩来减少长文本处理成本并加速推理,但视觉令牌仍然容易出现冗余的文本和结构信息。此外,当前用于传统视觉-语言模型(VLMs)的令牌修剪方法由于不恰当的压缩机制而无法保持文本保真度。通过分析DeepSeek-OCR的解码过程,我们发现了一种独特的双阶段阅读轨迹:模型最初优先处理大多数高范数令牌,然后随后重新分配其注意力到剩余的令牌上。受此启发,我们提出RTPrune,一种专为DeepSeek-OCR设计的双阶段令牌修剪方法。在第一阶段,我们优先保留捕捉显著文本和结构信息的高范数视觉令牌。在第二阶段,剩余的令牌基于最优传输理论进行配对和合并,以实现高效的特征聚合。我们进一步引入了一个动态修剪比率,以适应令牌相似性和文本密度,从而在OCR任务中实现更优的效率-精度权衡。广泛的实验表明,RTPrune在OmniDocBench上实现了99.47%的准确率和1.23倍更快的prefill速度,当应用于DeepSeek-OCR-Large时,仅保留84.25%的令牌。

英文摘要

DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.

2604.28177 2026-05-22 cs.CV cs.CY

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

AEGIS:一个评估人工智能生成学术图像取证分析的综合基准

Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao, Peilin Gao, Zijie Xi, Zixin Ding, Haiyang Sun, Haocheng Gao, Yuan Liu, Liangjia Wang, Yiling Huang, Yujie Wang, Yuyue Zhang, Ronghui Xi, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Haihong E

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 本文提出AEGIS基准,通过七个学术类别和39个细粒度子类型覆盖,揭示了人工智能生成学术图像取证分析的内在难度,同时评估了多种模型在检测、推理和定位方面的性能,揭示了不同模型家族的互补优势。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

我们介绍了AEGIS,一个用于评估人工智能生成学术图像取证分析的综合基准。与现有基准相比,AEGIS有三个关键改进:(1)领域特定复杂性:涵盖七个学术类别和39个细粒度子类型,暴露了内在的取证难度,其中即使GPT-5.1的整体性能也仅为48.80%,而专家模型只能达到有限的定位精度(IoU 30.09%);(2)多样化的伪造模拟:在25种生成模型中建模四种普遍的学术伪造策略,其中11种模型的平均取证准确率低于50%,表明取证技术落后于生成技术的发展;(3)多维取证评估:共同评估检测、推理和定位,揭示了不同模型家族之间的互补优势,其中多模态大语言模型(MLLMs)在文本伪影识别上的准确率高达84.74%,专家检测器在二元真实性检测上的最高准确率为79.54%。通过评估25种领先的MLLMs、九个专家模型和一个统一的多模态理解和生成模型,AEGIS成为了一个诊断测试平台,揭示了学术图像取证分析中的根本性限制。

英文摘要

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

2604.26836 2026-05-22 cs.LG cs.SY eess.SY

Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics

具有不确定性的预测安全过滤器用于概率神经网络动态

Bernd Frauenknecht, Lukas Kesper, Daniel Mayfrank, Henrik Hose, Sebastian Trimpe

发表机构 * Institute for Data Science in Mechanical Engineering (DSME), RWTH Aachen University(机械工程数据科学研究所(DSME),亚琛工业大学) Institute of Climate and Energy Systems (ICE), Energy Systems Engineering (ICE-1), Forschungszentrum Jülich GmbH(气候与能源系统研究所(ICE),能源系统工程(ICE-1),焦耳研究中心有限公司)

AI总结 本文提出了一种具有不确定性的预测安全过滤器(UPSi),通过将未来结果建模为可达集,利用概率集合(PE)神经网络动态模型提供严格的安全预测,从而在模型基于强化学习(MBRL)中提升探索安全性,同时保持与标准MBRL相当的性能。

详情
AI中文摘要

预测安全过滤器(PSFs)利用模型预测控制在深度强化学习(RL)探索期间强制约束满足,但其对第一原理模型或高斯过程的依赖限制了可扩展性和更广泛的应用。同时,基于模型的RL(MBRL)方法通常使用概率集合(PE)神经网络来从数据中捕捉复杂的、高维动态,且在最少的先验知识下。然而,现有将PE整合到PSFs中的尝试缺乏严格的不确定性量化。我们引入了具有不确定性的预测安全过滤器(UPSi),一种通过将未来结果建模为可达集来提供严格安全预测的PSF,利用PE动态模型。UPSi引入了显式的确定性约束,防止模型被利用,并无缝集成到常见的MBRL框架中。我们评估了UPSi在Dyna-style MBRL中的标准安全RL基准上,并报告了在先前神经网络PSFs上显著改进的探索安全性,同时保持与标准MBRL相当的性能。UPSi弥合了现代MBRL的可扩展性和通用性与预测安全过滤器的安全保证之间的差距。

英文摘要

Predictive safety filters (PSFs) leverage model predictive control to enforce constraint satisfaction during deep reinforcement learning (RL) exploration, yet their reliance on first-principles models or Gaussian processes limits scalability and broader applicability. Meanwhile, model-based RL (MBRL) methods routinely employ probabilistic ensemble (PE) neural networks to capture complex, high-dimensional dynamics from data with minimal prior knowledge. However, existing attempts to integrate PEs into PSFs lack rigorous uncertainty quantification. We introduce the Uncertainty-Aware Predictive Safety Filter (UPSi), a PSF that provides rigorous safety predictions using PE dynamics models by formulating future outcomes as reachable sets. UPSi introduces an explicit certainty constraint that prevents model exploitation and integrates seamlessly into common MBRL frameworks. We evaluate UPSi within Dyna-style MBRL on standard safe RL benchmarks and report substantial improvements in exploration safety over prior neural network PSFs while maintaining performance on par with standard MBRL. UPSi bridges the gap between the scalability and generality of modern MBRL and the safety guarantees of predictive safety filters.

2604.20665 2026-05-22 cs.CV cs.AI

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

视见之代价:在单体范式内实现可信的多模态推理

Karan Goyal

发表机构 * IIIT Delhi, India(德里印度理工学院)

AI总结 本文提出了一种新的多模态评估方法,通过信息论视角揭示了多模态推理中的视见代价问题,提出了三个新指标并提出了语义充分性准则,挑战了传统多模态评估方法。

Comments Addresses practical viability of Vlabel construction. Writing is grounded. Acknowledgement is duly added

详情
AI中文摘要

视觉语言模型(VLMs)的快速普及通常被视为促进统一多模态知识发现的手段,但其背后存在一个未经检验的假设:当前VLMs能够忠实合成多模态数据。我们认为它们往往不能,这种差距反映了主导的视觉编码器-投影器-语言模型范式中的可信问题。而非从视觉输入中提取基础知识,最先进的模型经常表现出功能失明,即利用强大的语言先验来绕过严重的视觉表示瓶颈。在本文中,我们挑战了传统多模态评估方法,该方法依赖于数据删减或新数据集创建,因此将数据集偏差与架构能力不足混淆了。我们提出了一种信息论的突破:模态翻译协议,旨在量化我们称之为视见代价的东西。通过翻译语义负载而不是删减它们,我们提出了三个新的指标——视见的 toll(ToS)、诅咒(CoS)和谬误(FoS)——最终得出语义充分性准则(SSC)。此外,我们假设多模态扩展的分歧定律:随着底层语言引擎扩展到前所未有的推理能力,视觉知识瓶颈的惩罚可能增加而不是减少。我们主张社区应超越“多模态增益”作为主要评估目标。通过将SSC从被动的诊断约束提升为主动的架构蓝图,我们为引导下一代人工智能系统走向真正的多模态推理提供了基础。

英文摘要

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.

2604.16076 2026-05-22 cs.LG cs.AI cs.NE

Prototype-Grounded Concept Models for Verifiable Concept Alignment

基于原型的可验证概念模型用于可验证的概念对齐

Stefano Colamonaco, David Debot, Pietro Barbiero, Giuseppe Marra

发表机构 * Department of Computer Science, KU Leuven(卢森堡大学计算机科学系) IBM Research, Zurich(苏黎世IBM研究院)

AI总结 本研究提出了一种基于原型的概念模型(PGCMs),通过将概念与学习到的视觉原型关联起来,从而提高概念对齐的可验证性和可解释性,同时保持预测性能。

详情
AI中文摘要

概念瓶颈模型(CBMs)旨在通过人类可理解的概念来提高深度学习的可解释性,但它们无法验证所学概念是否与人类的意图一致,从而损害了可解释性。我们引入了基于原型的概念模型(PGCMs),将概念 grounded 在学习到的视觉原型上:作为概念的显式证据的图像部分。这种 grounding 允许直接检查概念语义,并支持在原型层面进行有针对性的人类干预以纠正不一致。实证结果表明,PGCMs 在预测性能上与最先进的 CBMs 相当,同时显著提高了透明度、可解释性和可干预性。

英文摘要

Concept Bottleneck Models (CBMs) aim to improve interpretability in Deep Learning by structuring predictions through human-understandable concepts, but they provide no way to verify whether learned concepts align with the human's intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs achieve similar predictive performance as state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.

2604.15774 2026-05-22 cs.CL

MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents

MemEvoBench: 评估LLM代理中内存误进化带来的安全风险

Weiwei Xie, Shaoxiong Guo, Fan Zhang, Tian Xia, Xue Yang, Lizhuang Ma, Junchi Yan, Qibing Ren

发表机构 * Shanghai Jiao Tong University(上海交通大学) East China Normal University(华东师范大学) Shandong University(山东大学) Duke University(杜克大学)

AI总结 本文提出MemEvoBench,首个评估LLM代理长期内存安全性的基准,针对对抗性内存注入、噪声工具输出和偏见反馈,通过7个领域36种风险类型的问题任务和20个Agent-SafetyBench环境改编的工作流任务,验证了内存进化对安全性的重大影响,指出静态提示防御不足,亟需加强LLM代理内存进化的安全性。

详情
AI中文摘要

为大型语言模型(LLMs)配备持久化内存可以增强交互连续性和个性化,但引入了新的安全风险。具体而言,受污染或偏见的内存积累可能触发异常代理行为。现有的评估方法尚未建立衡量内存误进化的标准化框架。这种现象是指由于反复接触误导信息而导致的行为漂移。为解决这一缺口,我们引入MemEvoBench,首个评估LLM代理长期内存安全性的基准,针对对抗性内存注入、噪声工具输出和偏见反馈。该框架包含7个领域36种风险类型的问答式任务,以及改编自20个Agent-SafetyBench环境的工作流任务,采用混合良性与误导性内存池在多轮交互中模拟内存进化。在代表性模型上的实验揭示了在偏见内存更新下显著的安全退化。我们的分析表明,内存进化是这些失败的重要原因。此外,静态提示基于防御证明不足,强调了在LLM代理中保障内存进化的安全性的紧迫性。

英文摘要

Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.

2604.11028 2026-05-22 cs.RO cs.AI

Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation

联邦单体机器人:多机器人协调无需机器人内部多代理碎片化

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus(赫瑞-沃顿大学马来西亚校区数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 本文提出了一种联邦单体机器人(FSAR)架构,通过在单体机器人运行时基础上实现多机器人协调,避免了机器人内部的多代理碎片化,提升了协调效率和恢复能力。

Comments 30 pages, 10 figures, 9 tables. Code: https://github.com/s20sc/fsar-fleet-coordination

详情
AI中文摘要

随着具身机器人向舰队规模操作发展,多机器人协调已成为系统挑战的核心。现有方法通常将其视为增加机器人内部多代理分解的动机。我们主张另一种原则:多机器人协调不需要机器人内部的多代理碎片化。每个机器人应保持一个单体具身代理,拥有自己的持久运行时、本地策略范围、能力状态和恢复权限,而协调则通过在舰队层面的联邦实现。我们提出了联邦单体机器人(FSAR),一种基于单体机器人运行时的多机器人协调运行时架构。每个机器人暴露受控的能力表面,而非内部碎片化的代理社会。舰队协调通过共享的能力注册表、跨机器人任务委托、策略感知的权限分配、信任范围内的交互以及分层恢复协议实现。我们正式化了关键协调关系,包括权限委托、跨机器人能力请求、本地与舰队恢复边界以及分层人类监督,并描述了一种支持共享具身能力模块(ECM)发现、合同感知的跨机器人协调以及舰队层面治理的舰队运行时架构。我们在代表性的多机器人协调场景中评估了FSAR,与分解密集的基线进行比较。结果表明,在治理局部性(d=2.91,p<.001 vs. 集中控制)和恢复包含性(d=4.88,p<.001 vs. 分解密集)方面有统计学显著的提升,同时在所有场景中减少了权限冲突和策略违规。我们的结果支持了从具身代理到具身舰队的路径应通过在相干机器人运行时之间进行联邦而非在其中进行碎片化的观点。

英文摘要

As embodied robots move toward fleet-scale operation, multi-robot coordination is becoming a central systems challenge. Existing approaches often treat this as motivation for increasing internal multi-agent decomposition within each robot. We argue for a different principle: multi-robot coordination does not require intra-robot multi-agent fragmentation. Each robot should remain a single embodied agent with its own persistent runtime, local policy scope, capability state, and recovery authority, while coordination emerges through federation across robots at the fleet level. We present Federated Single-Agent Robotics (FSAR), a runtime architecture for multi-robot coordination built on single-agent robot runtimes. Each robot exposes a governed capability surface rather than an internally fragmented agent society. Fleet coordination is achieved through shared capability registries, cross-robot task delegation, policy-aware authority assignment, trust-scoped interaction, and layered recovery protocols. We formalize key coordination relations including authority delegation, inter-robot capability requests, local-versus-fleet recovery boundaries, and hierarchical human supervision, and describe a fleet runtime architecture supporting shared Embodied Capability Module (ECM) discovery, contract-aware cross-robot coordination, and fleet-level governance. We evaluate FSAR on representative multi-robot coordination scenarios against decomposition-heavy baselines. Results show statistically significant gains in governance locality (d=2.91, p<.001 vs. centralized control) and recovery containment (d=4.88, p<.001 vs. decomposition-heavy), while reducing authority conflicts and policy violations across all scenarios. Our results support the view that the path from embodied agents to embodied fleets is better served by federation across coherent robot runtimes than by fragmentation within them.

2604.09095 2026-05-22 cs.LG math.OC

GeoPAS: Geometric Probing for Algorithm Selection in Continuous Black-Box Optimization

GeoPAS: 在连续黑盒优化中用于算法选择的几何探测

Jiabao Brad Wang, Xiang Shi, Yiliang Yuan, Mustafa Misir

发表机构 * Duke Kunshan University(杜克昆山大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出了一种几何探测框架,通过随机采样多尺度二维切片来表示问题实例,并结合有效性掩码感知的视觉池化进行聚合,从而在连续黑盒优化中实现算法选择。

Comments 20 pages, 9 figures, 6 tables; extended version of a GECCO 2026 poster-track paper; code available at https://github.com/BradWangW/GeoPAS

详情
AI中文摘要

连续黑盒优化的自动化算法选择依赖于在有限探测下表示问题信息,并在具有厚尾性能分布的情况下选择求解器。本文提出了一种几何探测框架,通过随机采样多尺度二维切片来表示每个问题实例。这些切片通过有效性掩码感知的视觉池化进行编码并聚合为实例表示。然后通过结合学习的实例条件估计和算法侧经验先验的对数复合分数进行求解器选择。该框架在标准单目标黑盒优化基准套件上评估,使用十二种求解器的组合,在实例级、分组随机和问题级转移协议下进行测试。在两种套件协议下,它将单最佳求解器的平均相对预期运行时间从30.37降至3.14和3.61,同时提高了中位数和上尾性能。在问题级转移下,传统自适应设置提高了典型和中等尾部性能,但使均值被罕见极端失败所主导;一个先验重的评分变体缓解了这种失败模式,尽管其鲁棒性可能依赖于基准。结果表明,粗粒度几何探测提供了有用的求解器相关信息,而鲁棒跨问题选择也取决于度量对齐的决策评分。

英文摘要

Automated algorithm selection for continuous black-box optimization depends on representing problem information under limited probing and selecting solvers under heavy-tailed performance distributions. This paper proposes a geometric probing framework that represents each problem instance by randomly sampled multi-scale two-dimensional slices of the objective landscape. The slices are encoded with validity-mask-aware visual pooling and aggregated into an instance representation. Solver selection is then performed by a logarithmic composite score combining a learned instance-conditioned estimate with an algorithm-side empirical prior. The framework is evaluated on a standard single-objective black-box optimization benchmark suite with a portfolio of twelve solvers under instance-level, grouped random, and problem-level transfer protocols. Under the two within-suite protocols, it reduces aggregate mean relative expected running time from 30.37 for the single best solver to 3.14 and 3.61, while also improving median and upper-tail performance. Under problem-level transfer, the canonical adaptive setting improves typical and moderate-tail performance but leaves the mean dominated by rare extreme failures; a prior-heavy scoring variant mitigates this failure mode, although its robustness may be benchmark-dependent. The results suggest that coarse geometric probes provide useful solver-relevant information, while robust cross-problem selection also depends on metric-aligned decision scoring.

2604.08362 2026-05-22 cs.CL cs.AI cs.LG

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

迈向真实世界的人类行为模拟:在长时间跨度、跨场景、异质行为轨迹上对大语言模型进行基准测试

Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Kuaishou Technology(快手科技)

AI总结 本文提出OmniBehavior基准测试,通过真实世界数据整合长周期、跨场景和异质行为模式,揭示现有模型在模拟复杂人类行为时的局限性,包括对正向平均人的趋同、人格同质化和乌托邦偏见,为未来高保真模拟研究指明方向。

Comments Project page: https://OmniBehavior.github.io

详情
AI中文摘要

大语言模型(LLMs)的出现揭示了通用用户模拟的潜力。然而,现有基准测试仍局限于孤立场景、狭窄动作空间或合成数据,无法捕捉真实人类行为的整体性。为弥合这一差距,我们引入OmniBehavior,首个完全基于真实世界数据构建的用户模拟基准测试,将长周期、跨场景和异质行为模式整合到统一框架中。基于此基准测试,我们首先提供了实证证据,表明以往孤立场景的数据集存在隧道视野问题,而真实世界决策依赖于长期的跨场景因果链。对最新LLMs的广泛评估显示,当前模型在模拟这些复杂行为时表现不佳,即使扩展上下文窗口,性能也趋于平稳。关键的是,模拟行为与真实行为的系统性比较揭示了根本性的结构偏差:LLMs倾向于趋同于正向平均人,表现出超活跃、人格同质化和乌托邦偏见。这导致了个体差异和长尾行为的丧失,突显了未来高保真模拟研究的关键方向。

英文摘要

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

2604.08295 2026-05-22 cs.AI cs.CV

U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

U-CECE:一个通用的多分辨率框架用于概念反事实解释

Angeliki Dimitriou, Nikolaos Chaidos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

发表机构 * Artificial Intelligence and Learning Systems (AILS) laboratory, National Technical University of Athens(人工智能与学习系统实验室,国家技术大学(雅典))

AI总结 本文提出U-CECE框架,旨在解决概念反事实方法在表达性和效率之间的权衡问题,通过多分辨率层次结构提供不同层次的解释能力,并在不同数据集上验证了其效率与表达性的平衡。

详情
AI中文摘要

随着AI模型日益复杂,可解释性对于建立信任至关重要,然而基于概念的反事实方法仍面临表达性与效率之间的权衡。将底层概念表示为原子集合虽然快速但忽略了关系上下文,而完整的图表示更加忠实但需要解决NP难的图编辑距离(GED)问题。我们提出了U-CECE,一个统一的、模型无关的多分辨率框架,用于概念反事实解释,能够适应数据环境和计算预算。U-CECE涵盖三个层次的表达性:原子概念用于广泛解释,关系集合-集合用于简单交互,以及结构图用于完整语义结构。在结构层,支持基于监督图神经网络(GNNs)的精度导向的归纳模式和基于无监督图自动编码器(GAEs)的可扩展归纳模式。在结构上,CUB和视觉基因组数据集的实验展示了不同层次的效率-表达性权衡,同时人类调查和LVLM基于评估表明,检索到的结构反事实与精确GED基于的地面真相解释在语义上等价,且常被优先选择。

英文摘要

As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.

2604.07799 2026-05-22 cs.RO cs.AI

Learning Without Losing Identity: Capability Evolution for Embodied Agents

无需失去身份的学习:体素代理的能力进化

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University(赫瑞-沃德大学数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 本文提出了一种以能力为中心的体素代理进化范式,通过引入体素能力模块(ECMs)实现持续改进,同时保持代理身份的稳定性,实验表明其在任务成功率和安全性方面优于传统方法。

Comments 12 pages, 2 figures, 7 tables

详情
AI中文摘要

体素代理被期望在动态物理环境中持续运作,并随时间不断获得新能力。现有方法通常通过修改代理本身来提高性能,导致长期系统不稳定和身份丢失。本文提出了一种以能力为中心的进化范式,认为机器人应保持持久的代理作为认知身份,同时通过能力进化实现持续改进。具体而言,我们引入了体素能力模块(ECMs),代表可随时间学习、优化和组合的模块化功能单元。我们提出一个统一框架,将能力进化与代理身份解耦。能力通过包含任务执行、经验收集、模型优化和模块更新的闭环过程进化,所有执行均由运行时层控制,确保安全性和策略约束。通过模拟体素任务证明,能力进化在20次迭代中将任务成功率从32.4%提升到91.3%,优于代理修改基线和现有技能学习方法(SPiRL, SkiMo),同时保持零策略漂移和零安全违规。我们的结果表明,将代理身份与能力进化分离为长期体素智能提供了可扩展且安全的基础。

英文摘要

Embodied agents are expected to operate persistently in dynamic physical environments, continuously acquiring new capabilities over time. Existing approaches to improving agent performance often rely on modifying the agent itself -- through prompt engineering, policy updates, or structural redesign -- leading to instability and loss of identity in long-lived systems. In this work, we propose a capability-centric evolution paradigm for embodied agents. We argue that a robot should maintain a persistent agent as its cognitive identity, while enabling continuous improvement through the evolution of its capabilities. Specifically, we introduce the concept of Embodied Capability Modules (ECMs), which represent modular, versioned units of embodied functionality that can be learned, refined, and composed over time. We present a unified framework in which capability evolution is decoupled from agent identity. Capabilities evolve through a closed-loop process involving task execution, experience collection, model refinement, and module updating, while all executions are governed by a runtime layer that enforces safety and policy constraints. We demonstrate through simulated embodied tasks that capability evolution improves task success rates from 32.4% to 91.3% over 20 iterations, outperforming both agent-modification baselines and established skill-learning methods (SPiRL, SkiMo), while preserving zero policy drift and zero safety violations. Our results suggest that separating agent identity from capability evolution provides a scalable and safe foundation for long-term embodied intelligence.

2604.07180 2026-05-22 cs.CV cs.AI

Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis

基于能量的组织流形用于纵向多参数MRI分析

Kartikay Tehlan, Lukas Förner, Sina Wendrich, Nico Schmutzenhofer, Michael Frühwald, Matthias Wagner, Nassir Navab, Thomas Wendler

发表机构 * Dept. of diagnostic and interventional Radiology and Neuroradiology, University Hospital Augsburg, Germany(诊断与介入放射科和神经放射科,奥格斯堡大学医院,德国) Digital Medicine, University Hospital Augsburg, Germany(数字医学,奥格斯堡大学医院,德国) Chair for Computer Aided Medical Procedures and Augmented Reality, Technical University of Munich, Germany(计算机辅助医疗程序与增强现实 chair,慕尼黑技术大学,德国) Bavarian Center for Cancer Research (BZKF) Augsburg, Germany(巴伐利亚癌症研究中心(BZKF)奥格斯堡,德国) Dept. of Pediatrics and Adolescent Medicine, University Hospital Augsburg, Germany(儿科和青少年医学科,奥格斯堡大学医院,德国) Center for Advanced Analytics and Predictive Sciences, University of Augsburg, Germany(高级分析与预测科学中心,奥格斯堡大学,德国)

AI总结 本文提出了一种基于患者特定能量建模的几何框架,用于纵向多参数MRI分析,通过训练紧凑的隐式神经表示来学习能量函数,为组织状态提供微分几何描述,无需分割标签,展示了患者特定能量流形在纵向mpMRI分析中的应用潜力。

Comments The code is available at https://github.com/tkartikay/EnFold-MRI

详情
AI中文摘要

我们提出了一种基于患者特定能量建模的几何框架,用于纵向多参数MRI分析。该框架基于序列空间中的患者特定能量建模,而不是在具有空间网络的图像上进行操作。每个体素由其多序列强度向量(T1,T1c,T2,FLAIR,ADC)表示,并通过去噪分数匹配训练紧凑的隐式神经表示,以从单次基线扫描学习一个能量函数E_θ(u) over R^d。学习的能量景观提供了没有分割标签的组织状态的微分几何描述。局部极小值定义了组织盆地,梯度大小反映了接近状态边界的可能性,拉普拉斯曲率表征了局部约束结构。重要的是,该基线能量流形被视为固定的几何参考:它编码了诊断时观察到的对比组合,并且在随访时不进行重新训练。因此,纵向评估被公式化为对后续扫描相对于此基线几何的评估。而不是比较解剖分割,我们分析MRI序列向量的分布如何在基线能量函数下演变。在一项儿童病例中,复发后随访扫描显示能量和方向位移在序列空间中逐渐偏离基线肿瘤相关状态,但在明显放射学再出现之前。在一项稳定疾病病例中,体素分布仍被限制在已建立的低能盆地内,没有系统性漂移。所展示的病例证明了患者特定能量流形可以作为纵向mpMRI分析的几何参考系统,而无需显式分割或监督分类,为进一步研究基于流形的肿瘤风险区域追踪提供了基础。

英文摘要

We propose a geometric framework for longitudinal multi-parametric MRI analysis based on patient-specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi-sequence intensity vector ($T1$, $T1c$, $T2$, FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function $E_θ(\mathbf{u})$ over $\mathbb{R}^d$ from a single baseline scan. The learned energy landscape provides a differential-geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow-up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow-up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour-associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low-energy basins without systematic drift. The presented cases serve as proof-of-concept that patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold-based tissue-at-risk tracking in neuro-oncology.

2603.29981 2026-05-22 cs.LG stat.ML

Aligning Validation with Deployment in Spatial Prediction: Target-Weighted Cross-Validation

在空间预测中对齐验证与部署:目标加权交叉验证

Alexander Brenning, Thomas Suesse

发表机构 * Friedrich Schiller University Jena(耶拿弗里德里希-施勒辛格大学) ELLIS Unit Jena(耶拿ELLIS单位)

AI总结 本文提出了一种基于加权交叉验证的部署导向验证框架,通过引入目标加权交叉验证(TWCV)来对齐验证任务与指定领域内预测任务的分布,以减少因采样偏差导致的预测误差。

详情
AI中文摘要

可靠地估计预测性能对于空间环境建模至关重要,其中机器学习模型用于从不均匀分布的观测数据中生成地图。标准交叉验证(CV)假设验证数据能代表目标领域内预测条件的分布。在实践中,由于选择性或集群采样,这一假设经常被违反,导致性能和不确定性估计偏倚。本文引入了一种基于加权交叉验证的部署导向验证框架,该框架通过重要性加权交叉验证(IWCV)和基于校准的方法,目标加权交叉验证(TWCV),利用具有空间意义的任务描述符如环境协变量和预测距离。模拟实验表明,传统非空间和空间交叉验证策略在现实采样设计下会表现出显著偏倚,而加权交叉验证方法在验证任务充分覆盖部署任务空间时能大幅减少这种偏倚。德国氮氧化物(NO₂)浓度制图案例研究显示,标准交叉验证由于采样偏倚会高估预测误差,而加权交叉验证则能产生更符合部署条件的估计。该框架将验证任务生成与风险估计分开,并为在样本分布与预测领域不同的空间预测设置中改进性能评估提供了实用方法。

英文摘要

Reliable estimation of predictive performance is essential for spatial environmental modeling, where machine-learning models are used to generate maps from unevenly distributed observations. Standard cross-validation (CV) assumes that validation data are representative of prediction conditions across the target domain. In practice, this assumption is often violated due to preferential or clustered sampling, leading to biased performance and uncertainty estimates. We introduce a deployment-oriented validation framework based on weighted CV that aligns validation tasks with the distribution of prediction tasks across a specified domain. The framework includes importance-weighted cross-validation (IWCV) and a calibration-based approach, Target-Weighted Cross-Validation (TWCV), which uses spatially meaningful task descriptors such as environmental covariates and prediction distance. Simulation experiments show that conventional non-spatial and spatial CV strategies can exhibit substantial bias under realistic sampling designs, whereas weighted CV approaches substantially reduce this bias when validation tasks adequately cover the deployment-task space. A case study on mapping nitrogen dioxide (NO$_2$) concentrations across Germany demonstrates that standard CV can overestimate prediction error due to sampling bias, while weighted CV yields estimates more consistent with deployment conditions. The framework separates validation task generation from risk estimation and provides a practical approach for improving performance assessment in spatial prediction settings where sample distributions differ from prediction domains.

2603.27355 2026-05-22 cs.AI cs.CL cs.SE

LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications

LLM Readiness Harness: 评估、可观测性和持续集成门禁用于LLM/RAG应用

Alexandre Cristovão Maiorano

发表机构 * Lumytics

AI总结 本文提出了一种LLM和RAG应用的准备性框架,通过自动化基准测试、OpenTelemetry可观测性和持续集成质量门禁,将评估转化为部署决策流程,并通过帕累托前沿计算场景加权的准备度分数,展示了在票务路由工作流和BEIR接地任务上的评估结果。

Comments 19 pages, 4 figures, 15 tables

详情
AI中文摘要

我们提出了一种用于LLM和RAG应用的准备性框架,将评估转化为部署决策流程。该系统结合了自动化基准测试、OpenTelemetry可观测性和持续集成质量门禁,通过最小的API合同聚合工作流程成功、政策合规性、 groundedness、检索命中率、成本和p95延迟,计算出场景加权的准备度分数。我们对票务路由工作流和BEIR接地任务(SciFact和FiQA)进行了评估,覆盖了完整的Azure矩阵(162/162有效单元跨数据集、场景、检索深度、种子和模型)。结果表明,准备度不是单一指标:在FiQA中,sla-first at k=5时,gpt-4.1-mini在准备度和忠实度上领先,而gpt-5.2则支付了显著的延迟成本;在SciFact中,模型质量接近但仍有操作区分。票务路由回归门禁持续拒绝不安全的提示变体,证明了该框架能够阻止风险发布,而不仅仅是报告离线分数。结果是一个可重复、操作基础的框架,用于决定LLM或RAG系统是否准备好发布。

英文摘要

We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage (162/162 valid cells across datasets, scenarios, retrieval depths, seeds, and models). Results show that readiness is not a single metric: on FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness, while gpt-5.2 pays a substantial latency cost; on SciFact, models are closer in quality but still separable operationally. Ticket-routing regression gates consistently reject unsafe prompt variants, demonstrating that the harness can block risky releases instead of merely reporting offline scores. The result is a reproducible, operationally grounded framework for deciding whether an LLM or RAG system is ready to ship.

2603.25958 2026-05-22 cs.LG

Cluster-Adaptive Feature Extraction and its Theoretical Foundation with Minkowski Weighted k-Means

基于Minkowski加权k均值的聚类自适应特征提取及其理论基础

Renato Cordeiro de Amorim, Vladimir Makarenkov

发表机构 * School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe, UK(埃塞克斯大学计算机科学与电子工程学院,英国威文豪斯) Département d’informatique, Université du Québec à Montréal, C.P. 8888 succ. Centre-Ville, Montreal (QC) H3C 3P8 Canada(魁北克大学蒙特利尔分校计算机科学系,加拿大蒙特利尔(QC)H3C 3P8) Mila - Quebec AI Institute, Montreal, QC, Canada(魁北克人工智能研究所,加拿大蒙特利尔(QC))

AI总结 本文提出了一种基于Minkowski加权k均值的聚类自适应特征提取方法,通过理论分析揭示了特征权重的结构,并证明了该方法在抑制高分散特征和增强信息性特征方面的有效性。

详情
AI中文摘要

Minkowski加权k均值(mwk-均值)算法通过引入特征权重和Minkowski距离扩展了经典k均值。我们首先证明,mwk-均值的目标函数可以表示为聚类内分散度的幂均值聚合,其中幂次由Minkowski指数p决定。这一表示揭示了p如何控制特征在选择性和均匀性之间的过渡。利用这种表示,我们推导了目标函数的界限,并刻画了特征权重的结构,证明其仅依赖于相对分散度,并遵循与分散比的幂律关系。这导致了对高分散特征抑制的显式保证,并建立了算法的收敛性。基于这些理论结果,我们引入了聚类自适应特征提取(CAFE),一种利用mwk-均值特征权重对数据进行预处理以进行无监督特征提取的方法。我们证明这种预处理反转了聚类内分散度的排序,抑制噪声特征并放大信息性特征。在受控的聚类内噪声环境下进行的大量实验表明,CAFE在传统特征提取方法的结果上始终表现出改进。

英文摘要

The Minkowski weighted $k$-means ($mwk$-means) algorithm extends classical $k$-means by incorporating feature weights and a Minkowski distance. We first show that the $mwk$-means objective can be expressed as a power-mean aggregation of within-cluster dispersions, with the order determined by the Minkowski exponent $p$. This formulation reveals how $p$ controls the transition between selective and uniform use of features. Using this representation, we derive bounds for the objective function and characterise the structure of the feature weights, showing that they depend only on relative dispersion and follow a power-law relationship with dispersion ratios. This leads to explicit guarantees on the suppression of high-dispersion features, and we establish convergence of the algorithm. Building on these theoretical results, we introduce Cluster-Adaptive Feature Extraction (CAFE), a method that uses the $mwk$-means feature weights to rescale the data prior to unsupervised feature extraction. We prove that this rescaling reverses the within-cluster dispersion ordering, suppressing noisy features and amplifying informative ones. Numerous experiments conducted under controlled within-cluster noise show that CAFE consistently improves the results of traditional feature extraction methods.

2603.20405 2026-05-22 cs.LG cs.CL cs.LO

Putnam 2025 Problems in Rocq using Opus 4.6 and Rocq-MCP

使用 Opus 4.6 和 Rocq-MCP 的 2025 年 Putnam 问题

Guillaume Baudart, Marc Lelarge, Tristan Stérin, Jules Viennot

发表机构 * IRIF, Université Paris Cité, Inria, CNRS(IRIF,巴黎Cité大学,法国国家信息与自动化研究所,法国国家科学研究中心) DI ENS, PSL University, Inria(ENS巴黎大学DI,巴黎科学实验室大学,法国国家信息与自动化研究所)

AI总结 研究探讨了使用 Opus 4.6 配合 Rocq-MCP 工具自主证明 2025 年 Putnam 数学竞赛中 12 个问题中的 10 个,展示了基于模型上下文协议 (MCP) 的自动证明方法及公开可用的证明过程。

详情
AI中文摘要

我们报告了一项实验,其中配备有 Model Context Protocol (MCP) 工具的 Claude Opus~4.6,能够自主证明 2025 年 Putnam 数学竞赛中的 10 个问题。MCP 工具由 Claude 设计,通过分析先前在 miniF2F-Rocq 上的实验日志来编码一种“先编译,后交互回退”的策略。该代理在隔离的虚拟机上运行,无网络访问,部署了 141 个子代理,在 17.7 小时的活跃计算时间(51.6 小时墙钟时间)内消耗了约 190 亿个 token。所有证明均公开可用。

英文摘要

We report on an experiment in which Claude Opus~4.6, equipped with a suite of Model Context Protocol (MCP) tools for the Rocq proof assistant, autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition. The MCP tools, designed with Claude by analyzing logs from a prior experiment on miniF2F-Rocq, encode a "compile-first, interactive-fallback" strategy. Running on an isolated VM with no internet access, the agent deployed 141 subagents over 17.7 hours of active compute (51.6h wall-clock), consuming approximately 1.9 billion tokens. All proofs are publicly available.

2603.18003 2026-05-22 cs.CV

Universal Skeleton Understanding via Differentiable Rendering and MLLMs

通过可微渲染和大语言模型实现通用骨架理解

Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, China(人工智能通用基础理论国家重点实验室,北京大学深圳研究生院,中国) Tencent(腾讯) Nanjing University of Aeronautics(南京航空航天大学)

AI总结 本文提出SkeletonLLM,通过可微渲染将任意骨架序列转换为大语言模型的视觉模态,实现通用骨架理解,同时引入协同训练策略提升推理能力,展示了在开放词汇动作识别中的强泛化能力,并扩展到异构骨架格式的运动描述和问答任务。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言推理方面表现出色,但无法处理结构化非视觉数据如人体骨架。现有方法要么将骨架动力学压缩成有损特征向量以进行文本对齐,要么将运动量化为离散标记,但这些方法在异构骨架格式上泛化能力较差。我们提出了SkeletonLLM,通过将任意骨架序列转换为MLLM的本机视觉模态实现通用骨架理解。其核心是DrAction,一种可微、格式无关的渲染器,将骨骼运动学转换为紧凑的图像序列。由于整个流程是端到端可微的,MLLM的梯度可以直接引导渲染以生成任务相关信息的视觉标记。为进一步增强推理能力,我们引入了协同训练策略:因果推理蒸馏将结构化的逐步推理从教师模型转移过来,而判别微调则增强可混淆动作之间的决策边界。SkeletonLLM在开放词汇动作识别中表现出强泛化能力,其学习的推理能力自然扩展到异构骨架格式的运动描述和问答任务——表明了将MLLM应用于非本机模态的可行路径。代码:https://github.com/wangzy01/SkeletonLLM。

英文摘要

Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet cannot process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM's native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization \revise{in open-vocabulary action recognition, while its learned reasoning capabilities naturally extend to motion captioning and question answering across heterogeneous skeleton formats} -- suggesting a viable path for applying MLLMs to non-native modalities. Code: https://github.com/wangzy01/SkeletonLLM.

2603.16672 2026-05-22 cs.AI cs.CL cs.CY

CritiSense: Critical Digital Literacy and Resilience Against Misinformation

CritiSense: 关键数字素养与对抗虚假信息的韧性

Firoj Alam, Fatema Ahmad, Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Elisa Sartori, Giovanni Da San Martino, Abul Hasnat, Raian Ali

发表机构 * Qatar Computing Research Institute(卡塔尔计算研究所) University of Padova(帕多瓦大学) Hamad Bin Khalifa University(哈马德·本·卡西姆大学)

AI总结 本研究提出CritiSense,一个多功能的移动媒体素养应用,通过短而互动的挑战提升用户识别操纵手段的能力,为多语言的预警告平台和微学习效果评估提供测试环境。

Comments resilience, disinformation, misinformation, fake news, propaganda

详情
AI中文摘要

社交媒体上的虚假信息破坏了知情决策和公众信任。预警告(prebunking)通过帮助用户在遇到真实信息前识别操纵手法,提供了一种积极的补充方法。我们介绍了CritiSense,一个移动媒体素养应用,通过短而互动的挑战和即时反馈来培养这些技能。它是首个支持九种语言且模块化的平台,设计用于快速更新不同主题和领域。我们报告了93名用户的可用性研究:83.9%的用户表示总体满意,90.1%的用户认为该应用易于使用。定性反馈表明,CritiSense有助于提高数字素养技能。总体而言,它提供了一个多语言预警告平台和一个测试环境,用于衡量微学习对对抗虚假信息韧性的影响。在六个月中,我们已吸引了超过500名活跃用户。它在Apple App Store(https://apps.apple.com/us/app/critisense/id6749675792)和Google Play Store(https://play.google.com/store/apps/details?id=com.critisense&hl=en)上免费向所有用户提供。

英文摘要

Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 6 months, we have reached 500+ active users. It is freely available to all users on the Apple App Store (https://apps.apple.com/us/app/critisense/id6749675792) and Google Play Store (https://play.google.com/store/apps/details?id=com.critisense&hl=en).

2603.08403 2026-05-22 cs.CV

SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents

SPIRAL:通过反思规划代理实现自演化动作条件视频生成

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Liang Lv, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee

发表机构 * Zhejiang University(浙江大学) KnowledgeXLab at Shanghai AI Lab(上海人工智能实验室知识X实验室) National University of Singapore(新加坡国立大学) Chinese Academy of Sciences(中国科学院) Tencent Youtu Lab(腾讯优设实验室) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学)

AI总结 本文提出SPIRAL框架,通过反思规划代理实现长时域动作条件视频生成,解决传统方法在长时间视频生成中的不足,通过闭环设计和自演化机制提升视频生成的一致性和准确性。

Comments 42 Pages, 21 Figures, Project page at https://yuyang-cloud.github.io/spiral

详情
AI中文摘要

长时域动作条件视频生成旨在合成符合复杂动作指令的时序一致视频,要求过程有序、持续执行动作和场景一致,超越传统TI2V的短时精度。现有单次视频生成模型通常采用开环方式,导致动作执行不完整、幻觉运动和时间漂移。为解决此问题,我们提出SPIRAL,一种闭环框架,通过顺序规划和迭代反思进行动作条件长时域视频生成。具体而言,SPIRAL实现一个思考-行动-反思过程:PlanAgent将高层目标分解为子动作,这些动作条件VideoGenerator生成每个片段并伴随记忆上下文,同时CriticAgent评估中间视频片段以提供迭代优化的反馈。此闭环设计进一步通过利用PlanAgent提出的行为和CriticAgent得出的奖励进行GRPO基于的后训练,以增强视频生成器的长时域一致性。此外,我们引入ActVideoGen-Dataset用于任务特定训练,并建立ActVideoGen-Bench作为专用评估套件,用于衡量动作质量和时间一致性。在多个TI2V后端和自演化策略下的实验显示,在ActVideoGen-Bench和VBench上均取得一致提升,证明了SPIRAL的有效性。

英文摘要

Long-horizon action-conditioned video generation aims to synthesize temporally coherent videos that follow complex action instructions over extended horizons, requiring procedural ordering, persistent action execution, and scene consistency beyond conventional TI2V's short-term fidelity. Existing single-shot video generation models typically operate in an open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. To address this, we propose SPIRAL, a closed-loop framework that performs sequential planning and iterative reflection for action-conditioned long-horizon video generation. Specifically, SPIRAL instantiates a think-act-reflect process: a PlanAgent decomposes high-level goals into sub-actions, which condition a VideoGenerator to synthesize each segment alongside a memory context, while a CriticAgent evaluates intermediate video segments to provide corrective feedback for iterative refinement. This closed-loop design further supports self-evolution by utilizing PlanAgent-proposed actions and CriticAgent-derived rewards for GRPO-based post-training to enhance the video generator's long-horizon consistency. Moreover, we introduce ActVideoGen-Dataset for task-specific training, and establish ActVideoGen-Bench as a dedicated evaluation suite for measuring action quality and temporal coherence. Experiments across multiple TI2V backbones alongside the self-evolving strategy show consistent gains on ActVideoGen-Bench and VBench, demonstrating the effectiveness of SPIRAL.

2603.03454 2026-05-22 cs.LG

[Re] FairDICE: A Fair Tradeoff in Multi-objective Offline RL

[Re] FairDICE:多目标离线RL中的公平权衡

Peter Adema, Karim Galliamov, Aleksey Evstratovskiy, Ross Geurts

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 该研究探讨了多目标离线强化学习中公平权衡的问题,提出FairDICE算法通过自适应学习多目标权重来实现公平妥协,但发现代码错误导致其在连续环境中退化为标准行为克隆,并需修正超参数以提升实验有效性。

Comments 12 pages, 8 figures in main text. Code at https://github.com/p-adema/re-fairdice. Reviewed at https://openreview.net/forum?id=Tr6MBt0hAj

Journal ref Published 05/2026 in Transactions on Machine Learning Research

详情
AI中文摘要

离线强化学习(RL)是RL领域的一个新兴分支,其中策略仅从演示中学习。在离线RL中,某些环境需要平衡多个目标,但现有的多目标离线RL算法未能提供有效的方法来找到公平的折中方案。FairDICE(见arXiv:2506.08062v2)通过将OptiDICE(一种离线RL算法)进行适应性修改,以自动学习多个目标的权重,例如激励目标间的公平性。由于这一贡献具有价值,本复制研究检验了关于FairDICE的可复制性声明。我们发现许多理论声明成立,但代码中的错误使FairDICE在连续环境中退化为标准行为克隆,并且许多重要的超参数最初未明确指定。在修正之后,我们通过扩展原始论文的实验表明,FairDICE可以扩展到复杂环境和高维奖励,尽管它在(在线)超参数调优上可能依赖性较强。我们得出结论,FairDICE是一种理论上有吸引力的方法,但实验验证需要显著修订。

英文摘要

Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE (see arXiv:2506.08062v2) seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g. incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

2603.02938 2026-05-22 cs.LG cs.AI

Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models

超越一刀切:基于大语言模型的零样本图学习中的自适应子图去噪

Fengzhi Li, Liang Zhang, Yuan Zuo, Ruiqing Zhao, YanSong Liu, Yunfei Ma, Fanyu Meng, Junlan Feng

发表机构 * JIUTIAN Research(JIUTIAN研究) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) MIIT Key Laboratory of Data and Decision Intelligence(信息与决策智能重点实验室) Beihang University(北航)

AI总结 本文提出GraphSSR框架,通过自适应子图提取和去噪方法,解决传统图神经网络在零样本学习中泛化能力不足的问题,提升大语言模型在图推理任务中的表现。

详情
AI中文摘要

图基任务在零样本设置中仍面临显著挑战,由于数据稀缺性和传统图神经网络(GNNs)无法泛化到未见领域或标签空间。尽管最近的进展转向利用大语言模型(LLMs)作为预测器来增强GNNs,但这些方法常面临跨模态对齐问题。最近的范式(即Graph-R1)通过采用纯文本格式和基于LLM的图推理克服了上述架构依赖性,显示出改进的零样本泛化能力。然而,它使用一种任务无关的“一刀切”子图提取策略,不可避免地引入了显著的结构噪声——无关邻居和边——这会扭曲LLMs的感知范围并导致次优预测。为了解决这一限制,我们引入GraphSSR,一种新的框架,用于零样本LLM图推理中的自适应子图提取和去噪。具体而言,我们提出了SSR流水线,通过“采样-选择-推理”过程动态定制子图提取以适应特定上下文,使模型能够自主过滤掉任务无关的邻居并克服“一刀切”问题。为了内化这一能力,我们开发了SSR-SFT,一种数据合成策略,生成高质量的SSR风格图推理轨迹用于LLM的监督微调。此外,我们提出了SSR-RL,一种两阶段强化学习框架,该框架专门设计用于自适应子图去噪,明确调节所提出SSR流水线中的采样和选择操作。通过结合真实性增强和去噪增强的强化学习,我们引导模型使用简洁的、去噪的子图进行推理以实现准确预测。

英文摘要

Graph-based tasks in the zero-shot setting remain a significant challenge due to data scarcity and the inability of traditional Graph Neural Networks (GNNs) to generalize to unseen domains or label spaces. While recent advancements have transitioned toward leveraging Large Language Models (LLMs) as predictors to enhance GNNs, these methods often suffer from cross-modal alignment issues. A recent paradigm (i.e., Graph-R1) overcomes the aforementioned architectural dependencies by adopting a purely text-based format and utilizing LLM-based graph reasoning, showing improved zero-shot generalization. However, it employs a task-agnostic, one-size-fits-all subgraph extraction strategy, which inevitably introduces significant structural noise--irrelevant neighbors and edges--that distorts the LLMs' receptive field and leads to suboptimal predictions. To address this limitation, we introduce GraphSSR, a novel framework designed for adaptive subgraph extraction and denoising in zero-shot LLM-based graph reasoning. Specifically, we propose the SSR pipeline, which dynamically tailors subgraph extraction to specific contexts through a "Sample-Select-Reason" process, enabling the model to autonomously filter out task-irrelevant neighbors and overcome the one-size-fits-all issue. To internalize this capability, we develop SSR-SFT, a data synthesis strategy that generates high-quality SSR-style graph reasoning traces for supervised fine-tuning of LLMs. Furthermore, we propose SSR-RL, a two-stage reinforcement learning framework that explicitly regulates sampling and selection operations within the proposed SSR pipeline designed for adaptive subgraph denoising. By incorporating Authenticity-Reinforced and Denoising-Reinforced RL, we guide the model to achieve accurate predictions using parsimonious, denoised subgraphs for reasoning.

2602.23231 2026-05-22 cs.CV

Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Skarimva:基于骨架的动作识别是一种多视图应用

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

发表机构 * Institute for Software and Systems Engineering, University of Augsburg(软件与系统工程研究所,奥格斯堡大学)

AI总结 本文研究了基于骨架的动作识别中多视图应用的重要性,指出通过多摄像头视图三角化获得更准确的3D骨架数据,可以显著提升现有动作识别模型的性能,表明输入数据质量是限制模型性能的关键因素,未来研究应将多视图应用作为标准设置。

详情
AI中文摘要

人类动作识别在开发人机智能交互中起着重要作用。尽管有很多研究致力于改进用于基于骨架的动作识别的机器学习算法,但对输入骨架数据本身质量的关注却很少。本文证明,通过利用多个摄像头视图来三角化更准确的3D骨架,可以显著提高现有动作识别模型的性能。这表明,输入数据的质量目前是这些模型性能的限制因素。基于这些结果,认为在大多数实际应用场景中,使用多个摄像头的成本效益比非常有利,因此未来基于骨架的动作识别研究应将多视图应用作为标准设置。

英文摘要

Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.

2602.22719 2026-05-22 cs.LG

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

通过激活子空间瓶颈解释和操控状态空间模型

Vamshi Sunku Mohan, Kaustubh Gupta, Aneesha Das, Chandan Singh

发表机构 * Microsoft Research, Redmond(微软研究院(红mond)) Independent Researcher(独立研究员)

AI总结 本文通过识别Mamba家族状态空间模型中的激活子空间瓶颈,提出了一种在测试时通过乘以标量来操控激活的干预方法,从而在多个模型和基准测试中提升了性能,并验证了这些瓶颈对性能的阻碍作用。

详情
AI中文摘要

状态空间模型(SSMs)已经 emerged 作为构建强大语言模型的有效策略,避免了transformers中计算注意力的二次复杂度。尽管有潜力,现代SSMs的可解释性和操控性仍然相对研究不足。我们通过使用机理可解释性工具,在Mamba家族的SSMs中识别出激活子空间瓶颈。然后,我们引入了一种测试时操控干预,通过将识别出的瓶颈的激活乘以一个标量。在7个SSMs和6个多样化的基准测试中,这种干预平均提升了8.27%的性能,无需任何任务特定的调优。最后,我们验证了识别出的瓶颈确实阻碍了性能,通过修改它们得到一种称为Stable-Mamba的架构,在重新训练时实现了长上下文性能的提升。

英文摘要

State-space models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 7 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.