arXivDaily arXiv每日学术速递 周一至周五更新
2606.20108 2026-06-19 cs.CV cs.LG 新提交

EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors

EFIQA: 基于解剖先验的可解释眼底图像质量评估

Pengwei Wang, José Morano, Qian Wan, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria(维也纳医科大学医学数据科学中心人工智能研究所) Christian Doppler Lab for Artificial Intelligence in Retina, Medical University of Vienna, Austria(维也纳医科大学视网膜人工智能克里斯蒂安·多普勒实验室)

AI总结 提出无需质量标签的EFIQA框架,利用解剖先验通过掩膜解剖修复学习正常结构,生成空间质量图,在多个基准上超越监督方法,兼具可解释性。

Comments Accepted in MIDL 2026. Code: https://github.com/penway/EFIQA

Journal ref Proceedings of Machine Learning Research 315:2248-2264, 2026

详情
AI中文摘要

图像质量控制对于广泛的下游应用至关重要。基于深度学习的图像质量评估方法通常根据数据集特定的质量标签训练分类器,这继承了两种局限性:(1)泛化能力受限于训练集的标注标准;(2)这些方法无法提供质量下降的空间反馈,缺乏可解释性。在这项工作中,我们提出了EFIQA,一个无需质量相关监督的框架,并通过设计生成空间质量图。EFIQA不是从人工标注的标签中学习“什么是退化”,而是通过利用解剖先验来学习“应该有什么”。对于眼底摄影,我们将其实例化为两阶段方法:首先通过掩膜解剖修复训练无监督异常检测器,以识别缺失血管区域;然后将这一先验知识蒸馏到一个浅层适配器中,将冻结基础模型的特征映射到精确的质量图。外部数据集评估表明,这种无需标签且只需最小适配的方法,在不同质量标准的基准上,与监督方法相比,实现了更好的性能和可解释性,突显了其在现实应用中的潜力。

英文摘要

Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning ``what is degradation" from human-annotated labels, EFIQA learns ``what should be there" by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.

2606.20107 2026-06-19 cs.LG 新提交

Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning

均值分位数:一种用于最小最大最优强化学习的无奖励集成方法

Asaf Cassel, Aviv Rosenberg

发表机构 * Google Research(谷歌研究院)

AI总结 提出一种基于分位数的集成方法,无需计数即可在有限时域MDP中实现最优方差依赖的遗憾界,为强化学习中的集成探索提供理论依据。

详情
AI中文摘要

最优强化学习算法通常依赖于精心构造的基于计数的不确定性估计来驱动探索。尽管理论上合理,但这类估计在实际设置中难以计算,因此对设计探索启发式方法提供的见解有限。与此同时,集成方法已成为一种实用的方法,但仍缺乏理论证明。基于最近一种用于多臂赌博机的集成方法,我们提出了一种用于有限时域马尔可夫决策过程(MDP)的基于分位数的集成方法。我们这种简单的无计数方法实现了最优方差依赖的遗憾界,为强化学习中的集成探索提供了理论基础。

英文摘要

Optimal Reinforcement Learning (RL) algorithms typically rely on carefully constructed count-based uncertainty estimates to drive exploration. Although theoretically sound, such estimates are hard to compute in practical settings and therefore offer limited insight for designing exploration heuristics. Meanwhile, ensembling has emerged as a practical approach, but remains without theoretical justification. Building on a recent ensemble-based method for Multi-Armed Bandits, we propose a quantile-based ensemble method for finite-horizon Markov Decision Processes (MDPs). Our simple count-free approach achieves optimal variance-dependent regret bounds, providing theoretical grounding for ensemble-based exploration in RL.

2606.20104 2026-06-19 cs.LG cs.AI 新提交

Sensorimotor World Models: Perception for Action via Inverse Dynamics

传感器运动世界模型:通过逆动力学实现面向行动感知

Petr Ivashkov, Randall Balestriero, Bernhard Schölkopf

发表机构 * Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Department of Computer Science, Brown University(布朗大学计算机科学系) ELLIS Institute(ELLIS研究所) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出传感器运动世界模型(SMWM),通过逆动力学正则化端到端训练潜空间世界模型,防止表示崩溃并学习与行动对齐的紧凑表示,在2D和3D控制任务中实现竞争性规划性能。

详情
AI中文摘要

面向行动的感知表明,世界的表示不应仅由视觉保真度决定,而应由其与行动的相关性决定。同时,潜在的JEPA风格世界模型主张从高维观测中学习紧凑的预测状态以促进未来状态的预测,但这些模型的端到端训练并非易事,因为如果我们的唯一目标是构建易于预测的潜在状态,表示可能会崩溃。我们引入了一种传感器运动世界模型(SMWM):一种通过逆动力学正则化进行端到端训练的潜在世界模型。这一单一正则化解决了两个问题:它防止表示崩溃并诱导与行动对齐的表示。通过迫使潜在状态保留关于转换背后行动的信息,它使模型偏向于环境中可控的自由度,同时丢弃不可控的干扰因素。这产生了从离线、无奖励轨迹中训练的稳定潜在世界模型,无需冻结编码器、指数移动平均或复杂的潜在正则化。实验表明,SMWM学习了紧凑、可解释的潜在空间,并在简单的2D和3D控制任务中实现了竞争性的规划性能。

英文摘要

Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

2606.20103 2026-06-19 cs.CV 新提交

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

3D高斯溅射中保持几何结构的LiDAR-相机外参标定

Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang

发表机构 * Kyung Hee University(庆熙大学)

AI总结 针对LiDAR-相机标定中跨模态特征稀缺问题,提出通过多视图LiDAR深度监督和阻止光度梯度更新高斯空间参数来保持3DGS代理的度量几何,提升标定精度。

Comments Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

详情
AI中文摘要

精确的LiDAR-相机标定对于鲁棒的多模态感知至关重要。无目标方法避免了手动设置,但仍受限于跨模态判别特征的稀缺性。最近的方法通过在可微模型中重建场景,通过密集光度监督实现外参优化。其中,3D高斯溅射(3DGS)被广泛用作几何代理,在单一可微框架内桥接LiDAR和相机。然而,由于3DGS最初是为新视图合成设计的,现有方法倾向于优先考虑渲染质量,导致代理几何偏离真实的LiDAR结构。我们提出了一种框架,通过聚合多视图LiDAR观测进行密集深度监督,并阻止光度梯度更新高斯空间参数,从而保持高斯代理的度量几何。我们在公开驾驶数据集上验证了该方法,在标定精度上持续优于现有无目标方法。

英文摘要

Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

2606.20102 2026-06-19 cs.CY cs.CR 新提交

Artificial Intelligence as Game Changer in Cybersecurity: What We Learned in 2025-2026, and how this is relevant for Africa

人工智能作为网络安全游戏规则改变者:2025-2026年我们学到的,以及这对非洲的意义

Mikael Alemu Gorsky

AI总结 本文通过2025-2026年两个事件论证前沿语言模型已成为网络作战决定性工具,而非洲在模型构建、运营和获取上被完全排除,面临技能、算力和投资三重赤字,并遭受AI欺诈攻击,建议在6-12个月内通过威胁情报共享、治理采纳和伙伴关系应对。

Comments International Conference on Cybersecurity in the Era of Digital Transformation and Artificial Intelligence

详情
AI中文摘要

在2025年和2026年,两个事件解决了此前仅是推测的问题。第一个事件中,一个大型语言模型独立执行了国家支持的网络间谍活动的大部分任务,人类操作员仅在少数决策点介入。第二个事件中,最强大的网络相关模型被置于一个受控访问计划之下,仅限于经过审查的美国科技公司、盟国政府和欧洲标准机构;该范围不包括任何非洲政府、运营商或大学。这两个事件共同确立了本文的论点:前沿语言模型已成为网络作战的决定性工具,而该工具在一个小圈子内建造、拥有和配给,非洲被排除在外。本文记录了非洲在每一方面的排斥。该大陆不构建前沿模型,尚无法运营它们,并且目前无法获得最强大的模型。运营赤字沿着三个轴心展开:技能人才、计算和电力、投资,每个都根据当前数据衡量;与此同时,针对非洲移动货币系统(该大陆领先的数字经济部分)的AI欺诈攻击已经在增加。由此产生两个约束:开发者对前沿模型的把关(非洲决策无法打开),以及对基础设施供应商的选择性依赖(现已陷入地缘政治限制)。由于可比较但不受把关的模型预计在6至12个月内扩散,本文主张通过威胁情报共享、治理采纳和伙伴关系,在非洲人自主条件下,在该窗口内采取应对措施。

英文摘要

In 2025 and 2026, two events settled questions that had until then been speculative. In the first, a large language model executed the great majority of a state-aligned cyber-espionage campaign on its own, with human operators intervening at only a few decision points. In the second, the most capable cyber-relevant model was placed under a controlled-access program limited to a vetted set of United States technology firms, allied governments, and European standards bodies; that perimeter included no African government, operator, or university. Together the two events establish the argument of this paper: frontier language models have become a decisive instrument of cyber operations, and that instrument is built, owned, and rationed within a small circle from which Africa is absent. The paper documents Africa's exclusion on every count. The continent does not build frontier models, cannot yet operate them, and cannot, for now, obtain the most capable ones. The operational deficit is set out along three axes, skilled people, compute and electrical power, and investment, each measured against current figures; meanwhile AI-enabled fraud is already mounting against African mobile-money systems, the part of the digital economy the continent leads. Two constraints follow: the gating of frontier models by their developers, which no African decision can open, and a chosen dependence on infrastructure vendors now caught in geopolitical restriction. Because comparable but ungated models are forecast to spread within six to twelve months, the paper argues for a response that operates inside that window through threat-intelligence sharing, governance adoption, and partnership, undertaken by Africans on their own terms.

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 新提交

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Fisheries College, Ocean University of China(中国海洋大学水产学院) College of Information and Electrical Engineering, China Agricultural University(中国农业大学信息与电气工程学院)

AI总结 提出混合两阶段扩散变压器架构,通过粗到细策略平衡全局语义对齐与局部细节编辑,在重叠音频事件和复杂指令任务上提升性能与效率。

详情
AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容,同时保留其余声学内容。尽管扩散模型取得了显著进展,但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互,这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下,扩散变压器提供了更强的全局建模和多模态融合,但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率,我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构,用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐,然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明,所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升,同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

2606.20100 2026-06-19 cs.CV 新提交

WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization

WeGenBench:面向文本到图像模型优化的多维诊断基准

Qian Liang, Xiaomin Li, Ying Zhang, Jia Xu, Lihao Ni, Hongrui Li, Jingjing Li, Jing Lyu, Chen Li

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Dalian University of Technology(大连理工大学) Weixin, Tencent(腾讯微信)

AI总结 提出WeGenBench基准,包含4000个中英双语提示,通过场景分类和多维标签实现跨维度评估,并设计基于视觉语言模型的新颖指标,精准定位模型在特定生成类别中的缺陷。

详情
AI中文摘要

最近的文本到图像生成模型在仅从文本输入合成高度逼真的图像方面展现了卓越的能力。尽管现有基准可以在一定程度上评估各种模型的生成能力,但它们难以全面准确地衡量多个维度的性能,往往无法揭示模型在特定类别中的固有缺陷。为了解决这些局限性,我们提出了WeGenBench,一个新颖的基准,旨在对文本到图像生成能力进行全面、多视角的评估。我们的基准总共包含4000个测试提示,涵盖两个主要类别,并在中英文之间精心平衡,以评估双语和跨文化生成能力。除了宏观场景分类外,我们根据每种语言的不同内容和挑战为每个提示标注了多维标签,从而将生成任务细化为更具体的子类别。通过利用场景分类和多维标签的跨维度评估机制,WeGenBench可以精确定位模型在特定生成类别中的不足。此外,为了更准确地衡量生成质量,我们通过整合视觉语言模型(VLM)设计并验证了几种新颖的评估指标,这些指标从三个核心方面评估模型在特定领域任务上的性能。至关重要的是,我们的方法既产生评估结果,也产生详细的推理轨迹,有助于对评估结果的准确性和合理性进行严格验证。最后,我们对当前最先进的方法进行了系统性的基准测试,并深入分析了现有模型中存在的局限性。

英文摘要

Recent text-to-image generation models have demonstrated remarkable capabilities in synthesizing highly realistic images from text inputs alone. Although existing benchmarks can evaluate the generation capabilities of various models to some extent, they struggle to comprehensively and accurately measure performance across multiple dimensions, often failing to reveal the inherent deficiencies of models in specific categories. To address these limitations, we propose WeGenBench, a novel benchmark designed for the comprehensive, multi-perspective evaluation of text-to-image generation capabilities. Our benchmark comprises a total of 4,000 test prompts across two primary categories, meticulously balanced between Chinese and English to evaluate bilingual and cross-cultural generation capabilities. Beyond macroscopic scene classification, we annotate each prompt with multi-dimensional tags tailored to the distinct content and challenges of each language, thereby refining the generation tasks into more specific sub-categories. Through a cross-dimensional evaluation mechanism leveraging both scene classifications and multi-dimensional tags, WeGenBench can precisely pinpoint model shortcomings in specific generation categories. Furthermore, to measure generation quality more accurately, we design and validate several novel evaluation metrics by integrating Vision-Language Models (VLMs), which assess model performance on domain-specific tasks from three core aspects. Crucially, our approach yields both the assessment outcomes and the detailed reasoning trajectories, facilitating a rigorous verification of the accuracy and soundness of the evaluation results. Finally, we conduct systematic benchmarking on current state-of-the-art methods and provide an in-depth analysis of the limitations present in existing models.

2606.20097 2026-06-19 cs.CL 新提交

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

HydraHead:从头部级功能异质性到专业化注意力混合

Zhentao Tan, Wei Chen, Jingyi Shen, Yao Liu, Xu Shen, Yue Wu, Jieping Ye

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出HydraHead架构,沿头部维度混合全注意力和线性注意力,通过可解释性驱动的头部选择和尺度归一化融合模块,在长上下文任务中优于层级混合设计,仅用15B token训练即在512K上下文长度上提升69%。

详情
AI中文摘要

注意力的二次复杂度对长上下文处理构成了关键瓶颈,激发了混合注意力设计的兴趣。大多数开源混合模型采用层级策略。然而,先前工作注意到线性注意力与全注意力整合的内在困难,表明注意力混合的设计空间仍未充分探索。为了探索这一空间,我们进行可解释性分析,观察到层表现出块级功能相似性,而同一层内的单个头部尽管共享输入特征,却显示出不同的功能专门化。这种头部级异质性表明,头部维度为融合异质注意力信号提供了自然且原则性的粒度。基于这一洞察,我们引入了HydraHead,一种沿头部轴混合全注意力和线性注意力的新型架构。HydraHead具有两个关键创新:(1)一种可解释性驱动的选择策略,识别检索关键的头部并仅为其保留全注意力;(2)一种尺度归一化融合模块,调和全注意力和线性注意力头部输出之间的分布差距。通过利用参数重用和蒸馏的三阶段迁移流程,我们以最小的训练开销实现了高性能混合模型。在统一的训练设置下,HydraHead在长上下文任务中优于其他混合设计,同时保持强大的通用推理能力。通过可解释性驱动的头部选择,它以7:1的线性注意力与全注意力比例匹配了3:1层级混合的长上下文性能。关键的是,仅用15B token训练,HydraHead在512K上下文长度上比基线提升超过69%,接近Qwen3.5(一个具有256K原生上下文长度的类似规模领先模型)。这突显了头部级混合的显著扩展潜力。

英文摘要

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

2606.20096 2026-06-19 cs.CG q-bio.NC 新提交

Quadratic Forms for Measuring Geometric Trees in 3-dimensional Space

用于测量三维空间中几何树的二次型

Yossi Bokor Bleile, Emanuele Cortinovis, Herbert Edelsbrunner, Shota Uka

AI总结 提出使用二次型测量几何树的方向分布,并引入基于Fisher度量的六边形图模型进行可视化和统计分析。

Comments 16 pages, 6 figures

详情
AI中文摘要

树状结构出现在许多科学领域,其形状有助于理解它们驱动或产生的潜在过程。通过将这些结构视为$\mathbb{R}^3$中的几何图,我们可以利用计算几何和拓扑学的工具来研究它们。在本文中,我们采用二次型理论来测量几何图的方向分布,并引入六边形图模型——配备基于标准三角形上Fisher度量的度量——用于可视化、测量和收集统计数据。

英文摘要

Tree-like structures appear in many areas of science, and their shapes can help understand the underlying processes they drive or that give rise to them. By thinking of these structures as geometric graphs in $\mathbb{R}^3$, we gain access to tools from computational geometry and topology to study them. In this paper, we adopt the theory of quadratic forms to measure the directional spread of geometric graphs, and we introduce the hexplot model -- equipped with a metric derived from the Fisher metric on the standard triangle -- to visualize, measure, and collect statistics.

2606.20095 2026-06-19 cs.CV 新提交

Stitching and dimensionality effects on large artificially generated volume datasets

拼接和维度对大规模人工生成体数据集的影响

Lucas von Chamier, Jan Philipp Albrecht, Dagmar Kainmüller

发表机构 * GFZ Helmholtz-Zentrum für Geoforschung(亥姆霍兹地球科学中心) Max Delbrück Center for Molecular Medicine in the Helmholtz Association(亥姆霍兹协会马克斯·德尔布吕克分子医学中心) Helmholtz Imaging(亥姆霍兹成像) Humboldt-Universität zu Berlin(柏林洪堡大学) University of Potsdam(波茨坦大学)

AI总结 研究深度学习生成大图像时的拼接伪影对风格迁移的影响,比较2D与3D模型,发现FID无法检测影响下游任务的细微伪影,3D模型略优但计算成本高。

详情
AI中文摘要

通过深度学习生成大图像需要对输入数据进行分块以适应硬件内存限制,然后组装输出块,这一过程在相邻块边界不对齐时可能引入拼接伪影。虽然已知这些伪影会影响分割任务,但它们对风格迁移生成模型的影响尚不清楚。我们使用在冷冻电镜数据集上训练的cycleGAN模型,研究了三种拼接方法和两种块维度(2D vs 3D)。我们评估了感知质量和下游线粒体分割的性能。主要发现如下:(1)FID分数无法检测到显著影响下游分割性能的细微拼接伪影;(2)具有无伪影拼接的3D模型在下游任务上略优于2D模型,尽管改进勉强证明计算成本合理;(3)2D模型由于更大的批量大小而训练更稳定。此外,我们证明从三个正交方向集成预测可以改善低质量体,但对高质量输出无益。这些结果表明,在大型科学数据集上最大化生成模型性能需要仔细考虑和减轻拼接伪影,并且仅凭感知指标不足以评估生物医学成像中的域适应质量。

英文摘要

Generating large images via deep learning requires patching input data to accommodate hardware memory limitations, then assembling output patches, a process that can introduce stitching artifacts when neighboring patches do not align at borders. While these artifacts are known to affect segmentation tasks, their impact on generative models for style-transfer remains poorly understood. We investigated three stitching approaches and two patch dimensionalities (2D vs 3D) using cycleGAN models trained on cryo-electron microscopy datasets. We evaluated both perceptual quality and performance on downstream mitochondria segmentation. Our key findings reveal that: (1) FID scores fail to detect subtle stitching artifacts that significantly impact downstream segmentation performance, (2) 3D models with artifact-free stitching marginally outperform 2D models on downstream tasks, though the improvement barely justifies the computational cost, and (3) 2D models train more stably due to larger batch sizes. Additionally, we demonstrate that ensembling predictions from three orthogonal directions can improve low-quality volumes but provides no benefit for high-quality outputs. These results demonstrate that maximizing generative model performance on large scientific datasets requires careful consideration and mitigation of stitching artifacts, and that perceptual metrics alone are insufficient for evaluating domain adaptation quality in biomedical imaging.

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 新提交

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror:在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon(亚马逊)

AI总结 提出MakeupMirror扩散模型,通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器,在保持面部特征和肤色的同时实现高质量化妆迁移,相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情
AI中文摘要

化妆迁移模型能够实现有趣的增强现实(AR)体验以及在线化妆购物的虚拟试妆(VTO)。尽管最近最先进的基于扩散的解决方案(如Stable-Makeup)显著提高了化妆迁移的准确性和逼真度,但在身份和肤色保持方面仍存在局限性,使得用于化妆购物的生产级VTO不切实际。在这项工作中,我们提出了MakeupMirror,一种基于扩散的化妆迁移方法,在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新:(1)将面部几何条件与ControlNets集成以保持面部保真度;(2)区域特定的化妆迁移控制,以便在面部区域(如皮肤、眼睛和嘴唇)实现精确的化妆应用;(3)基于肤色的化妆迁移调制,防止跨主体迁移场景中的肤色改变;(4)集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及(本文新收集的、更多样化的)MakeupSelfies数据集上的实验表明,与Stable-Makeup相比,MakeupMirror将相对面部识别相似度提高了+60%,将相对肤色差异降低了-50%,延迟为0.7秒,同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

2606.20093 2026-06-19 cs.CL 新提交

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

自我偏好在可验证的指令遵循修订中弱或不存在:基于真正作者身份的四模型测试

William Guey, Pierrick Bougault

发表机构 * Department of Industrial Engineering, Tsinghua University(清华大学工业工程系)

AI总结 通过IFEval验证器测试四类中端模型在指令遵循修订中的自我偏好,发现作者拒绝已验证正确编辑的比例与新鲜模型无显著差异,表明自我偏好弱或不存在。

Comments 7 pages, 3 tables. Code and data: https://github.com/williamguey/self-preference-revision

详情
AI中文摘要

大型语言模型(LLMs)越来越多地审查和修订文本,包括它们自己的文本。有记录的自我偏好偏差(模型在充当评判者时偏爱自己的生成)引发了一个问题:模型是否也会抵制对自己写作的有效修正。我们在一个“有效”不是由另一个模型决定,而是由确定性验证器决定的设置中测试这一点:基于IFEval的指令遵循修订。模型撰写草稿;官方IFEval检查器确认草稿违反约束,并且候选编辑修复了它;然后模型接受或拒绝该编辑,要么作为真正的上下文内作者,要么作为一个以中立方式看待草稿的新鲜模型。在四个中端模型系列和85次作者与新鲜模型比较中,我们未检测到可察觉的自我偏好:作者拒绝对自己草稿的已验证正确修复的比例与判断相同草稿的新鲜模型基本相同(差距-5.1个百分点,95%置信区间[-12.9, +2.7])。来自较小试点的自我怀疑提示未在大规模上复制。唯一稳健的观察是定性的:当作者确实拒绝已验证正确的修复时,他们陈述的理由中有97%是挑错而非偏好,即关于拒绝的性质,而非升高的比率。在此样本量下,不能排除小于约13个百分点的效应。

英文摘要

Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where "valid" is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.

2606.20092 2026-06-19 cs.CV 新提交

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

EventVLA: 面向长程视觉-语言-动作策略的事件驱动视觉证据记忆

Ganlin Yang, Zhangzheng Tu, Yuqiang Yang, Sitong Mao, Junyi Dong, Tianxing Chen, Jiaqi Peng, Jing Xiong, Jiafei Cao, Jifeng Dai, Wengang Zhou, Yao Mu, Tai Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Dalian University of Technology(大连理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司) The University of Hong Kong(香港大学) Tsinghua University(清华大学) Peking University(北京大学)

AI总结 针对长程机器人操作中记忆瓶颈问题,提出EventVLA框架,通过动态关键帧证据记忆模块自主捕获任务关键视觉事件,在17个模拟和4个真实任务中平均成功率提升40%。

详情
AI中文摘要

记忆仍然是长程机器人操作的关键瓶颈,因为标准的视觉-语言-动作(VLA)策略在任务相关线索随时间变得遮挡或不可观测时常常失败。虽然现有的记忆增强方法利用历史上下文,但它们要么遭受严重的信息瓶颈,通过解耦的双系统引入高延迟,要么依赖积累大量视觉冗余的无选择性缓冲区。为了解决这些限制,我们引入了EventVLA,一个基于稀疏视觉证据记忆概念的端到端框架,包含两个核心组件:用于保留初始和短期上下文的基础视觉锚点,以及动态关键帧证据记忆(KEM)模块。具体来说,KEM直接从VLA的潜在嵌入中预测未来关键帧概率,以自主捕获和存储稀疏的、任务关键的视觉事件。这种前瞻驱动的机制使策略能够动态评估当前观测的未来因果效用,在瞬态视觉证据变得不可观测之前将其保留。此外,我们提出了RoboTwin-MeM,一个专门设计用于评估具有交互式视觉证据的非马尔可夫操作任务的诊断基准。大量评估表明,在17个需要记忆的模拟任务和4个真实世界双臂任务中,EventVLA相比最先进的记忆增强VLA实现了平均成功率提升+40%。

英文摘要

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

2606.20089 2026-06-19 cs.CL cs.AI 新提交

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

IHUBERT: 面向波斯语资源的基于向量的语义去重与领域平衡预训练

Arash Ghafouri, Mahdi Firouzmandi, Hossein Saberi, Mohammad Reza Hasani Ahangar

AI总结 提出IHUBERT,一个基于RoBERTa-base的波斯语预训练模型,通过多阶段预处理(包括基于向量数据库的语义去重和领域平衡)在45GB语料上训练,在多项NLU任务上取得领先结果,尤其抽取式问答表现突出。

详情
AI中文摘要

波斯语预训练语言模型仍然受到大规模高质量预训练语料库稀缺以及标准分类和NER任务之外评估不足的限制。我们提出了IHUBERT,一个从头训练的波斯语单语PLM,采用RoBERTa-base编码器(1.25亿参数),在Sepahr-Danesh集合的45GB精选子集(约70-80亿token)上进行训练。为了提高语料质量并减少冗余,我们采用多阶段预处理流程,包括规范化、精确和近似重复去除、匿名化,以及基于向量数据库的语义去重,以实现跨领域和语体的分布平衡控制。我们还在完整的预训练语料库上训练了一个13.9万词汇量的BPE分词器,以更好地捕捉波斯语的形态和拼写变化。IHUBERT在七个波斯语NLU基准测试上进行评估,涵盖NER、情感分析、主题分类、NLI、抽取式问答和关系抽取,使用任务标准指标(实体级F1、宏F1、EM/F1)。IHUBERT在抽取式QA上取得了最强增益,在PQuAD(F1 88.3542)和ParsiNLU-RC(F1 49.0987)上均排名第一,并在FarsTail上取得了最佳结果(宏F1 0.8350)。在NER和主题分类上,它保持竞争力(例如,ParsTwiNER上F1 0.8308;DigiMag上宏F1 0.7953),而关系抽取仍然是主要差距(PERLEX上宏F1 0.6684)。在IHUBERT预训练语料库上的受控分词器消融实验表明,在匹配词汇量下,BPE产生的子词碎片化程度略低于WordPiece,支持了我们的分词设计。总体而言,IHUBERT通过语义精选的大规模预训练以及跨分类和理解型任务的广泛评估,推进了波斯语语言建模。

英文摘要

Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.

2606.20087 2026-06-19 cs.AI 新提交

Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing

基于多头注意力的特征提取器与软演员-评论家集成用于增材制造中的孔隙率预测和工艺参数优化

Kianoush Aqabakee, Leonardo Stella

发表机构 * Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic)(阿米尔卡比尔理工大学(德黑兰理工大学)电气工程系) Department of Mechanical Engineering, Amirkabir University of Technology (Tehran Polytechnic)(阿米尔卡比尔理工大学(德黑兰理工大学)机械工程系) School of Computer Science, University of Birmingham(伯明翰大学计算机科学学院)

AI总结 提出一种结合多头注意力机制与软演员-评论家算法的连续动作空间方法,用于增材制造孔隙率预测和参数优化,实现更快收敛和更高奖励。

详情
AI中文摘要

增材制造工艺优化需要精确的参数控制以最小化孔隙等缺陷。传统的使用离散动作空间的强化学习方法收敛慢且易陷入局部最优,限制了其在精密制造任务中的有效性。本研究通过采用连续动作空间并结合一种新颖架构——将多头注意力机制与软演员-评论家(SAC)算法集成,来解决这些局限性。基于注意力的特征提取器增强了智能体捕捉低维输入特征中细微变化的能力,从而在存在局部极小值的价值空间中实现更有效的探索-利用平衡。我们在激光粉末床熔融中的孔隙率预测和工艺参数优化上验证了该方法,与标准强化学习方法(包括DQN、PPO、TD3和原始SAC)相比,展示了更快的收敛速度和更高的最终奖励值。所提出的方法在14个回合内达到322.79的收敛值,在保持训练稳定性的同时优于现有方法。

英文摘要

Additive manufacturing process optimization requires precise parameter control to minimize defects such as porosity. Traditional reinforcement learning (RL) approaches using discrete action spaces suffer from slow convergence and susceptibility to local optima, limiting their effectiveness for high-precision manufacturing tasks. This study addresses these limitations by employing a continuous action space combined with a novel architecture that integrates a multi-head attention mechanism with the Soft Actor-Critic (SAC) algorithm. The attention-based feature extractor enhances the agent's ability to capture subtle variations in low-dimensional input features, enabling more effective exploration-exploitation balance for navigating value spaces with local minima. We validate our approach on porosity prediction and process parameter optimization in laser powder bed fusion, demonstrating faster convergence and higher final reward values compared to standard RL methods including DQN, PPO, TD3, and vanilla SAC. The proposed methodology achieves a convergence value of 322.79 within 14 episodes, outperforming existing approaches while maintaining stability throughout training.

2606.20084 2026-06-19 cs.AI 新提交

Residual-Space Evolutionary Optimization via Flow-based Generative Models

基于流生成模型的残差空间进化优化

Zhuo Cao, Lena Krieger, Fernanda Nader, Xuan Zhao, Hanno Scharr, Ira Assent

发表机构 * LMU Munich, Munich Center for Machine Learning (MCML), Germany(慕尼黑大学,慕尼黑机器学习中心(MCML),德国) Department of Computer Science, Aarhus University, Denmark(丹麦奥胡斯大学计算机科学系)

AI总结 提出残差空间进化优化框架,结合流生成编辑与进化算法,在残差空间分离局部利用与全局探索,用于非可微黑盒目标的数据编辑。

Comments Accepted by ICML 2026 Workshop SPIGM, 5 pages, 3 figures

详情
AI中文摘要

使用生成方法进行数据编辑通常需要可微目标和基于梯度的搜索。然而,这些假设在基于流的设置中不成立,其中编辑通过前向和反向积分执行,并且通常涉及不可微或黑盒目标。我们引入了残差空间进化优化,这是一个模型无关的框架,通过将基于流的生成编辑与进化算法相结合来解决这一差距。基于条件流匹配(CFM)可以将条件控制因素与实例特定残差分离的观察,我们的框架直接在残差空间中操作,并分离两个互补的搜索机制:自花授粉通过保留特征的残差细化进行局部利用,而交叉授粉通过跨异质样本重组残差促进更广泛的探索。作为概念验证,我们在MorphoMNIST(一个用于反事实生成的基准数据集)和晶体数据上进行了验证,表明这种探索-利用分解为平衡目标对齐、实例保留和多样性提供了有用的机制,并且可以扩展到图像之外的真实世界科学领域。

英文摘要

Data editing with generative methods typically requires differentiable objectives and gradient-based search. However, these assumptions break down in flow-based settings, where edits are performed through forward and backward integration and often involve non-differentiable or black-box objectives. We introduce residual-space evolutionary optimization, a model-agnostic framework that addresses this gap by combining flow-based generative editing with evolutionary algorithms. Building on the observation that conditional flow matching (CFM) can disentangle condition-controlled factors from instance-specific residuals, our framework directly operates in residual space and separates two complementary search regimes: self-pollination performs local exploitation through feature-preserving residual refinement, and cross-pollination promotes broader exploration by recombining residuals across heterogeneous samples. As a proof of concept, we validate on MorphoMNIST, a benchmark dataset for counterfactual generation, and on crystal data, demonstrating that this exploration--exploitation decomposition provides a useful mechanism for balancing target alignment, instance preservation, and diversity, and extends beyond images to real-world scientific domains.

2606.20083 2026-06-19 cs.CV 新提交

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Holo-World: 视频世界模型的统一相机、物体和天气控制

Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen, Wei Li, Dachun Kai, Chunfeng Wang, Xiaoyan Sun

AI总结 提出Holo-World,一种从单张图像联合控制相机、物体运动和天气的统一视频世界模型,通过场景适配器和解耦CFG实现世界保持与天气迁移。

Comments Project Page: \url{https://xiangchenyin.github.io/Holo-World} Code: \url{https://github.com/XiangchenYin/Holo-World}

详情
AI中文摘要

视频世界模型正朝着在可控相机和物体运动下保持观察到的世界,同时允许其环境状态变化的方向发展。然而,这些控制仍然是孤立的,天气生成通常依赖于已经指定未来结构的源视频或重建场景。我们研究了一种基于第一帧锚定的源到状态设置,其中模型从单张图像开始,遵循明确的相机和物体控制以及可选的天气指令,然后生成一个视频,该视频要么保持源世界,要么将其转移到目标天气状态。为了解决这些挑战,我们首先构建了HoloStateData,一个状态视频数据集,将多样化的视频转换为用于相机、物体和天气监督的统一控制样本。其次,我们引入了Holo-World,一个统一的、可控制的视频世界模型,从单张图像联合控制场景。其统一场景适配器将世界保持和天气迁移分解为不同的参数子空间,使用渲染背景、几何缓冲区和物体控制来维持受控场景结构,同时建模依赖天气的外观和粒子效果。此外,场景-天气解耦CFG分别引导场景和天气残差,增强目标天气效果而不过度放大完整条件。定量和定性实验表明,Holo-World在保持精确的相机和物体控制以及一致场景结构的同时,将场景迁移到多样化的目标天气状态,在天气状态生成上优于视频到视频的天气编辑基线。我们的项目页面可在\url{this https URL}获取。

英文摘要

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.

2606.20077 2026-06-19 cs.CV cs.AI 新提交

The Hidden Evolution of Disguised Visual Context inside the VLM

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey(萨里大学以人为本人工智能研究所) Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心)

AI总结 研究视觉语言模型中视觉令牌如何通过不同集成架构(上下文注入与逐层注入)转化为有意义表示,揭示其内部演化过程及对性能的影响。

详情
AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型(LLM)。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示,还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成,目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式,在单图像、多图像和视频基准上进行公平比较。在此过程中,我们揭示了一个隐藏的演化:视觉令牌作为伪装的视觉上下文(缺乏语言结构的原始表示)进入LLM,但根据集成范式逐渐被重塑,每种范式捕捉视觉信号的不同频率特征。我们表明,LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐,以及最终每种范式在不同任务上的表现。我们进一步证明,仅关注注意力分配是不够的,性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

2606.20076 2026-06-19 cs.CV cs.AI 新提交

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

基于可学习全局合并的可变长度分词用于扩散变换器

Dong Hoon Lee, Seunghoon Hong

发表机构 * Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea(韩国科学技术院金载哲人工智能研究生院,大田,韩国) School of Computing, KAIST, Daejeon, South Korea(韩国科学技术院计算学院,大田,韩国)

AI总结 针对固定压缩比限制扩散模型质量-计算权衡的问题,提出基于可学习全局合并的可变长度分词器,通过合并令牌实现跨长度表示对齐,在ImageNet 256×256生成中实现更优的gFID-计算权衡。

详情
AI中文摘要

潜在扩散模型(LDM)在视觉合成中占据主导地位,但其质量-计算权衡很大程度上受限于分词器的固定压缩比。可变长度分词器(VLT)通过改变令牌数量实现自适应压缩,使扩散模型能够灵活平衡质量和计算。然而,传统的VLT通过截断有序令牌序列来调节长度,这使得令牌语义依赖于令牌位置,并破坏了跨长度的表示对齐。这导致潜在分布出现跨长度偏移,阻碍单个可变长度扩散模型有效运行。为了解决这个问题,我们提出了一种新颖的可变长度分词器,通过合并令牌来调节长度。我们表明,当扩散变换器根据合并模式运行时,鼓励相似令牌合并可以实现直接的跨长度表示对齐。由于传统的合并方法是数据依赖的,使得生成过程中无法访问合并模式,我们引入了可学习的全局合并,它是数据独立的,以确保与扩散变换器的兼容性。在ImageNet 256×256生成中,我们的基于合并的可变长度分词器与扩散变换器集成,相比之前的VLT方法实现了更优的gFID-计算权衡。代码可在[此https URL](此https URL)获取。

英文摘要

Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256$\times$256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](https://github.com/movinghoon/lgm)

2606.20075 2026-06-19 cs.LG cs.CL 新提交

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

什么使得潜在思维链中的监督有效:一种信息论分析

Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology(宁波数字孪生研究院,东方理工大学) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系)

AI总结 本文从信息论角度分析潜在思维链中的监督失效问题,提出轨迹监督和空间监督两个维度,并引入统一潜在探针(ULP)量化信息保真度,揭示了信息-性能绑定关系。

详情
AI中文摘要

潜在思维链(Latent Chain-of-Thought, CoT)将推理内化到连续隐藏状态中,为冗长的离散推理轨迹提供了一种有前景的替代方案。然而,鲁棒的潜在推理仍然困难,因为结果监督提供的学习信号较弱,且容易导致潜在轨迹发生语义漂移。在这项工作中,我们从信息论角度分析潜在CoT,并将这种失效识别为双重崩溃:优化路径上的梯度衰减和潜在空间中的表征漂移。我们进一步将过程监督分解为两个互补维度:轨迹监督(注入密集的逐步推理信号)和空间监督(保持潜在流形的语义结构)。我们的分析表明,刚性几何压缩可能坍缩推理空间,而生成式重建提供了更灵活的语义锚点,更好地保留了信息容量。为了衡量这些效应,我们引入了统一潜在探针(Unified Latent Probe, ULP),用于量化潜在轨迹与显式推理步骤之间的互信息。实验揭示了清晰的信息-性能绑定关系:推理准确性取决于潜在链中保留的信息保真度。这些发现为潜在推理监督提供了一个原则性框架,并建议从几何模仿转向互信息最大化。我们的代码可在\href{this https URL}{此仓库}获取。

英文摘要

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

2606.20072 2026-06-19 cs.CL 新提交

Source-Grounded Data Generation for Text-to-JSON Learning

基于源数据的文本到JSON学习数据生成

Sunghee Ahn, Guijin Son, Youngjae Yu

发表机构 * Seoul National University(首尔大学)

AI总结 提出STAGE方法,利用电子表格作为源数据,通过LLM生成报告和JSON模式,并验证真实值,显著提升文本到JSON任务的训练数据质量。

Comments Preprint

详情
AI中文摘要

从财务文件到临床记录,传统行业严重依赖冗长、非结构化的文档来存储高价值信息。将这些信息可靠地提取为结构化的、机器可读的表示形式,是使自动化系统能够访问这些内容的关键前提。JSON是这种结构化提取的自然目标,然而构建可靠且可扩展的文本到JSON训练数据仍然具有挑战性。为了解决这一差距,我们提出了STAGE(电子表格基础的文本到JSON工件生成),一种基于源数据的数据生成管道,通过使用LLM进行可扩展合成,同时根据底层电子表格验证真实值,来构建报告和JSON模式。在STAGE-Eval(我们的基于源数据的基准测试,包含851个示例的测试集)上的评估表明,STAGE生成的训练数据优于现有方法。这使Qwen3-4B的精确匹配从31.37%提高到74.27%,值准确率从45.46%提高到90.69%。

英文摘要

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

2606.20068 2026-06-19 cs.AI 新提交

Process-Verified Reinforcement Learning for Theorem Proving via Lean

基于Lean的过程验证强化学习用于定理证明

Minsu Kim, Se-Young Yun

发表机构 * KAIST AI(韩国科学技术院人工智能系)

AI总结 提出利用Lean证明助手提供过程级验证信号,结合GRPO风格强化学习目标,通过策略级监督提升定理证明性能。

详情
AI中文摘要

虽然基于可验证奖励的强化学习通常依赖于单一的二元验证信号,但形式推理中的符号证明助手提供了丰富、细粒度的结构化反馈。这种结构化过程与非结构化奖励之间的差距凸显了既密集又可靠的反馈的重要性。在这项工作中,我们证明Lean证明助手本身可以作为符号过程预言机,在训练期间提供结果级和细粒度的策略级验证反馈。证明尝试被解析为策略序列,Lean的细化标记出局部正确的步骤和最早失败的步骤,从而产生基于类型理论的密集、验证器基础的信用信号。我们将这些结构化奖励纳入GRPO风格的强化学习目标中,采用首次错误传播和首次令牌信用方法,平衡结果级和过程级优势。在STP-Lean和DeepSeek-Prover-V1.5上的实验表明,在大多数设置中,策略级监督优于仅结果基线,在MiniF2F和ProofNet等基准测试上取得了改进。除了经验上的提升,我们的研究还突出了一个更广阔的视角:符号证明助手不仅在评估时是验证器,而且在训练期间可以作为过程级奖励预言机。这为强化学习框架开辟了一条道路,该框架将语言模型的可扩展性与符号验证的可靠性相结合,用于形式推理。

英文摘要

While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.

2606.20065 2026-06-19 cs.IR cs.CL cs.CY 新提交

Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

生成式引擎优化规模化:衡量AI搜索引擎中的品牌可见性

Pratyush Kumar

AI总结 本研究通过分析10万+提示响应,提出衡量AI搜索引擎中品牌可见性的方法,发现品牌成熟度形成三级阶梯,并识别出最受引用的内容格式和情感不稳定性。

Comments 14 pages, 4 tables; v1.0 preprint

详情
AI中文摘要

人们越来越多地从AI搜索引擎(如ChatGPT、Claude、Perplexity和Gemini)直接获取答案,而不是滚动浏览搜索结果。曾经专注于搜索引擎优化(SEO)的品牌现在必须优化这些引擎如何代表、引用和推荐它们——这一转变被称为生成式引擎优化(GEO)、答案引擎优化(AEO)和AI搜索可见性。我们将AEO和AI可见性视为GEO的一部分,并研究如何衡量AI引擎中的品牌可见性:它们在引用品牌时看重什么,依赖哪些来源,以及大型语言模型呈现什么内容。难点在于那些尚未成为权威顶级品牌的所有其他品牌——中小企业、D2C品牌、创作者和早期初创公司。我们分析了2026年3月至5月期间在Ranqo上追踪的100多个品牌的10万+提示响应。首次可见性运行形成了清晰的三级品牌地位阶梯:全球家喻户晓的品牌(如Stripe、Nike)在首次运行时出现在73%的相关AI答案中;成熟的中端市场和区域品牌(如Olipop、Klaviyo)出现在44%中;小众和小品牌仅出现在11%中——每级约30个百分点。当引擎引用来源时,约78%指向企业网站;在非企业来源中,YouTube领先,其次是Reddit、编辑媒体和维基百科。杠杆率最高的页面是排名“最佳”列表文章,是最常被引用的内容格式,约占所有引用的21%。情感是不稳定的信号:品牌被正面或负面描述的变化频率大约是品牌是否被提及的变化频率的6.7倍。这些发现为衡量GEO提供了首个大规模基线:AI品牌可见性是可测量的,因平台而异,并随品牌成熟度强烈变化。最后,我们提出了七个v1.1协议,以测试特定建议是否能因果性地提高AI可见性。

英文摘要

People increasingly get answers straight from AI search engines like ChatGPT, Claude, Perplexity, and Gemini rather than scrolling search results. Brands that once focused on search engine optimization (SEO) must now optimize for how these engines represent, cite, and recommend them -- a shift variously called Generative Engine Optimization (GEO), Answer Engine Optimization (AEO), and AI Search Visibility. We treat AEO and AI Visibility as part of GEO, and study how to measure brand visibility across AI engines: what they value when they cite a brand, which sources they rely on, and what content large language models surface. The hard case is everyone outside the already-authoritative top brands -- SMEs, D2C brands, creators, and early-stage startups. We analyze 100K+ prompt responses across 100+ brands tracked on Ranqo between March and May 2026. First visibility runs form a clear three-tier brand-stature ladder: global household names (e.g., Stripe, Nike) appear in 73% of relevant AI answers on their first run; established mid-market and regional brands (e.g., Olipop, Klaviyo) in 44%; niche and small brands in just 11% -- about 30 percentage points per step. When engines cite sources, about 78% go to corporate websites; among non-corporate sources YouTube leads, ahead of Reddit, editorial media, and Wikipedia. The highest-leverage page is the ranked "best-of" listicle, the most-cited content format at about 21% of all citations. Sentiment is the unstable signal: whether a brand is framed positively or negatively flips about 6.7 times more often than whether it is mentioned at all. These findings provide a first large-scale baseline for measuring GEO: AI brand visibility can be measured, differs by platform, and varies strongly by brand maturity. We close by proposing seven v1.1 protocols to test whether specific recommendations can causally improve AI visibility.

2606.20064 2026-06-19 cs.HC 新提交

AI Conversational Interviewing: Scaling Up Semi-Structured and In-depth Interviews

AI对话式访谈:扩展半结构化与深度访谈的规模

Alexander Wuttke, Max Melchior Lang, Christopher Klamm, Quirin Würschinger, Frauke Kreuter

AI总结 本研究提出AI对话式访谈方法,通过语音、文本或自由选择模式大规模收集开放型意见数据,证明其能捕捉标准化调查遗漏的深层思考,且受访者评价不低于传统调查。

详情
AI中文摘要

舆论研究长期以来面临深度与规模之间的权衡:标准化调查能够进行大规模测量,但将受访者限制在研究者定义的类别中,掩盖了公众情绪背后多样化的意外考量。更具对话性的访谈通过开放式探究提供更丰富的见解,但其对训练有素的人类访谈者的依赖使其难以规模化。本研究引入AI对话式访谈作为一种大规模收集开放型舆论数据的方法,追求三个目标:展示对话文本数据对于封闭式问题无法触及的问题的分析价值;通过参与者自身的评估评估该方法的实际可行性;并通过实验比较语音、文本和自由选择访谈模式来指导实施。我们进行了一项研究,将AI主导的访谈与关于移民政策的标准化调查相结合,通过Prolific和Payback Panel招募了571名受访者。研究结果确立了AI对话式访谈作为社会科学工具包中可行且有价值的补充。对话记录揭示了标准化综合问卷无法捕捉的考量和推理,例如在态度水平相似的子群体中存在显著不同的移民心智模型。在完成访谈的受访者中,对AI访谈的评价在各模式下均达到或超过标准化调查,尽管完成率因条件而异。通过发布开放数据和开源流程材料,本研究为利用人工智能扩展舆论测量方法的日益增长的文献做出了贡献。

英文摘要

Public opinion research has long faced a trade-off between depth and scale: standardized surveys enable large-scale measurement but restrict respondents to researcher-defined categories, obscuring the diversity of unexpected considerations that underlie public sentiment. More conversational interviews provide richer insights through open-ended probing, but their reliance on trained human interviewers has kept them difficult to scale. This study introduces AI Conversational Interviewing as a method for collecting open-ended public opinion data at scale, pursuing three objectives: to demonstrate the analytical value of conversational text data for questions beyond the reach of closed-ended items; to assess the method's practical viability through participants' own evaluations; and to inform implementation by experimentally comparing voice-based, chat-based, and free-choice interview modes. We conducted a study combining an AI-led interview with a standardized survey on migration policy among 571 respondents recruited via Prolific and Payback Panel. The findings establish AI Conversational Interviewing as a viable and valuable addition to the social-science toolkit. The conversational transcripts surface considerations and reasoning that a comprehensive standardized battery does not capture such as markedly different mental models of migration among subgroups with similar attitudes levels. Among respondents who completed the interview, evaluations of the AI interview were at or above those of the standardized survey across modes, although completion itself varied by condition. By releasing open data and open-source pipeline materials, the study contributes to a growing literature on harnessing artificial intelligence to expand the methods of public opinion measurement.

2606.20058 2026-06-19 cs.AI 新提交

Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

面向企业级AI规模的自驱动事件驱动多智能体编排

Harsh Rao Dhanyamraju, Leonidas Raghav, Aaron Lee

AI总结 针对企业级AI中多智能体系统在规模扩展时性能下降的问题,提出任务管理器通过优先级推理、事件合并和抢占机制,在200个生产场景中验证其降低高优先级延迟14-75%,提升相关事件正确率超20个百分点。

详情
AI中文摘要

企业AI旨在朝着跨专业智能体的持续事件监控、检测和行动方向发展,然而现有的多智能体系统大多假设离散的请求-响应工作流,并且在企业规模下仍未得到充分探索。我们在208个源自生产的场景中评估了DAG Plan and Execute和ReAct,这些场景涵盖个人(少于10个智能体)、部门(20-80个)和企业(200个)规模,并引入了一个任务管理器,通过优先级推理、相关事件合并和抢占实现持续运行。结果表明,规模而非任务复杂性主导了编排性能:两种架构在小规模下表现良好,但在企业规模下性能下降,因为智能体发现噪声成为主要瓶颈,简单任务的下降幅度比复杂任务更严重。DAG Plan and Execute在较小规模下提供更高的精度和结构化并行化,但其较高的开销在企业规模下恶化;ReAct通过增量处理故障而更具鲁棒性。任务管理器将高优先级队列延迟降低了14-75%,并在企业规模下将相关事件正确性提高了超过20个百分点。

英文摘要

Enterprise AI aims to move toward continuous event monitoring, detection, and action across specialist agents, yet existing multi-agent systems largely assume discrete request-response workflows and remain underexplored at enterprise scale. We evaluate DAG Plan and Execute and ReAct across 208 production-derived enterprise scenarios spanning Persona (<10 agents), Department (20-80), and Enterprise (200) scales, and introduce a Task Manager for continuous operation via priority inference, related-event merging, and preemption. Results show that scale, not task complexity, dominates orchestration performance: both architectures perform well at small scale but degrade at enterprise scale as agent discovery noise becomes the primary bottleneck, with simple tasks degrading more sharply than complex ones. DAG Plan and Execute offers higher precision and structured parallelization at smaller scales, but its higher overhead worsens at enterprise scale; ReAct is more robust by handling failures incrementally. The Task Manager reduces high-priority queue latency by 14-75% and improves related-event correctness by over 20 percentage points at enterprise scale.

2606.20056 2026-06-19 cs.RO 新提交

VFILC: Accurate Frequency Extrapolations in Imitation Learning via Sampling Frequency ILC

VFILC: 通过采样频率迭代学习控制实现模仿学习中的精确频率外推

Nozomu Masuya, Toshiaki Tsuji, Sho Sakaino

发表机构 * Grad. School of Science Technology University of Tsukuba Tsukuba, Japan Engineering Saitama University Saitama, Japan Information Engineering University of Tsukuba Tsukuba, Japan

AI总结 提出VFILC方法,结合可变频率模仿学习与前馈-反馈迭代学习控制,在三种任务中实现精确的速度外推,频率误差降低最高81%。

Comments 8 pages, 17 figures. Accepted at IROS 2026

详情
AI中文摘要

传统的基于神经网络(NN)的变速度运动模仿学习方法要么局限于内插速度,要么在外推超出训练速度范围时产生不可预测的运动。可变频率模仿学习(VFIL)通过将NN模型的采样频率与运动频率相关联,实现了速度的外推,但其开环配置导致频率误差,特别是在外推的高频设置中。本研究提出了基于VFIL和迭代学习控制(ILC)的可变频率模仿学习与迭代学习控制(VFILC),包含前馈和反馈两部分,前者利用VFIL的优势,后者调整频率误差。实验结果表明,所提方法成功且精确地外推了运动速度,并在所有三个任务中减少了频率误差;特别是在以训练数据中平均速度的两倍进行外推时,与简单前馈VFIL相比,反馈在擦拭任务中将频率误差显著降低了81%,在摇晃任务中降低了50%。即使在受复杂摩擦特性影响的接触密集混合任务的内插频率下,所提方法相比VFIL也将精度提高了27%。

英文摘要

Conventional neural network (NN)-based imitation learning methods for variable-speed motion either restricted their scope to interpolated speeds, or generated unpredictable motions when extrapolating beyond trained velocity ranges. Variable-frequency imitation learning (VFIL) enabled extrapolations of speeds by linking the NN model's sampling frequency to the motion frequency, whereas its open-loop configuration caused frequency errors, especially in the extrapolated high-frequency settings. This study proposes variable-frequency imitation learning with iterative learning control (VFILC) based on a combination of VFIL and iterative learning control (ILC) with both feedforward and feedback parts, the former taking advantage of VFIL and the latter adjusting the frequency errors. The experimental results showed that the proposed method successfully and accurately extrapolated motion speeds and reduced frequency errors in all three tasks, and that the feedback especially reduced the frequency errors by a remarkable 81% in the wiping task and 50% in the shaking task, both compared to simple feedforward VFIL, when extrapolating at double the average speed in the training data. The proposed method also improved accuracy by 27% compared with VFIL even at an interpolated frequency for a contact-rich mixing task affected by complex friction traits.

2606.20055 2026-06-19 cs.LG 新提交

PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection

PaAno+:用于时间序列异常检测的多尺度编码与跨变量注意力

Youji Zhu, Hongbing Wang, Wenchao Liu, Xiaodong Liu, Xiangguang Xiong

发表机构 * School of Mathematical Sciences, Guizhou Normal University(贵州师范大学数学科学学院) School of Big Data and Computer Science, Guizhou Normal University(贵州师范大学大数据与计算机科学学院)

AI总结 提出PaAno模型,通过多尺度特征提取、跨变量融合注意力和补丁窗口排序预任务,实现轻量高效的时间序列异常检测,在TSB-AD基准上达到SOTA。

详情
AI中文摘要

时间序列异常检测在工业和医疗监测等关键领域具有重要的实用价值。当前基于Transformer和大模型的检测方法计算开销过大,而现有的轻量级替代方案受限于特征提取不足以及多变量间依赖关系建模不充分。为缓解上述缺陷,本研究在面向补丁的表征学习范式下,开发了一种轻量高效的异常检测模型PaAno。在编码器模块中,使用具有差异化感受野的卷积核构建多尺度特征提取主干,以捕获层次化时间特征;随后通过跨尺度自适应注意力聚合结合残差连接优化,进一步稳定特征表征学习。嵌入跨变量融合注意力模块以显式表征变量间相关性,使模型能够在复杂运行条件下识别异常模式。此外,定制了一种基于时间补丁窗口排序的新型前置任务,以揭示时间序列的内在结构特性,并利用三元组损失优化补丁嵌入空间以增强特征判别性。在TSB-AD基准上的大量实验表明,所提出的PaAno在单变量和多变量任务上均实现了最先进的检测精度,在包括VUS-PR在内的评估指标上相对于原始PaAno取得了显著性能提升。凭借紧凑的网络设计,该模型实现了良好的计算效率,能够在资源受限的终端上部署用于实时异常推理。

英文摘要

Time-series anomaly detection has significant practical value for industrial and medical monitoring, as well as other critical domains. Current Transformer- and large-model-based detection approaches incur excessive computational overhead, while existing lightweight alternatives are constrained by insufficient feature extraction and inadequate modeling of dependencies across multivariate variables. To mitigate the above drawbacks, this study develops a lightweight, efficient anomaly detection model, dubbed PaAno, within the patch-oriented representation learning paradigm. In the encoder module, a multiscale feature-extraction backbone is constructed using convolutional kernels with differentiated receptive fields to capture hierarchical temporal characteristics; subsequent cross-scale adaptive attention aggregation, combined with residual connection optimization, further stabilizes feature representation learning. A cross-variable fusion attention module is embedded to explicitly characterize inter-variable correlations, empowering the model to identify anomalous patterns amid intricate operational conditions. Moreover, a novel pretext task based on temporal patch-window sorting is customized to uncover intrinsic structural properties of time series, and triplet loss is leveraged to optimize the patch embedding space for enhanced feature discrimination. Extensive experiments on the TSB-AD benchmark demonstrate that the proposed PaAno achieves state-of-the-art detection accuracy on both univariate and multivariate tasks, yielding significant performance gains across evaluation metrics, including VUS-PR, relative to the original PaAno. Leveraging a compact network design, the presented model achieves favorable computational efficiency, enabling deployment on resource-limited terminals for real-time anomaly inference.

2606.20053 2026-06-19 cs.LG 新提交

Comparative Study of Neural Surrogate Architectures for Autoregressive Prediction of Internal Battery States

用于电池内部状态自回归预测的神经代理架构比较研究

Gihyun Lee, Thorben Menne, Simon Olma, Jakob Hilgert, Sangyoung Park

AI总结 系统比较四种神经网络架构(MLP、ResNet、U-Net、FNO)作为自回归状态转移算子,预测锂离子电池DFN模型内部状态,发现U-Net因多尺度空间归纳偏置在精度和速度上最优。

Comments 8 pages, 5 figures

详情
AI中文摘要

Doyle-Fuller-Newman (DFN) 模型以高保真度解析锂离子电池的内部电化学状态。然而,其控制方程的数值求解对于实时部署而言计算成本过高,限制了从单个电池到电池组及车队规模应用的可扩展性。虽然机器学习代理可以通过GPU加速大幅降低推理延迟,但现有大多数方法学习的是特定操作条件下的解近似,而非可泛化的状态演化动力学。本文系统比较了四种神经网络架构(MLP、ResNet、U-Net、FNO),它们被构建为自回归状态转移算子,可预测广泛操作条件下的完整DFN内部状态。为确保受控的架构比较,所有模型在统一框架下训练,采用多步展开和电流条件化,隔离了空间归纳偏置的影响。结果表明,U-Net的多尺度特征层次在300步自回归展开后,所有内部状态变量的平均最终步nRMSE达到3%,同时相比数值求解器实现了5.38倍的加速。这些发现强调了空间归纳偏置是代理性能的关键决定因素,推动了用于下一代电池管理系统和数字孪生的内部状态可观测性代理的发展。

英文摘要

The Doyle-Fuller-Newman (DFN) model resolves internal electrochemical states in lithium-ion batteries with high fidelity. However, the numerical solution of its governing equations is computationally prohibitive for real-time deployment, limiting scalability from individual cells to pack and fleet-scale applications. While machine learning surrogates can substantially reduce inference latency through GPU acceleration, most existing approaches learn solution approximations tied to specific operating conditions rather than learning generalizable state-evolution dynamics. This work presents a systematic comparison of four neural network architectures (MLP, ResNet, U-Net, FNO) formulated as autoregressive state-transition operators that predict full DFN internal states across a wide range of operating conditions. To ensure a controlled architectural comparison, all models are trained under a unified framework using multi-step unrolling and current-conditioning, isolating the impact of spatial inductive bias. Results demonstrate that the U-Net's multi-scale feature hierarchy achieves a mean final-step nRMSE of 3% averaged across all internal state variables after 300-step autoregressive rollouts, while providing a 5.38x speed-up over the numerical solver. These findings highlight spatial inductive bias as a critical determinant of surrogate performance, advancing the development of surrogates for internal state observability for next-generation battery management systems and digital twins.

2606.20048 2026-06-19 cs.RO 新提交

MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs

MirrorDuo:基于镜像演示对的反射一致视觉运动学习

Zheyu Zhuang, Ruiyu Wang, Giovanni Luca Marchetti, Florian T. Pokorny, Danica Kragic

AI总结 提出MirrorDuo方法,通过反射一致性为每个原始演示生成镜像副本,实现数据增强,在相同数据预算下显著提升行为克隆性能,并支持零/少样本技能迁移。

Comments Published in CoRL 2025

Journal ref CoRL 2025

详情
AI中文摘要

基于图像的行为克隆利用从无处不在的RGB相机捕获的演示。然而,它仍然受到收集多样化演示成本的限制,特别是在工作空间变化中泛化。我们提出MirrorDuo,一种基于反射的公式,操作于图像、本体感受和完整的6自由度末端执行器动作元组,为每个原始演示生成镜像对应物,有效实现“收集一个,免费获得一个”。它可以作为现有学习管道(如标准行为克隆或扩散策略)的数据增强策略,或作为反射等变策略网络的结构先验。通过利用原始域和镜像域之间的重叠,当演示均匀分布在工作空间两侧时,MirrorDuo在相同数据预算下实现了显著改进的性能。当演示仅限于一侧时,MirrorDuo能够在目标布局中仅使用零或五个演示实现向镜像工作空间的高效技能迁移。

英文摘要

Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras. However, it remains constrained by the cost of collecting diverse demos, especially for generalizing across workspace variations. We propose MirrorDuo, a reflection-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving "collect one, get one for free". It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning or diffusion policy, or as a structural prior for reflection-equivariant policy networks. By leveraging the overlap between the original and mirrored domains, MirrorDuo achieves significantly improved performance under the same data budget when demonstrations are evenly distributed across both sides of the workspace. When demonstrations are confined to one side, MirrorDuo enables efficient skill transfer to the mirrored workspace with as few as zero or five demos in the target arrangement.

2606.20047 2026-06-19 cs.IR 新提交

PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents

PACMS: 作为LLM代理可插拔引擎的子模上下文选择

Manu Ghulyani, Arunabh Singh, Karan Bharadwaj, Ankit Nath, Suranjan Goswami

AI总结 提出PACMS,一种基于子模函数最大化的上下文选择方法,在提示组装时按相关性从会话、记忆和工具输出中挑选内容,替代截断机制,提升长对话中的信息保持能力。

详情
AI中文摘要

对话和工具使用的LLM代理在上下文窗口中操作,该窗口同时从多个方向填充。随着会话进行,代理积累用户和助手轮次、从持久记忆存储中提取的条目,以及通常最大的工具调用输出(如文件读取、搜索结果和API响应)。一旦累积上下文超过模型的令牌预算,框架必须决定保留什么。当前机制是最近截断,有时辅以定期摘要。这是主题盲目的:会话早期建立的事实仅仅因为陈旧而被丢弃,即使当前用户查询正是关于该事实;相反,冗长但无关的近期材料被保留。必须在多轮中回忆信息的代理(记忆的定义案例)正是最近截断失败的地方。现有替代方案位于代理组装步骤之外。检索增强生成将外部文档提取到提示中,但不仲裁代理的“已存在”池化上下文。上下文压缩方法通过重写或修剪文本来减少令牌计数,但以查询盲目和有损的方式操作。两者都不将记忆条目、对话轮次和工具输出视为一个单一的候选池,在提示组装时按相关性进行选择。

英文摘要

Conversational and tool-using LLM agents operate over a context window that fills from several directions simultaneously. As a session proceeds, the agent accumulates user and assistant turns, entries drawn from a persistent memory store, and often largest of all, the verbatim outputs of tool calls such as file reads, search results, and API responses. Once the cumulative context exceeds the model's token budget, the framework must decide what to keep. The prevailing mechanism is recency truncation, sometimes paired with periodic summarization. This is topic-blind: a fact established early in a session is discarded simply because it is old, even when the current user query is about exactly that fact; conversely, verbose but irrelevant recent material is retained. Agents that must recall information across many turns, the defining case for memory, are precisely where recency truncation fails. Existing alternatives sit outside the agent's assembly step. Retrieval augmented generation fetches external documents into the prompt but does not arbitrate the agent's \emph{already-present} pooled context. Context-compression methods reduce token count by rewriting or pruning text, but operate query-blind and lossily. Neither treats memory entries, conversation turns, and tool outputs as a single candidate pool to be selected from by relevance at the moment the prompt is assembled.