arXivDaily arXiv每日学术速递 周一至周五更新
重置
CS计算机2212
2606.09828 2026-06-09 cs.CV 新提交

Latent Spatial Memory for Video World Models

视频世界模型的潜在空间记忆

Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) Microsoft Research(微软研究院) Adelaide University(阿德莱德大学) Monash University(莫纳什大学)

AI总结 提出潜在空间记忆框架Mirage,通过在扩散潜在空间中直接构建和查询3D缓存,避免像素空间重建,实现高效视频生成,速度提升10.57倍,内存减少55倍。

详情
Comments
Project Page: https://aka.ms/latent-spatial-memory, Code: https://github.com/microsoft/LatentSpatialMemory
AI中文摘要

在生成帧之间保持3D空间一致性的视频世界模型通常依赖于在RGB空间中构建的显式点云记忆。这种设计既计算昂贵(需要重复渲染和VAE编码),又固有地有损(因为通过像素空间的往返会丢弃学习到的潜在表示的丰富特征)。在本文中,我们为视频世界模型引入了\emph{潜在空间记忆},这是一种持久化的3D缓存,直接在扩散潜在空间中存储场景信息,避免了像素空间重建。在此基础上,我们提出了Mirage,一种潜在空间空间记忆框架,通过深度引导的反投影将潜在令牌提升到3D来构建记忆,并通过直接潜在空间扭曲合成新视图来查询记忆。这种统一的公式消除了像素空间重建的信息损失以及重复编码和渲染的计算负担。实验表明,相对于显式3D基线,潜在空间记忆实现了高达\textbf{10.57}倍的端到端视频生成加速和\textbf{55}倍的内存占用减少。利用扩散模型的几何先验,Mirage在WorldScore上达到了最先进的性能,并在RealEstate10K上实现了强大的重建质量。

英文摘要

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.

2606.09827 2026-06-09 cs.RO cs.CV 新提交

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

MemoryVLA++:通过记忆与想象在视觉-语言-动作模型中进行时间建模

Hao Shi, Weiye Li, Bin Xie, Yulin Wang, Renping Zhou, Tiancai Wang, Xiangyu Zhang, Ping Luo, Gao Huang

发表机构 * Tsinghua University(清华大学) The University of Hong Kong(香港大学) Dexmal StepFun

AI总结 提出MemoryVLA++框架,通过工作记忆、感知-认知记忆库和想象未来状态的世界模型,实现完整时间建模,在模拟和真实机器人任务上显著提升长时域和依赖记忆与想象的任务性能。

详情
Comments
The project is available at https://shihao1895.github.io/MemoryVLA-PP-Web
AI中文摘要

时间建模对于机器人操作至关重要,因为有效控制既需要过去交互的记忆,也需要对未来状态的想象。然而,大多数VLA模型主要依赖当前观测,因此在长时域和时间依赖任务上表现不佳。认知科学表明,人类依赖工作记忆缓冲短期上下文,海马系统保存过去经历的情景记忆,以及内部模型想象可能的未来状态演化。受这些机制启发,我们提出MemoryVLA++,一个完整的时序建模框架,为VLA模型配备记忆和想象能力以进行机器人操作。预训练的VLM将当前观测编码为感知和认知标记,形成工作记忆。这些标记查询感知-认知记忆库以检索相关历史上下文。该记忆库存储来自过去交互的低级细节和高级语义,并通过冗余感知合并进行更新。一个世界模型在去噪潜在空间中想象未来状态,并在记忆引导下整合想象的潜在表示,形成完整的时间感知标记。生成的标记条件化一个扩散动作专家,以预测时间一致的动作序列。我们在5个模拟基准和3类真实机器人任务(涵盖3种机器人)上进行了广泛实验,包括通用操作、长时域时间任务、鲁棒性和泛化性。我们的方法在Libero、SimplerEnv、Mikasa-Robo、Calvin、Libero-Plus以及多样化的真实机器人任务上取得了强劲性能,验证了具有记忆和想象的完整时间建模的有效性。例如,在真实机器人上,在通用、依赖记忆和依赖想象的任务上分别获得了+9%、+26%和+28%的提升。项目页面:https://shihao1895.github.io/MemoryVLA-PP-Web

英文摘要

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web

2606.09826 2026-06-09 cs.CV cs.AI 新提交

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

OmniGameArena: 一个统一的UE5基准测试,用于具有改进动态的VLM游戏智能体

Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang, Wei Huang, Yitang Li, Fan Zhang, Zeyu Hu, Lingting Zhu, Xin Wang, Xiaojuan Qi

发表机构 * The University of Hong Kong(香港大学) LIGHTSPEED The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学)

AI总结 提出OmniGameArena,一个包含12个UE5游戏的统一基准,以及改进动态曲线(IDC),通过反思机制评估VLM智能体的冷启动分数、改进动态和泛化能力。

详情
AI中文摘要

视觉语言模型(VLM)智能体越来越多地部署在交互式游戏环境中。然而,针对VLM智能体的游戏基准通常报告每个(智能体,游戏)对的单次首次尝试分数,专注于单智能体单人游戏,并且缺乏统一的协议来评估异构智能体类别(商业VLM、开源VLM和专用游戏策略)在同一水平上。我们通过OmniGameArena填补了这些空白,这是一个包含12个新构建的Unreal Engine 5游戏的实时基准,涵盖单人(7个)、玩家对战(3个)和合作(2个)模式,具有统一的动作接口,以及改进动态曲线(IDC),这是一个智能体反思框架,其中使用工具的反思LLM在多个回合中自主优化有界技能提示。除了冷启动排行榜分数外,IDC还为每个(智能体,游戏)对揭示了两个额外的可观测指标:分数在反思回合中的演变方式,以及学习到的技能在保留任务变体上的表现。我们报告了12个VLM智能体在冷启动排行榜上的这些可观测指标,以及四个顶级智能体在IDC下的表现。

英文摘要

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

2606.09825 2026-06-09 cs.LG cs.AI cs.SY eess.SY math.OC 新提交

An Agency-Transferring Model-Free Policy Enhancement Technique

一种无模型策略增强的代理转移技术

Anton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko

发表机构 * Center for Engineering Systems and Sciences(工程系统与科学中心) Central University(中央大学) Sirius University of Science and Technology(天狼星科技大学)

AI总结 提出一种将次优基线策略嵌入强化学习训练的方法,通过逐步从基线策略向可学习策略转移代理权,提升训练效率并最终获得超越基线的独立策略。

详情
AI中文摘要

从头开始训练强化学习(RL)策略成本高昂:需要仔细设计奖励和环境、大量调参以及大量计算。然而,许多控制问题已经有一个功能正常但次优的基线策略可用。本文提出一种方法,将这样的基线策略嵌入RL训练过程,同时提高相对于从头开始方法的训练效率,并产生一个优于基线的学习策略。在每个步骤中,该方法在基线策略和可训练的学习策略之间进行仲裁,最初强烈依赖基线策略,然后逐步将代理权转移给学习策略。训练结束时,学习策略是一个无需基线策略支持的独立神经网络。本文形式化了基线策略“功能正常”的含义:在该策略下,智能体以高概率到达目标集并停留在那里。所提出的仲裁机制旨在训练过程中利用这一特性,从训练开始就产生高目标到达率。理论分析在给定假设下提供了这种行为的形式化解释,并将其扩展到最终无基线场景,其中推导了独立学习策略目标到达概率的显式下界。在连续控制基准上的实验结果表明,所提出的方法实现了与竞争方法相当或更高的回报,同时在训练过程中(包括最终阶段,学习策略无需任何基线支持)保持了最高的目标到达率。

英文摘要

Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.

2606.09822 2026-06-09 cs.CL cs.FL 新提交

Causally Evaluating the Learnability of Formal Language Tasks

因果评估形式语言任务的可学习性

Vésteinn Snæbjarnarson, Anej Svete, Josef Valvoda, Reda Boumasmoud, Brian DuSell, Ryan Cotterell

AI总结 通过引入分箱半环控制目标属性频率,结合因果图模型和分解KL散度,证明标准相关性评估在形式语言任务可学习性分析中存在混淆偏差。

详情
AI中文摘要

语言模型作为多任务学习器,在训练过程中获得广泛能力。一个基本问题是学习给定任务需要多少特定任务数据。在自然语言中回答这个问题很困难:任务难以界定且可能相互混淆。为了严格研究数据频率与可学习性之间的关系,我们转向使用从概率有限自动机导出的形式语言的受控设置。这作为方法论测试平台,证明标准相关性评估实践固有缺陷。为了实现因果分析,我们引入了分箱半环,这是一种代数对象,允许我们控制目标属性在采样语料库中出现的频率。我们将实验流程表述为因果图模型,并推导出分解的Kullback-Leibler散度指标来衡量特定子任务的可学习性。我们的实验表明,在没有因果干预的情况下评估可学习性会由于相关性分析中的混淆因素导致错误结论,并警示自然语言环境中的相关性陷阱。

英文摘要

Language models, as multi-task learners, acquire a wide range of abilities during training. A fundamental question is how much task-specific data is needed to learn a given task. Answering this for natural language is difficult: tasks are hard to delineate and can confound one another. To rigorously investigate the relationship between data frequency and learnability, we turn to a controlled setting using formal languages induced from probabilistic finite automata. These serve as a methodological testbed to demonstrate that standard correlational evaluation practices are inherently flawed. To enable causal analysis, we introduce the binning semiring, an algebraic object that lets us control how often a targeted property occurs in a sampled corpus. We formulate the experimental pipeline as a causal graphical model and derive decomposed Kullback-Leibler divergence metrics to measure the learnability of specific sub-tasks. Our experiments show that evaluating learnability without causal intervention leads to incorrect conclusions due to confounders in correlational analysis, and serve as a warning about correlational pitfalls in natural-language settings.

2606.09821 2026-06-09 cs.LG 新提交

Rethinking the Divergence Regularization in LLM RL

重新思考LLM强化学习中的散度正则化

Jiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee, Liefeng Bo, Tianyu Pang

发表机构 * Tencent Hunyuan(腾讯混元) UIUC(伊利诺伊大学厄巴纳-香槟分校) NUS(新加坡国立大学)

AI总结 针对PPO等方法的硬裁剪或硬掩码在长尾词汇中分布偏移代理不佳的问题,提出DRPO,用平滑的优势加权二次正则化替代硬掩码,保持信任区域几何的同时提供连续梯度权重,提升训练稳定性和效率。

详情
AI中文摘要

强化学习已成为后训练大型语言模型的关键组成部分。在实践中,由于训练-推理不匹配和策略陈旧,LLM RL通常是离策略的,因此信任区域控制对于稳定优化至关重要。PPO和GRPO等主流方法通过比率裁剪机制近似这种控制,但在长尾词汇中,重要性比率可能成为分布偏移的糟糕代理。最近的工作如DPPO通过用基于散度的掩码替换基于比率的裁剪来解决这种不匹配,从而产生由采样令牌的绝对概率偏移定义的信任区域。然而,DPPO仍然依赖于硬掩码:一旦令牌以有害方向越过信任区域边界,其梯度就会被丢弃而不是纠正。为了解决这个问题,我们提出了散度正则化策略优化(DRPO),它用策略偏移上的平滑优势加权二次正则化器替换硬掩码。DRPO保留了与DPPO相同的信任区域几何,同时引入了有界、连续的梯度权重,这些权重衰减发散更新并在边界之外提供纠正信号。跨模型规模、架构和精度设置的实验表明,DRPO提高了LLM RL训练的稳定性和效率。

英文摘要

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

2606.09820 2026-06-09 math.FA cs.LG math.PR q-fin.MF stat.ML 新提交

Weighted universal approximation of differentiable maps on infinite-dimensional manifolds

无限维流形上可微映射的加权通用逼近

Philipp Schmocker, Josef Teichmann

AI总结 通过加权Nachbin定理,将函数输入神经网络的通用逼近定理推广到可微映射,包括导数逼近,并应用于非预期泛函和路径空间泛函的逼近。

详情
Comments
77 pages, 3 figures
AI中文摘要

我们将函数输入神经网络(FNN)的通用逼近定理推广到可微映射,包括导数的逼近。FNN将输入从可能无限维的加权流形映射到实值隐藏层,在该层上应用非线性标量激活函数,然后通过一些线性读出将输出返回到Banach空间。通过证明加权Nachbin定理,我们建立了可微映射的通用逼近定理(UAT),该定理超越了紧集上的通常表述,并且还包括导数的逼近。这导致了非预期泛函(包括水平和垂直导数)的逼近结果。作为进一步的应用,我们证明了签名的线性函数能够逼近路径空间泛函,包括它们的方向导数。

英文摘要

We generalize the universal approximation theorem for functional input neural networks (FNN) to differentiable maps by including the approximation of the derivatives. A FNN maps the input from a possibly infinite-dimensional weighted manifold to the real-valued hidden layer, on which a non-linear scalar activation function is applied, and then returns the output into a Banach space via some linear readouts. By proving a weighted Nachbin theorem, we establish a universal approximation theorem (UAT) for differentiable maps, which goes beyond the usual formulation on compact sets and also includes the approximation of the derivatives. This leads us to approximation results for non-anticipative functionals including the horizontal and vertical derivatives. As a further application, we show that linear functions of the signature are able to approximate path space functionals including their directional derivatives.

2606.09816 2026-06-09 cs.CV cs.AI math.PR 新提交

PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws

PTL-Diffusion: 具有周期终端定律的流形感知扩散

Danqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins, Xiaojie Wang, Ke Chen, Yue Wu

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Cambridge(剑桥大学) University of Oxford(牛津大学) Harvard University(哈佛大学) MIT(麻省理工学院) University of Washington(华盛顿大学)

AI总结 提出PTL-Diffusion,通过将前向噪声过程收敛到周期高斯终端族而非单一分布,显式嵌入相位结构,改善低维流形上的分布匹配,在点云和人脸数据集上降低误差。

详情
AI中文摘要

标准扩散模型通常使用单一时间齐次高斯终端分布作为生成的参考律。虽然这一选择在分析上方便且经验上有效,但对于集中在低维流形附近的数据,它提供的显式结构很少,其中数据分布的不同区域可能对应于不同的局部几何或语义因素。因此,反向模型必须几乎完全从非结构化的终端参考分布中恢复流形级别的结构。\n我们提出PTL-Diffusion,一种概念验证的扩散框架,其前向噪声过程收敛到一个非常数的周期高斯终端族,而不是单一不变律。与相位条件DDPM不同(其中相位信息仅进入去噪网络,而前向过程保持不变),PTL-Diffusion将相位结构直接嵌入前向噪声动力学中。\n所提出的构造仍然接近标准去噪扩散模型:对于周期强迫的Ornstein-Uhlenbeck型前向过程,我们推导出闭合形式的前向边际分布、极限周期高斯终端族以及显式高斯反向后验,从而支持标准噪声预测训练。我们还引入了一个不变平均正则化项,通过平均周期参考律耦合相位条件反向动力学。在环面和圆柱点云基准以及Olivetti人脸数据集上的实验表明,PTL-Diffusion在匹配的DDPM基线上改善了流形级别的分布匹配,减少了相位条件误差、特征空间协方差误差和最近邻流形距离。这些结果表明结构化终端参考律是一个有前景的方向,同时激励更具表现力的相位构造和更大规模的评估。

英文摘要

Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.

2606.09813 2026-06-09 cs.RO cs.CV 新提交

iMaC: Translating Actions into Motion and Contact Images for Embodied World Models

iMaC: 将动作转化为运动与接触图像用于具身世界模型

Zhenyu Wu, Xiuwei Xu, Yukun Zhou, Yifan Li, Qiuping Deng, Xiaofeng Wang, Zheng Zhu, Bingyao Yu, Ziwei Wang, Jiwen Lu, Haibin Yan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) GigaAI Nanyang Technological University(南洋理工大学)

AI总结 提出iMac框架,将原始视觉图像作为动作表示,通过图像-动作编码器和动态预测器实现高保真未来状态预测和闭环控制,在预测精度、任务成功率和跨场景泛化上优于传统向量动作控制。

详情
Comments
Project page: https://imac-wm.github.io/
AI中文摘要

具身世界模型已成为视觉机器人决策和交互环境模拟的关键范式。然而,传统的具身框架依赖于低维结构化动作向量(例如关节角度和末端执行器位姿),这些向量存在表达能力有限、跨不同具身形态泛化能力差以及对复杂物理交互的动态建模不自然等问题。为了解决这些限制,本文提出了iMac(图像作为动作控制),一种新颖的统一控制范式,将原始视觉图像视为具身世界模型的原生动作表示。与传统的显式运动学动作编码不同,iMac将连续的视觉操作表述为基于图像的动作标记,这些标记内在地包含了空间运动意图、交互几何约束和细微的物理动力学。我们构建了一个双分支具身架构,包括图像-动作编码器和动态世界预测器:编码器将目标驱动的视觉图像压缩为紧凑的动作嵌入,而预测器学习以图像动作为条件的环境转移规则,以实现高保真的未来状态预测和闭环具身控制。在公开的具身操作基准和真实机器人场景上进行了大量实验。结果表明,iMac在预测精度、任务成功率和跨场景泛化能力方面优于基于向量的动作控制基线。此外,我们的图像动作设计消除了对人工定义动作空间的依赖,实现了异构具身智能体的灵活通用控制。这项工作为具身世界模型提供了一种创新的视觉-动作视角,为可扩展的机器人感知和操作提供了一种简单而有效的范式。

英文摘要

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.

2606.09811 2026-06-09 cs.RO cs.AI cs.CV 新提交

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

AHA-WAM:异步自适应时域世界-动作建模与观测引导的上下文路由

Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang, Zhixuan Liang, Wenjie Xu, Yinan Mao, Weinan Zhang, Xiaokang Yang, Ru Ying, Ran Zheng, Yao Mu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Baidu AI Cloud(百度智能云) The University of Hong Kong(香港大学)

AI总结 提出AHA-WAM,一种基于双扩散Transformer的异步时域自适应世界-动作模型,通过低频世界规划器和高频动作执行器解耦时序,实现高效闭环控制,在RoboTwin和真实任务上达到SOTA性能。

详情
Comments
Project page: https://serene-sivy.github.io/aha-wam/
AI中文摘要

世界-动作模型已成为机器人操作的一种有前景的范式,它联合建模视觉场景动态和动作,将物理先验注入策略学习。然而,现有的世界-动作模型以相同的时间分辨率耦合世界预测和动作执行,迫使世界分支建模近期的帧变化,这些变化是冗余且信息量弱的。我们假设,将世界预测和动作执行严格绑定到相同的时间节奏可能未充分利用视频分支在具身控制中的潜力。因此,我们提出AHA-WAM,一种基于双扩散Transformer(DiT)架构的异步自适应时域世界-动作模型,该模型围绕这种时间不对称性重新组织世界-动作建模。AHA-WAM将视频DiT实例化为一个低频世界规划器,它维护过去观测的滚动键值记忆,并暴露可重用的逐层潜在上下文,编码长时域场景演化;同时,一个高频动作DiT通过逐层联合注意力查询该上下文,以闭环方式执行短动作块。为了支持异步执行,我们引入了自适应时域偏移训练和观测引导的视频-上下文路由(OVCR),它们共同让动作专家利用长时域世界上下文,同时保持对实时执行状态的响应,而无需重新运行视频DiT。在RoboTwin和真实世界操作任务上的实验表明,AHA-WAM无需任何机器人数据预训练即达到最先进性能,在RoboTwin上平均成功率为92.80%,在4个真实世界任务上成功率为78.3%,同时达到24.17 Hz的闭环控制,相比Fast-WAM加速4.59倍。

英文摘要

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

2606.09806 2026-06-09 cs.LG cs.AI 新提交

Topological Neural Operators

拓扑神经算子

Lennart Bastian, Samuel Leventhal, Mustafa Hajij, Tolga Birdal

发表机构 * Imperial College London(伦敦帝国学院) University of San Francisco(旧金山大学)

AI总结 提出拓扑神经算子(TNOs),利用离散外微积分在细胞复形上实现跨维度耦合,并通过分层结构提升长程信息传播,在PDE基准上优于现有算子。

详情
AI中文摘要

我们引入了拓扑神经算子(TNOs),这是一个在细胞复形上进行算子学习的原理性框架,将神经算子(NOs)从点和/或边上的函数提升到拓扑域。TNOs将数据表示为定义在不同维度细胞上的特征,并通过离散外微积分建模它们的相互作用,通过梯度、旋度和散度型算子实现显式的跨维度耦合。关键设计原则是将信息流向(由固定拓扑算子控制)与信息变换(学习得到)解耦,从而产生尊重物理量几何支撑并暴露守恒和相容性结构的模型。我们进一步提出了分层TNOs(HTNOs),它结合了学习到的粗粒度复形以传播长程和拓扑依赖的信息。我们的框架将现有NOs作为特例,提供了跨离散化的算子学习统一视角。在一系列PDE基准测试中,包括不规则几何流动问题,TNOs和HTNOs提高了精度;控制研究进一步隔离了原生高阶和拓扑结构带来的优势。项目页面:https://circle-group.github.io/research/TNO

英文摘要

We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features defined on cells of varying dimension and model their interactions through Discrete Exterior Calculus, enabling explicit cross-dimensional coupling via gradient-, curl-, and divergence-type operators. The key design principle is to decouple where information flows, as governed by fixed topological operators, from how it is transformed (which is learned), yielding models that respect the geometric support of physical quantities and expose conservation and compatibility structure. We further propose Hierarchical TNOs (HTNOs), which incorporate learned coarse complexes to propagate long-range and topology-dependent information. Our framework subsumes existing NOs as a special case, providing a unified perspective on operator learning across discretizations. Across a range of PDE benchmarks, including irregular-geometry flow problems, TNOs and HTNOs improve accuracy; controlled studies further isolate the benefits of native higher-rank and topological structure. Project page: https://circle-group.github.io/research/TNO

2606.09803 2026-06-09 cs.CV cs.GR cs.LG 新提交

Echo-Memory: A Controlled Study of Memory in Action World Models

Echo-Memory:动作世界模型中记忆的受控研究

Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li, Yaowei Li, Yaofeng Su, Yuming Li, Haoyu Wang, Shiyi Zhang, Songchun Zhang, Yuwei Niu, Sihan Xu, Junhao Zhuang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy

AI总结 提出Echo-Memory框架,通过控制变量法研究动作条件世界模型中的记忆机制,发现原始上下文容量和块状状态空间递归对开放域返回任务至关重要。

详情
Comments
9 figures and 28 pages, Code at \href{https://github.com/Echo-Team-Joy-Future-Academy-JD/Echo-Memory}{this URL}
AI中文摘要

我们提出\textbf{Echo-Memory},对动作条件世界模型中的记忆机制进行受控研究。这些模型从第一帧、文本提示和相机动作序列生成多段视频,但其核心失败往往是记忆而非局部图像合成:当相机离开并返回时,场景或显著物体可能悄然改变。现有记忆设计难以比较,因为增益与骨干网络、训练、检索和评估差异纠缠在一起。Echo-Memory固定了动作到视频的接口,仅改变生成器存储和读取历史的方式。在共享的视频扩散骨干网络、优化器、相机动作表示、采样器和评估流程下,我们比较了原始上下文、基于压缩的记忆、具有不同读取路径的空间摘要以及状态空间递归。这种匹配矩阵分离了四个通常混淆的轴:\emph{容量}、\emph{压缩}、\emph{读取}和\emph{递归}。我们还通过三个分支协议评估记忆:重放质量、域内循环重访和开放域返回探测。这些分支通常不一致,表明重放保真度不足以作为记忆世界的代理。得出三个发现。原始上下文是一个强大的容量基线,并且比重放指标更能改善开放域返回。紧凑性不能免费替代容量:激进的混合压缩记忆会丢失返回所需的显著证据。最后,块状状态空间递归是我们矩阵中最强的开放域返回机制,表明隐式记忆的结构与是否使用记忆同样重要。这些结果为在孤立的重放指标之外研究动作世界模型中的记忆提供了一个紧凑的协议。

英文摘要

We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.

2606.09802 2026-06-09 cs.LG cs.AI stat.ML 新提交

Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts

高效实验的Bandits:适应控制组、偏好和上下文漂移

Udvas Das, Waris Radji, Debabrota Basu, Odalric-Ambrym Maillard

AI总结 针对用户偏好和上下文分布随时间漂移的线性上下文随机多臂赌博机问题,提出Dri-MED算法,通过异方差回归处理非平稳噪声,实现实例相关的遗憾界和约束违规界。

详情
AI中文摘要

我们考虑线性上下文随机多臂赌博机的一个变体,其中学习器必须向一组用户提供推荐,每个用户有其个性化的偏好向量,并且上下文分布随时间漂移。在实践者友好的假设下,我们将此设置简化为具有平稳均值但异方差和非平稳噪声的线性赌博机。我们进一步研究了学习器必须确保每个决策的平均奖励超过基线策略$\boldsymbol{\pi}_0$在每个决策步骤的均值的情况。我们引入了Dri-MED,一种受MED策略线性版本启发并仔细调整以处理非平稳异方差噪声的算法。我们表明,实例相关的遗憾界为$\tilde{\mathcal O}\left(\frac{\kappa}{\tilde{\Delta}}d^2(\log(T)\right)$,其中$\tilde{\Delta}$是受策略$\pi_0$约束的次优性间隙,方差感知乘性项$\kappa$通过异方差回归仔细处理。我们进一步表明Dri-MED享有$\tilde{\mathcal{O}}(d)$的期望约束违规。我们的数值结果表明,Dri-MED显著优于忽略漂移和偏好结构的保守基线。

英文摘要

We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under practitioner-friendly assumptions, we reduce this setting to linear bandit with stationary mean but heteroskedastic and non-stationary noise. We further study the case when the learner must ensure the mean reward of each decision must exceed that of a baseline strategy $\boldsymbolπ_0$ at each decision step. We introduce Dri-MED, an algorithm inspired from the linear version of the MED strategy, and carefully adapted to handle the non-stationary heteroskedastic noise. We show that the instance-dependent regret scales as $\tilde{\mathcal O}\left(\fracκ{\tildeΔ}d^2(\log(T)\right)$, where $\tildeΔ$ is the constraint-aware sub-optimality gap subject to policy $π_0$, with variance-aware multiplicative term $κ$ that we carefully handle using heteroskedastic regression. We further show Dri-MED enjoys $\tilde{\mathcal{O}}(d)$ expected constraint violations. Our numerical results suggest that Dri-MED significantly outperforms conservative baselines that ignores the drift and preference structure.

2606.09801 2026-06-09 math.CO cs.DM 新提交

On the generalized Turán number of complete bipartite graphs

关于完全二部图的广义Turán数

Oliver Janzer, Sean Longbrake, Liana Yepremyan

AI总结 本文证明了当s∈{2,3}且s<a≤b且t足够大时,ex(n,K_{a,b},K_{s,t})=Θ(n^s),并证明了对于任意含边图F,存在无穷多个实数r使得ex(n,F,H)=Θ(n^r)对某个H成立。

详情
Comments
17 pages
AI中文摘要

对于图$F$和$H$,广义Turán数$\mathrm{ex}(n,F,H)$表示在$n$个顶点且不含$H$的图中$F$的最大拷贝数。我们证明,如果$s\in \{2,3\}$,$s< a\leq b$且$t$足够大,则$\mathrm{ex}(n,K_{a,b},K_{s,t})=Θ(n^s)$。该结果中$s=2$、$a=b=3$的情形回答了Spiro的一个问题。\n证明Spiro的另一个猜想,我们证明对于每个至少有一条边的图$F$,存在无穷多个实数$r$使得$\mathrm{ex}(n,F,H)=Θ(n^r)$对某个图$H$成立。

英文摘要

For graphs $F$ and $H$, the generalized Turán number $\mathrm{ex}(n,F,H)$ denotes the maximum number of copies of $F$ in an $H$-free graph on $n$ vertices. We prove that if $s\in \{2,3\}$, $s< a\leq b$ and $t$ is sufficiently large, then $\mathrm{ex}(n,K_{a,b},K_{s,t})=Θ(n^s)$. The $s=2$, $a=b=3$ case of this result answers a question of Spiro. Proving another conjecture of Spiro, we show that for every graph $F$ with at least one edge, there exist infinitely many real numbers $r$ such that $\mathrm{ex}(n,F,H)=Θ(n^r)$ holds for some graph $H$.

2606.09800 2026-06-09 cs.SE cs.AI cs.MA 新提交

FASE: Fast Adaptive Semantic Entropy for Code Quality

FASE: 用于代码质量的快速自适应语义熵

Shizhe Lin, Ladan Tahvildari

AI总结 提出快速自适应语义熵(FASE),通过最小生成树近似功能正确性,在HumanEval和BigCodeBench上相比现有语义熵方法在Spearman相关性和ROCAUC上分别提升25%和19%,且计算开销仅为传统方法的0.3%。

详情
AI中文摘要

多智能体代码生成通过模拟人类软件工程生命周期,为自主软件开发提供了一种有前景的范式。然而,系统可靠性仍然受到LLM幻觉和跨交互智能体错误传播的阻碍。虽然语义熵提供了一种无需真实答案即可量化不确定性的原则性方法,但当前方法通常依赖于成本高昂的LLM驱动的等价性检查。在这项工作中,我们引入了快速自适应语义熵(FASE),这是一种基于结构和语义不相似图的最小生成树来近似功能正确性的新型度量。在HumanEval和BigCodeBench上的评估表明,FASE优于通过LLM蕴含的最先进语义熵,在使用Qwen3-Embedding-8B模型时,与基于真实测试用例的Pass@1相比,Spearman相关性平均提升25%,ROCAUC分数提升19%。此外,通过消除成本高昂的LLM驱动的等价性评估,FASE的计算开销可忽略不计,其运行成本仅为传统语义熵方法的约0.3%。这些结果使FASE成为优化现实世界多智能体工作流中不确定性量化的实用且经济高效的解决方案。

英文摘要

Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.

2606.09798 2026-06-09 cs.RO 新提交

SynManDex: Synthesizing Human-like Dexterous Grasps from Synthetic Human Pre-Grasps

SynManDex: 从合成人类预抓取中合成类人灵巧抓取

Yanming Shao, Zanxin Chen, Wenwei Lin, Mingjie Zhou, Tianxing Chen, Xiaokang Yang, Yichen Chi, Yao Mu

发表机构 * Shanghai AI Lab(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Shenzhen University(深圳大学) Fudan University(复旦大学) University of Hong Kong(香港大学) ZTE Corporation(中兴通讯股份有限公司)

AI总结 提出SynManDex流水线,利用生成的人类预抓取作为启发,通过机器人原生优化实现力闭合接触,生成类人灵巧抓取,在仿真和真实机器人上取得高成功率和类人性。

详情
AI中文摘要

人类手-物交互编码了功能意图,但直接迁移到机器人手上常因形态、接触和可达性约束而失败。我们提出SynManDex,一个合成流水线,使用生成的人类预抓取作为可负担性感知的提议,并通过机器人原生优化解决最终接触。SynManDex采样物体条件化的数字人类预抓取,将其重定向到灵巧机器人手姿态,优化目标实体上的力闭合接触,并接受通过每一步检查的轨迹。所得关键帧支持抓取-举起演示以及各种抓取操作任务,如倒茶、拍照和吹笛子,这些任务通过VLM代理设计。因此,SynManDex结合了高抓取质量(86.4%抓取稳定性)和4.67/5的类人性(93.4%)。在仿真中达到80.7%的成功率,在应用于36自由度双臂灵巧机器人平台时,真实机器人成功率为25/30(83.3%)。

英文摘要

Human hand-object interactions encode functional intent, but direct transfer to robotic hands often fails under morphology, contact, and reachability constraints. We present SynManDex, a synthetic pipeline that uses generated human pre-grasps as affordance-aware proposals and resolves the final contacts with robot-native optimization. SynManDex samples object-conditioned digital human pre-grasps, retargets them to dexterous robotic hand poses, optimizes force-closure contacts on the target embodiment, and admits trajectories that pass checks from each step. The resulting keyframes support both grasp-and-lift demonstrations and various prehensile manipulation tasks such as tea pouring, photo taking, and flute playing, designed via VLM agents. As a result, SynManDex combines high grasp quality (86.4\% grasp stability) with 4.67/5 human-likeness (93.4\%). It achieves 80.7\% successes in simulation and 25/30 (83.3\%) real-robot successes when applied to a 36-DOF bimanual dexterous robotic platform.

2606.09795 2026-06-09 math.CO cs.IT cs.NA hep-th math-ph math.IT math.MP math.NA 新提交

Finite-n Estimate of Dedekind Numbers by Layer-Ratio Monte Carlo

通过层比蒙特卡洛方法对戴德金数的有限n估计

Tian-Shun Chen, Hao Feng, Haozhe Wang, Kilar Zhang

AI总结 将戴德金数问题转化为布尔格理想格的层比重建问题,利用可逆固定层马尔可夫链估计层比,从而估计戴德金数M(n),并发现n=9时中心层附近存在双肩特征。

详情
Comments
27 pages, 6 figures, 7 tables
AI中文摘要

戴德金问题计数单调布尔函数,等价于布尔格的下集。我们将此枚举重新表述为秩理想格的惠特尼数的有限层比重建问题。精确的相邻层双重计数通过可加元素数和可移除元素数的局部平均值表达每个层比。可逆固定层马尔可夫链估计这些平均值,从而估计戴德金数M(n)。在M(8)和M(9)上的回测校准了固定协议下的种子级变异性,并测量了观察到的蒙特卡洛预算缩放。所得估计探测了理想格的惠特尼数序列。尽管这些行先前被经验描述为单峰的,但高精度n=9估计在中心秩附近有一个浅的双肩特征,与经验描述相反;n=11和n=13中心窗口估计显示出更大对比度的类似模式。M(10)的协议估计为\\[ \widehat M(10)=(8.9360\pm0.0010)\times 10^{78}, \\] 其中显示的不确定性是生产预算下跨n缩放定律的基于预算的预测尺度。

英文摘要

Dedekind's problem counts monotone Boolean functions, equivalently downsets of a Boolean lattice. We recast this enumeration as a finite layer-ratio reconstruction problem for the Whitney numbers of the ranked ideal lattice. An exact adjacent-layer double count expresses each layer ratio through local averages of the number of addable elements and the number of removable elements. Reversible fixed-layer Markov chains estimate these averages and hence estimate the Dedekind number M(n). Backtests at M(8) and M(9) calibrate seed-level variability under the fixed protocol and measure the observed Monte Carlo budget scaling. The resulting estimate probes the Whitney-number sequence of the ideal lattice. Although these rows have previously been described empirically as unimodal, the high-precision n=9 estimate has a shallow two-shoulder feature around the central rank, contrary to that empirical description; n=11 and n=13 center-window estimates show a larger-contrast analogous pattern. The protocol estimate for M(10) is \[ \widehat M(10)=(8.9360\pm0.0010)\times 10^{78}, \] where the displayed uncertainty is the budget-based forecast scale from the cross-n scaling law under the production budget.

2606.09794 2026-06-09 cs.CV cs.GR 新提交

Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance Reconstruction

超越球谐函数:重新思考辐射重建的外观模型

Ewa Miazga, Jorge Condor, Piotr Didyk

发表机构 * École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Università della Svizzera Italiana(意大利语区瑞士大学)

AI总结 本文系统评估多种球面函数,提出归一化各向异性球面Gabor函数,以紧凑表示高效建模高频外观效果,在辐射场重建中实现五倍内存节省和更优质量。

详情
Comments
19 pages, 11 figures
AI中文摘要

视角相关的外观建模在新视角合成与重建中仍是一个具有挑战性的问题。准确表示复杂的角度效应通常需要大量的内存和计算资源。对于新的基于学习的方法,常见做法是依赖球谐函数(SH)。然而,捕捉镜面反射等高频率现象需要高阶展开,这会增加内存使用和计算成本。因此,大多数方法采用低阶SH,这限制了建模复杂视角相关效应的能力,导致表示过于平滑或漫反射。为解决这些限制,我们系统评估了场景重建中多种球面函数。其中一些函数在本文中首次被引入图形学和计算机视觉领域。基于实验洞察,我们提出了一种新的球面公式——归一化各向异性球面Gabor函数,它能够在保持紧凑表示的同时高效建模和学习高频外观效果。与现有方法相比,我们的函数在重建如闪光等视角相关现象时实现了更高质量,同时内存效率提高五倍,且评估更高效。我们在辐射场重建任务中验证了其性能。

英文摘要

View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks.

2606.09792 2026-06-09 cs.CV 新提交

End-to-End Optimization of Incoherent Imaging for Classification Under Detector-Limited Readout

探测器有限读出下非相干成像分类的端到端优化

Archer Wang, Joshua Chen, Sachin Vaidya, Marin Soljačić

发表机构 * Research Laboratory of Electronics, Massachusetts Institute of Technology(麻省理工学院电子研究实验室) Department of Physics, Massachusetts Institute of Technology(麻省理工学院物理系)

AI总结 针对探测器有限读出场景,通过端到端优化相位掩模提升非相干成像分类性能,理论证明全读出下无增益,有限读出下通过增强类可分性实现显著改进。

详情
AI中文摘要

光学前端(如超表面)和神经网络后端的端到端联合优化已广泛应用于成像任务,但缺乏一个形式化框架来描述此类系统何时以及为何优于传统透镜成像。本文聚焦于分类这一核心成像任务,探究端到端优化非相干成像相位掩模何时能提升性能。我们发现,这些增益主要出现在探测器读出受限的情况下,而在全读出下则有限。在后一种情况下,我们证明没有非相干相位掩模能超过探测器测量与类别标签之间的理想信道互信息;传统聚焦透镜接近这一上限,联合优化无实证增益。当探测器读出受限时(通过粗空间采样或有限测量次数),优化光学系统可通过增加探测器测量中的类可分性来显著提升分类性能。这些增益在低探测器噪声下最大,并随噪声增大而减小,因为光学系统在信号到达探测器前塑造信号,但无法去除之后添加的噪声。该优势还取决于任务的光谱结构:当类别判别内容集中在比类内变化更低的空间频率时,协同设计帮助最大。我们开发了一个理论框架来形式化这些区别,并在合成数据和标准基准(MNIST、FashionMNIST、SVHN)上测试其预测。

英文摘要

End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasks, yet a formalism characterizing when and why such systems outperform conventional lens-based imaging is largely lacking. This paper focuses on object classification, a central imaging task, and asks when end-to-end optimization of a phase mask for incoherent imaging improves performance over a conventional focusing lens. We find that these gains arise primarily under constrained detector readout and are limited under full detector readout. In the latter setting, we prove that no incoherent phase mask exceeds the ideal-channel mutual information between detector measurements and class labels; a conventional focusing lens approaches this ceiling, and joint optimization yields no empirical gain. When detector readout is constrained -- by coarse spatial sampling or a limited number of measurements -- optimized optics can substantially improve classification by increasing class separability in the detector measurements. These gains are largest under low detector noise and shrink as noise grows, because the optics shape the signal before it reaches the detector but cannot remove noise added afterward. The advantage also depends on the spectral structure of the task: co-design helps most when class-discriminative content is concentrated at lower spatial frequencies than within-class variation. We develop a theoretical framework formalizing these distinctions and test its predictions on synthetic data and standard benchmarks (MNIST, FashionMNIST, SVHN).

2606.09789 2026-06-09 cs.CY 新提交

Principled Uncertainty in Clinical AI: End-to-End Bayesian Modelling and Algorithmic Equity Auditing Across Multimodal Patient Data

临床AI中的原则性不确定性:跨多模态患者数据的端到端贝叶斯建模与算法公平审计

Oladimeji Anthonio, Dimeji Abdulsobur Olawuyi, Oloruntoba Ajayi, Temiloluwa Aderemi, Joseph Odamo

AI总结 提出端到端贝叶斯不确定性建模框架,结合校准不确定性作为公平性度量,在模拟患者数据中识别出初级/农村设施和低社会经济地位患者的认知不确定性公平差距。

详情
AI中文摘要

临床人工智能(AI)系统在缺乏原则性不确定性量化的情况下常规生成预测,限制了其在高风险医疗环境中的可信度。本文提出一个综合研究计划,解决两个相互关联的问题:(1)开发一个完全端到端的贝叶斯不确定性建模框架,用于多模态临床数据;(2)将校准的不确定性估计作为跨患者亚组算法公平性的正式度量。我们构建了一个概率深度学习架构,包括特定模态的变分编码器、精度加权后期融合机制以及一个分离偶然不确定性和认知不确定性的分解不确定性输出头。该系统使用复合贝叶斯损失进行训练,包括二元交叉熵、Kullback-Leibler散度正则化和不确定性校准惩罚。我们使用期望校准误差(ECE = 0.096)评估模型校准,并在1000名模拟患者的数据集上,按设施类型、社会经济地位、年龄组和生理性别进行亚组公平审计。结果表明,认知不确定性系统地识别出服务不足的人群:初级/农村设施患者表现出15.3%的不确定性公平差距(p < 0.001,效应量 = 0.698),低社会经济地位患者表现出6.8%的差距(p < 0.001),老年患者表现出3.9%的差距(p < 0.001),而未检测到显著的性别差异。这些发现表明,校准不确定性不仅是概率模型的技术属性,而且是一种具有直接临床相关性的可操作公平信号。

英文摘要

Clinical artificial intelligence (AI) systems routinely produce predictions without principled quantification of uncertainty, limiting their trustworthiness in high-stakes medical environments. This paper presents an integrated research programme addressing two interconnected problems: (1) the development of a fully end-to-end Bayesian uncertainty modelling framework for multimodal clinical data, and (2) the application of calibrated uncertainty estimates as a formal measure of algorithmic equity across patient subgroups. We construct a probabilistic deep learning architecture comprising modality-specific variational encoders, a precision-weighted late fusion mechanism, and a decomposed uncertainty output head that separates aleatoric from epistemic uncertainty. The system is trained with a composite Bayesian loss incorporating binary cross-entropy, Kullback-Leibler divergence regularisation, and an uncertainty calibration penalty. We evaluate model calibration using Expected Calibration Error (ECE = 0.096) and conduct a subgroup equity audit across facility type, socioeconomic status, age group, and biological sex on a dataset of 1,000 simulated patients. Results demonstrate that epistemic uncertainty systematically identifies underserved populations: primary/rural facility patients show a 15.3% uncertainty equity gap (p < 0.001, effect size = 0.698), low socioeconomic status patients exhibit a 6.8% gap (p < 0.001), and elderly patients show a 3.9% gap (p < 0.001), whilst no significant sex-based disparity is detected. These findings establish that calibrated uncertainty is not merely a technical property of probabilistic models but constitutes an actionable equity signal with direct clinical relevance.

2606.09788 2026-06-09 cs.CV 新提交

POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

POTATR: 一种用于页面级表格提取的轻量级图像到图模型

Brandon Smock, Libin Liang, Max Sokolov, Amrit Ramesh, Valerie Faucon-Morin, Tayyibah Khanam, Maury Courtland

发表机构 * Kensho Technologies

AI总结 提出轻量级图像到图模型POTATR(29M参数),在页面级表格提取任务上以130倍速度和300倍低成本超越前沿模型,GriTS_Con达0.964,输出空间可解释。

详情
Comments
16 pages, split from PubTables-v2 paper
AI中文摘要

大规模文档处理需要上下文感知的表格提取(TE),既准确又高效。然而,当前方法需要数十亿参数、数百个自回归步骤或昂贵的API推理。受此启发,我们引入了页面对象表格Transformer(POTATR),这是一个轻量级的29M参数图像到图模型,扩展了表格Transformer(TATR)用于上下文感知的页面级TE。在PubTables-v2单页面基准测试中,POTATR超越了所有测试模型(包括前沿MLLM),实现了0.964的$\textrm{GriTS}_\textrm{Con}$,同时运行速度提高130倍以上,成本降低约300倍。此外,POTATR的输出是空间可解释的:每个识别元素都有一个边界框,支持视觉验证和几何文本分配。因此,POTATR在执行统一的页面级TE的同时,可以与其他模型组合,通过外部OCR扩展到扫描文档,并通过跨页面合并等技术扩展到全文档TE。代码和模型将发布。

英文摘要

Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark -- including frontier MLLMs -- achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR's output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.

2606.09787 2026-06-09 cs.LG cs.NI 新提交

Zero Touch Predictive Orchestration: Automating Time-Series Models for the Cloud-Edge Continuum

零接触预测性编排:为云边连续体自动化时间序列模型

Abd Elghani Meliani, Arora Sagar, Adlen Ksentini, Raymond Knopp

发表机构 * Eurecom OpenAirInterface

AI总结 针对云边连续体中节点冷启动问题,提出一种结合数据混合与神经架构搜索的自动化时间序列预测架构,有效提升预测精度并加速收敛。

详情
Comments
19 pages, 14 figures
AI中文摘要

云边连续体(CEC)通过将资源分布到远边缘来支持延迟关键型应用,但其极端波动性使得通过时间序列预测进行主动零接触管理至关重要。然而,编排器面临严重的“冷启动”问题:新发现的节点缺乏训练局部预测模型所需的历史数据,而通用模型无法捕捉独特的硬件和微服务行为。为解决此问题,我们提出了一种由新颖的数据混合方法驱动的全自动时间序列预测架构。在基础设施层面,我们引入了一个轻量级、技术无关的资源暴露器(RE),它动态发现节点并持续收集可定制的遥测数据(例如,计算、网络、能源)。为了克服这些初始局部样本的稀疏性,我们的框架自动将它们与TimeTrack(我们公开的高分辨率数据集,以45秒间隔收集)合并。这协同了TimeTrack的基础高频时间模式与局部节点数据的精确校准。通过神经架构搜索(NAS)引擎处理,系统自动生成高精度的基线模型。实验结果表明,将目标数据与TimeTrack合并有效缓解了冷启动挑战。与仅使用稀疏局部样本训练、仅使用通用数据集训练或将目标数据与标准替代数据集混合相比,这种集成显著提高了以均方误差(MSE)、平均绝对误差(MAE)和平均绝对百分比误差(MAPE)衡量的预测准确性,并加速了收敛,为持续MLOps部署奠定了坚实基础。

英文摘要

The Cloud-Edge Continuum (CEC) enables latency-critical applications by distributing resources to the far edge, but its extreme volatility makes proactive Zero Touch Management via time-series forecasting essential. However, orchestrators face a severe "cold start" problem: newly discovered nodes lack the historical data required to train localized predictive models, while generalized models fail to capture unique hardware and microservice behaviors. To solve this, we propose a fully automated time-series prediction architecture driven by a novel data-mixing methodology. At the infrastructure level, we introduce a lightweight, technology-agnostic Resource Exposer (RE) that dynamically discovers nodes and continuously collects customizable telemetry (e.g., compute, network, energy). To overcome the sparsity of these initial local samples, our framework automatically merges them with TimeTrack, our publicly available, high-resolution dataset collected at 45-second intervals. This synergizes TimeTrack's foundational, high-frequency temporal patterns with the precise calibration of the local node data. Processed through a Neural Architecture Search (NAS) engine, the system automatically generates highly accurate baseline models. Experimental results demonstrate that merging the target data with TimeTrack effectively mitigates the cold start challenge. This integration significantly improves forecasting accuracy measured in Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) and accelerates convergence compared to training on the sparse local samples alone, training solely on generic datasets, or mixing the target data with standard alternative datasets, establishing a robust foundation for continuous MLOps deployment.

2606.09785 2026-06-09 math.CO cs.DM 新提交

Biclique decompositions from Welzl orders

从Welzl序的二部团分解

Jean Cardinal, Rose McCarty, Yelena Yuditsky

AI总结 研究图边划分为完全二部子图(二部团分解)的问题,利用Welzl序证明低邻域复杂度图存在小尺寸分解,并推广到Zarankiewicz问题、矩阵乘法、量子电路复杂度和最短路径算法。

详情
AI中文摘要

图的二部团分解是将其边划分为完全二部子图。我们考虑顶点可排序的图,使得每个顶点的邻域是亚线性个区间的并集。我们观察到这些图允许以紧凑的二部团分解形式表示,其中分解的大小以其二部团的顶点数之和来衡量。结合这一结果与Welzl在1988年证明的低邻域复杂度图存在合适顶点排序的结论,我们恢复并扩展了几个已知结果,达到对数因子。这些结果包括Zarankiewicz问题、矩阵乘法、量子电路复杂度以及“结构良好”实例中最短路径算法的上界。

英文摘要

A biclique decomposition of a graph is a partition of its edges into complete bipartite subgraphs. We consider graphs whose vertices can be ordered such that the neighborhood of every vertex is the union of a sublinear number of intervals. We observe that these graphs admit compact representations in the form of biclique decompositions of small size. Here, the size of a decomposition is measured as the sum of the number of vertices of its bicliques. Combining this result with the existence of suitable vertex orderings for graphs of low neighborhood complexity, as proven by Welzl in 1988, we recover and extend several known results up to logarithmic factors. These results include upper bounds on the Zarankiewicz problem, matrix multiplication, quantum circuit complexity, and shortest path algorithms in ``well-structured'' instances.

2606.09782 2026-06-09 cs.HC 新提交

Cohort-based Semantic Labeling: AI-Enabled Recovery of Visualization Semantics from Deployed SVGs

基于分组的语义标注:AI 驱动的从已部署 SVG 中恢复可视化语义

Jeongah Lee, Hima Varshini Surisetty, Durga Nirmaleswaran, Jahnavi Sharma, Srikiran Kavuri, Narges Mahyar, Ali Sarvghad

AI总结 提出 CSL 管道,通过分组分解和混合语义锚定,从 SVG 中自动恢复可视化语义,实现高精度标注。

详情
AI中文摘要

许多基于网络的可视化以可缩放矢量图形(SVG)形式部署,这种格式忠实地保留了视觉外观,但通常省略了机器解释所需的高层语义结构。一旦渲染和发布,关于可视化组件、角色和编码的信息不再显式可用,限制了查询、可访问性增强、解释、个性化和转换等下游操作。为了解决这一差距,我们引入了 CSL,一个 AI 驱动的多阶段管道,通过两种互补机制从已部署的 SVG 中自动恢复可视化语义:(1)基于分组的分解,将异构 SVG 图元组织成结构上连贯的子集,减少语义分配空间;(2)混合语义锚定,将基于模型的推理与确定性结构验证和传播相结合,使标注既上下文敏感又结构锚定。CSL 生成语义 SVG(SSVG),一种 SVG 元素被标注了图形标记类型、可视化角色和数据角色的表示。我们将 CSL 实现为端到端原型,并在 102 个 SVG 可视化上进行了评估,在标记类型、可视化角色和数据角色恢复上分别达到了 0.822、0.853 和 0.860 的全局宏平均准确率。与非分组的整图基线的消融实验表明,分组显著提高了准确率(配对 t 检验:t > 20,p < 0.001;Cohen's d > 2.0),对随机选择的 SVG 进行 100 次重复标注,三个属性的平均一致性超过 91.9%。这些结果有力证明了 CSL 可以将已部署的 SVG 转换为机器可用的语义表示,从而实现更易访问、自适应和用户可控的可视化系统。

英文摘要

Many web-based visualizations are deployed as Scalable Vector Graphics (SVG), a format that faithfully preserves visual appearance but typically omits the higher-level semantic structure needed for machine interpretation. Once rendered and published, information about a visualization's components, roles, and encodings is no longer explicitly available, limiting downstream operations such as querying, accessibility augmentation, explanation, personalization, and transformation. To address this gap, we introduce CSL, an AI-enabled, multi-stage pipeline for automatically recovering visualization semantics from deployed SVGs through two complementary mechanisms: (1) cohort-based decomposition, which organizes heterogeneous SVG primitives into structurally coherent subsets that reduce the semantic assignment space, and (2) hybrid semantic grounding, which combines model-based inference with deterministic structural validation and propagation to make labeling both context-sensitive and structurally anchored. CSL produces Semantic SVG (SSVG), a representation in which SVG elements are annotated with graphical mark type, visualization role, and data role. We implemented CSL as an end-to-end prototype and evaluated it on 102 SVG visualizations, achieving global macro-averaged accuracies of 0.822 for mark type, 0.853 for visualization role, and 0.860 for data-role recovery. An ablation against a non-cohort whole-chart baseline showed that cohorting significantly improves accuracy (paired t-test: t > 20, p < 0.001; Cohen's d > 2.0), and repeated labeling of a randomly selected SVG over 100 runs yielded mean agreement above 91.9% across all three attributes. These results provide strong evidence that CSL can transform deployed SVGs into machine-usable semantic representations, enabling more accessible, adaptive, and user-steerable visualization systems.

2606.09780 2026-06-09 cs.SD cs.NE 新提交

Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

声音生成中的质量-多样性搜索:用于音频探索的创新引擎研究

Björn Þór Jónsson, Çağrı Erdem, Stefano Fasciani, Kyrre Glette

发表机构 * University of Oslo(奥斯陆大学)

AI总结 本研究将质量多样性算法与监督判别模型结合,通过多频段CPPN和DSP图生成多样化合成声音,并分析进化路径与时间生态位,展示了创新引擎在声音发现中的潜力。

详情
Comments
This is an extended version of the previously published conference paper "Towards Sound Innovation Engines Using Pattern-Producing Networks and Audio Graphs": https://doi.org/10.1007/978-3-031-56992-0_14
AI中文摘要

本研究解决了作曲家和声音设计师在创建和优化工具以实现其音乐目标时所面临的挑战。通过利用进化过程促进多样性并培养偶然发现,我们自动化了在未知声音空间中的搜索以发现声音,认为促进多样性的算法可以弥合声音的理论实现与实际可访问性之间的差距。我们描述了一个生成式声音合成系统,该系统将质量多样性(QD)算法与监督判别模型相结合,灵感来自创新引擎算法,并探索了不同配置以及所选合成方法与判别模型之间的相互作用。我们研究了组合模式生成网络(CPPN)和数字信号处理(DSP)图之间的交互,引入了一种新颖的方法,该方法使用多个专门针对不同频率范围的CPPN;这产生了更简单的网络,同时保持了与单CPPN设置相当的性能。我们还通过分析音乐和非音乐背景之间的目标切换来研究进化垫脚石,揭示了谱系如何穿越看似不可能的路径到达当前精英。将先前研究的行为空间扩展到包括各种声音持续时间,我们发现了时间生态位内的特化。结果表明,CPPN和DSP图与多维表型精英档案(MAP-Elites)和深度学习分类器相结合,可以生成大量多样的合成声音,在时间和上下文维度上具有多样性和创新性。我们通过在线探索器和渲染的声音文件呈现生成的声音对象,并在音乐创作的背景下,展示了一个实验性应用,该应用展示了它们在不同持续时间和上下文中的创造潜力。

英文摘要

This study addresses the challenges composers and sound designers face in creating and refining tools to achieve their musical goals. Using evolutionary processes to promote diversity and foster serendipitous discoveries, we automate the search through uncharted sonic spaces for sound discovery, arguing that diversity-promoting algorithms can bridge the gap between the theoretical realisation and practical accessibility of sounds. We describe a system for generative sound synthesis combining Quality Diversity (QD) algorithms with a supervised discriminative model, inspired by the Innovation Engine algorithm, and explore different configurations and the interplay between the chosen synthesis approach and the discriminative model. We examine the interaction between Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs, introducing a novel approach that uses multiple specialised CPPNs for different frequency ranges; this yields simpler networks while maintaining performance comparable to single-CPPN setups. We also investigate evolutionary stepping stones by analysing goal switches between musical and non-musical contexts, revealing how lineages traverse unlikely paths to current elites. Expanding the behaviour space of a previous study to include various sound durations, we uncover specialisation within temporal niches. Results indicate that CPPN and DSP graphs coupled with a Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) and a deep learning classifier can generate a substantial variety of synthetic sounds, diverse and innovative across temporal and contextual dimensions. We present the generated sound objects through an online explorer and as rendered sound files, and, in the context of music composition, an experimental application that showcases their creative potential across various durations and contexts.

2606.09778 2026-06-09 quant-ph cs.AI 新提交

Who Earns the Safety? Intervention-Aware Quantum Predictive Control with Safety Attribution

谁赢得了安全?具有安全归因的干预感知量子预测控制

Yifan Wang

AI总结 提出干预感知变分量子可微预测控制(IA-VQC-DPC),通过原始-对偶干预预算和安全性归因协议,量化并提升量子策略的固有安全性,避免保护层掩盖策略缺陷。

详情
Comments
7 pages, 4 figures
AI中文摘要

硬安全过滤器越来越多地部署在学习控制器的下游,以保证运行时约束满足。然而,一个从不违反约束的过滤控制器可能仍然没有学到任何关于安全性的知识:过滤器可以静默地修复一个不称职的上游策略,使得过滤后的成功衡量的是过滤器,而不是策略。我们认为,安全策略学习应该问谁赢得了安全——策略还是其保护层——并且我们使这个问题可测量。我们引入了干预感知变分量子可微预测控制(IA-VQC-DPC),它(i)在原始-对偶干预预算下训练一个紧凑的变分量子电路(VQC)策略,该预算惩罚对可微控制障碍函数(CBF)投影的依赖,并且(ii)通过一个安全性归因协议进行评估,该协议将执行轨迹修正分解为CBF项和部署运行时保护项,并通过关闭保护评估对策略进行压力测试。在闭环、高保真BOPTEST建筑控制模拟器上(5个种子,每种方法60个回合),干预感知训练显著降低了量子策略的原始预过滤违规和总安全层依赖(两者p < 10^-4),且没有显著的能耗回归;在约400个参数的相同预算下,量子策略比匹配的经典策略显著更安全、更舒适。关闭保护评估证实了改进是策略层面的,并揭示了一个有价值的负面结果:一个学习的可微能量头只有与分布感知的运行时保护配对时才安全。该归因协议在量子策略和建筑之外具有通用性。

英文摘要

Hard safety filters are increasingly placed downstream of learned controllers to guarantee constraint satisfaction at run time. Yet a filtered controller that never violates a constraint may still have learned nothing about safety: the filter can silently repair an incompetent upstream policy, so that post-filter success measures the filter, not the policy. We argue that safe policy learning should ask who earns the safety - the policy or its protective layers - and we make this question measurable. We introduce Intervention-Aware Variational Quantum Differentiable Predictive Control (IA-VQC-DPC), which (i) trains a compact variational quantum circuit (VQC) policy under a primal-dual intervention budget that penalizes reliance on a differentiable Control-Barrier-Function (CBF) projection, and (ii) is evaluated with a safety-attribution protocol that decomposes the executed-trajectory correction into a CBF term and a deployment runtime-guard term, and stress-tests the policy with guard-off evaluation. On closed-loop, high-fidelity BOPTEST building-control emulators (5 seeds, 60 episodes per method), intervention-aware training significantly lowers the quantum policy's raw pre-filter violation and total safety-layer reliance (both p < 10^-4) with no significant energy regression; at an equal approximately 400-parameter budget the quantum policy is significantly safer and more comfortable than a matched classical policy. Guard-off evaluation confirms the improvement is policy-level and exposes a valuable negative result: a learned differentiable energy head is only safe when paired with a distribution-aware runtime guard. The attribution protocol is general beyond quantum policies and buildings.

2606.09777 2026-06-09 cs.RO 新提交

AetheRock: An Arm-Worn Robot Teaching System for Force-Guided Vision-Tactile Learning

AetheRock: 一种用于力引导视觉触觉学习的臂戴式机器人教学系统

Hong Li, Yue Xu, Yihan Tang, Yankang Dong, Chenyuan Liu, Chenyang Yu, Xuyang Li, Siyuan Huang, Yujun Shen, Nan Xue, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团) Shanghai Innovation Institute(上海创新研究院) Beijing Institute for General Artificial Intelligence (BIGAI)(北京通用人工智能研究院)

AI总结 提出臂戴式设备AetheRock采集夹爪力、视觉和触觉数据,并设计ForceVT框架利用力和视觉引导触觉学习,解决力感知机器人学习中传感器装配不兼容问题。

详情
AI中文摘要

力和触觉感知在接触密集操作中不可或缺。然而,由于手持或可穿戴设备中触觉和力传感器的不兼容装配,力感知机器人学习面临关键挑战。为解决这些限制,我们首先引入AetheRock用于夹爪力、视觉和触觉数据收集,这是一种臂戴式设备,指尖配备模块化且易于制造的视觉触觉传感器GelSlim-MiniFab,人体手指接触区域配备电阻式压力传感器,定制PCB模块,以及用于舒适和稳健收集的可穿戴套件。在此基础上,我们提出ForceVT,一种表示学习框架,利用力和视觉引导保真度无关的触觉学习,实现在任何触觉情况下的鲁棒推理。实际实验表明,AetheRock实现了合格的数据效率,且ForceVT有效缓解了视觉触觉传感器在制造和使用不一致时的低效问题。总体而言,我们的工作通过创新的硬件设计和算法减轻了夹爪力-视觉-触觉机器人学习的局限性。

英文摘要

Force and tactile sensing are indispensable in contact-rich manipulation. However, force-aware robot learning faces critical challenges due to the incompatible assembly of tactile and force sensors in handheld or wearable devices. To address these limitations, we first introduce AetheRock for gripper-force, vision, and tactile data collection, which is an arm-worn device featuring a modular and easily manufactured visuo-tactile sensor, GelSlim-MiniFab, at the fingertip, a resistive pressure sensor at the human finger contact region, a customized PCB module, and a wearable kit for comfortable and robust collection. Building on this, we propose ForceVT, a representation learning framework that uses force and vision to guide fidelity-agnostic tactile learning, enabling robust inference in any tactile situation. Real-world experiments show that AetheRock achieves qualified data efficiency and that ForceVT effectively alleviates inefficiencies when visuo-tactile sensors exhibit manufacturing and utilization inconsistencies. Overall, our work mitigates the limitations of gripper-force vision-tactile robot learning through innovative hardware design and algorithms.

2606.09774 2026-06-09 cs.AI cs.CL 新提交

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

SIGA: 用于科学模拟的自演化编码智能体适配器

Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin

发表机构 * University of California, San Diego(加利福尼亚大学圣迭戈分校)

AI总结 提出SIGA适配器,通过检索、程序记忆、轨迹内验证和验证强制终止,将通用编码智能体转化为科学模拟软件操作员,在GEOS上实现36倍加速,并支持自演化提升性能。

详情
AI中文摘要

高级科学模拟器暴露了专门的输入语言,将模拟目标转化为可执行配置,但学习这些语言可能需要领域科学家花费数小时到数天。我们将模拟器设置研究为智能体-工具接口接地问题:需要哪些最小的模拟器特定适配才能使现成的编码智能体操作真实的科学软件?我们的直觉是,编码智能体已经知道如何导航文件、编辑代码、运行命令和修复输出,但它们缺乏模拟器的可执行契约:其词汇、结构约束、验证规则和终止条件。我们介绍了SIGA,一个模拟器接口接地适配器,通过检索、程序记忆、轨迹内验证和验证强制终止来提供此契约。我们主要在GEOS上评估SIGA,GEOS是一个用于地下科学的开源多物理场模拟器。SIGA在大约五分钟内生成完整的GEOS输入文件,TreeSim高于0.90,与花费大约三小时的扩展预算人类专家相当,实现了大约36倍的挂钟加速。在更难的保留集上,接地将TreeSim从0.720提高到0.789,相对于裸智能体提高了大约10%,并且可以将跨种子的标准差降低16倍。自演化通过从先前轨迹重写适配器内容进一步改进SIGA,产生了最高的保留GEOS平均值,并匹配或超过了最强的手工设计配置。迁移到OpenFOAM和LAMMPS表明,主导机制因接口而异:当结构完整性是瓶颈时,验证最重要;而当领域正确性是瓶颈时,记忆和检索最重要。这些结果表明,轻量级、可自我改进的接地层可以将通用编码智能体转变为科学软件的实用操作员。

英文摘要

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

2606.09772 2026-06-09 cs.CV 新提交

SemDINO: A DINOv3-Driven Network for Cross-Temporal Semantic Alignment in Change Detection

SemDINO: 一种基于DINOv3的跨时间语义对齐变化检测网络

Xinyu Tong, Meihua Zhou, Jinxiao Sun, Yingjie Tang, Lei Wang

发表机构 * Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences(中国科学院新疆生态与地理研究所) University of Chinese Academy of Sciences(中国科学院大学) School of Computer Science, Xiangtan University(湘潭大学计算机科学学院) College of Information and Communication Engineering, Harbin Engineering University(哈尔滨工程大学信息与通信工程学院)

AI总结 提出SemDINO网络,通过双分支编码器、多尺度时序交互、语义净化与变化增强模块,解决语义变化检测中跨时间对齐不足、多尺度表示弱及伪变化鲁棒性差的问题。

详情
AI中文摘要

语义变化检测(SCD)旨在同时定位土地覆盖变化并识别转变前后的语义类别。然而,现有方法存在跨时间对齐不足、多尺度表示弱以及对光照、季节和配准噪声引起的伪变化鲁棒性差的问题。为了解决这些问题,我们提出了一种名为SemDINO的新型端到端语义变化检测网络,它将双分支编码器、多尺度时序交互、语义净化、变化增强和解耦多任务预测集成到一个统一框架中。具体来说,我们构建了一个双分支编码器,通过门控金字塔融合将CNN骨干网络和冻结的DINOv3特征相结合,实现丰富的多尺度语义表示。然后,提出了一种多尺度时序双向变换器交互(M-TBTT)模块,以实现全局跨时间特征对齐和信息交互。为了进一步增强真实变化并抑制伪变化,我们协同引入了语义净化(SCP)、双向变化增强(BiChangeEnhance)和多尺度变化增强(MCE)模块。最后,设计了一个多分支CD预测头,用于联合输出二值变化掩码、双时相语义图和边缘约束。在公开遥感CD数据集上的大量实验表明,SemDINO在复杂干扰因素场景下,相比最先进方法取得了优越的性能和泛化能力。

英文摘要

Semantic change detection (SCD) aims to simultaneously locate land-cover changes and identify semantic categories before and after transition. However, existing methods suffer from insufficient cross-temporal alignment, weak multi-scale representation, and poor robustness to pseudo-changes caused by illumination, season, and registration noise. To address these issues, we propose a novel end-to-end semantic change detection network named SemDINO, which integrates a dual-branch encoder, multi-scale temporal interaction, semantic purification, change enhancement, and decoupled multi-task prediction into a unified framework. Specifically, we construct a dual-branch encoder that combines a CNN backbone and frozen DINOv3 features via gated pyramid fusion, enabling rich multi-scale semantic representation. Then, a multi-scale temporal bidirectional transformer interaction (M-TBTT) module is proposed to achieve global cross-temporal feature alignment and information interaction. To further enhance genuine changes and suppress pseudo-variations, we introduce semantic purification (SCP), bidirectional change enhancement (BiChangeEnhance), and multi-scale change enhancement (MCE) modules collaboratively. Finally, a multi-branch CD prediction head is designed to jointly output binary change mask, bi-temporal semantic maps, and edge constraint. Extensive experiments on public remote sensing CD datasets demonstrate that SemDINO achieves superior performance and generalization ability against state-of-the-art methods, especially in complex scenarios with interference factors.

2606.09770 2026-06-09 q-bio.NC cs.LG 新提交

Discovering Functionally Selective Brain Regions with a Deep Topographic Multimodal Model

发现功能选择性脑区:一种深度地形多模态模型

Badr AlKhamissi, Johannes Mehrer, Lara Marinov, Ahmed Abdelaal, Abdulkadir Gokce, Martin Schrimpf

AI总结 提出Topo-Omni模型,通过空间平滑微调预训练基础模型,在单一连续虚拟皮层上整合视觉、听觉和语言/认知处理,产生与人类神经影像一致的多模态聚类,并用于发现新脑区。

详情
Comments
Preprint. First two author contributed equally
AI中文摘要

皮层中的邻近神经元具有相似的反应特征,从而在感觉和认知系统中产生系统性的空间组织。最近的地形模型再现了这种结构的某些方面,但仍然是单模态的,并且对每一层分别施加空间约束,产生了碎片化的图谱,既不能捕捉皮层处理流的连续性,也不能捕捉跨模态的整合。我们引入了Topo-Omni,一种地形多模态模型,其中视觉、听觉和语言/认知处理共享一个单一的连续虚拟皮层。通过使用空间平滑目标微调预训练的基础模型,该架构在跨模态中发展出与人类神经影像一致的聚类,从感觉系统到认知系统。驱动或抑制一个聚类会选择性偏向或损害感知,这与人类干预研究相似。最后,我们使用我们的模型在虚拟皮层中筛选新的聚类,并发现了新的自然景观和动物网络,并在人类数据中验证了它们。因此,单一的空间原则组织了跨模态和处理阶段的表征,产生了关于皮层组织的可检验假设。

英文摘要

Nearby neurons in cortex share similar response profiles, producing systematic spatial organization across sensory and cognitive systems. Recent topographic models reproduce aspects of this structure but remain unimodal and spatially constrain each layer separately, yielding fragmented maps that capture neither the contiguity of cortical processing streams nor their integration across modalities. We introduce Topo-Omni, a topographic multimodal model in which visual, auditory, and language/cognitive processing share a single contiguous in-silico sheet. Built by fine-tuning a pretrained foundation model with a spatial smoothness objective, this architecture develops clusters across modalities that are consistent with human neuroimaging, from sensory to cognitive systems. Driving or suppressing a cluster selectively biases or impairs perception, paralleling human intervention studies. Finally, we use our model to screen for novel clusters in-silico and discover new natural landscape and animal networks which we validate in human data. A single spatial principle thus organizes representations across modalities and processing stages, yielding testable hypotheses about cortical organization.