arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.17415 2026-05-25 cs.LG cs.AI cs.DB cs.IR

IVF-TQ: Calibration-Free Streaming Vector Search via a Codebook-Free Residual Layer

IVF-TQ:通过无码本残差层实现无需校准的流式向量搜索

Tarun Sharma

AI总结 本文提出了一种名为IVF-TQ的流式向量搜索索引,该方法通过一种无需代码本的残差压缩层实现了校准自由的近似最近邻搜索。核心思想是在不依赖代码本的情况下,利用固定随机旋转和预计算的Lloyd-Max标量量化器,仅通过比特宽度和维度参数进行配置,从而在不需训练的情况下保持流式数据的稳定性。实验表明,IVF-TQ在多个数据集和内存条件下均能保持良好的性能,无需重新训练或个性化调整比特预算,显著提升了流式场景下的搜索效率与鲁棒性。

详情
AI中文摘要

近似最近邻(ANN)索引部署在流式语料库上会在数周内无声地丢失召回率。标准诊断是分布漂移,但在洗牌独立同分布(shuffled-i.i.d.)摄取下(完全没有漂移),乘积量化在子匹配位预算下仍会下降3.8个百分点。主流生产压缩方法(PQ、OPQ、ScaNN)都针对初始样本拟合码本,并在数据库增长数个数量级时重复使用该码本。 本文提出IVF-TQ,一种倒排文件索引,其残差压缩层是数据无关的:一个固定的随机旋转,后跟一个仅由位宽b和维度d参数化的预计算Lloyd-Max标量量化器。仅训练IVF粗k-means分区。一个仅依赖于(b, d, delta)的球面上均匀内积误差界提供了任何学习码本方法都无法提供的结构保证。相同的无码本设计实现了IVF放大效应,将差距缩小到Extended RaBitQ的统计噪声范围内(在匹配位预算下,比平面TQ高17.7个百分点),以及一种自适应变体,在不触及压缩层的情况下刷新分区。在九个受控单元(三个10M数据集、三种PQ内存模式、三个随机种子)中,每批PQ码本重新训练从未恢复流式差距;IVF-PQ流式稳定性需要逐数据集位预算调整,而IVF-TQ在所有三个数据集上使用一个固定的(b, d)配置,Delta在[-0.80, +0.56]个百分点之间。贡献在于操作层面:无需训练码本,无需逐数据集位预算调整,无需任何能缩小差距的重新训练周期。

英文摘要

Approximate nearest neighbor (ANN) indexes deployed against streaming corpora silently lose recall over weeks. The standard diagnosis is distribution shift, but under shuffled-i.i.d. ingestion -- no shift at all -- product quantization still degrades -3.8pp at sub-matched bit budgets. The dominant production compression methods (PQ, OPQ, ScaNN) all fit a codebook to an initial sample and reuse it as the database grows by orders of magnitude. This paper presents IVF-TQ, an inverted-file index whose residual compression layer is data-independent: a fixed random rotation followed by a precomputed Lloyd-Max scalar quantizer parameterised only by the bit width b and dimension d. Only the IVF coarse k-means partition is trained. A uniform-over-sphere inner-product error bound depending only on (b, d, delta) provides a structural guarantee no learned-codebook method admits. The same codebook-free design enables an IVF-amplification effect that closes the gap to Extended RaBitQ to within statistical noise (+17.7pp over flat TQ at matched bit budget), and an Adaptive variant that refreshes the partition without touching the compression layer. Across nine controlled cells (three 10M datasets, three PQ memory regimes, three seeds), per-batch PQ codebook retraining never recovers the streaming gap; IVF-PQ streaming stability requires per-dataset bit-budget tuning, while IVF-TQ holds at one fixed (b, d) configuration on all three datasets with Delta in [-0.80, +0.56]pp. The contribution is operational: no codebook to train, no per-dataset bit-budget tuning, no retraining cycle that ever closes the gap.

2605.12260 2026-05-25 cs.CL

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

PRISM:面向长周期智能体的意图感知结构化记忆的帕累托高效检索

Jingyi Peng, Zhongwei Wan, Weiting Liu, Qiuzhuang Sun

AI总结 PRISM是一种面向长周期智能体的高效检索框架,旨在解决对话历史存储与检索中的内存管理问题。该方法将长周期记忆视为图结构记忆中的联合检索与压缩问题,通过四种独立的推理组件实现高效的信息检索与压缩。实验表明,PRISM在保持高准确率的同时,显著降低了上下文预算,展现出在回答质量与检索效率之间的优越平衡。

详情
Comments
Preprint
AI中文摘要

长周期语言智能体积累的对话历史远超任何固定上下文窗口的容量,使得记忆管理对回答准确性和服务成本都至关重要。现有方法要么扩展上下文窗口而不解决检索内容,要么在摄入时进行大量事实提取而消耗大量token,要么依赖启发式图遍历而牺牲准确性和效率。我们提出PRISM,一个无需训练的检索端框架,将长周期记忆视为图结构记忆上的联合检索与压缩问题。PRISM结合了四个正交的推理时组件:基于类型化关系路径的层次化束搜索、与检测到的查询意图对齐的查询敏感边成本计算、将候选束压缩为紧凑答案侧上下文的证据压缩,以及将大多数查询路由到零LLM层的自适应意图路由。通过将检索公式化为类型化路径模板上的最小成本选择,并与LLM侧的压缩步骤配对,PRISM在严格的上下文预算下呈现正确的证据,无需对上游摄入流程进行任何微调或修改。在LoCoMo基准上的实验表明,PRISM在相同协议下,以数量级更小的上下文预算,实现了比所有基线更高的LLM评判准确率,占据了准确率-上下文成本前沿中先前空白的角落,并在回答质量和检索效率之间展现了卓越的平衡。

英文摘要

Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, perform heavy ingestion-time fact extraction at substantial token cost, or rely on heuristic graph traversal that leaves both accuracy and efficiency on the table. We present PRISM, a training-free retrieval-side framework that treats long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. PRISM combines four orthogonal inference-time components: Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that compresses the candidate bundle into a compact answer-side context, and Adaptive Intent Routing that routes most queries through zero-LLM tiers. By formulating retrieval as min-cost selection over typed path templates and pairing it with an LLM-side compression step, PRISM surfaces the right evidence under a strict context budget without any fine-tuning or modification to the upstream ingestion pipeline. Experiments on the LoCoMo benchmark show that PRISM delivers substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency.

2605.23903 2026-05-25 cs.CV

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Geo-Align: 通过度量几何奖励实现视频生成对齐

Zizun Li, Haoyu Guo, Runzhe Teng, Chunhua Shen, Tong He

AI总结 Geo-Align 是一种基于强化学习的视频重渲染框架,旨在解决现有方法在处理真实世界视频时对物理尺度和相机轨迹控制不足的问题。该方法通过引入度量三维估计器提取生成视频中的精确相机轨迹,并结合感知奖励机制优化模型,从而提升对旋转和平移偏差的控制能力。实验表明,Geo-Align 在相机可控性和视觉保真度方面均优于现有监督学习方法,展示了其在无配对数据情况下的有效性。

详情
AI中文摘要

近年来,相机控制的视频生成取得了显著进展。然而,现有的视频到视频重渲染方法主要依赖于使用合成数据集的监督微调。目前,同步的多视角真实世界视频数据极度稀缺。因此,当前范式在处理分布外的真实世界视频时通常表现出有限的泛化能力,模型难以准确遵循物理尺度和相机轨迹。为了弥补这一差距,我们提出了Geo-Align,这是第一个专门为相机控制视频重渲染设计的强化学习框架。基于预训练模型,我们通过尺度感知的感知奖励机制优化模型。具体来说,我们引入了一个度量3D估计器,从生成的视频中提取精确的相机轨迹,明确惩罚旋转和平移的偏差。此外,我们精心设计了一种基于真实条件视频和从合成数据导出的目标相机轨迹的数据流水线策略,消除了对配对数据的依赖。大量实验表明,Geo-Align在精确的相机可控性和视觉保真度方面始终优于现有的监督学习基线,证明了我们方法的有效性。

英文摘要

Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.

2605.23902 2026-05-25 cs.CV

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

PiD: 基于像素扩散的快速高分辨率潜解码

Yifan Lu, Qi Wu, Jay Zhangjie Wu, Zian Wang, Huan Ling, Sanja Fidler, Xuanchi Ren

AI总结 本文提出了一种名为PiD的像素扩散解码器,旨在解决高分辨率文本到图像生成中传统潜空间解码效率低、细节不足的问题。PiD将潜空间解码重新表述为条件像素扩散过程,统一了生成与上采样步骤,能够在高分辨率像素空间中直接去噪,从而以较低的延迟生成4倍甚至8倍放大图像。通过引入轻量的sigma感知适配器和模型蒸馏技术,PiD在保证视觉质量的同时显著提升了生成速度和效率,适用于多种潜空间表示,包括传统VAE和近期基于语义潜空间的模型。

详情
Comments
Project Page: https://research.nvidia.com/labs/sil/projects/pid/
AI中文摘要

大多数实用的高分辨率文本到图像系统,包括潜扩散和自回归模型,在紧凑的潜空间中生成,然后由解码器将生成的潜变量映射回像素。然而,潜到像素解码器是面向重建的,优化目标是反转编码器而非合成更多细节,并且在百万像素尺度上成本越来越高。这一缺陷呼唤更具表现力和高效的解码范式。受近期可扩展像素空间扩散进展的启发,我们引入了PiD,一种像素扩散解码器,将潜解码重新表述为条件像素扩散,将解码和上采样统一到一个生成模块中。通过直接在高分辨率像素空间中去噪,PiD以低延迟合成$4 imes$甚至$8 imes$上采样的图像。对于潜条件,一个轻量级的sigma感知适配器将噪声污染的潜变量注入像素扩散主干,使PiD能够解码部分去噪的潜变量并提前终止潜扩散过程。为进一步提高效率,我们使用DMD2蒸馏模型,将推理步骤减少到仅4步。PiD适用于传统的VAE潜变量和近期基于RAE的模型中使用的语义潜变量(例如SigLIP、DINOv2)。PiD在消费级RTX 5090上,以13 GB峰值内存,在不到1秒内将$512 imes 512$图像的潜变量解码为$2048 imes 2048$像素,在GB200 GPU上最快可达210毫秒,比级联扩散超分辨率流水线快约$6 imes$,且视觉保真度更高。

英文摘要

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes $4\times$ and even $8\times$ upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of $512 \times 512$ images into $2048 \times 2048$ pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about $6\times$ faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.

2605.23901 2026-05-25 cs.LG cs.AI cs.IT math.IT

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

LLMs 作为噪声信道:香农视角下的模型容量与缩放定律

Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang, Chen Zheng, Thomas Hartvigsen, Yiyuan Ma

AI总结 本文从香农信息论的角度出发,将大语言模型(LLM)的训练过程建模为在噪声信道中传递信息的过程,提出了香农扩展定律(Shannon Scaling Law),用以解释传统单调扩展定律无法描述的非单调现象,如灾难性过训练和量化退化。该理论通过将模型参数映射为信道带宽、训练数据映射为信号功率,揭示了模型规模或数据量的扩展若不能保持足够的信噪比,将导致噪声放大并引发性能的U型退化。实验验证表明,该理论在多个任务和扰动设置下均优于传统扩展定律,具有良好的拟合与外推能力。

详情
Comments
Accepted by ICML 2026
AI中文摘要

现有的大语言模型(LLMs)缩放定律主要是单调幂律,无法解释新出现的非单调现象,如灾难性过训练和量化引起的退化,在这些现象中,尽管计算量增加,性能却下降。我们提出了香农缩放定律,这是一个统一的理论框架,将LLM训练建模为噪声信道上的信息传输,基于香农-哈特利定理。通过将模型参数映射到信道带宽,训练令牌映射到信号功率,我们的公式明确捕捉了学习信号与内在噪声之间的相互作用。这一视角揭示了LLMs的基本香农容量:在未保持足够信噪比(SNR)的情况下扩展模型规模或数据,必然会放大噪声,导致从单调改进到U形性能退化的转变。我们通过在Pythia和OLMo2上进行的实验验证了该理论,实验包括高斯噪声、量化以及在数学、问答和代码任务上的监督微调。香农缩放定律始终优于经典缩放定律和最近的扰动感知定律,取得了强$R^2$分数,并准确捕捉了先前方法遗漏的损失盆地。它还能进行外推:在$\leq$6.9B Pythia模型上使用$\leq$180B令牌拟合后,预测了未见过的12B模型在高达307B令牌时的性能,池化$R^2=0.847$,而单调基线则崩溃。

英文摘要

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.

2605.23899 2026-05-25 cs.AI

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

从原始经验到技能消费:模型生成智能体技能的系统研究

Zisu Huang, Jingwen Xu, Yifan Yang, Ziyang Gong, Qihao Yang, Muzhao Tian, Xiaohua Wang, Changze Lv, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Xue Yang, Dongdong Chen, Xiaoqing Zheng, Chong Luo

AI总结 本文系统研究了模型生成智能体技能的全生命周期,包括经验生成、技能提取和技能消费,旨在评估这些技能的实际效果及影响因素。研究构建了一个基于实用性的评估框架,在五个不同任务领域中进行了广泛实验,发现模型生成的技能虽总体有益,但存在非平凡的负迁移现象,且提取器与消费者的表现并不一致。通过深入分析各阶段特征,论文揭示了技能质量的决定因素,并提出了一种元技能,用于指导技能提取以提升其实际效用。

详情
AI中文摘要

语言智能体通过重用从过去经验中提取的结构化程序化制品——\emph{技能}——不断改进。特别是,\emph{领域级}和\emph{模型生成}的技能尤其有前景。它们通过编码领域特定的重复性程序,在领域内实现快速适应,并且超越了劳动密集型的手工制作。然而,尽管提取方法不断涌现,但理解仍然有限,缺乏覆盖技能全生命周期—— extbf{经验生成}、 extbf{技能提取}和 extbf{技能消费}——的全面研究,以探究这些技能是否真正有效、何时有效以及成功或失败的原因。为填补这一空白,我们构建了一个基于效用的评估框架,在五个多样化的智能体任务领域上,提供了跨提取器和目标智能体的系统实验结果。我们发现,模型生成的技能平均有益,但表现出非平凡的负迁移,并且提取器和目标智能体的行为并不一致。一个模型可能是强提取器但弱消费者,反之亦然,技能效用与模型规模或基线任务强度无关。为解释这些模式,我们随后深入剖析每个生命周期阶段,分析经验组成如何塑造技能质量、有用技能具备哪些属性,以及同一技能如何在不同消费者之间迁移。最后,我们将这些发现转化为具体的\emph{元技能},指导技能提取朝向与实际效用相关的特征,从而在多个领域持续提升技能质量并大幅减少负迁移。

英文摘要

Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience. In particular, \emph{domain-level} and \emph{model-generated} skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- \textbf{experience generation}, \textbf{skill extraction}, and \textbf{skill consumption} -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete \emph{meta-skill} that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.

2605.23898 2026-05-25 cs.AI

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

SPACENUM: 重新审视视觉语言模型中的空间数值理解

Jianshu Zhang, Yijiang Li, Huifeixin Chen, Haoran Lu, Letian Xue, Bingyang Wang, Han Liu

AI总结 本文研究视觉语言模型(VLMs)在空间数值理解方面的能力,探讨其是否能真正将数值输出与空间感知建立联系。为此,作者提出了SpaceNum框架,通过两个双向任务Num2Space和Space2Num,评估模型在动态空间探索和静态空间推理中对数值与空间结构的映射能力。研究发现,现有VLMs在空间数值理解上表现有限,主要依赖浅层空间线索,难以构建稳定的坐标感知表示,且无法从视觉中抽象出结构化的空间布局。

详情
Comments
Project page: https://sterzhang.github.io/SpaceNum-Home
AI中文摘要

视觉语言模型(VLM)越来越多地部署在具身环境中,需要产生数值输出,如动作幅度和空间坐标。尽管这些数字看似有意义,但目前尚不清楚这些数值输出是否真正基于空间感知。因此,在这项工作中,我们通过SpaceNum重新审视空间数值理解,这是一个统一框架,捕捉两种互补的设置:作为空间探索中动态过渡的数字,以及作为空间推理中静态布局的数字。我们制定了两个双向任务,Num2Space和Space2Num,以评估VLM在视觉侧空间结构和语言侧数值表示之间的映射能力。我们系统地研究了当前VLM是否真正理解空间设置中的数值。在动态过渡和静态布局中,我们发现模型在很大程度上未能将数值与空间意义联系起来,并且通常表现接近随机猜测。通过错误分析、推理轨迹分析和受控干预,我们表明当前VLM严重依赖浅层空间线索,难以构建稳定的坐标感知表示,并且未能从视觉观察中抽象出结构化的空间布局。我们进一步表明,显式推理仅提供边际收益,而微调可以部分改善空间数值理解并迁移到外部空间推理基准。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.

2605.23897 2026-05-25 cs.CV cs.AI cs.CL

ETCHR: Editing To Clarify and Harness Reasoning

ETCHR: 通过编辑来澄清和利用推理

Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin

AI总结 多模态大语言模型在视觉推理方面取得了进展,但纯文本推理链在需要精细关注或视角变换的问题上仍存在瓶颈。为解决这一问题,研究提出ETCHR,一种与理解模型解耦的、基于问题条件的图像编辑器,通过两阶段训练方法分别弥补语言侧和生成侧的缺陷,提升编辑准确性和推理效果。实验表明,ETCHR在多个任务上显著提升了推理性能,且可无缝集成到不同开源和闭源模型中。

详情
Comments
Code, model and data are open-sourced at https://github.com/InternLM/ETCHR
AI中文摘要

多模态大语言模型已经推进了视觉推理,但对于需要细粒度关注或视角变换的问题,纯文本思维链仍然是一个瓶颈。“用图像思考”范式缩小了这一差距,但现有方法要么受限于固定的预定义工具包,要么从统一的多模态方法中产生噪声的中间图像。我们追求第三种选择:使用专用的图像编辑模型并将其与理解模型解耦。然而,现成的图像编辑器作为推理助手存在两个互补的差距:语言侧差距,即被训练为被动指令跟随者的编辑器无法将抽象问题映射到适当的视觉变换;以及生成侧差距,即随着推理深度增加,编辑正确性下降。基于这一分析,我们引入了ETCHR(Editing To Clarify and Harness Reasoning),一种问题条件化、推理感知的图像编辑器,与下游理解模型解耦,并通过针对这两个差距的两阶段配方进行训练:通过监督微调编辑轨迹进行推理模仿,随后通过基于VLM的奖励进行推理增强,以提升编辑正确性和下游推理准确性。由于编辑器是解耦的,ETCHR可以以无需训练的方式插入不同的开源和闭源MLLM。在五个任务族(细粒度感知、图表理解、逻辑推理、拼图恢复和3D理解)上,ETCHR将平均Pass@1从55.95提升到60.77(+4.82,使用Qwen3-VL-8B),从65.08提升到70.55(+5.47,使用Gemini-3.1-Flash-Lite),以及从76.55提升到81.16(+4.61,使用1T参数的MoE模型Kimi K2.5)。

英文摘要

Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.

2605.23895 2026-05-25 cs.CV

From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain

从激活到因果:人脑中视觉表征的因果发现

Yuval Golbari, Navve Wasserman, Matias Cosarinsky, Roman Beliy, Aude Oliva, Antonio Torralba, Michal Irani, Tamar Rott Shaham

AI总结 本文研究如何在人类大脑中识别与特定视觉概念相关的脑区,提出了一个名为BrainCause的自动化框架,通过结合生成模型和脑成像模型,生成受控刺激并进行因果验证,以区分真正代表概念的脑区与仅由相关视觉或语义线索驱动的脑区。该方法能够有效识别已知的功能定位,并发现新的候选表征,验证表明仅依赖激活强度可能导致大量假阳性结果,强调了因果验证的重要性。

详情
AI中文摘要

识别人类大脑中哪些脑区代表视觉概念是神经科学的核心挑战。现有方法通过激活最大化定位粗略的功能区域(例如,面孔、地点),识别出对目标概念相对于其他概念激活强烈的区域。然而,仅凭强激活并不能确定该区域代表概念本身,因为响应可能由相关的视觉或语义线索驱动。我们引入了BrainCause,一个自动化框架,结合生成模型和脑模型合成受控刺激,并通过有针对性的因果测试验证神经表征。给定一个指定感兴趣概念的查询,我们的框架构建有针对性的刺激集,包括概念图像、去除目标概念同时保留其他图像内容的反事实编辑,以及包含候选相关干扰物的图像。然后,它使用图像到fMRI编码模型预测大脑响应,并搜索那些对目标概念而非相关替代物有特定响应的表征。BrainCause返回经过验证的候选表征,并提出后续fMRI实验以进一步测试或扩展其发现。我们的方法成功恢复了已知的功能定位,并在数十个概念中识别出新的候选表征,在预测和测量的fMRI数据上均得到验证。关键的是,我们表明如果没有因果验证,大部分定位将是假阳性,证实了仅凭激活不足以作为表征的证据。

英文摘要

Identifying which brain regions represent a visual concept in the human brain is a central challenge in neuroscience. Existing approaches have localized coarse functional regions (e.g., faces, places) through activation maximization, identifying regions that activate strongly for a target concept relative to other concepts. Yet strong activation alone does not establish that a region represents the concept itself, as responses may instead be driven by correlated visual or semantic cues. We introduce BrainCause, an automated framework that combines generative and brain models to synthesize controlled stimuli and validate neural representations through targeted causal testing. Given a query specifying a concept of interest, our framework constructs targeted stimulus sets comprising concept images, counterfactual edits that remove the target concept while preserving other image content, and images with candidate correlated distractors. It then uses an image-to-fMRI encoding model to predict brain responses and searches for representations that respond specifically to the target concept over correlated alternatives. BrainCause returns validated candidate representations and proposes follow-up fMRI experiments to further test or extend its discoveries. Our approach successfully recovers known functional localizations and identifies new candidate representations across dozens of concepts, validated on both predicted and measured fMRI data. Critically, we show that without causal validation, a large fraction of localizations would be false positives, confirming that activation alone is insufficient evidence of representation.

2605.23893 2026-05-25 cs.LG

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE:MoE模型的最优超参数迁移与缩放

Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong, Yifan Gong, Jianming Zhang, Yan Kang

AI总结 本文提出了一种名为 Complete-muE 的框架,用于在密集 FFN 和混合专家(MoE)结构之间进行最优的超参数迁移与缩放。该框架通过两个桥梁系统解决现有工具无法处理的架构和专家数量变化带来的挑战,能够覆盖 MoE 模型中激活专家数、总容量、粒度以及共享/分组平衡混合等变化,同时适用于通用 Transformer 模型的宽度、深度、批量大小和训练时长的变化。实验表明,Complete-muE 能够在不同模型架构和参数规模下实现相对稳定的超参数优化,只需在密集模型上进行一次调参即可近似最优地迁移到所有 MoE 配置,显著加速了 MoE 模型的收敛速度。

详情
Comments
27 pages
AI中文摘要

我们提出Complete-muE,一个针对Transformer块中稠密FFN与任意混合专家(MoE)设置之间超参数迁移的框架。现有工具如$μ$P(要求固定架构)或SDE(要求固定每步token数)无法直接解决MoE设置中的超参数迁移问题,因为从稠密到MoE的迁移或MoE总专家数的缩放同时改变了架构和每个专家的token数。Complete-muE通过一个双桥系统解决了这一挑战:桥I通过带有归一化路由器尺度的有效宽度$μ$P在稠密FFN和稠密MoE之间映射。桥II通过激活专家缩放将稠密MoE映射到稀疏MoE,其中一阶SDE学习率/权重衰减校正相互抵消,而一个有限残差$σ_0$偏移保持不变。由此产生的迁移规则,我们称之为Complete muE,涵盖了MoE模型的激活专家数、总容量、粒度以及共享/组平衡混合的变化,以及通用Transformer模型的网络宽度/深度、批量大小和训练时长的变化。大量的语言模型和扩散模型预训练实验证实,complete-muE在不同模型架构和参数数量下产生了相对稳定的超参数最优值——仅存在与桥II非严格SDE行为一致的小幅偏移。在实践中,这种偏移足够小,以至于在单个稠密参考模型上调优的超参数可以接近最优地迁移到所有MoE配置——\emph{一次调优稠密模型,迁移至所有}是Complete-muE核心的实用策略。这使得MoE模型在扩展模型容量时,无需进行昂贵的超参数搜索即可实现比稠密模型更快的收敛速度提升。

英文摘要

We propose Complete-muE, a framework which targets hyperparameter transfer across dense FFN and any Mixture-of-Experts (MoE) setups in transformer blocks. Existing tools such as $μ$P (requires fixed architectue) or SDE (requires fixed per-step token count) cannot directly solve the hyperparameter transfer problem in MoE setups because Dense to MoE transfer or MoE total experts scaling changes both architecture and tokens per expert. Complete-muE solves this challenge with a two-bridge system: Bridge~I maps between dense FFN and Dense MoE by active-width $μ$P with a normalized router scale. Bridge~II maps between Dense MoE and sparse MoE by activated-expert scaling, where the first-order SDE LR/WD correction cancels while a bounded residual $σ_0$ shift remains. The resulting transfer rule, which we term as Complete muE, covers changes in activated experts, total capacity, granularity, and shared/group-balanced hybrids for MoE models as well as network width/depth, batch size, and duration changes for general Transformer models. Extensive language model and diffusion model pretraining experiments confirm that complete-muE yields relatively stable hyperparameter optima across model architectures and parameter counts -- with only minor drift consistent with the non-strict SDE behavior of Bridge~II. In practice this drift is small enough that hyperparameters tuned on a single dense reference transfer near-optimally to all MoE configurations -- \emph{tune dense once, transfer to all} is the practical recipe at the core of Complete-muE. This enables MoE models to achieve accelerated convergence speedup over dense models when scaling model capacity without costly hyperparameter search.

2605.23892 2026-05-25 cs.CV cs.AI cs.GR cs.LG cs.RO

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

优质令牌狩猎:视觉几何变换器令牌选择指南

Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski

AI总结 视觉几何变换器在多视角三维重建中表现出色,但其计算成本随输入序列长度呈二次增长,限制了模型的效率和可扩展性。本文提出了一种简单而通用的解决方案,通过限制每个查询在全局注意力中交互的关键/值标记数量来降低计算复杂度。该方法采用两阶段框架:首先在帧级别选择保留的帧以保证场景覆盖多样性,然后在帧内进一步去除冗余标记,且引入基于注意力熵的层感知稀疏化策略。实验表明,该方法在保持或提升性能的同时,可将视觉几何变换器的处理速度提升85%以上。

详情
Comments
Project Page: https://zsh2000.github.io/good-token-hunting.github.io, Code: https://github.com/zsh2000/gotohunt
AI中文摘要

视觉几何变换器已成为多视图三维重建的强大架构,能够以前馈方式联合预测多个三维属性。然而,由于这些模型内部的全局注意力层,其计算成本随输入序列长度呈二次增长,限制了其可扩展性和效率。在这项工作中,我们通过一个简单而通用的策略来应对这一挑战:限制每个查询在全局注意力期间交互的键/值令牌数量。为了实现有效的令牌选择,我们引入了一个两阶段框架。首先,帧间选择步骤在帧级别操作,以识别应保留的帧。其次,帧内选择步骤进一步丢弃所选帧内更冗余的令牌。我们的分析强调了基于多样性的帧间选择策略的优势,该策略确保了对场景的广泛覆盖。对于帧内选择,我们表明层感知稀疏化是必要的,选择过程由全局注意力模式的熵引导。与现有解决方案相比,我们的方法提供了优越的速度-精度权衡。大量实验表明,对于包含500张图像的场景,我们的方法将视觉几何变换器加速超过85%,同时保持甚至提升基线性能,这暗示了我们的令牌选择策略如何在视觉几何变换器的未来应用中发挥关键作用。我们的项目网站位于 https://zsh2000.github.io/good-token-hunting.github.io。

英文摘要

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.

2605.23891 2026-05-25 cs.CV

Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework

Smart-Insertion-V: 通过闭环反馈双流框架实现逼真的视频插入

Xiao Cao, Yansong Qu, Xiangzhen, Chang, Wen Xiao, Jiakui Hu, Heyuan Li, Jialun Liu, Zhiyong Huang, Xuelong Li

AI总结 本文提出了一种名为 Smart-Insertion-V 的端到端双流框架,用于实现无需掩码的高质量视频物体插入。该方法通过图像流同步引导视频生成,并引入闭环反馈机制以增强插入鲁棒性,同时设计了 Dual-World-View RoPE 和解耦引导模块,以解决特征纠缠和风格泄露问题,并提升语义对齐与风格适应能力。实验表明,该方法在物体插入位置合理性与画面和谐性方面均达到当前最优水平。

详情
AI中文摘要

无掩码视频对象插入已成为一项具有挑战性的任务,需要将参考对象和谐地融入源视频中。然而,当参考对象与源场景存在严重的风格域差异时,现有方法难以应对。为了克服这一问题,我们提出了 extit{ extbf{Smart-Insertion-V}},一种端到端的 extbf{双流}框架,同时进行视频插入和图像风格迁移。在该框架内,图像流同步引导视频生成过程,同时进一步引入 extbf{闭环反馈}机制以确保鲁棒插入。不可避免地,整合这些多样化的条件信号会导致特征纠缠和风格泄露。为解决此问题,我们设计了 extbf{双世界视角旋转位置编码},通过时空偏移区分不同信号,且不增加大量训练开销。此外,为了促进空间定位和风格适应,我们引入了 extbf{解耦引导模块},该模块利用视觉语言模型进行语义推理,同时通过原生文本编码器保留原始时间引导。为了弥合和谐参考插入任务的数据差距,我们提出了一种数据整理流程,并将发布一个 extbf{开源数据集}。实验表明,我们的方法可以将对象插入到合理的位置,同时实现最和谐的结果。

英文摘要

Mask-free video object insertion has emerged as a challenging task, requiring harmonious integration of reference objects into source videos. However, existing methods struggle when references exhibit severe stylistic domain gaps with the source scene. To overcome this, we propose \textit{\textbf{Smart-Insertion-V}}, an end-to-end \textbf{Dual-Stream} framework that concurrently conducts video insertion and image style transfer. Within this framework, the image stream synchronously guides the video generation process, while a \textbf{Closed-loop Feedback} mechanism is further incorporated to ensure robust insertion. Inevitably, integrating these diverse conditioning signals results in feature entanglement and style leakage. To tackle this issue, we design \textbf{Dual-World-View RoPE} to distinguish different signals via spatial-temporal offsets without incurring heavy training overhead. Furthermore, to facilitate spatial grounding and stylistic adaptation, we introduce a \textbf{Decoupled Guidance Module} that leverages a Vision-Language Model for semantic reasoning while preserving original temporal guidance with native text encoder. To bridge data gap for harmonious reference insertion task, we propose a data curation pipeline and will release an \textbf{open-source dataset}. Experiments demonstrate that our method can insert objects into plausible positions while achieving the most harmonious results.

2605.23889 2026-05-25 cs.CV

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

HorizonStream: 用于流式3D重建的长程注意力

Chong Cheng, Peilin Tao, Nanjie Yao, Guanzhi Ding, Xianda Chen, Yuansen Du, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Zhengqing Chen, Hao Wang

AI总结 HorizonStream 是一种用于流式三维重建的长时序注意力模型,旨在解决在线重建中因时间异质性导致的漂移、抖动和崩溃问题。该方法通过引入证据影响核的概念,将几何传播分解为长时序和短时序两个因子,分别采用几何线性注意力和几何局部注意力进行处理,从而实现多时间尺度的几何信息传播与稳定的空间匹配。实验表明,HorizonStream 在仅使用48帧训练的情况下,能够稳定地处理超过10,000帧的长序列,表现出优越的流式三维重建性能。

详情
AI中文摘要

在线3D重建需要在严格的因果和有界内存约束下估计相机姿态和场景几何。现有方法在长序列上常出现漂移、抖动或崩溃。我们将这些失败归因于一个根本性的不匹配:流式几何本质上是时间异质的,证据范围从短时对应到持久全局尺度。然而,当前架构施加了统一且病态的影响模式。例如,滑动窗口强制硬截断,而无门控循环和因果注意力导致缓存饱和和尖峰状注意力沉没。为解决此问题,我们将几何传播形式化为一个证据影响核,并提出HorizonStream,一种显式分解该核的长程Transformer。对于长程时间因子,几何线性注意力学习通道级衰减率,实现几何证据的有界、多时间尺度传播。对于短程空间因子,具有时空RoPE的几何局部注意力在抑制注意力沉没的同时执行可靠的3D匹配。最后,度量读出令牌直接从持久几何状态恢复稳定尺度和刚性姿态。大量实验表明,仅用48帧片段训练的HorizonStream,在恒定内存和线性时间下稳定泛化到超过10,000帧的序列,实现了最先进的流式3D重建性能。项目页面:https://3dagentworld.github.io/horizonstream/

英文摘要

Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emph{evidence influence kernel} and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/

2605.23888 2026-05-25 cs.CV

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

GenRecon: 桥接生成先验的多视图3D场景重建

Katharina Schmid, Nicolas von Lützow, Jozef Hladký, Angela Dai, Matthias Nießner

AI总结 本文提出了一种基于生成先验的高质量多视角三维场景重建方法GenRecon,通过将场景分割为局部重叠的区域,并在每个区域上进行条件生成,实现了大范围场景的高精度重建。研究利用先进的生成形状模型Trellis.2作为先验,并提出了一种基于投影的条件机制,将多视角图像特征提升为与生成模型对齐的三维表示,从而生成几何一致、视图一致的重建结果。该方法在室内环境重建中表现出色,相比现有方法在重建质量上提升了16%。

详情
Comments
Project page: https://kasothaphie.github.io/GenRecon/
AI中文摘要

我们提出了一种新的方法,从多视图RGB图像进行高保真3D场景重建,该方法将重建与强大的生成式3D先验紧密结合。我们将场景重建视为在空间局部、重叠的块上的条件3D生成,这些块共同覆盖场景,将生成扩展到大的场景范围。关键的是,我们继承了最先进的生成形状模型(以Trellis.2为例)的保真度和完整性,并将其推广到场景级别。为此,我们提出了一种基于投影的条件机制,该机制将带姿态的多视图图像特征提升为与生成模型对齐的连贯3D表示,独立于视图顺序并空间锚定到场景,从而产生高保真、多视图一致的生成几何。这使得将Trellis.2的强对象级先验提升到多视图、场景规模的生成,产生室内环境的忠实、可编辑的PBR网格重建。因此,我们获得了高保真结果,比最先进的重建方法性能提升16%。

英文摘要

We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.

2605.23887 2026-05-25 cs.DB cs.AI cs.CR cs.LG cs.MA

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

CHRONOS:面向演化数据市场的时态感知多智能体协调

Joydeep Chandra

AI总结 CHRONOS 是一种面向动态数据市场的多智能体协调框架,旨在解决静态设计中因数据演化带来的检索效率下降、价值分配不准确和隐私预算过度消耗等问题。该方法采用三层架构,分别通过时间感知的神经微分方程、基于突变点检测的夏普利价值评估和满足差分隐私的强化学习算法,实现高效且隐私保护的市场协调。实验表明,CHRONOS 在多个基准上表现出优越的检索性能和隐私效率,具有较高的实用价值。

详情
AI中文摘要

时态知识图谱数据市场在静态设计中面临三个耦合的失败:随着边演化,过时的混合索引捷径降低召回率;分布漂移后,固定的Shapley定价错误归因价值;不协调的智能体过度消耗共享的差分隐私预算。我们提出CHRONOS,一个三层架构,通过显式的公共和私有分离统一处理这些挑战。第一层应用神经ODE时间衰减到捷径边,提供每个查询的期望召回损失界为Big-O of Pq lambda delta t,单调包络保证将边界宽松度降低到观测损失的1.8到3.2倍。第二层将Shapley估值条件化在检测到的变点上,并在噪声下提供有限样本误差保证。第三层使用EXP3-IX实现Big-O of sqrt(T log T)遗憾,同时通过矩会计强制执行epsilon和delta差分隐私。CHRONOS每轮使用高斯机制发布私有化亲和矩阵;所有检索和排序都是后处理,不产生额外隐私成本。我们提供多轮结算、500个卖家的可扩展性分析,以及与加速基线的比较。在四个基准上,CHRONOS在10个结果时召回率为0.937,每秒2.74个查询,延迟161毫秒,在zCDP组合下总epsilon为4.25,delta为10^{-6}。这些结果表明一个竞争性的操作点。一个局限性是,在此隐私水平下,发布的估值仍受噪声主导;效用主要来自公共索引路由和由低敏感度统计驱动的自适应调度。

英文摘要

Temporal knowledge-graph data marketplaces face three coupled failures in static designs: stale hybrid index shortcuts reduce recall as edges evolve, stationary Shapley pricing misattributes value after distribution shifts, and uncoordinated agents over-consume a shared differential-privacy budget. We present CHRONOS, a three-layer architecture providing a unified treatment of these challenges with explicit public and private separation. Layer one applies neural-ODE temporal decay to shortcut edges, providing a per-query expected recall-loss bound of Big-O of Pq lambda delta t, with a monotone-envelope guarantee reducing bound looseness to 1.8 to 3.2 times observed loss. Layer two conditions Shapley valuation on detected changepoints and provides finite-sample error guarantees under noise. Layer three uses EXP3-IX to achieve Big-O of the square root of T log T regret while enforcing epsilon and delta differential privacy via moments accounting. CHRONOS releases a privatized affinity matrix per epoch using the Gaussian mechanism; all retrieval and ranking are post-processing, incurring no extra privacy cost. We provide multi-epoch settlement, scalability analysis for 500 sellers, and comparisons against accelerated baselines. Across four benchmarks, CHRONOS shows 0.937 recall at ten, 2.74 queries per second, 161 ms latency, and total epsilon of 4.25 at delta of 10 to the power of negative 6 under zCDP composition. These results indicate a competitive operating point. A limitation is that at this privacy level, released valuations remain noise-dominated; utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.

2605.23885 2026-05-25 cs.CL

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

数据约束下的多语言知识迁移:通过词汇干预

Anastasiia Sedova, Natalie Schluter, Skyler Seto, Maartje ter Hoeve

AI总结 本文研究了在数据有限的情况下如何通过词级干预实现多语言知识迁移,提出了一种名为LINK的方法,通过在高资源语言(如英语)的预训练数据中使用双语词表进行词替换,从而提升模型对低资源语言的知识迁移能力。该方法无需额外训练或大量平行语料,仅需双语词汇表即可实现,实验表明在多种语言和不同规模模型上均能显著提升下游任务性能。

详情
AI中文摘要

跨语言知识迁移对于为训练数据不足的语言构建高性能多语言语言模型至关重要。当目标语言数据稀缺时,许多涉及科学推理、常识推理和世界知识的下游任务所需的知识必须主要从高资源语言获取,因此有效的知识迁移至关重要。现有的改进此类跨语言知识迁移的方法需要大量平行数据、翻译系统、辅助模型或额外的训练阶段,而这些对于许多语言来说基本不可用。我们提出LINK——一种数据级干预方法,通过使用双语词汇对预训练数据中的高资源部分进行词汇替换,在模型预训练期间改进知识迁移。对于给定的替换比例,在高资源(英语)训练语料库的一部分中,随机选择的单词被替换为其词级翻译,无需额外的模型训练,仅需一个双语词汇,而该词汇几乎可以零成本获得。对五种模型规模下的八种语言的评估显示,目标语言的下游任务有显著改进,达到同等性能的训练速度最高可提升2倍。

英文摘要

Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involving scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK - a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vocabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabulary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance.

2605.23883 2026-05-25 cs.CV cs.AI

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

PGT: 用于提升多模态大语言模型视觉定位的程序化生成任务

Rim Assouel, Amir Bar, Michal Drozdzal, Adriana Romero-Soriano

AI总结 尽管多模态大语言模型(MLLMs)已取得显著进展,但在细粒度理解任务上仍存在不足。本文提出了一种名为PGT的过程生成任务框架,通过在图像上叠加明确的几何原语生成密集的监督信号,从而提升模型的视觉 grounding 能力,并作为低成本的诊断工具识别感知失败的原因。实验表明,PGT 在多种基准测试中显著提升了模型性能,表明细粒度感知瓶颈可通过增强监督信号有效解决。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)取得了显著进展,但这些模型在细粒度理解任务上仍然存在困难。在这项工作中,我们提出了程序化生成任务(PGT),一个简单的数据驱动框架,具有双重目的:诱导细粒度视觉理解,并作为低成本的诊断工具来识别感知失败的来源。通过在图像上叠加明确的几何基元,PGT生成额外的密集监督,将视觉定位能力与语义先验解耦。在关系、定量和3D/深度理解基准上的大量实验表明,PGT在各种架构上均取得了显著提升。在使用PGT数据增强的LLaVA-v1.5-Instruct上进行指令微调,在What'sUp基准上提升高达+20%,在CV-Bench-2D上提升+13.3%,同时保持通用感知能力。此外,在PGT数据上微调最先进的MLLMs,在What'sUp上提升高达+5.5%,在CV-Bench-2D上提升+8.3%。这些发现表明,PGT有效解决了细粒度感知的瓶颈,揭示了许多空间推理缺陷源于监督信号不足,而非固有的架构或分辨率限制。

英文摘要

Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.

2605.23879 2026-05-25 stat.ML cs.CR cs.LG math.ST stat.TH

On the Stability of Spherical Hellinger-Kantorovich Flows and Their Implications for Differential Privacy

球形Hellinger-Kantorovich流的稳定性及其对差分隐私的影响

Aratrika Mustafi, Soumya Mukherjee

AI总结 本文研究了球形Hellinger-Kantorovich梯度流的稳定性问题,并探讨其在差分隐私中的应用。作者建立了该梯度流的扰动理论,分析了不同势函数下流的动力学差异,并给出了与时间相关的log-似然比和Rényi散度的统一上界,进一步推导了KL散度的界。这些结果被用于差分隐私中的指数机制采样,提供了基于SHK梯度流的纯差分隐私和近似差分隐私保证,并分离了机制本身的次优性与有限时间采样误差的影响。

详情
AI中文摘要

梯度流采样将吉布斯分布解释为概率测度上能量泛函的最小值,并生成收敛到该目标的动力学。在球形Hellinger-Kantorovich (SHK)几何下,流耦合输运和反应,并与生灭Langevin动力学一致。本文发展了SHK梯度流的摄动理论。对于两个势函数$V$和$V^{\prime}$,我们从共同初始值出发比较相关的流,并量化势差异随时间传播的程度。一个统一的扰动界给出了对数似然比和Rényi散度的无维、逐点控制,而额外的结构使我们能够推导出KL散度的界。我们将这些结果应用于差分隐私中指数机制的近似采样。似然比控制为基于SHK的采样器提供了显式的时间依赖纯DP保证,而KL界通过hockey-stick散度给出了近似DP证书。我们还推导了一个效用界,将指数机制的内在次优性与有限时间采样误差分离。

英文摘要

Gradient-flow sampling interprets a Gibbs distribution as the minimizer of an energy functional over probability measures and generates dynamics converging to this target. Under spherical Hellinger-Kantorovich (SHK) geometry, the flow couples transport and reaction and coincides with birth-death Langevin dynamics. In this work, we develop a perturbation theory for SHK gradient flows. For two potentials $V$ and $V^{\prime}$, we compare the associated flows from a common initialization and quantify how potential discrepancies propagate over time. A uniform perturbation bound yields dimension-free, pointwise control of the log-likelihood ratio and Rényi divergence, while additional structure allows us to derive bounds for the KL divergence as well. We apply these results to approximate sampling for the exponential mechanism in differential privacy. The likelihood-ratio control provides explicit time-dependent Pure-DP guarantees for SHK-based samplers, while the KL bound yields Approximate-DP certificates via hockey-stick divergence. We also derive a utility bound separating intrinsic exponential-mechanism suboptimality from finite-time sampling error.

2605.23878 2026-05-25 cs.CV

LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

LaMo: 视频生成中物理真实性的自监督潜在运动先验

Bo Jiang, Depu Meng, Yihan Hu, Yichen Xie, Tianshuo Xu, Wei Zhan

AI总结 现代视频生成模型虽然能生成视觉吸引人的视频,但在物理和运动一致性方面仍存在不足,限制了其作为可靠世界模拟器的应用。本文提出LaMo,一种基于自监督学习的潜在运动先验方法,通过从未标注的训练视频中提取运动线索,无需外部模拟器或物理数据即可提升视频生成的物理真实性。LaMo引入了两个轻量级模块,在训练和采样阶段分别用于约束运动漂移和引导运动先验,能够方便地集成到现有视频扩散模型中,并在多个基准测试中展现出优越的物理一致性提升效果。

详情
Comments
Project Page: https://lamo-ai.github.io/
AI中文摘要

现代视频生成器能产生视觉上吸引人的片段,但在物理和运动一致性方面仍有困难,限制了其作为可靠世界模拟器的使用。现有的补救措施通常依赖外部模拟器、教师模型或精心策划的物理聚焦数据。我们探索了一种互补的自监督方向:从用于训练视频扩散模型的无标签视频中提取运动线索。我们提出LaMo,它根据当前潜在变量和提示,对帧间潜在变化制定了一个潜在运动先验。该先验通过两个轻量级读出器暴露:一个用于训练期间的宏运动漂移损失,以及一个用于采样期间的微运动场引导。这两个组件都是即插即用的,与现有的视频扩散骨干网络兼容,无需架构或I/O更改。在VideoPhy和VideoPhy2上,LaMo改进了CogVideoX骨干网络,并优于最近使用外部监督的物理感知基线。在VBench上,它在保持整体生成质量的同时改善了运动相关维度。这些结果表明,无标签视频包含有用的运动监督,可用于提高现代视频扩散模型的物理保真度。

英文摘要

Modern video generators produce visually compelling clips but still struggle with physical and motion consistency, limiting their use as reliable world simulators. Existing remedies often rely on external simulators, teacher models, or curated physics-focused data. We explore a complementary self-supervised direction: extracting motion cues from the unlabeled videos already used to train video diffusion models. We propose LaMo, which formulates a latent motion prior over frame-to-frame latent changes conditioned on the current latent and prompt. This prior is exposed through two lightweight readouts: a macro motion drift used during training as a Motion Drift Loss, and a learned micro motion field used during sampling as Motion Prior Guidance. Both components are plug-and-play with existing video diffusion backbones, requiring no architectural or I/O changes. On VideoPhy and VideoPhy2, LaMo improves CogVideoX backbones and outperforms recent physics-aware baselines that use external supervision. On VBench, it preserves overall generation quality while improving motion-related dimensions. These results suggest that unlabeled video contains useful motion supervision for improving physical fidelity in modern video diffusion models.

2605.23872 2026-05-25 cs.LG cs.NA math.NA stat.ML

Training-Free Looped Transformers

免训练循环Transformer

Lizhang Chen, Jonathan Li, Chen Liang, Ni Lao, Qiang Liu

AI总结 本文提出了一种无需训练的循环变压器模型,通过在冻结的预训练模型中引入一个轻量级的推理时包装器,对连续的中间层块进行循环应用,而无需额外微调或结构修改。研究发现,直接重复使用中间层块会导致性能下降,因此作者借鉴常微分方程的前向欧拉方法,将循环视为对同一近似的优化,采用更小的阻尼子步骤替代单一的大更新。实验表明,该方法在多种模型架构上均能有效提升推理性能,如在MMLU-Pro等基准测试中取得显著提升。

详情
AI中文摘要

我们引入了免训练循环Transformer,其中轻量级推理时包装器循环冻结检查点的连续中间块层,无需额外微调、继续训练或架构更改。与先前使用循环结构端到端训练的循环Transformer方法不同,我们在测试时将循环性改造到预训练模型上。我们表明,简单的块重新应用通常会降低性能,凸显了循环应用策略的重要性。受将预归一化Transformer块视为ODE上的前向欧拉步骤的启发,我们将循环视为同一近似的细化,用一个大的更新替换为更小的阻尼子步骤。在七个密集、稀疏MoE和MLA+MoE模型家族中,我们的方法在MMLU-Pro上将Qwen3-4B-Instruct提升了2.64个百分点,在CommonsenseQA上将Qwen3-30B-A3B-Instruct提升了1.14个百分点,在OpenBookQA上将Moonlight-16B-A3B-Instruct提升了1.20个百分点。

英文摘要

We introduce training-free looped transformers, in which a lightweight inference-time wrapper loops a contiguous mid-stack block of layers of a frozen checkpoint without additional fine-tuning, continued training, or architectural changes. Unlike prior looped transformer methods that train with the looped structure end-to-end, we retrofit recurrence onto pretrained models at test time. We show that naive block reapplication usually degrades performance, highlighting the importance of the loop application strategy. Motivated by viewing a pre-norm transformer block as a forward Euler step on an ODE, we instead treat looping as a refinement of the same approximation, replacing one large update with smaller damped sub-steps. Across seven dense, sparse MoE, and MLA+MoE model families, our method improves Qwen3-4B-Instruct by +2.64 pp on MMLU-Pro, Qwen3-30B-A3B-Instruct by +1.14 pp on CommonsenseQA, and Moonlight-16B-A3B-Instruct by +1.20 pp on OpenBookQA.

2605.23871 2026-05-25 stat.ML cs.LG math.ST stat.TH

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

Muon上的移动:Muon优化器的哈密顿概率梯度流视角

Aratrika Mustafi, Soumya Mukherjee, Bharath K. Sriperumbudur

AI总结 本文从哈密顿概率梯度流的视角,研究了Muon优化器的连续时间动力学行为,提出了正则化Muon优化的梯度流形式,并揭示了其与核范数的Fenchel对偶平滑之间的联系。通过将Muon优化推广到有限粒子概率目标函数,作者推导了其惯性连续时间极限,并建立了参数-动量对的概率相空间平均场方程,证明了该动力学为阻尼哈密顿概率动力系统,具有单调递减的哈密顿能量。此外,文章还分析了目标函数的收敛性,并将该方法扩展到适用于变换器混合专家模型的块状Muon概率流。

详情
AI中文摘要

我们开发了一种在矩阵值参数概率测度空间上的梯度流,该梯度流由正则化Muon(理想化Muon优化器的解析平滑版本)诱导。关键观察是正则化正交化映射是核范数的光滑Fenchel对偶平滑的梯度。这确定了(正则化)Muon更新为更新变量中的镜像/近端步骤,其中动量充当对偶坐标。我们利用这一结构将Muon从单个矩阵参数提升到形如$J(ρ)=R\left(\int F d ρ ight)$的有限粒子概率目标,这一设置由神经网络训练的均场描述所激发,并推导出惯性连续时间极限。利用这一结构,我们在步长和动量的惯性缩放下推导出有限粒子连续时间极限,然后过渡到参数-动量对概率律上的相空间均场方程。所得流可被证明是阻尼哈密顿概率动力学,其动能由正则化Muon镜像势诱导。我们证明了一个精确的哈密顿耗散恒等式,显示哈密顿能量单调递减。虽然目标目标本身在惯性Muon动力学下不一定单调,但在额外的梯度优势、有界动量和曲率/对齐假设下,我们获得了目标间隙的连续和离散时间指数收敛率。我们还研究了均场极限方程的适定性,并建立了相互作用粒子系统的混沌传播保证。最后,我们将公式扩展到乘积矩阵空间上的Hilbert值特征映射,得到适用于平滑变压器混合专家模型的块状Muon概率流。

英文摘要

We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form $J(ρ)=R\left(\int F d ρ\right)$, a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.

2605.23868 2026-05-25 cs.CV

Vision Transformers Need Better Token Interaction

视觉Transformer需要更好的Token交互

Linxiang Su

AI总结 视觉Transformer(ViT)在学习图像级表示方面表现出色,但在长时间训练后,其对密集预测任务的块表示效果会下降。本文分析了这一现象,指出其原因不仅在于高范数伪影,还涉及语义扩散问题,即全局语义信息在块之间非局部地传播。为此,作者提出通过稀疏注意力机制来增强块之间的选择性交互,在保持全局连接性的前提下提升了ViT在密集预测任务中的性能,如语义分割等。

详情
Comments
7 pages
AI中文摘要

视觉Transformer(ViT)可以学习强大的图像级表示,但在长时间训练过程中,其补丁表示对于密集预测变得不那么有效。我们重新审视这种密集退化现象,并认为它不能仅由高范数伪影完全解释。相反,我们描述了\emph{语义扩散}:一种优化捷径,其中全局语义信息通过补丁token传播,超出了局部合理的范围。我们的分析表明,密集表示质量不能仅由局部性来捕捉:浅层特征可以保持与前景区域更好的对齐,但表现不如深层特征,而 exttt{[CLS]}特征在密集预测中仍然具有互补性。这些观察表明,目标不应该是移除全局上下文,而是使token交互更具选择性。因此,我们研究稀疏注意力作为最小干预,用entmax-1.5替换softmax注意力,同时保持全局token连接。在ImageNet-1K上训练200个epoch的DINOv1 ViT-S/16上,这一改变保持了ImageNet线性探测准确率,并显著提高了语义分割性能:VOC mIoU从42.80提高到48.78,ADE20K从19.85提高到21.97,Cityscapes从36.79提高到37.87。这些结果表明,选择性token混合是改善密集ViT表示的一种简单而有效的偏置。

英文摘要

Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emph{semantic diffusion}: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality alone: shallow features can remain better aligned with foreground regions yet underperform deeper features, and \texttt{[CLS]} features remain complementary for dense prediction. These observations suggest that the goal should not be to remove global context, but to make token interactions more selective. We therefore study sparse attention as a minimal intervention, replacing softmax attention with entmax-1.5 while preserving global token connectivity. On DINOv1 ViT-S/16 trained for 200 epochs on ImageNet-1K, this change preserves ImageNet linear probing accuracy and substantially improves semantic segmentation performance: VOC mIoU increases from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. These results suggest that selective token mixing is a simple and effective bias for improving dense ViT representations.

2605.23867 2026-05-25 cs.HC cs.AI

Human Decision-Making with Persuasive and Narrative LLM Explanations

具有说服性和叙事性LLM解释的人类决策

Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash, Murat Kantarcioglu

AI总结 本研究探讨了生成式语言模型(LLM)在分类任务中生成的叙事性解释对人类决策性能的影响。通过大规模人类行为实验,研究发现LLM生成的叙事解释的说服力并未显著提升决策准确性,但可能增加人类对AI预测的依赖,并可能对决策反应时间和判断AI预测正确性的能力产生负面影响。研究结果表明,在AI预测中加入叙事解释可能带来决策性能的权衡,未来需要进一步研究其具体影响机制和适用场景。

详情
AI中文摘要

大型语言模型(LLMs)有潜力在分类任务中辅助和改善人类决策,不仅通过提供相当准确的预测,还通过生成这些预测的连贯叙事解释。先前的研究表明,人们通常认为AI叙事解释易于理解、可信且具有说服力,能够改变信念和观点;然而,关于叙事解释对客观人类决策表现的影响知之甚少。在这里,我们进行了一项大规模人类行为实验,以评估使用LLM生成的不同说服力叙事解释的决策表现。我们发现,基于LLM的解释的说服力程度(或缺乏说服力)并未显著影响决策准确性,相比于简单的AI预测本身,这与基于特征重要性的可解释AI的典型结果一致。我们发现有证据表明叙事增加了对AI的依赖,但无论AI预测正确还是错误都是如此。探索性分析还表明,更具说服力的叙事可能对决策响应时间以及区分正确和错误AI预测的能力产生不利影响。总体而言,这项工作表明,将叙事解释与AI预测结合可能会对决策表现产生权衡,需要更多研究来确定叙事解释如何以及何时影响人类决策。

英文摘要

Large language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fairly accurate predictions, but also in their ability to generate cogent narrative explanations of those predictions. Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions; however, less is known about the impact of narrative explanations on objective human decision-making performance. Here we conduct a large-scale human behavioral experiment to evaluate decision-making performance with LLM-generated narrative explanations of varying persuasiveness. We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance. We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect. Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction. Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making.

2605.23863 2026-05-25 cs.RO

Robotic Strawberry Harvesting with Robust Vision and Deep Reinforcement Learning based Sim-to-Real Control

基于鲁棒视觉和深度强化学习仿真到现实控制的机器人草莓采摘

Al Bashir, Shao-Yang Chang, Partho Ghose, Prem Raj, Chen-Kang Huang, Azlan Zahid

AI总结 本文提出了一种闭环的机器人草莓采摘系统,结合了鲁棒视觉模块、基于仿真训练的深度强化学习控制以及ROS平台的实物机器人执行。研究中设计了一种改进的YOLO26-seg架构HRAttnEdge-YOLO26-seg,提升了复杂场景下的实例分割性能,并在仿真环境中训练了基于PPO的策略控制器,实现了对UR10e机械臂的精准控制。实验表明,该系统在温室环境中成功采摘了281颗草莓,达到了较高的成功率,展示了仿真训练与任务感知结合在农业机器人中的实用性和高效性。

详情
AI中文摘要

本研究提出一种闭环机器人草莓采摘系统,结合鲁棒视觉模块、仿真训练的深度强化学习(DRL)控制和基于ROS的真实机器人执行。在感知方面,我们提出HRAttnEdge-YOLO26-seg,一种改进的YOLO26-seg架构,融合高分辨率P2分支、分割路径注意力和边缘监督原型学习,以改善杂乱场景中的实例分割。在控制方面,我们在Isaac Lab中训练目标条件近端策略优化(PPO)策略,生成UR10e机械臂的平滑关节位置指令,并将其部署在UR10e机器人上,用于目标水果的接近和采摘。这种基于仿真的方法减少了硬件依赖,降低了开发成本,并允许在真实部署前无需大量物理试验即可进行可扩展的策略训练。所提出的视觉模型在评估方法中表现出最高的整体性能。在自采集和公开数据集上,该模型的分割性能提升了10%至14%。在受控室内测试中,PPO控制器产生的运动比基于逆运动学(IK)的MoveIt基线更稳定且动态更平滑。在温室试验中,所提出的集成系统采摘了281颗草莓,实现了96.6%的接近成功率、91.3%的抓取-拉动成功率和84.3%的总体采摘成功率。这些结果表明,任务特定感知与仿真训练的PPO相结合,可以作为传统依赖规划器的操作中接近方法的实用且资源高效的替代方案,从而在复杂农业环境中实现可靠的闭环机器人采摘。

英文摘要

This study presents a closed-loop robotic strawberry harvesting system that combines a robust vision module, simulation-trained deep reinforcement learning (DRL) control, and ROS-based realrobot execution. For perception, we propose HRAttnEdge-YOLO26-seg, a modified YOLO26-seg architecture that incorporates a high-resolution P2 branch, segmentation-path attention, and edgesupervised prototype learning to improve instance segmentation in cluttered scenes. For control, we train a target-conditioned Proximal Policy Optimization (PPO) policy in Isaac Lab to produce smooth joint-position commands for a UR10e manipulator and deploy it on a UR10e robot for targetfruit reaching and harvesting. This simulation-based approach reduces hardware dependency, lowers development cost, and allows scalable policy training without exhaustive physical trials before real deployment. The proposed vision model demonstrated the highest overall performance among the evaluated methods. On both self-collected and public datasets, the model showed a 10 to 14% improvement in segmentation performance. In controlled in-house tests, the PPO controller produced stable and dynamically smoother motion than a inverse kinematics (IK)-based MoveIt baseline. In greenhouse trials, the proposed integrated system harvested 281 strawberries, achieving 96.6% reaching success, 91.3% grasp-and-pull success, and 84.3% overall harvesting success. These results illustrate that task-specific perception combined with simulation-trained PPO can serve as a practical and resource-efficient alternative to conventional planner-dependent reaching in manipulation, enabling reliable closed-loop robotic harvesting in complex agricultural environments.

2605.23861 2026-05-25 cs.LG cs.AI cs.CV

Leveraging Foundation Models for Causal Generative Modeling

利用基础模型进行因果生成建模

Aneesh Komanduri, Xintao Wu

AI总结 该论文研究如何利用预训练基础模型进行因果生成建模,旨在提升AI系统在反事实推理方面的能力。提出了一种名为FM-CGM的模块化框架,通过概念提取器、概念操作器和反事实生成器三个核心组件,实现了端到端的视觉因果推理。该方法结合了因果推理模型和文本到图像扩散模型,并引入了因果语义引导机制,有效支持零样本因果发现与反事实图像生成,具有重要的理论与应用价值。

详情
AI中文摘要

因果生成建模对于开发能够进行反事实推理的可靠且透明的AI系统至关重要。现有方法侧重于在生成模型训练过程中整合因果约束,但通常缺乏统一框架来利用预训练基础模型的零样本推理能力。我们提出FM-CGM,一个使用预训练基础模型进行端到端视觉因果推理的模块化框架。FM-CGM通过三个核心组件形式化因果流程:概念提取器、概念操作器和反事实生成器。通过利用大型推理模型进行因果推断,以及文本到图像扩散模型进行生成,我们的方法实现了零样本因果发现、干预和反事实生成。然后,我们开发了因果语义引导(CSG),一种基于交叉注意力的机制,确保语义干预传播到后代概念,同时保留不变区域。我们实验证明,我们的方法能够识别合理的因果结构,并适用于忠实的反事实图像生成。

英文摘要

Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.

2605.23857 2026-05-25 cs.LG cs.CL

Strong Teacher Not Needed? On Distillation in LLM Pretraining

不需要强教师?关于大语言模型预训练中的蒸馏

Taiming Lu, Zhuang Liu

AI总结 本文挑战了知识蒸馏中“强教师优于弱教师”的常见假设,研究了在大语言模型预训练中不同教师-学生关系对蒸馏效果的影响。通过调整模型规模和训练数据量,作者构建了强-弱、同级和弱-强的教师-学生关系,并发现即使使用小型或未充分训练的教师,通过合理混合语言建模和蒸馏损失,也能有效提升学生模型性能。研究还表明,更强的教师并不总是更好,过度增加教师规模或训练数据可能削弱蒸馏效果,同时蒸馏在提升模型泛化能力方面比领域内拟合更具优势。

详情
AI中文摘要

知识蒸馏通常假设强到弱的关系,即更强的教师会产生更好的学生。在这项工作中,我们检验了关于大语言模型预训练中蒸馏的这一假设。通过改变架构大小和训练token预算,我们创建了强到弱、同级和弱到强的师生关系,并研究了每种情况下蒸馏的有效性。我们发现教师不需要强:通过适当混合语言建模和知识蒸馏损失,即使是小型和训练不足的教师也能提升较大的学生。同时,更强的教师并不总是更好:通过更多参数或更多训练token进一步推动教师,可能会饱和甚至逆转蒸馏收益。我们进一步观察到,蒸馏更容易改善泛化(分布外和下游性能)而非域内拟合。这些结果共同挑战了蒸馏预训练总是需要强教师的普遍信念。

英文摘要

Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, same-level, and weak-to-strong teacher-student relationships, and study distillation's effectiveness under each. We find that the teacher need not be strong: with proper mixing of the language modeling and knowledge distillation losses, even small and undertrained teachers improve larger students. At the same time, a stronger teacher is not always better: pushing the teacher further, through more parameters or more training tokens, can saturate or even reverse the distillation gains. We further observe that distillation improves generalization (out-of-distribution and downstream performance) more readily than in-domain fitting. Together, these results challenge the common belief that distillation pretraining always requires a strong teacher.

2605.23856 2026-05-25 cs.RO

Point Tracking Improves World Action Models

点跟踪改进了世界动作模型

Jiarui Guan, Wenshuai Zhao, Yue Pei, Ziliang Chen, Arno Solin, Juho Kannala

AI总结 该论文提出了一种名为JOPAT的联合像素与轨迹世界动作模型,用于改进机器人策略学习中的环境动态建模。JOPAT通过单个去噪扩散变换器同时预测潜视觉观测、带可见性标志的2D点轨迹以及动作,核心思想是利用点轨迹提供更鲁棒且能捕捉长期动态的显式运动表示。实验表明,JOPAT在涉及遮挡、物体交互和屏幕外运动的长时序任务中显著优于基于像素的传统方法。

详情
AI中文摘要

机器人策略学习受益于捕捉环境动态的世界动作模型,但像素级预测将动态与光照、纹理等无关因素纠缠在一起,使学习到的表示容易受到与任务无关的视觉变化的影响。我们提出了JOPAT,一种联合像素与跟踪的世界动作模型,它在单个去噪扩散变换器中预测潜在视觉观测、带可见性的2D点跟踪和动作。关键洞察在于,跟踪提供了运动的显式表示,能够捕捉长时域动态,并在遮挡或部分离帧运动下保持鲁棒性,比仅建模像素外观具有更大的实用性。在LIBERO和真实世界的LeRobot任务上,JOPAT优于基于像素的基线,在涉及遮挡、物体交互和离屏运动的长时域任务上提升最大。

英文摘要

Robot policy learning benefits from world-action models that capture environment dynamics, but pixel-level prediction entangles dynamics with nuisance factors such as lighting and texture, making learned representations vulnerable to task-irrelevant visual variation. We propose JOPAT, a JOint Pixel-And-Track World-Action Model that predicts latent visual observations, 2D point tracks with visibility, and actions in a single denoising diffusion transformer. The key insight is that tracks provide an explicit representation of motion that captures long-horizon dynamics and remains robust under occlusion or partial out-of-frame motion, offering greater utility than modeling pixel appearance alone. On LIBERO and real-world LeRobot tasks, JOPAT improves over pixel-based baselines, with the largest gains on long-horizon tasks involving occlusion, object interaction, and off-screen motion.

2605.23854 2026-05-25 cs.LG math.ST stat.ML stat.TH

Entrywise Error Bounds for Spectral Ranking with Semi-Random Adversaries

半随机对抗下谱排序的逐项误差界

Dongmin Lee, Anuran Makur, Japneet Singh

AI总结 本文研究了在半随机对抗环境下谱方法用于谱排序的逐项误差界问题。针对能够任意增强某些边采样概率的半随机对手,作者分析了无权重谱方法的性能,并发现其表现高度依赖生成图的谱特性。通过适当重加权观测边以抵消对手影响,可恢复接近均匀采样图的渐近性能。数值实验验证了理论结果的有效性。

详情
Comments
17 pages, 2 figures, 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2
AI中文摘要

Bradley-Terry-Luce (BTL) 模型估计是一种基于成对比较数据对项目集合进行排序的成熟策略。尽管在均匀采样图的情况下,谱估计和最大似然估计等 BTL 估计方法的理论性能已得到充分研究,但将这些结果推广到更广泛的随机图类已被证明具有挑战性。在这项工作中,我们研究了谱算法在半随机对抗下的逐项误差,该对抗可以任意提升某些边的采样概率。我们发现,未加权谱方法的性能严重依赖于生成图的谱性质。此外,我们表明,通过适当地重新加权观察到的边以对抗对抗并恢复谱间隙,可以恢复接近均匀采样图的渐近性能。最后,我们提供了支持我们理论发现的数值模拟。

英文摘要

Bradley-Terry-Luce (BTL) model estimation is a well-established strategy to rank a collection of items given a dataset of pairwise comparisons. Although the theoretical performance of BTL estimation methods, such as spectral and maximum likelihood estimation, is well studied in the regime of uniformly sampled graphs, generalizing such results to a wider class of random graphs has proved challenging. In this work, we investigate the entry-wise error of spectral algorithms against a semi-random adversary that can arbitrarily boost the sampling probabilities of certain edges. We find that the performance of the unweighted spectral method is heavily dependent on the spectral properties of the generated graph. Furthermore, we show that asymptotic performance approaching that of uniformly sampled graphs can be recovered by appropriately reweighting the observed edges to counteract the adversary and restore the spectral gap. Finally, we provide numerical simulations that support our theoretical findings.

2605.23847 2026-05-25 cs.RO

Instrumentation for Imitation Learning: Enhancing Training Datasets for Clothes Hanger Insertion

模仿学习的仪器化:增强衣架插入训练数据集

Remko Proesmans, Thomas Lips, Francis wyffels

AI总结 本文研究如何通过在物体中集成传感器(即仪器化)来增强模仿学习在衣物挂架插入任务中的表现。作者提出了一种仪器化模仿学习方法,利用180个远程操作演示数据训练扩散策略,并对比了使用和不使用仪器化数据的策略性能。实验表明,结合仪器化数据的策略在成功率上比仅依赖视觉信息的策略高出14-25个百分点,并且能够更有效地理解任务需求。此外,通过仪器化专家策略生成数据增强训练集,可以使仅依赖视觉的策略达到接近专家水平的性能,验证了仪器化在提升模仿学习效果中的有效性。

详情
Comments
Accepted for presentation at ICRA2026
AI中文摘要

大型行为模型已经改变了机器人操作领域,但迄今为止,由于数据需求过高,未能像视觉语言模型那样实现革命性突破。我们认为仪器化,即物体中的传感器集成,可以提供宝贵的状态信息,并实现机器人操作的高效学习。在本文中,我们展示了衣架插入的仪器化模仿学习。使用180次遥操作演示,我们训练了有无仪器化数据访问的扩散策略。结果表明,利用仪器化的策略比纯视觉策略成功率提高14-25%,并表现出更高的任务意识。关键的是,黑箱模仿学习策略无需显式指导就能学会优先使用仪器化信号。此外,用仪器化专家策略的 rollout 增强遥操作数据集,使得纯视觉学生策略能够达到与仪器化专家相当的性能,从而超越了原始的纯视觉策略。这些发现确立了仪器化作为增强机器人操作模仿学习的一种有前景的策略。数据集可在 Zenodo 上获取。

英文摘要

Large behaviour models have transformed the field of robotic manipulation, but prohibitive data requirements have thus far prevented a revolution similar to vision language models. We believe that instrumentation, i.e. sensor integration in objects, can provide invaluable state information and enable efficient learning for robotic manipulation. In this paper, we present instrumented imitation learning of clothes hanger insertion. Using 180 teleoperated demonstrations, we train diffusion policies with and without access to instrumentation data. Results show that policies leveraging instrumentation outperform vision-only counterparts by 14-25 %pt and exhibit greater task awareness. Crucially, a black-box imitation learning policy learns to prioritise instrumentation signals without explicit guidance. In addition, enhancing the teleoperation dataset with rollouts from an instrumented expert policy, enables a vision-only student policy to achieve performance comparable to the instrumented expert, thereby surpassing the original vision-only policy. These findings establish instrumentation as a promising strategy to enhance imitation learning for robotic manipulation. Datasets are available on Zenodo.

2605.23845 2026-05-25 cs.CV

Learning a Particle Dynamics Model with Real-world Videos

利用真实世界视频学习粒子动力学模型

Chanho Kim, Suhas V. Sumukh, Li Fuxin

AI总结 本文提出了一种从真实世界未标注视频中学习粒子动力学模型的新方法,旨在克服传统物理模拟器和依赖合成数据的世界模型在现实场景中的局限性。该方法基于高斯点扩散框架,通过渲染监督直接学习密集高斯粒子的位置和旋转变化,无需粒子级别的标注信息。研究还发布了一个包含约500个视频的真实数据集,用于多样化物体交互的建模与验证。

详情
Comments
CVPR 2026 Findings
AI中文摘要

数据驱动的物理仿真学习方法(有时称为世界模型)因其可微性质,已成为传统物理模拟器的有前途的替代方案。先前的工作在预测涉及多个相互作用物体的复杂场景中刚性和非刚性物体的运动方面展示了令人印象深刻的结果。然而,这些模型通常在模拟环境中训练,因为在现实世界中获取完美的状态信息(例如完整的场景点云和随时间变化的点对应关系)具有挑战性。这种对合成数据的依赖可能在模拟到现实差距较大时限制其适用性。在这项工作中,我们旨在通过引入一种直接从无标签真实世界视频训练神经物体动力学模型的新框架来克服这些限制。具体来说,我们提出学习一个与高斯溅射框架兼容的基于粒子的动力学模型,该模型操作于从高斯导出的密集粒子(即具有尺度和旋转的粒子),并预测它们随时间的位置和旋转变化。该模型通过渲染监督进行训练,从而无需粒子级别的标签状态即可从真实世界视频中学习。我们的模型直接操作于密集高斯,而不依赖于启发式子采样锚点。为了实现这项研究,我们还提供了一个包含约500个捕捉不同物体相互作用的视频的真实世界数据集。

英文摘要

Data-driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non-rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real-world settings. This reliance on synthetic data can limit their applicability when the sim-to-real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real-world videos. Specifically, we propose to learn a particle-based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real-world videos without requiring particle-level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real-world dataset consisting of about 500 videos capturing diverse object interactions.