arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31129 2026-06-01 cs.LG

Generalizing Multi-Scale Time-Series Modeling with a Single Operator

使用单一算子泛化多尺度时间序列建模

Cheonwoo Lee, Dooho Lee, Doyun Choi, Jaemin Yoo

AI总结 提出SiGMA架构,通过可学习离散高斯核实现距离感知缩放,解决现有方法固定离散缩放的局限性,在长期和短期预测任务中均达到最优性能。

详情
Comments
Accepted at ICML 2026
AI中文摘要

多尺度建模通过捕获多个分辨率的时间动态,已成为时间序列预测的有效设计原则。由于文献中尚未建立原则性基础,我们将现有的缩放方法统一为一个缩放算子族,揭示了现有方法的一个基本局限性:依赖固定和离散的缩放。为了解决这一局限性,我们提出了SiGMA(单一泛化多尺度架构),它通过基于尺度空间理论的可学习离散高斯(LDG)核实现距离感知缩放。我们在长期和短期预测基准上全面评估了SiGMA,与最先进的多尺度基线进行了比较。SiGMA在两项任务上均优于所有竞争对手,特别是在16个长期评估设置中,有13个达到了最佳性能。除了准确性,SiGMA在训练速度上比最强竞争对手提高了最多5.3倍,内存消耗降低了最多3.8倍。代码可在https://github.com/cheonwoolee/SiGMA获取。

英文摘要

Multi-scale modeling has emerged as an effective design principle for time-series forecasting by capturing temporal dynamics at multiple resolutions. As no principled foundation has been established in the literature, we unify existing scaling methods into a scaling operator family, revealing a fundamental limitation of existing approaches: reliance on fixed and discrete scaling. To address this limitation, we propose SiGMA (Single Generalized Multi-scale Architecture), which enables distance-aware scaling via the learnable discrete Gaussian (LDG) kernel grounded in scale-space theory. We evaluate SiGMA comprehensively on long- and short-term forecasting benchmarks against state-of-the-art multi-scale baselines. SiGMA outperforms all competitors on both tasks, especially achieving the best performance in 13 out of 16 long-term evaluation settings. Beyond accuracy, SiGMA significantly improves training speed by up to 5.3 times and reduces memory consumption by up to 3.8 times over the strongest competitors. Code is available at https://github.com/cheonwoolee/SiGMA.

2605.31127 2026-06-01 cs.LG cs.NA math.NA

Scalable Bayesian Inference for Nonlinear Conservation Laws

非线性守恒律的可扩展贝叶斯推断

Tim Weiland, Philipp Hennig

AI总结 提出一种基于高斯过程先验的数值保守方法,用于非线性守恒律的不确定性量化,并通过稀疏近似技术实现大规模正反问题的高效求解。

详情
Comments
27 pages, 13 figures, 3 tables
AI中文摘要

非线性守恒律是科学和工程中许多最重要动力系统的核心。在实际应用中,此类系统常受到各种不确定性来源的影响,例如稀疏或有噪声的测量。推断感兴趣的物理量和场成为一个不适定问题,经典数值方法和现代深度学习方法都难以恰当处理。最近的工作将经典数值方法框架化为高斯过程先验下的贝叶斯推断,从而实现了对不确定性的物理感知处理。沿着这一思路,我们开发了一种新颖的数值保守方法,用于非线性守恒律的不确定性感知模拟。我们利用最近的稀疏近似技术,将规模扩展到大规模正问题和反问题。对于正问题模拟,我们继承了经典求解器的精度,同时提供了结构化的不确定性量化。在反问题上,我们在数秒内恢复非参数源场的后验——优于需要数分钟才能产生不太精确点估计的神经基线方法。

英文摘要

Nonlinear conservation laws are at the heart of many of the most important dynamical systems in science and engineering. In practical applications, such systems are often subject to various sources of uncertainty, e.g. due to sparse or noisy measurements. Inferring physical quantities and fields of interest then becomes an ill-posed problem which both classical numerical methods and modern deep learning-based methods struggle to treat appropriately. Recent work has framed classical numerical methods as Bayesian inference under Gaussian process priors, resulting in a physics-aware treatment of uncertainties. Following this line of work, we develop a novel numerically conservative method for uncertainty-aware simulations of nonlinear conservation laws. We use recent sparse approximation techniques to scale up to large-scale forward and inverse problems. For forward simulation, we inherit the accuracy of classical solvers while providing structured uncertainty quantification. On inverse problems, we recover posteriors over nonparametric source fields in seconds -- outperforming neural baselines that take minutes to produce a less accurate point estimate.

2605.31126 2026-06-01 cs.CL cs.AI cs.LG

Not All Synthetic Data Is Yours to Learn From

并非所有合成数据都适合学习

Sina Alemohammad, Li Chen, Richard G. Baraniuk, Zhangyang Wang

AI总结 研究无提示、无教师、无验证器、无奖励模型的自训练中,语言模型能否从自身生成的文本中学习,发现合成数据与学生之间的兼容性是关键,并揭示了能力与逐字记忆可分离的现象。

详情
AI中文摘要

语言模型能否从自身采样的纯文本中改进,无需提示、教师、验证器或奖励模型?可以,但仅当合成语料库与学生兼容时,这是一种源-学生对的关联属性,而非数据的内在属性。我们称之为潜在能力重现假说:弱自训练可以放大预训练模型中已有的能力,但仅在这种兼容条件下。我们在无提示无条件自训练的最小设置中研究这一点,其中基础语言模型仅在BOS令牌生成的文本上进行微调,没有任务规范或外部监督。我们报告三个发现。首先,合成效用是关联的而非内在的:自生成数据是最有效的来源,同源迁移优于更强但不同来源的训练,跨家族迁移显著较弱。其次,常见的内在代理失效:基准级别的语义相似性和学生下的平均每令牌似然都不能预测哪些语料库有帮助。第三,这种机制产生了一个令人惊讶的副产品。在受控的Pythia实验中,能力和逐字记忆解耦:基准效用得以保留或改善,而保留的精确匹配提取下降超过95%,无需遗忘集、隐私目标或针对性遗忘。总之,这些结果表明,无提示自训练通过放大学生已知的内容来工作,而不是从数据中导入结构。它们还揭示了一种无需任何显式遗忘目标即可分离能力和逐字记忆的机制。

英文摘要

Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities already present in the pretrained model, but only under this compatibility condition. We study this in the minimal setting of prompt-free unconditional self-training, where base language models are fine-tuned on text generated from the BOS token alone, with no task specification or external supervision. We report three findings. First, synthetic utility is relational rather than intrinsic: self-generated data is the most effective source, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Second, common intrinsic proxies fail: neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. Third, this regime produces a surprising byproduct. In controlled Pythia experiments, capability and verbatim memorization decouple: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent, with no forget set, privacy objective, or targeted unlearning. Together, these results suggest that prompt-free self-training works by amplifying what the student already knows, not by importing structure from the data. They also reveal a regime in which capability and verbatim memorization can be separated without any explicit unlearning objective.

2605.31124 2026-06-01 cs.CV

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

QVGGT: 训练后量化的视觉几何基础Transformer

Zhizhen Pan, Hesong Wang, Huan Wang

AI总结 针对VGGT模型参数量大、部署受限的问题,提出QVGGT量化框架,通过选择性混合精度、令牌滤波与任务感知尺度搜索,实现近无损W4A16量化,显著降低内存和加速推理。

详情
Comments
Accepted by CVPR 2026. Project page: https://ddsacu.github.io/QVGGT/
AI中文摘要

直接从图像估计3D属性的技术随着视觉几何基础Transformer(VGGT)的提出而迅速发展,该模型能够在前向传播中一次性预测相机参数、深度图和点云。然而,其12亿参数规模严重限制了在无人机和移动AR设备等资源受限平台上的部署。为解决这一限制,我们引入了QVGGT,一个专门为压缩VGGT而设计的量化框架。我们的方法基于以下观察:VGGT内的Transformer块对量化表现出异质性敏感度。因此,我们分析了逐块量化敏感度,并提出了一种选择性混合精度策略,为最脆弱的Transformer块分配更高精度。为了解决由高方差相机和注册令牌引起的量化误差放大问题,我们进一步引入了带相机信息补偿的令牌过滤,从激活校准中移除这些异常值,并使用PCA导出的全局补偿令牌恢复其几何线索。最后,我们开发了一种任务感知尺度搜索机制,不仅通过层重建,还通过多头监督以及相机姿态、深度图和点图之间的跨头几何一致性来评估候选量化尺度。在多个几何感知基准上的大量实验表明,QVGGT实现了近乎无损的W4A16量化,在保持所有3D预测头精度的同时,相比FP32实现了3~4.9倍的内存减少和高达2.8倍的硬件实际加速。我们的方法使得在边缘设备上实现高保真3D感知成为可能,从而在现实世界的受限环境中实现前馈3D重建模型的实际部署。

英文摘要

Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps. Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3$\sim$4.9$\times$ memory reduction and up to 2.8$\times$ real hardware speedup over FP32. Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.

2605.31121 2026-06-01 cs.RO cs.AI

TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues

TARIC: 语义线索中断下基于记忆增强的可通行性感知户外视觉语言导航

Tianle Zeng, Hanjing Ye, Jianwei Peng, Jingwen Yu, Hanxuan Chen, Hong Zhang

AI总结 针对户外视觉语言导航中语义线索中断导致导航退化的问题,提出统一框架,通过可通行性一致的执行引导和不确定性感知的3D线索记忆,在长时间无线索阶段维持稳定导航,在四足和轮式平台上成功率提升显著。

详情
AI中文摘要

户外视觉语言导航(VLN)在远程、开放世界环境中经常受到语义线索中断的干扰,此时信息性目标线索变得稀疏、被遮挡或离开视野。一旦此类线索消失,智能体进入无线索阶段,并常退化为回溯、振荡航向或盲目探索。虽然基于记忆的方法试图弥合这些间隙,但在可通行性驱动的绕行中常常失败:记忆中的线索方向可能不可行,迫使绕行延长无线索阶段,并逐渐使机器人中心的线索过时、隐式历史模糊。这使得可通行性成为维持目标导向引导的稳定性条件,而不仅仅是局部安全问题。 我们提出一个统一的户外VLN框架,通过在长时间无线索阶段维持可通行性一致的可执行引导来应对语义线索中断。具体来说,我们的方法从可见性门控的目标或探索线索中提取语义方位,并利用实时近场可通行性轮廓将其接地为可执行航向,提供超越仅拒绝安全过滤的目标一致可行引导。为防止绕行期间引导退化,我们将间歇性2D证据提升为世界对齐的3D线索记忆,并配备不确定性感知读出机制,确保引导在机器人移动时持续可达且稳定。 我们在四足和轮式平台上评估该框架,路线长度为600-1000米。我们的方法在模拟中成功率比最强基线提高超过10个百分点,真实世界成功率达到40%,而最强基线为17.5%,且在长时间无线索间隔中具有显著更高的鲁棒性。

英文摘要

Outdoor vision-language navigation (VLN) in long-range, open-world environments is frequently disrupted by semantic-cue interruptions, where informative goal cues become sparse, occluded, or leave the field of view. Once such cues disappear, agents enter a cue-free phase and often degrade into backtracking, oscillatory headings, or aimless exploration. While memory-based methods attempt to bridge these gaps, they often fail under traversability-driven detours: the remembered cue direction may be infeasible, forcing detours that prolong cue-free phases and gradually render robot-centric cues stale and implicit histories blurred. This makes traversability a stability condition for maintaining goal-directed guidance, rather than merely a local safety concern. We propose a unified outdoor VLN framework that survives semantic-cue interruptions by maintaining traversability-consistent executable guidance throughout prolonged cue-free phases. Specifically, our method extracts semantic bearings from visibility-gated goal or exploration cues and grounds them into executable headings using a real-time near-field traversability profile, providing goal-consistent feasible guidance beyond reject-only safety filtering. To prevent guidance degradation during detours, we lift intermittent 2D evidence into a world-aligned 3D cue memory with an uncertainty-aware readout mechanism, ensuring guidance remains continuously reachable and stable as the robot moves. We evaluate the framework on quadrupedal and wheeled platforms over 600--1000 m routes. Our method improves simulation success rate by over 10 percentage points over the strongest baseline and achieves a real-world success rate of 40%, compared to 17.5% for the strongest baseline, with substantially higher robustness during prolonged cue-free intervals.

2605.31120 2026-06-01 cs.GR cs.AI cs.LG

SWIM: Single-Instance Whole-Body Imitation for swiMming

SWIM: 用于游泳的单实例全身模仿

Binglun Wang, Edmond S. L. Ho, He Wang

AI总结 提出一种基于物理的游泳动作合成方法SWIM,通过单实例模仿学习实现全身协调与流体连续交互,在数据效率、稳定性、鲁棒性和泛化性上优于现有方法。

详情
AI中文摘要

我们提出了一种合成基于物理的游泳动作的新方法。基于物理的角色动画旨在生成物理有效、可控且自然的动作,能够应对意外干扰,其中难度的一个决定性因素是任务的复杂性,尤其是与所需环境交互的复杂程度。现有研究已在静态和动态环境中的各种任务上取得成功。我们进一步将难度推向游泳,这需要全身协调和与流体的持续交互,这是与环境交互时的一个新复杂性层次。这种复杂性在学习控制时面临挑战,包括在易变的环境力下的控制学习、将控制泛化到不同环境和游泳风格、缺乏数据参考,以及在控制学习过程中不可避免的极其缓慢的物理模拟。为此,我们提出了SWIM,一种新的游泳动作模仿方法,它可以从单个游泳动作中学习,并泛化到未见过的环境、身体条件和游泳风格。广泛的评估和比较表明,SWIM具有数据效率高、稳定、鲁棒和可泛化的特点,在多个任务类别和指标上优于替代方法。

英文摘要

We propose a new method for synthesizing physically-based swimming motions. Physically-based character animation aims to generate physically valid, controllable, and natural-looking motions which can respond to unexpected disturbances, where one dictating factor of difficulty is the complexity of the task, especially the level of sophistication of the required interactions with the environment. Existing research has succeeded in various tasks in static and dynamic environments. We push the difficulty further to swimming, which requires full-body coordination and continuous interactions with fluids, a new level of complexity when it comes to interacting with the environment. This complexity imposes challenges in learning control under volatile environmental forces, generalizing control to different environments and swimming styles, lack of data references, and prohibitively slow physical simulation which is inevitable during control learning. To this end, we propose SWIM, a new imitation method for swimming motions, which can learn from a single swimming motion and generalize to unseen environments, body conditions, and swimming styles. Extensive evaluation and comparison demonstrate that SWIM is data-efficient, stable, robust, and generalizable, outperforming alternative methods across multiple classes of tasks and metrics.

2605.31119 2026-06-01 cs.RO cs.LG

Don't Fool Me Twice: Adapting to Adversity in the Wild with Experience-Driven Reasoning

不要愚弄我两次:通过经验驱动推理在野外适应逆境

Navin Sriram Ravie, Andrew Jong, Krrish Jain, John Liu, Omar Alama, Bijo Sebastian, Sebastian Scherer

AI总结 提出一种持续学习框架,使移动机器人能够在线从干扰中学习,通过语义将异常行为归因于原因,从而更好地预测和规划未来。

详情
AI中文摘要

在机器人学中,危险和逆境模式通常具有具体性且相对于每个智能体。自主移动机器人的一个前沿是使智能体能够在未见的非结构化环境中有效运行。在未见的非结构化环境中的一个重大挑战是可能无法预测特定机器人的所有危险。尽管最近的工作使用大型基础视觉语言模型(VLM)来预先预测一个详尽的常识性危险列表,但仍然难以捕捉可能的交互和依赖于具体性的逆境。我们提出了一个持续学习框架,使移动具身智能体能够在线从干扰中学习,并通过语义将异常行为归因于原因,从而更好地预测和规划未来世界。我们的框架“不要愚弄我两次”首先观察干扰并描述其对机器人的影响;该描述通过视觉上下文增强,以查询VLM预测可能的原因;使用核回归对局部干扰进行特征化,从而实现对瞬态异常的高效、少样本建模。我们利用语义体素中心建模来估计认知不确定性,通过将交互驱动的干扰视为可学习的空间行为,实现更丰富的下游恢复。我们提出了四个假设,并在仿真和硬件上跨具体性和逆境模式进行了验证。

英文摘要

In robotics, dangers and adversity modes are often embodiment-specific and relative to each agent. A frontier of autonomous mobile robotics is to enable agents to operate effectively in the wild in unseen unstructured environments. A significant challenge in unseen unstructured environments is that it may not be possible to predict all the dangers to the specific robot. Although recent work has used large foundation vision-language models (VLMs) to preemptively predict an exhaustive list of common-sense dangers, it remains difficult to capture possible interaction and embodiment-dependent adversities. We propose a continual learning framework for a mobile embodied agent to learn online from disturbances and attribute anomalous behaviours to causes through semantics, enabling better prediction and planning of the world in the future. Our framework, "Don't Fool Me Twice", first observes disturbances and describes their effects on the robot; this description is augmented with visual context to query a VLM to predict possible causes; the local disturbance is characterized using kernel regression, which allows for efficient, few-shot modeling of transient anomalies. We leverage semantic voxel-centric modeling to estimate epistemic uncertainty, enabling richer downstream recovery by treating interaction-driven disturbances as learnable spatial behaviors. We present four hypotheses and validate them in simulation and on hardware across embodiments and adversity modes.

2605.31116 2026-06-01 cs.CV cs.RO

NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

NTR:端到端驾驶中场景令牌瓶颈的神经令牌重建

Jiahui Li, Jiawei Sun, Zixiang Ren, Ming Liu, Jiamin Shi, Ruiteng Zhao, Zhiyang Liu, Liying Liu, Zuoguan Wang, Kaidi Yang

AI总结 针对端到端驾驶中场景令牌瓶颈缺乏视觉监督的问题,提出神经令牌重建(NTR)框架,通过自蒸馏掩码潜在重建约束场景令牌保留更丰富的视觉表示,实现最先进的驾驶性能。

详情
AI中文摘要

最近的无感知端到端自动驾驶方法通过将密集的图像块令牌压缩为紧凑的场景令牌,用于下游轨迹生成和评分,从而绕过了显式的感知输出。虽然这些场景令牌为规划器形成了紧凑的视觉瓶颈,但它们仅从规划目标接收监督,对编码的视觉信息提供了有限的约束。为了解决这一限制,我们引入了神经令牌重建(NTR),一种表示学习框架,直接约束无感知驾驶中的紧凑场景令牌瓶颈。NTR引入了一种自蒸馏掩码潜在重建目标,该目标仅使用紧凑的场景令牌作为重建记忆来重建被掩码的块级潜在特征。这迫使重建梯度仅通过场景令牌瓶颈传递,鼓励场景令牌为规划保留更丰富且更少冗余的视觉表示。我们进一步引入了来自基础模型注释的语义先验,作为弱语义接口,将重建目标偏向于驾驶相关结构,而不引入显式的感知头。所有辅助重建组件在推理时被移除,部署的规划器保持不变。NTR在三个公共自动驾驶基准测试中实现了最先进的性能,包括Waymo E2E上的8.0461 RFS以及NavSim1&2上的94.1 PDMS / 90.9 EPDMS。学习到的场景令牌表现出更低的成对冗余和更高的有效秩,表明有效的瓶颈监督同时改善了紧凑视觉表示学习和规划性能。

英文摘要

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

2605.31115 2026-06-01 cs.CV

Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning

Polyphony: 基于扩散的双手动作分割,采用交替视觉Transformer和语义条件

Hao Zheng, Hu Wang, Tiantian Zheng, Prajjwal Bhattarai, Tuka Alhanai

AI总结 提出Polyphony三阶段方法,通过交替训练双手视觉Transformer、语义特征条件化和扩散分割,解决双手动作分割中的手间依赖、视觉不对称和语义模糊问题,在多个数据集上达到最优性能。

详情
Comments
CVPR 2026
AI中文摘要

双手动作分割是从未修剪视频中密集预测双手动作,对于理解复杂的双手活动至关重要。然而,它带来了几个独特的挑战:复杂的手间依赖、双手之间的视觉不对称、主导手垄断梯度的表示冲突以及细粒度动作中的语义模糊性。我们提出了Polyphony,一种三阶段方法,通过以下方式应对这些挑战:(1) 交替双手视觉Transformer,在左右手小批量之间交替训练,以确保双手的梯度贡献平衡,同时共享时空编码器;(2) 语义特征条件化,将视觉特征与结构化的、组合式的动作描述对齐,以增强语义相似动作的区分度;(3) 基于扩散的分割,结合跨手特征融合以实现手间协调,以及自适应损失加权以平衡性能。Polyphony在双手数据集(HA-ViD、ATTACH)上达到了最先进水平,改进高达16.8个百分点,并在单流Breakfast数据集(82.5%)上超越了之前使用12倍大骨干网络的最佳方法。值得注意的是,我们的统一模型使用单个共享骨干网络,超越了需要单独每手模型的基线方法。代码位于https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation。

英文摘要

Dual-hand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. Polyphony achieves state-of-the-art on both dual-hand datasets (HA-ViD, ATTACH) with improvements up to 16.8 points, and on the single-stream Breakfast dataset (82.5%), outperforming the prior best method that uses a 12x larger backbone. Notably, our unified model with a single shared backbone surpasses baselines requiring separate per-hand models. Code is at https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation.

2605.31113 2026-06-01 cs.CL

TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

TSM-Bench:在真实维基百科编辑实践中检测LLM生成的文本

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

AI总结 针对维基百科等用户生成内容平台,提出多语言、多生成器、多任务的TSM-Bench基准,发现现有检测器在任务特定MGT上准确率下降10-40%,且存在泛化不对称性。

详情
AI中文摘要

自动检测机器生成文本(MGT)对于维护维基百科等用户生成内容(UGC)平台的知识完整性至关重要。现有的检测基准主要关注 extit{通用}文本生成任务(例如,“写一篇关于机器学习的文章。”)。然而,编辑者经常使用LLM进行特定的写作任务(例如,摘要)。这些 extit{任务特定}的MGT实例由于其受限的任务制定和上下文条件,往往更接近人类撰写的文本。在这项工作中,我们展示了一系列最先进的MGT检测器在识别反映维基百科真实编辑的任务特定MGT时存在困难。我们引入了 extsc{TSM-Bench},这是一个多语言、多生成器和 extit{多任务}基准,用于评估MGT检测器在常见的真实维基百科编辑任务上的表现。我们的发现表明:( extit{i})与之前的基准相比,平均检测准确率下降了10-40%;( extit{ii})存在泛化不对称性:在任务特定数据上微调能够泛化到通用数据——甚至跨领域——但反之则不然。我们证明,仅在通用MGT上微调的模型会过度拟合机器生成的表面伪影。我们的结果表明,与之前的基准相比,大多数检测器在UGC平台等真实世界上下文中仍不可靠。因此, extsc{TSM-Bench}为开发和评估未来模型提供了关键基础。

英文摘要

Automatically detecting machine-generated text (MGT) is critical to maintaining the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia. Existing detection benchmarks primarily focus on \textit{generic} text generation tasks (e.g., ``Write an article about machine learning.''). However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation). These \textit{task-specific} MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning. In this work, we show that a range of SOTA MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia. We introduce \textsc{TSM-Bench}, a multilingual, multi-generator, and \textit{multi-task} benchmark for evaluating MGT detectors on common, real-world Wikipedia editing tasks. Our findings demonstrate that (\textit{i}) average detection accuracy drops by 10--40\% compared to prior benchmarks, and (\textit{ii}) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data -- even across domains -- but not vice versa. We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation. Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms. \textsc{TSM-Bench} therefore provides a critical foundation for developing and evaluating future models.

2605.31111 2026-06-01 cs.LG

Subspace-Decomposed JEPAs: Disentangling Progression and Content in Latent World Models

子空间分解的JEPA:解耦潜在世界模型中的进展与内容

Lucas Thil, Jesse Read, Rim Kaddah, Guillaume Doquet

AI总结 提出SD-JEPA方法,通过将JEPA潜在空间分解为正交的进展子空间和内容子空间,利用余弦边际三元组损失和SIGReg正则化分别约束,在控制基准上优于LeWM基线,并证明进展坐标可作为场景感知的指南针。

详情
AI中文摘要

联合嵌入预测架构(JEPA)通过预测未来嵌入来学习紧凑的潜在世界模型,但潜在空间的任何单一坐标都未被指定用于编码任务进展。我们将JEPA潜在空间分解为两个具有不相交角色的正交子空间:一个由余弦边际三元组损失塑造的低维进展子空间,以及一个由LeWM现有SIGReg目标正则化的高维内容子空间。我们证明两个抗坍塌力作用于不相交的坐标,因此它们加性组合而非在同一维度上竞争。我们的方法SD-JEPA在大多数控制基准上以匹配的计算量优于LeWM基线,并在Push-T上优于最强的非LeWM JEPA基线;子空间消融验证了分解是关键因素。除了规划之外,得到的一维角进展坐标在潜在空间中充当场景感知的指南针。它随任务进展而前进,当智能体回溯时后退,在受控扰动下既会尖峰也会重新定位到语义上合适的新任务阶段区域,以预测误差标量无法做到的方式将惊讶时刻与其意义分离。三个定量测试支持这一点:在40个保留的立方体情节中,|Δθ_t|在定位语义事件方面优于标准潜在预测误差惊讶度,最高可达+0.18的合并AUROC(在±1步容差下每情节胜率97.5%);在所有四个环境(每个环境40个情节)的情节内线性探针显示,8维进展子空间(潜在空间的4.2%)解释了72-95%的任务进展方差。

英文摘要

Joint-Embedding Predictive Architectures (JEPAs) learn compact latent world models by predicting future embeddings, but no single coordinate of the latent is designated to encode task progression. We carve the JEPA latent into two orthogonal subspaces with disjoint roles: a low-dimensional progression subspace shaped by a cosine-margin triplet loss, and a high-dimensional content subspace regularised by the existing SIGReg objective of LeWM. We prove that the two anti-collapse forces act on disjoint coordinates, so they compose additively rather than competing on the same dimensions. Our method, SD-JEPA improves over the LeWM baseline on the majority of its control benchmarks at matched compute, and outperforms the strongest non-LeWM JEPA baseline on Push-T; a subspace-ablation falsifier confirms the split is the load-bearing ingredient. Beyond planning, the resulting 1-D angular progression coordinate functions as a scene-aware compass on the latent. It advances with task progress, regresses when the agent backtracks, and under controlled perturbations both spikes and relocalises to a semantically appropriate new task-phase sector, separating the moment of surprise from its meaning in a way that prediction-error scalars cannot. Three quantitative tests back this up: $|Δθ_t|$ outperforms the standard latent-prediction-error surprise at localising semantic events on 40 held-out cube episodes by up to +0.18 pooled AUROC (97.5% per-episode win rate at $\pm 1$-step tolerance); a within-episode linear probe across all four environments (40 episodes per env) shows the 8-dimensional progression subspace (4.2% of the latent) explains 72-95% of task-progress variance..

2605.31110 2026-06-01 cs.RO

Building Generalization Into Behavior Generation Via Adaptive Compositions of Regularities

通过规律的自适应组合构建行为生成中的泛化能力

Aravind Battaje, Malte Bernhard, Vito Mengers, Oliver Brock

AI总结 本文通过AICON框架研究自适应组合规律(机器人-环境系统中的可预测关系)作为行为生成中泛化能力的关键机制,并在模拟实验中验证其有效性。

详情
Comments
10 pages, 6 figures
AI中文摘要

机器人领域的泛化需要关于世界如何结构化的先验知识,然而这种结构会随情境变化。本文研究一个命题:泛化源于将规律(机器人-环境系统中的可预测关系)自适应组合成适合情境的行为生成结构。我们通过分析AICON(主动互连)框架中的机制来检验这一命题,该框架将规律表示为可微分网络中的交互过程,其中感觉反馈实现组合,梯度下降生成行为。为了隔离自适应组合作为关键机制,我们研究了一个简单的模拟问题,其中所有相关规律都可以被识别。我们将所得模型暴露于设计时未考虑的各种新条件下,发现除了一个编码规律被证明不足的情况外,它在所有情况下都能生成情境适当的行为。消融实验表明,网络会根据规律的信息量自动调节哪些规律影响行为。这些结果表明,规律的自适应组合构成了将泛化能力构建到行为生成中的强大归纳偏置。

英文摘要

Generalization in robotics requires prior knowledge about how the world is structured, yet this structure changes from one situation to the next. This paper investigates the proposition that generalization arises from adaptively composing regularities -- predictable relationships within the robot-environment system -- into situation-appropriate structures for behavior generation. We examine this proposition by analyzing the mechanism in AICON (Active InterCONnect), a framework representing regularities as interacting processes in a differentiable network, where sensory feedback realizes composition and gradient descent generates behavior. To isolate adaptive composition as the key mechanism, we study a simple simulated problem in which all relevant regularities can be identified. We expose the resulting model to a wide range of novel conditions not considered during design, and we find that it generates context-appropriate behavior in all but one case, where encoded regularities are provably insufficient. Ablations reveal that the network automatically modulates which regularities influence behavior based on their informativeness. These results suggest that adaptive composition of regularities constitutes a powerful inductive bias for building generalization into behavior generation.

2605.31108 2026-06-01 cs.CV cs.LG

Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams

通过重建来记忆:视频流上的域增量学习与测试时训练

Jonathan Swinnen, Tinne Tuytelaars

AI总结 提出一种结合主任务头和自监督掩码自编码器头的域增量学习方法,通过测试时训练识别最佳LoRA适配器以重新记忆域,适用于视频流数据。

详情
AI中文摘要

在这项工作中,我们提出了一种新颖的域增量学习方法,使模型能够随时间适应不断演变的非平稳数据。与其他工作不同,我们不试图避免灾难性遗忘,而是允许并利用它。我们的模型结合了一个主任务头和一个自监督掩码自编码器(MAE)头。然后在增量训练期间学习特定于域的LoRA适配器。每个适配器专攻其域,自然地在两个头上诱导对其他域的遗忘。在推理时,我们在自监督MAE头上进行在线测试时训练,以识别哪些LoRA最匹配当前输入,从而使模型能够再次“记住”该域。我们的方案特别适用于现实世界的流数据,例如视频,其中连续样本高度相关且域变化是渐进的。我们在域增量动作识别和语义分割任务上展示了我们的方法。

英文摘要

In this work we introduce a novel approach to domain incremental learning, adapting models over time to evolving, non-stationary data. In contrast to other works, we do not attempt to avoid catastrophic forgetting, but rather allow it and exploit it. Our model combines a main task head with a self-supervised masked autoencoder (MAE) head. We then learn domain-specific LoRA adapters during incremental training. Each adapter specializes to its domain, naturally inducing forgetting on other domains in both heads. At inference, we perform online test-time training on the self-supervised MAE head to identify which LoRAs best matches the current input, so the model can `remember' the domain again. Our scheme is especially well-suited to real-world streaming data, such as video, where consecutive samples are highly correlated and domain shifts are gradual. We demonstrate our method on domain-incremental action recognition and semantic segmentation tasks.

2605.31106 2026-06-01 cs.LG

Riemannian Diffusion Models on General Manifolds via Physics-Informed Neural Networks

基于物理信息神经网络的通用流形上的黎曼扩散模型

Gyeonghoon Ko, Juho Lee

AI总结 针对黎曼流形上热核难以解析计算的问题,提出用物理信息神经网络求解流形热方程来近似热核,从而实现扩散模型的训练与采样。

详情
AI中文摘要

黎曼扩散模型通过流形上的随机扩散方程将基于分数的生成建模推广到流形支持的数据。然而,训练需要从流形热核中采样并对其求导,而除少数高度对称的流形外,热核很少具有封闭形式。我们提出一种通用方法,通过使用物理信息神经网络(PINN)直接求解流形热方程来近似热核。给定显式流形规范,我们选择坐标系,推导相应的热(Fokker--Planck)方程和短时渐近近似,然后训练PINN学习对数热核。得到的替代模型能够实现前向加噪(热核采样)和去噪分数匹配的条件分数评估。我们在多种流形上演示了该方法,包括$S^2$、$SO(3)$、$\mathrm{SPD}(n)$和置换商点云。

英文摘要

Riemannian diffusion models generalize score-based generative modeling to manifold-supported data via stochastic diffusion equations on the manifold. However, training requires sampling from and differentiating the manifold heat kernel, which is rarely available in closed form beyond a few highly symmetric manifolds. We propose a general approach that approximates the heat kernel by directly solving the manifold heat equation with a physics-informed neural network (PINN). Given an explicit manifold specification, we choose a coordinate system, derive the corresponding heat (Fokker--Planck) equation and a short-time asymptotic approximation, and then train a PINN to learn the log heat kernel. The resulting surrogate enables both forward noising (heat-kernel sampling) and conditional-score evaluation for denoising score matching. We demonstrate the method on diverse manifolds including $S^2$, $SO(3)$, $\mathrm{SPD}(n)$, and permutation-quotiented point clouds.

2605.31105 2026-06-01 cs.CL

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

GRKV: 长上下文LLM中免训练的KV缓存压缩的全局回归

Junjie Peng, You Wu, Haoyi Wu, Jialong Han, Xiaohua Xie, Kewei Tu, Jianhuang Lai

AI总结 提出GRKV方法,通过岭回归合并步骤最小化压缩缓存与完整缓存注意力输出的差异,解决基于跨度保留的合并模式不平衡导致的过度合并和信息损失问题。

详情
Comments
21 pages, 7 figures
AI中文摘要

具有扩展上下文长度的大型语言模型(LLM)依赖键值(KV)缓存来支持对先前令牌的注意力。然而,维护KV缓存会产生大量内存开销,促使通过驱逐和合并来强制执行固定预算的KV缓存压缩方法。现代驱逐方法越来越多地采用基于跨度的保留,因为保留连续跨度在经验上有效且更好地保持语义连贯性。然而,当与驱逐后合并结合时,基于跨度的保留将合并集中到一小部分跨度边界载体令牌上,产生高度不平衡的合并模式,加剧过度合并并增加信息损失。为了解决这种不平衡,我们提出GRKV(全局回归KV缓存),一种免训练的KV缓存合并方法,直接最小化压缩缓存与完整缓存注意力输出之间的差异。GRKV使用基于岭回归的合并步骤,将驱逐令牌的信息分布到保留令牌上,同时正则化更新以防止过度平滑。在LongBench和RULER长上下文基准测试中,GRKV是唯一一种以最小开销提高整体性能的合并方法。

英文摘要

Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.

2605.31100 2026-06-01 cs.AI cs.DB cs.IR

Vector Linking via Cross-Model Local Isometric Consistency

通过跨模型局部等距一致性的向量链接

Ziying Chen, Yang Cao, He Sun, Beining Yang, Tianjian Yang

AI总结 提出一种基于局部几何一致性的迭代参考几何嵌入哈希方法,从少量种子锚点恢复跨模型向量对应关系,实现准确鲁棒的向量链接。

详情
Comments
Accepted at ICML 2026
AI中文摘要

我们研究向量链接:给定由不同黑盒编码器在部分重叠数据集上生成的两个嵌入云,仅使用向量恢复跨模型对象对应关系。实验和理论上表明,独立训练的对比编码器表现出局部几何一致性:短距离近似保持(按比例因子),而长距离因模型特定失真而不保持。基于此,我们提出一种迭代的、基于参考的几何嵌入哈希方法,从微小的种子锚点集恢复向量链接。它通过到采样配对锚点的距离表示每个向量,通过哈希空间匹配提出候选链接,并在Beta-Bernoulli后验中跨视图聚合证据,以引导高置信度链接作为新锚点。在多个基准测试和嵌入模型对上的实验表明,该方法在不同重叠度、种子预算和域外锚点下实现准确且鲁棒的链接,并应用于向量数据库集成和跨模型聚类。代码见https://github.com/DBgroup-Edinburgh/VecLinking。

英文摘要

We study Vector Linking: given two embedding clouds produced by different black-box encoders over partially overlapping datasets, recover cross-model object correspondences using only vectors. Empirically and theoretically, we show that independently trained contrastive encoders exhibit local geometric consistency: short-range distances are approximately preserved up to a scale factor, while long-range distances are not due to model-specific distortion. Building on this, we propose an iterative, reference-based geometric embedding hashing that recovers vector links from a tiny seed set of paired anchors. It represents each vector by distances to sampled paired anchors, proposes candidate links via hash-space matching, and aggregates evidence across views in a Beta-Bernoulli posterior to bootstrap high-confidence links as new anchors. Experiments across multiple benchmarks and embedding model pairs demonstrate accurate and robust linking under varying overlap, seed budgets, and out-of-domain anchors, with applications to vector database integration and cross-model clustering. Code is available at https://github.com/DBgroup-Edinburgh/VecLinking.

2605.31099 2026-06-01 cs.CL cs.AI

KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning

KnowledgeGain: 评估和优化面向读者学习的科学新闻生成

Dominik Soós, Meng Jiang, Jian Wu

AI总结 提出KnowledgeGain指标,通过测量读者知识增益来评估科学新闻质量,并利用LLM模拟器优化生成,提升读者学习效果。

详情
AI中文摘要

科学新闻是研究界与公众之间传播发现的重要媒介。然而,大多数用于生成或摘要文本的指标评估语义相似性和事实一致性,但并未衡量读者从新闻中学到了多少知识。我们引入了KnowledgeGain,这是一个通过测量读者阅读后获得的知识量来评估科学新闻质量的指标。为了评估该指标,我们首先进行了一项受控人类研究,表明该指标成功捕捉了人类读者阅读不同类型科学媒体时获得的知识差异。这些数据使我们能够校准一个仅基于提示的LLM读者模拟器。我们用它来在人类评估之前对候选文章进行排序和过滤。第二项人类研究表明,使用该模拟器选择的文章在阅读后准确性和标准化KnowledgeGain上均优于强生成基线。我们的工作是朝着生成更符合Bloom分类法知识和理解目标的科学新闻迈出的一步。

英文摘要

Science news is an important medium to communicate discoveries between the research communities and the public. Yet, most metrics for generated or summarized text evaluate semantic similarity and factual consistency, but do not measure how much knowledge readers learn from the news. We introduce KnowledgeGain, a metric that evaluates the quality of science news by measuring how much knowledge readers gained after reading it. To evaluate the metric, we first performed a controlled human study and showed that the metric successfully captures the differential knowledge gained by human readers reading different types of science media. The data allowed us to calibrate a prompt-only LLM reader simulator. We use it to rank and filter candidate articles before human evaluation. A second human study shows that articles selected with this simulator improve post-reading accuracy and normalized KnowledgeGain over a strong generation baseline. Our work is a step toward generating science news that better meets the knowledge and comprehension goals of Bloom's Taxonomy.

2605.31097 2026-06-01 cs.DB cs.AI

SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition

SpecDB: 通过面向特征的分解生成LLM定制的数据库

Yunkai Lou, Longbin Lai, Shunyang Li, Zhengping Qian, Ying Zhang

AI总结 提出SpecDB系统,利用大语言模型通过面向特征的分解和依赖图DBGraph,从自然语言工作负载描述自动生成定制化关系数据库,在TPC-C测试中达到与PostgreSQL和MySQL相当的性能,代码量仅为它们的3%。

详情
AI中文摘要

主流关系数据库在部署时提供统一的特征集,尽管单个工作负载只使用可用子系统的一小部分。我们研究是否可以根据目标工作负载按需生成具有匹配特征集的数据库。我们提出SpecDB,一个使用大语言模型(LLM)合成定制化关系数据库的系统。我们调查了9个生产系统,并将其分解为10个功能模块,每个模块进一步划分为实现变体。为了捕获跨模块依赖关系,包括不相交子树中的实现必须协同设计的情况,我们采用FODA特征模型,并用合作边扩展它,得到依赖图DBGraph。SpecDB通过分层模块构建流水线来操作DBGraph,其中每个模块由专门的子代理(由三个内部代理驱动:主代理、测试代理、架构代理)生成、验证和集成,以及一个精炼代理,该代理根据用户提供的精炼工具(对现有数据库源代码具有只读访问权限)迭代修复和调整组装的数据库。配套的选择组件将自然语言工作负载描述转换为一组实现变体,提供从工作负载描述到可部署数据库的端到端流水线。我们在TPC-C上使用BenchmarkSQL评估SpecDB。生成的数据库(23,779行Rust代码)在1个和10个仓库下完成了60分钟的TPC-C测试,零错误。在10个仓库下,它达到tpmC=130,而PostgreSQL为128,MySQL为127,延迟相当,代码量约为它们的3%。由于代理在模块规范级别而非产品源代码级别操作,它原则上可以跨系统边界组合技术。随着LLM成本的下降,为目标工作负载生成专用数据库正变得简单。

英文摘要

Mainstream relational databases ship a uniform feature set across deployments, although individual workloads exercise only a fraction of the available subsystems. We investigate whether a database can instead be generated on demand with a feature set matched to the target workload. We present SpecDB, a system that uses large language models (LLMs) to synthesize customized relational databases. We survey 9 production systems and decompose them into 10 functional modules, each further divided into implementation variants. To capture cross-module dependencies, including cases where implementations in disjoint subtrees must be co-designed, we adopt the FODA feature model and extend it with a cooperate edge, yielding a dependency graph DBGraph. SpecDB operationalizes DBGraph through a layered module-construction pipeline in which each module is generated, validated, and integrated by a dedicated subagent (driven by three inner agents: Main, Tester, Architect), and a Refining Agent that iteratively repairs and tunes the assembled database against a user-supplied refining harness with read-only access to existing database source code. A companion selection component translates a natural-language workload description into a set of implementation variants, providing an end-to-end pipeline from workload description to deployable database. We evaluate SpecDB on TPC-C with BenchmarkSQL. The generated database (23,779 lines of Rust) completes 60-minute TPC-C at 1 and 10 warehouses with zero errors. At 10 warehouses it reaches tpmC=130, compared to 128 for PostgreSQL and 127 for MySQL, with comparable latency at ~3% of their code size. Because the agent operates at module-specification level rather than product source, it can in principle combine techniques across system boundaries. Paired with falling LLM costs, generating a purpose-built database for a target workload is becoming straightforward.

2605.31096 2026-06-01 cs.CV

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

iVGR: 通过强化学习将视觉基础推理内化到多模态大语言模型中

Chang-Bin Zhang, Yujie Zhong, Qiang Zhang, Kai Han

AI总结 提出iVGR框架,利用强化学习和双流训练策略将视觉定位能力内化到文本推理中,避免显式视觉基础在推理时的干扰,提升细粒度感知性能。

详情
Comments
Accepted by ICML 2026
AI中文摘要

尽管视觉基础链式思维(CoT)已成为增强多模态大语言模型(MLLM)细粒度感知的有前途范式,但其在推理阶段的有效性仍未得到充分探索。在这项工作中,我们经验性地发现,与没有显式视觉基础的标准文本CoT相比,在推理时强制要求视觉基础CoT中的显式对象框通常会降低性能。我们假设视觉定位能力可以内化到文本CoT中,而强制性的显式基础会对模型的主要目标(答案预测)引入不必要的干扰。为了解决这个问题,我们提出了内化视觉基础推理(iVGR),一种新颖的强化学习框架,将定位能力转移到文本推理过程中。我们采用双流训练策略,通过提出的一致性奖励将文本流与高质量的视觉基础流对齐,使模型在推理时无需显式基础即可准确定位。大量实验表明,我们的方法在细粒度基准上显著优于现有基线,同时保持支持工具辅助推理工作流的灵活性。

英文摘要

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (\textbf{iVGR}), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

2605.31094 2026-06-01 cs.CV cs.AI

Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation

重新定义实例匹配:全景分割评估中部件感知匹配的统一框架

Erik Großkopf, Soumya Snigdha Kundu, Hendrik Möller, Nicolas Münster, Mehdi Astaraki, Paula Tamara Buzduga, Kerstin Ritter, Benedikt Wiestler, Jan Kirschke, Jonathan Shapey, Tom Vercauteren, Florian Kofler

AI总结 提出将全景分割中的片段匹配重新表述为约束二分分配问题,定义四种匹配策略,并扩展至部件感知评估,发布基于Panoptica的统一开源包。

详情
Comments
9 pages, 4 figures
AI中文摘要

全景质量(PQ)度量是联合评估实例分割和语义分割的标准。然而,其原始定义依赖于预测片段和真实片段之间的一对一匹配,只有当IoU阈值超过0.5时才是直接的。低于0.5时,在一个探索不足的问题空间中会出现多种匹配策略。我们通过将片段匹配重新表述为约束二分分配问题,系统地阐明了这个空间。独立地约束预测端和真实端的度数,产生了四种匹配策略:一对一、多对一、一对多和多对多。我们表明,前三种在PQ框架内是良好定义的,而多对多则超出其范围。当实例被碎片化、相邻物体难以划分或标注有噪声时,这些策略变得相关。我们框架的核心是基于顶点的TP、FN和FP计数,锚定于真实片段和预测片段,而不是匹配边。我们进一步表明,该框架自然地扩展到部件感知全景分割,并在生物医学数据上探索了部件感知评估。在可配置的案例研究中,我们报告了不同阈值和匹配策略组合在实际中的表现。我们发布了一个基于Panoptica的统一开源包,它暴露了基于Voronoi的区域分析、部件感知评估和阈值下曲线面积作为可配置选项。

英文摘要

The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definition relies on a One-to-One matching between predicted and ground truth segments, which is only straightforward when the IoU threshold exceeds 0.5. Below 0.5, multiple matching strategies emerge in a poorly explored problem space. We systematically elucidate this space by recasting segment matching as a constrained bipartite assignment problem. Independently bounding the prediction- and ground-truth-side degrees yields four matching strategies: One-to-One, Many-to-One, One-to-Many, and Many-to-Many. We show that the first three are well-defined within the PQ framework, while Many-to-Many falls outside it. These strategies become relevant when instances are fragmented, adjacent objects are difficult to delineate, or annotations are noisy. Central to our framework is a vertex-based accounting of TP, FN, and FP, anchored to ground truth and predicted segments rather than to matching edges. We further show that the framework extends naturally to part-aware panoptic segmentation, and we explore part-aware evaluation on biomedical data. Across configurable case studies we report how different combinations of thresholds and matching strategies behave in practice. We release a unified open-source package built on Panoptica. It exposes Voronoi-based region-wise analysis, part-aware evaluation, and Area Under Threshold Curve computations as configurable options.

2605.31093 2026-06-01 cs.CV

Cross-Modal Clinical Knowledge Integration for Mammography Report Generation

跨模态临床知识整合用于乳腺X线报告生成

Jiayi Zhu, Fuxiang Huang, Yu Xie, Xi Wang, Zhixuan Chen, Yuan Guo, Qingcong Kong, Zhenhui Li, Qiong Luo, Hao Chen

AI总结 提出MammoRG框架,通过两阶段训练模拟临床报告流程,整合BI-RADS指南和先验知识,提升报告生成的临床一致性。

详情
Comments
16 pages, 5 figures
AI中文摘要

乳腺癌是一个主要的全球健康问题,乳腺X线筛查在早期检测中起着核心作用。大量的筛查检查给放射科医生带来了沉重的工作负担,使得准确且一致的报告生成成为一个关键的临床挑战。现有的自动乳腺X线报告生成方法主要关注直接的视觉到文本映射,而忽略了放射科医生在实际工作中遵循的结构化临床推理过程。为了解决这一局限性,我们提出了MammoRG,一个乳腺X线报告生成框架,它通过遵循BI-RADS指南并整合先验临床知识来明确模拟临床报告工作流程,从而生成诊断报告。具体来说,MammoRG采用两阶段训练框架。在第一阶段,模型通过基于分类的监督学习从患者的四视图乳腺X线图像中整合临床相关的先验知识。在第二阶段,引入术语感知的监督微调策略,将乳腺X线特异性临床术语建模为原子语义单元,从而生成具有更高临床一致性的高质量报告。为了促进生成报告的临床效能评估,我们进一步开发了MammoRGTool,一个专用的乳腺X线报告解析工具,它从自由文本报告中提取结构化临床信息。大量实验表明,MammoRG在多个临床效能指标上持续优于现有方法,特别是在与诊断相关的BI-RADS F1上,它在内部、外部1、外部2和VinDr-Mammo数据集上分别超过第二名模型2.73%、2.04%、1.90%和3.27%。

英文摘要

Breast cancer is a major global health concern, and mammography screening plays a central role in early detection. The large volume of screening examinations creates a substantial workload for radiologists, making accurate and consistent report generation a critical clinical challenge. Existing automated mammography report generation methods primarily focus on direct visual-to-text mapping, while overlooking the structured clinical reasoning process followed by radiologists in real-world practice. To address this limitation, we propose MammoRG, a mammography report generation framework that explicitly simulates the clinical reporting workflow by following the BI-RADS guideline and incorporating prior clinical knowledge to produce diagnostic reports. Specifically, MammoRG adopts a two-stage training framework. In the first stage, the model learns to integrate clinically relevant prior knowledge from a patient's four-view mammograms through classification-based supervision. In the second stage, a terminology-aware supervised fine-tuning strategy is introduced to model mammography-specific clinical terms as atomic semantic units, enabling the generation of high-quality reports with improved clinical consistency. To facilitate clinical efficacy evaluation of generated reports, we further develop MammoRGTool, a dedicated mammography report parsing tool that extracts structured clinical information from free-text reports. Extensive experiments demonstrate that MammoRG consistently outperforms existing methods across multiple clinical efficacy metrics, particularly in diagnosis-related BI-RADS F1, where it surpasses the second-best model by 2.73%, 2.04%, 1.90%, and 3.27% on the internal, external 1, external 2, and VinDr-Mammo datasets, respectively.

2605.31090 2026-06-01 cs.CV cs.AI

On Revisiting Entropy for Identifying Mislabeled Images

重新审视熵在识别错误标注图像中的应用

Chunlei Li, Zixuan Zheng, Yilei Shi, Guanglu Dong, Pengfei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

AI总结 提出基于训练动态的有符号熵积分(SEI)统计量,通过捕捉预测熵的幅度和时间趋势,有效识别训练集中的错误标注样本,在医学影像数据集上达到最优性能。

详情
Comments
ICML 2026
AI中文摘要

训练数据集中的错误标注样本会严重降低深度网络的性能,因为过参数化模型倾向于记忆错误标签。我们通过提出一种利用训练动态的错误标注数据检测新方法来应对这一挑战。我们的方法基于一个关键观察:正确标注的样本在训练过程中熵持续下降,而错误标注的样本在整个训练过程中保持相对较高的熵。基于这一见解,我们引入了一个有符号熵积分(SEI)统计量,它捕捉了训练周期中预测熵的幅度和时间趋势。SEI广泛适用于分类网络,并且在与对比语言-图像预训练(CLIP)架构集成时表现出特别的有效性。通过在四个医学影像数据集(由于诊断复杂性,该领域特别容易受到标注错误的影响)上进行涵盖不同模态和病理的广泛实验,我们证明SEI在错误标注数据识别中达到了最先进的性能,在保持计算效率和实现简单性的同时优于现有方法。我们的代码可在 https://github.com/MedAITech/SEI 获取。

英文摘要

Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets -- a domain particularly susceptible to labeling errors due to diagnostic complexity -- spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.

2605.31082 2026-06-01 cs.SD cs.MM

Sound effects in media:A comparative analysis of recorded and synthetic samples in live-action and animation

媒体中的音效:实拍与动画中录制样本与合成样本的比较分析

Nelly Garcia, Joshua Reiss

AI总结 通过比较程序化生成的合成音效与真实录制音效在实拍和动画场景中的可信度,发现合成音效在戏剧和科幻场景中表现良好,但在卡通日常动作中可信度较低。

详情
Comments
ArtsIT, Interactivity and Game Creation 2024
AI中文摘要

为故事创作声音对于电影、电视剧和视频游戏等作品中环境的建立至关重要。这一过程通常涉及重复、分层和录制真实物体或使用音效库,这可能耗时且重复。为了解决这些挑战,程序化音频(也称为数字拟音)提供了一种解决方案,允许声音设计师快速生成样本。尽管效率高,但合成样本与真实样本相比的可信度仍存在问题。在我们的研究中,我们比较了由在线程序化引擎生成的合成样本,并将其与动画和实拍画面集成。我们的结果表明,程序化音频在戏剧和科幻场景中非常有效且被认为可信,特别是对于激光、打击、空气和火箭等声音模型,而合成声音在表现日常动作的卡通制作中不太可信。最后,我们确定了需要优化的特定模型,并根据音频专业人士的反馈强调了需要改进的音频特征。

英文摘要

Creating sound for storytelling is crucial to establishing the environment in productions such as films, TV series and video games. This process often involves repeating, layering and recording real objects or using sound libraries, which can be time-consuming and repetitive. To address these challenges, procedural audio, also known as digital foley, offers a solution by allowing sound designers to quickly generate samples. Despite its efficiency, questions remain about the believability of synthetic samples compared to real ones. In our study, we compared synthetic samples generated by an online procedural engine and integrated them with both animated and live-action visuals. Our results indicate that procedural audio is highly effective and perceived as believable in drama and sci-fi scenes, particularly for sound models such as lasers, hits, air and rockets, whereas synthetic sounds weren't as believable in cartoon productions when representing everyday actions. Finally, we identified specific models that needed optimisation and highlighted audio features that needed improvement with feedback from audio professionals.

2605.31080 2026-06-01 cs.MM cs.AI cs.CL cs.CV cs.HC

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

策展人引导的多语言艺术描述对盲人和低视力观众的小型视觉语言模型试点研究

Iosif Tsangko, Andreas Triantafyllopoulos, George Margetis, Ioana Crihana, Björn W. Schuller

AI总结 本研究使用小型视觉语言模型Qwen2.5-VL-3B-Instruct,通过策展人引导的方式为盲人和低视力观众生成德语、罗马尼亚语和塞尔维亚语的多语言艺术描述,发现语言特定适配器在控制性和视觉基础描述质量上优于多语言适配器。

详情
Comments
7 pages, 2 figures, 3 tables. Preprint
AI中文摘要

盲人和低视力(BLV)观众在视觉艺术描述方面仍然服务不足,尤其是在跨语言和博物馆环境中,隐私和知识产权限制可能倾向于使用小型本地视觉语言模型(VLM)。本试点研究使用Qwen2.5-VL-3B-Instruct,针对德语、罗马尼亚语和塞尔维亚语,调查了策展人引导的多语言艺术描述。我们从艺术品图像和元数据构建了一个平行的BLV导向字幕语料库,并在固定骨干网络和训练预算下,比较了语言特定的LoRA适配器与单个多语言适配器。评估结合了自动词汇和基于嵌入的指标,以及针对小型罗马尼亚BLV试点研究校准的LLM作为评判协议。在我们的试点设置下,语言特定适配器在罗马尼亚语和塞尔维亚语上表现出更稳定的可控性和视觉基础描述质量,而多语言适配器在德语上仍具有竞争力。我们将这些发现视为小型本地VLM的部署导向证据,并强调在得出关于多语言可访问性的总体结论之前,需要进行更大规模的BLV用户研究和更广泛的语言覆盖。

英文摘要

Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.

2605.31075 2026-06-01 cs.CV

Task-Focused Memorization for Multimodal Agents

面向多模态智能体的任务聚焦记忆

Tao Zou, Yichen He, Tian Qiu, Yuan Lin, Hang Li

AI总结 提出基于强化学习的任务聚焦记忆策略学习框架TaskMem,通过两阶段训练使多模态智能体在流式观测中动态选择任务相关记忆,在三个流式基准上VQA准确率提升5.3%-7.0%。

详情
AI中文摘要

长期记忆对于多模态智能体构建连贯经验、积累世界知识和实现持续学习至关重要。然而,构建有效记忆不仅涉及记忆模块设计和准确性、保真度等基本要求,关键挑战在于决定记忆什么。多模态智能体(如具身智能体)在真实或虚拟环境中持续感知、推理和行动,接收无界的多模态观测流。面对这种信息组合爆炸,智能体必须选择性地保留与其环境角色相关且对未来任务有价值的内容。为弥合这一差距,我们将记忆生成建模为可学习的记忆策略,并引入TaskMem(任务聚焦记忆策略学习),一种基于强化学习的框架,使策略能够动态调整其关注点以适应环境中遇到的实际任务需求。TaskMem采用两阶段训练范式:第一阶段在基本保真度要求下优化记忆质量,学习如何记忆;第二阶段在部署后进行,智能体通过在其基础MLLM上调整适配器来学习记忆什么,利用近期环境任务定义奖励模型,引导记忆策略聚焦于任务相关的内容。为评估我们的方法,我们将VideoMME、EgoLife和EgoTempo重新构建为流式基准,模拟智能体处理流式观测并处理在线到达任务的真实场景。为隔离记忆评估,问题必须仅使用智能体的记忆回答,而不访问原始视频。基于Qwen3-VL-30B-A3B,TaskMem在这些基准上分别将VQA准确率提高了6.3%、7.0%和5.3%。

英文摘要

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

2605.31073 2026-06-01 cs.CL

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

ConsisGuard:在LLM护栏中对齐安全审议与策略执行

Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi, Bingyu Zhu, YueFeng Chen, Zeyu Yang, Jungang Lou, Longtao Huang, Ningyu Zhang, Kui Ren, Hui Xue

AI总结 提出ConsisGuard框架,通过策略到决策轨迹蒸馏和功能耦合对齐,解决基于推理的LLM护栏中审议与执行之间的不一致问题,提升安全检测性能并减少策略执行失败。

详情
Comments
18 pages, 9 figures
AI中文摘要

基于推理的LLM护栏通过在做出最终决策前生成明确理由来改进安全审核。然而,它们的理由并不总是导致忠实的执行:模型可能在推理中识别出有害意图,但仍然预测安全标签,或者在没有策略依据的情况下发布不安全决策。我们将这种安全关键性失败模式识别为审议到执行的差距。与一般的思维链忠实性不同,护栏可靠性要求策略执行一致性:生成的推理应基于安全策略,最终决策应由该推理蕴含。我们提出ConsisGuard,一个用于基于推理的LLM护栏的一致性感知框架。ConsisGuard执行策略到决策轨迹蒸馏和功能耦合对齐,对齐安全审议与决策执行之间的内部耦合。在提示和响应有害性检测基准上的实验表明,ConsisGuard在减少策略执行失败的同时提高了检测性能。这些结果表明,可靠的基于推理的护栏需要准确忠实地执行安全策略。

英文摘要

Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent in its reasoning but still predict a safe label, or issue an unsafe decision without policy-grounded justification. We identify this safety-critical failure mode as the deliberation-to-enforcement gap. Unlike general chain-of-thought faithfulness, guardrail reliability requires policy execution consistency: the generated reasoning should be grounded in the safety policy, and the final decision should be entailed by that reasoning. We propose ConsisGuard, a consistency-aware framework for reasoning-based LLM guardrails. ConsisGuard performs Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment, aligning the internal coupling between safety deliberation and decision enforcement. Experiments on prompt and response harmfulness detection benchmarks show that ConsisGuard improves detection performance while reducing policy execution failures. These results suggest that reliable reasoning-based guardrails require accurate faithful execution of safety policies.

2605.31070 2026-06-01 cs.LG cs.GT

Learning to Bid in FCR Markets: A Best-of-Both-Worlds Approach

在FCR市场中学习投标:一种两全其美的方法

Marius Potfer, Cheng Wan, Pierre Gruet

AI总结 针对欧洲频率控制储备(FCR)市场中投标者仅能观察到部分反馈(如出清价格和分配数量)的问题,提出了一种将多国FCR出清问题转化为重复多单位统一价格拍卖的方法,并采用两全其美的组合半强盗算法实现对数伪遗憾(随机环境)和平方根遗憾(对抗环境),实验验证了其理论缩放性和实际竞争力。

详情
Comments
Algorithms and data available at https://data.mendeley.com/datasets/htprbf47dg/1
AI中文摘要

在欧洲频率控制储备(FCR)市场中,由于竞争报价是隐藏的,投标者只能观察到来自市场的部分反馈,如出清价格和分配数量,因此对于灵活性提供商而言,投标具有挑战性。对于活跃在单个国家的参与者,我们证明多国FCR出清问题可以转化为针对内生对手报价向量的重复多单位统一价格拍卖。这种重新表述产生了一个在线学习问题,并使我们能够适应一种两全其美的组合半强盗算法,该算法可从这种标准市场反馈中实现。由此产生的投标者在随机环境中实现对数伪遗憾,在对抗环境中实现$\mathcal{O}(\sqrt{T})$遗憾。综合实验验证了预期的缩放性,对历史欧洲FCR数据的回测显示了实际中的竞争性能:该方法在稳定产品上表现尤其出色,而EXP3类型的基线在更强的非平稳性下可能更安全。总体而言,结果表明,当学习规则与产品级市场稳定性相匹配时,基于学习的FCR市场投标在理论上是有根据的,在实践中是有用的。

英文摘要

Bidding in the European Frequency Containment Reserve (FCR) market is challenging for flexibility providers because competing offers are hidden and bidders observe only partial feedback form the market, such as, clearing price and awarded quantity. For a participant active in a single country, we show that the multi-country FCR clearing problem can be recast as a repeated multi-unit uniform-price auction against an endogenous vector of opposing bids. This reformulation yields an online learning problem and allows us to adapt a Best-of-Both-Worlds combinatorial semi-bandit algorithm implementable from this standard market feedback. The resulting bidder achieves logarithmic pseudo-regret in stochastic environments and $\mathcal{O}(\sqrt{T})$ regret in adversarial ones. Synthetic experiments confirm the expected scaling, and backtests on historical European FCR data show competitive performance in practice: the method performs especially well on stable products, while EXP3-type baselines can be safer under stronger non-stationarity. Overall, the results show that learning-based bidding in FCR markets is theoretically grounded and practically useful when the learning rule matches product-level market stability.

2605.31069 2026-06-01 cs.CV cs.CL

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

面向有效长视频事件预测的多级事件语义挖掘

Bo Peng, YuanJie Lyu, PengGang Qin, Tong Xu

AI总结 提出VISTA框架,通过多级事件语义挖掘(细节级、事件级、未来级)实现长视频事件预测,解决现有模型无法精确提取事件细节和进行细粒度分析的问题。

详情
AI中文摘要

准确预测未来事件是内容理解和决策制定的基础,涉及多个领域。先前研究主要关注文本或短视频场景,而长视频事件预测具有多模态上下文丰富和叙事复杂的特点,尚未得到充分探索。同时,基于大语言模型和视觉语言模型构建的近期长视频语言模型在长视频问答和摘要方面表现出潜力,但难以泛化到事件预测,因为它们既不能精确提取事件相关细节,也无法对事件发展进行细粒度分析。为弥补这一差距,我们提出VISTA,一个用于长视频事件预测的多级事件语义挖掘框架。首先,VISTA应用以角色为中心的视觉提示精确提取事件相关视觉细节,增强细节级语义;其次,采用知识增强的迭代检索策略,引导大语言模型逐步构建逻辑连贯的事件链,从而改善事件级叙事;最后,VISTA采用类人的先提议后检索策略生成多样化的面向未来的提议并整合多级线索,产生稳健准确的预测。在真实数据集上的大量实验验证了VISTA在长视频事件预测中的有效性。

英文摘要

Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.

2605.31068 2026-06-01 cs.CV

HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning

HQ-JEPA: 用于跨模态遥感表示学习的混合量子联合嵌入预测架构

Md Aminur Hossain, Ayush V. Patel, Sanjay K. Singh, Biplab Banerjee

AI总结 提出HQ-JEPA混合量子-经典架构,通过联合嵌入预测、跨模态对齐、SIGReg高斯正则化和量子保真度损失,在Sentinel-1/2图像上学习语义表示,在GeoBench分类和分割任务上取得优于强基线的性能。

详情
Comments
19 pages
AI中文摘要

我们提出了HQ-JEPA,一种用于跨模态遥感表示学习的混合量子-经典联合嵌入预测架构。该框架将JEPA风格的掩码潜在预测扩展到配对的Sentinel-1和Sentinel-2图像,通过从可见上下文区域预测掩码目标表示,同时在共享嵌入空间中对齐异构模态特征。为了提高表示质量,HQ-JEPA结合了四个互补目标:潜在令牌预测、跨模态令牌对齐、融合潜在空间中基于SIGReg的高斯正则化,以及基于可微SWAP测试的保真度量子相似性(FQS)损失。与像素重建方法不同,HQ-JEPA直接在潜在空间中学习语义表示,并使用基于量子态重叠的相似性作为额外的正则化信号。我们在线性探测和微调设置下,在GeoBench分类和分割任务上评估了预训练编码器。结果表明,HQ-JEPA在强自监督和遥感基础模型基线上取得了具有竞争力且通常更优的性能,证明了将预测性自监督、跨模态几何正则化和基于量子保真度的表示学习相结合对遥感应用的好处。

英文摘要

We introduce HQ-JEPA, a hybrid quantum-classical joint-embedding predictive architecture for cross-modal remote sensing representation learning. The proposed framework extends JEPA-style masked latent prediction to paired Sentinel-1 and Sentinel-2 imagery by predicting masked target representations from visible context regions while aligning heterogeneous modality features in a shared embedding space. To improve representation quality, HQ-JEPA combines four complementary objectives: latent token prediction, cross-modal token alignment, SIGReg-based Gaussian regularization in the fused latent space, and a differentiable SWAP-test-based Fidelity Quantum Similarity (FQS) loss. Unlike pixel reconstruction methods, HQ-JEPA learns semantic representations directly in latent space and uses quantum state-overlap-based similarity as an additional regularization signal. We evaluate the pretrained encoder on GeoBench classification and segmentation tasks under linear probing and fine-tuning settings. Results show that HQ-JEPA achieves competitive and often superior performance over strong self-supervised and remote sensing foundation-model baselines, demonstrating the benefit of integrating predictive self-supervision, cross-modal geometric regularization, and quantum fidelity-based representation learning for remote sensing applications.

2605.31066 2026-06-01 cs.RO

Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

空中VLA模型能协作吗?基于CARLA-Air的闭环空地协调评估

Tianle Zeng, Yanci Wen, Xueang Yu, Hong Zhang

AI总结 本文通过构建CARLA-Air仿真环境,评估空中视觉-语言-动作模型在空地协作任务中的表现,发现当前模型难以将单智能体能力转化为稳定协作行为,并指出零样本协作需要伙伴状态显式感知、低延迟动作协调和团队目标对齐三个关键组件。

详情
Comments
Code at https://github.com/louiszengCN/CarlaAir
AI中文摘要

最近的空中视觉-语言-动作(VLA)模型展示了有前景的单无人机能力,例如跟踪移动物体和导航到语言指定的地标。然而,这些能力能否转移到空地协作中尚不清楚,其中无人机和无人地面车辆必须在共享的闭环物理世界中联合行动。我们通过CARLA-Air研究这个问题,这是一个单进程空地评估环境,在同一个虚幻引擎运行时内统一了CARLA和AirSim。通过共享相同的世界状态、物理时钟和感知流水线,CARLA-Air实现了物理一致的无人机-无人地面车辆交互,并精确测量仿真时间戳对齐和有效协调延迟。利用CARLA-Air,我们在两个互补的诊断任务上评估了代表性的空中VLA和规划基线:移动平台降落和遮挡恢复护航。结果表明,当前的空中VLA模型通常能够跟踪或跟随地面伙伴,但难以将这种单智能体能力转化为稳定的协作行为。状态提示提供的益处有限,而朴素的双向交互未能持续提高性能,并且可能放大大多数基线的错误。这些发现表明,在测试的基于文本的提示接口下,零样本协作空地VLA需要当前范式之外的三个组件:显式的伙伴状态感知、低延迟动作协调和团队目标对齐。我们的代码可在https://github.com/louiszengCN/CarlaAir获取。

英文摘要

Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world. We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency. Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.