arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.14762 2026-06-16 cs.CV cs.AI 新提交

Scribby: A Multi-Level LLM Framework for Semantic Video Analysis

Scribby: 一种用于语义视频分析的多级LLM框架

Julian Abelarde, Hugo Garrido-Lestache Belinchon

发表机构 * Department of Computer Science and Software Engineering, Milwaukee School of Engineering（密尔沃基工程学院计算机科学与软件工程系）

AI总结提出一种基于LLM的视频摘要框架，通过微观索引（分析完整转录、句子及语义分组）平衡宏观理解与微观语义分析，并利用相关性热图实现语义分块和匹配的可视化。

详情

AI中文摘要

随着视频内容在教育平台、录播讲座和直播娱乐中的持续扩展，对长视频进行高效且结构化分析的需求日益增长。尽管许多现有AI程序基于AI生成的转录提供高级视频摘要，但这些方法通常局限于粗略概述，缺乏对视频结构、主题进展和语义关系的详细分析，而这些正是全面视频分析所必需的。本文提出一种基于LLM的视频摘要框架，平衡宏观理解与微观语义分析。该过程的第一阶段在微观层面对视频进行索引，包括：(1) 分析完整转录，(2) 分析单个转录句子，(3) 使用LLM作为评判依据语义相似性对这些句子进行分组。在句子级处理中，通过将全局转录分析和相邻句子信息纳入每个评估提示，保留上下文连续性。该框架为通过相关性热图可视化语义分块和语义匹配的视频分析工具奠定了基础。还讨论了框架的局限性和未来扩展。

英文摘要

As video content continues to expand across educational platforms, recorded lectures, and live-streamed entertainment, the need for efficient and structured analysis of long-form footage has increased \cite{1}. Although many existing AI programs provide high-level video summaries based on AI-generated transcripts \cite{2,3,4,5}, these approaches are often limited to coarse overviews and lack detailed analysis of a video's structure, thematic progression, and semantic relationships, all of which are required for comprehensive video analysis. This paper proposes an LLM-based video summarization framework that balances macro-level comprehension with micro-level semantic analysis \cite{6,12,13}. The first stage of the process indexes the video at a micro level by (1) analyzing the full transcript, (2) analyzing individual transcript sentences, and (3) grouping these sentences by semantic similarity using an LLM as a judge \cite{6,13}. Contextual continuity is retained during sentence-level processing by incorporating both the global transcript analysis and adjacent sentence information into each evaluation prompt. This framework establishes a foundation for video analysis tools that visualize semantic chunking and semantic matching through relevance-based heatmaps. Limitations and future expansions of the framework are also discussed.

URL PDF HTML ☆

赞 0 踩 0

2606.14760 2026-06-16 cs.CV cs.AI 新提交

GeoRoPE: Ground-Aware Rotary Adaptation for Remote Sensing Foundation Models

GeoRoPE: 面向遥感基础模型的地面感知旋转适配

Yu Luo, Kun Hu, Mengwei He, Xiaogang Zhu, Shan Zeng, Allen Benter, Wei Xiang, Patrick Filippi, Thomas Francis Bishop, Zhiyong Wang

发表机构 * The University of Sydney（悉尼大学）； Edith Cowan University（埃迪斯科文大学）； Adelaide University（阿德莱德大学）； Wuhan Polytechnic University（武汉轻工大学）； Climate, Orange Agricultural Institute（气候研究所，奥兰治农业研究所）； La Trobe University（拉筹伯大学）

AI总结提出GeoRoPE方法，通过地理坐标校准和频率校准解决遥感基础模型中的尺度失配问题，提升跨分辨率鲁棒性和尺度敏感表征学习。

详情

AI中文摘要

遥感基础模型（RSFMs）受益于在多传感器和地面采样距离（GSD）图像上的预训练，但仅凭这种暴露并不能解决下游适配过程中的尺度失配问题。固定的token网格偏移在不同传感器下可能对应不同的地面距离，使得基于网格的位置先验在物理上不一致。同时，异质空间粒度意味着紧凑的城市区域和均质景观即使在相同GSD下也可能需要不同的位置敏感性。因此，我们提出GeoRoPE，一种面向RSFMs的地面感知、RoPE兼容且参数高效的空间适配方法。GeoRoPE从两个互补方面重新校准token级位置交互。首先，地理坐标校准（GCC）根据一个token网格步长代表的地面距离重新缩放原始token网格偏移，产生跨GSD的地理校准相对坐标。其次，地理频率校准（GFC）使用关系特定因子调整原生RoPE频率，实现对场景依赖空间粒度的位置敏感适配。GeoRoPE通过轻量适配器注入预训练RSFM，在保持冻结空间先验的同时添加地理感知位置校正。在多个RSFM、传感器、分辨率和下游任务上的实验表明，GeoRoPE提升了跨分辨率鲁棒性和尺度敏感表征学习。

英文摘要

Remote-sensing foundation models (RSFMs) benefit from pretraining on imagery from multiple sensors and ground sampling distances (GSDs), but such exposure alone does not resolve scale mismatch during downstream adaptation. A fixed token-grid offset can correspond to different ground distances across sensors, making grid-based positional priors physically inconsistent. Meanwhile, heterogeneous spatial granularity means that compact urban regions and homogeneous landscapes may require different positional sensitivities even under the same GSD. Therefore, we propose {GeoRoPE}, a ground-aware, RoPE-compatible, and parameter-efficient spatial adaptation method for RSFMs. GeoRoPE recalibrates token-level positional interactions from two complementary aspects. First, \textit{Geo-Coordinate Calibration (GCC)} rescales raw token-grid offsets according to the ground distance represented by one token-grid step, producing geo-calibrated relative coordinates across GSDs. Second, \textit{Geo-Frequency Calibration (GFC)} adjusts the native RoPE frequency with a relation-specific factor, enabling position sensitive adaptation to scene-dependent spatial granularity. GeoRoPE is injected into pretrained RSFMs through a lightweight adapter, preserving the frozen spatial prior while adding geo-aware positional corrections. Experiments across multiple RSFMs, sensors, resolutions, and downstream tasks demonstrate that GeoRoPE improves cross-resolution robustness and scale-sensitive representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.14759 2026-06-16 cs.CV cs.AI 新提交

Temporally Consistent and Controllable Video Generation of 2D Cine CMR via Latent Space Motion Modeling

基于潜在空间运动建模的二维电影心脏磁共振时序一致且可控视频生成

Yiheng Cao, Gustavo Andrade-Miranda, Jiatian Zhang, Guillaume Sallé, Xin Gao

发表机构 * Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences（苏州生物医学工程与技术研究所，中国科学院）； SyCoIA, IMT Mines Ales（SyCoIA，IMT Mines Ales）

AI总结提出一种文本到视频生成方法，通过解耦心脏空间结构与时间运动，利用微调扩散模型合成初始帧，再以心脏相位嵌入条件化潜在流模型生成完整运动，实现高时序一致性和解剖可控性。

Journal ref ISBI 2026 - IEEE International Symposium on Biomedical Imaging, Apr 2026, London, United Kingdom. pp.1-4

详情

AI中文摘要

电影心脏磁共振是评估心脏功能的金标准，但公共数据集的稀缺限制了先进数据驱动模型的发展。为解决这一限制，我们提出一种生成方法，用于合成时间上连贯且解剖上一致的心脏序列。我们的文本到视频框架将心脏空间结构与时间运动解耦。首先，一个微调的扩散模型根据临床文本提示合成初始帧，控制解剖特征。然后，一个以心脏相位嵌入为条件的潜在流模型生成完整的心脏运动，确保空间一致性和时间控制。我们的模型生成解剖和病理多样化的序列，具有高时间连贯性和对输入提示的强保真度，图像真实感的FID为31.68，文本-图像对齐的CLIP得分为31.04。这些实验结果突显了其产生高保真、按需医疗数据的潜力，为数据稀缺提供了可扩展的解决方案。

英文摘要

Cine cardiac magnetic resonance is the gold standard for assessing cardiac function, but the scarcity of public datasets limits the development of advanced data-driven models. To address this limitation, we propose a generative method for synthesizing temporally coherent and anatomically consistent cardiac sequences. Our text-to-video framework decouples cardiac spatial structure from temporal motion. First, a fine-tuned diffusion model synthesizes an initial frame from a clinical text prompt, controlling anatomical features. Then, a latent flow model conditioned on a cardiac phase embedding generates the complete cardiac motion, ensuring spatial consistency and temporal control. Our model generates anatomically and pathologically diverse sequences with high temporal coherence and strong fidelity to input prompts, achieving a FID of 31.68 for image realism and a CLIP score of 31.04 for text-image alignment. These experimental results highlight its potential to produce high-fidelity, on-demand medical data, offering a scalable solution to data scarcity.

URL PDF HTML ☆

赞 0 踩 0

2606.14758 2026-06-16 cs.CV cs.AI 新提交

Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability

解构幻觉：正交语义投影实现鲁棒可解释性

Emirhan Bilgiç, Baptiste Caramiaux, Zhi Yan, Gianni Franchi

发表机构 * U2IS, ENSTA, Institut Polytechnique de Paris（巴黎综合理工学院ENSTA学院U2IS实验室）； ISIR, Université Sorbonne, Pierre et Marie Curie（索邦大学皮埃尔和玛丽·居里分校ISIR实验室）； AMIAD, Pôle Recherche（AMIAD研究部）

AI总结针对视觉语言模型解释中的语义幻觉问题，提出线性语义归因（LSA）理论框架，并引入正交语义投影（OSP）方法，通过正交化查询向量消除共享特征干扰，最小化幻觉。

Comments 41 pages in total. 5 figures, and 2 tables in the main paper; 10 figures and 17 tables in the appendix

详情

AI中文摘要

随着视觉语言模型在安全关键型应用中的部署日益增多，其解释的可信度变得至关重要。视觉语言模型的可解释人工智能（XAI）方法常常遭受语义幻觉，即当输入错误的文本描述时（例如，提示“猫”却高亮显示狗），归因图仍会突出显示显著的图像区域。尽管这个问题普遍存在，但文献中缺乏对XAI方法和CLIP嵌入的正式数学分析。我们证明，这种现象并非特定于单一架构，而是高维嵌入空间中线性语义泄漏的基本后果。我们提出了一个统一的理论框架——线性语义归因（LSA），该框架泛化于多种判别方法。我们引入了OSP，一种利用OMP残差性质的几何干预方法，用于将独特的语义信号与共享概念分离。我们从理论上证明并实验表明，OSP通过将查询向量与干扰概念正交化，最小化幻觉，使归因模型对共享特征“失明”，同时保持对正确提示的保真度。我们的代码可在 https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection 获取。

英文摘要

As Vision-Language Models are increasingly deployed in safety-critical applications, the trustworthiness of their explanations becomes crucial. Explainable AI (XAI) methods for Vision-Language Models often suffer from semantic hallucination, where attribution maps highlight prominent image regions even when prompted with incorrect text descriptions (e.g., highlighting a dog when prompted ``cat''). Although this problem is widespread, a formal mathematical analysis of XAI methods and CLIP embeddings is largely missing in the literature. We demonstrate that this phenomenon is not specific to a single architecture but is a fundamental consequence of Linear Semantic Leakage in high-dimensional embedding spaces. We propose a unified theoretical framework, Linear Semantic Attribution (LSA), which generalizes across discriminative methods. We introduce OSP, a geometric intervention that utilizes the residual property of OMP to disentangle unique semantic signals from shared concepts. We prove theoretically and demonstrate empirically that OSP minimizes hallucination by orthogonalizing the query vector against distractor concepts, rendering the attribution model blind to shared features while preserving fidelity for correct prompts. Our code is available at: https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection

URL PDF HTML ☆

赞 0 踩 0

2606.14757 2026-06-16 cs.CV cs.LG 新提交

Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

基于空间填充曲线的小型与有限数据视觉Transformer的空间先验

Leyla Naz Candogan, Arshia Afzal, Pol Puigdemont, Volkan Cevher

发表机构 * ETH Zürich（苏黎世联邦理工学院）

AI总结提出VIOLIN，一种轻量级掩码注意力机制，通过空间填充曲线编码空间结构，以极小的参数和计算开销为视觉Transformer注入空间归纳偏置，在小模型和有限数据场景下显著提升性能。

Comments ICML 2026

详情

AI中文摘要

尽管视觉Transformer（ViT）已成为许多计算机视觉任务中的主导骨干网络，但由于置换等变性，其注意力机制缺乏显式的空间归纳偏置。这在模型容量小或训练数据有限的情况下尤为重要。受线性Transformer中的注意力掩码策略和视觉状态空间模型（SSM）的扫描模式的启发，我们引入了VIOLIN，一种轻量级掩码注意力机制，通过空间填充曲线（SFC）在注意力中编码空间结构，仅增加不到0.0015%的额外参数和可忽略的计算开销。VIOLIN使用多条SFC扫描图像，构建曲线特定的衰减掩码，然后将其组合并与注意力矩阵相乘。在广泛的评估中，VIOLIN持续提升性能。在有限数据场景下，例如在VTAB-1K上进行微调时，它提升了所有任务组的准确率，在空间信息至关重要的任务上提升高达8.7%。它可以与参数高效微调方法（如LoRA）结合，进一步提高性能。除了微调，VIOLIN在ImageNet-1K上预训练期间改进了各种小型ViT架构（如DeiT、DINO）。此外，在高度依赖位置信息的像素级CIFAR-100训练中，VIOLIN将准确率提升了高达7.2%。总体而言，VIOLIN提供了一种计算高效且有效的方式，将空间归纳偏置注入ViT，特别有利于小模型和有限数据场景。

英文摘要

Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance. In limited data regimes such as fine-tuning on VTAB-1K, it boosts accuracy across all task groups and by up to 8.7% on the tasks where spatial information is essential. It can be combined with parameter-efficient fine-tuning methods such as LoRA to further increase the performance. Beyond fine-tuning, VIOLIN improves various small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K. Additionally, on pixel-level CIFAR-100 training, a task that is highly dependent on location information, VIOLIN increases accuracy by up to 7.2%. Overall, VIOLIN provides a computationally efficient yet effective way to inject spatial inductive bias into ViTs, especially benefiting small models and limited data settings.

URL PDF HTML ☆

赞 0 踩 0

2606.14756 2026-06-16 cs.CV cs.AI cs.LG 新提交

Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models

分而除噪：一种公平组合扩散模型的博弈论方法

Abhi Gupta, Polina Barabanshchikova, Vikas Garg, Samuel Kaski, Tommi Jaakkola

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； University of Washington（华盛顿大学）； University of Cambridge（剑桥大学）

AI总结提出Divide-and-Denoise方法，通过公平分配博弈协调多个预训练扩散模型，在采样时划分区域并引导各模型去噪，解决模型主导或冲突问题，在条件图像生成中优于基线。

Comments Accepted as spotlight at ICML 2026

详情

AI中文摘要

大量预训练扩散模型为组合提供了机会。然而，组合多个模型存在一个模型主导或模型间相互冲突的风险。在此，我们提出Divide-and-Denoise，一种在采样过程中协调多个预训练扩散模型的方法。类似于管理专业劳动力，我们的方法在模型间创建了公平且高效的劳动分工。我们方法的核心是分配的概念，它定义了每个模型对含噪样本每个区域的责任。在每个时间步，我们通过以下步骤去噪：(i) 通过求解公平分配博弈更新分配，其中我们在公平约束下将样本划分为最大化总效用的区域，以及(ii) 使模型与这种分配对齐，引导每个模型在其分配区域内去噪。这导致了一个新的复合去噪过程，该过程与划分过程同步演化。我们在条件图像生成上评估了Divide-and-Denoise。在包括GenEval基准在内的多个质量指标上，我们的方法优于基线，并解决了常见失败情况，包括缺失对象和属性不匹配。实验表明，Divide-and-Denoise利用了每个模型的专业知识，同时不忽视任何其他模型。

英文摘要

The abundance of pre-trained diffusion models provides an opportunity for composition. Combining several models, however, runs the risk of one model dominating or models disagreeing with each other. Here, we propose Divide-and-Denoise, a method for coordinating multiple pre-trained diffusion models during sampling. Much like managing a specialized workforce, our method creates a fair but efficient division of labor across models. Central to our method is the notion of an allocation which defines the responsibility of each model to every region of the noisy sample. At every timestep, we then denoise by (i) updating the allocation by solving a fair division game, where we divide the sample into regions that maximize total utility under fairness constraints, and (ii) aligning the models with this allocation, where we guide each model to denoise within its assigned region. This leads to a new composite denoising process that evolves in tandem with a division process. We evaluate Divide-and-Denoise on conditional image generation. Across several quality metrics, including the GenEval benchmark, our method outperforms baselines and resolves common failures including missing objects and mismatched attributes. Experiments show that Divide-and-Denoise utilizes each model's expertise without neglecting any other model.

URL PDF HTML ☆

赞 0 踩 0

2606.14755 2026-06-16 cs.CV cs.AI 新提交

Where Does Texture Evidence Live in SAM? Features, Proposal Masks, and Texture Segmentation

纹理证据在 SAM 中存在于何处？特征、提议掩码与纹理分割

Nadav Orenstein, Aviad Cohen Zada, Shai Avidan, Gal Oren

发表机构 * Tel Aviv University（特拉维夫大学）； Stanford University（斯坦福大学）； Technion（以色列理工学院）

AI总结研究冻结的 Segment Anything Model (SAM) 中纹理相关证据的存在性，通过最小聚类读取和提议银行监督读取分析多尺度特征与自动提议掩码，发现 SAM 并非纹理盲，但默认失败源于读取不匹配和承诺失败。

Comments 26 pages, 13 figures, 20 tables. Code available at https://github.com/Scientific-Computing-Lab/ArchiTexture

详情

AI中文摘要

纹理分割对基础分割模型构成挑战，因为有意义区域由材质或重复外观而非物体身份定义。Segment Anything Models (SAMs) 默认情况下在纹理定义的分割上经常失败，但这种失败是模糊的：纹理证据可能缺失、在提议银行中缺失，或者存在但被以物体为中心的读取方式错误选择或组装。我们询问在适应之前，冻结的 SAM 中已经保留了哪些纹理相关证据。我们研究两个冻结的证据空间：多尺度特征（通过最小聚类读取探测）和自动提议银行（作为监督整合读取的证据）。SAM 全程冻结；我们不微调骨干网络或重新训练提议生成器。在 RWTD、STLD、ADE20K 精选精修裁剪补充集以及 ControlNet 拼接的 PTD 桥梁存档上，冻结的 SAM 默认情况下不是纹理分割器，但其失败并非简单的纹理盲。粗糙的冻结特征保留了纹理组织，提议银行通常包含纹理对齐的掩码或片段。自然场景更常需要组装和对片段做出承诺，而更干净的合成案例则通常简化为选择已经连贯的提议。因此，默认掩码失败应分解为表示证据、提议银行支持、读取不匹配和承诺失败。

英文摘要

Texture segmentation stresses foundation segmentation because meaningful regions are defined by material or repeated appearance rather than object identity. Segment Anything Models (SAMs) often fail by default on such texture-defined partitions, but this failure is ambiguous: the texture evidence may be absent, missing from the proposal bank, or present but selected or assembled incorrectly by an object-centric readout. We ask what texture-relevant evidence is already preserved in frozen SAM before adaptation. We study two frozen evidence spaces: multiscale features, probed with a minimal clustering readout, and the automatic proposal bank, treated as evidence for a supervised consolidation readout. SAM is frozen throughout; we do not fine-tune the backbone or retrain the proposal generator. Across RWTD, STLD, an ADE20K-selected refined-crop complement, and a ControlNet-stitched PTD bridge archive, frozen SAM is not a texture segmenter by default, but its failures are not simple texture blindness. Coarse frozen features preserve texture organization, and proposal banks often contain texture-aligned masks or fragments. Natural scenes more often require assembly and commitment over fragments, while cleaner synthetic cases more often reduce to selecting an already coherent proposal. Default mask failure should therefore be decomposed into representation evidence, proposal-bank support, readout mismatch, and commitment failure.

URL PDF HTML ☆

赞 0 踩 0

2606.14754 2026-06-16 cs.CV cs.AI 新提交

Sub-Semantic Image Segmentation

子语义图像分割

Aviad Cohen Zada, Nadav Orenstein, Shai Avidan, Gal Oren

发表机构 * Tel Aviv University（特拉维夫大学）； Stanford University（斯坦福大学）； Technion（以色列理工学院）

AI总结提出子语义图像分割，通过耦合视觉-语言模型与SAM，并引入DETECTURE解决语言泄漏、提示竞争和语义失真问题，在自建数据集TextureADE上取得最优性能。

Comments 23 pages. Code: https://github.com/Scientific-Computing-Lab/TextureDetecture

详情

AI中文摘要

图像可以基于视觉线索（即纹理分割）或对象（即语义分割）进行分割。我们提出了一类新的子语义图像分割，模糊了两者之间的界限。在子语义图像分割中，语言不用于命名整个对象。相反，它用于将图像划分为可由语言描述的稳定外观模式。为此，我们将通用视觉-语言模型与SAM 3（一个可提示分割骨干网络，其原生文本路径可以将丰富描述映射到掩码）耦合。简单的耦合由于我们在论文中识别的多种原因而失败，我们通过引入DETECTURE来克服它们，解决了三个具体的失效模式——纹理区域之间的语言泄漏、分割骨干网络内部的提示竞争以及语言到掩码接口处的语义失真。由于没有子语义图像分割的数据集，我们引入了一个名为TextureADE的数据集。新数据集使用我们设计的系统从ADE20K数据集派生而来。我们将DETECTURE与多个基线进行比较，发现它在多个数据集上使用不同指标均取得了最强性能。代码可在https://github.com/Scientific-Computing-Lab/TextureDetecture获取。

英文摘要

Images can be segmented based on visual cues (i.e., texture segmentation) or into objects (i.e., semantic segmentation). We propose a new category of sub-semantic image segmentation that blurs the line between the two. In sub-semantic image segmentation, language is not used to name whole objects. Instead, it is used to partition an image into stable appearance patterns that can be described by language. To do that, we couple a general-purpose vision-language model to SAM 3, a promptable segmentation backbone whose native text pathway can ground rich descriptions into masks. Simple coupling fails for a number of reasons that we identify in the paper, and we overcome them by introducing DETECTURE that resolves three concrete failure modes -- language leakage between texture regions, prompt competition inside the segmentation backbone, and semantic distortion at the language-to-mask interface. Since there is no dataset of sub-semantic image segmentation, we introduce one, termed TextureADE. The new dataset is derived from the ADE20K dataset using a system we designed. We compare DETECTURE to a number of baselines and find that it achieves the strongest performance on several datasets using different metrics. Code is available at https://github.com/Scientific-Computing-Lab/TextureDetecture.

URL PDF HTML ☆

赞 0 踩 0

2606.14753 2026-06-16 cs.CV cs.AI 新提交

Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

超越自注意力：用于快速图像描述的次二次视觉Transformer

Chiradeep Ghosh, Dakshina Ranjan Kisku

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； National Institute of Technology Durgapur（德里apur国立学院）； Durgapur, India（印度德里apur）

AI总结提出基于高斯混合模型和EM算法的概率Transformer，将自注意力复杂度从二次降至线性，在Flickr30K上实现高效图像描述。

Comments 8 pages, 8 figures

详情

AI中文摘要

图像描述是一项具有挑战性且重要的任务，旨在为给定图像生成连贯且语义有意义的文本描述。要完成此任务，需要对视觉内容有深入理解，并具备用自然语言表达这种理解的能力。尽管基于Transformer的架构取得了显著进展，现有方法仍存在局限性，例如缺乏丰富的局部特征表示以及二次自注意力的高计算成本。所提出的模型通过重构视觉Transformer架构，专注于提高计算效率。在设计该方法时，将Vision Transformer中的标准自注意力机制替换为基于高斯混合模型（GMM）的概率Transformer方法，这是一种软聚类技术。该模型不是计算所有图像块之间的成对注意力，而是使用期望最大化（EM）算法将相似块分组到固定数量的聚类中。这种基于聚类的机制将计算复杂度从二次O(n^2)降低到线性O(nK)，其中K << n。自回归的GPT解码器用于生成描述。该模型在Flickr 30K数据集上进行了评估，显示出与现有工作相比具有竞争力和显著的改进。

英文摘要

Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content along with the ability to express that understanding in natural language. Despite remarkable progress with transformer-based architectures, existing approaches often suffer from limitations, such as a lack of rich local feature representations and the high computational cost of quadratic self-attention. The proposed model focuses on improving computational efficiency by restructuring the vision transformer architecture. In designing this approach, the standard self-attention mechanism in Vision Transformers is replaced with a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique. Instead of computing pairwise attention among all image patches, the model groups similar patches into a fixed number of clusters using an Expectation-Maximization (EM) algorithm. This clustering-based mechanism reduces the computational complexity from quadratic O(n^2) to linear O(nK), where K << n. The autoregressive GPT-based decoder is used for caption generation. The model is evaluated on the Flickr 30K dataset, demonstrating competitive and significant improvement over existing works.

URL PDF HTML ☆

赞 0 踩 0

2606.14752 2026-06-16 cs.CV cs.AI cs.LG cs.RO 新提交

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

X-Tokenizer: 一种用于视觉-语言-动作预训练的多模态动作分词器

Xirui Kang, Yanpei Shi, Lucy Liang, Roy Gan, Dongxiu Liu, Pushi Zhang, Danpeng Chen, Xiaoyi Qin, Yinan Zheng, Jinliang Zheng, Hao Wang, Xianyuan Zhan, Hang Su

发表机构 * Square Robot ； City University of Hong Kong（香港城市大学）； Tsinghua University（清华大学）

AI总结提出X-Tokenizer，通过语义残差量化（SRQ）和掩码动作建模（MAM）将动作离散化为语义接口，在2.4M轨迹上预训练后提升VLA模型的多模态接地和长程任务性能。

Comments Project page: https://x-square-robot.github.io/X-Tokenizer_projectPage/

详情

AI中文摘要

现代视觉-语言-动作（VLA）模型必须桥接预训练的视觉-语言推理和精确的连续机器人控制。现有的动作分词器主要为了重建而离散化动作，产生的编码保留了运动几何结构，但仅向主干网络提供弱语义监督。因此，我们将动作分词化不仅视为压缩，而是作为多模态推理与可执行控制之间的语义接口学习。为此，我们引入了X-Tokenizer，一种轻量级的编码器-语义残差量化（SRQ）-解码器架构，为多种机械臂形态提供共享的动作接口。其关键组件SRQ在残差向量量化上施加了非对称结构：第一层通过掩码动作建模（MAM）训练，形成捕获粗略运动意图的离散动作语言，而更深层则保持面向重建的残差，保留细粒度细节。为了进一步将动作标记与多模态语义对齐，X-Tokenizer通过与预训练基础模型的表示空间进行对比对齐以及下一帧视觉-语言特征预测进行预训练。在2.4M轨迹（2.0B动作帧）上预训练后，单个冻结的X-Tokenizer作为表示塑造的监督信号插入混合离散-连续VLA中。X-Tokenizer在真实世界聚合指标上达到最佳，并在RoboTwin 2.0模拟中表现强劲。在多模态接地（+13.5%）和长程任务（+8.25）上优于FAST，表明动作分词器作为VLA预训练的语义接口，而不仅仅是动作压缩。

英文摘要

Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

URL PDF HTML ☆

赞 0 踩 0

2606.14749 2026-06-16 cs.CV cs.AI 新提交

Automated 3D Kinematic Monitoring for Circadian Activity and Anomaly Detection in Juvenile Fish

幼鱼昼夜活动与异常检测的自动化三维运动监测

Chih-Wei Huang, Chang-Wen Huang, Chung-Ping Chiang, Tsung-Wei Pan

发表机构 * AI Research Center, National Taiwan Ocean Univ.（台湾海洋大学人工智能研究中心）； Dept. of Aquaculture, National Taiwan Ocean Univ.（台湾海洋大学水产养殖系）； Center of Excellence for the Oceans, National Taiwan Ocean University（台湾海洋大学海洋卓越研究中心）

AI总结提出结合深度学习目标检测与双目立体视觉的高通量3D行为表型框架，实现高密度环境下幼鱼实时监测、体长估计和3D轨迹重建，首次量化自由游动幼鱼的真实物理速度，建立昼夜运动基线用于生理应激预警。

2606.14748 2026-06-16 cs.CV cs.AI 新提交

Is My Vision-Language Data in Your AI? Membership Inference Test (MINT) Demo 2

我的视觉-语言数据在你的AI中吗？成员推断测试（MINT）演示2

Daniel DeAlcala, Gonzalo Mancera, Julian Fierrez, Aythami Morales, Ruben Tolosana, Ruben Vera-Rodriguez

发表机构 * Universidad Autonoma de Madrid（马德里自治大学）

AI总结提出成员推断测试（MINT）框架，通过多种架构检测训练数据，在人脸识别和LLM上准确率达90%，并构建了多模态审计平台。

Comments IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

详情

AI中文摘要

我们展示了成员推断测试（MINT）演示2，这是一个旨在提高机器学习训练过程透明度的框架。MINT是一种实验性技术，用于确定特定数据是否在机器学习模型训练期间被使用。我们建立了理论框架，并根据被审计模型已知信息的多少，提出了多种MINT架构。使用一个流行的人脸识别模型、4个最先进的LLM以及多个多样化的大规模公共图像和文本数据库进行的实验，在训练数据检测中达到了高达90%的准确率。基于这些结果，我们引入了一个综合性的网络平台，将这些能力扩展到图像和文本模态。该平台集成了多种技术栈，包括MINT、aMINT和gMINT，允许用户审计广泛的模型。该演示旨在促进AI透明度，并提供一种实用工具以促进对新兴AI法规的合规性。

英文摘要

We present the Membership Inference Test (MINT) Demo 2, a framework designed to improve transparency in machine learning training processes. MINT is a technique for experimentally determining whether specific data were used during machine learning model training. We establish the theoretical framework and propose multiple architectures for MINT depending on the amount of information known about the models that are being audited. Experimental results using a popular face recognition model, 4 state-of-the-art LLMs, and multiple, diverse, and large-scale public image and text databases achieve promising accuracy levels in the detection of training data of up to 90%. Building on these results, we introduce a comprehensive web platform1 that expands these capabilities to image and text modalities. The platform integrates a diverse technological stack, including MINT, aMINT, and gMINT, allowing users to audit a wide range of models. This demonstrator aims to promote AI transparency and provides a practical tool to foster compliance with emerging AI regulations.

URL PDF HTML ☆

赞 0 踩 0

2606.14747 2026-06-16 cs.CV cs.AI 新提交

MMLongEmbed: Benchmarking Multimodal Embedding Models in Long-Context Scenarios

MMLongEmbed: 长上下文场景下的多模态嵌入模型基准测试

Haitian Wang, Ruoxi Sun, Quantong Qiu, Juntao Li, Junhui Li, Hua Chen, Jinxiong Chang, Min Zhang

发表机构 * Soochow University（苏州大学）； Ant Group（蚂蚁集团）

AI总结针对多模态嵌入模型在长上下文场景中缺乏系统评估的问题，提出首个综合基准MMLongEmbed，涵盖文本、文档和视频模态的检索任务，揭示模型依赖浅层特征匹配、难以捕捉深层语义依赖等瓶颈。

详情

AI中文摘要

最近的进展显著扩展了多模态嵌入模型（MEMs）的理论上下文窗口。然而，更大的上下文窗口并不一定能转化为对长上下文多模态输入的有效理解和表示，这仍然是实际部署的关键瓶颈。为了解决这一设置中缺乏系统评估的问题，我们引入了MMLongEmbed，这是首个用于评估长上下文场景中MEMs的综合基准。MMLongEmbed包含四个检索任务，涵盖多个上下文长度范围，覆盖文本、文档和视频模态。通过对最先进模型的广泛评估，我们发现当前架构严重依赖浅层特征匹配，难以捕捉深层语义和结构依赖。我们进一步观察到，性能下降随上下文长度和关键信息位置系统性地变化。此外，模型对不同模态中的冗余上下文信息表现出显著不同的鲁棒性。为了可重复性，基准和代码已公开。

英文摘要

Recent advancements have significantly expanded the theoretical context windows of Multimodal Embedding Models (MEMs). However, larger context windows do not necessarily translate into effective comprehension and representation of long-context multimodal inputs, which remains a critical bottleneck for real-world deployment. To address the lack of systematic evaluation in this setting, we introduce MMLongEmbed, the first comprehensive benchmark for evaluating MEMs in long-context scenarios. MMLongEmbed comprises four retrieval tasks spanning multiple context-length ranges, covering text, document, and video modalities. Through extensive evaluation of state-of-the-art models, we find that current architectures rely heavily on superficial feature matching and struggle to capture deep semantic and structural dependencies. We further observe that performance degradation varies systematically with context length and key information placement. Moreover, models exhibit substantially different robustness to redundant contextual information across modalities. For reproducibility, the benchmark and code are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.14746 2026-06-16 cs.CV 新提交

Style-CCL: Content-Preserving Style Transfer via Curriculum Continual Learning

Style-CCL：通过课程持续学习实现内容保持的风格迁移

Shiwen Zhang, Haoyuan Wang, Xianghao Zang, Haibin Huang, Chi Zhang, Xuelong Li

发表机构 * Institute of Artificial Intelligence (TeleAI), China Telecom（中国电信人工智能研究院）

AI总结针对扩散变换器在风格迁移中内容与风格特征纠缠的问题，提出多阶段课程持续学习框架Style-CCL，通过从语义到纹理风格、从干净到合成数据的分阶段训练，并采用随机记忆排练防止灾难性遗忘，在风格相似性、内容一致性和美学质量上达到最优。

Comments code and models of QwenStyle are released at https://github.com/witcherofresearch/Qwen-Image-Style-Transfer/ and https://github.com/Tele-AI/TeleStyle/

详情

AI中文摘要

给定内容和风格参考，内容保持的风格迁移对于扩散变换器（DiT）仍然具有挑战性，因为内容和风格特征纠缠在一起。通过反向三元组合成流程构建百万级训练集，以及双分支风格-内容DiT（SC-DiT）——通过分离的ROPE嵌入和因果掩码解耦风格和内容，我们观察到这种在混合风格类别上的单阶段训练范式会导致语义风格占主导，阻碍纹理风格学习，并损害内容保持。为了解决这些问题，我们提出了Style-CCL，一个多阶段课程持续学习框架，从语义（简单）到纹理（困难）风格，从干净到合成数据训练SC-DiT，并在各阶段之间使用随机记忆排练以避免灾难性遗忘。大量实验表明，我们的Style-CCL在三个核心指标：风格相似性、内容一致性和美学质量上达到了最先进的性能。

英文摘要

Content-Preserving Style transfer, given content and style references, remains challenging for Diffusion Transformers (DiTs) due to entangled content and style features. With a reverse triplet synthesis pipeline to build a million-scale training set and a dual-branch Style-Content DiT (SC-DiT) that decouples style and content via separate ROPE embeddings and causal masking, we observe that such a one-stage training paradigm on mixed style categories causes semantic styles to dominate, hindering texture style learning, and harming content preservation. To address these issues, we propose Style-CCL, a Multi-Stage Curriculum Continual Learning framework that trains SC-DiT from semantic (easy) to texture (hard) styles, and from clean to synthetic data, with Random Memory Rehearsal across stages to avoid catastrophic forgetting. Extensive experiments demonstrate that our Style-CCL achieves state-of-the-art performance in three core metrics: style similarity, content consistency, and aesthetic quality.

URL PDF HTML ☆

赞 0 踩 0

2606.14741 2026-06-16 cs.CV cs.LG 新提交

HorusEye: Language as Dynamic Attention for Emergency Visual Analysis

HorusEye：语言作为动态注意力用于应急视觉分析

Armel Yara

发表机构 * Armel Yara

AI总结提出HorusEye框架，通过语言反馈动态引导视觉分析，在应急场景下评估多种VLM，发现语言反馈效果依赖模型，并揭示热成像中的裁剪悖论。

Comments 18 pages, 9 figures, 11 tables

详情

AI中文摘要

我们介绍了HorusEye，即语言作为动态注意力用于应急视觉分析。我们的研究分为五个阶段。第一阶段是构建RefCOCO-Degraded基准数据集，包含15,244张图像（3,811张基础图像×4种条件：清晰、雾、烟和热成像），具有系统性的视觉退化。通过四个研究问题，我们评估了多种VLM（Gemini、Qwen2-VL、BLIP-2、LLaVA、Kosmos-2）在视觉定位（第二阶段）、语言反馈恢复（第三阶段）、健康VQA任务（第四阶段）以及幻觉分析（最终阶段）上的表现。我们的关键发现是语言反馈的有效性依赖于模型：Gemini通过迭代语言反馈在热成像条件下提升了47.3%，而Qwen2-VL在相同协议下性能下降了5.1%。我们还发现了“热成像悖论”，即提升RGB性能的裁剪策略在热成像中灾难性地失败。此外，BLIP-2在退化条件下独特地产生更多幻觉，使其不适合应急部署。

英文摘要

We introduce HorusEye, Language as Dynamic Attention for Emergency Visual Analysis. Our investigation followed five stages. The first one is benchmarking RefCOCO-Degraded, a dataset of 15,244 images (3,811 base images x 4 conditions: Clean, Fog, Smoke and Thermal) with systematic visual degradation. Through four research questions, we evaluate multiple VLMs (Gemini, Qwen2-VL, BLIP-2, LLaVA, Kosmos-2) across visual grounding the second stage, language feedback recovery the third one, health VQA tasks the fourth, and hallucination analysis the final stage. Our key finding is that language feedback effectiveness is model-dependent: Gemini achieves +47.3% improvement in thermal conditions through iterative language feedback, while Qwen2-VL shows -5.1% degradation under the same protocol. We also identify the 'Thermal Paradox' where cropping strategies that improve RGB performance catastrophically fail in thermal imagery. Furthermore, BLIP-2 uniquely hallucinates more under degradation, making it unsuitable for emergency deployment

URL PDF HTML ☆

赞 0 踩 0

2606.14740 2026-06-16 cs.CV 新提交

GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods

GridVQA-X: 评估多模态可解释性方法的框架

Sujay Belsare, Sudarshan Nikhil, Sushant Kumar, Ponnurangam Kumaraguru, Chirag Agarwal

发表机构 * IIIT Hyderabad（印度海得拉巴国际信息技术学院）； University of Virginia（弗吉尼亚大学）

AI总结提出GridVQA-X诊断框架，通过合成数据生成数学保证的解释，并训练纯推理与捷径依赖的配对模型，揭示现有可解释性方法无法区分真实跨模态推理与浅层捷径。

Comments 23 pages, 15 Figures, Accepted for poster presentation at CVPR 2026 TRUE-V Workshop

详情

AI中文摘要

随着视觉-语言模型的不断发展，其预测结果对相关利益方具有可解释性变得至关重要。然而，可解释性领域并未跟上多模态发展的步伐。尽管最近的多模态可解释人工智能（MxAI）方法生成解释以归因不同模态之间的交互，但当前的评估协议缺乏区分真正跨模态推理（例如，空间组合）与浅层跨模态捷径（例如，词袋属性匹配）所需的地面真相。目前尚不清楚MxAI方法是否忠实地捕捉了协同交互，或者仅仅是对作为简单特征检测器的模型进行推理幻觉。在本文中，我们介绍了GridVQA-X，这是第一个专门设计用于评估跨模态可解释性的诊断框架。与自然数据集不同，GridVQA-X利用封闭世界合成逻辑生成独特的、数学上保证的解释。我们利用这个受控环境，在相同的架构上训练配对的真实模型：$M_{\ ext{pure}}$，学习稳健的空间关系推理，以及$M_{\ ext{spur}}$，结构上被迫依赖跨模态捷径。这种行为差异创建了一个严格的测试平台：一个忠实的解释器必须为每个模型报告不同的推理路径。我们的发现表明，广泛使用的方法无法区分依赖真正空间关系推理的模型和利用跨模态捷径的模型，突显了在捕捉真正跨模态协同方面的关键差距，并错误地表示了多模态模型实际如何做出决策。

英文摘要

With the increasing development of Vision-Language Models, it becomes imperative that their predictions are readily explainable to relevant stakeholders. However, the field of explainability has not kept pace with the multimodal surge. While recent Multimodal Explainable AI (MxAI) methods generate explanations to attribute the interaction between different modalities, current evaluation protocols lack the ground truth required to distinguish between true cross-modal reasoning (e.g., spatial composition) and shallow cross-modal shortcuts (e.g., Bag-of-Words attribute matching). It remains unknown whether MxAI methods faithfully capture synergistic interactions or merely hallucinate reasoning on models acting as simple feature detectors. In this paper, we introduce GridVQA-X, the first diagnostic framework specifically designed to evaluate cross-modal explainability. Unlike natural datasets, GridVQA-X leverages a closed-world synthesis logic to generate unique, mathematically guaranteed explanations. We utilize this controlled environment to train paired ground-truth models on identical architectures: $M_{\text{pure}}$, which learns robust spatial-relational reasoning and $M_{\text{spur}}$, which is structurally forced to rely on cross-modal shortcuts. This behavioral divergence creates a rigorous testbed: a faithful explainer must report distinct reasoning pathways for each model. Our findings reveal that widely used methods fail to distinguish between models relying on genuine spatial-relational reasoning and those exploiting cross-modal shortcuts, highlighting a critical gap in capturing true cross-modal synergy and misrepresenting how multimodal models actually make decisions.

URL PDF HTML ☆

赞 0 踩 0

2606.14735 2026-06-16 cs.CV 新提交

UtVAA: Ultra-tiny Vision Transformer with Affix Attention for Mobile Image Classification

UtVAA: 用于移动图像分类的带有Affix Attention的超微型视觉Transformer

Romiyal George, Sathiyamohan Nishankar, Selvarajah Thuseethan, Roshan G. Ragel

发表机构 * University of Peradeniya（佩拉德尼亚大学）； Charles Darwin University（查尔斯·达尔文大学）

AI总结提出超微型ViT架构UtVAA，通过Affix Attention块结合局部与全局特征，在极低参数量和FLOPs下实现高精度图像分类，适用于移动设备。

Comments 13 pages, 7 figures

详情

AI中文摘要

视觉Transformer（ViT）在图像分类中展现了强大的表示能力。然而，其二次自注意力复杂度和大量参数限制了在资源受限的移动和边缘设备上的部署。本文介绍了UtVAA，一种超微型视觉Transformer架构，专为在严格计算预算下进行高效视觉识别而设计。它包含一个新颖的Affix Attention块，该块结合了深度可分离局部特征提取、线性自注意力、用于空间依赖建模的坐标注意力，以及一个轻量级三元融合策略来整合局部和全局表示。此外，Dilated Bottleneck块通过使用扩张深度可分离卷积扩展感受野，同时通过残差连接保持低FLOPs和稳定优化。UtVAA实现了可扩展的Tiny、Medium和Large变体，其中最小的模型包含204.67K参数和53.95M FLOPs。在CIFAR-10、CIFAR-100、PlantVillage-Tomato和SLIF-Tomato数据集上的实验结果表明，UtVAA在百万参数以下的范围内达到了有竞争力的准确率。总体而言，结果表明基于Transformer的视觉模型可以重新设计为超微型架构，而不会显著损失判别性能，使得UtVAA适用于移动和边缘部署。代码可在https://github.com/romiyal/UtVAA获取。

英文摘要

Vision Transformers (ViTs) have demonstrated strong representation capability in image classification. However, their quadratic self-attention complexity and large parameter counts limit deployment on resource-constrained mobile and edge devices. This paper introduces UtVAA, an ultra-tiny Vision Transformer architecture designed for efficient visual recognition under strict computational budgets. It incorporates a novel Affix Attention block that combines depthwise-pointwise local feature extraction, linear self-attention, coordinate attention for spatial dependency modelling, and a lightweight ternary fusion strategy to integrate local and global representations. In addition, Dilated Bottleneck blocks expand the receptive field using dilated depthwise separable convolutions while maintaining low FLOPs and stable optimisation through residual connections. UtVAA is implemented in scalable Tiny, Medium, and Large variants, with the smallest model containing 204.67K parameters and 53.95M FLOPs. Experimental results on CIFAR-10, CIFAR-100, PlantVillage-Tomato and SLIF-Tomato datasets show that UtVAA achieves competitive accuracy within a sub-million-parameter regime. Overall, the results demonstrate that transformer-based vision models can be redesigned into ultra-tiny architectures without significant loss in discriminative performance, making UtVAA suitable for mobile and edge deployment. Code is available at https://github.com/romiyal/UtVAA

URL PDF HTML ☆

赞 0 踩 0

2606.14732 2026-06-16 cs.CV cs.AI cs.LG cs.MM 新提交

Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

Steady-Forcing: 长时程自然视频扩散中空间持久性与运动连续性的平衡

Matiur Rahman Minar, Seunghun Oh, GangHyeon Jeong, Unsang Park

发表机构 * Department of Computer Science and Engineering, Sogang University（西江大学计算机科学与工程系）； Department of Artificial Intelligence, Sogang University（西江大学人工智能系）

AI总结提出Steady-Forcing框架，通过视觉锚点、运动记忆和蒸馏等技术，在长时程固定相机自然视频生成中平衡背景稳定与运动连续性，优于现有方法。

Comments Project page: https://minar09.github.io/steadyforcing/

详情

AI中文摘要

自回归视频扩散模型支持流式生成，但在长时程生成中常退化：静态场景布局漂移，而改善空间稳定性的机制往往抑制运动，导致水流、火焰或烟雾等自然流动停滞。我们研究了固定相机长时程自然视频生成中的这种稳定性-运动权衡，其中两种失败模式比移动相机设置更易区分。我们提出Steady-Forcing，一种结合持久视觉锚点（V-Sink）、指数移动平均运动记忆（EMA-Sink）、块相对时间编码、周期性缓存净化以及从Wan2.1-14B教师模型蒸馏（在任务聚焦配置下使用运动奖励先验）的记忆与训练框架。这些组件共同设计用于在数分钟的自回归生成中保持背景一致性，同时维持视觉上合理的流体动力学。在七个基线上的评估表明，Steady-Forcing改善了长时程背景一致性和成像质量，而盲用户研究显示更强的感知稳定性和运动连续性。基准评估进一步表明，通用的VBench聚合分数对固定相机伪影惩罚不足，同时将漂移引起的光流奖励为动态程度，而不直接惩罚纹理硬化或流动停滞——这激励了未来针对静态相机自然流动评估的任务特定基准。项目页面：https://minar09.github.io/steadyforcing/

英文摘要

Autoregressive video diffusion models enable streaming generation but often degrade over long rollouts: static scene layouts drift, while mechanisms that improve spatial stability tend to suppress motion, causing natural flows such as water, fire, or smoke to stagnate. We study this stability-motion trade-off in fixed-camera long-horizon nature video generation, where the two failure modes can be more clearly separated than in moving-camera settings. We propose Steady-Forcing, a memory and training framework combining a persistent visual anchor (V-Sink), an exponential moving-average motion memory (EMA-Sink), block-relative temporal encoding, periodic cache purification, and distillation from a Wan2.1-14B teacher with motion-rewarded priors under task-focused configurations. Together, these components are designed to preserve background identity while sustaining visually plausible fluid dynamics over multi-minute autoregressive rollouts. Evaluations across seven baselines show that Steady-Forcing improves long horizon background consistency and imaging quality, while a blind user study indicates stronger perceived stability and motion continuity. The benchmark evaluation further suggest that generic VBench aggregate scores under-penalize fixed-camera artifacts as well as rewarding drift-induced optical flow as Dynamic Degree while not directly penalizing texture hardening or flow stagnation - motivating future task-specific benchmarks for static-camera nature-flow evaluation. Project page: https://minar09.github.io/steadyforcing/

URL PDF HTML ☆

赞 0 踩 0

2606.14731 2026-06-16 cs.CV 新提交

BBR-Net: Boundary-Balanced Replay for Continual Medical Image Segmentation

BBR-Net：用于连续医学图像分割的边界平衡重放

Zahid Ullah, Sieun Choi, Jihie Kim

发表机构 * Department of Computer Science and Artificial Intelligence, Dongguk University（东国大学计算机科学与人工智能系）

AI总结提出边界平衡重放网络（BBR-Net），通过边界感知优先级和类别平衡选择重放样本，在连续心脏超声分割中减少灾难性遗忘并保持目标域适应能力。

详情

AI中文摘要

在域漂移下，基于重放的方法通常保留外观信息而没有显式建模解剖结构，因此连续学习在医学图像分割中仍然具有挑战性。本研究探究结构一致性是否控制连续心脏超声分割中的知识保留。我们提出边界平衡重放网络（BBR-Net），它使用边界感知优先级和类别平衡来选择重放样本，以保留解剖信息丰富的区域。该方法在CAMUS和CardiacNet上进行了前向（CAMUS到CardiacNet）和反向（CardiacNet到CAMUS）任务顺序的评估。在前向设置中，BBR-Net将源任务性能保持在接近离线联合训练参考的水平，同时显著减少灾难性遗忘并保持竞争性的目标任务适应。消融结果表明，边界感知优先级有助于保留，并且当与类别感知采样结合时，改善了源任务保留与目标任务适应之间的平衡。相反，反向设置揭示，当初始表示从噪声大且结构不一致的数据中学习时，结构感知重放会失败。为了隔离这种效应，我们进行了受控的结构扰动分析，逐步破坏源任务边界，同时保持数据集、架构和训练协议固定。随着结构可靠性降低，遗忘持续增加，表明重放有效性受存储结构信息质量的强烈影响，而不仅仅是记忆容量。这些发现表明，在域漂移下保留解剖结构是连续医学图像分割的核心因素，重放机制应考虑结构可靠性以支持稳健的知识保留。

英文摘要

Continual learning for medical image segmentation remains challenging under domain shift because replay-based methods often preserve appearance information without explicitly modeling anatomical structure. This study investigates whether structural consistency governs knowledge retention in continual cardiac ultrasound segmentation. We propose the Boundary-Balanced Replay Network (BBR-Net), which selects replay samples using boundary-aware priority and class balance to preserve anatomically informative regions. The method is evaluated on CAMUS and CardiacNet under forward (CAMUS to CardiacNet) and reverse (CardiacNet to CAMUS) task orders. In the forward setting, BBR-Net retains source-task performance close to an offline joint-training reference, while markedly reducing catastrophic forgetting and preserving competitive target-task adaptation. Ablation results show that boundary-aware prioritization contributes to retention and improves the balance between source-task preservation and target-task adaptation when combined with class-aware sampling. In contrast, the reverse setting reveals that structure-aware replay fails when initial representations are learned from noisy and structurally inconsistent data. To isolate this effect, we conduct a controlled structural perturbation analysis by progressively corrupting source-task boundaries while keeping the dataset, architecture, and training protocol fixed. Forgetting increases consistently as structural reliability decreases, suggesting that replay effectiveness is strongly influenced by the quality of stored structural information, rather than by memory capacity alone. These findings indicate that preserving anatomical structure under domain shift is a central factor in continual medical image segmentation, and that replay mechanisms should account for structural reliability to support robust knowledge retention.

URL PDF HTML ☆

赞 0 踩 0

2606.14730 2026-06-16 cs.CV 新提交

Hierarchical GRU with Input-Conditioned Slot Queries for Ball Action Anticipation

基于输入条件化槽查询的分层GRU用于足球动作预测

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods（迪克体育用品的GameChanger）

AI总结提出分层模型，利用局部Transformer、GRU和输入条件化事件槽解码器，结合频率重加权匈牙利匹配和高斯软标签，在SoccerNet基准上实现17.91% mAP。

Comments CVPR 2026 SoccerNet Ball Action Anticipation Challenge, Validated Rank 4

2606.14728 2026-06-16 cs.CV 新提交

FUSE: Quantifying Uncertainty in Vision-Language Models by Bayesian Fusing Epistemic and Aleatoric Uncertainty

FUSE: 通过贝叶斯融合认知不确定性和偶然不确定性来量化视觉语言模型中的不确定性

Harry Zhang, Luca Carlone

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结提出FUSE概率框架，通过贝叶斯融合视觉语言模型中的偶然不确定性和认知不确定性，生成标量不确定性度量，用于可靠预测输出正确性，实现SOTA不确定性校准。

2606.14727 2026-06-16 cs.CV 新提交

FairGen: Preference-Aligned Diffusion for Demographically Equitable Medical Image Synthesis

FairGen: 用于人口统计公平医学图像生成的偏好对齐扩散模型

Zhimin Li, Ruichen Zhang, Zhen Tan, Howard J Aizenstein, Jingtong Hu, Tianlong Chen

发表机构 * University of Pittsburgh, Swanson School of Engineering（匹兹堡大学斯旺森工程学院）； The University of North Carolina at Chapel Hill, Department of Computer Science（北卡罗来纳大学教堂山分校计算机科学系）； Arizona State University, School of Computing and Augmented Intelligence（亚利桑那州立大学计算与增强智能学院）； University of Pittsburgh, Department of Psychiatry（匹兹堡大学精神病学系）

AI总结提出FairGen框架，通过将医生偏好嵌入扩散模型生成过程，合成人口统计平衡的医学图像，在皮肤、胸片和脑MRI任务上分别实现95.9%、80.0%和35.2%的公平性提升，同时保持诊断准确性。

Comments Accepted for publication in npj Digital Medicine. 20 pages, 6 figures

详情

AI中文摘要

医学影像学是现代诊断的核心，人工智能系统越来越多地用于支持基于图像的分析，以提高效率、准确性和医疗可及性。然而，医疗保健获取的不平等和疾病患病率的差异导致临床图像数据中存在严重的人口统计不平衡。由于疾病在不同人口群体中可能表现出不同的特征，使得某些表型表现自然罕见，这种不平衡进一步加剧。在这种不平衡数据上训练的AI模型有可能延续诊断偏见并扩大医疗差距。本文介绍了FairGen，一个公平感知的扩散框架，它在合成人口统计平衡的医学图像的同时保留与病理相关的视觉特征。通过将医生对齐的偏好嵌入生成过程，FairGen在合成和下游分类过程中改善了子组覆盖。应用于皮肤病学、放射学和神经影像学基准任务，FairGen在皮肤图像上实现了95.9%的公平性提升，在胸部X光片上实现了80.0%，在脑MRI上实现了35.2%，同时相对于在原始临床数据上训练的模型保持了有竞争力的诊断准确性。面向临床医生的专家评审和在独立队列上的外部验证进一步支持这些增益超越了标准保真度指标，并且不局限于原始分布内数据集。

英文摘要

Medical imaging is central to modern diagnostics, and artificial intelligence (AI) systems are increasingly used to support image-based analysis by improving efficiency, accuracy, and access to care. However, inequities in healthcare access and differential disease prevalence create severe demographic imbalances in clinical image data. Such imbalances are compounded by the fact that diseases can manifest with distinct features across demographic groups, rendering certain phenotypic presentations naturally rare. AI models trained on such imbalanced data risk perpetuating diagnostic bias and widening healthcare disparities. Here we introduce FairGen, a fairness-aware diffusion framework that synthesizes demographically balanced medical images while preserving pathology-relevant visual features. By embedding physician-aligned preferences into the generation process, FairGen improves subgroup coverage during synthesis and downstream classification. Applied to dermatology, radiology, and neuroimaging benchmark tasks, FairGen achieves fairness improvements of 95.9% for skin images, 80.0% for chest radiography, and 35.2% for brain MRI, while maintaining competitive diagnostic accuracy relative to models trained on original clinical data. Clinician-facing expert review and external validation on independent cohorts further support that these gains extend beyond standard fidelity metrics and are not confined to the original in-distribution datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.14725 2026-06-16 cs.CV 新提交

Interpolation between Convolution and Attention via K-Nearest Neighbors

通过K近邻实现卷积与注意力之间的插值

Mingi Kang

发表机构 * Bowdoin College（博德因学院）

AI总结提出ConvNN统一框架，将卷积和自注意力视为K近邻聚合的特例，通过可配置的相似度函数和邻居选择策略实现局部与全局聚合的连续插值。

Comments Undergraduate Thesis in Computer Science at Bowdoin College

详情

AI中文摘要

从卷积神经网络到Transformer的转变重塑了计算机视觉，然而这两个架构家族通常被视为根本不同。卷积神经网络由空间局部卷积操作定义，而Transformer依赖于全局自注意力。我们认为，尽管卷积和自注意力存在明显差异，但它们可以在一个统一的k近邻聚合框架内统一。关键洞察在于，这两种操作都是邻居选择和加权聚合的特例：卷积通过空间邻近性选择邻居，而自注意力通过特征相似性选择邻居，这表明它们位于一个连续谱上，而不是代表截然不同的计算。我们引入了卷积近邻（ConvNN），这是一个统一框架，形式化了这种联系。ConvNN通过将邻居选择限制在归一化空间坐标上精确恢复标准和深度卷积，并通过用缩放点积相似性替换空间邻近性精确恢复自注意力及其稀疏变体（包括KVT注意力）。除了这些特例，ConvNN可作为卷积和注意力层的即插即用替代，通过可配置的相似度函数、邻居选择策略、位置编码和聚合核，系统探索局部与全局聚合之间的中间谱。

英文摘要

The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. Convolutional Neural Networks are defined by spatially local convolution operations, while Transformers rely on global self-attention. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and weighted aggregation. Convolution selects neighbors by spatial proximity while self-attention selects by feature similarity, revealing that they lie on a continuous spectrum rather than representing categorically different computations. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. ConvNN exactly recovers standard and depthwise convolution by restricting neighbor selection to normalized spatial coordinates, and exactly recovers self-attention and its sparse variants, including KVT-attention, by replacing spatial proximity with scaled dot-product similarity. Beyond these special cases, ConvNN serves as a drop-in replacement for both convolution and attention layers, enabling systematic exploration of the intermediate spectrum between local and global aggregation through configurable similarity functions, neighbor selection strategies, positional encodings, and aggregation kernels.

URL PDF HTML ☆

赞 0 踩 0

2606.14724 2026-06-16 cs.CV cs.AI 新提交

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

VigilFormer: 用于视频异常检测的可变形注意力与因果风险推理

Xinze Zhang

发表机构 * University of Southern California（南加州大学）

AI总结提出VigilFormer框架，结合可变形时空注意力与因果时序建模，通过稀疏注意力、对比多实例学习和自适应帧跳过，在保持高精度的同时实现实时异常检测。

详情

AI中文摘要

监控场景中的视频异常检测必须在检测准确性与实时吞吐量之间取得平衡，现有方法要么通过更强的特征提取器，要么通过更高效的架构来解决这一矛盾，但很少能兼顾两者。我们提出VigilFormer，一个统一框架，结合可变形时空注意力与因果时序建模，用于检测未修剪监控视频中的异常。所提出的可变形时空编码器（DSTE）关注跨帧的稀疏信息位置，避免了密集注意力的二次复杂度，同时保留了捕捉不规则运动模式的能力。因果异常分类器（CAC）对片段级特征应用扩张因果卷积，并优化对比多实例学习目标，无需帧级标签即可分离异常和正常表示。为满足部署约束，自适应置信度调度器（ACS）在推理时动态跳过低信息帧，减少静态场景中的冗余计算。在UCF-Crime、ShanghaiTech和CUHK Avenue上评估，VigilFormer在单GPU上以41.5 FPS分别达到87.83%、97.21%和89.74%的AUC分数，在准确性和速度上均优于最近的弱监督方法。

英文摘要

Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.

URL PDF HTML ☆

赞 0 踩 0

2606.14723 2026-06-16 cs.CV 新提交

Disagreement-Based Cross-Model Routing for Implicit Video Question Answering

基于分歧的跨模型路由用于隐式视频问答

Durga Sandeep Saluru

发表机构 * Independent Researcher（独立研究员）

AI总结针对隐式视频问答中单模型精度瓶颈和自一致性策略失效问题，提出无标签无训练的分歧驱动跨模型路由方法，将分歧样本路由至第二模型，在ImplicitQA基准上提升平均准确率1.43%。

详情

AI中文摘要

我们研究ImplicitQA基准上的多项选择视频问答，其中正确答案从未明确显示，必须从屏幕外事件、视线线索、因果结构和跨镜头空间布局中推断。在该基准上，单个前沿视频LLM已接近其精度上限，我们观察到传统的自一致性策略——对同一模型的重复样本进行多数投票——可能有害而非有益，因为模型在难题上的错误是相关的。我们提出基于分歧的跨模型路由，一种纯推理时过程，无需标签和训练。我们对原生视频模型（Gemini 3.1 Pro Preview）在温度为零时进行三次采样，利用其视频处理流水线的真实样本间方差来识别三个样本存在分歧的大约20%的问题子集，并将该子集仅路由到来自不同家族的第二个模型（Claude Opus 4.8），该模型采用自适应思考的均匀采样帧。在具有公开真实标签的1001个问题的验证集上——我们的主要评估——该方法相对于主模型的最佳单样本将AvgAcc提高了1.43，每个类别的提升集中在运动与轨迹（+5.49）、推断计数（+3.45）和垂直空间推理（+1.82）——这些类别最依赖于跨镜头参考解析。相同的流水线应用于保留的172个问题的CVPR 2026 ImplicitQA挑战测试集，实现了82.03 AvgAcc / 79.71 MacroAvgAcc（相对于主模型最佳单样本提升1.81），在独立分割上确认了验证结果。

英文摘要

We study multiple-choice video question answering on the ImplicitQA benchmark, where the correct answer is never explicitly shown but must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout. On this benchmark a single frontier video LLM already operates near its accuracy ceiling, and we observe that conventional self-consistency strategies -- majority voting across repeated samples of the same model -- can hurt rather than help, because the model's errors on hard questions are correlated. We propose disagreement-based cross-model routing, a pure inference-time procedure that requires no labels and no training. We triple-sample a native-video model (Gemini 3.1 Pro Preview) at temperature zero, exploit the genuine sample-to-sample variance of its video-processing pipeline to identify the roughly 20% subset of questions where the three samples disagree, and route only that subset to a second model from a different family (Claude Opus 4.8) that consumes uniformly sampled frames with adaptive thinking. On the 1001-question validation set with public ground truth -- our main evaluation -- the method improves AvgAcc by +1.43 over the best single sample of the primary model, with per-category gains concentrated on Motion & Trajectory (+5.49), Inferred Counting (+3.45), and Vertical Spatial Reasoning (+1.82) -- the categories most dependent on cross-shot reference resolution. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over the best single sample of the primary model), confirming the validation result on an independent split.

URL PDF HTML ☆

赞 0 踩 0

2606.14720 2026-06-16 cs.CV 新提交

AI for Maritime Security: Comparative Evaluation of CNN and Vision Transformer Architectures for Maritime Object Detection

AI用于海上安全：CNN与Vision Transformer架构在海上目标检测中的比较评估

Ismet Gocer, Zakirul Bhuiayn, Shakeel Ahmad, Raza Hasan

发表机构 * Southampton Solent University School of Technology and Maritime Industries（索马顿桑德兰大学技术与海洋工业学院）

AI总结研究利用CNN和Vision Transformer等六种深度学习模型，在多种天气条件下检测海面船只，ViT达到100%准确率且处理速度最快，展示了AI视觉系统在海上监视中的潜力。

Comments 24 Pages

详情

AI中文摘要

本研究旨在通过使用先进的人工智能（AI）和计算机视觉（CV）技术来增强海上安全。为此，设计并评估了能够在不同实时环境下检测海面船只存在的智能目标检测系统。为实现这一目标，使用了包含6,468张图像的海上图像数据集，涵盖了多云、雾、雨和晴天等不同天气条件。评估了六种深度学习架构，包括基础卷积神经网络（CNN）模型、四种迁移学习模型（Xception、VGG16、MobileNetV2和EfficientNetV2L）以及一种视觉Transformer（ViT）模型。使用多个性能指标对模型进行比较，包括准确率、第一类和第二类错误、模型大小以及视频处理时间。结果表明，模型性能因计算约束和部署条件而异。虽然轻量级架构适用于资源有限的设备，但ViT实现了最佳整体性能，达到100%准确率，错误率最低且视频处理时间最快。研究结果凸显了AI驱动的计算机视觉系统在海上监视、边境保护和自主导航中的潜力。

英文摘要

This study aims to enhance maritime security by using advanced Artificial Intelligence (AI) and Computer Vision (CV) techniques. For this purpose, it was designed and assessed intelligent object detection systems that can detect the presence of ships on the sea surface under different real-time environments. To achieve this goal, a maritime image dataset with 6,468 images was used, covering different weather conditions like cloudy, foggy, rainy, and sunny environments. Six deep learning architectures were evaluated, including a base Convolutional Neural Network (CNN) model, four transfer learning models (Xception, VGG16, MobileNetV2, and EfficientNetV2L), and a Vision Transformer (ViT) model. The models were compared using multiple performance indicators, including accuracy, Type I and Type II errors, model size, and video processing time. The results show that model performance varies depending on computational constraints and deployment conditions. While lightweight architectures are suitable for resource-limited devices, the ViT achieved the best overall performance, reaching 100% accuracy with the lowest error rates and the fastest video processing time. The findings highlight the potential of AI-driven computer vision systems for maritime surveillance, border protection, and autonomous navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.14716 2026-06-16 cs.CV cs.AI cs.RO 新提交

RAMS: Resource-Adaptive and Detection-Conditioned Model Switching for Embedded Edge Perception

RAMS: 面向嵌入式边缘感知的资源自适应与检测条件模型切换

Kushal Khemani, Evan Leri, George Xu, Amit Hod

发表机构 * NEXEDGE Research Lab（NEXEDGE研究实验室）

AI总结提出RAMS运行时控制器，通过监控设备压力、校准切换阈值，在YOLOv8三个规模模型间动态切换，引入检测条件策略和VRU加权准确率评分，在多种嵌入式平台上实现延迟与精度的平衡。

详情

AI中文摘要

嵌入式硬件上的边缘目标检测需要在变化的资源压力下平衡推理延迟和检测质量。我们提出RAMS，一种轻量级运行时控制器，它监控设备压力，从空闲行为校准切换阈值，并在三个驻留的YOLOv8层级（NANO/SMALL/MEDIUM，分辨率320/416/640 px）之间动态选择，无需模型重新加载延迟。RAMS定义了五种切换策略，包括两种检测条件变体，可在最近检测到易受伤道路使用者（VRU）后防止激进的降级。我们进一步引入VRU加权准确率评分（SWAS），一种用于离线策略比较的标量指标，无需真实标注，以及一种基于oracle的变体，用于分离检测器循环性与真正的层级保留收益。在Raspberry Pi 5、x86笔记本电脑和Jetson Orin ONNX/TensorRT部署中，相同的控制器方程在37倍的延迟范围内运行。在重负载下的Jetson Orin TensorRT上，safety2策略实现了3.41毫秒的平均延迟，比固定MEDIUM推理快5.6倍，同时通过接近NANO操作并在VRU阳性窗口期间选择性锁定SMALL和MEDIUM，保留了其代理准确率的74%。与重负载下仅基于阈值的策略相比，检测条件切换在oracle评分下将SWAS提高了25.4%，在检测器衍生评分下提高了47.3%。实时KITTI评估报告了每层级VRU召回率分别为24.2%、41.2%和59.0%，表明反应性覆盖从根本上受限于基线检测器的召回率。

英文摘要

Edge object detection on embedded hardware requires balancing inference latency and detection quality under changing resource pressure. We present RAMS, a lightweight runtime controller that monitors device pressure, calibrates switching thresholds from idle behavior, and dynamically selects among three resident YOLOv8 tiers (NANO/SMALL/MEDIUM at 320/416/640 px) without model-reload latency. RAMS defines five switching policies, including two detection-conditioned variants that prevent aggressive downgrades after recent vulnerable-road-user (VRU) detections. We further introduce the VRU-Weighted Accuracy Score (SWAS), a scalar metric for offline policy comparison without ground-truth annotations, together with an oracle-bounded variant that separates detector circularity from genuine tier-retention benefit. Across Raspberry Pi 5, x86 laptops, and Jetson Orin ONNX/TensorRT deployments, the same controller equations operate over a 37x latency range. On Jetson Orin TensorRT under heavy load, the safety2 policy achieves 3.41 ms mean latency, 5.6x faster than fixed-MEDIUM inference, while retaining 74% of its proxy accuracy through near-NANO operation with selective SMALL and MEDIUM locks during VRU-positive windows. Detection-conditioned switching improves SWAS by 25.4% under oracle scoring and 47.3% under detector-derived scoring relative to threshold-only policies under heavy load. Live KITTI evaluation reports per-tier VRU recall of 24.2%, 41.2%, and 59.0%, showing that reactive overrides are fundamentally limited by baseline detector recall.

URL PDF HTML ☆

赞 0 踩 0

2606.14694 2026-06-16 cs.CL 新提交

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

AdaSR: 自适应流式推理与分层相对策略优化

Junlong Tong, Wenqi Xu, Yingqi Fan, Anhao Zhao, Xuan Lu, Yang Tan, Xiaoyu Shen

发表机构 * Eastern Institute of Technology, Ningbo（宁波东方理工大学）； Shanghai Jiao Tong University（上海交通大学）； The Hong Kong Polytechnic University（香港理工大学）； Southeast University（东南大学）； Xi’an Jiaotong-Liverpool University（西交利物浦大学）

AI总结提出AdaSR框架，通过分层相对策略优化（HRPO）实现流式输入下的自适应推理，在推理准确率、计算效率和流式延迟间取得更好平衡。

详情

AI中文摘要

大型推理模型通常遵循先读后想的范式：它们观察完整输入，在静态上下文中推理，然后产生答案。然而许多真实场景本质上是动态的，例如音频和视频流，信息以连续流的形式到达，模型必须在部分观察下进行推理、更新和响应。最近的流式推理方法允许模型边读边想，但它们主要依赖于对预构建轨迹的监督模仿，这限制了其灵活性。在本文中，我们提出AdaSR，一种自适应流式推理框架，使模型能够在输入流式传输期间进行推理，并在流完成后进行最终深思，学习何时思考以及在不同阶段分配多少计算量。为了优化这一分层推理过程，我们引入了分层相对策略优化（HRPO），它将策略优化分解为流式推理和深度推理阶段，提供更细粒度的优势分配，而不是将单一序列级优势均匀分配给所有token。HRPO整合了格式、准确性和自适应思考奖励，以强制执行有效的推理协议，保持最终任务性能，并鼓励延迟感知的计算分配。实验表明，与监督微调基线相比，AdaSR在推理准确率、计算效率和流式延迟之间实现了更好的平衡。我们在以下网址发布代码：此 https URL。

英文摘要

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.

URL PDF HTML ☆

赞 0 踩 0

2606.14398 2026-06-16 cs.LG 新提交

A theoretical model for task routing in mixture-of-expert transformers

混合专家Transformer中任务路由的理论模型

Vinoth Nandakumar, Yongli Xiang, Yunzhi Yao, Peike Li, Tongliang Liu

发表机构 * University of Sydney（悉尼大学）； Zhejiang University（浙江大学）； Google Research（谷歌研究院）

AI总结通过离散语言模型证明单层MoE Transformer可利用专家实现任务专业化，支持经验发现。

详情

AI中文摘要

混合专家（MoE）层使得在保持推理计算固定的情况下扩展Transformer模型成为可能。尽管在前沿MoE Transformer模型的实证研究中观察到了任务-专家专业化现象，但现有的理论工作使用连续混合模型进行分析，无法有效建模自然语言。一个重要的问题是使用离散语言模型从理论上解释Transformer MoE模型中的任务-专家专业化。为此，我们通过句法模板和有限键值字典表示结构化知识，并正式证明单层MoE Transformer可以通过使用专注于相应任务的专家来编码知识。我们的构造展示了查询如何被路由到唯一的、特定于任务的专家，其大小仅取决于给定任务的内在复杂度（即其句法模板和事实字典的组合大小）。我们的构造为MoE模型中局部化知识回路的实证结果提供了理论支持。我们通过实验评估模型在不同MoE损失函数下的性能来支持我们的理论发现。

英文摘要

Mixture-of-experts (MoE) layers enable the scaling of transformer models while keeping the inference compute fixed. While task-expert specialization has been observed in empirical studies of frontier MoE transformer models, existing theoretical work analyzes this using continuous mixture models that cannot be used to model natural language effectively. An important open question is to \textit{theoretically explain task-expert specialization in transformer MoE models using discrete models of language}. To address this, we represent structured knowledge via syntactic templates and finite key-value dictionaries, and prove formally that a single-layer MoE transformer can encode knowledge by using experts that specialize in the corresponding tasks. Our construction shows how queries are routed to unique, task-specific experts whose size depends solely on the intrinsic complexity of the given task (i.e. the combined size of its syntactic templates and factual dictionary). Our construction provides a theoretical support for empirical results on localized knowledge circuits in MoE models. We support our theoretical findings with experiments evaluating model performance under varying MoE loss functions.

URL PDF HTML ☆

赞 0 踩 0

2606.14238 2026-06-16 cs.RO cs.AI 新提交

When and How Severely: Scenario-Specific Safety Envelopes for Driving VLAs

何时以及多严重：驾驶VLA的场景特定安全包络

Abhinaw Priyadershi, Jelena Frtunikj

发表机构 * NVIDIA Corporation（英伟达公司）； NVIDIA GmbH（英伟达德国有限公司）

AI总结针对ISO 21448下VLA驾驶规划器的安全认证，提出二维安全包络方法，通过GMM识别六种严重性等级，揭示场景特定风险差异。

详情

AI中文摘要

根据ISO 21448 (SOTIF)对视觉-语言-动作(VLA)驾驶规划器的安全认证依赖于运行设计域(ODD)规范，该规范回答两个互补的问题：规划器何时开始失效，以及一旦失效其严重程度如何？我们评估了Alpamayo R1（一个100亿参数的开源权重驾驶VLA）在15,968个（片段，攻击）对上的表现。我们发现一个保守的聚合差距：在15%平均位移误差(ADE)预算下，聚合安全阈值σ ≤ 50掩盖了能够容忍测试网格顶部（σ = 70）的良好采样场景。在变化解释子集上的高斯混合模型(GMM)识别出六个离散的严重性等级（BIC最优k=6），因此具有相同平均误差的两个扰动条件在高严重性(C4/C5)失效份额上可能有实质性差异。将两种分析结合在同一个语料库上，发现了一个单独分析无法得出的结论：噪声阈值最宽松的场景并非高严重性率最低的场景：STOP_SIGNAL的C4/C5份额大约是LANE_KEEPING的4倍，尽管它容忍更大的σ。因此，用于驾驶VLA的可部署SOTIF ODD规范需要二维安全包络，而不是每个危险的单一聚合值。

英文摘要

Safety certification of Vision-Language-Action (VLA) driving planners under ISO 21448 (SOTIF) rests on an Operational Design Domain (ODD) specification that answers two complementary questions: when does the planner start to fail, and how severely does it fail once it does? We evaluate Alpamayo R1, a 10B-parameter open-weight driving VLA, on 15,968 (clip, attack) pairs. We find a conservative-aggregate gap: an aggregate safe threshold of $σ\leq 50$ under a 15% average displacement error (ADE) budget masks well-sampled scenarios that tolerate the top of the tested grid ($σ= 70$). A Gaussian Mixture Model (GMM) on the changed-explanation subset identifies six discrete severity bands (BIC-optimal $k{=}6$), so two perturbation conditions with the same mean error can differ materially in their share of high-severity (C4/C5) failures. Joining the two analyses on the same corpus surfaces a finding neither yields in isolation: the scenarios with the loosest noise thresholds are not those with the lowest high-severity rate: STOP_SIGNAL concentrates roughly $4\times$ the C4/C5 share of LANE_KEEPING despite tolerating a larger $σ$. A deployable SOTIF ODD specification for driving VLAs therefore requires a two-dimensional safety envelope, not a single aggregate value per hazard.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Scribby: A Multi-Level LLM Framework for Semantic Video Analysis

GeoRoPE: Ground-Aware Rotary Adaptation for Remote Sensing Foundation Models

Temporally Consistent and Controllable Video Generation of 2D Cine CMR via Latent Space Motion Modeling

Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability

Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models

Where Does Texture Evidence Live in SAM? Features, Proposal Masks, and Texture Segmentation

Sub-Semantic Image Segmentation

Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

Automated 3D Kinematic Monitoring for Circadian Activity and Anomaly Detection in Juvenile Fish

Is My Vision-Language Data in Your AI? Membership Inference Test (MINT) Demo 2

MMLongEmbed: Benchmarking Multimodal Embedding Models in Long-Context Scenarios

Style-CCL: Content-Preserving Style Transfer via Curriculum Continual Learning

HorusEye: Language as Dynamic Attention for Emergency Visual Analysis

GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods

UtVAA: Ultra-tiny Vision Transformer with Affix Attention for Mobile Image Classification

Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

BBR-Net: Boundary-Balanced Replay for Continual Medical Image Segmentation

Hierarchical GRU with Input-Conditioned Slot Queries for Ball Action Anticipation

FUSE: Quantifying Uncertainty in Vision-Language Models by Bayesian Fusing Epistemic and Aleatoric Uncertainty

FairGen: Preference-Aligned Diffusion for Demographically Equitable Medical Image Synthesis

Interpolation between Convolution and Attention via K-Nearest Neighbors

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

Disagreement-Based Cross-Model Routing for Implicit Video Question Answering

AI for Maritime Security: Comparative Evaluation of CNN and Vision Transformer Architectures for Maritime Object Detection

RAMS: Resource-Adaptive and Detection-Conditioned Model Switching for Embedded Edge Perception

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

A theoretical model for task routing in mixture-of-expert transformers

When and How Severely: Scenario-Specific Safety Envelopes for Driving VLAs