视觉大模型 / VLM - arXivDaily 专题

2606.20161 2026-06-19 cs.CV 新提交 85%

ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation

ARTEMIS: 基于智能体引导的可靠性感知时间掩码演化用于不完美监督的视频息肉分割

Tong Wang, Siwen Wang, Yaolei Qi, Jinxing Zhou, Yuting He, Guanyu Yang, Yutong Xie

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education（东南大学教育部新一代人工智能技术及其跨学科应用重点实验室）； Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（穆罕默德·本·扎耶德人工智能大学）； School of Medicine, Case Western Reserve University（凯斯西储大学医学院）

专题命中视觉定位：利用视觉语言智能体选择可靠时间锚点，结合SAM2进行视频息肉分割。

AI总结提出ARTEMIS框架，利用视觉语言智能体选择可靠时间锚点，结合SAM2传播和可靠性感知鲁棒学习，从不完美监督（点、涂鸦、少量密集标签）中学习高质量视频息肉分割掩码，在多个基准上达到最优性能。

详情

AI中文摘要

不完美监督的视频息肉分割（VPS）旨在从廉价监督中学习密集、时间一致的掩码，包括弱标注（点、涂鸦）和少量密集标注帧的半监督。该设置具有临床价值，但由于弱对比、模糊边界、运动模糊和镜面高光，加上稀疏的像素级指导，具有挑战性。虽然SAM2可以从稀疏输入生成密集掩码，但直接伪标签通常会产生几何退化的掩码，存在边界泄漏，未充分利用时间一致性，并忽略可靠性。为解决这些问题，我们提出ARTEMIS，一个由智能体引导的可靠性感知时间掩码演化驱动的统一框架，用于不完美监督的VPS。ARTEMIS从可用监督初始化粗掩码：SAM2转换点/涂鸦，而密集标签作为可靠锚点。一个辩论-判断视觉语言智能体在弱监督下选择可靠的时间锚点，这些锚点通过SAM2双向传播以细化不可靠或未标注的帧。最后，ARTEMIS使用时间可靠性感知鲁棒学习训练分割器，结合可靠性引导的参考选择、参考原型传输模块和可靠性感知鲁棒损失。这些组件评估掩码可靠性，随时间演化锚点，跨帧传输目标身份，并降低噪声监督的权重而非丢弃困难样本。在SUN-SEG和CVC-ClinicDB-612上的涂鸦、点和有限标签设置下的实验表明，ARTEMIS达到了最先进的性能。代码将在此https URL发布。

英文摘要

Imperfectly supervised video polyp segmentation (VPS) aims to learn dense, temporally consistent masks from inexpensive supervision, including weak annotations (points, scribbles) and semi-supervision with few densely labeled frames. This setting is clinically valuable but challenging due to weak contrast, ambiguous boundaries, motion blur, and specular highlights, compounded by sparse pixel-level guidance. While SAM2 can generate dense masks from sparse inputs, direct pseudo-labeling often yields geometry-degraded masks with boundary leakage, underutilizes temporal consistency, and ignores reliability. To address these issues, we propose ARTEMIS, a unified framework for imperfectly supervised VPS driven by agent-guided reliability-aware temporal mask evolution. ARTEMIS initializes coarse masks from available supervision: SAM2 converts points/scribbles, while dense labels serve as reliable anchors. A debate-and-judge vision-language agent selects reliable temporal anchors under weak supervision, which are propagated bidirectionally with SAM2 to refine unreliable or unlabeled frames. Finally, ARTEMIS trains the segmenter using temporal reliability-aware robust learning, incorporating reliability-guided reference selection, a Reference Prototype Transport Module, and reliability-aware robust loss. These components assess mask reliability, evolve anchors over time, transport target identity across frames, and down-weight noisy supervision instead of discarding difficult samples. Experiments on SUN-SEG and CVC-ClinicDB-612 under scribble, point, and limited-label settings demonstrate that ARTEMIS achieves state-of-the-art performance. Code will be released at https://github.com/wangtong627/ARTEMIS.

URL PDF HTML ☆

赞 0 踩 0

2606.19627 2026-06-19 cs.IR cs.AI cs.LG 新提交 70%

VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions

VCG：极端冷启动条件下电商视频流的多模态检索框架

Katya Mirylenka, Egor Malykh, Mahdyar Ravanbakhsh, Michael Gygli, Marco-Andrea Buchmann, Andrew Dzhoha, Svitlana Borzenko, Francesca Catino, Mohamed Gaafar, Maarten Versteegh, Thomas Kober, Dario d'Andrea, Ellie Langhans

发表机构 * Zalando Switzerland AG（Zalando瑞士有限公司）； TU Wien（维也纳技术大学）； Zalando SE（Zalando德国分公司）

专题命中视觉定位：基于CLIP的多模态检索系统，用于电商视频冷启动。

AI总结针对电商视频流中的极端冷启动和偏差问题，提出基于领域自适应视觉-语言模型（CLIP）的可扩展多模态检索系统VCG，实现零样本检索，在线测试显示深度视频完成率提升50%。

详情

AI中文摘要

数字商业格局正从静态的搜索驱动型目录转向动态的沉浸式视频流。这一转变引入了“极端冷启动”问题：与传统商品不同，新的短视频缺乏协同过滤所需的密集交互历史。此外，沉浸式视频流引入了强烈的位置和时长偏差，扭曲了标准参与信号。在本文中，我们展示了视频候选生成（VCG）系统，这是一个可扩展的多模态检索引擎，旨在解决大规模电商环境中的这些挑战。通过利用领域自适应的视觉-语言模型（基于CLIP），我们将用户和视频映射到共享语义空间，实现基于视觉内容而非行为历史的零样本检索。我们详细介绍了系统的架构，并进行了严格的评估，比较了生成式（LLM）和判别式（CLIP）嵌入。结果表明，虽然生成式模型在属性预测方面表现出色，但在检索任务中会出现嵌入空间坍塌。在线A/B测试表明，VCG有效缓解了参与偏差，使深度视频完成率提升了50%。为了展示系统的能力，我们提供了一个交互式演示，包含三种双向检索场景：产品到视频、视频到产品和零样本语义搜索。

英文摘要

The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start'' problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system's architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50\% uplift in deep video completion. To showcase the system's capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

URL PDF HTML ☆

赞 0 踩 0