arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 5 信号源:cs.CV, cs.GR, cs.MM
2606.19103 2026-06-18 cs.CV cs.AI 新提交 90%

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

ProductConsistency:通过SFT和RL改进基于指令的图像编辑中的产品身份保持

Mukund Khanna, Raj Singh Yadav, Kunal Singh

发表机构 * Fractal Analytics

专题命中 图像编辑 :基于指令的图像编辑,保持产品身份。

AI总结 针对基于指令的图像编辑中产品特征保持不足的问题,提出ProductConsistency数据集和循环一致性奖励,结合监督微调与强化学习,显著提升产品一致性、文本渲染和视觉质量。

Comments CVPR HiGen 2026

详情
AI中文摘要

近期基于指令的图像编辑的进展使模型能够根据自然语言指令执行复杂的视觉编辑。然而,在以产品为中心的场景中,保留产品特征、品牌和文本元素至关重要,当前的开源和闭源模型往往难以维持这种细粒度的对象身份。这一问题因缺乏具有文本保真度约束的基于指令的产品图像编辑数据集而进一步加剧,导致该能力在很大程度上被视为基于指令的图像编辑模型的隐式能力。在这项工作中,我们引入了ProductConsistency数据集,旨在改进以产品为中心的图像编辑。我们的方法包括一个用于产品编辑的包含87k样本的监督微调(SFT)数据集、一个包含869张独特产品图像的强化学习(RL)数据集,以及一个新的基准数据集ProductConsistency Benchmark,以允许对编辑模型进行严格和标准化的评估。为了指导RL训练,我们提出了一种循环一致性奖励,通过使用原始产品描述与从编辑图像生成的描述之间的字幕相似性来强制保持产品身份的语义。我们使用我们的数据集对Qwen-Image-Edit-2511和Flux.1-Kontext-dev进行了微调,并在OCR和感知指标以及基于MLLM的评估中展示了相对于基线模型的一致改进,表明更强的产品一致性、文本渲染和整体视觉质量;其中Qwen-Image-Edit-2511模型实现了字符错误率降低5倍。代码和流程可在此https URL获取。

英文摘要

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

2606.18906 2026-06-18 cs.CV 新提交 90%

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

BindEdit: 驯服注意力泄漏以实现精确的多目标图像编辑

Chaewon Park, Soyoon Lee, Naeun Lee, Minjung Shin, Seogkyu Jeon, Kibeom Hong

发表机构 * Sookmyung Women’s University(成均女性大学) Yonsei University(延世大学) Samsung Research(三星研究院)

专题命中 图像编辑 :提出多目标图像编辑方法抑制注意力泄漏

AI总结 针对多目标图像编辑中的语义混合和对象重复问题,提出BindEdit方法,通过联合正则化交叉注意力和自注意力、交叉注意力重平衡机制及区域保真项,在单次扩散轨迹内抑制注意力泄漏,实现精确编辑。

Comments Preprint

详情
AI中文摘要

真实图像编辑能够精确操作视觉内容,但现有方法在复杂的多目标场景中常常失败,导致语义混合、对象重复或编辑不完整。我们将这些失败归因于注意力泄漏,即在去噪过程中,跨空间区域和文本标记的信号变得纠缠。具体来说,我们识别出两种不同形式的泄漏:编辑-标记泄漏,其中模糊的标记-区域对齐导致对象混合;以及源主导泄漏,其中未改变的源对象的标记压倒了目标实体应有的注意力。为了解决这些泄漏,我们提出了\textbf{BindEdit},它在单次扩散轨迹内强制执行注意力级别的约束。为了抑制编辑-标记泄漏,BindEdit联合正则化交叉注意力和自注意力,使得每个目标标记组绑定到其对应的空间区域,同时保持实例级别的分离。为了抑制源主导泄漏,一种交叉注意力重平衡机制放大目标标记的影响,并减弱可编辑区域内残留的源语义。此外,区域保真项确保每个目标概念在整个编辑掩码中连贯表达。另外,我们提出了一个全面的多目标基准,涵盖不同的对象数量和类别。大量实验表明,BindEdit在单次扩散轨迹内始终优于现有方法,在单目标和多目标编辑场景中均保持稳健性能。

英文摘要

Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

2606.19073 2026-06-18 cs.CV 新提交 85%

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

驯服I2V模型用于图像HOI编辑:认知基准与智能体自校正框架

Jiayi Gao, Qingchao Chen, Yuxin Peng, Yang Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University, Beijing, China(王轩计算机技术研究所,北京大学,北京,中国) National Institute of Health Data Science, Peking University, Beijing, China(国家健康数据科学研究院,北京大学,北京,中国)

专题命中 图像编辑 :图像HOI编辑,利用I2V模型。

AI总结 提出HOI-Edit基准和SCPE框架,利用I2V模型的时间生成能力进行动态人-物交互编辑,通过自校正提示迭代优化,实现与SOTA竞争的性能。

详情
AI中文摘要

当前的图像编辑方法在静态属性上表现出色,但在复杂的人-物交互(HOI)上失败,这是一个关键挑战,现有基准将HOI与静态属性混淆,依赖无法同时评估动态交互有效性和纠缠的人-物对保留的全局指标。因此,我们首先引入HOI-Edit,一个包含三个渐进认知层次的综合基准,其特点是自动化指标HOI-Eval,通过让VLM在思考后对包含基础人-物对的图像进行问答,可靠地评估实例级交互。考虑到任务本质是重塑动态关系,我们对图像到视频(I2V)模型进行基准测试,发现它们由于其时间生成能力而天生适合动态编辑。关键的是,除了优越的性能,这种能力提供了“失败过程的重放”,为错误原因提供了独特的可诊断性。因此,我们提出SCPE(自校正过程编辑),一种新颖的智能体自校正框架,通过迭代优化的提示约束I2V模型的生成,使生成的视频更准确地呈现目标HOI。从这些视频中提取的帧是最终的编辑结果。在HOI-Edit上,SCPE在交互上达到了与最先进(SOTA)编辑模型(如Nano Banana)竞争的性能。代码可在该https URL获取。

英文摘要

Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.

2605.21431 2026-06-18 cs.CV 版本更新 85%

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

iTryOn: 通过空间-语义引导掌握交互式视频虚拟试穿

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

发表机构 * Shenzhen Campus of Sun Yat-sen University Taobao \& Tmall Group of Alibaba

专题命中 图像编辑 :交互式视频虚拟试穿,属于图像生成与编辑。

AI总结 本文提出iTryOn框架,通过空间-语义引导解决交互式视频虚拟试穿中的语义模糊和复杂服装变形问题,实现了更动态可控的虚拟试穿体验。

Comments Project Page: https://zhengjun-ai.github.io/itryon-page. Accepted by ICML 2026

详情
AI中文摘要

视频虚拟试穿(VVT)旨在无缝替换视频中人物身上的衣物。尽管现有方法在保持时间一致性方面取得了显著进展,但它们主要局限于非交互场景,其中模型仅展示衣物。这种限制忽略了现实世界服装展示中的关键方面:主动的人-衣物互动。为弥合这一差距,我们引入并正式化了一个新的挑战性任务:交互式视频虚拟试穿(Interactive VVT),其中视频中的主体主动与衣物互动。该任务引入了超出简单纹理保留的独特挑战,包括:(1)从标准姿态信息中解决交互的语义模糊性,以及(2)从视频中学习复杂的衣物变形,其中交互时刻稀少且短暂。为了解决这些挑战,我们提出了iTryOn,一种基于大规模视频扩散Transformer的新型框架。iTryOn首创多级交互注入机制,以引导复杂动态的生成。在空间层面,我们引入了服装无关的3D手先验,以提供精细的指导,精确的手-服装接触,有效解决空间模糊性。在语义层面,iTryOn利用全局描述词提供整体上下文,并利用时间戳动作描述词提供局部交互,通过我们新颖的Action-aware Rotational Position Embedding(A-RoPE)进行同步。广泛的实验表明,iTryOn不仅在传统VVT基准上实现了最先进的性能,还在新的交互设置中建立了显著的领先优势,标志着更动态和可控的虚拟试穿体验的重要一步。

英文摘要

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

2604.03156 2026-06-18 cs.CV 版本更新 85%

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

CAMEO: 一种条件感知与质量驱动的多智能体图像编辑编排器

Yuhan Pu, Hao Zheng, Ziqian Mo, Zirui Pang, Hill Zhang, Tianyi Fan, Shuhong Wu, Jiaheng Wei

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Harbin Institute of Technology(哈尔滨工业大学) Shenzhen University(深圳大学) Claremont McKenna College(克莱蒙特麦肯纳学院) Research Institute of Petroleum Exploration and Development, CNPC(石油勘探开发研究院,中石油)

专题命中 图像编辑 :多智能体框架进行条件图像编辑,含质量评估

AI总结 提出CAMEO多智能体框架,将条件图像编辑重构为质量感知的反馈驱动过程,通过分解编辑阶段、嵌入评估循环,在异常插入和人体姿态切换任务中平均胜率提升20%。

详情
AI中文摘要

条件图像编辑旨在根据文本提示和可选的参考指导修改源图像。这种编辑在需要严格结构控制的场景中至关重要(例如,驾驶场景中的异常插入和复杂人体姿态变换)。尽管近期大规模编辑模型(如Seedream、Nano Banana等)取得了进展,但大多数方法依赖单步生成。这种范式通常缺乏显式质量控制,可能引入与原始图像的过度偏差,并经常产生结构伪影或环境不一致的修改,通常需要手动调整提示才能获得可接受的结果。我们提出\textbf{CAMEO},一个结构化的多智能体框架,将条件编辑重构为质量感知、反馈驱动的过程,而非一次性生成任务。CAMEO将编辑分解为协调的阶段:规划、结构化提示、假设生成和自适应参考定位,仅在任务复杂度需要时才调用外部指导。为克服现有方法缺乏内在质量控制的不足,评估直接嵌入编辑循环中。通过结构化反馈迭代优化中间结果,形成闭环过程,逐步纠正结构和上下文不一致性。我们在异常插入和人体姿态切换任务上评估CAMEO。在多个强编辑骨干网络和独立评估模型上,CAMEO相比多个最先进模型平均胜率提升20%,展示了在条件图像编辑中更强的鲁棒性、可控性和结构可靠性。

英文摘要

Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.