多模态大模型 - arXivDaily 专题

2606.18249 2026-06-19 cs.CV 新提交 90%

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

统一多模态自回归建模：共享上下文-视觉分词器是实现统一的关键

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai

发表机构 * Institute of Trustworthy Embodied AI, Fudan University（可信具身AI研究院，复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Qwen Team, Alibaba Inc.（通义实验室，阿里公司）

专题命中图文多模态：统一多模态自回归建模，桥接视觉理解与生成

AI总结提出UniAR框架，通过单一离散视觉分词器桥接视觉理解与生成，采用并行位预测和扩散解码，在图像生成和编辑上达到最优，同时保持多模态理解竞争力。

Comments ICML2026. Project page https://sharelab-sii.github.io/uniar-web

详情

AI中文摘要

统一多模态建模旨在将视觉理解和生成集成到单个系统中。然而，现有方法通常依赖两个不同的视觉分词器，这分割了表示空间并阻碍了真正的统一建模。我们提出UniAR，一个统一的自回归框架，其中单个离散视觉分词器作为理解和生成之间的关键桥梁，使得模型能够直接解释其自身生成的视觉标记而无需额外的重新编码，从而实现共享上下文。UniAR采用预训练的视觉编码器，结合多级特征融合和无查找的逐位量化方案，在保留高层语义和低层细节的同时，以最小代价扩展有效视觉词汇。在此基础上，统一自回归模型采用并行逐位预测来联合预测空间分组的多级视觉编码，大幅减少视觉序列长度并加速生成。最后，基于扩散的视觉解码器对离散视觉标记进行操作，以解码高保真图像。通过大规模预训练，随后进行监督微调和强化学习，UniAR在图像生成和图像编辑上达到了最先进的性能，同时在多模态理解基准上保持竞争力。项目页面可在此URL获取。

英文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

URL PDF HTML ☆

赞 1 踩 0

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 新提交 85%

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM：基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University（北京大学）； MSALab ； ByteDance（字节跳动）

专题命中图文多模态：多模态扩散语言模型实现并行区域感知

AI总结提出PerceptionDLM，利用扩散语言模型的并行解码特性，通过高效提示和结构化注意力掩码实现多区域并行感知，显著提升推理效率，并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解任务中取得了显著进展。然而，现有大多数MLLMs依赖自回归生成，这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中，我们提出PerceptionDLM，一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base（一个在开源扩散MLLMs中达到最先进性能的强基础基线），我们的架构充分利用了DLMs的并行解码特性。具体来说，我们引入了高效提示和结构化注意力掩码，以实现对多个掩码区域的同步感知，使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比，这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性，我们通过将DLC-Bench扩展为每张图像包含多个区域掩码，构建了一个新的并行详细局部描述基准（ParaDLC-Bench），从而能够联合评估描述质量和推理效率。实验表明，PerceptionDLM在区域描述中保持竞争性能，同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知，我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

URL PDF HTML ☆

赞 0 踩 0

2606.19706 2026-06-19 cs.CV cs.CL 新提交 80%

NEST: Narrative Event Structures in Time for Long Video Understanding

NEST：面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）

专题命中图文多模态：多模态叙事事件标注，涉及视觉、对话和音频。

AI总结提出NEST数据集（1005部全长电影），通过多模态叙事事件标注和关系链接，评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力，实验表明事件检测等任务极具挑战性。

详情

AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能，但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索，而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展，例如，模型是否能够将早期的挫折（如失业）与后来的关系破裂联系起来，尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST（面向长视频理解的时间叙事事件结构），一个包含1005部全长电影（平均98分钟）的数据集，每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件，并通过反映叙事结构的关系（包括时间顺序、层次组合和长程依赖）将它们联系起来。我们引入了事件触发检测（ETD）、事件定位（EL）、事件论元抽取（EAE）和事件关系抽取（ERE）的基线。该基准对于基于事件发现极具挑战性，ETD低于8%，EL低于6%，EAE低于11%。相比之下，一旦事件给定，ERE更容易处理，零样本F1达到35.45%，微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.19413 2026-06-19 cs.LG 新提交 80%

Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting

文本真的有用吗？揭示并解决多模态时间序列预测中的文本坍缩问题

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

发表机构 * Applied Artificial Intelligence Initiative（应用人工智能计划）

专题命中图文多模态：多模态时间序列预测中文本与数值的融合。

AI总结针对多模态时间序列预测中文本分支被忽视导致“文本坍缩”的问题，提出REST-TS方法，通过让文本分支专门预测数值主干无法解释的残差，强制其提取真实内容，实现最先进性能。

详情

AI中文摘要

多模态时间序列预测将数值序列与领域相关的文本报告配对，有望将世界知识注入预测流程。然而，我们揭示了现有框架中的一个关键失败模式，称为文本坍缩：文本分支收敛到与内容无关的变换，无论输入描述如何，都贡献可忽略的判别信号。我们认为文本坍缩是时间序列预测中基本不对称性的结果：数值输入与输出强自相关，使得数值主干天生占主导地位，而文本分支尽管携带互补且通常关键的信息，却未被充分利用，导致其系统性欠利用。为解决此问题，我们提出REST-TS（时间序列中文本的残差独占监督），将不对称性转化为设计原则：数值主干产生其独立的数值预测，而文本分支被独占监督以预测残差的结构化组成部分，即数值无法解释的预测差距。由于没有数值路径可以减少这些损失，文本分支必须从输入描述中提取真实内容。在多样化的现实领域和主干架构上的评估表明，REST-TS实现了最先进的性能，并一致地显示出比现有框架更高的文本分支利用率，提供了强有力的经验证据，表明对文本分支进行残差监督迫使其从输入中提取真实内容。

英文摘要

Multimodal time series forecasting, which pairs numerical sequences with domain-relevant textual reports, promises to inject world knowledge into forecasting pipelines. However, we uncover a critical failure mode in existing frameworks that we term text collapse: the text branch converges to a content-independent transformation, contributing negligible discriminative signal regardless of the input description. We argue that text collapse is a consequence of a fundamental asymmetry in time series forecasting: the numerical input is strongly autocorrelated with the output, making the numerical backbone inherently dominant, while the text branch, despite carrying complementary and often critical information, is insufficiently utilized, leading to its systematic underexploitation. To address this, we propose \textbf{REST-TS} (\textbf{R}esidual-\textbf{E}xclusive \textbf{S}upervision for \textbf{T}ext in \textbf{T}ime \textbf{S}eries), which turns the asymmetry into a design principle: the numerical backbone produces its own independent numerical forecast, and the text branch is exclusively supervised to predict the structured components of the residual, the prediction gap that numbers cannot explain. Because no numerical pathway can reduce these losses, the text branch must extract genuine content from the input description. Evaluated across diverse real-world domains and backbone architectures, REST-TS achieves state-of-the-art performance and consistently demonstrates greater text-branch utilization than existing frameworks, providing strong empirical evidence that supervising the text branch on the residual compels it to extract genuine content from the input.

URL PDF HTML ☆

赞 0 踩 0

2606.20527 2026-06-19 cs.CL cs.CV 新提交 70%

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich（慕尼黑工业大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Princeton Center for Information and Technology Policy（普林斯顿信息与技术政策中心）

专题命中图文多模态：研究多模态大语言模型中的视觉偏见

AI总结提出StylisticBias基准，通过控制单一视觉属性变化，发现年龄和体型主导身份层面偏见，而时尚风格等约15个属性解释近80%的偏见变化，偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在个人和社会影响重大的场景中，但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的（群体）个体，难以将外貌效应与身份差异分离。我们引入StylisticBias，一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸，每张脸创建约50个单一属性变体，产生约25K张图像。这种设计保持身份不变，每次改变一个视觉属性，使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应，而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现，约15个属性解释了近80%的总变异，表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中，尤其是社会经济和风格相关判断，敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集：此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.19882 2026-06-19 cs.CV cs.LG 新提交 70%

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego（加州大学圣地亚哥分校）

专题命中图文多模态：结合图像和文本的多模态模型。

AI总结提出多模态概念瓶颈模型（MM-CBM），利用双概念瓶颈层对齐图像和文本嵌入，实现可解释的零样本分类和图像检索，在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情

AI中文摘要

概念瓶颈模型（CBM）通过将图像提取的特征与自然概念对齐，增强了深度学习网络的可解释性。然而，现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制，其中预期概念之外的预测信号被无意中利用。在本文中，我们提出了多模态概念瓶颈模型（MM-CBM）来解决这些问题，并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层（CBL）将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务，如零样本分类或图像检索。与现有方法相比，MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率，在黑盒性能的约5%以内，同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

URL PDF HTML ☆

赞 0 踩 0

2606.19727 2026-06-19 cs.CL cs.AI 新提交 70%

NRITYAM: Language Models Meet Art and Heritage of Dance

NRITYAM：语言模型遇见舞蹈的艺术与遗产

Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber, Haiqin Yang

发表机构 * Shenzhen Technology University（深圳技术大学）； New Delhi Institute of Management（新德里管理学院）； Technische Universität Dresden（德累斯顿工业大学）； Ramakrishna Mission Vivekananda Educational and Research Institute（罗摩克里希纳传道会维韦卡南达教育与研究学院）； Indian Institute of Technology（印度理工学院）； Swami Vivekananda Institute of Technology（斯瓦米·维韦卡南达技术学院）； GuangDong Engineering Technology Research Center of Edge Intelligence（广东省边缘智能工程技术研究中心）

专题命中图文多模态：包含多模态模型评估，涉及视觉和语言。

AI总结提出NRITYAM基准，包含9,260个跨12语言的文化问答对，评估语言模型对全球舞蹈传统的文化理解能力，涵盖多种模型类型。

Comments 18 pages, 12 figures, in ECML_PKDD'26

详情

AI中文摘要

语言模型已成为塑造现代工作流程的重要工具。然而，其全球有效性取决于对当地社会文化背景的细致理解。为弥补这一差距，我们提出NRITYAM，一个用于评估语言模型在全球舞蹈传统背景下文化理解能力的综合基准。NRITYAM包含9,260个精心策划的问答对，涵盖12种语言，是专门用于评估舞蹈文化知识的最大数据集。该数据集通过与本地舞蹈艺术家和母语者的密切合作从头开发，他们创作并验证了特定地区的文化相关问题。我们评估了一系列模型，包括大型语言模型、小型语言模型、多模态大型语言模型和小型多模态语言模型。作为一个多语言和多文化基准，NRITYAM为评估AI系统理解和推理传统表演艺术的能力设定了新标准。详细数据集样本可在\url{this https URL}获取。

英文摘要

Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.

URL PDF HTML ☆

赞 0 踩 0

2606.20559 2026-06-19 cs.CV cs.LG 新提交 60%

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

UNIEGO：代理作为中介的统一自我中心视频表示学习

Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das

发表机构 * University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校）

专题命中图文多模态：融合多模态教师知识进行蒸馏学习。

AI总结提出分层多教师蒸馏框架UNIEGO，通过代理模型将异构教师知识转化为同质自我中心空间，并采用选择性代理蒸馏自适应筛选可靠监督，在三个自我中心视频理解任务上达到最优。

详情

AI中文摘要

自我中心视频理解本质上受限于可穿戴摄像头的狭窄视角：单一视角、单一模态、单一模型无法捕捉人类动作的全部丰富性。我们认为，真正富有表现力的自我中心表示必须包含跨视角、跨模态和基础模型表示的互补知识，同时仍能仅从自我中心视频部署。为此，我们引入了一个分层多教师蒸馏框架，生成UNIEGO，一个统一的自我中心编码器，使用九个教师（涵盖自我-外部视角、RGB、深度和骨架模态）以及四个基础模型进行训练。我们的框架不是直接从异构教师中蒸馏（其不兼容的架构和特征几何会导致冲突梯度），而是在其中插入一层表示特定的代理模型，将多样的教师知识转化为同质的自我中心空间。第二阶段蒸馏，即选择性代理蒸馏（SPD），然后自适应地为每个训练样本选择既正确又自信的代理子集，仅从可靠监督中蒸馏并抑制错误信号。SPD进一步通过将UNIEGO初始化为代理参数的凸组合来稳定，在蒸馏开始前将统一模型置于损失景观的良好条件区域。UNIEGO在三个自我中心视频理解任务（动作识别、视频检索和动作分割）上，在三个具有挑战性的自我-外部基准测试中达到了最先进的性能，优于朴素的多教师蒸馏基线，并证明了结构化的、代理中介的知识转移能产生更丰富、更具判别性的自我中心表示。

英文摘要

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

URL PDF HTML ☆

赞 0 踩 0