URL PDF HTML ☆

赞 0 踩 0

2505.19937 2026-06-17 cs.CL cs.SD eess.AS 版本更新

ALAS: An Automatic Latent Alignment Score for Audio Language Models

ALAS：音频语言模型的自动潜在对齐分数

Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan

AI总结提出ALAS指标，通过计算音频与文本表示的跨模态余弦相似度，无需训练即可评估语音-LLM的音频-文本对齐质量，揭示模型对齐深度与任务需求的关系。

详情

AI中文摘要

大型语言模型（LLM）被扩展为语音-LLM，它们学习的音频-文本对齐质量影响大多数下游口语理解（SLU）行为。然而，尽管融合策略不断增长，但没有标准方法来衡量语音-LLM内部如何将音频帧与文本标记绑定。我们引入ALAS（自动潜在对齐分数），一种模型和任务无关的度量，探测LLM的逐层隐藏状态，将音频和文本表示之间的跨模态余弦相似度与Whisper导出的参考进行评分。ALAS仅需要冻结的前向传递和现成的ASR参考，无需训练或拟合分类器，并校准到可解释的均匀基线，可在任务间比较。将ALAS应用于四个开源语音-LLM（AF3、Qwen2-Audio、Qwen-Omni、SALMONN），在情感识别（IEMOCAP）、开放式SQA（LibriSQA）和多选音频理解（MMAU-speech）上，我们发现对齐的深度和强度反映了每个模型的音频编码器设计以及任务的声学与语义需求，并且ALAS跟踪但不重复任务准确性，暴露了那些得分高但未真正基于音频的模型。我们将ALAS作为开源库发布，以便从业者探测自己的语音-LLM或在新任务上尝试。

英文摘要

Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion strategies, there is no standard way to measure how well a Speech-LLM internally binds audio frames to text tokens. We introduce ALAS (Automatic Latent Alignment Score), a model and task-agnostic metric that probes the LLM's per-layer hidden states, scoring the cross-modal cosine similarity between audio and text representations against a Whisper-derived reference. ALAS needs only a frozen forward pass and an off-the-shelf ASR reference, with no training or fitted classifier, and is calibrated to an interpretable uniform baseline comparable across tasks. Applying ALAS to four open-source Speech-LLMs (AF3, Qwen2-Audio, Qwen-Omni, SALMONN) across emotion recognition (IEMOCAP), open-ended SQA (LibriSQA), and multi-choice audio understanding (MMAU-speech), we find that the depth and strength of alignment reflect each model's audio-encoder design and the acoustic-versus-semantic demands of the task, and that ALAS tracks but does not duplicate task accuracy, exposing models that score well without genuinely grounding in the audio. We release ALAS as an open-source library so that practitioners can probe their own Speech-LLMs or try it on new tasks.

URL PDF HTML ☆

赞 0 踩 0

2506.17639 2026-06-17 cs.RO cs.AI 版本更新

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

RLRC：基于强化学习的压缩视觉-语言-动作模型恢复

Yuxuan Chen, Yixin Han, Yize Huang, Xiao Li

AI总结提出RLRC三阶段压缩恢复流程，通过结构化剪枝、SFT和强化学习恢复以及量化，实现8倍内存减少和2.3倍推理加速，同时保持任务成功率。

Comments 8 pages, 10 figures; accepted by RA-L 2026

详情

DOI: 10.1109/LRA.2026.3700379
Journal ref: IEEE Robotics and Automation Letters, vol. 11, no. 7, pp. 8864-8871, July 2026

AI中文摘要

视觉-语言-动作模型（VLA）在复杂机器人操作中展示了卓越的能力和巨大潜力。然而，其庞大的参数规模和高推理延迟阻碍了实际部署，尤其是在资源受限的平台上。为此，我们对VLA的模型压缩进行了系统的实证研究。基于这些见解，我们提出了\textit{RLRC}，一个三阶段压缩和恢复流程，包括结构化剪枝、通过SFT和RL进行性能恢复，以及后续量化。RL阶段引入了评论家预热策略和BC损失正则化，以稳定训练并保持策略行为。RLRC实现了高达8倍的内存减少和2.3倍的推理加速，同时保持原始任务成功率。在多个VLA骨干网络上的大量实验表明，RLRC始终优于现有的压缩基线，突显了其在设备端部署的有效性。项目网站：此https URL

英文摘要

Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and strong potential in complex robotic manipulation. However, their large parameter sizes and high inference latency hinder real-world deployment, especially on resource-constrained platforms. To address this, we conduct a systematic empirical study of model compression for VLAs. Building on these insights, we present \textit{RLRC}, a three-stage compression and recovery pipeline consisting of structured pruning, performance recovery via SFT and RL, and subsequent quantization. The RL stage incorporates a critic warm-up strategy and BC loss regularization to stabilize training and preserve policy behavior. RLRC achieves up to an 8 times memory reduction and 2.3 times inference speedup while maintaining the original task success rate. Extensive experiments across multiple VLA backbones show that RLRC consistently outperforms existing compression baselines, highlighting its effectiveness for on-device deployment. Project website: https://rlrc-vla.github.io

URL PDF HTML ☆

赞 0 踩 0

2506.10981 2026-06-17 cs.CV 版本更新

Mordal: 面向视觉语言模型的自动化预训练模型选择

Shiqi He, Insu Jang, Mosharaf Chowdhury

AI总结提出Mordal框架，通过减少候选模型数量和评估时间，自动化搜索用户定义任务的最佳视觉语言模型，相比网格搜索降低GPU耗时8.9-11.6倍，加权Kendall's τ平均提升69%。

详情

AI中文摘要

将多种模态融入大型语言模型（LLMs）是增强其对非文本数据理解、使其能够执行多模态任务的有效方式。视觉语言模型（VLMs）因其在医疗、机器人和无障碍等领域的众多实际应用，成为增长最快的多模态模型类别。然而，尽管文献中不同的VLM在不同基准测试中展现出令人印象深刻的视觉能力，它们都是由人类专家手工设计的；目前尚无自动化框架来创建特定任务的多模态模型。我们引入Mordal，一种自动化多模态模型搜索框架，能够高效地为用户定义的任务找到最佳VLM，无需人工干预。Mordal通过减少搜索过程中需考虑的候选模型数量以及最小化评估每个剩余候选模型所需的时间来实现这一目标。我们的评估表明，Mordal能够找到给定问题的最佳VLM，其GPU耗时比网格搜索低8.9倍至11.6倍。我们还发现，Mordal在不同任务上平均比最先进的模型选择方法实现约69%更高的加权Kendall's τ。

英文摘要

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. We have also discovered that Mordal achieves about 69\% higher weighted Kendall's $τ$ on average than the state-of-the-art model selection method across diverse tasks.

URL PDF HTML ☆

赞 0 踩 0

2409.17502 2026-06-17 cs.LG 版本更新

Broadcast Product: Redefining Shape-aligned Element-wise Multiplication and Beyond

广播乘积：重新定义形状对齐的逐元素乘法及其扩展

Yusuke Matsui, Tatsuya Yokota

AI总结本文引入广播乘积$\boxdot$，形式化扩展Hadamard乘积以处理形状不匹配的张量逐元素乘法，并建立其代数性质及与线性代数的联系，为广播感知的张量运算奠定数学基础。

Comments TMLR2026. OpenReview: https://openreview.net/forum?id=zv0OtOPpPO

详情

AI中文摘要

广播操作在科学计算库中被广泛使用，但其数学形式化在机器学习文献中常常是隐式的且表示不一致。当逐元素乘积被写出但张量形状不匹配时，这个问题经常导致无效的方程。在本文中，我们通过引入广播乘积$\boxdot$来形式化此类操作，该乘积通过形状对齐的元素复制显式扩展了Hadamard乘积。我们提供了广播乘积的严格定义，分析了其代数性质，并展示了如何使用标准线性代数表示它。基于这一框架，我们制定了最小二乘问题并勾勒出一个概念验证的广播分解。作为初步说明，我们展示了该形式化方法能够产生一类具有与传统张量分解不同结构特性的新分解。这项工作为广播感知的张量运算建立了数学基础，将实际实现与严格的张量分析联系起来。

英文摘要

Broadcast operations are widely used in scientific computing libraries, yet their mathematical formulation is often implicit and inconsistently represented in machine learning literature. This problem frequently leads to invalid equations when element-wise products are written despite mismatched tensor shapes. In this paper, we formalize such operations by introducing the broadcast product $\boxdot$, which explicitly extends the Hadamard product through shape-aligned element duplication. We provide a rigorous definition of the broadcast product, analyze its algebraic properties, and show how it can be expressed using standard linear algebra. Building on this framework, we formulate least-squares problems and sketch a proof-of-concept broadcast decomposition. As a preliminary illustration, we show that the formalism enables a new family of decompositions with distinct structural properties from conventional tensor decompositions. This work establishes a mathematical foundation for broadcast-aware tensor operations, connecting practical implementations with rigorous tensor analysis.

URL PDF HTML ☆

赞 0 踩 0

2408.12099 2026-06-17 cs.CV cs.CR 版本更新

Query-Efficient Video Adversarial Attack with Stylized Logo on Service Computing

面向服务计算的查询高效视频对抗攻击：带风格化标志

Duoxun Tang, Yuxin Cao, Xi Xiao, Derui Wang, Sheng Wen, Tianqing Zhu

AI总结提出一种黑盒视频攻击框架SLA，通过风格化标志和强化学习实现低预算、高逼真度的对抗样本生成，在目标攻击中优于现有方法。

Comments Accepted to IEEE Transactions on Dependable and Secure Computing (TDSC)

详情

AI中文摘要

在服务计算中，视频分类已成为许多智能应用的基础。尽管深度神经网络（DNN）在识别视频内容方面表现出色，但最近的研究表明，DNN极易受到对抗样本的影响。因此，理解对抗攻击可以更好地应对紧急情况。为了提高攻击性能，许多基于风格迁移的攻击和基于补丁的攻击被提出。然而，前者的全局扰动会带来不自然的全局色彩，而后者由于扰动空间有限，在目标攻击中难以成功。此外，与大量针对图像分类器的方法相比，视频对抗攻击仍然相对未被充分探索。因此，为了在低预算下生成对抗样本并使其具有更高的逼真度，我们提出了一种新颖的黑盒视频攻击框架，称为风格化标志攻击（SLA）。SLA通过三个阶段进行。第一阶段涉及构建标志的风格参考集，这不仅可以使生成的样本更自然，还可以在目标攻击中携带更多目标类别特征。然后，采用强化学习来确定标志在视频中的风格参考和位置参数，确保风格化标志以最优属性放置在视频中。最后，逐步优化扰动以提高欺骗率。实验结果表明，SLA可以实现比最先进方法更好的性能，并且在面对各种防御方法时仍保持良好的欺骗效果。我们相信SLA可以提高安全社区对视频分类系统可靠性和安全性的认识，并作为可能攻击方法的备忘录。

英文摘要

In service computing, video classification has become fundamental to many intelligent applications. While Deep Neural Networks (DNNs) have demonstrated excellent performance in recognizing video content, recent studies have shown that DNNs are highly vulnerable to adversarial examples. Thus, understanding adversarial attacks can better respond to emergency situations. In order to improve attack performance, many style-transfer-based attacks and patch-based attacks have been proposed. However, the global perturbation of the former will bring unnatural global colors, while the latter is difficult to achieve success in targeted attacks due to the limited perturbation space. Moreover, compared to a plethora of methods targeting image classifiers, video adversarial attacks remain relatively underexplored. Therefore, to generate adversarial examples with a low budget and to provide them with a higher verisimilitude, we propose a novel black-box video attack framework, called Stylized Logo Attack (SLA). SLA is conducted through three stages. The first stage involves building a style reference set for logos, which can not only make the generated examples more natural, but also carry more target class features in targeted attacks. Then, Reinforcement Learning is employed to determine the style reference and position parameters of the logo within the video, which ensures that the stylized logo is placed in the video with optimal attributes. Finally, perturbations are optimized in a step-by-step manner so as to improve the fooling rate. Experimental results indicate that SLA can achieve better performance than state-of-the-art methods and still maintain good deception effects when facing various defense methods. We believe SLA can raise awareness among the security community about the reliability and security of video classification systems and serve as a memorandum of possible attack methods.

URL PDF HTML ☆

赞 0 踩 0

2406.07435 2026-06-17 cs.CV cs.LG eess.IV 版本更新

Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration

警惕混叠——信号保留对鲁棒图像复原至关重要

Shashank Agnihotri, Julia Grabinski, Janis Keuper, Margret Keuper

AI总结针对图像复原网络因混叠导致鲁棒性差的问题，提出BOA-Restormer，通过在频域执行部分下采样和上采样操作，确保无混叠路径，在低成本下提升模型鲁棒性。

Comments Tags: Adversarial attack, image restoration, image deblurring, frequency sampling

2404.09790 2026-06-17 cs.CV 版本更新

NTIRE 2024 Challenge on Image Super-Resolution (x4): Methods and Results

NTIRE 2024图像超分辨率挑战赛（x4）：方法与结果

Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong, Zhijuan Huang, Yajun Zou, Yuan Huang, Jiamin Lin, Bingnan Han, Xianyu Guan, Yongsheng Yu, Daoan Zhang, Xuanwu Yin, Kunlong Zuo, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, Hongyu An, Xinfeng Zhang, Zhiyuan Song, Ziyue Dong, Qing Zhao, Xiaogang Xu, Pengxu Wei, Zhi-chao Dou, Gui-ling Wang, Chih-Chung Hsu, Chia-Ming Lee, Yi-Shiuan Chou, Cansu Korkmaz, A. Murat Tekalp, Yubin Wei, Xiaole Yan, Binren Li, Haonan Chen, Siqi Zhang, Sihan Chen, Amogh Joshi, Nikhil Akalwadi, Sampada Malagi, Palani Yashaswini, Chaitra Desai, Ramesh Ashok Tabib, Ujwala Patil, Uma Mudenagudi, Anjali Sarvaiya, Pooja Choksy, Jagrit Joshi, Shubh Kawa, Kishor Upla, Sushrut Patwardhan, Raghavendra Ramachandra, Sadat Hossain, Geongi Park, S. M. Nadim Uddin, Hao Xu, Yanhui Guo, Aman Urumbekov, Xingzhuo Yan, Wei Hao, Minghan Fu, Isaac Orais, Samuel Smith, Ying Liu, Wangwang Jia, Qisheng Xu, Kele Xu, Weijun Yuan, Zhan Li, Wenqin Kuang, Ruijin Guan, Ruting Deng, Zhao Zhang, Bo Wang, Suiyi Zhao, Yan Luo, Yanyan Wei, Asif Hussain Khan, Christian Micheloni, Niki Martinel

AI总结本文回顾NTIRE 2024图像超分辨率挑战赛（x4），总结参赛方案和成果，推动单图像超分辨率性能边界并概述当前趋势。

Comments NTIRE 2024 webpage: https://cvlai.net/ntire/2024. Code: https://github.com/zhengchen1999/NTIRE2024_ImageSR_x4

详情

DOI: 10.1109/CVPRW63382.2024.00617
Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 6108-6132

AI中文摘要

本文回顾了NTIRE 2024图像超分辨率（$\ imes$4）挑战赛，重点介绍了提出的解决方案和获得的结果。该挑战涉及利用先验信息从低分辨率（LR）输入生成对应的高分辨率（HR）图像，放大倍数为四倍。LR图像来源于双三次下采样退化。挑战的目标是获得具有最先进SR性能的设计/解决方案，对计算资源（如模型大小和FLOPs）或训练数据没有限制。该赛道在DIV2K测试数据集上使用PSNR指标评估性能。比赛吸引了199名注册者，其中20支队伍提交了有效参赛作品。这一集体努力不仅推动了单图像SR的性能边界，还提供了对该领域当前趋势的全面概述。

英文摘要

This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field.

URL PDF HTML ☆

赞 0 踩 0

2404.01965 2026-06-17 cs.LG cs.AI 版本更新

平均场薛定谔桥的广义Sinkhorn算法

Asmaa Eldesoukey, Yongxin Chen, Abhishek Halder

AI总结针对平均场薛定谔桥问题，提出广义Hopf-Cole变换并设计Sinkhorn型递归算法求解积分-偏微分方程组，在弱假设下证明收敛性，数值实验验证有效性。

2603.27049 2026-06-17 stat.ML cs.LG 版本更新

Overcoming the Incentive Collapse Paradox

克服激励崩溃悖论

Qichuan Yin, Ziwei Su, Shuangning Li

AI总结针对AI辅助任务中激励崩溃问题，提出哨兵审计支付机制，在有限成本下维持正人力努力，并构建激励感知的主动统计推断框架优化审计率与采样分配。

Comments Accepted to ICML 2026

详情

AI中文摘要

AI辅助任务委派日益普遍，但此类系统中的人力成本高昂且通常不可观测。Bastani和Cachon (2025); Sambasivan等人 (2021) 的最新研究表明，基于准确度的支付方案存在激励崩溃：随着AI准确度提升，维持正向人力努力需要无界支付。我们在预算约束的委托-代理框架中研究这一现象，其中战略型人类代理的输出准确度取决于不可观测的努力。我们的第一个贡献是一般性不可能结果，表明激励崩溃不仅是简单线性支付的局限，而是任何仅基于观测任务结果的支付规则都会出现。为克服这一障碍，我们提出一种哨兵审计支付机制，该机制以有限成本强制执行严格为正且可控的人力努力水平，且与AI准确度无关。在此激励鲁棒的基础上，我们构建了一个激励感知的主动统计推断框架，联合优化(i)审计率和(ii)跨不同难度任务的主动采样与预算分配，以在单一预算下最小化最终统计损失。实验表明，相对于标准主动学习和仅审计基线，该方法改善了成本-误差权衡。

英文摘要

AI-assisted task delegation is increasingly common, yet human effort in such systems is costly and typically unobserved. Recent work by Bastani and Cachon (2025); Sambasivan et al. (2021) shows that accuracy-based payment schemes suffer from incentive collapse: as AI accuracy improves, sustaining positive human effort requires unbounded payments. We study this phenomenon in a budget-constrained principal-agent framework with strategic human agents whose output accuracy depends on unobserved effort. Our first contribution is a general impossibility result showing that incentive collapse is not merely a limitation of simple linear payments, but arises for any payment rule based only on observed task accuracy.To overcome this barrier, we propose a sentinel-auditing payment mechanism that enforces a strictly positive and controllable level of human effort at finite cost, independent of AI accuracy. Building on this incentive-robust foundation, we develop an incentive-aware active statistical inference framework that jointly optimizes (i) the auditing rate and (ii) active sampling and budget allocation across tasks of varying difficulty to minimize the final statistical loss under a single budget. Experiments demonstrate improved cost-error tradeoffs relative to standard active learning and auditing-only baselines.

URL PDF HTML ☆

赞 0 踩 0

2603.19697 2026-06-17 eess.AS cs.MM cs.SD 版本更新

Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction

Plug-and-Steer：解耦分离与选择的音视频目标说话人提取

Doyeop Kwak, Suyeon Lee, Joon Son Chung

AI总结提出Plug-and-Steer方法，通过解耦分离与目标选择，利用冻结的纯音频骨干网络和潜引导矩阵实现高保真音视频目标说话人提取。

Comments Accepted by Interspeech 2026; demo available https://plugandsteer.github.io

详情

AI中文摘要

本文的目标是通过解耦分离和目标选择，为音视频目标说话人提取（AV-TSE）提供新视角。传统的AV-TSE系统通常深度融合音频和视觉特征以重新学习整个分离过程，由于野外音视频数据集的噪声特性，这可能会成为保真度的上限。为了解决这个问题，我们提出了Plug-and-Steer，它将高保真分离分配给冻结的纯音频骨干网络，并将视觉模态的作用严格限制在目标选择上。我们引入了潜引导矩阵（LSM），这是一种最小化的线性变换，它重新路由骨干网络内的潜特征，将目标说话人锚定到指定通道。在四种代表性架构上的实验表明，我们的方法有效地保留了不同骨干网络的声学先验，实现了与原始骨干网络相当的可感知质量。音频样本可在以下网址获取：this https URL

英文摘要

The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of the visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to that of the original backbones. Audio samples are available at: https://plugandsteer.github.io

URL PDF HTML ☆

赞 0 踩 0

2603.04438 2026-06-17 eess.IV cs.AI cs.LG 版本更新

CogGen: Cognitive-Load-Inspired Fully Unsupervised Deep Generative Modeling for Compressively Sampled MRI Reconstruction

CogGen: 认知负荷启发的全无监督深度生成模型用于压缩感知MRI重建

Qingyong Zhu, Yumin Tan, Xiang Gu, Dong Liang

AI总结提出CogGen框架，基于认知易到难原则，通过自定进度课程学习和MRI感知双阈值加权策略，将CS-MRI重建分解为分阶段反演问题，理论证明降低局部充分迭代界和累积噪声放大界，实验优于现有无监督和有监督方法。

详情

AI中文摘要

全无监督深度生成建模（FU-DGM）为压缩感知磁共振成像（CS-MRI）重建提供了巨大潜力。代表性的FU-DGM公式，如深度图像先验（DIP）和隐式神经表示（INR），利用架构偏置在图像空间中诱导与正向观测对齐的低维流形。然而，由于底层逆系统高度病态，FU-DGM中长时间的迭代拟合通常导致效率低下和噪声放大。本文受认知易到难学习原则的启发，提出CogGen，一种将CS-MRI重建重新表述为分阶段反演问题的FU-DGM框架。具体地，CogGen通过MRI感知的双阈值加权准则实现自定进度课程学习（SPCL）驱动的渐进调度策略，该准则自适应地调节k空间测量参与。数据一致性残差阈值评估当前生成器的拟合可靠性，而k空间半径阈值控制阶段性的测量暴露，从而避免整个优化过程中的均匀拟合。理论上，我们的分析表明，当早期阶段倾向于易拟合的测量时，CogGen产生更低的局部充分迭代界和更小的累积噪声放大界，解释了CogGen在有限迭代预算内改进的收敛行为和重建保真度。数值实验表明，CogGen的两种实例化，CogGen-DIP和CogGen-INR，在包括无监督和有监督流程在内的现有CS-MRI重建技术中实现了优越的性能。

英文摘要

Fully unsupervised deep generative modeling (FU-DGM) offers significant potential for compressively sampled magnetic resonance imaging (CS-MRI) reconstruction. Representative FU-DGM formulations, such as deep image prior (DIP) and implicit neural representation (INR), employ architectural bias to induce a low-dimensional manifold in the image space that aligns with the forward observation. However, as the underlying inverse system is highly ill-posed, prolonged iterative fitting in FU-DGM typically leads to poor efficiency and noise amplification. In this paper, guided by the cognitive principle of easy-to-hard learning, we propose CogGen, an FU-DGM framework that reformulates CS-MRI reconstruction as a staged inversion problem. Specifically, CogGen implements an self-paced curriculum learning (SPCL)-driven progressive scheduling strategy through an MRI-aware dual-threshold weighting criterion, which adaptively regulates k-space measurement participation. The data-consistency residual thresholding evaluates the fitting reliability of the current generator, while the k-space radius thresholding controls stage-wise measurement exposure, thereby avoiding uniform fitting throughout optimization. Theoretically, our analysis shows that, when early stages favor easy-to-fit measurements, CogGen yields a reduced local sufficient-iteration bound and a smaller cumulative noise-amplification bound, explaining the improved convergence behavior and reconstruction fidelity of CogGen within a finite iteration budget. Numerical experiments demonstrate that both CogGen instantiations, CogGen-DIP and CogGen-INR, achieve superior performance over prevailing CS-MRI reconstruction techniques, including unsupervised and supervised pipelines.

URL PDF HTML ☆

赞 0 踩 0