arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8098
2604.18995 2026-06-03 cs.CL cs.AI cs.LG

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

$R^2$-dLLM: 通过时空冗余减少加速扩散大语言模型

Zhenbang Du, Kejing Xia, Xinrui Zhong, Yonggan Fu, Nicolai Oswald, Binfei Ji, Brucek Khailany, Pavlo Molchanov, Yingyan Lin

AI总结 提出 $R^2$-dLLM 框架,通过推理和训练两阶段减少扩散大语言模型解码中的空间和时间冗余,实现高达 88% 的解码步数减少并保持生成质量。

详情
AI中文摘要

扩散大语言模型(dLLMs)通过并行令牌预测成为自回归生成的有前途的替代方案。然而,实际的 dLLM 解码仍然遭受高推理延迟,限制了部署。在这项工作中,我们观察到这种低效率的很大一部分来自解码过程中反复出现的冗余,包括由置信度聚类和位置模糊性引起的空间冗余,以及由重复重新掩蔽已经稳定的预测引起的时间冗余。受这些模式的启发,我们提出了 $R^{2}$-dLLM,一个从推理和训练两个角度减少解码冗余的统一框架。在推理时,我们引入了无需训练的解码规则,聚合局部置信度和令牌预测,并最终确定时间稳定的令牌以避免冗余解码步骤。我们进一步提出了一个冗余感知的监督微调流程,使模型与高效解码轨迹对齐,并减少对手动调整阈值的依赖。实验表明,与现有解码策略相比,$R^{2}$-dLLM 一致地将解码步数减少高达 88%,同时在不同模型和任务上保持有竞争力的生成质量。这些结果验证了解码冗余是 dLLMs 的一个核心瓶颈,明确减少它能够带来显著的实用效率提升。我们的代码和模型可在 https://github.com/GATECH-EIC/R2-dLLM 获取。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^{2}$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^{2}$-dLLM consistently reduces the number of decoding steps by up to 88\% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains. Our code and models are available at https://github.com/GATECH-EIC/R2-dLLM.

2604.15748 2026-06-03 cs.CV

Concept-wise Attention for Fine-grained Concept Bottleneck Models

面向细粒度概念瓶颈模型的概念级注意力机制

Minghong Zhong, Guoshuai Zou, Kanghao Chen, Dexia Chen, Ruixuan Wang

AI总结 提出概念级注意力机制(CoAt-CBM),通过可学习概念视觉查询和概念对比优化,实现自适应细粒度图像-概念对齐,解决预训练偏差和概念互斥问题,显著提升性能。

Comments Withdrawn by authors for revision and improvement

详情
AI中文摘要

最近,通过利用大型预训练视觉-语言模型(如CLIP)学习的图像-文本对齐,概念瓶颈模型(CBM)取得了令人印象深刻的性能。然而,概念建模存在两个关键限制。现有方法常受预训练偏差影响,表现为粒度错位或依赖结构先验。此外,使用二元交叉熵(BCE)损失进行微调将每个概念独立处理,忽略了概念间的互斥性,导致对齐次优。为解决这些限制,我们提出了面向细粒度概念瓶颈模型的概念级注意力机制(CoAt-CBM),一种实现自适应细粒度图像-概念对齐和高可解释性的新颖框架。具体地,CoAt-CBM采用可学习的概念级视觉查询,自适应地获取细粒度的概念级视觉嵌入,然后用于生成概念得分向量。接着,一种新颖的概念对比优化指导模型处理概念得分的相对重要性,使概念预测忠实反映图像内容并改善对齐。大量实验表明,CoAt-CBM持续优于最先进方法。代码将在接收后公开。

英文摘要

Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors. Moreover, fine-tuning with Binary Cross-Entropy (BCE) loss treats each concept independently, which ignores mutual exclusivity among concepts, leading to suboptimal alignment. To address these limitations, we propose Concept-wise Attention for Fine-grained Concept Bottleneck Models (CoAt-CBM), a novel framework that achieves adaptive fine-grained image-concept alignment and high interpretability. Specifically, CoAt-CBM employs learnable concept-wise visual queries to adaptively obtain fine-grained concept-wise visual embeddings, which are then used to produce a concept score vector. Then, a novel concept contrastive optimization guides the model to handle the relative importance of the concept scores, enabling concept predictions to faithfully reflect the image content and improved alignment. Extensive experiments demonstrate that CoAt-CBM consistently outperforms state-of-the-art methods. The codes will be available upon acceptance.

2604.01410 2026-06-03 cs.CL

Assessing Pause Thresholds for empirical Translation Process Research

评估经验翻译过程研究中的暂停阈值

Devi Sri Bandaru, Michael Carl, Xinyue Ren

AI总结 本文比较了五种计算暂停阈值的方法,并提出了一种新的生产单元断点计算方法,以区分自动化与反思性翻译过程。

Comments In Proceedings of "Translation in Transition 8", 2026

详情
AI中文摘要

文本生产(及翻译)以打字片段的形式进行,其间被按键暂停打断。通常认为,快速打字反映了无挑战/自动化的翻译生产,而较长(或更长)的打字暂停则表明翻译问题、障碍或困难。基于关于确定区分自动化与推测性反思翻译过程的暂停阈值的长期讨论(O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016),本文比较了计算这些暂停阈值的五种方法,并提出并评估了一种计算生产单元断点的新方法。

英文摘要

Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares five approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.

2603.26738 2026-06-03 cs.CV cs.AI cs.CL

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

SleepVLM:基于视觉语言模型的可解释且规则驱动的睡眠分期

Guifeng Deng, Pan Wang, Mengfan Niu, Jiquan Wang, Shuying Rao, Junyi Xie, Xi'ang Chen, Sha Zhao, Gang Pan, Wanjun Guo, Tao Li, Haiteng Jiang

AI总结 提出SleepVLM,一种基于规则驱动的视觉语言模型,通过多通道PSG波形图像进行睡眠分期,并生成符合AASM评分标准的临床可读解释,在保持高准确率的同时提升可解释性。

Comments Under review

详情
AI中文摘要

尽管自动睡眠分期已达到专家级准确率,但其临床采用因缺乏可审计的推理而受阻。我们提出了SleepVLM,一种基于规则驱动的视觉语言模型(VLM),它通过多通道多导睡眠图(PSG)波形图像进行睡眠分期,并基于美国睡眠医学学会(AASM)评分标准生成临床可读的理由。利用波形感知预训练和规则驱动的监督微调,SleepVLM在保留测试集(MASS-SS1)上实现了0.767的Cohen's kappa,在外部队列(ZUAMHCS)上实现了0.743,达到了最先进的性能。两位经过训练的睡眠技术专家的独立评估进一步验证了模型的推理质量,在两个数据集上,事实准确性、证据全面性和逻辑连贯性的平均得分在3.75-3.96之间(满分5分)。通过将竞争性性能与透明、基于规则的解释相结合,SleepVLM可以提高临床工作流程中自动睡眠分期的可信度和可审计性。为了促进可解释睡眠医学的进一步研究,我们发布了MASS-EX,一个新颖的专家注释数据集。

英文摘要

While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) that stages sleep from multi-channel polysomnography (PSG) waveform images and generates clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen's kappa of 0.767 on a held-out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Independent expert evaluation by two trained sleep technologists further validated the model's reasoning quality, with mean scores of 3.75-3.96 out of 5 across factual accuracy, evidence comprehensiveness, and logical coherence on both datasets. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.

2603.29346 2026-06-03 cs.CL

L-ReLF: A Framework for Lexical Dataset Creation

L-ReLF:词汇数据集创建框架

Anass Sedrati, Mounir Afifi, Reda Benkhadra

AI总结 本文提出L-ReLF框架,通过可复现的技术流程为低资源语言创建高质量结构化词汇数据集,以解决标准化术语缺失问题,并支持下游NLP应用。

Comments Accepted to the 2026 International Conference on Natural Language Processing (ICNLP). 6 pages, 1 figure

详情
Journal ref
Proc. 2026 8th International Conference on Natural Language Processing (ICNLP), pp. 261-265, IEEE
AI中文摘要

本文介绍了L-ReLF(低资源词汇框架),一种新颖、可复现的方法论,用于为服务不足的语言创建高质量、结构化的词汇数据集。以摩洛哥达里贾语为例,标准化术语的缺乏对维基百科等平台的知识公平构成了关键障碍,常常迫使编辑者依赖不一致的临时方法在其语言中创造新词。我们的研究详细阐述了为克服这些挑战而开发的技术流程。我们系统地解决了处理低资源数据的困难,包括源识别、利用光学字符识别(OCR)尽管其偏向现代标准阿拉伯语,以及严格的后处理以纠正错误并标准化数据模型。最终的结构化数据集与维基数据词位完全兼容,作为一项重要的技术资源。L-ReLF方法论设计具有通用性,为其他语言社区提供了清晰的路径,以构建用于下游NLP应用(如机器翻译和形态分析)的基础词汇数据。

英文摘要

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

2603.27455 2026-06-03 cs.CV

From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis

从无到有:通过新视角合成的自监督3D重建

Ranran Huang, Weixun Luo, Ye Mao, Krystian Mikolajczyk

AI总结 提出NAS3R框架,通过自监督学习从无标注图像中联合估计3D几何和相机参数,利用新视角合成进行训练,无需真实标注或预训练先验。

详情
AI中文摘要

本文提出NAS3R,一个自监督前馈框架,无需真实标注和预训练先验,联合学习显式3D几何和相机参数。训练时,NAS3R从无标定和无位姿的上下文视图重建3D高斯,并使用自预测的相机参数渲染目标视图,从而通过2D光度监督实现自监督训练。为确保稳定收敛,NAS3R在共享的Transformer骨干中集成重建和相机预测,并由掩码注意力调控,同时采用基于深度的高斯公式以促进良态优化。该框架与最先进的监督3D重建架构兼容,并可在可用时融入预训练先验或内参信息。大量实验表明,NAS3R优于其他自监督方法,为从无约束数据中进行3D重建建立了一个可扩展且几何感知的范式。代码和模型已在https://ranrhuang.github.io/nas3r/公开。

英文摘要

In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.

2506.13107 2026-06-03 cs.LG stat.ML

Honesty in Causal Forests: When It Helps and When It Hurts

因果森林中的诚实性:何时有益,何时有害

Yanfang Hou, Carlos Fernández-Loría

AI总结 本文通过偏差-方差权衡分析,发现诚实估计(分割数据用于子组定义和效应估计)在异质性较强且数据充足时会降低个体处理效应估计精度,建议将其视为正则化手段而非默认选择。

详情
AI中文摘要

因果森林估计处理效应如何随个体变化,指导营销、运营和公共政策等领域的个性化干预。标准做法是诚实估计:将数据分为两个样本,一个用于定义子组,另一个用于估计子组内的处理效应。这旨在减少过拟合,并且是许多软件包的默认设置。但这是正确的选择吗?我们表明,诚实估计会降低个体处理效应估计的准确性,特别是当效应异质性显著且数据集足够大以检测到它时。原因是偏差-方差权衡:诚实性降低了过拟合的风险,但通过限制可用于检测和建模异质性的数据,增加了欠拟合的风险。在超过7000个基准数据集上,我们发现默认使用诚实性的代价可能高达需要多27%的数据才能匹配未使用诚实性训练的模型的性能。诚实性最好被理解为一种正则化形式。是否采用它应取决于应用的目标及其经验表现,而不是反射性的默认使用。

英文摘要

Causal forests estimate how treatment effects vary across individuals, guiding personalized interventions in areas like marketing, operations, and public policy. A standard practice is honest estimation: dividing the data into two samples, one to define subgroups and another to estimate treatment effects within them. This is intended to reduce overfitting and is the default in many software packages. But is it the right choice? We show that honest estimation can reduce the accuracy of estimates of individual treatment effects, especially when effect heterogeneity is substantial and datasets are large enough to detect it. The reason is a bias-variance trade-off: honesty lowers the risk of overfitting but increases the risk of underfitting by limiting the data available to detect and model heterogeneity. Across more than 7,000 benchmark datasets, we find that the cost of using honesty by default can be as high as requiring 27% more data to match the performance of models trained without it. Honesty is best understood as a form of regularization. Whether to adopt it should depend on the goals of the application and its empirical performance, not on reflexive default use.

2603.01372 2026-06-03 cs.LG cs.AI

Causal Neural Probabilistic Circuits

因果神经概率电路

Weixin Chen, Han Zhao

AI总结 提出因果神经概率电路(CNPC),通过结合神经属性预测器和从因果图编译的因果概率电路,支持遵循因果依赖的精确干预推理,从而提升概念瓶颈模型在干预下的分类准确率。

详情
AI中文摘要

概念瓶颈模型(CBM)通过引入概念层并从概念预测中预测类别标签,增强了端到端神经网络的可解释性。CBM的一个关键特性是支持干预,即领域专家可以在测试时纠正错误预测的概念值以提高最终准确性。然而,典型的CBM仅覆盖被纠正的概念,而保持其他概念预测不变,这忽略了概念间的因果依赖。为解决此问题,我们提出了因果神经概率电路(CNPC),它结合了神经属性预测器和从因果图编译的因果概率电路。该电路支持精确、易处理的因果推理,天然尊重因果依赖。在干预下,CNPC基于专家乘积(PoE)建模类别分布,融合了属性预测器的预测分布和电路计算的干预边际。我们从理论上刻画了CNPC相对于其模块的组合干预误差,并确定了CNPC接近真实干预类别分布的条件。在五个基准数据集上的分布内和分布外实验表明,与五个基线模型相比,CNPC在不同干预属性数量下均实现了更高的任务准确率。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of end-to-end neural networks by introducing a layer of concepts and predicting the class label from the concept predictions. A key property of CBMs is that they support interventions, i.e., domain experts can correct mispredicted concept values at test time to improve the final accuracy. However, typical CBMs apply interventions by overwriting only the corrected concept while leaving other concept predictions unchanged, which ignores causal dependencies among concepts. To address this, we propose the Causal Neural Probabilistic Circuit (CNPC), which combines a neural attribute predictor with a causal probabilistic circuit compiled from a causal graph. This circuit supports exact, tractable causal inference that inherently respects causal dependencies. Under interventions, CNPC models the class distribution based on a Product of Experts (PoE) that fuses the attribute predictor's predictive distribution with the interventional marginals computed by the circuit. We theoretically characterize the compositional interventional error of CNPC w.r.t. its modules and identify conditions under which CNPC closely matches the ground-truth interventional class distribution. Experiments on five benchmark datasets in both in-distribution and out-of-distribution settings show that, compared with five baseline models, CNPC achieves higher task accuracy across different numbers of intervened attributes.

2602.17063 2026-06-03 cs.LG cs.AI cs.CL cs.CV

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

符号锁定:随机初始化的权重符号持续存在并成为亚比特模型压缩的瓶颈

Akira Sakai, Yuma Ichikawa

AI总结 研究亚比特模型压缩中符号位的瓶颈问题,通过符号锁定理论解释权重符号的随机性来源,并提出一种从头开始的低秩符号模板训练方法以突破该瓶颈。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

亚比特模型压缩的目标是将每个权重的存储降至1比特以下;当幅度被激进压缩时,符号位成为固定成本的瓶颈。在Transformer、CNN和MLP中,学习到的符号矩阵抵抗低秩近似,并且在频谱上与i.i.d. Rademacher基线无法区分。这种随机性导致了亚比特模型压缩的下界——1比特墙。尽管存在这种明显的随机性,大多数权重仍保留其初始化符号;翻转主要通过罕见的近零边界穿越发生,表明符号模式的随机性很大程度上继承自初始化。我们通过符号锁定理论形式化了这一行为,这是对SGD噪声下符号翻转的停时分析。在有界更新和零的小邻域内罕见重新进入的条件下,有效符号翻转的数量呈现几何尾部。基于这一机制,我们引入了一种从头开始的低秩符号模板训练方法,以防止这种1比特墙的出现。

英文摘要

Sub-bit model compression targets storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. This randomness gives rise to the lower bound of sub-bit model compression -- the one-bit wall. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood of zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a from-scratch low-rank sign-template training method that prevents the emergence of this one-bit wall.

2602.12221 2026-06-03 cs.CV

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

两全其美:通过统一离散流匹配实现多模态推理与生成

Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou

AI总结 提出UniDFlow框架,通过任务特定低秩适配器解耦理解与生成,并利用基于参考的多模态偏好对齐优化忠实性与可控性,在多个基准上达到最先进性能。

详情
AI中文摘要

我们提出了UniDFlow,一个统一的多模态理解、生成和编辑的离散流匹配框架。它通过任务特定的低秩适配器解耦理解和生成,避免了目标干扰和表示纠缠,同时一种新颖的基于参考的多模态偏好对齐在相同条件下优化相对结果,提高了忠实性和可控性,无需大规模重新训练。UniDFlow在八个基准上达到了最先进的性能,并在包括修复、上下文图像生成、基于参考的编辑和组合生成等任务上展现出强大的零样本泛化能力,尽管没有进行明确的特定任务训练。

英文摘要

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

2602.02890 2026-06-03 cs.LG

Self-Soupervision: Cooking Model Soups without Labels

自我汤合:无标签的模型汤烹饪

Anthony Fuller, James R. Green, Evan Shelhamer

AI总结 提出Self-Soupervision方法,将模型汤技术扩展到自监督学习,通过使用无标签数据混合不同自监督算法训练的参数,提升模型鲁棒性和准确性。

Comments code: https://github.com/antofuller/self_soupervision data: https://huggingface.co/datasets/antofuller/mini-VTAB

详情
AI中文摘要

模型汤是参数的一种奇特且异常有效的组合。它们将一个模型(底汤)微调成多个模型(配料),然后将它们的参数混合回一个模型(汤)以改进预测。虽然所有已知的汤都需要监督学习,并在标记数据上优化相同的损失,但我们的Self-Soupervision配方将汤推广到自监督学习(SSL)。我们的Self-Souping允许我们在新的数据源上调味配料,例如来自迁移任务的无标签数据或来自鲁棒性迁移的数据。我们表明,在损坏的测试数据上进行Self-Souping,然后回到未损坏的训练数据上进行微调,可以将鲁棒性提升+3.5%(ImageNet-C)和+7%(LAION-C)。Self-Soupervision还解锁了无数SSL算法,以烹饪更鲁棒汤所需的各种配料。我们首次表明,配料可以在其SSL超参数上有所不同——更令人惊讶的是,在其SSL算法上也可以不同。我们烹饪了MAE、MoCoV3、MMCR和LeJEPA配料的汤,这些汤比任何单个SSL配料都更准确。

英文摘要

Model soups are strange and strangely effective combinations of parameters. They take a model (the stock), fine-tune it into multiple models (the ingredients), and then mix their parameters back into one model (the soup) to improve predictions. While all known soups require supervised learning, and optimize the same loss on labeled data, our recipes for Self-Soupervision generalize soups to self-supervised learning (SSL). Our Self-Souping lets us flavor ingredients on new data sources, e.g. from unlabeled data from a task for transfer or from a shift for robustness. We show that Self-Souping on corrupted test data, then fine-tuning back on uncorrupted train data, boosts robustness by +3.5% (ImageNet-C) and +7% (LAION-C). Self-Soupervision also unlocks countless SSL algorithms to cook the diverse ingredients needed for more robust soups. We show for the first time that ingredients can differ in their SSL hyperparameters -- and more surprisingly, in their SSL algorithms. We cook soups of MAE, MoCoV3, MMCR, and LeJEPA ingredients that are more accurate than any single SSL ingredient.

2602.01903 2026-06-03 cs.LG stat.ML

Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

在线表格MDPs的数据依赖和方差依赖遗憾界

Mingyi Li, Taira Tsuchiya, Kenji Yamanishi

AI总结 针对已知转移的在线表格马尔可夫决策过程,提出在对抗性环境下实现数据依赖遗憾界、在随机环境下实现方差依赖遗憾界的最优算法,并证明全局优化方法达到近乎最优。

Comments Accepted at ICML 2026. 72 pages, 4 tables

详情
AI中文摘要

本文研究具有已知转移的在线情景表格马尔可夫决策过程(MDPs),并开发了在对抗性环境下实现精细数据依赖遗憾界、在随机环境下实现方差依赖遗憾界的最佳算法。我们使用一阶量和几个新的数据依赖度量(包括二阶量和路径长度度量)来量化对抗性环境下的MDP复杂度,以及基于方差的度量来量化随机环境下的MDP复杂度。为了适应这些度量,我们基于全局优化和策略优化开发了算法,两者都建立在具有对数障碍正则化的乐观跟随正则化领导者之上。对于全局优化,我们的算法在对抗性环境下实现了一阶、二阶和路径长度遗憾界,在随机环境下实现了方差感知的无间隙依赖界和方差感知的间隙依赖界(该界关于情景数量为多对数)。对于策略优化,通过利用新的乐观$Q$函数估计器,我们的算法实现了相同的数据和方差依赖自适应性,但乘以情景视界因子。最后,我们针对对抗性环境下的数据依赖复杂度度量和随机环境下的方差度量建立了遗憾下界,表明全局优化方法实现的遗憾上界是近乎最优的。

英文摘要

This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic $Q$-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.

2510.16392 2026-06-03 cs.AI

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

RGMem:基于重正化群启发的语言智能体记忆演化

Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, Yanfang Liu

AI总结 提出RGMem框架,利用重正化群思想对长期对话记忆进行多尺度粗粒化、阈值更新和重缩放,实现从事实到用户偏好的层次化整合,在LOCOMO和PersonaMem基准上超越现有记忆系统。

Comments Accepted to ICML 2026

详情
AI中文摘要

个性化和持续交互对于基于LLM的对话智能体至关重要,但有限的上下文窗口和静态参数记忆阻碍了对长期、跨会话用户状态的建模。现有方法(包括检索增强生成和显式记忆系统)主要在事实层面操作,难以从演化且可能冲突的对话中提炼稳定的偏好和深层用户特征。为应对这一挑战,我们提出RGMem,一种受重正化群(RG)多尺度组织和涌现观点启发的自演化记忆框架。RGMem将长期对话记忆建模为多尺度演化过程:情节交互被转化为语义事实和用户洞察,然后通过层次化粗粒化、阈值更新和重缩放逐步整合为动态演化的用户画像。通过明确分离快速变化的证据和慢变特征,并启用非线性、相变般的动力学,RGMem实现了超越平面检索或静态摘要的稳健个性化。在LOCOMO和PersonaMem基准上的大量实验表明,RGMem持续优于最先进的记忆系统,实现了更强的跨会话连续性并更好地适应演化的用户偏好。代码可在https://github.com/fenhg297/RGMem获取。

英文摘要

Personalized and continuous interactions are critical for LLM-based conversational agents, yet finite context windows and static parametric memory hinder the modeling of long-term, cross-session user states. Existing approaches, including retrieval-augmented generation and explicit memory systems, primarily operate at the fact level, making it difficult to distill stable preferences and deep user traits from evolving and potentially conflicting dialogues.To address this challenge, we propose RGMem, a self-evolving memory framework inspired by the renormalization group (RG) perspective on multi-scale organization and emergence. RGMem models long-term conversational memory as a multi-scale evolutionary process: episodic interactions are transformed into semantic facts and user insights, which are then progressively integrated through hierarchical coarse-graining, thresholded updates, and rescaling into a dynamically evolving user profile.By explicitly separating fast-changing evidence from slow-varying traits and enabling non-linear, phase-transition-like dynamics, RGMem enables robust personalization beyond flat retrieval or static summarization. Extensive experiments on the LOCOMO and PersonaMem benchmarks demonstrate that RGMem consistently outperforms SOTA memory systems, achieving stronger cross-session continuity and improved adaptation to evolving user preferences. Code is available at https://github.com/fenhg297/RGMem

2602.00392 2026-06-03 cs.LG

Localized, High-resolution Geographic Representations with Slepian Functions

基于Slepian函数的局部高分辨率地理表示

Arjun Rao, Ruth Crasto, Tessa Ooms, David Rolnick, Konstantin Klemmer, Marc Rußwurm

AI总结 提出利用球面Slepian函数构建地理编码器,在感兴趣区域内集中表示能力,实现高分辨率且计算高效,并引入混合Slepian-球谐编码器平衡局部与全局性能,在分类、回归等任务中优于基线。

Comments ICML 2026

详情
AI中文摘要

地理数据本质上是局部的。疾病爆发集中在人口中心,生态模式沿着海岸线出现,经济活动集中在国家边界内。然而,编码地理位置的机器学习模型将表示能力均匀地分布在全球,难以满足局部应用所需的细粒度分辨率。我们提出了一种基于球面Slepian函数的地理位置编码器,它将表示能力集中在感兴趣区域内,并在无需大量计算需求的情况下扩展到高分辨率。对于需要全局上下文的情况,我们提出了一种混合Slepian-球谐编码器,它有效地平衡了局部与全局性能的权衡,同时保留了诸如极点安全和球面距离保持等理想特性。在涵盖分类、回归和图像增强预测的五项任务中,Slepian编码优于基线,并在广泛的神经网络架构中保持性能优势。

英文摘要

Geographic data is fundamentally local. Disease outbreaks cluster in population centers, ecological patterns emerge along coastlines, and economic activity concentrates within country borders. Machine learning models that encode geographic location, however, distribute representational capacity uniformly across the globe, struggling at the fine-grained resolutions that localized applications require. We propose a geographic location encoder built from spherical Slepian functions that concentrate representational capacity inside a region-of-interest and scale to high resolutions without extensive computational demands. For settings requiring global context, we present a hybrid Slepian-Spherical Harmonic encoder that efficiently bridges the tradeoff between local-global performance, while retaining desirable properties such as pole-safety and spherical-surface-distance preservation. Across five tasks spanning classification, regression, and image-augmented prediction, Slepian encodings outperform baselines and retain performance advantages across a wide range of neural network architectures.

2601.22841 2026-06-03 cs.CV

How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models

我们需要多少模型?遥感基础模型中的冗余与可瘦身性

Leonard Hackel, Tom Burgert, Begüm Demir

AI总结 通过后验瘦身(均匀减少编码器Transformer块宽度)评估8个遥感基础模型的表示冗余,发现遥感模型在激进宽度缩减下仍保持69%-109%相对精度,而自然图像预训练模型性能急剧下降,表明遥感模型存在冗余编码且可有效瘦身。

详情
AI中文摘要

遥感中的大规模基础模型(RS FMs)遵循计算机视觉(CV)中建立的范式开发,但将CV缩放定律迁移至RS的有效性尚未系统检验。我们假设RS FMs在比CV对应模型小得多的规模下进入过参数化区域,任务相关信息在模型维度间冗余编码。为验证这一假设,我们应用后验瘦身(即均匀减少预训练编码器Transformer块的宽度)作为衡量8个最先进RS FMs在分类、分割和变化检测任务中表示冗余的工具。在激进宽度缩减下,RS FMs在RS数据集上保持69%至109%的相对精度,而基于自然图像预训练的掩码自编码器(MAE)和DINOv2(记为CV MAE和CV DINOv2)在相同计算需求范围内,在匹配类别数的ImageNet子集上性能急剧下降。直接在相同RS数据集上评估的CV MAE缩小了差距但未消除,表明数据集特性和领域特定预训练共同导致了模型间的差异。特征相关性、解释方差和有效维度等机制分析表明,任务相关方差集中在少数主成分中,并在模型维度间冗余编码。我们进一步证明,对于对比目标,学习型可瘦身训练优于后验瘦身,而基于重建的目标无法从当前可瘦身训练协议中受益。我们的发现确立了后验瘦身作为资源受限RS应用的实际部署策略,以及作为RS FMs表示冗余的诊断工具。论文接收后,我们将发布所有代码。

英文摘要

Large-scale foundation models (FMs) in remote sensing (RS) (denoted as RS FMs) are developed following paradigms established in computer vision (CV), yet the validity of transferring CV scaling laws to RS has not been systematically examined. We hypothesize that RS FMs enter an overparameterized regime at substantially smaller scales than their CV counterparts, with task-relevant information encoded redundantly across model dimensions. To test this hypothesis, we apply post-hoc slimmability, uniform width reduction of pretrained encoder transformer blocks, as a tool to measure representational redundancy across eight state-of-the-art RS FMs on classification, segmentation, and change detection tasks. RS FMs retain 69% to 109% relative accuracy on RS datasets under aggressive width reduction, while masked autoencoder (MAE) and DINOv2 pretrained on natural images (denoted as CV MAE and CV DINOv2) degrade sharply on ImageNet subsets of matched class count over the same range of computational requirements. A CV MAE evaluated directly on the same RS datasets narrows but does not close the gap, indicating that both dataset characteristics and domain-specific pretraining contribute to the differences between the models. Mechanistic analyses such as feature correlation, explained variance, and effective dimensionality indicate that task-relevant variance concentrates in few principal components and is redundantly encoded across model dimensions. We further show that learned slimmable training improves over post-hoc slimmability for contrastive objectives, while reconstruction-based objectives do not benefit from current slimmable training protocols. Our findings establish post-hoc slimming as a practical deployment strategy for resource-constrained RS applications and as a diagnostic tool for representational redundancy in RS FMs. Upon acceptance, we will publish all code.

2601.22599 2026-06-03 cs.SD cs.HC

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

用于数据高效查询式通用声音分离的语义一致数据集

Kai Li, Jintao Cheng, Chang Zeng, Zijun Yan, Helin Wang, Zixiong Su, Bo Zheng, Xiaolin Hu

AI总结 提出自动管道通过语义一致合成协议消除事件共现,构建高质量合成数据集Hive,使模型在数据量极小的情况下达到与大规模训练模型相当的分离精度和泛化能力。

Comments Accepted to ICML 2026

详情
AI中文摘要

查询式通用声音分离是智能听觉系统的基础,旨在从混合声音中分离特定声源。尽管最近取得了进展,现有方法在复杂声学场景中仍存在残余干扰。这种性能限制主要源于数据瓶颈:野外数据集包含弱标签和严重的事件共现。这些缺陷导致模型学习背景噪声与目标类别之间的虚假相关性,而非鲁棒的声学特征。为解决这一问题,我们提出了一种自动管道,通过语义一致合成协议从野外数据集中挖掘高纯度单事件片段,从而消除事件共现。利用该管道,我们构建了Hive,一个包含2400小时原始音频的高质量合成数据集。实验结果表明,与在比Hive大约500倍的大数据集上训练的最先进模型SAM-Audio相比,在Hive上训练的某些开源模型达到了具有竞争力的分离精度和感知质量。此外,这些模型在分布外评估基准上表现出显著的零样本泛化能力。这些发现强调,优先考虑监督信号的纯度可以实现显著的数据效率,为以降低计算成本训练鲁棒的听觉基础模型提供了新范式。代码和数据集可在https://cslikai.cn/Hive获取。

英文摘要

Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset $\sim$500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://cslikai.cn/Hive.

2512.19347 2026-06-03 cs.RO

OMP: One-step Meanflow Policy with Directional Alignment

OMP: 一步均值流策略与方向对齐

Han Fang, Yize Huang, Yuheng Zhao, Paul Weng, Xiao Li, Yutong Ban

AI总结 提出一步均值流策略(OMP),通过方向对齐机制和微分推导方程解决均值流在机器人操作中的谱偏差和梯度饥饿问题,实现高保真实时操控。

Comments Accepted as poster of ICML-2026

详情
AI中文摘要

机器人操作日益采用数据驱动的生成策略框架,但该领域面临持续的权衡:扩散模型推理延迟高,而基于流的方法通常需要复杂的架构约束。尽管在图像生成领域,均值流范式提供了单步推理的路径,但其直接应用于机器人领域受到关键理论病理的阻碍,特别是低速度区域中的谱偏差和梯度饥饿。为克服这些限制,我们提出了一步均值流策略(OMP),一种专为高保真实时操作设计的新型框架。我们引入轻量级方向对齐机制,以显式同步预测速度与真实均值速度。此外,我们实现了微分推导方程(DDE)来近似雅可比向量积(JVP)算子,该算子解耦前向和后向传播,显著降低内存复杂度。在Adroit和Meta-World基准上的大量实验表明,OMP在成功率和轨迹精度上优于最先进方法,特别是在高精度任务中,同时保持了单步生成的效率。

英文摘要

Robot manipulation has increasingly adopted data-driven generative policy frameworks, yet the field faces a persistent trade-off: diffusion models suffer from high inference latency, while flow-based methods often require complex architectural constraints. Although in image generation domain, the MeanFlow paradigm offers a path to single-step inference, its direct application to robotics is impeded by critical theoretical pathologies, specifically spectral bias and gradient starvation in low-velocity regimes. To overcome these limitations, we propose the One-step MeanFlow Policy (OMP), a novel framework designed for high-fidelity, real-time manipulation. We introduce a lightweight directional alignment mechanism to explicitly synchronize predicted velocities with true mean velocities. Furthermore, we implement a Differential Derivation Equation (DDE) to approximate the Jacobian-Vector Product (JVP) operator, which decouples forward and backward passes to significantly reduce memory complexity. Extensive experiments on the Adroit and Meta-World benchmarks demonstrate that OMP outperforms state-of-the-art methods in success rate and trajectory accuracy, particularly in high-precision tasks, while retaining the efficiency of single-step generation.

2601.11429 2026-06-03 cs.CL cs.AI

Relational Linearity is a Predictor of Hallucinations

关系线性是幻觉的预测因子

Yuetian Lu, Yihong Liu, Sebastian Gerstner, Lea Hirlimann, Jonas Rohweder, Hinrich Schütze

AI总结 通过合成未知实体基准测试,发现语言模型在回答线性关系问题时更容易产生幻觉,且关系线性度与幻觉率强相关。

Comments 15 pages, 6 figures, 14 tables

详情
AI中文摘要

幻觉是语言模型(LMs)的一个核心失败模式。我们关注对诸如“格伦·古尔德演奏哪种乐器?”这类问题的幻觉,但针对设计为模型未知的合成实体提问。我们发现,像Gemma-7B-IT这样的LM经常产生幻觉,即它们难以识别幻觉事实不属于其知识。基于线性关系嵌入的思想,我们提出以下假设:(i)由于用于表示它们的抽象方案,LM可以轻松地为线性关系的非存在主体生成合理的对象,这可能导致幻觉。(ii)对于非线性关系,这种生成对象的机制不可用,因此更容易避免幻觉。为了验证这一假设,我们创建了SyntHal,一个针对15种关系的合成未知实体基准。我们发现,在四个指令调优模型中,关系线性度是模型为未知主体生成对象(而非拒绝回答)的强预测因子,相关系数$r \in [.58, .84]$。

英文摘要

Hallucination is a central failure mode of language models (LMs). We focus on hallucinations in response to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities designed to be unknown to the model. We find that LMs like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. Based on the idea of linear relational embeddings, we put forward the following hypothesis. (i) Due to the abstract scheme that is used to represent them, LMs can easily produce plausible objects for non-existing subjects of linear relations, which can lead to hallucinations. (ii) For a nonlinear relation, this mechanism for producing an object is not available and so a hallucination is easier to avoid. To test this hypothesis, we create SyntHal, a synthetic unknown-entity benchmark for 15 relations. We find that across four instruction-tuned models, relational linearity is a strong predictor of models hallucinating an object for an unknown subject vs refusing to give an answer, with correlations $r \in [.58, .84]$.

2512.22539 2026-06-03 cs.RO cs.CV

VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

VLA-Arena:一个用于基准测试视觉-语言-动作模型的开源框架

Borong Zhang, Jiahao Li, Jiachen Shen, Yuhao Zhang, Yishuai Cai, Yuanpei Chen, Juntao Dai, Jiaming Ji, Yaodong Yang

AI总结 提出VLA-Arena基准,通过三正交轴(任务结构、语言命令、视觉观察)量化任务难度,系统评估视觉-语言-动作模型的能力边界与失败模式。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管视觉-语言-动作模型(VLA)正快速向通用机器人策略发展,但定量理解其局限和失败模式仍然困难。为此,我们引入了一个名为VLA-Arena的全面基准。我们提出了一种新颖的结构化任务设计框架,用于在三个正交轴上量化难度:(1)任务结构,(2)语言命令,以及(3)视觉观察。这使我们能够系统地设计具有细粒度难度级别的任务,从而精确测量模型能力边界。对于任务结构,VLA-Arena的170个任务被分为四个维度:安全性、干扰物、外推和长时域。每个任务设计有三个难度级别(L0-L2),仅在L0上进行微调以评估通用能力。正交于此,语言(W0-W4)和视觉(V0-V4)扰动可应用于任何任务,以实现鲁棒性的解耦分析。我们对最先进的VLA进行了广泛评估,揭示了几个关键局限性,包括强烈的记忆化倾向而非泛化、不对称鲁棒性、缺乏对安全约束的考虑,以及无法组合已学技能以完成长时域任务。为了促进针对这些挑战的研究并确保可重复性,我们提供了完整的VLA-Arena框架,包括从任务定义到自动评估的端到端工具链,以及用于微调的VLA-Arena-S/M/L数据集。我们的基准、数据、模型和排行榜可在https://vla-arena.github.io获取。

英文摘要

While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena's 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task to enable a decoupled analysis of robustness. Our extensive evaluation of state-of-the-art VLAs reveals several critical limitations, including a strong tendency toward memorization over generalization, asymmetric robustness, a lack of consideration for safety constraints, and an inability to compose learned skills for long-horizon tasks. To foster research addressing these challenges and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, data, models, and leaderboard are available at https://vla-arena.github.io.

2512.18268 2026-06-03 cs.RO cs.CG

On The Computational Complexity of Minimum Aerial Photographs for Planar Region Coverage

关于平面区域覆盖的最小航拍照片的计算复杂性

Si Wei Feng

AI总结 研究用正方形和圆形覆盖简单平面多边形的计算复杂性,证明了不可近似性间隙并开发了2.828-最优近似算法。

Comments I have not communicated well with other contributors to the work when submitting this paper

详情
AI中文摘要

随着无人机技术的普及,航拍在环境监测、结构检查、执法等日常场景中变得普遍。该领域的一个核心挑战是在尊重图像分辨率和有限拍摄数量等约束的同时,高效地用照片覆盖目标区域,使照片能够完整捕捉该区域。本文研究了使用正方形和圆形覆盖简单平面多边形的计算复杂性。具体来说,它展示了1.165(对于正方形)和1.25(对于受限正方形中心)的不可近似性间隙,并开发了一个2.828-最优近似算法,表明这些问题在计算上难以近似。本文的直觉可以扩展到航拍之外更广泛的应用,如农药喷洒和战略传感器放置。

英文摘要

With the popularity of drone technologies, aerial photography has become prevalent in many daily scenarios such as environment monitoring, structure inspection, law enforcement etc. A central challenge in this domain is the efficient coverage of a target area with photographs that can entirely capture the region, while respecting constraints such as the image resolution, and limited number of pictures that can be taken. This work investigates the computational complexity of covering a simple planar polygon using squares and circles. Specifically, it shows inapproximability gaps of $1.165$ (for squares) and $1.25$ (for restricted square centers) and develops a $2.828$-optimal approximation algorithm, demonstrating that these problems are computationally intractable to approximate. The intuitions of this work can extend beyond aerial photography to broader applications such as pesticide spraying and strategic sensor placement.

2512.21235 2026-06-03 cs.RO

RoboCade: Gamifying Robot Data Collection

RoboCade: 游戏化机器人数据收集

Suvir Mirchandani, Mia Tang, Jiafei Duan, Jubayer Ibn Hamid, Michael Cho, Dorsa Sadigh

AI总结 提出游戏化远程操作平台RoboCade,通过嵌入视觉反馈、音效、进度条等游戏化元素,吸引普通用户收集演示数据,并证明该数据可提升下游策略成功率16-56%,且用户愉悦度提高24%。

Comments 10 pages, 9 figures. International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

从人类演示中模仿学习已成为训练自主机器人策略的主流方法。然而,收集演示数据集成本高昂:通常需要访问机器人,并在冗长乏味的过程中持续付出努力。这些因素限制了可用于训练策略的数据规模。我们旨在通过让更广泛的受众参与既易于访问又具有激励性的游戏化数据收集体验来解决这一可扩展性挑战。具体来说,我们开发了一个游戏化远程操作平台RoboCade,以吸引普通用户收集对下游策略训练有益的数据。为此,我们将游戏化策略嵌入系统界面和数据收集任务的设计中。在系统界面中,我们包含视觉反馈、音效、目标可视化、进度条、排行榜和徽章等组件。我们还提出了构建与有用下游目标任务具有重叠结构的游戏化任务的原则。我们在三个操作任务(包括空间排列、扫描和插入)上实例化了RoboCade。为了说明游戏化机器人数据收集的可行性,我们通过平台收集了一个演示数据集,并表明使用该数据共同训练机器人策略可以提高非游戏化目标任务的成功率(+16-56%)。此外,我们进行了一项用户研究,验证了新手用户认为游戏化平台比标准非游戏化平台显著更有趣(+24%)。这些结果凸显了游戏化数据收集作为收集演示数据的一种可扩展、可访问且引人入胜的方法的前景。

英文摘要

Imitation learning from human demonstrations has become a dominant approach for training autonomous robot policies. However, collecting demonstration datasets is costly: it often requires access to robots and needs sustained effort in a tedious, long process. These factors limit the scale of data available for training policies. We aim to address this scalability challenge by involving a broader audience in a gamified data collection experience that is both accessible and motivating. Specifically, we develop a gamified remote teleoperation platform, RoboCade, to engage general users in collecting data that is beneficial for downstream policy training. To do this, we embed gamification strategies into the design of the system interface and data collection tasks. In the system interface, we include components such as visual feedback, sound effects, goal visualizations, progress bars, leaderboards, and badges. We additionally propose principles for constructing gamified tasks that have overlapping structure with useful downstream target tasks. We instantiate RoboCade on three manipulation tasks -- including spatial arrangement, scanning, and insertion. To illustrate the viability of gamified robot data collection, we collect a demonstration dataset through our platform, and show that co-training robot policies with this data can improve success rate on non-gamified target tasks (+16-56%). Further, we conduct a user study to validate that novice users find the gamified platform significantly more enjoyable than a standard non-gamified platform (+24%). These results highlight the promise of gamified data collection as a scalable, accessible, and engaging method for collecting demonstration data.

2512.05530 2026-06-03 cs.AI

MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models

MIND:面向多模态大模型的多理由集成判别推理框架

Chuang Yu, Jinmiao Zhao, Mingxuan Zhao, Yunpeng Liu, Xiujun Shu, Yuanhao Feng, Bo Wang, Xiangyu Yue

AI总结 针对多模态大语言模型在多理由语义建模、逻辑鲁棒性和抗误导方面的不足,提出MIND推理框架,通过“理解-反思-纠正”机制实现从被动模仿到主动判别推理的范式转变。

Comments Accepted to ICML 2026

详情
AI中文摘要

最近,多模态大语言模型(MLLMs)被广泛应用于推理任务。然而,它们存在多理由语义建模有限、逻辑鲁棒性不足以及易受误导线索影响的问题。因此,我们提出了一个多理由集成判别(MIND)推理框架,旨在赋予MLLMs类似人类的“理解-反思-纠正”认知能力,实现从基于被动模仿的推理到主动判别推理的范式演变。具体而言,我们引入了理由增强与判别(RAD)范式,提供了统一且可扩展的数据基础。同时,我们设计了渐进式两阶段纠正学习(P2CL)策略:第一阶段增强多理由正向学习,第二阶段实现主动逻辑判别与纠正。此外,为了缓解多理由语义空间中的表示纠缠,我们提出了多理由对比对齐(MCA)优化策略。大量实验表明,我们的MIND在多个公共数据集上达到了最先进的性能。我们的数据和代码可在https://github.com/YuChuang1205/MIND获取。

英文摘要

Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and susceptibility to misleading cues. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of "Understand -> Rethink -> Correct", and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which provides a unified and extensible data foundation. Meanwhile, we design a Progressive Two-stage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multi-rationale Contrastive Alignment (MCA) optimization strategy. Extensive experiments show that our MIND achieves SOTA performance on multiple public datasets. Our data and code are available at https://github.com/YuChuang1205/MIND

2512.00360 2026-06-03 cs.CL

CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA

CourseTimeQA: 一个讲座视频基准和一种用于时间戳问答的延迟约束跨模态融合方法

Vsevolod Kovalev, Parteek Kumar

AI总结 针对教育讲座视频中的时间戳问答任务,在单GPU延迟/内存预算下,提出了CourseTimeQA基准和一种轻量级延迟约束跨模态检索器CrossFusion-RAG,通过冻结编码器、浅层查询无关交叉注意力和时间一致性正则化,在nDCG@10和MRR上分别提升0.10和0.08,中位端到端延迟约1.55秒。

Comments This paper is being withdrawn because an error in our measurement procedure produced incorrect values in our reported retrieval results (Tables I, II, V, and VI, and the corresponding headline figures in the Abstract). Several of our empirical claims depend on these measurements and therefore do not hold as stated

详情
AI中文摘要

我们研究了在单GPU延迟/内存预算下,对教育讲座视频进行时间戳问答。给定自然语言查询,系统检索相关的时间戳片段并合成有依据的答案。我们提出了CourseTimeQA(52.3小时,涵盖六门课程的902个查询)和一种轻量级、延迟约束的跨模态检索器(CrossFusion-RAG),它结合了冻结编码器、一个学习的512->768视觉投影、在ASR和帧上的浅层查询无关交叉注意力(带有时间一致性正则化)以及一个小型交叉注意力重排序器。在CourseTimeQA上,CrossFusion-RAG相比强基线BLIP-2检索器,nDCG@10提升了0.10,MRR提升了0.08,同时在单个A100上实现了约1.55秒的中位端到端延迟。最接近的对比方法(零样本CLIP多帧池化;CLIP+交叉编码器重排序器+MMR;学习的后期融合门控;仅文本混合方法(带交叉编码器重排序及其MMR变体);字幕增强文本检索;非学习的时间平滑)在匹配的硬件和索引下进行了评估。我们报告了在ASR噪声(WER四分位数)下的鲁棒性、时间定位的诊断结果,以及完整的训练/调优细节,以支持可重复的比较。

英文摘要

We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512->768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and a small cross-attentive reranker. On CourseTimeQA, CrossFusion-RAG improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 retriever while achieving approximately 1.55 s median end-to-end latency on a single A100. Closest comparators (zero-shot CLIP multi-frame pooling; CLIP + cross-encoder reranker + MMR; learned late-fusion gating; text-only hybrid with cross-encoder reranking and its MMR variant; caption-augmented text retrieval; non-learned temporal smoothing) are evaluated under matched hardware and indexing. We report robustness across ASR noise (WER quartiles), diagnostics for temporal localization, and full training/tuning details to support reproducible comparison.

2510.17149 2026-06-03 cs.AI

ProtocolBench: Which LLM MultiAgent Protocol to Choose?

ProtocolBench:选择哪个LLM多智能体协议?

Hongyi Du, Jiaqi Su, Jisen Li, Lijie Ding, Yingxuan Yang, Peixuan Han, Xiangru Tang, Kunlun Zhu, Jiaxuan You

AI总结 提出ProtocolBench基准,系统比较多智能体协议在任务成功率、延迟、开销和鲁棒性上的表现,并设计可学习的协议路由器ProtocolRouter以动态选择最优协议。

Comments Accepted to ICML 2026. Camera-ready version.Code and benchmark artifacts: https://github.com/ulab-uiuc/AgentProtocols

详情
AI中文摘要

随着大规模多智能体系统的发展,通信协议层已成为影响性能和可靠性的关键但评估不足的因素。尽管存在多种协议(A2A、ACP、ANP、Agora等),选择往往依赖直觉且缺乏标准化指导。我们引入ProtocolBench,一个沿四个可测量轴(任务成功率、端到端延迟、消息或字节开销、故障下的鲁棒性)系统比较智能体协议的基准。在ProtocolBench上,协议选择显著影响系统行为。在流队列场景中,不同协议的整体完成时间差异高达36.5%,平均端到端延迟相差3.48秒。在故障风暴恢复下,不同协议的鲁棒性也持续存在差异。除评估外,我们提出ProtocolRouter,一个可学习的协议路由器,根据需求和运行时信号为每个场景(或每个模块)选择协议。ProtocolRouter相比最佳单协议基线将故障风暴恢复时间降低高达18.1%,并在GAIA等场景中取得更高成功率。我们还发布了ProtocolRouterBench以标准化协议评估并提高大规模可靠性。

英文摘要

As large-scale multi-agent systems evolve, the communication protocol layer has become a critical yet under-evaluated factor shaping performance and reliability. Despite the existence of diverse protocols (A2A, ACP, ANP, Agora, etc.), selection is often intuition-driven and lacks standardized guidance. We introduce ProtocolBench, a benchmark that systematically compares agent protocols along four measurable axes: task success, end-to-end latency, message or byte overhead, and robustness under failures. On ProtocolBench, protocol choice significantly influences system behavior. In the Streaming Queue scenario, overall completion time varies by up to 36.5% across protocols, and mean end-to-end latency differs by 3.48 s. Under Fail-Storm Recovery, resilience also differs consistently across protocols. Beyond evaluation, we present ProtocolRouter, a learnable protocol router that selects per-scenario (or per-module) protocols from requirement and runtime signals. ProtocolRouter reduces Fail-Storm recovery time by up to 18.1% versus the best single-protocol baseline, and achieves scenario-specific gains such as higher success in GAIA. We also release ProtocolRouterBench to standardize protocol evaluation and improve reliability at scale.

2510.16302 2026-06-03 cs.AI cs.IR

DTKG: Dual-Track Knowledge Graph-Verified Reasoning Framework for Multi-Hop QA

DTKG: 用于多跳问答的双轨知识图谱验证推理框架

Changhao Wang, Yanfang Liu, Xinxin Fan, Ao Tian, Lanzhi Zhou, Yunfeng Lu

AI总结 提出DTKG框架,通过分类阶段和分支处理阶段分别处理并行事实验证和链式多跳推理,提升多跳问答的效率和准确性。

Comments Accepted to ICML 2026

详情
AI中文摘要

问答中的多跳推理在现代大型语言模型的检索增强生成中扮演关键角色。通过从知识图谱中检索实体的关系结构可以获得准确答案。考虑到固有的关系依赖和推理模式,多跳推理通常分为两类:i) 并行事实验证多跳推理问题,即需要同时验证多个独立子问题;ii) 链式多跳推理问题,即需要顺序多步推理,中间结论作为后续推理的必要前提。目前,多跳推理方法单独使用两种技术之一:基于LLM响应的事实验证和基于KG路径的链构建。然而,前者擅长并行事实验证但在链式推理任务上表现不佳,而后者擅长链式多跳推理但在处理并行事实验证推理时存在冗余路径检索问题。这些限制降低了多跳问答任务的效率和准确性。为解决这一挑战,我们提出了一种新颖的双轨KG验证和推理框架DTKG,其灵感来自认知科学中的双过程理论。具体来说,DTKG包括两个主要阶段:分类阶段和分支处理阶段。

英文摘要

Multi-hop reasoning for question answering (QA) plays a critical role in retrieval-augmented generation (RAG) for modern large language models (LLMs). The accurate answer can be obtained through retrieving relational structure of entities from knowledge graph (KG). Regarding the inherent relation-dependency and reasoning pattern, multi-hop reasoning can be in general classified into two categories: i) parallel fact-verification multi-hop reasoning question, i.e., requiring simultaneous verifications of multiple independent sub-questions; and ii) chained multi-hop reasoning questions, i.e., demanding sequential multi-step inference with intermediate conclusions serving as essential premises for subsequent reasoning. Currently, the multi-hop reasoning approaches singly employ one of two techniques: LLM response-based fact verification and KG path-based chain construction. Nevertheless, the former excels at parallel fact-verification but underperforms on chained reasoning tasks, while the latter demonstrates proficiency in chained multi-hop reasoning but suffers from redundant path retrieval when handling parallel fact-verification reasoning. These limitations deteriorate the efficiency and accuracy for multi-hop QA tasks. To address this challenge, we propose a novel dual-track KG verification and reasoning framework DTKG, which is inspired by the Dual Process Theory in cognitive science. Specifically, DTKG comprises two main stages: the Classification Stage and the Branch Processing Stage.

2510.09845 2026-06-03 cs.LG cs.AI cs.CV

Harnessing Self-Supervised Deep Learning and Geostationary Remote Sensing for Advancing Wildfire and Associated Air Quality Monitoring: Improved Smoke and Fire Front Masking using GOES and TEMPO Radiance Data

利用自监督深度学习和地球静止遥感推进野火及相关空气质量监测:使用GOES和TEMPO辐射数据改进烟雾和火锋掩膜

Nicholas LaHaye, Thilanka Munashinge, Hugo Lee, Xiaohua Pan, Gonzalo Gonzalez Abad, Hazem Mahmoud, Jennifer Wei

AI总结 本研究利用NASA TEMPO卫星任务的每小时数据和自监督深度学习,提出了一种创新系统,通过GOES-18和TEMPO数据有效区分烟雾与云层,实时绘制野火火锋和烟雾羽流,显著优于现有业务产品。

Comments https://2025.ieeeigarss.org/view_paper.php?PaperNum=6389&SessionID=1611

详情
AI中文摘要

这项工作展示了通过利用NASA的TEMPO卫星任务前所未有的每小时数据以及自监督深度学习的进展,改善美国西部野火和空气质量管理的可能性。我们展示了一种创新的自监督深度学习系统在绘制近实时每小时野火火锋和烟雾羽流扩散方面的有效性:成功使用GOES-18和TEMPO数据区分烟雾与云层,不同传感模态生成的烟雾和火掩膜之间具有强一致性,并且对于相同案例相比业务产品有显著改进。

英文摘要

This work demonstrates the possibilities for improving wildfire and air quality management in the western United States by leveraging the unprecedented hourly data from NASA's TEMPO satellite mission and advances in self-supervised deep learning. Here we demonstrate the efficacy of deep learning for mapping the near real-time hourly spread of wildfire fronts and smoke plumes using an innovative self-supervised deep learning-system: successfully distinguishing smoke plumes from clouds using GOES-18 and TEMPO data, strong agreement across the smoke and fire masks generated from different sensing modalities as well as significant improvement over operational products for the same cases.

2509.25859 2026-06-03 cs.CV cs.SY eess.SY

LiDAR Point Cloud Colourisation Using Multi-Camera Fusion and Low-Light Image Enhancement

使用多相机融合和低光图像增强的LiDAR点云着色

Pasindu Ranasinghe, Dibyayan Patra, Bikram Banerjee, Simit Raval

AI总结 提出一种硬件无关的方法,通过多相机融合和低光增强模块,实现机械LiDAR点云的360度着色,在低光照条件下仍能恢复场景细节。

详情
Journal ref
Sensors 25(21), 6582 (2025)
AI中文摘要

近年来,相机数据与LiDAR测量的融合已成为增强空间理解的一种强大方法。本研究引入了一种新颖的、与硬件无关的方法,该方法使用多个相机输入从机械LiDAR生成着色点云,提供完整的360度覆盖。主要创新在于其在低光照条件下的鲁棒性,这是通过在融合管道中集成低光图像增强模块实现的。系统需要初始校准以确定相机内参,然后自动计算LiDAR与相机之间的几何变换,无需专门的校准目标,简化了设置。数据处理框架使用颜色校正来确保融合前相机馈送的一致性。该算法使用Velodyne Puck Hi-Res LiDAR和四相机配置进行了测试。优化后的软件实现了实时性能,即使在极低照度下也能可靠着色,成功恢复了原本无法检测的场景细节。

英文摘要

In recent years, the fusion of camera data with LiDAR measurements has emerged as a powerful approach to enhance spatial understanding. This study introduces a novel, hardware-agnostic methodology that generates colourised point clouds from mechanical LiDAR using multiple camera inputs, providing complete 360-degree coverage. The primary innovation lies in its robustness under low-light conditions, achieved through the integration of a low-light image enhancement module within the fusion pipeline. The system requires initial calibration to determine intrinsic camera parameters, followed by automatic computation of the geometric transformation between the LiDAR and cameras, removing the need for specialised calibration targets and streamlining the setup. The data processing framework uses colour correction to ensure uniformity across camera feeds before fusion. The algorithm was tested using a Velodyne Puck Hi-Res LiDAR and a four-camera configuration. The optimised software achieved real-time performance and reliable colourisation even under very low illumination, successfully recovering scene details that would otherwise remain undetectable.

2509.19305 2026-06-03 cs.LG cs.AI eess.SP

Wavelet Fourier Diffuser: Frequency-Aware Diffusion Model for Reinforcement Learning

小波傅里叶扩散器:用于强化学习的频率感知扩散模型

Yifu Luo, Yongzhe Chang, Xueqian Wang

AI总结 针对现有扩散模型在离线强化学习中忽略频域特征导致频率偏移的问题,提出WFDiffuser,通过离散小波变换分解轨迹并利用短时傅里叶变换和交叉注意力增强频域建模,在D4RL基准上有效缓解频率偏移,提升轨迹稳定性和决策性能。

Comments IJCNN 2025

详情
Journal ref
IJCNN 2025
AI中文摘要

扩散概率模型通过直接建模轨迹序列,在离线强化学习中展现出显著潜力。然而,现有方法主要关注时域特征而忽略频域特征,根据我们的观察,这会导致频率偏移和性能下降。在本文中,我们从频域的新视角研究强化学习问题。我们首先观察到,仅使用时域的方法会无意中引入频域低频分量的偏移,从而导致轨迹不稳定和性能下降。为了解决这个问题,我们提出了小波傅里叶扩散器(WFDiffuser),一种新颖的基于扩散的强化学习框架,它集成了离散小波变换将轨迹分解为低频和高频分量。为了进一步增强每个分量的扩散建模,WFDiffuser采用短时傅里叶变换和交叉注意力机制来提取频域特征并促进跨频率交互。在D4RL基准上的大量实验结果表明,WFDiffuser有效缓解了频率偏移,从而产生更平滑、更稳定的轨迹,并相比现有方法提高了决策性能。

英文摘要

Diffusion probability models have shown significant promise in offline reinforcement learning by directly modeling trajectory sequences. However, existing approaches primarily focus on time-domain features while overlooking frequency-domain features, leading to frequency shift and degraded performance according to our observation. In this paper, we investigate the RL problem from a new perspective of the frequency domain. We first observe that time-domain-only approaches inadvertently introduce shifts in the low-frequency components of the frequency domain, which results in trajectory instability and degraded performance. To address this issue, we propose Wavelet Fourier Diffuser (WFDiffuser), a novel diffusion-based RL framework that integrates Discrete Wavelet Transform to decompose trajectories into low- and high-frequency components. To further enhance diffusion modeling for each component, WFDiffuser employs Short-Time Fourier Transform and cross attention mechanisms to extract frequency-domain features and facilitate cross-frequency interaction. Extensive experiment results on the D4RL benchmark demonstrate that WFDiffuser effectively mitigates frequency shift, leading to smoother, more stable trajectories and improved decision-making performance over existing methods.

2509.11323 2026-06-03 cs.CV cs.AI

Motion Estimation for Multi-Object Tracking using KalmanNet with Semantic-Independent Encoding

基于语义无关编码的KalmanNet多目标跟踪运动估计

Jian Song, Wei Mei, Yunfeng Xu, Qiang Fu, Renke Kou, Lina Bu, Yucheng Long

AI总结 提出语义无关KalmanNet(SIKNet),通过语义无关编码器(SIE)改进运动估计,在MOT中比传统卡尔曼滤波和学习辅助滤波器更鲁棒、更准确。

详情
AI中文摘要

运动估计是多目标跟踪(MOT)中的关键组成部分。它通过分析连续帧图像中物体位置的变化来预测物体的轨迹,减少跟踪失败和身份切换。基于线性恒速模型的卡尔曼滤波器(KF)是MOT中最常用的方法之一。然而,当KF参数不匹配且物体非平稳运动时,可能产生不理想的结果。在这项工作中,我们利用学习辅助滤波器来处理MOT的运动估计。具体地,我们提出了一种名为语义无关KalmanNet(SIKNet)的新方法,该方法通过两步使用语义无关编码器(SIE)对状态向量(输入特征)进行编码。首先,SIE使用核大小为1的一维卷积,该卷积沿不同状态向量中同语义元素维度进行卷积,以编码独立的语义信息。然后,它采用全连接层和非线性激活层来编码异语义元素之间的非线性和交叉依赖信息。为了独立评估MOT中运动估计模块的性能,我们从几个开源MOT数据集构建了一个大规模半模拟数据集。实验结果表明,所提出的SIKNet优于传统KF,并且比现有的学习辅助滤波器具有更好的鲁棒性和准确性。代码可在(https://github.com/SongJgit/filternet 和 https://github.com/SongJgit/TBDTracker)获取。

英文摘要

Motion estimation is a crucial component in multi-object tracking (MOT). It predicts the trajectory of objects by analyzing the changes in their positions in consecutive frames of images, reducing tracking failures and identity switches. The Kalman filter (KF) based on the linear constant-velocity model is one of the most commonly used methods in MOT. However, it may yield unsatisfactory results when KF's parameters are mismatched and objects move in non-stationary. In this work, we utilize the learning-aided filter to handle the motion estimation of MOT. In particular, we propose a novel method named Semantic-Independent KalmanNet (SIKNet), which encodes the state vector (the input feature) using a Semantic-Independent Encoder (SIE) by two steps. First, the SIE uses a 1D convolution with a kernel size of 1, which convolves along the dimension of homogeneous-semantic elements across different state vectors to encode independent semantic information. Then it employs a fully-connected layer and a nonlinear activation layer to encode nonlinear and cross-dependency information between heterogeneous-semantic elements. To independently evaluate the performance of the motion estimation module in MOT, we constructed a large-scale semi-simulated dataset from several open-source MOT datasets. Experimental results demonstrate that the proposed SIKNet outperforms the traditional KF and achieves superior robustness and accuracy than existing learning-aided filters. The code is available at (https://github.com/SongJgit/filternet and https://github.com/SongJgit/TBDTracker).

2508.15130 2026-06-03 cs.CV

HiRQA: Hierarchical Ranking and Quality Alignment for Opinion-Unaware Image Quality Assessment

HiRQA: 面向无意见图像质量评估的层次化排序与质量对齐

Vaishnav Ramesh, Haining Wang, Md Jahidul Islam

AI总结 提出HiRQA框架,通过层次化排序和对比学习实现自监督无参考图像质量评估,无需主观标签即可泛化到真实失真场景。

Comments Accepted for publication in Machine Vision and Applications

详情
AI中文摘要

尽管无参考图像质量评估(NR-IQA)取得了显著进展,但数据集偏差和对主观标签的依赖仍阻碍其泛化性能。我们提出HiRQA(层次化排序与质量对齐),一个自监督、无意见的框架,通过结合排序和对比学习提供层次化的质量感知嵌入。与依赖于推理时的原始参考或辅助模态的先前方法不同,HiRQA仅使用输入图像预测质量分数。我们引入了一种新颖的高阶排序损失,通过失真对之间的关系排序来监督质量预测,以及一个嵌入距离损失,强制特征距离与感知差异之间的一致性。由结构化文本提示引导的训练时对比对齐损失进一步增强了学习到的表示。仅在合成图像失真上训练的HiRQA能够泛化到真实退化,通过对各种未见失真(如镜头光晕、雾霾、运动模糊和低光条件)的全面评估得到了证明。为了实时部署,我们引入了HiRQA-S,一个轻量级变体,每张图像的推理时间仅为3.5毫秒。在合成和真实基准上的大量实验验证了HiRQA的竞争性能、强泛化能力和可扩展性。HiRQA模型和推理管道可在https://github.com/uf-robopi/HiRQA获取。

英文摘要

Despite significant progress in no-reference image quality assessment (NR-IQA), dataset biases and reliance on subjective labels continue to hinder their generalization performance. We propose HiRQA (Hierarchical Ranking and Quality Alignment), a self-supervised, opinion-unaware framework that offers a hierarchical, quality-aware embedding through a combination of ranking and contrastive learning. Unlike prior approaches that depend on pristine references or auxiliary modalities at inference time, HiRQA predicts quality scores using only the input image. We introduce a novel higher-order ranking loss that supervises quality predictions through relational ordering across distortion pairs, along with an embedding distance loss that enforces consistency between feature distances and perceptual differences. A training-time contrastive alignment loss, guided by structured textual prompts, further enhances the learned representation. Trained only on synthetic image distortions, HiRQA generalizes to authentic degradations, as demonstrated through comprehensive evaluations on various unseen distortions such as lens flare, haze, motion blur, and low-light conditions. For real-time deployment, we introduce HiRQA-S, a lightweight variant with an inference time of only 3.5 ms per image. Extensive experiments across synthetic and authentic benchmarks validate HiRQA's competitive performance, strong generalization ability, and scalability. The HiRQA model and inference pipeline are available at: https://github.com/uf-robopi/HiRQA.