arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20725 2026-05-21 cs.CV

Holistic Reliability Propagation: Decoupling Annotation and Prediction for Robust Noisy-Label

整体可靠性传播：解耦标注与预测以实现鲁棒的噪声标签

Jingyang Mao, Ningkang Peng, Yanhui Gu

AI总结本文提出了一种整体可靠性传播方法，通过解耦标注和预测来提高在噪声标签下的鲁棒性，该方法通过双层元学习生成两个批次标准化标量，分别用于给定标签和伪标签，并在不同目标上路由这些可靠性，从而在合成和现实基准上提升了平均准确率。

2605.20723 2026-05-21 cs.LG

Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds

在资源受限的Android蜂窝中实现内存高效的分区DNN推理

Lakshani Manamperi, Disumi Pathirana, Thiwanka Pathirana, Nipun Premarathna, Kutila Gunasekera

AI总结本文提出了一种在资源受限的Android设备上实现高效DNN推理的方法，通过五个机制将内存压力分散到多个设备上，从而在不修改模型的情况下实现ONNX推理，显著降低了电池消耗和延迟。

Comments 6 pages, 3 figures, 4 tables. Accepted at the ICML 2026 Workshop on Machine Learning for the Global South

详情

AI中文摘要

在边缘机器学习中，将大型深度神经网络部署到内存受限的移动设备上是一个核心挑战。尽管压缩、剪枝和量化可以降低每个参数的成本，但基于Transformer的模型仍然太大，无法适应商用Android手机约3.3-7.4 GB RAM的范围。我们提出了CROWDio的DNN管道调度子系统，通过五个机制将内存压力分散到多个设备上，从而在不修改模型的情况下实现资源受限Android设备上的实用ONNX推理。这些机制包括JIT延迟分区加载、单分区驻留约束、四层亲和调度器、zlib压缩张量传输以及流式1:1依赖模型。在DistilBERT（Sanh等人，2019）（约6700万参数，SST-2）上跨五个Android手机进行十次运行评估时，我们的系统使每个设备的峰值RSS保持在43±2 MB，限制电池消耗到每运行50±3 mAh，同时流式并发将批次延迟降低了34%低于屏障同步。

英文摘要

Deploying large deep neural networks on memory-constrained mobile devices is a central challenge in edge ML. While compression, pruning, and quantization reduce per-parameter cost, transformer-based models remain too large for the 3.3-7.4 GB RAM envelope of commodity Android handsets. We present the DNN pipeline scheduling subsystem of CROWDio, which achieves practical ONNX inference across resource-constrained Android workers without model modification, by distributing memory pressure across devices via five mechanisms: JIT deferred partition loading, a single-partition-resident constraint, a 4-tier affinity scheduler, a zlib-compressed tensor transport, and a streaming 1:1 dependency model. Evaluated on DistilBERT (Sanh et al., 2019) (approximately 67 M parameters, SST-2) across five Android handsets over ten runs, our system holds peak per-device RSS to 43+-2 MB and limits battery draw to 50+-3 mAh per run, while streaming concurrency cuts batch latency 34% below barrier synchronisation.

URL PDF HTML ☆

赞 0 踩 0

2605.20722 2026-05-21 cs.LG cs.AI

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

AGPO: 基于双统计反馈的自适应群体策略优化

Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

AI总结本文提出AGPO，一种无 critic 的 GRPO 改进方法，通过群体层面的统计信息控制更新幅度和探索。在九个英语和中文数学/STEM 基准上，Qwen2.5-14B 在相同生成 token 预算下优于 PPO/GRPO，达到 GSM8K 67.3% 和 MATH 40.5%。

详情

AI中文摘要

强化学习提升大语言模型推理能力，但 PPO/GRPO 通常使用固定剪切和解码温度，使训练脆弱且调参困难。我们提出自适应群体策略优化（AGPO），一种无 critic 的 GRPO 改进方法，利用群体层面统计信息控制更新幅度和探索。AGPO 使用共享的探针衍生统计状态驱动两个控制器：（i）自适应剪切，根据奖励分散度和偏度、探针投票熵、策略熵和逐步 KL 偏移设置信任区域大小；（ii）双向自适应温度采样，根据与运行基线相对的中心不确定性加热或冷却解码。在九个英语和中文数学/STEM 基准上，使用 AGPO 训练的 Qwen2.5-14B 在相同生成 token 预算下优于 PPO/GRPO，达到 GSM8K 67.3% 和 MATH 40.5%。收益转移到 Llama-3-8B 和 Gemma-2-9B，消融实验确认两个模块互补。我们的实现可在 https://github.com/wandugu/paper_agpo 公开获取。

英文摘要

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.

URL PDF HTML ☆

赞 0 踩 0

2605.20721 2026-05-21 cs.LG

Robust Recommendation from Noisy Implicit Feedback: A GMM-Weighted Bayes-label Transition Matrix Framework

从噪声隐式反馈中鲁棒推荐：一种加权贝叶斯标签转移矩阵框架

Zongyu Li, Xuanyu Liu, Gongce Cao, Shirui Sun, Yaqi Fang, Yongshuai Yu

AI总结本文提出了一种鲁棒的高斯混合模型加权贝叶斯标签转移矩阵框架（RGBT），通过利用高斯混合模型生成实例特定的可靠性评分，系统校准贝叶斯标签转移矩阵估计以减少偏差，从而在保证全样本利用的同时，实现一致的估计和显著的估计方差减少。

详情

AI中文摘要

在推荐系统中，从隐式反馈学习受到普遍标签噪声的挑战。虽然传统去噪方法通常丢弃噪声实例以确保鲁棒性，但这种策略不可避免地导致数据利用率低。替代方法利用贝叶斯标签转移矩阵（BLTM）可以利用所有可用数据，但其估计在实际推荐场景中往往存在偏差。为了解决这些限制，本文提出了一种鲁棒的高斯混合模型加权贝叶斯标签转移矩阵框架（RGBT）。我们的解决方案利用高斯混合模型（GMM）推导实例特定的可靠性评分，系统校准BLTM估计以减轻偏差。理论分析确认，通过利用BLTM框架结合GMM校准，我们的方法同时确保了全样本利用、一致的估计以及关键的估计方差显著减少。在多个真实世界和合成翻转数据集上的广泛实验表明，RGBT不仅比主流可靠样本去噪方法更有效地利用噪声样本，而且在状态-of-the-art转移矩阵去噪方法中实现了显著更优的转移矩阵校准能力。

英文摘要

Learning from implicit feedback in recommender systems is fundamentally challenged by pervasive label noise. While conventional denoising approaches often discard noisy instances to ensure robustness, this strategy inevitably suffers from low data utilization. Alternative methods that employ a Bayes-label transition matrix (BLTM) can leverage all available data, but their estimates tend to be biased in practical recommendation scenarios. To address these limitations, this paper proposes a Robust GMM-weighted Bayes-label Transition Matrix framework (RGBT). Our solution utilizes a Gaussian Mixture Model (GMM) to derive instance-specific reliability scores, which systematically calibrate the BLTM estimation to mitigate bias. Theoretical analysis confirms that our approach, by leveraging the BLTM framework with GMM calibration, simultaneously ensures full sample utilization, delivers consistent estimation, and critically, achieves a significant reduction in estimation variance. Extensive experiments on multiple real-world and synthetically flipped datasets demonstrate that RGBT not only utilizes noisy samples more effectively than mainstream reliable sample-based denoising methods, but also achieves significantly superior calibration capability of the transition matrix compared to state-of-the-art transition matrix-based denoising approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.20713 2026-05-21 cs.CV cs.AI cs.LG

SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

SAVER：选择性所需视觉证据用于多模态信息提取

Miaobo Hu, Shuhao Hu, Bokun Wang, Rui Chen, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

AI总结该研究提出SAVER框架，通过选择性视觉证据提升多模态命名实体识别和关系抽取的性能，减少计算开销并提高准确性。

详情

AI中文摘要

多模态信息提取在社交媒体中具有挑战性，因为帖子可能附加多个弱相关、冗余甚至误导性的图像。在这样的情况下，持续的多模态融合会浪费计算资源并放大虚假的视觉提示。核心挑战是决定是否为每个候选跨度或标记实体对咨询视觉信息，以及如果需要，哪些小图像子集提供可信的证据。我们提出SAVER，一种选择性视觉所需框架用于多模态命名实体识别和多模态关系抽取。SAVER使用符合性地面性门（CGG）来估计MNER中的跨度级视觉地面性，从两个标记实体推导出对级激活，通过符合性风格程序和Clopper-Pearson上界校准激活阈值。当被激活时，一个子模ularity相关性-多样性选择器选择跨图像的紧凑证据子集，然后通过集合变换器进行聚合。一个受能量启发的联合评分头结合文本、可选视觉证据、文本-图像一致性以及稀疏路由用于实体类型或关系分类。实验表明，SAVER在强文本-only和持续多模态基线上一致提高F1，同时减少AURC，增加激活覆盖面积，在固定风险水平下，降低FLOPs和P90延迟。

英文摘要

Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues. The core challenge is to decide, for each candidate span or marked entity pair, whether vision should be consulted at all and, if so, which small subset of images provides trustworthy evidence. We propose SAVER, a selective vision-as-needed framework for multimodal named entity recognition and multimodal relation extraction. SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability in MNER, derive pair-level activation in MRE from the two marked entities, and calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper--Pearson upper bounds. When activated, a submodular relevance--diversity selector chooses a compact evidence subset across images, which is then aggregated by a Set Transformer. An energy-inspired joint scoring head combines text, optional visual evidence, text--image consistency, and sparse routing for entity typing or relation classification. Experiments show that SAVER consistently improves F1 over strong text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage at a fixed risk level, and lowering FLOPs and P90 latency.

URL PDF HTML ☆

赞 0 踩 0

2605.20712 2026-05-21 cs.CL cs.AI

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

SCRIBE：用于印度语言ASR的诊断评估和丰富转录模型

Kavya Manohar, Arghya Bhattacharya, Kush Juvekar, Kumarmanas Nethil

AI总结 SCRIBE通过沙地容忍对齐和领域词汇注入，提供词错误率的分类分解，解决了传统词错误率在处理聚合语言时的不足，同时释放了用于印地语、马拉雅尔语和卡纳达语的丰富转录模型。

Comments Submitted to Interspeech 2026

2605.20708 2026-05-21 cs.CV cs.AI

Rethinking Cross-Layer Information Routing in Diffusion Transformers

重新思考扩散变换器中的跨层信息路由

Chao Xu, Maohua Li, Qirui Li, Yixuan Xu, Yanke Zhou, Yunhe Li, Cuifeng Shen, Hanlin Tang, Kan Liu, Tao Lan, Lin Qu, Shao-Qun Zhang

AI总结本文研究了扩散变换器中跨层信息流动的问题，通过系统性的实证分析，识别了传统残差加法的三个具体症状，并提出了扩散适应性路由（DAR）方法，以实现可学习、时间步适应和非递增的子层输出聚合，从而提升模型性能。

详情

AI中文摘要

扩散变换器（DiTs）已成为现代视觉生成的事实性骨干，其设计的几乎所有主要轴线——分词、注意力、条件、目标和潜在自编码器——都已被广泛重新审视。然而，决定信息如何在层之间积累的残差流却直接继承自原始Transformer。在本文中，我们对DiTs中的跨层信息流进行了系统性的实证分析，同时考虑深度和去噪时间步，并识别出传统残差加法的三个具体症状，即单调的前向幅度膨胀、急剧的反向梯度衰减和显著的块状冗余。受此诊断的启发，我们提出了扩散适应性路由（DAR），一种可直接替换残差的机制，能够对子层输出的历史进行可学习、时间步适应和非递增的聚合。此外，所提出的DAR与许多现代Transformer增强方法，如REPA，具有兼容性。在ImageNet 256×256上，DAR将SiT-XL/2的FID值提升了2.11（7.56 vs. 9.67），并且在8.75倍更少的训练迭代中达到了基线的收敛质量。在REPA之上堆叠时，它在早期阶段实现了2倍的训练加速，表明跨层信息路由是扩散建模中一个未被充分探索的设计轴，该轴与现有表示对齐目标相互独立。除了预训练外，DAR还可以在大规模T2I模型的微调阶段应用，并在分布匹配蒸馏中保留高频细节。

英文摘要

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

URL PDF HTML ☆

赞 0 踩 0

2605.20696 2026-05-21 cs.LG

Distributed Direct Preference Optimization

分布式直接偏好优化

Zhanhong Jiang

AI总结本文研究了在分布式环境中直接偏好优化（DPO）的收敛性和时间复杂度，分析了联邦学习和去中心化学习中偏好数据碎片化对优化动态的影响，并提出了具有理论保证的鲁棒且可扩展的实现实现方法。

Comments 29 pages, 12 figures

详情

AI中文摘要

基于偏好强化学习（RL）是将策略与人类判断对齐的关键范式，然而其在分布式设置中，偏好数据在异构用户之间碎片化的情况下理论行为仍不明确。直接偏好优化（DPO）避免显式奖励建模，但在联邦和去中心化训练中缺乏收敛保证，其中通信约束和非独立同分布（non-IID）偏好根本上改变了优化动态。我们为分布式环境中的DPO提供了首次收敛性和时间复杂度分析。通过建模具有用户特定偏好分布的个性化离线RL，我们刻画了诱导的全局优化景观。对于联邦DPO，我们推导了收敛率，量化了客户端漂移、通信频率和偏好异质性的影响；对于去中心化DPO，我们建立了在一般通信图上的收敛性，并展示了谱连通性如何控制优化速度和共识。实证上，我们在标准对齐基准上验证了我们的理论见解，证明了我们提出的方法不仅具有强理论保证，而且在实践中也表现出鲁棒性和可扩展性。代码库在此处提供。

英文摘要

Preference-based reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly understood. Direct Preference Optimization (DPO) avoids explicit reward modeling but lacks convergence guarantees under federated and decentralized training, where communication constraints and non-IID preferences fundamentally alter optimization dynamics. We provide the first convergence and time-complexity analysis of DPO in distributed environments. Modeling personalized offline RL with user-specific preference distributions, we characterize the induced global optimization landscape. For federated DPO, we derive convergence rates that quantify the impact of client drift, communication frequency, and preference heterogeneity; for decentralized DPO, we establish convergence over general communication graphs and show how spectral connectivity governs optimization speed and consensus. Empirically, we corroborate our theoretical insights on standard alignment benchmarks, demonstrating that our proposed methods not only enjoy strong theoretical guarantees but also deliver robust and scalable performance in practice. The code base is available here.

URL PDF HTML ☆

赞 0 踩 0

2605.20693 2026-05-21 cs.CL cs.AI stat.ML

Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

通过共识和标签解缠获得可解释的判别文本表示

Tong Wang, Yiqing Xu, Leo Yang Yang

AI总结本文提出了一种可解释的判别文本表示方法，通过共识和标签解缠来确保特征的可解释性和可重复性，实验表明该方法在多个文本分类任务中表现优异，产生了更清晰且更少标签纠缠的特征。

详情

AI中文摘要

可解释的文本表示应暴露出不仅具有预测性，而且对独立审计员来说有意义的坐标。现有的判别表示通常使用匿名嵌入方向，而概念瓶颈和LLM辅助方法将自然语言名称附加到特征上，但并未确保这些定义是可重复的或与目标标签不同。我们提出了一种可解释判别文本表示的操作标准：每个坐标应满足概念清晰度，通过独立标注员应用特征定义之间的机会调整一致性来衡量，并且标签解缠，即特征不应仅仅改述预测目标。我们通过LLM辅助特征发现（LFD）方法实现了这一标准，这是一种迭代方法，从对比性反向文本对中提出词汇和语义特征，通过跨LLM Cohen's $κ$ 筛选候选，并通过残差保留的预测增益选择特征。一种简化分析将$κ$筛选与每个特征的注释噪声界限联系起来，正式化一致性作为可靠性检查。在十个跨越七个语料库的文本分类任务中，LFD与强大的文本瓶颈基线具有相同的预测性能，同时产生明显更清晰且标签纠缠更少的特征。232名人类审计员的实验表明，LFD特征在人类-人类和人类-LLM一致性方面优于基线概念，且审计员一致认为它们更少标签泄漏。这些结果表明，经过一致性测试和标签解缠的坐标为可解释文本分类提供了一个实用的可审计标准。

英文摘要

Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $κ$, and selects features by residual held-out predictive gain. A stylized analysis connects the $κ$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.

URL PDF HTML ☆

赞 0 踩 0

2605.20689 2026-05-21 cs.CL cs.AI cs.IR cs.LG

DIVE: Embedding Compression via Self-Limiting Gradient Updates

DIVE: 通过自限制梯度更新实现嵌入压缩

Dongfang Zhao

AI总结本文提出DIVE方法，通过自限制的三元组损失和头级NT-Xent对比损失解决嵌入压缩中因标注数据稀缺导致的过拟合问题，提升了检索性能。

详情

AI中文摘要

大型语言模型的高维嵌入对向量搜索系统造成了显著的存储和计算成本。最近的嵌入压缩方法，包括Matryoshka-Adaptor（EMNLP 2024）、Search-Adaptor（ACL 2024）和SMEC（EMNLP 2025），通过轻量级残差适配器实现降维，但其训练目标在标注数据稀缺时导致严重过拟合，使检索性能低于冻结基线。我们提出DIVE（通过隐式视图集合进行降维），一种压缩适配器，通过两种机制解决这一失败。首先，一个自限制的基于hinge的三元组损失在三元组满足边距约束时产生零梯度，限制应用于预训练嵌入空间的总扰动。其次，头级NT-Xent对比损失将每个嵌入的多个学习投影视为隐式视图，提供密集的自监督梯度，补偿小数据集上三元组信号的稀疏性。在六个BEIR数据集上，DIVE在每个数据集和每个评估的压缩比上均优于所有三个基线适配器，具有14M参数的开源实现。

英文摘要

High-dimensional embeddings from large language models impose significant storage and computational costs on vector search systems. Recent embedding compression methods, including Matryoshka-Adaptor (EMNLP 2024), Search-Adaptor (ACL 2024), and SMEC (EMNLP 2025), enable dimensionality reduction through lightweight residual adapters, but their training objectives cause severe overfitting when labeled data is scarce, degrading retrieval performance below the frozen baseline. We propose \textsc{DIVE} (\textbf{D}imensionality reduction with \textbf{I}mplicit \textbf{V}iew \textbf{E}nsembles), a compression adapter that addresses this failure through two mechanisms. First, a self-limiting hinge-based triplet loss produces zero gradient once a triplet satisfies the margin constraint, bounding the total perturbation applied to the pretrained embedding space. Second, a head-wise NT-Xent contrastive loss treats multiple learned projections of each embedding as implicit views, providing dense self-supervised gradients that compensate for the sparsity of the triplet signal on small datasets. Across six BEIR datasets, \textsc{DIVE} outperforms all three baseline adapters on every dataset and at every evaluated compression ratio, with a 14M-parameter open-source implementation.

URL PDF HTML ☆

赞 0 踩 0

2605.20684 2026-05-21 cs.CL

Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

超越语义相似性：一种用于企业信贷审批的双阶段非参数检索流程

Linus Ng Junjia, Ezekiel Tee Kongquan, Kelvin Heng, Kenneth Zhu Ke, Zhao Jing Yuan

AI总结本文提出了一种双阶段非参数检索架构，旨在解决信贷审批中检索结果与决策有用性之间的差距问题，通过结合词法和密集多语言检索构建候选池，并利用LLM作为判断机制对文档进行实用性评分，从而提高检索效率和实用性。

详情

AI中文摘要

企业信贷审批需要分析师从数百页、多语言的异构财务文档中提取可操作的证据。标准的检索增强生成（RAG）流水线优化语义相似性，这通常会检索出主题相关但缺乏决策有用性的段落，我们称之为相似性-有用性差距。我们提出了一种双阶段非参数检索架构，将高召回率的候选检索与高精度的实用性排名分开。第一阶段结合词法和密集多语言检索构建广泛候选池。第二阶段应用自适应检索控制器，利用查询意图和文档结构信号过滤候选者，随后通过LLM作为判断机制对段落进行实用性评分，而非基于语义接近性。一个上下文感知的提取模块在叙述文本和复杂财务表格之间保持结构忠实性。该系统完全在本地部署以满足企业数据治理要求。在具有分析师定制相关性标签的多语言专有财务文档语料库上评估，该系统显著优于简单检索基线。在超过800名信贷分析师的生产部署中，文档审查时间从数小时减少到约三分钟，证明了实用性感知RAG架构在文档密集型决策支持流程中的实际价值。

英文摘要

Corporate credit underwriting requires analysts to extract actionable evidence from long, heterogeneous financial documents spanning hundreds of pages and multiple languages. Standard Retrieval-Augmented Generation (RAG) pipelines optimize for semantic similarity, which frequently surfaces passages that are topically related but lack decision utility, a problem we term the similarity-utility gap. We propose a two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking. The first phase combines lexical and dense multilingual retrieval to construct a broad candidate pool. The second phase applies an adaptive retrieval controller that filters candidates using query intent and document structure signals, followed by an LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity. A context-aware extraction module preserves structural fidelity across narrative text and complex financial tables. The system is deployed entirely on-premise to satisfy enterprise data governance requirements. Evaluated on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels, the system significantly outperforms naive retrieval baselines. In production deployment across more than 800 credit analysts, document review time was reduced from several hours to approximately three minutes, demonstrating the practical value of utility-aware RAG architectures for document-intensive decision-support workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.20682 2026-05-21 cs.CV

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

IndusAgent: 通过智能工具增强开放词汇工业异常检测

Rongbin Tan, Fangfang Lin, Zhenlong Yuan, Min Qiu, Kejin Cui, Mengmeng Wang, Yi Wang, Zijian Song, Zhiyuan Wang, Jiyuan Wang, Yue Wang, Shuhan Song§, Huawei Cao

AI总结本文提出IndusAgent框架，通过整合视觉观测、高分辨率局部片段和专家正常性先验，提升开放词汇工业异常检测的零样本性能，验证了方法的鲁棒性和泛化能力。

详情

AI中文摘要

多模态大语言模型（MLLMs）在连接视觉感知和文本推理方面表现出色，能够跨多样化的工业场景实现零样本理解。然而，其在开放词汇工业异常检测（IAD）中的性能常受限于领域不匹配的推理和幻觉的结构推断。为了解决这些挑战，我们提出了IndusAgent，一种工具增强的智能框架用于开放词汇IAD。具体而言，我们首先构建了Indus-CoT，一个整合了全局视觉观测、高分辨率局部片段和专家正常性先验的结构化数据集，为在严格工业检查轨迹上微调模型提供监督。在此基础上，IndusAgent动态协调一组外部工具，包括动态区域裁剪、高频特征增强和先验检索，从而使代理能够主动解决视觉歧义并分离细微异常。此外，我们引入了一个门控强化学习目标，联合优化异常分类、定位准确性、异常类型推理和高效的工具使用，确保工具调用仅在有益时发生。在五个工业异常基准测试上（包括MVTec-AD、VisA、MPDD、DTD和SDD）的广泛评估表明，IndusAgent在所有现有方法中实现了最先进的零样本性能，验证了我们的鲁棒性和泛化能力。

英文摘要

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.

URL PDF HTML ☆

赞 0 踩 0

2605.20680 2026-05-21 cs.CV

DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

DarkShake-DVS: 低光和摇晃条件下基于事件的行人动作识别

Jiaqi Chen, Qinfu Xu, Liyuan Pan

AI总结本文提出了一种结合事件相机和惯性测量单元的EIS-HAR方法，通过非线性变形模块减少运动模糊并提取时空特征，同时引入DarkShake-DVS基准数据集，用于低光和6自由度运动条件下的行人动作识别研究。

Comments 8pages,7 figures

详情

AI中文摘要

行人动作识别（HAR）是计算机视觉中的基本任务，具有广泛的应用。实际部署通常涉及低光环境和无约束的6-DoF相机运动，这些条件会降低视觉质量，破坏时间一致性，并影响现有方法的可靠性。事件相机具有高低光灵敏度和微秒级时间分辨率，结合惯性测量单元（IMU）提供了一种有前途的解决方案。然而，当前研究面临两个关键挑战：缺乏整合低光条件、6-DoF运动和同步IMU数据的基准；以及缺乏有效的运动补偿技术。为此，我们提出事件-IMU稳定HAR（EIS-HAR），包含两个模块。第一个是EIS模块，通过非线性变形函数减少运动模糊以重建运动补偿的输入。第二个是HAR模块，具有四阶段混合架构，以高效提取时空特征进行准确的动作识别。为缓解数据稀缺，我们引入DarkShake-DVS，第一个大规模基于事件的HAR基准，包含18,041个真实世界片段，在低光和强烈6-DoF运动条件下拍摄，并补充同步IMU数据。在三个数据集上的广泛实验表明，EIS-HAR在状态-of-the-art方法上表现出一致的优越性。

英文摘要

Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.20678 2026-05-21 cs.LG cs.AI

Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting

动态TMoE：一种针对非平稳时间序列预测的漂移感知动态专家混合框架

Jiawen Zhu, Shuhan Liu, Di Weng, Yingcai Wu

AI总结本文提出Dynamic TMoE框架，通过动态构建异构专家和剪枝冗余专家来优化容量，并利用时间记忆路由器确保稳定且上下文感知的专家选择，从而在非平稳时间序列预测中实现更优性能。

Comments 27 pages, 7 figures. Accepted to ICML 2026

详情

AI中文摘要

非平稳时间序列预测面临由演变分布偏移带来的挑战，静态模型难以捕捉这些变化。虽然混合专家（MoE）架构提供了解耦复杂漂移模式的有前景范式，但现有方法受限于固定专家池和无记忆路由，阻碍了其适应突发制度转变的能力。为此，我们提出Dynamic TMoE框架，将架构进化与时间连续性统一在学习阶段。通过最大均值偏差（MMD）检测分布偏移，动态实例化异构专家并剪枝冗余专家以优化容量。此外，时间记忆路由器利用循环状态和异常库确保稳定、上下文感知的专家选择，无需测试时更新。在九个基准测试中的实验表明，该方法实现了最先进的性能，将MSE减少10.4%，MAE减少7.8%。代码可在https://github.com/andone-07/Dynamic-TMoE获取。

英文摘要

Non-stationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture-of-Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we propose Dynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context-aware expert selection without requiring test-time updates. Experiments on nine benchmarks demonstrate state-of-the-art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at https://github.com/andone-07/Dynamic-TMoE.

URL PDF HTML ☆

赞 0 踩 0

2605.20676 2026-05-21 cs.CV

VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

VISTAQA: 评估联合视觉问答与像素级证据

Mozhgan Nasr Azadani, Yimu Wang, Yongpeng Zhu, Lihong Chen, Milan Ganai, Sean Sedwards, Marco Pavone, Krzysztof Czarnecki

AI总结本文提出VISTAQA基准，用于评估视觉问答中自由回答的正确性和像素级证据的定位，通过引入GROVE指标，强调回答正确性与视觉证据对齐的重要性，实验显示现有系统在该指标下表现有限，揭示了回答准确性和视觉证据对齐之间的显著差距。

详情

AI中文摘要

建立模型预测与支持它们的视觉证据之间的清晰联系对于多模态推理的透明性和可靠性至关重要，但当前的多模态大语言模型（MLLM）评估并未明确强制这种对齐。现有的基准评估要么单独评估文本答案的正确性，要么单独评估像素级定位，使推理与定位的耦合成为一个开放性挑战。我们介绍了VISTAQA，一个用于联合评估自由回答正确性和像素级证据定位的全面基准。VISTAQA包含1,157个专家整理的样本，涵盖六种任务类型和六个视觉领域，从直接感知到组合和关系推理。VISTAQA要求模型不仅要正确回答，还要提供精确的分割掩码以支持其答案。它还包含有幻觉意识的例子，其中不存在有效的视觉证据。为了支持这种增强的评估，我们引入了GROVE，一个统一的评估指标，通过每样本几何均值结合文本准确性与定位质量，确保两者都不能补偿对方的不足。在接地意识模型和混合管道与通用MLLM的全面实验中，即使最强的系统在GROVE下也表现有限，突显了回答准确性和视觉证据对齐之间的显著差距。

英文摘要

Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their answers. It also includes hallucination-aware examples where no valid visual evidence exists. To support this enhanced evaluation, we introduce GROVE, a unified evaluation metric that enforces joint correctness by combining textual accuracy and grounding quality via a per-sample geometric mean, ensuring neither dimension can compensate for deficiencies in the other. Comprehensive experiments across grounding-aware models and hybrid pipelines with general-purpose MLLMs reveal that even the strongest systems achieve limited performance under GROVE, highlighting a substantial gap between answer accuracy and visual evidence alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.20674 2026-05-21 cs.LG

Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

无需微调的模块化多模态分类：一种简单的组合方法

Herman Bergström, Aditya Mehrotra, Rahul G. Krishnan

AI总结本文提出CoMET，一种无需微调的多模态分类方法，通过冻结预训练的backbone对每个模态进行处理，使用PCA压缩嵌入并输入到表格基础模型中进行预测，展示了PCA作为适配器在不同模态上的强大鲁棒性能，并提出了PALPooling来提升表示质量，实现了无需训练的多模态学习最佳结果。

Comments 30 pages, 17 figures

详情

AI中文摘要

我们介绍CoMET，即通过表格基础模型（TFM）组合模态编码器的简单而具有竞争力的多模态分类方法：将每个模态通过冻结的预训练backbone处理，用PCA压缩得到的嵌入，并将其连接作为输入到TFM中进行预测。我们证明仅PCA就足以作为适配器，在不同模态上实现强大且稳健的性能。当基础模型的CLS标记与下游任务匹配不佳时，我们提出了PALPooling，一种轻量级的自适应标记池化器，能够一致地提高表示质量。通过将强大的冻结表示学习backbone与TFM组合，我们的方法在多样化的多模态基准上实现了最先进的结果，无需任何训练。在具有大规模细粒度类别空间的分层任务中，我们的方法实现了快速且可扩展的分类，能够处理超过500,000个样本和2,000个类别的数据集，无需任何微调。总体而言，我们的结果表明，基础模型的组合是一种简单但强大的即开即用解决方案，挑战了为新问题进行复杂端到端训练管道的必要性。

英文摘要

We introduce CoMET, \textit{\textbf{C}omposing \textbf{M}odality \textbf{E}ncoders with \textbf{T}abular foundation models}, a simple yet highly competitive method for multimodal classification: pass each modality through a frozen pre-trained backbone, compress the resulting embeddings with PCA, and concatenate as input into a Tabular Foundation Model (TFM) for prediction. We show that PCA alone suffices to act as an adaptor yielding strong, robust performance across modalities. When the \texttt{CLS} tokens of the foundation model align poorly with downstream tasks, we propose \textbf{PALPooling}, a lightweight adaptive token pooler that consistently improves representation quality. By composing strong frozen representation learning backbones with TFMs, our approach achieves state-of-the-art results across diverse multimodal benchmarks without any training. On hierarchical tasks with large fine-grained class spaces, our approach enables fast and scalable classification, handling datasets with over 500,000 samples and 2,000 classes without any fine-tuning. Overall, our results show that the composition of foundation models is a simple, yet powerful, out-of-the-box solution for multimodal learning, challenging the necessity of complex, end-to-end training pipelines for new problems.

URL PDF HTML ☆

赞 0 踩 0

2605.20669 2026-05-21 cs.CV

GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection

GSA-YOLO: 一种通过结构稀疏性和自适应知识蒸馏实现高效率的实时X射线安全检查框架

Jiahao Kong

AI总结本文提出GSA-YOLO框架，通过结构稀疏性和自适应知识蒸馏提升实时X射线安全检查的检测鲁棒性和推理效率，实现了高精度和高效率的平衡。

Comments 41 pages, 8 figures, submitted to Scientific Reports

详情

AI中文摘要

X射线安全检查需要准确实时检测违禁物品，但现有模型往往难以平衡严重遮挡、复杂杂乱和严格速度要求的挑战。为克服这些挑战，本文提出GSA-YOLO，一种基于YOLOv8n架构的新型轻量框架，专门设计以增强检测鲁棒性和推理效率。GSA-YOLO通过三个核心组件策略性整合结构稀疏性和自适应知识转移：Group Lasso（GL）应用于网络颈部以实现鲁棒的特征提取；Sparse Structure Selection（SSS）应用于检测头以实现显著的模型瘦身；以及自适应知识蒸馏（Ada-KD）机制以实现全面的准确率恢复。这种整合方法协同增强了特征表示，同时修剪冗余通道，最大化模型效率而不牺牲性能。在HiXray和PIDray数据集上的严格评估证实了GSA-YOLO的全面能力，实现了领先的推理速度189.62 FPS，伴随计算成本从8.7G降至8.0G。关键的是，GSA-YOLO在HiXray和PIDray上分别实现了mAP50:95结果0.531和0.679，分别比基线提高了2.4%和1.8%。与其他模型相比，GSA-YOLO在保持计算效率的同时表现出更高的准确性，使其成为实际X射线安全检查的有前景的解决方案。

英文摘要

X-ray security inspection requires accurate real-time detection of prohibited items, but existing models often struggle to balance the challenges of severe occlusion, complex clutter, and strict speed requirements. To overcome these challenges, this paper proposes GSA-YOLO, a novel lightweight framework built upon the YOLOv8n architecture, specifically engineered to enhance detection robustness and inference efficiency. GSA-YOLO strategically integrates structured sparsity and adaptive knowledge transfer through three core components: Group Lasso (GL) applied to the network neck for robust feature extraction; Sparse Structure Selection (SSS) applied to the detection head for significant model slimming; and an Adaptive Knowledge Distillation (Ada-KD) mechanism for comprehensive accuracy recovery. This integrated approach synergistically enhances feature representation while pruning redundant channels, maximizing model efficiency without sacrificing performance. Rigorous evaluations on the HiXray and PIDray datasets confirm GSA-YOLO's comprehensive capability, achieving a leading inference speed of 189.62 FPS, accompanied by a reduction in computational cost from 8.7G to 8.0G. Crucially, GSA-YOLO secures mAP50:95 results of 0.531 and 0.679 on HiXray and PIDray, demonstrating 2.4% and 1.8% improvements over the baseline, respectively. Compared to other models, GSA-YOLO exhibits enhanced accuracy while maintaining computational efficiency, making it a promising solution for practical X-ray security inspection.

URL PDF HTML ☆

赞 0 踩 0

2605.20668 2026-05-21 cs.CL cs.AI cs.LG

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

人工智能审稿人的局限与机遇：对Nature系列论文审稿的45位专家科学家的审查

Seungone Kim, Dongkeun Yoon, Kiril Gashteovski, Juyoung Suk, Jinheon Baek, Pranjal Aggarwal, Ian Wu, Viktor Zaverkin, Spase Petkoski, Daniel R. Schrider, Ilija Dukovski, Francesco Santini, Biljana Mitreska, Yong Jeong, Kyeongha Kwon, Young Min Sim, Dragana Manasova, Arthur Porto, Biljana Mojsoska, Makoto Takamoto, Marko Shuntov, Ruoqi Liu, Hyunjoo Jenny Lee, Niyazi Ulas Dinç, Yehhyun Jo, Sunkyu Han, Chungwoo Lee, Huishan Li, Esther H. R. Tsai, Ergun Simsek, Khushboo Shafi, Yeonseung Chung, Jihye Park, Aleksandar Shulevski, Henrik Christiansen, Yoosang Son, Elly Knight, Amanda Montoya, Jeongyoun Ahn, Christian Langkammer, Heera Moon, Changwon Yoon, Nikola Stikov, Mooseok Jang, Edward Choi, Junhan Kim, Yeon Sik Jung, Woo Youn Kim, Jae Kyoung Kim, Ishraq Md Anjum, Hyun Uk Kim, Drew Bridges, Carolin Lawrence, Xiang Yue, Alice Oh, Akari Asai, Sean Welleck, Graham Neubig

AI总结本文通过大规模专家标注研究，探讨了AI审稿人在科学同行评审中的能力与局限，发现AI审稿在准确性、显著性和证据充分性方面表现优异，但存在领域知识有限、上下文管理不足等弱点，表明AI审稿是人类审稿的补充而非替代。

Comments Work in progress

详情

AI中文摘要

随着AI能力的提升，AI审稿人开始被应用于科学同行评审，但其能力和可信度仍存疑：许多科学家将其视为概率系统，缺乏评估研究的专业能力，而其他研究人员则对AI的准备程度更为乐观，但缺乏实证支持。理解AI审稿人擅长什么、哪里不足以及仍需解决的挑战至关重要。然而，现有的AI审稿评估主要关注其判断是否与人类一致（例如评分对齐、接受预测），这不足以表征其能力和局限。在本文中，我们通过大规模专家标注研究填补了这一空白，45位物理、生物和健康科学领域的专家花费469小时对2960个个体批评（每个批评针对论文的一个特定方面）进行评分，这些批评来自人类和AI生成的82篇Nature系列论文的审稿。在综合正确性、显著性和证据充分性三个维度上，由GPT-5.2驱动的审稿代理在每篇论文的最高评分人类审稿人评分上（60.0% vs. 48.2%，p = 0.009），而所有三个AI审稿（包括Gemini 3.0 Pro和Claude Opus 4.5）在每个维度上都超过了最低评分的人类审稿人。AI审稿的准确批评也更常被评分显著且证据充分，并揭示了人类未提及的26%的问题。然而，AI审稿在交叉审稿者对之间重叠远多于人类（21% vs. 3%），并且表现出16个人类不共享的弱点，如领域知识有限、缺乏多文件上下文管理能力以及对次要问题过于批判。总体而言，我们的结果表明当前AI审稿人是人类审稿人的补充，而非替代。

英文摘要

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

URL PDF HTML ☆

赞 0 踩 0

2605.20667 2026-05-21 cs.CV

LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

LER-YOLO: 一种可靠性感知的专家路由方法用于对齐不准确的RGB-红外无人机检测

Liming Hou, Yueping Peng, Hexiang Hao, Ji Wang, Xuekai Zhang, Wei Tang, Zecong Ye, Xin Ying, Yubo He

AI总结该研究提出LER-YOLO，一种可靠性感知的稀疏专家混合方法，用于解决RGB-红外遥感对中无人机检测的挑战，通过引入不确定性感知的目标对齐模块和可靠性引导的稀疏MoE融合模块，提升跨模态交互的可靠性。

Comments 17 pages, 6 figures, 8 tables

详情

AI中文摘要

检测RGB-红外遥感对中的小型无人驾驶航空器仍然具有挑战性，因为目标尺度小、背景杂乱以及异构传感器之间的空间不对齐。现有的双模检测器通常对齐或融合特征，但未评估局部跨传感器对应关系的可靠性，导致不匹配伪影传播到检测头。为此，我们提出了LER-YOLO，一种可靠性感知的稀疏混合专家框架，用于对齐不准确的RGB-红外无人机检测。LER-YOLO首先引入了一个不确定性感知的目标对齐模块，将可见特征重新采样到红外参考，并估计空间可靠性图。此可靠性先验随后被可靠性引导的稀疏MoE融合模块使用，以从RGB主导、红外主导和交互融合专家中自适应选择k个专家，从而在抑制不可靠融合的同时实现可信的跨模态交互。在公共MBU基准上，使用YOLOv5s家族协议进行实验，结果显示LER-YOLO在三个独立种子下达到89.7±0.2%的AP50，最佳结果为89.9%。广泛的消融实验、参数匹配比较、合成位移评估和复杂度分析表明，收益主要来自可靠性引导的专家路由，而非增加模型容量。

英文摘要

Detecting small unmanned aerial vehicles from RGB-infrared remote-sensing pairs remains challenging due to tiny target scale, cluttered backgrounds, and spatial misalignment between heterogeneous sensors. Existing bimodal detectors often align or fuse features without assessing the reliability of local cross-sensor correspondence, allowing mismatch artifacts to propagate into the detection head. To address this issue, we propose LER-YOLO, a reliability-aware sparse mixture-of-experts framework for misaligned RGB-infrared UAV detection. LER-YOLO first introduces an Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map. This reliability prior is then used by a Reliability-Guided Sparse MoE Fusion module to adaptively select k experts from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling trustworthy cross-modal interaction while suppressing unreliable fusion. Experiments on the public MBU benchmark under a YOLOv5s-family protocol show that LER-YOLO achieves 89.7+/-0.2% AP50 over three independent seeds, with a best result of 89.9%. Extensive ablations, parameter-matched comparisons, synthetic-shift evaluations, and complexity analysis demonstrate that the gains mainly come from reliability-guided expert routing rather than increased model capacity.

URL PDF HTML ☆

赞 0 踩 0

2605.20666 2026-05-21 cs.RO

A Semantic and Occlusion-Aware GM-PHD Filter

一种语义和遮挡感知的GM-PHD滤波器

Jovan Menezes, Mark Campbell

AI总结本文提出了一种包含从深度学习中提取的语义信息的新出生模型，以创建一种遮挡感知的高斯混合概率假说密度（GM-PHD）滤波器。与以往依赖简单或统一假设的方法不同，所提出的语义-遮挡感知（S-OA）出生模型通过显式考虑遮挡区域并利用环境的语义信息来定义初始化项。这使滤波器能够准确表示新物体更可能出现的位置，从而在复杂和高密度的驾驶场景中提高跟踪性能。该方法通过蒙特卡洛模拟和KITTI数据集的实验进行评估。性能通过测量首次检测与跟踪初始化之间的延迟、平均绝对数量误差以及最优子模式分配（OSPA）度量来评估。结果表明，S-OA出生模型在遮挡密集的环境中减少了初始化延迟，在约70%的情况下匹配或优于最强基线。还提供了出生模型权重的敏感性分析。总体而言，研究结果强调了在自动驾驶中将遮挡推理和语义先验整合到贝叶斯跟踪框架中的优势。

Comments Accepted at ICRA 2026

详情

AI中文摘要

本文提出了一种新的出生模型，该模型包含从深度学习中提取的语义信息，以创建一种遮挡感知的高斯混合概率假说密度（GM-PHD）滤波器。与以往依赖简单或统一假设的方法不同，所提出的语义-遮挡感知（S-OA）出生模型通过显式考虑遮挡区域并利用环境的语义信息来定义初始化项。这使滤波器能够准确表示新物体更可能出现的位置，从而在复杂和高密度的驾驶场景中提高跟踪性能。该方法通过蒙特卡洛模拟和KITTI数据集的实验进行评估。性能通过测量首次检测与跟踪初始化之间的延迟、平均绝对数量误差以及最优子模式分配（OSPA）度量来评估。结果表明，S-OA出生模型在遮挡密集的环境中减少了初始化延迟，在约70%的情况下匹配或优于最强基线。还提供了出生模型权重的敏感性分析。总体而言，研究结果强调了在自动驾驶中将遮挡推理和语义先验整合到贝叶斯跟踪框架中的优势。

英文摘要

This paper proposes a new birth model including semantic information derived from deep learning to create an occlusion-aware Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter. Unlike prior approaches that rely on simplistic or uniform assumptions, the proposed Semantic-Occlusion Aware (S-OA) birth model defines initialization terms by explicitly considering regions of occlusion and by leveraging semantic information about the environment. This enables the filter to accurately represent where new objects are more likely to appear, thereby improving tracking performance in complex and high-density driving scenarios. The method is evaluated through Monte Carlo simulations and experiments on the KITTI dataset. Performance is assessed by measuring the latency between first detection and track initiation, along with the mean absolute cardinality error and the Optimal Subpattern Assignment (OSPA) metric. Results demonstrate that the S-OA birth model reduces initialization delay in occlusion-heavy settings, matching or outperforming the strongest baseline in approximately 70% of cases. A sensitivity analysis of birth model weights is also provided. Overall, the findings underscore the benefits of integrating occlusion reasoning and semantic priors into Bayesian tracking frameworks for autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.20659 2026-05-21 cs.CV cs.LG

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

RoPeSLR: 3D RoPE驱动的稀疏低秩注意力用于高效的扩散变换器

Yuxi Liu, Zekun Zhang, Yixiang Cai, Renjia Deng, Yutong He, Kun Yuan

AI总结本研究提出RoPeSLR，一种基于3D RoPE的稀疏低秩注意力框架，旨在解决扩散变换器中长序列生成的高复杂度问题，通过结合高频率语义尖峰集和极低秩背景连续体，实现子二次稀疏性和子线性秩增长，从而在超长视频推理中表现出色。

详情

AI中文摘要

扩散变换器（DiTs）已革新了高保真视频生成，但其$\mathcal{O}(L^2)$的注意力复杂度对长序列合成构成了重大瓶颈。尽管近期的稀疏线性注意力混合体旨在缓解这一问题，但其在极端稀疏性下性能严重下降，这是因为“RoPE困境”：标准线性注意力无法保持3D旋转位置嵌入（RoPE）的正交相对位置结构，从而消除了关键的距离意识。为了解决这个问题，我们提出了RoPeSLR，一种3D RoPE引导的稀疏低秩注意力框架。我们建立，根据经验证实的假设，DiT注意力流形可以解耦为一个高频率语义尖峰集（受限于$\mathcal{O}(L^{3/2})$稀疏性）和一个极低秩（$\mathcal{O}(d_h \log L)$）背景连续体。受这一结构先验的指导，RoPeSLR摒弃标准线性注意力，采用具有可学习3D绝对位置嵌入（PE）注入的头级低秩参数化，无缝合成长距离相对距离衰减。通过保证子二次稀疏性和子线性秩增长，RoPeSLR特别适合扩展到超长视频推理。广泛的评估验证了这种可扩展优势：在90%稀疏性下，RoPeSLR在Wan2.1-1.3B上实现高达10倍的FLOPs减少，并在HunyuanVideo-13B的超长100K+ token序列上提供2.26倍的端到端推理加速，同时保持接近无损的生成保真度（平均VBench退化低于1.3%）

英文摘要

Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

URL PDF HTML ☆

赞 0 踩 0

2605.20651 2026-05-21 cs.CV

Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

凝视细节：用于OCTA视网膜血管分割的局部敏感增强

Tuopusen Huang, Ding Ma, Xiangqian Wu

AI总结本文提出LSENet，通过引入三个创新模块解决OCTA血管分割中局部对比度低导致的断续和细节丢失问题，实验表明其在多个公开数据集上达到最佳性能且参数更少。

详情

AI中文摘要

现有的OCTA血管分割深度学习框架大多基于U-Net架构，但大多数方法仅关注整体表示，难以处理OCTA特有的低局部对比度问题，导致血管断续和细节丢失。为此，我们提出LSENet，基于U-Net架构引入三个核心创新模块：为解决血管断续问题，引入补丁信息增强模块（PIE），用补丁级注意力替代标准跳接连接；为缓解细节丢失问题，提出多尺度特征融合模块（MFF），通过从原始输入和前一层提取可解释特征，为PIE模块提供丰富多尺度信息；最后设计连接性细化解码器（CRD），通过最终卷积层的大核减少碎片化。在三个公开数据集（OCTA-500、ROSE-1和ROSSA）上的实验表明，所提LSENet在性能上达到最佳，且参数更少。

英文摘要

Existing deep learning frameworks for Optical Coherence Tomography Angiography (OCTA) vessel segmentation are largely derived from the U-Net architecture, which serves as the foundation for most current designs. However, most of these methods focus only on holistic representation, struggling to address the problem of low local contrast unique to OCTA, which leads to vessel discontinuities and loss of detail. To address these problems, we propose LSENet, which builds upon the U-Net architecture by introducing three core innovative modules: To address vessel discontinuities, we introduce the Patch Information Enhance module (PIE), which replaces standard skip connections to execute patch-wise attention. To mitigate detail loss, the Multiscale Feature Fusion module (MFF) is proposed to feed the PIE module rich, multi-scale information by extracting visually interpretable features from both the original input and preceding layers. Finally, the Connectivity Refinement Decoder (CRD) is designed to refine features from all levels and utilize a large kernel in the final convolutional layer to reduce fragmentation. Experiments on three public datasets (OCTA-500, ROSE-1, and ROSSA) demonstrate that our proposed LSENet achieves state-of-the-art performance while requiring fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.20648 2026-05-21 cs.RO cs.AI

Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition

联合学习谓词和动作使零样本技能组合成为可能

Benedict Quartey, Sebastian Castro, Eric Rosen, Wil Thomason, George Konidaris, Stefanie Tellex

AI总结本文提出了一种联合学习谓词和动作的技能方法，通过闭合回路的视觉-运动策略，使机器人能够在不重新训练的情况下实现零样本技能组合。

详情

AI中文摘要

学习示范（LfD）使机器人能够从专家示例中学习复杂行为，但现有方法往往无法在不重新训练的情况下泛化到新组合的已知技能。现代生成性策略仅建模动作轨迹分布，因此无法推断出所需的符号结果。我们提出技能应联合建模动作轨迹和它们诱导的符号结果。为解决这一差距，我们引入了谓词动作技能（PACTS），一种闭合回路的视觉-运动策略，将技能建模为动作和谓词信念轨迹的联合生成过程，在单一模型中产生连贯的动作-结果滚动。联合生成动作和谓词使PACTS能够学习改进动作生成和谓词分类的内部表示。此外，我们通过利用PACTS的在线谓词预测作为符号接口来序列化和监控执行，展示了学习技能的零样本组合。项目网站：https://planpacts.github.io/

英文摘要

Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: https://planpacts.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.20645 2026-05-21 cs.CV

Seeing Through Fog: Towards Fog-Invariant Action Recognition

穿透雾气：迈向雾不变的动作识别

Enqi Liu, Liyuan Pan, Zhi Gao, Lingzhi Li, Qing Li

AI总结本文提出FogAct基准数据集和FogNet模型，旨在解决雾天环境下动作识别中的挑战，通过改进的两流CLIP模型提取雾不变的语义信息，提升在雾天条件下的动作识别性能。

详情

AI中文摘要

雾天条件在现实应用中很常见；然而，现有动作识别方法通常假设有利的天气和高质量的视频输入。在雾天，不可预测的可见性降级和对比度降低会阻碍语义线索的提取，给当前的动作识别方法带来重大挑战。在本文中，我们通过采用两种策略来缓解雾天条件下动作识别的问题。首先，我们提出了FogAct，这是第一个雾状动作识别基准数据集，由使用立体相机系统拍摄的配对干净和雾天视频组成。该数据集涵盖10个场景和55个动作类别，包含近10000个视频片段。其次，我们提出了FogNet，一种两流CLIP模型，该模型发现隐藏在降质视频背后的雾不变的语义信息。FogNet通过清洁视频的指导学习雾视频的稳健表示，有效捕捉清洁和雾天视频之间的共享结构和运动线索。在FogAct和三个其他流行数据集上的广泛实验表明，我们的方法在与最先进（SOTA）方法相比时具有竞争性性能。我们的FogAct和FogNet可在我们的项目页面上找到。

英文摘要

Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.

URL PDF HTML ☆

赞 0 踩 0

2605.20644 2026-05-21 cs.LG cs.AI cs.RO

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

制造设计：一种集成制造知识的强化学习框架用于航空发动机自由形管道路由

Caicheng Wang, Zili Wang, Shuyou Zhang, Yongzhe Xiang, Zheyi Li, Liangyou Li, Jianrong Tan

AI总结本文提出了一种集成制造知识的强化学习框架，用于航空发动机中自由形管道路由优化，通过将制造知识作为约束条件，提高了管道路径的可制造性和几何平滑度。

详情

AI中文摘要

制造设计在先进航空发动机开发中起着关键作用，其中复杂组件需要仔细考虑可制造性。然而，当前的管道路由实践仍然很大程度上与下游制造脱节，导致需要大量劳动和试错迭代以获得可制造的设计。为了解决这个问题，本研究提出了一种基于弗伦塞尔的管道路由优化（FPRO）框架，这是一种用于航空发动机自由形管道设计的集成制造知识的强化学习方法。FPRO将路由问题表述为弗伦塞尔框架中的边界值问题。在此框架中，管道路径由曲率和扭率剖面表示，这些剖面通过三次赫尔迈特插值生成。为了将设计与制造相结合，领域特定的制造知识被嵌入到曲率和扭率的允许范围的约束中。路径优化使用了具有随机探索和阶段引导奖励机制的近端策略优化算法。统一的映射公式然后将优化的路径转换为弯曲模具的运动轨迹，使六轴自由弯曲机能够直接制造。实验结果表明，FPRO能够持续生成无碰撞、可制造的路径，其几何剖面比基于笛卡尔的方法更平滑。它还实现了更快的收敛速度和在终端对齐、路径长度、障碍物避让和可制造性方面的优越性能，优于最先进的强化学习基线。现实验证确认了制造管道与数字设计之间几何的紧密对应关系，验证了FPRO的实践可行性。

英文摘要

Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current practices in pipe routing remain largely decoupled from down-stream manufacturing, leading to labor-intensive, trial-and-error iterations to achieve manufacturable designs. To address this problem, this study proposes the Frenet-based pipe routing optimization (FPRO) framework, a manufacturability knowledge-integrated reinforcement learning approach for free-form pipe design in aeroengines. FPRO formulates the routing problem as a boundary value problem in the Frenet frame. In this framework, the pipe path is represented by curvature and torsion profiles, which are generated using cubic Hermite interpolation. To integrate design and manufacturing, domain-specific manufacturing knowledge is embedded as constraints on the permissible ranges of curvature and torsion. The path optimization is performed using the proximal policy optimization algorithm with stochastic exploration and a stage-guided reward mechanism. A unified mapping formulation then translates the optimized path into motion trajectories for the bending die, enabling direct fabrication on a six-axis free-bending machine. Experimental results demonstrate that FPRO consistently generates collision-free, manufacturable paths with smoother geometric profiles compared to Cartesian-based methods. It also achieves faster convergence and superior performance in terminal alignment, path length, obstacle avoidance, and manufacturability compared to state-of-the-art reinforcement learning baselines. Real-world validation confirms the close geometric correspondence between the manufactured pipe and its digital design, validating the practical feasibility of FPRO.

URL PDF HTML ☆

赞 0 踩 0

2605.20643 2026-05-21 cs.LG cs.AI cs.CL

AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

AVSD：通过平衡共识和教师特定的特权信号实现自适应视图自蒸馏

Duy Nguyen, Hanqi Xiao, Archiki Prasad, Zaid Khan, Anirban Das, Austin Zhang, Sambit Sahu, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

AI总结本文提出AVSD，一种通过平衡共识和教师特定的特权信号来实现自适应视图自蒸馏的方法，以解决自蒸馏中教师和学生信息不对称和特权信息选择的问题。

Comments Code: https://github.com/duykhuongnguyen/AVSD

详情

AI中文摘要

自蒸馏使语言模型能够通过使用同一模型作为学生和教师来从自身轨迹中学习，其中教师基于学生无法访问的特权信息进行条件。此类信息可以是不同种类或视图，如解决方案、演示、反馈或最终答案。这种设置可以在不依赖外部模型的情况下提供密集的token级反馈，但会产生根本性的不对称性：教师可能依赖于视图特定的信息，而学生在推理时无法访问。此外，最佳的特权信息类型通常是任务依赖的，使得选择单一教师视图变得困难。在本工作中，我们通过引入AVSD（自适应视图自蒸馏），一种具有多种特权信息视图的自蒸馏新方法，来同时解决这两个挑战。AVSD通过分离稳定的跨视图共识和视图特定的残差信号来重建token级监督。AVSD识别出跨视图共享的共识信号，提供可靠的更新方向，然后在两者一致且比例适当的情况下，选择性地添加视图特定的残差信号以调整更新幅度。在数学竞赛基准（AIME24、AIME25和HMMT25）上的实验表明，AVSD在Qwen3-8B和Qwen3-4B上分别比单视图自蒸馏基线和GRPO平均Avg@8提升了3.1%和2.2%。此外，在代码生成基准（Codeforces、LiveCodeBench v6）上使用Qwen3-8B时，AVSD在平均上比单视图自蒸馏基线高出2.4%。

英文摘要

Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token-level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view-specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task-dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive-View Self-Distillation), a novel method of self-distillation with multiple privileged-information views, which reconstructs token-level supervision by separating stable cross-view consensus from view-specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view-specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single-view self-distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3-8B and Qwen3-4B, respectively. Moreover, on code-generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3-8B, AVSD outperforms the single-view self-distillation baseline by 2.4% on average.

URL PDF HTML ☆

赞 0 踩 0

2605.20642 2026-05-21 cs.LG

Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions

相同目标，不同盆地：标注者分布中的硬标签与软标签

Mirerfan Gheibi, Gashin Ghazizadeh

AI总结本文研究了在标注者分布中硬标签与软标签的区别，发现当每个示例的标注数量较少时，硬标签方法在性能上优于软标签训练，尤其是在稀疏经验目标远离完整标注者分布时效果更佳。

Comments 14 pages, 12 figures. Accepted to the 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML @ ICML 2026)

详情

AI中文摘要

当标注者存在分歧时，这种分歧可能反映的是知识不确定性而非简单的标签噪声。我们研究了硬标签交付作为一种替代方法，以替代通常的投票汇总为单一标签或直接在经验软标签分布上训练。我们重点关注两种主要的硬标签方法：多轮次（multipass），它在保持数据集大小不变的情况下循环处理观察到的投票；以及随机标签采样（SLS），它在每个epoch开始时对每个示例采样一个标签。在CIFAR-10H上，我们发现当每个示例仅有少量标注时，硬标签交付在软标签训练上表现更优，尤其是在稀疏经验目标远离完整标注者分布时改进更明显。当完整标注者分布可用时，两种硬标签方法与软标签训练相当。我们使用确定性控制作为多轮次的消融实验，并使用洗牌SLS作为打破示例到分布匹配的对照。我们还展示了SLS和软标签交叉熵优化相同的预期目标。硬标签交付还收敛到更平坦的盆地，这在SVHN和CIFAR-100上的OoD检测中提供了支持性的描述证据。总体而言，这些结果表明，当原始投票数可用时，多轮次是一个强大的实用默认选择，而SLS则提供了一个轻量级的替代方案，当每个示例仅有少量投票时仍具有竞争力，且在完整标注者分布可用时与软标签训练相当。

英文摘要

When annotators disagree, that disagreement can reflect epistemic uncertainty rather than simple label noise. We study hard-label delivery as an alternative to the usual choices of collapsing votes to a single label or training directly on the empirical soft-label distribution. We focus on two primary hard-label methods: multipass, which cycles through observed votes while keeping the dataset size fixed, and stochastic label sampling (SLS), which samples one label per example at the start of each epoch. On CIFAR-10H, we find that when only a small number of annotations per example is available, hard-label delivery improves over soft-label training, with larger improvements where the sparse empirical target is farther from the full annotator distribution. When full annotator distributions are available, both hard-label methods match soft-label training. We use deterministic control as an ablation of multipass and shuffled SLS as a control that breaks the example-to-distribution match. We also show that SLS and soft-label cross-entropy optimize the same expected objective. Hard-label delivery also converges to flatter basins, with supporting descriptive evidence from OOD detection on SVHN and CIFAR-100. Overall, these results suggest that multipass is a strong practical default when raw vote counts are available, while SLS offers a lightweight alternative that remains competitive when only a few votes per example are available and matches soft-label training when full annotator distributions are available.

URL PDF HTML ☆

赞 0 踩 0

2605.20640 2026-05-21 cs.CV cs.AI

Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

帕累托优化的肖像生成：用于对齐、真实性和美学的视觉对齐文本监督

Yunlong Wang, Jinjin Shi, Wenbin Gao, Xuran Xu, Runyu Shi, Ying Huang

AI总结本文提出了一种多模态扩散变换器（MM-DiT）的特征监督方法，通过引入轻量级的跨模态对齐机制，隐式提取多粒度的视觉对齐文本表示，以提升文本-图像对齐、真实性和美学质量，从而在Pareto前沿上实现协同改进。

详情

AI中文摘要

文本到图像扩散模型在生成人类肖像时往往面临严重的三重困境：文本-图像对齐、逼真度和人类感知的美学之间相互抑制。监督微调（SFT）是一种有效提升图像生成逼真度的方法，但通常会导致过度拟合训练数据集、破坏预训练图像先验并降低对齐或美学质量。为突破这一瓶颈，我们提出了一种多模态扩散变换器（MM-DiT）的特征监督范式。具体而言，我们引入了一种轻量级的跨模态对齐机制，隐式地从SigLIP 2中提取多粒度的视觉对齐文本表示，并在训练阶段将监督应用于MM-DiT的图像分支，且无额外的推理开销。我们的方法在保持基模型原有泛化能力的同时，注入了视觉对齐的文本指导，避免了SFT导致的退化。此外，我们的方法直接从预训练的视觉基础模型中挖掘隐含的多粒度美学信号，以优化人类感知的美学。在MM-DiT上的广泛实验表明，我们的方法推动了Pareto前沿，并在文本-图像对齐、逼真度和人类感知的美学方面实现了协同改进。

英文摘要

Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model's original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.

URL PDF HTML ☆

赞 0 踩 0

2605.20630 2026-05-21 cs.AI

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

评估代理计划-执行管道中的时间语义缓存和工作流优化

Alimurtaza Mustafa Merchant, Krish Veera, Sajal Kumar Goyla, Shambhawi Bhure, Dhaval Patel, Kaoutar El Maghraoui

AI总结本文研究了在代理计划-执行管道中时间语义缓存和工作流优化的问题，提出两种互补的优化层以提高效率，并展示了其在工业资产操作工作流中的应用效果。

Comments 13 pages, 8 figures, 3 appendices

详情

AI中文摘要

工业资产操作工作流对延迟敏感，因为单个用户查询可能需要协调传感器数据、工作订单、故障模式、预测工具和领域特定代理。我们在此问题上评估了AssetOpsBench (AOB)，这是一个工业代理基准，其计划-执行管道暴露了工具发现、LLM规划、MCP工具执行和最终总结的重复开销。现有的LLM缓存技术如KV缓存重用和基于嵌入的语义缓存是为聊天机器人服务设计的，并在输出有效性依赖于时间、资产或传感器参数时失效。我们为AOB计划-执行管道提出了两个互补的优化层：一个时间语义缓存和一组结合磁盘支持的工具发现缓存和依赖感知并行步骤执行的MCP工作流优化。MCP工作流优化对应于1.67倍的速度提升，将中位端到端延迟减少了约40.0%，而时间缓存基准在缓存命中时实现了30.6倍的速度提升。除了速度提升外，我们的结果揭示了纯语义缓存在参数丰富的工业查询中的具体失败模式，提供了对MCP支持的代理基准中缓存选择如何与评估正确性相互作用的批判性分析。

英文摘要

Industrial asset operations workflows are latency-sensitive because a single user query may require coordination over sensor data, work orders, failure modes, forecasting tools, and domain-specific agents. We evaluate this problem on AssetOpsBench (AOB), an industrial agent benchmark whose plan-execute pipeline exposes repeated overhead from tool discovery, LLM planning, MCP tool execution, and final summarization. Existing LLM caching techniques such as KV-cache reuse and embedding-based semantic caching were designed for chatbot serving and break down when output validity depends on time, asset, or sensor parameters. We propose two complementary optimization layers for AOB plan-execute pipelines: a temporal semantic cache and a set of MCP workflow optimizations combining disk-backed tool-discovery caching and dependency-aware parallel step execution. MCP workflow optimizations corresponded to a 1.67x speedup and reduced median end-to-end latency by about 40.0% while the temporal-cache benchmark achieved a median of 30.6x speedup on cache hits. Beyond the speedup, our results expose a concrete failure mode of pure semantic caching for parameter-rich industrial queries, providing a critical analysis of how caching choices interact with evaluation correctness in MCP-backed agent benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.20626 2026-05-21 cs.CL cs.AI cs.CV

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

基于检索的长上下文翻译用于文化图像描述：佛罗里达大学Gators参加2026年美洲自然语言处理共享任务的提交

Aashish Dhawan, Christopher Driggers-Ellis, Dzmitry Kasinets, Daisy Zhe Wang, Christan Grant

AI总结本文提出了一种基于检索的长上下文翻译方法，用于文化图像描述，通过两阶段流程生成西班牙语中间描述，再利用检索增强的多示例提示生成目标语言描述，显著提升了Bribri、Guaraní和Orizaba Nahuatl语言的描述生成性能，并在共享任务中获得冠军。

详情

AI中文摘要

我们提出了佛罗里达大学Gators团队对2026年美洲自然语言处理共享任务在原住民语言文化图像描述任务中的提交。我们的两阶段流程使用Qwen2.5-VL生成西班牙语中间描述，然后利用检索增强的多示例提示与Gemini 2.5 Flash生成目标语言描述。我们在开发集评估中分别实现了Bribri、Guaraní和Orizaba Nahuatl描述生成性能的164.1%、131.7%和122.6%的提升，并在测试集评估中保持Bribri和Orizaba Nahuatl语言的>150%提升。我们发现检索高度依赖语言，仅对大规模、领域内语料有效，并且合成数据增强对开发集Guaraní性能提升贡献了约28 chrF++。我们的提交在共享任务中获得冠军，位列五份最终提交中的第二名。

英文摘要

We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaraní, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaraní performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

URL PDF HTML ☆

赞 0 踩 0