arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.19619 2026-05-20 cs.LG cs.AI math.OC stat.ML

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

MiMuon: 一种具有改进泛化能力的混合穆恩优化器用于大模型

Feihu Huang, Yuning Luo, Songcan Chen

AI总结 本文研究了穆恩优化器的泛化误差,提出了一种改进的混合穆恩优化器MiMuon,证明其泛化误差更低,同时保持了与穆恩优化器相同的收敛速度。

Comments 25 pages

详情
AI中文摘要

矩阵结构的参数在许多人工智能模型中频繁出现,例如大语言模型。最近,为大规模模型的矩阵参数设计了一种高效的穆恩优化器,其收敛速度明显快于向量级算法。尽管一些工作已经开始研究穆恩优化器的收敛性质(即优化误差),但其泛化性质(即泛化误差)尚未建立。因此,在本文中,我们基于算法稳定性与数学归纳法研究穆恩优化器的泛化误差,并证明穆恩优化器的泛化误差为O(1/(Nκ^T)),其中N为训练样本数量,T表示迭代次数,κ>0表示梯度估计奇异值之间的最小差。为了增强穆恩优化器的泛化能力,我们通过谨慎使用梯度的正交化,提出了一种有效的混合穆恩(MiMuon)优化器,该优化器是穆恩优化器与基于动量的SGD优化器的混合。然后我们证明我们的MiMuon优化器的泛化误差比穆恩优化器的O(1/(Nκ^T))更低,因为κ通常非常小。同时,我们还研究了我们MiMuon算法的收敛性质,并证明我们的MiMuon算法具有与穆恩算法相同的收敛速度O(1/T^{1/4})。在训练大模型(包括Qwen3-0.6B和YOLO26m)的一些数值实验结果中展示了MiMuon优化器的效率。

英文摘要

Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{Nκ^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $κ>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{Nκ^{T}}\big)$ of Muon optimizer, since $κ$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.

2605.19618 2026-05-20 cs.LG stat.ME

A Family of Divergence Measures for Evaluating the Reconstruction Quality of Explainable Ensemble Trees

可解释性集成树的重建质量评估的一类发散度度量

Massimo Aria, Agostino Gnasso, Carmela Iorio

AI总结 本文提出了一种基于发散度的度量框架,用于评估可解释性集成树的重建质量,通过区分一致性和关联性,提供了一种新的诊断方法来识别重建失败的具体原因。

详情
AI中文摘要

验证集成学习者可解释的替代模型需要测量集成内部表示与其替代近似之间的同意程度,而不是仅仅关联性。基于相关性的方法是尺度不变的,无法检测共现结构中的系统性差异。我们提出了一种基于一致性和关联性区别的统计框架,以归一化的可解释性损失(nLoI)为中心。该框架基于Cressie-Read幂发散家族,lambda等于2,nLoI可以分解为节点内和节点间的组成部分,提供了独特的诊断能力,以精确识别重建失败的位置和原因。该框架包含四个互补的度量,捕捉替代质量的不同结构方面。统一的排列检验程序在单次重采样过程中为所有度量提供有效的推断。每个度量的理论性质,包括有界性和对称性,均已建立。蒙特卡洛模拟和实证评估证实了精确的I型错误控制,并展示了这些度量能够检测出相关性方法无法检测到的重建保真度梯度。该框架在可解释性集成树(E2Tree)的背景下开发和说明,并在三个基准数据集上的实证评估展示了该框架的实际应用价值。

英文摘要

Validating interpretable surrogate models for ensemble learners requires measuring agreement between the ensemble's internal representation and its surrogate approximation, rather than mere association. Correlation-based approaches are scale-invariant and fail to detect systematic discrepancies in co-occurrence structure. We propose a statistical framework grounded in the agreement-association distinction, centered on the normalized Loss of Interpretability (nLoI). Rooted in the Cressie-Read power divergence family with lambda equal to 2, the nLoI admits a closed-form decomposition into within-node and between-node components, providing a unique diagnostic capability to identify precisely where and why reconstruction fails. The framework incorporates four complementary measures capturing distinct structural facets of approximation quality. A unified permutation testing procedure delivers valid inference for all measures within a single resampling pass. Theoretical properties, including boundedness and symmetry, are established for each metric. Monte Carlo simulations and empirical evaluations confirm exact Type I error control and demonstrate that these measures detect reconstruction fidelity gradients invisible to correlation-based alternatives. The framework is developed and illustrated in the context of Explainable Ensemble Trees (E2Tree), and empirical evaluation on three benchmark datasets illustrates the practical utility of the framework.

2605.19613 2026-05-20 cs.CV

White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation

先白平衡,后调整:通过视觉-语言评估实现跨相机颜色恒常性

Shuwei Li, Lei Tan, Robby T. Tan

AI总结 本文提出VLM-CC框架,通过视觉-语言模型评估实现跨相机颜色恒常性的迭代反馈优化,利用感知反馈替代直接RGB回归,提升鲁棒性。

Comments In CVPR 2026

详情
AI中文摘要

颜色恒常性旨在保持物体颜色在不同光照下的一致性。跨相机颜色恒常性仍具挑战性,因为基于学习的模型常过拟合训练相机的颜色响应特性,导致在其他相机拍摄的图像上性能下降。我们提出VLM-CC,一种反馈引导的框架,将颜色恒常性建模为迭代细化过程。而不是直接从原始输入估计光源,VLM-CC通过视觉-语言模型(VLM)基于的评估进行迭代修正。在每次迭代中,图像使用当前估计进行白平衡并转换为伪sRGB。一个轻量级的LoRA微调VLM然后评估校正后的图像,识别主导的残差色偏并提供定性反馈。此反馈被映射到残差照明方向(红、绿或蓝)并用于更新光源估计,直到收敛。我们的关键思想是将颜色恒常性重新建模为迭代感知反馈问题,利用VLM评估而不是直接RGB回归。通过将直接RGB估计替换为VLM引导的感知反馈,VLM-CC在多个数据集上实现了跨相机颜色恒常性的最先进鲁棒性。代码将在https://github.com/NothingIknow/VLM-CC上提供。

英文摘要

Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras. We propose VLM-CC, a feedback-guided framework that formulates color constancy as an iterative refinement process. Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by vision-language model (VLM)-based evaluation. At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback. This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence. Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression. By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets. Code will be available at https://github.com/NothingIknow/VLM-CC.

2605.19607 2026-05-20 cs.CV cs.AI cs.LG

Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution

基于谱积分梯度的粗到细特征归因

Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi

AI总结 本文提出Spectral Integrated Gradients(SIG)方法,通过奇异值分解构建积分路径,以减少噪声并提高特征归因的准确性,优于传统路径基方法。

Comments 21 pages, 13 figures, 9 tables. Accepted to ACM KDD 2026; includes appendix

详情
AI中文摘要

积分梯度(IG)是一种广泛采用的特征归因方法,满足理想的公理性质。然而,积分路径的选择显著影响归因质量,标准直线路径同时引入所有输入特征,通常在途中积累噪声梯度。为解决这一限制,我们提出了Spectral Integrated Gradients,通过基线到输入差异的奇异值分解(SVD)构建积分路径。通过逐步激活奇异成分,从最大到最小,SIG在引入全局结构之前引入细粒度细节,自然遵循粗到细的进程。通过在多种图像分类数据集上的广泛评估,我们证明SIG生成的归因图更干净,噪声更少,并在定量性能上优于现有基于路径的归因方法。我们的代码可在https://github.com/leekwoon/sig/上获得。

英文摘要

Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight-line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline-to-input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine-grained details, naturally following a coarse-to-fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path-based attribution methods. Our code is available at https://github.com/leekwoon/sig/.

2605.19605 2026-05-20 cs.CV

deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

deadtrees.earth-aerial: 一个多分辨率航拍图像数据集用于树冠和死亡检测

Ayushi Sharma, Clemens Mosig, Lukas Drees, Salim Soltani, Janusch Vajna-Jehle, Aaron Sheppard, Belqis Ahmadi, Jonathan Schmid, Paul Neumeier, Nathan Jacobs, Jan Dirk Wegner, Teja Kattenborn

AI总结 本文提出两个全新的开放数据集,用于从厘米级航拍图像中进行树冠和死亡的联合分割,解决了全球范围内缺乏统一数据集的问题,并在多个生物群落中实现了显著的性能提升。

Comments Preprint. Under review. All rights reserved

详情
AI中文摘要

全球范围内的森林正日益受到气候变化和火灾、害虫和病原体等破坏的威胁,这催生了对大规模树冠和树死亡监测的迫切需求。无人机和飞机的航拍图像是一种关键的数据源,用于详细且大规模地绘制树冠和死亡情况。然而,相关进展受限于缺乏全球代表性、统一的数据集,用于树冠和死亡的联合分割。我们介绍了两个新的、开放的、适合机器学习的数据集,首次在全球范围内实现了从厘米级航拍图像中进行树冠和死亡的联合分割。通过DTE-aerial-train,我们提供了一个包含385,000个1024x1024像素图像块的训练数据集,分辨率范围从2.5到20厘米。它包括多类专家标注和审核的伪标签,用于树冠和死亡。通过DTE-aerial-bench,我们提供了一个地理上平衡的基准测试集,包含25个全球分布的正射图像,总计525个高质量的专家标注图像块,用于树冠和死亡。训练和基准数据集涵盖了热带、温带、寒带和干旱生物群落,并覆盖了广泛的森林结构和死亡模式。使用基准测试集进行评估,我们建立了强参考基线,这些基线在所有生物群落和尺度上提高了死亡分割的性能,在挑战性区域如寒带森林中,F1分数从0.40提高到0.58,提升了约45%的相对性能。所有数据、模型和代码将在宽松的开源许可证下公开发布。基准数据集的交互式可视化可在deadtrees.earth/releases/dte-aerial-bench查看。

英文摘要

Forests worldwide are increasingly threatened by climate change and disturbances such as fire, pests, and pathogens, creating an urgent need for scalable monitoring of tree cover and tree mortality. Aerial imagery from drones and aircraft is a key data source for detailed and large-scale mapping of tree crowns and mortality. However, related progress is limited by the lack of globally representative, harmonized datasets for joint segmentation of tree cover and mortality. We introduce two novel, open, machine-learning-ready datasets to enable joint segmentation of tree cover and tree mortality from centimeter-scale aerial imagery for the first time at global scales. With DTE-aerial-train, we provide a training dataset comprising 385K image patches of size 1024x1024 pixels, with resolutions ranging from 2.5 to 20 cm. It includes multi-class expert-annotated and -audited pseudo-labels for tree cover and mortality. With DTE-aerial-bench, we provide a geographically balanced benchmark test set of 25 globally distributed orthoimages totaling 525 patches with high-quality expert annotations for both tree cover and mortality. Both the training and benchmark datasets span tropical, temperate, boreal, and dryland biomes and cover a wide range of forest structures and mortality patterns. Using the benchmark test set for evaluation, we establish strong reference baselines that improve mortality segmentation across all biomes and scales with significant gains in challenging regions, such as boreal forests, where the F1 score increases from 0.40 to 0.58 with around 45% relative improvement. All data, models, and code will be publicly released under permissive open-source licenses. An interactive visualization of the benchmark dataset is available at deadtrees.earth/releases/dte-aerial-bench.

2605.19604 2026-05-20 cs.AI

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

形式技能:用于高效且准确LLM代理的可编程运行时技能

Xi Zhang, Meijun Gao, Yuntian Zhao, Xinyu Tan, Yilun Yao, Feiyu Wang, Yanshu Wang, Dingsiyi, Tong Yang

AI总结 本文提出形式技能,一种用于LLM代理的可编程运行时技能抽象,通过JSON元数据和动作模式、可靠的Python执行器、受钩子控制的控制逻辑、形式技能路由和本地运行时状态,提高代理的效率和准确性。

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地在真实工作空间中发挥作用,其中工具和技能决定了模型推理是否能够可靠地转化为行动。现有的技能仍然主要非正式:Markdown技能和指令包将过程编码为长自然语言文档,而函数调用、模型上下文协议(MCP)服务器和框架工具则结构化单个动作,但通常将工作流状态、政策执行和完成纪律排除在技能本身之外。我们引入了形式技能,一种运行时原生的抽象,它通过JSON元数据和动作模式、可靠的Python执行器、受钩子控制的控制逻辑、形式技能路由和本地运行时状态来表示可重用的能力。通过将可重用的过程从重复的提示文本中转移到可执行的状态机和钩子策略中,形式技能为代理提供了一个令牌高效且可执行的控制面。我们在FairyClaw中实现了该抽象,这是一个开源的事件驱动运行时,用于可执行、可观察和可组合的形式技能。在Harness-Bench上,FairyClaw获得了高度竞争的平均分数,同时使用显著更少的令牌,尤其在暴露形式技能作用的任务上表现尤为突出。

英文摘要

Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.

2605.19600 2026-05-20 cs.RO

FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

FlyMirage: 一种用于生成多样化和可扩展的无人机飞行数据的完全自动化生成流程

Jinhan Li, Xijie Huang, Zhaoqi Wang, Yijin Wang, Weiqi Ge, Qiyi He, Mo Zhu, Fei Gao, Yuze Wu, Xin Zhou

AI总结 本文提出FlyMirage,一种完全自动化的生成流程,通过生成世界模型生成大规模、多样化且逼真的无人机视觉-语言导航数据,支持下一代具身导航模型的发展。

详情
AI中文摘要

在视觉-语言导航(VLN)领域,空中数据集在结合规模、多样性和现实感方面仍然有限,通常依赖于昂贵的真实世界场景或视觉受限的模拟。为了解决这些挑战,我们引入了FlyMirage,一种高度可扩展且完全自动化的空中VLN数据生成流程。我们的方法利用大型语言模型(LLM)作为环境设计师来促进场景多样性,配以生成世界模型,将这些设计转化为高保真的3D高斯点云(3DGS)场景。为了显著减少人工劳动并确保飞行数据的可行性,FlyMirage自动化了场景探索和语义信息获取,并进一步集成了动态可行的规划器用于无人机(UAV)轨迹生成。利用这一工具链,我们生成了一个大规模、多样化且逼真的空中VLN数据集,具有动态可行的飞行轨迹,旨在支持下一代具身导航模型的发展。

英文摘要

In the field of Vision-Language Navigation (VLN), aerial datasets remain limited in their ability to combine scale, diversity, and realism, often relying on either costly real-world scenes or visually limited simulations. To address these challenges, we introduce FlyMirage, a highly scalable and fully automated data generation pipeline for aerial VLN. Our approach leverages large language models (LLM) as an environment designer to promote scene diversity, paired with a generative world model that instantiates these designs into high-fidelity 3D Gaussian Splatting (3DGS) scenes. To substantially reduce human labor and ensure the feasibility of flight data, FlyMirage automates scene exploration and semantic information acquisition, and further integrates a dynamically feasible planner for uncrewed aerial vehicle (UAV) trajectory generation. Utilizing this toolchain, we generate a large-scale, diverse, and photorealistic aerial VLN dataset, with dynamically feasible flying trajectories, designed to support the development of next-generation embodied navigation models.

2605.19597 2026-05-20 cs.CL

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LLMEval-Logic: 一个验证求解器的中文逻辑推理基准,具有对抗性强化

Ming Zhang, Qiyuan Peng, Yinxi Wei, Yujiong Shen, Kexin Tan, Yuhui Wang, Zhenghao Xiang, Junjie Ye, Zhangyue Yin, Zhiheng Xi, Shihan Dou, Tao Gui, Maxm Pan, Ruizhi Yang, Qi Zhang, Xuanjing Huang

AI总结 本文提出LLMEval-Logic,一个基于真实情境场景的中文逻辑推理基准,通过作者和专家共同审核自然语言项目及其形式化参考,利用Z3验证注释答案,构建自然到形式的评分标准,并通过闭环对抗流程强化选定项目。基准包含246个基础项目和190个难度项目,评估14个前沿LLM显示当前模型存在显著差距。

详情
AI中文摘要

评估大型语言模型(LLMs)在自然语言逻辑推理上的能力至关重要,因为规则主导的任务要求结论必须严格基于陈述的前提。许多现有的逻辑推理基准是通过从采样的公式中模板化自然语言项目生成的,仅提供粗糙或未经审核的形式注释,现在很快被前沿推理模型饱和。我们提出了LLMEval-Logic,一个基于真实情境场景的中文逻辑推理基准。其流程包括作者和专家共同审核自然语言项目及其参考形式化,利用Z3验证注释答案,构建自然到形式的评分标准,并通过闭环对抗流程强化选定项目。该基准发布在两个配对子集中:一个包含246个项目的基础子集,附带1,400个专家开发的评分原子,以及一个包含190个项目的难度子集,包含938个多步骤子问题,覆盖封闭模型空间。在LLMEval-Logic上评估14个前沿LLM揭示了当前模型的显著差距:最佳模型仅达到37.5%的难度项目准确率,即使使用参考符号,评估模型中最高的联合Z3+评分形式化得分也仅为60.16%。我们的基准在https://github.com/llmeval/LLMEval-Logic上公开可用。

英文摘要

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.

2605.19595 2026-05-20 cs.CV cs.AI

A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

一种由LLM代理优化的YOLO26-MoE新型模型用于考虑无人机图像的绝缘子故障检测

João Pedro Matos-Carvalho, Laio Oriel Seman, Stefano Frizzo Stefenon, Mohammad Khalaf Mohammad Khreasat, Gabriel Villarrubia González

AI总结 本文提出一种优化的YOLO26-MoE模型,通过在YOLO26检测器的高分辨率分支中集成稀疏的混合专家(MoE)模块,以适应细微和多样的故障模式,同时保持单阶段检测框架的效率,利用LLM代理进行超参数优化,最终在无人机图像上实现了99.00 mAP@0.5和95.15 mAP@0.5:0.95的性能,优于最新版本的YOLO。

详情
AI中文摘要

电力线路绝缘子的检查对于确保电网可靠性和防止因损坏或退化的绝缘组件引起的故障至关重要。近年来,结合深度学习视觉系统的无人机(UAV)已成为自动化此过程的有效解决方案。然而,由于缺陷区域小、故障模式异质性、复杂背景和变化的成像条件,绝缘子故障检测仍具挑战性。为解决这些挑战,本文提出了一种优化的YOLO26-MoE模型,一种新的目标检测架构,其在YOLO26检测器的高分辨率分支中集成了稀疏的混合专家(MoE)模块。所提出的修改使模型能够适应细微和多样的故障模式,同时保持单阶段检测框架的效率。超参数优化、最终训练和评估通过工具增强的大型语言模型(LLM)代理协调。所提出的模型实现了0.9900 mAP@0.5和0.9515 mAP@0.5:0.95的性能,优于最新版本的YOLO。这些结果表明,所提出的模型为基于无人机的绝缘子故障检测提供了一种有效且可靠的解决方案。

英文摘要

The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.

2605.19594 2026-05-20 cs.RO

MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation

MCNav: 用于零样本目标导向导航的记忆感知动态认知图

Jingyu Li, Zhe Liu, Wenxiao Wu, Li Zhang

AI总结 本文提出MCNav,一种记忆感知的动态认知图导航框架,通过高效查询已探索区域的相关物体信息,解决零样本目标导向导航中目标丢失或误识别的问题,通过目标再验证和遗漏目标再探索策略,结合黑名单和双检机制,实现最先进的性能。

详情
AI中文摘要

在复杂环境中导航到实例级目标是一个具有挑战性的问题。许多现有的零样本方法通过建模整个环境并利用大语言模型进行场景理解来实现强性能。然而,这些策略主要集中在探索新区域,而缺乏对先前探索区域信息的深入利用。因此,当目标在先前访问的区域中丢失或误识别时,导航失败频繁发生。为了解决这些限制,我们提出了MCNav,一种具有动态认知图的记忆感知导航框架。该图存储有关已探索区域相关物体的高效查询信息。基于此记忆结构,MCNav引入了两种记忆感知探索策略:目标再验证,用于重新评估已见过的对象以纠正匹配失败;以及遗漏目标再探索,用于根据上下文线索估计目标在已探索区域中的存在概率。这些策略进一步通过黑名单机制防止重复错误,并通过双检机制进行高置信度确认。我们在HM3Dv1和HM3Dv2数据集上对MCNav进行了三种不同任务的评估,其中在实例级目标导航任务上实现了最先进的性能。

英文摘要

Navigating to instance-level targets in complex environments is a challenging problem. Many existing zero-shot methods achieve strong performance by modeling the entire environment and leveraging large language models for scene understanding. However, such strategies primarily focus on exploring new regions while lacking a deeper exploitation of information from previously explored areas. Consequently, when targets are missed or misidentified within previously visited regions, navigation failures occur frequently. To address these limitations, we propose MCNav, a memory-aware navigation framework with a dynamic cognitive map. This map stores efficiently queryable information about relevant objects in explored areas. Building on this memory structure, MCNav introduces two memory-aware exploration strategies: goal re-validation, which re-assesses previously seen objects to correct matching failures, and missed goal re-exploration, which estimates the likelihood that a target is present in an explored region from contextual cues. These strategies are further stabilized by a blacklist mechanism to prevent repeated errors and a double-check mechanism for high-confidence confirmation. We evaluate MCNav on the HM3Dv1 and HM3Dv2 datasets across three different tasks, where it achieves state-of-the-art performance, particularly on the instance-level goal navigation task.

2605.19593 2026-05-20 cs.AI cs.DC

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

迈向多模型LLM调度器:关于卸载和抢占的实证洞察

Mert Yildiz, Pietro Spadaccino, Alexey Rolich, Francesca Cuomo, Andrea Baiocchi

AI总结 本文通过实证研究探讨了不同LLM在不同硬件平台上的行为,重点分析了层卸载和抢占对性能的影响,揭示了卸载和抢占对解码吞吐量的非线性影响以及其在不同模型和硬件平台上的差异,为设计高效的多模型LLM服务系统提供了指导。

Comments The 2026 Mediterranean Artificial Intelligence and Networking Conference (MAIN 2026)

详情
AI中文摘要

现代大型语言模型(LLM)的部署越来越需要在共享异构硬件上服务具有不同架构、规模和专业化的多个模型。这种设置对资源分配、调度和调度提出了新的挑战,特别是在GPU内存受限的情况下,部分CPU-GPU卸载和抢占成为必要。尽管现有系统主要优化单一模型的吞吐量,但较少工作在这些条件下处理多模型调度。本文通过实证研究探讨了不同LLM在不同硬件平台上的行为,重点分析了层卸载和抢占对性能的影响。我们发现,卸载导致解码吞吐量显著非线性下降,较小的模型对减少GPU驻留时间更敏感。我们进一步证明,抢占带来了显著的开销,主要由模型状态重新加载而非键值缓存传输主导,并且这种成本在不同模型和硬件平台上差异显著。此外,我们还强调了序列长度和互连带宽在放大数据移动和执行效率低下方面的作用。基于这些发现,我们识别出未来调度器必须考虑的关键特性,包括模型特定的卸载敏感性、工作负载特征以及抢占和数据传输的成本结构。这些见解为设计下一代能够高效管理异构、多模型工作负载的LLM服务系统提供了指导。

英文摘要

Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. We further demonstrate that preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer, and that this cost varies significantly across models and hardware platforms. Additionally, we highlight the role of sequence length and interconnect bandwidth in amplifying data movement and execution inefficiencies. Based on these findings, we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer. These insights provide guidance for the design of next-generation LLM serving systems capable of efficiently managing heterogeneous, multi-model workloads with hybrid CPU-GPU execution.

2605.19592 2026-05-20 cs.RO cs.AI

Implicit Action Chunking for Smooth Continuous Control

隐式动作分块用于平滑连续控制

Bosun Liang, Shuo Pei, Zirui Chen, Chuanzhi Fan, Chen Sun, Yuankai Wu, Huachun Tan, Yong Wang

AI总结 本文提出了一种隐式动作分块框架Dual-Window Smoothing (DWS),用于实现平滑的连续控制。该方法通过双窗口设计,在不扩展动作空间的情况下,确保物理平滑性和时间差分目标的一致性,从而解决传统显式动作分块方法的优化困难和与标准逐步交互不兼容的问题。

详情
AI中文摘要

强化学习常常产生高频振荡的控制信号,这会破坏物理部署所需的安全性和稳定性。显式动作分块通过预测固定时间跨度的轨迹来解决这个问题,但会按时间跨度长度成比例地扩展策略输出维度,导致优化困难和与标准逐步交互不兼容。为克服这些挑战,本文提出了Dual-Window Smoothing (DWS),一种隐式动作分块框架用于平滑连续控制。与显式方法不同,DWS通过确定性调制确保时间一致性,而不扩展动作空间。它采用双窗口设计:一个执行窗口通过确定性调制确保物理平滑,一个价值窗口在时间差分目标上对时间跨度进行对齐,以纠正由于开环执行导致的批评者偏差。DWS还包含一个轻量级的演员侧时间正则化器,基于一阶动作差异,以促进全局连续性。该设计有效地弥合了时间抽象与反应式逐步控制之间的差距。在包括DeepMind控制套件和工业能源管理任务在内的基准测试中,DWS优于最先进的(SOTA)基线。在复杂的基于视觉的自动驾驶任务中,DWS实现了更平滑的控制,更安全的行为,减少了抖动,并达到了100%的成功率。

英文摘要

Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks show that DWS outperforms state-of-the-art (SOTA) baselines. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.

2605.19589 2026-05-20 cs.LG physics.flu-dyn

Physics-Informed Graph Neural Network Surrogates for Turbulent Nanoparticle Dispersion in Dental Clinical Environments

具有物理信息的图神经网络代理用于牙科临床环境中湍流纳米粒子分散

Takshak Shende, Viktor Popov

AI总结 本文提出了一种结合物理信息的图神经网络代理,用于预测牙科临床环境中湍流纳米粒子的分散过程,通过改进的图网络和物理模型提高了计算效率和准确性。

Comments 40 pages, 12 figures,

详情
AI中文摘要

牙科气溶胶程序会产生亚50微米的颗粒,这些颗粒可以在封闭的诊所中长时间悬浮,从而为空气传播病原体的传播提供途径。雷诺平均纳维-斯托克斯(RANS)模拟结合欧拉-拉格朗日粒子追踪可以准确捕捉这种传输,但每个场景的运行时间非常长,这使得在三维空间中无法实时支持临床决策。本文提出了一种欧拉-拉格朗日图交互网络(ELGIN),这是一种具有物理信息的图代理,能够同时预测载流体流动动力学在OpenFOAM多面体网格上的动态以及多分散喷雾云中每个包裹的运动。ELGIN通过可微逆距离网格-包裹耦合,将多头图变换器与雅可比预处理的可学习压力投影和湍流闭合头连接到一个sigmoid门控拉格朗日交互网络。ELGIN使用辛特尔-弗莱特积分器推进包裹。一个四阶段的物理信息课程稳定了260步自回归滚动,而无需梯度爆炸。通过foam-extend 4.1 OpenFOAM reactingParcelFoam在临床相关通风速率和手piece喷雾速度下的参数扫描提供了CFD地面真实数据。本文报告了一种单案例演示,其中ELGIN和一个仅基于拉格朗日的基线(M0)都在二十案例扫描的Sweep_Case_03上进行训练和评估;完整的16/2/2重训练正在进行,并将取代所有报告的指标。在该案例中,ELGIN比M0更紧密地跟踪foam-extend粒子云:平均包裹位移误差从房间宽度的19.56%降至16.20%,云半径-惯性误差从9.85%降至6.58%。26秒的滚动在4GB GPU上完成于约64秒,比foam-extend参考流程快约37倍,朝着多案例检查点到位后每就诊感染风险筛查的目标前进。

英文摘要

Dental aerosol procedures produce sub-50 micrometre nuclei that can remain airborne for long periods in enclosed clinics, creating pathways for airborne pathogen transmission. Reynolds-Averaged Navier-Stokes (RANS) simulations with Euler-Lagrange particle tracking capture this transport accurately but require very long run times per scenario, which precludes real-time clinical decision support in 3D. We present the Eulerian-Lagrangian Graph Interaction Network (ELGIN), a physics-informed graph surrogate that jointly predicts carrier-flow dynamics on the OpenFOAM polyhedral mesh and the per-parcel motion of the polydisperse spray cloud. ELGIN couples a multi-head Graph Transformer with Jacobi-preconditioned learnable pressure projection and a turbulence-closure head to a sigmoid-gated Lagrangian Interaction Network through differentiable inverse-distance mesh-parcel coupling, and advances parcels with a symplectic Stormer-Verlet integrator. A four-stage physics-informed curriculum stabilises 260-step autoregressive rollouts without gradient explosion. A parameter sweep with foam-extend 4.1 OpenFOAM reactingParcelFoam across clinically relevant ventilation rates and handpiece spray speeds provides CFD ground truth. This article reports a single-case demonstration in which both ELGIN and a Lagrangian-only baseline (M0) are trained and evaluated on Sweep_Case_03 of a twenty-case sweep; full 16/2/2 retraining is in progress and will replace all reported metrics. On this case, ELGIN tracks the foam-extend particle cloud much more closely than M0: mean parcel displacement error falls from 19.56% to 16.20% of room width and cloud radius-of-gyration error from 9.85% to 6.58%. A 26-second rollout completes in ~64 s on a 4 GB GPU, approximately 37x faster than the foam-extend reference pipeline, toward per-appointment infection-risk screening once the multi-case checkpoint is in place.

2605.19587 2026-05-20 cs.AI

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

SceneCode: 可执行的世界程序用于可编辑的室内场景及具有关节物体

Puyi Wang, Yuhao Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yangguang Li, Yu Cheng

AI总结 本文提出SceneCode,一种通过可执行程序生成可编辑的室内场景,解决了现有方法中物体结构控制不足的问题,提升了场景生成的精确性和可交互性。

详情
AI中文摘要

室内场景合成是具身AI、机器人操作和基于模拟的策略评估的基础,其中有用的场景不仅需要定义环境的外观,还需要定义其物体的结构。然而,现有流程通常将生成内容表示为静态网格,并且只能从经过筛选的资产库中继承关节性,这限制了物体级别的可控性,并阻止了按需生成新的可交互资产。我们通过将物理上可交互的室内场景合成制定为程序化世界生成,提出SceneCode框架,该框架将自然语言提示编译成可执行的代码驱动的室内世界,而不是一组不透明的网格。一个房间级别的智能核心首先将提示转换为结构化的房屋布局,并通过规划-设计-批评循环发出每个物体的AssetRequests。每个请求随后被路由到五个代码生成策略之一,并转换为合成的分步Blender Python程序,这些程序通过执行引导的修复和优化循环进行验证。生成的程序被编译成模拟准备的资产,并导出为SDF用于物理模拟。一个持久的场景状态注册表将物体请求、可执行程序、渲染几何体和模拟资产联系起来,使场景组装成为一个可追溯且本地可编辑的世界构建过程。我们评估了SceneCode在场景级合成、物体级资产质量、人类判断和下游机器人交互方面的表现。结果表明,可执行世界程序提高了提示忠实的室内场景生成,并产生了具有更干净网格结构和可加载的模拟器关节元数据的资产。项目页面:https://scene-code.github.io/.

英文摘要

Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner--designer--critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: https://scene-code.github.io/.

2605.19584 2026-05-20 cs.LG stat.ML

Online Market Making and the Value of Observing the Order Book

在线市场做市与观察订单簿的价值

Davide Maran, Marcello Restelli

AI总结 本文研究了在线市场做市问题,其中学习者在与持有私人估值的交易者交互时,依次发布买入和卖出价格。与现有在线学习公式假设完全截断反馈不同,我们引入了受真实限价簿启发的动作依赖反馈模型。我们证明,这种额外信息从根本上改变了问题的学习性。在随机设置中,我们提出了一种消除算法,以高概率达到O(√T)的遗憾,而无需对交易者估值分布的光滑性做出任何假设。然后我们将这一结果扩展到广泛的均值回归价格过程中,考虑了局部自回归动态和基于累积偏离均值的较弱全局漂移条件。在任一假设下,我们建立了高概率O(√T)的遗憾界,依赖于一个新的有趣的集中不等式。最后,在对抗性设置中,我们设计了探索后扰动算法,保证了期望O(T^{2/3})的遗憾。

Comments Accepted at COLT2026

详情
AI中文摘要

我们研究了一个在线市场做市问题,其中学习者在与持有私人估值的交易者交互时,依次发布买入和卖出价格。与现有在线学习公式假设完全截断反馈不同,我们引入了受真实限价簿启发的动作依赖反馈模型:当发生交易时,交易者的估值保持隐藏,而当没有发生交易时,会揭示关于供应和需求的信息反馈。我们证明,这种额外信息从根本上改变了问题的学习性。在随机设置中,我们提出了一种消除算法,以高概率达到O(√T)的遗憾,而无需对交易者估值分布的光滑性做出任何假设。然后我们将这一结果扩展到广泛的均值回归价格过程中,考虑了局部自回归动态和基于累积偏离均值的较弱全局漂移条件。在任一假设下,我们建立了高概率O(√T)的遗憾界,依赖于一个新的有趣的集中不等式。最后,在对抗性设置中,我们设计了探索后扰动算法,保证了期望O(T^{2/3})的遗憾。我们的结果量化了在线市场做市中观察订单簿的价值,并证明了即使有限的动作依赖反馈也能显著改善遗憾保证,相比标准带隙反馈模型。

英文摘要

We study an online market-making problem in which a learner sequentially posts bid and ask prices for a single asset while interacting with traders holding private valuations. Unlike existing online learning formulations that assume fully censored feedback, we introduce an action-dependent feedback model inspired by real limit order books: when a trade occurs, the trader's valuation remains hidden, whereas when no trade occurs, informative feedback about supply and demand is revealed. We show that this additional information fundamentally changes the learnability of the problem. In the stochastic setting with i.i.d. market prices, we propose an elimination-based algorithm that achieves $O(\sqrt T)$ regret with high probability, without requiring any smoothness assumptions on the distribution of trader valuations. We then extend this result to a broad class of mean-reverting price processes by considering both local, autoregressive dynamics and a weaker global drift condition based on cumulative deviations from the mean. Under either assumption, we establish high-probability $O(\sqrt T)$ regret bounds, relying on a new concentration inequality of independent interest. Finally, in the adversarial setting with oblivious prices, we design an explore-then-perturb algorithm that guarantees $O(T^{2/3})$ regret in expectation. Our results quantify the value of observing the order book in online market making and demonstrate that even limited, action-dependent feedback can substantially improve regret guarantees compared to standard bandit feedback models.

2605.19580 2026-05-20 cs.RO

PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

PAPO-VLA: 为视觉-语言-动作模型进行规划感知的策略优化

Peizheng Guo, Jingyao Wang, Changwen Zheng, Wenwen Qiang

AI总结 本文提出PAPO-VLA,一种针对视觉-语言-动作模型的规划感知策略优化方法,通过识别和优化规划动作以提高VLA策略的可靠性。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在语言引导的机器人任务中展现出有前途的能力。然而,使VLA策略可靠仍然具有挑战性,因为一个操作任务是通过闭环交互完成的,其中每个动作都会影响后续的执行。为了分析这个问题,我们重新审视VLA策略在执行过程中的作用,并认为VLA策略同时扮演着规划者和执行者两个角色:规划者做出任务导向的决策以改变执行方向,而执行者通过密集的连续动作来实现这些决策。这种观点表明,提高VLA可靠性需要特别关注规划动作。现有的优化方法可以模仿动作或改进完整的轨迹,但通常不明确识别规划动作或衡量其对任务成功的重要性。为了解决这个问题,我们提出了PAPO-VLA,即针对VLA模型的规划感知策略优化方法。PAPO-VLA首先通过联合考虑动作变化和轨迹结果来识别规划动作,然后通过因果充分性和因果必要性估计其重要性,并最终将这种重要性纳入GRPO优势估计中。这样,更重要规划动作会受到更强的优化关注,同时整个轨迹仍然通过轨迹级反馈进行优化。在多个基准上的实验展示了PAPO-VLA的有效性。

英文摘要

Vision-Language-Action (VLA) models show promising ability in language-guided robotic tasks. However, making VLA policies reliable remains challenging, because a manipulation task is completed through closed-loop interaction, where each action affects subsequent execution. To analyze this problem, we revisit VLA policy during execution and argue that a VLA policy acts both as a planner, which makes task-oriented decisions that change the direction of execution, and as an executor, which realizes these decisions through dense continuous actions. This view suggests that improving VLA reliability requires particular attention to planning actions. Existing optimization methods can imitate actions or improve complete trajectories, but they usually do not explicitly identify planning actions or measure their importance for task success. To address this issue, we propose Planning-Aware Policy Optimization for VLA models (PAPO-VLA). PAPO-VLA first identifies planning actions by jointly considering action variation and trajectory outcome, then estimates their importance through causal sufficiency and causal necessity, and finally incorporates this importance into GRPO advantage estimation. In this way, more important planning actions receive stronger optimization emphasis, while the whole trajectory is still optimized by trajectory-level feedback. Experiments on multiple benchmarks demonstrate the effectiveness of PAPO-VLA.

2605.19577 2026-05-20 cs.CL

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

GoLongRL: 以能力为导向的长上下文强化学习与多任务对齐

Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su, Ziyang Chen, Ziqi Wang, Zhennan Wu, Ruotong Pan, jian Liang, Ruiming Tang, Han Li

AI总结 本文提出GoLongRL,一种完全开源的、以能力为导向的长上下文强化学习后训练配方,通过可验证奖励(RLVR)实现。现有长上下文强化学习方法往往将数据构建视为设计越来越复杂的检索路径,导致任务覆盖同质化和奖励形式无法充分反映实际长上下文需求。本文的贡献是(1)以能力为导向的数据构建并完全开源,释放了包含23,000个RLVR样本的数据集、完整的构建流程和所有训练代码。基于长上下文能力的分类学,数据集涵盖9种任务类型,每种任务类型都配有其自然评估指标。它包含从现有语料库中精心挑选的开源样本和合成样本,其问答对是从真实源文档如书籍、学术论文和多轮对话中生成的。在相同的 vanilla GRPO 设置下,我们的数据集单独优于闭源的 QwenLong-L1.5 数据集。此外,我们的 Qwen3-30B-A3B 模型在该数据上训练后,长上下文性能与 DeepSeek-R1-0528 和 Qwen3-235B-A22B-Thinking-2507 相当,表明更广泛的覆盖和更大的奖励多样性显著有助于长上下文能力的提升。(2)TMN-Reweight 用于异构多任务优化。为了解决异构奖励带来的优化挑战,我们提出了 TMN-Reweight,它结合了任务层面的均值归一化以实现跨任务奖励尺度对齐,以及难度自适应加权以获得更可靠的优势估计。TMN-Reweight 进一步在 vanilla GRPO 上提高了平均性能,在报告的评估中,通用能力得以保持或提升。

详情
AI中文摘要

我们提出了GoLongRL,一种完全开源、以能力为导向的长上下文强化学习后训练配方,用于可验证奖励(RLVR)。现有长上下文强化学习方法往往将数据构建视为设计越来越复杂的检索路径,导致任务覆盖同质化和奖励形式无法充分反映实际长上下文需求。我们的工作提供了两个贡献。(1)以能力为导向的数据构建并完全开源。我们公开发布了一个包含23,000个RLVR样本的数据集、完整的构建流程和所有训练代码。基于长上下文能力的分类学,数据集涵盖9种任务类型,每种任务类型都配有其自然评估指标。它包含从现有语料库中精心挑选的开源样本和合成样本,其问答对是从真实源文档如书籍、学术论文和多轮对话中生成的。在相同的 vanilla GRPO 设置下,我们的数据集单独优于闭源的 QwenLong-L1.5 数据集。此外,我们的 Qwen3-30B-A3B 模型在该数据上训练后,长上下文性能与 DeepSeek-R1-0528 和 Qwen3-235B-A22B-Thinking-2507 相当,表明更广泛的覆盖和更大的奖励多样性显著有助于长上下文能力的提升。(2)TMN-Reweight 用于异构多任务优化。为了解决异构奖励带来的优化挑战,我们提出了 TMN-Reweight,它结合了任务层面的均值归一化以实现跨任务奖励尺度对齐,以及难度自适应加权以获得更可靠的优势估计。TMN-Reweight 进一步在 vanilla GRPO 上提高了平均性能,在报告的评估中,通用能力得以保持或提升。

英文摘要

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

2605.19576 2026-05-20 cs.AI cs.CL cs.SE

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

库漂移:在自我演化的LLM技能库中诊断和修复一种无声的失败模式

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

AI总结 本文研究了自我演化的LLM技能库中的一种无声失败模式——库漂移,通过可重复触发实验、细粒度诊断和验证修复方法,揭示了技能积累无序导致检索退化、假阳性注入和性能停滞的问题,并提出了一种经过验证的修复方案,显著提升了技能库的性能。

详情
AI中文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

英文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

2605.19568 2026-05-20 cs.CL

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

m3BERT: 一种现代的、多语言的、俄罗斯套娃双向编码器

Yaoxiang Wang, Simiao Zuo, Qingguo Hu, Yucheng Ding, Yeyun Gong, Jian Jiao, Jinsong Su

AI总结 本文提出m3BERT,一种现代多语言俄罗斯套娃双向编码器,通过联合优化Transformer层和多维嵌入表示,解决现有预训练模型在不同部署场景中适应性差的问题,展示了其在工业检索中的高效性和实用性。

Comments KDD 2026

详情
AI中文摘要

嵌入模型在工业信息检索系统中至关重要,如搜索和广告。然而,现有预训练模型通常具有固定架构和嵌入维度,这在适应具有不同业务驱动约束的多样化部署场景时带来了显著挑战。一种常见做法是在资源受限任务中通过部分参数初始化从更大预训练模型进行微调。这种方法往往效果不佳,因为预训练和下游使用之间的不匹配阻碍了预训练优势的完全实现。为了解决这一限制,我们引入了m3BERT:一种现代的、多语言的、俄罗斯套娃双向编码器,其特征是新颖的预训练策略,联合优化Transformer层和多个嵌入维度的表示。这使得单个模型能够针对不同的资源和准确率目标进行定制,同时保持与预训练的一致性。结合最近的架构改进,m3BERT采用三阶段预训练:单语预训练、多语适应以服务多样化用户群体,以及在大规模网络领域语料库上进行关键的持续预训练以增强商业检索中的实用性。m3BERT在Bing-Click大型工业检索数据集上显著优于现有最先进的嵌入模型,展示了其作为高效基础的实用性和适应性,用于资源感知的工业检索系统。进一步在公共数据集上的实验也证实了我们多粒度俄罗斯套娃预训练策略的通用有效性。

英文摘要

Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.

2605.19562 2026-05-20 cs.RO cs.LG math.OC

Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions

基于学习的优化轨迹规划用于协作的空中-地面切换任务

Jingshan Chen, Bochen Yu, Henrik Ebel, Peter Eberhard

AI总结 本文提出了一种结合学习的轨迹规划框架,用于协同无人 aerial 和 ground 车辆的切换任务,通过使用解耦的编码器-解码器 LSTM 网络生成协调的切换轨迹预测,从而加速优化过程,实现更快的收敛和更高的优化成功率。

Comments Preprint of a contribution accepted for publication in the RoManSy 2026 Springer proceedings

详情
AI中文摘要

本文提出了一种基于学习的轨迹规划框架,用于协同无人 aerial 和 ground 车辆的切换任务。尽管集中式轨迹优化能够确保动态可行性和任务最优性,但其高计算成本限制了实时应用。我们提出了一种神经代理规划器,利用解耦的编码器-解码器长短期记忆(LSTM)网络,从任务规范中生成协调的切换轨迹预测。这些预测作为下游集中优化器的有信息的预热启动,从而加速收敛到动态可行的解决方案。基准评估显示,与冷启动优化相比,结合学习的规划框架在速度上提高了三倍以上,并实现了100%的优化成功率。结果表明,结合数据驱动推断与模型驱动细化能够为异构多机器人系统提供快速且可靠的轨迹生成。

英文摘要

This paper presents a learning-augmented trajectory planning framework for cooperative unmanned aerial vehicle (UAV) and unmanned ground vehicle (UGV) handover missions. While centralized trajectory optimization ensures dynamic feasibility and task optimality, its high computational cost limits real-time applicability. We propose a neural surrogate planner utilizing decoupled encoder-decoder long short-term memory (LSTM) networks to generate coordinated handover trajectory predictions from the task specifications. These predictions serve as informed warm starts for the downstream centralized optimizer, thereby accelerating convergence to dynamically feasible solutions. Benchmark evaluations demonstrate that the learning-augmented planning framework achieves more than a threefold speedup and 100% optimization success rate compared to cold start optimization. The results indicate that combining data-driven inference with model-based refinement enables fast and reliable trajectory generation for heterogeneous multi-robot systems.

2605.19561 2026-05-20 cs.LG cs.AI

TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization

TORQ:MXFP4量化中的两级正交旋转

Zukang Xu, Xing Hu, Dawei Yang

AI总结 本文提出TORQ框架,通过优化坐标变换重塑激活空间的几何属性,解决MXFP4激活量化中的精度下降问题,显著提升量化精度。

Comments 17 pages, 4 figures, 13 tables

详情
AI中文摘要

随着大型语言模型(LLMs)向实际部署迈进,微缩FP4(MXFP4)格式已成为下一代低比特推断的基石,因其在高动态范围与硬件效率之间的平衡能力。然而,直接将MXFP4应用于LLM激活量化不可避免地导致显著的精度下降。在本文中,我们从理论上分析MXFP4激活量化的误差结构,揭示出性能下降的根本原因在于激活分布与MXFP4块浮点格式之间的两个结构性不平衡:(1)极端块间方差不平衡和(2)块内代码书利用不平衡。为了解决这些挑战,我们提出了TORQ(MXFP4量化中的两级正交旋转),一种无训练的后训练量化(PTQ)框架,通过最优坐标变换重塑激活空间的几何属性。在宏观层面,TORQ利用Schur-Horn定理通过块间正交旋转重新分配激活能量,防止高方差块驱动共享缩放因子,从而保留小幅度元素的精度。在微观层面,TORQ采用最大熵引导的块内旋转以缓解代码书坍塌并最大化MXFP4代码书的信息容量。在主流LLM如LLaMA3和Qwen3上的实验表明,与现有方法相比,TORQ显著提高了MXFP4激活量化的准确性:在Qwen3-32B上,WikiText的困惑度降低到8.43(相比BF16的7.61),平均准确率从直接RTN的38.40%增加到73.63%(相比BF16的74.82%),大幅缩小了4位浮点量化与全精度推断之间的差距。

英文摘要

As Large Language Models (LLMs) advance toward practical deployment, the Microscaling FP4 (MXFP4) format has emerged as a cornerstone for next-generation low-bit inference, owing to its ability to balance high dynamic range with hardware efficiency. However, directly applying MXFP4 to LLM activation quantization inevitably leads to significant accuracy degradation. In this paper, we theoretically analyze the error structure of MXFP4 activation quantization, revealing that the root cause of this performance drop lies in two structural imbalances between activation distributions and the MXFP4 block floating-point format: (1) extreme inter-block variance imbalance and (2) intra-block codebook utilization imbalance. To address these challenges, we propose TORQ (Two-level Orthogonal Rotation for MXFP4 Quantization), a training-free Post-Training Quantization (PTQ) framework designed to reshape the geometric properties of the activation space through optimal coordinate transformations. At the macroscopic level, TORQ leverages the Schur-Horn theorem to redistribute activation energy via inter-block orthogonal rotation, preventing high-variance blocks from driving up shared scaling factors and thereby preserving the precision of small-magnitude elements. At the microscopic level, TORQ employs maximum-entropy-guided intra-block rotation to alleviate codebook collapse and maximize the MXFP4 codebook's information capacity. Experiments on mainstream LLMs such as LLaMA3 and Qwen3 show that TORQ significantly improves the accuracy of MXFP4 activation quantization compared to existing methods: on Qwen3-32B, the perplexity on WikiText is reduced to 8.43 (vs. 7.61 for BF16), and the average accuracy increases from 38.40% with direct RTN to 73.63% (vs. 74.82% for BF16), substantially narrowing the gap between 4-bit floating-point quantization and full-precision inference.

2605.19559 2026-05-20 cs.CV cs.AI

EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

EgoCoT-Bench: 用于MLLMs的 grounded 和可验证的 operation-centric 思维链推理基准测试

Yang Dai, Dian Jiao, Tianwei Lin, Wenqiao Zhang

AI总结 本文提出EgoCoT-Bench,一个用于评估MLLMs在第一人称视角下细粒度操作中心推理能力的基准测试,包含3172个可验证的问答对,涵盖感知、预见和高层次推理等任务,旨在解决现有基准测试在细粒度推理和证据验证方面的不足。

详情
AI中文摘要

多模态大语言模型(MLLMs)的快速发展引发了对第一人称视频理解的广泛关注,特别是MLLMs识别细粒度手-物体交互、跟踪物体状态变化以及从第一人称视角推理动态环境中操作过程的能力。然而,现有的第一人称视频基准测试存在局限性,即缺乏对基于现实证据的推理评估,难以支持细粒度的操作中心推理,并且很少检查模型推理是否基于显式的时空证据。为了解决这一差距,我们引入了EgoCoT-Bench,一个细粒度的第一人称基准测试,用于验证和可验证的操作中心推理,具有显式的逐步推理注释。总体而言,EgoCoT-Bench包含3172个可验证的问答对,覆盖351个第一人称视频,分为四个任务组,共12个子任务组,涵盖感知与回顾、预见和高层次推理。该基准测试通过时空场景图(STSG)引导生成框架构建,并通过人工标注者进一步优化,以确保正确性、第一人称相关性和细粒度质量。实验结果表明,第一人称细粒度推理仍存在困难,并进一步揭示了许多多模态模型生成的解释虽然答案正确,但证据与答案不一致。我们希望EgoCoT-Bench能为第一人称视频理解中的 grounded 和可验证推理提供有用的测试平台。项目页面和补充材料可在:https://dstardust.github.io/EgoCoT/ 上找到。

英文摘要

The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.

2605.19556 2026-05-20 cs.CV

EpiDiffVO: Geometry-Aware Epipolar Diffusion for Robust Visual Odometry

EpiDiffVO: 一种基于几何的视差扩散用于鲁棒视觉里程计

Prateeth Rao

AI总结 本文提出了一种稀疏视差匹配框架,通过优化几何一致性来减少冗余,并结合视差扩散过程和图神经网络实现高效的视觉里程计。

Comments 8 pages, 5 figures, in revision to be submitted to IEEE RA-L

详情
AI中文摘要

从图像对中估计相对姿态本质上只需要一组几何上一致的对应点的最小子集。然而,大多数基于学习的方法依赖于密集匹配或直接回归,导致冗余并降低几何可解释性。在本工作中,我们提出了一种稀疏视差匹配框架,预测一组紧凑的对应点,以优化不同时间基线下的几何一致性。为了解决残余噪声和对齐问题,我们引入了视差扩散过程,该过程建模对应点的不确定性,并将关键点细化到视差一致性。经过细化的对应点,结合深度线索,被提升为图表示,形成一个Steiner图,该图编码点之间的关系结构。图神经网络学习了一组紧凑的有用对应点,这些对应点被传递给可微的奇异值分解求解器进行端到端的几何估计。从得到的基矩阵中恢复相对姿态,并在TartanAir和KITTI SLAM数据集上进行视觉里程计评估。实验结果表明,结合稀疏匹配、基于扩散的细化和基于图的子集选择可以减少对应点的冗余,同时在具有挑战性的基线下保持稳健的姿态估计。

英文摘要

Estimating relative pose from image pairs fundamentally requires only a minimal subset of geometrically consistent correspondences. However, most learning-based approaches rely on dense matching or direct regression, leading to redundancy and reduced geometric interpretability. In this work, we propose a sparse epipolar matching framework that predicts a compact set of correspondences optimized for geometric consistency across varying temporal baselines. To address residual noise and misalignment, we introduce an epipolar diffusion process that models correspondence uncertainty and refines keypoints toward epipolar consistency. The refined correspondences, along with depth cues, are lifted into a graph representation forming a Steiner graph that encodes relational structure between points. A graph neural network learns a compact subset of informative correspondences, which are passed to a differentiable singular value decomposition solver for end-to-end geometric estimation. Relative pose is recovered from the resulting essential matrix and evaluated in a visual odometry setting on the TartanAir and KITTI SLAM datasets. Experimental results demonstrate that combining sparse matching, diffusion-based refinement, and graph-based subset selection reduces correspondence redundancy while maintaining robust pose estimation across challenging baselines.

2605.19554 2026-05-20 cs.CV

Self-Creative Text-to-Object Generation using Semantic-Aware Spatial Weighting

基于语义感知空间加权的自创文本到物体生成

Yue Yu, Haibo Chen, Shuo Chen, Jian Yang, Jun Li

AI总结 本文提出了一种自创扩散模型SCDiff,通过学习空间加权模块和视觉-语义混合损失模块,提升文本到图像生成的创意性和语义对齐性。

详情
AI中文摘要

在文本到图像(T2I)生成中注入创造力是一个重大挑战,因为合成图像不仅要具有视觉新颖性和惊喜,还应具有艺术价值。然而,当前T2I模型主要优化于字面文本-图像对齐,其噪声预测网络限制生成到高概率区域,导致生成结果缺乏真实创造力。为此,我们提出了一种自创扩散(SCDiff)模型,用于有意义的T2I生成,包含两个核心模块:可学习的空间加权(LSW)模块和视觉-语义混合损失(VSML)。LSW模块设计了一个参数化的Kaiser-Bessel窗,以强化中心图像特征,促进新颖和令人惊讶的生成。VSML模块引入了双重损失函数:相似性损失约束新图像与文本描述对齐,而多样性损失最大化其与原始图像的区别,从而增强语义价值和视觉新颖性。大量实验表明,我们的模型显著提高了创造力、语义对齐性和视觉一致性,提供了一个简单但强大的框架用于生成创意物体。

英文摘要

Instilling creativity in text-to-image (T2I) generation presents a significant challenge, as it requires synthesized images to exhibit not only visual novelty and surprise, but also artistic value. Current T2I models, however, are largely optimized for literal text-image alignment with their data distribution, and their noise prediction networks constrain the generation to high-probability regions, consequently generating outputs that lack authentic creativity. To address this, we propose a Self-Creative Diffusion (SCDiff) model for meaningful T2I generations featuring two core modules: a learnable spatial weighting (LSW) module and a visual-semantic mixing loss (VSML). The LSW module designs a parametric Kaiser-Bessel window to reinforce central image features, fostering novel and surprising generation. The VSML module introduces a dual loss function: a similarity loss constrains that the new images align with its textual description, while a diversity loss maximizes its distinction from the original image, enhancing both semantic value and visual novelty. Extensive experiments demonstrate that our model substantially improves creativity, semantic alignment, and visual coherence, offering a simple yet powerful framework for generating creative objects.

2605.19541 2026-05-20 cs.SD

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

利用强化学习优化神经语音编解码器用于300bps通信

Junyi Wang, Chi Zhang, Jing Qian, Haifeng Luo, Hao Wang, Zengrui Jin, Chao Zhang

AI总结 本文提出ClariCodec,一种在300bps下工作的神经语音编解码器,通过将量化视为随机策略,利用强化学习优化可懂度,从而在极端压缩水平下减少词错误率。

详情
AI中文摘要

在带宽受限的通信中,如卫星和水下信道,语音往往需要在超低比特率下传输,其中可懂性是主要目标。在如此极端的压缩水平下,通过声音重建损失训练的编解码器倾向于将比特分配给感知细节,导致词错误率(WER)显著下降。本文提出了ClariCodec,一种在300比特每秒(bps)下工作的神经语音编解码器,将量化重新表述为随机策略,从而通过强化学习(RL)优化可懂性。具体来说,编码器使用由WER驱动的奖励进行微调,而声音重建流程保持冻结。即使没有强化学习,ClariCodec在LibriSpeech测试清洁集上以300bps实现了4.64%的WER,已经与在更高比特率下工作的编解码器具有竞争力。进一步的强化学习微调将WER降低到测试清洁集上的3.55%和测试其他集上的10.4%,对应的相对减少为23%,同时保持感知质量。

英文摘要

In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 10.4% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.

2605.19539 2026-05-20 cs.CV

Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R

信任它还是不信任它:基于信任3R的证据不确定性用于前馈3D重建

Zihao Zhu, Wenyuan Zhao, Nuo Chen, Chao Tian, Zhiwen Fan

AI总结 本文提出Trust3R,一种轻量级的证据不确定性框架,用于前馈3D重建,通过结合门控残差均值细化和正态-逆 Wishart 证据头,生成点云不确定性估计,提升几何重建的准确性和可靠性。

Comments Accepted at ICML 2026. 10 pages main paper, with appendix

详情
AI中文摘要

几何基础模型有希望从未经校准的图像中进行无约束的密集几何预测。然而,在当前的前馈设计中,其预测的置信度分数是启发式的,缺乏概率解释,且通常无法指示预测几何的可信区域和程度。为解决这一差距,我们提出了Trust3R,一种轻量级的证据不确定性框架用于前馈3D重建。Trust3R结合了门控残差均值细化和正态-逆 Wishart 证据头,生成每一点的几何不确定性的闭合形式多元学生t分布。这种设计在提供概率基础的点云不确定性估计的同时,增加了适度的推断开销。我们在多样化的室内和室外基准上进行了评估,并与MASt3R内置的置信度图以及跨越单次通过异方差回归和基于采样的方法(如MC dropout和深度集合)的常见不确定性感知基线进行了比较。实验结果表明,Trust3R在风险覆盖和稀疏化方面表现一致,并且在几何准确性方面总体有所提高。这些收益体现在跨基准的更强的不确定性排名上,ScanNet++上AURC降低了25%,AUSE降低了41%,为不确定性感知加权在下游几何管道中提供了实用的可靠性信号。项目页面和代码可在https://trust3r-z.github.io/上找到。

英文摘要

Geometric foundation models hold promise for unconstrained dense geometry prediction from uncalibrated images. However, in current feed-forward designs, their predicted confidence scores are heuristic, lack probabilistic interpretation, and often fail to indicate where and how much the predicted geometry can be trusted. To address this gap, we present Trust3R, a lightweight evidential uncertainty framework for feed-forward 3D reconstruction. Trust3R combines gated residual mean refinement with a Normal-Inverse-Wishart evidential head, yielding a closed-form multivariate Student-t distribution for per-point geometric uncertainty. This design provides probabilistically grounded pointmap uncertainty estimates while adding moderate inference overhead. We evaluate on diverse indoor and outdoor benchmarks and compare against MASt3R's built-in confidence map as well as common uncertainty-aware baselines spanning single-pass heteroscedastic regression and sampling-based methods such as MC dropout and deep ensembles. Experimental results show that Trust3R consistently improves risk-coverage and sparsification, and generally improves geometric accuracy. These gains are reflected in stronger uncertainty ranking across benchmarks, with 25% lower AURC and 41% lower AUSE on ScanNet++, providing a practical reliability signal for uncertainty-aware weighting in downstream geometry pipelines. The project page and code are available at https://trust3r-z.github.io/.

2605.19538 2026-05-20 cs.CV cs.AI

CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

CaptchaMind: 通过强化学习与显式推理监督训练CAPTCHA求解器

Pengcheng Wang, Haoxiang Liu, Yang Dai, Xiangxiang Zeng, Guanhua Chen, Baotian Hu, Longyue Wang, Weihua Luo

AI总结 本文提出CaptchaMind,一种基于强化学习的CAPTCHA求解器,通过显式推理监督训练,实现了82.9%的平均成功率,显著优于现有方法。

Comments 17 pages, 12 figures

详情
AI中文摘要

CAPTCHAs被广泛部署作为人类验证机制,经常阻止智能代理在现实网络环境中完成端到端自动化。解决现代CAPTCHAs需要稳健的多步骤视觉推理和交互能力,但基于训练的方法由于缺乏大规模训练数据和过程级注释而一直缺席。我们介绍了CaptchaBench,第一个支持大规模训练的CAPTCHA基准,包含16,000个程序生成的样本,覆盖八个任务类别,并带有详细的区域和过程级注释。系统评估表明,现有方法在需要精细视觉细节捕获和区域级比较的任务上表现一致失败。因此,我们提出了CaptchaMind,一种基于强化学习的求解器,通过显式推理过程监督训练,实现了82.9%的平均成功率,跨八个任务和71.0%在现实实例上的表现,显著优于所有现有方法,无需闭源API。

英文摘要

CAPTCHAs are widely deployed as human verification mechanisms and frequently block intelligent agents from completing end-to-end automation in real-world web environments. Solving modern CAPTCHAs requires robust multi-step visual reasoning and interaction capabilities, yet training-based approaches have remained absent due to the lack of large-scale training data and process-level annotations. We introduce CaptchaBench, the first CAPTCHA benchmark designed to support large-scale training, comprising 16,000 programmatically generated samples across eight task categories with detailed region and process-level annotations. Systematic evaluation on CaptchaBench reveals that existing methods fail consistently on tasks requiring fine-grained visual detail capture and region-level comparison. We therefore present CaptchaMind, an RL-based solver trained with explicit reasoning process supervision, achieving 82.9% average success rate across eight tasks and 71.0% on real-world instances, substantially outperforming all existing methods without closed-source APIs.

2605.19533 2026-05-20 cs.CV

Replacement Learning: Training Neural Networks with Fewer Parameters

替代学习:用更少的参数训练神经网络

Yuming Zhang, Peizhe Wang, Tianyang Han, Hengyu Shi, Junhao Su, Dongzhi Guan, Jiabin Liu, Jiaji Wang

AI总结 本文提出替代学习(RepL)方法,通过替换而非删除神经网络中的部分模块来减少全深度反向传播的冗余,从而在保持性能的同时降低参数量、内存使用和训练时间。

Comments 16pages

详情
AI中文摘要

端到端训练结合全深度反向传播仍然是优化深度神经网络的主要范式,但随着模型变深,其效率会下降。由于每个块必须在单一全局目标下执行和微分,全深度反向传播引入了显著的参数冗余、激活-内存成本和训练延迟,尤其是在相邻层具有高度相关学习模式时。直接跳过或删除层可以降低成本,但通常会削弱表示能力或需要特定架构的重用设计。在本文中,我们提出了替代学习(RepL),一种训练时的范式,通过替换选定的块而不是简单地删除它们来减少全深度冗余。对于每个被移除的块,RepL插入一个轻量级计算层,通过可学习的转换从其相邻前序和后序块的参数合成一个替代操作符,并将该合成操作符应用于前序激活。这样,RepL在保持局部上下文连续性的同时避免了不必要的全层计算。我们为CNNs和ViTs实例化RepL,使用定制化的参数融合块来处理卷积通道、特征分辨率和Transformer子模块。在CIFAR-10、SVHN、STL-10、ImageNet、COCO和CityScapes等数据集上的广泛实验表明,RepL在减少可训练参数、GPU内存使用和训练时间的同时,在分类、检测和分割任务中与标准端到端训练相匹配或超越。此外,在WikiText-2、迁移学习、推理吞吐量、检查点、随机深度和INT8量化等额外结果中进一步展示了其通用性和兼容性。

英文摘要

End-to-end training with full-depth backpropagation remains the dominant paradigm for optimizing deep neural networks, but its efficiency deteriorates as models grow deeper. Since every block must be executed and differentiated under a single global objective, full-depth BP introduces substantial parameter redundancy, activation-memory cost, and training latency, especially when neighboring layers exhibit highly correlated learning patterns. Directly skipping or removing layers can reduce cost, but often weakens representation capacity or requires architecture-specific reuse designs. In this paper, we propose Replacement Learning (RepL), a training-time paradigm that reduces full-depth redundancy by replacing selected blocks rather than simply discarding them. For each removed block, RepL inserts a lightweight computing layer that synthesizes a surrogate operator from the parameters of its adjacent preceding and succeeding blocks through a learnable transformation, and applies the synthesized operator to the preceding activation. In this way, RepL preserves local contextual continuity while avoiding unnecessary full-layer computation. We instantiate RepL for CNNs and ViTs with tailored parameter-fusion blocks that handle convolutional channels, feature resolutions, and transformer submodules. Extensive experiments on CIFAR-10, SVHN, STL-10, ImageNet, COCO, and CityScapes show that RepL reduces trainable parameters, GPU memory usage, and training time while matching or surpassing standard end-to-end training across classification, detection, and segmentation. Additional results on WikiText-2, transfer learning, inference throughput, checkpointing, stochastic depth, and INT8 quantization further demonstrate its generality and compatibility.

2605.19532 2026-05-20 cs.CV cs.LG

Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection

通过基于核心标记注意力的种子选择提升文本到图像扩散模型

Yunzhe Zhang, Hongfu Liu, Pengyu Hong

AI总结 本文研究了文本到图像扩散模型中种子对生成质量的影响,提出基于核心标记注意力的种子选择方法,无需训练即可提升文本与图像的一致性及视觉质量。

Comments Preprint

详情
AI中文摘要

文本到图像扩散模型能够生成高质量的图像,但其输出对随机种子极为敏感:不同的初始种子往往导致图像质量和提示词与图像的一致性产生显著差异。我们重新审视这一

英文摘要

Text-to-image diffusion models can synthesize high-quality images, yet the outcome is notoriously sensitive to the random seed: different initial seeds often yield large variations in image quality and prompt-image alignment. We revisit this "seed effect" and show that attention dynamics over prompt core tokens, the content-bearing words, measured during the first few denoising steps, strongly predict final generation quality. Building on this observation, we introduce Attention-Based Seed Selection (ABSS), a training-free, plug-and-play method that ranks seeds for a given prompt by leveraging cross-attention to core tokens during the denoising process. ABSS requires no finetuning and does not alter the initial noise; it scores and ranks all candidate seeds, keeps only the top-k for full generation, and discards the rest, without relying on a fixed accept/reject threshold. Operating purely at inference time, ABSS can serve as a lightweight pre-selection add-on for existing seed-optimization pipelines, enabling additional gains. Across three benchmarks, extensive experiments show that ABSS enables consistent improvements in text-image alignment and visual quality for Stable Diffusion variants, as corroborated by human preference and alignment metrics.

2605.19529 2026-05-20 cs.AI

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

生成-评估一致性:为LLM赋能的自适应评估的必要有效性标准

Grandee Lee, Yue Wang, Che Yee Lye, Luke Peh

AI总结 本文提出生成-评估一致性(GEA)作为LLM赋能自适应评估的有效性标准,通过测量LLM评分函数是否能恢复其生成函数所指示的技能水平,发现其在不同技能层面的有效性存在差异,并提出细粒度、技能分解的评分标准作为提升GEA的主要方法。

Comments BEA 2026

详情
AI中文摘要

当相同的LLM生成评估项目、模拟学生响应并评分时,验证循环是自我参照的。我们引入生成-评估一致性(GEA),作为一种衡量标准,用于确定LLM的评分函数是否能恢复其生成函数被指示产生的技能水平。在首次对双阶段自适应评估的直接测量中,模型恢复了约一半的预期方差r=0.698,存在系统性正偏。GEA在可语法验证的技能上表现强r>0.7,但在设计层面的技能上接近于零,并且低技能的过度估计会放大接近路由阈值的分数。我们主张细粒度、技能分解的评分标准是提升GEA的主要提出机制,并概述了互补的缓解措施。

英文摘要

When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.