arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.04907 2026-06-04 cs.RO

WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

WAM-Nav:面向统一视觉导航的非对称潜在世界-动作建模

Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu

AI总结 提出WAM-Nav,一种联合学习动作生成与潜在视觉预测的非对称扩散Transformer模型,通过共享扩散Transformer实现长时程动作与短时程视觉预测的联合扩散,并引入双流上下文条件机制和目标对齐模块,在统一策略下支持图像目标、点目标和无目标导航,在ClutterScenes和InternScenes基准上分别提升15.7%和3.3%的成功率,并在真实环境中实现85%的任务成功率。

详情
AI中文摘要

视觉导航需要在复杂的几何和物理约束下生成平滑且无碰撞的轨迹。现有的反应式策略直接将观测映射到动作,缺乏预期推理能力,限制了其主动避障的能力。虽然视觉想象提供了预测性前瞻,但传统的模块化方法将场景预测与策略学习分离,常常导致误差累积和推理效率低下。为了解决这些限制,我们提出了WAM-Nav,一种用于具身视觉导航的潜在世界-动作模型,它联合学习动作生成和潜在视觉预测,从而在不影响推理效率的情况下实现更鲁棒和更具前瞻性的导航决策。具体来说,WAM-Nav利用共享的扩散Transformer进行非对称联合扩散,同时生成长时程动作和短时程视觉预测,减少了多步自回归展开中固有的推理延迟和视觉误差累积。为了进一步促进平滑且一致的轨迹生成,我们引入了一种双流上下文条件机制,将情节级别的自运动历史与顺序视觉观测相结合。结合统一的目标对齐模块,该模块在不同目标类型间保持平衡表示,WAM-Nav在单一策略下自然支持图像目标、点目标和无目标探索。在具有挑战性的ClutterScenes和InternScenes基准上的大量实验证明了WAM-Nav的强大泛化能力,特别是在图像目标和点目标导航中,成功率分别提高了15.7%和3.3%。真实世界部署进一步验证了有效的零样本模拟到现实迁移,在多样化的室内和室外环境中实现了平均85%的任务成功率。

英文摘要

Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.

2606.04906 2026-06-04 cs.CL cs.AI

'Your AI Text is not Mine': Redefining and Evaluating AI-generated Text Detection under Realistic Assumptions

“你的AI文本不是我的”:在现实假设下重新定义和评估AI生成文本检测

Nils Dycke, Marina Sakharova, Nico Daheim, Iryna Gurevych

AI总结 针对AI生成文本检测领域缺乏统一有害使用定义的问题,本文系统定义了多种AI生成文本概念,构建了包含详细生成过程注释的人机协作文本基准AITDNA,并评估了多种检测器在不同概念下的表现。

详情
AI中文摘要

尽管普遍认为AI生成的文本会带来广泛的社会风险,但在AI生成文本检测文献中,对于什么构成有害使用并没有共同的理解。相反,现有的数据集和方法往往定义自己的标准并做出自己的假设,有时是隐含的,而且通常只与真实世界的需求和应用程序松散相关。为了解决这一差距,我们在此系统地定义了AI生成文本的各种概念及其特征。为了研究这些,我们收集了AITDNA——一个全新的人机协作文本基准,其中标注了详细的生成过程信息,如整个编辑和AI交互历史。我们评估了各种机器生成文本检测器,发现它们通常只在特定概念下表现良好,而不能作为广泛的检测器。我们公开发布代码和数据。

英文摘要

Although it is generally agreed that AI-generated text poses a broad societal risk, there is no common understanding in the AI-generated text detection literature on what constitutes harmful use. Rather, existing datasets and approaches often define their own criteria and make their own assumptions, sometimes implicitly, and often only loosely related to real-world needs and applications. To address this gap, we here systematically define various notions of AI-generated text and their characteristics. To study these, we collect AITDNA - a new benchmark of human-machine co-constructed texts that is annotated with detailed genesis information, such as the entire edit and AI-interaction history. We benchmark various machine-generated text detectors and find that they often only perform well for specific notions but not as broad detectors. We release code and data publicly.

2606.04903 2026-06-04 cs.LO cs.AI cs.MA cs.PL

Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

基于人类编写本体的可审计且安全的LLM智能体

Aaron Sterling

AI总结 提出Agentic Redux架构,通过类型化λ演算证明其在适当领域上的执行语义正确且决策可审计,并引入本体优先的智能体设计方法。

详情
AI中文摘要

我们介绍了LLM智能体架构Agentic Redux,旨在用于需要线性可审计性的非平凡问题领域。使用类型化λ演算,我们证明了在适当领域上运行时,Agentic Redux的执行在语义上保证正确,所有决策记录在仅追加的分类账中。我们提出了两个生产级领域:医疗账单合规性和安全漏洞披露。支持两个领域上运行的Agentic Redux的工作代码可在配套代码仓库中找到。我们还引入了本体优先的智能体设计方法,这是一种在问题领域上创建智能体框架的方法,其中人类专家使用基本形式本体对问题领域进行本体化,然后分配LLM推导出智能体和人在回路中可以扮演的角色,以处理该领域中的问题。

英文摘要

We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present two production-grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology-First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans-in-the-loop can fill, in order to work the problems in the domain.

2606.04898 2026-06-04 cs.CV

CDPM-Align: Multi-Scale Guidance-Aligned Diffusion Pretraining for Robust Few-Shot Anatomical Landmark Detection

CDPM-Align:用于鲁棒少样本解剖标志检测的多尺度引导对齐扩散预训练

Roberto Di Via, Irina Voiculescu, Francesca Odone, Vito Paolo Pastore

AI总结 提出多尺度引导对齐的条件扩散预训练方法CDPM-Align,通过生成式预训练学习鲁棒表示,在少样本和低标注场景下提升解剖标志检测的准确性和不确定性。

详情
Comments
Accepted MICCAI 2026
AI中文摘要

解剖标志检测是医学图像分析中的一项基础任务,支持广泛的诊断和介入工作流程。尽管最近的方法已经实现了亚毫米级的定位,但仅凭准确性不足以用于临床部署,还需要预测的可靠性和鲁棒性。尽管具有临床相关性,但表示学习在此背景下的影响仍未得到充分探索。在这项工作中,我们引入了CDPM-align,一种用于解剖标志检测的多尺度引导对齐条件扩散预训练方法。我们的实验设置侧重于少量图像和少量标注场景。具体来说,我们采用三个流行的异构小规模基准数据集,通过条件生成预训练进行表示学习。此外,我们考虑了标志检测下游任务的低标注场景,分别使用10张和25张标注图像,反映了临床工作与标注资源约束之间的现实权衡。我们的结果证实,生成式预训练使模型能够学习鲁棒的表示。这提高了下游任务的准确性和不确定性,朝着安全高效的临床部署迈进。

英文摘要

Anatomical landmark detection is a fundamental task in medical image analysis supporting a wide range of diagnostic and interventional workflows. Although recent methods have achieved sub-millimetric localisation, accuracy alone is not sufficient for clinical deployment, requiring reliability and robustness in prediction. Despite its clinical relevance, the impact of representation learning in this context is still underexplored. In this work, we introduce CDPM-align, a multi-scale guidance-aligned conditional diffusion pre-training for anatomical landmark detection. Our experimental setup focuses on a few images and a few annotation regimes. Specifically, we employ three popular heterogeneous small-scale benchmark datasets for representation learning via conditional generative pre-training. Furthermore, we consider low-annotation scenarios for the downstream task of landmark detection, with 10 and 25 annotated images, reflecting realistic trade-offs between clinical effort and resource constraints for annotations. Our results confirm that generative pre-training enables the model to learn a robust representation. This improves both accuracy and uncertainty on the downstream tasks, advancing towards safe and efficient clinical deployment.

2606.04891 2026-06-04 cs.CV cs.CG

Hierarchical Space Partition for Surface Reconstruction

表面重建的层次空间划分

Minjie Tang, Xiangfei Li

AI总结 针对点云重建中因LiDAR扫描局限导致细节缺失的问题,提出基于平面分类与优先级生长的层次空间划分方法,并通过最小割优化生成水密多边形网格。

详情
Journal ref
in 2026 International Conference on 3D Vision (3DV), Vancouver, BC, Canada, 2026, pp. 207-216
Comments
Published in 2026 International Conference on 3D Vision (3DV)
AI中文摘要

从点云生成紧凑的多边形模型是3D视觉和计算机图形学中的一个关键问题。然而,由于LiDAR扫描的固有限制(例如距离约束和遮挡),关键场景信息常常缺失,导致重建精度下降。为了解决这个问题,我们提出了一种平面组装策略,该策略在保持模型紧凑性的同时有效恢复缺失的细节。我们将从场景中提取的所有平面分为三类:高可见、几乎不可见和不可见。通过场景结构分析恢复的不可见平面指示了缺失的细节。这三种类型的平面对应于三种生长优先级。每个平面根据优先级水平生长,空间被逐步划分,即层次划分。随后,我们通过基于最小割的优化从划分中生成水密多边形网格。最后,在公共数据集上的比较显示了我们的方法相对于主流方法的有效性和优越性。项目页面可在https://hsr-3dv.github.io/获取。

英文摘要

Generating compact polygonal models from point clouds is a key problem in 3D vision and computer graphics. However, due to inherent limitations of LiDAR scanning (e.g. range constraints and occlusions), critical scene information is often missing, leading to degraded reconstruction accuracy. To address this, we propose a plane assembling strategy that effectively recovers missing details while maintaining model compactness. We classify all the planes extracted from the scene into three categories: highly visible, barely visible, and invisible. The invisible planes, which are recovered by scene structure analysis, indicate the missing details. The three types of planes correspond to the three growth priorities. Each plane grows according to the priority level, and the space is partitioned progressively, namely, the hierarchical partition. Subsequently, we generate a watertight polygonal mesh from the partition via a min-cut-based optimization. Finally, comparisons on public datasets show the effectiveness and superiority of our method against mainstream approaches. The project page is available at https://hsr-3dv.github.io/.

2606.04889 2026-06-04 cs.CL

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

GRAIL: 基于梯度重加权优势的可验证奖励强化学习

Tej Deep Pala, Vernon Toh, Soujanya Poria

AI总结 针对强化学习中统一优势分配导致梯度信号稀释的问题,提出基于梯度激活显著性的令牌级优势重加权方法GRAIL,无需过程级监督即可提升推理对齐,在多个模型上平均准确率提升3.60%。

详情
AI中文摘要

基于可验证奖励的强化学习(如GRPO)现在已成为提升大语言模型(LLMs)数学推理能力的常见方法。然而,当前方法通常将单个序列级优势广播到所有令牌,或使用昂贵的过程奖励模型(PRMs)进行步骤级监督。统一优势分配假设所有令牌对最终奖励的贡献相同。这会稀释梯度信号,因为存在缺陷的推理步骤和填充词与有效的逻辑推理得到同等强度的更新。为解决此问题,我们引入了梯度重加权优势(GRAIL),一种内在的令牌级优势重加权方法。GRAIL使用梯度激活显著性,将更多权重赋予那些对最终答案局部更敏感的令牌。在来自Qwen3、R1-distilled和OctoThinker家族的五个模型上的评估表明,GRAIL始终优于GRPO。GRAIL在准确率上平均提升3.60%,在Pass@3上平均提升3.05%,表明无需过程级监督即可实现细粒度的推理对齐。

英文摘要

Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.

2606.04888 2026-06-04 cs.CV

HD-DinoMoE: A Class-Aware Hierarchical Dual Mixture-of-Experts Network for Scleral Anomaly Segmentation in Complex Acquisition Scenarios

HD-DinoMoE: 一种用于复杂采集场景下巩膜异常分割的类别感知层次化双混合专家网络

Yinxiang Yu, Maoxiang Chu, Qi Niu, Guanghu Liu, Wei Xu, Haotian Wang, Zhi Chen, Yutian Zhu, Yuelong Fan, Guanghao Liao

AI总结 针对多源分布差异、异常形态多样和巩膜镜面反射问题,提出类别感知层次化双混合专家网络HD-DinoMoE,结合双流DINOv3特征融合与类别特定多专家解码,实现血管、黄斑和黑斑、血斑的像素级分割,在ML-SASD数据集上达到72.11%的平均Dice和58.44%的平均IoU。

详情
Comments
Submitted to Medical Image Analysis; 47 pages, 31 figures, 14 tables
AI中文摘要

中医目诊通过观察巩膜表面异常提供经验性线索,但其临床应用仍具有主观性且难以量化。为支持智能化和可量化的目诊,本研究提出了中医启发的人工智能眼部辅助诊断系统(TAO),并聚焦于像素级巩膜表面异常分割。针对受多源分布差异、异常形态多样和巩膜镜面反射(SSR)影响的临床和用户采集图像,我们提出了HD-DinoMoE,一种类别感知层次化双混合专家网络。HD-DinoMoE结合类别感知双流DINOv3特征融合与类别特定多专家解码,以分割血管、黄斑和黑斑以及血斑。一种三阶段骨干冻结路由策略稳定了双骨干适应;渐进置信惩罚(PCP)损失减少了SSR区域的高置信度假阳性和分割泄漏;类别感知自适应样本加权(CA-ASW)平衡了样本和类别级别的训练贡献。我们进一步构建了多标签巩膜异常分割数据集(ML-SASD),这是一个包含临床、野生和混合设置以及三种异常类别像素级标注的新基准。在ML-SASD-Mix上,HD-DinoMoE实现了72.11%的平均Dice和58.44%的平均交并比,同时保持了良好的边界定位和镜面区域假阳性控制。它在公共SBVPI数据集的血管子集上也显示出有竞争力的泛化能力。这些结果表明,HD-DinoMoE为复杂采集场景下的TAO提供了一种可行的分割解决方案。代码和数据访问信息可在https://github.com/FX-CMX/HD-DinoMoE获取。

英文摘要

Traditional Chinese Medicine (TCM) ocular inspection provides empirical cues for assessing scleral surface anomalies, but its clinical use remains subjective and difficult to quantify. To support intelligent and quantifiable ocular inspection, this study presents the TCM-inspired Artificial Intelligence Ocular Auxiliary Diagnosis System (TAO) and focuses on pixel-level scleral surface anomaly segmentation. For clinical and user-acquired images affected by multi-source distributional discrepancies, diverse anomaly morphologies, and scleral specular reflection (SSR), we propose HD-DinoMoE, a class-aware hierarchical dual mixture-of-experts network. HD-DinoMoE combines class-aware dual-stream DINOv3 feature fusion with class-specific multi-expert decoding to segment Vessels, Yellow and Black Spots, and Blood Spots. A three-stage backbone-frozen routing strategy stabilizes dual-backbone adaptation; Progressive Confidence Penalty (PCP) Loss reduces high-confidence false positives and segmentation leakage in SSR regions; and Class-Aware Adaptive Sample Weighting (CA-ASW) balances sample- and class-level training contributions. We further construct the Multi-label Scleral Anomaly Segmentation Dataset (ML-SASD), a new benchmark with Clinical, Wild, and Mix settings and pixel-wise annotations for three anomaly categories. On ML-SASD-Mix, HD-DinoMoE achieves a mean Dice of 72.11% and a mean Intersection-over-Union of 58.44%, while maintaining favorable boundary localization and specular-region false-positive control. It also shows competitive generalization on the Vessels subset of the public SBVPI dataset. These results indicate that HD-DinoMoE provides a feasible segmentation solution for TAO under complex acquisition scenarios. The code and data access information are available at https://github.com/FX-CMX/HD-DinoMoE.

2606.04884 2026-06-04 cs.RO

D$^3$-MoE:Dual Disentangled Diffusion Mixture-of-Experts for Style-Controllable End-to-End Autonomous Driving

D$^3$-MoE:面向风格可控的端到端自动驾驶的双解耦扩散混合专家模型

Renju Feng, Rukang Wang, Ning Xi, Jianguo Yu, Liping Lu, Pan Zhou, Duanfeng Chu

AI总结 提出D$^3$-MoE框架,通过行为轴(扩散生成与选择解耦)和物理轴(纵向与横向专家解耦)的双重解耦,实现风格可控的端到端自动驾驶,在NAVSIM基准上达到SOTA规划性能。

详情
Comments
8 pages, 6 figures
AI中文摘要

传统的端到端自动驾驶框架在训练于高方差的人类演示时经常遭受“风格平均化”困境,产生同质化、风格不可控甚至运动学不安全的策略。为了克服这一限制,我们提出了D$^3$-MoE(双解耦扩散混合专家模型),该模型沿两个互补轴解耦轨迹建模。在行为轴上,生成与选择解耦:一个风格条件扩散过程在单个场景中并行合成多风格候选轨迹,允许下游模块根据用户偏好或评估分数选择最优轨迹。在物理轴上,解耦的纵向和横向路由器在推理时激活各自的专家,这些专家使用来自正交地面真值运动学的自监督目标进行训练,无需人工标签。这些激活的专家采用扩散变换器(DiT)架构,并配备风格条件自适应层归一化(AdaLN)和非对称横向融合交叉注意力,独立预测其对应的物理状态,然后重新组装成统一的、运动学一致的轨迹。在具有挑战性的NAVSIM基准上的广泛评估表明,D$^3$-MoE实现了最先进的规划性能,默认达到88.2 PDMS和84.3 EPDMS。此外,我们的“三选最佳”集成策略有效拓宽了多模态解空间,将性能提升至91.3 PDMS和87.5 EPDMS。定量和定性分析共同证实了该框架在规划质量和风格可控性方面的优势。

英文摘要

Traditional end-to-end autonomous driving frameworks frequently suffer from the "style-averaging" dilemma when trained on high-variance human demonstrations, yielding homogenized, style-uncontrollable, and even kinematically unsafe policies. To overcome this limitation, we present D$^3$-MoE (Dual Disentangled Diffusion Mixture-of-Experts), which disentangles trajectory modeling along two complementary axes. On the behavioral axis, generation is decoupled from selection: a style-conditioned diffusion process synthesizes multi-style candidate trajectories in parallel within a single scene, allowing a downstream module to select the optimal trajectory based on user preference or an evaluation score. On the physical axis, decoupled longitudinal and lateral routers activate their respective experts during inference time, trained without manual labels using self-supervised targets from orthogonal ground-truth kinematics. These activated experts, architected as Diffusion Transformers (DiT) and equipped with style-conditioned AdaLN and asymmetric lateral-fusion cross-attention, independently predict their corresponding physical state before being reassembled into a unified, kinematically coherent trajectory. Extensive evaluations on the challenging NAVSIM benchmark demonstrate that D$^3$-MoE achieves state-of-the-art planning performance, reaching 88.2 PDMS and 84.3 EPDMS by default. Moreover, our Best-of-Three ensemble strategy effectively broadens the multi-modal solution space, raising performance to 91.3 PDMS and 87.5 EPDMS. Both quantitative and qualitative analyses jointly confirm the framework's advantages in planning quality and style controllability.

2606.04883 2026-06-04 cs.CL cs.LO

Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean

优化 Lean 中智能定理证明器的成本-质量权衡

Kári Rögnvaldsson, Chenhao Sun, Jasper Dekoninck, Martin Vechev

AI总结 提出一种包含数据平面和控制平面的动作路由智能体,通过观察失败轨迹并估计成功概率与成本来动态决定继续证明或重新分解,在 PutnamBench 子集上平均降低 25.8% 成本且保持性能。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于在 Lean 中生成形式化证明的工作流程。这些工作流程通常将问题分解为更小的引理,采样许多证明尝试,并使用编译器反馈来指导搜索。然而,它们可能成本高昂,往往在最终失败的尝试上花费大量计算。在这项工作中,我们通过一个包含数据平面和控制平面的动作路由智能体来解决这个问题。数据平面生成自然语言的引理分解,在 Lean 中形式化它们,并为由此产生的定理和引理目标采样证明尝试。控制平面观察之前失败的 Lean 尝试,估计成功可能性和另一次尝试的成本,并决定是继续证明当前目标还是从新的分解重新开始。在 PutnamBench 的一个子集上,我们的智能体平均比固定步长基线降低 25.8% 的成本,在显著减少计算量的同时保持性能。这些结果表明,失败的 Lean 轨迹为智能定理证明中的成本感知资源分配提供了可操作的信号。

英文摘要

Large language models (LLMs) are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by $25.8\%$ over a fixed-step baseline on average, preserving performance while using substantially less compute. These results suggest that failed Lean trajectories provide actionable signals for cost-aware resource allocation in agentic theorem proving.

2606.04881 2026-06-04 cs.CV cs.AI

DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance

DiverAge: 基于跨年龄身份关系引导的可靠多元人脸老化

Yueying Zou, Peipei Li, Qianrui Teng, Dianyan Xu, Zekun Li

AI总结 提出基于扩散自编码的分层多元人脸老化框架DiverAge,通过随机扩散解码和年龄条件语义调制保持外观多样性,并引入跨年龄身份关系调节器(CARR)在推理时引导去噪,以提升序列级有序可靠性。

详情
Comments
11 pages,10 figures, 5 tables
AI中文摘要

人脸老化在长期生物特征分析、跨年龄身份验证和法医身份分析中扮演重要角色。由于同一主体因遗传、环境和生活方式等因素在目标年龄可能呈现多种合理外观,人脸老化本质上是一个一对多的生成问题。然而,仅有多元性不足以实现可靠的人脸老化:模型应在每个年龄组内提供外观级别的候选多样性,同时跨有序年龄组保持序列级别的有序可靠性。现有的确定性老化方法可以合成视觉上合理的年龄增长人脸,但通常缺乏随机多样性。相比之下,多元老化方法引入局部外观变化,但往往未能明确调控完整老化序列的身份演化。本文提出基于扩散自编码的分层多元人脸老化框架DiverAge。DiverAge通过随机扩散解码和年龄条件语义调制保持外观级多样性。为提升序列级可靠性,我们引入跨年龄身份关系调节器(CARR),一种推理时引导策略,联合去噪多个目标年龄组。CARR由从真实同身份跨年龄对估计的跨年龄身份相似性(CIS)先验引导,通过单边采样时引导抑制过度的跨年龄身份漂移,无需修改训练目标或引入额外可训练参数。实验表明,DiverAge在保持身份保留、年龄准确性、图像质量和外观级多样性的同时,提升了序列级有序可靠性。

英文摘要

Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.

2606.04880 2026-06-04 cs.CV

MAOAM: Unified Object and Material Selection with Vision-Language Models

MAOAM: 基于视觉语言模型的统一对象与材质选择

Jaden Park, Valentin Deschaintre, Jason Kuen, Kangning Liu, Iliyan Georgiev, Krishna Kumar Singh, Yong Jae Lee, Michael Fischer

AI总结 提出MAOAM框架,利用视觉语言模型和分割头,通过文本或点击交互实现对象和材质的精确选择,并设计数据生成流水线解决材质选择数据缺乏问题。

详情
Comments
Accepted to SIGGRAPH 2026 Conference. Project page: \href{https://jadenpark0.github.io/project_pages/maoam/}{here}
AI中文摘要

选择是交互式图像编辑中的核心操作。为了实用,用户应能通过文本或点击交互来指定和区分所需的选择区域,系统应支持不仅选择对象,还包括其他标准,如材质。基于材质的选择对于重新纹理化表面或编辑特定材质的实例等任务很有价值。然而,现有的基于视觉语言模型(VLM)的选择方法以对象为中心,通常支持单一交互模态,限制了其适用性。因此,在这项工作中,我们提出了Mask Any Object And Material(MAOAM),一个统一的选择框架,能够在文本和点击交互中实现精确的对象和材质级选择。MAOAM利用带有分割头的VLM从用户提示中生成像素级掩码:VLM解释用户的选择意图(对象或材质级)并编码视觉实体、属性和空间关系,而分割头将输出标记解码为掩码。一个关键挑战是缺乏带有文本标注的材质选择数据集。我们提出了一种可扩展的数据生成流水线:收集带有材质掩码的真实和合成图像,并利用VLM生成具有丰富视觉语义的材质描述。我们通过多任务目标训练MAOAM,涵盖点击和文本选择,以及从材质描述派生的辅助VQA任务,以促进更深入的材质理解。尽管使用单模态提示训练,我们的模型在推理时结合文本和点击时表现出选择能力的涌现提升,实现了灵活的图像编辑工作流程。实验表明,在多样化的对象、材质和交互场景中,选择准确且连贯,突显了实际鲁棒性。

英文摘要

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.

2606.04877 2026-06-04 cs.LO cs.AI cs.PL cs.SE

Abduction Prover in Isabelle/HOL

Isabelle/HOL中的溯因证明器

Yutaka Nagashima, Daniel Sebastian Goc

AI总结 针对基于表达逻辑的证明助手自动化程度低的问题,提出了一种利用溯因推理识别有用猜想并自动构建证明脚本的Isabelle/HOL溯因证明器。

详情
Comments
Accepted to Isabelle2026
AI中文摘要

基于表达逻辑的证明助手在证明搜索方面自动化程度有限,增加了基于证明助手的形式化验证成本。我们通过引入Isabelle/HOL的溯因证明器来解决这个问题。给定一个具有挑战性的证明目标,溯因证明器通过使用溯因推理识别有用的猜想,为该目标构建证明脚本。

英文摘要

Proof assistants based on expressive logics suffer limited automation for proof search, raising the cost of formal verification based on proof assistants. We address this problem by introducing the Abduction Prover for Isabelle/HOL. Given a challenging proof goal, the Abduction Prover constructs a proof script for the goal by identifying useful conjectures using abductive reasoning.

2606.04876 2026-06-04 cs.LG

Towards Pretraining Text Encoders for TabPFN

面向TabPFN的文本编码器预训练

Mustafa Tajjar, Alexander Pfefferle, Lennart Purucker, Frank Hutter

AI总结 提出TabPFN文本适配器,通过轻量级适配器将文本嵌入映射到TabPFN的嵌入空间,避免PCA瓶颈,保留TabPFN数值优势,训练效率更高。

详情
AI中文摘要

表格基础模型(如TabPFN)在数值和分类数据的表格数据集上表现强劲,但本身不处理高基数文本特征。因此,标准流程使用语言模型嵌入文本,并通过PCA将结果向量压缩为少量标量特征,再输入TabPFN。这造成了信息瓶颈:大多数嵌入维度被丢弃,压缩后的表示必须由TabPFN的特征编码器再次扩展。端到端替代方案可以避免PCA,但需要大量包含文本单元格的预训练数据,且通常性能不如在大量合成数据上预训练的表格基础模型。受模态对齐方法(如LLaVA(视觉到LLM令牌投影)和TableGPT风格系统(表格到LLM令牌投影))的启发,我们引入了TabPFN文本适配器(文本到TFM令牌投影)。我们冻结句子编码器和TabPFN,仅训练一个轻量级适配器,将文本嵌入映射为TabPFN嵌入空间中的短序列令牌。这种设计消除了PCA瓶颈,保留了TabPFN的数值优势,并且比端到端文本表格流水线训练效率更高。

英文摘要

Tabular foundation models, such as TabPFN, achieve strong performance on tabular datasets with numerical and categorical data, but do not natively handle high-cardinality text features. Standard pipelines, therefore, embed text with a language model and compress the resulting vectors with PCA into a small number of scalar features before inputting them into TabPFN. This creates an information bottleneck: most embedding dimensions are discarded, and the compressed representation must then be expanded again by TabPFN's feature encoder. End-to-end alternatives can avoid PCA, but they require large amounts of pretraining data containing text cells and usually perform subpar compared to tabular foundation models that were pretrained on large amounts of synthetic data. Inspired by modality-alignment approaches like LLaVA (vision-to-LLM token projection) and TableGPT-style systems (table-to-LLM token projection), we introduce the TabPFN Text Adapter (text-to-TFM token projection). We freeze both the sentence encoder and TabPFN, and train only a lightweight adapter that maps text embeddings into a short sequence of tokens in TabPFN's embedding space. This design removes the PCA bottleneck, preserves TabPFN's numerical strengths, and is more efficient to train than end-to-end text-tabular pipelines.

2606.04871 2026-06-04 cs.CV

Recent Advances and Trends in Learning-based 3D Representations

基于学习的3D表示的最新进展与趋势

Adrien Schockaert, Hamid Laga, Hazem Wannous, Vincent Magnier, Guillaume Dufaye, Jean-françois Witz

AI总结 本文综述了从离散显式格式到连续隐式场(基于神经渲染或基元溅射)的3D表示家族,分析了其优缺点及关键应用,并强调了向隐式表示的范式转变。

详情
AI中文摘要

选择合适的3D表示是一个基本的设计决策,它决定了现代计算机视觉和图形管线在3D重建、新视角合成与渲染、形状与运动分析、识别和生成等任务中的效率、质量和能力。虽然传统表示(如网格、点云和体素网格)仍然是3D传感器(如LiDAR和3D扫描仪)的标准输出,并广泛应用于下游应用(如编辑和仿真),但最近的神经和基元表示(如3D高斯溅射)提供了紧凑且可微的替代方案,在游戏、AR/VR、自动驾驶、机器人导航和医学成像等应用中开辟了广泛的机会。本文的目标是综述主要的3D表示家族,从离散显式格式到基于神经渲染或基元溅射的连续隐式场。对于每种表示类型,我们介绍其一般公式和变体,讨论其优点和局限性,并突出关键应用。最后,我们概述了开放挑战和未来研究的潜在方向。与近期广泛涵盖3D物体和场景重建的综述不同,本文专注于分析3D表示本身的演变。我们特别强调了向隐式表示的范式转变,提供了关于这些新兴格式如何从根本上改变3D/4D工作流程的新视角。

英文摘要

The selection of an appropriate 3D representation is a fundamental design decision that dictates the efficiency, quality, and capabilities of modern computer vision and graphics pipelines for tasks such as 3D reconstruction, novel-view synthesis and rendering, shape and motion analysis, recognition, and generation. While traditional representations (\eg meshes, point clouds, and volumetric grids) remain standard outputs of 3D sensors (\eg LiDAR and 3D scanners) and are widely used in downstream applications (\eg editing and simulation), recent neural and primitive-based representations (\eg 3D Gaussian Splatting) offer compact and differentiable alternatives opening a wide range of opportunities in applications such as games, AR/VR, autonomous driving, robot navigation, and medical imaging, to name a few. The goal of this paper is to survey the main families of 3D representations from discrete explicit formats to continuous implicit fields based either on neural rendering or primitive splatting. For each type of representation, we present the general formulation and its variants, discuss its benefits and limitations, and highlight key applications. We conclude the paper by outlining the open challenges and potential directions for future research. Distinct from recent surveys that broadly cover 3D object and scene reconstruction, this paper provides a focused analysis on the evolution of 3D representations themselves. We specifically emphasize the paradigm shift toward implicit representations, offering a novel perspective on how these emerging formats fundamentally alter 3D/4D workflows.

2606.04867 2026-06-04 cs.AI

AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

AICompanionBench: 以LLM为评判标准的AI伴侣安全基准测试

Yanjing Ren, Reza Ebrahimi, TengTeng Ma

AI总结 本文提出AICompanionBench,首个公开的细粒度安全风险标注的人机伴侣对话基准数据集,并评估20个LLM在检测不安全交互中的表现,发现强模型在显式有害内容上准确率高,但难以识别隐式不安全交互。

详情
AI中文摘要

随着Replika和Character.AI等AI伴侣平台的快速增长,对不安全的人机交互的担忧日益加剧。本研究引入了AICompanionBench,据我们所知,这是第一个公开可用的人机伴侣对话基准数据集,并标注了细粒度的安全风险类别。该数据集包含从Reddit收集的2,123个真实Replika对话,并通过人机协作在九个类别上进行标注:性行为、反社会行为、身体攻击、言语攻击、药物滥用、自伤和自杀、控制、操纵和无害。利用该基准,我们在LLM作为评判者的框架下评估了20个最先进的开源和闭源LLM,用于检测不安全交互。结果显示模型性能差异显著,较强的模型实现了较高的整体准确性,但在操纵等细微类别以及被错误识别为有害的无害对话中仍存在困难。我们的发现表明,尽管当前的LLM能有效检测显式有害内容,但在识别隐式不安全交互方面仍然有限。总体而言,我们的工作为AI伴侣安全研究贡献了一个新的基准数据集,并为使用LLM监控AI伴侣系统提供了见解。该数据集公开于:https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx

英文摘要

As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx

2606.04866 2026-06-04 cs.LG

Provably Reduced Sample Cost in Prior-Guided Hyperparameter Optimization

在先验引导的超参数优化中可证明的样本成本降低

Leona Hennig, Jasmin Brandt, Lukas Fehring, Barbara Hammer, Marius Lindauer, Marcel Wever

AI总结 本文通过固定预算最佳臂识别的形式化框架,首次给出了多保真度超参数优化中依赖先验分布的样本复杂度界,证明了信息性先验可显著减少评估次数,并实验验证了高达90%的预算节省。

详情
AI中文摘要

自动化机器学习(AutoML)中的大规模超参数优化(HPO)消耗大量计算资源,引发了关于可扩展性和能源效率的日益关注。现有方法启发式地利用先验信息来加速黑箱和多保真度设置,但缺乏对先验信息性如何定量减少样本复杂度的刻画。在这项工作中,我们通过固定预算最佳臂识别的形式化视角,首次给出了带先验的多保真度HPO的依赖分布的样本复杂度界。通过将先验直接建模在臂均值(即配置性能)上,我们推导出显式的、依赖分布的误差界,量化了先验与评估预算之间的关系。我们的分析表明,信息性先验(将概率质量集中在接近最优的臂上)能够减少所需的评估次数,而无信息或误导性先验则恢复基线性能。我们在合成基准和LCBench(一个用于深度学习的常见多保真度HPO基准)上进行了概念验证实验,以确认我们的理论结果,在保持解质量的同时实现了高达90%的预算削减。总之,我们的结果为先验引导和计算高效的绿色AutoML提供了原则性基础。

英文摘要

Large-scale hyperparameter optimization (HPO) in automated machine learning (AutoML) consumes substantial computational resources, raising growing concerns about scalability and energy efficiency. Existing methods use prior information heuristically to accelerate both black-box and multi-fidelity settings, but they lack a characterization of how prior informativeness quantitatively reduces sample complexity. In this work, we provide the first distribution-dependent sample complexity bounds for multi-fidelity HPO with priors through the formal lens of fixed-budget best-arm identification. By modeling priors directly over arm means as configuration performance, we derive explicit, distribution-dependent error bounds that quantify the relationship between priors and evaluation budget. Our analysis shows that informative priors, which concentrate probability mass on near-optimal arms, yield reductions in the number of required evaluations, whereas baseline performance is recovered with uninformative or misleading priors. We conduct proof-of-concept experiments on a synthetic benchmark and on LCBench, a common multi-fidelity HPO benchmark for deep learning, to confirm our theoretical results, achieving up to 90% budget reduction while retaining solution quality. Together, our results provide a principled foundation for prior-guided and compute-efficient green AutoML.

2606.04863 2026-06-04 cs.CV

IRIS-GAN: Staged Specialist Detection of Deepfake Faces

IRIS-GAN: 深度伪造人脸的分阶段专家检测

Jaume M. Trenchs, Veronica Sanz

AI总结 提出IRIS-GAN,一种通过分阶段暴露于不同GAN族来训练的专业伪造人脸检测器,在跨生成器迁移下实现高检测率,并通过Grad-CAM分析揭示生成器依赖的空间响应模式。

详情
Comments
20 pages, 10 figures
AI中文摘要

我们引入IRIS-GAN,一种针对跨生成器迁移下合成人脸图像的专业取证检测器。我们并非解决通用合成图像检测问题,而是专注于由生成对抗网络(GAN)生成的人脸,这些网络在深度伪造内容中处于领先地位,并通过分阶段暴露于日益苛刻的GAN族同时保留早期生成器来训练检测器。最终模型在考虑的GAN族中实现了超过99%的伪造检测率,并以98.9%的准确率分类了一个外部真实人脸数据集。Grad-CAM分析进一步揭示了可测量的生成器依赖的空间响应模式,这些模式对于仅使用热图的二级分类器仍然具有信息量。对扩散生成人脸的族外测试证实了IRIS-GAN是一个专家检测器,具有一定能力检测非GAN深度伪造。这些结果确立了分阶段训练作为鲁棒GAN人脸取证的有效策略。

英文摘要

We introduce IRIS-GAN, a specialist forensic detector for synthetic face images under cross-generator shift. Rather than addressing universal synthetic-image detection, we focus on faces generated by generative adversarial networks (GANs), which are state-of-the-art in deepfake content, and train the detector through staged exposure to increasingly demanding GAN families while retaining earlier generators. The final model reaches fake-detection rates above 99% across the GAN families considered and classifies an external real-face dataset with 98.9% accuracy. Grad-CAM analysis further reveals measurable generator-dependent spatial response patterns, which remain informative for a secondary heatmap-only classifier. Out-of-family tests on diffusion-generated faces confirm that IRIS-GAN is a specialist detector, with some capability to reach non-GAN deepfakes. These results establish staged training as an effective strategy for robust GAN-face forensics.

2606.04860 2026-06-04 cs.LG cs.AI

Learning Empirically Admissible Neural Heuristics for Combinatorial Search

学习组合搜索的经验可容许神经启发式

Siddharth Sahay

AI总结 针对组合搜索问题,提出一种结合可容许贝尔曼算子与非对称损失函数的验证校准框架,训练出经验可容许的神经启发式,在保证路径最优性的同时显著减少搜索节点扩展。

详情
Comments
13 pages, 3 figures, 2 tables, 1 algorithm
AI中文摘要

寻找诸如魔方、滑动拼图游戏和Lights Out等组合谜题的最优解路径仍然是人工智能中的经典挑战。启发式搜索算法(如A*)仅在使用可容许启发式(即从不高估真实剩余代价的启发式)时才能保证路径最优性。深度强化学习方法(如DeepCubeA)训练深度神经网络来近似代价到目标的启发式。然而,标准均方误差训练经常产生高估,违反可容许性并损害解的最优性。在本文中,我们介绍了一个可泛化的框架,用于学习验证校准的可容许神经启发式。我们使用低估的可容许贝尔曼算子结合非对称损失函数来训练价值网络,以惩罚高估。为了考虑残差神经函数逼近误差,我们提出了一个基于验证打乱计算的校准安全偏移量。我们证明,在校准的神经启发式下,在评估协议下未观察到可容许性违反,并在实践中保持了路径最优性,同时与标准分析基线相比,在2x2魔方上减少了高达83.0%的搜索节点扩展,在3x3 Lights Out网格上减少了19.9%,在8-Puzzle上减少了1.9%。

英文摘要

Finding optimal solution paths for combinatorial puzzles like the Rubik's Cube, sliding tile puzzles, and Lights Out remains a classical challenge in artificial intelligence. Heuristic search algorithms, such as A* , guarantee path optimality only when using an admissible heuristic-one that never overestimates the true remaining cost-to-go. Deep reinforcement learning (RL) methods like DeepCubeA train deep neural networks to approximate cost-to-go heuristics. However, standard mean-squared error (MSE) training regularly yields overestimations, violating admissibility and compromising solution optimality. In this paper, we introduce a generalizable framework for learning validation-calibrated admissible neural heuristics. We train a value network using an underestimating Admissible Bellman Operator combined with an Asymmetric Loss function to penalize overestimation. To account for residual neural function approximation errors, we propose a post-hoc calibration safety offset computed over validation scrambles. We demonstrate that our calibrated neural heuristics achieve no observed admissibility violations under the evaluation protocol and preserve path optimality in practice while reducing search node expansions by up to 83.0% on a 2 by 2 Rubik's Cube, 19.9% on a 3 by 3 Lights Out grid, and 1.9% on an 8-Puzzle compared to standard analytical baselines.

2606.04857 2026-06-04 cs.LG

Rethinking Incompleteness: Formalizing Protocol Divergence and Train-Once Learning for Robust IMVC

重新思考不完备性:形式化协议发散与单次训练学习用于鲁棒IMVC

Haolu Liu, Xiyue Wang, Xuanting Xie, Liangjian Wen, Zhao Kang

AI总结 针对标准IMVC评估范式忽视缺失率不足以刻画数据不完备性的问题,提出协议发散形式化度量,并设计CRAFT架构通过样本独立性和掩码感知融合实现单次训练泛化到多种缺失模式。

详情
AI中文摘要

标准IMVC评估为不同的缺失数据配置分别训练模型。我们表明,这种范式掩盖了一个基本脆弱性:仅缺失率不足以刻画数据不完备性。具体而言,我们表明,具有相同名义缺失率的协议在完全观测样本的比例上可能相差高达$50\times$,从而引发截然不同的学习机制。我们将这一现象形式化为不完备性发散,提供了捕捉缺失数据协议间结构差异的度量。我们进一步证明,对于一大类基于重构的目标函数,当完整样本比例低于临界阈值时,学习在结构上变得不适定,导致接近随机的性能。为了绕过这一理论界限,我们提出了CRAFT(完整数据鲁棒注意力掩码融合变换器)。CRAFT通过两个关键特性将鲁棒性的负担从损失函数转移到架构上:(i)每个样本的独立性,消除了对完整样本共现的依赖,以及(ii)掩码感知变长融合,通过注意力掩码仅聚合观测到的视图。这种设计允许单个模型在完整数据上训练一次,即可在推理时泛化到不同的缺失模式,无需重新训练。在七个基准上的大量实验表明,CRAFT匹配或超越了每个配置的基线,同时将训练开销降低了$8.8\times$,证明对缺失数据的鲁棒性可以作为固有的架构属性实现。代码(CRAFT)和我们的imvc-audit工具包可在https://anonymous.4open.science/r/CRAFT-BF80/ 和 https://anonymous.4open.science/r/imvc-audit-8263/ 获取。

英文摘要

Standard IMVC evaluation retrains separate models for different missing-data configurations. We show that this paradigm obscures a fundamental vulnerability: missing rate alone is insufficient to characterize data incompleteness. Specifically, we show that protocols with identical nominal missing rates can differ by up to $50\times$ in their proportion of fully observed samples, inducing drastically different learning regimes. We formalize this phenomenon as incompleteness divergence, providing measures that capture structural disparities across missing-data protocols. We further prove that for a broad class of reconstruction-based objectives, learning becomes structurally ill-posed when the proportion of complete samples falls below a critical threshold, leading to near-random performance. To bypass this theoretical bound, we propose CRAFT (Complete-data Robust Attention-masked Fusion Transformer). CRAFT shifts the burden of robustness from the loss function to the architecture via two key properties: (i) per-sample independence, which removes reliance on complete-sample co-occurrence, and (ii) mask-aware variable-length fusion, which aggregates only observed views through attention masking. This design allows a single model, trained once on complete data, to generalize to diverse missing patterns at inference time without retraining. Extensive experiments on seven benchmarks show that CRAFT matches or outperforms per-configuration baselines while reducing training overhead by $8.8\times$, demonstrating that robustness to missing data can be achieved as an inherent architectural property. Code (CRAFT) and our imvc-audit toolkit are available at https://anonymous.4open.science/r/CRAFT-BF80/ and https://anonymous.4open.science/r/imvc-audit-8263/.

2606.04853 2026-06-04 cs.RO

Teaching Robots to Say 'I Don't Know' : SENTINEL for Uncertainty-Aware SLAM

教机器人说‘我不知道’:用于不确定性感知SLAM的SENTINEL

Abhishek S, Badrikanath Praharaj, Sreeram MV

AI总结 提出SENTINEL框架,通过几何扫描统计和跨模态深度一致性为低成本2D LiDAR提供无训练、无标签的可靠性评分,拒绝损坏扫描并回退到轮式里程计,防止SLAM无声损坏。

详情
Comments
6 pages, 10 figures, 3 tables, This paper was accepted at Uncertainty in Open-World Robotics Workshop in conjunction with Internation conference of robotics and automation (ICRA 2026)
AI中文摘要

低成本2D LiDAR缺乏高端传感器用于诊断测量故障的强度通道,但它们广泛用于教育和预算机器人平台。我们提出SENTINEL,一个无需训练、无需标签的可靠性估计框架,为仅测距的LiDAR提供有效的诊断信号。SENTINEL结合基于几何的扫描统计与LiDAR和RGB-D相机之间的跨模态深度一致性,计算每个扫描的可靠性分数(0到1)。当分数低于阈值时,损坏的扫描被拒绝,机器人回退到校准的轮式里程计,防止无声的SLAM损坏。我们在配备RPLidar A2M12和Intel RealSense D435i的GEFIER R1四轮差速转向机器人上评估SENTINEL,在包含中央障碍物上受控透明和反射故障元素的185 cm × 245 cm场地中。跨五种表面条件(包括玻璃、镜子、光面纸以及混合镜子和光面纸条件)的空间可靠性图显示了干净情况和故障情况之间的清晰分离,允许受影响区域被识别为拒绝或噪声。由于这些故障模式在仿真中不存在,验证完全在真实硬件上进行。

英文摘要

Low-cost 2D LiDARs lack the intensity channel that higher-end sensors use to diagnose measurement failures, yet they are widely used on educational and budget robotics platforms. We present SENTINEL, a training - free, label - free reliability estimation framework that gives range - only LiDAR an effective diagnostic signal. SENTINEL combines geometry-based scan statistics with cross - modal depth consistency between LiDAR and an RGB - D camera to compute a per - scan reliability score between 0 and 1. When the score falls below a threshold, corrupted scans are rejected and the robot falls back to calibrated wheel odometry, preventing silent SLAM corruption. We evaluate SENTINEL on a GEFIER R1 four - wheel skid-steer robot equipped with an RPLidar A2M12 and an Intel RealSense D435i in a 185 cm by 245 cm arena containing controlled transparent and reflective failure elements on a central obstacle. Spatial reliability maps across five surface conditions, including glass, mirror, shiny paper, and a mixed mirror and shiny-paper condition, show clear separation between clean and failure cases, allowing affected regions to be identified as reject or noise. Because these failure modes are absent in simulation, validation is performed entirely on real hardware.

2606.04850 2026-06-04 cs.LG cs.AI cs.AR math.OC

Uncertainty-Aware End-to-End Co-Design of Neural Network Processors: From Training and Mapping to Fabrication

不确定性感知的神经网络处理器端到端协同设计:从训练、映射到制造

Yuyang Du, Yujun Huang, Gioele Zardini

AI总结 提出一个基于单调协同设计理论的统一框架,通过四个可互操作的设计模块(网络训练、芯片映射、晶圆级制造和计算资源分配)实现神经网络处理器的端到端协同设计,并引入置信度(成功概率的倒数)作为显式可优化资源来处理不确定性。

详情
Comments
14 pages
AI中文摘要

设计神经网络处理器是一个端到端的协同设计问题:网络架构和训练预算决定了推理工作负载;硬件映射决策决定了芯片面积、延迟和能量;这些特性决定了制造良率和生产成本。在实践中,这些决策是在不同阶段做出的,现有的协同设计方法与特定算法紧密耦合,使得改进一个组件而不重新设计整个流水线变得困难。本文提出了一个基于单调协同设计理论的统一框架,该框架组合了四个可互操作的设计模块,涵盖网络训练、芯片映射、晶圆级制造和计算资源分配。每个模块仅向系统其余部分暴露功能-资源接口,因此任何模块都可以在不改变其他模块结构的情况下进行优化。一个核心贡献是对不确定性的处理:该框架没有将随机结果简化为点估计,而是引入置信度(成功概率的倒数)作为与成本、时间和功耗并列的显式可优化资源。三个案例研究验证了该方法。第一个案例恢复了跨异构应用场景的帕累托最优实现。第二个案例确认置信度作为一个连续可调的设计旋钮,而非事后诊断指标。第三个案例表明,改进单个模块的实现集会自动传播到全局帕累托前沿,而无需修改协同设计图。

英文摘要

Designing a neural network processor is an end-to-end co-design problem: network architecture and training budget determine the inference workload; hardware mapping decisions determine chip area, latency, and energy; and these characteristics govern fabrication yield and manufacturing cost. In practice, these decisions are made in separate stages, and existing co-design methodologies are tightly coupled to specific algorithms, making it difficult to improve one component without reworking the entire pipeline. This paper presents a unified framework, grounded in monotone co-design theory, that composes four interoperable design blocks spanning network training, chip mapping, wafer-level fabrication, and compute resource allocation. Each block exposes only a functionality-resource interface to the rest of the system, so any block can be refined without structural changes elsewhere. A central contribution is the treatment of uncertainty: rather than collapsing stochastic outcomes into point estimates, the framework introduces Confidence, the inverse of success probability, as an explicit and optimizable resource alongside cost, time, and power. Three case studies validate the approach. The first recovers Pareto-optimal implementations across heterogeneous application scenarios. The second confirms that Confidence functions as a continuously tunable design knob rather than a post-hoc diagnostic. The third demonstrates that improving a single block's implementation set automatically propagates to the global Pareto front, without modifying the co-design diagram.

2606.04847 2026-06-04 cs.CV cs.CL cs.LG

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

MusaCoder: 在摩尔线程GPU上通过全栈训练实现原生GPU内核生成

Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang

AI总结 提出MusaCoder全栈训练框架,结合渐进式数据合成、多样性保持拒绝微调和基于执行反馈的强化学习,在CUDA和MUSA后端上生成高效原生GPU内核,9B模型匹配前沿闭源模型,27B模型达到新最优。

详情
AI中文摘要

原生GPU内核生成将高级张量程序转换为可执行、高效的低级代码。现有大型语言模型(LLMs)在此任务上表现不佳,而基于执行的强化学习面临稀疏奖励、奖励黑客和训练不稳定性问题。我们提出MusaCoder,一个用于在CUDA和MUSA后端上生成原生GPU内核的全栈训练框架。MusaCoder结合了渐进式内核导向数据合成、保持多样性的拒绝微调以及通过MooreEval(一个分布式验证器和奖励环境)进行的执行反馈强化学习(RL)。为了稳定RL,MusaCoder引入了PrimeEcho用于首轮锚定的多轮奖励、Buffered Dynamic Retry用于从全失败的困难样本中恢复信号,以及MirrorPop用于离策略序列过滤。在KernelBench和MUSA移植变体上的实验表明,MusaCoder在正确性和经验加速方面均优于强开源和专有基线,其中9B模型匹配或超越前沿闭源模型,27B模型建立了新的最优结果。这些结果不仅证明了全栈执行反馈训练对原生内核生成的有效性,也展示了摩尔线程GPU支持完整LLM后训练栈的能力,为新兴加速器上的大模型训练和优化提供了实用基础。

英文摘要

Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

2606.04846 2026-06-04 cs.CL

Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas

K-12教育中的大语言模型:与州课程标准和学生角色的对齐

Lisa Korver, Tomo Lazovich, Sherief Reda

AI总结 本研究开发基于LLM的流程评估不同LLM与美国各州历史课程标准的对齐程度,并通过控制用户角色实验分析模型对地理、年级、性别和种族的敏感性,发现模型能调整历史主题呈现但可能源于州政治倾向,且对年级适应良好而对种族性别敏感性低,揭示了LLM与课程标准错位对学生学习的潜在风险。

详情
AI中文摘要

随着大语言模型(LLM)在教育环境中日益普及,它们引发了关于其使用伦理的重要问题。公开可用的在线聊天机器人能力和准确性迅速提升,导致更广泛的使用,包括寻求作业帮助的学生。这使得考虑这些模型是否与教育标准对齐变得至关重要。由于美国的课程标准由各州制定,它们在所需内容、重点和叙事焦点上存在显著差异。在这项工作中,我们开发了一个基于LLM的流程,以识别各州美国历史课程的变化,并评估不同LLM反映这些州特定课程差异的程度。此外,我们进行控制实验,通过陈述用户属性(如地理位置、年级、性别和种族)来改变用户角色,以评估LLM响应对用户特征的敏感性。我们发现,虽然模型能够调整历史主题的呈现,但这些转变可能源于各州的政治倾向,并不一定反映实际的课程内容。此外,模型成功适应学生的年级水平,而对种族或性别的敏感性最小,这表明它们能够以有限的人口统计偏差对用户角色进行有用的适应。总之,这些发现突显了开放访问LLM聊天机器人可能因与州课程标准错位而导致学生学习成果受损的潜在风险,并强调了需要更强大的对齐技术。

英文摘要

As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability and accuracy leading to more widespread use, including among students looking for help with their homework. This makes it crucial to consider whether these models are aligned with educational standards. Because curriculum standards in the United States are set at the state level, they differ significantly in required content, emphasis, and narrative focus. In this work, we develop an LLM-based pipeline to identify variations in U.S. History curricula across states and evaluate the extent to which different LLMs reflect these state-specific curricular differences. In addition, we conduct controlled experiments that vary user personas by stating user attributes such as geographic location, grade level, gender and race to evaluate the sensitivity of LLM responses to user characteristics. We find that while models are able to adjust their presentation of historical topics, these shifts may come from the perceived political leanings of states and do not necessarily reflect actual curriculum content. Additionally, models successfully adapt to a student's grade level while showing minimal sensitivity to race or gender, suggesting they are capable of useful adaptation to student personas with limited demographic bias. Together, these findings highlight potential risks that open access to LLM chatbots may cause to student learning outcomes stemming from misalignment with state curriculum standards and highlight the need for more robust alignment techniques.

2606.04845 2026-06-04 stat.ML cs.LG math.ST stat.CO stat.TH

Bayesian learning for the stochastic shortest path problem

随机最短路径问题的贝叶斯学习

Chon Wai Ho, Sumeetpal S. Singh, Jiaqi Guo

AI总结 针对随机最短路径问题,提出一种贝叶斯框架,通过贝尔曼最优方程直接构建最优动作价值函数Q*的后验分布,并解决似然松弛导致的不可识别性问题,实现不确定性量化与数据高效学习。

详情
Comments
50 pages, 19 figures
AI中文摘要

序列决策问题通常被建模为马尔可夫决策过程(MDP)。我们关注随机最短路径(SSP)问题,这是一个具有吸收终止状态的无限水平无折扣MDP。我们开发了一个贝叶斯框架,通过与决策任务的交互来学习最优决策策略。具体来说,我们学习最优动作价值函数$Q^*$,但与许多现有的贝叶斯方法不同,我们不依赖于不现实的建模假设和临时近似。我们的方法是通过贝尔曼最优方程直接构建$Q^*$的后验信念。对于确定性奖励,我们将后验描述为具有流形密度的分布。为了简化推理,我们放松了似然,使得勒贝格密度存在。但这样做的代价是产生不可识别性问题。具体来说,放松后的后验可能在不当决策规则上有显著质量,而精确后验则不会。我们还计算了$Q^*$的表格参数化、高斯似然放松和高斯先验下最优动作选择的精确后验概率,这在基准测试研究中很有用。对深海基准测试变体的数值研究验证了我们的发现。我们证明了我们的框架能够忠实地量化不确定性,并且与其他基于时间差分的贝叶斯方法相比,数据效率更高。最后,我们对未来工作提出了建议。

英文摘要

Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal decision strategy through interactions with the decision-making task. Specifically, we learn the optimal action-value function $Q^*$, but unlike many existing Bayesian approaches, we do not rely on unrealistic modelling assumptions and ad-hoc approximations. Our approach is to directly construct the posterior beliefs for $Q^*$ through Bellman's optimality equations. For deterministic rewards, we characterise the posterior as a distribution with a manifold density. To facilitate simpler inference, we relax the likelihood so that a Lebesgue density exists. The flip side is to create unidentifiability issues. Specifically, the relaxed posterior can have significant mass on improper decision rules, while the exact posterior will not. We also calculate the exact posterior probabilities for optimal action selections for the tabular parametrisation of $Q^*$, a Gaussian likelihood relaxation and a Gaussian prior, which is useful in benchmarking studies. Numerical studies on variants of the Deep Sea benchmark verify our findings. We demonstrate that our framework faithfully quantifies uncertainty and, compared to other temporal-difference-based Bayesian methodologies, is more data efficient. We conclude with recommendations for future work.

2606.04844 2026-06-04 cs.SD cs.CV

Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

漂移增强评分:文本驱动的零样本音频-语言分类噪声鲁棒性

Tu Vo, Sheir Zaheer, Chan Y. Park

AI总结 提出漂移增强评分(DAS),通过文本生成的噪声条件提示预测音频嵌入漂移方向,为每个类别添加奖励分数,在不增加梯度或测试时批处理的情况下,显著提升零样本音频分类在噪声下的准确率和mAP。

详情
AI中文摘要

对比音频-语言模型(如CLAP)能够实现零样本音频分类:通过将音频嵌入与文本提示嵌入匹配来标记声音,无需标注音频。但在声学噪声下,这种匹配会失效,标准基准测试中,0 dB SNR时准确率和mAP下降12-30个百分点。我们提出漂移增强评分(DAS),这是一种添加到余弦评分中的每类小奖励。当噪声音频嵌入向该类噪声条件文本提示预测的方向漂移时,奖励该类。该奖励仅从文本推导,计算一次并缓存,推理时每类只需一个内积,无需梯度或测试时批处理。在LAION CLAP骨干网络上,我们将DAS与Acevedo等人同期方法的四种变体在UrbanSound8K和完整FSD50K评估集上进行比较,将每个片段与城市声学场景噪声混合,覆盖一系列SNR。DAS在所有测试条件下均提升了指标:UrbanSound8K上准确率提高+2.60至+5.75个百分点,FSD50K上mAP提高+1.50至+1.74个百分点。

英文摘要

Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.

2606.04836 2026-06-04 cs.CV

3D Temporal Analysis for Autism Spectrum Disorder Screening During Attention Tasks

注意力任务期间自闭症谱系障碍筛查的3D时间分析

Inam Qadir, Elizabeth B Varghese, Dena Al-Thani, Marwa Qaraqe

AI总结 提出基于DECA的3D时间分析框架,提取头部姿态和面部表情特征,利用LSTM/GRU分类器在VR-CPT任务中实现ASD筛查,多模态融合达到84.6%准确率。

详情
AI中文摘要

对学龄儿童进行准确的自闭症谱系障碍(ASD)筛查对于识别早期可能遗漏的病例以及及时干预以支持社交、认知和学业发展至关重要。当前的ASD筛查依赖于主观评估和2D分析方法,无法捕捉ASD行为特征的空间位移模式。本研究提出了一种新颖的3D时间分析框架,该框架基于DECA(详细表情捕捉与动画)这一3D建模框架,用于提取全面的头部姿态参数(包括平移分量$T_x, T_y, T_z$)以及独立于姿态变化的面部表情。基于LSTM和GRU的时间分类器在从39名7-12岁参与者(19名ASD,20名TD)在虚拟现实-持续性能测试任务中收集的视频数据提取的3D特征上进行训练。GRU模型表现出优越性能,其中3D头部姿态特征达到83.9%的准确率,3D面部特征达到81.4%的准确率,分别比2D基线方法高出10.7%和7.5%。此外,通过PCA降维的3D头部姿态和面部特征的多模态融合达到了84.6%的最高准确率,优于单模态方法。这项工作为针对学龄人群ASD识别中当前诊断局限性的客观、自动化筛查工具奠定了基础。

英文摘要

Accurate Autism Spectrum Disorder (ASD) screening for school-age children is crucial to identify cases that may have been missed earlier and to enable timely interventions supporting social, cognitive, and academic development. Current ASD screening relies on subjective assessments and 2D analysis methods that fail to capture spatial displacement patterns characteristic of ASD behaviors. In this study, a novel 3D temporal analysis framework is presented, built on top of DECA (Detailed Expression Capture and Animation), a 3D modeling framework, to extract comprehensive head pose parameters (including translational components $T_x, T_y, T_z$) and facial expressions independent of pose variations. LSTM and GRU-based temporal classifiers were trained on the extracted 3D features from video data collected from 39 participants (19 ASD, 20 TD) aged 7-12 years during Virtual Reality-Continuous Performance Test tasks. The GRU-based models demonstrated superior performance, with 3D head pose features achieving 83.9\% accuracy and 3D facial features reaching 81.4\% accuracy, outperforming 2D baseline approaches by 10.7\% and 7.5\%, respectively. Furthermore, multimodal fusion of 3D head pose and facial features with PCA-based dimensionality reduction achieved the highest accuracy of 84.6\%, outperforming unimodal approaches. This work establishes a foundation for objective, automated screening tools addressing current diagnostic limitations in ASD identification for school-age populations.

2606.04834 2026-06-04 cs.LG

Prediction Under Imperfect Compression: A Theory of Approximate MDL

非完美压缩下的预测:近似最小描述长度理论

Qian Li, Xinyu Mao, Shang-Hua Teng, Guangxu Yang

AI总结 本文研究了在近似优化下,最小描述长度(MDL)原则仍能保证可靠序列预测的条件,证明了加性松弛下的鲁棒性并刻画了正则化的必要性。

详情
Comments
26 pages
AI中文摘要

最小描述长度(MDL)通过优化总描述长度 $L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$ 形式化了奥卡姆剃刀原则。对于序列预测,MDL 方法反复选择在观测前缀上具有最小目标得分的模型进行下一步预测。经典 MDL 预测理论表明,精确优化 MDL 目标确实提供了支持可靠预测的强压缩保证。然而,实际机器学习通常只能通过近似优化目标函数来找到模型。为弥合这一差距,本文解决了以下基本问题:在何种近似和正则化形式下,近似 MDL 仍能保证可靠的序列预测?本文提供了一个原则性的刻画。我们证明,对于平衡 MDL 目标更一般形式 $λ\cdot L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$ 的任意加性松弛 $C$,当 $λ\ge1$ 时,累积期望平方预测误差有限。$λ>1$ 的情况通过亲和-望远镜论证证明,而边界情况 $λ=1$ 通过基于精确静态 MDL 边界的似然比停止论证证明。我们的结果表明,经典 MDL 正则化对任意固定加性优化误差保持鲁棒。此外,我们建立了近似 MDL 框架刻画的尖锐性:当 $0<λ<1$ 时,在可估测度的通用类中,过拟合可能导致无限累积期望误差,因此需要强形式的模型复杂度正则化。另外,在乘性近似下,模型选择可能在每个正则化区域 $λ>0$ 中失败,因此加性近似既充分又必要。

英文摘要

Minimum Description Length (MDL) formalizes the principle of Occam's razor by optimizing the total description length: $L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$. For sequential prediction, the MDL method repeatedly selects a model with a minimum objective score of the observed prefix for the next step prediction. Classical MDL prediction theory shows that exact optimization of the MDL objective indeed provides a strong compression guarantee that supports reliable prediction. However, practical machine learning usually can only find models by approximately optimizing the objective function. To bridge this gap, this paper addresses the following fundamental question: Under what forms of approximation and regularization does approximate MDL still guarantee reliable sequential prediction? This work offers a principled characterization. We prove that for any approximation with additive slack $C$ of the more general form of the balanced MDL objective: $λ\cdot L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$, the cumulative expected squared prediction error is finite for all $λ\ge1$. The case $λ>1$ is proved by an affinity-telescoping argument, while the boundary case $λ=1$ is proved by a likelihood-ratio stopping argument based on exact static MDL bounds. Our results establish that classical MDL regularization remains robust to any fixed additive optimization error. Furthermore, we establish that our characterization of the approximate MDL framework is sharp: When $0<λ<1$, overfits can happen to incur infinite cumulative expected error in the universal class of estimable measures, and hence a strong form of model-complexity regularization is necessary. In addition, model selection may fail in every regularized regime $λ>0$, under multiplicative approximation, and thus, additive approximation is both sufficient and essential.

2606.04829 2026-06-04 cs.RO

M3imic: Learning a Versatile Whole-Body Controller for Multimodal Motion Mimicking

M3imic: 学习用于多模态运动模仿的通用全身控制器

Zuxing Lu, Ziang Zheng, Yao Lyu, Jingyu Liu, Feihong Zhang, Song Lu, Xin Yuan, Changyin Sun, Xingxing Zuo, Shengbo Eben Li

AI总结 提出M3imic框架,通过模态特定编码器将异构运动参考模态(机器人关节角度、人体姿态轨迹、末端执行器位姿)映射到共享潜在空间,并利用大规模强化学习训练单一策略,实现无需模态特定重训练的sim-to-real迁移。

详情
AI中文摘要

构建通用全身控制器对于使人形机器人在广泛的下游任务(包括 locomotion 和 loco-manipulation)中具备多样化的运动能力至关重要。不同任务依赖于不同的运动参考模态:locomotion 主要依赖于协调的机器人关节轨迹,而 manipulation 则需要精确的末端执行器轨迹跟踪。现有方法常常忽视密集的机器人关节角度与稀疏的末端执行器位姿之间的表示不匹配问题。为解决这一问题,我们提出了 Multi-Modal Mimic (M3imic),一个通用的多模态全身控制框架,它使用模态特定编码器将异构运动参考模态(包括机器人关节角度、人体姿态轨迹和末端执行器位姿)映射到共享潜在空间,从而统一这些模态。利用模拟器中的大规模强化学习,我们训练了一个单一策略,该策略能够在无需模态特定重训练的情况下实现跨多种运动参考模态的 sim-to-real 迁移。在 Unitree G1 机器人上进行了广泛的仿真和真实世界实验以评估所提出的框架。在仿真中,该策略在未见过的测试数据集上达到了 98.42% 的峰值成功率,展示了其卓越的泛化能力。代码可在 https://github.com/Renforce-Dynamics/MultiModalWBC 获取。

英文摘要

Building a general-purpose whole-body controller is essential for enabling diverse motion capabilities in humanoid robots across a wide range of downstream tasks, including locomotion and loco-manipulation. Different tasks rely on distinct motion reference modalities: locomotion primarily depends on coordinated robot joint trajectories, whereas manipulation requires precise end-effector trajectory tracking. Existing methods often overlook the representational mismatch between dense robot joint angles and sparse end-effector poses. To address this, we propose Multi-Modal Mimic (M3imic), a versatile multi-modal whole-body control framework that unifies heterogeneous motion reference modalities, including robot joint angles, human pose trajectories, and end-effector poses, using modality-specific encoders to map them into a shared latent space. Leveraging large-scale reinforcement learning in the simulator, we train a single policy that achieves sim-to-real transfer across multiple motion reference modalities without modality-specific retraining. Extensive simulation and real-world experiments on the Unitree G1 robot are conducted to evaluate the proposed framework. In simulation, the policy achieves a peak success rate of 98.42\% on an unseen test dataset, demonstrating its exceptional generalization capability. The code is available at https://github.com/Renforce-Dynamics/MultiModalWBC

2606.04828 2026-06-04 cs.CL

A French Corpus Annotated for Multiword Expressions with Adverbial Function

一个标注了副词性多词表达的法语语料库

Eric Laporte, Takuya Nakamura, Stavroula Voyatzi

AI总结 本文介绍了一个标注了副词性多词表达的法语语料库,旨在支持信息检索、信息抽取以及深层和浅层句法分析的研究。

详情
Journal ref
Language Resources and Evaluation Conference (LREC), Linguistic Annotation Workshop, 2008, Marrakech, Morocco, pp.48-51
AI中文摘要

本文介绍了一个标注了副词性多词表达(MWEs)的法语语料库。该语料库旨在用于信息检索和信息抽取以及深层和浅层句法分析的研究。我们界定了所标注的多词表达的类型,描述了用于标注的资源和方法,并简要评论了结果。标注后的语料库可在 http://infolingu.univ-mlv.fr/ 根据 LGPLLR 许可证获取。

英文摘要

This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results. The annotated corpus is available at http://infolingu.univ-mlv.fr/ under the LGPLLR license.

2606.04825 2026-06-04 cs.RO

HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning

HapTile: 用于接触丰富模仿学习的触觉感知视觉-触觉-语言-动作数据集

Amirhosein Alian, Yongqiang Zhao, Shiyi Gu, Xuyang Zhang, Zhuo Chen, Christopher E. Mower, Haitham Bou-Ammar, Shan Luo

AI总结 提出HapTile数据集,通过集成指尖触觉反馈和操作员触觉感知,为接触丰富的机器人操作任务提供视觉-触觉-语言-动作联合数据,并验证其在策略学习中的有效性。

详情
AI中文摘要

尽管触觉感知对于可靠操作至关重要,但大多数现有的视觉-语言-动作(VLA)数据集仍然仅基于视觉,而那些确实包含触觉信息的数据集通常缺乏任务多样性、语言条件和动作轨迹的联合组合。此外,现有的遥操作流程很少为操作员提供触觉反馈,尽管触觉反馈在演示质量和操作稳定性中具有公认的作用。在这项工作中,我们提出了HapTile,一个接触基础的视觉触觉操作数据集,它通过嵌入两个层次的物理交互感知超越了仅视觉轨迹数据集:机器人末端执行器上的指尖触觉反馈,以及遥操作侧的触觉感知演示。数据收集平台将触觉反馈直接集成到遥操作控制器中,使操作员能够实时感知接触交互。它基于一个标准且可复现的机器人系统构建,该系统配备了定制设计的指尖触觉传感器。该数据集涵盖了日常操作任务,包括拾取与放置、折叠、按压、堆叠以及其他常规活动,这些任务涉及广泛的接触丰富技能。每个任务都配有语言指令,用于根据操作目标对策略进行条件化,同时还有同步的视觉触觉观察和动作轨迹。此外,我们使用两个基线模型对接触丰富的策略学习进行了基准研究,以评估所提出的接触基础数据集的有效性。数据集和更多详细信息可在我们的网站上获取:haptile-dataset.github.io。

英文摘要

Despite the importance of tactile sensing for reliable manipulation, most existing Vision-Language-Action (VLA) datasets remain vision-only, and those that do incorporate tactile information typically lack the joint combination of task diversity, language conditioning, and action trajectories. Furthermore, existing teleoperation pipelines rarely provide haptic feedback to the operator, despite its established role in demonstration quality and manipulation stability. In this work, we present HapTile, a contact-grounded visuotactile manipulation dataset that advances beyond vision-only trajectory datasets by embedding physical interaction sensing at two levels: fingertip tactile feedback at the robot end-effector, and haptic-informed demonstrations at the teleoperator side. The data collection platform integrates haptic feedback directly into the teleoperation controller, enabling the operator to perceive contact interactions in real time. It is built around a standard and reproducible robotic system equipped with custom-designed fingertip tactile sensors. The dataset comprises everyday manipulation tasks spanning a broad range of contact-rich skills, including pick-and-place, folding, pressing, stacking, and other routine activities. Each task is paired with language instructions that condition the policy on the manipulation objective, together with synchronized visuotactile observations and action trajectories. In addition, we provide a benchmarking study on contact-rich policy learning using two baseline models to evaluate the effectiveness of the proposed contact-grounded dataset. The dataset and additional details are available on our website: haptile-dataset.github.io.