arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.04907 2026-06-04 cs.RO

WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

WAM-Nav：面向统一视觉导航的非对称潜在世界-动作建模

Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu

AI总结提出WAM-Nav，一种联合学习动作生成与潜在视觉预测的非对称扩散Transformer模型，通过共享扩散Transformer实现长时程动作与短时程视觉预测的联合扩散，并引入双流上下文条件机制和目标对齐模块，在统一策略下支持图像目标、点目标和无目标导航，在ClutterScenes和InternScenes基准上分别提升15.7%和3.3%的成功率，并在真实环境中实现85%的任务成功率。

详情

AI中文摘要

视觉导航需要在复杂的几何和物理约束下生成平滑且无碰撞的轨迹。现有的反应式策略直接将观测映射到动作，缺乏预期推理能力，限制了其主动避障的能力。虽然视觉想象提供了预测性前瞻，但传统的模块化方法将场景预测与策略学习分离，常常导致误差累积和推理效率低下。为了解决这些限制，我们提出了WAM-Nav，一种用于具身视觉导航的潜在世界-动作模型，它联合学习动作生成和潜在视觉预测，从而在不影响推理效率的情况下实现更鲁棒和更具前瞻性的导航决策。具体来说，WAM-Nav利用共享的扩散Transformer进行非对称联合扩散，同时生成长时程动作和短时程视觉预测，减少了多步自回归展开中固有的推理延迟和视觉误差累积。为了进一步促进平滑且一致的轨迹生成，我们引入了一种双流上下文条件机制，将情节级别的自运动历史与顺序视觉观测相结合。结合统一的目标对齐模块，该模块在不同目标类型间保持平衡表示，WAM-Nav在单一策略下自然支持图像目标、点目标和无目标探索。在具有挑战性的ClutterScenes和InternScenes基准上的大量实验证明了WAM-Nav的强大泛化能力，特别是在图像目标和点目标导航中，成功率分别提高了15.7%和3.3%。真实世界部署进一步验证了有效的零样本模拟到现实迁移，在多样化的室内和室外环境中实现了平均85%的任务成功率。

英文摘要

Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.

URL PDF HTML ☆

赞 0 踩 0

2606.04906 2026-06-04 cs.CL cs.AI

'Your AI Text is not Mine': Redefining and Evaluating AI-generated Text Detection under Realistic Assumptions

“你的AI文本不是我的”：在现实假设下重新定义和评估AI生成文本检测

Nils Dycke, Marina Sakharova, Nico Daheim, Iryna Gurevych

AI总结针对AI生成文本检测领域缺乏统一有害使用定义的问题，本文系统定义了多种AI生成文本概念，构建了包含详细生成过程注释的人机协作文本基准AITDNA，并评估了多种检测器在不同概念下的表现。

2606.04903 2026-06-04 cs.LO cs.AI cs.MA cs.PL

Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

基于人类编写本体的可审计且安全的LLM智能体

Aaron Sterling

AI总结提出Agentic Redux架构，通过类型化λ演算证明其在适当领域上的执行语义正确且决策可审计，并引入本体优先的智能体设计方法。

2606.04898 2026-06-04 cs.CV

CDPM-Align: Multi-Scale Guidance-Aligned Diffusion Pretraining for Robust Few-Shot Anatomical Landmark Detection

CDPM-Align：用于鲁棒少样本解剖标志检测的多尺度引导对齐扩散预训练

Roberto Di Via, Irina Voiculescu, Francesca Odone, Vito Paolo Pastore

AI总结提出多尺度引导对齐的条件扩散预训练方法CDPM-Align，通过生成式预训练学习鲁棒表示，在少样本和低标注场景下提升解剖标志检测的准确性和不确定性。

详情

Comments: Accepted MICCAI 2026

AI中文摘要

解剖标志检测是医学图像分析中的一项基础任务，支持广泛的诊断和介入工作流程。尽管最近的方法已经实现了亚毫米级的定位，但仅凭准确性不足以用于临床部署，还需要预测的可靠性和鲁棒性。尽管具有临床相关性，但表示学习在此背景下的影响仍未得到充分探索。在这项工作中，我们引入了CDPM-align，一种用于解剖标志检测的多尺度引导对齐条件扩散预训练方法。我们的实验设置侧重于少量图像和少量标注场景。具体来说，我们采用三个流行的异构小规模基准数据集，通过条件生成预训练进行表示学习。此外，我们考虑了标志检测下游任务的低标注场景，分别使用10张和25张标注图像，反映了临床工作与标注资源约束之间的现实权衡。我们的结果证实，生成式预训练使模型能够学习鲁棒的表示。这提高了下游任务的准确性和不确定性，朝着安全高效的临床部署迈进。

英文摘要

Anatomical landmark detection is a fundamental task in medical image analysis supporting a wide range of diagnostic and interventional workflows. Although recent methods have achieved sub-millimetric localisation, accuracy alone is not sufficient for clinical deployment, requiring reliability and robustness in prediction. Despite its clinical relevance, the impact of representation learning in this context is still underexplored. In this work, we introduce CDPM-align, a multi-scale guidance-aligned conditional diffusion pre-training for anatomical landmark detection. Our experimental setup focuses on a few images and a few annotation regimes. Specifically, we employ three popular heterogeneous small-scale benchmark datasets for representation learning via conditional generative pre-training. Furthermore, we consider low-annotation scenarios for the downstream task of landmark detection, with 10 and 25 annotated images, reflecting realistic trade-offs between clinical effort and resource constraints for annotations. Our results confirm that generative pre-training enables the model to learn a robust representation. This improves both accuracy and uncertainty on the downstream tasks, advancing towards safe and efficient clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.04891 2026-06-04 cs.CV cs.CG

Hierarchical Space Partition for Surface Reconstruction

表面重建的层次空间划分

Minjie Tang, Xiangfei Li

AI总结针对点云重建中因LiDAR扫描局限导致细节缺失的问题，提出基于平面分类与优先级生长的层次空间划分方法，并通过最小割优化生成水密多边形网格。

详情

DOI: 10.1109/3DV69130.2026.00027
Journal ref: in 2026 International Conference on 3D Vision (3DV), Vancouver, BC, Canada, 2026, pp. 207-216
Comments: Published in 2026 International Conference on 3D Vision (3DV)

AI中文摘要

从点云生成紧凑的多边形模型是3D视觉和计算机图形学中的一个关键问题。然而，由于LiDAR扫描的固有限制（例如距离约束和遮挡），关键场景信息常常缺失，导致重建精度下降。为了解决这个问题，我们提出了一种平面组装策略，该策略在保持模型紧凑性的同时有效恢复缺失的细节。我们将从场景中提取的所有平面分为三类：高可见、几乎不可见和不可见。通过场景结构分析恢复的不可见平面指示了缺失的细节。这三种类型的平面对应于三种生长优先级。每个平面根据优先级水平生长，空间被逐步划分，即层次划分。随后，我们通过基于最小割的优化从划分中生成水密多边形网格。最后，在公共数据集上的比较显示了我们的方法相对于主流方法的有效性和优越性。项目页面可在https://hsr-3dv.github.io/获取。

英文摘要

Generating compact polygonal models from point clouds is a key problem in 3D vision and computer graphics. However, due to inherent limitations of LiDAR scanning (e.g. range constraints and occlusions), critical scene information is often missing, leading to degraded reconstruction accuracy. To address this, we propose a plane assembling strategy that effectively recovers missing details while maintaining model compactness. We classify all the planes extracted from the scene into three categories: highly visible, barely visible, and invisible. The invisible planes, which are recovered by scene structure analysis, indicate the missing details. The three types of planes correspond to the three growth priorities. Each plane grows according to the priority level, and the space is partitioned progressively, namely, the hierarchical partition. Subsequently, we generate a watertight polygonal mesh from the partition via a min-cut-based optimization. Finally, comparisons on public datasets show the effectiveness and superiority of our method against mainstream approaches. The project page is available at https://hsr-3dv.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.04889 2026-06-04 cs.CL

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

GRAIL: 基于梯度重加权优势的可验证奖励强化学习

Tej Deep Pala, Vernon Toh, Soujanya Poria

AI总结针对强化学习中统一优势分配导致梯度信号稀释的问题，提出基于梯度激活显著性的令牌级优势重加权方法GRAIL，无需过程级监督即可提升推理对齐，在多个模型上平均准确率提升3.60%。

详情

AI中文摘要

基于可验证奖励的强化学习（如GRPO）现在已成为提升大语言模型（LLMs）数学推理能力的常见方法。然而，当前方法通常将单个序列级优势广播到所有令牌，或使用昂贵的过程奖励模型（PRMs）进行步骤级监督。统一优势分配假设所有令牌对最终奖励的贡献相同。这会稀释梯度信号，因为存在缺陷的推理步骤和填充词与有效的逻辑推理得到同等强度的更新。为解决此问题，我们引入了梯度重加权优势（GRAIL），一种内在的令牌级优势重加权方法。GRAIL使用梯度激活显著性，将更多权重赋予那些对最终答案局部更敏感的令牌。在来自Qwen3、R1-distilled和OctoThinker家族的五个模型上的评估表明，GRAIL始终优于GRPO。GRAIL在准确率上平均提升3.60%，在Pass@3上平均提升3.05%，表明无需过程级监督即可实现细粒度的推理对齐。

英文摘要

Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.04888 2026-06-04 cs.CV

HD-DinoMoE: A Class-Aware Hierarchical Dual Mixture-of-Experts Network for Scleral Anomaly Segmentation in Complex Acquisition Scenarios

HD-DinoMoE: 一种用于复杂采集场景下巩膜异常分割的类别感知层次化双混合专家网络

Yinxiang Yu, Maoxiang Chu, Qi Niu, Guanghu Liu, Wei Xu, Haotian Wang, Zhi Chen, Yutian Zhu, Yuelong Fan, Guanghao Liao

AI总结针对多源分布差异、异常形态多样和巩膜镜面反射问题，提出类别感知层次化双混合专家网络HD-DinoMoE，结合双流DINOv3特征融合与类别特定多专家解码，实现血管、黄斑和黑斑、血斑的像素级分割，在ML-SASD数据集上达到72.11%的平均Dice和58.44%的平均IoU。

详情

Comments: Submitted to Medical Image Analysis; 47 pages, 31 figures, 14 tables

AI中文摘要

中医目诊通过观察巩膜表面异常提供经验性线索，但其临床应用仍具有主观性且难以量化。为支持智能化和可量化的目诊，本研究提出了中医启发的人工智能眼部辅助诊断系统（TAO），并聚焦于像素级巩膜表面异常分割。针对受多源分布差异、异常形态多样和巩膜镜面反射（SSR）影响的临床和用户采集图像，我们提出了HD-DinoMoE，一种类别感知层次化双混合专家网络。HD-DinoMoE结合类别感知双流DINOv3特征融合与类别特定多专家解码，以分割血管、黄斑和黑斑以及血斑。一种三阶段骨干冻结路由策略稳定了双骨干适应；渐进置信惩罚（PCP）损失减少了SSR区域的高置信度假阳性和分割泄漏；类别感知自适应样本加权（CA-ASW）平衡了样本和类别级别的训练贡献。我们进一步构建了多标签巩膜异常分割数据集（ML-SASD），这是一个包含临床、野生和混合设置以及三种异常类别像素级标注的新基准。在ML-SASD-Mix上，HD-DinoMoE实现了72.11%的平均Dice和58.44%的平均交并比，同时保持了良好的边界定位和镜面区域假阳性控制。它在公共SBVPI数据集的血管子集上也显示出有竞争力的泛化能力。这些结果表明，HD-DinoMoE为复杂采集场景下的TAO提供了一种可行的分割解决方案。代码和数据访问信息可在https://github.com/FX-CMX/HD-DinoMoE获取。

英文摘要

Traditional Chinese Medicine (TCM) ocular inspection provides empirical cues for assessing scleral surface anomalies, but its clinical use remains subjective and difficult to quantify. To support intelligent and quantifiable ocular inspection, this study presents the TCM-inspired Artificial Intelligence Ocular Auxiliary Diagnosis System (TAO) and focuses on pixel-level scleral surface anomaly segmentation. For clinical and user-acquired images affected by multi-source distributional discrepancies, diverse anomaly morphologies, and scleral specular reflection (SSR), we propose HD-DinoMoE, a class-aware hierarchical dual mixture-of-experts network. HD-DinoMoE combines class-aware dual-stream DINOv3 feature fusion with class-specific multi-expert decoding to segment Vessels, Yellow and Black Spots, and Blood Spots. A three-stage backbone-frozen routing strategy stabilizes dual-backbone adaptation; Progressive Confidence Penalty (PCP) Loss reduces high-confidence false positives and segmentation leakage in SSR regions; and Class-Aware Adaptive Sample Weighting (CA-ASW) balances sample- and class-level training contributions. We further construct the Multi-label Scleral Anomaly Segmentation Dataset (ML-SASD), a new benchmark with Clinical, Wild, and Mix settings and pixel-wise annotations for three anomaly categories. On ML-SASD-Mix, HD-DinoMoE achieves a mean Dice of 72.11% and a mean Intersection-over-Union of 58.44%, while maintaining favorable boundary localization and specular-region false-positive control. It also shows competitive generalization on the Vessels subset of the public SBVPI dataset. These results indicate that HD-DinoMoE provides a feasible segmentation solution for TAO under complex acquisition scenarios. The code and data access information are available at https://github.com/FX-CMX/HD-DinoMoE.

URL PDF HTML ☆

赞 0 踩 0

2606.04884 2026-06-04 cs.RO

D$^3$-MoE:Dual Disentangled Diffusion Mixture-of-Experts for Style-Controllable End-to-End Autonomous Driving

D$^3$-MoE：面向风格可控的端到端自动驾驶的双解耦扩散混合专家模型

Renju Feng, Rukang Wang, Ning Xi, Jianguo Yu, Liping Lu, Pan Zhou, Duanfeng Chu

AI总结提出D$^3$-MoE框架，通过行为轴（扩散生成与选择解耦）和物理轴（纵向与横向专家解耦）的双重解耦，实现风格可控的端到端自动驾驶，在NAVSIM基准上达到SOTA规划性能。

详情

Comments: 8 pages, 6 figures

AI中文摘要

传统的端到端自动驾驶框架在训练于高方差的人类演示时经常遭受“风格平均化”困境，产生同质化、风格不可控甚至运动学不安全的策略。为了克服这一限制，我们提出了D$^3$-MoE（双解耦扩散混合专家模型），该模型沿两个互补轴解耦轨迹建模。在行为轴上，生成与选择解耦：一个风格条件扩散过程在单个场景中并行合成多风格候选轨迹，允许下游模块根据用户偏好或评估分数选择最优轨迹。在物理轴上，解耦的纵向和横向路由器在推理时激活各自的专家，这些专家使用来自正交地面真值运动学的自监督目标进行训练，无需人工标签。这些激活的专家采用扩散变换器（DiT）架构，并配备风格条件自适应层归一化（AdaLN）和非对称横向融合交叉注意力，独立预测其对应的物理状态，然后重新组装成统一的、运动学一致的轨迹。在具有挑战性的NAVSIM基准上的广泛评估表明，D$^3$-MoE实现了最先进的规划性能，默认达到88.2 PDMS和84.3 EPDMS。此外，我们的“三选最佳”集成策略有效拓宽了多模态解空间，将性能提升至91.3 PDMS和87.5 EPDMS。定量和定性分析共同证实了该框架在规划质量和风格可控性方面的优势。

英文摘要

Traditional end-to-end autonomous driving frameworks frequently suffer from the "style-averaging" dilemma when trained on high-variance human demonstrations, yielding homogenized, style-uncontrollable, and even kinematically unsafe policies. To overcome this limitation, we present D$^3$-MoE (Dual Disentangled Diffusion Mixture-of-Experts), which disentangles trajectory modeling along two complementary axes. On the behavioral axis, generation is decoupled from selection: a style-conditioned diffusion process synthesizes multi-style candidate trajectories in parallel within a single scene, allowing a downstream module to select the optimal trajectory based on user preference or an evaluation score. On the physical axis, decoupled longitudinal and lateral routers activate their respective experts during inference time, trained without manual labels using self-supervised targets from orthogonal ground-truth kinematics. These activated experts, architected as Diffusion Transformers (DiT) and equipped with style-conditioned AdaLN and asymmetric lateral-fusion cross-attention, independently predict their corresponding physical state before being reassembled into a unified, kinematically coherent trajectory. Extensive evaluations on the challenging NAVSIM benchmark demonstrate that D$^3$-MoE achieves state-of-the-art planning performance, reaching 88.2 PDMS and 84.3 EPDMS by default. Moreover, our Best-of-Three ensemble strategy effectively broadens the multi-modal solution space, raising performance to 91.3 PDMS and 87.5 EPDMS. Both quantitative and qualitative analyses jointly confirm the framework's advantages in planning quality and style controllability.

URL PDF HTML ☆

赞 0 踩 0

2606.04883 2026-06-04 cs.CL cs.LO

Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean

优化 Lean 中智能定理证明器的成本-质量权衡

Kári Rögnvaldsson, Chenhao Sun, Jasper Dekoninck, Martin Vechev

AI总结提出一种包含数据平面和控制平面的动作路由智能体，通过观察失败轨迹并估计成功概率与成本来动态决定继续证明或重新分解，在 PutnamBench 子集上平均降低 25.8% 成本且保持性能。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于在 Lean 中生成形式化证明的工作流程。这些工作流程通常将问题分解为更小的引理，采样许多证明尝试，并使用编译器反馈来指导搜索。然而，它们可能成本高昂，往往在最终失败的尝试上花费大量计算。在这项工作中，我们通过一个包含数据平面和控制平面的动作路由智能体来解决这个问题。数据平面生成自然语言的引理分解，在 Lean 中形式化它们，并为由此产生的定理和引理目标采样证明尝试。控制平面观察之前失败的 Lean 尝试，估计成功可能性和另一次尝试的成本，并决定是继续证明当前目标还是从新的分解重新开始。在 PutnamBench 的一个子集上，我们的智能体平均比固定步长基线降低 25.8% 的成本，在显著减少计算量的同时保持性能。这些结果表明，失败的 Lean 轨迹为智能定理证明中的成本感知资源分配提供了可操作的信号。

英文摘要

Large language models (LLMs) are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by $25.8\%$ over a fixed-step baseline on average, preserving performance while using substantially less compute. These results suggest that failed Lean trajectories provide actionable signals for cost-aware resource allocation in agentic theorem proving.

URL PDF HTML ☆

赞 0 踩 0

2606.04881 2026-06-04 cs.CV cs.AI

DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance

DiverAge: 基于跨年龄身份关系引导的可靠多元人脸老化

Yueying Zou, Peipei Li, Qianrui Teng, Dianyan Xu, Zekun Li

AI总结提出基于扩散自编码的分层多元人脸老化框架DiverAge，通过随机扩散解码和年龄条件语义调制保持外观多样性，并引入跨年龄身份关系调节器（CARR）在推理时引导去噪，以提升序列级有序可靠性。

详情

Comments: 11 pages,10 figures, 5 tables

AI中文摘要

人脸老化在长期生物特征分析、跨年龄身份验证和法医身份分析中扮演重要角色。由于同一主体因遗传、环境和生活方式等因素在目标年龄可能呈现多种合理外观，人脸老化本质上是一个一对多的生成问题。然而，仅有多元性不足以实现可靠的人脸老化：模型应在每个年龄组内提供外观级别的候选多样性，同时跨有序年龄组保持序列级别的有序可靠性。现有的确定性老化方法可以合成视觉上合理的年龄增长人脸，但通常缺乏随机多样性。相比之下，多元老化方法引入局部外观变化，但往往未能明确调控完整老化序列的身份演化。本文提出基于扩散自编码的分层多元人脸老化框架DiverAge。DiverAge通过随机扩散解码和年龄条件语义调制保持外观级多样性。为提升序列级可靠性，我们引入跨年龄身份关系调节器（CARR），一种推理时引导策略，联合去噪多个目标年龄组。CARR由从真实同身份跨年龄对估计的跨年龄身份相似性（CIS）先验引导，通过单边采样时引导抑制过度的跨年龄身份漂移，无需修改训练目标或引入额外可训练参数。实验表明，DiverAge在保持身份保留、年龄准确性、图像质量和外观级多样性的同时，提升了序列级有序可靠性。

英文摘要

Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.

URL PDF HTML ☆

赞 0 踩 0

2606.04880 2026-06-04 cs.CV

MAOAM: Unified Object and Material Selection with Vision-Language Models

MAOAM: 基于视觉语言模型的统一对象与材质选择

Jaden Park, Valentin Deschaintre, Jason Kuen, Kangning Liu, Iliyan Georgiev, Krishna Kumar Singh, Yong Jae Lee, Michael Fischer

AI总结提出MAOAM框架，利用视觉语言模型和分割头，通过文本或点击交互实现对象和材质的精确选择，并设计数据生成流水线解决材质选择数据缺乏问题。

详情

DOI: 10.1145/3799902.3811186
Comments: Accepted to SIGGRAPH 2026 Conference. Project page: \href{https://jadenpark0.github.io/project_pages/maoam/}{here}

AI中文摘要

IRIS-GAN: 深度伪造人脸的分阶段专家检测

Jaume M. Trenchs, Veronica Sanz

AI总结提出IRIS-GAN，一种通过分阶段暴露于不同GAN族来训练的专业伪造人脸检测器，在跨生成器迁移下实现高检测率，并通过Grad-CAM分析揭示生成器依赖的空间响应模式。

详情

Comments: 20 pages, 10 figures

AI中文摘要

我们引入IRIS-GAN，一种针对跨生成器迁移下合成人脸图像的专业取证检测器。我们并非解决通用合成图像检测问题，而是专注于由生成对抗网络（GAN）生成的人脸，这些网络在深度伪造内容中处于领先地位，并通过分阶段暴露于日益苛刻的GAN族同时保留早期生成器来训练检测器。最终模型在考虑的GAN族中实现了超过99%的伪造检测率，并以98.9%的准确率分类了一个外部真实人脸数据集。Grad-CAM分析进一步揭示了可测量的生成器依赖的空间响应模式，这些模式对于仅使用热图的二级分类器仍然具有信息量。对扩散生成人脸的族外测试证实了IRIS-GAN是一个专家检测器，具有一定能力检测非GAN深度伪造。这些结果确立了分阶段训练作为鲁棒GAN人脸取证的有效策略。

英文摘要

We introduce IRIS-GAN, a specialist forensic detector for synthetic face images under cross-generator shift. Rather than addressing universal synthetic-image detection, we focus on faces generated by generative adversarial networks (GANs), which are state-of-the-art in deepfake content, and train the detector through staged exposure to increasingly demanding GAN families while retaining earlier generators. The final model reaches fake-detection rates above 99% across the GAN families considered and classifies an external real-face dataset with 98.9% accuracy. Grad-CAM analysis further reveals measurable generator-dependent spatial response patterns, which remain informative for a secondary heatmap-only classifier. Out-of-family tests on diffusion-generated faces confirm that IRIS-GAN is a specialist detector, with some capability to reach non-GAN deepfakes. These results establish staged training as an effective strategy for robust GAN-face forensics.

URL PDF HTML ☆

赞 0 踩 0

2606.04860 2026-06-04 cs.LG cs.AI

Learning Empirically Admissible Neural Heuristics for Combinatorial Search

学习组合搜索的经验可容许神经启发式

Siddharth Sahay

AI总结针对组合搜索问题，提出一种结合可容许贝尔曼算子与非对称损失函数的验证校准框架，训练出经验可容许的神经启发式，在保证路径最优性的同时显著减少搜索节点扩展。

详情

Comments: 13 pages, 3 figures, 2 tables, 1 algorithm

AI中文摘要

寻找诸如魔方、滑动拼图游戏和Lights Out等组合谜题的最优解路径仍然是人工智能中的经典挑战。启发式搜索算法（如A*）仅在使用可容许启发式（即从不高估真实剩余代价的启发式）时才能保证路径最优性。深度强化学习方法（如DeepCubeA）训练深度神经网络来近似代价到目标的启发式。然而，标准均方误差训练经常产生高估，违反可容许性并损害解的最优性。在本文中，我们介绍了一个可泛化的框架，用于学习验证校准的可容许神经启发式。我们使用低估的可容许贝尔曼算子结合非对称损失函数来训练价值网络，以惩罚高估。为了考虑残差神经函数逼近误差，我们提出了一个基于验证打乱计算的校准安全偏移量。我们证明，在校准的神经启发式下，在评估协议下未观察到可容许性违反，并在实践中保持了路径最优性，同时与标准分析基线相比，在2x2魔方上减少了高达83.0%的搜索节点扩展，在3x3 Lights Out网格上减少了19.9%，在8-Puzzle上减少了1.9%。

英文摘要

Finding optimal solution paths for combinatorial puzzles like the Rubik's Cube, sliding tile puzzles, and Lights Out remains a classical challenge in artificial intelligence. Heuristic search algorithms, such as A* , guarantee path optimality only when using an admissible heuristic-one that never overestimates the true remaining cost-to-go. Deep reinforcement learning (RL) methods like DeepCubeA train deep neural networks to approximate cost-to-go heuristics. However, standard mean-squared error (MSE) training regularly yields overestimations, violating admissibility and compromising solution optimality. In this paper, we introduce a generalizable framework for learning validation-calibrated admissible neural heuristics. We train a value network using an underestimating Admissible Bellman Operator combined with an Asymmetric Loss function to penalize overestimation. To account for residual neural function approximation errors, we propose a post-hoc calibration safety offset computed over validation scrambles. We demonstrate that our calibrated neural heuristics achieve no observed admissibility violations under the evaluation protocol and preserve path optimality in practice while reducing search node expansions by up to 83.0% on a 2 by 2 Rubik's Cube, 19.9% on a 3 by 3 Lights Out grid, and 1.9% on an 8-Puzzle compared to standard analytical baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.04857 2026-06-04 cs.LG

Rethinking Incompleteness: Formalizing Protocol Divergence and Train-Once Learning for Robust IMVC

重新思考不完备性：形式化协议发散与单次训练学习用于鲁棒IMVC

Haolu Liu, Xiyue Wang, Xuanting Xie, Liangjian Wen, Zhao Kang

AI总结针对标准IMVC评估范式忽视缺失率不足以刻画数据不完备性的问题，提出协议发散形式化度量，并设计CRAFT架构通过样本独立性和掩码感知融合实现单次训练泛化到多种缺失模式。

详情

AI中文摘要

标准IMVC评估为不同的缺失数据配置分别训练模型。我们表明，这种范式掩盖了一个基本脆弱性：仅缺失率不足以刻画数据不完备性。具体而言，我们表明，具有相同名义缺失率的协议在完全观测样本的比例上可能相差高达$50\times$，从而引发截然不同的学习机制。我们将这一现象形式化为不完备性发散，提供了捕捉缺失数据协议间结构差异的度量。我们进一步证明，对于一大类基于重构的目标函数，当完整样本比例低于临界阈值时，学习在结构上变得不适定，导致接近随机的性能。为了绕过这一理论界限，我们提出了CRAFT（完整数据鲁棒注意力掩码融合变换器）。CRAFT通过两个关键特性将鲁棒性的负担从损失函数转移到架构上：（i）每个样本的独立性，消除了对完整样本共现的依赖，以及（ii）掩码感知变长融合，通过注意力掩码仅聚合观测到的视图。这种设计允许单个模型在完整数据上训练一次，即可在推理时泛化到不同的缺失模式，无需重新训练。在七个基准上的大量实验表明，CRAFT匹配或超越了每个配置的基线，同时将训练开销降低了$8.8\times$，证明对缺失数据的鲁棒性可以作为固有的架构属性实现。代码（CRAFT）和我们的imvc-audit工具包可在https://anonymous.4open.science/r/CRAFT-BF80/ 和 https://anonymous.4open.science/r/imvc-audit-8263/ 获取。

英文摘要

Standard IMVC evaluation retrains separate models for different missing-data configurations. We show that this paradigm obscures a fundamental vulnerability: missing rate alone is insufficient to characterize data incompleteness. Specifically, we show that protocols with identical nominal missing rates can differ by up to $50\times$ in their proportion of fully observed samples, inducing drastically different learning regimes. We formalize this phenomenon as incompleteness divergence, providing measures that capture structural disparities across missing-data protocols. We further prove that for a broad class of reconstruction-based objectives, learning becomes structurally ill-posed when the proportion of complete samples falls below a critical threshold, leading to near-random performance. To bypass this theoretical bound, we propose CRAFT (Complete-data Robust Attention-masked Fusion Transformer). CRAFT shifts the burden of robustness from the loss function to the architecture via two key properties: (i) per-sample independence, which removes reliance on complete-sample co-occurrence, and (ii) mask-aware variable-length fusion, which aggregates only observed views through attention masking. This design allows a single model, trained once on complete data, to generalize to diverse missing patterns at inference time without retraining. Extensive experiments on seven benchmarks show that CRAFT matches or outperforms per-configuration baselines while reducing training overhead by $8.8\times$, demonstrating that robustness to missing data can be achieved as an inherent architectural property. Code (CRAFT) and our imvc-audit toolkit are available at https://anonymous.4open.science/r/CRAFT-BF80/ and https://anonymous.4open.science/r/imvc-audit-8263/.

URL PDF HTML ☆

赞 0 踩 0

2606.04853 2026-06-04 cs.RO

Teaching Robots to Say 'I Don't Know' : SENTINEL for Uncertainty-Aware SLAM

教机器人说‘我不知道’：用于不确定性感知SLAM的SENTINEL

Abhishek S, Badrikanath Praharaj, Sreeram MV

AI总结提出SENTINEL框架，通过几何扫描统计和跨模态深度一致性为低成本2D LiDAR提供无训练、无标签的可靠性评分，拒绝损坏扫描并回退到轮式里程计，防止SLAM无声损坏。

详情

Comments: 6 pages, 10 figures, 3 tables, This paper was accepted at Uncertainty in Open-World Robotics Workshop in conjunction with Internation conference of robotics and automation (ICRA 2026)

AI中文摘要

低成本2D LiDAR缺乏高端传感器用于诊断测量故障的强度通道，但它们广泛用于教育和预算机器人平台。我们提出SENTINEL，一个无需训练、无需标签的可靠性估计框架，为仅测距的LiDAR提供有效的诊断信号。SENTINEL结合基于几何的扫描统计与LiDAR和RGB-D相机之间的跨模态深度一致性，计算每个扫描的可靠性分数（0到1）。当分数低于阈值时，损坏的扫描被拒绝，机器人回退到校准的轮式里程计，防止无声的SLAM损坏。我们在配备RPLidar A2M12和Intel RealSense D435i的GEFIER R1四轮差速转向机器人上评估SENTINEL，在包含中央障碍物上受控透明和反射故障元素的185 cm × 245 cm场地中。跨五种表面条件（包括玻璃、镜子、光面纸以及混合镜子和光面纸条件）的空间可靠性图显示了干净情况和故障情况之间的清晰分离，允许受影响区域被识别为拒绝或噪声。由于这些故障模式在仿真中不存在，验证完全在真实硬件上进行。

英文摘要

Low-cost 2D LiDARs lack the intensity channel that higher-end sensors use to diagnose measurement failures, yet they are widely used on educational and budget robotics platforms. We present SENTINEL, a training - free, label - free reliability estimation framework that gives range - only LiDAR an effective diagnostic signal. SENTINEL combines geometry-based scan statistics with cross - modal depth consistency between LiDAR and an RGB - D camera to compute a per - scan reliability score between 0 and 1. When the score falls below a threshold, corrupted scans are rejected and the robot falls back to calibrated wheel odometry, preventing silent SLAM corruption. We evaluate SENTINEL on a GEFIER R1 four - wheel skid-steer robot equipped with an RPLidar A2M12 and an Intel RealSense D435i in a 185 cm by 245 cm arena containing controlled transparent and reflective failure elements on a central obstacle. Spatial reliability maps across five surface conditions, including glass, mirror, shiny paper, and a mixed mirror and shiny-paper condition, show clear separation between clean and failure cases, allowing affected regions to be identified as reject or noise. Because these failure modes are absent in simulation, validation is performed entirely on real hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.04850 2026-06-04 cs.LG cs.AI cs.AR math.OC

Uncertainty-Aware End-to-End Co-Design of Neural Network Processors: From Training and Mapping to Fabrication

不确定性感知的神经网络处理器端到端协同设计：从训练、映射到制造

Yuyang Du, Yujun Huang, Gioele Zardini

AI总结提出一个基于单调协同设计理论的统一框架，通过四个可互操作的设计模块（网络训练、芯片映射、晶圆级制造和计算资源分配）实现神经网络处理器的端到端协同设计，并引入置信度（成功概率的倒数）作为显式可优化资源来处理不确定性。

详情

Comments: 14 pages

AI中文摘要

设计神经网络处理器是一个端到端的协同设计问题：网络架构和训练预算决定了推理工作负载；硬件映射决策决定了芯片面积、延迟和能量；这些特性决定了制造良率和生产成本。在实践中，这些决策是在不同阶段做出的，现有的协同设计方法与特定算法紧密耦合，使得改进一个组件而不重新设计整个流水线变得困难。本文提出了一个基于单调协同设计理论的统一框架，该框架组合了四个可互操作的设计模块，涵盖网络训练、芯片映射、晶圆级制造和计算资源分配。每个模块仅向系统其余部分暴露功能-资源接口，因此任何模块都可以在不改变其他模块结构的情况下进行优化。一个核心贡献是对不确定性的处理：该框架没有将随机结果简化为点估计，而是引入置信度（成功概率的倒数）作为与成本、时间和功耗并列的显式可优化资源。三个案例研究验证了该方法。第一个案例恢复了跨异构应用场景的帕累托最优实现。第二个案例确认置信度作为一个连续可调的设计旋钮，而非事后诊断指标。第三个案例表明，改进单个模块的实现集会自动传播到全局帕累托前沿，而无需修改协同设计图。

英文摘要

Designing a neural network processor is an end-to-end co-design problem: network architecture and training budget determine the inference workload; hardware mapping decisions determine chip area, latency, and energy; and these characteristics govern fabrication yield and manufacturing cost. In practice, these decisions are made in separate stages, and existing co-design methodologies are tightly coupled to specific algorithms, making it difficult to improve one component without reworking the entire pipeline. This paper presents a unified framework, grounded in monotone co-design theory, that composes four interoperable design blocks spanning network training, chip mapping, wafer-level fabrication, and compute resource allocation. Each block exposes only a functionality-resource interface to the rest of the system, so any block can be refined without structural changes elsewhere. A central contribution is the treatment of uncertainty: rather than collapsing stochastic outcomes into point estimates, the framework introduces Confidence, the inverse of success probability, as an explicit and optimizable resource alongside cost, time, and power. Three case studies validate the approach. The first recovers Pareto-optimal implementations across heterogeneous application scenarios. The second confirms that Confidence functions as a continuously tunable design knob rather than a post-hoc diagnostic. The third demonstrates that improving a single block's implementation set automatically propagates to the global Pareto front, without modifying the co-design diagram.

URL PDF HTML ☆

赞 0 踩 0

2606.04847 2026-06-04 cs.CV cs.CL cs.LG

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

MusaCoder: 在摩尔线程GPU上通过全栈训练实现原生GPU内核生成

Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang

AI总结提出MusaCoder全栈训练框架，结合渐进式数据合成、多样性保持拒绝微调和基于执行反馈的强化学习，在CUDA和MUSA后端上生成高效原生GPU内核，9B模型匹配前沿闭源模型，27B模型达到新最优。

详情

AI中文摘要

原生GPU内核生成将高级张量程序转换为可执行、高效的低级代码。现有大型语言模型（LLMs）在此任务上表现不佳，而基于执行的强化学习面临稀疏奖励、奖励黑客和训练不稳定性问题。我们提出MusaCoder，一个用于在CUDA和MUSA后端上生成原生GPU内核的全栈训练框架。MusaCoder结合了渐进式内核导向数据合成、保持多样性的拒绝微调以及通过MooreEval（一个分布式验证器和奖励环境）进行的执行反馈强化学习（RL）。为了稳定RL，MusaCoder引入了PrimeEcho用于首轮锚定的多轮奖励、Buffered Dynamic Retry用于从全失败的困难样本中恢复信号，以及MirrorPop用于离策略序列过滤。在KernelBench和MUSA移植变体上的实验表明，MusaCoder在正确性和经验加速方面均优于强开源和专有基线，其中9B模型匹配或超越前沿闭源模型，27B模型建立了新的最优结果。这些结果不仅证明了全栈执行反馈训练对原生内核生成的有效性，也展示了摩尔线程GPU支持完整LLM后训练栈的能力，为新兴加速器上的大模型训练和优化提供了实用基础。

英文摘要

Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

URL PDF HTML ☆

赞 0 踩 0

2606.04846 2026-06-04 cs.CL

Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas

K-12教育中的大语言模型：与州课程标准和学生角色的对齐

Lisa Korver, Tomo Lazovich, Sherief Reda

AI总结本研究开发基于LLM的流程评估不同LLM与美国各州历史课程标准的对齐程度，并通过控制用户角色实验分析模型对地理、年级、性别和种族的敏感性，发现模型能调整历史主题呈现但可能源于州政治倾向，且对年级适应良好而对种族性别敏感性低，揭示了LLM与课程标准错位对学生学习的潜在风险。

详情

AI中文摘要

随着大语言模型（LLM）在教育环境中日益普及，它们引发了关于其使用伦理的重要问题。公开可用的在线聊天机器人能力和准确性迅速提升，导致更广泛的使用，包括寻求作业帮助的学生。这使得考虑这些模型是否与教育标准对齐变得至关重要。由于美国的课程标准由各州制定，它们在所需内容、重点和叙事焦点上存在显著差异。在这项工作中，我们开发了一个基于LLM的流程，以识别各州美国历史课程的变化，并评估不同LLM反映这些州特定课程差异的程度。此外，我们进行控制实验，通过陈述用户属性（如地理位置、年级、性别和种族）来改变用户角色，以评估LLM响应对用户特征的敏感性。我们发现，虽然模型能够调整历史主题的呈现，但这些转变可能源于各州的政治倾向，并不一定反映实际的课程内容。此外，模型成功适应学生的年级水平，而对种族或性别的敏感性最小，这表明它们能够以有限的人口统计偏差对用户角色进行有用的适应。总之，这些发现突显了开放访问LLM聊天机器人可能因与州课程标准错位而导致学生学习成果受损的潜在风险，并强调了需要更强大的对齐技术。

英文摘要

As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability and accuracy leading to more widespread use, including among students looking for help with their homework. This makes it crucial to consider whether these models are aligned with educational standards. Because curriculum standards in the United States are set at the state level, they differ significantly in required content, emphasis, and narrative focus. In this work, we develop an LLM-based pipeline to identify variations in U.S. History curricula across states and evaluate the extent to which different LLMs reflect these state-specific curricular differences. In addition, we conduct controlled experiments that vary user personas by stating user attributes such as geographic location, grade level, gender and race to evaluate the sensitivity of LLM responses to user characteristics. We find that while models are able to adjust their presentation of historical topics, these shifts may come from the perceived political leanings of states and do not necessarily reflect actual curriculum content. Additionally, models successfully adapt to a student's grade level while showing minimal sensitivity to race or gender, suggesting they are capable of useful adaptation to student personas with limited demographic bias. Together, these findings highlight potential risks that open access to LLM chatbots may cause to student learning outcomes stemming from misalignment with state curriculum standards and highlight the need for more robust alignment techniques.

URL PDF HTML ☆

赞 0 踩 0

2606.04845 2026-06-04 stat.ML cs.LG math.ST stat.CO stat.TH

Bayesian learning for the stochastic shortest path problem

随机最短路径问题的贝叶斯学习

Chon Wai Ho, Sumeetpal S. Singh, Jiaqi Guo

AI总结针对随机最短路径问题，提出一种贝叶斯框架，通过贝尔曼最优方程直接构建最优动作价值函数Q*的后验分布，并解决似然松弛导致的不可识别性问题，实现不确定性量化与数据高效学习。

详情

Comments: 50 pages, 19 figures

AI中文摘要

序列决策问题通常被建模为马尔可夫决策过程（MDP）。我们关注随机最短路径（SSP）问题，这是一个具有吸收终止状态的无限水平无折扣MDP。我们开发了一个贝叶斯框架，通过与决策任务的交互来学习最优决策策略。具体来说，我们学习最优动作价值函数$Q^*$，但与许多现有的贝叶斯方法不同，我们不依赖于不现实的建模假设和临时近似。我们的方法是通过贝尔曼最优方程直接构建$Q^*$的后验信念。对于确定性奖励，我们将后验描述为具有流形密度的分布。为了简化推理，我们放松了似然，使得勒贝格密度存在。但这样做的代价是产生不可识别性问题。具体来说，放松后的后验可能在不当决策规则上有显著质量，而精确后验则不会。我们还计算了$Q^*$的表格参数化、高斯似然放松和高斯先验下最优动作选择的精确后验概率，这在基准测试研究中很有用。对深海基准测试变体的数值研究验证了我们的发现。我们证明了我们的框架能够忠实地量化不确定性，并且与其他基于时间差分的贝叶斯方法相比，数据效率更高。最后，我们对未来工作提出了建议。

英文摘要

Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal decision strategy through interactions with the decision-making task. Specifically, we learn the optimal action-value function $Q^*$, but unlike many existing Bayesian approaches, we do not rely on unrealistic modelling assumptions and ad-hoc approximations. Our approach is to directly construct the posterior beliefs for $Q^*$ through Bellman's optimality equations. For deterministic rewards, we characterise the posterior as a distribution with a manifold density. To facilitate simpler inference, we relax the likelihood so that a Lebesgue density exists. The flip side is to create unidentifiability issues. Specifically, the relaxed posterior can have significant mass on improper decision rules, while the exact posterior will not. We also calculate the exact posterior probabilities for optimal action selections for the tabular parametrisation of $Q^*$, a Gaussian likelihood relaxation and a Gaussian prior, which is useful in benchmarking studies. Numerical studies on variants of the Deep Sea benchmark verify our findings. We demonstrate that our framework faithfully quantifies uncertainty and, compared to other temporal-difference-based Bayesian methodologies, is more data efficient. We conclude with recommendations for future work.

URL PDF HTML ☆

赞 0 踩 0

2606.04844 2026-06-04 cs.SD cs.CV

Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

漂移增强评分：文本驱动的零样本音频-语言分类噪声鲁棒性

Tu Vo, Sheir Zaheer, Chan Y. Park

AI总结提出漂移增强评分（DAS），通过文本生成的噪声条件提示预测音频嵌入漂移方向，为每个类别添加奖励分数，在不增加梯度或测试时批处理的情况下，显著提升零样本音频分类在噪声下的准确率和mAP。

详情

AI中文摘要

对比音频-语言模型（如CLAP）能够实现零样本音频分类：通过将音频嵌入与文本提示嵌入匹配来标记声音，无需标注音频。但在声学噪声下，这种匹配会失效，标准基准测试中，0 dB SNR时准确率和mAP下降12-30个百分点。我们提出漂移增强评分（DAS），这是一种添加到余弦评分中的每类小奖励。当噪声音频嵌入向该类噪声条件文本提示预测的方向漂移时，奖励该类。该奖励仅从文本推导，计算一次并缓存，推理时每类只需一个内积，无需梯度或测试时批处理。在LAION CLAP骨干网络上，我们将DAS与Acevedo等人同期方法的四种变体在UrbanSound8K和完整FSD50K评估集上进行比较，将每个片段与城市声学场景噪声混合，覆盖一系列SNR。DAS在所有测试条件下均提升了指标：UrbanSound8K上准确率提高+2.60至+5.75个百分点，FSD50K上mAP提高+1.50至+1.74个百分点。

英文摘要

Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.

URL PDF HTML ☆

赞 0 踩 0

2606.04836 2026-06-04 cs.CV

3D Temporal Analysis for Autism Spectrum Disorder Screening During Attention Tasks

注意力任务期间自闭症谱系障碍筛查的3D时间分析

Inam Qadir, Elizabeth B Varghese, Dena Al-Thani, Marwa Qaraqe

AI总结提出基于DECA的3D时间分析框架，提取头部姿态和面部表情特征，利用LSTM/GRU分类器在VR-CPT任务中实现ASD筛查，多模态融合达到84.6%准确率。

详情

AI中文摘要

对学龄儿童进行准确的自闭症谱系障碍（ASD）筛查对于识别早期可能遗漏的病例以及及时干预以支持社交、认知和学业发展至关重要。当前的ASD筛查依赖于主观评估和2D分析方法，无法捕捉ASD行为特征的空间位移模式。本研究提出了一种新颖的3D时间分析框架，该框架基于DECA（详细表情捕捉与动画）这一3D建模框架，用于提取全面的头部姿态参数（包括平移分量$T_x, T_y, T_z$）以及独立于姿态变化的面部表情。基于LSTM和GRU的时间分类器在从39名7-12岁参与者（19名ASD，20名TD）在虚拟现实-持续性能测试任务中收集的视频数据提取的3D特征上进行训练。GRU模型表现出优越性能，其中3D头部姿态特征达到83.9%的准确率，3D面部特征达到81.4%的准确率，分别比2D基线方法高出10.7%和7.5%。此外，通过PCA降维的3D头部姿态和面部特征的多模态融合达到了84.6%的最高准确率，优于单模态方法。这项工作为针对学龄人群ASD识别中当前诊断局限性的客观、自动化筛查工具奠定了基础。

英文摘要

Accurate Autism Spectrum Disorder (ASD) screening for school-age children is crucial to identify cases that may have been missed earlier and to enable timely interventions supporting social, cognitive, and academic development. Current ASD screening relies on subjective assessments and 2D analysis methods that fail to capture spatial displacement patterns characteristic of ASD behaviors. In this study, a novel 3D temporal analysis framework is presented, built on top of DECA (Detailed Expression Capture and Animation), a 3D modeling framework, to extract comprehensive head pose parameters (including translational components $T_x, T_y, T_z$) and facial expressions independent of pose variations. LSTM and GRU-based temporal classifiers were trained on the extracted 3D features from video data collected from 39 participants (19 ASD, 20 TD) aged 7-12 years during Virtual Reality-Continuous Performance Test tasks. The GRU-based models demonstrated superior performance, with 3D head pose features achieving 83.9\% accuracy and 3D facial features reaching 81.4\% accuracy, outperforming 2D baseline approaches by 10.7\% and 7.5\%, respectively. Furthermore, multimodal fusion of 3D head pose and facial features with PCA-based dimensionality reduction achieved the highest accuracy of 84.6\%, outperforming unimodal approaches. This work establishes a foundation for objective, automated screening tools addressing current diagnostic limitations in ASD identification for school-age populations.

URL PDF HTML ☆

赞 0 踩 0

2606.04834 2026-06-04 cs.LG

Prediction Under Imperfect Compression: A Theory of Approximate MDL

非完美压缩下的预测：近似最小描述长度理论

Qian Li, Xinyu Mao, Shang-Hua Teng, Guangxu Yang

AI总结本文研究了在近似优化下，最小描述长度（MDL）原则仍能保证可靠序列预测的条件，证明了加性松弛下的鲁棒性并刻画了正则化的必要性。

详情

Comments: 26 pages

AI中文摘要

最小描述长度（MDL）通过优化总描述长度 $L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$ 形式化了奥卡姆剃刀原则。对于序列预测，MDL 方法反复选择在观测前缀上具有最小目标得分的模型进行下一步预测。经典 MDL 预测理论表明，精确优化 MDL 目标确实提供了支持可靠预测的强压缩保证。然而，实际机器学习通常只能通过近似优化目标函数来找到模型。为弥合这一差距，本文解决了以下基本问题：在何种近似和正则化形式下，近似 MDL 仍能保证可靠的序列预测？本文提供了一个原则性的刻画。我们证明，对于平衡 MDL 目标更一般形式 $λ\cdot L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$ 的任意加性松弛 $C$，当 $λ\ge1$ 时，累积期望平方预测误差有限。$λ>1$ 的情况通过亲和-望远镜论证证明，而边界情况 $λ=1$ 通过基于精确静态 MDL 边界的似然比停止论证证明。我们的结果表明，经典 MDL 正则化对任意固定加性优化误差保持鲁棒。此外，我们建立了近似 MDL 框架刻画的尖锐性：当 $0<λ<1$ 时，在可估测度的通用类中，过拟合可能导致无限累积期望误差，因此需要强形式的模型复杂度正则化。另外，在乘性近似下，模型选择可能在每个正则化区域 $λ>0$ 中失败，因此加性近似既充分又必要。

英文摘要

Minimum Description Length (MDL) formalizes the principle of Occam's razor by optimizing the total description length: $L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$. For sequential prediction, the MDL method repeatedly selects a model with a minimum objective score of the observed prefix for the next step prediction. Classical MDL prediction theory shows that exact optimization of the MDL objective indeed provides a strong compression guarantee that supports reliable prediction. However, practical machine learning usually can only find models by approximately optimizing the objective function. To bridge this gap, this paper addresses the following fundamental question: Under what forms of approximation and regularization does approximate MDL still guarantee reliable sequential prediction? This work offers a principled characterization. We prove that for any approximation with additive slack $C$ of the more general form of the balanced MDL objective: $λ\cdot L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$, the cumulative expected squared prediction error is finite for all $λ\ge1$. The case $λ>1$ is proved by an affinity-telescoping argument, while the boundary case $λ=1$ is proved by a likelihood-ratio stopping argument based on exact static MDL bounds. Our results establish that classical MDL regularization remains robust to any fixed additive optimization error. Furthermore, we establish that our characterization of the approximate MDL framework is sharp: When $0<λ<1$, overfits can happen to incur infinite cumulative expected error in the universal class of estimable measures, and hence a strong form of model-complexity regularization is necessary. In addition, model selection may fail in every regularized regime $λ>0$, under multiplicative approximation, and thus, additive approximation is both sufficient and essential.

URL PDF HTML ☆

赞 0 踩 0

2606.04829 2026-06-04 cs.RO

M3imic: Learning a Versatile Whole-Body Controller for Multimodal Motion Mimicking

M3imic: 学习用于多模态运动模仿的通用全身控制器

Zuxing Lu, Ziang Zheng, Yao Lyu, Jingyu Liu, Feihong Zhang, Song Lu, Xin Yuan, Changyin Sun, Xingxing Zuo, Shengbo Eben Li

AI总结提出M3imic框架，通过模态特定编码器将异构运动参考模态（机器人关节角度、人体姿态轨迹、末端执行器位姿）映射到共享潜在空间，并利用大规模强化学习训练单一策略，实现无需模态特定重训练的sim-to-real迁移。

详情

AI中文摘要

构建通用全身控制器对于使人形机器人在广泛的下游任务（包括 locomotion 和 loco-manipulation）中具备多样化的运动能力至关重要。不同任务依赖于不同的运动参考模态：locomotion 主要依赖于协调的机器人关节轨迹，而 manipulation 则需要精确的末端执行器轨迹跟踪。现有方法常常忽视密集的机器人关节角度与稀疏的末端执行器位姿之间的表示不匹配问题。为解决这一问题，我们提出了 Multi-Modal Mimic (M3imic)，一个通用的多模态全身控制框架，它使用模态特定编码器将异构运动参考模态（包括机器人关节角度、人体姿态轨迹和末端执行器位姿）映射到共享潜在空间，从而统一这些模态。利用模拟器中的大规模强化学习，我们训练了一个单一策略，该策略能够在无需模态特定重训练的情况下实现跨多种运动参考模态的 sim-to-real 迁移。在 Unitree G1 机器人上进行了广泛的仿真和真实世界实验以评估所提出的框架。在仿真中，该策略在未见过的测试数据集上达到了 98.42% 的峰值成功率，展示了其卓越的泛化能力。代码可在 https://github.com/Renforce-Dynamics/MultiModalWBC 获取。

英文摘要

Building a general-purpose whole-body controller is essential for enabling diverse motion capabilities in humanoid robots across a wide range of downstream tasks, including locomotion and loco-manipulation. Different tasks rely on distinct motion reference modalities: locomotion primarily depends on coordinated robot joint trajectories, whereas manipulation requires precise end-effector trajectory tracking. Existing methods often overlook the representational mismatch between dense robot joint angles and sparse end-effector poses. To address this, we propose Multi-Modal Mimic (M3imic), a versatile multi-modal whole-body control framework that unifies heterogeneous motion reference modalities, including robot joint angles, human pose trajectories, and end-effector poses, using modality-specific encoders to map them into a shared latent space. Leveraging large-scale reinforcement learning in the simulator, we train a single policy that achieves sim-to-real transfer across multiple motion reference modalities without modality-specific retraining. Extensive simulation and real-world experiments on the Unitree G1 robot are conducted to evaluate the proposed framework. In simulation, the policy achieves a peak success rate of 98.42\% on an unseen test dataset, demonstrating its exceptional generalization capability. The code is available at https://github.com/Renforce-Dynamics/MultiModalWBC

URL PDF HTML ☆

赞 0 踩 0

2606.04828 2026-06-04 cs.CL

A French Corpus Annotated for Multiword Expressions with Adverbial Function

一个标注了副词性多词表达的法语语料库

Eric Laporte, Takuya Nakamura, Stavroula Voyatzi

AI总结本文介绍了一个标注了副词性多词表达的法语语料库，旨在支持信息检索、信息抽取以及深层和浅层句法分析的研究。

2606.04825 2026-06-04 cs.RO

HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning

HapTile: 用于接触丰富模仿学习的触觉感知视觉-触觉-语言-动作数据集

Amirhosein Alian, Yongqiang Zhao, Shiyi Gu, Xuyang Zhang, Zhuo Chen, Christopher E. Mower, Haitham Bou-Ammar, Shan Luo

AI总结提出HapTile数据集，通过集成指尖触觉反馈和操作员触觉感知，为接触丰富的机器人操作任务提供视觉-触觉-语言-动作联合数据，并验证其在策略学习中的有效性。

详情

AI中文摘要

尽管触觉感知对于可靠操作至关重要，但大多数现有的视觉-语言-动作（VLA）数据集仍然仅基于视觉，而那些确实包含触觉信息的数据集通常缺乏任务多样性、语言条件和动作轨迹的联合组合。此外，现有的遥操作流程很少为操作员提供触觉反馈，尽管触觉反馈在演示质量和操作稳定性中具有公认的作用。在这项工作中，我们提出了HapTile，一个接触基础的视觉触觉操作数据集，它通过嵌入两个层次的物理交互感知超越了仅视觉轨迹数据集：机器人末端执行器上的指尖触觉反馈，以及遥操作侧的触觉感知演示。数据收集平台将触觉反馈直接集成到遥操作控制器中，使操作员能够实时感知接触交互。它基于一个标准且可复现的机器人系统构建，该系统配备了定制设计的指尖触觉传感器。该数据集涵盖了日常操作任务，包括拾取与放置、折叠、按压、堆叠以及其他常规活动，这些任务涉及广泛的接触丰富技能。每个任务都配有语言指令，用于根据操作目标对策略进行条件化，同时还有同步的视觉触觉观察和动作轨迹。此外，我们使用两个基线模型对接触丰富的策略学习进行了基准研究，以评估所提出的接触基础数据集的有效性。数据集和更多详细信息可在我们的网站上获取：haptile-dataset.github.io。

英文摘要

Despite the importance of tactile sensing for reliable manipulation, most existing Vision-Language-Action (VLA) datasets remain vision-only, and those that do incorporate tactile information typically lack the joint combination of task diversity, language conditioning, and action trajectories. Furthermore, existing teleoperation pipelines rarely provide haptic feedback to the operator, despite its established role in demonstration quality and manipulation stability. In this work, we present HapTile, a contact-grounded visuotactile manipulation dataset that advances beyond vision-only trajectory datasets by embedding physical interaction sensing at two levels: fingertip tactile feedback at the robot end-effector, and haptic-informed demonstrations at the teleoperator side. The data collection platform integrates haptic feedback directly into the teleoperation controller, enabling the operator to perceive contact interactions in real time. It is built around a standard and reproducible robotic system equipped with custom-designed fingertip tactile sensors. The dataset comprises everyday manipulation tasks spanning a broad range of contact-rich skills, including pick-and-place, folding, pressing, stacking, and other routine activities. Each task is paired with language instructions that condition the policy on the manipulation objective, together with synchronized visuotactile observations and action trajectories. In addition, we provide a benchmarking study on contact-rich policy learning using two baseline models to evaluate the effectiveness of the proposed contact-grounded dataset. The dataset and additional details are available on our website: haptile-dataset.github.io.

URL PDF HTML ☆

赞 0 踩 0