arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.18250 2026-06-17 cs.CV 新提交

Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion

未来动态3D重建:一种具有解耦自运动的3D世界模型

Nils Morbitzer, Jonathan Evers, Artem Savkin, Thomas Stauner, Nassir Navab, Federico Tombari, Stefano Gasperini

发表机构 * Technical University of Munich (TUM)(慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 提出FR3D世界模型,通过解耦场景3D演化与智能体轨迹,利用教师-学生蒸馏策略实现从单目观测到未来动态3D重建的几何一致性和零样本泛化。

Comments ICML 2026. Project page: https://fr3d-wm.github.io

详情
AI中文摘要

预测动态环境的演化对于自主智能体至关重要。尽管生成式世界模型最近通过在图像平面内混合自运动和环境动态,在2D视频合成中实现了高逼真度,但它们表现出物理不一致性,例如物体变形或消失,尤其是在长时间范围内。在本文中,我们提出FR3D,一种预测未来动态3D重建的持久3D潜在表示的世界模型。与将世界视为基于图像的特征序列的先前工作不同,FR3D明确地将场景的3D演化与智能体的轨迹解耦,将推断的自运动视为动作的潜在代理。这种解耦解决了自运动和世界运动之间的歧义,确保了几何一致性到未来。此外,我们引入了一种教师-学生蒸馏策略,利用现成基础模型的空间“常识”,从而实现鲁棒的零样本泛化。大量实验表明,FR3D在多个数据集上从单目观测进行未来动态3D重建(甚至到未来2秒)的强大性能。项目页面:此https URL。

英文摘要

Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial "common sense" of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: https://fr3d-wm.github.io.

2606.18249 2026-06-17 cs.CV 新提交

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

统一多模态自回归建模:共享上下文-视觉分词器是实现统一的关键

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(可信具身AI研究院,复旦大学) Shanghai Innovation Institute(上海创新研究院) Qwen Team, Alibaba Inc.(通义实验室,阿里公司)

AI总结 提出UniAR框架,通过单一离散视觉分词器桥接视觉理解与生成,采用并行位预测和扩散解码,在图像生成和编辑上达到最优,同时保持多模态理解竞争力。

Comments Accepted by ICML2026. Project page https://sharelab-sii.github.io/uniar-web

详情
AI中文摘要

统一多模态建模旨在将视觉理解和生成集成到单个系统中。然而,现有方法通常依赖两个不同的视觉分词器,这分割了表示空间并阻碍了真正的统一建模。我们提出UniAR,一个统一的自回归框架,其中单个离散视觉分词器作为理解和生成之间的关键桥梁,使得模型能够直接解释其自身生成的视觉标记而无需额外的重新编码,从而实现共享上下文。UniAR采用预训练的视觉编码器,结合多级特征融合和无查找的逐位量化方案,在保留高层语义和低层细节的同时,以最小代价扩展有效视觉词汇。在此基础上,统一自回归模型采用并行逐位预测来联合预测空间分组的多级视觉编码,大幅减少视觉序列长度并加速生成。最后,基于扩散的视觉解码器对离散视觉标记进行操作,以解码高保真图像。通过大规模预训练,随后进行监督微调和强化学习,UniAR在图像生成和图像编辑上达到了最先进的性能,同时在多模态理解基准上保持竞争力。项目页面可在此URL获取。

英文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

2606.18247 2026-06-17 cs.RO cs.AI 新提交

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

视觉验证实现推理时引导与自主策略改进

Mingtong Zhang, Dhruv Shah

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出VERITAS框架,利用预训练通用机器人策略作为生成器,结合无梯度视觉验证器在推理时评估动作,实现无需额外训练的推理时策略引导和离线策略改进。

Comments Website: https://veritas-improvement.github.io

详情
AI中文摘要

部署在现实世界中的机器人应从经验中学习并随时间改进。这需要一个实践并从反馈中学习的机制。在本文中,我们提出VERITAS,一个用于通用机器人策略的生成器-验证器框架,用于推理时策略引导和自我改进。我们使用预训练的通用机器人策略作为“生成器”,并将其与一个无梯度的“视觉验证器”配对,该验证器在推理时评估动作。该框架实现了推理时引导,无需额外训练即可提高策略性能。我们证明,推理时验证在无需额外演示数据训练的情况下,始终优于普通通用策略。此外,我们证明验证后的 rollout 为离线策略改进提供了有效的监督:在验证后的自生成轨迹上微调的策略实现了持续的性能提升。值得注意的是,我们发现使用验证后的 rollout 进行后训练达到了与专家演示相当的效率,同时无需人工干预。我们的结果突出了推理时验证作为一种实用且可扩展的机制,用于在部署期间改进机器人策略。

英文摘要

Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a ``generator'' and pair it with a gradient-free ``visual verifier'' that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.

2606.18246 2026-06-17 cs.CL 新提交

Variable-Width Transformers

变宽Transformer

Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy, Yoon Kim

发表机构 * MIT(麻省理工学院) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

AI总结 提出一种中间窄、两端宽的变宽Transformer架构,通过无参数残差缩放机制实现非均匀容量分配,在语言模型困惑度、FLOPs和KV缓存上优于均匀宽度基线。

详情
AI中文摘要

扩展模型规模,特别是深度和宽度,推动了基于Transformer的语言模型的显著进步。然而,大多数架构在所有层中保持恒定宽度,即使不同层可能扮演不同的计算角色,也均匀分配固定的参数和计算预算。在这项工作中,我们通过提出一个×形> <former架构,实证研究了跨网络深度的非均匀容量分配。该设计保持较宽的早期和晚期层,同时缩窄中间层,利用无参数残差缩放机制。在从200M到2B参数(密集)和3B参数(MoE)的仅解码器语言模型中,我们的> <former在语言建模损失上始终优于参数匹配的均匀基线。通过降低平均层宽度,该架构还减少了总体FLOPs(在拟合的损失匹配缩放曲线下减少22%)以及更小的KV缓存内存和I/O成本(减少15%)。在分析中,我们展示了这种瓶颈结构导致残差流中定性不同的表示。总体而言,我们的结果表明,非均匀宽度分配可以导致语言模型更资源最优的缩放。

英文摘要

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

2606.18243 2026-06-17 cs.CV cs.GR cs.RO 新提交

MOCHI: Motion Enhancement of Collaborative Human-object Interactions

MOCHI: 协作人-物交互的运动增强

Jiye Lee, Yonghun Choi, Jungdam Won

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Seoul National University(首尔国立大学)

AI总结 针对多人-物交互数据中手物接触错位、运动抖动和手指细节缺失等问题,提出两阶段框架MOCHI,先通过优化生成物理合理的手部抓取,再基于扩散模型优化全身运动,有效增强噪声数据。

Comments SIGGRAPH 2026 Journal (ACM TOG); Project page: https://jiyewise.github.io/projects/MOCHI/

详情
AI中文摘要

协作人-物交互展示了动态且复杂的运动,需要参与者与共享对象之间的相互预期和持续调整。对此类协作多人-物交互(MHOI)场景进行建模需要高质量的数据采集作为基础步骤;然而,由于MHOI中人与人、人与物交互同时发生的内在复杂性,这一步骤具有挑战性。这种复杂性导致MHOI捕获数据存在噪声,表现为多种伪影:手与物体之间的接触错位、捕获序列中的运动抖动和时间不一致性,以及缺失或不完整的手指级关节细节。为了解决这些挑战,我们提出了MOCHI(协作人-物交互的运动增强),一个用于增强噪声MHOI数据的两阶段框架。我们的方法首先通过从噪声身体输入进行优化生成物理合理的手部抓取,产生既物理合理又与身体姿态语义一致的抓取,然后将这些优化后的抓取扩展为完整的手-物交互序列。随后,所有参与者的全身运动通过一个基于扩散的噪声优化框架进行细化,该框架使用单人运动先验。在优化过程中,我们引入优化目标以在这些单人先验中编码人-物和人与人交互信息。实验结果表明,我们的流程在多种MHOI数据(无论是通过现有捕获方法获取还是由生成模型合成)上均有效。我们进一步展示了系统在不同参与者数量和交互类型下的鲁棒性,并演示了包括基于关键帧的MHOI创建和通过改变物体几何形状进行数据增强在内的多种应用。

英文摘要

Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.

2606.18242 2026-06-17 cs.CV 新提交

EventDrive: Event Cameras for Vision-Language Driving Intelligence

EventDrive: 用于视觉-语言驾驶智能的事件相机

Dongyue Lu, Rong Li, Ao Liang, Lingdong Kong, Wei Yin, Lai Xing Ng, Benoit R. Cottereau, Camille Simon Chane, Wei Tsang Ooi

发表机构 * NUS(新加坡国立大学) HKUST(GZ)(香港科技大学(广州)) Horizon Robotics(地平线机器人) A*STAR, I2R(新加坡科技研究局,资讯通信研究院) IPAL, CNRS IRL 2955, Singapore(IPAL,法国国家科学研究中心国际联合实验室2955,新加坡) University Toulouse, CNRS, CerCo, Toulouse, France(图卢兹大学,法国国家科学研究中心,CerCo,法国图卢兹) ETIS UMR 8051, CY Cergy Paris University, ENSEA, CNRS, France(ETIS UMR 8051,CY塞尔吉-巴黎大学,ENSEA,法国国家科学研究中心,法国)

AI总结 提出EventDrive基准和模型套件,通过多时域事件金字塔和时域混合专家模块融合事件流与RGB帧,在感知、理解、预测和规划四维度提升驾驶推理性能。

Comments CVPR2026, 34 pages, 15 figures, 15 tables, project page: https://dylanorange.github.io/projects/eventdrive

详情
AI中文摘要

事件相机通过异步亮度变化感知世界,具有微秒级延迟和高动态范围,其运动保真度远超基于帧的传感器,并能捕捉传统曝光常遗漏的时间结构。这些特性使事件成为自动驾驶中RGB的有力补充,尤其在帧感知可能不可靠的模糊、眩光和快速运动场景下。然而,现有的事件感知视觉-语言模型仍局限于通用感知,未能揭示事件传感如何促进整个驾驶循环中的推理和决策。我们提出EventDrive,一个大规模基准和模型套件,统一了事件流、RGB帧和语言监督,涵盖四个核心维度:感知、理解、预测和规划,包括描述、结构化问答、定位、运动状态识别、轨迹预测和规划任务。在此基础上,EventDrive-VLM引入了多时域事件金字塔和时域混合专家模块,自适应地编码和融合异步与基于帧的信息,用于下游推理。在多样化任务上的全面评估表明,事件流在时间精度、运动感知和鲁棒性方面提供了显著提升,将事件传感置于驾驶智能的核心。

英文摘要

Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.

2606.18239 2026-06-17 cs.RO 新提交

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

EBench: 通用移动操作策略的要素诊断

Ning Gao, Jinliang Zheng, Xing Gao, Haoxiang Ma, Hanqing Wang, Yukai Wang, Jiantong Chen, Zanxin Chen, Shujie Zhang, Mingda Jia, Xuekun Jiang, Zihou Zhu, Xinyu Li, Shuai Wang, Hao Li, Wenzhe Cai, Yuqiang Yang, Xudong Xu, Zhaoyang Lyu, Yao Mu, Tai Wang, Jiangmiao Pang, Jia Zeng, Weinan Zhang, Chunhua Shen

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Xi’an Jiaotong University(西安交通大学) Institute for AI Industry Research (AIR), Tsinghua University(清华大学智能产业研究院) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学)

AI总结 提出EBench基准,从5个能力和4个泛化维度诊断通用移动操作模型,揭示不同模型在成功率相近时能力差异显著。

详情
AI中文摘要

我们提出EBench,一个仿真基准,用于诊断通用移动操作策略,超越单一的成功率标量。EBench包含26个多样且具有挑战性的操作任务,沿5个能力维度和4个泛化维度进行标注。我们评估了最先进的通用操作模型,包括$\pi_0$、$\pi_{0.5}$、XVLA和InternVLA-A1,并揭示出成功率相近的模型展现出截然不同的能力轮廓:$\pi_{0.5}$实现了最高的测试成功率和最佳的训练-测试保持率,而InternVLA-A1在移动操作上占主导地位,但在灵巧任务上崩溃,XVLA与其他策略相比在一组不相交的原子技能上表现出优势。除了能力轮廓分析,EBench还从4个代表性角度分析了泛化能力,识别了不同分布偏移因素的影响。结果揭示了模型在总体得分背后的优势和弱点。我们希望这个基准能提供广泛的诊断信号,以指导通用操作模型的迭代。

英文摘要

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $π_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.

2606.18237 2026-06-17 cs.CL cs.AI cs.LG 新提交

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

ReproRepo: 利用 GitHub 仓库问题扩展可重复性审计

Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang, Ameet Talwalkar

发表机构 * School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院) Datadog

AI总结 提出 ReproRepo 框架,利用 GitHub issues 作为监督信号,对 1149 篇论文进行可重复性评估,发现 Codex with GPT-5.5 能识别约 90% 论文的语义相关复现问题。

详情
AI中文摘要

从论文和已发布代码中复现研究结果对科学进步至关重要。现有工作引入了基准测试来评估 LLM 代理是否能协助可重复性,但由于数据整理和评估需要大量人工努力,这些基准难以扩展。我们提出了 ReproRepo,一个可扩展的可重复性评估框架,利用人类提出的 GitHub issues 作为真实复现障碍的自然监督信号。我们在来自主要会议的 1149 篇近期机器学习论文上实例化 ReproRepo,并评估了四种前沿模型代理配置。我们的结果表明,即使不执行代码,LLM 代理也能从论文-仓库对中识别出许多现实世界的可重复性问题:我们研究中的最佳代理,即带有 GPT-5.5 的 Codex,为研究中约 90% 的论文揭示了至少一个语义相关的人类报告的障碍。进一步分析表明,代理在揭示可见故障和识别正确语义区域方面特别有效,但在精确定位方面可能仍不足。ReproRepo 可作为未来在真实世界可重复性审计中评估 LLM 代理的可重用、可扩展框架。我们的代码发布在 https://this URL。

英文摘要

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

2606.18235 2026-06-17 cs.AI 新提交

EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation

EvolveNav: 用于零样本目标导航的主动预反思与自进化记忆

Qi Chai, Wenhao Shen, Nanjie Yao, Yue Xia, Kaiyong Zhao, Jie Ma, Guosheng Lin, Hao Wang

发表机构 * HKUST(GZ)(香港科技大学(广州)) Nanyang Technological University(南洋理工大学) Xi’an Jiaotong University(西安交通大学) XGRIDS(深圳格物智联)

AI总结 提出自进化零样本目标导航框架,通过从历史轨迹提取规则并基于置信上界检索,结合记忆引导预反思模块,减少无效探索,成功率提升10.1%。

详情
AI中文摘要

零样本目标导航(ZS-OGN)要求具身智能体在没有任何先验训练的情况下探索并定位目标物体。为此,近期方法利用基础模型,但它们通常依赖静态先验且缺乏适应性,导致重复错误和代价高昂的试错。本文提出一种自进化的ZS-OGN框架,实现连续的测试时改进。具体而言,我们通过从过去轨迹中提取可操作知识来构建智能体规则记忆。然后,我们提出一种基于置信上界的检索策略,通过平衡语义相关性和历史成功率来选择有效规则。此外,我们引入一个记忆引导的预反思模块,在行动前预测潜在结果,减少低效探索。大量实验表明,我们的方法优于现有的零样本基线,在减少不必要步骤的同时实现了10.1%的成功率提升。

英文摘要

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1\% improvement in success rate with fewer unnecessary steps.

2606.18231 2026-06-17 cs.CV cs.LG cs.RO 新提交

Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

自适应体积力学属性场:分辨率无关

Rishit Dagli, Donglai Xiang, Vismay Modi, Xuning Yang, Gavriel State, David I. W. Levin, Maria Shugrina

发表机构 * NVIDIA(英伟达)

AI总结 提出AdaVoMP方法,利用稀疏自适应体素结构和自回归Transformer编解码器,为3D物体预测高分辨率空间变化的杨氏模量、泊松比和密度,相比现有技术分辨率提升16^3倍且更准确。

Comments Project Page and hi-res paper: https://research.nvidia.com/labs/sil/projects/adavomp/. ICML 2026

详情
AI中文摘要

精确的力学属性(或材料)杨氏模量($E$)、泊松比($\ u$)和密度($\ ho$)对于数字世界的可靠物理模拟至关重要,但大多数3D资产缺乏这些信息。我们提出AdaVoMP,一种预测输入3D物体跨表示形式的精确密集空间变化($E$,$\ u$,$\ ho$)的方法,在分辨率、准确性和内存效率上优于现有技术。我们技术的基础是一种稀疏自适应体素结构SAV,它能高效地表示输入3D形状和材料场输出。我们将最准确的先前方法VoMP的固定体素模型替换为一种新颖的稀疏Transformer编码器-解码器模型,该模型学习为每个输入形状自回归地生成唯一的SAV来表示其材料,实现比先前技术高$16^3$倍的分辨率。实验表明,即使测试时计算量少于所有先前技术,AdaVoMP也能估计出更准确的体积属性。这使得我们能够将高分辨率复杂3D物体转换为可模拟的资产,从而实现逼真的可变形模拟。

英文摘要

Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($ν$) and density ($ρ$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $ν$, $ρ$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

2606.18222 2026-06-17 cs.CL cs.DL 新提交

Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

Darshana Graph:用于比较印度哲学的平行注释语料库,附文体计量与探索性图分析

Joy Bose

发表机构 * Independent Researcher(独立研究者) Bangalore, India(印度班加罗尔)

AI总结 构建包含超12.5万条记录的印度哲学平行注释语料库,其中约8500条记录实现跨18位注释者的根颂对齐,通过文体计量和约束大语言模型管道分析论证风格与概念关系,揭示学派间分歧模式。

Comments 12 pages, 1 figure. Open Source Code available at https://github.com/joyboseroy/darshana-graph and dataset at https://huggingface.co/datasets/joyboseroy/darshana-graph

详情
AI中文摘要

我们介绍了Darshana Graph,一个包含超过12.5万条文本记录的语料库,涵盖古典印度教、佛教和耆那教哲学传统,这些记录来自公共领域和开放许可的翻译,包括《薄伽梵歌》、《梵经》、主要《奥义书》、《巴利经典》和核心耆那教经典。其独特贡献在于一个结构独特的子集,包含约8500条印度教和耆那教记录,其中相同的根本颂或经句与代表吠檀多五个学派及其他见(darshanas)的十八位历史注释者对齐,从而能够直接比较独立解释传统如何解读相同的源材料。据我们所知,没有其他公开资源提供如此规模的跨注释者对齐。我们展示了基于该语料库的两项分析。首先,一种无需机器学习的透明文体计量比较,通过经典引用密度、明确反驳率和句子复杂度来衡量论证风格。它发现引用密度与反驳率之间存在中等负相关,在相关教义谱系的三位注释者中反驳率显著增加,并且在巴利经典内部存在可测量的体裁层面差异。其次,我们描述了一个受约束的大语言模型管道,该管道使用预定义的关系词汇和确定性事后验证来提取概念之间的类型化哲学关系。生成的图揭示了跨学派分歧模式,同时也揭示了重要的提取局限性,包括独立基于嵌入的分析与图派生结果不一致的情况。我们发布了完整的语料库、提取的关系图以及所有源代码。

英文摘要

We introduce Darshana Graph, a corpus of over 125,000 text records spanning classical Hindu, Buddhist, and Jain philosophical traditions, drawn from public-domain and openly licensed translations of sources including the Bhagavad Gita, Brahma Sutras, principal Upanishads, the Pali Canon, and core Jain texts. Its distinctive contribution lies in a structurally unique subset of roughly 8,500 Hindu and Jain records in which the same root verse or sutra is aligned across eighteen historical commentators representing five schools of Vedanta and other darshanas, enabling direct comparison of how independent interpretive traditions read identical source material. To our knowledge, no publicly available resource provides comparable cross-commentator alignment at this scale. We present two analyses built on this corpus. First, a transparent stylometric comparison requiring no machine learning measures argumentative style through scriptural citation density, explicit refutation rate, and sentence complexity. It finds a moderate negative correlation between citation density and refutation rate, a marked increase in refutation rate across three commentators in a related doctrinal lineage, and measurable genre-level differences within the Pali Canon itself. Second, we describe a constrained large language model pipeline that extracts typed philosophical relationships between concepts using a predefined relation vocabulary and deterministic post-hoc validation. The resulting graph surfaces cross-school disagreement patterns while also revealing important extraction limitations, including cases where an independent embedding-based analysis disagrees with the graph-derived findings. We release the full corpus, extracted relationship graph, and all source code.

2606.18216 2026-06-17 cs.CL 新提交

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

近端策略优化区域:教师存在于提示中,而非梯度中

Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma

发表机构 * NVIDIA(英伟达)

AI总结 提出ZPPO方法,通过将教师知识注入提示而非策略梯度,解决小模型知识蒸馏中教师梯度主导和强化学习策略漂移问题,在多种规模模型上超越现有方法。

Comments Project page: https://byungkwanlee.github.io/ZPPO-page/

详情
AI中文摘要

知识蒸馏将教师的能力传递给小型学生模型,但在小模型场景下存在脆弱性:强制学生模仿更大教师的logits会使其集中于教师最尖锐的模式,从而损害训练语料库之外基准家族的泛化能力。强化学习通过基于学生自身rollout进行训练避免了logit模仿。然而,在每次rollout都失败(产生零优势并被静默丢弃)的问题上,将更强教师的响应注入策略梯度会破坏同策略假设并导致漂移。我们提出近端策略优化区域(ZPPO),受维果茨基最近发展区启发,将教师保留在提示中而非策略梯度中。在难题上,ZPPO构建两种重新表述的提示:二元候选包含问题(BCQ)将一个正确教师响应与一个错误学生响应配对作为匿名候选,学生必须区分;负候选包含问题(NCQ)将学生的错误rollout聚合到单个提示中,以揭示其共同的失败模式。提示重放缓冲区循环每个难题,直到其毕业(学生在该问题上的平均rollout准确率达到一半)或在有限容量下被FIFO逐出,从而在学生当前最近发展区内放大BCQ和NCQ。在Qwen3.5系列上,使用四个学生规模(0.8B-9B)和27B教师,后训练为视觉语言模型并在31个基准套件(16个VLM、10个LLM、5个视频)上评估,ZPPO优于离/同策略蒸馏和GRPO,且在最小规模上增益最大。

英文摘要

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

2606.18209 2026-06-17 cs.LG 新提交

Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

重新思考用于分类的数据集蒸馏:蒸馏集是否优于核心集?

Trisha Mittal, Akshay Mehra, Joshua Kimball

发表机构 * Dolby Laboratory(杜比实验室)

AI总结 本文通过大规模标准化实验评估七种最先进的数据集蒸馏方法,发现其在大型数据集上性能不如或仅相当于核心集,且构建成本更高,核心集在数据分布覆盖和计算效率上更具优势。

详情
AI中文摘要

数据集蒸馏(DD)已成为以数据为中心的机器学习中的一种重要方法,旨在通过将大型数据集中的信息压缩到少量合成样本中,合成紧凑的训练集以实现高效训练。然而,DD方法通常在不一致的评估协议下进行评估,从标准ERM到单/多教师监督,这使得难以从评估中分离出蒸馏数据的有效性。此外,许多先前方法声称DD优于数据剪枝方法(如核心集选择),其假设是将浓缩数据集限制为真实样本的子集会从根本上限制其表达能力。在这项工作中,我们通过使用标准化数据集和评估协议进行大规模实验,批判性地评估DD方法以评估其内在有效性。我们在ImageNet-1K、ImageNet100和ImageNette上对七种最先进的DD方法进行了基准测试,使用了三种广泛采用的训练协议,并与三种核心集策略进行比较。我们的结果表明,虽然一些DD方法甚至未能优于简单的随机子集,但最先进的DD方法在大型数据集上与核心集相当或更差,并且构建成本显著更高。除了准确性,我们还评估了浓缩集的代表性、多样性和质量,发现核心集始终能更好地覆盖原始数据分布。这些发现凸显了当前DD方法的实际优势有限,并表明核心集仍然具有竞争力,并且通常是以数据为中心的学习中计算效率更高的替代方案。

英文摘要

Dataset distillation (DD) has emerged as a prominent approach in data centric machine learning, aiming to synthesize compact training sets for efficient training by compressing the information in large datasets into a small number of synthetic samples. However, DD methods are often evaluated under inconsistent evaluation protocols, ranging from standard ERM to single/multi-teacher supervision, making it difficult to isolate the effectiveness of distilled data from evaluation. Moreover, many prior methods claim that DD outperforms data pruning approaches such as coreset selection (CS), based on the assumption that restricting condensed datasets to subsets of real samples fundamentally limits their expressiveness. In this work, we critically evaluate DD methods through large-scale experiments using standardized datasets and evaluation protocols to assess their intrinsic effectiveness. We benchmark seven state-of-the-art (SOTA) DD methods on ImageNet-1K, ImageNet100, and ImageNette, using three widely adopted training protocols against three CS strategies. Our results show that while some DD methods fail to outperform even simple random subsets, the SOTA DD approaches are comparable to or worse than coresets on large-scale datasets and incur a substantially higher cost for construction. Beyond accuracy, we also evaluate the representativeness, diversity, and quality of condensed sets, and find that coresets consistently achieve better coverage of the original data distribution. These findings highlight the limited practical advantages of current DD methods and show that coresets remain competitive and are often a more computationally efficient alternative for data-centric learning.

2606.18208 2026-06-17 cs.LG cs.AI cs.CL cs.CV 新提交

Looped World Models

循环世界模型

Hongyuan Adam Lu, Z. L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang, Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen, Hebin Wang, Zeyu Gao, Yang Li, Wei Zhao, Qimin Zhong, Siqi Liu, Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam

发表机构 * FaceMind Research Asia

AI总结 提出循环世界模型(LoopWM),通过参数共享的Transformer块迭代细化潜在环境状态,实现高达100倍参数效率,并建立迭代潜在深度作为世界模拟的新缩放轴。

Comments Technical Report

详情
AI中文摘要

当前的世界模型面临一个基本矛盾:忠实的长期模拟需要深度计算,但更深的模型部署成本高且容易产生累积误差。我们通过引入循环世界模型(LoopWM)来解决这一问题,这是首个用于世界建模的循环架构。我们的方法通过一个参数共享的Transformer块迭代地细化潜在环境状态。这带来了高达100倍于传统方法的参数效率,并具有自适应计算能力,可自动调整深度以匹配每个预测步骤的复杂性。与缩放模型大小和训练数据正交,LoopWM建立了迭代潜在深度作为世界模拟的新缩放轴,这可能显著推动社区发展。

英文摘要

Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.

2606.18206 2026-06-17 cs.AI 新提交

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

不动点推理器:稳定且自适应的深度循环Transformer

Sajad Movahedi, Vera Milovanović, Shlomo Libo Feigin, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, Antonio Orvieto

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center(ELLIS研究所蒂宾根,马克斯·普朗克智能系统研究所,蒂宾根人工智能中心) ETH Zurich(苏黎世联邦理工学院) Swiss Institute of Bioinformatics(瑞士生物信息学研究所) Université Paris Cité(巴黎西岱大学) Liquid AI

AI总结 针对循环架构中深度导致的信号传播问题,提出基于预层归一化和残差缩放的FPRM模型,利用不动点收敛作为端到端停止机制,在Sudoku、Maze等推理基准上自适应计算并有效提升性能。

Comments Code available at https://github.com/nilskiKonjIzDunava/fprm

详情
AI中文摘要

循环架构为学习需要组合推理的任务的逐步程序提供了归纳偏置。通过循环达到的有效层数决定了这些模型找到的解的质量。与深层架构类似,循环架构容易受到由深度引起的信号传播问题的影响,因为停止决策被推迟。在本文中,我们使用预层归一化和残差缩放来解决这个信号传播问题。基于这些架构修改,我们提出了FPRM,一种基于Transformer的不动点推理模型,它在循环架构中使用不动点收敛作为端到端停止机制。我们表明,不动点停止允许FPRM根据任务难度调整其计算量。FPRM在常见的推理基准(即Sudoku、Maze、状态跟踪和ARC-AGI)上是有效的。

英文摘要

Looped architectures provide an inductive bias toward learning step-by-step procedures for tasks that require compositional reasoning. The number of effective layers reached by looping determines the quality of the solution these models find. Like deep architectures, looped architectures are prone to a signal propagation problem induced by depth as the halting decision is postponed. In this paper, we address this signal propagation issue using pre-norm layers and residual scaling. Building on these architectural modifications, we propose FPRM, a Transformer-based Fixed-Point Reasoning Model that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. We show that fixed-point halting allows FPRM to adapt its compute to task difficulty. FPRM is effective on common reasoning benchmarks, namely Sudoku, Maze, state-tracking, and ARC-AGI.

2606.18205 2026-06-17 cs.CL 新提交

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

使用ISO语言标记框架和TEI Lex-0分析与编码Al-Mawrid阿拉伯语-英语词典

Diaa Fayed, Laurent Romary

发表机构 * Faculty of Information Technology and Computer Science, Sinai University(信息科技与计算机科学学院, Sinai大学) Inria ISO committee TC 37 (Language and terminology)(ISO术语委员会TC 37(语言和术语))

AI总结 本文提出了一种系统化方法,将Al-Mawrid阿拉伯语-英语词典数字化并编码为标准化计算词典,采用ISO LMF和TEI Lex-0双重标准,实现91%的结构解析准确率。

Comments 44 pages, 58 figures, 12 tables. Submitted to Language Resources and Evaluation, under review since Aug 2025, round 3

详情
AI中文摘要

本文提出了一种稳健的方法,用于系统化数字化和编码Al-Mawrid阿拉伯语-英语词典,将其从传统的印刷资源转变为标准化的计算词典。针对阿拉伯语词汇基础设施中的显著空白,本研究采用双重标准框架,将ISO词汇标记框架(LMF)与文本编码倡议TEI Lex-0指南对齐。通过应用编辑视角处理词典的宏观和微观结构,研究解决了20世纪双语词典中典型的结构歧义和标点不一致问题。该方法基于对词典词汇知识密度的实证分析。基于代表性样本(字母Ayn,占总量的4.6%),研究为编码过程提供了科学依据,展示了91%的结构解析准确率。信息提取规则的定量评估显示出高性能,同义词的精确率为85%,召回率为98%,其他形态语义特征的精确率为88%。除了技术描述,本文还与现有阿拉伯语词汇资源进行了批判性比较,并讨论了TEI Lex-0在建模特定阿拉伯语现象(如隐式“开放集”语义关系和分散的形态线索)时的局限性。此外,研究通过建立可扩展的基于前缀的引用系统,探索了语言关联开放数据(LLOD)集成的潜力,促进了该资源在语义网中的包含。最终成果是一个可互操作、机器可处理的资源,为阿拉伯语自然语言处理和数字人文社区中复杂遗留双语词典的逆向数字化提供了可复现的工作流程。

英文摘要

This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit "open set" semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource's inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.

2606.18203 2026-06-17 cs.CL cs.AI 新提交

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

RubricsTree: 面向个人健康代理在健康记忆与医疗技能上的可扩展且不断演进的开放式评估

Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

发表机构 * Google Research(谷歌研究院) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出RubricsTree框架,通过专家对齐的层次化分类法(含100多个原子布尔规则)和上下文自适应路由,实现可扩展、可审计且不断演进的开放式评估,在HealthBench上使模型性能提升高达约66%。

详情
AI中文摘要

基于LLM的个人健康代理利用用户健康(传感器)指标,为缓解全球医疗资源获取不均提供了有希望的途径。然而,大规模临床部署仍受限于开放式评估瓶颈:医生标注可靠但成本高且不可扩展,而LLM作为评判者的评估虽可扩展但主观、不一致,且有时临床对齐不佳。我们引入了RubricsTree,一个可扩展的评估框架,具有专家对齐的层次化分类法,包含超过100个原子级、临床可验证的布尔规则,这些规则通过迭代的人机协同策展协议(由经验丰富的医生领导的专家小组)从4000个真实用户查询的洞察中演化而来。一个上下文感知的自适应路由器每查询仅激活相关的自动加权规则子集,提供可扩展评估所需的吞吐量,同时保持专家对齐的质量。通过系统的元评估,我们展示了RubricsTree:(i) 在具有挑战性的开放式查询上,专家对齐程度显著超过强大的大规模评估基线;(ii) 可靠地惩罚上下文退化的响应;(iii) 当用作结构化指令、文本反馈或性能优化的训练奖励时,在HealthBench上为Gemini、GPT和Qwen模型系列带来高达约66%的相对提升。因此,RubricsTree为产品级个人健康AI的持续优化提供了可扩展、可审计且不断演进的评估基础设施。

英文摘要

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

2606.18195 2026-06-17 cs.CL 新提交

Learning from the Self-future: On-policy Self-distillation for dLLMs

从自我未来学习:面向扩散LLM的在线自蒸馏

Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu

发表机构 * Tsinghua University(清华大学) Technical University of Munich(慕尼黑工业大学) Nanyang Technological University(南洋理工大学) University of British Columbia(不列颠哥伦比亚大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) ELLIS Institute Tubingen(ELLIS蒂宾根研究所) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Tubingen AI Center(蒂宾根人工智能中心)

AI总结 提出首个面向扩散大语言模型的在线自蒸馏框架d-OPSD,通过自我生成答案作为后缀条件实现从自我未来经验学习,并将监督从词元级转向步骤级,在推理基准上以约10%的优化步骤超越RLVR和SFT基线。

Comments Preprint

详情
AI中文摘要

在线自蒸馏(OPSD)已被证明对后训练大型语言模型(LLMs)有效,但其在扩散LLMs(dLLMs)上的应用仍未探索。现有的OPSD方法本质上是自回归中心的,它们通过从左到右的前缀条件化和词元级差异监督注入特权信息,这种设计与dLLMs的任意顺序生成根本冲突。我们提出了d-OPSD,这是首个为dLLMs量身定制的OPSD框架。我们的方法有两个核心贡献。首先,我们通过使用自我生成的答案作为后缀条件来重新构建自我教师,使学生模型能够从“自我未来经验”而非特权前缀中学习。其次,我们将监督从词元级转向步骤级,使训练与dLLMs的迭代去噪过程对齐。在四个推理基准上的实验表明,d-OPSD在样本效率上始终优于RLVR和SFT基线,仅需RLVR约10%的优化步骤,为dLLM后训练开辟了一条有前景的途径。代码可在该https URL获取。

英文摘要

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

2606.18192 2026-06-17 cs.AI 新提交

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

斯坦福EDGAR文件数据集:将美国公司及财务披露重建为布局忠实且令牌高效的预训练数据

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Nanjing University(南京大学) Stanford University(斯坦福大学)

AI总结 为解决长上下文文档稀缺问题,提出SEFD数据集,将SEC文件重建为布局忠实的MultiMarkdown格式,用于金融语言建模与评估,具有令牌高效、与Common Crawl重叠率低于0.1%的特点。

Comments Preprint. Includes appendix, tables, and figures

详情
AI中文摘要

随着高质量公共网络语料库日益枯竭,干净的长上下文文档已成为大型语言模型(LLM)训练数据中稀缺且昂贵的来源。现有的长上下文语料库通常是专有的且获取成本高昂、合成生成的,或集中在编程等狭窄领域。我们介绍了斯坦福EDGAR文件数据集(SEFD),这是将SEC文件重建为布局忠实的MultiMarkdown格式的开放数据集,用于金融语言建模和评估。SEFD使经过审计的财务报表、风险披露、所有权报告、会计说明和影响市场的事件文件能够用作长上下文预训练数据,并作为金融推理、预测、合规和文档理解的基础。生成的语料库令牌高效、可直接用于模型,并且与Common Crawl衍生的语料库重叠率低于0.1%。我们发布了SEFD-v1,一个152B令牌的初始公共快照,并提供了更大的1850万文件档案(估计为550B令牌)的语料库级分析。我们进一步引入了两个基于SEFD的基准:EDGAR-Forecast,用于评估模型知识截止后基于文件的数值预测;以及EDGAR-OCR,用于评估复杂金融表格的转录。

英文摘要

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

2606.18191 2026-06-17 cs.AI cs.MA 新提交

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

DRFLOW:用于个性化工作流预测的深度研究基准

Md Tawkat Islam Khondaker, Raymond Li, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Issam H. Laradji

发表机构 * ServiceNow AI Research(ServiceNow人工智能研究)

AI总结 提出DRFLOW基准,评估AI代理从异构源预测个性化工作流的能力,包含5领域100任务,并设计7个诊断指标,实验显示现有代理性能有限。

详情
AI中文摘要

深度研究(DR)系统越来越多地用于复杂信息寻求任务,但现有工作主要关注生成报告和摘要。相比之下,许多企业任务需要代理识别具体的工作流,即一系列行动步骤。例如,代理不应总结预算政策,而应能确定回答诸如“在固定预算下如何申请新员工?”这类问题所需的步骤。因此,我们引入DRFLOW,一个用于评估代理从异构源预测个性化工作流的基准。每个任务要求代理从分散来源中识别相关证据,然后使用这些证据预测用户任务的正确行动步骤序列。DRFLOW包含跨五个领域的100个任务,1246个参考工作流步骤,基于超过3900个来源。我们定义了七个诊断指标,涵盖事实依据、步骤恢复、结构排序、条件解决和个性化。我们进一步提出DRFLOW-Agent(DRFA),一个面向工作流的参考代理,用于预测个性化工作流。我们表明,尽管DRFA相比强基线代理有所改进(平均F1分数提升高达10.02%),但在这些工作流指标上仍有很大的改进空间,表明预测完整且正确的个性化工作流仍然是深度研究的一个挑战性前沿。

英文摘要

Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.

2606.18189 2026-06-17 cs.RO 新提交

Beyond Failure Recovery: An Engagement-Aware Human-in-the-loop Framework for Robotic Systems

超越故障恢复:一种面向机器人系统的参与感知人在回路框架

Jiaying Fang, Joyce Yang, Zhanxin Wu, Bohan Yang, Tapomayukh Bhattacharjee

发表机构 * Cornell University(康奈尔大学)

AI总结 提出一种参与感知模型预测控制(E-MPC)方法,通过规划交互频率和类型来维持用户参与度并控制工作负荷,在机器人辅助进食系统中验证了其提升用户体验且不降低任务成功率的效果。

Comments Project website at https://emprise.cs.cornell.edu/empc

详情
Journal ref
Robotics: Science and Systems 2026
AI中文摘要

传统的人机协同方法通常仅在机器人遇到故障或不确定性时才让用户介入,将人类主要视为提升机器人性能的工具。然而,在许多以人为中心的机器人环境中,交互应通过让用户参与决策来支持参与度,而非将其限制于故障驱动的干预。这在物理护理场景中尤为突出,因为行动受限会降低用户实时干预或调节机器人行为的能力。因此,故障驱动的交互策略可能使用户在任务的大部分时间里沦为被动观察者。例如,行动受限的用户在持续被动接受机器人喂食时可能感到参与度不足。同时,过于频繁的交互可能令人疲惫并增加用户工作负荷。为解决这一权衡,我们提出了一种用户参与感知方法——参与感知模型预测控制(E-MPC),该方法规划交互以在维持参与度的同时满足工作负荷约束。E-MPC利用一个用户交互动力学模型,该模型捕捉用户参与度如何随交互频率和类型变化。机器人并非仅在任务执行出现困难时才请求输入,而是主动考虑用户在整个任务中偏好的参与水平,平衡自主性与交互,同时确保任务成功。我们通过多项消融实验和基线对比在仿真中评估了E-MPC。结果表明,该方法在多种用户画像下均有效。此外,我们在一个机器人辅助咬取系统中,与模拟行动受限的真实参与者进行了用户研究,显示E-MPC在维持任务成功的同时改善了用户体验。

英文摘要

Conventional human-in-the-loop approaches typically involve users only when a robot encounters failure or uncertainty, treating humans primarily as tools for improving robot performance. However, in many human-centered robotics settings, interaction should support engagement by keeping users involved in decision-making rather than limiting them to failure-driven interventions. This is particularly compelling in physical caregiving, where mobility limitations can reduce users' ability to intervene or modulate the robot's behavior in the moment. As a result, failure-driven interaction policies may relegate users to passive observers for long stretches of the task. For example, a user with mobility limitations may feel less engaged when being continuously and passively fed by a robot. At the same time, overly frequent interaction can be tiring and increase the user's workload. To address this trade-off, we propose Engagement-aware MPC (E-MPC), a user-engagement-aware method that plans interaction to maintain engagement while respecting a workload constraint. E-MPC leverages a user interaction dynamics model that captures how user engagement evolves as a function of both the frequency and type of interaction. Rather than requesting input only when difficulties arise during task execution, the robot proactively considers the user's preferred level of engagement throughout the task, balancing autonomy and interaction while ensuring task success. We evaluate E-MPC in simulation with several ablations and baseline comparisons. Results demonstrate the effectiveness of our approach across diverse user personas. In addition, we conduct a real-world user study with participants with emulated mobility limitations on a robot-assisted bite acquisition system, showing that E-MPC improves user experience while maintaining task success.

2606.18186 2026-06-17 cs.LG cs.AI 新提交

Kolmogorov Regression for Robust Diffusion Policies

用于鲁棒扩散策略的Kolmogorov回归

Lekan Molu

发表机构 * Bala Cynwyd, PA 19004(巴拉辛威德, PA 19004)

AI总结 提出后向Kolmogorov方程将扩散策略提升至Cameron-Martin空间,用确定性边界值PDE问题替代随机分数匹配,通过精度加权损失和残差诊断实现收敛保证、轨迹规则化和无奖励故障检测。

详情
AI中文摘要

有限维扩散策略由于离散化伪影导致时间漂移,降低了长期性能(当部署在物理系统上时)。我们引入了一个后向Kolmogorov方程,将扩散策略提升至Cameron-Martin空间——希尔伯特空间的一个子集。本质上,用确定性边界值PDE问题替代随机分数匹配。我们的核心创新基于高斯测度理论,其中扩散噪声协方差算子由有色噪声分布实现,该分布规定了推理时模型样本的正则性概念。我们使用推导出的精度加权Cameron-Martin损失训练扩散模型,并引入Kolmogorov残差作为推理时的PDE诊断。这些替换产生了:(i) 收敛保证,其中界的常数取决于核的有效秩而非动作维度,(ii) 通过谱加权改进轨迹规则性,以及(iii) 无需奖励信号的确定性故障检测器。在两个应用领域的验证显示了显著改进:在PushT操作基准测试中,Cameron-Martin损失在最大回合奖励上实现了17%的提升(0.95对比0.78的MSE),并通过引入的残差幅度在推理期间减少了67.6%的步间漂移。类似地,在具有恒定在制品(CONWIP)流量控制的6站生产线上,我们实现了比经典LSTM基线低28.4%的RMSE;高饥饿事件召回率(测试周期中为1.0),以及有效的瓶颈识别(测试集中Precision@1=1.0,信噪比13倍)。然后,我们使用Hamilton-Jacobi可达性理论认证调度策略,与100次模拟运行中的无控制调度相比,死锁事件减少了96%(防止了351个事件)。

英文摘要

Finite-dimensional (FD) diffusion policies exhibit temporal drift owing to discretization artifacts that degrade long-horizon performance (when deployed on physical systems). We introduce a backward Kolmogorov equation that lifts diffusion policies to a Cameron-Martin space -- a subset of the Hilbert space. Essentially, replacing stochastic score matching with a deterministic boundary-value PDE problem. Our core innovation thrives on Gaussian measure theory whereupon the diffusion noise covariance operator is realized from a colored noise distribution which prescribes a notion of regularity on samples from the model at inference time. We train the diffusion model with a derived precision-weighted Cameron- Martin loss and a Kolmogorov residual is introduced as a PDE diagnostic during inference. These substitutions yield (i) convergence guarantees where the bound's constants depend on the effective rank of the kernel rather than action dimension, (ii) improved trajectory regularity via spectral weighting, and (iii) a deterministic failure detector without reward signals. Validation across two application domains demonstrates substantial improvements: on the PushT manipulation benchmark, the Cameron-Martin loss achieves a 17% improvement in maximum episode reward (0.95 vs. 0.78 for MSE) and 67.6% reduction in inter-step drifts during inference via the introduced residual magnitude. Similarly, on a 6-station manufacturing line with constant work-in-process (CONWIP) flow control, we achieve 28.4% lower RMSE than classical LSTM baselines; a high starvation-event recall (1.0 in test cycles), and effective bottleneck identification (Precision@1 = 1.0 in test set, 13x signal-to-noise ratio). We then certify the dispatch policies with Hamilton-Jacobi reachability theory which reduces deadlock events by 96% compared to uncontrolled dispatch over 100 simulated runs (351 events prevented).

2606.18180 2026-06-17 cs.CV 新提交

EgoCS-400K: An Egocentric Gameplay Dataset for World Models

EgoCS-400K:面向世界模型的自我中心游戏数据集

Rongjin Guo, Dong Liang, Yuhao Liu, Fang Liu, Tianyu Huang, Gerhard P. Hancke, Rynson W. H. Lau

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 为支持世界模型研究,构建大规模自我中心游戏数据集EgoCS-400K,包含40万第一人称视频和1万小时游戏轨迹,支持动作条件未来预测、状态事件场景展开等交互式视觉建模任务。

详情
AI中文摘要

从视频生成到交互式世界建模的转变对数据提出了新要求:除了带字幕的视频外,世界模型还需要基于驱动未来场景变化的动作、相机运动、状态和事件的时间对齐的视频-动作-语言轨迹。然而,大规模获取此类数据十分困难。网络视频数据集提供广泛的视觉覆盖,但缺乏可执行动作和可靠状态;机器人数据集提供动作和状态监督,但成本高昂且场景多样性有限;现有模拟器通常缺乏大规模人类驱动的交互轨迹。在本文中,我们介绍了EgoCS-400K,一个大规模基于回放的自我中心反恐精英世界模型数据集,该数据集基于公开的职业CS和CS2比赛演示构建,保留了人类游戏轨迹,并支持解析、回放、渲染和时间对齐。我们提取玩家状态、视角方向、移动、键盘/按钮输入、视角变化、武器使用、游戏事件和回合级上下文,并从相同轨迹渲染干净的第一人称视频。EgoCS-400K包含超过40万第一人称视频和1万小时游戏时间,来自1000多场比赛和4万回合,涵盖13张地图和每回合10个玩家视角。它支持一系列交互式视觉建模任务,包括动作条件未来预测、状态和事件感知场景展开、基于回放的描述以及智能体自我中心动作理解。通过大规模连接视觉观察与人类动作、相机运动、游戏状态和事件,EgoCS-400K在被动网络视频、可控游戏模拟和昂贵的真实世界具身数据之间架起了一座实用桥梁。

英文摘要

The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.

2606.18156 2026-06-17 cs.CV cs.AI 新提交

ReAge3D: Re-Aging 3D Faces with View Consistency

ReAge3D:具有视角一致性的3D人脸回龄

Libing Zeng, Li Ma, Mingming He, Ning Yu, Paul Debevec, Nima Khademi Kalantari

发表机构 * Texas A&M University(德克萨斯农工大学) Netflix Eyeline Studios

AI总结 提出ReAge3D框架,通过2D扩散模型DiffReaging和中心向外编辑传播策略,实现多视角一致的3D人脸回龄,保持身份和细节,优于现有方法。

详情
AI中文摘要

我们提出了一种新颖的框架,用于实现逼真且可控的3D人脸回龄,生成高度详细、保留身份的结果。现有的3D编辑方法虽然对粗粒度的语义变化有效,但不适合回龄,因为即使回龄2D视图之间的微小不一致也会导致对微妙但感知上重要的年龄相关细节的过度平滑。为了解决这一挑战,我们首先引入了一个基于2D扩散的回龄模型DiffReaging,该模型在合成生成的图像对上训练。我们进一步提出了一种中心向外编辑传播策略,利用该回龄模型重建多视图一致的回龄图像。具体来说,从回龄的正面枢轴视图开始,我们通过扭曲和我们提出的Masked-DiffReaging过程重建其余视图。通过在扩散过程的每一步注入现有内容,Masked-DiffReaging确保重建区域与现有像素保持连贯。由此产生的一致回龄视图集监督回龄3D表示的优化。我们的方法在视觉上和定量上都优于现有的3D编辑技术,能够对3D人脸模型中的年龄变换进行平滑、细粒度的控制。

英文摘要

We present a novel framework for realistic and controllable 3D face re-aging which produces highly detailed, identity-preserving results. Existing 3D editing methods, while effective for coarse semantic changes, are not well suited for re-aging, as even small inconsistencies across re-aged 2D views can lead to over-smoothing of subtle but perceptually important age-related details. To address this challenge, we first introduce a 2D diffusion-based re-aging model, DiffReaging, trained on synthetically generated image pairs. We further propose a center-out editing propagation strategy that leverages this re-aging model to reconstruct multi-view-consistent re-aged images. Specifically, starting from a re-aged frontal pivot view, we reconstruct the remaining views through warping and our proposed Masked-DiffReaging process. By injecting existing content at every step of the diffusion process, Masked-DiffReaging ensures that the reconstructed regions remain coherent with existing pixels. The resulting consistent set of re-aged views supervises the optimization of the re-aged 3D representation. Our method outperforms existing 3D editing techniques both visually and quantitatively, enabling smooth, fine-grained control over age transformations in 3D face models.

2606.18154 2026-06-17 cs.AI 新提交

Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure

通过智能体发现混合结构学习心脏电生理数字孪生

Ziqi Zhou, Yubo Ye, Sumeet Atul Vadhavka, Linwei Wang, Zhiqiang Tao

发表机构 * Rochester Institute of Technology(罗彻斯特理工学院)

AI总结 提出LEADS框架,利用LLM智能体在结构化动作空间中迭代发现混合物理-神经模型,实现个性化心脏电生理数字孪生构建,优于人工设计和其他LLM方法。

Comments 10 pages, 4 figures

详情
AI中文摘要

构建个性化心脏电生理(EP)数字孪生需要为每个患者识别合适的模型结构,而不仅仅是拟合参数。传统方法依赖专家手动指定混合物理-神经架构,这需要深厚的领域专业知识,且无法跨患者迁移。最近的工作应用大型语言模型(LLM)来生成或充当混合模型。然而,尽管这些基于LLM的方法具有有希望的泛化能力,但它们缺乏稳定心脏模拟所需的结构先验。因此,我们提出LEADS,一个将心脏EP领域知识形式化为结构化动作空间,并利用LLM智能体发现混合模型的框架。该智能体遵循迭代推理-行动循环来选择、组合和优化混合模型,同时梯度下降处理参数拟合。所提出的LEADS设计每个候选模型都朝向物理基础、可解释和数值稳定,同时允许开放式的架构发现。我们在具有三个真实反应模型的合成数据和真实心脏EP数据上验证了LEADS,证明其优于人工设计的混合模型和其他基于LLM的混合建模方法。

英文摘要

Building personalized cardiac electrophysiology (EP) digital twins requires identifying the appropriate model structure for each patient, not merely fitting parameters. Traditional methods rely on experts to manually prescribe hybrid physics-neural architectures, which requires deep domain expertise and does not transfer across patients. Recent works have applied large language models (LLMs) to generate or act as hybrid models. However, despite their promising generalization capacity, these LLM-based methods lack the structural priors needed for stable cardiac simulations. Hence, we propose LEADS, a framework that formulates cardiac EP domain knowledge as a structured action space and utilizes an LLM agent to discover hybrid models. The agent follows an iterative reasoning-and-action loop to select, combine, and refine hybrid models, whilst gradient descent handles parameter fitting. The proposed LEADS designs every candidate model towards physically grounded, interpretable, and numerically stable, while allowing open-ended architectural discovery. We validate LEADS on synthetic data with three ground-truth reaction models and on real cardiac EP data, demonstrating that it outperforms both human-designed hybrid models and other LLM-based hybrid modeling.

2606.18153 2026-06-17 cs.CV 新提交

Neural Tree Reconstruction for the Open Forest Observatory

开放森林观测站的神经树重建

Marissa Ramirez de Chanlatte, Arjun Rewari, Trevor Darrell, Derek J. N. Young

发表机构 * Berkeley AI Research, University of California, Berkeley(加州大学伯克利分校伯克利人工智能研究) Department of Plant Sciences, University of California, Davis(加州大学戴维斯分校植物科学系)

AI总结 针对开放森林观测站中经典运动恢复结构方法重建质量差的问题,提出引入神经辐射场提升3D树重建的细节与鲁棒性,并展望未来工作。

Comments Published as a workshop paper at "Tackling Climate Change with Machine Learning", ICLR 2024

详情
AI中文摘要

开放森林观测站(OFO)是一项跨大学及其他合作伙伴的合作项目,旨在让生态学家、土地管理者和公众能够低成本地进行森林测绘。OFO正在构建一个地理空间森林数据库,以及通过无人机进行森林测绘的开源方法和工具。这些数据对多种气候应用非常有用,包括优先安排重新造林工作、减少野火风险以及监测碳封存。在OFO森林地图数据库的当前版本中,3D树图是使用经典的运动恢复结构技术创建的。这种方法容易出现伪影,缺乏细节,并且在森林地面(输入数据即俯拍图像的可视性有限)上尤其困难。这些重建错误可能会传播到下游的科学任务中(例如野火模拟)。3D重建的进展,包括神经辐射场(NeRF)等方法,产生了更高质量的结果,对稀疏视图更具鲁棒性,并支持数据驱动的先验。我们探索了将NeRF纳入OFO数据集的方法,概述了支持更先进的3D视觉模型的未来工作,并描述了高质量3D重建对林业应用的重要性。

英文摘要

The Open Forest Observatory (OFO) is a collaboration across universities and other partners to make low-cost forest mapping accessible to ecologists, land managers, and the general public. The OFO is building both a database of geospatial forest data as well as open-source methods and tools for forest mapping by uncrewed aerial vehicle. Such data are useful for a variety of climate applications including prioritizing reforestation efforts, informing wildfire hazard reduction, and monitoring carbon sequestration. In the current iteration of the OFO's forest map database, 3D tree maps are created using classical structure-from-motion techniques. This approach is prone to artifacts, lacks detail, and has particular difficulty on the forest floor where the input data (overhead imagery) has limited visibility. These reconstruction errors can potentially propagate to the downstream scientific tasks (e.g. a wildfire simulation.) Advances in 3D reconstruction, including methods like Neural Radiance Fields (NeRF), produce higher quality results that are more robust to sparse views and support data-driven priors. We explore ways to incorporate NeRFs into the OFO dataset, outline future work to support even more state-of-the-art 3D vision models, and describe the importance of high-quality 3D reconstructions for forestry applications.

2606.18147 2026-06-17 cs.AI 新提交

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

WEQA: 可穿戴健康问答中的查询自适应智能推理

Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu, Daniel McDuff, Cecilia Mascolo

发表机构 * University of Cambridge(剑桥大学) Tsinghua University(清华大学) University College London(伦敦大学学院) Dartmouth College(达特茅斯学院) Google Research(谷歌研究院)

AI总结 提出WEQA框架,通过LLM控制器动态组合传感器分析与预训练模型,实现可穿戴健康数据问答,在基准测试中准确率提升24%,专家评估显示实用性和临床合理性显著提高。

详情
AI中文摘要

语言模型在医学问答中表现出色,有时甚至超过普通医生的准确率。然而,关于可穿戴健康数据的问题回答仍然具有挑战性且研究不足,因为这些无处不在的传感器产生连续、高维和纵向的数据,难以与LLM预训练中的文本中心分布对齐。传感器模态和用户意图的多样性无法通过固定的推理工作流或单一的预训练基础模型有效处理。为了解决这些挑战,我们提出了WEQA,一个查询自适应智能体框架,将LLM推理与专门的可穿戴分析和建模工具统一起来。采用LLM控制器来合成执行计划,动态地将每个查询路由到适当的传感器分析和预训练模型组合,并利用外部知识进行基于证据的响应审计。我们还整理了一个基准测试,涵盖四个开放的可穿戴数据集,包括三个不同健康领域的分析和预测任务。实验表明,我们的框架比LLM和智能体基线准确率提高24%,一项由12名医学专家和8名用户进行的盲法研究显示,在实用性和临床合理性方面有显著提升。

英文摘要

Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

2606.18144 2026-06-17 cs.AI cs.CY cs.LG cs.RO 新提交

Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So

记忆作为消耗性资产:为具身智能体定价闪存耐久性及其局限性

Josef Liyanjun Chen

发表机构 * KAIKAKU

AI总结 本文提出将机器人闪存耐久性视为折旧资本,通过单一影子价格η进行定价,实现成本最优的存储层级分配,并基于真实机器人日志测量价值-写入关联χ的符号,发现其取决于部署场景。

详情
AI中文摘要

机器人的闪存耐久性是一种不可再生资源:每次持久化写入都会消耗数千次编程/擦除周期中的一次,且无法补充,然而目前没有实际部署的机器人内存系统对哪些记忆值得消耗一次擦除周期进行定价。我们将具身记忆视为折旧资本,并用单一耐久性影子价格η对该资源定价,这使得在RAM/板载NVM/云层级中进行成本最小化的放置成为一个在磨损增强的每字节索引中的阈值。无论价值-写入关联χ的符号如何,该索引都是成本最优的;只有当χ>0时,最优解才变为非单调,将机器人最有价值的记忆从闪存中移出。因此,关键点是经验性的,我们在预定义的关口上测量真实机器人日志中的χ:其符号是部署场景的一个属性——在重复的长时域操作中为正(χ̂≈+1.0×10^{-3},在全功率下可复现),在较短时域任务中为零,在非重复遥操作中为负。两个边界限制了该结果。在高端3,000 P/E TLC闪存按数据手册价格计算时,耐久性预算处于休眠状态;而在廉价边缘机器人使用的商用QLC/eMMC(约1,000 P/E)上则具有约束力。当约束生效时,学习到的磨损感知控制器仅在任务价值上与基于价格的路由持平,因为实现的价值在RAM、NVM和云层级之间是不变的:租金决定设备寿命和成本,而非任务性能。磨损感知放置是否能提高任务价值仍是一个开放问题——χ是针对价值代理测量的,而非单调最优解虽已被证明,但尚未在数据中观察到。

英文摘要

A robot's flash endurance is a non-renewable stock: every persisted write spends one of a few thousand program/erase cycles and never refills, yet no fielded robot memory system prices which memories are worth an erase cycle. We treat embodied memory as depreciating capital and price that stock with a single endurance shadow price $η$, which makes cost-minimizing placement across a RAM / on-board NVM / cloud hierarchy a threshold in a wear-augmented per-byte index. The index is cost-optimal whatever the sign of the value-write association $χ$; only when $χ> 0$ does the optimum turn non-monotone, sending a robot's most valuable memories off its flash. The pivot is thus empirical, and we measure $χ$ on real robot logs at a pre-specified gate: its sign is a property of the deployment regime -- positive on recurrent long-horizon manipulation ($\hatχ \approx +1.0 \times 10^{-3}$, replicated at full power), null on a shorter-horizon suite, and negative on non-recurrent teleoperation. Two boundaries scope the result. The endurance budget is dormant on premium 3,000-P/E TLC at datasheet prices and binding on the commodity QLC/eMMC ($\sim$1,000 P/E) that cheaper edge robots run. And where it binds, a learned wear-aware controller only ties price-based routing on task value, because realized value is tier-invariant across RAM, NVM, and cloud: the rent governs device lifetime and cost, not task performance. Whether wear-aware placement improves task value remains open -- $χ$ is measured against a value proxy, and the non-monotone optimum, while proven, is not yet observed in data.

2606.18142 2026-06-17 cs.AI cs.CL cs.CY 新提交

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛:前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Oliver Tulio, Joel Christoph, Miles Tidmarsh, Carol Kline, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning(同情对齐机器学习) Sentient Futures(感知未来) Harvard Kennedy School(哈佛肯尼迪学院) Appalachian State University Department of Management(阿巴拉契亚州立大学管理系)

AI总结 提出首个代理基准TAC,测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型,所有模型得分低于随机水平64%,最佳模型仅53%。

详情
AI中文摘要

AI代理正从顾问转变为行动者,代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应,但未检验这些响应中的福利推理是否迁移到代理部署中(模型必须使用工具采取行动)。我们引入TAC(旅行代理同情心),这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景,涵盖六类动物剥削,并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%,最佳表现者(Claude Opus 4.7)为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升,在GPT-5.2中提升26个百分点,在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计(使用Gemini 2.5 Flash Lite作为评判者,对前两名模型的288个基础条件转录进行审计)未标记任何评估意识转录,表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

2606.18135 2026-06-17 cs.SD cs.AI 新提交

Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)

描述符:Certus 口径分类枪声数据集 (C3GD)

Sinclair Gurny, Ryan Quinn

发表机构 * Certus Innovations

AI总结 介绍一个公开的枪声数据集 C3GD,包含超过8000个来自28种枪支、16种口径的实地采集数据点,用于口径分类、枪声检测等任务,提供丰富的元数据以支持泛化与学术分析。

详情
AI中文摘要

在这项工作中,我们介绍了 Certus 口径分类枪声数据集 (C3GD),这是一个公开可访问的数据集,用于分析枪口爆炸声。该数据集旨在提供多种枪支、口径、弹药、麦克风和麦克风位置,其元数据详细程度超过当前已有的其他数据集。它包含来自28种枪支、16种口径的超过8000个实地采集数据点。由于实地数据采集成本高昂,现有研究多使用从互联网收集的枪声音频,这增加了低质量数据和标签噪声的风险。该数据集主要关注口径分类,但也可用于枪声检测、音频分离和音频信号处理,提供了多样化的真实世界参考。该数据集旨在提供足够的多样性,以便泛化到更多实际应用,同时提供足够的元数据以进行详细的学术分析。

英文摘要

In this work, we introduce the Certus Caliber Classification Gunshot Dataset (C3GD), a publicly accessible data set developed for the analysis of firearm muzzle blast sounds. The dataset aims to provide a wide variety of firearms, calibers, cartridges, microphones, and microphone locations with metadata detailed beyond what is currently otherwise available. It comprises more than 8000 field-collected data points from 28 firearms across 16 calibers. Because data collection in the field is costly, much of the existing research has been done using gunshot audio collected from the internet, which increases the risk of low-quality data and label noise. This dataset is primarily focused on caliber classification, but can also be used for gunshot detection, audio separation, and audio signal processing, providing a diversified and real-world reference. The dataset aims to provide enough diversity to be able to generalize to more real-world applications while also providing enough metadata for detailed academic analysis.