arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1925
2606.18249 2026-06-17 cs.CV 新提交

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

统一多模态自回归建模:共享上下文-视觉分词器是实现统一的关键

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(可信具身AI研究院,复旦大学) Shanghai Innovation Institute(上海创新研究院) Qwen Team, Alibaba Inc.(通义实验室,阿里公司)

AI总结 提出UniAR框架,通过单一离散视觉分词器桥接视觉理解与生成,采用并行位预测和扩散解码,在图像生成和编辑上达到最优,同时保持多模态理解竞争力。

Comments Accepted by ICML2026. Project page this https URL (https://sharelab-sii.github.io/uniar-web)

详情
AI中文摘要

统一多模态建模旨在将视觉理解和生成集成到单个系统中。然而,现有方法通常依赖两个不同的视觉分词器,这分割了表示空间并阻碍了真正的统一建模。我们提出UniAR,一个统一的自回归框架,其中单个离散视觉分词器作为理解和生成之间的关键桥梁,使得模型能够直接解释其自身生成的视觉标记而无需额外的重新编码,从而实现共享上下文。UniAR采用预训练的视觉编码器,结合多级特征融合和无查找的逐位量化方案,在保留高层语义和低层细节的同时,以最小代价扩展有效视觉词汇。在此基础上,统一自回归模型采用并行逐位预测来联合预测空间分组的多级视觉编码,大幅减少视觉序列长度并加速生成。最后,基于扩散的视觉解码器对离散视觉标记进行操作,以解码高保真图像。通过大规模预训练,随后进行监督微调和强化学习,UniAR在图像生成和图像编辑上达到了最先进的性能,同时在多模态理解基准上保持竞争力。项目页面可在此URL获取。

英文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at this https URL.

2606.18246 2026-06-17 cs.CL 新提交

Variable-Width Transformers

变宽Transformer

Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy, Yoon Kim

发表机构 * MIT(麻省理工学院) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

AI总结 提出一种中间窄、两端宽的变宽Transformer架构,通过无参数残差缩放机制实现非均匀容量分配,在语言模型困惑度、FLOPs和KV缓存上优于均匀宽度基线。

详情
AI中文摘要

扩展模型规模,特别是深度和宽度,推动了基于Transformer的语言模型的显著进步。然而,大多数架构在所有层中保持恒定宽度,即使不同层可能扮演不同的计算角色,也均匀分配固定的参数和计算预算。在这项工作中,我们通过提出一个×形> <former架构,实证研究了跨网络深度的非均匀容量分配。该设计保持较宽的早期和晚期层,同时缩窄中间层,利用无参数残差缩放机制。在从200M到2B参数(密集)和3B参数(MoE)的仅解码器语言模型中,我们的> <former在语言建模损失上始终优于参数匹配的均匀基线。通过降低平均层宽度,该架构还减少了总体FLOPs(在拟合的损失匹配缩放曲线下减少22%)以及更小的KV缓存内存和I/O成本(减少15%)。在分析中,我们展示了这种瓶颈结构导致残差流中定性不同的表示。总体而言,我们的结果表明,非均匀宽度分配可以导致语言模型更资源最优的缩放。

英文摘要

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

2606.18242 2026-06-17 cs.CV 新提交

EventDrive: Event Cameras for Vision-Language Driving Intelligence

EventDrive: 用于视觉-语言驾驶智能的事件相机

Dongyue Lu, Rong Li, Ao Liang, Lingdong Kong, Wei Yin, Lai Xing Ng, Benoit R. Cottereau, Camille Simon Chane, Wei Tsang Ooi

发表机构 * NUS(新加坡国立大学) HKUST(GZ)(香港科技大学(广州)) Horizon Robotics(地平线机器人) A*STAR, I2R(新加坡科技研究局,资讯通信研究院) IPAL, CNRS IRL 2955, Singapore(IPAL,法国国家科学研究中心国际联合实验室2955,新加坡) University Toulouse, CNRS, CerCo, Toulouse, France(图卢兹大学,法国国家科学研究中心,CerCo,法国图卢兹) ETIS UMR 8051, CY Cergy Paris University, ENSEA, CNRS, France(ETIS UMR 8051,CY塞尔吉-巴黎大学,ENSEA,法国国家科学研究中心,法国)

AI总结 提出EventDrive基准和模型套件,通过多时域事件金字塔和时域混合专家模块融合事件流与RGB帧,在感知、理解、预测和规划四维度提升驾驶推理性能。

Comments CVPR2026, 34 pages, 15 figures, 15 tables, project page: this https URL (https://dylanorange.github.io/projects/eventdrive)

详情
AI中文摘要

事件相机通过异步亮度变化感知世界,具有微秒级延迟和高动态范围,其运动保真度远超基于帧的传感器,并能捕捉传统曝光常遗漏的时间结构。这些特性使事件成为自动驾驶中RGB的有力补充,尤其在帧感知可能不可靠的模糊、眩光和快速运动场景下。然而,现有的事件感知视觉-语言模型仍局限于通用感知,未能揭示事件传感如何促进整个驾驶循环中的推理和决策。我们提出EventDrive,一个大规模基准和模型套件,统一了事件流、RGB帧和语言监督,涵盖四个核心维度:感知、理解、预测和规划,包括描述、结构化问答、定位、运动状态识别、轨迹预测和规划任务。在此基础上,EventDrive-VLM引入了多时域事件金字塔和时域混合专家模块,自适应地编码和融合异步与基于帧的信息,用于下游推理。在多样化任务上的全面评估表明,事件流在时间精度、运动感知和鲁棒性方面提供了显著提升,将事件传感置于驾驶智能的核心。

英文摘要

Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.

2606.18239 2026-06-17 cs.RO 新提交

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

EBench: 通用移动操作策略的要素诊断

Ning Gao, Jinliang Zheng, Xing Gao, Haoxiang Ma, Hanqing Wang, Yukai Wang, Jiantong Chen, Zanxin Chen, Shujie Zhang, Mingda Jia, Xuekun Jiang, Zihou Zhu, Xinyu Li, Shuai Wang, Hao Li, Wenzhe Cai, Yuqiang Yang, Xudong Xu, Zhaoyang Lyu, Yao Mu, Tai Wang, Jiangmiao Pang, Jia Zeng, Weinan Zhang, Chunhua Shen

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Xi’an Jiaotong University(西安交通大学) Institute for AI Industry Research (AIR), Tsinghua University(清华大学智能产业研究院) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学)

AI总结 提出EBench基准,从5个能力和4个泛化维度诊断通用移动操作模型,揭示不同模型在成功率相近时能力差异显著。

详情
AI中文摘要

我们提出EBench,一个仿真基准,用于诊断通用移动操作策略,超越单一的成功率标量。EBench包含26个多样且具有挑战性的操作任务,沿5个能力维度和4个泛化维度进行标注。我们评估了最先进的通用操作模型,包括$\pi_0$、$\pi_{0.5}$、XVLA和InternVLA-A1,并揭示出成功率相近的模型展现出截然不同的能力轮廓:$\pi_{0.5}$实现了最高的测试成功率和最佳的训练-测试保持率,而InternVLA-A1在移动操作上占主导地位,但在灵巧任务上崩溃,XVLA与其他策略相比在一组不相交的原子技能上表现出优势。除了能力轮廓分析,EBench还从4个代表性角度分析了泛化能力,识别了不同分布偏移因素的影响。结果揭示了模型在总体得分背后的优势和弱点。我们希望这个基准能提供广泛的诊断信号,以指导通用操作模型的迭代。

英文摘要

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $\pi_0$, $\pi_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $\pi_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.

2606.18237 2026-06-17 cs.CL cs.AI cs.LG 新提交

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

ReproRepo: 利用 GitHub 仓库问题扩展可重复性审计

Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang, Ameet Talwalkar

发表机构 * School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院) Datadog

AI总结 提出 ReproRepo 框架,利用 GitHub issues 作为监督信号,对 1149 篇论文进行可重复性评估,发现 Codex with GPT-5.5 能识别约 90% 论文的语义相关复现问题。

详情
AI中文摘要

从论文和已发布代码中复现研究结果对科学进步至关重要。现有工作引入了基准测试来评估 LLM 代理是否能协助可重复性,但由于数据整理和评估需要大量人工努力,这些基准难以扩展。我们提出了 ReproRepo,一个可扩展的可重复性评估框架,利用人类提出的 GitHub issues 作为真实复现障碍的自然监督信号。我们在来自主要会议的 1149 篇近期机器学习论文上实例化 ReproRepo,并评估了四种前沿模型代理配置。我们的结果表明,即使不执行代码,LLM 代理也能从论文-仓库对中识别出许多现实世界的可重复性问题:我们研究中的最佳代理,即带有 GPT-5.5 的 Codex,为研究中约 90% 的论文揭示了至少一个语义相关的人类报告的障碍。进一步分析表明,代理在揭示可见故障和识别正确语义区域方面特别有效,但在精确定位方面可能仍不足。ReproRepo 可作为未来在真实世界可重复性审计中评估 LLM 代理的可重用、可扩展框架。我们的代码发布在 https://this URL。

英文摘要

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at this https URL.

2606.18235 2026-06-17 cs.AI 新提交

EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation

EvolveNav: 用于零样本目标导航的主动预反思与自进化记忆

Qi Chai, Wenhao Shen, Nanjie Yao, Yue Xia, Kaiyong Zhao, Jie Ma, Guosheng Lin, Hao Wang

发表机构 * HKUST(GZ)(香港科技大学(广州)) Nanyang Technological University(南洋理工大学) Xi’an Jiaotong University(西安交通大学) XGRIDS(深圳格物智联)

AI总结 提出自进化零样本目标导航框架,通过从历史轨迹提取规则并基于置信上界检索,结合记忆引导预反思模块,减少无效探索,成功率提升10.1%。

详情
AI中文摘要

零样本目标导航(ZS-OGN)要求具身智能体在没有任何先验训练的情况下探索并定位目标物体。为此,近期方法利用基础模型,但它们通常依赖静态先验且缺乏适应性,导致重复错误和代价高昂的试错。本文提出一种自进化的ZS-OGN框架,实现连续的测试时改进。具体而言,我们通过从过去轨迹中提取可操作知识来构建智能体规则记忆。然后,我们提出一种基于置信上界的检索策略,通过平衡语义相关性和历史成功率来选择有效规则。此外,我们引入一个记忆引导的预反思模块,在行动前预测潜在结果,减少低效探索。大量实验表明,我们的方法优于现有的零样本基线,在减少不必要步骤的同时实现了10.1%的成功率提升。

英文摘要

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1\% improvement in success rate with fewer unnecessary steps.

2606.18231 2026-06-17 cs.CV cs.LG cs.RO 新提交

Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

自适应体积力学属性场:分辨率无关

Rishit Dagli, Donglai Xiang, Vismay Modi, Xuning Yang, Gavriel State, David I.W. Levin, Maria Shugrina

发表机构 * NVIDIA(英伟达)

AI总结 提出AdaVoMP方法,利用稀疏自适应体素结构和自回归Transformer编解码器,为3D物体预测高分辨率空间变化的杨氏模量、泊松比和密度,相比现有技术分辨率提升16^3倍且更准确。

Comments Project Page and hi-res paper: this https URL (https://research.nvidia.com/labs/sil/projects/adavomp/). ICML 2026

详情
AI中文摘要

精确的力学属性(或材料)杨氏模量($E$)、泊松比($\ u$)和密度($\ ho$)对于数字世界的可靠物理模拟至关重要,但大多数3D资产缺乏这些信息。我们提出AdaVoMP,一种预测输入3D物体跨表示形式的精确密集空间变化($E$,$\ u$,$\ ho$)的方法,在分辨率、准确性和内存效率上优于现有技术。我们技术的基础是一种稀疏自适应体素结构SAV,它能高效地表示输入3D形状和材料场输出。我们将最准确的先前方法VoMP的固定体素模型替换为一种新颖的稀疏Transformer编码器-解码器模型,该模型学习为每个输入形状自回归地生成唯一的SAV来表示其材料,实现比先前技术高$16^3$倍的分辨率。实验表明,即使测试时计算量少于所有先前技术,AdaVoMP也能估计出更准确的体积属性。这使得我们能够将高分辨率复杂3D物体转换为可模拟的资产,从而实现逼真的可变形模拟。

英文摘要

Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($\nu$) and density ($\rho$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $\nu$, $\rho$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

2606.18222 2026-06-17 cs.CL cs.DL 新提交

Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

Darshana Graph:用于比较印度哲学的平行注释语料库,附文体计量与探索性图分析

Joy Bose

发表机构 * Independent Researcher(独立研究者) Bangalore, India(印度班加罗尔)

AI总结 构建包含超12.5万条记录的印度哲学平行注释语料库,其中约8500条记录实现跨18位注释者的根颂对齐,通过文体计量和约束大语言模型管道分析论证风格与概念关系,揭示学派间分歧模式。

Comments 12 pages, 1 figure. Open Source Code available at this https URL (https://github.com/joyboseroy/darshana-graph) and dataset at this https URL (https://huggingface.co/datasets/joyboseroy/darshana-graph)

详情
AI中文摘要

我们介绍了Darshana Graph,一个包含超过12.5万条文本记录的语料库,涵盖古典印度教、佛教和耆那教哲学传统,这些记录来自公共领域和开放许可的翻译,包括《薄伽梵歌》、《梵经》、主要《奥义书》、《巴利经典》和核心耆那教经典。其独特贡献在于一个结构独特的子集,包含约8500条印度教和耆那教记录,其中相同的根本颂或经句与代表吠檀多五个学派及其他见(darshanas)的十八位历史注释者对齐,从而能够直接比较独立解释传统如何解读相同的源材料。据我们所知,没有其他公开资源提供如此规模的跨注释者对齐。我们展示了基于该语料库的两项分析。首先,一种无需机器学习的透明文体计量比较,通过经典引用密度、明确反驳率和句子复杂度来衡量论证风格。它发现引用密度与反驳率之间存在中等负相关,在相关教义谱系的三位注释者中反驳率显著增加,并且在巴利经典内部存在可测量的体裁层面差异。其次,我们描述了一个受约束的大语言模型管道,该管道使用预定义的关系词汇和确定性事后验证来提取概念之间的类型化哲学关系。生成的图揭示了跨学派分歧模式,同时也揭示了重要的提取局限性,包括独立基于嵌入的分析与图派生结果不一致的情况。我们发布了完整的语料库、提取的关系图以及所有源代码。

英文摘要

We introduce Darshana Graph, a corpus of over 125,000 text records spanning classical Hindu, Buddhist, and Jain philosophical traditions, drawn from public-domain and openly licensed translations of sources including the Bhagavad Gita, Brahma Sutras, principal Upanishads, the Pali Canon, and core Jain texts. Its distinctive contribution lies in a structurally unique subset of roughly 8,500 Hindu and Jain records in which the same root verse or sutra is aligned across eighteen historical commentators representing five schools of Vedanta and other darshanas, enabling direct comparison of how independent interpretive traditions read identical source material. To our knowledge, no publicly available resource provides comparable cross-commentator alignment at this scale. We present two analyses built on this corpus. First, a transparent stylometric comparison requiring no machine learning measures argumentative style through scriptural citation density, explicit refutation rate, and sentence complexity. It finds a moderate negative correlation between citation density and refutation rate, a marked increase in refutation rate across three commentators in a related doctrinal lineage, and measurable genre-level differences within the Pali Canon itself. Second, we describe a constrained large language model pipeline that extracts typed philosophical relationships between concepts using a predefined relation vocabulary and deterministic post-hoc validation. The resulting graph surfaces cross-school disagreement patterns while also revealing important extraction limitations, including cases where an independent embedding-based analysis disagrees with the graph-derived findings. We release the full corpus, extracted relationship graph, and all source code.

2606.18216 2026-06-17 cs.CL 新提交

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

近端策略优化区域:教师存在于提示中,而非梯度中

Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma

发表机构 * NVIDIA(英伟达)

AI总结 提出ZPPO方法,通过将教师知识注入提示而非策略梯度,解决小模型知识蒸馏中教师梯度主导和强化学习策略漂移问题,在多种规模模型上超越现有方法。

Comments Project page: this https URL (https://byungkwanlee.github.io/ZPPO-page/)

详情
AI中文摘要

知识蒸馏将教师的能力传递给小型学生模型,但在小模型场景下存在脆弱性:强制学生模仿更大教师的logits会使其集中于教师最尖锐的模式,从而损害训练语料库之外基准家族的泛化能力。强化学习通过基于学生自身rollout进行训练避免了logit模仿。然而,在每次rollout都失败(产生零优势并被静默丢弃)的问题上,将更强教师的响应注入策略梯度会破坏同策略假设并导致漂移。我们提出近端策略优化区域(ZPPO),受维果茨基最近发展区启发,将教师保留在提示中而非策略梯度中。在难题上,ZPPO构建两种重新表述的提示:二元候选包含问题(BCQ)将一个正确教师响应与一个错误学生响应配对作为匿名候选,学生必须区分;负候选包含问题(NCQ)将学生的错误rollout聚合到单个提示中,以揭示其共同的失败模式。提示重放缓冲区循环每个难题,直到其毕业(学生在该问题上的平均rollout准确率达到一半)或在有限容量下被FIFO逐出,从而在学生当前最近发展区内放大BCQ和NCQ。在Qwen3.5系列上,使用四个学生规模(0.8B-9B)和27B教师,后训练为视觉语言模型并在31个基准套件(16个VLM、10个LLM、5个视频)上评估,ZPPO优于离/同策略蒸馏和GRPO,且在最小规模上增益最大。

英文摘要

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

2606.18209 2026-06-17 cs.LG 新提交

Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

重新思考用于分类的数据集蒸馏:蒸馏集是否优于核心集?

Trisha Mittal, Akshay Mehra, Joshua Kimball

发表机构 * Dolby Laboratory(杜比实验室)

AI总结 本文通过大规模标准化实验评估七种最先进的数据集蒸馏方法,发现其在大型数据集上性能不如或仅相当于核心集,且构建成本更高,核心集在数据分布覆盖和计算效率上更具优势。

详情
AI中文摘要

数据集蒸馏(DD)已成为以数据为中心的机器学习中的一种重要方法,旨在通过将大型数据集中的信息压缩到少量合成样本中,合成紧凑的训练集以实现高效训练。然而,DD方法通常在不一致的评估协议下进行评估,从标准ERM到单/多教师监督,这使得难以从评估中分离出蒸馏数据的有效性。此外,许多先前方法声称DD优于数据剪枝方法(如核心集选择),其假设是将浓缩数据集限制为真实样本的子集会从根本上限制其表达能力。在这项工作中,我们通过使用标准化数据集和评估协议进行大规模实验,批判性地评估DD方法以评估其内在有效性。我们在ImageNet-1K、ImageNet100和ImageNette上对七种最先进的DD方法进行了基准测试,使用了三种广泛采用的训练协议,并与三种核心集策略进行比较。我们的结果表明,虽然一些DD方法甚至未能优于简单的随机子集,但最先进的DD方法在大型数据集上与核心集相当或更差,并且构建成本显著更高。除了准确性,我们还评估了浓缩集的代表性、多样性和质量,发现核心集始终能更好地覆盖原始数据分布。这些发现凸显了当前DD方法的实际优势有限,并表明核心集仍然具有竞争力,并且通常是以数据为中心的学习中计算效率更高的替代方案。

英文摘要

Dataset distillation (DD) has emerged as a prominent approach in data centric machine learning, aiming to synthesize compact training sets for efficient training by compressing the information in large datasets into a small number of synthetic samples. However, DD methods are often evaluated under inconsistent evaluation protocols, ranging from standard ERM to single/multi-teacher supervision, making it difficult to isolate the effectiveness of distilled data from evaluation. Moreover, many prior methods claim that DD outperforms data pruning approaches such as coreset selection (CS), based on the assumption that restricting condensed datasets to subsets of real samples fundamentally limits their expressiveness. In this work, we critically evaluate DD methods through large-scale experiments using standardized datasets and evaluation protocols to assess their intrinsic effectiveness. We benchmark seven state-of-the-art (SOTA) DD methods on ImageNet-1K, ImageNet100, and ImageNette, using three widely adopted training protocols against three CS strategies. Our results show that while some DD methods fail to outperform even simple random subsets, the SOTA DD approaches are comparable to or worse than coresets on large-scale datasets and incur a substantially higher cost for construction. Beyond accuracy, we also evaluate the representativeness, diversity, and quality of condensed sets, and find that coresets consistently achieve better coverage of the original data distribution. These findings highlight the limited practical advantages of current DD methods and show that coresets remain competitive and are often a more computationally efficient alternative for data-centric learning.

2606.18208 2026-06-17 cs.LG cs.AI cs.CL cs.CV 新提交

Looped World Models

循环世界模型

Hongyuan Adam Lu, Z.L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang, Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen, Hebin Wang, Zeyu Gao, Yang Li, Wei Zhao, Qimin Zhong, Siqi Liu, Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam

发表机构 * FaceMind Research Asia

AI总结 提出循环世界模型(LoopWM),通过参数共享的Transformer块迭代细化潜在环境状态,实现高达100倍参数效率,并建立迭代潜在深度作为世界模拟的新缩放轴。

Comments Technical Report

详情
AI中文摘要

当前的世界模型面临一个基本矛盾:忠实的长期模拟需要深度计算,但更深的模型部署成本高且容易产生累积误差。我们通过引入循环世界模型(LoopWM)来解决这一问题,这是首个用于世界建模的循环架构。我们的方法通过一个参数共享的Transformer块迭代地细化潜在环境状态。这带来了高达100倍于传统方法的参数效率,并具有自适应计算能力,可自动调整深度以匹配每个预测步骤的复杂性。与缩放模型大小和训练数据正交,LoopWM建立了迭代潜在深度作为世界模拟的新缩放轴,这可能显著推动社区发展。

英文摘要

Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.

2606.18206 2026-06-17 cs.AI 新提交

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

不动点推理器:稳定且自适应的深度循环Transformer

Sajad Movahedi, Vera Milovanović, Shlomo Libo Feigin, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, Antonio Orvieto

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center(ELLIS研究所蒂宾根,马克斯·普朗克智能系统研究所,蒂宾根人工智能中心) ETH Zurich(苏黎世联邦理工学院) Swiss Institute of Bioinformatics(瑞士生物信息学研究所) Université Paris Cité(巴黎西岱大学) Liquid AI

AI总结 针对循环架构中深度导致的信号传播问题,提出基于预层归一化和残差缩放的FPRM模型,利用不动点收敛作为端到端停止机制,在Sudoku、Maze等推理基准上自适应计算并有效提升性能。

Comments Code available at this https URL (https://github.com/nilskiKonjIzDunava/fprm)

详情
AI中文摘要

循环架构为学习需要组合推理的任务的逐步程序提供了归纳偏置。通过循环达到的有效层数决定了这些模型找到的解的质量。与深层架构类似,循环架构容易受到由深度引起的信号传播问题的影响,因为停止决策被推迟。在本文中,我们使用预层归一化和残差缩放来解决这个信号传播问题。基于这些架构修改,我们提出了FPRM,一种基于Transformer的不动点推理模型,它在循环架构中使用不动点收敛作为端到端停止机制。我们表明,不动点停止允许FPRM根据任务难度调整其计算量。FPRM在常见的推理基准(即Sudoku、Maze、状态跟踪和ARC-AGI)上是有效的。

英文摘要

Looped architectures provide an inductive bias toward learning step-by-step procedures for tasks that require compositional reasoning. The number of effective layers reached by looping determines the quality of the solution these models find. Like deep architectures, looped architectures are prone to a signal propagation problem induced by depth as the halting decision is postponed. In this paper, we address this signal propagation issue using pre-norm layers and residual scaling. Building on these architectural modifications, we propose FPRM, a Transformer-based Fixed-Point Reasoning Model that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. We show that fixed-point halting allows FPRM to adapt its compute to task difficulty. FPRM is effective on common reasoning benchmarks, namely Sudoku, Maze, state-tracking, and ARC-AGI.

2606.18205 2026-06-17 cs.CL 新提交

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

使用ISO语言标记框架和TEI Lex-0分析与编码Al-Mawrid阿拉伯语-英语词典

Diaa Fayed, Laurent Romary

发表机构 * Faculty of Information Technology and Computer Science, Sinai University(信息科技与计算机科学学院, Sinai大学) Inria ISO committee TC 37 (Language and terminology)(ISO术语委员会TC 37(语言和术语))

AI总结 本文提出了一种系统化方法,将Al-Mawrid阿拉伯语-英语词典数字化并编码为标准化计算词典,采用ISO LMF和TEI Lex-0双重标准,实现91%的结构解析准确率。

Comments 44 pages, 58 figures, 12 tables. Submitted to Language Resources and Evaluation, under review since Aug 2025, round 3

详情
AI中文摘要

本文提出了一种稳健的方法,用于系统化数字化和编码Al-Mawrid阿拉伯语-英语词典,将其从传统的印刷资源转变为标准化的计算词典。针对阿拉伯语词汇基础设施中的显著空白,本研究采用双重标准框架,将ISO词汇标记框架(LMF)与文本编码倡议TEI Lex-0指南对齐。通过应用编辑视角处理词典的宏观和微观结构,研究解决了20世纪双语词典中典型的结构歧义和标点不一致问题。该方法基于对词典词汇知识密度的实证分析。基于代表性样本(字母Ayn,占总量的4.6%),研究为编码过程提供了科学依据,展示了91%的结构解析准确率。信息提取规则的定量评估显示出高性能,同义词的精确率为85%,召回率为98%,其他形态语义特征的精确率为88%。除了技术描述,本文还与现有阿拉伯语词汇资源进行了批判性比较,并讨论了TEI Lex-0在建模特定阿拉伯语现象(如隐式“开放集”语义关系和分散的形态线索)时的局限性。此外,研究通过建立可扩展的基于前缀的引用系统,探索了语言关联开放数据(LLOD)集成的潜力,促进了该资源在语义网中的包含。最终成果是一个可互操作、机器可处理的资源,为阿拉伯语自然语言处理和数字人文社区中复杂遗留双语词典的逆向数字化提供了可复现的工作流程。

英文摘要

This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit "open set" semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource's inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.

2606.18203 2026-06-17 cs.CL cs.AI 新提交

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

RubricsTree: 面向个人健康代理在健康记忆与医疗技能上的可扩展且不断演进的开放式评估

Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

发表机构 * Google Research(谷歌研究院) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出RubricsTree框架,通过专家对齐的层次化分类法(含100多个原子布尔规则)和上下文自适应路由,实现可扩展、可审计且不断演进的开放式评估,在HealthBench上使模型性能提升高达约66%。

详情
AI中文摘要

基于LLM的个人健康代理利用用户健康(传感器)指标,为缓解全球医疗资源获取不均提供了有希望的途径。然而,大规模临床部署仍受限于开放式评估瓶颈:医生标注可靠但成本高且不可扩展,而LLM作为评判者的评估虽可扩展但主观、不一致,且有时临床对齐不佳。我们引入了RubricsTree,一个可扩展的评估框架,具有专家对齐的层次化分类法,包含超过100个原子级、临床可验证的布尔规则,这些规则通过迭代的人机协同策展协议(由经验丰富的医生领导的专家小组)从4000个真实用户查询的洞察中演化而来。一个上下文感知的自适应路由器每查询仅激活相关的自动加权规则子集,提供可扩展评估所需的吞吐量,同时保持专家对齐的质量。通过系统的元评估,我们展示了RubricsTree:(i) 在具有挑战性的开放式查询上,专家对齐程度显著超过强大的大规模评估基线;(ii) 可靠地惩罚上下文退化的响应;(iii) 当用作结构化指令、文本反馈或性能优化的训练奖励时,在HealthBench上为Gemini、GPT和Qwen模型系列带来高达约66%的相对提升。因此,RubricsTree为产品级个人健康AI的持续优化提供了可扩展、可审计且不断演进的评估基础设施。

英文摘要

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

2606.18195 2026-06-17 cs.CL 新提交

Learning from the Self-future: On-policy Self-distillation for dLLMs

从自我未来学习:面向扩散LLM的在线自蒸馏

Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu

发表机构 * Tsinghua University(清华大学) Technical University of Munich(慕尼黑工业大学) Nanyang Technological University(南洋理工大学) University of British Columbia(不列颠哥伦比亚大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) ELLIS Institute Tubingen(ELLIS蒂宾根研究所) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Tubingen AI Center(蒂宾根人工智能中心)

AI总结 提出首个面向扩散大语言模型的在线自蒸馏框架d-OPSD,通过自我生成答案作为后缀条件实现从自我未来经验学习,并将监督从词元级转向步骤级,在推理基准上以约10%的优化步骤超越RLVR和SFT基线。

Comments Preprint

详情
AI中文摘要

在线自蒸馏(OPSD)已被证明对后训练大型语言模型(LLMs)有效,但其在扩散LLMs(dLLMs)上的应用仍未探索。现有的OPSD方法本质上是自回归中心的,它们通过从左到右的前缀条件化和词元级差异监督注入特权信息,这种设计与dLLMs的任意顺序生成根本冲突。我们提出了d-OPSD,这是首个为dLLMs量身定制的OPSD框架。我们的方法有两个核心贡献。首先,我们通过使用自我生成的答案作为后缀条件来重新构建自我教师,使学生模型能够从“自我未来经验”而非特权前缀中学习。其次,我们将监督从词元级转向步骤级,使训练与dLLMs的迭代去噪过程对齐。在四个推理基准上的实验表明,d-OPSD在样本效率上始终优于RLVR和SFT基线,仅需RLVR约10%的优化步骤,为dLLM后训练开辟了一条有前景的途径。代码可在该https URL获取。

英文摘要

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at this https URL.

2606.18192 2026-06-17 cs.AI 新提交

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

斯坦福EDGAR文件数据集:将美国公司及财务披露重建为布局忠实且令牌高效的预训练数据

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Nanjing University(南京大学) Stanford University(斯坦福大学)

AI总结 为解决长上下文文档稀缺问题,提出SEFD数据集,将SEC文件重建为布局忠实的MultiMarkdown格式,用于金融语言建模与评估,具有令牌高效、与Common Crawl重叠率低于0.1%的特点。

Comments Preprint. Includes appendix, tables, and figures

详情
AI中文摘要

随着高质量公共网络语料库日益枯竭,干净的长上下文文档已成为大型语言模型(LLM)训练数据中稀缺且昂贵的来源。现有的长上下文语料库通常是专有的且获取成本高昂、合成生成的,或集中在编程等狭窄领域。我们介绍了斯坦福EDGAR文件数据集(SEFD),这是将SEC文件重建为布局忠实的MultiMarkdown格式的开放数据集,用于金融语言建模和评估。SEFD使经过审计的财务报表、风险披露、所有权报告、会计说明和影响市场的事件文件能够用作长上下文预训练数据,并作为金融推理、预测、合规和文档理解的基础。生成的语料库令牌高效、可直接用于模型,并且与Common Crawl衍生的语料库重叠率低于0.1%。我们发布了SEFD-v1,一个152B令牌的初始公共快照,并提供了更大的1850万文件档案(估计为550B令牌)的语料库级分析。我们进一步引入了两个基于SEFD的基准:EDGAR-Forecast,用于评估模型知识截止后基于文件的数值预测;以及EDGAR-OCR,用于评估复杂金融表格的转录。

英文摘要

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

2606.18191 2026-06-17 cs.AI cs.MA 新提交

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

DRFLOW:用于个性化工作流预测的深度研究基准

Md Tawkat Islam Khondaker, Raymond Li, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Issam H. Laradji

发表机构 * ServiceNow AI Research(ServiceNow人工智能研究)

AI总结 提出DRFLOW基准,评估AI代理从异构源预测个性化工作流的能力,包含5领域100任务,并设计7个诊断指标,实验显示现有代理性能有限。

详情
AI中文摘要

深度研究(DR)系统越来越多地用于复杂信息寻求任务,但现有工作主要关注生成报告和摘要。相比之下,许多企业任务需要代理识别具体的工作流,即一系列行动步骤。例如,代理不应总结预算政策,而应能确定回答诸如“在固定预算下如何申请新员工?”这类问题所需的步骤。因此,我们引入DRFLOW,一个用于评估代理从异构源预测个性化工作流的基准。每个任务要求代理从分散来源中识别相关证据,然后使用这些证据预测用户任务的正确行动步骤序列。DRFLOW包含跨五个领域的100个任务,1246个参考工作流步骤,基于超过3900个来源。我们定义了七个诊断指标,涵盖事实依据、步骤恢复、结构排序、条件解决和个性化。我们进一步提出DRFLOW-Agent(DRFA),一个面向工作流的参考代理,用于预测个性化工作流。我们表明,尽管DRFA相比强基线代理有所改进(平均F1分数提升高达10.02%),但在这些工作流指标上仍有很大的改进空间,表明预测完整且正确的个性化工作流仍然是深度研究的一个挑战性前沿。

英文摘要

Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.

2606.18189 2026-06-17 cs.RO 新提交

Beyond Failure Recovery: An Engagement-Aware Human-in-the-loop Framework for Robotic Systems

超越故障恢复:一种面向机器人系统的参与感知人在回路框架

Jiaying Fang, Joyce Yang, Zhanxin Wu, Bohan Yang, Tapomayukh Bhattacharjee

发表机构 * Cornell University(康奈尔大学)

AI总结 提出一种参与感知模型预测控制(E-MPC)方法,通过规划交互频率和类型来维持用户参与度并控制工作负荷,在机器人辅助进食系统中验证了其提升用户体验且不降低任务成功率的效果。

Comments Project website at this https URL (https://emprise.cs.cornell.edu/empc)

详情
AI中文摘要

传统的人机协同方法通常仅在机器人遇到故障或不确定性时才让用户介入,将人类主要视为提升机器人性能的工具。然而,在许多以人为中心的机器人环境中,交互应通过让用户参与决策来支持参与度,而非将其限制于故障驱动的干预。这在物理护理场景中尤为突出,因为行动受限会降低用户实时干预或调节机器人行为的能力。因此,故障驱动的交互策略可能使用户在任务的大部分时间里沦为被动观察者。例如,行动受限的用户在持续被动接受机器人喂食时可能感到参与度不足。同时,过于频繁的交互可能令人疲惫并增加用户工作负荷。为解决这一权衡,我们提出了一种用户参与感知方法——参与感知模型预测控制(E-MPC),该方法规划交互以在维持参与度的同时满足工作负荷约束。E-MPC利用一个用户交互动力学模型,该模型捕捉用户参与度如何随交互频率和类型变化。机器人并非仅在任务执行出现困难时才请求输入,而是主动考虑用户在整个任务中偏好的参与水平,平衡自主性与交互,同时确保任务成功。我们通过多项消融实验和基线对比在仿真中评估了E-MPC。结果表明,该方法在多种用户画像下均有效。此外,我们在一个机器人辅助咬取系统中,与模拟行动受限的真实参与者进行了用户研究,显示E-MPC在维持任务成功的同时改善了用户体验。

英文摘要

Conventional human-in-the-loop approaches typically involve users only when a robot encounters failure or uncertainty, treating humans primarily as tools for improving robot performance. However, in many human-centered robotics settings, interaction should support engagement by keeping users involved in decision-making rather than limiting them to failure-driven interventions. This is particularly compelling in physical caregiving, where mobility limitations can reduce users' ability to intervene or modulate the robot's behavior in the moment. As a result, failure-driven interaction policies may relegate users to passive observers for long stretches of the task. For example, a user with mobility limitations may feel less engaged when being continuously and passively fed by a robot. At the same time, overly frequent interaction can be tiring and increase the user's workload. To address this trade-off, we propose Engagement-aware MPC (E-MPC), a user-engagement-aware method that plans interaction to maintain engagement while respecting a workload constraint. E-MPC leverages a user interaction dynamics model that captures how user engagement evolves as a function of both the frequency and type of interaction. Rather than requesting input only when difficulties arise during task execution, the robot proactively considers the user's preferred level of engagement throughout the task, balancing autonomy and interaction while ensuring task success. We evaluate E-MPC in simulation with several ablations and baseline comparisons. Results demonstrate the effectiveness of our approach across diverse user personas. In addition, we conduct a real-world user study with participants with emulated mobility limitations on a robot-assisted bite acquisition system, showing that E-MPC improves user experience while maintaining task success.

2606.18186 2026-06-17 cs.LG cs.AI 新提交

Kolmogorov Regression for Robust Diffusion Policies

用于鲁棒扩散策略的Kolmogorov回归

Lekan Molu

发表机构 * Bala Cynwyd, PA 19004(巴拉辛威德, PA 19004)

AI总结 提出后向Kolmogorov方程将扩散策略提升至Cameron-Martin空间,用确定性边界值PDE问题替代随机分数匹配,通过精度加权损失和残差诊断实现收敛保证、轨迹规则化和无奖励故障检测。

详情
AI中文摘要

有限维扩散策略由于离散化伪影导致时间漂移,降低了长期性能(当部署在物理系统上时)。我们引入了一个后向Kolmogorov方程,将扩散策略提升至Cameron-Martin空间——希尔伯特空间的一个子集。本质上,用确定性边界值PDE问题替代随机分数匹配。我们的核心创新基于高斯测度理论,其中扩散噪声协方差算子由有色噪声分布实现,该分布规定了推理时模型样本的正则性概念。我们使用推导出的精度加权Cameron-Martin损失训练扩散模型,并引入Kolmogorov残差作为推理时的PDE诊断。这些替换产生了:(i) 收敛保证,其中界的常数取决于核的有效秩而非动作维度,(ii) 通过谱加权改进轨迹规则性,以及(iii) 无需奖励信号的确定性故障检测器。在两个应用领域的验证显示了显著改进:在PushT操作基准测试中,Cameron-Martin损失在最大回合奖励上实现了17%的提升(0.95对比0.78的MSE),并通过引入的残差幅度在推理期间减少了67.6%的步间漂移。类似地,在具有恒定在制品(CONWIP)流量控制的6站生产线上,我们实现了比经典LSTM基线低28.4%的RMSE;高饥饿事件召回率(测试周期中为1.0),以及有效的瓶颈识别(测试集中Precision@1=1.0,信噪比13倍)。然后,我们使用Hamilton-Jacobi可达性理论认证调度策略,与100次模拟运行中的无控制调度相比,死锁事件减少了96%(防止了351个事件)。

英文摘要

Finite-dimensional (FD) diffusion policies exhibit temporal drift owing to discretization artifacts that degrade long-horizon performance (when deployed on physical systems). We introduce a backward Kolmogorov equation that lifts diffusion policies to a Cameron-Martin space -- a subset of the Hilbert space. Essentially, replacing stochastic score matching with a deterministic boundary-value PDE problem. Our core innovation thrives on Gaussian measure theory whereupon the diffusion noise covariance operator is realized from a colored noise distribution which prescribes a notion of regularity on samples from the model at inference time. We train the diffusion model with a derived precision-weighted Cameron- Martin loss and a Kolmogorov residual is introduced as a PDE diagnostic during inference. These substitutions yield (i) convergence guarantees where the bound's constants depend on the effective rank of the kernel rather than action dimension, (ii) improved trajectory regularity via spectral weighting, and (iii) a deterministic failure detector without reward signals. Validation across two application domains demonstrates substantial improvements: on the PushT manipulation benchmark, the Cameron-Martin loss achieves a 17% improvement in maximum episode reward (0.95 vs. 0.78 for MSE) and 67.6% reduction in inter-step drifts during inference via the introduced residual magnitude. Similarly, on a 6-station manufacturing line with constant work-in-process (CONWIP) flow control, we achieve 28.4% lower RMSE than classical LSTM baselines; a high starvation-event recall (1.0 in test cycles), and effective bottleneck identification (Precision@1 = 1.0 in test set, 13x signal-to-noise ratio). We then certify the dispatch policies with Hamilton-Jacobi reachability theory which reduces deadlock events by 96% compared to uncontrolled dispatch over 100 simulated runs (351 events prevented).

2606.18180 2026-06-17 cs.CV 新提交

EgoCS-400K: An Egocentric Gameplay Dataset for World Models

EgoCS-400K:面向世界模型的自我中心游戏数据集

Rongjin Guo, Dong Liang, Yuhao Liu, Fang Liu, Tianyu Huang, Gerhard P. Hancke, Rynson W. H. Lau

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 为支持世界模型研究,构建大规模自我中心游戏数据集EgoCS-400K,包含40万第一人称视频和1万小时游戏轨迹,支持动作条件未来预测、状态事件场景展开等交互式视觉建模任务。

详情
AI中文摘要

从视频生成到交互式世界建模的转变对数据提出了新要求:除了带字幕的视频外,世界模型还需要基于驱动未来场景变化的动作、相机运动、状态和事件的时间对齐的视频-动作-语言轨迹。然而,大规模获取此类数据十分困难。网络视频数据集提供广泛的视觉覆盖,但缺乏可执行动作和可靠状态;机器人数据集提供动作和状态监督,但成本高昂且场景多样性有限;现有模拟器通常缺乏大规模人类驱动的交互轨迹。在本文中,我们介绍了EgoCS-400K,一个大规模基于回放的自我中心反恐精英世界模型数据集,该数据集基于公开的职业CS和CS2比赛演示构建,保留了人类游戏轨迹,并支持解析、回放、渲染和时间对齐。我们提取玩家状态、视角方向、移动、键盘/按钮输入、视角变化、武器使用、游戏事件和回合级上下文,并从相同轨迹渲染干净的第一人称视频。EgoCS-400K包含超过40万第一人称视频和1万小时游戏时间,来自1000多场比赛和4万回合,涵盖13张地图和每回合10个玩家视角。它支持一系列交互式视觉建模任务,包括动作条件未来预测、状态和事件感知场景展开、基于回放的描述以及智能体自我中心动作理解。通过大规模连接视觉观察与人类动作、相机运动、游戏状态和事件,EgoCS-400K在被动网络视频、可控游戏模拟和昂贵的真实世界具身数据之间架起了一座实用桥梁。

英文摘要

The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.

2606.18156 2026-06-17 cs.CV cs.AI 新提交

ReAge3D: Re-Aging 3D Faces with View Consistency

ReAge3D:具有视角一致性的3D人脸回龄

Libing Zeng, Li Ma, Mingming He, Ning Yu, Paul Debevec, Nima Khademi Kalantari

发表机构 * Texas A&M University(德克萨斯农工大学) Netflix Eyeline Studios

AI总结 提出ReAge3D框架,通过2D扩散模型DiffReaging和中心向外编辑传播策略,实现多视角一致的3D人脸回龄,保持身份和细节,优于现有方法。

详情
AI中文摘要

我们提出了一种新颖的框架,用于实现逼真且可控的3D人脸回龄,生成高度详细、保留身份的结果。现有的3D编辑方法虽然对粗粒度的语义变化有效,但不适合回龄,因为即使回龄2D视图之间的微小不一致也会导致对微妙但感知上重要的年龄相关细节的过度平滑。为了解决这一挑战,我们首先引入了一个基于2D扩散的回龄模型DiffReaging,该模型在合成生成的图像对上训练。我们进一步提出了一种中心向外编辑传播策略,利用该回龄模型重建多视图一致的回龄图像。具体来说,从回龄的正面枢轴视图开始,我们通过扭曲和我们提出的Masked-DiffReaging过程重建其余视图。通过在扩散过程的每一步注入现有内容,Masked-DiffReaging确保重建区域与现有像素保持连贯。由此产生的一致回龄视图集监督回龄3D表示的优化。我们的方法在视觉上和定量上都优于现有的3D编辑技术,能够对3D人脸模型中的年龄变换进行平滑、细粒度的控制。

英文摘要

We present a novel framework for realistic and controllable 3D face re-aging which produces highly detailed, identity-preserving results. Existing 3D editing methods, while effective for coarse semantic changes, are not well suited for re-aging, as even small inconsistencies across re-aged 2D views can lead to over-smoothing of subtle but perceptually important age-related details. To address this challenge, we first introduce a 2D diffusion-based re-aging model, DiffReaging, trained on synthetically generated image pairs. We further propose a center-out editing propagation strategy that leverages this re-aging model to reconstruct multi-view-consistent re-aged images. Specifically, starting from a re-aged frontal pivot view, we reconstruct the remaining views through warping and our proposed Masked-DiffReaging process. By injecting existing content at every step of the diffusion process, Masked-DiffReaging ensures that the reconstructed regions remain coherent with existing pixels. The resulting consistent set of re-aged views supervises the optimization of the re-aged 3D representation. Our method outperforms existing 3D editing techniques both visually and quantitatively, enabling smooth, fine-grained control over age transformations in 3D face models.

2606.18154 2026-06-17 cs.AI 新提交

Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure

通过智能体发现混合结构学习心脏电生理数字孪生

Ziqi Zhou, Yubo Ye, Sumeet Atul Vadhavka, Linwei Wang, Zhiqiang Tao

发表机构 * Rochester Institute of Technology(罗彻斯特理工学院)

AI总结 提出LEADS框架,利用LLM智能体在结构化动作空间中迭代发现混合物理-神经模型,实现个性化心脏电生理数字孪生构建,优于人工设计和其他LLM方法。

Comments 10 pages, 4 figures

详情
AI中文摘要

构建个性化心脏电生理(EP)数字孪生需要为每个患者识别合适的模型结构,而不仅仅是拟合参数。传统方法依赖专家手动指定混合物理-神经架构,这需要深厚的领域专业知识,且无法跨患者迁移。最近的工作应用大型语言模型(LLM)来生成或充当混合模型。然而,尽管这些基于LLM的方法具有有希望的泛化能力,但它们缺乏稳定心脏模拟所需的结构先验。因此,我们提出LEADS,一个将心脏EP领域知识形式化为结构化动作空间,并利用LLM智能体发现混合模型的框架。该智能体遵循迭代推理-行动循环来选择、组合和优化混合模型,同时梯度下降处理参数拟合。所提出的LEADS设计每个候选模型都朝向物理基础、可解释和数值稳定,同时允许开放式的架构发现。我们在具有三个真实反应模型的合成数据和真实心脏EP数据上验证了LEADS,证明其优于人工设计的混合模型和其他基于LLM的混合建模方法。

英文摘要

Building personalized cardiac electrophysiology (EP) digital twins requires identifying the appropriate model structure for each patient, not merely fitting parameters. Traditional methods rely on experts to manually prescribe hybrid physics-neural architectures, which requires deep domain expertise and does not transfer across patients. Recent works have applied large language models (LLMs) to generate or act as hybrid models. However, despite their promising generalization capacity, these LLM-based methods lack the structural priors needed for stable cardiac simulations. Hence, we propose LEADS, a framework that formulates cardiac EP domain knowledge as a structured action space and utilizes an LLM agent to discover hybrid models. The agent follows an iterative reasoning-and-action loop to select, combine, and refine hybrid models, whilst gradient descent handles parameter fitting. The proposed LEADS designs every candidate model towards physically grounded, interpretable, and numerically stable, while allowing open-ended architectural discovery. We validate LEADS on synthetic data with three ground-truth reaction models and on real cardiac EP data, demonstrating that it outperforms both human-designed hybrid models and other LLM-based hybrid modeling.

2606.18153 2026-06-17 cs.CV 新提交

Neural Tree Reconstruction for the Open Forest Observatory

开放森林观测站的神经树重建

Marissa Ramirez de Chanlatte, Arjun Rewari, Trevor Darrell, Derek J. N. Young

发表机构 * Berkeley AI Research, University of California, Berkeley(加州大学伯克利分校伯克利人工智能研究) Department of Plant Sciences, University of California, Davis(加州大学戴维斯分校植物科学系)

AI总结 针对开放森林观测站中经典运动恢复结构方法重建质量差的问题,提出引入神经辐射场提升3D树重建的细节与鲁棒性,并展望未来工作。

Comments Published as a workshop paper at "Tackling Climate Change with Machine Learning", ICLR 2024

详情
AI中文摘要

开放森林观测站(OFO)是一项跨大学及其他合作伙伴的合作项目,旨在让生态学家、土地管理者和公众能够低成本地进行森林测绘。OFO正在构建一个地理空间森林数据库,以及通过无人机进行森林测绘的开源方法和工具。这些数据对多种气候应用非常有用,包括优先安排重新造林工作、减少野火风险以及监测碳封存。在OFO森林地图数据库的当前版本中,3D树图是使用经典的运动恢复结构技术创建的。这种方法容易出现伪影,缺乏细节,并且在森林地面(输入数据即俯拍图像的可视性有限)上尤其困难。这些重建错误可能会传播到下游的科学任务中(例如野火模拟)。3D重建的进展,包括神经辐射场(NeRF)等方法,产生了更高质量的结果,对稀疏视图更具鲁棒性,并支持数据驱动的先验。我们探索了将NeRF纳入OFO数据集的方法,概述了支持更先进的3D视觉模型的未来工作,并描述了高质量3D重建对林业应用的重要性。

英文摘要

The Open Forest Observatory (OFO) is a collaboration across universities and other partners to make low-cost forest mapping accessible to ecologists, land managers, and the general public. The OFO is building both a database of geospatial forest data as well as open-source methods and tools for forest mapping by uncrewed aerial vehicle. Such data are useful for a variety of climate applications including prioritizing reforestation efforts, informing wildfire hazard reduction, and monitoring carbon sequestration. In the current iteration of the OFO's forest map database, 3D tree maps are created using classical structure-from-motion techniques. This approach is prone to artifacts, lacks detail, and has particular difficulty on the forest floor where the input data (overhead imagery) has limited visibility. These reconstruction errors can potentially propagate to the downstream scientific tasks (e.g. a wildfire simulation.) Advances in 3D reconstruction, including methods like Neural Radiance Fields (NeRF), produce higher quality results that are more robust to sparse views and support data-driven priors. We explore ways to incorporate NeRFs into the OFO dataset, outline future work to support even more state-of-the-art 3D vision models, and describe the importance of high-quality 3D reconstructions for forestry applications.

2606.18147 2026-06-17 cs.AI 新提交

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

WEQA: 可穿戴健康问答中的查询自适应智能推理

Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu, Daniel McDuff, Cecilia Mascolo

发表机构 * University of Cambridge(剑桥大学) Tsinghua University(清华大学) University College London(伦敦大学学院) Dartmouth College(达特茅斯学院) Google Research(谷歌研究院)

AI总结 提出WEQA框架,通过LLM控制器动态组合传感器分析与预训练模型,实现可穿戴健康数据问答,在基准测试中准确率提升24%,专家评估显示实用性和临床合理性显著提高。

详情
AI中文摘要

语言模型在医学问答中表现出色,有时甚至超过普通医生的准确率。然而,关于可穿戴健康数据的问题回答仍然具有挑战性且研究不足,因为这些无处不在的传感器产生连续、高维和纵向的数据,难以与LLM预训练中的文本中心分布对齐。传感器模态和用户意图的多样性无法通过固定的推理工作流或单一的预训练基础模型有效处理。为了解决这些挑战,我们提出了WEQA,一个查询自适应智能体框架,将LLM推理与专门的可穿戴分析和建模工具统一起来。采用LLM控制器来合成执行计划,动态地将每个查询路由到适当的传感器分析和预训练模型组合,并利用外部知识进行基于证据的响应审计。我们还整理了一个基准测试,涵盖四个开放的可穿戴数据集,包括三个不同健康领域的分析和预测任务。实验表明,我们的框架比LLM和智能体基线准确率提高24%,一项由12名医学专家和8名用户进行的盲法研究显示,在实用性和临床合理性方面有显著提升。

英文摘要

Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

2606.18144 2026-06-17 cs.AI cs.CY cs.LG cs.RO 新提交

Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So

记忆作为消耗性资产:为具身智能体定价闪存耐久性及其局限性

Josef Liyanjun Chen

发表机构 * KAIKAKU

AI总结 本文提出将机器人闪存耐久性视为折旧资本,通过单一影子价格η进行定价,实现成本最优的存储层级分配,并基于真实机器人日志测量价值-写入关联χ的符号,发现其取决于部署场景。

详情
AI中文摘要

机器人的闪存耐久性是一种不可再生资源:每次持久化写入都会消耗数千次编程/擦除周期中的一次,且无法补充,然而目前没有实际部署的机器人内存系统对哪些记忆值得消耗一次擦除周期进行定价。我们将具身记忆视为折旧资本,并用单一耐久性影子价格η对该资源定价,这使得在RAM/板载NVM/云层级中进行成本最小化的放置成为一个在磨损增强的每字节索引中的阈值。无论价值-写入关联χ的符号如何,该索引都是成本最优的;只有当χ>0时,最优解才变为非单调,将机器人最有价值的记忆从闪存中移出。因此,关键点是经验性的,我们在预定义的关口上测量真实机器人日志中的χ:其符号是部署场景的一个属性——在重复的长时域操作中为正(χ̂≈+1.0×10^{-3},在全功率下可复现),在较短时域任务中为零,在非重复遥操作中为负。两个边界限制了该结果。在高端3,000 P/E TLC闪存按数据手册价格计算时,耐久性预算处于休眠状态;而在廉价边缘机器人使用的商用QLC/eMMC(约1,000 P/E)上则具有约束力。当约束生效时,学习到的磨损感知控制器仅在任务价值上与基于价格的路由持平,因为实现的价值在RAM、NVM和云层级之间是不变的:租金决定设备寿命和成本,而非任务性能。磨损感知放置是否能提高任务价值仍是一个开放问题——χ是针对价值代理测量的,而非单调最优解虽已被证明,但尚未在数据中观察到。

英文摘要

A robot's flash endurance is a non-renewable stock: every persisted write spends one of a few thousand program/erase cycles and never refills, yet no fielded robot memory system prices which memories are worth an erase cycle. We treat embodied memory as depreciating capital and price that stock with a single endurance shadow price $\eta$, which makes cost-minimizing placement across a RAM / on-board NVM / cloud hierarchy a threshold in a wear-augmented per-byte index. The index is cost-optimal whatever the sign of the value-write association $\chi$; only when $\chi > 0$ does the optimum turn non-monotone, sending a robot's most valuable memories off its flash. The pivot is thus empirical, and we measure $\chi$ on real robot logs at a pre-specified gate: its sign is a property of the deployment regime -- positive on recurrent long-horizon manipulation ($\hat{\chi} \approx +1.0 \times 10^{-3}$, replicated at full power), null on a shorter-horizon suite, and negative on non-recurrent teleoperation. Two boundaries scope the result. The endurance budget is dormant on premium 3,000-P/E TLC at datasheet prices and binding on the commodity QLC/eMMC ($\sim$1,000 P/E) that cheaper edge robots run. And where it binds, a learned wear-aware controller only ties price-based routing on task value, because realized value is tier-invariant across RAM, NVM, and cloud: the rent governs device lifetime and cost, not task performance. Whether wear-aware placement improves task value remains open -- $\chi$ is measured against a value proxy, and the non-monotone optimum, while proven, is not yet observed in data.

2606.18142 2026-06-17 cs.AI cs.CL cs.CY 新提交

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛:前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Oliver Tulio, Joel Christoph, Miles Tidmarsh, Carol Kline, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning(同情对齐机器学习) Sentient Futures(感知未来) Harvard Kennedy School(哈佛肯尼迪学院) Appalachian State University Department of Management(阿巴拉契亚州立大学管理系)

AI总结 提出首个代理基准TAC,测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型,所有模型得分低于随机水平64%,最佳模型仅53%。

详情
AI中文摘要

AI代理正从顾问转变为行动者,代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应,但未检验这些响应中的福利推理是否迁移到代理部署中(模型必须使用工具采取行动)。我们引入TAC(旅行代理同情心),这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景,涵盖六类动物剥削,并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%,最佳表现者(Claude Opus 4.7)为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升,在GPT-5.2中提升26个百分点,在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计(使用Gemini 2.5 Flash Lite作为评判者,对前两名模型的288个基础条件转录进行审计)未标记任何评估意识转录,表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

2606.18135 2026-06-17 cs.SD cs.AI 新提交

Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)

描述符:Certus 口径分类枪声数据集 (C3GD)

Sinclair Gurny, Ryan Quinn

发表机构 * Certus Innovations

AI总结 介绍一个公开的枪声数据集 C3GD,包含超过8000个来自28种枪支、16种口径的实地采集数据点,用于口径分类、枪声检测等任务,提供丰富的元数据以支持泛化与学术分析。

详情
AI中文摘要

在这项工作中,我们介绍了 Certus 口径分类枪声数据集 (C3GD),这是一个公开可访问的数据集,用于分析枪口爆炸声。该数据集旨在提供多种枪支、口径、弹药、麦克风和麦克风位置,其元数据详细程度超过当前已有的其他数据集。它包含来自28种枪支、16种口径的超过8000个实地采集数据点。由于实地数据采集成本高昂,现有研究多使用从互联网收集的枪声音频,这增加了低质量数据和标签噪声的风险。该数据集主要关注口径分类,但也可用于枪声检测、音频分离和音频信号处理,提供了多样化的真实世界参考。该数据集旨在提供足够的多样性,以便泛化到更多实际应用,同时提供足够的元数据以进行详细的学术分析。

英文摘要

In this work, we introduce the Certus Caliber Classification Gunshot Dataset (C3GD), a publicly accessible data set developed for the analysis of firearm muzzle blast sounds. The dataset aims to provide a wide variety of firearms, calibers, cartridges, microphones, and microphone locations with metadata detailed beyond what is currently otherwise available. It comprises more than 8000 field-collected data points from 28 firearms across 16 calibers. Because data collection in the field is costly, much of the existing research has been done using gunshot audio collected from the internet, which increases the risk of low-quality data and label noise. This dataset is primarily focused on caliber classification, but can also be used for gunshot detection, audio separation, and audio signal processing, providing a diversified and real-world reference. The dataset aims to provide enough diversity to be able to generalize to more real-world applications while also providing enough metadata for detailed academic analysis.

2606.18132 2026-06-17 cs.AI 新提交

Knowledge Reutilization in Meta-Reinforcement Learning

元强化学习中的知识复用

Yuan Meng, Bo Wang, Juan de los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun, Alois Knoll

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出一种元知识复用框架,通过动力学简化智能体学习任务知识并迁移至异构智能体,利用贝叶斯非参数先验和高层策略生成任务级指导,显著降低跟踪误差并提高样本效率。

Comments 18 pages initial submission

详情
AI中文摘要

元强化学习通过从相关任务中提取共享结构实现快速适应,但现有的端到端方法通常将任务推理与具身特定控制耦合。这种耦合可能模糊非参数任务语义,降低样本效率,并限制跨智能体复用。我们提出一个元知识复用框架,在动力学简化的智能体上学习任务级知识,并将其迁移至异构智能体。该框架使用贝叶斯非参数先验组织潜在任务模式,并使用高层策略生成任务级幅度指导。为了桥接可复用任务知识与不同具身,我们引入一个语义-幅度接口和一个轻量级时间适配器,将冻结的元知识转换为具身特定低层控制器的时间对齐子目标。在多个运动智能体上的实验表明,与最近的最先进基线相比,我们的框架将最终步跟踪误差降低了94.75%–99.79%,并且仅使用约23.8%的交互数据即可达到相当的部署性能。

英文摘要

Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, and limit cross-agent reuse. We propose a meta-knowledge reutilization framework that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents. The framework uses a Bayesian non-parametric prior to organize latent task modes and a high-level policy to generate task-level magnitude guidance. To bridge reusable task knowledge with different embodiments, we introduce a semantic-magnitude interface and a lightweight temporal adaptor, which convert frozen meta-knowledge into temporally aligned subgoals for embodiment-specific low-level controllers. Experiments on multiple locomotion agents show that our framework reduces final-step tracking error by 94.75% -- 99.79% compared with recent state-of-the-art baselines and achieves comparable deployment performance with about 23.8% of their interaction data.

2606.18124 2026-06-17 cs.CL 新提交

Unintended Effects of Geographic Conditioning in Large Language Models

大型语言模型中地理条件化的意外效应

Naz Col, David M. Chan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究评估了大型语言模型在接收地理中立提示时,因用户元数据中的位置信息导致的地理泄露现象,并发现位置占位符“Unknown”本身也会引发泄露,揭示了用户档案框架的生成条件化效应。

Comments To appear at the Second Workshop on Customizable NLP (CustomNLP4U) at ACL 2026

详情
AI中文摘要

现代对话式AI系统经常依赖用户元数据来本地化响应,但由这种隐藏上下文引入的意外区域偏见仍然知之甚少。在这项工作中,我们评估了位置泄露:即模型在接收地理中立用户提示时仍生成地理引用的现象。在创意写作和开放式问答提示中,即使是最先进的LLM,在暴露于位置元数据时也会系统性地偏向特定区域的输出,泄露比基线高出多达793倍(例如,Llama 3.1-8B从0.04%增加到31.7%,Qwen3-8B和Claude Sonnet 4.6分别为21.3%和8.8%)。我们的分析进一步揭示了一种新颖的结构性条件化效应:将注入的位置替换为占位符“Unknown”仍会使泄露比基线高出多达72倍,这表明用户档案框架本身,独立于任何地理内容,充当了生成条件化信号。

英文摘要

Modern conversational AI systems frequently rely on user metadata to localize responses, yet the unintended regional biases introduced by this hidden context remain poorly understood. In this work, we evaluate location leakage: the phenomenon where a model generates geographic references despite receiving a geographically neutral user prompt. Across both creative writing and open-ended Q&A prompts, even state-of-the-art LLMs systematically favor region-specific outputs when exposed to location metadata, with leakage spiking by up to 793 times above baseline (e.g., from 0.04% to 31.7% for Llama 3.1-8B, and 21.3% and 8.8% for Qwen3-8B and Claude Sonnet 4.6, respectively). Our analysis further shows a novel structural conditioning effect: replacing the injected location with the placeholder "Unknown" still elevates leakage by up to 72 times above baseline, demonstrating that the user profile frame itself, independent of any geographic content, acts as a generative conditioning signal.

2606.18123 2026-06-17 cs.CV 新提交

Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology

使用多模态混合专家病理基础模型预测免疫生物标志物,赋能精准肿瘤学

Tianyu Liu, Ziqing Wang, Zhaokang Liang, Tong Ding, Peter Humphrey, Lorraine Colón-Cartagena, Emily Ling-Lin Pai, Kenneth Tou En Chang, Mohamed Kahila, Jonathan Chong Kai Liew, Tinglin Huang, Rex Ying, Kaize Ding, Faisal Mahmood, Wengong Jin

发表机构 * Program of Computational Biology and Bioinforamtics, Yale University(耶鲁大学计算生物学与生物信息学项目) Broad Institute of MIT and Harvard(麻省理工学院与哈佛大学博德研究所) Department of Statistics and Data Science, Northwestern University(西北大学统计与数据科学系) Department of Computer Science, Northeastern University(东北大学计算机科学系) Department of Computer Science, Harvard University(哈佛大学计算机科学系) Department of Pathology, Yale University(耶鲁大学病理学系) Department of Anatomic Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania(宾夕法尼亚大学医院解剖病理学与检验医学系) Department of Pathology and Laboratory Medicine, University of California, San Francisco(加州大学旧金山分校病理学与检验医学系) Department of Pathology and Laboratory Medicine, KK Women’s and Children’s Hospital(竹脚妇幼医院病理学与检验医学系) Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania(宾夕法尼亚大学佩雷尔曼医学院生物统计学、流行病学与信息学系)

AI总结 提出MixTIME多模态基础模型,采用混合专家架构整合不同模态的病理基础模型,从HE全切片图像预测多重免疫荧光蛋白表达,在17个蛋白标记物上达到最优性能,并增强空间域识别、生存预测等下游任务。

Comments 5 figures

详情
AI中文摘要

预测与肿瘤免疫微环境(TIME)相关的免疫生物标志物对于推进精准肿瘤学至关重要,但现有方法主要局限于单一图像模态,且存在分辨率不足以及未能充分利用互补的临床和生物学信息的问题。本文介绍MixTIME,一种多模态基础模型,利用混合专家(MoE)架构整合在不同模态上训练的病理基础模型:纯图像(UNIv2)、图像文本(CONCHv1.5)和图像转录组(STPath)表示,用于从苏木精-伊红(HE)全切片图像进行像素级和切片级的多重免疫荧光(mIF)蛋白表达预测。MixTIME采用可学习路由器动态加权专家贡献,并使用分布和趋势感知的损失函数进行训练。在两个不同规模的数据集上进行基准测试,MixTIME在17个蛋白标记物上通过相关性指标衡量达到了最先进的性能。预测的mIF图谱显著增强了下游任务,包括空间域识别、生存预测以及由全球多个机构的病理专家验证的AI辅助病理报告生成。此外,MixTIME能够跨临床时间点纵向追踪蛋白表达动态,并揭示与肿瘤微环境中耐药性和免疫抑制相关的蛋白-基因相互作用模式。总之,MixTIME为计算病理学中的多模态生物标志物发现和临床转化提供了一个可扩展的框架。

英文摘要

Predicting immune biomarkers associated with the tumor immune microenvironment (TIME) is critical for advancing precision oncology, yet existing approaches are largely limited to single image modalities and suffer from insufficient resolution and incomplete utilization of complementary clinical and biological information. Here we introduce MixTIME, a multimodal foundation model that leverages a mixture-of-experts (MoE) architecture to integrate pathology foundation models trained across distinct modalities: image only (UNIv2), image text (CONCHv1.5), and image transcriptomic (STPath) representations for pixel-level and slide-level prediction of multiplex immunofluorescence (mIF) protein expression from hematoxylin and eosin (HE) whole-slide images. MixTIME employs a learnable router to dynamically weight expert contributions and is trained with a distribution- and tendency-aware loss function. Benchmarked on two datasets of different scales, MixTIME achieves state-of-the-art performance across 17 protein markers as measured by correlation metrics. The predicted mIF profiles substantially enhance downstream tasks, including spatial domain identification, survival prediction, and AI-assisted pathology report generation validated by expert pathologists from multiple institutes across the world. Furthermore, MixTIME enables longitudinal tracking of protein expression dynamics across clinical time points and reveals protein gene interaction patterns linked to drug resistance and immune suppression in tumor microenvironments. Collectively, MixTIME provides a scalable framework for multimodal biomarker discovery and clinical translation in computational pathology.