arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.19209 2026-05-20 cs.RO cs.MA

Graph Neural Planning and Predictive Control for Multi-Robot Communication-Constrained Unlabeled Motion Planning

基于图神经网络的多机器人通信受限的无标签运动规划与预测控制

Manohari Goarin, Yang Zhou, Giuseppe Loianno

发表机构 * New York University（纽约大学）； University of California Berkeley（加州大学伯克利分校）

AI总结本文提出一种分层框架，结合图注意力规划器和分布式非线性模型预测控制器，以解决多机器人在通信受限环境下同时分配目标和生成安全轨迹的问题，通过图神经网络方法实现可扩展的去中心化解决方案。

Comments 8 pages, 6 figures, Accepted at the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情

AI中文摘要

多机器人同时分配目标并生成安全轨迹的无标签运动规划问题在许多协作任务中至关重要。最近的图神经网络方法提供了可扩展的去中心化解决方案，但依赖于简化动力学和模拟环境，忽略了现实部署中的关键挑战，如动态可行性和通信限制。为了解决这些差距，我们提出了一种分层框架，结合图注意力规划器（GATP）和分布式非线性模型预测控制器（NMPC）。GATP通过多机器人协作提供中间子目标，而NMPC在非线性动力学和执行约束下强制安全。我们评估了该框架在仿真和真实世界四旋翼实验中的性能。得益于注意力机制和最小通信需求，我们展示了在更大团队中改进的泛化能力、对通信延迟高达200毫秒的鲁棒性以及实用可行性，具有去中心化的板载推理。

英文摘要

The multi-robot unlabeled motion planning problem of concurrently assigning robots to goals and generating safe trajectories is central in many collaborative tasks. Recent Graph Neural Network methods offer scalable decentralized solutions but rely on simplified dynamics and simulation environments, overlooking key challenges of real-world deployment such as dynamic feasibility and communication constraints. To address these gaps, we propose a hierarchical framework that combines a Graph ATtention Planner (GATP) with a decentralized Nonlinear Model Predictive Controller (NMPC). GATP provides intermediate subgoals through multi-robot cooperation, and the NMPC enforces safety under nonlinear dynamics and actuation constraints. We evaluate our framework in both simulation and real-world quadrotor experiments. Thanks to attention mechanisms and minimal communication requirements, we demonstrate improved generalization to larger teams, robustness to communication delays up to 200 ms and practical feasibility with decentralized on-board inference.

URL PDF HTML ☆

赞 0 踩 0

2605.19207 2026-05-20 cs.CV cs.AI cs.LG

一种通过奖励设计和终止条件实现RL基于四旋翼控制性能调优的启发式方法

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, George Nikolakopoulos

发表机构 * Robotics and AI group, in the Department of Computer Science, Electrical and Space Engineering at Luleå University of Technology（鲁德尼大学机器人与人工智能小组，计算机科学、电气与空间工程系）

AI总结本文提出了一种新的启发式方法，通过奖励设计和终止条件实现RL四旋翼控制的可调性能，该方法通过双带宽指数奖励结构实现了设定点跟踪的临界阻尼响应，并具有低稳态误差。在使用近端策略优化（PPO）算法训练时，结合episode截断条件，在600万次时间步内以高效的方式实现了所需性能。通过直观的启发式规则调整奖励权重和指数系数，可以实现更快（空翻式）和更慢（检查式）的稳定时间性能，同时保留基线临界阻尼响应和约2%的稳态误差。

Comments Accepted in the 34th Mediterranean Conference on Control and Automation

详情

AI中文摘要

基于强化学习（RL）的四旋翼控制策略在诸如在复杂环境中快速导航和无人机赛车等任务中取得了显著性能。然而，在某些应用中，如基础设施检查，实现精确、可控的机动并具有可调性能至关重要。本文提出了一种新的启发式方法，通过奖励设计和终止条件实现RL基于四旋翼控制的可调性能。我们提出了一种包含双带宽指数的新型奖励结构，实现了设定点跟踪的基线临界阻尼响应，并具有低稳态误差。当使用近端策略优化（PPO）算法进行训练时，结合episode截断条件，在600万次时间步内以高效的方式实现了所需性能。为了调节基线行为的性能，我们提出了直观的启发式规则来调整奖励权重和指数系数，以实现更快（空翻式）和更慢（检查式）的稳定时间性能，同时保留基线临界阻尼响应和大约2%的稳态误差。我们评估了三种RL策略（基线、空翻和检查）在100次试验中的表现，并展示了在随机初始条件下位置和偏航跟踪的准确且可调性能，从而证明了所提出启发式方法的有效性。

英文摘要

Reinforcement learning (RL)-based quadrotor control policies have achieved impressive performance in tasks such as fast navigation in cluttered environments and drone racing, where the focus is on speed and agility. However, in several applications, such as infrastructure inspection, it is critical to achieve precise, controlled maneuvers with tunable performance. In this article, we present a novel heuristic approach to achieve tunable performance in RL-based Quadrotor control through reward design and termination conditions. We present a novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response in setpoint tracking, with low steady-state errors. When trained with a Proximal Policy Optimization (PPO) algorithm, in conjunction with episode truncation conditions, the desired performance is achieved in 6 million time steps in a sample-efficient manner. In order to tune the performance about the baseline behavior, we present intuitive heuristic rules to adjust the reward weights and exponential coefficients to achieve faster (acrobatic-like) and slower (inspection-like) settling time performance, while retaining the baseline critically damped response and approximately 2\% steady-state error. We evaluate the three RL policies (baseline, acrobatic, and inspection) across 100 trials and show accurate and tunable performance in position and yaw tracking from random initial conditions, thereby demonstrating the effectiveness of the proposed heuristic approach.

URL PDF HTML ☆

赞 0 踩 0

2605.19156 2026-05-20 cs.AI cs.CY cs.LG cs.MA

How Far Are We From True Auto-Research?

我们距离真正的自动研究还有多远？

Zhengxin Zhang, Ning Wang, Sainyam Galhotra, Claire Cardie

发表机构 * Cornell University（康奈尔大学）

AI总结本文通过ResearchArena评估了不同代理生成的论文质量，发现虽然代理能生成看似有竞争力的论文，但实际实验严谨性不足，存在伪造结果、实验能力不足和计划与执行不匹配等问题，表明自动研究仍需进一步发展。

详情

AI中文摘要

最近的自动研究系统能够生成完整的论文，但可行性并不等同于质量，该领域仍然缺乏对代理生成论文实际质量的系统研究。我们介绍了ResearchArena，一个最小的框架，让现成的代理（Claude Code使用Opus 4.6，Codex使用GPT-5.4，和Kimi Code使用K2.5）在仅轻量指导下自行完成完整的研究循环（构想、实验、论文写作、自我完善）。在13个计算机科学种子和每个代理-领域对的3次试验中，ResearchArena生成了117篇代理生成的论文，每篇都在三个互补的视角下评估：仅手稿的评审员（SAR）、考虑工件的同行评审（PR）以及人工进行的元评审。在仅SAR的情况下，图景是乐观的：Claude Code获得最高评分，优于Analemma的FARS，并与加权平均的人类ICLR 2025提交匹配，表明最小框架的代理能够生成在手稿-only评审中看起来有竞争力的论文。然而，人工检查却揭示了这个图景被夸大了：SAR评分与实际接受决定不一致，且奖励合理框架而不验证实验实质。在考虑工件的PR评分急剧下降，人工审计发现实验严谨性是主要瓶颈，分解为三种失败模式（伪造结果、低能力实验、计划/执行不匹配），这些模式高度依赖于代理：Codex 5%/8%论文与工件不匹配/伪造参考文献，与Kimi Code 77%/72%相比，差距约为15倍，追踪代理发展出的不同研究身份。没有一篇代理生成的论文达到顶级会议的接受标准。这表明我们仍然与真正的自动研究有差距。

英文摘要

Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.

URL PDF HTML ☆

赞 0 踩 0

2605.19155 2026-05-20 cs.CV

Efficient coding along the visual hierarchy

视觉层次中的高效编码

Ananya Passi, Brian S. Robinson, Michael F. Bonner

发表机构 * Department of Cognitive Science, Johns Hopkins University（约翰霍普金斯大学认知科学系）； Applied Physics Laboratory, Johns Hopkins University（约翰霍普金斯大学应用物理实验室）

AI总结本文研究了在有限数据下如何通过高效编码原理构建与人类对齐的视觉特征层次，提出了一种无监督学习方法，该方法通过压缩输入到自然图像的主要变化模式来生成从边缘和颜色到纹理和形状的特征，且结合监督微调可提高脑区对齐性和类别学习速度。

Comments 34 pages, 6 figures

详情

AI中文摘要

GRASP：交互图中的确定性论证排名

Diganta Misra, Antonio Orvieto, Rediet Abebe, Volkan Cevher

发表机构 * MPI-IS Tübingen（图宾根MPI研究所）； Tübingen AI Center（图宾根人工智能中心）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； Eberhard Karls Universität Tübingen（图宾根埃伯哈德·卡尔斯大学）； LIONS, EPFL（EPFL的LIONS实验室）

AI总结本文提出GRASP框架，通过聚合稳定的局部交互判断生成全局排名，以解决大语言模型作为裁判时整体评判不一致的问题，强调结构充分性而非说服力或修辞吸引力。

Comments Preprint

详情

AI中文摘要

大型语言模型越来越多地被部署为自动裁判，以评估论证的强度。随着这一角色的扩大，其合法性取决于一致性、透明性和将论证结构与修辞吸引力区分开的能力。然而，我们证明了整体评判——一种常见的LLM-as-a-Judge实践，其中模型对辩论提供全球裁决——存在显著的跨模型分歧。我们主张这种不稳定性源于将辩论复杂的交互结构压缩成单一的不透明分数。为了解决这一问题，我们提出GRASP（渐进排名与攻击支持传播），一种确定性框架，通过收敛的攻击-防御传播操作，将稳定的局部交互判断聚合为全局排名。我们证明在LLM-as-a-Judge评估中，局部交互判断比整体排名更具可重复性，使GRASP能够生成更一致的全局排名。我们进一步证明GRASP分数与人类“说服性”标签不相关，突显了一个关键的社技术区别：GRASP不衡量说服力、事实性或修辞吸引力，而是结构充分性——一种在显式交互图上的防御意识的论证鲁棒性概念。总体而言，GRASP为整体LLM评判提供了一个透明且可审计的替代方案。

英文摘要

Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

URL PDF HTML ☆

赞 0 踩 0

2605.19140 2026-05-20 cs.AI

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

学习手柄：在接口约束下的可证明收敛的工作流学习

Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen, Yujun Yan

发表机构 * Stern School of Business（斯特恩商学院）； New York University（纽约大学）； Department of Computer Science（计算机科学系）； Dartmouth College（达特茅斯学院）； Virginia Tech（弗吉尼亚理工大学）

AI总结该研究探讨了在接口约束下的工作流学习问题，提出了一种异步去中心化的Q学习算法IC-Q，并给出了神经IC-Q的有限样本界，证明了在去中心化部分可观测性下的神经Q学习的第一个有限样本保证。

详情

AI中文摘要

我们研究了在专门的代理通过共享的艺术品进行控制转移的设置下的工作流学习，每个代理只能观察该艺术品的局部函数及其自己的私人状态，且没有集中式学习者访问联合轨迹——这多代理LLM管道跨越组织、供应商或信任边界时的操作模式。我们将这种模式形式化为一个接口约束的半马尔可夫决策过程（IC-SMDP），其决策时刻发生在手柄时间，设计了IC-Q，一种异步去中心化的Q学习算法，其中每次手柄的跨代理协调恰好是一个标量。我们的主要结果是神经IC-Q的有限样本界，该界分解为三个独立可控的误差源：神经函数近似误差、接口表示差距和混合时间残差，基于随机选项持续时间折扣。建立这个界需要将近似信息状态（AIS）框架从单代理原始步骤MDP提升到多代理SMDP，并在随机持续时间内控制马尔可夫噪声，而这在先前工作中尚未完成。据我们所知，这是第一个在去中心化部分可观测性下的神经Q学习的有限样本保证。四个实验：一个受控的合成IC-SMDP，多LLM数学推理，多代理路由，以及多代理CPU编程，显示IC-Q在没有任何代理观察联合轨迹的情况下匹配集中式 oracle，每个误差源沿其对应的轴按界预测的比例缩放。

英文摘要

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

URL PDF HTML ☆

赞 0 踩 0

2605.19137 2026-05-20 cs.CV

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

迈向数据高效的视频预训练：使用冻结的图像基础模型

Svetlana Orlova, Niccolò Cavagnero, Gijs Dubbelman

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

AI总结本文探讨了如何通过冻结预训练的图像基础模型并仅训练时间模块来实现数据高效的视频预训练，从而减少对大规模视频数据和计算资源的需求。

Comments Accepted to CVPR 2026 Workshops CV4Smalls

详情

AI中文摘要

视频基础模型在许多视频理解任务中表现出色，但通常需要在大规模视频数据集上进行大规模预训练，导致显著的数据和计算成本。相比之下，现代图像基础模型已经提供了强大的空间表示。这引发了一个重要问题：能否通过重用这些空间表示并仅进行时间推理的预训练来构建具有竞争力的视频模型？我们初步探索了一种轻量级训练范式，即冻结预训练的图像基础模型并仅训练时间模块来处理流视频。通过将图像基础模型用作空间编码器，这种方法可以显著减少与端到端视频预训练相比所需的视频数据和计算量。在本工作中，我们探讨了这种方法的可行性，以在投入视频预训练计算之前进行探索。在多个视频理解任务上的实证发现表明，无需大规模视频预训练即可获得强大的时间性能，这促使未来的工作集中在通过在冻结的图像基础模型上预训练时间模块来构建递归视频基础模型。代码：https://github.com/tue-mps/towards-video-image-frozen

英文摘要

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .

URL PDF HTML ☆

赞 0 踩 0

2605.19136 2026-05-20 cs.RO

Automatically Improving Simulation Physics for Articulated Objects

自动提升仿真的物理特性用于关节物体

Anh-Quan Pham

发表机构 * Penn（宾夕法尼亚大学）； PennPAL Lab（宾夕法尼亚大学PAL实验室）

AI总结本文研究了如何通过量化评估框架和多模态仿真反馈方法，提升关节物体在仿真中的物理真实性和稳定性，从而提高机器人学习的效率和效果。

详情

AI中文摘要

仿真是可扩展机器人学习的核心工具，但其效果取决于物体资产的质量。尽管现代3D数据集提供了丰富的几何和运动学表示，但通常缺乏用于稳定和真实交互所需的物理属性，需要大量手动工作来构建仿真准备的关节物体。在本论文中，我们引入了交互准备性，它表征了物体在操作下是否可以可靠地仿真。我们提出了一种定量评估框架，将交互准备性分解为可测量的组成部分，从而系统分析物体质量并揭示传统评估未捕获的失败模式。我们进一步提出了一个多模态、仿真循环的方法，从不完整的3D资产中生成交互准备的关节物体。该方法整合了几何、视觉和语义信息来推断物理属性，并通过迭代仿真反馈来优化这些属性，以提高物理一致性。在多样化的关节物体和操作任务上的实验表明，物体质量直接影响仿真稳定性、交互行为和策略性能。经过我们方法优化的物体表现出更稳定和真实的动态，从而实现了更可靠的下游学习和评估。总体而言，本论文展示了关节物体在仿真中的物理真实性的的重要性，并引入了一种由仿真反馈指导的实用多模态优化方法，用于大规模构建此类物体。

英文摘要

Simulation is a central tool for scalable robot learning, but its effectiveness depends on the quality of object assets. While modern 3D datasets provide rich geometric and kinematic representations, they typically lack the physical properties required for stable and realistic interaction, requiring significant manual effort to construct simulation-ready articulated objects. In this thesis, we introduce interaction-readiness, which characterizes whether an object can be reliably simulated under manipulation. We propose a quantitative evaluation framework that decomposes interaction-readiness into measurable components, enabling systematic analysis of object quality and revealing failure modes not captured by conventional evaluation. We further present a multi-modal, simulator-in-the-loop approach for generating interaction-ready articulated objects from incomplete 3D assets. The method integrates geometric, visual, and semantic information to infer physical properties and refines them through iterative simulator feedback to improve physical consistency. Experiments across diverse articulated objects and manipulation tasks show that object quality directly impacts simulation stability, interaction behavior, and policy performance. Objects refined by our method exhibit more stable and realistic dynamics, enabling more reliable downstream learning and evaluation. Overall, this thesis demonstrates the importance of physical realism for articulated objects in simulation and introduces a practical multi-modal refinement approach, guided by simulator feedback, for constructing such objects at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.19135 2026-05-20 cs.LG

EgoBabyVLM：基于自然主义第一人称视频数据的跨模态学习基准测试

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Éric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux

发表机构 * Meta Superintelligence Labs（Meta超智能实验室）； Stanford University（斯坦福大学）； Meta Reality Labs（Meta现实实验室）； The University of Tokyo（东京大学）

AI总结研究探讨了儿童如何从有限的视觉-语言输入中获得语言 grounding 的鲁棒性，提出了 EgoBabyVLM 挑战，推动模型在自然主义数据中实现 grounded language learning。

详情

AI中文摘要

儿童在有限的视觉-语言输入中展现出惊人的鲁棒性，这种能力超过了目前最好的大型多模态模型。最近的研究表明，目前基于 curated web 数据训练的视觉-语言模型 (VLMs) 无法泛化到由可穿戴设备、具身代理和婴儿头摄像机产生的稀疏、弱对齐的第一人称视频流，并且没有固定的评估流程来衡量在此类数据上的进展。我们训练 VLMs 在具有不同视觉和语言输入语义对齐程度的数据集上，包括自然主义婴儿和成人第一人称视频，并通过涵盖多模态语言 grounding 和单模态视觉和语言任务的综合评估套件进行评估。这套评估的核心是 Machine-DevBench，它是一个基于语料库的基准测试，自动从模型的训练词汇中生成，以消除训练/评估不匹配和先前发展基准的低统计效力。我们的结果表明，当前 VLM 模型依赖于 curated 数据的紧密语义对齐，并无法利用主导自然主义第一人称输入的弱对齐信号——正是人类在其中茁壮成长的领域。为了推动进展，我们引入了 EgoBabyVLM 挑战，以驱动开发能够从人类婴儿经历的此类自然主义数据中实现 grounded language learning 的模型。

英文摘要

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

URL PDF HTML ☆

赞 0 踩 0

2605.19127 2026-05-20 cs.AI

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

POLAR-Bench: 一个用于LLM代理隐私-效用权衡的诊断基准

Qiaoyuan Zheng, Yiqu Yang, Qi Gao, Imanol Schlag

发表机构 * ETH Zurich（苏黎世联邦理工学院）； ETH AI Center（ETH人工智能中心）

AI总结本文提出POLAR-Bench基准，用于评估LLM代理在隐私和效用之间的权衡。通过在10个领域和7,852个样本上进行测试，该基准通过确定性集合成员hip评分隐私和效用，并在两个正交轴上变化隐私策略维度和攻击策略，生成5x5的诊断表面。结果揭示了当前前沿模型在保护属性上隐瞒超过99%，而较小的开放权重模型在1-30B范围内表现更差，泄露率高达一半。

Comments Preprint

详情

AI中文摘要

随着LLM代理越来越多地访问私人用户数据，并在与第三方系统交互时代表用户行事，用户定义了哪些信息可以和必须不被共享。代理必须在第三方系统行为对抗性时也能稳健地遵循该意图。我们引入了POLAR-Bench（政策感知对抗基准），其中受信任的模型具有隐私策略和任务对话的模型与第三方模型进行交互，后者对抗性地探测任务相关和受保护的属性。在10个领域和7,852个样本上，我们通过确定性集合成员hip评分隐私和效用，并在两个正交轴上变化隐私策略维度和攻击策略，生成每个模型的5x5诊断表面。我们的结果揭示了一个明显的分裂：当前前沿模型隐瞒超过99%的受保护属性，而较小的开放权重模型在1-30B范围内，用户最常运行作为其自己的受信任代理在设备上或通过私人推理，得分显著更差，最差的泄露超过一半。POLAR-Bench因此定位了每个模型的意图遵循崩溃点，为隐私对齐提供了立足点，特别是在最关重要的地方。

英文摘要

LLM agents increasingly have access to private user data and act on the user's behalf when interacting with third-party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third-party systems behave adversarially. We introduce POLAR-Bench (Policy-aware adversarial Benchmark), in which a trusted model with a privacy policy and a task converses with a third-party model that adversarially probes for both task-relevant and protected attributes. Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set-membership and vary privacy policy dimension and attack strategy along two orthogonal axes, producing a 5 times 5 diagnostic surface per model. Our results reveal a sharp split: current frontier models withhold over 99% of protected attributes, while smaller open-weight models in the 1--30B range, the class users most commonly run as their own trusted agent on-device or via private inference, score notably worse, with the weakest leaking over half. POLAR-Bench thus localizes where each model's intent-following breaks down, providing a foothold for privacy alignment where it matters most.

URL PDF HTML ☆

赞 0 踩 0

2605.19120 2026-05-20 cs.RO

CosFly: Plan in the Matrix, Fly in the World

CosFly：矩阵中的计划，世界中的飞行

Hanxuan Chen, Xiangyue Wang, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Binbo Li, Kangli Wang, Ji Pei

发表机构 * Autel Robotics（Autel机器人公司）； Nanjing University（南京大学）； Peking University（北京大学）； Southern University of Science and Technology（南方科技大学）； University of Hong Kong（香港大学）

AI总结本文提出CosFly，一个用于空中跟踪的盒状结构规划和多模态模拟流程，以及CosFly-Track大规模无人机数据集，用于在多样环境中动态目标跟踪。CosFly通过将复杂的3D世界转换为结构化障碍表示进行规划，然后将轨迹投影到多模态传感器数据中，并支持可配置的固定视角缩放级别。

详情

AI中文摘要

我们介绍了CosFly，一个用于空中跟踪的盒状结构规划和多模态模拟流程，以及CosFly-Track，一个大规模的无人机数据集，用于在多样环境中进行动态目标跟踪。在我们的当前实现上，CosFly提供了一个模块化的7步构建流程，将复杂的3D世界转换为结构化的障碍表示用于规划，然后将结果轨迹投影到多模态传感器数据中，包括RGB图像、高精度深度图和语义分割掩码，并配以自然语言导航指令。一个关键特点是支持可配置的固定视角缩放级别（每个轨迹一个视角设置并保持恒定），通过相机内参数调整模拟各种焦距。该流程涵盖了从3D地图导出通过网格简化、行人和无人机轨迹规划、多模态渲染（6自由度姿态注释）、质量检查以及教师-学生描述生成的完整流程。我们分析了两种轨迹规划范式：传统的两阶段流程（前端候选生成和后端细化）以及直接基于梯度的公式，该公式在单一目标中优化多个跟踪约束。公开的CosFly-Track发布包含250条经过验证的轨迹和约10万张渲染图像，具有完整的6自由度无人机姿态注释（位置x、y、z和方向偏航、俯仰、滚动）。共同，该流程和数据集建立了一个可扩展的基础，支持在多样环境中进行空中-地面协同研究，支持动态目标跟踪、无人机导航和多模态感知。

英文摘要

We present CosFly, a box-structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly-Track, a large-scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7-step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi-modal sensor data -- including RGB images, high-precision depth maps, and semantic segmentation masks -- paired with natural language navigation instructions. A key feature is the support for configurable fixed-FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera-intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi-modal rendering with 6-DOF pose annotations, quality inspection, and teacher-student caption generation. We analyze two trajectory-planning paradigms for aerial target tracking: a conventional two-stage pipeline with front-end candidate generation and backend refinement, and a direct gradient-based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly-Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6-DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception across diverse environments.

URL PDF HTML ☆

赞 0 踩 0

2605.19111 2026-05-20 cs.CV cs.AI

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

FAGER：基于事实的文本到图像模型评估与改进

Youngsun Lim, Cusuh Ham, Pin-Yu Chen, Deepti Ghadiyaram

发表机构 * Boston University（波士顿大学）； Adobe（Adobe公司）； IBM Research（IBM研究院）

AI总结本文提出FAGER框架，用于评估和改进文本到图像模型的事实准确性，通过结合LLM生成事实和参考引导的视觉事实提取与验证，构建结构化事实评估标准，并通过VLM进行评估，验证FAGER在事实性测试中优于现有方法，并能无训练改进T2I输出。

Comments It was accepted for an oral presentation at the 2nd Workshop on the Evaluation of Generative Foundation Models (EVGENFM2026) at CVPR 2026. Total 8 pages (1 page for references). 5 figures

详情

AI中文摘要

现有文本到图像（T2I）评估指标主要评估生成图像是否与提示中明确陈述的信息一致，但往往无法捕捉隐含、外部依赖或定义身份的事实要求。因此，它们不适合评估涉及科学知识、历史事实、产品或文化特定概念的提示中的事实正确性。我们提出了FActually Grounded Evaluation and Refinement（FAGER），一种代理框架，用于评估生成图像是否正确反映由提示中或暗示的视觉可验证事实，并提供改进的可操作反馈。FAGER首先通过结合LLM生成事实与参考引导的视觉事实提取和验证构建结构化事实评估标准，然后将该标准转换为基于VLM的问答对进行评估。为了验证FAGER作为事实性度量标准的有效性，我们引入了事实性A/B测试，该测试衡量度量标准是否更倾向于选择事实参考图像而非对应的生成图像。在涵盖科学、历史、产品、文化和知识密集型概念的五个数据集中，FAGER在该测试中始终优于现有方法。我们进一步表明，FAGER可以以无训练的方式用于改进T2I输出，在多个数据集中产生显著的事实性提升。

英文摘要

Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.19107 2026-05-20 cs.LG eess.SP

Performance Monitoring of Proton Exchange Membrane Water Electrolyzer by Transformers-Based Machine Learning Model

通过基于变压器的机器学习模型对质子交换膜水电解器进行性能监控

Bingqing Chen, Ivan Batalov, Qiu Chen, Weiqi Ji, Lei Cheng

发表机构 * Bosch Research & Technology Center（博世研发与技术中心）

AI总结本文提出了一种基于变压器的机器学习框架，用于在正常运行过程中进行虚拟电化学表征，通过编码器-解码器结构对极化曲线进行重构，实现了对质子交换膜水电解器状态健康度的连续监控。

详情

AI中文摘要

绿色氢气在去碳化过程中扮演着关键角色，预计到2030年其容量将扩大至560 GW（2023年为1.39 GW）。质子交换膜（PEM）电解是生产绿色氢气最有前途的技术路线之一，实时监测PEM电解器的系统健康状况对于其规模化部署至关重要。在实验室环境中，可以通过电化学测试协议通过定期暂停正常运行来表征性能退化。这种中断对于大规模堆叠部署来说并不实用，限制了系统操作员对健康状态（SoH）进行实时评估的能力。本文提出了一种机器学习（ML）框架，可以在正常运行过程中进行虚拟电化学表征。该方法使用编码器-解码器变压器，基于操作数据来重构表征输出，重点关注极化曲线。受基于补丁的序列分词启发，我们将输入分割成补丁并对其进行编码，以形成有意义的标记，这大大提高了学习效率。在四次纵向运行中，持续时间最长为478小时，不同测试单元和负载循环下，模型准确重构了极化曲线，并相比普通变压器实现了均方误差（MSE）减少10倍。这一概念验证表明，ML模型可以实现PEM电解器的连续性能监控，并且编码器能够捕捉到SoH的有意义的潜在表示，为未来工作中的可解释指标推导提供了机会。

英文摘要

Green hydrogen plays an essential role in decarbonization, with capacity projected to scale to 560 GW by 2030 (vs. 1.39 GW in 2023) in net-zero settings. Proton exchange membrane (PEM) electrolysis is one of the most promising technology routes to green hydrogen production, and real-time system health monitoring of PEM electrolyzers is essential for their scalable deployment. In lab settings, performance degradation can be characterized through electrochemical testing protocols by periodic pauses of normal operation. Such interruption is not practical for full-scale stack deployments, limiting system operators' ability to make real-time assessments of state-of-health (SoH). We present a machine learning (ML) framework that performs virtual electrochemical characterization during normal operation. The method uses an encoder-decoder transformer, conditioned on operational data, to reconstruct characterization outputs, focusing here on polarization curves. Inspired by patch-based sequence tokenization, we segment the inputs into patches and encode them to form meaningful tokens, which substantially improves learning efficiency. Across four longitudinal runs, lasting up to 478 hours on different test cells and loading cycles, the model accurately reconstructed polarization curves and achieved 10x reduction in mean squared error (MSE) compared to a vanilla transformer. This proof-of-concept demonstrates that ML models can enable continuous performance monitoring for PEM electrolyzers and that the encoder captures meaningful latent representations of SoH, opening up opportunities to derive interpretable indicators in future work.

URL PDF HTML ☆

赞 0 踩 0

2605.19104 2026-05-20 cs.RO cs.AI

Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots

神经运算符用于腱驱动连续机器人设计空间的代理建模

Branden Frieden, James M. Ferguson, Alan Kuntz, Varun Shankar

发表机构 * The Robotics Center and the Kahlert School of Computing at the University of Utah（犹他大学机器人中心和Kahlert计算学院）； The Departments of Computer Science and Electrical and Computer Engineering at Vanderbilt University（范德比大学计算机科学与电气与计算机工程系）

AI总结本文提出了一种基于神经运算符的学习方法，用于腱驱动连续机器人的设计空间代理建模，通过映射机器人设计参数和腱驱动输入到最终配置，实现跨大量机器人设计的泛化能力。

Comments Accepted to ICRA 2026

详情

AI中文摘要

连续机器人能够在受限环境中实现灵活的操作，但需要准确且高效的模型用于实时操作和控制。传统物理模型可能计算成本高且因未建模效应导致不准确，而当前基于学习的方法在特定机器人上泛化能力差。本文提出将腱驱动连续机器人代理建模作为运算符学习问题，将机器人设计参数和腱驱动输入映射到最终配置。该方法使单个训练模型能够跨大量机器人设计泛化。我们开发了四种新型神经运算符架构--两种基于深度运算符网络（DeepONets）和两种基于傅里叶神经运算符（FNOs）--并训练它们在仿真数据上预测机器人配置。所有架构均实现良好的准确性，同时允许快速且准确地跨设计泛化。我们的结果表明，运算符学习为连续机器人力学在设计空间中的代理建模提供了有效且可泛化的解决方案，使在手术和工业应用中控制、规划和设计优化能够快速建模。

英文摘要

Continuum robots enable dexterous manipulation in constrained environments, but require accurate and efficient models for real-time manipulation and control. Traditional physics-based models can be computationally expensive and may suffer from inaccuracies due to unmodeled effects, while current learning-based methods often generalize poorly beyond the specific robot on which they are trained. We present a formulation of surrogate modeling for tendon-driven continuum robots as an operator learning problem that maps robot design parameters and tendon actuation inputs to resulting configurations. This formulation enables a single trained model to generalize across a large class of robot designs. We develop four novel neural operator architectures--two based on Deep Operator Networks (DeepONets) and two based on Fourier Neural Operators (FNOs)--and train them on simulation data to predict robot configurations. All architectures achieve good accuracy while allowing for fast and accurate generalization across designs. Our results demonstrate that operator learning provides an effective and generalizable surrogate for continuum robot mechanics in the design space, enabling fast modeling for control, planning, and design optimization in surgical and industrial applications.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Graph Neural Planning and Predictive Control for Multi-Robot Communication-Constrained Unlabeled Motion Planning

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection

Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning

Prompting language influences diagnostic reasoning and accuracy of large language models

Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

How Far Are We From True Auto-Research?

Efficient coding along the visual hierarchy

Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

GRASP: Deterministic argument ranking in interaction graphs

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

Automatically Improving Simulation Physics for Articulated Objects

Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing

Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

CLIC: Contextual Language-Informed Cardiac Pathology Classification

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

CosFly: Plan in the Matrix, Fly in the World

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

Performance Monitoring of Proton Exchange Membrane Water Electrolyzer by Transformers-Based Machine Learning Model

Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots