arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪 全部专题
2605.19209 2026-05-20 cs.RO cs.MA

Graph Neural Planning and Predictive Control for Multi-Robot Communication-Constrained Unlabeled Motion Planning

基于图神经网络的多机器人通信受限的无标签运动规划与预测控制

Manohari Goarin, Yang Zhou, Giuseppe Loianno

发表机构 * New York University(纽约大学) University of California Berkeley(加州大学伯克利分校)

AI总结 本文提出一种分层框架,结合图注意力规划器和分布式非线性模型预测控制器,以解决多机器人在通信受限环境下同时分配目标和生成安全轨迹的问题,通过图神经网络方法实现可扩展的去中心化解决方案。

Comments 8 pages, 6 figures, Accepted at the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

多机器人同时分配目标并生成安全轨迹的无标签运动规划问题在许多协作任务中至关重要。最近的图神经网络方法提供了可扩展的去中心化解决方案,但依赖于简化动力学和模拟环境,忽略了现实部署中的关键挑战,如动态可行性和通信限制。为了解决这些差距,我们提出了一种分层框架,结合图注意力规划器(GATP)和分布式非线性模型预测控制器(NMPC)。GATP通过多机器人协作提供中间子目标,而NMPC在非线性动力学和执行约束下强制安全。我们评估了该框架在仿真和真实世界四旋翼实验中的性能。得益于注意力机制和最小通信需求,我们展示了在更大团队中改进的泛化能力、对通信延迟高达200毫秒的鲁棒性以及实用可行性,具有去中心化的板载推理。

英文摘要

The multi-robot unlabeled motion planning problem of concurrently assigning robots to goals and generating safe trajectories is central in many collaborative tasks. Recent Graph Neural Network methods offer scalable decentralized solutions but rely on simplified dynamics and simulation environments, overlooking key challenges of real-world deployment such as dynamic feasibility and communication constraints. To address these gaps, we propose a hierarchical framework that combines a Graph ATtention Planner (GATP) with a decentralized Nonlinear Model Predictive Controller (NMPC). GATP provides intermediate subgoals through multi-robot cooperation, and the NMPC enforces safety under nonlinear dynamics and actuation constraints. We evaluate our framework in both simulation and real-world quadrotor experiments. Thanks to attention mechanisms and minimal communication requirements, we demonstrate improved generalization to larger teams, robustness to communication delays up to 200 ms and practical feasibility with decentralized on-board inference.

2605.19207 2026-05-20 cs.CV cs.AI cs.LG

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

用于低资源医疗环境的量化机器学习模型:医学影像

Sumanth Meenan Kanneti, Aryan Shah

发表机构 * Georgia State University(佐治亚州立大学)

AI总结 本文提出了一种多策略压缩框架,用于MRI图像中的脑肿瘤分类,通过量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏以及轻量MobileNetV2骨干网络上的Float16后训练量化,实现了在低资源医疗环境中高效且准确的脑肿瘤筛查。

详情
AI中文摘要

深度学习模型在医学影像分析中表现出强大的性能,但在低资源临床环境中部署仍然困难,由于计算、内存和电力限制。本文提出了一种多策略压缩框架,用于从MRI中进行脑肿瘤分类,包括量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏,以及在轻量MobileNetV2骨干网络上的Float16后训练量化。使用包含胶质瘤、脑膜瘤、垂体瘤和健康对照的多类脑肿瘤MRI数据集,我们提供了基于MobileNetV2的完整实验验证,通过三阶段迁移学习训练分类器,并通过TensorFlow Lite应用Float16量化。DenseNet基于的知识蒸馏和量化感知训练策略被描述为框架内的互补压缩方法,其完整的经验评估留待未来工作。在MobileNetV2管道上的实验结果表明,量化模型在验证准确率为82.37%的情况下,与全精度基线82.20%相比,模型大小从35.34 MB减少到5.76 MB,压缩比为6.14倍,无显著精度损失。各分类评估证实,量化在所有四个肿瘤类别中均匀保持诊断性能。这些发现表明,轻量化的量化模型可以在资源受限的医疗环境中提供临床可行的脑肿瘤筛查。

英文摘要

Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings.

2605.19206 2026-05-20 cs.RO

CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

CLUE: 通过利用统一语义地图实现适应性优先级上下文线索

Taeyun Kim, Alvin Jinsung Choi, Dasol Hong, Hyun Myung

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 CLUE通过利用统一语义地图,采用适应性优先级上下文线索的方法,有效解决零样本物体-目标导航问题,提高了导航的鲁棒性和效率。

Comments 8 pages, 5 figures

详情
AI中文摘要

零样本物体-目标导航(ZSON)是机器人领域具有挑战性的问题,需要对语言和视觉观察有全面的理解。房间和物体的上下文线索至关重要,但它们的相对重要性取决于目标:一些物体与特定房间类型紧密相关,而另一些物体则更可能由附近共存的物体预测。现有方法忽略了这一区别,导致探索效率低下且不准确。我们提出了CLUE,一种新的导航框架,通过利用从离线大型语言模型(LLM)提取的常识知识,适应性地平衡使用上下文房间和物体。通过使用LLM估计目标与房间类型的关联性,代理优先使用房间线索预测强关联的目标,使用物体线索预测弱关联的目标。我们的框架构建了一个统一的语义价值地图,整合了两种类型的上下文信息,并根据目标的模糊性进行自适应加权,以指导探索。结合多视角验证和由上下文线索指导的探索策略,CLUE实现了稳健且高效的导航。在模拟和真实世界部署中的大量实验表明,我们的方法在成功率(SR)和按路径长度加权的成功率(SPL)上均优于最先进的基线方法,证明了其在实际导航任务中的有效性和实用性。

英文摘要

Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehensive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target's association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target's ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both success rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.

2605.19202 2026-05-20 cs.RO cs.AI math.OC

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

通过基于强化学习的四旋翼控制实现空中巡检行为:在树冠下森林环境中的应用

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Viswa Narayanan Sankaranarayanan, George Nikolakopoulos

发表机构 * Robotics and AI group, in the Department of Computer Science, Electrical and Space Engineering at Luleå University of Technology, Sweden(鲁尔坎大学技术学院机器人与人工智能小组,计算机科学、电气与空间工程系,瑞典)

AI总结 本文提出了一种基于深度强化学习的四旋翼控制器,用于在树冠下森林环境中进行自主巡检任务,通过端到端控制策略实现巡检视角姿态跟踪,并结合旅行商问题规划器和快速随机树星规划器确保长距离任务的安全可靠部署。

Comments Submitted to 2026 IEEE 22nd International Conference on Automation Science and Engineering

详情
AI中文摘要

本文针对在树冠下森林环境中使用基于深度强化学习(RL)的低级四旋翼控制器进行空中巡检任务的问题进行了研究。具体而言,本文提出了一种端到端(将状态映射到RPMs)的四旋翼控制策略,实现了巡检视角姿态跟踪(同时位置和偏航参考跟踪),这对于各种目标巡检行为和森林中的点对点导航至关重要。为确保在长距离任务中端到端RL控制器的安全可靠部署,本文利用了一个包含旅行商问题规划器(TSP)和快速随机树星规划器(RRT*)的更高导航指导层。在已知的森林地图和一组用户指定的巡检区域上,TSP规划器找到最优访问序列。在两个目标区域之间,RRT*规划器生成符合下层端到端RL策略跟踪限制的碰撞自由路径。通过五个目标巡检场景,本文证明了基于强化学习的电机级稳定控制器,结合导航指导层,可以有效用作树冠下森林巡检任务的低级巡检执行模块。

英文摘要

This paper addresses the problem of using a deep Reinforcement Learning (RL)-based low-level Quadrotor controller within an autonomous Quadrotor navigation stack for aerial inspection missions in under-canopy forest environments. Specifically, the article presents an end-to-end (mapping states to RPMs) Quadrotor control policy that achieves inspection view-pose tracking (simultaneous position and yaw reference tracking), which is crucial for various target inspection behaviors and point-to-point navigation in forests. To ensure safe and reliable deployment of the end-to-end RL controller in long-range missions, this article utilizes a higher navigation guidance layer comprising of a Traveling Salesman Problem planner (TSP) and a Rapidly-exploring Random Tree Star (RRT*) planner. Over a known map of a forest and a set of user-specified inspection regions, the TSP planner finds the optimal visitation sequence. Between two target regions, collision-free paths that respect the tracking limitations of the lower end-to-end RL policy are generated by an RRT* planner. Through five target inspection scenarios, this article demonstrates that an RL-based motor-level stabilizing controller, supported by a navigation guidance layer, can be used effectively as the low-level inspection execution module for under-canopy forest inspection missions.

2605.19201 2026-05-20 cs.LG cs.AI

On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis

设备端持续学习与双阶段缓冲器和动态损失用于现场肺炎诊断

Danu Kim

发表机构 * Korea International School, Jeju Campus(韩国国际学校,济州校区)

AI总结 本文提出PneumoNet,一种适用于资源受限环境的领域增量学习方法,结合轻量级CNN进行设备端预测,双阶段平衡缓冲器实现类别平衡回放,以及动态类别加权损失以纠正训练批次不平衡,实验表明其在模拟五个真实域变化场景的PneumoniaMNIST数据集上达到86.6%的准确率,同时更小更高效。

Comments Presented at 32nd Samsung Humantech Paper Awards

详情
AI中文摘要

深度学习模型在胸部X光片上检测肺炎具有高准确性,但在设备、患者或机构差异导致的域偏移下性能会下降。我们提出了PneumoNet,一种用于资源受限环境的点-of-care肺炎诊断的领域增量学习方法。PneumoNet结合了轻量级CNN进行设备端预测,双阶段平衡缓冲器实现类别平衡回放,以及动态类别加权损失以纠正训练批次不平衡。在模拟五个真实域变化场景的域偏移PneumoniaMNIST数据集上评估,PneumoNet在86.6%的准确率和1.4%的遗忘率下,比现有基线更小且更快。这些结果突显了PneumoNet在真实世界和疫情准备医疗环境中实现适应性、隐私保护诊断AI的潜力。

英文摘要

Deep learning models detect pneumonia from chest X-rays with high accuracy, but the performance declines under domain shifts caused by differences in devices, patients, or institutions. We present PneumoNet, a domain-incremental learning method for point-of-care pneumonia diagnosis in resource-limited settings. PneumoNet combines a lightweight CNN for on-device prediction, a dual-stage balanced buffer for class-balanced replay, and a dynamic class-weighted loss to correct training-batch imbalances. Evaluated on a domain-shifted PneumoniaMNIST dataset simulating five realistic domain change scenarios, PneumoNet achieves 86.6% accuracy with 1.4% forgetting while being smaller and faster than existing baselines. These results highlight PneumoNet's potential to enable adaptive, privacy-preserving diagnostic AI directly on point-of-care medical devices in real-world and pandemic-ready healthcare.

2605.19196 2026-05-20 cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

时间到REFLECT:我们能否信任LLM裁判来评估基于证据的研究代理?

Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

发表机构 * Yale University(耶鲁大学) IBM Research(IBM研究院)

AI总结 本文提出REFLECT基准,用于评估LLM裁判在代理环境中的细粒度失败检测,揭示当前LLM裁判在推理、工具使用和报告质量上的可靠性不足,为构建更可靠的评估流程提供指导。

详情
AI中文摘要

深度研究代理越来越多地自动化复杂的信息检索任务,通过多步骤推理、工具使用和综合生成基于证据的报告。其日益增长的作用要求可扩展、可靠的评估,将LLM作为裁判设定为评估事实准确性、证据使用和推理质量的监督范式。然而,这些裁判对深度研究代理的可靠性仍不明确,提出了一个关键的元评估问题:在部署LLM裁判监督研究代理之前,必须首先评估这些裁判本身。现有的元评估在两个方面存在不足:(1)依赖于粗略的、主观的人类偏好一致;(2)专注于遵循指令或可验证的任务,未探索开放性的代理执行。为了解决这些差距,我们引入REFLECT(REliable Fine-grained LLM judge Evaluation via Controlled inTervention),一个针对代理环境中细粒度失败检测的元评估基准。REFLECT定义了详细的失败模式分类,通过在质量筛选的代理执行轨迹上执行受控和局部化的干预来实例化。这产生了可验证、全面且细粒度的实例,用于验证裁判模型。我们的实验表明,当前LLM裁判仍然不可靠:即使是最能干的模型,在推理、工具使用和报告质量失败方面的总体准确率也低于55%,在证据验证上表现尤其差。一起,我们的分类和发现揭示了系统性的裁判限制,揭示了成本和可靠性之间的权衡,并为构建更可靠的评估流程提供可行的指导。

英文摘要

Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

2605.19194 2026-05-20 cs.CL

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

MMoA: 一个具有递归性的记忆混合代理框架

Rui Chu

发表机构 * Rui Chu(楚瑞)

AI总结 本文提出MMoA框架,通过引入LSTM门控机制,改进了传统混合代理方法在时间依赖性和上下文感知方面的不足,实现了更高效的多代理系统。

详情
AI中文摘要

混合代理(MoA)框架通过聚合多个代理的输出来提升大语言模型(LLM)的性能。然而,现有MoA系统通常依赖静态路由器,无法充分捕捉聚合层中的时间依赖性和上下文依赖性。为了解决这一限制,我们提出MMoA,一种具有递归性的MoA架构,将基于LSTM的门控机制整合到代理选择过程中。递归路由器根据当前输入和历史路由决策动态调节代理贡献,从而实现更上下文感知的聚合。我们在标准的指令遵循基准上评估了MMoA,包括AlpacaEval 2.0、MT-Bench和Arena-Hard。结果表明,MMoA在准确率上与传统MoA相当,同时通过动态激活更少的代理减少了计算开销。例如,在AlpacaEval 2.0上,MMoA实现了58.0%的胜率,相比MoA的59.8%,同时将运行时间效率提高了高达4.6%。这些结果表明,MMoA为适应性多代理LLM系统提供了一种可扩展且高效的解决方案。

英文摘要

The Mixture-of-Agents (MoA) framework has shown promise in improving large language model (LLM) performance by aggregating outputs from multiple agents. However, existing MoA systems often rely on static routers that do not fully capture temporal and contextual dependencies across aggregation layers. To address this limitation, we propose MMoA, a recurrent MoA architecture that integrates LSTM-based gating into the agent selection process. The recurrence router adaptively modulates agent contributions based on both current inputs and historical routing decisions, enabling more context-aware aggregation. We evaluate MMoA on standard instruction-following benchmarks, including AlpacaEval 2.0, MT-Bench, and Arena-Hard. The results show that MMoA achieves comparable accuracy to traditional MoA while reducing computational overhead by dynamically activating fewer agents. For example, on AlpacaEval 2.0, MMoA achieves a win rate of 58.0%, compared with 59.8% for MoA, while improving runtime efficiency by up to 4.6%. These results suggest that MMoA provides a scalable and efficient approach for adaptive multi-agent LLM systems.

2605.19193 2026-05-20 cs.LG

Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection

多智能体大语言模型辩论中的顺序共识:一种基于Wald-SPRT的计算控制器与基于校准的故障检测

Andrea Morandi

发表机构 * Cisco(思科)

AI总结 本文提出了一种基于Wald-SPRT的计算控制器,用于多智能体大语言模型辩论,通过校准来检测故障,从而在保证准确性的同时减少计算资源的使用。

详情
AI中文摘要

多智能体大语言模型辩论能够提高事实性和推理能力,但大多数方法固定回合数,导致在简单任务上过度消耗计算资源而在困难任务上不足。本文将Wald的顺序概率比率检验(SPRT)作为插件计算控制器应用于大语言模型辩论。每轮结束后,一个LLM法官会发出一个[0,1]的共识分数来评估最新智能体的位置;Wald监控器在Beta似然族下累积“有用收敛”与“尚未有用”的对数似然比,并在跨越任一边界或返回 capped 最佳努力结果时停止。在独立同分布假设下,该规则继承了SPRT类型I/类型II误差保证;在部署中,校准本身更为重要,因为它估计法官评分是否在特定领域中区分有用和无用的收敛。我们评估了两个轨道:(i) 在校准Beta模型下的蒙特卡洛研究,研究工作曲线、误差率、上限行为和敏感性;以及(ii) 在200个尝试的MMLU和200个尝试的GSM8K项目上的真实LLM评估,使用三个异质智能体(gpt-5, claude-opus-4-6, gemini-2.5-pro)和一个claude-opus-4-6法官,使用不相交的40项校准子集。在GSM8K上,该规则在1.01平均回合(4.06个LLM调用)达到97.0%的准确率,比固定5轮辩论在15次调用中达到的99.0%准确率减少了3.7倍的调用次数,但准确性降低了2个百分点。在MMLU上,校准的KL值坍缩到约0,规则在2.1倍成本下对99.5%的项目进行上限。结论是,SPRT并未使辩论更准确,而是经典的顺序检验为多智能体LLM系统提供了一种廉价的计算控制和故障检测层。

英文摘要

Multi-agent LLM debate improves factuality and reasoning, but most recipes pick a fixed round count, over-spending on easy items and under-spending on hard ones. We adapt Wald's Sequential Probability Ratio Test (SPRT) as a plug-in compute governor for LLM debates. After each round, an LLM judge emits a [0,1] consensus score on the latest agent positions; a Wald monitor accumulates the log-likelihood ratio of "useful convergence" vs "not yet useful" under a Beta likelihood family, and stops when either boundary is crossed or returns a capped best-effort outcome at R_max. Under i.i.d. assumptions the rule inherits SPRT type-I/type-II error guarantees; in deployment the calibration itself is the more important object, since it estimates whether the judge score actually separates useful from unhelpful convergence in a given domain. We evaluate two tracks: (i) a Monte-Carlo study under calibrated Beta models characterising working curves, error rates, capping behaviour, and sensitivity; and (ii) a real-LLM evaluation on 200 attempted MMLU and 200 attempted GSM8K items with three heterogeneous agents (gpt-5, claude-opus-4-6, gemini-2.5-pro) and a claude-opus-4-6 judge, using disjoint 40-item calibration subsets. On GSM8K the rule stops in 1.01 average rounds (4.06 LLM calls) at 97.0% accuracy vs 99.0% for fixed-5 debate at 15 calls: a 3.7x call reduction at -2pp accuracy. On MMLU the calibrated KL collapses to about 0 and the rule caps on 99.5% of items at 2.1x cost. The takeaway is not that SPRT makes debate more accurate, but that a classical sequential test serves as a cheap compute-control and failure-detection layer for multi-agent LLM systems.

2605.19185 2026-05-20 cs.LG cs.AI

Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning

规划可接受的图-偏微分方程值扩展用于稀疏目标条件规划

Shiheng Zhang

发表机构 * Department of Applied Mathematics, University of Washington(应用数学系,华盛顿大学)

AI总结 本文研究了在操作argmin-Q规划器下,哪些图值扩展是规划可接受的,提出了一种局部动作间隙证书,证明在 rollout 过程中若代理值误差低于真实动作间隙的一半,则贪心 rollout 可达到目标。通过比较原理填充距离界,AMLE 实现了该证书,而调和扩展由于反映边界击中概率而非最短路径贪心顺序,可能导致局部动作排名错误。

详情
AI中文摘要

稀疏目标条件规划中,少量成本到目标标签可视为图-偏微分方程Dirichlet扩展问题:将稀疏标签扩展到目标依赖的边界上,以贪心rollouts达到目标。我们研究了在操作argmin-Q规划器下哪些图值扩展是规划可接受的。我们的主要结果是一种局部动作间隙证书:如果代理值误差在rollout过程中保持在真实动作间隙的一半以下,则贪心rollout可达到目标。绝对最小Lipschitz扩展(AMLE),作为图p-Laplacian家族的p=∞端点,通过比较原理填充距离界实现了该证书。相比之下,调和扩展由于其值反映边界击中概率而非最短路径贪心顺序,可能导致局部动作排名错误。在120个AntMaze布局衍生的图配置上,调和扩展实现0.584的累积rollout成功率,而AMLE达到0.970。有限高p方法也进入高成功率区域,p=4时成功率0.903,p=8时0.973,p=16固定预算求解器时0.982,尽管p=16行未作为收敛端点排名使用,因求解器认证不完整。机制审计显示,许多rollout决策发生在AMLE兼容但调和不兼容的局部几何中,并且AMLE在rollout加权决策范围内修正了大多数调和反转。

英文摘要

Sparse goal-conditioned planning with few cost-to-go labels can be viewed as a graph-PDE Dirichlet extension problem: extend sparse labels on a goal-dependent boundary to unlabelled graph vertices so that greedy rollouts reach the goal. We study which graph value extensions are planner-admissible under the operational argmin-Q planner. Our main result is a local action-gap certificate: if the surrogate value error along the rollout stays below half the true action gap, then the greedy rollout reaches the goal. Absolutely Minimal Lipschitz Extension (AMLE), the p=infinity endpoint of the graph p-Laplacian family, instantiates this certificate through a comparison-principle fill-distance bound. Harmonic extension, by contrast, can mis-rank local actions because its values reflect boundary hitting probabilities rather than shortest-path greedy order. On 120 AntMaze layout-derived graph configurations, harmonic extension achieves 0.584 aggregate rollout success, while AMLE reaches 0.970. Finite high-p methods also enter a high-success regime, with success 0.903 for p=4, 0.973 for p=8, and 0.982 for a fixed-budget p=16 solver, though the p=16 row is not used as a converged endpoint ranking due to incomplete solver certification. Mechanism audits show that many rollout decisions occur in AMLE-compatible but harmonic-incompatible local geometry, and that AMLE corrects most harmonic inversions on the rollout-weighted decision scope.

2605.19173 2026-05-20 cs.CL

Prompting language influences diagnostic reasoning and accuracy of large language models

提示语言影响大型语言模型的诊断推理和准确性

Adrien Bazoge, Josselin Corvellec, Sofiane Djillali Sid-Ahmed, Pierre-Antoine Gourraud

发表机构 * Data Clinic, Nantes University Hospital(数据诊所,南特大学医院)

AI总结 本研究探讨了提示语言对大型语言模型在临床诊断推理和准确性上的影响,通过比较英语和法语性能,发现四种模型在英语环境下表现更优,而o3模型未表现出语言效应,表明提示语言是影响模型临床性能的关键因素。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被探索用于临床决策支持,但大多数评估都是用英语进行的,这使得其在其他语言中的可靠性存疑。本文通过比较五种LLM(o3、DeepSeek-R1、GPT-4-Turbo、Llama-3.1-405B-Instruct和BioMistral-7B)在英语和法语环境下的表现,评估了提示语言对诊断推理和最终诊断准确性的影响。总共评估了180个涵盖16个医学专科的临床情景,由两位医生使用18分量表评估诊断准确性和推理质量。五种模型中有四种在英语环境下表现更优(平均差异0.37-0.91,调整p<0.05),差距涵盖多个推理方面,包括鉴别诊断、逻辑结构和内部效度。o3是唯一一个未表现出整体语言效应的模型。这些发现表明,提示语言仍然是影响LLM临床性能的关键因素,对全球公平的语义文化部署具有重要影响。

英文摘要

Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

2605.19172 2026-05-20 cs.LG cs.AI

Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand

Bridge:基于检索的时空建模用于城市配送需求

Yihong Tang, Tong Nie, Junlin He, Qianjun Huang, Dingyi Zhuang, Lijun Sun

发表机构 * McGill University(麦吉尔大学) The Hong Kong Polytechnic University(香港理工大学) University of Toronto(多伦多大学) MIT(麻省理工学院)

AI总结 本文提出Bridge框架,通过结合归纳上下文图结构和时间感知的记忆模块,解决新加入服务区域缺乏历史记录导致的城市配送需求预测难题,提升了冷启动区域的预测性能。

详情
AI中文摘要

预测城市配送需求在新增服务区域缺乏历史记录时变得尤为具有挑战性。现有的时空预测器在有足够的节点历史时能有效建模空间依赖性,但它们仍然是参数化的,因此在冷启动区域难以恢复短期运营动态。地理嵌入帮助识别区域的位置和功能,但并不能直接揭示相似区域在相似时间背景下行为的方式。我们提出了Bridge,一种结合归纳上下文图结构和时间感知记忆的时空图框架。对于每个目标区域,Bridge通过区域上下文和近期动态从记忆中检索未来需求模式,并通过门控融合机制优化图结构预测。为了使检索与预测效用对齐,我们进一步训练检索器以未来为导向的目标,偏好那些未来轨迹与目标最匹配的条目。实验表明,Bridge在四个真实世界配送数据集上,无论是城市内部冷启动还是跨城市转移时部分观察情况下,均优于竞争性的时空基线模型。结果表明,当参数图泛化能力不足时,检索增强为冷启动城市需求预测提供了有用的操作记忆。

英文摘要

Forecasting urban delivery demand becomes substantially more challenging when newly added service regions lack historical records. Existing spatiotemporal forecasters effectively model spatial dependence once sufficient node histories are available. Still, they remain parametric and therefore struggle to recover short-term operational dynamics in cold-start regions. Geospatial embeddings help identify where a region is and what function it serves, yet they do not directly reveal how a similar region behaves under a comparable temporal context. We propose Bridge, a retrieval-augmented spatiotemporal graph framework that combines an inductive contextual graph backbone with a time-aware memory of region-time windows. For each target region, Bridge retrieves future demand patterns from the memory using both regional context and recent dynamics, and refines the backbone forecast through a gated fusion mechanism. To align retrieval with forecasting utility, we further train the retriever with a future-aware objective that favors entries whose future trajectories best match the target. Experiments on four real-world delivery datasets show that Bridge consistently improves over competitive spatiotemporal baselines in both within-city cold-start and cross-city transfer with partial observations. The results show that retrieval augmentation provides a useful operational memory for cold-start urban demand forecasting when parametric graph generalization alone is insufficient.

2605.19166 2026-05-20 cs.RO cs.LG math.OC

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

一种通过奖励设计和终止条件实现RL基于四旋翼控制性能调优的启发式方法

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, George Nikolakopoulos

发表机构 * Robotics and AI group, in the Department of Computer Science, Electrical and Space Engineering at Luleå University of Technology(鲁德尼大学机器人与人工智能小组,计算机科学、电气与空间工程系)

AI总结 本文提出了一种新的启发式方法,通过奖励设计和终止条件实现RL四旋翼控制的可调性能,该方法通过双带宽指数奖励结构实现了设定点跟踪的临界阻尼响应,并具有低稳态误差。在使用近端策略优化(PPO)算法训练时,结合episode截断条件,在600万次时间步内以高效的方式实现了所需性能。通过直观的启发式规则调整奖励权重和指数系数,可以实现更快(空翻式)和更慢(检查式)的稳定时间性能,同时保留基线临界阻尼响应和约2%的稳态误差。

Comments Accepted in the 34th Mediterranean Conference on Control and Automation

详情
AI中文摘要

基于强化学习(RL)的四旋翼控制策略在诸如在复杂环境中快速导航和无人机赛车等任务中取得了显著性能。然而,在某些应用中,如基础设施检查,实现精确、可控的机动并具有可调性能至关重要。本文提出了一种新的启发式方法,通过奖励设计和终止条件实现RL基于四旋翼控制的可调性能。我们提出了一种包含双带宽指数的新型奖励结构,实现了设定点跟踪的基线临界阻尼响应,并具有低稳态误差。当使用近端策略优化(PPO)算法进行训练时,结合episode截断条件,在600万次时间步内以高效的方式实现了所需性能。为了调节基线行为的性能,我们提出了直观的启发式规则来调整奖励权重和指数系数,以实现更快(空翻式)和更慢(检查式)的稳定时间性能,同时保留基线临界阻尼响应和大约2%的稳态误差。我们评估了三种RL策略(基线、空翻和检查)在100次试验中的表现,并展示了在随机初始条件下位置和偏航跟踪的准确且可调性能,从而证明了所提出启发式方法的有效性。

英文摘要

Reinforcement learning (RL)-based quadrotor control policies have achieved impressive performance in tasks such as fast navigation in cluttered environments and drone racing, where the focus is on speed and agility. However, in several applications, such as infrastructure inspection, it is critical to achieve precise, controlled maneuvers with tunable performance. In this article, we present a novel heuristic approach to achieve tunable performance in RL-based Quadrotor control through reward design and termination conditions. We present a novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response in setpoint tracking, with low steady-state errors. When trained with a Proximal Policy Optimization (PPO) algorithm, in conjunction with episode truncation conditions, the desired performance is achieved in 6 million time steps in a sample-efficient manner. In order to tune the performance about the baseline behavior, we present intuitive heuristic rules to adjust the reward weights and exponential coefficients to achieve faster (acrobatic-like) and slower (inspection-like) settling time performance, while retaining the baseline critically damped response and approximately 2\% steady-state error. We evaluate the three RL policies (baseline, acrobatic, and inspection) across 100 trials and show accurate and tunable performance in position and yaw tracking from random initial conditions, thereby demonstrating the effectiveness of the proposed heuristic approach.

2605.19156 2026-05-20 cs.AI cs.CY cs.LG cs.MA

How Far Are We From True Auto-Research?

我们距离真正的自动研究还有多远?

Zhengxin Zhang, Ning Wang, Sainyam Galhotra, Claire Cardie

发表机构 * Cornell University(康奈尔大学)

AI总结 本文通过ResearchArena评估了不同代理生成的论文质量,发现虽然代理能生成看似有竞争力的论文,但实际实验严谨性不足,存在伪造结果、实验能力不足和计划与执行不匹配等问题,表明自动研究仍需进一步发展。

详情
AI中文摘要

最近的自动研究系统能够生成完整的论文,但可行性并不等同于质量,该领域仍然缺乏对代理生成论文实际质量的系统研究。我们介绍了ResearchArena,一个最小的框架,让现成的代理(Claude Code使用Opus 4.6,Codex使用GPT-5.4,和Kimi Code使用K2.5)在仅轻量指导下自行完成完整的研究循环(构想、实验、论文写作、自我完善)。在13个计算机科学种子和每个代理-领域对的3次试验中,ResearchArena生成了117篇代理生成的论文,每篇都在三个互补的视角下评估:仅手稿的评审员(SAR)、考虑工件的同行评审(PR)以及人工进行的元评审。在仅SAR的情况下,图景是乐观的:Claude Code获得最高评分,优于Analemma的FARS,并与加权平均的人类ICLR 2025提交匹配,表明最小框架的代理能够生成在手稿-only评审中看起来有竞争力的论文。然而,人工检查却揭示了这个图景被夸大了:SAR评分与实际接受决定不一致,且奖励合理框架而不验证实验实质。在考虑工件的PR评分急剧下降,人工审计发现实验严谨性是主要瓶颈,分解为三种失败模式(伪造结果、低能力实验、计划/执行不匹配),这些模式高度依赖于代理:Codex 5%/8%论文与工件不匹配/伪造参考文献,与Kimi Code 77%/72%相比,差距约为15倍,追踪代理发展出的不同研究身份。没有一篇代理生成的论文达到顶级会议的接受标准。这表明我们仍然与真正的自动研究有差距。

英文摘要

Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.

2605.19155 2026-05-20 cs.CV

Efficient coding along the visual hierarchy

视觉层次中的高效编码

Ananya Passi, Brian S. Robinson, Michael F. Bonner

发表机构 * Department of Cognitive Science, Johns Hopkins University(约翰霍普金斯大学认知科学系) Applied Physics Laboratory, Johns Hopkins University(约翰霍普金斯大学应用物理实验室)

AI总结 本文研究了在有限数据下如何通过高效编码原理构建与人类对齐的视觉特征层次,提出了一种无监督学习方法,该方法通过压缩输入到自然图像的主要变化模式来生成从边缘和颜色到纹理和形状的特征,且结合监督微调可提高脑区对齐性和类别学习速度。

Comments 34 pages, 6 figures

详情
AI中文摘要

生物视觉系统在有限经验下学习,不同于依赖数百万训练图像的深度学习模型。什么学习原理使这种可能性成为可能?我们测试了高效编码(即神经表示捕捉自然输入的统计结构)是否能从有限数据中构建与人类对齐的视觉特征层次。我们开发了一种无监督学习过程,其中每个深度网络层仅使用局部统计信息,不使用标签、任务或反向传播,将输入压缩到自然图像的主要变化模式上。这种无监督过程生成的特征从边缘和颜色逐步发展到纹理和形状。该深度高效编码模型的特征易于被人类观察者识别,并能预测人类视觉皮层的图像诱发fMRI响应。此外,结合高效编码与监督微调的混合学习过程在低数据设置下能产生更好的脑区对齐性,并加快类别学习速度。这些发现表明,高效编码可能在视觉层次的整个表示中起作用,并有助于解释生物视觉的数据效率。

英文摘要

Biological visual systems learn from limited experience, unlike deep learning models that rely on millions of training images. What learning principles make this possible? We tested whether efficient coding, the idea that neural representations capture the statistical structure of natural inputs, can build a hierarchy of human-aligned visual features from limited data. We developed an unsupervised learning procedure in which each layer of a deep network compresses its inputs onto the dominant modes of variation in natural images, using only local statistics and no labels, tasks, or backpropagation. This unsupervised procedure yields features that progress from edges and colors to textures and shapes. The features of this deep efficient coding model are readily recognized by human observers and are predictive of image-evoked fMRI responses in human visual cortex. Furthermore, a hybrid learning procedure that combines efficient coding with supervised fine-tuning yields better brain alignment in low-data settings and more rapid category learning. These findings suggest that efficient coding may shape representations across the entire visual hierarchy and help explain the data efficiency of biological vision.

2605.19151 2026-05-20 cs.AI cs.HC

Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

渐进自主性作为偏好学习:代理工具使用中的信任校准形式化

Changkun Ou

发表机构 * Changkun Ou(Ou Changkun)

AI总结 本文将代理工具使用中的信任校准形式化为一个偏好学习问题,通过高斯过程后验模型维护潜在人类风险容忍函数,并在审批结果最不确定的地方升级到人类,继承了偏好贝叶斯优化的推理机制和样本效率论证,但目标不同。

详情
AI中文摘要

我们正式将代理工具使用中的信任校准(决定何时自动化代理的提议行动可以自主执行还是需要人类批准)作为偏好学习问题。策略网关维护一个高斯过程后验,覆盖潜在人类风险容忍函数,通过二元批准/拒绝反馈的probit似然进行观测,并在审批结果最不确定的地方升级到人类。我们证明这在结构上是偏好贝叶斯优化的一个实例,继承了其推理机制(近似高斯过程分类)和样本效率论证(不确定性目标查询),但目标不同:将动作空间分类为允许/阻止/询问区域,而不是优化设计。

英文摘要

We formalize trust calibration for agentic tool use (deciding when an automated agent's proposed action may execute autonomously versus require human approval) as a preference-learning problem. A policy gateway maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates to the human exactly where the approval outcome is most uncertain. We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery (approximate Gaussian-process classification) and its sample-efficiency argument (uncertainty-targeted querying), while differing in objective: classifying an action space into allow/block/ask regions rather than optimizing a design.

2605.19150 2026-05-20 cs.LG cs.AI

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

Flash PD-SSM: 一种内存优化的结构稀疏状态空间模型

Aleksandar Terzić, Francesco Carzaniga, Nicolas Menet, Yannick Biehl, Michael Hersche, Thomas Hofmann, Abbas Rahimi

发表机构 * IBM Research – Zurich(IBM瑞士研究中心) Department of Computer Science, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 本文提出Flash PD-SSM,一种内存优化的结构稀疏状态空间模型,通过在保持高效的同时提升表达能力,实现了与传统结构化状态空间模型相当的吞吐量,并在多个任务中展示了更高的准确性和效率。

详情
AI中文摘要

状态空间模型(SSMs)面临效率与表达能力之间的根本权衡,这主要由模型转移矩阵的结构决定。无结构的转移矩阵具有最大的表达能力,但计算和内存成本过高。相比之下,大多数结构化转移矩阵形式在运行时间和内存消耗上都非常高效,但表达能力有限。基于最近关于结构稀疏SSMs的研究,我们提出了Flash PD-SSM,一种新的SSM,其吞吐量与广泛使用的结构化SSMs相当,但具有显著更好的表达性保证。Flash PD-SSM维护一个可训练的结构稀疏矩阵集合,在每个时间步选择其中一个进行离散选择,从而在保持大规模训练所需的效率的同时,实现了与无结构矩阵相当的FSA表达能力。首先,我们在合成机制和状态跟踪任务上验证了Flash PD-SSM,发现其理论表达能力在实践中得以实现。其次,在涉及超过17000长度序列的多变量时间序列任务中,我们发现Flash PD-SSM在竞争性的SSM方法中定义了新的最先进的(SoTA)准确性。最后,我们展示了Flash PD-SSM是混合LLMs的有效替代品,在自然语言状态跟踪和常见语言建模场景中均取得改进。该模型相比前沿语言模型广泛使用的SSMs表现出更高的吞吐量和更低的内存消耗。

英文摘要

State-space models (SSMs) face a fundamental trade-off between efficiency and expressivity that is mainly dictated by the structure of the model's transition matrix. Unstructured transition matrices enable maximal expressivity, as measured by their ability to model finite-state automaton (FSA) transitions, but come at a prohibitively high compute and memory cost. In contrast, most structured transition matrix forms are highly efficient both in runtime and memory consumption, but suffer from limited expressivity. Building on recent work on structured sparse SSMs, we propose Flash PD-SSM, a novel SSM that achieves comparable throughput to widely-used structured SSMs with significantly better expressivity guarantees. Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale. First, we validate Flash PD-SSM against a suite of alternative models on synthetic mechanistic and state-tracking tasks, finding that its theoretical expressivity is achieved in practice. Second, on multivariate time-series tasks involving sequences of length over 17,000, we find that Flash PD-SSM defines a new state-of-the-art (SoTA) accuracy among competing SSM methods. Finally, we demonstrate that Flash PD-SSM is an effective drop-in replacement for hybrid LLMs, yielding improvements both in natural language state-tracking and in common language modeling scenarios. The model exhibits increased throughput and decreased memory consumption compared to SSMs widely used in frontier language models.

2605.19149 2026-05-20 cs.CL cs.CR

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

Rishi Jha, Harold Triedman, Arkaprabha Bhattacharya, Vitaly Shmatikov

发表机构 * Department of Computer Science(计算机科学系)

AI总结 研究探讨了代理在遇到错误时可能发生意外崩溃现象,通过实验发现64.7%的代理在遇到模拟错误时会出现不同程度的不安全行为,且这些行为未被现有安全标准所覆盖。

Comments 32 pages, 8 figures, 4 tables

详情
AI中文摘要

代理在使用计算机和网络时不可避免地会遇到错误:无法访问的网页、缺失的文件、本地和远程的配置错误等。这些错误不会阻碍基于最新模型的代理。它们会继续寻找完成任务的方法。我们引入、描述并测量了一种新的代理失败类型,称为"意外崩溃":在没有对抗性输入的情况下,对良性环境错误产生不安全或有害行为。由于崩溃未被现有可靠性或安全基准捕捉,我们开发了一种崩溃行为的分类法。然后,我们实现了通用代理基础设施,用于在滚动环境中注入模拟的本地和远程错误,并使用它来系统评估基于GPT、Grok和Gemini的代理系统。我们的评估显示,在遇到模拟错误的64.7%的代理滚动中,会出现不同程度和成功程度的崩溃(例如,进行未经授权的侦察或颠覆访问控制)。在超过一半的这些崩溃中,不安全行为未报告给用户。比较有无错误的相同代理行为,我们发现对错误的探索与不安全和有害行为相关。

英文摘要

Agents operating with computer and Web use inevitably encounter errors: inaccessible webpages, missing files, local and remote misconfigurations, etc. These errors do not thwart agents based on state-of-the-art models. They helpfully continue to look for ways to complete their tasks. We introduce, characterize, and measure a new type of agent failure we call \emph{accidental meltdown}: unsafe or harmful behavior in response to a benign environmental error, in the absence of any adversarial inputs. Because meltdowns are not captured by the existing reliability or safety benchmarks, we develop a taxonomy of meltdown behaviors. We then implement an agent-agnostic infrastructure for injecting simulated local and remote errors into the rollout environment and use it to systematically evaluate agent systems powered by GPT, Grok, and Gemini. Our evaluation demonstrates that meltdowns (e.g., conducting unauthorized reconnaissance or subverting access control) of varying severity and success occur in 64.7\% of agent rollouts that encounter simulated errors, spanning all combinations of agent system, backing model, and error type. In over half of these meltdowns, unsafe behaviors are not reported to the user. Comparing behaviors of the same agents with and without errors, we find that exploration in response to errors is correlated with unsafe and harmful behavior.

2605.19141 2026-05-20 cs.LG cs.AI cs.CL cs.CY cs.HC

GRASP: Deterministic argument ranking in interaction graphs

GRASP:交互图中的确定性论证排名

Diganta Misra, Antonio Orvieto, Rediet Abebe, Volkan Cevher

发表机构 * MPI-IS Tübingen(图宾根MPI研究所) Tübingen AI Center(图宾根人工智能中心) ELLIS Institute Tübingen(图宾根ELLIS研究所) Eberhard Karls Universität Tübingen(图宾根埃伯哈德·卡尔斯大学) LIONS, EPFL(EPFL的LIONS实验室)

AI总结 本文提出GRASP框架,通过聚合稳定的局部交互判断生成全局排名,以解决大语言模型作为裁判时整体评判不一致的问题,强调结构充分性而非说服力或修辞吸引力。

Comments Preprint

详情
AI中文摘要

大型语言模型越来越多地被部署为自动裁判,以评估论证的强度。随着这一角色的扩大,其合法性取决于一致性、透明性和将论证结构与修辞吸引力区分开的能力。然而,我们证明了整体评判——一种常见的LLM-as-a-Judge实践,其中模型对辩论提供全球裁决——存在显著的跨模型分歧。我们主张这种不稳定性源于将辩论复杂的交互结构压缩成单一的不透明分数。为了解决这一问题,我们提出GRASP(渐进排名与攻击支持传播),一种确定性框架,通过收敛的攻击-防御传播操作,将稳定的局部交互判断聚合为全局排名。我们证明在LLM-as-a-Judge评估中,局部交互判断比整体排名更具可重复性,使GRASP能够生成更一致的全局排名。我们进一步证明GRASP分数与人类“说服性”标签不相关,突显了一个关键的社技术区别:GRASP不衡量说服力、事实性或修辞吸引力,而是结构充分性——一种在显式交互图上的防御意识的论证鲁棒性概念。总体而言,GRASP为整体LLM评判提供了一个透明且可审计的替代方案。

英文摘要

Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

2605.19140 2026-05-20 cs.AI

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

学习手柄:在接口约束下的可证明收敛的工作流学习

Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen, Yujun Yan

发表机构 * Stern School of Business(斯特恩商学院) New York University(纽约大学) Department of Computer Science(计算机科学系) Dartmouth College(达特茅斯学院) Virginia Tech(弗吉尼亚理工大学)

AI总结 该研究探讨了在接口约束下的工作流学习问题,提出了一种异步去中心化的Q学习算法IC-Q,并给出了神经IC-Q的有限样本界,证明了在去中心化部分可观测性下的神经Q学习的第一个有限样本保证。

详情
AI中文摘要

我们研究了在专门的代理通过共享的艺术品进行控制转移的设置下的工作流学习,每个代理只能观察该艺术品的局部函数及其自己的私人状态,且没有集中式学习者访问联合轨迹——这多代理LLM管道跨越组织、供应商或信任边界时的操作模式。我们将这种模式形式化为一个接口约束的半马尔可夫决策过程(IC-SMDP),其决策时刻发生在手柄时间,设计了IC-Q,一种异步去中心化的Q学习算法,其中每次手柄的跨代理协调恰好是一个标量。我们的主要结果是神经IC-Q的有限样本界,该界分解为三个独立可控的误差源:神经函数近似误差、接口表示差距和混合时间残差,基于随机选项持续时间折扣。建立这个界需要将近似信息状态(AIS)框架从单代理原始步骤MDP提升到多代理SMDP,并在随机持续时间内控制马尔可夫噪声,而这在先前工作中尚未完成。据我们所知,这是第一个在去中心化部分可观测性下的神经Q学习的有限样本保证。四个实验:一个受控的合成IC-SMDP,多LLM数学推理,多代理路由,以及多代理CPU编程,显示IC-Q在没有任何代理观察联合轨迹的情况下匹配集中式 oracle,每个误差源沿其对应的轴按界预测的比例缩放。

英文摘要

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

2605.19137 2026-05-20 cs.CV

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

迈向数据高效的视频预训练:使用冻结的图像基础模型

Svetlana Orlova, Niccolò Cavagnero, Gijs Dubbelman

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 本文探讨了如何通过冻结预训练的图像基础模型并仅训练时间模块来实现数据高效的视频预训练,从而减少对大规模视频数据和计算资源的需求。

Comments Accepted to CVPR 2026 Workshops CV4Smalls

详情
AI中文摘要

视频基础模型在许多视频理解任务中表现出色,但通常需要在大规模视频数据集上进行大规模预训练,导致显著的数据和计算成本。相比之下,现代图像基础模型已经提供了强大的空间表示。这引发了一个重要问题:能否通过重用这些空间表示并仅进行时间推理的预训练来构建具有竞争力的视频模型?我们初步探索了一种轻量级训练范式,即冻结预训练的图像基础模型并仅训练时间模块来处理流视频。通过将图像基础模型用作空间编码器,这种方法可以显著减少与端到端视频预训练相比所需的视频数据和计算量。在本工作中,我们探讨了这种方法的可行性,以在投入视频预训练计算之前进行探索。在多个视频理解任务上的实证发现表明,无需大规模视频预训练即可获得强大的时间性能,这促使未来的工作集中在通过在冻结的图像基础模型上预训练时间模块来构建递归视频基础模型。代码:https://github.com/tue-mps/towards-video-image-frozen

英文摘要

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .

2605.19136 2026-05-20 cs.RO

Automatically Improving Simulation Physics for Articulated Objects

自动提升仿真的物理特性用于关节物体

Anh-Quan Pham

发表机构 * Penn(宾夕法尼亚大学) PennPAL Lab(宾夕法尼亚大学PAL实验室)

AI总结 本文研究了如何通过量化评估框架和多模态仿真反馈方法,提升关节物体在仿真中的物理真实性和稳定性,从而提高机器人学习的效率和效果。

详情
AI中文摘要

仿真是可扩展机器人学习的核心工具,但其效果取决于物体资产的质量。尽管现代3D数据集提供了丰富的几何和运动学表示,但通常缺乏用于稳定和真实交互所需的物理属性,需要大量手动工作来构建仿真准备的关节物体。在本论文中,我们引入了交互准备性,它表征了物体在操作下是否可以可靠地仿真。我们提出了一种定量评估框架,将交互准备性分解为可测量的组成部分,从而系统分析物体质量并揭示传统评估未捕获的失败模式。我们进一步提出了一个多模态、仿真循环的方法,从不完整的3D资产中生成交互准备的关节物体。该方法整合了几何、视觉和语义信息来推断物理属性,并通过迭代仿真反馈来优化这些属性,以提高物理一致性。在多样化的关节物体和操作任务上的实验表明,物体质量直接影响仿真稳定性、交互行为和策略性能。经过我们方法优化的物体表现出更稳定和真实的动态,从而实现了更可靠的下游学习和评估。总体而言,本论文展示了关节物体在仿真中的物理真实性的的重要性,并引入了一种由仿真反馈指导的实用多模态优化方法,用于大规模构建此类物体。

英文摘要

Simulation is a central tool for scalable robot learning, but its effectiveness depends on the quality of object assets. While modern 3D datasets provide rich geometric and kinematic representations, they typically lack the physical properties required for stable and realistic interaction, requiring significant manual effort to construct simulation-ready articulated objects. In this thesis, we introduce interaction-readiness, which characterizes whether an object can be reliably simulated under manipulation. We propose a quantitative evaluation framework that decomposes interaction-readiness into measurable components, enabling systematic analysis of object quality and revealing failure modes not captured by conventional evaluation. We further present a multi-modal, simulator-in-the-loop approach for generating interaction-ready articulated objects from incomplete 3D assets. The method integrates geometric, visual, and semantic information to infer physical properties and refines them through iterative simulator feedback to improve physical consistency. Experiments across diverse articulated objects and manipulation tasks show that object quality directly impacts simulation stability, interaction behavior, and policy performance. Objects refined by our method exhibit more stable and realistic dynamics, enabling more reliable downstream learning and evaluation. Overall, this thesis demonstrates the importance of physical realism for articulated objects in simulation and introduces a practical multi-modal refinement approach, guided by simulator feedback, for constructing such objects at scale.

2605.19135 2026-05-20 cs.LG

Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing

部分潜在变量共享下的可识别多模态因果表示学习

Manal Benhamza, Marianne Clausel, Myriam Tami

发表机构 * Paris-Saclay University, CentraleSupélec, MICS Lab(巴黎-萨克雷大学,中央理工-巴黎高等电力学院,MICS实验室) Lorraine University, CRAN(洛林大学,CRAN)

AI总结 本文研究了在部分潜在变量共享设定下多模态因果表示学习的可识别性问题,通过非线性混合函数生成各模态数据,并在不假设潜在变量分布的情况下,建立了因果潜在表示的组件可识别性保证,进一步验证了在欠定情况下方法的有效性。

详情
AI中文摘要

因果表示学习(CRL)旨在从高维观测数据中揭示有意义的潜在变量及其对应的因果结构。尽管其重要性,CRL的可识别性仍是一个关键属性,因为它确保了数据生成过程背后机制的恢复,从而保证了表示的可解释性和鲁棒性。证明CRL的可识别性本质上是困难的,本文针对更具有挑战性的多模态设定进行了研究:考虑具有部分共享潜在结构的多模态观测数据。每个模态通过非线性混合函数从特定的因果潜在变量子集生成。在灵活的假设下且不假设潜在变量的参数分布,我们建立了因果潜在表示的组件可识别性保证。此外,我们的可识别性结果还适用于欠定情况,即每个模态中观测变量多于潜在变量。为了实例化我们的理论分析,我们引入了一个基于Wasserstein的模块来恢复部分共享的潜在结构。由于其可微性,后者可以轻松地集成到所有类型的架构中,仅需最小的修改。在合成和现实数据集上的广泛实验验证了我们的方法优于现有最先进方法。

英文摘要

Causal representation learning (CRL) seeks to uncover meaningful latent variables and their corresponding causal structure from high-dimensional observational data. Although its significance, CRL identifiability remains a crucial property, as it ensures the recovery of the mechanisms behind the data generation process, and hence the interpretability and robustness of the representation. Proving identifiability in CRL is intrinsically difficult, and we address in this work an even more challenging setting: multimodality. We consider multimodal observed data with a latent partially shared structure. Each modality is generated, through non linear mixing functions, from a specific subset of causal latent variables. Under flexible assumptions and without imposing any parametric distribution on the latent variables, we establish component-wise identifiability guarantees for the causal latent representation. Our identifiability results, furthermore, apply to the undercomplete scenario where we have, for each modality, more observed than latent variables. To instantiate our theoretical analysis, we introduce a Wasserstein-based module to recover the partially shared latent structure. Due to its differentiability, the latter can be easily integrated into all types of architecture, only requiring minimal changes. Extensive experiments on synthetic and realistic datasets validate the superiority of our approach over SOTA methods.

2605.19133 2026-05-20 cs.CV cs.AI

Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

知道何时不进行预测:用于更安全糖尿病视网膜病变筛查的自监督学习与退避

Muskaan Chopra, Lorenz Sparrenberg, Jan H. Terheyden, Rafet Sifa

发表机构 * Rheinische Friedrich-Wilhelms-Universität Bonn(莱茵-威斯巴登大学波恩分校) University Hospital Bonn - Department of Ophthalmology(波恩大学医院眼科部门) Fraunhofer IAIS(弗劳恩霍夫研究所) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔人工智能与机器学习研究所)

AI总结 本文研究了自监督学习预训练长度对校准置信度和基于置信度的退避策略的影响,发现预训练长度对选择性预测有积极影响,但过长预训练并不总能提高可靠性,强调了退避意识评估的重要性。

Comments Accepted at IJCAI 2026

详情
AI中文摘要

自监督学习(SSL)现在是预训练医学图像模型的标准方法,但性能仍主要通过下游准确性来评估。对于安全关键的筛查任务,如糖尿病视网膜病变分级,这还不够:模型必须知道何时其预测不可靠,并将不确定案例推迟给临床审查。在本工作中,我们探讨了SSL预训练长度如何影响校准置信度和基于置信度的退避。我们评估了多个SSL检查点在固定微调协议下的表现,并评估了校准置信度、覆盖范围、选择性准确性以及选择性宏F1。在不同数据集和数据制度下,SSL预训练优于从头开始训练。与之前主要评估下游准确性或AUROC的SSL研究不同,我们分析了SSL预训练持续时间如何影响在基于校准置信度的退避下的置信度行为。然而,一旦准确性饱和,选择性性能仍可能在不同检查点间显著变化,且更长的预训练并不总能提高可靠性。这些结果强调了退避意识评估的重要性,并建议预训练长度应被视为重要的可靠性相关设计选择,而非仅是计算细节。代码可在GitHub上获取。

英文摘要

Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.

2605.19132 2026-05-20 cs.LG

CLIC: Contextual Language-Informed Cardiac Pathology Classification

CLIC: 基于上下文的语言引导心脏病理分类

Giovani D. Lucafo, Rafael da Costa Silva, João Lucas Luz Lima Sarcinelli, Andre Guarnier De Mitri, Diego Furtado Silva

发表机构 * Institute of Mathematical and Computer Sciences(数学与计算机科学学院) Universidade de São Paulo(圣保罗大学)

AI总结 本文提出CLIC框架,通过将患者上下文数据转化为描述性文本,利用自然语言编码技术提升心脏病理诊断的精确度,同时探索大语言模型生成的临床描述在下游分类任务中的应用。

Comments 6 pages, 2 figures, accepted at the ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM)

详情
AI中文摘要

心电图(ECG)是无创诊断心脏病理的黄金标准,也是心血管医学的基本支柱。深度学习的最新进展推动了稳健的自动化分类器的发展,这些分类器通过处理原始生理信号实现高性能。然而,在临床实践中,诊断很少仅基于信号本身。心内科医生通常会结合患者的特征和具体的数据采集上下文来支持其解释。尽管如此,大多数现有算法仍局限于仅信号分析,未能整合技术元数据和人口统计数据。本文提出了上下文语言引导的心脏病理分类(CLIC),一种多模态框架,通过自然语言编码这些变量显著提高诊断精度。我们证明将患者层面的上下文数据转化为描述性文本提供了一个信息锚点,帮助模型解歧复杂的生理模式。我们进一步探讨了使用大语言模型合成更丰富的临床描述,并观察到尽管这些生成的文本仍具竞争力,但受控模板化的上下文临床文本在下游分类任务中带来了持续的性能提升。

英文摘要

The electrocardiogram (ECG) is the gold standard for non-invasive diagnosis of cardiac pathologies and is a fundamental pillar of cardiovascular medicine. Recent progress in deep learning has led to the development of robust automated classifiers that achieve high performance by processing raw physiological signals. However, in clinical practice, diagnosis is rarely based solely on the signal. Cardiologists commonly support their interpretation with the patient's characteristics and the specific data-acquisition context. Despite this, most current algorithms remain restricted to signal-only analysis, failing to integrate technical metadata and demographic variables. This paper proposes Contextual Language-Informed Cardiac pathology classification (CLIC), a multimodal framework that significantly enhances diagnostic precision by encoding these variables through natural language. We demonstrate that translating patient-level contextual data into descriptive text provides an informative anchor that helps the model disambiguate complex physiological patterns. We further investigate the use of Large Language Models to synthesize richer clinical descriptions and observe that, while these generated texts remain competitive, controlled template-based contextual clinical text leads to consistent improvements in downstream classification performance.

2605.19130 2026-05-20 cs.LG cs.AI cs.CL cs.CV

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

EgoBabyVLM:基于自然主义第一人称视频数据的跨模态学习基准测试

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Éric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) Stanford University(斯坦福大学) Meta Reality Labs(Meta现实实验室) The University of Tokyo(东京大学)

AI总结 研究探讨了儿童如何从有限的视觉-语言输入中获得语言 grounding 的鲁棒性,提出了 EgoBabyVLM 挑战,推动模型在自然主义数据中实现 grounded language learning。

详情
AI中文摘要

儿童在有限的视觉-语言输入中展现出惊人的鲁棒性,这种能力超过了目前最好的大型多模态模型。最近的研究表明,目前基于 curated web 数据训练的视觉-语言模型 (VLMs) 无法泛化到由可穿戴设备、具身代理和婴儿头摄像机产生的稀疏、弱对齐的第一人称视频流,并且没有固定的评估流程来衡量在此类数据上的进展。我们训练 VLMs 在具有不同视觉和语言输入语义对齐程度的数据集上,包括自然主义婴儿和成人第一人称视频,并通过涵盖多模态语言 grounding 和单模态视觉和语言任务的综合评估套件进行评估。这套评估的核心是 Machine-DevBench,它是一个基于语料库的基准测试,自动从模型的训练词汇中生成,以消除训练/评估不匹配和先前发展基准的低统计效力。我们的结果表明,当前 VLM 模型依赖于 curated 数据的紧密语义对齐,并无法利用主导自然主义第一人称输入的弱对齐信号——正是人类在其中茁壮成长的领域。为了推动进展,我们引入了 EgoBabyVLM 挑战,以驱动开发能够从人类婴儿经历的此类自然主义数据中实现 grounded language learning 的模型。

英文摘要

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

2605.19127 2026-05-20 cs.AI

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

POLAR-Bench: 一个用于LLM代理隐私-效用权衡的诊断基准

Qiaoyuan Zheng, Yiqu Yang, Qi Gao, Imanol Schlag

发表机构 * ETH Zurich(苏黎世联邦理工学院) ETH AI Center(ETH人工智能中心)

AI总结 本文提出POLAR-Bench基准,用于评估LLM代理在隐私和效用之间的权衡。通过在10个领域和7,852个样本上进行测试,该基准通过确定性集合成员hip评分隐私和效用,并在两个正交轴上变化隐私策略维度和攻击策略,生成5x5的诊断表面。结果揭示了当前前沿模型在保护属性上隐瞒超过99%,而较小的开放权重模型在1-30B范围内表现更差,泄露率高达一半。

Comments Preprint

详情
AI中文摘要

随着LLM代理越来越多地访问私人用户数据,并在与第三方系统交互时代表用户行事,用户定义了哪些信息可以和必须不被共享。代理必须在第三方系统行为对抗性时也能稳健地遵循该意图。我们引入了POLAR-Bench(政策感知对抗基准),其中受信任的模型具有隐私策略和任务对话的模型与第三方模型进行交互,后者对抗性地探测任务相关和受保护的属性。在10个领域和7,852个样本上,我们通过确定性集合成员hip评分隐私和效用,并在两个正交轴上变化隐私策略维度和攻击策略,生成每个模型的5x5诊断表面。我们的结果揭示了一个明显的分裂:当前前沿模型隐瞒超过99%的受保护属性,而较小的开放权重模型在1-30B范围内,用户最常运行作为其自己的受信任代理在设备上或通过私人推理,得分显著更差,最差的泄露超过一半。POLAR-Bench因此定位了每个模型的意图遵循崩溃点,为隐私对齐提供了立足点,特别是在最关重要的地方。

英文摘要

LLM agents increasingly have access to private user data and act on the user's behalf when interacting with third-party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third-party systems behave adversarially. We introduce POLAR-Bench (Policy-aware adversarial Benchmark), in which a trusted model with a privacy policy and a task converses with a third-party model that adversarially probes for both task-relevant and protected attributes. Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set-membership and vary privacy policy dimension and attack strategy along two orthogonal axes, producing a 5 times 5 diagnostic surface per model. Our results reveal a sharp split: current frontier models withhold over 99% of protected attributes, while smaller open-weight models in the 1--30B range, the class users most commonly run as their own trusted agent on-device or via private inference, score notably worse, with the weakest leaking over half. POLAR-Bench thus localizes where each model's intent-following breaks down, providing a foothold for privacy alignment where it matters most.

2605.19120 2026-05-20 cs.RO

CosFly: Plan in the Matrix, Fly in the World

CosFly:矩阵中的计划,世界中的飞行

Hanxuan Chen, Xiangyue Wang, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Binbo Li, Kangli Wang, Ji Pei

发表机构 * Autel Robotics(Autel机器人公司) Nanjing University(南京大学) Peking University(北京大学) Southern University of Science and Technology(南方科技大学) University of Hong Kong(香港大学)

AI总结 本文提出CosFly,一个用于空中跟踪的盒状结构规划和多模态模拟流程,以及CosFly-Track大规模无人机数据集,用于在多样环境中动态目标跟踪。CosFly通过将复杂的3D世界转换为结构化障碍表示进行规划,然后将轨迹投影到多模态传感器数据中,并支持可配置的固定视角缩放级别。

详情
AI中文摘要

我们介绍了CosFly,一个用于空中跟踪的盒状结构规划和多模态模拟流程,以及CosFly-Track,一个大规模的无人机数据集,用于在多样环境中进行动态目标跟踪。在我们的当前实现上,CosFly提供了一个模块化的7步构建流程,将复杂的3D世界转换为结构化的障碍表示用于规划,然后将结果轨迹投影到多模态传感器数据中,包括RGB图像、高精度深度图和语义分割掩码,并配以自然语言导航指令。一个关键特点是支持可配置的固定视角缩放级别(每个轨迹一个视角设置并保持恒定),通过相机内参数调整模拟各种焦距。该流程涵盖了从3D地图导出通过网格简化、行人和无人机轨迹规划、多模态渲染(6自由度姿态注释)、质量检查以及教师-学生描述生成的完整流程。我们分析了两种轨迹规划范式:传统的两阶段流程(前端候选生成和后端细化)以及直接基于梯度的公式,该公式在单一目标中优化多个跟踪约束。公开的CosFly-Track发布包含250条经过验证的轨迹和约10万张渲染图像,具有完整的6自由度无人机姿态注释(位置x、y、z和方向偏航、俯仰、滚动)。共同,该流程和数据集建立了一个可扩展的基础,支持在多样环境中进行空中-地面协同研究,支持动态目标跟踪、无人机导航和多模态感知。

英文摘要

We present CosFly, a box-structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly-Track, a large-scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7-step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi-modal sensor data -- including RGB images, high-precision depth maps, and semantic segmentation masks -- paired with natural language navigation instructions. A key feature is the support for configurable fixed-FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera-intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi-modal rendering with 6-DOF pose annotations, quality inspection, and teacher-student caption generation. We analyze two trajectory-planning paradigms for aerial target tracking: a conventional two-stage pipeline with front-end candidate generation and backend refinement, and a direct gradient-based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly-Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6-DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception across diverse environments.

2605.19111 2026-05-20 cs.CV cs.AI

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

FAGER:基于事实的文本到图像模型评估与改进

Youngsun Lim, Cusuh Ham, Pin-Yu Chen, Deepti Ghadiyaram

发表机构 * Boston University(波士顿大学) Adobe(Adobe公司) IBM Research(IBM研究院)

AI总结 本文提出FAGER框架,用于评估和改进文本到图像模型的事实准确性,通过结合LLM生成事实和参考引导的视觉事实提取与验证,构建结构化事实评估标准,并通过VLM进行评估,验证FAGER在事实性测试中优于现有方法,并能无训练改进T2I输出。

Comments It was accepted for an oral presentation at the 2nd Workshop on the Evaluation of Generative Foundation Models (EVGENFM2026) at CVPR 2026. Total 8 pages (1 page for references). 5 figures

详情
AI中文摘要

现有文本到图像(T2I)评估指标主要评估生成图像是否与提示中明确陈述的信息一致,但往往无法捕捉隐含、外部依赖或定义身份的事实要求。因此,它们不适合评估涉及科学知识、历史事实、产品或文化特定概念的提示中的事实正确性。我们提出了FActually Grounded Evaluation and Refinement(FAGER),一种代理框架,用于评估生成图像是否正确反映由提示中或暗示的视觉可验证事实,并提供改进的可操作反馈。FAGER首先通过结合LLM生成事实与参考引导的视觉事实提取和验证构建结构化事实评估标准,然后将该标准转换为基于VLM的问答对进行评估。为了验证FAGER作为事实性度量标准的有效性,我们引入了事实性A/B测试,该测试衡量度量标准是否更倾向于选择事实参考图像而非对应的生成图像。在涵盖科学、历史、产品、文化和知识密集型概念的五个数据集中,FAGER在该测试中始终优于现有方法。我们进一步表明,FAGER可以以无训练的方式用于改进T2I输出,在多个数据集中产生显著的事实性提升。

英文摘要

Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.

2605.19107 2026-05-20 cs.LG eess.SP

Performance Monitoring of Proton Exchange Membrane Water Electrolyzer by Transformers-Based Machine Learning Model

通过基于变压器的机器学习模型对质子交换膜水电解器进行性能监控

Bingqing Chen, Ivan Batalov, Qiu Chen, Weiqi Ji, Lei Cheng

发表机构 * Bosch Research & Technology Center(博世研发与技术中心)

AI总结 本文提出了一种基于变压器的机器学习框架,用于在正常运行过程中进行虚拟电化学表征,通过编码器-解码器结构对极化曲线进行重构,实现了对质子交换膜水电解器状态健康度的连续监控。

详情
AI中文摘要

绿色氢气在去碳化过程中扮演着关键角色,预计到2030年其容量将扩大至560 GW(2023年为1.39 GW)。质子交换膜(PEM)电解是生产绿色氢气最有前途的技术路线之一,实时监测PEM电解器的系统健康状况对于其规模化部署至关重要。在实验室环境中,可以通过电化学测试协议通过定期暂停正常运行来表征性能退化。这种中断对于大规模堆叠部署来说并不实用,限制了系统操作员对健康状态(SoH)进行实时评估的能力。本文提出了一种机器学习(ML)框架,可以在正常运行过程中进行虚拟电化学表征。该方法使用编码器-解码器变压器,基于操作数据来重构表征输出,重点关注极化曲线。受基于补丁的序列分词启发,我们将输入分割成补丁并对其进行编码,以形成有意义的标记,这大大提高了学习效率。在四次纵向运行中,持续时间最长为478小时,不同测试单元和负载循环下,模型准确重构了极化曲线,并相比普通变压器实现了均方误差(MSE)减少10倍。这一概念验证表明,ML模型可以实现PEM电解器的连续性能监控,并且编码器能够捕捉到SoH的有意义的潜在表示,为未来工作中的可解释指标推导提供了机会。

英文摘要

Green hydrogen plays an essential role in decarbonization, with capacity projected to scale to 560 GW by 2030 (vs. 1.39 GW in 2023) in net-zero settings. Proton exchange membrane (PEM) electrolysis is one of the most promising technology routes to green hydrogen production, and real-time system health monitoring of PEM electrolyzers is essential for their scalable deployment. In lab settings, performance degradation can be characterized through electrochemical testing protocols by periodic pauses of normal operation. Such interruption is not practical for full-scale stack deployments, limiting system operators' ability to make real-time assessments of state-of-health (SoH). We present a machine learning (ML) framework that performs virtual electrochemical characterization during normal operation. The method uses an encoder-decoder transformer, conditioned on operational data, to reconstruct characterization outputs, focusing here on polarization curves. Inspired by patch-based sequence tokenization, we segment the inputs into patches and encode them to form meaningful tokens, which substantially improves learning efficiency. Across four longitudinal runs, lasting up to 478 hours on different test cells and loading cycles, the model accurately reconstructed polarization curves and achieved 10x reduction in mean squared error (MSE) compared to a vanilla transformer. This proof-of-concept demonstrates that ML models can enable continuous performance monitoring for PEM electrolyzers and that the encoder captures meaningful latent representations of SoH, opening up opportunities to derive interpretable indicators in future work.

2605.19104 2026-05-20 cs.RO cs.AI

Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots

神经运算符用于腱驱动连续机器人设计空间的代理建模

Branden Frieden, James M. Ferguson, Alan Kuntz, Varun Shankar

发表机构 * The Robotics Center and the Kahlert School of Computing at the University of Utah(犹他大学机器人中心和Kahlert计算学院) The Departments of Computer Science and Electrical and Computer Engineering at Vanderbilt University(范德比大学计算机科学与电气与计算机工程系)

AI总结 本文提出了一种基于神经运算符的学习方法,用于腱驱动连续机器人的设计空间代理建模,通过映射机器人设计参数和腱驱动输入到最终配置,实现跨大量机器人设计的泛化能力。

Comments Accepted to ICRA 2026

详情
AI中文摘要

连续机器人能够在受限环境中实现灵活的操作,但需要准确且高效的模型用于实时操作和控制。传统物理模型可能计算成本高且因未建模效应导致不准确,而当前基于学习的方法在特定机器人上泛化能力差。本文提出将腱驱动连续机器人代理建模作为运算符学习问题,将机器人设计参数和腱驱动输入映射到最终配置。该方法使单个训练模型能够跨大量机器人设计泛化。我们开发了四种新型神经运算符架构--两种基于深度运算符网络(DeepONets)和两种基于傅里叶神经运算符(FNOs)--并训练它们在仿真数据上预测机器人配置。所有架构均实现良好的准确性,同时允许快速且准确地跨设计泛化。我们的结果表明,运算符学习为连续机器人力学在设计空间中的代理建模提供了有效且可泛化的解决方案,使在手术和工业应用中控制、规划和设计优化能够快速建模。

英文摘要

Continuum robots enable dexterous manipulation in constrained environments, but require accurate and efficient models for real-time manipulation and control. Traditional physics-based models can be computationally expensive and may suffer from inaccuracies due to unmodeled effects, while current learning-based methods often generalize poorly beyond the specific robot on which they are trained. We present a formulation of surrogate modeling for tendon-driven continuum robots as an operator learning problem that maps robot design parameters and tendon actuation inputs to resulting configurations. This formulation enables a single trained model to generalize across a large class of robot designs. We develop four novel neural operator architectures--two based on Deep Operator Networks (DeepONets) and two based on Fourier Neural Operators (FNOs)--and train them on simulation data to predict robot configurations. All architectures achieve good accuracy while allowing for fast and accurate generalization across designs. Our results demonstrate that operator learning provides an effective and generalizable surrogate for continuum robot mechanics in the design space, enabling fast modeling for control, planning, and design optimization in surgical and industrial applications.