arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09499 2026-06-09 cs.RO cs.AI cs.CR 新提交

Targeting World Models to Compromise Robot Learning Pipelines

针对世界模型以破坏机器人学习流程

Ethan Rathbun, Ahmed Agha, Saaduddin Mahmud, Christopher Amato, Alina Oprea, Eugene Bagdasarian

发表机构 * Northeastern University（东北大学）； University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）

AI总结本文提出针对世界模型的新型数据投毒攻击方法，通过注入恶意提示或转换动态，在看似安全的数据中生成危险训练轨迹，导致下游策略不安全。

详情

Comments: 8 Pages, CoRL Preprint

AI中文摘要

世界模型近来在流行度和能力上迅速增长，成为生成机器人训练数据或模拟真实环境的更高效工具，许多工作提议将其集成到机器人学习流程中。尽管非常实用，但本文证明世界模型引入了机器人学习供应链中一种独特隐蔽且有效的数据投毒入口，可能导致部署不安全或受损的机器人策略，尽管训练数据看似安全。与传统数据投毒技术直接向已售或上传数据集中植入危险轨迹不同，我们的新型攻击方法将恶意提示或受损转换动态注入到视觉安全的遥操作数据集中，这些数据仅当通过世界模型作为输入时才会被激活。这可能导致生成合成的危险机器人训练轨迹，进而产生不安全或受损的机器人策略。我们展示了针对最先进的行动条件和文本条件世界模型的攻击有效性，展示了在下游DRL策略上的完整端到端后门攻击，以及针对VLA设置的概念验证。总体而言，这些发现需要研究更安全的世界模型，并重新评估其在机器人学习供应链中的地位。

英文摘要

World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot training data or simulating real world environments, with many works proposing their integration into the robot learning pipeline. While highly practical, in this work we demonstrate that world models introduce a uniquely stealthy and effective data poisoning entry point into the robot learning supply chain that can result in the deployment of unsafe or otherwise compromised robotic policies despite training on seemingly safe ground truth training data. In contrast to traditional data poisoning techniques which directly implant dangerous trajectories into sold or uploaded datasets, our novel attack methods inject malicious prompts or compromising transition dynamics into visibly safe teleoperated datasets which are only activated once fed through a world model as input. This can result in the generation of synthetic, dangerous robot training trajectories and subsequently unsafe or compromised robot policies. We demonstrate the effectiveness of our attacks against both state of the art action conditioned and text conditioned world models, showing a full end-to-end backdoor on a downstream DRL policy and a proof-of-concept for the VLA setting. Overall these findings necessitate research into more secure world models and reevaluating their position within the robot learning supply chain.

URL PDF HTML ☆

赞 0 踩 0

2606.09498 2026-06-09 cs.CL 新提交

Self-Harness: Harnesses That Improve Themselves

Self-Harness：自我改进的操控框架

Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出Self-Harness范式，让LLM智能体通过弱点挖掘、框架提议和验证迭代改进自身操控框架，在Terminal-Bench-2.0上使三种模型的通过率分别提升21.4%、14.3%和14.2%。

详情

AI中文摘要

基于LLM的智能体的性能由其基础模型和中介其与环境交互的操控框架共同塑造。由于不同模型表现出不同的行为，有效的框架设计本质上是模型特定的。然而，智能体框架仍然主要由人类专家设计，这种范式随着现代LLM日益多样化和快速演变而难以扩展。在本文中，我们引入了Self-Harness，一种新的范式，其中基于LLM的智能体改进其自身的操作框架，而不依赖人类工程师或更强的外部智能体。我们将Self-Harness实现为一个迭代循环，包含三个阶段：弱点挖掘，从执行轨迹中识别模型特定的失败模式；框架提议，生成与这些失败相关的多样化但最小的框架修改；以及提议验证，仅在回归测试后接受候选编辑。我们在Terminal-Bench-2.0上使用最小初始框架和来自不同家族的三个基础模型实例化了Self-Harness：MiniMax M2.5、Qwen3.5-35B-A3B和GLM-5。在所有三个模型上，Self-Harness一致地提高了性能，保留通过率分别从40.5%提高到61.9%，从23.8%提高到38.1%，以及从42.9%提高到57.1%。定性分析进一步表明，Self-Harness不仅仅是添加通用指令，而是有效地将模型特定的弱点转化为具体的、可执行的框架更改。这些结果表明了一条路径，使得基于LLM的智能体不仅被其框架塑造，而且能够参与重塑自身框架。

英文摘要

The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages: Weakness Mining, which identifies model-specific failure patterns from execution traces; Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; and Proposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.

URL PDF HTML ☆

赞 0 踩 0

2606.09495 2026-06-09 cs.CV 新提交

ContextShift: A Controlled Benchmark for Context Dependence in Object Detection

ContextShift: 目标检测中上下文依赖性的受控基准

Dan Zlotnikov, Alex Lazarovich, Ohad Ben-Shahar

发表机构 * Ben-Gurion University of the Negev（内盖夫本-古里安大学）

AI总结提出ContextShift基准，通过几何变换和背景替换系统操纵物体-上下文关系，发现检测器性能下降主要表现为漏检增加和预测数量减少，且统计共现与有效视觉上下文非线性相关，上下文感知增强可提升鲁棒性。

详情

AI中文摘要

现代目标检测器在标准基准上表现强劲，但其对上下文变化的鲁棒性仍未被充分理解。先前的评估主要依赖于在非受控分布偏移上的平均精度等聚合指标，这可能会掩盖上下文变化下性能下降的真实情况。我们提出了ContextShift，一个受控基准，它在保持物体外观的同时系统地操纵物体-上下文关系。基于COCO 2017，它通过几何变换以及合成和自然背景替换，将上下文作为独立变量分离出来，包括基于归一化点互信息（NPMI）的连续兼容性轴。在多种检测器架构中，我们观察到一致的退化模式：假阴性最多增加227%，预测数量最多减少44%，而假阳性保持稳定或下降。这种抑制行为无法被平均精度等聚合指标捕捉，这些指标可能掩盖显著的召回率损失和预测动态变化。进一步分析表明，退化更多是由有效检测候选的形成减少而非置信度降低所驱动。此外，沿统计兼容性轴的性能是非单调的，在中间NPMI处达到峰值，并向两端退化，表明统计共现与有效视觉上下文并非线性相关。最后，我们展示了上下文感知增强提高了鲁棒性：每个增强变体在原始和操纵的测试图像上都优于仅使用数据集的基线，通过在训练期间暴露模型于物体-上下文解耦，部分恢复了因预测抑制失败而损失的性能。

英文摘要

Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual variation remains insufficiently understood. Prior evaluations largely rely on aggregate metrics such as AP on uncontrolled distribution shifts, which can obscure how performance degrades under context change. We introduce ContextShift, a controlled benchmark that systematically manipulates object--context relationships while preserving object appearance. Built on COCO 2017, it isolates context as an independent variable through geometric transformations and synthetic and natural background substitutions, including a continuous compatibility axis based on normalized pointwise mutual information (NPMI). Across diverse detector architectures, we observe a consistent degradation pattern: false negatives increase by up to 227% and prediction volume decreases by up to 44%, while false positives remain stable or decline. This suppression behavior is not captured by aggregate metrics such as AP, which can mask substantial recall loss and changes in prediction dynamics. Further analysis suggests that degradation is driven less by reduced confidence than by a reduced formation of valid detection candidates. Moreover, performance along the statistical compatibility axis is non-monotonic, peaking at intermediate NPMI and degrading toward both extremes, indicating that statistical co-occurrence does not correlate linearly with effective visual context. Finally, we show that context-aware augmentation improves robustness: every augmented variant outperforms the dataset-only baseline on both original and manipulated test images, partially recovering performance lost to prediction-suppression failures by exposing models to object--context decoupling during training.

URL PDF HTML ☆

赞 0 踩 0

2606.09489 2026-06-09 cs.AI 新提交

LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines

LLM编排的卒中护理合规性检查无需计算机可解释指南

Giorgio Leonardi, Stefania Montani, Manuel Striani, Alessandro Canessa, Delfina Ferrandi

发表机构 * Computer Science Institute, DiSIT, University of Piemonte Orientale（皮埃蒙特东方大学计算机科学研究所）； Integrated Laboratory of AI and Medical Informatics, DAIRI, SS. Antonio e Biagio e Cesare Arrigo Hospital（圣安东尼奥、比亚焦与切萨雷·阿里戈医院DAIRI人工智能与医学信息学综合实验室）

AI总结提出基于大语言模型编排的模块化框架，从非结构化临床文本和指南中自动提取患者轨迹、识别规范规则并计算合规性指标，在卒中护理领域验证了86%以上的轨迹合规。

详情

AI中文摘要

目标：医疗保健中的合规性检查旨在评估患者护理路径是否符合临床指南。然而，其实际应用通常依赖于正式、机器可解释的指南表示（如计算机可解释指南CIG），而这些在现实临床环境中很少可用。方法：本文引入了一个基于大语言模型编排的模块化框架，直接从非结构化的临床和指南文本中支持医疗合规性检查，无需预定义的CIG。所提出的架构集成了多个LLM和支持组件，从临床出院信中提取患者轨迹，从文本临床指南中识别规范规则，将这些规则转换为可执行脚本，并计算轨迹合规性指标以量化事件日志中的合规性。结果：该框架在亚历山德里亚医院神经内科病房的卒中护理领域进行了实施和评估。从医院数据中自动提取了数百条患者轨迹，并根据参考指南衍生的50条规则进行了评估。分析显示，超过86%的可用轨迹是合规的。结论：结果证明了使用编排的LLM进行实际医疗保健合规性分析的可行性。同时，该研究提供了亚历山德里亚医院卒中护理指南高度遵守的证据。

英文摘要

Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its practical application often depends on the availability of formal, machine-interpretable representations of guidelines, such as Computer-Interpretable Guidelines (CIGs), which are seldom available in real-world clinical settings. Methods: This work introduces a modular framework based on the orchestration of Large Language Models (LLMs) to support medical conformance checking directly from unstructured clinical and guideline texts, without requiring predefined CIGs. The proposed architecture integrates multiple LLMs and supporting components to extract patient traces from clinical discharge letters, identify normative rules from textual clinical guidelines, translate these rules into executable scripts, and compute a Trace Conformance Indicator to quantify compliance within the event log. Results: The framework was implemented and evaluated in the stroke care domain at the neurological ward of Alessandria Hospital. Hundreds of patient traces were automatically extracted from hospital data and assessed against 50 rules derived from the reference guideline. The analysis showed that more than 86\% of the available traces were conformant. Conclusion: The results demonstrate the feasibility of using orchestrated LLMs for practical healthcare conformance analysis. At the same time, the study provides evidence of a high level of adherence to stroke care guidelines at Alessandria Hospital.

URL PDF HTML ☆

赞 0 踩 0

2606.09484 2026-06-09 cs.CL 新提交

Detecting Differences Is Not Understanding Structure: Large Language Models Fail at Graph Isomorphism

检测差异不等于理解结构：大型语言模型在图同构任务中失败

Kumar Thushalika, Sukumar Kishanthan, Asela Hevapathige

发表机构 * University of Ruhuna（鲁胡纳大学）； University of Moratuwa（莫拉图瓦大学）； University of Melbourne（墨尔本大学）

AI总结本研究通过图同构检测任务揭示LLM的“虚假成功”：虽然LLM在检测同构时准确率接近完美，但面对节点标签置换的相同图时却无法识别，表明其依赖模式而非抽象结构推理。

2606.09483 2026-06-09 cs.CL cs.AI 新提交

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

超越回忆的记忆：用于自进化LLM代理的双过程认知记忆系统

Tianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu

发表机构 * Tencent（腾讯）

AI总结提出DCPM系统，基于双过程理论将代理记忆组织为认知能力层次，通过同步日间写入器和异步夜间引擎分别处理信念修正和模式归纳，在隐式跨会话推理任务上提升显著。

详情

AI中文摘要

LLM代理的长期记忆不仅仅是适时检索正确的段落。当前的记忆系统将信念修正、因果耦合和跨领域抽象压缩到为表面回忆而调整的单一检索面上，因此难以处理需要推理用户如何演变的隐式个性化。我们提出DCPM，它沿着认知能力层次重新组织代理记忆，从原始输入和原子事实，经过历时信念轨迹和身份，上升到领域模式、潜在意图和跨领域模式。该层次由两个过程驱动，继承了双过程理论的架构分裂：一个同步的日间写入器（系统1），记录信念修正为双重链接的取代链；一个异步的夜间引擎（系统2），归纳模式和意图，并扫描跨领域冲突，抽象为更高级的核心模式。在LongMemEval、PersonaMem和PersonaMem-v2上，启用系统2在奖励隐式跨会话推理的基准上贡献最大（在PersonaMem-v2上最高+5.20），在跨度回忆上贡献最小，与架构预测一致。

英文摘要

Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on implicit personalisation that requires reasoning over how a user has evolved. We propose DCPM, which reorganises agent memory along a cognitive capability hierarchy ascending from raw inputs and atomic facts, through diachronic belief trajectories and identity, to domain schemas, latent intentions and cross-domain patterns. The hierarchy is driven by two processes inheriting the architectural split of dual-process theory: a synchronous daytime writer (System1) that records belief revisions as doubly linked supersedes chains, and an asynchronous nighttime engine (System2) that induces schemas and intentions and sweeps for cross-domain collisions abstracted into higher-level core schemas. On LongMemEval, PersonaMem and PersonaMem-v2, enabling System2 contributes most where the benchmark rewards implicit cross-session inference (up to +5.20 on PersonaMem-v2) and least on span recall, matching the architectural prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.09479 2026-06-09 cs.CV cs.DL 新提交

Optical Music Recognition for Real-World Manuscripts with Synthetic Data

基于合成数据的真实世界手稿光学音乐识别

Jiří Mayer, Martina Dvořáková, Vojtěch Dvořák, Markéta Herzánová Vlková, Filip Bím, Pavel Pecina, Samuel Šomorjai, Petr Žabička, Jan Hajič

发表机构 * Institute of Formal and Applied Linguistics, Charles University（查尔斯大学形式与应用语言学研究所）； Moravian Library（摩拉维亚图书馆）

AI总结针对资源受限场景下真实世界复杂钢琴手稿的识别，提出利用合成手稿图像进行域自适应，显著提升性能，并避免昂贵细粒度标注。

详情

Comments: Accepted for publication at the ICDAR 2026 conference

AI中文摘要

光学音乐识别（OMR）在模型设计方面取得了重大进展，端到端方法现在能够识别所有复杂程度的符号。然而，这一进展的影响受到可用训练数据集视觉领域的限制，这些数据集大多是数字原生的。图书馆和其他遗产机构中现有的大量乐谱收藏主要包含手稿，其视觉领域高度多样且不同，因此现有的OMR系统在现实世界中应用时失败。这些机构通常资源受限，因此无法期望大规模领域内数据集。我们在资源受限场景下为具有复杂钢琴符号的真实世界手稿提供了第一个基线。使用细粒度音乐符号图（MuNG）注释和Smashcima合成工具，我们随后表明，虽然领域内数据的一些直接转录仍然是必要的，但使用合成音乐手稿图像进行域自适应带来了显著的改进。此外，所使用的符号不需要是领域内的，因此可以避免昂贵的细粒度注释。因此，我们将OMR更接近其既定目标之一：保护和推广音乐文化遗产。

英文摘要

Optical Music Recognition (OMR) has seen major progress in model design, with end-to-end methods now capable of recognising notation at all levels of complexity. However, the impact of this progress has been limited by the visual domains of available training datasets, which are largely born-digital. Existing large collections of sheet music in libraries and other heritage institutions contain predominantly manuscripts, whose visual domains are highly diverse and different, so existing OMR systems fail when applied in the real world. These institutions are often resource-constrained, so large in-domain datasets cannot be expected. We provide a first baseline on real-world manuscripts with complex piano notation in the resource-constrained scenario. Using fine-grained music notation graph (MuNG) annotations and the Smashcima synthesis tool, we then show that while some direct transcriptions of in-domain data remain essential, domain adaptation using synthetic musical manuscript images brings significant improvement. Furthermore, the symbols used do not need to be in-domain, so the expensive fine-grained annotation can be avoided. We thus bring OMR closer to one of its stated goals: preserving and promoting musical cultural heritage.

URL PDF HTML ☆

赞 0 踩 0

2606.09477 2026-06-09 cs.CV 新提交

Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems

多相机系统中视觉-惯性相对位姿估计的高效最小求解器

Tao Li, Zhenbao Yu, Banglei Guan, Jianli Han, Weimin Lv

发表机构 * Naval Aviation University（海军航空大学）； National University of Defense Technology（国防科技大学）

AI总结提出两种基于IMU先验的最小求解器，仅需4个点对应，将多相机相对位姿问题简化为单变量6次多项式，显著降低计算复杂度，在RANSAC框架中表现优异。

详情

AI中文摘要

估计多相机系统的相对位姿是计算机视觉中的一个基本问题，在自动驾驶、移动设备和无人机（UAV）中具有关键应用。然而，现有解决方案通常计算复杂度高或依赖过多的点对应，限制了其实际应用。为解决这些限制，我们提出两种高效的最小求解器，利用新颖的参数化来估计多相机系统的相对位姿。第一种求解器利用惯性测量单元（IMU）提供的垂直方向先验，第二种利用IMU提供的旋转轴方向先验。我们的方法仅需四个点对应，并将多相机相对位姿估计问题简化为求解一个单变量6次多项式，相较于通常涉及8次多项式的现有方法有显著改进。这种计算复杂度和对应要求的降低使得我们的求解器在集成到RANSAC框架中时特别有效，展示了在视觉里程计应用中的强大潜力。通过在合成数据和KITTI基准上的严格评估，我们的方法相比最先进算法实现了卓越的计算效率和具有竞争力的精度。

英文摘要

Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessive number of point correspondences, limiting their real-world applicability. To address these limitations, we propose two efficient minimal solvers for estimating the relative poses of multi-camera systems using a novel parameterization. The first solver leverages the vertical direction prior provided by Inertial Measurement Units (IMUs), while the second utilizes the rotation axis direction prior from IMUs. Our methods require only four point correspondences and reduce the problem of multi-camera relative pose estimation to solving a univariate 6th-degree polynomial, a significant improvement over existing approaches, which typically involve 8th-degree polynomials. This reduction in computational complexity and correspondence requirements makes our solvers particularly effective when integrated into RANSAC frameworks, demonstrating strong potential for visual odometry applications. Through rigorous evaluations on synthetic data and the KITTI benchmark, our methods achieved superior computational efficiency and competitive accuracy compared to state-of-the-art algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.09476 2026-06-09 cs.RO 新提交

Goal Sets, Not Goal States: Queryable Robot Goals through Goal-Set Hindsight Relabeling

目标集，而非目标状态：通过目标集事后重标记实现可查询的机器人目标

Carlos Vélez García, Miguel Cazorla, Jorge Pomares

发表机构 * INESCOP（西班牙鞋类及相关技术研究所）； University of Alicante（阿利坎特大学）

AI总结提出目标集事后重标记（GS-HER），将事后重标记从单目标状态推广到谓词级目标集，通过可查询的二值谓词解耦成功条件与状态维度，提升离线GCRL在冗余维度下的性能，并实现单一模型支持多目标谓词。

2606.09474 2026-06-09 cs.CV 新提交

Training-Free Generalized Few-Shot Segmentation through Open-Vocabulary Semantic Arbitration

无需训练的通用的少样本分割通过开放词汇语义仲裁

Silas Kwabla Gah, Ebenezer Owusu

发表机构 * University of Ghana（加纳大学）

AI总结提出Open-V框架，通过推理时协调冻结的语义先验（SAM3 PCS与K-shot CLIP支持质心）实现无需训练的通用少样本分割，在多个基准上超越有监督方法。

详情

AI中文摘要

通用少样本语义分割（GFSS）传统上被视为表示学习问题，需要任务特定的适应来从有限的支持样本中引入新类别。然而，最近的基础模型已经展现出强大的开放词汇识别和分割能力，这提出了一个不同的问题：能否通过推理时协调冻结的语义先验而不是参数适应来解决GFSS？我们通过Open-V回答了这个问题，这是一个无需训练的GFSS框架，它结合了Segment Anything (SAM3) 可提示概念分割（PCS）与K-shot CLIP支持质心，通过校准的逐像素语义仲裁。Open-V不引入任何可训练组件，并在推理时支持任意语义类别。除了分割性能，我们的研究还贡献了三个更广泛的发现。首先，我们表明支持信息可以通过推理时语义基础来整合，并且其贡献随着基础模型文本先验在标签不相交词汇表上的减弱而增加。其次，我们识别了基础模型分割中的可重复性混淆，证明了预处理和评估空间的不匹配会无声地扭曲报告的性能。最后，我们在PASCAL-5i、COCO-20i和ADE-OW上验证了Open-V，表明无需训练的基础模型先验协调在常规GFSS和开放词汇评估设置中都能泛化。在PASCAL-5i（1-shot）上，Open-V达到了基础/新类/调和mIoU分别为78.4/77.5/77.9，无需GFSS特定训练，超越了最强有监督基线+17.7 HM。

英文摘要

Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learning problem, requiring task-specific adaptation to incorporate novel classes from limited support examples. Recent foundation models, however, already exhibit strong open-vocabulary recognition and segmentation capabilities, raising a different question: can GFSS be solved through inference-time coordination of frozen semantic priors rather than parameter adaptation? We answer this question with Open-V, a training-free GFSS framework that combines Segment Anything (SAM3) Promptable Concept Segmentation (PCS) with a K-shot CLIP support centroid through calibrated per-pixel semantic arbitration. OpenV introduces no trainable components and supports arbitrary semantic categories at inference time. Beyond segmentation performance, our study contributes three broader findings. First, we show that support information can be incorporated through inference-time semantic grounding, and that its contribution increases as foundation-model text priors weaken on label-disjoint vocabularies. Second, we identify a reproducibility confound in foundationmodel segmentation, demonstrating that preprocessing and evaluation-space mismatches can silently distort reported performance. Finally, we validate Open-V across PASCAL5i, COCO-20i, and ADE-OW, showing that training-free coordination of foundation-model priors generalizes across both conventional GFSS and open-vocabulary evaluation settings. On PASCAL-5i (1-shot), Open-V attains base/novel/harmonic mIoU of 78.4/77.5/77.9, without GFSS-specific training surpassing the strongest trained baseline by +17.7 HM.

URL PDF HTML ☆

赞 0 踩 0

2606.09471 2026-06-09 cs.LG cs.CL 新提交

Escaping the KL Agreement Trap in On-Policy Distillation

逃离在线策略蒸馏中的KL一致陷阱

Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Hong Kong University of Science and Technology（香港科技大学）； The Hong Kong Polytechnic University（香港理工大学）； Eastern Institute of Technology, Ningbo（宁波东方理工大学）

AI总结针对在线策略蒸馏中学生陷入低KL一致陷阱导致训练信号弱的问题，提出KAT动态终止规则，过滤弱监督，在数学基准上提升avg@k 2.66%和pass@k 3.43%，同时减少59.73%的rollout长度。

2606.09470 2026-06-09 cs.CL cs.AI 新提交

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

一种用于联合多粒度L2评估和自然语言解释的微调SpeechLLM

Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

发表机构 * Centre for Language Studies, Radboud University（语言研究中心，拉德堡德大学）

AI总结提出一种基于评分准则的SpeechLLM，通过混合训练目标联合预测句子级和词/音素级标签并生成自然语言解释，在SpeechOcean762上达到或超越单粒度模型。

详情

Comments: Accepted to Interspeech 2026. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants, which is financed by the Dutch Research Council (NWO)

AI中文摘要

自动化的L2语音评估可以分配熟练度标签，但通常缺乏可解释性。我们提出了一种基于评分准则的SpeechLLM，用于多角度、多粒度的评估，采用结合监督微调和有界直接偏好优化的混合目标进行训练。该模型在同一个响应中联合预测句子级（准确性、流利度、韵律）的序数标签、词/音素级准确性，并生成自然语言解释。在SpeechOcean762上，我们的方法匹配或优于单粒度模型，同时与先前方法保持竞争力。我们从两个维度分析解释的可靠性：与模型预测的自一致性和与真实标签的对齐，使用情感一致性（合理性）和基于提及的一致性（忠实性）。解释在句子级别是合理的，但在词/音素级别忠实性下降：参考稀疏且与词元级标签弱对齐。

英文摘要

Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.

URL PDF HTML ☆

赞 0 踩 0

2606.09461 2026-06-09 cs.CL 新提交

H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions

H2HMem: 面向人际交互中智能体的多模态记忆基准

Shiping Zhu, Yibo Yang, Zhengyang Wang, Tiancheng Shen, Dandan Guo, Ming-Hsuan Yang

发表机构 * Jilin University（吉林大学）； Shanghai Jiao Tong University（上海交通大学）； University of California at Merced（加州大学默塞德分校）

AI总结提出H2HMem基准，通过双人和多人多模态对话评估智能体在记忆召回、推理和应用方面的能力，揭示现有模型在多模态、多参与者场景下的显著局限。

详情

Comments: 22 pages, 6 figures

AI中文摘要

大型语言模型智能体越来越多地部署在人际交互场景中，例如会议助手和临床文档系统，在这些场景中，它们必须观察对话并保留信息以供后续查询。与传统的人助交互不同，这些环境本质上是多模态的，涉及复杂的语篇现象，如回指和指示，并且包含来自多个参与者的异步或冲突信息。然而，现有的记忆基准主要关注单用户、纯文本交互，未能捕捉这些挑战。为填补这一空白，我们引入了H2HMem，一个面向复杂人际交互中记忆能力评估的人-人多模态记忆基准。H2HMem包括双人和多人对话，包含多模态信息流，并从三个维度评估智能体：记忆召回、推理和应用。使用先进智能体的实验揭示了在跨模态、参与者和会话中构建、保留和利用记忆方面的显著局限性，凸显了下一代LLM智能体需要大幅改进的空间。

英文摘要

Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human-assistant settings, these environments are inherently multimodal, involve complex discourse phenomena such as anaphora and deixis, and contain asynchronous or conflicting information from multiple participants. However, existing memory benchmarks largely focus on single-user, text-only interactions, failing to capture these challenges. To address this gap, we introduce H2HMem, a Human-to-Human Multimodal Memory Benchmark for evaluating memory capabilities in complex human-human interactions. H2HMem includes both dyadic and multi-party conversations with multimodal information streams, and evaluates agents along three dimensions: memory recall, reasoning, and application. Experiments with advanced agents reveal substantial limitations in constructing, retaining, and utilizing memories across modalities, participants, and sessions, highlighting substantial room for improvement in next-generation LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2606.09459 2026-06-09 cs.CL 新提交

AbstRAG: Learning to Abstract for Retrieval Problems

AbstRAG：面向检索问题的抽象学习

Lei Xu, Xin Quan, Daniel Pedronette, André Freitas

发表机构 * Idiap Research Institute（Idiap 研究所）； École Polytechnique Fédérale de Lausanne (EPFL)（洛桑联邦理工学院 (EPFL)）； São Paulo State University（圣保罗州立大学）； University of Manchester（曼彻斯特大学）； CRUK National Biomarker Centre（英国癌症研究中心国家生物标志物中心）

AI总结针对查询与文档证据间的抽象鸿沟问题，提出AbstRAG方法，通过将抽象作为显式检索对象，并采用反思性精炼机制，在三个基准上提升了检索和生成性能。

详情

AI中文摘要

当查询、文档证据和用户意图以不同抽象级别表达时，检索增强生成常常失败。查询可能询问一个类别、关系或事件，而文档仅陈述具体实例、间接框架或限定表述。我们将这种不匹配定义为抽象鸿沟：将查询意图与可用证据对齐所需的最小类型假设集合。为弥合这一鸿沟，我们引入AbstRAG，将抽象视为显式检索对象。AbstRAG将查询-证据鸿沟分解为表达、概念、意图-证据和事件类型组件，并通过结合匹配质量、查询无关的效用先验以及所需桥梁的成本来评分相关性。其核心机制是反思性精炼：批评者诊断检索失败，定位失败的抽象操作符，提出最小的阶段特定补丁，并仅在充分性和压缩控制下接受补丁。在三个文档内检索基准上与七个基线对比，AbstRAG在21个配对自助法对比中的18个上以nDCG@10胜出，并在三个基准上分别将生成准确率提升1.9%、5.2%和4.0%；消融实验证实，反思性精炼驱动了大部分检索增益，而仅压缩控制就在压力切片上将过度扩展假阳性从73.7%降至0%。

英文摘要

Retrieval-augmented generation often fails when the query, the document evidence, and the user's intent are expressed at different levels of abstraction. A query may ask about a class, a relation, or an event, while the document only states specific instances, indirect framings, or scoped formulations. We define this mismatch as an abstraction gap: the minimal set of typed assumptions required to align query intent with the available evidence. To close this gap, we introduce AbstRAG, which treats abstraction as an explicit retrieval object. AbstRAG decomposes the query--evidence gap into expression, conceptual, intent--evidence, and event-type components, and scores relevance by combining match quality, a query-independent utility prior, and the cost of the required bridges. Its central mechanism is reflective refinement: a critic diagnoses retrieval failures, localizes the failed abstraction operator, proposes a minimal stage-specific patch, and accepts the patch only under sufficiency and compression controls. Across three within-document retrieval benchmarks against seven baselines, AbstRAG outperforms on nDCG@10 in 18 of 21 paired-bootstrap contrasts and improves generation accuracy by 1.9%, 5.2%, and 4.0% across the three benchmarks; ablations confirm that reflective refinement drives most of the retrieval gain and the compression control alone reduces over-expansion false positives from 73.7% to 0% on a stress slice.

URL PDF HTML ☆

赞 0 踩 0

2606.09457 2026-06-09 cs.RO 新提交

$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models

$ω$-EVA：基于潜在交互世界模型的构想、验证与行动

Zhenguo Sun, Yu Sun, Hande Huang, Alois Knoll

发表机构 * Technical University of Munich（慕尼黑工业大学）

AI总结提出$ω$-EVA框架，通过潜在交互世界模型实现“构想-验证-行动”循环，利用动作条件潜在动力学和语言条件流策略生成动作，无需生成未来视频，在多种机器人操作任务中提升策略性能。

详情

AI中文摘要

具身策略通常直接将当前观测映射到动作，使得候选动作的后果隐含。世界模型提供预测监督、表示或外部模拟，但很少让策略在行动前检查自身提议的想象后果。我们提出$ω$-EVA，一种潜在交互世界模型，实现了用于具身动作生成的构想-验证-行动循环。其三阶段框架学习动作条件潜在动力学，在动力学感知的视觉表示上训练语言条件流策略，并将策略的提议反馈给世界模型。一个三分支精炼器联合推理当前状态、提议条件未来和提议动作，以生成最终动作块。由于后果推理保持在潜在特征空间中，$ω$-EVA在推理时避免了生成未来视频。在多种单臂、双臂、长时域和扰动仿真设置中的评估表明，完整的交互流程持续改进了提议策略，而潜在诊断指示了有意义的动作条件未来结构。拥有约12亿参数且无需额外的机器人数据预训练，$ω$-EVA展示了紧凑且具有竞争力的性能-规模-数据权衡，使世界模型成为主动的动作反馈模块而非被动预测器。

英文摘要

Embodied policies typically map current observations directly to actions, leaving candidate-action consequences implicit. World models provide predictive supervision, representations, or external simulation, but rarely let a policy inspect the imagined consequence of its own proposal before acting. We introduce $ω$-EVA, a latent interactive world model that realizes an Envision--Verify--Act loop for embodied action generation. Its three-stage framework learns action-conditioned latent dynamics, trains a language-conditioned flow policy on dynamics-aware visual representations, and feeds the policy's proposal back through the world model. A tri-branch refiner jointly reasons over the current state, proposal-conditioned future, and proposed action to produce the final action chunk. Because consequence reasoning remains in latent feature space, $ω$-EVA avoids generating future videos at inference. Evaluations across diverse single-arm, bimanual, long-horizon, and perturbed simulation settings show that the complete interaction pipeline consistently improves the proposal policy, while latent diagnostics indicate meaningful action-conditioned future structure. With approximately 1.2B parameters and no additional robot-data pretraining, $ω$-EVA demonstrates a compact and competitive performance--scale--data trade-off, making the world model an active action-feedback module rather than a passive predictor.

URL PDF HTML ☆

赞 0 踩 0

2606.09456 2026-06-09 cs.LG 新提交

Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

打破分词器壁垒：跨模型系列的在线策略蒸馏

Yifan Niu, Han Xiao, Dongyi Liu, Zelong Wang, Dihong Gong, Yasheng Wang, Jia Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Tencent（腾讯）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出跨分词器在线策略蒸馏方法，通过精确的token映射算法使教师模型概率分布信号能跨不同分词器传播，显著提升计算效率。

详情

AI中文摘要

在线策略蒸馏（OPD）已成为大型语言模型（LLM）后训练中从领域专家向学生模型迁移知识的核心技术。然而，现有的OPD蒸馏方法要求教师和学生模型共享相同的分词器，限制了OPD在模型系列内的适用性。当前主流实践通常采用在教师生成的响应上进行监督微调（SFT）来实现跨分词器蒸馏，这未能捕捉到嵌入在教师概率分布中的丰富知识。在这项工作中，我们使标准的在线策略蒸馏方法能够跨模型系列运行，确保高保真的token级信号可以通过精确的token映射算法在不同分词器之间传播。大量实验表明，在各种基准测试上，跨分词器OPD在计算效率上显著优于基线方法。我们的结果为OPD解锁了更广泛的教师-学生配对，为适应和增强LLM之间的交互开辟了新途径。

英文摘要

On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenizer, restricting the applicability of OPD within the model series. Current mainstream practice typically employs Supervised Fine-Tuning (SFT) on teacher-generated responses for cross-tokenizer distillation, which fails to capture the rich knowledge embedded in the teacher's probability distribution. In this work, we enable the standard on-policy distillation method to operate across model families, ensuring that high-fidelity token-level signals can propagate across different tokenizers with a precise token-mapping algorithm. Extensive experiments show that cross-tokenizer OPD is significantly more compute-efficient than baselines on various benchmarks. Our results unlock a broader range of teacher-student pairs for OPD, opening up new avenues for adapting and enhancing interactions between LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.09451 2026-06-09 cs.RO cs.CV cs.LG 新提交

Dense Force Estimation with an Event-based Optical Tactile Sensor

基于事件的光学触觉传感器的稠密力估计

Agis Politis, René Zurbrügg, Valentina Cavinato

发表机构 * Sony Advanced Visual Sensing, Zurich, Switzerland（索尼高级视觉传感公司，苏黎世，瑞士）； ETH Zürich（苏黎世联邦理工学院）

AI总结提出首个利用事件相机重建稠密3D力场的方法，通过事件数据估计表面位移并映射为力，平均误差(0.14N,0.10N,0.93N)，工作频率100Hz。

详情

AI中文摘要

人类依赖空间稠密、几何和力感知的触觉反馈以高时间分辨率进行灵巧操作。虽然基于视觉的触觉传感器能够实现稠密力估计，但受限于相机帧率、运动模糊和数据带宽。基于事件的光学触觉传感器具有微秒级时间分辨率和低运动模糊的优点，但现有方法仅限于预测净力。我们提出了首个利用基于事件的光学触觉传感器进行稠密3D力场重建的框架。我们的方法从事件数据估计3D表面位移，并通过逆有限元方法（iFEM）将其映射为力。剪切位移通过所提出的事件标记跟踪算法恢复，而法向位移则由卷积神经网络预测，该网络在收集的同步力-位移-事件数据集上训练。实验表明，该方法能够准确重建物理力，在力范围高达(4N,4N,20N)时，平均绝对误差为(0.14N,0.10N,0.93N)，同时以平均100Hz的频率运行。这项工作为在机器人抓取和灵巧操作中实现高频控制的稠密力反馈迈出了第一步。

英文摘要

Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.09450 2026-06-09 cs.AI 新提交

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

TheoremBench: 评估LLMs在形式数学中的定理证明能力

QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets

发表机构 * Skolkovo Institute of Science and Technology（斯科尔科沃科学技术研究所）； HSE University（高等经济大学）； Artificial Intelligence Research Institute（人工智能研究所）； Sberbank（俄罗斯联邦储蓄银行）

AI总结提出TheoremBench基准，通过结构化定理族和细粒度评估指标，揭示当前证明器在复杂定理上的行为偏差。

详情

Comments: Preprint version (20 pages, 10 figures)

AI中文摘要

LLMs最近在形式证明基准上取得了强劲结果。然而，现有评估仍高度集中在竞赛式问题上，且往往未能捕捉模型在更长、依赖关系更丰富的数学发展中的行为。我们引入TheoremBench，这是一个Lean4基准，旨在评估超越竞赛设置的定理证明器。该基准由近一百个经典定理构建，并以两种互补形式发布：一个简洁主版本，每个实例包含一个目标定理；以及一个前提版本，将每个定理扩展为一个结构化的相关证明任务族，包括主定理以及自动提取的支持性子定理。这种设计不仅能够评估最终定理是否从零开始被证明，还能评估通过定理内部证明结构的部分进展。我们的实验表明，显式前提显著提高了Lean4能力证明器模型的性能。为了提供全面评估，我们引入了定理级覆盖率和令牌效率指标，这些指标揭示了证明行为中的定性差异。结果表明，当前的证明器仍然强烈偏向于简单的子定理，并且通常通过冗长且低效的策略轨迹而非紧凑的证明计划来求解定理。因此，TheoremBench提供了对形式推理能力的更细粒度视角，并强调了结构基准设计对于评估Lean4定理证明器的重要性。

英文摘要

LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.

URL PDF HTML ☆

赞 0 踩 0

2606.09449 2026-06-09 cs.CL 新提交

Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization

无金标准推理：自动形式化的代理-裁判理论

Lei Xu, Xin Quan, André Freitas

发表机构 * Idiap Research Institute（Idiap研究所）； École Polytechnique Fédérale de Lausanne (EPFL)（洛桑联邦理工学院）； University of Manchester（曼彻斯特大学）； CRUK National Biomarker Centre, University of Manchester（英国癌症研究中心国家生物标志物中心，曼彻斯特大学）

AI总结提出无参考的代理-裁判框架，通过多轴属性检查替代金标准匹配，实现自动形式化的迭代优化，理论保证收敛，实验提升通过率。

详情

AI中文摘要

复杂的推理任务日益要求系统生成其正确性无法通过与单一参考精确匹配来判断的输出。自动形式化（AF）是一个代表性例子；它要求模型将非正式的数学或逻辑推理翻译成可形式化检查的对象，然而专家验证的形式化在玩具案例之外无法扩展，且一个非正式论证可以有许多有效的形式化呈现。因此，进展取决于部分、结构化的代理能否替代精确参考。我们为AF引入了一个无参考的代理-裁判框架，用多轴属性检查向量替代金标准匹配。该框架沿三个结构范围组织代理：涵盖所引发对象的全局属性、子组件内部的模块属性、以及将其重新对齐到非正式来源的跨域属性，并将每个轴聚合成一个裁决向量。该向量驱动一个反思性精炼循环，其中违反的坐标将控制器引导到匹配的修复目标，因此每次迭代仅更改被判断为错误的部分。在有界裁判噪声下，期望的内在差距几何级数收缩到噪声相关的平台。在miniF2F、ProofNet、e-SNLI和ProntoQA上的七个形式化骨干中，精炼持续提升通过率超过单次ICL基线，并且在基线有改进空间的基准上，多轴代理优于匹配的标量代理。因此，结构化代理判断既提供了实用的精炼信号，也提供了在精确参考不可用时收敛的理论依据。

英文摘要

Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable object, yet expert-validated formalizations do not scale beyond toy cases and a single informal argument can admit many valid formal renderings. Progress therefore depends on whether partial, structured proxies can substitute for exact references. We introduce a reference-free proxy-judge framework for AF that replaces gold-standard matching with a vector of per-axis property checks. The framework organizes the proxy along three structural scopes that cover global properties of the elicited object, per-module properties internal to its sub-components, and cross-domain properties that re-align it to the informal source, and aggregates each axis into a verdict vector. The vector drives a reflective refinement loop in which a violated coordinate routes the controller to a matching repair target, so each iteration changes only what is judged wrong. Under bounded judge noise, the expected intrinsic gap contracts geometrically to a noise-dependent plateau. Across seven formalization backbones on miniF2F, ProofNet, e-SNLI, and ProntoQA, refinement consistently lifts Pass Rate over the single-shot ICL baseline, and the per-axis proxy outperforms a matched scalar proxy on benchmarks where the baseline has room to improve. Structured proxy judgments therefore provide both a practical refinement signal and a theoretical handle on convergence when exact references are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2606.09447 2026-06-09 cs.AI 新提交

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

AliyunConsoleAgent：通过蒸馏和强化学习在真实云环境中训练Web智能体

Bojie Rong, Zheyu Shen, Qiaoping Wang, Pengfei Kang, Yang Xu, Yawen Wei, Hanyu Wu, Zhi Zhao, Leihao Pei, Linquan Jiang

发表机构 * Alibaba Cloud China（阿里云中国）

AI总结提出AliyunConsoleAgent框架，通过蒸馏前沿模型轨迹进行监督微调，再结合GRPO和双通道结果奖励模型在真实云环境中强化学习，实现文档验证自动化，以低成本达到接近前沿专有模型的成功率。

详情

AI中文摘要

我们提出AliyunConsoleAgent，一个用于真实云控制台自动化文档验证的Web智能体框架。主流云平台包含数百个产品，功能迭代迅速，导致控制台UI频繁与对应文档不一致。验证文档流程准确反映当前控制台并能够端到端执行，每年需要约400万次重复检查，但人工覆盖率仍低于1%。虽然基于前沿专有模型的智能体系统取得了高成功率，但其高昂成本和数据隐私限制阻碍了大规模部署。我们提出一个两阶段训练范式：首先对蒸馏的前沿模型轨迹进行监督微调，然后在真实云环境中使用组相对策略优化（GRPO）和双通道结果奖励模型进行强化学习。为了支持大规模RL训练，我们构建了一个高确定性的回滚系统，采用基于Terraform的资源预置和LLM驱动的按需置备，有效隔离环境噪声与训练信号。我们进一步引入基于后端审计日志的规则奖励评估协议，提供客观、抗奖励破解的结果判断。我们的模型从机械的指令遵循演变为具有云控制台和产品特定理解的自主决策。在一个具有挑战性的278任务基准上（最佳前沿模型仅达到65.34%成功率），AliyunConsoleAgent-32B实现了63.52%的平均成功率——相比基础模型提升20.24个百分点，与最佳前沿专有模型的差距缩小至1.82个百分点（bootstrap 95% CI [-1.27, 7.39]）——而推理成本降低92%。

英文摘要

We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.

URL PDF HTML ☆

赞 0 踩 0

2606.09446 2026-06-09 cs.CV 新提交

Leveraging Morphology for Historical Script Metrological Analysis

利用形态学进行历史手稿计量分析

Malamatenia Vlachou Efstathiou, Raphaël Baena, Dominique Stutzmann, Mathieu Aubry

发表机构 * LIGM, École des Ponts et Chaussées, IP Paris, CNRS, France（LIGM，国立桥路学校，巴黎理工学院，法国国家科学研究中心，法国）； Institut de Recherche et d’Histoire des Textes, Paris, Île-de-France, France（文本研究与历史研究所，巴黎，法兰西岛，法国）

AI总结提出基于Transformer的检测架构和原型线重建模块，从行级转录中学习字符原型，实现可扩展、有意义的古文字测量，并验证其在区分图形轮廓和发现细微变化方面的有效性。

详情

AI中文摘要

手写文本识别的进展使得历史文献的大规模转录成为可能，但仍为古文字学（历史手稿研究）提供有限的可解释视觉测量。本文的主要见解是，形态学手稿分析，特别是从行级转录中学习字符原型的能力，能够定义可扩展、有意义且稳定的古文字测量。更精确地说，我们利用基于Transformer的检测架构和基于原型的线重建模块来学习原型字符及其出现、变形和定位。我们的贡献有两方面。首先，我们引入了一种深度架构和学习方法，仅通过行级转录监督即可实现高效的字符建模，显著改进了可学习打字机基线，并实现了准确的字符边界框预测，释放了其在古文字测量中的潜力。其次，我们介绍并展示了由我们的架构实现的字符、双字母组和图形单元之间间距的自动测量的古文字相关性。为了演示，我们将巴黎手稿BnF fr. 2813（14世纪末由查理五世委托，由四名抄写员抄写）的注释扩展到160页。我们可视化这些页面上的测量结果，显示它们不仅使我们能够区分图形轮廓，还能发现和分析细微变化。这个案例研究概述了我们方法的可扩展性及其在所需训练数据方面的节俭性，因为单列文本就足以对160页中的每一页进行计算。数据和代码公开于：https://malamatenia.github.io/morphology4metrology-analysis。

英文摘要

Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning. Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages. Data and code are publicly available at: https://malamatenia.github.io/morphology4metrology-analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.09441 2026-06-09 cs.AI cs.AR 新提交

SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

SIFT: 利用注意力不变性实现RAG预填充快速计算的索引选择

Rya Sanovar, Srikant Bharadwaj, Hritvik Taneja, Moinuddin Qureshi

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Microsoft（微软）

AI总结针对RAG查询中文档重复导致预填充计算冗余和TTFT增加的问题，提出SIFT方法，通过离线提取文档高注意力分数位置并利用注意力不变性，在预填充时仅计算标记位置，将TTFT提升1.71倍且精度损失在1%以内。

详情

AI中文摘要

检索增强生成（RAG）向LLM查询注入相关文档以提高响应质量。这种注入增加了提示长度并减慢了首个令牌生成时间（TTFT）。与标准查询不同，RAG查询具有上下文复用的独特属性，即相同文档在用户查询中重复出现。因此，为每个RAG查询完全重新计算文档会导致冗余计算并增加TTFT。先前的工作离线预计算RAG文档的KV张量，并在在线预填充期间粗略地重新计算一些令牌。然而，由于高延迟的磁盘传输，这种KV复用在现代GPU上通常比完全重新计算更慢。此外，这种粗粒度的重新计算会降低准确性。为了解决这些限制，本文提出了SIFT：利用注意力不变性实现RAG预填充快速计算的索引选择。SIFT离线处理文档，并提取每个文档中高注意力分数的细粒度位置。接下来，我们识别出以下注意力不变性见解，使我们能够在运行时利用提取的位置：（1）局部注意力不变性：文档内高注意力分数的位置不受周围文档的影响。这有助于我们预测文档自注意力中高分数出现的位置。（2）交叉注意力一致性：具有高文档内注意力的键也会吸引后续文档的交叉注意力。这有助于我们预测文档对未来文档注意力中高分数出现的位置。关键的是，SIFT不存储任何KV数据，仅以两个紧凑的位向量的形式存储高分数位置。SIFT的存储比KV张量小24000倍，避免了昂贵的磁盘传输。在预填充期间，SIFT仅计算标记位置的注意力，将TTFT提升1.71倍，同时将精度保持在完全重新计算的1%以内。

英文摘要

Retrieval-Augmented Generation (RAG) injects LLM queries with relevant documents to improve response quality. This injection increases prompt length and slows time to first token (TTFT). Unlike standard queries, RAG queries have a unique property of context reuse where the same documents recur across user queries. Thus, fully recomputing documents for every RAG query does redundant compute and increases TTFT. Prior works precompute KV tensors of RAG documents offline and coarsely recompute some tokens during online prefill. However, such KV reuse is often slower than full recomputation on modern GPUs due to high-latency disk transfers. Further, such a coarse-grained recomputation degrades accuracy. To address these limitations, this paper proposes SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance. SIFT processes documents offline and extracts fine-grained locations of high attention scores for each document. Next, we identify the following attention invariance insights that enable us to exploit the extracted locations during runtime: (1) Local-Attention Invariance: The location of high attention scores within a document remain invariant to surrounding documents. This helps us predict the location of high scores where the document attends to itself. (2) Cross-Attention Consistency: Keys with high intra-document attention also attract cross-attention from subsequent documents. This helps us predict the location of high scores where the document attends to future documents. Critically, SIFT stores no KV data and only stores locations of high scores in the form of two compact bit vectors. SIFT's storage is up to 24,000x smaller than KV tensors, obviating costly disk transfers. During prefill, SIFT computes the attention only for the marked locations and improves TTFT by 1.71x while holding accuracy within 1% of full recompute.

URL PDF HTML ☆

赞 0 踩 0

2606.09435 2026-06-09 cs.CL 新提交

MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models

MUDIDI：一种基于语言模型的多语言词典数字化两阶段框架

David Setiawan, Temuulen Khishigsuren, Milind Agarwal, Pagnarith Pit, Aso Mahmudi, Ekaterina Vylomova

发表机构 * School of Computing and Information Systems, The University of Melbourne（墨尔本大学计算与信息系统学院）； Melbourne School of Psychological Sciences, The University of Melbourne（墨尔本大学墨尔本心理科学学院）； LILT

AI总结提出MUDIDI两阶段框架，结合语言模型实现多语言词典数字化，在字符识别、标记保留和词条分割上优于现有OCR和视觉语言模型，并发布30本公共领域词典的标注数据集。

详情

Comments: 9 pages, preprint, submitted to EMNLP 2026

AI中文摘要

多语言词典是低资源和濒危语言最有价值的文献资源之一，但许多仍仅以扫描件形式存在。几十年来，由于语言特有的文字、包含缩写和交叉引用条目的复杂多栏布局，其数字化并转换为机器可读格式几乎不可能。最近的视觉语言模型提供了有希望的解决方案，但尚不清楚它们在保留字符、标记和处理词典结构方面的表现。我们提出MUDIDI，一个用于多语言词典数字化的两阶段框架。第一阶段评估字符识别和标记保留的质量；第二阶段专注于词典条目分割，随后映射到机器可读的词典模式——SIL的多词典格式化器。我们还发布了一个数据集，包含从30本公共领域词典中收集的人工标注的词典条目，这些词典涵盖多种文字系统、语系和格式。我们在该数据集上对OCR系统、通用大语言模型（LLM）和视觉语言模型（VLM）进行了基准测试，展示了LLM在大多数文字系统和语言的两个阶段中的优越性能，并为更具挑战性的场景提供了改进结果的实用指南。最后，我们表明向LLM补充额外信息（如词典引言）可以提高数字化词典的质量。Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/

英文摘要

Multilingual dictionaries are among the most valuable documentary resources for low-resource and endangered languages, yet many remain available only as scans. For many decades, their digitization and conversion into a machine-readable format was nearly impossible due to language-specific scripts, complex multi-column layouts full of entries with abbreviations and cross-references. Recent vision-language models offer a promising solution, but it is unclear how well they preserve characters, markup, and process lexicographic structure. We introduce MUDIDI, a two-stage framework for multi-lingual dictionary digitization. Stage One evaluates the quality of character recognition and markup preservation; Stage Two focuses on dictionary entry segmentation with subsequent mapping into a machine-readable lexicographic schema, SIL's Multi-Dictionary Formatter. We also release a dataset that consists of human-annotated lexicographic entries collected from 30 public-domain dictionaries featuring diverse writing systems, language families, and formats. We benchmark OCR systems, general-purpose Large Language Models (LLMs), and Vision Language Models (VLMs) on the dataset, demonstrating superior performance of LLMs across most writing systems and languages in both stages, and provide practical guidelines on improving the results for more challenging scenarios. Finally, we show that supplementing additional information, such as dictionary introduction, to the LLMs can improve the quality of the digitized dictionary. Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/

URL PDF HTML ☆

赞 0 踩 0

2606.09434 2026-06-09 cs.LG 新提交

Operator learning for solving Fokker-Planck equations with various initial conditions

算子学习求解不同初始条件下的福克-普朗克方程

Li Zeng, Xiaoliang Wan, Yaobin Wang, Fabio Nobile, Tao Zhou

发表机构 * Fuzhou University（福州大学）； Louisiana State University（路易斯安那州立大学）； Beijing Normal-Hong Kong Baptist University（北京师范大学-香港浸会大学联合国际学院）； École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）； Chinese Academy of Sciences（中国科学院）

AI总结提出基于条件归一化流的物理信息神经网络框架，利用Chapman-Kolmogorov方程和线性化SDE基分布，高效求解多种初始条件下FPE的算子，引入时间加权损失函数解决小时间不稳定性。

详情

AI中文摘要

福克-普朗克方程（FPE）在描述由随机动力学支配的系统概率密度函数（PDF）的时间演化中起着关键作用。本文提出了一种基于条件归一化流的物理信息神经网络（PINN）框架，用于高效逼近整个初始条件范围内FPE的解算子。利用马尔可夫随机过程的Chapman-Kolmogorov方程，将问题重新表述为逼近从任意点狄拉克质量开始的初始时刻的转移PDF。采用关联线性化随机微分方程（SDE）的PDF作为归一化流的基分布，该分布提供了目标PDF的良好近似，特别是在小时间尺度下，从而避免了与狄拉克δ初始分布相关的映射奇异性。此外，引入时间加权损失函数以减轻小时间尺度下出现的数值不稳定性，在时间推进过程中实现因果性与训练难度之间的平衡。通过多种数值实验展示了所提方法的有效性和鲁棒性。

英文摘要

The Fokker-Planck equation (FPE) plays a pivotal role in describing the time evolution of probability density functions (PDFs) for systems governed by stochastic dynamics. In this work, we propose a conditional normalizing flow-based physics-informed neural network (PINN) framework for efficiently approximating the solution operator of the FPE for a whole range of initial conditions. Leveraging the Chapman-Kolmogorov equation for Markovian stochastic processes, the problem is reformulated into approximating a transition PDF starting at initial time from a Dirac mass centered at an arbitrary point. The PDF of an associated linearized stochastic differential equation (SDE) is employed as the base distribution for the normalizing flow, providing a good approximation of the target PDF, especially for small times, and thereby avoiding the singularity of the map associated with the Dirac delta initial distribution. Furthermore, a time-weighted loss function is introduced to mitigate numerical instabilities arising at small times, achieving a balance between causality and training difficulty as time progresses. A variety of numerical experiments are presented to illustrate the effectiveness and robustness of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2606.09433 2026-06-09 cs.AI 新提交

Bayesian Selective Latent Inference for Wastewater-First Influenza Monitoring

贝叶斯选择性潜在推断用于污水优先的流感监测

Yixuan Zhang, Yang Song, Hao Wang, Samir Bhatt, Hengguan Huang

发表机构 * University of Copenhagen（哥本哈根大学）； Rutgers University（罗格斯大学）； Imperial College London（帝国理工学院）

AI总结提出贝叶斯选择性潜在推断（BSLI），通过后验分布、可回答性认证和成本校准的Bellman策略，在污水优先流感监测中优化查询与弃权决策。

详情

Comments: Corresponding authors: Hengguan Huang and Samir Bhatt. Hengguan Huang is the lead corresponding author

AI中文摘要

污水流感监测可以在临床报告之前揭示社区传播，但仅凭污水并不能完全识别人类负担。现有的污水模型假设固定的证据集，而通用的证据获取方法将官方监测流视为可互换的昂贵特征。我们将污水优先的流感监测视为一个选择性决策问题：从强制性的污水证据开始，系统必须决定污水是否足够，接下来查询哪个延迟的官方流，以及在源模糊下何时弃权是唯一科学上可辩护的行动。我们提出了贝叶斯选择性潜在推断（BSLI），这是一种原则性的贝叶斯方法，它维护潜在负担和可识别性的后验分布，通过明确的科学门认证可回答性，并使用精确的成本校准Bellman策略优化查询-停止决策。我们证明了关键的变分、可回答性、Bellman最优性和一维成本校准性质。在一个包含5,933个预测事件和3,102个源模糊事件的固定公共数据基准上，BSLI改善了匹配预算的成本-性能前沿，同时在源模糊下保持保守的弃权。

英文摘要

Wastewater influenza surveillance can reveal community circulation before clinical reporting, but wastewater alone is not a fully identifiable proxy for human burden. Existing wastewater models assume a fixed evidence set, while generic evidence-acquisition methods treat official surveillance streams as interchangeable costly features. We cast wastewater-first influenza monitoring as a selective decision problem: starting from mandatory wastewater evidence, the system must decide whether wastewater is sufficient, which delayed official stream to query next, and when abstention is the only scientifically defensible action under source ambiguity. We propose Bayesian Selective Latent Inference (BSLI), a principled Bayesian method that maintains a posterior over latent burden and identifiability, certifies answerability through explicit scientific gates, and optimizes query-stop decisions with an exact cost-calibrated Bellman policy. We prove the key variational, answerability, Bellman-optimality, and one-dimensional cost-calibration properties. On a fixed public-data benchmark with 5,933 forecasting episodes and 3,102 source-ambiguity episodes, BSLI improves the matched-budget cost-performance frontier while preserving conservative abstention under source ambiguity.

URL PDF HTML ☆

赞 0 踩 0

2606.09432 2026-06-09 cs.LG 新提交

Graph Mamba Operator: A Latent Simulator for Interacting Particle Systems

Graph Mamba Operator: 一种用于相互作用粒子系统的潜在模拟器

Karn Tiwari, Niladri Dutta, N M Anoop Krishnan, Prathosh A P

发表机构 * Indian Institute of Science, Bangalore（印度科学研究所，班加罗尔）； Indian Institute of Technology, Delhi（印度理工学院，德里）

AI总结提出Graph Mamba Operator (GraMO)，通过将状态空间模型与图交互学习集成到单一循环中，实现长期时空依赖的联合建模，在N体系统、运动捕捉和机器人数据集上取得最低误差。

详情

Comments: Under Submission

AI中文摘要

建模相互作用的动力系统需要捕捉空间相互作用以及长期时间依赖。图神经网络（GNNs）提供了一种自然的表示，但通常依赖于自回归滚动，并分别处理空间和时间动态，导致长期预测中误差累积。现有方法还侧重于局部交互和短时间上下文，限制了它们捕捉多跳依赖和全局结构的能力。我们引入了图Mamba算子（GraMO），一种潜在空间模拟器，将状态空间模型与基于图的交互学习集成在一起。与先前将节点排序或分阶段应用空间和时间更新的工作不同，GraMO在单个循环中耦合了基于图的交互和时间状态更新。该更新在潜在状态上是线性的，具有跨状态自适应变化的输入相关系数。我们在N体系统、运动捕捉和机器人数据集上评估了GraMO，在基准测试中实现了最低误差，并在长期预测中取得了最大增益。

英文摘要

Modeling interacting dynamical systems requires capturing spatial interactions alongside long-range temporal dependencies. Graph neural networks (GNNs) provide a natural representation but typically rely on autoregressive rollouts and treat spatial and temporal dynamics separately, leading to error accumulation over long horizons. Existing approaches also focus on local interactions and short temporal contexts, limiting their ability to capture multi-hop dependencies and global structure. We introduce the Graph Mamba Operator (GraMO), a latent-space simulator that integrates state-space models with graph-based interaction learning. In contrast to prior work that sequences nodes or applies spatial and temporal updates in separate stages, GraMO couples graph-based interactions and temporal state updates within a single recurrence. The update is linear in the latent state, with input-dependent coefficients that adapt across regimes. We evaluate GraMO on N-body systems, motion capture, and robotics datasets, achieving the lowest error across benchmarks and the largest gains in long-horizon prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.09430 2026-06-09 cs.LG cs.AI 新提交

LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

LargeMonitor: 通过大型预训练模型监控在线无任务持续学习

Mingqi Yuan, Xiaoquan Sun, Shihao Luo, Jiayu Chen

发表机构 * HKU（香港大学）； Qicore Tech（启科科技）

AI总结提出LargeMonitor框架，利用大型预训练模型（LVM和LMM）解耦检测与诊断，实现无任务持续学习中的零样本漂移检测和语义病因诊断，提升现有算法性能。

详情

AI中文摘要

在线无任务持续学习（TFCL）要求智能体在严格单次遍历约束下，从无界、非平稳的数据流中顺序积累知识，且无显式任务标识。现有在线TFCL范式主要依赖于参数高效的提示调整或由训练耦合优化动态（如经验损失波动或潜在距离演变）驱动的动态结构扩展。因此，这些训练耦合求解器对分布漂移的结构起源不可知，机械地在根本不同的流变化上强制执行固定策略。为解决这一问题，我们提出LargeMonitor，一个利用大型预训练基础模型自主编排无任务连续适应的框架。具体而言，LargeMonitor引入一个解耦的检测模块，利用大型视觉模型（LVM）的冻结、稳定表示空间，实现鲁棒的零样本漂移检测，无需训练依赖的干扰或脆弱的阈值调整。在确认漂移后，该框架激活一个由大型多模态模型（LMM）驱动的上下文感知诊断模块，以解释流变化的精确语义病因（例如，新类出现 vs. 环境域偏移）。这种双阶段能力使连续学习者能够动态部署自适应且特定于漂移的优化策略。在多个TFCL设置和基准上的大量实验表明，LargeMonitor实现了对复杂数据流的精确、鲁棒检测和诊断，同时持续提升现有在线TFCL算法的性能。

英文摘要

Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving latent distances. As a result, these training-coupled solvers remain agnostic to the structural origins of distribution drift, mechanically enforcing a fixed strategy across fundamentally distinct streaming variations. To address this gap, we propose LargeMonitor, a framework that leverages large pretrained foundation models to autonomously orchestrate task-free continuous adaptation. Specifically, LargeMonitor introduces a decoupled detection module utilizing the frozen, stable representation space of large vision models (LVMs) to achieve robust, zero-shot drift detection without training-dependent interference or brittle threshold tuning. Upon a confirmed drift, the framework activates a context-aware diagnostic module driven by large multimodal models (LMMs) to interpret the precise semantic etiologies of the stream variation (e.g., novel class emergence vs. environmental domain shift). This dual-stage capability empowers the continuous learner to dynamically deploy adaptive and shift-specific optimization strategies. Extensive experiments across multiple TFCL settings and benchmarks demonstrate that LargeMonitor achieves precise, robust detection and diagnosis of complex data streams while consistently improving the performance of existing online TFCL algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.09428 2026-06-09 cs.CL 新提交

Guide Me Out: A Framework to Benchmark VLM Operators Communication in Crisis Scenarios

引导我出去：危机场景下评估VLM操作员通信的框架

Giacomo Gonella, Stefano Menini, Marco Guerini

发表机构 * Fondazione Bruno Kessler（布鲁诺·凯斯勒基金会）； University of Trento（特伦托大学）

AI总结提出一个基准框架，评估视觉语言模型在模拟疏散中引导平民的策略（窄播 vs. 广播）、环境表示（视觉 vs. 图）和威胁行为（静态 vs. 移动），发现窄播降低失败率，视觉表示主导性能，移动威胁增加失败率。

详情

AI中文摘要

有效的危机响应需要空间定位的通信，将平民的语言指导与物理环境联系起来，考虑结构瓶颈、不断变化的威胁和代理特定背景。然而，当前危机通信中的NLP研究主要局限于静态、纯文本分类设置，忽视了AI操作员在动态、具身场景中的关键通信作用。我们通过一个新的基准框架来解决这一差距，该框架用于评估视觉语言模型（VLM）在模拟疏散中引导平民代理的任务。我们测试了两种通信策略（窄播与广播）、两种环境表示（视觉与基于图）和两种威胁行为（静态与移动），跨越九张不同结构复杂度的地图。我们的结果表明，与广播相比，窄播在所有难度级别上持续降低平民失败率。指导质量很大程度上取决于VLM操作员如何表示世界：视觉模态驱动性能，而添加邻接图则依赖于模型且通常有害。移动威胁在所有条件下提高失败率，因为通信必须随时间持续适应。这些发现共同表明，将VLM作为AI操作员部署在疏散场景中仍然是一个非平凡挑战，其中通信策略和输入表示的选择可以直接决定干预的成功或失败。

英文摘要

Effective crisis response requires spatially grounded communication that bridges linguistic guidance of civilians with the physical environment, accounting for structural bottlenecks, evolving threats, and agent-specific contexts. Yet, current NLP research in crisis communication remains mainly limited to static, text-only classification settings, overlooking the critical communicative role of AI operators in dynamic, embodied scenarios. We address this gap with a novel benchmarking framework for evaluating Vision-Language Models (VLMs) tasked with guiding civilian agents through simulated evacuations. We test two communication strategies (narrowcast vs. broadcast), two environment representations (visual vs. graph-based), and two threat behaviors (static vs. moving) across nine maps of varying structural complexity. Our results show that Narrowcast consistently reduces civilian Fail rates compared to Broadcast across all difficulty levels. Guidance quality depends heavily on how the VLM operator represents the world: the visual modality drives performance, while adding an adjacency graph is model-dependent and often harmful. Moving threats raise Fail rates across all conditions as communication must continuously adapt over time. Together, these findings show that deploying VLMs as AI operators in evacuation scenarios remains a non-trivial challenge, where the choice of communication strategy and input representation can directly determine the success or failure of the intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.09424 2026-06-09 cs.CL 新提交

Toward Signing Activity Projection in Sign Language Interaction

面向手语交互中的手语活动预测

Takao Obi, Wang Yusong, Koji Inoue, Kotaro Funakoshi

发表机构 * Institute of Science Tokyo（东京科学大学）； Kyoto University（京都大学）

AI总结本研究探索将语音活动预测（VAP）框架迁移至双人手语交互，利用公共DGS语料库提取手语活动流，基于姿态特征进行轮换预测，结果表明HOLD/SHIFT预测有潜力但SHIFT预测困难。

详情

AI中文摘要

社交机器人不仅需要与以语音为中心的系统所假设的用户进行稳健交互，还需要与依赖不同模态（例如手语）进行交流的多样化用户进行交互。一个重要的能力差距是与手语用户进行预测性轮换。尽管语音活动预测（VAP）已成功用于模拟口语交互中的未来语音活动，但该框架是否适用于手语交互仍不清楚。本文提出了将VAP架构适应双人手语交互的初步迁移研究。使用公共DGS语料库的交互录音，我们从词汇手语标注中推导出二进制手语活动流，并制定轮换预测的代理任务。模型使用每个手语者提取的基于姿态的手部、眼部区域和嘴部区域特征。结果表明，SHIFT/HOLD预测是有前景的，尤其是利用手部线索，而SHIFT预测仍然困难。这些发现为将预测性轮换模型从口语交互迁移到手语交互的潜力和当前局限性提供了初步证据。手语交互的预测建模仍然需要超越语音衍生类别的手语特定事件定义。

英文摘要

Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projection (VAP) has been successfully used to model future voice activity in spoken interaction, it remains unclear whether the framework transfers to sign language interaction. This paper presents an initial transfer study of adapting a VAP architecture to dyadic sign language interaction. Using interaction recordings from the Public DGS Corpus, we derive binary signing activity streams from lexical sign annotations and formulate proxy tasks for turn-taking prediction. The model uses pose-derived hand, eye-region, and mouth-region features extracted for each signer. The results show that SHIFT/HOLD prediction is promising, especially with hand cues, while SHIFT-prediction remains difficult. These findings provide initial evidence for both the promise and the current limitations of transferring predictive turn-taking models from spoken interaction to sign language interaction. Predictive modeling of sign language interaction still requires sign-language-specific event definitions that go beyond speech-derived categories.

URL PDF HTML ☆

赞 0 踩 0

2606.09416 2026-06-09 cs.RO cs.AI cs.SE 新提交

Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer

面向物理AI的驾驭工程：机器人中间件即驾驭层

Sanghoon Lee, Jiyeong Chae, Kyung-Joon Park

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)（大邱庆北科学技术院）

AI总结本文提出机器人中间件作为物理AI的驾驭层，需同时干预控制、计算和通信，并补充投影、隔离和转移三种缺失的强制功能，以ROS 2驾驭配置文件为例。

详情

Comments: 6 pages, 2 figures, 2 tables. Big Ideas track submission to the 27th ACM/IFIP International Middleware Conference (Middleware 2026)

AI中文摘要

在物理AI时代，机器人中间件面临新的角色。学习策略、规划器和视觉-语言-动作（VLA）模型现在作为控制路径上的因果参与者进入已部署的机器人，但将它们与定时、调度和网络集成的层尚未被命名。最近的语言智能体工作将此层命名为驾驭层，即中介工具、管理状态、约束资源和记录执行的外部系统。机器人社区尚未采用这一框架，我们提出机器人中间件就是那个驾驭层。物理AI驾驭层与软件驾驭层的区别在于其干预位置。软件驾驭层在工具调用边界进行中介。物理AI驾驭层必须同时干预控制、计算和通信，因为学习策略的输出跨越所有三者：其命令改变轨迹，其推理时间改变调度，其有效载荷改变带宽。机器人中间件是机器人栈中最低的层，具有对所有三者的中介抽象，因此最适合组合它们的强制实施。它已经提供了驾驭层所需的大部分功能，但缺乏针对AI模型的强制实施。我们将这种缺失的强制实施命名为三个功能：投影在输出时门控每个输出，隔离约束模型的执行和传输时隙，转移在检查失败时回退到经过验证的基线。每个功能目前以手工构建的应用程序代码形式出现在已部署的机器人系统中，构建在机器人中间件已提供的表面上。机器人中间件应该将它们作为组合所有三者的层，而不是作为最佳的单轴强制器。我们将其勾勒为ROS 2驾驭配置文件，这是一个部署工件，携带AI模型声明的输出区域、推理预算和运行机制，而中间件在ROS 2、DDS和Zenoh上强制实施它们。

英文摘要

Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh.

URL PDF HTML ☆

赞 0 踩 0