arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.08068 2026-06-09 cs.LG 新提交

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

DICE: 用于稳定多智能体LLM协调的熵正则化均衡选择

Yi Xie, Zhanke Zhou, Chentao Cao, Bo Liu, Bo Han

发表机构 * University of Arizona(亚利桑那大学) Hong Kong Baptist University(香港浸会大学)

AI总结 提出DICE框架,通过熵正则化均衡选择(HQRE)解决多智能体LLM协调中的不稳定性,实现线性收敛和有限贝叶斯遗憾,在11个基准上平均提升4.3-8.5个百分点。

详情
AI中文摘要

多智能体大语言模型(LLM)系统通常无法可靠地超越配备最佳N采样的单个强模型。我们认为这种不稳定性的一个核心来源是病态的均衡选择:当前系统指定了智能体共享哪些信息,但没有指定应选择哪种协调约定。我们将此类系统的一类广泛形式化为折扣不完全信息马尔可夫博弈,并表明两种常见病理——竞争约定之间的振荡和跨约定漂移——均可导致不稳定的学习和线性贝叶斯遗憾。为了获得一个良定义的目标,我们引入了异质量化响应均衡(HQRE),这是一种具有智能体和状态依赖温度的熵正则化均衡概念。在单调性条件下,HQRE是唯一的,允许线性收敛的镜像更新,并产生有界的贝叶斯遗憾;相同的条件产生可 rollout 测量的稳定性诊断。我们在两种算法中实例化这一目标:DICE-PC,通过提示控制动作协调冻结模型,以及DICE-FT,执行参数高效的镜像微调。在四个领域的十一个基准测试中,DICE在准确性-成本权衡上优于强类内基线;在推理和规划任务上,DICE-PC平均提高4.3个百分点,DICE-FT提高8.5个百分点。

英文摘要

Multi-agent large language model (LLM) systems often fail to reliably outperform a single strong model equipped with best-of-N sampling. We argue that a core source of this instability is ill-posed equilibrium selection: current systems specify what information agents share, but not which coordination convention should be selected. We formalize a broad class of such systems as discounted incomplete-information Markov games and show that two common pathologies, oscillation between competing conventions and drift across them, can both induce unstable learning and linear Bayesian regret. To obtain a well-posed target, we introduce the Heterogeneous Quantal Response Equilibrium (HQRE), an entropy-regularized equilibrium concept with agent- and state-dependent temperatures. Under a monotonicity condition, HQRE is unique, admits linearly convergent mirror updates, and yields bounded Bayesian regret; the same condition yields rollout-measurable stability diagnostics. We instantiate this objective in two algorithms: DICE-PC, which coordinates frozen models through prompt-control actions, and DICE-FT, which performs parameter-efficient mirror fine-tuning. Across eleven benchmarks in four domains, DICE improves accuracy-cost trade-offs over strong within-class baselines; on reasoning and planning tasks, DICE-PC improves by 4.3 percentage points on average and DICE-FT by 8.5 points.

2606.08067 2026-06-09 cs.LG 新提交

Beyond Homophily: Towards Generalized Graph Reconstruction Attack and Defense

超越同质性:迈向广义图重构攻击与防御

Zhanke Zhou, Bo Han, Xuan Li, Jiangchao Yao, Sanmi Koyejo, Michael K. Ng

发表机构 * Hong Kong Baptist University(香港浸会大学) Shanghai Jiao Tong University(上海交通大学) Stanford University(斯坦福大学)

AI总结 针对图神经网络可能泄露训练图邻接信息的问题,提出基于马尔可夫链近似的攻击方法MC-GRA(+)和防御方法MC-GPB(+),在异质图上实现高保真重构攻击并有效防御。

详情
AI中文摘要

图神经网络(GNN)广泛部署于关系数据上,但它们可能泄露关于训练图邻接的敏感或专有信息,例如社交关系、交易和交互。本文研究图重构攻击(GRA),这是一种模型反演形式,从训练好的GNN中重构训练邻接,给定不同级别的攻击方信息。我们首先系统地表征了邻接何时以及为何通过特征、标签、嵌入和预测变得可恢复,其中泄漏由图的同质性、异质性和模型的归纳偏差调节。受这些发现启发,我们通过马尔可夫链近似视角审视GNN推理,将分层前向计算视为一个拓扑依赖表示的链。基于此视角,我们开发了互补的攻击和防御方法。在攻击方面,我们提出MC-GRA(+),通过优化一个替代邻接来重构邻接,该替代邻接的GNN诱导表示在各层与目标模型的表示对齐。在防御方面,我们提出MC-GPB(+),在整个表示链中抑制邻接依赖的信息,同时旨在在隐私-效用权衡下保持分类准确性。在同质/异质图基准和GNN上的实验表明,我们的攻击比先前方法提高了重构保真度,而我们的防御仅以轻微精度损失降低了重构成功率。

英文摘要

Graph neural networks (GNNs) are widely deployed on relational data, yet they can leak sensitive or proprietary information about the training graph adjacency, e.g., social ties, transactions, and interactions. This work studies graph reconstruction attacks (GRA), a form of model inversion that reconstructs the training adjacency from a trained GNN, given different levels of attacker-side information. We first provide a systematic characterization of when and why adjacency becomes recoverable through features, labels, embeddings, and predictions, with leakage modulated by graph homophily, heterophily, and the model's inductive bias. Motivated by these findings, we view GNN inference through a Markov chain approximation lens, treating the layered forward computation as a chain of topology-dependent representations. Building on this view, we develop complementary attack and defense methods. On the attack side, we propose MC-GRA (+), which reconstructs the adjacency by optimizing a surrogate adjacency whose GNN-induced representations align with those of the target model at each layer. On the defense side, we propose MC-GPB (+), which suppresses adjacency-dependent information throughout the representation chain while aiming to preserve classification accuracy under a privacy-utility trade-off. Experiments across homophilic/heterophilic graph benchmarks and GNNs show that our attacks improve reconstruction fidelity over prior methods, while our defenses reduce reconstruction success with only minor accuracy loss.

2606.08064 2026-06-09 cs.RO 新提交

Cooperative Long Rope Skipping via Multi-Agent Reinforcement Learning

基于多智能体强化学习的协作长绳跳绳

Zihao Wang, Shijie Peng, Kerui Wu, Yu Huang, Ruiqi Xue, Dong Liu, Tian Xu, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Beijing Academy of Artificial Intelligence, BAAI(北京智源人工智能研究院)

AI总结 提出Marope框架,采用分层强化学习实现多个人形机器人的协作长绳跳绳,通过多智能体强化学习训练分散的摇绳策略,上层调度策略协调执行,并融入多样跳跃策略提升泛化能力,在仿真和真实实验中优于基线方法。

详情
AI中文摘要

人类展现出卓越的运动敏捷性,能够完成跑步、跳跃等多种动态技能,这凸显了人形机器人在运动方面的巨大潜力。在竞技体育中,长绳跳绳需要两名摇绳者协同摇绳,同时适应不同跳跃节奏的玩家,这对人形机器人来说是一项有意义但具有挑战性的任务。尽管现有的人形机器人运动方法在单智能体和无交互场景(如跑步、舞蹈和跑酷)中取得了成功,但需要多参与者精确协调的任务场景仍鲜有探索。为此,我们提出Marope,一个用于多个人形机器人协作长绳跳绳的多智能体强化学习框架。具体而言,Marope采用分层强化学习框架进行策略训练。在底层,通过多智能体强化学习学习分散的摇绳操作策略;在顶层,训练集中调度策略以协调底层策略的执行。为了提高对不同玩家行为风格的泛化能力,Marope进一步将多样化的跳跃策略融入协作博弈训练中。我们在仿真和真实环境中对宇树G1人形机器人进行了评估。实验结果表明,Marope优于多种基线方法,实现了更高效稳定的摇绳操作以及与不同玩家更鲁棒和自适应的协作。

英文摘要

Humans exhibit remarkable motor agility, enabling a wide range of dynamic skills such as running and jumping, which highlights the great potential of humanoid robots for athletic locomotion. Among athletic sports, long rope skipping requires two rope turners to cooperatively swing the rope while adapting to a player under different jumping rhythms, making it a meaningful yet challenging task for humanoid robots. Although existing methods for humanoid sports have achieved success in single-agent and interaction-free settings, such as running, dancing, and parkour, task scenarios that require precise coordination among multiple participants remain largely unexplored. To this end, we propose Marope, a multi-agent reinforcement learning (MARL) framework for cooperative long rope skipping with multiple humanoid robots. Specifically, Marope adopts a hierarchical reinforcement learning framework for policy training. At the lower level, it learns decentralized rope manipulation policies through MARL, while at the upper level, a centralized scheduling policy is trained to coordinate the execution of the lower-level policies. To improve generalization across different player behavioral styles, Marope further incorporates diverse jumping policies into cooperative game training. We evaluate our approach on Unitree G1 humanoid robots in both simulation and real-world settings. Experimental results demonstrate that Marope outperforms various baselines, achieving more efficient and stable rope manipulation as well as more robust and adaptable cooperation with varied players.

2606.08059 2026-06-09 cs.RO 新提交

Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

感知行为基础模型:将人体运动先验适应到以机器人为中心的地形

Zifan Wang, Yizhao Li, Teli Ma, Qiang Zhang, Yudong Fan, Hao Xu, Shuo Yang, Junwei Liang

发表机构 * Mondo Robotics The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) Artificial General Intelligence Institute, University of Science and Technology of China(中国科学技术大学通用人工智能研究院)

AI总结 提出感知行为基础模型(Perceptive BFM),通过地形一致参考合成(TCRS)将人体运动先验适应到机器人局部地形,实现地形感知的人形机器人控制。

详情
AI中文摘要

人形机器人行为基础模型旨在从广泛的人体运动先验中获取可复用的全身控制策略,使单一控制器能够产生多样且富有表现力的行为。然而,现有的以运动为中心的基础策略大多假设参考运动已经与机器人周围环境物理兼容。当演示者、操作者和机器人处于不同环境时,这一假设不再成立:人体运动可能指定了预期行为,但并未指定机器人局部地形所需的落脚点、间隙、身体高度或接触时机。我们引入了\emph{感知行为基础模型}(Perceptive BFM),这是一种地形感知的人形机器人控制框架,将人体运动先验植根于以机器人为中心的感知。该模型保留原始运动学运动参考作为行为接口,同时利用局部地形观测来调整接触、姿态和时机。为了提供可扩展的地形监督,我们开发了\emph{地形一致参考合成}(TCRS),通过接触感知的落脚点构建、足部几何感知的摆动优化、支撑感知的根部重建、碰撞修复和多点逆运动学,将面向运动的运动片段转换为地形一致的参考。然后,我们训练一个盲适应参考教师,并通过目标帧动作对齐将其地形一致行为迁移到部署的原始参考学生。学生是一个身份门控Transformer跟踪器,其地形特征通过残差路径进入,这些路径初始化为保留运动跟踪先验,并仅在需要时训练产生局部修正。

英文摘要

Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain. We introduce \emph{Perceptive Behavior Foundation Model} (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing. To provide scalable terrain supervision, we develop \emph{terrain-conformal reference synthesis} (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.

2606.08057 2026-06-09 cs.RO cs.AI 新提交

EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

EgoAERO:无需物体资产,从单个第一人称视频学习灵巧操作

Yichen Niu, Haoran Lv, Xinrui Zhang, Xueyao Wan, Shiyu Gao, Ying Ai, Hui Xu, Yongqi Hu, Hengyi Zhang, Yang Xie, Zhaxizhuoma, Yue Zhao, Zhenshan Bing, Yan Ding, Jianxing Liu

发表机构 * School of Astronautics, Harbin Institute of Technology(哈尔滨工业大学航天学院) Lumos Robotic Suzhou Research Institute, Harbin Institute of Technology(哈尔滨工业大学苏州研究院) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Lab(上海人工智能实验室) Nanjing University(南京大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学) Fudan University(复旦大学)

AI总结 提出EgoAERO框架,无需物体资产,从单个第一人称RGB-D视频中通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹,并利用两阶段残差学习转化为机器人策略,实现单次演示的灵巧操作。

详情
AI中文摘要

第一人称RGB-D视频提供了人类灵巧操作演示的自然来源,但现有数据难以用于机器人学习,因为物体姿态、几何和接触信息常常缺失或需要预先扫描的物体资产。我们提出EgoAERO,这是第一个无需物体资产、从单个第一人称RGB-D人类演示中学习灵巧操作的框架。EgoAERO通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹,然后利用两阶段残差学习将其转化为机器人策略。我们进一步引入在线质量评估机制,并构建EgoDex-R,一个包含430万RGB-D帧的大规模第一人称数据集,用于灵巧策略学习。仿真和真实世界实验表明,EgoAERO能够实现单次演示的灵巧操作,并在HOI4D上达到接近基于CAD重建的下游性能。

英文摘要

Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.

2606.08056 2026-06-09 cs.CL cs.AI 新提交

What's the Point? Spatial Grammar & Index Resolution for Sign Language Processing

要点何在?手语处理中的空间语法与索引解析

Oline Ranum, Simon Hadfield, Richard Bowden

发表机构 * Centre for Vision, Speech and Signal Processing, University of Surrey(萨里大学视觉、语音与信号处理中心)

AI总结 针对手语中占10-15%但被忽视的空间索引现象,提出索引检测与话语实体链接的分解框架,建立索引感知手语建模基线,并作为辅助专家提升冻结手语识别模型性能。

详情
AI中文摘要

手语模型主要使用词汇序列或文本监督进行训练,因此对非词汇和构式性结构的建模不足。一个相对易处理的情况是空间索引:将话语实体分配给空间位置以供后续共指的指向手势,而以词汇为中心的目标在很大程度上未能捕捉到这一点。我们对手语识别中的索引进行了有针对性的评估,显示尽管索引占手语内容的10-15%,但其恢复效果很差。我们引入了一个用于训练和评估索引专家的框架,为索引感知手语建模建立了基线。我们的方法将空间指代解析分解为索引检测和话语实体链接。由此产生的提及表示支持自动标注和非词汇结构建模,并在推理时作为辅助索引专家增强冻结的SLR模型。

英文摘要

Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive constructions. One comparatively tractable instance is spatial indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We present a targeted evaluation of indexing in Sign Language Recognition, showing that despite comprising 10-15% of signing content, indexing is poorly recovered. We introduce a framework for training and evaluating indexing experts, establishing a baseline for index-aware sign language modeling. Our approach decomposes spatial reference resolution into index detection and discourse entity linking. The resulting mention representations enable automatic annotation and non-lexical structure modeling, and serve as an auxiliary indexing expert that augments a frozen SLR model at inference time.

2606.08051 2026-06-09 cs.AI cs.LG 新提交

How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

你能做到多小?面向金融交易中商户信息抽取的 270M-8B 模型 LoRA 微调

Donghao Huang, Tomas Drietomsky, Benjamin Barrett, Zhaoxia Wang

发表机构 * Singapore Management University(新加坡管理大学) Mastercard(万事达卡) A*STAR Centre for Frontier AI Research(新加坡科技研究局前沿人工智能研究中心)

AI总结 针对金融交易中从嘈杂银行字符串提取结构化商户信息的生产需求,系统评估 24 种模型变体,发现 Qwen 3.5 4B 在参数量减半下 F1 仅低 0.35 点,0.8B 模型匹配 2.5-4 倍大模型性能,且思维链微调提升有限。

详情
Comments
9 pages, 5 figures, 5 tables. Submitted to the IEEE International Conference on Data Mining (ICDM) 2026
AI中文摘要

金融交易处理需要从嘈杂、缩写的银行交易字符串中大规模提取结构化商户信息。我们当前的生产系统是 LoRA 微调的 LLaMA 3.1-8B,在该任务上达到了 96.95% 的 F1 分数,但部署 80 亿参数模型带来了高昂的内存、延迟和成本约束。为了识别更高效的替代方案,我们进行了一项以部署为中心的研究,涵盖四个模型家族的 24 种模型变体:Gemma 3(270M、1B、4B)、Qwen 3.5(0.8B、2B、4B)、Aya(3.35B)和 LLaMA 3.1-8B,系统评估了准确率、推理吞吐量、训练成本和硬件行为,以评估生产适用性。我们的发现表明:(1)使用 LoRA 秩为 8 复现 LLaMA 3.1-8B 微调达到 96.75% F1,仅比秩为 32 的基线低 0.20 个点;(2)仅使用 JSON 提示的 Qwen 3.5 4B 达到 96.60% F1,比 8B 基线低 0.35 个点,同时参数量大约减半;(3)0.8B 的 Qwen 3.5 模型达到 94.75% F1,与 2.5-4 倍大的模型性能相当,提供了有吸引力的延迟-准确率权衡;(4)思维链微调通常使大多数模型的 F1 提升 0.3-1.8 个点,尽管 Qwen 3.5 4B 在直接仅 JSON 提示下表现最佳;(5)Qwen 3.5 的 Think 和 Nothink 训练模板产生几乎相同的结果(F1 差异 <0.004),表明对于结构化抽取任务,显式推理监督是不必要的。我们进一步将所有 14 个微调后的子 8B 模型部署为 Databricks Model Serving 端点,并观察到基准性能可靠地迁移到生产环境,平均 F1 变化仅为 0.8 个点。基于 Cohere2 架构的 Aya 3.35B 是唯一的例外,在服务条件下 F1 下降了 3-5 个点。基于这些结果,我们提供了跨准确率和延迟需求的部署建议,……

英文摘要

Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at scale. Our current production system, a LoRA-fine-tuned LLaMA 3.1-8B, achieves 96.95% F1 on this task, but deploying 8-billion-parameter models imposes prohibitive memory, latency, and cost constraints. To identify more efficient alternatives, we conduct a deployment-focused study of 24 model variants spanning four model families: Gemma 3 (270M, 1B, 4B), Qwen 3.5 (0.8B, 2B, 4B), Aya (3.35B), and LLaMA 3.1-8B, systematically evaluating accuracy, inference throughput, training cost, and hardware behavior to assess production suitability. Our findings show that: (1) reproducing the LLaMA 3.1-8B fine-tune with a LoRA rank of 8 achieves 96.75% F1, only 0.20 points below the rank-32 baseline; (2) Qwen 3.5 4B with JSON-only prompting reaches 96.60% F1, within 0.35 points of the 8B baseline while using roughly half the parameters; (3) the 0.8B Qwen 3.5 model achieves 94.75% F1, matching models 2.5-4x larger and offering an attractive latency-accuracy trade-off; (4) chain-of-thought fine-tuning generally improves F1 by 0.3-1.8 points across most models, although Qwen 3.5 4B performs best with direct JSON-only prompting; and (5) Qwen 3.5 Think and Nothink training templates produce nearly identical results (F1 differences <0.004), indicating that explicit reasoning supervision is unnecessary for structured extraction tasks. We further deploy all 14 fine-tuned sub-8B models as Databricks Model Serving endpoints and observe that benchmark performance transfers reliably to production, with an average F1 change of only 0.8 points. Aya 3.35B, based on the Cohere2 architecture, is the sole exception, exhibiting a 3-5 point decline under serving conditions. Based on these results, we provide deployment recommendations across accuracy and latency requirements, ...

2606.08049 2026-06-09 cs.AI cs.MA 新提交

SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

SKILL.nb:用于持久代理工作流的选择性形式化与门控执行

Amine El Hattami, Nicolas Chapados, Christopher Pal

发表机构 * ServiceNow Research Mila Polytechnique Montréal(蒙特利尔综合理工学院) Canada CIFAR AI Chair(加拿大CIFAR人工智能讲席)

AI总结 提出SKILL.nb框架,通过选择性形式化和门控执行管理代理工作流的生命周期可靠性,在WebArena-Verified上单轮成功率达53.7%,重执行保留率91.7%。

详情
AI中文摘要

AI代理越来越多地将过去的经验转化为可重用的工件,如代码、工作流和程序记忆。重用可以提高效率,但也带来了生命周期可靠性问题:曾经成功的工件可能在环境漂移、任务说明不充分或任务分布变化时失败,尤其是在Web自动化中。我们引入了SKILL.nb,一个通过证据校准的生命周期策略来管理可重用代理工作流的框架。SKILL.nb使用选择性形式化:执行证据决定哪些工作流步骤应成为可执行代码,哪些应保留自然语言指导,以及何时应修订这些选择。工作流存储为可审计、版本化的笔记本,交织自然语言指导、多语言可执行单元格、验证门、回退路径以及多模态证据(如输出、截图和错误轨迹)。在运行时,门控执行让每个步骤在门验证时运行代码,或在漂移使可执行实现失效时本地回退。在WebArena-Verified上,SKILL.nb实现了53.7%的单轮成功率,比最强基线提高了3.9个百分点。在三次重新执行中,它保留了91.7%的初始成功任务,比次优方法高出15.5个百分点。在有界修复下,它恢复了72.9%的后续失败,同时将修复后回归限制在4.2%,而持久基线为15.0%至17.0%。它还在Mind2Web跨网站和跨领域分割上领先。在GitLab迁移测试中,SKILL.nb在重用基于GitLab 15.7学习的冻结状态时保持性能,冻结与新鲜目标版本的差距在GitLab 16.11上为-1.7个百分点,在GitLab 18.9上为+0.6个百分点。这些结果将生命周期治理和门控执行确定为超越一次性任务成功之外的可靠性轴。

英文摘要

AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.

2606.08048 2026-06-09 cs.CL 新提交

Diffusion Language Model Parallel Decoding via Product-of-Experts Bridge

通过专家乘积桥接的扩散语言模型并行解码

Juntong Shi, Brian L. Trippe, Jure Leskovec, Stefano Ermon, Minkai Xu

发表机构 * Stanford University(斯坦福大学)

AI总结 提出PoE-Bridge框架,通过专家乘积构建中间分布,结合扩散语言模型并行解码和自回归模型质量,实现5倍加速并恢复至少95%的AR性能。

详情
Comments
ICML 2026
AI中文摘要

扩散语言模型(DLM)通过并行解码提供了显著的速度优势,但与自回归(AR)模型相比,缺乏令牌依赖性限制了生成质量。最近的进展试图通过重要性采样来弥合差距,其中DLM作为提议分布,AR作为目标分布。然而,由于它们分布之间的巨大差距,采样需要大量粒子,因此计算成本高昂。在本文中,我们引入了PoE-Bridge,一种新颖的解码框架,通过引入中间分布来弥合差距,从而大幅提高生成速度和准确性。该分布被构建为DLM提议和AR目标的专家乘积(PoE)。借助中间分布,我们首先使用DLM并行起草多个续写,然后应用拒绝采样验证起草的令牌,并将结果候选向PoE移动。接着,我们使用重要性采样进一步将PoE对齐的候选向AR目标校正。我们还提出了若干改进技术,包括用于增强多样性的混合温度采样和用于减少浪费验证的弹性拒绝窗口。实验上,PoE-Bridge在标准DLM解码方法上实现了显著提高的准确性,速度提升5倍,并恢复了目标AR模型至少95%的性能,在具有挑战性的数学推理和编码任务上高效地推进了大部分质量差距。我们的代码可在https://github.com/juntongshi48/poe-bridge获取。

英文摘要

Diffusion language models (DLMs) offer substantial speed advantages through parallel decoding, but the lack of token dependencies limits generation quality compared to autoregressive (AR) models. Recent progress attempts to bridge the gap via importance sampling, with DLM being the proposal and AR being the target. However, due to the huge gap between their distributions, the sampling requires a large number of particles and is thus expensive to compute. In this paper, we introduce PoE-Bridge, a novel decoding framework that drastically improves generation speed and accuracy by introducing an intermediate distribution to bridge the gap. The distribution is constructed as a Product-of-Experts (PoE) of the DLM proposal and the AR target. With the intermediate distribution, we first use the DLM to draft multiple continuations in parallel, then apply rejection sampling to verify the drafted tokens and move the resulting candidates toward the PoE. We then use importance sampling to further correct the PoE-aligned candidates toward the AR target. We further propose several improved techniques, including mixed-temperature sampling for enhanced diversity and elastic rejection windows for reducing wasted verification. Empirically, PoE-Bridge achieves significantly improved accuracy with $5\times$ speedup over the standard DLM decoding approach, and recovers at least 95% of the target AR model's performance, efficiently advancing most of the quality gap on challenging mathematical reasoning and coding tasks. Our code is available at https://github.com/juntongshi48/poe-bridge.

2606.08046 2026-06-09 cs.AI cs.CV cs.LG 新提交

OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs

OSMGraphCLIP:从OpenStreetMap图学习全局位置表示

Dimitrios Michail, Eleni Saka, Ioannis Giannopoulos, Ioannis Papoutsis

发表机构 * Harokopio University of Athens(雅典哈罗科皮奥大学) National Technical University of Athens(雅典国家技术大学) Vienna University of Technology(维也纳技术大学) National Observatory of Athens(雅典国家天文台)

AI总结 提出OSMGraphCLIP模型,利用OpenStreetMap异构图结构学习全局位置嵌入,通过多尺度图编码器和对比学习对齐,在气候、生态、社会经济等下游任务中达到或超越卫星基线方法。

详情
AI中文摘要

我们提出了OSMGraphCLIP,一种CLIP风格的地理空间表示模型,从免费可用的OpenStreetMap(OSM)数据中学习全局位置嵌入。OSMGraphCLIP将地理环境表示为带类型的OSM特征的异构图,保留了道路、建筑物、土地利用区域和兴趣点之间的拓扑和语义关系。多尺度图编码器捕获细粒度的局部结构和更广泛的景观组成,并通过对比对齐目标监督球谐位置编码器。我们在涵盖气候、生态、社会经济指标、公共卫生、土地覆盖、生物多样性和野火预测等一系列下游地理空间回归和分类任务中评估了OSMGraphCLIP,并表明仅结构化OSM数据就支持跨领域的强全局位置表示。OSMGraphCLIP在大多数基准测试中达到或超过了基于卫星的基线,在社会经济和公共卫生任务中优势最为明显,因为OSM对建成环境的显式语义注释编码了卫星像素只能间接捕获的人类活动模式。在生态和环境任务中,尽管未使用地球观测数据,该模型仍与基于图像的方法保持紧密竞争。定性分析证实,学习到的嵌入连贯地组织了地理空间,仅从地图拓扑中恢复了生物群落边界、城市梯度和热带-温带区别。

英文摘要

We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings from freely available OpenStreetMap (OSM) data. OSMGraphCLIP represents geographic environments as heterogeneous graphs of typed OSM features, preserving the topological and semantic relationships among roads, buildings, land-use regions, and points of interest. A multi-scale graph encoder captures both fine-grained local structure and broader landscape composition, and supervises a spherical-harmonics location encoder through a contrastive alignment objective. We evaluate OSMGraphCLIP across a diverse suite of downstream geospatial regression and classification tasks spanning climate, ecology, socioeconomic indicators, public health, land cover, biodiversity, and wildfire forecasting, and show that structured OSM data alone supports strong global location representations across domains. OSMGraphCLIP matches or exceeds satellite-based baselines on the majority of benchmarks, with the most pronounced advantage on socioeconomic and public-health tasks, where OSM's explicit semantic annotation of the built environment encodes patterns of human activity that satellite pixels can only capture indirectly. On ecological and environmental tasks, the model remains closely competitive with imagery-based methods despite using no Earth observation data. Qualitative analysis confirms that the learned embeddings organize geographic space coherently, recovering biome boundaries, urban gradients, and tropical--temperate distinctions from map topology alone.

2606.08044 2026-06-09 cs.LG cs.AI cs.CL 新提交

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

当行为安全评估失败时:表征层面的视角

Enyi Jiang, Anders Gjølbye, Yibo Jacky Zhang, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Technical University of Denmark(丹麦技术大学)

AI总结 本文提出行为安全与干预鲁棒性之间的“审计差距”,通过构建解离模型和引入潜在脆弱性评分(LVS),证明行为安全指标不足以衡量表征层面的鲁棒性。

详情
Comments
Preprint
AI中文摘要

大型语言模型(LLM)的安全性通常从行为层面进行评估,这提供了有限的内部鲁棒性证据,因为这些评估针对的是输出,而非干预下的表征层面脆弱性。我们将这种差异形式化为审计差距:行为安全与干预下鲁棒性之间的差异。为了研究这一差距,我们构建了解离模型,这些模型在保持安全的外在行为的同时,在潜在空间中仍然脆弱。我们引入了一个基于干预的评估框架,通过在参数和潜在空间中进行软干预(包括有害微调和逐层潜在扰动)来测试模型鲁棒性。为了形式化评估,我们提出了潜在脆弱性评分(LVS),用于衡量通过有界潜在扰动引发有害行为的难易程度。使用该评估框架,我们表明行为安全指标不足以衡量多个安全和对齐及未对齐的最先进模型的表征层面鲁棒性。值得注意的是,解离模型在有害干预下尽管表现出相当的拒绝行为,但LVS显著升高,其中中间表征对干预最为敏感。我们的结果表明,仅凭行为安全评估无法全面反映模型鲁棒性,这促使我们需要进行表征感知的审计,以评估潜在脆弱性和可观察行为。

英文摘要

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.

2606.08039 2026-06-09 cs.RO 新提交

MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning

MuJoCo-Drones-Gym: 用于控制和强化学习的GPU加速多无人机模拟器

Manan Tayal

发表机构 * TAU-Intelligence

AI总结 提出基于MuJoCo物理引擎的GPU加速多无人机模拟器MuJoCo-Drones-Gym,支持任意数量Crazyflie 2.x纳米四旋翼,提供模块化物理模型、动作接口和观测空间,集成PettingZoo多智能体强化学习,涵盖悬停、速度跟踪等七种任务环境。

详情
Comments
18 pages, 8 figures, 7 tables
AI中文摘要

机器人模拟器是现代空中机器人研究的基石,既作为新控制算法开发的工具,也作为训练强化学习策略的数据源。然而,现有的四旋翼学习环境通常在物理保真度、多智能体支持和现代深度强化学习管道所需吞吐量之间面临权衡。本文提出MuJoCo-Drones-Gym,一个基于MuJoCo物理引擎构建的开源Gymnasium兼容多无人机环境。MuJoCo-Drones-Gym支持任意数量的Bitcraze Crazyflie 2.x纳米四旋翼,并暴露模块化API用于选择:(i)物理模型(刚体MuJoCo、显式Python动力学,或地面效应、桨叶阻力和无人机间下洗流的任意子集),(ii)动作接口(每电机RPM、集体归一化推力、速度设定点或PID航点命令),以及(iii)观测空间(运动状态向量、RGB/深度/分割相机或邻域邻接信息)。PettingZoo ParallelEnv封装支持即插即用的多智能体强化学习,而一套七种任务环境——悬停、速度跟踪、多无人机悬停、航点导航、编队飞行、门赛竞速和通用多智能体模板——展示了接口的广度。我们描述了环境设计、底层物理和四旋翼动力学,并通过与密切相关项目gym-pybullet-drones相似的控制和学习示例说明其使用,同时利用MuJoCo改进的接触处理、渲染和并行化能力。

英文摘要

Robotic simulators are a cornerstone of modern research in aerial robotics, serving both as a vehicle for the development of new control algorithms and as the data source for training reinforcement learning (RL) policies. Yet, existing quadcopter learning environments often face a trade-off between physical fidelity, multi-agent support, and the throughput required by modern deep RL pipelines. In this paper, we present MuJoCo-Drones-Gym, an open-source Gymnasium-compatible multi-drone environment built on top of the MuJoCo physics engine. MuJoCo-Drones-Gym supports an arbitrary number of Bitcraze Crazyflie 2.x nano-quadcopters and exposes a modular API for selecting (i)~the physics model (rigid-body MuJoCo, explicit Python dynamics, or any subset of ground effect, blade drag, and inter-drone downwash), (ii)~the action interface (per-motor RPMs, collective normalized thrust, velocity setpoints, or PID waypoint commands), and (iii)~the observation space (kinematic state vectors, RGB / depth / segmentation cameras, or neighbourhood adjacency information). A PettingZoo ParallelEnv wrapper enables drop-in multi-agent reinforcement learning, while a suite of seven task environments, hover, velocity tracking, multi-drone hover, waypoint navigation, formation flight, gate racing, and a generic multi-agent template, demonstrates the breadth of the interface. We describe the environment design, the underlying physics and quadcopter dynamics, and illustrate its use through control and learning examples that mirror those of the closely related gym-pybullet-drones project, while taking advantage of MuJoCo's improved contact handling, rendering, and parallelizability.

2606.08038 2026-06-09 cs.SD 新提交

Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis

探索语音反欺骗数据集的规模与多样性:实验与分析

Zhuolin Yi, Jun Xue, Yanzhen Ren, Yihuan Huang, Yi Chai, Daixian Li, Guanxiang Feng, Jiajun Liu

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院)

AI总结 本研究通过解耦训练数据规模与多样性,发现数据多样性比规模更重要,过大规模可能导致过拟合,而多样化的较小数据集在跨域评估中表现更优。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

过去十年中,语音反欺骗数据集的规模呈指数级增长,其背后假设是更大的数据能带来更好的性能。然而,无差别地扩大规模是否能够相应地提升模型泛化能力尚不清楚。本研究通过解耦训练数据规模与多样性的影响,挑战了“规模优先”的范式。通过对代表性数据集的实验,我们报告了两个关键发现:(1)更大并不总是更好。在固定生成方法下过度扩大数据规模会带来微不足道的收益,甚至可能因过拟合而降低跨域泛化能力。(2)多样性优于规模。在跨数据集评估中,一个包含多种攻击的较小复合训练集显著优于规模更大但多样性有限的数据集。我们得出结论,未来的数据集构建应优先考虑生成方法的多样性而非规模,以有效提升模型泛化能力。

英文摘要

The scale of speech anti-spoofing datasets has grown exponentially over the past decade, driven by the assumption that larger data leads to better performance. However, it remains unclear whether indiscriminate scaling commensurately improves model generalization. This study challenges the "scale-first" paradigm by decoupling the impacts of training data scale versus diversity. Through experiments on representative datasets, we report two key findings: (1) Larger is not always better. Expanding data scale excessively under fixed generation methods yields negligible returns and may even degrade cross-domain generalization due to overfitting.(2) Diversity outweighs scale. A smaller composite training set featuring diverse attacks significantly outperforms larger-scale datasets with limited diversity in cross-dataset evaluations. We conclude that future dataset construction should prioritize the diversity of generation methods over scale to effectively enhance model generalization.

2606.08037 2026-06-09 cs.LG cs.AI 新提交

SafeECGMatch: Calibration-Aware Joint Frequency and Time Space Semi-Supervised Learning for Open-Set ECG Classification

SafeECGMatch:面向开放集心电图分类的校准感知联合频率与时间空间半监督学习

Hongkyu Koh, Ikbeom Jang

发表机构 * Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出SafeECGMatch框架,通过双分支架构提取时频特征,结合自适应标签平滑和温度缩放校准模型,在标签分布不匹配下实现可靠的开集分类和OOD检测。

详情
Comments
8 pages. Accepted to the KDD-UC 2026 (ACM International Conference on Data Mining and Knowledge Discovery - Undergraduate Consortium 2026)
AI中文摘要

心电图(ECG)分类模型常面临严重的标签稀缺问题,使得半监督学习(SSL)成为降低标注成本的有效策略。然而,在临床环境中,未标注数据池通常包含分布外(OOD)异常或标注集中不存在的诊断类别。标准SSL会强制对这些未见类别分配错误的伪标签,产生过度自信的预测。为解决此问题,我们提出SafeECGMatch,一个校准感知的安全SSL框架,用于标签分布不匹配下的单标签ECG分类。方法上,SafeECGMatch采用双分支架构,通过ECG特定的数据增强提取时频潜在表示。关键地,它通过自适应标签平滑和温度缩放动态对齐置信度与经验准确性,在时间和频谱域上校准多类分类器和OOD检测器。这种联合优化实现了可信的OOD拒绝和可靠的伪标签分配。在PTB-XL和PhysioNet/CinC Challenge基准上评估,SafeECGMatch达到了最先进的准确性和校准性能,推动了生理时间序列中可靠知识发现。代码可在https://github.com/labhai/SafeECGMatch获取。

英文摘要

Electrocardiogram (ECG) classification models often suffer from severe label scarcity, making semi-supervised learning (SSL) an attractive strategy for reducing annotation costs. In clinical settings, however, unlabeled pools frequently contain out-of-distribution (OOD) anomalies or diagnostic groups absent from the labeled set. Standard SSL forces incorrect pseudo-labels onto these unseen classes, producing overconfident predictions. To address this, we propose SafeECGMatch, a calibration-aware safe SSL framework for single-label ECG classification under label distribution mismatch. Methodologically, SafeECGMatch employs a dual-branch architecture extracting time-frequency latent representations via ECG-specific augmentations. Crucially, it dynamically aligns confidence with empirical accuracy through adaptive label smoothing and temperature scaling, calibrating both the multiclass classifier and the OOD detector across temporal and spectral domains. This joint optimization allows trustworthy OOD rejection and reliable pseudo-labeling. Evaluated on the PTB-XL and PhysioNet/CinC Challenge benchmarks, SafeECGMatch achieves state-of-the-art accuracy and calibration, advancing reliable knowledge discovery in physiological time-series. Code is available at https://github.com/labhai/SafeECGMatch.

2606.08035 2026-06-09 cs.CV 新提交

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

DyCo-RL: 用于视觉推理的动态跨模态协调

Hangui Lin, Yan Shu, Zhengyang Liang, Chi Liu, Xiangrui Liu, Minghao Qin, Teng Long, Zheng Liu, Nicu Sebe

发表机构 * University of Trento(特伦托大学) BAAI(北京智源人工智能研究院) Singapore Management University(新加坡管理大学) IQuest Research

AI总结 提出DyCo-RL,通过Fisher-Rao测地距离量化模态内注意力转移,实现动态跨模态协调,并利用对齐引导的优势重加权优化策略,提升多模态大模型在视觉推理中的表现。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为增强多模态大语言模型(MLLMs)视觉推理的主要范式。然而,现有的RLVR方法主要针对推理结果进行优化,从根本上忽略了生成过程中所需的细粒度跨模态协调。通过token级分析和控制干预,我们揭示了在思维链(CoT)推理过程中,MLLMs经常无法在提取视觉证据和合成文本上下文之间动态交替——这种协调崩溃与推理失败存在因果关系。受这些发现的启发,我们提出了DyCo-RL,它将动态跨模态协调集成到RLVR优化中。具体来说,DyCo-RL使用Fisher-Rao测地距离来度量模态内注意力转移,将token分配到视觉导向或文本导向的功能角色。然后,它评估token实际注意力分配与其分配角色之间的一致性,利用该分数在策略优化期间进行对齐引导的优势重加权。大量实验表明,算法无关的DyCo-RL应用于Qwen2.5-VL-3B/7B时,在涵盖视觉中心和数学推理的七个基准测试中,一致地改进了四种代表性的RLVR算法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.

2606.08034 2026-06-09 cs.CV cs.AI cs.CL 新提交

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Sci-Rho:面向STEM问题的多语言视觉基础符号基准

Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak, Fajri Koto

发表机构 * Independent Researcher(独立研究员) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Binus University(比努斯大学) Bandung Institute of Technology(万隆理工学院)

AI总结 提出Sci-Rho,一个多语言、视觉基础的STEM问题动态基准,包含4242个模板和42420个实例,评估17个VLM发现最差精度与平均精度存在差距,且小模型跨语言性能下降。

详情
Comments
22 pages
AI中文摘要

符号基准已成为评估模型在STEM相关问题微小修改下鲁棒性的关键方法。然而,现有符号基准大多局限于数学推理,缺乏视觉基础,且主要以英语为主。在这项工作中,我们引入了Sci-Rho(科学鲁棒性),一个面向视觉基础STEM问题的动态基准,涵盖五个学科和七种语言,包含由领域专家(包括奥林匹克奖牌得主)精心设计的4,242个问题模板(每种语言606个)。每个模板实现为可执行的Python代码,通过改变数值、视觉模式、几何形状、颜色方案和函数类型,生成多样但等价的问题实例,总共产生42,420个实例,每个实例都配有推理步骤和真实解决方案。我们评估了17个最先进的VLM,发现最差情况准确率(定义为模型在每种生成变体上均正确回答的问题模板比例)与平均准确率之间存在明显差距。我们还发现,较小的模型在不同语言上表现出显著的性能下降,而专有模型和较大模型保持鲁棒。步骤级评估反映了相同的趋势,揭示了平均F1与最差情况F1分数之间的显著差距。最后,我们对VLM注意力头的检查显示,图像标记与文本标记的相对注意力分配存在显著的跨语言变化。我们的工作强调了超越静态基准的评估作为衡量VLM质量指标的重要性。

英文摘要

Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In this work, we introduce Sci-Rho (Science Rhobustness), a dynamic benchmark for visually-grounded STEM problems spanning five subjects and seven languages, comprising 4,242 problem templates (606 per language) crafted by domain experts, including Olympiad medalists. Each template is implemented as executable Python code that generates diverse but equivalent problem instances by varying numerical values, visual patterns, geometric shapes, color schemes, and function types, resulting in 42,420 instances in total, each paired with reasoning steps and ground-truth solutions. We evaluated 17 state-of-the-art VLMs and discovered a noticeable gap between worst-case accuracy (defined as the proportion of problem templates that a model answers correctly across every generated variation) and average accuracy. We also discovered that smaller models show noticeable performance degradation across languages, whereas proprietary and larger models remain robust. Step-level evaluation reflects this same trend, revealing a significant gap between average F1 and worst-case F1 scores. Finally, our inspection of attention heads of a VLM reveals substantial cross-lingual variation in the relative attention allocated to image tokens compared to text tokens. Our work highlights the importance of evaluation beyond static benchmarks as a metric to measure the quality of VLMs.

2606.08033 2026-06-09 cs.CV cs.LG 新提交

Balancing Real and Synthetic Data for CNN-based Masonry Crack Detection

基于CNN的砌体裂缝检测中真实与合成数据的平衡

Mattia Forlesi, Alfonso Esposito, Ivan Zyrianoff, Alessandro Marzani, Marco Di Felice

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 针对砌体裂缝检测中真实数据不足的问题,提出用合成数据补充训练,通过调整真实与合成数据比例,发现20%真实数据加合成数据即可达到甚至超越纯真实数据的效果。

详情
AI中文摘要

裂缝是建筑健康的关键指标,早期识别对于防止有害损害至关重要。深度学习(DL)的进展,特别是卷积神经网络(CNN),已实现可扩展的自动裂缝检测解决方案。然而,CNN性能高度依赖于大规模多样化数据集的可用性,这对于砌体等复杂表面尤其具有挑战性。收集足够的真实数据耗时,而公开数据集可能不充分。为解决这一限制,我们探索生成合成裂缝数据,以补充真实数据并提高训练效果。真实数据集由从博洛尼亚及周边地区建筑收集的砌体裂缝图像组成。相比之下,合成数据集使用裂缝叠加工具生成,该工具以受控方向和位置向背景图像添加裂缝。使用真实数据集训练多种DL架构,以确定最佳性能模型(InceptionV4),用于生成数据的实验。通过改变真实与合成数据的比例,在InceptionV4上测试了六种训练场景,并在由真实图像组成的测试集上使用F1分数和平均交并比(mIoU)指标进行评估。结果表明,在合成数据上训练加上少量20%真实数据,可获得与仅使用真实数据训练相当的结果。此外,20/80(合成/真实)场景实现了76%的F1分数和80%的平均IoU,优于纯真实情况。可以看出,该方法展示了合成数据在减少收集工作同时提高裂缝检测准确性的潜力。

英文摘要

Cracks are a critical indicator of building health, and early stage identification is fundamental to prevent harmful damages. Advances in deep learning (DL), particularly convolutional neural networks (CNNs), have enabled scalable solutions for automated crack detection. However, CNN performance strongly depends on the availability of large and diverse datasets, which is particularly challenging for complex surfaces such as masonry. Collecting sufficient real data is time-consuming, while publicly available datasets may not be adequate. To address this limitation, we explored generating synthetic crack data, which complements real data and improves training effectiveness. The real dataset consists of masonry crack images collected from buildings in Bologna and surrounding areas. In contrast, the synthetic dataset was generated using a crack overlay tool that adds cracks to background images in a controlled orientation and placement. The real dataset was used to train several DL architectures, to identify the best-performing model (InceptionV4) employed for experiments with generated data. Six training scenarios were tested in InceptionV4 by varying the ratio of real and synthetic data, with evaluation performed on a test set composed of real images using the F1-score and mean Intersection over Union (mIoU) metrics. Results show that training on synthetic data plus a modest addition of 20% real data achieves results comparable to training on real data only. Moreover, the 20/80 scenario (synthetic/real) achieved an 76% F1-score and 80% mean IoU, outperforming the real-only case. As can be seen, the method demonstrates the potential of synthetic data to reduce collection efforts while enhancing crack detection accuracy.

2606.08029 2026-06-09 cs.RO 新提交

IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations

IntentNav: 从人类演示中学习空间-视觉物体导航

Yuxin Cai, Zongtai Li, Maonan Wang, Muyi Bao, Haokun Zhu, Ruofei Bai, Ding Zhao, Zirui Li, Wenshan Wang, Wei-Yun Yau, Ji Zhang, Chen Lv

发表机构 * Nanyang Technological University(南洋理工大学) Carnegie Mellon University(卡内基梅隆大学) The Chinese University of Hong Kong(香港中文大学) A*STAR Institute for Infocomm Research (I2R)(新加坡科技研究局资讯通信研究院)

AI总结 提出IntentNav框架,通过人类演示学习类人物体导航策略,利用前沿标注和意图对齐目标实现最优性能,并零样本迁移到多种机器人平台。

详情
Comments
26 pages, 9 figures
AI中文摘要

物体导航要求机器人在未知环境中搜索未观察到的目标,通过在部分可观测性下决定下一步探索位置。有效的搜索类似于人类探索:选择性探查视觉上有希望的前沿,同时依赖空间记忆避免重复访问。我们提出IntentNav,一个从人类演示中学习类人ObjectNav策略的空间-视觉模仿框架。为了从低级人类动作推断高级搜索意图,我们引入了基于前沿的人类意图标注,该方法前瞻人类演示并标注最能解释演示者未来搜索方向的前沿。我们构建了一个空间-视觉候选空间,其中BEV记忆跟踪已探索区域、未探索前沿和轨迹历史,而自我中心视觉记忆为每个候选提供语义线索。训练一个VLM策略在这些基于上下文的候选中进行选择,使用意图对齐目标以鼓励一致且类人的探索。IntentNav在MP3D、HM3D-v1和HM3D-v2 ObjectNav基准上实现了最先进的性能。所提出的候选级导航界面无需进一步VLM微调即可零样本迁移到轮式、四足和类人机器人。\href{https://anonymous.4open.science/w/IntentNav/}{项目页面}。

英文摘要

Object navigation requires a robot to search for an unobserved target in an unknown environment by deciding where to explore next under partial observability. Effective search resembles human-like exploration: selectively probing visually promising frontiers while relying on spatial memory to avoid redundant revisits. We propose IntentNav, a spatial-visual imitation framework that learns human-like ObjectNav policies from human demonstrations. To infer high-level search intent from low-level human actions, we introduce Frontier-based Human-Intent Labeling, which looks ahead in human demonstrations and labels the frontier that best explains the demonstrator's future search direction. We construct a spatial-visual candidate space, where BEV memory tracks explored regions, unexplored frontiers, and trajectory history, while egocentric visual memory provides semantic cues for each candidate. A VLM policy is trained to select among these grounded candidates, using Intent-Aligned Objective to encourage consistent and human-like exploration. IntentNav achieves state-of-the-art performance on the MP3D, HM3D-v1 and HM3D-v2 ObjectNav benchmarks. The proposed candidate-level navigation interface transfers zero-shot to wheeled, quadruped, and humanoid robots without further VLM fine-tuning. \href{https://anonymous.4open.science/w/IntentNav/}{Project page}.

2606.08028 2026-06-09 cs.LG 新提交

Noise-Adaptive High-Probability Regret Bounds for Online Convex Optimization

噪声自适应的在线凸优化高概率遗憾界

Wentao Zhang, Yutong Zhang, Wentao Mo

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) College of Mathematics, Sichuan University(四川大学数学学院)

AI总结 针对强凸损失在线凸优化,提出噪声自适应高概率遗憾界,在完全信息下实现与噪声水平相关的乘性改进,并证明赌博反馈下遗憾与置信度的线性关系,同时为约束优化提供联合高概率保证。

详情
Comments
Accepted to 2026 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases(ECML-PKDD 2026)
AI中文摘要

我们研究了具有强凸损失的在线凸优化(OCO)的高概率遗憾界,并建立了三个结果,解决了噪声自适应性、反馈结构和约束满足交叉领域的开放问题。对于具有次高斯随机梯度的完全信息设置,我们证明了一个噪声自适应的高概率遗憾界,其中鞅偏差项与噪声水平$σ$而非梯度界$G$成比例,相比经典的Azuma-Hoeffding基线实现了$G/σ$的乘性改进。我们的分析引入了一个指数超鞅论证,绕过了Freedman不等式的有界差分要求,从而无需截断伪影即可直接处理无界次高斯噪声。对于赌博反馈,我们证明了一个极小极大下界:高概率遗憾与$\log(1/δ)$线性增长,而完全信息下的置信成本为$\sqrt{\log(1/δ)}$。这构成了强凸OCO在不同反馈模型下置信成本的正式分离。关于具有满足Slater条件的随机约束的约束OCO,我们为累积遗憾和长期约束违反提供了同时的高概率保证,实现了$\mathcal{O}(\sqrt{T\log(m/δ)})$的遗憾和$\mathcal{O}(\sqrt{T}/(ζδ) + m\sqrt{T\log(m/δ)})$的违反。合成实验证实了所有理论预测。

英文摘要

We study high-probability regret bounds for online convex optimization (OCO) with strongly convex losses and establish three results that resolve open questions at the intersection of noise adaptivity, feedback structure, and constraint satisfaction. For the full-information setting with sub-Gaussian stochastic gradients, we prove a noise-adaptive high-probability regret bound in which the martingale deviation term scales with the noise level $σ$ rather than the gradient bound $G$, yielding a multiplicative improvement of $G/σ$ over the classical Azuma-Hoeffding baseline. Our analysis introduces an exponential supermartingale argument that bypasses the bounded-difference requirement of Freedman's inequality, enabling direct treatment of unbounded sub-Gaussian noise without truncation artifacts. For bandit feedback, we prove a minimax lower bound: the high-probability regret scales linearly in $\log(1/δ)$, in contrast to the $\sqrt{\log(1/δ)}$ confidence cost under full information. This constitutes a formal separation in the confidence cost of strongly convex OCO across feedback models. Regarding constrained OCO with stochastic constraints satisfying a Slater condition, we provide simultaneous high-probability guarantees for both cumulative regret and long-run constraint violation, achieving $\mathcal{O}(\sqrt{T\log(m/δ)})$ regret and $\mathcal{O}(\sqrt{T}/(ζδ) + m\sqrt{T\log(m/δ)})$ violation. Synthetic experiments corroborate all theoretical predictions.

2606.08027 2026-06-09 cs.LG cs.AI 新提交

CausShield: Sample Reconstruction-Resilient Vertical FL via Causal Representation Learning

CausShield: 通过因果表示学习实现样本重建鲁棒的纵向联邦学习

Yongqi Jiang, Yansong Gao, Siguang Chen, Anmin Fu

发表机构 * Nanjing University of Science and Technology(南京理工大学) University of Western Australia(西澳大学) Hohai University(河海大学) Nanjing University(南京大学)

AI总结 针对纵向联邦学习中样本重建攻击的防御问题,提出基于因果表示学习的CausShield方法,将共享表示分解为任务相关与无关部分,实现全周期隐私保护,理论证明收敛性,实验优于七种最新方法。

详情
AI中文摘要

纵向联邦学习(VFL)是一种分布式学习范式,利用跨孤立方的垂直划分特征,无需共享原始样本;然而,它仍然容易受到主动样本重建攻击。现有防御方法由于要么抑制任务相关信息的同时也抑制了隐私敏感特征,要么依赖端到端监督训练来收敛防御模块(这暴露了早期轮次的脆弱性),因此无法在模型效用和隐私保护之间实现令人满意的权衡。为了解决这一挑战,我们采用结构因果模型(SCM)的见解,构建了CausShield。从任务学习的角度来看,原始样本中的因果特征是那些直接相关且有助于学习目标的特征,而非因果特征与任务无关,但通常编码了样本特定的私有信息,从而促进了重建。重要的是,我们奠定了理论基础来证明这一见解。因此,CausShield将VFL中客户端与协调服务器之间的共享表示分解为任务相关和任务无关的组件,以确保全周期的隐私保护。然而,由于在保持模型效用的同时减轻隐私泄露的双重目标,这种分解本质上具有挑战性。我们通过一个精心制定的优化问题来解决这一问题,该问题通过无监督表示学习求解。我们进一步从理论上证明CausShield保持了标准VFL的收敛行为。大量实验将CausShield与七种最新方法(包括InvL (USENIX Security'25))进行比较,并评估了对高级重建攻击(如URVFL (NDSS'25))的鲁棒性。结果表明,CausShield在隐私保护、模型效用和计算效率方面始终表现优异。

英文摘要

Vertical federated learning (VFL) is a distributed learning paradigm that leverages vertically partitioned features across isolated parties without sharing raw samples; however, it remains vulnerable to active sample reconstruction attacks. Existing defenses fail to achieve a satisfactory trade-off between model utility and privacy protection, due to either suppressing task-relevant information alongside privacy-sensitive features or relying on end-to-end supervised training to converge the defense module, which exposes the model to early-epoch vulnerability. To address this challenge, we adopt a structural causal model (SCM) insight and construct CausShield. From a task-learning standpoint, causal features within a raw sample are those that are directly relevant and contributory to the learning objective, whereas non-causal features are task-irrelevant but often encode sample-specific private information, thereby facilitating reconstruction. Importantly, we lay a theoretical foundation to prove this insight. CausShield thus decomposes the shared representations between the client and the coordinating server in VFL into task-relevant and task-irrelevant components to ensure full-cycle privacy protection. Nonetheless, the decomposition is inherently challenging due to the dual objectives of preserving model utility while mitigating privacy leakage. We address this via a carefully formulated optimization problem, which is solved through unsupervised representation learning. We further theoretically prove that CausShield preserves the convergence behavior of standard VFL. Extensive experiments compare CausShield against seven SOTAs, including InvL (USENIX Security'25), and evaluate robustness against advanced reconstruction attacks such as URVFL (NDSS'25). Results demonstrate that CausShield consistently outperforms in privacy protection, model utility, and computational efficiency.

2606.08021 2026-06-09 cs.LG cs.AI cs.MA 新提交

Semantic Quorum Assurance: Collective Certification for Non-Deterministic AI Infrastructure

语义法定数保证:面向非确定性AI基础设施的集体认证

Jun He, Deying Yu

发表机构 * OpenKedge.io

AI总结 提出语义法定数保证(SQA),一种通过多样化验证者群体和风险自适应法定数谓词,将非确定性LLM代理的不安全操作批准率从18.5%降至0.3%的控制平面原语。

详情
Comments
21 pages, 2 figures, 6 tables
AI中文摘要

随着大型语言模型(LLM)代理被集成到自主云操作中,分布式系统面临一个语义可靠性问题:提议代理可以生成语法有效且静态授权但操作不安全的生成突变,例如修改IAM策略、开放防火墙安全组或执行数据导出。经典的分布式共识协议复制确定性状态转换,但不评估提议意图的安全性。为弥补这一差距,我们引入语义法定数保证(SQA),一种用于治理非确定性代理基础设施的控制平面原语。SQA将提议表示为绑定到密码证据链的声明性执行合约,并将其路由到由只读、沙盒验证代理组成的多样化面板。SQA在风险自适应法定数谓词下聚合其判断,该谓词强制执行模型和原型多样性,根据校准的保证分数调整权重,并尊重特定原型的否决。通过的提议仅通过主权执行门执行。我们在云原生控制平面中实例化SQA,并为非确定性验证者形式化了一个相关的认知失败模型。在500个基础设施启发的突变场景中,安全结果报告在保留的安全/不安全试验上(排除模糊场景),SQA将不安全批准率从单代理验证的18.5%降低到0.3%,同时在研究风险桶中增加了1.45-4.12秒的中位验证延迟。

英文摘要

As large language model (LLM) agents are integrated into autonomous cloud operations, distributed systems face a semantic reliability problem: proposer agents can generate production mutations, such as modifying IAM policies, opening firewall security groups, or executing data exports, that are syntactically valid and statically authorized but operationally unsafe. Classical distributed consensus protocols replicate deterministic state transitions but do not evaluate the safety of the proposed intent. To address this gap, we introduce Semantic Quorum Assurance (SQA), a control-plane primitive for governing non-deterministic agentic infrastructure. SQA represents proposals as declarative execution contracts bound to cryptographic evidence chains and routes them to a diverse panel of read-only, sandboxed validator agents. SQA aggregates their judgments under a risk-adaptive quorum predicate that enforces model and archetype diversity, adjusts weights based on calibrated assurance scores, and respects archetype-specific vetoes. Admitted proposals execute only through a sovereign execution gate. We instantiate SQA in a cloud-native control plane and formalize a correlated cognitive failure model for non-deterministic validators. On 500 infrastructure-inspired mutation scenarios, with safety results reported on held-out safe/unsafe trials excluding ambiguous scenarios, SQA reduces unsafe approval from 18.5% for single-agent validation to 0.3% while adding median validation latency of 1.45--4.12 seconds across the studied risk buckets.

2606.08018 2026-06-09 cs.AI 新提交

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

UniQL:迈向方言通用的文本到SQL基准测试

Jianling Gao, Chongyang Tao, Jiayuan Bai, Liu Yang, Xuanguang Pan, Jinrui Liu, Shihao Xing, Xiaohan Xu, Jie Liang, Shuai Ma

发表机构 * SKLCCSE, Beihang University(北京航空航天大学软件开发环境国家重点实验室) The University of Hong Kong(香港大学)

AI总结 提出UniQL基准,通过跨16种SQL方言的对齐标注,评估模型在不同数据库系统间的泛化能力,揭示现有模型在方言通用性上的不足。

详情
AI中文摘要

现有的文本到SQL基准测试主要集中在SQLite上,这使得评估模型能否跨异构SQL方言泛化变得困难。然而,现实世界的数据库系统在语法、函数、类型系统和执行语义上存在显著差异,因此相同的自然语言意图通常需要特定方言的SQL实现。我们引入了UniQL,一个用于跨方言文本到SQL评估的人工验证基准。UniQL将1,534个自然语言问题与16种SQL方言的可执行SQL注释对齐,产生了24,544个方言特定的查询。所有方言共享相同的意图、对齐的模式和数据库内容,从而实现了对方言泛化的可控评估。UniQL通过一个混合流水线构建,结合了数据库迁移、SQL翻译、执行引导验证、迭代规则总结和人工验证。在开源和闭源LLM上的实验表明,当前模型远未达到方言通用,在不同数据库系统间性能差异显著,且从SQLite成功到其他方言的迁移有限。这些发现凸显了对齐的跨方言基准和更注重方言的文本到SQL方法的必要性。代码和数据可在https://github.com/JerryGao818/UniQL获取。

英文摘要

Existing text-to-SQL benchmarks are largely centered on SQLite, making it difficult to evaluate whether models can generalize across heterogeneous SQL dialects. However, real-world database systems differ substantially in syntax, functions, type systems, and execution semantics, so the same natural language intent often requires dialect-specific SQL realizations. We introduce UniQL, a human-verified benchmark for cross-dialect text-to-SQL evaluation. UniQL aligns 1,534 natural language questions with executable SQL annotations across 16 SQL dialects, yielding 24,544 dialect-specific queries. All dialects share the same intents, aligned schemas and database contents, enabling controlled evaluation of dialect generalization. UniQL is constructed through a hybrid pipeline combining database migration, SQL translation, execution-guided verification, iterative rule summarization, and human validation. Experiments on both open-source and closed-source LLMs show that current models remain far from dialect-universal, with substantial performance variation across database systems and limited transfer from SQLite success to other dialects. These findings highlight the need for aligned cross-dialect benchmarks and more dialect-aware text-to-SQL methods. Code and data are available at https://github.com/JerryGao818/UniQL

2606.08016 2026-06-09 cs.CV cs.AI cs.CL 新提交

IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment

IEA:通过三阶段多任务对齐的业余友好型对话式图像编辑代理

Zichen Zhu, Yuheng Sun, Mingxuan Zhu, Wenjie Ma, Situo Zhang, Zhexiang Wang, Ziyue Yang, Danyang Zhang, Kunyao Lan, Zihan Zhao, Dingye Liu, Siqi Xiang, Lu Chen, Kai Yu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institution(上海创新研究院) Huawei Technologies Ltd.(华为技术有限公司) Nanyang Technological University(南洋理工大学) Jiangsu Key Lab of Language Computing(江苏省语言计算重点实验室)

AI总结 提出IEA对话式图像编辑代理,通过三阶段多任务训练学习操作参数化工具,实现可解释编辑轨迹,在像素距离和ROUGE-L指标上优于基线,用户研究中指令跟随和感知质量表现最佳。

详情
Comments
[CVPR 2026 Findings] Our data and code are released at https://github.com/OpenDFM/Image_Edit_Agent
AI中文摘要

当前的图像编辑软件通常依赖于固定滤镜或专家调参,导致业余用户的意图与结果之间存在差距。生成模型创建的图像可能包含伪影、不合理的细节或偏离真实感的风格漂移,并且对编辑原因缺乏解释。我们提出IEA,一个对话式图像编辑代理,它学习在显式、可解释的动作空间中操作参数化工具。IEA通过三阶段多任务流水线进行训练:(1) 在蒸馏专家编辑上进行SFT,(2) 使用GRPO进行奖励优化,奖励包括相似度改进、工具有用性和意图总结,(3) 大规模合成微调以联合掌握图像编辑、细化和用户意图总结。通过逐步操作16个编辑工具,IEA产生透明的编辑轨迹,可以检查和调试。在定量实验中,它在编辑任务上获得更低的像素距离,在总结任务上获得比强基线更高的ROUGE-L。在用户研究中,它在指令跟随方面在工具调用方法中排名最佳,同时在整体感知质量上超越生成方法。我们的结果验证了可解释的、以工具为中心的VLM作为人类指令引导图像润色的可靠路径。

英文摘要

Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur users' intent and outcomes. Creations by generative models may contain artifacts, implausible details, or stylistic drift away from photorealism and offer little insight into why an edit was made. We propose IEA, a conversational Image Editing Agent that learns to operate parameterized tools in an explicit, interpretable action space. IEA is trained via a three-stage multitask pipeline: (1) SFT on distilled expert edits, (2) GRPO with rewards for likeness improvement, tool usefulness, and intent summarization, and (3) large-scale synthetic fine-tuning to jointly master image editing, refinement, and user intent summarization. By manipulating 16 editing tools step by step, IEA produces transparent edit traces that can be inspected and debugged. In quantitative experiments, it attains a lower pixel distance on the edit task and a higher ROUGE-L on the summary task than strong baselines. In user studies, it ranks best among tool-calling methods for instruction following while surpassing generative methods in overall perceptual quality. Our results validate interpretable, tool-centric VLMs as a reliable path to human instruction-guided image retouching.

2606.08015 2026-06-09 cs.RO 新提交

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

Q-VGM: 基于Q引导的值梯度匹配的流匹配VLA策略

Ziqian Wang, Jiayu Sun, Xingjian Mao, Minqian Wang, Yao Mu

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出Q-VGM离线强化学习方法,通过将值梯度转化为去噪时间上的值梯度场,避免反向传播去噪链,高效微调流匹配VLA策略,在LIBERO等任务上显著提升成功率。

详情
Comments
13 pages, 3 figures, 4 tables
AI中文摘要

我们提出Q引导的值梯度匹配(Q-VGM),一种离线强化学习方法,解决了微调流匹配视觉-语言-动作(VLA)策略中长期存在的挑战:如何高效地根据学习到的Q函数改进一个表达力强的流匹配动作专家。有效的改进必须利用评论家的一阶(梯度)信息,但这对于流策略很困难,因为直接通过其多步去噪过程反向传播值函数在VLA规模下数值不稳定,而策略梯度方法所需的可处理动作似然在迭代去噪下不可用。现有的基于值的方法要么通过整个去噪链反向传播,要么仅在测试时使用评论家而不更新策略,要么将评论家改进的动作作为终端标签蒸馏而不监督速度场。Q-VGM通过利用VGG-Flow(一种生成建模中流对齐的值梯度视角)绕过了这些问题,它将值梯度转化为去噪时间上的值梯度场,而不是不稳定的端到端目标。这不需要动作似然,也不需要反向传播去噪链,并且在一个固定的重放缓冲区上操作。评论家是一个动作敏感的Cal-QL集成,基于紧凑的RLT特征和每层动作注入。Q-VGM实现了一种实用的少样本初始化然后从经验中学习的范式:从少样本SFT pi0.5 VLA开始,该方法利用自生成的rollout数据显著提升任务性能,无需额外的专家监督。在LIBERO上,Q-VGM将平均成功率从75.0%提升到92.5%;在RoboTwin 2.0上,从76.4%提升到87.2%;在两个真实机器人桌面任务上,从40.0%提升到67.5%,在所有三种设置中均优于所有相同骨干、相同评论家的基线。

英文摘要

We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.

2606.08014 2026-06-09 cs.CV cs.AI 新提交

GVC-Seg: Training-Free 3D Instance Segmentation via Geometric Visual Correspondence

GVC-Seg: 基于几何视觉对应的免训练3D实例分割

Liang Xu, Fangjing Wang, Jinyu Yang, Feng Zheng

发表机构 * Victoria University of Wellington(惠灵顿维多利亚大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Southern University of Science and Technology(南方科技大学)

AI总结 提出GVC-Seg,一种免训练的3D实例分割方法,通过几何与视觉特征对应消除多模型集成中的置信度偏差,在多个基准上达到最优性能。

详情
Comments
10 pages, 5 figures
AI中文摘要

点云数据中的精确3D实例分割对于机器视觉应用至关重要。最近的研究利用多个预训练基础模型生成3D提案,然后应用提案聚合方法,显著提升了性能。然而,由于不同分割模型之间置信度水平的固有差异,它们通常会产生次优结果,导致偏向于置信度更高的模型。这种偏差本质上是模型依赖的,并受到数据预处理技术和训练策略等因素的影响。为了解决这一偏差,我们提出了一种新颖的、免训练的3D实例分割方法,通过几何视觉对应(GVC-Seg)来利用3D几何线索与2D视觉线索之间的对应关系,以减轻置信度偏差。此外,在实例掩码生成和实例语义推理过程中,分别引入了3D提案生成模块和掩码感知的CLIP特征提取模块。通过这种方式,GVC-Seg增强了提案质量评估,确保了不同模型之间的无偏集成学习。大量实验表明,我们的方法在多个具有挑战性的基准上达到了最先进的性能,同时在开放词汇语义分割设置中也展现出强大的潜力。

英文摘要

Accurate 3D instance segmentation in point cloud data is critical for machine vision applications. Recent advancements leverage multiple pre-trained foundation models to generate 3D proposals, followed by the application of proposal aggregation methods, which significantly enhance performance. However, they often produce sub-optimal results due to inherent variations in confidence levels across different segmentation models, resulting in a bias toward the model with higher confidence. This bias is inherently model-dependent and is influenced by factors such as data preprocessing techniques and training strategies. To address this bias, we propose a novel, training-free 3D instance segmentation approach via Geometric Visual Correspondence (GVC-Seg), which exploits the correspondence between 3D geometric cues and 2D visual cues to mitigate the confidence bias. Additionally, a 3D proposal generation module and a mask-aware CLIP feature extraction module are introduced during the instance mask generation and instance semantic reasoning, respectively. In this way, GVC-Seg enhances proposal quality assessment, ensuring unbiased ensemble learning across different models. Extensive experiments demonstrate that our method achieves state-of-the-art performance on several challenging benchmarks, while also exhibiting strong potential in open-vocabulary semantic segmentation settings.

2606.08013 2026-06-09 cs.LG 新提交

Evaluating the Impact of Task Granularity on Catastrophic Forgetting in Continual Learning

评估任务粒度对持续学习中灾难性遗忘的影响

Emre Alyamac, Himanshu Janmeda, Shashwat Krishna, Yash Vijay

发表机构 * College of Engineering(工程学院) College of Natural Science(自然科学学院)

AI总结 研究任务粒度顺序对持续学习中灾难性遗忘的影响,通过CIFAR-100上的粗到细、细到粗和平坦三种训练策略,结合弹性权重巩固(EWC)方法,发现先学习一般类别可减少遗忘。

详情
Comments
8 pages, 4 figures, 5 tables
AI中文摘要

灾难性遗忘,即学习新信息时突然丢失先前获得的知识,仍然是持续学习中的核心挑战。本项目研究模型学习信息的顺序是否影响其保留知识的能力。具体而言,我们提出疑问:先学习一般类别(如“动物” vs “交通工具”)再学习具体类别(如“狗” vs “猫”)是否比一次性学习所有类别更能减少遗忘?我们在CIFAR-100上测试了三种方法:(1)粗到细:先训练2个超类,再扩展到10个具体子类;(2)细到粗:先训练10个子类,再分组为2个超类;(3)平坦:从一开始就训练所有10个类别。我们使用弹性权重巩固(EWC)来防止过渡期间的遗忘。我们的假设是,先学习一般模式可以为模型建立一个稳定的基础,帮助其在学习更详细区分时保留知识。我们使用标准指标(准确率、精确率、召回率、F1)以及持续学习指标(如反向迁移和遗忘率)进行评估。这项工作可为需要增量学习的实际系统设计学习序列提供参考。

英文摘要

Catastrophic forgetting, the abrupt loss of previously acquired knowledge upon learning new information, remains the central challenge in Continual Learning. This project investigates whether the order in which a model learns information affects how well it retains knowledge. Specifically, we ask: does learning general categories first (like "animals" vs "vehicles") before learning specific classes (like "dog" vs "cat") reduce forgetting compared to learning all classes at once? We test three approaches on CIFAR-100: (1) Coarse-to-Fine: train on 2 super-classes, then expand to 10 specific sub-classes, (2) Fine-to-Coarse: train on 10 sub-classes, then group into 2 super-classes, and (3) Flat: train on all 10 classes from the start. We use Elastic Weight Consolidation (EWC) to prevent forgetting during transitions. Our hypothesis is that learning general patterns first creates a stable foundation that helps the model retain knowledge when learning more detailed distinctions. We evaluate using standard metrics (accuracy, precision, recall, F1) plus continual learning metrics like backward transfer and forgetting rates. This work could inform how we design learning sequences for real-world systems that need to learn incrementally.

2606.08002 2026-06-09 cs.CV 新提交

Aqua Boundary-Saliency Attention Module for Lightweight Underwater Salient Instance Segmentation Detection Transformer

Aqua边界显著性注意力模块:用于轻量级水下显著实例分割检测Transformer

M. Fazri Nizar, Julian Supardi, Muhammad Naufal Rachmatullah

发表机构 * Universitas Sriwijaya(斯里维贾亚大学)

AI总结 提出轻量级水下显著实例分割检测Transformer(LUSIS-DETR),通过Aqua边界显著性注意力模块嵌入水下先验线索,在四个数据集上达到领先性能,并在NVIDIA T4 GPU上实现4.31-6.34毫秒延迟。

详情
Comments
This work has been submitted to the IEEE for possible publication
AI中文摘要

水下实例分割融合了像素级掩码预测和实例级判别,用于海洋资源勘探、生态监测和水下机器人感知。最近的基于提示和辅助模态的方法提高了掩码质量,但它们对大型基础模型、提示生成或额外模态估计的依赖使高效部署复杂化。本文介绍了轻量级水下显著实例分割检测Transformer(LUSIS-DETR),这是一个紧凑的检测Transformer框架,围绕Aqua边界显著性注意力模块(AquaBSAM)构建。AquaBSAM通过有界残差调制将水下边界、对比度、衰减、色度、暗通道和中心先验线索嵌入到DINOv2初始化的多尺度特征中,而辅助掩码监督和小目标复制粘贴仅在训练中使用。在四个最新的水下实例分割数据集UIIS、UIIS10K、USIS10K和USIS16K上的广泛评估表明,在类别感知和显著实例协议下,该方法相对于先前最先进的工作具有竞争力的领先性能。在NVIDIA T4图形处理单元(GPU)上的TensorRT半精度(FP16)基准测试实现了4.31-6.34毫秒(ms)的延迟,支持在可复现的设置下进行实时推理。

英文摘要

Underwater instance segmentation integrates pixel-level mask prediction and instance-level discrimination for marine resource exploration, ecological monitoring, and underwater robotic perception. Recent prompt-based and auxiliary-modality methods improve mask quality, but their reliance on large foundation models, prompt generation, or extra modality estimation complicates efficient deployment. This work introduces Lightweight Underwater Salient Instance Segmentation Detection Transformer (LUSIS-DETR), a compact detection-transformer framework built around the Aqua Boundary-Saliency Attention Module (AquaBSAM). AquaBSAM embeds underwater boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into DINOv2-initialized multi-scale features through bounded residual modulation, while auxiliary mask supervision and small-object copy-paste are training-only. Extensive evaluation on four recent underwater instance segmentation datasets, UIIS, UIIS10K, USIS10K, and USIS16K, shows competitively leading performance against previous state-of-the-art works across category-aware and salient-instance protocols. TensorRT half-precision (FP16) benchmarking on an NVIDIA T4 graphics processing unit (GPU) achieves 4.31-6.34 milliseconds (ms) latency, supporting real-time inference under an accessible reproduction setting.

2606.08000 2026-06-09 cs.CL cs.AI 新提交

Summarization is Not Dead Yet

摘要生成尚未消亡

Dongqi Liu, Chenxi Whitehouse, Zheng Zhao, Zhuchen Cao, Jian Li, Yabiao Wang

发表机构 * Saarland University(萨尔大学) Max Planck Institute for Informatics(马克斯·普朗克信息学研究所) University of Cambridge(剑桥大学) University of Edinburgh(爱丁堡大学) Zhejiang University(浙江大学) Tencent YouTu Lab(腾讯优图实验室)

AI总结 通过多维度评估,发现人类参考摘要在信息量和忠实度上仍优于大语言模型,后者仅在表面连贯性和流畅性上占优,表明摘要生成研究仍有挑战。

详情
AI中文摘要

大型语言模型(LLMs)的进展引发了关于模型生成的摘要可与人类撰写的参考摘要相媲美甚至超越后者的说法,这引发了摘要生成是否仍是一个开放研究问题的疑问。我们通过多轨道评估重新审视这一说法,涵盖五个不同数据集和五个最先进的LLMs,结合受控人工评估、偏差缓解的LLM作为评判协议、基于外部知识的事实性验证以及语料库级别的语言分析。我们的发现揭示了一个更为细致的图景:人类参考摘要继续在信息量和忠实度方面展现出优势,而LLM输出主要在表面连贯性和流畅性上更受青睐。事实性验证表明,人类参考摘要仍然更可靠,尤其是对于涉及推理或综合的声明,而语言分析揭示了不同模型之间风格同质化的模式。这些观察表明,当前的LLMs提高了摘要生成的质量下限,但其性能上限仍低于人类能力。

英文摘要

The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarization remains an open research problem. We re-examine this narrative through a multi-track evaluation covering five diverse datasets and five state-of-the-art LLMs, combining controlled human assessment, bias-mitigated LLM-as-Judge protocols, factuality verification against external knowledge, and corpus-level linguistic analysis. Our findings reveal a more nuanced landscape in which human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.

2606.07996 2026-06-09 cs.CL cs.AI 新提交

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

MC-PDD: 面向黑盒大语言模型的掩码语料级预训练数据检测

Kaixin Lan, Mu You, Tao Fang, Binkai Ou, Lidia S. Chao, Derek F. Wong

发表机构 * University of Macau(澳门大学) Macau Millennium College(澳门万人大学) BoardWare Information System Limited(博纬信息系统有限公司)

AI总结 提出MC-PDD方法,通过掩码特定token并利用LLM预测缺失内容,比较候选语料与参考非成员语料的预测命中率差异,以黑盒方式检测预训练数据,性能与现有方法相当。

详情
Comments
The manuscript consists of 10 pages formatted in the IEEE/ACM two-column style
AI中文摘要

预训练是大语言模型(LLM)发展的基础,然而预训练数据的不透明性使模型分析复杂化,并引发伦理、法律和公平性问题。因此,检测特定数据集是否在预训练中使用至关重要。现有最先进方法通常依赖于访问模型概率分布,因此不适用于仅提供输入输出接口的闭源LLM。为解决这一限制,我们引入了掩码语料级预训练数据检测(MC-PDD),这是一种受掩码语言建模范式启发的新方法。MC-PDD在每段文本中掩码高度特定的token,并提示LLM预测缺失内容。然后,它评估候选语料与参考非成员语料之间的预测命中率差异是否具有统计显著性。基于此比较,MC-PDD确定候选文本是否可能包含在模型的预训练数据中。实验结果表明,在三个数据集上,对于开源和闭源LLM,预训练数据和未见数据之间的预测命中率存在明显且一致的差异。尽管在更严格的黑盒设置下运行,MC-PDD仍实现了与现有检测方法相当的性能。我们的方法仅需使用标准API访问即可实现模型审计和数据版权验证等实际应用。接受后,我们将公开发布代码和数据集。

英文摘要

Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only input-output interfaces. To address this limitation, we introduce Masked Corpus-level Pretraining Data Detection (MC-PDD), a novel method inspired by the masked language modeling paradigm. MC-PDD masks highly specific tokens in each text and prompts the LLM to predict the missing content. It then assesses whether the difference in prediction hit rates between a candidate corpus and a reference non-member corpus is statistically significant. Based on this comparison, MC-PDD determines whether the candidate texts were likely included in the model's pretraining data. Experimental results demonstrate clear and consistent differences in prediction hit rates between pretrained and unseen data across three datasets, for both open-source and closed-source LLMs. Despite operating under a stricter black-box setting, MC-PDD achieves performance comparable to existing detection methods. Our approach enables practical applications such as model auditing and data copyright verification using only standard API access. Upon acceptance, we will publicly release the code and datasets.

2606.07995 2026-06-09 cs.CL 新提交

Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR

客户代理:通过工具增强代理和RLVR克服超长购物轨迹中的上下文限制

Hongye Liu, Rongmei Lin, Anurag Kashyap, Hejie Cui, Ricardo Henao, Besnik Fetahu, Bing Yin

发表机构 * Amazon(亚马逊) Duke University(杜克大学)

AI总结 提出ShopTrajQA基准和客户代理框架,利用RLVR训练代理通过代码解释器自主检索解析外部轨迹文件,突破LLM上下文窗口限制,在超长购物轨迹推理中取得强性能。

详情
AI中文摘要

理解客户购物轨迹对于实现个性化购物体验至关重要。然而,购物记录(如客户的搜索、点击、购买等)通常跨越多年时间,形成极长的轨迹,给现有大型语言模型(LLM)带来重大挑战。尽管该问题重要,现有基准仅限于短客户轨迹,而大型电商平台的真实轨迹由于数据隐私限制难以获取。为解决这一差距,我们引入ShopTrajQA,一个基于真实产品信息和模拟购物轨迹构建的长上下文评估基准。数据集包含高达32k和64k token的变体,能够系统评估模型在不同上下文长度下的鲁棒性。通过对前沿LLM的全面基准测试,我们识别出在长购物轨迹数据推理中的关键性能差距。为应对这些挑战,我们提出一种用于超长上下文管理的客户代理框架。利用可验证奖励强化学习(RLVR)代理训练范式,我们的方法将轨迹存储为外部本地文件,并训练代理通过代码解释器交互(如SQL查询)自主检索和解析它们,有效绕过LLM的固定上下文窗口限制。实验结果表明,我们的框架在ShopTrajQA上取得强性能,并展现出对其他复杂推理任务的泛化能力。

英文摘要

Understanding customer shopping trajectories is essential for enabling personalized shopping experiences. However, shopping records (i.e., customer's search, clicks, purchases, etc.) often span long time horizons over multiple years, resulting in extremely long trajectories that pose significant challenges for existing large language models (LLMs). Despite the importance of this problem, existing benchmarks are limited to short customer trajectories, while real-world trajectories from large e-commerce platforms are rarely accessible due to data privacy constraints. To address this gap, we introduce ShopTrajQA, a long-context evaluation benchmark constructed from real-world product information and simulated shopping trajectories. The dataset includes variants of up to 32k and 64k tokens, enabling systematic evaluation of model robustness under varying context lengths. Through comprehensive benchmarking of frontier LLMs, we identify critical performance gaps in reasoning over long shopping trajectory data. To address these challenges, we propose a Customer Agent Framework for ultra-long context management. Leveraging a Reinforcement Learning with Verifiable Rewards (RLVR) agentic training paradigm, our approach stores trajectories as external local files and trains the agent to autonomously retrieve and parse them through code-interpreter interactions (e.g., SQL queries), effectively bypassing the fixed in-context window constraints of LLMs. Experimental results demonstrate that our framework achieves strong performance for ShopTrajQA and shows generalization to other complex reasoning tasks.