arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.05665 2026-06-05 cs.CV

V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation

V2V-Bench:视频到视频生成评估的综合基准

Tao Liu, Leela Krishna, Gouti Pavan Kumar, Sreeja K, Vishav Garg

发表机构 * arXiv.org cs.CV(计算机视觉)

AI总结 针对视频到视频生成评估中现有指标无法同时衡量编辑指令遵循和帧级对应的问题,提出包含11个维度、5个类别的V2V-Bench基准,评估三个模型并验证其与人类判断高度相关。

详情
Comments
Accepted at ICML 2026 workshop
AI中文摘要

视频到视频(V2V)生成难以评估,因为输出必须同时遵循编辑指令并保持与源视频的帧级对应,而现有的T2V和I2V指标无法捕捉这一点。我们引入了V2V-Bench,一个包含11个维度的基准,分为五个类别:时间对齐、结构保真度、变换质量、视频质量和语义对齐。V2V-Bench将多样化的源视频与具有挑战性的编辑任务配对,并评估了两个商业模型Grok Imagine和Gemini Veo3,以及一个开源模型Open Sora 2。结果显示模型优势互补:Grok在编辑保真度上表现更好,而Veo3在视觉质量上更强。在六个V2V特定维度上,V2V-Bench与人类判断的Spearman相关系数达到0.905。

英文摘要

Video-to-video (V2V) generation is difficult to evaluate because outputs must both follow editing instructions and preserve frame-level correspondence with the source video, which existing T2V and I2V metrics do not capture. We introduce V2V-Bench, a 11-dimension benchmark organized into five categories: temporal alignment, structural fidelity, transformation quality, video quality, and semantic alignment. V2V-Bench pairs diverse source videos with challenging editing tasks and evaluates two commercial models, Grok Imagine and Gemini Veo3, and one open-source model, Open Sora 2. Results show complementary model strengths: Grok performs better on editing fidelity, while Veo3 achieves stronger visual quality. On six V2V-specific dimensions, V2V-Bench reaches a Spearman correlation of 0.905 with human judgments.

2606.05663 2026-06-05 cs.RO

Preserving Full 6-DOF Actuation Under Abrupt Total Rotor Failures: Passive Fault-Tolerant Flight Control Using a Biaxial-Tilt Hexacopter

在突然完全旋翼故障下保持完整六自由度驱动:使用双轴倾斜六旋翼的被动容错飞行控制

Yipeng Yang, Yiqiao Tang, Hao Zhang, Jinqi Jiang, Jianfeng He, Rumo Chen, Xinghu Yu, Zhan Li, Huijun Gao

发表机构 * Tsinghua University(清华大学)

AI总结 本文针对双轴倾斜过驱动六旋翼在突发完全旋翼故障下,提出两种无需故障检测的被动容错控制方案,实现完整六自由度轨迹跟踪,并通过仿真和实验验证其鲁棒性。

详情
AI中文摘要

传统多旋翼在突发完全旋翼故障下,可达力旋量空间(AWS)迅速缩小,使得完整的六自由度恢复在物理上不可能。本文研究了双轴倾斜过驱动六旋翼(BTO)在控制器事先未知的突发完全旋翼故障下的被动容错飞行。控制设计与分析聚焦于代表性的突发旋翼故障情况,其中故障后系统仍保持完全驱动,且不假设显式的故障检测、隔离或故障模式切换。首先,我们通过引入瞬态力旋量跳跃项扩展了AWS的内接球度量,从而能够在最多三个同时旋翼故障下进行定量可行性评估,并与单轴倾斜和共面六旋翼进行基准比较。其次,我们开发了两种计算高效的被动方案,不依赖故障检测或在线优化。一种方案在控制器层运行,将高阶全驱动(HOFA)控制器与线性扩展状态观测器(LESO)结合,用于集总扰动抑制。另一种方案在分配器层运行,使用基于模型参考的自适应控制分配和基于动量的力旋量估计来补偿控制分配偏差。仿真和飞行实验验证了在单个和多个旋翼故障下的稳定悬停和六自由度轨迹跟踪。进一步系统比较证实,BTO比单轴倾斜和共面设计提供更大的恢复裕度。额外的仅机载传感器实验,包括风扰下的室内跟踪、极端条件下的室外跟踪、窄框穿越和基于接触的空中书写,进一步验证了所提框架在复杂操作环境中的鲁棒性。

英文摘要

Conventional multirotors suffer from a rapid collapse of attainable wrench space (AWS) under abrupt total rotor failures, rendering full 6-DOF recovery physically impossible. This paper addresses passive fault-tolerant flight of a biaxial-tilt overactuated hexacopter (BTO) under abrupt total rotor failures that are a priori unknown to the controller. The control design and analysis focus on representative abrupt rotor-failure cases for which the post-failure system remains fully actuated, while no explicit fault detection, isolation, or fault-mode switching is assumed. First, we extend the inscribed-sphere metric of the AWS by incorporating the transient-wrench-jump term, enabling quantitative feasibility assessment under up to three simultaneous rotor failures and benchmarking against uniaxial-tilt and coplanar hexacopters. Second, we develop two computationally efficient passive schemes without relying on fault detection or online optimization. One scheme operates at the controller layer by combining a high-order fully actuated (HOFA) controller with a linear extended state observer (LESO) for lumped-disturbance rejection. The other scheme operates at the allocator layer by using model-reference adaptive control allocation with momentum-based wrench estimation to compensate for control-allocation biases. Simulations and flight experiments validate stable hovering and 6-DOF trajectory tracking under single and multiple rotor failures. Further systematic comparisons confirm that the BTO provides larger recovery margins than uniaxial-tilt and coplanar designs. Additional onboard-sensor-only experiments, including indoor tracking under wind disturbance, outdoor tracking under extreme conditions, narrow-frame traversal, and contact-based aerial writing, further validate the robustness of the proposed framework in complex operational environments.

2606.05661 2026-06-05 cs.AI cs.CL

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

持续学习基准:评估现实世界有状态环境中的前沿AI系统

Parth Asawa, Christopher M. Glaze, Gabriel Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala, Matei Zaharia, Joseph E. Gonzalez

发表机构 * UC Berkeley(伯克利大学) Snorkel AI University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出首个专家验证的持续学习基准CL-Bench,涵盖六个领域,通过增益指标隔离在线学习能力,发现现有系统存在过拟合和知识复用不足问题。

详情
AI中文摘要

持续学习,即AI系统通过顺序经验提升能力,已引起广泛关注,但缺乏高质量基准来评估。我们提出持续学习基准(CL-Bench),首个由专家验证的困难基准,旨在衡量基于LLM的系统是否真正从经验中改进。CL-Bench涵盖六个不同领域(软件工程、信号处理、疾病爆发预测、数据库查询、策略游戏和需求预测),每个领域由领域专家验证,任务共享可学习的潜在结构(代码库布局、疾病爆发动态、对手策略),有状态系统可在线发现而静态系统不能。我们评估了从朴素上下文学习(ICL)到专用记忆系统的多种智能体架构的前沿模型,引入增益指标以隔离学习与先验能力。我们发现这些系统在持续学习上仍有提升空间:智能体常过度拟合即时观察或未能跨实例复用知识,专用记忆系统并未解决此问题——实际上,朴素ICL优于专用记忆管理系统。CL-Bench是首个通过专家验证任务在多个现实世界领域评估持续学习并隔离在线学习与基础模型能力的基准,表明需要更好的持续学习系统。

英文摘要

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

2606.05660 2026-06-05 cs.RO cs.AI

Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation

面向长时域任务的安全具身AI:机器人操作跨层分析

Dabin Kim, Daemin Park, Sangyub Lee, Jinsik Kim, Yeongtak Oh, Jongho Shin, Sungroh Yoon

发表机构 * UNIST InnoCORE AI-Space Solar Initiative(UNIST创新核心人工智能空间太阳能计划) Ulsan National Institute of Science and Technology (UNIST)(乌山国立科学技术研究院) Automation and Systems Research Institute(自动化与系统研究所) Department of Electrical and Computer Engineering(电气与计算机工程系) Interdisciplinary Program in Artificial Intelligence(人工智能跨学科项目) LG Electronics(LG电子)

AI总结 本文从具身AI视角,系统综述长时域机器人操作中的安全问题,按干预时机(规划时、策略时、执行时)组织文献,分析证据强度,并指出当前安全保证的不足与未来方向。

详情
Comments
63 pages, 6 figures
AI中文摘要

具身AI系统日益被期望在物理环境中进行长时间跨度的推理和行动。这种不断增强的能力将安全问题推向前台,因为物理世界中的失败可能伤害人、损坏物体并扰乱工作场所。尽管安全具身AI已引起广泛关注,但文献在规划、策略设计和运行时执行方面仍然分散。长时域机器人操作是这一问题特别具有揭示性的锚定领域,因为语义误解、子任务级错误传播、执行漂移和接触丰富的物理风险可能在同一个闭环系统中累积。因此,本综述从具身AI视角对长时域机器人操作中的安全性进行了结构化回顾。我们按干预时机组织文献,涵盖规划时、策略时和执行时的安全性,并分析每条工作提供的证据强度,区分形式化保证、统计支持和经验安全启发式。这一框架阐明了骨干能力论文、直接安全机制以及基准或评估研究的独特作用,同时揭示了当前安全声明在哪些方面得到良好支持,在哪些方面仍然间接。我们识别了持续的空白,包括策略时安全性的有限证据、接触丰富长时域操作的形式化支持薄弱、不成熟的不确定性触发干预以及缺乏操作特定的安全基准。最后,我们概述了跨层保证、评估设计以及长时域机器人代理在真实世界环境中更安全部署的研究方向。

英文摘要

Embodied AI systems are increasingly expected to reason and act over extended horizons in physical environments. This growing capability brings safety to the foreground, because failures in the physical world can harm people, damage objects, and disrupt workplaces. Although safe embodied AI has attracted substantial attention, the literature remains fragmented across planning, policy design, and runtime execution. Long-horizon robotic manipulation is a particularly revealing anchor domain for this problem because semantic misgrounding, subtask-level error propagation, execution drift, and contact-rich physical risk can accumulate within the same closed-loop system. This survey therefore provides a structured review of safety in long-horizon robotic manipulation from an embodied AI perspective. We organize the literature by intervention locus, covering planning-time, policy-time, and execution-time safety, and we analyze the strength of the evidence that each line of work provides, distinguishing formal guarantees, statistical support, and empirical safety heuristics. This framework clarifies the distinct roles of backbone capability papers, direct safety mechanisms, and benchmark or evaluation studies, while exposing where current safety claims are well supported and where they remain indirect. We identify persistent gaps, including limited evidence for policy-time safety, weak formal support for contact-rich long-horizon manipulation, immature uncertainty-triggered intervention, and a shortage of manipulation-specific safety benchmarks. We conclude by outlining research directions for cross-layer assurance, evaluation design, and safer deployment of long-horizon robotic agents in real-world settings.

2606.05652 2026-06-05 cs.CV

CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors

CoFi-UCGen:无标签先验的粗到细无监督条件生成

Shengxi Li, Zhaokun Hu, Ce Zheng, Mai Xu, Jingyuan Xia, Si Liu

发表机构 * Department of Electronic Information Engineering, Beihang University(信息工程系,北航) School of Cyber Science and Technology, Beihang University(网络安全科学与技术学院,北航) College of Electronic Science, National University of Defense Technology(电子科学学院,国防科技大学) Institute of Artificial Intelligence, Beihang University(人工智能研究院,北航)

AI总结 提出粗到细的无监督条件生成框架CoFi-UCGen,通过对抗语义互学习理论和位编码实现无标签条件下的全局与细粒度语义解耦,并利用扩散模型层次调制机制控制生成。

详情
AI中文摘要

无监督条件图像生成(UCGen)旨在不依赖人工标注标签的情况下控制生成,但由于跨粒度的非结构化语义表示而仍然具有挑战性。为了解决这个问题,我们提出了一种新颖的粗到细UCGen框架(CoFi-UCGen),该框架明确地将全局语义与细粒度变化解耦,据我们所知,这是首次在没有任何标签的情况下成功实现粗粒度和细粒度条件生成。具体来说,我们首先提出对抗语义互学习理论,以确保图像和潜在空间之间的语义一致性和完整性。基于这种一致性,我们提出位编码来学习结构化的粗粒度潜在空间,并进一步证明从我们的位编码中继承的独特全局语义,同时保留用于生成的独立噪声采样。在这些位编码的基础上,我们建立了细粒度语义基础,并在扩散模型中引入了层次调制机制,通过从粗条件逐层注入,在生成过程中逐步控制细粒度属性。大量实验表明,在没有任何标签先验或预训练特征提取器的情况下,我们的CoFi-UCGen在图像质量、语义一致性和控制准确性方面始终优于现有的UCGen方法,验证了显式粗到细语义分解对于具有挑战性的UCGen任务的有效性。

英文摘要

Unsupervised conditional image generation (UCGen) aims to control generation without relying on manually annotated labels, yet remains challenging due to unstructured semantic representations across granularities. To address this, we propose a novel coarse-to-fine UCGen framework (CoFi-UCGen) that explicitly disentangles global semantics from fine-grained variations, which to the best of our knowledge, sets out the first successful attempt for both coarse- and fine-grained conditional generation without any labels. More specifically, we first propose the adversarial semantic reciprocal learning theory to ensure the semantic consistency and completeness between images and latent spaces. Based on the consistency, we propose the bit-codes to learn a structured coarse-grained latent space, and further prove distinct global semantics inherent from our bit-codes while preserving independent noise sampling for generation. Building upon these bit-codes, we establish a fine-grained semantic basis and introduce a hierarchical modulation mechanism in diffusion models, by enabling layer-wise injection from coarse conditions to progressively control fine-grained attributes during generation. Extensive experiments demonstrate that without any label priors or pre-trained feature extractors, our CoFi-UCGen consistently outperforms existing UCGen methods in terms of image quality, semantic consistency, and control accuracy, verifying the effectiveness of explicit coarse-to-fine semantic decomposition for the challenging UCGen task.

2606.05647 2026-06-05 cs.AI cs.CL cs.CY cs.HC

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

与“敌人”编码:人类开发者能否检测到AI代理的破坏行为?

Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi

发表机构 * Northeastern University(东北大学)

AI总结 通过大规模用户实验,研究人类开发者在长时间编码任务中检测AI代理恶意代码插入的能力,发现94%的开发者未能识别破坏,并分析其原因,提出安全监控设计建议。

详情
Comments
34 pages, 30 figures, 3 tables
AI中文摘要

AI编码代理越来越多地嵌入到现实世界的软件开发中,与人类开发者协作,同时获得对代码库和工具的更广泛访问权限。这创造了一个新的攻击面:代理可以利用人类信任来破坏开发,例如通过插入恶意代码来完成隐藏的附带任务。大多数先前的工作研究AI-only环境中的AI破坏,对人类监督在检测和减轻此类恶意行为中的作用关注有限。为填补这一空白,我们进行了首个关于AI编码破坏中人类监督的大规模研究。超过100名参与者与四个前沿模型(Claude-Opus-4.6、GPT-5.4、Gemini-3.1-Pro和MiniMax-M2.7)之一合作,完成一项持续约五小时的长周期编码任务,旨在模拟真实工作流程。我们发现94%的开发者未能检测到破坏,我们对参与者反馈的分析将这一脆弱性归因于最小化的代码审查、合理的掩护故事以及对代理的过度信任。我们进一步测试了安全监控器在一种条件下的有效性:虽然监控器降低了破坏成功率,但仍有56%的参与者接受了恶意代码,忽略了其警告。根据参与者反馈,我们为更好的监控器设计提供了可操作的建议。这项工作补充了现有的AI安全研究,并强调了迫切需要以人为本的安全机制,考虑人类因素,特别是在长周期、真实世界的开发环境中。

英文摘要

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

2606.05644 2026-06-05 cs.AI

FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG

FIDES: 通过深层证据信号实现RAG中检索-记忆冲突的忠实推理

Zhe Yu, Wenpeng Xing, Tiancheng Zhao, Mohan Li, Changting Lin, Meng Han

发表机构 * Binjiang Institute of Zhejiang University(浙江大学滨江研究院) Zhejiang University(浙江大学) Guangzhou University(广州大学) GenTel.io

AI总结 针对检索增强生成中检索证据与参数记忆冲突导致模型忽略上下文的问题,提出无训练解码器FIDES,通过融合输出表面、隐藏表示和预测轨迹三种内部信号,在token级别动态调整干预强度,显著提升上下文忠实度。

详情
AI中文摘要

当检索到的证据与参数记忆相矛盾时,语言模型常常忽略上下文并默认采用记忆化的先验知识——这种失败削弱了检索增强的核心目的。对比解码通过放大上下文条件输出以抑制参数偏差,但现有方法基于一个隐含假设:这种偏差在token间是均匀的。单一的全局对比权重会过度惩罚安全token,同时使真正存在冲突的token得不到充分纠正。我们识别出token级别的冲突集中现象:检索-记忆张力呈现高度异质性,集中在少数答案关键的解码步骤上。这重新定义了对比解码:从“施加多少对比”转变为“在何处施加对比”。我们提出FIDES(通过深层证据信号实现忠实推理),一种无训练解码器,它读取三种内部信号——输出表面、隐藏表示和预测轨迹——在互补深度探测检索-记忆冲突,并融合它们以控制每个解码步骤的干预强度。在三个基准和六个主干模型(四个主流的7B/8B模型和两个扩展至70B的主干模型)上,FIDES在所有18个设置中实现了最佳的上下文忠实度,比最强的无训练基线高出3到13个百分点。在70B规模上,忠实度达到92-94%,同时F1分数飙升至62-63%,表明token级别的选择性解锁了粗粒度对比规则所抑制的生成能力。

英文摘要

When retrieved evidence contradicts parametric memory, language models frequently ignore context and default to memorized priors -- a failure that undermines the core purpose of retrieval augmentation. Contrastive decoding amplifies the context-conditioned output to suppress parametric bias, but existing methods rest on an implicit assumption that this bias is uniform across tokens. A single global contrastive weight over-penalizes safe tokens while leaving genuinely conflicted ones insufficiently corrected. We identify token-level conflict concentration: retrieval-memory tension is sharply heterogeneous, concentrated on a small fraction of answer-critical decoding steps. This reframes contrastive decoding from how much contrast to apply to where to apply it. We propose FIDES (Faithful Inference via Deep Evidence Signals), a training-free decoder that reads three internal signals probing retrieval-memory conflict at complementary depths -- output surface, hidden representations, and prediction trajectory -- and fuses them to govern intervention strength at each decoding step. Across three benchmarks and six backbones -- four primary 7B/8B models and two scaling backbones up to 70B -- FIDES achieves the best context fidelity in all 18 settings, outperforming the strongest training-free baseline by +3 to +13 points. On the 70B scale, fidelity reaches 92-94% while F1 surges to 62-63%, demonstrating that token-level selectivity unlocks generation capability that coarse contrastive rules suppress.

2606.05641 2026-06-05 cs.CV

Multi-Task Crack Foundation Model for Engineering-Reliable Crack Representation and Topology Preservation in Civil Infrastructure

面向工程可靠裂缝表示与拓扑保持的土木基础设施多任务裂缝基础模型

Blessing Agyei Kyem, Joshua Kofi Asamoah, Eugene Denteh, Armstrong Aboah

发表机构 * NDSU(内达苏大学)

AI总结 提出 CrackGeoFM 多任务框架,结合冻结视觉基础骨干与裂缝专用适配模块,实现掩码预测、骨架重建和不确定性估计,在20个数据集上达到最优分割、拓扑保持和校准不确定性。

详情
Comments
60 pages, 17 figures, 11 tables
AI中文摘要

可靠的裂缝评估不仅需要准确的像素级掩码,还需要在域偏移下保持稳定的连通裂缝几何形状和置信度估计。然而,现有的分割模型在实现高重叠分数的同时,可能会使裂缝碎片化、遗漏细小分支,并且无法提供校准的不确定性。为了解决这一问题,本文提出了 CrackGeoFM,一个多任务框架,它将冻结的视觉基础骨干与裂缝专用适配相结合,用于掩码预测、骨架重建和不确定性估计。该框架集成了频率引导的裂缝增强模块(FCEM)以增强高频裂缝线索,裂缝域特征适配模块(CFAM)以将冻结骨干特征适配到裂缝域模式,以及结构感知多任务解码器(SMTD)以联合解码掩码、骨架和不确定性。在20个裂缝数据集上,CrackGeoFM 实现了最先进的分割性能、改进的拓扑保持、校准的不确定性以及仅需五张标注图像的有效少样本适应。这些结果支持可靠、可泛化且面向工程的裂缝分析,用于基础设施评估。

英文摘要

Reliable crack assessment requires not only accurate pixel-level masks but also connected crack geometry and confidence estimates that remain stable under domain shift. However, existing segmentation models can achieve high overlap scores while fragmenting cracks, missing fine branches, and providing no calibrated uncertainty. To address this gap, this paper proposes CrackGeoFM, a multi-task framework that combines a frozen visual foundation backbone with crack-specific adaptation for mask prediction, skeleton reconstruction, and uncertainty estimation. The framework integrates a Frequency-Guided Crack Enhancement Module (FCEM) to enhance high-frequency crack cues, a Crack-Domain Feature Adaptation Module (CFAM) to adapt frozen backbone features to crack-domain patterns, and a Structure-Aware Multi-Task Decoder (SMTD) to jointly decode masks, skeletons, and uncertainty. Across 20 crack datasets, CrackGeoFM achieves state-of-the-art segmentation, improved topology preservation, calibrated uncertainty, and effective few-shot adaptation with only five labeled images. These results support reliable, generalizable, and engineering-oriented crack analysis for infrastructure assessment.

2606.05639 2026-06-05 cs.LG

Q-GNN: Query-Conditioned Graph Neural Networks with Type Awareness for Knowledge Graph Completion

Q-GNN: 具有类型感知的查询条件图神经网络用于知识图谱补全

Dongxiao He, Ruqiong Zhang, Zhizhi Yu, Ling Ding, Di Jin, Guangquan Xu, Zhiyong Feng

发表机构 * College of Intelligence and Computing, Tianjin University(智能与计算学院,天津大学)

AI总结 提出Q-GNN,通过融合查询实体的结构上下文和语义类型信息,增强图神经网络在知识图谱补全中的推理能力。

详情
AI中文摘要

知识图谱补全(KGC)旨在从不完整的知识图谱中预测缺失的三元组,这对于下游应用至关重要。近年来,基于图神经网络(GNN)的方法通过在以查询为中心的局部子图上进行消息传递取得了显著成功。然而,在实践中,查询由实体和关系共同定义,两者都携带推理不可或缺的信息,但这些方法仅依赖查询关系作为引导信号,而查询实体中固有的信息未被利用来指导推理——实体仅作为子图提取的结构锚点。为此,我们从两个角度将查询实体信息融入推理过程:第一是结构上下文,即实体周围的邻居结构和关系模式,由专用上下文编码器编码并用于调制消息;第二是实体的语义类型,由大语言模型推断,并融入注意力计算和最终评分,以提供类型级别的先验约束。这两类信息共同使推理过程同时受查询关系和查询实体引导。在标准基准上的实验结果证明了所提出的Q-GNN的有效性。

英文摘要

Knowledge Graph Completion (KGC) aims at predicting missing triplets from incomplete knowledge graphs, which is crucial for downstream applications. Recently, Graph Neural Network (GNN)-based methods have achieved remarkable success by performing message passing over query-centered local subgraphs. However, in practice, a query is jointly defined by both the entity and the relation, with both carrying information indispensable for reasoning, yet these methods rely solely on the query relation as the guiding signal, while the information inherent in the query entity is not leveraged to guide inference - the entity serves merely as a structural anchor for subgraph extraction. To this end, we incorporate query entity information into the reasoning process from two perspectives: the first is structural context, i.e., the neighboring structure and relation patterns around the entity, which is encoded by a dedicated context encoder and used to modulate messages; the second is semantic type of the entity, inferred by a large language model, which is incorporated into attention computation and final scoring to provide type-level prior constraints. Together, these two sources of information enable the reasoning process to be guided by both the query relation and the query entity. Experimental results on standard benchmarks demonstrate the effectiveness of the proposed Q-GNN.

2606.05636 2026-06-05 cs.LG

StableRCA: Robust Graph-Agnostic Mechanism-Level Root Cause Analysis

StableRCA:鲁棒的图无关机制级根因分析

Xiaoyu Lin, Nicholas Tagliapietra, Kehan Li, Lavdim Halilaj, Juergen Luettin

发表机构 * Department of Computer Science, Tsinghua University(清华大学计算机科学系) Bosch Center for Artificial Intelligence(博世人工智能中心) Computer Science Department, TU Darmstadt(图尔恩大学计算机科学系)

AI总结 提出StableRCA框架,通过估计局部马尔可夫边界并检测条件分布偏移,避免全局图发现,实现鲁棒的机制级根因分析。

详情
AI中文摘要

根因分析(RCA)旨在识别复杂领域(如制造业、云计算和医疗保健)中导致系统行为异常的变量。现有方法面临一个关键瓶颈:基于图的因果方法可以识别干预目标,但通常需要已知或准确估计的因果图,而无图统计方法要么定位边际异常而非结构原因,要么依赖于对图结构或函数形式的限制性假设。我们提出StableRCA,一种局部机制级RCA框架,通过估计局部马尔可夫边界并检测其中的条件分布偏移来避免全局图发现。利用独立因果机制原理,我们证明在忠实马尔可夫边界恢复和非退化机制偏移下,干预目标可以以样本量指数收敛的概率被识别。在合成基准和五个真实世界数据集上的实验表明,StableRCA对图错误指定具有鲁棒性,在多个干预目标下有效,可扩展至大型系统,并在不同应用领域中可靠。代码可在 https://anonymous.4open.science/r/StableRCA-E362 获取。

英文摘要

Root-Cause Analysis (RCA) seeks to identify the variables responsible for abnormal system behavior in complex domains such as manufacturing, cloud computing, and healthcare. Existing approaches face a critical bottleneck: graph-based causal methods can identify intervention targets but typically require a known or accurately estimated causal graph, while graph-free statistical methods either localize marginal anomalies rather than structural causes, or rely on restrictive assumptions about graph structure or functional form. We propose StableRCA, a local mechanism-level RCA framework that avoids global graph discovery by estimating local Markov boundaries and detecting conditional distribution shifts within them. Leveraging the Independent Causal Mechanism principle, we show that intervention targets can be identified with probability converging exponentially in sample size under faithful Markov boundary recovery and non-degenerate mechanism shifts. Experiments on synthetic benchmarks and five real-world datasets demonstrate that StableRCA is robust to graph misspecification, effective under multiple intervention targets, scalable to large systems, and reliable across diverse application domains. Code is available at: https://anonymous.4open.science/r/StableRCA-E362

2606.05635 2026-06-05 cs.CV cs.MM

ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

ShotCrop$^3$:将人物中心图像裁剪为电影级三镜头构图

Dehong Kong, Lina Lei, Lingtao Zheng, Chenyang Wu, Ailing Zhang, Xinran Qin, Teng Ma, Jiaqi Xu, Zhixin Wang, Zhikai Chen, Xuecheng Qi, Renjing Pei, Fan Li

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室) Sun Yat-sen University(中山大学)

AI总结 提出三镜头构图任务,通过三阶段训练流程(思维链微调、半监督微调和组相对策略优化)从单张人物中心图像生成远景、中景和特写三张裁剪图,并附带简短描述,以支持视觉叙事。

详情
AI中文摘要

先前关于美学构图的工作通常产生单一美观的裁剪,忽略了从一个场景中构图多个镜头的叙事价值。在实践中,多镜头构图对于下游创意工作流程至关重要:商业海报通常需要不同重点(例如,背景、主体和情感/产品细节)的多个裁剪来呈现关键故事节拍。因此,我们提出了 extbf{三镜头构图(TSC)},这是一个构图任务,从单张人物中心图像生成一个三镜头集——远景、中景和特写,每个镜头都配有简短的镜头描述以支持视觉叙事。为了在有限的专家标注下学习TSC,我们引入了 extbf{ShotCrop},它经历了一个三阶段训练过程:首先应用思维链监督微调以建立基本推理和美学裁剪技能,然后使用高置信度伪标签进行半监督微调以进一步增强美学能力,最后通过针对 extbf{ShotCrop}的组相对策略优化(GRPO-S)进行优化,使用为其定制的复合奖励。具体来说,我们的伪标签策略结合了基于MLLM的评分、美学评估和CLIP相似度,以保留高置信度的训练信号。此外,我们提出了TSC-Bench,一个包含1.2k个专家标注测试用例的基准。值得注意的是,ShotCrop在镜头定位准确率上比GPT-5平均提高了 extbf{2.82}倍。

英文摘要

Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.

2606.05633 2026-06-05 cs.AI

Answer Presence Drives RAG Rewriting Gains

答案存在驱动RAG重写收益

Yuejie Li, Yueying Hua, Ke Yang, Li Zhang, Yueping He, Yueping He, Ruiqi Li, Bolin Chen, Tao Wang, Bowen Li, Chengjun Mao

发表机构 * Ant Group(蚂蚁集团)

AI总结 通过受控干预审计,发现检索增强问答中重写器带来的性能提升主要由黄金答案字符串出现在重写上下文中驱动,而非证据质量改善。

详情
AI中文摘要

检索增强的问答管道通常将检索到的段落通过LLM重写器处理后输入较小的阅读器,在多跳基准测试中将F1提升数十个百分点;这种提升通常归因于证据质量的改善。我们通过受控干预审计,探究这种提升是否由黄金答案字符串出现在重写上下文中而非整理本身因果驱动。对于每个重写上下文,我们对编译输出进行四种受控编辑后重新运行阅读器:移除黄金答案跨度、替换为长度匹配的随机非答案跨度(安慰剂)、将黄金答案注入原本缺失的重写中(前缀或中间句子边界)。跨越三个阅读器系列(Qwen2.5-7B、Qwen3.5-35B、GLM-4.7)、两个数据集(HotpotQA、2WikiMultihopQA)和三种编译器安排(仅MA、仅MB、MA+验证)的十二个(单元、基线)干预运行中,在配对的answer-in-compile层上,移除黄金答案导致阅读器F1比长度匹配的安慰剂下降28到64个百分点,而在12个(单元、基线)组合中的10个中,将黄金答案前置到原本缺失的重写中使F1提升+0.7到+9.7个百分点。一项配套的五哨兵审计显示,传统的单[MASK]探针本身对哨兵敏感:在2Wiki上,它报告+4.12 F1的“非泄漏残差”,在四种替代哨兵下翻转至-3.33到-7.81 F1,并且对其中三种哨兵未能通过等价检验(1/4通过)。我们不提出新的重写器或缓解措施;我们发布干预运行器和哨兵面板,以便其他重写器收益声明可以针对相同标准进行测试。

英文摘要

Retrieval-augmented QA pipelines often route retrieved passages through an LLM \emph{rewriter} before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA$+$verify), removing the gold answer drops reader F1 by $28$ to $64$ points beyond the length-matched placebo on paired \texttt{answer-in-compile} strata, and prepending the gold into rewrites that lacked it raises F1 by $+0.7$ to $+9.7$ points in $10$ of $12$ (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-\texttt{[MASK]} probe is itself sentinel-fragile: on 2Wiki it reports a $+4.12$~F1 ``non-leakage residual'' that flips to $-3.33$ to $-7.81$~F1 under four alternative sentinels and fails an equivalence test for three of those four ($1/4$~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.

2606.05632 2026-06-05 cs.AI

Evaluation of LLMs for Mathematical Formalization in Lean

LLM在Lean中数学形式化的评估

Tyson Klingner, Drew Bladek, Escher Crawford, Bohao Chen, Ariel Fu, Kaira Nair, Jarod Alper, Giovanni Inchiostro, Vasily Ilin

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 本研究通过pass@k和refine@k指标在miniF2F和miniCTX子集上比较了多种大语言模型在Lean 4中生成形式化证明的能力,发现Gemini 3.1 Pro和Claude Opus 4.7性能最佳,而NVIDIA Nemotron 3 Super和GPT-OSS 120B在考虑成本时效率最高。

详情
Comments
15 pages, 13 figures, 10 tables. Comments welcome!
AI中文摘要

在过去几年中,大语言模型(LLM)生成形式化数学证明的能力得到了显著提升。我们比较了多种LLM在Lean 4中生成形式化证明的有效性,旨在帮助那些希望利用LLM支持自己项目的人。我们使用pass@$k$和refine@$k$指标作为比较基准,并在miniF2F和miniCTX数据集的子集上进行评估。测试表明,总体而言,Gemini 3.1 Pro和Claude Opus 4.7表现最佳。Gemini 3.1 Pro在miniF2F上通过refine@32达到了92%的成功率,而Opus 4.7在miniCTX上通过refine@32达到了86%的成功率。考虑成本时,NVIDIA Nemotron 3 Super和GPT-OSS 120B效率最高,具有竞争力的准确率且每个正确证明的平均成本低于0.01美元。

英文摘要

Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@$k$ and refine@$k$ metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX datasets. Our testing shows that overall, Gemini 3.1 Pro and Claude Opus 4.7 perform best. Gemini 3.1 Pro achieved a 92\% success rate on miniF2F via refine@32 whereas Opus 4.7 achieved a 86\% success rate on miniCTX via refine@32. When taking cost into account, NVIDIA Nemotron 3 Super and GPT-OSS 120B were the most efficient, with competitive accuracies and average costs of $<\$0.01$ per correct proof.

2606.05626 2026-06-05 cs.CL cs.AI cs.LG

When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer

当新生成器到来:基于岭特征迁移的终身机器生成文本归因

Zhen Sun, Yifan Liao, Zhicong Huang, Jiaheng Wei, Cheng Hong, Yutao Yue, Xinlei He

发表机构 * Wuhan University(武汉大学) Ant Group(蚂蚁集团) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Institute of Deep Perception Technology, JITRI(感知技术研究院,JITRI)

AI总结 针对终身机器生成文本归因中持续适应新生成器与保留旧知识难以平衡的问题,提出轻量级分析更新框架RidgeFT,通过协方差校准和固定随机特征实现无需示例回放的闭式更新。

详情
Comments
12 pages
AI中文摘要

机器生成文本(MGT)归因旨在识别给定文本的特定生成器,从而为模型问责和滥用调查提供细粒度证据。随着新的大语言模型不断涌现,归因模型必须持续纳入新生成器,同时保留识别先前见过的生成器的能力。先前工作表明,这种终身MGT归因设置具有挑战性,现有方法通常难以在适应新类别和保留旧类别之间实现稳定平衡。为解决此问题,我们提出RidgeFT,一种轻量级分析更新框架,不依赖于示例回放。RidgeFT在初始生成器集上训练任务感知编码器,在首次观察到每个生成器类别时存储紧凑的类别充分统计量,然后冻结编码器以进行无回放的闭式更新。它通过协方差校准抑制与生成器无关的变异,通过固定随机特征提升表示能力,并基于类别充分统计量通过闭式岭回归更新新类别。在具有不同初始生成器设置的多主题评估中,RidgeFT始终优于基线。它在跨领域、骨干网络和增量协议上实现了最佳宏F1,同时改进了旧类别保留和新类别适应。这些结果表明,特征稳定的分析更新为终身MGT归因提供了一种简单而有效的方法。

英文摘要

Machine-generated text (MGT) attribution aims to identify the specific generator responsible for a given text, thereby providing fine-grained evidence for model accountability and misuse investigation. As new large language models continue to emerge, attribution models must continuously incorporate new generators while preserving their ability to recognize previously seen ones. Prior works have shown that this lifelong MGT attribution setting is challenging, and existing methods often struggle to achieve a stable balance between adapting to new classes and retaining old ones. To address this issue, we propose RidgeFT, a lightweight analytic update framework that does not rely on exemplar replay. RidgeFT trains a task-aware encoder on the initial generator set, stores compact class-wise sufficient statistics when each generator class is first observed, and then freezes the encoder for replay-free closed-form updates. It then suppresses generator-irrelevant variation through covariance calibration, improves representation capacity with fixed random features, and updates new classes through closed-form ridge regression based on class-level sufficient statistics. Across multi-topic evaluations with varying initial generator setups, RidgeFT consistently outperforms baselines. It achieves the best macro-F1 across domains, backbones, and incremental protocols, while also improving both old-class retention and new-class adaptation. These results suggest that feature-stable analytic updates provide a simple yet effective approach to lifelong MGT attribution.

2606.05625 2026-06-05 cs.AI cs.LG

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

自承诺延迟:一种用于提示隐式劫持的无奖励探针

Bonan Shen, Youting Wang, Dingyan Shang, Tao Ning

发表机构 * Stanford University(斯坦福大学) Tsinghua University(清华大学)

AI总结 提出自承诺延迟指标,通过测量推理上下文对模型自身最终答案的承诺时机,无需奖励信号即可检测提示隐式劫持,在GSM8K数据集上达到AUROC 0.878-0.926。

详情
AI中文摘要

当语言模型的思维链看似良性时,隐式奖励劫持难以审计:最终答案可能被提示捷径锚定,而书面推理仍类似于普通问题求解。基于验证器的探针通过测量早期截断的推理上下文获得高奖励来暴露此类行为,但需要任务特定的奖励信号。本文提出一种弱输入替代方案——自承诺延迟,它测量提示推理上下文对模型自身最终答案的承诺时机。我们在受控配对GSM8K设置中使用Qwen2.5-3B-Instruct-4bit评估该探针,比较普通提示与包含答案提示的提示。与诚实上下文相比,包含提示的上下文显著更早且以更低不确定性做出承诺。主要延迟指标——阈值为0.8时的首次承诺延迟——达到AUROC 0.878;支持的全曲线摘要达到承诺范围AUROC 0.926和平均未承诺质量AUROC 0.904。当两种提示条件都正确回答时信号更强,且在不同阈值下保持稳定。这些结果表明,存在捷径的推理上下文会留下早期行为承诺特征,无需奖励模型、外部评判或训练分类器即可检测。

英文摘要

Implicit reward hacking is hard to audit when a language model's chain of thought appears benign: a final answer may be anchored by a prompt shortcut while the written reasoning still resembles ordinary problem solving. Verifier-based probes expose such behavior by measuring how early truncated reasoning contexts obtain high reward, but require a task-specific reward signal. This paper proposes a weaker-input alternative, self-commitment latency, which measures how early a prompted reasoning context commits to the model's own final answer. We evaluate the probe in a controlled paired GSM8K setting using Qwen2.5-3B-Instruct-4bit, comparing ordinary prompts with prompts that include an answer hint. Hinted contexts commit substantially earlier and with lower uncertainty than honest contexts. The primary latency metric, first-commitment latency at threshold 0.8, reaches AUROC 0.878; supporting whole-curve summaries reach AUROC 0.926 for commitment range and 0.904 for mean uncommitted mass. The signal is stronger when both prompt conditions answer correctly and remains stable across thresholds. These results show that shortcut-available reasoning contexts can leave an early behavioral commitment signature detectable without a reward model, external judge, or trained classifier.

2606.05624 2026-06-05 cs.CV cs.GR

KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

KV-Control: 用于轨迹控制文本到运动的参数高效K/V注入

Tengjiao Sun, Pengcheng Fang, Xiaoyu Zhan, Yanwen Guo, Dongjie Fu, Xiaohao Cai, Hansung Kim

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出KV-Control,一种紧凑的注意力侧控制接口,通过部分标记化运动基元和轨迹编码器注入键/值记忆,实现精确的轨迹控制而不覆盖预训练的文本条件运动先验。

详情
AI中文摘要

文本条件3D人体运动模型现在可以从提示中合成合理的运动,但实际动画和具身代理工作流程很少止步于文本:角色可能需要遵循草绘的根路径,达到末端执行器目标,或满足多关节轨迹,同时保持语言描述的步态、风格和意图。这暴露了一个控制权衡。轨迹控制器应该精确而不覆盖预训练的文本条件运动先验,但现有解决方案要么复制生成器的大部分以重新获得每层控制访问,要么将大部分成本转移到测试时优化。我们引入KV-Control,一种用于冻结掩码文本到运动变换器的紧凑注意力侧控制接口。关键思想是将几何约束作为自注意力中的记忆提供,而不是通过全局姿态标记注入或仅在输出侧强制执行。为了支持该接口,我们共同设计了部分标记化的运动基元和控制器:PartVQ学习解剖对齐的部分码本,T-Concat将每个帧-部分标记暴露为注意力可寻址站点,KV-Control在每个自注意力层注入控制条件的键/值记忆,同时保留预训练的查询流、文本交叉注意力、FFN和所有骨干权重。生成的适配器仅在共享轨迹编码器之上添加可训练的注入参数,但在继承的细化协议下以亚厘米精度跟踪根和多关节约束,同时保留文本条件的运动质量。KV-Control将轨迹条件重新定义为轻量级记忆检索,为文本到运动生成提供了一个小型、精确且透明的控制接口。

英文摘要

Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \textbf{PartVQ} learns anatomy-aligned part codebooks, T-Concat exposes each frame--part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.

2606.05622 2026-06-05 cs.CL

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

AdaPlanBench: 在世界约束和用户约束下评估大语言模型智能体的自适应规划能力

Jiayu Liu, Cheng Qian, Zhenhailong Wang, Bingxuan Li, Jiateng Liu, Heng Wang, Jeonghwan Kim, Yumeng Wang, Xiusi Chen, Yi R. Fung, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对现有基准未充分探索渐进揭示的双重约束下的自适应规划问题,提出动态交互基准AdaPlanBench,通过307个家务任务和可扩展的约束构建流程,评估LLM智能体在交互中根据反馈迭代调整计划的能力。

详情
AI中文摘要

语言模型对现实世界问题进行规划时,通常涉及世界约束和用户约束,这些约束可能不会事先完全明确,而是通过交互逐步披露。然而,现有基准仍未充分探索在这种逐步揭示的双重约束下的自适应规划。为填补这一空白,我们引入了AdaPlanBench,这是一个动态交互基准,用于评估大语言模型(LLM)智能体是否能够在逐步揭示的世界约束和用户约束下自适应地规划和重新规划。AdaPlanBench基于307个家务任务构建,并配备了一个可扩展的约束构建流程,为每个任务增加双重约束。在运行时,智能体通过多轮协议与环境交互,其中隐藏的约束仅在智能体提出违反它们的计划时才会被揭示,从而需要在累积反馈下迭代修订计划。这使得规划具有挑战性,因为智能体必须从反馈中推断并跟踪约束,同时有效地重新规划。在十个领先的LLM上的实验表明,在双重约束下的自适应规划仍然具有挑战性,最佳模型仅达到67.75%的准确率。我们进一步观察到,随着约束的累积,性能会下降,其中用户约束尤其构成巨大挑战,而失败通常源于较弱的物理基础知识和降低的有效性。这些结果将AdaPlanBench确立为双重约束交互规划的测试平台,并凸显了LLM智能体可靠适应动态揭示约束的挑战。

英文摘要

Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.

2606.05616 2026-06-05 cs.CL

What's in a Name? Morphological Shortcuts by LLMs in Pharmacology

名字里有什么?LLM在药理学中的形态捷径

Kaijie Mo, Thomas Yang, Chantal Shaib, Qing Yao, William Rudman, Ramez Kouzy, Kanishka Misra, Byron C. Wallace, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Northeastern University(东北大学) MD Anderson Cancer Center(MD安德森癌症中心)

AI总结 研究LLM在药理学中依赖词缀线索进行推理的形态捷径行为,通过虚构药物名称实验和归因框架揭示其机制及安全风险。

详情
Comments
22 pages
AI中文摘要

单词的形态常常能为其含义提供线索,但纯粹依赖这些映射在高风险领域可能导致过度泛化。例如,在医学领域,LLM可以仅凭词缀(如wugcillin)自信地推理虚构药物,并生成看似合理的临床内容。我们提出了LLM在药理学中“词缀启发式”的行为和机制研究。使用由真实词缀构建的虚构药物名称,我们表明仅词缀信号就能引发类别水平的药理反应。我们引入了一个框架,用于识别模型的药物语义主要受词缀、词干还是整个药物名称驱动。应用于653种药物,我们的框架揭示模型通常主要通过词缀线索诱导药物含义,但很少明确表明这种依赖,有时还会错误地将词缀共享药物的属性混淆。跨模型的激活修补进一步将这种行为定位到早期到中期层。这些发现表明,形态捷径对安全性构成了微妙但可衡量的风险。

英文摘要

The morphological form of a word can often give cues to its meaning, but purely relying on these mappings can lead to overgeneralization in high-stakes domains. In the medical domain, for instance, LLMs can confidently reason about fictitious drugs from their affixes alone (e.g., wugcillin) and generate plausible-looking clinical content. We present a behavioral and mechanistic study of LLM "affix heuristics" in pharmacology. Using fictitious drug names built from real affixes, we show that affix signals alone elicit class-level pharmacological responses. We introduce a framework for identifying whether a model's drug semantics are driven mainly by the affix, the stem, or the drug name as a whole. Applied across 653 drugs, our framework reveals that models often induce drug meaning primarily through affix cues, yet rarely explicitly indicate this reliance, and sometimes incorrectly conflate properties among affix-sharing drugs. Activation patching across models further localizes this behavior to early-mid layers. These findings show that morphological shortcuts pose a subtle but measurable risk to safety.

2606.05614 2026-06-05 cs.AI

Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

安全悖论:增强的安全意识如何使LLM易受后验攻击

Long P. Hoang, Hai V. Le, Shaoyang Xu, Wei Lu, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Nanyang Technological University(南洋理工大学)

AI总结 本文揭示安全对齐增强的LLM因内部安全评估能力而面临后验攻击漏洞,通过实验和理论分析证明安全判断能力越强越易被利用,并提出因果干预验证。

详情
AI中文摘要

大型语言模型(LLM)经过严格对齐以拒绝有害请求,这一过程内在培养了评估和识别不安全内容的潜在能力。在这项工作中,我们揭示了这种高级安全意识无意中引入了一个致命漏洞。我们提出了后验攻击(Posterior Attack),一种单次查询的越狱方法,通过提示模型生成其内部分类器通常会标记为不安全的精确有害响应来绕过防护栏。通过对30个开源LLM(参数规模高达35B)和前沿模型(如GPT-5、Claude 4.6)的广泛实证评估,我们观察到一个显著现象:具有更优安全判断能力的模型更容易受到这种利用。为了解释这一点,我们形式化了安全悖论(Safety Paradox),分析表明安全对齐的单调改进自然放大了后验漏洞。最后,我们通过强化学习干预建立了因果联系,示例说明人为降低模型的安全判断能力可使其免疫攻击,而增强判断则会加剧漏洞。我们的发现揭示了当前对齐范式中的潜在缺陷,表明防御机制可能需要进一步的结构性改进。

英文摘要

Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.

2606.05613 2026-06-05 cs.AI

Multilingual Fine-Tuning via Localized Gradient Conflict Resolution

通过局部梯度冲突解决的多语言微调

Long P. Hoang, Yiran Zhao, Wei Lu, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Salesforce AI Research(Salesforce人工智能研究) Nanyang Technological University(南洋理工大学)

AI总结 提出Bucket-Level MOO框架,将多语言微调重构为多目标优化问题,通过局部梯度冲突解决提升多语言性能。

详情
AI中文摘要

大型语言模型(LLMs)的快速发展已将跨语言多功能性确立为现代系统的定义特征。然而,微调这些模型经常引发跨语言的负面干扰。为了解决这个问题,我们将多语言微调重构为多目标优化(MOO)问题。具体来说,我们引入了Bucket-Level MOO,一个可扩展的分布式框架,它在参数桶上局部应用基于梯度的MOO算法。这使得冲突感知更新成为可能,而无需重建完整梯度向量的高昂通信开销。理论上,我们证明了这种局部解决自然地强制执行精炼帕累托平稳性,这是帕累托最优性的一个严格更紧的必要条件。实验上,Bucket-Level MOO通过驱动LLMs构建特定的语言维度来减轻干扰,提高了表示的可分离性。在四个基础LLM上的广泛实验表明,我们的方法在标准微调范式上显著提高了所见和未见的多语言性能。

英文摘要

The rapid evolution of Large Language Models (LLMs) has established cross-lingual versatility as a defining feature of modern systems. However, fine-tuning these models frequently induces negative interference across languages. To address this, we reformulate multilingual fine-tuning as a multi-objective optimization (MOO) problem. Specifically, we introduce Bucket-Level MOO, a scalable distributed framework that applies gradient-based MOO algorithms locally on parameter buckets. This enables conflict-aware updates without the prohibitive communication overhead of reconstructing full gradient vectors. Theoretically, we prove this localized resolution natively enforces Refined Pareto Stationarity, a strictly tighter necessary condition for Pareto optimality. Empirically, Bucket-Level MOO mitigates interference by driving LLMs to construct distinct language-specific dimensions, improving representational separability. Extensive experiments across four base LLMs demonstrate that our method significantly improves both seen and unseen multilingual performance over standard fine-tuning paradigms.

2606.05611 2026-06-05 cs.CV

What's Under the Skin? Estimating Swine Body Condition

皮肤之下是什么?估算猪体况

Mk Bashar, Kuljit Bhatti, Gary Rohrer, Madonna Benjamin, Tami Brown-Brandl, Daniel Morris

发表机构 * arXiv.org cs.CV(计算机视觉)

AI总结 提出PigFormer系统,利用RGB-D深度图像通过两阶段流程(几何前端和切片注意力编码器)预测猪的皮下背膘厚度、腰肌深度和总组织厚度,实现非接触式体况监测。

详情
AI中文摘要

母猪体况是养殖者的重要指标,因为它对泌乳性能和仔猪存活率有很大影响。然而,生产中使用的体况测量方法(如视觉评分和卡尺)与底层组织成分的相关性较差。超声波扫描可以直接测量皮下背膘厚度和腰肌深度,但操作劳动密集且无法规模化生产。我们提出了PigFormer,一个端到端的两阶段系统,它从天花板安装的RGB-D相机获取原始深度帧,并预测最后肋骨处的皮下背膘厚度、腰肌深度和总组织厚度。第一阶段是几何前端,通过SAM3-to-MaskDINO分割蒸馏、地平面去除和方向归一化将原始深度转换为标准化高度图。第二阶段是切片注意力编码器,将每个高度图视为一系列横截面切片,并捕捉沿整个背侧表面的空间关系。在两个设施的多站点数据集(319头母猪和小母猪实例)上,PigFormer实现了2.43毫米的背膘平均绝对误差和3.87毫米的整体平均绝对误差。它优于强大的单阶段ResNet-18和ViT-small基线。PigFormer为商业养猪生产中实现连续、自动化、非接触式体况监测提供了一条实用途径。代码可在https://github.com/iambashar/Pigformer获取。

英文摘要

Sow body condition is an important indicator for growers as it has a large impact on lactation performance and piglet survival. However, body condition measures used during production, such as visual scoring and calipers, correlate poorly with underlying tissue composition. Ultrasound scans can provide direct measurements of subcutaneous backfat thickness and loin muscle depth, but their operation is labor intensive and not scalable for production. We present PigFormer, an end-to-end two-stage system that takes raw depth frames from a ceiling-mounted RGB-D camera and predicts subcutaneous backfat thickness, loin muscle depth, and total tissue thickness at the last rib. Stage 1 is a geometric front-end that converts raw depth into a standardized height map via SAM3-to-MaskDINO segmentation distillation, ground-plane removal, and orientation normalization. Stage 2 is a Slice Attention Encoder that treats each height map as a sequence of cross-sectional slices and captures spatial relationships along the full dorsal surface. On a multi-site dataset of 319 sow and gilt instances from two facilities, PigFormer achieves 2.43 mm backfat MAE and 3.87 mm overall MAE. It outperforms strong single-stage ResNet-18 and ViT-small baselines. PigFormer offers a practical path toward continuous, automated, non-contact body condition monitoring in commercial swine production. Code is available at https://github.com/iambashar/Pigformer.

2606.05610 2026-06-05 cs.CL

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

LLM持续预训练中最优超参数的可预测缩放定律

Yongwei Zhou, Juncheng Diao, Junlin Shang, Peiguang Li, Rongxiang Weng

发表机构 * MeiTuan(美团) University of Chinese Academy of Sciences(中国科学院大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文发现持续预训练中学习率和批大小等最优超参数遵循稳定可预测的缩放定律,并提出一个两阶段框架,通过小规模代理模型和状态感知预测,将超参数搜索开销降低90%且性能相当或更优。

详情
AI中文摘要

大型语言模型(LLM)持续预训练的效果取决于超参数配置,如学习率和批大小。然而,当前实践通常依赖启发式方法或网格搜索,导致训练不稳定和成本过高。在这项工作中,我们首先通过实验发现,在整个持续预训练过程中,最优超参数遵循稳定且可预测的缩放定律。利用这些见解,我们提出了一个新框架,用于建立给定检查点的计算预算与最优超参数之间的定量关系。我们的方法分为两个阶段:(1)经验定律发现,其中我们训练小规模代理模型,通过标准的损失-计算缩放定律推导出将计算预算映射到最优超参数的函数;(2)状态感知超参数预测,其中我们评估初始检查点的验证损失,并使用逆缩放定律估计其等效预训练计算量——即从零开始达到相同损失所需的计算量。结合计划的计算预算,我们预测目标运行的最优超参数。实验结果表明,我们的方法将超参数搜索开销降低了高达90%,同时实现了与基线相当或更优的性能。这个与模型无关的框架可跨架构推广,为从任意给定点开始的多样化持续预训练场景提供了一种原则性且高效的方法。

英文摘要

The efficacy of continued pre-training for Large Language Models (LLMs) hinges upon hyperparameter configurations, such as learning rate and batch size. However, current practices often rely on heuristics or grid searches, leading to training instability and excessive costs. In this work, we first empirically discover that optimal hyperparameters follow stable and predictable scaling laws throughout the continued pre-training process. Leveraging these insights, we propose a novel framework to establish quantitative relationships between compute budget and optimal hyperparameters for a given checkpoint. Our approach has two stages: (1) \textit{Empirical Law Discovery}, where we train small-scale proxy models to derive functions mapping compute budget to optimal hyperparameters via standard loss-compute scaling laws; and (2) \textit{State-Aware Hyperparameter Prediction}, where we evaluate an initial checkpoint's validation loss and use the inverse scaling law to estimate its \textit{equivalent pre-training compute} -- the compute needed to achieve the same loss from scratch. Combining this with the planned compute budget, we predict optimal hyperparameters for the target run. Empirical results demonstrate that our method reduces the hyperparameter search overhead by up to 90\% while achieving comparable or superior performance relative to baselines. This model-agnostic framework generalizes across architectures, providing a principled and efficient methodology for diverse continued pre-training scenarios starting from any given point.

2606.05606 2026-06-05 cs.LG cs.AI math.OC

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

跨时代自适应展开优化用于强化学习后训练

Yiming Zong, Yige Wang, Jiashuo Jiang

发表机构 * Department of Industrial Engineering & Decision Analytics, Hong Kong University of Science and Technology(工业工程与决策分析系,香港科学与技术大学)

AI总结 针对提示词训练信号差异大的问题,提出CERO方法,通过贝叶斯估计提示词成功概率并利用Fenchel对偶优化自适应分配展开预算,在固定总预算下提升样本效率。

详情
AI中文摘要

LLM后训练通常依赖于对每个提示采样多次展开的强化学习方法,但大多数现有方法对每个提示使用固定的展开预算,尽管不同提示提供的训练信号差异很大。本文研究在固定全局预算下的自适应展开分配,并将问题形式化为具有提示级递减收益的在线资源分配。我们的方法CERO维护每个提示成功概率的Beta后验分布,并使用后验期望伯努利方差作为额外展开价值的贝叶斯估计。我们利用该估计构建累积分配上的凹饱和效用函数,得到一个目标函数,其中跨提示和跨时代的决策通过全局预算耦合。由于所得目标在时间上不可分离,我们推导出Fenchel对偶重写,并通过投影在线梯度下降更新提示级和预算级对偶变量。在固定提示效用下,我们证明相对于离线分配基准的$O(\sqrt{K})$遗憾界。在数学推理问题上的实验表明,CERO在多个开源LLM和基准上持续优于GRPO,证明自适应展开预算可以提高样本效率。

英文摘要

LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an $O(\sqrt{K})$ regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.

2606.05605 2026-06-05 cs.LG cs.NE

From Prediction to Self: Developmental Conditions for Agency in Minimal Neural Systems

从预测到自我:最小神经系统中能动性的发展条件

Evan Ye

发表机构 * Independent Researcher(独立研究者)

AI总结 通过40个逐步增加的实验,研究最小GRU系统如何区分自我与世界因果影响,发现四个严格顺序的发展条件,并提出能动性增益作为度量指标。

详情
Comments
18 pages, 6 figures
AI中文摘要

一个仅仅预测世界的系统如何区分自身的因果影响与其他一切?我们在一个最小192维GRU中通过40个受控实验(按发展序列排列,一次添加一个组件)追踪这一转变,并跟踪系统是否能区分自我引起的变化与世界引起的变化。发展路径揭示了必须严格按顺序满足的四个条件:(1)形成稳定吸引子的持久状态,(2)连接输出到输入的因果动作循环,(3)使隐式因果知识显式的本体感觉反馈,以及(4)异步觉醒——感知学习必须在动作学习开始之前巩固。我们提出能动性增益(A = Err_world - Err_self),即了解自身动作的预测优势,作为跟踪这一过程的度量。自我感知预测器在周期性(正弦)和混沌(洛伦兹)环境中始终优于自我盲预测器,并且该度量在移除所有辅助组件后仍然有效。只有前向采样的动作选择产生有意义的能动性增益;两种基于梯度的替代方案退化。同样重要的是12个被证伪的假设,它们映射了发展停滞的地方:仅靠预测编码不会产生自我表征。

英文摘要

How does a system that merely predicts the world come to distinguish its own causal influence from everything else? We trace this transition in a minimal 192-dimensional GRU through 40 controlled experiments arranged as a developmental sequence, adding components one at a time and tracking whether the system can distinguish self-caused from world-caused changes. The developmental path reveals four conditions that must be satisfied in strict order: (1) persistent state forming stable attractors, (2) a causal action loop linking output to input, (3) proprioceptive feedback that makes implicit causal knowledge explicit, and (4) asynchronous awakening - perceptual learning must consolidate before action learning begins. We propose agency gain (A = Err_world - Err_self), the predictive advantage of knowing one's own action, as a metric to track this process. The self-aware predictor consistently outperforms the self-blind predictor across periodic (sinusoidal) and chaotic (Lorenz) environments, and the metric survives ablation of all auxiliary components. Only forward-sampled action selection produces meaningful agency gain; two gradient-based alternatives degenerate. Equally significant are 12 falsified hypotheses mapping where development stalls: predictive coding alone does not produce self-represent

2606.05602 2026-06-05 cs.AI cs.HC cs.LG

Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization

修正思维,而非动作:通过知识缺口定位实现可解释的AI辅助

Ayano Hiranaka, Ya-Chuan Hsu, Stefanos Nikolaidis, Erdem Bıyık, Daniel Seita

发表机构 * University of Tokyo(东京大学) National Institute of Information and Communications Technology(信息与通信技术国家研究所)

AI总结 提出SENSEI框架,通过结构化知识表示推断用户误解并提供针对性建议,在长时任务中实现零样本组合泛化,纠正90%的学生误解。

详情
Comments
Accepted to International Conference on Machine Learning (ICML) 2026
AI中文摘要

在人机协作中,AI助手通常通过行为反馈(例如辅助驾驶中的警报或方向盘提示)来纠正次优的人类行为。此类干预可以缓解即时错误,但长期改进需要解决导致重复错误的潜在误解。我们引入了SENSEI,一个从交互行为推断用户误解并提供针对性、最小但充分建议的框架。我们的方法通过操作结构化知识表示来定位和纠正错误行为的根源,从而脱离动作或轨迹层面的干预。在具有不同误解和相应行为的三个长时任务中,SENSEI展示了零样本组合泛化能力,尽管仅针对单一误解案例进行训练,却能解开多个重叠的误解。一项用户研究进一步表明,我们的方法能够识别真实的人类误解,并提供有效的指导,从而提高长时任务表现,成功纠正了90%的学生误解。代码和项目页面见https://misoshiruseijin.github.io/SENSEI/。

英文摘要

AI assistants in human-AI collaboration often correct suboptimal human actions through behavioral feedback (e.g., alerts or steering-wheel nudges in assistive driving). Such interventions can mitigate immediate errors, but long-term improvement requires addressing the underlying misconceptions that cause repeated mistakes. We introduce SENSEI, a framework that infers user misconceptions from interaction behavior and provides targeted, minimal yet sufficient suggestions to correct them. Our approach departs from action- or trajectory-level interventions by operating over a structured knowledge representation to localize and correct the sources of erroneous behavior. Across three long-horizon tasks with diverse misconceptions and corresponding behaviors, SENSEI demonstrates zero-shot compositional generalization, disentangling multiple overlapping misconceptions despite training only on single-misconception cases. A user study further shows that our method identifies real human misconceptions and provides effective guidance that improves long-horizon task performance, successfully correcting $90\%$ of student misconceptions. Code and project page are available at https://misoshiruseijin.github.io/SENSEI/.

2606.05599 2026-06-05 cs.LG math.ST stat.ME stat.ML stat.TH

Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations

通过平滑激活函数缓解深度神经网络一致收敛中的维度灾难

Yizhe Ding, Runze Li, Jia Liu, Lingzhou Xue

发表机构 * Department of Statistics, The Pennsylvania State University(宾夕法尼亚州立大学统计学系)

AI总结 本文通过分析平滑激活深度神经网络,建立了统一收敛的理论框架,证明其能够通过自适应利用目标函数的低维层次组合结构来缓解维度灾难。

详情
Comments
30 pages, 5 figures
AI中文摘要

本文为平滑激活深度神经网络(DNN)估计量的一致收敛建立了理论框架。虽然标准ReLU网络在各种非参数回归任务中,在$L^2(P)$范数下达到了极小化最优速率,但我们建立了一个理论下界,表明最小二乘ReLU估计量在其一致收敛行为中可能遭受维度灾难。受下游任务中对最坏情况可靠性的需求驱动,我们通过分析平滑激活DNN(平滑DNN),包括前馈和残差结构,来解决这一局限性。我们为这些模型的逼近器建立了新的伪维数界、非渐近逼近保证和Hölder范数界。利用这些结果,我们推导了平滑DNN估计量在多种统计上下文(包括Huber回归、最小二乘回归、分位数回归和逻辑回归)中的非渐近一致收敛速率。我们证明,平滑DNN可以通过自适应利用目标函数的低维层次组合结构来缓解一致收敛中的维度灾难。通过模拟研究和实际应用的支持,我们的结果将平滑DNN定位为在需要一致保证的统计学习任务中,理论上合理且实践上可行的ReLU网络替代方案。

英文摘要

This paper establishes a theoretical framework for the uniform convergence of smoothly activated deep neural network (DNN) estimators. While standard ReLU networks achieve minimax-optimal rates in the $L^2(P)$ norm for various nonparametric regression tasks, we establish a theoretical lower bound demonstrating that least-squares ReLU estimators can suffer from the curse of dimensionality in their uniform convergence behavior. Motivated by the need for reliable uniform guarantees in downstream tasks requiring worst-case reliability, we address this limitation by analyzing smoothly activated DNNs (smooth DNNs), encompassing both feedforward and residual structures. We establish novel pseudo-dimension bounds, non-asymptotic approximation guarantees, and Hölder-norm bounds for the approximators of these models. Leveraging these results, we derive non-asymptotic uniform convergence rates for smooth DNN estimators across multiple statistical contexts, including Huber, least-squares, quantile, and logistic regression. We prove that smooth DNNs can mitigate the {curse of dimensionality} in uniform convergence by adaptively exploiting the low-dimensional hierarchical composition structure of the target function. Supported by both simulation studies and a real-world application, our results position smooth DNNs as a theoretically grounded and practically viable alternative to ReLU networks for statistical learning tasks requiring uniform guarantees.

2606.05588 2026-06-05 cs.RO cs.LG

Auditing Demonstration Curation Metrics: Action-Only Scorers Fail on the Structural Defects That Degrade Imitation Policies

审计示范策展指标:仅动作评分器在降低模仿策略的结构缺陷上失败

Aarav Bedi

发表机构 * Aarav Bedi

AI总结 本研究构建受控测试平台,注入两类示范缺陷(细微扰动和结构错误),审计七种策展指标,发现仅动作指标无法检测结构错误,且部分指标评分倒置,而状态轨迹指标能部分检测但下游性能恢复有限。

详情
Comments
5 pages, 3 figures, 4 tables
AI中文摘要

模仿学习策略继承了其训练示范的质量,越来越多的策展指标声称能自动评分和过滤低质量示范。这些指标各自在不同协议的不同数据上验证,因此不清楚哪些指标真正识别出损害策略的示范。我们构建了一个受控测试平台,其中示范缺陷以已知类型注入,并沿两个轴审计七种策展指标:每个指标区分缺陷示范与清洁示范的效果,以及基于每个指标策展的子集训练行为克隆策略是否提高任务成功率。我们研究两种缺陷机制。细微扰动(相关动作噪声、震颤、截断)可通过多变量离群值评分检测,一旦移除,可恢复全部下游差距。结构错误,即示范在关键时刻执行错误动作,对我们测试的每个仅动作指标都是不可见的,其中两个指标是倒置的:它们将缺陷示范评分为更高质量,并用于策展时,往往使策略处于或低于未策展基线,而非高于基线。只有检查状态轨迹的指标能检测结构错误,即使最好的指标也只能恢复三分之一的下游差距。高检测准确性并不保证下游改进。我们发布了测试平台和所有策展实现。

英文摘要

Imitation-learning policies inherit the quality of the demonstrations they are trained on, and a growing set of curation metrics promise to score and filter low-quality demonstrations automatically. These metrics are each validated on different data with different protocols, so it is unclear which of them actually identify the demonstrations that harm a policy. We build a controlled testbed in which demonstration defects are injected with known type, and audit seven curation metrics along two axes: how well each separates defective from clean demonstrations, and whether training a behavior-cloning policy on each metric's curated subset improves task success. We study two defect regimes. Subtle perturbations (correlated action noise, tremor, truncation) are detectable by multivariate outlier scoring and, once removed, recover the full downstream gap. Structural errors, where the demonstration executes a wrong action at a key moment, are invisible to every action-only metric we test, and two of them are inverted: they score defective demonstrations as higher quality and, used for curation, tend to leave the policy at or below the uncurated baseline rather than above it. Only metrics that examine the state trajectory detect structural errors, and even the best of them recovers just a third of the downstream gap. High detection accuracy does not guarantee downstream improvement. We release the testbed and all curation implementations.

2606.05587 2026-06-05 cs.CV cs.AI cs.LG

HDST-GNN: Heterogeneous Dynamic Spatiotemporal Graph Neural Networks for Multi-Object Tracking in UAV Aerial Imagery

HDST-GNN:用于无人机航拍图像多目标跟踪的异质动态时空图神经网络

Phillip Jiang

发表机构 * Phillip Jiang(菲利普·姜)

AI总结 针对无人机航拍中目标小、密集、遮挡导致身份切换的问题,提出异质动态时空图神经网络HDST-GNN,通过高度自适应边构建、异质节点表示和遮挡门控时序聚合提升跟踪性能。

详情
Comments
18 pages, 4 figures, 6 tables
AI中文摘要

无人机航拍图像的多目标跟踪(MOT)面临独特挑战:序列间高度变化、目标小而密集、频繁遮挡导致身份切换。现有基于图的跟踪器假设固定空间上下文并统一处理所有目标,忽略了检测、活跃轨迹和丢失目标等异质生命周期状态。我们提出HDST-GNN,一种异质动态时空图神经网络,包含三项创新。首先,高度自适应边构建根据平均目标面积估计相机高度代理,并相应调整图连接半径。其次,异质节点表示将检测(D型)、确认轨迹(T型)和丢失轨迹(L型)建模为不同节点类型,具有专用投影和类型化边关系。第三,遮挡门控时序聚合根据每个节点的遮挡置信度门控其注意力贡献,防止被遮挡节点破坏邻居嵌入。HDST-GNN使用可微Sinkhorn头部,结合交叉熵和三元组损失进行端到端训练。在VisDrone2019-MOT上使用oracle检测时,HDST-GNN达到94.51% MOTA和97.24% IDF1,比SORT高出+5.0 MOTA点,身份切换减少81%。使用真实YOLOv8n检测时,HDST-GNN相比SORT身份切换减少49%。消融研究证实了每个组件的独立贡献。

英文摘要

Multi-object tracking (MOT) from UAV imagery presents unique challenges: altitude varies across sequences, objects are small and densely packed, and frequent occlusion causes identity switches. Existing graph-based trackers assume fixed spatial context and treat all objects uniformly, ignoring the heterogeneous lifecycle states of detections, active tracklets, and lost targets. We propose HDST-GNN, a Heterogeneous Dynamic Spatiotemporal Graph Neural Network with three novel contributions. First, Altitude-Adaptive Edge Construction estimates a camera-altitude proxy from mean object area and adjusts the graph connectivity radius accordingly. Second, Heterogeneous Node Representation models detections (Type-D), confirmed tracklets (Type-T), and lost tracklets (Type-L) as distinct node types with dedicated projections and typed edge relations. Third, Occlusion-Gated Temporal Aggregation gates each node's attention contribution by its occlusion confidence, preventing occluded nodes from corrupting neighbour embeddings. HDST-GNN is trained end-to-end with a differentiable Sinkhorn head using joint cross-entropy and triplet loss. On VisDrone2019-MOT with oracle detections, HDST-GNN achieves 94.51% MOTA and 97.24% IDF1, outperforming SORT by +5.0 MOTA points and reducing identity switches by 81%. With real YOLOv8n detections, HDST-GNN reduces identity switches by 49% vs. SORT. Ablation studies confirm the independent contribution of each component.

2606.05586 2026-06-05 cs.CV cs.MM

BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection

BMCR: 基于强化学习的自适应主干模块组合用于遥感目标检测

Wenlin Liu, Xikun Hu, Ping Zhong

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学与技术学院,国防科技大学)

AI总结 提出BMCR方法,通过强化学习动态组合CNN和ViT的模块化主干,解决遥感目标检测中不同复杂度输入的自适应特征提取问题,在多个数据集上取得领先性能。

详情
AI中文摘要

在遥感目标检测中,卷积神经网络擅长捕捉局部细节,而视觉Transformer更擅长全局上下文建模。然而,现有检测器通常依赖单一固定主干或手动设计的混合架构,无法自适应地利用这些互补优势处理不同复杂度的输入。为解决这一局限,我们提出基于强化学习的主干模块组合(BMCR)。BMCR从现成的CNN和ViT主干中分解出可重用模块,动态组装输入自适应推理路径。为实现跨家族组合,我们首先构建了一个可扩展的模块工具箱。具体而言,我们将代表性的CNN和ViT主干分解为可重用的功能模块,并为每个模块封装明确的结构、语义和计算元数据,以实现兼容性感知的组装。为弥合基于网格的CNN特征与基于令牌的ViT表示之间的差距,我们设计了一种轻量级的基于最优传输(OT)的过渡接口,在保持空间一致性的同时确保分布感知对齐。然后,将主干组合过程建模为序列决策问题,其中策略网络根据中间多尺度观测逐步选择任务相关模块。为稳定可重用模块和路由策略的联合优化,我们进一步开发了自适应模块协同优化(AMCO)策略,在训练过程中协调模块更新、路由探索和奖励分配。在DOTA-v1.0、DOTA-v1.5和DIOR-R上,BMCR分别达到79.31%、73.41%和71.86%的mAP,在保持竞争效率的同时,超越强静态和动态基线最多2.5个百分点。

英文摘要

In remote sensing object detection, Convolutional Neural Networks (CNNs) excel at capturing local details while Vision Transformers (ViTs) are better at global context modeling. However, existing detectors typically rely on a single fixed backbone or a manually designed hybrid architecture, and thus fail to adaptively exploit these complementary strengths across inputs of diverse complexity. To address this limitation, we propose Backbone Module Composition via Reinforcement Learning (BMCR). BMCR dynamically assembles input-adaptive inference paths from reusable modules decomposed from off-the-shelf CNN and ViT backbones. To enable such cross-family composition, we first construct an extensible module toolbox. Specifically, we decompose representative CNN and ViT backbones into reusable functional modules and encapsulate each module with explicit structural, semantic, and computational metadata for compatibility-aware assembly. To bridge the gap between grid-based CNN features and token-based ViT representations, we design a lightweight Optimal Transport (OT) based transition interface that ensures distribution-aware alignment while respecting spatial consistency. The backbone composition process is then formulated as a sequential decision problem, in which a policy network progressively selects task-relevant modules according to intermediate multi-scale observations. To stabilize the joint optimization of reusable modules and the routing policy, we further develop an Adaptive Module Cooperative Optimization (AMCO) strategy that coordinates module updating, routing exploration, and reward assignment during training. On DOTA-v1.0, DOTA-v1.5 and DIOR-R, BMCR achieves 79.31\%, 73.41\% and 71.86\% mAP, respectively, surpassing strong static and dynamic baselines by up to 2.5 points while maintaining competitive efficiency.

2606.05576 2026-06-05 cs.CV

UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning

UltraVR:面向证据推理的诊断性超分辨率图像VQA基准

Gexin Huang, Yanting Yang, Myeongkyun Kang, Beidi Zhao, Jun Zhou, Chen Zhou, Gang Wang, Zu-hua Gao, Xiaoxiao Li

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) BC Cancer Agency(不列颠哥伦比亚癌症中心) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出UltraVR基准,通过结构化思维链标注诊断视觉语言模型在超分辨率图像上的证据推理能力,发现模型在证据定位和局部感知环节错误集中。

详情
Comments
10 pages, 1 figure
AI中文摘要

视觉语言模型(VLM)在视觉问答和多模态推理基准上表现出色。然而,它们在超分辨率图像上的能力——其中关键证据微小、细微、空间遥远或分布广泛——仍不清楚。现有评估主要报告最终答案准确率,对模型是否获取并整合必要视觉证据的洞察有限。我们引入UltraVR,一个面向超分辨率图像上基于证据的视觉推理的诊断性基准。UltraVR涵盖四个高价值场景:CCTV监控、遥感(RS)、全切片图像(WSI)病理学和工业异常检测(AD)。这些领域提出互补挑战:拥挤CCTV场景中的细粒度目标定位、RS中的长程空间比较、WSI中的多尺度证据导航以及重复工业布局中的细微不规则检测。除了标准QA三元组,每个实例包括一个结构化的真实思维链,包含步骤级问题、中间答案和推理标签。这些标签将推理分解为证据定位、局部感知、量化、证据整合和决策推断,从而实现对黑盒评分的流程级诊断。使用UltraVR,我们评估前沿VLM,并表明当前模型在超分辨率推理上仍远不可靠。重要的是,结构化注释使我们能够定位从视觉到决策流水线中的失败:错误集中在证据定位和局部感知,而当提供中间视觉事实时,下游推理通常能够恢复。这些发现表明UltraVR是一个诊断性测试平台,不仅衡量VLM是否回答正确,还衡量其超分辨率推理过程在何处中断。

英文摘要

Vision-language models (VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: CCTV surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges: fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts. Beyond standard QA triples, each instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis over black-box scoring. Using UltraVR, we evaluate frontier VLMs and show that current models remain far from reliable on ultra-resolution reasoning. Importantly, the structured annotations allow us to localize failures across the visual-to-decision pipeline: errors concentrate in evidence grounding and local perception, while downstream inference often recovers when intermediate visual facts are supplied. These findings demonstrate UltraVR as a diagnostic testbed for measuring not only whether VLMs answer correctly, but where their ultra-resolution reasoning process breaks.