arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1706
2605.22997 2026-05-25 cs.CV

Scene Reconstruction as Mapping Priors for 3D Detection

场景重建作为3D检测的映射先验

Yang Fu, Yuliang Zou, Hao Xiang, Xin Huang, Yijing Bai, Chen Song, Weijing Shi, Govind Thattai, Dragomir Anguelov, Mingxing Tan, Yingwei Li

AI总结 在自动驾驶中,地图对运动规划至关重要,但其在3D目标检测等感知任务中的应用仍不充分。本文提出了一种可扩展的解决方案,通过自动构建密集的地图先验信息,并设计一种融合多传感器模态的MPA3D框架,有效提升了3D检测性能。实验表明,该方法在Waymo Open Dataset上取得了新的最先进成果,验证了可扩展场景先验对增强3D检测的有效性。

Comments Accepted to CVPR 2026

详情
AI中文摘要

在自动驾驶中,映射对于运动规划至关重要,但仍然是3D目标检测等感知任务中未被充分利用的资源。地图可以提供静态环境的鲁棒结构先验,有助于解决歧义并纠正传感器数据稀疏或噪声问题,特别是对于远处物体或在恶劣天气条件下。然而,传统的高清(HD)地图获取和维护成本高昂,这对高效的大规模部署构成了挑战。在本文中,我们提出了一种可扩展的解决方案,通过克服两个主要挑战来系统地利用映射改进3D检测。首先,我们引入了一个从聚合传感器数据自动构建密集映射先验的流程,消除了人工标注的需求。其次,我们设计了一个新颖的映射先验增强3D检测(MPA3D)框架,以有效整合映射先验与不同传感器模态。在Waymo开放数据集上的大量实验表明,我们的方法达到了新的最先进结果,证明了可扩展的重建场景先验在增强3D检测方面的有效性。

英文摘要

In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks such as 3D object detection. Maps can provide robust structural priors of the static environment, helping resolve ambiguities and correct for sensor data sparsity or noise, especially for distant objects or under adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for efficient, large-scale deployment. In this paper, we propose a scalable solution to systematically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Priors Augmented 3D Detection (MPA3D) framework to effectively integrate mapping priors with different sensor modalities. Extensive experiments on the Waymo Open Dataset demonstrate that our approach achieves new state-of-the-art results, proving the effectiveness of scalable reconstructed scene priors for enhancing 3D detection.

2605.22996 2026-05-25 cs.CV

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

CoMoGen: 基于掩码引导的视频生成的可控运动动力学与交互

Adil Meric, Lin Geng Foo, Mert Kiray, Benjamin Busam, Rishabh Dabral, Christian Theobalt

AI总结 本文提出了一种可控视频生成框架 CoMoGen,能够在输入图像和二值掩码序列的条件下生成具有真实交互动态的视频。该方法引入了一个轻量的 MaskAdapter 模块,将掩码序列编码为残差信号,并通过余弦加权调度注入到多模态扩散变换器(MMDiT)中。通过低秩适配(LoRA)对 MMDiT 中负责运动生成的特定层进行微调,实现了对运动关键组件的聚焦,降低了计算成本。实验表明,CoMoGen 在运动保真度和感知真实感方面优于现有方法,达到了当前最优水平。

详情
AI中文摘要

我们提出了CoMoGen,一个可控视频生成框架,它能够根据输入图像和单个二进制掩码序列生成逼真的交互动力学。CoMoGen引入了一个轻量级的MaskAdapter,将二进制掩码序列编码为潜在残差信号,并通过余弦加权调度注入到多模态扩散Transformer(MMDiT)模型中。与UNet架构的分层粗到细设计不同,MMDiT作为一系列统一的Transformer块运行,因此很难确定哪些层负责运动生成。因此,我们提出了一种新颖的方法来确定在MMDiT注意力空间中运行的“运动层”。我们通过使用低秩适应(LoRA)对运动层进行微调,而不需要对MMDiT进行任何架构更改。这种选择性适应使我们的方法能够专注于运动关键组件,从而降低计算成本。尽管方法简单,CoMoGen实现了精确的主体运动以及与周围人类、物体和场景的合理交互。在不同数据集上的全面实验表明,CoMoGen始终优于先前的可控视频生成方法,并在运动保真度和感知真实性方面达到了最先进的性能。项目页面:mericadil.github.io/CoMoGen。

英文摘要

We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule. Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify which layers are responsible for the motion generation. Therefore, we propose a novel way to determine "Motion Layers" operating in the attention space of MMDiT. We fine-tune the model by using Low-Rank Adaptation (LoRA) to the Motion Layers, without requiring any architecture change in the MMDiT. This selective adaptation enables our method to focus on motion-critical components, yielding reduced computational cost. Despite its simplicity, CoMoGen enables precise subject motion and plausible interactions with surrounding humans, objects, and scenes. Comprehensive experiments on different datasets show that CoMoGen consistently outperforms prior controllable video generation methods and achieves state-of-the-art performance in motion fidelity and perceptual realism. Project page: mericadil.github.io/CoMoGen.

2605.22993 2026-05-25 cs.CL cs.AI

A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism

一种主动式多智能体对话框架用于评估自闭症中的社交语言障碍特征

Chuanbo Hu, Minglei Yin, Bin Liu, Wenqi Li, Lynn K. Paul, Shuo Wang, Xin Li

AI总结 该研究提出了一种名为TPA的主动多智能体对话框架,用于评估自闭症谱系障碍中的社会语言障碍(SLD)特征。该框架通过医生智能体主动选择针对性的问题策略,以系统性地揭示患者对话中潜在的语言障碍特征,从而提高诊断效率。实验表明,TPA在多个关键指标上优于现有基线方法,显著提升了SLD特征的覆盖率和诊断效率,为AI辅助临床筛查提供了重要支持。

详情
AI中文摘要

与自闭症谱系障碍中社交语言障碍(SLD)相关的特征性语言行为,包括回声性重复、代词位移和刻板媒体引用,在自发对话中基本不存在,仅在特定对话条件下出现。在结构化临床评估中,这种延迟意味着提问策略选择是决定对话产生多少诊断信息的关键但未被充分重视的因素。大型语言模型(LLMs)能否被引导主动选择系统地揭示这些潜在特征的提问策略,在很大程度上仍未探索。本文提出TPA(思考、计划、询问),一种应用于自闭症诊断观察量表模块4(ADOS-2)语言评估部分的主动式多智能体对话框架,其中医生智能体在选择临床依据策略并生成针对性问题之前,明确推理哪些特征尚未观察到。基于真实ADOS-2临床数据的患者智能体使得无需真实患者参与即可进行可重复评估,并通过三个独立实验验证,确认其对真实患者语言具有足够的保真度。在来自35名患者的484个片段上评估,TPA在所有主要指标上优于六个竞争性对话规划基线,实现了82.1%的SLD特征覆盖率,比训练有素的临床医生进行的真实临床对话自动回放(65.5%)高16.6%,并且每轮诊断效率显著更高(AUCC:0.628 vs. 0.458,绝对增益+0.170)。这些结果表明,主动提问策略选择显著提高了自动化SLD特征评估的效率,对可扩展的AI辅助临床筛查具有直接意义。

英文摘要

Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition, pronoun displacement, and stereotyped media quoting, are largely absent from spontaneous conversation and only emerge under specific conversational conditions. In structured clinical assessments, this latency means that questioning strategy selection is a critical yet underappreciated determinant of how much diagnostic information a conversation yields. Whether large language models (LLMs) can be guided to proactively select questioning strategies that systematically surface these latent traits remains largely unexplored. Here we present TPA (Think, Plan, Ask), a proactive multi-agent dialogue framework applied to the language assessment component of the Autism Diagnostic Observation Schedule Module 4 (ADOS-2), in which a doctor agent explicitly reasons about which traits remain unobserved before selecting a clinically grounded strategy and generating a targeted question. A patient agent grounded in real ADOS-2 clinical data enables reproducible evaluation without real patient participation, validated across three independent experiments confirming adequate fidelity to real patient language. Evaluated on 484 episodes from 35 patients, TPA outperforms six competitive dialogue planning baselines across all primary metrics, achieving 82.1% SLD trait coverage, 16.6% higher than automated replay of real clinical dialogues conducted by trained clinicians (65.5%), with substantially greater per-turn diagnostic efficiency (AUCC: 0.628 vs. 0.458, absolute gain +0.170). These results demonstrate that proactive questioning strategy selection substantially improves the efficiency of automated SLD trait assessment, with direct implications for scalable AI-assisted clinical screening.

2605.22991 2026-05-25 cs.RO

Verified Task-Space Motion Planning Under Joint-Space Constraints

关节空间约束下的验证任务空间运动规划

Hanjiang Hu, Changliu Liu, Yebin Wang

AI总结 本文研究了在关节空间约束下验证任务空间运动规划的问题,针对传统任务空间规划器如Bug2在面对关节角限制时可能出现的轨迹漂移和目标无法到达的问题,提出了一种基于二阶多项式逆运动学近似和S-过程的方法,计算出在关节位移限制下可验证的笛卡尔空间最大超矩形,从而实现自适应步长的规划。实验表明,该方法在多种对抗性场景中实现了零关节限制违反,并保持了100%的目标到达率。

详情
AI中文摘要

反应式任务空间规划器(如Bug2)使用固定的笛卡尔步长,且不考虑机械臂的关节角度限制。当雅可比矩阵病态时,即使很小的笛卡尔步长也可能导致关节变化超出允许范围;将关节限制在其极限会导致跟踪漂移,甚至完全无法到达目标。我们通过在每个规划步骤中计算在关节位移约束下 extit{可证明可达}的最大笛卡尔超矩形来解决这一问题。利用逆运动学的二阶多项式近似和S过程,我们构建一个小型半定规划,其解给出可证明的半宽~$λ^\star$。利用二次结构的等效二分法在亚毫秒时间内完成验证。将此验证与Bug2集成,得到步长适应局部运动学条件的规划器。在跨越六种关节极限设置的94个对抗场景的统计评估中,SOS验证的规划器实现了 extit{零}关节极限违反,目标到达率为100%,而标准Bug2规划器在6-11%的步骤中违反关节极限,并在高达18%的场景中无法到达目标。

英文摘要

Reactive task-space planners such as Bug2 operate with fixed Cartesian step sizes and are unaware of the manipulator's joint-angle limits. When the Jacobian is poorly conditioned, even small Cartesian steps can demand joint changes that exceed admissible bounds; clipping the joints to their limits causes tracking drift and can prevent goal reaching entirely. We address this by computing, at each planning step, the largest Cartesian hyperrectangle that is \emph{certifiably reachable} under joint displacement bounds. Using a second-order polynomial approximation of the inverse kinematics and the S-procedure, we formulate a small semidefinite program whose solution yields the certified half-width~$λ^\star$. An equivalent bisection procedure exploiting the quadratic structure solves the certification in sub-millisecond time. Integrating this certificate with Bug2 yields a planner whose step size adapts to local kinematic conditioning. In a statistical evaluation over 94 adversarial scenarios spanning six joint-limit settings, the SOS-verified planner achieves \emph{zero} joint-limit violations with a 100\% goal-reaching rate, whereas a standard Bug2 planner violates joint limits in 6--11\% of steps and fails to reach the goal in up to 18\% of scenarios.

2605.22986 2026-05-25 cs.RO cs.AI cs.HC cs.LG

Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations

知道该问什么的机器人:通过有针对性的解释恢复未对齐的奖励

Helena Merker, Nick Walker, Andreea Bobu

AI总结 该研究针对从人类示范中学习奖励函数时存在的特征不充分问题,提出了一种通过有针对性的解释来识别并修正奖励函数偏差的框架。核心方法基于分析示范数据中各特征的一致性,识别出未充分说明的特征,并通过自然语言解释这些不确定性,主动请求针对性的补充示范。实验表明,该方法在模拟和真实机器人任务中显著提升了奖励函数的学习效果,优于随机查询和被动数据收集的方式。

详情
AI中文摘要

从演示中学习奖励函数假设演示对所有特征(或行为中与任务相关的方面)提供了充分的监督。实际上,演示往往不完美:由于认知负荷或物理难度,人类可能低估某些特征,或者训练机制可能未能充分覆盖所有相关情况。无论哪种情况,重要特征可能未被充分指定,导致学习到的奖励函数存在歧义,并在部署时出现未对齐的行为。我们提出一个框架,检测此类未充分指定的特征,并主动请求有针对性的纠正演示。我们的关键洞察是,演示隐含地揭示了哪些特征被良好指定:一致优化的特征在演示之间变化很小,而未充分指定的特征则变化很大。我们利用这一统计信号推断哪些特征可能未被充分演示。然后,机器人用自然语言解释它不确定哪些特征,并请求明确解决已识别差距的演示。我们在模拟桌面操作领域和真实Franka机器人的用户研究中评估了我们的方法。与随机查询和被动数据收集相比,有针对性的、解释引导的查询显著改善了奖励恢复,减少了否则会从有缺陷的演示中持续存在的歧义。

英文摘要

Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features -- or task-relevant aspects of behavior. In practice, demonstrations are often imperfect: humans may under-emphasize certain features due to cognitive load or physical difficulty, or the training regime may fail to sufficiently cover all relevant situations. In either case, important features may be underspecified, leading to ambiguity in the learned reward function and misaligned behavior at deployment. We propose a framework that detects such underspecified features and actively solicits targeted corrective demonstrations. Our key insight is that demonstrations implicitly reveal which features are well specified: features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely. We leverage this statistical signal to infer which features may have been insufficiently demonstrated. The robot then explains which features it is uncertain about in natural language and queries for demonstrations that explicitly address the identified gaps. We evaluate our approach in a simulated tabletop manipulation domain and in a user study with a real Franka robot. Targeted, explanation-guided queries significantly improve reward recovery compared to random querying and passive data collection, reducing ambiguity that would otherwise persist in learning from imperfect demonstrations.

2605.22984 2026-05-25 cs.LG cs.AI

Test-Time Training Undermines Safety Guardrails

测试时训练削弱安全护栏

Simone Antonelli, Sadegh Akhondzadeh, Aleksandar Bojchevski

AI总结 本文研究了测试时训练(Test-Time Training, TTT)在提升模型性能的同时所带来的安全风险。作者指出,TTT允许模型在推理过程中动态调整参数,虽然能增强模型在少样本学习、检索增强生成等任务中的表现,但也引入了新的攻击漏洞,使模型更容易被绕过安全防护。实验表明,TTT显著提高了攻击成功率,并在不同规模模型中表现出高度的可转移性。为此,作者提出了一种基于困惑度变化的轻量级检测方法,以识别潜在的TTT攻击请求。

Comments 30 pages, 4 figures. Project page: https://uoc-tail.github.io/ttt-jailbreak/

详情
AI中文摘要

测试时训练(TTT)是一种新兴范式,使模型在推理过程中调整参数,从而提升少样本学习、检索增强生成和复杂推理等任务的性能。然而,这种动态适应引入了攻击者可利用的新漏洞来越狱模型。我们识别了TTT的三种威胁模型,并演示了攻击者如何利用它们绕过安全过滤器。我们的结果表明,TTT可以显著提高攻击成功率(ASR)以及超过10次生成试验的ASR(ASR@10)。例如,在LoRA下,少样本和生成阶段威胁模型在不同家族和规模的模型上平均ASR@10分别达到95%和93%。这些漏洞可迁移到生产级微调API。我们还展示了TTT引发的过拟合可能产生退化输出,在标准评判下夸大ASR,并提出了一个有效性感知评估来纠正这一点。我们的发现表明,TTT暴露了新的攻击面,增强了攻击,并削弱了现有的安全护栏。作为防御的第一步,我们提出了一个轻量级的提供商侧检测器,通过私有有害保留集上的困惑度偏移来标记TTT请求,但稳健部署最终需要动态对齐。

英文摘要

Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few-shot learning, retrieval-augmented generation, and complex reasoning. However, this dynamic adaptation introduces new vulnerabilities that adversaries can exploit to jailbreak models. We identify three threat models for TTT and demonstrate how attackers can leverage them to bypass safety filters. Our results show that TTT can significantly increase the Attack Success Rate (ASR) and the ASR over 10 generation trials (ASR@10). For example, under LoRA, the few-shot and generation-phase threat models achieve an average ASR@10 of 95% and 93% respectively, across models from different families and scales. These vulnerabilities transfer to production fine-tuning APIs. We also show that TTT-induced overfitting can produce degenerate outputs that inflate ASR under standard judges, and propose a validity-aware evaluation to correct for this. Our findings suggest that TTT exposes a new attack surface, strengthens attacks, and undermines existing safety guardrails. As a first step toward defense, we propose a lightweight provider-side detector that flags TTT requests via the perplexity shift on a private harmful holdout, but robust deployment will ultimately require dynamic alignment.

2605.22981 2026-05-25 cs.CL cs.AI cs.LG

Memorization Dynamics of Fill-in-the-Middle Pretraining

Fill-in-the-Middle 预训练的记忆动态

Tobias von Arx, Tanguy Dieudonné

AI总结 本文研究了“填中”(FIM)预训练目标对语言模型逐字记忆能力的影响。通过在包含重复内容的语料库上训练匹配的Llama 3.2模型,发现FIM更倾向于恢复短或部分匹配的文本片段,而传统的从左到右(LTR)方法则更常对长段精确续写赋予高置信度。实验还表明,FIM训练下的逐字记忆能力随重复次数近似线性增长,并且后缀上下文不足以支持准确回忆,前缀上下文在其中起关键作用。研究强调了单一评估方式可能忽略记忆行为的复杂性。

Comments MemFM @ ICML 2026

详情
AI中文摘要

Fill-in-the-Middle (FIM) 是一种广泛用于赋予因果语言模型填充能力的预训练目标,但其对逐字记忆的影响尚未充分探索。我们在受控设置中研究 FIM 的记忆动态,通过在包含重复 Gutenberg 摘录的 FineWeb-Gutenberg 语料库上,使用 FIM 和标准从左到右 (LTR) 目标预训练匹配的 Llama 3.2 模型。基于前缀的探测表明,FIM 更常恢复短片段或部分匹配的跨度,而 LTR 更常对长精确延续赋予高置信度。我们观察到,在测试范围内,FIM 训练下的逐字提取随重复次数近似线性增长。评估原生 FIM 格式的探测显示,后缀上下文并不足够:FIM 训练下的逐字回忆仍然强烈锚定于前缀上下文。我们的结果还表明,仅评估一种跨度长度或探测格式可能会遗漏记忆行为中的重要细微差别。

英文摘要

Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.

2605.22973 2026-05-25 cs.LG cs.AI

Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection

比随机更差:无监督特征选择中基线的重要性

Muhammad Rajabinasab, Michael E. Houle, Oussama Chelly, Arthur Zimek

AI总结 本文探讨了无监督特征选择方法的评估基准问题,指出当前多数方法缺乏与随机特征选择这一基准的比较,难以衡量其实际贡献。作者提出应将随机特征选择作为评估基准,并通过实验证明许多先进方法在性能和效率上均不如随机选择。因此,研究强调在开发新的无监督特征选择方法时,必须以随机选择为基准,以确保方法的有效性与改进价值。

Comments Preprint submitted to Elsevier Pattern Recognition Letters

详情
AI中文摘要

每年都有许多新的无监督特征选择方法被提出,但它们的实证评估仅限于在选定数据集上计算的监督和无监督评估指标,以及与现有方法的比较。然而,在缺乏既定评估基线的情况下,很难确定每种方法对现有文献的附加值,以及它们底层方法的有效性。我们提出使用随机特征选择作为评估无监督特征选择方法的基线。我们通过实证表明,许多最先进的无监督特征选择方法在性能和效率上均不如随机特征选择。因此,我们强调在开发新的无监督特征选择方法时,必须严格考虑将随机特征选择作为基线,以确保相对于随机特征选择的一致改进。

英文摘要

Many novel unsupervised feature selection methods are proposed each year, yet their empirical evaluation is limited to supervised and unsupervised evaluation metrics computed on selected datasets, along with comparisons to existing methods. However, in the absence of an established evaluation baseline, it is difficult to determine the value added to the existing literature by each of these methods, and how effective their underlying approaches are. We propose using random feature selection as a baseline for evaluating the unsupervised feature selection methods. We empirically show that many of the state-of-the-art methods in unsupervised feature selection are outperformed by random feature selection in both performance and efficiency. Accordingly, we emphasize on the strict requirement of considering random feature selection as a baseline in the development process of novel unsupervised feature selection methods to ensure a consistent improvement over random feature selection.

2605.22972 2026-05-25 cs.LG cs.AI

A mathematical theory of balancing relational generalization and memorization

关系泛化与记忆平衡的数学理论

Luke Cheng, Samuel Lippl

AI总结 本文探讨了学习系统如何在关系泛化与记忆例外之间取得平衡这一核心问题,提出了一种新的任务——带有例外的传递推理任务,用于测试模型在关系规则下的泛化与例外记忆能力。通过理论分析和实验验证,研究发现神经网络模型在不同表征结构下表现出对泛化与记忆的平衡能力,但其成功依赖于具体的表征几何特性。该理论不仅揭示了这一任务的机制性挑战,还通过预训练语言模型的实验验证了理论预测,为理解学习系统的泛化机制提供了新视角。

详情
AI中文摘要

人类、动物和现代机器学习模型展现出学习复杂行为并将其泛化到未见情境的惊人能力。这种能力要求我们学习规则和规律以实现泛化。同时,在大多数复杂环境中,任何规则都有例外。学习系统如何在学习一般规律和记忆例外之间取得平衡?我们认为,缺乏任务范式阻碍了对这一基本能力的研究。为填补这一空白,我们引入了一个新任务——带例外的传递推理,该任务测试关系泛化以及对关系规则例外的记忆。然后,我们解析地表征了一个简单、理论上可处理的神经网络学习模型(核岭回归)在广泛表示族和任务参数下的行为。我们发现,这些模型能够在关系泛化和记忆之间取得平衡,但与无例外的传递推理不同,成功的泛化对特定的表示几何敏感。我们通过分析理论解释了为什么该任务在机制上更具挑战性。最后,我们在对有序关系进行微调的预训练语言模型中验证了我们的理论见解,发现这些模型成功根据传递规则进行泛化,但也做出了我们理论预测的那种系统性错误。总体而言,我们的理论展示了学习系统如何在关系泛化和记忆之间取得平衡,解释了可能出错的方式,并强调了设计新任务范式以探测这种能力的必要性。

英文摘要

Humans, animals, and modern machine learning models exhibit impressive abilities to learn complex behaviors and generalize these behaviors to unseen situations. This ability requires us to learn rules and regularities that allow for such generalizations. At the same time, in most complex environments, any rule will have its exceptions. How do learning systems balance between learning general regularities and memorizing exceptions? We argue that a lack of task paradigms has hindered the study of this essential ability. To address this gap, we introduce a novel task, transitive inference with exceptions, that tests for relational generalization and memorization of an exception to the relational rule. We then analytically characterize the behavior of a simple, theoretically tractable model of neural network learning (kernel ridge regression) across a broad family of representations and task parameters. We find that these models can balance between relational generalization and memorization, but unlike for transitive inference without an exception, successful generalization is sensitive to the specific representational geometry. We explain why this task is more challenging mechanistically by drawing on our analytical theory. Finally, we validate our theoretical insights in pretrained language models that are finetuned on ordered relations, finding that these models successfully generalize according to the transitive rule, but also make the kinds of systematic mistakes predicted by our theory. Overall, our theory shows how learning systems can balance between relational generalization and memorization, explains how this can go wrong, and emphasizes the need for new task paradigms designed to probe this ability.

2605.22971 2026-05-25 cs.CL cs.HC

Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

AI 能猜出你知道什么吗?从通信日志中估计人类领域知识的大型语言模型性能比较

Ko Watanabe, Shoya Ishimaru

AI总结 本文研究了大型语言模型(LLMs)是否能通过分析员工的Slack通信日志来估计其领域知识水平。通过对43名用户共27,188条消息的分析,比较了包括Gemini、Claude和GPT系列在内的七种模型在零样本设置下的估计性能,发现Gemini 2.5 Flash表现最佳,误差最低。研究还表明,估计准确性与消息数量的相关性较弱,强调了自动化专家映射的可行性及当前限制,并指出隐私保护和更丰富的知识表示形式的重要性。

详情
AI中文摘要

员工常常难以识别“谁知道什么”,导致组织生产力损失。我们研究大型语言模型(LLMs)是否能够直接从长期 Slack 日志中推断个人领域知识。通过分析来自 43 名用户的 27,188 条消息,我们评估了七个模型(包括 Gemini、Claude 和 GPT 系列),将其零样本估计与 27 名参与者的自我报告技能评分进行比较。Gemini 2.5 Flash 实现了最低误差(MAE 21.13%),而 GPT 模型显示出显著更大的差异。值得注意的是,估计精度仅弱依赖于消息量,表明更多的文本本身并不能保证更好的推断。这些发现证明了自动专业知识映射的可行性和当前局限性,强调了需要保护隐私的部署以及更丰富、结构感知的人类知识表示。

英文摘要

Employees often struggle to identify ``who knows what,'' leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.

2605.22964 2026-05-25 cs.LG

Certification from Examples is Hard for Circuits and Transformers under Minimal Overparametrization

在最小过参数化下,从示例中认证对于电路和Transformer是困难的

Artur Back de Luca, Kimon Fountoulakis

AI总结 本文研究了在最小过参数化条件下,对电路和Transformer模型进行精确认证的困难性。作者证明,即使仅增加少量参数,认证所需样本数量也会呈指数级增长,表明精确认证在多个假设类中是计算上困难的。实验部分展示了构造的电路和训练好的Transformer在二进制加法任务中的认证行为,揭示了不完美模型可能通过大规模随机样本规避检测。

Comments 38 pages, 5 figures

详情
AI中文摘要

随着最先进的神经网络被部署在推理和算法任务上,精确性保证变得越来越重要。然而,高平均准确率仍可能掩盖不一致的行为。这激发了精确认证的需求,即寻找最小的标记示例集,以证明学习到的假设与目标一致。我们表明,虽然某些假设易于认证,但即使是最小的过参数化,也可能使多个假设类别的认证变得指数级困难。对于深度≥2的阈值电路,添加一个额外的门就可能导致认证集大小在输入维度上呈指数增长。我们展示了对于仅具有恒定架构开销的对数精度Transformer,存在类似的困难结果。我们还刻画了近似认证,表明允许多项式数量的错误仍然需要指数级大小的证书,而常数相对误差保证可能隐藏指数级数量的错误。实验上,我们研究了用于识别二进制加法的构造电路和训练后的Transformer的认证。虽然构造电路实例化了认证的指数障碍,但训练后的Transformer分析表明,不完美的模型可以通过大的均匀采样候选证书来逃避检测。

英文摘要

As state-of-the-art neural networks are deployed on reasoning and algorithmic tasks, exactness guarantees become increasingly important. However, high average-case accuracy can still mask inconsistent behaviors. This motivates exact certification, which asks for the smallest set of labeled examples needed to certify that a learned hypothesis equals the target. We show that while some hypotheses are easy to certify, even minimal overparametrization can make certification exponentially hard across several hypothesis classes. For threshold circuits of depth $\ge 2$, adding a single extra gate can force certificate sizes exponential in the input dimension. We show an analogous hardness result for log-precision Transformers with only constant architectural overhead. We also characterize approximate certification, showing that allowing only polynomially many mistakes still requires exponentially large certificates, whereas constant relative-error guarantees can hide exponentially many mistakes. Empirically, we study certification for constructed circuits and trained Transformers for recognizing binary addition. While the constructed circuits instantiate the exponential barrier for certification, the trained Transformer analysis shows that imperfect models can evade detection by large uniformly sampled certificate candidates.

2605.22963 2026-05-25 cs.CL cs.AI

Graph Alignment Topology as an Inductive Bias for Grounding Detection

图对齐拓扑作为接地检测的归纳偏置

Paul Landes, Pranav Herur, Adam Cross, Jimeng Sun

AI总结 本文研究了如何利用图对齐拓扑作为归纳偏置,以提升大语言模型(LLM)生成内容的事实准确性。作者构建了参考信息与模型输出之间的二分图,并通过图神经网络建模对齐结构,从而直接学习对齐拓扑特征。该方法在多个幻觉检测和问答数据集上取得了优于现有方法及基础LLM(如GPT-4o)的最先进结果,为提升模型输出的可解释性和事实可靠性提供了新思路。

详情
AI中文摘要

大型语言模型(LLM)被优化以产生分布上合理的延续,而不是明确验证生成的命题是否源自源文档。这种归纳偏置使得泛化成为可能,但它不编码响应是否相对于参考是接地的。这些问题限制了LLM在严格事实正确性至关重要的领域(如临床决策支持)中的使用。现有的幻觉检测方法通过检索增强、自一致性或声明验证来提高事实性,但通常不直接学习对齐拓扑。为了利用对齐拓扑作为归纳偏置,我们在参考信息和LLM输出之间构建对齐二分图,并训练图神经网络(GNN)通过消息传递来建模对齐结构。该方法在四个不同的幻觉和问答数据集上取得了最先进的结果,优于所有比较的方法,包括基础LLM如GPT-4o。

英文摘要

Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables generalization, but it does not encode whether responses are grounded with respect to a reference. These issues limit the use of LLMs in domains where strict factual correctness is crucial, such as clinical decision support. Existing hallucination detection approaches improve factuality through retrieval augmentation, self-consistency, or claim verification, but generally do not learn directly over alignment topology. To leverage alignment topology as an inductive bias, we construct aligned bipartite graphs between reference information and LLM outputs and train a graph neural network (GNN) to model alignment structure using message passing. The method achieves state-of-the-art results on four diverse hallucination and question-answering datasets, outperforming all compared methods, including foundational LLMs such as GPT-4o.

2605.22962 2026-05-25 cs.CV cs.CE cs.HC cs.SE q-bio.NC

GazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction

凝视行为注释工具包 (GBAT): 基于AI的自动注释工具,用于自我中心眼动追踪和儿童-照顾者互动视频数据

Iba Baig, Kevin Li, Yanbin Xu, Seiji Cattelain, Marie Hallo, Hayato Ono, Sho Tsuji, Ming Bo Cai

AI总结 该研究提出了一种基于人工智能的工具GazeBehavior Annotation Toolkit(GBAT),用于自动标注儿童与照顾者互动过程中的第一人称眼动追踪和视频数据。该工具通过深度学习技术实现了多视频后同步、视线目标半自动标注以及参与者姿态和手部动作的分类,显著提高了数据预处理和特征提取的效率与可扩展性。这一工具为研究人类早期发展中注意力动态和自然行为的大规模长期研究提供了重要支持。

Comments submitted to IEEE International Conference on Development and Learning (ICDL), 2026

详情
AI中文摘要

儿童-照顾者互动的视频记录使得能够研究自然行为中的注意力动态。这种多模态记录还允许研究人员实时检查注意力如何与动作和语言使用相互作用。然而,手动注释此类数据非常耗时。在这里,我们介绍凝视行为注释工具包,这是一个基于深度学习的工具包,旨在促进数据预处理和特征提取中的三个关键过程:多视频的事后同步、注视目标类别的半自动注释以及参与者姿态和手部动作的分类。该工具包提高了从人类自我中心眼动追踪和视频数据中提取特征的效率和可扩展性。这种改进对于支持人类早期发展中注意力动态和自然行为的大规模纵向研究至关重要。

英文摘要

Video recordings of child-caregiver interactions enable investigation of attentional dynamics during naturalistic behavior. Such multimodal recording also allows researchers to examine how attention interacts with action and language use in real time. However, manual annotation of such data is time-consuming. Here, we introduce GazeBehavior Annotation Toolkit, a deep-learning-based toolkit designed to facilitate three key processes in data preprocessing and feature extraction: post-hoc synchronization across multiple videos, semi-automatic annotation of gaze target categories, and categorization of participants' poses and hand actions. This toolkit improves the efficiency and scalability of feature extraction from human egocentric eye-tracking and video data. Such improvement is critical in supporting large-scale and longitudinal investigations of attentional dynamics and naturalistic behavior in human early development.

2605.22635 2026-05-25 cs.LG cs.CL cs.CV

The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution

多任务放射学报告生成中的双重困境:梯度动力学分析与解决方案

Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo

AI总结 在多任务医学影像报告生成中,现有的线性标量化策略难以有效平衡临床监督的严格约束与报告生成的平滑性需求。本文从梯度动力学角度分析了这一问题,揭示其本质是漂移项偏差与扩散项衰减的“双重困境”,并提出了一种与模型无关的优化器CAME-Grad,通过冲突规避方向校正和幅度增强能量注入,实现了几何有效性与局部最优解的规避,实验表明该方法在多个任务中均能显著提升临床效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管基于多任务学习的自动放射学报告生成(RRG)被广泛采用以确保临床一致性,但大多数研究集中在架构设计上,仍局限于粗糙的线性标量化策略。这些策略无法有效平衡判别性临床监督的硬约束与报告生成的平滑性要求。为了解决这些问题,我们从梯度动力学的角度分析了线性标量化的失败机制,利用随机微分方程(SDE)框架将其表征为漂移项偏差和扩散项衰减的“双重困境”。基于此,我们提出了一种与骨干网络无关的优化器,名为冲突规避幅度增强梯度下降(CAME-Grad)。通过冲突规避的方向修正和幅度增强的能量注入,该算法不仅保证了几何有效性,还避免了局部最优解。然后,自适应梯度融合机制用于建立理论最优方向与任务特定归纳偏差之间的动态平衡。实验表明,作为一种通用的即插即用优化器,CAME-Grad在八种不同的RRG方法上带来了显著且一致的改进,在MIMIC-CXR上平均提升整体临床效能2.3%,在IU X-Ray上提升1.9%。我们的代码可在https://github.com/vpsg-research/CAME-Grad获取。

英文摘要

While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray. Our code is available at https://github.com/vpsg-research/CAME-Grad.

2605.22423 2026-05-25 cs.CV

Moment-Reenacting: Inverse Motion Degradation with Cross-shutter Guidance

时刻重现:基于交叉快门引导的逆运动退化

Xiang Ji, Guixu Lin, Zhengwei Yin, Jiancheng Zhao, Yinqiang Zheng

AI总结 该论文研究了在计算成像中如何逆向解决由快速运动或低光照引起的运动退化问题,提出了一种统一框架,通过结合全局快门(GS)模糊和滚动快门(RS)畸变的互补特性,实现运动场景的重建。作者设计了一种双快门系统,同步捕获模糊-RS图像对,并构建了三轴成像系统采集真实世界数据集,用于训练和评估模型。所提出的网络通过双流模块分离运动的上下文和时间特性,实现了高质量的帧重建,在复杂运动退化下的高速视频重建任务中表现出优越性和广泛适用性。

Comments Accepted by TPAMI

详情
AI中文摘要

运动退化表现为全局快门(GS)图像中的模糊或卷帘快门(RS)图像中的畸变,在快速运动或低光条件下仍是计算成像的基本挑战。以往工作将模糊分解和RS时间超分辨率视为独立任务,未能利用其内在互补性。本文提出统一框架,通过联合利用GS模糊和RS畸变的互补特性来逆转运动退化并重现成像时刻。为此,我们引入一种新颖的双快门设置,捕获同步的模糊-RS图像对,并证明该组合有效解决了两种模态固有的时间和空间模糊性。为允许灵活的性能-成本权衡,我们进一步将双快门设置扩展到窄基线的立体模糊-RS配置。此外,我们构建了一个三轴成像系统,收集了具有对齐GS-RS对和真实高速帧的真实世界数据集,支持超越合成数据的鲁棒训练和评估。我们提出的网络通过双流运动解释模块将运动显式解耦为上下文感知和时间敏感表示,随后进行自提示帧重建阶段。大量实验验证了我们方法的优越性和泛化能力,为复杂运动退化下的真实高速视频重建建立了新范式。代码和更多资源见 https://jixiang2016.github.io/dualBR_site/。

英文摘要

Motion degradation, manifested as blur in global shutter (GS) images or rolling shutter (RS) distortion in RS counterparts, remains a fundamental challenge in computational imaging, especially under fast motion or low-light conditions. While prior works have treated blur decomposition and RS temporal super-resolution as separate tasks, this separation fails to exploit their intrinsic complementarity. In this paper, we propose a unified framework to invert motion degradation and reenact imaging moment by jointly leveraging the complementary characteristics of GS blur and RS distortion. To this end, we introduce a novel dual-shutter setup that captures synchronized blur-RS image pairs and demonstrate that this combination effectively resolves temporal and spatial ambiguities inherent in both modalities. For allowing flexible performance-cost trade-offs, we further extend this dual-shutter setup to a stereo Blur-RS configuration with a narrow baseline. In addition, we construct a triaxial imaging system to collect a real-world dataset with aligned GS-RS pairs and ground-truth high-speed frames, enabling robust training and evaluation beyond synthetic data. Our proposed network explicitly disentangles motion into context-aware and temporally-sensitive representations via a dual-stream motion interpretation module, followed by a self-prompted frame reconstruction stage. Extensive experiments validate the superiority and generalizability of our approach, establishing a new paradigm for realistic high-speed video reconstruction under complex motion degradations. Codes and more resources are available at https://jixiang2016.github.io/dualBR_site/.

2605.22373 2026-05-25 cs.LG cs.CL

Boundary-targeted Membership Inference Attacks on Safety Classifiers

针对安全分类器的边界目标成员推断攻击

Anthony Hughes, Alexander Goldberg, Prince Jha, Adam Perer, Nikolaos Aletras, Niloofar Mireshghallah

AI总结 该研究探讨了针对安全分类器的边界定向成员推理攻击问题,这类分类器常用于生成式AI系统中以过滤有害内容或识别高风险用户。研究提出了一种新的攻击方法,通过识别分类器最不自信的样本,揭示模型在训练数据上的记忆性特征,从而推断出样本是否属于训练集。实验表明,该方法在检测用户情绪支持需求的分类器上,能以较低的误报率恢复更多被标记为高风险的对话,效果显著优于现有成员推理攻击方法,并进一步分析了边界样本的特性,指出基于内容的过滤策略难以有效防御此类攻击。

详情
AI中文摘要

安全分类器是生成式AI系统中的重要保障,用于过滤有害内容或识别与大语言模型交互时处于风险中的用户。尽管这些模型是必要的,但它们是在包含自残和心理健康讨论等敏感数据集上训练的,这引发了重要但尚未充分理解的隐私问题。成员推断攻击(MIA)允许对手推断用于训练模型的示例的成员身份。在这项工作中,我们假设识别分类器最不自信的示例对于对手推断成员身份是有信息的。这反映了局部泛化失败,其中模型依赖记忆来解决训练集中的歧义。为了研究这一点,我们引入了一种新的边界目标选择策略,该策略识别低置信度示例,从而放大训练集中示例成员身份的信号。我们的实验结果表明,在针对检测可能需要情感支持的用户的微调分类器上,对手可以以5%的假阳性率恢复安全分类器标记为指示用户困扰的对话中的19%。这比单独使用最先进的MIA方法攻击高出3.5倍。最后,我们描述了边界示例的特征,并表明基于内容的过滤对于保护无效,而现有的噪声策略可以有效减轻这些示例的敏感性。

英文摘要

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19% of the conversations a safety classifier flagged as indicating user distress, at a 5% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is $3.5$ times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.

2605.22350 2026-05-25 cs.LG stat.ML

Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation

神经网络的部分融合:集成与权重聚合之间的高效权衡

Fabian Morelli, Stephan Eckstein

AI总结 该论文提出了一种神经网络的部分融合方法,在集成学习与权重聚合之间实现计算成本与性能的灵活权衡。核心思想是基于神经元层面的相似性,仅对最相似的神经元进行权重聚合,从而在保持较高准确率的同时降低计算开销。研究还展示了通过部分最优运输方法识别和匹配相似神经元的具体实现,并将权重聚合与部分融合视为集成模型的广义剪枝过程,允许对神经元进行删除或线性组合操作,进一步拓展了模型优化的灵活性。

Comments Accepted to ICML 2026

详情
AI中文摘要

神经网络的集成通常优于单个网络,但计算成本高昂,而权重聚合产生的聚合模型成本较低,但精度也较低。我们引入了网络的部分融合,它在集成和权重聚合之间进行插值,从而允许在计算成本和性能之间进行灵活的权衡。实现这一目标的一种直接方法是扩展现有的基于不同网络之间神经元级相似性的权重聚合方法,其中部分融合仅聚合最相似神经元的权重。我们展示了一种特定方法,通过部分最优传输联合识别哪些神经元最相似并进行匹配。此外,我们将权重聚合和部分融合视为集成模型的广义剪枝,其中神经元不仅可以被删除,还可以线性组合。最后,我们表明,应用于单个网络的广义剪枝通过允许基于相似性隔离、删除和线性组合神经元之间的权衡,产生了与部分融合类似的优势。我们的代码可在 https://github.com/Fabian-Mor/partial_fusion_nn 获取。

英文摘要

Ensembles of neural networks typically outperform individual networks but incur large computational costs, whereas weight aggregation produces less costly, yet also less accurate, aggregate models. We introduce partial fusion of networks, which interpolates between ensembles and weight aggregation and thus allows for a flexible tradeoff between computational cost and performance. A direct way to achieve this is to extend existing weight aggregation methods based on neuron-level similarity between different networks, where partial fusion then only aggregates weights of neurons which are most similar. We showcase one particular method to jointly identify which neurons are most similar and match them via partial optimal transport. Further, we consider the more general perspective of weight aggregation and partial fusion as generalized pruning of ensemble models, where neurons cannot just be deleted, but also linearly combined. Finally, we show that generalized pruning applied to a single network yields similar benefits as partial fusion by allowing for a tradeoff between isolating, deleting, and linearly combining neurons based on similarity. Our code is available at https://github.com/Fabian-Mor/partial_fusion_nn.

2605.22272 2026-05-25 cs.RO cs.CV

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Imagine2Real: 通过视频生成先验实现零样本人形机器人-物体交互

Jiahe Chen, ZiRui Wang, Feiyu Jia, Xiao Chen, Xiaojie Niu, Weishuai Zeng, Tianfan Xue, Xiaowei Zhou, Jiangmiao Pang, Jingbo Wang

AI总结 全身体型人机交互(HOI)因高质量3D数据稀缺而面临瓶颈。现有基于视频生成先验的方法由于依赖几何先验(如显式CAD模型)导致表示对齐问题,并因复杂的形态重定向过程而面临重定向复杂性问题。本文提出Imagine2Real,一种无需几何信息的零样本HOI框架,通过将机器人和物体运动统一为4D点轨迹解决表示对齐问题,并通过稀疏关键点追踪避开重定向误差,结合行为基础模型的潜在空间实现自然运动,最终在运动捕捉系统中实现零样本物理部署。

详情
AI中文摘要

全身人形机器人-物体交互(HOI)受限于高保真3D数据的稀缺性。虽然视频生成先验提供了一种有前景的替代方案,但现有方法由于依赖几何先验(如显式CAD模型)而遭受表示不对齐问题,并且由于密集变形和形态不匹配而产生重定向复杂性。我们提出了Imagine2Real,一个零样本HOI框架,用于灵活、无几何的交互。为了解决不对齐问题,我们将机器人和物体的运动统一为4D点轨迹。为了克服重定向复杂性,我们的关键点跟踪器仅跟踪稀疏的关键点(基座、手和物体),完全绕过了误差放大的重定向过程。为了在这些稀疏信号下保持自然步态,我们利用行为基础模型(BFM)的潜在空间作为跟踪器的搜索域。通过渐进式训练策略,Imagine2Real学习到具有简单跟踪奖励的鲁棒行为,从而在动作捕捉(mocap)系统内实现零样本物理部署。

英文摘要

Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textit{Representation Misalignment} due to their reliance on geometric priors (e.g., explicit CAD models), and \textit{Retargeting Complexity} arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the tracker's search domain. Using a progressive training strategy, Imagine2Real learns robust behaviors with simple tracking rewards, enabling zero-shot physical deployment within a motion capture(mocap) system.

2605.22216 2026-05-25 cs.CV

A Robust Semantic Segmentation Pipeline for the CVPR 2026 8th UG2+ Challenge Track 2

面向CVPR 2026第八届UG2+挑战赛赛道2的鲁棒语义分割流程

Jinming Chai, Libo Yan, Licheng Jiao, Fang Liu

AI总结 本文提出了针对CVPR 2026第八届UG2+挑战赛Track 2(恶劣天气下的语义分割)的解决方案,旨在解决在不良天气条件下进行图像语义分割的难题。我们设计了一种半监督分割流水线,仅基于挑战赛提供的WeatherProof数据集进行训练,无需额外数据。方法以UniMatch V2为基线模型,将所有退化天气图像作为未标注数据进行半监督学习,并在推理阶段采用测试时增强技术以提升分割结果的鲁棒性和准确性。

详情
AI中文摘要

本报告介绍了我们针对WeatherProof数据集挑战赛(即CVPR 2026第八届UG2+挑战赛赛道2:恶劣天气下的语义分割)的解决方案。针对恶劣天气条件下的语义分割任务,我们提出了一种半监督分割流程。我们的方法仅使用WeatherProof数据集进行训练,未使用任何额外的外部数据。具体而言,我们采用UniMatch V2作为基线模型,并将所有退化天气图像视为未标记数据进行半监督训练,从而充分利用挑战赛提供的数据分布。在推理过程中,我们进一步应用测试时增强,以提高最终预测的鲁棒性和分割精度。代码已公开:https://github.com/ylb888/weatherproof-challenge-unimatchv2。

英文摘要

This report presents our solution for the WeatherProof Dataset Challenge, namely CVPR 2026 8th UG2+ Challenge Track 2: Semantic Segmentation in Adverse Weather. For the semantic segmentation task under adverse weather conditions, we propose a semi-supervised segmentation pipeline. Our method is trained exclusively on the WeatherProof dataset, without using any additional external data. Specifically, we adopt UniMatch V2 as the baseline model and treat all degraded-weather images as unlabeled data for semi-supervised training, thereby fully exploiting the data distribution provided by the challenge. During inference, we further apply test-time augmentation to improve the robustness and segmentation accuracy of the final predictions. The code is publicly available at: https://github.com/ylb888/weatherproof-challenge-unimatchv2.

2605.22020 2026-05-25 cs.CV

ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting

ForeSplat:面向前馈3D高斯泼溅的优化感知预判

Yuke Li, Weihang Liu, Cheng Zhang, Yuefeng Zhang, Jiadi Cui, Zixuan Wang, Junran Ding, Haoyu Wu, Yujiao Shi, Jingyi Yu, Xin Lou

AI总结 本文提出ForeSplat,一种优化感知的前馈3D高斯溅射训练框架,旨在提升模型在有限网络容量下的重建质量。通过引入MetaGrad方法,ForeSplat将部分场景建模任务转移给优化器,使前馈模型能生成更利于后续优化的初始化表示,从而在更少优化步骤内达到更高的重建精度。实验表明,该方法在多种网络结构上均能有效提升重建效果,为轻量级高保真3D重建提供了实用路径。

详情
AI中文摘要

前馈3D高斯泼溅模型能够实现快速单次重建,但将其扩展到匹配逐场景优化质量时,受到大规模3D标注稀缺的根本限制。一种实用的折衷方案是“先预测后优化”,即通过预测后优化来弥补前馈网络有限的能力。然而,标准的前馈3DGS仅针对零步渲染误差进行训练,忽略了其输出是否为下游优化器提供了良好的初始化。我们提出了ForeSplat,一个优化感知的训练框架,使前馈3DGS模型能够产生明确设计用于快速、有效精细化的初始化。通过将部分场景建模负担转移给优化器,ForeSplat显著减轻了前馈模型的能力压力,即使使用紧凑网络也能实现高质量重建。其核心是MetaGrad,一种轻量级多锚点元梯度训练规则,通过3DGS优化器避免了昂贵的高阶微分。MetaGrad展开一个短的内循环细化轨迹,采样锚点状态,并将聚合的一阶梯度反向传播到预测头,作为替代的优化感知信号。这种微调不增加推理成本,并在几步细化后几秒内实现高质量重建。我们在多种骨干网络上实例化ForeSplat,包括AnySplat、Pi3X以及专为边缘部署定制的蒸馏变体。在所有测试架构中,经过ForeSplat训练的初始化在更少的细化步骤内收敛,并达到比原始版本更高的峰值重建质量,即使完全收敛也是如此。该框架持续弥合了摊销预测与逐场景优化之间的差距,为轻量级、高保真3D重建开辟了实用路径。

英文摘要

Feed-forward 3D Gaussian Splatting models offer fast single-pass reconstruction,but scaling them to match per-scene optimization quality is fundamentally hindered by the scarcity of large-scale 3D annotations. A practical compromise is predict-then-refine,where post-prediction optimization compensates for the limited capacity of the feed-forward network. However,standard feed-forward 3DGS is trained solely for zero-step rendering error,ignoring whether its output constitutes a good initialization for the downstream optimizer. We present ForeSplat,an optimization-aware training framework that equips feed-forward 3DGS models to produce initializations explicitly designed for rapid,effective refinement. By offloading part of the scene-modeling burden to the optimizer,ForeSplat substantially reduces the capacity pressure on the feed-forward model,making high-quality reconstruction feasible even with compact networks. At its core is MetaGrad,a lightweight multi-anchor meta-gradient training rule that bypasses costly higher-order differentiation through the 3DGS optimizer. MetaGrad unrolls a short inner-loop refinement trajectory,samples anchor states,and back-propagates aggregated first-order gradients to the prediction head as a surrogate optimization-aware signal. This fine-tuning adds no inference cost and enables high-quality reconstruction within seconds after a few refinement steps. We instantiate ForeSplat on diverse backbones,including AnySplat,Pi3X,and a distilled variant tailored for edge deployment. Across all tested architectures,a ForeSplat-trained initialization converges in fewer refinement steps and reaches a higher peak reconstruction quality than its vanilla counterpart,even fully converged. The framework consistently bridges the gap between amortized prediction and per-scene optimization,establishing a practical path toward lightweight,high-fidelity 3D reconstruction.

2605.21906 2026-05-25 cs.CV

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

从解剖到疾病表型的通用CT表示:通过聚合预训练

Yuheng Li, Yuan Gao, Haoyu Dong, Yuxiang Lai, Shansong Wang, Mojtaba Safari, James E. Baciak, Xiaofeng Yang

AI总结 该研究提出了一种名为FlexiCT的CT基础模型,通过聚合式持续预训练方法,在56个公开数据集的26万余例CT影像上进行训练,构建了一个大规模的CT表征学习资源。模型分三个阶段进行预训练,涵盖二维轴向、三维解剖结构以及报告引导的语义对齐,支持切片级、体积级和视觉-语言分析。实验表明,FlexiCT在多个下游任务中表现优异,并能通过嵌入信息反映肿瘤阶段等疾病表型特征,为CT影像的通用表征学习提供了新方法。

详情
AI中文摘要

计算机断层扫描(CT)是三维医学成像的核心,但基于CT的人工智能仍然分散在用于分割、分类、配准和报告分析的任务特定模型中。这里我们提出FlexiCT,一个CT基础模型系列,通过对来自56个公开数据集的266,227个CT体积进行聚合连续预训练,形成了用于CT表示学习的大规模公共资源。FlexiCT采用三阶段聚合预训练:二维轴向预训练、三维解剖预训练和报告引导的语义对齐。这种训练策略支持切片级、体积级和视觉语言分析。在五个下游任务族(分割、分类、配准、视觉语言理解和临床检索)中,FlexiCT在多个基准上匹配或超过先前的任务特定方法。其嵌入进一步沿着与不同肿瘤阶段相关的梯度组织CT扫描,表明CT基础模型可以捕获与疾病表型表征相关的影像特征。项目页面和代码见:https://ricklisz.github.io/flexict.github.io 和 https://github.com/ricklisz/FlexiCT。

英文摘要

Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Project page and code are available at: https://ricklisz.github.io/flexict.github.io and https://github.com/ricklisz/FlexiCT.

2605.21851 2026-05-25 cs.LG cs.AI

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

OPPO: 用于LLM推理中令牌级信用分配的贝叶斯价值递归

Yu Li, Rui Miao, Tian Lan, Zhengling Qi

AI总结 该论文提出了一种名为OPPO的新型算法,用于改进大语言模型(LLM)在推理任务中的信用分配机制。OPPO基于一种关键观察:传统方法中用于局部判别的 oracle 信号本质上是模型对最终成功概率的贝叶斯更新。通过沿轨迹累积该信号,OPPO能够在不依赖价值网络或额外采样的情况下,直接计算出每个位置的成功概率估计和令牌级优势,从而更准确地识别推理过程中的关键步骤。实验表明,OPPO在多个数学、科学和代码推理基准上显著优于现有方法。

详情
AI中文摘要

具有可验证奖励的强化学习已成为提升LLM推理的标准方法,但主流算法GRPO为每个令牌分配单一轨迹级优势,稀释了关键推理步骤的信号,并在无信息步骤中注入噪声。源自在线策略蒸馏的无评论家替代方案通过预言机条件似然比提供每令牌信号,但每个信号孤立于该位置之前累积的轨迹级证据。我们提出Oracle-Prompted Policy Optimization (OPPO),它基于一个简单观察:先前蒸馏式方法用于局部区分的预言机信号,也是模型对最终成功信念的自然贝叶斯更新。沿轨迹累积信号,以一次额外前向传播的代价,以闭式形式给出每个位置成功概率的运行估计,以及无需学习价值网络和额外采样的令牌级优势。一阶分析将优势分解为蒸馏方法使用的每令牌区分信号,乘以一个状态权重,该权重将信用集中在真正关键的令牌上,并具有方向性方差减少保证。该框架包含两种估计器,区别仅在于谁对证据评分: extit{自预言机}重用学生模型,将在线策略蒸馏奖励作为严格特例恢复; extit{教师预言机}将评分委托给更强的冻结模型。在两个基础LLM上,跨越七个数学、科学和代码推理基准,OPPO在AMC'23上比GRPO、DAPO和SDPO提升高达+6.0分,在AIME'24上提升+5.2分,且增益随响应长度单调增加。

英文摘要

Reinforcement learning with verifiable rewards has become the standard recipe for improving LLM reasoning, but the dominant algorithm GRPO assigns a single trajectory-level advantage to every token, diluting the signal at pivotal reasoning steps and injecting noise at uninformative ones. Critic-free alternatives derived from on-policy distillation supply per-token signals through oracle-conditioned likelihood ratios, yet apply each signal in isolation from the trajectory-level evidence accumulated up to that position. We propose Oracle-Prompted Policy Optimization (OPPO), which rests on a single observation: the oracle signal used by prior distillation-style methods for local discrimination is also the natural Bayesian update of the model's belief about eventual success. Accumulating the signal along a trajectory yields, in closed form and at the cost of one extra forward pass, a running estimate of the success probability at every position, together with a token-level advantage that requires no learned value network and no additional rollouts. A first-order analysis factorizes the advantage into the per-token discrimination signal used by distillation methods modulated by a state weight that concentrates credit on genuinely pivotal tokens, with a directional variance-reduction guarantee. The framework admits two estimators differing only in which model scores the evidence: a \textit{self-oracle} that reuses the student and recovers the on-policy distillation reward as a strict special case, and a \textit{teacher-oracle} that delegates scoring to a stronger frozen model. On two base LLMs across seven mathematics, science, and code reasoning benchmarks, OPPO improves over GRPO, DAPO, and SDPO by up to $+6.0$ points on AMC'23 and $+5.2$ points on AIME'24, with gains that widen monotonically with response length.

2605.21605 2026-05-25 cs.CV

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

GenEvolve: 通过工具编排的视觉经验蒸馏实现自我进化的图像生成智能体

Sixiang Chen, Zhaohu Xing, Tian Ye, Xinyu Geng, Yunlong Lin, Jianyu Lai, Xuanhua He, Fuxiang Zhai, Jialin Gao, Lei Zhu

AI总结 本文提出了一种名为GenEvolve的自进化图像生成框架,旨在应对日益复杂和多样的图像生成需求。该方法通过工具协调的视觉经验蒸馏技术,使智能体能够在生成过程中自主学习和优化策略,包括证据收集、参考选择和提示构建等关键步骤。GenEvolve通过对比不同生成轨迹,提炼结构化视觉经验并用于指导模型训练,显著提升了生成质量与效率,并在多个基准测试中取得了优于现有方法的性能。

详情
AI中文摘要

开放式图像生成已不再是简单的提示词到图像问题。高质量生成通常需要智能体将模型的内部生成能力与外部资源相结合。随着请求变得更加多样化和苛刻,我们旨在开发一个通用的图像生成智能体,该智能体能够通过轨迹自我进化,并在各种生成挑战中更有效地使用工具。为此,我们提出了GenEvolve,一个基于工具编排的视觉经验蒸馏的自我进化框架。在GenEvolve中,每次生成尝试都被建模为工具编排的轨迹,智能体收集证据、选择参考、调用生成技能,并将它们组合成提示-参考程序。与主要依赖图像级标量奖励的现有智能体生成方法不同,GenEvolve针对同一请求比较多个轨迹,并将最佳-最差差异抽象为结构化视觉经验,仅提供给特权教师分支。受在线策略自蒸馏的启发,视觉经验蒸馏提供密集的令牌级监督,帮助学生内化更好的搜索、知识激活、参考选择和提示构建。我们进一步构建了GenEvolve-Data和GenEvolve-Bench。在公共基准和GenEvolve-Bench上的实验表明,与强基线相比有显著提升,在当前的图像生成框架中达到了最先进的性能。我们的网站如下:https://ephemeral182.github.io/GenEvolve/

英文摘要

Open-ended image generation is no longer a simple prompt-to-image problem. High-quality generation often requires an agent to combine a model's internal generative ability with external resources. As requests become more diverse and demanding, we aim to develop a general image-generation agent that can self-evolve through trajectories and use tools more effectively across varied generation challenges. To this end, we propose GenEvolve, a self-evolving framework based on Tool-Orchestrated Visual Experience Distillation. In GenEvolve, each generation attempt is modeled as a tool-orchestrated trajectory, where the agent gathers evidence, selects references, invokes generation skills, and composes them into a prompt-reference program. Unlike existing agentic generation methods that mainly rely on image-level scalar rewards, GenEvolve compares multiple trajectories for the same request and abstracts best-worst differences into structured visual experience, provided only to a privileged teacher branch. Inspired by on-policy self-distillation, Visual Experience Distillation provides dense token-level supervision, helping the student internalize better search, knowledge activation, reference selection, and prompt construction. We further construct GenEvolve-Data and GenEvolve-Bench. Experiments on public benchmarks and GenEvolve-Bench show substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks. Our website is as follows: https://ephemeral182.github.io/GenEvolve/

2605.21489 2026-05-25 cs.LG cs.AI cs.CV stat.CO stat.ML

Variance Reduction for Expectations with Diffusion Teachers

具有扩散教师的期望方差缩减

Jesse Bettencourt, Xindi Wu, Matan Atzmon, James Lucas, Jonathan Lorraine

AI总结 本文研究了如何在使用预训练扩散模型作为“教师”进行下游任务(如文本到3D生成、单步蒸馏等)时,降低梯度估计的方差。提出了一种名为CARV的计算感知方差控制框架,通过分层蒙特卡洛估计器,将昂贵的上游计算过程与廉价的扩散噪声重采样相结合,并结合时间步重要性采样和分层逆CDF构造,有效减少了计算成本。实验表明,CARV在不改变目标函数的前提下显著提升了计算效率,但在某些任务中梯度方差的降低并未带来生成质量的提升,表明此时方差已不再是性能瓶颈。

Comments Project page: https://research.nvidia.com/labs/sil/projects/CARV/

详情
AI中文摘要

预训练的扩散模型作为冻结教师,为文本到3D、单步蒸馏和数据归因等下游流程提供支持。这些流程消耗的教师梯度是关于噪声水平和高斯噪声样本的蒙特卡洛期望;其估计器方差主导了计算成本,因为每次抽取都需要昂贵的上游工作(渲染、模拟、编码)。我们引入了CARV,一个计算感知的方差核算框架,它激发了一种分层蒙特卡洛估计器:通过廉价的扩散噪声重采样来摊销昂贵的上游计算,并通过时间步重要性采样和分层逆CDF构造加以强化。在我们的文本到3D蒸馏和归因实验中,CARV在不改变目标的情况下提供了2-3倍的有效计算乘数(主要来自摊销重用;约25%来自IS+分层);在单步蒸馏中,相同的技术将梯度方差降低了一个数量级,但并未改善下游FID,标志着MC方差不再是瓶颈的区间。

英文摘要

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

2605.21487 2026-05-25 cs.CV

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Uni-Edit: 智能编辑作为统一模型调优的通用任务

Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li

AI总结 本文提出了一种名为Uni-Edit的智能图像编辑任务,作为统一多模态模型(UMMs)调优的通用任务。与传统的多任务混合训练方法不同,Uni-Edit通过单一任务、单一训练阶段和单一数据集,同时提升模型在图像理解、生成和编辑三方面的能力。研究引入了一种自动化且可扩展的数据合成方法,将多样化的视觉问答数据转化为复杂且有效的编辑指令,从而显著提升了模型的编辑性能,并在多个基准测试中验证了其对多模态能力的全面提升效果。

Comments Project Page: https://zhengdian1.github.io/Uni-Edit-proj/ Code: https://github.com/zhengdian1/Uni-Edit

详情
AI中文摘要

目前,增强统一多模态模型(UMMs)的图像理解、生成和编辑能力主要依赖于混合多任务训练。由于固有的任务冲突,这种策略需要复杂的多阶段流水线、大量数据混合和平衡技巧,仅能实现性能折衷而非真正的相互增强。为了打破这一范式,我们提出Uni-Edit,一种智能图像编辑任务,作为UMM调优的第一个通用任务。与复杂的混合流水线不同,Uni-Edit仅使用一个任务、一个训练阶段和一个数据集,即可同时提升所有三种能力。具体来说,我们首先识别出图像编辑本质上是一个理想的通用任务,因为它自然需要视觉理解和生成。然而,现有的编辑数据依赖于过于简单的指令,严重低估了模型的理解能力。为解决这一问题,我们引入了第一个自动化且可扩展的智能编辑数据合成流水线,将多样化的VQA数据转化为复杂且有效的编辑指令,其中嵌入了问题和嵌套逻辑。由此产生了Uni-Edit-148k数据集,将多样化的推理密集型指令与高质量编辑图像配对。在BAGEL和Janus-Pro上的大量实验表明,仅对Uni-Edit进行调优即可在所有三种能力上实现全面增强,无需任何辅助操作。

英文摘要

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

2605.21139 2026-05-25 cs.CV cs.LG

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

蒸馏思考,预见行动:面向自动驾驶的认知-物理强化学习

Yang Wu, Qiang Meng, Zhaojiang Liu, Youquan Liu, Jian Yang, Jin Xie

AI总结 当前端到端自动驾驶模型受到模仿学习行为克隆天花板的限制,为此,本文提出CoPhy认知-物理强化学习框架,通过将视觉语言模型知识蒸馏到鸟瞰图编码器中,实现零推理成本的认知能力,并构建自回归的鸟瞰图世界模型以预测候选动作的未来语义地图,从而在物理环境层面预见行动后果。该方法结合物理奖励和认知奖励优化驾驶策略,不仅在NAVSIM基准上取得最优性能,还支持通过用户定义的语言指令实现更安全、更灵活的驾驶控制。

详情
AI中文摘要

当前的端到端自动驾驶模型从根本上受到模仿学习的行为克隆上限的限制。虽然强化学习提供了更智能自主性的路径,但它需要两个缺失的基础设施:(1)理解交通语义和驾驶意图的认知基础,以及(2)能够预见候选行动后果的前瞻性物理环境。为此,我们提出了CoPhy,一个用于自动驾驶的认知-物理强化学习框架。为了蒸馏思考,我们将VLM知识蒸馏到BEV编码器中,然后完全丢弃VLM,以零推理成本保留认知能力,同时将认知通道作为可插拔接口释放,用于可选的人类语言命令。为了预见行动,我们构建了一个自回归BEV世界模型,该模型明确预测以候选行动为条件的未来语义地图,作为一个可解释的物理沙盒,从中直接推导出安全指标。基于这一双重基础设施,我们通过GRPO优化驾驶策略,采用新颖的双奖励机制:从BEV rollout导出的物理奖励强制执行硬安全约束,而来自语言对齐评分器的认知奖励确保意图合规。大量实验表明,CoPhy不仅在NAVSIM v1和v2基准上取得了最先进的结果,而且通过认知信息化的场景合规性和通过用户定义的语言指令实现的灵活意图控制,实现了更安全的驾驶。

英文摘要

Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

2605.21071 2026-05-25 cs.CL cs.AI

Fine-grained Claim-level RAG Benchmark for Law

细粒度声明级法律RAG基准

Souvick Das, Sallam Abualhaija, Domenico Bianculli

AI总结 本文提出ClaimRAG-LAW,一个支持英法双语、面向法律专家与非专家用户的细粒度法律检索增强生成(RAG)基准数据集,涵盖多种真实场景的问答类型。研究通过细粒度评估框架分析当前先进法律RAG系统的检索、生成及主张级表现,揭示了其在法律领域中存在的局限性,为提升法律AI系统的可靠性提供了重要参考。

详情
AI中文摘要

大型语言模型(LLM)的快速进展正在将语义搜索转向问答范式,用户提出问题,LLM生成回答。在法律等高风险领域,检索增强生成(RAG)通常用于减轻生成回答中的幻觉。然而,先前的研究表明,无论是通用还是法律专用的RAG系统,仍然以不同速率产生幻觉,这使得细粒度评估变得至关重要。尽管有需求,现有的法律RAG系统评估框架缺乏分别对检索和生成性能进行详细分析所需的粒度。此外,当前的基准主要是英文且集中于法律专家查询,忽视了非专家需求。我们引入了ClaimRAG-LAW,一个全面的法律RAG数据集,支持法语和英语,面向专家和非专家,并包含反映现实场景的多样化问题类型。我们进一步应用细粒度评估框架对最先进的法律RAG系统进行评估,揭示了法律领域在检索、生成和声明级分析方面的局限性。

英文摘要

The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

2605.20919 2026-05-25 cs.LG cs.AI cs.PL

Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures

Sutra: 以张量操作RNN作为向量符号架构的编译目标

Emma Leonhart

AI总结 Sutra 是一种类型化的纯函数式编程语言,其前向传播过程被编译为 PyTorch 神经网络。该语言通过将程序中的原始操作、控制流和字符串 I/O 等全部转换为一个融合的张量操作图,实现了对向量符号架构的高效编译。研究展示了 Sutra 在多种嵌入表示上的高精度解码能力,并验证了其可微分性,使得同一程序既能作为逻辑程序运行,也能作为可训练的神经网络进行优化。

Comments Modified NeurIPS submission, see AI declaration and replication materials at end of paper

详情
AI中文摘要

Sutra是一种带类型的纯函数式编程语言,其编译后的前向传播是一个PyTorch神经网络。编译器将整个程序——包括原语、控制流、字符串I/O——通过beta归约降级为一个在冻结嵌入基质上的融合张量操作图。旋转绑定、解绑、捆绑、多项式Kleene三值逻辑以及尾递归循环均被降级为张量操作;Kleene连接词是在{-1, 0, +1}真值网格上精确的拉格朗日插值多项式。验证通过两种方式测试同一事实。(1) 同一程序在跨越两种模态的四个冻结嵌入上运行——三种文本编码器(nomic-embed-text、all-minilm、mxbai-embed-large)和一种蛋白质语言模型(ESM-2)——并在每个基质上以宽度k=8实现100%的解码准确率,而教科书式的Hadamard乘积已经崩溃(mxbai-embed-large上2.5%,all-minilm上7.5%)。(2) PyTorch自动求导流经实际编译的图:一个用.su编写的模糊规则分类器从随机初始化(18.7±9.5%;随机概率=20%,五类)通过反向传播经过发射图(符号源未修改)训练到100.0±0.0%(三个种子)。一个加权变体额外训练一个标量余弦增益,并将其作为数值字面量写回.su源文件;重新编译重现训练后的行为,每个logit误差约2e-7,因此训练后的模型本身是可读、可重编译的代码。因此,同一工件既是一个逻辑程序,也是一个可训练的神经网络。

英文摘要

Sutra is a typed, purely functional programming language whose compiled forward pass is a PyTorch neural network. The compiler beta-reduces the whole program -- primitives, control flow, string I/O -- to one fused tensor-op graph over a frozen embedding substrate. Rotation binding, unbind, bundle, polynomial Kleene three-valued logic, and tail-recursive loops all lower to tensor operations; the Kleene connectives are Lagrange-interpolated polynomials exact on the {-1, 0, +1} truth grid. Validation is one fact tested two ways. (1) The same program runs on four frozen embeddings spanning two modalities -- three text encoders (nomic-embed-text, all-minilm, mxbai-embed-large) and one protein language model (ESM-2) -- and decodes bundles at 100% accuracy through width k=8 on every substrate, where the textbook Hadamard product has already collapsed (2.5% on mxbai-embed-large, 7.5% on all-minilm). (2) PyTorch autograd flows through the actually compiled graph: a fuzzy-rule classifier written in .su trains from random init (18.7 +/- 9.5%; chance = 20%, five classes) to 100.0 +/- 0.0% (three seeds) by backpropagating through the emitted graph, the symbolic source unmodified. A weighted variant additionally trains a scalar cosine gain and writes it back into the .su source as a numeric literal; recompiling reproduces the trained behaviour to ~2e-7 per logit, so the trained model is itself legible, recompilable code. The same artifact is therefore both a logic program and a trainable neural network.

2605.20558 2026-05-25 cs.CL

When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology

当不规则性有帮助:神经形态学中归纳偏置的子类分析

Wen Zhang

AI总结 该研究分析了神经形态生成系统在处理日语过去时动词变位时的表现,发现模型错误主要集中在一个结构特殊且数据极少的不规则子类上。通过对照实验表明,移除这一小类比移除所有不规则动词更能提升模型泛化能力,说明不同不规则模式对模型稳定性的影响不同。研究指出,错误集中源于极低频形态模式与特定音系过程(如重音)的相互作用,强调形态评估应引入更细致的子类分析以揭示模型缺陷。

详情
AI中文摘要

神经形态生成系统通常在基准数据集上达到高总体准确率,但这种性能可能掩盖集中在罕见形态子类中的系统性错误。我们考察日语过去式动词屈折,发现一个非常小、结构特定的不规则子类(<1%的数据)占据了模型错误的不成比例份额。受控消融实验表明,移除该子类比移除所有不规则动词带来更大的泛化提升,表明并非所有不规则性对模型不稳定性贡献相同。这些发现表明,错误集中是由极端低频形态模式与特定形态音韵过程(特别是促音化)之间的交互驱动的。我们认为形态评估应纳入比标准变位类别更细粒度的子类分析。

英文摘要

Neural morphological generation systems often achieve high aggregate accuracy on benchmark datasets, yet such performance can conceal systematic errors concentrated in rare morphological subclasses. We examine Japanese past-tense verb inflection and show that a very small, structurally specific irregular subtype (<1% of data) accounts for a disproportionate share of model errors. Controlled ablation experiments demonstrate that removing this subtype yields larger improvements in generalization than removing all irregular verbs, indicating that not all irregularity contributes equally to model instability. These findings suggest that error concentration is driven by the interaction between extreme low-frequency morphological patterns and specific morphophonological processes, particularly gemination. We argue that morphological evaluation should incorporate finer-grained subclass analysis beyond standard conjugation categories.

2605.20201 2026-05-25 cs.CL cs.AI cs.LG

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

基于代理思维链调优的长上下文推理

Miao Li, Irina Saparina, Alexander Gurung, Mirella Lapata

AI总结 该研究针对大语言模型在长上下文复杂推理任务中表现不佳的问题,提出了一种名为ProxyCoT的新训练框架。该方法通过在短代理上下文中获取高质量的推理轨迹,并将其迁移到完整的长上下文中,从而提升模型的长上下文推理能力。实验表明,ProxyCoT在多个数据集上均优于现有方法,且计算开销更低,同时具备良好的跨领域泛化能力。

Comments Long paper, ACL 2026 (Main conference)

详情
AI中文摘要

近期的大语言模型支持高达1000万token的输入,但在需要复杂推理的长上下文任务上表现不佳。此类任务可以通过仅使用输入的一个子集(即代理上下文)而非完整序列来解决。尽管共享相同的底层推理过程,模型在代理上下文和完整上下文之间表现出显著的性能差异。为了改进长上下文推理,我们提出了ProxyCoT,一种新颖的训练框架,将推理能力从短代理上下文迁移到完整长上下文。具体来说,我们首先通过强化学习或从更大的教师模型蒸馏,在代理上下文中获得高质量的思维链推理轨迹,然后通过监督微调将这些生成的轨迹锚定到完整长上下文中。跨不同数据集的实验表明,ProxyCoT在减少计算开销的同时,始终优于强基线。此外,使用ProxyCoT训练的模型能够将其长上下文推理能力泛化到域外任务。

英文摘要

Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.