arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2603.16020 2026-06-09 cs.AI 版本更新

IRAM-Omega-Q: A Computational Framework for Uncertainty Regulation in Adaptive Agents

IRAM-Omega-Q:适应性智能体在随机干扰下的不确定性调节计算框架

Veronique Ziegler

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出IRAM-Omega-Q框架,结合量子态表示与闭环自适应控制,通过比较因果控制顺序,探讨不确定性调节的架构影响。

Comments 14 pages, 6 figues

详情
AI中文摘要

适应性智能体在不确定环境下必须做更多 than 仅优化任务输出:它们必须在噪声、扰动和变化条件下维持一个可行的内部状态。本文提出IRAM-Omega-Q框架,用于建模在随机干扰下适应性智能体的不确定性调节。该框架结合了量子态表示与闭环自适应控制,通过比较因果控制顺序,探讨不确定性调节的架构影响。

英文摘要

Adaptive agents operating under uncertainty must do more than optimize task outputs: they must maintain a workable internal state under noise, perturbation, and changing conditions. This paper introduces IRAM-Omega-Q, a computational framework for modeling uncertainty regulation in adaptive agents under stochastic disturbance. The framework combines a quantum-like state representation with closed-loop adaptive control over an internal entropy signal. The quantum-like formalism is used instrumentally: the evolving state is a normalized complex amplitude vector, coherent evolution is propagated exactly as psi(t + Delta t) = exp(-i H Delta t) psi(t), and a derived density matrix supports entropy and coherence-gap analysis. Two causal control orderings are compared. In regulation-first (RF) ordering, adaptive regulation is available before current-cycle disturbance and attenuates incoming exposure; in disturbance-first (DF) ordering, current-cycle disturbance is received before a new regulatory response can be computed, and stabilization acts reactively. Publication-mode, matched-seed simulations show broadly comparable coherence-gap trajectories but lower sustained adaptive gain under RF. Susceptibility maps based on post-burn-in temporal fluctuations further show that DF shifts the critical initial-gain ridge toward larger values across multiple disturbance intervals. These results identify ordering as an architectural determinant of regulatory demand and threshold location within an otherwise shared regime structure.

2512.08724 2026-06-09 cs.LG 版本更新

Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search

通过自动化提示搜索暴露文本到图像模型中的隐藏偏见

Manos Plitsis, Giorgos Bouritsas, Vassilis Katsouros, Yannis Panagakis

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本文提出Bias-Guided Prompt Search框架,通过自动生成提示最大化图像偏见,揭示文本到图像模型中的隐藏偏见,提升公平性评估。

Comments ICML 2026. Code is here: https://github.com/manosplitsis/BGPS

详情
AI中文摘要

文本到图像(TTI)扩散模型已实现出色的视觉质量,但被反复显示在敏感属性如性别、种族和年龄上存在社会偏见。为缓解这些偏见,现有方法常依赖人工构建或由大型语言模型生成的提示数据集。除了编纂成本外,这还可能忽视那些触发偏见生成的未预见、不明显的提示,即使模型已进行去偏处理。本文引入Bias-Guided Prompt Search(BGPS),一个自动产生旨在最大化结果图像偏见的提示框架。BGPS包含两个组件:(1)一个指导生成中性属性提示的LLM,(2)对TTI内部表示起作用的属性分类器,引导LLM的解码过程向提示空间中放大目标图像属性的区域。我们在Stable Diffusion 1.5和最先进的去偏模型上进行了广泛实验,发现了一系列微妙且此前未记录的偏见,严重损害公平性指标。关键的是,发现的提示是可解释的,即可以由普通用户输入,定量提高困惑度度量相比于一个突出的硬提示优化对手。我们的发现揭示了TTI的脆弱性,同时BGPS扩展了偏见搜索空间,可以作为新的偏见缓解评估工具。

英文摘要

Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI's internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.

2603.14342 2026-06-09 cs.CV cs.AI 版本更新

AgroOmni: A Large-Scale Multi-view Agricultural Dataset for Cross-Scale Multimodal Reasoning

AgroOmni:一个大规模多视角农业数据集用于跨尺度多模态推理

Jiarui Zhang, Junqi Hu, Zurong Mai, Yang Liu, Yuhang Chen, Shuohong Lou, Henglian Huang, Hong Cheng, Lingyuan Zhao, Jianxi Huang, Yutong Lu, Haohuan Fu, Juepeng Zheng

发表机构 * Sun Yat-sen University(中山大学) Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) HuanTian Wisdom Technology Co., Ltd.(慧天智慧科技有限公司) China Agricultural University(中国农业大学) Southwest Jiaotong University(西南交通大学) National Supercomputing Center in Shenzhen(深圳国家超算中心) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出AgroOmni数据集,通过288K视觉问答对覆盖56个专业任务类别,解决多视角跨尺度农业多模态推理中的尺度偏差问题,提出AgroNVILA模型在AgroMind基准上达到62.32%的SOTA成绩。

详情
AI中文摘要

现代农业数据来源于多样化的平台,涵盖多个空间尺度,从地面级近距离摄影到无人机(UAV)航空观测和卫星遥感图像。因此,农业多模态推理需要强大的跨尺度空间理解。然而,由于缺乏多视角农业基准数据集,现有多模态大语言模型(MLLMs)表现出严重的地面级偏差,导致农业感知任务中出现尺度混淆和语义崩溃,例如将农田图像误认为墙壁或地板。为此,我们引入AgroOmni,一个大规模多视角训练语料库,包含288K个视觉问答对,覆盖56个专业任务类别,跨14种任务类型,旨在捕捉现代农业精准农业中的多样化尺度。基于此数据集,我们提出AgroNVILA,其在AgroMind基准上达到62.32%的最新SOTA成绩(比GPT-5.2高15.03%),有效缓解了多视角跨尺度差距,实现了整体农业理解。对AgMMU的诊断评估进一步揭示了宏观先验与微观诊断之间的固有异质性,通过受约束的零样本性能。同时,即使最小的微调也使AgroNVILA在AgMMU上实现了显著的性能提升,强有力地证明了其由AgroOmni赋能的泛化能力。完整的训练脚本已公开在https://anonymous.4open.science/r/AgroOmni-6510。

英文摘要

Modern agricultural data is sourced from diverse platforms and spans multiple spatial scales, ranging from ground-level close-up photography to Unmanned Aerial Vehicle (UAV) aerial observation and satellite remote sensing imagery. Accordingly, agricultural multimodal reasoning demands robust cross-scale spatial understanding. However, due to the lack of multi-view agricultural benchmark datasets, existing multimodal large language models (MLLMs) exhibit severe ground-level bias, which leads to scale confusion then semantic collapse in agricultural perception tasks, such as misinterpreting farmland imagery as walls or floors. To address this, we introduce AgroOmni, a large-scale multi-view training corpus with 288K Visual Question Answering pairs covering 56 specialized task categories across 14 task types, designed to capture diverse scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, which achieves a new state-of-the-art of 62.32% on the AgroMind benchmark (+15.03% over GPT-5.2), effectively mitigating the multi-view cross-scale gap for holistic agricultural understanding. Diagnostic evaluations on AgMMU further reveal an inherent heterogeneity between macro-priors and micro-diagnostics through constrained zero-shot performance. Meanwhile, even minimal fine-tuning leads to a dramatic performance gain of AgroNVILA on AgMMU, strongly demonstrating its generalization capability empowered by AgroOmni. Full training scripts are publicly available at https://anonymous.4open.science/r/AgroOmni-6510.

2603.14147 2026-06-09 cs.AI cs.LG 版本更新

An Alternative Trajectory for Generative AI

生成AI的另一种轨迹

Margarita Belova, Yuval Kansal, Yihao Liang, Jiaxin Xiao, Niraj K. Jha

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出通过构建领域特定超智能(DSS)来改进生成AI,利用符号抽象提升领域推理能力,避免LLM合成数据的模型崩溃问题,实现可持续发展。

详情
AI中文摘要

生成人工智能(AI)生态系统正经历快速变革,威胁其可持续性。随着模型从研究原型转向高流量产品,能耗从一次性训练转向持续的无界推理。推理模型使计算成本每查询增加数个数量级。通过单体模型扩展追求人工通用智能与物理约束的碰撞:电网故障、用水消耗和数据扩展的边际效益递减。此轨迹产生具有出色事实记忆的模型,但在需要深入推理的领域表现不佳,可能由于训练数据中的抽象不足。当前大型语言模型(LLMs)仅在数学和编程等领域表现出真实的推理深度,其他领域泛化能力差。我们提出基于领域特定超智能(DSS)的替代轨迹。我们主张首先构建显式的符号抽象(知识图谱、本体和形式逻辑)以支撑合成课程,使小型语言模型能够掌握领域特定推理,而无需LLM基于合成数据方法的模型崩溃问题。而非单一通用巨模型,我们设想“DSS模型社会”:动态生态系统,其中协调代理将任务路由到不同的DSS后端。此范式转变使能力脱离规模,使智能从能耗高的数据中心迁移到安全的设备专家。通过将算法进步与物理约束对齐,DSS社会使生成AI从环境负担转变为可持续的经济赋能力量。

英文摘要

The generative artificial intelligence (AI) ecosystem is undergoing rapid transformations that threaten its sustainability. As models transition from research prototypes to high-traffic products, the energetic burden has shifted from one-time training to recurring, unbounded inference. This is exacerbated by reasoning models that inflate compute costs by orders of magnitude per query. The prevailing pursuit of artificial general intelligence through scaling of monolithic models is colliding with hard physical constraints: grid failures, water consumption, and diminishing returns on data scaling. This trajectory yields models with impressive factual recall but struggles in domains requiring in-depth reasoning, possibly due to insufficient abstractions in training data. Current large language models (LLMs) exhibit genuine reasoning depth only in domains like mathematics and coding, where rigorous, pre-existing abstractions provide structural grounding. In other fields, the current approach fails to generalize well. We propose an alternative trajectory based on domain-specific superintelligence (DSS). We argue for first constructing explicit symbolic abstractions (knowledge graphs, ontologies, and formal logic) to underpin synthetic curricula enabling small language models to master domain-specific reasoning without the model collapse problem typical of LLM-based synthetic data methods. Rather than a single generalist giant model, we envision "societies of DSS models": dynamic ecosystems where orchestration agents route tasks to distinct DSS back-ends. This paradigm shift decouples capability from size, enabling intelligence to migrate from energy-intensive data centers to secure, on-device experts. By aligning algorithmic progress with physical constraints, DSS societies move generative AI from an environmental liability to a sustainable force for economic empowerment.

2603.13259 2026-06-09 cs.CL cs.AI 版本更新

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Transformer 如何拒绝错误答案:事实约束处理的旋转动力学

Javier Marín

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究揭示了Transformer在处理事实性问题时,隐藏状态空间中正确与错误延续路径的旋转分离现象,揭示了模型在深层结构中对错误延续的非局部化偏好。

详情
AI中文摘要

当解码器-only Transformer 被强制处理事实性查询的匹配正确和错误单token延续时,两种路径在隐藏状态空间中以特定方式分离:从查询-only 表示出发的位移向量保持大致相等的幅度但方向旋转远离。角分离在中层增加,后期层解决不对称结果——在错误运行中,logit-lens 倾向远低于朴素先验,对应模型将错误token的概率约11.5倍于正确token。该双阶段模式——中层旋转分离后后期层不对称承诺——被描述为模型对外部看似拒绝错误延续的实证几何特征,但明确指出是观测描述而非因果解释。该模式在六个解码器-only Transformer 中一致,包括五个架构家族(1B到13B参数)。第七个模型(Qwen2 1.5B)在当前提取协议下显示平坦曲线,可能是tokenizer-fragmentation的artefact而非真实规模限制;是否存在临界出现阈值的问题仍悬而未决。单层激活拼接在任何层带均无法恢复正确token,意味着后期层不对称性并非局限于离散组件。总体而言,证据支持事实约束处理的分布式轨迹账户——几何结构在许多层中逐步累积出现,而非单一局部化回溯账户。

英文摘要

When a decoder-only transformer is forced to process matched correct and incorrect single-token continuations of a factual query, the two pathways through hidden-state space diverge in a specific way: displacement vectors from the query-only representation maintain approximately equal magnitude but rotate apart in direction. The angular separation grows through mid-depth, and late layers resolve the asymmetric outcome -a logit-lens preference that, in the incorrect run, falls far below the naive prior of equal probability, corresponding to the model assigning approximately 11.5 times more probability to the incorrect token than to the correct one. We characterize this two-phase pattern-rotational divergence in mid-depth followed by late-layer asymmetric commitment-as the empirical geometric signature of what looks externally like the model rejecting a wrong continuation, while remaining explicit that it is an observational characterization, not a causal account. The pattern is consistent across six decoder-only transformers including five architecture families from 1B to 13B parameters. A seventh model (Qwen2 1.5B) shows a flat profile under the present extraction protocol that is plausibly a tokenizer-fragmentation artefact rather than a real scale floor; the question of an emergence threshold is left open. Single-layer activation patching does not recover the correct token at any layer band, meaning the late-layer asymmetry is not localized to a discrete component under the protocol used. Taken together, the evidence is consistent with a distributed-by-trajectory account of factual constraint processing-geometric structure that emerges cumulatively across many layers rather than from a single localized circuit and inconsistent with the simplest single-layer localized-recall account.

2603.12666 2026-06-09 cs.LG cs.AI 版本更新

RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction

RetroReasoner:一种用于战略 retrosynthesis 预测的推理 LLM

Hanbum Ko, Chanhui Lee, Ye Rin Kim, Rodrigo Hormazabal, Sehui Han, Sungbin Lim, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Department of Statistics, Korea University(韩国大学统计系) Materials Intelligence Lab, LG AI Research(LG人工智能研究实验室)

AI总结 RetroReasoner 通过监督微调和强化学习,捕捉化学家基于断键策略的推理过程,提升 retrosynthesis 预测的准确性和多样性。

Comments 35 pages, 19 figures

详情
AI中文摘要

retrosynthesis预测旨在识别能够合成给定产物分子的反应物。尽管分子大语言模型(LLMs)最近展示了有前景的结果,但大多数现有方法要么直接生成反应物,要么仅提供通用的产品级分析,而没有明确推理关于断键策略来证明特定反应物选择的合理性。本文提出了RetroReasoner,一种能够捕捉化学家基于断键策略的推理过程的 retrosynthetic推理模型。RetroReasoner通过监督微调和强化学习进行训练。在监督微调中,SyntheticRetro生成结构化的断键理由配对反应物预测。在强化学习中,一个往返奖励通过将预测的反应物传递给正向合成模型来评估预测的反应物,奖励能够重建原始产物的预测。RetroReasoner还可以通过将其整合到并行化的蒙特卡洛树搜索框架中,用于多步 retrosynthetic规划,从而减少搜索时间并增加有效合成路径的数量和多样性。实验结果表明,RetroReasoner在性能上优于先前的基线,不仅包括分子LLMs,还包括专门针对retrosynthesis的专家模型,并生成更广泛的可行反应物提案,特别是在具有挑战性的反应实例中。

英文摘要

Retrosynthesis prediction aims to identify reactants that can synthesize a given product molecule. Although molecular large language models (LLMs) have recently shown promising results, most existing methods either generate reactants directly or provide only generic product-level analysis, without explicitly reasoning about bond-disconnection strategies that justify specific reactant choices. This paper proposes RetroReasoner, a retrosynthetic reasoning model that captures chemists' strategic disconnection-based thinking. RetroReasoner is trained with supervised fine-tuning and reinforcement learning. For supervised fine-tuning, SyntheticRetro generates structured disconnection rationales paired with reactant predictions. For reinforcement learning, a round-trip reward evaluates predicted reactants by passing them through a forward synthesis model and rewarding predictions that reconstruct the original product. RetroReasoner can also be applied to multi-step retrosynthetic planning by incorporating it into a parallelized Monte Carlo tree search framework, reducing search time while increasing the number and diversity of valid synthetic pathways. Experimental results show that RetroReasoner outperforms prior baselines, including not only molecular LLMs but also retrosynthesis-specific expert models, and generates a broader range of feasible reactant proposals, especially for challenging reaction instances.

2603.12453 2026-06-09 cs.CL 版本更新

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

CSE-UOI在SemEval-2026任务6中的表现:一种双阶段异构集成与 deliberative 复杂性门控的政治理论逃避检测方法

Christos Tzouvaras, Konstantinos Skianis, Athanasios Voulodimos

发表机构 * University of Ioannina(伊奥安纳大学) National Technical University of Athens(雅典国家技术大学)

AI总结 本文提出一种双阶段异构集成方法,结合自我一致性与加权投票,以及新颖的后处理修正机制Deliberative Complexity Gating,用于政治逃避检测,最终在评估集上获得0.85的Macro-F1分数。

详情
AI中文摘要

本文描述了我们在SemEval-2026任务6中的系统,该系统将政治访谈中的回应清晰度分为三个类别:清晰回复、矛盾回复和清晰非回复。我们提出了一种异构双大型语言模型(LLM)集成方法,结合自我一致性(SC)和加权投票,并提出了一种新的后处理修正机制,即Deliberative Complexity Gating(DCG)。该机制利用跨模型行为信号,并利用发现LLM响应长度代理与样本模糊性之间存在强相关性的发现。为了进一步研究提高模糊性检测的机制,我们评估了多代理辩论作为增加 deliberative 能力的替代策略。与DCG不同,后者通过跨模型行为信号自适应地门控推理,而辩论则通过增加代理数量而不增加模型多样性。我们的解决方案在评估集上获得了0.85的Macro-F1分数,取得了第三名,并与第二好的报告分数并列。

英文摘要

This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place and tied with the second-best reported score.

2502.01226 2026-06-09 cs.LG stat.ML 版本更新

Adaptive Prior Selection in Gaussian Process Bandits with Thompson Sampling

基于高斯过程强化学习的自适应先验选择

Jack Sandberg, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg(计算机科学与工程系,楚尔姆斯理工大学和哥德堡大学)

AI总结 本文提出两种算法,通过高斯过程强化学习进行先验选择和后悔最小化,理论分析证明了HP-GP-TS的亚线性后悔界,并通过实验验证其有效性。

Comments 30 pages, 12 figures

详情
AI中文摘要

高斯过程(GP)强化学习为未知函数的黑箱优化提供了强大框架。未知函数的特性严重依赖于假设的GP先验。大多数文献假设先验已知,但实践中很少成立。本文研究了两种算法:Prior-Elimination GP-TS(PE-GP-TS)通过排除预测性能差的先验,以及HyperPrior GP-TS(HP-GP-TS)利用双层汤普森采样方案。我们理论分析了这些算法,并为HP-GP-TS建立了亚线性后悔界。此外,我们通过合成和现实数据的实验展示了这些算法相对于替代方案的有效性。

英文摘要

Gaussian process (GP) bandits provide a powerful framework for performing blackbox optimization of unknown functions. The characteristics of the unknown function depend heavily on the assumed GP prior. Most work in the literature assume that this prior is known but in practice this seldom holds. Instead, practitioners often rely on maximum likelihood estimation to select the hyperparameters of the prior - which lacks theoretical guarantees. In this work, we study two algorithms for joint prior selection and regret minimization in GP bandits based on GP Thompson sampling (GP-TS): Prior-Elimination GP-TS (PE-GP-TS) that disqualifies priors with poor predictive performance, and HyperPrior GP-TS (HP-GP-TS) that utilizes a bi-level Thompson sampling scheme. We theoretically analyze the algorithms and establish a sublinear regret bound for HP-GP-TS. In addition, we demonstrate the effectiveness of these algorithms compared to the alternatives through extensive experiments with synthetic and real-world data.

2603.10395 2026-06-09 cs.LG 版本更新

Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

图流模型:基于强化学习训练图流模型

Baoheng Zhu, Deyu Bo, Delvin Ce Zhang, Xiao Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出Graph-GRPO框架,通过可验证奖励训练图流模型,推导了转移概率表达式并提出局部探索策略,实验显示其在生成质量与分子优化任务中表现优异。

Comments Accepted by ICML 2026

详情
AI中文摘要

图生成是具有广泛应用的基本任务,如药物发现。最近,基于离散流匹配的图生成方法(图流模型,GFM)因其优越性能和灵活采样而兴起。然而,有效对齐GFM与复杂人类偏好或任务特定目标仍是一个重大挑战。本文提出Graph-GRPO,一种在线强化学习(RL)框架,用于在可验证奖励下训练GFM。我们的方法有两个关键贡献:(1)我们推导了GFM的转移概率分析表达式,取代了蒙特卡洛采样,使RL训练能够完全可微;(2)我们提出了一种精炼策略,随机扰动图中的特定节点和边,并重新生成它们,允许局部探索和生成质量的自我改进。在合成和真实数据集上的广泛实验表明了Graph-GRPO的有效性。仅使用50次去噪步骤,我们的方法在平面和树数据集上分别达到95.0%和97.5%的Valid-Unique-Novelty分数。此外,Graph-GRPO在分子优化任务中实现了最先进的性能,优于基于图和片段的RL方法以及经典遗传算法。

英文摘要

Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexible sampling. However, effectively aligning GFMs with complex human preferences or task-specific objectives remains a significant challenge. In this paper, we propose Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards. Our method makes two key contributions: (1) We derive an analytical expression for the transition probability of GFMs, replacing the Monte Carlo sampling and enabling fully differentiable rollouts for RL training; (2) We propose a refinement strategy that randomly perturbs specific nodes and edges in a graph, and regenerates them, allowing for localized exploration and self-improvement of generation quality. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness of Graph-GRPO. With only 50 denoising steps, our method achieves 95.0\% and 97.5\% Valid-Unique-Novelty scores on the planar and tree datasets, respectively. Moreover, Graph-GRPO achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.

2510.01661 2026-06-09 cs.RO 版本更新

Symskill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation

Symskill:符号与技能共发明用于数据高效且反应性强的长周期操作

Yifei Simon Shao, Yuchen Zheng, Sunan Sun, Pratik Chaudhari, Vijay Kumar, Nadia Figueroa

发表机构 * GRASP Laboratory, University of Pennsylvania(GRASP实验室,宾夕法尼亚大学)

AI总结 Symskill通过联合学习谓词、运算符和技能,实现了数据高效且反应性强的长周期操作,结合了组合泛化与实时恢复能力。

Comments ICRA 2026 Best Conference Paper Award; ICRA 2026 Best Paper Award on Planning and Control; CoRL 2025 Best Paper Award on Learning Effective Abstractions for Planning (LEAP) Workshop (https://symskill.github.io/)

详情
AI中文摘要

动态环境中多步骤操作仍具挑战性。模仿学习(IL)反应性强但缺乏组合泛化能力,因为单一策略无法在场景变化时决定复用哪个技能。经典任务与运动规划(TAMP)提供组合性,但其高规划延迟阻碍了实时故障恢复。我们引入SymSkill,一个统一框架,从无标签、未分段的演示中联合学习谓词、运算符和技能,结合组合泛化与实时恢复。离线时,SymSkill直接从演示中学习符号抽象和目标导向技能。在线时,给定学习到的谓词 conjunction,它使用符号规划器组合和重新排列技能以实现符号目标,同时在运动和符号层面实时恢复故障。结合合规控制器,SymSkill在人类和环境干扰下支持安全执行。在RoboCasa模拟中,SymSkill执行12个单步骤任务,成功率达85%,并能将它们组合成多步骤计划而无需额外数据。在真实Franka机器人上,它从5分钟的玩耍数据中学习,并从目标规范中执行12步任务。代码和额外分析可在https://symskill.github.io/获取。

英文摘要

Multi-step manipulation in dynamic environments remains challenging. Imitation learning (IL) is reactive but lacks compositional generalization, since monolithic policies do not decide which skill to reuse when scenes change. Classical task-and-motion planning (TAMP) offers compositionality, but its high planning latency prevents real-time failure recovery. We introduce SymSkill, a unified framework that jointly learns predicates, operators, and skills from unlabeled, unsegmented demonstrations, combining compositional generalization with real-time recovery. Offline, SymSkill learns symbolic abstractions and goal-oriented skills directly from demonstrations. Online, given a conjunction of learned predicates, it uses a symbolic planner to compose and reorder skills to achieve symbolic goals while recovering from failures at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill supports safe execution under human and environmental disturbances. In RoboCasa simulation, SymSkill executes 12 single-step tasks with 85% success and composes them into multi-step plans without additional data. On a real Franka robot, it learns from 5 minutes of play data and performs 12-step tasks from goal specifications. Code and additional analysis are available at https://symskill.github.io/ .

2603.08630 2026-06-09 cs.LG physics.comp-ph 版本更新

Integral Formulas for Vector Signal Tensor Products

向量信号张量积的积分公式

Valentin Heyraud, Zachary Weller-Davies, Jules Tilly

发表机构 * InstaDeep

AI总结 本文推导了简化向量信号张量积的积分公式,通过获得反称Gaunt系数的显式表达式,实现了Clebsch-Gordan张量积的高效模拟,减少9倍的张量积计算量,为SO(3)等价神经网络应用奠定基础。

Comments 17 pages, 3 figures

详情
AI中文摘要

我们推导了简化最近由Xie等人引入的向量信号张量积的积分公式,该公式将Gaunt张量积推广到反称耦合。特别地,我们获得了反称Gaunt系数的显式闭式表达式。这使我们能够使用单个向量信号张量积模拟Clebsch-Gordan张量积,从而在张量积计算方面减少高达9倍。我们的结果使向量信号张量积的高效和实用实现成为可能,为Gaunt张量积的这种推广在SO(3)等价神经网络中的应用铺平了道路。此外,我们讨论了Gaunt和向量信号张量积如何控制与通常Clebsch-Gordan张量积相关的表达性-运行时间权衡。最后,我们研究了所考虑张量积归一化低秩分解,以用于等价神经网络中。

英文摘要

We derive integral formulas that simplify the Vector Signal Tensor Product recently introduced by Xie et al., which generalizes the Gaunt tensor product to anti-symmetric couplings. In particular, we obtain explicit closed-form expressions for the anti-symmetric analogues of the Gaunt coefficients. This enables us to simulate the Clebsch-Gordan tensor product using a single Vector Signal Tensor Product, yielding up to a $9\times$ reduction in the required tensor product evaluations. Our results enable efficient and practical implementations of the Vector Signal Tensor Product, paving the way for applications of this generalization of Gaunt Tensor Products in $\mathrm{SO}(3)$-equivariant neural networks. Moreover, we discuss how the Gaunt and the Vector Signal Tensor Products allow to control the expressivity-runtime tradeoff associated with the usual Clebsch-Gordan Tensor Products. Finally, we investigate low rank decompositions of the normalizations of the considered tensor products in view of their use in equivariant neural networks.

2603.07445 2026-06-09 cs.CL cs.LG 版本更新

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

少令牌,大杠杆:在微调期间通过约束安全令牌保持安全对齐

Guoli Wang, Haonan Shi, Tu Ouyang, An Wang

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 提出PACT框架,通过约束安全相关令牌的置信度来防止微调导致的安全对齐漂移,同时保持下游任务性能。

Comments Accepted to KDD 2026

详情
AI中文摘要

大型语言模型(LLMs)通常需要微调(FT)才能在下游任务上表现良好,但即使训练数据集仅包含良性数据,FT也可能导致安全对齐漂移。先前的研究表明,引入少量有害数据会显著损害LLM的拒绝行为,导致LLM顺从有害请求。现有的防御方法通常依赖于模型范围的干预,例如限制哪些参数更新或注入额外的安全数据,这可能会限制通用性并降低下游任务性能。为了解决这些限制,我们提出了一种名为PACT(通过约束令牌保持安全对齐)的微调框架,该框架稳定了模型在安全令牌上的置信度。我们的方法基于经验观察:安全对齐行为反映在模型的令牌级输出置信度中,并且通常集中在少量安全相关令牌上。在下游微调期间,我们正则化微调模型,使其在每一步响应中与对齐参考模型在安全相关令牌上的置信度匹配,同时允许非安全令牌基本不受约束以实现有效的任务适应。这种有针对性的约束防止了对齐漂移,而无需施加通常以牺牲模型效用为代价的全局限制。我们的代码可在{https://github.com/Glresearch1/PACT}获取。

英文摘要

Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility. Our code is available at {https://github.com/Glresearch1/PACT}.

2602.17911 2026-06-09 cs.CL cs.AI 版本更新

Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

基于条件的推理用于依赖上下文的生物医学问答

Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman, Pengcheng Jiang, Chih-Hsuan Wei, Zhizheng Wang, Zhiyong Lu, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) National Institutes of Health(美国国立卫生研究院)

AI总结 本文提出CondMedQA基准和Condition-Gated Reasoning框架,通过构建条件感知知识图谱,提升生物医学问答中条件依赖的推理能力。

详情
AI中文摘要

当前生物医学问答系统常假设医学知识是统一的,但现实临床推理本质上是条件性的:几乎所有决策都依赖于患者特定因素,如共病和禁忌症。现有基准不评估此类条件推理,检索增强或图基方法缺乏显式机制确保检索知识适用于给定上下文。为解决这一差距,我们提出CondMedQA,首个针对条件生物医学问答的基准,包含多跳问题,其答案随患者条件变化。此外,我们提出Condition-Gated Reasoning(CGR),一种新框架,构建条件感知知识图谱,并根据查询条件选择性激活或修剪推理路径。我们的发现显示,CGR更可靠地选择条件合适的答案,同时在生物医学问答基准上匹配或超越现有最佳性能,突显了显式建模条件性对稳健医疗推理的重要性。

英文摘要

Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.

2603.04865 2026-06-09 cs.SD 版本更新

The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

环境声音深度伪造检测挑战赛:鲁棒性、评估与洞察的基准测试

Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

发表机构 * School of Electrical Engineering, KAIST, Daejeon, Republic of Korea(韩国成均馆大学电气工程学院) University of Melbourne, Australia(墨尔本大学) Fortemedia Singapore, Singapore(新加坡Fortemedia公司) Xi’an University of Posts & Telecommunications, Xi’an, China(西安邮电大学) Xi'an Lianfeng Acoustic Technologies Co., Ltd., China(西安联丰声学技术有限公司)

AI总结 本文介绍了环境声音深度伪造检测挑战赛,探讨了鲁棒性评估、系统架构及未来研究方向,提出了环境声音深度伪造检测的关键挑战与机遇。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

近年来,音频生成技术的进步使得创建高度逼真的环境声音景观变得更加容易,这可能被滥用于制造欺骗性内容,如假警报、枪声和人群声音,从而引发公众安全和信任的担忧。尽管语音和歌唱声的深度伪造检测已被广泛研究,但环境声音深度伪造检测(ESDD)仍处于探索阶段。为了推动ESDD的发展,首次ESDD挑战赛被启动,吸引了97支注册团队,收到了1748份有效提交。本文提出了该任务的定义、数据集构建、评估协议、基线系统以及挑战赛结果中的关键见解。此外,我们分析了高性能系统中常见的架构选择和训练策略。最后,我们讨论了ESDD的潜在未来研究方向,概述了关键机会和开放问题,以指导该领域后续研究。

英文摘要

Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.

2603.05500 2026-06-09 cs.LG cs.AI cs.CL 版本更新

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

POET-X:通过扩展正交变换实现内存高效的LLM训练

Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu

发表机构 * University of Cambridge(剑桥大学)

AI总结 POET-X通过优化正交等价变换降低计算和内存开销,实现高效稳定的LLM训练,支持在单块H100 GPU上预训练十亿参数模型。

Comments ICML 2026 Oral (15 pages, 7 figures, project page: https://spherelab.ai/poetx/)

详情
AI中文摘要

高效且稳定的大型语言模型(LLM)训练仍然是现代机器学习系统的核心挑战。为解决这一挑战,提出了重新参数化正交等价训练(POET),这是一种保持谱的框架,通过正交等价变换优化每个权重矩阵。尽管POET提供了强大的训练稳定性,但其原始实现由于密集的矩阵乘法导致高内存消耗和计算开销。为克服这些限制,我们引入了POET-X,一种可扩展且内存高效的变体,通过显著降低的计算成本执行正交等价变换。POET-X在保持POET的一般化和稳定性优势的同时,实现了吞吐量和内存效率的显著提升。在我们的实验中,POET-X能够在单块Nvidia H100 GPU上预训练十亿参数的LLM,而标准优化器如AdamW在相同设置下会因内存不足而失败。

英文摘要

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

2601.21149 2026-06-09 cs.LG cs.AI 版本更新

Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement

移动性嵌入的POI:从人类移动中学习场所身份与使用方式

Maria Despoina Siampou, Shushman Choudhury, Shang-Ling Hsu, Neha Arora, Cyrus Shahabi

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ME-POIs框架,通过对比学习将大规模人类移动数据与语言模型嵌入结合,学习场所功能,并在五个地图丰富任务上超越文本或移动性单独基线。

详情
AI中文摘要

近期地理空间基础模型的进展强调了学习真实世界位置(特别是人类活动集中的兴趣点POI)通用表示的重要性。然而,现有方法主要关注从静态文本元数据中提取的场所身份,或学习与轨迹上下文相关的表示,这些表示捕捉的是移动规律而非场所的实际使用方式(即POI的功能)。我们认为POI功能是通用POI表示中缺失但关键的信号。我们提出了移动性嵌入的POI(ME-POIs),这是一个框架,通过大规模人类移动数据增强从语言模型派生的POI嵌入,以学习基于真实世界使用的、以POI为中心且上下文无关的表示。ME-POIs将个体访问编码为时间上下文化的嵌入,并通过对比学习将其与可学习的POI表示对齐,以捕捉跨用户和时间的使用模式。为解决长尾稀疏性问题,我们提出了一种新机制,从附近频繁访问的POI跨多个空间尺度传播时间访问模式。我们在五个新提出的地图丰富任务上评估ME-POIs,测试其捕捉POI身份和功能的能力。在所有任务中,用ME-POIs增强文本嵌入始终优于纯文本和纯移动性基线。值得注意的是,仅使用移动数据训练的ME-POIs在某些任务上能超越纯文本模型,凸显了POI功能是准确且可泛化的POI表示的关键组成部分。

英文摘要

Recent progress in geospatial foundation models highlights the importance of learning general-purpose representations for real-world locations, particularly points-of-interest (POIs) where human activity concentrates. Existing approaches, however, focus primarily on place identity derived from static textual metadata, or learn representations tied to trajectory context, which capture movement regularities rather than how places are actually used (i.e., POI's function). We argue that POI function is a missing but essential signal for general POI representations. We introduce Mobility-Embedded POIs (ME-POIs), a framework that augments POI embeddings derived, from language models with large-scale human mobility data to learn POI-centric, context-independent representations grounded in real-world usage. ME-POIs encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning to capture usage patterns across users and time. To address long-tail sparsity, we propose a novel mechanism that propagates temporal visit patterns from nearby, frequently visited POIs across multiple spatial scales. We evaluate ME-POIs on five newly proposed map enrichment tasks, testing its ability to capture both the identity and function of POIs. Across all tasks, augmenting text-based embeddings with ME-POIs consistently outperforms both text-only and mobility-only baselines. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks, highlighting that POI function is a critical component of accurate and generalizable POI representations.

2509.20906 2026-06-09 cs.CV cs.RO 版本更新

Distant Object Localisation from Noisy Image Segmentation Sequences

基于噪声图像分割序列的远距离目标定位

Julius Pesonen, Arno Solin, Eija Honkavaara

发表机构 * Research Council of Finland(芬兰研究理事会) RCF Flagship Forest–Human–Machine Interplay—Building Resilience, Redefining Value Networks and Enabling Meaningful Experiences (UNITE)(RCF旗舰森林-人类-机器交互——构建韧性,重新定义价值网络和赋能有意义体验(UNITE))

AI总结 针对远距离目标定位问题,提出多视图三角测量和粒子滤波两种方法,后者还能提供形状和不确定性估计,结合无人机图像分割与GNSS姿态估计实现可靠野火监测。

详情
AI中文摘要

基于相机测量序列的3D目标定位对于安全关键的监视任务(如基于无人机的野火监测)至关重要。使用相机检测到的目标定位通常可以通过专门的传感器配置或3D场景重建来解决。然而,对于远距离目标或受限于可用计算资源的任务,这两种解决方案都不可行。在本文中,我们表明该任务可以通过多视图三角测量或粒子滤波来解决,后者还提供形状和不确定性估计。我们使用3D模拟和基于无人机的图像分割序列以及基于全球导航卫星系统(GNSS)的相机姿态估计来研究这些解决方案。结果表明,将所提出的方法与现有的图像分割模型和无人机携带的计算资源相结合,可以为基于无人机的野火监测提供可靠的系统。所提出的解决方案与检测方法无关,还能快速适应类似任务。代码可在以下网址获取:https://this URL

英文摘要

3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with specialised sensor configurations or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved with either multi-view triangulation or particle filters, with the latter also providing shape and uncertainty estimates. We studied the solutions using 3D simulation and drone-based image segmentation sequences with global navigation satellite system (GNSS) based camera pose estimates. The results suggest that combining the proposed methods with pre-existing image segmentation models and drone-carried computational resources yields a reliable system for drone-based wildfire monitoring. The proposed solutions are independent of the detection method, also enabling quick adaptation to similar tasks. Code is available at https://fgi_nls.gitlab.io/public/distant-localisation

2603.04125 2026-06-09 cs.CV 版本更新

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

基于特征残差判别的小样本开放集动作识别基线研究与基准

Stefano Berti, Giulia Pasquale, Lorenzo Natale

发表机构 * Humanoid Sensing and Perception, Istituto Italiano di Tecnologia, Genoa, Italy(人形感知与感知、意大利理工学院,热那亚,意大利)

AI总结 针对小样本动作识别在开放集场景下的不足,提出基于特征残差判别器的架构扩展,在五个数据集上实现未知类拒绝能力提升且不损失闭集精度,设立新基准。

详情
AI中文摘要

小样本动作识别(FS-AR)已显示出有希望的结果,但常受限于闭集假设,在现实开放集场景中失效。虽然小样本开放集(FSOS)识别在图像领域已很成熟,但其在时空视频数据上的扩展仍未被充分探索。为解决此问题,我们提出基于特征残差判别器(FR-Disc)的架构扩展,将先前在骨骼数据上的工作适配到更复杂的视频领域。在五个数据集上的大量实验表明,虽然常见的开放集技术仅提供边际增益,但我们的FR-Disc显著增强了未知类拒绝能力,且不损害闭集精度,为FSOS-AR设立了新的最先进水平。项目网站、代码和基准可在以下网址获取:this https URL。

英文摘要

Few-Shot Action Recognition (FS-AR) has shown promising results but is often limited by a closed-set assumption that fails in real-world open-set scenarios. While Few-Shot Open-Set (FSOS) recognition is well-established for images, its extension to spatio-temporal video data remains underexplored. To address this, we propose an architectural extension based on a Feature-Residual Discriminator (FR-Disc), adapting previous work on skeletal data to the more complex video domain. Extensive experiments on five datasets demonstrate that while common open-set techniques provide only marginal gains, our FR-Disc significantly enhances unknown rejection capabilities without compromising closed-set accuracy, setting a new state-of-the-art for FSOS-AR. The project website, code, and benchmark are available at: https://hsp-iit.github.io/fsosar/.

2603.01613 2026-06-09 cs.CV 版本更新

Uncertainty-Aware Hierarchical Re-Localization in OpenStreetMap via Semantic Alignment

基于语义对齐的OpenStreetMap中不确定性感知分层重定位

Yuchen Zou, Xiao Hu, Lihuang Fang, Yuqing Tang

发表机构 * International Digital Economy Academy(国际数字经济学院) School of Automation Science and Engineering, Xi’an Jiaotong University(西安交通大学自动化科学与工程学院) Department of Electronic and Electrical Engineering, Southern University of Science and Technology(南方科技大学电子与电气工程系)

AI总结 提出不确定性感知分层搜索框架,利用目标级DINO-ViT令牌减少跨模态差异,通过粗FFT相关和不确定性控制的局部细化实现高效定位,在精度和速度上显著优于现有方法。

Comments 7 pages, 4 figures

详情
AI中文摘要

单目重定位使机器人能够从视觉观测中估计相机姿态。然而,许多现有方法依赖密集地图或大型参考图像数据库,面临可扩展性限制和隐私风险。OpenStreetMap(OSM)作为一种轻量级隐私保护地图,提供具有全局可扩展性的语义和几何信息。尽管如此,由于自然图像与OSM之间的跨模态差异以及基于全局地图定位的高成本,OSM定位仍然具有挑战性。在本文中,我们提出了一种具有语义对齐的不确定性感知分层搜索框架,用于OSM中的定位。首先,利用目标级DINO-ViT令牌来减少地面视角观测与OSM向量之间的语义差距。其次,将全局密集匹配分解为粗FFT相关和不确定性控制的局部细化。大量实验表明,我们的方法显著提高了定位精度和速度。在单个数据集上训练时,我们方法的3°方向召回率甚至优于最先进方法的5°召回率。

英文摘要

Monocular re-localization enables robots to estimate camera poses from visual observations. However, many existing methods rely on dense maps or large reference image databases, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight privacy-preserving map, offers semantic and geometric information with global scalability. Nonetheless, OSM localization remains challenging due to cross-modal discrepancies between natural images and OSM, as well as the high cost of global map-based localization. In this paper, we propose an uncertainty-aware hierarchical search framework with semantic alignment for localization in OSM. First, object-centric DINO-ViT tokens are exploited to reduce the semantic gap between ground-view observations and OSM vectors. Second, global dense matching is decomposed into coarse FFT correlation and uncertainty-controlled local refinement. Extensive experiments demonstrate that our method significantly improves localization accuracy and speed. When trained on a single dataset, the 3$^\circ$ orientation recall of our method even outperforms the 5$^\circ$ recall of state-of-the-art methods.

2602.24181 2026-06-09 cs.CV cs.AI 版本更新

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

混合饮食使DINO成为杂食视觉编码器

Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia, Skanda Koppula, Andre Araujo, Joao Carreira, Niloy J. Mitra

发表机构 * Google DeepMind(谷歌深Mind) University College London(伦敦大学学院)

AI总结 针对DINOv2等预训练视觉编码器在不同视觉模态间特征对齐差的问题,提出杂食视觉编码器,通过后训练框架学习模态无关特征空间,实现跨模态鲁棒理解。

Comments CVPR 2026 Highlight

详情
AI中文摘要

预训练的视觉编码器(如DINOv2)在单模态任务上表现出色。然而,我们观察到它们的特征在不同视觉模态之间对齐不佳。例如,同一场景的RGB图像及其对应深度图的特征嵌入,其余弦相似度与两个随机不相关图像几乎相同。为了解决这个问题,我们提出了杂食视觉编码器,一种学习模态无关特征空间的后训练框架。我们通过双重目标微调编码器:首先,最大化同一场景不同模态之间的特征对齐;其次,一个蒸馏目标,将学习到的表示锚定到完全冻结的教师模型。由此产生的学生编码器通过为给定场景生成更一致的嵌入(无论输入模态是RGB、深度、分割等)而变得“杂食”。这种方法在保留原始基础模型的判别语义的同时,实现了鲁棒的跨模态理解。杂食模型权重可在以下网址获取:此 https URL。

英文摘要

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their features are poorly aligned across different visual modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a post-training framework that learns a modality-agnostic feature space. We fine-tune the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to a fully frozen teacher. The resulting student encoder becomes "omnivorous" by producing more consistent embeddings for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model. Omnivorous model weights are available at https://github.com/google-deepmind/representations4d.

2509.24762 2026-06-09 cs.LG 版本更新

In-Context Learning of Temporal Point Processes with Foundation Inference Models

基于基础推理模型的时间点过程上下文学习

David Berghaus, Patrick Seifner, Kostadin Cvejoski, César Ojeda, Ramsés J. Sánchez

发表机构 * Lamarr Institute(拉马尔研究所) Fraunhofer IAIS(弗劳恩霍夫人工智能研究所) University of Bonn(波恩大学) JetBrains Research(JetBrains研究) University of Potsdam(波恩大学)

AI总结 提出一种基于摊销推理和上下文学习的点过程基础推理模型FIM-PP,通过大规模合成数据预训练,无需额外训练即可估计真实MTPP,或快速微调至目标系统。

Comments This paper is published as a conference paper at ICLR 2026

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
AI中文摘要

利用带标记的时间点过程(MTPP)对多种事件类型的事件序列进行建模,为揭示支配性动态规则和预测未来事件提供了一种原则性方法。当前MTPP推理的神经网络方法依赖于为每个目标系统训练单独的专用模型。我们采用一种截然不同的方法:利用摊销推理和上下文学习,预训练一个深度神经网络,以从由事件序列集合定义的上下文中推断事件历史的条件强度函数。预训练是在从广泛霍克斯过程分布中采样的大规模合成MTPP数据集上进行的。预训练后,我们的点过程基础推理模型(FIM-PP)可以在无需任何额外训练的情况下从真实世界数据中估计MTPP,或者快速微调至目标系统。实验表明,这种摊销方法在常见基准数据集上的下一事件预测任务中与专用模型的性能相匹配。

英文摘要

Modeling event sequences of multiple event types with marked temporal point processes (MTPPs) provides a principled way to uncover governing dynamical rules and predict future events. Current neural network approaches to MTPP inference rely on training separate, specialized models for each target system. We pursue a radically different approach: drawing on amortized inference and in-context learning, we pretrain a deep neural network to infer, in-context, the conditional intensity functions of event histories from a context defined by sets of event sequences. Pretraining is performed on a large synthetic dataset of MTPPs sampled from a broad distribution of Hawkes processes. Once pretrained, our Foundation Inference Model for Point Processes (FIM-PP) can estimate MTPPs from real-world data without any additional training, or be rapidly finetuned to target systems. Experiments show that this amortized approach matches the performance of specialized models on next-event prediction across common benchmark datasets.

2602.22919 2026-06-09 cs.CV 版本更新

Chain of Flow: ECG-Conditioned 4D Cardiac Cine Generation from Patient-Specific Anatomical Anchor

流动链:基于患者特定解剖锚点的ECG条件4D心脏电影生成

Haofan Wu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen, Le Zhang

发表机构 * School of Engineering, College of Engineering and Physical Sciences, University of Birmingham(英国伯明翰大学工程学院) William Harvey Research Institute, NIHR Barts Biomedical Research Centre, Queen Mary University London(伦敦Queen Mary大学威廉·哈里维研究所) Barts Heart Centre, St Bartholomew’s Hospital, Barts Health NHS Trust(巴特勒医院心脏中心,圣巴塞洛缪医院,巴特勒健康 NHS信托) Division of Cardiology, Johns Hopkins University School of Medicine(约翰霍普金斯大学医学院心脏病科)

AI总结 提出Chain of Flow (COF)框架,利用患者特定MRI和当前ECG生成4D心脏电影,在UK Biobank上实现高图像保真度和下游功能性能。

详情
AI中文摘要

心脏电影磁共振成像(MRI)是功能性心脏评估的核心,然而在分析时可能无法直接获得完整的当前电影序列。我们引入了流动链(COF),这是一个心电图(ECG)条件框架,结合患者特定MRI和当前ECG,用于生成特定于受试者的4D心脏电影。在UK Biobank数据集上,COF在共享同次就诊可评估基准上实现了强图像级保真度和下游功能导向性能。多切片和多分辨率分析表明,在短轴堆叠和异质采集分辨率上,结构生成质量稳定。跨重采样输入MRI相位的受控相位鲁棒性分析进一步提供了同次就诊代理支持,当目标MRI相位未直接观察到时,使用患者特定MRI加当前ECG。跨次就诊路线提供了探索性序列证据,在当前面向感兴趣区域读出中增益最明显。疾病类别功能审计、病例级容积轨迹证据审查进一步描绘了当前患者特定MRI加ECG公式在解剖感知下游心脏分析中保持稳定的情况。代码可在https://this URL获取。

英文摘要

Cardiac cine magnetic resonance imaging (MRI) is central to functional cardiac assessment, yet a full current cine sequence may not always be directly available at the point of analysis. We introduce Chain of Flow (COF), an electrocardiography (ECG)-conditioned framework that combines patient-specific MRI and current ECG for subject-specific 4D cardiac cine generation. On the UK Biobank dataset, COF achieves strong image-level fidelity and downstream function-oriented performance on a shared same-visit evaluable benchmark. Multi-slice and multi-resolution analyses indicate stable structural generation quality across the short-axis stack and heterogeneous acquisition resolutions. Controlled phase-robustness analyses across resampled input MRI phases further provide same-visit proxy support for patient-specific MRI plus current ECG when a target MRI phase is not directly observed. A cross-visit route provides exploratory serial evidence, with the clearest gains in current-facing region-of-interest readout. Disease-category functional audits, case-level volume-trajectory evidence review further delineate where the current patient-specific MRI plus ECG formulation remains stable for anatomy-aware downstream cardiac analysis. Code is available at https://anonymous.4open.science/r/COF-paper-release-C88B.

2602.22766 2026-06-09 cs.CL 版本更新

Imagination Helps Visual Reasoning, But Not Yet in Latent Space

想象力有助于视觉推理,但尚未在潜在空间中实现

You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, Maosong Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过因果中介分析发现,多模态大语言模型中的潜在推理存在输入-潜在和潜在-答案两个关键断连,表明其有效性有限,并提出显式想象方法CapImagine,在视觉推理任务中表现更优。

Comments ICML 2026 Poster

详情
AI中文摘要

潜在视觉推理旨在通过在多模态大语言模型的隐藏状态中进行冥想,模仿人类的想象过程。虽然被认为是视觉推理的一种有前景的范式,但其有效性的潜在机制仍不清楚。为了揭示其功效的真正来源,我们使用因果中介分析来研究潜在推理的有效性。我们将该过程建模为因果链:输入作为处理变量,潜在标记作为中介变量,最终答案作为结果变量。我们的发现揭示了两个关键的断连:(a) 输入-潜在断连:对输入进行剧烈扰动导致潜在标记的变化可以忽略不计,表明潜在标记未能有效关注输入序列。(b) 潜在-答案断连:对潜在标记的扰动对最终答案的影响极小,表明潜在标记对结果施加的因果效应有限。此外,广泛的探测分析显示,潜在标记编码的视觉信息有限且表现出高度相似性。因此,我们质疑潜在推理的必要性,并提出了一种简单的替代方法CapImagine,该方法教会模型使用文本进行显式想象。在视觉中心基准上的实验表明,CapImagine显著优于复杂的潜在空间基线,突显了通过显式想象进行视觉推理的优越潜力。

英文摘要

Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.

2601.22669 2026-06-09 cs.LG 版本更新

Beyond Fixed Rounds: Data-Free Early Stopping for Practical Federated Learning

超越固定轮次:面向实际联邦学习的无数据早停法

Youngjoon Lee, Hyukjoon Lee, Seungrok Jung, Andy Luo, Jinu Gong, Yang Cao, Joonhyuk Kang

发表机构 * arXiv

AI总结 提出一种无数据早停框架,通过监控任务向量增长率确定最优停止点,在皮肤病变/血细胞/结肠病理分类任务中达到与基于验证集的早停相当的性能,且仅需少量额外轮次。

Comments Under Review

详情
AI中文摘要

联邦学习(FL)无需传输原始数据即可实现去中心化协作学习。然而,依赖固定的全局轮次或验证数据进行超参数调优会带来高计算成本和隐私风险,阻碍了实际部署。为解决这一问题,我们提出了一种无数据早停框架,该框架仅使用服务器端参数监控任务向量的增长率来确定最优停止点。在皮肤病变/血细胞/结肠病理分类上的数值结果表明,我们的方法与多种最先进FL方法中基于验证集的早停性能相当。特别是,所提出的框架平均需要45/12/31(皮肤病变/血细胞/结肠病理)额外轮次即可实现比基于验证数据早停高12.3%/8.9%/3.9%的性能。此外,该框架仅需9/8/14额外轮次即可筛选不良配置,不到固定轮次预算的3%。据我们所知,这是首个为FL方法提出的无数据早停框架。我们的代码已开源。

英文摘要

Federated Learning (FL) facilitates decentralized collaborative learning without transmitting raw data. However, reliance on fixed global rounds or validation data for hyperparameter tuning hinders practical deployment by incurring high computational costs and privacy risks. To address this, we propose a data-free early stopping framework that determines the optimal stopping point by monitoring the task vector's growth rate using only server-side parameters. The numerical results on skin lesion/blood cell/colon pathology classification demonstrate that our approach is comparable to the validation-based early stopping across various state-of-the-art FL methods. In particular, the proposed framework requires an average of 45/12/31 (skin lesion/blood cell/colon pathology) additional rounds to achieve over 12.3%/8.9%/3.9% higher performance than early stopping based on validation data. Moreover, the proposed framework requires only 9/8/14 additional rounds to screen bad configurations, which is less than 3% of the fixed-round budget. To the best of our knowledge, this is the first work to propose a data-free early stopping framework for FL methods. Our code is available at this open repository.

2602.21172 2026-06-09 cs.AI cs.CV 版本更新

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

NoRD: 一种无需推理的高数据效率视觉-语言-动作模型

Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition Texas A&M University(德克萨斯大学A&M分校) UC Berkeley(伯克利加州大学)

AI总结 提出NoRD模型,通过无需推理标注和仅需<60%数据微调,结合Dr. GRPO算法克服难度偏差,实现与现有VLA模型相当的性能,显著降低数据与计算开销。

Comments Accepted to CVPR 2026. Code available at: https://github.com/Applied-Open-Source/nord

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过统一的端到端架构取代模块化流水线,推动了自动驾驶的发展。然而,当前的VLA模型面临两个昂贵的要求:(1)大规模数据集收集,(2)密集的推理标注。在这项工作中,我们通过NoRD(无需推理驾驶)解决了这两个挑战。与现有的VLA模型相比,NoRD在仅使用<60%的数据且无需推理标注的情况下实现了竞争性能,从而减少了3倍的token数量。我们发现,当将标准组相对策略优化(GRPO)应用于在这种小规模、无推理数据集上训练的策略时,它未能产生显著的改进。我们表明,这种限制源于难度偏差,它不成比例地惩罚了GRPO中产生高方差rollout的场景的奖励信号。NoRD通过引入Dr. GRPO(一种旨在减轻LLM中难度偏差的最新算法)克服了这一限制。因此,NoRD在Waymo和NAVSIM上以极少的训练数据和零推理开销实现了竞争性能,从而实现了更高效的自主系统。网站:此 https URL

英文摘要

Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with NORD (No Reasoning for Driving). Compared to existing VLAs, NORD achieves competitive performance while being fine-tuned on <60% of the data and no reasoning annotations, resulting in 3x fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. NORD overcomes this by incorporating Dr. GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, NORD achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems. Website: https://nord-vla-ai.github.io/

2602.21889 2026-06-09 cs.AI cs.LG 版本更新

2-Step Agent: A Framework for the Interaction of a Decision Maker with AI Decision Support

2-Step Agent: 一个用于决策者与AI决策支持交互的框架

Otto Nyberg, Fausto Carcassi, Davide Tugnoli, Giovanni Cinà

发表机构 * Department of Medical Informatics, Amsterdam UMC University of Amsterdam(医学信息学系,阿姆斯特丹大学医学中心,阿姆斯特丹大学) Institute for Logic, Language and Computation, University of Amsterdam(逻辑、语言和计算研究所,阿姆斯特丹大学) Department of Mathematics and Earth Sciences, University of Trieste(数学与地球科学系,特里埃斯特大学)

AI总结 本文提出2-Step Agent框架,用于研究决策者如何学习和利用基于机器学习的决策支持,并揭示了即使在理想条件下,ML-DS也可能导致更严重的负面影响。

Comments 17 pages, 17 figures

详情
AI中文摘要

机器学习模型的预测支持人类在多个领域做出决策,包括高风险领域如医疗和司法。然而,我们仍然缺乏对决策者如何从基于机器学习的决策支持(ML-DS)中学习的清晰理解。在本文中,我们介绍了一个通用的计算框架,即2-Step Agent,以捕捉这一过程。由于机器学习模型的预测包含关于训练数据的信息,预测也可以用于推断。我们的框架模型了(i)新的观察预测如何影响理性贝叶斯代理的信念,以及(ii)这种信念变化如何影响因果效应的估计、下游决策和后续结果。除了框架本身外,我们还做出了三个贡献。首先,在线性高斯设定下,我们推导出了解决我们引入的具有挑战性的贝叶斯推断问题的可计算解,即代理从ML预测中推断。其次,我们通过实验确定了ML-DS有益的条件。第三,我们证明了即使ML模型是良好规范的,且代理是完全理性的,单个不一致的先验信念也可能使ML-DS导致比没有决策支持更差的下游结果。因此,即使在理想条件下,ML-DS也可能造成更大的伤害。

英文摘要

Predictions from ML models support human decision making in several fields, including high-stakes ones such as healthcare and the judiciary. Yet, we still lack a clear understanding of how decision makers learn from ML-based decision support (ML-DS). In this paper, we introduce a general computational framework, the 2-Step Agent, to capture this process. As a prediction from an ML model contains information about the training data, a prediction can also be used for inference. Our framework models (i) how a prediction for a new observation affects the beliefs of a rational Bayesian agent, and (ii) how this change in beliefs affects the estimation of causal effect, the downstream decision, and the subsequent outcome. In addition to the framework itself, we make three contributions. First, for the linear Gaussian setting, we derive a tractable solution for the challenging Bayesian inference problem we introduced, i.e. one in which the agent infers from an ML prediction. Second, we experimentally identify conditions under which ML-DS is beneficial. Third, we show that a single misaligned prior belief can be sufficient for ML-DS to lead to worse downstream outcomes compared to no decision support even when the ML model is well-specified and the agent is perfectly rational. Hence, even under ideal conditions, ML-DS can do more harm than good.

2602.19330 2026-06-09 cs.LG 版本更新

CTS-Bench: Benchmarking Graph Coarsening Trade-offs for GNNs in Clock Tree Synthesis

CTS-Bench: 面向时钟树综合中GNN的图粗化权衡基准测试

Barsat Khadka, Kawsher Roxy, Md Rubel Ahmed

发表机构 * The University of Southern Mississippi(密苏里州南方大学) Intel Corporation(英特尔公司) Louisiana Tech University(路易斯安那理工大学)

AI总结 提出CTS-Bench基准套件,系统评估图粗化对GNN在时钟树综合中预测精度与计算效率的权衡,发现粗化虽降低内存和加速训练,但会移除关键结构信息导致零样本评估下R²为负。

Comments Accepted to ML Bench'26 ASPLOS

详情
AI中文摘要

图神经网络(GNN)在电子设计自动化中的物理设计分析中越来越受到关注,特别是用于建模时钟树综合行为,如时钟偏斜和缓冲复杂性。然而,由于在原始门级网表上操作的内存和运行时间成本过高,实际部署仍然有限。图粗化通常用于提高可扩展性,但其对CTS关键学习目标的影响尚未得到充分表征。本文介绍了CTS-Bench,一个基准测试套件,用于系统评估基于GNN的CTS分析中图粗化、预测精度和计算效率之间的权衡。CTS-Bench包含跨越五个架构的4,860个收敛的物理设计解决方案,并提供来自布局后设计的配对原始门级和聚类图表示。以时钟偏斜预测作为代表性CTS任务,我们展示了明确的精度-效率权衡。虽然图粗化将GPU内存使用减少高达17.2倍,并将训练加速高达3倍,但它也移除了对建模时钟分布至关重要的结构信息,经常导致零样本评估下R²为负。我们的发现表明,即使全局物理指标保持不变,通用图聚类技术也可能从根本上损害CTS学习目标。CTS-Bench支持对CTS感知的图粗化策略进行原则性评估,支持在现实物理设计约束下对GNN架构和加速器进行基准测试,并为开发学习辅助的CTS分析和优化技术提供了基础。

英文摘要

Graph Neural Networks (GNNs) are increasingly explored for physical design analysis in Electronic Design Automation, particularly for modeling Clock Tree Synthesis behavior such as clock skew and buffering complexity. However, practical deployment remains limited due to the prohibitive memory and runtime cost of operating on raw gate-level netlists. Graph coarsening is commonly used to improve scalability, yet its impact on CTS-critical learning objectives is not well characterized. This paper introduces CTS-Bench, a benchmark suite for systematically evaluating the trade-offs between graph coarsening, prediction accuracy, and computational efficiency in GNN-based CTS analysis. CTS-Bench consists of 4,860 converged physical design solutions spanning five architectures and provides paired raw gate-level and clustered graph representations derived from post-placement designs. Using clock skew prediction as a representative CTS task, we demonstrate a clear accuracy-efficiency trade-off. While graph coarsening reduces GPU memory usage by up to 17.2x and accelerates training by up to 3x, it also removes structural information essential for modeling clock distribution, frequently resulting in negative $R^2$ scores under zero-shot evaluation. Our findings indicate that generic graph clustering techniques can fundamentally compromise CTS learning objectives, even when global physical metrics remain unchanged. CTS-Bench enables principled evaluation of CTS-aware graph coarsening strategies, supports benchmarking of GNN architectures and accelerators under realistic physical design constraints, and provides a foundation for developing learning-assisted CTS analysis and optimization techniques.

2602.18695 2026-06-09 cs.LG 版本更新

Insertion Based Sequence Generation with Learnable Order Dynamics

基于可学习顺序动态的插入式序列生成

Dhruvesh Patel, Benjamin Rozonoyer, Gaurav Pandey, Tahira Naseem, Ramón Fernandez Astudillo, Andrew McCallum

发表机构 * University of Washington(华盛顿大学) Google Research(谷歌研究院)

AI总结 提出LoFlexMDM,一种具有可学习顺序动态的插入式掩码扩散模型,通过学习数据依赖的插入和解掩码速率,在分子生成任务上提升样本质量。

Comments Some updated results. Accepted at ICML 2026. Code and checkpoints available at https://github.com/dhruvdcoder/LoFlexMDM

详情
AI中文摘要

现有的基于插入的掩码扩散模型通过交替进行token插入和解掩码来生成序列,它们使用固定的调度,不依赖于数据。对于像图和分子这样的结构化序列,学习数据依赖的生成顺序可以通过减少动作空间的不确定性来提高生成质量。我们提出了LoFlexMDM,一种具有可学习顺序动态的插入式掩码扩散模型,它学习数据依赖的插入和解掩码速率。我们将离散流匹配框架推广到处理变长序列,提出了一种可处理的调度参数化方法以及一个用于联合训练生成器和目标顺序动态的训练目标。在从头设计和片段约束的分子生成任务中,LoFlexMDM相比FlexMDM分别将样本质量提升了高达17.5%和6.7%。这些结果表明,学习目标生成顺序可以在不牺牲可处理训练的情况下改进插入式扩散模型。我们在以下网址开源了代码:https://this URL。

英文摘要

Existing insertion-based masked diffusion models that generate sequences by interleaving token insertion with unmasking use fixed schedules that are not dependent on the data. For structured sequences like graphs and molecules, learning data-dependent generation orders can improve generation quality by reducing uncertainty over the action space. We propose LoFlexMDM, an insertion-based masked diffusion model with learnable order dynamics that learns data-dependent insertion and unmasking rates. We generalize the discrete flow matching framework to work with variable-length sequences, propose a tractable schedule parameterization and a training objective for joint training of the generator and the target order dynamics. On De Novo and fragment-constrained molecule generation, LoFlexMDM improves sample quality over FlexMDM by up to 17.5% and 6.7%, respectively. These results show that learning the target generation order can improve insertion-based diffusion models without giving up tractable training. We open source the code at https://github.com/dhruvdcoder/LoFlexMDM.

2602.18020 2026-06-09 cs.CV cs.RO 版本更新

UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

UAOR: 面向视觉-语言-动作模型的不确定性感知观测重注入

Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Zichen Wen, Bowen Fang, Tao Yu, Xiangnan Wu, Qisen Ma, Kai Wang, Ziheng He, Yingda Li, Zhengbo Zhang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所模式识别新技术实验室) Shanghai Jiao Tong University(上海交通大学) FiveAges(五代)

AI总结 提出UAOR模块,通过动作熵检测不确定性,在语言模型高不确定层重注入观测信息,无需额外训练或数据,提升VLA模型在仿真和真实任务中的性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型利用预训练的视觉-语言模型(VLM)作为骨干,将图像和指令映射到动作,展现出在可泛化机器人操作中的显著潜力。为了提升性能,现有方法通常引入额外的观测线索(如深度图、点云)或辅助模块(如目标检测器、编码器),以实现更精确和可靠的任务执行,但这些方法通常需要昂贵的数据收集和额外训练。受语言模型中的前馈网络(FFN)可作为“键值记忆”的发现启发,我们提出不确定性感知观测重注入(UAOR),一种有效、无需训练且即插即用的VLA模型模块。具体地,当当前语言模型层表现出由动作熵衡量的高不确定性时,它通过注意力检索将关键观测信息重注入下一层的前馈网络(FFN)。该机制直接在高不确定性层用观测证据增强隐藏状态,从而实现更准确和可靠的动作生成。综合实验表明,我们的方法以最小开销一致地提升了多种VLA模型在仿真和真实任务中的性能。值得注意的是,UAOR消除了对额外观测线索或模块的需求,使其成为现有VLA流程中通用且实用的即插即用组件。项目页面见此URL。

英文摘要

Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism directly augments the hidden states with observation evidence at high-uncertainty layers, enabling more accurate and reliable action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.

2602.17337 2026-06-09 cs.CV 版本更新

Polaffini: A feature-based approach for robust affine and polyaffine image registration

Polaffini: 一种基于特征的鲁棒仿射和多项式仿射图像配准方法

Antoine Legouhy, Cosimo Campo, Ross Callaghan, Hojjat Azadbakht, Hui Zhang

发表机构 * Hawkes Institute & Department of Computer Science, University College London, London, UK(霍克斯研究所及大学学院伦敦计算机科学系,伦敦,英国) Institut Pasteur, Université Paris Cité, Unité de Neuroanatomie Appliquée et Théorique(巴斯德研究所,巴黎城市大学,应用与理论神经解剖学单元) AINOSTICS ltd., Manchester, UK(AINOSTICS有限公司,曼彻斯特,英国)

AI总结 提出Polaffini框架,利用深度学习分割模型提取解剖对应点,通过闭式解实现全局和局部仿射匹配,生成从仿射到多项式仿射的可调平滑变换,在结构对齐和下游非线性配准初始化上优于传统方法。

Comments associated github repo: https://github.com/CIG-UCL/polaffini

详情
AI中文摘要

在这项工作中,我们提出了Polaffini,一个稳健且通用的解剖学基础配准框架。医学图像配准主要由基于强度的配准方法主导,这些方法依赖于对齐质量的替代度量。相比之下,基于特征的方法通过识别明确的解剖对应点进行操作,理论上更理想,但由于可靠提取特征的挑战而 largely 失宠。然而,得益于深度学习的近期进展,这些挑战现已显著克服,预训练的分割模型能够即时提供可靠、精细的解剖描绘。我们旨在证明这些进展可用于创建新的解剖学基础图像配准算法。为此,我们提出Polaffini,它从这些分割区域中以特别简单的方式获得具有一一对应关系的解剖学基础特征点:提取它们的质心。这些特征点通过闭式解实现高效的全局和局部仿射匹配。这些匹配用于生成从仿射到多项式仿射的整体变换,并具有可调平滑度。多项式仿射变换比仿射变换具有更多的自由度,允许更精细的对齐,并且它们在对数-欧几里得框架中的嵌入确保了微分同胚性质。Polaffini既可用于独立配准,也可作为后续非线性配准的预对齐,我们将其与流行的基于强度的配准技术进行了评估。结果表明,Polaffini在结构对齐方面优于竞争方法,并为下游非线性配准提供了改进的初始化。Polaffini快速、稳健且准确,使其特别适合集成到医学图像处理流程中。

英文摘要

In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.