arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.23278 2026-06-01 cs.CL stat.ML

When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

下一个词预测何时有用？边缘化、遍历性、混合可识别性、局部充分性、RAG、工具与编程

Francesco Corielli

AI总结本文通过区分完整条件语言过程、边缘文本过程和模型诱导分布，论证了下一个词预测的有效性依赖于强假设（平稳性、代表性、遍历性）以及观察前缀对潜在上下文的充分性，并解释了RAG和工具使用作为条件充分性机制的作用。

详情

AI中文摘要

在观察序列上训练的语言模型通常被描述为学习给定前一个词的下一个词的条件分布。这种描述仅在一定条件下成立。在真实词轨迹上训练的模型并未观察到完整的条件法则；它接收的是采样后的延续。此外，真实语言生成不仅受前文影响，还受非文本环境的影响：事实、事件、意图、目标、信念、社会背景和任务特定约束。本文区分了三个常被混淆的对象：以潜在环境为条件的完整条件语言过程、通过积分掉这些环境得到的边缘纯文本过程，以及从有限观察语料库中学习到的模型诱导分布。本文认为，将模型训练解释为估计边缘纯文本法则需要强假设：平稳性、代表性和遍历性，这些假设在统计估计中是标准的，但在应用于异质语言语料库时存在问题。即使这些假设成立，边缘纯文本法则也仅当观察前缀是延续相关潜在环境的近似充分统计量时才有用。从信息论角度看，有用性要求下一个词与被省略环境之间的条件互信息（给定观察文本）很小。然后，本文将这一论证扩展到异质训练语料库。最后，本文将检索增强生成（RAG）和工具使用解释为条件充分性装置。

英文摘要

Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.

URL PDF HTML ☆

赞 0 踩 0

2605.22967 2026-06-01 cs.LG

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

学习的中继表示用于前向思考的离散扩散模型

Benjamin Rozonoyer, Jacopo Minniti, Dhruvesh Patel, Neil Band, Avishek Joey Bose, Tim G. J. Rudner, Andrew McCallum

AI总结提出Learned Relay Representations (Relay)方法，通过可微通道传递潜在信息，使掩码扩散模型在去噪步骤间前向思考，减少推理延迟并提升性能。

Comments 16 pages, 3 figures. Equal contribution: Benjamin Rozonoyer, Jacopo Minniti, and Dhruvesh Patel. Code: https://github.com/jacopo-minniti/relay

详情

AI中文摘要

当掩码扩散模型（MDMs）通过迭代细化生成序列时，掩码位置上的丰富内部计算被丢弃，迫使每个后续细化步骤重新计算存储为模型表示的有价值内部信息。为了避免去噪轮次之间的硬重置，我们提出了学习的中继表示（Relay），一种允许MDMs在去噪时进行前向思考的方法，通过显式学习如何传播潜在信息以利于未来的去噪步骤。Relay引入了一个可微的逐token通道，在前向传递之间传递信息，并通过时间截断反向传播（BPTT）进行训练。我们展示了该框架可以扩展到最先进的扩散语言模型（DLMs），并且与块扩散和KV缓存等技术无缝兼容。我们首先在具有挑战性的基于数独的规划任务上对Relay的设计选择进行了彻底验证。然后，我们将Relay扩展到最先进的DLM Fast-dLLM v2，在编码任务上优于标准的监督微调，同时将推理延迟降低高达32%。我们的实证结果表明，最先进的DLM可以被显式训练以在解码步骤间前向中继潜在信息，从而推进性能-延迟帕累托前沿。我们提供了所有实验的代码。

英文摘要

When Masked Diffusion Models (MDMs) generate sequences through iterative refinement, the rich internal computation over masked positions is discarded, forcing every subsequent refinement step to recompute the valuable internal information stored as model representations. To avoid a hard reset between denoising rounds, we propose Learned Relay Representations (Relay), a method that allows MDMs to be forward-thinking when denoising by explicitly learning how to propagate latent information for the benefit of future denoising steps. Relay introduces a differentiable per-token channel that passes information between forward passes and is trained via truncated backpropagation through time (BPTT). We show that this framework can be scaled to state-of-the-art Diffusion Language Models (DLMs), and is seamlessly compatible with techniques like block diffusion and KV caching. We first provide a thorough justification of the design choices in Relay on a challenging Sudoku-based planning task. We then scale Relay to Fast-dLLM v2, a state-of-the-art DLM, outperforming standard supervised finetuning on coding tasks while reducing inference latency by up to 32%. Our empirical results demonstrate that state-of-the-art DLMs can be explicitly trained to relay latent information forward across decoding steps, advancing the performance-latency Pareto frontier. We provide code for all our experiments.

URL PDF HTML ☆

赞 0 踩 0

2605.22639 2026-06-01 cs.RO

Symmetries Here and There, Combined Everywhere: Cross-space Symmetry Compositions in Robotics

此处与彼处的对称性，无处不在的组合：机器人学中的跨空间对称性组合

Loizos Hadjiloizou, Rodrigo Pérez-Dattari, Noémie Jaquier

AI总结提出跨空间对称性组合框架，通过前向运动学的微分几何结构实现配置空间与任务空间对称性的联合等变，并在双机械臂实验中验证了多对称性联合利用能提升泛化能力。

Comments 8 pages, 8 figures, 1 table

2605.20992 2026-06-01 cs.CV

CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction

CHOIR: 接触感知的4D手物交互重建

Hao Xu, Yilin Liu, Yinqiao Wang, Chi-Wing Fu, Niloy J. Mitra

AI总结提出CHOIR框架，利用接触作为显式耦合信号，从单目视频中重建手物交互的4D序列，包括手部运动、物体形状与6D姿态以及接触信息，显著提升了物体重建、物理合理性和时间一致性。

详情

AI中文摘要

我们探究是否可以将日常开放世界单目视频转化为可复用的4D交互基元：包括关节手部运动、随时间变化的物体形状与6D姿态，以及接触的时空信息。这种能力将支持真实交互的可扩展挖掘，并在重建之外，支持场景感知的合成与规划。然而，从具有挑战性的单目视频中重建手物交互（HOI）仍然困难：现有方法通常假设已知物体或精心设计的场景，且单独估计的手和物体在杂乱、遮挡和未见物体几何下容易错位。针对这一场景，我们提出CHOIR，一种面向单目相机的接触感知HOI重建框架，利用接触作为手和物体之间的显式耦合信号。CHOIR首先从开放世界视觉先验中初始化一个粗糙的、接触无关的4D HOI序列。然后引入一个生成式HOI空间修正模块，预测射线深度修正并纠正手物相对位置，随后在修正后的几何上推导出初始的逐帧接触对应关系。最后，采用带有动态更新接触约束的接触感知联合优化，强制执行几何、时间和接触一致性。在受控和具有挑战性的视频上的实验表明，CHOIR在物体重建、物理合理性和时间一致性上优于现有最先进方法。

英文摘要

We ask whether everyday open-world monocular videos can be turned into reusable 4D interaction primitives: articulated hand motion, object shape with 6D pose over time, and the when/where of contact. Such a capability would enable scalable mining of real interactions and, beyond reconstruction, support scene-aware synthesis and planning. However, reconstructing hand-object interaction (HOI) from challenging monocular videos remains difficult: methods often assume known objects or curated scenes, and separately estimated hands and objects easily become misaligned under clutter, occlusion, and unseen object geometries. Targeting this setting, we present CHOIR, a Contact-aware HOI Reconstruction framework for a monocular camera, using contact as an explicit coupling signal between hands and objects. CHOIR first initializes a coarse, contact-agnostic 4D HOI sequence from open-world visual priors. It then introduces a generative HOI spatial rectification module to predict ray-depth corrections and rectify hand-object relative placement, then derive initial per-frame contact correspondences on the rectified geometry. Last, a contact-aware joint optimization with dynamically updated contact constraints enforces geometric, temporal, and contact consistency. Experiments on controlled and challenging videos show that CHOIR improves object reconstruction, physical plausibility, and temporal consistency over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.20036 2026-06-01 cs.LG

D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market

D$^3$-Subsidy：大规模网约车市场的在线和顺序司机补贴决策

Taijie Chen, Rui Su, Siyuan Feng, Laoming Zhang, Hongyang Zhang, Haijiao Wang, Zhaofeng Ma, Jintao Ke, Li Ma

AI总结针对网约车市场动态环境，提出基于扩散的分层框架D$^3$-Subsidy，通过前缀条件扩散模型和拉格朗日对偶映射实现城市级补贴控制，在满足补贴率上限和低延迟约束下提升订单量和GMV。

Comments 14 pages, 14 figures

详情

AI中文摘要

滴滴出行等网约车平台运行在高度动态的环境中，平衡司机供给和乘客需求至关重要。尽管司机端补贴是调整这些力量并改善关键KPI（如完成订单数（\texttt{Rides}）和总交易额（\texttt{GMV}））的主要杠杆，但在生产中优化它们需要同时满足三个约束：（i）对随机冲击的响应性，（ii）严格的补贴率上限，以及（iii）城市规模的低延迟执行。这些要求排除了昂贵的逐订单优化，需要一种前瞻性的、约束感知的城市级控制器用于在线顺序决策。为了满足这些要求，我们引入了D$^3$-Subsidy（动态司机端基于扩散的补贴），一种基于扩散的分层框架，用于可部署的全城补贴控制。为了弥合训练-推理差距，D$^3$-Subsidy采用前缀条件扩散模型，从不可变的历史观测中采样可能的未来轨迹，确保训练协议与在线部署的固定历史性质一致。这些生成的计划随后由上下文条件逆模块解码为低维城市级控制信号。对于可扩展的执行，我们通过拉格朗日对偶导出的映射弥合了城市级规划和细粒度调度之间的差距，该映射将补贴率上限直接嵌入到订单-司机激励中，无需迭代优化。此外，采用参数高效微调的多城市预训练策略能够实现跨异构城市的鲁棒迁移。广泛的离线评估表明，D$^3$-Subsidy在提高\texttt{Rides}和\texttt{GMV}的同时增强了上限合规性，而真实世界的A/B测试证实了显著提升，同时将预算相关的违规指标保持在运营阈值内。

英文摘要

Ride-hailing platforms like DiDi Chuxing operate in highly dynamic environments where balancing driver supply and passenger demand is critical. Although driver-side subsidies serve as a primary lever to align these forces and improve key KPIs like completed rides (\texttt{Rides}) and gross merchandise value (\texttt{GMV}), optimizing them in production requires simultaneously meeting three constraints: (i) responsiveness to stochastic shocks, (ii) strict subsidy-rate caps, and (iii) low-latency execution at city scale. These requirements rule out expensive per-order optimization, calling for a forward-looking, constraint-aware city-level controller for online sequential decision making. To meet these requirements, we introduce D$^3$-Subsidy (Dynamic Driver-side Diffusion-based Subsidy), a hierarchical diffusion-based framework for deployable city-wide subsidy control. To bridge the train-inference gap, D$^3$-Subsidy employs a prefix-conditioned diffusion model that samples plausible future trajectories from immutable historical observations, ensuring the training protocol aligns with the fixed-history nature of online deployment. These generated plans are then decoded by a context-conditioned inverse module into low-dimensional city-level control signals. For scalable execution, we bridge the gap between city-level planning and fine-grained dispatch via a Lagrangian-dual-derived mapping, which embeds subsidy-rate caps directly into order-driver incentives without iterative optimization. Additionally, a multi-city pretraining strategy with parameter-efficient fine-tuning enables robust transfer across heterogeneous cities. Extensive offline evaluations demonstrate that D$^3$-Subsidy improves \texttt{Rides} and \texttt{GMV} while enhancing cap compliance, and a real-world A/B test confirms significant uplift while keeping budget-related violation metrics within operational thresholds.

URL PDF HTML ☆

赞 0 踩 0

2506.21035 2026-06-01 cs.LG

Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

循序渐进：通过增量混合秩-1联想记忆专家实现持续学习

Haodong Lu, Chongyang Zhao, Minhui Xue, Lina Yao, Kristen Moore, Dong Gong

AI总结针对持续学习中专家粒度粗糙导致的冗余、干扰和遗忘问题，提出MoRAM方法，将秩-1适配器作为细粒度专家和联想记忆单元，通过自激活机制实现增量扩展，显著提升塑性-稳定性权衡和泛化能力。

Comments Accepted at ICML2026. Project page: https://artificer-ai-lab.github.io/MoRAM/

详情

AI中文摘要

持续学习（CL）与大型预训练模型旨在增量获取知识而不发生灾难性遗忘。现有的基于LoRA的混合专家（MoE）方法通过添加孤立的新专家并冻结旧专家来扩展容量，但仍存在冗余、干扰、路由模糊以及由此导致的遗忘问题。我们研究了源于粗粒度专家粒度的问题。粗粒度专家（例如高秩LoRA）编码低专一性信息，导致专家重复/干扰以及随着专家积累而路由退化/混乱。在这项工作中，我们提出了MoRAM（混合秩-1联想记忆）。基于权重矩阵作为线性联想记忆的观点，MoRAM将CL实现为可重用原子秩-1专家作为记忆的增量扩展。每个秩-1适配器充当细粒度MoE专家或联想记忆单元。通过将秩-1专家视为键值记忆对，我们消除了显式的MoE-LoRA路由器，采用自激活机制，其中每个记忆原子通过其内在键评估其相关性。因此，推理过程成为对增量累积的学习快照记忆的内容可寻址检索和回忆。在CLIP和LLM上的大量实验表明，MoRAM显著优于最先进的方法，实现了更好的塑性-稳定性权衡、更强的泛化能力和更少的遗忘。项目页面：https://artificer-ai-lab.github.io/MoRAM/。

英文摘要

Continual learning (CL) with large pre-trained models aims to incrementally acquire knowledge without catastrophic forgetting. Existing LoRA-based Mixture-of-Experts (MoE) methods expand capacity by adding isolated new experts while freezing old ones, but still suffer from redundancy, interference, routing ambiguity, and consequent forgetting. We investigate the issues stemming from coarse-grained expert granularity. Coarse-grained experts (e.g., high-rank LoRA) encode low-specialty information, leading to expert duplication/interference and routing degradation/confusion as experts accumulate. In this work, we propose MoRAM (Mixture of Rank-1 Associative Memory). Grounded in the view that weight matrices act as linear associative memories, MoRAM achieves CL as incremental expansion of reusable atomic rank-1 experts as memory. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 experts as key-value memory pairs, we eliminate explicit MoE-LoRA routers with self-activation, where each memory atom evaluates its relevance via its intrinsic key. The inference process thus becomes a content-addressable retrieval and recall over the incrementally accumulated memory of learning snapshots. Extensive experiments on CLIP and LLMs show that MoRAM significantly outperforms state-of-the-art methods, achieving a better plasticity-stability trade-off, stronger generalization, and reduced forgetting. Project Page: https://artificer-ai-lab.github.io/MoRAM/.

URL PDF HTML ☆

赞 0 踩 0

2605.21470 2026-06-01 cs.LG cs.AI

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

面向延迟优化的Web Agent规划与调度的Agent即时编译

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini, Christos Kozyrakis

AI总结提出Agent即时编译系统，通过JIT-Planner生成代码计划、JIT-Scheduler探索并行化策略及不变式工具协议，显著降低延迟并提高准确性。

Comments Accepted at ICML 2026

详情

AI中文摘要

计算机使用Agent通过生成对浏览器中点击、输入、滚动等工具的调用序列，自动化自然语言指定的任务，例如“从Taco Bell订购最便宜的商品”。当前实现遵循顺序的获取截图-执行循环，每次迭代需要一次LLM调用，导致高延迟和因工具使用错误而频繁出错。我们提出了Agent即时编译系统，该系统将任务描述直接编译为可执行代码，其中可能包含LLM调用、工具调用和并行化。我们的方法包括三个组件：（1）JIT-Planner，生成多个代码计划，根据工具规范验证每个计划，并选择最小成本候选；（2）JIT-Scheduler，通过从学习到的延迟分布进行蒙特卡洛成本估计，探索并行化策略；（3）不变式强制工具协议，指定前置条件和后置条件要求，以减少工具使用错误率。在五个应用中，JIT-Planner相比Browser-Use实现了10.4倍的加速和28%的更高准确率，而JIT-Scheduler相比OpenAI CUA实现了2.4倍的加速和9%的更高准确率。

英文摘要

Computer-use agents (CUAs) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We present agent just-in-time (JIT) compilation, a system that compiles task descriptions directly into executable code that may include LLM calls, tool calls, and parallelization. Our approach comprises three components: (1) JIT-Planner, which generates multiple code plans, validates each against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions; and (3) an invariant-enforcing tool protocol specifying precondition and postcondition requirements to reduce the rate of incorrect tool use. Across five applications, JIT-Planner achieves $10.4\times$ speedup and 28$\%$ higher accuracy over Browser-Use, while JIT-Scheduler achieves $2.4\times$ speedup and 9\% higher accuracy over OpenAI CUA.

URL PDF HTML ☆

赞 0 踩 0

2605.21108 2026-06-01 cs.LG cs.AI

Efficient Learning of Deep State Space Models via Importance Smoothing

通过重要性平滑高效学习深度状态空间模型

John-Joseph Brady, Nikolas Nusken, Yunpeng Li

AI总结提出并行变分蒙特卡洛（PVMC）方法，结合变分推断和序贯蒙特卡洛，实现深度状态空间模型在判别与生成任务上的高效训练，速度提升10倍。

Comments Accepted to the proceedings of ICML 2026

详情

AI中文摘要

潜在状态空间系统在统计建模中无处不在，当通过噪声观测时间序列时自然出现。然而，大规模训练深度状态空间模型（DSSM）仍然困难。训练DSSM出现了两种截然不同的策略。第一种是自编码DSSM，通过优化变分下界来训练生成模型。第二种是通过经典序贯蒙特卡洛（SMC）算法的输出进行反向传播。这些方法可以训练DSSM用于判别和生成任务，但其固有的顺序前向传递在现代硬件上扩展性差。我们提出了并行变分蒙特卡洛（PVMC），一种新的训练方法，它桥接了这些范式，并稳健地训练DSSM用于判别和生成任务。在一组基准实验中，PVMC达到或超过了最先进的性能，同时训练速度比最快的竞争SMC方法快10倍。

英文摘要

Latent state space systems are ubiquitous in statistical modelling, arising naturally when time series are observed through noisy measurements. However, training deep state space models (DSSMs) at scale remains difficult. Two largely distinct strategies have emerged for training DSSMs. The first, auto-encoding DSSMs, trains generative models by optimising a variational lower bound. The second backpropagates through the outputs of classical sequential Monte Carlo (SMC) algorithms. Such approaches can train DSSMs for both discriminative and generative tasks, but their inherently sequential forward passes scale poorly on modern hardware. We propose \emph{parallel variational Monte Carlo} (PVMC), a new training method that bridges these paradigms and robustly trains DSSMs for both discriminative and generative tasks. Across a set of benchmark experiments, PVMC matches or exceeds state-of-the-art performance while training $10\times$ faster than the fastest competing SMC-based approach.

URL PDF HTML ☆

赞 0 踩 0

2605.21007 2026-06-01 cs.CV cs.RO

LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

LiteViLNet: 轻量级视觉-激光雷达融合网络用于高效道路分割

Daojie Peng, Bingtao Wang, Fulong Ma, Liang Zhang, Jun Ma

AI总结提出轻量级多模态网络LiteViLNet，通过双流编码器、深度可分离卷积和多尺度特征融合模块，在KITTI数据集上以14.04M参数达到96.36% MaxF，实现精度与效率的平衡。

详情

AI中文摘要

道路分割是自动驾驶和智能机器人系统的基本感知任务，需要高精度和实时推理，特别是在资源受限的边缘设备上部署时。现有的多模态道路分割方法通常依赖重型基于Transformer的编码器以达到最先进的性能，但其巨大的计算成本阻碍了在嵌入式平台上的实时部署。为解决这一困境，我们提出了LiteViLNet，一种轻量级多模态网络，融合RGB纹理信息和LiDAR几何信息用于高效道路分割。具体来说，我们设计了双流轻量级编码器和深度可分离卷积，以最小的参数从两种模态中提取层次特征。我们进一步提出了多尺度特征融合模块（MSFM）以促进不同层次的跨模态交互，以及一个大核桥模块以线性复杂度捕获长距离依赖。在KITTI道路数据集和实际应用上的大量实验表明，LiteViLNet在准确性和效率之间取得了有希望的平衡。值得注意的是，仅用14.04M参数，我们的模型达到了96.36%的MaxF分数，在所有基于CNN的方法中排名最佳，并与更大的基于Transformer的模型相当，在RTX 4060 Ti上模型推理速度为163.79 FPS（在Jetson Orin NX上为22.18 FPS）。它在推理速度上优于许多重型方法，同时保持高度竞争的准确性，充分验证了LiteViLNet在自动驾驶和智能机器人中实时嵌入式部署的潜力。

英文摘要

Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose LiteViLNet, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

URL PDF HTML ☆

赞 0 踩 0

2605.20873 2026-06-01 cs.AI cs.LG

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

PlanningBench: 生成可扩展且可验证的规划数据以评估和训练大型语言模型

Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou

AI总结提出PlanningBench框架，通过约束驱动合成管道生成可扩展、多样化且可验证的规划数据，用于评估和训练LLMs，并验证其在提升规划能力上的有效性。

详情

AI中文摘要

规划是大型语言模型（LLMs）的一项基本能力，因为这类复杂任务要求模型将目标、约束、资源和长期后果协调成可执行且可验证的解决方案。然而，现有的规划基准通常将规划数据视为固定的实例集合，而非可控的生成目标。这限制了场景覆盖范围，将难度与表面代理而非结构来源挂钩，并且对可扩展生成、自动验证或面向规划的训练支持有限。我们引入PlanningBench，一个用于生成可扩展、多样化且可验证的规划数据的框架，既可用于评估也可用于训练。PlanningBench从真实规划场景出发，将实际工作流程抽象为包含30多种任务类型、子任务、约束族和难度因素的结构化分类体系。在该分类体系的指导下，一个约束驱动的合成管道实例化自包含的规划问题，具备自适应难度控制、质量过滤和实例级验证检查表。这将规划数据构建从固定基准收集转变为可控生成，同时保留现实任务基础。我们使用PlanningBench评估开源和闭源前沿LLMs，发现当前模型在耦合约束下仍难以生成完整解决方案。除评估外，在已验证的PlanningBench数据上进行强化学习可提升在未见规划基准和更广泛的指令遵循任务上的性能。进一步分析表明，确定性或明确指定的最优解提供了更清晰的奖励信号和更稳定的训练动态。总体而言，PlanningBench为诊断和提高LLMs中可泛化的规划能力提供了可控的规划数据来源。

英文摘要

Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2601.22538 2026-06-01 cs.LG stat.AP

Learning-to-Defer in Non-Stationary Time Series via Switching State-Space Models

通过切换状态空间模型在非平稳时间序列中的学习-延迟决策

Yannis Montreuil, Letian Yu, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

AI总结提出L2D-SLDS框架，利用因子化切换线性高斯状态空间模型处理非平稳流式数据，通过共享因子持续更新未查询专家的信念，并设计学习感知查询分数平衡即时成本与信息增益，实现在线学习-延迟决策。

详情

AI中文摘要

学习-延迟决策（L2D）将每个决策路由到系统自身的预测器或外部专家。流式时间序列设置打破了离线L2D的假设：数据是非平稳的，专家可用性随时间变化，内部预测器在线训练。我们提出L2D-SLDS，一种基于因子化切换线性高斯状态空间模型的一阶段在线L2D框架，该模型覆盖所有潜在残差：一个离散状态、一个共享全局因子以及每个专家的特异状态。始终观测的内部残差通过共享因子持续更新关于每个未查询专家的信念，而学习感知查询分数平衡即时成本与潜在状态信息增益以及一步学习者的改进。我们证明了一个针对时变学习-延迟比较器的oracle不等式，将遗憾分解为查询奖励预算、SLDS预测成本误差项$\mathcal{E}_{\mathrm{SLDS}}$以及内部学习者的区间动态遗憾。在合成数据、墨尔本、耶拿和24专家德里基准测试上，L2D-SLDS与上下文和非平稳老虎机基线相比具有竞争力或更优，同时在真实数据轮次中延迟比例低于$2\%$。

英文摘要

Learning-to-defer (L2D) routes each decision to a system's own predictor or to an external expert. Streaming time-series settings break the offline-L2D assumptions: the data are non-stationary, expert availability shifts over time, and the internal predictor is trained online. We propose L2D-SLDS, a one-stage online L2D framework based on a factorized switching linear-Gaussian state-space model over all potential residuals: a discrete regime, a shared global factor, and per-expert idiosyncratic states. The always-observed internal residual continuously updates beliefs about every unqueried expert through the shared factor, and a learner-aware query score balances immediate cost against latent-state information gain and one-step learner improvement. We prove an oracle inequality against a time-varying learn-and-defer comparator, decomposing regret into a query-bonus budget, an SLDS predictive-cost-error term~$\mathcal{E}_{\mathrm{SLDS}}$, and the internal learner's interval dynamic regret. On synthetic, Melbourne, Jena, and 24-expert Delhi benchmarks, L2D-SLDS is competitive with or improves on contextual- and non-stationary-bandit baselines while deferring on ${<}2\%$ of real-data rounds.

URL PDF HTML ☆

赞 0 踩 0

2509.10308 2026-06-01 cs.LG

GraphCSVAE: Graph Categorical Structured Variational Autoencoder for Spatiotemporal Auditing of Physical Vulnerability Towards Sustainable Post-Disaster Risk Reduction

GraphCSVAE: 面向可持续灾后风险降低的物理脆弱性时空审计的图类别结构化变分自编码器

Joshua Dimasaka, Christian Geiß, Robert Muir-Wood, Emily So

AI总结提出GraphCSVAE框架，通过整合深度学习、图表示和类别概率推断，利用时间序列卫星数据和专家先验，对物理脆弱性进行建模，并在两个灾后地区验证其时空审计能力。

Comments Accepted for publication in Progress in Disaster Science (on May 20, 2026) and at the 8th International Disaster and Risk Conference, IDRC 2025 | Keywords: weakly supervised, graph, categorical, vulnerability, remote sensing, spatiotemporal | The data and code are respectively available at https://doi.org/10.5281/zenodo.16656471 and https://github.com/riskaudit/GraphCSVAE

详情

DOI: 10.1016/j.pdisas.2026.100601

AI中文摘要

在灾害发生后，全球许多机构在监测灾害风险变化方面面临挑战，限制了评估联合国仙台减少灾害风险框架（2015-2030）进展的能力。尽管众多研究通过地球观测和数据驱动方法显著推进了灾害暴露和危险性的大规模建模，但在风险方程中另一个同等重要但具有挑战性的要素——物理脆弱性的建模方面进展仍然有限。为弥补这一空白，我们引入了图类别结构化变分自编码器（GraphCSVAE），这是一个概率数据驱动框架，通过整合深度学习、图表示和类别概率推断，利用时间序列卫星数据集和专家先验来建模物理脆弱性。我们引入了一个弱监督的一阶转移矩阵，以捕捉两个受灾害影响且社会经济弱势地区脆弱性时空分布的变化：孟加拉国受气旋影响的Khurushkul社区和塞拉利昂受泥石流影响的弗里敦市。在两个案例研究中，该框架构建了2016-2023年的大规模图表示，并由于缺乏时间地面真值标签，使用Aitchison距离评估后验成分分布与专家先验的差异。该工作揭示了灾后物理脆弱性的区域动态，为局部时空审计和可持续的灾后风险降低策略提供了宝贵见解。

英文摘要

In the aftermath of disasters, many institutions worldwide face challenges in monitoring changes in disaster risk, limiting assessment of progress towards the UN Sendai Framework for Disaster Risk Reduction 2015-2030. While numerous efforts have substantially advanced the large-scale modeling of hazard and exposure through Earth observation and data-driven methods, progress remains limited in modeling another equally important yet challenging element of the risk equation: physical vulnerability. To address this gap, we introduce Graph Categorical Structured Variational Autoencoder (GraphCSVAE), a probabilistic data-driven framework for modeling physical vulnerability by integrating deep learning, graph representation, and categorical probabilistic inference, using time-series satellite-derived datasets and expert priors. We introduce a weakly supervised first-order transition matrix to capture changes in the spatiotemporal distribution of vulnerability across two disaster-affected and socioeconomically disadvantaged regions: the cyclone-impacted Khurushkul community in Bangladesh and the mudslide-affected city of Freetown in Sierra Leone. Across both case studies, the framework constructs large-scale graph representations spanning 2016-2023 and evaluates posterior compositional distributions against expert priors using Aitchison distance due to the lack of temporal groundtruth labels. The work reveals post-disaster regional dynamics in physical vulnerability, offering valuable insights into localized spatiotemporal auditing and sustainable strategies for post-disaster risk reduction.

URL PDF HTML ☆

赞 0 踩 0

2605.19806 2026-06-01 cs.CL cs.AI

Chunking German Legal Code

德国法律文本的分块处理

Max Prior, Natalia Milanova, Andreas Schultz

AI总结研究针对德国成文法，以德国民法典为基准语料库，比较多种分块策略在检索增强生成中的性能，发现基于法律固有结构（如章节、小节）的分块方法在召回率和计算效率上优于语义增强方法。

Comments Accepted at the Eigth Workshop on Automated Semantic Analysis of Information in Legal Texts co-located with the 21th International Conference on Artificial Intelligence and Law (ICAIL 2026)

详情

AI中文摘要

本文研究了针对德国成文法的检索增强生成的分块策略，以德国民法典作为结构化基准语料库。我们实现并比较了一系列分割方法，包括结构单元（章节、小节、句子、命题）、固定大小窗口、上下文分块、语义聚类、Lumber风格分块以及基于RAPTOR的层次检索。所有方法都在一个具有章节级黄金标签的法律问答数据集上进行评估，测量召回率、查询延迟、索引构建时间和存储需求。结果表明，与固有法律结构对齐的分块策略——特别是基于章节和小节的检索——实现了最高的召回率，而覆盖这种结构的更复杂方法表现更差。与上下文分块、RAPTOR和Lumber等LLM密集型技术相比，这些更简单的方法还提供了有利的计算效率。研究结果突出了语义丰富性与操作成本之间的关键权衡，并证明保留领域特定结构对于有效的法律信息检索至关重要。

英文摘要

This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a structured benchmark corpus. We implement and compare a range of segmentation approaches, including structural units (sections, subsections, sentences, propositions), fixed-size windows, contextual chunking, semantic clustering, Lumber-style chunking, and RAPTOR-based hierarchical retrieval. All methods are evaluated on a legal question-answering dataset with section-level gold labels, measuring recall, query latency, index build time, and storage requirements. Results show that chunking strategies aligned with the inherent legal structure - particularly section and subsection - based retrieval-achieve the highest recall, while more complex approaches that override this structure perform worse. These simpler methods also offer favorable computational efficiency compared to LLM-intensive techniques such as contextual chunking, RAPTOR, and Lumber. The findings highlight a key trade-off between semantic enrichment and operational cost, and demonstrate that preserving domain-specific structure is critical for effective legal information retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.19145 2026-06-01 cs.LG

PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks

PMF-CL: 面向冲突任务的帕累托最小遗忘持续学习器

Srijith Nair, Atilla Eryilmaz, Jia Liu

AI总结提出基于多任务学习视角的帕累托最优框架，通过寻找帕累托最优解实现冲突任务下最小化遗忘的持续学习，并推导出适用于线性回归、基函数回归及具有二次上界损失函数的帕累托最小遗忘算法。

Comments 25 pages, 4 figures, 4 algorithms

详情

AI中文摘要

文献中提出了许多持续学习算法来解决机器学习模型中的灾难性遗忘问题（即学习新任务导致先前学习任务性能下降）。尽管所有持续学习方法都使用某种形式的记忆来保留过去任务的信息，但对需要存储哪些信息以最小化灾难性遗忘的基本理解仍然难以捉摸。最近，人们认识到，在存在所有任务共同全局最小化器的强假设下，灾难性遗忘可以完全避免。然而，在实践中，任务很少具有共同的全局最小化器，一定程度的遗忘是不可避免的。本文提出了一个基于多任务学习视角的、原则性且系统化的冲突任务持续学习基础框架。该方法基于寻找帕累托最优解，即根据定义，在帕累托意义上最小化遗忘先前任务的解。我们推导了线性回归和基函数回归的帕累托最小遗忘持续学习算法，以及具有二次上界的一般损失函数（例如逻辑回归）。对于二次问题，PMF-CL使用内存高效的迭代更新，对于具有$d$个参数的模型，静态内存占用为$\mathcal{O}(d^2)$。

英文摘要

In the literature, many continual learning (CL) algorithms have been proposed to address the issue of catastrophic forgetting in ML models (i.e., learning new tasks leads to the loss of performance on previously learned tasks). Although all CL approaches use some form of memory to retain information about past tasks, a grounded understanding of what information needs to be stored to minimize catastrophic forgetting remains elusive. Recently, it has been recognized that under the strong assumption of the existence of a common global minimizer over all tasks, catastrophic forgetting can be completely avoided. However, in practice, tasks rarely have a common global minimizer, and a certain amount of forgetting is inevitable. In this paper, we propose a foundational framework for principled and systematic CL of conflicting tasks using a multi-task learning (MTL) perspective. The approach is based on finding Pareto-optimal solutions, i.e., the solutions which, by definition, minimally forget the previous tasks in the Pareto sense. We derive Pareto-minimal-forgetting CL algorithms for linear and basis-function regression, and general loss functions which have a quadratic upper bound, e.g., logistic regression. For quadratic problems, PMF-CL uses memory-efficient iterative updates with a static memory footage of $\mathcal{O}(d^2)$ for models with $d$ parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.18807 2026-06-01 cs.LG cs.AI

Block-Based Double Decoders

基于块的双解码器

Asher Labovich, Benjamin Bradley, Vanessa Alexander, Chaitanya Harsha

AI总结提出基于块的双解码器架构，利用双重因果块注意力掩码实现全损失监督和静态序列打包，结合解码器训练效率与编码器-解码器推理效率，在缩放定律实验中优于编码器-解码器并接近解码器模型，推理时KV缓存和每token计算减少至少2/3。

Comments 8 pages main, 13 pages total

2605.18803 2026-06-01 cs.LG cs.AI

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

PROWL: 基于优先遗憾驱动的世界模型学习优化

Ahmet H. Güzel, Jenny Seidenschwarz, Benjamin Graham, Jonathan Sadeghi, Jeffrey Hawke, Ilija Bogunovic

AI总结提出一种KL约束的对抗课程，通过训练策略暴露扩散世界模型的高误差轨迹并持续微调，结合优先对抗轨迹缓冲区，解决被动数据中罕见关键转换的鲁棒性问题。

详情

AI中文摘要

现代动作条件视频世界模型在短期视觉真实性上表现强劲，但在罕见且对交互关键的转换上仍不可靠，而这些转换主导了下游规划和策略性能。由于被动演示数据系统性地对这些高影响区域采样不足，提高鲁棒性需要主动引发模型失败，而非依赖其自然发生。我们引入了一种KL约束的对抗课程，其中训练一个策略来暴露基于扩散的世界模型的高误差轨迹，同时保持接近行为分布。世界模型在这些对抗性发现的轨迹上持续微调，形成一个对抗训练循环，将罕见失败转化为稳定的、接近分布的训练信号，而不会漂移到分布外利用。为了在模型改进时持续对未解决的弱点施加压力，我们提出了一种优先对抗轨迹（PAT）缓冲区，该缓冲区根据预测误差、动作保真度和学习进度对轨迹重新排序，将训练集中在未解决的失败模式上，而不是重复访问已解决的案例。我们在MineRL框架中实现了我们的方法，并在保留的分布外轨迹上进行了评估；PROWL提高了相对于仅在被动数据上训练的模型的鲁棒性，揭示了在弱行为约束下的奖励黑客行为，并证明了有效的对抗世界模型训练关键取决于平衡探索性失败发现与显式行为正则化。我们的结果表明，可扩展的世界模型不仅受益于更大的数据集，还受益于选择性生成信息丰富的训练数据。

英文摘要

Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under-samples these high-impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL-constrained adversarial curriculum in which a policy is trained to expose high-error trajectories of a diffusion-based world model while remaining close to the behavior distribution. The world model is continuously fine-tuned on these adversarially discovered trajectories, yielding an adversarial training loop that converts rare failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation. To maintain pressure on unresolved weaknesses as the model improves, we propose a Prioritized Adversarial Trajectory (PAT) buffer that re-ranks trajectories based on prediction error, action fidelity, and learning progress, focusing training on unresolved failure modes rather than repeatedly revisiting solved cases. We implement our approach in the MineRL framework and evaluate it on held-out out-of-distribution trajectories; PROWL improves robustness over models trained on passive data alone, reveals reward-hacking behaviors under weak behavioral constraints, and demonstrates that effective adversarial world-model training critically depends on balancing exploratory failure discovery with explicit behavioral regularization. Our results suggest that scalable world models benefit not only from larger datasets, but also from selectively generating informative training data.

URL PDF HTML ☆

赞 0 踩 0

2605.18606 2026-06-01 cs.LG

Physics-Aligned Canonical Equivariant Fourier Neural Operator under Symmetry-Induced Shifts

对称性诱导位移下的物理对齐规范等变傅里叶神经算子

Jiaxiao Xu, Changhong Mou, Yeyu Zhang, Fengxiang He

AI总结提出PACE-FNO，通过李代数坐标估计将输入场对齐到参考帧，再应用标准FNO并恢复目标帧，利用周期性演化方程的连续对称性分离坐标对齐与物理演化，在多种PDE上实现OOD相对误差降低高达12倍。

Comments 36 pages, 14 figures, 10 tables

详情

AI中文摘要

神经算子近似PDE解映射，但未必尊重控制方程的对称性。在分布外（OOD）场景中，标准神经算子通常需要在单个映射中学习坐标对齐和物理演化，这可能会损害泛化能力。我们利用周期性域上演化方程的已知连续对称性来分离这两个角色。我们提出了物理对齐规范等变傅里叶神经算子（PACE-FNO），它通过李代数坐标估计器估计输入帧，将场映射到参考帧，应用标准傅里叶神经算子（FNO），并将预测恢复到目标帧。我们使用有界对称扰动联合训练对齐和算子预测，并在推理时通过可选的低维精化步骤更新估计帧。等变性通过输入和输出变换强制执行，而FNO架构保持不变。在周期性域上的1-D和2-D Burgers、浅水方程和Navier-Stokes方程中，PACE-FNO在分布内（ID）精度上与标准神经算子相当，并在平移和伽利略位移下将分布外（OOD）相对误差比带对称增强的FNO（FNO+Aug）降低多达12倍，在耦合旋转-平移位移下增益较小。消融实验表明，对齐输入和恢复输出帧贡献了大部分OOD增益；推理时精化提供了较小的修正。

英文摘要

Neural operators approximate PDE solution maps, but they need not respect the symmetries of the governing equation. In out-of-distribution (OOD) regimes, a standard neural operator must often learn coordinate alignment and physical evolution within a single map, which can hurt generalization. We use known continuous symmetries of evolution equations on periodic domains to separate these two roles. We propose the Physics-Aligned Canonical Equivariant Fourier Neural Operator (PACE-FNO), which estimates the input frame with a Lie-algebra coordinate estimator, maps the field to a reference frame, applies a standard Fourier Neural Operator (FNO), and restores the prediction to the target frame. We train alignment and operator prediction jointly using bounded symmetry perturbations, with an optional low-dimensional refinement step that updates the estimated frame at inference. Equivariance is enforced by the input and output transformations, while the FNO architecture remains unchanged. Across 1-D and 2-D Burgers, shallow-water, and Navier-Stokes equations on periodic domains, PACE-FNO matches the in-distribution (ID) accuracy of standard neural operators and reduces out-of-distribution (OOD) relative error by up to 12x over FNO with symmetry augmentation (FNO+Aug) under translations and Galilean shifts, with smaller gains for coupled rotation-translation shifts. Ablations show that aligning the input and restoring the output frame account for most OOD gains; inference-time refinement provides a smaller correction.

URL PDF HTML ☆

赞 0 踩 0

2605.18364 2026-06-01 cs.LG math.OC

Proximal basin hopping: global optimization with guarantees

近端盆地跳跃：有保证的全局优化

Guillaume Lauga, Cesare Molinari, Samuel Vaiter

AI总结提出近端盆地跳跃（PBH）理论框架，结合近端优化与局部最小化，构建算法以高概率收敛到全局最小值，在合成硬函数和深度学习标度律拟合等实际问题中表现优于有理论保证的已知算法，且维度越高性能差距越大。

2605.18024 2026-06-01 cs.LG cs.AI cs.MA

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

交互破坏对抗学习框架用于鲁棒多智能体强化学习

Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han

AI总结提出交互破坏对抗学习框架，从信息论角度构建攻击破坏智能体间交互，并训练智能体在干扰下可靠执行，提升鲁棒性。

Comments 9 pages for main, 33 pages for total, Accepted to ICML 2026

2605.18023 2026-06-01 cs.CV

DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection

DSAA: 面向细粒度开放词汇检测的双阶段属性激活

Donghong Jiang, Endian Lin, Hanqing Liu, Mingjie Liu, Luoping Cui, Zhao Yang, Chuang Zhu

AI总结提出DSAA框架，通过文本嵌入阶段的属性前缀适配器和BERT编码阶段的键/值调制器增强属性语义，并引入属性感知对比损失，提升细粒度开放词汇检测性能。

详情

AI中文摘要

开放词汇目标检测（OVD）模型打破了封闭集检测的限制，能够通过自然语言提示识别未见类别。然而，在涉及颜色、材质和纹理等属性的细粒度检测任务中，它们表现出明显的局限性。我们将OVD模型中的这一性能瓶颈归因于一个核心问题：当类别信号占主导时，OVD模型在推理过程中倾向于边缘化属性信息，导致属性与目标对象之间的错误绑定。为了解决这个问题，我们提出了双阶段属性激活（DSAA）框架，通过在两个关键阶段增强属性语义来提升细粒度检测能力。在文本嵌入阶段，我们采用属性前缀适配器（APA）模块生成属性前缀，注入显式的属性先验。为了进一步放大这些属性的影响，我们的键/值（K/V）调制器模块在BERT编码阶段进行干预，选择性地增强对应属性令牌的键和值向量。此外，我们引入了属性感知对比损失，以在训练过程中提高具有不同属性的同类别实例之间的区分度。在FG-OVD基准上的实验结果表明，我们的方法在各种主流开放词汇模型中均有效。

英文摘要

Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the identification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine-grained detection tasks involving attributes like color, material, and texture. We attribute this performance bottleneck in OVD models to a core issue: when category signals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect binding between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capabilities by strengthening attribute semantics at two critical stages. In the text embedding stage, we employ Attribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further amplify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encoding phase, selectively enhancing the Key and Value vectors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with different attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary models.

URL PDF HTML ☆

赞 0 踩 0

2605.17524 2026-06-01 cs.LG cs.DB

Covariance Structure and Coordinate Heterogeneity Govern Binary Quantization of Contrastive Embeddings

协方差结构与坐标异质性支配对比嵌入的二值量化

Wenxuan Xiao

AI总结通过分析InfoNCE训练表示的协方差结构，揭示了协方差矩阵的非对角项和坐标异质性如何分别影响二值量化的排序保真度和设计选择，并推导出缩放律以指导系统设计。

Comments 21 pages, 1 figure, 19 tables (6 in main text, 13 in appendix)

详情

AI中文摘要

二值量化（BQ）将高维嵌入压缩为每个坐标一或两个比特，从而实现极速的最近邻搜索。然而，一个显著的谜题仍然存在：BQ在对比嵌入上取得了有竞争力的召回率，但在其他嵌入上却失败——并且两个领先系统采用了截然相反的策略（随机旋转与保留坐标轴），而没有共同的理论解释何时适用哪种策略。我们通过将最近建立的InfoNCE训练表示的Gaussian结构与BQ质量的统计框架联系起来，解决了这个谜题。我们的分析揭示了协方差矩阵的两个不同作用。首先，完整的协方差结构——而不仅仅是其对角线——决定了排序保真度的绝对水平，其中非对角相关性贡献了30-50%的信号。其次，坐标异质性（每个坐标方差的非均匀性）支配着关键设计选择：每个额外比特贡献多少，以及随机旋转是有益还是有害。我们推导了Gaussian模型下排序保真度的近似表达式，表明幅度比特携带与异质性成比例的信息，并表明随机旋转恰好破坏了某个范式所利用的信号，同时创造了另一个范式所需的各向同性。一个现象学缩放律预测了跨模型和维度的保真度。在涵盖9个嵌入家族的18个数据集上的实验支持了主要预测，并据我们所知，为二值量化系统提供了第一个有原则的设计指南。

英文摘要

Binary quantization (BQ) compresses high-dimensional embeddings into one or two bits per coordinate, enabling nearest neighbor search at extreme speed. Yet a striking puzzle persists: BQ achieves competitive recall on contrastive embeddings but fails on others -- and two leading systems adopt diametrically opposite strategies (random rotation vs. preserving coordinate axes) without a common theory explaining when each is appropriate. We address this puzzle by connecting the Gaussian structure recently established for InfoNCE-trained representations to a statistical framework for BQ quality. Our analysis reveals two distinct roles of the covariance matrix. First, the full covariance structure -- not merely its diagonal -- determines the absolute level of ranking fidelity, with off-diagonal correlations contributing 30--50% of the signal. Second, coordinate heterogeneity (the non-uniformity of per-coordinate variances) governs key design choices: how much each additional bit contributes, and whether random rotation helps or hurts. We derive approximate expressions for ranking fidelity under a Gaussian model, show that the magnitude bit carries information proportional to heterogeneity, and show that random rotation destroys precisely the signal that one paradigm exploits while creating the isotropy that the other requires. A phenomenological scaling law predicts fidelity across models and dimensions. Experiments on 18 datasets spanning 9 embedding families support the main predictions and provide, to our knowledge, the first principled design guide for binary quantization systems.

URL PDF HTML ☆

赞 0 踩 0

2605.17373 2026-06-01 cs.LG cs.AI

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

FML-bench：从搜索动力学视角对AI研究代理策略的受控研究

Qiran Zou, Hou Hei Lam, Wenhao Zhao, Tingting Chen, Yiming Tang, Samson Yu, Yingtao Zhu, Srinivas Anumasa, Zufeng Zhang, Tianyi Zhang, Chang Liu, Zhengyao Jiang, Anirudh Goyal, Dianbo Liu

AI总结本文提出FML-Bench基准，通过分离策略与基础设施并定义过程级指标，评估六种代理策略，发现贪婪爬山法接近最优树搜索，且自适应策略基于搜索密度切换可超越其他代理。

Comments Our benchmark is available at: https://github.com/qrzou/FML-bench

详情

AI中文摘要

AI研究代理通过自动化假设生成、实验和实证改进来加速机器学习研究。现有代理策略从贪婪爬山法到树搜索和进化优化不等，但哪些策略选择驱动性能仍不清楚。回答这个问题需要一个基准，该基准将代理策略（例如搜索拓扑）与执行基础设施（例如代码编辑器）分离，以便性能差异归因于策略而非基础设施，并提供最终分数之外的过程级指标来分析探索行为。现有基准支持有限。我们提出FML-Bench，一个涵盖10个领域18个基础ML研究任务的基准，将代理策略与执行基础设施分离，并定义了12个过程级行为指标。评估六个代表性代理，我们发现：(1) 策略复杂性本身并不能保证强性能：一个简单的贪婪爬山者几乎与最佳性能的树搜索代理相匹配，两者均远高于其余代理；(2) 我们的分析表明，这种模式与改进机会结构相关：当机会密集时，贪婪搜索往往更有效，而当机会稀疏时，树搜索和进化策略往往更有效；基于这一见解构建的自适应代理在检测到改进停滞时切换到更广泛的探索，并优于其他六个代理，初步支持了这一观察；(3) 过程级分析表明，早期收敛和方向聚焦的探索与最终性能显著相关，而解决方案多样性和计算成本则不然。我们的基准可在 https://github.com/qrzou/FML-bench 获取。

英文摘要

AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML-bench.

URL PDF HTML ☆

赞 0 踩 0

2605.17101 2026-06-01 cs.CL cs.AI

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG: 面向医学推理的自演化多智能体检索增强生成框架

Yongfeng Huang, Ruiying Chen, James Cheng

AI总结针对医学问答中单轮静态检索与临床推理多阶段过程不匹配的问题，提出SEMA-RAG框架，通过任务解耦和动态多轮探索，由三个专业智能体分别负责临床解释、自演化检索和证据裁决，在多个基准上平均提升准确率6.46个百分点。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

检索增强生成（RAG）被广泛用于缓解医学问答中的幻觉和知识过时等风险，但其主要采用单轮静态检索范式，与临床推理的多阶段过程不匹配。这种压缩的工作流导致两个结构性缺陷：问题到查询的转换通常缺乏临床基础的语义解释，且检索缺乏迭代充分性反馈，难以形成可靠的证据链。我们认为这两个问题源于更深层的原因：将解释、探索和裁决等异构任务过载到单一推理链上。解决方案是通过任务解耦和动态多轮探索来重构工作流。为此，我们提出SEMA-RAG，一种用于医学问答的自演化多智能体RAG框架，将这些角色分配给三个专业智能体：解释智能体负责临床模式解释，探索智能体负责充分性驱动的自演化检索，裁决智能体负责证据裁决和答案选择。在五个基准和五个LLM骨干网络上，SEMA-RAG平均比最强基线提高6.46个准确率点（按骨干网络测量）。

英文摘要

Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.

URL PDF HTML ☆

赞 0 踩 0

2605.16215 2026-06-01 cs.AI cs.CL

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

完全开放的Meditron：临床大语言模型的可审计流水线

Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley

AI总结提出首个完全开放的临床大语言模型构建流水线Fully Open Meditron，通过可审计的数据集、可复现的训练框架和对齐评估协议，在不牺牲可审计性和可复现性的前提下实现了领域最新性能。

Comments Preprint. 31 pages, 10 figures. Code, models, and data: https://github.com/EPFLiGHT/FullyOpenMeditron

详情

AI中文摘要

临床决策支持系统（CDSS）需要可审查、可审计的流水线，以实现严格、可复现的验证。然而，当前基于LLM的CDSS仍然大多不透明。大多数“开放”模型仅开放权重，发布参数的同时隐瞒了决定模型行为的数据来源、整理程序和生成流水线。完全开放（FO）模型暴露完整的训练堆栈，目前在医学领域尚不存在。我们引入了Fully Open Meditron，这是首个用于构建LLM-CDSS的完全开放流水线，包含临床医生审计的训练语料库、可复现的数据构建和训练框架，以及使用对齐的评估协议。该语料库将八个公共医学QA数据集统一为标准化对话格式，并通过三个经临床医生审查的合成扩展扩展了覆盖范围：考试式QA、源自46,469个临床实践指南的指南基础QA以及临床小插曲。该流水线强制执行系统级去污染、教师生成的金标签重采样以及由四位医生小组进行的端到端验证。我们使用LLM-as-a-judge协议对专家撰写的临床小插曲进行评估，并针对204名人类评分者进行校准。我们将该配方应用于五个FO基础模型（Apertus-70B/8B-Instruct、OLMo-2-32B-SFT、EuroLLM-22B/9B-Instruct）。所有MeditronFO变体均优于其基础模型。Apertus-70B-MeditronFO在综合医学基准上比其基础模型提高了+6.6个百分点（从47.2%到53.8%），建立了新的FO SoTA。Gemma-3-27B-MeditronFO在58.6%的LLM-as-a-judge比较中优于MedGemma，并在HealthBench上表现更优（58% vs 55.9%）。这些结果表明，完全开放的流水线可以在不牺牲可审计性或可复现性的情况下实现最先进的领域特定性能。

英文摘要

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2602.12005 2026-06-01 cs.CL

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

LaCy: 小型语言模型能学且应学的不仅仅是损失问题

Szilvia Ujváry, Louis Béthune, Pierre Ablin, João Monteiro, Marco Cuturi, Michael Kirchhof

AI总结研究在预训练中，小型语言模型（SLM）应学习哪些token以及应通过<CALL>委托哪些token，提出结合损失和事实性信号的LaCy方法，提升SLM在级联生成中的事实准确性。

Comments 40 pages, 26 figures, 10 tables, preprint. v3-v4: new results for RAG, ablations and additional analysis

详情

AI中文摘要

语言模型不断增长以将更多世界知识压缩到其参数中，但可预训练到其中的知识受参数规模上限约束。尤其是小型语言模型（SLM）容量有限，导致事实性错误生成。通常通过让SLM访问外部源（如查询更大模型、文档或数据库）来缓解此问题。在此设置下，我们研究基本问题：预训练期间SLM可以且应该学习哪些token，以及哪些应通过<CALL> token委托。我们发现这不仅仅是损失问题：尽管损失可预测预测token是否与真实值不匹配，但不足以识别哪些预测实际上会导致事实性或语义无效的延续。一些高损失token对应预训练文档中可接受的替代延续，因此不应触发<CALL>。这表明可学习性不能仅从损失表征，而需要关于token在句子中角色的额外领域特定信号。在类似维基百科的领域中，我们展示用spaCy解析器的轻量级语法信息增强损失信号可显著改善委托决策。基于此洞察，我们提出LaCy，一种新颖的预训练方法，结合损失与事实性信号以决定SLM应学习哪些token。实验表明，LaCy模型成功学习预测哪些token以及何时请求帮助。这在与更大模型级联生成时获得更高FactScore，且优于Rho或LLM-judge训练的SLM，同时更简单更廉价。

英文摘要

Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{<CALL>} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, it is insufficient for identifying which predictions would actually lead to factual or semantically invalid continuations. Some high-loss tokens correspond to \emph{acceptable} alternative continuations of a pretraining document and therefore should not trigger a \texttt{<CALL>}. This suggests that learnability cannot be characterized from loss alone, but requires additional domain-specific signals about the role of a token in the sentence. In Wikipedia-like domains, we show that augmenting the loss signal with lightweight grammatical information from a spaCy parser substantially improves delegation decisions. Based on this insight, we propose LaCy, a novel pretraining method that combines loss with factuality signals to decide which tokens an SLM should learn. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and when to call for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

URL PDF HTML ☆

赞 0 踩 0

2602.00747 2026-06-01 cs.CL cs.AI

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

将搜索与训练解耦：通过模型合并实现大规模语言模型预训练的数据混合缩放

Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao

AI总结提出DeMix框架，通过模型合并预测最优数据配比，在降低搜索成本的同时提升基准性能。

Comments 18 pages, 5 figures, accepted at ICML 2026

详情

AI中文摘要

确定有效的数据混合是大语言模型（LLM）预训练的关键因素，模型必须在通用能力与数学、代码等困难任务的专业性之间取得平衡。然而，识别最优混合仍然是一个开放挑战，现有方法要么依赖不可靠的小规模代理实验，要么需要代价高昂的大规模探索。为此，我们提出“将搜索与训练解耦混合”（DeMix），一种利用模型合并预测最优数据配比的新框架。DeMix不是为每个采样的混合训练代理模型，而是按规模在候选数据集上训练组件模型，并通过加权模型合并推导数据混合代理。这种范式将搜索与训练成本解耦，使得无需额外训练负担即可评估无限采样的混合，从而通过更多搜索试验促进更好的混合发现。大量实验表明，DeMix打破了充分性、准确性和效率之间的权衡，以更低的搜索成本获得更高基准性能的最优混合。此外，我们发布了DeMix语料库，一个包含高质量预训练数据和已验证混合的综合22T令牌数据集，以促进开放研究。我们的代码和DeMix语料库可在https://github.com/Lucius-lsr/DeMix获取。

英文摘要

Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.

URL PDF HTML ☆

赞 0 踩 0

2605.15706 2026-06-01 cs.LG

Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models

可微分的混合智能体激励大型语言模型的群体智能

Xingjian Wu, Junkai Lu, Siyu Yan, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang

AI总结提出可微分的混合智能体（DMoA）框架，通过可微分的上下文感知路由机制动态激活智能体，实现推理过程中的弹性协作，并在9个基准上取得最优性能。

详情

AI中文摘要

大型语言模型（LLMs）的最新进展推动了用于复杂推理任务的多智能体系统（MAS）的发展。然而，现有的MAS通常依赖于预定义或预编译的通信拓扑，这限制了它们对动态任务需求的灵活性和适应性。在这项工作中，我们提出了可微分的混合智能体（DMoA），一个自我进化的多智能体框架，能够在推理过程中实现弹性且自适应的智能体协作。不同于静态构建工作流，DMoA在每个推理步骤动态路由和激活智能体，使系统能够隐式模拟多样化的通信拓扑并适应不断变化的需求。为了实现这一点，我们设计了一个可微分的、上下文感知的路由机制，利用循环结构融入历史和上下文信息，以逐步方式产生稀疏的智能体激活。此外，我们引入预测熵作为自监督信号来优化路由过程，实现了无需外部标注的高效测试时自适应。在9个基准上的广泛实验表明，DMoA在实现最先进性能的同时，展现出强大的效率、鲁棒性和集成能力。

英文摘要

Recent advances in Large Language Models (LLMs) have catalyzed the development of multi-agent systems (MAS) for complex reasoning tasks. However, existing MAS typically rely on pre-defined or pre-compiled communication topologies, which limits their flexibility and adaptability to dynamic task requirements. In this work, we propose Differentiable Mixture-of-Agents (DMoA), a self-evolving multi-agent framework that enables elastic and adaptive agent collaboration during inference. Instead of statically constructing workflows, DMoA dynamically routes and activates agents at each reasoning step, allowing the system to implicitly simulate diverse communication topologies and adapt to evolving demands. To achieve this, we design a differentiable, context-aware routing mechanism that leverages recurrent structures to incorporate historical and contextual information, producing sparse agent activations in a step-wise manner. Furthermore, we introduce predictive entropy as self-supervised signals to optimize the routing process, enabling efficient test-time adaptation without external annotations. Extensive experiments across 9 benchmarks demonstrate that DMoA achieves state-of-the-art performance while exhibiting strong efficiency, robustness, and ensembling capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.15470 2026-06-01 cs.LG physics.ao-ph

Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting

Njord: 一种用于集合海洋预报的概率图神经网络

Daniel Holmberg, Joel Oskarsson, Erik Wikingsson, Fredrik Lindsten, Teemu Roos

AI总结提出结合深度潜变量框架和图神经网络的概率模型Njord，在全球和区域海洋实现单次前向传播采样集合预报，并引入K-means聚类网格适应不规则海面几何，相比确定性基线在观测评估中取得更低误差。

Comments Preprint

详情

AI中文摘要

海洋动力学本质上是混沌的，但现有的机器学习海洋模型仅产生确定性预报。我们介绍了Njord，一种用于海洋预报的概率数据驱动模型，适用于全球和区域领域。Njord结合了深度潜变量框架与图神经网络架构，使得每次预报步骤可以在单次前向传播中采样。我们在全球0.25°分辨率和波罗的海区域2 km分辨率上应用Njord。为了扩展到这些大型海洋网格，我们引入了K-means聚类网格，以适应不规则的海面几何。实验表明，与确定性机器学习基线相比，Njord在两个领域均表现出强劲性能，同时通过采样的集合预报提供不确定性估计。在全球OceanBench基准上，Njord在针对真实观测评估时，在上层海洋变量上平均实现了最低误差，其中海表温度预测改进最大。

英文摘要

Ocean dynamics are inherently chaotic, yet existing machine learning ocean models produce only deterministic forecasts. We introduce Njord, a probabilistic data-driven model for ocean forecasting, applicable to both global and regional domains. Njord combines a deep latent variable framework with a graph neural network architecture, enabling sampling each forecast step in a single forward pass. We apply Njord globally at 0.25° resolution and regionally to the Baltic Sea at 2 km resolution. To scale to these large ocean grids we introduce K-means cluster meshes that adapt to irregular sea surface geometry. Experiments demonstrate strong performance on both domains compared to deterministic machine learning baselines, while also providing uncertainty estimates from the sampled ensemble forecasts. On the global OceanBench benchmark, Njord achieves the lowest errors on average across upper-ocean variables when evaluated against real-world observations, with the largest improvements in surface temperature prediction.

URL PDF HTML ☆

赞 0 踩 0

2604.15215 2026-06-01 cs.RO

A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

用于机器人上下文模仿学习的层次化时空动作分词器

Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

AI总结提出一种层次化时空动作分词器HiST-AT，通过两级向量量化实现动作的层次化聚类，并同时利用空间和时间信息进行重建，在多个模拟和真实机器人操作基准上达到最先进性能。

详情

AI中文摘要

我们提出了一种新颖的层次化时空动作分词器，用于上下文模仿学习。我们首先提出一种层次化方法，包括两个连续级别的向量量化。具体来说，低级别将输入动作分配到细粒度子簇，而高级别进一步将细粒度子簇映射到簇。我们的层次化方法优于非层次化方法，同时主要通过重建输入动作来利用空间信息。此外，我们通过利用空间和时间线索扩展了我们的方法，形成了层次化时空动作分词器，即HiST-AT。具体来说，我们的层次化时空方法进行多级聚类，同时重建输入动作及其相关时间戳。最后，在多个模拟和真实机器人操作基准上的广泛评估表明，我们的方法在上下文模仿学习中建立了新的最先进性能。

英文摘要

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

URL PDF HTML ☆

赞 0 踩 0

2402.17672 2026-06-01 cs.CV eess.IV

SDF2Net: Shallow to Deep Feature Fusion Network for PolSAR Image Classification

SDF2Net: 用于PolSAR图像分类的浅层到深层特征融合网络

Mohammed Q. Alkhatib, M. Sami Zitouni, Mina Al-Saad, Nour Aburaed, Hussain Al-Ahmad

AI总结提出一种新颖的三分支复值CNN融合网络SDF2Net，通过浅层到深层特征融合提升PolSAR图像分类精度，在三个数据集上取得优于现有方法的性能。

详情

AI中文摘要

极化合成孔径雷达（PolSAR）图像包含有价值的信息，有助于广泛的土地覆盖解释并生成多样化的输出产品。从PolSAR数据中提取有意义的特征面临与光学图像不同的挑战。深度学习方法为克服PolSAR特征提取中的这些挑战提供了有效解决方案。卷积神经网络（CNN）通过利用内核能力考虑局部信息和PolSAR数据的复值性质，在捕获PolSAR图像特征中发挥关键作用。本研究提出了一种新颖的三分支复值CNN融合网络，称为浅层到深层特征融合网络（SDF2Net），用于PolSAR图像分类。为了验证所提方法的性能，使用Flevoland和San Francisco的机载合成孔径雷达（AIRSAR）数据集以及ESAR Oberpfaffenhofen数据集，将分类结果与多种最先进方法进行比较。结果表明，所提方法在总体精度上有所提升，AIRSAR数据集提升1.3%和0.8%，ESAR数据集提升0.5%。对Flevoland数据的分析强调了SDF2Net模型的有效性，即使在仅1%采样率下，总体精度也达到了96.01%。

英文摘要

Polarimetric synthetic aperture radar (PolSAR) images encompass valuable information that can facilitate extensive land cover interpretation and generate diverse output products. Extracting meaningful features from PolSAR data poses challenges distinct from those encountered in optical imagery. Deep learning (DL) methods offer effective solutions for overcoming these challenges in PolSAR feature extraction. Convolutional neural networks (CNNs) play a crucial role in capturing PolSAR image characteristics by leveraging kernel capabilities to consider local information and the complex-valued nature of PolSAR data. In this study, a novel three-branch fusion of complex-valued CNN, named the Shallow to Deep Feature Fusion Network (SDF2Net), is proposed for PolSAR image classification. To validate the performance of the proposed method, classification results are compared against multiple state-of-the-art approaches using the airborne synthetic aperture radar (AIRSAR) datasets of Flevoland and San Francisco, as well as the ESAR Oberpfaffenhofen dataset. The results indicate that the proposed approach demonstrates improvements in overallaccuracy, with a 1.3% and 0.8% enhancement for the AIRSAR datasets and a 0.5% improvement for the ESAR dataset. Analyses conducted on the Flevoland data underscore the effectiveness of the SDF2Net model, revealing a promising overall accuracy of 96.01% even with only a 1% sampling ratio.

URL PDF HTML ☆

赞 0 踩 0