arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3860
2605.18475 2026-05-19 cs.LG cs.AI

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

GAMMA:在任意预算下为混合精度模型进行全局位分配

Zhangyang Yao, Haiyan Zhao, Haoyu Wang, Tianbo Huang, Lihua Zhang, Xu Han

AI总结 本文提出GAMMA框架,通过后训练流水线学习模块级精度偏好,优化教师强制隐藏状态重建目标并利用整数规划实现精确预算分配,从而在任意预算下提升大语言模型的精度,优于固定精度基线和搜索基混合精度方法。

详情
AI中文摘要

混合精度量化通过将更多位分配给敏感模块,提高了大语言模型(LLMs)的预算-精度权衡。然而,在LLM规模上自动化这种分配面临独特约束:可学习方法需要量化感知训练,这在十亿参数模型中不可行;训练自由替代方案依赖静态代理指标,无法捕捉跨模块交互,并且必须为每个目标预算重新计算;搜索方法成本高且无法保证精确预算符合。我们提出GAMMA,一种量化器无关的框架,完全在后训练流水线内学习模块级精度偏好。GAMMA在增强拉格朗日约束下优化教师强制隐藏状态重建目标,并通过整数规划将学习的偏好投影到精确预算可行的离散分配中。关键性质是分数重用:因为学习的偏好编码了一个稳定的敏感性排名而非预算特定权重,单次训练运行可服务于任意部署目标,仅需重新求解整数规划,将每预算适应时间从小时减少到几分钟。在Llama和Qwen模型(8B-32B)上,GAMMA优于固定精度基线(最高+12.99 Avg.)和搜索基混合精度方法(最高+7.00 Avg.),并在2.5位平均精度下可匹配固定3位质量,从而在大幅减小内存占用的情况下实现部署。

英文摘要

Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B--32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.

2605.18467 2026-05-19 cs.CV

InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

InstructAV2AV:基于指令的音频视频联合编辑

Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, Boxin Shi

AI总结 本文提出InstructAV2AV,首个端到端的指令引导音频视频联合编辑框架,通过构建大规模音频视频编辑数据集InsAVE-80K和改进的生成模型,实现了更高质量的音频视频联合编辑。

详情
AI中文摘要

最近的扩散基方法在视频内容操控方面取得了显著进展。然而,它们通常忽视伴随的音频,导致音频与编辑结果脱节。在本文中,我们提出了InstructAV2AV,首个端到端的指令引导音频视频联合编辑框架。我们首先开发了一个可扩展的数据合成管道,并构建了InsAVE-80K,首个大规模音频视频编辑数据集,包含高质量的源到目标配对。借助这一数据基础,我们适配了一个音频视频生成骨干网络,以利用其强大的先验知识。我们将音频视频输入与噪声潜在代码结合,以锚定源上下文,提出源指令门控注意力以提高指令遵循和内容保持,并引入两阶段训练策略以有效转移这些预训练的先验知识。广泛的实验表明,InstructAV2AV在两个评估集上,跨11个指标覆盖三个方面,均优于现有最先进方法,凸显了其在可控内容创作中的潜力。项目页面:https://hjzheng.net/projects/InstructAV2AV/.

英文摘要

Recent diffusion-based methods have achieved impressive progress in video content manipulation. However, they typically ignore the accompanying audio, leaving the audio disjointed from the edited results. In this paper, we propose InstructAV2AV, the first end-to-end framework for instruction-guided audio-video joint editing. We first develop a scalable data synthesis pipeline and construct InsAVE-80K, the first large-scale audio-video editing dataset with high-quality source-to-target pairs. With this data foundation, we adapt an audio-video generation backbone to leverage its robust priors. We concatenate the audio-video input with noisy latent codes to anchor the source context, propose the source-instruction gated attention to improve instruction following and content preservation, and introduce a two-stage training strategy to effectively transfer these pre-trained priors. Extensive experiments demonstrate that InstructAV2AV outperforms state-of-the-art methods across 11 metrics spanning three aspects on two evaluation sets, highlighting its potential for controllable content creation. Project page: https://hjzheng.net/projects/InstructAV2AV/.

2605.18466 2026-05-19 cs.CV

Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI

基于语音引导的多模态学习用于实时MRI中的声道分割

Daiqi Liu, Lukas Mulzer, Md Hasan, Nyvenn de Castro, Fangxu Xing, Xingjian Kang, Chengze Ye, Siyuan Mei, Yipeng Sun, Tomás Arias-Vergara, Jana Hutter, Jonghye Woo, Andreas Maier, Paula Andrea Pérez-Toro

AI总结 本文提出了一种三阶段框架,利用语音和语音学监督进行训练,仅需实时MRI图像进行推理,通过将语音学表示转换为空间边界框先验进行发音器官定位,通过双级跨模态对比预训练对视觉和音频编码器对齐,并通过跨注意力解码器融合学习的表示,有效将多模态知识转移到单模态推理管道中,实验表明该方法在75-Speaker~Annot-16和USC-TIMIT数据集上优于现有单模态和多模态方法。

详情
Comments
under review
AI中文摘要

在实时MRI(rtMRI)中对发音器官进行分割是一个具有低对比度、快速运动和有限空间分辨率的动态图像分割难题。然而,尽管rtMRI采集可能提供同步的声学信号,现有方法却丢弃了这一信息,而能结合音频的少数多模态方法在音频不可用时无法部署。我们提出了一种三阶段框架,在训练过程中利用音频和语音学监督,而在推理时仅需rtMRI图像:语音学表示被转换为空间边界框先验以用于发音器官定位,视觉和音频编码器通过双级跨模态对比预训练对齐,学习的表示通过跨注意力解码器融合,有效将多模态知识转移到单模态推理管道中。在75-Speaker~Annot-16和USC-TIMIT数据集上的评估表明,我们的方法优于现有单模态和多模态方法,证明了多模态监督对精确且可临床部署的声道分割提供了可转移的益处。

英文摘要

Segmenting vocal tract articulators in real-time MRI (rtMRI) is a challenging dynamic image segmentation problem characterized by low contrast, rapid motion, and limited spatial resolution. However, while rtMRI acquisitions may provide synchronized acoustic signals, existing methods discard this information, and the few multimodal approaches that incorporate audio cannot be deployed when audio is unavailable. We propose a three-stage framework that leverages acoustic and phonological supervision during training while requiring only the rtMRI image at inference: phonological representations are converted into spatial bounding-box priors for articulator localization, visual and acoustic encoders are aligned via dual-level cross-modal contrastive pretraining, and the learned representations are fused through a cross-attention decoder, effectively transferring multimodal knowledge into a single-modality inference pipeline. Evaluated on 75-Speaker~Annot-16 and USC-TIMIT datasets, our method outperforms existing unimodal and multimodal methods, demonstrating that multimodal supervision provides transferable benefits for precise and clinically deployable vocal tract segmentation.

2605.18462 2026-05-19 cs.CL

From BERT to T5: A Study of Named Entity Recognition

从BERT到T5:一项命名实体识别研究

Mei Jia

AI总结 本文研究了在预训练模型BERT和T5上进行命名实体识别任务,比较了两种模型在序列标注任务中的性能,并分析了常见错误和超参数对模型表现的影响。

详情
Comments
11 pages, 9 figures
AI中文摘要

命名实体识别(NER)是现代自然语言处理应用中的一个基本预处理步骤。本报告专注于在微调两个预训练模型上实现NER任务:(i)一个仅编码器的模型(BERT)带有简单的分类头,以及(ii)一个序列到序列模型(T5)带有少量样本提示。在原始7类标签和3类简化标签方案下,BERT使用加权交叉熵作为训练损失,而T5则通过两种验证策略进行微调。此外,还进行了不同超参数的消融研究。此外,相关分析为BERT和两种模型的性能中的常见错误提供了有价值的见解。基于一系列性能指标,本报告旨在比较上述两种架构,并探索它们在序列标注任务中的能力,为进一步的实际应用案例奠定基础。

英文摘要

Named entity recognition (NER) has been one of the essential preliminary steps in modern NLP applications. This report focuses on implementing the NER task on finetuning two pretrained models: (i) an encoder-only model (BERT) with a simple classification head, and (ii) a sequence-to-sequence model (T5) with few-shot prompts. Under the original 7-class tag and 3-class simplified tag schemes, BERT is applied a weighted cross-entropy for training loss, and T5 is fine-tuned with two validation strategies. It also conducted an ablation study with different hyperparameters. Moreover, the related analysis provides valuable insights into common errors in BERT and the two models' performance. Based on a bunch of performance metrics, this report aims to compare the above two architectures and explore their abilities in the sequence labelling task, laying the groundwork for further practical use cases.

2605.18460 2026-05-19 cs.AI cs.LG cs.NE

When Fireflies Cluster; Enhancing Automatic Clustering via Centroid-Guided Firefly Optimization

当萤火虫聚类;通过重心引导萤火虫优化增强自动聚类

MKA Ariyaratne, Azwirman Gusrialdi, Yury Nikulin, Jaakko Peltonen

AI总结 本文提出了一种改进的萤火虫算法用于数据聚类,解决了传统方法如K均值在处理非均匀聚类形状、密度以及需要预先定义聚类数的局限性。该算法引入了重心移动策略和多目标适应度函数,平衡了紧凑性、分离性和新的TSP基于的导航惩罚。它能够自动估计最佳聚类数并动态调整聚类边界。在机器人传感器网络中的应用展示了其实际价值,实验表明其聚类质量优于K均值,且减少集群内路径距离。这些结果证实了该算法在复杂空间聚类任务中的鲁棒性,未来可能扩展到更高维和适应性场景。

详情
Comments
34 pages, 19 Figures
AI中文摘要

本文提出了一种新的萤火虫算法变体用于数据聚类,以解决传统方法如K均值在处理非均匀聚类形状、密度以及需要预先定义聚类数的局限性。所提出的算法引入了重心移动策略和多目标适应度函数,该函数平衡了紧凑性、分离性和一个新的基于TSP的导航惩罚。该算法能够自动估计最佳聚类数并动态调整聚类边界。在机器人传感器网络中的应用展示了其实际价值,实验表明其聚类质量优于K均值,且减少集群内路径距离。这些结果证实了该算法在复杂空间聚类任务中的鲁棒性,具有未来扩展到更高维和适应性场景的潜力。

英文摘要

This work presents a novel variant of the Firefly Algorithm (FA) for data clustering, addressing limitations of traditional methods like K-Means that struggle with non-uniform cluster shapes, densities, and the need for pre-defining the number of clusters. The proposed algorithm introduces a centroid movement strategy and a multi-objective fitness function that balances compactness, separation, and a novel TSP-based navigation penalty. It automatically estimates the optimal number of clusters and dynamically adjusts cluster boundaries. Application to robotic sensor networks highlights its practical value, with experiments showing improved clustering quality and reduced intra-cluster path distances compared to K-Means. These results confirm the algorithm's robustness in complex spatial clustering tasks, with potential for future extensions to higher-dimensional and adaptive scenarios.

2605.18459 2026-05-19 cs.LG stat.ML

Adaptive Experimentation for Censored Survival Outcomes

适应性实验设计用于截断生存结果

Yuxin Wang, Dennis Frauen, Jonas Schweisthal, Maresa Schröder, Emil Javurek, Stefan Feuerriegel

AI总结 本文提出了一种新的适应性实验框架,用于在右截断情况下估计因果效应,通过推导平均生存效应曲线的半参数效率界限,得到闭合形式的效率最优分配策略,并通过数值实验展示了与均匀随机化和截断无关基线相比的一致效率提升。

详情
AI中文摘要

适应性实验设计能够高效估计因果效应,但现有方法未针对具有截断的生存数据进行设计,其中事件时间仅部分观察(例如癌症试验中的总生存时间但存在退出)。本文开发了一种新的适应性实验框架,用于在右截断情况下估计因果效应。为此,我们推导了平均生存效应曲线的半参数效率界限,作为治疗分配策略的函数,从而获得闭合形式的效率最优分配策略。该策略通过优先考虑同时事件和截断动态导致高不确定性的患者分层,将经典Neyman分配扩展到生存设置。在此基础上,我们提出了自适应生存估计器(ASE),一种能够学习分配策略并依次估计平均生存效应曲线的自适应框架。我们的框架有三个主要优势:(i)它可以容纳任意机器学习模型用于非必要估计;(ii)它由闭合形式的效率最优分配策略引导;(iii)它具有强的理论保证,包括通过鞅中心极限定理获得的渐近正态性。我们通过各种数值实验展示了该框架,以显示与均匀随机化和截断无关基线相比的一致效率提升。

英文摘要

Adaptive experimentation enables efficient estimation of causal effects, but existing methods are not designed for survival data with censoring, where event times are only partially observed (e.g., overall survival in cancer trials but with dropout). In this paper, we develop a novel framework for adaptive experimentation to estimate causal effects under right censoring. For this, we derive the semiparametric efficiency bound for the average survival effect curve as a function of the treatment allocation policy and thereby obtain a closed-form efficiency-optimal allocation policy. The policy generalizes classical Neyman allocation to survival settings by prioritizing patient strata where both event and censoring dynamics induce high uncertainty. Building on this, we propose the Adaptive Survival Estimator (ASE), an adaptive framework that learns the allocation policy and estimates the average survival effect curve sequentially. Our framework has three main benefits: (i) it accommodates arbitrary machine learning models for nuisance estimation; (ii) it is guided by a closed-form efficiency-optimal allocation policy; and (iii) it admits strong theoretical guarantees, including asymptotic normality via a martingale central limit theorem. We demonstrate our framework across various numerical experiments to show consistent efficiency gains over uniform randomization and censoring-agnostic baselines.

2605.18454 2026-05-19 cs.LG cs.AI cs.SC

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

能说话的调度:一种可解释的程序化强化学习框架

Chengpeng Hu, Yingqian Zhang, Hendrik Baier

AI总结 本文提出了一种可解释的程序化强化学习框架ProRL,通过人类可读且可编辑的程序化策略实现高效调度,解决了传统深度强化学习在透明性和计算效率方面的不足。

详情
AI中文摘要

深度强化学习(DRL)最近涌现出作为求解组合优化问题(如作业车间调度)的有希望的方法。然而,DRL学习的策略通常由深度神经网络(DNNs)表示,其不透明的神经架构和不可解释的策略决策可能引起人类决策者的关键信任和可用性问题。此外,DNNs的计算需求还会进一步阻碍在资源受限环境中实际部署。在本工作中,我们提出ProRL,一种新颖的可解释程序化强化学习框架,能够通过人类可读且可编辑的程序化策略实现高性能调度(即程序)。我们首先介绍了一种用于调度的领域特定语言(DSL-S)来表示调度策略为结构化程序。ProRL然后通过局部搜索探索由DSL-S定义的程序空间,以识别不完整的程序,这些程序随后通过贝叶斯优化学习其参数。ProRL学习选择哪种调度启发式规则,因此它自然地整合了已在工业场景中使用的现有启发式方法。在广泛使用的基准实例上的实验表明,ProRL在现有启发式方法和DRL基线方面表现出色。此外,ProRL在强约束计算资源下表现良好,例如仅使用100个episode进行训练。我们的代码可在https://github.com/HcPlu/ProRL上获得。

英文摘要

Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architectures and non-interpretable policy decisions can lead to critical trust and usability concerns for human decision makers. In addition, the computational requirements of DNNs can further hinder practical deployment in resource constrained environments. In this work, we propose ProRL, a novel interpretable programmatic reinforcement learning framework that achieves high-performance scheduling with human-readable and editable programmatic policies (i.e., programs). We first introduce a domain-specific language for scheduling (DSL-S) to represent scheduling strategies as structured programs. ProRL then explores the program space defined by DSL-S using local search to identify incomplete programs, which are subsequently completed by learning their parameters via Bayesian optimization. ProRL learns which scheduling heuristic rules to select, and hence, it naturally incorporates existing heuristics already used in industrial scenarios. Experiments on widely used benchmark instances demonstrate the strong performance of ProRL against existing heuristics and DRL baselines. Furthermore, ProRL performs well under strongly constrained computational resources, such as training with only 100 episodes. Our code is available at https://github.com/HcPlu/ProRL.

2605.18451 2026-05-19 cs.CV cs.GR

Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

Code-as-Room: 通过代理代码合成从俯视图图像生成3D房间

Yixuan Yang, Zhen Luo, Wanshui Gan, Jinkun Hao, Junru Lu, Jinghao Yan, Zhaoyang Lyu, Xudong Xu

AI总结 本文提出Code-as-Room框架,通过结构化执行 harness 生成3D房间,利用Blender代码表示房间,并引入专门的代码基3D房间合成基准进行评估。

详情
AI中文摘要

设计逼真且功能性的3D室内房间对于广泛的应用至关重要,包括室内设计、虚拟现实、游戏和具身AI。尽管最近基于大语言模型(MLLM)的方法在从文本描述或参考图像生成3D房间方面展现出巨大潜力,但基于文本的方法难以精确捕捉空间信息,而现有的图像条件化代理在从俯视图生成整体房间时面临不稳定性和无限循环的问题。为了解决这些限制,我们提出了Code-as-Room,一种基于MLLM的代理框架,配备了结构化执行harness,用Blender代码表示3D房间。给定一个俯视房间图像,该框架解析参考图像以提取场景元素及其空间关系,并在有原则的多阶段管道中合成用于几何、材料和照明的可执行Blender代码。在整个过程中维护一个跨阶段的记忆模块,以缓解现有基于代理框架固有的上下文遗忘问题。我们进一步引入了一个专门的代码基3D房间合成基准,涵盖各种评估协议。基于我们的基准,对现有基于代理的方法进行了全面比较,以验证我们提出的执行harness的有效性。

英文摘要

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.

2605.18449 2026-05-19 cs.LG cs.AI

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

用强化学习建模客户轨迹以获得实际零售洞察

Ken Ming Lee, Paul Barde, Maxime C. Cohen, Derek Nowrouzezahrai

AI总结 本文提出了一种基于智能体的建模框架,将客户轨迹预测转化为最大熵强化学习问题,以更准确地反映具有有限理性的客户行为,从而提供更精确的冲动购买率和货架交通密度估计。

详情
Comments
Proceeding of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
AI中文摘要

理解零售空间内客户移动对于优化商店布局至关重要。现实世界轨迹数据可以提供高度准确的洞察,但收集起来成本高昂且对许多零售商来说难以实现。启发式方法如旅行商问题(TSP)和概率最近邻(PNN)常被用作廉价的近似方法,但实际客户轨迹与最短路径的偏差平均为28%,突显了准确性和实用性之间的权衡。我们提出了一种基于智能体的建模框架,将客户轨迹预测视为最大熵强化学习(RL)问题,通过平衡奖励最大化与随机性来更好地反映具有有限理性的客户。使用现实世界便利商店的轨迹数据,我们证明RL生成的轨迹比TSP和PNN更接近客户行为,提供了更准确的冲动购买率和货架交通密度估计。此外,只有基于RL的预测能够为冲动产品提供与实际轨迹数据一致的重新定位决策,从而产生可比的估计利润增长。我们的工作表明,RL提供了一种实用且基于行为的替代方法,弥合了过于简化的启发式方法和数据密集型方法之间的差距,使准确的布局优化更具可及性。为了鼓励进一步研究,源代码可在GitHub上获得。

英文摘要

Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.

2605.18441 2026-05-19 cs.RO cs.SY eess.SY

REACT: Environment-Adaptive Architecture for Continuous Formation Navigation of Wheeled Mobile Robots

REACT:面向轮式移动机器人连续编队导航的环境自适应架构

Jianghong Dong, Yifeng Zhang, Jiawei Wang, Mengchi Cai, Keqiang Li, Guillaume Sartoretti

AI总结 本文提出REACT架构,通过集中式编队生成和分布式编队维护相结合的方法,解决轮式移动机器人在复杂环境中编队导航的适应性问题,实现了无轨迹冲突的连续编队导航。

详情
AI中文摘要

轮式移动机器人(WMRs)的编队控制已广泛应用于物流运输、环境监测和搜索救援等领域。然而,大多数现有研究主要关注跟踪预定义编队,限制了其在复杂现实环境中的适应性。为此,我们提出了REACT(实时环境自适应架构用于连续编队导航),一种集成了集中式编队生成和分布式编队维护的分层架构。具体而言,上层在必要时生成新的环境自适应编队,并使用我们提出的TCF-R2T(轨迹冲突自由机器人到目标分配)算法,在多项式时间内计算无冲突的WMR到目标分配,实现及时的编队转换而无轨迹冲突。下层中,每个WMR执行我们开发的JSTP(联合时空轨迹规划)方法,通过同时优化空间位置和时间持续时间来维护生成的编队,从而增强机器人之间的协调性,并在障碍物丰富的环境和动态障碍场景中实现连续导航。仿真和实际实验验证了REACT的有效性和实用性。实验视频可在我们的项目网站上获取:https://dongjh20.github.io/REACT-website。

英文摘要

Formation control of wheeled mobile robots (WMRs) has been extensively studied due to its broad applications in fields such as logistics transportation, environmental monitoring, and search and rescue. However, most existing works mainly focus on tracking predefined formations, which limits their adaptability to complex real-world environments. To address this, we propose REACT (Real-time Environment-Adaptive architecture for Continuous formation navigaTion), a hierarchical architecture integrating centralized formation generation and distributed formation maintenance. Specifically, our upper layer generates new environment-adaptive formations when necessary and uses our proposed TCF-R2T (Trajectory-Conflict-Free Robot-to-Target assignment) algorithm to compute conflict-free WMR-to-target assignments in polynomial time, enabling timely formation transitions without trajectory conflicts. At the lower layer, each WMR executes our developed JSTP (Joint Spatio-Temporal trajectory Planning) method to maintain the generated formation by simultaneously optimizing spatial positions and temporal durations, thereby enhancing coordination among WMRs and enabling continuous navigation in obstacle-rich environments and dynamic-obstacle scenarios. Both simulation and real-world experiments validate the effectiveness and practical applicability of REACT. Experimental videos are available on our project website: https://dongjh20.github.io/REACT-website.

2605.18437 2026-05-19 cs.LG cs.DC

Heterogeneous Tasks Offloading in Vehicular Edge Computing: A Federated Meta Deep Reinforcement Learning Approach

车载边缘计算中的异构任务卸载:一种联邦元深度强化学习方法

Yaorong Huang, Jingtao Luo, Xuechao Wang

AI总结 本文提出了一种联邦元深度强化学习框架FedMAGS,用于解决车载边缘计算中异构任务卸载问题,通过图注意力网络捕捉DAG依赖关系,序列到序列策略生成结构化卸载决策,并利用联邦元学习实现跨分布式MEC服务器的快速适应。

详情
AI中文摘要

车载边缘计算(VEC)通过将计算密集型任务卸载到附近的边缘服务器,使延迟敏感的车载应用成为可能。然而,现实中的车载工作负载通常被建模为具有复杂依赖结构的异构有向无环图(DAG)任务,这使得联合卸载和资源分配极具挑战性。此外,分布式MEC部署在协同训练基于学习的策略时会引发隐私问题。本文提出了一种联邦元深度强化学习框架,结合GAT-Seq2Seq建模(FedMAGS),用于车载边缘计算系统中的异构任务卸载。所提出的方法利用图注意力网络捕捉DAG依赖关系,基于序列到序列的策略生成结构化卸载决策,并利用联邦元学习实现跨分布式MEC服务器的快速适应,而无需共享原始数据。大量模拟表明,FedMAGS在收敛速度、执行延迟和可扩展性方面均优于现有最先进的基线方法。此外,联邦设计在保护数据隐私的同时减少了通信开销,使该框架非常适合动态和大规模的VEC环境。

英文摘要

Vehicular edge computing (VEC) enables latency-sensitive vehicular applications by offloading computation-intensive tasks to nearby edge servers. However, real-world vehicular workloads are typically modeled as heterogeneous directed acyclic graph (DAG) tasks with complex dependency structures, making joint offloading and resource allocation highly challenging. Moreover, distributed MEC deployment raises privacy concerns when collaboratively training learning-based policies. In this paper, we propose a Federated Meta Deep Reinforcement Learning framework with GAT-Seq2Seq modeling (FedMAGS) for heterogeneous task offloading in VEC systems. The proposed approach leverages Graph Attention Networks to capture DAG dependencies, a Seq2Seq-based policy to generate structured offloading decisions, and federated meta-learning to enable fast adaptation across distributed MEC servers without sharing raw data. Extensive simulations demonstrate that FedMAGS achieves faster convergence, lower execution delay, and better scalability compared with state-of-the-art baselines. In addition, the federated design preserves data privacy while reducing communication overhead, making the framework well suited for dynamic and large-scale VEC environments.

2605.18436 2026-05-19 cs.CV

A Dataset for the Recognition of Historical and Handwritten Music Scores in Western Notation

一个用于识别历史和手写乐谱的数据库

Pau Torras, Jiří Mayer, Carles Badal, Martina Dvořáková, Markéta Herzanová Vlková, Gerard Asbert, Vojtěch Dvořák, Samuel Šomorjai, Jan Hajič, Alicia Fornés

AI总结 本文提出了一个包含1309页历史乐谱的数据库,用于训练光学音乐识别系统,该数据库提供了音乐XML转录和符号注释,是目前最大的手写音乐数据集,适用于训练和评估端到端和基于目标检测的OMR系统。

详情
Comments
Under review at Scientific Data
AI中文摘要

大量的音乐遗产已由记忆机构(图书馆、博物馆和档案馆)数字化。然而,尽管深度学习的进步,光学音乐识别(OMR)领域在使音乐可机读方面仍然面临困难,主要是因为缺乏可用于真实条件训练的数据库。MusiCorpus数据集旨在通过提供1,309页的历史乐谱(主要是手写乐谱)以及音乐XML转录和符号注释来解决这一问题。它是目前最大的手写音乐数据集,也是首个包含来自记忆机构的真实且具有代表性的音乐文档集合的数据集,适用于训练和评估端到端和基于目标检测的OMR系统,并比较其性能。

英文摘要

A large amount of musical heritage has been digitised by memory institutions: libraries, museums, and archives. Nevertheless, the field of Optical Music Recognition (OMR) has struggled with making this music machine-readable, despite advances in deep learning, mostly because no datasets for training systems in realistic conditions were available. The MusiCorpus dataset aims to remedy this situation by providing 1,309 pages of historical sheet music, primarily handwritten, with MusicXML transcriptions and symbol annotations. It is the largest dataset of handwritten music to date and the first dataset containing a realistic and representative sample of musical document collections from memory institutions, suitable for training and evaluating both end-to-end and object detection-based OMR systems and comparing their performance.

2605.18430 2026-05-19 cs.LG

Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation

Text2CAD-Bench: 一个用于基于LLM的文本到参数化CAD生成的基准

Liang Wang, Heng Meng, Zekai Xiang, Jin Liu, Pingyi Zhou, Litao Chen, Yongqiang Tang

AI总结 本文提出Text2CAD-Bench,首个系统评估文本到CAD在几何复杂度和应用多样性方面的基准,发现当前模型在基本几何上表现良好,但在复杂拓扑和高级功能上表现下降。

详情
AI中文摘要

文本到CAD生成旨在从自然语言创建参数化CAD模型,使快速原型设计和直观设计流程成为可能。然而,现有基准主要关注基本原始体和简单的草图-拉伸序列,缺乏现实应用中必需的高级功能,并仅涵盖传统机械部件。我们引入Text2CAD-Bench,首个系统评估文本到CAD在几何复杂度和应用多样性方面的基准。我们的基准包含600个由人类整理的例子,涵盖四个层次:L1-L2涵盖基本几何和标准特征,L3引入复杂拓扑和自由曲面,L4扩展到机械部件之外的现实领域。每个示例配对双风格提示--几何描述模仿非专家用户,以及程序序列对齐专家级规范。评估主流通用LLM和领域特定模型,发现当前模型在基本几何上表现良好,但在复杂拓扑和高级功能上表现下降。我们发布此基准以推动文本到CAD研究的发展。

英文摘要

Text-to-CAD generation aims to create parametric CAD models from natural language, enabling rapid prototyping and intuitive design workflows. However, existing benchmarks focus on basic primitives and simple sketch-extrude sequences, lacking advanced features essential for real-world applications and covering only traditional mechanical parts. We introduce Text2CAD-Bench, the first benchmark systematically evaluating text-to-CAD across geometric complexity and application diversity. Our benchmark comprises 600 human-curated examples spanning four levels: L1-L2 cover fundamental geometry with standard features, L3 introduces complex topology and freeform surfaces, and L4 extends to real-world domains beyond mechanical parts. Each example pairs dual-style prompts -- geometric descriptions mimicking non-expert users, and procedural sequences aligned with expert-level conventions. Evaluating mainstream general LLMs and domain-specific models, we find that current models perform reasonably on basic geometry but degrade substantially on complex topology and advanced features. We release our benchmark to drive progress in text-to-CAD research.

2605.18425 2026-05-19 cs.LG math.ST stat.TH

Generative Adversarial Learning from Deterministic Processes

从确定性过程生成对抗学习

Joris C. Kühl, Hanno Gottschalk

AI总结 本文研究了生成对抗网络在非独立同分布数据中的成功应用,证明了通过无限维生成对抗学习模型可以从单个确定性时间序列中学习混沌动力系统不变分布,并给出了收敛速率。

详情
Comments
37 pages, 3 figures
AI中文摘要

物理人工智能正被成功应用于不遵循传统独立同分布(i.i.d.)样本 paradigm 的数据。事实上,物理人工智能常常在非随机数据上进行训练,这些数据来源于混沌动力系统,如湍流。我们旨在通过生成对抗网络(GANs)的例子来解释这些方法的实证成功,其统计学习理论在i.i.d.假设下通常被很好地理解。我们证明了使用无限维的生成对抗学习(GAL)模型,可以从单个确定性演变的时间序列中学习足够混沌的动力系统的不变分布,并以詹森-香农散度给出收敛到解的显式速率。

英文摘要

Physical AI is being successfully applied to data which does not follow the traditional paradigm of independent and identically distributed (i.i.d.) samples. In fact, physical AI is often trained on data which is not random at all, and is instead derived from chaotic dynamical systems like turbulence. We aim to explain the empirical success of these methods using the example of generative adversarial networks (GANs), whose statistical learning theory under the i.i.d. assumption is generally well understood. We prove that it is possible, using an infinite-dimensional model of generative adversarial learning (GAL), to learn the invariant distribution of a sufficiently chaotic dynamical system from a single deterministically evolving time series of its states or measurements thereof, and give explicit rates for the convergence to the solution in terms of the Jensen-Shannon divergence.

2605.18423 2026-05-19 cs.RO cs.CY

REBAR: Reference Ethical Benchmark for Autonomy Readiness

REBAR:自主性准备的参考伦理基准

Jonathan Diller, David Barnes, Rebekah Bogdanoff, Rhett Collier, Roddy Collins, Keith Fieldhouse, Yonatan Gefen, Cameron Johnson, Anuriha Kodali, Brad Kriel, Varun Murali, James Niehaus, Mish Sukharev, Joseph VanPelt, Anthony Hoogs, Vijay Kumar, Arslan Basharat

AI总结 本文提出REBAR框架,通过严谨测试提供可计算的自主性准备等级,以量化伦理性能并解决现有伦理AI框架的不足。

详情
Comments
To be presented at the 2026 Workshop on Robot Ethics - Ethical, Legal and User Perspectives in Robotics and Automation (WOROBET)
AI中文摘要

随着自主系统日益先进,客观评估其伦理和法律合规性的指标对于告知终端用户其局限性并确保滥用者的责任至关重要。当前的伦理具身AI框架大多定性,侧重于系统设计(通过安全护栏或定向红队测试),而实现的护栏往往直接禁止不安全行为,而没有为用户提供重置或可解释的原因。相反,需要通过严格测试获得可计算的指标,使用户能够确定系统适用于任务。为解决这一差距,我们引入了自主性准备的参考伦理基准(REBAR),一个用于自主系统的定量测试和评估框架。REBAR将运行指标映射到可计算的自主性准备等级(ARL)标准,以量化伦理表现。该框架的关键创新包括一种神经符号大型语言模型(LLM)方法来计算和解释场景的伦理难度,LLM驱动的大规模测试实例生成,以及一个多功能、逼真模拟环境。通过通过此严格测试流程评估白盒自主性解决方案,REBAR提供了一个客观且可重复的基准分数,弥合了抽象原则与可验证、可问责的自主性之间的差距。

英文摘要

As autonomous systems grow more advanced, objective metrics to evaluate their ethical and legal compliance are critical for informing end users of their limitations and ensuring accountability of those who misuse them. Current ethical embodied AI frameworks remain mostly qualitative, focusing on system design (through safety guardrails or targeted red teaming), and the realized guardrails often directly disallow unsafe behavior without providing the user with an override or interpretable reason. Instead, there is a need for computable metrics through rigorous testing that allow a user to determine the applicability of the system to the task. To address this gap, we introduce the Reference Ethical Benchmark for Autonomy Readiness (REBAR), a quantitative test and evaluation framework for autonomous systems. REBAR maps operating metrics into a computable Autonomy Readiness Level (ARL) rubric that can quantify ethical performance. Key innovations of the framework include a neuro-symbolic Large Language Model (LLM) approach to calculate and explain the ethical difficulty of scenarios, LLM-driven at-scale generation of test instances, and a versatile, photorealistic simulation environment. By evaluating white-box autonomy solutions through this rigorous testing pipeline, REBAR delivers an objective and repeatable benchmark score, bridging the gap between abstract principles and verifiable, accountable autonomy.

2605.18421 2026-05-19 cs.CL cs.AI cs.LG

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

EvoMemBench: 从自演化视角评估智能体记忆

Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, Yuhan Li, Miao Peng, Bing Tong, Chen Zhang, Yan Zhou, Jia Li

AI总结 本文提出EvoMemBench,从自演化视角评估智能体记忆,通过内存范围和内容两个维度构建统一基准,比较15种内存方法并发现当前内存系统尚未达到通用解决方案,长上下文基线仍具竞争力,内存在上下文不足或任务困难时效果显著,检索方法在知识密集型任务中表现优异,而程序和长期记忆方法在任务结构匹配时更有效。

详情
AI中文摘要

近期针对大语言模型(LLM)智能体的基准测试主要评估推理、规划和执行能力。然而,记忆对于智能体同样至关重要,因为它使智能体能够随时间存储、更新和检索信息。这种能力仍被低估,主要是因为现有基准测试未能提供系统评估记忆机制的方法。本文从自演化视角研究智能体记忆,引入EvoMemBench,一个沿内存范围(回合内 vs. 跨回合)和内存内容(知识导向 vs. 执行导向)两个轴线组织的统一基准。我们在标准化协议下比较了15种代表性内存方法与强大的长上下文基线。结果表明,当前内存系统仍远未达到通用解决方案:长上下文基线仍具有高度竞争力,内存在当前上下文不足或任务困难时效果最显著,且没有单一的内存形式能一致适用于所有设置。基于检索的方法在知识密集型任务中仍表现强劲,而程序和长期记忆方法在存储的经验与任务结构匹配时,对执行导向任务更有效。我们希望EvoMemBench能促进未来更有效的LLM智能体内存系统研究。我们的代码可在https://github.com/DSAIL-Memory/EvoMemBench获取。

英文摘要

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.

2605.18419 2026-05-19 cs.CV cs.AI

Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology

面向几何的不确定性聚类用于病理学中鲁棒的视觉上下文学习

Franciskus Xaverius Erick, Johanna Paula Müller, Bernhard Kainz

AI总结 本文提出GAUC,一种无需训练的聚类选择方法,直接在预训练的多模态嵌入空间中操作,通过优化三个目标提升视觉上下文学习的鲁棒性、准确性和校准性。

详情
AI中文摘要

视觉-语言模型(VLMs)能够将视觉感知与开放性临床推理结合,使其在计算病理学中具有吸引力。然而,对稀缺的专家标注病理数据进行数十亿参数的微调是不可行的,而上下文学习(ICL)在没有参数更新的情况下将VLM条件于演示图像-文本对,但容易受到所选示例和查询措辞的影响,导致诊断不可靠。现有选择策略依赖于查询依赖的最近邻检索,忽略了全局数据结构,需要昂贵的参数更新,或忽视了VLMs的联合视觉-文本嵌入几何。我们提出GAUC,一种无需训练的聚类选择方法,直接在预训练的多模态嵌入空间中操作。GAUC联合优化三个目标:(1)最大均值差异项,强制聚类与完整数据集之间的分布一致性;(2)有效互信息差异正则化器,通过利用VLMs的联合视觉-文本对齐来限制在提示改写下的性能下降;(3)预测方差惩罚,抑制过于自信且不稳定的输出。在CRC-100K和MHIST多个开源VLM架构上,GAUC在准确率、校准性和提示鲁棒性上均优于最近的ICL选择方法和数据集蒸馏基线,且无需单次梯度更新。

英文摘要

Vision-language models (VLMs) can couple visual perception with open-ended clinical reasoning, making them attractive for computational histopathology. However, fine-tuning billions of parameters on scarce, expert-annotated pathology data is prohibitive, while in-context learning (ICL), which conditions the VLM on demonstrative image-text pairs without parameter updates, suffers from high sensitivity to which examples are selected and how the query is phrased, producing unreliable diagnostics. Existing selection strategies rely on query-dependent nearest-neighbour retrieval that ignores global data structure, require costly parameter updates, or disregard the joint vision-text embedding geometry of VLMs. We propose GAUC, a training-free coreset selection method operating directly in the pre-trained multimodal embedding space. GAUC jointly optimises three objectives: (1) a Maximum Mean Discrepancy term enforcing distributional fidelity between coreset and full dataset, (2) an Effective Mutual Information Difference regulariser bounding performance degradation under prompt paraphrases by exploiting the VLM's joint vision-text alignment, and (3) a predictive-variance penalty suppressing overconfident, unstable outputs. On CRC-100K and MHIST across multiple open-source VLM architectures, GAUC consistently improves accuracy, calibration, and prompt robustness over recent ICL selection methods and dataset-distillation baselines, all without a single gradient update.

2605.18409 2026-05-19 cs.SD

EnvTriCascade: An Environment-Aware Tri-Stage Cascaded Framework for ESDD2 2026 Challenge

EnvTriCascade: 一个面向环境的三阶段级联框架用于ESDD2 2026挑战

Hengyan Huang, Xiaoxuan Guo, Jiayi Zhou, Yuankun Xie, Jian Liu, Haonan Cheng, Long Ye, Qin Zhang

AI总结 本文提出EnvTriCascade框架,通过三阶段级联结构和环境感知方法,有效区分真实语音和 manipulated 混合信号,在ESDD2挑战中取得高宏F1分数。

详情
AI中文摘要

在现实场景中,ADD已从仅语音伪造发展到更具有挑战性的组件级设置,其中语音和环境声音可能被独立操控。为解决这一问题,我们提出EnvTriCascade,一个面向环境的三阶段级联框架用于ESDD2挑战。首先,一个混合一致性检测器提供二元先验以区分原始录音和 manipulated 混合物,校准最终决策。其次,两个互补的五类检测器,利用SSLAM+XLS-R和EAT-large+XLS-R表示,提取鲁棒的多分支特征,通过跨分支注意力门控分类器整合。为了增强对不同混合条件的鲁棒性,我们引入RawBoost增强。仅在官方CompSpoofV2数据集上训练,我们的系统在测试集上获得宏F1分数0.8266,显著优于官方基线,并在挑战中排名第二。

英文摘要

ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for the ESDD2 Challenge. First, a mix-consistency detector provides a binary prior to distinguish original recordings from manipulated mixtures, which calibrates the final decisions. Next, two complementary five-class detectors, leveraging SSLAM+XLS-R and EAT-large+XLS-R representations, extract robust multi-branch features integrated via a cross-branch attention-gated classifier. To enhance robustness against diverse mixing conditions, we incorporate RawBoost augmentation. Trained exclusively on the official CompSpoofV2 dataset, our system achieves a Macro-F1 score of 0.8266 on the test set, significantly outperforming the official baseline and ranking second in the challenge.

2605.18408 2026-05-19 cs.CV

Historical Knowledge Graphs for Global Maritime Estimated Time of Arrival

全球海运估计到达时间的历史知识图谱

Neofytos Dimitriou

AI总结 本文提出利用AIS数据构建全球海运历史知识图谱的方法,通过高斯混合模型预处理提取轨迹,利用速度分布构建图谱,实现高效的航行时间预测,为港口运营和减排提供支持。

详情
AI中文摘要

准确的船舶预计到达时间预报对港口运营和脱碳至关重要,但缺乏成本高昂的上下文数据,全球范围内的航行时间预测仍极具挑战性。本文提出一种方法,仅使用自动识别系统(AIS)数据构建历史海运知识图谱。首先,通过基于高斯混合模型的预处理流程从噪声AIS数据中提取分段轨迹。然后通过迭代处理轨迹,按船舶类型、航行时间和方向存储速度分布,生成包含5,433个geohash-3节点和12,334条边的全球图谱。该图谱可通过分层、优先级系统查询任意两个位置之间的航行时间预测,该系统利用历史统计数据并有原则的回退机制。在时间上保留的测试集上,中位RMSE为22.75分钟(分段级)和30.90分钟(轨迹级),其中69.1%的轨迹在实际到达时间的20%以内。在第二个外部测试集上,中位RMSE为27.36分钟(分段级)和37.46分钟(轨迹级),其中62.1%的轨迹在20%以内。这些结果证实了我们方法的潜力,能够实现全球航行时间预测,并为及时到达规划和减排提供坚实基础。

英文摘要

Accurate vessel estimated-time-of-arrival forecasts are critical for port operations and decarbonization, yet global-scale travel-time prediction remains difficult without costly contextual data. Herein, I present a methodology for constructing a historical maritime knowledge graph using only Automatic Identification System (AIS) data. First, segmented trajectories are extracted from noisy AIS data using a Gaussian-mixture-model-based preprocessing pipeline. The graph is then constructed by iteratively processing the trajectories and storing speed distributions stratified by vessel type, time of travel, and direction of travel; the resulting global graph comprises 5,433 geohash-3 nodes and 12,334 edges. The graph can be queried to retrieve travel-time predictions between any two location via a hierarchical, priority-based system that uses historical statistics with principled fallback. On a temporally held-out test set, median RMSE is 22.75 min (segment-level) and 30.90 min (trajectory-level), with 69.1% of trajectories within 20% of actual arrival time. On a second external test set, median RMSE is 27.36 min (segment-level) and 37.46 min (trajectory-level), with 62.1% of trajectories within 20%. These results corroborate the promise of our method, enabling global travel-time prediction and providing a strong foundation for just-in-time arrival planning and emissions reduction.

2605.18401 2026-05-19 cs.CL cs.AI

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

SkillsVote: 代理技能的生命周期治理从收集、推荐到进化

Hongyi Liu, Haoyan Yang, Tao Jiang, Bo Tang, Feiyu Xiong, Zhiyu Li

AI总结 本文提出SkillsVote框架,通过生命周期治理管理代理技能,从收集和推荐到进化,提升模型在终端基准和SWE-Bench Pro上的性能。

详情
Comments
44 pages, 7 figures, 5 tables
AI中文摘要

长周期LLM代理留下的轨迹可能成为可重用的经验,但原始轨迹噪声大且难以管理。我们将代理技能视为一种经验模式,结合可执行脚本和不可执行的指导。然而,开放技能生态系统包含冗余、不均匀、环境敏感的产物,随意更新会污染未来上下文。我们提出了SkillsVote,一个用于代理技能生命周期治理的框架,从收集和推荐到进化。SkillsVote对百万级开源语料库进行环境需求、质量和可验证性分析,然后合成可验证技能的任务。在执行前,SkillsVote在结构化技能库中进行代理库搜索以暴露教学技能上下文。在执行后,它将轨迹分解为技能关联的子任务,将结果归因于技能使用、代理探索、环境和结果信号,并只接受成功的可重用发现以进行证据门控更新。在评估中,离线进化使GPT-5.2在Terminal-Bench 2.0上提升高达7.9个百分点,而在线进化使SWE-Bench Pro提升高达2.6个百分点。总体而言,受控的外部技能库可以在不更新模型的情况下提升冻结代理,当系统控制暴露、信用和保存时。

英文摘要

Long-horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an experience schema that couples executable scripts, with non-executable guidance on procedures. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills from collection and recommendation to evolution. SkillsVote profiles a million-scale open-source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution, SkillsVote performs agentic library search over structured skill library to expose instructional skill context. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. In our evaluation, offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp, while online evolution improves SWE-Bench Pro by up to 2.6 pp. Overall, governed external skill libraries can improve frozen agents without model updates when systems control exposure, credit, and preservation.

2605.18390 2026-05-19 cs.CV

Vision Foundation Models as Generalist Tokenizers for Image Generation

视见过滤模型作为图像生成的通用标记器

Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

AI总结 本文提出了一种基于冻结视见过滤模型(VFM)的通用图像标记器VFMTok,通过区域自适应量化框架和语义重建目标,提升了图像生成的质量和效率,同时在离散和连续潜在空间中实现了高保真度的类别条件合成。

详情
Comments
4 figures and 14 tables
AI中文摘要

在本文中,我们探索了构建一个通用图像标记器的全新方向,该标记器直接建立在冻结的视见过滤模型(VFM)之上。为了构建此标记器,我们利用冻结的VFM作为编码器,并引入两个关键创新:(1)区域自适应量化框架,用于消除标准2D网格特征中的空间冗余;(2)语义重建目标,使解码输出与VFM的表示对齐,以保持语义保真度。基于这些设计,我们提出了VFMTok,一种能够无缝在离散和连续潜在空间中运行的通用视觉标记器。VFMTok在合成质量上取得了显著提升,同时大幅提高了标记效率。对于离散自回归(AR)生成,它通过3倍加速模型收敛,并在ImageNet条件合成上实现了最先进的gFID值1.36。同样,对于连续空间生成,将VFMTok与去噪模型结合,可获得极佳的gFID值1.25。此外,由于潜在空间本身捕捉了丰富的空间语义,VFMTok能够在两种生成范式中无需分类器自由指导(w/o CFG)下实现高保真度的类别条件合成,显著加快了推理速度。除了这些显著的实证结果外,我们还系统地研究了我们方法的底层机制。我们发现,在VFM预训练过程中使用的特定自监督学习目标决定了其作为标记器的有效性。具体来说,一个联合优化全局对比学习和潜在掩码图像建模的VFM提供了最佳的图像标记表示。这些见解为未来图像标记器的设计奠定了坚实的基础,并提供了有价值的指导。

英文摘要

In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID of \textbf{1.36} on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf{1.25}. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbf{w/o CFG}) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.

2605.18387 2026-05-19 cs.LG cs.AI

Graph Hierarchical Recurrence for Long-Range Generalization

图层次递归用于长距离泛化

Stefano Carotti, Marco Pacini, Alessio Gravina, Davide Bacciu, Bruno Lepri, Sebastiano Bontorin

AI总结 本文提出了一种名为图层次递归(GHR)的新框架,通过在输入图和通过池化获得的层次抽象上联合操作,解决了图神经网络和图转换器在长距离相关性捕捉任务中的限制,并在多个长距离基准测试中表现出色,参数效率高。

详情
AI中文摘要

图神经网络(GNNs)和图转换器(GTs)已成为图学习的基本范式,结合了深度模型的表示学习能力与诱导偏置带来的样本效率。尽管其有效性已得到广泛认可,但大量研究表明这些模型在需要捕捉图中远距离区域之间相关性的任务中仍面临根本性限制。为了解决这一问题,我们引入了图层次递归(GHR),一种新的框架,该框架同时在输入图和通过池化获得的层次抽象上进行操作。我们还展示了现有模型的局限性在超出范围的泛化中更加明显,其中测试实例涉及比训练时观察到的更长距离的相互作用。相比之下,尽管其设计简单,GHR提供了三个关键优势:在长距离依赖上表现强劲,改进了超出范围的泛化能力,以及高参数效率。为了验证这些主张,我们展示了在广泛的长距离基准测试中,GHR在使用当前最先进的模型参数的1%的情况下,始终优于现有的图模型。这些结果表明,当前趋势通过扩展架构来获得图基础模型的互补方向,表明仅增加模型容量可能不足以实现泛化。

英文摘要

Graph Neural Networks (GNNs) and Graph Transformers (GTs) are now a fundamental paradigm for graph learning, combining the representation-learning capabilities of deep models with the sample efficiency induced by their inductive biases. Despite their effectiveness, a large body of work has shown that these models still face fundamental limitations in tasks that require capturing correlations between distant regions of a graph. To address this issue, we introduce Graph Hierarchical Recurrence (GHR), a novel framework that operates jointly on the input graph and on a hierarchical abstraction obtained through pooling. We also show that the limitations of existing models are even more pronounced in out-of-range generalization, where test instances involve interactions over distances longer than those observed during training. By contrast, despite its simple design, GHR provides three key advantages: strong performance on long-range dependencies, improved out-of-range generalization, and high parameter efficiency. To corroborate these claims, we show that across a broad set of long-range benchmarks, GHR consistently outperforms existing graph models while using as little as 1% of the parameters of current state-of-the-art models. These results suggest a complementary direction to the current trend of scaling architectures to obtain graph foundation models, indicating that increased model capacity alone may not be sufficient for generalization.

2605.18385 2026-05-19 cs.RO cs.AI

Towards Ubiquitous Mapping and Localization for Dynamic Indoor Environments

面向动态室内环境的无处不在的映射与定位

Halim Djerroud, Nico Steyn, Olivier Rabreau, Patrick Bonnin, Abderraouf Benali

AI总结 本文提出UbiSLAM,一种用于动态室内环境实时映射和定位的创新解决方案,通过部署固定RGB-D相机网络解决传统SLAM系统在环境变化敏感性和依赖移动单元传感器的问题,提升机器人在环境中的定位精度和响应性。

详情
Journal ref
Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025), Volume 1, pages 537-548, SciTePress, 2025. ISBN: 978-989-758-737-5, ISSN: 2184-433X
AI中文摘要

我们提出了UbiSLAM,一种用于动态室内环境实时映射和定位的创新解决方案。通过在工作空间内战略性地部署固定RGB-D相机网络,UbiSLAM解决了传统SLAM系统常见的局限性,如对环境变化的敏感性和对移动单元传感器的依赖。这种固定传感器方法实现了实时、全面的映射,提高了机器人在环境中的定位精度和响应性。由UbiSLAM生成的集中化地图持续更新,为机器人提供准确的全局视图,从而提高导航、减少碰撞并促进共享空间中更流畅的人机交互。除了其优势外,UbiSLAM还面临挑战,特别是在确保完整空间覆盖和管理盲区方面,这需要从机器人本身集成数据。在本文中,我们讨论了潜在的解决方案,如自动校准以获得最佳的相机位置和方向,以及增强的通信协议以实现实时数据共享。所提出的模型减少了对单个机器人单元的计算负载,使更复杂的机器人平台能够有效运行,同时增强了整个系统的鲁棒性。

英文摘要

We present UbiSLAM, an innovative solution for real-time mapping and localization in dynamic indoor environments. By deploying a network of fixed RGB-D cameras strategically throughout the workspace, UbiSLAM addresses limitations commonly encountered in traditional SLAM systems, such as sensitivity to environmental changes and reliance on mobile unit sensors. This fixed-sensor approach enables real-time, comprehensive mapping, enhancing the localization accuracy and responsiveness of robots operating within the environment. The centralized map generated by UbiSLAM is continuously updated, providing robots with an accurate global view, which improves navigation, minimizes collisions, and facilitates smoother human-robot interactions in shared spaces. Beyond its advantages, UbiSLAM faces challenges, particularly in ensuring complete spatial coverage and managing blind spots, which necessitate data integration from the robots themselves. In this paper we discuss potential solutions, such as automatic calibration for optimal camera placement and orientation, along with enhanced communication protocols for real-time data sharing. The proposed model reduces the computational load on individual robotic units, allowing less complex robotic platforms to operate effectively while enhancing the robustness of the overall system.

2605.18383 2026-05-19 cs.LG

TabH2O: A Unified Foundation Model for Tabular Prediction

TabH2O:用于表格预测的统一基础模型

Pascal Pfeiffer, Dmitry Gordeev, Mathias Müller, Laura Fink, Joan Salvà Soler, Mark Landry, Branden Murray, Marcos V. Conde, Sri Satish Ambati

AI总结 本文提出TabH2O,一种统一的基础模型,通过上下文学习在单次前向传递中实现分类和回归。该模型基于TabICL架构进行了关键改进,包括统一训练、单阶段预训练和噪声感知预训练,从而在表格数据预测任务中表现出色。

详情
Comments
Technical Report - https://tabh2o.h2oai.com/
AI中文摘要

我们提出了TabH2O,一种用于表格数据的基础模型,该模型通过上下文学习在单次前向传递中实现分类和回归。TabH2O基于TabICL架构进行了若干关键改进:(1) 统一训练,一个模型通过双头架构同时处理分类和回归,消除了对单独模型的需要,从而降低了总预训练成本;(2) 单阶段预训练,通过训练稳定性改进(有界可扩展softmax、阶段间归一化、可学习残差缩放、logit软上限)消除了多阶段课程学习的需要,使模型能够从一开始就使用完整长度序列进行训练;(3) 噪声感知预训练,合成数据集包含显式噪声维度以教导模型对无关特征具有鲁棒性。我们在TALENT基准(300个数据集)上评估了TabH2O v1(29.2M参数),其中它在6种评估方法中的平均排名为2.55,优于调优的CatBoost(4.07)、H2O AutoML(4.18)和LightGBM(5.08),与TabPFN v2.6(2.74)竞争,但落后于TabICL v2(2.12),并在分类和回归任务中81%的测试数据集上位列前三名。

英文摘要

We present TabH2O, a foundation model for tabular data that performs classification and regression in a single forward pass via in-context learning. TabH2O builds on the TabICL architecture with several key modifications: (1) unified training, a single model handles both classification and regression via a dual-head architecture, eliminating the need for separate models and reducing total pretraining cost; (2) single-stage pretraining, training stability improvements (bounded scalable softmax, inter-stage normalization, learnable residual scaling, logit soft-capping) eliminate the need for multi-stage curriculum learning, enabling training with full-length sequences from the start; and (3) noise-aware pretraining, synthetic datasets include explicit noise dimensions to teach the model robustness to irrelevant features. We evaluate TabH2O v1 (29.2M parameters) on the TALENT benchmark (300 datasets), where it achieves an average rank of 2.55 out of 6 evaluated methods, outperforming tuned CatBoost (4.07), H2O AutoML (4.18), and LightGBM (5.08), competitive with TabPFN v2.6 (2.74), and behind TabICL v2 (2.12), while placing in the top-3 on 81% of the testing datasets across classification and regression tasks.

2605.18381 2026-05-19 cs.LG

Generating Physically Consistent Molecules with Energy-Based Models

生成具有物理一致性的分子的基于能量模型

Christoph Griesbacher, Lea Bogensperger, Andreas Habring, Thomas Pock

AI总结 本文提出了一种基于能量模型(EBM)的方法EBMol,用于生成三维分子,通过学习原子可加的标量势能恢复了能量归纳偏差,从而在QM9和GEOM-Drugs数据集上实现了最先进的性能,并展示了学习的能量景观作为质量度量用于配置排序和过滤,以及通过形状引导采样实现可控生成。

详情
AI中文摘要

处于平衡状态的分子遵循玻尔兹曼分布,使底层的能量景观成为一种基于物理的建模目标。然而,这样的景观从数据中学习起来困难,一旦学习完成,也难以进行采样。扩散模型和流匹配模型通过学习噪声与数据之间的时条件分数或传输场来规避这些困难,以更可处理的训练目标交换了能量归纳偏差。我们引入EBMol,一种基于能量模型(EBM),通过在训练过程中不进行显式模拟而学习原子可加的标量势能来恢复这种归纳偏差。我们的方法采用受流启发的恢复场匹配目标来近似能量景观。我们采用镜像-兰格-恩算法进行采样,使原子位置和类型的统一更新成为可能,并在推理时间采用并行退火来扩展计算规模。EBMol是首个在三维分子生成中实现最先进的性能的EBM,已在QM9和GEOM-Drugs数据集上达到最先进的性能。此外,我们还证明了学习的能量景观可以作为原理性的质量度量用于排序和过滤配置,并通过潜在能组成和零样本连接器设计通过形状引导采样实现可控生成,而无需重新训练。

英文摘要

Molecules in equilibrium follow a Boltzmann distribution, making the underlying energy landscape a physically grounded modeling objective. However, such landscapes are difficult to learn from data and, once learned, hard to sample from. Diffusion and flow-matching models sidestep these difficulties by learning a time-conditional score or transport field between noise and data, losing the energy inductive bias in exchange for a more tractable training objective. We introduce EBMol, an energy-based model (EBM) that restores this inductive bias by learning an atom-additive scalar potential without explicit simulation during training. Our method employs a flow-inspired Restoring Field Matching objective to approximate the energy landscape. We adopt the Mirror-Langevin algorithm for sampling, enabling unified updates of atomic positions and types, and incorporate parallel tempering for inference-time compute scaling. EBMol is the first EBM for 3D molecular generation to achieve state-of-the-art performance on QM9 and GEOM-Drugs. Moreover, we show that the learned energy landscape serves as a principled quality metric for ranking and filtering configurations, and demonstrate controllable generation without retraining through shape-steered sampling via potential composition and zero-shot linker design.

2605.18380 2026-05-19 cs.AI

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

QSTRBench: 一个评估语言模型进行定性空间和时间推理能力的新基准

Anthony G. Cohn, Robert E. Blackwell

AI总结 本文提出QSTRBench基准,用于评估大语言模型在定性空间和时间推理方面的能力,通过不同推理算法规则的组合性推理、反向关系和概念邻域等任务,展示了不同模型在处理不同算法规则时的表现差异,发现PA最简单而RCC-22最难。

详情
Comments
74 pages, 20 figures
AI中文摘要

我们介绍了一个广泛的定性空间和时间推理(QSTR)基准,用于评估大语言模型(LLMs)。我们提出了关于组合推理(使用组合表,CT)、反向关系和概念邻域(CN)的问题,针对QSTR算式、点代数(PA)、Allen区间代数、区间和持续时间(INDU)、区域连接算式(RCC-5、RCC-8和RCC-22)、九交模型、方向算式和STAR。RCC-22的CN首次在此发布。一个扩展的基准系统性地变化了问题呈现方式,包括前缀/后缀、词语/符号/非正式术语和图示描述,针对选定的算式。我们报告了当前前沿模型的结果。所有测试的模型都比猜测表现更好,但没有模型能一致正确回答所有问题。性能在不同算式之间差异显著,PA最简单,RCC-22最难。我们发布了该基准和我们的结果,以开放许可证发布,以促进进一步评估语言模型在定性空间/时间推理方面的能力。

英文摘要

We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR calculi, Point Algebra (PA), Allen's Interval Algebra, Interval and Duration (INDU), Region Connection Calculus (RCC-5, RCC-8, and RCC-22), the nine intersection model, cardinal direction calculus, and STAR. The RCC-22 CN is published here for the first time. An extended benchmark systematically varies question presentation including prefix/infix, words/symbols/nonce terms and schematic descriptions for selected calculi. We report results for contemporary frontier models. All models tested perform better than guessing but none can consistently answer all questions correctly. Performance varies sharply by calculus, with PA being the most straightforward, and RCC-22 the most difficult. We release the benchmark, and our results under an open licence to facilitate further assessment of qualitative spatio/temporal reasoning in LLMs.

2605.18379 2026-05-19 cs.LG

Beyond Square Roots: Explicit Memory-Efficient Factorization for Multi-Epoch Private Learning

超越平方根:多轮差分隐私学习的显式内存高效分解

Nikita P. Kalinin, Aki Rehn, Joel Daniel Andersson, Antti Honkela, Christoph H. Lampert

AI总结 本文提出了一种统一的分解方法γ-BIFR,用于多轮差分隐私学习,该方法在低内存和低带宽情况下显著提升了RMSE、放大RMSE和隐私训练性能,同时提供了更紧的理论保证。

详情
AI中文摘要

相关噪声机制是提高差分隐私模型训练效用最具前景的方法之一,但严格的保证需要显式、可分析的分解,而实际部署需要内存效率。最近的研究开发了带状逆分解,通过利用相关矩阵的带状结构来同时满足这两个要求。带宽控制用于在迭代之间相关噪声的噪声缓冲区大小,从而控制效用和内存成本之间的权衡。现有分解强调这种权衡:DP-λCGD通过仅使用一个步骤的噪声缓冲区实现了高内存效率,但限制了其效用增益,而带状逆平方根(BISR)分解利用更大的相关窗口,在大带宽下渐近最优,但在低带宽下表现不佳。我们提出γ-BIFR,是这两种分解的统一泛化。在低内存、低带宽情况下,γ-BIFR显著提高了RMSE、放大RMSE和隐私训练性能,同时为多轮参与误差提供了更紧的理论保证。

英文摘要

Correlated-noise mechanisms are among the most promising approaches for improving the utility of differentially private model training, but rigorous guarantees require explicit, analyzable factorizations, and practical deployment requires memory efficiency. Recent works have developed banded inverse factorizations, which address both requirements by exploiting a banded structure in the correlation matrix. The bandwidth controls the size of the noise buffer used to correlate noise across iterations, and thus governs the tradeoff between utility and memory cost. Existing factorizations highlight this tradeoff: DP-$λ$CGD achieves high memory efficiency by using only a one-step noise buffer, but this limits its utility gains, while the banded inverse square root (BISR) factorization exploits larger correlation windows and is asymptotically optimal for large bandwidths but performs poorly at low bandwidths. We propose $γ$-BIFR, a unified generalization of both factorizations. In the low-memory, low-bandwidth regime, $γ$-BIFR significantly improves RMSE, amplified RMSE, and private training performance, while yielding tighter theoretical guarantees for multi-participation error in multi-epoch training.

2605.18374 2026-05-19 cs.LG cs.AI

Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

超越推理时间搜索:强化学习合成可重用求解器

Soheyl Massoudi, Gabriel Apaza, Milad Habibi, Mark Fuge

AI总结 本文探讨了强化学习能否将组合优化的推理成本转移到代码LLM的权重中,从而合成可重用的求解器。通过Synergistic Dependency Selection问题,研究发现强化学习能有效生成约束感知的模拟退火模板,并在多个领域展示出更高的效率和鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)通常将组合优化视为推理时间的过程,通过采样、搜索或重复提示单独解决每个实例。我们询问强化学习是否可以将部分推理成本转移到代码LLM的权重中,从而让模型为整个问题家族合成可重用的求解器。我们研究了Synergistic Dependency Selection(SDS),一种受约束的二次背包问题的受控变体,旨在暴露特定的失败模式:局部信号和严格可行性约束使贪心启发式方法具有吸引力但不可靠。在相同的框架下,Best-of-64基础模型采样在接近全局虚拟最佳求解器(VBS)的28.7%差距处饱和;代码审计显示基础模型经常检索模拟退火模板但错误实现Metropolis接受规则。我们使用可行性门控奖励和轻量结构框架对Qwen2.5-Coder-14B-Instruct进行微调,使用组相对策略优化(GRPO)。所得到的策略在99.8%的可行SDS输出中收敛到一个约束感知的模拟退火模板,达到VBS的5.0%差距,并且在生成后执行/搜索成本方面比累积Best-of-64评估便宜91倍。一次编译检查显示,每个种子的最优冻结求解器在SDS测试集上重复使用时仍然高度竞争,而额外领域评估在作业调度问题上提供了更窄但积极的证据,表明框架可以超越SDS。负消融揭示了这种配方的局限性:标准稳定器会降低性能,软可行性门控失败,结果仍对奖励归一化和领域特定设计选择敏感。

英文摘要

Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into the weights of a code LLM, so that the model synthesizes a reusable solver for an entire problem family. We study this question on Synergistic Dependency Selection (SDS), a controlled variant of constrained Quadratic Knapsack designed to expose a specific failure mode: local signals and strict feasibility constraints make greedy heuristics attractive but unreliable. Under identical scaffolding, Best-of-64 base-model sampling saturates at an approximately 28.7% gap to the global Virtual Best Solver (VBS); code audits show that the base model often retrieves Simulated Annealing templates but misimplements the Metropolis acceptance rule. We fine-tune Qwen2.5-Coder-14B-Instruct with Group Relative Policy Optimization (GRPO) using a feasibility-gated reward and light structural scaffolding. The resulting policy converges to a constraint-aware Simulated Annealing template in 99.8% of feasible SDS outputs, achieves a 5.0% gap to that VBS, and is 91 times cheaper in post-generation execution/search cost than cumulative Best-of-64 evaluation. A compile-once check shows that one best frozen solver per seed remains highly competitive when reused unchanged across the SDS test set, while an additional-domain evaluation on Job Shop Scheduling provides narrower but positive evidence that the scaffold transfers beyond SDS. Negative ablations reveal the limits of this recipe: standard stabilizers degrade performance, a soft feasibility gate fails, and results remain sensitive to reward normalization and domain-specific design choices.

2605.18373 2026-05-19 cs.RO cs.LG math.DS math.OC

Dynamic robotic cloth folding with efficient Koopman operator-based model predictive control

动态机器人布料折叠与高效的Koopman算子基于模型预测控制

Edoardo Caldarelli, Franco Coltraro, Adrià Colomé, Lorenzo Rosasco, Carme Torras

AI总结 本文提出了一种基于Koopman算子的模型预测控制方法,用于快速生成布料折叠轨迹,结合物理仿真和高效的核基Koopman算子回归,以提高折叠任务的效率和精度。

详情
Comments
Accepted for presentation at the 2026 IEEE International Conference on Robotics and Automation (ICRA)
AI中文摘要

机器人布料折叠是一项具有挑战性的任务,尤其是在动态折叠任务中,需要通过快速运动利用布料的动力学特性进行折叠。当受到这种快速运动的影响时,布料动力学的复杂性会阻碍系统识别和折叠轨迹的规划,导致在使用物理布料模型时仿真到现实的转移困难。与人类在折叠任务中表现出的灵活性相比,机器人通常使用小而刚性的衣物,要么太慢,要么太快但不精确,需要多次尝试才能获得相对良好的折叠效果。在本文中,我们通过生成快速折叠轨迹来解决这些问题,采用了一种新的模型预测控制器,结合基于物理的布料动力学仿真和高效的核基Koopman算子回归。Koopman算子回归是一种日益流行的机器学习技术,用于非线性系统识别,用于获得被折叠布料的线性模型。此类代理模型,通过高保真的物理布料仿真器的数据进行训练,可以用于合适的模型预测控制算法中,替代昂贵的非线性模型,以高效地生成由机器人执行的折叠轨迹。在模拟和真实机器人实验中,我们展示了Koopman算子基于模型提供的线性化如何能够有效地生成未见过的姿势的快速折叠轨迹,而不牺牲折叠的准确性。

英文摘要

Robotic cloth folding is a challenging task, particularly when considering dynamic folding tasks, which aim at folding cloth by fast motions that leverage its dynamics. When subject to such fast motions, the complexity of cloth dynamics hinders both system identification and planning of folding trajectories, resulting in a difficult simulation-to-reality transfer when using physical models of cloth. Compared to the dexterity that humans exhibit when performing folding tasks, robotic approaches usually employ small garments with quite rigid dynamics, and are either too slow, or fast but imprecise, requiring several attempts to achieve a reasonably good fold. In this paper, we tackle these challenges by generating fast folding trajectories with a novel model predictive controller, integrating physics-based simulation of cloth dynamics and efficient, kernel-based Koopman operator regression. Koopman operator regression, an increasingly popular machine learning technique for nonlinear system identification, is used to obtain a linear model for the cloth being folded. Such a surrogate model, trained with data from a high-fidelity, physics-based cloth simulator, can then be employed within a suitable model predictive control algorithm, in place of the costly, nonlinear one, to efficiently generate folding trajectories to be executed by a robotic manipulator. Both in simulated and real-robot experiments, we show how the linearization supplied by the Koopman operator-based model can be employed to efficiently generate fast folding trajectories to unseen poses, without sacrificing folding accuracy.

2605.18365 2026-05-19 cs.CV

GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

GeoFlow: 在视频生成中强制隐式几何一致性

Jan Ackermann, Shengqu Cai, Boyang Deng, Zhengfei Kuang, Songyou Peng, Gordon Wetzstein

AI总结 本文提出GeoFlow,一种通过强化学习微调来增强视频生成中几何一致性的方法,通过引入几何一致性奖励,有效减少时间上的几何伪影,同时保持感知质量。

详情
Comments
Project Page: https://geometryflow.github.io/
AI中文摘要

生成几何上一致的视频仍然是一个开放性挑战:基于网络级数据训练的文本到视频扩散模型仅隐式处理几何,导致在相机运动下出现物体形变、纹理漂移和非刚性背景。现有解决方案要么作为副产品改进一致性,要么仅适用于静态场景或完全重新对齐模型的潜在空间。我们引入了一个几何一致性奖励,直接衡量生成视频中的运动是否与一致的场景兼容。我们的关键见解是,在物理一致的视频中,背景运动应能由刚性相机诱导的流解释,而独立移动的物体应沿运动轨迹保持外观身份。我们使用光流、深度-姿态预测和基于特征的对应关系来分离刚性和动态区域并评估它们各自的一致性。将此奖励与强化学习微调结合,将几何一致性从一种涌现属性转化为视频生成器的显式优化目标。该方法对模型不敏感,适用于包含相机和物体运动的多样化动态场景。实验显示,在强基线模型上显著减少了时间上的几何伪影,同时保持感知质量。代码和模型权重已发布。

英文摘要

Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth--pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.