arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2601.03048 2026-05-28 cs.CV cs.AI cs.CC

On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning

关于Transformer图像嵌入在非可解空间推理中的内在限制

Siyi Lyu, Quan Liu, Feng Yan

AI总结本文通过将空间理解形式化为群同态问题，证明恒定深度Transformer由于TC⁰复杂度限制，无法在单次前向传播中捕获非可解群（如SO(3)）的空间结构。

详情

AI中文摘要

视觉Transformer（ViT）在语义识别方面表现出色，但在心理旋转等空间推理任务中却出现系统性失败。虽然这通常归因于数据规模，但本文认为该限制源于架构的内在电路复杂度。通过将空间理解形式化为学习一个群同态问题——其中潜在嵌入保留作用于图像的物理变换的代数结构——我们识别出一个基本的计算瓶颈。具体来说，对于非可解群（例如$\mathrm{SO}(3)$），维持这种保结构嵌入的下界由单词问题决定，该问题是$\mathsf{NC^1}$-完全的。相比之下，具有多项式精度的恒定深度ViT严格受限于复杂度类$\mathsf{TC^0}$。在标准猜想$\mathsf{TC^0} \subsetneq \mathsf{NC^1}$下，出现了一个复杂度边界：恒定深度架构缺乏在单次前向传播中捕获非可解空间结构所需的逻辑深度。为了实证验证这一理论差距，我们提出了潜在空间代数（LSA）基准，该基准揭示了随着非可解任务组合深度的增加，ViT表示出现显著退化。

英文摘要

Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises from the intrinsic circuit complexity of the architecture. By formalizing spatial understanding as learning a Group Homomorphism Problem -- where latent embeddings preserve the algebraic structure of physical transformations acting on images -- we identify a fundamental computational bottleneck. Specifically, for non-solvable groups (e.g., $\mathrm{SO}(3)$), maintaining such structure-preserving embeddings is lowerbounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, constant-depth ViTs with polynomial precision are strictly bounded by the complexity class $\mathsf{TC^0}$. Under the standard conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, a complexity boundary emerges: constant-depth architectures lack the logical depth required to capture non-solvable spatial structures in a single forward pass. To empirically validate this theoretical gap, we propose the Latent Space Algebra (LSA) benchmark, which reveals a significant degradation in ViT representations as the compositional depth of non-solvable tasks increases.

URL PDF HTML ☆

赞 0 踩 0

2601.01616 2026-05-28 cs.LG eess.SP

Real Time NILM Based Power Monitoring of Identical Induction Motors Representing Cutting Machines in Textile Industry

基于实时非侵入式负荷监测的纺织行业切割机同型号感应电机功率监控

Md Istiauk Hossain Rifat, Moin Khan, Zohara Kamal, Md Borhan Uddin Khan, Mohammad Zunaed

AI总结针对纺织行业能源监控落后的问题，提出基于实时非侵入式负荷监测（NILM）的框架，使用MATNILM模型对同型号感应电机进行功率分解，验证了实时监控可行性并指出多台相同设备同时运行时的分解困难。

Comments 9 pages, 9 figures

详情

AI中文摘要

孟加拉国的纺织行业是能源密集型行业之一，但其监控实践仍然大多过时，导致电力使用效率低下和运营成本高昂。为了解决这个问题，我们提出了一种基于实时非侵入式负荷监测（NILM）的框架，专为工业应用定制，重点关注代表纺织切割机的相同电机驱动负载。开发了一个包含电压和电流传感器、Arduino Mega和ESP8266的硬件装置，用于捕获总负荷和单个负荷数据，并在云平台上存储和处理。从三个相同的感应电机和辅助负载创建了一个新数据集，总计超过180,000个样本，以在具有挑战性的工业条件下评估最先进的MATNILM模型。结果表明，虽然总能量估计相当准确，但每个电器的分解面临困难，特别是当多台相同机器同时运行时。尽管存在这些挑战，集成系统通过Blynk应用程序展示了具有远程访问功能的实际实时监控。这项工作突出了NILM在工业环境中的潜力和局限性，为未来的改进提供了见解，例如更高频率的数据收集、更大规模的数据集以及用于处理相同负载的先进深度学习方法。

英文摘要

The textile industry in Bangladesh is one of the most energy-intensive sectors, yet its monitoring practices remain largely outdated, resulting in inefficient power usage and high operational costs. To address this, we propose a real-time Non-Intrusive Load Monitoring (NILM)-based framework tailored for industrial applications, with a focus on identical motor-driven loads representing textile cutting machines. A hardware setup comprising voltage and current sensors, Arduino Mega and ESP8266 was developed to capture aggregate and individual load data, which was stored and processed on cloud platforms. A new dataset was created from three identical induction motors and auxiliary loads, totaling over 180,000 samples, to evaluate the state-of-the-art MATNILM model under challenging industrial conditions. Results indicate that while aggregate energy estimation was reasonably accurate, per-appliance disaggregation faced difficulties, particularly when multiple identical machines operated simultaneously. Despite these challenges, the integrated system demonstrated practical real-time monitoring with remote accessibility through the Blynk application. This work highlights both the potential and limitations of NILM in industrial contexts, offering insights into future improvements such as higher-frequency data collection, larger-scale datasets and advanced deep learning approaches for handling identical loads.

URL PDF HTML ☆

赞 0 踩 0

2504.10079 2026-05-28 cs.CV

Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

层次化关系增强表示泛化用于少样本动作识别

Hongyu Qu, Ling Xing, Jiachao Zhang, Rui Yan, Yazhou Yao, Xiangbo Shu

AI总结提出HR2G-shot框架，通过统一帧间、视频间和任务间三种关系建模，从整体视角学习任务特定的时间模式，以提升少样本动作识别的性能。

详情

AI中文摘要

少样本动作识别（FSAR）旨在通过少量样本识别新动作类别。现有方法通常通过设计帧间时间建模策略或粗粒度视频级交互来学习每个视频的帧级表示。然而，它们孤立地处理每个情节任务，忽略了视频间的细粒度时间关系建模，因此无法捕获跨视频共享的细粒度时间模式，也无法重用历史任务的时间知识。鉴于此，我们提出了HR2G-shot，一种用于FSAR的层次化关系增强表示泛化框架，它统一了三种关系建模（帧间、视频间和任务间），从整体视角学习任务特定的时间模式。除了进行帧间时间交互外，我们进一步设计了两个组件分别探索视频间和任务间关系：i) 视频间语义相关性（ISC）以细粒度方式执行跨视频帧级交互，从而捕获任务特定的查询特征，并增强类内一致性和类间可分离性；ii) 任务间知识迁移（IKT）从存储历史情节任务中多样时间模式的库中检索和聚合相关时间知识。在五个基准上的大量实验表明，HR2G-shot优于当前领先的FSAR方法。

英文摘要

Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

URL PDF HTML ☆

赞 0 踩 0

2601.00501 2026-05-28 cs.CV

CPPO: Contrastive Perception Policy Optimization for VLM Agents

CPPO: 面向VLM智能体的对比感知策略优化

Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain, Zhou Weimin, Yong Zhang, Mohammad Akbari

AI总结提出一种自监督的对比感知策略优化方法CPPO，通过对比感知损失增强视觉语言模型的视觉基础能力，无需额外模型或标注，在感知关键任务中优于现有方法。

详情

AI中文摘要

我们引入了CPPO，一种用于微调视觉语言模型（VLM）的对比感知策略优化方法。可靠的感知是基于VLM的智能体在开放环境中推理和行动的核心要求：错误的视觉基础直接导致错误的行为、幻觉工具调用和不安全的决策。虽然强化学习（RL）显著提升了语言模型的推理能力，但将这些进展扩展到多模态智能体需要同时改进感知和推理。先前的工作主要通过显式感知奖励来解决这一挑战，这通常需要额外的LLM评判器、真实标注或强制将感知与推理分离。CPPO通过扩展RL目标，引入对比感知损失（CPL），以自监督方式解决了这一限制，为视觉基础提供了直接的学习信号。对比目标鼓励模型对输入的视觉信息更加敏感。为了有效应用这一信号，CPPO利用在扰动图像下模型输出分布中的熵移机制识别感知令牌，并在训练期间选择性地对这些令牌应用对比损失。实验表明，CPPO在避免额外模型的同时超越了先前方法，使训练更加高效和可扩展，并产生了更适合感知关键智能体任务的策略。

英文摘要

We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision--language models (VLMs). Reliable perception is a core requirement for VLM-based agents that must reason and act in open-ended environments: faulty visual grounding cascades directly into faulty actions, hallucinated tool calls, and unsafe decisions. While reinforcement learning (RL) has significantly improved reasoning in language models, extending these advances to multimodal agents requires improving both perception and reasoning. Prior works address this challenge mainly through explicit perception rewards, which often require extra LLM judges, ground-truth annotations, or forced separation of perception from reasoning. CPPO addresses this limitation in a self-supervised manner by extending the RL objective with a Contrastive Perception Loss (CPL) that provides a direct learning signal for visual grounding. The contrastive objective encourages the model to become more sensitive to input visual information. To apply this signal effectively, CPPO identifies perception tokens using an entropy-shift mechanism in the model's output distributions under perturbed images and applies the contrastive loss selectively to those tokens during training. Experiments show that CPPO surpasses prior methods while avoiding extra models, making training more efficient and scalable, and yielding policies that are better suited to perception-critical agentic tasks.

URL PDF HTML ☆

赞 0 踩 0

2512.23959 2026-05-28 cs.CL cs.AI cs.LG

HGMEM: Hypergraph-based Working Memory to Improve Multi-step RAG for Long-Context Complex Relational Modeling

HGMem：基于超图的工作记忆以改进长上下文复杂关系建模的多步RAG

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu

AI总结提出HGMem超图工作记忆系统，通过超边表示记忆单元并渐进形成高阶交互，增强多步RAG中的全局理解和复杂推理能力。

Comments ICML 2026; Code released at https://github.com/Encyclomen/HGMem

详情

AI中文摘要

多步检索增强生成（RAG）已成为增强大型语言模型（LLMs）在需要全局理解和密集推理任务上的广泛采用策略。尽管许多RAG系统整合了工作记忆来整合信息，但现有设计主要作为孤立事实的被动存储。这种静态特性忽略了原始事实之间的关键高阶相关性，从而限制了模型的多步推理能力，导致在扩展上下文中的碎片化推理和弱全局理解。我们引入了HGMem，一种基于超图的工作记忆系统，将记忆的概念从简单存储扩展到动态、表达性结构，用于复杂推理和全局理解。在我们的方法中，记忆被表示为超图，其中超边对应不同的记忆单元，使得记忆内高阶交互的逐步形成成为可能。该机制连接围绕焦点问题的事实和思考，将记忆演变为一个集成且情境化的知识结构，为更深层次的推理提供强有力的命题。我们在几个具有挑战性的全局理解基准上评估了HGMem。大量实验和深入分析表明，我们的方法持续改进了多步RAG，并在不同数据集上显著优于强基线系统。

英文摘要

Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Although many RAG systems incorporate a working memory to consolidate information, existing designs primarily function as a passive storage for isolated facts. This static nature overlooks crucial high-order correlations among primitive facts, thereby limiting models' capacity for multi-step reasoning and resulting in fragmented reasoning and weak global sense-making within extended contexts. We introduce HGMem, a hypergraph-based working memory system, extending the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph where hyperedges correspond to distinct memory units, enabling the progressive formation of high-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving the memory into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning. We evaluate HGMem on several challenging global sense-making benchmarks. Extensive experiments and in-depth analyses demonstrate that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse datasets.

URL PDF HTML ☆

赞 0 踩 0

2512.22777 2026-05-28 cs.LG cs.AI

Adapting, Fast and Slow: On Few-Shot Transportability of Compositions

适应，快与慢：关于组合的少样本可迁移性

Kasra Jalaldoust, Elias Bareinboim

AI总结研究在少样本场景下，通过因果传输性理论将源域学习到的因果机制组合成目标域预测器，并区分模块传输性和电路传输性，提出基于梯度松弛的电路搜索方法以实现快速或慢速适应。

详情

AI中文摘要

跨域泛化需要连接源分布和目标分布的稳定结构。基于因果传输性理论，我们研究了一个序列预测设置，其中目标预测器可以表示为从源数据可学习的因果机制组成的电路。我们引入了两类传输性。模块传输性捕获原子情况，其中目标预测器由可从单个源域学习的机制给出。电路传输性将此思想推广到通过组合从源数据学习的多个模块获得的目标预测器，即使没有源机制直接预测目标标签，也能实现零样本预测。我们在逐渐放松的假设下研究这些电路类别。首先，我们提供了条件，在这些条件下，给定关于源域和目标域的因果知识，可以从源数据单独学习相关电路。然后，我们通过允许来自目标域的有限数据来放松这些结构假设。特别地，我们开发了一种监督域适应方案，该方案无需显式因果结构即可学习电路。由此产生的少样本保证将可实现误差与可从源数据学习的模块组成的最小目标电路的大小联系起来。最后，我们提出了符号电路搜索的基于梯度的松弛，并进行了实证评估，表明它定性地跟踪了预测的快速适应机制——有和没有中间位置的过程监督——以及当没有源机制匹配时的慢速适应。

英文摘要

Generalization across domains requires stable structure that links the source and target distributions. Building on causal transportability theory, we study a sequential prediction setting in which the target predictor can be represented as a circuit composed of causal mechanisms that are learnable from source data. We introduce two classes of transportability. Module transportability captures the atomic case, where the target predictor is given by a mechanism learnable from a single source domain. Circuit transportability generalizes this idea to target predictors obtained by composing several modules learned from source data, enabling zero-shot prediction even when no source mechanism directly predicts the target label. We study these classes of circuits under increasingly relaxed assumptions. First, we provide conditions under which the relevant circuits can be learned from source data alone, given causal knowledge about the source and target domains. We then relax these structural assumptions by allowing limited data from the target domain. In particular, we develop a supervised domain adaptation scheme that learns circuits without requiring explicit causal structure. The resulting few-shot guarantees tie the achievable error to the size of the smallest target circuit composable from modules learned from source data. Finally, we propose a gradient-based relaxation of the symbolic circuit search and evaluate it empirically, showing that it qualitatively tracks the predicted regimes of fast adaptation -- with and without process supervision over intermediate positions -- and slow adaptation when no source mechanism matches.

URL PDF HTML ☆

赞 0 踩 0

2501.09934 2026-05-28 cs.LG cs.AI

HEART: Achieving Timely Multi-Model Training for Vehicle-Edge-Cloud-Integrated Hierarchical Federated Learning

HEART：实现车辆-边缘-云集成分层联邦学习的多模型及时训练

Xiaohong Yang, Minghui Liwang, Xianbin Wang, Zhipeng Cheng, Seyyedali Hosseinalipour, Huaiyu Dai, Zhenzhen Jiao

AI总结针对车辆-边缘-云分层联邦学习中多模型训练面临的模型过时、数据利用低效和资源分配不平衡问题，提出HEART框架，通过混合同步-异步聚合规则和两阶段优化算法（改进PSO+GA与贪心算法）最小化全局训练延迟并实现任务平衡。

Comments Accepted by IEEE Transactions on Cloud Computing (22 pages, 7 figures)

详情

AI中文摘要

人工智能赋能的物联网车辆（IoV）的快速发展需要高效的机器学习（ML）解决方案，以处理高车辆移动性和分散数据。这推动了车辆-边缘-云架构上的分层联邦学习（VEC-HFL）的出现。然而，VEC-HFL文献中尚未充分探讨的一个方面是，车辆通常需要同时执行多个ML任务，这种多模型训练环境带来了关键挑战。首先，不恰当的聚合规则可能导致模型过时和训练时间延长。其次，车辆移动性可能阻止车辆将模型返回网络边缘，导致数据利用效率低下。第三，跨不同任务实现平衡的资源分配变得至关重要，因为它极大地影响协作训练的有效性。我们率先提出一个针对动态VEC-HFL中多模型训练的框架，目标是最小化全局训练延迟，同时确保跨各种任务的平衡训练，该问题被证明是NP难的。为了促进及时模型训练，我们引入了一种混合同步-异步聚合规则。在此基础上，我们提出了一种称为混合进化与贪婪分配（HEART）的新方法。该框架分两个阶段运行：首先，通过结合改进的粒子群优化（PSO）和遗传算法（GA）的混合启发式方法实现平衡的任务调度；其次，采用低复杂度的贪心算法确定车辆上分配任务的训练优先级。在真实数据集上的实验证明了HEART相对于现有方法的优越性。

英文摘要

The rapid growth of AI-enabled Internet of Vehicles (IoV) calls for efficient Machine Learning (ML) solutions that can handle high vehicular mobility and decentralized data. This has motivated the emergence of Hierarchical Federated Learning over vehicle-edge-cloud architectures (VEC-HFL). Nevertheless, one aspect which is underexplored in the literature on VEC-HFL is that vehicles often need to execute multiple ML tasks simultaneously, where this multi-model training environment introduces crucial challenges. First, improper aggregation rules can lead to model obsolescence and prolonged training times. Second, vehicular mobility may result in inefficient data utilization by preventing the vehicles from returning their models to the network edge. Third, achieving a balanced resource allocation across diverse tasks becomes of paramount importance as it majorly affects the effectiveness of collaborative training. We take one of the first steps towards addressing these challenges via proposing a framework for multi-model training in dynamic VEC-HFL with the goal of minimizing global training latency while ensuring balanced training across various tasks, a problem that turns out to be NP-hard. To facilitate timely model training, we introduce a hybrid synchronous-asynchronous aggregation rule. Building on this, we present a novel method called Hybrid Evolutionary And gReedy allocaTion (HEART). The framework operates in two stages: first, it achieves balanced task scheduling through a hybrid heuristic approach that combines improved Particle Swarm Optimization (PSO) and Genetic Algorithms (GA); second, it employs a low-complexity greedy algorithm to determine the training priority of assigned tasks on vehicles. Experiments on real-world datasets demonstrate the superiority of HEART over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2512.18566 2026-05-28 cs.LG cs.SY eess.SY q-bio.NC

Comparing Dynamical Models Through Diffeomorphic Vector Field Alignment

通过微分同胚向量场对齐比较动力学模型

Ruiqi Chen, Giacomo Vedovati, Todd Braver, ShiNung Ching

AI总结提出DFORM框架，通过非线性坐标变换对齐两个动力系统的轨迹，评估拓扑等价性并定位高维模型中的低维动力学模式。

Comments 57 pages, 18 figures. For associated code, see https://github.com/rq-Chen/DFORM_stable

详情

DOI: 10.1162/NECO.a.1526
Journal ref: Neural Computation (2026) 38 (6): 1006-1061

AI中文摘要

诸如递归神经网络（RNN）等动力系统模型在理论神经科学中越来越受欢迎，用于假设生成和数据分析。评估这些模型中的动力学是理解其学习到的生成机制的关键。然而，这种评估受到两个主要挑战的阻碍：首先，由于没有强制要求坐标系统等价，跨模型比较学习到的动力学很困难。其次，在高维非线性模型（如RNN）中，识别机制上重要的低维模式（例如极限集）是难以处理的。在这里，我们提出了一个全面的框架来解决这两个问题，称为学习模型的微分同胚向量场对齐（DFORM）。DFORM学习两个动力系统状态空间之间的非线性坐标变换，以最大程度地一对一地对齐它们的轨迹。通过这样做，DFORM能够评估两个模型是否表现出拓扑等价性，即尽管坐标系统不同但机制相似。该方法的一个副产品是一种在高维系统中嵌入的低维流形上定位动力学模式的方法。我们使用典型的拓扑等价系统、RNN和通过非线性流相关的系统验证了DFORM识别线性和非线性坐标变换的能力。DFORM还被证明可以提供拓扑不同系统之间的相似性量化。然后，我们证明了DFORM可以在高维模型中定位重要的动力学模式，包括不变流形和鞍极限集。最后，使用一组在人类功能性磁共振成像（fMRI）记录上训练的RNN模型，我们展示了DFORM可以从高维数据驱动模型中识别极限环，这与先前的数值分析结果一致。

英文摘要

Dynamical systems models such as recurrent neural networks (RNNs) are increasingly popular in theoretical neuroscience for hypothesis-generation and data analysis. Evaluating the dynamics in such models is key to understanding their learned generative mechanisms. However, such evaluation is impeded by two major challenges: First, comparison of learned dynamics across models is difficult because there is no enforced equivalence of their coordinate systems. Second, identification of mechanistically important low-dimensional motifs (e.g., limit sets) is intractable in high-dimensional nonlinear models such as RNNs. Here, we propose a comprehensive framework to address these two issues, termed Diffeomorphic vector field alignment FOR learned Models (DFORM). DFORM learns a nonlinear coordinate transformation between the state spaces of two dynamical systems, which aligns their trajectories in a maximally one-to-one manner. In so doing, DFORM enables an assessment of whether two models exhibit topological equivalence, i.e., similar mechanisms despite differences in coordinate systems. A byproduct of this method is a means to locate dynamical motifs on low-dimensional manifolds embedded within higher-dimensional systems. We verified DFORM's ability to identify linear and nonlinear coordinate transformations using canonical topologically equivalent systems, RNNs, and systems related by nonlinear flows. DFORM was also shown to provide a quantification of similarity between topologically distinct systems. We then demonstrated that DFORM can locate important dynamical motifs including invariant manifolds and saddle limit sets within high-dimensional models. Finally, using a set of RNN models trained on human functional MRI (fMRI) recordings, we illustrated that DFORM can identify limit cycles from high-dimensional data-driven models, which agreed well with prior numerical analysis.

URL PDF HTML ☆

赞 0 踩 0

2512.17375 2026-05-28 cs.LG cs.CL cs.CR

DSSE：无人机群搜索环境

Manuel Castanares, Luis F. S. Carrete, Enrico F. Damiani, Leonardo D. M. de Abreu, José Fernando B. Brancalion, Fabrício J. Barth

AI总结基于PettingZoo的多智能体强化学习环境，无人机通过动态概率输入搜索目标。

Comments 7 pages

2512.12649 2026-05-28 cs.RO cs.SY eess.SY

Bayesian Optimization Parameter Tuning Framework for a Lyapunov Based Path Following Controller

基于Lyapunov的路径跟踪控制器的贝叶斯优化参数调优框架

Zhewen Zheng, Wenjing Cao, Hongkang Yu, Mo Chen, Takashi Suzuki

AI总结针对非线性几何控制器中参数相互依赖导致手动调优效率低的问题，提出一种将闭环系统视为黑箱、利用高斯过程代理模型进行贝叶斯优化的数据高效调优方法，并在本田AI-Formula三轮机器人上验证了其在32次试验内提升控制器性能的有效性。

Comments The authors request withdrawal because the current arXiv version does not reflect the complete and finalized authorship record of the manuscript. The author list and contribution record require correction before further public dissemination

详情

AI中文摘要

实际实验中的参数调优受限于硬件上有限的评估预算。本文研究的路径跟踪控制器反映了非线性几何控制器中的典型情况，其中多个增益通过耦合非线性项影响动力学。这种相互依赖性使得手动调优效率低下，且在实际试验次数内难以获得令人满意的性能。为应对这一挑战，我们提出了一种贝叶斯优化（BO）框架，该框架将闭环系统视为黑箱，并使用高斯过程代理模型选择控制器增益。BO提供了无模型探索、量化不确定性和数据高效搜索，使其非常适合每次评估成本高昂的调优任务。该框架在Honda的AI-Formula三轮机器人上实现，并通过在固定测试轨道上重复全圈实验进行评估。结果表明，BO在32次试验内（包括15次预热初始评估）提升了控制器性能，表明它能够在实际条件下高效定位参数空间中的高性能区域。这些发现证明，BO为真实机器人平台上的非线性路径跟踪控制器提供了一种实用、可靠且数据高效的调优方法。

英文摘要

Parameter tuning in real-world experiments is constrained by the limited evaluation budget available on hardware. The path-following controller studied in this paper reflects a typical situation in nonlinear geometric controller, where multiple gains influence the dynamics through coupled nonlinear terms. Such interdependence makes manual tuning inefficient and unlikely to yield satisfactory performance within a practical number of trials. To address this challenge, we propose a Bayesian optimization (BO) framework that treats the closed-loop system as a black box and selects controller gains using a Gaussian-process surrogate. BO offers model-free exploration, quantified uncertainty, and data-efficient search, making it well suited for tuning tasks where each evaluation is costly. The framework is implemented on Honda's AI-Formula three-wheeled robot and assessed through repeated full-lap experiments on a fixed test track. The results show that BO improves controller performance within 32 trials, including 15 warm-start initial evaluations, indicating that it can efficiently locate high-performing regions of the parameter space under real-world conditions. These findings demonstrate that BO provides a practical, reliable, and data-efficient tuning approach for nonlinear path-following controllers on real robotic platforms.

URL PDF HTML ☆

赞 0 踩 0

2512.09800 2026-05-28 cs.LG cs.DC cs.PF

Ariel-ML: Computing Parallelization with Embedded Rust for Neural Networks on Heterogeneous Multi-core Microcontrollers

Ariel-ML：面向异构多核微控制器的嵌入式Rust神经网络计算并行化

Zhaolan Huang, Kaspar Schleiser, Gyungmin Myung, Emmanuel Baccelli

AI总结针对多核MCU上TinyML推理的并行化需求，提出基于嵌入式Rust的Ariel-ML工具包，通过通用TinyML流水线和多核支持，在多种32位MCU上实现低延迟推理，并保持与C/C++相当的内存占用。

详情

AI中文摘要

低功耗微控制器（MCU）硬件正从单核架构向多核架构演进。同时，新的嵌入式软件构建块越来越多地用Rust编写，而C/C++在该领域的主导地位逐渐减弱。另一方面，各种小型人工神经网络（ANN）越来越多地部署在边缘AI用例中，直接在低功耗MCU上执行。在此背景下，增量改进和新颖创新服务需要不断通过已在现场部署的传感/执行系统上的嵌入式软件执行ANN来改造。然而，目前尚无能够自动并行化多核MCU上任意TinyML模型推理计算的Rust嵌入式软件平台。本文通过引入Ariel-ML填补了这一空白，这是一个新颖的工具包，结合了通用TinyML流水线和嵌入式Rust软件平台，能够充分利用各种32位微控制器系列（Arm Cortex-M、RISC-V、ESP-32）的多核能力。我们发布了其实现的完整开源代码，并使用多种TinyML模型对其性能进行了基准测试。结果表明，Ariel-ML在推理延迟方面优于现有技术，并且与使用嵌入式C/C++的现有工具包相比，实现了相当的内存占用。因此，Ariel-ML为TinyML从业者和资源受限的嵌入式Rust开发者提供了有用的基础。

英文摘要

Low-power microcontroller (MCU) hardware is currently evolving from single-core architectures to predominantly multi-core architectures. In parallel, new embedded software building blocks are more and more written in Rust, while C/C++ dominance fades in this domain. On the other hand, small artificial neural networks (ANN) of various kinds are increasingly deployed in edge AI use cases, thus deployed and executed directly on low-power MCUs. In this context, both incremental improvements and novel innovative services will have to be continuously retrofitted using ANNs execution in software embedded on sensing/actuating systems already deployed in the field. However, there was so far no Rust embedded software platform automating parallelization for inference computation on multi-core MCUs executing arbitrary TinyML models. This paper thus fills this gap by introducing Ariel-ML, a novel toolkit we designed combining a generic TinyML pipeline and an embedded Rust software platform which can take full advantage of multi-core capabilities of various 32bit microcontroller families (Arm Cortex-M, RISC-V, ESP-32). We published the full open source code of its implementation, which we used to benchmark its capabilities using a zoo of various TinyML models. We show that Ariel-ML outperforms prior art in terms of inference latency as expected, and we show that, compared to pre-existing toolkits using embedded C/C++, Ariel-ML achieves comparable memory footprints. Ariel-ML thus provides a useful basis for TinyML practitioners and resource-constrained embedded Rust developers.

URL PDF HTML ☆

赞 0 踩 0

2512.09786 2026-05-28 cs.LG cs.PF cs.SD eess.AS eess.SP

TinyDéjàVu: Smaller RAM and Faster Inference with Neural Networks on MCUs for Sensor Data Streams

TinyDéjàVu：用于传感器数据流的微控制器上更小RAM和更快推理的神经网络

Zhaolan Huang, Emmanuel Baccelli

AI总结提出TinyDéjàVu框架，通过优化神经网络推理中的数据流，在微控制器上实现高达90%的RAM节省和相同计算延迟，用于传感器数据时间序列的推理。

详情

AI中文摘要

嵌入式智能的例子包括用于无线传感器和执行器上的各种微型神经网络，这些网络预期持续对感知数据的时间序列进行推理。为了满足电池供电时的寿命和能耗要求，此类硬件完全基于微控制器，并尽可能少的内存，例如128 kB的RAM。在此背景下，优化推理过程中神经网络层间的数据流变得至关重要。在本文中，我们介绍了一个新框架TinyDéjàVu以及我们设计的新算法，旨在大幅减少在典型微控制器硬件上使用各种神经网络模型对传感器数据时间序列进行推理所需的RAM预算。我们将TinyDéjàVu的实现开源，并在常见的微控制器硬件（Arm Cortex-M）上进行可重复的基准测试。我们表明，与先前工作（StreamiNNC）相比，在重叠滑动窗口输入上，TinyDéjàVu可以节省高达90%的RAM使用，同时计算延迟相同。

英文摘要

Examples of embedded intelligence include a wide variety of tiny neural networks used on-board wireless sensors and actuators, which are expected to continuously perform inference on time-series of the data they sense. In order to fit lifetime and energy consumption requirements when operating on battery, such hardware is exclusively based on microcontroller with as little memory as possible, e.g., 128 kB of RAM. In this context, optimizing data flows during inference across neural network layers becomes crucial. In this paper, we introduce a new framework, TinyDéjàVu, and novel algorithms we designed to drastically reduce the RAM budget required by inference using various neural network models for sensor data time-series on typical microcontroller hardware. We publish the implementation of TinyDéjàVu as open source, and we perform reproducible benchmarks on common microcontroller hardware (Arm Cortex-M). We show that TinyDéjàVu can save up to 90\% of RAM usage with equal compute latency compared to prior work (StreamiNNC) on overlapping sliding window inputs.

URL PDF HTML ☆

赞 0 踩 0

2512.00814 2026-05-28 cs.CV

Artemis: 用于感知策略学习的结构化视觉推理

Wei Tang, Yanpeng Sun, Shan Zhang, Weihao Bo, Xiaofan Li, Piotr Koniusz, Wei Li, Na Zhao, Zechao Li

AI总结提出Artemis方法，通过结构化视觉推理（中间步骤表示为（标签，边界框）对）替代语言推理，提升视觉感知策略的性能，并统一处理多种感知任务。

详情

AI中文摘要

最近的视觉感知策略强化学习框架通常结合用自然语言表达的中间推理链。经验观察表明，这种纯语言中间推理通常会降低感知任务的性能。我们认为核心问题不在于推理本身，而在于推理的形式：虽然这些链在非结构化的语言空间中进行语义推理，但视觉感知需要在空间和以对象为中心的空间中进行推理。为此，我们引入了Artemis，一种感知策略学习方法，它执行结构化的视觉推理，其中每个中间步骤都表示为一个（标签，边界框）对，捕获可验证的视觉状态。这种设计能够显式跟踪中间状态，直接监督提议质量，并避免基于语言的推理引入的歧义。基于可验证和空间定位的推理链，Artemis为各种感知任务提供了统一的架构，无需依赖先前感知策略模型所依赖的任务特定设计。使用自然图像域中的定位和检测样本进行训练，Artemis泛化到计数和几何感知任务。其核心是空间定位的、以对象为中心的链式规则，为可扩展和通用的感知策略提供了原则性基础。

英文摘要

Recent reinforcement-learning frameworks for visual perception policy usually incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, \textbf{visual perception requires reasoning in a spatial and object-centric space}. In response, we introduce \textbf{Artemis}, a perception-policy learning method that performs structured visual reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Building upon verifiable and spatially grounded reasoning chains, Artemis provides a unified architecture for diverse perceptual tasks, without requiring the task-specific designs relied upon by prior perceptual policy models. Trained using grounding and detection sampeles in natural image domains, Artemis generalizes to counting and geometric perception tasks. At its core, a spatially grounded, object-centric chain rule provides a principled foundation for scalable and general perceptual policies.

URL PDF HTML ☆

赞 0 踩 0

2511.20934 2026-05-28 cs.AI cs.CV cs.LG

Guaranteed Optimal Compositional Explanations for Neurons

神经元的保证最优组合解释

Biagio La Rosa, Leilani H. Gilpin

AI总结提出首个框架，通过分解、启发式和算法，在完整状态空间上计算保证最优的组合解释，并证明10-40%的波束搜索解释在概念重叠时非最优。

Comments Accepted at ICML 2026 (Oral), 43 pages, 10 figures

详情

AI中文摘要

组合解释是一类方法，旨在通过逻辑规则描述神经元感受野激活与概念之间的空间对齐，通常通过搜索所有可能的概念组合来计算。由于在整个状态空间上计算空间对齐在计算上不可行，文献中通常采用与组合结构相关的假设和波束搜索来限制状态空间。然而，波束搜索无法提供任何最优性的理论保证，且当前解释与真正最优解的接近程度仍不清楚。在这篇理论性论文中，我们通过引入首个框架来解决这一差距，该框架在采用假设所涵盖的整个状态空间上计算保证最优的组合解释。具体而言，我们提出：(i) 一种识别影响空间对齐因素的分解方法，(ii) 一种在搜索任何阶段估计对齐的启发式方法，以及(iii) 第一个能够在与穷举波束搜索相当的时间内计算最优组合解释的算法。使用该框架，我们证明当涉及重叠概念时，先前通过波束搜索获得的10-40%的解释是次优的。最后，我们评估了一种由我们提出的分解和启发式方法引导的波束搜索变体，表明它在超参数和计算资源方面提供更大灵活性的同时，匹配或改进了先前方法的运行时间。

英文摘要

Compositional explanations are a family of methods that aim to describe the spatial alignment between neurons' receptive field activations and concepts through logical rules, typically computed via a search over all possible concept combinations. Since computing the spatial alignment over the entire state space is computationally infeasible, the literature commonly adopts assumptions related to the structure of the combinations and beam search to restrict the state space. However, beam search cannot provide any theoretical guarantees of optimality, and it remains unclear how close current explanations are to the true optimum. In this theoretical paper, we address this gap by introducing the first framework for computing guaranteed optimal compositional explanations over the entire state space spanned by the adopted assumptions. Specifically, we propose: (i) a decomposition that identifies the factors influencing the spatial alignment, (ii) a heuristic to estimate the alignment at any stage of the search, and (iii) the first algorithm that can compute optimal compositional explanations in a time comparable to exhaustive beam search. Using this framework, we demonstrate that 10-40% of explanations previously obtained with beam search are suboptimal when overlapping concepts are involved. Finally, we evaluate a beam-search variant guided by our proposed decomposition and heuristic, showing that it matches or improves runtime over prior methods while offering greater flexibility in hyperparameters and computational resources.

URL PDF HTML ☆

赞 0 踩 0

2511.20439 2026-05-28 cs.CV cs.AI

Object-Centric Vision Token Pruning for Vision Language Models

面向视觉语言模型的以对象为中心的视觉令牌剪枝

Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen

AI总结提出OC-VTP方法，通过轻量预训练以对象为中心的视觉令牌剪枝器，直接选择最具代表性的视觉令牌，在保持高精度的同时提升VLM推理效率。

详情

AI中文摘要

在视觉语言模型（VLM）中，与语言令牌相比，视觉令牌数量庞大但信息分散，因此消耗了大量不必要的计算。为了提升VLM推理效率，剪枝冗余视觉令牌的研究一直在进行，但现有方法都采用间接且无保证的方式。我们提出了OC-VTP，一种直接且有保证的方法，用于选择最具代表性的视觉令牌，以实现高效且保持精度的VLM推理。我们的OC-VTP仅需对一个小型的以对象为中心的视觉令牌剪枝器进行轻量预训练，然后即可将其插入现有VLM中，无需在任何数据集上微调任何模型。通过最小化从所选令牌重建原始未剪枝令牌的误差，保证保留最具代表性的视觉令牌。在任何视觉剪枝比例（即推理效率）下，我们的OC-VTP都能一致地帮助主流VLM保持最高的推理精度。我们的剪枝还展示了有趣的可解释性。我们的代码可在 https://github.com/GarryLarry010131/OC-VTP 获取。

英文摘要

In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.

URL PDF HTML ☆

赞 0 踩 0

2511.02558 2026-05-28 cs.CV cs.LG q-bio.NC

Forecasting Future Anatomies: Longitudinal Brain Mri-to-Mri Prediction

预测未来解剖结构：纵向脑MRI到MRI的预测

Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka

AI总结本文研究从基线MRI预测未来脑部MRI，采用五种深度学习架构（UNet、U2-Net、UNETR、时间嵌入UNet和ODE-UNet）在ADNI和AIBL数据集上实现高保真体素级预测，并验证了跨队列泛化能力。

详情

DOI: 10.1109/ISBI61048.2026.11515462
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), Apr. 2026

AI中文摘要

从基线磁共振图像（MRI）预测未来脑状态是神经影像学的一个核心挑战，对研究阿尔茨海默病（AD）等神经退行性疾病具有重要意义。大多数现有方法预测未来认知评分或临床结果，例如从轻度认知障碍向痴呆的转化。相反，本文研究纵向MRI图像到图像的预测，该预测可以预测参与者未来数年的整个脑部MRI，内在建模复杂的、空间分布的神经退行模式。我们在两个纵向队列（ADNI和AIBL）上实施并评估了五种深度学习架构（UNet、U2-Net、UNETR、时间嵌入UNet和ODE-UNet）。使用捕捉全局相似性和局部差异的指标，将预测的随访MRI与实际随访扫描直接进行比较。表现最佳的模型实现了高保真预测，并且所有模型都能很好地泛化到独立的外部数据集，展示了稳健的跨队列性能。我们的结果表明，深度学习可以在体素水平上可靠地预测参与者特定的脑部MRI，为个体化预后提供了新的机会。

英文摘要

Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.

URL PDF HTML ☆

赞 0 踩 0

2511.15390 2026-05-28 cs.CV

Automatic Pruning Discovery for Large Language Models

大型语言模型的自动剪枝发现

Haidong Kang, Lihong Lin, Enneng Yang, Hongning Dai, Hao Wang

AI总结提出AutoPrune方法，利用LLMs自动设计剪枝算法，并通过图驱动思维链优化提示，结合偏态感知动态稀疏分配解决高剪枝率下的异常值问题，在主流基准上超越现有方法。

Comments 15 pages, 10 figures

详情

AI中文摘要

大型语言模型（LLMs）在广泛任务上取得了显著性能，但由于其庞大的规模，阻碍了实际部署。现有的针对LLMs的剪枝方法（例如Wanda）严重依赖手动设计的剪枝算法，从而导致巨大的人力成本并需要专家知识。此外，我们首次识别出在高剪枝率下由均匀稀疏性导致的严重异常值问题，这引发了关于如何为LLMs设计自适应剪枝稀疏度的额外担忧。LLMs能否自行剪枝？在这项工作中，我们通过提出一种名为AutoPrune的新型剪枝方法给出了肯定答案，该方法首次通过利用LLMs自动为其自身设计最优剪枝算法，无需任何专家知识，从而克服了专家知识的限制。具体来说，为了缓解LLMs的黑箱性质，我们提出了一种图驱动思维链（GCoT）来优化提示，显著增强了学习剪枝算法中的推理过程，并使我们能够生成具有卓越性能和可解释性的下一代剪枝算法。最后，基于对异常值问题的洞察，我们引入了偏态感知动态稀疏分配（SDSA）来克服异常值问题，减轻高剪枝率下的性能下降。我们在主流LLMs基准上进行了广泛实验，证明了AutoPrune的优越性，它始终优于最先进的竞争对手。

英文摘要

Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to huge labor costs and requires expert knowledge. Furthermore, we are the first to identify the serious outlier value issue behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called AutoPrune, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithm for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors.

URL PDF HTML ☆

赞 0 踩 0

2511.14558 2026-05-28 cs.CV

Explaining Digital Pathology Models via Clustering Activations

通过激活聚类解释数字病理学模型

Adam Bajger, Jan Obdržálek, Vojtěch Kůr, Rudolf Nenutil, Petr Holub, Vít Musil, Tomáš Brázdil

AI总结提出一种基于卷积神经网络激活聚类的可解释性方法，通过展示模型全局行为并提供细粒度信息，增强对数字病理学模型的理解和信任。

详情

DOI: 10.1109/ISBI61048.2026
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

AI中文摘要

我们提出了一种基于聚类的可解释性技术，用于基于卷积神经网络的数字病理学模型。与常用的基于显著性图的方法（如遮挡、GradCAM或相关性传播）不同，这些方法突出显示对单个切片预测贡献最大的区域，而我们的方法展示了所考虑模型的全局行为，同时提供了更细粒度的信息。结果聚类不仅可以可视化以理解模型，还可以增加对其操作的信心，从而在临床实践中更快地采用。我们还评估了我们的技术在现有用于检测前列腺癌的模型上的性能，证明了其实用性。

英文摘要

We present a clustering-based explainability technique for digital pathology models based on convolutional neural networks. Unlike commonly used methods based on saliency maps, such as occlusion, GradCAM, or relevance propagation, which highlight regions that contribute the most to the prediction for a single slide, our method shows the global behaviour of the model under consideration, while also providing more fine-grained information. The result clusters can be visualised not only to understand the model, but also to increase confidence in its operation, leading to faster adoption in clinical practice. We also evaluate the performance of our technique on an existing model for detecting prostate cancer, demonstrating its usefulness.

URL PDF HTML ☆

赞 0 踩 0

2511.09572 2026-05-28 cs.AI cs.LG cs.SE

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

SynthTools: 用于扩展智能体开发中合成工具的框架

Tommaso Castellani, Naimeng Ye, Daksh Mittal, Thomson Yen, Emmanouil Koukoumidis, William Zeng, Hongseok Namkoong

AI总结提出基于LLM的端到端管道SynthTools，通过环境生成、模拟、验证和任务构建，生成大规模多样化工具使用环境，提升智能体工具使用能力。

详情

AI中文摘要

为了使智能体系统能够使用外部工具解决复杂、长期的任务，我们需要大量多样且可控的工具使用环境。我们引入了SynthTools，一个完全基于LLM的管道，涵盖整个生命周期：环境生成、模拟、验证和任务构建。通过端到端地使用LLM，我们的框架补充了其他受限于真实API复杂性的工具使用环境，并通过设计确保可扩展性和可控性。该框架由三个组件组成：自上而下的环境生成，分层构建多样化的、基于领域的工具环境；环境模拟与验证，确保工具能够可靠地模拟并过滤掉无法模拟的工具；以及自下而上的任务与轨迹生成，产生可解决且可验证的任务以及多步轨迹，对难度、长度、轨迹组成和领域焦点进行控制以保证灵活性。作为具体实例，我们发布了包含6800个环境和100个领域中的73883个经过验证的工具、79925个可验证任务的数据集，以及大规模生成轨迹的管道。在这些任务生成的轨迹语料库上训练不同规模的Qwen3模型，在多个工具使用基准测试（包括真实API）上取得了提升，表明在合成数据上训练的工具使用能力可能迁移到某些真实环境。这些结果共同表明，SynthTools可以作为大规模训练工具使用智能体的有用基础设施。

英文摘要

For agentic systems to use external tools to solve complex, long-horizon tasks, we need a large set of diverse and controllable tool-use environments. We introduce SynthTools, a fully LLM-based pipeline spanning the entire lifecycle: environment generation, simulation, validation and task construction. By operating end-to-end through LLMs, our framework complements other tool-use environments bottlenecked by the complexity of real APIs, and ensures scalability and controllability by design. The framework consists of three components: top-down environment generation, which hierarchically constructs diverse, domain-grounded tool environments; environment simulation and validation, which ensures tools can be reliably emulated and filters out those that cannot; and bottom-up task and trajectory generation, which produces solvable and verifiable tasks together with multi-step trajectories, exposing control over difficulty, length, trajectory composition, and domain focus to guarantee flexibility. As a concrete instantiation, we release the dataset comprising $73{,}883$ validated tools across $6{,}800$ environments and $100$ fields, $79{,}925$ verifiable tasks as well as the pipeline to generate trajectories at scale. Training Qwen3 models of various sizes on a corpus of trajectories generated from these tasks yields gains across multiple tool-use benchmarks, including real APIs, indicating tool-use capabilities trained on synthetic data may transfer to some real environments. Together, these results suggest that SynthTools can serve as a useful infrastructure for large-scale training of tool-use agents.

URL PDF HTML ☆

赞 0 踩 0

2511.05550 2026-05-28 cs.SD cs.CL cs.LG

Assessing Factual Music Comprehension in Large Audio Language Models

评估大型音频语言模型中的事实音乐理解能力

Daniel Chenyu Lin, Michael Freeman, John Thickstun

AI总结针对现有MusicQA数据集无法衡量模型回答事实正确性的问题，提出基于可验证信息的评估协议，通过精确率、召回率和F1分数客观评估模型，并在三个数据集上定义六项事实检索任务，对九个最新LALM进行基准测试。

Comments 16 pages; second submission

详情

AI中文摘要

大型音频语言模型（LALMs）利用多模态表示生成对音频自然语言查询的开放式回答。本文（1）提供经验证据表明，使用流行的MusicQA数据集评估LALMs无法衡量模型关于音乐的回答是否事实正确，（2）开发了一种新的评估LALMs音乐理解能力的协议。具体来说，我们提出一个评估协议，提示LALM提供可事实验证的信息，并将其开放式回答解析为结构化格式，使用精确率、召回率和F1分数进行客观评估。利用该协议，我们定义了一个基准测试，包含在三个不同数据集（MusicNet、Free Music Archive和OverClocked ReMix）上定义的六项事实信息检索任务。我们对九个最近的LALMs进行了基准测试，包括前沿模型如Gemini和最新的开放模型如Music Flamingo，并在https://github.com/DCL2004/LALM-Eval发布了评估脚本套件，以方便新LALMs的基准测试。

英文摘要

Large audio language models (LALMs) leverage multimodal representations to generate open-ended answers to natural language queries about audio. In this paper, we (1) provide empirical evidence that assessment of LALMs using the popular MusicQA dataset fails to measure whether a model's responses about music are factually correct, and (2) develop a new protocol for assessing the music comprehension capabilities of LALMs. Specifically, we propose an evaluation protocol that prompts a LALM for factually verifiable information, and parses its open-ended response into a structured format that can be objectively assessed using Precision, Recall, and F1 scores. Using this protocol, we define a benchmark consisting of six factual information retrieval tasks defined on three diverse datasets: MusicNet, the Free Music Archive, and OverClocked ReMix. We benchmark nine recent LALMs, including frontier models like Gemini and the latest open models like Music Flamingo, and release the suite of evaluation scripts at https://github.com/DCL2004/LALM-Eval to facilitate benchmarking of new LALMs.

URL PDF HTML ☆

赞 0 踩 0

2511.02398 2026-05-28 cs.LG

A Spatially Informed Gaussian Process UCB Method for Decentralized Coverage Control

一种用于分散覆盖控制的空间信息高斯过程UCB方法

Gennaro Guidone, Luca Monegaglia, Elia Raimondi, Han Wang, Mattia Bianchi, Florian Dörfler

AI总结提出一种基于高斯过程上置信界（GP-UCB）的分散算法，通过结合期望位置成本与方差探索项，使智能体自主平衡探索与利用，实现未知空间覆盖控制。