arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2117
2601.13591 2026-06-12 cs.AI cs.CL 版本更新

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

DSAEval:在广泛真实世界数据科学问题上评估数据科学智能体

Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang

发表机构 * Department of Data Science and Artificial Intelligence, Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学) Department of Applied Mathematics, Hong Kong Polytechnic University(应用数学系,香港理工大学)

AI总结 提出包含641个真实数据科学问题的基准DSAEval,涵盖多模态环境感知、多查询交互和多维评估,系统评估13个先进LLM智能体,发现Claude-Sonnet-4.5综合最优,多模态感知提升视觉任务性能2.04%-11.30%。

详情
AI中文摘要

近期基于LLM的数据智能体旨在自动化从数据分析到深度学习的数据科学任务。然而,真实世界数据科学问题的开放性——通常跨越多个分类且缺乏标准答案——给评估带来了重大挑战。为此,我们引入了DSAEval,一个包含641个基于285个多样化数据集的真实世界数据科学问题的基准,涵盖结构化和非结构化数据(例如图像和文本)。DSAEval包含三个独特特征:(1)多模态环境感知,使智能体能够解释来自多种模态(包括文本和视觉)的观察;(2)多查询交互,反映真实世界数据科学项目的迭代和累积性质;(3)多维评估,提供跨推理、代码和结果的全面评估。我们使用DSAEval系统评估了13个近期先进的智能体LLM。结果表明,Claude-Sonnet-4.5实现了最强的整体性能,MiMo-V2-Pro在持续时间上领先,GPT-5.2在步骤效率上领先,而MiMo-V2-Flash最具成本效益。我们进一步证明,多模态感知持续提升视觉相关任务的性能,增益范围为2.04%至11.30%。总体而言,尽管当前数据科学智能体在结构化数据和常规数据分析工作流上表现良好,但在非结构化领域仍存在重大挑战。最后,我们提供了关键见解并概述了未来研究方向。

英文摘要

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., image and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities, including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 13 recent advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, MiMo-V2-Pro and GPT-5.2 lead in duration and step efficiency, respectively, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04\% to 11.30\%. Overall, while current data science agents perform well on structured data and routine data analysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions.

2508.12681 2026-06-12 cs.RO cs.LG cs.SY eess.SY 版本更新

Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory

基于Cosserat杆理论物理信息神经网络的软体连续机器人自适应模型预测控制

Johann Licher, Max Bartholdt, Henrik Krauss, Tim-Lukas Habich, Thomas Seel, Moritz Schappler

发表机构 * Institute of Mechatronic Systems, Leibniz University Hannover(机械系统研究所,汉诺威莱布尼茨大学) Department of Advanced Interdisciplinary Studies, The University of Tokyo(先进跨学科研究部,东京大学) Institute of Assembly Technology and Robotics, Leibniz University of Hannover(组装技术与机器人研究所,汉诺威莱布尼茨大学)

AI总结 提出一种基于域解耦物理信息神经网络(DD-PINN)的实时非线性模型预测控制框架,实现软体连续机器人的高精度动态控制,位置误差低于3 mm。

详情
Comments
Submitted to IEEE Transactions on Robotics, 20 pages, 14 figures
AI中文摘要

软体连续机器人(SCR)的动态控制对其应用扩展具有巨大潜力,但由于精确动态模型的高计算需求,仍然是一个具有挑战性的问题。虽然已经提出了如Koopman算子方法等数据驱动方法,但它们通常缺乏自适应性,且无法重建完整的机器人形状,限制了其适用性。本文介绍了一种基于具有自适应弯曲刚度的域解耦物理信息神经网络(DD-PINN)的实时非线性模型预测控制(MPC)框架。DD-PINN作为动态Cosserat杆模型的替代模型,加速比高达44,000倍。它还被用于无迹卡尔曼滤波器中,从末端执行器位置测量中估计模型状态和弯曲柔度。我们在GPU上实现了一个以70 Hz运行的非线性进化MPC。在仿真中,它展示了动态轨迹的精确跟踪和设定点控制,末端执行器位置误差低于3 mm(执行器长度的2.3%)。在实际实验中,控制器实现了类似的精度和高达3.55 m/s²的加速度。

英文摘要

Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot reconstruct the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of up to 44,000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3\% of the actuator's length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.

2509.18085 2026-06-12 cs.LG cs.AI cs.CL 版本更新

Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

构建未来:通过校准草稿图实现扩散LLM推测解码

Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Christopher Lott, Fatih Porikli, Mingu Lee

发表机构 * University of Waterloo(多伦多大学)

AI总结 提出Spiffy算法,利用校准的草稿图结构实现扩散LLM的推测解码,在保持输出分布的同时加速推理,最高减少8.6倍模型推理次数并加速6.3倍令牌生成速率。

详情
Comments
Original version uploaded on Sep 22, 2025. (v2): Extended Table 2 with additional analysis and referenced it in Sec 5.2. (v3): Added note to Sec 4.2 and Appendix A.2 specifying conditions for losslessness. (v4): Updated with the version accepted to ICML 2026 workshops
AI中文摘要

扩散LLM(dLLM)最近作为自回归LLM(AR-LLM)的强大替代方案出现,具有以显著更高的令牌生成速率运行的潜力。为了释放这一潜力,我们提出了Spiffy,一种推测解码算法,用于加速dLLM推理,同时可证明地保持模型的输出分布。这项工作解决了将AR-LLM的推测解码思想应用于dLLM所涉及的独特挑战。Spiffy执行自动推测以消除独立草稿模型的开销,以新颖的有向草稿图形式构建草稿状态,以利用dLLM生成的双向、块状特性。这些草稿图离线校准以最大化接受率,并在推理过程中动态剪枝以提高计算效率。我们给出了Spiffy的详细公式,并展示了其与KV缓存和基于阈值的动态掩码相结合,加速LLaDA、Dream和SDAR模型的能力,导致模型推理次数减少高达8.6倍,令牌速率加速高达6.3倍。

英文摘要

Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spiffy, a speculative decoding algorithm to accelerate dLLM inference while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to dLLMs. Spiffy performs auto-speculation to eliminate the overheads of an independent draft model, structuring draft states in the form of a novel directed draft graph to take advantage of the bidirectional, blockwise nature of dLLM generation. These draft graphs are calibrated offline to maximize acceptance rates and are dynamically pruned during inference for improved computational efficiency. We present a detailed formulation of Spiffy and demonstrate its ability to accelerate LLaDA, Dream, and SDAR models in combination with KV caching and threshold-based dynamic unmasking leading to up to $8.6\times$ reduction in model inferences and $6.3\times$ acceleration in token rate.

2601.06227 2026-06-12 cs.LG cs.AI 版本更新

When Smaller Wins: Dual-Stage Distillation and Pareto-Guided Compression of Liquid Neural Networks for Edge Battery Prognostics

当更小胜出:面向边缘电池健康预测的液态神经网络双阶段蒸馏与帕累托引导压缩

Dhivya Dharshini Kannan, Wei Li, Wei Zhang, Jianbiao Wang, Zhi Wei Seh, Man-Fai Ng

发表机构 * Singapore Institute of Technology(新加坡科技学院) Institute of Materials Research and Engineering(材料研究与工程研究所) Agency for Science, Technology and Research(科技研究局) Institute of High Performance Computing(高性能计算研究所)

AI总结 提出DLNet框架,通过欧拉离散化、双阶段知识蒸馏和帕累托引导压缩,将高容量液态神经网络压缩为边缘可部署模型,在电池健康预测中实现小模型超越大模型。

详情
Comments
Accepted at International Conference on Pattern Recognition, ICPR 2026. Code available at: https://github.com/Dhivya-DD17/DLNet
AI中文摘要

电池管理系统日益需要在严格的设备端约束下进行准确的电池健康预测。本文提出DLNet,一个实用的双阶段液态神经网络蒸馏框架,将高容量模型转化为紧凑且可边缘部署的电池健康预测模型。DLNet首先应用欧拉离散化重新表述液态动力学以实现嵌入式兼容性。然后进行双阶段知识蒸馏,以传递教师模型的时间行为,并在进一步压缩后恢复该行为。在联合误差-成本目标下的帕累托引导选择保留了平衡准确性和效率的学生模型。我们在广泛使用的数据集上评估DLNet,并在Arduino Nano 33 BLE Sense上使用int8部署验证实际设备可行性。最终部署的学生模型在预测未来100个周期的电池健康时实现了0.0066的低误差,比教师模型低15.4%。模型大小从616 kB减少到94 kB,减少了84.7%,在设备上每次推理耗时21毫秒。这些结果支持了一个实用的“更小胜出”观察:通过适当的监督和选择,小模型可以在边缘预测中匹配或超越大模型。除了电池,DLNet框架可以扩展到其他具有严格硬件约束的工业分析任务。

英文摘要

Battery management systems increasingly require accurate battery health prognostics under strict on-device constraints. This paper presents DLNet, a practical framework with dual-stage distillation of liquid neural networks that turns a high-capacity model into compact and edge-deployable models for battery health prediction. DLNet first applies Euler discretization to reformulate liquid dynamics for embedded compatibility. It then performs dual-stage knowledge distillation to transfer the teacher model's temporal behavior and recover it after further compression. Pareto-guided selection under joint error-cost objectives retains student models that balance accuracy and efficiency. We evaluate DLNet on a widely used dataset and validate real-device feasibility on an Arduino Nano 33 BLE Sense using int8 deployment. The final deployed student achieves a low error of 0.0066 when predicting battery health over the next 100 cycles, which is 15.4% lower than the teacher model. It reduces the model size from 616 kB to 94 kB with 84.7% reduction and takes 21 ms per inference on the device. These results support a practical smaller wins observation that a small model can match or exceed a large teacher for edge-based prognostics with proper supervision and selection. Beyond batteries, the DLNet framework can extend to other industrial analytics tasks with strict hardware constraints.

2601.03184 2026-06-12 cs.LG cs.AI 版本更新

Decentralized Autoregressive Generation

分散自回归生成

Stepan Maschan, Haoxuan Qu, Jun Liu

发表机构 * Lancaster University(兰卡斯特大学)

AI总结 本文通过离散流匹配框架证明分散训练与集中训练在理论上等价,实验验证其在多模态基准上保持竞争力。

详情
AI中文摘要

近年来,自回归生成的分散化作为解决扩展瓶颈的方案引起了广泛关注。然而,尽管有令人鼓舞的实验结果,这一范式目前缺乏严格的理论证明。在这项工作中,我们正式建立了分散训练与集中训练之间的理论等价性。为此,我们调整了离散流匹配框架用于自回归生成,利用其固有性质证明全局模型自然分解为独立专家。最后,我们在多种多模态基准上进行了大量实验,实验验证了分散训练在标准集中架构上保持竞争性。

英文摘要

The decentralization of autoregressive generation has attracted considerable attention in recent years as a solution to scaling bottlenecks. However, despite promising empirical results, this paradigm currently lacks rigorous theoretical justification. In this work, we formally establish the theoretical equivalence between decentralized and centralized training. To achieve this, we adapt the Discrete Flow Matching framework for autoregressive generation, leveraging its inherent properties to demonstrate that global models naturally decompose into independent experts. Finally, we conduct extensive experiments across diverse multimodal benchmarks, empirically validating that decentralized training maintains competitive parity with standard centralized architectures.

2601.06279 2026-06-12 cs.CV 版本更新

EyeTheia: A Lightweight and Accessible Eye-Tracking Toolbox

EyeTheia:一个轻量级且易用的眼动追踪工具箱

Stevenson Pather, Niels Martignène, Arnaud Bugnet, Fouad Boutaleb, Fabien D'Hondt, Deise Santana Maia

发表机构 * Univ. Lille, Inserm, CHU Lille, U1172 - LilNCog - Lille Neuroscience & Cognition(里尔大学、法国国家医学研究院、里尔大学医院、U1172 - 里尔神经科学与认知中心) Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL(里尔大学、法国国家科学研究中心、里尔中央理工大学、UMR 9189 CRIStAL) Centre national de ressources et de résilience (CN2R)(资源与韧性国家研究中心)

AI总结 提出基于网络摄像头的轻量级眼动追踪管道EyeTheia,结合MediaPipe特征提取和CNN模型,通过用户微调降低预测误差,在点探测任务中与商业方案表现一致。

详情
Comments
Code for the EyeTheia: https://github.com/patherstevenson/EyeTheia. Experimental platform for the cognitive neuroscience task (BAWEB IAPS): https://git.interactions-team.fr/INTERACTIONS/calypso/src/branch/main/src/medita/
AI中文摘要

我们介绍了EyeTheia,一个用于基于网络摄像头的视线估计的轻量级开源深度学习管道,专为基于浏览器的实验平台和现实世界的认知与临床研究设计。EyeTheia仅使用标准笔记本电脑摄像头即可实现实时视线追踪,结合基于MediaPipe的 landmarks 提取和受iTracker启发的卷积神经网络,并支持可选的用户特定微调。我们研究了两种互补策略:在移动数据上预训练模型,以及在桌面数据集上从头训练相同架构。在MPIIFaceGaze上的验证结果显示,在标定前两种方法性能相当,而轻量级的用户特定微调持续降低了视线预测误差。我们还在一个真实的点探测任务中评估了EyeTheia,并与商业网络摄像头追踪器SeeSo SDK进行了比较。结果表明,在刺激呈现期间左右视线分配上具有高度一致性,尽管时间变异性更高。总体而言,EyeTheia为低成本视线追踪提供了一个透明且可扩展的解决方案,适用于可扩展和可重复的实验与临床研究。代码、训练模型和实验材料均已公开。

英文摘要

We introduce EyeTheia, a lightweight and open deep learning pipeline for webcam-based gaze estimation, designed for browser-based experimental platforms and real-world cognitive and clinical research. EyeTheia enables real-time gaze tracking using only a standard laptop webcam, combining MediaPipe-based landmark extraction with a convolutional neural network inspired by iTracker and optional user-specific fine-tuning. We investigate two complementary strategies: adapting a model pretrained on mobile data and training the same architecture from scratch on a desktop-oriented dataset. Validation results on MPIIFaceGaze show comparable performance between both approaches prior to calibration, while lightweight user-specific fine-tuning consistently reduces gaze prediction error. We further evaluate EyeTheia in a realistic Dot-Probe task and compare it to the commercial webcam-based tracker SeeSo SDK. Results indicate strong agreement in left-right gaze allocation during stimulus presentation, despite higher temporal variability. Overall, EyeTheia provides a transparent and extensible solution for low-cost gaze tracking, suitable for scalable and reproducible experimental and clinical studies. The code, trained models, and experimental materials are publicly available.

2601.04885 2026-06-12 cs.CL cs.AI cs.LG 版本更新

CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

CuMA: 通过人口统计感知的适配器混合使大语言模型与稀疏文化价值观对齐

Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Shu Su, Yuheng Jia

发表机构 * Southeast University(东南大学) ByteDance Inc.(字节跳动公司) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用重点实验室(东南大学),中华人民共和国教育部,中国)

AI总结 提出CuMA框架,通过人口统计感知路由将冲突梯度分离到专家子空间,解决密集模型在多文化对齐中的均值崩溃问题,在WorldValuesBench等基准上取得最优性能。

详情
Comments
ACL 2026 Main
AI中文摘要

随着大语言模型服务于全球用户,对齐必须从强制执行普遍共识转向尊重文化多元主义。我们证明,密集模型在被迫适应冲突的价值分布时会出现\textbf{均值崩溃},收敛到无法代表不同群体的通用平均值。我们将其归因于\textbf{文化稀疏性},其中梯度干扰阻止密集参数跨越不同的文化模式。为解决此问题,我们提出\textbf{\textsc{CuMA}}(\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters),一个将对齐视为\textbf{条件容量分离}问题的框架。通过引入人口统计感知路由,\textsc{CuMA}内化了一个\textit{潜在文化拓扑},以将冲突梯度明确解耦到专门的专家子空间中。在WorldValuesBench、Community Alignment和PRISM上的广泛评估表明,\textsc{CuMA}达到了最先进的性能,显著优于密集基线和仅语义MoE。关键的是,我们的分析证实\textsc{CuMA}有效缓解了均值崩溃,保留了文化多样性。我们的代码可在该https URL获取。

英文摘要

As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.

2507.07947 2026-06-12 cs.LG cs.AI 版本更新

Reconstructing Template-Memorized Images from Natural Prompts

从自然提示中重建模板记忆的图像

Sol Yarkoni, Mahmood Sharif, Roi Livni

发表机构 * School of Electrical & Computer Engineering(电气与计算机工程学院) School of Computer Science & AI(计算机科学与人工智能学院) Tel Aviv University(特拉维夫大学)

AI总结 提出一种低资源攻击方法,利用模板化电商数据中的模式,从自然提示中重建训练集中的记忆图像,揭示隐私风险。

详情
AI中文摘要

生成模型(如扩散模型)的最新进展引发了与隐私、版权侵犯和数据管理相关的担忧。为了更好地理解和控制这些风险,先前的工作引入了从训练数据中重建图像或部分图像的技术和攻击。虽然这些结果表明训练数据可以被恢复,但现有方法通常依赖于高计算资源、对训练集的部分访问或精心设计的提示。在这项工作中,我们提出了一种新的攻击,该攻击需要低资源,假设对训练数据几乎没有或完全没有访问权限,并识别出看似良性的提示,这些提示可能导致潜在有风险的图像重建。我们进一步表明,即使对于没有专业知识的用户,这种重建也可能无意中发生。例如,我们观察到,对于现有模型,提示“蓝色男女通用T恤”会生成一个真实个体的面部。此外,通过将已识别的漏洞与真实世界的提示数据相结合,我们发现了能够重现记忆视觉元素的提示。我们的方法建立在先前工作的见解之上,并利用领域知识来揭示由于使用抓取的电商数据而产生的基本漏洞,其中模板化布局和图像与模式化的文本提示紧密相关。我们的攻击代码在此https URL公开。

英文摘要

Recent advances in generative models, such as diffusion models, have raised concerns related to privacy, copyright infringement, and data stewardship. To better understand and control these risks, prior work has introduced techniques and attacks that reconstruct images, or parts of images, from training data. While these results demonstrate that training data can be recovered, existing methods often rely on high computational resources, partial access to the training set, or carefully engineered prompts. In this work, we present a new attack that requires low resources, assumes little to no access to the training data, and identifies seemingly benign prompts that can lead to potentially risky image reconstruction. We further show that such reconstructions may occur unintentionally, even for users without specialized knowledge. For example, we observe that for one existing model, the prompt ``blue Unisex T-Shirt'' generates the face of a real individual. Moreover, by combining the identified vulnerabilities with real-world prompt data, we discover prompts that reproduce memorized visual elements. Our approach builds on insights from prior work and leverages domain knowledge to expose a fundamental vulnerability arising from the use of scraped e-commerce data, where templated layouts and images are closely tied to pattern-like textual prompts. The code for our attack is publicly available at https://github.com/TheSolY/lr-tmi.

2601.02177 2026-06-12 cs.CV cs.CR 版本更新

Why Commodity WiFi Sensors Fail at Multi-Person Gait Identification: A Systematic Analysis Using ESP32

为什么商用WiFi传感器在多人体步态识别中失败:基于ESP32的系统分析

Oliver Custance, Saad Khan, Simon Parkinson

发表机构 * University of Cambridge(剑桥大学)

AI总结 通过ESP32实验发现,多人体步态识别性能差主要源于商用WiFi的感知质量限制,而非算法选择。

详情
AI中文摘要

WiFi信道状态信息(CSI)在单人步态识别中展现出潜力,引发了对其在非接触式生物识别、持续认证和被动识别中应用的兴趣。然而,在低成本商用设备上进行多人识别的可行性仍不清楚。一个关键问题是,较差的多人性能主要是算法限制,还是反映了商用WiFi硬件更根本的感知上限。我们通过使用商用ESP32 WiFi传感器的系统实证研究来回答这个问题。我们评估了六种不同的信号分离方法——FastICA、SOBI、PCA-ICA、NMF、小波和张量分解——在七个场景中,覆盖1-10人,包括受控和现实室内环境。为了超越分类准确率进行研究,我们引入了三个诊断指标:受试者内变异性(ISV)、受试者间可区分性(ISD)和性能退化率(PDR)。所有方法的性能均中等(39%-56%准确率),几乎没有证据表明仅靠算法选择能解决问题。表现最佳的方法NMF达到56%准确率,而所有方法都表现出极高的特征空间重叠(97%-99%)、不稳定的受试者内表示以及显著的环境敏感性。这些发现表明,在商用ESP32 CSI约束下,密集多人步态识别更多受限于感知质量和空间多样性,而非所选分离算法。我们的结果对安全和隐私有直接影响:它们质疑了商用WiFi CSI作为稳健的多用户生物识别基元的实用性,同时也对低成本现成WiFi硬件可实现的被动识别能力施加了重要限制。

英文摘要

WiFi Channel State Information (CSI) has shown promise for single-person gait identification, raising interest in its use for contactless biometrics, continuous authentication, and passive identification. However, the feasibility of multi-person identification on low-cost commodity devices remains unclear. A critical question is whether weak multi-person performance is primarily an algorithmic limitation, or whether it reflects a more fundamental sensing ceiling on commodity WiFi hardware. We address this question through a systematic empirical study using commodity ESP32 WiFi sensors. We evaluated six different signal separation methods--FastICA, SOBI, PCA-ICA, NMF, Wavelet, and Tensor decomposition--across seven scenarios spanning 1-10 people in both controlled and realistic indoor environments. To investigate beyond classification accuracy, we introduce three diagnostic metrics: intra-subject variability (ISV), inter-subject distinguishability (ISD), and performance degradation rate (PDR). In all methods, performance remains moderate (39%-56% accuracy), with limited evidence that algorithmic choice alone solves the problem. The best-performing method, NMF, reaches 56% accuracy, while all methods exhibit extremely high feature-space overlap (97%-99%), unstable within-subject representations, and marked environmental sensitivity. These findings suggest that, under commodity ESP32 CSI constraints, dense multi-person gait identification is limited more by sensing quality and spatial diversity than by the chosen separation algorithm. Our results have direct implications for security and privacy: they call into question the practicality of commodity WiFi CSI as a robust multi-user biometric primitive for authentication, while also placing important bounds on the passive identification capabilities achievable with low-cost off-the-shelf WiFi hardware.

2304.13836 2026-06-12 cs.LG cs.AI cs.CV stat.ME 版本更新

On Pitfalls of $\textit{RemOve-And-Retrain}$: Data Processing Inequality Perspective

论 $\textit{RemOve-And-Retrain}$ 的陷阱:数据处理不等式视角

Junhwa Song, Keumgang Cha, Junghoon Seo

发表机构 * KAIST(韩国科学技术院)

AI总结 从信息论角度揭示ROAR基准的缺陷:数据无关的后处理可提升ROAR分数,导致对归因图信息量的误判,并发现模糊性偏差。

详情
Comments
Accepted at the 2026 ICML Workshop on Mechanistic Interpretability
AI中文摘要

RemOve-And-Retrain (ROAR) 基准被广泛用于评估特征归因方法,但其有效性尚未从信息论角度得到充分探索。我们证明,对归因图进行模型和数据无关的后处理(通过数据处理不等式,这些变换\emph{不能}增加关于决策函数的信息)通常可以改善ROAR分数。这意味着ROAR排名的提升本身并不能证明归因图携带更多关于模型的信息。我们将这种失败模式归因于对空间模糊掩膜的偏好。在CIFAR-10、SVHN和CUB-200上的实验显示,模糊度与ROAR性能之间存在一致的关联,这种模式也出现在ROAD变体中。我们为更谨慎的基于移除的基准测试提供了指导方针,这对验证神经网络内部机制的机械理解具有重要意义。

英文摘要

The RemOve-And-Retrain (ROAR) benchmark is widely used to evaluate feature attribution methods, yet its validity remains underexplored from an information-theoretic perspective. We show that model- and data-agnostic post-processing of attribution maps (transformations that, by the data processing inequality, \emph{cannot} add information about the decision function) can often improve ROAR scores. This means that an improved ROAR ranking is not, by itself, evidence that an attribution map carries more information about the model. We trace this failure mode to a bias toward spatially blurry masks. Experiments on CIFAR-10, SVHN, and CUB-200 show a consistent association between blurriness and ROAR performance, a pattern that also appears in the ROAD variant. We provide guidelines for more cautious removal-based benchmarking, with implications for validating mechanistic understanding of neural network internals.

2509.07150 2026-06-12 cs.LG cond-mat.mtrl-sci 版本更新

PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

PLaID++: 一种用于定向无机材料设计的偏好对齐语言模型

Andy Xu, Rohan Desai, Larry Wang, Ethan Ritz, Gabriel Hope

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出PLaID++,通过对称性感知的Wyckoff文本表示和温度缩放熵正则化,结合可验证奖励的强化学习,实现稳定、新颖且满足空间群属性的晶体生成,比先前方法效率提高约50%。

详情
Comments
Code available at https://github.com/andaero/PLaID, model weights at https://huggingface.co/HOPE-Lab-HMC/PLaID
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为提高LLM正确性的有前景方法,然而在许多科学问题中,目标并非产生正确答案,而是产生满足一组约束的多样化候选方案。我们在材料生成背景下研究这一挑战。为此,我们引入了PLaID++,一个经过后训练的LLM,用于稳定且属性引导的晶体生成。我们发现性能取决于我们的晶体学表示和奖励公式。首先,我们引入了一种紧凑的、对称性感知的Wyckoff文本表示,提高了计算效率并鼓励从物理先验中泛化。其次,我们证明了温度缩放作为熵正则化器,可以抵消模式坍塌并鼓励探索。通过将对称性约束直接编码到文本中,并将模型输出引导至理想的化学空间,PLaID++生成热力学稳定、独特且新颖的结构,其速率比先前方法高约50%,并能条件性地生成具有所需空间群属性的结构。我们的工作展示了将自然语言处理中的后训练技术适应于材料设计的潜力,为定向和高效发现新材料铺平了道路。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising approach to improve correctness in LLMs, however, in many scientific problems, the objective is not necessarily to produce the correct answer, but instead to produce a diverse array of candidates which satisfy a set of constraints. We study this challenge in the context of materials generation. To this end, we introduce PLaID++, an LLM post-trained for stable and property-guided crystal generation. We find that performance hinges on our crystallographic representation and reward formulation. First, we introduce a compact, symmetry-informed Wyckoff text representation which improves computational efficiency and encourages generalization from physical priors. Second, we demonstrate that temperature scaling acts as an entropy regularizer which counteracts mode collapse and encourages exploration. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50\% greater rate than prior methods and conditionally generates structures with desired space group properties. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.

2511.23030 2026-06-12 cs.RO cs.CV 版本更新

DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management

DiskChunGS:基于分块内存管理的大规模3D高斯SLAM

Casimir Feldmann, Maximum Wilder-Smith, Vaishakh Patil, Michael Oechsle, Michael Niemeyer, Keisuke Tateno, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zurich(机器人系统实验室,瑞士苏黎世联邦理工学院) Google(谷歌)

AI总结 提出DiskChunGS,通过将场景划分为空间块并将非活跃区域存储于磁盘,突破GPU内存限制,实现大规模3D高斯SLAM,在多个数据集上完成全序列重建并提升视觉质量。

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 4, 2026
AI中文摘要

近期3D高斯溅射(3DGS)的进展在实时渲染的新视角合成中展现了令人印象深刻的结果。然而,将3DGS与SLAM系统集成面临根本的可扩展性限制:方法受限于GPU内存容量,只能重建小规模环境。我们提出DiskChunGS,一种可扩展的3DGS SLAM系统,通过一种外核方法克服这一瓶颈,该方法将场景划分为空间块,并在GPU内存中仅维护活跃区域,同时将非活跃区域存储在磁盘上。我们的架构与现有的用于位姿估计和闭环检测的SLAM框架无缝集成,实现大规模全局一致的重建。我们在室内场景(Replica、TUM-RGBD)、城市驾驶场景(KITTI)以及资源受限的Nvidia Jetson平台上验证了DiskChunGS。我们的方法独特地完成了所有11个KITTI序列,没有出现内存故障,同时实现了卓越的视觉质量,证明了算法创新可以克服先前限制3DGS SLAM方法的内存约束。

英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. We validate DiskChunGS on indoor scenes (Replica, TUM-RGBD), urban driving scenarios (KITTI), and resource-constrained Nvidia Jetson platforms. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods.

2510.16928 2026-06-12 cs.CL 版本更新

ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

ChiKhaPo: 一个用于评估大型语言模型词汇理解与生成能力的大规模多语言基准

Emily Chang, Niyati Bafna

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Johns Hopkins University, Center for Language and Speech Processing(约翰霍普金斯大学语言与语音处理中心)

AI总结 针对现有基准语言覆盖不足且侧重高阶任务的问题,提出ChiKhaPo基准,包含8个子任务,覆盖2700+种语言,评估LLM的词汇理解与生成能力,发现6个SOTA模型表现不佳。

详情
AI中文摘要

现有的大型语言模型(LLM)基准主要局限于高资源或中资源语言,并且通常评估推理和生成方面的高阶任务性能。然而,大量证据表明,LLM在全球3800多种书面语言中的绝大多数语言中缺乏基本的语言能力。我们引入了ChiKhaPo,它包含8个难度不同的子任务,旨在评估生成模型的词汇理解和生成能力。ChiKhaPo利用现有的词典、单语数据和双语文本,为2个子任务提供了2700多种语言的覆盖,在语言覆盖范围上超过了任何现有基准。我们进一步展示了6个SOTA模型在我们的基准上表现不佳,并讨论了影响性能分数的因素,包括语系、语言资源丰富度、任务以及理解与生成方向。通过ChiKhaPo,我们希望促进并鼓励对LLM进行大规模多语言基准测试。

英文摘要

Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

2511.19652 2026-06-12 cs.CV 版本更新

Navigating Gigapixel Pathology Images with Large Multimodal Models

利用大型多模态模型导航千兆像素病理图像

Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai, Arjun K. Manrai

发表机构 * Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系) Department of Pathology, Massachusetts General Hospital(麻省总医院病理学系) Department of Pathology and Laboratory Medicine, Brown University(布朗大学病理学与实验室医学系)

AI总结 提出GIANT方法,无需训练即可让通用多模态模型自主导航WSI,通过迭代选择多放大倍数裁剪并聚合证据,在MultiPathQA基准上实现SOTA。

详情
AI中文摘要

近期大型多模态模型的进展使得开发能够对话和推理病理全切片图像(WSI)的交互式聊天模型成为可能。然而,现有的切片级聊天系统通常高度专业化,通常将WSI压缩为固定的切片级嵌入或依赖多组件流水线,这可能会丢失多尺度细节并限制目标任务之外的泛化能力。我们提出GIANT(千兆像素图像组织导航代理),一种简单、无需训练的方法,让通用多模态模型自主导航WSI,迭代选择多放大倍数裁剪并随时间聚合证据。为了评估WSI问答中的泛化能力并促进可重复性,我们引入了MultiPathQA,一个涵盖五个临床挑战和934个问题(涉及868个独特WSI)的基准套件。其中包括128道由病理学家编写的多项选择题,旨在模拟真实的诊断搜索和多尺度推理。使用GPT-5,GIANT在五个基准中的四个上取得了最先进的性能,优于专门用于病理问答的模型。

英文摘要

Recent advances in large multimodal models have allowed for the development of interactive chat models that can converse and reason about pathology whole-slide images (WSIs). However, existing slide-level chat systems are often highly specialized, typically compressing WSIs into fixed slide-level embeddings or relying on multi-component pipelines, which can lose multi-scale detail and limit generalizability beyond the target task. We present GIANT (Gigapixel Image Agent for Navigating Tissue), a simple, training-free approach that lets general-purpose multimodal models navigate WSIs on their own, iteratively selecting multi-magnification crops and aggregating evidence over time. To evaluate generalizability in WSI question answering and to promote reproducibility, we introduce MultiPathQA, a benchmark suite spanning five clinical challenges and 934 questions over 868 unique WSIs. This includes a new set of 128 pathologist-authored multiple-choice questions designed to mirror real diagnostic search and multi-scale reasoning. Using GPT-5, GIANT outperforms models specialized for pathology question answering, achieving state-of-the-art performance on four out of five benchmarks.

2511.17221 2026-06-12 cs.CV cs.RO 版本更新

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

QueryOcc:基于查询的3D语义占据自监督方法

Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) Zenseact

AI总结 提出QueryOcc,一种基于查询的自监督框架,通过相邻帧的4D时空查询直接学习连续3D语义占据,利用视觉基础模型或激光雷达数据提供监督,并引入收缩场景表示以在恒定内存下实现远程监督,在Occ3D-nuScenes基准上语义RayIoU提升26%。

详情
AI中文摘要

从图像学习3D场景几何和语义是计算机视觉的核心挑战,也是自动驾驶的关键能力。由于大规模3D标注成本过高,近期研究探索直接从传感器数据中进行自监督学习,无需人工标签。现有方法要么依赖2D渲染一致性(3D结构仅隐式出现),要么依赖来自累积激光雷达点云的离散化体素网格,限制了空间精度和可扩展性。我们提出QueryOcc,一种基于查询的自监督框架,通过跨相邻帧采样的独立4D时空查询直接学习连续3D语义占据。该框架支持来自视觉基础模型导出的伪点云或原始激光雷达数据的监督。为了实现恒定内存下的远程监督和推理,我们引入了一种收缩场景表示,在平滑压缩远处区域的同时保留近场细节。QueryOcc在自监督Occ3D-nuScenes基准上以11.6 FPS运行,语义RayIoU比之前的基于相机的方法提升26%,表明直接4D查询监督能够实现强大的自监督占据学习。

英文摘要

Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

2511.04260 2026-06-12 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet:面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学)

AI总结 提出Proto-LeakNet,利用扩散模型中的信号泄漏痕迹,结合闭集分类与密度开集评估,实现可解释的生成器归因,在闭集上训练后对未见生成器也有效。

详情
Comments
44 pages, 27 figures, 11 tables
AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明,扩散管道会在其输出中无意中留下持久的统计痕迹,称为信号泄漏,特别是在潜在表示中。基于这一观察,我们提出了Proto-LeakNet,一个信号泄漏感知且可解释的归因框架,它将闭集分类与基于密度的开集评估相结合,对学习到的嵌入进行开集评估,从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域,重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征,而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC,Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒,超越了最先进的方法,并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取:this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

2511.11022 2026-06-12 cs.RO 版本更新

Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving

用于验证多智能体协同自动驾驶的微型测试平台

Hyunchul Bae, Eunjae Lee, Jehyeop Han, Minhee Kang, Jaehyeon Kim, Junggeun Seo, Minkyun Noh, Heejin Ahn

发表机构 * School of Electrical Engineering(电气工程学院) School of Mechanical Engineering(机械工程学院) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出CIVAT微型测试平台,集成V2V/V2I通信与ROS2框架,通过基础设施感知和交叉口管理实验验证协同自动驾驶功能。

详情
Comments
Accepted by ICRA 2026, 8 pages
AI中文摘要

协同自动驾驶通过实现车辆与智能路侧基础设施之间的实时协作来扩展车辆自主性,仍然是一个具有挑战性但至关重要的问题。然而,现有的测试平台均未采用配备感知、边缘计算和通信能力的智能基础设施。为填补这一空白,我们设计并实现了一个1:15比例的微型测试平台CIVAT,用于验证协同自动驾驶,该平台包括一个缩小的城市地图、配备车载传感器的自动驾驶车辆以及智能基础设施。所提出的测试平台通过共享Wi-Fi和ROS2框架,以发布-订阅模式集成V2V和V2I通信,实现车辆与基础设施之间的信息交换,从而达成协同驾驶功能。作为案例研究,我们通过基于基础设施的感知和交叉口管理实验验证了该系统。

英文摘要

Cooperative autonomous driving, which extends vehicle autonomy by enabling real-time collaboration between vehicles and smart roadside infrastructure, remains a challenging yet essential problem. However, none of the existing testbeds employ smart infrastructure equipped with sensing, edge computing, and communication capabilities. To address this gap, we design and implement a 1:15-scale miniature testbed, CIVAT, for validating cooperative autonomous driving, consisting of a scaled urban map, autonomous vehicles with onboard sensors, and smart infrastructure. The proposed testbed integrates V2V and V2I communication with the publish-subscribe pattern through a shared Wi-Fi and ROS2 framework, enabling information exchange between vehicles and infrastructure to realize cooperative driving functionality. As a case study, we validate the system through infrastructure-based perception and intersection management experiments.

2503.10919 2026-06-12 cs.RO cs.SY eess.SY nlin.PS 版本更新

Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds

基于绝热谱子流形的数据驱动软体机器人控制

Roshan S. Kaundinya, John Irvin Alora, Jonas G. Matt, Luis A. Pabon, Marco Pavone, George Haller

发表机构 * Institute for Mechanical Systems, ETH Zürich(机械系统研究所,苏黎世联邦理工学院) Autonomous Systems Lab, Stanford University(自主系统实验室,斯坦福大学) Automatic Control Laboratory, ETH Zürich(自动控制实验室,苏黎世联邦理工学院)

AI总结 针对软体机器人在非线性区域控制难题,提出基于绝热谱子流形(aSSM)的模型预测控制策略,通过数据驱动构建低维吸引子流形,实现高精度轨迹跟踪,性能提升达10倍。

详情
Comments
41 pages, 24 figures, IJRR (2026) in press
AI中文摘要

软体机器人的机械复杂性给基于模型的控制带来了重大挑战。具体而言,线性数据驱动模型难以在探索具有显著非线性行为的复杂空间扩展路径上控制软体机器人。为了解释这些非线性,我们基于最新的绝热谱子流形(aSSM)理论开发了一种模型预测控制策略。该理论适用是因为重度阻尼机器人的内部振动衰减速度远快于机器人沿预定路径的期望速度。在这种情况下,低维吸引不变流形(aSSM)从路径发出并承载机器人的主导动力学。借助这一最新理论,我们仅从数据出发设计了一种基于aSSM的模型预测控制方案。我们展示了数据驱动模型在跨不同任务跟踪动态轨迹方面的有效性。我们在软体躯干机器人和基于Cosserat杆的弹性软臂的高保真、高维有限元模型上进行了验证,额外实验确认了即使在存在实验噪声的情况下也具有鲁棒性能。值得注意的是,我们发现五维或六维aSSM简化模型在所有闭环控制任务中的跟踪性能比其他数据驱动建模方法高出最多10倍。

英文摘要

The mechanical complexity of soft robots creates significant challenges for their model-based control. Specifically, linear data-driven models have struggled to control soft robots on complex, spatially extended paths that explore regions with significant nonlinear behavior. To account for these nonlinearities, we develop here a model-predictive control strategy based on the recent theory of adiabatic spectral submanifolds (aSSMs). This theory is applicable because the internal vibrations of heavily overdamped robots decay at a speed that is much faster than the desired speed of the robot along its intended path. In that case, low-dimensional attracting invariant manifolds (aSSMs) emanate from the path and carry the dominant dynamics of the robot. Aided by this recent theory, we devise an aSSM-based model-predictive control scheme purely from data. We demonstrate the effectiveness of our data-driven model in tracking dynamic trajectories across diverse tasks. We validate on high-fidelity, high-dimensional finite-element models of a soft trunk robot and Cosserat-rod-based elastic soft arms, with additional experiments confirming robust performance even in the presence of experimental noise. Notably, we find that five- or six-dimensional aSSM-reduced models outperform the tracking performance of other data-driven modeling methods by a factor up to 10 across all closed-loop control tasks.

2504.21561 2026-06-12 cs.CV 版本更新

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

通过逐步偏好调优的多模态智能体迭代工具使用探索

Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

发表机构 * Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology(北京智能信息科技重点实验室,计算机科学与技术学院,北京理工大学) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI) State Key Laboratory of General Artificial Intelligence, Peking University(通用人工智能国家重点实验室,北京大学) Harbin Institute of Technology(哈尔滨工业大学) Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University(广东机器感知与智能计算实验室,深圳MSU-BIT大学) Department of Automation, Tsinghua University(自动化系,清华大学)

AI总结 提出SPORT方法,通过任务合成、步骤采样、步骤验证和偏好调优的迭代循环,使多模态智能体无需预收集数据即可自主探索和优化工具使用策略,在GTA和GAIA基准上分别提升6.41%和3.64%。

详情
Comments
24 pages
AI中文摘要

多模态智能体将控制器(例如视觉语言模型)与外部工具集成,在解决复杂多模态任务方面展现了卓越的能力。现有训练这些智能体的方法,包括监督微调和强化学习,都依赖于大量人工标注的任务-答案对和工具轨迹。然而,对于复杂多模态任务,此类标注成本过高或难以实现。本文提出一种无需任何预收集数据的多模态智能体迭代工具使用探索方法,即SPORT,通过逐步偏好优化来改进工具使用轨迹。我们的方法使多模态智能体能够通过自我探索和优化自主发现有效的工具使用策略,消除了人工标注的瓶颈。SPORT包含四个迭代组件:任务合成、步骤采样、步骤验证和偏好调优。我们首先使用语言模型合成多模态任务。然后,我们引入一种新颖的轨迹探索方案,其中步骤采样和步骤验证交替执行以解决合成任务。在步骤采样中,智能体尝试不同的工具并获取相应结果。在步骤验证中,我们使用验证器提供AI反馈以构建逐步偏好数据。该数据随后通过偏好调优用于更新控制器的工具使用,生成SPORT智能体。通过与真实环境交互,SPORT智能体逐渐演化为更精细和更有能力的系统。在GTA和GAIA基准上的评估显示,SPORT智能体分别实现了6.41%和3.64%的提升,突显了我们方法的泛化性和有效性。项目页面见该URL。

英文摘要

Multimodal agents, which integrate a controller e.g., a vision language model) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using language models. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

2510.16380 2026-06-12 cs.CL cs.AI cs.CY cs.HC cs.LG 版本更新

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

MoReBench:评估语言模型中的程序性和多元道德推理,超越结果

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Raphaël Millière, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Conor Downey, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine

发表机构 * University of Washington(华盛顿大学) New York University(纽约大学) Scale AI Harvard University(哈佛大学) University of Michigan(密歇根大学) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Center for AI Safety(人工智能安全中心) Stanford University(斯坦福大学) MIT(麻省理工学院) University of Oxford(牛津大学)

AI总结 提出MoReBench基准,包含1000个道德场景和超过2.3万条标准,用于评估语言模型在道德推理中的程序性推理能力,发现现有基准无法预测模型表现,且模型对特定道德框架存在偏好。

详情
Comments
46 pages, 8 figures, 10 tables. Published in ICLR 2026. Accepted at CHAI workshop and SPP 2026 (non-archival)
AI中文摘要

随着人工智能系统的进步,我们越来越依赖它们与我们共同或代替我们做出决策。为了确保这些决策符合人类价值观,我们不仅需要理解它们做出了什么决策,还需要理解它们如何得出这些决策。推理语言模型能够提供最终响应和(部分透明的)中间思考轨迹,这为研究AI的程序性推理提供了及时的机会。与通常有客观正确答案的数学和代码问题不同,道德困境是过程导向评估的绝佳测试平台,因为它们允许多种可辩护的结论。为此,我们提出了MoReBench:包含1000个道德场景,每个场景配有一组专家认为在推理该场景时必须包含(或避免)的评分标准。MoReBench包含超过2.3万条标准,包括识别道德考量、权衡利弊以及给出可操作的建议,覆盖了AI为人类道德决策提供建议以及自主做出道德决策的情况。此外,我们整理了MoReBench-Theory:150个示例,用于测试AI是否能在规范伦理学的五个主要框架下进行推理。我们的结果表明,规模定律以及现有的数学、代码和科学推理任务基准无法预测模型进行道德推理的能力。模型还显示出对特定道德框架(例如边沁式的行为功利主义和康德义务论)的偏好,这可能是流行训练范式的副作用。这些基准共同推动了面向过程推理的评估,以实现更安全、更透明的AI。

英文摘要

As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

2510.16311 2026-06-12 cs.LG 版本更新

Toward General Digraph Contrastive Learning: A Dual Spatial Perspective

面向一般有向图对比学习:双空间视角

Zhengyu Wu, Daohan Su, Yang Zhang, Xunkai Li, Rong-Hua Li, Guoren Wang

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出S2-DiGCL框架,从复数域和实数域双空间视角对有向图进行对比学习,通过磁拉普拉斯自适应调制和路径子图增强,在节点分类和链接预测任务上分别提升4.41%和4.34%。

详情
AI中文摘要

图对比学习(GCL)已成为一种从图中提取一致表示而无需标签信息的强大工具。然而,现有方法主要关注无向图,忽略了在实际网络(如社交网络和推荐系统)中基础且不可或缺的关键方向信息。本文提出了S2-DiGCL,一种新颖的框架,强调从复杂域和实数域视角对有向图进行对比学习的空间洞察。从复数域视角,S2-DiGCL在磁拉普拉斯中引入个性化扰动,以自适应地调制边相位和方向语义。从实数域视角,它采用基于路径的子图增强策略,捕捉细粒度的局部不对称性和拓扑依赖性。通过联合利用这两个互补的空间视图,S2-DiGCL构建了高质量的正负样本,从而实现更通用和鲁棒的有向图对比学习。在7个真实有向图数据集上的大量实验证明了我们方法的优越性,在监督和无监督设置下,节点分类和链接预测分别实现了4.41%和4.34%的性能提升,达到了最先进水平。

英文摘要

Graph Contrastive Learning (GCL) has emerged as a powerful tool for extracting consistent representations from graphs, independent of labeled information. However, existing methods predominantly focus on undirected graphs, disregarding the pivotal directional information that is fundamental and indispensable in real-world networks (e.g., social networks and recommendations).In this paper, we introduce S2-DiGCL, a novel framework that emphasizes spatial insights from complex and real domain perspectives for directed graph (digraph) contrastive learning. From the complex-domain perspective, S2-DiGCL introduces personalized perturbations into the magnetic Laplacian to adaptively modulate edge phases and directional semantics. From the real-domain perspective, it employs a path-based subgraph augmentation strategy to capture fine-grained local asymmetries and topological dependencies. By jointly leveraging these two complementary spatial views, S2-DiGCL constructs high-quality positive and negative samples, leading to more general and robust digraph contrastive learning. Extensive experiments on 7 real-world digraph datasets demonstrate the superiority of our approach, achieving SOTA performance with 4.41% improvement in node classification and 4.34% in link prediction under both supervised and unsupervised settings.

2510.05430 2026-06-12 cs.RO 版本更新

Active Semantic Perception

主动语义感知

Huayi Tang, Pratik Chaudhari

发表机构 * General Robotics, Automation, Sensing and Perception (GRASP) Laboratory(通用机器人、自动化、传感与感知实验室)

AI总结 提出一种基于紧凑多层场景图和大语言模型的主动语义感知方法,用于高效探索未知环境,在仿真和真实机器人上验证了优于现有方法。

详情
AI中文摘要

我们开发了一种主动语义感知方法,该方法利用场景的语义进行探索等任务。我们构建了一个紧凑的多层场景图,能够以不同抽象级别表示大型复杂室内环境,例如对应于房间、物体、墙壁、窗户等的节点,以及它们几何结构的细粒度细节。我们基于大语言模型(LLM)开发了一个程序,用于采样与场景部分观测一致的未观测区域的新可能场景图。我们开发了一个程序,用于计算潜在航点在该场景图上的信息增益,以实现复杂的空间推理:例如,从客厅出去的两扇门中,一扇可能通向厨房,另一扇通向卧室。我们在仿真中的逼真3D室内公寓以及现实世界中的Unitree Go 2机器人上评估了我们的方法。定性和定量分析表明,我们的方法能够比现有方法更快、更准确地确定环境中高层和低层的语义信息。

英文摘要

We develop an approach for active semantic perception, which refers to using the semantics of the scene for tasks such as exploration. We build a compact, multi-layer scene graph that can represent large, complex indoor environments at various levels of abstraction, e.g., nodes corresponding to rooms, objects, walls, windows etc., as well as fine-grained details of their geometry. We develop a procedure based on large language models (LLMs) to sample new plausible scene graphs of unobserved regions that are consistent with partial observations of the scene. We develop a procedure to compute the information gain of a potential waypoint upon this scene graph to enable sophisticated spatial reasoning: for example, of the two doors that lead out of the living room, one probably leads to the kitchen and the other to the bedroom. We evaluate our approach in realistic 3D indoor apartments in simulation and also on a Unitree Go 2 robot in the real world. Qualitative and quantitative analysis shows that our approach can pin down high-level and low-level semantic information in the environment quickly and more accurately than existing approaches.

2503.06573 2026-06-12 cs.CL cs.AI 版本更新

WildIFEval: Instruction Following in the Wild

WildIFEval: 野外指令遵循

Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor

发表机构 * The Hebrew University of Jerusalem(希伯来大学杰里科分校) IBM Research(IBM研究院)

AI总结 提出WildIFEval数据集,包含7K条真实用户的多约束指令,用于评估LLM的指令遵循能力,发现所有模型仍有较大改进空间。

详情
Comments
Accepted to the 5th Workshop on Generation, Evaluation and Metrics (GEM) at ACL 2026
AI中文摘要

最近的LLMs在遵循用户指令方面取得了显著成功,但处理具有多个约束的指令仍然是一个重大挑战。在这项工作中,我们引入了WildIFEval——一个包含7K条真实用户指令的大规模数据集,这些指令具有多样化的多约束条件。与以往的数据集不同,我们的收集涵盖了广泛的词汇和主题约束范围,这些约束是从自然用户指令中提取的。我们将这些约束分为八个高级类别,以捕捉它们在现实场景中的分布和动态。利用WildIFEval,我们进行了大量实验来评估领先LLMs的指令遵循能力。WildIFEval清晰地区分了小型和大型模型,并表明所有模型在此类任务上仍有很大的改进空间。我们分析了约束数量和类型对性能的影响,揭示了模型约束遵循行为的有趣模式。我们发布数据集以促进在复杂现实条件下指令遵循的进一步研究。

英文摘要

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

2510.03896 2026-06-12 cs.CV cs.RO 版本更新

GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert

GAE: 利用可泛化动作专家释放VLM的物理潜力

Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

发表机构 * arXiv.org University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出通用动作专家(GAE),通过稀疏几何接口将VLM的高层意图转化为连续动作轨迹,采用动作预训练-点云微调(APPF)方案解耦动作动力学与几何基础,实现跨视觉域、视角和指令的强泛化。

详情
AI中文摘要

视觉语言模型展示了强大的推理和规划能力,但将这些预测转化为精确的机器人动作仍是一个核心挑战。现有的视觉-语言-动作方法通常将推理和动作生成纠缠在一起,导致泛化能力有限。我们提出了通用动作专家(GAE),一个任务无关的模型,将稀疏几何规划转化为密集的机器人动作。我们的方法引入了一个稀疏几何接口:VLM预测代表高层意图的稀疏3D路点,而GAE将这些路点与实时点云观测一起映射到连续动作轨迹。GAE在一个包含来自仿真和真实世界机器人的15万条轨迹的大规模点云-轨迹数据集上进行预训练。为了进一步提高效率和泛化能力,我们引入了动作预训练-点云微调(APPF)方案,将学习动作动力学与几何基础解耦。预训练后,GAE被冻结并在下游任务中重用,只需对VLM进行轻量级微调以生成稀疏接口。实验表明,我们的方法在多样化的视觉域、相机视角和自然语言指令下实现了强大的性能和泛化能力。

英文摘要

Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.

2505.20076 2026-06-12 cs.LG 版本更新

ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

ExPLAIND:统一模型、数据和训练归因以研究模型行为

Florian Eichin, Yupei Du, Philipp Mondorf, Maria Matveev, Barbara Plank, Michael A. Hedderich

发表机构 * University of Michigan(密歇根大学)

AI总结 提出ExPLAIND框架,统一归因于模型组件、数据和训练轨迹,支持跨粒度解释,通过梯度路径核和AdamW核机器推导参数级和步骤级影响分数,验证了Transformer的Grokking和EuroLLM预训练中的两阶段动态。

详情
Comments
published at ICML 2026, code at https://github.com/mainlp/explaind
AI中文摘要

事后可解释性方法通常将模型行为归因于其组件、数据或训练轨迹中的某一个,并且往往局限于局部到全局谱中的特定粒度。这导致解释缺乏统一视角,可能遗漏关键交互。我们提出了ExPLAIND,一个理论扎实的统一框架,它整合了模型组件、数据和训练轨迹,同时支持跨粒度的解释。我们推广了最近关于梯度路径核的工作,将AdamW训练的模型重新表述为核机器。从得到的核特征图中,我们推导出新的参数级和步骤级影响分数。我们在多种设置下实证验证了模型行为的分解结果,并将ExPLAIND应用于两个案例研究。我们对一个表现出Grokking现象的Transformer的发现支持了先前提出的学习阶段,同时将最后阶段细化为外层在记忆后围绕一个表示管道对齐的阶段。对于EuroLLM预训练,ExPLAIND揭示了一个两阶段动态:第一阶段以外部MLP学习为特征,第二阶段以中间注意力层的相对影响增加为特征。这些结果确立了ExPLAIND作为解释模型行为和训练动态的统一框架。

英文摘要

Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation, and are often tied to a particular level of granularity along the local-to-global spectrum. This leads to explanations that lack a unified view and may miss key interactions. We present ExPLAIND, a theoretically grounded, unified framework that integrates model components, data, and training trajectory while supporting explanations across granularities. We generalize recent work on gradient path kernels, reformulating models trained by AdamW as kernel machines. From the resulting kernel feature maps, we derive novel parameter-wise and step-wise influence scores. We empirically validate the resulting decomposition of model behavior in several settings and apply ExPLAIND to two case studies. Our findings on a Transformer exhibiting Grokking support previously proposed learning phases, while refining the final phase as one in which outer layers align around a representation pipeline learned after memorization. For EuroLLM pretraining, ExPLAIND reveals a two-phase dynamic, with the first characterized by outer-layer MLP learning and the second by increased relative influence of intermediate attention layers. These results establish ExPLAIND as a unified framework for interpreting model behavior and training dynamics.

2509.22050 2026-06-12 cs.LG 版本更新

BrainPro: Towards Large-scale Brain State-aware EEG Representation Learning

BrainPro:迈向大规模脑状态感知的脑电图表征学习

Yi Ding, Muyun Jiang, Weibang Jiang, Shuailei Zhang, Xinliang Zhou, Chenyu Liu, Shanglin Li, Yong Li, Cuntai Guan

发表机构 * Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) Advanced Telecommunications Research Institute International(先进电信研究院) Southeast University(东南大学)

AI总结 提出BrainPro模型,通过检索式空间对齐和脑状态解耦模块,学习共享与特定状态表征,在9个公共BCI数据集上取得最优性能。

详情
Comments
31 pages, 11 figures
AI中文摘要

脑电图(EEG)反映了潜在的脑状态,其活动分布在大脑区域并表现为头皮上的空间模式。学习这些空间结构化的、与状态相关的模式需要跨数据集的一致空间表征。然而,现有的EEG基础模型通常基于自注意力机制,该机制不保留位置特定信息,并且难以对齐不同通道配置记录的信号。此外,脑状态包含共享和状态特定的区域活动,这表明学习神经生理学上合理的、状态感知的表征可以补充当前模型所针对的共享表征,并改善下游解码。为了解决这些局限性,我们提出了BrainPro,一个大型EEG模型,它结合了基于检索的空间学习机制用于跨布局空间对齐,以及一个脑状态解耦模块,通过并行编码器和区域感知重建学习共享和状态特定表征。在大型EEG语料库上预训练后,BrainPro在跨越情感、运动、语音、压力、精神疾病和注意力任务的九个公共BCI数据集上实现了最先进的性能。对空间滤波器、通道丢失鲁棒性和编码器贡献的分析进一步验证了其空间对齐和状态感知路径的有效性。这些结果表明,BrainPro实现了学习空间模式的更好可解释性,并产生了有益于多种EEG解码任务的表征。

英文摘要

Electroencephalography (EEG) reflects underlying brain states, whose activities are distributed across brain regions and manifest as spatial patterns on the scalp. Learning these spatially structured, state-related patterns requires consistent spatial representations across datasets. However, existing EEG foundation models are typically based on self-attention, which does not preserve location-specific information and struggles to align signals recorded with different channel configurations. Moreover, brain states contain both shared and state-specific regional activity, suggesting that learning neurophysiologically plausible, state-aware representations can complement the shared representations targeted by current models and improve downstream decoding. To address these limitations, we propose BrainPro, a large EEG model that combines a retrieval-based spatial learning mechanism for cross-layout spatial alignment with a brain state-decoupling module that learns both shared and state-specific representations through parallel encoders and region-aware reconstruction. Pre-trained on a large EEG corpus, BrainPro achieves state-of-the-art performance across nine public BCI datasets spanning emotion, motor, speech, stress, mental disease, and attention tasks. Analyses of spatial filters, channel-drop robustness, and encoder contributions further validate the effectiveness of its spatial alignment and state-aware pathways. These results show that BrainPro achieves improved interpretability of learned spatial patterns and produces representations that benefit diverse EEG decoding tasks.

2509.21398 2026-06-12 cs.CV eess.IV 版本更新

Skeleton Sparsification and Densification Scale-Spaces

骨架稀疏化和致密化尺度空间

Julia Gierke, Pascal Peter

发表机构 * Mathematical Image Analysis Group, Saarland University(萨尔兰大学数学图像分析组) Department of Mathematics and Computer Science, Saarland University(萨尔兰大学数学与计算机科学系)

AI总结 提出骨架化尺度空间,通过稀疏化中轴实现形状层次简化,并引入致密化实现从粗到细的逆过程,应用于鲁棒骨架化、形状压缩和增材制造刚度增强。

详情
AI中文摘要

Hamilton-Jacobi骨架,也称为中轴,是一种强大的形状描述符,它根据最大内切圆的中心来表示二值对象。尽管应用广泛,但中轴对噪声敏感:微小的边界变化可能导致骨架不成比例地扩大和产生不必要的分支。经典的剪枝方法通过系统地移除多余的骨架分支来缓解这一缺陷。这种骨架的顺序简化类似于稀疏化尺度空间的原理,该空间将图像嵌入到从越来越稀疏的像素表示重建的族中。我们通过引入骨架化尺度空间将两者结合起来:它们利用中轴的稀疏化来实现形状的层次简化。与传统的剪枝不同,我们的框架固有地满足关键的尺度空间特性,如层次结构、可控简化和对几何变换的等变性。我们在连续和离散公式中提供了严格的理论基础,并通过致密化进一步扩展了这一概念。通过逐步增长骨架而不是收缩它,我们允许从粗到细尺度的逆过程。致密化尺度空间甚至可以超越原始骨架,产生与实际问题相关的过完备形状表示。通过概念验证实验,我们展示了我们的框架在实际任务中的有效性,包括鲁棒骨架化、形状压缩和增材制造的刚度增强。

英文摘要

The Hamilton-Jacobi skeleton, also known as the medial axis, is a powerful shape descriptor that represents binary objects in terms of the centres of maximal inscribed discs. Despite its broad applicability, the medial axis suffers from sensitivity to noise: Minor boundary variations can lead to disproportionately large and undesirable expansions of the skeleton. Classical pruning methods mitigate this shortcoming by systematically removing extraneous skeletal branches. This sequential simplification of skeletons resembles the principle of sparsification scale-spaces that embed images into a family of reconstructions from increasingly sparse pixel representations. We combine both worlds by introducing skeletonisation scale-spaces: They leverage sparsification of the medial axis to achieve hierarchical simplification of shapes. Unlike conventional pruning, our framework inherently satisfies key scale-space properties such as hierarchical architecture, controllable simplification, and equivariance to geometric transformations. We provide a rigorous theoretical foundation in both continuous and discrete formulations and extend the concept further with densification. By growing the skeleton successively instead of shrinking it, we allow inverse progression from coarse to fine scales. Densification scale-spaces can even reach beyond the original skeleton to produce overcomplete shape representations with relevancy for practical applications. Through proof-of-concept experiments, we demonstrate the effectiveness of our framework for practical tasks including robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing.

2509.19526 2026-06-12 cs.LG cs.SY eess.SY 版本更新

Metriplectic Conditional Flow Matching for Dissipative Dynamics

度量辛条件流匹配用于耗散动力学

Ali Baheri, Lars Lindemann

发表机构 * Rochester Institute of Technology, Rochester, NY, USA(罗切斯特理工学院) Automatic Control Laboratory, ETH Zürich, Switzerland(自动控制实验室)

AI总结 提出度量辛条件流匹配(MCFM)方法,通过将保守-耗散分解融入向量场和结构保持采样器,学习耗散动力学,保证能量单调递减和长期稳定性。

详情
AI中文摘要

度量辛条件流匹配(MCFM)在不违反第一原理的情况下学习耗散动力学。神经替代模型常常注入能量并破坏长期推演的稳定性;MCFM 则将保守-耗散分解同时融入向量场和结构保持采样器。MCFM 通过短时间过渡上的条件流匹配进行训练,避免了长时间推演伴随的梯度计算。在推理时,Strang-prox 方案交替进行辛更新和近端度量步骤,确保离散能量衰减;当有可信能量可用时,可选投影强制严格衰减。我们提供了连续和离散时间保证,将该参数化和采样器与守恒、单调耗散和稳定推演联系起来。在一个受控机械基准上,MCFM 产生的相图更接近真实情况,并且与同等表达能力的无约束神经流相比,能量增加和正能量率事件显著减少,同时匹配终端分布拟合。

英文摘要

Metriplectic conditional flow matching (MCFM) learns dissipative dynamics without violating first principles. Neural surrogates often inject energy and destabilize long-horizon rollouts; MCFM instead builds the conservative-dissipative split into both the vector field and a structure preserving sampler. MCFM trains via conditional flow matching on short transitions, avoiding long rollout adjoints. In inference, a Strang-prox scheme alternates a symplectic update with a proximal metric step, ensuring discrete energy decay; an optional projection enforces strict decay when a trusted energy is available. We provide continuous and discrete time guarantees linking this parameterization and sampler to conservation, monotonic dissipation, and stable rollouts. On a controlled mechanical benchmark, MCFM yields phase portraits closer to ground truth and markedly fewer energy-increase and positive energy rate events than an equally expressive unconstrained neural flow, while matching terminal distributional fit.

2509.01630 2026-06-12 cs.LG cs.MA cs.RO cs.SY eess.SY 版本更新

DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization

DiffCoord: 分布式多智能体轨迹优化的可微协调

Bingheng Wang, Yichao Gao, Tianchen Sun, Shanker Ajay, Lin Zhao

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系)

AI总结 提出DiffCoord框架,将截断ADMM-DDP管道的耦合参数通过端到端元学习联合优化,利用智能体神经网络实现任务自适应,并扩展到不同智能体数量。在协作空中运输系统中验证,相比现有方法将每智能体梯度计算时间减少70%。

详情
AI中文摘要

将交替方向乘子法(ADMM)与微分动态规划(DDP)相结合,为分布式多智能体轨迹优化提供了一个可扩展的框架。在实践中,ADMM通常被截断以提高计算效率,这紧密耦合了原本分别控制协调质量和任务性能的参数。在本文中,我们提出了可微协调(DiffCoord),一个统一框架,联合元学习截断ADMM-DDP管道的这些耦合参数。这些参数由智能体神经网络生成以实现任务自适应,并且同构智能体之间共享相同的网络,从而能够扩展到不同数量的智能体。我们通过端到端微分ADMM-DDP管道实现了高效的元学习。值得注意的是,这产生了一个辅助的ADMM-LQR分布式梯度求解器,用于计算和协调关于这些参数的元梯度。该求解器继承了管道的计算结构,使得关键计算结果可以重用,并能够在智能体和轨迹时间线上高效并行化。我们通过协作空中运输系统的数值和物理实验验证了DiffCoord,该系统在狭窄空间中重新配置四旋翼编队以实现安全的六自由度负载操作。它能够鲁棒地适应变化的团队规模和负载动力学,同时与最先进的轨迹梯度方法相比,将每智能体梯度计算时间减少高达70%。

英文摘要

Integrating the Alternating Direction Method of Multipliers (ADMM) with Differential Dynamic Programming (DDP) provides a scalable framework for distributed multi-agent trajectory optimization. In practice, ADMM is typically truncated for computational efficiency, tightly coupling parameters that would otherwise separately govern coordination quality and task performance. In this paper, we propose Differentiable Coordination (DiffCoord), a unified framework that jointly meta-learns these coupled parameters for the truncated ADMM-DDP pipeline. These parameters are generated by agent-wise neural networks for task adaptation, and the same networks are shared among isomorphic agents to enable scalability to varying agent counts. We achieve efficient meta-learning by differentiating the ADMM-DDP pipeline end-to-end. Notably, this yields an auxiliary ADMM-LQR distributed gradient solver that computes and coordinates meta-gradients with respect to these parameters. This solver inherits the computational structure of the pipeline, enabling reuse of key computation results and efficient parallelization over agents and along trajectory horizons. We validate DiffCoord through numerical and physical experiments on a cooperative aerial transport system, where it reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces. It adapts robustly to varying team sizes and load dynamics, while reducing per-agent gradient computation time by up to 70% compared with state-of-the-art trajectory-gradient methods.

2509.04682 2026-06-12 cs.SD cs.AI cs.CV cs.IR cs.LG eess.AS 版本更新

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

GetNetUPAM:生态信息嵌套交叉验证与噪声鲁棒注意力用于海洋生物声学监测

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出GetNetUPAM框架,通过分层嵌套交叉验证保持生态异质性,并集成CBAM空间注意力的ARPA-N网络,在高噪声低信噪比条件下实现鲁棒泛化,在零训练区域将误报率降低约10倍。

详情
Comments
Resubmitted and under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined
AI中文摘要

部署可靠的生物声学监测系统需要能够在高噪声、低信噪比条件下泛化的模型,以及能够暴露部署相关故障模式的评估协议,这些在当前UPAM实践中基本未得到解决。内在噪声、可变传播以及混合的生物和人为源会导致分布偏移,而传统模型和单次划分评估会掩盖这些偏移,夸大性能并掩盖不稳定性。我们提出GetNetUPAM,一种分层嵌套交叉验证框架,它利用嵌套阶段来量化模型稳定性,而不是调整以获取夸大的保留分数。通过将数据划分为站点-年份块,GetNetUPAM保留了生态异质性,并迫使每个外层折代表不同的环境条件,防止过拟合局部噪声或传感器伪影。内层分层折衡量整个UPAM信号分布上的泛化能力,强制模型开发与外层保留部署条件严格分离。使用GetNetUPAM,我们评估了自适应分辨率池化和注意力网络(ARPA-N),一种用于不规则频谱图维度的CNN架构。ARPA-N将CBAM空间注意力集成为学习型噪声抑制器,生成注意力图以定位真实叫声结构,并避免标准CNN在长窗口数据上利用的全局非生物线索。在GetNetUPAM下,ARPA-N在不同环境条件下鲁棒泛化。在零训练的Balleny Islands区域,它在固定90%召回率下将每小时误报率降低超过一个数量级(约10倍),并在各折上持续改进指标。这些进展提供了可重复的基准,推动UPAM向可扩展、部署可靠的生态监测发展。

英文摘要

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.