arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.19136 2026-05-20 cs.RO

Automatically Improving Simulation Physics for Articulated Objects

自动提升仿真的物理特性用于关节物体

Anh-Quan Pham

AI总结 本文研究了如何通过量化评估框架和多模态仿真反馈方法,提升关节物体在仿真中的物理真实性和稳定性,从而提高机器人学习的效率和效果。

详情
AI中文摘要

仿真是可扩展机器人学习的核心工具,但其效果取决于物体资产的质量。尽管现代3D数据集提供了丰富的几何和运动学表示,但通常缺乏用于稳定和真实交互所需的物理属性,需要大量手动工作来构建仿真准备的关节物体。在本论文中,我们引入了交互准备性,它表征了物体在操作下是否可以可靠地仿真。我们提出了一种定量评估框架,将交互准备性分解为可测量的组成部分,从而系统分析物体质量并揭示传统评估未捕获的失败模式。我们进一步提出了一个多模态、仿真循环的方法,从不完整的3D资产中生成交互准备的关节物体。该方法整合了几何、视觉和语义信息来推断物理属性,并通过迭代仿真反馈来优化这些属性,以提高物理一致性。在多样化的关节物体和操作任务上的实验表明,物体质量直接影响仿真稳定性、交互行为和策略性能。经过我们方法优化的物体表现出更稳定和真实的动态,从而实现了更可靠的下游学习和评估。总体而言,本论文展示了关节物体在仿真中的物理真实性的的重要性,并引入了一种由仿真反馈指导的实用多模态优化方法,用于大规模构建此类物体。

英文摘要

Simulation is a central tool for scalable robot learning, but its effectiveness depends on the quality of object assets. While modern 3D datasets provide rich geometric and kinematic representations, they typically lack the physical properties required for stable and realistic interaction, requiring significant manual effort to construct simulation-ready articulated objects. In this thesis, we introduce interaction-readiness, which characterizes whether an object can be reliably simulated under manipulation. We propose a quantitative evaluation framework that decomposes interaction-readiness into measurable components, enabling systematic analysis of object quality and revealing failure modes not captured by conventional evaluation. We further present a multi-modal, simulator-in-the-loop approach for generating interaction-ready articulated objects from incomplete 3D assets. The method integrates geometric, visual, and semantic information to infer physical properties and refines them through iterative simulator feedback to improve physical consistency. Experiments across diverse articulated objects and manipulation tasks show that object quality directly impacts simulation stability, interaction behavior, and policy performance. Objects refined by our method exhibit more stable and realistic dynamics, enabling more reliable downstream learning and evaluation. Overall, this thesis demonstrates the importance of physical realism for articulated objects in simulation and introduces a practical multi-modal refinement approach, guided by simulator feedback, for constructing such objects at scale.

2605.19135 2026-05-20 cs.LG

Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing

部分潜在变量共享下的可识别多模态因果表示学习

Manal Benhamza, Marianne Clausel, Myriam Tami

AI总结 本文研究了在部分潜在变量共享设定下多模态因果表示学习的可识别性问题,通过非线性混合函数生成各模态数据,并在不假设潜在变量分布的情况下,建立了因果潜在表示的组件可识别性保证,进一步验证了在欠定情况下方法的有效性。

详情
AI中文摘要

因果表示学习(CRL)旨在从高维观测数据中揭示有意义的潜在变量及其对应的因果结构。尽管其重要性,CRL的可识别性仍是一个关键属性,因为它确保了数据生成过程背后机制的恢复,从而保证了表示的可解释性和鲁棒性。证明CRL的可识别性本质上是困难的,本文针对更具有挑战性的多模态设定进行了研究:考虑具有部分共享潜在结构的多模态观测数据。每个模态通过非线性混合函数从特定的因果潜在变量子集生成。在灵活的假设下且不假设潜在变量的参数分布,我们建立了因果潜在表示的组件可识别性保证。此外,我们的可识别性结果还适用于欠定情况,即每个模态中观测变量多于潜在变量。为了实例化我们的理论分析,我们引入了一个基于Wasserstein的模块来恢复部分共享的潜在结构。由于其可微性,后者可以轻松地集成到所有类型的架构中,仅需最小的修改。在合成和现实数据集上的广泛实验验证了我们的方法优于现有最先进方法。

英文摘要

Causal representation learning (CRL) seeks to uncover meaningful latent variables and their corresponding causal structure from high-dimensional observational data. Although its significance, CRL identifiability remains a crucial property, as it ensures the recovery of the mechanisms behind the data generation process, and hence the interpretability and robustness of the representation. Proving identifiability in CRL is intrinsically difficult, and we address in this work an even more challenging setting: multimodality. We consider multimodal observed data with a latent partially shared structure. Each modality is generated, through non linear mixing functions, from a specific subset of causal latent variables. Under flexible assumptions and without imposing any parametric distribution on the latent variables, we establish component-wise identifiability guarantees for the causal latent representation. Our identifiability results, furthermore, apply to the undercomplete scenario where we have, for each modality, more observed than latent variables. To instantiate our theoretical analysis, we introduce a Wasserstein-based module to recover the partially shared latent structure. Due to its differentiability, the latter can be easily integrated into all types of architecture, only requiring minimal changes. Extensive experiments on synthetic and realistic datasets validate the superiority of our approach over SOTA methods.

2605.19133 2026-05-20 cs.CV cs.AI

Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

知道何时不进行预测:用于更安全糖尿病视网膜病变筛查的自监督学习与退避

Muskaan Chopra, Lorenz Sparrenberg, Jan H. Terheyden, Rafet Sifa

AI总结 本文研究了自监督学习预训练长度对校准置信度和基于置信度的退避策略的影响,发现预训练长度对选择性预测有积极影响,但过长预训练并不总能提高可靠性,强调了退避意识评估的重要性。

Comments Accepted at IJCAI 2026

详情
AI中文摘要

自监督学习(SSL)现在是预训练医学图像模型的标准方法,但性能仍主要通过下游准确性来评估。对于安全关键的筛查任务,如糖尿病视网膜病变分级,这还不够:模型必须知道何时其预测不可靠,并将不确定案例推迟给临床审查。在本工作中,我们探讨了SSL预训练长度如何影响校准置信度和基于置信度的退避。我们评估了多个SSL检查点在固定微调协议下的表现,并评估了校准置信度、覆盖范围、选择性准确性以及选择性宏F1。在不同数据集和数据制度下,SSL预训练优于从头开始训练。与之前主要评估下游准确性或AUROC的SSL研究不同,我们分析了SSL预训练持续时间如何影响在基于校准置信度的退避下的置信度行为。然而,一旦准确性饱和,选择性性能仍可能在不同检查点间显著变化,且更长的预训练并不总能提高可靠性。这些结果强调了退避意识评估的重要性,并建议预训练长度应被视为重要的可靠性相关设计选择,而非仅是计算细节。代码可在GitHub上获取。

英文摘要

Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.

2605.19132 2026-05-20 cs.LG

CLIC: Contextual Language-Informed Cardiac Pathology Classification

CLIC: 基于上下文的语言引导心脏病理分类

Giovani D. Lucafo, Rafael da Costa Silva, João Lucas Luz Lima Sarcinelli, Andre Guarnier De Mitri, Diego Furtado Silva

AI总结 本文提出CLIC框架,通过将患者上下文数据转化为描述性文本,利用自然语言编码技术提升心脏病理诊断的精确度,同时探索大语言模型生成的临床描述在下游分类任务中的应用。

Comments 6 pages, 2 figures, accepted at the ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM)

详情
AI中文摘要

心电图(ECG)是无创诊断心脏病理的黄金标准,也是心血管医学的基本支柱。深度学习的最新进展推动了稳健的自动化分类器的发展,这些分类器通过处理原始生理信号实现高性能。然而,在临床实践中,诊断很少仅基于信号本身。心内科医生通常会结合患者的特征和具体的数据采集上下文来支持其解释。尽管如此,大多数现有算法仍局限于仅信号分析,未能整合技术元数据和人口统计数据。本文提出了上下文语言引导的心脏病理分类(CLIC),一种多模态框架,通过自然语言编码这些变量显著提高诊断精度。我们证明将患者层面的上下文数据转化为描述性文本提供了一个信息锚点,帮助模型解歧复杂的生理模式。我们进一步探讨了使用大语言模型合成更丰富的临床描述,并观察到尽管这些生成的文本仍具竞争力,但受控模板化的上下文临床文本在下游分类任务中带来了持续的性能提升。

英文摘要

The electrocardiogram (ECG) is the gold standard for non-invasive diagnosis of cardiac pathologies and is a fundamental pillar of cardiovascular medicine. Recent progress in deep learning has led to the development of robust automated classifiers that achieve high performance by processing raw physiological signals. However, in clinical practice, diagnosis is rarely based solely on the signal. Cardiologists commonly support their interpretation with the patient's characteristics and the specific data-acquisition context. Despite this, most current algorithms remain restricted to signal-only analysis, failing to integrate technical metadata and demographic variables. This paper proposes Contextual Language-Informed Cardiac pathology classification (CLIC), a multimodal framework that significantly enhances diagnostic precision by encoding these variables through natural language. We demonstrate that translating patient-level contextual data into descriptive text provides an informative anchor that helps the model disambiguate complex physiological patterns. We further investigate the use of Large Language Models to synthesize richer clinical descriptions and observe that, while these generated texts remain competitive, controlled template-based contextual clinical text leads to consistent improvements in downstream classification performance.

2605.19130 2026-05-20 cs.LG cs.AI cs.CL cs.CV

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

EgoBabyVLM:基于自然主义第一人称视频数据的跨模态学习基准测试

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Éric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux

AI总结 研究探讨了儿童如何从有限的视觉-语言输入中获得语言 grounding 的鲁棒性,提出了 EgoBabyVLM 挑战,推动模型在自然主义数据中实现 grounded language learning。

详情
AI中文摘要

儿童在有限的视觉-语言输入中展现出惊人的鲁棒性,这种能力超过了目前最好的大型多模态模型。最近的研究表明,目前基于 curated web 数据训练的视觉-语言模型 (VLMs) 无法泛化到由可穿戴设备、具身代理和婴儿头摄像机产生的稀疏、弱对齐的第一人称视频流,并且没有固定的评估流程来衡量在此类数据上的进展。我们训练 VLMs 在具有不同视觉和语言输入语义对齐程度的数据集上,包括自然主义婴儿和成人第一人称视频,并通过涵盖多模态语言 grounding 和单模态视觉和语言任务的综合评估套件进行评估。这套评估的核心是 Machine-DevBench,它是一个基于语料库的基准测试,自动从模型的训练词汇中生成,以消除训练/评估不匹配和先前发展基准的低统计效力。我们的结果表明,当前 VLM 模型依赖于 curated 数据的紧密语义对齐,并无法利用主导自然主义第一人称输入的弱对齐信号——正是人类在其中茁壮成长的领域。为了推动进展,我们引入了 EgoBabyVLM 挑战,以驱动开发能够从人类婴儿经历的此类自然主义数据中实现 grounded language learning 的模型。

英文摘要

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

2605.19127 2026-05-20 cs.AI

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

POLAR-Bench: 一个用于LLM代理隐私-效用权衡的诊断基准

Qiaoyuan Zheng, Yiqu Yang, Qi Gao, Imanol Schlag

AI总结 本文提出POLAR-Bench基准,用于评估LLM代理在隐私和效用之间的权衡。通过在10个领域和7,852个样本上进行测试,该基准通过确定性集合成员hip评分隐私和效用,并在两个正交轴上变化隐私策略维度和攻击策略,生成5x5的诊断表面。结果揭示了当前前沿模型在保护属性上隐瞒超过99%,而较小的开放权重模型在1-30B范围内表现更差,泄露率高达一半。

Comments Preprint

详情
AI中文摘要

随着LLM代理越来越多地访问私人用户数据,并在与第三方系统交互时代表用户行事,用户定义了哪些信息可以和必须不被共享。代理必须在第三方系统行为对抗性时也能稳健地遵循该意图。我们引入了POLAR-Bench(政策感知对抗基准),其中受信任的模型具有隐私策略和任务对话的模型与第三方模型进行交互,后者对抗性地探测任务相关和受保护的属性。在10个领域和7,852个样本上,我们通过确定性集合成员hip评分隐私和效用,并在两个正交轴上变化隐私策略维度和攻击策略,生成每个模型的5x5诊断表面。我们的结果揭示了一个明显的分裂:当前前沿模型隐瞒超过99%的受保护属性,而较小的开放权重模型在1-30B范围内,用户最常运行作为其自己的受信任代理在设备上或通过私人推理,得分显著更差,最差的泄露超过一半。POLAR-Bench因此定位了每个模型的意图遵循崩溃点,为隐私对齐提供了立足点,特别是在最关重要的地方。

英文摘要

LLM agents increasingly have access to private user data and act on the user's behalf when interacting with third-party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third-party systems behave adversarially. We introduce POLAR-Bench (Policy-aware adversarial Benchmark), in which a trusted model with a privacy policy and a task converses with a third-party model that adversarially probes for both task-relevant and protected attributes. Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set-membership and vary privacy policy dimension and attack strategy along two orthogonal axes, producing a 5 times 5 diagnostic surface per model. Our results reveal a sharp split: current frontier models withhold over 99% of protected attributes, while smaller open-weight models in the 1--30B range, the class users most commonly run as their own trusted agent on-device or via private inference, score notably worse, with the weakest leaking over half. POLAR-Bench thus localizes where each model's intent-following breaks down, providing a foothold for privacy alignment where it matters most.

2605.19120 2026-05-20 cs.RO

CosFly: Plan in the Matrix, Fly in the World

CosFly:矩阵中的计划,世界中的飞行

Hanxuan Chen, Xiangyue Wang, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Binbo Li, Kangli Wang, Ji Pei

AI总结 本文提出CosFly,一个用于空中跟踪的盒状结构规划和多模态模拟流程,以及CosFly-Track大规模无人机数据集,用于在多样环境中动态目标跟踪。CosFly通过将复杂的3D世界转换为结构化障碍表示进行规划,然后将轨迹投影到多模态传感器数据中,并支持可配置的固定视角缩放级别。

详情
AI中文摘要

我们介绍了CosFly,一个用于空中跟踪的盒状结构规划和多模态模拟流程,以及CosFly-Track,一个大规模的无人机数据集,用于在多样环境中进行动态目标跟踪。在我们的当前实现上,CosFly提供了一个模块化的7步构建流程,将复杂的3D世界转换为结构化的障碍表示用于规划,然后将结果轨迹投影到多模态传感器数据中,包括RGB图像、高精度深度图和语义分割掩码,并配以自然语言导航指令。一个关键特点是支持可配置的固定视角缩放级别(每个轨迹一个视角设置并保持恒定),通过相机内参数调整模拟各种焦距。该流程涵盖了从3D地图导出通过网格简化、行人和无人机轨迹规划、多模态渲染(6自由度姿态注释)、质量检查以及教师-学生描述生成的完整流程。我们分析了两种轨迹规划范式:传统的两阶段流程(前端候选生成和后端细化)以及直接基于梯度的公式,该公式在单一目标中优化多个跟踪约束。公开的CosFly-Track发布包含250条经过验证的轨迹和约10万张渲染图像,具有完整的6自由度无人机姿态注释(位置x、y、z和方向偏航、俯仰、滚动)。共同,该流程和数据集建立了一个可扩展的基础,支持在多样环境中进行空中-地面协同研究,支持动态目标跟踪、无人机导航和多模态感知。

英文摘要

We present CosFly, a box-structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly-Track, a large-scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7-step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi-modal sensor data -- including RGB images, high-precision depth maps, and semantic segmentation masks -- paired with natural language navigation instructions. A key feature is the support for configurable fixed-FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera-intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi-modal rendering with 6-DOF pose annotations, quality inspection, and teacher-student caption generation. We analyze two trajectory-planning paradigms for aerial target tracking: a conventional two-stage pipeline with front-end candidate generation and backend refinement, and a direct gradient-based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly-Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6-DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception across diverse environments.

2605.19111 2026-05-20 cs.CV cs.AI

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

FAGER:基于事实的文本到图像模型评估与改进

Youngsun Lim, Cusuh Ham, Pin-Yu Chen, Deepti Ghadiyaram

AI总结 本文提出FAGER框架,用于评估和改进文本到图像模型的事实准确性,通过结合LLM生成事实和参考引导的视觉事实提取与验证,构建结构化事实评估标准,并通过VLM进行评估,验证FAGER在事实性测试中优于现有方法,并能无训练改进T2I输出。

Comments It was accepted for an oral presentation at the 2nd Workshop on the Evaluation of Generative Foundation Models (EVGENFM2026) at CVPR 2026. Total 8 pages (1 page for references). 5 figures

详情
AI中文摘要

现有文本到图像(T2I)评估指标主要评估生成图像是否与提示中明确陈述的信息一致,但往往无法捕捉隐含、外部依赖或定义身份的事实要求。因此,它们不适合评估涉及科学知识、历史事实、产品或文化特定概念的提示中的事实正确性。我们提出了FActually Grounded Evaluation and Refinement(FAGER),一种代理框架,用于评估生成图像是否正确反映由提示中或暗示的视觉可验证事实,并提供改进的可操作反馈。FAGER首先通过结合LLM生成事实与参考引导的视觉事实提取和验证构建结构化事实评估标准,然后将该标准转换为基于VLM的问答对进行评估。为了验证FAGER作为事实性度量标准的有效性,我们引入了事实性A/B测试,该测试衡量度量标准是否更倾向于选择事实参考图像而非对应的生成图像。在涵盖科学、历史、产品、文化和知识密集型概念的五个数据集中,FAGER在该测试中始终优于现有方法。我们进一步表明,FAGER可以以无训练的方式用于改进T2I输出,在多个数据集中产生显著的事实性提升。

英文摘要

Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.

2605.19107 2026-05-20 cs.LG eess.SP

Performance Monitoring of Proton Exchange Membrane Water Electrolyzer by Transformers-Based Machine Learning Model

通过基于变压器的机器学习模型对质子交换膜水电解器进行性能监控

Bingqing Chen, Ivan Batalov, Qiu Chen, Weiqi Ji, Lei Cheng

AI总结 本文提出了一种基于变压器的机器学习框架,用于在正常运行过程中进行虚拟电化学表征,通过编码器-解码器结构对极化曲线进行重构,实现了对质子交换膜水电解器状态健康度的连续监控。

详情
AI中文摘要

绿色氢气在去碳化过程中扮演着关键角色,预计到2030年其容量将扩大至560 GW(2023年为1.39 GW)。质子交换膜(PEM)电解是生产绿色氢气最有前途的技术路线之一,实时监测PEM电解器的系统健康状况对于其规模化部署至关重要。在实验室环境中,可以通过电化学测试协议通过定期暂停正常运行来表征性能退化。这种中断对于大规模堆叠部署来说并不实用,限制了系统操作员对健康状态(SoH)进行实时评估的能力。本文提出了一种机器学习(ML)框架,可以在正常运行过程中进行虚拟电化学表征。该方法使用编码器-解码器变压器,基于操作数据来重构表征输出,重点关注极化曲线。受基于补丁的序列分词启发,我们将输入分割成补丁并对其进行编码,以形成有意义的标记,这大大提高了学习效率。在四次纵向运行中,持续时间最长为478小时,不同测试单元和负载循环下,模型准确重构了极化曲线,并相比普通变压器实现了均方误差(MSE)减少10倍。这一概念验证表明,ML模型可以实现PEM电解器的连续性能监控,并且编码器能够捕捉到SoH的有意义的潜在表示,为未来工作中的可解释指标推导提供了机会。

英文摘要

Green hydrogen plays an essential role in decarbonization, with capacity projected to scale to 560 GW by 2030 (vs. 1.39 GW in 2023) in net-zero settings. Proton exchange membrane (PEM) electrolysis is one of the most promising technology routes to green hydrogen production, and real-time system health monitoring of PEM electrolyzers is essential for their scalable deployment. In lab settings, performance degradation can be characterized through electrochemical testing protocols by periodic pauses of normal operation. Such interruption is not practical for full-scale stack deployments, limiting system operators' ability to make real-time assessments of state-of-health (SoH). We present a machine learning (ML) framework that performs virtual electrochemical characterization during normal operation. The method uses an encoder-decoder transformer, conditioned on operational data, to reconstruct characterization outputs, focusing here on polarization curves. Inspired by patch-based sequence tokenization, we segment the inputs into patches and encode them to form meaningful tokens, which substantially improves learning efficiency. Across four longitudinal runs, lasting up to 478 hours on different test cells and loading cycles, the model accurately reconstructed polarization curves and achieved 10x reduction in mean squared error (MSE) compared to a vanilla transformer. This proof-of-concept demonstrates that ML models can enable continuous performance monitoring for PEM electrolyzers and that the encoder captures meaningful latent representations of SoH, opening up opportunities to derive interpretable indicators in future work.

2605.19104 2026-05-20 cs.RO cs.AI

Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots

神经运算符用于腱驱动连续机器人设计空间的代理建模

Branden Frieden, James M. Ferguson, Alan Kuntz, Varun Shankar

AI总结 本文提出了一种基于神经运算符的学习方法,用于腱驱动连续机器人的设计空间代理建模,通过映射机器人设计参数和腱驱动输入到最终配置,实现跨大量机器人设计的泛化能力。

Comments Accepted to ICRA 2026

详情
AI中文摘要

连续机器人能够在受限环境中实现灵活的操作,但需要准确且高效的模型用于实时操作和控制。传统物理模型可能计算成本高且因未建模效应导致不准确,而当前基于学习的方法在特定机器人上泛化能力差。本文提出将腱驱动连续机器人代理建模作为运算符学习问题,将机器人设计参数和腱驱动输入映射到最终配置。该方法使单个训练模型能够跨大量机器人设计泛化。我们开发了四种新型神经运算符架构--两种基于深度运算符网络(DeepONets)和两种基于傅里叶神经运算符(FNOs)--并训练它们在仿真数据上预测机器人配置。所有架构均实现良好的准确性,同时允许快速且准确地跨设计泛化。我们的结果表明,运算符学习为连续机器人力学在设计空间中的代理建模提供了有效且可泛化的解决方案,使在手术和工业应用中控制、规划和设计优化能够快速建模。

英文摘要

Continuum robots enable dexterous manipulation in constrained environments, but require accurate and efficient models for real-time manipulation and control. Traditional physics-based models can be computationally expensive and may suffer from inaccuracies due to unmodeled effects, while current learning-based methods often generalize poorly beyond the specific robot on which they are trained. We present a formulation of surrogate modeling for tendon-driven continuum robots as an operator learning problem that maps robot design parameters and tendon actuation inputs to resulting configurations. This formulation enables a single trained model to generalize across a large class of robot designs. We develop four novel neural operator architectures--two based on Deep Operator Networks (DeepONets) and two based on Fourier Neural Operators (FNOs)--and train them on simulation data to predict robot configurations. All architectures achieve good accuracy while allowing for fast and accurate generalization across designs. Our results demonstrate that operator learning provides an effective and generalizable surrogate for continuum robot mechanics in the design space, enabling fast modeling for control, planning, and design optimization in surgical and industrial applications.

2605.19101 2026-05-20 cs.SD cs.LG

Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

面向异质性的数据集调度以实现高效的音频大语言模型训练

Yanru Wu, Jianning Wang, Chongxin Gan, Yang Li

AI总结 本文提出了一种面向异质性的数据集调度方法GST,通过将数据集分组并按渐进调度策略引入,平衡了并行训练的稳定性与序列优化的效率,从而在14个AudioQA数据集上实现了30-40%的更快收敛速度。

详情
AI中文摘要

训练通用的音频大语言模型(ALLMs)以跨多样化的数据集进行训练对于全面的音频理解至关重要,但面临由于数据集异质性导致的显著挑战,这通常会导致冲突的梯度和缓慢的收敛。尽管其影响重大,如何在训练过程中显式管理这种异质性仍鲜有研究,当前的做法主要依赖于均匀混合。在本文中,我们从收敛性角度分析多数据集AudioQA训练,并提出分组序列训练(GST)。GST战略性地将数据集分为具有亲和力的数据集组,并通过渐进调度协议引入这些数据集,有效地平衡了并行训练的稳定性与序列优化的效率。为了确保可扩展性,我们开发了基于梯度的亲和度度量,以捕捉跨数据集的关系,而无需采用具有抑制成本的经验转移性估计。在14个AudioQA数据集上的广泛评估表明,GST在标准并行训练上实现了30-40%更快的收敛速度,同时保持或超越混合所有训练的性能。我们的结果提供了理论见解和一个实用且模型无关的框架,用于高效的大规模ALLM优化。

英文摘要

Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impact, how to explicitly manage this heterogeneity during training remains underexplored, with current practices relying primarily on uniform mixture. In this work, we analyze multi-dataset AudioQA training from a convergence perspective and propose Grouped Sequential Training (GST). GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, we develop gradient-based affinity metrics that capture inter-dataset relationships without the prohibitive cost of empirical transferability estimation. Extensive evaluations on 14 AudioQA datasets spanning speech, music, and environmental sounds demonstrate that GST achieves 30--40\% faster convergence than standard parallel training while maintaining or even surpassing the performance of mix-all training. Our results provide both theoretical insights and a practical, model-agnostic framework for efficient large-scale ALLM optimization.

2605.19099 2026-05-20 cs.AI cs.CL cs.MA

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

DecisionBench: 一个用于长周期代理工作流中涌现委托的基准测试

Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, Ao Qu

AI总结 本文提出DecisionBench基准测试,用于评估长周期代理工作流中涌现的委托机制,通过五个条件参考扫描发现委托质量、路由保真度和潜在性能上限等核心发现。

Comments 28 pages, 9 figures, 11 tables. Code and data: https://huggingface.co/decisionbench

详情
AI中文摘要

我们引入DecisionBench,一个用于长周期代理工作流中涌现委托的基准测试子系统。该子系统固定了一个任务集(GAIA,tau-bench,BFCL多轮),一个同级模型池(11个模型,7个供应商家族),一个委托接口(调用模型加可选的读取资料通道),一个确定性技能标注层,以及一个覆盖质量、成本、延迟、委托率、路由保真度-at-k、供应商自偏好以及反事实委托天花板的多轴度量套件。该子系统对同级信息的生成或传递方式无关,因此学习的路由器、更丰富的同级记忆、适应性的资料构造以及多步委托均可在此进行评估。我们通过在完整池(n=23,375任务实例)上的五条件参考扫描来表征该子系统。三个基准级别发现:(i)四个意识条件下的平均终端任务质量在统计上无法区分(|beta| <= 0.010,p >= 0.21),因此仅质量评估会错过编排信号;(ii)路由保真度-at-1在条件中从7.5%到29.5%不等,且在近似相等的平均质量下,交付通道(按需工具 vs. 预加载描述)主导描述内容;(iii)反事实天花板将完美委托置于每套测量性能的15-31个百分点之上,定位了未来编排方法中巨大的未实现潜力。我们发布了该子系统、标注层、参考干预套件、分析流程以及220个每条件运行存档。

英文摘要

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

2605.19095 2026-05-20 cs.LG cs.AI stat.ML

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

ScheduleFree+: 将学习率自由和调度自由学习扩展到大型语言模型

Aaron Defazio

AI总结 本文提出了一种学习率自由和调度自由的学习方法(ScheduleFree+),用于训练大型语言模型,该方法在大规模训练中显著优于传统的Warmup-Stable-Decay(WSD)调度方案,并证明了调度自由学习在长周期训练中的有效性。

详情
AI中文摘要

调度自由学习作为一种实用的随时训练方法,在机器学习中展示了其在数十个标准基准问题上的成功。然而,对于大型语言模型(LLM)训练,强大的性能仅在小规模情况下得到验证。我们识别出一系列必要的改进,以将调度自由学习扩展到更大的批量大小和模型大小,并提出了一种学习率自由和调度自由的方法(ScheduleFree+)用于训练大型语言模型,其性能显著优于Warmup-Stable-Decay(WSD)调度方案。我们还证明调度自由学习在长周期训练中最有效,并且在每参数1000个令牌的情况下,比最先进的调度方案高出31%。调度自由学习为预训练过程中模型平均和检查点合并的使用提供了理论基础。

英文摘要

Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.

2605.19093 2026-05-20 cs.AI cs.LG

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

通过 elicitation 进行嵌入:用于系统提示贝叶斯优化的动态表示

Zhiyuan Jerry Lin, Benjamin Letham, Samuel Dooley, Maximilian Balandat, Eytan Bakshy

AI总结 本文研究了在仅有聚合反馈的情况下,如何通过动态表示进行系统提示的贝叶斯优化,提出了一种基于 elicitation 的嵌入方法 ReElicit,利用 LLM 构建可解释的特征空间,并通过概率高斯过程代理选择目标特征向量,最终实现系统提示的优化。

详情
AI中文摘要

系统提示是现代 AI 系统中的核心控制机制,在对话、任务和用户群体中塑造行为。然而,当反馈仅作为聚合度量而非每个示例的标签、失败或批评时,调整系统提示变得困难。我们研究了这种聚合反馈设置作为受限样本的黑盒优化问题,针对离散且长度可变的文本。我们引入了 ReElicit,一种基于 elicitation 的贝叶斯优化框架。给定任务描述、先前评估的提示和标量分数,LLM 会提取一个紧凑且可解释的特征空间,并将提示映射到其中。利用概率高斯过程代理,获取函数会选择目标特征向量,LLM 会实现并优化这些向量以生成可部署的系统提示。随着新评估的到来,重新提取特征空间使表示能够适应观察到的提示-分数历史。我们通过离线基准准确率作为受控的聚合代理来评估该设置:优化器观察每个提示的一个标量分数,而没有每个示例的标签、错误或批评。在十个系统提示优化任务中,使用 30 次总评估预算,ReElicit 在代表性聚合-only 提示优化基线中实现了最强的聚合性能。这些结果表明,LLM 不仅可以作为提示生成器,还可以作为适应性语义表示构建器,用于自然语言艺术的贝叶斯优化。

英文摘要

System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.

2605.19092 2026-05-20 cs.LG cs.AI cs.CL

Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

反事实可能性测试用于私人推理通道中的间接影响

Alexander Boesgaard Lorup

AI总结 本文提出了一种反事实可能性测试方法,用于衡量私人推理通道之间的影响力,通过替换上游私人块为匹配长度的供体块,并固定公共令牌序列和下游目标,测量下游目标的负对数似然变化,以评估私人和公共通道中的直接和间接影响。

Comments 12 pages, 4 figures, 5 tables

详情
AI中文摘要

推理系统越来越多地将中间计算分成私人和公共通道,产生在转录中看起来相似的评估案例:独立共推导、直接访问私人内容和通过公共通信的间接影响。本文提出了一种反事实可能性测试,用于测量私人推理通道之间的影响力。该方法用一个长度匹配的供体块替换上游私人块,固定公共令牌序列和下游目标,测量下游目标的负对数似然变化。在用于验证的7B角色通道推理模型上,文本探针不可靠:原始n-gram重叠高估了泄漏,修正重叠仍存在噪声,canary复现报告无区分能力。反事实可能性将未遮蔽和遮蔽条件分开,而长度匹配控制了RoPE位置混杂因素。在强化遮蔽验证中,B到A的反向影响接近于零,而A到B的影响通过公共语音隐藏状态持续存在。在三个检查点、五个种子和13,734个有效方向对比的多检查点验证中,重复了这种不对称性。一个图分离控制,阻止私人到公共的载体边,产生所有13,734个控制评估中自然和反事实分数位相同的结果,确定测试的公共通道路径是测量的反事实信号在实施的角色可见性遮蔽下的完整载体。结果表明,私人通道评估应分别报告直接和间接影响,并且反事实可能性探针为测量这些边界提供了实用的默认方法。

英文摘要

Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target's negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.

2605.19091 2026-05-20 cs.LG

Chessformer: A Unified Architecture for Chess Modeling

Chessformer: 一个用于棋类建模的统一架构

Daniel Monroe, George Eilender, Philip Chalmers, Zhenwei Tang, Ashton Anderson

AI总结 本文提出Chessformer,一种统一的棋类建模架构,能够同时提升棋类建模的三大核心目标:提升棋力、预测人类下棋和增强可解释性。

Comments International Conference in Learning Representations (2026)

详情
AI中文摘要

棋类长期以来一直是人工智能的典型测试平台,但其核心任务的建模方法却各不相同。最大化棋力、预测人类下棋和增强可解释性通常使用不同的架构,这些设计往往与领域本身的几何结构不一致。这引出了一个自然问题:这些目标是否需要不同的建模范式,或者是否存在一个能够同时支持它们的单一架构?我们介绍了Chessformer,一种统一的架构,它在棋类建模的三个核心目标上都达到了最先进的水平。Chessformer是一种仅包含编码器的Transformer,将棋盘方格表示为标记,通过一种名为几何注意力偏置(GAB)的新动态位置编码来增强自注意力机制,该编码能够适应领域特定的几何结构,并通过基于注意力的源-目标策略头来预测动作。我们对Chessformer的每个方面进行了评估。首先,我们开发了\maiathree,一个用于预测人类下棋的模型家族,其移动匹配准确率达到57.1%,显著超越了之前最先进的方法,且参数量不到四分之一。其次,我们将Chessformer集成到领先的开源引擎Leela Chess Zero中,使其棋力提升超过100个Elo,并在主要的计算机国际象棋比赛中战胜Stockfish。第三,我们证明Chessformer的方格标记设计使注意力模式和激活可以直接归因于棋盘方格,从而实现细粒度的可解释性分析,而以前的架构不自然支持。更广泛地说,我们的结果表明,将模型的标记化、位置编码和输出设计与领域底层结构对齐,可以同时带来性能、人类兼容性和可解释性的提升。

英文摘要

Chess has long served as a canonical testbed for artificial intelligence, but modeling approaches for its central tasks have diverged. Maximizing playing strength, predicting human play, and enabling interpretability are typically solved with disparate architectures, and these designs are often misaligned with the geometry of the domain. This raises the natural question of whether these objectives require separate modeling paradigms, or if there exists a single architecture that supports them simultaneously. We introduce Chessformer, a unified architecture that advances the state of the art on all three central goals in chess modeling. Chessformer is an encoder-only transformer that represents board squares as tokens, augments self-attention with a novel dynamic positional encoding called Geometric Attention Bias (GAB) that adapts to domain-specific geometry, and predicts actions with an attention-based source-destination policy head. We evaluate Chessformer on each front. First, we develop \maiathree, a family of models for human move prediction that reaches 57.1\% move-matching accuracy, significantly surpassing the previous state of the art with fewer than a quarter of the parameters. Second, we integrate Chessformer into Leela Chess Zero, a leading open-source engine, adding over 100 Elo of playing strength and resulting in tournament victories over Stockfish in major computer chess competitions. Third, we show that Chessformer's square-token design makes attention patterns and activations directly attributable to board squares, enabling granular interpretability analyses that prior architectures do not naturally support. More broadly, our results demonstrate that aligning a model's tokenization, positional encoding, and output design with the underlying structure of a domain can yield simultaneous gains in performance, human compatibility, and interpretability.

2605.19080 2026-05-20 cs.LG cs.AI

MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning

MANGO:面向在线持续学习的元适应网络梯度优化

Ankita Awasthi, Marco Apolinario, Kaushik Roy

AI总结 本文提出MANGO框架,通过梯度门控和元学习正则化平衡持续学习中的稳定性与可塑性,实现对过去任务遗忘的克服和新任务高效学习。

详情
AI中文摘要

在在线持续学习(OCL)中,神经网络在单次通过中从非平稳数据流中依次学习,仅能访问有限的内存回放缓冲区。这与离线持续学习形成鲜明对比,后者依赖多个epoch训练大型数据集。OCL的主要挑战是克服对过去任务的灾难性遗忘(稳定性)的同时高效学习新任务(可塑性)。现有方法通过回放式复习、输出级蒸馏、固定正则化或当前数据上的元学习来对抗遗忘。然而,这些方法存在局限:复习引入存储样本偏差;蒸馏在输出分布上操作而无法调节参数更新;固定正则化对参数施加惩罚而不考虑敏感性;仅基于数据流的元学习缺乏反馈控制的参数更新。我们提出元适应网络梯度优化(MANGO),一种OCL框架,通过梯度门控和元学习正则化平衡稳定性与可塑性。梯度门控根据敏感性调整参数更新,防止破坏性更新。元学习正则化适应稳定性系数,评估参数更新对回放的影响。在MANGO中,回放同时充当训练信号和遗忘评估器。我们在三个标准OCL基准数据集上评估了我们的方法。MANGO在多个基准上优于强基线方法,取得最先进的结果,并在不同回放大小下保持一致性能。在CLEAR-10上的领域增量学习和CIFAR-100和Tiny-ImageNet上的类别增量学习中,它在所有基线中取得最高准确率,并实现正向反馈转移,克服CLEAR-10上的遗忘。

英文摘要

In Online Continual Learning (OCL), a neural network sequentially learns from a non-stationary data stream in a single-pass with access only to a limited memory replay buffer. This contrasts sharply with off-line continual learning where training is multiple epoch dependent on large datasets. The main challenge faced by OCL is to overcome catastrophic forgetting of past tasks (stability) while learning new ones efficiently (plasticity). Existing methods counter forgetting via replay-based rehearsal, output level distillation, fixed regularization, or meta-learning on the current data. However, these methods have limitations: rehearsal introduces a stored sample bias; distillation operates on output-distributions without modulating parameter updates; fixed-regularization penalizes parameters irrespective of sensitivity; stream-only meta-learning lacks a feedback controlled parameter update. We propose Meta-Adaptive Network Gradient Optimization (MANGO), an OCL framework that balances stability-plasticity via gradient-gating and meta-learned regularization. Gradient-gating scales parameter updates based on sensitivity, preventing destructive updates. Meta-learned regularization adapts stability coefficients, evaluating the effect of parameter update on replay. In MANGO, replay acts as both a training signal and a forgetting evaluator. We evaluated our method on three standard OCL benchmark datasets. MANGO outperforms strong baselines, achieving state-of-the-art results with consistent performance across replay sizes. In domain incremental learning on CLEAR-10 and class incremental learning on CIFAR-100 and Tiny-ImageNet, it achieves highest accuracy among all baselines and achieves positive Backward Transfer, overcoming forgetting on CLEAR-10.

2605.19077 2026-05-20 cs.CL cs.AI

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

ReacTOD:用于零样本对话状态跟踪的受限神经符号代理NLU

Yanjun Lin, Zimo Xiao, Kartik Natarajan, Mahesh Sankaranarayanan, Niraj Nawanit, Rakshit Parashar, Austin Zhang, Karthik Konaraddi, Rishita Mote, Wei Niu

AI总结 该研究提出ReacTOD,一种受限神经符号架构,通过在自我纠正的ReAct循环中将NLU重新表述为离散工具调用来解决零样本对话状态跟踪问题,其核心方法是确定性验证,主要贡献是实现了新的零样本状态-of-the-art结果。

Comments Accepted at TrustNLP Workshop at ACL 2026

详情
AI中文摘要

面向任务的对话系统--处理交易、预订和服务请求--需要可预测的行为,然而用于实际延迟的中等大小LLM容易产生幻觉和格式错误,这些错误会级联到错误的动作中(例如,预订了错误日期的酒店)。我们提出了ReacTOD,一种受限神经符号架构,将NLU重新表述为自纠正ReAct循环中的离散工具调用。受限的ReAct循环能够实现迭代自我纠正,比单次推断在MultiWOZ上提高了9.3个百分点的准确性。一个符号验证器在每次对话状态更新时强制执行动作合规性、模式一致性以及核心ference一致性,实现了93.1%的自我纠正率,并产生结构化的执行轨迹。增量状态预测和按需历史检索保持提示紧凑,实验证明在参数受限的模型中提高了指令遵循性。在MultiWOZ 2.1上,ReacTOD实现了新的零样本状态-of-the-art:gpt-oss-20B达到52.71%的联合目标准确率,超过之前的最佳结果14个百分点,而Qwen3-8B仅使用8B参数达到47.34%。在Schema-Guided Dialogue(SGD)基准上,ReacTOD在完全端到端评估中使用预测的领域,Claude-Opus-4.6达到80.68%的JGA,Qwen3-32B达到64.09%--展示了无需任务特定训练数据的跨基准泛化能力。

英文摘要

Task-oriented dialogue systems -- handling transactions, reservations, and service requests -- require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% -- demonstrating cross-benchmark generalization without task-specific training data.

2605.19076 2026-05-20 cs.LG physics.flu-dyn

The impact of observation density on Bayesian inversion of latent dynamics in shock-dominated flows

观测密度对冲击主导流动中潜变量动态贝叶斯反演的影响

Bipin Tiwari, Muhammad Abid, Omer San

AI总结 本文提出了一种非侵入式降阶建模框架,用于高效贝叶斯初始状态反演与不确定性量化,通过卷积自编码器和学习的潜空间前向算子结合,以提高冲击主导流动中潜变量动态的反演精度和效率。

详情
AI中文摘要

从稀疏和噪声测量中推断冲击主导可压缩流动中未知的初始状态是一个具有挑战性的不适定反问题,由于非线性波相互作用和传感限制。在本工作中,我们开发了一种非侵入式降阶建模框架,用于高效的贝叶斯初始状态反演与不确定性量化。该框架结合了卷积自编码器和学习的潜空间前向算子。自编码器将高维流动场压缩成紧凑的非线性潜表示,而前向算子从编码的初始条件预测最终时间的潜变量状态。该AE-ROM代理能够快速进行正向评估,并嵌入到No-U-Turn Sampler (NUTS)中进行后验探索。该框架通过拉丁超立方采样生成500个高保真度Sod冲击管模拟,并使用五阶WENO方案求解。反问题旨在从稀疏噪声观测的最终时间密度和压力场中恢复未知的左和右密度和压力状态。结果表明,AE-ROM能够准确重建关键的冲击管结构,包括稀疏波、接触不连续性和激波前。潜变量维度为32提供了重建精度和减少空间紧凑性之间的有效平衡,而250个训练模拟足以实现准确的重建。增加观测密度显著收缩后验不确定性,将密度的均值后验标准差减少约78%,压力减少约76%。总体而言,所提出的框架为冲击主导流动的反演分析提供了一种计算高效且具有不确定性的方法,具有向多维可压缩流动和数字孪生应用扩展的潜力。

英文摘要

Inferring unknown initial states in shock-dominated compressible flows from sparse and noisy measurements is a challenging ill-posed inverse problem due to nonlinear wave interactions and limited sensing. In this work, we develop a non-intrusive reduced-order modeling framework for efficient Bayesian initial-state inversion with uncertainty quantification. The framework combines a convolutional autoencoder with a learned latent-space forward operator. The autoencoder compresses high-dimensional flow fields into a compact nonlinear latent representation, while the forward operator predicts final-time latent states from encoded initial conditions. This AE-ROM surrogate enables rapid forward evaluations and is embedded within a No-U-Turn Sampler (NUTS) for posterior exploration. The framework is demonstrated using 500 high-fidelity Sod shock tube simulations generated through Latin hypercube sampling and solved using a fifth-order WENO scheme. The inverse problem seeks to recover unknown left and right density and pressure states from sparse noisy observations of final-time density and pressure fields. Results show that the AE-ROM accurately reconstructs key shock-tube structures, including the rarefaction wave, contact discontinuity, and shock front. A latent dimension of 32 provides an effective balance between reconstruction accuracy and reduced-space compactness, while 250 training simulations are sufficient for accurate reconstruction. Increasing observation density significantly contracts posterior uncertainty, reducing the mean posterior standard deviation by approximately 78% for density and 76% for pressure. Overall, the proposed framework provides a computationally efficient and uncertainty-aware approach for inverse analysis of shock-dominated flows, with potential extensions to multidimensional compressible-flow and digital-twin applications.

2605.19075 2026-05-20 cs.CV cs.AI

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

CRAFT: 基于批评的自适应关键帧目标定位用于多模态视频问答

Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, Pengyu Yan, Akhil Gorugantu, David Doermann

AI总结 该研究提出CRAFT方法,通过动态关键帧选择、每视频ASR与多语言回退以及混合批评循环,迭代验证和修复声明,最终实现多模态视频问答的准确证据聚合。

Comments Accepted at ACL 2026 Multimodal Augmented Generation via MultimodAl Retrieval Workshop

详情
AI中文摘要

基于现实世界新闻事件的多视频问答需要系统在异构视频档案中检索与查询相关的证据,并将每个声明归因于其支持来源。我们介绍了CRAFT(Critic-Refined Adaptive Key-Frame Targeting),一种查询条件的管道,结合动态关键帧选择、每视频ASR与多语言回退以及混合批评循环,以迭代验证和修复声明,然后整合。该管道集成了UNLI时间蕴含、DeBERTa-v3跨声明筛选以及Llama-3.2-3B裁决者,并在最终引用合并阶段发出每个事实一次,附带所有支持来源标识符。在MAGMaR 2026上,CRAFT实现了最佳的总体平均(0.739)、参考召回(0.810)和引用F1(0.635)。我们进一步在WikiVideo的MAGMaR风格转换上进行了评估,包含52个非重叠事件查询,CRAFT也表现出色(0.823 Avg),表明其声明中心的证据聚合能力超越了MAGMaR。消融研究显示,原子声明、ASR和批评循环在超过基本查询条件基线时发挥了主要作用。代码和实现细节可在https://github.com/bhosalems/CRAFT公开获取。

英文摘要

Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at https://github.com/bhosalems/CRAFT.

2605.19074 2026-05-20 cs.CV cs.AI

Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting

通过多时间尺度预测学习光伏功率输出预测中的长期时间依赖性

Sumit Laha, Ankit Sharma, Hassan Foroosh

AI总结 本文提出一种多时间尺度预测框架,通过联合优化多个未来值来提高深度神经网络对隐含的步间时间依赖性的捕捉能力,从而提升光伏功率输出预测的准确性和鲁棒性。

详情
AI中文摘要

全球太阳能光伏(PV)容量的迅速扩张——2024年达到创纪录的597 GW——凸显了需要稳健的预测模型来缓解由太阳能辐照度间歇性引起的电网不稳定性。尽管基于深度学习的直接预测使用地面天空图像(GSI)已成为主导方法,但现有文献常受限于单一架构评估和对单时间尺度(点)预测的专注。本文提出从传统单时间尺度估计向多时间尺度预测框架的转变,从而实现架构无关的准确率提升。我们假设并实验验证了联合优化一系列未来值使深度神经网络能够通过避免网络在权重梯度和滤波器多样性方面的过早收敛来更好地捕捉隐含的步间时间依赖性。利用这种架构无关的改进,将顺序天空图像与历史光伏发电数据相结合,我们评估了模型在多个离散未来时间步长上同时预测功率输出的能力。我们的方法通过在多样深度学习架构上的比较分析进行验证。结果表明,这种多时间尺度方法在预测时间范围内显著提高了预测准确性和鲁棒性,同时保持计算效率。通过在单时间尺度模型上实现优越性能且计算开销 negligible,本文提供了一种可扩展且高效的解决方案,以提高现代电网的韧性。

英文摘要

The rapid global expansion of solar photovoltaic (PV) capacity-reaching a record 597 GW in 2024-highlights the urgent need for robust forecasting models to mitigate the grid instability caused by the intermittent nature of solar irradiance. While deep learning-based direct forecasting using ground-based sky images (GSI) has emerged as a dominant approach, existing literature is often constrained by single-architecture evaluations and an exclusive focus on single-horizon (point) prediction. This paper proposes a transition from traditional single-horizon estimation toward a multi-horizon forecasting framework, leading to an architecture-independent improvement in accuracy. We hypothesize and demonstrate experimentally that joint optimization over a sequence of future values allows deep neural networks to better capture latent inter-step temporal dependencies by avoiding precocious convergence of the network in terms of both weight gradients and filter diversity. Leveraging this architecture-independent improvement that integrates sequential sky imagery with historical PV generation data, we evaluate the models' abilities to predict power output across multiple discrete future time steps simultaneously. Our methodology is validated through a comparative analysis across diverse deep learning architectures. The results demonstrate that this multi-horizon approach significantly enhances predictive accuracy and robustness across the entire forecast horizon while maintaining computational parsimony. By achieving superior performance with negligible overhead compared to single-horizon models, this work provides a scalable and efficient solution to improve the resilience of modern power grids.

2605.19073 2026-05-20 cs.LG cs.AI

Riemannian Networks over Full-Rank Correlation Matrices

全秩相关矩阵上的Riemannian网络

Ziheng Chen, Xiaojun Wu, Bernhard Schölkopf, Nicu Sebe

AI总结 本文提出了一种在全秩相关矩阵上进行Riemannian网络的研究,通过扩展基本层并引入准确的反向传播方法,展示了其在对比现有SPD和Grassmannian网络时的有效性。

Comments Accepted to ICML 2026

详情
AI中文摘要

在不同应用中,对称正定(SPD)流形上的表示已引起广泛关注。相比之下,全秩相关矩阵流形,作为SPD矩阵的归一化替代品,仍然鲜为人知。本文介绍了在相关流形上进行的Riemannian网络,利用了五种最近发展的相关几何结构。我们系统地扩展了基本层,包括多项式对数回归(MLR)、全连接(FC)和卷积层,到这些几何结构上。此外,我们还提出了用于两种相关几何结构的准确反向传播方法。通过与现有SPD和Grassmannian网络的比较实验,展示了该方法的有效性。

英文摘要

Representations on the Symmetric Positive Definite (SPD) manifold have garnered significant attention across different applications. In contrast, the manifold of full-rank correlation matrices, a normalized alternative to SPD matrices, remains largely underexplored. This paper introduces Riemannian networks over the correlation manifold, leveraging five recently developed correlation geometries. We systematically extend basic layers, including Multinomial Logistic Regression (MLR), Fully Connected (FC), and convolutional layers, to these geometries. Besides, we present methods for accurate backpropagation for two correlation geometries. Experiments comparing our approach against existing SPD and Grassmannian networks demonstrate its effectiveness.

2605.19066 2026-05-20 cs.CL

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

低资源NLP评估中的标注稀缺悖论:十年加速与新兴约束

Vukosi Marivate

AI总结 本文探讨了低资源NLP评估中由于标注资源不足导致的矛盾,分析了过去十年中评估方法的发展阶段,并提出了新的解决方案以提升评估的公平性和有效性。

Comments Under Review

详情
AI中文摘要

在过去十年中,低资源自然语言处理(NLP)经历了爆炸性增长,推动因素包括跨语言迁移、大规模多语言模型以及基准测试的快速普及。然而,这种表面的进步掩盖了一个关键但研究不足的矛盾:评估日益复杂的生成系统所需深入的社会语言学专业知识严重受限、分配不均且结构上被边缘化。本文对低资源NLP评估(2014至今)进行了批判性叙述调查,追溯其在三个阶段的发展:早期启发式乐观、顶层基准测试的幻象以及当前的生成瓶颈时代。我们提出了标注稀缺悖论的概念,即当模型扩展能力远超所需的人类基础设施时产生的结构性摩擦。通过分析提取数据管道、未补偿的“幽灵工作”和语言数据膨胀,我们论证了这种悖论威胁了所报告进展的认知有效性。我们调查了新兴的回应方式,包括数据增强、基于模型的评估、参与性整理以及通过项目反应理论和主动学习的标注高效方法,并评估了它们的公平性与有效性权衡。最后,我们呼吁实践者行动,认为克服这一瓶颈需要从交易性数据提取转向基于认知治理、数据主权和共享所有权的社区嵌入评估。

英文摘要

Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work'', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses -- including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning -- and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.

2605.19063 2026-05-20 cs.LG

Mapping Uncharted Symmetries: Machine Discovery in Combinatorics

映射未知对称性:组合学中的机器发现

Eugenio Cainelli, Lorenzo Luccioli, Alessandro Iraci, Michele D'Adderio, Giovanni Paolini

AI总结 本文提出了一种基于机器学习的组合学研究方法,通过构建满足精确分布约束的简单数学函数,发现q,t-纳尔ayan多项式的新组合解释,并提供了其对称性的证明。

Comments 20 pages

详情
AI中文摘要

受代数组合学中长期未解决的问题启发,我们展示了现代机器学习可以有意义地贡献于可验证的数学发现。特别是,我们关注在精确分布约束下构造简单数学函数的问题,将其正式化为简单学习在刚性比例下(SLURP)。我们通过引入两种方法:MapSeek-Functional,通过交替伪标签和监督训练步骤建模所需函数;以及MapSeek-Symbolic,直接生成符号公式。我们成功将这两种方法应用于代数组合学中的研究问题,发现了来自表示论的q,t-纳尔ayan多项式的新组合解释。据我们所知,这是基于非交叉划分的第一个此类解释。使用一个发现的统计量,我们找到了这些多项式对称性的组合证明,在之前未解决的情况下。为了简化验证和可重复性,我们发布了所有代码,包括本文所有数学发现的Lean 4形式化。

英文摘要

Inspired by long-standing open problems in algebraic combinatorics, we show that modern machine learning can meaningfully contribute to verifiable mathematical discoveries. In particular, we focus on the construction of simple mathematical functions under exact distributional constraints, a setting we formalize as Simple Learning Under Rigid Proportions (SLURP). We tackle this problem by introducing two methods: MapSeek-Functional, which models the desired function alternating pseudo-labeling and supervised training steps; and MapSeek-Symbolic, designed to directly produce symbolic formulas. We successfully apply both methods to a research problem in algebraic combinatorics, discovering a new combinatorial interpretation of the $q,t$-Narayana polynomials arising from representation theory. To our knowledge, this is the first such interpretation based on noncrossing partitions. Using one discovered statistic, we find a combinatorial proof of the symmetry of these polynomials in a previously unsolved case. To streamline verification and reproducibility, we release all code, including a formalization of all the mathematical discoveries of this paper in Lean 4.

2605.19060 2026-05-20 cs.CV cs.AI eess.IV

LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators

LiFT:用于从2D生成器生成3D图像的提升跨切片特征轨迹

Xinhe Zhang, Yuyang Zhang, Pengfei Jin, Arnau Marin-Llobet, Na Li, Quanzheng Li

AI总结 本文提出LiFT框架,通过将3D体积合成分解为单切片图像生成和跨切片轨迹学习,解决高分辨率3D医学图像生成中体积模型计算成本高和2D切片生成器在第三维度上无法保持解剖一致性的问题。

详情
AI中文摘要

高分辨率3D医学图像生成仍然具有挑战性,因为完全体积分布模型计算成本高,而高效的2D切片生成器往往无法在第三维度上保持解剖一致性。我们提出LiFT,一种用于提升跨切片特征轨迹的框架,将3D体积合成分解为单切片图像生成和跨切片轨迹学习。与端到端建模体积分布不同,LiFT将体积视为特征空间中的有序轨迹,捕捉解剖结构在深度方向上的出现、变换和消失。一个三平面漂移损失对齐生成切片的轨迹与真实体积的轨迹,使在无条件生成中能够学习跨切片进展的分布;在配对翻译中,一个双向$z$-上下文混合器通过注册目标进行训练,提供通过平面的连贯性同时保持单切片的保真度。我们在BraTS 2023(无条件和缺失模态MRI)和SynthRAD2023(MRI到CT)上评估LiFT。在这些设置中,LiFT保持单切片质量,接近报告的cWDM缺失MRI重建质量,在约135倍更低的推理成本下(无正式等价性测试),并在MRI到CT中相对于无映射消融提高了通过平面的连贯性,证明了轻量级跨切片轨迹学习是高分辨率3D医学合成的可行途径。

英文摘要

High-resolution 3D medical image generation remains challenging because fully volumetric models are computationally expensive, while efficient 2D slice generators often fail to preserve anatomical consistency across the third dimension. We propose LiFT, a framework for Lifted inter-slice Feature Trajectories that factorizes 3D volume synthesis into per-slice image generation and inter-slice trajectory learning. Rather than modeling the volumetric distribution end-to-end, LiFT treats a volume as an ordered trajectory in feature space, capturing how anatomical structures appear, transform, and disappear across depth. A tri-planar drifting loss aligns the trajectory of generated slices with the trajectories of real volumes, enabling distributional learning over inter-slice progressions in unconditional generation; in paired translation, a bidirectional $z$-context mixer trained against the registered target supplies through-plane coherence while preserving per-slice fidelity. We evaluate LiFT on BraTS 2023 (unconditional and missing-modality MR) and SynthRAD2023 (MR-to-CT). Across these settings, LiFT preserves per-slice quality, approaches the reported cWDM missing-MR reconstruction quality at $\sim$$135\times$ lower inference cost (without formal equivalence testing), and improves through-plane coherence on MR-to-CT relative to a no-mapper ablation, demonstrating that lightweight inter-slice trajectory learning is a viable route to high-resolution 3D medical synthesis.

2605.19050 2026-05-20 cs.LG physics.chem-ph q-bio.QM

Generative Pseudo-Force Fields for Molecular Generation

生成伪力场用于分子生成

Stefaan Simon Pierre Hessmann, Khaled Kahouli, Stefan Gugler, Michael Plainer, Frank Noé, Klaus-Robert Müller, Niklas Wolf Andreas Gebauer

AI总结 本文提出生成伪力场(GPFFs)以解决分子生成中能量基放松与数据驱动生成模型采样效率之间的权衡问题,通过训练MLFF在参考平衡结构上的二次伪势能面上实现高效且稳定的分子构象生成。

详情
AI中文摘要

生成稳定的分子构象通常需要在基于物理的能量放松的物理真实性和数据驱动生成模型的采样效率之间做出权衡。虽然机器学习力场(MLFFs)可以通过根据物理力放松分子几何结构来采样稳定的构象,但它们需要昂贵的从头计算训练数据。相反,扩散模型(DMs)仅从平衡数据学习,但依赖于噪声调度和时间步长条件。在本文中,我们提出生成伪力场(GPFFs)以弥合这些范式,通过在参考平衡结构上的二次伪势能面上训练MLFF。由于不需要对扰动几何进行从头计算,非平衡训练数据可以通过对平衡结构添加高斯噪声实时生成。我们证明GPFFs是方差爆炸扩散模型的时间步长无关变种:分数来自预测的伪力,但力的大小隐含地编码了噪声水平,因此不需要时间步长条件。我们的GPFF因此可以作为标准扩散采样(祖先、Heun)中的直接替换,也可以促进更高效、自适应的变种和一个受MLFF启发的直接去噪方案。我们提出的采样算法支持任意的结构先验和几何约束。在QM9数据集上,GPFF在256个神经函数评估(NFE)时有100%的有效性,在仅6个NFE时超过50%,优于所有扩散基线。结合自定义先验,我们在分子编辑器中展示了我们的方法在药物设计设置中的快速和准确的生成过程,其中分子在实时中生成。

英文摘要

Generating stable molecular conformations typically forces a tradeoff between the physical realism of energy-based relaxation and the sampling efficiency of data-driven generative models. While machine learning force fields (MLFFs) can sample stable conformations by relaxing molecular geometries according to physical forces, they require costly ab-initio training data. Conversely, diffusion models (DMs) learn from equilibrium data alone but are dependent on noise schedules and time-step conditioning. In this work, we propose generative pseudo-force fields (GPFFs) to bridge these paradigms by training an MLFF on a quadratic pseudo-potential energy surface relative to reference equilibrium structures. Because no ab-initio calculations are required for the perturbed geometries, non-equilibrium training data can be generated on the fly by perturbing the equilibria with Gaussian noise. We show that GPFFs constitute a time-step-agnostic variant of variance exploding DMs: the score comes from the predicted pseudo-forces but because force magnitudes implicitly encode the noise level, no time-step conditioning is needed. Our GPFF can hence be used as a drop-in replacement in standard diffusion sampling (ancestral, Heun) but also facilitates more efficient, adaptive variants and an MLFF inspired direct denoising scheme. Our proposed sampling algorithms support arbitrary structural priors and geometric constraints. On QM9, GPFF has 100 % validity at 256 neural function evaluations (NFE) and over 50 % at just 6 NFE, outperforming diffusion baselines across all samplers. Combined with custom priors, we showcase the fast and accurate generation process of our method in a molecular editor for a drug design setting, where a molecule is generated in real time.

2605.19049 2026-05-20 cs.LG cs.AI

KVBuffer: IO-aware Serving for Linear Attention

KVBuffer: 为线性注意力设计的I/O感知服务

Longwei Zou, Lin Zhong

AI总结 本文提出KVBuffer,一种I/O感知的线性注意力服务机制,通过缓冲最近的键和值,使服务系统能够更灵活且高效地计算线性注意力输出,从而减少内存访问和解码延迟,提升服务性能。

详情
AI中文摘要

线性注意力因在长上下文推理中具有与上下文长度无关的恒定解码成本而受到广泛关注。然而,现有服务系统通常在每次解码步骤中递归计算和更新一个大的线性注意力状态,由于该状态远大于每个token的键和值,递归解码导致显著的内存访问开销,对服务线性注意力效率低下。在本文中,我们提出KVBuffer,一种为线性注意力设计的I/O感知服务机制。通过缓冲最近的键和值,KVBuffer使服务系统能够以更灵活且内存高效的方式计算线性注意力输出。对于解码,KVBuffer支持分块计算,通过延迟状态更新并批量应用,减少了平均内存访问和解码延迟。对于推测解码,KVBuffer并行验证草案token并避免存储临时状态。对于短上下文,KVBuffer直接从缓冲的键和值计算注意力输出,无需创建或更新线性注意力状态。我们将在SGLang中实现KVBuffer用于Qwen3-Next。我们的评估显示,当验证四个草案token时,KVBuffer可将线性注意力解码延迟降低高达45.17%,并使推测解码的最大服务请求数增加5倍。

英文摘要

Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by recurrently computing and updating a large linear attention state in every decoding step. Since the state is much larger than the per-token key and value, recurrent decoding incurs substantial memory access and becomes inefficient for serving linear attention. In this paper, we propose KVBuffer, an IO-aware serving mechanism for linear attention. By buffering recent keys and values, KVBuffer enables serving systems to compute linear attention outputs in more flexible and memory-efficient ways. For decoding, KVBuffer enables chunkwise computation, which reduces average memory access and decoding latency by deferring state updates and applying them in batch. For speculative decoding, KVBuffer verifies draft tokens in parallel and avoids storing temporary states. For short contexts, KVBuffer computes attention outputs directly from buffered keys and values, without creating or updating the linear attention state. We implement KVBuffer in SGLang for Qwen3-Next. Our evaluations show that KVBuffer can reduce linear attention decoding latency by up to 45.17% and increase the maximum number of serving requests by 5x for speculative decoding when verifying four draft tokens.

2605.19042 2026-05-20 cs.AI

Interference-Aware Multi-Task Unlearning

干扰感知的多任务反学习

Ying-Hua Huang, Rui Fang, Hsi-Wen Chen, Ming-Syan Chen

AI总结 本文提出了一种干扰感知的多任务反学习方法,通过任务感知梯度投影和实例级梯度正交化来解决多任务设置中因共享参数导致的任务级和实例级干扰问题,实验表明该方法在多任务计算机视觉基准上有效减少了反学习的干扰。

详情
AI中文摘要

机器反学习旨在从已训练模型中移除指定训练数据的贡献,同时保持对剩余数据的性能。现有工作主要集中在单任务设置,而现代模型往往在具有共享主干的多任务设置中运行,其中移除一个任务或实例的监督可能无意中影响其他任务。我们引入了多任务反学习,包含两种设置:全任务反学习,即从所有任务中移除目标实例,以及部分任务反学习,即仅从选定的任务中移除监督。我们表明共享参数将遗忘集和保留集耦合在一起,导致非目标任务的任务级干扰和其它实例的实例级干扰。为了解决这个问题,我们提出了一种干扰感知框架,结合任务感知的梯度投影,该方法约束更新在任务特定的子空间内,以及实例级的梯度正交化,以减少遗忘和保留信号之间的冲突。在两个多任务计算机视觉基准上跨五个任务的实验表明,我们的方法在有效反学习的同时保持了强大的泛化能力,与最强基线相比,在全任务反学习中减少了30.3%的UIS,在部分任务反学习中减少了52.9%的UIS。

英文摘要

Machine unlearning aims to remove the contribution of designated training data from a trained model while preserving performance on the remaining data. Existing work mainly focuses on single-task settings, whereas modern models often operate in multi-task setups with shared backbones, where removing supervision for one task or instance can unintentionally affect others. We introduce multi-task unlearning with two settings: full-task unlearning, which removes a target instance from all tasks, and partial-task unlearning, which removes supervision only from selected tasks. We show that shared parameters couple the forget and retain sets, causing task-level interference on non-target tasks and instance-level interference on other instances. To address this issue, we propose an interference-aware framework that combines task-aware gradient projection, which constrains updates within task-specific subspaces, with instance-level gradient orthogonalization, which reduces conflicts between forget and retain signals. Experiments on two multi-task computer vision benchmarks across five tasks show that our method achieves effective unlearning while maintaining strong generalization, reducing UIS compared with the strongest baseline by 30.3% in full-task unlearning and 52.9% in partial-task unlearning.

2605.19038 2026-05-20 cs.RO cs.LG

Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic

用时空逻辑引导神经符号场景生成

Lorenzo Bonin, Francesco Giacomarra, Luca Bortolussi, Jyotirmoy V. Deshmukh, Francesca Cairoli

AI总结 本研究提出STRELGen框架,结合扩散模型和时空逻辑规范,高效生成安全关键的多智能体驾驶场景,提升自动驾驶系统的鲁棒性验证能力。

详情
AI中文摘要

自动驾驶技术的快速发展已远超安全评估方法的进展。传统测试依赖于暴露自动驾驶系统于大量真实交通场景,这是一种成本高昂且统计上无法有效捕捉罕见安全关键边缘情况的暴力方法。为解决这一根本限制,我们引入STRELGen,一个可扩展的框架,用于目标生成安全关键的驾驶场景。STRELGen协同结合多智能体轨迹生成扩散模型(DM)与通过高度可解释的形式化方法编码复杂安全和现实属性的时空逻辑(STREL)规范。关键在于监控这些规范的满足程度是可微的,从而允许基于梯度的搜索。在推理时间,我们直接优化DM的潜在空间以最大化STREL公式满足程度。结果是高效生成高度可信且安全关键的多智能体场景,这些场景位于学习的数据分布内。STRELGen因此提供了一种灵活、可解释且强大的工具,用于对自动驾驶系统进行压力测试,超越了暴力数据收集的限制。

英文摘要

The rapid advancement of autonomous driving (AD) technologies has outpaced the development of robust safety evaluation methods. Conventional testing relies on exposing AD systems to vast numbers of real-world traffic scenes -- a brute-force approach that is prohibitively expensive and statistically ineffective at capturing the rare, safety-critical edge cases essential for validating real-world robustness. To address this fundamental limitation, we introduce STRELGen, a scalable framework for the targeted generation of safety-critical driving scenarios. STRELGen synergistically combines a multi-agent trajectory-generation diffusion model (DM) with Spatio-Temporal Logic (STREL) specifications that encode complex safety and realism properties through a highly interpretable formalism. Crucially, monitoring satisfaction levels of these specifications is differentiable, enabling gradient-based search. At inference time, we optimize directly over the DM latent space to maximize STREL formula satisfaction. The result is efficient generation of highly plausible yet safety-critical multi-agent scenarios that lie within the learned data distribution. STRELGen thus provides a flexible, interpretable, and powerful tool for stress-testing autonomous driving systems, moving beyond the limitations of brute-force data collection.

2605.19035 2026-05-20 cs.AI

Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

可信代理网络:在代理网络中,信任必须被内置,而非事后添加

Yixiang Yao, Yuhang Yao, Xinyi Fan, Jiechao Gao, Jie Wang, Minjia Zhang, Srivatsan Ravi, Carlee Joe-Wong

AI总结 本文探讨了在代理网络中信任必须被内置而非事后添加的问题,提出了一个综合的概念框架,通过四个设计支柱来建立代理网络中的信任。

Comments Accepted by SIGKDD 2026 Blue Sky Ideas Track

详情
AI中文摘要

大型语言模型的快速发展催生了能够进行复杂推理和执行的自主LLM代理。随着这些代理从孤立操作转向协作生态系统,我们见证了代理到代理(A2A)网络的出现,这是一种异构代理自主协调解决多步骤任务的范式。尽管这些网络可能比单纯使用一个代理完成整个任务表现更好,但它们引入了系统性漏洞,如对抗性组合、语义错位和级联操作失败,现有代理对齐技术无法解决。在本文的愿景论文中,我们主张A2A网络的可信度不能通过在现有协议上进行 retrofitting 来完全保证,这些协议大多为单个代理设计。相反,它必须在A2A协调框架的最初阶段进行架构设计。我们提出一个综合的概念框架,通过四个设计支柱来在A2A系统中建立信任。

英文摘要

The rapid advancement of Large Language Models has given rise to autonomous LLM-based agents capable of complex reasoning and execution. As these agents transition from isolated operation to collaborative ecosystems, we witness the emergence of the Agent-to-Agent (A2A) network, a paradigm where heterogeneous agents autonomously coordinate to solve multi-step tasks. While these networks may offer better task performance compared to simply using one agent to complete the entire task, they introduce systemic vulnerabilities, such as adversarial composition, semantic misalignment, and cascading operational failures, that existing agent alignment techniques cannot address. In this vision paper, we argue that the trustworthiness of A2A networks cannot be fully guaranteed via retrofitting on existing protocols that are largely designed for individual agents. Rather, it must be architected from the very beginning of the A2A coordination framework. We present a comprehensive conceptual framework that situates trust in A2A systems through four design pillars.