arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22567 2026-05-22 cs.CL

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG: 用于多语言推理的强化学习与语言自适应提示引导

Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu, Jian Yang, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Jingbo Zhu, Tong Xiao

AI总结 本文提出LANG框架,通过语言条件提示引导非英语推理任务的探索,解决了多语言环境下强化学习在输入语言一致性与推理质量之间的权衡问题,提升了推理性能而不影响语言一致性。

详情
Comments
Accepted to ACL 2026 (main conference)
AI中文摘要

强化学习已被证明在增强大型语言模型(LLMs)的多步推理方面非常有效,但其好处尚未完全转化为多语言环境。现有方法在根本上面临一个矛盾:优先考虑输入语言的一致性严重损害推理质量,而优先考虑推理则会导致无意中向英语漂移。我们通过LANG,一种新的框架,利用语言条件提示来指导非英语推理任务的探索。我们的方法结合了两个关键机制来防止依赖这些提示:一个逐步衰减计划,逐渐撤回支架,以及一个语言自适应切换,将学习时间跨度调整到特定语言的困难程度。在具有挑战性的多语言数学基准上的实验证明,LANG显著提高了推理性能,而不会损害语言一致性。此外,我们表明我们的框架超越了数学,促进了模型各层之间更一致的语言对齐。

英文摘要

Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language-conditioned hints to guide exploration in non-English reasoning tasks. Our method incorporates two key mechanisms to prevent dependency on these hints: a progressive decay schedule that gradually withdraws scaffolding, and a language-adaptive switch that tailors learning horizons to specific language difficulties. Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency. Moreover, we show that our framework generalizes beyond mathematics, fostering more consistent language alignment across model layers

2605.22566 2026-05-22 cs.LG

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

GraphFlow: 一种基于图的流程管理用于高效的LLM代理服务

Ao Li, Shangpeng Yang, Fahao Chen, Tianheng Xu, Peng Li, Zhou Su

AI总结 本文提出了一种基于图的流程管理方法GraphFlow,通过统一图结构wGraph动态生成任务特定流程,提高LLM代理服务的效率和性能,实验表明其在多个基准数据集上表现优异,性能提升显著且内存占用减少。

详情
Comments
Accepted to ICML 2026
AI中文摘要

基于大型语言模型(LLM)的代理在有结构化指令引导下表现出强大的推理和执行能力,通常称为工作流。然而,现有的工作流辅助代理服务系统通常依赖于预定义模板和浅层匹配机制,限制了它们捕捉深层语义关系和泛化到以前未见过的任务的能力。为了解决这些限制,我们提出了一种新的工作流管理范式,通过统一图结构表示工作流,称为wGraph,其中每个节点对应一个原子操作。wGraph作为共享的基质,从其中动态实例化任务特定的工作流。基于wGraph的基本原理,我们引入了GraphFlow系统,通过两个关键设计高效地将工作流整合到代理服务中。首先,自适应工作流生成根据任务语义和约束要求从wGraph动态构建工作流。其次,工作流状态管理利用wGraph结构高效管理键值(KV)缓存,减少代理服务中的冗余计算。在五个基准数据集上的广泛实验表明,GraphFlow在多个基准数据集上表现优异,平均性能提升约4.95个百分点,同时实现内存占用约4倍的减少。

英文摘要

Large Language Model (LLM)-based agents demonstrate strong reasoning and execution capabilities on complex tasks when guided by structured instructions, commonly referred to as workflows. However, existing workflow-assisted agent serving systems typically rely on predefined templates and shallow matching mechanisms, which limit their ability to capture deep semantic relationships and generalize to previously unseen tasks. To address these limitations, we propose a new workflow management paradigm that represents workflows using a unified graph, termed wGraph, where each node corresponds to an atomic operation. wGraph serves as a shared substrate from which task-specific workflows are dynamically instantiated. Building on wGraph primitives, we introduce GraphFlow, a system that efficiently integrates workflows into agent serving through two key designs. First, adaptive workflow generation dynamically constructs workflows from wGraph based on task semantics and constraint requirements. Second, workflow state management exploits wGraph structure to efficiently manage Key-Value (KV) caches, reducing redundant computation during agent serving. Extensive experiments across five benchmark datasets show that GraphFlow consistently outperforms state-of-the-art methods, yielding an average performance improvement of approximately 4.95 percentage points, while achieving an approximately 4$\times$ reduction in memory footprint.

2605.22564 2026-05-22 cs.CL cs.LG cs.SE

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

SynAE: 一个用于评估工具调用代理合成数据质量的框架

Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti

AI总结 本文提出SynAE框架,用于评估多轮工具调用代理合成数据的质量,通过四个指标类别评估合成数据的有效性、保真度和多样性,揭示单一指标不足以全面表征合成数据质量。

详情
AI中文摘要

如今,工具调用代理通常在静态执行轨迹数据集上进行评估或测试,包括输入命令、代理响应和相关工具调用。然而,内部生产数据集往往不足或无法使用;例如,它们可能包含敏感或专有数据,或过于稀疏,无法支持全面测试(尤其是预部署前)。在这些情况下,实践者越来越多地用合成数据替代或补充真实数据进行评估。关键挑战是量化这些合成数据集与真实数据之间的关系。我们介绍了SynAE,一个用于评估多轮工具调用代理合成基准如何复制和增强真实数据轨迹特征的评估框架。SynAE在四个指标类别中评估合成数据的效度、保真度和多样性:(i)任务指令和中间响应,(ii)工具调用,(iii)最终输出,(iv)下游评估。我们通过近期代理基准评估SynAE,并通过现实且受控的生成方案测试常见的合成数据失败模式。SynAE能够检测数据效度、保真度和多样性的细粒度变化,并表明没有单一指标足以全面表征合成数据质量,从而推动对合成数据的多轴评估。SynAE的演示可在https://synae-2026-synae-demo.static.hf.space/index.html获取,代码在https://github.com/wsqwsq/SynAE。

英文摘要

Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae-2026-synae-demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.

2605.22563 2026-05-22 cs.CV

Cell Phantom Video Generation in Elliptical Fourier Descriptor Domain

椭圆傅里叶描述符域中的细胞假体视频生成

Francesco Benedetto, Roberto Basla, Luca Magri, Giacomo Boracchi

AI总结 本研究提出了一种在椭圆傅里叶描述符(EFD)域中生成细胞假体视频的新框架,通过将细胞假体演变表示为多变量时间序列的EFD系数,引入了强先验知识,从而高效生成在时间上一致的视频,验证了在EFD空间建模时间演变能够生成生物合理性的假体视频,为合成标注数据生成提供了方法,减少了标注努力。

详情
Comments
6 pages, Accepted at the International Conference on Image Processing (ICIP) 2026
AI中文摘要

训练用于生物视频中单个细胞跟踪的深度神经网络需要大量标注数据。对细胞跟踪视频进行标注非常耗时,通常需要领域专业知识;这解释了公共标注数据在解决重要医疗问题如组织修复或癌症治疗方面有限的可用性。生成合成视频及其地面真实标注是一个有前景的解决方案,其基础第一步是单个细胞标注(或假体)的合成。假体需要时间一致,因为它们必须复制特定细胞类型的生物过程。在本文中,我们提出了一种新的框架,用于在椭圆傅里叶描述符(EFDs)域中生成细胞假体视频,这是一种紧凑且几何上可解释的2D闭合轮廓表示。我们将细胞假体演变表示为EFD系数的多变量时间序列,引入了强先验知识用于细胞形态,从而高效生成在时间上一致演变的序列。我们的实验验证证明,建模EFD空间中的时间演变能够生成生物合理性的假体视频。我们的方法可用于生成合成标注数据的生成管道,从而强烈缓解创建新数据集的标注努力。我们的代码可在此处下载:https://github.com/FrancescoBenedetto99/efd-cell-video-gen。

英文摘要

Training Deep Neural Networks for tracking individual cells in biomedical videos requires a large amount of annotated data. The annotation of videos for cell tracking is very time consuming and often requires domain expertise; this explains the limited availability of public annotated data to address important medical problems like tissue repair or cancer treatment. Generating synthetic videos along with their Ground Truth annotations is a promising solution that relies, as a foundational first step, on the synthesis of single cell annotations (or phantoms). Phantoms need to be time consistent, as they have to replicate biological processes that are specific to the cell types. In this work, we propose a novel framework for generating videos of cell phantoms in the Elliptical Fourier Descriptors (EFDs) domain, a compact and geometrically interpretable representation for 2D closed contours. We represent the cell phantom evolution as a multivariate time series of EFD coefficients, introducing a strong prior for cell morphology and enabling the efficient generation of sequences that evolve coherently in time. Our experimental validation proves that modelling the temporal evolution in EFD space enables the generation of biologically plausible phantom videos. Our method can be used in generative pipelines for synthesizing annotated data for cell tracking, thus strongly mitigating the annotation effort for creating new datasets. Our code is available for download here: https://github.com/FrancescoBenedetto99/efd-cell-video-gen.

2605.22561 2026-05-22 cs.LG

Regret-Based $(ε,δ)$-optimal Stopping Criteria for Bayesian Optimization

基于遗憾的贝叶斯优化(ε,δ)-最优停止准则

Haowei Wang, Jingyi Wang, Qiyu Wei

AI总结 本文提出了一种基于更紧的高斯过程上置信界(GP-UCB)即时遗憾界限的停止准则,确保在终止时以高概率1-δ获得ε-最优解,并通过数值实验验证其有效性。

详情
Comments
21 pages
AI中文摘要

贝叶斯优化(BO)是一种广泛使用的迭代黑盒优化方法,利用高斯过程(GP)替代模型。在实践中,BO通常在耗尽固定评估预算后终止,这可能导致不必要的成本,并且无法保证解的质量最优性。最近的研究在开发实用的停止准则方面取得了实证进展,但理论上有说服力的停止准则仍处于进行中。在本文中,我们提出了GP上置信界(GP-UCB)在任意给定迭代中的可证明更紧的即时遗憾界限。然后,我们基于此更紧的界限提出GP-UCB的停止准则,确保终止时以高概率1-δ获得ε-最优解。通过数值实验验证和展示所提停止准则的有效性和效率。

英文摘要

Bayesian optimization (BO) is a widely used iterative black-box optimization method that utilizes Gaussian process (GP) surrogate models. In practice, BO is typically terminated after a fixed evaluation budget is exhausted, which can incur unnecessary cost and provides no optimality guarantee on solution quality. Recent research in developing a practical stopping criterion has made empirical progress, yet a theoretically sound stopping criterion remains a work in progress. In this work, we present provably tighter instantaneous regret bounds for GP upper confidence bound (GP-UCB) at any given iteration. Then, we propose stopping criteria for GP-UCB based on this tighter bound that ensures an $ε$-optimal solution with high probability $1-δ$ upon termination. Numerical experiments are performed to validate and demonstrate the effectiveness and efficiency of our stopping criteria.

2605.22558 2026-05-22 cs.CV

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

GeoWeaver: 在场景推理前通过几何证据 grounding 视觉 token

Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang

AI总结 本文提出 GeoWeaver,一种在场景推理前通过几何证据对视觉 token 进行 grounding 的框架,以提升空间推理能力并保持多模态能力。

详情
AI中文摘要

视觉语言模型中的时空推理需要保持物理几何的视觉表示,而非仅仅语义外观。最近的多模态模型通过结构分支、3D感知监督、推理阶段融合或长视界记忆来整合几何信息。尽管这些方法展示了几何对空间智能的重要性,但它们通常将几何线索视为所有视觉 token 的共享信号。我们注意到,这忽略了更细致的挑战:不同的视觉 token 需要根据其空间角色不同的几何证据。为了解决这一限制,我们引入 GeoWeaver,一种预推理的几何 grounding 框架,将几何视为时空推理的表示前提。GeoWeaver 从冻结的几何编码器构建多层次的几何库,并执行 token 自适应的几何证据分配,使每个视觉 token 能够检索最相关的几何抽象。所选证据通过残差 grounding 操作整合到视觉 token 中,在语言建模之前,产生几何 grounding 的表示,以支持后续推理。在空间推理基准上的广泛评估表明,GeoWeaver 一致地增强了几何感知推理,同时保持了通用多模态能力。这表明几何信息带来的最大收益不是作为后期融合的辅助信号,而是作为塑造大型语言模型推理基础的必要前提。所有源代码和模型将在 https://github.com/yahooo-m/GeoWeaver 上发布。

英文摘要

Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at https://github.com/yahooo-m/GeoWeaver .

2605.22556 2026-05-22 cs.LG

ImplicitTerrainV2: Wavelet-Guided Spatially Adaptive Neural Terrain Representation

ImplicitTerrainV2: 基于小波引导的时空自适应神经地形表示

Haoan Feng, Xin Xu, Leila De Floriani

AI总结 本文提出ImplicitTerrainV2,通过结合频谱控制机制、小波引导的空间自适应性、导数感知监督和训练后模型压缩,实现了紧凑高效的神经地形数据格式,提升了地形分析的精度和效率。

详情
Comments
14 pages, 8 figures
AI中文摘要

数字高程模型(DEMs)是地理信息系统(GIS)中地形分析的基础,但其常见的栅格形式依赖插值进行离格采样和有限差分算子进行基于导数的分析。隐式神经表示(INRs)提供了一种连续的替代方案,但先前的地形INRs缺乏显式的频率控制,忽视了地形的梯度结构,并且在实际部署中仍然过于庞大和昂贵。我们提出了ImplicitTerrainV2,通过结合频谱控制机制、小波引导的空间自适应性、导数感知监督和训练后模型压缩,将地形INRs推进到紧凑、高效的神经地形数据格式。在核心部分,小波复杂度场(WCF)从解析计算的小波系数中推导出空间自适应的频率掩码,将高频能力局部化到复杂地形区域。同一字段指导复杂度感知的自适应采样,将训练集中在高复杂度区域,同时梯度匹配应用额外监督以强制地形DEMs的光滑流形结构,从而提高导数保真度。训练后混合精度量化和熵编码将存储减少到1.23 bpp,PSNR下降0.28 dB。在50个瑞士地形图块上,ImplicitTerrainV2达到66.25 dB的端到端PSNR,比先前工作提高了5.70 dB,同时使用3.2倍更少的参数,在单个GPU上每个图块训练时间仅为55秒。我们的压缩神经格式在率失真性能上与几种已建立的DEM编码器竞争,同时还支持离格点查询、闭合形式导数评估和分辨率无关重建,这可能受益于许多下游GIS应用。

英文摘要

Digital elevation models (DEMs) underpin terrain analysis in Geographic Information Systems (GIS), but in their common raster form, they rely on interpolation for off-grid sampling and finite-difference operators for derivative-based analysis. Implicit neural representations (INRs) offer a continuous alternative, but prior terrain INRs lack explicit frequency control, neglect the gradient structure of terrain, and remain too large and costly to train for practical deployment. We present ImplicitTerrainV2, which advances terrain INRs toward a compact, efficient neural terrain data format by combining a spectral control mechanism with wavelet-guided spatial adaptivity, derivative-aware supervision, and post-training model compression. At its core, a wavelet complexity field (WCF) derives spatially-adaptive frequency masks from analytically computed wavelet coefficients, localizing high-frequency capacity to complex terrain regions. The same field guides complexity-aware adaptive sampling that concentrates training in high-complexity regions, while gradient matching applies extra supervision to enforce the smooth manifold structure of terrain DEMs for improved derivative fidelity. Post-training mixed-precision quantization and entropy coding reduce storage to 1.23 bpp with a 0.28 dB PSNR drop. On 50 Swiss terrain tiles, ImplicitTerrainV2 reaches 66.25 dB end-to-end PSNR, improving over the prior work by 5.70 dB while using 3.2x fewer parameters and training in 55 s per tile on a single GPU. Our compressed neural format is competitive with several established DEM codecs in rate-distortion performance, while additionally supporting off-grid point queries, closed-form derivative evaluation, and resolution-independent reconstruction, which may benefit many downstream GIS applications.

2605.22552 2026-05-22 cs.CV cs.MM

FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning

FashionLens:通过任务自适应学习实现多功能时尚图像检索

Haokun Wen, Xuemeng Song, Xinghao Xie, Xiaolin Chen, Xiangyu Zhao, Weili Guan

AI总结 本文提出FashionLens框架,通过任务自适应学习实现多功能时尚图像检索,解决现有方法无法处理多样检索需求的问题。

详情
AI中文摘要

时尚图像检索是现代电子商务系统的核心。在实践中,一个能够支持多种查询格式和搜索意图的统一框架备受青睐。然而,现有方法专注于狭窄的检索任务,无法充分捕捉这种多样性。因此,在本工作中,我们旨在开发一个能够处理多样现实时尚检索场景的统一框架,实现真正多功能的时尚图像检索。为了建立数据基础,我们首先引入U-FIRE,一个综合基准,将碎片化的时尚数据集整合到统一的集合中,并辅以两个人工整理的数据集进行测试通用性。在此基础上,我们提出了基于多模态大语言模型的FashionLens框架。为处理不同的匹配目标,我们设计了Proposal-Guided Spherical Query Calibrator,通过自适应球形线性插值动态将查询表示转移到任务对齐的度量空间中。此外,为缓解因任务复杂性和数据规模不同导致的优化不平衡问题,我们开发了Gradient-Guided Adaptive Sampling策略,根据实时学习难度和数据规模先验自动重新加权任务。在U-FIRE上的实验表明,FashionLens在多种检索场景中均取得最佳性能,并能稳健地推广到未见任务。数据和代码已公开发布在https://github.com/haokunwen/FashionLens。

英文摘要

Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.

2605.22550 2026-05-22 cs.CV

MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding

MOTOR: 两轮车骑行行为理解的多模态数据集

Varun A. Paturkar, Shankar Gangisetty, C. V. Jawahar

AI总结 本文提出MOTOR数据集,用于研究两轮车在密集无结构交通中的骑行行为,通过多视角、多模态数据融合,为自动驾驶辅助系统提供新的研究基础。

详情
AI中文摘要

两轮车在发展中国家道路上的致命事故比例显著偏高。然而,关于两轮车骑行行为的研究远远落后于四轮车,后者多模态数据集推动了高级驾驶辅助系统(ADAS)的重大进展。为填补这一空白,我们提出了MOTOR数据集,这是首个大规模、多视角、多模态资源,专门用于密集无结构交通中的两轮车。MOTOR包含1,629个序列(25多个小时的视频数据),由16名骑行者收集,整合了同步的前视、后视和头盔视频、可穿戴追踪器的骑行目视数据、道路音频和 telemetry(GPS、加速度计、陀螺仪)。丰富的注释捕捉交通情境、骑行状态、12种骑行动作(涵盖传统和非常规行为)以及合法性标签(合法、非法、未指定)。我们使用最先进的视频动作识别骨干网络(CNN和Transformer-based)进行骑行行为识别和动作合法性分类,并发现结合RGB、目视和telemetry数据能够获得最佳性能。MOTOR因此为两轮车驾驶的安全关键理解提供了独特基础。它为研究社区提供了一个基准,以开发和评估用于行为分析、合法性感知预测和智能交通系统模型。数据集和代码可在https://varuniiith.github.io/MOTOR-Dataset/获取。

英文摘要

Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR comprises 1,629 sequences (25+ hours of video data) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traffic context, rider state, 12 riding maneuvers spanning conventional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code is available at https: //varuniiith.github.io/MOTOR-Dataset/

2605.22549 2026-05-22 stat.ML cs.LG

A Martingale Kernel Independence Test

一个鞅核独立性检验

Felix Laumann, Zhaolu Liu, Mauricio Barahona

AI总结 本文提出两种学生化统计量,通过自归一化和半样本分割,实现了无需排列校准的独立性检验,显著提升了计算效率和测试性能。

详情
AI中文摘要

Hilbert-Schmidt Independence Criterion (HSIC) 及其联合独立性扩展 dHSIC 是退化 V 统计量,其数据依赖的加权 χ² 空间迫使排列校准,导致每测试成本乘以排列次数,实际中为两到三个数量级。通过将最近的鞅 MMD 构造应用于两样本检验到联合独立性问题,我们引入了两个学生化统计量,其空分布为标准正态分布,无论数据分布如何,因此单次正态分位数查找可完全替代排列步骤。第一个,mHSIC,是两个经验中心 Gram 矩阵的 Hadamard 积的自归一化下三角和。在独立性和有界四次矩核下,它收敛于标准正态分布。它对所有固定替代一致,且在样本量二次成本下运行,无需样本分割,与偏置 HSIC V 统计量匹配。第二个统计量 mdHSIC 通过单个半样本分割实现有限样本一致性:中心化估计在一半,下三角自归一化鞅在另一半运行,使条件均值残差缩成指数小量,因此在任意固定联合测试变量数下,统计量渐近标准正态分布,每测试成本仅与 d 线性增长。在合成数据中,输入维度从 1 到 500,联合测试变量从 2 到 10,两种统计量在运行速度上比排列校准基线快 25 到 60 倍,同时保持相同的经验 I 类错误率和测试功效。

英文摘要

The Hilbert-Schmidt Independence Criterion (HSIC) and its joint-independence extension $d\mathrm{HSIC}$ are degenerate $V$-statistics whose data-dependent weighted-$χ^2$ null limits force a permutation calibration that multiplies the per-test cost by the number of permutations, in practice two orders of magnitude. Adapting the recent martingale MMD construction for two-sample testing to the (joint) independence problem, we introduce two studentised statistics whose null distributions are standard normal regardless of the data law, so that a single normal-quantile lookup replaces the permutation step entirely. The first, $m\mathrm{HSIC}$, is a self-normalised lower-triangular sum of the Hadamard product of two empirically centred Gram matrices. Under independence and bounded-fourth-moment kernels it converges to a standard normal. It is consistent against every fixed alternative, and runs at quadratic cost in the sample size without any sample split, matching the biased HSIC $V$-statistic. Our second statistic, $md\mathrm{HSIC}$, achieves finite-sample consistency with a single half-sample split: the centring is estimated on one half and the lower-triangular self-normalised martingale is run on the other, shrinking the conditional-mean residual to a quantity that is exponentially small in $d$, so the statistic is asymptotically standard normal at every fixed number of jointly tested variables, with a per-test cost that grows only linearly in $d$. On synthetic data with per-variable input dimension from $1$ to $500$ and between $2$ and $10$ jointly tested variables, both statistics match the empirical type-I error rate and test power of permutation-calibrated baselines while running $25$ to $60\times$ faster.

2605.22544 2026-05-22 cs.CL cs.IR

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

一个提示不够:指令敏感性削弱了嵌入模型评估

Yevhen Kostiuk, Kenneth Enevoldsen

AI总结 本文研究了单提示评估在指令调优嵌入模型中的不足,发现默认提示可能系统性低估或高估性能,并指出排行榜对提示选择不鲁棒,建议通过多提示评估或报告敏感性来改进基准测试。

详情
AI中文摘要

指令嵌入模型已成为最先进模型中的常见选择,但通常仅使用单个提示进行评估。单点评估忽略了指令方法的主要问题,即对指令措辞的敏感性。我们对6个嵌入模型、11个数据集和每个数据集15个任务特定提示进行了实证研究,共990个案例。我们发现报告的分数无法代表在合理提示下的分数分布。默认提示既可能系统性低估也可能高估性能。此外,我们发现排行榜对提示选择不鲁棒:通过选择有利的提示,研究中的任何模型都可以被提升到首位。我们的发现表明,单提示评估不足以评估指令调优的嵌入模型,基准测试应纳入提示鲁棒性,通过多提示评估或报告敏感性来改进。

英文摘要

Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.

2605.22540 2026-05-22 cs.CE cs.AI

Dynamic Hypergraph Representation Learning for Multivariate Time Series without Prior Knowledge

动态超图表示学习用于无先验知识的多变量时间序列

Marco Gregnanin, Johannes De Smedt, Giorgio Gnecco, Maurizio Parton

AI总结 本文提出了一种无需先验知识的多变量时间序列动态超图表示学习方法,通过社区检测和注意力机制构建超图,并利用动态超图注意力卷积网络进行预测。

详情
AI中文摘要

超图有能力捕捉跨不同领域的实体之间的高维关系,使其成为研究社区中理解和分析复杂系统结构和动态的热门话题。然而,一个关键挑战是在超图结构有限或不存在的情况下,从时间序列数据中推导出超图表示。在本研究中,我们提出了一种模型,通过应用社区检测到时间序列并利用注意力机制将所得社区转换为超图,从而为多变量时间序列构建动态超图表示。通过不同时间序列数据集推导出的超图,然后由动态超图注意力卷积网络(DHACN)用于多变量时间序列预测。本研究通过引入一种新的方法,推动了超图表示领域的发展,该方法更适合在无先验知识的情况下揭示高阶关系。

英文摘要

Hypergraphs have the capacity to capture higher-dimensional relationships among entities across various domains, making them a subject of growing interest within the research community for understanding the structure and dynamics of complex systems. However, a key challenge is the derivation of hypergraph representations from time series data in situations where the structure of the hypergraph is limited or absent. In this study, we propose a model that constructs a dynamic hypergraph representation for multivariate time series without relying on prior knowledge of the data. This is achieved by applying community detection to the time series and transforming the resulting communities, obtained through an attention mechanism, into a hypergraph using a clique-based technique. Hypergraph representations are derived from different time series datasets, and the resulting hypergraphs are then used by a Dynamic Hypergraph Attention Convolution Network (DHACN) for multivariate time series predictions. This research advances the field of hypergraph representation by introducing a novel approach that is better suited to uncover high-order relationships without prior knowledge.

2605.22538 2026-05-22 cs.CV

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

基于运动、几何和语义适应的复杂非线性视觉目标跟踪

Deyi Zhu, Yuji Wang, Yong Liu, Yansong Tang, Bingyao Yu, Jiwen Lu, Jie Zhou

AI总结 本文提出SAMOSA框架,通过显式利用运动、几何和语义线索,改进SAM 2在复杂非线性视觉目标跟踪中的表现,实现了更鲁棒和通用的跟踪方法。

详情
AI中文摘要

传统视觉目标跟踪(VOT)方法通常依赖于任务特定的监督训练,限制了其对未见对象和具有干扰、遮挡和非线性运动的挑战场景的泛化能力。最近的视觉基础模型,如SAM 2,通过大规模预训练学习强大的视频理解先验,并为构建更鲁棒和通用的跟踪器提供了有前景的基础。然而,直接将SAM 2应用于VOT仍然不够优化,因为它没有显式建模目标运动动态或在帧之间强制几何和语义一致性,这两者对于可靠的跟踪至关重要。为了解决这个问题,我们提出了SAMOSA,一个新的跟踪框架,通过显式利用运动、几何和语义线索,将SAM 2适应于复杂的VOT场景。具体来说,我们引入了一个轻量级的非线性运动预测器来建模目标动态并指导掩码选择以及内存过滤。我们进一步利用语义线索来检测目标位移并从跟踪失败中恢复,同时将几何线索作为结构约束以提高跟踪稳定性。通过这种方式,SAMOSA弥合了SAM 2隐含视频理解先验与显式跟踪导向建模之间的差距。广泛的实验表明,SAMOSA在通用基准上始终优于最先进的基于SAM 2的方法,展示了比监督VOT方法更强的泛化能力,并在反UAV数据集上实现了显著的提升,这些数据集典型地代表了复杂的非线性运动场景。我们的代码可在https://github.com/DurYi/SAMOSA上获得。

英文摘要

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.

2605.22537 2026-05-22 cs.LG

F-TIS: Harnessing Diverse Models in Collaborative GRPO

F-TIS: 利用多样化模型进行协作GRPO

Nikolay Blagoev, Oğuzhan Ersoy, Wendelin Boehmer, Lydia Yiyu Chen

AI总结 本文提出F-TIS方法,通过利用异构模型在协同GRPO训练中提高本地模型的学习效果,实现了高效的通信和一致的最终模型收敛,同时在某些情况下提升了模型在分布外任务上的泛化能力。

详情
Comments
Accepted to ICML 2026 Workshop Scalable Learning and Optimization for Efficient Multimodal AI Agents (SCALE)
AI中文摘要

像GRPO这样的强化学习方法在LLM后训练中变得非常流行。在GRPO中,模型产生一组提示的完成,这些完成会得到奖励,策略会朝着相对高奖励的完成更新。由于模型的自回归性质,这种训练风格的生成阶段可以极其耗时。为了解决这个问题,先前的工作试图将推理步骤分布到许多节点上,并行工作。这些工作主要假设训练中的同质模型,以保持样本尽可能接近on-policy。这一假设可能在去中心化系统中不切实际,因为具有不同计算能力和偏好的各方可能希望在同一个任务上合作。因此,去中心化训练需要一种能够处理异构模型的方法——不同的模型在同一个任务上协作。然而,这会导致训练过程中出现高度离策略的样本,而先前的工作已经指出离策略样本可能会影响GRPO的收敛。为了实现异质性,我们提出了过滤截断重要性采样(F-TIS)——一种GRPO风格的训练范式,可以利用离策略样本来改进本地模型的学习。我们的框架允许各种模型在同一个RL训练运行中协作,同时保持高效的通信。我们广泛评估了F-TIS在各种异构设置中的表现,并展示了它在最终模型收敛方面与纯on-sample训练相同。此外,我们观察到在某些设置中,F-TIS在分布外任务上的泛化能力优于on-policy训练,使模型性能提高了高达12%。

英文摘要

Reinforcement learning methods such as GRPO have seen great popularity in LLM post-training. In GRPO, models produce completions to a set of prompts, which are rewarded, and the policy is updated towards the relatively high reward completions. Due to the auto-regressive nature of models, the generation phase of such style of training can be extremely time consuming. As a solution, prior work has sought to distribute the inference step across many nodes, working parallel. These works assume primarily homogeneous models in the training in order to keep samples as close to on-policy as possible. This assumption may be impractical in decentralized systems, where parties with various computes and preferences may wish to collaborate on the same task. Thus, decentralized training requires an approach that can handle heterogeneous models - different models collaborating on the same tasks. However, this leads to highly off-policy samples presented during training, which prior work has identified that off-policy samples can hurt GRPO convergence. To enable heterogeneity, we propose Filtered Truncated Importance Sampling (F-TIS) - a GRPO-style training paradigm that can use off-policy samples to improve local model's learning. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model's performance by up to 12\%.

2605.22536 2026-05-22 cs.CV cs.CL

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

SpaceDG: 在视觉退化下评估空间智能的基准测试

Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong

AI总结 本文提出SpaceDG,首个针对退化感知空间理解的大型数据集,通过物理基础的退化合成引擎生成9种退化类型,评估多模态大语言模型在视觉退化下的空间推理能力,并展示在退化条件下微调可提升模型鲁棒性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在空间智能方面取得了快速进展,但现有空间推理基准大多假设纯净的视觉输入,忽略了现实部署中常见的退化现象,如运动模糊、低光照、恶劣天气、镜头畸变和压缩伪影。这提出了一个根本性问题:当前MLLMs在视觉观察不完美时的空间智能鲁棒性如何?为回答这个问题,我们引入SpaceDG,首个大规模退化感知空间理解数据集。它通过物理基础的退化合成引擎将退化形成过程嵌入3D高斯点散布(3DGS)渲染,能够真实模拟九种退化类型。所生成的数据集包含约100万对QA问题,来自近1000个室内场景。我们进一步引入SpaceDG-Bench,一个经人类验证的基准,包含11种推理类别和9种视觉退化类型的1102个问题,产生超过10000个VQA实例。评估25个开源和闭源MLLMs发现,视觉退化一致且显著损害空间推理能力,暴露出关键的鲁棒性差距。最后,我们展示在SpaceDG上微调可显著提高退化鲁棒性,并且在退化条件下甚至可以超越人类性能,而不会在清晰图像上造成性能下降,突显了退化感知训练在鲁棒空间智能方面的潜力。

英文摘要

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

2605.22535 2026-05-22 cs.AI

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

TerminalWorld: 在真实世界终端任务上评估智能体的基准测试

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr, Mark Harman, Federica Sarro, He Ye

AI总结 本文提出TerminalWorld,一个可扩展的数据引擎,能够自动从真实世界终端记录中反向工程高保真的评估任务。通过处理80,870条终端记录,生成1,530个经过验证的任务,涵盖18个真实世界类别,从短日常操作到超过50步的工作流,覆盖1,280个唯一命令。从中精选出200个代表性任务作为Verified子集。在八个前沿模型和六个智能体上全面评估发现,当前系统仍难以处理真实终端工作流,最高通过率为62.5%。此外,TerminalWorld捕捉到与现有专家整理的基准(如Terminal-Bench)不同的真实终端能力,仅与它们的分数有弱相关性(Pearson r=0.20)。自动化引擎使TerminalWorld本身具有真实性和可扩展性,使其能够评估智能体在真实终端环境中随着开发者实践的发展而变化。数据和代码可在https://github.com/EuniAI/TerminalWorld获取。

详情
AI中文摘要

我们介绍了TerminalWorld,一个可扩展的数据引擎,能够自动从'现实世界'终端记录中反向工程高保真的评估任务。处理80,870条终端记录,该引擎生成1,530个经过验证的任务,涵盖18个真实世界类别,从短日常操作到超过50步的工作流,覆盖1,280个唯一命令。从中我们精选出200个代表性、人工审核的任务作为Verified子集。在八个前沿模型和六个智能体上对TerminalWorld-Verified进行全面评估发现,当前系统仍难以处理真实终端工作流,最高通过率为仅62.5%。此外,TerminalWorld捕捉到与现有专家整理的基准(如Terminal-Bench)不同的真实终端能力,仅与它们的分数有弱相关性(Pearson r=0.20)。自动化引擎使TerminalWorld本身具有真实性和可扩展性,使其能够评估智能体在真实终端环境中随着开发者实践的发展而变化。数据和代码可在https://github.com/EuniAI/TerminalWorld获取。

英文摘要

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.

2605.22531 2026-05-22 cs.LG

Disentanglement Beyond Generative Models with Riemannian ICA

超越生成模型的解缠:黎曼ICA

Edmond Cunningham

AI总结 本文提出黎曼ICA,一种不依赖生成模型的解缠方法,通过引入解缠张量来研究局部解缠特性,为理解无生成假设下的特征解缠提供了理论基础。

详情
AI中文摘要

在解缠理论基础与现代表示学习实践之间存在差距。现有的理论框架,特别是独立成分分析(ICA)及其非线性变体,假设数据背后存在统计独立的潜在变量,使得解缠等同于识别生成数据的潜在变量。这种生成框架具有可解释性和理论依据,但其强假设使其难以应用于现代表示学习。现代预训练编码器通常学习出具有解缠特性的特征,而无需做出生成假设,但缺乏解释这些特征作为独立变化因素的一般理论。本文通过引入黎曼ICA,将ICA的全局生成模型替换为局部几何结构。RICA基于观察到,在ICA中,数据点的潜在变化因素可以通过从该点出发的径向曲线映射到潜在空间中的轴对齐直线来理解。我们利用黎曼几何正式化这一观点,并以与现有生成方法一致的方式提出我们的理论。我们的主要贡献是解缠张量,它编码了我们称为点解缠的二阶解缠概念。该张量依赖于数据对数似然的Hessian以及模型诱导的里奇曲率。在受控源恢复设置中,RICA在多个流形上恢复了源,而ICA基线的成功取决于用于表示观测的坐标。本文为研究无生成模型假设下的局部解缠提供了理论基础。

英文摘要

There is a gap between the theoretical foundations of disentanglement and the practice of modern representation learning. Existing theoretical frameworks, particularly Independent Component Analysis (ICA) and its nonlinear variants, assume a generative model with statistically independent latent variables underlying the data so that disentanglement amounts to identifying the latents that could have generated the data. This generative framework is interpretable and theoretically justified, but its strong assumptions make it difficult to apply to modern representation learning. Modern pretrained encoders often learn features that exhibit disentangled properties without making generative assumptions, yet there is no general theory for interpreting these features as independent factors of variation. We take a step toward such a theory by introducing Riemannian ICA (RICA), which replaces ICA's global generative model with local geometric structure. RICA is founded on the observation that in ICA, the factors of variation underlying a data point can be understood through radial curves emanating from the point that map to axis-aligned lines in the latent space. We formalize this perspective using Riemannian geometry and introduce our theory in a way that is consistent with the existing generative approach. Our main contribution is the disentanglement tensor, which encodes a second-order notion of disentanglement that we call pointwise disentanglement. This tensor depends on the Hessian of the data log likelihood as well as the Ricci curvature induced by the model. In a controlled source recovery setting with known ground-truth sources, RICA recovers sources across several manifolds, while the success of ICA baselines depends on the coordinates used to represent the observations. Our work provides a theoretical basis for studying local disentanglement without assuming a global generative model.

2605.22530 2026-05-22 cs.AI

A Subjective Logic-based method for runtime confidence updates in safety arguments

基于主观逻辑的方法用于安全论证中的运行时置信度更新

Benjamin Herd, Jessica Kelly, Clarissa Heinemann, João-Vitor Zacchi

AI总结 本文提出了一种基于主观逻辑的方法,用于在安全论证中实现动态定量保证,通过整合设计时证据和时间窗口内的运行时安全性能指标(SPIs),在开发生命周期中量化和传播置信度。在运行时,SPI证据被持续评估,针对的声明通过规则更新,当没有违反时增加置信度,当发生违反时施加即时惩罚。该设计优先考虑安全相关响应性,而非精确的经典贝叶斯后验更新。

详情
Journal ref
Proceedings of the 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26), 2026
Comments
Accepted for publication at the 41st ACM/SIGAPP Symposium on Applied Computing (SAC 2026)
AI中文摘要

我们提出了一种方法,用于动态定量保证,该方法通过在单一的主观逻辑(SL)基础上的保证案例中整合设计时证据和时间窗口内的运行时安全性能指标(SPIs),从而增强静态安全案例,实现连续的运行时驱动的置信度更新。该方法通过量化和传播置信度,贯穿整个开发生命周期。在运行时,SPI证据被持续评估,并通过规则更新目标声明:在没有违反的情况下增加置信度,在发生违反时施加即时惩罚。该设计优先考虑安全相关响应性,而非精确的经典贝叶斯后验更新。我们通过基于模拟的施工区辅助功能演示该方法,重点在于基于机器学习的施工锥检测组件,并展示置信度如何随着SPI证据在操作中的观察而演变。

英文摘要

We present a method for dynamic quantitative assurance that enhances static safety cases with continuous, runtime-driven confidence updates. The method quantifies and propagates confidence across the development lifecycle by integrating design-time evidence and windowed runtime Safety Performance Indicators (SPIs) within a single Subjective Logic (SL)-based assurance case. At runtime, SPI evidence is continuously evaluated, and targeted claims are updated using a rule that increases confidence in the absence of violations and imposes prompt penalties when violations occur. This design prioritizes safety-relevant responsiveness over exact classical Bayesian posterior updates. We demonstrate the method using a simulation-based construction zone assist function, focusing on an ML-based construction cone detection component, and show how confidence evolves as SPI evidence is observed in operation.

2605.22529 2026-05-22 cs.LG cs.AI

Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets

在网络安全AI中稳定可解释性脆弱性:公共基准数据集中的多重共线性影响与缓解

Ioannis J. Vourganas, Anna Lito Michala

AI总结 本文研究了在入侵检测(IDS)中使用AI可解释性时的一个未被探索但重要的漏洞:多重共线性导致的不稳定性。尽管广泛依赖于事后可解释性工具如SHAP或LIME,但相关特征对解释鲁棒性的影响未被评估。我们引入了一个正式定理,表明多重共线性会放大归因方差。这证明了在多重共线性下,解释和特征重要性是非可识别的。在代表性的基准数据集UNSW-NB15上,通过一系列全面的实验验证了该定理。评估了四种广泛使用的模型家族,包括线性、基于树的、核和神经网络模型,在基于VIF和相关性阈值的完整和剪枝特征集上。我们提出了新的指标Explanability Fragility Score,并提出了两种新的缓解方法,具有变量整合复杂度。CAA-Filtering专注于通过分组训练模型的归因来稳定解释。SHARP是一种新的训练时间正则化框架,通过惩罚归因不稳定性,使可解释性稳定性可控且单调提高。研究结果支持稳定的预测性能,使用Kendall's τ量化在重采样解释中的不稳定性。这项工作对XAI在安全关键领域中的可信度和可重复性有直接影响,并促使将多重共线性缓解措施纳入IDS流程,为从业者提供了一套指南。

详情
Comments
35 pages, 3 figures, submitted to ACM TAISAP
AI中文摘要

本文研究了在入侵检测(IDS)中使用AI可解释性时的一个未被探索但重要的漏洞:多重共线性导致的不稳定性。尽管广泛依赖于事后可解释性工具如SHAP或LIME,但相关特征对解释鲁棒性的影响未被评估。我们引入了一个正式定理,表明多重共线性会放大归因方差。这证明了在多重共线性下,解释和特征重要性是非可识别的。在代表性的基准数据集UNSW-NB15上,通过一系列全面的实验验证了该定理。评估了四种广泛使用的模型家族,包括线性、基于树的、核和神经网络模型,在基于VIF和相关性阈值的完整和剪枝特征集上。我们提出了新的指标Explanability Fragility Score,并提出了两种新的缓解方法,具有变量整合复杂度。CAA-Filtering专注于通过分组训练模型的归因来稳定解释。SHARP是一种新的训练时间正则化框架,通过惩罚归因不稳定性,使可解释性稳定性可控且单调提高。研究结果支持稳定的预测性能,使用Kendall's τ量化在重采样解释中的不稳定性。这项工作对XAI在安全关键领域中的可信度和可重复性有直接影响,并促使将多重共线性缓解措施纳入IDS流程,为从业者提供了一套指南。

英文摘要

This paper investigates a unexplored yet impactful vulnerability in AI explainability used in intrusion detection (IDS): multicollinearity-induced instability. Despite extensive reliance on post-hoc explainability tools such as SHAP or LIME, the impact of correlated features on explanation robustness is not evaluated. We introduce a formal theorem stating that multicollinearity inflates attribution variance. This demonstrates that explanations and feature importances are non-identifiable under multicollinearity. A suite of comprehensive experiments validates the theorem on a representative benchmark dataset, UNSW-NB15. Four widely used families of models are evaluated, including linear, tree-based, kernel, and neural, across full and pruned feature sets based on VIF and correlation thresholding. We propose the novel metric of Explanability Fragility Score and two novel methods to mitigate it with variable integration complexity. CAA-Filtering focuses on stabilising explanations by grouping attributions of trained models. SHARP is a novel training-time regularisation framework that penalises attribution instability, enabling controllable and monotonic improvement of explainability stability. The findings support stable predictive performance, using Kendall's τ to quantify instability across bootstrapped explanations. This work has direct implications for the trustworthiness and reproducibility of XAI in security-critical contexts, and motivates incorporating multicollinearity mitigations into the IDS pipelines, providing a set of guidelines for practitioners.

2605.22521 2026-05-22 cs.RO cs.HC

Quantifying Full-Body Immersion

量化全身沉浸

Alihan Bakir, Ekrem Yüksel, Fabio Zuliani, Neil Chennoufi, Francesco Bruno, Jamie Paik

AI总结 本文提出了一种基于全身动态交互的沉浸式虚拟体验新范式,通过音频视觉沉浸、物理沉浸和全身沉浸三个层次,结合模块化机器人表面单元实现可扩展的沉浸环境渲染,推动人与虚拟环境的共生。

详情
Comments
This manuscript is under consideration for possible publication in the Nature. Copyright may be transferred to Nature if the manuscript is accepted for publication, without further notice
AI中文摘要

人类正处于又一场数字革命的前沿,现实与虚拟世界的界限正在消融,重塑我们对周围环境的认知和交互方式。在此背景下,我们引入了一种以全身动态交互为核心的沉浸式虚拟体验新范式。我们的方法通过三个不同的层次重新定义沉浸:音频视觉沉浸,捕捉感官真实;物理沉浸,提供触觉反馈;以及全身沉浸(FBI),其中动态的身体互动无缝整合到虚拟环境中。该创新的核心是一种基于模块化机器人表面单元的可扩展、可分布平台,这些单元受到自然界适应性设计的启发。这些单元能够渲染沉浸式环境,从亲密的个人体验到大规模多用户设置,动态适应实时互动。模块化系统在整个空间中分布力、形状和运动反馈,复制环境的物理特性,并通过FBI实现新的深度参与。通过结合可扩展性、适应性和动态物理参与,该框架弥合了现实与虚拟世界之间的鸿沟。它提供了一种前所未有的沉浸水平,使用户能够以共生的方式与虚拟空间进行全身互动。这项工作不仅推动了沉浸技术的发展,还重新定义了人类与虚拟环境共存的方式,为人类与环境合成的新时代奠定了基础。

英文摘要

Humanity is at the forefront of yet another digital revolution, where the lines between real and virtual worlds are dissolving, reshaping how we perceive and interact with our surroundings. In this context, we introduce a transformative paradigm for immersive virtual experiences centered around whole-body kinetic interactions. Our approach redefines immersion through three distinct levels: audio-visual immersion, capturing sensory realism; physical immersion, delivering haptic feedback; and full-body immersion (FBI), where dynamic bodily interaction integrates seamlessly with virtual environments. At the core of this innovation lies a scalable, distributable platform based on modular robotic surface units inspired by the adaptive designs of nature. These units enable the rendering of immersive environments at any scale, from intimate personal experiences to expansive multi-user settings, dynamically adapting to interactions in real-time. The modular system distributes force, shape, and motion feedback throughout entire spaces, replicating the physical characteristics of the environment and enabling new depth of engagement through FBI. By combining scalability, adaptability, and dynamic physical engagement, this framework bridges the gap between real and virtual worlds. It offers an unprecedented level of immersion where users can engage their entire bodies in symbiotic interactions with the virtual space. This work not only advances immersive technology but also redefines how humans and virtual environments coexist, setting a foundation for a new era of human-environment synthesis.

2605.22513 2026-05-22 cs.AI

Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems

为不确定非线性系统参考跟踪设计快速适应的元学习

Jiaqi Yan, Ankush Chakrabarty, Niklas Schmid, John Lygeros, Alisa Rupenyan

AI总结 本文针对不确定非线性系统的参考跟踪问题,提出基于元学习的控制框架,通过利用源系统数据加速训练并提升控制性能,通过两阶段方法实现对目标系统的快速适应。

详情
Comments
13 pages
AI中文摘要

在本文中,我们解决了不确定非线性系统的参考跟踪问题。由于从目标系统收集数据往往具有挑战性,我们的目标是利用有限的目标系统数据设计最优控制器。元学习提供了一个有前景的范式,通过利用源系统(与目标系统结构相似的系统)的离线数据来加速训练并提高控制性能。受此启发,我们提出了一种基于元学习的控制框架,将隐式模型无关元学习(iMAML)算法适应到控制设置中。该框架分为两个阶段:一个(离线)元训练阶段,其中从源数据中学习聚合表示以捕捉相似系统之间的共享系统动态;一个(在线)元适应阶段,其中仅使用少量数据样本和有限的适应步骤对目标系统进行微调。我们将此框架表述为一个双层优化问题,并提供一个具有降低存储复杂性和较少近似值的高效解决方案。所提出的框架具有通用性,允许各种学习算法的整合。为了展示这种灵活性,我们提出两种特定的学习算法,分别基于神经状态空间模型和深度Q网络。这两种方法的主要区别在于是否需要显式系统识别。数值模拟和硬件实验表明,所提出的方法增强了控制性能,并且在大多数情况下均优于基线方法。

英文摘要

In this paper, we address the problem of reference tracking for uncertain nonlinear systems. Since collecting data from the target system (i.e., the system of interest) is often challenging, our objective is to design optimal controllers using limited target system data. Meta-learning provides a promising paradigm by leveraging offline data from source systems (systems sharing structural similarities with the target system) to accelerate training and enhance control performance. Motivated by this idea, we propose a meta-learning-based control framework that tailors the implicit model-agnostic meta-learning (iMAML) algorithm to the control setting. The framework operates in two phases: an (offline) meta-training phase, where an aggregated representation is learned from source data to capture the shared system dynamics among similar systems, and an (online) meta-adaptation phase, where this representation is fine-tuned on the target system using only a few data samples and limited adaptation steps. We formulate this framework as a bi-level optimization problem and provide an efficient solution with reduced storage complexity and few approximations. The proposed framework is general, allowing various learning algorithms to be integrated. To demonstrate this flexibility, we propose two specific learning algorithms that can be incorporated into our framework based on a neural state-space model and a deep Q-network, respectively. The primary distinction between these approaches is whether explicit system identification is required. Numerical simulations and hardware experiments demonstrate that the proposed methods enhance control performance and consistently outperform baseline approaches.

2605.22507 2026-05-22 cs.LG stat.ML

Generative Modeling by Value-Driven Transport

通过价值驱动传输进行生成建模

Pablo Moreno-Muñoz, Adrian Müller, Gergely Neu

AI总结 本文提出了一种基于测度传输离散时间随机控制 formulations 的新生成建模框架,通过线性规划的对偶变量直接编码最优控制策略,并开发了高效的模拟-free 原始-对偶算法来计算近似最优价值函数和价值驱动传输(VDT)策略,这些策略在多个实验中表现出优越的性能和良好的可扩展性。

详情
AI中文摘要

我们提出了一种基于测度传输离散时间随机控制 formulations 的新生成建模框架。通过适应控制理论中的经典结果,我们将问题 formulations 为一个线性规划,其对偶变量对应于控制问题的最优价值函数,这直接编码了最优控制策略。利用这种线性规划 formulations,我们开发了高效的模拟-free 原始-对偶算法,用于计算近似最优价值函数及其相关的价值驱动传输(VDT)策略,这些策略近似于真正的最优策略。我们展示了经过良好训练的 VDT 策略与其他基于流、扩散或 Schrödinger 桥的最新方法相比具有许多有利的性质:它们导致直线传输路径,可以快速且鲁棒地模拟,并且可以以与扩散和流基模型相同的方式增强(例如,条件生成、分类器-free 引导、无配对数据到数据翻译都很容易整合)。我们在一系列实验中评估了我们的方法,结果表明性能强大且具有良好的可扩展性潜力。

英文摘要

We propose a new framework for generative modeling based on a discrete-time stochastic control formulation of measure transport. Adapting classic results from control theory, we formulate our problem as a linear program whose dual variables correspond to the \emph{optimal value function} of the control problem, which directly encodes the optimal control policy. Exploiting this LP formulation, we develop an efficient simulation-free primal-dual algorithm for computing approximately optimal value functions and the associated \emph{value-driven transport} (VDT) policies which approximate the true optimal policy. We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges: they lead to straight transport paths which can be simulated quickly and robustly, and can be enhanced in all the same ways as diffusion and flow-based models (e.g., conditional generation, classifier-free guidance, unpaired data-to-data translation are all easy to incorporate). We evaluate our methodology in a range of experiments, with results that indicate strong performance and good potential for scalability.

2605.22506 2026-05-22 cs.CR cs.LG

EnCAgg: Enhanced Clustering Aggregation for Robust Federated Learning against Dynamic Model Poisoning

EnCAgg: 增强型聚类聚合用于对抗动态模型中毒的联邦学习

Tianyun Zhang, Zhen Yang, Haozhao Wang, Ru Zhang, Yongfeng Huang

AI总结 本文提出了一种新的鲁棒聚合方法,通过利用少量已知的良性客户端作为参考,准确识别和过滤恶意梯度,同时保留尽可能多的良性梯度,即使恶意客户端的数量未知且变化。方法包括密度基低维梯度聚类、增强聚类低维梯度生成模型和低维梯度重新聚类。

详情
AI中文摘要

联邦学习面临越来越多的模型中毒攻击威胁,这些攻击损害了其在提高隐私保护方面的应用。现有的防御方法通常依赖于固定的阈值或使用固定数量的聚类来进行区分恶意梯度和良性梯度。然而,这些方法难以适应恶意客户端的动态中毒策略,且由于客户端本地数据集的异质性,常常导致良性梯度的丢失。为了解决这些问题,我们提出了一种新的鲁棒聚合方法,该方法利用少量已知的良性客户端作为参考,能够准确识别和过滤恶意梯度,同时尽可能保留良性梯度,即使恶意客户端的数量未知且变化。首先,我们引入了一种基于密度的低维梯度聚类方法,将梯度投影到两个最分散的维度,并应用基于密度的聚类来识别恶意梯度,同时保留聚类中的良性梯度和可能的良性异常值。其次,我们设计了一种增强聚类低维梯度生成模型,该模型学习生成与良性簇边界对齐的伪梯度。这些伪梯度充当桥梁,连接稀疏的良性梯度异常值。第三,我们引入了低维梯度重新聚类,将生成的伪梯度与真实梯度一起聚类,以恢复被误分类为噪声点的良性梯度,使更多的良性梯度能够参与聚合。在MNIST、CIFAR-10和MIND数据集上的广泛实验表明,我们的方法在动态中毒场景下表现出卓越的保真度和鲁棒性。

英文摘要

Federated learning faces increasing threats from model poisoning attacks, which harms its application to improve privacy. Existing defense methods typically rely on fixed thresholds or perform clustering with a fixed number of clusters to distinguish malicious gradients from benign ones. However, these methods are difficult to adapt to dynamic poisoning strategies of malicious clients, and often result in the loss of benign gradients due to the heterogeneity of clients' local datasets. To address these problems, we propose a novel robust aggregation method that leverages a small number of known benign clients as references, enabling accurate identification and filtering of malicious gradients while retaining as many benign gradients as possible, even when the number of malicious clients is unknown and variable. First, we introduce a density-based low-dimensional gradient clustering method, which projects gradients onto the two most divergent dimensions and applies density-based clustering to identify malicious gradients while retaining clustered benign gradients and potentially benign outliers. Second, we design an enhancing clustering low-dimensional gradient generator model, which learns to generate pseudo-gradients aligned with the boundary of the benign cluster. These pseudo-gradients act as bridges to connect sparse benign gradient outliers. Third, we introduce low-dimensional gradient re-clustering that clusters the generated pseudo-gradients together with real gradients to recover benign gradients misclassified as noise points, enabling more benign gradients to participate in aggregation. Extensive experiments on the MNIST, CIFAR-10, and MIND datasets demonstrate that our method exhibits superior fidelity and robustness under dynamic poisoning scenarios.

2605.22505 2026-05-22 cs.AI

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

通过优先级排名直接评估Harness优化器

Kai Tzu-iunn Ong, Minseok Kang, Dongwook Choi, Junhee Cho, Seungju Kim, Seungwon Lim, Geunha Jang, Minwoo Oh, Bogyung Jeong, Sunghwan Kim, Taeyoon Kwon, Jinyoung Yeo

AI总结 本文提出通过优先级排名直接评估Harness优化器,以解决传统方法中因缺乏oracle harness而无法有效评估优化器中间步骤的问题,展示了该方法在多步骤优化中的可靠性。

详情
Comments
Preprint. Work in Progress
AI中文摘要

Harness优化通过让优化器代理迭代更新目标代理的harness来实现自动化代理创建。尽管其成功,当前研究仅通过观察目标代理的性能提升来评估优化器,这种间接的末端改进评估忽视了优化器在中间步骤中的行动,这些行动往往错误且阻碍代理性能。因此,不清楚harness优化是受优化器有信息的更新行动驱动还是单纯的试错。这需要直接评估harness优化器。然而,由于缺乏oracle harness,直接评估harness优化器是非平凡且昂贵的。为此,我们提出了一种简单且低成本的设计来直接评估它们,即优先级排名。通过让harness优化器对给定harness中的组件(例如工具)按其更新时对代理性能改进/阻碍的潜力进行排序,我们的设计在不昂贵的rollout或手动检查的情况下量化了优化器在步骤层面的能力。更重要的是,优化器的排名性能与它们在实际多步骤harness优化中改进代理的能力相关,建立了优先级排名作为优化能力可靠预测指标。优先级排名通过Shor实现,Shor是182个由人类验证的优化场景的集合,涵盖多个领域、设计和时间阶段。代码和数据可在https://github.com/k59118/Harness_Optimizer_Evaluation找到。

英文摘要

Harness optimization enables automated agent creation by having an optimizer agent iteratively update the harness of target agents. Despite its success, current studies evaluate optimizers solely by observing target agents' performance gains. This indirect end-improvement evaluation neglects optimizers' actions at intermediate steps, which are often erroneous and hinder agent performance. Therefore, it is unclear whether harness optimization is driven by optimizers' informed update actions or simply trial-and-error. This necessitates direct evaluation of harness optimizers. However, evaluating harness optimizers directly is non-trivial and costly due to the lack of oracle harnesses. To address this, we present a simple, low-cost design to directly evaluate them, namely priority ranking. By asking harness optimizers to rank components (e.g., tools) in a given harness by their potential to improve/hinder agent performance when updated, our design quantifies optimizer ability at the step level without expensive rollouts or manual examination. More importantly, optimizers' ranking performance correlates with their ability to improve agents in actual multi-step harness optimization, establishing priority ranking as a reliable predictor of optimization ability. Priority ranking is enabled by Shor, a collection of 182 human-verified optimization scenarios spanning across domains, designs, and time stages. Codes and data can be found at https://github.com/k59118/Harness_Optimizer_Evaluation.

2605.22504 2026-05-22 cs.AI cs.CV

LACO: Adaptive Latent Communication for Collaborative Driving

LACO:适应性潜在通信用于协同驾驶

Tianhao Chen, Yuheng Wu, Dongman Lee

AI总结 本文提出LACO,一种无需训练的潜在通信范式,通过迭代潜在推理、跨时间显著性归因和结构化语义知识蒸馏,解决协同驾驶中潜在通信的延迟和信息丢失问题,实验证明其在降低通信和推理延迟的同时保持了强大的协同驾驶性能。

详情
AI中文摘要

协同驾驶旨在通过使连接车辆在部分可观测性下协调以提高安全性和效率。最近的方法已从共享视觉特征进行感知发展到通过基础模型交换基于语言的推理以实现行为协调。尽管用语言交流提供直观的信息,但引入了两个挑战:由自回归解码引起的高延迟以及由于将丰富的内部表示压缩成离散标记而引起的信信息丢失。为了解决这些挑战,我们分析了协同驾驶中潜在通信在多智能体设置下的固有限制。我们的分析揭示了代理身份混淆,即直接融合潜在状态会将车辆间的决策表示纠缠。受此启发,我们提出了LACO,一种无需训练的潜在通信范式,能够无缝地将预训练驾驶模型适应到协同设置中。LACO引入了迭代潜在推理(ILD)用于潜在推理,跨时间显著性归因(CHSA)用于通信高效的信信息选择,以及结构化语义知识蒸馏(SSKD)以稳定以自我为中心的决策。在CARLA中的闭环实验表明,LACO显著降低了通信和推理延迟,同时保持了强大的协同驾驶性能。

英文摘要

Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordination. Though communicating in language provides intuitive information, it introduces two challenges: high latency caused by autoregressive decoding and information loss caused by compressing rich internal representations into discrete tokens. To address these challenges, we analyze latent communication in collaborative driving under inherent limitations of multi-agent settings. Our analysis reveals agent identity confusion, where direct fusion of latent states entangles decision representations across vehicles. Motivated by this, we propose LACO, a training-free \textbf{LA}tent \textbf{CO}mmunication paradigm that seamlessly adapts pretrained driving models to collaborative settings. LACO introduces Iterative Latent Deliberation (ILD) for latent reasoning, Cross-Horizon Saliency Attribution (CHSA) for communication-efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego-centric decision making. Closed-loop experiments in CARLA show that LACO notably reduces communication and inference latency while maintaining strong collaborative driving performance.

2605.22502 2026-05-22 cs.AI cs.LG

Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

将代理工作流编译为LLM权重:在成本上减少两个数量级的情况下实现接近前沿质量

Simon Dennis, Rivaan Patil, Kevin Shabahang, Hao Guo

AI总结 本文研究如何将代理工作流编译为LLM权重以提高效率,通过在旅行预订、Zoom支持和保险索赔等任务中验证,展示了编译方法在减少成本的同时保持高质量性能。

详情
Comments
19 pages
AI中文摘要

代理编排框架已经普及,共同超过了LangGraph、CrewAI、Google ADK、OpenAI Agents SDK、Semantic Kernel、Strands和LlamaIndex在内的290,000多个GitHub星标。所有框架都遵循相同模式:一个外部编排器位于LLM之上,每回合注入指令并路由决策。最近的工作表明,这种架构在处理过程性任务时,只需在前沿模型的系统提示中提供过程即可[Dennis et al., 2026a],但代价是消耗上下文窗口、需要为每次对话提供一个前沿模型,并将专有过程暴露给第三方提供者。将过程编译到小型微调模型的权重中——创建一个地下代理——应解决所有这些担忧,先前工作(SimpleTOD、FireAct、SynTOD、WorkflowLLM、Agent Lumos)已展示了该技术的可行性。然而,开发者采用却 overwhelmingly 倾向于编排。我们识别了三个感知障碍,并在旅行预订(14个节点)、Zoom支持(14个节点,产品特定知识)和保险索赔(55个节点,6个决策中心)中通过实证方法解决每个障碍。

英文摘要

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

2605.22501 2026-05-22 cs.CL cs.AI cs.IR

BeLink: Biomedical Entity Linking Meets Generative Re-Ranking

BeLink: 生物医学实体链接结合生成性重新排序

Darya Shlyk, Stefano Montanelli, Lawrence Hunter

AI总结 本文提出了一种基于生成模型的重新排序方法,通过指令微调提高生物医学实体链接的效率和准确性,在多个基准测试中实现了3%-24%的链接准确率提升,同时减少了推理时间。

详情
Comments
Accepted to ACM SIGIR 2026
AI中文摘要

尽管近年来取得了进展,但使用大语言模型(LLMs)的生物医学实体链接(BEL)仍然计算效率低下,难以在实际应用中部署。在本工作中,我们证明了在BEL流水线的重新排序阶段对开源生成模型进行指令微调可以提供有效的解决方案。我们提出了一种集束式指令微调公式,使候选人的选择变得快速且准确。我们的方法在多个BEL基准测试中表现出色,比最先进的方法在链接准确性上提高了3%-24%,同时减少了推理时间。我们将我们的生成性重新排序器整合到BeLink中,这是一个模块化、端到端的系统,旨在实际的生物医学实体链接应用中使用。

英文摘要

Despite recent progress, Biomedical Entity Linking (BEL) with large language models (LLMs) remains computationally inefficient and challenging to deploy in practical settings. In this work, we demonstrate that instruction-tuning of open-source generative models can offer an effective solution when applied at the re-ranking stage of the BEL pipeline. We propose a set-wise instruction-tuning formulation that enables fast and accurate candidate selection. Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art. We integrate our generative re-ranker into BeLink, a modular, end-to-end system designed for practical real-world BEL applications.

2605.22498 2026-05-22 cs.LG cs.AI cs.SC

The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning

神经编译器:程序到网络的翻译用于混合科学机器学习

Lucas Sheneman

AI总结 该研究提出了一种神经编译器,能够将程序转换为可微的PyTorch模块,用于混合科学机器学习,通过符号规范生成正确且可微的模块,实现系统化的可组合性。

详情
Comments
Use: 21 pages, 10 figures, 10 tables. Preprint; source code available at https://github.com/sheneman/neural_compiler
AI中文摘要

科学机器学习经常需要结合已知的物理规律与从数据中学习的未知参数或校正项。现有方法要么忽略已知结构,将其编码为软惩罚项,要么需要为每个方程手动编写PyTorch代码。我们提出了神经编译器,一种将用第一顺序Scheme-like表达式语言编写的程序转换为冻结、可微的PyTorch模块的系统。这些模块在浮点精度范围内匹配源程序,并通过autograd提供梯度。在混合模型中,编译模块精确编码已知的物理规律,而学习组件则建模未知的剩余部分。我们评估了该编译器在六个实验领域:费曼物理方程、洛特卡-沃勒特动力学、阻尼摆、一维热方程、三维向量力学以及组合泛化。编译模块在单个方程上与手动编写PyTorch实现数值上一致,显示编译没有精度损失。编译模型在大多数情况下能够将物理常数恢复到不到1%的误差,而标准PINN基线模型具有超过8500个参数,误差为7到93%。编译模块还可以与零误差组合,而神经近似方法在深度组合链中会积累大误差。编译器的主要价值不是优于手动编写方程的精度,而是系统化的可组合性:它从符号规范生成正确且可微的模块,而无需手动重写每个方程。该系统支持51个基本操作,包括向量和矩阵代数,能够实现PDE离散化和混合科学模型。这种字符串输入、模块输出的接口也为大语言模型提供了自然的目标,这些模型可以将科学描述翻译成可执行的可微模块。

英文摘要

Scientific machine learning often requires combining known physics with unknown parameters or correction terms learned from data. Existing approaches either ignore known structure, encode it as a soft penalty, or require hand-written PyTorch code for each equation. We present The Neural Compiler, a system that translates programs written in a first-order Scheme-like expression language into frozen, differentiable PyTorch modules. These modules match the source program to floating-point precision and provide gradients through autograd. In hybrid models, the compiled module encodes known physics exactly while learned components model the unknown remainder. We evaluate the compiler across six experiment domains: Feynman physics equations, Lotka-Volterra dynamics, a damped pendulum, a one-dimensional heat equation, three-dimensional vector mechanics, and compositional generalization. Compiled modules match hand-coded PyTorch implementations numerically for single equations, showing no accuracy loss from compilation. With only 1 to 4 trainable parameters, compiled models recover physical constants to less than 1 percent error in most cases, while standard PINN baselines with more than 8500 parameters show 7 to 93 percent error. Compiled modules also compose with zero error, while neural approximations can accumulate large errors in deep composition chains. The main value of the compiler is not improved accuracy over hand-coded equations, but systematic composability: it generates correct, differentiable modules from symbolic specifications without rewriting each equation by hand. The system supports 51 primitive operations, including vector and matrix algebra, enabling PDE discretizations and hybrid scientific models. This string-in, module-out interface also provides a natural target for large language models that translate scientific descriptions into executable differentiable modules.

2605.22496 2026-05-22 cs.LG

The Signal in the Noise: OOD Detection Through Goodness-of-Fit Testing in Factorised Latent Spaces

噪声中的信号:通过因子化潜在空间中的拟合性检验进行分布外检测

Philipp Bomatter, Jack Geary, Henry Gouk

AI总结 本文提出了一种基于因子化潜在空间中拟合性检验的分布外检测方法SITN,该方法无需访问分布外数据,计算开销小,并能严格控制误报率。

详情
AI中文摘要

深度生成模型为分布外检测提供了自然的基础,但先前的工作表明,它们分配的似然在区分分布内与分布外数据方面 notoriously 不可靠。在本文中,我们通过利用连续归一化流的 diffeomorphic 和质量保持性质来解决这个问题。我们的分析表明,分布外样本被映射到在噪声先验下高度非典型的噪声样本,这种方式无法通过似然来捕捉。基于这一观察,我们提出了一种新的方法--Signal in the Noise (SITN)--用于单样本级别的分布外检测。SITN 不需要访问分布外数据,计算开销小,并提供严格的误报率控制。通过标准基准和合成扰动的全面评估,突显了该方法的有效性以及似然方法固有的复杂性偏差的不存在。

英文摘要

Deep generative models offer a natural foundation for out-of-distribution (OOD) detection, yet prior work has shown that their assigned likelihoods are notoriously unreliable indicators for in- vs out-of-distribution data. In this paper, we address this problem by leveraging the diffeomorphic and mass-preserving properties of continuous normalising flows. Our analysis shows that OOD samples are mapped to noise samples that are highly atypical under the noise prior in ways not captured by the likelihood. Based on this observation, we propose a new method -- Signal in the Noise (SITN) -- for OOD detection on the single-sample level. SITN requires no access to OOD data, incurs minimal computational overhead, and provides strict control of false positive rates. Comprehensive evaluations through standard benchmarks and synthetic perturbations highlight the method's effectiveness and the absence of the complexity bias inherent to likelihood-based methods.

2605.22493 2026-05-22 cs.LG cs.AI cs.RO

Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

理解动作分块行为克隆中的多模态失败

Lorenzo Mazza, Massimiliano Datres, Ariel Rodriguez, Sebastian Bodenstedt, Gitta Kutyniok, Stefanie Speidel

AI总结 研究行为克隆在多模态情况下失败的机制,分析不同多模态参数化在动作分块策略中的不同失效方式,并提出通过调整正则化程度和改进生成策略来提升鲁棒性的方法。

详情
AI中文摘要

当相同的观察允许多个有效动作时,行为克隆变得困难。我们研究了动作分块策略中的这一问题,并展示了不同多模态参数化以不同的方式失败。对于隐变量策略,后验-先验正则化使部署时的采样更可靠,但过度正则化会移除区分演示模式所需的动作条件信息。减少这种正则化可以保留模式信息,但此时成功取决于先验是否覆盖相关隐变量区域。对于动作空间生成策略,多模态性受到基础到动作传输的平滑性限制:具有小Lipschitz常数的映射无法将大量分离的模式分配显著概率。覆盖许多模式需要基础空间中的陡峭过渡或动作空间中的非支持桥接区域。在合成多模态任务和机器人模拟基准上的实验支持了这些机制。

英文摘要

Behavioral cloning becomes difficult when the same observation admits several valid actions. We study this problem for action-chunking policies and show that different multimodal parameterizations fail in different ways. For latent-variable policies, posterior-prior regularization makes deployment-time sampling more reliable, but excessive regularization removes the action-conditioned information needed to distinguish demonstrated modes. Reducing this regularization can preserve mode information, but then success depends on whether the prior covers the relevant latent regions. For action-space generative policies, multimodality is constrained by the smoothness of the base-to-action transport: a map with small Lipschitz constant cannot assign substantial probability to many well-separated modes. Covering many modes therefore requires either sharp transitions in base space or off-support bridge regions in action space. Experiments on synthetic multimodal tasks and robotic simulation benchmarks support these mechanisms.