arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2406.09407 2026-06-04 cs.CV

Towards Evaluating the Robustness of Visual State Space Models

评估视觉状态空间模型的鲁棒性

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Shahbaz Khan, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI（Mohamed Bin Zayed人工智能大学）； Center of Secure Cyber-Physical Security Systems（安全的网络物理安全系统中心）； Linköping University（林波伊大学）； Australian National University（澳大利亚国立大学）

AI总结本文全面评估了视觉状态空间模型（VSSMs）在遮挡、图像结构、常见损坏和对抗攻击等多种扰动下的鲁棒性，并与Transformer和CNN等架构进行比较，揭示了其优势和局限性。

Comments Accepted at The 5th Workshop of Adversarial Machine Learning on Computer Vision (CVPRW 2025)

详情

AI中文摘要

视觉状态空间模型（VSSMs）是一种结合了循环神经网络和潜变量模型优势的新型架构，通过有效捕捉长程依赖和建模复杂视觉动态，在视觉感知任务中表现出色。然而，它们在自然和对抗扰动下的鲁棒性仍然是一个关键问题。在这项工作中，我们全面评估了VSSMs在各种扰动场景下的鲁棒性，包括遮挡、图像结构、常见损坏和对抗攻击，并将其性能与Transformer和卷积神经网络等成熟架构进行比较。此外，我们研究了VSSMs在复杂视觉场景中针对物体-背景组合变化的鲁棒性，使用了专门设计用于测试模型性能的复杂基准。我们还使用模拟真实场景的损坏数据集评估了它们在目标检测和分割任务上的鲁棒性。为了更深入地理解VSSMs的对抗鲁棒性，我们进行了基于频率的对抗攻击分析，评估了它们对低频和高频扰动的性能。我们的发现突出了VSSMs在处理复杂视觉损坏方面的优势和局限性，为未来研究提供了宝贵的见解。我们的代码和模型将在 https://github.com/HashmatShadab/MambaRobustness 提供。

英文摘要

Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We also assess their robustness on object detection and segmentation tasks using corrupted datasets that mimic real-world scenarios. To gain a deeper understanding of VSSMs' adversarial robustness, we conduct a frequency-based analysis of adversarial attacks, evaluating their performance against low-frequency and high-frequency perturbations. Our findings highlight the strengths and limitations of VSSMs in handling complex visual corruptions, offering valuable insights for future research. Our code and models will be available at https://github.com/HashmatShadab/MambaRobustness.

URL PDF HTML ☆

赞 0 踩 0

2404.11309 2026-06-04 cs.CV

Achieving Rotation-Invariant Convolution via Non-Learnable Orientation Alignment Operators

通过不可学习的朝向对齐算子实现旋转不变卷积

Hanlin Mo, Peihong Lei, You Hao, Guoying Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出基于不可学习算子的旋转不变卷积（RIConvs），其参数量和计算过程与标准卷积相同，在多个视觉任务中提升准确率，尤其在数据有限时效果显著。

详情

AI中文摘要

在深度神经网络中实现旋转不变性而无需数据增强是一个研究热点。内在不变性使特征能够捕捉目标的固有属性，从而提升深度学习在视觉任务中的性能。基于多种类型的不可学习算子，本文提出了一套对任意旋转自然不变的卷积操作。与大多数先前方法不同，这些旋转不变卷积（RIConvs）具有与标准卷积相同的可学习参数数量和相似的计算过程，因此可以互换。使用MNIST-Rot数据集，我们验证了它们在不同旋转角度下的不变性，并与先前的旋转不变CNN进行了比较，其中两种基于梯度的RIConvs取得了最先进的结果。然后，我们将RIConvs与经典CNN骨干网络集成，并在纹理识别、飞机类型识别和遥感图像分类任务上进行了评估。结果表明，RIConvs显著提高了准确率，特别是在训练数据有限的情况下，并且即使在使用数据增强时也能提升性能。

英文摘要

Achieving rotational invariance in deep neural networks without data augmentation is a research hotspot. Intrinsic invariance enables features to capture targets' inherent properties, enhancing deep learning performance in visual tasks. Based on various types of non-learnable operators, this paper proposes a comprehensive set of convolution operations that are natually invariant to arbitrary rotations. Unlike most prior methods, these rotation-invariant convolutions (RIConvs) have the same number of learnable parameters and a similar computational process as standard convolutions, making them interchangeable. Using the MNIST-Rot dataset, we validate their invariance across rotation angles and compare them with previous rotation-invariant CNNs, where two gradient-based RIConvs achieve state-of-the-art results. Then, we integrate RIConvs with classic CNN backbones and evaluate them on texture recognition, aircraft type recognition, and remote sensing image classification tasks. Results show that RIConvs significantly improve accuracy, particularly with limited training data, and enhance performance even with data augmentation.

URL PDF HTML ☆

赞 0 踩 0

1905.04235 2026-06-04 cs.RO cs.SY eess.SY

Autonomous Locomotion Mode Transition in Quadruped Track-Legged Robots: A Simulation-Based Analysis for Step Negotiation

四足履轮腿机器人自主运动模式切换：基于仿真的步阶跨越分析

Jie Wang, Krispin Davies

发表机构 * University of Cambridge（剑桥大学）； ClearPath AI

AI总结本文提出了一种用于四足混合机器人自主切换运动模式的方法，特别是在跨越不同高度台阶时，通过能量效率评估机制实现平稳过渡。

详情

DOI: 10.1016/j.simpat.2024.102893

AI中文摘要

混合履轮腿机器人结合了轮式和腿式运动的优势，通过高效切换滚动和行走模式，在多种地形中实现适应性。然而，自动实现这些切换仍然是重大挑战。本文介绍了一种用于四足混合机器人自主模式切换的方法，特别是在跨越台阶时。我们的方法基于一种决策机制，利用所提出的基于能量的准则评估两种运动模式的能量效率。为了确保平稳跨越台阶，我们结合了两种攀爬步态，用于评估行走运动的能量使用情况。仿真结果验证了该方法的有效性，显示在不同高度的台阶上实现了成功的自主切换。我们提出的方法具有通用性，可以修改以适应类似机械配置的其他混合机器人，前提是其运动能量性能已先进行研究。

英文摘要

Hybrid track/wheel-legged robots combine the advantages of wheel-based and leg-based locomotion, granting adaptability across varied terrains through efficient transitions between rolling and walking modes. However, automating these transitions remains a significant challenge. In this paper, we introduce a method designed for autonomous mode transition in a quadruped hybrid robot with a track/wheel-legged configuration, especially during step negotiation. Our approach hinges on a decision-making mechanism that evaluates the energy efficiency of both locomotion modes using a proposed energy-based criterion. To guarantee a smooth negotiation of steps, we incorporate two climbing gaits designated for the assessment of energy usage in walking locomotion. Simulation results validate the method's effectiveness, showing successful autonomous transitions across steps of diverse heights. Our suggested approach has universal applicability and can be modified to suit other hybrid robots of similar mechanical configuration, provided their locomotion energy performance is studied beforehand.

URL PDF HTML ☆

赞 0 踩 0

2402.02555 2026-06-04 cs.CV cs.CL

High-Quality Entity Segmentation and Grounding

高质量实体分割与定位

Lu Qi, Yi-Wen Chen, Tao Zhang, Xiangtai Li, Xu Yang, Bo Du, Ming-Hsuan Yang

发表机构 * Wuhan University（武汉大学）； Insta360 Research（Insta360研究院）； Department of EECS, University of California, Merced（加州大学默塞德分校电子工程与计算机科学系）； Nanyang Technological University（南洋理工大学）； Institute of Automation of the Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结提出ESG流水线，通过新数据集EntitySeg和两阶段解耦设计（CropFormer高质量分割+GELLA精确名词提取与语义匹配），实现高质量实体分割与定位，在五项任务上有效。

详情

AI中文摘要

在这项工作中，我们提出了ESG，一个由新数据集EntitySeg支持的高质量实体分割与定位流水线。首先，所提出的数据集命名为EntitySeg，包含跨越各种图像域和实体的图像，以及用于训练和测试的大量高分辨率图像和高质量掩码标注。然后，ESG主要由两个模块组成：用于高质量实体分割的CropFormer，以及用于从句子中精确提取名词并在语言和视觉区域之间进行语义匹配的GELLA。与现有联合训练分割和大语言模型的定位方法不同，ESG采用两阶段解耦设计，保留了高质量掩码和定位鲁棒性，避免了联合训练通常带来的权衡。CropFormer确保高质量实体分割结果，然后可以编码到GELLA模型中进行有效定位。大量实验结果表明，我们提出的流水线在五项任务上有效，包括实体分割、全景分割、开放词汇分割、指代分割和全景定位叙述。此外，ESG流水线的GELLA模块高度灵活，能够处理来自任何分割框架的掩码输入，这得益于其轻量级的颜色图/视觉编码器、语言/掩码解码器和关联模块。实体分割数据集和定位代码将在https://github.com/qqlu/Entity发布。

英文摘要

In this work, we propose ESG, a pipeline for high-quality entity segmentation and grounding supported by a new dataset EntitySeg. At first, the proposed dataset naming EntitySeg contains images spanning various image domains and entities, along with plentiful high-resolution images and high-quality mask annotations for training and testing. Then, the ESG mainly consists of two modules: CropFormer for high-quality entity segmentation whereas GELLA for accurate noun extraction from sentences and semantic matching between language and visual regions. Unlike existing grounding methods that jointly train a segmentation and a large language model, ESG adopts a two-stage decoupled design, preserving high-quality masks and grounding robustness without the trade-offs often introduced by joint training. CropFormer ensures high-quality entity segmentation results, which can then be encoded into the GELLA model for effective grounding. Extensive experimental results demonstrate the effectiveness of our proposed pipeline across five tasks, including entity segmentation, panoptic segmentation, open-vocabulary segmentation, referring segmentation, and panoptic localized narratives. Furthermore, GELLA module of ESG pipeline is highly flexible and capable of processing mask inputs from any segmentation framework, thanks to its lightweight colormap/vision encoder, language/mask decoder, and association module. The entity segmentation dataset and grounding code will be released at https://github.com/qqlu/Entity.

URL PDF HTML ☆

赞 0 踩 0

2209.15448 2026-06-04 cs.LG math.ST stat.ME stat.TH

Blessing from Human-AI Interaction: Super Reinforcement Learning in Confounded Environments

人机交互的福音：混杂环境下的超级强化学习

Jiayi Wang, Zhengling Qi, Chengchun Shi

发表机构 * Department of Mathematical Sciences, University of Texas at Dallas（德克萨斯大学达拉斯分校数学科学系）； Department of Statistics, London School of Economics and Political Science（伦敦政治经济学院统计系）； Department of Decision Sciences, George Washington University（乔治华盛顿大学决策科学系）

AI总结提出利用人机交互中的观察动作进行超级策略学习，在存在未测量混杂的情况下，通过近端因果推断实现优于标准最优策略和行为策略的超级策略。

详情

AI中文摘要

随着人工智能在社会中越来越普遍，整合人类和AI系统以发挥各自优势并降低风险的有效方法已成为重要优先事项。在本文中，我们引入了超级策略学习的范式，该范式利用人机交互进行数据驱动的序贯决策。这种方法将来自AI或人类的观察动作作为输入，以实现决策者（人类或AI）在策略学习中更强的oracle。在存在未测量混杂的决策过程中，过去智能体采取的动作可以揭示未公开信息的有价值见解。通过以一种新颖且合法的方式将这些信息纳入策略搜索，所提出的超级策略学习将产生一个超级策略，该策略保证优于标准最优策略和行为策略（例如，过去智能体的动作）。我们将这种更强的oracle称为人机交互的福音。此外，为了解决使用批处理数据寻找超级策略时的未测量混杂问题，在近端因果推断框架下建立了一系列非参数和因果识别。基于这些新颖的识别结果，我们开发了几种超级策略学习算法，并系统研究了它们的理论性质，例如有限样本遗憾保证。最后，通过大量模拟和实际应用说明了我们方法的有效性。

英文摘要

As AI becomes more prevalent throughout society, effective methods of integrating humans and AI systems that leverage their respective strengths and mitigate risk have become an important priority. In this paper, we introduce the paradigm of super policy learning that takes advantage of Human-AI interaction for data driven sequential decision making. This approach utilizes the observed action, either from AI or humans, as input for achieving a stronger oracle in policy learning for the decision maker (humans or AI). In the decision process with unmeasured confounding, the actions taken by past agents can offer valuable insights into undisclosed information. By including this information for the policy search in a novel and legitimate manner, the proposed super policy learning will yield a super-policy that is guaranteed to outperform both the standard optimal policy and the behavior one (e.g., past agents' actions). We call this stronger oracle a blessing from human-AI interaction. Furthermore, to address the issue of unmeasured confounding in finding super-policies using the batch data, a number of nonparametric and causal identifications are established under the framework of proximal causal inference. Building upon on these novel identification results, we develop several super-policy learning algorithms and systematically study their theoretical properties such as finite-sample regret guarantee. Finally, we illustrate the effectiveness of our proposal through extensive simulations and real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.05150 2026-06-04 cs.NE cs.AI

Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization

使用自适应和非自适应粒子群优化的多列RBF神经网络

Ammar Hoori, Yuichi Motai

发表机构 * Department of Biomedical Engineering, Case Western Reserve University（生物医学工程系，凯斯西储大学）； Department of Electrical and Computer Engineering, Virginia Commonwealth University（电气与计算机工程系，弗吉尼亚 Commonwealth 大学）

AI总结针对大规模数据集下RBF神经网络训练的可扩展性问题，提出基于粒子群优化（PSO）和自适应PSO（APSO）的多列RBF网络（MC-PSO和MC-APSO），通过并行训练多个RBFN并利用子集专门化提高精度和速度。

Comments 15 Page, Under Review

详情

AI中文摘要

使用梯度下降算法训练的径向基函数神经网络（RBFN）在浅层和深层网络中提供了有效的全连接结构。误差校正（ErrCor）是一种先进的基于梯度的训练方法，它选择最优隐藏单元以提高精度。另外，作为基于种群的算法，粒子群优化算法（PSO）利用群体经验优化RBFN参数，提供全局搜索和对局部最小值的鲁棒性。自适应PSO（APSO）作为PSO的改进变体出现。APSO算法通过在优化过程中动态调整群体参数来提高收敛速度。ErrCor和PSO都显示出改进的结果和有竞争力的收敛性。然而，对于大规模数据集，这些方法面临可扩展性挑战，如过多的核计算和大的隐藏层结构。最近的多列RBFN方法（MCRN）通过在并行系统中部署小型RBFN来提高ErrCor性能。受MCRN成功的启发，我们提出了两种改进PSO性能的新方法：使用PSO的多列RBFN（MC-PSO）和使用APSO的多列RBFN（MC-APSO）。这些方法引入了使用进化群方法训练的并行RBFN结构。每个RBFN独立地在数据集的特定空间子集上使用PSO或APSO算法进行训练。这些经过专门训练的RBFN针对各自的子集进行了定制。在测试期间，只有测试实例邻居所在的选定RBFN对多列输出有贡献。这种专门化提高了精度，而并行性提高了速度。我们在各种基准数据集上评估了所提出的方法。MC-PSO和MC-APSO在精度和召回率方面优于ErrCor、PSO、APSO和MCRN。在大多数实验中，它们还表现出更快的训练和测试时间。

英文摘要

The radial basis function neural network (RBFN) trained with a gradient descending algorithm provides an effective fully connected structure in both shallow and deep networks. The error correction (ErrCor), a state-of-the-art gradient-based training method, selects optimal hidden units to improve accuracy. Alternatively, as a population-based algorithm, the particle swarm optimization algorithm (PSO) uses the swarm experience to optimize RBFN parameters, offering global search and robustness to local minima. Adaptive PSO (APSO) has emerged as an improved variant of PSO. APSO algorithm improves convergence speed by dynamically adjusting swarm parameters during optimization. Both ErrCor and PSO demonstrate improved results and competitive convergence. However, with large datasets, these methods face scalability challenges such as excessive kernel computations and large hidden layer structures. A recent multi-column RBFN approach (MCRN) improves ErrCor performance by deploying small RBFNs in a parallel system. Inspired by MCRN's success, we propose two novel approaches to improve PSO performance: the multi-column RBFN with PSO (MC-PSO) and the multi-column RBFN with APSO (MC-APSO). These methods introduce parallel RBFN structures trained using evolutionary swarm methods. Each RBFN is independently trained on a specific spatial subset of the dataset using either PSO or APSO algorithms. These resulting specialist-trained RBFNs are tailored to their respective subsets. During testing, only selected RBFNs, where the test instance neighbors are located, contribute to the multi-column output. This specialization improves accuracy, while parallelism enhances speed. We evaluate the proposed methods on various benchmark datasets. The MC-PSO and MC-APSO outperform ErrCor, PSO, APSO, and MCRN in terms of accuracy and recall. They also demonstrate faster training and testing times in most experiments.

URL PDF HTML ☆

赞 0 踩 0

2606.05129 2026-06-04 cs.CR cs.LG

Preserving Data Privacy in Learning Causal Structure with Fully Homomorphic Encryption

在全同态加密下学习因果结构时保护数据隐私

Jian Yang, Yuan Tong, Qinbin Li, Zeyi Wen, Xiaofang Zhou

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Hong Kong University of Science and Technology（香港理工大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结针对分布式因果结构学习中的隐私泄露问题，提出基于全同态加密的方法，通过电路简化、除法和对数近似以及SIMD批处理技术，在加密数据上高效完成因果结构学习，并支持扩展到差分隐私。

详情

AI中文摘要

保护数据隐私是结构数据管理和数据挖掘中的重要课题。然而，分布式因果结构学习中的隐私泄露问题是一个持续的挑战，特别是在需要数据传输和计算的情况下。在本文中，我们提出了一种基于全同态加密（FHE）的方法，该方法在密文上进行计算，保持数据在传输和计算过程中加密。然而，由于FHE计算成本高且对除法和对数运算的支持有限，将FHE应用于因果结构学习具有挑战性。为了应对这一挑战，我们提出了一系列新颖的技术，包括（i）电路简化以提高效率，（ii）通过牛顿-拉夫森倒数和泰勒展开近似除法和对数，以及（iii）使用SIMD加速的批处理技术来增强整个学习过程。此外，我们的方法可以轻松扩展到FHE之外，通过展示其可移植性来支持差分隐私。实验结果表明，我们的方法在测试的数据集上实现了与明文版本高度一致且可比的因果结构。最后，即使在FHE的隐私保护下，我们的方法也能在几十分钟内高效且实际地完成因果结构学习。

英文摘要

Preserving data privacy is an important topic in structural data management and data mining. However, the issue of privacy leakage in distributed causal structure learning is a persistent challenge, especially in cases where data transmission and computation are required. In this paper, we propose a method based on fully homomorphic encryption (FHE) that performs calculations on ciphertexts, keeping data encrypted in transition and computation. Nevertheless, adopting FHE to causal structure learning is challenging due to the high computation cost and limited support on division as well as logarithm operations in FHE. To tackle this challenge, we propose a series of novel techniques including (i) circuit simplification for better efficiency, (ii) approximation of division and logarithm through Newton-Raphson Reciprocal and Taylor expansion, and (iii) a batching technique with SIMD-acceleration to enhance the whole learning process. Additionally, our method can be easily extended beyond FHE by demonstration of its portability to support differential privacy. Empirical results show that our method achieves high consistency and comparable causal structure with the plaintext version in the datasets tested. Last, our method is efficient and practical to complete learning causal structures in tens of minutes even under the privacy protection of FHE.

URL PDF HTML ☆

赞 0 踩 0

2606.05124 2026-06-04 cs.GR cs.CV cs.LG

Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting

几何高斯：在高斯泼溅中解耦外观与几何

Hongyu Zhou, Zorah Lähner

发表机构 * University of Bonn（波恩大学）； Lamarr Institut（拉马尔研究所）

AI总结针对3D高斯泼溅在几何表示与外观渲染间的冲突，提出通过为每个溅射添加几何不透明度参数并配合透明度优化流程，实现几何与外观的解耦，提升复杂场景（尤其是透明物体）的渲染与几何性能。

详情

AI中文摘要

在3D高斯泼溅（3DGS）成功用于新视角合成后，许多工作探索了如何将其用于几何表面表示。然而，直接从3DGS中提取准确的几何信息仍然具有挑战性，且往往会降低外观渲染质量。在这项工作中，我们通过使用完整的地面真值纹理和几何信息进行训练，证明了默认形式的3DGS本质上不适合同时表示纹理和几何。我们还提出了一种简单的解决方案，即为每个溅射应用一个额外的几何不透明度参数，并配合可选的透明度策划优化流程。我们的实验，无论是使用地面真值还是视觉基础模型的几何输入，都表明这一改变在多种数据集上提高了渲染和几何性能，尤其是对于包含透明物体的复杂场景，我们的方法带来了显著提升。

英文摘要

After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering quality. In this work, we show that 3DGS in its default form is inheritedly unsuited to represent texture and geometry at the same time, by training with complete ground-truth texture and geometry information. We also propose a simple solution by applying a single additional geometry opacity parameter to each splat, together with an optional transparency-curated optimization pipeline. Our experiments, both with ground-truth and vision foundation model geometric input, show that this change leads to improved rendering and geometry performance on a wide variety of dataset, and especially complex scenes with transparent objects benefit significantly from our method.

URL PDF HTML ☆

赞 0 踩 0

2606.05045 2026-06-04 math.DS cs.LG

Learning Control-Affine Reduced-Order Models via Autoencoders

通过自编码器学习控制仿射降阶模型

Ali Mjalled, Martin Mönnigmann

发表机构 * Automatic Control and Systems Theory Ruhr-Universität Bochum（自动控制与系统理论梅尔恩大学波恩分校）

AI总结提出一种利用自编码器同时学习降阶潜在空间和控制仿射状态空间动力学的框架，并扩展为序列模型以提高预测精度，通过反馈线性化验证其有效性。

详情

AI中文摘要

本文提出了一种用于识别控制仿射降阶模型（ROM）的框架。该方法利用自编码器（AE）将高维状态以及潜在的高维输入变换为适合控制仿射状态空间动力学的降维潜在变量。这是通过同时训练AE和状态空间模型实现的。此外，我们将离散ROM公式扩展为基于序列的模型，该模型处理状态和输入历史以提高预测精度，同时保持控制仿射结构。我们通过对导出的模型应用反馈线性化来激励我们的框架，并提出了有效使用它的指南。所提出的框架在两个数值示例上进行了评估，并将其性能与基线模型（其中AE识别具有线性状态空间动力学的潜在空间）进行了比较。评估涉及测试数据上ROM的预测精度及其将系统控制到期望状态或轨迹的有效性。

英文摘要

We present in this paper a framework for the identification of control-affine reduced-order models (ROMs). The proposed method utilizes autoencoders (AEs) to transform the high-dimensional states, and potentially the high-dimensional inputs, into reduced latent ones suitable for control-affine state-space dynamics. This is achieved by simultaneous training of the AE and the state-space model. In addition, we extend the discrete ROM formulation to a sequence-based model, which processes state and input histories to improve prediction accuracy while preserving the control-affine structure. We motivate our framework by applying feedback linearization to the derived models, and we present guidelines for its efficient use. The proposed framework is assessed on two numerical examples and its performance is compared to a baseline model, where the AE identifies a latent space with linear state-space dynamics. The assessment involves evaluating the prediction accuracy of the ROM on test data and its effectiveness in controlling the system to a desired state or trajectory.

URL PDF HTML ☆

赞 0 踩 0

2606.05037 2026-06-04 cs.SE cs.AI

Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

自反式API：结构优于冗长，助力AI代理恢复

Arquimedes Canedo, Grama Chethan

发表机构 * Siemens Digital Industries Software, USA（西门子数字工业软件公司）

AI总结提出自反式API，在验证失败时返回机器可读的结构化建议，使AI代理无需外部推理即可修复请求并重试，在Anthropic模型上将任务完成率提升36.7-40.0个百分点，且每成功令牌效率提升1.8-2.2倍。

详情

AI中文摘要

当AI代理调用API并遇到验证错误时，它需要的不仅仅是哪里出错了——它需要下一步该做什么。自反式API在验证失败时返回一个机器可读的 recovery_feedback.suggestions[] 负载，足以让代理修复请求并在无需外部推理的情况下重试。在一个经过泄露审计的试点实验（每单元N=30，3个LLM，10个对抗性任务）中，结构化建议在Anthropic模型上将任务完成率提升了+36.7至40.0个百分点（Fisher精确检验 p ≤ 0.0022），每成功令牌效率提高了1.8至2.2倍。在gpt-4o-mini上提升不显著（p=0.435）；在计费API上的第二个领域复制确认了这一模式。该比较仅在审计了LLM基准测试中两个未记录的答案泄露类别后才成立。我们提供了 audit_prompt_leakage.py 作为可重用的CI基础设施。代码和数据：https://github.com/arquicanedo/self-reflective-apis。

英文摘要

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ($N{=}30$ per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by $+36.7$--$40.0$pp over plain-English diagnoses on Anthropic models (Fisher's exact $p \le 0.0022$), at $1.8$--$2.2\times$ better per-success token efficiency. The lift is not significant on gpt-4o-mini ($p{=}0.435$); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.

URL PDF HTML ☆

赞 0 踩 0

2606.05004 2026-06-04 cs.CR cs.AI

SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models

SharedRequest: 面向大型语言模型的隐私保护模型无关推理

Peihua Mai, Xuanrong Gao, Youlong Ding, Xianglong Du, Wei Liu, Yan Pang

发表机构 * National University of Singapore (Chongqing) Research Institute（新加坡国立大学（重庆）研究院）； Chongqing Key Laboratory of Trusted Perception and Interaction Technology for Intelligent and Connected Vehicles（重庆智能网联车辆可信感知与交互技术重点实验室）； National University of Singapore（新加坡国立大学）； Hebrew University of Jerusalem（耶路撒冷希伯来大学）； State Key Laboratory of Intelligent Vehicle Safety Technology, Chongqing, China（重庆智能车辆安全技术国家重点实验室）； CHONGQING CHANGAN AUTOMOBILE Co., Ltd（重庆长安汽车有限公司）

AI总结提出一种模型无关的隐私保护推理框架SharedRequest，通过批量级别混淆和语义分组实现高效隐私保护，相比差分隐私基线效用提升20%以上，查询成本降低5倍。

Comments accepted by ACL 2026 (main)

详情

AI中文摘要

随着ChatGPT等公共大型语言模型（LLMs）的广泛部署，保护用户提示隐私已成为一个日益关键的问题。现有的隐私保护推理方法要么牺牲效用，要么牺牲效率，并且通常需要特定于模型的修改，限制了其兼容性。在本文中，我们提出了SharedRequest，一个模型无关的隐私保护LLM推理框架，它将隐私保护重新定义为批量级别而非单个提示级别。关键思想是通过将原始提示与噪声变体混合来混淆敏感信息，同时将语义等效的指令分组，以在大量查询批次中分摊推理成本，对LLM响应质量影响最小。该设计独立于LLM架构，无需访问模型参数或进行架构修改。实验结果表明，与先前的差分隐私基线相比，SharedRequest实现了超过20%的效用提升，并且其共享提示机制相比非批量推理将查询成本降低了5倍。

英文摘要

With the widespread deployment of public large language models (LLMs) such as ChatGPT, protecting user prompt privacy has become an increasingly critical issue. Existing privacy-preserving inference methods sacrifice either utility or efficiency, and often require model-specific modifications that limit their compatibility. In this paper, we propose SharedRequest, a model-agnostic framework for privacy-preserving LLM inference that reformulates privacy protection at the batch level rather than the individual-prompt level. The key idea is to obscure sensitive information by mixing original prompts with noisy variants, while grouping semantically equivalent instructions to amortize the inference cost over a large batch of queries with minimal impact on LLM response quality. This design is independent of the LLM architecture, requiring no access to model parameters or architectural modification. Empirical results demonstrate that SharedRequest achieves over $20\%$ higher utility compared to prior differential privacy baselines, and its shared-prompt mechanism reduces query cost by up to $5\times$ compared to non-batched inference.

URL PDF HTML ☆

赞 0 踩 0

2606.04989 2026-06-04 cs.HC cs.RO

What Can Eye Gaze Teach Us About Real-World Cycling? Insights From the Oxford RobotCycle Project

眼动能教会我们关于真实世界骑行的什么？来自牛津RobotCycle项目的见解

Benjamin Hardin, Efimia Panagiotaki, Daniele De Martini, Lars Kunze

发表机构 * University of Oxford（牛津大学）； University of the West of England（西英格兰大学）

AI总结本研究利用可穿戴眼动追踪眼镜，通过分析不同环境（如自行车道、汽车道和共享公交车道）和事件（如超车和行人）下的眼动模式，揭示了骑行中感知危险的潜意识差异，并评估了眼动追踪在估计骑行压力和认知负荷方面的潜力。

详情

AI中文摘要

尽管对骑行情境的身体危险已有较多了解，但对骑行的感知危险知之甚少。此外，危险感知可能在潜意识层面被过滤，因此难以自我报告。为此，这些潜意识感知可以通过眼动等生理指标揭示。本文探讨了英国牛津骑行的感知安全性，并研究了可穿戴眼动追踪眼镜在不同环境和事件下产生关于感知差异见解的能力。本文发现，在自行车道、汽车道和共享公交车道之间，眼动模式发生变化，代表了每种车道类型的不同认知挑战。本文表明，不同交叉路口的眼动模式显著不同，这可能对骑行者的压力有影响。最后，与无事件骑行相比，在超车和道路行人等事件发生时，眼动模式存在差异。本文总结了使用可穿戴眼动追踪器估计压力和骑行者工作量的优点和局限性。

英文摘要

Although much is known about the physical danger of cycling situations, less is understood about the perceived danger of cycling. Furthermore, perception of danger may be filtered at a subconscious level and therefore difficult for one to self-report. To this end, these subconscious perceptions can be revealed through physiological metrics such as eye gaze. This paper explores the perceived safety of cycling in Oxford, United Kingdom and explores the ability of wearable eye tracking glasses to produce insights about the differences in perception under different environments and events. This paper finds that eye gaze patterns change between using bike lanes, car lanes and shared bus lanes, representing different cognitive challenges of each lane type. This paper presents that different intersections have significantly different eye gaze patterns which may have implications for cyclist stress. Finally, eye gaze patterns differ in the presence of events such as passes and pedestrians in the road compared to when cycling with no events. This paper draws conclusions on the benefits and limitations of using wearable eye trackers to estimate stress and cyclist workload.

URL PDF HTML ☆

赞 0 踩 0

2606.04967 2026-06-04 cs.SE cs.AI

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

从提示到流程：支持AI软件开发智能体的框架流程分类与比较评估

Sanderson Oliveira de Macedo

发表机构 * Federal Institute of Goias（戈亚斯联邦理工学院）

AI总结提出六维流程分类法，对六个AI软件开发框架进行评分比较，揭示流程深度与可移植性之间的结构性权衡。

详情

AI中文摘要

AI编程工具不再仅仅是自动补全或聊天助手：它们组织为开发框架，包含流程、角色、工件和验证。最近的调查绘制了用于软件工程的智能体和LLM，但缺少一项以将这些能力转化为流程的操作框架为中心的研究。我们对主要来源进行了定向搜索，采用功能性纳入标准和牵引力测量，选择了六个框架：GitHub Spec Kit、OpenSpec、BMAD Method、Get Shit Done (GSD)、Spec Kitty和Reversa。每个框架通过不同路径攻击AI开发：完整和轻量变体的规范驱动开发、智能体驱动的敏捷规划、智能体上的上下文工程、工作树隔离与审查，以及从遗留系统中恢复操作规范。我们的核心贡献是一个六维流程分类法：规范、上下文、角色、执行、验证和可移植性，并附带一个评分标准，使其成为可复制的工具。我们将其应用于六个框架和一个样本外案例Spec-Flow。两个结果突出。在已经采用某种流程的框架中，存在趋同：孤立的提示失去中心地位，持久工件、工作合同、可追溯性和人工审查成为减少歧义和协调智能体的机制。并且没有框架强覆盖所有六个维度，暴露了流程深度与跨智能体可移植性之间的结构性权衡。我们还发现了反复出现的风险：规范与代码之间的漂移、对生成工件的过度信任、社区扩展的脆弱性、平台依赖性以及缺乏完整流程的基准测试。我们以一个研究议程结束，侧重于中间质量指标、上下文治理、安装安全性和可重复性。

英文摘要

AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development through a different path: spec-driven development in full and lightweight variants, agent-driven agile planning, context engineering over the agent, worktree isolation and review, and recovery of operational specifications from legacy systems. Our central contribution is a six-dimension process taxonomy: specification, context, roles, execution, validation and portability, with a scoring rubric that turns it into a replicable instrument. We apply it to the six frameworks and an out-of-sample case, Spec-Flow. Two results stand out. Among frameworks that already adopt some process there is convergence: the isolated prompt loses centrality, and persistent artifacts, work contracts, traceability and human review become mechanisms that reduce ambiguity and coordinate agents. And no framework strongly covers all six dimensions, exposing a structural trade-off between process depth and portability across agents. We also found recurring risks: drift between specification and code, excessive trust in generated artifacts, fragility of community extensions, platform dependence and a lack of benchmarks for the complete process. We close with a research agenda for empirical evaluation, focused on intermediate-quality metrics, context governance, installation security and reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2606.04957 2026-06-04 cs.CR cs.IR cs.LG

NLLog: Lightweight, Explainable SOC Anomaly Detection via Log-to-Language Rewriting

NLLog: 通过日志到语言重写的轻量级、可解释的SOC异常检测

Samuel Ndichu, Tao Ban, Seiichi Ozawa, Takeshi Takahashi, Daisuke Inoue

发表机构 * University of Tokyo（东京大学）； National Institute of Information and Communications Technology（日本信息通信技术研究所）

AI总结提出NLLog流水线，将日志模板重写为自然语言句子，结合TF-IDF加权和树集成分类，利用TreeSHAP提供可解释的异常检测，在HDFS、BGL和AIT数据集上实现低误报率和低延迟。

Comments 15 pages, 11 figures, 12 tables; submitted to ACSAC 2026

详情

AI中文摘要

系统生成的日志是安全监控的基础，但其僵化的基于模板的格式阻碍了自动化分析和人类理解。我们提出NLLog（自然语言日志），一个轻量级流水线，它确定性地将解析后的模板重写为WHO-WHAT-SEVERITY句子，通过词频-逆文档频率加权进行池化，使用树集成对会话进行分类，并通过TreeSHAP反向投影证据供分析师审查。在Hadoop分布式文件系统（HDFS）和Blue Gene/L（BGL）语料库上，NLLog超过了两个复现的匹配协议基线；在HDFS、BGL和AIT警报数据集上，它保持了低误报率，且延迟适用于安全运营中心分类。覆盖度、稀疏与密集、忠实性和对抗性消融实验表明，回退充分性依赖于语料库，部署前的覆盖度检查可以揭示细化需求，并且可审计的确定性重写结合轻量级密集编码为日志异常检测和分类提供了可测量的表示层。

英文摘要

System-generated logs underpin security monitoring, yet their rigid template-based format hinders both automated analysis and human comprehension. We present NLLog (Natural-Language Log), a lightweight pipeline that deterministically rewrites parsed templates into WHO-WHAT-SEVERITY sentences, pools them with term-frequency-inverse-document-frequency weighting, classifies sessions with tree ensembles, and back-projects evidence with TreeSHAP for analyst review. On Hadoop Distributed File System (HDFS) and Blue Gene/L (BGL) corpora, NLLog exceeds two reproduced matched-protocol baselines; across HDFS, BGL, and the AIT Alert Data Set, it sustains low false-positive rates with commodity-hardware latency suitable for security operations center triage. Coverage, sparse-versus-dense, faithfulness, and adversarial ablations show that fallback sufficiency is corpus-dependent, that an enrollment-time coverage check can surface refinement requirements before deployment, and that an auditable deterministic rewrite combined with lightweight dense encoding provides a measurable representation layer for log-anomaly detection and triage.

URL PDF HTML ☆

赞 0 踩 0

2606.04952 2026-06-04 cs.HC cs.CL

Clinical Assistant for Remote Engagement Link (CARE-link): A Web-Based Electronic Health Records Software for Managing Diabetes

临床远程参与助手（CARE-link）：一种用于管理糖尿病的基于网络的电子健康记录软件

Prince Ebenezer Adjei, Joshua Teye Tettey, Toufiq Musah, Audrey Agbeve, John Amuasi

发表机构 * Global One Health Research Group, Bernhard Nocht Institute of Tropical Medicine（全球健康研究组，伯恩哈德-诺克特热带医学研究所）； Global Health and Infectious Diseases Research Group, Kumasi Centre for Collaborative Research in Tropical Medicine（全球健康与传染病研究组，库马西协作热带医学研究中心）； Department of Computer Engineering, Kwame Nkrumah University of Science and Technology（计算机工程系，库马西大学科学与技术学院）； Department of Global Health, School of Public Health, Kwame Nkrumah University of Science and Technology（全球健康系，公共卫生学院，库马西大学科学与技术学院）

AI总结 CARE-link是一个开源、基于网络的临床支持平台，通过LLM介导的工作流程连接临床医生和患者，用于改善妊娠期糖尿病管理，系统汇总院外患者生成数据、提供临床决策支持，并通过WhatsApp界面为患者提供管理计划解释和生活方式指导。

2606.04946 2026-06-04 cs.DS cs.LG stat.ML

A General Framework for Dynamic Consistent Submodular Maximization

动态一致子模最大化的通用框架

Paul Dütting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Ola Svensson, Morteza Zadimoghaddam

发表机构 * ETH Zurich（苏黎世联邦理工学院）； KTH Royal Institute of Technology（皇家理工学院）； University of Toronto（多伦多大学）

AI总结针对全动态环境下的子模最大化问题，提出一个通用算法框架，首次实现具有次线性一致性的常数因子近似解。

Comments Accepted at ICML 2026

详情

AI中文摘要

一致性是动态子模最大化中的一个重要性质，它要求算法始终维持一个接近最优的解，并且在每一步只对解进行少量调整。先前的工作仅在仅插入的情况下探讨了这个问题，其中算法面临 $n$ 个插入的流，并建立了基数约束版本的下界和上界。我们在全动态设置中考虑这个问题，其中操作流可能同时包含插入和删除。我们开发了一个通用框架来设计该设置下的算法，并通过实例化得到了首个具有次线性一致性的常数因子近似。对于基数约束，我们提出了一个 $\frac 12 - O(\varepsilon)$ 近似，其一致性为 $O\left(\frac{1}{\varepsilon^2}\right)$。对于秩-$k$ 拟阵约束，我们构造了一个 $\frac 14 - O(\varepsilon)$ 近似于动态最优解，其一致性为 $O\left(\frac{\log k}{\varepsilon^2}\right)$。

英文摘要

Consistency is an important property in dynamic submodular maximization and entails maintaining a near-optimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored this question for the insertion-only case, where the algorithm faces a stream of $n$ insertions, and has established lower and upper bounds for the cardinality-constrained version of the problem. We consider this question in the fully dynamic setting, where the stream of operations may contain both insertions and deletions. We develop a general framework for designing algorithms for this setting, and instantiate it to obtain the first constant-factor approximations with sublinear consistency. For cardinality constraints, we propose a $\frac 12 - O(\varepsilon)$ approximation that is $O\left(\frac{1}{\varepsilon^2}\right)$ consistent. For rank-$k$ matroid constraints, we construct a $\frac 14 - O(\varepsilon)$ approximation to the dynamic optimum that is $O\left(\frac{\log k}{\varepsilon^2}\right)$ consistent.

URL PDF HTML ☆

赞 0 踩 0

2606.04909 2026-06-04 cs.IR cs.CL

BEATS: Bootstrapping E-commerce Attribute Taxonomies for Search through Iterative Human-AI Collaboration

BEATS: 通过迭代人机协作引导电商搜索属性分类

Yung-Yu Shih, Shang-Yu Su, Tzu-I Ho, Dongzhe Wang, Yun-Nung Chen

发表机构 * National Taiwan University（国立台湾大学）； Rakuten Group, Inc.（拉肯集团）； Taiwan Rakuten Ichiba, Inc.（台湾拉肯Ichiba公司）； Rakuten Asia Pte. Ltd.（拉肯亚洲有限公司）

AI总结针对新兴市场电商平台缺乏结构化属性模式的问题，提出BEATS框架，利用人机协作的LLM流水线从零构建产品属性分类，并通过属性标注提升搜索系统性能。

Comments 6 pages, 1 figure, 5 tables. Accepted to SIGIR 2026 Industry Track. Official version: https://doi.org/10.1145/3805712.3808520

详情

DOI: 10.1145/3805712.3808520

AI中文摘要

新兴市场的电商平台通常使用欠发达的产品目录，仅包含类别分类而缺乏结构化属性模式。缺乏细粒度产品属性限制了搜索能力——阻碍分面过滤、降低查询理解、削弱搜索系统使用的语义表示。我们提出BEATS，一种人机协作的LLM框架，用于从零开始引导产品属性分类。我们的方法扩展了一个多阶段LLM生成流水线，包含两个关键生产阶段：(1) 模型开发者主动进行质量检查以过滤错误输出，以及(2) 领域专家本地工作人员进行人工标注以验证生成的属性。该框架迭代运行——每个生成阶段的提示基于质量检查观察和标注者在连续轮次中的反馈进行优化，逐步提高属性质量。一旦属性分类建立，我们使用LLM对单个产品项目进行结构化属性标注，丰富其上下文表示。丰富的目录直接有益于搜索系统的多个组件：实现细粒度基于属性的过滤、为排序模型提供结构化特征、改善密集检索的语义表示。我们通过在属性丰富的产品数据上训练密集检索模型来验证生成的分类，证明相对于使用原始目录信息的基线有一致的改进。我们的系统已在台湾乐天部署，丰富了9个主要类别，涵盖2,694个子类别，生成了67,277个属性，超过540万产品已使用生成的属性进行标注，并计划丰富整个产品目录。

英文摘要

E-commerce platforms in emerging markets often operate with underdeveloped product catalogs that contain only category taxonomies but lack structured attribute schemas. This absence of fine-grained product attributes limits search capabilities -- preventing faceted filtering, degrading query understanding, and weakening semantic representations used by search systems. We present BEATS, a human-in-the-loop LLM framework for bootstrapping product attribute taxonomies entirely from scratch. Our approach extends a multi-stage LLM generation pipeline with two critical production stages: (1) proactive quality checking by model developers to filter erroneous outputs, and (2) human annotation by domain-expert local staff to validate generated attributes. The framework operates iteratively -- prompts at each generation stage are refined based on quality check observations and annotator feedback across successive rounds, progressively improving attribute quality. Once the attribute taxonomy is established, we employ LLMs to perform structured attribute tagging on individual product items, enriching their contextual representations. The enriched catalog directly benefits multiple components of the search system: enabling granular attribute-based filtering, providing structured features for ranking models, and improving semantic representations for dense retrieval. We validate the generated taxonomy by training dense retrieval models on attribute-enriched product data, demonstrating consistent improvements over baselines using original catalog information. Our system has been deployed at Rakuten Taiwan, enriching 9 major categories spanning 2,694 sub-categories with 67,277 generated attributes, and over 5.4 million products have been tagged with the generated attributes, with plans to enrich the entire product catalog.

URL PDF HTML ☆

赞 0 踩 0

2606.04903 2026-06-04 cs.LO cs.AI cs.MA cs.PL

Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

基于人类编写本体的可审计且安全的LLM智能体

Aaron Sterling

发表机构 * Thistleseeds

AI总结提出Agentic Redux架构，通过类型化λ演算证明其在适当领域上的执行语义正确且决策可审计，并引入本体优先的智能体设计方法。

2606.04877 2026-06-04 cs.LO cs.AI cs.PL cs.SE

Abduction Prover in Isabelle/HOL

Isabelle/HOL中的溯因证明器

Yutaka Nagashima, Daniel Sebastian Goc

发表机构 * Institute of Computer Science, the Czech Academy of Sciences（捷克科学院计算机科学研究所）

AI总结针对基于表达逻辑的证明助手自动化程度低的问题，提出了一种利用溯因推理识别有用猜想并自动构建证明脚本的Isabelle/HOL溯因证明器。

Comments Accepted to Isabelle2026

2606.04845 2026-06-04 stat.ML cs.LG math.ST stat.CO stat.TH

Bayesian learning for the stochastic shortest path problem

随机最短路径问题的贝叶斯学习

Chon Wai Ho, Sumeetpal S. Singh, Jiaqi Guo

发表机构 * Department of Engineering, University of Cambridge, UK（剑桥大学工程系）； School of Mathematics and Physics, University of Wollongong, Wollongong, Australia（沃林根大学数学与物理学院）

AI总结针对随机最短路径问题，提出一种贝叶斯框架，通过贝尔曼最优方程直接构建最优动作价值函数Q*的后验分布，并解决似然松弛导致的不可识别性问题，实现不确定性量化与数据高效学习。

Comments 50 pages, 19 figures

详情

AI中文摘要

序列决策问题通常被建模为马尔可夫决策过程（MDP）。我们关注随机最短路径（SSP）问题，这是一个具有吸收终止状态的无限水平无折扣MDP。我们开发了一个贝叶斯框架，通过与决策任务的交互来学习最优决策策略。具体来说，我们学习最优动作价值函数$Q^*$，但与许多现有的贝叶斯方法不同，我们不依赖于不现实的建模假设和临时近似。我们的方法是通过贝尔曼最优方程直接构建$Q^*$的后验信念。对于确定性奖励，我们将后验描述为具有流形密度的分布。为了简化推理，我们放松了似然，使得勒贝格密度存在。但这样做的代价是产生不可识别性问题。具体来说，放松后的后验可能在不当决策规则上有显著质量，而精确后验则不会。我们还计算了$Q^*$的表格参数化、高斯似然放松和高斯先验下最优动作选择的精确后验概率，这在基准测试研究中很有用。对深海基准测试变体的数值研究验证了我们的发现。我们证明了我们的框架能够忠实地量化不确定性，并且与其他基于时间差分的贝叶斯方法相比，数据效率更高。最后，我们对未来工作提出了建议。

英文摘要

Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal decision strategy through interactions with the decision-making task. Specifically, we learn the optimal action-value function $Q^*$, but unlike many existing Bayesian approaches, we do not rely on unrealistic modelling assumptions and ad-hoc approximations. Our approach is to directly construct the posterior beliefs for $Q^*$ through Bellman's optimality equations. For deterministic rewards, we characterise the posterior as a distribution with a manifold density. To facilitate simpler inference, we relax the likelihood so that a Lebesgue density exists. The flip side is to create unidentifiability issues. Specifically, the relaxed posterior can have significant mass on improper decision rules, while the exact posterior will not. We also calculate the exact posterior probabilities for optimal action selections for the tabular parametrisation of $Q^*$, a Gaussian likelihood relaxation and a Gaussian prior, which is useful in benchmarking studies. Numerical studies on variants of the Deep Sea benchmark verify our findings. We demonstrate that our framework faithfully quantifies uncertainty and, compared to other temporal-difference-based Bayesian methodologies, is more data efficient. We conclude with recommendations for future work.

URL PDF HTML ☆

赞 0 踩 0

2606.04769 2026-06-04 cs.CR cs.AI cs.SE

Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications

现实世界 MCP 服务器中的描述-代码不一致性：测量、检测与安全影响

Yutao Shi, Xiaohan Zhang, Xiangjing Zhang, Xihua Shen, Hui Ouyang, Huming Qiu, Mi Zhang, Min Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对 MCP 服务器中工具描述与代码实现不一致的问题，提出结合结构感知静态分析与 Direct-Reverse-Arbitration 提示方法的自动检测框架 DCIChecker，并在大规模数据集上揭示 9.93% 的不一致率及其安全风险。

Comments Preprint

详情

AI中文摘要

模型上下文协议 (MCP) 已成为赋能大型语言模型 (LLM) 使用外部工具的关键标准。在此生态系统中，LLM 依赖 MCP 服务器提供的自然语言描述来选择和执行函数。这种交互隐含假设工具描述忠实地反映了其底层实现，而该假设在实践中并未得到强制验证。因此，MCP 部署可能遭受名为描述-代码不一致性 (DCI) 的问题，即工具对其能力和安全边界的描述与代码实际行为不一致。本文对现实世界 MCP 服务器中的 DCI 进行了全面研究。我们正式定义了该问题，并提出了一个涵盖功能不一致和未声明副作用的综合分类法。在此分类法指导下，我们开发了 DCIChecker，一个自动框架，结合结构感知静态分析与 Direct-Reverse-Arbitration 提示方法，交叉验证工具描述与实际代码实现。我们将该框架应用于一个大规模数据集，包含从 2,214 个现实世界 MCP 服务器中提取的 19,200 个描述-代码对。我们的测量揭示 DCI 普遍存在，其中 9.93% 的对存在不一致。我们进一步证明 DCI 造成了关键的防御盲点，助长了从操作故障到隐蔽恶意行为等多种风险。最后，我们提出了缓解策略以强制语义一致性并增强新兴智能体生态系统的可靠性。

英文摘要

The Model Context Protocol (MCP) has emerged as a critical standard empowering Large Language Models (LLMs) to utilize external tools. In this ecosystem, LLMs rely on natural language descriptions provided by MCP servers to select and execute functions. This interaction implicitly assumes that tool descriptions faithfully reflect their underlying implementations, while this assumption is not mandatorily verified in practice. As a result, MCP deployments may suffer from a problem named Description-Code Inconsistency (DCI), where a tool's description of its capabilities and security boundaries is not consistent with what the code actually does. In this paper, we present a comprehensive study of DCI in real-world MCP servers. We formally define the problem and propose a comprehensive taxonomy spanning functionality inconsistencies and undeclared side effects. Guided by this taxonomy, we develop DCIChecker, an automated framework that combines structure-aware static analysis with the Direct-Reverse-Arbitration prompting method to cross-validate tool descriptions against actual code implementations. We apply this framework to a large-scale dataset comprising 19,200 description-code pairs extracted from 2,214 real-world MCP servers. Our measurement reveals that DCI is widespread, with 9.93% of these pairs exhibiting inconsistencies. We further demonstrate that DCI creates a critical defense blind spot, facilitating varied risks from operational failures to stealthy malicious behaviors. Finally, we propose mitigation strategies to enforce semantic consistency and enhance the reliability of the emerging agentic ecosystem.

URL PDF HTML ☆

赞 0 踩 0

2606.04757 2026-06-04 math.OC cs.LG

Near-Optimal Decentralized Stochastic Convex Optimization over Networks

网络上的近最优去中心化随机凸优化

Nitai Kluger, Amit Attia, Tomer Koren

发表机构 * Blavatnik School of Computer Science, Tel Aviv University（塔尔大学比拉维克计算机科学学院）； Google Research Tel Aviv（谷歌研究以色列特拉维夫）

AI总结针对去中心化随机光滑凸优化问题，提出一种加速去中心化方法，在总梯度样本预算N下，将可支持的工作节点数提升至M≲√ρ N^{3/4}，并证明其最优性。

Comments 12 papers

详情

AI中文摘要

我们研究去中心化随机光滑凸优化，其中$M$个工作者使用局部随机梯度并通过固定八卦网络上的仅邻居通信来最小化平均目标。该设置中的一个核心问题是，在总梯度样本预算为$N$的情况下，确定可以使用的最大工作者数量，同时仍保持集中式$O(1/\sqrt N)$统计速率。我们引入了一种加速去中心化方法，该方法在最多$\smash{M\lesssim \sqrt\rho\,N^{3/4}}$个工作者时保持该速率，其中$\rho$是八卦网络的谱间隙，改进了先前最佳的最大缩放$\smash{M\lesssim \rho\sqrt N}$。该方法基于一步延迟随机加速方案，使工作者能够将小批量与加速八卦交错进行，同时控制残差分歧，其保证仅对数依赖于最优-局部异质性。我们还为线性跨度去中心化一阶方法建立了匹配的下界，表明该方法在对数因子内是最优的。

英文摘要

We study decentralized stochastic smooth convex optimization, where $M$ workers minimize an average objective using local stochastic gradients and neighbor-only communication over a fixed gossip network. A central question in this setting is to determine the largest number of workers that can be used under a total budget of $N$ gradient samples while still preserving the centralized $O(1/\sqrt N)$ statistical rate. We introduce an accelerated decentralized method that preserves this rate for up to $\smash{M\lesssim \sqrtρ\,N^{3/4}}$ workers, where $ρ$ is the spectral gap of the gossip network, improving the best prior maximal scaling of $\smash{M\lesssim ρ\sqrt N}$. The method is based on a one-step-delayed stochastic acceleration scheme that enables workers to interleave minibatching with accelerated gossip while controlling residual disagreement, and its guarantee depends only logarithmically on the optimum-local heterogeneity. We also establish a matching lower bound for linear-span decentralized first-order methods, showing that the method is optimal up to logarithmic factors.

URL PDF HTML ☆

赞 0 踩 0

2606.04755 2026-06-04 hep-ex cs.AI cs.IR

Archi: Agentic Operations at the CMS Experiment

Archi: CMS实验中的代理操作

Pietro Lugato, Luca Lavezzo, Jason Mohoney, Hasan Ozturk, Muhammad Hassan Ahmed, Juan Pablo Salas, Viphava Ohm, Krittin Phornsiricharoenphant, Gabriele Benelli, Mariarosaria D'Alfonso, Manasvita Joshi, Warren Nam, Aron Soha, Samantha Sunnarborg, Austin Swinney, Jack Tucker, Dmytro Kovalskyi, Tim Kraska, Christoph Paus

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； CMS Collaboration（CMS合作组）； CERN（欧洲核子研究中心）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Fermi National Accelerator Laboratory（费米国家加速器实验室）； Brown University（布朗大学）； Harvard University（哈佛大学）

AI总结提出Archi开源框架，整合异构数据源并部署可配置、私有的代理，用于CMS实验计算操作支持，在真实查询中表现有效。

详情

AI中文摘要

我们提出Archi，一个面向科学合作的开源端到端框架，它结合了异构数据源的系统化摄取和组织，以及可配置、私有且可扩展的代理的部署，这些代理能够检索和推理这些数据。自2026年2月起，Archi的一个实例已部署在CERN大型强子对撞机的CMS实验计算操作团队中，作为技术操作员的辅助代理，通过结合文档、历史数据和实时监控系统提供检索和分析能力。我们根据操作员反馈和从生产使用中收集的问题集对系统进行评估，这些问题由人工和自动化专家组评分。该系统在操作任务中证明有效，解决了CMS操作员提出的真实世界查询。我们还观察到，本地托管的开源权重模型表现具有竞争力，从而能够对敏感数据进行完全私有管理。

英文摘要

We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.

URL PDF HTML ☆

赞 0 踩 0

2606.04739 2026-06-04 cs.SE cs.AI

Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models

重新审视Vul-RAG：基于RAG的漏洞检测的可复现性与可复制性——使用开放权重模型

Sabrina Kaniewski, Fabian Schmidt, Tobias Heer

发表机构 * Institute for Secure Networked Systems, Esslingen University（安全网络系统研究所，埃斯林根大学）； Institute for Intelligent Systems, Esslingen University（智能系统研究所，埃斯林根大学）

AI总结本研究通过本地部署和多种开放权重模型，复现并扩展了Vul-RAG框架，发现其性能存在约0.30成对准确率的上限，且模型能力提升无法显著改善性能。

Comments Accepted at AI&CCPS 2026 workshop, co-located with the 21st International Conference on Availability, Reliability and Security (ARES 2026). This is the authors' preprint version

详情

AI中文摘要

大型语言模型（LLMs）在自动化软件漏洞检测方面展现出强大潜力，尤其是在检索增强生成（RAG）设置中。然而，对于依赖专有模型和API的方法，可复现性和可复制性在很大程度上仍未得到探索，这引发了一个问题：报告的结果是否具有普遍性，还是主要依赖于特定的模型选择。在这项工作中，我们对Vul-RAG进行了可复现性研究，Vul-RAG是一个基于RAG的源代码漏洞检测框架，它利用高级漏洞知识增强LLMs。我们首先使用报告中的开放权重基线模型，在完全本地和开放权重的设置下复现了结果。然后，我们将评估扩展到一组多样化的最新开放权重LLMs，包括代码专用、通用和推理模型，参数规模各异。结果证实，Vul-RAG的发现可以在本地部署下复现，但存在微小偏差。在所有评估的模型中，我们观察到性能在约0.30成对准确率（即漏洞函数和修补函数都被正确分类的代码对）处达到平台期。值得注意的是，即使对于更新更先进的模型，这一平台期仍然存在，表明仅凭模型能力的提升并不能显著提高性能。最后，我们讨论了检测效果、模型能力和模型规模之间的实际影响和权衡。实现和评估工件可在 https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG 公开获取。

英文摘要

Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval-augmented generation (RAG) settings. However, for approaches relying on proprietary models and APIs, reproducibility and replicability remain largely unexplored, raising the question of whether reported results generalize or depend primarily on specific model choices. In this work, we present a reproducibility study of Vul-RAG, a RAG-based framework for source code vulnerability detection that enhances LLMs with high-level vulnerability knowledge. We first replicate the results in a fully local and open-weights setting using the reported open-weight baseline models. We then extend the evaluation to a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models of varying parameter sizes. The results confirm that the findings of Vul-RAG are reproducible under local deployment, but with minor deviations. Across all evaluated models, we observe a performance plateau at approximately 0.30 pairwise accuracy (code pairs for which both the vulnerable and the patched function are correctly classified). Notably, this plateau persists even for more recent and advanced models, indicating that improvements in model capacity alone do not substantially enhance performance. Finally, we discuss practical implications and trade-offs between detection effectiveness, model capabilities, and model scale. Implementation and evaluation artifacts are publicly available at https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG.

URL PDF HTML ☆

赞 0 踩 0

2606.04689 2026-06-04 quant-ph cs.LG

QPredSGG: Hybrid Quantum Predicate Learning for Long-Tailed Scene Graph Generation

QPredSGG：面向长尾场景图生成的混合量子谓词学习

Prerana Ramkumar, Nouhaila Innan, Muhammad Shafique

发表机构 * Department of Computer Science, University of Waterloo（1. 温哥华大学计算机科学系）； Machine Learning Research Group, University of Waterloo（2. 温哥华大学机器学习研究组）

AI总结针对场景图生成中长尾谓词分布导致的分类偏差，提出用量子谓词头（QP-Head）替换经典谓词头，通过振幅嵌入和强纠缠层压缩特征，在Visual Genome 150上实现参数高效的长尾关系分类。

Comments 11 pages, 5 figures

详情

AI中文摘要

场景图生成（SGG）需要对物体及其交互进行关系推理，但性能常受严重的长尾谓词不平衡限制。经典SGG模型通常依赖数据集统计，导致预测偏向频繁关系而非细粒度语义谓词。尽管现有去偏策略提高了平均召回率，但当前框架中的谓词分类仍常依赖参数成本高的大型经典决策模块。本文通过用加权交叉熵训练的量子谓词头（QP-Head）替换因果特征增强网络（CFEN）中的经典谓词头，引入了一种用于SGG的混合量子谓词分类器。据我们所知，这是首批评估混合量子架构在Visual Genome 150上进行场景图谓词分类的研究之一。我们研究了量子比特数、编码策略、纠缠结构和电路深度对关系预测的影响。最佳4量子比特QP-Head使用振幅嵌入和强纠缠层将4096维对特征压缩为16维量子兼容表示，对应256倍缩减。它实现了57.25%的mR@100，而经典CFEN参考为41.1%，同时仅使用96个可训练量子参数。扩展到8量子比特保持了强大的长尾性能，达到55.38%的mR@100，使用384个量子参数，而深度分析显示了表达能力和运行时间开销之间的权衡。这些结果表明，紧凑的混合量子谓词头可以支持复杂视觉推理任务中参数高效的长尾关系分类。

英文摘要

Scene Graph Generation (SGG) requires relational reasoning over objects and their interactions, but performance is often limited by severe long-tail predicate imbalance. Classical SGG models frequently rely on dataset statistics, leading to biased predictions toward frequent relations rather than fine-grained semantic predicates. Although existing debiasing strategies improve mean recall, predicate classification in current frameworks still often depends on large classical decision modules with high parameter cost. This work introduces a hybrid quantum predicate classifier for SGG by replacing the classical predicate head in Causal Feature Enhancement Network (CFEN) with a Quantum Predicate Head (QP-Head) trained using weighted cross-entropy. To the best of our knowledge, this is among the first studies to evaluate a hybrid quantum architecture for scene graph predicate classification on Visual Genome 150. We study the effect of qubit count, encoding strategy, entangling structure, and circuit depth on relational prediction. The best 4-qubit QP-Head uses Amplitude Embedding and Strongly Entangling Layers to compress 4096-dimensional pair features into a 16-dimensional quantum-compatible representation, corresponding to a 256$\times$ reduction. It achieves an mR@100 of 57.25%, compared with 41.1% for the classical CFEN reference, while using only 96 trainable quantum parameters. Scaling to 8 qubits maintains strong long-tail performance, reaching an mR@100 of 55.38% with 384 quantum parameters, while the depth analysis shows a trade-off between expressibility and runtime overhead. These results suggest that compact hybrid quantum predicate heads can support parameter-efficient long-tail relational classification in complex visual reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.04680 2026-06-04 eess.AS cs.CL cs.SD

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

听你所写：基于声学差异的无参考假设评估

Zhihan Li, Hankun Wang, Yiwei Guo, Bohan Li, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China（X-LANCE实验室、计算机科学学院、上海交通大学、中国）； MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, China（人工智能MOE重点实验室、江苏省语言计算重点实验室、中国）

AI总结提出READ指标，利用预训练自回归TTS模型计算语音与文本假设的声学差异，无需参考转录即可评估ASR假设，并在噪声条件下实现高达20%的相对错误率降低。

Comments Submitted to Interspeech 2026. 6 pages, 4 figures

2606.04670 2026-06-04 math.NA cs.LG cs.MS cs.NA

Fitting scattered data with optional monotonicity constraints on GPU: LipFit package

在GPU上拟合带有可选单调性约束的散乱数据：LipFit包

Gleb Beliakov

发表机构 * School of Information Technology, Deakin University（德肯大学信息科技学院）

AI总结提出一种多变量散乱数据插值与逼近方法，在满足单调性约束下产生最优Lipschitz连续逼近，并实现GPU并行化的Python包LipFit。

2606.04658 2026-06-04 cs.NE cs.LG

U-Net-Accelerated Quality-Diversity Optimization for Climate-Adaptive Urban Layouts

U-Net加速的质量-多样性优化用于气候适应性城市布局

Alexander Hagg, Tania Guerrero, Dirk Reith

发表机构 * Institute of Technology, Resource and Energy-efficient Engineering (TREE)（技术学院，资源与能源高效工程院（TREE））； Bonn-Rhein-Sieg University of Applied Sciences（博恩-莱茵-锡格应用科学大学）； Fraunhofer Institute for Algorithms and Scientific Computing (SCAI)（弗劳恩霍夫算法与科学计算研究所（SCAI））

AI总结提出用U-Net替代慢速物理模拟器作为代理模型，结合离线MAP-Elites算法，实现快速生成数千个多样化且经气候评估的建筑布局。

详情

AI中文摘要

优化城市布局以适应气候需要在建筑密度与冷空气通风之间取得平衡。由于基于物理的气候模拟计算成本高昂，规划者通常只能评估少于十个手动设计方案。质量-多样性（QD）算法提供了一种系统性地照亮设计空间的方法，但需要代理模型才能实用。在本文中，我们用一个空间深度学习代理（U-Net）替换了缓慢的监管物理模拟器，并将其嵌入离线MAP-Elites循环中。我们系统地比较了这种空间方法与传统的高斯过程（GP）代理在不同训练数据策略（准随机Sobol采样 vs. 主动QD自举）下的表现。结果表明，标量GP代理在随机样本上训练时灾难性地失败，需要昂贵的、主动生成的QD存档才能泛化。相比之下，U-Net的空间归纳偏置使其能够稳健地学习底层物理映射（R² = 0.996），完全独立于训练数据来源。这使得离线QD优化仅需一次性随机训练样本批次即可实现高度准确的适应度排名（ρ = 0.994）。最终流程部署在开源OpenSKIZZE工具中，能在十分钟内生成数千个多样化且经气候评估的建筑布局。

英文摘要

Optimizing urban layouts for climate adaptation requires balancing building density with cold-air ventilation. Because physics-based climate simulations are computationally expensive, planners typically evaluate fewer than ten manual designs. \gls{qd} algorithms offer a way to systematically illuminate the design space, but they require surrogate models to be practical. In this paper, we replace a slow, regulatory physics simulator with a spatial deep-learning surrogate (U-Net) inside an offline MAP-Elites loop. We systematically compare this spatial approach with a traditional \gls{gp} surrogate across different training-data strategies (quasi-random Sobol sampling vs.\ active \gls{qd} bootstrapping). Our results reveal that scalar \gls{gp} surrogates fail catastrophically when trained on random samples, requiring expensive, actively generated \gls{qd} archives to generalize. In contrast, the spatial inductive bias of the U-Net allows it to learn the underlying physics mapping robustly ($R^2 = 0.996$), completely independent of the training data source. This allows offline \gls{qd} optimization to achieve highly accurate fitness rankings ($ρ= 0.994$) using only a one-time batch of random training samples. The resulting pipeline, deployed in the open-source OpenSKIZZE tool, generates thousands of diverse, climate-evaluated building layouts in under ten minutes.

URL PDF HTML ☆

赞 0 踩 0

2606.04603 2026-06-04 cs.IR cs.LG stat.ML

Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval

面向不确定性感知检索的分布近似最近邻搜索

Olivier Jeunen

发表机构 * Antwerp, Belgium（比利时安特卫普）

AI总结提出DINOSAUR框架，通过为每个物品采样多个嵌入并构建索引，在检索时对用户嵌入进行采样，以隐式边缘化嵌入不确定性，从而在不改变模型架构或索引基础设施的情况下提升长尾物品的覆盖。

详情

AI中文摘要

近似最近邻搜索索引构成了现实世界推荐系统的骨干，支持在百万级物品目录上进行实时候选检索。通常，为每个用户和每个物品学习一个点估计嵌入。在服务时，用户嵌入查询索引以获取相关物品。由于这些表示是从稀疏交互数据中学习的，它们带有噪声，可能无法捕捉所有有助于“相关性”的细微差别——忽略了其固有的基本不确定性。结果是检索管道系统性地偏向于少数嵌入估计良好的热门头部物品，而牺牲了长尾中多数小众、多样和偶然的内容。我们提出了DINOSAUR（面向不确定性感知检索的分布近似最近邻搜索）：一个简单且与基础设施兼容的框架，将嵌入不确定性纳入候选生成。DINOSAUR不为点估计建立索引，而是为每个物品采样$S_i$个嵌入，并在这一增强集上构建索引。类似地，在查询时，对用户嵌入进行采样。这种双边的随机检索过程隐式地边缘化了嵌入不确定性，无需改变模型架构或ANN索引基础设施。在分析方面，我们展示了当不确定性消失时，DINOSAUR恢复标准的点估计检索，并刻画了增加的嵌入方差如何扩展不确定物品可检索的潜在空间区域。可重复的实证观察与这些预期一致，显示出在离线召回率小幅损失的情况下，覆盖率大幅提升。

英文摘要

Approximate Nearest Neighbour search indices form the backbone of real-world recommender systems, enabling real-time candidate retrieval over million-item catalogues. Typically, a single point estimate embedding is learnt for every user and every item. At serving time, the user embedding queries the index for relevant items. Since these representations are learnt from sparse interaction data, they are noisy and might fail to capture all the nuances that contribute to ``relevance'' -- ignoring the fundamental uncertainty that is inherent to them. The result is a retrieval pipeline that is systematically biased toward the small minority of popular head items with well-estimated embeddings, at the expense of the long-tail majority of niche, diverse, and serendipitous content. We propose DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval): a simple and infrastructure-compatible framework to incorporate embedding uncertainty into candidate generation. Rather than indexing point estimates, DINOSAUR samples $S_i$ embeddings per item and constructs an index on this augmented set. Analogously, at query time, a user embedding is sampled. This two-sided stochastic retrieval process implicitly marginalises over embedding uncertainty, without requiring changes to model architecture or ANN index infrastructure. On the analytical side, we show that DINOSAUR recovers standard point-estimate retrieval as uncertainty vanishes, and we characterise how increased embedding variance expands the regions of latent space in which uncertain items are retrievable. Reproducible empirical observations align with these expectations, showing large coverage gains with small losses in offline recall.

URL PDF HTML ☆

赞 0 踩 0

2606.04594 2026-06-04 cs.DC cs.AI cs.SE

Ekka: Automated Diagnosis of Silent Errors in LLM Inference

Ekka: LLM推理中静默错误的自动诊断

Yile Gu, Zhen Zhang, Shaowei Zhu, Xinwei Fu, Jun Wu, Yida Wang, Baris Kasikci

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出Ekka系统，通过差分调试对齐比较中间执行状态，自动诊断LLM推理框架中的静默错误，在真实错误基准上达到80% pass@1和88% pass@5的诊断准确率。

Comments ICML 2026

详情

AI中文摘要

LLM服务框架随着复杂的软件栈和大量优化而快速发展。快速开发过程可能引入静默错误，即输出质量在没有任何显式错误信号的情况下悄然下降。由于高层症状与底层根本原因之间存在巨大的语义鸿沟，诊断静默错误非常困难。我们观察到，通过利用语义正确的参考实现，静默错误的诊断可以有效地构建为差分调试问题。我们提出了Ekka，一个自动诊断系统，通过系统地对齐和比较目标框架与参考框架之间的中间执行状态来识别根本原因。我们构建了一个来自流行服务框架的真实静默错误基准，Ekka显示出80%的pass@1诊断准确率和88%的pass@5诊断准确率，优于现有系统。Ekka还诊断了服务框架中的4个新静默错误，所有错误均已得到开发者确认。

英文摘要

LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the high-level symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 80% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.

URL PDF HTML ☆

赞 0 踩 0