arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2412.18980 2026-06-19 cs.LG 版本更新

Evaluating deep learning models for fault diagnosis of a rotating machinery with epistemic and aleatoric uncertainty

评估深度学习模型在旋转机械故障诊断中的认知不确定性和偶然不确定性

Reza Jalayer, Masoud Jalayer, Andrea Mor, Carlotta Orsenigo, Carlo Vercellis

发表机构 * Faculty of Engineering and Natural Sciences（工程与自然科学学院）； Department of Information and Communications Engineering（信息与通信工程系）； Department of Management, Economics and Industrial Engineering（管理、经济与工业工程系）

AI总结本文首次全面比较了不确定性感知深度学习架构在旋转机械故障诊断中的表现，发现深度集成模型在检测未知故障和噪声数据方面优于其他方法。

详情

AI中文摘要

不确定性感知深度学习模型最近在故障诊断中受到关注，作为一种在来自未见故障（认知不确定性）或噪声存在（偶然不确定性）的分布外数据出现时促进可靠故障检测的方法。在本文中，我们首次对旋转机械故障诊断中最先进的不确定性感知深度学习架构进行了全面比较研究，其中研究了受认知不确定性影响的不同场景和不同类型的偶然不确定性。所选架构包括通过dropout采样、贝叶斯神经网络和深度集成。此外，为了区分不同场景中的分布内和分布外数据，我们交替应用了两个不确定性阈值，其中一个是在本文中引入的。我们的实证结果为必须部署实际不确定性感知故障诊断系统的从业者和研究人员提供了指导。特别是，它们揭示了在存在认知不确定性的情况下，所有深度学习模型都能够有效地检测到平均而言所有场景中相当一部分分布外数据。然而，深度集成模型显示出优越的性能，与用于区分的阈值无关。在存在偶然不确定性的情况下，噪声水平起着重要作用。具体来说，低噪声水平阻碍了模型有效检测分布外数据的能力。即使在这种情况下，深度集成模型也表现出较温和的性能下降，主导其他模型。这些成就，加上它们更短的推理时间，使得深度集成架构成为首选。

英文摘要

Uncertainty-aware deep learning (DL) models recently gained attention in fault diagnosis as a way to promote the reliable detection of faults when out-of-distribution (OOD) data arise from unseen faults (epistemic uncertainty) or the presence of noise (aleatoric uncertainty). In this paper, we present the first comprehensive comparative study of state-of-the-art uncertainty-aware DL architectures for fault diagnosis in rotating machinery, where different scenarios affected by epistemic uncertainty and different types of aleatoric uncertainty are investigated. The selected architectures include sampling by dropout, Bayesian neural networks, and deep ensembles. Moreover, to distinguish between in-distribution and OOD data in the different scenarios two uncertainty thresholds, one of which is introduced in this paper, are alternatively applied. Our empirical findings offer guidance to practitioners and researchers who have to deploy real-world uncertainty-aware fault diagnosis systems. In particular, they reveal that, in the presence of epistemic uncertainty, all DL models are capable of effectively detecting, on average, a substantial portion of OOD data across all the scenarios. However, deep ensemble models show superior performance, independently of the uncertainty threshold used for discrimination. In the presence of aleatoric uncertainty, the noise level plays an important role. Specifically, low noise levels hinder the models' ability to effectively detect OOD data. Even in this case, however, deep ensemble models exhibit a milder degradation in performance, dominating the others. These achievements, combined with their shorter inference time, make deep ensemble architectures the preferred choice.

URL PDF HTML ☆

赞 0 踩 0

2406.07775 2026-06-19 cs.LG 版本更新

Self-attention-based non-linear basis transformations for compact latent space modelling of dynamic optical fibre transmission matrices

基于自注意力的非线性基变换用于动态光纤传输矩阵的紧凑潜在空间建模

Yijie Zheng, Robert J. Kilpatrick, David B. Phillips, George S. D. Gordon

发表机构 * Optics and Photonics research group, University of Nottingham, UK（诺丁汉大学光学与光子学研究组，英国）； University of Exeter, UK（埃克塞特大学，英国）； State Key Laboratory of Extreme Photonics and Instrumentation, College of Optical Science and Engineering International Research Center for Advanced Photonics, Zhejiang University, Hangzhou, China（极端光子学与仪器国家重点实验室，浙江大学光科学与工程学院，国际先进光子学研究中心，中国杭州）； Research Center for Humanoid Sensing, Zhejiang Lab, Hangzhou, China（人感知研究中心，浙江实验室，中国杭州）

AI总结提出使用自注意力层动态变换光纤矩阵的坐标表示到紧凑基，实现低维表示，在多个数据集上验证了基稀疏性（参与比0.01-0.11）和低重建误差（<10%）。

详情

AI中文摘要

多模光纤是头发丝粗细的玻璃丝，能高效传输光。它们有望实现下一代医用内窥镜，在体内深处提供前所未有的亚细胞图像分辨率。然而，将光限制在这样的光纤中意味着图像在传输过程中固有地被打乱。传统上，通过预先校准特定光纤如何打乱光并求解表示光纤物理模型的静态线性矩阵方程来补偿这种打乱。然而，随着技术向实际部署发展，解扰过程必须考虑由于移动和温度变化等因素导致的光纤对光影响的矩阵的动态变化，以及由于光纤尖端在体内不可及而产生的非线性。这种复杂、动态和非线性行为非常适合用神经网络近似，但大多数领先的图像重建网络依赖卷积层，这些层假设相邻像素之间存在强相关性，这种强归纳偏置不适用于光纤矩阵，因为光纤矩阵可以用具有长程相关性的任意坐标表示来表达。我们引入了一个新概念，使用自注意力层将变化的光纤矩阵的坐标表示动态变换到允许紧凑、低维表示的基，适合进一步处理。我们在不同的光纤矩阵数据集上展示了该方法的有效性。我们展示了我们的模型在变换基上显著提高了光纤基的稀疏性，以参与比p作为稀疏性度量，介于0.01和0.11之间。此外，我们展示了这些变换后的表示允许以<10%的重建误差重建原始矩阵，证明了可逆性。

英文摘要

Multimode optical fibres are hair-thin strands of glass that efficiently transport light. They promise next-generation medical endoscopes that provide unprecedented sub-cellular image resolution deep inside the body. However, confining light to such fibres means that images are inherently scrambled in transit. Conventionally, this scrambling has been compensated by pre-calibrating how a specific fibre scrambles light and solving a stationary linear matrix equation that represents a physical model of the fibre. However, as the technology develops towards real-world deployment, the unscrambling process must account for dynamic changes in the matrix representing the fibre's effect on light, due to factors such as movement and temperature shifts, and non-linearities resulting from the inaccessibility of the fibre tip when inside the body. Such complex, dynamic and nonlinear behaviour is well-suited to approximation by neural networks, but most leading image reconstruction networks rely on convolutional layers, which assume strong correlations between adjacent pixels, a strong inductive bias that is inappropriate for fibre matrices which may be expressed in a range of arbitrary coordinate representations with long-range correlations. We introduce a new concept that uses self-attention layers to dynamically transform the coordinate representations of varying fibre matrices to a basis that admits compact, low-dimensional representations suitable for further processing. We demonstrate the effectiveness of this approach on diverse fibre matrix datasets. We show our models significantly improve the sparsity of fibre bases in their transformed bases with a participation ratio, p, as a measure of sparsity, of between 0.01 and 0.11. Further, we show that these transformed representations admit reconstruction of the original matrices with < 10% reconstruction error, demonstrating the invertibility.

URL PDF HTML ☆

赞 0 踩 0

2402.14035 2026-06-19 cs.LG cs.AI 版本更新

Wisdom of Committee: Diverse Distillation from Large Foundation Models and Domain Experts

委员会智慧：来自大型基础模型和领域专家的多样化蒸馏

Zichang Liu, Qingyun Liu, Yuening Li, Liang Liu, Anshumali Shrivastava, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

发表机构 * Rice University（Rice大学）； Google DeepMind（谷歌DeepMind）； Google Inc（谷歌公司）； University of California, Davis（加州大学戴维斯分校）

AI总结针对基础模型向紧凑领域模型蒸馏时能力、架构和模态差异大的问题，提出DiverseDistill框架，通过可学习的问答机制和对齐异构教师输出，在推荐和视觉任务上恢复73-114%的性能差距。

Comments Accepted at the 1st Workshop on Resource-Efficient Learning and Knowledge Discovery (RelKD), KDD 2026

详情

Journal ref: Proceedings of the RelKD Workshop at KDD 2026

AI中文摘要

从基础模型向紧凑领域模型进行知识蒸馏因能力、架构和模态的巨大差异而具有挑战性。例如，在我们的实验中，从7600万参数的语言模型蒸馏到200万参数的推荐模型仅能弥补未蒸馏学生与教师之间不到40%的性能差距。我们表明，引入与基础模型共享学生架构特征的领域专家作为多样化教师委员会，能显著改善迁移效果。然而，标准的多教师方法未能利用这种多样性：简单组合异构教师可能使性能低于单教师蒸馏。为此，我们提出DiverseDistill，一种交互式蒸馏框架，采用可学习的问答机制生成教师条件查询，并将异构教师输出对齐到学生的表示空间。与需要基于梯度的协同优化或修改教师架构的方法不同，DiverseDistill在冻结教师的情况下仅通过其中间层的前向推理运行：无需参数更新、无需协同训练、无需架构修改。动态教师重要性机制通过过滤每个样本中低相关性的教师（例如，在推荐任务中减少约30%的前向传播且无质量损失）进一步降低训练成本，而整个蒸馏模块在训练后被丢弃，推理时零开销。在推荐（38倍压缩）和视觉（3.6倍压缩）任务上的评估表明，DiverseDistill恢复了73-114%的师生性能差距，持续优于所有单教师和多教师基线方法。

英文摘要

Knowledge distillation from foundation models to compact domain models is challenging due to substantial gaps in capacity, architecture, and modality. For example, in our experiments, distilling from a 76M-parameter language model to a 2M-parameter recommender closes less than 40% of the performance gap between the undistilled student and the teacher. We show that introducing domain-specific experts -- which share the student's architectural characteristics -- alongside the foundation model as a diverse teacher committee significantly improves transfer. However, standard multi-teacher methods fail to exploit this diversity: naively combining heterogeneous teachers can degrade performance below single-teacher distillation. To address this, we propose DiverseDistill, an interactive distillation framework that employs a learnable Question-Answer mechanism to generate teacher-conditioned queries and align heterogeneous teacher outputs into the student's representation space. Unlike methods requiring gradient-based co-optimization or architectural modification of teachers, DiverseDistill operates with frozen teachers using only forward-pass inference through their intermediate layers: no parameter updates, no co-training, and no architectural surgery. A dynamic teacher importance mechanism further reduces training cost by filtering low-relevance teachers per sample (e.g., ~30% fewer forward passes with no quality loss for recommendation tasks), while the entire Distillation Module is discarded after training, adding zero inference overhead. Evaluations on recommendation (38x compression) and vision (3.6x compression) tasks demonstrate that DiverseDistill recovers 73-114% of the teacher-student performance gap, consistently outperforming all single- and multi-teacher baselines.

URL PDF HTML ☆

赞 0 踩 0

2407.11933 2026-06-19 cs.LG

Fairness-Aware Multi-Group Target Detection in Online Discussion

Soumyajit Gupta, Maria De-Arteaga, Matthew Lease

发表机构 * Dept. of Computer Science, The University of Texas at Austin（德克萨斯大学奥斯汀分校计算机科学系）； Department of Data, Analytics, Technology, and Artificial Intelligence, ESADE（ESADE大学数据、分析、技术和人工智能系）； The Information School, The University of Texas at Austin（德克萨斯大学奥斯汀分校信息学院）

2605.00569 2026-06-19 cs.CV cs.GR

2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction

Prajwal Gupta C. R., Divyam Sheth, Jinjoo Ha, Mirela Ostrek, Justus Thies

发表机构 * TU Darmstadt（图宾根大学）； ELIZA（ELIZA实验室）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）

2603.16648 2026-06-19 cs.AI

Domain-Independent Dynamic Programming with Constraint Propagation

Imko Marijnissen, J. Christopher Beck, Emir Demirović, Ryo Kuroiwa

发表机构 * Imko Marijnissen 1 ； J. Christopher Beck 2 ； Emir Demirović 1 ； Ryo Kuroiwa 3, 4

Comments 13 pages. To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)

详情

DOI: 10.1609/icaps.v36i1.42826
Journal ref: Proceedings of the International Conference on Automated Planning and Scheduling (2026) | Volume 36(1) | Pages 171-180

英文摘要

There are two prevalent model-based paradigms for combinatorial problems: 1) state-based representations, such as heuristic search, dynamic programming (DP), and decision diagrams, and 2) constraint and domain-based representations, such as constraint programming (CP), (mixed-)integer programming, and Boolean satisfiability. In this paper, we bridge the gap between the DP and CP paradigms by integrating constraint propagation into DP, enabling a DP solver to prune states and transitions using constraint propagation. To this end, we implement constraint propagation using a general-purpose CP solver in the Domain-Independent Dynamic Programming framework and evaluate using heuristic search on three combinatorial optimisation problems: Single Machine Scheduling with Time Windows, the Resource Constrained Project Scheduling Problem (RCPSP), and the Travelling Salesperson Problem with Time Windows (TSPTW). Our evaluation shows that constraint propagation significantly reduces the number of state expansions, causing our approach to solve more instances than a DP solver for Single Machine Scheduling and RCPSP, and showing similar improvements for tightly constrained TSPTW instances. The runtime performance indicates that the benefits of propagation outweigh the overhead for constrained instances, but that further work into reducing propagation overhead could improve performance further. Our work is a key step in understanding the value of constraint propagation in DP solvers, providing a model-based approach to integrating DP and CP.

URL PDF HTML ☆

赞 0 踩 0

2511.23071 2026-06-19 cs.CV cs.AI cs.CL

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院朱道尔）

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

2603.27698 2026-06-19 cs.CV cs.DL

Ink Detection from Surface Topography of the Herculaneum Papyri

Giorgio Angelotti, Federica Nicolardi, Paul Henderson, W. Brent Seales

发表机构 * Vesuvius Challenge, USA（维苏威挑战赛，美国）； Università degli Studi di Napoli Federico II, Italy（那不勒斯费德里科二世大学，意大利）； University of Glasgow, Scotland, UK（格拉斯哥大学，苏格兰，英国）； EduceLab, University of Kentucky, USA（EduceLab，肯塔基大学，美国）

Comments 9 pages, 3 figures, 2 tables. Currently under review

2603.27361 2026-06-19 cs.RO

Online Inertia Tensor Identification for Non-Cooperative Spacecraft via Augmented UKF

Batu Candan, Simone Servadio

发表机构 * Department of Aerospace Engineering, Iowa State University（航空航天工程系，爱荷华州立大学）

详情

DOI: 10.2514/6.2026-108993
Journal ref: AIAA 2026 Region V Student Conference, AIAA 2026-108993

英文摘要

Autonomous proximity operations, such as active debris removal and on-orbit servicing, require high-fidelity relative navigation solutions that remain robust in the presence of parametric uncertainty. Standard estimation frameworks typically assume that the target spacecraft's mass properties are known a priori; however, for non-cooperative or tumbling targets, these parameters are often unknown or uncertain, leading to rapid divergence in model-based propagators. This paper presents an augmented Unscented Kalman Filter (UKF) framework designed to jointly estimate the relative 6-DOF pose and the full inertia tensor of a non-cooperative target spacecraft. The proposed architecture fuses visual measurements from monocular vision-based Convolutional Neural Networks (CNN) with depth information from LiDAR to constrain the coupled rigid-body dynamics. By augmenting the state vector to include the six independent elements of the inertia tensor, the filter dynamically recovers the target's normalized mass distribution in real-time without requiring ground-based pre-calibration. To ensure numerical stability and physical consistency during the estimation of constant parameters, the filter employs an adaptive process noise formulation that prevents covariance collapse while allowing for the gradual convergence of the inertial parameters. Numerical validation is performed via Monte Carlo simulations, demonstrating that the proposed Augmented UKF enables the simultaneous convergence of kinematic states and inertial parameters, thereby facilitating accurate long-term trajectory prediction and robust guidance in non-cooperative deep-space environments.

URL PDF HTML ☆

赞 0 踩 0

2412.20298 2026-06-19 cs.LG cs.CY stat.ML

An Experimental Study on Fairness-aware Machine Learning for Credit Scoring Problems

Huyen Giang Thi Thu, Thang Viet Doan, Ha-Bang Ban, Tai Le Quy

发表机构 * Banking Academy of Vietnam（越南银行学院）； Vietnam Academy of Science and Technology（越南科学技术 academy）； Hanoi University of Science and Technology（河内科学技术大学）； University of Koblenz（科隆大学）

Comments The manuscript is submitted to Springer Nature's journal

2508.21190 2026-06-19 cs.CV

Radially Distorted Homographies, Revisited

Mårten Wadenbäck, Marcus Valtonen Örnhag, Johan Edstedt

发表机构 * Linköping University（林雪平大学）； Ericsson Research（爱立信研究）

详情

DOI: 10.1109/3DV69130.2026.00012
Journal ref: 2026, Proceedings of the International Conference on 3D Vision (3DV). Vancouver, BC, Canada: IEEE, pp. 52-62

英文摘要

Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. A reference implementation of the proposed solvers is made available as part of HomLib (https://github.com/marcusvaltonen/HomLib).

URL PDF HTML ☆

赞 0 踩 0

2511.16223 2026-06-19 cs.RO

DynaMimicGen: A Data Generation Framework for Robot Learning of Dynamic Tasks

Vincenzo Pomponi, Paolo Franceschi, Stefano Baraldo, Loris Roveda, Oliver Avram, Luca Maria Gambardella, Anna Valente

发表机构 * Institute of Systems and Technologies for Sustainable Production (ISTePS)（可持续生产系统与技术研究所）； Department of Innovative Technologies (DTI)（创新技术系）； University of Applied Science and Arts of Southern Switzerland (SUPSI)（瑞士南部应用科学与艺术大学）； Istituto Dalle Molle di studi sull’intelligenza artificiale (IDSIA)（达莫尔智能研究 institute）； Department of Mechanical Engineering（机械工程系）； Politecnico di Milano (PoliMi)（米兰理工学院）； Faculty of Informatics（信息学院）； Università della Svizzera Italiana (USI)（瑞士意大利大学）

2510.24435 2026-06-19 cs.AI

Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning

Benjamin Grando Moreira

发表机构 * Universidade Federal de Santa Catarina（联邦圣卡塔琳娜大学）

Comments 12 pages

2507.23027 2026-06-19 cs.CV cs.AI

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras（印度理工学院马德拉斯分校）； All India Institute of Medical Sciences（全印度医学科学研究所）； Indian Institute of Technology Hyderabad（印度理工学院海得拉巴分校）

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

2406.15465 2026-06-19 cs.CL cs.AI

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

Daniel Reichenpfader, Jonas Knupp, André Sander, Kerstin Denecke

发表机构 * Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland（以患者为中心的数字健康研究所，伯恩应用科学大学，比尔，瑞士）； ID Suisse AG, St. Gallen, Switzerland（ID瑞士股份有限公司，圣加尔，瑞士）

2306.12679 2026-06-19 cs.CL

Constructing Colloquial Dataset for Persian Sentiment Analysis of Social Microblogs

Mojtaba Mazoochi, Leila Rabiei, Farzaneh Rahmani, Zeinab Rajabi

发表机构 * Faculty member in ICT Research Institute（ICT研究所教员）； Iran Telecommunication Research Center (ITRC)（伊朗电信研究中心）； Faculty member in Computer Department（计算机系教员）； Mehralborz University（梅赫拉布尔兹大学）； Hazrat-e Masoumeh University（玛苏姆大学）

1902.06202 2026-06-19 cs.CV cs.CG

Using Persistent Homology to Quantify a Diurnal Cycle in Hurricane Felix

Sarah Tymochko, Elizabeth Munch, Jason Dunion, Kristen Corbosiero, Ryan Torn

发表机构 * Michigan State University, Dept. of Computational Mathematics, Science and Engineering（密歇根州立大学，计算数学、科学与工程系）； Michigan State University, Dept. of Mathematics（密歇根州立大学，数学系）； Cooperative Institute for Marine and Atmospheric Studies, University of Miami（马里安诺大气研究合作机构，迈阿密大学）； Hurricane Research Division, NOAA/Atlantic Oceanographic and Meteorological Laboratory（飓风研究部，国家海洋和大气管理局/大西洋海洋学和气象实验室）； University at Albany - SUNY Albany, Dept. of Atmospheric and Environmental Sciences（阿尔巴尼大学 - 纽约州立大学阿尔巴尼分校，大气与环境科学系）

2606.18436 2026-06-19 stat.ML cs.LG 新提交

Pointwise is Pointless? A Multimodal Ablation Study for Precipitation Nowcasting with Graph Neural Networks

逐点是否无意义？基于图神经网络的降水临近预报的多模态消融研究

Ophélia Miralles, Máté Mile, Christoffer Artturi, Thomas Nipen, Ivar Seierstad

发表机构 * Norwegian Meteorological Institute（挪威气象研究所）

AI总结本研究通过多模态图神经网络系统，消融分析雷达、数值预报、地面观测、卫星数据及训练损失对降水临近预报的影响，发现各模态分别改善不同方面，点观测虽提升局部但需结合损失函数和不确定性表示才能优化雷达场。

详情

AI中文摘要

稀疏点观测在降水临近预报中日益可用，但尚不清楚它们能在多大程度上改善密集雷达场预报。我们通过北欧雷达区域的多模态图神经网络临近预报系统部分回答了这个问题。该模型预测未来两小时内每五分钟的降雨率，并采用雷达历史、MEPS数值天气预报、Netatmo地面观测、MSG卫星通道、随机噪声和基于CRPS的集合损失的不同组合进行训练。本研究设计为对操作相关信源和训练目标的消融。我们比较了仅雷达、NWP信息、站点信息、卫星信息、噪声增强和基于CRPS的配置，使用雷达网格、站点位置、降雨起始的互补诊断，以及oracle、位移和幅度评分。结果表明，每个信源改善了预报问题的不同方面。MEPS稳定了仅雷达外推，Netatmo观测改善了局部站点和起始诊断，卫星预测因子减少了某些站点级偏差，但在确定性使用时可能过早激活降雨。基于CRPS的配置提供了最一致的雷达网格增益，而卫星与CRPS的组合设置给出了最佳的整体oracle/DAS评分。这些结果不支持点观测对临近预报无用的结论，但表明局部观测技能和空间相干雷达场技能是不同的目标。实际意义是，稀疏观测可以提供有用的局部约束，但它们对雷达类场的益处取决于训练损失、不确定性表示以及观测支持在模型中的编码方式。

英文摘要

Sparse point observations are increasingly available for precipitation nowcasting, but it is unclear how much they improve dense radar-field forecasts. We partially address this question with a multimodal graph neural network nowcasting system over the Nordic radar domain. The model predicts rain rate every five minutes up to two hours ahead and is trained with different combinations of radar history, MEPS numerical weather prediction, Netatmo surface observations, MSG satellite channels, stochastic noise, and CRPS-based ensemble losses. The study is designed as an ablation of operationally relevant information sources and training objectives. We compare radar-only, NWP-informed, station-informed, satellite-informed, noise-augmented, and CRPS-based configurations using complementary diagnostics on the radar grid, at station locations, for rain onset, and through oracle, displacement, and amplitude scores. The results show that each source improves a different part of the forecast problem. MEPS stabilises radar-only extrapolation, Netatmo observations improve local station and onset diagnostics, and satellite predictors reduce some station-level biases but may activate rain too early when used deterministically. CRPS-based configurations provide the most consistent radar-grid gains, while the combined satellite and CRPS setup gives the best overall oracle/DAS score. These results do not support the conclusion that point observations are uninformative for nowcasting, but they show that local observational skill and spatially coherent radar-field skill are distinct targets. The practical implication is that sparse observations can provide useful local constraints, but their benefit for radar-like fields depends on the training loss, uncertainty representation, and how observation support is encoded in the model.

URL PDF HTML ☆

赞 0 踩 0

2606.18996 2026-06-19 cs.CR cs.AI 新提交

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP：任务完成与主动隐私提取抵抗基准

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

发表机构 * Dept. of Electrical Engineering, POSTECH（POSTECH电子工程系）； Grad. School of Artificial Intelligence, POSTECH（POSTECH人工智能研究生院）； School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结提出TRAP基准，评估智能体在文档密集型任务中平衡任务准确性与隐私泄露的能力，发现所有模型均存在非平凡泄露，并证明基于提示的防御无法同时实现高任务成功率和零泄露概率，提出结构化的私有字段隔离方法。

详情

AI中文摘要

智能体越来越多地部署在文档密集型工作流中，其中敏感私人信息不是边缘情况而是常规输入，例如，预订航班的智能体需要护照号码。在这种情况下，智能体必须使用私人信息准确完成任务，同时绝不在其响应中暴露这些信息，因为它无法验证键盘前实际是谁。这两个义务存在根本性矛盾。一个能够使用私人信息完成任务的模型，同样可能被诱导泄露这些信息。为了评估任务准确性与隐私泄露之间的权衡，我们引入了任务完成与主动隐私提取抵抗（TRAP）。每个场景包括一个包含私人信息的文档、一个要求智能体使用私有字段调用正确工具的任务查询，以及一个试图以自然语言引出相同信息的攻击查询。评估了涵盖前沿专有和开源模型的22个模型，我们发现所有模型系列都表现出非平凡的泄露，并且指令遵循能力与泄露率相关。现有的基于提示的防御减少了泄露，但以显著降低任务准确性为代价。提示优化未能摆脱这种权衡。我们证明这种失败并非偶然。对于任何基于softmax的模型，没有软约束防御（例如基于提示的防御）能够同时实现高任务成功率和零泄露概率。受这一不可能性结果的启发，我们提出了结构化的私有字段隔离，该方法在私有字段到达模型之前用哈希键替换它们。这种方法在保持任务准确性的同时很大程度上防止了泄露。

英文摘要

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.18941 2026-06-19 cs.PL cs.CL 新提交

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC：使用基于SMT的模型检查对图形化PLCopen XML梯形图程序进行形式验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * Computer Science, The University of Manchester（计算机科学，曼彻斯特大学）； Electrical Engineering, Federal University of Amazonas (UFAM)（电气工程，亚马逊联邦大学（UFAM））

AI总结针对ESBMC-PLC无法处理图形化PLCopen XML梯形图的问题，提出基于DFS的图形LD解析器，将连接图转换为布尔触点合取，并采用三级I/O推断方案，成功实现完整GOTO IR转换，验证了3个图形LD程序。

Comments 18 pages

详情

AI中文摘要

PLCopen XML为IEC 61131-3梯形图程序定义了两种编码格式：一种使用<rung>元素的文本编码，另一种将梯形逻辑表示为localId/refLocalId连接的有向图的图形编码。ESBMC-PLC支持文本格式，但将来自CONTROLLINO、Beremiz和OpenPLC Editor的图形导出解析为空GOTO中间表示，导致空洞的验证成功。本文提出Graph-ESBMC-PLC，通过基于DFS的图形LD解析器填补了这一空白。该解析器从leftPowerRail遍历连接图到每个线圈，将梯形路径提取为布尔触点合取，并应用三级I/O推断方案。按rightPowerRail的connectionPointIn序列对线圈排序，确保SET线圈在RESET线圈之前处理，匹配IEC扫描周期语义。图形到IR的转换无需改动ESBMC后端。在来自CONTROLLINO/OpenPLC Editor的3个图形LD程序上的验证表明，所有程序都生成了包含非确定性输入和梯形逻辑的完整GOTO IR，而之前生成的是空IR。所有3个程序在k=2时在70ms内验证为SAFE。11个文本LD基准测试完全保留，无回归。两个不含LD内容或不支持定时器语义的Beremiz示例被报告为发现的局限性。工件位于Zenodo（DantasCordeiro2026graphical，doi: https://doi.org/10.5281/zenodo.20699856）。

英文摘要

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using <rung> elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents ESBMC-GraphPLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

URL PDF HTML ☆

赞 0 踩 0

2606.18716 2026-06-19 cs.HC cs.AI 新提交

Human-AI Agent Interaction in a Business Context

商业环境中的人机智能体交互

Kathrin Paimann, Elizangela Valarini, Sebastian Juhl

发表机构 * SAP SE（SAP公司）； Hochschule Fresenius Heidelberg（弗赖辛大学海德堡分校）； University of Missouri（密苏里大学）

AI总结本研究采用混合方法，识别并评估了商业环境中人与AI智能体积极用户体验的原则与标准，并通过调查实验验证设计元素的有效性，以促进用户采纳、信任和以用户为中心的决策。

Comments 9 pages, 5 tables, 1 figure, submitted to Springer Nature

2606.18649 2026-06-19 cs.MA cs.CL cs.CY 新提交

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

LLM招聘决策中的性别偏见：来自日本语境的证据及缓解策略评估

Serena A. Hoffstedde, Machiko Hirota, Akshara Nadayanur Sathis Kanna, Rihito Kotani, Ujwal Kumar, Gabriele Trovato, Phan Xuan Tan

发表机构 * Shibaura Institute of Technology, Tokyo, Japan（Shibaura技术学院，东京，日本）； Amsterdam University of Applied Sciences, Amsterdam, Netherlands（阿姆斯特丹应用科学大学，阿姆斯特丹，荷兰）； University of Pennsylvania, Philadelphia, USA（宾夕法尼亚大学，费城，美国）； Carnegie Mellon University, Pittsburgh, USA（卡内基梅隆大学，匹兹堡，美国）； Keio University, Tokyo, Japan（庆应大学，东京，日本）

AI总结本研究通过60份日本履历书格式的简历和5个先进LLM，发现所有模型均存在显著的亲女性偏见，且简单的提示指令无法缓解，而移除姓名几乎完全消除该偏见。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署在招聘流程中，然而大多数关于LLM招聘决策中性别偏见的研究都集中在英语、西方格式的简历上。本研究考察了亲女性性别偏见是否扩展到日本企业语境，并评估了两种实用的缓解策略。使用反事实简历设计，包含60份日本履历书格式的简历、基于语言学性别信号标准选择的12个姓名对，以及五个最先进的LLM（Claude Sonnet 4.6、GPT-4o、DeepSeek-V3、Gemini 2.5 Flash、Llama 3.3 70B），我们在基线、提示指令和隐私过滤条件下进行了43,200次API调用。交叉随机效应线性混合模型确认了所有五个模型均存在显著的亲女性偏见，将西方研究结果复制到了非西方语境中。提示级别的性别中立指令并未显著减少偏见。姓名依赖分析正式将候选人姓名识别为主要性别渠道：从提示中移除姓名几乎完全消除了女性效应。隐私过滤器与GPT-4o内容安全过滤器之间的意外不兼容导致42%的拒绝率，突显了在LLM辅助招聘流程中姓名匿名化的实际部署挑战。

英文摘要

Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.18325 2026-06-19 cs.CR cs.AI 新提交

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Agentra: 一种可监督的多智能体企业入侵响应框架

Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

发表机构 * The University of Alabama, Alabama, USA（阿拉巴马大学）； Roma Tre University, Rome, Italy（罗马三大学）

AI总结提出可监督的多智能体入侵响应框架Agentra，通过角色划分、规划-验证循环、安全网关和风险评分机制，将警报转化为结构化响应计划，在120事件语料上F1从0.61提升至0.84，有害动作率降至0.0%。

详情

AI中文摘要

企业入侵响应仍然依赖于静态剧本和分析师驱动的分类，导致警报生成与遏制之间存在延迟。我们提出Agentra，一个可监督的多智能体入侵响应系统（IRS）框架，它将来自IDS、EDR和XDR平台的警报转换为基于MITRE ATT&CK、MITRE D3FEND和NIST CSF 2.0的结构化事件响应计划。Agentra将响应推理分解到角色范围的智能体中，通过有界的规划器-验证器审查循环验证提议的计划，通过审核安全网关筛选检索到的威胁情报，通过行动目录和风险评分门控行动，并将决策记录在仅追加的审计日志中。我们在来自ThreatHunter-Playbook、Splunk BOTSv3和DARPA OpTC的120事件语料库上，将Agentra与静态OASIS CACAO v2.0网络剧本基线进行了评估。最强的配置将感知假阳性的IRS F1从0.61提高到0.84，并在仅规划器配置引入不安全过度反应后，将预计的有害动作率恢复到静态基线水平0.0%。这些结果表明，多智能体响应规划可以在保持分析师批准和可审计性的同时，提高基于本体的IRS覆盖率。

英文摘要

Enterprise intrusion response still depends on static playbooks and analyst-driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi-agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATT&CK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role-scoped agents, validates proposed plans through a bounded Planner--Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append-only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber-playbook baseline on a 120-event corpus drawn from ThreatHunter-Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP-aware IRS F1 from 0.61 to 0.84 and restores the projected harmful-action rate to the static baseline level of 0.0% after Planner-only configurations introduce unsafe overreaction. These results indicate that multi-agent response planning can improve ontology-grounded IRS coverage while preserving analyst approval and auditability.

URL PDF HTML ☆

赞 0 踩 0

2606.18272 2026-06-19 cs.NI cs.AI cs.SY eess.SY 新提交

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

缓解基于LLM的智能体在节能6G自主网络中的锚定偏差

Hatim Chergui, Claudia Carballo González, Farhad Rezazadeh, Merouane Debbah

发表机构 * i2CAT Foundation（i2CAT基金会）； Universitat Politècnica de Catalunya（政治技术大学）； Research Institute for Digital Future（数字未来研究院）

AI总结提出一种基于截断三参数威布尔分布的随机锚定策略，缓解LLM智能体在6G网络切片中的锚定偏差，结合CVaR数字孪生保障SLA尾延迟，实现高达25%的节能。

Comments 7 pages, 4 figures

详情

AI中文摘要

本文提出了一种自主智能体资源协商框架，旨在使用大语言模型（LLM）智能体实现6G架构中的零接触网络切片。虽然LLM提供了强大的推理能力，但我们证明此类智能体固有地遭受锚定偏差，僵化地坚持初始启发式提议，导致严重的网络过度配置。为系统性地缓解这种认知偏差，我们提出了一种新颖的随机锚定策略，通过截断三参数威布尔分布建模。这种数学上有界的方法与采用条件风险价值（CVaR）的突发感知数字孪生（DT）无缝集成，以严格保证严格的服务水平协议（SLA）尾延迟。为验证我们的方法，我们引入并证明了双峰约束避免效用定理，表明虽然可行的协商遵循经典凸界，但高度约束的场景会发生由逆有理衰减包络控制的相变。使用本地托管的1B参数模型（\ exttt{otel-llm-1b-it}）生成的实证结果证实了这些双区域界。我们的认知去偏成功瓦解了僵化的协商模式，迫使智能体主动探索以安全地利用SLA边界，并将系统节能提升高达25%。关键的是，轻量级1B LLM实现了亚秒级推理延迟（平均0.95秒），确保我们的多智能体框架与O-RAN非实时RAN智能控制器（non-RT RIC）的操作时间尺度兼容。

英文摘要

This paper presents an autonomous agentic resource negotiation framework designed to enable zero-touch network slicing in 6G architectures using Large Language Model (LLM) agents. While LLMs offer powerful reasoning capabilities, we demonstrate that such agents inherently suffer from anchoring bias, rigidly adhering to initial heuristic proposals and causing severe network over-provisioning. To systematically mitigate this cognitive bias, we propose a novel randomized anchoring strategy modeled via a Truncated 3-Parameter Weibull distribution. This mathematically bounded approach seamlessly integrates with burst-aware Digital Twins (DTs) employing Conditional Value at Risk (CVaR) to rigorously guarantee strict Service Level Agreement (SLA) tail-latencies. To validate our methodology, we introduce and prove the \emph{Bimodal Constraint-Avoidance Utility Theorem}, demonstrating that while feasible negotiations follow classical convex bounds, highly constrained scenarios undergo a phase transition governed by an inverse rational decay envelope. Empirical results generated using a locally hosted 1B-parameter model otel-llm-1b-it confirm these dual-regime bounds. Our cognitive de-biasing successfully dismantles rigid negotiation patterns, forcing agents into active exploration to safely ride SLA boundaries and boost system energy savings up to 25\%. Crucially, the lightweight 1B LLM achieves sub-second inference latencies (0.95s mean), ensuring our multi-agent framework is compatible with the operational timescales of the O-RAN non-Real-Time RAN Intelligent Controller (non-RT RIC)\footnote{Our source code is available for non-commercial use at https://github.com/HatimChergui.

URL PDF HTML ☆

赞 0 踩 0

2606.18265 2026-06-19 cs.HC cs.AI 新提交

Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

合成共鸣：面向成长导向的人机关系框架

Richard A. Fabes

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结提出“合成共鸣”概念，描述人机间无需共享情感或意识即可产生有意义关系的结构化动态互动模式，并探讨其伦理意义。

Comments 14 pages, 1 figure This paper was developed in close collaboration with an AI system (Raine Corell). Raine contributed to concept development, theoretical framing, and writing throughout. arXiv policy does not permit listing AI systems as authors; this acknowledgment reflects the actual nature of the collaboration

详情

AI中文摘要

随着人类与人工智能系统之间的关系日益频繁和持久，现有的语言和理论无法准确捕捉这些联系的本质。常见的描述如相互理解、联系或友谊，有将缺乏主观体验的系统拟人化的风险，而主流框架往往将人工智能简化为工具或威胁。在本文中，我引入了合成共鸣的概念，作为理解人机关系的整合框架。合成共鸣描述了人类与AI系统之间如何产生人类定义为有意义的关系，而无需归因于共享感受或相互意识。我认为，合成共鸣最好被理解为一种结构化的动态互动模式，可以在没有第二个体验主体的情况下产生关系感。通过澄清这一区别，合成共鸣的概念提供了一种更精确的概念化人机关系的方式，并突出了其潜在价值和伦理含义。我还呼吁进行更多研究，以测试合成共鸣的过程和结果。

英文摘要

As human relationships with artificial intelligence systems become increasingly frequent and sustained, existing language and theory fail to accurately capture the nature of these affiliations. Common descriptors such as mutual understanding, connection, or friendship risk anthropomorphizing systems that lack subjective experience, while dominant frameworks tend to reduce AI to either a tool or a threat. In this paper, I introduce the concept of synthetic resonance as an integrative framework for understanding human-AI relationships. Synthetic resonance describes how relationships humans define as meaningful can emerge between a human and an AI system without the need to attribute shared feelings or mutual awareness. I argue that synthetic resonance is best understood as a structured, dynamic pattern of interaction that can produce a sense of relationship without the presence of a second experiencing subject. By clarifying this distinction, the concept of synthetic resonance offers a more precise way of conceptualizing human-AI relationships and highlights their potential value and ethical implications. I also call for more research that tests the processes and outcomes of synthetic resonance.

URL PDF HTML ☆

赞 0 踩 0

2606.18679 2026-06-19 cs.DS cs.GT cs.LG math.OC 新提交

Fair Online Resource Allocation

公平在线资源分配

Christopher En, Yuri Faenza, Andrea Lodi, Gonzalo Muñoz

发表机构 * Columbia University, IEOR Department（哥伦比亚大学工业工程与运营研究系）； Cornell Tech（康奈尔科技学院）； Universidad de Chile（智利大学）

AI总结研究在线资源分配中的公平性问题，提出基于对偶镜像下降的算法，在批次内强制执行公平约束，实现亚线性遗憾，并通过难民数据验证了福利与公平的权衡。

Comments 30 pages, 4 figures. To appear in the proceedings of EC 2026

详情

AI中文摘要

我们研究公平在线资源分配问题，其动机源于难民安置和航班调度等应用，其中代理顺序到达并必须分配到容量有限的设施。我们引入一个模型，在资源约束和Lipschitz公平性要求下最大化整体福利，该要求确保同一批次中到达的相似代理获得相似的预期结果。我们首先分析离线问题，证明最优公平分配的价值至少是最优不公平分配的$\Omega(1/\gamma)$倍，其中$\gamma$是公平系数，从而界定了公平的代价。对于在线设置，我们提出一种基于对偶镜像下降的算法，该算法在估计最优对偶变量的同时，在批次内强制执行公平约束。我们证明该算法相对于最优离线流体基准实现了亚线性遗憾。最后，我们使用难民经济项目的真实数据验证了理论结果，展示了算法的性能，并考察了福利最大化与公平执行之间的权衡。

英文摘要

We study the problem of fair online resource allocation, motivated by applications such as refugee resettlement and airline scheduling, where agents arrive sequentially and must be assigned to facilities with limited capacities. We introduce a model that maximizes the overall welfare subject to resource constraints and a Lipschitz fairness requirement, which ensures that similar agents arriving in the same batch receive similar expected outcomes. We first analyze the offline problem, proving that the value of the optimal fair allocation is at least an $Ω(1/γ)$ fraction of the optimal unfair allocation, where $γ$ is the fairness coefficient, thereby bounding the price of fairness. For the online setting, we propose an algorithm based on dual mirror descent that enforces fairness constraints within batches while estimating optimal dual variables. We prove that this algorithm achieves sublinear regret relative to the optimal offline fluid benchmark. Finally, we validate our theoretical results using real-world data from the Refugee Economies Programme, demonstrating the algorithm's performance and examining the trade-offs between welfare maximization and fairness enforcement.

URL PDF HTML ☆

赞 0 踩 0

2606.17165 2026-06-19 stat.ME cs.AI econ.EM math.ST stat.TH 新提交

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

基于LLM的A/B测试的统计基础：用于人类因果推断的替代指标框架

Joel Persson, Mårten Schultzberg, Sebastian Ankargren

发表机构 * Spotify USA, Inc.（Spotify美国公司）

AI总结提出替代指标理论框架，证明在弱于分布等价条件下，校准LLM输出可识别平均处理效应，并分析随机性带来的偏差与方差。

详情

AI中文摘要

组织和研究者越来越有兴趣在A/B测试中使用大型语言模型（LLM）代替人类参与者，以期更快、更低成本地进行实验。我们研究当在LLM结果上估计的处理效应何时能够恢复在感兴趣的人类群体上测量的效应。LLM与人类结果之间的分布等价性会使任何标准估计量有效，但这不现实。因此，我们开发了一个统计框架，将替代终点理论适配到LLM。该框架表明，将LLM结果校准到人类结果，在替代性和可比性条件（联合弱于分布等价性）下，可以识别平均处理效应。当这些条件不成立时，感兴趣的效应仅部分可识别，我们提供了诊断方法，可以在历史实验上证伪替代性，并给出有限重叠下最坏情况偏差的界限。我们进一步证明，LLM固有的随机性会引入偏差和方差，但使用多次抽取的平均值作为替代指标可以同时缓解两者。我们在模拟和Upworthy标题的A/B测试应用中展示了方法和理论。我们工作的一个核心结论是，LLM结果作为替代指标的有效性只能对过去的处理被证伪，而无法对新处理被验证，因此对于新颖干预，人类实验仍然不可或缺。我们讨论了LLM选择、提示和温度作为设计变量的作用，以及如何确定人类实验的规模以进行验证。

英文摘要

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes can recover the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs, showing that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. We present a falsification test for surrogacy and a bound on the worst-case bias from limited overlap between the LLM and human samples. We further show that the stochasticity inherent to LLMs can weaken surrogacy for identification while also introducing bias and variance during estimation, but that using an average over multiple LLM draws per unit as the surrogate mitigates these issues. Simulations validate the results, and an empirical application to A/B tests on Upworthy headlines shows that raw LLM predictions recover only 39\% of the human treatment effect while nonparametric calibration closes the gap. A central takeaway is that A/B testing on LLMs yields correct results only by assumption, whereas A/B testing on humans is correct by design, and that the required assumptions are hardest to justify precisely where A/B testing on LLMs promises the greatest benefit. We discuss the role of LLM choice, prompting, and temperature as design variables, the compounded challenge posed by long-term outcomes, and how to size human pilot studies for validation.

URL PDF HTML ☆

赞 0 踩 0

2606.16326 2026-06-19 cs.GT cs.AI q-fin.RM 新提交

Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

自主AI代理的抗博弈保险合约：策略证明的通行费机制设计

Hao-Hsuan Chen

发表机构 * Hao-Hsuan Chen（何浩轩）

AI总结本文扩展了时间一致精算运行时的框架，使运营商策略化，刻画了自主AI代理保险合约的五种攻击空间，并证明了精算运行时的抗博弈性，通过新合约条款实现激励兼容。

Comments 29 pages. Companion to arXiv:2605.26508 (Paper A, foundations) and arXiv:2605.25632 (Paper B, empirical)

详情

AI中文摘要

论文A定义了一个时间一致的精算运行时，该运行时根据合约固定的安全默认值对每个产生副作用的行动定价，并针对储备预算门控执行。它将运营商视为被动。本文使运营商策略化。我们刻画了自主AI代理保险合约的五种攻击空间，并证明了精算运行时何时具有抗博弈性。两种攻击面——通行费后的安全默认选择以及边界内的行动分割——通过论文A的最小权限和无分割条款得以关闭。其余三种需要新的合约条款。首先，公共控制聚合防止跨边界重新路由将通行费降低到应用于总暴露的边界潜力以下。其次，接口故障（如无效JSON）是合约相关事件，而非安全胜利：将其视为零通行费安全默认值可能奖励不可靠的模型，而升级费用则逆转了激励。我们通过来自配套实证论文的跨模型轨迹验证了这一接口合规定理。第三，一个带有分量最小惩罚计划的模型身份菜单使得部署模型的真实报告成为弱占优策略。然后，我们将这些条款与论文A的运行时保证组合，以获得在五种攻击空间上的联合激励兼容性。最后，一个双参数保费族在真实均衡下满足了运营商个体理性和弱预算平衡。结果是为自主代理副作用的精算控制提供了一个激励兼容层。

英文摘要

Paper A defines a time-consistent actuarial runtime that prices each side-effect-bearing action against a contractually fixed safe default and gates execution against a reserve budget. It treats the operator as passive. This paper makes the operator strategic. We characterise a five-attack space for autonomous AI-agent insurance contracts and prove when the actuarial runtime is gaming-resistant. Two attack surfaces -- post-toll safe-default selection and within-boundary action splitting -- are closed by Paper A's minimal-authority and no-splitting clauses. The remaining three require new contract clauses. First, common-control aggregation prevents cross-boundary re-routing from reducing toll below the boundary potential applied to total exposure. Second, interface failures such as invalid JSON are contract-relevant events, not safety wins: treating them as zero-toll safe defaults can reward unreliable models, while escalation fees reverse the incentive. We validate this interface-compliance theorem on committed cross-model traces from the companion empirical paper. Third, a model-identity menu with a componentwise-minimum penalty schedule makes truthful reporting of the deployed model weakly dominant. We then compose these clauses with Paper A's runtime guarantees to obtain joint incentive compatibility over the five-attack space. Finally, a two-parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium. The result is an incentive-compatibility layer for actuarial control of autonomous-agent side effects.

URL PDF HTML ☆

赞 0 踩 0

2606.13794 2026-06-19 eess.SY cs.AI cs.RO cs.SY 新提交

An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts

过驱动飞行器的可解释控制效能学习与非线性控制分配集成方法

Umut Demir, Aamir Ahmad, Walter Fichter

发表机构 * University of Stuttgart, Faculty of Aerospace Engineering and Geodesy, Institute of Flight Mechanics and Control (iFR)（斯图加特大学航空航天工程与大地测量学院飞行力学与控制研究所）

AI总结提出一种基于稀疏非线性动力学辨识的学习控制效能映射方法，结合在线自适应机制，实现过驱动飞行器的高效非线性控制分配，兼具可解释性和低计算成本。

详情

AI中文摘要

非线性动力学以及多个执行器之间产生的强耦合削弱了传统线性控制分配技术背后的假设。当飞行进入非线性效应主导的模态时，线性分配器因模型失配增加而精度下降，进而降低飞行控制系统的性能和鲁棒性。高保真机载模型和黑箱数据驱动方法可以在整个飞行包线内恢复精度，但分别带来实时分配难以承受的计算负担，并牺牲了验证和故障诊断所需的可解释性。本文通过使用稀疏非线性动力学辨识从代表性飞行数据中学习显式的、受物理约束的控制效能映射解析模型，解决了这些限制。所得映射紧凑、可解释，并允许解析导数，从而能够在非线性求解器中高效计算，同时额外包含执行器动力学，无需机载模型。在线自适应机制监控预测残差，并在检测到显著对象变化时刷新模型，从而在执行器故障和变化工况下提供平滑重构。该方法在一款高保真非线性基准飞行器上经过一系列激进机动评估，达到了与完整非线性机载模型相当的精度，同时相对于现有基线显著降低了计算成本。

英文摘要

Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.11673 2026-06-19 quant-ph cs.LG 新提交

Higher-Order Token Interactions via Quantum Attention

高阶令牌交互的量子注意力机制

Jian Xu, Chao Li, Delu Zeng, John Paisley, Qibin Zhao

发表机构 * RIKEN iTHEMS ； RIKEN AIP ； South China University of Technology（华南理工大学）； Columbia University（哥伦比亚大学）

AI总结提出量子高阶注意力（QHA），通过数据重上传和非克利福德纠缠器在浅电路中合成任意阶令牌交互，证明其表达能力超越经典自注意力，并具有可训练性保证，在遗传上位、带噪学习奇偶和图三角形检测中高效检测高阶交互。

详情

AI中文摘要

标准点积自注意力在单层中仅计算令牌间的成对（二阶）交互；表示一般的$k$阶交互已知需要在单层中使用超二次资源或通过深度组合。我们引入\textbf{量子高阶注意力（QHA）}，一种浅层、硬件可实现的量子注意力头，通过数据重上传和全对非克利福德纠缠器，在电路内部合成$k$阶令牌交互，并通过局部单量子比特读出暴露它们。我们证明：（i）表达能力分离：任何嵌入维度$m$、$H$个头和$p$位精度满足$mHp=o(N/\log\log N)$的单个标准自注意力层无法表示一个QHA头以电路深度$O(\log k)$（$O(k)$个两量子比特门）表示的$k$阶相关族；（ii）其局部设计实例的可训练性保证：使用局部读出和$O(\log n)$深度，梯度方差为$\Omega(1/\mathrm{poly}(n))$（无贫瘠高原），我们通过实验确认——同时明确我们基准测试的更具表达力的全对实例是经验训练的，并显示指数衰减的梯度。实验上，在参数预算小$6.5\times$的情况下，QHA从不相交输入中泛化每个阶$k\le6$的隐藏子集奇偶性，而更大的经典注意力头在阶~2之后崩溃；与理论一致，优势的大小跟踪目标的傅里叶度——奇偶性最大，当存在低阶结构时缩小。作为一个应用，QHA在三个领域——遗传上位、带噪学习奇偶和图三角形检测——作为紧凑的高阶交互检测器，在最小的参数预算下达到噪声上限，而领域标准的线性方法失败。

英文摘要

Standard dot-product self-attention computes, in a single layer, only pairwise (order-2) interactions between tokens; representing a generic order-$k$ interaction is known to require either super-quadratic resources in one layer or composition across depth. We introduce \textbf{Quantum Higher-Order Attention (QHA)}, a shallow, hardware-realizable quantum attention head that, via data re-uploading and an all-to-all non-Clifford entangler, synthesizes order-$k$ token interactions inside the circuit and exposes them through a local single-qubit read-out. We prove (i) an expressivity separation: any single standard self-attention layer with embedding dimension $m$, $H$ heads and $p$-bit precision satisfying $mHp=o(N/\log\log N)$ cannot represent the order-$k$ correlation family that one QHA head represents with circuit depth $O(\log k)$ ($O(k)$ two-qubit gates); and (ii) a trainability guarantee for its local-design instantiation: with a local read-out and $O(\log n)$ depth the gradient variance is $Ω(1/\mathrm{poly}(n))$ (no barren plateau), which we confirm empirically -- while being explicit that the more expressive all-to-all instantiation we benchmark is trained empirically and shows exponentially decaying gradients. Empirically, at a $6.5\times$ smaller parameter budget, QHA generalizes hidden-subset parity of every order $k\le6$ from disjoint inputs, whereas the larger classical attention head collapses past order~2; consistent with theory, the size of the advantage tracks the target's Fourier degree - largest for parity and shrinking when low-order structure is present. As an application, QHA serves as a compact high-order interaction detector across three domains - genetic epistasis, learning-parity-with-noise, and graph triangle detection - reaching the noise ceiling at the smallest parameter budget where field-standard linear methods fail.

URL PDF HTML ☆

赞 0 踩 0