arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2311.04938 2026-05-22 cs.CV cs.AI cs.LG

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

改进的DDIM采样与矩匹配高斯混合模型

Prasad Gabbur

AI总结 本文提出在DDIM框架中使用高斯混合模型作为反向转换操作符,通过约束GMM参数匹配DDPM前向边缘的矩,从而在少量采样步骤下提升生成样本质量,实验表明GMM核在FID和IS指标上优于传统高斯核。

Comments 34 pages, 12 figures; Accepted to TMLR; Code open sourced

详情
Journal ref
Transactions on Machine Learning Research, 05/2026
AI中文摘要

我们提出在去噪扩散隐式模型(DDIM)框架中使用高斯混合模型(GMM)作为反向转换操作符(核),这是用于加速从预训练去噪扩散概率模型(DDPM)采样的最广泛使用的 approaches 之一。具体而言,我们通过约束GMM参数来匹配DDPM前向边缘的一阶和二阶中心矩。我们发现矩匹配足以获得与原始DDIM高斯核相等或更好的样本质量。我们分别在无条件模型(训练于CelebAHQ和FFHQ)、类条件模型(训练于ImageNet)以及使用Stable Diffusion v2.1在COYO700M数据集上进行文本到图像生成实验。我们的结果表明,当采样步骤数较小时,使用GMM核可显著提升生成样本的质量,如在ImageNet 256x256上,使用10个采样步骤时,GMM核的FID为6.94,IS为207.85,而高斯核分别为10.15和196.73。此外,我们还为修正流匹配模型推导了新的SDE采样器,并对所提出的方法进行了实验。我们发现使用1-修正流和2-修正流模型均有所改进。代码:https://github.com/pgabbur/ddim-gmm。

英文摘要

We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code: https://github.com/pgabbur/ddim-gmm.

2306.05905 2026-05-22 cs.LG math.OC

TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization

TreeDQN: 一种用于组合优化的高效离策略强化学习方法

D. Sorokin, A. Kostin, L. Savchenko, G. Gusev, A. V. Savchenko

AI总结 TreeDQN通过优化几何平均预期回报,提高了离策略强化学习在组合优化任务中的样本效率,并在合成任务和ML4CO竞赛中表现优异。

Comments Accepted in Knowledge-Based Systems

详情
AI中文摘要

解决组合优化任务的一种方便方法是分支定界法。其分支启发式可以学习以解决大量相似任务。在这里取得的有希望的结果是通过最近出现的基于树马尔可夫决策过程的在线策略强化学习方法实现的。为了克服其主要缺点,即训练时间非常大和不稳定,我们提出了TreeDQN(树深度Q网络),一种样本效率高的离策略RL方法,通过优化预期回报的几何平均来训练。为了理论支持我们的方法的训练过程,我们证明了树MDP中Bellman算子的收缩性质。结果表明,我们的方法所需的训练数据最多减少10倍,并在合成任务上比已知的在线策略方法运行更快。此外,TreeDQN在ML4CO竞赛中的挑战性实际任务上显著优于最先进的技术。

英文摘要

A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by the recently appeared on-policy reinforcement learning method based on the tree Markov Decision Process. To overcome its main disadvantages, namely, very large training time and unstable training, we propose TreeDQN (Tree Deep Q-Network), a sample-efficient off-policy RL method trained by optimizing the geometric mean of expected return. To theoretically support the training procedure for our method, we prove the contraction property of the Bellman operator for the tree MDP. As a result, our method requires up to 10 times less training data and performs faster than known on-policy methods on synthetic tasks. Moreover, TreeDQN significantly outperforms the state-of-the-art techniques on a challenging practical task from the ML4CO competition.

1607.06330 2026-05-22 cs.CL

La representación de la variación contextual mediante definiciones terminológicas flexibles

通过灵活的术语定义实现语境变化的表示

Antonio San Martín

AI总结 本文研究了如何通过灵活的术语定义来反映环境领域中专业概念在不同语境下的变化,提出了灵活的术语定义方法,并通过实证研究分析了语境变化对术语定义的影响。

Comments PhD Thesis. in Spanish. University of Granada. 2016

详情
AI中文摘要

在本博士论文中,我们应用认知语言学的原理到术语定义,并提出了一种称为灵活术语定义的提案。这包括由一组同一概念的定义组成,包括一个一般定义(在此情况下,涵盖整个环境领域)以及额外的定义,从相关子领域的角度描述该概念。由于语境是构建词汇单位(包括术语)意义的关键因素,我们假设术语定义可以并且应该反映语境的影响,尽管定义传统上被视为与任何语境因素无关的意义表达。本论文的主要目标是分析语境变化对专业环境概念的影响,以它们在术语定义中的表示为视角。具体而言,我们专注于基于主题限制的语境变化。为了完成本博士论文的目标,我们进行了实证研究,包括对一组具有语境变化的概念进行分析,并为其中两个概念创建灵活的定义。在我们实证研究的第一部分,我们将我们的领域依赖性语境变化概念划分为三种不同的现象:调制、视角化和子概念化。这些现象是叠加的,即所有概念都经历调制,一些概念还经历视角化,最后,少数概念还经历子概念化。在第二部分,我们应用这些概念到术语定义,并提出了如何构建灵活定义的指南,从知识提取到定义的实际写作。

英文摘要

In this doctoral thesis, we apply premises of cognitive linguistics to terminological definitions and present a proposal called the flexible terminological definition. This consists of a set of definitions of the same concept made up of a general definition (in this case, one encompassing the entire environmental domain) along with additional definitions describing the concept from the perspective of the subdomains in which it is relevant. Since context is a determining factor in the construction of the meaning of lexical units (including terms), we assume that terminological definitions can, and should, reflect the effects of context, even though definitions have traditionally been treated as the expression of meaning void of any contextual effect. The main objective of this thesis is to analyze the effects of contextual variation on specialized environmental concepts with a view to their representation in terminological definitions. Specifically, we focused on contextual variation based on thematic restrictions. To accomplish the objectives of this doctoral thesis, we conducted an empirical study consisting of the analysis of a set of contextually variable concepts and the creation of a flexible definition for two of them. As a result of the first part of our empirical study, we divided our notion of domain-dependent contextual variation into three different phenomena: modulation, perspectivization and subconceptualization. These phenomena are additive in that all concepts experience modulation, some concepts also undergo perspectivization, and finally, a small number of concepts are additionally subjected to subconceptualization. In the second part, we applied these notions to terminological definitions and we presented we presented guidelines on how to build flexible definitions, from the extraction of knowledge to the actual writing of the definition.

2605.22779 2026-05-22 cs.SE cs.LG

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

FAME:面向失败的混合专家模型用于消息级日志异常检测

Huanchi Wang, Zihang Huang, Yifang Tian, Kristina Dzeparoska, Hans-Arno Jacobsen, Alberto Leon-Garcia

AI总结 本文提出FAME,一种面向失败的混合专家模型,用于消息级日志异常检测。该方法通过少量标注数据训练轻量级路由器和领域专家,实现高效的异常检测,同时在BGL和Thunderbird数据集上取得了高精度和召回率。

Comments 12 pages, 5 figures

详情
AI中文摘要

生产系统每天生成数百万条日志行,但大多数异常检测器在会话或窗口级别工作,标记的是行组而非特定消息。这种粗粒度迫使操作员每条警报都要检查许多常规行。消息级检测提供更细粒度,但仍然具有挑战性。一个事件模板可能对应正常和异常消息,故障源于异构子系统,大规模行级标注不切实际。尽管大型语言模型(LLMs)可以推断日志语义,但将其应用于每条行对于持续监控来说成本太高。我们提出了FAME(Failure-Aware Mixture-of-Experts),一种标签高效的面向消息级的混合专家框架,该框架仅在离线时使用LLM一次。我们最多为每个模板标注K条标注行以推导二元正常/异常指标和代表性示例。LLM提出将模板划分为故障领域,并通过认证步骤验证该提议后再进行训练。FAME训练了一个轻量级路由器和领域专家,这些专家在本地运行,并输出异常预测和故障领域标签。在BGL上,FAME在K=100时达到F1=98.16,将标注工作量减少76倍,并检测出86.3%的未见过的EventIDs异常。在Thunderbird上,FAME达到F1=99.95,具有完美的召回率。

英文摘要

Production systems generate millions of log lines daily, yet most anomaly detectors operate at the session or window-level, flagging groups of lines rather than identifying the specific message responsible. This coarse granularity forces operators to inspect many routine lines per alert. Message-level detection offers finer granularity, but remains challenging. A single event template may correspond to both normal and anomalous messages, failures arise from heterogeneous subsystems, and line-level labeling at scale is impractical. Although large language models (LLMs) can reason over log semantics, applying them to every line is too costly for continuous monitoring. We present FAME (Failure-Aware Mixture-of-Experts), a label-efficient message-level mixture-of-experts framework that uses an LLM only once offline. We annotate at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples. The LLM proposes a partition of templates into failure domains, and a certification step validates the proposal before training. FAME trains a lightweight router and domain experts that run on-premise and output anomaly predictions and failure-domain labels. On BGL, FAME achieves F1 = 98.16 at K = 100 reducing annotation effort by 76x and detects 86.3% of anomalies from unseen EventIDs. On Thunderbird, FAME reaches F1 = 99.95 with perfect recall.

2605.22736 2026-05-22 math.OC cs.LG cs.NA math.DG math.NA

Optimization over the intersection of manifolds

在两个流形交集上的优化

Yan Yang, Bin Gao, Ya-xiang Yuan

AI总结 本文提出了一种几何方法,通过在单个流形上进行重新参数化,并在两个正交方向上更新迭代点,以解决两个流形交集上的优化问题,证明了清洁交集和内在横贯性是等价的,并展示了该方法在稀疏和低秩优化问题中的有效性。

Comments 26 pages, 5 figures, 3 tables

详情
AI中文摘要

在两个流形交集上的优化出现在广泛的应用中,但受到可行区域耦合几何的阻碍。在本文中,我们证明了正则性——清洁交集和内在横贯性——是等价的,这导致了可处理的交集切空间投影。因此,我们提出了一种几何方法,该方法仅在单个流形上使用重新参数化,并在两个正交方向上更新迭代点。具体而言,迭代点停留在一个流形上,而这两个方向分别负责渐近接近另一个流形和减少目标函数。在内在横贯性下,我们推导了可行性和最优性度量的收敛速度,并证明了每个积累点都是第一阶 stationary 的。在稀疏和低秩优化问题上的数值实验,包括拟合球形数据、在真实数据上近似双曲嵌入和计算压缩模式,展示了所提方法的有效性。

英文摘要

Optimization over the intersection of two manifolds arises in a broad range of applications, but is hindered by the coupled geometry of the feasible region. In this paper, we prove that the regularities -- clean intersection and intrinsic transversality -- are equivalent, which yields a tractable projection onto the tangent space of the intersection. Therefore, we propose a geometric method that employs a retraction on only one manifold and updates the iterate along two orthogonal directions. Specifically, the iterates stay on one manifold, and the two directions are responsible for asymptotically approaching the other manifold and decreasing the objective function, respectively. Under intrinsic transversality, we derive the convergence rate for both the feasibility and optimality measures, and show that every accumulation point is first-order stationary. Numerical experiments on problems stemming from sparse and low-rank optimization, including fitting spherical data, approximating hyperbolic embeddings on real data, and computing compressed modes, demonstrate the effectiveness of the proposed method.

2605.22709 2026-05-22 cs.CR cs.ET cs.RO cs.SY eess.SY

TriSweep: A Four-Drone Swarm Framework for Electromagnetic Side-Channel Analysis

TriSweep: 一种四无人机群框架用于电磁侧信道分析

Eric Yocam, Varghese Vaidyan

AI总结 本文提出TriSweep框架,通过四无人机群实现自主远距离电磁侧信道分析,针对嵌入式微控制器在0.25-1.5米范围内进行攻击,通过空间专业化收集无人机和固定积累无人机的协同工作,实现信号增强和掩码消除,验证了无人机群在对抗环境中的有效性。

Comments Simulation framework + systems design for a four-drone swarm performing standoff electromagnetic side-channel analysis. No hardware fabricated yet

详情
AI中文摘要

电磁(EM)侧信道分析传统上假设存在一个静止且近距离的探测器,这种威胁模型低估了空中对手的威胁。TriSweep是一种模拟框架,设计并评估了一种四无人机群架构,用于自主远距离电磁侧信道分析(EM-SCA)嵌入式微控制器,距离为0.25-1.5米。三个空间专业化收集无人机——锚点(全频谱)、掩码探测器(掩码寄存器加载泄漏)和密码探测器(掩码SubBytes输出泄漏)——将信号馈入一个固定积累无人机,该无人机通过两个空间分离泄漏流的中心乘积进行相干结合(+4.8 dB信噪比增益)和二次掩码消除。在三个真实的ANSSI ASCAD数据集(ATmega8515掩码AES-128和50/100样本非同步变体)上评估该框架,其在0.25米范围内主要掩码数据集上实现了模拟密钥排名为18±1.7(五种子)。通过轮廓跟踪轨迹交叉相关对齐,单无人机排名从89降低到21,在100样本抖动变体上展示了对无人机悬停振动的补偿。积累无人机中的两个通道CNN收敛到损失为0.454(与随机基线5.545相比)并在非同步数据集上提高了排名。尚未制造物理硬件;原型构建是下一步计划。

英文摘要

Electromagnetic (EM) side-channel analysis traditionally assumes a stationary, close-proximity probe - a threat model that underestimates aerial adversaries. TriSweep is a simulation framework that designs and evaluates a four-drone swarm architecture for autonomous standoff EM-SCA of embedded microcontrollers at 0.25-1.5 m. Three spatially specialized collector drones - Anchor (full-spectrum), Mask Probe (mask-register loading leakage), and Cipher Probe (masked SubBytes output leakage) - feed a stationary Accumulator drone that performs coherent combining (+4.8 dB SNR gain) and second-order mask cancellation via a centered product of the two spatially separated leakage streams. Evaluated against three real ANSSI ASCAD datasets (ATmega8515 masked AES-128 and 50/100-sample desynchronized variants), the framework achieves a simulated key rank of 18 +/- 1.7 (five-seed) at 0.25 m on the primary masked dataset. Profiling-trace cross-correlation alignment reduces single-drone rank from 89 to 21 on the 100-sample-jitter variant, demonstrating compensation for drone hover vibration. A two-channel CNN in the Accumulator converges to a loss of 0.454 (vs. random baseline 5.545) and improves rank on desynchronized datasets. No physical hardware has been fabricated; prototype construction is the planned next step.

2605.22666 2026-05-22 math.CO cs.LG math.PR

Holographic functions and neural networks

全息函数与神经网络

Balazs Szegedy

AI总结 本文研究了全息函数的复杂性,通过三种不同方法(采样性质、结构性质和计算性质)探讨了全息函数的复杂性界限,并证明了这三种性质在参数上是等价的。

详情
AI中文摘要

模糊布尔函数是映射 $f:\cube^n o [0,1]$,其中 $n\in\mathbb N$。我们介绍了并比较了三种表示此类函数具有有界复杂度的方式。第一种是采样性质:函数值 $f(x)$ 可以通过随机选择的少量坐标值在小误差和高概率下恢复。我们称其为全息性质。第二种是结构性质:$f$ 与在有限多个有界线性坐标形式上的一次多项式一致。第三种是计算性质:$f$ 与具有有限个非输入神经元、有界Lipschitz激活函数和有界输入权重的神经网络的输出一致。我们证明了这三种性质在参数上是等价的。从全息性到多项式结构的推论使用了超图正则性的弱变种。

英文摘要

A fuzzy Boolean function is a map $f:\cube^n\to [0,1]$, where $n\in\mathbb N$. We introduce and compare three ways of saying that such a function has bounded complexity. The first is a sampling property: the value $f(x)$ can be recovered, up to small error and with high probability, from the values of a bounded number of randomly chosen coordinates of $x$. We call this the holographic property. The second is a structural property: $f$ is uniformly close to a bounded-degree polynomial in boundedly many bounded linear coordinate forms. The third is computational: $f$ is uniformly close to the output of a neural network with a bounded number of non-input neurons, bounded Lipschitz activation functions and bounded incoming weights. We prove that these three properties are equivalent up to quantitative changes of the parameters. The implication from holography to polynomial structure uses a variant of a weak version of hypergraph regularity.

2605.22653 2026-05-22 cs.DS cs.LG

The Secretary Problem with a Stochastic Precursor

带随机前导的秘书问题

Franziska Eberle, Alexander Lindermayr

AI总结 本文研究了带随机前导的秘书问题,展示了预测仅因其到达时间而有价值。在随机顺序模型中,单个均匀时间的前导可使成功概率达到至少1/2,优于经典1/e的基准。在对抗性顺序模型中,足够集中的前导可恢复常数成功保证。

详情
AI中文摘要

在学习增强的在线算法中,预测通常因其提供的价值估计、解决方案或算法推荐而被重视。本文表明,预测仅因其到达时间而有价值。我们研究了带随机前导的秘书问题:一种无内容的信号,保证在最佳项目之前到达,但其他时间是随机的。该信号不携带额外信息;然而,其到达时间本身改变了最优停止策略的结构。我们分别在随机顺序和对抗性顺序模型中刻画了最优策略。在随机顺序中,单个均匀时间的前导可使成功概率达到至少1/2,优于经典1/e的基准。随着前导时间越来越晚,成功概率接近1。在对抗性顺序中,对于传统模型无法提供强保证的情况,足够集中的前导可恢复常数成功保证。我们的结果表明,这种新型的异步时间信息是在线决策中的独特且强大的建议形式,可能对其他问题也有效。

英文摘要

In learning-augmented online algorithms, predictions are usually valued for what they say: a value estimate, a solution, or an algorithmic recommendation. This paper shows that predictions can also be valuable solely due to their arrival time. We study the fundamental secretary problem augmented with a stochastic precursor: a content-free signal that is guaranteed to arrive no later than the best item, but is otherwise stochastically timed. The signal does not carry any additional information; nevertheless, its timing alone changes the structure of optimal stopping. We characterize optimal policies in the random-order and adversarial-order models. In random order, a single uniformly timed precursor already gives success probability at least $\frac12$, improving on the classic $\frac1e$ benchmark. With increasingly late precursors, the success probability approaches $1$. In adversarial order, for which traditional models do not admit strong guarantees, sufficiently concentrated precursors recover constant success guarantees. Our results show that such novel forms of asynchronous temporal information are a distinct and powerful form of advice in online decision making and may also be effective for other problems.

2605.22621 2026-05-22 cs.CR cs.LG cs.NI

UNAD+: An Explainable Hybrid Framework for Unknown Network Attack Detection

UNAD+: 一种用于未知网络攻击检测的可解释混合框架

Saif Alzubi, Frederic Stahl

AI总结 本文提出UNAD+框架,结合无监督集成与监督精修阶段,通过集成可解释性层提升未知网络攻击检测的性能和透明度。

详情
AI中文摘要

先前未见的网络攻击检测仍然是入侵检测系统面临的主要挑战。尽管监督学习方法在已知攻击类别上表现良好,但当新攻击类型未在训练数据中表示时,它们的性能受限。无监督方法更适合检测零日攻击,因为它们不需要标记的攻击样本,但它们通常具有较高的误报率,这限制了其在现实中的实用性。本文提出了UNAD+,一种改进的未知网络攻击检测框架,源自之前提出的Unknown Network Attack Detector (UNAD)。UNAD+结合了仅良性样本的无监督集成、加权多数投票(WMV),一种在伪标记检测上训练的监督精修阶段,以及一个后验可解释性层,提供局部和全局解释。该框架在CICIDS2017和NSL-KDD基准数据集上进行了评估。结果表明,UNAD+在原始UNAD框架上有所改进,在基准数据集上实现了超过98%的F1分数,同时显著减少了误报率,并通过集成可解释性增强了透明度和部署适用性。

英文摘要

The detection of previously unseen network attacks remains a major challenge for intrusion detection systems. Although supervised learning methods often perform well on known attack classes, they are limited when new attack types are not represented in the training data. Unsupervised methods are more suitable for detecting zero-day attacks, as they do not require labelled attack samples, but they often suffer from high false positive rates, which limits their real-world usefulness. This paper presents UNAD+, an enhanced framework for unknown network attack detection derived from the previously proposed Unknown Network Attack Detector (UNAD). UNAD+ combines a benign-only unsupervised ensemble with Weighted Majority Voting (WMV), a supervised refinement stage trained on pseudo-labelled detections, and a post hoc explainability layer that provides both local and global explanations. The framework was evaluated on the CICIDS2017 and NSL-KDD benchmark datasets. The results show that UNAD+ improves on the original UNAD framework, achieving F1-scores above 98% across the benchmark datasets while significantly reducing false positives and enhancing transparency and deployment suitability through integrated explainability.

2605.22612 2026-05-22 cs.CY cs.AI cs.LG

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

医疗LLM基准测试的可靠性仅取决于其显式假设

Naveen Raman, Santiago Cortes-Gomez, Mateo Dulce Rubio, Fei Fang, Bryan Wilder

AI总结 本文提出医疗LLM基准测试的评估-部署差距源于隐式假设,而非基准设计问题,并通过BenchmarkCards和分阶段评估方法来解决这一问题。

Comments 13 pages, 1 figure

详情
AI中文摘要

基准测试对于医疗评估是必要的,但不足以预测部署性能。我们的观点是,评估-部署差距并非源于基准设计不当,而是源于关于用户如何与模型交互的隐式假设,这些假设无法仅通过基准测试本身来揭示。为了使这一观点更明确,我们提出了将假设分为两类的分类:任务假设,可通过对话数据单独测试;以及结果假设,需要结果数据和行为研究来测试。关键的是,结果假设依赖于人类行为,即使设计良好的基准也无法直接观察。为了证明该框架的实用性,我们回顾性分析了一个医疗RCT作为案例研究,并发现差距自然分为大致相等的任务和结果差距。为此,我们做出了两项贡献:首先,我们提出BenchmarkCards,一种记录假设的工具;其次,我们提出分阶段评估,一种系统测试假设并评估性能的程序。

英文摘要

Benchmarks are necessary for healthcare evaluation, but are not sufficient for predicting deployment performance. Our position is that the evaluation--deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. To make this precise, we propose a classification of assumptions into two categories: task, which can be tested from conversation data alone, and outcome, which requires outcome data and behavioral studies for testing. Critically, outcome assumptions depend on human behavior, something that even well-designed benchmarks cannot directly observe. To demonstrate the operationality of this framework, we retrospectively analyze a healthcare RCT as a case study and find that the gap naturally separates into task and outcome gaps of roughly equal size. To address this, we make two contributions: first, we propose BenchmarkCards, an artifact that documents assumptions, and second, we propose staged evaluation, a procedure that systematically tests assumptions and evaluates performance.

2605.22604 2026-05-22 cs.CR cs.AI cs.LG cs.SE

Innovations in Cardless Artificial Intelligence Banking: A Comprehensive Framework for Cyber Secure and Fraud Mitigation using Machine Learning Algorithms

无卡人工智能银行业创新:基于机器学习算法的全面框架用于网络安全与欺诈防范

Md Israfeel

AI总结 本文提出了一种全面的框架,利用机器学习算法增强无卡人工智能银行系统的网络安全和欺诈防范能力,通过AI驱动的数据加密生成虚拟卡,减少信息泄露风险。

详情
AI中文摘要

无卡人工智能(AI)银行业的发展标志着金融领域的一次范式转变,为用户提供前所未有的安全性和便利性。本文概述了一个全面的框架,旨在增强网络安全,引入自动生成的虚拟卡,并在无卡AI银行系统中减轻欺诈风险。该框架设想了一种未来银行架构,利用AI驱动的数据加密技术来创建安全的虚拟卡以实现无缝交易。通过强调安全的通信渠道,它确保了银行系统、持卡人和第三方供应商之间的金融活动的完整性。基于AI的授权方法在验证每一笔交易的同时,主动识别潜在欺诈,展示了该框架在加强无卡AI银行业安全方面的有效性。初始方法,包含一个AI驱动的基于特征的银行系统,确保生成带有加密数据的虚拟卡,减少信息暴露并降低欺诈风险。整合机器学习算法为潜在的欺诈活动增加了一层保护。最后,所提出的框架为无卡AI银行系统建立了一个全面的网络安全和欺诈防范范式。其实施使金融机构能够应对传统银行业相关的安全问题,为一个不仅抗欺诈而且对用户安全和方便的未来银行业景观铺平道路。

英文摘要

The advent of cardless artificial intelligence (AI) banking heralds a paradigm shift in the financial landscape, offering users unprecedented security and convenience. This paper outlines a comprehensive framework designed to enhance cybersecurity, introduce auto-generated virtual cards, and mitigate fraud risks within cardless AI banking systems. The framework envisions a future banking architecture that employs AI-powered data cryptography to create secure virtual cards for seamless transactions. By emphasizing secure communication channels, it ensures the integrity of financial activities among banking systems, cardholders, and third-party vendors. AI-based authorization methodologies play a pivotal role in authenticating each transaction while proactively identifying potential fraud, demonstrating the framework's efficacy in fortifying cardless AI banking security. The initial approach, featuring an AI-driven, feature-based banking system, ensures the generation of virtual cards with encrypted data, minimizing information exposure and reducing fraud risks. Integrating a machine learning algorithm adds an additional layer of protection against potential fraudulent activities. In conclusion, the proposed framework establishes a holistic cybersecurity and fraud-mitigation paradigm for cardless AI banking systems. Its implementation empowers financial institutions to address security concerns associated with traditional banking, paving the way for a future banking landscape that is not only fraud-resistant but also secure and convenient for users.

2605.22568 2026-05-22 cs.CR cs.AI

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

在不欺骗自己的情况下衡量安全:为什么基准测试智能体是困难的

Sahar Abdelnabi, Chris Hicks, Konrad Rieck, Ahmad-Reza Sadeghi

AI总结 本文探讨了在安全关键角色中评估AI代理的基准测试存在的核心挑战,包括基准漏洞、时间滞后的不准确性以及运行时的不确定性,并提出了构建更可靠和可信评估框架的方向。

详情
AI中文摘要

用于评估在安全关键角色中AI代理的基准测试存在关键弱点。基于最近的经验证据,我们指出了三个核心挑战,这些挑战削弱了安全评估:基准漏洞、时间滞后的不准确性和运行时的不确定性。然后,我们概述了构建更稳健和可信评估框架的实用方向。

英文摘要

The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then outline practical directions toward building more robust and trustworthy evaluation frameworks.

2605.22549 2026-05-22 stat.ML cs.LG

A Martingale Kernel Independence Test

一个鞅核独立性检验

Felix Laumann, Zhaolu Liu, Mauricio Barahona

AI总结 本文提出两种学生化统计量,通过自归一化和半样本分割,实现了无需排列校准的独立性检验,显著提升了计算效率和测试性能。

详情
AI中文摘要

Hilbert-Schmidt Independence Criterion (HSIC) 及其联合独立性扩展 dHSIC 是退化 V 统计量,其数据依赖的加权 χ² 空间迫使排列校准,导致每测试成本乘以排列次数,实际中为两到三个数量级。通过将最近的鞅 MMD 构造应用于两样本检验到联合独立性问题,我们引入了两个学生化统计量,其空分布为标准正态分布,无论数据分布如何,因此单次正态分位数查找可完全替代排列步骤。第一个,mHSIC,是两个经验中心 Gram 矩阵的 Hadamard 积的自归一化下三角和。在独立性和有界四次矩核下,它收敛于标准正态分布。它对所有固定替代一致,且在样本量二次成本下运行,无需样本分割,与偏置 HSIC V 统计量匹配。第二个统计量 mdHSIC 通过单个半样本分割实现有限样本一致性:中心化估计在一半,下三角自归一化鞅在另一半运行,使条件均值残差缩成指数小量,因此在任意固定联合测试变量数下,统计量渐近标准正态分布,每测试成本仅与 d 线性增长。在合成数据中,输入维度从 1 到 500,联合测试变量从 2 到 10,两种统计量在运行速度上比排列校准基线快 25 到 60 倍,同时保持相同的经验 I 类错误率和测试功效。

英文摘要

The Hilbert-Schmidt Independence Criterion (HSIC) and its joint-independence extension $d\mathrm{HSIC}$ are degenerate $V$-statistics whose data-dependent weighted-$χ^2$ null limits force a permutation calibration that multiplies the per-test cost by the number of permutations, in practice two orders of magnitude. Adapting the recent martingale MMD construction for two-sample testing to the (joint) independence problem, we introduce two studentised statistics whose null distributions are standard normal regardless of the data law, so that a single normal-quantile lookup replaces the permutation step entirely. The first, $m\mathrm{HSIC}$, is a self-normalised lower-triangular sum of the Hadamard product of two empirically centred Gram matrices. Under independence and bounded-fourth-moment kernels it converges to a standard normal. It is consistent against every fixed alternative, and runs at quadratic cost in the sample size without any sample split, matching the biased HSIC $V$-statistic. Our second statistic, $md\mathrm{HSIC}$, achieves finite-sample consistency with a single half-sample split: the centring is estimated on one half and the lower-triangular self-normalised martingale is run on the other, shrinking the conditional-mean residual to a quantity that is exponentially small in $d$, so the statistic is asymptotically standard normal at every fixed number of jointly tested variables, with a per-test cost that grows only linearly in $d$. On synthetic data with per-variable input dimension from $1$ to $500$ and between $2$ and $10$ jointly tested variables, both statistics match the empirical type-I error rate and test power of permutation-calibrated baselines while running $25$ to $60\times$ faster.

2605.22540 2026-05-22 cs.CE cs.AI

Dynamic Hypergraph Representation Learning for Multivariate Time Series without Prior Knowledge

动态超图表示学习用于无先验知识的多变量时间序列

Marco Gregnanin, Johannes De Smedt, Giorgio Gnecco, Maurizio Parton

AI总结 本文提出了一种无需先验知识的多变量时间序列动态超图表示学习方法,通过社区检测和注意力机制构建超图,并利用动态超图注意力卷积网络进行预测。

详情
AI中文摘要

超图有能力捕捉跨不同领域的实体之间的高维关系,使其成为研究社区中理解和分析复杂系统结构和动态的热门话题。然而,一个关键挑战是在超图结构有限或不存在的情况下,从时间序列数据中推导出超图表示。在本研究中,我们提出了一种模型,通过应用社区检测到时间序列并利用注意力机制将所得社区转换为超图,从而为多变量时间序列构建动态超图表示。通过不同时间序列数据集推导出的超图,然后由动态超图注意力卷积网络(DHACN)用于多变量时间序列预测。本研究通过引入一种新的方法,推动了超图表示领域的发展,该方法更适合在无先验知识的情况下揭示高阶关系。

英文摘要

Hypergraphs have the capacity to capture higher-dimensional relationships among entities across various domains, making them a subject of growing interest within the research community for understanding the structure and dynamics of complex systems. However, a key challenge is the derivation of hypergraph representations from time series data in situations where the structure of the hypergraph is limited or absent. In this study, we propose a model that constructs a dynamic hypergraph representation for multivariate time series without relying on prior knowledge of the data. This is achieved by applying community detection to the time series and transforming the resulting communities, obtained through an attention mechanism, into a hypergraph using a clique-based technique. Hypergraph representations are derived from different time series datasets, and the resulting hypergraphs are then used by a Dynamic Hypergraph Attention Convolution Network (DHACN) for multivariate time series predictions. This research advances the field of hypergraph representation by introducing a novel approach that is better suited to uncover high-order relationships without prior knowledge.

2605.22506 2026-05-22 cs.CR cs.LG

EnCAgg: Enhanced Clustering Aggregation for Robust Federated Learning against Dynamic Model Poisoning

EnCAgg: 增强型聚类聚合用于对抗动态模型中毒的联邦学习

Tianyun Zhang, Zhen Yang, Haozhao Wang, Ru Zhang, Yongfeng Huang

AI总结 本文提出了一种新的鲁棒聚合方法,通过利用少量已知的良性客户端作为参考,准确识别和过滤恶意梯度,同时保留尽可能多的良性梯度,即使恶意客户端的数量未知且变化。方法包括密度基低维梯度聚类、增强聚类低维梯度生成模型和低维梯度重新聚类。

详情
AI中文摘要

联邦学习面临越来越多的模型中毒攻击威胁,这些攻击损害了其在提高隐私保护方面的应用。现有的防御方法通常依赖于固定的阈值或使用固定数量的聚类来进行区分恶意梯度和良性梯度。然而,这些方法难以适应恶意客户端的动态中毒策略,且由于客户端本地数据集的异质性,常常导致良性梯度的丢失。为了解决这些问题,我们提出了一种新的鲁棒聚合方法,该方法利用少量已知的良性客户端作为参考,能够准确识别和过滤恶意梯度,同时尽可能保留良性梯度,即使恶意客户端的数量未知且变化。首先,我们引入了一种基于密度的低维梯度聚类方法,将梯度投影到两个最分散的维度,并应用基于密度的聚类来识别恶意梯度,同时保留聚类中的良性梯度和可能的良性异常值。其次,我们设计了一种增强聚类低维梯度生成模型,该模型学习生成与良性簇边界对齐的伪梯度。这些伪梯度充当桥梁,连接稀疏的良性梯度异常值。第三,我们引入了低维梯度重新聚类,将生成的伪梯度与真实梯度一起聚类,以恢复被误分类为噪声点的良性梯度,使更多的良性梯度能够参与聚合。在MNIST、CIFAR-10和MIND数据集上的广泛实验表明,我们的方法在动态中毒场景下表现出卓越的保真度和鲁棒性。

英文摘要

Federated learning faces increasing threats from model poisoning attacks, which harms its application to improve privacy. Existing defense methods typically rely on fixed thresholds or perform clustering with a fixed number of clusters to distinguish malicious gradients from benign ones. However, these methods are difficult to adapt to dynamic poisoning strategies of malicious clients, and often result in the loss of benign gradients due to the heterogeneity of clients' local datasets. To address these problems, we propose a novel robust aggregation method that leverages a small number of known benign clients as references, enabling accurate identification and filtering of malicious gradients while retaining as many benign gradients as possible, even when the number of malicious clients is unknown and variable. First, we introduce a density-based low-dimensional gradient clustering method, which projects gradients onto the two most divergent dimensions and applies density-based clustering to identify malicious gradients while retaining clustered benign gradients and potentially benign outliers. Second, we design an enhancing clustering low-dimensional gradient generator model, which learns to generate pseudo-gradients aligned with the boundary of the benign cluster. These pseudo-gradients act as bridges to connect sparse benign gradient outliers. Third, we introduce low-dimensional gradient re-clustering that clusters the generated pseudo-gradients together with real gradients to recover benign gradients misclassified as noise points, enabling more benign gradients to participate in aggregation. Extensive experiments on the MNIST, CIFAR-10, and MIND datasets demonstrate that our method exhibits superior fidelity and robustness under dynamic poisoning scenarios.

2605.22463 2026-05-22 quant-ph cs.LG

Reinforcement learning for ion shuttling on trapped-ion quantum computers

基于受限离子量子计算机的离子穿梭强化学习

Maximilian Schier, Lea Richtmann, Christian Staufenbiel, Tobias Schmale, Daniel Borcherding, Michèle Heurs, Bodo Rosenhahn

AI总结 本文提出利用强化学习优化受限离子量子计算机中的离子穿梭过程,通过直接交互学习策略,显著提高了离子穿梭效率,减少了36.3%的穿梭操作,并展示了方法在不同芯片架构中的广泛应用潜力。

Comments 15 pages + 9 pages supplementary material, 6 figures

详情
AI中文摘要

可扩展的受限离子量子计算通常通过具有不同功能区域的模块化芯片实现,如存储、状态准备和门执行。为了执行量子电路,离子必须在这些区域之间运输,这一过程称为离子穿梭。为了获得可靠计算结果,必须优化穿梭过程。然而,随着离子数量的增加,这一过程成为高维优化问题,最优解无法高效计算。本文首次将强化学习(RL)应用于离子穿梭的优化,RL适用于此类场景,因为它能够通过直接与问题交互学习策略。我们证明我们的RL方法优于当前最先进的启发式技术,减少了多达36.3%的穿梭操作。此外,我们展示了该方法可以轻松应用于各种芯片架构。我们的方法为研究芯片设计中的穿梭效率提供了灵活的工具,因此对于未来更复杂的架构具有高度相关性。

英文摘要

Scalable trapped-ion quantum computing is commonly realized with modular chips that feature distinct zones with specific functionalities, such as storage, state preparation, and gate execution. To execute a quantum circuit, the ions must be transported between these zones. This process is called ion shuttling. To achieve reliable computation results, the shuttling process must be optimized. However, as the number of ions increases, this becomes a high-dimensional optimization problem where optimal solutions cannot be computed efficiently. We demonstrate, to the best of our knowledge, the first use of reinforcement learning (RL) for the optimization of ion shuttling. RL is well-suited for such scenarios, as it enables learning a strategy through direct interaction with the problem. We show that our RL approach outperforms current state-of-the-art heuristic techniques, yielding a reduction in shuttling operations of up to 36.3 %. Furthermore, we show that our method is easily applicable to various chip architectures. Our approach offers a versatile method to study shuttling efficiency during chip design and, therefore, a highly relevant tool for future, more complex architectures.

2605.22441 2026-05-22 cs.CR cs.AI

A Constant-Time Implementation Methodology for Activation Functions on Microcontrollers

在微控制器上实现激活函数的常数时间方法

Andrii Tyvodar, Andreas Rechberger, Dirmanto Jap, Shivam Bhasin, Bernhard Jungk, Jakub Breier, Xiaolu Hou

AI总结 本文提出了一种在嵌入式微控制器上实现激活函数的常数时间方法,通过结合无分支选择、固定成本Padé近似、必要的虚拟算术和周期对齐,实现了定时规律的激活函数实现,并验证了其在ReLU、sigmoid、tanh、GELU和Swish函数上的有效性。

详情
AI中文摘要

嵌入式神经网络推断可能通过定时侧信道泄露信息,包括由激活函数评估引起的泄露。本文提出了一种在嵌入式微控制器上实现激活函数的常数时间方法,并在ARM Cortex-M4平台上验证了ReLU、sigmoid、tanh、GELU和Swish函数。所提出的方法结合了无分支选择、固定成本Padé近似、必要的虚拟算术和周期对齐,以获得定时规律的激活函数实现。作为动机,我们评估了一种基于脱同步的防护措施,并展示了其仍易受基于模板的定时攻击攻击。实验结果表明,所得到的受保护实现对于所有测试输入具有相同的周期数,包括三函数设置下的88个周期和五函数设置下的108个周期。同时,数值误差分析表明,近似的非线性函数保留了高精度。这些结果表明,所提出的方法为构建在嵌入式推断中抗侧信道攻击的激活函数提供了实用基础。

英文摘要

Embedded neural-network inference can leak information through timing side channels, including leakage caused by the evaluation of activation functions. This work proposes a constant-time implementation methodology for activation functions on embedded microcontrollers and validates it on ReLU, sigmoid, tanh, GELU, and Swish on an ARM Cortex-M4 platform. The proposed methodology combines branchless selection, fixed-cost Padé-based approximation, dummy arithmetic where needed, and cycle alignment to obtain timing-regular activation-function implementations. As motivation, we also evaluate a desynchronization-based countermeasure and show that it remains vulnerable to a template-based timing attack. Experimental results show that the resulting protected implementations achieve identical cycle counts for all tested inputs, including (88) cycles in the three-function setting and (108) cycles in the five-function setting. At the same time, the numerical-error analysis indicates that the approximated nonlinear functions retain high accuracy. These results suggest that the proposed methodology provides a practical basis for constructing side-channel-resistant activation functions in embedded inference.

2605.22438 2026-05-22 stat.ML cs.GT cs.LG

Do Not Trust The Auctioneer: Learning to Bid in Feedback-Manipulated Auctions

不要相信拍卖师:在反馈操纵拍卖中学习出价

Luigi Foscari, Matilde Tullii, Vianney Perchet

AI总结 研究在反馈操纵拍卖中学习出价的问题,提出一种结合鲁棒区间消除分支和乐观分支的算法,以应对反馈操纵带来的挑战,并在单活跃区域情况下提供匹配下界。

详情
AI中文摘要

Shilling是指通过人工出价使竞争看起来更激烈以推高价格。我们研究了重复的第一价格拍卖,在其中shilling影响反馈但不影响分配:学习者在真实竞争出价中获胜或失败,但在失败后观察到真实出价和一个独立的shill出价的最大值。这种操纵改变了学习者所观察到的内容,从而影响其学习出价的方式,而不会改变当前拍卖的结果。我们分析了与最佳出价基准相比的遗憾,假设shill-bid分布已知。即使如此,shilling仍可能掩盖真实出价,而有用的侧信息仅通过间歇性低shill事件出现。我们的算法结合了一个鲁棒的区间消除分支,该分支忽略shilled报告并达到动态定价率$ ilde{\mathcal{O}}(T^{2/3})$,以及一个乐观分支,该分支去偏失败侧报告并利用其在可靠时的结果信息,达到第一价格拍卖的速率$ ilde{\mathcal{O}}(\sqrt{T})$。一个验证和竞赛过程让算法在不知道正确尺度或反馈几何学的情况下使用这些乐观更新。我们用单活跃区域情况下的匹配下界补充了上界,除了对数因子外。总体而言,结果表明,即使只有反馈的shilling也能显著改变重复出价的统计难度。

英文摘要

Shilling is the use of artificial bids to make competition appear stronger and push prices upward. We study repeated first-price auctions in which shilling affects feedback but not allocation: the learner wins or loses against the real competing bid, but after a loss observes the maximum of the real bid and an independent shill bid. Thus the manipulation changes what the learner observes and hence how it learns to bid, without changing the outcome of the current auction. We analyze regret with respect to the best bid benchmark, assuming that the shill-bid distribution is known. Even then, shilling can mask the real bid, while useful side information appears only through intermittent low-shill events. Our algorithm combines a robust interval-elimination branch, which ignores the shilled report and achieves the dynamic-pricing rate $\tilde{\mathcal{O}}(T^{2/3})$, with an optimistic branch that debiases losing-side reports and exploits the resulting suffix information when it is reliable and achieves the first-price auctions rate $\tilde{\mathcal{O}}(\sqrt{T})$. A validation and racing procedure lets the algorithm use these optimistic updates without knowing the right scale or feedback geometry in advance. We complement the upper bounds with a matching lower bound, up to logarithmic factors, in the single-active-region case. Overall, the results show that even feedback-only shilling can sharply alter the statistical difficulty of repeated bidding.

2605.22437 2026-05-22 cs.CR cs.AI cs.LG

Characterizing the Fault Response of the Intel Neural Compute Stick 2 Under Single-Pulse Electromagnetic Fault Injection

对Intel神经计算Stick 2在单脉冲电磁故障注入下的故障响应进行表征

Štefan Kučerák, Jakub Breier, Xiaolu Hou

AI总结 本文研究了Intel神经计算Stick 2在单脉冲电磁故障注入下的故障响应,通过系统性的测试发现四种可重复的故障类别,并探讨了针对这些故障类别的缓解策略。

详情
AI中文摘要

视觉处理单元和其他商业神经网络推断加速器越来越多地应用于安全相关的边缘应用,但它们在瞬态硬件干扰下的故障响应在开放文献中仍然缺乏表征。对于Intel Movidius Myriad X,封装为Intel神经计算Stick 2(NCS2),只有单篇可行性研究已发表。我们报告了一项系统性的单脉冲电磁故障注入(EMFI)测试,该测试在运行三个ImageNet训练的卷积神经网络(ResNet-18、ResNet-50、VGG-11)的OpenVINO运行时上进行。在1,536次热点测试和约16,000次参数搜索测试中,单脉冲产生四种可重复的故障类别:无测量精度变化、轻微的静默数据破坏、主要的持续退化,该退化在后续推断中持续直到模型重新加载,以及需要USB电源循环的设备挂起;这些结果分别解释为无影响、SDC可能带有类似SET或小的持久状态机制、SEU-like持续破坏,以及SEFI-like功能丧失。两个发现是核心。首先,主要退化类别可以在18-31%的测试中诱导,其中崩溃后的top-1精度低于5%,在所有后续推断中持续直到显式模型重新加载 - 这一状态没有任何推断API级别的机制可以检测。第二,这一状态也可以通过向空闲设备发送脉冲来诱导,表明仅靠加载时的完整性检查是不够的。我们讨论了按类别分级的缓解策略,重点是可以在应用级别实现的机制,而无需修改设备固件或OpenVINO运行时。

英文摘要

Vision processing units and other commercial neural-network inference accelerators are increasingly deployed in safety-relevant edge applications, but their fault response under transient hardware disturbances remains poorly characterized in the open literature. For the Intel Movidius Myriad X, packaged as the Intel Neural Compute Stick 2 (NCS2), only a single feasibility study has been published. We report a systematic single-pulse electromagnetic fault injection (EMFI) campaign on the NCS2 running three ImageNet-trained convolutional neural networks (ResNet-18, ResNet-50, VGG-11) on the OpenVINO runtime. Across 1,536 spot-test trials at characterized hotspots and approximately 16,000 parameter-search trials, single pulses produce four reproducible outcome classes: no measured accuracy change, minor silent data corruption, major persistent degradation that survives across subsequent inferences until model reload, and device hangs requiring USB power-cycling; these outcomes are respectively interpreted as no-effect, SDC with possible SET-like or small persistent-state mechanisms, SEU-like persistent corruption, and SEFI-like loss of functionality. Two findings are central. First, the major-degradation class can be induced at 18-31% of trials at characterized hotspots, with post-collapse top-1 accuracy below five percent and persistence across all subsequent inferences until explicit model reload - a regime that no inference-API-level mechanism detects. Second, this regime is also inducible by pulses delivered to an idle device with the model already loaded, demonstrating that load-time integrity checks alone are insufficient. We discuss mitigation strategies graded by class, focusing on mechanisms implementable at the application level without modification to the device firmware or the OpenVINO runtime.

2605.22425 2026-05-22 eess.IV cs.CV

Time-varying rPPG signal separation via block-sparse signal model

基于块稀疏信号模型的时变rPPG信号分离

Kosuke Kurihara, Yoshihiro Maeda, Daisuke Sugimura, Takayuki Hamamoto

AI总结 本文提出了一种利用rPPG信号近似周期特性进行信号提取的方法,通过构建时变信号分离框架,在光照变化下实现适应性信号分离,实验验证了方法的有效性。

Comments Accepted by IEEE International Conference on Image Processing (ICIP 2026)

详情
AI中文摘要

远程光脉冲波形图(rPPG)通过分析面部视频中细微的颜色变化来实现非接触式心脏脉搏信号测量。然而,由于rPPG信号极弱且易受光照噪声影响,提取rPPG信号仍然具有挑战性。本文提出了一种rPPG信号提取方法,利用rPPG信号的近似周期特性,将其近似周期性建模为时频域中的块稀疏结构。为了整合块稀疏模型并实现光照波动下的自适应信号分离,我们构建了时变信号分离框架。使用公共数据集的实验验证了该方法的有效性。

英文摘要

Remote photoplethysmography (rPPG) enables non-contact measurement of cardiac pulse signals by analyzing subtle color changes in facial videos. Nevertheless, extracting rPPG signals remains challenging because of their extremely weak signal strength and susceptibility to illumination noise. In this paper, we propose an rPPG signal extraction method that exploits the quasi-periodic characteristics of rPPG signals. Our approach models quasi-periodicity of the rPPG signal, which arises from the stable cardiac cycle, as a block-sparse structure in the time-frequency domain. To incorporate a block-sparse model and enable adaptive signal separation under illumination fluctuations, we construct a time-varying signal separation framework. Experiments using a public dataset demonstrate the effectiveness of our method.

2605.22379 2026-05-22 cs.HC cs.AI cs.LG

Cross-Subject EEG Emotion Recognition Based on Temporal Asynchronous Alignment Contrastive Learning

基于时间异步对齐对比学习的跨受体EEG情绪识别

Ying Xie, Yi Zheng, Zehui Xiao, Wenkai Lu, Mengting Liu

AI总结 本文提出了一种基于时间异步对齐对比学习(TA2CL)的框架,用于解决跨受体EEG情绪识别中由于不同受体响应时间不一致导致的识别问题,通过改进相似性计算策略,提升模型对跨受体差异和时间延迟的鲁棒性。

Comments 16 pages, 7 figures

详情
AI中文摘要

随着科技的发展,情绪研究的重要性日益凸显。近年来,基于脑电图(EEG)的情绪识别已成为一个活跃的研究领域,因其客观性和高时间分辨率。然而,大多数现有方法侧重于优化编码器结构以增强特征提取能力,而对相似性计算策略关注较少,特别是忽略了不同受体之间响应的潜在时间不一致问题。为了解决这些不足,本文受ColBERT在自然语言处理(NLP)中的晚期交互机制启发,提出了一种基于时间异步对齐的对比学习(TA2CL)框架。该方法将传统的全局

英文摘要

With the advancement of science and technology, the importance of emotion research has become increasingly evident. Electroencephalography (EEG)-based emotion recognition has emerged as an active research area in recent years, owing to its objectivity and high temporal resolution. However, most existing methods focus on optimizing encoder structures to enhance feature extraction capabilities, while paying relatively little attention to similarity calculation strategies, particularly overlooking the potential temporal misalignment of responses among different subjects. To address these shortcomings, this paper draws inspiration from the late interaction mechanism of ColBERT in natural language processing (NLP) and proposes a Temporal Asynchronous Alignment-based Contrastive Learning (TA2CL) framework. This method transforms the traditional global "hard alignment" similarity calculation approach into a fine-grained local matching mechanism, enabling the model to adaptively search for and align "locally highly correlated" segments between two EEG signals, thereby effectively mitigating the effects of inter-subject differences and temporal delays. Experimental results demonstrate that the proposed method achieves strong performance across multiple public datasets. Specifically, on the FACED dataset, it achieves an accuracy of 64.5% for the nine-class classification task and 79.5% for the binary classification task, while on the SEED and SEED-V datasets, it achieves accuracies of 86.4% and 70.1%, respectively, validating the method's effectiveness and generalization capability.

2605.22363 2026-05-22 math.OC cs.AI cs.GT

Incentive-Aligned Vehicle-to-Vehicle Energy Trading via Nash-Integrated Multi-Agent Reinforcement Learning

通过纳什整合多智能体强化学习实现激励对齐的车对车能源交易

Yujin Lin, Yue Yang, Hao Wang

AI总结 本文提出一种基于纳什博弈解的多智能体深度确定性策略梯度(Nash-MADDPG)方法,用于车对车能源交易中的激励对齐,提升了社会福利和交易量,并在公平性方面取得了显著改进。

Comments The 24th IEEE International Conference on Industrial Informatics, 2026

详情
AI中文摘要

车对车(V2V)能源交易允许电动车辆(EVs)之间进行去中心化的点对点能源交换,从而减少对电网的依赖并利用剩余容量获取收益。然而,协调具有不同充电需求和不确定到达-离开时间表的自利EV代理仍然具有挑战性。现有方法要么需要集中优化但计算受限,要么缺乏公平性保障。本文将纳什博弈解整合到多智能体深度确定性策略梯度中,即纳什-MADDPG,用于激励对齐的V2V能源交易。纳什博弈确定高效的双方面定价,而纳什引导的价格接近性奖励使代理学习朝着博弈最优策略方向发展。在30天连续运行的评估中,与双重拍卖相比,社会福利提高了61.6%,交易量提高了62.9%,同时实现了更高的公平性,如贾恩指数提高了40.1%。在6-100个代理跨越30天的时间范围内进行测试,连续车辆周转确认了在种群规模上的可扩展性和在纳什博弈基准附近的经验稳定价格。

英文摘要

Vehicle-to-vehicle (V2V) energy trading enables decentralized peer-to-peer energy exchange among electric vehicles (EVs), reducing grid dependency while monetizing surplus capacity. However, coordinating self-interested EV agents with diverse charging needs and uncertain arrival-departure schedules remains challenging. Existing approaches either require centralized optimization with computational limitations or lack fairness guarantees. This paper integrates Nash Bargaining Solution into Multi-Agent Deep Deterministic Policy Gradient, namely Nash-MADDPG, for incentive-aligned V2V energy trading. Nash bargaining determines efficient bilateral pricing, while Nash-guided price proximity rewards align agent learning toward bargaining-optimal strategies. Evaluation over 30-day continuous operation demonstrates an improvement of 61.6% in social welfare and 62.9% improvement in trading volume over Double Auction, while achieving superior fairness, such as 40.1% improvement in Jain's index. Testing across 6-100 agents over a 30-day horizon with continuous vehicle turnover confirms scalability across population size and empirically stable pricing near the Nash Bargaining benchmark.

2605.22343 2026-05-22 cs.MA cs.AI cs.SE

Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

Sibyl-AutoResearch:自主研究需要自我进化的试验与错误机制,而非论文生成器

Chengcheng Wang, Qinhua Xie, Wei He, Jianyuan Guo, Shiqi Wang, Chang Xu

AI总结 本文提出Sibyl-AutoResearch框架,通过自我进化的方法改进自主研究系统,解决现有系统在试验经验积累方面的不足,通过可审计的转换单元实现试验到行为和试验到机制行为的转换,从而提高自主研究系统的可靠性。

详情
AI中文摘要

自主研究系统日益使科学工作流程可执行:代理可以提出想法、运行代码、检查结果并起草论文。但可执行的工作流程本身并不产生研究判断。我们分析了当前系统在试验经验积累方面的不足:弱证据变成散文,试点信号变成广泛声明,记忆保持文本,重复的过程失败不改变后来的行为。我们引入Sibyl-AutoResearch,一个自我进化的AutoResearch框架,围绕科学试验与错误机制构建。一个机制让代理运行有界试验,保存积极和消极结果,并将教训路由到后来的规划、验证、声明范围、调度、批评、写作和机制修复。我们通过两个可审计的转换单元正式化这一过程:试验到行为转换,将试验信号链接到后来的研究行动,以及试验到机制行为转换,将重复的过程失败链接到系统更新。我们实现了该框架在SIBYL中,这是一个基于文件的自主研究系统,暴露了状态、角色、记忆、门、和制品痕迹所需以检查这些转换路径。回顾性审计识别出八个高置信度的转换事件,中位延迟为一个迭代,最大延迟为三个迭代。一个恢复失败注册表进一步展示了如何通过五个自然发生的失败类别,包括重复结果、过时数字和不支持的统计数据,被阻止、降级或路由到后来的修复。这些痕迹不建立比较性能的主张;它们表明所提出的转换单元可以从现实的自主研究工作空间中恢复。SIBYL框架和系统可在https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem上获得。

英文摘要

Autonomous research systems increasingly make the scientific workflow executable: agents can propose ideas, run code, inspect results, and draft papers. But executable workflows do not by themselves produce research judgment. We analyze where current systems lose trial experience: weak evidence becomes prose, pilot signals become broad claims, memory remains textual, and recurring process failures do not change later behavior. We introduce Sibyl-AutoResearch, a self-evolving AutoResearch framework built around Scientific Trial-and-Error Harnesses. A harness lets agents run bounded trials, preserve positive and negative outcomes, and route lessons into later planning, validation, claim scope, scheduling, critique, writing, and harness repair. We formalize this through two auditable conversion units: trial-to-behavior conversion, which links trial signals to later research actions, and trial-to-harness-behavior conversion, which links recurring process failures to system updates. We implement the framework in SIBYL, a file-backed autonomous research system that exposes the state, roles, memory, gates, and artifact traces needed to inspect these conversion paths. A retrospective audit identifies eight high-confidence conversion events, with a median latency of one iteration and a maximum latency of three iterations. A recovered-failure registry further shows how five naturally occurring failure classes, including duplicate results, stale numbers, and unsupported statistics, were blocked, downgraded, or routed into later repair. These traces do not establish a comparative performance claim; they show that the proposed conversion units are recoverable from realistic autonomous-research workspaces. The SIBYL framework and system are available at https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem.

2605.22321 2026-05-22 cs.CR cs.AI cs.SE

Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions

对时间、空间和语义规避的自主代理进行基准测试

Jianan Ma, Xiaohu Du, Ruixiao Lin, Yaoxiang Bian, Jialuo Chen, Jingyi Wang, Xiaofang Yang, Shiwen Cui, Changhua Meng, Xinhao Deng, Zhen Wang

AI总结 本文提出了一种针对基于大语言模型(LLM)的代理系统的多维规避框架,通过引入时间、空间和语义三种隐蔽攻击向量,系统地量化了这些威胁,并展示了其在实际威胁场景中的效果,揭示了现有自主代理系统在架构层面的系统性漏洞。

Comments 21 pages, 9 figures, 7 tables. Code and data available at https://github.com/antgroup/Agent3Sigma-Stage

详情
AI中文摘要

随着自主代理(例如OpenClaw)越来越多地利用深度系统级权限执行复杂任务,它们引入了严重的、未缓解的安全风险。当前的漏洞分析大多集中在单轮、无状态行为上,忽略了状态ful、多轮交互和动态工具调用中扩大的攻击面。在本文中,我们提出了一种新的、多维的规避框架,针对基于LLM的代理系统。我们引入了三种隐蔽的攻击向量:(1)时间规避,将恶意负载碎片化地分布在连续的交互轮次中;(2)空间规避,将负载隐藏在复杂的外部 artifacts 中,以逃避标准LLM解析机制;(3)语义规避,将恶意意图隐藏在良性上下文噪声之下。为了系统地量化这些威胁,我们构建了A3S-Bench,一个包含2,254个真实世界代理执行轨迹的综合基准。评估一个标准代理框架,分别与10个主流LLM后端整合,针对20个实际威胁场景,我们展示了我们的规避框架将平均风险触发率从28.3%的基准提升到52.6%。这些发现揭示了当前自主代理系统在架构层面的系统性漏洞,现有防御措施无法解决这些问题,突显了需要针对这些独特威胁设计的防御机制的紧迫性。

英文摘要

As autonomous agents (e.g., OpenClaw) increasingly operate with deep system-level privileges to execute complex tasks, they introduce severe, unmitigated security risks. Current vulnerability analyses overwhelmingly focus on single-turn, stateless behaviors, overlooking the expanded attack surface inherent in stateful, multi-turn interactions and dynamic tool invocations. In this paper, we propose a novel, multi-dimensional evasion framework targeting LLM-based agent systems. We introduce three stealthy attack vectors: (1) Temporal evasion, which fragments malicious payloads across sequential interaction turns; (2) Spatial evasion, which conceals payloads within complex external artifacts that evade standard LLM parsing mechanisms; and (3) Semantic evasion, which obscures malicious intents beneath benign contextual noise. To systematically quantify these threats, we construct A3S-Bench, a comprehensive benchmark comprising 2,254 real-world agent execution trajectories. Evaluating a standard agent framework separately integrated with 10 mainstream LLM backbones against 20 practical threat scenarios, we demonstrate that our evasion framework elevates the average risk trigger rate from a 28.3\% baseline to 52.6\%. These findings reveal systemic, architecture-level vulnerabilities in current autonomous agent systems that existing defenses fail to address, highlighting an urgent need for defense mechanisms tailored to the unique threats.

2605.22306 2026-05-22 cs.MA cs.AI

ACCoRD: Actor-Critic Conflict Resolution with Deep learning for O-RAN xApps

ACCoRD:基于深度学习的O-RAN xApps中的Actor-Critic冲突解决

Cezary Adamczyk, Adrian Kliks

AI总结 本文提出了一种基于深度学习的Actor-Critic冲突解决方法ACCoRD,用于在O-RAN xApps中实时解决控制冲突,通过强化学习算法PPO-Clip训练人工神经网络,提高了规则方法在中高流量场景下的效率。

详情
AI中文摘要

冲突缓解(ConMit)是智能网络控制在开放无线电接入网络(O-RAN)中的关键部分。本文提出了一种名为ACCoRD的方法,通过在近实时RAN智能控制器中使用一个通过强化学习算法PPO-Clip训练的人工神经网络(ANN)来解决检测到的控制冲突。实现的人工神经网络分析有关网络和冲突控制决策的数据,以推断最优的冲突解决(CR)操作。冲突解决代理在每次解决冲突后从网络收集反馈,以评估其效率并在批量训练中调整ANN的权重。所提出方法的评估基于仿真数据。提出了一种新的评估CR解决方案的方法。结果表明,所提出的基于ANN的方法通过显著减少由冲突控制决策引起的负面网络事件,提高了规则方法的效率。

英文摘要

Conflict Mitigation (ConMit) is a crucial part of intelligent network control in Open Radio Access Networks (O-RAN). In this paper, we propose a method named ACCoRD to resolve detected control conflicts in Near-Real Time RAN Intelligent Controller using a Conflict Resolution (CR) Agent with an Artificial Neural Network (ANN) trained with a reinforcement learning algorithm PPO-Clip. The implemented ANN analyzes data about the network and conflicting control decisions to infer optimal CR actions. The CR Agent gathers feedback from the network after each resolved conflict to assess its efficiency and adjust the ANN's weights during batch training. The evaluation of the proposed approach is based on simulation data. A new methodology for evaluating CR solutions is proposed. Results show that the proposed ANN-based method improves on the efficiency of rule-based approaches by significantly reducing negative network events caused by conflicting control decisions in medium and high traffic scenarios.

2605.22268 2026-05-22 cs.NI cs.AI cs.CV

Impact of Atmospheric Turbulence and Pointing Error on Earth Observation

大气湍流和指向误差对地球观测的影响

Celia Sánchez-de-Miguel, Antonio M. Mercado-Martínez, Beatriz Soret, Antonio Jurado-Navas, Miguel Castillo-Vázquez

AI总结 本文研究了大气湍流和指向误差对地球观测图像的影响,提出了一种增强的图像模拟器来生成物理真实的失真图像,并通过案例研究评估了YOLOv8和RetinaNet在不同湍流和指向误差条件下的性能。

Comments Conference

详情
AI中文摘要

地球观测(EO)图像常常受到大气湍流和指向抖动的退化;然而,这些效应很少被考虑在用于训练基于AI的检测模型的数据集中。基于先前的工作,本文提出了一种增强的图像模拟器,能够将垂直路径的大气湍流和卫星指向抖动(源于平台和传感器振动)纳入其中,以生成物理上真实的失真图像。作为案例研究,使用YOLOv8和RetinaNet在由所提出模拟器生成的图像上评估船舶检测,结果表明,在理想条件下,YOLOv8的召回率从91%下降到弱湍流存在时的60%,在强湍流或抖动下低于40%。相比之下,RetinaNet表现出更大的鲁棒性,在退化条件下保持约75%的召回率。这些结果突显了在EO训练数据集中纳入真实物理退化的重要性,以确保AI模型在操作环境中的可靠性能,如在海上监控应用中所展示的那样。

英文摘要

Earth Observation (EO) imagery is often degraded by atmospheric turbulence and pointing jitter; yet, these effects are rarely considered in datasets used to train AI-based detection models. Based on prior work, this paper presents an enhanced image simulator that enables the incorporation of vertical-path atmospheric turbulence and satellite pointing jitter, arising from platform and sensor vibrations, to generate physically realistic distorted images. As a case study, vessel detection is evaluated using YOLOv8 and RetinaNet on images generated by the proposed simulator under different levels of turbulence and pointing errors. Results show that YOLOv8 recall decreases from 91% under ideal conditions to 60% in the presence of weak turbulence, and falls below 40% under strong turbulence or jitter. In contrast, RetinaNet demonstrates greater robustness, maintaining approximately 75% recall across degraded conditions. These results highlight the importance of incorporating realistic physical degradations into EO training datasets to ensure reliable performance of AI-based models in operational environments, as demonstrated in maritime surveillance applications.

2605.22207 2026-05-22 eess.SY cs.LG cs.SY

Kernel-Based Safe Exploration in Deep Reinforcement Learning

基于核的深度强化学习安全探索

Rupak Majumdar, Nikhil Singh, Sadegh Soudjani

AI总结 本文提出了一种基于核的方法,用于在深度强化学习中安全探索,通过学习屏障函数来保证策略不会进入危险区域,同时在探索过程中同时学习最优策略和屏障函数,提供更可靠的概率安全保证。

Comments Accepted at L4DC Conference (22 Jan 2026)

详情
AI中文摘要

安全性在将深度强化学习算法部署到现实世界时是一个主要关注点。一种有前景的方向是学习一个屏障函数,以确保学习的策略不会访问危险区域。屏障函数是从状态到实数的函数,它将初始状态赋予低值,将危险状态赋予高值,并在每次转移中减少期望值;这样的函数可用于限制到达危险状态的概率。以前的研究直接从探索数据中学习屏障函数,但需要大量数据或对系统动力学的限制。在本文中,我们展示了如何利用核嵌入来学习深度强化学习中随机系统的屏障函数。我们的算法,称为基于核的安全探索(KBSE),在探索过程中同时学习最优策略和屏障函数。屏障函数是通过迭代计算得到的,并以条件均值嵌入表示,随着探索的增加,它们提供更好的概率安全保证。探索算法使用学习到的屏障函数来识别安全违规。在发生违规时,它会干预,将危险动作改为安全动作,从而确保探索仅限于限制到达危险状态概率的动作。我们评估了KBSE在多个复杂的连续控制基准上的性能。实验结果表明,我们的新算法适用于合成概率安全的控制策略,而不会影响奖励的累积。

英文摘要

Safety has been a major concern when deploying deep reinforcement learning algorithms in the real world. A promising direction that ensures that the learned policy does not visit unsafe regions is to learn a \emph{barrier function} along with the policy. A barrier is a function from states to reals that assigns low values to the initial states, high values to the unsafe states, and decreases in expectation on each transition; such a function can be used to bound the probability of reaching unsafe states. Previous attempts learned a barrier function directly from exploration data, but this required either large amounts of data or restrictions on the system dynamics. In this paper, we show how kernel embeddings can be used to learn barrier functions during deep reinforcement learning for stochastic systems with unknown dynamics. Our algorithm, \emph{kernel-based safe exploration (KBSE)}, learns an optimal policy and a barrier simultaneously during exploration. The barriers are computed iteratively, represented as conditional mean embeddings, and provide better probabilistic safety guarantees with more exploration. The exploration algorithm uses the learned barrier functions to identify safety violations. In the case of violation, it intervenes to modify the unsafe action to a safe action, thereby ensuring that the exploration is restricted to actions that bound the probability of reaching unsafe states. We evaluate KBSE on several complex continuous control benchmarks. Experimental results establish our new algorithm to be suitable for synthesizing control policies that are probabilistically safe without degradation in reward accumulation.

2605.22206 2026-05-22 cs.NE cs.AI cs.RO

Temporal Coding as a Substrate for Sensorimotor Object Inference: A Spiking Reinterpretation of Thousand Brains Architecture

时间编码作为感觉运动物体推断的子基质:一种脉冲重解释的千脑架构

Joy Bose

AI总结 该研究提出用脉冲编码替代密集向量,以更有效地编码传感器接触顺序,从而提升物体识别的准确性和鲁棒性,核心方法是基于STDP的学习规则和可学习参数lambda,主要贡献是验证了时间编码在不同空间排列和噪声水平下的优越性能。

Comments 18 pages, 5 figures

详情
AI中文摘要

千脑理论(TBT)及其开源的Monty框架通过感觉运动推断进行物体识别——通过主动移动传感器跨物体表面并逐接触建立证据。当前实现将每个接触编码为密集浮点向量。虽然Monty跟踪步间位移并跨接触积累证据,但其将每个接触的特征激活模式视为无序集合——特征遇到的顺序不具有表征意义。在TBT中,接触的顺序具有空间意义:知道在从左到右的扫过中特征A在特征B之前被感受到,可以告诉你A和B在物体上的位置。密集向量丢弃了这种顺序。我们提出用等级顺序脉冲包替代密集向量:每个接触产生一连串神经事件的短暂爆发,其中最强烈激活的神经元首先放电。连续爆发之间的时间间隔隐含地编码传感器位移,而无需显式坐标计算。一种生物启发的学习规则(STDP)将遍历方向编码到突触权重中。一个可学习的参数lambda调整对早期与近期接触的依赖程度,适应每个物体的几何形状。我们推导出三个可检验的预测,并指定了四个组件的大约450行NumPy实现。三个合成实验验证了核心主张:时间编码在具有相同特征但不同空间排列的物体上实现完美判别准确性,而密集积累在偶然情况下表现不佳;时间编码在所有测试噪声水平上保持30-50个百分点的优势;适应性的lambda收敛到不同的值,反映物体几何复杂性。对Monty的YCB基准的端到端评估留待未来工作。

英文摘要

The Thousand Brains Theory (TBT) and its open-source Monty framework model object recognition through sensorimotor inference -- identifying objects by actively moving a sensor across their surface and building evidence contact by contact. The current implementation encodes each contact as a dense floating-point vector. While Monty tracks inter-step displacement and accumulates evidence across contacts, it treats the feature activation pattern at each contact as an unordered set - the directional sequence in which features are encountered carries no representational weight. In TBT, the sequence of contacts carries spatial meaning: knowing that feature A was felt before feature B during a left-to-right sweep tells you something about where A and B sit on the object. Dense vectors discard this ordering. We propose replacing dense vectors with rank-order spike packets: each contact produces a brief burst of neural events where the most strongly activated neuron fires first. The time gap between successive bursts implicitly encodes sensor displacement without explicit coordinate calculations. A biologically motivated learning rule (STDP) encodes traversal direction into synaptic weights. A learnable parameter lambda adjusts reliance on earlier versus recent contacts, adapting to each object's geometry. We derive three testable predictions and specify an implementation of four components in approximately 450 lines of NumPy. Three synthetic experiments confirm the core claims: temporal coding achieves perfect discrimination accuracy on objects with identical features in different spatial arrangements, where dense accumulation performs at chance; temporal coding maintains a 30-50 percentage point advantage across all tested noise levels; the adaptive lambda converges to distinct values, reflecting object geometric complexity. End-to-end evaluation on Monty's YCB benchmark is left for future work.

2605.22175 2026-05-22 cs.SE cs.AI

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

SWE-Mutation:LLMs能否在软件工程中生成可靠的测试套件?

Yuxuan Sun, Yuze Zhao, Yufeng Wang, Yao Du, Zhiyuan Ma, Jinbo Wang, Mengdi Zhang, Kai Zhang, Zhenya Huang

AI总结 本文提出SWE-Mutation基准,用于评估LLM生成的测试套件质量,通过系统性地变异解决方案来测试测试套件的可靠性,并发现当前LLM在生成可靠且具有判别力的测试套件方面存在不足。

Comments 24 pages, 8 figures

详情
Journal ref
ACL 2026 Findings
AI中文摘要

评估软件工程能力已成为现代大语言模型(LLMs)的核心组成部分;然而,进一步扩展的关键瓶颈不在于高质量解决方案的稀缺,而在于高质量测试套件的缺乏。测试套件对于合成程序修复轨迹和在强化学习中提供精确反馈信号至关重要。不幸的是,由于标注成本高且困难,高质量测试套件长期以来难以获得,而由LLM自动生成的测试套件往往肤浅且缺乏足够的判别力。作为构建高质量测试套件的第一步,我们介绍了SWE-Mutation,一个用于评估LLM生成测试套件的基准。该基准通过引入系统性变异的解决方案来表征测试套件,这些变异试图“欺骗”测试套件并通过验证。我们进一步提出了一种代理、语言无关的框架,用于自动生成复杂的变异体。我们的基准包含2,636个变异体,源自800个原始实例,并包含覆盖九种编程语言的多语言子集。对七种LLM的实验表明,即使DeepSeek-V3.1也仅达到10.20%的验证率和36.15%的检测率,突显了当前LLM的不足。此外,我们的代理变异策略增强了现实性,与传统方法相比,将平均检测率从71.04%降低到39.81%。这些发现揭示了当前LLM在生成可靠且具有判别力的测试套件方面存在的持续缺陷。

英文摘要

Evaluating software engineering capabilities has become a core component of modern large language models (LLMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning. Unfortunately, due to the high cost and difficulty of annotation, high-quality test suites have long been hard to obtain, while those automatically generated by LLMs tend to be superficial and lack sufficient discriminative power. As a first step toward constructing high-quality test suites, we introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites. The benchmark characterizes test suites by introducing systematically mutated solutions that attempt to ``fool'' the test suites and pass validation. We further propose an agentic, language-agnostic framework for automatically generating complex mutants. Our benchmark consists of 2,636 mutated variants derived from 800 original instances and includes a multilingual subset spanning nine programming languages. Experiments on seven LLMs reveal that even DeepSeek-V3.1 achieves only 10.20% verification and 36.15% detection rates, highlighting the inadequacy of current LLMs. Additionally, our agentic mutation strategy enhances realism, reducing average detection rates from 71.04% to 39.81% compared to conventional methods. These findings expose persistent deficiencies in the ability of current LLMs to generate reliable and discriminative test suites.

2605.22124 2026-05-22 stat.ML cs.LG math.PR

From Betting to Empirical Bernstein LIL

从赌局到经验伯恩斯坦LIL

Francesco Orabona

AI总结 本文通过在线投注策略的财富保证,推导出迭代对数定律,并提出经验伯恩斯坦LIL方法。

详情
AI中文摘要

This is a verbatim copy of a technical report I wrote in 2017-2018 to obtain the law of the iterated logarithm using the guarantee on the wealth of an online betting strategy.

英文摘要

This is a verbatim copy of a technical report I wrote in 2017-2018 to obtain the law of the iterated logarithm using the guarantee on the wealth of an online betting strategy.