arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2256
2403.16825 2026-05-28 cs.LG math.OC math.PR stat.ML

Weak Convergence Analysis of Online Neural Actor-Critic Algorithms

在线神经演员-评论家算法的弱收敛分析

Samuel Chun-Hei Lam, Justin Sirignano, Ziheng Wang

AI总结 本文证明单层神经网络在线演员-评论家算法在隐藏单元数和训练步数趋于无穷时分布收敛到随机常微分方程,并利用泊松方程和弱收敛技术分析极限行为。

详情
AI中文摘要

我们证明,当隐藏单元数和训练步数趋于无穷时,使用在线演员-评论家算法训练的单层神经网络在分布上收敛到一个随机常微分方程(ODE)。在在线演员-评论家算法中,数据样本的分布随着模型更新而动态变化,这是任何收敛分析的关键挑战。我们建立了在固定演员策略下数据样本的几何遍历性。然后,利用泊松方程,我们证明了由于随机到达的数据样本导致的模型更新围绕极限分布的波动随着参数更新次数趋于无穷而消失。利用泊松方程和弱收敛技术,我们证明了演员神经网络和评论家神经网络收敛到具有随机初始条件的ODE系统的解。对极限ODE的分析表明,极限评论家网络将收敛到真实价值函数,这将为演员提供策略梯度的渐近无偏估计。然后我们证明极限演员网络将收敛到一个驻点。

英文摘要

We prove that a single-layer neural network trained with the online actor critic algorithm converges in distribution to a random ordinary differential equation (ODE) as the number of hidden units and the number of training steps $\rightarrow \infty$. In the online actor-critic algorithm, the distribution of the data samples dynamically changes as the model is updated, which is a key challenge for any convergence analysis. We establish the geometric ergodicity of the data samples under a fixed actor policy. Then, using a Poisson equation, we prove that the fluctuations of the model updates around the limit distribution due to the randomly-arriving data samples vanish as the number of parameter updates $\rightarrow \infty$. Using the Poisson equation and weak convergence techniques, we prove that the actor neural network and critic neural network converge to the solutions of a system of ODEs with random initial conditions. Analysis of the limit ODE shows that the limit critic network will converge to the true value function, which will provide the actor an asymptotically unbiased estimate of the policy gradient. We then prove that the limit actor network will converge to a stationary point.

2308.04823 2026-05-28 cs.CL

Evaluating the Generation Capabilities of Large Chinese Language Models

评估大型中文语言模型的生成能力

Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang

AI总结 提出CG-Eval自动评估框架和Gscore复合指标,用于多学科领域评估大型中文语言模型的生成能力。

详情
AI中文摘要

本文揭示了CG-Eval,这是首个专为评估大型中文语言模型在多个学科领域中的生成能力而设计的全面自动化评估框架。CG-Eval以其自动化流程脱颖而出,该流程基于模型在六个关键领域(科学与工程、人文与社会科学、数学计算、医师资格考试、司法考试和注册会计师考试)中生成精确且上下文相关回答的能力进行关键评估。同时,我们引入了Gscore,这是一种创新的复合指标,由多个指标的加权和开发而成。Gscore独特地自动测量模型文本生成相对于参考标准的质量,提供对模型性能的详细而细致的评估。这种自动化不仅提高了评估过程的效率和可扩展性,还确保了跨不同模型的客观和一致评估。详细的测试数据和结果,展示了所评估模型的强大能力和比较性能,可在http://cgeval.besteasy.com/获取。

英文摘要

This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework designed for assessing the generative capabilities of large Chinese language models across a spectrum of academic disciplines. CG-Eval stands out for its automated process, which critically assesses models based on their proficiency in generating precise and contextually relevant responses to a diverse array of questions within six key domains: Science and Engineering, Humanities and Social Sciences, Mathematical Calculations, Medical Practitioner Qualification Examination, Judicial Examination, and Certified Public Accountant Examination. Alongside this, we introduce Gscore, an innovative composite index developed from a weighted sum of multiple metrics. Gscore uniquely automates the quality measurement of a model's text generation against reference standards, providing a detailed and nuanced assessment of model performance. This automation not only enhances the efficiency and scalability of the evaluation process but also ensures objective and consistent assessment across various models. The detailed test data and results, highlighting the robust capabilities and comparative performance of the evaluated models, are accessible at http://cgeval.besteasy.com/.

2308.07772 2026-05-28 cs.LG cs.AI

MOLE: MOdular Learning FramEwork via Mutual Information Maximization

MOLE: 基于互信息最大化的模块化学习框架

Tianchao Li, Yulong Pei

AI总结 提出一种异步局部学习框架MOLE,通过层间模块化与互信息最大化实现梯度隔离的局部优化,适用于向量、网格和图数据,并在图级别和节点级别任务上验证了通用性。

Comments accepted by icml llw

详情
Journal ref
ICML Workshop on Localized Learning (LLW) 2023
AI中文摘要

本文介绍了一种名为模块化学习框架(MOLE)的异步局部学习框架,用于神经网络。该框架按层将神经网络模块化,通过互信息为每个模块定义训练目标,并通过互信息最大化依次训练每个模块。MOLE使训练成为跨模块梯度隔离的局部优化,这种方案比BP更具生物合理性。我们在向量、网格和图类型数据上进行了实验。特别地,该框架能够解决图类型数据的图级别和节点级别任务。因此,MOLE已被实验证明可普遍适用于不同类型的数据。

英文摘要

This paper is to introduce an asynchronous and local learning framework for neural networks, named Modular Learning Framework (MOLE). This framework modularizes neural networks by layers, defines the training objective via mutual information for each module, and sequentially trains each module by mutual information maximization. MOLE makes the training become local optimization with gradient-isolated across modules, and this scheme is more biologically plausible than BP. We run experiments on vector-, grid- and graph-type data. In particular, this framework is capable of solving both graph- and node-level tasks for graph-type data. Therefore, MOLE has been experimentally proven to be universally applicable to different types of data.

2304.12986 2026-05-28 cs.CL cs.AI

Measuring Massive Multitask Chinese Understanding

测量大规模多任务中文理解

Hui Zeng

AI总结 针对中文大语言模型缺乏能力评估的问题,提出一个涵盖医学、法律、心理学和教育四大领域共23个子任务的多任务测试,通过零样本准确率评估模型性能,发现最佳模型平均领先最差模型18.6个百分点,且所有模型在法律领域表现最差。

详情
AI中文摘要

大规模中文语言模型的发展蓬勃,但缺乏相应的能力评估。因此,我们提出一个测试来衡量大型中文语言模型的多任务准确性。该测试涵盖四大领域,包括医学、法律、心理学和教育,其中医学有15个子任务,教育有8个子任务。我们发现,在零样本设置中,表现最好的模型平均比表现最差的模型高出近18.6个百分点。在四大领域中,所有模型的最高平均零样本准确率为0.512。在子领域中,只有GPT-3.5-turbo模型在临床医学上达到了0.693的零样本准确率,这是所有模型在所有子任务中的最高准确率。所有模型在法律领域表现不佳,最高零样本准确率仅为0.239。通过全面评估多个学科知识的广度和深度,该测试可以更准确地识别模型的不足之处。

英文摘要

The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.

2305.06426 2026-05-28 cs.AI cs.SY eess.SY math.OC

Planning a Community Approach to Diabetes Care in Low- and Middle-Income Countries Using Optimization

使用优化规划中低收入国家糖尿病护理的社区方法

Katherine B. Adams, Justin J. Boutilier, Sarang Deo, Yonatan Mintz

AI总结 提出一个优化框架,通过个性化社区卫生工作者访视计划,在社区层面最大化血糖控制,并平衡筛查新患者与管理已登记患者的资源分配。

Comments 50 pages, 13 figures

详情
AI中文摘要

糖尿病是全球健康优先事项,尤其是在中低收入国家,超过50%的过早死亡归因于高血糖。社区卫生工作者项目可以提供负担得起且文化适宜的解决方案,用于糖尿病的早期发现和管理。我们引入了一个优化框架,用于确定个性化的社区卫生工作者访视,以在社区层面最大化血糖控制。我们的框架明确建模了筛查新患者与为已登记治疗患者提供管理访视之间的权衡。我们考虑了患者的动机状态,这会影响他们决定加入或退出治疗,从而影响干预的有效性。通过估计患者的健康和动机状态,我们的模型在构建访视计划时考虑了患者在决定加入治疗时的权衡,从而降低了退出率并改善了资源分配。我们应用该方法,使用印度城市贫民窟的运营数据生成社区卫生工作者访视计划。我们发现,与最佳基线方法相比,我们的方法在相同能力下可将空腹血糖降低高达25%。我们的实验还表明,该方法在不完美信息下表现良好。

英文摘要

Diabetes is a global health priority, especially in low- and-middle-income countries, where over 50% of premature deaths are attributed to high blood glucose. Community Health Worker (CHW) programs can provide affordable and culturally tailored solutions for early detection and management of diabetes. We introduce an optimization framework to determine personalized CHW visits that maximize glycemic control at a community level. Our framework explicitly models the trade-off between screening new patients and providing management visits to individuals who are enrolled in treatment. We account for patients' motivational states, which affect their decisions to enroll or drop out of treatment and, therefore, the effectiveness of the intervention. By estimating patients' health and motivational states, our model builds visit plans accounting for patients' tradeoffs when deciding to enroll in treatment, leading to reduced dropout rates and improved resource allocation. We apply our approach to generate CHW visit plans using operational data from urban slums in India. We find that our approach can reduce fasting blood glucose by up to 25% with the same capacity as the best baseline method. Our experiments also demonstrate that our approach performs well with imperfect information.

2006.06049 2026-05-28 cs.LG stat.ML

On Mixup Regularization

关于混合正则化

Luigi Carratino, Moustapha Cissé, Rodolphe Jenatton, Jean-Philippe Vert

AI总结 本文通过将混合解释为数据变换与随机扰动的组合,揭示了其正则化效应,并提出了测试时数据变换改进以及标签平滑和Lipschitz常数减小等机制。

详情
Journal ref
Journal of Machine Learning Research, 23(325):1-31, 2022
AI中文摘要

混合是一种数据增强技术,通过训练点和标签的凸组合创建新样本。这种简单技术已在不同设置和应用中经验性地提高了许多最先进模型的准确性,但其经验成功背后的原因仍然知之甚少。在本文中,我们通过阐明混合的正则化效应,在解释其理论基础方面迈出了重要一步。我们表明,混合可以解释为在数据变换和变换数据随机扰动组合下的标准经验风险最小化估计量。从这一新解释中,我们获得了两个核心见解。首先,数据变换表明,在测试时,使用混合训练的模型也应应用于变换后的数据,这是一行代码的改变,我们经验性地表明这可以提高预测的准确性和校准。其次,我们展示了混合新解释中的随机扰动如何诱导多种已知的正则化方案,包括标签平滑和估计量Lipschitz常数的减小。这些方案协同相互作用,产生自校准且有效的正则化效果,防止过拟合和过度自信的预测。我们通过实验支持我们的理论分析,这些实验证实了我们的结论。

英文摘要

Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretical foundations of Mixup, by clarifying its regularization effects. We show that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data. We gain two core insights from this new interpretation. First, the data transformation suggests that, at test time, a model trained with Mixup should also be applied to transformed data, a one-line change in code that we show empirically to improve both accuracy and calibration of the prediction. Second, we show how the random perturbation of the new interpretation of Mixup induces multiple known regularization schemes, including label smoothing and reduction of the Lipschitz constant of the estimator. These schemes interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions. We corroborate our theoretical analysis with experiments that support our conclusions.

2206.15475 2026-05-28 cs.LG stat.ME

Causal Machine Learning: A Survey and Open Problems

因果机器学习:综述与开放问题

Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, Ricardo Silva

AI总结 本文综述了因果机器学习(CausalML)的五个主要研究方向(因果监督学习、因果生成建模、因果解释、因果公平性和因果强化学习),系统比较了各方向的方法,指出了开放问题,并讨论了在计算机视觉、自然语言处理和图表征学习中的应用。

Comments v03. Work in progress. Feedback and comments are highly appreciated!

详情
AI中文摘要

因果机器学习(CausalML)是机器学习方法的总称,这些方法将数据生成过程形式化为结构因果模型(SCM)。这种视角使我们能够推理对该过程的改变(干预)的效果以及事后本会发生的情况(反事实)。我们根据解决的问题将CausalML的工作分为五类:(1)因果监督学习,(2)因果生成建模,(3)因果解释,(4)因果公平性,以及(5)因果强化学习。我们系统地比较了每个类别中的方法,并指出了开放问题。此外,我们回顾了计算机视觉、自然语言处理和图表征学习中的数据模态特定应用。最后,我们提供了因果基准的概述和对这一新兴领域现状的批判性讨论,包括对未来工作的建议。

英文摘要

Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This perspective enables us to reason about the effects of changes to this process (interventions) and what would have happened in hindsight (counterfactuals). We categorize work in CausalML into five groups according to the problems they address: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, and (5) causal reinforcement learning. We systematically compare the methods in each category and point out open problems. Further, we review data-modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work.

2112.09305 2026-05-28 cs.LG stat.ML

Gaussian RBF Centered Kernel Alignment (CKA) in the Large Bandwidth Limit

大带宽极限下的高斯RBF中心核对齐(CKA)

Sergio A. Alvarez

AI总结 本文证明基于高斯RBF核的中心核对齐(CKA)在大带宽极限下收敛到线性CKA,并发现收敛起始对特征表示的几何形状敏感,表示偏心率限制了高斯CKA表现非线性的带宽范围。

Comments 11 pages, 3 figures

详情
Journal ref
IEEE TPAMI, vol. 45, issue 5, 01 May 2023, pages 6587-6593
AI中文摘要

我们证明基于高斯RBF核的中心核对齐(CKA)在大带宽极限下收敛到线性CKA。我们表明收敛起始对特征表示的几何形状敏感,并且表示偏心率限制了高斯CKA表现非线性的带宽范围。

英文摘要

We prove that Centered Kernel Alignment (CKA) based on a Gaussian RBF kernel converges to linear CKA in the large-bandwidth limit. We show that convergence onset is sensitive to the geometry of the feature representations, and that representation eccentricity bounds the range of bandwidths for which Gaussian CKA behaves nonlinearly.

1901.03808 2026-05-28 cs.LG eess.SP stat.ML

ECGadv: Generating Adversarial Electrocardiogram to Misguide Arrhythmia Classification System

ECGadv: 生成对抗性心电图以误导心律失常分类系统

Huangxun Chen, Chenyu Huang, Qianyi Huang, Qian Zhang, Wei Wang

AI总结 本文针对基于深度神经网络的心电图诊断系统,分析心电图特性并设计两种攻击模型下的对抗攻击方案,揭示系统盲点,呼吁采取对策。

Comments Accepted by AAAI 2020

详情
Journal ref
Proceedings of the AAAI conference on artificial intelligence 2020
AI中文摘要

基于深度神经网络(DNN)的心电图(ECG)诊断系统最近取得了令人瞩目的进展,有望取代心脏病专家进行繁琐的检查。然而,它们对对抗攻击的脆弱性仍缺乏全面研究。由于心电图在可视化和动态特性上的独特性,图像领域的现有攻击无法直接适用。因此,本文迈出一步,深入探索对基于DNN的心电图诊断系统的对抗攻击。我们分析心电图特性,分别在两种攻击模型下设计有效的攻击方案。我们的结果揭示了基于DNN的诊断系统在对抗攻击下的盲点,这呼吁采取充分的应对措施。

英文摘要

Deep neural networks (DNNs)-powered Electrocardiogram (ECG) diagnosis systems recently achieve promising progress to take over tedious examinations by cardiologists. However, their vulnerability to adversarial attacks still lack comprehensive investigation. The existing attacks in image domain could not be directly applicable due to the distinct properties of ECGs in visualization and dynamic properties. Thus, this paper takes a step to thoroughly explore adversarial attacks on the DNN-powered ECG diagnosis system. We analyze the properties of ECGs to design effective attacks schemes under two attacks models respectively. Our results demonstrate the blind spots of DNN-powered diagnosis systems under adversarial attacks, which calls attention to adequate countermeasures.

2605.28787 2026-05-28 cs.IR cs.AI

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

智能体需要语义元数据吗?智能体数据检索的比较研究

Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy

AI总结 通过对比基线智能体(搜索开放网络)与语义智能体(利用schema.org元数据)在数据检索中的表现,发现语义元数据在检索可操作数据时精度更高(整体精度高65.7%),而基线智能体覆盖更广但存在“最后一英里效用”失败。

详情
AI中文摘要

在自主智能体时代,机器可操作数据对于数据驱动的工作流至关重要。十多年来,像schema.org这样的语义元数据支撑了机器可操作数据的FAIR原则(可发现、可访问、可互操作、可重用),并支持了Google Dataset Search等发现工具。然而,能够导航非结构化网络的大型语言模型(LLM)的兴起提出了一个基本问题:语义元数据对于智能体数据发现是否仍然必要,或者智能体能否直接从网络可靠地检索可操作数据?我们提出了两种不同环境下的智能体数据检索比较分析:一个基线智能体搜索数十亿开放网络文档,以及一个语义智能体利用使用schema.org的9000万数据集语料库。我们部署了一个“LLM作为裁判”的评估流程,直接映射到FAIR原则,以评估检索数据的语义相关性、数据可访问性和计算实用性。我们的结果揭示了明显的差异。语义智能体在检索可操作数据方面表现出色,对于元数据丰富的注册表,其返回结果中的精度高出44.9%,对于具有机器可读下载的页面,精度高出46.6%。相反,基线智能体经常遭受“最后一英里效用”失败,检索到的是散文密集的页面(占结果的20.1%)和门户登录页面(占8.5%),而不是实际的数据页面。虽然基线智能体通过回答多40%的问题实现了更高的覆盖率,但语义智能体在检索符合FAIR原则的数据集方面实现了更高的准确性,整体精度高出65.7%。我们得出结论,虽然非结构化检索支持广泛的探索性任务,但结构化生态系统仍然是可靠、面向执行的自主工作流不可或缺的基础。

英文摘要

In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using schema.org. We deploy an "LLM-as-a-judge" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers "Last-Mile Utility" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.

2605.28729 2026-05-28 stat.ML cs.LG

Beyond Lipschitz: Data-Driven Robustness via Discrete Modulus of Continuity

超越Lipschitz:基于离散模连续性的数据驱动鲁棒性

Jürgen Dölz, Michael Multerer, Michele Palma

AI总结 提出基于离散模连续性(DMOC)的数据驱动鲁棒性框架,通过非线性泛化Lipschitz连续性并引入可扩展的小批量算法,实现与数据分布相关的细粒度鲁棒性评估。

详情
AI中文摘要

神经网络的鲁棒性通常通过局部或全局Lipschitz常数来量化。然而,Lipschitz连续性作为全局鲁棒性度量可能过于粗糙或过于严格,无法捕捉细微的、依赖于数据的行为。我们提出了一种基于离散模连续性(DMOC)的数据驱动、架构无关的框架,这是Lipschitz连续性的非线性推广,提供了更精细的鲁棒性概念。与许多现有方法不同,DMOC不需要访问模型内部,而是评估相对于数据分布的规律性。这将焦点从模型转移到数据,数据提供了规律性的数据驱动基线,用于评估网络的鲁棒性。我们建立了DMOC诱导半范数的收敛结果,给出了基于分离距离的显式数据驱动速率,并引入了一种可扩展的小批量算法,该算法将精确计算的二次成本降低,从而能够应用于ImageNet等大规模数据集。实验上,DMOC作为一种架构无关的诊断工具:它区分了训练和未训练的网络,揭示了欠拟合和过拟合状态,并且作为特例,产生了与最先进方法(如ECLipsE和ECLipsE-fast)相当的紧Lipschitz估计。

英文摘要

Robustness of neural networks is commonly quantified via local or global Lipschitz constants. However, Lipschitz continuity can be overly coarse or overly restrictive as global robustness measure, failing to capture nuanced, data-dependent behavior. We propose a data-driven, architecture-agnostic framework based on the discrete modulus of continuity (DMOC), a non linear generalization of Lipschitz continuity that provides a finer notion of robustness. Unlike many existing approaches, DMOC does not require access to model internals and instead evaluates regularity relative to the data distribution. This shifts the focus from the model to the data, which provide a data-driven baseline of regularity against which the network's robustness is assessed. We establish convergence results for DMOC-induced seminorms with explicit data-driven rates in terms of the separation distance, and introduce a scalable minibatch algorithm that reduces the quadratic cost of exact computation, enabling application to large-scale data sets such as ImageNet. Empirically, DMOC serves as an architecture independent diagnostic: it distinguishes trained from untrained networks, reveals underfitting and overfitting regimes, and yields, as a special case, tight Lipschitz estimates comparable to state-of-the-art method such as ECLipsE and ECLipsE-fast.

2605.28703 2026-05-28 cs.NE cs.AI cs.DS math.OC

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

对拉马克进化与鲍德温效应的重新审视

Inès Benito, Johannes F. Lutzeyer, Benjamin Doerr

AI总结 通过实验和理论分析,比较拉马克、鲍德温和达尔文进化在最大独立集和最大割问题上的表现,证明局部搜索增强的进化算法(尤其是鲍德温进化)显著优于达尔文进化,并给出理论上的运行时界限。

Comments To appear in the proceedings of PPSN 2026

详情
AI中文摘要

鲍德温和拉马克进化在进化算法中已存在很长时间,但从未主导学术文献或实际应用。在这项工作中,我们使用现代实证和理论方法重新审视拉马克和鲍德温进化,并将其与一般的达尔文进化进行严格比较。在实证方面,我们在来自近期GraphBench基准的六个不同数据集的图上,针对最大独立集和最大割问题运行了一套全面的实验。我们的结果表明,鲍德温和拉马克进化始终优于达尔文进化,证实了局部搜索增强进化算法的巨大潜力。值得注意的是,在绝大多数情况下,所有进化算法都优于最近的深度学习基线,并接近高度专业化的启发式和精确求解器的性能。此外,我们报告了一组适用于所有研究进化类型的高性能通用参数,希望未来对从业者有用。在理论方面,我们将现有的欺骗性前导块基准扩展到任意块长度,并使用现代理论运行时分析工具来证明预期运行时的上下界。对于大于二的块长度,鲍德温进化渐近快于拉马克进化,而拉马克进化渐近快于达尔文进化。当考虑适应度评估中局部搜索过程的成本时,排序取决于实现方式,鲍德温进化从较小的块长度开始就保持最快,这解释了其强大的实证性能。

英文摘要

Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic literature or practical applications. In this work, we use modern empirical and theoretical methods to revisit Lamarckian and Baldwinian evolution and rigorously compare them with the generic Darwinian evolution. On the empirical side, we run a comprehensive suite of experiments on graphs from six different datasets from the recent GraphBench benchmark on Maximum Independent Set and Maximum Cut problems. Our results show that Baldwinian and Lamarckian evolution consistently outperform Darwinian evolution, confirming the great potential of local search augmented evolutionary algorithms. Notably, in the great majority of cases, all EAs outperform recent deep learning baselines and approach the performance of highly specialised heuristic and exact solvers. We furthermore report a high-performing set of generalist parameters for all studied evolution types that we hope will be of use to practitioners in future. On the theoretical side, we extend the existing Deceptive Leading Block benchmark to arbitrary block length and use tools from modern theoretical runtime analysis to prove upper and lower bounds on the expected runtime. For block lengths greater than two, Baldwinian evolution is asymptotically faster than Lamarckian which is asymptotically faster than Darwinian evolution. When accounting for the cost of the local search procedure in fitness evaluations, the ordering depends on the implementation with Baldwinian evolution staying fastest from small block lengths onwards, explaining its strong empirical performance.

2605.28697 2026-05-28 eess.IV cs.AI cs.CV

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

深度学习应变估计:基于物理的模拟是解决方案吗?

Thierry Judge, Nicolas Duchateau, Andreas Østvik, Khuram Faraz, Anders Austlid Taskén, Sigve Karlsen, Thor Edvardsen, Harald Brunvand, Md Abulkalam Azad, Havard Dalen, Bjørnar Grenne, Gabriel Kiss, Pierre-Yves Courand, Lasse Lovstakken, Pierre-Marc Jodoin, Olivier Bernard

AI总结 针对超声心动图中应变估计缺乏可靠运动参考的问题,提出一种结合真实视频散斑去相关测量与迭代细化过程的模拟策略,生成逼真数据集训练运动估计算法,在全局和区域应变上达到优于临床参考的性能。

Comments 10 pages

详情
AI中文摘要

斑点追踪超声心动图(STE)是心肌应变估计的临床标准。尽管在全局应变(GLS)上表现良好,但其区域应变的准确性仍然有限,尽管这一生物标志物对于早期诊断和表征细微异常高度相关。深度学习是一种有前景的替代方案,但其发展受到缺乏可靠运动参考的限制。现有解决方案要么依赖于STE衍生的标签,要么依赖于基于物理模型生成的模拟,但这些合成序列与临床数据相比仍缺乏足够的真实性。在本文中,我们提出了一种新的模拟策略,该策略结合了来自真实视频的散斑去相关测量,并使用迭代细化过程来改善模拟中的运动真实性。我们创建了一个包含1,478个视频及其参考运动的开源逼真数据集,用于训练超声心动图运动估计算法。所提出的方法在全局和区域应变上实现了无与伦比的性能,特别是在专家间设置中,GLS变异性达到1.42%,而临床参考为1.78%。

英文摘要

Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still have limited realism compared with clinical data.In this paper, we propose a novel simulation strategy that incorporates speckle decorrelation measures from real videos and uses an iterative refinement process to improve the motion realism in the simulations. We created an open-source photorealistic dataset of 1,478 videos with reference motion, which was used to train an echocardiographic motion estimation algorithm. The proposed method achieves unmatched performance on global and regional strain, notably reaching a GLS variability of 1.42% in an inter-expert setting compared to 1.78% for the clinical reference.

2605.28693 2026-05-28 q-bio.NC cs.AI

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

反向传播与大脑对图像响应的层级结构之间的错位

Joséphine Raugel, Maximilian Seitzer, Marc Szafraniec, Huy V. Vo, Jérémy Rapin, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean-Rémi King

AI总结 通过fMRI和MEG记录人类对自然图像的脑响应,发现预训练模型的反向传播梯度虽能预测高级视觉皮层和晚期信号,但其时空组织与大脑层级结构不一致,表明深度网络与大脑可能依赖不同的学习机制。

Comments 13 pages, 9 figures

详情
AI中文摘要

反向传播是深度学习核心的学习机制。然而,该算法是否以及如何在大脑中实现仍存在高度争议。特别是,虽然预训练模型的前向激活可靠地映射到视觉处理的皮层层级结构,但反向传播梯度是否表现出类似的对应关系尚不清楚。在这里,我们利用功能性磁共振成像(fMRI)和脑磁图(MEG)记录人类对自然图像的脑响应来探讨这一问题。为此,我们将前向激活的标准编码分析扩展到将反向传播梯度映射到神经数据。聚焦于最近的自监督视觉模型(DINOv3)并在八个视觉模型上复现结果,我们发现反向传播梯度能够可靠地预测fMRI和MEG信号,尤其是在高级视觉皮层和较晚的潜伏期。然而,这些反向传播梯度在大脑中的空间和时间组织与生物合理反向传播机制预期的模式不同:具体而言,梯度计算的顺序及其空间组织均与人类大脑的时间和空间层级结构相偏离。这些结果表明,尽管深度网络和大脑可能共享相似的表征内容,但它们可能依赖根本不同的机制来学习这些表征。

英文摘要

Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brain remains highly debated. In particular, while forward activations of pretrained models reliably map onto the cortical hierarchy of visual processing, it is unknown whether backpropagated gradients exhibit a similar correspondence. Here, we address this question using functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) recordings of human brain responses to natural images. For this, we extend standard encoding analyses of forward activations to map backpropagated gradients onto neural data. Focusing on a recent self-supervised vision model (DINOv3) and reproducing results on eight vision models, we find that backpropagated gradients can reliably predict both fMRI and MEG signals, specifically in higher-level visual cortex and for later latencies. However, the spatial and temporal organization of these backpropagated gradients in the brain diverges from the patterns expected under a biologically plausible backpropagation mechanism: specifically, both the order in which gradients are computed and their spatial organization diverge from the temporal and spatial hierarchies of the human brain. Together, these results suggest that, although deep networks and the brain may share similar representational content, they likely rely on fundamentally different mechanisms to learn those representations.

2605.28680 2026-05-28 cs.HC cs.AI cs.CY

AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness

职场中的AI:人工智能对感知工作体面性和意义性的影响

Kuntal Ghosh, Marc Hassenzahl, Shadan Sadeghian

AI总结 本研究通过对24名来自IT、服务和医疗行业员工的访谈,探讨了AI对工作满意度的感知影响,发现不同职业领域对AI带来的工作体面性和意义性变化预期不同,从而影响整体满意度。

Comments Accepted to CSCW 2026 / Proceedings of the ACM on Human-Computer Interaction (PACMHCI)

详情
AI中文摘要

人工智能在工作场所的普及正在改变我们的工作方式。虽然现有关于人机协作的研究通常优先考虑绩效,但对其体验结果知之甚少。通过对24名来自信息技术、服务和医疗行业的员工进行访谈,本文考察了AI通过感知工作体面性和意义性对当前和未来工作满意度的影响。我们的结果显示,AI对整体工作满意度的预期影响因职业领域而异,对其潜在的体面性和意义性的感知也不同。例如,IT和医疗行业预期在工时等体面性方面满意度提高,但由于误解AI将处理大部分任务,在社交形象等意义性方面满意度下降。相反,服务行业员工预计工时无改善,但由于与AI合作带来的地位提升感知,社会地位会提高。

英文摘要

The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaboration at work often prioritizes performance, less is known about their experiential outcomes. Through interviews with 24 employees across Information Technology (IT), service-based, and healthcare sectors, this paper examines AI's impact on job satisfaction via perceptions of job decency and meaningfulness, now and in the future. Our results reveal that the anticipated impact of AI on overall job satisfaction varies with the occupational domain, with differing perceptions of its underlying decency and meaningfulness. For instance, IT and healthcare anticipate increased satisfaction with decency aspects like working hours but decreased satisfaction with meaningfulness aspects like social image due to misconceptions about AI handling most of their tasks. Conversely, service workers foresee no improvement in their working hours but a higher social standing due to the perceived status boost associated with working with AI.

2605.28645 2026-05-28 cs.CR cs.CL

GraphSteal: Structural Knowledge Stealing from Graph RAG via Traversal Reconstruction

GraphSteal:通过遍历重建从图RAG中窃取结构知识

Jinze Gu, Qinghua Mao, Xi Lin, Jun Wu

AI总结 提出一种结构导向的重建框架,通过深度优先启发式搜索和广度优先扩散搜索,从黑盒图RAG系统中高保真恢复隐藏知识图,揭示敏感实体、关系和结构依赖。

详情
AI中文摘要

检索增强生成(RAG)通过将生成过程基于查询相关的外部证据来增强LLM。除了非结构化文本语料库外,图RAG将知识图谱集成到检索流程中,使LLM能够访问编码在结构化知识中的实体、关系和多跳依赖。然而,赋能图RAG的相同结构化知识也创造了新的隐私攻击面。我们证明,图RAG系统可以转变为结构预言机:通过自适应黑盒交互,对手可以引出足够的关联证据,以重建隐藏知识图的实质性部分。我们提出了一种面向结构的重建框架,从局部和全局角度恢复目标图。具体来说,深度优先启发式搜索通过递归扩展以实体为中心的证据来提取细粒度节点属性,而广度优先扩散搜索通过跨关系诱导邻域传播来推断图拓扑。在通用和医疗场景上的实验表明,我们的方法可以从代表性图RAG系统中恢复超过90%的原始知识图,以高保真度揭示敏感实体、关系和结构依赖。现有的防护措施对我们的攻击提供的防御有限,突显了在图RAG流程中保护结构隐私的固有困难。

英文摘要

Retrieval-Augmented Generation (RAG) enhances LLMs by grounding generation in query-relevant external evidence. Beyond unstructured text corpora, Graph RAG integrates knowledge graphs into the retrieval pipeline, enabling LLMs to access entities, relations, and multi-hop dependencies encoded in structured knowledge. However, the same structured knowledge that empowers Graph RAG also creates a new privacy attack surface. We demonstrate that Graph RAG systems can be turned into structural oracles: through adaptive black-box interactions, an adversary can elicit sufficient relational evidence to reconstruct substantial portions of the hidden knowledge graph. We propose a structure-oriented reconstruction framework that recovers targeted graphs from both local and global perspectives. Specifically, Depth-Wise Heuristic Search extracts fine-grained node attributes by recursively expanding entity-centered evidence, while Breadth-Wise Diffusion Search infers graph topology by propagating across relation-induced neighborhoods. Experiments on generic and healthcare scenarios demonstrate that our method can recover over 90\% of the original knowledge graph from representative Graph RAG systems, revealing sensitive entities, relations, and structural dependencies with high fidelity. Existing guradrails provide limited defense against our attack, highlighting the inherent difficulty of safeguarding structural privacy in Graph RAG pipelines.

2605.28632 2026-05-28 cs.CR cs.AI

Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking

盲PRNG劫持:一种针对LLM水印的不可检测的完整性保持攻击

Ziyang You, Huilong He, Xiaoke Yang, Xuxing Lu

AI总结 提出SeedHijack攻击,通过替换伪随机数生成器(PRNG)在供应链层面对LLM水印进行盲攻击,同时保持完整性并规避检测。

Comments Preprint prepared for submission to IEEE TIFS. 12 pages, 8 figures

详情
AI中文摘要

密码学水印是归因大型语言模型(LLM)生成文本的主要防御手段。现有方案(包括KGW、Unigram和DipMark)的安全性基于底层伪随机数生成器(PRNG)可信的假设。本文引入SeedHijack,这是首个针对LLM水印的供应链攻击,同时满足:(i) 盲——无需知道水印密钥、检测器或模型logits;(ii) 完整性保持——放大而非擦除水印信号;(iii) 与检测正交——攻击引入的偏差与所有内容侧检测器统计独立,确保放大和规避共存而无权衡。SeedHijack不扰动生成文本,而是在供应链层替换PRNG,偏向绿名单选择而不改变输出令牌或降低文本质量。在三种水印方案和三个开源LLM上,攻击触发了0/6个最先进的内容侧统计检测器,同时将水印z-score放大至2.42倍(系统级防御如熵源认证保持正交和互补)。量子随机数生成器(QRNG)对策被证明能完全中和攻击,同时保持良性水印效用。这些发现确立了PRNG完整性作为密码学内容来源系统的一等安全需求。

英文摘要

Cryptographic watermarking is a leading defense for attributing text generated by large language models (LLMs). Existing schemes, including KGW, Unigram, and DipMark, derive their security guarantees from the assumption that the underlying pseudo-random number generator (PRNG) is trustworthy. This work introduces SeedHijack, the first supply-chain attack on LLM watermarking that is simultaneously (i) blind -- requiring no knowledge of the watermark key, detector, or model logits, (ii) integrity-preserving -- amplifying rather than erasing the watermark signal, and (iii) orthogonal to detection -- the attack-induced bias is statistically independent of all content-side detector statistics, ensuring that amplification and evasion coexist without trade-off. Rather than perturbing generated text, SeedHijack replaces the PRNG at the supply-chain layer, biasing green-list selection without altering output tokens or degrading text quality. Across three watermarking schemes and three open-source LLMs, the attack triggers 0/6 state-of-the-art content-side statistical detectors while inflating the watermark z-score up to 2.42x (system-level defenses such as entropy-source attestation remain orthogonal and complementary). A quantum random number generator (QRNG) countermeasure is shown to fully neutralize the attack while preserving benign watermarking utility. These findings establish PRNG integrity as a first-class security requirement for cryptographic content-provenance systems.

2605.28613 2026-05-28 math.OC cs.LG stat.ML

Implicit Regularization in Perturbed Deep Matrix Factorization: Spectral Conditions and Stability

扰动深度矩阵分解中的隐式正则化:谱条件与稳定性

Jingzhe Wang, Hung-Hsu Chou

AI总结 本文研究扰动深度矩阵分解中低秩隐式正则化的稳定性,通过推导谱条件分析无噪声情况下的低秩阶段,并证明扰动下梯度下降的收敛性与低秩阶段的保持性。

详情
AI中文摘要

本文研究了扰动深度矩阵分解中低秩隐式正则化的稳定性,其中目标矩阵被噪声矩阵破坏。我们首先推导了充分的谱条件,使得梯度下降在无噪声情况下表现出低秩阶段。这些条件展示了目标谱、初始化和步长如何共同决定非空低秩区间的存在性。然后我们分析了扰动的梯度下降动力学,证明了收敛保证,并量化了扰动如何影响迭代复杂度和特征值恢复。最后,我们表明低秩阶段在扰动下仍然存在,且与扰动大小有显式依赖关系。数值实验支持了理论发现。

英文摘要

This paper studies the stability of low-rank implicit regularization in perturbed deep matrix factorization, where the target matrix is corrupted by a noise matrix. We first derive sufficient spectral conditions under which gradient descent exhibits a low-rank phase in the noiseless setting. These conditions show how the target spectrum, initialization, and step size jointly determine the existence of a nonempty low-rank interval. We then analyze the perturbed gradient descent dynamics, proving convergence guarantees and quantifying how the perturbation affects iteration complexity and eigenvalue recovery. Finally, we show that the low-rank phase persists under perturbation, with explicit dependence on the perturbation size. Numerical experiments support the theoretical findings.

2605.28597 2026-05-28 cs.CR cs.AI cs.LG

Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

立场:淘汰“正向后门”标签——秘密对齐需要严格且系统的评估

Jianwei Li, Jung-Eun Kim

AI总结 本文主张停止使用“正向后门”标签,将触发激活的隐藏行为视为秘密对齐,并通过评估三个代表性应用在六个核心属性上的表现,揭示其脆弱性,呼吁进行严格评估。

Comments ICML 2026

详情
AI中文摘要

这篇立场论文认为,AI/ML社区应停止过度宣称并淘汰“正向后门”标签,而应将触发激活的隐藏行为视为秘密对齐。关键在于,基于秘密对齐的保护性主张在缺乏严格、标准化评估的情况下,默认不应被视为安全。私有AI时代,通过开放权重的LLM和可访问的训练/推理栈,语言模型成为私有数字资产,产生了关于未授权访问、模型盗窃和行为滥用的安全问题。最近,一系列被称为“正向后门”的工作被提出以应对这些挑战。为将我们的立场建立在证据基础上,我们将这些提议统一为用于访问门控、所有权归属和安全执行的隐蔽触发-行为关联,并评估了三个代表性应用在六个核心属性上的表现:有效性、无害性、持久性、效率、鲁棒性和可靠性。我们的结果揭示了触发-行为映射的显著脆弱性——尤其是在机密性、完整性和可用性(CIA)方面——这些往往被现有声称低估。我们进一步将这些结果与行为密度和决策复杂性联系起来,提供了一个理解部署时风险的行为视角,并激励社区范围内的评估,使秘密对齐主张可证明。

英文摘要

This position paper argues that the AI/ML community should stop overclaiming and retire the label "positive backdoor," and instead treat trigger-activated hidden behaviors as Secret Alignment. Crucially, protective claims based on Secret Alignment should be presumed not secure by default unless supported by rigorous, standardized evaluation. The Private AI era, enabled by open-weight LLMs and accessible training/inference stacks, turns language models into privately owned digital assets, creating security concerns around unauthorized access, model theft, and behavioral misuse. Recently, a line of work framed as "positive backdoors" has been proposed to address these challenges. To ground our position in evidence, we unify these proposals as covert trigger-behavior associations for access gating, ownership attribution, and safety enforcement, and evaluate three representative applications across six core properties: effectiveness, harmlessness, persistence, efficiency, robustness, and reliability. Our results reveal substantial brittleness - especially in the confidentiality, integrity, and availability (CIA) - of trigger-behavior mappings often underrepresented by existing claims. We further relate these outcomes to behavior density and decision complexity, offering a behavioral lens for understanding deployment-time risks and motivating community-wide evaluation that makes Secret Alignment claims provable.

2605.28596 2026-05-28 astro-ph.CO cs.LG

Dark Quest II: A Wide-Coverage Neural Network Emulator of the Nonlinear Matter Power Spectrum Across Extended Cosmologies

黑暗探索 II:扩展宇宙学下非线性物质功率谱的广覆盖神经网络仿真器

Satoshi Tanaka, Takahiro Nishimichi, Yosuke Kobayashi

AI总结 提出DarkEmulator2神经网络仿真器,在九维w0waνoCDM参数空间中通过联合训练多分辨率模拟实现亚百分比精度预测非线性物质功率谱。

Comments 53 pages, 44 figures, emulator code available at https://github.com/DarkQuestCosmology/dark_emulator2_public

详情
AI中文摘要

\textsc{DarkEmulator2} 是一个神经网络仿真器,用于在九维 $w_0 w_a νo \mathrm{CDM}$ 参数空间中模拟非线性物质功率谱,作为 \textsc{Dark Quest II} (DQ2) 程序的仿真器组件开发。它基于 \textsc{Ginkaku} 代码生成的模拟进行训练,该代码的数值实现、精度测试和后处理流程在配套论文中描述。设计遵循统一策略:除了宇宙学参数向量外,我们还向神经网络的输入补充了三类物理动机的辅助量——线性物质功率谱、模拟分辨率描述符以及初始高斯随机场的低维总结——这些预计将改善跨参数空间的泛化能力。跨三个模拟分辨率层级联合训练单个网络,使得仿真器能够利用少量高分辨率模拟,同时保留来自较低分辨率模拟的广泛覆盖。对于 $L_{\mathrm{box}}=1\,\hiGpc$ 盒子,粒子数为 $N=3000^{3}$,仿真器在粒子奈奎斯特尺度 $k_{\mathrm{Ny}}\simeq 10\,\hMpci$ 以内以亚百分比精度再现模拟物质功率谱。仿真器在校准波数范围内保持准确,而其最高 $k$ 预测取决于模拟分辨率和散粒噪声。我们在独立测试集上验证仿真器,并通过与多个公开仿真器和广泛使用的拟合公式的交叉比较,表征模型间一致性及其残差中参数依赖的趋势。

英文摘要

\textsc{DarkEmulator2} is a neural network emulator of the nonlinear matter power spectrum in a nine-dimensional $w_0 w_a νo \mathrm{CDM}$ parameter space, developed as the emulator component of the \textsc{Dark Quest II} (DQ2) program. It is trained on simulations generated with the \textsc{Ginkaku} code, whose numerical implementation, accuracy tests, and post-processing pipeline are described in the companion paper. The design follows a unified strategy: in addition to the cosmological parameter vector, we supplement the neural network's inputs with three families of physically motivated auxiliary quantities -- the linear matter power spectrum, descriptors of the simulation resolution, and a low-dimensional summary of the initial Gaussian random field -- that are expected to improve generalization across the parameter space. Training a single network jointly across three simulation resolution tiers allows the emulator to exploit a small number of high-resolution simulations while retaining broad coverage from lower-resolution simulations. For a $L_{\mathrm{box}}=1\,\hiGpc$ box with $N=3000^{3}$ particles, the emulator reproduces the simulated matter power spectrum to subpercent accuracy up to the particle Nyquist scale, $k_{\mathrm{Ny}}\simeq 10\,\hMpci$. The emulator remains accurate over the calibrated wavenumber range, while its highest-$k$ predictions depend on the simulation resolution and shot noise. We validate the emulator on independent test suites and, through a cross-comparison with several public emulators and widely used fitting formulas, characterize the inter-model consistency and the parameter-dependent trends in their residuals.

2605.28594 2026-05-28 cond-mat.stat-mech cs.AI physics.comp-ph

Thermodynamic properties of chemically disordered compounds via AI-driven estimation of partition function with the PULSE method

通过PULSE方法基于AI驱动配分函数估计的化学无序化合物热力学性质

Baptiste Bernard, Luca Messina, Eiji Kawasaki, Emeric Bourasseau

AI总结 提出改进的PULSE方法,通过无监督学习采样和估计配分函数,以低成本高效计算化学无序化合物的热力学性质,并在2D Ising模型上验证了其高精度和效率。

Comments 13 pages, 11 figures, submitted to Physical Chemistry Chemical Physics

详情
AI中文摘要

在本文中,我们提出了PULSE方法(配分函数无监督学习采样与评估)的改进版本,用于估计化学无序化合物的热力学性质。目的是降低这类材料蒙特卡罗方法的计算成本,并证明这种生成工具可以通过采样和估计系统的配分函数来估计热力学性质。为了验证这种创新方法,我们使用2D Ising模型作为基准。我们证明,与传统蒙特卡罗采样方法相比,我们的方法能够以高精度和效率准确再现平均性质。我们的结果突出了PULSE方法的效率和适应性,使其成为研究那些传统方法因化学无序影响而过于低效、无法低成本计算性质的材料的有价值工具。

英文摘要

In this article, we present an improved version of the PULSE method (Partition function Unsupervised Learning Sampling and Evaluation) for estimating the thermodynamic properties of chemically disordered compounds. The aim is to reduce the computational cost of Monte Carlo approaches for this type of material and to demonstrate that this generative tool can estimate thermodynamic properties by sampling and estimating the partition function of the system. To validate this innovative approach, we use the 2D Ising model as a benchmark. We demonstrate that our method accurately reproduces average properties with high precision and efficiency compared to traditional Monte Carlo sampling methods. Our results highlight the efficiency and adaptability of the PULSE method, making it a valuable tool for studying materials for which conventional methods are too inefficient to compute properties affected by chemical disorder at low cost.

2605.28588 2026-05-28 cs.CR cs.AI

Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

技术报告:探索智能体技能生态系统的新兴威胁

Luca Beurer-Kellner, Aleksei Kudrinskii, Marco Milanta, Kristian Bonde Nielsen, Hemang Sarkar, Liran Tal

AI总结 本研究通过分析3984个AI智能体技能,发现76个恶意载荷,揭示了技能生态系统中的安全威胁,并提出了威胁分类和攻击模式。

Comments 10 pages, technical report

详情
AI中文摘要

我们分析了来自主要市场的3,984个AI智能体技能,发现了76个确认的恶意载荷,包括凭证窃取、后门安装和数据泄露。13.4%的技能至少包含一个关键级别的安全问题,截至发表之日,至少有8个手动确认的恶意技能仍在clawhub.ai上公开可用。本报告记录了我们的方法论,基于真实样本提出了威胁分类,并详细描述了观察到的攻击模式。随着技能市场快速增长,AI智能体获得敏感凭证和系统的访问权限,自动化安全分析不再是可选项。

英文摘要

We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and data exfiltration. 13.4% of all skills contain at least one critical-level security issue and at least 8 manually confirmed malicious skills remain publicly available on clawhub.ai as of the date of publication. This report documents our methodology, presents a threat taxonomy based on real-world samples, and details the attack patterns we observed. As skill marketplaces grow rapidly and AI agents gain access to sensitive credentials and systems, automated security analysis is no longer optional.

2605.28565 2026-05-28 cs.DL cs.AI cs.CL cs.IR

Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

验证性误导:衡量搜索增强型大语言模型中的结构性引用失败

Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang, Dongha Lee

AI总结 针对搜索增强型大语言模型中的引用可信度问题,提出CITETRACE数据集和三维评估框架,发现系统性“验证性误导”模式:模型引用真实可访问来源但存在意图对齐、来源适宜性或答案-来源忠实度缺陷,导致用户面临结构性误导。

Comments Working Progress

详情
AI中文摘要

搜索增强型大语言模型的用户依赖引用作为回答基于真实来源的证据,但很少自行验证引用的页面。每天数百万次查询通过这些系统,使得引用质量成为用户是被告知还是被误导的无声决定因素——然而现有基准各自孤立地处理一个方面,导致决定引用可信度的联合结构未被衡量。我们构建了CITETRACE,一个大规模数据集,追踪从用户查询到检索来源再到生成答案的完整引用链:来自28个社区的11,200个真实世界查询,与来自五个提供商的十个模型的112,000个回答配对,产生761,495个可评估的引用对。我们设计了一个三维评估框架,使用专家验证的预定义矩阵和五级忠实度标准,对每个引用在意图-目的对齐、来源适宜性和答案-来源忠实度上进行评分;该框架适用于任何产生带引用回答的系统。大规模应用该框架,我们识别出一种系统性的模式,称为验证性误导(VM):模型引用真实、可访问的来源,但在一个或多个维度上失败,产生忠实度-适宜性权衡,其中忠实模型选择不合适的来源,反之亦然。在我们的池中,30.6%的引用扭曲了其来源,27.1%的引用源自领域不合适的来源;在回答层面,高达96%的用户至少遇到一个结构性误导的引用。提供商层面的差异解释了88-96%的引用质量方差,表明来源选择更多受超出单个模型能力的因素控制,而非LLM本身。总之,CITETRACE及其评估框架为诊断部署的搜索增强系统中的结构性引用失败提供了首个资源。

英文摘要

Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

2605.28557 2026-05-28 cs.LO cs.AI

Token Optimization Strategies for LLM-Based Oracle-to-PostgreSQL Migration

基于LLM的Oracle到PostgreSQL迁移的Token优化策略

Oleg Grynets, Dmytro Babarytskyi, Vasyl Lyashkevych

AI总结 本文形式化并评估了十二种Token优化策略,在Oracle到PostgreSQL迁移中平衡成本、语法有效性、语义保持和结构保真度。

Comments 11 pages, 3 figures, 5 tables, 38 references

详情
AI中文摘要

LLM越来越多地用于软件现代化、代码翻译和数据库迁移。然而,基于LLM的Oracle2PostgreSQL迁移仍然受到高Token消耗、长上下文退化、方言特定的语义差异以及查询转换过程中语义漂移风险的限制。将大型Oracle SQL/PL-SQL工件、模式定义、过程逻辑和迁移指令直接包含到模型上下文中会增加成本并可能降低生成质量。本文将Token优化视为基于LLM的Oracle2PostgreSQL迁移中的一个约束转换问题。研究形式化并评估了十二种Token优化策略:基线表示、上下文剪枝、最小化、基于DSL的语义压缩、元数据增强、上下文重构、模式蒸馏、自适应路由、基于AST的最小化、标识符掩码、输出约束强制和混合优化。这些策略在10和100个Oracle SQL查询样本上使用有效语法率、精确匹配、语义匹配、CodeBLEU和Token效率进行评估。结果表明,轻度上下文剪枝几乎保持了基线水平的语义质量,在100个查询样本上实现了89.75%的语义匹配,而未优化基线为89.80%。自适应路由提供了最佳的实际权衡,输入Token减少8.72%,输出Token减少5.49%,同时保持88.40%的语义匹配,并将Token效率提高6.67%。激进的模式蒸馏将Token效率提高了132.22%,但导致语义匹配下降44.50个百分点。研究结果表明,Token优化不能简单地视为提示缩短;它必须作为一个多目标迁移问题来评估,平衡成本、语法有效性、语义保持和结构保真度。

英文摘要

LLMs are increasingly used for software modernization, code translation, and database migration. However, LLM-based Oracle2PostgreSQL migration remains constrained by high token consumption, long-context degradation, dialect-specific semantic differences, and the risk of semantic drift during query transformation. Direct inclusion of large Oracle SQL/PL-SQL artefacts, schema definitions, procedural logic, and migration instructions into the model context increases cost and may reduce generation quality. This paper shows token optimization as a constrained transformation problem in LLM-based Oracle2PostgreSQL migration. The study formalizes and evaluates twelve token optimization strategies: baseline representation, context pruning, minification, DSL-based semantic compression, metadata augmentation, context refactoring, schema distillation, adaptive routing, AST-based minification, identifier masking, output constraint enforcement, and hybrid optimization. The strategies are evaluated on samples of 10 and 100 Oracle SQL queries using Valid Syntax Rate, Exact Match, Semantic Match, CodeBLEU, and Token Efficiency. The results show that mild context pruning preserves semantic quality almost at the baseline level, achieving 89.75% Semantic Match on the 100-query sample compared with 89.80% for the unoptimized baseline. Adaptive routing provides the best practical trade-off, reducing input tokens by 8.72% and output tokens by 5.49% while maintaining 88.40% Semantic Match and increasing Token Efficiency by 6.67%. Aggressive schema distillation increases Token Efficiency by 132.22% but results in a 44.50-percentage-point decrease in Semantic Match. The findings demonstrate that token optimization cannot be treated as simple prompt shortening; it must be evaluated as a multi-objective migration problem balancing cost, syntactic validity, semantic preservation, and structural fidelity.

2605.28516 2026-05-28 stat.ML cs.LG

Conservative neural posterior estimation via distributionally robust training

通过分布鲁棒训练实现保守神经后验估计

William Laplante, Yuga Hikida, Charita Dellaporta, François-Xavier Briol, Ayush Bharti

AI总结 提出DRO-NPE方法,通过Wasserstein模糊集上的最坏情况损失替代标准NPE目标,控制过拟合并减少后验过度自信,从而提高低模拟预算下的覆盖率和校准性能。

详情
AI中文摘要

基于神经后验估计(NPE)的模拟推断在有限模拟预算下通常会产生过度自信且不可靠的后验。为了解决这个问题,我们提出了DRO-NPE,一种分布鲁棒方法,它将标准NPE目标替换为Wasserstein模糊集上的最坏情况损失。我们引入了基于KL的误覆盖和误校准度量,并利用这些度量表明DRO-NPE目标控制了过拟合并减少了后验过度自信。我们的方法是可处理的、可并行化的,并且易于与标准归一化流集成。在基准SBI任务中,DRO-NPE一致地提高了覆盖率和校准性能,同时缩小了经验NPE损失与总体NPE损失之间的差距,从而在低模拟情况下实现更可靠的推断。

英文摘要

Simulation-based inference with neural posterior estimation (NPE) often yields overconfident and unreliable posteriors under limited simulation budgets. To address this, we propose DRO-NPE, a distributionally robust approach that replaces the standard NPE objective with a worst-case loss over a Wasserstein ambiguity set. We introduce KL-based metrics for miscoverage and miscalibration, and use these to show that the DRO-NPE objective controls overfitting and reduces posterior overconfidence. Our method is tractable, parallelisable, and readily integrates with standard normalising flows. Across benchmark SBI tasks, DRO-NPE consistently improves coverage and calibration, while narrowing the gap between empirical and population NPE loss, leading to more reliable inference in low-simulation regimes.

2605.28515 2026-05-28 cs.SE cs.AI

Do LLMs Favor Their Providers? Measuring Vertical Integration Bias in Code Generation

LLM 是否偏袒其提供商?测量代码生成中的垂直整合偏差

Melih Catal, Alex Wolf, Tiago Ferreiro Matos, Pooja Rani, Harald Gall

AI总结 本文提出 VIBench 基准,通过 20 个提供商可选的软件集成场景,测量前沿 LLM 在直接和代理代码生成中的垂直整合偏差,发现六成关联模型存在显著偏差,代理工作流加剧偏差至 +39.2 个百分点。

详情
AI中文摘要

大型语言模型已成为软件开发不可或缺的一部分,尤其是随着代理能力的出现。然而,许多前沿 LLM 与特定提供商有关联。这引发了一个问题:生成的代码是否偏袒提供商自身的生态系统而非可比较的替代方案,从而可能限制开发者的选择并增加对单一提供商的依赖。我们将这种行为定义为垂直整合偏差,并引入 VIBench,一个用于在 20 个提供商可选的软件集成场景中测量直接和代理代码生成中 VIB 的基准。通过评估 10 个前沿提供商关联模型与 3 个非关联对照模型,我们发现直接生成中存在正的 VIB,其中十个关联模型中有六个显示出统计显著效应,最高达 +18.8 个百分点。代理工作流进一步放大了 VIB,达到 +39.2 个百分点。此外,代理工作流中早期的关联生态系统选择可能持续存在于概念上解耦的下游文件中,持续性高达 90.3%。这些发现强调了在代码生成中测量和考虑 VIB 的必要性,尤其是在代理能力日益普及的背景下。

英文摘要

Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet, many frontier LLMs are affiliated with specific providers. This raises the question of whether generated code favors the provider's own ecosystem over comparable alternatives, potentially constraining developers' choices and increasing dependence on a single provider. We define this behavior as Vertical Integration Bias (VIB) and introduce \textsc{VIBench}, a benchmark for measuring VIB in direct and agentic code generation across $20$ provider-selectable software-integration scenarios. Evaluating $10$ frontier provider-affiliated models against $3$ non-affiliated controls, we find positive VIB in direct generation, with six of ten affiliated models showing statistically significant effects up to $+18.8$ percentage points (pp). Agentic workflows further amplify VIB, reaching $+39.2$ pp. Moreover, early affiliated-ecosystem choices in agentic workflows can persist into conceptually decoupled downstream files, with persistence as high as $90.3\%$. These findings underscore the need to measure and account for VIB in code generation, especially as agentic capabilities become more prevalent.

2605.28498 2026-05-28 cs.HC cs.AI

The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

验证决策:温暖度和用户特征如何塑造对信息搜索中对话代理的依赖

Mert Yazan, Frederik Bungaran Ishak Situmeang, Suzan Verberne

AI总结 研究通过混合实验发现,即使提供事实核查工具,用户仍过度依赖对话AI,验证行为主要由用户特征(如先验信任)驱动,而温暖对话风格通过增加对错误答案的认同间接影响依赖。

Comments Under review for Computers in Human Behavior

详情
AI中文摘要

对话式人工智能(AI)提供了高效便捷的信息访问途径。然而,当用户盲目信任AI并在不进行事实核查的情况下接受其答案时,可能会导致过度依赖。信息搜索日益遵循一种结合对话AI与网络搜索的混合交互范式,使得事实核查更加容易。本文考察了这种交互范式是否能有效抑制依赖。我们进一步探究了驱动用户验证AI答案的潜在因素(例如数字素养和对话温暖度)。我们进行了一项混合被试问答实验,参与者与温暖或中性的聊天机器人互动。我们的发现表明,尽管用户同时拥有对话和网络搜索的访问权限,依赖仍然存在。验证决策主要由现有的用户感知(例如对聊天机器人的先验信任)驱动,而非答案属性,一些用户无论上下文如何都会进行事实核查,而另一些用户则默认信任聊天机器人。温暖的对话风格通过增加对错误聊天机器人的认同,对依赖产生了间接但关键的影响。咨询额外的AI来源可预测更高的准确性,而传统网络搜索则不然。我们的研究通过以下方式扩展了过度依赖研究:(a)证明了即使在可进行事实核查的情况下,过度依赖仍然存在;(b)将验证行为识别为用户依赖性;(c)揭示了对话温暖度对过度依赖的间接影响,这对设计可信赖的对话搜索系统具有启示意义。

英文摘要

Conversational artificial intelligence (AI) provides an efficient and convenient gateway to information access. However, it can cause overreliance when users blindly trust AI and accept its answers without fact-checking. Information search increasingly follows a hybrid interaction paradigm that combines conversational AI with web search, making fact-checking easier. In this paper, we examine whether this interaction paradigm is effective in curbing reliance. We further investigate the underlying factors (e.g., digital literacy and conversation warmth) that drive users to verify AI answers. We conduct a mixed-subjects question-answering experiment where participants interact with either a warm or a neutral chatbot. Our findings reveal that reliance persists despite users having access to both conversational and web search. The decision to verify is driven primarily by existing user perceptions (e.g., prior trust in chatbots) rather than answer properties, with some users fact-checking regardless of the context and others trusting chatbots by default. Warm conversational style has an indirect yet critical influence on reliance by increasing agreement with the chatbot when it is incorrect. Consulting additional AI sources predicts higher accuracy, while traditional web search does not. Our study extends overreliance research by: (a) demonstrating its persistence despite access to fact-checking, (b) identifying verification behavior as user-dependent, and (c) revealing conversational warmth's indirect effect on overreliance with implications for designing trustworthy conversational search systems.

2605.28480 2026-05-28 eess.AS cs.SD

Audio-Mind: An Auditable Agentic Framework for Audio Understanding

Audio-Mind: 一种可审计的音频理解智能体框架

Yucheng Wang, Jing Peng, Hanqi Li, Chenghao Wang, Wenming Tu, Yu Xi, Zhaokai Sun, Kai Yu, Shuai Wang

AI总结 提出Audio-Mind框架,通过条件性证据获取动态结合强前端与规划器引导的工具使用,解决音频理解中智能体证据获取的时机问题,在MMAR和MSU-Bench上分别达到80.4%和82.8%的准确率,并生成可审计的推理轨迹。

详情
AI中文摘要

音频智能体通过将音频问题分解为工具调用、中间证据和迭代推理步骤来扩展大型音频语言模型(LALM)。然而,随着LALM变得更强,关键挑战从启用工具使用转变为确定智能体证据获取何时真正有益于音频理解。我们提出Audio-Mind,一个用于音频理解中条件性证据获取的可审计且可插拔框架。Audio-Mind动态结合强前端与规划器引导的工具使用,在初始证据足够时保留前端判断,同时为存在未解决证据差距的问题获取有界的外部证据。在MMAR和MSU-Bench上的实验表明,Audio-Mind优于先前的音频智能体基线,在MMAR上达到80.4%的准确率,在MSU-Bench上达到82.8%的准确率。匹配骨干网络的比较突显了这种设计的重要性:在强音频前端下,如果工作流不保留前端的整体音频基础判断,智能体分解可能成为编排瓶颈。除了准确性,Audio-Mind还产生更高质量、可审计的推理轨迹,暴露不确定性、工具证据和答案理由,为更可靠的音频问答标注和错误分析提供潜在基础。

英文摘要

Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence acquisition genuinely benefits audio understanding. We propose Audio-Mind, an auditable and pluggable framework for conditional evidence acquisition in audio understanding. Audio-Mind dynamically combines a strong frontend with planner-guided tool use, preserving frontend judgment when initial evidence is sufficient while acquiring bounded external evidence for questions with unresolved evidence gaps. Experiments on MMAR and MSU-Bench show that Audio-Mind outperforms prior audio-agent baselines, reaching 80.4% accuracy on MMAR and 82.8% accuracy on MSU-Bench. A matched-backbone comparison highlights why this design matters: under strong audio frontends, agentic decomposition can become an orchestration bottleneck when the workflow does not preserve the frontend's holistic audio-grounded judgment. Beyond accuracy, Audio-Mind produces higher-quality, auditable reasoning traces that expose uncertainty, tool evidence, and answer rationales, offering a potential basis for more reliable audio-QA annotation and error analysis.

2605.28364 2026-05-28 stat.ML cs.LG

Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation

基于多项逻辑函数逼近的强化学习的方差自适应最优算法

Wonyoung Kim, Min-Hwan Oh, Garud Iyengar, Assaf Zeevi

AI总结 针对多项逻辑函数逼近的强化学习,提出一种计算高效的方差自适应算法,实现了实例级最优遗憾界,并通过实验验证其优于传统方法。

详情
AI中文摘要

基于多项逻辑(MNL)函数逼近的强化学习因其灵活性和广泛适用性已成为一个重要框架。虽然现有研究在最坏情况分析下建立了遗憾保证,但它们未能捕捉性能如何依赖于学习者和环境之间交互的变异性。在本文中,我们为基于MNL的马尔可夫决策过程开发了一种新的理论分析,得到了显式的方差自适应遗憾界。我们的算法计算高效,并实现了实例级最优遗憾率,缩小了上下界之间的差距。我们的数值实验验证了我们的方法比传统方法更有效地学习最优策略。

英文摘要

Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance depends on the variability of the interaction between the learner and the environment. In this paper, we develop a new theoretical analysis for MNL-based Markov decision processes that yields explicit variance-adaptive regret bounds. Our algorithm is computationally efficient and achieves the instance-wise optimal rate of regret, narrowing the gap between upper and lower bounds. Our numerical experiments validate that our method learns optimal policies more efficiently than conventional approaches.

2605.28353 2026-05-28 cs.NE cs.AI cs.SC

Improving Evaluation of Recombination-based Cartesian Genetic Programming

改进基于重组的笛卡尔遗传编程的评估

Duy Long Tran, Anja Jankovic, Marie Anastacio, Holger Hoos, Roman Kalkreuth

AI总结 本研究通过超参数优化,在SRBench基准平台上评估了子图交叉和离散表型重组两种重组算子,证明了超参数优化可提升基于重组的笛卡尔遗传编程的性能。

Comments Accepted for presentation as workshop paper in the graph-based genetic programming workshop (GGP) at the Genetic and Evolutionary Computation Conference (GECCO). To appear in the GECCO'26 conference companion. GECCO'26 will be held July 13-17, 2026 in San Jose, Costa Rica

详情
Journal ref
GECCO'26 Companion: Genetic and Evolutionary Computation Conference Companion, July 13-17, 2026, San Jose, Costa Rica
AI中文摘要

笛卡尔遗传编程传统上使用变异作为其主要且通常是唯一的遗传算子来驱动进化搜索。尽管近年来取得了进展,但由于明显的性能提升不足,基于重组的方法长期以来一直被避免。本研究在符号回归基准平台SRBench上检验了最近提出的两种重组算子:子图交叉和离散表型重组。利用TinyverseGP框架中提供的实现,我们对这两种算子的相应表示进行了超参数优化。我们的工作表明,超参数优化可以导致基于重组的笛卡尔遗传编程的性能提升。

英文摘要

Cartesian Genetic Programming has traditionally been using mutation as its main and often sole genetic operator to drive evolutionary search. Despite advancements in recent years, recombinationbased approaches have long been avoided, due to apparent lack of performance gains. This study examines two recently suggested recombination-based operators, subgraph crossover and discrete phenotypic recombination on SRBench, a benchmarking platform for symbolic regression. Using the implementations provided in the TinyverseGP framework, we perform hyperparameter optimisation of the respective representations with these two operators. Our work demonstrates that hyperparameter optimisation can lead to improvements in performance for recombination-based Cartesian Genetic Programming.