arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.04903 2026-06-04 cs.LO cs.AI cs.MA cs.PL

Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

基于人类编写本体的可审计且安全的LLM智能体

Aaron Sterling

发表机构 * Thistleseeds

AI总结提出Agentic Redux架构，通过类型化λ演算证明其在适当领域上的执行语义正确且决策可审计，并引入本体优先的智能体设计方法。

2606.04845 2026-06-04 stat.ML cs.LG math.ST stat.CO stat.TH

Bayesian learning for the stochastic shortest path problem

随机最短路径问题的贝叶斯学习

Chon Wai Ho, Sumeetpal S. Singh, Jiaqi Guo

发表机构 * Department of Engineering, University of Cambridge, UK（剑桥大学工程系）； School of Mathematics and Physics, University of Wollongong, Wollongong, Australia（沃林根大学数学与物理学院）

AI总结针对随机最短路径问题，提出一种贝叶斯框架，通过贝尔曼最优方程直接构建最优动作价值函数Q*的后验分布，并解决似然松弛导致的不可识别性问题，实现不确定性量化与数据高效学习。

Comments 50 pages, 19 figures

详情

AI中文摘要

序列决策问题通常被建模为马尔可夫决策过程（MDP）。我们关注随机最短路径（SSP）问题，这是一个具有吸收终止状态的无限水平无折扣MDP。我们开发了一个贝叶斯框架，通过与决策任务的交互来学习最优决策策略。具体来说，我们学习最优动作价值函数$Q^*$，但与许多现有的贝叶斯方法不同，我们不依赖于不现实的建模假设和临时近似。我们的方法是通过贝尔曼最优方程直接构建$Q^*$的后验信念。对于确定性奖励，我们将后验描述为具有流形密度的分布。为了简化推理，我们放松了似然，使得勒贝格密度存在。但这样做的代价是产生不可识别性问题。具体来说，放松后的后验可能在不当决策规则上有显著质量，而精确后验则不会。我们还计算了$Q^*$的表格参数化、高斯似然放松和高斯先验下最优动作选择的精确后验概率，这在基准测试研究中很有用。对深海基准测试变体的数值研究验证了我们的发现。我们证明了我们的框架能够忠实地量化不确定性，并且与其他基于时间差分的贝叶斯方法相比，数据效率更高。最后，我们对未来工作提出了建议。

英文摘要

Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal decision strategy through interactions with the decision-making task. Specifically, we learn the optimal action-value function $Q^*$, but unlike many existing Bayesian approaches, we do not rely on unrealistic modelling assumptions and ad-hoc approximations. Our approach is to directly construct the posterior beliefs for $Q^*$ through Bellman's optimality equations. For deterministic rewards, we characterise the posterior as a distribution with a manifold density. To facilitate simpler inference, we relax the likelihood so that a Lebesgue density exists. The flip side is to create unidentifiability issues. Specifically, the relaxed posterior can have significant mass on improper decision rules, while the exact posterior will not. We also calculate the exact posterior probabilities for optimal action selections for the tabular parametrisation of $Q^*$, a Gaussian likelihood relaxation and a Gaussian prior, which is useful in benchmarking studies. Numerical studies on variants of the Deep Sea benchmark verify our findings. We demonstrate that our framework faithfully quantifies uncertainty and, compared to other temporal-difference-based Bayesian methodologies, is more data efficient. We conclude with recommendations for future work.

URL PDF HTML ☆

赞 0 踩 0

2606.04769 2026-06-04 cs.CR cs.AI cs.SE

Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications

现实世界 MCP 服务器中的描述-代码不一致性：测量、检测与安全影响

Yutao Shi, Xiaohan Zhang, Xiangjing Zhang, Xihua Shen, Hui Ouyang, Huming Qiu, Mi Zhang, Min Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对 MCP 服务器中工具描述与代码实现不一致的问题，提出结合结构感知静态分析与 Direct-Reverse-Arbitration 提示方法的自动检测框架 DCIChecker，并在大规模数据集上揭示 9.93% 的不一致率及其安全风险。

Comments Preprint

详情

AI中文摘要

模型上下文协议 (MCP) 已成为赋能大型语言模型 (LLM) 使用外部工具的关键标准。在此生态系统中，LLM 依赖 MCP 服务器提供的自然语言描述来选择和执行函数。这种交互隐含假设工具描述忠实地反映了其底层实现，而该假设在实践中并未得到强制验证。因此，MCP 部署可能遭受名为描述-代码不一致性 (DCI) 的问题，即工具对其能力和安全边界的描述与代码实际行为不一致。本文对现实世界 MCP 服务器中的 DCI 进行了全面研究。我们正式定义了该问题，并提出了一个涵盖功能不一致和未声明副作用的综合分类法。在此分类法指导下，我们开发了 DCIChecker，一个自动框架，结合结构感知静态分析与 Direct-Reverse-Arbitration 提示方法，交叉验证工具描述与实际代码实现。我们将该框架应用于一个大规模数据集，包含从 2,214 个现实世界 MCP 服务器中提取的 19,200 个描述-代码对。我们的测量揭示 DCI 普遍存在，其中 9.93% 的对存在不一致。我们进一步证明 DCI 造成了关键的防御盲点，助长了从操作故障到隐蔽恶意行为等多种风险。最后，我们提出了缓解策略以强制语义一致性并增强新兴智能体生态系统的可靠性。

英文摘要

The Model Context Protocol (MCP) has emerged as a critical standard empowering Large Language Models (LLMs) to utilize external tools. In this ecosystem, LLMs rely on natural language descriptions provided by MCP servers to select and execute functions. This interaction implicitly assumes that tool descriptions faithfully reflect their underlying implementations, while this assumption is not mandatorily verified in practice. As a result, MCP deployments may suffer from a problem named Description-Code Inconsistency (DCI), where a tool's description of its capabilities and security boundaries is not consistent with what the code actually does. In this paper, we present a comprehensive study of DCI in real-world MCP servers. We formally define the problem and propose a comprehensive taxonomy spanning functionality inconsistencies and undeclared side effects. Guided by this taxonomy, we develop DCIChecker, an automated framework that combines structure-aware static analysis with the Direct-Reverse-Arbitration prompting method to cross-validate tool descriptions against actual code implementations. We apply this framework to a large-scale dataset comprising 19,200 description-code pairs extracted from 2,214 real-world MCP servers. Our measurement reveals that DCI is widespread, with 9.93% of these pairs exhibiting inconsistencies. We further demonstrate that DCI creates a critical defense blind spot, facilitating varied risks from operational failures to stealthy malicious behaviors. Finally, we propose mitigation strategies to enforce semantic consistency and enhance the reliability of the emerging agentic ecosystem.

URL PDF HTML ☆

赞 0 踩 0

2606.04757 2026-06-04 math.OC cs.LG

Near-Optimal Decentralized Stochastic Convex Optimization over Networks

网络上的近最优去中心化随机凸优化

Nitai Kluger, Amit Attia, Tomer Koren

发表机构 * Blavatnik School of Computer Science, Tel Aviv University（塔尔大学比拉维克计算机科学学院）； Google Research Tel Aviv（谷歌研究以色列特拉维夫）

AI总结针对去中心化随机光滑凸优化问题，提出一种加速去中心化方法，在总梯度样本预算N下，将可支持的工作节点数提升至M≲√ρ N^{3/4}，并证明其最优性。

Comments 12 papers

详情

AI中文摘要

我们研究去中心化随机光滑凸优化，其中$M$个工作者使用局部随机梯度并通过固定八卦网络上的仅邻居通信来最小化平均目标。该设置中的一个核心问题是，在总梯度样本预算为$N$的情况下，确定可以使用的最大工作者数量，同时仍保持集中式$O(1/\sqrt N)$统计速率。我们引入了一种加速去中心化方法，该方法在最多$\smash{M\lesssim \sqrt\rho\,N^{3/4}}$个工作者时保持该速率，其中$\rho$是八卦网络的谱间隙，改进了先前最佳的最大缩放$\smash{M\lesssim \rho\sqrt N}$。该方法基于一步延迟随机加速方案，使工作者能够将小批量与加速八卦交错进行，同时控制残差分歧，其保证仅对数依赖于最优-局部异质性。我们还为线性跨度去中心化一阶方法建立了匹配的下界，表明该方法在对数因子内是最优的。

英文摘要

We study decentralized stochastic smooth convex optimization, where $M$ workers minimize an average objective using local stochastic gradients and neighbor-only communication over a fixed gossip network. A central question in this setting is to determine the largest number of workers that can be used under a total budget of $N$ gradient samples while still preserving the centralized $O(1/\sqrt N)$ statistical rate. We introduce an accelerated decentralized method that preserves this rate for up to $\smash{M\lesssim \sqrtρ\,N^{3/4}}$ workers, where $ρ$ is the spectral gap of the gossip network, improving the best prior maximal scaling of $\smash{M\lesssim ρ\sqrt N}$. The method is based on a one-step-delayed stochastic acceleration scheme that enables workers to interleave minibatching with accelerated gossip while controlling residual disagreement, and its guarantee depends only logarithmically on the optimum-local heterogeneity. We also establish a matching lower bound for linear-span decentralized first-order methods, showing that the method is optimal up to logarithmic factors.

URL PDF HTML ☆

赞 0 踩 0

2606.04739 2026-06-04 cs.SE cs.AI

Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models

重新审视Vul-RAG：基于RAG的漏洞检测的可复现性与可复制性——使用开放权重模型

Sabrina Kaniewski, Fabian Schmidt, Tobias Heer

发表机构 * Institute for Secure Networked Systems, Esslingen University（安全网络系统研究所，埃斯林根大学）； Institute for Intelligent Systems, Esslingen University（智能系统研究所，埃斯林根大学）

AI总结本研究通过本地部署和多种开放权重模型，复现并扩展了Vul-RAG框架，发现其性能存在约0.30成对准确率的上限，且模型能力提升无法显著改善性能。

Comments Accepted at AI&CCPS 2026 workshop, co-located with the 21st International Conference on Availability, Reliability and Security (ARES 2026). This is the authors' preprint version

详情

AI中文摘要

大型语言模型（LLMs）在自动化软件漏洞检测方面展现出强大潜力，尤其是在检索增强生成（RAG）设置中。然而，对于依赖专有模型和API的方法，可复现性和可复制性在很大程度上仍未得到探索，这引发了一个问题：报告的结果是否具有普遍性，还是主要依赖于特定的模型选择。在这项工作中，我们对Vul-RAG进行了可复现性研究，Vul-RAG是一个基于RAG的源代码漏洞检测框架，它利用高级漏洞知识增强LLMs。我们首先使用报告中的开放权重基线模型，在完全本地和开放权重的设置下复现了结果。然后，我们将评估扩展到一组多样化的最新开放权重LLMs，包括代码专用、通用和推理模型，参数规模各异。结果证实，Vul-RAG的发现可以在本地部署下复现，但存在微小偏差。在所有评估的模型中，我们观察到性能在约0.30成对准确率（即漏洞函数和修补函数都被正确分类的代码对）处达到平台期。值得注意的是，即使对于更新更先进的模型，这一平台期仍然存在，表明仅凭模型能力的提升并不能显著提高性能。最后，我们讨论了检测效果、模型能力和模型规模之间的实际影响和权衡。实现和评估工件可在 https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG 公开获取。

英文摘要

Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval-augmented generation (RAG) settings. However, for approaches relying on proprietary models and APIs, reproducibility and replicability remain largely unexplored, raising the question of whether reported results generalize or depend primarily on specific model choices. In this work, we present a reproducibility study of Vul-RAG, a RAG-based framework for source code vulnerability detection that enhances LLMs with high-level vulnerability knowledge. We first replicate the results in a fully local and open-weights setting using the reported open-weight baseline models. We then extend the evaluation to a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models of varying parameter sizes. The results confirm that the findings of Vul-RAG are reproducible under local deployment, but with minor deviations. Across all evaluated models, we observe a performance plateau at approximately 0.30 pairwise accuracy (code pairs for which both the vulnerable and the patched function are correctly classified). Notably, this plateau persists even for more recent and advanced models, indicating that improvements in model capacity alone do not substantially enhance performance. Finally, we discuss practical implications and trade-offs between detection effectiveness, model capabilities, and model scale. Implementation and evaluation artifacts are publicly available at https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG.

URL PDF HTML ☆

赞 0 踩 0

2606.04689 2026-06-04 quant-ph cs.LG

QPredSGG: Hybrid Quantum Predicate Learning for Long-Tailed Scene Graph Generation

QPredSGG：面向长尾场景图生成的混合量子谓词学习

Prerana Ramkumar, Nouhaila Innan, Muhammad Shafique

发表机构 * Department of Computer Science, University of Waterloo（1. 温哥华大学计算机科学系）； Machine Learning Research Group, University of Waterloo（2. 温哥华大学机器学习研究组）

AI总结针对场景图生成中长尾谓词分布导致的分类偏差，提出用量子谓词头（QP-Head）替换经典谓词头，通过振幅嵌入和强纠缠层压缩特征，在Visual Genome 150上实现参数高效的长尾关系分类。

Comments 11 pages, 5 figures

详情

AI中文摘要

场景图生成（SGG）需要对物体及其交互进行关系推理，但性能常受严重的长尾谓词不平衡限制。经典SGG模型通常依赖数据集统计，导致预测偏向频繁关系而非细粒度语义谓词。尽管现有去偏策略提高了平均召回率，但当前框架中的谓词分类仍常依赖参数成本高的大型经典决策模块。本文通过用加权交叉熵训练的量子谓词头（QP-Head）替换因果特征增强网络（CFEN）中的经典谓词头，引入了一种用于SGG的混合量子谓词分类器。据我们所知，这是首批评估混合量子架构在Visual Genome 150上进行场景图谓词分类的研究之一。我们研究了量子比特数、编码策略、纠缠结构和电路深度对关系预测的影响。最佳4量子比特QP-Head使用振幅嵌入和强纠缠层将4096维对特征压缩为16维量子兼容表示，对应256倍缩减。它实现了57.25%的mR@100，而经典CFEN参考为41.1%，同时仅使用96个可训练量子参数。扩展到8量子比特保持了强大的长尾性能，达到55.38%的mR@100，使用384个量子参数，而深度分析显示了表达能力和运行时间开销之间的权衡。这些结果表明，紧凑的混合量子谓词头可以支持复杂视觉推理任务中参数高效的长尾关系分类。

英文摘要

Scene Graph Generation (SGG) requires relational reasoning over objects and their interactions, but performance is often limited by severe long-tail predicate imbalance. Classical SGG models frequently rely on dataset statistics, leading to biased predictions toward frequent relations rather than fine-grained semantic predicates. Although existing debiasing strategies improve mean recall, predicate classification in current frameworks still often depends on large classical decision modules with high parameter cost. This work introduces a hybrid quantum predicate classifier for SGG by replacing the classical predicate head in Causal Feature Enhancement Network (CFEN) with a Quantum Predicate Head (QP-Head) trained using weighted cross-entropy. To the best of our knowledge, this is among the first studies to evaluate a hybrid quantum architecture for scene graph predicate classification on Visual Genome 150. We study the effect of qubit count, encoding strategy, entangling structure, and circuit depth on relational prediction. The best 4-qubit QP-Head uses Amplitude Embedding and Strongly Entangling Layers to compress 4096-dimensional pair features into a 16-dimensional quantum-compatible representation, corresponding to a 256$\times$ reduction. It achieves an mR@100 of 57.25%, compared with 41.1% for the classical CFEN reference, while using only 96 trainable quantum parameters. Scaling to 8 qubits maintains strong long-tail performance, reaching an mR@100 of 55.38% with 384 quantum parameters, while the depth analysis shows a trade-off between expressibility and runtime overhead. These results suggest that compact hybrid quantum predicate heads can support parameter-efficient long-tail relational classification in complex visual reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.04680 2026-06-04 eess.AS cs.CL cs.SD

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

听你所写：基于声学差异的无参考假设评估

Zhihan Li, Hankun Wang, Yiwei Guo, Bohan Li, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China（X-LANCE实验室、计算机科学学院、上海交通大学、中国）； MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, China（人工智能MOE重点实验室、江苏省语言计算重点实验室、中国）

AI总结提出READ指标，利用预训练自回归TTS模型计算语音与文本假设的声学差异，无需参考转录即可评估ASR假设，并在噪声条件下实现高达20%的相对错误率降低。

Comments Submitted to Interspeech 2026. 6 pages, 4 figures

2606.04658 2026-06-04 cs.NE cs.LG

U-Net-Accelerated Quality-Diversity Optimization for Climate-Adaptive Urban Layouts

U-Net加速的质量-多样性优化用于气候适应性城市布局

Alexander Hagg, Tania Guerrero, Dirk Reith

发表机构 * Institute of Technology, Resource and Energy-efficient Engineering (TREE)（技术学院，资源与能源高效工程院（TREE））； Bonn-Rhein-Sieg University of Applied Sciences（博恩-莱茵-锡格应用科学大学）； Fraunhofer Institute for Algorithms and Scientific Computing (SCAI)（弗劳恩霍夫算法与科学计算研究所（SCAI））

AI总结提出用U-Net替代慢速物理模拟器作为代理模型，结合离线MAP-Elites算法，实现快速生成数千个多样化且经气候评估的建筑布局。

详情

AI中文摘要

优化城市布局以适应气候需要在建筑密度与冷空气通风之间取得平衡。由于基于物理的气候模拟计算成本高昂，规划者通常只能评估少于十个手动设计方案。质量-多样性（QD）算法提供了一种系统性地照亮设计空间的方法，但需要代理模型才能实用。在本文中，我们用一个空间深度学习代理（U-Net）替换了缓慢的监管物理模拟器，并将其嵌入离线MAP-Elites循环中。我们系统地比较了这种空间方法与传统的高斯过程（GP）代理在不同训练数据策略（准随机Sobol采样 vs. 主动QD自举）下的表现。结果表明，标量GP代理在随机样本上训练时灾难性地失败，需要昂贵的、主动生成的QD存档才能泛化。相比之下，U-Net的空间归纳偏置使其能够稳健地学习底层物理映射（R² = 0.996），完全独立于训练数据来源。这使得离线QD优化仅需一次性随机训练样本批次即可实现高度准确的适应度排名（ρ = 0.994）。最终流程部署在开源OpenSKIZZE工具中，能在十分钟内生成数千个多样化且经气候评估的建筑布局。

英文摘要

Optimizing urban layouts for climate adaptation requires balancing building density with cold-air ventilation. Because physics-based climate simulations are computationally expensive, planners typically evaluate fewer than ten manual designs. \gls{qd} algorithms offer a way to systematically illuminate the design space, but they require surrogate models to be practical. In this paper, we replace a slow, regulatory physics simulator with a spatial deep-learning surrogate (U-Net) inside an offline MAP-Elites loop. We systematically compare this spatial approach with a traditional \gls{gp} surrogate across different training-data strategies (quasi-random Sobol sampling vs.\ active \gls{qd} bootstrapping). Our results reveal that scalar \gls{gp} surrogates fail catastrophically when trained on random samples, requiring expensive, actively generated \gls{qd} archives to generalize. In contrast, the spatial inductive bias of the U-Net allows it to learn the underlying physics mapping robustly ($R^2 = 0.996$), completely independent of the training data source. This allows offline \gls{qd} optimization to achieve highly accurate fitness rankings ($ρ= 0.994$) using only a one-time batch of random training samples. The resulting pipeline, deployed in the open-source OpenSKIZZE tool, generates thousands of diverse, climate-evaluated building layouts in under ten minutes.

URL PDF HTML ☆

赞 0 踩 0

2606.04594 2026-06-04 cs.DC cs.AI cs.SE

Ekka: Automated Diagnosis of Silent Errors in LLM Inference

Ekka: LLM推理中静默错误的自动诊断

Yile Gu, Zhen Zhang, Shaowei Zhu, Xinwei Fu, Jun Wu, Yida Wang, Baris Kasikci

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出Ekka系统，通过差分调试对齐比较中间执行状态，自动诊断LLM推理框架中的静默错误，在真实错误基准上达到80% pass@1和88% pass@5的诊断准确率。

Comments ICML 2026

详情

AI中文摘要

LLM服务框架随着复杂的软件栈和大量优化而快速发展。快速开发过程可能引入静默错误，即输出质量在没有任何显式错误信号的情况下悄然下降。由于高层症状与底层根本原因之间存在巨大的语义鸿沟，诊断静默错误非常困难。我们观察到，通过利用语义正确的参考实现，静默错误的诊断可以有效地构建为差分调试问题。我们提出了Ekka，一个自动诊断系统，通过系统地对齐和比较目标框架与参考框架之间的中间执行状态来识别根本原因。我们构建了一个来自流行服务框架的真实静默错误基准，Ekka显示出80%的pass@1诊断准确率和88%的pass@5诊断准确率，优于现有系统。Ekka还诊断了服务框架中的4个新静默错误，所有错误均已得到开发者确认。

英文摘要

LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the high-level symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 80% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.

URL PDF HTML ☆

赞 0 踩 0

2606.04592 2026-06-04 cs.CY cs.AI cs.HC

Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?

合成人格：LLM 如何使用社会经济微观数据模仿个体受访者？

Leonard Kinzinger, Jochen Hartmann

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结研究利用德国社会经济面板数据构建个体级数字孪生，通过评估不同构建方法（模型、信息深度、嵌入方式、推理模式）对200万以上孪生响应的准确性，发现信息深度在75%熵分位数达到成本效益帕累托点，最佳单元准确率达78.8%。

详情

AI中文摘要

基于LLM的数字孪生有望扩展和加速市场研究，但大多数已发表的孪生要么是基于少数人口统计问题的粗略角色机器人，要么是基于专门收集的调查和访谈记录构建的详细个体级孪生。这两种设置都不涉及营销实践中操作上最相关的情况：从企业通过CRM系统、忠诚度计划和重复调查积累的现有异构面板数据中构建详细的个体孪生。我们从德国社会经济面板（SOEP）构建详细的个体级孪生，并在一个$3 \times 5 \times 2 \times 2$的构建方法网格中评估它们，该网格涵盖三个开放权重的LLM、五个按归一化香农熵排序的累积信息深度、两种嵌入方法和两种推理模式，对500名参与者和183个保留问题评分超过210万个孪生响应。孪生质量随信息深度提高，但超过75%熵分位数后收益递减，该分位数相对于性能最佳的100%单元充当成本效益帕累托点。将嵌入从叙述性角色摘要切换到原始对话历史（过去响应）在100%深度下每个模型-推理单元中提高了保留准确率，而显式思考模式提高了秩次相关性但不改变准确率。最佳单元准确率达到78.8%，Fisher-$z$相关性在SOEP保留评估集上达到$r = 0.590$。研究结果表明，基于孪生的市场研究不再受数据设计限制，而是受项目数量、模型选择和本文现在映射的一小部分构建级决策限制。

英文摘要

LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \times 5 \times 2 \times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.

URL PDF HTML ☆

赞 0 踩 0

2606.04582 2026-06-04 physics.comp-ph cs.LG physics.app-ph

Reconstructing Unobservable Temperature Fields via Simulation-Aided Intelligent Sensing

通过仿真辅助智能感知重建不可观测温度场

Monika Stipsitz, Hèlios Sanchis-Alepuz, Jacob Reynvaan, Silvester Sabathiel

发表机构 * Silicon Austria Labs（硅酸奥地利实验室）； Republic of Austria（奥地利共和国）； Styrian Business Promotion Agency（施蒂里亚商业促进局）； federal state of Carinthia（卡林西亚联邦州）； Upper Austrian Research（上奥地利研究）； Austrian Association for the Electric and Electronics Industry（奥地利电子电气工业协会）

AI总结提出基于随机物理仿真生成数据集的方法，训练神经网络从稀疏传感器重建内部温度场，实现实时在线监测。

Comments Presented at IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Nancy, France, 2026

2606.04576 2026-06-04 stat.ML cs.LG econ.EM q-fin.RM

ReSGA: A Large Tail Risk Model for Learning Value-at-Risk and Expected Shortfall

ReSGA: 一种用于学习风险价值和预期缺口的大尾部风险模型

Yichi Zhang, Ke Zhu, Zhoufan Zhu

发表机构 * Hong Kong University（香港大学）； Xiamen University（厦门大学）

AI总结提出检索增强自分组自编码器（ReSGA），利用数百万参数捕捉资产横截面依赖和长期时间动态，在1926-2023年美国股票数据上优于12种基准模型，并通过新规模增强左尾动量策略实现经济收益。

详情

AI中文摘要

学习风险价值（VaR）和预期缺口（ES）对于有效管理金融风险至关重要。在大数据时代，参数有限的现有方法容易受到模型错误设定的影响。为了解决这一局限性，我们提出了一种大尾部风险模型——检索增强自分组自编码器（ReSGA），该模型设计有数百万个参数，利用资产的特征来挖掘丰富的横截面依赖性和长期时间动态。应用于1926年至2023年的月度美国股票收益数据，包含153个公司特征，ReSGA在样本外损失和统计回测方面优于十二种计量经济学和机器学习竞争对手。此外，其预测优势可以通过一种新的规模增强左尾动量策略构建的多空十分位投资组合转化为显著的经济收益。为了阐明复杂性的作用，我们进一步进行了系统的规模分析，并证明联合VaR-ES预测的改进主要由数据复杂性驱动，而非模型复杂性。最后，我们的组重要性和迁移学习分析展示了ReSGA的可解释性和跨市场泛化能力。

英文摘要

Learning Value-at-Risk (VaR) and Expected Shortfall (ES) is important for managing financial risks effectively. Existing approaches with limited parameters are vulnerable to model misspecification in the era of big data. To address this limitation, we propose a large tail risk model, the retrieval-enhanced self-grouping autoencoder (ReSGA), which is designed with millions of parameters to exploit the rich cross-sectional dependence and long-term temporal dynamics of assets using their characteristics. Applied to monthly US equity returns from 1926 to 2023 with 153 firm characteristics, ReSGA outperforms twelve econometric and machine learning competitors in terms of out-of-sample loss and statistical backtesting. In addition, its forecast advantages can translate into significant economic gains from long-short decile portfolios that are constructed by a new size-enhanced left-side momentum strategy. To clarify the role of complexity, we further conduct a systematic scaling analysis and demonstrate that improvements in joint VaR-ES forecasting are primarily driven by data complexity rather than model complexity. Finally, our analyses of group-importance and transfer-learning exhibit the interpretability and cross-market generalizability of ReSGA.

URL PDF HTML ☆

赞 0 踩 0

2606.04522 2026-06-04 cs.IR cs.AI cs.DB cs.LG

ANN Search: Recall What Matters

ANN搜索：召回真正重要的

Dimitris Dimitropoulos, Nikos Mamoulis

发表机构 * University of Ioannina（伊奥尼亚大学）； Archimedes, Athena RC（阿基米德，雅典RC）

AI总结本文提出用逆近似比1/Ratio@k替代Recall@k来评估近似最近邻搜索质量，实验表明前者能更准确反映实际效用并降低计算开销。

详情

AI中文摘要

近似最近邻（ANN）搜索已成为信息检索和现代机器学习任务（从分类到检索增强生成）的核心原语。社区主要通过给定Recall@k（检索到的真实精确最近邻的比例）下的吞吐量来评估和调优ANN算法。我们认为，ANN搜索真正重要的是检索结果的质量，而非它们与真实kNN集合的重叠。我们证明，使用Recall@k评估检索质量会带来不必要的计算开销，并研究用逆近似比1/Ratio@k替代它。1/Ratio@k评估检索到的邻居与真实邻居之间距离的差异。它无需判断、无需超参数，仅通过标准ANN基准输入即可计算。我们在涵盖广泛内在维度的多样化数据集上对最先进的ANN算法进行基准测试，从效率、下游分类和检索增强生成三个维度全面评估这两个指标。在效率方面，优化1/Ratio@k达到操作质量阈值所需的计算成本远低于Recall@k。在下游任务中，即使Recall@k显著下降，性能指标（标签精度、语义相似度、BERTScore和LLM评分质量）仍保持高度稳定。相反，逆近似比紧密反映了这种稳定性，比Recall@k更好地追踪实际效用。最终，虽然Recall@k夸大了近似的真实成本，但1/Ratio@k提供了更准确、可部署的ANN实际质量代理。

英文摘要

Approximate nearest neighbor (ANN) search has become a core primitive in information retrieval and modern machine learning tasks, from classification to retrieval-augmented generation. The community evaluates and tunes ANN algorithms primarily on their throughput at a given Recall@k, the fraction of true exact neighbors retrieved. We argue that what really matters in ANN search is the quality of the retrieved results and not their overlap with the true kNN set. We show that using Recall@k to assess retrieval quality forces unnecessary computational overhead and investigate replacing it by 1/Ratio@k, the inverse approximation ratio. 1/Ratio@k evaluates the differences between the distances of the retrieved and true neighbors. It is judge-free, hyperparameter-free, and computable from standard ANN benchmark inputs alone. We benchmark state-of-the-art ANN algorithms across diverse datasets spanning a wide range of intrinsic dimensionalities, evaluating the two metrics comprehensively across efficiency, downstream classification, and retrieval-augmented generation. On the efficiency axis, optimizing for 1/Ratio@k reaches operational quality thresholds at a substantially lower computational cost than Recall@k. In downstream tasks, performance indicators (label precision, semantic similarity, BERTScore, and LLM-graded quality) remain highly stable even when Recall@k drops significantly. The inverse approximation ratio, on the other hand, closely mirrors this stability, tracking true utility much better than Recall@k. Ultimately, while Recall@k overstates the true cost of approximation, 1/Ratio@k offers a more accurate, deployable proxy for actual ANN quality.

URL PDF HTML ☆

赞 0 踩 0

2606.04517 2026-06-04 cs.NI cs.AI

Treat Traffic Like Trees: A Semantic-Preserving Hierarchical Graph-Based Expert Framework for Encrypted Traffic Analysis

像对待树一样对待流量：一种用于加密流量分析的语义保持分层图专家框架

Yuantu Luo, Jun Tao, Linxiao Yu, Guang Cheng

发表机构 * School of Cyber Science and Engineering, Southeast University（东南大学网络安全科学与工程学院）； Purple Mountain Laboratories（紫金山实验室）； Engineering Research Center of Blockchain Application, Supervision and Management (Southeast University)（区块链应用、监督与管理工程研究中心（东南大学））； Engineering Research Center of Security for Ubiquitous Network, Jiangsu Province（江苏省物联网安全工程技术研究中心）

AI总结提出一种基于协议树图注意力与专家混合的语义保持分层图专家框架（PTGAMoE），通过字段级图构建和专家委员会设计，在严格无数据泄露设置下显著优于现有模型，并提供可解释的协议级特征重要性分析。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

基于图的深度学习方法已被广泛应用于加密流量分析，以利用不同粒度下的潜在相关性。然而，复杂的预处理流程和精细的模型结构虽然通常能取得良好性能，但在表示学习过程中可能掩盖固有的协议语义。此外，由协议规范定义并在人工流量分析中常规使用的协议层及其对应字段的分层结构，在现有学习框架中仍未得到充分探索。在本文中，我们提出了一种用于加密流量分析的语义保持分层图专家框架——协议树图注意力与专家混合（PTGAMoE）。基于字段的图构建和专家委员会设计使PTGAMoE能够量化模型对特定字段和协议的偏好。在严格无数据泄露设置下，对代表性基准数据集的大量实验结果表明，PTGAMoE显著优于最先进的模型。此外，语义保持设计提供了关于协议级特征重要性和专家级贡献的可解释性洞察，反映了模型在加密流量分类任务中的决策逻辑。

英文摘要

Graph-based deep learning methods have been widely employed in encrypted traffic analysis to exploit latent correlations across different granularities. However, while complex preprocessing pipelines and sophisticated model structures often achieve strong performance, they may obscure inherent protocol semantics during representation learning. Moreover, the hierarchical structure of protocol layers and their corresponding fields, defined by protocol specifications and routinely utilized in manual traffic analysis, remains underexplored in existing learning frameworks. In this paper, we propose Protocol Tree Graph Attention with Mixture of Experts (PTGAMoE), a semantic-preserving hierarchical graph-based expert framework for encrypted traffic analysis. The field-based graph construction and expert committee design enable PTGAMoE to quantify the model's preferences for specific fields and protocols. Extensive experimental results on representative benchmark datasets under strict no-data-leakage settings demonstrate that PTGAMoE significantly outperforms state-of-the-art (SOTA) models. Furthermore, the semantic-preserving design provides interpretable insights into protocol-level feature importance and expert-level contributions, reflecting the model's decision-making logic in encrypted traffic classification tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.04499 2026-06-04 cs.SI cs.LG

Modeling and Interpreting Teamwork Dynamics in Cancer Care Outcome Prediction

建模与解释癌症护理结果预测中的团队协作动态

Yuhua Huang, Hsiao-Ying Lu, Kwan-Liu Ma

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结利用电子健康记录中的协作网络和机器学习方法，研究医疗专业人员团队协作动态对癌症患者生存预测的影响，并解释关键网络特征。

详情

AI中文摘要

癌症护理需要纵向方法，根据每个患者的需求随时间规划和实施治疗。虽然先前研究深入探讨了临床和人口统计学因素（如合并症和年龄）如何指导治疗规划，但对护理实施阶段的关注却少得多。然而，规划和实施都是基于团队的过程，依赖于多个医疗专业人员之间的协调努力。因此，这些协作实践中蕴含的人为因素对于优化患者结果至关重要。尽管重要性显著，但现有关于癌症护理中人为因素的文献有限，很少有研究调查护理团队内的协作如何在治疗过程中演变。为填补这一空白，本研究探讨通过电子健康记录系统捕获的医疗专业人员协作如何影响癌症患者结果，特别强调团队协作动态。我们将电子健康记录介导的医疗专业人员交互表示为网络，并应用机器学习方法识别这些协作结构中嵌入的患者生存预测信号。我们进一步通过指出与特定结果相关的网络特征和动态模式来解释模型预测。我们通过稳健性分析评估模型，确保发现稳定且不受训练中随机变异驱动。此外，我们的见解与医学文献中提出的假设一致，我们的结果为这些主张提供了基于经验数据的证据。总体而言，我们的工作提供了一个实用流程，利用协作的数字痕迹来评估和加强纵向团队医疗，为医疗实施中的数据驱动干预提供可操作的见解。

英文摘要

Cancer care requires a longitudinal approach in which treatments are planned and delivered over time according to the needs of each individual patient. While prior research has thoroughly explored how clinical and demographic factors, such as comorbidities and age, inform treatment planning, far less attention has been devoted to the delivery phase of care. Yet planning and delivery are both team-based processes that depend on coordinated efforts among multiple healthcare professionals (HCPs). As such, the human factors embedded in these collaborative practices are crucial to optimizing patient outcomes. Despite this importance, the existing literature on human factors in cancer care is limited, and very few studies have investigated how collaboration within care teams evolves over the course of treatment. To fill this gap, this work examine how HCPs' collaboration, captured through electronic health record (EHR) systems, affects cancer patient outcomes, with particular emphasis on teamwork dynamics. We represent EHR-mediated HCP interactions as networks and apply machine learning methods to identify predictive signals of patient survival embedded in these collaborative structures. We further interpret model predictions by pinpointing network characteristics and dynamic patterns associated with particular outcomes. We evaluate our model through robustness analyses to ensure that the findings are stable and not driven by stochastic variation in training. Additionally, our insights align with hypotheses proposed in the medical literature, and our results provide the empirical, data-driven evidence supporting these claims. Overall, our work contributes a practical workflow for leveraging digital traces of collaboration to evaluate and strengthen longitudinal team-based healthcare, offering actionable insights to guide data-informed interventions in healthcare delivery.

URL PDF HTML ☆

赞 0 踩 0

2606.04486 2026-06-04 cs.CR cs.CL cs.LG stat.ML

Global Sketch-Based Watermarking for Diffusion Language Models

基于全局草图的扩散语言模型水印

Daniel Zhao

发表机构 * Harvard University（哈佛大学）

AI总结提出一种针对掩码扩散语言模型的全局向量草图水印方法，通过控制文本的整体统计特征实现与局部上下文无关的检测。

2606.04460 2026-06-04 cs.CR cs.AI cs.LG

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

CyberGym-E2E：面向AI代理端到端网络安全能力的可扩展真实世界基准

Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang, Francisco De La Riega, Zhun Wang, Jingzhi Jiang, Alexander Cheung, Sean Tai, Jonah Cha, Jianhong Tu, Gabriel Han, Chenguang Wang, Jingxuan He, Wenbo Guo, Dawn Song

发表机构 * Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）

AI总结提出CyberGym-E2E，一个大规模、真实的端到端网络安全基准，通过自动化流水线将开源漏洞数据转化为评估环境，全面评估AI代理在漏洞发现、PoC生成和补丁生成全生命周期中的能力。

Comments ICML 2026

2606.04446 2026-06-04 cs.DC cs.LG

D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

D^2SD: 使用双重扩散草稿模型加速推测解码

Liyuan Zhang, Jiarui Zhang, Jinwei Yao, Ran Yan, Yuchen Yang, Jiahao Zhang, Tongkai Yang, Yi Wu, Binhang Yuan

发表机构 * Peking University（北京大学）； Tsinghua University（清华大学）； HKUST（香港科技大学）； UIUC（伊利诺伊大学厄巴纳-香槟分校）； Ant Group（蚂蚁集团）

AI总结提出D^2SD框架，通过双重扩散草稿模型和置信度引导的前缀树，提升推测解码的接受率，优于现有扩散方法和自回归推测解码基线。

详情

AI中文摘要

推测解码通过草拟多个令牌并在单次目标模型前向传递中验证它们，加速自回归大语言模型推理。最近的基于扩散的草稿模型并行生成整个令牌块，但通常每次验证只提交单个草稿序列：一旦出现第一个不匹配，所有后续草稿令牌被丢弃，导致接受率有限。简单地对更多草稿候选序列进行批处理只会带来边际改进，因为冗余或位置不当的分支增加了草拟和验证的成本，而没有成比例地增加接受的令牌数量。我们提出D^2SD，一种双重扩散草稿推测解码框架，将候选组织成置信度引导的前缀树，其中第一个扩散草稿器生成一个块以及每个位置的置信度分数，用于识别最可能的拒绝边界并选择前K个前缀范围进行恢复；第二个可变前缀扩散草稿器在每个选定前缀处重新锚定，并在一次批处理中提出替代延续；得到的共享前缀候选通过级联注意力联合验证。实验表明，D^2SD在底层扩散方法和强自回归推测解码基线上均有明显改进。

英文摘要

Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens. We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.04444 2026-06-04 eess.IV cs.LG

Scaling Datasets for Multi-Sensor, Multi-Agent, and Multi-Domain Learning in Autonomous Systems

面向自主系统中多传感器、多智能体与多领域学习的数据集扩展

R. Spencer Hallyburton, David Hunt, Miroslav Pajic

发表机构 * Department of Electrical and Computer Engineering, Duke University（电气与计算机工程系，杜克大学）

AI总结提出基于AVstack和CARLA的模块化数据集生成流程，创建TB级带真实标签的多域数据，支持单/多智能体与灵活传感器配置，用于特定应用训练和协作自主研究。

2606.04429 2026-06-04 stat.ML cs.LG

Flatness and Generalization: Learning Multi-Index Models with Homogeneous Neural Networks

平坦性与泛化：使用齐次神经网络学习多指标模型

Harsh Vardhan, Hossein Taheri, Arya Mazumdar

发表机构 * Department of Computer Science（计算机科学系）； University of California, San Diego（加州大学圣地亚哥分校）； Halicioğlu Data Science Institute（Halicioğlu数据科学研究所）

AI总结本文研究两层齐次神经网络学习多指标模型时，平坦性与泛化之间的关系，证明最平坦插值器总能泛化，而某些非泛化插值器的平坦性无法接近最平坦值。

详情

AI中文摘要

用于解释一阶梯度方法在非凸神经网络上泛化能力的常见启发式方法是“平坦插值器泛化良好”（Hochreiter and Schmidhuber, 1994; Keskar et al., 2017），其中平坦性可通过经验损失Hessian矩阵的迹来衡量。然而，Dinh等人（2017）表明，利用网络的对称性（可在保持总体和经验损失不变的情况下改变平坦性），任何插值器都可以变得更尖锐或更平坦。这一结果使得之前的启发式陈述变得空洞。在本文中，我们表明，对于使用两层非凸齐次神经网络学习未知多指标模型，尽管存在对称性，平坦性与泛化之间仍存在联系。这种联系涉及“最平坦”插值器，即所有插值器中具有阶数最小平坦性的插值器。首先，我们证明存在一类自然的非泛化插值器，其平坦性即使利用对称性也无法接近最平坦可能值。其次，我们证明，对于由单指标模型之和生成的数据，如果近似误差和标签噪声较低，任何最平坦插值器都能实现较小的总体损失，即最平坦插值器总是泛化的。这建立了平坦性与泛化之间的直接联系，适用于一大类激活函数和现实数据分布。

英文摘要

A common heuristic used to explain the generalization of first-order gradient methods on non-convex neural networks is that "flat interpolators generalize well" (Hochreiter and Schmidhuber, 1994; Keskar et al., 2017), where flatness can be measured by the trace of the Hessian of the empirical loss. However, Dinh et al. 2017) showed that, using symmetry of the network that can change flatness while keeping the population and empirical losses unchanged, any interpolator can be made sharper or flatter. This result makes the earlier heuristic statement vacuous. In this paper, we show that for learning an unknown multi-index model with $2$-layer non-convex homogeneous neural networks, there is a connection between flatness and generalization, despite the existence of symmetries. This connection pertains to the "flattest" interpolators, i.e., the interpolators that have orderwise minimum flatness among all interpolators. First, we show that there exists a natural class of non-generalizing interpolators whose flatness cannot be made closer to the flattest possible, even using symmetries. Second, we show that for data generated by a sum of single-index models, if the approximation error and label noise are low, any flattest interpolator achieves small population loss, i.e., the flattest interpolators always generalize. This establishes a direct link between flatness and generalization which applies to a large class of activations and realistic data distributions.

URL PDF HTML ☆

赞 0 踩 0

2606.04425 2026-06-04 cs.CR cs.AI

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

如果提示注入从未消失？探索智能体系统中的跨会话存储提示注入

Yuanbo Xie, Tianyun Liu, Yingjie Zhang, Suchen Liu, Yulin Li, Liya Su, Tingwen Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络空间安全学院）； AI Sec Lab, Beijing Chaitin Technology Co.,Ltd（北京柴坦科技有限公司AI安全实验室）

AI总结本研究引入跨会话存储提示注入，通过持久化状态使提示注入从单会话模型级威胁转变为长期系统级漏洞，并构建了分类法、基准测试和沙箱工具以评估风险。

Comments position paper

详情

AI中文摘要

现代智能体系统将大语言模型从会话受限的助手转变为跨会话持久化并演化共享世界状态的有状态系统，通过记忆、文件系统、工具和其他长期存在的上下文工件实现。这种转变从根本上扩展了提示注入的攻击面。然而，先前关于提示注入的工作主要关注单会话内的模型级威胁，忽视了跨会话持久系统状态如何从根本上改变智能体系统的系统级风险。受Web系统中存储型跨站脚本的启发，我们引入了跨会话存储提示注入，其中成功的注入可以持久存在于智能体系统状态中，并在原始攻击者交互结束后长时间静默影响未来执行。为了系统研究这一威胁，我们形式化了存储提示注入，并开发了关于对抗性内容如何跨会话持久化并影响智能体系统的分类法。我们进一步开发了基准测试和沙箱工具包来评估存储提示注入的风险，支持对不同模型、攻击目标和持久化渠道的攻击成功率进行定量分析。我们的发现强调，持久化将提示注入从短暂的模型级威胁转变为嵌入智能体执行状态中的长期系统级漏洞。我们希望这项工作能引起对这一新兴威胁的更广泛关注，并激励社区系统研究和缓解智能体系统中持久化带来的系统风险。

英文摘要

Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the system-level risk of agentic systems. Inspired by stored cross-site scripting in web systems, we introduce cross-session stored prompt injection, where a successful injection can persist within agentic system state and silently influence future executions long after the original attacker interaction has ended. To systematically study this threat, we formalize stored prompt injection and develop a taxonomy of how adversarial content persists and affects agentic systems across sessions. We further develop a benchmark and sandbox toolkit to evaluate the risks of stored prompt injection, enabling quantitative analysis of attack success across different models, attack goals, and persistence channels. Our findings highlight that persistence transforms prompt injection from an ephemeral model-level threat into a long-lived system-level vulnerability embedded within agent execution state. We hope this work draws broader attention to this emerging threat and motivates the community to systematically study and mitigate system risks arising from persistence in agentic systems.

URL PDF HTML ☆

赞 0 踩 0

2606.04404 2026-06-04 stat.ML cs.LG

Knockoffs-based False Discovery Rate Control and Simplification for Deep Neural Networks

基于Knockoffs的深度神经网络错误发现率控制与简化

Huiqi Zhang, Wenyu Liao, Yiqing Shi, Xiaobo Huang, Fang Xie

发表机构 * bnbu.edu.cn（北京理工大学）

AI总结本文基于knockoff方法和正则化神经网络，提出了三种在控制错误发现率条件下的变量筛选方法（单层过滤、多层过滤、变量权重聚合过滤），以简化深度神经网络并降低计算复杂度。

2606.04388 2026-06-04 cs.CR cs.AI cs.LG

TITAN-FedAnil+: Trust-Based Adaptive Blockchain Federated Learning for Resource-Constrained Intelligent Enterprises

TITAN-FedAnil+：面向资源受限智能企业的基于信任的自适应区块链联邦学习

Muhammad Hadi, Muhammad Jahangir, Talha Shafique, Muhammad Khuram Shahzad

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出TITAN-FedAnil+框架，通过基于亲和传播的自适应聚类聚合过滤恶意更新、GPU加速向量化提升效率及有符号状态跳变机制实现轻量级区块链重同步，在资源受限边缘设备上内存开销降低81%。

Comments 8 pages, 5 figures; code available at https://github.com/error8149/FedAnilPlus-Optimized

详情

AI中文摘要

联邦学习（FL）已成为一种在保护数据隐私的同时实现协作智能的有效范式。然而，由非独立同分布（non-IID）数据分布引起的数据异构性和去中心化安全威胁仍然是重大挑战，尤其是在资源受限的企业环境中。本文提出了TITAN-FedAnil+，一种面向智能企业中区块链联邦学习的基于信任的自适应网络。所提出的框架引入了基于亲和传播的自适应聚类聚合，无需预先知道攻击者数量即可识别并过滤恶意更新。此外，采用GPU加速向量化以提高计算效率，同时通过有符号状态跳变机制实现轻量级区块链重同步。实验结果表明，与基线框架相比，在受限的8 GB边缘设备上经过50轮通信，内存开销显著降低，节省高达81%。结果表明，TITAN-FedAnil+有效提升了智能企业环境中安全联邦学习部署的鲁棒性、可扩展性和资源效率。

英文摘要

Federated Learning (FL) has emerged as an effective paradigm for collaborative intelligence while preserving data privacy. However, data heterogeneity arising from non-IID distributions and decentralized security threats remain significant challenges, particularly in resource-constrained enterprise environments. This paper presents TITAN-FedAnil+, a Trust-Based Adaptive Network for blockchain-enabled federated learning in intelligent enterprises. The proposed framework introduces affinity propagation-based adaptive clustered aggregation to identify and filter malicious updates without requiring prior knowledge of the number of attackers. In addition, GPU-accelerated vectorization is employed to improve computational efficiency, while a signed state jump mechanism enables lightweight blockchain resynchronization. Experimental results demonstrate substantial reductions in memory overhead, achieving up to 81% savings across 50 communication rounds on constrained 8 GB edge devices compared with the baseline framework. The results indicate that TITAN-FedAnil+ effectively improves robustness, scalability, and resource efficiency for secure federated learning deployments in intelligent enterprise environments.

URL PDF HTML ☆

赞 0 踩 0

2606.04387 2026-06-04 cs.IR cs.AI

Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking

重新思考基于LLM的分层偏好排名的销售线索评分

Chenyu Zhang, Yiwen Liu, Yin Sun, Xinyuan Zhang, Yuji Cao, Junming Jiao, Juyi Qiao

发表机构 * Intelligent Business Team, Li Auto Inc.（李自动公司智能商务团队）

AI总结针对高价值领域销售线索转化问题，提出基于LLM的判别式框架HPRO，通过分层偏好排名优化联合建模结构化与非结构化数据，实现评分与排名性能提升。

详情

AI中文摘要

在高价值领域（如汽车、房地产）中，销售线索转化与电子商务推荐有根本不同，因为其决策周期长且涉及多阶段漏斗。传统的线索评分方法（基于规则的评分卡、机器学习或逐点CTR模型）面临严重挑战：监督信号稀疏、非结构化CRM日志中的语义鸿沟，以及无法捕捉线索的相对优先级。虽然大型语言模型（LLM）能够对客户交互提供卓越的语义理解，但通用LLM不适合线索排名：它们生成文本而非可比较的分数，并且缺乏与销售漏斗分层优先级的一致性。我们提出了一种基于LLM的判别式框架用于销售线索评分，该框架支持结构化CRM特征和非结构化客户交互的联合建模。在此框架之上，我们提出了HPRO（分层偏好排名优化），通过分层偏好排名目标增强销售线索评分。HPRO采用边际感知的Bradley-Terry公式，将稀疏的二元标签转换为密集的、漏斗感知的偏好对，使线索评分能够同时利用逐点和成对监督。在来自领先新能源汽车品牌的大规模数据上的实验表明，分类性能达到最先进水平（AUC 0.8161），排名性能提升（排名靠前线索的精确度提高39.7%）。为期132天的在线A/B测试验证了9.5%的销量提升，确认了实际的商业影响。

英文摘要

Sales lead conversion in high-stakes domains (e.g., automotive, real estate) differs fundamentally from e-commerce recommendation due to prolonged decision cycles and multi-stage funnels. Traditional lead scoring methods rule-based scorecards, machine learning, or pointwise CTR models face severe challenges: sparse supervision, a semantic gap in unstructured CRM logs, and inability to capture relative lead priority. While Large Language Models(LLMs) offer superior semantic understanding of customer interactions, general-purpose LLMs are ill-suited for lead ranking: they generate text rather than comparable scores, and lack alignment with the hierarchical priorities of sales funnels. We introduce an LLM-based discriminative framework for sales lead scoring, which supports joint modeling of structured CRM features and unstructured customer interactions. On top of this framework, we propose HPRO (Hierarchical Preference Ranking Optimization), which augments sales lead scoring with a hierarchical preference ranking objective. HPRO employs a margin-aware Bradley-Terry formulation to transform sparse binary labels into dense, funnel-aware preference pairs, enabling lead scoring to leverage both pointwise and pairwise supervision. Experiments on large-scale data from a leading NEV brand demonstrate state-of-the-art classification (AUC 0.8161) and ranking performance (+39.7% precision among top-ranked leads). A 132-day online A/B test validates 9.5% sales volume uplift, confirming real-world commercial impact.

URL PDF HTML ☆

赞 0 踩 0

2606.04382 2026-06-04 cs.DL cs.AI cs.IR

LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment

LCSHBench：一个多语言、共识基础的国会图书馆主题标目分配基准

Kwok Leong Tang

发表机构 * Library of Congress（国会图书馆）

AI总结提出LCSHBench基准，基于多图书馆共识构建多语言书目记录集，通过精确匹配和概念匹配评估自动主题编目，并展示低秩微调嵌入器在跨语言检索中的改进。

详情

AI中文摘要

自动主题编目为书目记录分配受控词汇标目，但LCSH缺乏标准的公开基准。我们引入LCSHBench：来自哈佛、哥伦比亚和普林斯顿开放许可目录的15种语言的22,346本书。只有当至少两个独立编目机构分配了LCSH时，记录才被纳入；我们发布每个目录的来源以及联合和一致答案视图。对465,187部由三个图书馆编目的作品进行的一致性研究显示了这种设计的重要性：图书馆通常在底层主题上达成一致（93.3%共享概念级标目），但在确切表达上经常不同（39.4%具有相同的标目集）。因此，LCSHBench通过开放词汇生成和全词汇检索，使用按语言和标目类型分解的集合和排名指标，对精确匹配和概念匹配进行评分。作为首次演示，对300M设备端嵌入器的低秩微调改进了跨语言检索，并在开发集上的精确召回率@200（0.659 vs 0.623）超过了3,072维托管嵌入器。语言面板显示增益并不均匀，保留测试和端到端确认仍是未来工作。

英文摘要

Automated subject cataloging assigns controlledvocabulary headings to bibliographic records, but LCSH has no standard public benchmark. We introduce LCSHBench: 22,346 books in 15 languages from the openly licensed Harvard, Columbia, and Princeton catalogs. Records enter only when at least two independent cataloging agencies assigned LCSH; we release per-catalog provenance plus union and unanimous answer views. A concordance study of 465,187 works cataloged by all three libraries shows why this design matters: libraries usually agree on the underlying topic (93.3% share a concept-level heading) but often differ in exact expression (39.4% have identical heading sets). LCSHBench therefore scores both exact and concept matches, with set and rank metrics broken down by language and heading type, across open-vocabulary generation and full-vocabulary retrieval. As a first demonstration, a low-rank fine-tune of a 300M on-device embedder improves cross-lingual retrieval and beats a 3,072-dimensional hosted embedder on development exact recall@200 (0.659 vs 0.623). The language panel shows the gain is not uniform, and held-out-test and end-to-end confirmation remain future work.

URL PDF HTML ☆

赞 0 踩 0

2606.04380 2026-06-04 stat.ML cs.LG

REGAIN: REconciliation GAIN-driven Auxiliary Direction Learning

REGAIN：基于调和增益的辅助方向学习

Weijia Li, Shun Hu, Yanfei Kang

发表机构 * School of Mathematical Sciences, Beihang University, Beijing, China（北京航空航天大学数学科学学院）； School of Economics and Management, Beihang University, Beijing, China（北京航空航天大学经济管理学院）

AI总结提出REGAIN框架，通过学习归一化辅助方向并利用冻结预测预言机，基于目标加权损失减少选择方向，以改进预测调和。

详情

AI中文摘要

预测调和通常从固定测量系统开始，询问如何将预测投影到一致空间。我们提出不同问题：哪些额外的线性测量应被预测并纳入调和系统？我们提出REGAIN，一种调和增益框架，学习归一化辅助方向，用冻结预测预言机预测诱导序列，并通过增强广义最小二乘调和后的目标加权损失减少选择方向。与基于方差的分量或基于可预测性的辅助选择不同，REGAIN优化辅助测量对最终调和预测的下游影响。我们提供统计特征，表明有用的辅助方向必须提供关于未解决目标不确定性的互补信息，而不仅仅是易于预测。分析还阐明了协方差风险减少机制、偏差变化在实现二次风险中的作用以及估计增益信号的稳定性。开发了带有保留增益筛选的分阶段学习算法，以及可选的联合优化步骤。在北京PM2.5和澳大利亚旅游数据上的实验表明，增益选择的测量可以改进普通多变量和层次预测，特别是当它们揭示原始测量系统未捕捉的残差不确定性时。

英文摘要

Forecast reconciliation usually starts from a fixed measurement system and asks how forecasts should be projected onto a coherent space. We ask a different question: which additional linear measurements should be forecast and included in the reconciliation system? We propose REGAIN, a reconciliation-gain framework that learns normalized auxiliary directions, forecasts the induced series with a frozen forecasting oracle, and selects directions by their target-weighted loss reduction after augmented generalized least-squares reconciliation. Unlike variance-based components or predictability-based auxiliary selection, REGAIN optimizes the downstream effect of an auxiliary measurement on the final reconciled forecasts. We provide a statistical characterization showing that useful auxiliary directions must provide complementary information about unresolved target uncertainty, rather than merely being easy to forecast. The analysis also clarifies the covariance-risk reduction mechanism, the role of bias changes in realized quadratic risk, and the stability of estimated gain signals. A stagewise learning algorithm with held-out gain screening is developed, together with an optional joint refinement step. Experiments on Beijing PM2.5 and Australian Tourism data show that gain-selected measurements can improve both ordinary multivariate and hierarchical forecasts, especially when they reveal residual uncertainty not captured by the original measurement system.

URL PDF HTML ☆

赞 0 踩 0

2606.04374 2026-06-04 cs.IR cs.AI

DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling

DSIRM：学习查询桥接的离散语义标识符用于电商相关性建模

Bokang Wang, Xing Fang, Mingmin Jin, Jing Wang, Zhentao Song, Guangxin Song, Jianbo Zhu

发表机构 * Taobao & Tmall Group of Alibaba（淘宝与天猫集团）

AI总结针对电商搜索中连续嵌入难以捕捉细粒度属性区分的问题，提出查询桥接对比量化的离散语义标识符相关性模型（DSIRM），通过注入查询-物品交互监督学习语义感知分区，并利用生成式大语言模型预测物品标识符，显著提升相关性建模效果。

Comments Jing Wang (Corresponding Author)

详情

AI中文摘要

尽管连续嵌入在电商搜索相关性方面取得了快速进展，但一个长期存在的难题是难以捕捉细粒度的属性区分。虽然离散语义标识符（SIDs）已被广泛采用作为有前景的替代方案，但现有的SID生成方法严重依赖无监督量化。在现实场景中，缺乏显式监督通常使得更难决定哪些物品应共享一个SID，导致查询依赖排序的能力有限。为了解决无监督SID的问题，我们提出显式建模离散相关性特征，并开发了离散语义标识符相关性模型（DSIRM）。具体而言，我们在物品侧提出了一种查询桥接的对比量化方法，将查询-物品交互监督注入残差量化中，以主动学习相关性感知的语义分区。另一方面，我们在查询侧探索生成式大语言模型，从文本中显式预测物品SID，解决长尾查询和意图模糊问题。查询和物品SID之间的层次前缀匹配产生了具有判别力的特征，完美补充了密集信号。在天猫生产数据上的大量实验结果表明，我们提出的方法取得了更好的结果，离线AUC提升了+1.54%。通过高效的混合架构部署，它实现了显著的在线提升（UCTR +0.13%，UCTCVR +0.25%），证明了其巨大的工业价值。

英文摘要

Despite rapid progress of continuous embeddings for e-commerce search relevance, a long-standing open problem is the difficulty in capturing fine-grained attribute distinctions. While discrete Semantic Identifiers (SIDs) have been widely adopted as a promising alternative, existing SID generation methods rely heavily on unsupervised quantization. In realistic scenarios, the lack of explicit supervision often makes it more difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. To address the issue of unsupervised SIDs, we propose to explicitly model discrete relevance features and develop a Discrete Semantic Identifier Relevance Model (DSIRM). Specifically, we present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. On the other hand, we explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Extensive experimental results on Tmall's production data show that our proposed approach has achieved better results, improving offline AUC by +1.54\%. Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13\% UCTR, +0.25\% UCTCVR), proving its massive industrial value.

URL PDF HTML ☆

赞 0 踩 0

2606.04370 2026-06-04 eess.AS cs.SD eess.SP

Masked Wavelet Scattering Transform Neural Field for Sound Field Reconstruction

掩蔽小波散射变换神经场用于声场重建

Xinmeng Luan, Samuel A. Verburg, Efren Fernandez-Grande, Gary Scavone

发表机构 * Fonds de recherche du Québec – Nature et technologies（魁北克自然与技术研究基金）

AI总结提出一种利用小波散射变换作为多尺度特征提取器，结合神经场优化和掩蔽策略，实现稀疏观测下声场重建的方法，并在HRTF上采样中验证有效性。

Comments 5 pages, 2 figures, conference

2606.04362 2026-06-04 cs.IR cs.CL

Disentangling Answer Engine Optimization from Platform Growth: A Log-Based Natural Experiment on ChatGPT Referral Traffic

解耦答案引擎优化与平台增长：基于日志的ChatGPT推荐流量自然实验

Keisuke Watanabe, Kazuki Nakayashiki

发表机构 * Glasp Inc.（Glasp公司）

AI总结本研究通过自然实验方法，利用同一域内未处理页面作为对照，分离了答案引擎优化（AEO）对推荐流量的因果效应与平台自身增长带来的混淆效应。

Comments 9 pages, 4 figures, 1 table

详情

AI中文摘要

大型语言模型（LLM）“答案引擎”（如ChatGPT）现在向开放网络发送可测量的推荐流量，一种类似于搜索引擎优化的实践——此处称为答案引擎优化（AEO）——已经出现。公开的AEO成功案例通常引用巨大的原始增长倍数，但原始推荐增长被答案引擎本身的快速平台级增长所混淆。我们报告了一项针对单个高流量域名（glasp.co）的纵向现场研究，该域名拥有数十万个YouTube问答页面，在2026年1月接受了一组明确的AEO干预（详见第4节）。由于干预集中在网站的一个子集上，同一域内未处理的剩余部分作为同期对照，吸收了平台尾风。使用第一方分析和服务器日志而非概率性第三方估计，我们发现：（1）原始增长由平台尾风主导：在月度汇总中，ChatGPT总推荐量增长了5.7倍，而同一域内未处理页面在同一时间段内增长了3.5倍；（2）对每周处理/对照比率的中断时间序列模型估计出一个离散的、与干预对齐的水平增长1.82倍（95% CI 1.31-2.54，HAC p=0.001），该结果在参与度过滤流量（2.27倍）和替代规格下稳健；（3）然而，保守的安慰剂时间置换检验得出p=0.16，因此该效应是提示性的而非结论性的，鉴于前期短且噪声大；（4）Google对处理页面的自然点击并未超出整体网站趋势下降，且索引得以保留，这与SEO保护规则一致。方法论上的信息——通过域内对照分离处理与平台尾风——比任何单一倍数更重要，并意味着标题中的AEO倍数大大高估了因果效应。

英文摘要

Large language model (LLM) "answer engines" such as ChatGPT now send measurable referral traffic to the open web, and a practice analogous to search engine optimization, here called Answer Engine Optimization (AEO), has emerged. Public AEO success stories typically quote large raw growth multiples, but raw referral growth is confounded by the rapid platform-level growth of the answer engines themselves. We report a longitudinal field study on a single high-traffic domain (glasp.co) whose corpus of hundreds of thousands of YouTube question-and-answer pages received a defined bundle of AEO interventions in January 2026 (detailed in Section 4). Because the interventions were concentrated on one subset of the site, the untreated remainder of the same domain acts as a contemporaneous control that absorbs the platform tailwind. Using first-party analytics and server logs rather than probabilistic third-party estimators, we find: (1) raw growth is dominated by the platform tailwind: on monthly aggregates total ChatGPT referrals grew 5.7x while untreated pages on the same domain grew 3.5x over the same window; (2) an interrupted time-series model on the weekly treated/control ratio estimates a discrete, intervention-aligned level increase of 1.82x (95% CI 1.31-2.54, HAC p=0.001), robust across engagement-filtered traffic (2.27x) and alternative specifications; (3) however, a conservative placebo-in-time permutation test yields p=0.16, so the effect is suggestive, not conclusive, given a short and noisy pre-period; and (4) Google organic clicks to treated pages did not fall beyond the ambient site-wide trend and indexation was preserved, consistent with the SEO-protection rule. The methodological message, separating treatment from platform tailwind with an on-domain control, matters more than any single multiple, and implies that headline AEO multiples substantially overstate causal effect.

URL PDF HTML ☆

赞 0 踩 0

2606.04361 2026-06-04 eess.SY cs.MA cs.RO cs.SY math.DS math.OC

When Freshness Is Not Enough: Distribution-Aware Age of Information for Networked LQR Control

当新鲜度不足时：面向网络化LQR控制的分布感知信息年龄

Abdullah Y. Etcibasi, C. Emre Koksal, Eylem Ekici

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University（电气与计算机工程系，俄亥俄州立大学）

AI总结本文研究网络化控制系统中，仅最小化平均信息年龄（AoI）不足以优化LQR跟踪性能，需考虑调度间隔的完整分布（包括高阶矩和指数矩）。

详情

AI中文摘要

信息年龄（AoI）已成为无线更新系统设计的核心指标，尤其是在新鲜测量支持跟踪、估计和控制的场景中。尽管其广泛应用，但将平均AoI或峰值AoI作为闭环性能的替代指标通常基于直觉而非控制理论推导。本文探讨了最小化平均AoI是否对网络化控制系统最优。对于具有延迟间歇更新的标量线性时不变系统，我们证明，在状态无关调度策略下，无限时域LQR跟踪问题可简化为对调度间隔分布的优化。所得目标函数依赖于调度过程的高阶统计矩，在不稳定或相关情况下还依赖于指数矩，而非仅依赖于其均值。因此，具有相同平均AoI的策略可能产生显著不同的跟踪成本。我们进一步将分析扩展到具有指数衰减自相关的扰动，并推导出揭示完整间隔分布作用的等效成本公式。最后，使用NGSIM US-101数据集中的真实车辆轨迹验证理论。实证结果与预测的性能趋势一致，表明仅凭平均AoI不足以进行面向控制的网络设计。

英文摘要

Age of Information (AoI) has become a central metric for the design of wireless update systems, especially in applications where fresh measurements support tracking, estimation, and control. Despite its popularity, the use of mean AoI or peak AoI as a surrogate for closed-loop performance is often motivated by intuition rather than by a control-theoretic derivation. This paper examines whether minimizing the mean AoI is in fact optimal for networked control systems. For scalar linear time-invariant systems with delayed intermittent updates, we show that, under state-independent scheduling policies, the infinite-horizon LQR tracking problem reduces to an optimization over the distribution of inter-scheduling intervals. The resulting objective depends on higher-order statistical moments, and in unstable or correlated regimes on exponential moments, of the inter-scheduling process rather than only on its mean. Consequently, policies with identical mean AoI can induce substantially different tracking costs. We further extend the analysis to disturbances with exponentially decaying autocorrelation and derive equivalent cost formulations that expose the role of the full interval distribution. Finally, we validate the theory using real vehicle trajectories from the NGSIM US-101 dataset. The empirical results match the predicted performance trends, demonstrating that mean AoI alone is insufficient for control-oriented network design.

URL PDF HTML ☆

赞 0 踩 0