2605.24326 2026-05-26 cs.DC cs.AI cs.NI

ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training

ScaleAcross Explorer：探索跨规模AI模型训练的通信优化

Minghao Li, Alicia Golden, Samuel Hsia, Michael Kuchnik, Adi Gangidi, Xu Zhang, Ashmitha Jeevaraj Shetty, Zachary DeVito, Weiwei Chu, Dong He, Haoci Zhang, Yuchen Hao, Ruoming Pang, James Hongyi Zeng, Ying Zhang, Minlan Yu, Carole-Jean Wu

发表机构 * Harvard University（哈佛大学）； Meta Platforms, Inc.（Meta平台公司）

AI总结针对跨数据中心大规模AI模型训练（scale-across）的复杂设计空间，提出ScaleAcross Explorer优化器，通过联合优化并行放置、并行调度和网络层技术，实现高达64.62%的训练加速。

Comments 28 pages, 27 figures

详情

AI中文摘要

大型语言模型训练的快速扩展需要将GPU资源分布在多个数据中心建筑和区域之间。我们将这种范式称为“scale-across”训练。随着基础设施的扩展，系统设计空间变得越来越复杂，涵盖了新的模型架构、硬件异构性和不断演变的通信模式。借鉴Meta的生产经验，我们强调了在跨多个拥有数十万GPU的数据中心部署训练作业的复杂性。为了加速对庞大设计空间的探索并实现前沿模型开发的高效训练，我们对三个关键设计维度进行了深入表征：并行放置、并行调度和网络层技术。然后，我们提出了ScaleAcross Explorer，这是一个考虑设计维度相互作用并整体优化跨规模训练的优化器。测试床实验和模拟表明，在广泛的设计点上，与生产配置相比，训练速度提升高达64.62%，与最先进的基线相比，训练速度提升高达37.59%。

英文摘要

The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we highlight the complexities of deploying training jobs across a few data centers housing hundreds of thousands of GPUs. To accelerate exploration of the large design space and to enable efficient training for frontier model development, we conduct in-depth characterization of three key design dimensions: parallelism placement, parallelism scheduling, and network layer technologies. We then propose ScaleAcross Explorer, an optimizer that considers the interplay of design dimensions and holistically optimizes scale-across training. Testbed experiments and simulations demonstrate up to 64.62% training speedups over production configuration and up to 37.59% training speedups over the state-of-the-art baseline across a wide range of design points.

URL PDF HTML ☆

赞 0 踩 0

2605.24324 2026-05-26 quant-ph cs.LG

解锁苹果的私有云计算：隐私保护人工智能分析

Yannik Dittmar, Marvin Jerome Stephan, Thomas Völkl, Matthias Hollick, Jiska Classen

发表机构 * Hasso Plattner Institute, University of Potsdam（哈索普兰特纳研究所，波茨坦大学）； TU Darmstadt, Secure Mobile Networking Lab（德累斯顿技术大学，安全移动网络实验室）； IMDEA Networks Institute, Madrid, Spain（IMDEA网络研究所，马德里，西班牙）

AI总结通过逆向工程苹果私有云计算（PCC）在移动设备上的实现，评估其隐私保护特性，并开放非公开接口以支持自定义查询和独立基准测试。

详情

DOI: 10.1145/3765613.3811691
Journal ref: Proceedings of the 19th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec 2026)

AI中文摘要

许多现有的移动设备人工智能解决方案依赖于大量敏感数据的收集，引发隐私担忧，并且通常需要存储上下文和模型改进数据。苹果的私有云计算（PCC）旨在通过强调移动设备集成和隐私优先设计来解决这一问题。PCC的核心主张是它不存储任何用户数据，并且用户输入和用户账户是不可关联的。尽管大多数PCC系统规范是公开的，但编译后的二进制文件增加了一层不透明性。没有可重现的构建，这些二进制文件中也没有符号，导致规范与实际交付给用户的产品之间可能存在差异。此外，查询PCC的底层模型和接口并不公开可访问，限制了学术上对模型属性（如准确性）的评估。这给评估像PCC这样的隐私保护方法是否既值得信赖又能提供高质量答案带来了挑战。我们是第一个逆向工程移动设备上PCC实现以评估隐私方面，并在本地设备上开放其非公开接口以支持自定义PCC查询的研究团队。我们通过独立基准测试PCC模型，展示了超出苹果预期用例的访问级别。通过公开我们的PCC基准测试框架，我们为未来的研究提供了支持。

英文摘要

Many existing Artificial Intelligence (AI) solutions on mobile devices rely on an extensive collection of sensitive data, raising privacy concerns and often requiring storage for both context and model improvement. Apple's Private Cloud Compute (PCC) aims to address this by emphasizing mobile device integration and a privacy-first design. The central claim of PCC is that it does not store any user data and that user input and user accounts are unlinkable. While most of the PCC system specifications are public, compiled binaries add a layer of opaqueness. There are no reproducible builds, and there are no symbols within those binaries, creating potential discrepancies between the specification and what is shipped to the user. Additionally, the underlying models and interfaces for querying PCC are not openly accessible, limiting academic evaluation of model properties, such as accuracy. This poses a challenge in assessing whether a privacy-preserving approach like PCC is actually trustworthy while also providing high-quality answers. We are the first to reverse-engineer the PCC implementation on mobile devices to evaluate privacy aspects and to open its non-public interfaces on local devices to support custom PCC queries. We demonstrate this level of access beyond Apple's intended use cases by independently benchmarking the PCC model. We enable future research by making our PCC benchmarking framework publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.24213 2026-05-26 cs.SE cs.AI cs.LG

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

迈向评估工程：机器学习评估工具在野外的实证研究

Zhimin Zhao, Zehao Wang, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan

发表机构 * Software Analysis and Intelligence Lab (SAIL), School of Computing, Queen's University（软件分析与智能实验室（SAIL），计算学院，女王大学）； Concordia University（Concordia大学）； Lahore University of Management Sciences (LUMS)（拉合尔管理科学大学（LUMS））

AI总结通过对57个评估工具的实证研究，提出五阶段工具模型，并分类16560个问题，发现规范阶段问题最多（41.4%），主要根因是未实现功能（24.3%）、文档缺失（20.3%）和输入验证缺失（17.2%），为将评估工程作为独立软件工程关注点奠定实证基础。

详情

AI中文摘要

评估工具是编排模型评估的软件系统，管理模型调用、数据加载、指标计算和结果报告。尽管它们在机器学习基础设施中扮演关键角色，但其操作挑战和工程问题迄今受到的关注有限。我们对57个评估工具进行了实证研究，推导出一个五阶段工具模型，并根据工作流阶段和根本原因对16,560个问题进行了分类。大多数工具操作挑战集中在规范阶段（占问题的41.4%），在此阶段工具集成外部模型、数据集和评分评判者。操作挑战的三个最常见根本原因是未实现功能（24.3%）、文档缺失（20.3%）和输入验证缺失（17.2%），这些合计占分类问题的61.7%，涵盖现有功能的缺陷和阻碍预期工作流的能力缺口。根本原因也因工作流阶段而异：环境不兼容和外部依赖破坏占配置问题的36.2%，而算法错误（25.9%）和验证缺失（22.5%）主导评估问题。这些贡献共同为将评估工程视为一个独立的软件工程关注点建立了实证基础。

英文摘要

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.

URL PDF HTML ☆

赞 0 踩 0

2605.24212 2026-05-26 stat.AP cs.AI cs.LG stat.ML

Distributionally Robust Transfer Learning with Structurally Missing Covariates, with Application to Cross-National Cardiac Arrest Prediction

分布鲁棒迁移学习在结构缺失协变量中的应用：以跨国心脏骤停预测为例

Siqi Li, Chuan Hong, Ziye Tian, Benjamin Sieu-Hon Leong, Koshi Nakagawa, Hideharu Tanaka, Sang Do Shin, Khuong Quoc Dai, Do Ngoc Son, Marcus Eng Hock Ong, Nan Liu, Molei Liu

发表机构 * Centre for Biomedical Data Science, Duke-NUS Medical School, Singapore（生物医学数据科学中心，杜克-国家大学医学院，新加坡）； Duke-NUS AI + Medical Sciences Initiative, Duke-NUS Medical School, Singapore（杜克-国家大学医学院AI+医学科学倡议，新加坡）； Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA（生物统计学与生物信息学系，杜克大学，北卡罗来纳州达勒姆，美国）； Duke Clinical Research Institute, Durham, NC, USA（杜克临床研究学院，北卡罗来纳州达勒姆，美国）； Emergency Medicine Department, National University Hospital, Singapore（急诊医学部，国立大学医院，新加坡）； Department of Sport and Medical Science, Faculty of Physical Education, Kokushikan University, Tokyo, Japan（体育与医学科学系，体育学院，立命馆大学，东京，日本）； Graduate School of Emergency Medical System, Kokushikan University, Tokyo, Japan（急救医疗系统研究生院，立命馆大学，东京，日本）； Department of Emergency Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea（急诊医学系，首尔国立大学医学院，首尔，韩国）； Center for Emergency Medicine, Bach Mai Hospital, Hanoi, Vietnam（急救医学中心，巴赫梅医院，河内，越南）； Center for Critical Care Medicine, Bach Mai Hospital, Hanoi, Vietnam（重症医学中心，巴赫梅医院，河内，越南）； Health Services Research Centre, Singapore Health Services, Singapore（卫生服务研究中心，新加坡卫生服务，新加坡）； Department of Emergency Medicine, Singapore General Hospital, Singapore（急诊医学部，新加坡中央医院，新加坡）； Pre-hospital & Emergency Research Centre, Health Services Research and Population Health, Duke-NUS Medical School, Singapore（院前与急诊研究中心，卫生服务研究与人口健康，杜克-国家大学医学院，新加坡）

AI总结提出DRUM框架，通过分布鲁棒优化和神经网络生成器处理目标域中结构缺失的协变量，实现无标签目标域的预测模型迁移，并在跨国心脏骤停预测中验证有效性。

详情

AI中文摘要

当关键训练协变量在部署时不可用且目标域中标记结果有限时，跨医疗系统部署临床预测模型常常失败。例如，院外心脏骤停（OHCA）的高性能模型依赖于高资源环境中常规收集的详细院前测量数据，但在许多国际登记处中不可用。现有方法要么丢弃缺失协变量，牺牲预测信息，要么依赖于关于其目标分布的可检验假设。我们提出了DRUM（具有结构缺失协变量的分布鲁棒无监督迁移学习），这是一个将预测模型迁移到某些协变量结构缺失且结果标签不可用的目标群体的框架。DRUM将协变量划分为共享组件（$X$，在所有环境中观察到）和缺失组件（$A$，仅在源域中观察到）。DRUM不进行缺失协变量插补，而是使用神经网络生成器优化未知目标分布$A \mid X$上的最坏情况预测性能，并通过鲁棒性参数控制与源条件允许的偏差。我们进一步开发了一种偏差校正程序，以减少对干扰估计误差的敏感性。模拟显示，在分布偏移下，平均和最坏情况预测误差均有显著改善。应用于跨国OHCA预测，将模型从美国登记处迁移到多个未记录院前变量的亚洲登记处，DRUM在各个站点产生了更校准的预测和改进的临床分类性能。

英文摘要

Deploying clinical prediction models across healthcare systems often fails when key training covariates are unavailable at deployment and labeled outcomes are limited in the target domain. For example, high-performing models for out-of-hospital cardiac arrest (OHCA) rely on detailed prehospital measurements routinely collected in high-resource settings but unavailable in many international registries. Existing methods either discard missing covariates, sacrificing predictive information, or rely on untestable assumptions about their target distribution. We propose DRUM (\underline{D}istributionally \underline{R}obust \underline{U}nsupervised transfer learning with structurally \underline{M}issing covariates), a framework that transfers prediction models to target populations where certain covariates are structurally absent and outcome labels are unavailable. DRUM partitions covariates into shared components ($X$), observed across all settings, and missing components ($A$), observed only in the source. Rather than imputing missing covariates, DRUM optimizes worst-case predictive performance over the unknown target distribution of $A \mid X$ using a neural network generator, with a robustness parameter controlling allowable deviation from the source conditional. We further develop a bias correction procedure that reduces sensitivity to nuisance estimation error. Simulations show substantial improvements in both mean and worst-case prediction error under distribution shift. Applied to cross-national OHCA prediction, transferring models from a US registry to multiple Asian registries where prehospital variables are unrecorded, DRUM yields better-calibrated predictions and improved clinical classification performance across sites.

URL PDF HTML ☆

赞 0 踩 0

2605.24207 2026-05-26 cs.DB cs.LG

Incorporating Deep Learning Design in Database Queries

将深度学习设计融入数据库查询

Yuval Lev Lubarsky, Dean Light, Boaz Berger, Shunit Agmon, Benny Kimelfeld

发表机构 * University of Washington（华盛顿大学）

AI总结提出一种将深度学习自然集成到数据库查询中的方法，通过为元组关联可学习的向量嵌入，使查询同时操作数据和嵌入，实现关系深度学习。

详情

AI中文摘要

关系数据库上的深度学习通常通过将数据转换为图表示并在外部框架中应用基于图的神经网络来实现。这种数据库与外部机器学习系统之间的往返引入了非平凡的工程开销。实际上，这些图神经网络对元组嵌入进行操作，并以捕获关系连接引起的交互的方式操纵它们。鉴于这种自然的对应关系，没有根本原因说明为什么在关系数据上指定神经网络应该比查询它困难得多。我们提出了一种将深度学习与数据库查询自然集成的方法。关键思想是为每个元组关联一个来源，表示为具有可学习参数的向量嵌入。查询被提升为联合操作数据和嵌入，将带有嵌入元组的输入关系映射到带有嵌入元组的输出关系。这种方法为关系深度学习提供了声明性基础，促进了与数据库系统的集成、优化和广泛采用。我们描述了RelaNN，这是一个基于PyTorch和cuDF构建的概念验证实现。通过实现各种图学习模型，包括图卷积网络、异构图变换器、超图神经网络和深度同态网络，我们展示了RelaNN的实用性。程序的简单性及其有竞争力的运行时性能展示了一条具体路径，使得在数据库上实现最先进的神经网络变得像编写查询一样简单。

英文摘要

Deep learning over relational databases is conventionally realized by translating data into graph representations and applying graph-based neural networks within external frameworks. This round-trip between the database and external machine learning (ML) systems introduces non-trivial engineering overhead. In effect, these graph neural networks operate on tuple embeddings and manipulate them in ways that capture the interactions induced by relational joins. Given this natural correspondence, there is no fundamental reason why specifying a neural network over relational data should be substantially harder than querying it. We propose an approach that naturally integrates deep learning with database queries. The key idea is to associate each tuple with provenance, represented as a vector embedding with learnable parameters. Queries are lifted to operate jointly on data and embeddings, mapping input relations with embedded tuples to output relations with embedded tuples. This approach provides a declarative foundation for relational deep learning, facilitating integration with database systems, optimization, and wide adoption. We describe RelaNN, a proof-of-concept implementation of this approach built on top of PyTorch and cuDF. We illustrate the utility of RelaNN by implementing various graph-learning models, including graph convolutional networks, heterogeneous graph transformers, hypergraph neural networks and deep homomorphism networks. The simplicity of the programs and their competitive runtime performance demonstrate a concrete path toward making the implementation of state-of-the-art neural networks over databases as simple as writing a query.

URL PDF HTML ☆

赞 0 踩 0

2605.24183 2026-05-26 cs.DB cs.AI cs.LG

AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

AvalancheBench: 通过潜在世界恢复评估企业数据智能体

Darek Kleczek, Fuheng Zhao, Alexander W. Lee, Julien Tissier, Pawel Liskowski, Ugur Cetintemel, Anupam Datta

发表机构 * Brown University and Snowflake（布朗大学和Snowflake）

AI总结提出AvalancheBench基准，通过潜在世界恢复评估企业数据智能体的分析理解能力，揭示早期错误如何传播并导致系统性错误推荐。

详情

AI中文摘要

我们介绍了AvalancheBench，一个通过潜在世界恢复评估企业数据智能体的基准。AvalancheBench在三个方面改进了现有基准。首先，它评估分析理解而非流程完成：系统根据是否恢复了解释数据的片段、驱动因素、时间事件和关系来评分，而不仅仅是执行工作流或生成看似合理的报告。其次，它通过从已知潜在世界生成观测数据，为目标驱动分析提供真实基准，从而允许对不完整但有效的恢复给予部分分数。第三，它揭示了早期分析错误如何传播到后续结论：遗漏的片段、合并的事件或错误的归因可能导致系统性错误推荐。在这个意义上，AvalancheBench通过提供一个受控环境来诊断智能体是否恢复了企业数据背后的分析结构，从而补充了真实数据基准。在第一个电子商务用例中，领先编码智能体的最强配置仅恢复了26%的评分标准，失败集中在通用客户细分和合并的时间事件上。

英文摘要

We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than pipeline completion: systems are scored on whether they recover the segments, drivers, temporal events, and relationships that explain the data, not merely on whether they execute a workflow or produce a plausible report. Second, it provides ground truth for goal-driven analytics by generating observations from a known latent world, enabling partial credit for incomplete but valid recoveries. Third, it exposes how early analytical mistakes propagate into later conclusions: missed segments, merged events, or wrong attributions can lead to systematically wrong recommendations. In this sense, AvalancheBench complements real-data benchmarks by providing a controlled setting for diagnosing whether agents recover the analytical structure behind enterprise data. On a first e-commerce use case, the strongest configuration of a leading coding agent recovers only 26\% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.

URL PDF HTML ☆

赞 0 踩 0

2605.24180 2026-05-26 physics.soc-ph cs.AI cs.DL cs.HC

多智能体编程中的对话模式理解：以斐波那契游戏开发为例

Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, Miroslaw Staron

发表机构 * Chalmers University of Technology ； University of Gothenburg Gothenburg Sweden ； Research \& Development, Volvo Car Corporation Gothenburg Sweden ； University of Gothenburg ； Research \& Development, Volvo Car Corporation

AI总结本文通过分析12种开源LLM组合中设计者与程序员智能体的对话，揭示了多智能体交互的效率、一致性和有效性三个关键维度，发现DeepSeek-R1:DeepSeek-R1对能从首次迭代起稳定收敛到正确解，而其他组合则存在发散或错误共识问题。

Comments 10 pages, 7 figures, AIware, FSE 2026

详情

DOI: 10.1145/3805760.3814914

AI中文摘要

大型语言模型（LLM）越来越多地应用于软件工程（SE），但它们在自主、面向角色的协作方面的潜力仍远未得到充分探索。理解多个基于LLM的智能体如何协调、保持角色对齐并收敛到解决方案对SE至关重要，因为简单地让智能体交互并不能可靠地产生正确或稳定的结果。最近的实证研究表明，非结构化或理解不足的交互动态可能导致错误传播、对错误解决方案的过早共识，或阻止收敛的长期分歧，即使在交互早期存在正确的部分解决方案。作为解决这一未被充分探索领域的初步步骤，我们对两个智能体（设计者和程序员）之间的对话进行了系统分析，涉及来自7个开源LLM（Gemma 2、Gemma 3、LLaMA 3.2、LLaMA 3.3、DeepSeek-R1、MiniCPM和Qwen3）的12种模型组合。我们的系统方法揭示了多智能体交互的三个关键维度：效率（收敛的速度和稳定性）、一致性（通过BLEU和ROUGE可视化的角色对齐程度）和有效性（编译成功和错误解决的程度）。结果表明，DeepSeek-R1:DeepSeek-R1对从第一次迭代起就独特地收敛到正确解，并一致地保持到最终迭代，而LLaMA 3.2:LLaMA 3.2和Qwen3:Qwen3对尽管偏离了正确解，但表现出强烈的设计者:程序员角色对齐。其他对偏离了任务，从未收敛到结果。这些发现推进了对智能体编程的理解，并强调了进一步研究理解和校准收敛及停止条件的必要性，这对于未来的自主SE至关重要。

英文摘要

Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE.

URL PDF HTML ☆

赞 0 踩 0

2605.24137 2026-05-26 cs.SE cs.AI

多元神经网络输出的最优非渐近 Edgeworth 展开

Lucia Celli

发表机构 * Department of Mathematics, University of Luxembourg（卢森堡大学数学系）

AI总结针对有限宽度全连接神经网络输出，利用任意阶 Edgeworth 展开逼近其与高斯极限的偏差，并给出总变差距离的上下界。

Comments 34 pages, 2 figures

2605.24069 2026-05-26 cs.CR cs.AI

When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

当手册撒谎：评估LLM智能体MCP投毒攻击的现实基准

Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin, Biyu Zhou, Wenjie Xiao, Wantao Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学网络安全学院）

AI总结针对LLM智能体通过模型上下文协议（MCP）集成外部工具时面临的工具描述投毒（TDP）攻击，提出MCP-TDP安全基准，包含32个真实测试用例，评估8种主流LLM发现严重漏洞，并提出反应性自我纠正防御机制。

详情

AI中文摘要

使用工具的大型语言模型（LLM）智能体的兴起，通过模型上下文协议（MCP）等协议标准化，通过集成外部开放领域知识和工具，为LLM智能体解锁了前所未有的自主执行能力。然而，这种互操作性引入了一个针对智能体认知规划层的隐蔽攻击面。本文系统性地研究了工具描述投毒（TDP），一种新颖的语义攻击。在TDP中，恶意指令并非嵌入工具的可执行代码，而是隐蔽地注入其描述性元数据——即智能体依赖进行安全规划和决策的“手册”。为了严格系统地评估这一新兴威胁，我们引入了MCP-TDP安全基准。这个高保真沙箱环境包含32个跨越6个不同风险类别的真实测试用例。我们对8种主流LLM的评估揭示了严重漏洞，领先模型如GPT-4o在六个高风险场景中表现出近100%的攻击成功率（ASR）。此外，我们的发现表明，常见的提示护栏防御基本无效，并且可能适得其反（我们称之为“防火墙谬误”）。关键的是，我们还提出了一种防御机制：“反应性自我纠正”，即智能体在执行后自主检测并撤销其恶意行为。这项工作为TDP提供了第一个专门的安全基准，为保护高级智能体系统的认知和规划层提供了重要见解。

英文摘要

The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self-Correction," where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.

URL PDF HTML ☆

赞 0 踩 0

2605.24067 2026-05-26 physics.ao-ph cs.LG

Seeing Inside the Storm: Improving Nowcasting by Integrating Meteorological Drivers

洞察风暴内部：通过整合气象驱动因子改进临近预报

Minghui Qiu, Jun Chen, Lin Chen, Weifeng Chen, Shuxin Zhong, Zhidan Liu, Yu Zhang, Kaishun Wu

发表机构 * Guangzhou Meteorological Observatory（广州气象局）

AI总结提出MetroLogist框架，通过物理定制的编码器、时间相位对齐器和跨场空间聚合器，整合热力学、动力学和微物理驱动因子，实现风暴生命周期的完整建模，显著提升临近预报性能。

详情

AI中文摘要

大多数基于雷达反射率的临近预报系统关注当前降水，忽略了大气前兆——如低层辐合、湍流涡旋和潜热加热——这些为预见风暴诞生提供了短暂窗口。我们提出了MetroLogist，一个受物理启发的雷达智能框架，模拟从风暴前兆到组织化演变的完整对流生命周期。然而，利用这些前兆并非易事：它们源自多个气象驱动因子——热力学、动力学和微物理——这些因子异步演化（C1）且在空间上分散（C2）。为此，MetroLogist设计了三个紧密集成的组件。物理定制编码器根据雷达回波的内在物理尺度和语义进行处理，形成热力学、动力学和微物理流，捕捉不同的动力机制。时间相位对齐器通过利用因果时间注意力来捕捉不同驱动因子何时以及如何相互作用和激活，从而解决C1。跨场空间聚合器通过跨区域融合，对齐相邻单元中微弱且分散的前兆，以暴露上游触发因素并强制空间一致性，从而解决C2。在3D-NEXRAD（2020-2022，全美范围）上的评估显示，MetroLogist在高影响检测（CSI40）上比强基线提升了+9.7%，并在风暴发展阶段实现了37.67%的显著增益——展示了在风暴出现之前感知它们的真正预见能力。代码可在补充材料中找到。

英文摘要

Most nowcasting systems, built on radar reflectivity, focus on current precipitation, ignoring the atmospheric precursors -- such as low-level convergence, turbulent eddies, and latent heating -- that offer a fleeting window to foresee storm birth. We introduce MeteoLogist, a physics-inspired radar intelligence framework that models the full life cycle of convection -- from its precursors to organized storm evolution. However, exploiting these precursors is non-trivial: they originate from multiple meteorological drivers -- thermodynamic, kinematic, and microphysical -- that evolve asynchronously (C1) and remain spatially fragmented (C2). To this end, MeteoLogist designs three tightly integrated components. The Physics-Tailored Encoders process radar echoes according to their intrinsic physical scales and semantics, forming thermodynamic, kinematic, and microphysical streams that capture distinct dynamical regimes. The Temporal-Phase Aligner addresses C1 by leveraging causal temporal attention to capture when and how different drivers interact and activate. The Cross-Field Spatial Aggregator addresses C2 through cross-regional fusion, aligning weak and scattered precursors across neighboring cells to expose upstream triggers and enforce spatial coherence. Evaluated on 3D-NEXRAD (2020--2022, US-wide), MeteoLogist boosts high-impact detection (CSI40) by +9.7% over strong baselines, and achieves a remarkable 37.67% gain during the storm-developing stage -- demonstrating true foresight in sensing storms before they appear. The code can be found in the supplementary material.

URL PDF HTML ☆

赞 0 踩 0

2605.24050 2026-05-26 cs.SE cs.AI stat.AP

More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

更多技能，更差智能体？扩展技能库时技能遮蔽降低性能

Hongwen Song, Song, Wei

发表机构 * Databricks Inc.（Databricks公司）

AI总结本文研究LLM智能体技能库扩展导致性能下降的现象，提出将性能下降分解为技能遮蔽和上下文开销两种效应，并通过实验证明技能遮蔽是主要瓶颈。

详情

AI中文摘要

技能库允许LLM智能体按需加载任务特定指令，使非专家用户能够通过自然语言解决领域特定任务，而无需知道存在哪些技能或它们如何工作。然而，随着技能库的增长，性能会下降——当从一组已知有用的小技能扩展到包含202个技能的库时，性能下降高达21%。在这项工作中，我们将这种性能下降定义为从加载已知有用技能库到加载完整技能库之间的通过率下降。此外，我们提出通过条件化技能调用——即智能体在轨迹中选择哪些技能——将通过率下降分解为两种效应：\emph{技能遮蔽}，即随着技能库扩展，智能体更频繁地选择错误技能；以及\emph{上下文开销}，即即使选择正确，扩大的上下文也会降低执行性能。我们推导了这两种效应的上界，以表征它们对通过率下降的影响程度。我们对效应及其上界的经验估计均表明，\emph{技能遮蔽}效应随技能库大小增长，并对性能下降有显著贡献，而\emph{上下文开销}效应仍然很小且与零无显著差异。这种观察到的非对称性表明，技能选择失败（而非上下文扩大）是扩展技能库时的主要瓶颈。

英文摘要

Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow -- by up to 21\% when scaling from a small set of helpful skills to a 202-skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known-helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation -- which skills the agent selects during a trajectory -- into two effects: \emph{skill shadowing}, where the agent selects wrong skills more often as the library expands, and \emph{context overhead}, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emph{skill shadowing} effect grows with library size and significantly contributes to the performance degradation, whereas the \emph{context overhead} effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.

URL PDF HTML ☆

赞 0 踩 0

2605.24031 2026-05-26 q-fin.CP cs.LG

Volatility Surface Reconstruction using Deep Learning under No-Arbitrage Constraints

无套利约束下使用深度学习进行波动率曲面重建

Pablo Rodriguez Manzi

发表机构 * Universidad de Buenos Aires（布宜诺斯艾利斯大学）

AI总结研究使用深度学习模型在无套利约束下从稀疏噪声期权报价重建隐含波动率曲面，比较多种神经网络架构与经典SVI参数化方法，发现Transformer和U-Net在稀疏观测下重建精度高，软套利惩罚有效减少套利违规。

Comments MSc thesis, Universidad de Buenos Aires, 2026. 94 pages, 27 figures

2605.24016 2026-05-26 cs.AR cs.AI

SA-Kura: An Energy-Efficient Systolic Array Accelerator for Locally-Coupled Kuramoto Drift in Diffusion Sampling

SA-Kura: 用于扩散采样中局部耦合Kuramoto漂移的节能脉动阵列加速器

Jeongmin Jin, Kyeongwon Lee, Mundo Jeong, Jongin Choi, Woojoo Lee

发表机构 * National Research Foundation of Korea（韩国国家研究基金会）； Institute of Information & communications Technology Planning & Evaluation（信息通信技术规划与评估院）

AI总结针对扩散采样中局部耦合Kuramoto漂移的计算瓶颈，提出首个专用数字脉动阵列加速器SA-Kura，通过重新公式化耦合计算实现高效脉动执行，相比软件和GPU分别实现193倍和6.57倍加速。

Comments 8 pages, 6 figures, 1 table; ACM/IEEE ISLPED 2026 accepted paper

详情

AI中文摘要

扩散推理在边缘部署中仍然成本高昂，但现有加速器几乎完全专注于分数网络，因为标准漂移仅仅是微不足道的线性缩放。Kuramoto定向扩散用局部耦合的相位相互作用取代了这种微不足道的漂移，提高了采样效率，但引入了新的硬件瓶颈：在每个反向步骤中评估的中心依赖非线性5x5模板。该内核难以映射到传统的CNN加速器和面向矩阵的引擎。我们提出了SA-Kura，据我们所知，这是第一个专用于局部耦合Kuramoto漂移的数字脉动阵列加速器。通过将成对正弦耦合重新表述为独立于中心相位的邻居累加，然后进行单个中心依赖的乘减组合，SA-Kura消除了PE内的超越函数单元，并实现了具有寄存器级复用的规则脉动执行。SA-Kura以可综合RTL实现，集成到基于RISC-V的轻量级SoC中，在FPGA上原型验证，并通过45nm CMOS综合和功耗分析进行评估。仅对于漂移内核，与同一SoC平台上处理器内核上相同内核的软件执行相比，SA-Kura分别将延迟和能耗降低了193倍和69.4倍。与独立的Jetson Orin Nano CUDA实现相同内核相比，它快6.57倍，并且每像素能耗降低约46.0倍。

英文摘要

Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial drift with locally coupled phase interactions, improving sampling efficiency but introducing a new hardware bottleneck: a center-dependent nonlinear 5 x 5 stencil evaluated at every reverse step. This kernel maps poorly to conventional CNN accelerators and matrix-oriented engines. We present SA-Kura, to our knowledge the first digital systolic-array accelerator dedicated to locally coupled Kuramoto drift. By reformulating pair-wise sinusoidal coupling into neighbor accumulation independent of the center phase followed by a single center-dependent multiply-subtract combination, SA-Kura eliminates in-PE transcendental units and enables regular systolic execution with register-level reuse. SA-Kura was implemented in synthesizable RTL, integrated into a lightweight RISC-V-based SoC, prototyped on FPGA, and evaluated through 45 nm CMOS synthesis and power analysis. For the drift kernel only, compared with software execution of the same kernel on the processor core in the same SoC platform, SA-Kura reduces latency and energy by 193x and 69.4x, respectively. Compared with a standalone Jetson Orin Nano CUDA implementation of the same kernel, it is 6.57x faster and achieves approximately 46.0x lower energy per pixel.

URL PDF HTML ☆

赞 0 踩 0