arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3851
2511.17514 2026-06-09 cs.NI cs.AI cs.IT math.IT 交叉投稿

XAI-on-RAN: Explainable, AI-native, and GPU-Accelerated RAN Towards 6G

XAI-on-RAN:面向6G的可解释、AI原生和GPU加速的无线接入网

Osman Tugay Basaran, Falko Dressler

发表机构 * School of Electrical Engineering and Computer Science, Technische Universität Berlin(电气工程与计算机科学学院,柏林技术大学)

AI总结 针对6G关键任务场景中AI决策不透明的问题,提出可解释AI原生RAN框架,通过数学建模权衡透明度、延迟和GPU利用率,实验证明混合XAI模型xAI-Native性能优于基线。

Comments 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: AI and ML for Next-Generation Wireless Communications and Networking (AI4NextG)

详情
AI中文摘要

人工智能原生的无线接入网(RAN)将服务于具有严格要求的垂直行业:智能电网、自动驾驶、远程医疗、工业自动化等。为了实现这些要求,现代5G/6G设计越来越多地利用AI进行网络优化,但AI决策的不透明性在关键任务领域带来了风险。这些用例通常通过非公共网络(NPN)或专用网络切片交付,其中可靠性和安全性至关重要。在本文中,我们借鉴第三代合作伙伴计划(3GPP)对非公共网络的愿景,论证了在高风险通信(如医疗、工业自动化和机器人)中需要透明且可信的AI。我们设计了一个数学框架,用于建模在部署可解释AI(XAI)模型时透明度(解释保真度和公平性)、延迟和图形处理单元(GPU)利用率之间的权衡。实证评估表明,我们提出的混合XAI模型xAI-Native在性能上始终优于传统基线模型。

英文摘要

Artificial intelligence (AI)-native radio access networks (RANs) will serve vertical industries with stringent requirements: smart grids, autonomous vehicles, remote healthcare, industrial automation, etc. To achieve these requirements, modern 5G/6G design increasingly leverage AI for network optimization, but the opacity of AI decisions poses risks in mission-critical domains. These use cases are often delivered via non-public networks (NPNs) or dedicated network slices, where reliability and safety are vital. In this paper, we motivate the need for transparent and trustworthy AI in high-stakes communications (e.g., healthcare, industrial automation, and robotics) by drawing on 3rd generation partnership project (3GPP)'s vision for non-public networks. We design a mathematical framework to model the trade-offs between transparency (explanation fidelity and fairness), latency, and graphics processing unit (GPU) utilization in deploying explainable AI (XAI) models. Empirical evaluations demonstrate that our proposed hybrid XAI model xAI-Native, consistently surpasses conventional baseline models in performance.

2506.22459 2026-06-09 eess.SP cs.LG cs.SY eess.SY 交叉投稿

Physics-Embedded Neural Networks for sEMG-based Continuous Motion Estimation

基于表面肌电的连续运动估计的物理嵌入神经网络

Wending Heng, Chaoyuan Liang, Yihui Zhao, Zhiqiang Zhang, Glen Cooper, Zhenhong Li

发表机构 * University of Manchester(曼彻斯特大学) University of Bristol(布里斯托大学) University of Leeds(利兹大学)

AI总结 提出物理嵌入神经网络(PENN),结合可解释的肌肉骨骼正向动力学与数据驱动残差学习,实现生理一致且准确的连续运动估计,在RMSE和R²指标上优于现有方法。

Comments Accepted by 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

详情
AI中文摘要

从表面肌电信号(sEMG)中准确解码人类运动意图对于肌电控制至关重要,并在康复机器人和辅助技术中有广泛应用。然而,现有的基于sEMG的运动估计方法通常依赖于难以校准的特定于受试者的肌肉骨骼(MSK)模型,或缺乏生理一致性的纯数据驱动模型。本文提出了一种新颖的物理嵌入神经网络(PENN),它结合了可解释的MSK正向动力学与数据驱动残差学习,从而在实现准确运动估计的同时保持生理一致性。PENN采用递归时间结构来传播历史估计,并使用轻量级卷积神经网络进行残差校正,从而实现鲁棒且时间连贯的估计。为PENN设计了两阶段训练策略。对六名健康受试者的实验评估表明,PENN在均方根误差(RMSE)和$R^2$指标上均优于最先进的基线方法。

英文摘要

Accurately decoding human motion intentions from surface electromyography (sEMG) is essential for myoelectric control and has wide applications in rehabilitation robotics and assistive technologies. However, existing sEMG-based motion estimation methods often rely on subject-specific musculoskeletal (MSK) models that are difficult to calibrate, or purely data-driven models that lack physiological consistency. This paper introduces a novel Physics-Embedded Neural Network (PENN) that combines interpretable MSK forward-dynamics with data-driven residual learning, thereby preserving physiological consistency while achieving accurate motion estimation. The PENN employs a recursive temporal structure to propagate historical estimates and a lightweight convolutional neural network for residual correction, leading to robust and temporally coherent estimations. A two-phase training strategy is designed for PENN. Experimental evaluations on six healthy subjects show that PENN outperforms state-of-the-art baseline methods in both root mean square error (RMSE) and $R^2$ metrics.

2502.09194 2026-06-09 cs.IT cs.AI math.IT 交叉投稿

XAInomaly: Explainable and Interpretable Deep Contractive Autoencoder for O-RAN Traffic Anomaly Detection

XAInomaly:用于O-RAN流量异常检测的可解释与可解释深度收缩自编码器

Osman Tugay Basaran, Falko Dressler

发表机构 * School of Electrical Engineering and Computer Science, TU Berlin, Germany(电气工程与计算机科学学院,柏林技术大学,德国)

AI总结 提出XAInomaly框架,利用半监督深度收缩自编码器学习正常网络行为的鲁棒表示,并引入fastshap-C可解释AI技术,实现O-RAN中准确、可扩展且可解释的异常检测。

Comments 22 pages, 9 Figures, Submitted to Journal (First revision completed)

详情
AI中文摘要

生成式人工智能技术通过实现复杂数据建模和特征提取以增强网络性能,已成为推动下一代无线通信系统发展的关键组成部分。在开放无线接入网络(O-RAN)领域,其以解耦架构和来自多个供应商的异构组件为特征,生成模型的部署为网络管理(如流量分析、流量预测和异常检测)带来了显著优势。然而,O-RAN的复杂性和动态性带来了挑战,不仅需要准确的检测机制,还需要降低复杂性、可扩展性,以及最重要的是可解释性,以促进有效的网络管理。在本研究中,我们引入了XAInomaly框架,这是一种用于O-RAN异常检测的可解释且可解释的半监督深度收缩自编码器(DeepCAE)设计。我们的方法利用SS-DeepCAE模型的生成建模能力,学习正常网络行为的压缩、鲁棒表示,该表示捕获了关键特征,从而能够识别指示异常的偏差。为了解决深度学习模型的黑箱特性,我们提出了一种名为fastshap-C的反应式可解释AI(XAI)技术。

英文摘要

Generative Artificial Intelligence (AI) techniques have become integral part in advancing next generation wireless communication systems by enabling sophisticated data modeling and feature extraction for enhanced network performance. In the realm of open radio access networks (O-RAN), characterized by their disaggregated architecture and heterogeneous components from multiple vendors, the deployment of generative models offers significant advantages for network management such as traffic analysis, traffic forecasting and anomaly detection. However, the complex and dynamic nature of O-RAN introduces challenges that necessitate not only accurate detection mechanisms but also reduced complexity, scalability, and most importantly interpretability to facilitate effective network management. In this study, we introduce the XAInomaly framework, an explainable and interpretable Semi-supervised (SS) Deep Contractive Autoencoder (DeepCAE) design for anomaly detection in O-RAN. Our approach leverages the generative modeling capabilities of our SS-DeepCAE model to learn compressed, robust representations of normal network behavior, which captures essential features, enabling the identification of deviations indicative of anomalies. To address the black-box nature of deep learning models, we propose reactive Explainable AI (XAI) technique called fastshap-C.

2606.07235 2026-06-09 cs.IR cs.LG 版本更新

FLOWREADER: Min-Cost Flow Optimization for Multi-Modal Long Document Q&A

FLOWREADER: 多模态长文档问答的最小成本流优化

Ambuj Mehrish, Sebastiano Vascon

发表机构 * Ca’ Foscari University of Venice(威尼斯卡布里亚大学)

AI总结 提出FLOWREADER,将多模态长文档中的证据组装建模为最小成本流问题,通过统一评分向量控制源选择、汇选择和边成本,在碎片化证据场景下优于top-k检索方法。

详情
AI中文摘要

长多模态文档迫使检索增强系统从文本、表格和幻灯片中碎片化的证据中组装答案,这些证据可能分布在长表格的单元格中、多张幻灯片上或图表与其讨论之间。Top-k块检索独立处理每个片段,无法表示证据之间的关联。我们提出FLOWREADER,将证据组装重新定义为多模态节点图上的最小成本流问题:一个单一的评分向量$h$控制源选择(通过MMR)、汇选择(通过长度感知的可回答性代理)以及每条边的成本和容量。最优流被分解为候选证据路径,通过熵正则化复制动力学选择紧凑的非冗余子集,并在双过程门控下并行运行VLM工作器,当答案一致性低或路由流紧张时触发一次System-2精炼过程。在VisDoMBench上,FLOWREADER在碎片化证据主导的两个子集PaperTab(58.40,比G^{2}-Reader高1.30)和SlideVQA(72.93,高0.62)上表现最佳,在SPIQA、FetaTab和SciGraphQA上具有竞争力。在所有五个子集上的宏观平均得分(65.47)与最强基线(G^{2}-Reader,66.21)相差0.74。总体而言,这些结果表明最小成本流在碎片化多模态证据上表现良好,而top-k检索在此类场景中失败。它还提供了一种统一的方式来控制评分、路由、选择和自适应计算。

英文摘要

Long, multimodal documents force retrieval-augmented systems to assemble answers from evidence fragmented across text, tables, and slides broken across cells in a long table, spread over multiple slides, or split between a figure and its discussion. Top-$k$ chunk retrieval treats each fragment independently and cannot represent how evidence connects. We introduce FLOWREADER, which reframes evidence assembly as a min-cost flow problem on a multimodal node graph: a single scoring vector $h$ controls source selection (via MMR), sink selection (via a length-aware answerability proxy), and the costs and capacities of every edge. The optimal flow is decomposed into candidate evidence paths, a compact non-redundant subset is selected by entropy-regularized replicator dynamics, and parallel VLM workers under a dual-process gate produce the answer with a single System-2 refinement pass triggered when answer consistency is low or the routed flow is strained. On VisDoMBench, FLOWREADER is best on the two subsets dominated by fragmented evidence PaperTab ($58.40$, $+1.30$ over G^{2}-Reader) and SlideVQA ($72.93$, $+0.62$) and competitive on SPIQA, FetaTab, and SciGraphQA. Macro-averaged across all five subsets, FLOWREADER ($65.47$) is within $0.74$ of the strongest baseline (G^{2}-Reader, $66.21$). Overall, these results show that min-cost flow performs well on fragmented multimodal evidence, where top-$k$ retrieval fails. It also provides a unified way to control scoring, routing, selection, and adaptive compute together.

2606.06497 2026-06-09 cs.GR cs.CV cs.HC 版本更新

Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers

实时注意力弯曲:视频扩散变换器的粒度交互式网络弯曲

Adam Cole, Rebecca Fiebrink, Mick Grierson

发表机构 * Creative Computing Institute(创意计算研究所) University of the Arts London(伦敦艺术大学)

AI总结 提出实时注意力弯曲工具,通过操纵视频扩散变换器的自注意力、交叉注意力及前馈网络,实现逐层、逐步、逐令牌的交互式生成控制,增强艺术家的创作代理与模型材料亲密性。

Comments 5 pages, 4 figures. Accepted to ACM Creativity & Cognition XAIxArts Workshop 2026

详情
AI中文摘要

生成式视频模型已实现显著的视觉保真度,但其仅提示的界面提供了薄弱的创作代理,并使得艺术家无法了解模型的物质过程。我们提出了实时注意力弯曲,这是一种将网络弯曲实践扩展到视频扩散变换器(DiT)全深度并使其进入实时交互式生成的工具。作为DayDream Scope生态系统中的插件构建,并封装了开源实时Wan管道,该工具将自注意力、交叉注意力和前馈网络暴露为可独立操作的面,目标可细化到单个扩散步骤、DiT层、提示令牌和隐藏神经元。实时操作的即时性提供了我们所谓的与模型的“物质亲密性”:对特定层和神经元如何塑造生成视频的响应式、近乎机械的感觉。我们将该工具定位为同时作为对变换器内部结构的XAIxArts探针,以及用于发现模型默认表示空间之外的美学的表达性乐器。

英文摘要

Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call "material intimacy" with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model's default representational space.

2606.05363 2026-06-09 cs.GT cs.LG econ.TH math.OC 版本更新

Should Demand Models Incorporate Competitor Prices? Oblivious Learning and Algorithmic Collusion

需求模型是否应包含竞争对手价格?无知学习与算法合谋

Yuhang Wu, Assaf Zeevi

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 研究在竞争市场中,定价算法是否应显式建模竞争对手价格,通过对比无知与知情学习策略,发现知情策略是纳什均衡且价格收敛至竞争结果,而合谋模式不稳健。

Comments Preliminary version "Oblivious Learning, Price Exploration and Collusive Dynamics" accepted at EC 2026

详情
AI中文摘要

在一个拥有多个卖家的平台上,定价算法在学习需求时是否应显式建模竞争对手的价格?经典学习论点给出肯定答案:忽略竞争对手会导致模型错误指定和效率低下。相反,关于算法合谋的最新研究表明,战略性无知——故意忽略竞争对手价格——可能促进合谋结果并提高利润。我们在一个具有未知噪声需求的风格化竞争市场中研究这一建模选择,其中多个卖家重复设定价格并通过迭代最小二乘法估计需求,要么将竞争对手价格纳入其需求模型(知情),要么忽略它们(无知)。我们首先证明,相对于垄断者,竞争市场中的无知卖家必须更积极地探索以补偿动态竞争对手信息的损失。基于这一见解,我们刻画了所有卖家均为无知时的市场动态,并表明在充分探索下价格收敛至竞争结果,而当探索衰减时会出现连续伪均衡。分析价格轨迹,我们发现一种“偏离”现象,产生随学习进行而消散的暂时合谋模式。在同时存在无知和知情卖家的市场中,知情卖家的收益严格高于无知卖家。作为策略博弈解读,该建模选择具有唯一的纳什均衡:全知情市场,其中价格有效收敛至竞争结果。总体而言,我们的结果表明合谋模式不稳健,且不能由无知建模维持;因此,纳入竞争对手信息,结合充分的价格探索,仍是竞争市场中卖家的可靠策略。

英文摘要

On a platform with many sellers, should a pricing algorithm explicitly model competitors' prices when learning demand? Classical learning arguments suggest an affirmative answer: ignoring competitors induces model misspecification and inefficiency. In contrast, recent work on algorithmic collusion suggests that strategic obliviousness -- deliberately ignoring competitor prices -- may facilitate collusive outcomes and improve profits. We study this modeling choice in a stylized competitive market with unknown noisy demand, in which multiple sellers repeatedly set prices and estimate demand via iterated least squares, and either incorporate competitors' prices into their demand models (informed) or ignore them (oblivious). We first show that, relative to a monopolist, an oblivious seller in a competitive market must explore more aggressively to compensate for the loss of dynamic competitor information. Building on this insight, we characterize market dynamics when all sellers are oblivious and show that prices converge to the competitive outcome under sufficient exploration, while a continuum of pseudo-equilibria arises when exploration decays. Analyzing the resulting price trajectories, we uncover an excursion phenomenon that gives rise to transient collusive patterns that dissipate as learning progresses. In markets with both oblivious and informed sellers, the informed strictly out-earn the oblivious. Read as a strategy game, the modeling choice has a unique Nash equilibrium: the all-informed market, in which prices converge to the competitive outcome efficiently. Overall, our results indicate that collusive patterns are not robust and are not sustained by oblivious modeling; therefore, incorporating competitor information, together with sufficient price exploration, remains a reliable strategy for sellers in competitive markets.

2606.04581 2026-06-09 cs.DC cs.AI cs.NI 版本更新

Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge

Multi-SPIN:面向边缘协作令牌生成的多接入推测推理

Haotian Zheng, Zhanwei Wang, Mingyao Cui, Chang Cai, Hongyang Du, Kaibin Huang

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系)

AI总结 提出多接入推测推理(Multi-SPIN)架构,通过联合优化草案长度控制和带宽分配,在异构边缘系统中最大化令牌总吞吐量。

详情
AI中文摘要

推测推理(SPIN)最初被开发为一种加速大型语言模型(LLMs)的高效架构。在这项工作中,我们提出其分布式部署,以在多用户边缘系统中实现协作令牌生成;其优势在于有效平衡资源受限设备与服务器之间的计算负载。由此产生的架构称为多接入SPIN(Multi-SPIN),利用设备上的小型语言模型生成并上传候选令牌草稿,而边缘服务器运行LLM以并行批次验证它们。鉴于用户计算和通信能力的严重异构性,草案长度成为关键控制变量,影响节点级计算负载和多接入延迟,从而控制总令牌吞吐量。因此,考虑频分多址,我们研究了多接入草案控制问题,即联合优化草案长度控制和带宽分配以最大化总令牌吞吐量。我们考察了两种情况:(1)用户间同质草案长度以促进服务器端批处理,以及(2)异质草案长度以引入新的吞吐量提升维度。通过开发分解方法,我们将这些复杂优化简化为可处理的子问题,从而能够以闭式形式推导出高效的草案控制算法。我们的分析表明,在同质情况下,由于批处理同步要求,最优带宽分配补偿了计算和通信能力较弱的用户;而在异质情况下,通过放宽这些要求,最优带宽分配奖励具有更高接受率的用户。使用Llama-2和Qwen3.5模型对在不同任务上的实验表明,Multi-SPIN相比忽略异构性的基线将吞吐量提升了高达88%。

英文摘要

Speculative inference (SPIN) was originally developed as an efficient architecture to accelerate Large Language Models (LLMs). In this work, we propose its distributed deployment to enable cooperative token generation in a multiuser edge system; its advantage is to effectively balance computational loads between resource-constrained devices and servers. The resulting architecture, termed Multi-access SPIN (Multi-SPIN), utilizes on-device small language models to generate and upload candidate token drafts, while an edge server operates the LLM to verify them in parallel batches. Given the severe heterogeneity in users' computation and communication capabilities, the draft length emerges as a critical control variable that influences node-level computation loads and multi-access latency, thereby governing the sum token goodput. Consequently, considering frequency-division multiple access, we investigate the problem of multi-access draft control, a joint optimization of draft-length control and bandwidth allocation to maximize sum token goodput. We examine two cases: (1) homogeneous draft lengths across users to facilitate server-side batching, and (2) heterogeneous draft lengths to introduce a new dimension for goodput enhancement. By developing decomposition methods, we reduce these complex optimizations into tractable sub-problems, which allow efficient draft control algorithms to be derived in closed form. Our analysis shows that the optimal bandwidth allocation compensates users with weaker computation-and-communication capabilities in the homogeneous case due to the batching synchronization requirements, whereas its heterogeneous-case counterpart rewards users with higher acceptance rates by relaxing such requirements. Experiments using Llama-2 and Qwen3.5 model pairs across diverse tasks demonstrate that Multi-SPIN improves goodput by up to 88% over heterogeneity-agnostic baselines.

2606.04227 2026-06-09 cs.DS cs.AI 版本更新

Incremental Sheaf Cohomology on Cellular Complexes: O(1)-in-n Lazy Edit Processing under Bounded Local Geometry

细胞复形上的增量层上同调:有界局部几何下的O(1)-in-n惰性编辑处理

Jason L. Volk

发表机构 * Invariant Research(Invariant研究院)

AI总结 针对动态演化的1维细胞复形上的层上同调$H^1$,提出一种增量维护算法,在有界局部几何假设下实现每次编辑O(1)时间处理,并通过同步点保证正确性。

Comments 2 figures, 2 tables, 1 algorithm; code at https://github.com/Jasonleonardvolk/sigma

详情
AI中文摘要

我们提出了一种算法框架,用于在动态演化的1维细胞复形(配备有限维细胞层)上增量维护第一层上同调$H^1(X; \mathcal{F})$。通过分解上边界矩阵经典计算$H^1$需要$O(n^3)$时间;当复形经历$m$次编辑的流时,每次编辑后完全重计算代价为$O(mn^3)$。在局部几何有界假设下——有界细胞大小$v_{\max}$、有基维数$d$和有界神经度$D$——每次编辑(顶点插入、边插入、限制映射更新)仅影响有界的一组局部上边界块。因此,该算法以相对于总复形大小$n$的$O(1)$时间处理惰性流式编辑(代价在局部几何参数$v_{\max}$、$d$和$D$的多项式时间内,这些参数被视为与$n$无关的常数),将局部特征值求解和Mayer-Vietoris全局组装推迟到同步点(Flush)。在同步时,维护的状态与分区层模型的相应批量组装一致;我们在所有批量验证的运行中观察到零测量漂移(通过$V = 10^6$)。我们还给出了细胞分解的均摊$O(|E|)$流式构造,并讨论了一个对抗性代数RAM障碍,论证非分区非平凡层($d \geq 2$,非恒等限制映射)不具有相同的局部性。在最多$5 \times 10^6$个顶点和$1.7 \times 10^7$次流式编辑的Barabasi-Albert图上的实验显示,每次编辑的惰性中位延迟为35微秒(不包括刷新);查询时间(同步时的全局组装)在实现的完全遍历路径中为每次刷新$O(n)$。精确同步代价另行报告。

英文摘要

We present an algorithmic framework for incremental maintenance of first sheaf cohomology $H^1(X; \mathcal{F})$ on dynamically evolving 1-dimensional cellular complexes equipped with finite-dimensional cellular sheaves. The classical computation of $H^1$ via factorization of the coboundary matrix requires $O(n^3)$ time; when the complex evolves with a stream of $m$ edits, full recomputation after each edit costs $O(mn^3)$. Under a bounded local geometry assumption -- bounded cell size $v_{\max}$, bounded stalk dimension $d$, and bounded nerve degree $D$ -- each edit (vertex insertion, edge insertion, restriction map update) affects only a bounded set of local coboundary blocks. The algorithm therefore processes lazy streaming edits in $O(1)$ time with respect to the total complex size $n$ (with cost polynomial in the local geometry parameters $v_{\max}$, $d$, and $D$, which are treated as constants independent of $n$), deferring local eigensolves and Mayer-Vietoris global assembly to synchronization points (Flush). At synchronization, the maintained state agrees with the corresponding batch assembly of the partitioned sheaf model; we observe zero measured drift in all batch-verified runs (through $V = 10^6$). We also give an amortized $O(|E|)$ streaming construction for the cellular decomposition and discuss an adversarial algebraic-RAM barrier arguing that unpartitioned non-trivial sheaves ($d \geq 2$, non-identity restriction maps) do not admit the same locality. Experiments on Barabasi-Albert graphs with up to $5 \times 10^6$ vertices and $1.7 \times 10^7$ streaming edits show 35 $μ$s median lazy per-edit update latency (excluding flush); query time (global assembly at synchronization) is $O(n)$ per flush in the implemented full-traversal path. Exact synchronization costs are reported separately.

2606.01567 2026-06-09 cs.CR cs.AI cs.CL 版本更新

Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents

针对终端代理的技能注入攻击的防御与使能因素

Yoshinari Fujinuma, Varun Gangal, Traian Rebedea, Makesh Narsimhan Sreedhar, Prasoon Varshney, Rebecca Qian, Anand Kannappan

发表机构 * Patronus AI NVIDIA

AI总结 研究基于大语言模型的代理在重用技能时面临的安全威胁,提出守护者防御(动态和静态)将攻击成功率降低过半,并测试了攻击重述的鲁棒性。

Comments First version, small updates and clarifications likely in v2

详情
AI中文摘要

大型语言模型(LLM)代理越来越依赖可重用的技能,即描述任务特定程序的文档。然而,这为代理管理引入了新的攻击面。我们针对这一威胁研究了两个互补方向。首先,我们评估了基于守护者的防御:一个中间LLM代理,作为技能文件访问的调解者(动态守护者)或在构建时预先重写这些文件(静态守护者)。在三个LLM代理家族中,我们的守护者将攻击成功率(ASR)降低了一半以上,同时保持了任务效用。其次,我们通过攻击重述对其进行压力测试,使用了四种保留恶意指令但改变措辞的攻击。对于非守护者设置,重述将ASR推高至81.4%,但动态守护者将其降至18.6%,表明实时调解是一种稳健的防御。

英文摘要

Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this introduces a new attack surface for agents to manage. We study two complementary directions for this threat. First, we evaluate guardian-based defenses: an intermediary LLM agent that acts as a mediator for skill file access (dynamic guardian) or pre-rewrites these files at build time (static guardian). Across three LLM agent families, our guardians cut attack success rate (ASR) by well over half while preserving task utility. Second, we stress test them through attack reframing using four attacks that preserve the malicious instruction but change the phrasing. For non-guardian setup, the reframing pushes the ASR up to 81.4\%, but the dynamic guardian brings it down to 18.6\%, showing that real-time mediation is a robust defense.

2606.01342 2026-06-09 cs.DS cs.LG 版本更新

Towards Optimal Robustness in Learning-Augmented Paging

面向学习增强分页的最优鲁棒性

Peng Chen, Hailiang Zhao, Xueyan Tang, Yixuan Wang, Shuiguang Deng

发表机构 * Department of XXX, University of YYY, Location, Country School of ZZZ, Institute of WWW, Location, Country Zhejiang University, Hangzhou, China Nanjing University of Aeronautics Nanyang Technological University, Singapore

AI总结 本文提出一种新框架,通过相对预测预算原语,在学习增强分页中实现最优鲁棒性界 H_k + O(1),并实验验证其实际性能。

Comments ICML 2026

详情
AI中文摘要

近年来,学习增强分页得到了广泛研究。与朴素基于机器学习的方法相比,一个关键优势是 extit{有界鲁棒性},即使在预测不准确时也能保证最坏情况性能,这使得这些算法对实际系统有价值。先前工作在随机化设置中实现了 $2H_k + O(1)$ 的鲁棒性界,与最优竞争比 $H_k$ 存在差距。在本文中,我们研究如何缩小这一差距。我们首先回顾在线最优性,并证明最新的 $H_k$-竞争算法的一个新性质,这有助于我们在学习增强设置中的分析。然后,我们回顾现有的学习增强分页算法,并引入一个统一原语—— extit{相对预测预算},它捕捉了建立鲁棒性的本质,并揭示了先前算法要么过度使用要么未充分利用预测。在上述分析指导下,我们开发了一个新框架,实现了学习增强分页的最优鲁棒性(至多相差一个加法常数):$H_k + O(1)$。实验进一步证明了强大的实际性能。

英文摘要

Learning-augmented paging has been extensively studied in recent years. A key advantage over naive ML-based approaches is \emph{bounded robustness}, which guarantees worst-case performance even when predictions are inaccurate, making these algorithms valuable for real-world systems. Prior work achieves robustness bounds of $2H_k + O(1)$ in the randomized setting, leaving a gap to the optimal competitive ratio $H_k$. In this paper, we study how to close this gap. We begin by reviewing online optimality and proving a new property of the latest $H_k$-competitive algorithm, which facilitates our analysis in the learning-augmented setting. Then, we review existing learning-augmented paging algorithms and introduce a unifying primitive, the \emph{relative prediction budget}, which captures the essence of establishing robustness and reveals that prior algorithms either overuse or underutilize predictions. Guided by the above analysis, we develop a new framework that achieves the best-possible robustness up to an additive constant for learning-augmented paging: $H_k + O(1)$. Experiments further demonstrate strong practical performance.

2606.00419 2026-06-09 stat.ML cs.LG 版本更新

Parameter-Free and Group Conditional Online Conformal Prediction

无参数和组条件在线共形预测

Beepul Bharti, Ambar Pal, Jacopo Teneggi, Jeremias Sulam

发表机构 * Data Science and AI Institute (DSAI), Johns Hopkins University(数据科学与人工智能研究院(DSAI),约翰霍普金斯大学) Mathematical Institute for Data Science (MINDS), Johns Hopkins University(数据科学数学研究院(MINDS),约翰霍普金斯大学) Department of Biomedical Engineering, Johns Hopkins University(生物医学工程系,约翰霍普金斯大学) Department of Computer Science, Johns Hopkins University(计算机科学系,约翰霍普金斯大学) Amazon Responsible AI(亚马逊负责任人工智能)

AI总结 提出一种无参数算法用于组条件在线共形预测,在保证组条件覆盖的同时无需调参,并在合成和真实数据上验证了其有效性和可靠性。

详情
AI中文摘要

不确定性量化对于机器学习预测器在数据分布随时间变化(即数据可能不可交换)的真实场景中的部署至关重要。在线共形预测方法解决了这个问题,但代价是(i)组间误差控制或(ii)与学习率无关的实现。组条件覆盖对于跨不同数据点集合的公平性以及提供更精细的不确定性量化保证至关重要。无参数优化对于对抗对抗性和未知数据偏移的鲁棒性至关重要。我们提出了一种用于组条件在线共形预测的无参数算法,并证明它实现了最佳的组条件覆盖保证。我们在合成和真实数据上评估了我们的算法,表明我们的方法不仅提高了现有无参数在线共形预测方法的可靠性,而且提供了与调优良好的组条件方法大小相当的预测区间。通过将组条件覆盖与无参数在线算法统一,我们的工作为变化环境中公平且鲁棒的不确定性量化奠定了基础。

英文摘要

Uncertainty quantification (UQ) is critical for the deployment of machine learning predictors in real-world scenarios where the data distribution may shift over time (i.e., data may not be exchangeable). Online conformal prediction (OCP) methods address this issue at the expense of either (i) group-wise error control or (ii) learning-rate independent implementation. Group-conditional coverage is essential for fairness across different collections of data points and for providing finer UQ guarantees. Parameter-free optimization is crucial for robustness to adversarial and unknown data shifts. We propose a parameter-free algorithm for group-conditional OCP and demonstrate that it achieves the best group-conditional coverage guarantees. We evaluate our algorithm on synthetic and real-world data, demonstrating that our method not only improves the reliability of existing parameter-free OCP methods but also provides prediction intervals that are comparable in size to well-tuned group-conditional approaches. By unifying group-conditional coverage with parameter-free online algorithms, our work lays a foundation for fair and robust uncertainty quantification in shifting environments.

2605.28510 2026-06-09 cs.SE cs.AI cs.IR 版本更新

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

高效可扩展的LLM生成代码片段溯源追踪

Andrea Gurioli, Davide D'Ascenzo, Federico Pennino, Maurizio Gabbrielli, Stefano Zacchiroli

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出混合两阶段溯源追踪流水线HYBRIDSOURCETRACKER,结合向量搜索与指纹匹配,实现LLM生成代码的高效、可扩展溯源。

详情
AI中文摘要

用于代码补全和生成的大型语言模型(LLM)在软件开发中日益普及,但它们可能会逐字复现训练示例且不注明出处,引发关于抄袭和许可合规的法律与伦理问题。基于指纹的经典抄袭检测器(如Winnowing)仍然高效,但检测需要将代码片段与整个训练集进行比较,其线性时间搜索使其不适用于训练现代代码LLM的十亿级语料库。为弥补这一差距,我们引入了SOURCETRACKER——一个专为代码检索定制的3亿参数编码器,以及混合两阶段溯源追踪流水线HYBRIDSOURCETRACKER(HST)。HST首先通过向量搜索缩小候选片段集,然后使用Winnowing对精确指纹进行重排序。我们在THESTACKV2数据集的1000万片段子集上训练和评估系统,包括逐字片段和模拟真实标识符重命名的改编片段。在包含改编查询的体外10万片段搜索空间中,我们的混合方法在30令牌片段上的平均倒数排名与Winnowing相当。然后,从>=60令牌的窗口开始,它持续优于Winnowing最多5.4%,同时保持对数时间查询复杂度。在使用基于LLM的评判者的补充评估中,我们发现许多未被标记为真实来源的检索片段与预期来源高度相似,尤其是在较长的上下文窗口中,因此对最终用户仍然有用。总体而言,我们的结果表明,将向量搜索与指纹识别相结合,能够实现对LLM生成的代码进行可扩展、高精度的溯源追踪。

英文摘要

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

2605.27852 2026-06-09 cs.GR cs.CV 版本更新

ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

ClothTransformer: 用于可扩展布料模拟的统一潜空间变换器

Yu Zhang, Yidi Shao, Wenqi Ouyang, Yushi Lan, Zhexin Liang, Chengrui Wu, Xudong Xu, Xingang Pan

发表机构 * S-Lab, Nanyang Technological University, Singapore(新加坡南洋理工大学S实验室,南洋理工大学,新加坡) Feeling AI University of Oxford(牛津大学) Nanyang Technological University(南洋理工大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ClothTransformer,通过将布料模拟重构为潜空间中的自回归序列建模,使用统一Transformer架构处理多种场景,实现比现有方法低4-9倍的误差,并引入可扩展潜空间公式和穿透自由数据集。

详情
AI中文摘要

统一且可扩展的变换器最近在建模传统上与计算机图形学相关的多种现象(如3D视觉效果、渲染过程和视频中的运动)方面取得了显著成功。在这项工作中,我们进一步研究现代变换器技术是否能够应对布料模拟这一挑战性任务。为此,我们提出了ClothTransformer,这是一个将布料模拟重构为在学习的潜空间中进行自回归序列建模的框架。现有的神经布料模拟器大多专用于单一场景,与网格离散化内在耦合,并且缺乏鲁棒的碰撞处理。我们的方法通过三个贡献解决了这些局限性:(1)一个统一的变换器架构,在单一模型下处理多种场景——身体驱动的服装、机器人操作和自由落体碰撞——并在所有场景中实现比先前最先进方法低约4-9倍的误差;(2)一个可扩展的潜空间公式,将任意分辨率的网格压缩为固定大小的潜令牌集,使得时间动态计算独立于网格分辨率;(3)一个覆盖所有三种设置的高保真无穿透数据集(约493.4k帧),该数据集支持可微分的连续碰撞检测(CCD)模块以抑制穿透伪影。

英文摘要

Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts. Project Page: https://yucrazing.github.io/clothtransformer/

2605.27441 2026-06-09 cs.IR cs.LG 版本更新

A Unified Structured Query Understanding Framework for Industrial Semantic Search

面向工业语义搜索的统一结构化查询理解框架

Ping Liu, Qianqi Shen, Jianqiang Shen, Chunnan Yao, Kevin Kao, Rajat Arora, Dan Xu, Baofen Zheng, Yunxiang Ren, Benjamin Le, Ali Hooshmand, Igor Lapchuk, Juan Bottaro, Raghavan Muthuregunathan, Caleb Johnson, Liangjie Hong, Jingwei Wu, Wenjing Zhang

发表机构 * LinkedIn Corporation(领英公司)

AI总结 提出一个统一的结构化查询理解系统,将多个异构功能整合到单个小语言模型(SLM)中,并引入Query Illuminator框架用于自动标注和评估,在LinkedIn的职位搜索和人员搜索中验证了效果。

Comments Accepted by KDD-ADS 2026

详情
AI中文摘要

大规模工业搜索系统中的查询理解通常实现为一系列不同、任务特定的组件的级联。虽然每个组件可单独优化,但这种碎片化架构导致维护开销高,且行为不一致,特别是对于长尾查询。在这项工作中,我们提出并部署了一个统一的结构化查询理解系统,将异构功能整合到单个执行模式约束生成的小语言模型(SLM)中。为了解决统一建模中的数据瓶颈,我们引入了Query Illuminator,一个双重用途的框架,作为:(i) 用于高质量自动标注和蒸馏的教师模型,以及(ii) 在人工标注稀缺时用于可扩展评估的替代评判者。我们通过在LinkedIn的职位搜索系统中的广泛离线和在线测试验证了该方法。此外,我们通过跨领域的人员搜索案例研究展示了该框架的水平可扩展性。结果表明,在有限的GPU资源上满足严格的低延迟服务约束的同时,用户参与度提高,运营成本降低。

英文摘要

Query understanding in large-scale industrial search systems is typically implemented as a cascade of disparate, task-specific components. While individually optimizable, this fragmented architecture incurs high maintenance overhead and results in inconsistent behaviors, particularly for long-tail queries. In this work, we propose and deploy a unified structured query understanding system that consolidates these heterogeneous functions into a single Small Language Model (SLM) that performs schema-constrained generation. To address the data bottlenecks inherent in unified modeling, we introduce Query Illuminator, a dual-purpose framework serving as: (i) a teacher model for high-quality auto-annotation and distillation, and (ii) a surrogate judge for scalable evaluation where human labels are scarce. We validate this approach through extensive offline and online tests within LinkedIn's Job Search system. Furthermore, we demonstrate the framework's horizontal extensibility through a cross-domain case study on People Search. The results show improved user engagement and reduced operational costs, achieved while satisfying strict low-latency serving constraints on limited GPU resources.

2605.27410 2026-06-09 quant-ph cs.LG cs.NE 版本更新

Zero-shot Quantum Neural Architecture Search

零样本量子神经架构搜索

Tung Dao, Son N. Tran, Huynh Thi Thanh Binh

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) Deakin University(德金大学)

AI总结 针对变分量子算法中电路架构设计的高计算成本问题,基于量子神经正切核的Gram矩阵收敛性,提出零样本代理模型和MCTS框架MZeQAS,无需完整训练即可高效搜索高性能架构。

详情
AI中文摘要

变分量子算法是利用近期量子硬件的主要方法,通过参数化量子电路和经典优化来获得优势。尽管前景广阔,但VQA的实际部署受到设计平衡表达性、可训练性和硬件约束的量子电路架构的挑战。现有的基于进化的量子神经架构搜索方法解决了这些挑战,但由于候选电路的重复训练而导致高计算成本。在这项工作中,我们确定了量子神经正切核的Gram矩阵收敛的设置。基于这一观察,我们设计了一个零样本代理模型来估计候选性能而无需完整训练,显著加速了架构搜索过程。利用该代理,我们提出了MZeQAS,一种基于蒙特卡洛树搜索的零样本量子神经架构搜索框架,用于VQA。通过将基于代理的性能估计与MCTS探索相结合,MZeQAS高效地发现了高性能架构。实验结果表明,MZeQAS在搜索效率和解决方案质量方面均优于现有方法,为在噪声中等规模量子设备上推进VQA部署提供了一个可扩展且有效的框架。

英文摘要

Variational Quantum Algorithms (VQAs) are a leading approach to exploiting near-term quantum hardware, leveraging parameterized quantum circuits and classical optimization to achieve advantage. Despite their promise, the practical deployment of VQAs is challenged by the difficulty of designing quantum circuit architectures that balance expressivity, trainability, and hardware constraints. Existing evolutionary-based quantum neural architecture search methods address these challenges but suffer from high computational costs due to repeated training of candidate circuits. In this work, we identify a setting in which the Gram matrix of the Quantum Neural Tangent Kernel converges. Building on this observation, we design a zero-shot surrogate model to estimate candidate performance without full training, significantly accelerating the architecture search process. Using this surrogate, we propose MZeQAS, a Monte Carlo Tree Search (MCTS)-based Zero-Shot Quantum Neural Architecture Search framework for VQAs. By integrating proxy-based performance estimation with MCTS exploration, MZeQAS efficiently discovers high-performing architectures. Experimental results demonstrate that MZeQAS outperforms existing approaches in terms of both search efficiency and solution quality, providing a scalable and effective framework for advancing VQA deployment on noisy intermediate-scale quantum devices.

2605.26703 2026-06-09 econ.TH cs.GT cs.LG stat.ML 版本更新

Proper Calibeating

Proper Calibeating

Dean P. Foster, Sergiu Hart

发表机构 * Department of Statistics, Wharton, University of Pennsylvania, Philadelphia, and Amazon, New York(统计系、沃顿商学院、宾夕法尼亚大学费城分校,以及纽约亚马逊公司) Institute of Mathematics, Department of Economics, and Federmann Center for the Study of Rationality, The Hebrew University of Jerusalem(数学研究所、经济系、理性研究基金会,以色列希伯来大学)

AI总结 本文将经典校准预测和calibeating概念扩展到真确评分规则,定义proper-calibration和proper-calibeating,证明校准蕴含proper-calibration而calibeating不一定蕴含proper-calibeating,展示如何保证proper-calibeating和proper-multicalibeating,并证明proper-calibration与不确定性决策中对预测最佳回应时通用无遗憾的等价性。

Comments v2: Updated section 6 "Decision Making Under Uncertainty"

详情
AI中文摘要

经典概念“校准预测”及其更近期的改进“calibeating”是相对于标准二次评分规则定义的。我们将这些概念扩展到$\textit{真确}$评分规则类(其中最佳预测是真实分布),并通过要求误差在所有有界真确评分规则上一致收敛到零来定义$\textit{proper-calibration}$和$\textit{proper-calibeating}$。我们首先证明校准总是蕴含proper-calibration,而calibeating不一定蕴含proper-calibeating。其次,我们展示如何保证proper-calibeating和proper-multicalibeating。最后,我们证明了在不确定性决策中对预测进行最佳回应时,proper-calibration与通用无遗憾之间的等价性。

英文摘要

The classic concept of "calibrated forecasts" and its more recent refinement, "calibeating," are defined with respect to the standard quadratic scoring rule. We extend these notions to the class of $\textit{proper}$ scoring rules (for which the best forecast is the true distribution) and define $\textit{proper-calibration}$ and $\textit{proper-calibeating}$ by requiring the errors to converge to zero uniformly over all bounded proper scoring rules. We first establish that calibration always implies proper-calibration, whereas calibeating need not imply proper-calibeating. Second, we show how to guarantee proper-calibeating and proper-multicalibeating. Finally, we demonstrate the equivalence between proper-calibration and universal no regret when best replying to forecasts in decision-making under uncertainty.

2605.30123 2026-06-09 cs.CR cs.LG 版本更新

Privacy-Enhanced Zero-Order Federated Learning via xMK-CKKS over Wireless Channels

基于xMK-CKKS的无线信道隐私增强零阶联邦学习

Anthony Ayli, Khalil Harris, Jihad Fahs, Mohamad Assaad

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对无线联邦学习中单密钥同态加密的客户端安全漏洞,提出一种无需信道估计的四阶段协议,利用xMK-CKKS多密钥同态加密实现安全聚合,并集成零阶优化,在保证收敛率的同时降低通信开销。

Comments 12 pages, 3 figures

详情
AI中文摘要

同态加密(HE)通过允许服务器在不解密的情况下操作加密数据,实现了联邦学习(FL)中的隐私保护聚合。现有的无线同态加密方法主要依赖单密钥HE方案,并需要信道估计或预均衡来补偿无线衰落。然而,单密钥HE仍然容易受到共享相同密钥的诚实但好奇客户端的攻击。此外,攻破单个客户端可能危及整个网络的安全性,而多密钥HE方案通过为每个设备分配自己的密钥来提供更强的客户端级安全性。我们提出了一种四阶段协议,使得著名的多密钥HE方案xMK-CKKS能够在共享无线信道上进行聚合,而无需信道估计。该协议通过相同的信道实现重传部分公钥和密文,使得在解密过程中占主导地位的大模数加密项代数相消。我们将该协议与零阶FL集成在缓慢变化的视距主导信道上,其中每个设备每轮传输一个加密标量,通信/加密开销与模型维度无关。我们证明,解码后的加密噪声保持了\(O(1/\sqrt{K})\)的收敛速度,直到可忽略的噪声基底。该协议能够抵抗与最多\(N-1\)个客户端共谋的诚实但好奇服务器,MNIST上的数值结果验证了分析。

英文摘要

Homomorphic encryption (HE) enables privacy-preserving aggregation in federated learning (FL) by allowing the server to operate on encrypted data without decryption. Existing HE-over-the-air (OTA) methods mainly rely on single-key HE schemes and require channel estimation or pre-equalization to compensate for wireless fading. However, single-key HE remains vulnerable to honest-but-curious (HBC) clients holding the shared secret key, while multi-key HE provides stronger client-level security by assigning each device its own secret key. We propose a four-phase protocol that enables the aggregation of xMK-CKKS over a shared wireless channel without channel estimation. The protocol retransmits partial public keys and ciphertexts through the same channel realization, so that the dominant large-modulus encryption terms cancel algebraically during decryption. We integrate this protocol with zero-order FL over slowly varying LoS-dominant channels, where each device transmits a single encrypted scalar per round and the communication/encryption overhead is independent of the model dimension. We show that the residual noise induced by encryption and wireless aggregation preserves the standard convergence rate \(O(1/\sqrt{K})\) up to a negligible noise floor, where $K$ is the number of communication rounds. The protocol assumes an non-trusted server and is secure against HBC clients, preventing any client from recovering the local updates of other participants. Numerical results on MNIST validate the theoretical analysis.

2605.09823 2026-06-09 cs.MA cs.AI 版本更新

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

CalBench: 评估多智能体大语言模型中的协调-隐私权衡

Chelsea Zou, Yiheng Yao, Selena She, Noah Goodman, Robert D. Hawkins

发表机构 * Stanford University(斯坦福大学)

AI总结 提出CalBench基准,用于在私有信息下评估多智能体日程协调中任务完成、成本、通信、公平性和隐私泄露的权衡。

详情
AI中文摘要

个人AI助手开始作为代表行事,能够访问日历、收件箱和用户偏好。日程安排使信任问题具体化:助手必须与其他助手协调,同时决定透露关于其所代表的人的哪些信息。我们引入了CalBench,一个用于在私有信息下进行多智能体日程安排的可控基准。在每个任务中,$N$个智能体管理各自的私有日历,并安排$M$个传入会议流,同时最小化干扰成本。由于没有智能体可以检查另一个智能体的日历,成功需要语言介导的协调而非集中规划。CalBench生成可解场景,配备CP-SAT oracle解决方案和去中心化的非LLM参考协议,能够在匹配信息约束下评估任务成功、额外成本、通信效率、负担公平性和隐私泄露。在七个模型系列中,我们发现仅完成度会遗漏重要失败:智能体留下可避免的成本,通信量不能预测更低的遗憾,保护隐私的沉默可能剥夺队友公平负担分配所需的成本信息。CalBench提供了一个可重复的测试平台,用于研究自主助手在规模化部署前能否代表用户进行协调。

英文摘要

Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi-agent calendar scheduling under private information. In each task, $N$ agents manage separate private calendars and schedule a stream of $M$ incoming meetings while minimizing disruption costs. Because no agent can inspect another agent's calendar, success requires language-mediated coordination rather than centralized planning. CalBench generates solvable scenarios with CP-SAT oracle solutions and decentralized non-LLM reference protocols, enabling evaluation of task success, excess cost, communication efficiency, burden fairness, and privacy leakage under matched information constraints. Across seven model families, we find that completion alone misses important failures: agents leave avoidable cost on the table, communication volume does not predict lower regret, and privacy-preserving silence can deprive teammates of cost information needed for fair burden allocation. CalBench provides a reproducible testbed for studying whether autonomous assistants can coordinate on behalf of users before deployment at scale.

2605.25085 2026-06-09 cs.IT cs.AI cs.LG math.IT 版本更新

Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

自回归语言模型中的多项式上下文截断敏感性:KV缓存压缩的序列Wyner-Ziv界

Munsik Kim

发表机构 * Independent Researcher(独立研究者)

AI总结 研究自回归语言模型中在线KV缓存压缩的率失真极限,将其建模为序列Wyner-Ziv信源编码,发现下一词分布对上下文截断的敏感性呈多项式衰减,并推导了仅后缀缓存策略的每词内存需求。

详情
AI中文摘要

我们研究了自回归语言模型中在线KV缓存压缩的率失真极限,将其建模为模型诱导滤子上的序列Wyner-Ziv信源编码,其中下一步查询作为解码器边信息。实验上,在涵盖两个系列、参数规模0.5-3B的四个模型中,我们发现下一词分布对上下文截断的敏感性呈多项式衰减而非几何衰减:幂律在外推中比指数拟合提升一个数量级,拟合指数通过汇加最近KL测量独立恢复,并通过位置保持消融验证了衰减不受位置编码伪影影响。在相应的多项式截断敏感性假设下,我们的主要结果刻画了仅后缀缓存策略的每词内存需求:滑动窗口方案以窗口大小$w = O(\varepsilon^{-1/α})$达到失真$\varepsilon$,且在附加双边贝叶斯风险条件下,逆命题表明在该策略类内$w = \Omega(\varepsilon^{-1/α})$是必要的,因此仅后缀策略的缩放为$\Theta(\varepsilon^{-1/α})$。循环或传播缓存摘要能否超越此缩放留待进一步研究。一个显式的块马尔可夫方案达到上界;在附加前向衰减和正则性假设(仅由截断敏感性无法推出)下,其收敛速率指数与逆命题匹配,否则相差两倍。实验上,幂律预测了具体缓存策略的退化曲线:基于最近性的驱逐(滑动、汇加最近)在同等预算下将失真抑制约两个数量级,且失真随预算呈幂律衰减。

英文摘要

We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Ziv source coding on the filtration induced by the model, with the next-step query as decoder side information. Empirically, across four models spanning two families and $0.5$-$3$B parameters, we find that the next-token distribution's sensitivity to context truncation decays \emph{polynomially} rather than \emph{geometrically}: a power law improves on an exponential fit by an order of magnitude in extrapolation, the fitted exponent is recovered independently from a sink-plus-recent KL measurement, and the decay is verified to be free of positional-encoding artifacts by a position-preserving ablation. Under a corresponding \emph{polynomial truncation-sensitivity} assumption, our main result characterizes the per-token memory requirement of \emph{suffix-only} cache policies: a sliding-window scheme attains distortion $\varepsilon$ with window $w = O(\varepsilon^{-1/α})$, and -- under an additional two-sided Bayes-risk condition -- a converse shows $w = Ω(\varepsilon^{-1/α})$ is necessary within this policy class, so the scaling is $Θ(\varepsilon^{-1/α})$ for suffix-only policies. Whether recurrent or propagating cache summaries can beat this scaling is left open. An explicit block-Markov scheme achieves the upper bound; its rate-of-convergence exponent matches the converse under additional forward-decay and regularity hypotheses (not implied by truncation sensitivity alone), and differs by a factor of two otherwise. Empirically, the polynomial law predicts the degradation curves of concrete cache policies: recency-based eviction (sliding, sink-plus-recent) suppresses distortion by roughly two orders of magnitude over random retention at equal budget, with a power-law decay in the budget.

2605.24660 2026-06-09 cs.IR cs.AI cs.LG 版本更新

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

LLM 智能体应看到多少工具?一种机会校正的答案

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Joey Blackwell

发表机构 * II Meta Platforms(Meta平台)

AI总结 针对 LLM 智能体工具选择中候选列表长度优化问题,提出基于机会校正的 Bits-over-Random (BoR) 指标,并将其转化为强化学习奖励,实现每查询自适应深度选择,在保持覆盖率的同时显著减少展示工具数量并提升下游工具选择准确率。

Comments 13 pages, 2 figures

详情
AI中文摘要

在 LLM 智能体使用工具之前,检索系统必须决定向智能体展示哪些候选工具。这个候选列表应该多长?展示太多工具,模型难以选择;展示太少,正确的工具可能不会出现。大多数系统对每个查询应用固定的候选列表大小,但缺乏标准指标来评估该大小是否合适。我们将展示给 LLM 智能体的工具数量作为评估对象,并应用 Bits-over-Random (BoR),一种机会校正的指标,询问在给定深度下的成功是否优于随机选择在同一深度下的表现。我们在三个工具选择基准、多个评分器以及从 20 到 3,251 个工具不等的注册表上评估 BoR。然后,我们将相同的原理转化为强化学习 (RL) 奖励,用于每查询选择工具候选列表深度。RL 智能体故意设计得简单,作为指标的探针而非提议的系统。随着候选列表增长,随机包含正确工具的机会增加,因此奖励自然减少,减少了对工程化深度惩罚的需求。在 BFCL(370 个工具)上,学习到的策略几乎匹配展示 50 个工具的覆盖率(90.3% 对 90.8%),而平均仅展示 7 个。在 ToolBench(3,251 个工具)上,固定展示 5 个工具实现了更高的总覆盖率(64.7% 对 61.9%),但在困难查询(正确工具排名第 6-20 位)上未找到任何工具。BoR 智能体通过搜索更深层,在这些查询上找到了 16.7%。使用 Claude Sonnet 4.6 的下游验证表明,更短的自适应列表也提高了 LLM 选择正确工具的能力:与始终展示 5 个工具时的 87.1% 相比,达到了 93.1%;在中等难度查询(正确工具存在但未排名第一)上,从 60.9% 扩大到 76.8%。

英文摘要

Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits-over-Random (BoR), a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool-selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools ($90.3\%$ vs $90.8\%$) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage ($64.7\%$ vs $61.9\%$) but finds nothing on hard queries (correct tool ranked 6th-20th). The BoR agent finds $16.7\%$ on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM's ability to select the right tool: $93.1\%$ versus $87.1\%$ when always shown 5 tools, widening to $76.8\%$ vs $60.9\%$ on medium-difficulty queries where the correct tool is present but not ranked first.

2605.22781 2026-06-09 cs.OS cs.AI 版本更新

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

DeltaBox: 通过毫秒级沙箱检查点/回滚扩展状态化AI代理

Yunpeng Dong, Jingkai He, Shiqi Liu, Yuze Hou, Dong Du, Zhonghu Xu, Si Yu, Baochuan Yang, Yubin Xia, Haibo Chen

发表机构 * Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University(并行与分布式系统研究所,上海交通大学) Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China(领域特定操作系统工程研究中心,中华人民共和国教育部,中国) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出DeltaBox,一种通过DeltaFS和DeltaCR机制实现毫秒级检查点/回滚的新型AI代理沙箱,解决了传统方法在高频状态探索中的延迟问题。

详情
AI中文摘要

LLM驱动的AI代理需要高频状态探索(例如测试时的树搜索和强化学习),依赖于快速检查点和回滚(C/R)完整的沙箱状态,包括文件和进程状态(例如内存、上下文等)。现有机制需要完整复制状态,导致每次C/R的延迟达到数百毫秒到秒级,严重限制了深度搜索和大规模扩展。本文观察到AI代理中的后续检查点高度相似,因此沙箱应仅复制连续检查点之间的变化(关键洞察)。然而,实现这一想法并不简单,主要是由于缺乏操作系统支持。本文提出新的操作系统抽象DeltaState,通过两个共同设计的操作系统机制,为AI代理实现基于变化的事务性C/R。首先,DeltaFS通过将文件状态组织成分层结构,动态冻结可写层并在检查点时插入新层,将文件更新转换为写时复制,使回滚成为简单的层切换。其次,DeltaCR通过增量快照实现基于变化的过程状态C/R,并通过绕过传统管道直接从冻结的模板进程fork()来加速回滚。我们随后提出DeltaBox,一种新型的代理沙箱,通过这两种新机制实现毫秒级的C/R。在SWE-bench和RL微基准测试中的评估显示,DeltaBox在毫秒级延迟(14ms和5ms)内完成检查点和回滚,使代理在固定时间预算内能够探索大量节点。

英文摘要

LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the entire state, causing hundreds of milliseconds to seconds of latency per C/R, which severely bottlenecks deep search and large-scale fan-outs. This paper observes that subsequent checkpoints in AI agents are highly similar. Therefore, instead of full duplication, a sandbox should only duplicate the changes between consecutive checkpoints (Key Insight). However, it is non-trivial to realize the idea, mainly due to the missing OS supports. This paper proposes a new OS-level abstraction, DeltaState, to enable the change-based transactional C/R for AI agents with two co-designed OS mechanisms. First, DeltaFS enables change-based filesystem C/R by organizing the file states into layers and dynamically freezing the writable layer and inserting a new one during checkpoint, reducing file updates to copy-on-write, and making rollback a simple layer switch. Second, DeltaCR enables change-based process state C/R using incremental dumps, and accelerates rollback by bypassing traditional pipelines to directly fork() from a frozen template process. We then present DeltaBox, a novel agent sandbox achieving millisecond level C/R through the two new mechanisms. Evaluations on SWE-bench and RL micro-benchmarks show DeltaBox completes checkpoint and rollback in millisecond-level latency (14ms and 5ms, respectively), empowering agents to explore substantially more nodes under fixed time budgets.

2605.16972 2026-06-09 cs.HC cs.AI 版本更新

WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI

WhiteTesseract: 通过XR和对话式AI重新诠释文化遗产

Jingjing Li, Zhi Liu, Xiyao Jin, Tatsuki Fushimi, Yoichi Ochiai

发表机构 * University of Tsukuba(茨口大学)

AI总结 本研究通过结合XR和对话式AI,提出WhiteTesseract系统,旨在提升文化遗产展览的沉浸感和个性化体验,增强观众的参与度和反思能力。

Comments 38 pages, 13 figures. Accepted for publication in ACM Journal on Computing and Cultural Heritage (JOCCH)

详情
AI中文摘要

文化遗产展览往往难以维持观众的注意力并促进深入思考。实体展览依赖固定解释工具,缺乏对个体背景或好奇心的适应性,其效果高度依赖于参观者的个人情境、先前知识和文化素养。同时,数字展览更注重便利性和可及性,但可能削弱定义具身文化体验的物理和社会情境。WhiteTesseract通过高分辨率XR和对话式AI实现现场解释,系统整合空间智能通过艺术品识别,允许参观者通过降维现实减少环境干扰,并通过大语言模型进行情境感知对话。目标是保留物理和社会环境的丰富性,同时提供灵活的个人反思空间,增强个人情境而不妥协于物理真实性。我们部署了该系统在一个克劳德·莫奈展览中,并与26名参与者进行了受控用户研究。定量结果表明,WhiteTesseract的调节显著将平均观看时间从35.3秒增加到98.3秒(p < 0.001)。分析529次参观者与AI的互动发现,60%的互动超出了事实性查询,包括分析、情感和比较性查询。这些发现展示了如何通过XR和AI丰富实体展览体验,支持更深入、更个性化的参与,而不取代文化遗产的具身价值。我们讨论了现实部署的技术和社会限制以及受控环境的局限性。

英文摘要

Cultural heritage exhibitions often struggle to sustain attention and support reflective engagement. Physical exhibitions rely on fixed interpretive aids that lack adaptability to individual backgrounds or curiosity, and their effectiveness depends heavily on a visitor's Personal Context, prior knowledge, and cultural literacy. Meanwhile, digital exhibitions prioritize convenience and accessibility but risk weakening the Physical and Social Contexts that define embodied cultural experience. WhiteTesseract addresses this gap by enabling in-situ interpretation through high-resolution XR and conversational AI. The system integrates spatial intelligence via artwork recognition to allow visitors to selectively reduce environmental distractions (via diminished reality) and engage in context-aware dialogue (via large language models). The goal is to preserve the richness of the physical and social environment while providing a flexible space for personal reflection, enhancing Personal Context without compromising physical authenticity. We deployed the system in a Claude Monet exhibition and conducted a controlled user study with 26 participants. Quantitative results showed that WhiteTesseract modulation significantly increased average viewing duration from 35.3 to 98.3 seconds (p < 0.001). Analysis of 529 visitor-AI interactions revealed that 60% extended beyond factual queries to include analytical, emotional, and comparative inquiries. These findings demonstrate how XR and AI can enrich the physical exhibition experience by supporting deeper, more personalized engagement without displacing the embodied value of cultural heritage. We discuss technical and social constraints for real-world deployment and limitations of our controlled setting.

2602.08916 2026-06-09 cs.SC cs.ET cs.LG 版本更新

AMS-HD: Hyperdimensional Computing for Real-Time and Energy-Efficient Acute Mountain Sickness Detection

AMS-HD:用于实时和节能急性高海拔病检测的高维计算

Abu Masum, Mehran Moghadam, M. Hassan Najafi, Bige Unluturk, Ulkuhan Guler, Beth A. Beidleman, Sercan Aygun

发表机构 * School of Computing and Informatics, University of Louisiana at Lafayette(路易斯安那州立大学拉斐特分校计算机与信息学院) Department of Electrical, Computer, and Systems Engineering, Case Western Reserve University(凯斯西储大学电气、计算机与系统工程系) Electrical and Biomedical Engineering, Michigan State University(密歇根州立大学电气与生物医学工程系) Electrical and Computer Engineering Department, Worcester Polytechnic Institute(沃思菲技术学院电气与计算机工程系) US Army Research Institute of Environmental Medicine(美国陆军环境医学研究院)

AI总结 本文提出AMS-HD框架,利用高维计算实现实时急性高海拔病检测,通过特征选择、超向量编码和位置投影提升分类效率,在多种平台上实现高准确率和低能耗。

详情
AI中文摘要

目标:急性高海拔病(AMS)是最常见的高海拔疾病,影响未适应者在海拔2500米以上攀登时,传统机器学习方法在连续监测中难以满足实时硬件效率要求。方法:本文提出AMS-HD,首个基于高维计算的实时AMS检测框架,涵盖移动平台的高维双极计算和FPGA/ASIC的低维二进制计算。框架整合互信息特征选择、超向量编码和位置投影以提高分类效率。验证在ARM、FPGA和智能手表-智能手机平台使用可穿戴的血氧和心率信号。结果:AMS-HD在二分类和多分类中匹配或优于SVM和MLP基线,二分类准确率高达91%,F1分数达90%。在FPGA上,AMS-HD减少LUT和触发器使用量达7.3倍和5.8倍,能耗仅为MLP的3.9倍。在移动平台,AMS-HD每会话仅消耗1%电池,2.50毫秒推理时间,能耗低于SVM和MLP。结论:AMS-HD提供了一个可扩展、硬件感知的替代方案,实现竞争性性能和显著降低资源消耗。意义:本文首次提出完整的高维计算框架用于高海拔病检测,连接可穿戴推理和低层硬件部署,为资源受限健康监测提供解决方案。

英文摘要

Objective: Acute mountain sickness (AMS) is the most prevalent altitude illness, affecting unacclimatized individuals ascending above 2,500 m and potentially escalating to life threatening cerebral or pulmonary edema. Conventional machine learning (ML) methods for AMS detection from wearable physiological signals often fail to meet real-time hardware efficiency requirements of continuous monitoring. Methods: We present AMS-HD, the first hyperdimensional computing (HDC)-based framework for real-time AMS detection, spanning high-level bipolar (-1/+1) computing for mobile platforms and low-level binary (0/1) computing for FPGA and ASIC targets. The framework integrates mutual information feature selection, hypervector encoding, and positional projection to enhance classification efficiency. Validation spans ARM, FPGA, and smartwatch-smartphone platforms using wearable-accessible SpO2 and heart rate signals. Results: AMS-HD matches or outperforms SVM and MLP baselines in both binary and multiclass classification, achieving up to 91% accuracy and 90% F1-score in binary classification, and up to 85% accuracy on external AMS-related datasets. On FPGA, AMS-HD reduces LUT and flip-flop usage by 7.3x and 5.8x, while consuming 3.9x less power than MLP. On mobile platforms, AMS-HD requires only 1% battery per session, 60 Bytes of memory, and 2.50 ms inference time -- approximately 2x and more than 3x lower energy consumption than SVM and MLP. Conclusion: AMS-HD provides a scalable, hardware-aware alternative to conventional ML for real-time AMS monitoring, achieving competitive performance with substantially lower resource consumption. Significance: This work presents the first complete HDC framework for altitude sickness detection, bridging wearable inference and low-level hardware deployment for resource-constrained health monitoring.

2605.16223 2026-06-09 cs.GR cs.AI cs.CV 版本更新

Evaluating Design Video Generation: Metrics for Compositional Fidelity

评估设计视频生成:构成保真度的度量标准

Adrienne Deganutti, Dingning Cao, Jaejung Seol, Elad Hirsch, Purvanshi Mehta

发表机构 * Lica World(Lica世界) San Francisco, United States of America(美国旧金山) ICML’26 Workshop on Human-AI Co-Creativity, Seoul, South Korea(ICML’26 人类-人工智能协同创作研讨会,韩国首尔)

AI总结 本文提出一个自动化评估框架,用于评估设计动画中布局、动作正确性、时间质量和内容保真度,以替代主观人类评估,为该领域提供统一基准。

Comments ICML 2026 Workshop on Human-AI Co-Creativity

详情
AI中文摘要

生成视频模型越来越多地用于设计动画任务,但该领域缺乏标准化评估框架。与自然视频生成不同,设计动画施加了结构化约束:特定组件需以规定类型、方向、速度和时间进行动画,而非动画区域必须保持稳定,布局结构必须保持。本文提供了一个全面自动化的评估框架,从四个维度组织:布局保真度、动作正确性、时间质量及内容保真度。这消除了对主观人类评估的依赖,并为该领域建立了一个共同的基准。我们在此发布代码和数据集:https://github.com/purvanshi/lica-bench。

英文摘要

Generative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. Unlike natural video generation, design animation imposes structured constraints: specific components shall animate with prescribed motion types, directions, speed and timing, while non-animated regions must remain stable and layout structure must be preserved. This paper provides a fully automated evaluation framework organized across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. This eliminates the reliance on subjective human evaluation and establishes a common basis for benchmarking progress in the field. We release the code and dataset here: https://github.com/purvanshi/lica-bench.

2605.16163 2026-06-09 physics.ao-ph cs.LG 版本更新

SwAIther-Precip: Lead-Time-Aware Bias Correction Enables Kilometer-Scale Downscaling of Global AI Precipitation Forecasts over Switzerland

SwAIther-Precip:考虑提前时间的偏倚校正实现瑞士全球AI降水预报的公里级降尺度

Dan Assouline, Erwan Koch, Federico Amato, Filippo Quarenghi, Daniele Nerini, Thibaut Loiseau, Kyle van de Langemheen, Tom Beucler

发表机构 * European Centre for Medium-Range Weather Forecasts(欧洲中期天气预报中心) University of Geneva(日内瓦大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出SwAIther-Precip框架,通过校正提前时间依赖性偏倚,提升全球AI降水预报的公里级概率降尺度能力,实验显示CRPS降低48%。

详情
AI中文摘要

技能性中短期降水预报在复杂地形上仍具挑战,因降水源于多尺度非线性过程,全球模型无法以经济成本显式解析。全球AI天气模型可产生技能性中短期预报,但其原生0.25度分辨率限制了本地灾害应用。统计降尺度有助于弥合这一差距,但现有方法常难以处理状态依赖性及尤其提前时间依赖性的全球预报偏倚。我们引入SwAIther-Precip,一种考虑提前时间的降尺度框架,将粗分辨率AIFS预报转换为瑞士公里级概率降水场。首先,通过特征-wise线性调制的U-Net,利用提前时间条件确定性校正粗分辨率系统性偏倚。这种针对性校正使后续更便宜的超分辨率阶段仅需校正降水,允许直接训练于观测而非完整大气状态。扩散模型随后独立于提前时间生成精细空间变异性。使用AIFS预报和CombiPrecip雷达-雨量计观测,SwAIther-Precip将CRPS相对于原始AIFS降低48%。生成的场在大尺度(0.85以上)和小尺度(0.88)上再现观测空间变异性,对应于1公里网格上约4公里的有效分辨率,适用于最多5天的提前时间。跨提前时间训练进一步提升长程性能,相对于提前时间特定模型,在6天时CRPS减少13%。这些结果表明,在生成超分辨率前显式校正提前时间依赖性偏倚是高效公里级概率降尺度的关键。

英文摘要

Skillful medium-range precipitation forecasting at kilometer scale remains challenging over complex terrain because precipitation arises from multiscale nonlinear processes that global models cannot explicitly resolve at affordable cost. Global AI weather models can produce skillful medium-range forecasts, but their native 0.25 degrees resolution limits direct use for local hazard applications. Statistical downscaling can help bridge this gap, yet existing approaches often struggle with state-dependent, and especially lead-time-dependent, biases in global forecasts. We introduce SwAIther-Precip, a lead-time-aware downscaling framework that converts coarse-resolution AIFS forecasts into probabilistic km-scale precipitation fields over Switzerland. First, a U-Net conditioned on lead time via feature-wise linear modulation deterministically corrects systematic biases at coarse resolution. This targeted correction enables a cheaper super-resolution stage conditioned only on corrected precipitation, allowing direct training on observations rather than on the full atmospheric state. A diffusion-based model then generates fine-scale spatial variability independently of lead time. Using AIFS forecasts and CombiPrecip radar-gauge observations, SwAIther-Precip reduces CRPS by 48% relative to raw AIFS. The generated fields reproduce observed spatial variability with spectral fidelity above 0.85 at large scales and 0.88 at small scales, corresponding to an effective resolution of approximately 4 km on a 1 km grid for lead times up to 5 days. Training across lead times further improves long-range performance, yielding a 13% CRPS reduction at 6 days relative to lead-time-specific models. These results show that explicitly correcting lead-time-dependent biases before generative super-resolution is key to efficient km-scale probabilistic downscaling of global AI precipitation forecasts.

2605.14285 2026-06-09 eess.IV cs.LG 版本更新

ForcingDAS: Unified and Robust Data Assimilation via Diffusion Forcing

通过扩散强迫实现统一且稳健的数据同化:ForcingDAS

Yixuan Jia, Siyi Chen, Yida Pan, Xiao Li, Lianghe Shi, Chanyong Jung, Haijie Yuan, Ismail Alkhouri, Yue Cynthia Wu, Saiprasad Ravishankar, Jeffrey A Fessler, Qing Qu

发表机构 * University of Michigan(密歇根大学) University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) Massachusetts Institute of Technology(麻省理工学院) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出ForcingDAS,一种基于扩散强迫的统一数据同化框架,能够捕捉长时序依赖并减少误差积累,同时在推理时无需重新训练即可实现滤波到平滑的全谱应用。

详情
AI中文摘要

数据同化(DA)通过噪声和不完全观测估计动态系统状态,广泛应用于科学模拟、天气和气候科学。在实践中,滤波方法依赖于帧间过渡模型,但当观测非马尔可夫时(如真实天气数据只形成高维潜在状态的部分切片),这些模型容易在长时域积累误差。同时,学习DA方法通常局限于单一领域(如滤波或平滑),这分割了应共享的先验。为解决这些问题,我们引入ForcingDAS,一种统一且稳健的DA框架。该框架基于具有独立噪声水平的扩散强迫,学习联合轨迹先验而非帧间过渡。这使其能够捕捉长时域时间依赖性并减少误差积累。此外,训练好的模型在推理时可覆盖完整的滤波到平滑谱。具体而言,现在预测、固定滞后平滑和批量再分析通过推理计划单独选择,无需重新训练。我们评估了ForcingDAS在2D纳维-斯托克斯涡旋、降水现在预测和全球大气状态估计中的表现。在所有设置中,单个模型在与专门针对单一领域的学习和经典基线竞争或超越,尤其在真实天气基准上取得最大收益。

英文摘要

Data assimilation (DA) estimates the state of an evolving dynamical system from noisy, partial observations, and is widely used in scientific simulation as well as weather and climate science. In practice, filtering methods rely on frame-to-frame transition models. However, these models are fragile when observations are non-Markovian (when they form only a partial slice of a higher-dimensional latent state as in real-world weather data): they tend to accumulate errors over long horizons. At the same time, learned DA methods typically commit to a single regime, either filtering (nowcasting, real-time forecasting) or smoothing (retrospective reanalysis), which splits what should be a shared prior across application-specific pipelines. To address both issues, we introduce ForcingDAS, a unified and robust DA framework. Built on Diffusion Forcing with an independent noise level assigned to each frame, ForcingDAS learns a joint-trajectory prior instead of frame-to-frame transitions. This allows it to capture long-horizon temporal dependencies and reduce error accumulation. In addition, the same trained model spans the full filtering to smoothing spectrum at inference time. Specifically, nowcasting, fixed-lag smoothing, and batch reanalysis are selected through the inference schedule alone, without retraining. We evaluate ForcingDAS on 2D Navier-Stokes vorticity, precipitation nowcasting, and global atmospheric state estimation. Across all settings, a single model is competitive with or outperforms both learned and classical baselines that are specialized for individual regimes, with the largest gains observed on real-world weather benchmarks.

2602.00056 2026-06-09 cs.CY cs.AI 版本更新

How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

超数据化如何影响前沿AI的可持续性成本

Sophia N. Wilson, Sebastian Mair, Mophat Okinyi, Erik B. Dam, Janin Koch, Raghavendra Selvan

发表机构 * University of Copenhagen(哥本哈根大学) Linköping University(_linköping大学) Techworker Community Africa(非洲技术工人社区) Univ. Lille, Inria, CNRS, Centrale Lille(里尔大学,Inria,CNRS,Centrale Lille)

AI总结 本文研究超数据化对前沿AI的环境、社会和经济成本的影响,通过分析Hugging Face Hub的55万数据集,揭示数据增长、存储能耗及全球数据基础设施差异,提出Data PROOFS建议以缓解相关成本。

Comments Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency. Montreal, Canada

详情
AI中文摘要

大规模数据在过去十年中推动了前沿人工智能(AI)模型的成功。这种扩展依赖于大型科技公司持续努力聚合和整理互联网级数据集。本文从可持续性角度研究大规模数据在AI中的环境、社会和经济成本。我们主张该领域正从基于数据构建模型转向主动创建数据以构建模型。我们将这一转变称为超数据化,标志着前沿AI及其社会影响的关键转折点。为量化和 contextualize 数据相关成本,我们分析了约550,000个数据集,重点是数据集增长、存储相关的能耗和碳足迹,以及通过语言数据进行的社会代表性分析。我们还通过肯尼亚数据工人的定性反馈来研究劳动力问题,包括大型科技公司直接雇佣和对图像内容的暴露。我们进一步利用外部数据来源来验证我们的发现,通过展示全球数据中心基础设施的不平等来支持我们的发现。我们的分析表明,超数据化驱动了显著且增长的环境成本,同时系统地将劳动力风险和代表性伤害向全球南方转移。因此,我们提出了涵盖溯源、资源意识、所有权、开放性、节俭和标准的Data PROOFS建议,以缓解这些成本。我们的工作旨在使前沿AI背后常被忽视的数据成本可视化,并在研究社区和更广泛范围内激发更广泛的讨论。

英文摘要

Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication drives substantial and growing environmental costs while systematically redistributing labour risks and representational harms toward the Global South. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.

2604.10271 2026-06-09 cs.CR cs.CL cs.IR 版本更新

Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution

窃取文本遗产:通过同形替代隐藏人类签名

Robert Dilworth

发表机构 * Department of Computer Science and Engineering, Mississippi State University(计算机科学与工程系,密苏里州立大学)

AI总结 本文研究通过同形替代技术削弱 stylometry 系统,探讨如何在文本中隐藏个人身份信息以防止身份泄露。

Comments 30 pages, 9 figures

详情
AI中文摘要

在政府颁发的身份证件如护照、驾照等数据泄露事件中,其影响似乎比在非显眼的社交媒体平台上自愿披露数据更为严重。然而,后者场景中通过在线帖子可能揭示作者的年龄范围和地理位置。本文探讨通过同形替代(将字符替换为视觉相似的替代品)来降低 stylometry 系统的识别能力,从而防止个人身份信息从文本中泄露。

英文摘要

In what way could a data breach involving government-issued IDs such as passports, driver's licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual's date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless--or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science--the determinations are statistical--stylometry can reveal comparable, though noticeably diluted, information about an individual. To prevent an ID from being breached, simply sharing it as little as possible suffices. Preventing the leakage of personal information from written text requires a more complex solution: adversarial stylometry. In this paper, we explore how performing homoglyph substitution--the replacement of characters with visually similar alternatives (e.g., "h" $\texttt{[U+0068]}$ $\rightarrow$ "h" $\texttt{[U+04BB]}$)--on text can degrade stylometric systems.

2605.00327 2026-06-09 cs.IR cs.AI 版本更新

DynamicPO: Dynamic Preference Optimization for Recommendation

DynamicPO:基于推荐的动态偏好优化

Xingyu Hu, Kai Zhang, Jiancan Wu, Shuli Wang, Chi Wang, Wenshuai Chen, Yinhua Zhu, Haitao Wang, Xingxing Wang, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Innovation Institute(上海创新研究院) Meituan(美团)

AI总结 本文提出DynamicPO框架,通过动态边界负样本选择和双边距动态beta调整,解决偏好优化崩溃问题,提升推荐准确性。

Comments DASFAA 2026 Best Paper

详情
AI中文摘要

本文提出DynamicPO框架,通过动态边界负样本选择和双边距动态beta调整,解决偏好优化崩溃问题,提升推荐准确性。

英文摘要

In large language model (LLM)-based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with user preferences, requiring multi-negative objective functions to leverage abundant implicit-feedback negatives and sharpen preference boundaries. However, our empirical analyses reveal a counterintuitive phenomenon, preference optimization collapse, where increasing the number of negative samples can lead to performance degradation despite a continuously decreasing training loss. We further theoretically demonstrate that this collapse arises from gradient suppression, caused by the dominance of easily discriminable negatives over boundary-critical negatives that truly define user preference boundaries. As a result, boundary-relevant signals are under-optimized, weakening the model's decision boundary. Motivated by these observations, we propose DynamicPO (Dynamic Preference Optimization), a lightweight and plug-and-play framework comprising two adaptive mechanisms: Dynamic Boundary Negative Selection, which identifies and prioritizes informative negatives near the model's decision boundary, and Dual-Margin Dynamic beta Adjustment, which calibrates optimization strength per sample according to boundary ambiguity. Extensive experiments on three public datasets show that DynamicPO effectively prevents optimization collapse and improves recommendation accuracy on multi-negative preference optimization methods, with negligible computational overhead. Our code and datasets are available at https://github.com/xingyuHuxingyu/DynamicPO.

2604.26993 2026-06-09 math.NA cs.LG cs.NA math.OC 版本更新

State-Dependent Lyapunov Analysis of Rank-1 Matrix Factorization

基于状态依赖的Lyapunov分析的秩1矩阵分解

Jaehong Moon

发表机构 * Industrial & Enterprise Systems Engineering University of Illinois at Urbana-Champaign(工业与企业系统工程伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文通过状态依赖Lyapunov视角研究梯度下降在秩1矩阵分解中的收敛性,提出参数化二次证书I(δ;·),证明在临界步长以下收敛到全局极小值,临界步长以上则进入平衡终端状态并表现出周期2行为。

详情
AI中文摘要

我们通过状态依赖Lyapunov视角研究梯度下降在秩1矩阵分解中的收敛性。核心对象是一个参数化二次证书I(δ;·),其边界内向性质诱导单调状态参数δ_t,从而证明轨迹被限制在收缩的水平集内。对于初始值低于临界步长的初始化,此机制证明收敛到全局极小值。在临界步长以上,相同的单调状态机制导致平衡终端状态;对于一系列临界步长以上的步长,减少的动力学表现出周期2行为,与稳定性边缘现象一致。我们进一步表明,标量证书并非随意的代数构造:在结构公理和自然的状态-参数归一化下,它由单调性机制唯一确定。数值实验表明,这种状态依赖Lyapunov机制在证明案例之外也持续存在,包括二维秩1近似和标量分解的四次扩展。

英文摘要

We study gradient descent for rank-1 matrix factorization through a state-dependent Lyapunov perspective. The central object is a parameterized quadratic certificate $I(δ;\,\cdot)$ whose boundary-inward property induces a monotone state parameter $δ_t$, thereby certifying that the trajectory is confined to a shrinking family of level sets. For certified initializations below the critical step size, this mechanism proves convergence to global minimizers. Above the critical step size, the same monotone-state mechanism instead leads to a balanced terminal regime; for a range of post-critical step sizes, the reduced dynamics exhibit period-2 behavior consistent with edge-of-stability phenomena. We further show that the scalar certificate is not an ad hoc algebraic construction: under structural axioms and a natural state-parameter normalization, it is uniquely determined by the monotonicity mechanism. Numerical experiments suggest that this state-dependent Lyapunov mechanism persists beyond the proved cases, including two-dimensional rank-1 approximation and quartic augmentations of scalar factorization.