arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪
2311.01468 2026-05-25 cs.CL cs.LG

Remember what you did so you know what to do next

记住你做了什么,以便知道下一步该做什么

Manuel R. Ciosici, Alex Hedges, Yash Kankanampati, Justin Martin, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文研究了使用中等规模的大型语言模型(GPT-J,60亿参数)为模拟机器人在ScienceWorld平台中制定计划,以完成30类科学实验目标。实验表明,通过引入更多历史步骤信息,该模型的性能显著优于基于强化学习的方法,最高可达3.5倍。研究还指出任务类别间的性能差异较大,平均表现可能掩盖具体问题,并展示了在仅使用6.5%训练数据时仍能取得2.2倍的性能提升。

Comments Identical to EMNLP 2023 Findings

详情
AI中文摘要

我们探索使用中等规模的大语言模型(GPT-J,6B参数)为模拟机器人在ScienceWorld(一个用于基础科学实验的文本游戏模拟器)中制定计划,以实现30类目标。先前发表的实证工作声称,与强化学习相比,大语言模型(LLMs)不太适合(Wang等人,2022)。使用马尔可夫假设(仅前一步),LLM的性能是强化学习方法性能的1.4倍。当我们尽可能多地填充LLM的输入缓冲区(包含尽可能多的先前步骤)时,性能提升至3.5倍。即使仅使用6.5%的训练数据,我们观察到性能比强化学习方法提高了2.2倍。我们的实验表明,30类动作的性能差异很大,这表明对任务进行平均可能会掩盖显著的性能问题。在与我们同时期的工作中,Lin等人(2023)展示了一种两部分方法(SwiftSage),该方法使用一个小型LLM(T5-large)并辅以OpenAI的大规模LLM,在ScienceWorld中取得了出色结果。我们的6B参数单阶段GPT-J在结合GPT-3.5 turbo(其参数数量是GPT-J的29倍)时,与SwiftSage的两阶段架构性能相匹配。

英文摘要

We explore using a moderately sized large language model (GPT-J 6B parameters) to create a plan for a simulated robot to achieve 30 classes of goals in ScienceWorld, a text game simulator for elementary science experiments. Previously published empirical work claimed that large language models (LLMs) are a poor fit (Wang et al., 2022) compared to reinforcement learning. Using the Markov assumption (a single previous step), the LLM outperforms the reinforcement learning-based approach by a factor of 1.4. When we fill the LLM's input buffer with as many prior steps as possible, improvement rises to 3.5x. Even when training on only 6.5% of the training data, we observe a 2.2x improvement over the reinforcement-learning-based approach. Our experiments show that performance varies widely across the 30 classes of actions, indicating that averaging over tasks can hide significant performance issues. In work contemporaneous with ours, Lin et al. (2023) demonstrated a two-part approach (SwiftSage) that uses a small LLM (T5-large) complemented by OpenAI's massive LLMs to achieve outstanding results in ScienceWorld. Our 6-B parameter, single-stage GPT-J matches the performance of SwiftSage's two-stage architecture when it incorporates GPT-3.5 turbo which has 29-times more parameters than GPT-J.

2110.01552 2026-05-25 cs.CL cs.AI cs.LG

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

或许PTLMs应该去上学——一项评估开卷和闭卷问答的任务

Manuel R. Ciosici, Joe Cecil, Alex Hedges, Dong-Ho Lee, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文提出了一项新的任务,旨在评估预训练语言模型(PTLMs)在开放书和闭合书场景下的问答能力,使用社会学和人文领域的大学教材作为教学材料。研究通过设计基于教材内容的判断题,并进行多轮测试,发现PTLMs在闭合书条件下表现有限,表明其可能未真正理解教材内容;而在开放书条件下,允许模型检索相关段落进行回答时,性能显著提升。该任务为评估PTLMs对复杂文本的理解能力提供了新的基准。

Comments Identical to the EMNLP 2021 version

详情
AI中文摘要

我们的目标是提供一项新任务和排行榜,以刺激关于问答和预训练语言模型(PTLM)的研究,使其理解重要的教学文档,例如大学入门教科书或手册。PTLM在许多问答任务中取得了巨大成功,但需要大量监督训练,而在零样本设置中表现较差。我们提出了一项新任务,包括两本社会科学(《美国政府2e》)和人文科学(《美国历史》)的大学入门教材,数百个基于教材作者编写的复习题的真假陈述,基于教材前八章的验证/开发测试,基于剩余章节的盲测,以及基于最先进PTLM的基线结果。由于问题平衡,随机表现应为约50%。使用BoolQ微调的T5达到了相同的表现,表明教材内容未在PTLM中预表示。闭卷考试(即阅读教材,将教材添加到T5的预训练中)最多带来微小改进(56%),表明PTLM可能没有“理解”教材(或可能误解了问题)。开卷考试(即允许机器自动检索段落并用于回答问题)表现更好(约60%)。

英文摘要

Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have "understood" the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).

2101.05400 2026-05-25 cs.CL cs.AI cs.LG

Machine-Assisted Script Curation

机器辅助脚本编纂

Manuel R. Ciosici, Joseph Cummings, Mitchell DeHaven, Alex Hedges, Yash Kankanampati, Dong-Ho Lee, Ralph Weischedel, Marjorie Freedman

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文介绍了一种名为MASC的系统,用于实现人机协作的脚本创作。该系统能够自动生成事件类型、链接至维基数据、提示可能被遗漏的子事件,并记录参与多个子事件的实体及其时间顺序,从而辅助用户高效编写结构复杂的事件脚本。研究展示了MASC在实际案例中的应用效果,验证了其在脚本创作中的实用价值。

Comments Identical to the NAACL 2021 Demo version

详情
AI中文摘要

我们描述了机器辅助脚本编纂器(MASC),一个用于人机协作脚本创作的系统。使用MASC生成的脚本包括:(1)构成更大复杂事件的子事件的英文描述;(2)每个事件的类型;(3)预期参与多个子事件的实体记录;(4)子事件之间的时间顺序。MASC通过提供事件类型建议、维基数据链接以及可能被遗忘的子事件,自动化了脚本创作过程的部分环节。我们通过几个案例研究脚本展示了这些自动化功能对脚本作者的实用性。

英文摘要

We describe Machine-Aided Script Curator (MASC), a system for human-machine collaborative script authoring. Scripts produced with MASC include (1) English descriptions of sub-events that comprise a larger, complex event; (2) event types for each of those events; (3) a record of entities expected to participate in multiple sub-events; and (4) temporal sequencing between the sub-events. MASC automates portions of the script creation process with suggestions for event types, links to Wikidata, and sub-events that may have been forgotten. We illustrate how these automations are useful to the script writer with a few case-study scripts.

2605.23887 2026-05-25 cs.DB cs.AI cs.CR cs.LG cs.MA

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

CHRONOS:面向演化数据市场的时态感知多智能体协调

Joydeep Chandra

发表机构 * BNRIST, Tsinghua University(北京清华大学智能机器人系统研究院)

AI总结 CHRONOS 是一种面向动态数据市场的多智能体协调框架,旨在解决静态设计中因数据演化带来的检索效率下降、价值分配不准确和隐私预算过度消耗等问题。该方法采用三层架构,分别通过时间感知的神经微分方程、基于突变点检测的夏普利价值评估和满足差分隐私的强化学习算法,实现高效且隐私保护的市场协调。实验表明,CHRONOS 在多个基准上表现出优越的检索性能和隐私效率,具有较高的实用价值。

详情
AI中文摘要

时态知识图谱数据市场在静态设计中面临三个耦合的失败:随着边演化,过时的混合索引捷径降低召回率;分布漂移后,固定的Shapley定价错误归因价值;不协调的智能体过度消耗共享的差分隐私预算。我们提出CHRONOS,一个三层架构,通过显式的公共和私有分离统一处理这些挑战。第一层应用神经ODE时间衰减到捷径边,提供每个查询的期望召回损失界为Big-O of Pq lambda delta t,单调包络保证将边界宽松度降低到观测损失的1.8到3.2倍。第二层将Shapley估值条件化在检测到的变点上,并在噪声下提供有限样本误差保证。第三层使用EXP3-IX实现Big-O of sqrt(T log T)遗憾,同时通过矩会计强制执行epsilon和delta差分隐私。CHRONOS每轮使用高斯机制发布私有化亲和矩阵;所有检索和排序都是后处理,不产生额外隐私成本。我们提供多轮结算、500个卖家的可扩展性分析,以及与加速基线的比较。在四个基准上,CHRONOS在10个结果时召回率为0.937,每秒2.74个查询,延迟161毫秒,在zCDP组合下总epsilon为4.25,delta为10^{-6}。这些结果表明一个竞争性的操作点。一个局限性是,在此隐私水平下,发布的估值仍受噪声主导;效用主要来自公共索引路由和由低敏感度统计驱动的自适应调度。

英文摘要

Temporal knowledge-graph data marketplaces face three coupled failures in static designs: stale hybrid index shortcuts reduce recall as edges evolve, stationary Shapley pricing misattributes value after distribution shifts, and uncoordinated agents over-consume a shared differential-privacy budget. We present CHRONOS, a three-layer architecture providing a unified treatment of these challenges with explicit public and private separation. Layer one applies neural-ODE temporal decay to shortcut edges, providing a per-query expected recall-loss bound of Big-O of Pq lambda delta t, with a monotone-envelope guarantee reducing bound looseness to 1.8 to 3.2 times observed loss. Layer two conditions Shapley valuation on detected changepoints and provides finite-sample error guarantees under noise. Layer three uses EXP3-IX to achieve Big-O of the square root of T log T regret while enforcing epsilon and delta differential privacy via moments accounting. CHRONOS releases a privatized affinity matrix per epoch using the Gaussian mechanism; all retrieval and ranking are post-processing, incurring no extra privacy cost. We provide multi-epoch settlement, scalability analysis for 500 sellers, and comparisons against accelerated baselines. Across four benchmarks, CHRONOS shows 0.937 recall at ten, 2.74 queries per second, 161 ms latency, and total epsilon of 4.25 at delta of 10 to the power of negative 6 under zCDP composition. These results indicate a competitive operating point. A limitation is that at this privacy level, released valuations remain noise-dominated; utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.

2605.23879 2026-05-25 stat.ML cs.CR cs.LG math.ST stat.TH

On the Stability of Spherical Hellinger-Kantorovich Flows and Their Implications for Differential Privacy

球形Hellinger-Kantorovich流的稳定性及其对差分隐私的影响

Aratrika Mustafi, Soumya Mukherjee

发表机构 * Department of Statistics, Pennsylvania State University(宾夕法尼亚州立大学统计学系)

AI总结 本文研究了球形Hellinger-Kantorovich梯度流的稳定性问题,并探讨其在差分隐私中的应用。作者建立了该梯度流的扰动理论,分析了不同势函数下流的动力学差异,并给出了与时间相关的log-似然比和Rényi散度的统一上界,进一步推导了KL散度的界。这些结果被用于差分隐私中的指数机制采样,提供了基于SHK梯度流的纯差分隐私和近似差分隐私保证,并分离了机制本身的次优性与有限时间采样误差的影响。

详情
AI中文摘要

梯度流采样将吉布斯分布解释为概率测度上能量泛函的最小值,并生成收敛到该目标的动力学。在球形Hellinger-Kantorovich (SHK)几何下,流耦合输运和反应,并与生灭Langevin动力学一致。本文发展了SHK梯度流的摄动理论。对于两个势函数$V$和$V^{\prime}$,我们从共同初始值出发比较相关的流,并量化势差异随时间传播的程度。一个统一的扰动界给出了对数似然比和Rényi散度的无维、逐点控制,而额外的结构使我们能够推导出KL散度的界。我们将这些结果应用于差分隐私中指数机制的近似采样。似然比控制为基于SHK的采样器提供了显式的时间依赖纯DP保证,而KL界通过hockey-stick散度给出了近似DP证书。我们还推导了一个效用界,将指数机制的内在次优性与有限时间采样误差分离。

英文摘要

Gradient-flow sampling interprets a Gibbs distribution as the minimizer of an energy functional over probability measures and generates dynamics converging to this target. Under spherical Hellinger-Kantorovich (SHK) geometry, the flow couples transport and reaction and coincides with birth-death Langevin dynamics. In this work, we develop a perturbation theory for SHK gradient flows. For two potentials $V$ and $V^{\prime}$, we compare the associated flows from a common initialization and quantify how potential discrepancies propagate over time. A uniform perturbation bound yields dimension-free, pointwise control of the log-likelihood ratio and Rényi divergence, while additional structure allows us to derive bounds for the KL divergence as well. We apply these results to approximate sampling for the exponential mechanism in differential privacy. The likelihood-ratio control provides explicit time-dependent Pure-DP guarantees for SHK-based samplers, while the KL bound yields Approximate-DP certificates via hockey-stick divergence. We also derive a utility bound separating intrinsic exponential-mechanism suboptimality from finite-time sampling error.

2605.23867 2026-05-25 cs.HC cs.AI

Human Decision-Making with Persuasive and Narrative LLM Explanations

具有说服性和叙事性LLM解释的人类决策

Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash, Murat Kantarcioglu

发表机构 * DEVCOM Army Research Laboratory(美国陆军研发实验室) University of Texas at Dallas(德克萨斯大学达拉斯分校) Virginia Polytechnic Institute and State University(弗吉尼亚理工大学)

AI总结 本研究探讨了生成式语言模型(LLM)在分类任务中生成的叙事性解释对人类决策性能的影响。通过大规模人类行为实验,研究发现LLM生成的叙事解释的说服力并未显著提升决策准确性,但可能增加人类对AI预测的依赖,并可能对决策反应时间和判断AI预测正确性的能力产生负面影响。研究结果表明,在AI预测中加入叙事解释可能带来决策性能的权衡,未来需要进一步研究其具体影响机制和适用场景。

详情
AI中文摘要

大型语言模型(LLMs)有潜力在分类任务中辅助和改善人类决策,不仅通过提供相当准确的预测,还通过生成这些预测的连贯叙事解释。先前的研究表明,人们通常认为AI叙事解释易于理解、可信且具有说服力,能够改变信念和观点;然而,关于叙事解释对客观人类决策表现的影响知之甚少。在这里,我们进行了一项大规模人类行为实验,以评估使用LLM生成的不同说服力叙事解释的决策表现。我们发现,基于LLM的解释的说服力程度(或缺乏说服力)并未显著影响决策准确性,相比于简单的AI预测本身,这与基于特征重要性的可解释AI的典型结果一致。我们发现有证据表明叙事增加了对AI的依赖,但无论AI预测正确还是错误都是如此。探索性分析还表明,更具说服力的叙事可能对决策响应时间以及区分正确和错误AI预测的能力产生不利影响。总体而言,这项工作表明,将叙事解释与AI预测结合可能会对决策表现产生权衡,需要更多研究来确定叙事解释如何以及何时影响人类决策。

英文摘要

Large language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fairly accurate predictions, but also in their ability to generate cogent narrative explanations of those predictions. Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions; however, less is known about the impact of narrative explanations on objective human decision-making performance. Here we conduct a large-scale human behavioral experiment to evaluate decision-making performance with LLM-generated narrative explanations of varying persuasiveness. We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance. We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect. Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction. Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making.

2605.23778 2026-05-25 physics.ao-ph cs.LG physics.comp-ph

The physics of AI weather models

AI天气模型的物理学

George Craig, Tobias Selz, Matthias Beylich, Kirsten I. Tempest

发表机构 * Meteorological Institute, LMU Munich(慕尼黑大学气象研究所)

AI总结 本文探讨了人工智能天气模型是否在隐式求解物理方程,尽管这些方程可能不同于传统数值天气预报模型所使用的方程。研究通过计算预报技能与中心核对齐的相关性,发现不同AI天气模型在表征大气时具有相似性,尽管其结构和容量存在差异。文章提出这些模型可能通过粒子描述的方式模拟大气,其中每个网格点的潜在变量对应高维潜在空间中的粒子位置,并假设粒子的运动遵循潜在空间中自由能函数的梯度流。这一假设在GraphCast和Aurora模型的分析中得到了支持。

详情
AI中文摘要

AI天气模型是否可能在求解物理方程,尽管这些方程可能不是传统NWP模型所使用的方程?我们计算了预测技能和中心核对齐的相关性,提供了证据表明不同的AI天气模型以相似的方式表示大气,尽管架构和能力存在差异。我们认为AI模型的架构和训练限制了它们可能模拟的物理定律的形式。特别地,我们提出这些模型实现了大气的粒子描述,其中每个网格点的潜变量对应于高维潜空间中粒子的位置。我们假设粒子的运动遵循潜空间中的梯度流,朝向学习到的自由能泛函的最小值。对GraphCast和Aurora模型的分析表明,它们在早期处理器层中在大空间尺度上进行变化,并随着层深增加转向更小尺度,这与梯度流假设一致。

英文摘要

Could it be that AI weather models are solving physical equations, although they may not be the equations used by conventional NWP models? We compute correlations of forecast skill and Centered Kernel Alignment, providing evidence that different AI weather models represent the atmosphere in similar ways, despite differences in architecture and capacity. We argue that the architecture and training of the AI models constrains the form of the physical laws that they might simulate. In particular, we propose that the models implement a particle description of the atmosphere, where the latent variables at each mesh point correspond to the position of a particle in the high dimensional latent space. We hypothesize that the movement of the particles follows a gradient flow in the latent space towards a minimum of a learned free energy functional. Analysis of the GraphCast and Aurora models show that they make changes on large spatial scales in the early processor layers and move to smaller scale with increasing layer depth, consistent with the gradient flow hypothesis.

2605.23712 2026-05-25 cs.CE cs.LG

Operator Learning for Reconstructing Flow Fields from Sparse Measurements: a Language Model Approach

基于稀疏测量重建流场的算子学习:一种语言模型方法

Qian Zhang, George Em Karniadakis

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系)

AI总结 本文研究了如何从稀疏测量数据中重建流场这一流体力学中的基础问题,并提出了一种基于语言模型架构的新型算子学习框架,实现了无需网格的流场重建。该方法将流场重建转化为序列到序列的学习任务,利用稀疏测量作为上下文,未观测位置作为查询,有效捕捉了空间相关性和长程依赖关系。实验表明,该方法在多个基准数据集上均表现出良好的重建精度,尤其在观测数据不足10%的情况下仍具有高效性能,展示了语言模型在科学数据重建中的潜力。

详情
AI中文摘要

从稀疏测量中重建流场是流体力学中的一个基本问题,对建模、控制和设计具有广泛影响。在这项工作中,我们提出了一种新颖的算子学习框架,利用语言模型的架构以无网格方式进行流场重建。我们将流场重建重新表述为序列到序列的学习任务,其中稀疏测量被视为上下文,未观测位置被视为查询。我们的模型学习从稀疏输入重建完整流场,有效捕捉空间相关性和长程依赖。我们在四个基准数据集上评估了所提出的方法:(1) 二维涡街模拟,(2) 美国本土的日平均温度数据,(3) 基于耗散粒子动力学的三维血流模拟,以及(4) 通过粒子跟踪测速获得的三维湍流射流测量。在所有情况下,我们的方法即使在高度不完整的数据(观测率低于10%)下也表现出竞争性的重建精度,并实现了高效性能。结果凸显了语言模型作为科学数据重建的鲁棒且可扩展工具的潜力,并指向了为科学和工程应用开发基础模型的有前景方向。

英文摘要

Reconstructing flow fields from sparse measurements is a fundamental problem in fluid mechanics with broad implications for modeling, control, and design. In this work, we propose a novel operator learning framework that leverages the architecture of language models to perform flow reconstruction in a mesh-free manner. We reformulate flow field reconstruction as a sequence-to-sequence learning task, where sparse measurements are treated as context and unobserved locations as queries. Our model learns to reconstruct the full flow field from sparse inputs, effectively capturing spatial correlations and long-range dependencies. We evaluate the proposed approach on four benchmark datasets: (1) two-dimensional vortex street simulations, (2) daily average temperature data across the contiguous United States, (3) three-dimensional blood flow simulations based on dissipative particle dynamics, and (4) three-dimensional turbulent jet flow measurements obtained via particle tracking velocimetry. Across all cases, our method demonstrates competitive reconstruction accuracy, even with highly incomplete data (less than 10\% observed), and achieves efficient performance. The results highlight the potential of language models as robust and scalable tools for scientific data reconstruction, and suggest a promising direction toward the development of foundation models for scientific and engineering applications.

2605.21504 2026-05-25 q-fin.ST cs.AI

Multivariate Financial Forecasting using the Chronos Time Series Foundation Models

使用Chronos时间序列基础模型进行多元金融预测

Sanjiv R Das, Tarang Goyal, Mohini Yadav

发表机构 * Santa Clara University(圣克拉拉大学)

AI总结 本文利用开源时间序列基础模型Chronos-2,评估预训练时间序列模型在经济与金融预测中的表现,重点研究多变量(MV)输入相比单变量(UV)基线是否能提升预测精度。研究覆盖了七只优质股票、美国国债利率及其组合面板,通过2000年至2025年的滚动月度评估,结果显示多变量预测在利率和股票数据中均显著优于单变量预测,且误差分布更集中。研究还指出,跨市场混合时间序列会降低预测准确性,表明引入噪声背景可能影响模型性能,整体表明基础模型可通过跨序列信息提升金融预测精度,尤其在结构化滚动协议下效果更佳。

Comments 10 pages, 3 tables, 3 figures

详情
AI中文摘要

使用开源时间序列基础模型Chronos-2,我们评估了预训练时间序列模型在经济和金融预测中的表现,重点研究多元输入相对于单变量基线是否提高了准确性。研究涵盖两个面板——Magnificent-7股票和美国国债利率——以及一个组合面板,使用2000年至2025年的滚动月度评估。我们改变输入窗口长度和预测范围,并报告RMSE和MAPE。跨数据集,多元预测一致优于单变量预测,利率的增益尤为强劲,股票也有显著改善。序列级比较显示多元输入在所有情况下均有改进,且误差离散度通常更低。我们还提供了参数热图和时间序列可视化。然而,混合股票和利率市场的时间序列会降低预测准确性,表明添加噪声上下文会降低模型性能。总体而言,结果表明基础模型可以利用跨序列信息提高金融预测准确性,并且在严格滚动协议下对相关序列进行联合建模时收益最大。除了使用开源基础模型外,本文还展示了AI如何用于金融研究。

英文摘要

Using Chronos-2, an open-source time-series foundation model, we evaluate pretrained time-series models for economic and financial forecasting with an emphasis on whether multivariate (MV) inputs improve accuracy relative to univariate (UV) baselines. The study covers two panels -- the Magnificent-7 equities and U.S. Treasury interest rates -- as well as a combined panel, using rolling monthly evaluations from 2000--2025. We vary input window lengths and forecast horizons and report RMSE and MAPE. Across datasets, MV forecasts consistently outperform UV forecasts, with especially strong gains for interest rates and meaningful improvements for equities. Series-level comparisons show MV improvements in every case, and error dispersion is generally lower under MV inputs. We also provide parameter-heatmap and time-series visualizations. However, mixing time series across equity and interest rate markets reduces forecast accuracy, indicating that adding noisy context degrades model performance. Overall, the results indicate that foundation models can leverage cross-series information to improve forecast accuracy in finance, and that the benefits are strongest when related series are modeled jointly under disciplined rolling protocols. Other than using an open-source foundation model, this paper also showcases how AI may be used for financial research.

2604.26145 2026-05-25 cs.HC cs.AI

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Ceci n'est pas une explication: 评估语言学习系统中作为可解释性陷阱的解释失败

Ben Knight, Wm. Matthew Kennedy, Danielle Carvalho, Isaac Pattis, James Edgell

发表机构 * Oxford University Press(牛津大学出版社) Oxford Internet Institute(牛津互联网研究所) University of Oxford(牛津大学)

AI总结 该研究探讨了人工智能语言学习系统中解释性失败的问题,指出这些系统提供的即时反馈可能在表面看似有帮助,但实际上存在根本性缺陷,可能加剧学习者的误解并影响学习效果。研究提出了L2-Bench基准,用于评估语言教育中的AI系统,涵盖诊断准确性、错误原因分析等多个关键反馈维度,并分析了AI在这些维度上的失效方式及其带来的“可解释性陷阱”。研究强调了语言学习场景下这些风险的特殊性,并呼吁在设计评估框架时更加关注相关问题。

Comments Accepted to Misleading Impacts Resulting from AI Generated Explanations (MIRAGE) Workshop @ IUI 2026

详情
AI中文摘要

AI驱动的语言学习工具日益为全球数百万学习者提供即时、个性化的反馈。然而,这种反馈可能以学习者甚至教师难以察觉的方式失败,长期使用可能强化误解并侵蚀学习效果。我们提出了L2-Bench的一部分,这是一个用于评估语言教育中AI系统的基准,包括(但不限于)有效反馈的六个关键维度:诊断准确性、适当性意识、错误原因、优先级排序、改进指导和支持自我调节。我们分析了AI系统在这些维度上可能失败的方式。这些失败,我们认为会导致“可解释性陷阱”,即表面上看似有帮助但本质上有缺陷的AI生成解释,增加了成就、人机交互和社会情感危害的风险。我们讨论了语言学习的特定背景如何放大这些风险,并概述了在设计评估框架时我们认为值得更多关注的开放问题。我们的分析旨在扩展社区对可解释性陷阱类型学及其可能发生的上下文动态的理解,以鼓励AI开发者更好地设计安全、可信和有效的AI解释。

英文摘要

AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.

2604.24920 2026-05-25 cs.CR cs.AI

SUDP: Secret-Use Delegation Protocol for Agentic Systems

SUDP: 面向智能体系统的秘密使用委托协议

Xiaohang Yu, Hejia Geng, Xinmeng Zeng, William Knottenbelt

发表机构 * Imperial College London(伦敦帝国学院) University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 随着代理系统越来越多地使用用户秘密进行API调用、消息平台和云服务操作,现有的运行时授权机制往往通过暴露秘密或其衍生物来实现,导致潜在的安全风险。本文提出了一种名为SUDP的机密使用委托协议,旨在确保用户授权的秘密操作不被滥用,且不赋予请求者持久的访问权限。该协议通过用户授权、请求者提出操作、托管方执行有限使用的方式,满足七个关键安全属性,在结合硬件根运行时的情况下,能够在标准密码学假设下保障秘密的完整性和机密性。

详情
AI中文摘要

智能体系统越来越多地使用用户秘密来访问API、消息平台和云服务。当前的智能体运行时通常通过暴露来实现授权:启用操作通常意味着将可重复使用的秘密或由其派生的可重复使用工件放入运行时,因此瞬时的提示注入或工具侧妥协就会变成持久的账户妥协。现有的防御措施涵盖了相邻的部分,如秘密存储、范围委托、发送者约束令牌和运行时监控,但未能为组合的智能体义务提供通用规范:一个不可信的自主请求者应该能够发起用户授权的秘密支持操作,而不会获得对该操作的可重复使用权限。我们将此形式化为智能体秘密使用(ASU)问题,并确定了任何解决方案必须满足的七项安全属性,涵盖授权完整性和秘密机密性。我们提出了秘密使用委托协议(SUDP),其中请求者提出规范操作,用户使用新鲜的身份验证器支持的授权进行授权,保管人兑现授权以执行有限的使用;可重复使用的权限永远不会跨越请求者边界。我们将SUDP专门用于LLM驱动的智能体,每当工具调用会使用用户注册的授权材料时,它都适用。在标准密码学假设下,当与硬件根运行时集成时,SUDP满足所有七项属性。参考实现可在https://github.com/xhyumiracle/sudp获取。

英文摘要

Agentic systems increasingly act with user secrets for APIs, messaging platforms, and cloud services. Today's agent runtimes typically implement authorization by exposure: enabling action often means placing a reusable secret, or a reusable artifact derived from it, inside the runtime, so a transient prompt-injection or tool-side compromise becomes durable account compromise. Existing defenses cover adjacent pieces such as secret storage, scoped delegation, sender-constrained tokens, and runtime monitoring, but leave the combined agentic obligation without a common specification: an untrusted autonomous requester should be able to cause a user-authorized secret-backed operation without gaining reusable authority over it. We formalize this as the Agent Secret Use (ASU) problem and identify seven security properties any solution must satisfy, spanning authorization integrity and secret confidentiality. We propose the Secret-Use Delegation Protocol (SUDP), in which a requester proposes a canonical operation, the user authorizes it with a fresh authenticator-backed grant, and a custodian redeems the grant to perform the bounded use; reusable authority never crosses the requester boundary. We specialize SUDP for LLM-driven agents, where it applies whenever a tool call would exercise user-enrolled authority-bearing material. Under standard cryptographic assumptions, SUDP satisfies all seven properties when integrated with a hardware-rooted runtime. A reference implementation is available at https://github.com/xhyumiracle/sudp.

2602.12534 2026-05-25 stat.ML cs.DS cs.LG math.ST stat.TH

Linear Regression with Unknown Truncation Beyond Gaussian Features

未知截断下的线性回归:超越高斯特征

Alexandros Kouridakis, Anay Mehrotra, Alkis Kalavasis, Constantine Caramanis

发表机构 * UT Austin(德克萨斯大学奥斯汀分校) Stanford University(斯坦福大学) Yale University(耶鲁大学)

AI总结 本文研究了在截断线性回归中,当响应变量的生存集未知时,如何高效估计未知的回归参数问题。不同于以往依赖已知生存集或强假设(如高斯分布)的工作,本文提出了一种仅需特征向量满足次高斯条件的算法,其运行时间仅为多项式时间,显著提升了计算效率。该方法的核心在于设计了一种新的子程序,能够在仅有正例且满足平滑条件的情况下高效学习有限个区间联合的模型,具有独立的理论价值和应用前景。

详情
AI中文摘要

在截断线性回归中,只有当结果 $y$ 落在某个生存集 $S^\star$ 内时,样本 $(x,y)$ 才被观测到,目标是估计未知的 $d$ 维回归系数 $w^\star$。该问题在统计学和机器学习中有着悠久的研究历史,可追溯到 (Galton, 1897; Tobin, 1958) 的工作,以及近期如 (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024) 的研究。然而,尽管历史久远,大多数先前工作仅限于 $S^\star$ 精确已知的特殊情况。更实际相关的情况——$S^\star$ 未知且需从数据中学习——仍然开放:实际上,目前可用的算法要么要求特征向量分布有强假设(如高斯性),即使如此,达到 $\varepsilon$ 精度的运行时间也为 $d^{\mathrm{poly} (1/\varepsilon)}$。在本工作中,我们给出了首个针对未知生存集的截断线性回归算法,运行时间为 $\mathrm{poly} (d/\varepsilon)$,仅要求特征向量是次高斯的。我们的算法依赖于一个新颖的子程序,该子程序在某种平滑条件下,利用正例(无负例)高效学习有界数量区间的并集。该学习保证补充了正例仅 PAC 学习的研究路线,并可能具有独立意义。

英文摘要

In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.

2511.15503 2026-05-25 cs.AR cs.DC cs.LG cs.PF

DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

DCC: 面向处理-内存架构的机器学习内核数据驱动编译

Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko, Christina Giannoula

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) Barcelona Supercomputing Center(巴塞罗那超级计算中心) ETH Zürich(苏黎世联邦理工学院) Nvidia(英伟达) Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所)

AI总结 本文提出了一种面向存算一体架构的数据为中心的机器学习内核编译器DCC,旨在解决在处理大型语言模型等内存密集型任务时,主机处理器与存算一体核心之间数据布局不一致带来的性能瓶颈。DCC通过统一优化数据重排与计算代码生成,结合多层PIM抽象和性能预测模型,有效提升了在不同PIM设备上的执行效率。实验表明,DCC在多种机器学习内核和端到端大语言模型推理中均实现了显著的加速效果。

详情
AI中文摘要

高性能主机处理器可以集成处理-内存(PIM)设备,通过利用PIM核心可用的大内存带宽,加速机器学习(ML)模型(包括大型语言模型(LLM))的内存密集型内核。然而,主机处理器需要分布在DRAM bank中的连续元素,而PIM核心需要其本地bank内的连续元素。这需要在ML内核执行中进行数据重排,带来了显著的性能和可编程性挑战,并且由于需要支持多种PIM设备而进一步加剧。当前的编译方法缺乏针对多种ML内核和多个PIM设备的系统优化,并且可能在计算代码优化步骤中很大程度上忽略数据重排成本。我们表明数据重排和计算代码优化是相互依赖的,需要在调优过程中联合优化。因此,我们设计了DCC,这是首个面向PIM系统的数据驱动ML编译器,它在统一的调优过程中联合优化数据重排和计算代码。DCC集成了多层PIM抽象以支持多个PIM后端。DCC实现了数据分区策略与计算循环分区方案的有效联合优化。DCC应用了PIM特定的代码优化,并利用快速准确的性能预测模型为目标PIM架构上的给定内核选择最佳性能的代码调度。我们在各种单个ML内核上的评估表明,与仅GPU执行相比,DCC在HBM-PIM上实现了高达7.68倍的加速(平均2.21倍),在AttAcc PIM上实现了高达13.17倍的加速(平均3.92倍)。在端到端LLM推理中,AttAcc上的DCC在GPT-3和LLaMA-2上比GPU平均加速4.52倍(LLaMA-2上最高7.71倍)。DCC已在https://github.com/SPIN-Research-Group/DCC开源。

英文摘要

High-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory bandwidth available at PIM cores. However, Host processor needs consecutive elements distributed across DRAM banks, while PIM cores need consecutive elements within their local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM devices. Current compilation approaches lack systematic optimization for diverse ML kernels and multiple PIM devices, and may largely ignore data rearrangement costs during the compute code optimization step. We show that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. Therefore, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction to support multiple PIM backends. DCC enables effective co-optimization of data partitioning strategies with compute loop partitioning schemes. DCC applies PIM-specific code optimizations, and leverages a fast and accurate performance prediction model to select the bestperforming code schedule for a given kernel on a target PIM architecture. Our evaluations in various individual ML kernels show that DCC achieves up to 7.68x speedup (2.21x average) on HBM-PIM, and up to 13.17x speedup (3.92x average) on AttAcc PIM, over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by 4.52x average (up to 7.71x in LLaMA-2) over GPU. DCC is open-sourced at https://github.com/SPIN-Research-Group/DCC.

2502.20349 2026-05-25 q-bio.NC cs.AI

Naturalistic Computational Cognitive Science: Towards generalizable models and theories that capture the full range of natural behavior

自然主义计算认知科学:迈向能够捕捉自然行为全范围的通用模型与理论

Wilka Carvalho, Andrew Lampinen

发表机构 * Kempner Institute for the Study of Natural and Artificial Intelligence(Kempner自然与人工智能研究学院) Harvard University(哈佛大学) Google DeepMind(谷歌DeepMind)

AI总结 本文探讨如何通过结合人工智能的最新进展,构建能够涵盖自然情境和行为全貌的通用认知科学理论。研究指出,采用更加自然化的实验范式和计算模型,有助于更准确地理解自然智能的本质,并推动理论的泛化能力。文章综述了认知科学、神经科学和人工智能领域的相关研究,提出整合这些领域进展有助于在保持实验控制和理论深度的同时,更好地解释和模拟人类认知过程。

详情
AI中文摘要

认知科学如何构建能够涵盖自然情境与行为全范围的通用理论?我们认为,人工智能(AI)的进展为认知科学提供了及时的机会,使其能够采用日益自然化的刺激、任务和行为进行实验,并构建能够适应这些变化的计算模型。我们首先回顾了涵盖神经科学、认知科学和AI的日益增长的研究,这些研究表明,纳入更广泛的自然主义实验范式及其相应模型,可能是解决自然智能某些方面并确保理论泛化的必要条件。我们回顾了认知科学和神经科学中的案例,其中自然主义范式引发了不同的行为或涉及不同的过程。然后,我们讨论了AI的最新进展,表明从自然主义数据中学习会产生定性的不同行为模式和泛化模式,并探讨了这些发现如何影响我们从认知建模中得出的结论,以及如何帮助产生关于认知和神经现象根源的新假设。接着,我们建议整合AI和认知科学的最新进展,将使我们能够处理更自然的现象,而不放弃实验控制或对理论理解基础的追求。我们提供了关于方法论实践如何有助于自然主义计算认知科学中累积进展的实用指导,并描绘了一条构建能够解决自然认知实际问题的计算模型的道路,同时对这些模型所依据的过程和原则进行还原性理解。

英文摘要

How can cognitive science build generalizable theories that span the full scope of natural situations and behaviors? We argue that progress in Artificial Intelligence (AI) offers timely opportunities for cognitive science to embrace experiments with increasingly naturalistic stimuli, tasks, and behaviors; and computational models that can accommodate these changes. We first review a growing body of research spanning neuroscience, cognitive science, and AI that suggests that incorporating a broader range of naturalistic experimental paradigms, and models that accommodate them, may be necessary to resolve some aspects of natural intelligence and ensure that our theories generalize. We review cases from cognitive science and neuroscience where naturalistic paradigms elicit distinct behaviors or engage different processes. We then discuss recent progress in AI that shows that learning from naturalistic data yields qualitatively different patterns of behavior and generalization, and examine how these findings impact the conclusions we draw from cognitive modeling, and can help yield new hypotheses for the roots of cognitive and neural phenomena. We then suggest that integrating recent progress in AI and cognitive science will enable us to engage with more naturalistic phenomena without giving up experimental control or the pursuit of theoretically grounded understanding. We offer practical guidance on how methodological practices can contribute to cumulative progress in naturalistic computational cognitive science, and illustrate a path towards building computational models that solve the real problems of natural cognition, together with a reductive understanding of the processes and principles by which they do so.

2605.23663 2026-05-25 cs.HC cs.LG

Detecting Drunk Driving Using Off-the-Shelf Smartwatches

使用现成智能手表检测酒驾

Robin Deuber, Lanlan Yang, Michal Bechny, Christoph Heck, Matthias Pfäffli, Matthias Bantle, Florian von Wangenheim, Elgar Fleisch, Wolfgang Weinmann, Manuel Günther, Felix Wortmann, Varun Mishra

发表机构 * University of Bern(伯尔尼大学) University of St. Gallen(施特加尔伦大学) Northeastern University(东北大学)

AI总结 本文研究了如何利用市售智能手表检测酒后驾驶行为,以预防道路交通事故。研究通过分析手腕加速度计数据和心率变异性等生理信号,提出了一种基于机器学习的检测系统,并在封闭测试轨道上进行了随机对照实验。该系统使用逻辑回归和一维卷积神经网络进行训练,取得了较高的检测准确率,为基于可穿戴设备的酒驾预防提供了新的可行方案。

Comments 27 pages, 7 figures

详情
AI中文摘要

酒精影响驾驶仍然是道路交通事故和死亡的一个主要但可预防的原因,许多驾驶员低估了自己的醉酒程度。与车载系统相比,使用消费级智能手表的移动酒驾检测提供了一种可扩展的方式,无需额外车载硬件即可触发预防性干预并提高意识。我们引入了一个系统,利用手腕加速度计数据和心率变异性衍生的生理信号来检测酒精相关的驾驶障碍。我们在一个随机、对照的三组测试轨道研究(n=54)中收集数据,并训练了带有窗口聚合特征的逻辑回归模型和一个双塔一维卷积神经网络(CNN),以检测酒精影响下的驾驶。CNN在检测任何酒精中毒时实现了参与者平均受试者工作特征曲线下面积(AUROC)为0.88,在检测驾驶超过WHO推荐的0.05 g/dL限值时AUROC为0.86。据我们所知,这是第一个(1)展示使用消费级智能手表检测酒驾的工作,(2)在封闭测试轨道的真实车辆中开发和评估此类系统,以及(3)严格评估对未见参与者的泛化能力。这些发现共同凸显了基于可穿戴设备的传感在支持可扩展、测量驱动的酒精相关交通伤害预防方面的潜力。

英文摘要

Alcohol-impaired driving remains a major yet preventable cause of road traffic injury and death, with many drivers underestimating their level of intoxication. Compared to in-vehicle systems, mobile drunk-driving detection using consumer smartwatches offers a scalable way to trigger preventive interventions and increase awareness without additional in-vehicle hardware. We introduce a system that leverages wrist accelerometer data and heart rate variability-derived physiological signals to detect alcohol-related driving impairment. We collected data in a randomized, controlled three-arm test-track study (n=54) and trained both logistic regression models with window-aggregated features and a two-tower 1D convolutional neural network (CNN), to detect alcohol-impaired driving. The CNN achieved a participant-averaged area under the receiver operating characteristic (AUROC) of 0.88 for detecting any alcohol intoxication and 0.86 for detecting driving above the WHO-recommended limit of 0.05 g/dL. To the best of our knowledge, this is the first work to (1) demonstrate drunk-driving detection using consumer smartwatches, (2) develop and evaluate such a system in a real vehicle on a closed test track, and (3) rigorously assess generalization to unseen participants. Together, these findings highlight the potential of wearable-based sensing to support scalable, measurement-driven prevention of alcohol-related traffic harm.

2605.23643 2026-05-25 cs.CR cs.LG

Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin

更少努力,更短证明:Tamarin中安全协议分析的强化学习

Matthias Cosler, Cas Cremers, Bernd Finkbeiner, Mohamed Ghanem, Niklas Medinger

发表机构 * CISPA Helmholtz Center for Information Security(CISPA 欧洲信息安全中心) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出了一种基于强化学习的框架,用于辅助Tamarin工具进行安全协议的形式化验证。该方法受到AlphaZero和AlphaProof的启发,结合蒙特卡洛树搜索和神经网络启发式策略,实现了更高效、更短的协议验证过程。实验表明,该方法在多个案例研究中能够自动发现更多证明,并且生成的证明长度优于Tamarin默认搜索和人工设计的启发式方法,有效降低了验证过程中的人力投入。

详情
AI中文摘要

像Tamarin和ProVerif这样的工具在分析和验证复杂的现实世界协议(如EMV、5G和WPA2)方面取得了显著成功,甚至检测到了零日漏洞。尽管取得了这些成功,验证此类协议仍然是一项耗时、具有挑战性的任务,通常需要大量的人力和专业知识。在本文中,我们提出了一个受AlphaZero和AlphaProof启发的强化学习(RL)框架,该框架为Tamarin实现了一种新的证明搜索风格。我们为Tamarin开发了一个无状态API,充当经典的RL环境。我们通过一个从完成的子证明中学习的神经启发式来指导蒙特卡洛树搜索(MCTS)。我们在16个案例研究上评估了我们的框架,范围从经典协议模型到近期出版物中具有挑战性的最先进协议模型。我们的方法比Tamarin的标准搜索自动找到更多的证明,并且比标准和人工设计的启发式产生更短的证明。我们的流程开箱即用,可帮助Tamarin用户在活跃研究中减少所需的人力。此外,我们的标准化接口为用户提供了一种与Tamarin交互的程序化方式。最后,我们的工作展示了将基于RL的方法适应Tamarin领域的巨大潜力。

英文摘要

Tools like Tamarin and ProVerif have achieved notable success in analyzing and verifying complex real-world protocols such as EMV, 5G, and WPA2, even detecting zero-day exploits. Despite these successes, verifying such protocols remains a time-consuming, challenging task, often requiring significant human effort and expertise. In this paper, we present a reinforcement learning (RL) framework inspired by AlphaZero and AlphaProof that implements a new style of proof search for Tamarin. We have developed a stateless API for Tamarin that acts as a classical RL environment. We guide a Monte Carlo Tree Search (MCTS) by a neural heuristic that learns from completed subproofs. We evaluate our framework on 16 case studies, ranging from classical protocol models to challenging state-of-the-art protocol models from recent publications. Our method finds more proofs automatically than Tamarin's standard search and produces shorter proofs than both the standard and human-engineered heuristics. Our pipeline is applicable out of the box to assist Tamarin users in active research, reducing the human effort required. Moreover, our standardized interface provides a programmatic way for users to interact with Tamarin. Finally, our work demonstrates the promising potential of adapting RL-based methods to the Tamarin domain.

2605.23623 2026-05-25 cs.CR cs.AI cs.LG

Adversarial Vulnerability Under Temporal Concept Drift: A Longitudinal Study of Android Malware Detection

时间概念漂移下的对抗脆弱性:Android恶意软件检测的纵向研究

Ahmed Sabbah, Mohammed Kharma, Radi Jarrar, Samer Zein, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University(巴勒斯坦伯利兹大学计算机科学系) Department of Computer Science, University of Central Florida(佛罗里达州立大学计算机科学系)

AI总结 本文通过长期视角研究了安卓恶意软件检测系统在时间概念漂移下的对抗脆弱性,分析了十年间应用数据在静态和动态特征表示下的对抗鲁棒性。研究采用三种部署协议评估模型性能,引入了多个时间关联指标以量化分布偏移对鲁棒性的影响。结果表明,随着时间间隔增大,对抗鲁棒性下降,而攻击成功率上升,强调了在动态数据环境下需考虑时间漂移因素,并提出了针对长期对抗环境的鲁棒性评估框架的重要性。

Comments 42 pages, 4 tables, 10 figures

详情
AI中文摘要

我们提出了一种纵向的、考虑漂移的对抗鲁棒性评估,使用从模拟器和真实设备执行中提取的静态和动态特征表示,跨越超过十年的Android应用。数据集按年度切片组织,并在三种模拟现实学习场景的部署协议下进行评估:(1)同年度训练和测试,(2)跨年度部署且不更新模型,(3)使用累积历史数据进行扩展窗口重训练。在多个分类器家族中,使用FGSM和SPSA在可行性约束下生成对抗样本。我们测量了干净性能、对抗准确率(AA)、攻击成功率(ASR),并引入了时序关联指标——RobustDrop、$\Delta$ASR和对抗放大因子(AAF)——以量化分布漂移与鲁棒性退化之间的关系。结果表明,在评估的基于迁移的特征空间设置下,时间分离与对抗鲁棒性降低相关。随着训练-测试间隔增加,干净准确率和对抗准确率下降,而攻击成功率呈现配置相关的增加,特别是在FGSM扰动和静态特征下。扩展窗口重训练可以缓解但无法消除在持续分布演化下的鲁棒性损失。这些发现表明,在评估智能检测系统在演化数据分布下的长期鲁棒性时,应考虑时间漂移,并强调了在长期对抗环境中需要漂移感知的鲁棒性评估框架。

英文摘要

We present a longitudinal, drift-aware evaluation of adversarial robustness across more than a decade of Android applications using static and dynamic feature representations extracted from emulator and real-device executions. The dataset is organized into yearly slices and evaluated under three deployment protocols that emulate realistic learning scenarios: (1) same-year training and testing, (2) cross-year deployment without model updates, and (3) expanding-window retraining with cumulative historical data. Across multiple classifier families, adversarial examples are generated using FGSM and SPSA under feasibility constraints. We measure clean performance, Adversarial Accuracy (AA), Attack Success Rate (ASR), and introduce temporal linkage metrics -- RobustDrop, $Δ$ASR, and Adversarial Amplification Factor (AAF) -- to quantify the relationship between distribution shift and robustness degradation.nResults show that temporal separation is associated with reduced adversarial robustness under the evaluated transfer-based feature-space setting. As the train-test gap increases, clean accuracy and adversarial accuracy decline, while attack success exhibits configuration-dependent increases, particularly under FGSM perturbations and static features. Expanding-window retraining mitigates, but does not eliminate, robustness loss under continued distributional evolution. These findings indicate that temporal drift should be considered when assessing the long-term robustness of intelligent detection systems under evolving data distributions and highlight the need for drift-aware robustness assessment frameworks in long-lived adversarial environments.

2605.23619 2026-05-25 eess.AS cs.SD

Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech

Canary与WavLM的帧对齐融合用于助听器处理语音的非侵入式清晰度预测

Kazushi Nakazawa

发表机构 * Advanced Media, Inc.(先进媒体公司)

AI总结 本文研究了在无参考条件下预测助听器处理语音可懂度的问题,提出了一种基于Canary和WavLM两个预训练语音编码器的框架对齐融合方法。通过比较多种融合策略,作者发现将WavLM经过可学习的步进卷积处理后,在较粗的Canary时间线上进行融合,能够有效提升预测性能,最终在Eval数据集上取得了较低的RMSE和较高的相关系数。实验分析表明,在池化前建立粗粒度的时序对应关系有助于模型更好地捕捉语音可懂度的关键特征。

Comments 7 pages, 2 figures

详情
AI中文摘要

非侵入式清晰度预测估计听力受损听众对助听器处理语音的理解程度,无需干净参考。我们在第三届清晰度预测挑战赛中研究此任务,使用两个冻结的语音编码器Canary和WavLM。核心问题不仅在于是否应结合互补的预训练表示,还在于它们的交互应发生在何处。我们在共享的左右保留双耳框架下比较了单骨干基线、统一分数平均、池后融合、交叉注意力、帧对齐融合和反向对齐。在比较的系统中,最佳模型使用可学习的步进卷积对WavLM进行时间准备,并在池化前在较粗的Canary时间线上将其与Canary融合,达到Eval RMSE 24.96±0.06和Eval Corr 0.796±0.001。严重性、增强系统、层窗口和时间偏移分析表明,池化前的粗局部时间对应是该任务的有用归纳偏置。

英文摘要

Non-intrusive intelligibility prediction estimates how well hearing-impaired listeners understand hearing-aid-processed speech without a clean reference. We study this task in the 3rd Clarity Prediction Challenge using two frozen speech encoders, Canary and WavLM. The central question is not only whether complementary pretrained representations should be combined, but where their interaction should occur. We compare single-backbone baselines, uniform score averaging, pool-late fusion, cross-attention, frame-aligned fusion, and reverse alignment under a shared left/right-preserving binaural framework. Among the compared systems, the best model temporally prepares WavLM with a learnable strided convolution and fuses it with Canary on the coarser Canary timeline before pooling, reaching Eval RMSE 24.96$\pm$0.06 and Eval Corr 0.796$\pm$0.001. Severity, enhancement-system, layer-window, and temporal-shift analyses indicate that coarse local temporal correspondence before pooling is a useful inductive bias for this task.

2605.23604 2026-05-25 eess.AS cs.SD

Word-Level Modeling with Alignment-Aware Acoustic Fusion for Text-Assisted Intelligibility Prediction in Listeners with Hearing Loss

基于对齐感知声学融合的词级建模用于听力损失患者文本辅助可懂度预测

Kazushi Nakazawa

发表机构 * Advanced Media, Inc.(先进媒体公司)

AI总结 本文研究了如何利用文本辅助预测听力障碍者对语音的可懂度,提出了一种基于词级建模和对齐感知声学融合的方法。该方法结合冻结的Whisper编码器分析降质语音,通过条件解码器结合标准文本进行预测,并引入词对齐的局部声学分支与全局声学分支进行校准,提升了预测性能。实验表明,该方法在多项指标上优于基线模型,验证了细粒度预测与对齐融合的有效性。

Comments 7 pages, 2 figures

详情
AI中文摘要

我们针对CPC3中听力受损者的文本辅助语音可懂度预测问题。尽管目标是句子级百分比,但它由参考词识别结果决定。我们将预测建模为参考条件下的词级正确性建模:冻结的Whisper编码器分析退化语音,教师强制解码器以规范转录为条件,句子可懂度通过对有效参考词的预测正确概率取平均得到。为了补充转录条件解码器状态,我们添加了一个基于字符级交叉注意力对齐的词对齐局部声学分支,以及一个用于校准的语句级全局声学分支。在官方评估集上,解码器基线获得RMSE 24.92和相关系数0.795,而联合融合将错误词F1提升至0.778,MCC 0.626,相关系数0.806,RMSE 24.39。使用Whisper medium的类似趋势表明,增益来自预测粒度和对齐感知融合。

英文摘要

We address text-assisted speech intelligibility prediction for hearing-impaired listeners in CPC3. Although the target is a sentence-level percentage, it is determined by reference-word recognition outcomes. We formulate prediction as reference-conditioned word-level correctness modeling: a frozen Whisper encoder analyzes degraded speech, a teacher-forced decoder conditions on the canonical transcript, and sentence intelligibility is obtained by averaging predicted correctness probabilities over valid reference words. To complement transcript-conditioned decoder states, we add a word-aligned local acoustic branch based on character-level cross-attention alignment and an utterance-level global acoustic branch for calibration. On the official evaluation set, the decoder baseline obtains RMSE 24.92 and correlation 0.795, while joint fusion improves to incorrect-word F1 0.778, MCC 0.626, correlation 0.806, and RMSE 24.39. A similar trend with Whisper medium suggests that the gain comes from prediction granularity and alignment-aware fusion.

2605.23591 2026-05-25 stat.ML cond-mat.dis-nn cs.LG math.ST stat.TH

Asymmetric Scaling Laws from Sparse Features

基于稀疏特征的非对称缩放定律

John Sous, Michael Winer

发表机构 * Yale University(耶鲁大学) Energy Sciences Institute(能源科学研究所) Institute for Advanced Study(高级研究院) Alignment Research Center(对齐研究中心)

AI总结 本文研究了稀疏激活下神经网络的扩展规律,提出了一种新的模型,指出测试损失主要由训练输入中从未出现的稀疏坐标主导,从而形成一种不同于密集模型的新瓶颈。研究推导了欠参数化和过参数化情形下的渐近损失,并发现损失曲线在插值阈值附近呈现双下降现象,表现出由稀疏度决定的两个不同扩展指数。此外,还分析了梯度下降动力学,并展示了固定步长梯度下降不稳定概率的扩展规律,表明稀疏性带来的影响在非线性激活下依然存在。

详情
AI中文摘要

我们引入了一个稀疏激活下的神经缩放定律模型。在该模型中,测试损失通常由训练输入中从未观察到的稀有坐标主导。这种机制引入了一个密集模型中不存在的新瓶颈。我们推导了欠参数化和过参数化区域的渐近总体损失,并表明损失在插值阈值附近出现双下降峰值——其中参数数量刚好足以拟合训练数据——导致损失曲线由两个不同的缩放指数控制:一个用于过参数化区域,一个用于欠参数化区域,其差距由稀疏程度决定。此外,我们推导了一个计算最优边界,在固定计算预算下倾向于增加数据集大小而非模型容量。我们还分析了梯度下降动力学,并确定了固定步长梯度下降变得不稳定的概率的缩放定律。我们进一步表明,稀疏诱导效应在非线性激活下仍然存在。

英文摘要

We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.

2605.23572 2026-05-25 cs.IR cs.AI cs.LG

HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

HARNESS-LM: 一种在赞助搜索中利用小语言模型的三阶段训练方案

Vipul Gupta, Shikhar Mohan, Lakshya Kumar, Pranjal Chitale, Nikit Begwani, Amit Singh, Manik Varma

发表机构 * Microsoft AI(微软人工智能)

AI总结 在赞助搜索中,如何在保证检索质量的同时降低响应延迟是一个重要挑战。本文提出HARNESS-LM(HLM),一种三阶段训练框架,旨在将大规模语言模型的检索能力转移到参数更少、成本更低的模型中。通过知识蒸馏和对比优化等方法,HLM在保持高检索精度的同时显著提升了推理效率,并在实际的Bing Ads测试中验证了其有效性,取得了更高的收益、曝光和点击率提升。

Comments 9 pages, 3 figures, 10 tables

详情
AI中文摘要

在赞助搜索的竞争格局中,平衡检索质量与生产延迟是一个关键挑战。尽管基于小语言模型(SLM)的大型检索模型(如Qwen3-Embedding-4B/8B)在公共基准上设定了强上限,但其在高吞吐、延迟敏感环境中的部署仍不切实际。本文提出HARNESS-LM(HLM),一个三阶段训练框架,用于将大规模检索器的能力迁移至紧凑、成本高效的模型。该方法包括:(1)通过微调十亿参数规模的SLM训练高性能参考(“教师”)检索器;(2)通过L2目标对齐查询表示,将知识蒸馏至低于600M参数的学生编码器;(3)应用最终对比精炼阶段以优化学生的检索性能。我们还对关键设计选择进行了全面的实证研究,包括对齐目标、嵌入维度、模型规模、架构和优化策略,以确定在生产环境中最为有效的配置。在真实世界的Bing Ads评估基准上,HLM在多种设置下恢复了参考检索器超过98%的精度,同时在NVIDIA A100 GPU上实现了高达27倍的在线查询编码器延迟降低和20倍的吞吐量提升。在Bing Ads上的在线A/B测试进一步显示,与当前生产中运行的检索器集成(部署190M参数模型)相比,收入提升+1%,展示量提升+0.6%,点击量提升+0.4%,清晰突显了HLM方案在真实世界赞助搜索场景中的实际效果。

英文摘要

In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.

2605.23562 2026-05-25 cs.MA cs.AI

ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

ARMS: 稀疏奖励多智能体强化学习的自动奖励塑形

Elie Abboud, Oren Gal

发表机构 * Department of Marine Technologies(海洋技术系)

AI总结 在多智能体强化学习中,稀疏奖励是学习过程中的主要瓶颈,而传统的奖励塑造方法难以在保持策略结构的同时提升学习效率。本文提出了一种名为ARMS的自动奖励塑造框架,通过轨迹排序从稀疏环境奖励中学习密集的塑造奖励,并基于条件最佳响应推理保证在固定对手策略下保留每个智能体的最佳响应集和纳什均衡集。实验表明,ARMS在部分可观测的多智能体路径规划任务中显著提升了采样效率,具有良好的环境泛化能力,并揭示了多智能体系统中由探索不足和策略-奖励动态耦合引发的振荡行为问题。

详情
AI中文摘要

稀疏奖励是多智能体强化学习(MARL)中的一个主要瓶颈,其中同时学习会导致非平稳性并使奖励设计尤其精细。奖励塑形可以加速学习,但在多智能体环境中,它必须保留问题的战略结构,而不仅仅是改善短期优化。我们提出了多智能体系统中的自动奖励塑形(ARMS),这是一个用于MARL的自监督奖励塑形框架,通过轨迹排序从稀疏环境奖励中学习稠密塑形信号。由于单智能体轨迹排序保证不能直接迁移到MARL,我们通过条件最优反应推理重新表述策略不变性,并证明如果某些条件成立,则使用塑形奖励在固定对手策略下保留每个智能体的最优反应集,从而保留纳什均衡集。在此视角指导下,ARMS在策略学习和奖励学习之间交替,同时跨智能体共享塑形参数以提高效率。在部分可观测的多智能体路径规划领域中的实验表明,ARMS在奖励稀疏性和智能体数量增加的情况下提高了采样效率,泛化到未见过的环境,并揭示了一种MARL特有的失败模式,其中有限的探索和耦合的策略-奖励动态导致振荡行为。增加探索可缓解此效应并稳定学习。据我们所知,ARMS是第一个其设计动机来自博弈论均衡保持结果的MARL自动奖励塑形框架。

英文摘要

Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose Automatic Reward-shaping in Multi-agent Systems (ARMS), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy--reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.

2605.23550 2026-05-25 math.OC cs.AI cs.NA math.NA

RA-DCA: A Randomized Active-Set DCA for Directional Stationarity in Max-Structured DC Programs

RA-DCA:面向最大结构DC规划方向稳定性的随机活动集DCA

Yi-Shuai Niu

发表机构 * Beijing Institute of Mathematical Sciences and Applications(北京数学科学研究院)

AI总结 本文研究了一类非光滑的差分凸优化问题,其中被减去的凸项为多个光滑凸函数的最大值。为了解决标准DCA可能收敛到非方向平稳临界点的问题,同时避免大规模或组合型活动集带来的高计算成本,作者提出了一种基于随机化活动集的DCA方法RA-DCA。该方法通过在采样方向上投影活动梯度、检查采样顶点残差,并仅在残差较小时使用小规模线性规划作为补充,有效保持了DCA的下降结构,同时将随机筛选过程简化为矩阵乘法。实验表明,该方法在多种模型中能够避免非平稳临界点,并在组合型问题中展现出良好的筛选效果。

Comments 40 pages, 7 figures

详情
AI中文摘要

我们研究非光滑差凸规划,其中被减的凸项是光滑凸函数的有限最大值。在此设定下,标准DCA迭代可能收敛到非方向稳定的临界点,而当活动集较大或具有组合性质时,精确的活动顶点筛选可能代价高昂。我们提出RA-DCA,一种顶点优先的随机活动集DCA,它将活动梯度投影到采样方向,检查采样顶点残差,并仅在低残差凸组合回退时使用一个小型线性规划。该方法保留了DCA的下降结构,并将随机筛选层简化为矩阵乘法。在所述正则性、数值活动集一致性和随机嵌入假设下,受保护方法生成的每个聚点以概率1是方向稳定的。MATLAB实验首先在退化的最大仿射、最大二次和稀疏支撑函数模型上测试该定理,其中保护机制避免了非稳定临界点并紧密跟踪完整活动顶点扫描。随后,块top-k测试表明,当精确聚合枚举具有组合性质时,相同的筛选思想仍然有用。修剪回归、互补性和QUBO诊断区分了活动集选择有助于问题的情况与由多起点搜索、DC分裂或其他问题特定特征主导的情况。

英文摘要

We study nonsmooth difference-of-convex programs whose subtracted convex term is a finite maximum of smooth convex functions. In this setting, standard DCA iterations may converge to critical points that are not directionally stationary, whereas exact active-vertex screening can be expensive when active sets are large or combinatorial. We propose RA-DCA, a vertex-first randomized active-set DCA that projects active gradients onto sampled directions, checks a sampled vertex residual, and uses a small linear program only as a low-residual convex-combination fallback. The method preserves the descent structure of DCA and reduces the randomized screening layer to matrix multiplications. Under the stated regularity, numerical active-set consistency, and random-embedding assumptions, every accumulation point generated by the safeguarded method is directionally stationary with probability one. MATLAB experiments first test the theorem on degenerate max-affine, max-quadratic, and sparse support-function models, where the safeguard avoids nonstationary critical points and closely tracks a full active-vertex scan. Block top-k tests then show that the same screening idea remains useful when exact aggregate enumeration is combinatorial. Trimmed-regression, complementarity, and QUBO diagnostics separate cases where active-set selection helps from cases dominated by multistart search, the DC split, or other problem-specific features.

2605.23508 2026-05-25 cs.GR cs.AI cs.CV cs.MM eess.IV

DrawVideo: Generating Long Video from Storyboard Keyframe Sketches

DrawVideo: 从故事板关键帧草图生成长视频

Chuanzhi Xu, Huiqi Liang, Bang Shi, Huiming Zhang, Yifan Xiao, Guangcheng Lin, Haodong Chen, Qiang Qu, Zhicheng Lu, Weidong Cai

发表机构 * The University of Sydney(悉尼大学) Charles Sturt University(查尔斯·斯特劳特大学)

AI总结 DrawVideo 是一种基于草图和分镜脚本的可控长视频生成框架,能够通过用户提供的黑白草图、外观描述和运动提示生成结构清晰、内容连贯的长视频。该方法将视频分解为多个可独立控制的镜头,每个镜头由草图、外观提示和运动提示定义,并采用分层策略生成参考帧和动作状态帧,最终合成完整视频。研究还构建了首个用于草图引导长视频生成的数据集 SketchLongVideo,实验表明该方法在结构控制、外观一致性和视觉稳定性方面表现优异。

Comments 45 pages, 19 figures

详情
AI中文摘要

长视频生成需要高保真合成、连贯的叙事结构以及用户对长时间跨度的控制。现有的文本到视频方法通常依赖单一长提示,限制了对姿态、构图、布局和运动的控制。我们提出 DrawVideo,一种草图引导、故事板驱动的可控长视频生成框架。DrawVideo 将长视频分解为独立可控的镜头,每个镜头由黑白草图、外观提示和运动提示定义。草图控制姿态和布局,外观提示定义身份、场景和风格,运动提示引导时间动态。DrawVideo 遵循分层“全局多镜头、局部单草图”策略:首先生成结构对齐的参考关键帧,然后将运动提示扩展为代表动作状态的衍生关键帧,最后在相邻关键帧之间合成片段以构建每个镜头。我们还引入了 SketchLongVideo,这是首个用于草图引导的文本到长视频生成的数据集,通过镜头检测、关键帧提取、视觉语言识别、提示分解和草图转换从动画视频构建。实验表明,DrawVideo 实现了强大的结构可控性、外观一致性、视觉稳定性和连贯的长视频生成。

英文摘要

Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.

2605.23459 2026-05-25 cs.SE cs.AI

AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems

AI 保证:企业 AI 系统的综合测试策略

Chitra Badagi, Divye Singh, Animesh Sen, Adinath Shirsath

发表机构 * Thoughtworks Technologies(Thoughtworks技术公司)

AI总结 本文针对基于大语言模型、检索管道和自主代理的企业级AI系统,提出了一种全面的测试保障策略,以应对传统软件质量保证方法难以处理的新型风险。研究强调应将AI测试重点转向持续风险降低,而非严格的正确性验证,并将评估作为与开发同等重要的工程学科。文章引入了结构化的AI失效分类体系,提出了改进的五层AI保障金字塔,并提供了评估驱动开发、RAG系统测试、模型生命周期管理等方面的实践指导,旨在为企业工程领导者和实践者提供既有理论依据又可操作的保障策略。

详情
AI中文摘要

企业 AI 系统构建于大语言模型、检索管道和自主代理之上,引入了一类传统软件质量保证从未设计应对的风险。这些系统是概率性的、上下文敏感的和涌现性的:它们无法在经典意义上被验证为正确,只能通过不断增加信心来评估。本文提出了一种围绕三个关键原则的企业 AI 系统综合保证策略:第一,AI 测试应侧重于持续风险降低而非严格正确性验证;第二,评估必须与开发一起被视为核心工程学科;第三,AI 保证中的失败可能导致与传统确定性软件系统根本不同的组织影响。我们引入了结构化的 AI 故障分类法,提出了修订后的五层 AI 保证金字塔,并提供了关于评估驱动开发、RAG 系统测试、模型生命周期管理和治理的操作指南。目标是让工程领导者和从业者掌握一种既有哲学基础又可操作部署的策略。

英文摘要

Enterprise AI systems, built on large language models, retrieval pipelines and autonomous agents, introduce a class of risks that traditional software quality assurance was never designed to address. These systems are probabilistic, context-sensitive and emergent: they cannot be verified to be correct in the classical sense, but only evaluated with increasing confidence. This paper presents a comprehensive assurance strategy for enterprise AI systems built around three key principles: first, that AI testing should focus on continuous risk reduction rather than strict correctness verification; second, that evaluation must be treated as a core engineering discipline alongside development; and third, that failures in AI assurance can lead to organizational impacts that are fundamentally different from those seen in traditional deterministic software systems. We introduce a structured AI Failure Taxonomy, propose a revised five-layer AI Assurance Pyramid and provide operational guidance on evaluation-driven development, RAG system testing, model lifecycle management and governance. The goal is to equip engineering leaders and practitioners with a strategy that is both philosophically grounded and operationally deployable.

2605.23448 2026-05-25 cs.CR cs.AI

AI Security Research Should Better Incentivize Defense Research

AI安全研究应更好地激励防御研究

Youqian Zhang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文指出人工智能安全研究领域存在严重失衡现象,即攻击性研究远多于防御性研究。通过分析多个子领域的学术论文,发现攻击与防御的比例普遍偏高,且攻击性研究往往在有利条件下进行,夸大了实际威胁,而防御性研究则面临更高的标准,导致可用的防御方案寥寥无几。因此,作者呼吁人工智能安全研究应更加重视并激励防御技术的发展。

Comments 14 pages,3 figures,3 tables

详情
AI中文摘要

本文考察了人工智能(AI)安全研究中的不平衡:该领域倾向于产出更多关于攻击AI系统的研究,而非防御。通过相关学术论文,我们发现跨子领域(包括联邦学习、语音识别、成员推断、大语言模型等)存在偏斜的攻击-防御比例。这种不平衡可能远不止简单的计数:攻击论文通常在有利条件下进行评估,使威胁看起来比实际更严重,而防御则面临更严格的标准,很少有方法能达到。结果是文献中充斥着已证明的漏洞,而可用且已部署的防御则很少。因此,我们认为AI安全研究应更好地激励防御研究。

英文摘要

This work examines an imbalance in artificial intelligence (AI) security research: the field tends to produce more work on attacking AI systems than on defending them. Drawing on related academic papers, we find biased attack-to-defense ratios across subfields, including federated learning, speech recognition, membership inference, large language models, etc. The imbalance possibly means far beyond a simple count: attack papers are routinely evaluated under favorable conditions that make threats look more severe than they are in practice, while defenses are held to a stricter standard that few can meet. The result is a literature rich in demonstrated vulnerabilities and thin on usable and deployed protections. We thus argue that AI security research should better incentivize defense research.

2605.23426 2026-05-25 cs.HC cs.AI

Socially fluent AI decouples conversational signals from source identity in online interaction

社交流畅的AI在在线互动中解耦对话信号与来源身份

Lixiang Yan, Yueqiao Jin, Xibin Han, Dragan Gašević

发表机构 * School of Education, Tsinghua University(清华大学教育学院) Faculty of Information Technology, Monash University(墨尔本大学信息技术学院) Faculty of Education, The University of Hong Kong(香港大学教育学院)

AI总结 这项研究探讨了社交流利的AI代理在在线互动中是否能像普通人一样交流,从而让人难以仅凭对话信号判断对方身份。实验表明,在多人协作任务中,参与者无法准确区分AI与人类队友,尽管对话行为中存在可区分AI与人类的线索。研究指出,人们更多依赖主观印象和刻板印象进行判断,而非基于实际行为特征,这使得AI代理可能更易影响和操控在线讨论。

详情
AI中文摘要

社交流畅的智能体AI现在能够以类似于普通人类对话的方式参与在线互动,这可能削弱人们仅凭对话信号推断谁是人类的能力。我们在同步文本群组交互中测试了这种可能性,将未公开的AI代理作为普通队友嵌入到分析性、创造性和伦理任务中。在786名参与者进行的1572次交互后身份判断中,人们区分AI和人类队友的能力未高于随机水平。这种失败并非因为交互缺乏身份相关信息。对话行为包含区分AI与人类的稳健线索,并支持高度准确的计算分类。相反,参与者依赖熟悉的怀疑启发式,包括响应速度、流畅性和感知的脚本化,这些与真实身份只有弱相关。表征分析进一步表明,判断是基于主观印象而非编码真实身份的行为结构组织的。这种分离为能够大规模影响和操纵在线话语的协调AI代理创造了新的脆弱性。

英文摘要

Socially fluent agentic AI can now participate in online interaction in ways that resemble ordinary human conversation, potentially weakening people's ability to infer who is human from conversational signals alone. We tested this possibility in synchronous text-based group interaction by embedding undisclosed AI agents as ordinary teammates across analytical, creative, and ethical tasks. Across 786 participants who made 1,572 post-interaction identity judgments, people did not distinguish AI from human teammates above chance. This failure did not arise because the interaction lacked identity-relevant information. Conversational behaviour contained robust cues that differentiated AI from humans and supported highly accurate computational classification. Instead, participants relied on familiar suspicion heuristics, including response speed, fluency, and perceived scriptedness, that were only weakly related to actual identity. Representational analyses further showed that judgments were organised around subjective impressions rather than the behavioural structure encoding ground truth. This dissociation creates new vulnerabilities to coordinated AI agents that can influence and manipulate online discourse at scale.

2605.23424 2026-05-25 cs.IT cs.LG math.IT

Sparse In-Network Learning via Shortest-Path Backpropagation and Finite-Rate Gating

通过最短路径反向传播和有限速率门控的稀疏网内学习

Mohammad Reza Deylam Salehi

发表机构 * Nice, France(法国尼斯)

AI总结 本文研究了网络内学习(INL)中的稀疏通信问题,提出了一种基于最短路径树和有限速率门控机制的稀疏网络内学习方法D-INL。该方法通过保留以融合节点为根的容量感知最短路径树,去除非树链接,同时将局部路由建模为有限速率的随机门控,以在稀疏性和预测信息之间取得平衡。实验表明,D-INL在保持分类精度的同时,将训练过程中的通信量减少了70.4%,并进一步通过有限速率正则化将潜在信息率降低了45.7%。

详情
AI中文摘要

网内学习(INL)通过通信图交换潜在激活和反向传播误差来训练分布式神经模块。本文提出Dijkstra剪枝INL(D-INL),通过保留融合节点处的容量感知最短路径树来移除非树链接。为了平衡稀疏性和预测信息,局部路由(或聚合)被建模为有限速率随机门控,其速率为$R_g=I(Z; T)$。我们推导了一个率-失真-泛化界,并在可复现的分布式分类实验上验证了该方法,其中D-INL将训练交换量减少了70.4%,同时将精度保持在密集INL的标准差范围内。与未正则化的Dijkstra INL相比,添加有限速率正则化进一步将估计的潜在速率降低了45.7%。

英文摘要

In-network learning (INL) trains distributed neural modules by exchanging latent activations and backpropagated errors over a communication graph. This letter proposes Dijkstra-pruned INL (D-INL), which removes non-tree links by retaining a capacity-aware shortest-path tree rooted at the fusion node. To balance sparsity and predictive information, local routing (or aggregation) is modeled as a finite-rate stochastic gate with rate $R_g=I(Z; T)$. We derive a rate-distortion-generalization bound and validate the method on a reproducible distributed-classification experiment, where D-INL reduces training exchange by $70.4\%$ while preserving accuracy within the standard deviation of dense INL. Adding finite-rate regularization further reduces the estimated latent rate by $45.7\%$ relative to unregularized Dijkstra INL.

2605.23378 2026-05-25 math.OC cs.LG

Selective Ambulance Dispatch Under Contextual Travel-Time Uncertainty

上下文旅行时间不确定性下的选择性救护车调度

Zikun Lin, Daniel Zhuoyu Long, Viet Anh Nguyen

发表机构 * Department of Systems Engineering and Engineering Management(系统工程与工程管理系)

AI总结 本文研究了在交通时间不确定性背景下如何选择性派遣救护车以应对院外心脏骤停的紧急情况。提出了一种名为IDEAL的智能双派车框架,仅在主路线与备选路线的时间差超过阈值时才派遣第二辆救护车,从而在保证响应速度的同时减少资源消耗。该方法通过弱监督双层网络学习上下文相关的道路旅行时间,并结合非光滑优化与不确定性建模,实现了高效且具有收敛性保证的实时决策,在实际数据与模拟测试中表现出优于现有方法的响应时间与资源利用平衡。

详情
AI中文摘要

救护车响应在院外心脏骤停(OHCA)中具有时间紧迫性,调度员必须在及时到达与有限车队容量之间取得平衡。静态区域和确定性旅行时间估计易受动态拥堵影响,而始终双调度增加了冗余但消耗了车队容量。我们提出IDEAL(智能双调度急救车),一种选择性双调度框架,仅当主要路径与次要路径之间的乐观差距超过阈值时才派出第二辆救护车。IDEAL利用弱监督双层表示网络,从行程级调度记录(包括未观测路线)中学习上下文特定的边旅行时间。我们使用小批量保守梯度训练非光滑模型,并证明渐近收敛保证。IDEAL通过Burg散度扰动对学习表示空间中的共享度量进行建模,从而引起边旅行时间的相关变化,并从历史低估误差中学习上下文特定半径。对于实时决策,IDEAL将乐观差距计算转化为凸差规划,并推导出具有复杂度保证的高效预言机。与香港消防处合作,我们使用历史OHCA记录和实时自适应模拟评估IDEAL。相对于所有基于区域和基于谷歌的基线,结果实现了更强的响应时间/资源权衡。

英文摘要

Ambulance response is time-critical in out-of-hospital cardiac arrest (OHCA), where dispatchers must balance timely arrivals with limited fleet capacity. Static territories and deterministic travel-time estimates are vulnerable to dynamic congestion, while always-dual dispatch adds redundancy but consumes fleet capacity. We propose IDEAL (Intelligent Dual dispatch of Emergency AmbuLances), a selective dual-dispatch framework that sends a second ambulance only when the optimistic gap between primary and secondary paths exceeds a threshold. IDEAL learns context-specific edge travel times from trip-level dispatch records, including unobserved routes, using a weakly supervised bilevel representation network. We train the nonsmooth model with mini-batch conservative gradients and prove an asymptotic convergence guarantee. IDEAL models uncertainty via Burg-divergence perturbations to a shared metric in the learned representation space, thereby inducing correlated changes in edge travel times and learning context-specific radii from historical underprediction errors. For real-time decisions, IDEAL casts optimistic-gap computation as a difference-of-convex program and derives an efficient oracle with complexity guarantees. In collaboration with the Hong Kong Fire Services Department, we evaluate IDEAL using historical OHCA records and real-time adaptive simulations. The results achieve a stronger response-time/resource trade-off relative to all region-based and Google-based baselines.

2605.23348 2026-05-25 cs.DC cs.AI cs.NI

XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms

XWind: 面向可再生能源农场的跨站点大语言模型推理服务路由器

Tella Rajashekhar Reddy, Atharva Deshmukh, Liangcheng Yu, Chaojie Zhang, Mike Shepperd, Rohan Gandhi, Anjaly Parayil, Srinivasan Iyengar, Ajay Manchepalli, Debopam Bhattacherjee

发表机构 * Microsoft(微软)

AI总结 随着人工智能算力需求的快速增长,电力网络面临巨大压力,而可再生能源如风能却未被有效利用。本文提出了一种名为AI Greenferencing的互补性AI基础设施部署模型,将模块化AI计算能力部署在风电场,以本地化需求匹配可再生能源供给。为应对风电波动带来的推理服务挑战,研究团队设计了XWind,一种轻量、响应式且与工作负载无关的AI推理路由系统,通过实时信号动态调度任务,显著降低了端到端延迟,验证了其在实际场景中的高效性与普适性。

详情
AI中文摘要

AI电力需求正以前所未有的速度增长,而电网往往状况不佳且难以跟上。电网扩建伴随着高昂的资本支出和远距离传输损耗,然而源头处有丰富的可再生能源,只是与需求不匹配。本文提出一种互补的AI基础设施部署模式——AI绿色围栏,将模块化AI计算带到可再生能源源头,聚焦风能,允许AI足迹扩展,为可再生能源站点产生本地表后需求,并帮助缓解电力公用事业日益增长的压力。我们的可行性分析表明,在Azure数据中心的50毫秒网络往返时间内,有超过890吉瓦的风电容量,并且站点规模的合理调整与风能的空间互补性使得整体集群利用率与传统部署相当。为了在可变风力供电下服务推理请求,我们构建了XWind,一个轻量级、反应式且与工作负载无关的AI推理路由器,仅使用实时信号:推理延迟、KV缓存利用率和队列深度,来动态配置站点并分发请求。在模拟三个风力供电站点的真实64-GPU A100测试平台上,使用Azure生产轨迹进行评估,XWind将P99端到端延迟比最强竞争者(也是我们的想法)降低高达52%,比基线(如功率上限和GPU空闲)降低高达98%,且在不同工作负载类型、负载水平和GPU代际上均有一致的增益。

英文摘要

AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with high capital expenditure and long-distance transmission losses, yet there is abundant renewable energy at the source, just not matched to demand. This paper proposes a complementary AI infrastructure deployment model, AI Greenferencing, that brings modular AI compute to renewable energy sources, focusing on wind, allowing AI footprint expansion, generating local behind-the-meter demand for renewable sites, and helping ease the growing strain on power utilities. Our feasibility analysis shows that 890+ GW of wind capacity lies within 50 ms network round trip time of Azure data centers, and that site-wise right-sizing combined with spatial complementarity of wind energy keeps aggregate fleet utilization on par with traditional deployments. To serve inference requests under variable wind power, we build XWind, a lightweight, reactive, and workload-agnostic AI inference router that uses only real-time signals: inference latency, KV-cache utilization, and queue depth, to dynamically configure sites and distribute requests. Evaluated on a real 64-GPU A100 testbed emulating three wind-powered sites with Azure production traces, XWind reduces P99 end-to-end latency by up to 52% over the strongest contender (also our idea) and by up to 98% over baselines such as power-capping and GPU idling, with consistent gains across workload types, load levels, and GPU generations.