arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2511.22486 2026-06-16 physics.plasm-ph cs.LG 版本更新

The Machine Learning Approach to Moment Closure Relations for Plasma: A Review

等离子体矩闭包关系的机器学习方法：综述

Samuel Burles, Enrico Camporeale

发表机构 * School of Physical and Chemical Sciences, Queen Mary University of London（伦敦大学女王学院物理与化学科学学院）； Space Weather TREC, University of Colorado（科罗拉多大学空间天气TREC）

AI总结本文综述了机器学习方法在等离子体流体模型中发展改进闭包模型的研究，涵盖神经网络代理和方程发现两类方法，并讨论了离线测试与在线模拟的挑战及未来方向。

Comments 58 pages, 6 figures

详情

AI中文摘要

大规模等离子体全局模拟的需求是空间和实验室等离子体物理学中持续存在的挑战。任何基于流体模型的模拟都固有地需要高阶等离子体矩的闭包关系。本综述汇编并分析了近期涌现的机器学习方法，这些方法旨在开发改进的等离子体闭包模型，能够在等离子体流体模型中捕捉动力学现象。我们调查了两类方法：神经网络代理（从多层感知器到傅里叶神经算子，后者最近在流体求解器内在线复现了线性和非线性朗道阻尼）和方程发现方法（如稀疏回归）；并根据这些研究是离线对照参考数据测试还是在线在时间演化求解器内测试进行组织。我们概述了与机器学习闭包相关的挑战，包括非对角压力张量精度、超出训练分布的泛化能力以及稳定集成到大尺度模拟中，并指出了未来研究可能解决这些问题的方向。

英文摘要

The requirement for large-scale global simulations of plasma is an ongoing challenge in both space and laboratory plasma physics. Any simulation based on a fluid model inherently requires a closure relation for the high order plasma moments. This review compiles and analyses the recent surge of machine learning approaches developing improved plasma closure models capable of capturing kinetic phenomena within plasma fluid models. We survey two methodological families: neural-network surrogates (from multilayer perceptrons to Fourier neural operators, the latter recently reproducing both linear and non-linear Landau damping online within a fluid solver) and equation-discovery methods such as sparse regression; and organise the studies by whether they are tested offline against reference data or online within a time-evolving solver. We outline the challenges associated with machine-learning closures, including off-diagonal pressure-tensor accuracy, generalisation beyond the training distribution, and stable integration into large-scale simulations, and the directions future research might take to address them.

URL PDF HTML ☆

赞 0 踩 0

2601.08056 2026-06-16 q-bio.NC cs.RO 版本更新

The embodied brain: Bridging the brain, body, and behavior with biorealistic neuromechanical models

具身大脑：通过生物真实神经力学模型连接大脑、身体与行为

Sibo Wang-Chen, Pavan Ramdya

发表机构 * EPFL（瑞士联邦理工学院）

AI总结本文综述生物真实神经力学模型，通过将人工神经控制器嵌入模拟环境中的身体模型，揭示神经、身体与环境交互的行为控制算法，并推动神经科学、机器人学和机器学习之间的交流。

Comments 18 pages, 4 figures (including 1 graphical abstract), 1 table

详情

AI中文摘要

动物行为反映了神经系统、身体和环境之间的相互作用。因此，必须考虑生物力学和环境背景，以理解行为控制的算法。将人工神经控制器嵌入模拟环境中的身体模型的计算模型，是用于此目的的有力工具。在这里，我们回顾了生物真实神经力学模型的进展，同时强调了即将到来的新兴机遇。我们首先展示了这些模型如何能够推断出难以通过实验测量的生物物理变量。通过系统性扰动，可以通过这些模型生成新的可实验检验的假设。然后，我们考察了神经力学模型如何促进神经科学、机器人学和机器学习之间的交流，并展示了它们在医疗保健中的应用。我们设想，将实验研究与对其神经力学替代物的主动探测相结合，将显著加速神经科学的进展。

英文摘要

Animal behavior reflects interactions between the nervous system, body, and environment. Therefore, biomechanics and environmental context must be considered to understand algorithms for behavioral control. Computational models that embed artificial neural controllers within body models in simulated environments, are a powerful tool for this purpose. Here, we review advances in biorealistic neuromechanical models while also highlighting emerging opportunities ahead. We first show how these models enable inference of biophysical variables that are difficult to measure experimentally. Through systematic perturbation, one can generate new experimentally testable hypotheses through these models. We then examine how neuromechanical models facilitate the exchange between neuroscience, robotics, and machine learning, and showcase their applications in healthcare. We envision that coupling experimental studies with active probing of their neuromechanical surrogates will significantly accelerate progress in neuroscience.

URL PDF HTML ☆

赞 0 踩 0

2604.09679 2026-06-16 cs.MA cs.AI 版本更新

HCP-MAD:Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

HCP-MAD：用于高效多智能体辩论的异构共识渐进推理

Yiqing Liu, Hantao Yao, Wu Liu, Allen He, Yongdong Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出HCP-MAD框架，通过异构共识验证和自适应停止机制，在保持准确率的同时大幅降低多智能体辩论的token成本。

详情

AI中文摘要

多智能体辩论（MAD）是一种协作框架，其中多个智能体通过生成推理和交替批评循环来迭代优化解决方案。当前的工作主要分别优化轮内拓扑和轮间交互，限制了token成本对任务复杂度的适应性。本文引入了用于高效多智能体辩论的异构共识渐进推理（HCP-MAD），利用共识作为动态信号来促进渐进推理。核心动机是大多数简单任务可以通过轻量级双智能体辩论有效解决，而复杂任务需要扩展协作。首先，异构共识验证使用一对异构智能体进行快速共识验证以实现提前停止。其次，异构双智能体辩论应用自适应停止标准来终止推理轨迹的相互批评。最后，未解决的任务通过升级的集体投票，聚合来自额外智能体的多样化视角来处理。在六个基准上的实验表明，HCP-MAD在提高准确性的同时大幅降低了token成本。代码见此URL。

英文摘要

Multi-Agent Debate (MAD) is a collaborative framework in which multiple agents iteratively refine solutions through the generation of reasoning and alternating critique cycles. Current work primarily optimizes intra-round topologies and inter-round interactions separately, limiting the adaptation of token costs to task complexity. This work introduces Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate (HCP-MAD), leveraging consensus as a dynamic signal to facilitate progressive reasoning. The core motivation is that a majority of straightforward tasks can be effectively resolved via lightweight pair-agent debates, while complex tasks require expanded collaboration. Firstly, Heterogeneous Consensus Verification conducts rapid consensus verification using a pair of heterogeneous agents for early stopping. Next, Heterogeneous Pair-Agent Debate applies an adaptive stopping criterion to terminate mutual critique of reasoning traces. Finally, the unresolved tasks are addressed through Escalated Collective Voting by aggregating diverse perspectives from additional agents. Experiments across six benchmarks show that HCP-MAD enhances accuracy while substantially reducing token costs. Code is https://github.com/fuyu66/HCP-MAD.

URL PDF HTML ☆

赞 0 踩 0

2603.27998 2026-06-16 eess.AS cs.LG 版本更新

HRIR-Former: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer

HRIR-Former：基于空间编码Transformer的无网格时域头相关冲激响应重建

Shaoheng Xu, Chunyi Sun, Jihui Zhang, Amy Bastine, Prasanga N. Samarasinghe, Thushara D. Abhayapala, Hongdong Li

发表机构 * The Australian National University（澳大利亚国立大学）； The University of Queensland（昆士兰大学）

AI总结提出HRIR-Former，一种时域无网格双耳Transformer，从稀疏测量中预测任意方向HRIR，采用正弦空间特征、Conv1D细化模块及ITD/ILD辅助头，在SONICOM数据集上优于现有方法。

Comments Accepted at Interspeech 2026, Sydney, Australia

详情

AI中文摘要

个性化头相关冲激响应（HRIR）能够实现双耳渲染，但密集的逐听者测量成本高昂。我们解决从稀疏的逐听者测量中进行HRIR空间上采样的问题：给定一个听者的少量测量HRIR，预测未测量目标方向的HRIR。先前的学习方法通常在频域中工作，依赖最小相位假设或单独的时序模型，并使用固定方向网格，这可能会降低时间保真度和空间连续性。我们提出HRIR-Former，一种时域、无网格的双耳Transformer，用于从稀疏输入重建任意方向的HRIR。它使用正弦空间特征、Conv1D细化模块以及辅助的耳间时间差（ITD）和耳间电平差（ILD）头。在SONICOM数据集上，它在归一化均方误差（NMSE）、余弦距离和ITD/ILD误差上优于先前方法；消融实验验证了各模块，并表明最小相位预处理是不必要的。

英文摘要

Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose HRIR-Former, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase preprocessing is unnecessary.

URL PDF HTML ☆

赞 0 踩 0

2402.00094 2026-06-16 cs.NE cs.AI cs.LG 版本更新

Deep Neural Networks: A Formulation Via Non-Archimedean Analysis

深度神经网络：基于非阿基米德分析的公式化

W. A. Zúñiga-Galindo

发表机构 * University of Texas Rio Grande Valley School of Mathematical & Statistical Sciences（德克萨斯大学里奥格兰德谷大学数学与统计科学学院）

AI总结提出一种基于非阿基米德局部域整数环的多层树状架构深度神经网络，该网络是定义在环上实值函数的鲁棒通用逼近器，并证明其对单位区间上平方可积函数的通用逼近性。

Comments Final version accepted in the Journal of Fourier Analysis and Applications

2603.22766 2026-06-16 cs.HC cs.AI 版本更新

From Overload to Convergence: Supporting Multi-Issue Human-AI Negotiation with Bayesian Visualization

从过载到收敛：基于贝叶斯可视化的多议题人机协商支持

Mehul Parmar, Chaklam Silpasuwanchai

发表机构 * Asian Institute of Technology（亚洲理工学院）

AI总结针对多议题协商中认知负荷导致人类表现下降的问题，提出基于贝叶斯估计协议概率的不确定性可视化方法，实验证明该方法能提升人类协商结果和效率，同时保持人类控制。

Comments Accepted for publication to CHI 2026. v2: Added Appendix B (system prompts) and Appendix C (payoff matrices) in response to replication requests. Dataset independently available at https://doi.org/10.5281/zenodo.20545331

详情

DOI: 10.1145/3772318.3790358

AI中文摘要

随着AI系统越来越多地介入协商过程，理解协商议题数量对人类表现的影响对于维护人类自主性至关重要。我们在一个真实的租赁场景中设计了人机协商案例研究，改变协商议题的数量；实证结果表明，在没有支持的情况下，表现最多在三个议题时保持稳定，但随着额外议题增加认知负荷而下降。为了解决这个问题，我们引入了一种基于贝叶斯协议概率估计的新型不确定性可视化方法。它展示了随着协商进展，相互可接受的协议空间如何缩小，帮助用户识别有前景的选项。在受试者内实验（N=32）中，它改善了人类结果和效率，保持了人类控制，并避免了价值重新分配。我们的发现揭示了人类在人机协商中能够管理的复杂性的实际极限，推进了关于复杂协商中人类表现的理论，并为交互系统提供了经过验证的设计指导。

英文摘要

As AI systems increasingly mediate negotiations, understanding how the number of negotiated issues impacts human performance is crucial for maintaining human agency. We designed a human-AI negotiation case study in a realistic property rental scenario, varying the number of negotiated issues; empirical findings show that without support, performance stays stable up to three issues but declines as additional issues increase cognitive load. To address this, we introduce a novel uncertainty-based visualization driven by Bayesian estimation of agreement probability. It shows how the space of mutually acceptable agreements narrows as negotiation progresses, helping users identify promising options. In a within-subjects experiment (N=32), it improved human outcomes and efficiency, preserved human control, and avoided redistributing value. Our findings surface practical limits on the complexity people can manage in human-AI negotiation, advance theory on human performance in complex negotiations, and offer validated design guidance for interactive systems.

URL PDF HTML ☆

赞 0 踩 0

2603.22376 2026-06-16 cs.IR cs.AI 版本更新

Closing the Auto-Research Loop: An AI Co-Scientist for Production Search Ranking

关闭自动研究循环：面向生产搜索排名的AI合作科学家

Liwei Wu, Cho-Jui Hsieh

发表机构 * Trip.com Group（Trip.com集团）； UCLA（加州大学洛杉矶分校）

AI总结提出AI合作科学家框架，通过LLM代理与云计算集成，自动迭代生成想法、实现代码、进行GPU实验并分析结果，在搜索排名任务中带来额外+0.083%离线增益。

Comments Submitted to EMNLP for review on June 14, 2026

详情

AI中文摘要

我们提出了一个AI合作科学家框架，该框架为大型在线旅游平台的生产搜索排名系统关闭了研究循环——将LLM代理与直接云计算访问配对，使得想法生成、代码实现、GPU实验和结果分析能够与人类科学家一起端到端迭代。该框架采用混合代理架构：单一LLM代理处理常规工作，而多LLM共识（GPT-5.2、Gemini Pro 3、Claude Opus 4.5）用于更高风险的决策。在生产排名任务上，人工设计的Transformer基线（V2）相比预Transformer基线（V1）提升了+0.118%；AI合作科学家在V2之上的自动循环贡献了额外的+0.083%，合计离线增益为+0.201%，大约在一周多的挂钟时间内完成（单次运行数值；统计限制在论文中讨论）。最有用的AI提案——统一长序列布局、槽位类型嵌入和多阶段学习率调度——是NLP和视觉领域的标准实践，但之前未出现在我们的生产栈中，这表明LLM代理可以作为排名团队的跨学科连接器。我们还报告了部署背景、负面结果和经验教训。

英文摘要

We present an AI Co-Scientist framework that closes the research loop for the production search-ranking system of a large online travel platform -- pairing LLM agents with direct cloud-compute access so that idea generation, code implementation, GPU experimentation, and result analysis iterate end-to-end with a human scientist in the loop. The framework uses a hybrid agent architecture: single-LLM agents handle routine work, while multi-LLM consensus (GPT-5.2, Gemini Pro 3, Claude Opus 4.5) is invoked for higher-stakes decisions. On the production ranking task, a human-designed transformer baseline (V2) yielded $+0.118\%$ over a pre-transformer baseline (V1); the AI Co-Scientist's automated loop on top of V2 contributed an additional $+0.083\%$, for a combined $+0.201\%$ offline gain delivered in roughly one extra week of wall-clock time (single-run numbers; statistical limits discussed in the paper). The most useful AI proposals -- unified long-sequence layouts, slot-type embeddings, and multi-phase learning-rate schedules -- are standard practice in NLP and Vision but were absent from our production stack, suggesting that LLM agents can serve as cross-disciplinary connectors for ranking teams. We also report deployment context, negative results, and lessons learned.

URL PDF HTML ☆

赞 0 踩 0

2603.21613 2026-06-16 cs.IR cs.AI 版本更新

AgenticRec: A Recommendation-Oriented Agentic Framework with Progressive Tool-Integrated Reasoning Optimization

AgenticRec：面向推荐的智能体框架与渐进式工具集成推理优化

Tianyi Li, Zixuan Wang, Guidong Lei, Xiaodong Li, Hui Li

发表机构 * Xiamen University（厦门大学）

AI总结提出AgenticRec框架，将推荐建模为工具集成推理过程，并设计两阶段训练范式，通过隐式反馈激活和渐进偏好细化提升推荐准确性。

详情

AI中文摘要

基于大型语言模型的推荐智能体为个性化推荐提供了有前景的范式。然而，现有智能体通常存在工具集成推理轨迹与推荐反馈之间的错位，限制了其区分细粒度用户偏好的能力。为解决这些问题，我们提出AgenticRec，一个面向推荐的智能体框架，将推荐形式化为在推荐导向工具套件上的工具集成推理过程。基于此框架，我们进一步开发了一个专门的两阶段训练范式，专为推荐智能体定制。在第一阶段，我们引入推荐导向轨迹激活，在隐式反馈下优化智能体推荐能力。在第二阶段，渐进偏好细化通过自举困难对上的双向偏好推理进一步优化智能体，逐步锐化偏好边界。理论分析和大量实验证明了AgenticRec的有效性。我们的代码可在该https URL获取。

英文摘要

Recommender agents built on Large Language Models offer a promising paradigm for personalized recommendation. However, existing agents typically suffer from a misalignment between their tool-integrated reasoning trajectories and recommendation feedback, limiting their ability to distinguish fine-grained user preferences. To address these challenges, we propose AgenticRec, an agentic recommendation framework that formulates recommendation as a tool-integrated reasoning process over a recommendation-oriented tool suite. Built upon this framework, we further develop a dedicated two-stage training paradigm tailored for recommender agents. In the first stage, we introduce Recommendation-Oriented Trajectory Activation, optimize the agentic recommendation ability under implicit feedback. In the second stage, Progressive Preference Refinement further refines the agent through bidirectional preference reasoning over self-bootstrapped hard pairs, progressively sharpening preference boundaries. Theoretical analysis and extensive experiments demonstrate the effectiveness of AgenticRec. Our code is available at https://anonymous.4open.science/r/AgenticRec-FB16.

URL PDF HTML ☆

赞 0 踩 0

2603.19595 2026-06-16 cs.IR cs.CL 版本更新

All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

All-Mem: 通过动态拓扑演化实现智能体终身记忆

Can Lv, Heng Chang, Shengyu Tao, Mingju Chen, Zhaoxin Fan, Ziwei Zhang, Yuchen Guo, Shiji Zhou

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing（北京未来区块链与隐私计算先进创新中心）； Beihang University（北航）； Tsinghua University（清华大学）； Chalmers University of Technology（查尔姆斯理工大学）

AI总结提出All-Mem框架，通过在线/离线结合的非破坏性拓扑结构记忆库，解决终身交互代理中历史增长导致的检索冗余和噪声问题，在LoCoMo和LongMemEval-s上提升检索与问答性能。

详情

AI中文摘要

终身交互代理期望在数月或数年内协助用户，这需要在固定上下文和延迟预算下持续写入长期记忆，同时为每个新查询检索正确的证据。现有的记忆系统随着历史增长往往会退化，产生冗余、过时或噪声的检索上下文。我们提出\textbf{All-Mem}，一个在线/离线终身记忆框架，通过显式的、非破坏性的整合维护一个拓扑结构化的记忆库，避免了基于摘要压缩的典型不可逆信息损失。在在线操作中，它将检索锚定在有界可见表面上以保持粗略搜索成本有界。定期离线时，LLM诊断器提出置信度评分的拓扑编辑，通过三个算子（拆分、合并和更新）执行门控，同时保留不可变证据以保持可追溯性。在查询时，类型化链接支持从活动锚点到存档证据的跳数有界、预算可控的扩展。在\textbf{LoCoMo}和\textbf{LongMemEval-s}上的实验表明，与代表性基线相比，检索和问答性能得到提升。代码可在该https URL获取。

英文摘要

Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present \textbf{All-Mem}, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: Split, Merge, and Update, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on \textbf{LoCoMo} and \textbf{LongMemEval-s} show improved retrieval and QA over representative baselines. The code is available at https://github.com/LvCan926/All-Mem.

URL PDF HTML ☆

赞 0 踩 0

2603.18897 2026-06-16 cs.DC cs.AI 版本更新

Parallelizing Tool Execution and LLM Generation for Low-Latency Agent Serving

并行化工具执行与LLM生成以实现低延迟代理服务

Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, Kaiqiang Xu, Kai Chen, Yuqing Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Microsoft Research（微软研究院）； Stevens Institute of Technology（Stevens 工程学院）； Google（谷歌）； Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结提出PASTE系统，通过预测性执行未来工具调用与LLM生成并行，减少任务完成时间43.5%。

2602.08029 2026-06-16 gr-qc astro-ph.IM cs.CV 版本更新

Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields

基于物理信息神经场的动态黑洞发射断层成像

Berthy T. Feng, Andrew A. Chael, David Bromley, Aviad Levis, William T. Freeman, Katherine L. Bouman

发表机构 * Caltech（加州理工学院）； MIT（麻省理工学院）； NSF IAIFI（国家科学基金会IAIFI）； Princeton University（普林斯顿大学）； Niels Bohr International Academy（尼尔斯·玻尔国际学院）； University of Toronto（多伦多大学）

AI总结提出PI-DEF方法，利用可微神经渲染从EHT测量数据中联合重建4D发射率场和3D速度场，以软约束方式引入物理信息，在模拟数据上显著优于现有方法。

Comments CVPR 2026

详情

AI中文摘要

随着静态黑洞成像的成功，下一个前沿是黑洞的动态和三维成像。恢复黑洞附近的动态三维气体将揭示宇宙中以前未见的部分，并为新的物理模型提供信息。然而，只有从单一视角进行的稀疏射电测量是可能的，这使得动态三维重建问题严重不适定。此前，BH-NeRF通过假设气体的开普勒动力学来解决不适定问题，但这种假设在黑洞附近失效，因为黑洞的强大引力吸引和增强的电磁活动使流体动力学复杂化。为了克服BH-NeRF的限制性假设，我们提出了PI-DEF，一种基于物理信息的方法，使用可微神经渲染根据EHT测量拟合4D（时间+3D）发射率场。我们的方法联合重建3D速度场与4D发射率场，并将速度作为发射率动力学的软约束。在模拟数据上的实验中，我们发现与BH-NeRF和物理无关方法相比，重建精度显著提高。我们展示了我们的方法如何用于估计黑洞的其他物理参数，例如其自旋。

英文摘要

With the success of static black-hole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we propose PI-DEF, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs the 3D velocity field with the 4D emissivity field and enforces the velocity as a soft constraint on the dynamics of the emissivity. In experiments on simulated data, we find significantly improved reconstruction accuracy over both BH-NeRF and a physics-agnostic approach. We demonstrate how our method may be used to estimate other physics parameters of the black hole, such as its spin.

URL PDF HTML ☆

赞 0 踩 0

2603.13584 2026-06-16 cs.SE cs.AI 版本更新

An Empirical Investigation of Pre-Trained Deep Learning Model Reuse in the Scientific Process

预训练深度学习模型在科学过程中复用的实证研究

Nicholas M. Synovic, Karolina Ryzka, Alessandra V. Vellucci Solari, Kenny Lyons, James C. Davis, George K. Thiruvathukal

发表机构 * Loyola University Chicago（洛伊拉大学芝加哥分校）； Purdue University West Lafayette, IN, USA（普渡大学西拉法基分校）

AI总结通过对17,718篇同行评审开放获取论文的实证研究，量化了自然科学中预训练深度学习模型（PTM）的复用模式、利用率和影响，发现“生物化学、遗传学和分子生物学”领域复用最多，“适配”复用模式最普遍，且“测试”阶段受PTM集成影响最大。

Comments 22 pages, 7 figures, 4 tables

2603.10562 2026-06-16 math.OC cs.LG cs.SY eess.SY 版本更新

Quantization Robustness of Monotone Operator Equilibrium Networks

单调算子均衡网络的量化鲁棒性

James Li, Philip H. W. Leong, Thomas Chaffey

发表机构 * School of Electrical and Computer Engineering, The University of Sydney（悉尼大学电气与计算机工程学院）

AI总结分析单调算子均衡网络在低精度硬件部署时权重量化对收敛性和均衡解的影响，提出基于谱扰动和单调性边界的理论保证，并通过MNIST实验验证了量化精度与收敛性的相变关系。

Comments 6 pages, 4 figures. Accepted for publication in IEEE Control Systems Letters (L-CSS)

详情

AI中文摘要

单调算子均衡网络是隐式层模型，其输出是单调算子的唯一均衡点，保证了存在性、唯一性和收敛性。当部署在低精度硬件上时，权重被量化，可能破坏这些保证。我们将权重量化分析为底层单调包含的谱扰动。当谱范数权重扰动小于单调性边界时，量化求解器的收敛性得到保证；量化与全精度均衡之间的位移由扰动大小和边界界定；一个条件数（算子范数与边界的比值）将量化精度与前向误差联系起来。MNIST实验在预测阈值处确认了相变：三位和四位后训练量化发散，而五位及以上收敛。反向传播保证使得量化感知训练成为可能，在四位时恢复了可证明的收敛性。

英文摘要

Monotone operator equilibrium networks are implicit-layer models whose output is the unique equilibrium of a monotone operator, guaranteeing existence, uniqueness, and convergence. When deployed on low-precision hardware, weights are quantized, potentially destroying these guarantees. We analyze weight quantization as a spectral perturbation of the underlying monotone inclusion. Convergence of the quantized solver is guaranteed whenever the spectral-norm weight perturbation is smaller than the monotonicity margin; the displacement between quantized and full-precision equilibria is bounded in terms of the perturbation size and margin; and a condition number characterizing the ratio of the operator norm to the margin links quantization precision to forward error. MNIST experiments confirm a phase transition at the predicted threshold: three- and four-bit post-training quantization diverge, while five-bit and above converge. The backward-pass guarantee enables quantization-aware training, which recovers provable convergence at four bits.

URL PDF HTML ☆

赞 0 踩 0

2603.03417 2026-06-16 cs.CR cs.AI 版本更新

Parallel Test-Time Scaling with Multi-Sequence Verifiers

并行测试时扩展与多序列验证器

Yegon Kim, Seungyoo Lee, Chaeyun Jang, Hyungi Lee, Juho Lee

发表机构 * Graduate School of AI, KAIST（人工智能研究生院，韩国科学技术院）

AI总结提出多序列验证器（MSV），通过条件化候选集预测正确性，改善校准性，提升最佳选择准确率并实现早停策略，在数学推理任务中以不到一半延迟达到相同精度。

详情

AI中文摘要

并行测试时扩展（为单个问题生成多个候选解）是提升大语言模型性能的强大技术。然而，它受到两个关键瓶颈的阻碍：从候选池中准确选择正确的解，以及生成大量完整解带来的高推理延迟。我们认为这两个挑战从根本上与验证器的校准性相关，因为校准良好的验证器能改进答案选择，并支持早停策略以减少延迟。然而，现有的非生成式验证器存在局限性，因为它们孤立地评分每个候选，忽略了候选集之间的丰富上下文信息。为解决这一问题，我们引入了多序列验证器（MSV），这是一种轻量级验证器，它基于完整采样集的条件来预测每个候选的正确性。MSV实现了改进的校准性，这直接增强了最佳N选择性能，并赋能了一种新颖的早停框架。在具有挑战性的数学推理基准测试中，相对于强基线，MSV将最佳64选1的准确率提升了高达6%，并且在早停设置下，以不到一半的延迟达到了与基线相同的准确率。

英文摘要

Parallel test-time scaling, which generates multiple candidate solutions for a single problem, is a powerful technique for improving large language model performance. However, it is hindered by two key bottlenecks: accurately selecting the correct solution from the candidate pool, and the high inference latency from generating many full solutions. We argue that both challenges are fundamentally linked to verifier calibration, as a well-calibrated verifier improves answer selection and enables early-stopping strategies to reduce latency. However, existing non-generative verifiers are limited as they score each candidate in isolation, overlooking rich contextual information across the set of candidates. To address this, we introduce the Multi-Sequence Verifier (MSV), a lightweight verifier that predicts each candidate's correctness conditioned on the full sampled set. MSV achieves improved calibration, which directly enhances best-of-N selection performance and empowers a novel early-stopping framework. Across challenging mathematical reasoning benchmarks, MSV improves best-of-64 accuracy by up to 6\% relative to strong baselines, and in the early-stopping setting reaches the same accuracy as baselines with less than half the latency.

URL PDF HTML ☆

赞 0 踩 0

2512.22420 2026-06-16 cs.DC cs.AI 版本更新

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Nightjar: 面向大语言模型服务的动态自适应推测解码

Rui Li, Zhaoning Zhang, Libo Zhang, Huaimin Wang, Xiang Fu, Zhiquan Lai

发表机构 * State Key Laboratory of Complex & Critical Software Environment, National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology（复杂与关键软件环境国家重点实验室、平行与分布式计算国家实验室、计算机科学与技术学院、国防科技大学）

AI总结提出Nightjar框架，通过动态调整推测长度和主动禁用推测解码，在高低负载下优化吞吐量，最高提升14.76%。

详情

AI中文摘要

推测解码通过并行验证草稿令牌加速LLM推理。然而，该方法存在关键权衡：在低负载、内存受限系统中提高吞吐量，但在高负载、计算受限环境中因验证开销而降低性能。现有推测解码方法使用固定长度，无法适应工作负载变化或决定何时停止推测。重新启动推测推理的成本也未被量化。在高负载下，推测的收益减少，而保留草稿模型会减少KV缓存容量，限制批处理大小并降低吞吐量。为解决此问题，我们提出Nightjar，一种资源感知的自适应推测框架。它首先通过动态选择不同批处理大小的最优推测长度来适应请求负载。关键的是，当MAB规划器确定推测不再有益时，Nightjar主动禁用推测解码，并在禁用阶段仅在GPU内存压力下将草稿模型卸载到CPU。这为KV缓存回收内存，从而促进更大的批处理大小并最大化系统整体吞吐量。实验表明，在实时LLM服务场景的动态请求到达率下，Nightjar在主要基准测试套件中比标准推测解码实现高达14.76%的吞吐量提升和高达20.18%的延迟降低。

英文摘要

Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Existing speculative decoding methods use fixed lengths and cannot adapt to workload changes or decide when to stop speculation. The cost of restarting speculative inference also remains unquantified. Under high load, the benefit of speculation diminishes, while retaining the draft model reduces KV cache capacity, limiting batch size and degrading throughput. To overcome this, we propose Nightjar, a resource-aware adaptive speculative framework. It first adjusts to the request load by dynamically selecting the optimal speculative length for different batch sizes. Crucially, Nightjar proactively disables speculative decoding when the MAB planner determines that speculation is no longer beneficial, and during the disabled phase, offloads the draft model to the CPU only under GPU memory pressure. This reclaims memory for the KV cache, thereby facilitating larger batch sizes and maximizing overall system throughput. Experiments show that Nightjar achieves up to 14.76% higher throughput than standard speculative decoding and up to 20.18% lower latency in the main benchmark suite under dynamic request arrival rates for real-time LLM serving scenarios.

URL PDF HTML ☆

赞 0 踩 0

2603.01131 2026-06-16 cs.MA cs.AI 版本更新

MedCollab: IBIS-Guided Multi-Agent Collaboration with Hierarchical Disease Relation Chains for Clinical Diagnosis

MedCollab：基于IBIS引导的多智能体协作与分层疾病关系链的临床诊断

Yuqi Zhan, Xinyue Wu, Tianyu Lin, Yutong Bao, Xiaoyu Wang, Weihao Cheng, Huangwei Chen, Feiwei Qin, Zhu Zhu

发表机构 * Princeton University（普林斯顿大学）； Springer Heidelberg（斯普林格海德堡）； ABC Institute（ABC研究所）； Rupert-Karls-University Heidelberg（海德堡鲁珀特-卡尔大学）； Hangzhou Dianzi University（杭州电子科技大学）； Zhejiang University（浙江大学）； Children’s Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Children and Adolescents’ Health and Diseases（浙江大学医学院儿童医院，国家儿童青少年健康与疾病临床研究中心）

AI总结提出MedCollab框架，通过IBIS结构化论证和分层疾病关系链（HDRC）增强多智能体协作，提升临床诊断的准确性、可追溯性和报告质量。

详情

AI中文摘要

大型语言模型（LLM）在临床诊断中展现出潜力，但仍受限于不可靠的报告生成、薄弱的证据基础和 opaque 推理。我们提出MedCollab，一个基于IBIS引导的多智能体框架，用于全周期临床诊断和诊断报告生成。模拟医院会诊，MedCollab从患者记录中动态招募专科和检查智能体。每个诊断假设通过基于问题的信息系统（IBIS）结构化为证据关联的论点，提高可追溯性和可审计性。MedCollab进一步构建分层疾病关系链（HDRC），将接受的假设组织成具有临床意义的病理和共病关系。一个验证器引导的共识模块审计推理质量，检测矛盾，并在多轮中更新智能体权重。在ClinicalBench和MIMIC-IV上的实验表明，MedCollab在诊断准确性、科室路由、证据一致性和报告质量方面优于强大的LLM和医学多智能体基线。这些结果表明，结构化论证和疾病关系建模可以提高基于LLM的诊断的可靠性、透明度和临床连贯性。

英文摘要

Clinical diagnosis is a gradual process of evidence integration, in which physicians move from symptoms and medical history to examinations, competing hypotheses, disease relations, and treatment decisions. Large language models have advanced medical text understanding and generation. Yet their clinical use remains limited by weak evidence grounding, opaque reasoning, and inconsistent links among differential diagnosis, final diagnosis, diagnostic basis, and treatment planning. We introduce MedCollab, a multi-agent framework for full-cycle clinical diagnosis and report generation. MedCollab coordinates specialist and examination agents according to patient records. It structures agent deliberation with an Issue-Based Information System (IBIS) protocol, so that each diagnostic position is supported by patient-specific evidence and medical knowledge. It also builds Hierarchical Disease Relation Chains (HDRC) to connect accepted hypotheses through progression, complication, and comorbidity relations. During multi-round deliberation, a verifier-guided consensus module evaluates evidence support, medical plausibility, and logical conflicts. It then adjusts agent contributions and filters unsupported reasoning. Experiments on ClinicalBench and MIMIC-IV show that MedCollab outperforms leading LLMs and medical multi-agent baselines in diagnostic accuracy, evidence consistency, and clinical reasoning quality. These results indicate that structured and auditable collaboration can produce more faithful and clinically coherent diagnostic reports.

URL PDF HTML ☆

赞 0 踩 0

2511.05522 2026-06-16 eess.SP cs.AI 版本更新

AIRMap: AI-Generated Radio Maps for Wireless Digital Twins

AIRMap: 用于无线数字孪生的AI生成无线电地图

Ali Saeizadeh, Miead Tehrani-Moayyed, Davide Villa, J. Gordon Beattie, Pedram Johari, Stefano Basagni, Tommaso Melodia

发表机构 * VIAVI Solutions, Inc.（VIAVI解决方案公司）； National Telecommunications and Information Administration (NTIA)（国家电信与信息管理局）； U.S. National Science Foundation（美国国家科学基金会）

AI总结提出AIRMap深度学习框架，基于2D高程图通过U-Net自编码器实现超快速无线电地图估计，在4毫秒内达到低于4 dB RMSE的路径增益预测，比GPU加速射线追踪快100倍以上。

Comments 15 pages, 19 figures, This work has been accepted for publication on IEEE Transactions on Wireless Communications

详情

AI中文摘要

精确、低延迟的信道建模对于实时无线网络仿真和数字孪生应用至关重要。然而，像射线追踪这样的传统建模方法计算量大，不适合模拟动态条件。在本文中，我们提出了AIRMap，一个用于超快速无线电地图估计的深度学习框架，以及一个用于创建迄今为止最大无线电地图数据集的自动化流水线。AIRMap使用单输入U-Net自编码器，仅处理地形和建筑物高度的2D高程图。在120万波士顿区域样本上训练，并在四个具有不同地形和建筑密度的不同城市和农村环境中验证，AIRMap在NVIDIA L40S上每次推理在4毫秒内预测路径增益，RMSE低于4 dB——比基于GPU加速射线追踪的无线电地图快100倍以上。使用仅20%的现场测量数据进行轻量级校准，将中位误差降低到约5%，显著优于传统模拟器（误差超过50%）。集成到Colosseum仿真器和Sionna SYS平台中，与基于测量的信道相比，频谱效率和误块率几乎为零误差。这些发现验证了AIRMap在无线数字孪生中实现可扩展、准确和实时无线电地图估计的潜力。

英文摘要

Accurate, low-latency channel modeling is essential for real-time wireless network simulation and digital-twin applications. Traditional modeling methods like ray tracing are however computationally demanding and unsuited to model dynamic conditions. In this paper, we propose AIRMap, a deep-learning framework for ultra-fast radio-map estimation, along with an automated pipeline for creating the largest radio-map dataset to date. AIRMap uses a single-input U-Net autoencoder that processes only a 2D elevation map of terrain and building heights. Trained on 1.2M Boston-area samples and validated across four distinct urban and rural environments with varying terrain and building density, AIRMap predicts path gain with under 4 dB RMSE in 4 ms per inference on an NVIDIA L40S-over 100x faster than GPU-accelerated ray tracing based radio maps. A lightweight calibration using just 20% of field measurements reduces the median error to approximately 5%, significantly outperforming traditional simulators, which exceed 50% error. Integration into the Colosseum emulator and the Sionna SYS platform demonstrate near-zero error in spectral efficiency and block-error rate compared to measurement-based channels. These findings validate AIRMap's potential for scalable, accurate, and real-time radio map estimation in wireless digital twins.

URL PDF HTML ☆

赞 0 踩 0

2602.17587 2026-06-16 math.ST cs.LG stat.ML stat.TH 版本更新

Asymptotically Optimal Sequential Testing with Markovian Data

马尔可夫数据的渐近最优序贯检验

Alhad Sethi, Kavali Sofia Sagar, Shubhada Agrawal, Debabrota Basu, P. N. Karthik

发表机构 * Indian Institute of Science, Bangalore（班加罗尔印度科学学院）； Indian Institute of Technology, Hyderabad（海得拉巴印度理工学院）； Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL（里尔大学、法国国家科学研究中心、中央里尔学院、UMR 9189 – CRIStAL）

AI总结针对遍历有限状态马尔可夫链生成的数据，提出一种渐近最优的序贯假设检验方法，其期望停止时间与实例相关的下界渐近匹配，并应用于马尔可夫链蒙特卡洛模型误设检测和马尔可夫决策过程结构性质检验。

Comments ICML 2026

详情

AI中文摘要

我们研究了由遍历有限状态马尔可夫链生成的数据的单侧和α-正确序贯假设检验。原假设是未知转移矩阵属于随机矩阵的指定集合P，备择假设对应于不相交的集合Q。我们建立了备择假设下任何有效序贯检验的期望停止时间的非渐近实例相关下界，该下界是渐近紧的。我们的新分析改进了现有下界，这些下界在此设置中要么是渐近的，要么被证明是次优的。我们的下界同时包含了由未知马尔可夫链诱导的平稳分布和转移结构。我们进一步提出了一种最优检验，其期望停止时间在α→0时渐近匹配该下界。我们通过应用该框架到马尔可夫链蒙特卡洛中模型误设的序贯检测以及马尔可夫决策过程中转移动力学的线性等结构性质的检验，说明了我们框架的实用性。我们的发现给出了马尔可夫依赖下最优序贯检验程序的尖锐且一般的刻画。

英文摘要

We study one-sided and $α$-correct sequential hypothesis testing for data generated by an ergodic, finite-state Markov chain. The null hypothesis is that the unknown transition matrix belongs to a prescribed set $P$ of stochastic matrices, and the alternative corresponds to a disjoint set $Q$. We establish a non-asymptotic instance-dependent lower bound on the expected stopping time of any valid sequential test under the alternative, which is asymptotically tight. Our novel analysis improves the existing lower bounds, which are either asymptotic or provably sub-optimal in this setting. Our lower bound incorporates both the stationary distribution and the transition structure induced by the unknown Markov chain. We further propose an optimal test whose expected stopping time matches this lower bound asymptotically as $α\to 0$. We illustrate the usefulness of our framework through applications to sequential detection of model misspecification in Markov Chain Monte Carlo and to testing structural properties, such as the linearity of transition dynamics, in Markov decision processes. Our findings yield a sharp and general characterization of optimal sequential testing procedures under Markovian dependence.

URL PDF HTML ☆

赞 0 踩 0

2509.14959 2026-06-16 eess.AS cs.AI 版本更新

Discrete optimal transport is a strong audio adversarial attack

离散最优传输是一种强大的音频对抗攻击

Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan

发表机构 * University of Rochester（罗切斯特大学）； Rochester Institute of Technology（罗切斯特理工学院）

AI总结提出离散最优传输（DOT）作为黑盒攻击，通过分布对齐（使用WavLM嵌入和熵最优传输）显著降低说话人验证和反欺骗系统的性能，且无需模型参数或梯度。

详情

AI中文摘要

在本文中，我们研究了离散最优传输（DOT）作为针对现代自动说话人验证（ASV）和反欺骗对抗措施（CM）系统的黑盒攻击。我们的攻击作为一种后处理分布对齐步骤。使用熵最优传输和top-k重心投影，将生成语音（或其他人的语音）的帧级WavLM嵌入与未配对的真实语音池对齐，随后进行神经声码器处理。与基于梯度的攻击不同，所提出的方法无需访问模型参数、梯度或训练数据。在ASVspoof2019和ASVspoof5上的实验表明，DOT攻击显著提高了CM的等错误率（EER），并在多种欺骗攻击下显著降低了ASV性能。该攻击可跨数据集迁移，且在CM微调后仍然有效。通过说话人相似性、Fréchet音频距离和嵌入分布可视化的分析表明，DOT通过将源语音向表示空间的真实区域移动而非最大化说话人相似性来成功实施攻击。这些结果表明，基于最优传输的分布对齐代表了当代ASV和反欺骗系统的一个先前未被充分探索的攻击向量。

英文摘要

In this paper, we investigate discrete optimal transport (DOT) as a black-box attack against modern automatic speaker verification (ASV) and anti-spoofing countermeasure (CM) systems. Our attack operates as a post-processing distribution-alignment step. Frame-level WavLM embeddings of generated speech (or another person speech) are aligned to an unpaired bona fide speech pool using entropic optimal transport and a top-k barycentric projection, followed by neural vocoding. Unlike gradient-based attacks, the proposed method requires no access to model parameters, gradients, or training data. Experiments on ASVspoof2019 and ASVspoof5 demonstrate that DOT attack substantially increases CM EER and substantially degrades ASV performance across multiple spoofing attacks. The attack transfers across datasets and remains effective after CM fine-tuning. Analysis using speaker similarity, Fréchet Audio Distance, and visualization of embedding distributions suggests that DOT succeeds by shifting source speech toward bona fide regions of the representation space rather than by maximizing speaker similarity. These results indicate that optimal-transport-based distribution alignment represents a previously underexplored attack vector for contemporary ASV and anti-spoofing systems.

URL PDF HTML ☆

赞 0 踩 0

2602.14780 2026-06-16 cs.MA cs.CY cs.RO cs.SY eess.SY 版本更新

ROSA: Roundabout Optimized Speed Advisory with Multi-Agent Trajectory Prediction in Multimodal Traffic

ROSA: 多模式交通中基于多智能体轨迹预测的环岛优化速度建议

Anna-Lena Schlamp, Jeremias Gerner, Klaus Bogenberger, Werner Huber, Stefanie Schmidtner

发表机构 * IEEE

AI总结提出ROSA系统，结合Transformer多智能体轨迹预测与协调速度引导，提升环岛多模式混合交通的效率与安全，预测精度优于前人工作。

Comments 8 pages, 1 figure, 4 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2025 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version

详情

AI中文摘要

我们提出ROSA——环岛优化速度建议——一个结合多智能体轨迹预测与协调速度引导的系统，用于环岛处的多模式混合交通。使用基于Transformer的模型，ROSA联合预测环岛处车辆和弱势道路使用者（VRU）的未来轨迹。该模型针对单步预测训练并自回归部署，生成确定性输出，从而实现可操作的速度建议。结合运动动力学，模型在五秒预测范围内实现了高精度（ADE: 1.29m, FDE: 2.99m），超越了先前工作。添加路线意图进一步提升了性能（ADE: 1.10m, FDE: 2.36m），展示了网联车辆数据的价值。基于与VRU和环岛内车辆的预测冲突，ROSA为接近和进入环岛的车辆提供实时、主动的速度建议。尽管存在预测不确定性，ROSA显著提升了车辆效率和安全性，甚至从VRU视角对感知安全性也有积极影响。本工作的源代码可在以下网址获取：this http URL。

英文摘要

We present ROSA -- Roundabout Optimized Speed Advisory -- a system that combines multi-agent trajectory prediction with coordinated speed guidance for multimodal, mixed traffic at roundabouts. Using a Transformer-based model, ROSA jointly predicts the future trajectories of vehicles and Vulnerable Road Users (VRUs) at roundabouts. Trained for single-step prediction and deployed autoregressively, it generates deterministic outputs, enabling actionable speed advisories. Incorporating motion dynamics, the model achieves high accuracy (ADE: 1.29m, FDE: 2.99m at a five-second prediction horizon), surpassing prior work. Adding route intention further improves performance (ADE: 1.10m, FDE: 2.36m), demonstrating the value of connected vehicle data. Based on predicted conflicts with VRUs and circulating vehicles, ROSA provides real-time, proactive speed advisories for approaching and entering the roundabout. Despite prediction uncertainty, ROSA significantly improves vehicle efficiency and safety, with positive effects even on perceived safety from a VRU perspective. The source code of this work is available under: github.com/urbanAIthi/ROSA.

URL PDF HTML ☆

赞 0 踩 0

2602.14710 2026-06-16 cs.IR cs.AI 版本更新

Orcheo: A Modular Full-Stack Platform for Conversational Search

Orcheo: 一个用于对话式搜索的模块化全栈平台

Shaojie Jiang, Svitlana Vakulenko, Maarten de Rijke

发表机构 * University of Amsterdam（阿姆斯特丹大学）； AI Colleagues（AI同事）； WU Vienna University of Economics and Business（维也纳经济与商业大学）

AI总结提出Orcheo开源平台，通过模块化架构、生产级基础设施和45+即用组件，解决对话式搜索研究中框架统一与原型部署的难题。

Comments Accepted to SIGIR 2026

详情

DOI: 10.1145/3805712.3808613

AI中文摘要

对话式搜索（CS）需要一个复杂的软件工程流水线，集成了查询重构、排序和响应生成。CS研究人员目前面临两个障碍：缺乏一个统一的框架来有效地与社区共享贡献，以及难以部署用于用户评估的端到端原型。我们介绍了Orcheo，一个旨在弥合这一差距的开源平台。Orcheo提供三个关键优势：（i）模块化架构通过单文件节点模块促进组件复用，便于CS研究中的共享和可重复性；（ii）生产级基础设施通过双执行模式、安全凭证管理和执行遥测弥合原型到系统的差距，内置AI编码支持降低学习曲线；（iii）入门工具包包括45多个现成组件，用于查询理解、排序和响应生成，能够快速启动完整的CS流水线。我们描述了框架架构，并通过强调模块化和易用性的案例研究验证了Orcheo的实用性。Orcheo在MIT许可下以开源形式发布于此https URL。

英文摘要

Conversational search (CS) requires a complex software engineering pipeline that integrates query reformulation, ranking, and response generation. CS researchers currently face two barriers: the lack of a unified framework for efficiently sharing contributions with the community, and the difficulty of deploying end-to-end prototypes needed for user evaluation. We introduce Orcheo, an open-source platform designed to bridge this gap. Orcheo offers three key advantages: (i) A modular architecture promotes component reuse through single-file node modules, facilitating sharing and reproducibility in CS research; (ii) Production-ready infrastructure bridges the prototype-to-system gap via dual execution modes, secure credential management, and execution telemetry, with built-in AI coding support that lowers the learning curve; (iii) Starter-kit assets include 45+ off-the-shelf components for query understanding, ranking, and response generation, enabling the rapid bootstrapping of complete CS pipelines. We describe the framework architecture and validate Orcheo's utility through case studies that highlight modularity and ease of use. Orcheo is released as open source under the MIT License at https://github.com/AI-Colleagues/orcheo.

URL PDF HTML ☆

赞 0 踩 0

2602.09222 2026-06-16 cs.CR cs.AI 版本更新

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

MUZZLE: 针对间接提示注入攻击的自适应智能体红队测试框架

Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, Alina Oprea

发表机构 * Northeastern University（东北大学）； Mozilla Corporation（Mozilla公司）

AI总结提出MUZZLE框架，利用智能体轨迹自动识别高显著性注入面，自适应生成上下文相关的恶意指令，评估网络智能体对间接提示注入攻击的安全性，发现44种新攻击和跨应用攻击策略。

详情

AI中文摘要

基于大型语言模型的网络智能体正越来越多地被部署来自动化复杂的在线任务，通过直接与网站交互并代表用户执行操作。尽管这些智能体提供了强大的能力，但其设计使它们暴露于嵌入在不可信网络内容中的间接提示注入攻击，使对手能够劫持智能体行为并违反用户意图。尽管对这一威胁的认识日益增强，现有评估依赖于固定的攻击模板、手动选择的注入表面或范围狭窄的场景，限制了它们捕捉实际中遇到的现实自适应攻击的能力。我们提出了MUZZLE，一个自动化的智能体框架，用于评估网络智能体对间接提示注入攻击的安全性。MUZZLE利用智能体的轨迹自动识别高显著性注入表面，并自适应生成上下文相关的恶意指令，针对机密性、完整性和可用性的违反。与先前方法不同，MUZZLE根据观察到的智能体执行轨迹调整其攻击策略，并利用失败执行的反馈迭代改进攻击。我们在多种网络应用、用户任务和智能体配置上评估MUZZLE，展示了其以最少人工干预自动且自适应地评估网络智能体安全性的能力。我们的结果表明，MUZZLE在4个网络应用上针对10个违反机密性、可用性或隐私属性的对抗目标，在不同LLM和智能体框架下有效发现了44种新攻击。MUZZLE还识别了新颖的攻击策略，包括3种跨应用提示注入攻击和一种针对智能体的钓鱼场景。

英文摘要

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 44 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties across different LLMs and agent scaffolds. MUZZLE also identifies novel attack strategies, including 3 cross-application prompt injection attacks and an agent-tailored phishing scenario.

URL PDF HTML ☆

赞 0 踩 0

2602.05965 2026-06-16 cs.MA cs.AI 版本更新

Learning to Share: Selective Memory for Efficient Parallel Agentic Systems

学习共享：面向高效并行智能体系统的选择性记忆

Joseph Fioresi, Parth Parag Kulkarni, Ashmal Vayani, Song Wang, Mubarak Shah

发表机构 * arXiv.org ； cs.MA（计算机科学与建模）

AI总结提出LTS机制，通过强化学习训练控制器选择性共享跨团队中间信息，在减少并行智能体系统计算开销的同时保持或提升任务性能。

Comments ICML 2026

详情

AI中文摘要

智能体系统通过协调多个智能体迭代推理、调用工具和交换中间结果来解决复杂任务。为了提高鲁棒性和解决方案质量，最近的方法部署多个并行运行的智能体团队以探索多样化的推理轨迹。然而，并行执行带来了显著的计算成本：当不同团队独立推理相似子问题或执行类似步骤时，它们反复进行大量重叠计算。为了解决这些限制，本文提出学习共享（LTS），一种用于并行智能体框架的学习型共享记忆机制，能够在控制上下文增长的同时实现跨团队选择性信息重用。LTS引入了一个所有团队可访问的全局记忆库和一个轻量级控制器，该控制器决定是否将中间智能体步骤添加到记忆中。控制器使用带有使用感知信用分配的逐步强化学习进行训练，使其能够识别在并行执行中全局有用的信息。在AssistantBench和GAIA基准上的实验表明，与无记忆并行基线相比，LTS显著减少了总体运行时间，同时匹配或提高了任务性能，证明了学习型记忆准入是提高并行智能体系统效率的有效策略。项目页面：此https URL

英文摘要

Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub-problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: https://joefioresi718.github.io/LTS_webpage/

URL PDF HTML ☆

赞 0 踩 0

2602.01394 2026-06-16 eess.AS cs.LG cs.SD 版本更新

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

SSNAPS: 基于扩散逆采样的语音与背景噪声视听分离

Yochai Yemini, Yoav Ellinson, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya

发表机构 * Bar-Ilan University（巴伊兰大学）； OriginAI

AI总结提出一种无监督的视听语音分离方法，利用扩散先验和逆采样联合建模语音与噪声，在单麦克风场景下优于有监督基线，并支持离屏说话人分离。

详情

AI中文摘要

本文解决了在真实环境噪声下进行视听单麦克风语音分离和增强的挑战。我们的方法基于生成逆采样，其中我们用专用的扩散先验对干净语音和环境噪声进行建模，并联合利用它们来恢复所有潜在源。为此，我们重新制定了一个最近的逆采样器以匹配我们的设置。我们在包含1、2和3个说话人以及噪声的混合信号上进行了评估，结果表明，尽管是完全无监督的，我们的方法在所有条件下的WER上始终优于领先的有监督基线。我们进一步扩展了我们的框架以处理离屏说话人分离。此外，分离出的噪声分量具有高保真度，使其适用于声学场景的下游检测。代码和预训练模型将在接收后提供。演示页面：此 https URL

英文摘要

This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in WER across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream detection of the acoustic scene. Code and pretrained models will become available upon acceptance. Demo page: https://ssnaps2026.github.io/ssnaps2026/

URL PDF HTML ☆

赞 0 踩 0

2601.20875 2026-06-16 stat.AP cs.LG econ.EM stat.ME stat.ML 版本更新

Drivers, Receivers, and Dynamic Linkages: The Directed Structure of SDG Interdependence, 2000--2024

驱动者、接收者与动态联系：可持续发展目标相互依赖的有向结构，2000-2024

Md Muhtasim Munif Fahim, Md Jahid Hasan Imran, Md. Naim Molla, Luknath Debnath, Tonmoy Shil, Ehsanul Bashar Pranto, Md Mostafizur Rahman Likhon, Md Shafin Sanyan Saad, Md. Rezaul Karim

发表机构 * Data Science Research Lab, Department of Statistics, University of Rajshahi（数据科学研究实验室，统计学系，拉贾沙希大学）

AI总结使用面板格兰杰因果检验和局部投影法，分析114个国家2000-2024年17个可持续发展目标的有向相互依赖网络，发现84个显著联系（40个协同、44个权衡），驱动者-接收者排名脆弱，和平与强大机构是净接收者，减贫是效应加权驱动者。

Comments 27 pages, 5 figures. Panel Granger non-causality and local projections on 114 countries (2000-2024). Submitted to Sustainability Science

详情

AI中文摘要

财政和行政能力有限的政府需要知道哪些可持续发展目标（SDGs）通过目标系统传播进展以及传播速度有多快。我们利用2000年至2024年每年观测的114个国家的平衡面板数据，绘制了所有17个目标的有向相互依赖结构。目标序列具有持续性、趋势性和横截面依赖性，因此我们应用了两种适用于该机制的估计量：对一阶差分序列运行的Dumitrescu-Hurlin面板格兰杰非因果性检验，以恢复有向交互网络；以及具有Driscoll-Kraay标准误的面板局部投影，以测量31个理论推导的指标联系的动态幅度。在272个有向目标对中，84个联系通过了错误发现控制（40个协同，44个权衡；网络密度0.31）。协同和权衡以相当的强度出现，因此没有单一目标表现为通用加速器，目标层级本身也很脆弱。驱动者-接收者排名在滞后阶数和中心性指标上弱相关，并且在国家自助法下只有两个角色与零可区分：和平与强大机构作为最清晰的净接收者，以及减贫作为最可能的效应量加权驱动者。支持的联系是动态的，在四到五年内累积：卫生设施和贫困改善是降低儿童死亡率的最强预测因子，教育-儿童健康关联在183个国家的独立世界发展指标数据中得到证实。这些结果警示基于排名的加速器政策，并支持基于通过组成指标监测的、有支持的时间滞后联系构建的自适应投资组合。

英文摘要

Governments with limited fiscal and administrative capacity need to know which Sustainable Development Goals (SDGs) propagate progress through the goal system and how quickly. We map the directed interdependence structure of all seventeen goals using a balanced panel of 114 countries observed annually from 2000 to 2024. The goal series are persistent, trending, and cross-sectionally dependent, so we apply two estimators matched to this regime: a Dumitrescu-Hurlin panel Granger non-causality test, run on first-differenced series, to recover the directed interaction network, and panel local projections with Driscoll-Kraay standard errors to measure the dynamic magnitude of 31 theory-derived indicator linkages. Of 272 directed goal pairs, 84 linkages survive false-discovery control (40 synergies, 44 trade-offs; network density 0.31). Synergies and trade-offs occur at comparable strength, so no single goal behaves as a universal accelerator, and the goal-level hierarchy itself is fragile. Driver-receiver rankings correlate weakly across lag orders and centrality metrics, and under a country bootstrap only two roles are distinguishable from zero: peace and strong institutions as the clearest net receiver, and poverty reduction as the most probable effect-size-weighted driver. The supported linkages are dynamic, accruing over four to five years: sanitation and poverty improvements are the strongest predictors of lower child mortality, and the education-child-health association is corroborated in independent World Development Indicators data across 183 countries. These results caution against rankings-based accelerator policy and support adaptive portfolios built on supported, time-lagged linkages monitored through constituent indicators.

URL PDF HTML ☆

赞 0 踩 0

2601.19697 2026-06-16 cs.SE cs.AI 版本更新

AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion

AlignCoder: 为目标意图对齐检索以实现仓库级代码补全

Tianyue Jiang, Yanli Wang, Yanlin Wang, Daya Guo, Ensheng Shi, Yuchi Ma, Jiachi Chen, Zibin Zheng

发表机构 * School of Software, Sun Yat-sen University（中山大学软件学院）； Sun Yat-sen University（中山大学）

AI总结针对仓库级代码补全中检索与目标代码不匹配及无法利用推理信息的问题，提出AlignCoder框架，通过查询增强和基于强化学习的检索器训练，在CrossCodeEval上EM分数提升18.1%。

Comments To appear at ASE'25

详情

AI中文摘要

由于现有代码大语言模型（code LLMs）对仓库特定上下文和领域知识的理解有限，仓库级代码补全仍然是一项具有挑战性的任务。虽然检索增强生成（RAG）方法通过检索相关代码片段作为跨文件上下文显示出前景，但它们存在两个基本问题：检索过程中查询与目标代码之间的不对齐，以及现有检索方法无法有效利用推理信息。为了解决这些挑战，我们提出了AlignCoder，一个仓库级代码补全框架，引入了查询增强机制和基于强化学习的检索器训练方法。我们的方法生成多个候选补全以构建增强查询，从而弥合初始查询与目标代码之间的语义差距。此外，我们采用强化学习训练AlignRetriever，使其学会利用增强查询中的推理信息进行更准确的检索。我们在两个广泛使用的基准测试（CrossCodeEval和RepoEval）上，使用五个骨干代码LLM评估了AlignCoder，在CrossCodeEval基准测试上，与基线相比，EM分数提高了18.1%。结果表明，我们的框架实现了优越的性能，并在各种代码LLM和编程语言中表现出高度的泛化能力。

英文摘要

Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation (RAG) approaches have shown promise by retrieving relevant code snippets as cross-file context, they suffer from two fundamental problems: misalignment between the query and the target code in the retrieval process, and the inability of existing retrieval methods to effectively utilize the inference information. To address these challenges, we propose AlignCoder, a repository-level code completion framework that introduces a query enhancement mechanism and a reinforcement learning based retriever training method. Our approach generates multiple candidate completions to construct an enhanced query that bridges the semantic gap between the initial query and the target code. Additionally, we employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval. We evaluate AlignCoder on two widely-used benchmarks (CrossCodeEval and RepoEval) across five backbone code LLMs, demonstrating an 18.1% improvement in EM score compared to baselines on the CrossCodeEval benchmark. The results show that our framework achieves superior performance and exhibits high generalizability across various code LLMs and programming languages.

URL PDF HTML ☆

赞 0 踩 0

2512.19011 2026-06-16 cs.CR cs.AI cs.CL cs.LG 版本更新

Do You Really Need a GPU to Guard Your LLM? CPU-Class Classifiers and Multi-Stage Pipelines for Safety Enforcement at Scale

你真的需要GPU来保护你的LLM吗？用于大规模安全执行的CPU级分类器与多阶段流水线

Vasudev Majhi, Dhruv Gupta, Advait Singh, Matthew Barker, Dhruv Kumar

发表机构 * BITS Pilani（比斯帕利尼大学）； Trustwise（Trustwise公司）

AI总结本文研究CPU级分类器（如SVM、梯度提升树）在LLM输入安全检测中的性能，发现其与GPU模型互补，并设计三阶段流水线GuardChain，在80%的分布内查询中达到近峰值精度，降低部署成本。

Comments Under Review. 25 pages, 5 figures, 38 tables

详情

AI中文摘要

用于筛选LLM输入中越狱尝试的安全分类器已成为标准部署组件，但几乎所有生产系统都依赖基于GPU的模型：微调变换器和LLM-as-a-judge流水线。这些方法带来了显著的每查询延迟和基础设施成本。很少有研究探讨基于CPU的分类器（例如在TF-IDF特征上训练的支持向量机和梯度提升树）是否能在生产部署遇到的各种条件下匹配其准确性。我们评估了五个CPU分类器家族、基于SSM的GPU分类器Mamba-130M以及基于变换器的GPU模型（DeBERTa-v3和带LoRA的Gemma-2B），涵盖九个越狱来源和三种场景：分布内（D1）、分布外（D2）和对抗性混淆（D3）。在D1上，最佳CPU分类器以约五分之一的部署成本匹配最佳变换器GPU模型。在D2上，CPU分类器因自信的校准错误而失败，产生高置信度的假阴性，完全绕过升级。在D3上，CPU分类器在F1上比变换器GPU模型高出超过26个百分点。基于这些互补的失败模式，我们设计了GuardChain，一个三阶段安全流水线（正则表达式 -> CPU -> GPU），将每个提示路由到能够做出自信决策的最便宜阶段。仅CPU阶段就解决了80%的分布内提示，接近峰值精度，而GPU阶段恢复了分布外失败。对于大规模部署LLM安全的从业者，这项工作提供了证据，表明GPU级基础设施对于大多数流量是不必要的。

英文摘要

Safety classifiers that screen LLM inputs for jailbreak attempts have become standard deployment components, yet almost all production systems rely on GPU-based models: fine-tuned transformers and LLM-as-a-judge pipelines. These approaches impose significant per-query latency and infrastructure cost. Very little research has asked whether CPU-based classifiers, such as support vector machines and gradient-boosted trees trained on TF-IDF features, can match their accuracy across the conditions that production deployments encounter. We evaluate five CPU classifier families, Mamba-130M as an SSM-based GPU classifier, and transformer-based GPU models (DeBERTa-v3 and Gemma-2B with LoRA) across nine jailbreak sources and three regimes: in-distribution (D1), out-of-distribution (D2), and adversarially obfuscated (D3). On D1, the best CPU classifier matches the best transformer GPU model at roughly one-fifth the deployment cost. On D2, CPU classifiers fail via confident miscalibration, producing high-confidence false negatives that bypass escalation entirely. On D3, CPU classifiers outperform transformer GPU models by more than 26 percentage points in F1. Based on these complementary failure modes, we design GuardChain, a three-stage safety pipeline (Regex -> CPU -> GPU) that routes each prompt to the cheapest stage capable of a confident decision. The CPU stage alone resolves 80\% of in-distribution prompts at near-peak accuracy, and the GPU stage recovers the out-of-distribution failures. For practitioners deploying LLM safety at scale, this work provides evidence that GPU-class infrastructure is unnecessary for the majority of traffic.

URL PDF HTML ☆

赞 0 踩 0

2512.22560 2026-06-16 cs.DC cs.AI cs.LG 版本更新

RollArt: Disaggregated Multi-Task Agentic RL Training at Scale

RollArt: 可分解的多任务智能体强化学习规模化训练

Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang

发表机构 * HKUST（香港科技大学）； Alibaba Group（阿里巴巴集团）； Tongyi Lab, Alibaba（阿里云实验室）

AI总结提出RollArt系统，通过将强化学习流水线分解到异构硬件上，实现多任务智能体RL的高效训练，相比现有系统减少1.31-2.05倍训练时间。

Comments 19 pages, 15 figures

详情

AI中文摘要

智能体强化学习通过与环境的多轮交互训练大语言模型，产生混合计算密集型预填充、带宽密集型解码、CPU密集型环境执行和突发性奖励评估的工作负载。现有系统要么将所有阶段共置于单一GPU集群，要么仅以粗粒度解耦，忽视了硬件异构性并导致阶段间大量同步开销。我们提出ROLLART，一个在可分解基础设施上的多任务智能体RL系统。ROLLART将每个流水线阶段映射到最合适的硬件：将预填充密集型任务路由到计算优化GPU，解码密集型任务路由到带宽优化GPU，环境任务路由到CPU集群。它在轨迹级别解耦生成，使得生成、环境交互和奖励评分可以独立进行，从而慢速或失败的环境不会阻塞其他任务。ROLLART将无状态奖励计算卸载到无服务器基础设施，并通过有界陈旧性的异步权重同步将生成与训练重叠。结果表明，ROLLART有效提高了训练吞吐量，与各种RL系统相比实现了1.31-2.05倍的训练时间减少。我们还在阿里巴巴集群上使用超过3000个GPU训练了用于Qoder产品的数千亿参数MoE模型，验证了其稳定性和可扩展性。

英文摘要

Agentic Reinforcement Learning (RL) trains LLMs through multi-turn interactions with environments, producing workloads that mix compute-bound prefill, bandwidth-bound decoding, CPU-heavy environment execution, and bursty reward evaluation. Existing systems either colocate all stages on a single GPU cluster or decouple them only at a coarse granularity, overlooking hardware heterogeneity and incurring substantial synchronization overhead across stages. We present ROLLART, a system for multi-task agentic RL on disaggregated infrastructure. ROLLART maps each pipeline stage to best-fit hardware, routing prefill-heavy tasks to compute-optimized GPUs, decode-heavy tasks to bandwidth-optimized GPUs, and environments to CPU clusters. It decouples rollout at the trajectory level, allowing generation, environment interaction, and reward scoring to proceed independently, so that slow or failed environments never block the others. ROLLART offloads stateless reward computation to serverless infrastructure and overlaps rollout with training via staleness-bounded asynchronous weight synchronization. Our results demonstrate that ROLLART effectively improves training throughput and achieves 1.31--2.05 $\times$ training time reduction compared to various RL systems. We also evaluated ROLLART by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with above 3,000 GPUs, demonstrating its stability and scalability.

URL PDF HTML ☆

赞 0 踩 0

2510.10981 2026-06-16 stat.ML cs.LG 版本更新

In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning

上下文学习可证明是贝叶斯推断：元学习的泛化理论

Tomoya Wakayama, Taiji Suzuki

发表机构 * University of Tokyo（东京大学）； National Institute of Information and Communications Technology（信息通信技术国家研究所）

AI总结本文在元学习框架下，将上下文学习总风险分解为贝叶斯差距和后验方差，并证明Transformer通过预训练选择最优元算法，在测试时快速收敛到真实任务的最优算法。

详情

AI中文摘要

本文在元学习框架下，为上下文学习（ICL）发展了一个有限样本统计理论，该框架能够容纳多种任务类型的混合。我们引入了一个原则性的风险分解，将总ICL风险分解为两个正交分量：贝叶斯差距和后验方差。贝叶斯差距量化了训练模型逼近贝叶斯最优上下文预测器的程度。对于均匀注意力Transformer，我们推导出该差距的非渐近上界，明确阐明了其对预训练提示数量及其上下文长度的依赖关系。后验方差是一个与模型无关的风险，代表内在的任务不确定性。我们的关键发现是，该项仅由真实底层任务的难度决定，而任务混合带来的不确定性随着少量上下文示例呈指数级消失。这些结果共同提供了ICL的统一视角：Transformer在预训练期间选择最优元算法，并在测试时快速收敛到真实任务的最优算法。

英文摘要

This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.

URL PDF HTML ☆

赞 0 踩 0

2511.20709 2026-06-16 cs.SE cs.AI cs.CR 版本更新

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

DualGauge: 对仅由LLM和编码代理生成的规范代码进行自动化联合安全-功能基准测试

Rupam Patir, Keyan Guo, Suvadra Barua, Abhijeet Pathak, Dinesh Gudimetla, Jiawei Guo, Hongxin Hu, Haipeng Cai

发表机构 * University at Buffalo, SUNY（布法罗大学）

AI总结提出DualGauge框架，首个自动化联合评估规范代码正确性与安全性的系统，通过307个任务基准测试发现功能正确性高估可靠代码生成，联合成功率低于15%，且模型因素和代理系统均无法可靠提升。

详情

AI中文摘要

大型语言模型（LLM）和基于LLM的编码代理现在被用于从自然语言规范生成代码，然而确保此类代码既功能正确又安全仍然是一个挑战。我们提出了DualGauge，这是第一个用于联合评估仅规范代码生成正确性和安全性的全自动化框架，并由DualGauge-Bench支持，这是一个语言无关的基准测试，包含307个编码任务，每个任务都配有从相同规范派生的功能和安全性测试。通过评估Python、C++和JavaScript中的10个代表性LLM，我们发现功能正确性显著高估了可靠代码生成：即使是最强的模型，在每种语言中联合安全-功能成功率仍低于15%。常见的模型侧因素——规模、扩展思维、量化、指令调优和代码专业化——并不能可靠地提高联合性能，这表明安全且正确的代码生成并非仅仅从更强的编码能力中涌现。对3个领先的代理编码系统（Codex、OpenHands和Claude Code）的评估表明，在仅规范任务上，迭代脚手架相比直接（基于LLM的）生成没有优势。定性审计揭示，失败集中在输出契约边界以及存在但不足的防护措施上——这些模式只有联合基准测试才能可靠地暴露。

英文摘要

Large language models (LLMs) and LLM-based coding agents are now used to generate code from natural-language specifications, yet ensuring such code is both functionally correct and secure remains a challenge. We present DualGauge, the first fully automated framework for jointly evaluating correctness and security of specification-only code generation, supported by DualGauge-Bench, a language-agnostic benchmark of 307 coding tasks each paired with functional and security tests derived from the same specification. Evaluating 10 representative LLMs across Python, C++, and JavaScript, we find that functional correctness substantially overestimates reliable code generation: even the strongest model remains below 15% joint security-functionality success in every language. Common model-side factors--scale, extended thinking, quantization, instruction tuning, and code specialization--do not reliably improve joint performance, suggesting secure-and-correct code generation does not simply emerge from stronger coding capability. Evaluation of 3 leading agentic coding systems (Codex, OpenHands, and Claude Code) shows that iterative scaffolding provides no advantage over direct (LLM-based) generation on specification-only tasks. A qualitative audit reveals failures concentrate at the output contract boundary and in guards that exist but are insufficient--patterns that only joint benchmarking reliably exposes.

URL PDF HTML ☆

赞 0 踩 0