arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1708
专题追踪
2605.30432 2026-06-08 math.DS cs.LG cs.SI nlin.AO physics.soc-ph 版本更新

Learning effective models from network dynamics data with multiple initial conditions using weak form SINDy

使用弱形式SINDy从多初始条件的网络动力学数据中学习有效模型

Moyi Tian, Daniel A. Messenger, Vanja Dukic, Nancy Rodríguez, David M. Bortz

发表机构 * Department of Applied Mathematics, University of Colorado, Boulder, CO 80309 United States(应用数学系,科罗拉多大学,博尔德,CO 80309 美国) Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545 United States(洛斯阿拉莫斯国家实验室理论部,洛斯阿拉莫斯,NM 87545 美国)

AI总结 本文使用弱形式稀疏非线性动力学识别(WSINDy)方法,从多初始条件的网络动力学数据中学习有效模型,并评估了噪声水平与轨迹数量对学习精度的影响。

Comments 24 pages, 14 figures, 1 table. Code available at https://github.com/Moyi-Tian/WSINDy-NetworkDynamics

详情
AI中文摘要

社会系统由通过社交互动相互影响的个体网络组成。研究这些网络上的过程演化有助于我们更好地理解社会行为模式。我们研究了一个耦合线上和线下社交活动的系统,并探讨如何使用弱形式稀疏非线性动力学识别(WSINDy)方法直接从数据中学习有效模型,该方法用于发现控制方程。我们使用网络上的随机交互过程的平均场近似模型生成的数据评估学习性能,并测试在不同噪声水平下系统恢复的准确性。结果表明,当噪声较高时,使用更多轨迹可以提高准确性,但只需少量额外轨迹即可获得大部分收益,之后改进甚微。我们还从网络上的平均随机数据中学习有效的常微分方程模型。当传统的平均场近似失效时,直接从随机过程中识别连续常微分方程能够生成更符合数据的有效模型,并更深入地理解潜在动力学。

英文摘要

Social systems consist of networks of individuals who influence one another through social interactions. Studying how processes evolve on these networks can help us better understand patterns of social behavior. We study a system that couples online and offline social activity and investigate how to learn effective models directly from data using Weak Form Sparse Identification of Nonlinear Dynamics (WSINDy), a method for discovering governing equations. We assess learning performance using data generated by a mean-field approximation model of a stochastic interaction process on networks and test how accurately the system can be recovered under different noise levels. Our results show that using more trajectories improves accuracy when noise is high, but only a small number of additional trajectories is needed to gain most of the benefit, with little improvement beyond that. We also learn effective ODE models from averaged stochastic data on networks. When traditional mean-field approximations fail, identifying continuum ODEs directly from stochastic processes yields efficient models that better match the data and provide deeper insight into the underlying dynamics.

2605.25645 2026-06-08 cs.DC cs.AI 版本更新

Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines

在Google Cloud TPU上微调和服务Gemma 4 31B:与GPU基线的技术比较

Jatin Kishnani, Mayank Goel, Amit Singh, Pulkit Agrawal, Sairanjan Mishra

发表机构 * Google Cloud(谷歌云)

AI总结 本文首次端到端展示了在TPU硬件上微调和服务Google Gemma 4 31B模型,通过与GPU平台的实证比较,提供了代码级适配方案,并证明TPU在训练速度和成本上具有优势。

详情
AI中文摘要

我们首次端到端展示了在TPU硬件上微调和服务Google的Gemma 4 31B模型,提供了TPU与GPU平台在大语言模型适配上的实证比较。使用LoRA在Google TPU v5p-8上进行训练,在TPU v6e-8(Trillium)上进行推理,我们记录了将基于PyTorch、HuggingFace TRL和FSDP的GPU原生训练配方移植到JAX + Tunix/Qwix栈所需的全部代码级适配。这些适配涵盖网格配置、LoRA模块命名约定、分片注释修正、梯度检查点、数据管道重构以及自定义的Orbax到safetensors检查点合并过程。对于推理,我们详细描述了在v6e-8上服务Gemma 4所需的vLLM-TPU Docker设置,并刻画了由此产生的延迟和吞吐量特征。与相同超参数下的2xH100 GPU基线相比,TPU训练完成速度快1.61倍,成本低2.12倍。推理吞吐量在平台间差异在3%以内,而TPU的首令牌延迟低2倍(235 ms vs. 475 ms)。总体而言,对于代表性的训练加服务工作负载,TPU配置便宜1.82倍。我们的工作填补了开放工具生态系统中的关键空白,为从业者提供了可复现、生产就绪的Gemma 4在TPU基础设施上部署的配方。

英文摘要

We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe - built on PyTorch, HuggingFace TRL, and FSDP - to the JAX + Tunix/Qwix stack. These adaptations span mesh configuration, LoRA module naming conventions, sharding annotation corrections, gradient checkpoint, data pipeline restructuring, and a custom Orbax-to-safetensor checkpoint merging procedure. For inference, we detail the vLLM-TPU Docker setup necessary to serve Gemma 4 on v6e-8 and characterize the resulting latency and throughput profile. Compared with a similar-costing 2xH100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. For inference, we cover the vLLM-TPU Docker setup required to serve Gemma 4 on v6e-8 and explain the observed latency and throughput characteristics across a QPS sweep spanning 512 to 16k input tokens. Across both workloads we compare performance and cost against a 2xH100 GPU baseline running identical hyperparameters. The TPU completes training 1.61x faster at 2.12x lower cost. For inference, TPU v6e-8 matches GPU at short context (<=2048 tokens) and decisively outperforms at long context: 66% higher throughput and 23.6x faster TTFT at 4096-token inputs (61 ms vs 1,443 ms at QPS=4). Our work removes a critical gap in the open tooling ecosystem and provides practitioners with a recipe for Gemma 4 Dense 31B deployment on the TPU infrastructure.

2603.11075 2026-06-08 cs.AR cs.AI 版本更新

VeriHGN: Heterogeneous Graph-Based Congestion Prediction for Chip Layout Verification

VeriHGN: 基于异构图的芯片布局验证中的拥堵预测

Runbang Hu, Bo Fang, Bingzhe Li, Yuede Ji

发表机构 * The University of Texas at Arlington(德克萨斯大学阿灵顿分校) The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 本文提出VeriHGN框架,通过增强的异构图统一电路组件和空间网格,实现更准确的逻辑意图与物理实现的交互建模,提高了拥堵预测的准确性和相关性。

Comments Accpeted at KDD 2026

详情
AI中文摘要

随着非常大规模集成电路(VLSI)设计在规模和复杂性上持续增长,布局验证已成为现代电子设计自动化(EDA)工作流程中的核心挑战。在实践中,拥堵只能在详细布线后才能被准确识别,这使得传统验证既耗时又昂贵。因此,学习方法被探索以实现早期阶段的拥堵预测并减少布线迭代。然而,尽管先前的方法结合了网表连接性和布局特征,但它们通常以松散耦合的方式建模这两个方面,并主要产生数值拥堵估计。我们提出VeriHGN,一个基于增强异构图的验证框架,将电路组件和空间网格统一到单一关系表示中,从而实现更准确的逻辑意图与物理实现的交互建模。在工业基准测试中,包括ISPD2015、CircuitNet-N14和CircuitNet-N28,实验表明,VeriHGN在预测准确性和相关性度量方面均优于现有最先进方法。

英文摘要

As Very Large Scale Integration (VLSI) designs continue to scale in size and complexity, layout verification has become a central challenge in modern Electronic Design Automation (EDA) workflows. In practice, congestion can only be accurately identified after detailed routing, making traditional verification both time-consuming and costly. Learning-based approaches have therefore been explored to enable early-stage congestion prediction and reduce routing iterations. However, although prior methods incorporate both netlist connectivity and layout features, they often model the two in a loosely coupled manner and primarily produce numerical congestion estimates. We propose VeriHGN, a verification framework built on an enhanced heterogeneous graph that unifies circuit components and spatial grids into a single relational representation, enabling more faithful modeling of the interaction between logical intent and physical realization. Experiments on industrial benchmarks, including ISPD2015, CircuitNet-N14, and CircuitNet-N28, demonstrate that VeriHGN achieves the best or near-best performance over state-of-the-art methods in prediction accuracy and correlation metrics.

2605.17561 2026-06-08 cs.SE cs.AI cs.MA 版本更新

Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports

自动化无效bug报告的根因子类划分及无代码修复生成

Mahmut Furkan Gon, Emre Dinc, Tevfik Emre Sungur, Eray Tuzun

发表机构 * Department of Computer Engineering, Bilkent University(计算机工程系,比尔肯特大学)

AI总结 本研究旨在引入一个标准化的根因导向的无效bug报告子类划分体系,并通过实验测试不同方法在无效子类划分和无代码修复生成中的准确性。研究还分析了不同配置在我们创建的黄金标准基准上的表现。

Comments Submitted to IEEE Transactions on Software Engineering (TSE) and currently under review

详情
AI中文摘要

在使用软件时遇到的问题会以bug报告的形式被报告。然而,许多bug报告是无效的,意味着它们不需要代码更改,而是通过无代码修复解决。手动确定无效bug报告的根因并由客户支持提供可行的解决方案会造成严重的资源浪费。我们的目标是引入一个标准化的根因导向的无效bug报告子类划分体系,并通过实验测试各种方法在无效子类划分和无代码修复生成中的准确性。我们研究了不同配置在我们创建的黄金标准基准上的表现。使用人工整理的基准进行更高质量的分析,我们尝试了 vanilla LLMs、检索增强生成和代理网络搜索来识别无效子类并生成无代码修复。我们将结果与包含原始bug报告中无效子类和无代码修复的手动标注的地面真实数据进行了评估。我们用加权F1分数衡量子类检测性能,并用BERTScore和Judge LLM成功率评估无代码修复建议。对于子类划分,检索增强生成在总体性能上最高,达到0.66加权F1,略微优于vanilla LLMs的0.65和代理网络搜索的0.64。在子类级别,性能在非可复现上达到0.85 F1,在功能请求和问题上达到0.79,而错误版本仍然是最具有挑战性的,分数在0.00到0.29之间。对于无代码修复生成,代理网络搜索在总体Judge LLM成功率上最高,达到68.9%,相比检索增强生成的64.4%和vanilla LLMs的64.9%。在子类级别,最高峰值为工作正常的设计达到87.4%,问题达到72.2%。

英文摘要

Issues faced when using software are reported in the form of bug reports. However, many bug reports are invalid, meaning they do not require code changes, and are resolved with a no-code fix. Manually determining the root cause of the invalid bug reports and providing actionable resolutions by the customer support causes a serious waste of resources. Our goal is to introduce a standardized taxonomy for root-cause oriented invalid bug report subclassification, and perform experiments to test the accuracy of various approaches on invalid subclassification and no-code fix generation. We study how different configurations perform on a gold-standard benchmark we have created. Using a manually curated benchmark for higher quality analysis, we experimented with vanilla LLMs, Retrieval Augmented Generation, and agentic web search to identify invalid subclasses and generate no-code fixes. We evaluated the results against manually labeled ground truth data that includes the invalid subclass and no-code fixes from the original bug reports. We measured subclass detection performance with weighted F1-Score, and assessed no-code fix suggestions using BERTScore and Judge LLM success rates. For subclassification, retrieval augmented generation achieves the highest overall performance with 0.66 weighted F1, slightly outperforming vanilla LLMs at 0.65 and agentic web search at 0.64. At the subclass level, performance peaks at 0.85 F1 for Non-reproducibility and 0.79 for Feature Request and Question, while Wrong Version remains the most challenging with scores between 0.00 and 0.29. For no-code fix generation, agentic web search achieves the highest overall Judge LLM success rate at 68.9%, compared to 64.4% for RAG applications and 64.9% for vanilla LLMs, with subclass-level peaks of 87.4% for Working as Designed and 72.2% for Question.

2605.17548 2026-06-08 cs.SE cs.AI 版本更新

Rethinking Code Review in the Age of AI: A Vision for Agentic Code Review

重新思考AI时代的代码审查:面向代理代码审查的愿景

Hüseyin Özgür Kamalı, Erdem Tuna, Vahid Haratian, Eray Tüzün

发表机构 * Microsoft(微软) Ankara University(安卡拉大学) Bilkent University(比尔肯特大学)

AI总结 本文探讨了在AI时代代码审查的演变,提出了一种结合专门代理和人类控制的质量闸门的AI驱动代码审查流程,旨在提升代码审查的效率和可靠性。

Comments Submitted to ACM Transactions on Software Engineering Methodology (TOSEM). A shorter version of this work has been presented at ICSE-JAWs 2026, Rio de Janeiro, Brazil

详情
AI中文摘要

代码审查已经经历了数十年的发展,从非正式的同行检查发展到今天的拉取请求(PR)工作流程,但仍然主要是一种手动、不均匀且认知负担重的过程。人工智能(AI)编程助手的兴起加剧了这一挑战:虽然这些工具提高了代码生成的速度,但同时也增加了需要审查的代码量,使代码审查成为增长的瓶颈。当前的AI支持仍然碎片化,工具主要专注于孤立任务,如审阅者推荐、PR描述生成或评论建议,而非整个PR审查流程。本文回顾了代码审查实践的历史演变,并考察了由大语言模型(LLMs)和代理AI系统驱动的转变。随后,我们提出了一种AI驱动的代码审查流程愿景,结合专门的代理和人类控制的质量闸门。我们的框架涵盖五个阶段:PR创建、PR增强、审阅者选择、AI辅助代码审查和PR回顾,其中在关键决策点保留人类以保持判断、责任和团队层面的理解。我们识别了负责任采用的主要开放挑战,包括可靠性、偏见、隐私、自动化偏见、透明度和评估,并提出了更有效的软件工程中人类-AI协作的研究议程。

英文摘要

Code review has evolved for decades, from informal peer checking to today's pull request (PR) workflows, yet it remains a largely manual and cognitively demanding process. The rise of Artificial Intelligence (AI) coding assistants has intensified this challenge: while these tools increase code production velocity, they also expand the volume of code requiring review, turning code review into a growing bottleneck. Current AI support in code review remains fragmented, with tools focusing on isolated tasks such as reviewer recommendation, PR description generation, or comment suggestion rather than the end-to-end PR review workflow. We address this gap by treating review effectiveness as an outcome of the full code review lifecycle rather than a single stage, proposing a framework that carries context across stage boundaries. We propose a future vision for code review in which reviewers transition from manual inspectors into supervisory operators of agents. In this vision, staged, AI-powered workflows aim to align the pace of code generation with shared understanding and accountable engineering. In this paper, we review the historical evolution of code review practices, identify challenges in traditional code review systems, and examine the shift driven by large language models (LLMs) and agentic AI systems. We then present a vision for an AI-powered code review workflow combining specialized agents with human-controlled quality gates. Our framework spans five stages: PR Creation, PR Augmentation, Reviewer Selection, AI-Assisted Code Review, and PR Retrospective, with humans retained at key decision points to preserve judgment, accountability, and team-level understanding. Finally, we identify key adoption challenges and outline research directions for evaluation, governance, and responsible human-AI collaboration.

2605.13268 2026-06-08 quant-ph cs.LG 版本更新

Physics Guided Generative Optimization for Trotter Suzuki Decomposition

物理引导的Trotter-Suzuki分解生成优化

WenBin Yan

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出P-GONE方法,结合条件扩散模型、图神经网络和REINFORCE微调,联合优化Trotter-Suzuki分解中的项分组、阶数和时间步分配,在保真度≥0.95时实现19.4倍电路深度压缩。

详情
AI中文摘要

Trotter-Suzuki乘积公式是在含噪中等规模量子(NISQ)硬件上进行哈密顿演化的标准途径,但其精度取决于三个耦合的选择:项分组、乘积公式阶数和时间步分配。分组和阶数是离散的,这使得直接梯度优化不可行,并迫使现有编译器依赖静态启发式方法。我们描述了P-GONE方法,该方法结合了条件扩散模型(D3PM + DDPM)、图神经网络(GNN)编码器和闭环REINFORCE微调,以在混合离散-连续空间上联合学习分组、阶数和时间步优化。在保真度匹配条件下($F \geq 0.95$),该方法实现的电路深度为86,而Qiskit四阶(未分组,Suzuki-4)为1673,压缩约19.4倍;Paulihedral(一阶Trotter)为141,压缩约1.6倍。在$T=0.90$时,该方法也优于Qiskit分组对易教师(65 vs 103,压缩1.6倍),但在$T=0.95$时教师仍领先——这种分层模式指向保真度感知的微调。在标准退极化噪声模型下,该方法的含噪保真度大约是Qiskit四阶基线的2倍(0.743 vs 0.380)。消融实验显示清晰的层次:阶数学习 > 时间分配 > 分组。最佳N采样($N=32$是实际最佳点)和CFG指导在推理时提供灵活的保真度-深度权衡。该方法在结构化哈密顿量(TFIM,Heisenberg)上表现良好,但随机Pauli哈密顿量在$T \geq 0.95$时完全失败——这定义了该方法的适用边界。

英文摘要

Trotter Suzuki product formulas are the standard route to Hamiltonian evolution on noisy intermediate-scale quantum (\NISQ{}) hardware, but their accuracy depends on three coupled choices: term grouping, product-formula order, and time-step allocation. Grouping and order are discrete, which makes direct gradient optimization infeasible and forces existing compilers to rely on static heuristics. We describe P-GONE, a method that combines a conditional diffusion model (D3PM + DDPM), a graph neural network (\GNN{}) encoder, and closed-loop REINFORCE fine-tuning to jointly learn grouping, order, and time-step optimization over a mixed discrete-continuous space. Under fidelity-matched conditions ($F \geq 0.95$), the method achieves circuit depth 86 versus 1673 for Qiskit fourth-order (ungrouped, Suzuki-4), about $19.4\times$ compression, and 141 for Paulihedral (first-order Trotter), about $1.6\times$ compression. At $T=0.90$ the method also beats the Qiskit group-commuting teacher (65 vs 103, $1.6\times$ compression), though at $T=0.95$ the teacher still leads -- a stratified pattern that points toward fidelity-aware fine-tuning. Under a standard depolarizing noise model, the method achieves noisy fidelity roughly $2\times$ the Qiskit fourth-order baseline (0.743 vs 0.380). Ablation shows a clear hierarchy: order learning $>$ time allocation $>$ grouping. Best-of-N sampling ($N=32$ is a practical sweet spot) and CFG guidance give flexible fidelity-depth trade-offs at inference. The method works well on structured Hamiltonians (TFIM, Heisenberg), but random Pauli Hamiltonians fail entirely at $T \geq 0.95$ -- a boundary that defines where the method applies.

2601.13508 2026-06-08 cond-mat.mtrl-sci cs.AI 版本更新

Autonomous computational catalysis through an agentic research system

自主计算催化:通过智能体研究系统

Honghao Chen, Jiangjie Qiu, Yi Shen Tew, Xiaonan Wang

发表机构 * Beijing Key Laboratory of Artificial Intelligence for Advanced Chemical Engineering Materials, State Key Laboratory of Chemical Engineering and Low- Carbon Technology, Department of Chemical Engineering, Tsinghua University(北京先进化工材料人工智能重点实验室、化学工程与低碳技术国家重点实验室、清华大学化学工程系)

AI总结 提出CatMaster催化原生智能体研究系统,将自然语言请求转化为计算研究,实现从建模到闭环催化剂设计的自主执行,在CO2-to-CO催化剂设计中识别出竞争性活性位点。

Comments 25 pages for main manuscript; SI not available here

详情
AI中文摘要

自主智能体正开始将科学研究从工具辅助的工作流程转变为自我维持的发现过程。计算催化提供了一个代表性的挑战,因为催化剂发现需要将高层次问题转化为协调的模型构建、原子模拟、机理分析和跨尺度的迭代设计。在这里,我们介绍了CatMaster,一个催化原生的智能体研究系统,它将计算催化重塑为一个低门槛的自主研究虚拟生态系统。CatMaster维护一个不断演进的研究状态,并通过在一个可扩展环境内的模型构建、计算、批判和催化剂设计决策中的自我反馈来扩展能力。在逐渐具有挑战性的任务中,CatMaster将自然语言请求转化为具体的计算研究,从基本的原子建模和标准计算到机理探索和闭环催化剂设计。它在代表性的计算催化场景中展示了稳健的执行能力,并在选定的MatBench任务中表现出接近领先的性能,其中声子场景展示了其建模自我进化能力。在独立的CO2-to-CO催化剂设计案例中,CatMaster使用迭代的自我批判和证据精炼来识别出具有竞争力的B-CoN4和NiN3B/N-NiN3B基序。这些结果建立了一个虚拟生态系统范式,其中AI智能体超越模拟执行,走向端到端的计算研究,为催化和材料科学中的自主发现提供了基础。

英文摘要

Autonomous agents are beginning to transform scientific research from tool-assisted workflows toward self-sustaining discovery processes. Computational catalysis provides a representative challenge, as catalyst discovery requires high-level questions to be translated into coordinated model construction, atomistic simulation, mechanistic analysis, and iterative design across multiple scales. Here we introduce CatMaster, a catalysis-native agentic research system that recasts computational catalysis as a low-barrier virtual ecosystem for autonomous research. CatMaster maintains an evolving research state and extends capabilities through self-feedback across model construction, calculation, critique and catalyst-design decisions within one extensible environment. Across progressively challenging tasks, CatMaster converts natural-language requests into concrete computational studies, from essential atomistic modelling and standard calculations to mechanism exploration and closed-loop catalyst design. It showed robust execution in representative computational-catalysis scenarios and near-leading performance across selected MatBench tasks, with phonons scenario demonstrating its modelling self-evolution capability. In the independent CO2-to-CO catalyst design case, CatMaster used iterative self-critique and evidence refinement to identify competitive B-CoN4 and NiN3B/N-NiN3B motifs. These results establish a virtual-ecosystem paradigm in which AI agents move beyond simulation execution toward end-to-end computational research, providing a foundation for autonomous discovery in catalysis and materials science.

2605.10792 2026-06-08 math.OC cs.LG 版本更新

Implicit Neural Optimal Transport via Fixed-Point Optimization

通过不动点优化的隐式神经最优传输

Yesom Park, Eric Gelphman, Stanley Osher, Samy Wu Fung

发表机构 * Department of Mathematics, University of California, Los Angeles(加州大学洛杉矶分校数学系) Department of Applied Mathematics and Statistics, Colorado School of Mines(科罗拉多矿业学院应用数学与统计系)

AI总结 提出隐式神经最优传输公式,通过单个势函数和近端不动点问题避免对抗训练,实现稳定高效的单网络框架,同时恢复前向和后向传输映射。

Comments 37 pages, submitted to SIAM Journal on Mathematical Data Science (currently under review)

详情
AI中文摘要

我们提出了一种隐式神经最优传输公式,消除了现有方法中常用的对抗性最小-最大优化和多网络架构。我们的关键思想是在Kantorovich对偶中参数化单个势函数,并将相关的c-变换重新表述为近端不动点问题。这产生了一个稳定的单网络框架,其中通过对偶可行性通过近端最优性条件而非对抗性训练精确执行。尽管有内部不动点计算,梯度可以在不通过不动点迭代微分的情况下计算,从而无需隐式微分即可实现高效训练。我们进一步建立了随机梯度下降的收敛性。得到的框架高效、可扩展且广泛适用:它同时恢复前向和后向传输映射,并自然扩展到类条件设置。在高维高斯基准、物理数据集和图像翻译任务上的实验表明,该框架具有强大的传输精度以及改进的训练稳定性和良好的计算及内存效率。

英文摘要

We propose an implicit neural formulation of optimal transport that eliminates adversarial min--max optimization and multi-network architectures commonly used in existing approaches. Our key idea is to parameterize a single potential in the Kantorovich dual and reformulate the associated c-transform as a proximal fixed-point problem. This yields a stable single-network framework in which dual feasibility is enforced exactly through proximal optimality conditions rather than adversarial training. Despite the inner fixed-point computation, gradients can be computed without differentiating through the fixed-point iterations, enabling efficient training without requiring implicit differentiation. We further establish convergence of stochastic gradient descent. The resulting framework is efficient, scalable, and broadly applicable: it simultaneously recovers forward and backward transport maps and naturally extends to class-conditional settings. Experiments on high-dimensional Gaussian benchmarks, physical datasets, and image translation tasks demonstrate strong transport accuracy together with improved training stability and favorable computational and memory efficiency.

2605.08717 2026-06-08 cs.SE cs.AI 版本更新

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

调试调试器:面向软件工程智能体的失败锚定结构化恢复

Chenyu Zhao, Shenglin Zhang, Yihang Lin, Wenwei Gu, Zhimin Chen, Yongqian Sun, Dan Pei, Chetan Bansal, Saravan Rajmohan, Minghua Ma

发表机构 * Nankai University(南开大学) Tsinghua University(清华大学) Microsoft(微软)

AI总结 提出PROBE框架,通过遥测层、诊断层和指导门将运行时证据转化为结构化恢复指导,在代码修复、工作流恢复等场景中诊断准确率65.37%,恢复率21.79%。

详情
AI中文摘要

软件工程智能体越来越多地部署在可评估的工程环境中,但故障后恢复仍然成本高昂、依赖人工且临时性强。现有系统暴露跟踪或生成后续反馈,但未能将异构运行时证据转化为有根据的、有边界的恢复指导以供后续尝试。我们提出PROBE,一个用于软件工程智能体结构化恢复的失败锚定框架。PROBE通过遥测层、诊断层和指导门将失败运行的遥测数据组织为结构化证据、结构化诊断和有边界的恢复指导。遥测层保留细粒度运行时信号,诊断层将跨信号证据融合为有根据的诊断,指导门仅在证据有根据、可操作且属于智能体侧行为范围内时生成基于诊断的指导。我们在三个场景中评估PROBE:仓库级软件修复、企业工作流恢复和AIOps服务缓解。在257个初始未解决案例中,PROBE实现了65.37%的Top-1诊断准确率和21.79%的恢复率,分别比最强的非PROBE基线高出43.58和12.45个百分点。结果揭示了诊断-恢复差距:准确的诊断是必要的,但除非转化为后续尝试可执行和验证的有边界指导,否则是不够的。除了受控评估外,微软IcM原型显示,PROBE可以作为非侵入式侧通道附加到现有服务诊断工作流中,而无需更改智能体策略、工具集或执行预算。这些结果表明,在现实工程约束下,基于遥测的、失败锚定的恢复可以提高故障后可恢复性。

英文摘要

Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints.

2605.06647 2026-06-08 cs.IR cs.AI cs.LG 版本更新

Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval

超级智能检索代理:代理检索的下一个前沿

Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

发表机构 * Meta Superintelligence Labs(Meta超级智能实验室) Rice University(里士满大学)

AI总结 提出SIRA,通过单次语料判别性检索压缩多轮探索,利用LLM丰富文档词汇、预测查询缺失词汇并基于语料统计过滤,在BEIR基准上取得最强平均检索性能,并在下游QA任务中超越RL训练的代理系统。

详情
AI中文摘要

检索增强代理日益成为大型知识库的接口,但大多数将检索视为黑箱:它们发出探索性查询,检查片段,并重新表述直到证据出现。这类似于新手搜索不熟悉的数据库,而非专家利用术语和可能证据的强先验进行导航,导致额外的检索轮次、延迟和低召回率。我们引入了超级智能检索代理(SIRA),它将检索中的超级智能视为将多轮探索性搜索压缩为单次语料判别性检索行动。SIRA不仅询问哪些术语相关,还询问哪些术语将所需证据与语料级混淆项区分开。离线时,LLM用缺失的搜索词汇丰富每个文档;查询时,它预测查询遗漏的证据词汇;语料统计作为工具调用,过滤掉缺失、过于常见或不太可能产生检索边界的术语。最后一步是单次加权BM25调用,将查询与验证后的扩展结合。在十个BEIR基准上,SIRA实现了我们比较中最强的平均检索性能,击败了密集检索器、学习型稀疏检索器和LLM搜索代理基线,且未使用相关性标签或检索器微调。在下游QA中,其仅检索的答案覆盖率在NQ和HotpotQA上超过了近期RL训练的代理QA系统。我们还引入了BrowseComp-Wikipedia,一个包含232个BrowseComp衍生查询、覆盖25,587,229篇文档的维基百科索引的硬搜索基准。即使没有索引时丰富,仅使用基于维基百科类别的接地,SIRA在每个预算下都优于多轮Perplexity代理,达到9.70%的Recall@1、15.27%的Recall@10和36.14%的Recall@100。

英文摘要

Retrieval-augmented agents are increasingly the interface to large knowledge bases, yet most treat retrieval as a black box: they issue exploratory queries, inspect snippets, and reformulate until evidence emerges. This resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, causing extra retrieval rounds, latency, and poor recall. We introduce \textit{Superintelligent Retrieval Agent} (SIRA), which casts \emph{superintelligence} in retrieval as compressing multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask which terms are relevant; it asks which terms separate the desired evidence from corpus-level confusers. Offline, an LLM enriches each document with missing search vocabulary; at query time, it predicts evidence vocabulary the query omits; and corpus statistics serve as tool calls that filter terms that are absent, overly common, or unlikely to create retrieval margin. The final step is a single weighted BM25 call combining the query with the validated expansion. Across ten BEIR benchmarks, SIRA achieves the strongest average retrieval performance in our comparison, beating dense retrievers, learned sparse retrievers, and LLM search-agent baselines while using no relevance labels or retriever fine-tuning. On downstream QA, its retrieval-only answer coverage exceeds recent RL-trained agentic QA systems on NQ and HotpotQA. We also introduce \textbf{BrowseComp-Wikipedia}, a hard-search benchmark of 232 BrowseComp-derived queries over a 25,587,229-document Wikipedia index. Even without index-time enrichment, using only grounded Wikipedia categories, SIRA outperforms multi-round Perplexity agents at every budget, reaching 9.70% Recall@1, 15.27% Recall@10, and 36.14% Recall@100.

2605.06215 2026-06-08 physics.chem-ph cs.AI 版本更新

COF26: A new on-top functional for multiconfiguration pair-density functional theory

COF26:多组态对密度泛函理论的一种新的on-top泛函

Yuhao Chen, Donald G. Truhlar, Xiao He

发表机构 * Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development(分子治疗与新药开发上海工程研究中心) Shanghai Frontiers Science Center of Molecule Intelligent Syntheses(分子智能合成上海前沿科学中心) School of Chemistry and Molecular Engineering, East China Normal University(东华大学化学与分子工程学院) Department of Chemistry, Chemical Theory Center, and Minnesota Supercomputing Institute, University of Minnesota(明尼苏达大学化学系、化学理论中心和明尼苏达超级计算研究所) Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University(重庆精密光学重点实验室,东华大学重庆研究所) New York University–East China Normal University Center for Computational Chemistry, New York University Shanghai(纽约大学–东华大学计算化学中心,纽约大学上海)

AI总结 提出COF26泛函,通过大语言模型辅助优化工作流,在强和弱关联体系中均表现优越,推荐用于未来MC-PDFT计算。

详情
AI中文摘要

多组态对密度泛函理论(MC-PDFT)为计算强关联分子体系的电子能量提供了一种高效且准确的框架,其中on-top泛函的质量是其预测精度的关键决定因素。在此,我们介绍了MMCDDB26,一个严格整理的基准数据库,包含76个数据集和1495个反应。我们进一步提出了一种受约束的、大语言模型辅助的优化工作流,用于MC-PDFT泛函的开发和评估。利用该工作流,我们在MMCDDB26上优化了MC23/MC25泛函的参数,得到了MC26。与同类早期泛函相比,MC26提高了训练集的精度,并实现了更平衡的整体性能。此外,我们开发了混合meta泛函COF26。我们发现COF26在强和弱关联体系中均表现出优越的性能,因此推荐在未来的MC-PDFT计算中使用COF26。

英文摘要

Multiconfiguration pair-density functional theory (MC-PDFT) provides an efficient and accurate framework for computing electronic energies in strongly correlated molecular systems, with the quality of the on-top functional being a key determinant of its predictive accuracy. Here, we introduce MMCDDB26, a rigorously curated benchmark database comprising 76 datasets and 1,495 reactions. We further propose a constrained, large-language-model-assisted optimization workflow for the development and assessment of MC-PDFT functionals. Using this workflow, we optimized the parameters of the MC23/MC25 functionals on MMCDDB26 to obtain MC26. Compared with earlier functionals of the same class, MC26 improves the accuracy on the training set and achieves a more balanced overall performance. In addition, we developed the hybrid meta-functional COF26. We find that COF26 delivers superior performance for both strongly and weakly correlated systems, and therefore recommend COF26 for future MC-PDFT calculations.

2511.02399 2026-06-08 cs.SE cs.AI 版本更新

Towards Iterative End-to-End Software Development: A Feature-Driven Multi-Agent Framework

迈向迭代式端到端软件开发:一种特征驱动的多智能体框架

Junwei Liu, Chen Xu, Chong Wang, Tong Bai, Weitong Chen, Kaseng Wong, Yiling Lou, Xin Peng

发表机构 * Fudan University(复旦大学) Nanyang Technological University(南洋理工大学)

AI总结 提出EvoDev框架,通过特征分解、依赖建模和上下文传播,实现迭代式端到端软件开发,在Android任务上比Claude Code提升57.3%。

Comments Accepted by ISSTA 2026

详情
AI中文摘要

近年来,大语言模型智能体的进展为从自然语言需求自动化端到端软件开发带来了希望。然而,现有方法大多采用线性的瀑布式流程,这过度简化了真实世界开发的迭代性质,并且难以应对复杂、大规模的项目。为解决这些限制,我们提出了EvoDev,一种受特征驱动开发启发的迭代式软件开发框架。EvoDev将用户需求分解为一组用户价值特征,并构建特征图,这是一个有向无环图,显式建模特征之间的依赖关系。特征图中的每个特征节点维护多层上下文,包括业务逻辑、软件设计和代码实现,这些上下文沿着依赖关系传播,为后续开发迭代提供上下文。我们在具有挑战性的Android开发任务上评估了EvoDev,结果表明它比最佳基线Claude Code高出57.3%,同时在不同基础LLM上将单智能体性能提升了16.0%-58.5%,突出了特征分解、依赖建模、上下文传播和面向工作流的智能体设计对端到端软件开发的重要性。此外,我们的工作总结了设计迭代式、LLM驱动的开发框架的实用见解,并为未来训练基础LLM以更好地支持迭代式软件开发提供了参考。

英文摘要

Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each feature node in the feature map maintains multi-layer contexts, including business logic, software design, and code implementation, which are propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by 57.3%, while improving single-agent performance by 16.0%-58.5% across different base LLMs, highlighting the importance of feature decomposition, dependency modeling, context propagation, and workflow-aware agent design for end-to-end software development. Moreover, our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.

2604.23025 2026-06-08 cs.CR cs.LG 版本更新

Self-Supervised Learning for Android Malware Detection on a Time-Stamped Dataset

基于时间戳数据集的自监督学习安卓恶意软件检测

Annan Fu, Hao Pei, Maryam Tanha

发表机构 * Mastercard Canada(Mastercard加拿大)

AI总结 针对机器学习检测器的时间偏差问题,构建时间戳数据集并采用BYOL自监督预训练,在时间感知评估下达到98%准确率和89%F1分数。

Comments Accepted for publication in IEEE ICC 2026. \c{opyright} 2026 IEEE

详情
AI中文摘要

基于机器学习的安卓恶意软件检测器常受时间偏差影响:模型在训练和评估时未考虑应用的实际发布时间,导致准确率虚高并削弱实际鲁棒性。我们通过构建一个包含良性及恶意安卓应用的时间戳数据集来解决此问题,并引入时间戳验证程序以确保时间准确性。随后,我们提出一个检测框架,使用自监督预训练方法Bootstrap Your Own Latent (BYOL)学习抗混淆的表示,然后进行监督分类。在时间感知评估下,该方法达到98%的准确率和89%的F1分数。我们进一步通过VirusTotal和MITRE ATT&CK框架分析真正例和假负例来表征恶意软件行为。为支持可复现性和进一步创新,我们公开了数据集和源代码。

英文摘要

Android malware detectors built with machine learning often suffer from temporal bias: models are trained and evaluated without respecting apps' actual release times, inflating accuracy and weakening real-world robustness. We address this by constructing a time-stamped dataset of benign and malicious Android apps and introducing a timestamp-verification procedure to ensure temporal accuracy. We then propose a detection framework that uses Bootstrap Your Own Latent (BYOL) for self-supervised pre-training to learn obfuscation-resilient representations, followed by supervised classification. Under time-aware evaluation, the method attains 98% accuracy and 89% F1. We further characterize malware behavior by analyzing true positives and false negatives using VirusTotal and the MITRE ATT&CK framework. To support reproducibility and further innovation, we release our dataset and source code.

2604.17948 2026-06-08 cs.CR cs.AI cs.MA 版本更新

RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

RAVEN: 用于用户代码和二进制程序中内存损坏分析的检索增强漏洞探索网络

Parteek Jamwal, Minghao Shao, Boyuan Chen, Achyuta Muthuvelan, Asini Subanya, Boubacar Ballo, Kashish Satija, Mariam Shafey, Mohamed Mahmoud, Moncif Dahaji Bouffi, Pasindu Wickramasinghe, Siyona Goel, Yaakulya Sabbani, Hakim Hacid, Mthandazo Ndhlovu, Eleanna Kafeza, Sanjay Rawat, Muhammad Shafique

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出RAVEN框架,结合LLM代理与检索增强生成,自动生成遵循Google Project Zero模板的漏洞分析报告,在105个样本上平均质量得分54.21%。

详情
AI中文摘要

大型语言模型(LLM)在各种网络安全任务中展现了卓越的能力,包括漏洞分类、检测和修补。然而,它们在自动化漏洞报告文档和分析方面的潜力仍未得到充分探索。我们提出了RAVEN(检索增强漏洞探索网络),这是一个利用LLM代理和检索增强生成(RAG)来综合生成全面漏洞分析报告的框架。给定易受攻击的源代码,RAVEN按照Google Project Zero根因分析模板生成报告。该框架使用四个模块:用于漏洞识别的探索代理、从包含Google Project Zero报告和CWE条目的精选数据库中检索相关知识的RAG引擎、用于影响和利用评估的分析代理,以及用于结构化报告生成的报告代理。为确保质量,RAVEN包含一个特定任务的LLM评判器,用于评估报告的结构完整性、与真实情况的一致性、代码推理质量和修复质量。我们在来自NIST-SARD数据集的105个涵盖15种CWE类型的易受攻击代码样本上评估了RAVEN。结果显示平均质量得分为54.21%,支持了我们的方法在自动化漏洞文档方面的有效性。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST-SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.

2604.09552 2026-06-08 cs.IR cs.AI cs.CL 版本更新

MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

MCERF:通过增强检索推进工程文档的多模态大语言模型评估

Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi, Faez Ahmed, Hongyi Xu

发表机构 * School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut, Storrs, CT 06269(机械、航空航天与制造工程学院,康涅狄格大学,斯托尔斯,CT 06269) Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA(机械工程系,麻省理工学院,剑桥,MA 02139,美国)

AI总结 提出MCERF框架,结合多模态检索器ColPali与大语言模型推理,通过混合查找、视觉文本融合、高推理和自一致性决策等策略,在DesignQA基准上实现平均准确率相对提升41.1%,无需完整规则书摄入即可处理工程文档中的多模态问答。

详情
AI中文摘要

工程规则书和技术标准包含密集文本、表格和插图等多模态信息,对检索增强生成(RAG)系统构成挑战。基于依赖全文摄入和文本检索的DesignQA框架[1],本工作建立了多模态ColPali增强检索与推理框架(MCERF),该系统将多模态检索器与大语言模型推理相结合,实现从工程文档中准确高效地回答问题。该系统采用ColPali检索文本和视觉信息,并采用多种检索与推理策略:(i)混合查找模式用于显式规则提及,(ii)视觉到文本融合用于图形和表格引导的查询,(iii)高推理大语言模型模式用于复杂的多模态问题,以及(iv)自一致性决策以稳定响应。模块化框架设计为未来的多模态系统提供了可重用模板,无论底层模型架构如何。此外,本工作建立并比较了两种路由方法:单案例路由方法和多智能体系统,两者均动态分配查询到最优管道。在DesignQA基准上的评估表明,该系统在所有任务上的平均准确率相比基线RAG最佳结果相对提升了41.1%,这是多模态和推理密集型任务上的显著改进,且无需完整规则书摄入。这表明视觉语言检索、模块化推理和自适应路由如何在工程用例中实现可扩展的文档理解。

英文摘要

Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.

2604.07821 2026-06-08 cs.MA cs.AI cs.CL 版本更新

More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration

能力越强,合作越少?当LLM在零成本协作中失败时

Advait Yadav, Sid Black, Oliver Sourbut

发表机构 * GitHub

AI总结 研究LLM在多智能体系统中零成本协作的失败原因,通过构建去战略复杂性的环境,发现能力更强的模型(如o3)反而合作更差,并区分了能力失败与主动信息隐瞒,提出针对性干预措施。

Comments Accepted to the ICML 2026 main conference

详情
AI中文摘要

大语言模型(LLM)智能体越来越多地在多智能体系统中协调,但我们缺乏对合作失败地点和原因的理解。许多现实世界的协调问题并非社会困境:帮助他人——分享文档、为队友扫清障碍——对帮助者几乎不花费成本,同时产生巨大的集体利益。LLM智能体在这种帮助免费且被明确指示合作的机制下是否合作,仍然未知。我们构建了一个基于回合的多智能体环境,剥离了所有战略复杂性,使合作无成本且微不足道地最优。在八个广泛使用的LLM中,能力并不能预测合作:OpenAI o3仅达到最优集体性能的17%,而较弱的o3-mini达到50%,尽管有相同的最大化群体收入的指令。使用一种自动化智能体通信一方的因果分解方法,我们将合作失败与能力失败分开,并发现几个有能力的模型在隐瞒信息方面表现积极,尽管从隐瞒中一无所获。针对性的干预措施解决了每种模式:明确的协议使能力受限模型的性能大约翻倍,而小的分享激励则解锁了合作受限模型。我们的结果表明,仅靠扩展智能无法解决多智能体系统中的协调问题,需要深思熟虑的合作设计,即使帮助不花费任何成本。

英文摘要

Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation fails. Many real-world coordination problems are not social dilemmas: helping others -- sharing documentation, unblocking a teammate -- costs the helper almost nothing while producing substantial collective benefit. Whether LLM agents cooperate in this regime, where helping is free and they are explicitly instructed to do so, remains unknown. We build a turn-based multi-agent environment that strips away all strategic complexity, making cooperation costless and trivially optimal. Across eight widely used LLMs, capability does not predict cooperation: OpenAI o3 reaches only 17% of optimal collective performance while the weaker o3-mini reaches 50%, despite identical instructions to maximize group revenue. Using a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, and find that several capable models actively withhold information despite gaining nothing from withholding. Targeted interventions address each mode: explicit protocols roughly double the performance of competence-limited models, while small sharing incentives unlock cooperation-limited ones. Our results suggest that scaling intelligence alone will not solve coordination in multi-agent systems, and will require deliberate cooperative design, even when helping costs nothing.

2604.05360 2026-06-08 cs.HC cs.AI 版本更新

OGA-AID: Clinician-in-the-loop AI Report Drafting Assistant for Multimodal Observational Gait Analysis in Post-Stroke Rehabilitation

OGA-AID:用于中风后康复多模态观察性步态分析的临床医生在环AI报告起草助手

Khoi T. N. Nguyen, Nghia D. Nguyen, Hui Yu Koh, Patrick W. H. Kwong, Karen Sui Geok Chua, Ananda Sidarta, Baosheng Yu

发表机构 * Rehabilitation Research Institute of Singapore, Nanyang Technological University, Singapore(新加坡康复研究中心,南洋理工大学,新加坡) Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore(李光前医学院,南洋理工大学,新加坡) The Grainger College of Engineering, University of Illinois Urbana-Champaign, United States(伊利诺伊大学厄巴纳-香槟分校格雷格学院,美国) Department of Rehabilitation Sciences, The Hong Kong Polytechnic University, Hong Kong(香港理工大学康复科学系,香港) VinUni-Illinois Smart Health Center, VinUniversity, Vietnam(Vin大学Vin-伊利诺伊智能健康中心,越南) Institute of Rehabilitation Excellence, Tan Tock Seng Hospital, NHG Health, Singapore(卓越康复研究所,坦托克桑格医院,NHG健康,新加坡)

AI总结 提出OGA-AID,一种临床医生在环的多智能体大语言模型系统,通过协调三个专业智能体合成患者运动记录、运动学轨迹和临床资料,生成结构化步态评估报告,在真实患者数据上优于单次多模态基线,并展示了AI辅助分析与人类临床判断的互补关系。

Comments 2026 CV4Clinic CVPR Workshop Proceedings

详情
AI中文摘要

步态分析在中风后康复中至关重要,但仍然是时间密集型和认知要求高的,特别是当临床医生必须将步态视频和运动捕捉数据整合到结构化报告中时。我们提出了OGA-AID,一种临床医生在环的多智能体大语言模型系统,用于多模态报告起草。该系统协调3个专业智能体,将患者运动记录、运动学轨迹和临床资料综合成结构化评估。在真实患者数据上由专家物理治疗师评估,OGA-AID始终优于单次多模态基线,且误差低。在临床医生在环设置中,简短的专家初步笔记进一步降低了与参考评估相比的误差。我们的研究结果证明了多模态智能体系统用于结构化临床步态评估的可行性,并突出了在康复工作流程中AI辅助分析与人类临床判断之间的互补关系。

英文摘要

Gait analysis is essential in post-stroke rehabilitation but remains time-intensive and cognitively demanding, especially when clinicians must integrate gait videos and motion-capture data into structured reports. We present OGA-AID, a clinician-in-the-loop multi-agent large language model system for multimodal report drafting. The system coordinates 3 specialized agents to synthesize patient movement recordings, kinematic trajectories, and clinical profiles into structured assessments. Evaluated with expert physiotherapists on real patient data, OGA-AID consistently outperforms single-pass multimodal baselines with low error. In clinician-in-the-loop settings, brief expert preliminary notes further reduce error compared to reference assessments. Our findings demonstrate the feasibility of multimodal agentic systems for structured clinical gait assessment and highlight the complementary relationship between AI-assisted analysis and human clinical judgment in rehabilitation workflows.

2604.04226 2026-06-08 cs.MA cs.AI 版本更新

SW-$A^2$-Bench: Benchmarking Autonomous Software Agent Generation for Agentic Web

SW-$A^2$-Bench: 面向智能体网络的自主软件智能体生成基准测试

Linyao Chen, Bo Huang, Qinlao Zhao, Shuai Shao, Zhi Han, Zicai Cui, Ziheng Zhang, Guangtao Zeng, Wenzheng Tang, Yikun Wang, Yuanjian Zhou, Zimian Peng, Yong Yu, Weiwen Liu, Hiroki Kobayashi, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) The University of Tokyo(东京大学) Huazhong University of Science and Technology(华中科技大学) Shanghai Innovation Institute(上海创新研究院) Nankai University(南开大学) Singapore University of Technology and Design(新加坡科技设计大学) Queen’s University(女王大学) Fudan University(复旦大学) Zhejiang University(浙江大学)

AI总结 提出首个软件智能体生成基准SW-$A^2$-Bench,通过编码智能体自动将代码仓库转化为自主软件智能体,评估生成智能体的忠实性与互操作性,以扩展智能体网络规模。

详情
AI中文摘要

智能体网络正在成为一种新兴范式,其中自主软件智能体与在线资源及其他智能体交互以完成用户目标。然而,智能体网络的容量仍受限于自主软件智能体数量不足,这已成为扩展智能体网络的关键挑战。为缓解这一问题,我们研究了通过编码智能体自动将现有代码仓库转化为自主软件智能体的任务,将过程分解为关键阶段,并识别关键技术障碍。为系统评估这一能力,我们提出了面向智能体网络的软件智能体生成基准(SW-$A^2$-Bench),这是首个专为软件智能体生成设计的基准。SW-$A^2$-Bench不仅评估软件智能体是否能够生成,还评估生成的智能体是否忠实于源代码仓库,以及在多智能体工作流中是否与其他智能体可互操作。实验表明,我们的方法有效激活了代码仓库的功能能力,并在智能体网络中实现了可互操作的多智能体协作。我们相信,这项工作将为软件智能体生成提供标准化评估,并有助于未来扩展智能体网络的容量。

英文摘要

The Agentic Web is emerging as a paradigm in which autonomous software agents interact with online resources and with each other to accomplish user goals. However, the capacity of Agentic Web is still limited by insufficient autonomous software agent population, which has become a crucial challenge for scaling Agentic Web. In order to alleviate this, we study the task of automatically converting existing code repositories into autonomous software agents via coding agents, decompose the process into critical stages, and identify key technical hurdles. To systematically evaluate this capability, we propose SoftWare Agent generation for Agentic Web Bench (SW-$A^2$-Bench), the first benchmark designed for software agent generation. SW-$A^2$-Bench evaluates not only whether software agents can be generated, but also whether generated software agents are faithful to the source repositories and interoperable with other agents in multi-agent workflows. Our experiments demonstrate that our approach effectively activates the functional capabilities of code repositories and enables interoperable multi-agent collaboration in Agentic Web. We believe that this work will provide a standardized evaluation for software agent generation and will contribute to the future of scaling the capacity of Agentic Web.

2603.04982 2026-06-08 cs.CY cs.AI cs.HC 版本更新

Training for Technology: Adoption and Productive Use of Generative AI in Legal Analysis

技术培训:法律分析中生成式人工智能的采纳与生产性使用

Benjamin M. Chen, Hong Bao

发表机构 * University of Hong Kong Faculty of Law(香港大学法学院)

AI总结 通过随机实验发现,未经培训的法学学生使用大语言模型反而降低表现,而简短培训能显著提升采纳率和成绩,表明生成式AI的生产力需要培训支持。

详情
AI中文摘要

有针对性的用户培训能否释放生成式人工智能在专业环境中的生产潜力?我们通过一项随机实验研究了这个问题,其中164名法学学生在三种条件下完成了一项问题识别考试:无GenAI访问权限、可选访问大语言模型(LLM)、或LLM访问加简短培训干预。未经培训的LLM访问被证明适得其反:与没有任何LLM访问权限的参与者相比,未经培训的用户撰写的答案明显更短,犯更多案例陈述错误,且得分略低,尽管大多数差异未达到常规显著性水平。培训扭转了这一模式。接受培训的参与者以更高的比例采纳LLM(41% vs. 26%;p = 0.044),得分比未经培训的用户高0.27个绩点——大约一个精细等级——(p = 0.027),并且更准确地陈述了适用规则(p = 0.014)。主分层分析表明,培训主要通过采纳而非有效性发挥作用——在严格均值优势下,采纳下限(1.06)超过有效性上限(0.42)——尽管置信区间较宽。更广泛地说,这些发现挑战了GenAI主要惠及低技能工人的观点:没有培训,高能力从业者选择退出,而低能力用户采纳但无生产力。实现GenAI的生产力提升需要同时投资于访问和指导。

英文摘要

Can targeted user training unlock the productive potential of generative artificial intelligence in professional settings? We study this question using a randomized experiment in which 164 law students completed an issue-spotting examination under one of three conditions: no GenAI access, optional access to a large language model (LLM), or LLM access with a brief training intervention. Untrained LLM access proved counterproductive: relative to participants without any LLM access, untrained users wrote significantly shorter answers, committed more case misstatements, and scored marginally lower, though most differences fall short of conventional significance. Training reversed this pattern. Trained participants adopted the LLM at higher rates (41% vs. 26%; p = 0.044), scored 0.27 grade points higher than untrained users--roughly one fine grade--(p = 0.027), and stated applicable rules more accurately (p = 0.014). Principal stratification analysis suggests training operates primarily through adoption rather than effectiveness--the adoption lower bound (1.06) exceeds the effectiveness upper bound (0.42) at strict mean dominance--though confidence intervals are wide. More broadly, these findings challenge the view that GenAI primarily benefits lower-skilled workers: without training, higher-ability practitioners opt out while lower-ability users adopt but unproductively. Realizing GenAI's productivity gains requires investment in both access and instruction.

2603.22327 2026-06-08 cs.IR cs.AI cs.DL 版本更新

Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

基于流行病学系统评价评估AI科学知识综合

Shreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo, Lingyi Yang, Łukasz Borchmann, Piotr BŁaszczyk, Christian Morgenstern, Ruth McCabe, Sangeeta Bhatia, Philip H. Torr, Jakob Foerster, Scott A. Hale, Thomas Rawson, Anne Cori, Elizaveta Semenova, Adam Mahdi

发表机构 * University of Oxford(牛津大学) Imperial College London(伦敦帝国理工学院) University of Nottingham(诺丁汉大学) Snowflake AI Research(Snowflake人工智能研究) Independent(独立)

AI总结 提出AgentSLR评估框架,包含自动化工作流和专家标注数据集,测试LLM在流行病学系统评价各阶段能力,发现无模型全面领先,结构化提取是主要瓶颈。

详情
AI中文摘要

系统文献综述(SLR)是一种要求高且风险大的科学知识综合形式,但作为大型语言模型(LLM)的评估场景仍未被充分定义。我们引入了AgentSLR,一个大规模评估框架,包含SLR自动化工作流和覆盖16,248篇文章的专家标注数据集,旨在测试LLM在流行病学SLR各阶段的能力。参考标注来自关于WHO优先病原体的同行评审研究,并由领域专家制作。该框架将每个综述阶段作为独立单元进行评估,并配有专用指标,以便进行有针对性的失败分析。我们评估了五种前沿推理模型,发现没有单一模型在所有任务中占主导地位,显示出子任务专业化往往被聚合基准所掩盖。结构化数据提取是一个主要瓶颈,没有模型在平均字段级F1上超过0.67。估计成本差异很大,评估模型之间相差高达96倍。记录的失败模式表明,评估的模型在流行病学中尚不足以可靠地进行无监督部署,因为其发现可能影响公共政策。

英文摘要

Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference annotations were derived from peer-reviewed studies on WHO priority pathogens and produced by domain experts. The harness evaluates each review stage as a separate unit with dedicated metrics enabling targeted failure analysis. We evaluated five frontier reasoning models and found that no single model dominated across all tasks, showing sub-task specialisation often hidden by aggregate benchmarks. Structured data extraction is a major bottleneck, with no model exceeding an average field-level F1 of 0.67. Estimated costs vary substantially, by up to 96 times across evaluated models. Documented failure modes suggest that the evaluated models are not yet reliable enough for unsupervised deployment in epidemiology, where findings can inform public policy.

2603.20990 2026-06-08 cs.IR cs.AI 版本更新

$\mathrm{ECI}_{\mathrm{sem}}$: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives

ECI: 有效对比信息用于评估难负样本

Aarush Sinha, Rahul Seetharaman, Aman Bansal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology (IIT), Kharagpur, India(1. 印度理工学院(IIT)计算机科学与工程系,克哈格布尔,印度)

AI总结 本文提出ECI,一种无需训练的诊断方法,通过冻结的目标编码器嵌入对候选负样本进行排序,其在MS MARCO数据集上展示了优于其他模型的性能,且在不同条件下表现出稳定性。

详情
AI中文摘要

在密集检索中,硬负样本的选择通常是仅在微调和下游评估之后决定。我们提出有效对比信息(ECI),一种无需训练的诊断方法,通过冻结的目标编码器嵌入对候选负样本进行排序。ECI无需训练,也不依赖标签:每个评分示例需要一个查询、一个标记的正例和一个显式的候选负例。$\mathrm{ECI}_{\mathrm{sem}}$通过目标一致性、语义局部性、词汇残余性和一个对数确定性多样性目标构建加权残差信息矩阵。在MS MARCO负样本上,家族内ECI在非混合源中将LLM负样本排在首位,在混合源中将Dense+LLM排在首位,与DistilBERT、E5-base和Contriever的最强聚合BEIR迁移结果相匹配。受控消融实验表明,这种对齐依赖于使用目标编码器家族,而额外消融实验显示其在样本大小、温度、分词器和IDF语料扰动下具有稳定性。理论给出了损失减少的局部线性化链接,而实证研究将下游评估视为最终测试。

英文摘要

Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose $\mathrm{ECI}_{\mathrm{sem}}$, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. $\mathrm{ECI}_{\mathrm{sem}}$ is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. $\mathrm{ECI}_{\mathrm{sem}}$ builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family $\mathrm{ECI}_{\mathrm{sem}}$ ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.

2603.20967 2026-06-08 stat.ML cs.LG math.ST stat.TH 版本更新

Hard labels sampled from sparse targets mislead rotation invariant algorithms

从稀疏目标采样的硬标签误导旋转不变算法

Avrajit Ghosh, Bin Yu, Manfred Warmuth, Peter Bartlett

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Wisconsin, Madison(威斯康星大学麦迪逊分校)

AI总结 针对稀疏目标下的二分类问题,证明旋转不变算法(如逻辑损失梯度下降)的过风险下界为Ω((d-1)/n),而通过重参数化u_i v_i的非旋转不变算法可实现O(s log d / n)的上界。

Journal ref ICML-2026

详情
AI中文摘要

最常见的机器学习设置之一是逻辑回归。在许多分类模型中,包括神经网络,最终预测是通过将逻辑链接函数应用于线性得分获得的。在二元逻辑回归中,反馈可以是软标签(对应于数据的真实条件概率,如在蒸馏中)或采样的硬标签(取值为$\pm 1$)。我们指出即使在特别有利的设置中也会出现一个基本问题,其中目标是学习形式为$\sigma(\mathbf{x}^{\top}\mathbf{w}^{\star})$的无噪声软目标。在过约束情况(即样本数$n$超过输入维度$d$)下,使用样本$(\mathbf{x}_i,\sigma(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$足以恢复$\mathbf{w}^{\star}$,从而获得贝叶斯风险。然而,我们证明当样本由从相同条件分布$\sigma(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$采样的硬标签$y_i$标记,且$\mathbf{w}^{\star}$是$s$-稀疏时,旋转不变算法被证明是次优的:它们产生过风险$\Omega\\!\left(\frac{d-1}{n}\right)$,而存在简单的非旋转不变算法,其过风险为$O(\frac{s\log d}{n})$。最简单的旋转不变算法是逻辑损失上的梯度下降(带早停)。针对稀疏目标实现上述上界的简单非旋转不变算法使用对权重$u_i,v_i$的梯度下降,其中线性权重$w_i$被重参数化为$u_i v_i$。

英文摘要

One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values $\pm 1$). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form $σ(\mathbf{x}^{\top}\mathbf{w}^{\star})$. In the over-constrained case (i.e. the number of samples $n$ exceeds the input dimension $d$) with examples $(\mathbf{x}_i,σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$, it is sufficient to recover $\mathbf{w}^{\star}$ and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels $y_i$ sampled from the same conditional distribution $σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$ and $\mathbf{w}^{\star}$ is $s$-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk $Ω\!\left(\frac{d-1}{n}\right)$, while there are simple non-rotation invariant algorithms with excess risk $O(\frac{s\log d}{n})$. The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights $u_i,v_i$, where now the linear weight $w_i$ is reparameterized as $u_iv_i$.

2603.13428 2026-06-08 cs.SE cs.AI 版本更新

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

EvoClaw: 评估AI代理在持续软件演化中的表现

Gangda Deng, Zhaoling Chen, Zhongming Yu, Haoyang Fan, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Rajgopal Kannan, Le Cong, Mengdi Wang, Qian Zhang, Viktor Prasanna, Xiangru Tang, Xingyao Wang

发表机构 * USC(美国斯克利普斯大学) UCR(加州大学河滨分校) UCSD(加州大学圣地亚哥分校) Army Research Office(陆军研究办公室) Stanford(斯坦福大学) Princeton(普林斯顿大学) Haven OpenHands

AI总结 针对现有基准测试忽视软件演化中时间依赖和技术债务的问题,提出EvoClaw基准,通过从提交日志重建可验证里程碑DAG,评估AI代理在持续开发中维持系统完整性和限制错误累积的能力。

Comments ICML 2026

详情
AI中文摘要

随着AI代理越来越多地被部署为长期运行的系统,自主构建并持续演化定制软件以在动态环境中进行交互变得至关重要。然而,现有基准测试在孤立的、一次性的编码任务上评估代理,忽视了真实世界软件演化中固有的时间依赖和技术债务。为弥补这一差距,我们引入了DeepCommit,一个从嘈杂的提交日志中重建可验证里程碑DAG的代理管道,其中里程碑被定义为功能内聚的开发目标。这些可执行序列使得EvoClaw成为可能,这是一个新颖的基准测试,要求代理维持系统完整性并限制错误累积,这些是当前基准测试中大部分缺失的长期软件演化的维度。我们对4个代理框架下的12个前沿模型的评估揭示了一个关键弱点:整体性能得分从孤立任务上的>80%显著下降到持续设置中的最多38%,暴露了代理在长期维护和错误传播方面的深刻困境。

英文摘要

With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate agents on isolated, one-off coding tasks, neglecting the temporal dependencies and technical debt inherent in real-world software evolution. To bridge this gap, we introduce DeepCommit, an agentic pipeline that reconstructs verifiable Milestone DAGs from noisy commit logs, where milestones are defined as functionally cohesive development goals. These executable sequences enable EvoClaw, a novel benchmark that requires agents to sustain system integrity and limit error accumulation, dimensions of long-term software evolution largely missing from current benchmarks. Our evaluation of 12 frontier models across 4 agent frameworks reveals a critical vulnerability: overall performance scores drop significantly from >80% on isolated tasks to at most 38% in continuous settings, exposing agents' profound struggle with long-term maintenance and error propagation.

2512.00883 2026-06-08 cs.MM cs.CV cs.SD 版本更新

Audio-Visual World Models: Grounding Multisensory Imagination for Embodied Agents

视听世界模型:为具身智能体奠定多感官想象的基础

Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin Mao, Shijie Cheng

发表机构 * Tsinghua University(清华大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出视听世界模型(AVWM)统一框架,通过条件扩散Transformer(AV-CDiT)联合预测双耳音频与视觉动态,在30小时基准AVW-4k上实现高保真多模态预测,并验证其在具身导航中的有效性。

详情
AI中文摘要

世界模型通过模拟环境动态使智能体能够规划和推理未来状态。虽然现有方法主要关注视觉观察,但现实世界的感知本质上涉及多种感觉模态。音频提供了关键的空间和时间线索,如声源定位和声学场景属性,但其整合到世界模型中仍相对未被充分探索。先前的工作尚未建立低层动作控制下视听世界建模的通用公式,也未阐明如何联合捕捉物理上合理的双耳音频和视觉动态。本文提出了视听世界模型(AVWM)的统一公式,将多模态环境模拟建模为具有同步视听观测的部分可观测马尔可夫决策过程。作为解决该问题的基础步骤,我们构建了AVW-4k,一个受控基准数据集,包含30小时的双耳视听轨迹,覆盖76个室内环境并带有动作标注。我们提出了AV-CDiT,一种视听条件扩散Transformer,采用新颖的模态专家架构平衡视觉和听觉学习,通过三阶段训练策略优化以实现有效的多模态整合。在该基准上的大量实验表明,AV-CDiT在视觉和听觉模态上实现了高保真多模态预测。此外,我们验证了其在具身导航中的实际效用,证明AVWM改进了视觉-语言模型引导的智能体在连续视听导航中的表现。

英文摘要

World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains relatively underexplored. Prior work has not established a commonly adopted formulation for audio-visual world modeling under low-level action control or clarified how to jointly capture physically grounded binaural audio and visual dynamics. This work presents a unified formulation of Audio-Visual World Models (AVWM), casting multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. As a foundational step toward this problem, we construct AVW-4k, a controlled benchmark comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments on this benchmark demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in embodied navigation, demonstrating that AVWM improves a vision-language-model-guided agent in continuous audio-visual navigation.

2602.04894 2026-06-08 cs.CR cs.AI 版本更新

Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software

从黑盒LLM生成的软件中提取重复漏洞

Tomer Kordonsky, Amit LeVi, Maayan Yamin, Noam Benzimra, Avi Mendelson

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 提出特征-安全表(FSTab),通过黑盒攻击从前端特征预测后端漏洞,并量化模型跨程序、重述和领域的漏洞复现一致性,实验显示跨域攻击成功率高达94%。

Comments ICML 2026, Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD)

详情
AI中文摘要

LLM越来越多地用于代码生成,但其输出通常遵循重复模板,可能导致可预测的漏洞。我们研究了LLM生成软件中的漏洞持久性,并引入了特征-安全表(FSTab),包含两个组件。首先,FSTab支持黑盒攻击,通过可观察的前端特征和源LLM的知识预测可能的后端漏洞,无需访问后端或源代码。其次,FSTab提供以模型为中心的评估,量化模型在跨程序、语义保持重述和应用域中复现相同漏洞的一致性。我们在最先进的代码LLM(包括GPT-5.2、Claude-4.5 Opus和Gemini-3 Pro)上评估了FSTab,覆盖多种应用域。我们的结果显示强大的跨域迁移:即使目标域在训练中被排除,FSTab在内部工具(Claude-4.5 Opus)上仍能达到94%的攻击成功率和93%的漏洞覆盖率。这些发现暴露了LLM生成软件中一个未被充分探索的攻击面,并凸显了代码生成的安全风险。我们的代码可在https://github.com/fstabicml2026/FSTab获取。

英文摘要

LLMs are increasingly used for code generation, but their outputs often follow recurring templates that can induce predictable vulnerabilities. We study vulnerability persistence in LLM-generated software and introduce Feature--Security Table (FSTab) with two components. First, FSTab enables a black-box attack that predicts likely backend vulnerabilities from observable frontend features and knowledge of the source LLM, without access to the backend or source code. Second, FSTab provides a model-centric evaluation that quantifies how consistently a model reproduces the same vulnerabilities across programs, semantics-preserving rephrasings, and application domains. We evaluate FSTab on state-of-the-art code LLMs, including GPT-5.2, Claude-4.5 Opus, and Gemini-3 Pro, across diverse application domains. Our results show strong cross-domain transfer: even when the target domain is excluded from training, FSTab achieves up to 94% attack success and 93% vulnerability coverage on Internal Tools (Claude-4.5 Opus). These findings expose an underexplored attack surface in LLM-generated software and highlight the security risks of code generation. Our code is available at https://github.com/fstabicml2026/FSTab

2509.11208 2026-06-08 stat.ML cs.LG 版本更新

Predictable Compression Failures: Order Sensitivity and Information Budgeting for Evidence-Grounded Binary Adjudication

可预测的压缩失败:基于证据的二元裁决的顺序敏感性与信息预算

Leon Chlon, Ahmed Karim, Maggie Chlon, MarcAntonio Awada

发表机构 * GitHub

AI总结 研究证据顺序对基于Transformer的二元裁决模型的影响,提出QMV界和EDFL定律,通过信息充分率门控实现低幻觉率下的答案/弃权决策。

详情
AI中文摘要

用于基于证据的二元裁决(例如,支持/反驳、是/否或验证器支持的通过/失败决策)的Transformer可能对可交换证据呈现的顺序敏感,在验证器相关的伯努利谓词下产生跨排列的分散性和不可靠的尝试答案。我们将证据顺序视为一个干扰变量,并形式化了一个期望-实现差距:下一个词训练可以最小化顺序上的期望条件描述长度,而固定顺序仍保持位置敏感性。我们的量化鞅违反(QMV)界预测了由相邻秩位置敏感性引起的分散性,在调和区具有$O(\log n)$增长;我们的期望级解压定律(EDFL)将KL凸性/数据处理界专门化到伯努利谓词,产生信任比特(B2T)、幻觉风险(RoH)以及用于答案/弃权决策的信息充分率(ISR)门。在来自FEVER、HotpotQA、NQ-Open、PopQA和Controls的3,059个有依据项目上,我们观察到对数分散性和均匀排列混合的正Jensen增益。在一个预先指定的保留审计(528个项目)中,分析固定的ISR$=1$门实现了0.0-0.7%的幻觉率,20.6-27.9%的弃权率(95%置信区间),支持该操作点,但未声称对所有模型系列或不受限生成具有通用校准。

英文摘要

Transformers used for evidence-grounded binary adjudication (e.g., support/refute, yes/no, or verifier-backed pass/fail decisions) can be sensitive to the order in which exchangeable evidence is presented, producing dispersion across permutations and unreliable attempted answers under a verifier-relative Bernoulli predicate. We treat evidence order as a nuisance variable and formalize an expectation-realization gap: next-token training can minimize expected conditional description length over orderings while a fixed ordering remains position-sensitive. Our Quantified Martingale Violation (QMV) bound predicts the dispersion induced by adjacent-rank positional sensitivity, with $O(\log n)$ growth in the harmonic regime; our Expectation-level Decompression Law (EDFL) specializes a KL convexity/data-processing bound to Bernoulli predicates, yielding Bits-to-Trust (B2T), Risk-of-Hallucination (RoH), and an Information Sufficiency Ratio (ISR) gate for answer/abstain decisions. On 3,059 grounded items from FEVER, HotpotQA, NQ-Open, PopQA, and Controls, we observe logarithmic dispersion and positive Jensen gains from uniform permutation mixtures. In one pre-specified held-out audit (528 items), the analytically fixed ISR$=1$ gate attains 0.0-0.7% hallucination with 20.6-27.9% abstention (95% CIs), supporting the operating point without claiming universal calibration across all model families or unrestricted generation.

2602.16908 2026-06-08 cond-mat.mtrl-sci cs.LG quant-ph 版本更新

Multi-objective optimization and quantum hybridization of equivariant deep learning interatomic potentials

等变深度学习原子间势的多目标优化与量子混合

G. Laskaris, D. Morozov, D. Tarpanov, A. Seth, J. Procelewska, G. Sai Gautam, A. Sagingalieva, R. Brasher, A. Melnikov

发表机构 * Terra Quantum AG LIACS, Leiden University(LIACS,莱顿大学) Nanoscience Center and Department of Chemistry, University of Jyväskylä(贾瓦尔基利亚大学纳米科学中心和化学系) Department of Materials Engineering, Indian Institute of Science(印度科学研究所材料工程系) Schaeffler Technologies AG & Co. KG

AI总结 针对Allegro模型在精度与推理时间之间的权衡,采用多目标超参数优化,并设计经典扩展和量子-经典混合两种变体,在多个数据集上验证了混合变体在力预测精度上的优势。

Comments 15 pages, 7 figures, 6 tables

Journal ref Comput. Mater. Sci. 270, 114742 (2026)

详情
AI中文摘要

Allegro是一种机器学习原子间势模型,旨在使用E(3)等变神经网络预测分子中的原子性质。在训练该模型时,精度与推理时间之间往往存在权衡。为此,我们对这两个目标应用多目标超参数优化。此外,我们通过构建Allegro的变体来尝试修改架构:一种扩展了额外的经典层,另一种结合了量子-经典混合层。我们在QM9、rMD17-阿司匹林、rMD17-苯以及一个自生成的铜-锂结构数据集上评估所有模型。结果表明,两种变体在多个数据集上的力预测精度均超过Allegro。经典变体持续优于基线,而量子-经典混合变体在完全优化的Cu-Li数据集上取得了最佳的整体力预测精度,比经典变体高出约13%。值得注意的是,尽管混合变体在其他数据集上使用了从Cu-Li转移的超参数而未进行特定数据集的优化,但仍取得了有竞争力的结果,这表明量子-经典混合是增强MLIP架构的一个有前景的方向。

英文摘要

Allegro is a machine learning interatomic potential model designed to predict atomic properties in molecules using E(3) equivariant neural networks. When training this model, there tends to be a trade-off between accuracy and inference time. For this reason, we apply multi-objective hyperparameter optimization to both objectives. Additionally, we experiment with modified architectures by constructing variants of Allegro: one extended with additional classical layers and one incorporating quantum-classical hybrid layers. We evaluate all models on QM9, rMD17-aspirin, rMD17-benzene, and a self-generated dataset of copper-lithium structures. As results, both variants surpass Allegro in force prediction accuracy across multiple datasets. The classical variant consistently improves over the baseline, while the quantum-classical hybrid variant achieves the best overall force prediction accuracy on the Cu-Li dataset, where it was fully optimized, outperforming the classical variant by approximately 13%. Notably, the hybrid variant also achieves competitive results on the remaining datasets despite using hyperparameters transferred from Cu-Li without dataset-specific optimization, suggesting that quantum-classical hybridization is a promising direction for enhancing MLIP architectures.

2602.15084 2026-06-08 physics.plasm-ph cs.AI cs.LG 版本更新

TokaMind: A Multi-Modal Transformer Foundation Model for Tokamak Plasma Dynamics

TokaMind: 用于托卡马克等离子体动力学的多模态Transformer基础模型

Tobia Boschi, Andrea Loreti, Nicola C. Amorisco, Rodrigo H. Ordonez-Hurtado, Cécile Rousseau, George K. Holt, Eszter Székely, Alexander Whittle, Samuel Jackson, Adriano Agnello, Stanislas Pamela, Alessandra Pascale, Robert Akers, Juan Bernabe Moreno, Vassil Alexandrov, Mykhaylo Zayats

发表机构 * IBM Research(IBM研究院) UK Atomic Energy Authority(英国原子能局) STFC Hartree Centre(科学与技术设施研究中心哈特ree中心)

AI总结 提出TokaMind,首个开源托卡马克等离子体动力学基础模型,基于多模态Transformer在MAST数据集上预训练,支持多种数据模态和缺失信号处理,在14个任务上优于基线。

详情
AI中文摘要

我们提出TokaMind,据我们所知,这是首个用于托卡马克等离子体动力学的开源基础模型,基于多模态Transformer(MMT)并在公开可用的MAST数据集上的异构诊断数据上预训练。TokaMind支持多种数据模态(时间序列、2D轮廓和视频),具有不同的采样率、鲁棒的缺失信号处理,并通过选择性加载和冻结四个模型组件实现高效任务适配。为了表示多模态信号,我们使用轻量级固定基离散余弦变换嵌入(DCT3D),并为替代嵌入(例如变分自编码器)提供干净接口。我们在最近引入的MAST基准TokaMark上评估TokaMind,该基准包含14个具有异构重建和预测目标的任务。我们的结果表明,微调后的TokaMind在所有任务上均优于最强的基准基线,仅一个任务除外。与在匹配的epoch预算下从头训练相同架构相比,热启动适配在要求苛刻的下游设置中最为有益,包括长时域预测和高维平衡目标。这些发现突显了多模态预训练对托卡马克等离子体动力学的价值,并为未来的聚变建模任务提供了实用、可扩展的基础。训练代码和模型权重分别公开在github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamind和huggingface.co/UKAEA-IBM-STFC。

英文摘要

We present TokaMind, to our knowledge the first open-source foundation model for tokamak plasma dynamics, based on a Multi-Modal Transformer (MMT) and pretrained on heterogeneous diagnostics from the publicly available MAST dataset. TokaMind supports multiple data modalities (time-series, 2D profiles, and videos) with different sampling rates, robust missing-signal handling, and efficient task adaptation via selectively loading and freezing four model components. To represent multi-modal signals, we use a lightweight fixed-basis Discrete Cosine Transform embedding (DCT3D) and provide a clean interface for alternative embeddings (e.g., Variational Autoencoders). We evaluate TokaMind on the recently introduced MAST benchmark TokaMark, which comprises 14 tasks with heterogeneous reconstruction and forecasting objectives. Our results show that fine-tuned TokaMind outperforms the strongest benchmark baseline on all but one task. Compared with training the same architecture from scratch under a matched epoch budget, warm-start adaptation is most beneficial on demanding downstream settings, including long-horizon forecasting and high-dimensional equilibrium objectives. These findings highlight the value of multi-modal pretraining for tokamak plasma dynamics and provide a practical, extensible foundation for future fusion modeling tasks. Training code and model weights are publicly available at github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamind and huggingface.co/UKAEA-IBM-STFC, respectively.

2602.06245 2026-06-08 stat.ML cs.LG 版本更新

Inheritance Between Feedforward and Convolutional Networks via Model Projection

前馈网络与卷积网络之间的继承关系:通过模型投影

Nicolas Ewen, Jairo Diaz-Rodriguez, Kelly Ramsay

发表机构 * Department of Mathematics and Statistics(数学与统计学系)

AI总结 提出模型继承概念,证明广义前馈网络是广义卷积网络的子集,并通过模型投影实现反向继承,用于参数高效的迁移学习。

详情
AI中文摘要

神经网络技术通常通过类比在不同架构家族之间转移,但这种转移仅在技术所需假设被保留时才有效。我们将这一思想引入为模型类之间的继承。使用统一的节点级框架和张量值激活,我们证明广义前馈网络(GFFN)是广义卷积网络(GCNN)的严格子集,因此GCNN的性质直接转移到GFFN。反向方向并非自动:标准CNN节点使用空间核,而FFN节点对每个输入贡献使用一个标量权重。我们引入模型投影来恢复受限的反向继承路径。投影冻结每个卷积输入通道子函数,并为每个输入-输出通道贡献学习一个标量系数,使投影后的CNN节点具有标量加权输入重组的GFFN风格可训练结构。这种继承结构自然导致参数高效的迁移学习。在多个ImageNet预训练CNN骨干网络和下游图像分类数据集上,模型投影与标准和PEFT基线竞争,并为后续全微调提供有效的初始化。

英文摘要

Neural-network techniques are often transferred across architecture families by analogy, but such transfer is valid only when the assumptions required by a technique are preserved. We introduce this idea as inheritance between model classes. Using a unified node-level framework with tensor-valued activations, we prove that generalized feedforward networks (GFFNs) form a strict subset of generalized convolutional networks (GCNNs), so GCNN properties transfer directly to GFFNs. The reverse direction is not automatic: standard CNN nodes use spatial kernels, while FFN nodes use one scalar weight per input contribution. We introduce model projection to recover a restricted reverse inheritance path. Projection freezes each convolutional input-channel sub-function and learns one scalar coefficient for each input-output channel contribution, giving projected CNN nodes the GFFN-style trainable structure of scalar-weighted input recombination. This inherited structure leads naturally to parameter-efficient transfer learning. Across multiple ImageNet-pretrained CNN backbones and downstream image-classification datasets, model projection is competitive with standard and PEFT baselines and provides an effective initialization for subsequent full fine-tuning.

2602.01177 2026-06-08 quant-ph cs.IT cs.LG math.IT 版本更新

Privacy Implies Stability: Information-Theoretic Generalization Bounds for Quantum Learning

隐私蕴含稳定性:量子学习的信息论泛化界

Ayanava Dasgupta, Naqueeb Ahmad Warsi, Masahito Hayashi

发表机构 * Indian Statistical Institute, Kolkata(印度统计研究院,科希玛) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳校区数据科学学院) International Quantum Academy, Futian District, Shenzhen(深圳未来科技学院) Graduate School of Mathematics, Nagoya University(名古屋大学数学研究生院)

AI总结 提出信息论框架连接量子学习算法的稳定性、隐私与泛化,证明量子差分隐私可直接导出泛化保证,并发现量子非正交性使信息论可容许性与隐私兼容。

Comments 36 pages, 3 figures; The introduction has been substantially rewritten to provide better context, and certain proofs have been relocated from the appendices to the main body of the paper; The core mathematical framework and technical results remain unchanged

详情
AI中文摘要

我们开发了一个信息论框架,连接量子学习算法的稳定性、隐私和泛化。学习过程被建模为具有经典-量子输出的量子仪器,损失由可观测量表示。我们证明,在经典-量子次高斯条件下,信息论稳定性度量控制期望泛化误差。此外,我们利用量子Rényi散度处理非交换性下的高阶依赖,建立了一个高概率泛化界。在可信数据处理者设置中,量子差分隐私(QDP)提供了一种稳定性机制。我们证明单邻居QDP严格限制了经典-量子输出泄露的信息。结合我们的稳定性定理,直接得到隐私到泛化的保证。我们还探索了不可信数据处理者设置。在此,仅输出隐私是不够的,因为对抗性处理者可能在应用噪声后处理之前执行高度信息性的过程。为了解决这个问题,我们引入了信息论可容许性(ITA),这是一种认证条件,确保规定程序不仅仅是编码系综上一个严格更具信息性、物理允许操作的退化版本。我们证明了一个基本分离:虽然在经典模型中可容许性和隐私存在强烈张力,但量子非正交性使它们兼容。量子测量可以是ITA——耗尽所有相关的可访问信息——而无需完美恢复经典数据集。我们通过一个具体的量子ITA例子说明了这种分离。

英文摘要

We develop an information-theoretic framework connecting stability, privacy, and generalization for quantum learning algorithms. Learning procedures are modeled as quantum instruments with classical-quantum outputs, and losses are represented by observables. We prove that under a classical-quantum sub-Gaussian condition, an information-theoretic stability measure controls the expected generalization error. Furthermore, we establish a high-probability generalization bound using quantum Rényi divergences to manage higher-order dependencies under non-commutativity. In the trusted Data Processor setting, quantum differential privacy (QDP) provides a mechanism for stability. We show that one-neighbor QDP strictly bounds the information leaked by the classical-quantum output. Combining this with our stability theorem yields a direct privacy-to-generalization guarantee. We also explore an untrusted Data Processor setting. Here, output privacy alone is insufficient since an adversarial processor could perform a highly informative procedure before applying noisy post-processing. To combat this, we introduce Information-Theoretic Admissibility (ITA), a certification condition ensuring the prescribed procedure is not just a degraded version of a strictly more informative, physically allowed operation on the encoded ensemble. We prove a fundamental separation: while admissibility and privacy are in strong tension in classical models, quantum non-orthogonality makes them compatible. A quantum measurement can be ITA - exhausting all relevant accessible information - without perfectly recovering the classical dataset. We illustrate this separation through a concrete quantum ITA example.