arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2605.17561 2026-06-08 cs.SE cs.AI cs.MA 版本更新

Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports

自动化无效bug报告的根因子类划分及无代码修复生成

Mahmut Furkan Gon, Emre Dinc, Tevfik Emre Sungur, Eray Tuzun

发表机构 * Department of Computer Engineering, Bilkent University(计算机工程系,比尔肯特大学)

AI总结 本研究旨在引入一个标准化的根因导向的无效bug报告子类划分体系,并通过实验测试不同方法在无效子类划分和无代码修复生成中的准确性。研究还分析了不同配置在我们创建的黄金标准基准上的表现。

详情
Comments
Submitted to IEEE Transactions on Software Engineering (TSE) and currently under review
AI中文摘要

在使用软件时遇到的问题会以bug报告的形式被报告。然而,许多bug报告是无效的,意味着它们不需要代码更改,而是通过无代码修复解决。手动确定无效bug报告的根因并由客户支持提供可行的解决方案会造成严重的资源浪费。我们的目标是引入一个标准化的根因导向的无效bug报告子类划分体系,并通过实验测试各种方法在无效子类划分和无代码修复生成中的准确性。我们研究了不同配置在我们创建的黄金标准基准上的表现。使用人工整理的基准进行更高质量的分析,我们尝试了 vanilla LLMs、检索增强生成和代理网络搜索来识别无效子类并生成无代码修复。我们将结果与包含原始bug报告中无效子类和无代码修复的手动标注的地面真实数据进行了评估。我们用加权F1分数衡量子类检测性能,并用BERTScore和Judge LLM成功率评估无代码修复建议。对于子类划分,检索增强生成在总体性能上最高,达到0.66加权F1,略微优于vanilla LLMs的0.65和代理网络搜索的0.64。在子类级别,性能在非可复现上达到0.85 F1,在功能请求和问题上达到0.79,而错误版本仍然是最具有挑战性的,分数在0.00到0.29之间。对于无代码修复生成,代理网络搜索在总体Judge LLM成功率上最高,达到68.9%,相比检索增强生成的64.4%和vanilla LLMs的64.9%。在子类级别,最高峰值为工作正常的设计达到87.4%,问题达到72.2%。

英文摘要

Issues faced when using software are reported in the form of bug reports. However, many bug reports are invalid, meaning they do not require code changes, and are resolved with a no-code fix. Manually determining the root cause of the invalid bug reports and providing actionable resolutions by the customer support causes a serious waste of resources. Our goal is to introduce a standardized taxonomy for root-cause oriented invalid bug report subclassification, and perform experiments to test the accuracy of various approaches on invalid subclassification and no-code fix generation. We study how different configurations perform on a gold-standard benchmark we have created. Using a manually curated benchmark for higher quality analysis, we experimented with vanilla LLMs, Retrieval Augmented Generation, and agentic web search to identify invalid subclasses and generate no-code fixes. We evaluated the results against manually labeled ground truth data that includes the invalid subclass and no-code fixes from the original bug reports. We measured subclass detection performance with weighted F1-Score, and assessed no-code fix suggestions using BERTScore and Judge LLM success rates. For subclassification, retrieval augmented generation achieves the highest overall performance with 0.66 weighted F1, slightly outperforming vanilla LLMs at 0.65 and agentic web search at 0.64. At the subclass level, performance peaks at 0.85 F1 for Non-reproducibility and 0.79 for Feature Request and Question, while Wrong Version remains the most challenging with scores between 0.00 and 0.29. For no-code fix generation, agentic web search achieves the highest overall Judge LLM success rate at 68.9%, compared to 64.4% for RAG applications and 64.9% for vanilla LLMs, with subclass-level peaks of 87.4% for Working as Designed and 72.2% for Question.

2605.13268 2026-06-08 quant-ph cs.LG 版本更新

Physics Guided Generative Optimization for Trotter Suzuki Decomposition

物理引导的Trotter-Suzuki分解生成优化

WenBin Yan

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出P-GONE方法,结合条件扩散模型、图神经网络和REINFORCE微调,联合优化Trotter-Suzuki分解中的项分组、阶数和时间步分配,在保真度≥0.95时实现19.4倍电路深度压缩。

详情
AI中文摘要

Trotter-Suzuki乘积公式是在含噪中等规模量子(NISQ)硬件上进行哈密顿演化的标准途径,但其精度取决于三个耦合的选择:项分组、乘积公式阶数和时间步分配。分组和阶数是离散的,这使得直接梯度优化不可行,并迫使现有编译器依赖静态启发式方法。我们描述了P-GONE方法,该方法结合了条件扩散模型(D3PM + DDPM)、图神经网络(GNN)编码器和闭环REINFORCE微调,以在混合离散-连续空间上联合学习分组、阶数和时间步优化。在保真度匹配条件下($F \geq 0.95$),该方法实现的电路深度为86,而Qiskit四阶(未分组,Suzuki-4)为1673,压缩约19.4倍;Paulihedral(一阶Trotter)为141,压缩约1.6倍。在$T=0.90$时,该方法也优于Qiskit分组对易教师(65 vs 103,压缩1.6倍),但在$T=0.95$时教师仍领先——这种分层模式指向保真度感知的微调。在标准退极化噪声模型下,该方法的含噪保真度大约是Qiskit四阶基线的2倍(0.743 vs 0.380)。消融实验显示清晰的层次:阶数学习 > 时间分配 > 分组。最佳N采样($N=32$是实际最佳点)和CFG指导在推理时提供灵活的保真度-深度权衡。该方法在结构化哈密顿量(TFIM,Heisenberg)上表现良好,但随机Pauli哈密顿量在$T \geq 0.95$时完全失败——这定义了该方法的适用边界。

英文摘要

Trotter Suzuki product formulas are the standard route to Hamiltonian evolution on noisy intermediate-scale quantum (\NISQ{}) hardware, but their accuracy depends on three coupled choices: term grouping, product-formula order, and time-step allocation. Grouping and order are discrete, which makes direct gradient optimization infeasible and forces existing compilers to rely on static heuristics. We describe P-GONE, a method that combines a conditional diffusion model (D3PM + DDPM), a graph neural network (\GNN{}) encoder, and closed-loop REINFORCE fine-tuning to jointly learn grouping, order, and time-step optimization over a mixed discrete-continuous space. Under fidelity-matched conditions ($F \geq 0.95$), the method achieves circuit depth 86 versus 1673 for Qiskit fourth-order (ungrouped, Suzuki-4), about $19.4\times$ compression, and 141 for Paulihedral (first-order Trotter), about $1.6\times$ compression. At $T=0.90$ the method also beats the Qiskit group-commuting teacher (65 vs 103, $1.6\times$ compression), though at $T=0.95$ the teacher still leads -- a stratified pattern that points toward fidelity-aware fine-tuning. Under a standard depolarizing noise model, the method achieves noisy fidelity roughly $2\times$ the Qiskit fourth-order baseline (0.743 vs 0.380). Ablation shows a clear hierarchy: order learning $>$ time allocation $>$ grouping. Best-of-N sampling ($N=32$ is a practical sweet spot) and CFG guidance give flexible fidelity-depth trade-offs at inference. The method works well on structured Hamiltonians (TFIM, Heisenberg), but random Pauli Hamiltonians fail entirely at $T \geq 0.95$ -- a boundary that defines where the method applies.

2605.10792 2026-06-08 math.OC cs.LG 版本更新

Implicit Neural Optimal Transport via Fixed-Point Optimization

通过不动点优化的隐式神经最优传输

Yesom Park, Eric Gelphman, Stanley Osher, Samy Wu Fung

发表机构 * Department of Mathematics, University of California, Los Angeles(加州大学洛杉矶分校数学系) Department of Applied Mathematics and Statistics, Colorado School of Mines(科罗拉多矿业学院应用数学与统计系)

AI总结 提出隐式神经最优传输公式,通过单个势函数和近端不动点问题避免对抗训练,实现稳定高效的单网络框架,同时恢复前向和后向传输映射。

详情
Comments
37 pages, submitted to SIAM Journal on Mathematical Data Science (currently under review)
AI中文摘要

我们提出了一种隐式神经最优传输公式,消除了现有方法中常用的对抗性最小-最大优化和多网络架构。我们的关键思想是在Kantorovich对偶中参数化单个势函数,并将相关的c-变换重新表述为近端不动点问题。这产生了一个稳定的单网络框架,其中通过对偶可行性通过近端最优性条件而非对抗性训练精确执行。尽管有内部不动点计算,梯度可以在不通过不动点迭代微分的情况下计算,从而无需隐式微分即可实现高效训练。我们进一步建立了随机梯度下降的收敛性。得到的框架高效、可扩展且广泛适用:它同时恢复前向和后向传输映射,并自然扩展到类条件设置。在高维高斯基准、物理数据集和图像翻译任务上的实验表明,该框架具有强大的传输精度以及改进的训练稳定性和良好的计算及内存效率。

英文摘要

We propose an implicit neural formulation of optimal transport that eliminates adversarial min--max optimization and multi-network architectures commonly used in existing approaches. Our key idea is to parameterize a single potential in the Kantorovich dual and reformulate the associated c-transform as a proximal fixed-point problem. This yields a stable single-network framework in which dual feasibility is enforced exactly through proximal optimality conditions rather than adversarial training. Despite the inner fixed-point computation, gradients can be computed without differentiating through the fixed-point iterations, enabling efficient training without requiring implicit differentiation. We further establish convergence of stochastic gradient descent. The resulting framework is efficient, scalable, and broadly applicable: it simultaneously recovers forward and backward transport maps and naturally extends to class-conditional settings. Experiments on high-dimensional Gaussian benchmarks, physical datasets, and image translation tasks demonstrate strong transport accuracy together with improved training stability and favorable computational and memory efficiency.

2605.08717 2026-06-08 cs.SE cs.AI 版本更新

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

调试调试器:面向软件工程智能体的失败锚定结构化恢复

Chenyu Zhao, Shenglin Zhang, Yihang Lin, Wenwei Gu, Zhimin Chen, Yongqian Sun, Dan Pei, Chetan Bansal, Saravan Rajmohan, Minghua Ma

发表机构 * Nankai University(南开大学) Tsinghua University(清华大学) Microsoft(微软)

AI总结 提出PROBE框架,通过遥测层、诊断层和指导门将运行时证据转化为结构化恢复指导,在代码修复、工作流恢复等场景中诊断准确率65.37%,恢复率21.79%。

详情
AI中文摘要

软件工程智能体越来越多地部署在可评估的工程环境中,但故障后恢复仍然成本高昂、依赖人工且临时性强。现有系统暴露跟踪或生成后续反馈,但未能将异构运行时证据转化为有根据的、有边界的恢复指导以供后续尝试。我们提出PROBE,一个用于软件工程智能体结构化恢复的失败锚定框架。PROBE通过遥测层、诊断层和指导门将失败运行的遥测数据组织为结构化证据、结构化诊断和有边界的恢复指导。遥测层保留细粒度运行时信号,诊断层将跨信号证据融合为有根据的诊断,指导门仅在证据有根据、可操作且属于智能体侧行为范围内时生成基于诊断的指导。我们在三个场景中评估PROBE:仓库级软件修复、企业工作流恢复和AIOps服务缓解。在257个初始未解决案例中,PROBE实现了65.37%的Top-1诊断准确率和21.79%的恢复率,分别比最强的非PROBE基线高出43.58和12.45个百分点。结果揭示了诊断-恢复差距:准确的诊断是必要的,但除非转化为后续尝试可执行和验证的有边界指导,否则是不够的。除了受控评估外,微软IcM原型显示,PROBE可以作为非侵入式侧通道附加到现有服务诊断工作流中,而无需更改智能体策略、工具集或执行预算。这些结果表明,在现实工程约束下,基于遥测的、失败锚定的恢复可以提高故障后可恢复性。

英文摘要

Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints.

2605.06647 2026-06-08 cs.IR cs.AI cs.LG 版本更新

Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval

超级智能检索代理:代理检索的下一个前沿

Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

发表机构 * Meta Superintelligence Labs(Meta超级智能实验室) Rice University(里士满大学)

AI总结 提出SIRA,通过单次语料判别性检索压缩多轮探索,利用LLM丰富文档词汇、预测查询缺失词汇并基于语料统计过滤,在BEIR基准上取得最强平均检索性能,并在下游QA任务中超越RL训练的代理系统。

详情
AI中文摘要

检索增强代理日益成为大型知识库的接口,但大多数将检索视为黑箱:它们发出探索性查询,检查片段,并重新表述直到证据出现。这类似于新手搜索不熟悉的数据库,而非专家利用术语和可能证据的强先验进行导航,导致额外的检索轮次、延迟和低召回率。我们引入了超级智能检索代理(SIRA),它将检索中的超级智能视为将多轮探索性搜索压缩为单次语料判别性检索行动。SIRA不仅询问哪些术语相关,还询问哪些术语将所需证据与语料级混淆项区分开。离线时,LLM用缺失的搜索词汇丰富每个文档;查询时,它预测查询遗漏的证据词汇;语料统计作为工具调用,过滤掉缺失、过于常见或不太可能产生检索边界的术语。最后一步是单次加权BM25调用,将查询与验证后的扩展结合。在十个BEIR基准上,SIRA实现了我们比较中最强的平均检索性能,击败了密集检索器、学习型稀疏检索器和LLM搜索代理基线,且未使用相关性标签或检索器微调。在下游QA中,其仅检索的答案覆盖率在NQ和HotpotQA上超过了近期RL训练的代理QA系统。我们还引入了BrowseComp-Wikipedia,一个包含232个BrowseComp衍生查询、覆盖25,587,229篇文档的维基百科索引的硬搜索基准。即使没有索引时丰富,仅使用基于维基百科类别的接地,SIRA在每个预算下都优于多轮Perplexity代理,达到9.70%的Recall@1、15.27%的Recall@10和36.14%的Recall@100。

英文摘要

Retrieval-augmented agents are increasingly the interface to large knowledge bases, yet most treat retrieval as a black box: they issue exploratory queries, inspect snippets, and reformulate until evidence emerges. This resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, causing extra retrieval rounds, latency, and poor recall. We introduce \textit{Superintelligent Retrieval Agent} (SIRA), which casts \emph{superintelligence} in retrieval as compressing multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask which terms are relevant; it asks which terms separate the desired evidence from corpus-level confusers. Offline, an LLM enriches each document with missing search vocabulary; at query time, it predicts evidence vocabulary the query omits; and corpus statistics serve as tool calls that filter terms that are absent, overly common, or unlikely to create retrieval margin. The final step is a single weighted BM25 call combining the query with the validated expansion. Across ten BEIR benchmarks, SIRA achieves the strongest average retrieval performance in our comparison, beating dense retrievers, learned sparse retrievers, and LLM search-agent baselines while using no relevance labels or retriever fine-tuning. On downstream QA, its retrieval-only answer coverage exceeds recent RL-trained agentic QA systems on NQ and HotpotQA. We also introduce \textbf{BrowseComp-Wikipedia}, a hard-search benchmark of 232 BrowseComp-derived queries over a 25,587,229-document Wikipedia index. Even without index-time enrichment, using only grounded Wikipedia categories, SIRA outperforms multi-round Perplexity agents at every budget, reaching 9.70% Recall@1, 15.27% Recall@10, and 36.14% Recall@100.

2511.02399 2026-06-08 cs.SE cs.AI 版本更新

Towards Iterative End-to-End Software Development: A Feature-Driven Multi-Agent Framework

迈向迭代式端到端软件开发:一种特征驱动的多智能体框架

Junwei Liu, Chen Xu, Chong Wang, Tong Bai, Weitong Chen, Kaseng Wong, Yiling Lou, Xin Peng

发表机构 * Fudan University(复旦大学) Nanyang Technological University(南洋理工大学)

AI总结 提出EvoDev框架,通过特征分解、依赖建模和上下文传播,实现迭代式端到端软件开发,在Android任务上比Claude Code提升57.3%。

详情
Comments
Accepted by ISSTA 2026
AI中文摘要

近年来,大语言模型智能体的进展为从自然语言需求自动化端到端软件开发带来了希望。然而,现有方法大多采用线性的瀑布式流程,这过度简化了真实世界开发的迭代性质,并且难以应对复杂、大规模的项目。为解决这些限制,我们提出了EvoDev,一种受特征驱动开发启发的迭代式软件开发框架。EvoDev将用户需求分解为一组用户价值特征,并构建特征图,这是一个有向无环图,显式建模特征之间的依赖关系。特征图中的每个特征节点维护多层上下文,包括业务逻辑、软件设计和代码实现,这些上下文沿着依赖关系传播,为后续开发迭代提供上下文。我们在具有挑战性的Android开发任务上评估了EvoDev,结果表明它比最佳基线Claude Code高出57.3%,同时在不同基础LLM上将单智能体性能提升了16.0%-58.5%,突出了特征分解、依赖建模、上下文传播和面向工作流的智能体设计对端到端软件开发的重要性。此外,我们的工作总结了设计迭代式、LLM驱动的开发框架的实用见解,并为未来训练基础LLM以更好地支持迭代式软件开发提供了参考。

英文摘要

Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each feature node in the feature map maintains multi-layer contexts, including business logic, software design, and code implementation, which are propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by 57.3%, while improving single-agent performance by 16.0%-58.5% across different base LLMs, highlighting the importance of feature decomposition, dependency modeling, context propagation, and workflow-aware agent design for end-to-end software development. Moreover, our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.

2604.23025 2026-06-08 cs.CR cs.LG 版本更新

Self-Supervised Learning for Android Malware Detection on a Time-Stamped Dataset

基于时间戳数据集的自监督学习安卓恶意软件检测

Annan Fu, Hao Pei, Maryam Tanha

发表机构 * Mastercard Canada(Mastercard加拿大)

AI总结 针对机器学习检测器的时间偏差问题,构建时间戳数据集并采用BYOL自监督预训练,在时间感知评估下达到98%准确率和89%F1分数。

详情
Comments
Accepted for publication in IEEE ICC 2026. \c{opyright} 2026 IEEE
AI中文摘要

基于机器学习的安卓恶意软件检测器常受时间偏差影响:模型在训练和评估时未考虑应用的实际发布时间,导致准确率虚高并削弱实际鲁棒性。我们通过构建一个包含良性及恶意安卓应用的时间戳数据集来解决此问题,并引入时间戳验证程序以确保时间准确性。随后,我们提出一个检测框架,使用自监督预训练方法Bootstrap Your Own Latent (BYOL)学习抗混淆的表示,然后进行监督分类。在时间感知评估下,该方法达到98%的准确率和89%的F1分数。我们进一步通过VirusTotal和MITRE ATT&CK框架分析真正例和假负例来表征恶意软件行为。为支持可复现性和进一步创新,我们公开了数据集和源代码。

英文摘要

Android malware detectors built with machine learning often suffer from temporal bias: models are trained and evaluated without respecting apps' actual release times, inflating accuracy and weakening real-world robustness. We address this by constructing a time-stamped dataset of benign and malicious Android apps and introducing a timestamp-verification procedure to ensure temporal accuracy. We then propose a detection framework that uses Bootstrap Your Own Latent (BYOL) for self-supervised pre-training to learn obfuscation-resilient representations, followed by supervised classification. Under time-aware evaluation, the method attains 98% accuracy and 89% F1. We further characterize malware behavior by analyzing true positives and false negatives using VirusTotal and the MITRE ATT&CK framework. To support reproducibility and further innovation, we release our dataset and source code.

2604.17948 2026-06-08 cs.CR cs.AI cs.MA 版本更新

RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

RAVEN: 用于用户代码和二进制程序中内存损坏分析的检索增强漏洞探索网络

Parteek Jamwal, Minghao Shao, Boyuan Chen, Achyuta Muthuvelan, Asini Subanya, Boubacar Ballo, Kashish Satija, Mariam Shafey, Mohamed Mahmoud, Moncif Dahaji Bouffi, Pasindu Wickramasinghe, Siyona Goel, Yaakulya Sabbani, Hakim Hacid, Mthandazo Ndhlovu, Eleanna Kafeza, Sanjay Rawat, Muhammad Shafique

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出RAVEN框架,结合LLM代理与检索增强生成,自动生成遵循Google Project Zero模板的漏洞分析报告,在105个样本上平均质量得分54.21%。

详情
AI中文摘要

大型语言模型(LLM)在各种网络安全任务中展现了卓越的能力,包括漏洞分类、检测和修补。然而,它们在自动化漏洞报告文档和分析方面的潜力仍未得到充分探索。我们提出了RAVEN(检索增强漏洞探索网络),这是一个利用LLM代理和检索增强生成(RAG)来综合生成全面漏洞分析报告的框架。给定易受攻击的源代码,RAVEN按照Google Project Zero根因分析模板生成报告。该框架使用四个模块:用于漏洞识别的探索代理、从包含Google Project Zero报告和CWE条目的精选数据库中检索相关知识的RAG引擎、用于影响和利用评估的分析代理,以及用于结构化报告生成的报告代理。为确保质量,RAVEN包含一个特定任务的LLM评判器,用于评估报告的结构完整性、与真实情况的一致性、代码推理质量和修复质量。我们在来自NIST-SARD数据集的105个涵盖15种CWE类型的易受攻击代码样本上评估了RAVEN。结果显示平均质量得分为54.21%,支持了我们的方法在自动化漏洞文档方面的有效性。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST-SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.

2604.09552 2026-06-08 cs.IR cs.AI cs.CL 版本更新

MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

MCERF:通过增强检索推进工程文档的多模态大语言模型评估

Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi, Faez Ahmed, Hongyi Xu

发表机构 * School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut, Storrs, CT 06269(机械、航空航天与制造工程学院,康涅狄格大学,斯托尔斯,CT 06269) Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA(机械工程系,麻省理工学院,剑桥,MA 02139,美国)

AI总结 提出MCERF框架,结合多模态检索器ColPali与大语言模型推理,通过混合查找、视觉文本融合、高推理和自一致性决策等策略,在DesignQA基准上实现平均准确率相对提升41.1%,无需完整规则书摄入即可处理工程文档中的多模态问答。

详情
AI中文摘要

工程规则书和技术标准包含密集文本、表格和插图等多模态信息,对检索增强生成(RAG)系统构成挑战。基于依赖全文摄入和文本检索的DesignQA框架[1],本工作建立了多模态ColPali增强检索与推理框架(MCERF),该系统将多模态检索器与大语言模型推理相结合,实现从工程文档中准确高效地回答问题。该系统采用ColPali检索文本和视觉信息,并采用多种检索与推理策略:(i)混合查找模式用于显式规则提及,(ii)视觉到文本融合用于图形和表格引导的查询,(iii)高推理大语言模型模式用于复杂的多模态问题,以及(iv)自一致性决策以稳定响应。模块化框架设计为未来的多模态系统提供了可重用模板,无论底层模型架构如何。此外,本工作建立并比较了两种路由方法:单案例路由方法和多智能体系统,两者均动态分配查询到最优管道。在DesignQA基准上的评估表明,该系统在所有任务上的平均准确率相比基线RAG最佳结果相对提升了41.1%,这是多模态和推理密集型任务上的显著改进,且无需完整规则书摄入。这表明视觉语言检索、模块化推理和自适应路由如何在工程用例中实现可扩展的文档理解。

英文摘要

Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.

2604.07821 2026-06-08 cs.MA cs.AI cs.CL 版本更新

More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration

能力越强,合作越少?当LLM在零成本协作中失败时

Advait Yadav, Sid Black, Oliver Sourbut

发表机构 * GitHub

AI总结 研究LLM在多智能体系统中零成本协作的失败原因,通过构建去战略复杂性的环境,发现能力更强的模型(如o3)反而合作更差,并区分了能力失败与主动信息隐瞒,提出针对性干预措施。

详情
Comments
Accepted to the ICML 2026 main conference
AI中文摘要

大语言模型(LLM)智能体越来越多地在多智能体系统中协调,但我们缺乏对合作失败地点和原因的理解。许多现实世界的协调问题并非社会困境:帮助他人——分享文档、为队友扫清障碍——对帮助者几乎不花费成本,同时产生巨大的集体利益。LLM智能体在这种帮助免费且被明确指示合作的机制下是否合作,仍然未知。我们构建了一个基于回合的多智能体环境,剥离了所有战略复杂性,使合作无成本且微不足道地最优。在八个广泛使用的LLM中,能力并不能预测合作:OpenAI o3仅达到最优集体性能的17%,而较弱的o3-mini达到50%,尽管有相同的最大化群体收入的指令。使用一种自动化智能体通信一方的因果分解方法,我们将合作失败与能力失败分开,并发现几个有能力的模型在隐瞒信息方面表现积极,尽管从隐瞒中一无所获。针对性的干预措施解决了每种模式:明确的协议使能力受限模型的性能大约翻倍,而小的分享激励则解锁了合作受限模型。我们的结果表明,仅靠扩展智能无法解决多智能体系统中的协调问题,需要深思熟虑的合作设计,即使帮助不花费任何成本。

英文摘要

Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation fails. Many real-world coordination problems are not social dilemmas: helping others -- sharing documentation, unblocking a teammate -- costs the helper almost nothing while producing substantial collective benefit. Whether LLM agents cooperate in this regime, where helping is free and they are explicitly instructed to do so, remains unknown. We build a turn-based multi-agent environment that strips away all strategic complexity, making cooperation costless and trivially optimal. Across eight widely used LLMs, capability does not predict cooperation: OpenAI o3 reaches only 17% of optimal collective performance while the weaker o3-mini reaches 50%, despite identical instructions to maximize group revenue. Using a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, and find that several capable models actively withhold information despite gaining nothing from withholding. Targeted interventions address each mode: explicit protocols roughly double the performance of competence-limited models, while small sharing incentives unlock cooperation-limited ones. Our results suggest that scaling intelligence alone will not solve coordination in multi-agent systems, and will require deliberate cooperative design, even when helping costs nothing.

2604.05360 2026-06-08 cs.HC cs.AI 版本更新

OGA-AID: Clinician-in-the-loop AI Report Drafting Assistant for Multimodal Observational Gait Analysis in Post-Stroke Rehabilitation

OGA-AID:用于中风后康复多模态观察性步态分析的临床医生在环AI报告起草助手

Khoi T. N. Nguyen, Nghia D. Nguyen, Hui Yu Koh, Patrick W. H. Kwong, Karen Sui Geok Chua, Ananda Sidarta, Baosheng Yu

发表机构 * Rehabilitation Research Institute of Singapore, Nanyang Technological University, Singapore(新加坡康复研究中心,南洋理工大学,新加坡) Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore(李光前医学院,南洋理工大学,新加坡) The Grainger College of Engineering, University of Illinois Urbana-Champaign, United States(伊利诺伊大学厄巴纳-香槟分校格雷格学院,美国) Department of Rehabilitation Sciences, The Hong Kong Polytechnic University, Hong Kong(香港理工大学康复科学系,香港) VinUni-Illinois Smart Health Center, VinUniversity, Vietnam(Vin大学Vin-伊利诺伊智能健康中心,越南) Institute of Rehabilitation Excellence, Tan Tock Seng Hospital, NHG Health, Singapore(卓越康复研究所,坦托克桑格医院,NHG健康,新加坡)

AI总结 提出OGA-AID,一种临床医生在环的多智能体大语言模型系统,通过协调三个专业智能体合成患者运动记录、运动学轨迹和临床资料,生成结构化步态评估报告,在真实患者数据上优于单次多模态基线,并展示了AI辅助分析与人类临床判断的互补关系。

详情
Comments
2026 CV4Clinic CVPR Workshop Proceedings
AI中文摘要

步态分析在中风后康复中至关重要,但仍然是时间密集型和认知要求高的,特别是当临床医生必须将步态视频和运动捕捉数据整合到结构化报告中时。我们提出了OGA-AID,一种临床医生在环的多智能体大语言模型系统,用于多模态报告起草。该系统协调3个专业智能体,将患者运动记录、运动学轨迹和临床资料综合成结构化评估。在真实患者数据上由专家物理治疗师评估,OGA-AID始终优于单次多模态基线,且误差低。在临床医生在环设置中,简短的专家初步笔记进一步降低了与参考评估相比的误差。我们的研究结果证明了多模态智能体系统用于结构化临床步态评估的可行性,并突出了在康复工作流程中AI辅助分析与人类临床判断之间的互补关系。

英文摘要

Gait analysis is essential in post-stroke rehabilitation but remains time-intensive and cognitively demanding, especially when clinicians must integrate gait videos and motion-capture data into structured reports. We present OGA-AID, a clinician-in-the-loop multi-agent large language model system for multimodal report drafting. The system coordinates 3 specialized agents to synthesize patient movement recordings, kinematic trajectories, and clinical profiles into structured assessments. Evaluated with expert physiotherapists on real patient data, OGA-AID consistently outperforms single-pass multimodal baselines with low error. In clinician-in-the-loop settings, brief expert preliminary notes further reduce error compared to reference assessments. Our findings demonstrate the feasibility of multimodal agentic systems for structured clinical gait assessment and highlight the complementary relationship between AI-assisted analysis and human clinical judgment in rehabilitation workflows.

2604.04226 2026-06-08 cs.MA cs.AI 版本更新

SW-$A^2$-Bench: Benchmarking Autonomous Software Agent Generation for Agentic Web

SW-$A^2$-Bench: 面向智能体网络的自主软件智能体生成基准测试

Linyao Chen, Bo Huang, Qinlao Zhao, Shuai Shao, Zhi Han, Zicai Cui, Ziheng Zhang, Guangtao Zeng, Wenzheng Tang, Yikun Wang, Yuanjian Zhou, Zimian Peng, Yong Yu, Weiwen Liu, Hiroki Kobayashi, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) The University of Tokyo(东京大学) Huazhong University of Science and Technology(华中科技大学) Shanghai Innovation Institute(上海创新研究院) Nankai University(南开大学) Singapore University of Technology and Design(新加坡科技设计大学) Queen’s University(女王大学) Fudan University(复旦大学) Zhejiang University(浙江大学)

AI总结 提出首个软件智能体生成基准SW-$A^2$-Bench,通过编码智能体自动将代码仓库转化为自主软件智能体,评估生成智能体的忠实性与互操作性,以扩展智能体网络规模。

详情
AI中文摘要

智能体网络正在成为一种新兴范式,其中自主软件智能体与在线资源及其他智能体交互以完成用户目标。然而,智能体网络的容量仍受限于自主软件智能体数量不足,这已成为扩展智能体网络的关键挑战。为缓解这一问题,我们研究了通过编码智能体自动将现有代码仓库转化为自主软件智能体的任务,将过程分解为关键阶段,并识别关键技术障碍。为系统评估这一能力,我们提出了面向智能体网络的软件智能体生成基准(SW-$A^2$-Bench),这是首个专为软件智能体生成设计的基准。SW-$A^2$-Bench不仅评估软件智能体是否能够生成,还评估生成的智能体是否忠实于源代码仓库,以及在多智能体工作流中是否与其他智能体可互操作。实验表明,我们的方法有效激活了代码仓库的功能能力,并在智能体网络中实现了可互操作的多智能体协作。我们相信,这项工作将为软件智能体生成提供标准化评估,并有助于未来扩展智能体网络的容量。

英文摘要

The Agentic Web is emerging as a paradigm in which autonomous software agents interact with online resources and with each other to accomplish user goals. However, the capacity of Agentic Web is still limited by insufficient autonomous software agent population, which has become a crucial challenge for scaling Agentic Web. In order to alleviate this, we study the task of automatically converting existing code repositories into autonomous software agents via coding agents, decompose the process into critical stages, and identify key technical hurdles. To systematically evaluate this capability, we propose SoftWare Agent generation for Agentic Web Bench (SW-$A^2$-Bench), the first benchmark designed for software agent generation. SW-$A^2$-Bench evaluates not only whether software agents can be generated, but also whether generated software agents are faithful to the source repositories and interoperable with other agents in multi-agent workflows. Our experiments demonstrate that our approach effectively activates the functional capabilities of code repositories and enables interoperable multi-agent collaboration in Agentic Web. We believe that this work will provide a standardized evaluation for software agent generation and will contribute to the future of scaling the capacity of Agentic Web.

2603.20990 2026-06-08 cs.IR cs.AI 版本更新

$\mathrm{ECI}_{\mathrm{sem}}$: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives

ECI: 有效对比信息用于评估难负样本

Aarush Sinha, Rahul Seetharaman, Aman Bansal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology (IIT), Kharagpur, India(1. 印度理工学院(IIT)计算机科学与工程系,克哈格布尔,印度)

AI总结 本文提出ECI,一种无需训练的诊断方法,通过冻结的目标编码器嵌入对候选负样本进行排序,其在MS MARCO数据集上展示了优于其他模型的性能,且在不同条件下表现出稳定性。

详情
AI中文摘要

在密集检索中,硬负样本的选择通常是仅在微调和下游评估之后决定。我们提出有效对比信息(ECI),一种无需训练的诊断方法,通过冻结的目标编码器嵌入对候选负样本进行排序。ECI无需训练,也不依赖标签:每个评分示例需要一个查询、一个标记的正例和一个显式的候选负例。$\mathrm{ECI}_{\mathrm{sem}}$通过目标一致性、语义局部性、词汇残余性和一个对数确定性多样性目标构建加权残差信息矩阵。在MS MARCO负样本上,家族内ECI在非混合源中将LLM负样本排在首位,在混合源中将Dense+LLM排在首位,与DistilBERT、E5-base和Contriever的最强聚合BEIR迁移结果相匹配。受控消融实验表明,这种对齐依赖于使用目标编码器家族,而额外消融实验显示其在样本大小、温度、分词器和IDF语料扰动下具有稳定性。理论给出了损失减少的局部线性化链接,而实证研究将下游评估视为最终测试。

英文摘要

Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose $\mathrm{ECI}_{\mathrm{sem}}$, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. $\mathrm{ECI}_{\mathrm{sem}}$ is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. $\mathrm{ECI}_{\mathrm{sem}}$ builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family $\mathrm{ECI}_{\mathrm{sem}}$ ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.

2603.20967 2026-06-08 stat.ML cs.LG math.ST stat.TH 版本更新

Hard labels sampled from sparse targets mislead rotation invariant algorithms

从稀疏目标采样的硬标签误导旋转不变算法

Avrajit Ghosh, Bin Yu, Manfred Warmuth, Peter Bartlett

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Wisconsin, Madison(威斯康星大学麦迪逊分校)

AI总结 针对稀疏目标下的二分类问题,证明旋转不变算法(如逻辑损失梯度下降)的过风险下界为Ω((d-1)/n),而通过重参数化u_i v_i的非旋转不变算法可实现O(s log d / n)的上界。

详情
Journal ref
ICML-2026
AI中文摘要

最常见的机器学习设置之一是逻辑回归。在许多分类模型中,包括神经网络,最终预测是通过将逻辑链接函数应用于线性得分获得的。在二元逻辑回归中,反馈可以是软标签(对应于数据的真实条件概率,如在蒸馏中)或采样的硬标签(取值为$\pm 1$)。我们指出即使在特别有利的设置中也会出现一个基本问题,其中目标是学习形式为$\sigma(\mathbf{x}^{\top}\mathbf{w}^{\star})$的无噪声软目标。在过约束情况(即样本数$n$超过输入维度$d$)下,使用样本$(\mathbf{x}_i,\sigma(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$足以恢复$\mathbf{w}^{\star}$,从而获得贝叶斯风险。然而,我们证明当样本由从相同条件分布$\sigma(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$采样的硬标签$y_i$标记,且$\mathbf{w}^{\star}$是$s$-稀疏时,旋转不变算法被证明是次优的:它们产生过风险$\Omega\\!\left(\frac{d-1}{n}\right)$,而存在简单的非旋转不变算法,其过风险为$O(\frac{s\log d}{n})$。最简单的旋转不变算法是逻辑损失上的梯度下降(带早停)。针对稀疏目标实现上述上界的简单非旋转不变算法使用对权重$u_i,v_i$的梯度下降,其中线性权重$w_i$被重参数化为$u_i v_i$。

英文摘要

One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values $\pm 1$). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form $σ(\mathbf{x}^{\top}\mathbf{w}^{\star})$. In the over-constrained case (i.e. the number of samples $n$ exceeds the input dimension $d$) with examples $(\mathbf{x}_i,σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$, it is sufficient to recover $\mathbf{w}^{\star}$ and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels $y_i$ sampled from the same conditional distribution $σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$ and $\mathbf{w}^{\star}$ is $s$-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk $Ω\!\left(\frac{d-1}{n}\right)$, while there are simple non-rotation invariant algorithms with excess risk $O(\frac{s\log d}{n})$. The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights $u_i,v_i$, where now the linear weight $w_i$ is reparameterized as $u_iv_i$.

2603.13428 2026-06-08 cs.SE cs.AI 版本更新

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

EvoClaw: 评估AI代理在持续软件演化中的表现

Gangda Deng, Zhaoling Chen, Zhongming Yu, Haoyang Fan, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Rajgopal Kannan, Le Cong, Mengdi Wang, Qian Zhang, Viktor Prasanna, Xiangru Tang, Xingyao Wang

发表机构 * USC(美国斯克利普斯大学) UCR(加州大学河滨分校) UCSD(加州大学圣地亚哥分校) Army Research Office(陆军研究办公室) Stanford(斯坦福大学) Princeton(普林斯顿大学) Haven OpenHands

AI总结 针对现有基准测试忽视软件演化中时间依赖和技术债务的问题,提出EvoClaw基准,通过从提交日志重建可验证里程碑DAG,评估AI代理在持续开发中维持系统完整性和限制错误累积的能力。

详情
Comments
ICML 2026
AI中文摘要

随着AI代理越来越多地被部署为长期运行的系统,自主构建并持续演化定制软件以在动态环境中进行交互变得至关重要。然而,现有基准测试在孤立的、一次性的编码任务上评估代理,忽视了真实世界软件演化中固有的时间依赖和技术债务。为弥补这一差距,我们引入了DeepCommit,一个从嘈杂的提交日志中重建可验证里程碑DAG的代理管道,其中里程碑被定义为功能内聚的开发目标。这些可执行序列使得EvoClaw成为可能,这是一个新颖的基准测试,要求代理维持系统完整性并限制错误累积,这些是当前基准测试中大部分缺失的长期软件演化的维度。我们对4个代理框架下的12个前沿模型的评估揭示了一个关键弱点:整体性能得分从孤立任务上的>80%显著下降到持续设置中的最多38%,暴露了代理在长期维护和错误传播方面的深刻困境。

英文摘要

With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate agents on isolated, one-off coding tasks, neglecting the temporal dependencies and technical debt inherent in real-world software evolution. To bridge this gap, we introduce DeepCommit, an agentic pipeline that reconstructs verifiable Milestone DAGs from noisy commit logs, where milestones are defined as functionally cohesive development goals. These executable sequences enable EvoClaw, a novel benchmark that requires agents to sustain system integrity and limit error accumulation, dimensions of long-term software evolution largely missing from current benchmarks. Our evaluation of 12 frontier models across 4 agent frameworks reveals a critical vulnerability: overall performance scores drop significantly from >80% on isolated tasks to at most 38% in continuous settings, exposing agents' profound struggle with long-term maintenance and error propagation.

2512.00883 2026-06-08 cs.MM cs.CV cs.SD 版本更新

Audio-Visual World Models: Grounding Multisensory Imagination for Embodied Agents

视听世界模型:为具身智能体奠定多感官想象的基础

Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin Mao, Shijie Cheng

发表机构 * Tsinghua University(清华大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出视听世界模型(AVWM)统一框架,通过条件扩散Transformer(AV-CDiT)联合预测双耳音频与视觉动态,在30小时基准AVW-4k上实现高保真多模态预测,并验证其在具身导航中的有效性。

详情
AI中文摘要

世界模型通过模拟环境动态使智能体能够规划和推理未来状态。虽然现有方法主要关注视觉观察,但现实世界的感知本质上涉及多种感觉模态。音频提供了关键的空间和时间线索,如声源定位和声学场景属性,但其整合到世界模型中仍相对未被充分探索。先前的工作尚未建立低层动作控制下视听世界建模的通用公式,也未阐明如何联合捕捉物理上合理的双耳音频和视觉动态。本文提出了视听世界模型(AVWM)的统一公式,将多模态环境模拟建模为具有同步视听观测的部分可观测马尔可夫决策过程。作为解决该问题的基础步骤,我们构建了AVW-4k,一个受控基准数据集,包含30小时的双耳视听轨迹,覆盖76个室内环境并带有动作标注。我们提出了AV-CDiT,一种视听条件扩散Transformer,采用新颖的模态专家架构平衡视觉和听觉学习,通过三阶段训练策略优化以实现有效的多模态整合。在该基准上的大量实验表明,AV-CDiT在视觉和听觉模态上实现了高保真多模态预测。此外,我们验证了其在具身导航中的实际效用,证明AVWM改进了视觉-语言模型引导的智能体在连续视听导航中的表现。

英文摘要

World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains relatively underexplored. Prior work has not established a commonly adopted formulation for audio-visual world modeling under low-level action control or clarified how to jointly capture physically grounded binaural audio and visual dynamics. This work presents a unified formulation of Audio-Visual World Models (AVWM), casting multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. As a foundational step toward this problem, we construct AVW-4k, a controlled benchmark comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments on this benchmark demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in embodied navigation, demonstrating that AVWM improves a vision-language-model-guided agent in continuous audio-visual navigation.

2602.04894 2026-06-08 cs.CR cs.AI 版本更新

Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software

从黑盒LLM生成的软件中提取重复漏洞

Tomer Kordonsky, Amit LeVi, Maayan Yamin, Noam Benzimra, Avi Mendelson

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 提出特征-安全表(FSTab),通过黑盒攻击从前端特征预测后端漏洞,并量化模型跨程序、重述和领域的漏洞复现一致性,实验显示跨域攻击成功率高达94%。

详情
Comments
ICML 2026, Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD)
AI中文摘要

LLM越来越多地用于代码生成,但其输出通常遵循重复模板,可能导致可预测的漏洞。我们研究了LLM生成软件中的漏洞持久性,并引入了特征-安全表(FSTab),包含两个组件。首先,FSTab支持黑盒攻击,通过可观察的前端特征和源LLM的知识预测可能的后端漏洞,无需访问后端或源代码。其次,FSTab提供以模型为中心的评估,量化模型在跨程序、语义保持重述和应用域中复现相同漏洞的一致性。我们在最先进的代码LLM(包括GPT-5.2、Claude-4.5 Opus和Gemini-3 Pro)上评估了FSTab,覆盖多种应用域。我们的结果显示强大的跨域迁移:即使目标域在训练中被排除,FSTab在内部工具(Claude-4.5 Opus)上仍能达到94%的攻击成功率和93%的漏洞覆盖率。这些发现暴露了LLM生成软件中一个未被充分探索的攻击面,并凸显了代码生成的安全风险。我们的代码可在https://github.com/fstabicml2026/FSTab获取。

英文摘要

LLMs are increasingly used for code generation, but their outputs often follow recurring templates that can induce predictable vulnerabilities. We study vulnerability persistence in LLM-generated software and introduce Feature--Security Table (FSTab) with two components. First, FSTab enables a black-box attack that predicts likely backend vulnerabilities from observable frontend features and knowledge of the source LLM, without access to the backend or source code. Second, FSTab provides a model-centric evaluation that quantifies how consistently a model reproduces the same vulnerabilities across programs, semantics-preserving rephrasings, and application domains. We evaluate FSTab on state-of-the-art code LLMs, including GPT-5.2, Claude-4.5 Opus, and Gemini-3 Pro, across diverse application domains. Our results show strong cross-domain transfer: even when the target domain is excluded from training, FSTab achieves up to 94% attack success and 93% vulnerability coverage on Internal Tools (Claude-4.5 Opus). These findings expose an underexplored attack surface in LLM-generated software and highlight the security risks of code generation. Our code is available at https://github.com/fstabicml2026/FSTab

2509.11208 2026-06-08 stat.ML cs.LG 版本更新

Predictable Compression Failures: Order Sensitivity and Information Budgeting for Evidence-Grounded Binary Adjudication

可预测的压缩失败:基于证据的二元裁决的顺序敏感性与信息预算

Leon Chlon, Ahmed Karim, Maggie Chlon, MarcAntonio Awada

发表机构 * GitHub

AI总结 研究证据顺序对基于Transformer的二元裁决模型的影响,提出QMV界和EDFL定律,通过信息充分率门控实现低幻觉率下的答案/弃权决策。

详情
AI中文摘要

用于基于证据的二元裁决(例如,支持/反驳、是/否或验证器支持的通过/失败决策)的Transformer可能对可交换证据呈现的顺序敏感,在验证器相关的伯努利谓词下产生跨排列的分散性和不可靠的尝试答案。我们将证据顺序视为一个干扰变量,并形式化了一个期望-实现差距:下一个词训练可以最小化顺序上的期望条件描述长度,而固定顺序仍保持位置敏感性。我们的量化鞅违反(QMV)界预测了由相邻秩位置敏感性引起的分散性,在调和区具有$O(\log n)$增长;我们的期望级解压定律(EDFL)将KL凸性/数据处理界专门化到伯努利谓词,产生信任比特(B2T)、幻觉风险(RoH)以及用于答案/弃权决策的信息充分率(ISR)门。在来自FEVER、HotpotQA、NQ-Open、PopQA和Controls的3,059个有依据项目上,我们观察到对数分散性和均匀排列混合的正Jensen增益。在一个预先指定的保留审计(528个项目)中,分析固定的ISR$=1$门实现了0.0-0.7%的幻觉率,20.6-27.9%的弃权率(95%置信区间),支持该操作点,但未声称对所有模型系列或不受限生成具有通用校准。

英文摘要

Transformers used for evidence-grounded binary adjudication (e.g., support/refute, yes/no, or verifier-backed pass/fail decisions) can be sensitive to the order in which exchangeable evidence is presented, producing dispersion across permutations and unreliable attempted answers under a verifier-relative Bernoulli predicate. We treat evidence order as a nuisance variable and formalize an expectation-realization gap: next-token training can minimize expected conditional description length over orderings while a fixed ordering remains position-sensitive. Our Quantified Martingale Violation (QMV) bound predicts the dispersion induced by adjacent-rank positional sensitivity, with $O(\log n)$ growth in the harmonic regime; our Expectation-level Decompression Law (EDFL) specializes a KL convexity/data-processing bound to Bernoulli predicates, yielding Bits-to-Trust (B2T), Risk-of-Hallucination (RoH), and an Information Sufficiency Ratio (ISR) gate for answer/abstain decisions. On 3,059 grounded items from FEVER, HotpotQA, NQ-Open, PopQA, and Controls, we observe logarithmic dispersion and positive Jensen gains from uniform permutation mixtures. In one pre-specified held-out audit (528 items), the analytically fixed ISR$=1$ gate attains 0.0-0.7% hallucination with 20.6-27.9% abstention (95% CIs), supporting the operating point without claiming universal calibration across all model families or unrestricted generation.

2602.16908 2026-06-08 cond-mat.mtrl-sci cs.LG quant-ph 版本更新

Multi-objective optimization and quantum hybridization of equivariant deep learning interatomic potentials

等变深度学习原子间势的多目标优化与量子混合

G. Laskaris, D. Morozov, D. Tarpanov, A. Seth, J. Procelewska, G. Sai Gautam, A. Sagingalieva, R. Brasher, A. Melnikov

发表机构 * Terra Quantum AG LIACS, Leiden University(LIACS,莱顿大学) Nanoscience Center and Department of Chemistry, University of Jyväskylä(贾瓦尔基利亚大学纳米科学中心和化学系) Department of Materials Engineering, Indian Institute of Science(印度科学研究所材料工程系) Schaeffler Technologies AG & Co. KG

AI总结 针对Allegro模型在精度与推理时间之间的权衡,采用多目标超参数优化,并设计经典扩展和量子-经典混合两种变体,在多个数据集上验证了混合变体在力预测精度上的优势。

详情
Journal ref
Comput. Mater. Sci. 270, 114742 (2026)
Comments
15 pages, 7 figures, 6 tables
AI中文摘要

Allegro是一种机器学习原子间势模型,旨在使用E(3)等变神经网络预测分子中的原子性质。在训练该模型时,精度与推理时间之间往往存在权衡。为此,我们对这两个目标应用多目标超参数优化。此外,我们通过构建Allegro的变体来尝试修改架构:一种扩展了额外的经典层,另一种结合了量子-经典混合层。我们在QM9、rMD17-阿司匹林、rMD17-苯以及一个自生成的铜-锂结构数据集上评估所有模型。结果表明,两种变体在多个数据集上的力预测精度均超过Allegro。经典变体持续优于基线,而量子-经典混合变体在完全优化的Cu-Li数据集上取得了最佳的整体力预测精度,比经典变体高出约13%。值得注意的是,尽管混合变体在其他数据集上使用了从Cu-Li转移的超参数而未进行特定数据集的优化,但仍取得了有竞争力的结果,这表明量子-经典混合是增强MLIP架构的一个有前景的方向。

英文摘要

Allegro is a machine learning interatomic potential model designed to predict atomic properties in molecules using E(3) equivariant neural networks. When training this model, there tends to be a trade-off between accuracy and inference time. For this reason, we apply multi-objective hyperparameter optimization to both objectives. Additionally, we experiment with modified architectures by constructing variants of Allegro: one extended with additional classical layers and one incorporating quantum-classical hybrid layers. We evaluate all models on QM9, rMD17-aspirin, rMD17-benzene, and a self-generated dataset of copper-lithium structures. As results, both variants surpass Allegro in force prediction accuracy across multiple datasets. The classical variant consistently improves over the baseline, while the quantum-classical hybrid variant achieves the best overall force prediction accuracy on the Cu-Li dataset, where it was fully optimized, outperforming the classical variant by approximately 13%. Notably, the hybrid variant also achieves competitive results on the remaining datasets despite using hyperparameters transferred from Cu-Li without dataset-specific optimization, suggesting that quantum-classical hybridization is a promising direction for enhancing MLIP architectures.

2602.15084 2026-06-08 physics.plasm-ph cs.AI cs.LG 版本更新

TokaMind: A Multi-Modal Transformer Foundation Model for Tokamak Plasma Dynamics

TokaMind: 用于托卡马克等离子体动力学的多模态Transformer基础模型

Tobia Boschi, Andrea Loreti, Nicola C. Amorisco, Rodrigo H. Ordonez-Hurtado, Cécile Rousseau, George K. Holt, Eszter Székely, Alexander Whittle, Samuel Jackson, Adriano Agnello, Stanislas Pamela, Alessandra Pascale, Robert Akers, Juan Bernabe Moreno, Vassil Alexandrov, Mykhaylo Zayats

发表机构 * IBM Research(IBM研究院) UK Atomic Energy Authority(英国原子能局) STFC Hartree Centre(科学与技术设施研究中心哈特ree中心)

AI总结 提出TokaMind,首个开源托卡马克等离子体动力学基础模型,基于多模态Transformer在MAST数据集上预训练,支持多种数据模态和缺失信号处理,在14个任务上优于基线。

详情
AI中文摘要

我们提出TokaMind,据我们所知,这是首个用于托卡马克等离子体动力学的开源基础模型,基于多模态Transformer(MMT)并在公开可用的MAST数据集上的异构诊断数据上预训练。TokaMind支持多种数据模态(时间序列、2D轮廓和视频),具有不同的采样率、鲁棒的缺失信号处理,并通过选择性加载和冻结四个模型组件实现高效任务适配。为了表示多模态信号,我们使用轻量级固定基离散余弦变换嵌入(DCT3D),并为替代嵌入(例如变分自编码器)提供干净接口。我们在最近引入的MAST基准TokaMark上评估TokaMind,该基准包含14个具有异构重建和预测目标的任务。我们的结果表明,微调后的TokaMind在所有任务上均优于最强的基准基线,仅一个任务除外。与在匹配的epoch预算下从头训练相同架构相比,热启动适配在要求苛刻的下游设置中最为有益,包括长时域预测和高维平衡目标。这些发现突显了多模态预训练对托卡马克等离子体动力学的价值,并为未来的聚变建模任务提供了实用、可扩展的基础。训练代码和模型权重分别公开在github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamind和huggingface.co/UKAEA-IBM-STFC。

英文摘要

We present TokaMind, to our knowledge the first open-source foundation model for tokamak plasma dynamics, based on a Multi-Modal Transformer (MMT) and pretrained on heterogeneous diagnostics from the publicly available MAST dataset. TokaMind supports multiple data modalities (time-series, 2D profiles, and videos) with different sampling rates, robust missing-signal handling, and efficient task adaptation via selectively loading and freezing four model components. To represent multi-modal signals, we use a lightweight fixed-basis Discrete Cosine Transform embedding (DCT3D) and provide a clean interface for alternative embeddings (e.g., Variational Autoencoders). We evaluate TokaMind on the recently introduced MAST benchmark TokaMark, which comprises 14 tasks with heterogeneous reconstruction and forecasting objectives. Our results show that fine-tuned TokaMind outperforms the strongest benchmark baseline on all but one task. Compared with training the same architecture from scratch under a matched epoch budget, warm-start adaptation is most beneficial on demanding downstream settings, including long-horizon forecasting and high-dimensional equilibrium objectives. These findings highlight the value of multi-modal pretraining for tokamak plasma dynamics and provide a practical, extensible foundation for future fusion modeling tasks. Training code and model weights are publicly available at github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamind and huggingface.co/UKAEA-IBM-STFC, respectively.

2602.06245 2026-06-08 stat.ML cs.LG 版本更新

Inheritance Between Feedforward and Convolutional Networks via Model Projection

前馈网络与卷积网络之间的继承关系:通过模型投影

Nicolas Ewen, Jairo Diaz-Rodriguez, Kelly Ramsay

发表机构 * Department of Mathematics and Statistics(数学与统计学系)

AI总结 提出模型继承概念,证明广义前馈网络是广义卷积网络的子集,并通过模型投影实现反向继承,用于参数高效的迁移学习。

详情
AI中文摘要

神经网络技术通常通过类比在不同架构家族之间转移,但这种转移仅在技术所需假设被保留时才有效。我们将这一思想引入为模型类之间的继承。使用统一的节点级框架和张量值激活,我们证明广义前馈网络(GFFN)是广义卷积网络(GCNN)的严格子集,因此GCNN的性质直接转移到GFFN。反向方向并非自动:标准CNN节点使用空间核,而FFN节点对每个输入贡献使用一个标量权重。我们引入模型投影来恢复受限的反向继承路径。投影冻结每个卷积输入通道子函数,并为每个输入-输出通道贡献学习一个标量系数,使投影后的CNN节点具有标量加权输入重组的GFFN风格可训练结构。这种继承结构自然导致参数高效的迁移学习。在多个ImageNet预训练CNN骨干网络和下游图像分类数据集上,模型投影与标准和PEFT基线竞争,并为后续全微调提供有效的初始化。

英文摘要

Neural-network techniques are often transferred across architecture families by analogy, but such transfer is valid only when the assumptions required by a technique are preserved. We introduce this idea as inheritance between model classes. Using a unified node-level framework with tensor-valued activations, we prove that generalized feedforward networks (GFFNs) form a strict subset of generalized convolutional networks (GCNNs), so GCNN properties transfer directly to GFFNs. The reverse direction is not automatic: standard CNN nodes use spatial kernels, while FFN nodes use one scalar weight per input contribution. We introduce model projection to recover a restricted reverse inheritance path. Projection freezes each convolutional input-channel sub-function and learns one scalar coefficient for each input-output channel contribution, giving projected CNN nodes the GFFN-style trainable structure of scalar-weighted input recombination. This inherited structure leads naturally to parameter-efficient transfer learning. Across multiple ImageNet-pretrained CNN backbones and downstream image-classification datasets, model projection is competitive with standard and PEFT baselines and provides an effective initialization for subsequent full fine-tuning.

2602.01177 2026-06-08 quant-ph cs.IT cs.LG math.IT 版本更新

Privacy Implies Stability: Information-Theoretic Generalization Bounds for Quantum Learning

隐私蕴含稳定性:量子学习的信息论泛化界

Ayanava Dasgupta, Naqueeb Ahmad Warsi, Masahito Hayashi

发表机构 * Indian Statistical Institute, Kolkata(印度统计研究院,科希玛) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳校区数据科学学院) International Quantum Academy, Futian District, Shenzhen(深圳未来科技学院) Graduate School of Mathematics, Nagoya University(名古屋大学数学研究生院)

AI总结 提出信息论框架连接量子学习算法的稳定性、隐私与泛化,证明量子差分隐私可直接导出泛化保证,并发现量子非正交性使信息论可容许性与隐私兼容。

详情
Comments
36 pages, 3 figures; The introduction has been substantially rewritten to provide better context, and certain proofs have been relocated from the appendices to the main body of the paper; The core mathematical framework and technical results remain unchanged
AI中文摘要

我们开发了一个信息论框架,连接量子学习算法的稳定性、隐私和泛化。学习过程被建模为具有经典-量子输出的量子仪器,损失由可观测量表示。我们证明,在经典-量子次高斯条件下,信息论稳定性度量控制期望泛化误差。此外,我们利用量子Rényi散度处理非交换性下的高阶依赖,建立了一个高概率泛化界。在可信数据处理者设置中,量子差分隐私(QDP)提供了一种稳定性机制。我们证明单邻居QDP严格限制了经典-量子输出泄露的信息。结合我们的稳定性定理,直接得到隐私到泛化的保证。我们还探索了不可信数据处理者设置。在此,仅输出隐私是不够的,因为对抗性处理者可能在应用噪声后处理之前执行高度信息性的过程。为了解决这个问题,我们引入了信息论可容许性(ITA),这是一种认证条件,确保规定程序不仅仅是编码系综上一个严格更具信息性、物理允许操作的退化版本。我们证明了一个基本分离:虽然在经典模型中可容许性和隐私存在强烈张力,但量子非正交性使它们兼容。量子测量可以是ITA——耗尽所有相关的可访问信息——而无需完美恢复经典数据集。我们通过一个具体的量子ITA例子说明了这种分离。

英文摘要

We develop an information-theoretic framework connecting stability, privacy, and generalization for quantum learning algorithms. Learning procedures are modeled as quantum instruments with classical-quantum outputs, and losses are represented by observables. We prove that under a classical-quantum sub-Gaussian condition, an information-theoretic stability measure controls the expected generalization error. Furthermore, we establish a high-probability generalization bound using quantum Rényi divergences to manage higher-order dependencies under non-commutativity. In the trusted Data Processor setting, quantum differential privacy (QDP) provides a mechanism for stability. We show that one-neighbor QDP strictly bounds the information leaked by the classical-quantum output. Combining this with our stability theorem yields a direct privacy-to-generalization guarantee. We also explore an untrusted Data Processor setting. Here, output privacy alone is insufficient since an adversarial processor could perform a highly informative procedure before applying noisy post-processing. To combat this, we introduce Information-Theoretic Admissibility (ITA), a certification condition ensuring the prescribed procedure is not just a degraded version of a strictly more informative, physically allowed operation on the encoded ensemble. We prove a fundamental separation: while admissibility and privacy are in strong tension in classical models, quantum non-orthogonality makes them compatible. A quantum measurement can be ITA - exhausting all relevant accessible information - without perfectly recovering the classical dataset. We illustrate this separation through a concrete quantum ITA example.

2512.23128 2026-06-08 cs.HC cs.AI cs.MA 版本更新

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

这是一个陷阱!面向网络代理的任务重定向说服基准

Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki, Piotr Błaszczyk, Will Howard, Lukas Aichberger, Chris Russell, Philip H. S. Torr, Adam Mahdi, Adel Bibi

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出TRAP基准,评估大型语言模型驱动的网络代理在动态网页中易受提示注入攻击的程度,发现平均25%的任务中代理被重定向,揭示了心理驱动的系统漏洞。

详情
Comments
ICML 2026
AI中文摘要

由大型语言模型驱动的网络代理越来越多地用于电子邮件管理或专业网络等任务。然而,它们对动态网页内容的依赖使其容易受到提示注入攻击:隐藏在界面元素中的对抗性指令,说服代理偏离其原始任务。我们引入了任务重定向代理说服基准(TRAP),这是一个研究说服技术如何在现实任务中误导自主网络代理的基准。在六个前沿模型中,代理平均在25%的任务中容易受到提示注入(GPT-5为13%,DeepSeek-R1为43%),小的界面或上下文变化通常会使成功率翻倍,并揭示网络代理中系统的、由心理驱动的漏洞。我们还提供了一个模块化的社会工程注入框架,并在高保真网站克隆上进行受控实验,允许进一步扩展基准。

英文摘要

Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), a benchmark for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25% of tasks on average (13% for GPT-5 to 43% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.

2511.04567 2026-06-08 physics.plasm-ph cs.CE cs.LG physics.comp-ph 版本更新

Machine Learning for Electron-Scale Turbulence Modeling in W7-X

W7-X中电子尺度湍流建模的机器学习方法

Ionut-Gabriel Farcas, Don Lawrence Carl Agapito Fernando, Alejandro Banon Navarro, Gabriele Merlo, Frank Jenko

发表机构 * Department of Mathematics and Division of Computational Modeling and Data Analytics, Academy of Data Science, Virginia Tech(数学系和计算建模与数据科学学院,数据科学学院,弗吉尼亚理工学院) Max Planck Institute for Plasma Physics(最大平面物理研究所)

AI总结 针对Wendelstein 7-X仿星器中电子温度梯度湍流,利用主动学习回归构建物理引导的标度律降阶模型,预测热流并评估插值与外推性能。

详情
Journal ref
Phys. Plasmas 33, 000000 (2026)
Comments
15 pages, 7 tables, 14 figures
AI中文摘要

构建湍流输运的降阶模型对于加速剖面预测和实现参数探索、设计优化等多查询任务至关重要。本文研究了Wendelstein 7-X (W7-X)仿星器中电子温度梯度(ETG)湍流的机器学习驱动降阶模型。我们开发了物理引导的标度律,以预测七个径向位置处的ETG热流作为三个关键等离子体参数的函数:归一化电子温度梯度($ω_{T_e}$)、归一化电子温度与密度梯度之比($η_e$)以及电子与离子温度比($τ$)。模型系数通过回归结合主动学习策略确定。该过程使用低基数稀疏网格训练数据初始化标度律,并通过从现有模拟数据库中选择信息量最大的样本迭代丰富训练集。使用每个径向位置超过393个点的样本外数据集评估模型的预测性能。利用在七个训练径向位置识别的系数,我们进一步推导了标度律系数作为径向位置函数的回归参数化。然后在训练中未使用的三个额外径向位置评估所得模型,包括插值和适度外推情况。总体而言,我们的降阶模型表现出良好的预测性能,并达到与原始参考模拟相当的精度,包括在插值和适度外推范围内。一个重要发现是,单一的径向无关模型无法充分描述W7-X核心区的ETG输运,表明存在当前公式未捕捉的几何依赖物理。

英文摘要

Constructing reduced models for turbulent transport is essential for accelerating profile predictions and enabling many-query tasks such as parameter exploration and design optimization. This work investigates machine-learning-driven reduced models for Electron Temperature Gradient (ETG) turbulence in the Wendelstein 7-X (W7-X) stellarator. We develop physics-guided scaling laws to predict the ETG heat flux at seven radial locations as functions of three key plasma parameters: the normalized electron temperature gradient ($ω_{T_e}$), the ratio of normalized electron temperature and density gradients ($η_e$), and the electron-to-ion temperature ratio ($τ$). The model coefficients are determined through regression combined with an active learning strategy. The procedure initializes the scaling laws using low-cardinality sparse-grid training data and iteratively enriches the training set by selecting maximally informative samples from an existing simulation database. The predictive performance of the models is assessed using out-of-sample datasets comprising more than $393$ points per radial location. Using the coefficients identified at the seven training radial locations, we further derive regression-based parameterizations for the scaling-law coefficients as functions of radial position. The resulting models are then evaluated at three additional radial locations not used during training, including both interpolation and moderate extrapolation cases. Overall, our reduced models demonstrate good predictive performance and achieve accuracy comparable to the original reference simulations, including in interpolation and moderate extrapolation regimes. An important finding is that a single radius-independent model cannot adequately describe ETG transport across the W7-X core, suggesting the presence of geometry-dependent physics not captured by the present formulation.

2511.02748 2026-06-08 cs.NI cs.LG 版本更新

Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning

面向6G的智能体世界建模:近实时生成式状态空间推理

Farhad Rezazadeh, Amir Ashtari Gargari, Hatim Chergui, Sandra Lagen, Merouane Debbah, Houbing Song, Lingjia Liu

发表机构 * BrainOmega and the Technical University of Catalonia (UPC)(BrainOmega 和 哈佛大学(UPC)) Centre Tecnologic de Telecomunicacions de Catalunya (CTTC/CERCA)(巴塞罗那电信技术中心(CTTC/CERCA)) i2CAT Foundation(i2CAT 基金会) Khalifa University of Science and Technology(科技 Khalifa 大学) University of Maryland, Baltimore County (UMBC)(马里兰大学巴尔的摩分校(UMBC)) Virginia Tech(弗吉尼亚理工学院)

AI总结 提出基于世界建模的智能体框架WM-MS3M,通过生成式状态空间实现6G网络近实时反事实推理与资源分配,在O-RAN数据上降低预测误差并加速推理。

详情
Comments
13 Pages, 3 Figures, 4 Tables
AI中文摘要

我们认为第六代(6G)智能并非流畅的令牌预测,而是想象与选择的能力——模拟未来场景、权衡取舍并以校准的不确定性行动。我们通过反事实动力学和世界建模(WM)范式重新定义开放无线接入网(O-RAN)近实时(Near-RT)控制,该范式学习动作条件的生成式状态空间。这使得超越大语言模型(LLM)作为主要建模基元的定量“假设”预测成为可能。诸如物理资源块(PRB)之类的动作在因果世界模型中被视为一等控制输入,并且对预测和假设分析中的偶然不确定性和认知不确定性进行建模。一个基于智能体模型预测控制(MPC)的交叉熵方法(CEM)规划器在短时域上运行,利用数据驱动的PRB边界内的先验均值展开以最大化确定性奖励。该模型将多尺度结构化状态空间混合(MS3M)与紧凑随机潜变量耦合形成WM-MS3M,总结关键绩效指标(KPI)历史并在假设PRB序列下预测下一步KPI。在真实O-RAN轨迹上,WM-MS3M相比MS3M在参数减少32%且延迟相似的情况下将平均绝对误差(MAE)降低1.69%,相比注意力/混合基线实现35-80%更低的均方根误差(RMSE)和2.3-4.1倍更快的推理速度,从而支持稀有事件模拟和离线策略筛选。

英文摘要

We argue that sixth-generation (6G) intelligence is not fluent token prediction but the capacity to imagine and choose -- to simulate future scenarios, weigh trade-offs, and act with calibrated uncertainty. We reframe open radio access network (O-RAN) near-real-time (Near-RT) control via counterfactual dynamics and a world modeling (WM) paradigm that learns an action-conditioned generative state space. This enables quantitative "what-if" forecasting beyond large language models (LLMs) as the primary modeling primitive. Actions such as physical resource blocks (PRBs) are treated as first-class control inputs in a causal world model, and both aleatoric and epistemic uncertainty are modeled for prediction and what-if analysis. An agentic, model predictive control (MPC)-based cross-entropy method (CEM) planner operates over short horizons, using prior-mean rollouts within data-driven PRB bounds to maximize a deterministic reward. The model couples multi-scale structured state-space mixtures (MS3M) with a compact stochastic latent to form WM-MS3M, summarizing key performance indicators (KPIs) histories and predicting next-step KPIs under hypothetical PRB sequences. On realistic O-RAN traces, WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference, enabling rare-event simulation and offline policy screening.

2509.22685 2026-06-08 eess.IV cs.CV cs.GR 版本更新

VIRTUS-FPP: Virtual Sensor Modeling for Fringe Projection Profilometry in NVIDIA Isaac Sim

VIRTUS-FPP:NVIDIA Isaac Sim中条纹投影轮廓测量的虚拟传感器建模

Adam Haroon, Anush Lakshman, Badrinath Balasubramaniam, Beiwen Li

发表机构 * Department of Mechanical Engineering, Iowa State University(Iowa州立大学机械工程系) College of Engineering, University of Georgia(佐治亚大学工程学院)

AI总结 提出VIRTUS-FPP,首个在NVIDIA Isaac Sim中实现的端到端虚拟传感器建模框架,用于条纹投影轮廓测量,实现物理保真模拟,无需预校准物理系统,支持亚毫米级重建精度。

详情
Comments
10 pages, 13 figures, accepted for publication in IEEE Sensors Journal
AI中文摘要

条纹投影轮廓测量(FPP)是一种用于3D表面重建的高精度结构光传感技术,但其实际部署常受限于复杂的校准程序、对环境条件的敏感性以及物理实验的高成本。同时,机器人研究日益依赖如NVIDIA Isaac Sim等仿真平台进行可扩展的开发与验证,但目前缺乏FPP等光学计量传感器的精确虚拟表示。本文提出VIRTUS-FPP,这是首个在NVIDIA Isaac Sim中实现的用于条纹投影轮廓测量的端到端虚拟传感器建模框架,能够对完整的FPP流程(包括结构光投影、图像形成、校准和3D重建)进行物理保真模拟,且无需依赖预校准的物理系统。该框架利用逆相机模型表示投影仪,确保了几何和光度保真度与结构光原理一致。通过连接光学计量与机器人仿真,VIRTUS-FPP实现了高保真合成数据生成、传感流程的系统评估以及真实世界FPP系统的数字孪生复制。实验结果表明,该框架具有亚毫米级重建精度,且模拟与物理测量之间具有强对应性,突显了其有效性及在推动感知驱动型机器人、仿真到现实迁移以及可扩展光学传感器设计方面的潜力。

英文摘要

Fringe projection profilometry (FPP) is a high-precision structured-light sensing technique for 3D surface reconstruction, yet its practical deployment is often constrained by complex calibration procedures, sensitivity to environmental conditions, and the high cost of physical experimentation. At the same time, robotics research increasingly relies on simulation platforms such as NVIDIA Isaac Sim for scalable development and validation, but accurate virtual representations of optical metrology sensors such as FPP are not currently available. In this work, we present VIRTUS-FPP, the first end-to-end virtual sensor modeling framework for fringe projection profilometry implemented in NVIDIA Isaac Sim, enabling physically grounded simulation of the complete FPP pipeline, including structured light projection, image formation, calibration, and 3D reconstruction, without dependence on pre-calibrated physical systems. The framework leverages an inverse camera model for projector representation, ensuring geometric and photometric fidelity consistent with structured-light principles. By bridging optical metrology and robotics simulation, VIRTUS-FPP enables high-fidelity synthetic data generation, systematic evaluation of sensing pipelines, and digital twin replication of real-world FPP systems. Experimental results demonstrate sub-millimeter reconstruction accuracy and strong correspondence between simulated and physical measurements, highlighting the framework's effectiveness and its potential to advance perception-driven robotics, simulation-to-reality transfer, and scalable optical sensor design.

2508.17693 2026-06-08 cs.DB cs.AI cs.CL 版本更新

Database Normalization via Dual-LLM Self-Refinement

通过双LLM自精炼的数据库规范化

Eunjae Jo, Nakyung Lee, Gyuyeong Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Miffie框架,利用双模型自精炼架构和大语言模型实现数据库自动规范化,无需人工干预且保持高准确率。

详情
Comments
7 pages
AI中文摘要

数据库规范化对于保持数据完整性至关重要。然而,它通常由数据工程师手动执行,既耗时又容易出错。为此,我们提出了Miffie,一个利用大语言模型能力的数据库规范化框架。Miffie实现了无需人工努力的自动化数据规范化,同时保持高准确性。Miffie的核心是一种双模型自精炼架构,分别结合了性能最佳的模型用于规范化模式生成和验证。生成模块根据验证模块的反馈消除异常,直到输出模式满足规范化要求。我们还精心设计了任务特定的零样本提示,以引导模型实现高准确性和成本效率。实验结果表明,Miffie能够在保持高准确性的同时规范化复杂的数据库模式。

英文摘要

Database normalization is crucial to preserving data integrity. However, it is time-consuming and error-prone, as it is typically performed manually by data engineers. To this end, we present Miffie, a database normalization framework that leverages the capability of large language models. Miffie enables automated data normalization without human effort while preserving high accuracy. The core of Miffie is a dual-model self-refinement architecture that combines the best-performing models for normalized schema generation and verification, respectively. The generation module eliminates anomalies based on the feedback of the verification module until the output schema satisfies the requirement for normalization. We also carefully design task-specific zero-shot prompts to guide the models for achieving both high accuracy and cost efficiency. Experimental results show that Miffie can normalize complex database schemas while maintaining high accuracy.

2503.00065 2026-06-08 cs.CR cs.LG 版本更新

ADAGE: Active Defenses Against GNN Extraction

ADAGE: 针对GNN提取的主动防御

Jing Xu, Franziska Boenisch, Adam Dziedzic

发表机构 * CISPA Helmholtz Center for Information Security(信息安全研究中心)

AI总结 提出首个通用主动防御框架ADAGE,通过监控查询多样性并逐步扰动输出,有效阻止多种GNN模型窃取攻击,同时保持下游任务性能。

详情
Comments
Accepted at AsiaCCS 2026
AI中文摘要

图神经网络(GNN)在药物发现、交通状态预测和推荐系统等实际应用中取得了高性能。构建强大的GNN需要大量训练数据、强大的计算资源和人类专业知识,这使得模型成为模型窃取攻击的有利目标。先前的研究表明,针对GNN的窃取攻击威胁向量大且多样,攻击者可以利用从节点标签到高维节点嵌入的各种异质信号,以原始训练成本的一小部分创建目标GNN的本地副本。这种威胁向量的多样性使得设计有效且通用的防御具有挑战性,现有的防御通常专注于特定的窃取设置。此外,它们仅提供识别被盗模型副本的方法,而非阻止攻击。为弥补这一差距,我们提出了首个通用的针对GNN提取的主动防御(ADAGE)。ADAGE基于以下观察:窃取模型的全部功能需要高度多样化的查询来泄露其在整个输入空间的行为。我们的防御监控这种查询多样性,并随着累积泄漏的增加逐步扰动输出。与先前工作相比,ADAGE可以在所有常见攻击设置下阻止窃取。我们使用六个基准数据集、四个GNN模型和三种类型的自适应攻击者进行的广泛实验评估表明,ADAGE对攻击者施加惩罚,使其无法窃取,同时保持下游任务的预测性能。因此,ADAGE有助于未来安全地共享有价值的GNN。

英文摘要

Graph Neural Networks (GNNs) achieve high performance in various real-world applications, such as drug discovery, traffic states prediction, and recommendation systems. The fact that building powerful GNNs requires a large amount of training data, powerful computing resources, and human expertise turns the models into lucrative targets for model stealing attacks. Prior work has revealed that the threat vector of stealing attacks against GNNs is large and diverse, as an attacker can leverage various heterogeneous signals ranging from node labels to high-dimensional node embeddings to create a local copy of the target GNN at a fraction of the original training costs. This diversity in the threat vector renders the design of effective and general defenses challenging and existing defenses usually focus on one particular stealing setup. Additionally, they solely provide means to identify stolen model copies rather than preventing the attack. To close this gap, we propose the first and general Active Defense Against GNN Extraction (ADAGE). ADAGE builds on the observation that stealing a model's full functionality requires highly diverse queries to leak its behavior across the input space. Our defense monitors this query diversity and progressively perturbs outputs as the accumulated leakage grows. In contrast to prior work, ADAGE can prevent stealing across all common attack setups. Our extensive experimental evaluation using six benchmark datasets, four GNN models, and three types of adaptive attackers shows that ADAGE penalizes attackers to the degree of rendering stealing impossible, whilst preserving predictive performance on downstream tasks. ADAGE, thereby, contributes towards securely sharing valuable GNNs in the future.

2506.11066 2026-06-08 cs.SE cs.AI 版本更新

CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

CoQuIR:面向代码质量感知信息检索的综合基准

Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu, Walter Pretschner, Heinz Koeppl, Fakhri Karray

发表机构 * Linköping University(林波伊大学) MBZUAI(麦肯锡人工智能研究院) TU Darmstadt(德累斯顿技术大学) Shanghai Jiao Tong University(上海交通大学) EPFL(苏黎世联邦理工学院) University of Groningen(Groningen大学) Google Tokyo(东京Google) Alibaba Group(阿里巴巴集团) TU Munich(慕尼黑技术大学)

AI总结 提出首个大规模多语言代码质量感知检索基准CoQuIR,涵盖正确性、效率、安全性和可维护性四维度,通过细粒度标注和两个质量中心指标评估23个模型,发现顶尖模型常无法区分有缺陷代码,并探索了训练方法以提升质量感知能力。

详情
AI中文摘要

代码检索在现代软件开发中至关重要,因为它能促进代码复用并加速调试。然而,当前的基准主要强调功能相关性,而忽视了软件质量的关键维度。受此差距启发,我们引入了CoQuIR,这是首个大规模、多语言的基准,专门设计用于评估跨四个关键维度(正确性、效率、安全性和可维护性)的质量感知代码检索。CoQuIR为11种编程语言的42,725个查询和134,907个代码片段提供了细粒度的质量注释,并附带两个以质量为中心的评估指标:成对偏好准确率和基于边界的排名分数。利用CoQuIR,我们对23个检索模型(涵盖开源和专有系统)进行了基准测试,发现即使是最先进的模型也常常无法区分有缺陷或不安全的代码与其更健壮的对应代码。此外,我们初步研究了明确鼓励检索器识别代码质量的训练方法。使用合成数据集,我们展示了在各种模型上质量感知指标的显著改进,而不牺牲语义相关性。下游代码生成实验进一步验证了我们方法的有效性。总体而言,我们的工作强调了将质量信号整合到代码检索系统中的重要性,为更可信和更健壮的软件开发工具奠定了基础。

英文摘要

Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency, security, and maintainability. CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages, and is accompanied by two quality-centric evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Using CoQuIR, we benchmark 23 retrieval models, covering both open-source and proprietary systems, and find that even top-performing models frequently fail to distinguish buggy or insecure code from their more robust counterparts. Furthermore, we conduct preliminary investigations into training methods that explicitly encourage retrievers to recognize code quality. Using synthetic datasets, we demonstrate promising improvements in quality-aware metrics across various models, without sacrificing semantic relevance. Downstream code generation experiments further validate the effectiveness of our approach. Overall, our work highlights the importance of integrating quality signals into code retrieval systems, laying the groundwork for more trustworthy and robust software development tools.

2505.17739 2026-06-08 cs.MA cs.CY cs.HC cs.RO 版本更新

Feasible Action Space Reduction for Quantifying Causal Responsibility in Continuous Spatial Interactions

可行动作空间缩减用于量化连续空间交互中的因果责任

Ashwin George, Luciano Cavalcante Siebert, David A. Abbink, Arkady Zgonnikov

发表机构 * Deflt University of Technology(德福特技术大学)

AI总结 针对连续动作空间,提出FeAR度量的连续空间公式,用于量化空间交互中智能体的因果责任,并展示其在分析回溯责任和估计前瞻责任中的应用。

详情
Comments
In review
AI中文摘要

理解一个智能体对另一个智能体的因果影响对于将自动化车辆和移动机器人等人工智能系统安全部署到人类居住环境中至关重要。现有的因果责任模型处理具有离散动作的场景的简化抽象,从而限制了在理解空间交互中的责任时的实际应用。基于空间交互的智能体嵌入场景中且必须在每个时刻执行一个动作的假设,提出了可行动作空间缩减(FeAR)作为离散动作的网格世界环境中因果责任的度量。由于现实世界的交互涉及连续动作空间,本文提出了用于测量空间连续交互中因果责任的FeAR度量的公式。我们展示了该度量在典型空间共享冲突中的效用,并展示了其在分析回溯责任和估计前瞻责任以指导智能体决策中的应用。我们的结果突显了FeAR度量在设计和工程化人工智能体以及评估人类周围智能体责任方面的潜力。

英文摘要

Understanding the causal influence of one agent on another agent is crucial for safely deploying artificially intelligent systems such as automated vehicles and mobile robots into human-inhabited environments. Existing models of causal responsibility deal with simplified abstractions of scenarios with discrete actions, thus, limiting real-world use when understanding responsibility in spatial interactions. Based on the assumption that spatially interacting agents are embedded in a scene and must follow an action at each instant, Feasible Action-Space Reduction (FeAR) was proposed as a metric for causal responsibility in a grid-world setting with discrete actions.Since real-world interactions involve continuous action spaces, this paper proposes a formulation of the FeAR metric for measuring causal responsibility in space-continuous interactions. We illustrate the utility of the metric in prototypical space-sharing conflicts, and showcase its applications for analysing backward-looking responsibility and in estimating forward-looking responsibility to guide agent decision making. Our results highlight the potential of the FeAR metric for designing and engineering artificial agents, as well as for assessing the responsibility of agents around humans.