arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2605.26819 2026-05-27 cs.IR cs.AI

RAGEAR: Retrieval-Augmented Graph-Enhanced Academic Recommender

RAGEAR: 检索增强的图增强学术推荐器

Francesco Granata, Lorenzo Lamazzi, Misael Mongiovì, Francesco Poggi, Valeria Secchini

发表机构 * Department of Mathematics and Computer Science, University of Catania, Italy(卡塔尼亚大学数学与计算机科学系,意大利) Institute for Cognitive Science and Technology, National Research Council, Italy(意大利国家研究委员会认知科学与技术研究所)

AI总结 提出RAGEAR,一种神经符号推荐系统,结合密集检索和知识图谱,通过图感知聚合函数将片段级证据传播到课程级推荐,在学术课程推荐中优于元数据基线。

详情
AI中文摘要

我们提出了RAGEAR(检索增强的图增强学术推荐器),一种用于学术课程推荐的神经符号推荐系统。RAGEAR将完整讲座转录本的密集检索与符号知识图谱相结合,该图谱建模课程、课程、转录本片段、学分、学习计划和课程信息。知识图谱支持基于结构化约束(如学分、学科、学习计划和先修课程)的符号过滤和情境化。与基于元数据的方法不同,它通过检索与学生查询语义对齐的转录本片段来利用细粒度的教学内容。主要贡献是一种图感知聚合函数,它将片段级证据传播到课程级推荐。得分结合了三个因素:与课程相关的检索相似性份额、其相关片段的基于排名的强度以及证据在课程间的分布。我们通过人工评估样本和大规模基于LLM的相关性评估,在152个学生类查询上评估了RAGEAR。结果表明,讲座转录本优于仅元数据检索,并且RAGEAR进一步提高了基于转录本的归一化SumP基线的排名质量,尤其是在排名靠前的推荐中。

英文摘要

We present RAGEAR (Retrieval-Augmented Graph-Enhanced Academic Recommender), a neurosymbolic recommender system for academic course recommendation. RAGEAR combines dense retrieval over full lecture transcripts with a symbolic Knowledge Graph modelling courses, lessons, transcript chunks, credits, study plans, and curricular information. The Knowledge Graph supports symbolic filtering and contextualisation based on structured constraints, such as credits, academic disciplines, study plans, and prerequisites. Unlike metadata-based approaches, it exploits fine-grained instructional content by retrieving transcript chunks semantically aligned with a student's query. The main contribution is a graph-aware aggregation function that propagates chunk-level evidence to course-level recommendations. The score combines three factors: the share of retrieved similarity associated with a course, the rank-based strength of its relevant chunks, and the distribution of evidence across lessons. We evaluate RAGEAR on 152 student-like queries through a human evaluation sample and a large-scale LLM-based relevance assessment. Results show that lecture transcripts improve over metadata-only retrieval, and that RAGEAR further improves ranking quality over a transcript-based normalized SumP baseline, especially for top-ranked recommendations.

2605.26807 2026-05-27 cs.SE cs.AI

HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML

HTMLCure:将浏览器体验转化为面向交互式HTML的状态引导修复

Jiajun Wu, Jian Yang, Tuney Zheng, Wei Zhang, Haowen Wang, Yihang Lou, Xianglong Liu

发表机构 * Beihang University(北京航空航天大学) IQuest Research(IQuest研究院) Peking University(北京大学)

AI总结 提出HTMLCure框架,通过浏览器交互执行、状态感知诊断和闭环修复引擎,从大规模HTML页面中筛选并修复可修复页面,显著提升SFT数据质量和模型性能。

Comments 27 pages, 11 figures. Code: https://github.com/wuyuVerse/HTMLCure

详情
AI中文摘要

LLM现在可以生成完整的HTML页面,但其中许多页面仅在表面上正确:它们渲染一次,然后在滚动、悬停、点击、调整大小或游戏过程中失败。基于截图的评估可能遗漏这些失败,而过滤会丢弃许多仍然可修复的页面。我们引入了HTMLCure,一个浏览器体验框架,在系统与页面交互后评估HTML。评估器跨视口和交互状态执行页面,记录确定性的浏览器证据,并向VLM提供来自执行轨迹的精选关键帧,而非孤立截图。相同的状态信号驱动闭环修复引擎:HTMLCure诊断当前页面,选择特定状态的修复家族,再次运行每个候选页面,并导出质量清理后的页面用于SFT。在97K提示语料库上,这将直接可用的种子扩展为63703个质量清理页面的候选池,从中我们构建了最终的40K页面精炼SFT集。在相同骨干和训练方案下,HTMLCure-27B-Refined在HTMLBench-400上达到50.6分,确定性测试用例通过率为45.2%,与Kimi-K2.6和GPT-5.4等强参考行处于相同性能区间。在发布的MiniAppBench验证集上,它达到81.2的平均分,比原始27B SFT提高15.3分,接近强参考系统的水平。

英文摘要

LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hover, click, resize, or gameplay. Evaluation from screenshots can miss these failures, and filtering discards many pages that are still repairable. We introduce HTMLCure, a browser experience framework that evaluates HTML after the system has interacted with it. The evaluator executes the page across viewports and interaction states, records deterministic browser evidence, and gives the VLM curated keyframes from the executed trajectory rather than isolated screenshots. The same state signal drives a closed loop repair engine: HTMLCure diagnoses the current page, chooses a state specific repair family, runs each candidate again, and exports quality cleared pages for SFT. On a 97K prompt corpus, this expands the directly usable seed into a candidate pool of 63703 quality cleared pages, from which we construct the final refined SFT set of 40K pages. Under the same backbone and training recipe, HTMLCure-27B-Refined reaches 50.6 on HTMLBench-400 with 45.2% deterministic test case pass, placing it in the same performance band as strong reference rows such as Kimi-K2.6 and GPT-5.4. On the released MiniAppBench validation split, it reaches 81.2 average, improving raw 27B SFT by 15.3 points and approaching the level of strong reference systems.

2605.26786 2026-05-27 cs.CY cs.AI cs.LG

Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System

大数据分析在糖尿病管理中的应用:卢旺达医疗系统需求评估

Silas Majyambere, Tony Lindgren, Workneh Y. Ayele, Celestin Twizere

发表机构 * University of Rwanda(卢旺达大学)

AI总结 本研究通过利益相关者研讨会评估卢旺达医疗系统采用大数据分析管理糖尿病的准备情况,并提出了一个基于可解释机器学习模型的实用框架。

详情
AI中文摘要

糖尿病是一种慢性代谢疾病,如果不及早诊断和管理,可能导致严重的健康问题。大数据分析和机器学习为分析大型健康数据集、支持早期发现和更好的治疗决策提供了实用工具。然而,它们在常规临床实践中的使用仍然有限。本研究考察了卢旺达医疗系统采用大数据分析管理糖尿病的准备情况。随着该国不断扩大电子病历和健康信息系统的使用,改善预测、监测和临床决策的新机遇随之出现。我们举办了一个为期五天的研讨会,涉及25名关键利益相关者,包括临床医生、数据管理员、政策制定者、医学研究人员、营养学家和技术提供商,以评估准备情况并识别现有差距。研究结果突出了大数据分析实施的潜力和主要挑战。基于这些结果,本文提出了一个实用的大数据分析框架,利用可解释的机器学习模型支持糖尿病管理策略。

英文摘要

Diabetes is a chronic metabolic disease that can lead to serious health problems if not diagnosed and managed early. Big Data Analytics (BDA) and machine learning offer practical tools for analyzing large health datasets and supporting early detection and better treatment decisions. However, their use in routine clinical practice is still limited. This study examines the readiness of Rwanda's healthcare system to adopt big data analytics for diabetes management. As the country continues to expand its use of electronic medical records and health information systems, new opportunities arise for improving prediction, monitoring, and clinical decision-making. A five-day workshop involving 25 key stakeholders, including clinicians, data managers, policymakers, medical researchers, nutritionists, and technology providers, was conducted to assess preparedness and identify existing gaps. The findings highlight both the potential and the main challenges of BDA implementation. Based on these results, the paper proposes a practical BDA framework to support diabetes management strategies using explainable machine learning models.

2605.26769 2026-05-27 cs.CY cs.AI

Generative artificial intelligence and the marginalization of minoritized knowledges in higher education: the case of disability

生成式人工智能与高等教育中少数群体知识的边缘化:以残疾为例

Fatiha Tali-Otmani

发表机构 * Université Toulouse Jean Jaurès-UMR EFTS(图卢兹让·雅克·儒勒大学-UMR EFTS)

AI总结 研究通过教育科学、批判技术研究和残疾研究,揭示生成式人工智能如何通过以英语和西方为中心的训练数据集强化认知殖民性,导致残疾人群体的双重边缘化,并探讨研究者与机器混合以维护认知多样性的可能性及其结构性限制。

详情
AI中文摘要

生成式人工智能通过重构科学知识的生产和验证过程,重新定义了高等教育。这些系统并非中立;它们积极促进了非霸权认识论的边缘化。本研究借鉴教育科学、批判技术研究和残疾研究,证明训练数据集(主要来自英语和西方中心)强化了认知殖民性。残疾人的情况特别清晰地说明了这一现象。技术架构常常将这些个体限制在刻板的刻板印象中,或将他们排除在设计过程之外,导致双重边缘化。本文探讨了研究者与机器之间的混合是否可能维护认知多样性,同时承认当算法校正作为纯粹姑息策略时固有的结构性限制。

英文摘要

Generative artificial intelligence redefines higher education by restructuring the processes through which scientific knowledge is produced and validated. These systems are not neutral; they actively contribute to the marginalization of non-hegemonic epistemologies. This research draws upon educational sciences, critical technology studies, and disability studies to demonstrate that training datasets, which remain predominantly Anglophone and Western-centric, reinforce epistemic coloniality. The situation of persons with disabilities provides a particularly clear illustration of this phenomenon. Technological architectures frequently confine these individuals to reductive stereotypes or exclude them from the design process, leading to a double marginalization. This article examines whether a hybridization between the researcher and the machine might preserve epistemic plurality, while acknowledging the structural limitations inherent in algorithmic correction when used as a purely palliative strategy.

2605.26754 2026-05-27 cs.CR cs.AI

Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control

Cordon-MAS:通过信息流控制防御 RAG 的知识投毒

Zhe Yu, Wenpeng Xing, Gaolei Li, Shuguang Xiong, Hongzhi Wang, Xuyang Teng, Meng Han

发表机构 * Zhejiang University(浙江大学) Binjiang Institute of Zhejiang University(浙江大学滨江研究院) Shanghai Jiao Tong University(上海交通大学) Zhejiang Lab(浙江实验室) Harbin Institute of Technology(哈尔滨工业大学) Hangzhou Dianzi University(杭州电子科技大学)

AI总结 针对检索增强生成(RAG)中的 Confundo 式投毒攻击,提出 Cordon-MAS 框架,通过分离证据提取、跨源审计和答案合成到具有非对称内存权限的智能体中,将攻击成功率相对降低 92.4%,将投毒问题从检测重新定义为信息流控制。

详情
AI中文摘要

检索增强生成(RAG)日益支撑着高风险应用,但仍易受到 Confundo 式投毒攻击,其中对抗性优化的文档操纵生成的输出。现有防御假设检测到中毒证据即可防止危害。我们证明这一假设不正确:模型存在监控-控制差距——它们可以检测到检索证据中的矛盾,但仍会依据中毒声明行动。我们引入 Cordon 原则——任何能够进行最终合成的智能体都不得访问不可信的自然语言证据——并通过 CORDON-MAS 实现该原则,这是一个隔离框架,通过将证据提取、跨源审计和答案合成分离到具有非对称内存权限的智能体中,在架构上强制执行该原则。在五个 BEIR 数据集上,CORDON-MAS 相对于未防御的 RAG 将攻击成功率降低了 92.4%。这将 RAG 投毒问题从检测问题重新定义为信息流控制问题。

英文摘要

Retrieval-augmented generation (RAG) increasingly underpins high-stakes applications, yet remains vulnerable to Confundo-style poisoning where adversarially optimized documents manipulate generated outputs. Existing defenses assume that detecting poisoned evidence prevents harm. We show this assumption is incorrect: models exhibit a monitoring-control gap -- they can detect contradictions in retrieved evidence yet still act on poisoned claims. We introduce the Cordon Principle -- no agent capable of final synthesis may access untrusted natural-language evidence -- and realize it through CORDON-MAS, a compartmentalized framework that enforces this principle architecturally by separating evidence extraction, cross-source audit, and answer synthesis into agents with asymmetric memory privileges. Across five BEIR datasets, CORDON-MAS reduces attack success rate by 92.4\% relative to undefended RAG. This reframes RAG poisoning from a detection problem to an information-flow control problem.

2605.26741 2026-05-27 cond-mat.mtrl-sci cs.AI

MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation

MatFormBench: 一个面向目标驱动材料配方的基准评估框架

Linhan Wu, Chenxi Wang, Chuhan Yang, Zhengwei Yang, Yuyang Liu

发表机构 * DeepVerse

AI总结 针对现有材料机器学习基准仅关注正向属性预测而缺乏逆向优化评估的问题,提出MatFormBench基准框架,集成物理驱动配方生成方案与多维度评分指标,系统评估39种逆向设计算法。

Comments 26 pages

详情
AI中文摘要

材料的逆向设计显著推进了目标驱动的配方优化,然而现有的材料机器学习基准仍局限于正向属性预测,未能系统评估逆向优化和生成算法,这一关键差距阻碍了目标驱动材料设计的进展。为解决这一局限性,我们提出了MatFormBench,一个新颖的基准评估生态系统,专门用于评估和指导目标驱动配方的生成策略。MatFormBench集成了一个物理驱动的配方生成方案,用于生成忠实模拟真实材料结构-属性响应关系的合成样本,并辅以五个递增难度级别来量化这些关系的复杂性。为了严格评估算法性能,我们进一步提出了MatFormScore,一个多维指标,全面量化五个关键轴上的性能:目标成功率、搜索效率、探索能力、鲁棒性和稳定性。我们通过评估39种不同的逆向设计算法来验证MatFormBench,涵盖经典的代理辅助黑箱搜索、最先进的深度生成模型以及日益流行的基于大语言模型(LLM)的推荐策略。在1170次标准化算法-任务评估中,基于扩散的模型展现出最强的整体性能,而基于变分自编码器(VAE)和遗传算法(GA)的方法在特定场景中表现出独特优势。通过为目标驱动材料配方建立统一的评估标准,MatFormBench实现了可重复的基准测试、原则性的算法比较和逆向设计策略的诊断分析,为推进材料逆向设计提供了基础工具。

英文摘要

Inverse design of materials has significantly advanced target-driven formulation optimization, yet existing materials machine learning benchmarks remain limited to forward property prediction, failing to systematically evaluate inverse optimization and generation algorithms, a critical gap that hinders the progress of target-driven materials design. To address this limitation, we propose MatFormBench, a novel benchmarking ecosystem tailored to evaluate and guide generative strategies for target-driven formulation. MatFormBench integrates a physics-driven formulation generation scheme to generate synthetic samples that faithfully emulate realistic materials structure-property response relationships, complemented by five escalating difficulty levels to quantify the complexity of these relationships. To rigorously assess algorithm performance, we further propose MatFormScore, a multi-dimensional metric that comprehensively quantifies performance across five critical axes: target success, search efficiency, exploratory capacity, robustness, and stability. We validate MatFormBench by evaluating 39 diverse inverse design algorithms, covering classical surrogate-assisted black-box search, state-of-the-art deep generative models, and increasingly popular Large Language Model (LLM)-based recommendation strategies. Across 1170 standardized algorithm-task evaluations, diffusion-based models demonstrate the strongest overall performance, while Variational Autoencoder (VAE)-based and Genetic Algorithm (GA)-based methods exhibit distinct advantages in specific scenarios. By establishing a unified evaluation standard for target-driven materials formulation, MatFormBench enables reproducible benchmarking, principled algorithm comparison, and diagnostic analysis of inverse design strategies, providing a foundational tool for advancing materials inverse design.

2605.26726 2026-05-27 eess.IV cs.AI cs.CV

Measuring Prediction Uncertainty in Neural Cellular Automata

神经细胞自动机中的预测不确定性测量

Ario Sadafi, Michael Deutges, Nassir Navab, Carsten Marr

发表机构 * Computational Health Center, Helmholtz Munich, Neuherberg, Germany(赫尔姆霍茨慕尼黑计算健康中心) Helmholtz AI, Helmholtz Munich, Neuherberg, Germany(赫尔姆霍茨慕尼黑人工智能研究所) Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany(慕尼黑技术大学计算机辅助医疗程序研究所) Munich Center for Machine Learning, Munich, Germany(慕尼黑机器学习中心) Department of Medicine III, Ludwig-Maximilian-University Hospital, Munich, Germany(慕尼黑路德维希-马克西米利安大学医院第三医学部) Department of Physics, University of Munich, Munich, Germany(慕尼黑大学物理系) German Cancer Consortium (DKTK), partner site Munich, Germany(德国癌症研究中心(DKTK)慕尼黑分部)

AI总结 提出一种基于动态系统收敛性的不确定性度量方法,通过扰动自动机状态并观察预测稳定性来评估神经细胞自动机在医学图像分割中的可信度。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情
AI中文摘要

神经细胞自动机(NCA)为编码器-解码器分割网络提供了一种轻量级替代方案。然而,决定何时应信任预测可能很困难。在这里,我们研究基于NCA的医学图像分割的不确定性估计,无需修改底层架构或重新训练模型。我们的方法通过将NCA视为一个动态系统来激发,其中收敛吸引子对应于可信预测。具体地,我们提出了弹性(resilience),这是一种简单的度量,通过探测在自动机状态微小扰动下最终预测的稳定性来利用NCA固有的迭代结构。返回相同解的预测被认为是可信的,而显著变化的预测被标记为不确定。我们使用选择性预测指标($\Delta$Dice@90和AURC)和排序指标(AUROC和AUPRC)通过其预测分割质量的能力来评估不确定性。在多个医学分割基准测试中,弹性比基线更可靠地识别失败案例,提高了基于NCA模型的信任度和安全性。

英文摘要

Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to decide when a prediction should be trusted. Here, we study uncertainty estimation for NCA-based medical image segmentation without modifying the underlying architecture or retraining the model. Our approach is motivated by viewing the NCA as a dynamical system where convergent attractors correspond to confident predictions. Concretely, we propose resilience, a simple measure that leverages the intrinsic iterative structure of NCAs by probing the stability of the final prediction under small perturbations of the automaton state. Predictions that return to the same solution are deemed confident, while those that change substantially are flagged as uncertain. We evaluate uncertainty by its ability to predict segmentation quality using selective prediction metrics ($Δ$Dice@90 and AURC) and ranking metrics (AUROC and AUPRC). Across multiple medical segmentation benchmarks, resilience identifies failure cases more reliably than baselines, improving trust and safety in NCA-based models.

2605.26717 2026-05-27 cs.IR cs.AI

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

L2Rec:面向个性化推荐的LLM双视图理解

Pingjun Pan, Tingting Zhou, Peiyao Lu, Tingting Fei, Hongxiang Chen, Chuanjiang Luo

发表机构 * Netease Cloud Music(网易云音乐)

AI总结 提出L2Rec方法,通过双视图个性化混合专家机制在参数层面统一行为与语义理解,实现端到端个性化推荐,实验证明优于现有方法。

Comments Accepted at SIGIR 2026

详情
AI中文摘要

将大型语言模型(LLM)适配于个性化推荐需要将其通用能力与用户特定偏好对齐,同时有效利用行为信号和语义信号。现有方法通常在输入层(例如,将行为嵌入注入令牌空间)或输出层(例如,独立编码器的对比对齐)整合这些信号,存在分布差距或缺乏端到端任务监督。在这项工作中,我们引入了L2Rec,它在LLM的参数层面统一了行为和语义理解。我们的关键洞察是,同一组Transformer参数可以作为两个视图的共享媒介:通过双视图个性化混合专家(DPMoE)机制应用视图特定的个性化低秩扰动,L2Rec使得单个LLM主干能够为每个用户产生互补的行为和语义适应,且表示层面的不对齐最小化。一个自适应跨视图融合模块进一步将双视图输出整合为统一的用户偏好。在四个数据集上的实验表明,L2Rec持续优于最先进的基线方法,并且在大型工业平台上的在线A/B测试验证了关键参与指标的显著改进。

英文摘要

Adapting large language models (LLMs) for personalized recommendation requires aligning their general-purpose capabilities with user-specific preferences while effectively leveraging both behavioral and semantic signals. Existing approaches typically integrate these signals at either the input level (e.g., injecting behavioral embeddings into the token space) or the output level (e.g., contrastive alignment of separate encoders), suffering from distribution gaps or lack of end-to-end task supervision. In this work, we introduce L2Rec, which unifies behavioral and semantic understanding at the parameter level of LLMs. Our key insight is that the same set of Transformer parameters can serve as a shared medium for both views: by applying view-specific, personalized low-rank perturbations via a Dual-view Personalized Mixture-of-Experts (DPMoE) mechanism, L2Rec enables a single LLM backbone to produce complementary behavioral and semantic adaptations for each user with minimal representation-level misalignment. An adaptive cross-view fusion module further integrates the dual-view outputs into a unified user preference. Experiments on four datasets show that L2Rec consistently outperforms state-of-the-art baselines, and online A/B testing on a large-scale industrial platform validates significant improvements in key engagement metrics.

2605.26713 2026-05-27 stat.ML cs.LG

Transformers Can Learn Posterior Predictive Distributions In-Context

Transformer可以在上下文中学习后验预测分布

Gyeonghun Kang, Changwoo J. Lee, Xiang Cheng

发表机构 * Department of Statistical Science, Duke University, Durham, NC, USA(统计科学系,达勒姆大学,达勒姆,NC,美国) Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA(电气与计算机工程系,达勒姆大学,达勒姆,NC,美国)

AI总结 本文通过构造证明Transformer能够实现针对后验预测均值和方差的梯度下降算法,并研究其逼近后验预测分布的误差界,揭示了归一化和注意力深度对泛化能力的关键作用。

详情
AI中文摘要

先验数据拟合网络(PFN)最近已成为贝叶斯预测任务的一种强大方法,通过上下文学习近似后验预测分布(PPD)。尽管它们具有强大的实证性能和超越点预测的能力,但对Transformer在上下文中学习分布的算法能力的理论理解仍然缺乏。聚焦于高斯过程回归问题,我们通过构造证明Transformer可以实现针对后验预测均值和方差的梯度下降算法,随后通过非线性映射产生PPD的分箱概率。我们根据注意力深度和分箱分辨率研究了近似PPD的误差界。基于这些结果,我们进一步证明了归一化和注意力深度的选择在使Transformer能够超越预训练样本大小范围进行外推中的关键作用。我们进行了模拟实验,验证了我们的发现,为针对PPD的PFN的表达能力以及架构选择如何影响泛化能力提供了见解。

英文摘要

Prior-data fitted networks (PFNs) have recently emerged as a powerful approach for Bayesian prediction tasks, approximating the posterior predictive distribution (PPD) through in-context learning. Despite their strong empirical performance and ability to go beyond point predictions, theoretical understandings of the algorithmic capability of transformers to learn distributions in context are still lacking. Focusing on Gaussian process regression problems, we show by construction that transformers can implement a gradient descent algorithm targeting the posterior predictive mean and variance, followed by nonlinear mappings that yield binned probabilities of PPD. We study the error bounds of the approximated PPD in terms of attention depth and bin resolution. Based on these results, we further demonstrate the key role of normalization and the choice of attention depth in enabling the extrapolation abilities of transformers beyond the pretraining sample size range. We conduct simulations that corroborate our findings, providing insight into the expressivity of PFNs targeting PPDs and how architectural choices may influence generalization capabilities.

2605.20251 2026-05-27 cs.SE cs.AI

ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

ProcCtrlBench: 评估LLM编码智能体中的过程级缺陷与控制保持

Jiawei He, Jie Jia, Chenbo Liu, Chaoyi Xue, Yapeng Song, Xikai Yang, Dong Sun

发表机构 * Amap, Alibaba Group(阿里云)

AI总结 提出ProcCtrlBench基准,通过可复用的缺陷本体和标准化轨迹表示,从过程证据而非仅最终结果评估LLM编码智能体的执行质量,并引入控制保持量化执行过程的可解释性、可中断性等属性。

Comments 22 pages, 8 figures

详情
AI中文摘要

现有的LLM编码智能体基准主要评估最终结果。虽然有助于衡量整体能力,但这些指标提供的可见性有限,常常遗漏执行过程中出现的缺陷。我们提出了ProcCtrlBench,一个用于LLM编码智能体执行过程评估的基准。ProcCtrlBench将重复出现的执行缺陷组织成一个可复用的本体,涵盖4类11种缺陷类型,并通过标准化的过程证据而非仅最终结果来评估智能体轨迹。为了支持异构智能体之间的比较,ProcCtrlBench将原始日志标准化为统一的轨迹表示,并报告基于过程发现的校准评分卡。此外,ProcCtrlBench使用控制保持作为量化执行过程质量的方式,捕获执行是否保持可解释、可中断、可纠正、可逆,并在需要时能够交还控制权。我们在从三个基准(AndroidBench、TerminalBench和SWE-bench-Verified)中采样的200个案例上评估了ProcCtrlBench。结果表明,ProcCtrlBench可以以有用的可靠性实例化,提供比直接阈值化更稳定的语义,并揭示了传统基于结果的评估常常忽略的执行质量的有意义差异。

英文摘要

Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution. We present ProcCtrlBench, a benchmark for execution-process evaluation in LLM coding agents. ProcCtrlBench organizes recurrent execution defects into a reusable ontology covering 11 defect types in 4 categories, and evaluates agent trajectories through standardized process evidence rather than final outcomes alone. To support comparison across heterogeneous agents, ProcCtrlBench standardizes raw logs into a unified trajectory representation and reports calibrated scorecards over process-level findings. In addition, ProcCtrlBench uses control preservation as a way to quantify execution-process quality, capturing whether execution remains interpretable, interruptible, correctable, reversible, and able to hand back authority when needed. We evaluate ProcCtrlBench on 200 cases sampled from three benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Results show that ProcCtrlBench can be instantiated with useful reliability, provides more stable semantics than direct thresholding, and reveals meaningful differences in execution quality that are often overlooked by conventional outcome-based evaluation.

2605.04932 2026-05-27 stat.ML cs.LG

Jacobian-Velocity Bounds for Deployment Risk Under Covariate Drift

协变量漂移下部署风险的雅可比-速度界

Jonathan R. Landers

发表机构 * Independent Researcher(独立研究者)

AI总结 针对动态协变量漂移下冻结预测器的长期部署风险,提出基于时域庞加莱不等式和雅可比-速度定理的路径控制方法,并设计漂移对齐切线正则化(DTR)以降低风险波动。

Comments 8 pages, 4 figures, 4 tables

详情
AI中文摘要

我们研究了动态协变量漂移下冻结预测器的长期部署问题。时域庞加莱不等式首先将时间风险波动降低为导数能量。然后,雅可比-速度定理提供了相应的路径控制。在明确的规则性和支配假设下,该定理将沿部署路径的方向切线能量识别为控制量。在低秩漂移下,该量减少为漂移子空间中的方向雅可比能量,从而激发了漂移对齐切线正则化(DTR)和匹配的监测代理。DTR不是各向同性地平滑网络,而是仅沿估计的漂移方向惩罚敏感性。我们通过四个实验验证了从定理到方法的流程:一个用于时域不等式的合成基准,一个与各向同性雅可比正则化对比的受控合成实验,以及在UCI空气质量数据集和Tetouan电力消耗数据集上的两个冻结部署研究。DTR在受控低秩区域降低了风险波动和方向增益,并优于各向同性平滑。它还在两个真实数据集上给出了验证选择的部署增益,其中空气质量子空间是从目标正交传感器运动估计的。适度的漂移子空间错误指定是可容忍的,而正交错误指定则基本消除了收益。

英文摘要

We study long-horizon deployment of a frozen predictor under dynamic covariate shift. A time-domain Poincare inequality first reduces temporal risk volatility to derivative energy. A Jacobian-velocity theorem then supplies the corresponding pathwise control. Given explicit regularity and domination assumptions, the theorem identifies directional tangent energy along the deployment path as the governing quantity. Under low-rank drift, that quantity reduces to directional Jacobian energy in the drift subspace, motivating drift-aligned tangent regularization (DTR) and a matched monitoring proxy. Rather than smoothing the network isotropically, DTR penalizes sensitivity only along estimated drift directions. We validate the theorem-to-method pipeline in four experiments: a synthetic benchmark for the time-domain inequality, a controlled synthetic comparison against isotropic Jacobian regularization, and two frozen-deployment studies on the UCI Air Quality and Tetouan power-consumption datasets. DTR reduces risk volatility and directional gain in the controlled low-rank regime and beats isotropic smoothing there. It also gives validation-selected deployment gains on both real datasets, with the Air Quality subspace estimated from target-orthogonal sensor motion. Moderate drift-subspace misspecification is tolerable while orthogonal misspecification largely removes the benefit.

2605.03309 2026-05-27 cs.CR cs.AI cs.SE

Cryptographic Registry Provenance: Structural Defense Against Dependency Confusion in AI Package Ecosystems

加密注册表溯源:针对AI包生态系统中依赖混淆的结构性防御

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 提出加密注册表溯源系统,通过注册表身份签名、双重签名模型和权威命名空间绑定三层结构防御依赖混淆攻击。

Comments 15 pages, 1 figure, 4 tables. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated license

详情
AI中文摘要

依赖混淆攻击利用了软件分发中的结构性缺陷:一旦包被安装,就没有加密证据证明是哪个注册表分发的。所有现有防御都是基于配置的,并且在配置错误时会静默失败。我们提出一个加密分发溯源系统,包含三个组件:(1) 加密注册表身份,每个注册表持有一个Ed25519密钥对,并对其分发的每个工件进行签名;(2) 双重签名模型,发布者在打包时签名,注册表在发布时副署;(3) 权威命名空间绑定,消费者固定注册表指纹,解析器从加密上拒绝来自未授权注册表的工件。这些创建了三层防御,需要同时攻破才能成功攻击。对八个生态系统(npm、Cargo、Hex.pm、PyPI、Go模块、Docker/OCI、NuGet、Maven)的比较显示,没有现有生态系统结合了强制发布者签名、加密注册表身份、强制注册表副署和消费者端加密执行。该系统扩展到AI生成溯源作为签名属性,以及治理强制依赖解析。一个案例研究将分发溯源与三层运行时治理架构集成,创建了一个无加密间隙的四阶段生命周期链。

英文摘要

Dependency confusion attacks exploit a structural gap in software distribution: once a package is installed, there is no cryptographic proof of which registry distributed it. Every existing defense is configuration-based and fails silently when misconfigured. We present a cryptographic distribution provenance system comprising three components: (1) cryptographic registry identity, where every registry holds an Ed25519 keypair and signs every artifact it distributes; (2) a dual-signature model, where the publisher signs at packaging time and the registry countersigns at publication time; and (3) authoritative namespace binding, where consumers pin registry fingerprints and the resolver cryptographically rejects artifacts from unauthorized registries. These create three defense layers requiring simultaneous compromise for a successful attack. A comparison across eight ecosystems (npm, Cargo, Hex.pm, PyPI, Go modules, Docker/OCI, NuGet, Maven) shows no existing ecosystem combines mandatory publisher signing, cryptographic registry identity, mandatory registry countersigning, and consumer-side cryptographic enforcement. The system extends to AI-generation provenance as a signed attribute and governance-enforced dependency resolution. A case study integrates distribution provenance with a three-layer runtime governance architecture, creating a four-phase lifecycle chain with no cryptographic gaps.

2605.02958 2026-05-27 cs.CR cs.AI cs.CL cs.LG

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

追踪拒绝的动态:利用潜在拒绝轨迹进行鲁棒越狱检测

Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

发表机构 * Peking University(北京大学) Nanyang Technological University(南洋理工大学) Beijing Jiaotong University(北京交通大学)

AI总结 通过因果追踪识别出稀疏的“拒绝轨迹”激活模式,并提出轻量级白盒检测器SALO,基于隐藏状态窗口实现鲁棒越狱检测。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情
AI中文摘要

表征工程分析通常使用从终端或池化表示中提取的静态方向来描述拒绝。我们质疑这种观点是否忽略了拒绝是如何在层-标记位置上构建的。通过因果追踪,我们识别出一个 extit{拒绝轨迹}:一种稀疏的上游激活模式,即使当诸如GCG的攻击抑制终端拒绝信号时,该模式也常常持续存在。基于这一观察,我们提出了SALO(稀疏激活定位算子),一种轻量级白盒检测器,它在选定层窗口的原始隐藏状态体积上操作。在Qwen、Llama和Mistral模型上,SALO在固定的XSTest校准工作点下,改进了多个攻击家族的越狱检测。我们进一步分析了静态RepE风格基线、ROI敏感性、自适应GCG攻击和编码输入边界情况,阐明了拒绝轨迹监测的前景和局限性。

英文摘要

Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that operates on raw hidden-state volumes from a selected layer window. Across Qwen, Llama, and Mistral models, SALO improves jailbreak detection on several attack families under a fixed XSTest-calibrated operating point. We further analyze static RepE-style baselines, ROI sensitivity, adaptive GCG attacks, and encoded-input boundary cases, clarifying both the promise and limitations of refusal-trajectory monitoring.

2605.01037 2026-05-27 cs.CR cs.AI cs.PL

Certified Purity for Cognitive Workflow Executors: From Static Analysis to Cryptographic Attestation

认知工作流执行器的认证纯度:从静态分析到密码学证明

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 提出一种认证纯度架构,通过WebAssembly编译、密码签名证书和运行时验证门,将认知工作流系统中的治理执行从运行时约定转变为结构性能力边界,消除BEAM虚拟机上的对抗性绕过。

Comments 23 pages, 4 figures, 8 tables. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated license

详情
AI中文摘要

我们提出了一种认证纯度架构,将认知工作流系统中的治理执行从运行时约定转变为结构性能力边界。先前的三层治理架构证明了治理完备性、来源完备性以及不可治理效应的不可能性,这依赖于纯模块约束:即步骤执行器不能执行效应。该约束通过模块导入图分析来执行,但不足以对抗BEAM虚拟机上的对抗性绕过。本文通过四种机制弥补了这一差距:(1)受限的WebAssembly编译目标,其中产生效应的指令在结构上缺失;(2)纯度证书,即密码学签名的证明,将执行器二进制文件与其导入分类绑定;(3)运行时验证门,在未认证执行器进入治理管道之前拒绝它们;以及(4)通过远程证明实现跨组织验证的可移植治理凭证。我们证明了四个定理:结构性纯度由构造保证,所有五种BEAM绕过类别的绕过消除,证书完整性,以及门完备性。该保证相对于显式的可信计算基成立。在四个已实现执行器上的评估显示,验证延迟为39-42微秒,完整计划周期低于400微秒,运行时开销低于100毫秒HTTP请求的0.4%,并且重复调用之间零确定性分歧。

英文摘要

We present a certified purity architecture that converts governance enforcement in cognitive workflow systems from a runtime convention into a structural capability boundary. A prior three-layer governance architecture proves governance completeness, provenance completeness, and the impossibility of ungoverned effects, conditional on the pure module constraint: that step executors cannot perform effects. That constraint was enforced by module import graph analysis, which is insufficient against adversarial bypass on the BEAM virtual machine. This paper closes the gap through four mechanisms: (1) a restricted WebAssembly compilation target where effect-producing instructions are structurally absent; (2) purity certificates, cryptographically signed proofs binding executor binaries to their import classifications; (3) a runtime verification gate that rejects uncertified executors before they enter the governance pipeline; and (4) portable governance credentials via remote attestation for cross-organizational verification. We prove four theorems: structural purity by construction, bypass elimination for all five BEAM bypass classes, certificate integrity, and gate completeness. The guarantee holds relative to an explicit Trusted Computing Base. Evaluation on four implemented executors shows verification latency of 39--42 us, full plan cycle under 400 us, runtime overhead under 0.4% of a 100 ms HTTP request, and zero determinism divergences across repeated invocations.

2605.26679 2026-05-27 cs.CR cs.AI

Certified Causal Attribution for Real-Time Attack Forensics in 6G Network Slicing

面向6G网络切片的实时攻击取证的可认证因果归因

Minh K. Quan, Pubudu N. Pathirana

发表机构 * School of Engineering, Deakin University(德肯大学工程学院)

AI总结 提出DA-GC框架,结合资源条件格兰杰因果与资源争用模型,在6G网络切片中实现亚100毫秒内的高精度跨切片攻击归因,并提供了完整的正式认证栈。

Comments IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

详情
AI中文摘要

6G网络中的跨切片攻击归因需要在100毫秒内通过共享基础设施识别因果传播链。现有方法在满足严格SLA时难以保持准确性,因为共享资源争用会产生虚假相关性,在标准格兰杰检验下与真实因果链接难以区分。我们提出DA-GC,一个可认证的因果归因框架,将资源条件格兰杰因果与公理推导的资源争用模型(RCM)相结合,系统性地阻断资源介导的混杂。在包含15个切片的生产仿真6G测试平台和1100个攻击场景中,DA-GC在87毫秒内实现了89.2%的归因准确率。相比最强基线,准确率提升7.9个百分点,延迟降低2.7倍,同时展示了跨拓扑泛化和概念漂移鲁棒性。关键的是,DA-GC拥有全面的正式认证栈。我们提供了数学证明的有效性证书,保证在序列依赖遥测和分段平稳性下的统计可靠性。此外,我们建立了严格的安全边界,包括对抗性利用欺骗的崩溃点δ* ≈ 0.95,并定义了可证明隐私和鲁棒部署所需的最小差分隐私噪声。

英文摘要

Cross-slice attack attribution in 6G networks requires identifying causal propagation chains through shared infrastructure in under 100 ms. Existing methods struggle to satisfy this strict SLA without sacrificing accuracy, because shared resource contention creates spurious correlations that are indistinguishable from genuine causal links under standard Granger tests. We propose DA-GC, a certified causal attribution framework that integrates resource-conditioned Granger causality with an axiomatically derived Resource Contention Model (RCM) to systematically block resource-mediated confounding. On a 15-slice production-emulation 6G testbed with 1,100 attack scenarios, DA-GC achieves 89.2% attribution accuracy at 87 ms. This represents a 7.9 percentage-point improvement over the strongest baseline at 2.7x lower latency, alongside demonstrated cross-topology generalization and concept-drift resilience. Crucially, DA-GC is backed by a comprehensive formal certification stack. We provide mathematically proven validity certificates for statistical soundness under serially dependent telemetry and piecewise-stationarity. Furthermore, we establish strict security bounds, including an adversarial utilization spoofing breakdown point of $δ^* \approx 0.95$, and define the minimum differential-privacy noise required for a provably private and robust deployment.

2605.26675 2026-05-27 stat.ML cs.LG

CART Random Forests as Sequential Allocation over Random Opportunity Sets: A Stochastic-Control Theory of Ensemble Risk

CART随机森林作为随机机会集上的序贯分配:集成风险的随机控制理论

Tianxing Mei, Yingying Fan, Mingming Leng, Jinchi Lv

发表机构 * Faculty of Business, Lingnan University(岭南大学商学院) Data Sciences and Operations Department, University of Southern California(南加州大学数据科学与运营部门)

AI总结 本文从随机控制视角将CART随机森林建模为随机机会集上的序贯分配过程,通过分离特征子采样和信息分裂策略两个设计杠杆,揭示了森林均方误差的构成,并证明了CART策略的局部稳定性与全局次优性。

Comments 69 pages, 1 figure

详情
AI中文摘要

CART随机森林是最广泛使用的现代预测方法之一,具有充分记录的经验成功。然而,在机制层面,由于其复杂性,该算法通常被视为黑箱。在本文中,我们发展了特征子采样CART随机森林的随机控制视角,称为CART随机机会集分配(CART-ROSA)。在每个节点,特征的随机子集被解释为随机可行动作集,CART分裂规则被解释为掩码动作分配策略。该策略在信息性分裂计数状态上诱导出一个受控的随机过程,其终末分布决定了森林均方误差(MSE)中的单棵树误差和树间交互项。这种表示通过分离两个设计杠杆——特征子采样引起的信息性机会率和掩码内分裂策略的收缩强度——打开了CART森林的黑箱。我们证明CART策略是局部稳定的:它收缩了信息性分裂分配中的不平衡,并集中了终末树的几何结构。然而,在系统层面,它对森林目标可能是全局次优的。针对线性模型,我们显式推导了MSE风险展开。我们的结果表明,运筹学视角如何使从CART森林的标准算法描述难以触及的理论缺口变得可处理。

英文摘要

CART random forests are among the most widely used modern predictive methods, with well-documented empirical success. Yet, at the mechanistic level, the algorithm is often treated as a black box because of its complexity. In this paper, we develop a stochastic-control perspective on feature-subsampled CART random forests, named CART random opportunity-set allocation (CART-ROSA). At each node, the random subset of features is interpreted as a random feasible action set, and the CART split rule as a masked-action allocation policy. This policy induces a controlled stochastic process over informative split-count states, whose terminal law determines both single-tree error and cross-tree interaction terms in the forest mean squared error (MSE). Such representation opens the black box of CART-forests by separating two design levers: the informative-opportunity rate induced by feature subsampling, and the contraction strength from the within-mask split policy. We establish that the CART policy is locally stabilizing: it contracts imbalances in informative split allocations and concentrates terminal tree geometry. At the system level, however, it can be globally suboptimal for the forest objective. Specializing to the linear model, we derive the MSE risk expansion explicitly. Our results show how an operations-research perspective makes tractable a theoretical gap difficult to access from the standard algorithmic description of CART forests.

2605.26640 2026-05-27 eess.SY cs.LG cs.SY math.OC stat.ML

Sample Complexity of Policy Gradient for Log-Growth Control

对数增长控制的策略梯度样本复杂度

Qiuhua Pan, Yukai Shen, Liwei Zhang, Cailian Chen, Xinping Guan

发表机构 * State Key Laboratory of Submarine Geoscience, School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(submarine 地球科学国家重点实验室,自动化与智能感知学院,上海交通大学) Key Laboratory of System Control and Information Processing, Ministry of Education of China(系统控制与信息处理国家重点实验室,中华人民共和国教育部) Shanghai Key Laboratory of Perception and Control in Industrial Network Systems(上海工业网络系统感知与控制重点实验室) Paris Elite Institute of Technology, Shanghai Jiao Tong University(巴黎精英理工学院,上海交通大学)

AI总结 针对乘性噪声驱动标量线性系统的对数增长控制问题,利用奇点对称性消除梯度估计发散,证明了策略梯度的样本复杂度。

Comments 43 pages, 4 figures, 2 tables; includes supplementary material

详情
AI中文摘要

我们研究了策略梯度在对数增长控制中的样本复杂度——即从观测到的状态转移中学习一个反馈增益,该增益能够最优稳定一个通过乘性噪声驱动通道的标量线性系统。目标函数 $J(K) = \mathbb{E}[\log|1+BK|]$ 是闭环系统的顶部李雅普诺夫指数。该问题存在一个我们称为尖点障碍的结构性困难:最优增益 $K^*$ 总是将噪声奇点 $b_{\rm sing}(K) = -1/K$ 置于支撑集内部。在这个奇异最优处,策略梯度仅作为柯西主值存在,而非勒贝格积分,且自然的单样本梯度估计量具有无穷方差。因此,标准的一阶随机优化分析在最优处不适用,仅对目标函数进行平滑处理无法解决这一困难。然而,该障碍具有可利用的对称性:柯西核是关于移动极点位移的奇函数,因此将每个观测值与其关于极点的反射配对可以抵消发散部分。这一抵消同时控制了总体曲率、梯度估计量方差以及估计噪声密度时产生的偏差。结合这些界与一个闭式单转移梯度预言,我们证明:当噪声密度已知时,投影小批量策略梯度(初始化于稳定区域的任意紧子集内)的总样本复杂度为 $\tilde{O}(1/\eta)$;当噪声密度需估计时,对于 $C^s$ 噪声密度($s \geq 2$),样本复杂度为 $\tilde{O}(\eta^{-(2s+1)/(2s)})$。

英文摘要

We study the sample complexity of policy gradient for log-growth control -- the problem of learning, from observed state transitions, a feedback gain that optimally stabilizes a scalar linear system driven through a multiplicative-noise actuation channel. The objective $J(K) = \mathbb{E}[\log|1+BK|]$ is the top Lyapunov exponent of the closed loop. This problem carries a structural difficulty we call the cusp obstruction: the optimal gain $K^*$ always places the noise singularity $b_{\rm sing}(K) = -1/K$ in the interior of the support. At this singular optimum the policy gradient exists only as a Cauchy principal value, not as a Lebesgue integral, and the natural single-sample gradient estimator has infinite variance. Standard first-order stochastic-optimization analysis is thus inapplicable at the optimum, and merely smoothing the objective does not resolve the difficulty. The obstruction, however, has an exploitable symmetry: the Cauchy kernel is an odd function of the displacement from the moving pole, so pairing each observation with its reflection through the pole cancels the divergent part. This one cancellation simultaneously controls the population curvature, the gradient-estimator variance, and the bias incurred when the noise density is estimated. Combining these bounds with a closed-form single-transition gradient oracle, we prove that projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity $\tilde{O}(1/η)$ when the noise density is known and $\tilde{O}(η^{-(2s+1)/(2s)})$ when it must be estimated, for $C^s$ noise densities with $s \geq 2$.

2605.26627 2026-05-27 eess.SY cs.RO cs.SY

Breaking the Epistemic Trap: Active Perception Under Compound Uncertainty

打破认知陷阱:复合不确定性下的主动感知

Chayan Banerjee, Ethan Goan

发表机构 * School of Electrical Engineering and Robotics(电气工程与机器人学学院)

AI总结 针对强化学习在安全关键领域中因状态-动力学耦合不确定性导致的失败,提出基于互信息的复合不确定性系数和主动信息寻求策略的适应性安全架构。

详情
AI中文摘要

在安全关键领域部署强化学习,从自动驾驶到医疗决策支持,受到系统遇到不熟悉条件时出现的失败的限制。我们认为,根本瓶颈不是单个挑战,如变化的动力学或不完整的观测,而是它们的协同交互,我们称之为认知陷阱:代理无法在不知道系统动力学的情况下估计其状态,也无法在没有准确状态信息的情况下学习动力学。在模拟运动中的概念验证实验表明,结合这些不确定性导致的失败远严重于单独挑战,性能下降77%,而单独效应相加为46%,展示了传统方法忽略的复合失败模式。这些方法采用被动的认知立场,无法解决这种耦合的不确定性。我们提出将安全重新定义为信息问题,引入一个适应性安全架构,围绕三个贡献构建:复合不确定性系数(κ),一种基于互信息的度量,量化状态-动力学耦合,可在线上计算而无需完整的联合信念推断;由MaxInfoRL目标驱动的信息寻求策略,主动探测系统动力学;以及随认知耦合上升而收紧的机制自适应安全约束。这种范式转变,从被动鲁棒性到主动感知,为在不确定性下运行、识别自身无知并战略性地采取行动解决它的决策系统提供了原则性路径。

英文摘要

Deploying reinforcement learning in safety critical domains, from autonomous vehicles to medical decision support, is constrained by failures arising when systems encounter unfamiliar conditions. We argue that the fundamental bottleneck is not individual challenges like changing dynamics or incomplete observations, but their synergistic interaction, which we term the Epistemic Trap: agents cannot estimate their state without knowing system dynamics, nor learn dynamics without accurate state information. Proof-of-concept experiments in simulated locomotion reveal that combining these uncertainties causes failures far worse than either challenge alone, a 77% performance degradation against the 46% by adding the individual effects, demonstrating compounding failure modes that conventional methods overlook. Such approaches adopt a passive epistemic stance that cannot resolve this coupled uncertainty. We propose reframing safety as an information problem, introducing an Adaptive Safety Architecture built around three contributions: the Compound Uncertainty Coefficient ($κ$), a mutual information based metric that quantifies state dynamics coupling and is computable online without full joint belief inference; information seeking policies governed by a MaxInfoRL objective that actively probe system dynamics; and regime-adaptive safety constraints that tighten as epistemic coupling rises. This paradigm shift, from passive robustness to active perception, offers a principled path toward decision making systems that operate under uncertainty, recognize their own ignorance, and act strategically to resolve it.

2605.26577 2026-05-27 eess.SY cs.AI cs.LG cs.SY math.OC

Bridging Control with Neural Network Verifier alpha-beta-CROWN: A Tutorial

桥接控制与神经网络验证器 alpha-beta-CROWN:教程

Haoyu Li, Xiangru Zhong, Hao Cheng, Bin Hu, Huan Zhang

发表机构 * Department of Computer Science(计算机科学系) Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本教程提出一个统一框架,通过将控制问题与神经网络验证器 α,β-CROWN 桥接,实现控制器属性的可扩展形式验证。

Comments ACC 2026 Tutorial

详情
AI中文摘要

基于学习的控制器合成方法因其高表达力和强经验性能而受到欢迎。然而,在自动驾驶、机器人技术和电力系统等安全关键场景中,仅凭经验性能是不够的,对控制器的稳定性、安全性等属性进行形式验证是非常可取的。不幸的是,许多先前的验证方法要么依赖于系统或证书的特定结构假设,难以在不同设置间迁移,要么在高维神经网络系统上可扩展性差。在本教程中,我们提出了一个统一框架,旨在通过将控制与最先进的神经网络验证器 $α,\!β$-CROWN(alpha-beta-CROWN)桥接来弥合这一差距。其核心是,$α,\!β$-CROWN 是一个通用的边界引擎,用于表示为计算图的非线性函数:给定一个输入域,它可以产生认证边界和非线性函数的显式线性松弛。这些认证边界本身对于可达性分析等任务很有用,并且它们为执行可满足性检查和优化的更复杂例程提供了基础。更具体地说,许多控制问题归结为验证状态域上的实值不等式(例如,李雅普诺夫理论)。因此,$α,\!β$-CROWN 通过计算紧边界并基于边界递归划分和剪枝子域,实现了这些条件的可扩展验证。得益于 GPU 并行化,该流程在对传统方法具有挑战性的验证和优化问题上展示了卓越的可扩展性。在本教程中,我们讨论了 $α,\!β$-CROWN 的基础知识,并介绍了其在各种控制相关任务中的应用。

英文摘要

Learning-based methods for synthesizing controllers have gained popularity due to their high expressiveness and strong empirical performance. However, in safety-critical scenarios such as autonomous driving, robotics, and power systems, empirical performance alone is insufficient, and formal verification of controller properties such as stability and safety is highly desirable. Unfortunately, many prior verification approaches are either tied to specific structural assumptions on the system or the certificate, making them difficult to transfer across settings, or suffer from poor scalability on higher-dimensional neural network systems. In this tutorial, we present a unified framework that aims to mitigate this gap via bridging control with the state-of-the-art neural network verifier $α,\!β$-CROWN (alpha-beta-CROWN). At its core, $α,\!β$-CROWN is a general-purpose bounding engine for nonlinear functions represented as computation graphs: given an input domain, it can produce certified bounds and explicit linear relaxation of the nonlinear function. These certified bounds are useful on their own for tasks such as reachability analysis, and they also provide the foundation for more complex routines that perform satisfiability checking and optimization. More specifically, many control problems reduce to verifying real-valued inequalities over a state domain (e.g., Lyapunov theory). Consequently, $α,\!β$-CROWN enables scalable verification of such conditions by computing tight bounds and recursively partitioning and pruning subdomains based on the bounds. Thanks to GPU parallelization, this pipeline demonstrates superior scalability on verification and optimization problems that are challenging for traditional approaches. In this tutorial, we discuss the basics of $α,\!β$-CROWN and introduce its application to various control-related tasks.

2605.26548 2026-05-27 cs.CR cs.LG

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

SEC-bench Pro:语言模型能解决长周期软件安全任务吗?

Hwiwon Lee, Jiawei Liu, Dongjun Kim, Ziqi Zhang, Chunqiu Steven Xia, Lingming Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出SEC-bench Pro基准,通过三阶段流程构建包含V8和SpiderMonkey共183个已验证漏洞的测试集,评估前沿语言模型在长周期漏洞狩猎任务中的表现,发现最高成功率仅48.8%。

详情
AI中文摘要

大型语言模型(LLM)现已支持自动化软件安全任务,包括漏洞发现和概念验证(PoC)生成。现有基准因依赖模糊测试工具、目标特定描述或漏洞复现任务,未能真实评估LLM在实际漏洞狩猎场景中的表现。我们提出SEC-bench Pro,一个用于衡量智能体在关键、高复杂度软件系统上进行漏洞狩猎的基准。本工作通过一个三阶段流程(漏洞收集、环境重建和基于oracle的验证)披露带有具体PoC输入的报告,并将修复链接到可复现任务中。我们用V8和SpiderMonkey上的183个已验证漏洞实例化SEC-bench Pro,其中包括一个V8子集,其累计Google漏洞奖励计划奖金超过150万美元。这些实例涵盖浏览器级和运行时级执行条件下的内存安全、沙箱、JIT和竞态条件漏洞。我们的评估表明,使用前沿模型的编码智能体在两个引擎上的成功率均低于40%。开放权重的Kimi-K2.6基线在V8上达到11.7%,而最强前沿配置在V8上达到32.0%,在SpiderMonkey上达到38.8%。ClaudeCode和Codex解决了互补的实例集,它们的双智能体联合在V8上达到37.9%,在SpiderMonkey上达到48.8%。SEC-bench Pro为评估基于LLM的安全智能体提供了稳健的环境,并揭示了长周期漏洞狩猎任务中的局限性。

英文摘要

Large language models (LLMs) now support automated software security tasks, including vulnerability discovery and proof-of-concept (PoC) generation. Existing benchmarks do not faithfully evaluate LLMs in real-world bug hunting scenarios because they rely on fuzzing harnesses, target-specific descriptions, or vulnerability-reproduction tasks. We present SEC-bench Pro, a benchmark for measuring agent bug hunting on critical, high-complexity software systems. This work discloses reports with concrete PoC inputs and links fixes into reproducible tasks through a three-phase pipeline for vulnerability collection, environment reconstruction, and oracle-based validation. We instantiate SEC-bench Pro with 183 validated vulnerabilities across V8 and SpiderMonkey, including a V8 subset with more than $1.5 million in cumulative Google Vulnerability Reward Program awards. These instances span memory-safety, sandbox, JIT, and race-condition bugs under browser-grade and runtime-grade execution conditions. Our evaluation shows that coding agents with frontier models remain below 40% success on both evaluated engines. The open-weight Kimi-K2.6 baseline reaches 11.7% on V8, while the strongest frontier configuration reaches 32.0% on V8 and 38.8% on SpiderMonkey. ClaudeCode and Codex solve complementary instance sets, and their two-agent union reaches 37.9% on V8 and 48.8% on SpiderMonkey. SEC-bench Pro provides robust environments for assessing LLM-based security agents and exposes limitations in long-horizon bug hunting tasks.

2605.26542 2026-05-27 cs.CR cs.AI

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

ChainCaps: 通过单调能力衰减实现组合安全的工具使用智能体

Xiaochong Jiang, Shiqi Yang, Ziwei Li, Lifei Liu, Haoran Yu, Yichen Liu

发表机构 * Independent Researcher, Seattle, WA, USA(华盛顿州塞勒姆独立研究员) Independent Researcher, New York City, NY, USA(纽约市纽约独立研究员) King Abdullah University of Science and Technology(国王阿卜杜勒阿齐兹科学技术大学)

AI总结 针对工具组合中的权限洗钱漏洞,提出ChainCaps机制,通过运行时能力预算交集传播规则,在不修改智能体或工具服务器的情况下,将攻击成功率从25-68%降至0-4.8%,同时保持96-100%的良性任务完成率。

Comments Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情
AI中文摘要

工具使用智能体越来越多地在开放式部署环境中运行,它们会在运行时组合文件系统、Web API、代码解释器和企业服务。这造成了工具组合中的安全缺口:智能体可以通过每个工具的权限检查,但仍然产生不安全的端到端效果,例如读取机密文档、总结并将其发送到外部端点。我们将这种失败模式称为权限洗钱。ChainCaps通过一个运行时规则解决这一问题:每个值携带一个特定于接收器的能力预算,工具组合通过交集传播预算。一个值在通过工具链时可能保留或失去权限,但无法通过组合获得新权限。我们将ChainCaps实现为一个透明的MCP代理,无需对智能体或工具服务器进行任何更改。在来自三个提供商的五个前沿模型的82个任务上,ChainCaps将攻击成功率从25-68%降低到0-4.8%,同时保留了96-100%的良性完成率。在重放实验中,它也优于标量IFC和每函数隔离基线。清单质量是主要的部署瓶颈:专家清单达到100%的攻击阻止,而朴素清单则降至27.3%。我们的主张仅限于在可信清单和代理可见数据移动下的显式流组合安全性,这是当前部署的工具使用智能体中的一个实际缺口。

英文摘要

Tool-using agents increasingly operate in open-ended deployment environments, where they compose file systems, web APIs, code interpreters, and enterprise services at runtime. This creates a safety gap in tool composition: an agent can satisfy every per-tool permission check and still produce an unsafe end-to-end effect, such as reading a confidential document, summarizing it, and sending the summary to an external endpoint. We call this failure mode permission laundering. ChainCaps addresses it with a runtime rule: every value carries a sink-specific capability budget, and tool composition propagates budgets by intersection. A value can preserve or lose authority as it moves through a tool chain, but it cannot gain new authority through composition. We implement ChainCaps as a transparent MCP proxy that requires no changes to the agent or tool servers. On 82 tasks across five frontier models from three providers, ChainCaps reduces attack success rate from 25-68% to 0-4.8% while preserving 96-100% benign completion. In replay experiments, it also outperforms scalar-IFC and per-function-isolation baselines. Manifest quality is the dominant deployment bottleneck: expert manifests reach 100% attack blocking, while naive manifests fall to 27.3%. Our claims are limited to explicit-flow composition safety under trusted manifests and proxy-visible data movement, a practical gap in deployed tool-using agents today.

2605.26540 2026-05-27 physics.chem-ph cs.AI

DGLD: Domain-Gated Latent Diffusion for the Discovery of Novel Energetic Materials

DGLD: 用于发现新型含能材料的域门控潜在扩散

Yehudit Aperstein, Alexander Apartsin

发表机构 * Department of Intelligent Systems, Afeka Tel -Aviv College of Engineering(智能系统系,阿费卡特拉维工程学院) School of Computer Science, Faculty of Sciences, HIT -Holon Institute of Technology(计算机科学学院,科学学院,希伯来理工学院)

AI总结 提出域门控潜在扩散模型(DGLD),通过标签质量门控、多任务评分引导和四阶段化学验证漏斗,从稀疏标记数据中生成12个经DFT验证的新型含能材料候选物,其中领先化合物L1和E1在爆速和结构新颖性上表现优异。

Comments 49 pages, 25 figures

详情
AI中文摘要

含能材料的性能提升直接转化为推进剂质量减少、弹头小型化以及更高效的民用气体发生器,然而十五年来没有新的HMX类化合物被公开。设计这样一个化合物是一个稀疏标签问题:在约6.6万个标记的CHNO分子中,只有约3000个带有实验或DFT质量测量值,而在完整混合物上训练的朴素生成模型要么记忆高性能尾部,要么在无校准的情况下外推。我们引入了域门控潜在扩散(DGLD):训练时的标签质量门控、采样时的多任务评分模型引导,以及一个以第一性原理DFT审计结束的四阶段化学验证漏斗。结果是12个经DFT确认的新候选物。主打化合物3,4,5-三硝基-1,2-异噁唑(L1)达到ρ_cal=2.09 g/cm³和D_K-J,cal=8.25 km/s,且与所有65980个训练分子结构不同(最近邻Tanimoto系数0.27)。另一个主打候选物E1(4-硝基-1,2,3,5-氧杂三唑)在标定爆速(D_K-J,cal=9.00 km/s)上超过L1,且其化学型家族与L1的不相交。DGLD是唯一在DFT级别上落入生产性象限(同时新颖且达标)的方法。SMILES-LSTM精确记忆了其18.3%的输出;SELFIES-GA的最佳新颖候选物在DFT审计下损失了3.5 km/s;REINVENT 4生成了新颖的高氮杂环,但峰值仅为D=9.02 km/s。代码、检查点和918个挖掘的硬负样本已在Zenodo上发布(DOI 10.5281/zenodo.19821953);下一个进入HMX类能带的化合物可以以几个GPU天的成本被发现、验证并推荐合成。

英文摘要

Energetic-materials performance gains translate directly into reduced propellant mass, smaller warheads, and more efficient civilian gas-generators, yet no new HMX-class compound has been disclosed in fifteen years. Designing one is a sparse-label problem: of ~66 k labelled CHNO molecules only ~3 k carry experimental or DFT-quality measurements, and naive generative models trained on the full mixture either memorise the high-performance tail or extrapolate without calibration. We introduce Domain-Gated Latent Diffusion (DGLD): a label-quality gate at training time, multi-task score-model guidance at sample time, and a four-stage chemistry-validation funnel ending in first-principles DFT audit. The result is 12 DFT-confirmed novel leads. The headline compound, 3,4,5-trinitro-1,2-isoxazole (L1), reaches \r{ho}_"cal" =2.09 g/cm3 and D_"K-J,cal" =8.25 km/s and is structurally dissimilar from all 65 980 training molecules (nearest-neighbour Tanimoto 0.27). A co-headline lead, E1 (4-nitro-1,2,3,5-oxatriazole), exceeds L1 on calibrated detonation velocity (D_"K-J,cal" =9.00 km/s) from a chemotype family disjoint from L1's. DGLD is the only method to land in the productive quadrant (simultaneously novel and on-target) at DFT level. SMILES-LSTM memorises 18.3% of its outputs exactly; SELFIES-GA's best novel candidate loses 3.5 km/s under DFT audit; REINVENT 4 generates novel high-N heterocycles but peaks at D=9.02 km/s. Code, checkpoints, and 918 mined hard negatives are released on Zenodo (DOI 10.5281/zenodo.19821953); the next compound to enter the HMX-class band can be discovered, validated, and recommended for synthesis at the cost of a few GPU-days.

2605.26523 2026-05-27 cs.DC cs.AI cs.LG

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

StreamSplit: 通过不确定性引导的自适应分割实现连续音频表示学习

Minh K. Quan, Pubudu N. Pathirana

发表机构 * School of Engineering, Deakin University(德肯大学工程学院)

AI总结 提出StreamSplit框架,通过分布式的混合损失和强化学习策略实现边缘设备上的流式对比学习,在降低延迟、带宽和能耗的同时保持高精度。

Comments Accepted at ACM MobiSys 2026

详情
AI中文摘要

大批量对比学习(CL)是现代表示学习的基础,但与边缘设备波动的资源约束根本不相容。这种冲突造成了一个困境:设备上的小批量会降低模型保真度,而将计算卸载到云端则会导致不可接受的延迟和带宽成本。现有解决方案通常采用静态模型压缩,无法适应边缘环境的运行时波动。为弥合这一差距,我们提出了StreamSplit,一种新颖的框架,使得流式对比学习在异构ARM客户端平台上变得实用。StreamSplit解决了环境音频的连续性与CLAP和COLA等模型的离散批量需求之间的冲突。我们引入:(1)一种基于分布的流式框架,将表示质量与本地批量大小解耦,使用易于处理的混合损失在稀疏更新的情况下保持保真度;(2)一种不确定性引导的自适应分割器,使用轻量级强化学习(RL)策略动态划分计算。独特的是,该策略将实时资源监控与嵌入歧义性相结合,以动态优化准确率-延迟权衡。我们在从资源受限的Raspberry Pi 4到高性能Apple M2的多种硬件上评估了StreamSplit。结果表明,与以服务器为中心的基线相比,StreamSplit将每样本延迟降低了高达4.7倍,带宽减少了77.1%,能耗减少了52.3%。关键的是,它保持了与服务器中心模型相差2.2%以内的准确率,证明了自适应分布式学习是现代边缘生态系统的一条可行路径。

英文摘要

Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile resource constraints of edge devices. This conflict creates a dilemma: small on-device batches degrade model fidelity, while offloading to the cloud incurs unacceptable latency and bandwidth costs. Existing solutions often resort to static model compression, which fails to adapt to the runtime volatility of edge environments. To bridge this gap, we present StreamSplit, a novel framework that makes streaming CL practical across heterogeneous ARM client platforms. StreamSplit resolves the conflict between the continuous nature of ambient audio and the discrete batch requirements of models like CLAP and COLA. We introduce: (1) A distribution-based streaming framework that decouples representation quality from local batch size, using a tractable Hybrid Loss to maintain fidelity despite sparse updates; and (2) An Uncertainty-Guided Adaptive Splitter that uses a lightweight Reinforcement Learning (RL) policy to dynamically partition computation. Uniquely, this policy integrates real-time resource monitoring with embedding ambiguity to optimize the accuracy-latency trade-off on the fly. We evaluate StreamSplit on diverse hardware, from the resource-constrained Raspberry Pi 4 to the high-performance Apple M2. Results demonstrate that StreamSplit reduces per-sample latency by up to 4.7x and cuts bandwidth by 77.1% and energy by 52.3% compared to server-centric baselines. Crucially, it maintains accuracy within 2.2% of server-centric models, proving that adaptive, distributed learning is a viable path for the modern edge ecosystem.

2605.26508 2026-05-27 q-fin.RM cs.AI

Foundations of a Time-Consistent Counterfactual Actuarial Runtime for Autonomous AI Agents

自主AI智能体时间一致性反事实精算运行时的基础

Hao-Hsuan Chen

发表机构 * Department of Risk Management and Insurance(风险管理与保险系)

AI总结 本文提出一种精算运行时层,通过为每个动作分配时间一致的反事实风险费用,并建立边界内无套利和预算保证,为自主AI智能体提供基础数学框架。

Comments 10 pages. Foundational paper of a multi-paper program on actuarial runtime for autonomous AI agents; previously posted on SSRN (id 6761960). Empirical companion: arXiv:2605.25632. Proof companions included as ancillary files

详情
AI中文摘要

我们为自主AI智能体提出一个基础的精算运行时层,其中每个带有副作用的动作都承担一个时间一致的反事实风险费用,该费用根据合同固定的安全默认值计算,并位于明确的承保边界内。该框架将每个动作的保险作为主要分析单元,并用动作前交易层取代事后年度责任保险。本文建立了四个结构性结果:(i) 在选定的安全默认映射和延续策略下,定义明确的反事实费用,具有显式的非唯一性;(ii) 承保边界内的无分割性质,将路径分解的动作映射为边界势能,并推论出博弈抵抗与边界设计的关系;(iii) 不可逆权威溢价,分为严格正的动作级部分和集合级稳健资本增加的充要特征;(iv) 保守运行时门控定理,将高概率费用包络转化为执行动作预算保证。该结果是更广泛项目的数学基础层:一个实证配套通过精算动作接口和权威前沿实验实例化运行时;一个机制设计配套研究战略操作者激励和跨边界聚合;一个动态承保配套研究经验评级和审计重放校准。本文陈述了原始合约、费用恒等式、边界内无套利结果以及后续层所依赖的预算保证。

英文摘要

We propose a foundational runtime actuarial layer for autonomous AI agents in which every side-effect-bearing action carries a time-consistent, counterfactual risk toll computed against a contractually fixed safe default, inside an explicit underwriting boundary. The framework treats per-action insurance as the primary unit of analysis and replaces post-hoc annual liability cover with a pre-action transaction layer. The paper establishes four structural results: (i) a well-defined counterfactual toll under a chosen safe-default mapping and continuation policy, with explicit non-uniqueness; (ii) a no-splitting property within an underwriting boundary that telescopes path-decomposed actions into a boundary potential, with a corollary tying gaming-resistance to boundary design; (iii) an irreversible-authority premium, split into a strictly positive action-level component and an if-and-only-if characterisation of the set-level robust capital increase; and (iv) a conservative runtime gating theorem that translates high-probability toll envelopes into an executed-action budget guarantee. The result is the mathematical base layer for a broader program: an empirical companion instantiates the runtime through an Actuarial Action Interface and authority-frontier experiments; a mechanism-design companion studies strategic operator incentives and cross-boundary aggregation; and a dynamic-underwriting companion studies experience rating and audit-replay calibration. The present paper states the primitive contract, the toll identity, the within-boundary no-arbitrage result, and the budget guarantee on which those later layers depend.

2605.26457 2026-05-27 cs.SE cs.AI cs.CL cs.PL

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

Verus-SpecGym: 用于评估规范自动形式化的智能体环境

Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, Seungone Kim, Jannis Limperg, Cedric Flamant, Kanna Shimizu, Bryan Parno, Sean Welleck

发表机构 * CMU(卡内基梅隆大学) Amazon(亚马逊)

AI总结 提出 Verus-SpecGym 环境与 Verus-SpecBench 基准,通过执行规范机制和对抗性测试评估 LLM 智能体将非正式编程问题转化为形式规范的能力,发现前沿模型可解决 77.8% 的任务但存在遗漏假设等脆弱性。

Comments Preprint

详情
AI中文摘要

AI 编码智能体越来越多地用于编写真实世界的软件,但确保其输出正确性仍然是一个基本挑战。形式化验证提供了一条有希望的路径:智能体生成代码的同时生成机器检查的证明,保证代码满足形式规范。然而,无法保证形式规范本身与用户意图一致。在这项工作中,我们研究规范自动形式化:LLM 智能体能否将非正式编程问题转化为忠实的形式规范。我们引入了 Verus-SpecBench,一个包含 581 个规范编写任务的基准,这些任务源自针对 Rust 验证器 Verus 的 Codeforces 问题,以及 Verus-SpecGym,一个智能体环境,模型在其中与 Verus、bash 和文件系统交互以开发这些规范。核心挑战在于评估:专家编写的参考规范编写成本高昂,LLM 评判者可能遗漏细微错误。我们通过以下方式解决这一问题:(a) 扩展 Verus 的 exec_spec 机制,使生成的规范可以作为 Rust 代码执行;(b) 针对官方 Codeforces 测试和从 Codeforces "hacks"(即竞争对手编写的用于破解不正确解决方案的边缘情况)中提取的对抗性案例进行测试。在 Verus-SpecBench 上,最强的模型 Gemini 3.1 Pro 解决了 77.8% 的任务,其他前沿模型解决了 51.1-57.8%,而开源模型仅达到 21.5-25.5%。我们对失败模式的分析表明,模型生成的规范可能遗漏重要的输入假设、接受不正确的输出以及拒绝有效的输出。我们还发现,LLM 作为评判者的评估遗漏了我们评估者捕获的 26% 的失败。总体而言,我们的结果表明,规范自动形式化对于前沿智能体来说是可行的,但即使在它们已经能够生成正确代码的问题上仍然脆弱。代码、数据和日志可在 https://github.com/formal-verif-is-cool/verus-spec-gym 获取。

英文摘要

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

2605.26451 2026-05-27 cs.HC cs.CV

Design First, Code Later: Aesthetically Pleasing Template-Free Slides Generation

先设计,后编码:无模板的美观幻灯片生成

Zhiyao Cui, Chenxu Wang, Shuyue Hu, Yiqun Zhang, Wenqi Shao, Qiaosheng Zhang, Zhen Wang

发表机构 * School of Cybersecurity, Northwestern Polytechnical University(西北工业大学网络安全学院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institution(上海创新研究院) Fudan University(复旦大学)

AI总结 提出DeepSlides层次化幻灯片生成流程,通过解耦设计与实现、引入SlideDesign数据集和多智能体强化学习训练范式,在无模板条件下生成高质量幻灯片。

详情
AI中文摘要

自动生成演示幻灯片需要在严格的空间约束下协调叙事结构与页面级图形设计。对于这种结构化多模态任务,良好的设计流程对于确保幻灯片的最终质量至关重要。现有方法依赖固定模板或直接生成可执行代码,从而限制了LLM的创意布局设计能力,并绕过了关键的幻灯片页面设计步骤。为解决这些限制,本文(1)提出了一种层次化的幻灯片生成工作流DeepSlides,无需任何预定义模板或样式,系统化地组织幻灯片设计任务,将幻灯片页面设计与实现解耦;(2)引入了SlideDesign数据集,专门针对幻灯片生成任务定制;(3)提出了一种多智能体强化学习训练范式,并训练了一对模型SlideQwens,用于幻灯片设计和实现。实验结果表明,我们提出的框架在评估指标上优于基线方法,并在人类偏好评估中取得了优越性能。数据集和代码可在https://github.com/sxswz213/DeepSlides获取。

英文摘要

Producing presentation slides automatically entails coordinating narrative structure with page-level graphic design under strict spatial constraints. For such structured multimodal tasks, a well-organized design process is essential to ensure the final quality of slides. Existing approaches rely on fixed templates or directly emit executable code, thereby both limiting the creative layout-design capabilities of LLMs and bypassing the essential slide-page design step. To address these limitations, this paper (1) proposes a hierarchical slides generation workflow, DeepSlides, that systematically organizes slide design tasks without any predefined template or style, decoupling slide-page design from implementation; (2) introduces SlideDesign, a dataset tailored specifically for slides generation tasks; and (3) presents a multi-agent reinforcement learning training paradigm and trains a couple of models, SlideQwens, for slide design and implementation. Experimental results demonstrate that our proposed framework outperforms baseline methods on evaluated metrics and achieves superior performance in human preference evaluations. The dataset and code are available at https://github.com/sxswz213/DeepSlides.

2605.26429 2026-05-27 stat.ME cs.AI cs.LG stat.ML

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

面向大规模分布外检测的结构自适应共形推断

Rongyi Sun, Wenguang Sun, Zinan Zhao

发表机构 * Center for Data Science and School of Mathematical Sciences, Zhejiang University(数据科学中心和数学科学学院,浙江大学)

AI总结 提出结构自适应共形q值(SCQ)和伪分数引导的直推式自动模型选择(P-TAMS),在成对可交换性下实现结构化分布外检测的有限样本错误率控制、功效提升和可解释性增强。

详情
AI中文摘要

本文针对高风险机器学习应用中的结构化分布外(OOD)检测问题。传统共形方法依赖于联合可交换性,难以融入时空或分组结构等辅助信息。为克服这一局限,我们提出结构自适应共形q值(SCQ),这是一种整合个体检验证据与结构模式的显著性指标。我们还开发了伪分数引导的直推式自动模型选择(P-TAMS),将共形化模型选择适应于候选模型工具箱中的结构化OOD检测。SCQ和P-TAMS共同在成对可交换性下形成一个统一框架,提供有限样本错误率控制、改进的功效和增强的可解释性。在模拟和真实数据上的实验表明,所提方法控制了错误发现率,并在多种设置下表现良好。

英文摘要

This paper addresses structured out-of-distribution (OOD) testing in high-stakes machine learning applications. Traditional conformal methods rely on joint exchangeability, making it difficult to incorporate auxiliary information such as spatiotemporal or grouping structures. To overcome this limitation, we propose the structure-adaptive conformal q-value (SCQ), a significance index that integrates individual test evidence with structural patterns. We also develop pseudo-score-guided transductive automated model selection (P-TAMS), which adapts conformalized model selection to structured OOD testing across a toolbox of candidate models. Together, SCQ and P-TAMS form a unified framework under pairwise exchangeability, providing finite-sample error-rate control, improved power, and enhanced interpretability. Experiments on simulated and real data demonstrate that the proposed approach controls the false discovery rate and performs well across diverse settings.

2605.26424 2026-05-27 cs.IR cs.AI cs.LG

Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation

Uniboost:基于价值对齐的全局协调实现公平高效的流量分配

Ge Fan, Nan Zhao, Kai Meng, Cong Luo, Yang Fu, Huiping Chu, Jialin Liu, Yuning Jiang, Bo Zheng

发表机构 * Taobao \& Tmall Group of Alibaba Hangzhou China Taobao \& Tmall Group of Alibaba Beijing China Taobao \& Tmall Group of Alibaba

AI总结 提出Uniboost统一流量分配框架,通过后验价值对齐机制和独立线性提升范式,解决耦合分配、分数膨胀和可解释性问题,提升流量分配效率和推荐性能。

Comments accepted by SIGIR 2026

详情
AI中文摘要

随着互联网服务的快速发展,推荐系统已变得不可或缺。特别是混合(重排序)阶段在跨不同业务目标分配流量中起着关键作用。然而,现有方法常受限于耦合的分配方案、分数膨胀和缺乏可解释性。为应对这些挑战,我们提出Uniboost,一个统一的流量分配框架。Uniboost引入后验价值对齐机制,将抽象模型分数校准到具有明确业务语义的锚定指标,显著增强可解释性。此外,它采用独立的线性提升范式来解耦复杂的加权方案,实现每个计划贡献的精确归因。我们通过在线A/B测试和深入数据分析验证了Uniboost的有效性,展示了三个关键发现:1)降低加权分数的整体权重有效减轻了意外的业务干扰,产生更高效的微观流量分配策略;2)事后分析和聚合仪表板提供了直观的宏观洞察,指导整体流量分配机制的设计;3)提出的“有效完成分数”作为易于获取的后验指标,为内容推荐管道提供了可靠的锚点。综合来看,我们的实验表明,Uniboost不仅在微观层面提升了流量分配效率和推荐性能,还为系统迭代提供了宏观指导。因此,这项工作为大规模工业推荐系统提供了一种高效可控的流量调节解决方案。

英文摘要

With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) stage plays a pivotal role in allocating traffic across diverse business objectives. However, existing approaches often suffer from coupled allocation plans, score inflation, and a lack of interpretability. To address these challenges, we propose Uniboost, a unified traffic allocation framework. Uniboost introduces a posterior value alignment mechanism that calibrates abstract model scores to anchor metrics with explicit business semantics, significantly enhancing interpretability. Furthermore, it employs an independent linear boosting paradigm to decouple complex weighting schemes, enabling precise attribution of each plan's contribution. We validate the effectiveness of Uniboost through online A/B tests and in-depth data analysis, demonstrating three key findings: 1) Reducing the overall weight of weighted scores effectively mitigates unintended business interference, yielding a more efficient micro-level traffic allocation strategy; 2) Post-hoc analyses and aggregated dashboards provide intuitive, macro-level insights that guide the design of the overall traffic allocation mechanism; 3) The proposed "Effective Completion Score" serves as an easily obtainable post-metric that offers a reliable anchor for content recommendation pipelines. Collectively, our experiments show that Uniboost not only improves traffic allocation efficiency and recommendation performance at the micro level but also provides macro-level guidance for system iteration. Thus, this work provides an efficient and controllable traffic regulation solution for large-scale industrial recommendation systems.

2605.26413 2026-05-27 stat.ME cs.AI cs.LG stat.ML

Confounder Detection via Treatment Intent: A New Observational Study Design

通过治疗意图进行混杂检测:一种新的观察性研究设计

Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim

发表机构 * UCLA(加州大学洛杉矶分校) ETH Zurich(苏黎世联邦理工学院) Columbia University(哥伦比亚大学)

AI总结 提出一种通过询问治疗决策者比较配对单元来揭示未观测混杂因素的新研究设计,并在ICU数据中验证其有效性。

详情
AI中文摘要

理解干预的效果是科学进步的核心,随机对照试验(RCT)在许多应用领域被视为因果推断的金标准。然而,RCT成本高、耗时长,且常受伦理或实际限制,这促使我们需要能够从观察性数据中得出结论的因果方法。尽管此类数据收集规模日益扩大,但将其用于因果推断常因并非所有影响治疗分配和结果的变量都被观测到而受阻,这一问题称为未观测混杂。在本文中,我们介绍了一种称为通过治疗意图进行混杂检测的新研究设计。其思路是询问做出治疗决策的人类专家,并要求他们比较由原则性匹配策略提出的单元对,目的是引出解释治疗决策为何不同的未观测变量。我们为此类程序提供了理论基础,确定了此类研究设计可能引出未观测混杂因素的条件。基于这些新建立的基础,我们研究了重症监护病房(ICU)中干预的治疗效果。首先,我们展示了强烈表明ICU中收集的电子健康记录(EHR)存在未观测混杂的经验证据。通过使用临床文本笔记作为医生知识的代理并利用自然语言处理,我们在已知真实情况的半合成环境中为我们的方法提供了概念验证。

英文摘要

Understanding the effects of interventions is central to scientific progress, with randomized controlled trials (RCTs) regarded as the gold standard for causal inference in many applied fields. However, RCTs are costly, time-consuming, and often constrained by ethical or practical limitations, motivating the need for causal methods able to draw conclusions from observational data. While such data is collected at ever larger scale, making its use for causal inference is often hindered by the fact that not all variables affecting treatment allocation and the outcome are observed: an issue known as unobserved confounding. In this paper, we introduce a new study design called confounder detection via treatment intent. The idea is to query a human expert who makes treatment decisions, and ask them to compare pairs of units proposed by a principled matching strategy, with the goal of eliciting unobserved variables that explain why treatment decisions differ. We provide a theoretical basis for such a procedure, ascertaining conditions under which such a study design may elicit unobserved confounders. Building on this newly established foundations, we study treatment effects of interventions in the intensive care unit (ICU). First, we show empirical evidence strongly indicating that electronic health records (EHRs) collected in ICUs are subject to unobserved confounding. By using clinical text notes as a proxy for physicians' knowledge and leveraging natural language processing, we provide a proof of concept for our methodology in a semi-synthetic environment with a known ground truth.

2605.26409 2026-05-27 cs.CR cs.AI cs.LG

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

通过模型的行为几何进行越狱易感性预测与缓解

Hayden Helm, Xiaodong Liu, Weiwei Yang

发表机构 * Microsoft Research(微软研究院)

AI总结 本文通过形式化模型群体的行为几何,利用已评估和防御的模型,实现高效的易感性预测和防御迁移,在79个模型和100个系统配置上,易感性检测AUPRC达0.94且探针减少约98%,防御迁移性能优于同供应商分配。

详情
AI中文摘要

评估和缓解生成系统对越狱攻击的易感性对其安全部署至关重要。由于可部署系统的数量众多,对每种配置进行全面评估和优化是不切实际的。本文形式化了模型群体的行为几何,通过利用先前评估和防御过的模型,支持群体内高效的易感性预测和有效的防御迁移。我们将该框架应用于涵盖24个提供商的79个模型以及单个基础模型的100个系统配置。使用行为几何的简单方法在易感性检测中达到了0.94的AUPRC,与全面评估相比,探针数量减少了约98%。使用行为几何选择从哪个模型迁移优化后的防御,在无额外探针成本的情况下优于同供应商分配(+2%,p = 0.03),且一组三个模型足以覆盖整个群体。结果对超参数选择和评判者具有鲁棒性。

英文摘要

Evaluating and mitigating a generative system's susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of deployable systems, full per-configuration evaluation and optimization is impractical. In this paper, we formalize the behavioral geometry of a population of models that, by leveraging previously evaluated and defended models, supports both efficient susceptibility prediction and effective defense transfer across a population. We apply the framework to 79 models spanning 24 providers and to 100 system configurations of a single base model. Simple methods that use the behavioral geometry reach an AUPRC of $0.94$ for susceptibility detection with $\approx98\%$ fewer probes relative to a full evaluation. Using the behavioral geometry to select which model to transfer an optimized defense from outperforms same-provider assignment ($+2\%$, $p = 0.03$) at no additional probe cost, with a set of three models sufficient to cover the population. Results are robust to hyperparameter selection and judge.