arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4101
2606.00040 2026-06-02 cs.CY cs.AI

Tracing GenAI Literacy: Uncovering Student-AI Interaction Patterns in Academic Writing through Epistemic Network Analysis

追踪GenAI素养:通过认知网络分析揭示学术写作中的学生-AI交互模式

Angxuan Chen, Jiyou Jia

发表机构 * Department of Educational Technology, Graduate School of Education, Peking University(教育技术系,教育研究生院,北京大学)

AI总结 本研究利用学习分析和认知网络分析,通过分析162名学生在GenAI辅助摘要写作任务中的交互日志,揭示了高素养学生采用迭代优化和策略性提问,而低素养学生依赖直接生成命令的不同交互模式。

详情
AI中文摘要

随着生成式人工智能(GenAI)成为教育不可或缺的一部分,培养GenAI素养至关重要。然而,当前的评估主要依赖于自我报告量表,缺乏对素养在实际学习过程中如何体现的洞察。本研究利用学习分析(LA)来弥合这一差距。我们收集了162名大学生在GenAI辅助摘要写作任务中的交互日志。使用认知网络分析(ENA),我们建模并比较了不同GenAI素养水平学生的提问策略。初步结果揭示了不同的交互特征:高素养学生进行迭代优化和策略性提问,而低素养学生依赖直接生成命令。本研究通过展示过程数据如何表征GenAI素养,为数据驱动的素养评估和实时干预铺平道路,从而为研讨会做出贡献。

英文摘要

As Generative AI (GenAI) becomes integral to education, fostering GenAI literacy is critical. However, current assessments largely rely on self-reported scales, lacking insights into how literacy manifests in actual learning processes. This study leverages Learning Analytics (LA) to bridge this gap. We collected interaction logs from 162 university students engaged in a GenAI-assisted abstract writing task. Using Epistemic Network Analysis (ENA), we modeled and compared the questioning strategies of students with varying GenAI literacy levels. Preliminary results reveal distinct interaction signatures: high-literacy students engage in iterative refinement and strategic questioning, while low-literacy students rely on direct generation commands. This work contributes to the workshop by demonstrating how process data can characterize GenAI literacy, paving the way for data-driven literacy assessment and real-time interventions.

2606.00039 2026-06-02 cs.CY cs.AI cs.HC

Beyond Categories of Caste: Examining Caste Bias and Morality in Text-to-Image AI Models

超越种姓类别:审视文本到图像AI模型中的种姓偏见与道德

Divyanshu Kumar Singh, Dipto Das, Deepika Rama Subramanian, Koustuv Saha, Stephen Voida, Bryan Semaan

发表机构 * University of Colorado Boulder(科罗拉多大学波得尔分校) University of Toronto(多伦多大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过算法审计与批判性话语分析,揭示文本到图像模型如何超越上下种姓二元对立而延续种姓偏见,并提出反种姓方法应对AI系统中的公平问题。

详情
AI中文摘要

文本到图像(T2I)模型在各个领域展现出有前景的实用性。然而,这类模型也在其输出中放大了有害的社会偏见。在南亚背景下,近期研究表明种姓偏见和刻板印象正通过生成式AI(GenAI)系统得以延续。尽管这些研究提供了关于GenAI系统如何使种姓歧视的隐形叙事显性化的极其相关的见解,但它们往往将种姓视为一个身份类别。因此,在本工作中,我们转变本体论,聚焦于种姓的关系性方面。这使我们能够更细致地理解T2I模型产生和延续种姓歧视的机制。通过将算法审计与批判性话语分析相结合,我们借鉴挑战婆罗门规范性的概念框架,展示种姓偏见如何超越上下种姓类别的简单二元对立而得以延续。我们的贡献有两方面。除了挑战将种姓视为类别的范畴化理解,我们还提出了一种反种姓方法,以应对AI系统中种姓偏见和公平性的问题。

英文摘要

Text-to-Image (T2I) models have shown promising utility across various domains. However, such models are also amplifying harmful societal biases in their outputs. In the context of South Asia, recent work has shown caste biases and stereotypes are being perpetuated through Generative AI (GenAI) systems. While this research offers extremely relevant insight into invisibilized narratives of caste discrimination through the GenAI system, they often treat caste as an identity category. Therefore, in this work we shift our ontology to focus on the relational aspect of caste. This enables us to develop a more nuanced understanding of the mechanics of caste discrimination by and through T2I models. Combining an algorithmic audit with critical discourse analysis, we draw on a conceptual frame challenging Brahminical Normativity to show how caste biases are perpetuated beyond the simple binaries of upper vs lower-caste categories. Our contributions are two-fold. Beyond challenging the categorical understanding of caste as a category, we propose an anti-caste approach to tackle the issue of caste bias and fairness in AI systems.

2606.00037 2026-06-02 cs.CY cs.AI cs.HC

Update Opacity: Epistemic Accessibility and Governance Under AI System Change

更新不透明性:AI系统变更下的认知可及性与治理

Andrea Ferrario, Joshua Hatherley

发表机构 * Institute of Biomedical Ethics and History of Medicine, University of Zürich(伦理与医学史研究所,苏黎世大学) SUPSI, Dalle Molle Institute for Artificial Intelligence (IDSIA)(SUPSI,达勒莫利人工智能研究所) ETH Zürich(苏黎世联邦理工学院) Center for Philosophy of AI, University of Copenhagen(人工智能哲学中心,哥本哈根大学)

AI总结 针对AI系统更新导致用户难以理解输出变化的问题,提出结合欧盟AI法案和机器学习运营的治理框架,通过可信度画像和阈值披露实现更新透明化。

详情
AI中文摘要

嵌入部署AI系统中的机器学习模型会定期更新以维持正常功能。然而,此类更新可能产生更新不透明性:用户可能无法理解为何相同输入现在产生不同输出。我们认为,更新不透明性最好被理解为认知可及性的历时性失败:问题在于,在真实角色和时间特定约束下,物质上相关的变更可能无法以支持理解、校准依赖和适当行动的形式保持对用户可及。这使得更新不透明性成为一个治理问题。并非所有变更都同等相关,披露每一次更新本身会因信息过载而损害使用。为解决此问题,我们结合两种互补的治理方法:欧盟AI法案(有助于规范系统层面规范性相关变更的边界)和机器学习运营(提供跟踪和比较随时间变化的操作工具)。在此基础上,我们提出一个框架,通过可信度画像和可信度级别对系统变更建模,并使用基于阈值的披露,随时间向不同利益相关者揭示包络内物质相关变更。我们通过一个医疗AI示例说明该方法,并得出对生命周期文档、上市后监测和更新披露的实际意义。

英文摘要

Machine learning models embedded in deployed AI systems are routinely updated to maintain correct functioning over time. Yet such updates can generate update opacity: users may not be able to understand why the same input now yields a different output. We argue that update opacity is best understood as a diachronic failure of epistemic accessibility: the problem is that materially relevant changes may fail to remain accessible to human users in forms that support understanding, calibrated reliance, and appropriate action under real role- and time-specific constraints. This makes update opacity a governance problem. Not all change is equally relevant, and disclosing every update would itself undermine use through overload. To address this problem, we combine two complementary governance approaches: the EU AI Act, which helps specify the system-level perimeter of normatively relevant change, and Machine Learning Operations, which provides operational tools for tracking and comparing change over time. On this basis, we propose a framework that models system change through trustworthiness profiles and trustworthiness levels, and uses threshold-based disclosure to surface materially relevant within-envelope change to different stakeholders over time. We illustrate the approach with a medical AI example and derive practical implications for lifecycle documentation, post-market monitoring, and update disclosure.

2606.00033 2026-06-02 cs.CY cs.AI

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing

使机制可解释性可审计:呼吁通过持续协作评审制定指南

Michael Lan, Narmeen Fatimah Oozeer, Chaithanya Bandi, Philip Quirke, Austin Meek, Fazl Barez, Amirali Abdullah

发表机构 * University of Delaware(德克萨斯大学) University of Oxford(牛津大学) ThoughtWorks

AI总结 针对机制可解释性(MI)实验缺乏标准化审计系统的问题,提出通过持续协作评审平台、专家验证指南和基于来源的审计系统来建立可审计框架,以提升其在AI安全等高风险领域的可信度。

Comments Accepted at ACL 2026 main conference

详情
AI中文摘要

尽管机制可解释性(MI)对神经网络内部机制产生了重要见解,但该领域尚未建立标准化的实验审计系统。因此,其许多发现在医疗AI和自主系统等安全关键应用中仍未得到充分利用,因为利益相关者无法验证其有效性。近期工作具体证明了这一点:两篇论文对同一行为得出了矛盾的结论,第三项研究揭示两者部分正确但因方法不一致而无法比较。缺乏标准化审计时,这种模糊性阻碍了需要强正确性保证的高风险场景中的采用。我们呼吁MI社区致力于开发一种新颖的评审系统,通过以下方式补充同行评审:(1)由协作评审平台支持的持续评审,在该平台上组织和讨论论文之外适合的元科学结果和讨论(如批评、负面结果、事后扩展、复现、复制和部分结果),允许随时进行评论和修订;(2)将该平台上发现的良好实践推广为专家验证的指南和协议,以提高审计效率;(3)基于来源的审计系统,追踪声明所依赖的论点。这篇立场论文鼓励对这样一个框架的必要性、设计和实施进行建设性辩论,并提供早期具体示例以帮助催化这些对话。总体而言,我们提出审计MI本身对于其在AI安全、行业和治理中的应用至关重要。

英文摘要

While mechanistic interpretability (MI) has produced important insights into neural network internals, the field has yet to establish a standardized system to audit experiments. As such, many of its findings remain underutilized in safety-critical applications such as medical AI and autonomous systems, as stakeholders cannot certify their validity. Recent work demonstrates this concretely: two papers found conflicting conclusions for the same behavior, and a third study revealed that both were partially correct but incomparable due to methodological inconsistencies. Without standardized auditing, such ambiguities hinder adoption in high-stakes contexts requiring strong correctness guarantees. We call for the MI community to work towards developing a novel reviewing system that complements peer review via: (1) Continuous reviewing supported by a \emph{Collaborative Reviewing Platform} where meta-science results and discussions (such as critiques, negative results, post-hoc extensions, reproductions, replications, and partial results) that fit outside of papers are organized and discussed, allowing for comments and revisions to be made at any time (2) Generalizing good practices found on this platform into expert-verified guidelines and protocols to improve auditing efficiency, and (3) Source-based auditing systems that track arguments which claims depend on. This position paper encourages constructive debate over the necessity, design and implementation of such a framework, providing early concrete examples to help catalyze these dialogues. Overall, we propose that auditing MI itself is essential for its application in AI safety, industry, and governance.

2606.00015 2026-06-02 cs.HC cs.AI cs.CY cs.ET

SortingHat: Redefining Operating Systems Education with a Tailored Digital Teaching Assistant

SortingHat: 用定制的数字教学助手重新定义操作系统教育

Yifan Zhang, Xinkui Zhao, Zuxin Wang, Zhengyi Zhou, Guanjie Chen, Shuiguang Deng, Jianwei Yin

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院)

AI总结 针对操作系统课程教学挑战,提出结合检索增强生成和多智能体强化学习的3D数字人教学助手SortingHat,提供个性化指导、自适应内容生成和自动评估。

详情
Journal ref
WWW '25: Companion Proceedings of the ACM on Web Conference 2025,Pages 2951 - 2954
AI中文摘要

操作系统课程是计算机科学教育中最具挑战性的课程之一,原因在于其内部结构的复杂性和运行环境的多样性。传统的教学方法往往无法应对学生多样化的背景、学习速度和实际需求。为了应对这些挑战,我们提出了SortingHat,一个专为操作系统教育定制的个性化数字教学助手。SortingHat集成了先进的人工智能技术,包括检索增强生成框架和多智能体强化学习,以提供自适应、可扩展且有效的教育支持。SortingHat采用由大型语言模型驱动的3D数字人界面,提供个性化、富有同理心和上下文感知的指导。它根据每个学生的学习历史和学业表现生成定制的练习,强化薄弱环节并挑战高级概念。此外,该系统包含一个强大的评估流程,确保对学生提交的内容进行公平、一致和无偏见的评分,同时提供个性化的、可操作的改进反馈。通过结合个性化指导、自适应内容创建和自动评估,SortingHat将操作系统教育转变为一种引人入胜、沉浸式且可扩展的体验。

英文摘要

Operating Systems (OS) courses are among the most challenging in computer science education due to the complexity of internal structures and the diversity of running environments. Traditional teaching methods often fail to address the diverse backgrounds, learning speeds, and practical needs of students. To tackle these challenges, we present SortingHat, a personalized digital teaching assistant tailored specifically for OS education. SortingHat integrates advanced AI technologies, including a retrieval augmented generation (RAG) framework and multi agent reinforcement learning (MARL), to deliver adaptive, scalable, and effective educational support. SortingHat features a 3D digital human interface powered by large language models (LLMs) to provide personalized, empathetic, and context aware guidance. It generates tailored exercises based on each student's learning history and academic performance, reinforcing weak areas and challenging advanced concepts. Additionally, the system incorporates a robust evaluation pipeline that ensures fair, consistent, and unbiased grading of student submissions while delivering personalized, actionable feedback for improvement. By combining personalized guidance, adaptive content creation, and automated assessment, SortingHat transforms OS education into an engaging, immersive, and scalable experience.

2606.00011 2026-06-02 cs.HC cs.AI cs.LG

RuleEdit: Failure-Guided Human-AI Model Editing with Prospective Impact Preview

RuleEdit: 失败引导的人机模型编辑与前瞻性影响预览

Min Hun Lee, Justin Yu Feng Teo

发表机构 * Singapore Management University(新加坡国立大学)

AI总结 提出RuleEdit系统,通过规则表的不匹配信号检测失败并预览模型编辑的影响,在卒中康复评估中显著提升人机协同性能。

详情
AI中文摘要

尽管AI有望协助复杂决策,但从业者仍然缺乏在提交模型编辑之前检测可能失败和检查后果的方法。我们提出RuleEdit,一个交互式、规则引导的人机模型编辑系统,它(i)通过规则表可解释的不匹配信号揭示可能的失败,并(ii)支持用户编写的规则反馈,提供预期性能变化和嵌入偏移的前瞻性预览。我们在卒中康复评估中实例化RuleEdit,并与卫生专业人员和学生一起评估。规则引导的失败检测将人+AI性能显著提高了14.16%(p<0.001),同时改善了对错误AI的拒绝,减少了过度依赖和不足依赖以及ChangedToWrong决策。此外,呈现前瞻性嵌入预览改善了参与者对模型适应的反馈,在纳入用户基于规则的反馈后,将更新后的局部性能增益从11.50%提高到36.38%(p<0.001)。我们的发现表明,基于不匹配的失败线索和前瞻性影响预览可以支持失败感知的人机模型编辑,同时也揭示了局部-全局权衡:有助于特定案例的编辑在全局转移时可能会降低性能。我们讨论了设计失败感知和可控人机系统的意义。

英文摘要

Despite the promise of AI to assist complex decisions, practitioners still lack ways to detect likely failures and inspect the consequences of model edits before committing them. We present RuleEdit, an interactive, rule-guided human-AI model editing system that (i) surfaces likely failures through interpretable mismatch signals from rule tables and (ii) supports user-authored rule feedback with prospective previews of projected performance changes and embedding shifts. We instantiate RuleEdit in stroke rehabilitation assessment and evaluate it with health professionals and students. Rule-guided failure detection significantly increased Human + AI performance by 14.16\% ($p<0.001$) while improving rejection of incorrect AI and reducing both over- and under- reliance as well as ChangedToWrong decisions. In addition, presenting prospective embedding previews improved participants' feedback for model adaptation, increasing post-update local performance gains from 11.50\% to 36.38\% after incorporating users' rule-based feedback ($p<0.001$). Our findings show that mismatch-based failure cues and prospective impact previews can support failure-aware human-AI model editing, while also revealing a local-global tradeoff: edits that help a specific case can degrade performance when transferred globally. We discuss implications of designing failure-aware and controllable human-AI systems.

2606.00001 2026-06-02 cs.HC cs.CV cs.MM

Shu Dao: A Calligraphy Score Framework Linking Calligraphy, Music, and Performance

书道:连接书法、音乐与表演的评分框架

Lican Huang

发表机构 * Hangzhou Domain Zones Technology Co., Ltd.(杭州域区技术有限公司)

AI总结 提出CWSR表示法和书道框架,将东亚书法建模为类似乐谱的结构化表演,支持人机共创。

Comments 47 pages

详情
Journal ref
Journal of Advances in Information Science and Technology, 2026 4(2), 1-47. https://yvsou.com/journal/index.php/jaist/article/view/43
AI中文摘要

本文介绍了书法书写评分表示法(CWSR),并提出了书道框架,将东亚书法解读为一种表演艺术而非静态视觉产物。受日本书道和茶道等体现文化实践的启发,该框架将书法建模为类似于音乐符号的结构化表演。该方法不将字符表示为固定图像,而是将每个笔画编码为有序且可执行的动作,形成书法评分。字符在结构化空间网格中组织,笔画标注有类型、执行顺序、空间坐标、轨迹、构图角色以及动态属性(如笔压和节奏)。这种表示捕捉了书法书写中通常图像表示所缺失的时间和表达方面。本文做出三项主要贡献:首先,引入CWSR作为结构化符号系统,在笔画、字符结构和构图组织(如布局和章法)等多个层面表示书法,及其节奏和表演动态;其次,将书道概念化为基于评分的框架,将书法建模为结构化表演;第三,为基于AI的书法智能体分析、可视化和可执行生成书法作品建立计算基础。这些贡献共同连接了书法、音乐符号和表演文化实践,支持计算书法和数字人文研究中的人机共创。

英文摘要

This paper introduces Calligraphy Writing Score Representation (CWSR) and proposes Shu Dao as a framework that interprets East Asian calligraphy as a performative art rather than a static visual artifact. Inspired by traditions such as Japanese Shodō and embodied cultural practices such as Chadao , the framework models calligraphy as a structured performance analogous to musical notation. Instead of representing characters as fixed images, the proposed approach encodes each brush stroke as an ordered and executable action, forming a calligraphy score. Characters are organized within a structured spatial grid, and strokes are annotated with attributes including stroke type, execution order, spatial coordinates, trajectory, compositional role, and dynamic properties such as brush pressure and pacing. This representation captures temporal and expressive aspects of calligraphic writing that are typically absent from image-based representations. The paper makes three main contributions. First, it introduces CWSR as a structured notation system for representing calligraphy across multiple levels, including strokes, character structures, and compositional organization (e.g., layout and zhangfa), together with their rhythmic and performative dynamics. Second, it conceptualizes Shu Dao as a score-mediated framework that models calligraphy as structured performance. Third, it establishes a computational foundation for the analysis, visualization, and executable generation of calligraphic works by AI-based calligraphic agents. Together, these contributions bridge calligraphy, musical notation, and performative cultural practices, supporting human--AI co-creation in computational calligraphy and digital humanities research.

2605.30743 2026-06-02 cond-mat.mtrl-sci cs.CE cs.CL

A Padding Method for Enhanced Encoding of Inorganic Structures with Varying Chemical Compositions

一种用于增强不同化学成分无机结构编码的填充方法

Thang Dang, Haderbache Amir, Tzanakakis Alexandros, Yoshimoto Yuta

发表机构 * Fujitsu Limited(富士通株式会社) National Technical University of Athens(希腊国家技术大学)

AI总结 提出一种利用晶体对称性信息(Wyckoff位置长度感知填充)的编码方法,结合端到端生成系统,提升无机材料生成精度和稳定性,在质子导体数据上重建准确率提升5.3%,在perov-5数据集上生成的新颖稳定材料比基线模型多63.5%。

详情
AI中文摘要

通过生成模型设计新型无机材料仍然是材料科学的重要挑战,这是因为无机结构在广泛的化学成分和结构景观中具有复杂性和多样性。无机化合物的巨大组合空间需要创新的、人工智能驱动的方法来克服生成准确性和效率方面的限制。为了解决这个问题,我们引入了一种新方法,通过利用领域特定的对称感知表示来重新定义无机材料的编码和生成。我们的方法不仅改进了复杂无机结构的表示,还通过提高生成候选物的精度和稳定性,为材料发现领域做出了贡献。我们方法的核心是一种利用晶体对称信息来增强编码过程的新型填充技术。通过将Wyckoff位置长度感知填充集成到编码器架构中,我们实现了对无机材料更鲁棒的、信息丰富的表示。这种对称驱动的增强提高了深度学习模型生成稳定、先前未探索的无机结构的准确性和计算效率。此外,我们引入了一个端到端系统,利用机器学习势模型从初始数据到验证输出无缝生成新颖的、甚至在训练数据中未见过的稳定无机材料。该管道将先进的生成模型与稳定性分析相结合,标志着下一代无机材料自动探索和设计的重大飞跃。我们的方法在质子导体数据上将重建准确率提高了5.3%,并在perov-5数据集上生成了比基线模型多63.5%的新颖稳定无机材料。

英文摘要

Designing novel inorganic materials through generative models remains an important challenge for material science, driven by the complexity and diversity of inorganic structures across expansive chemical compositions and structural landscape. The vast combinatorial space of inorganic compounds demands innovative, AI-driven approaches to overcome limitations in generative accuracy and efficiency. To address this, we introduce a novel method that redefines the encoding and generation of inorganic materials by utilizing domain-specific symmetry-aware representation. Our approach not only refines the representation of intricate inorganic structures but also contributes to the field of material discovery by enhancing the precision and stability of generated candidates. Central to our methodology is a novel padding technique that exploits crystal symmetry information to enhance the encoding process. By integrating Wyckoff position length-aware padding into an encoder architecture, we achieve a more robust informed representation of inorganic materials. This symmetry-driven enhancement improves deep learning models to generate stable, previously unexplored inorganic structures with superior accuracy and computational efficiency. Furthermore, we introduce an end-to-end system that leverages the machine learning potential models to seamlessly generate novel, even those unseen in the training data, and stable inorganic materials from initial data to validated output. This pipeline integrates advanced generative models with stability analysis, marking a significant leap forward in the automated exploration and design of next-generation inorganic materials. Our method improved reconstruction accuracy 5.3% in proton conductor data, and generated 63.5% more novel stable inorganic material to baseline model on the perov-5 dataset.

2605.27527 2026-06-02 astro-ph.IM cs.LG

Probabilistic Data-Driven Modelling of Astrophysical Transients: The Neural Process Family for Ultrafast and Class-Agnostic Light Curve Reconstruction with NightLANP

天体瞬变事件的概率数据驱动建模:基于NightLANP的超快速与类别无关光变曲线重建的神经过程家族

Siddharth Chaini, Federica B. Bianco, Ashish Mahabal

发表机构 * NASA FINESST Fellow Department of Physics and Astronomy, University of Delaware(物理与天文学系,德雷克塞尔大学) University of Delaware, Data Science Institute(德雷克塞尔大学数据科学研究所) Joseph R. Biden, Jr. School of Public Policy and Administration, University of Delaware(德雷克塞尔大学公共政策与行政学院) Vera C. Rubin Observatory(维拉·鲁宾天文台) Division of Physics, Mathematics, and Astronomy, California Institute of Technology(物理、数学与天文学系,加州理工学院) Center for Data Driven Discovery, California Institute of Technology(数据驱动发现中心,加州理工学院)

AI总结 针对稀疏不规则光变曲线重建问题,提出神经过程家族(以注意力神经过程为例),结合高斯过程的概率框架与深度学习的可扩展性,通过元学习实现跨波段、类别无关的快速推理,在Rubin模拟数据上优于高斯过程和神经网络。

详情
AI中文摘要

来自地球的天体观测受到天气、环境和科学限制,导致稀疏、不规则的光变曲线。在Vera C. Rubin天文台时空遗产巡天前夕,其数据集为瞬变科学提供了前所未有的机遇。然而,一个关键挑战是其观测节奏——在六个波段上稀疏且不规则,限制了推断。插值有助于缓解这一问题,高斯过程是标准方法,但它们在跨波段相关性上表现不佳,需要先验核函数指定,并且必须单独拟合每条光变曲线,因此可扩展性差。在此,我们引入神经过程家族用于光变曲线重建,结合了高斯过程的概率框架与深度学习的可扩展性。通过在多样化的模拟瞬变事件上进行元学习,注意力神经过程将大部分计算转移到训练阶段,从而能够使用类别无关模型进行快速、摊销的推断。在15个瞬变类别上使用真实的Rubin观测节奏进行评估,我们表明,即使是一个未优化的、开箱即用的注意力神经过程,在所有测试指标(包括回归质量、天体物理特征恢复和概率校准)上始终优于所有基准——一组高斯过程和神经网络。我们的模型同时插值所有波段,耗时微秒级,比次优的神经基准快四个数量级,比高斯过程快五个数量级,展示了神经过程在Rubin夜间警报流中的潜力。注意力神经过程避免了标准神经网络的过度自信和高斯过程的信心不足,提供了尖锐且良好校准的不确定性。这项工作确立了神经过程家族作为Rubin时代实时瞬变科学的可扩展概率基础。

英文摘要

Astrophysical observations from Earth are subject to weather, environmental, and scientific constraints that lead to sparse, irregular light curves. On the eve of the Vera C. Rubin Observatory Legacy Survey of Space and Time, its dataset offers unprecedented opportunities for transient science. Yet a key challenge remains its cadence, sparse and irregular across six bands, limiting inference. Interpolation helps mitigate this, with Gaussian Processes the standard, but they struggle with cross-band correlations, require a priori kernel specification, and must be fit to each light curve individually, hence scaling poorly. Here, we introduce the neural process family for light curve reconstruction, combining the probabilistic framework of Gaussian Processes with the scalability of deep learning. By meta-learning on diverse simulated transients, Attentive Neural Processes shift the bulk of computation to training, enabling rapid, amortized inference with a class-agnostic model. Evaluated on realistic Rubin cadences across 15 transient classes, we show that even an unoptimized, out-of-the-box Attentive Neural Process consistently outperforms all benchmarks -- a suite of Gaussian Processes and neural networks -- on every tested metric, spanning regression quality, astrophysical feature recovery, and probabilistic calibration. Our model interpolates all bands simultaneously in microseconds, over four orders of magnitude faster than the next-best neural benchmark and five faster than Gaussian Processes, demonstrating the potential of neural processes for the nightly Rubin alert stream. Attentive Neural Processes avoid the overconfidence of standard neural networks and the underconfidence of Gaussian Processes, delivering sharp, well-calibrated uncertainties. This work establishes the neural process family as a scalable, probabilistic foundation for real-time transient science in the Rubin era.

2605.26874 2026-06-02 cs.DB cs.AI cs.LG

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

知识图谱:基于LLM的工业资产运营中缺失的数据层

Madhulatha Mandarapu, Sandeep Kunkunuru

发表机构 * VaidhyaMegha Private Limited, India(印度VaidhyaMegha私人有限公司)

AI总结 研究通过类型化知识图谱作为数据层,将GPT-4在工业维护场景中的准确率从65%提升至99%,并引入生成增强知识(GAK)处理缺失数据,实现81.8%的场景可回答性。

Comments v2: reframed around the knowledge graph as a grounding substrate with a 3-tier router (text-to-Cypher; native graph/optimization primitives; generation-augmented knowledge, GAK). Adds a benchmark-grounded GAK evaluation on 88 real non-deterministic AssetOpsBench scenarios with provenance-tagged enrichment. 18 pages. Code: github.com/samyama-ai/assetops-kg

详情
AI中文摘要

基于LLM的工业资产运营代理在处理平面文档存储时准确性有限。AssetOpsBench(KDD 2026)表明,GPT-4代理在139个工业维护场景中达到65%的准确率,并比较了LLM编排范式(Agent-As-Tool vs. Plan-Execute)在固定数据层上的表现。我们提出一个正交问题:工具背后的数据模型有多重要?我们将类型化知识图谱作为基础基质,并根据最佳回答方式路由每个问题:(i)LLM生成的Cypher进行结构化检索,将同一GPT-4模型从65%提升至82-83%;(ii)原生图和优化原语(无需LLM)在图可回答场景中达到99%;(iii)生成增强知识(GAK)用于处理数据中缺失的答案——引擎的代理将缺失事实实现为带有溯源标签的图节点,然后回答。一个反复出现的主题是反向LLM使用:我们约束LLM从类型化模式生成查询或一次性丰富,让图确定性地执行。在88个真实的AssetOpsBench故障模式场景中(基准本身标记为非确定性——图中缺失十种设备类型),GAK将可回答性从零提升至100%的设备类型,并回答了81.8%的场景,每个实现的事实都标记为来源:LLM派生以确保可审计性。我们还贡献了40个图原生场景。对于结构化操作领域,数据层——而非LLM编排——是主要杠杆,类型化知识图谱充当原始工业数据与LLM推理之间的基础基质。

英文摘要

LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) establishes that GPT-4 agents achieve 65% on 139 industrial maintenance scenarios, and compares LLM orchestration paradigms (Agent-As-Tool vs. Plan-Execute) on a fixed data layer. We ask the orthogonal question: how much does the data model behind the tools matter? We treat a typed knowledge graph as a grounding substrate and route each question by how it is best answered: (i) LLM-generated Cypher for structured retrieval, which lifts the same GPT-4 model from 65% to 82-83%; (ii) native graph and optimization primitives, with no LLM, reaching 99% on graph-answerable scenarios; and (iii) generation-augmented knowledge (GAK) for answers absent from the data -- the engine's agent materializes the missing facts as provenance-tagged graph nodes, then answers. A recurring theme is inverted LLM usage: we constrain the LLM to query generation or one-shot enrichment from a typed schema and let the graph execute deterministically. On the 88 real AssetOpsBench failure-mode scenarios the benchmark itself flags non-deterministic -- ten equipment types absent from the graph -- GAK lifts answerability from zero to 100% of equipment types and answers 81.8% of scenarios, every materialized fact tagged source:LLM-derived for auditability. We also contribute 40 graph-native scenarios. For structured operational domains the data layer -- not the LLM orchestration -- is the primary lever, and a typed knowledge graph serves as a grounding substrate between raw industrial data and LLM reasoning.

2605.30169 2026-06-02 cs.CY cs.AI cs.MA

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

分离性身份:语言模型代理缺乏声誉机制的基础

Botao Amber Hu, Helena Rong, Max Van Kleek

发表机构 * University of Oxford(牛津大学) New York University Shanghai(纽约大学上海分校)

AI总结 本文指出语言模型代理因本体上的分离性(模块可替换、身份流动)而无法满足声誉机制所需的身份持续性、行为可预测性和制裁敏感性,从而提出转向基于可观察性、事前、构成性、协议的行为约束。

Comments Accepted by FaccT 2026

详情
AI中文摘要

随着自主语言模型代理的激增,形成了一个具有现实后果的新兴代理网络,您可以使用哪些可信信号来决定是否信任并委托一个陌生的代理?自然的治理直觉是将人类身份验证和声誉机制从“了解你的客户”和信用评分扩展到“了解你的代理”制度。然而,我们认为这种类比从根本上是不完整的。声誉机制既作为社会信号,也作为纠正性反馈,维持可信行为的均衡,其前提是存在与行为连续性、制裁敏感性和昂贵不可替代性相关的持久身份。但语言模型代理在本体上是分离性的:它们本质上是可修改模块的集合——基础模型、系统提示、工具访问策略、外部记忆,在某些情况下还包括整个多代理系统——任何模块都可能改变代理行为,并且具有流动的人格,容易受到对抗性攻击,且可能不会内化制裁。借鉴分离性身份障碍的法理学,这种分离性使得代理缺乏可识别性、可预测性、可信性和可恢复性的基础——而这些正是声誉机制旨在维持的属性——从而破坏了信任。我们认为,基于身份的事后、规制性、制裁性的治理(如声誉)在结构上不适用于分离性代理,并建议转向基于可观察性的事前、构成性、协议性的行为约束。

英文摘要

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can you use to decide whether to trust an unfamiliar agent in the wild and delegate to it? A natural governance intuition is to extend human identity verification and reputation mechanisms, from ``Know Your Customer'' and credit scores to ``Know Your Agent'' regimes. However, we argue that this analogy is fundamentally incomplete. Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility. Yet language model agents are ontologically \emph{dissociative}: they are essentially an assemblage of mutable modules -- foundation models, system prompts, tool-access policies, external memory, and, in some cases, a multi-agent system as a whole -- any of which may change agent behavior -- with a fluid persona that is also vulnerable to adversarial attack and may not internalize sanctions. Drawing on dissociative identity disorder jurisprudence, this dissociativity leaves agents without grounding for identifiability, predictability, credibility, and rehabilitability -- the very properties that reputation mechanisms aim to sustain -- thereby collapsing trust. We argue that identity-based, ex post, regulative, sanction-based governance, such as reputation, is structurally inapplicable to dissociative agents, and we suggest a shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.

2506.02075 2026-06-02 stat.ME cs.LG

Position: Stop Chasing the C-index when Evaluating Survival Analysis Models

立场:评估生存分析模型时停止追逐C指数

Christian Marius Lillelund, Shi-ang Qi, Russell Greiner, Christian Fischer Pedersen

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 本文批判性审视生存分析中的评估实践,指出C指数等一致性指标被过度使用且与建模目标错位,提出双螺旋阶梯框架以确保评估指标与模型假设对齐,并通过实验展示错位导致的误导性比较。

Comments ICML 2026 Position Paper Track (Spotlight)

详情
AI中文摘要

当前生存分析评估的现状受到持续使用与既定建模目标不一致的评估指标的困扰。此外,许多此类评估基于隐含或不合理的删失假设。这意味着报告的性能可能具有误导性,并且可能无法回答评估旨在解决的科学或建模问题。在这篇立场论文中,我们批判性地审视了生存分析中的评估实践,并强调了删失如何使评估从根本上不同于标准回归或分类。我们特别关注基于一致性的度量,如C指数,我们证明其在文献中被过度使用。为了帮助确定合适的度量,我们提出了一组关键需求,并引入了一个双螺旋阶梯,其中有效评估需要度量与模型假设之间的对齐。通过控制实验,我们表明这种对齐的违反可能导致误导性的模型比较。最后,我们提供了关于如何评估生存模型的实用指导。

英文摘要

The current state of evaluation in survival analysis is plagued by the persistent use of evaluation metrics in ways that are misaligned with the stated modeling objective. In addition, many such evaluations are based on censoring assumptions that are left implicit or unjustified. This means that the reported performance can be misleading and may fail to answer the scientific or modeling question the evaluation was intended to address. In this position paper, we critically examine evaluation practices in survival analysis and highlight how censoring makes evaluation fundamentally different from standard regression or classification. We place particular focus on concordance-based measures, such as the C-index, which we show are heavily overused in the literature. To help identify appropriate metrics, we propose a set of key desiderata and introduce a double-helix ladder, in which valid evaluation requires alignment between metric and modeling assumptions. Through controlled experiments, we show that violations of this alignment can lead to misleading model comparisons. We conclude by providing practical guidance on how to evaluate a survival model.

2605.29287 2026-06-02 cs.IR cs.CV

UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

UniNote: 一种用于多模态表示和排序的统一嵌入模型

Jinghan Zhao, Wenwei Jin, Anqi Li, Jintao Tong, Luya Mo, Jiawei Li, Bin Li, Yao Hu

发表机构 * Xiaohongshu Beijing China(小红书北京中国) Shanghai Jiao Tong University(上海交通大学) Huazhong University of Science and Technology(华中科技大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出UniNote统一嵌入模型,通过两阶段训练(对比SFT和强化学习)解决工业级Item-to-Item检索中全局表示与局部检索的平衡、解耦流水线效率及精度-延迟权衡问题,在小红书部署后显著提升检索质量和成本效率。

Comments Accepted by KDD Ads Track 2026

详情
AI中文摘要

Item-to-Item (I2I) 检索是现代内容平台的基础部分,支持从推荐引擎到内容审核的关键工业工作流。虽然多模态嵌入方法在通用检索中取得了进展,但由于全局内容表示与细粒度局部检索之间的平衡挑战、解耦的嵌入-排序流水线的系统性低效,以及模型精度与服务延迟之间的固有权衡,它们通常在 I2I 场景中表现不佳。为了解决这些问题,我们提出了 extbf{UniNote},一种专为工业 I2I 检索设计的统一嵌入模型。引入了定制的检索策略,以支持在不同粒度上对复杂多模态内容进行表示学习。为了实现这些策略,UniNote 采用了两阶段训练范式:第一阶段利用对比 SFT 建立稳健的基础嵌入,第二阶段通过强化学习 (RL) 过程优化排序质量,使模型与内容相关性对齐。我们的结果表明,UniNote 在多种 I2I 任务上达到了最先进的性能。在小红书部署并与 Matryoshka 表示学习 (MRL) 集成后,UniNote 在大规模应用中显著提升了检索质量和成本效率。

英文摘要

Item-to-Item (I2I) retrieval is a fundamental part of modern content platforms, supporting critical industrial workflows from recommendation engines to content auditing. While multimodal embedding methods have advanced general retrieval, they often falter in I2I scenarios due to the challenges of balancing global content representation with fine-grained local retrieval, the systemic inefficiency of decoupled embedding-and-ranking pipelines, and the inherent trade-offs between model precision and serving latency. To solve these issues, we propose \textbf{UniNote}, a unified embedding model designed for industrial I2I retrieval. Tailored retrieval strategies are introduced to support representation learning over complex, multimodal content at varying granularities. To operationalize these strategies, UniNote employs a two-stage training paradigm: the first stage leverages contrastive SFT to establish robust base embeddings, while the second stage refines ranking quality through a reinforcement learning (RL) process that aligns the model with content relevance. Our results show that UniNote achieves SOTA performance across diverse I2I tasks. Deployed at Xiaohongshu and integrated with Matryoshka Representation Learning (MRL), UniNote achieved significant improvements in retrieval quality and cost efficiency in large-scale applications.

2605.29107 2026-06-02 cs.CR cs.AI

GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization

GEO-Bench: 生成式引擎优化中的排名操纵基准测试

Ojas Nimase, Zhe Chen, Gengpei Qi, Yue Zhao, Xiyang Hu

发表机构 * University of Southern California(南加州大学) Arizona State University(亚利桑那州立大学)

AI总结 提出GEO-Bench基准,统一评估生成式引擎优化中的排名操纵攻击,比较黑盒提示攻击、白盒梯度攻击和白帽策略的有效性与隐蔽性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地对用户查询的产品、文档和推荐进行排名,这使得操纵这些排名成为公平性和信息完整性日益关注的问题。关于生成式引擎优化(GEO)的研究已经产生了许多操纵方法,但每种方法都在自己的数据集上使用自己的指标进行评估,因此它们的相对强度和可检测性仍不清楚。我们提出了GEO-Bench,这是一个在统一协议下评估GEO排名操纵攻击的基准。它统一了黑盒提示攻击(TAP、Zero-Shot)、白盒梯度攻击(STS、RAF、StealthRank)以及十种白帽C-SEO策略。我们针对一个固定的开源权重排序器(Llama-3.1-8B-Instruct)在五个数据集上对每种方法进行评分,使用有效性(NRG、Success@α、Promote@α)和隐蔽性(关键词违规率、困惑度比率)指标。我们的评估表明,对抗性攻击在有效性和隐蔽性之间存在权衡;黑盒内容重写在排名提升方面达到或超过梯度攻击,同时生成更流畅的文本,并且可以在某些领域逃避基于关键词和困惑度的检测;访问模型并不能预测攻击强度。通过标准化数据集、攻击实现和指标,GEO-Bench实现了这些攻击范式之间的首次直接比较,并支持检测方法的开发。

英文摘要

Large language models (LLMs) increasingly rank products, documents, and recommendations for user queries, which makes manipulating these rankings a growing concern for fairness and information integrity. Research on generative engine optimization (GEO) has produced many manipulation methods, but each is evaluated on its own dataset with its own metrics, so their relative strength and detectability stay unclear. We present GEO-Bench, a benchmark that evaluates GEO ranking-manipulation attacks under one protocol. It unifies black-box prompt-based attacks (TAP, Zero-Shot), white-box gradient-based attacks (STS, RAF, StealthRank), and ten white-hat C-SEO strategies. We score every method on five datasets against a fixed open-weight ranker (Llama-3.1-8B-Instruct), using metrics for both effectiveness (NRG, Success@α, Promote@α) and stealth (keyword violation rate, perplexity ratio). Our evaluation shows that effectiveness and stealth trade off across adversarial attacks, that black-box content rewriting matches or exceeds gradient-based attacks on rank promotion while producing more fluent text and can evade both keyword- and perplexity-based detection on some domains, and that the access model does not predict attack strength. By standardizing datasets, attack implementations, and metrics, GEO-Bench enables the first direct comparison across these attack paradigms and supports the development of detection methods.

2605.28952 2026-06-02 cs.CR cs.DS cs.IT cs.LG math.IT math.ST stat.TH

Optimal Rates for Differentially Private Hypothesis Testing with E-values

基于E值的差分隐私假设检验的最优速率

Ben Jacobsen, Tomas Gonzalez, Gavin Brown, Kassem Fawaz, Aaditya Ramdas

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究在ε-差分隐私约束下,使用e值进行假设检验时所能达到的最大e-power,并给出最优速率及匹配算法。

Comments Corrected typos; updated references; generalized proposition 3.1

详情
AI中文摘要

近年来,e值作为支持任意有效和自适应数据分析的灵活工具引起了广泛关注。假设检验是许多此类应用的核心,而这些应用通常涉及私有或敏感数据。在这项工作中,我们回答了一个简单但重要的问题:给定两个分布 $\mathbb{P}$ 和 $\mathbb{Q}$,当使用满足 $\varepsilon$-差分隐私的e值检验 $X\sim \mathbb{P}^n$ 对 $X\sim\mathbb{Q}^n$ 时,所能达到的最大e-power是多少?我们刻画了该问题的最优速率,并提供了一个精确匹配的算法。在顺序设置中,当观测值逐个到达且分析者选择何时停止时,我们给出了任何私有e过程的停止时间的匹配上下界。数值实验证实了我们算法的实用性,在多种顺序检验问题和隐私水平下,我们的算法所需数据少于最近提出的DP-SPRT。

英文摘要

E-values have attracted considerable interest in recent years as flexible tools for enabling anytime-valid and adaptive data analysis. Hypothesis testing is at the core of many of these applications, which can often involve private or sensitive data. In this work, we answer a simple but important question: given two distributions $\mathbb{P}$ and $\mathbb{Q}$, what is the maximum achievable e-power when testing $X\sim \mathbb{P}^n$ against $X\sim\mathbb{Q}^n$ with e-values that satisfy $\varepsilon$-differential privacy? We characterize the optimal rate for this problem and provide an algorithm which matches it exactly. In the sequential setting, when observations arrive one-by-one and the analyst chooses when to halt, we give matching upper and lower bounds on the stopping times of any private e-process. Numerical experiments confirm the practicality of our algorithms, which require less data than the recently proposed DP-SPRT across a range of sequential testing problems and privacy levels.

2605.25889 2026-06-02 cs.CR cs.LG

Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models

能力与鲁棒性不可兼得:视觉-语言-动作模型的信息论界

Jianwei Tai

发表机构 * Jianwei Tai(Tai Jianwei)

AI总结 本文证明视觉-语言-动作模型的能力与鲁棒性之间存在信息论权衡,能力与鲁棒性之和受限于任务熵与对抗信道容量之和,并通过实验验证了该界。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在干净输入上达到高成功率,但在小的对抗扰动下崩溃:$16/255$ PGD攻击将OpenVLA-7B在LIBERO上的成功率从$95\\%$降至$5\\%$以下。这种权衡是否存在理论下限此前未知。我们证明它存在。对于任何VLA策略,能力$I(\\Astar;\\Api)$和鲁棒性$I(\\Api;\\Atildepi)-I(\\Api;δ)$之和至多为$H(\\Astar)+I(X;\\Xtilde)$,即任务熵加对抗信道容量。证明简化为两次应用数据处理不等式。像素级界宽松约$10^3$纳特,作为上限保证;编码器特定的推论将其收紧一个数量级以上,进入实际能力已消耗$5$--$9\\%$预算的区域。我们在$308$个单元中验证定理\\ref{thm:main},零违反:$252$个闭式高斯VLA、$48$个OpenVLA-7B$+$LIBERO$+$PGD($4$套件$\\times$ $4$个$\\\eps$ $\\times$ $3$个种子)、$4$个Square-Attack和$4$个多步($T{=}10$)。一个互补的可测性不等式$\\\Rob_{\\text{disc}} \\\le \\\Cap_{\\text{disc}}$进一步在跨越OpenVLA、OpenVLA-OFT(连续$L_1$)和SmolVLA(流匹配)的$144$个跨架构单元中成立。相同的构造产生了三个无标签诊断工具:预飞行编码器上限、定位输入侧与语言模型干预的防御取证探针,以及可在离散令牌、$L_1$回归和流匹配策略间比较的头部无关鲁棒性比。这些共同提供了跨设置轴防御和架构比较目前所缺乏的。

英文摘要

Vision-Language-Action (VLA) models reach high success rates on clean inputs but collapse under small adversarial perturbations: a $16/255$ PGD attack drops OpenVLA-7B's LIBERO success from $95\%$ to under $5\%$. Whether this trade-off has a theoretical floor was open. We prove that it does. For any VLA policy, capability $I(\Astar;\Api)$ and robustness $I(\Api;\Atildepi)-I(\Api;δ)$ sum to at most $H(\Astar)+I(X;\Xtilde)$, the task entropy plus adversarial channel capacity. The proof reduces to two applications of the Data Processing Inequality. The pixel-level bound is loose by $\sim 10^3$ nats and serves as a ceiling guarantee; an encoder-specific corollary tightens it by over an order of magnitude, into a regime where realized capability already consumes $5$--$9\%$ of the budget. We validate Theorem~\ref{thm:main} with zero violations across $308$ cells: $252$ closed-form Gaussian-VLA, $48$ OpenVLA-7B$+$LIBERO$+$PGD ($4$ suites $\times$ $4$ $\eps$ $\times$ $3$ seeds), $4$ Square-Attack, and $4$ multi-step ($T{=}10$). A complementary measurability inequality $\Rob_{\text{disc}} \le \Cap_{\text{disc}}$ further holds across $144$ cross-architecture cells spanning OpenVLA, OpenVLA-OFT (continuous-$L_1$), and SmolVLA (flow-matching). The same construction yields three label-free diagnostics: a pre-flight encoder ceiling, a defense-forensics probe that localizes input-side vs.\ language-model intervention, and a head-agnostic robustness ratio comparable across discrete-token, $L_1$-regression, and flow-matching policies. Together these provide the cross-setting axis defense and architecture comparisons currently lack.

2602.07666 2026-06-02 cs.CR cs.AI

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

SoK: DARPA 人工智能网络挑战赛 (AIxCC):竞赛设计、架构与经验教训

Cen Zhang, Younggi Park, Fabian Fleischer, Yu-Fu Fu, Jiho Kim, Dongkwan Kim, Youngjoon Kim, Qingxiao Xu, Andrew Chin, Ze Sheng, Hanqing Zhao, Michael Pelican, David J. Musliner, Jeff Huang, Jon Silliman, Mikel Mcdaniel, Jefferson Casavant, Isaac Goldthwaite, Nicholas Vidovich, Matthew Lehman, Taesoo Kim

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Texas A&M University(德克萨斯大学) Smart Information Flow Technologies (SIFT)(智能信息流技术公司) Kudu Dynamics(Kudu动态公司) Microsoft(微软)

AI总结 本文系统分析 DARPA 人工智能网络挑战赛 (AIxCC),探讨其竞赛设计、决赛系统的架构方法,并总结驱动性能的因素、技术进展及未来研究方向。

Comments Camera ready version, systematization of Knowledge and post-competition analysis of DARPA AIxCC (2023-2025)

详情
Journal ref
USENIX Security 2026
AI中文摘要

DARPA 的人工智能网络挑战赛 (AIxCC, 2023--2025) 是迄今为止规模最大的竞赛,旨在构建完全自主的网络推理系统 (CRS),利用人工智能的最新进展——特别是大型语言模型 (LLM)——来发现和修复真实世界开源软件中的漏洞。本文首次对 AIxCC 进行系统分析。基于设计文档、源代码、执行轨迹以及与组织者和参赛团队的讨论,我们审视了竞赛的结构和关键设计决策,描述了决赛 CRS 的架构方法,并分析了最终计分板之外的竞赛结果。我们的分析揭示了真正驱动 CRS 性能的因素,识别了各团队取得的技术进步,并指出了未来研究中仍需解决的局限性。最后,我们总结了组织未来竞赛的经验教训,以及在实际中部署自主 CRS 的更广泛见解。

英文摘要

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CRSs) that leverage recent advances in AI -- particularly large language models (LLMs) -- to discover and remediate vulnerabilities in real-world open-source software. This paper presents the first systematic analysis of AIxCC. Drawing on design documents, source code, execution traces, and discussions with organizers and competing teams, we examine the competition's structure and key design decisions, characterize the architectural approaches of finalist CRSs, and analyze competition results beyond the final scoreboard. Our analysis reveals the factors that truly drove CRS performance, identifies genuine technical advances achieved by teams, and exposes limitations that remain open for future research. We conclude with lessons for organizing future competitions and broader insights toward deploying autonomous CRSs in practice.

2505.11158 2026-06-02 eess.IV cs.CV

Diffusion Models for Hyperspectral Image Analysis: A Comprehensive Review

扩散模型在高光谱图像分析中的应用:综述

Xing Hu, Xiangcheng Liu, Qianqian Duan, Lian Zhang, Huiliang Shang, Linhua Jiang, Haima Yang, Dawei Zhang

发表机构 * School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology(上海理工大学光学电子与计算机工程学院) School of Electronics and Electrical Engineering, Shanghai University of Engineering Science(上海工程技术大学电子与电气工程学院) Medical Artificial Intelligence Lab, The First Hospital of Hebei Medical University, Hebei Medical University(河北医科大学第一医院医学人工智能实验室) Hangzhou Institute of Technology, xidian University(杭州职业技术学院)

AI总结 本文系统综述了扩散模型(包括去噪扩散概率模型和基于随机微分方程的生成框架)在高光谱图像处理中的最新进展,分类现有方法,强调其处理高维数据的优势,并与传统方法比较性能,特别关注变化检测和灾后异常识别等关键应用,同时讨论计算成本和训练稳定性等局限,并展望未来研究方向。

Comments Published in Neural Networks

详情
Journal ref
Neural Networks (2026) 109109
AI中文摘要

高光谱图像(HSI)分析在遥感、农业和环境监测中起着关键作用。然而,传统方法通常难以处理HSI数据中固有的高维度、光谱冗余和噪声,限制了其准确性和可扩展性。最近,扩散模型(包括去噪扩散概率模型和其他基于随机微分方程的生成框架)在捕捉复杂光谱空间结构和生成高保真HSI数据方面显示出强大潜力。这些模型为噪声抑制、数据增强、分类和异常检测等任务提供了有效解决方案。本文系统总结了扩散模型在HSI处理中的最新进展。我们对现有方法进行分类,强调其处理高维数据的优势,并与传统方法进行性能比较。特别关注变化检测和灾后异常识别等关键应用。本文还讨论了当前局限性,如计算成本和训练稳定性,并概述了潜在的研究方向。我们的主要贡献可总结如下:提供了基于扩散的HSI方法的系统分类,考察了它们在主要遥感任务中的应用,并提供了对未来研究潜在方向的见解。通过这些努力,本综述旨在支持社区利用深度学习模型实现更有效和高效的高光谱图像分析。

英文摘要

Hyperspectral image (HSI) analysis plays a critical role in remote sensing, agriculture, and environmental monitoring. However, traditional methods often struggle to handle the high dimensionality, spectral redundancy, and noise inherent in HSI data, limiting their accuracy and scalability. Recently, diffusion models including denoising diffusion probabilistic models and other generative frameworks based on stochastic differential equations have shown strong potential in capturing complex spectral spatial structures and generating high fidelity HSI data. These models offer effective solutions for tasks such as noise supression, data augmentation, classification, and anomaly detection. This review presents a systematic summary of recent advances in diffusion models for HSI processing. We categorize existing methods, highlight their strengths in handling high dimensional data, and compare their performance with conventional approaches. Special attention is given to critical applications such as change detection and post disaster anomaly identification. The review also discusses current limitations, such as computational cost and training stability, and outlines potential research directions. Our main contributions can be summarized as follows: we provide a systematic taxonomy of diffusion based HSI methods, examine their applications across major remote sensing tasks, and offer perspectives on potential directions for future research. With these efforts, this review seeks to support the community in harnessing deep learning models to achieve more effective and efficient hyperspectral image analysis.

2605.24248 2026-06-02 cs.CR cs.AI cs.SE

Attested Tool-Server Admission: A Security Extension to the Model Context Protocol

认证工具服务器准入:模型上下文协议的安全扩展

Alfredo Metere

发表机构 * Enclawed LLC(Enclawed公司)

AI总结 针对MCP协议缺乏信任机制的问题,提出mcp-attested扩展,通过离线签名的权限断言、默认拒绝的工具白名单和分级强制审计日志,实现安全服务器准入与工具边界控制。

详情
AI中文摘要

模型上下文协议(MCP)标准化了大语言模型(LLM)代理与外部工具服务器之间的消息交换,但未标准化信任:主机读取服务器自声明的工具列表并分发调用,没有关于可以使用哪些服务器、敏感程度如何或服务器哪些工具在界限内的概念。这项工作源于一个具体需求——让Enclawed代理安全地使用Google外部运营的MCP服务器(Gmail、日历、Drive),准入服务器并限制其可能驱动的工具,而不改变MCP或Enclawed自身的工具应用程序编程接口(API)。我们构建的机制mcp-attested(已在开源enclawed-oss发行版和enclaved变体中发布)具有通用性:使未经中介的第三方连接对单个用户不安全的差距,使得受监管的部署无法获得认证。我们通过三种附加机制来弥补这一差距:(1)一个小的、离线签名的权限断言,服务器在众所周知的统一资源标识符(URI)上发布,主机在分派任何工具之前对照固定的信任根进行验证;(2)一个默认拒绝的每服务器工具允许列表,因此准入服务器并不意味着信任其每个工具;(3)一个分级门控的强制模式,将检查从警告转变为硬性拒绝,每个决策都写入防篡改审计日志。我们给出了线路格式、验证算法、安全分析和LLM驱动的对抗性评估;然后以规范的请求评论(RFC 2119)形式陈述了设计——模式、验证规则、错误注册表、众所周知的注册和机器可检查的一致性向量——以便它可以作为MCP附录被采纳,而不是重新发明。未扩展的主机会忽略众所周知的文档,行为与今天完全相同。

英文摘要

The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not trust: a host reads a server's self-declared tool list and dispatches calls, with no notion of which servers it may use, at what sensitivity, or which of a server's tools are in bounds. This work grew out of a concrete need -- letting the Enclawed agent use Google's externally-operated MCP servers (Gmail, Calendar, Drive) safely, admitting the server and bounding the tools it may drive, without changing MCP or Enclawed's own tool application-programming interface (API). The mechanism we built, mcp-attested (shipped in both the open enclawed-oss distribution and the enclaved flavor), generalizes: the gap that makes an unmediated third-party connection unsafe for one user makes a regulated deployment impossible to accredit. We close it with three additive mechanisms: (1) a small, offline-signed clearance assertion a server publishes at a well-known Uniform Resource Identifier (URI) and a host verifies against a pinned trust root before any tool dispatch; (2) a deny-by-default per-server tool allowlist, so admitting a server is not trusting its every tool; and (3) a flavor-gated enforcement mode that turns the checks from warnings into hard denials, with every decision written to a tamper-evident audit log. We give the wire format, the verification algorithm, a security analysis, and an LLM-driven adversarial evaluation; we then state the design in normative Request-for-Comments (RFC 2119) form -- schema, verification rules, error registry, well-known registration, and machine-checkable conformance vectors -- so it can be adopted as an MCP addendum rather than reinvented. An unextended host ignores the well-known document and behaves exactly as today.

2602.11210 2026-06-02 cs.SE cs.AI cs.LG

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox:用于构建软件工程智能体的无容器强化学习

Danlong Yuan, Wei Wu, Enhan Zhao, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SWE-MiniSandbox,一种轻量级无容器方法,通过内核级隔离和预缓存技术降低磁盘使用和准备时间,实现可扩展的强化学习训练。

详情
AI中文摘要

强化学习已成为训练软件工程智能体的关键范式,但现有流程通常依赖每个任务的容器进行隔离。在大规模场景下,预构建的容器镜像会带来显著的存储开销、缓慢的环境设置,并且需要容器管理权限。我们提出SWE-MiniSandbox,一种轻量级、无容器的方法,能够在无需牺牲隔离性的情况下实现SWE智能体的可扩展强化学习训练。SWE-MiniSandbox不依赖每个实例的容器,而是在由内核级机制支持的隔离工作空间中执行每个任务,从而大幅降低系统开销。它利用轻量级环境预缓存技术,消除了对庞大容器镜像的需求。因此,我们的方法将磁盘使用量降低到基于容器的流程所需的大约5%,并将环境准备时间缩短到容器基线的大约25%。实验结果表明,SWE-MiniSandbox实现了与标准基于容器的流程相当的评估性能。通过消除对重型容器基础设施的依赖,SWE-MiniSandbox为扩展基于强化学习的SWE智能体提供了一个实用且可访问的基础,特别是在资源受限的研究环境中。

英文摘要

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without sacrificing isolation. Instead of relying on per-instance containers, SWE-MiniSandbox executes each task in an isolated workspace backed by kernel-level mechanisms, substantially reducing system overhead. It leverages lightweight environment pre-caching techniques to eliminate the need for bulky container images. As a result, our approach lowers disk usage to approximately 5\% of that required by container-based pipelines and reduces environment preparation time to about 25\% of the container baseline. Empirical results demonstrate that SWE-MiniSandbox achieves evaluation performance comparable to standard container-based pipelines. By removing the dependency on heavy container infrastructure, SWE-MiniSandbox offers a practical and accessible foundation for scaling RL-based SWE agents, particularly in resource-constrained research environments.

2605.15229 2026-06-02 cs.SE cs.AI

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

PBT-Bench:基于属性测试的AI智能体基准

Lucas Jing, Xinqi Wang, Liao Zhang, Simon S. Du

发表机构 * Tsinghua University(清华大学) University of Washington(华盛顿大学) Beneficial AI Foundation(有益人工智能基金会)

AI总结 提出PBT-Bench基准,包含100个基于属性测试的问题,用于评估AI智能体从文档中推导语义不变量并生成输入策略的能力。

详情
AI中文摘要

现有的代码基准测试衡量的是智能体能否生成任何能复现已知bug的测试,或者能否生成修复描述问题的补丁。两者都没有分离出基于属性测试的独特技能:从文档中推导语义不变量,然后构建足够精确的输入生成策略,使得随机搜索能够揭示违规。我们引入了PBT-Bench,一个包含40个真实Python库中100个精心策划的基于属性测试问题的基准。每个问题注入一个或多个语义bug(共365个,平均每个问题3.65个),设计使得默认策略的随机输入几乎不会触发它们;智能体必须阅读库的文档,识别相关不变量,并指定一个Hypothesis @given策略,将质量集中在触发区域。bug按三个难度级别(L1-L3)分层,涵盖单约束边界bug到有状态、跨函数协议违规。我们在两种提示机制(开放式基线与显式Hypothesis脚手架)下评估了八个当代LLM,每个配置进行三次独立运行。在PBT引导提示下,模型间的bug召回率从42.1%到83.4%不等;在开放式基线下,从31.4%到76.7%不等。Hypothesis脚手架将中等能力模型提升了超过20个百分点,但对最强模型提升较小,有两个例外显示出退化,表明结构化提示可能干扰某些模型行为而非补充。最难的bug被证明是模型特定的:不同架构在不同问题上失败,留下没有单一模型能填补的持续空白。我们发布基准、测试框架和完整评估语料库,以支持下游关于文档基础的语义推理工作。

英文摘要

Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library's documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels (L1-L3) spanning single-constraint boundary bugs to stateful, cross-function protocol violations. We evaluate eight contemporary LLMs under two prompting regimes (open-ended baseline vs. explicit Hypothesis scaffolding) for three independent runs per configuration. Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation, suggesting the structured prompt can interfere with certain model behaviours rather than complementing them. The hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes. We release the benchmark, harness, and full evaluation corpus to support downstream work on documentation-grounded semantic reasoning.

2605.19847 2026-06-02 cs.CR cs.IR cs.LG

Auditing Privacy in Multi-Tenant RAG under Account Collusion

多租户RAG中账户共谋下的隐私审计

Florian A. D. Burnat

发表机构 * University of Bath(巴斯大学)

AI总结 针对多租户RAG中同一索引下账户共谋导致隐私泄露加剧的问题,提出一种可验证的审计协议,用于认证噪声-选择检索并报告共谋上限内的隐私损失。

详情
AI中文摘要

多租户RAG服务通常将账户视为隐私边界:每个账户针对租户索引获得$(\varepsilon_{ ext{acc}},δ_{ ext{acc}})$-DP检索保证。我们表明,这种框架低估了同一索引下账户共谋的泄露。对于高斯噪声-选择检索,$k$个协调的同一租户账户组合成联合泄露$Θ(\sqrt{k}\,\varepsilon_{ ext{acc}})$,而非$\varepsilon_{ ext{acc}}$;我们给出匹配的成员推断攻击,并在标量、top-$K$、训练嵌入器和生产规模的HNSW设置中验证了预测的$\sqrt{k}$ AUC趋势。然后,我们给出一个验证者可运行的审计协议,该协议认证噪声-选择检索,并针对达到声明上限$k_{\max}$的联盟报告$( extsf{PASS},\varepsilon_{ ext{audit}})$,而不泄露索引或改变检索决策规则。该声明仅针对检索通道:生成通道泄露和对抗性鲁棒的联盟规模估计是补充审计谓词。

英文摘要

Multi-tenant RAG services often treat the account as the privacy boundary: each account receives an $(\varepsilon_{\text{acc}},δ_{\text{acc}})$-DP retrieval guarantee against the tenant index. We show that this framing understates leakage under same-index account collusion. For Gaussian noise-then-select retrieval, $k$ coordinated same-tenant accounts compose to joint leakage $Θ(\sqrt{k}\,\varepsilon_{\text{acc}})$, not $\varepsilon_{\text{acc}}$; we give a matching membership-inference attack and validate the predicted $\sqrt{k}$ AUC trend in scalar, top-$K$, trained-embedder, and production-scale HNSW settings. We then give a verifier-runnable audit protocol that attests noise-then-select retrieval and reports $(\textsf{PASS},\varepsilon_{\text{audit}})$ for coalitions up to a declared cap $k_{\max}$, without disclosing the index or changing the retrieval decision rule. The claim is retrieval-channel only: generation-channel leakage and adversarially robust coalition-size estimation are complementary audit predicates.

2605.18694 2026-06-02 math.OC cs.LG stat.ML

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

自适应梯度方法能否在重尾噪声下收敛?以 AdaGrad 为例

Zijian Liu

发表机构 * Zijian Liu(刘子健)

AI总结 本文研究 AdaGrad 在重尾梯度噪声下的收敛性,首次证明当尾指数 p 满足 4/3 < p ≤ 2 时,无需先验知识即可获得非凸优化的收敛率,并给出了算法相关的下界。

Comments ICML 2026. v2: simplification of the proof

详情
AI中文摘要

现代机器学习中的许多任务在优化过程中观察到涉及重尾梯度噪声。为了应对这一现实且具有挑战性的场景,引入了新的机制,如梯度裁剪和梯度归一化,以确保一阶算法的收敛性。然而,自适应梯度方法,一类著名的现代优化器,包括流行的 $\mathtt{Adam}$ 和 $\mathtt{AdamW}$,即使没有上述任何额外操作,通常也表现良好。因此,自然要问:自适应梯度方法能否在重尾噪声下收敛而无需任何算法更改?在这项工作中,我们通过研究一个特例 $\mathtt{AdaGrad}$(自适应梯度方法的起源)迈出了回答这个问题的第一步。我们首次证明了当尾指数 $p$ 满足 $4/3 < p \leq 2$ 时,$\mathtt{AdaGrad}$ 在非凸优化中的可证明收敛率。值得注意的是,这一结果无需任何关于 $p$ 的先验知识,因此对尾指数是自适应的。此外,我们开发了一个算法相关的下界,表明现有的重尾优化极小极大速率无法由 $\mathtt{AdaGrad}$ 达到。最后,我们考虑了 $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$(理论研究中 $\mathtt{AdaGrad}$ 的一个流行变体),并证明了在额外温和假设下,对于任何 $1 < p \leq 2$ 都成立的改进速率。

英文摘要

Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $\mathtt{Adam}$ and $\mathtt{AdamW}$, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, $\mathtt{AdaGrad}$, the origin of adaptive gradient methods. We provide the first provable convergence rate for $\mathtt{AdaGrad}$ in non-convex optimization when the tail index $p$ satisfies $4/3<p\leq2$. Notably, this result is achieved without requiring any prior knowledge of $p$ and is hence adaptive to the tail index. In addition, we develop an algorithm-dependent lower bound, suggesting that the existing minimax rate for heavy-tailed optimization is not attainable by $\mathtt{AdaGrad}$. Lastly, we consider $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$, a popular variant of $\mathtt{AdaGrad}$ in theoretical studies, and show an improved rate that holds for any $1<p\leq2$ under an extra mild assumption.

2605.14791 2026-06-02 astro-ph.IM astro-ph.CO cs.AI

Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology

超越AI助手:迈向宇宙学中的自主发现

Licong Xu, Thomas Borrett

发表机构 * Institute of Astronomy, University of Cambridge(剑桥大学天文研究所) Kavli Institute for Cosmology, University of Cambridge(剑桥大学凯斯勒宇宙研究所) Cavendish Astrophysics, University of Cambridge(剑桥大学卡文迪许天体物理研究所)

AI总结 本文提出两种互补的智能体系统(CMBEvolve和CosmoEvolve),通过LLM引导的代码进化与树搜索以及虚拟多智能体研究实验室,实现宇宙学中的自主科学发现,并在弱引力透镜异常检测和ACT DR6数据分析中展示了初步成果。

Comments 4 pages, 2 figures, Contribution to the 2026 Cosmology session of the 60th Rencontres de Moriond

详情
AI中文摘要

人工智能智能体的最新进展正在将AI从工具推向自主科学发现。我们讨论了两种互补的宇宙学智能体系统: exttt{CMBEvolve},通过LLM引导的代码进化和树搜索,针对具有明确定量目标的任务;以及 exttt{CosmoEvolve},通过虚拟多智能体研究实验室,针对开放式科学工作流。作为初步演示,我们将 exttt{CMBEvolve}应用于弱引力透镜图中的分布外检测,通过代码进化迭代改进基准分数;将 exttt{CosmoEvolve}应用于自主ACT DR6数据分析,识别出非平凡的成对和尺度依赖行为,并生成分析级诊断。这些例子展示了宇宙学如何为AI科学家系统的发展提供受控基准任务和现实开放研究问题。

英文摘要

Recent advances in artificial intelligence (AI) agents are pushing AI beyond tools toward autonomous scientific discovery. We discuss two complementary agentic systems for cosmology: \texttt{CMBEvolve}, which targets tasks with explicit quantitative objectives through LLM-guided code evolution and tree search, and \texttt{CosmoEvolve}, which targets open-ended scientific workflows through a virtual multi-agent research laboratory. As preliminary demonstrations, we apply \texttt{CMBEvolve} to out-of-distribution detection in weak-lensing maps, where it iteratively improves the benchmark score through code evolution, and \texttt{CosmoEvolve} to autonomous ACT DR6 data analysis, where it identifies non-trivial pair- and scale-dependent behaviour and produces analysis-grade diagnostics. These examples show how cosmology can provide both controlled benchmark tasks and realistic open-ended research problems for the development of AI scientist systems.

2605.13430 2026-06-02 stat.ME cs.AI cs.LG

Towards a holistic understanding of Selection Bias for Causal Effect Identification

走向因果效应识别中选择偏差的整体理解

Yiwen Qiu, Filip Kovačević, Shimeng Huang, Peter Spirtes, Francesco Locatello

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 研究在观测研究中存在选择偏差时,如何利用弱假设刻画倾向得分和选择概率,给出平均处理效应可识别性的充要条件,扩展了现有图形识别准则。

Comments 9 pages for the main text, ICML 2026

详情
AI中文摘要

选择偏差在观测研究中普遍存在。例如,大规模生物库数据可能表现出“健康志愿者偏差”,即受访者比他们所要代表的人群更健康、社会经济地位更高。从这样的子人群中恢复因果效应是因果推断中的一个重要问题,因为从选定人群估计平均处理效应(ATE)可能导致对整个群体的ATE估计严重偏倚。本文研究了选择偏差下ATE的可识别性。我们利用概率类的弱假设刻画倾向得分和选择概率,给出了ATE可识别性的充要条件。与以往工作相比,我们的结果扩展了现有的图形可识别性准则,并在存在选择偏差的情况下,以严格更弱的条件提供了对因果效应识别更全面的理解。

英文摘要

Selection bias is pervasive in observational studies. For example, large scale biobanks data can exhibit ``healthy volunteer bias'' when respondents are healthier and of higher socio-economic status than the population they are meant to represent. Recovering causal effects from such sub-population is an important problem in causal inference, as estimating average treatment effects (ATE) from selected populations can result in a severely biased estimate of the ATE from the whole population. In this paper, we investigate the identifiability of the ATE under selection bias. We provide necessary and sufficient conditions for ATE identifiability, leveraging weak assumptions on probability classes to characterize propensity score and selection probability. Compared to previous works, our results extend existing graphical identifiability criteria and offer a more comprehensive understanding of causal effect identification with strictly weaker conditions in the presence of selection bias.

2605.12768 2026-06-02 stat.ML cs.LG

ISOMORPH: A Supply Chain Digital Twin for Simulation, Dataset Generation, and Forecasting Benchmarks

ISOMORPH:用于仿真、数据集生成和预测基准的供应链数字孪生

Zhizhen Zhang, Hyemin Gu, Benjamin J. Zhang, Daniel Elenius, Michael Tyrrell, Theo J. Bourdais, Houman Owhadi, Markos A. Katsoulakis, Tuhin Sahai

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University of North Carolina(北卡罗来纳大学) SRI International(SRI国际) California Institute of Technology(加州理工学院)

AI总结 本文提出ISOMORPH,首个公开的多级物流网络数字孪生,通过可配置参数和模块化拓扑生成具有牛鞭效应等动态特性的数据集,并评估基础模型的零样本预测性能。

详情
AI中文摘要

开放的时间序列预测(TSF)基准涵盖零售、能源、天气和交通,但供应链物流仍未得到充分服务。我们引入了ISOMORPH,这是第一个具有可解释、用户可配置参数以及模块化拓扑、需求和控制规则的多级物流网络的公开数字孪生。该模拟器在离散时间上推进一个有向路由图:需求从库存中满足或记录为积压,并触发整个网络的补货。状态跟踪库存、未结订单、在途货物以及平滑的需求估计,在可处理的状态空间上产生马尔可夫动力学。发布的数据以经验一致的程度再现了牛鞭效应,同时三个守恒定律为模拟器扩展提供了验证工具。我们发布了两个目录规模(C=50和C=200)、六种场景扫描和20种拉丁超立方体扰动的数据集。这些数据集展示了固定TSF基准中基本缺失的动态特性,包括方差放大、级联瓶颈、制度转换以及通过共享宏观冲击的跨通道耦合。对四个基础模型(Chronos、Moirai、TimesFM和Lag-Llama)的零样本评估在低至中等预测范围上产生了超过公开GIFT-Eval参考的MASE值,支持将其纳入现有基准套件。相同的模型通过需求侧参数的拉丁超立方体扰动提供预测置信带,实现了标准TSF数据集上不可用的前向不确定性量化(UQ),并证明基础模型可以作为基于数字孪生的UQ的快速替代。代码(MIT):https://github.com/tuhinsahai/ISOMORPH。交互演示:https://huggingface.co/spaces/HyeminGu/ISOMORPH-demo。

英文摘要

Open time-series forecasting (TSF) benchmarks cover retail, energy, weather, and traffic, but supply-chain logistics remains underserved. We introduce ISOMORPH, the first public digital twin of a multi-echelon logistics network with interpretable, user-configurable parameters and modular topology, demand, and control rules. The simulator advances a directed routing graph in discrete time: demand is served from inventory or recorded as backlog and triggers replenishment throughout the network. The state tracks inventory, outstanding orders, in-transit shipments, and a smoothed demand estimate, yielding Markovian dynamics on a tractable state space. The released data reproduces the bullwhip effect at empirically consistent magnitudes, while three conservation laws provide verification tools for simulator extensions. We release datasets at two catalogue scales ($C=50$ and $C=200$), six scenario sweeps, and 20 Latin-hypercube perturbations. These datasets exhibit dynamics largely absent from fixed TSF benchmarks, including variance amplification, cascading bottlenecks, regime shifts, and cross-channel coupling through shared macro shocks. Zero-shot evaluation of four foundation models (Chronos, Moirai, TimesFM, and Lag-Llama) yields MASE values exceeding public GIFT-Eval references at low-to-moderate horizons, supporting incorporation into existing benchmark suites. The same models provide forecast confidence bands through Latin-hypercube perturbations of demand-side parameters, enabling forward uncertainty quantification (UQ) unavailable on standard TSF datasets and demonstrating that foundation models can serve as fast surrogates for digital-twin-based UQ. Code (MIT): https://github.com/tuhinsahai/ISOMORPH. Interactive demo: https://huggingface.co/spaces/HyeminGu/ISOMORPH-demo.

2603.29002 2026-06-02 cs.DC cs.AI

Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference

理解并加速大型语言模型推理的内存处理流水线

Zifan He, Rui Ma, Yizhou Sun, Jason Cong

发表机构 * GitHub

AI总结 本文通过将稀疏注意力、检索增强生成和压缩上下文内存等优化统一为四步内存处理流水线,识别出22%-97%的内存处理开销,并提出使用GPU-FPGA异构系统加速该流水线,实现最高2.2倍加速和4.7倍能效提升。

Comments Accepted by ICML 2026. Code: https://github.com/OswaldHe/HeteroLLM

详情
AI中文摘要

现代大型语言模型(LLMs)越来越依赖于高效的长上下文处理和生成机制,包括稀疏注意力、检索增强生成(RAG)和压缩上下文内存,以支持复杂推理。我们表明这些优化可以统一为一个四步内存处理流水线:准备内存、计算相关性、检索和应用到推理。通过系统分析,我们识别出LLM推理中22%-97%的内存处理开销及其计算特征的强异构性。受此洞察启发,我们认为异构系统非常适合加速内存处理,从而加速端到端推理。我们在GPU-FPGA系统上展示了这种方法,将稀疏、不规则和内存受限的操作卸载到FPGA,同时将计算密集型操作保留在GPU上。在AMD MI210 GPU和Alveo U55C FPGA上评估,我们的系统在多种LLM推理优化中比GPU基线快高达2.2倍,能耗降低高达4.7倍(在NVIDIA A100上结果类似)。这些结果确立了异构系统作为高效LLM内存处理的实用方向,并为未来异构硬件设计提供参考。

英文摘要

Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is up to $2.2\times$ faster and achieves up to $4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.

2605.00696 2026-06-02 stat.ML cs.CL cs.LG

Adaptive Querying with AI Persona Priors

基于AI人格先验的自适应查询

Kaizheng Wang, Yuhang Wu, Assaf Zeevi

发表机构 * Department of Industrial Engineering and Operations Research and Data Science Institute, Columbia University(工业工程与运筹学系及数据科学研究所,哥伦比亚大学) Decision, Risk, and Operations Division, Columbia Business School(决策、风险与运营部门,哥伦比亚商学院)

AI总结 提出一种基于AI人格诱导的潜变量模型,利用大语言模型生成响应分布,实现高效贝叶斯设计,用于在有限查询预算下学习用户相关量。

Comments ICML 2026

详情
AI中文摘要

我们研究在严格查询预算内,通过自适应查询学习用户相关的感兴趣量(如对保留项目的响应和心理测量指标)的问题。经典的贝叶斯设计和计算机化自适应测试通常依赖于限制性的参数假设或昂贵的后验近似,限制了它们在异质性、高维和冷启动场景中的应用。我们引入了一种人格诱导的潜变量模型,通过有限字典中的AI人格成员身份来表示用户状态,每种人格由大语言模型产生的响应分布提供。这产生了具有闭式后验更新和高效有限混合预测的表达性先验,从而实现了可扩展的贝叶斯设计用于顺序项目选择。在合成数据和WorldValuesBench上的实验表明,基于人格的后验提供了准确的概率预测和可解释的自适应启发流程。

英文摘要

We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within tight query budgets. Classical Bayesian design and computerized adaptive testing typically rely on restrictive parametric assumptions or expensive posterior approximations, limiting their use in heterogeneous, high-dimensional, and cold-start settings. We introduce a persona-induced latent variable model that represents a user's state through membership in a finite dictionary of AI personas, each offering response distributions produced by a large language model. This yields expressive priors with closed-form posterior updates and efficient finite-mixture predictions, enabling scalable Bayesian design for sequential item selection. Experiments on synthetic data and WorldValuesBench demonstrate that persona-based posteriors deliver accurate probabilistic predictions and an interpretable adaptive elicitation pipeline.

2604.26977 2026-06-02 cs.LO cs.AI

Defeasible Conditional Obligation in a Two-tiered Preference-based Semantics (Extended Version)

基于双层偏好语义的可废止条件义务(扩展版)

Xavier Parent

发表机构 * Technische Universität Wien (TU Wien)(维也纳技术大学)

AI总结 本文提出一种双层偏好语义框架,通过结合非单调推理机制和双序关系(理想性与正常性),解决可废止条件义务的逻辑建模问题,并与约束输入/输出逻辑建立联系。

Comments 13 pages. Extended version of a paper presented at KR 2926

详情
AI中文摘要

针对Horty提出的问题,本文开发了一种双层、基于偏好的语义框架,用于建模可废止条件义务。该文扩展了Hanss-Lewis风格的偏好语义,用于双元道义逻辑,通过引入非单调推理机制,使得当新的、可能冲突的信息出现时,先前推导的义务可以被撤销。该方法是双偏好的:采用世界上的两种序关系——理想性和正常性——来弥补早期方法的不足,并为每种序关系提供独立的排序方法。在非单调层面,考虑了若干公设,包括前提强化、包含和无淹没。与所谓的约束输入/输出(I/O)逻辑——一种基于不同方法的现有规范推理标准——建立了联系。

英文摘要

In response to a concern raised by Horty, this paper develops a two-tiered, preference-based semantic framework for modeling defeasible conditional obligations. The paper extends a Hansson-Lewis style preference semantics for dyadic deontic logic by incorporating a nonmonotonic reasoning mechanism that enables previously derived obligations to be withdrawn when new, potentially conflicting information comes in. The account is bi-preferential: two orderings--ideality and normality--on worlds are employed to address shortcomings in earlier approaches, with a separate ranking method for each. At the nonmonotonic layer, a number of postulates are considered, including antecedent strengthening, inclusion and no-drowning. A connection is established with so-called constrained input/output (I/O) logic--an existing standard for normative reasoning based on a different methodology.

2604.26197 2026-06-02 cs.IR cs.LG

Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent

面向LinkedIn招聘代理的分层长期语义记忆

Zhentao Xu, Shangjin Zhang, Emir Poyraz, Yvonne Li, Ye Jin, Xie Lu, Xiaoyang Gu, Karthik Ramgopal, Praveen Kumar Bodigutla, Xiaofeng Wang

发表机构 * LinkedIn Corporation(LinkedIn公司)

AI总结 提出分层长期语义记忆(HLTM)框架,通过构建模式对齐的记忆树,实现可扩展的语义知识摄入、隐私感知存储、低延迟检索和透明溯源,在LinkedIn招聘助手应用中使答案正确率提升超5%、检索F1提升超10%。

Comments Accepted to the Applied Data Science (ADS) track at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地应用于实际产品中,其中个性化和上下文感知的用户交互至关重要。实现此类能力的核心是代理的长期语义记忆系统,该系统从嘈杂的纵向行为数据中提取隐式和显式信号,以结构化形式存储,并支持低延迟检索。构建工业级LLM代理长期记忆面临五大挑战:可扩展性、低延迟检索、隐私约束、适应性和可观测性。我们提出了分层长期语义记忆(HLTM)框架,该框架将文本数据组织成模式对齐的记忆树,在多个粒度级别捕获语义知识,从而实现可扩展的摄入、隐私感知存储、低延迟检索和透明溯源;HLTM还进一步融入了适应机制以泛化到不同用例。在LinkedIn招聘助手上的广泛评估表明,HLTM使答案正确率提升超过5%,检索F1提升超过10%,同时显著推进了查询与索引延迟之间的帕累托前沿。HLTM已全面部署在LinkedIn招聘助手中,用于支持生产招聘工作流中的核心个性化功能。

英文摘要

Large Language Model (LLM) agents are increasingly used in real-world products, where personalized and context-aware user interactions are essential. A central enabler of such capabilities is the agent's long-term semantic memory system, which extracts implicit and explicit signals from noisy longitudinal behavioral data, stores them in a structured form, and supports low-latency retrieval. Building industrial-grade long-term memory for LLM agents raises five challenges: scalability, low-latency retrieval, privacy constraints, adaptability, and observability. We introduce the Hierarchical Long-Term Semantic Memory (HLTM) framework, which organizes textual data into a schema-aligned memory tree that captures semantic knowledge at multiple levels of granularity, enabling scalable ingestion, privacy-aware storage, low-latency retrieval, and transparent provenance; HLTM further incorporates an adaptation mechanism to generalize across diverse use cases. Extensive evaluations on LinkedIn's Hiring Assistant show that HLTM improves answer correctness by more than 5% and retrieval F1 by more than 10%, while significantly advancing the Pareto frontier between query and indexing latency. HLTM has been fully deployed in LinkedIn's Hiring Assistant to power core personalization features in production hiring workflows.