arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.04990 2026-06-17 cs.CR cs.AI 版本更新

From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents

从智能体痕迹到信任:LLM智能体中的证据追踪与执行溯源

Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu, Qingqiang Sun, Zequn Sun, Zhangkai Wu, Manqing Dong, Mingkai Zhang, Xuefei Yin, Yanming Zhu

发表机构 * Griffith University(格里菲斯大学) Jiangsu University(江苏大学) University of Southern Queensland(南方昆士兰大学) Peking University(北京大学) Great Bay University(大湾大学) Nanjing University(南京大学) Macquarie University(麦觉瑞大学) Southern University of Science and Technology(南方科学与技术大学)

AI总结 本文系统综述了LLM智能体中的证据追踪与执行溯源方法,通过统一溯源视角连接检索、工具使用、记忆等环节,提出分类体系并讨论开放挑战。

详情
AI中文摘要

基于大语言模型(LLM)的智能体通过与外部工具、检索系统、记忆模块、环境及其他智能体交互,日益解决复杂任务。这些能力增强了智能体的自主性,但也使其行为更难以验证、调试和审计。仅凭最终答案的准确性无法解释输出是如何产生的、每个主张由哪些证据支持、工具调用是否合理、记忆如何影响后续决策或执行失败的根源。证据追踪和执行溯源通过建模检索到的证据、工具输出、记忆项、环境观察、中间主张、动作和最终答案在智能体执行过程中的连接方式,弥补了这一空白。本综述对LLM智能体中的证据追踪和执行溯源进行了系统回顾和概念框架构建。我们围绕统一的溯源视角组织相关工作,该视角连接了检索依据、主张支持、工具使用安全、记忆谱系、可观测性、调试、审计和恢复。我们引入了一个分类体系,涵盖追踪来源、证据和执行单元、溯源关系、追踪粒度和时机、表示形式以及信任功能。我们回顾了关键方法论方向,包括溯源表示、证据归因、工具使用溯源、运行时护栏、携带溯源的记忆、基于痕迹的可观测性和故障诊断。我们还绘制了现有基准、数据集和评估指标与溯源相关能力的映射,并讨论了评估如何从最终答案正确性转向过程级问责。最后,我们概述了开放挑战,包括统一痕迹模式、主张级和语义溯源、溯源感知的安全机制、现实执行痕迹基准、面向恢复的评估以及隐私感知的审计基础设施。

英文摘要

Large language model (LLM)-based agents are evolving from passive text generators into autonomous systems capable of planning, tool use, retrieval, memory access, environmental interaction, and multi-agent collaboration. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where failures originated. This survey examines evidence tracing and execution provenance as foundations for process-level accountability in trustworthy LLM agents. We define execution provenance as the typed graph of an agent execution and evidence tracing as its projection onto evidence-support relations. This perspective connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery within a unified framework. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We then review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, observability, and failure diagnosis. Finally, we discuss benchmarks, datasets, metrics, and open challenges for building provenance-aware, auditable, and recoverable agent systems.

2605.26195 2026-06-17 cs.CR cs.AI 版本更新

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

CyberEvolver:面向网络安全代理的即时结构化自我进化

Yihe Fan, Changyi Li, Lichen Xu, Xudong Pan, Jiarun Dai, Hong Geng, Min Yang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Pudong Research Institute of Cryptology(上海浦东密码研究院)

AI总结 提出CyberEvolver框架,通过四层可进化架构、痕迹诊断机制和种群波束搜索,实现网络安全代理基于失败经验的支架自我进化,平均成功率提升13.6%。

详情
AI中文摘要

基于LLM的代理越来越多地用于网络安全任务,但现有系统大多依赖固定的、人工设计的支架,难以适应不同的目标和失败模式。我们提出了 extsc{CyberEvolver},一个自我进化的网络安全代理框架,它根据失败执行尝试的经验迭代地修改自己的支架。网络安全中的自我进化具有挑战性,因为可能的支架变化空间在很大程度上是非结构化的,执行反馈稀疏且常被环境掩盖,低多样性的更新可能导致错误在重复迭代中累积。 extsc{CyberEvolver}通过四层可进化代理架构(将支架优化分解为结构化组件)、痕迹诊断机制(将嘈杂的执行日志转化为可操作的修订信号)以及基于种群的波束搜索策略(在进化过程中保留多样化的代理变体)来应对这些挑战。我们在CTF挑战、漏洞利用和渗透测试任务上使用四个开源LLM评估了 extsc{CyberEvolver}。在这些设置中, extsc{CyberEvolver}将种子代理的成功率平均提高了13.6%,并优于六个人工设计的网络安全代理以及两种从其他领域改编的自我改进方法。这些结果表明,支架自我进化为构建用于安全测试的自适应LLM代理提供了一个有前景的方向。

英文摘要

LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textsc{CyberEvolver}, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low-diversity updates can cause errors to compound over repeated iterations. \textsc{CyberEvolver} addresses these challenges with a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population-based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textsc{CyberEvolver} on CTF challenges, vulnerability exploitation, and penetration-testing tasks using four open-source LLMs. Across these settings, \textsc{CyberEvolver} improves the seed agent's success rate by $13.6$\,\% on average, and outperforms six human-designed cybersecurity agents as well as two self-improvement methods adapted from other domains. These results suggest that scaffold self-evolution is a promising direction for building adaptive LLM agents for security testing.

2605.29669 2026-06-17 stat.ML cs.LG math.PR math.ST stat.TH 版本更新

Eigen-Spike Emergence and Quadratic Equivalents for Conjugate Kernels on Nonlinearly Separable Data

Eigen-Spike 涌现与共轭核在非线性可分数据上的二次等价

Collin Cranston, Zhichao Wang, Todd Kemp, Michael W. Mahoney

发表机构 * Department of Mathematics ICSI and Department of Statistics(数学系ICSI和统计系) University of California, San Diego, USA(美国加州大学圣地亚哥分校) University of California, Berkeley, USA(美国加州大学伯克利分校) Department of Mathematics ICSI, LBNL and Department of Statistics(数学系ICSI、劳伦斯伯克利国家实验室和统计系)

AI总结 针对非线性可分数据(XOR问题),通过共轭核矩阵的二次等价模型,分析异常特征值涌现及其与标签对齐的BBP型相变,揭示样本复杂度、信噪比、激活函数和预训练特征对非线性可学习性的影响。

Comments 81 pages, 8 figures

详情
AI中文摘要

近期随机矩阵理论(RMT)工作发展了确定性等价的概念:通常是线性代理模型,用于近似大型非线性随机矩阵(如神经网络中的非线性特征映射)的谱行为。一方面,这些确定性等价通过将复杂模型简化为具有经典RMT工具特性的更简单模型,使理论预测易于处理。然而,这留下了一个问题:在处理高维非线性可分数据(例如对非线性可分数据进行分类)时,这种理想化的线性等价是否仍然有意义。受此启发,我们考虑前馈神经网络的非线性特征映射——共轭核(CK),在典型的非线性可分数据集XOR问题上;我们利用CK中信息性异常特征值的研究及其对应特征向量是否渐近与XOR标签对齐,作为非线性可学习性的代理。我们开发了尖峰CK矩阵的稳健二次等价,从而能够精确分析随着修改机器学习实践中常见的各种旋钮(样本复杂度、信噪比、非线性激活选择以及预训练特征)时涌现的信息性尖峰。在每种情况下,我们推导出精确的BBP型相变,其中通过CK特征向量的线性分类变得可能。我们的分析有助于将RMT中确定性等价工具的力量转化为研究机器学习中实际相关的问题。

英文摘要

Recent work in random matrix theory (RMT) has developed the notion of deterministic equivalents: typically linear surrogate models that approximate the spectral behavior of large nonlinear random matrices, such as nonlinear feature maps in neural networks (NNs). Such equivalents make theoretical predictions tractable by reducing a complex model to a simpler one with properties that fall under the umbrella of classical RMT tools. However, this leaves open the question of whether this idealized linear equivalence remains meaningful for classification of high-dimensional nonlinearly separable data. Motivated by this, we consider the conjugate kernel (CK), which is the nonlinear feature map of a one-layer feedforward NN, under a canonical nonlinearly separable dataset for the XOR problem; and we use the study of informative outlier eigenvalues in the CK and whether their corresponding eigenvectors asymptotically align with XOR labels as a proxy for nonlinear learnability. We develop a robust quadratic equivalent of the CK matrix that enables a precise analysis of emergent informative spikes, as one modifies various knobs common in ML practice: sample complexity, signal-to-noise ratio (SNR), nonlinear activation choice, and pretrained features. We identify regimes in which these knobs move the CK beyond the linear equivalent and produce BBP-type transitions to label-aligned outlier eigenspaces. Our analysis helps bring deterministic-equivalence tools from RMT to bear on problems of practical relevance in ML.

2604.01904 2026-06-17 cs.CR cs.AI 版本更新

Combating Data Laundering in LLM Training

对抗LLM训练中的数据清洗

Muxing Li, Zesheng Ye, Sharon Li, Feng Liu

发表机构 * University of Melbourne(墨尔本大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 针对数据清洗(通过变换风格隐藏数据来源)导致传统检测失效的问题,提出基于辅助LLM推断变换目标并合成查询的SDR方法,显著增强数据滥用检测能力。

Comments 29 pages, 2 figures

详情
AI中文摘要

数据权利所有者可以通过查询专有样本来检测大型语言模型(LLM)训练中未经授权的数据使用。通常,模型在某个样本上表现优于未训练数据(例如更高的置信度或更低的损失)意味着该样本属于训练语料,因为LLM在训练中见过的数据上表现更好。然而,这种检测在数据清洗(一种保留关键信息但改变专有数据风格形式以混淆数据来源的做法)下变得脆弱。当LLM仅在经过清洗的变体上训练时,它在原始数据上不再表现更好,从而消除了标准检测所依赖的信号。我们通过从对目标LLM的黑盒访问中推断未知的清洗变换,并借助辅助LLM合成模仿清洗数据的查询来应对这一问题,即使权利所有者只拥有原始数据。由于寻找真实清洗变换的搜索空间是无限的,我们将这一过程抽象为高层变换目标(例如“抒情改写”)和具体细节(例如“使用生动意象”),并引入合成数据还原(SDR)来实例化这一抽象。SDR首先识别最可能的合成目标以缩小搜索范围;然后迭代细化细节,使合成查询逐渐从目标LLM中引发更强的检测信号。在MIMIR基准上针对多种清洗实践和目标LLM系列(Pythia、Llama2和Falcon)的评估表明,SDR持续增强了数据滥用检测,为数据清洗提供了一种实用的对策。

英文摘要

Post-hoc unauthorized-training data detection for large language models (LLMs) typically assumes a query-with-originals regime: rights holders query a target LLM with raw proprietary data and assess whether the model assigns them stronger memorization-based detection signals, e.g., higher confidence or lower loss, than held-out non-training reference texts. We show that this regime becomes brittle under data laundering, where the target LLM is trained on semantics-preserving but stylistically or structurally transformed surrogates of proprietary data to obfuscate provenance. Since training-time exposure occurs in the laundered form, memorization signals may no longer appear on the originals, collapsing the candidate-reference signal separation that standard detectors rely on. We counter this threat by studying laundering-aware detection with raw proprietary data, a held-out reference corpus, and query access to the target LLM, while the laundering transformation is undisclosed. Since exact recovery of the laundered corpus is infeasible, we infer a detection-useful synthesis process via an auxiliary LLM that maps originals into training-like queries. To make this search tractable, we introduce Synthesis Data Reversion (SDR), which constrains the unbounded space of natural-language transformations through a goal-details abstraction: a high-level transformation goal, e.g., "lyrical rewriting", and fine-grained details, e.g., "with vivid imagery". SDR identifies the most likely goal and iteratively refines details so synthesized queries elicit stronger target-model detection signals. Evaluated on the MIMIR benchmark against diverse laundering practices and target LLM families (Pythia, Llama2, and Falcon), SDR consistently restores detection signals, offering a practical auditing layer against data laundering.

2605.29526 2026-06-17 cs.CR cs.AI cs.LG 版本更新

Temporal Motif-aware Graph Test-time Adaptation for OOD Blockchain Anomaly Detection

面向OOD区块链异常检测的时间模体感知图测试时自适应

Runang He, Tongya Zheng, Huiling Peng, Yuanyu Wan, Bingde Hu, Jiawei Chen, Canghong Jin, Mingli Song, Can Wang

发表机构 * State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) Zhejiang Provincial Engineering Research Center for Real-Time SmartTech in Urban Security Governance(浙江省实时智能科技在城市安全治理中的工程研究中心) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高新技术区(滨江)区块链与数据安全研究院)

AI总结 提出TEMG-TTA框架,通过时间模体分布捕获和测试时自适应策略,解决区块链异常检测中的模式演化和分布外问题,在5个数据集上平均提升54.88%。

Comments Accepted to IJCAI-ECAI 2026, Special Track on AI for Social Good

详情
AI中文摘要

不断演变的交易模式严重阻碍了新兴加密货币区块链上的异常检测,原因在于地址数量庞大且异常行为多样。近期应用于区块链的高级图异常检测(GAD)方法面临两个关键挑战:恶意行为者的对抗性模式演化以及区块链上不同交易语义导致的分布外(OOD)问题。为应对这些挑战,我们提出了一种新颖框架,称为时间模体感知图测试时自适应(TEMG-TTA)。首先,我们通过高效的计算机制全面捕捉每个活跃地址的三节点时间模体分布,从而实现下游时间模体感知图学习。其次,我们设计了一种简单而有效的测试时自适应策略,以促进训练图和测试图之间共享常见模式。在5个真实世界数据集上的大量实验表明,我们提出的TEMG-TTA平均优于最先进的GAD方法54.88%。进一步关于可解释模体模式的案例研究表明,TEMG-TTA明确刻画了异常地址的复杂交易模式,从而验证了我们技术设计的有效性。我们的代码将公开在 https://github.com/LuoXishuang0712/TEMG-TTA/。

英文摘要

Ever-evolving transaction patterns have significantly hindered anomaly detection on emerging cryptocurrency blockchains due to the vast number of addresses and diverse anomalous behaviors. Recently, advanced Graph Anomaly Detection (GAD) approaches applied to blockchains have faced two critical challenges: \textit{adversarial pattern evolution by malicious actors} and \textit{the out-of-distribution (OOD) problem caused by varied transaction semantics on blockchains}. To address these challenges, we propose a novel framework termed \textbf{TE}mporal \textbf{M}otif-aware \textbf{G}raph \textbf{T}est-\textbf{T}ime \textbf{A}daptation (\textbf{TEMG-TTA}). First, we comprehensively capture the 3-node temporal motif distribution of each active address using an efficient computational mechanism, enabling downstream temporal motif-aware graph learning. Second, we design a simple yet effective test-time adaptation strategy to facilitate the sharing of common patterns between training and testing graphs. Extensive experiments on 5 real-world datasets demonstrate that our proposed \textbf{TEMG-TTA} outperforms \textit{state-of-the-art} GAD approaches by an average of 54.88\%. A further case study on interpretable motif patterns reveals that \textbf{TEMG-TTA} explicitly characterizes the complex transaction patterns of anomalous addresses, thereby verifying the effectiveness of our technical designs. Our code is publicly available at https://github.com/LuoXishuang0712/TEMG-TTA/.

2605.29179 2026-06-17 cond-mat.mtrl-sci cs.AI 版本更新

Sustainable Metal-Organic Framework Water Harvesters in the Artificial Intelligence Era

人工智能时代可持续的金属有机框架水收集器

Reid A. Coyle, Shyam Chand Pal, Peter Walther, Saeun Park, Bin Feng, Zhiling Zheng

发表机构 * Department of Chemistry, Washington University(华盛顿大学化学系) Institute of Materials Science & Engineering, Washington University(华盛顿大学材料科学与工程学院)

AI总结 本文探讨了金属有机框架(MOF)在干旱条件下水收集的设计原理,并介绍了人工智能(AI)、大语言模型(LLM)和数据挖掘如何加速高性能吸附剂的发现。

Comments 10 pages of main text, 26 total pages. 3 Figures and 1 Table of Content Graphic

详情
AI中文摘要

金属有机框架(MOF)因其可调节的孔隙环境而成为水收集的优秀候选材料,这些孔隙环境可以被精确设计以在干旱条件下捕获和释放水。将人工智能(AI)整合到MOF发现中可以进一步加速高性能吸附剂的设计,通过识别增强大气水收集(AWH)、稳定性和循环效率的结构特征。在这篇视角文章中,我们考察了关键的MOF设计原理,包括协同吸附、操作相对湿度(RH)、吸附容量、滞后现象和可扩展性。我们强调了最近的设计进展,如多变量策略和长臂连接体延伸,并考察了这些原理如何调节孔隙容量和亲水性,同时保持稳定性和结晶性。此外,我们讨论了AI、大语言模型(LLM)和数据挖掘如何通过预测合成、逆向设计以及阐明合成-结构-性能关系来加速下一代MOF水收集器的发现过程。

英文摘要

Metal-organic frameworks (MOFs) are excellent candidates for water harvesting due to their tunable pore environments, which can be precisely engineered to capture and release water in arid conditions. Integrating artificial intelligence (AI) into MOF discovery can further accelerate the design of high-performance sorbents by identifying structural features that enhance atmospheric water harvesting (AWH), stability, and cycling efficiency. In this Perspective, we examine key MOF design principles, including cooperative adsorption, operational relative humidity (RH), uptake capacity, hysteresis, and scalability. We highlight recent design advancements such as multivariate strategies and long-arm linker extension, and examine how these principles tune pore capacity and hydrophilicity, while preserving stability and crystallinity. Furthermore, we discuss how AI, large language models (LLMs), and data mining can accelerate the discovery process through predictive synthesis, inverse design, and elucidating synthesis-structure-property relationships for the next generation of MOF water harvesters.

2605.23243 2026-06-17 cs.CR cs.AI 版本更新

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

前沿大语言模型是否已为网络安全做好准备?来自双模式漏洞基准测试的垂直基础模型证据

Vivek Dahiya, Sunny Nehra, Vipul Dholariya, Bhavik Shangari, Chandra Khatri

发表机构 * super-intel.ai(超级智能人工智能公司)

AI总结 通过白盒函数级漏洞检测和黑盒Web应用安全测试双模式基准测试,评估前沿大语言模型在网络安全任务中的表现,发现其存在高误报率、低覆盖率等问题,而领域专用模型通过结构化方法显著提升性能。

详情
AI中文摘要

我们通过双模式基准测试评估前沿大语言模型是否已为网络安全做好准备:白盒函数级漏洞检测(VulnLLM-R,涵盖C/Java/Python)和黑盒Web应用安全测试(五个生产风格应用,包含118个真实漏洞,涉及20多个CWE家族,我们将开源)。我们测试了六个前沿模型(GPT-5.4、Codex~5.3、Claude Opus~4.6、Sonnet~4.6、Gemini~3.1~Pro和Gemini~3~Flash)以及两个领域专用模型,涵盖四种测试范式。我们的发现令人警醒:(1)每个前沿模型在白盒检测中产生10-50%的误报率,系统性地过度预测漏洞;(2)在黑盒测试中,前沿模型仅达到4-8%的真实漏洞覆盖率,即使借助外部安全工具(Playwright MCP、Burp Suite MCP)也仅提升至10-19%;(3)领域专用智能体中编码的结构化渗透测试方法将每个家族的检测率提升至50%以上,表明方法论而非规模是主要杠杆;(4)一个领域专用防御模型在单个GPU上实现了所有模型中最高的精确率(0.904)和最低的误报率(9.7%)。我们指出缺乏结构化安全测试痕迹(端到端请求/响应序列、失败密集型数据、多步攻击链)是根本的训练数据瓶颈,并提出自博弈安全测试作为数据生成策略。我们的结果为专门构建用于网络安全的垂直基础模型提供了依据。

英文摘要

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.7, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

2602.14211 2026-06-17 cs.CR cs.AI 版本更新

SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents

SkillJect:有效自动化基于技能的提示注入以针对具备技能的代理

Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, Philip Torr

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Chongqing University, China(重庆大学) Northeastern University, China(东北大学) Sun Yat-sen University, China(中山大学) University of Oxford, UK(牛津大学)

AI总结 SkillJect 是首个自动化生成有效中毒技能的框架,通过隐藏恶意负载和重写指令通道,提升攻击效果,揭示可重用技能生态中的持久性攻击向量。

详情
AI中文摘要

SkillJect通过隐藏恶意负载和重写指令通道,有效自动化基于技能的提示注入,针对具备技能的代理提升攻击效果,揭示可重用技能生态中的持久性攻击向量。

英文摘要

Agent skills extend LLM agents with task-specific instructions, executable scripts, and auxiliary resources, improving reusability but creating a new supply-chain attack surface. A malicious or compromised skill can be repeatedly loaded as trusted guidance and steer downstream tool use. Existing skill-based prompt-injection attacks are often manual and brittle, because explicit malicious instructions are rejected or ignored when they are not aligned with the original workflow. We propose SkillJect, the first automated framework for generating poisoned skills against skill-enabled agent systems. SkillJect uses two coordinated channels. In the artifact channel, it hides the payload inside an auxiliary helper script. In the instruction channel, it rewrites SKILL.md with a front-loaded inducement strategy, placing injected content at the beginning and framing the helper script as a mandatory prerequisite or initialization step. The rewritten instruction explicitly references the helper-script path and provides an executable example command, making the helper appear to be a legitimate setup step before normal skill operations. SkillJect further adopts a closed-loop multi-agent process to improve attack effectiveness. An Attack Agent generates poisoned skills, a Victim Agent executes downstream tasks with the poisoned skill, and an Evaluate Agent inspects execution traces to determine whether the hidden payload was executed. The Attack Agent then uses this feedback to diagnose failure causes and rewrite SKILL.md, while keeping the payload fixed. Experiments across skill-enabled platforms, backend LLMs, and attack categories show that SkillJect substantially outperforms naive direct injection and prior manual skill-injection attacks, highlighting poisoned skills as a persistent threat in reusable skill ecosystems.

2511.19162 2026-06-17 cs.IR cs.CY cs.HC cs.LG cs.MM 版本更新

BioArtlas: Computational Clustering of Multi-Dimensional Complexity in Bioart

BioArtlas:生物艺术中多维复杂性的计算聚类

Joonhyung Bae

发表机构 * Graduate School of Culture Technology(文化科技研究生院)

AI总结 本文提出BioArtlas,通过新型轴感知表示对81件生物艺术作品进行多维分析,揭示四种组织模式,并通过交互式网页界面提供分析与探索。

Comments Bae, J. BioArtlas: Computational Clustering of Multi-Dimensional Complexity in Bioart. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Creative AI Track: Humanity

详情
AI中文摘要

生物艺术的混合性质跨越艺术、科学、技术、伦理和政治,挑战传统单一轴分类。我提出了BioArtlas,利用新型轴感知表示分析81件生物艺术作品,共十三个 curated 维度。我们的代码本方法将相关概念分组为统一聚类,解决文化术语的多义性。对多达800种表示空间-算法组合的全面评估发现,Agglomerative clustering在k=15的4D UMAP上最优(轮廓系数0.664±0.008,信任度/连续性0.805/0.812)。该方法揭示了四种组织模式:艺术家特定的方法论凝聚力、基于技术的分段、时间艺术演变以及跨时间的概念亲和力。通过将分析优化与公共传播分离,我通过交互式网页界面(https://www.bioartlas.com)提供严谨分析和可访问的探索,数据集公开可用(https://github.com/joonhyungbae/BioArtlas).

英文摘要

Bioart brings living material into artistic practice, where a single work can be at once an aesthetic object, a scientific instrument, and an ethical provocation. Traditional categories sort such works along one axis at a time, which flattens the very hybridity that defines the field and leaves curators no way to compare works across many dimensions together. I introduce BioArtlas, a computational atlas that represents each bioartwork along many curated dimensions at once and organizes the field by conceptual similarity rather than by medium or chronology. My method embeds the keywords of all 81 works on each of thirteen interpretive axes, groups related concepts into a shared codebook that tames inconsistent terminology, and then searches systematically for a clustering that is both statistically clean and interpretable. Among the methods that place every work on the map, agglomerative clustering separates the field far more cleanly than the usual k-means baseline (silhouette 0.664 versus 0.483), whereas density-based methods reach higher scores only by discarding most of the corpus as noise. By separating rigorous analysis from public storytelling, BioArtlas turns the tangled complexity of bioart into a navigable landscape, openly available as an interactive interface (https://www.bioartlas.com) and dataset (https://github.com/joonhyungbae/BioArtlas).

2604.01197 2026-06-17 quant-ph cond-mat.stat-mech cs.CC cs.LG 版本更新

Learning and Generating Mixed States Prepared by Shallow Channel Circuits

通过浅层通道电路学习和生成混合态

Fangjun Hu, Christian Kokail, Milan Kornjača, Pedro L. S. Lopes, Weiyuan Gong, Sheng-Tao Wang, Xun Gao, Stefan Ostermann

发表机构 * QuEra Computing Inc.(QuEra计算公司) School of Engineering and Applied Sciences, Harvard University(哈佛大学工程与应用科学学院)

AI总结 研究通过浅层通道电路生成混合态的学习问题,证明在特定相态下,仅通过测量数据即可高效学习生成混合态,为量子生成模型提供结构基础。

Comments 44 pages, 14 figures, 1 table

详情
AI中文摘要

从测量数据中学习量子态是量子信息和计算复杂性中的核心问题。本文研究在有限维晶格上学习生成混合态的问题。受混合态物质相的最新发展启发,我们专注于平凡相中的任意态。一个态属于平凡相当于存在一个浅层准备通道电路,使得在准备过程中保持局部可逆性。我们证明了此类混合态可通过仅测量访问高效学习。具体而言,给定未知平凡相混合态的多个副本,我们的算法输出一个浅层局部通道电路,可近似生成该态。样本复杂度和运行时间与量子位数呈多项式(或准多项式)关系,假设电路深度和门局部性为常数(或多项式对数)。重要的是,学习者不被提供原始准备电路,仅依赖其存在。我们的结果为基于浅层通道电路的量子生成模型提供了结构基础。在经典极限下,我们的框架也启发了一种仅通过训练和生成的多项式过载高效算法,用于经典扩散模型。

英文摘要

Learning quantum states from measurement data is a central problem in quantum information and computational complexity. In this work, we study the problem of learning to generate mixed states on a finite-dimensional lattice. Motivated by recent developments in mixed state phases of matter, we focus on arbitrary states in the trivial phase. A state belongs to the trivial phase if there exists a shallow preparation channel circuit under which local reversibility is preserved throughout the preparation. We prove that any mixed state in this class can be efficiently learned from measurement access alone. Specifically, given copies of an unknown trivial phase mixed state, our algorithm outputs a shallow local channel circuit that approximately generates this state in trace distance. The sample complexity and runtime are polynomial (or quasi-polynomial) in the number of qubits, assuming constant (or polylogarithmic) circuit depth and gate locality. Importantly, the learner is not given the original preparation circuit and relies only on its existence. Our results provide a structural foundation for quantum generative models based on shallow channel circuits. In the classical limit, our framework also inspires an efficient algorithm for classical diffusion models using only a polynomial overhead of training and generation.

2605.12729 2026-06-17 cs.NI cs.AI cs.CR 版本更新

Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

用于代理网络运维和AI运维的大型语言模型:架构、评估与安全

Muhammad Bilal, Jon Crowcroft, Ruizhi Wang, Xiaolong Xu, Schahram Dustdar

发表机构 * School of Computing and Communications(计算与通信学院) University of Cambridge(剑桥大学) School of Software(软件学院) Nanjing University of Information Science and Technology(南京信息科技大學) TU Wien(维也纳技术大学) ICREA

AI总结 本文探讨了大型语言模型在网络运维和AI运维中的应用,分析了代理架构、评估方法及安全挑战,强调系统可靠性依赖于模型周边机制,而非模型本身。

Comments 49 pages, 15 figures, 6 tables; survey article

详情
AI中文摘要

大型语言模型正越来越多地用于支持网络运维(NetOps)和人工智能运维(AIOps),包括事件调查、根本原因分析、配置合成和有限的自动修复。在NetOps和AIOps中,这种转变正在改变任务管理方式。基于代理的操作作为工作流,从收集证据到采取行动,遵循权限、政策和检查,并在必要时提供回滚选项。这至关重要,因为操作决策可能立即产生影响。为了使论点具体化,我们围绕自主性层次、工具范围、证据轨迹和保证合同组织相关文献。这些合同定义了代理可以观察、提议和执行的内容,以及在允许任何行动前必须通过的检查。在 telemetry 查询推荐、诊断、根本原因分析、配置合成、变更规划和有限自动修复的研究中,出现了一致的模式。操作可靠性主要不来自模型本身,而是依赖于模型周围的机制。我们还主张评估应超越静态问答。代理NetOps和AIOps系统需要以工作流为中心的评估,包括轨迹质量、受限制的工具使用、安全提案生成、沙盒环境中的回放以及具有回滚意识的试用。没有这些措施,系统可能看起来稳健,但实际上可能过于脆弱。最后,我们检查了当代理接近操作控制面时,安全、隐私和治理风险变得尖锐的问题。综合来看,本文得出结论:智能NetOps和AIOps的进步将取决于将自主性视为受限制的操作控制问题,其输出必须可靠、可审计且安全可部署。

英文摘要

Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

2604.23628 2026-06-17 cs.DS cs.LG 版本更新

Characterizing Admissible Objective Functions for Hierarchical Clustering

刻画层次聚类的可容许目标函数

Ryuki Tsukuba, Kazutoshi Ando

发表机构 * Faculty of Engineering, Shizuoka University(izuoka大学工学部) Graduate School of Integrated Science and Technology, Shizuoka University(izuoka大学综合科学技术研究院)

AI总结 本文研究层次聚类的可容许目标函数,对基于聚合相似度的和型目标函数,完整刻画了对称多项式次数≤2时的可容许性,并给出次数为3的充分条件;引入最大型目标函数,刻画了任意对称缩放函数的可容许性。

Comments 20 pages, 3 figures. Minor correction to abstract metadata. Manuscript unchanged from v2. Submitted to Discrete Applied Mathematics

详情
AI中文摘要

层次聚类是数据分析中的基本任务,但经典方法长期缺乏有原则的目标函数。Dasgupta [STOC~2016] 通过提出一个动机良好的聚类树目标函数,朝着填补这一空白迈出了重要一步。Cohen-Addad 等人 [J. ACM 2019] 随后引入了可容许性的概念:如果一个目标函数在输入相似度矩阵允许生成树时,其极小化器恰好是生成该矩阵的树,则该目标函数是可容许的。他们还给出了基于聚合簇间相似度的一类目标函数中可容许性的充要条件。我们将这类函数称为和型目标函数。然而,除了 Dasgupta 的原始目标函数外,该类中没有给出显式的可容许目标函数。本文从两个方向研究层次聚类的可容许目标函数。对于和型目标函数,当缩放函数是次数不超过2的对称多项式时,我们给出了完整的刻画,并推导了次数为3的多项式的充分条件。我们还证明,递归最稀疏割算法对我们刻画所覆盖的可容许目标函数实现了 O($\phi$) 的近似比,其中 $\phi$ 是最稀疏割子程序的近似因子。然后,我们引入了最大型目标函数,其中簇间相互作用通过最大簇间相似度而非聚合相似度来度量。对于该类,我们刻画了哪些目标函数对于任意对称缩放函数是可容许的,并在缩放函数是次数不超过2的对称多项式时给出了完整刻画。

英文摘要

Hierarchical clustering is a fundamental task in data analysis, but classical methods have long lacked a principled objective function. Dasgupta [STOC 2016] took an important step toward addressing this gap by proposing a well-motivated objective function for cluster trees. Cohen-Addad et al. [J. ACM 2019] subsequently introduced the notion of admissibility: an objective function is admissible if, whenever the input similarity matrix admits generating trees, its minimizers are precisely those generating trees. They also gave a necessary and sufficient condition for admissibility within a family of objective functions based on aggregate intercluster similarity. We refer to this family as sum-type objective functions. However, apart from Dasgupta's original objective function, no explicit admissible objective functions in this family were provided. In this paper, we study admissible objective functions for hierarchical clustering in two directions. For sum-type objective functions, we give a complete characterization when the scaling function is a symmetric polynomial of degree at most two, and we derive sufficient conditions for degree-three polynomials. We also show that the recursive sparsest cut algorithm achieves an O$(ϕ)$-approximation ratio for the admissible objective functions covered by our characterization, where $ϕ$ is the approximation factor of the sparsest cut subroutine. We then introduce max-type objective functions, where cluster interaction is measured by maximum, rather than aggregate, intercluster similarity. For this class, we characterize which objective functions are admissible for arbitrary symmetric scaling functions and give a complete characterization when the scaling function is a symmetric polynomial of degree at most two.

2604.16450 2026-06-17 cs.CY cs.LG q-bio.QM 版本更新

Evaluating Intersectional Fairness across Clinical Machine Learning Use Cases using Fairlogue and the All of Us Research Program

使用Fairlogue和All of Us研究计划评估临床机器学习用例中的交叉公平性

Nick Souligne, Vignesh Subbian

发表机构 * College of Engineering, The University of Arizona(亚利桑那大学工程学院)

AI总结 本文使用Fairlogue工具包在临床预测任务中评估交叉公平性,发现交叉群体差异大于单轴分析,但反事实诊断表明多数差异与随机分组相当。

Comments 10 pages, 7 figures, Accepted at the AMIA Annual Symposium 2026

详情
AI中文摘要

医疗数据中的交叉偏见可能在临床机器学习模型中产生复合差异,然而大多数公平性评估独立地评估人口统计属性。FairLogue是一个用于交叉公平性审计的工具包,被应用于多个临床预测任务,以评估跨组合人口统计群体的差异。使用All of Us数据集,选择两个已发表模型进行复制和评估:(A) 预测选择性5-羟色胺再摄取抑制剂相关的出血事件,(B) 房颤患者两年卒中风险。计算了跨种族、性别和交叉亚组的观察性公平性指标,随后进行反事实分析以评估差异是否可归因于群体成员身份。交叉评估揭示了比单轴分析更大的差异;然而,反事实诊断表明,大多数观察到的差异与随机群体成员身份下预期的差异相当。这些结果强调了交叉公平性审计的重要性,并展示了FairLogue如何为临床机器学习系统中的偏见提供更深入的洞察。

英文摘要

Intersectional biases in healthcare data can produce compound disparities in clinical machine learning models, yet most fairness evaluations assess demographic attributes independently. FairLogue, a toolkit for intersectional fairness auditing, was applied across multiple clinical prediction tasks to evaluate disparities across combined demographic groups. Using the All of Us dataset, two published models were selected for replication and evaluation: (A) prediction of selective serotonin reuptake inhibitor associated bleeding events and (B) two-year stroke risk in patients with atrial fibrillation. Observational fairness metrics were computed across race, gender, and intersectional subgroups, followed by counterfactual analysis to evaluate whether disparities were attributable to group membership. Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership. These results highlight the importance of intersectional fairness auditing and demonstrate how FairLogue provides deeper insight into bias in clinical machine learning systems.

2511.09204 2026-06-17 quant-ph cs.LG 版本更新

Resource-Efficient Variational Quantum Classifier

资源高效的变分量子分类器

Petr Ptáček, Paulina Lewandowska, Ryszard Kukulski

发表机构 * IT4Innovations, VSB - Technical University of Ostrava(IT4Innovations奥斯特拉瓦技术大学) Faculty of Electrical Engineering and Computer Science, VSB - Technical University of Ostrava(电气工程与计算机科学学院,奥斯特拉瓦技术大学)

AI总结 提出基于汉明距离测量与经典后处理的无歧义量子分类器,通过更有效利用ansatz表达性提升分类性能,同时大幅减少电路评估次数,并增强对噪声的鲁棒性。

Comments 13 pages, 7 figures, 1 table; current format of preprint template

详情
AI中文摘要

我们引入了基于汉明距离测量与经典后处理的无歧义量子分类器。该方法通过更有效地利用ansatz的表达性来提升分类性能,同时显著减少电路评估次数。此外,该方法展现出对噪声的增强鲁棒性,这对近期的量子设备至关重要。我们在乳腺癌分类数据集上评估了所提出的方法。无歧义分类器实现了90%的平均准确率,相比基线提高了6.9个百分点,同时每次预测所需的电路执行次数减少了八倍。在存在噪声的情况下,改进幅度降至约3.1个百分点,执行成本降低相同。我们通过理论证据支持了该方法的实际性能,证实了我们的实验结果。

英文摘要

We introduce the unambiguous quantum classifier based on Hamming distance measurements combined with classical post-processing. The proposed approach improves classification performance through a more effective use of ansatz expressivity, while requiring significantly fewer circuit evaluations. Moreover, the method demonstrates enhanced robustness to noise, which is crucial for near-term quantum devices. We evaluate the proposed method on a breast cancer classification dataset. The unambiguous classifier achieves an average accuracy of 90%, corresponding to an improvement of 6.9 percentage points over the baseline, while requiring eight times fewer circuit executions per prediction. In the presence of noise, the improvement is reduced to approximately 3.1 percentage points, with the same reduction in execution cost. We substantiate our experimental results with theoretical evidence supporting the practical performance of the approach.

2603.18897 2026-06-17 cs.DC cs.AI 版本更新

Parallelizing Tool Execution and LLM Generation for Low-Latency Agent Serving

并行化工具执行与LLM生成以实现低延迟代理服务

Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, Kaiqiang Xu, Kai Chen, Yuqing Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Microsoft Research(微软研究院) Stevens Institute of Technology(Stevens 工程学院) Google(谷歌) Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出PASTE系统,通过预测性执行未来工具调用与LLM生成并行,减少任务完成时间43.5%。

详情
AI中文摘要

基于LLM的代理通过模型生成和工具执行的顺序循环来执行任务。当今的服务系统串行化此循环,使工具延迟暴露在任务关键路径上。本文提出PASTE,一个工具感知的代理服务系统,它从重复的代理模式中预测具体的未来工具调用,并在LLM仍在生成时推测性执行它们。PASTE将推测结果隔离,直到LLM确认,并联合调度工具执行和返回的LLM会话,以避免将瓶颈转移到GPU。在深度研究、编码和科学代理工作负载上,PASTE将平均任务完成时间减少43.5%,并将观察到的工具延迟降低1.8倍。

英文摘要

LLM-powered agents execute tasks through a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, leaving tool latency exposed on the task critical path. This paper presents PASTE, a tool-aware agent-serving system that predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. PASTE isolates speculative results until confirmed by the LLM and jointly schedules tool execution and returning LLM sessions to avoid shifting bottlenecks to the GPU. Across deep research, coding, and scientific-agent workloads, PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.

2503.17867 2026-06-17 cs.CR cs.AI cs.LG cs.NI 版本更新

Detecting and Mitigating DDoS Attacks with AI: A Survey

利用人工智能检测和缓解DDoS攻击:综述

Alexandru Apostu, Silviu Gheorghe, Andrei Hîji, Nicolae Cleju, Andrei Pătraşcu, Cristian Rusu, Radu Ionescu, Paul Irofti

发表机构 * Department of Computer Science, University of Bucharest(布加勒斯大学计算机科学系)

AI总结 本文综述了基于AI的DDoS攻击检测与缓解方法,提供了基于专家层次和AI生成树状图的分类法,讨论了数据集、对抗训练及未来研究方向。

详情
AI中文摘要

分布式拒绝服务攻击是一个活跃的网络安全研究问题。最近的研究从基于静态规则的防御转向基于AI的检测和缓解。本综述涵盖了几个关键主题。首先,讨论了最先进的AI检测方法。提供了基于手动专家层次和AI生成的树状图的深入分类法,从而解决了DDoS分类的歧义。随后讨论了可用的数据集,涵盖了数据格式选项及其在训练AI检测方法中的作用,以及对抗训练和示例增强。除了检测,还调查了基于AI的缓解技术。最后,提出了多个开放的研究方向。

英文摘要

Distributed Denial of Service attacks represent an active cybersecurity research problem. Recent research shifted from static rule-based defenses towards AI-based detection and mitigation. This comprehensive survey covers several key topics. Preeminently, state-of-the-art AI detection methods are discussed. An in-depth taxonomy based on manual expert hierarchies and an AI-generated dendrogram are provided, thus settling DDoS categorization ambiguities. An important discussion on available datasets follows, covering data format options and their role in training AI detection methods together with adversarial training and examples augmentation. Beyond detection, AI based mitigation techniques are surveyed as well. Finally, multiple open research directions are proposed.

2511.03211 2026-06-17 cs.CY cs.AI 版本更新

Retrofitters, pragmatists and activists: Public interest litigation for accountable automated decision-making

改造者、实用主义者和活动家:为可问责的自动化决策而进行的公益诉讼

Henry Fraser, Zahra Stardust

发表机构 * Queensland University of Technology, School of Law(昆士兰理工大学法学院) Centre for Automated Decision-Making and Society(自动化决策与社会研究中心) Queensland University of Technology, School of Communication(昆士兰理工大学传播学院)

AI总结 本文探讨公益诉讼在澳大利亚促进AI和自动化决策问责中的作用,基于访谈分析策略与局限,强调制度安排对有效诉讼的关键性。

详情
AI中文摘要

本文考察了公益诉讼在促进澳大利亚人工智能和自动化决策(ADM)问责方面的作用。由于ADM监管面临政治和地缘政治阻力,有效的治理将不得不依赖现有法律的执行。基于对澳大利亚公益诉讼律师、技术政策活动家和技术法学学者的访谈,本文将公益诉讼定位为ADM透明度、问责和正义的更大生态系统的一部分。文章探讨了参与者所称的“改造”旧法律以适应ADM的策略和战术。这些策略超越了创造性的法律论证,涵盖了社区建设、变革理论合作、精明的客户和诉讼理由选择,以及诉讼中利益相关者利益的协调。自然,本文也探讨了这些策略以及澳大利亚法律体系的局限性。然而,在局限可以被克服的地方,本文提出了关于紧迫需求的发现:使有效诉讼和问责得以实现的制度安排。本文对法律和技术学者、受ADM伤害的个人和团体、公益诉讼律师和技术律师、民间社会和倡导组织以及政策制定者具有参考价值。

英文摘要

This paper examines the role of public interest litigation in promoting accountability for AI and automated decision-making (ADM) in Australia. Since ADM regulation faces political and geopolitical headwinds, effective governance will have to rely on the enforcement of existing laws. Drawing on interviews with Australian public interest litigators, technology policy activists, and technology law scholars, the paper positions public interest litigation as part of a larger ecosystem for transparency, accountability and justice with respect to ADM. The paper explores the tactics and strategies of what one participant described as 'retrofitting' old laws to ADM. These go beyond creative legal argumentation, to encompass practices of community-building, collaboration on theories of change, canny selection of clients and causes of action, and the alignment of the interests of stakeholders in litigation. Naturally, the paper also contends with the limits of these strategies, and of the Australian legal system. Where limits are, however, capable of being overcome, the paper presents findings on urgent needs: the enabling institutional arrangements without which effective litigation and accountability will falter. The paper is relevant to law and technology scholars; individuals and groups harmed by ADM; public interest litigators and technology lawyers; civil society and advocacy organisations; and policymakers.

2512.15792 2026-06-17 cs.CY cs.AI cs.CL 版本更新

A Multifaceted Analysis of Social Biases in Large Language Models

大型语言模型中偏见的系统分析

Xulang Zhang, Rui Mao, Erik Cambria

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文系统分析了四种广泛使用的大型语言模型在政治、意识形态、联盟、语言和性别等维度上的偏见,通过多项实验揭示了模型在中立性、意识形态倾向、地缘政治倾向、多语言故事完成中的偏见以及性别倾向。

详情
AI中文摘要

大型语言模型(LLMs)已迅速成为获取信息和支持人类决策不可或缺的工具。然而,确保这些模型在各种情境下保持公平性对于其安全和负责任的部署至关重要。在本研究中,我们对四种广泛采用的LLMs进行了全面分析,探讨了它们在政治、意识形态、联盟、语言和性别等维度上的潜在偏见和倾向。通过一系列精心设计的实验,我们利用新闻摘要来检验其政治中立性,通过新闻立场分类来研究意识形态偏见,通过联合国投票模式来探讨对特定地缘政治联盟的倾向,通过多语言故事完成来检验语言偏见,并通过世界价值观调查中的响应来揭示性别相关倾向。结果表明,尽管这些模型被设计为中立和公正,但它们仍然表现出不同类型的偏见和倾向。

英文摘要

Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and responsible deployment. In this study, we undertake a comprehensive examination of four widely adopted LLMs, probing their underlying biases and inclinations across the dimensions of politics, ideology, alliance, language, and gender. Through a series of carefully designed experiments, we investigate their political neutrality using news summarization, ideological biases through news stance classification, tendencies toward specific geopolitical alliances via United Nations voting patterns, language bias in the context of multilingual story completion, and gender-related affinities as revealed by responses to the World Values Survey. Results indicate that while the LLMs are aligned to be neutral and impartial, they still show biases and affinities of different types.

2602.11453 2026-06-17 cs.IR cs.AI cs.LG 版本更新

From Noise to Order: Learning to Rank via Denoising Diffusion

从噪声到有序:通过去噪扩散学习排序

Sajad Ebrahimi, Bhaskar Mitra, Negar Arabzadeh, Ye Yuan, Haolun Wu, Fattane Zarrinkalam, Ebrahim Bagheri

发表机构 * University of Guelph(圭尔夫大学) Independent Researcher(独立研究者) University of California, Berkeley(加州大学伯克利分校) McGill University(麦吉尔大学) University of Toronto(多伦多大学)

AI总结 提出基于去噪扩散的生成式排序模型DiffusionRank,通过建模特征向量与相关性标签的联合分布,在四个标准LTR数据集上优于传统判别式方法。

详情
AI中文摘要

在信息检索(IR)中,学习排序(LTR)方法传统上局限于判别式机器学习方法,这些方法基于查询-文档对的特征表示来建模文档与查询相关的概率。在这项工作中,我们提出了一种基于去噪扩散的深度生成式LTR方法,该方法转而建模特征向量和相关性标签的完整联合分布。虽然在判别式设置中,过参数化的排序模型可能通过不同方式拟合训练数据,但我们假设在生成式设置下能够解释完整数据分布的候选解能更好地估计相关性。基于这一动机,我们提出了DiffusionRank,它扩展了TabDiff(一种用于表格数据集的基于去噪扩散的生成模型),以创建经典判别式逐点和成对LTR目标的生成式等价物。我们在四个标准LTR数据集上进行了彻底的实证评估,证明了DiffusionRank模型相对于其判别式对应物的改进。我们的工作为未来研究探索如何利用深度生成建模方法(如扩散)在IR中进行学习排序提供了丰富的空间。

英文摘要

Learning-to-rank (LTR) methods have traditionally been limited to discriminative machine learning approaches that model the probability of the document being relevant to the query given some feature representation of the query-document pair. We propose an alternative denoising diffusion-based generative approach to LTR that instead models the full joint distribution over features and relevance labels. While in discriminative LTR, an over-parameterized ranking model may find different ways to fit the training data, we posit that candidate solutions that can explain the full data distribution under the generative setting maybe better at estimating relevance. Thus, we propose DiffusionRank that extends TabDiff, an existing diffusion model for tabular datasets, to create generative alternatives to classical discriminative pointwise and pairwise LTR objectives. Our work demonstrates improvements from DiffusionRank over discriminative counterparts on four standard LTR datasets and points to a rich space for future exploration to leverage ongoing advancements in deep generative models for LTR. Our code is publicly available at https://github.com/sadjadeb/DiffusionRank.

2512.01241 2026-06-17 cs.CY cs.AI 版本更新

First, do NOHARM: towards clinically safe large language models

首先,不伤害:迈向临床安全的大语言模型

David Wu, Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran, Matthew Gwiazdon, Arjun Rustagi, Jenelle Jindal, Jacob M. Koshy, Vinay Kadiyala, Anup Agarwal, Bassman Tappuni, Brianna French, Sirus Jesudasen, Christopher V. Cosgriff, Rebanta Chakraborty, Jillian Caldwell, Susan Ziolkowski, David J. Iberri, Robert Diep, Rahul S. Dalal, Kira L. Newman, Kristin Galetta, J. Carl Pallais, Nancy Wei, Kathleen M. Buchheit, David I. Hong, Vartan Pahalyants, Ernest Y. Lee, Allen Shih, Tamara B. Kaplan, Vishnu Ravi, Sarita Khemani, Thomas A. Buckley, April S. Liang, Daniel Shirvani, Advait Patil, Nicholas Marshall, Kanav Chopra, Joel Koh, Adi Badhwar, Anastasia Perez, Austin J. Schoeffler, Mahbuba Tusty, Chase M. Walton, Liam G. McCoy, David J. H. Wu, Yingjie Weng, Sumant Ranji, Kevin Schulman, Nigam H. Shah, Jason Hom, Arnold Milstein, Arjun K. Manrai, Adam Rodman, Jonathan H. Chen, Ethan Goh

发表机构 * Harvard Combined Dermatology Program(哈佛联合皮肤科项目) Department of Dermatology, Mass General Brigham(麻省总医院皮肤科) Harvard Medical School(哈佛医学院) Stanford Center for Biomedical Informatics Research(斯坦福生物医学信息学研究中心) Stanford University(斯坦福大学) Division of Hospital Medicine, Department of Medicine, Stanford University School of Medicine(斯坦福大学医学院医院医学科) Department of Medicine, Cambridge Health Alliance(剑桥健康联盟医学科) Beth Israel Deaconess Hospital–Plymouth(贝塞斯达德acons医院-普利茅斯) Department of Medicine, University of California, San Francisco(加州大学旧金山分校医学科) Department of Neurology, Stanford University School of Medicine(斯坦福大学医学院神经科) Department of Medicine, Beth Israel Deaconess Medical Center(贝塞斯达德acons医学中心医学科) Division of Cardiology, Department of Medicine, Cambridge Health Alliance(剑桥健康联盟心脏病科) Department of Cardiovascular Medicine, Summa Health System(Summa健康系统心血管医学科) Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, University of Wisconsin-Madison(威斯康星大学麦迪逊分校医学科过敏、呼吸科和危重医学科) Division of Pulmonary and Critical Care Medicine, Department of Medicine, Massachusetts General Hospital(麻省总医院呼吸科和危重医学科) Center for Immunology and Inflammatory Diseases, Department of Medicine, Massachusetts General Hospital(麻省总医院免疫和炎症疾病中心) Broad Institute of MIT and Harvard(MIT和哈佛Broad研究所) Division of Pulmonary, Critical Care, and Sleep Medicine, Cambridge Health Alliance(剑桥健康联盟呼吸科、危重医学科和睡眠医学科)

AI总结 提出NOHARM基准,包含1100个初级到专科咨询案例,评估28个LLM的医疗建议安全性,发现高达22.6%的案例存在严重危害风险,其中遗漏错误占80%以上。

详情
AI中文摘要

大语言模型(LLM)被医生和患者常规用于医疗建议,但其临床安全性特征仍不明确。我们提出NOHARM(医学风险评估的众多选项危害评估),一个包含1100个初级保健到专科咨询案例的基准,用于衡量LLM生成的医疗建议的危害频率和严重程度。NOHARM涵盖10个专科,包含4249个临床管理选项的12747个专家注释。在28个LLM中,建议在高达22.6%的案例中具有严重危害潜力,其中遗漏错误占严重错误的80%以上。在一项涉及101名全科医生的随机试验中,AI辅助显著提高了人类基准表现,但医生远未实现AI工具的潜力,经常忽略AI提出的重要建议。安全性表现与通用智能和医学知识基准在整个模型范围内相关,但在前沿模型上解耦。尽管在现有评估中表现强劲,广泛使用的AI模型可能以非平凡的比例产生具有严重危害潜力的医疗建议,凸显了明确测量临床安全性的重要性。

英文摘要

Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a 1,100-task benchmark of primary care-to-specialist consultation cases to measure the frequency and severity of harm from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 28 LLMs, recommendations carried the potential for severe harm in up to 22.6% of cases, with errors of omission accounting for more than 80% of severe errors. In a randomized trial of 101 generalist physicians, human benchmark performance significantly improved with AI assistance, yet physicians remained far from realizing the potential of AI tools, frequently ignoring essential advice surfaced by AI. Safety performance tracked general-intelligence and medical-knowledge benchmarks across the full range of models but decoupled at the frontier. Despite strong performance on existing evaluations, widely used AI models can produce medical advice with the potential for severe harm at non-trivial rates, highlighting the importance of explicit measurement of clinical safety.

2510.04421 2026-06-17 stat.ML cs.LG math.ST stat.TH 版本更新

Learning Survival Models with Right-Censored Reporting Delays

学习带有右删失报告延迟的生存模型

Yuta Shikuri, Hironori Fujisawa

发表机构 * The Graduate University for Advanced Studies(高级研究大学) Tokio Marine Holdings, Inc.(东京海上日赤保险株式会社) The Institute of Statistical Mathematics(统计数学研究所) RIKEN(理化学研究所)

AI总结 针对报告延迟导致的生存数据右删失问题,联合建模事件和报告过程的参数风险,提出一致估计量和蒙特卡洛EM算法,并利用迁移学习提高行政删失下及时风险评估的准确性。

Comments 26 pages, 3 figures, 3 tables

详情
AI中文摘要

生存分析提供了对事件发生时间进行建模的统计方法。当事件发生时间未在发生时被观察到,而是仅在报告时被揭示时,就会出现报告延迟。当由于行政删失导致观察窗口较短时,这一问题对于及时风险评估尤为关键。在本研究中,我们通过对事件和报告过程联合建模参数风险,纳入了右删失报告延迟。然后,我们为模型参数构建了一致估计量,并开发了蒙特卡洛期望最大化算法来计算它。为了应对行政删失带来的挑战,我们利用这些发现并提出了一种迁移学习程序。实验结果表明,我们的方法提高了行政删失下及时风险评估的准确性。

英文摘要

Survival analysis provides statistical methods to model the time until an event occurs. Reporting delays arise when event times are not observed at their occurrence but are only revealed upon reporting. This issue is particularly critical for timely risk evaluation when the observation window is short due to administrative censoring. In this study, we incorporate right-censored reporting delays by jointly modeling parametric hazards for the event and reporting processes. We then construct a consistent estimator for the model parameters and develop a Monte Carlo expectation-maximization algorithm to compute it. To address the challenges posed by administrative censoring, we leverage these findings and propose a transfer-learning procedure. Experimental results demonstrate that our method improves the accuracy of timely risk evaluation under administrative censoring.

2501.09876 2026-06-17 math.NA cs.LG cs.NA 版本更新

Geometry-Preserving Encoder/Decoder in Latent Generative Models

潜在生成模型中的几何保持编码器/解码器

Wonjun Lee, Riley C. W. O'Neill, Dongmian Zou, Jeff Calder, Gilad Lerman

发表机构 * Department of Mathematics, The Ohio State University(俄亥俄州立大学数学系) Department of Mathematics, University of Minnesota(明尼苏达大学数学系) Zu Chongzhi Center for Mathematics and Computational Sciences, Duke Kunshan University(杜克-昆山大学仲长奇中心)

AI总结 本文提出一种新型几何保持编码器/解码器框架,通过保留数据分布的几何结构,在潜在扩散模型中实现更高效的训练和更快的收敛。

Comments 50 pages

详情
AI中文摘要

生成建模旨在生成与给定数据集相似的新数据样本。当使用扩散模型完成此任务时,主要挑战之一是在输入空间中解决问题,而输入空间往往非常高维。为了解决这个问题,最近的方法通过编码器将数据空间映射到较低维的潜在空间,在潜在空间中求解扩散模型,从而提高了训练效率并取得了最先进的结果。变分自编码器(VAE)是该领域最常用的编码器/解码器框架,以其学习潜在表示和生成数据样本的能力而闻名。在本文中,我们引入了一种新颖的编码器/解码器框架,其理论特性与VAE不同,专门设计用于保留数据分布的几何结构。我们证明了这种几何保持编码器在编码器和解码器训练过程中的显著优势。此外,我们提供了理论结果,证明了训练过程的收敛性,包括编码器训练的收敛保证,以及使用几何保持编码器时解码器训练收敛更快的结果。

英文摘要

Generative modeling aims to generate new data samples that resemble a given dataset. When using diffusion models for this task, one of the main challenges is solving the problem in the input space, which tends to be very high-dimensional. To address this, recent approaches solve diffusion models in the latent space through an encoder that maps from the data space to a lower-dimensional latent space, improving training efficiency and achieving state-of-the-art results. The variational autoencoder (VAE) is the most commonly used encoder/decoder framework in this domain, known for its ability to learn latent representations and generate data samples. In this paper, we introduce a novel encoder/decoder framework with theoretical properties distinct from those of the VAE, specifically designed to preserve the geometric structure of the data distribution. We demonstrate the significant advantages of this geometry-preserving encoder in the training process of both the encoder and decoder. Additionally, we provide theoretical results proving convergence of the training process, including convergence guarantees for encoder training, and results showing faster convergence of decoder training when using the geometry-preserving encoder.

2508.02721 2026-06-17 cs.SE cs.AI cs.PL 版本更新

Blueprint First, Model Second: A Framework for Deterministic LLM Workflow

蓝图优先,模型其次:确定性LLM工作流框架

Libin Qiu, Yuhang Ye, Zhirong Gao, Xide Zou, Junfu Chen, Ziming Gui, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, Kun Zhao

发表机构 * Alibaba(阿里巴巴)

AI总结 提出“蓝图优先,模型其次”框架,通过将工作流逻辑解耦为源代码蓝图并由确定性引擎执行,LLM仅处理子任务,在TravelPlanner上最终通过率提升97.6%,约束违反减少96.0%。

Comments 12 pages, 7 figures, 6 tables

详情
AI中文摘要

尽管强大,大型语言模型(LLM)智能体固有的非确定性限制了它们在结构化操作环境中的应用,这些环境要求程序保真度和可预测执行。这一限制源于当前架构将概率性的高级规划与低级动作执行混淆在单一生成过程中。为解决此问题,我们引入了 \ extsc{Source Code Agent} 框架,这是一种基于“蓝图优先,模型其次”哲学的新范式,将工作流逻辑与生成模型解耦。首先将专家定义的操作程序编纂为基于源代码的执行蓝图,然后由确定性引擎执行。LLM被策略性地调用作为专门工具,处理工作流中有界、复杂的子任务,但从不决定工作流的路径。我们在TravelPlanner基准上评估约束感知的旅行规划。\ extsc{Source Code Agent} 在相同Claude-Sonnet-4骨干上实现了35.56%的最终通过率,比最先进的ATLAS基线(18.00%)提高了97.6%。关键的是,它将约束违反减少了96.0%(11次对比275次),同时将执行效率提高了27.1%(10.2±0.7步对比14.0步)。两个生产事故诊断部署以及在ScienceWorld和ALFWorld上的额外结果证实,该架构可迁移到旅行规划之外的程序定义明确、约束密集型的工作流。我们的工作使得在受严格程序逻辑约束的应用中,自主智能体能够可验证且可靠地部署。

英文摘要

While powerful, the inherent non-determinism of large language model (LLM) agents limits their application in structured operational environments where procedural fidelity and predictable execution are strict requirements. This limitation stems from current architectures that conflate probabilistic, high-level planning with low-level action execution within a single generative process. To address this, we introduce the \textsc{Source Code Agent} framework, a new paradigm built on the ``Blueprint First, Model Second'' philosophy that decouples workflow logic from the generative model. An expert-defined operational procedure is first codified into a source code-based Execution Blueprint, which is then executed by a deterministic engine. The LLM is strategically invoked as a specialized tool to handle bounded, complex sub-tasks within the workflow, but never to decide the workflow's path. We evaluate on the TravelPlanner benchmark for constraint-aware travel planning. The \textsc{Source Code Agent} achieves a 35.56\% final pass rate, a 97.6\% improvement over the state-of-the-art ATLAS baseline (18.00\%) on the same Claude-Sonnet-4 backbone. Critically, it reduces constraint violations by 96.0\% (11 vs 275) while improving execution efficiency by 27.1\% (10.2$\pm$0.7 steps vs 14.0). Two production incident-diagnosis deployments and additional results on ScienceWorld and ALFWorld confirm that the architecture transfers beyond travel planning to procedurally well-defined, constraint-intensive workflows. Our work enables the verifiable and reliable deployment of autonomous agents in applications governed by strict procedural logic.

2507.17188 2026-06-17 cs.NI cs.AI cs.CR 版本更新

LLM-Aided Joint Secrecy Precoding and Trajectory for RSMA-Based Heterogeneous UAV Networks

基于RSMA的异构无人机网络中LLM辅助的联合保密预编码与轨迹设计

Lijie Zheng, Ji He, Shih Yu Chang, Yulong Shen

发表机构 * School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Department of Applied Data Science, San Jose State University(圣何塞州立大学应用数据科学系)

AI总结 针对RSMA异构无人机网络中的安全通信问题,提出分层优化框架:内层用SDR-S2DC算法求解固定位置下的保密预编码,外层用LLM引导的多智能体强化学习优化轨迹,实现保密速率与能效的权衡。

详情
AI中文摘要

本文研究了速率分割多址接入(RSMA)使能的异构无人机网络中的安全通信问题,其中多个无人机在存在窃听者的情况下协作服务地面终端。通过联合考虑保密速率最大化和推进能量消耗最小化,我们构建了一个多目标优化问题,涉及无人机轨迹设计、服务关联、功率分配和保密预编码,并受到移动性、碰撞避免、服务容量和通信约束。所构建的问题由于无人机轨迹、RSMA传输变量和保密预编码之间的耦合而高度非凸。为了解决由此产生的非凸且高度耦合的优化问题,我们提出了一种分层优化框架。内层使用基于半定松弛(SDR)的S2DC算法,结合惩罚函数和凸差(D.C.)规划,在固定无人机位置下求解保密预编码问题。外层引入了一种大语言模型(LLM)引导的启发式多智能体强化学习方法(LLM-HeMARL)用于轨迹优化。LLM-HeMARL高效地整合了LLM生成的专家启发式策略,使无人机能够学习能量感知、安全驱动的轨迹,而无需实时LLM调用的推理开销。仿真结果表明,我们的方法在保密速率和能效方面优于现有基线,并在不同的无人机群规模和随机种子下具有一致的鲁棒性。

英文摘要

This paper investigates secure communications in rate-splitting multiple access (RSMA) enabled heterogeneous UAV networks, where multiple UAVs collaboratively serve ground terminals in the presence of eavesdroppers. By jointly considering secrecy rate maximization and propulsion energy consumption minimization, we formulate a multi-objective optimization problem involving UAV trajectory design, service association, power allocation, and secrecy precoding under mobility, collision-avoidance, service-capacity, and communication constraints. The formulated problem is highly non-convex due to the coupling among UAV trajectories, RSMA transmission variables, and secrecy constraints.To address the resulting non-convex and highly coupled optimization problem, we propose a hierarchical optimization framework. The inner layer uses a semidefinite relaxation (SDR)-based S2DC algorithm combining penalty functions and difference-of-convex (D.C.) programming to solve the secrecy precoding problem with fixed UAV positions. The outer layer introduces a Large Language Model (LLM)-guided heuristic multi-agent reinforcement learning approach (LLM-HeMARL) for trajectory optimization. LLM-HeMARL efficiently incorporates LLM-generated expert heuristic policy, enabling UAVs to learn energy-aware, security-driven trajectories without the inference overhead of real-time LLM calls. The simulation results show that our method outperforms existing baselines in secrecy rate and energy efficiency, with consistent robustness across varying UAV swarm sizes and random seeds.

2507.11366 2026-06-17 cs.GT cs.LG 版本更新

Characterizing Nash Equilibria in Zero-Sum Games: A Physics-Inspired, Parallelizable Approach with a Linear Number of Gradient Queries

零和博弈中纳什均衡的表征:一种受物理学启发、可并行化且具有线性梯度查询次数的方法

Taemin Kim, James P. Bailey

发表机构 * Industrial and Systems Engineering(工业与系统工程系) Rensselaer Polytechnic Institute(伦塞拉尔理工学院)

AI总结 提出一种受哈密顿动力学启发的在线优化方法,通过交替梯度下降在线性迭代次数内表征零和博弈的纳什均衡集,支持并行化和任意学习率,实验性能显著优于传统方法。

详情
AI中文摘要

我们研究零和博弈的在线优化方法,这是机器学习、经济学及许多其他领域中对抗性学习的一个基本问题。传统方法使用基于遗憾的方法(时间平均收敛)或基于收缩映射的方法(最后迭代收敛)来近似纳什均衡。我们提出一种基于物理学中哈密顿动力学的新方法,并证明在无界设置下,除退化情况外,它能在有限(线性)次交替梯度下降迭代中表征纳什均衡集,这是在线优化中的首次。与计算纳什均衡的标准方法不同,我们提出的方法可并行化且适用于任意学习率,这两者在算法博弈论中均为首次。实验上,我们通过展示我们的方法显著优于标准方法来支持我们的结果。

英文摘要

We study online optimization methods for zero-sum games, a fundamental problem in adversarial learning in machine learning, economics, and many other domains. Traditional methods approximate Nash equilibria (NE) using either regret-based methods (time-average convergence) or contraction-map-based methods (last-iterate convergence). We propose a new method based on Hamiltonian dynamics in physics and prove that it can characterize the set of NE in a finite (linear) number of iterations of alternating gradient descent in the unbounded setting, modulo degeneracy, a first in online optimization. Unlike standard methods for computing NE, our proposed approach can be parallelized and works with arbitrary learning rates, both firsts in algorithmic game theory. Experimentally, we support our results by showing our approach drastically outperforms standard methods.

2411.06842 2026-06-17 eess.IV cs.CV 版本更新

Evaluating Synthetic Data Generation for Domain Generalization in Fetal Brain MRI Segmentation

评估胎儿脑MRI分割中域泛化的合成数据生成

Vladyslav Zalevskyi, Thomas Sanchez, Margaux Roulet, Busra Bulut, Hélène Lajous, Jordina Aviles Verdera, Sara Neves Silva, Georg Langs, Gregor Kasprian, Roxane Licandro, Jana Hutter, Hamza Kebiri, Meritxell Bach Cuadra

发表机构 * Department of Radiology, Lausanne University Hospital and University of Lausanne (UNIL)(拉沃斯大学医院放射科和洛桑大学(UNIL)) CIBM Center for Biomedical Imaging(生物医学成像中心) Institute for Information Processing, Leibniz University Hannover(汉诺威莱比锡大学信息处理研究所) Department of Biomedical Engineering, School of Biomedical Engineering & Imaging Sciences, King’s College London(伦敦国王学院生物医学工程系) Department of Biomedical Imaging and Image-Guided Therapy, Division of Neuroradiology and Musculoskeletal Radiology, Medical University of Vienna(维也纳医学大学生物医学成像与影像引导治疗系) Department of Biomedical Imaging and Image-guided Therapy, Computational Imaging Research Lab (CIR), Medical University of Vienna(维也纳医学大学生物医学成像与影像引导治疗系,计算成像研究实验室(CIR)) Christian Doppler Laboratory for Mathematical Modelling and Simulation of Next-Generation Medical Ultrasound Devices, Medical University of Vienna(维也纳医学大学下一代医学超声设备数学建模与仿真克里斯蒂安多普勒实验室) Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna(维也纳医学大学人工智能在医学中的综合中心) Division of Neuroradiology and Musculoskeletal Radiology, Department of Biomedical Imaging and Image–guided Therapy, Medical University of Vienna(维也纳医学大学生物医学成像与影像引导治疗系,神经放射学和骨科放射学系)

AI总结 针对胎儿脑MRI分割中数据异质性和标注不足问题,研究基于域随机化的合成数据生成策略,提出FetalSynthSeg框架,通过高斯混合强度建模和强度聚类提升跨域鲁棒性,在多个数据集上达到最优性能。

详情
AI中文摘要

从磁共振成像(MRI)中进行胎儿脑组织分割对于研究神经发育至关重要,但由于数据异质性和有限标注而仍然具有挑战性。域随机化(DR)最近作为一种有前景的单源域泛化策略出现,通过合成具有随机伪影、对比度和分辨率的训练图像。在这项工作中,我们研究了如何最大化基于DR的方法的域外(OOD)泛化能力。我们评估了几种用于DR的合成数据生成策略,特别关注我们最近提出的框架FetalSynthSeg。我们表明,简单的高斯混合强度建模优于更复杂的基于物理的模拟,并且强度聚类(根据强度细分组织类别)提高了OOD鲁棒性。在来自四个站点的348个胎儿受试者(涵盖0.55-3T以及T1w和T2w对比)上评估,FetalSynthSeg在多个FeTA 2024测试数据集上达到了最先进的性能(80-85 Dice分数),并且首次在T2w以外的模态上为胎儿脑分割提供了鲁棒的分割(在dHCP-T1w数据集上达到80 Dice)。与最先进的方法(如BOUNTI、nnU-Net集成和FeTA 2024获胜者)相比,FetalSynthSeg在保持跨域偏移的强鲁棒性的同时,提供了相当或更优的准确性。我们的代码、模型权重和便于推理的Docker镜像可在以下网址获取:此 https URL。

英文摘要

Fetal brain tissue segmentation from magnetic resonance imaging (MRI) is crucial for studying neurodevelopment, but remains challenging due to data heterogeneity and limited annotations. Domain randomization (DR) has recently emerged as a promising strategy for single-source domain generalization by synthesizing training images with randomized artifacts, contrast, and resolution. In this work, we investigate how to maximize the out-of-domain (OOD) generalization of DR-based methods. We evaluate several synthetic data generation strategies for DR, with a particular focus on our recently proposed framework, FetalSynthSeg. We show that simple Gaussian mixture-based intensity modeling outperforms more complex physics-based simulations, and that intensity clustering (subdividing tissue classes based on intensity) improves OOD robustness. Evaluated on 348 fetal subjects from four sites spanning 0.55-3T and both T1w and T2w contrasts, FetalSynthSeg reaches state-of-the-art performance on several FeTA 2024 testing datasets (80-85 Dice score) and, for the first time, offers robust segmentation on modalities other than T2w for fetal brain segmentation (80 Dice on dHCP-T1w dataset). Compared with state-of-the-art methods such as BOUNTI, nnU-Net ensemble, and the FeTA 2024 winner, FetalSynthSeg delivers comparable or superior accuracy while maintaining strong robustness across domain shifts. Our code, model weights, and Docker image ready for easy inference are available at https://hub.docker.com/r/vzalevskyi/fetalsynthseg.

2501.00826 2026-06-17 q-fin.TR cs.AI 版本更新

LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

基于LLM的多智能体系统实现自动化加密货币投资组合管理

Yichen Luo, Yebo Feng, Jiahua Xu, Paolo Tasca, Yang Liu

发表机构 * University College London(伦敦大学学院) Nanyang Technological University(南洋理工大学) Exponential Science(指数科学)

AI总结 提出一个三智能体系统(市场、新闻、交易),通过分层、协作和辩论架构融合多模态信号,在2025年回测中实现133.52%累计收益和1.502夏普比率,优于单智能体和深度学习基线。

详情
AI中文摘要

加密货币投资组合管理需要在高度波动和实时约束下融合异构多模态信号,包括结构化的价格和链上时间序列、非结构化的新闻文本以及技术指标。虽然深度学习方法显示出预测能力,但其不透明性限制了实际应用,而单个大语言模型(LLM)智能体难以处理稳健决策所需的多模态输入广度。我们提出一个多智能体系统(MAS)框架,其中三个模态专业智能体——负责市场动态的加密货币智能体、负责每周新闻情绪的新闻智能体和负责信号融合与投资组合执行的交易智能体——通过三种通信架构(分层、协作和辩论)分解任务。我们评估了四种能力配置:零样本、思维链(CoT)、检索增强生成(RAG)和技能增强。在2025年1月按市值排名前15的L1区块链原生加密货币的52周回测中,最佳配置(分层技能)实现了133.52%的累计收益和1.502的夏普比率,优于单智能体变体、被动基准和深度学习基线。消融研究确定加密货币智能体是最关键的组件,移除它会使累计收益降低42.57个百分点。跨模型比较进一步表明,在GPT-4o、GPT-5和Claude Sonnet 4.5下,MAS均优于单智能体基线,表明多智能体协调的优势与模型无关。与黑箱深度学习模型不同,每个投资组合决策都可追溯到明确的智能体推理,为多模态加密货币投资组合管理提供了一种可解释且有效的方法。

英文摘要

Cryptocurrency portfolio management requires the fusion of heterogeneous multi-modal signals, including structured price and on-chain time series, unstructured news text, and technical indicators, under high-volatility and real-time constraints. While deep learning approaches show predictive capability, their opacity limits practical adoption, and single large language model (LLM) agents struggle to process the breadth of modality-specific inputs needed for robust decision-making. We propose a multi-agent system (MAS) framework in which three modality-specialised agents, a Crypto Agent for market dynamics, a News Agent for weekly news sentiment, and a Trading Agent for signal fusion and portfolio execution, decompose the task across three communication architectures: hierarchical, collaborative, and debate. We evaluate four capability configurations: zero-shot, chain-of-thought (CoT), retrieval-augmented generation (RAG), and skill-augmented. In a 52-week backtest over calendar year 2025 across the top 15 L1 blockchain native cryptocurrencies by market capitalisation as of January 2025, the best configuration, Hierarchical (Skill), achieves a cumulative return of 133.52% and a Sharpe ratio of 1.502, outperforming single-agent variants, passive benchmarks, and deep learning baselines. An ablation study identifies the Crypto Agent as the most critical component, with its removal reducing cumulative return by 42.57 percentage points. A cross-model comparison further shows that MAS outperforms the single-agent baseline under GPT-4o, GPT-5, and Claude Sonnet 4.5, suggesting that the benefit of multi-agent coordination is model-agnostic. Unlike black-box deep learning models, every portfolio decision is traceable to explicit agent reasoning, offering an interpretable and effective approach to multi-modal cryptocurrency portfolio management.

2407.13053 2026-06-17 cs.CY cs.AI cs.CL cs.LG 版本更新

E2Vec: Feature Embedding with Temporal Information for Analyzing Student Actions in E-Book Systems

E2Vec:基于时间信息的特征嵌入用于分析电子书系统中的学生行为

Yuma Miyazaki, Valdemar Švábenský, Yuta Taniguchi, Fumiya Okubo, Tsubasa Minematsu, Atsushi Shimada

发表机构 * Kyushu University(九州大学)

AI总结 提出E2Vec方法,利用词嵌入将操作日志和时间间隔转化为学生向量,用于风险检测任务,提升泛化性和性能。

Comments Research paper published in the Proceedings of the 17th Educational Data Mining Conference (EDM 2024), see https://doi.org/10.5281/zenodo.12729853

详情
AI中文摘要

数字教科书(电子书)系统将学生与教科书的交互记录为一系列事件,称为事件流数据。过去,研究人员从事件流中提取有意义的特征,并将其用作下游任务(如成绩预测和学生行为建模)的输入。先前的研究评估了主要使用基于统计的特征(如操作类型数量或访问频率)的模型。虽然这些特征有助于提供某些见解,但它们缺乏捕捉不同学生学习行为中细粒度差异的时间信息。本研究提出E2Vec,一种基于词嵌入的新型特征表示方法。该方法将每个学生的操作日志及其时间间隔视为字符字符串序列,并生成包含时间信息的学习活动特征的学生向量。我们应用fastText为来自两年计算机科学课程数据集的305名学生生成嵌入向量。然后,我们研究了E2Vec在风险检测任务中的有效性,展示了其泛化性和性能潜力。

英文摘要

Digital textbook (e-book) systems record student interactions with textbooks as a sequence of events called EventStream data. In the past, researchers extracted meaningful features from EventStream, and utilized them as inputs for downstream tasks such as grade prediction and modeling of student behavior. Previous research evaluated models that mainly used statistical-based features derived from EventStream logs, such as the number of operation types or access frequencies. While these features are useful for providing certain insights, they lack temporal information that captures fine-grained differences in learning behaviors among different students. This study proposes E2Vec, a novel feature representation method based on word embeddings. The proposed method regards operation logs and their time intervals for each student as a string sequence of characters and generates a student vector of learning activity features that incorporates time information. We applied fastText to generate an embedding vector for each of 305 students in a dataset from two years of computer science courses. Then, we investigated the effectiveness of E2Vec in an at-risk detection task, demonstrating potential for generalizability and performance.

2208.03023 2026-06-17 eess.AS cs.SD 版本更新

AID: Open-source Anechoic Interferer Dataset

AID:开源消声干扰源数据集

Philipp Götz, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

发表机构 * International Audio Laboratories Erlangen(国际声学实验室埃尔朗根) Fraunhofer Institute for Integrated Circuits IIS(弗劳恩霍夫整合电路研究所IIS)

AI总结 提出一个家庭环境中各种声源的消声录音数据集,用于模拟复杂声学场景的非平稳环境噪声信号,并提供Python库生成随机混合干扰信号。

Comments Accepted for publication at IWAENC 2022

详情
AI中文摘要

本文提出了一个数据集,包含家庭环境中遇到的各种声源的消声录音。该数据集旨在作为非平稳环境噪声信号的资源,这些信号与声学脉冲响应卷积后可用于模拟复杂的声学场景。此外,还提供了一个Python库,用于生成数据集中录音的随机混合,这些混合可用作非平稳干扰信号。

英文摘要

A dataset of anechoic recordings of various sound sources encountered in domestic environments is presented. The dataset is intended to be a resource of non-stationary, environmental noise signals that, when convolved with acoustic impulse responses, can be used to simulate complex acoustic scenes. Additionally, a Python library is provided to generate random mixtures of the recordings in the dataset, which can be used as non-stationary interference signals.

2502.17773 2026-06-17 stat.ME cs.AI cs.LG

How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

大型语言模型值得模拟多少人意见?从不确定性量化角度出发

Chengpiao Huang, Yuhang Wu, Kaizheng Wang

发表机构 * Department of IEOR, Columbia University(哥伦比亚大学工业工程与运筹学系) Decision, Risk, and Operations Division, Columbia Business School(哥伦比亚商学院决策、风险与运营分校) Department of IEOR and Data Science Institute, Columbia University(哥伦比亚大学工业工程与运筹学系及数据科学研究所)

AI总结 本文从不确定性量化角度出发,提出了一种框架,将LLM模拟的响应转换为人类响应总体参数的可靠置信集,通过量化人类-LLM不一致带来的不确定性。关键设计是模拟响应的数量:过多会导致置信集过窄且覆盖性差,过少则导致置信集过宽且信息不足。本文提出了一种数据驱动的方法,自适应选择模拟样本量以实现名义平均覆盖性,无论LLM的模拟保真度或置信集构建过程如何。所选样本量进一步反映了LLM能代表的有效人类人口规模,提供了其模拟保真度的定量度量。实验表明不同LLM和领域存在异质性模拟保真度。

Comments 63 pages, 13 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于模拟调查响应,但合成数据可能与人类人口不一致,导致不可靠的推断。我们开发了一个通用框架,将LLM模拟的响应转换为人类响应总体参数的可靠置信集,量化由人类-LLM不一致引起的不确定性。关键设计选择是模拟响应的数量:过多会产生过于狭窄的置信集,覆盖性差;过少则会产生过于宽泛且信息不足的置信集,受随机噪声主导。我们提出了一种数据驱动的方法,自适应地选择模拟样本量以实现名义平均覆盖性,无论LLM的模拟保真度或置信集构建过程如何。所选样本量进一步被证明反映了LLM能代表的有效人类人口规模,提供其模拟保真度的定量度量。在真实调查数据集上的实验揭示了不同LLM和领域之间的异质性模拟保真度。

英文摘要

Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, quantifying the uncertainty induced by the human-LLM misalignment. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield overly wide and uninformative sets dominated by stochastic noise. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous simulation fidelity across different LLMs and domains.