arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12835 2026-06-12 cs.MA cs.AI cs.CY cs.NI 新提交

The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

智能体互联网:大规模通信、协调与集体智能

Quanyan Zhu

AI总结 本文提出智能体互联网(IoAI)愿景,构建异构智能体在云、边缘、设备等环境中发现、协商、通信与协作的开放生态系统,并探讨其架构、机制及关键研究挑战。

详情
AI中文摘要

自主AI智能体的快速涌现正在将人工智能从孤立的模型推理转变为分布式推理、通信和行动系统。本文发展了智能体互联网(IoAI)的愿景:一个开放生态系统,其中异构智能体能够跨云、边缘、设备、组织及信息物理环境相互发现、协商职责、交换上下文、调用工具并执行工作流。我们综合了单智能体AI、多智能体系统、分布式计算、通信网络、博弈论和安全工程的基础,以刻画可扩展智能体生态系统所需的架构和机制。本文考察了智能体部署模型、工作流生命周期、通信协议、互操作层、资源管理挑战和信任架构,并提供了自适应制造和分布式作战协调的案例研究。由此产生的框架突出了可控涌现、语义互操作、安全身份、激励兼容协调、资源感知编排以及大规模自主智能体网络治理等核心研究挑战。

英文摘要

The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments. We synthesize foundations from single-agent agentic AI, multi-agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource-management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance for large-scale networks of autonomous agents.

2606.12828 2026-06-12 cs.AI 新提交

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

人工智能研究中的主题相变:大规模证据与新兴主题的早期预警信号

Rasul Khanbayov, Hasan Kurban

AI总结 通过分析2017-2025年五大AI会议论文,发现AI主题通过“相变”方式突然爆发,并基于早期预警信号识别未来需关注的主题。

详情
AI中文摘要

人工智能的研究主题是逐渐增长,还是通过突然的、可检测的跳跃式发展?通过分析2017年至2025年期间五个顶级AI会议(ACL、CVPR、ICLR、ICML、NeurIPS)的80,814篇主会论文,我们发现主要AI主题通过主题相变推进:在多年间保持边缘地位,然后在一到三年内跨会议激增。到2025年,大型语言模型成为跨会议的主导主题,扩散模型以类似的突发性崛起,语言模型方法通过视觉语言模型进入计算机视觉领域,而强化学习则平滑累积,这区分了真正的相变与普通增长。这一结构是我们的主要贡献:对AI研究如何重组的大规模、跨会议特征描述。然后我们探究相变是否在达到顶峰前留下可检测的足迹。我们定义了一个早期预警信号,即基于2017-2021年数据冻结的四项出版动力学标准,并在2023-2025年的相变上进行样本外评估,在13.5%的基准率下获得了27%的精确率和63%的召回率。应用于2025年数据时,该信号将推理与测试时计算、智能体AI、多模态LLM、检索增强生成和世界模型标记为2026-2028年需监测的主题。源代码也在GitHub上公开,网址为https://this https URL。

英文摘要

Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepted main-track papers from five premier AI conferences (ACL, CVPR, ICLR, ICML, NeurIPS) spanning 2017 to 2025, we show major AI topics advance through topical phase transitions: remaining marginal for years, then surging across venues within one to three years. Large language models became the dominant cross-venue topic by 2025, diffusion models rose with comparable abruptness, and language-model methods crossed into computer vision via vision-language models, whereas reinforcement learning compounded smoothly, distinguishing genuine phase transitions from ordinary growth. This structure is our primary contribution: a large-scale, cross-venue characterization of how AI research reorganizes. We then ask whether a transition leaves a detectable footprint before it peaks. We define an early-warning signature, four publication-dynamics criteria frozen on 2017-2021 data, and evaluate it out of sample on 2023-2025 transitions, obtaining a precision of 27% and recall of 63% against a 13.5% base rate. Applied to 2025 data, the signature flags reasoning and test-time compute, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as topics to monitor over 2026-2028. The source code is also publicly available on GitHub at this https URL.

2606.12812 2026-06-12 cs.CY cs.SD 新提交

Vocal Identity Under Siege by AI Voice Cloning Technologies

AI语音克隆技术对声音身份的攻击

Jyh-An Lee, Xuan Sun

AI总结 本文通过比较分析公开权、人格权和个人数据保护权三种法律框架,探讨生成式AI语音克隆对声音身份独特价值的威胁及法律应对。

详情
AI中文摘要

先进的AI驱动语音克隆的出现,将保护声音身份的关键法律和伦理挑战推到了前台。受近期争议(包括OpenAI的ChatGPT-4o语音与斯嘉丽·约翰逊声音惊人相似)的推动,本文探讨了生成式AI技术如何削弱人类声音的独特价值,并进一步复杂化围绕人格权的法律问题。通过比较分析,本文评估了三种主要法律框架:公开权、人格权和个人数据保护权。每种框架——根植于不同的法律传统——在应对AI生成语音克隆带来的威胁方面各有优势和局限。通过分析这些原则的范围、救济措施和死后保护,本研究为理解现有法律方法如何应用于生成式AI时代声音身份不断演变的挑战提供了基础。

英文摘要

The advent of sophisticated AI-driven voice cloning has brought to the fore critical legal and ethical challenges regarding the protection of vocal identity. Prompted by recent controversies - including the striking resemblance between OpenAI's ChatGPT-4o voice and that of Scarlett Johansson - this article examines how generative AI technologies undermine the unique value of the human voice and further complicate the legal questions surrounding personality right. Through a comparative analysis, the paper evaluates three principal legal frameworks: the right of publicity, personality rights, and the personal data protection right. Each framework - rooted in different legal traditions o offers distinct strengths and limitations in addressing the threats posed by AI-generated voice cloning. By analysing these doctrines' scope, remedies, and posthumous protections, the study offers a foundation for understanding how existing legal approaches may be applied to the evolving challenges of vocal identity in the era of generative AI.

2606.12809 2026-06-12 cs.AI cs.LG 新提交

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

MLUBench: 多模态大语言模型终身遗忘评估基准

He Li, Haoang Chi, Qizhou Wang, Yunxin Mao, Zhiheng Zhang, Jie Tan, Tongliang Liu, Wenjing Yang, Bo Han

AI总结 提出MLUBench基准,评估多模态大模型在连续遗忘请求下的性能,发现现有方法存在累积退化,并揭示多模态对齐保持的挑战,提出LUMoE方法缓解退化。

详情
Comments
36 pages, accepted to the ICML 2026
AI中文摘要

多模态大语言模型(MLLMs)在海量多模态数据上训练,使得数据遗忘变得越来越重要,因为数据所有者可能要求移除特定内容。实际上,这些请求通常随时间顺序到达,引发了MLLM终身遗忘这一具有挑战性的问题。然而,现有大多数基准在规模和范围上有限,未能捕捉MLLM终身遗忘的复杂性。为填补这一空白,我们引入了MLUBench,一个大规模、全面的基准,包含9个类别下的127个实体,用于终身遗忘请求。我们使用MLUBench进行了大量实验,揭示出现有遗忘方法遭受严重且累积的退化。更重要的是,我们进一步识别出该问题的独特挑战:与单模态模型不同,MLLM终身遗忘受到保持多模态对齐需求的约束。持续从一种模态遗忘可能会退化整个模型。为缓解这一挑战,我们提出了LUMoE,一种有效方法。实验表明,LUMoE显著缓解了基线方法面临的退化问题。源代码和MLUBench数据集已在此https URL开源。

英文摘要

Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem of MLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a large-scale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. The source code and the MLUBench dataset are open-sourced in this https URL.

2606.12808 2026-06-12 cs.LG cs.AI 新提交

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

SymQNet: 低延迟自适应哈密顿量学习的摊销获取

Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal

AI总结 提出SymQNet,一种摊销强化学习方法,通过离线学习后验条件获取策略,在线快速前向传播,显著降低自适应哈密顿量学习的获取延迟。

详情
AI中文摘要

自适应哈密顿量学习对于校准和表征量子设备至关重要。在自适应控制器中,选择下一个实验本身就是一个计算。贝叶斯设计规则在每次后验更新后重新计算,这一步可能需要几秒钟。在数百次试验中,这些秒数成为自适应性的显著墙钟成本。我们引入SymQNet,一种用于低延迟自适应哈密顿量学习的摊销强化学习方法。SymQNet离线学习后验条件获取策略,然后在线使用快速策略前向传播,同时保留贝叶斯后验反馈。在横向场伊辛基准测试中,相对于有界Fisher信息搜索和有界两步贝叶斯主动学习(BALD),SymQNet显著降低了获取延迟。在五量子比特时,相对于这些在线基线,它仅获取决策延迟降低了$47.1\ imes$和$72.6\ imes$;在十二量子比特时,SymQNet的完整模拟步骤需要$1.02$秒,而有界两步BALD需要$13.27$秒。总体而言,我们表明学习获取可以使自适应哈密顿量学习对于重复的低延迟工作负载变得实用。

英文摘要

Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by $47.1\times$ and $72.6\times$ relative to these online baselines; at twelve qubits, full simulated steps take $1.02$ s for SymQNet versus $13.27$ s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

2606.12805 2026-06-12 cs.HC cs.AI 新提交

Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning

探索智能体口音如何影响K-12小组学习中的人机协作

Prerna Ravi, Carúmey Stevens, Ben Hurt, Brandon Hanks, Grace Lin, Emma Anderson

AI总结 研究通过33名教师的实验,发现GenAI语音智能体的不同口音(英式、印度式、非裔美式)影响其被感知为工具或同伴,进而影响信任、参与和依赖。

详情
AI中文摘要

协作被广泛认为是21世纪教育的基石,但教师在促进有效的同伴互动方面仍面临持续挑战。LLM对话式同伴智能体为调解面对面小组工作带来了新的可能性,引发了关于角色设计(尤其是语音特征)如何塑造学习者的感知、信任和互动动态的问题。虽然先前的研究已经考察了智能体口音在一对一环境中的影响,但关于这些影响如何在小组中表现尚知之甚少。我们进行了一项33名教师参与的组间混合方法研究,考察了具有不同口音(英式、印度式和非裔美式)的GenAI语音智能体如何影响协作和智能体感知。通过调查、小组互动分析和人工制品,我们发现口音塑造了参与者的心智模型以及智能体在小组互动中扮演的角色。英式口音智能体在很大程度上被视为工具,并以超然、基于实用性的方式参与,而印度式和非裔美式口音智能体则更容易被拟人化并作为同伴融入。这些角色期望影响了信任、参与和依赖随时间的变化。这项工作推进了关于GenAI的社会语言学设计特征如何塑造CSCL中小组动态的理解,对设计具有文化包容性的AI学习伙伴具有启示意义。

英文摘要

Collaboration is widely recognized as a cornerstone of 21st-century education, yet teachers still encounter persistent challenges in fostering productive peer interaction. LLM conversational peer agents introduce new possibilities for mediating in-person group work, raising questions about how persona design, particularly their voice characteristics, shapes learners' perceptions, trust, and interactional dynamics. While prior work has examined agent accent effects in one-to-one settings, little is known about how these effects manifest in groups. We conducted a between-subjects mixed-methods study with 33 teachers examining how a GenAI voice agent with different accents (British, Indian, and African American) influenced collaboration and agent perception. Across surveys, group interaction analyses, and artifacts, we find that accent shaped participants' mental models and the roles the agent assumed in group interaction. The British-accented agent was largely treated as a tool and engaged in detached, utility-based ways, whereas Indian- and African American-accented agents were more readily anthropomorphized and integrated as peers. These role expectations influenced trust, engagement, and reliance over time. This work advances understanding of how GenAI's sociolinguistic design features shape group dynamics in CSCL, with implications for designing culturally inclusive AI partners in group learning.

2606.12774 2026-06-12 eess.SY cs.AI cs.CL 新提交

Agentic MPC for Semantic Control System Resynthesis

用于语义控制系统再综合的智能体MPC

Yuya Miyaoka, Masaki Inoue

AI总结 提出智能体MPC框架,通过集成大语言模型智能体实现上下文感知的语义自适应控制综合,在自动驾驶场景中验证其根据个人偏好或社交情境(如避让应急车辆)调整控制的能力。

详情
Comments
7 pages, 5 figures
AI中文摘要

虽然MPC有效处理结构化、多样化和低层级的规范,但它缺乏动态融入高层级上下文信息(如社会规范、用户意图或自然语言指令)的能力。为解决这一局限,本文引入了一种智能体MPC框架,通过集成基于大语言模型的智能体,实现上下文感知、语义自适应的控制综合。该智能体解释异构输入,包括自然语言消息、环境观测和外部知识,以重新综合控制规范。该框架的有效性在自动驾驶场景中得到验证,系统能够根据个人偏好或对社交情境(如应急车辆避让)做出响应。

英文摘要

While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-level contextual information such as social norms, user intent, or natural language instructions. To address this limitation, this manuscript introduces an agentic MPC framework that enables context-aware, semantically adaptive control synthesis by integrating with large language model-based agents. The agent interprets heterogeneous inputs, including natural language messages, environmental observations, and external knowledge, to resynthesize the control specifications. The effectiveness of the framework is demonstrated in an autonomous driving scenario, where the system aligns with personal preferences or responds to social situations such as emergency vehicle yielding.

2606.12765 2026-06-12 cs.CL cs.DC 新提交

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Rigel:逆向工程 Apple M4 Max GPU 上的 Metal 4.1 张量计算路径

Ramchand Kumaresan

AI总结 通过微基准测试逆向工程 Apple M4 Max 的 Metal 4.1 张量计算路径,揭示 fp8 matmul2d 为模拟而非硬件加速,并重建了 8x8 张量片段布局。

详情
AI中文摘要

Apple 的 Metal 4.1 暴露了一条张量计算路径:基于 cooperative_tensor 片段的 Metal Performance Primitives (MPP) matmul2d 操作,其接口有文档记录,但硬件行为被故意隐藏。规范说明了支持哪些数据类型行,但从未说明它们是否经过硬件加速、操作在物理上何处执行、其累加器宽度是多少,或者如何在线程间划分矩阵片段。我们提出了 Rigel,这是对单个 Apple M4 Max(前神经加速器一代)上该路径的经验性表征。使用校验和门控、来源追踪的微基准测试工具,Rigel 恢复了 v4.1 规范隐藏或矛盾的十一个事实。主要发现:Metal 4.1 fp8 (E4M3) matmul2d 是模拟的,而非加速的:尽管读取的操作数字节数减半,但其吞吐量仅为 fp16 的 0.94 倍,因此在 M4 上它是一个内存占用特性,而非性能特性。我们进一步通过三信号三角测量(吞吐量上限、与 simdgroup_matrix 的比较以及每路功率归因)表明,matmul2d 完全在 GPU 着色器核心上执行,没有专用的矩阵数据路径,也没有证据表明路由到 Apple 神经引擎;它使用 >=fp32 累加;并且我们重建了 Apple 在任何地方都没有记录的 opaque 8x8 cooperative_tensor 片段布局。基于该表征,一个手动融合的 GEMM + bias + GELU 内核在缓存驻留状态下比分解路径快 6.5-12.9%。所有发现均可从 MIT 许可的代码和逐单元 CSV 中重现。

英文摘要

Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

2606.12754 2026-06-12 cs.CL cs.AI 新提交

LLMs Can Better Capture Human Judgments--With the Right Prompts

LLMs 能更好地捕捉人类判断——使用合适的提示

Danica Dillion, Chen Cecilia Liu, Baihui Wang, Daniele Barolo, Tanmay Rajore, Niket Tandon, Pranathi Ravikumar, Kurt Gray

AI总结 通过简单提示策略,LLMs 能恢复人类反应的完整分布,并减少对措辞变化的敏感性,提升 AI-人类对齐。

详情
AI中文摘要

大型语言模型(LLMs)在捕捉人类判断方面是否表现不佳?两个常被提及的限制是:LLMs 无法捕捉反应的全分布,以及它们的判断在措辞变化上不稳定。我们展示了缓解这些限制的简单提示策略。在两个数据集上——一个代表美国的 144 个道德情景集,以及国际社会调查项目“家庭与性别角色变化”模块涵盖 32 个国家的 38 个道德信念——我们展示了简单的启发式技术如何帮助改善 AI-人类对齐。首先,提示模型报告标准差和反应比例,比常见策略更好地恢复了人类反应的完整范围。其次,确保情景对人类参与者清晰——如人类困惑评分所反映——提升了模型对齐度,且 LLMs 可以跟踪人类困惑评分。同时,我们发现 LLMs 对自身误差的估计校准不佳,尽管它们能相对较好地预测人类变异性。这些结果表明,向 LLMs 提出更好的问题可以得到更好的答案。

英文摘要

Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

2606.12748 2026-06-12 cs.CL 新提交

Agent-based models for the evolution of morphological alternation patterns

基于智能体的形态交替模式演化模型

Aravinth Kulanthaivelu, Richard Sproat

AI总结 通过多智能体模拟,研究形态交替(如go/went)的涌现机制,发现无标度社交网络和随机采纳策略能产生更真实的形态模式。

详情
Comments
51 + 37 pages. 31 Figures
AI中文摘要

为什么英语中“go”的过去式是看似无关的“went”?这种交替在语言中很常见。它们既无助于交流也不利于学习,却能持续存在数百年或数千年。我们提出了一个多智能体模拟,用于研究形态词干和屈折交替的涌现。交替形式源于语音变化,或者像“go/went”一样,来自与部分人群相关的词汇替代。当一个智能体“听到”另一个智能体对某个词形位(例如go的过去式)使用新形式时,它们会以一定概率采纳该形式,并可能将其使用扩展到共享相同原始形式的其他词形位。因此,替代形式可以在人群中传播,并固化为词干或屈折标记的交替形式。与许多先前的计算研究不同,我们的系统允许自然主义的词汇形式、现实的语音规则、包含数百或数千条目的词典,以及数十或数百个智能体的人群。它支持多种网络拓扑、扩散模式和智能体采纳策略。这类模拟的一个问题是评估:与真实语言相比,产生的形态有多真实?我们引入了AI历史语言学家,这是一个新颖的大型语言模型驱动系统,模拟两位历史语言学家之间的辩论。我们用它来比较一组真实语言的形态、伪装形态和实验演化形态。结果表明,有利于产生更合理形态的因素包括无标度社交网络和随机伯努利形式采纳。我们还提出了三个案例研究,模拟了有记载的历史变化,使我们能够测试如果历史不同会发生什么。所有代码和数据均已发布。

英文摘要

Why is the past of English "go" the apparently unrelated "went"? Such alternations are frequent in languages. They neither aid communication nor learnability, yet they can be persistent, surviving over centuries or millennia. We present a multi-agent simulation of the emergence of morphological stem and inflection alternations. Alternate forms arise by phonological changes or, as with "go/went", from lexical alternatives associated with a subset of the population. When an agent 'hears' another agent use a novel form for a slot in the paradigm of a word (say, the past tense of go), they will with some probability adopt that form, possibly spreading its use to other slots in the paradigm that shared the same original form. Thus alternative forms can spread through the population and become entrenched as stem or inflectional marker alternants. Unlike many previous computational studies, our system allows for naturalistic lexical forms, realistic phonological rules, lexicons with hundreds or thousands of entries, and agent populations in the tens or hundreds. It supports several network topologies, diffusion patterns and agent adoption policies. One issue with such simulations is evaluation: how realistic is the resulting morphology compared to those of real languages? We introduce the AI Historical Linguist, a novel Large Language Model-driven system that models a debate between two historical linguists. We use this to compare a set of real language morphologies, disguised morphologies, and experimentally evolved morphologies. The results suggest that among the factors that favor more plausible morphologies are scale-free social networks and random Bernoulli adoption of forms. We also present three case studies modeling attested historical changes, allowing us to test what might have happened if history had been different. All code and data are released.

2606.12737 2026-06-12 cs.CR cs.AI 新提交

PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections

PI-Hunter:用于暴露和定位提示注入的自动化红队测试

Pengfei He, Lesly Miculicich, Vishesh Sharma, Ash Fox, George Lee, Jiliang Tang, Tomas Pfister, Long T. Le

AI总结 提出PI-Hunter自动化审计框架,通过构建源感知测试用例并迭代演化,主动暴露LLM智能体中的潜在提示注入漏洞,显著提升漏洞暴露和攻击面覆盖。

详情
AI中文摘要

大型语言模型(LLM)正迅速演变为与外部工具和环境交互的智能体系统,这引入了新的安全风险,例如通过不可信外部来源的间接提示注入攻击。现有防御主要关注在推理时阻止恶意内容,而当前的红队测试方法主要优化攻击成功率。因此,开发人员对潜在提示注入如何出现并通过智能体传播的可见性有限。我们提出PI-Hunter,一种用于主动暴露LLM智能体中漏洞的自动化智能体审计框架。PI-Hunter构建真实的源感知测试用例,并通过反馈驱动的探索迭代演化它们,以诱导智能体检索并揭示嵌入在外部环境中的潜在恶意指令。跨多个基准、智能体架构、攻击和防御的大量实验表明,与强大的自动化红队测试基线相比,PI-Hunter显著提高了漏洞暴露和攻击面覆盖,同时在现有提示注入防御下仍然有效。

英文摘要

Large Language Models (LLMs) are rapidly evolving into agentic systems that interact with external tools and environments, introducing new security risks such as indirect prompt injection attacks through untrusted external sources. Existing defenses mainly focus on blocking malicious content at inference time, and current red-teaming methods primarily optimize attack success. As a result, developers have limited visibility into how latent prompt injections emerge and propagate through agents. We propose PI-Hunter, an automated agentic auditing framework for proactive vulnerability exposure in LLM agents. PI-Hunter constructs realistic source-aware test cases and iteratively evolves them through feedback-driven exploration to induce agents to retrieve and reveal latent malicious instructions embedded within external environments. Extensive experiments across multiple benchmarks, agent architectures, attacks, and defenses demonstrate that PI-Hunter substantially improves vulnerability exposure and attack-surface coverage over strong automated red-teaming baselines, while remaining effective under existing prompt injection defenses.

2606.12718 2026-06-12 cs.LG eess.SP 新提交

Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

面向开放集射频指纹识别的分布外检测器

Sudeepta Mondal, Ganesh Sundaramoorthi

AI总结 针对开放集射频指纹识别中未知发射机与时间漂移引起的分布偏移问题,引入基于信息论的OOD检测统一框架,并采用无需OOD调优数据的方法,在POWDER数据集上验证其性能接近有真实OOD数据的基线。

详情
AI中文摘要

射频指纹识别系统必须在开放世界环境中运行,其中来自未知发射机的信号和时间漂移会在测试时引入分布偏移。分布外检测为该问题提供了自然框架,但其在射频指纹识别中的应用仍然有限。其采用的一个关键障碍是大多数OOD检测器需要辅助OOD数据进行参数调优,而在射频环境中收集代表性OOD数据不切实际,这一假设难以满足。在这项工作中,我们将机器学习文献中一组有前景的OOD检测方法引入开放集RFF领域。我们基于信息论(通信系统的自然框架)在一个统一的数学框架中呈现这些方法。我们的框架允许对方法进行系统分析并开发新方法。我们进一步展示了最近关于无需给定OOD调优数据即可调优OOD检测器的工作在开放集RFF中的适用性。我们在POWDER射频指纹数据集上进行评估,表明无需任何给定OOD数据调优的检测器性能与能够访问真实OOD调优数据的基线相当,并且大大优于无法访问真实OOD调优数据的基线方法,展示了RFF问题的实际可行性。

英文摘要

Radio-frequency (RF) fingerprinting systems must operate in open-world environments where signals from unknown transmitters and temporal drift introduce distribution shift at test time. Out-of-distribution (OOD) detection provides a natural framework for this problem, yet its application to RF fingerprinting (RFF) remains limited. A key barrier to their adoption is that most OOD detectors require auxiliary OOD data for parameter tuning, an assumption that is difficult to satisfy in RF environments where representative OOD data is impractical to collect. In this work, we introduce a promising set of OOD detection methods from the machine learning literature to open-set RFF domain. We present these methods within a unified mathematical framework based on information theory, which is a natural framework for communication systems. Our framework allows for the systematic analysis of methods and development of new methods. We further demonstrate the applicability of recent work on tuning OOD detectors without given OOD tuning data for open-set RFF. We evaluate on the POWDER RF fingerprinting dataset, showing that detectors tuned without any given OOD data achieve performance comparable to baselines with access to true OOD tuning data and greatly out-perform baseline approaches without access to true OOD tuning data, showcasing the practical viability for the RFF problem.

2606.12709 2026-06-12 cs.MA cs.CR cs.LG 新提交

Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows

更聪明的破坏者,更好的修复者:线性多智能体工作流中的规模与安全性

Timothy McAllister, Sina Abdidizaji, Ivan Garibay, Ozlem Ozmen Garibay

AI总结 研究模型规模对线性多智能体工作流安全性的影响,发现大模型更易执行恶意指令,但轻量级修复阶段可恢复性能,表明线性结构在适当校正下具有鲁棒性。

详情
Comments
16 pages (4 are main text), 2 figures, 6 tables. Accepted to the AIWILD Workshop at ICML 2026
AI中文摘要

随着基于LLM的多智能体系统(MAS)在现实环境中部署,其协作结构对抗对抗性攻击的韧性成为一个关键的安全问题。攻击者可能利用提示注入或越狱来破坏MAS工作流中的单个智能体,但模型缩放与系统级韧性之间的相互作用仍知之甚少。本文研究了模型规模如何影响线性多智能体工作流的安全性。我们在HumanEval基准上对两个开放权重模型系列在不同规模下的实验揭示了一种合规-校正对称性:较大的模型更可能忠实地执行恶意指令,在未校正的流水线中,27B参数模型的控制到恶意性能下降达到53.7个百分点。然而,附加一个轻量级的终端修复阶段可将此下降缩小到0.6个百分点,并恢复与控制级性能的统计对等性,表明严格线性协作结构在此规模下是可行且对抗性鲁棒的,并暗示先前归因于线性拓扑的脆弱性可能源于缺乏校正。

英文摘要

As LLM-based multi-agent systems (MAS) are deployed in the wild, the resilience of their collaboration structures against adversarial compromise becomes a critical safety concern. Attackers may leverage prompt-injection or jailbreaking to sabotage individual agents within MAS workflows, but the interaction between model scaling and system-level resilience remains poorly understood. This paper investigates how model scale affects the security of linear multi-agent workflows. Our experiments across scales of two open-weight model families on the HumanEval benchmark reveal a compliance-correction symmetry: larger models are far more likely to faithfully execute malicious instructions, with the control-to-malicious performance drop reaching 53.7pp at 27B in uncorrected pipelines. However, appending a lightweight terminal Fixer stage collapses this to 0.6pp and restores statistical parity with control-level performance, demonstrating that strictly linear collaboration structures can be viable and resilient to adversaries at this scale, and suggesting that the brittleness previously attributed to linear topology may stem from a lack of correction.

2606.12703 2026-06-12 cs.CR cs.AI cs.LG 新提交

SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

SMSR:针对持久化LLM代理系统中运行时内存投毒的认证防御

Tarun Sharma

AI总结 提出SMSR防御框架,通过写入时HMAC签名和查询时随机化内存消融与基于判决的多数投票,首次为多会话内存投毒攻击提供认证鲁棒性保证。

详情
AI中文摘要

检索增强生成(RAG)代理越来越多地使用跨用户会话累积的持久化内存。这创造了一个新的攻击面:仅通过正常渠道交互的对手可以注入精心构造的内存,一旦被检索,就会影响未来用户的代理响应,而无需触及模型权重或代码。我们将此称为多会话内存投毒(MSMP),并表明现有防御无法对此进行认证;静态语料库防御(RobustRAG、ReliabilityRAG)假设固定的知识库,而启发式过滤器则被流畅的企业风格文本绕过。我们提出了带平滑检索的签名内存(SMSR),这是首个针对此场景提供认证鲁棒性边界的防御。组件1在写入时添加HMAC-SHA256来源证明,阻止未签名注入。组件2在查询时应用随机化内存消融与基于判决的多数投票,限制认证对手的影响。我们证明了无来源证明的检索时过滤器无法认证自适应注入,推导了组件2的超几何证书,并形式化了一致少数效应,即一致对抗答案在基于字符串的投票中作为数值少数胜出,而基于判决的投票则将其移除。在15个企业场景(3150次重复试验)中,组件1将未签名变体的攻击成功率从93-100%降至0%。对于单次注入的认证对手,组件2将成功率控制在8.0%(95% CI [5.8, 10.9], n=450),低于认证最坏情况。在端到端仅查询攻击中(代理自身写入投毒而非预植入),SMSR在实时代理栈上将成功率从65.3%降至5.3%(n=150,非重叠置信区间)。干净查询效用为90%(组件1)和85%(组合)。

英文摘要

Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted memories that, once retrieved, steer the agent's responses for future users, without touching model weights or code. We call this Multi-Session Memory Poisoning (MSMP) and show that no existing defence certifies against it; static-corpus defences (RobustRAG, ReliabilityRAG) assume a fixed knowledge base, and heuristic filters are bypassed by fluent enterprise-style text. We present Signed Memory with Smoothed Retrieval (SMSR), the first defence with a certified robustness bound for this setting. Component 1 adds HMAC-SHA256 provenance at write time, blocking unsigned injection. Component 2 applies randomised memory ablation with verdict-based majority voting at query time, bounding the influence of authenticated adversaries. We prove that no provenance-free retrieval-time filter can certify against adaptive injection, derive a hypergeometric certificate for Component 2, and formalise the Consistent Minority Effect, whereby a consistent adversarial answer wins string-based voting as a numerical minority while verdict-based voting removes it. Across 15 enterprise scenarios (3,150 repeated trials), Component 1 cuts attack success from 93-100% to 0% for all unsigned variants. For an authenticated adversary with a single injection, Component 2 holds success to 8.0% (95% CI [5.8, 10.9], n=450), below the certified worst case. In an end-to-end query-only attack where the agent itself writes the poison rather than it being pre-seeded, SMSR reduces success from 65.3% to 5.3% (n=150, non-overlapping CIs) on a live agent stack. Clean-query utility is 90% (Component 1) and 85% (combined).

2606.12702 2026-06-12 cs.AI 新提交

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

以部署为中心的评估:预测临床大语言模型系统中的查询级拒绝风险

Alyssa Unell, Miguel Fuentes, Brenna Li, Bridget Lin, Meena Jagadeesan, Sanmi Koyejo, Nigam Shah

AI总结 针对临床大语言模型系统,提出基于部署上下文(如提供者类型、科室名称)的预响应分类器,预测用户拒绝风险,AUROC达0.719,并展示其在触发护栏和弃权中的效用。

详情
AI中文摘要

大语言模型(LLMs)正越来越多地集成到临床系统中,因此评估这些系统的实际效用至关重要。然而,静态基准倾向于衡量正确性而非用户接受度,跨查询聚合性能,并需要密集标注的数据集——这导致评估临床系统时存在重大盲点。在这项工作中,我们对嵌入某学术医疗中心电子健康记录中的LLM系统进行了以部署为中心的评估,其中用户反馈稀疏但密切反映了部署条件。具体而言,我们训练了一个预响应分类器,该分类器基于查询内容和生成前可用的部署特定上下文,估计未来交互导致用户拒绝LLM响应的风险。我们对模型进行了4.5个月用户反馈的前瞻性分析,发现我们的预测模型达到了0.719的AUROC。此外,我们估计了此类预测在两个下游用例(触发护栏和弃权)中的益处。我们的关键概念洞察是,利用部署特定上下文(即提供者类型、科室名称、用于响应的语言模型),而不仅仅是查询内容,可以提高预测用户是否会拒绝系统输出的能力。总之,我们的实证案例研究证明了使用部署特定上下文预测用户拒绝的可行性,为定向护栏打开了大门。

英文摘要

Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets -- leading to major blind spots for evaluating clinical systems. In this work, we perform a deployment-centered evaluation of an LLM system embedded within electronic health records at an academic medical center, where user feedback is sparse but closely reflects the deployment conditions. Specifically, we train a pre-response classifier that estimates the risk that a future interaction will result in the user rejecting the LLM response, based on query content and deployment-specific context available before generation. We conduct a prospective analysis of our model over 4.5 months of user feedback, finding that our prediction model achieves an AUROC of 0.719. Further, we estimate the benefit of such predictions in two downstream use cases (guardrail triggering and abstention). Our key conceptual insight is that making use of deployment-specific context (i.e., the provider type, department name, language model used for response), as opposed to only query content, improves the ability to predict whether the user will reject the system output. Altogether, our empirical case study demonstrates the feasibility of predicting user rejection using deployment-specific context, opening the door to targeted guardrails.

2606.12674 2026-06-12 cs.AI 新提交

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Evoflux: 紧凑型智能体的可执行工具工作流的推理时演化

Kushal Raj Bhandari, Ling Yue, Ching-Yun Ko, Dhaval Patel, Shaowu Pan, Pin-Yu Chen, Jianxi Gao

AI总结 提出Evoflux,一种推理时演化搜索方法,通过结构化编辑和执行反馈修复紧凑语言模型的工具工作流,将执行可行性从3%提升至17-24%,优于SFT和ReAct。

详情
Comments
Code is available at this https URL
AI中文摘要

紧凑型语言模型(LMs)降低了工具智能体的成本、延迟和部署风险。然而,MCP风格的工具使用不仅仅需要孤立的函数调用:智能体必须从实时目录中发现工具、满足模式、跨中间输出保留依赖关系,并在执行证据中基于最终响应。小型规划器通常生成看似合理的工作流图,但在工具解析、参数验证、依赖跟踪或执行中失败。我们认为,小语料蒸馏难以处理这种失败模式。几百个教师轨迹可以教授工作流格式,但很少涵盖修复失败计划所需的恢复行为。我们引入了Evoflux,一种推理时演化搜索方法,将紧凑工具使用视为可执行工具工作流的修复。它通过结构化编辑、执行反馈、自适应强度、元引导重设计和多样性剪枝来演化类型化工作流图。在涵盖实时MCP服务器和250个工具的保留MCP-Bench任务上,Evoflux将小型规划器的执行可行性从约3%提高到17-24%。相比之下,在相同搜索挖掘数据上的SFT和SFT+DPO匹配、表现不佳或崩溃至零样本性能以下;ReAct达到更高峰值,但方差和令牌成本更高。这些结果表明,在稀缺的教师轨迹预算下,基于执行的搜索更可靠。

英文摘要

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

2606.12671 2026-06-12 cs.CV 新提交

SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images

SalArt-VQA: 诊断VLM是否理解生成图像中的显著伪影

Xiaoxiao Sun, Ruotian Zhang, Junzhe Huang, James Burgess, Serena Yeung-Levy

AI总结 提出SalArt-VQA基准,通过950张图像和3681道多选题,从检测、定位、空间基础、缺陷识别四方面评估VLM对生成图像伪影的理解,揭示高检测准确率下隐藏的失败模式。

详情
Comments
23 pages, 7 figures, 7 tables. Dataset: this https URL
AI中文摘要

视觉语言模型(VLM)越来越多地被用于检测AI生成图像是否包含可见伪影,然而它们分析此类伪影的能力仍然知之甚少。正确的图像级决策仍可能隐藏重要失败:模型可能正确标记伪影,但依赖于错误的视觉线索、选择错误的区域,或描述图像中不存在的缺陷。为了直接评估这些行为,我们引入了SalArt-VQA,一个用于细粒度理解AI生成图像中显著伪影的诊断基准。SalArt-VQA包含950张图像和3,681道人工编写的多项选择题,涵盖伪影图像、匹配的真实参考图像和配对的生成参考图像。四种对齐的问题类型评估存在检测、语义定位、空间基础和证据基础的缺陷识别,而参考分割测试了当注释缺陷不存在时的校准和弃权能力。在20个VLM上,SalArt-VQA揭示了图像级检测准确率所隐藏的失败:最强的模型在伪影图像上达到99.37%的检测召回率,但仅在53.26%的图像上正确回答了所有四个伪影侧问题。比较伪影图像与无伪影参考揭示了灵敏度-校准权衡:敏感模型经常做出无根据的伪影声明,而保守模型主要通过遗漏真实伪影来避免误报。这些结果表明,高伪影检测准确率本身并不意味着有基础的伪影理解。SalArt-VQA暴露了这些隐藏的失败模式,并提供了对VLM伪影声明是否得到局部视觉证据支持的细粒度评估。

英文摘要

Vision-language models (VLMs) are increasingly used to detect whether AI-generated images contain visible artifacts, yet their ability to analyze such artifacts remains poorly understood. A correct image-level decision can still hide important failures: a model may correctly flag an artifact while relying on the wrong visual cue, selecting the wrong region, or describing a defect that the image does not support. To evaluate these behaviors directly, we introduce SalArt-VQA, a diagnostic benchmark for fine-grained SALient ARTifact understanding in AI-generated images. SalArt-VQA contains 950 images and 3,681 human-authored multiple-choice questions spanning artifact images, matched real reference images, and paired generated reference images. Four aligned question types evaluate presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification, while the reference splits test calibration and abstention when the annotated defect is absent. Across 20 VLMs, SalArt-VQA reveals failures that image-level detection accuracy hides: the strongest model reaches 99.37% detection recall on artifact images but answers all four artifact-side questions correctly on only 53.26% of images. Comparing artifact images with artifact-free references reveals a sensitivity-calibration tradeoff: sensitive models often make unsupported artifact claims, while conservative models avoid false alarms largely by missing real artifacts. These results show that high artifact detection accuracy alone does not imply grounded artifact understanding. SalArt-VQA exposes these hidden failure modes and provides a fine-grained evaluation of whether VLM artifact claims are supported by local visual evidence.

2606.12667 2026-06-12 cs.NI cs.AI eess.SY 新提交

Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites

低地球轨道卫星地面站位置的自由布局优化

Grace Ra Kim, Duncan Eddy, Vedant Srinivas, Mykel J. Kochenderfer

AI总结 提出SCORE方法,通过两阶段自由布局优化地面站位置,相比差分进化算法减少5倍函数评估次数并提升13%下行吞吐量,相比固定站点方法提升15%总下行量。

详情
Comments
34 pages, 13 figures, 11 tables, Journal of Aerospace Information Systems (JAIS)
AI中文摘要

快速扩展的低地球轨道卫星星座对地面网络的需求日益增加,推动了更高效地面站网络设计的发展。当前方法从预定义位置选择站点,将优化限制在现有基础设施内,从而约束了性能。相比之下,自由布局优化在地球连续空间域上运行,拓宽了搜索空间,允许更高吞吐量的配置,但代价是可能需要部署新的基础设施。在这项工作中,我们引入了SCORE(通过细化与评估的顺序循环优化),一种用于地面站设计的两阶段自由布局方法。SCORE结合了顺序坐标选择与循环细化,以应对全局优化器面临的高维度、非凸性和局部最小值挑战。我们使用Kongsberg卫星服务公司和世界电信协会的位置,将SCORE与差分进化(DE)等一次性方法以及整数规划方法进行了基准测试。在两个商业地球观测星座(Capella Space和ICEYE)和一个合成Walker-Star星座上的测试表明,与DE相比,SCORE收敛所需的函数评估次数最多减少5倍,同时下行吞吐量提升高达13%。与固定站点方法相比,无约束SCORE实现了高达15%的总下行量提升,为灵活布局建立了强大的经验性能基准;受基础设施约束的SCORE在将布局限制在现有光纤和电力基础设施附近的同时,保留了超过92%的增益。我们还探讨了扩建现有站点与部署新站点之间的权衡,为运营星座的未来地面网络设计提供参考。

英文摘要

Rapidly expanding low Earth orbit satellite constellations are placing increasing demands on terrestrial ground networks, motivating the development of more efficient ground station network designs. Current approaches select sites from predefined locations, limiting optimization to existing infrastructure and constraining performance. In contrast, free-placement optimization operates over a continuous spatial domain on Earth, broadening the search space and allowing higher-throughput configurations at the cost of potentially requiring new infrastructure deployment. In this work, we introduce SCORE (Sequential Cyclic Optimization via Refinement & Evaluation), a two-stage free-placement method for ground station design. SCORE combines sequential coordinate selection with cyclic refinement to manage high-dimensionality, non-convexity, and local minima that challenge global optimizers. We benchmark SCORE against one-shot methods such as differential evolution (DE) and integer programming approaches using locations from Kongsberg Satellite Services and the World Teleport Association. Tests across two commercial Earth observation constellations (Capella Space and ICEYE) and one synthetic Walker-Star constellation show that SCORE requires up to 5x fewer function evaluations to converge relative to DE while improving downlink throughput by up to 13%. Compared to fixed-site methods, unconstrained SCORE achieves up to 15% greater total downlink, establishing a strong empirical performance benchmark for flexible placement; infrastructure-constrained SCORE retains over 92% of this gain while restricting placement to within proximity of existing fiber and power infrastructure. We also explore trade-offs between expanding existing stations and deploying new sites, informing future ground network design for operational constellations.

2606.12666 2026-06-12 cs.CR cs.AI 新提交

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

CAPED:面向移动GUI代理的上下文感知隐私暴露防御

Siyu Shen, Fenghao Xu, Wenrui Diao, Kehuan Zhang

AI总结 针对移动GUI代理截图上传导致的附带视觉隐私暴露问题,提出上下文感知的预上传暴露控制层CAPED,通过任务需求提取、屏幕上下文隐私先验和UI元素解析,选择性暴露任务所需内容,在保持高任务效用的同时显著降低隐私泄露。

详情
AI中文摘要

基于截图的移动GUI代理能够像人类用户一样通过相同的视觉界面操作普通智能手机应用,但这种能力也将每一次屏幕观察变成了隐私边界。在正常任务执行过程中,截图可能暴露联系人、消息、照片、文件、推荐、健康提示等与用户请求无关的敏感上下文。我们称这个问题为附带视觉隐私暴露。现有防御难以解决:文本匿名化遗漏了许多视觉和推理线索,而通用隐私遮蔽可能移除GUI代理完成任务所需的证据和控制。本文提出CAPED,一种面向移动GUI代理的上下文感知预上传暴露控制层。CAPED被设计为手机端保护层:在截图被释放到远程多模态代理之前,它提取任务需求,利用屏幕上下文作为隐私先验,解析可见UI元素,并仅选择性暴露当前任务所需的内容,同时遮蔽附带隐私内容。我们在AndroidWorld上评估CAPED的广泛任务效用,并使用受控的28任务种子隐私评估作为轨迹级附带泄漏的测量工具。在该种子评估中,完整CAPED将成功条件下的加权种子泄漏从原始截图的0.766降低到0.268,同时保持高任务效用。更广泛的AndroidWorld运行显示了剩余的原型级效用成本,但结果支持核心主张:截图上传应被视为明确的设备-云边界决策,由任务驱动的选择性暴露而非全有或全无的屏幕共享来管理。

英文摘要

Screenshot-based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capability also turns every screen observation into a privacy boundary. During normal task execution, screenshots may expose contacts, messages, photos, files, recommendations, health cues, and other sensitive context that is unrelated to the user's request. We call this problem incidental visual privacy exposure. It is difficult to address with existing defenses: text anonymization misses many visual and inferential cues, while generic privacy masking can remove the evidence and controls that a GUI agent needs to complete the task. This paper presents CAPED, a context-aware pre-upload exposure control layer for mobile GUI agents. CAPED is designed as a phone-side protection layer: before screenshots are released to a remote multimodal agent, it extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content needed for the current task while masking incidental private content. We evaluate CAPED on AndroidWorld for broad task utility and with a controlled 28-task seeded privacy evaluation used as a measurement instrument for trajectory-level incidental leakage. In this seeded evaluation, Full CAPED reduces success-conditioned weighted seeded leakage from 0.766 under raw screenshots to 0.268 while preserving high task utility. A broader AndroidWorld run shows a remaining prototype-level utility cost, but the results support the central claim that screenshot upload should be treated as an explicit device--cloud boundary decision, governed by task-driven selective exposure rather than all-or-nothing screen sharing.

2606.12658 2026-06-12 cs.LG q-bio.QM stat.ML 新提交

Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability

基于物理信息的神经网络用于化疗药代动力学:基准测试临床估计器并揭示参数可辨识性

Riya Bisht, Dhruv Agarwal

AI总结 本研究将物理信息神经网络(PINN)应用于化疗药代动力学,在双室线性模型上匹配临床标准方法,在Michaelis-Menten扩展模型中揭示参数不可辨识性,并通过稀疏组织观测部分恢复可辨识性。

详情
AI中文摘要

物理信息神经网络(PINN)是生物学中部分观测问题的一个有吸引力的工具,其中控制动力学已知但某些隔室无法测量。化疗药代动力学(PK)是一个清晰的实例:血浆中的药物浓度常规测量,但组织中的浓度——决定肿瘤杀伤和脱靶毒性——无法测量。我们在两个PK问题上将PINN与标准临床基线(非线性最小二乘解析双指数血浆解,以下简称NLS)和物理无关的神经基线(仅数据的MLP)进行基准测试。在线性双室问题上,NLS接近最优;PINN在匹配其性能(小常数因子内)的同时,在单次训练过程中产生组织曲线,而仅数据的MLP在组织上失败约10倍。在Michaelis-Menten扩展(可饱和消除)上,双指数闭式不再存在,因此NLS被错误指定并静默返回无意义的速率常数。PINN反而揭示了一个更深层的事实:Michaelis-Menten双室模型仅从血浆数据不可辨识,PINN通过收敛到k12 -> 0的盆地诚实地报告这一点。添加两个稀疏组织观测在很大程度上解决了可辨识性:在五个随机种子上,PINN恢复k21在真实值的1%以内,Vmax和Km在一个标准差范围内,而k12向正确方向移动(0.02 -> 0.82)但仍低于真实值约2个标准差——这是闭式NLS估计器根本无法尝试的恢复,因为其双指数假设仅描述血浆。我们的主张不是PINN击败NLS。而是PINN提供了一种统一的方案,该方案在教科书问题上与教科书估计器匹配,揭示了教科书估计器隐藏的结构可辨识性,并在单一损失中吸收异构测量。

英文摘要

Physics-Informed Neural Networks (PINNs) are an attractive tool for partial-observation problems in biology, where the governing dynamics are known but some compartments cannot be measured. Chemotherapy pharmacokinetics (PK) is a clean instance: drug concentration in plasma is routinely measured, but concentration in tissue -- which determines tumour kill and off-target toxicity -- is not. We benchmark a PINN against the standard clinical baseline (nonlinear least-squares on the analytical biexponential plasma solution, hereafter NLS) and a physics-agnostic neural baseline (a data-only MLP) on two PK problems. On the linear two-compartment problem, NLS is near-optimal; the PINN matches it to within a small constant factor while also producing the tissue curve in a single training pass, whereas the data-only MLP fails on tissue by roughly 10x. On a Michaelis-Menten extension (saturable elimination), the biexponential closed form no longer exists, so NLS is mis-specified and silently returns meaningless rate constants. The PINN instead exposes a deeper fact: the Michaelis-Menten two-compartment model is non-identifiable from plasma alone, and the PINN reports this honestly by converging to a basin with k12 -> 0. Adding two sparse tissue observations largely resolves identifiability: across five seeds the PINN recovers k21 to within 1% of truth and Vmax, Km to within one standard-deviation bar, while k12 moves in the correct direction (0.02 -> 0.82) but remains ~2 sigma below truth -- a recovery the closed-form NLS estimator cannot attempt at all, because its biexponential ansatz describes only plasma. Our claim is not that PINNs beat NLS. It is that PINNs offer a uniform recipe that ties the textbook estimator on the textbook problem, exposes structural identifiability that the textbook estimator hides, and absorbs heterogeneous measurements within a single loss.

2606.12655 2026-06-12 cs.CR cs.CV 新提交

Amnesia: A Stealthy Replay Attack on Continual Learning Dreams

Amnesia: 一种针对持续学习梦境的重放隐蔽攻击

Ahmed Sharshar, Naveen Kumar Kummari, Mohsen Guizani

AI总结 提出Amnesia攻击,通过仅控制重放索引选择,在审计约束下最大化持续学习模型性能下降,揭示了索引级重放控制的威胁。

详情
AI中文摘要

持续学习(CL)模型常使用经验重放来减少灾难性遗忘,但其对重放采样干扰的鲁棒性尚未充分探索。现有的CL攻击会改变输入或训练流程(投毒/后门),且很少包含明确的审计约束,限制了真实性。这里,审计性意味着监控者可以通过检查采样器可见的遥测数据(例如,记录的重放索引/标签统计)来验证合规性,即检查实现的重放类别直方图是否接近名义基线,以及重放率在每个批次和/或滚动窗口内是否不变。我们研究了一个权限受限的内部人员,其仅控制重放索引选择,而不控制像素、标签或模型参数,同时保持在审计限制内(如队列优先级)。我们提出了Amnesia,一种重放组合攻击,在两种预算下最大化性能下降:可见性预算δ,限制与名义类别直方图p0的TV/KL散度;以及质量预算f,固定重放率。Amnesia有两个步骤:(i)计算轻量级类别效用(如EMA损失或置信度),将p0向有害类别倾斜;(ii)使用高效的KL(指数倾斜)或TV(平衡质量重分配)优化器将倾斜投影回δ-球内。窗口调度器强制执行滚动审计。在具有挑战性的CL基准测试和强重放基线中,Amnesia持续降低最终准确率(ACC)并恶化反向迁移(-BWT)。KL变体在多种审计方案(包括每批次和滚动窗口检查)下实现高影响且基本未被检测到。TV变体更具破坏性但更易检测,尤其是在严格的每类别约束下。这些结果揭示了仅索引重放控制是CL系统中一个实用且可审计的威胁面,并建立了原则性的影响-可见性权衡。

英文摘要

Continual learning (CL) models often use experience replay to reduce catastrophic forgetting, but their robustness to replay sampling interference remains underexplored. Existing CL attacks alter inputs or training pipelines (poisoning/backdoors) and rarely include explicit auditable constraints, limiting realism. Here, auditability means a monitor can verify compliance from sampler-visible telemetry - e.g., logged replay index/label statistics - by checking that the realized replay class histogram stays close to a nominal baseline and that replay rate is unchanged per batch and/or over a rolling window. We study a limited-privilege insider who controls only replay index selection, not pixels, labels, or model parameters, while staying within auditable limits such as queue priorities. We introduce Amnesia, a replay composition attack that maximizes degradation under two budgets: a visibility budget delta bounding the TV/KL divergence from a nominal class histogram p0, and a mass budget f fixing the replay rate. Amnesia has two steps: (i) compute lightweight class utilities, such as EMA loss or confidence, to tilt p0 toward harmful classes; and (ii) project the tilt back into the delta-ball using efficient KL (exponential tilt) or TV (balanced mass redistribution) optimizers. A windowed scheduler enforces rolling audits. Across challenging CL benchmarks and strong replay baselines, Amnesia consistently lowers final accuracy (ACC) and worsens backward transfer (-BWT). The KL variant delivers high impact while remaining largely undetected under multiple audit schemes, including per-batch and rolling-window checks. The TV variant is more damaging but easier to detect, especially under tight per-class constraints. These results expose index-only replay control as a practical, auditable threat surface in CL systems and establish a principled impact-visibility trade-off.

2606.12651 2026-06-12 cs.LG q-bio.QM 新提交

Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

物理感知辅助损失提升图神经网络可合成性滤波器的分布外泛化能力

Riya Bisht, Dhruv Agarwal

AI总结 通过在GNN上添加基于Bertz指数的拓扑复杂度回归和MMFF94力场应变能软惩罚作为辅助损失,在分布外数据上小幅但显著提升了可合成性滤波器的AUC(最高+0.0066)。

详情
AI中文摘要

机器学习药物发现流程越来越依赖生成模型,这些模型提出的分子远离用于训练下游可合成性滤波器的数据。现有滤波器(SAScore、SCScore、RAscore、DeepSA)纯粹基于统计,在分布外(OOD)场景下性能下降。我们探究廉价的闭式物理先验,作为图神经网络(GNN)的辅助监督,是否能改善OOD泛化。我们在GINE骨干网络上添加两个辅助损失:基于Bertz指数的拓扑复杂度回归,以及基于MMFF94力场能量的应变能软惩罚。在由SAScore阈值标注的65,177个分子语料库(HIV、Tox21、COCONUT)上,我们复现了强分布内基线,然后在单源OOD划分(在类药HIV+Tox21上训练,在COCONUT天然产物上测试)上评估4路消融实验(基线/+复杂度/+应变/+两者),重复5个种子并采用配对bootstrap置信区间。所有三个物理感知变体相比基线(平均OOD AUC 0.9774)均带来微小但统计显著的OOD提升:+复杂度Delta = +0.0060(95% CI [+0.0023, +0.0102]),+应变Delta = +0.0032([+0.0008, +0.0052]),+两者Delta = +0.0066([+0.0038, +0.0093]);每个区间均不包含零,且组合效果最佳。各变体在分布内表现无差异,因此效果仅在OOD评估下可见。我们明确指出效果是适度的,并报告一个警示性方法学发现:该实验的单种子版本产生了定性不同(非单调)的故事,未能在多种子评估中复现。

英文摘要

Machine-learning drug-discovery pipelines increasingly rely on generative models that propose molecules far from the data used to train downstream synthesizability filters. Existing filters (SAScore, SCScore, RAscore, DeepSA) are purely statistical and degrade in exactly this out-of-distribution (OOD) regime. We ask whether cheap, closed-form physical priors, used as auxiliary supervision on a graph neural network (GNN), improve OOD generalization. We add two auxiliary losses to a GINE backbone: a topological complexity regression supervised by the Bertz index, and a strain-energy soft penalty supervised by MMFF94 force-field energy. On a 65,177-molecule corpus (HIV, Tox21, COCONUT) labeled by SAScore thresholds we reproduce a strong in-distribution baseline, then evaluate a 4-way ablation (baseline / +complexity / +strain / +both) on a single-source OOD split (train on drug-like HIV+Tox21, test on COCONUT natural products), repeated over 5 seeds with paired bootstrap confidence intervals. All three physics-aware variants give a small but statistically significant OOD improvement over the baseline (mean OOD AUC 0.9774): +complexity Delta = +0.0060 (95% CI [+0.0023, +0.0102]), +strain Delta = +0.0032 ([+0.0008, +0.0052]), +both Delta = +0.0066 ([+0.0038, +0.0093]); every interval excludes zero, and the combination is best. The variants are indistinguishable in-distribution, so the effect is visible only under OOD evaluation. We are explicit that the effects are modest, and we report a cautionary methodological finding: a single-seed version of this experiment produced a qualitatively different (non-monotone) story that did not survive multi-seed evaluation.

2606.12647 2026-06-12 cs.CC cs.AI cs.LG 新提交

Token Complexity Theory for AI-Augmented Computing

AI增强计算的Token复杂度理论

Jie Wang

AI总结 提出Token复杂度作为AI增强计算中查询与响应成本的形式化度量,建立AI-Oracle图灵机框架,证明单调性、凸性、价格敏感性和任务排序的价格相对性等基本定理。

详情
Comments
25 pages, 1 figure
AI中文摘要

AI增强计算将自然语言查询、代码生成请求及其他开放式任务委托给一组AI模型,这些模型处理查询并生成响应。这一范式引入了一个经典时间或空间复杂度无法捕捉的资源维度:向该集群发送查询和接收响应的成本。我们引入Token复杂度,将其定义为在任务上达到指定输出质量水平所需的最小期望Token成本,并建立了一个根据概率性质强度对AI系统进行分类的体系。我们在AI-Oracle图灵机框架内发展Token复杂度,其中概率图灵机通过专用查询和响应磁带与随机Oracle交互。我们证明了基本定理,表明Token复杂度符合预期:单调性(更高质量需要更多Token)、凸性(质量改进逐渐变得更昂贵)、价格敏感性(小价格变化导致有界成本变化)以及任务排序的价格相对性(任务的Token复杂度排序可能根据查询与响应成本比率而反转)。我们证明了复杂度前沿(定义为Token、时间和空间中所有可行资源约束的集合)是非空的、向上封闭且凸的。

英文摘要

AI-augmented computing delegates natural language queries, code generation requests, and other open-ended tasks to a cluster of AI models that processes queries and generates responses. This paradigm introduces a resource dimension that neither classical time nor space complexity captures: the cost of sending queries to and receiving responses from such a cluster. We introduce token complexity, a formal resource measure defined as the minimum expected token cost to achieve a specified level of output quality on a task, and develop a taxonomy classifying AI systems by the strength of their probabilistic properties. We develop token complexity within the framework of AI-Oracle Turing machines, in which a probabilistic Turing machine interacts with a stochastic oracle via dedicated query and response tapes. We prove basic theorems establishing that token complexity behaves as expected: monotonicity (higher quality costs more tokens), convexity (quality improvements become progressively more expensive), price sensitivity (small price changes produce bounded cost changes), and price-relativity of task ordering (the token complexity ordering of tasks can reverse depending on the query-to-response cost ratio). We prove that the complexity frontier, defined as the set of all feasible resource bounds in tokens, time, and space, is non-empty, upward-closed, and convex.

2606.12639 2026-06-12 cs.LG q-bio.QM 新提交

The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

度量选择胜者:评估选择翻转未见化学空间中药物反应预测的模型排名

Dhruv Agarwal, Riya Bisht

AI总结 本研究通过VCPI竞赛数据,发现药物反应预测模型排名随评估指标反转:简单基线在代理指标下胜出,但真实指标下深度模型显著优于线性指纹基线,首次在真实药物化学数据上验证了度量校准效应。

详情
AI中文摘要

预测细胞转录组对其从未见过的药物的反应是计算细胞生物学中的一个核心难题:最近的基准测试表明,一旦测试化合物按化学结构留出,复杂模型往往无法击败简单基线。我们研究了一个细胞系和检测方法,即通过DRUG-seq分析的THP-1细胞,由VCPI预测竞赛的活性化合物加权MSE(wMSE)评分。我们提出了一种分阶段方法:该领域一直无法击败的简单基线(未处理对照和平均训练化合物响应);非参数检索(对留出化合物的最近训练化合物进行Tanimoto加权平均);以及一个融合阶段,将冻结的化学嵌入与检索支持特征相结合,以预测相对于均值的残差,并包含不确定性头和基因程序。在发布的VCPI THP-1 drug-seq数据(14,026个训练化合物)上,采用Bemis-Murcko骨架划分,模型排名根据度量标准反转。在逆方差每基因代理度量下,基于Morgan指纹的正则化线性回归似乎胜过了深度模型、检索和ChemBERTa——这是教科书式的“简单基线获胜”结果。但在竞赛的真实活性集度量(每(基因,化合物)的Mejia权重,经官方评分器验证;均值基线0.535 vs 组织者的0.507参考)下,情况反转:深度模型获胜,我们的融合解码器显著优于线性指纹基线(-0.012 wMSE,配对bootstrap p < 10^-4),而代理度量的胜者成为最差的化学感知预测器。选择度量即选择胜者——据我们所知,这是首次在真实留出药物化学数据上证明度量校准效应,该效应此前主要在遗传扰动中建立。我们发布了一个可复现的流水线,连接到官方评分器,可在真实的1064 x 12,995网格上生成有效提交。

英文摘要

Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa -- the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner -- to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.

2606.12633 2026-06-12 cs.CV cs.LG 新提交

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

ECA:面向开放图像到文本生成的高效持续对齐

Jiangtao Kong, Peijun Zhao, Chun-Fu Chen, Youngwook Do, Shaohan Hu, Tianyi Zhou, Huajie Shao

AI总结 提出ECA方法,通过混合查询模块、Fisher动态扩展和字典重放,实现无需旧数据的持续对齐,缓解灾难性遗忘,提升开放图像到文本生成的增量学习性能。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

开放图像到文本生成(OpenITG)的增量学习(IL)使模型能够持续为新的图像生成准确、上下文相关的文本,同时保留先前获得的知识。与先前研究不同,本文处理了一个更实际的场景,其中视觉数据的主要类别随时间推移而演变。在此背景下,我们引入了持续对齐的新概念,它逐步调整预训练VLM中的对齐模块,以保持高质量的跨模态表示。基于这一思想,我们提出了高效持续对齐(ECA),一种用于OpenITG的无样本IL方法。关键挑战是使模型能够获取新的任务特定特征,同时最小化对已建立对齐的干扰,且无需访问先前任务的原始数据。为此,ECA采用了三种核心机制:混合查询(MoQ)模块,用于适应任务特定的查询令牌;Fisher动态扩展(FeDEx),基于Fisher信息矩阵(FIM)度量动态扩展模型结构;以及带有字典重放(DR)的嵌入字典,以保留过去的知识。为了评估ECA的性能,我们构建了四个新的IL OpenITG基准,更好地反映了现实场景。实验结果表明,与基线方法相比,ECA显著缓解了灾难性遗忘并提高了IL性能。代码和基准可在该https URL获取。

英文摘要

Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at this https URL.

2606.12620 2026-06-12 cs.SE cs.AI 新提交

HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection

HybridCodeAuthorship:一个用于行级代码作者归属检测的基准数据集

Luke Patterson, Li Wang, Adam Faulkner

AI总结 针对现有基准无法反映真实AI代码助手使用场景的问题,提出HybridCodeAuthorship数据集,包含交错的人类和AI编写代码行,并评估两种检测算法性能。

详情
Comments
Accepted to LREC 2026
AI中文摘要

由于基于大型语言模型(LLM)的AI代码助手的快速采用,行业代码库越来越多地成为AI和人类编写代码的混合体。出于风险管理和生产力分析的目的,实现对AI生成代码的细粒度位置检测至关重要。为了开发此任务的算法,需要高质量的基准来评估性能。然而,现有的基准往往包含学术性的LeetCode风格问题,并假设代码片段要么完全由人类编写,要么完全由AI编写,这并不能反映使用AI代码助手的行业代码库的多样意图和风格。为了填补这些空白,我们引入了HybridCodeAuthorship,这是一个新颖的Python代码文件基准,其中交错有人类和AI编写的代码行,以模拟AI代码助手的真实使用。在本文中,我们首先介绍了我们的数据集构建流程,该流程利用了CodeSearchNet,这是一个包含GitHub上开源仓库链接的大型集合。然后,我们在行级和块级上评估了两种最先进的AI生成代码检测算法的性能。实验结果表明,HybridCodeAuthorship是一个具有挑战性的基准,得分最高的算法AIGCode Detector在块级和行级代码检测任务上分别获得了0.48和0.56的最高F1分数。

英文摘要

Thanks to the rapid adoption of AI code assistants powered by large language models (LLMs), industry codebases are, increasingly, a hybrid of AI- and human-authored code. For risk management and productivity analysis purposes, it is crucial to enable fine-grained location detection of AI-generated code. To develop algorithms for this task, quality benchmarks are needed to assess performance. However, existing benchmarks tend to comprise academic, LeetCode-style problems and presume a code snippet is either completely human-authored or completely AI-authored, which is not reflective of the diverse intents and styles of industry codebases utilizing AI code assistants. To fill these gaps, we introduce HybridCodeAuthorship, a novel benchmark of Python code files with interleaved human- and AI-authored lines of code to simulate authentic utilization of AI code assistants. In this paper, we first present our dataset construction pipeline, which leverages CodeSearchNet, a massive collection of links to open sourced repositories on GitHub. We then benchmark the performance of two state-of-the-art AI-generated code detection algorithms at both the line- and chunk-level. Experimental results demonstrate that HybridCodeAuthorship is a challenging benchmark with a top-scoring algorithm, AIGCode Detector, obtaining a highest F1 score of 0.48 and 0.56 on chunk-level and line-level code detection tasks, respectively.

2606.12609 2026-06-12 cs.LG q-bio.QM 新提交

Viral Proteins Reveal Geometry of Protein Language Models

病毒蛋白质揭示蛋白质语言模型的几何结构

Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang

AI总结 研究蛋白质语言模型在不平衡数据下对病毒蛋白的表示,发现嵌入空间中存在主导的“天然性”轴,该轴按模型困惑度排序序列,且缩放效果因病毒家族而异,但嵌入仍保留病毒特异性信号。

详情
Comments
Accepted at ICML 2026 GenBio Workshop and FM4LS Workshop. Code available at this https URL
AI中文摘要

蛋白质语言模型在高度不平衡的数据集上训练,引发了一个问题:它们如何表示代表性不足的生物序列?以病毒蛋白作为跨ESM模型家族的案例研究,我们在嵌入空间中识别出一个主导的天然性轴,该轴与掩码重建困惑度对齐,将序列从建模良好的细胞蛋白通过病毒蛋白排序到打乱和随机序列。缩放效果在不同病毒家族间不均匀地压缩该轴。尽管如此,蛋白质语言模型嵌入保留了病毒特异性信号:病毒蛋白在零样本困惑度和浅层序列特征之上仍然是线性可分的。这些结果共同表明,pLM表示由天然性的一般概念结构化,同时保留了特定于不同生物群体的信息。

英文摘要

Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

2606.12604 2026-06-12 cs.RO 新提交

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

EgoEngine:从自我中心人类视频到高保真灵巧机器人演示

Yangcen Liu, Shuo Cheng, Xinchen Yin, Woo Chul Shin, Alfred Cueva, Yiran Yang, Zhenyang Chen, Chuye Zhang, Danfei Xu

AI总结 提出EgoEngine框架,通过视觉和动作桥接,将自我中心人类视频转化为高保真机器人数据,首次实现零样本灵巧策略学习。

详情
AI中文摘要

灵巧操作受限于大规模机器人演示数据的收集成本。自我中心人类视频提供了多样操作行为的可扩展来源,但直接用于机器人学习需要弥合两个差距:人类与机器人观测之间的视觉差距,以及人类运动与机器人可执行动作之间的动作差距。我们提出EgoEngine,一个可扩展的框架,用于将自我中心人类操作视频转化为高保真机器人数据。给定一个自我中心RGB视频,EgoEngine生成:(i) 高保真机器人观测视频,用机器人替换人类,同时保留场景上下文和时间对齐,以及(ii) 在可行性约束下,与任务对齐、可执行的机器人动作轨迹。在仿真和真实机器人上的实验表明,EgoEngine能够将人类视频可扩展地转化为机器人数据,并且据我们所知,首次展示了无需真实机器人演示,从自我中心人类视频进行零样本视觉运动灵巧策略学习。项目网站:此 https URL。

英文摘要

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: this https URL.

2606.12599 2026-06-12 cs.CL 新提交

Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation

通过波斯谚语条件故事生成实现LLM中的约束语义解压缩

Zahra Habibzadeh, Paria Khoshtab, Amir Mesbah, Yadollah Yaghoobzadeh

AI总结 提出约束语义解压缩任务,通过波斯谚语条件故事生成测试大语言模型的抽象到实现能力,构建PAND数据集,发现解压缩差距,并表明显式推理和迭代细化可部分缓解。

详情
AI中文摘要

将一个密集、抽象的谚语转化为引人入胜且道德忠实的故事需要深厚的文化理解和稳健的语义基础。我们将此问题定义为约束语义解压缩任务,并研究谚语条件故事生成作为大语言模型中抽象到实现的测试平台。聚焦波斯语,我们引入了谚语对齐叙事数据集(PAND),将谚语与人类编写的故事和显式含义配对。通过结合人类校准的LLM-as-a-Judge与结构度量的混合评估框架,我们分析了多种提示机制下的模型行为。我们的发现揭示了一个持续存在的解压缩差距:当前的LLM通常实现强大的表面流畅性,但未能忠实地实例化谚语中编码的潜在道德和因果结构。我们进一步表明,显式推理和迭代细化可以部分缓解这些失败,这表明许多解压缩错误源于将抽象含义转化为叙事形式的困难,而非完全缺乏相关知识。我们提出的任务自然扩展到其他形式的压缩文化知识。

英文摘要

Transforming a dense, abstract proverb into an engaging and morally faithful narrative requires deep cultural understanding and robust semantic grounding. We frame this problem as a \emph{constrained semantic decompression} task and study proverb-conditioned story generation as a testbed for abstraction-to-realization in large language models (LLMs). Focusing on Persian, we introduce the Proverb Aligned Narrative Dataset (PAND), pairing proverbs with human-written stories and explicit meanings. By a hybrid evaluation framework that combines human-calibrated LLM-as-a-Judge with structural metrics, we analyze model behavior across multiple prompting regimes. Our findings reveal a persistent \emph{decompression gap}: current LLMs often achieve strong surface-level fluency while failing to faithfully instantiate the underlying moral and causal structure encoded in proverbs. We further show that explicit reasoning and iterative refinement can partially mitigate these failures, suggesting that many decompression errors arise from difficulties in translating abstract meaning into narrative form rather than a complete lack of relevant knowledge. Our proposed task naturally extends to other forms of compressed cultural knowledge.

2606.12581 2026-06-12 cs.SI cs.AI 新提交

Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark

多关系网络中的图缩减:面向传播的缩减基准

Mateusz Stolarski, Michał Czuba, Piotr Bielak, Piotr Bródka

AI总结 提出SORB基准框架,系统评估图缩减对影响力最大化任务的影响,发现缩减效果依赖于网络类型和评估指标。

详情
AI中文摘要

现实世界网络天生不完整、有噪声且动态演化,难以捕获所有参与者及其关系。其规模常使直接分析计算量大。虽然影响力最大化(IM)已被广泛研究,但图缩减作为预处理步骤及其对IM准确性的影响仍未被充分探索。本文引入面向传播的缩减基准(SORB),一个开源、标准化的框架,用于系统评估不同任务设置下的IM模型。SORB提供可扩展的流水线,操作于代表性真实世界网络集合(包括单层和多层结构),并将图缩减直接纳入评估过程。此设计将焦点从孤立分析IM算法转向量化图缩减如何改变预测性能。利用SORB,我们研究了多种IM场景下稀疏化和粗化的效果。结果表明,缩减的影响强烈依赖于网络类型(单层 vs. 多关系)和下游任务($Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$):稀疏化在单层网络上保持种子集质量,而扁平化多层网络无论缩减策略如何均表现出系统性排名退化。这些发现强调了在研究复杂网络传播过程时,进行缩减感知的多任务评估的重要性。

英文摘要

Real-world networks are inherently incomplete, noisy, and dynamically evolving, making it difficult to capture all actors and their relationships. Their scale often renders direct analysis computationally demanding. While influence maximisation (IM) has been widely studied, the role of graph reduction as a preprocessing step, and its impact on IM accuracy, remains underexplored. In this work, we introduce the Spreading-Oriented Reduction Benchmark (SORB), an open-source, standardised framework for systematically evaluating IM models across diverse task settings. SORB provides an extensible pipeline operating on a representative collection of real-world networks, including single- and multilayer structures, and accounts for graph reduction directly into the evaluation process. This design shifts the focus from analysing IM algorithms in isolation to quantifying how graph reduction alters predictive performance. Using SORB, we study the effects of sparsification and coarsening across multiple IM scenarios. Our results show that the impact of reduction is strongly dependent on both the network type (single-layer vs. multirelational) and the downstream task ($Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$): sparsification preserves seed set quality on single-layer networks, whereas flattened multilayer networks exhibit systematic ranking degradation regardless of reduction strategy. These findings highlight the importance of reduction-aware, multi-task evaluation when studying spreading processes in complex networks.