arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1530
专题追踪
2606.20064 2026-06-19 cs.HC 新提交

AI Conversational Interviewing: Scaling Up Semi-Structured and In-depth Interviews

AI对话式访谈:扩展半结构化与深度访谈的规模

Alexander Wuttke, Max Melchior Lang, Christopher Klamm, Quirin Würschinger, Frauke Kreuter

AI总结 本研究提出AI对话式访谈方法,通过语音、文本或自由选择模式大规模收集开放型意见数据,证明其能捕捉标准化调查遗漏的深层思考,且受访者评价不低于传统调查。

详情
AI中文摘要

舆论研究长期以来面临深度与规模之间的权衡:标准化调查能够进行大规模测量,但将受访者限制在研究者定义的类别中,掩盖了公众情绪背后多样化的意外考量。更具对话性的访谈通过开放式探究提供更丰富的见解,但其对训练有素的人类访谈者的依赖使其难以规模化。本研究引入AI对话式访谈作为一种大规模收集开放型舆论数据的方法,追求三个目标:展示对话文本数据对于封闭式问题无法触及的问题的分析价值;通过参与者自身的评估评估该方法的实际可行性;并通过实验比较语音、文本和自由选择访谈模式来指导实施。我们进行了一项研究,将AI主导的访谈与关于移民政策的标准化调查相结合,通过Prolific和Payback Panel招募了571名受访者。研究结果确立了AI对话式访谈作为社会科学工具包中可行且有价值的补充。对话记录揭示了标准化综合问卷无法捕捉的考量和推理,例如在态度水平相似的子群体中存在显著不同的移民心智模型。在完成访谈的受访者中,对AI访谈的评价在各模式下均达到或超过标准化调查,尽管完成率因条件而异。通过发布开放数据和开源流程材料,本研究为利用人工智能扩展舆论测量方法的日益增长的文献做出了贡献。

英文摘要

Public opinion research has long faced a trade-off between depth and scale: standardized surveys enable large-scale measurement but restrict respondents to researcher-defined categories, obscuring the diversity of unexpected considerations that underlie public sentiment. More conversational interviews provide richer insights through open-ended probing, but their reliance on trained human interviewers has kept them difficult to scale. This study introduces AI Conversational Interviewing as a method for collecting open-ended public opinion data at scale, pursuing three objectives: to demonstrate the analytical value of conversational text data for questions beyond the reach of closed-ended items; to assess the method's practical viability through participants' own evaluations; and to inform implementation by experimentally comparing voice-based, chat-based, and free-choice interview modes. We conducted a study combining an AI-led interview with a standardized survey on migration policy among 571 respondents recruited via Prolific and Payback Panel. The findings establish AI Conversational Interviewing as a viable and valuable addition to the social-science toolkit. The conversational transcripts surface considerations and reasoning that a comprehensive standardized battery does not capture such as markedly different mental models of migration among subgroups with similar attitudes levels. Among respondents who completed the interview, evaluations of the AI interview were at or above those of the standardized survey across modes, although completion itself varied by condition. By releasing open data and open-source pipeline materials, the study contributes to a growing literature on harnessing artificial intelligence to expand the methods of public opinion measurement.

2606.20047 2026-06-19 cs.IR 新提交

PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents

PACMS: 作为LLM代理可插拔引擎的子模上下文选择

Manu Ghulyani, Arunabh Singh, Karan Bharadwaj, Ankit Nath, Suranjan Goswami

AI总结 提出PACMS,一种基于子模函数最大化的上下文选择方法,在提示组装时按相关性从会话、记忆和工具输出中挑选内容,替代截断机制,提升长对话中的信息保持能力。

详情
AI中文摘要

对话和工具使用的LLM代理在上下文窗口中操作,该窗口同时从多个方向填充。随着会话进行,代理积累用户和助手轮次、从持久记忆存储中提取的条目,以及通常最大的工具调用输出(如文件读取、搜索结果和API响应)。一旦累积上下文超过模型的令牌预算,框架必须决定保留什么。当前机制是最近截断,有时辅以定期摘要。这是主题盲目的:会话早期建立的事实仅仅因为陈旧而被丢弃,即使当前用户查询正是关于该事实;相反,冗长但无关的近期材料被保留。必须在多轮中回忆信息的代理(记忆的定义案例)正是最近截断失败的地方。现有替代方案位于代理组装步骤之外。检索增强生成将外部文档提取到提示中,但不仲裁代理的“已存在”池化上下文。上下文压缩方法通过重写或修剪文本来减少令牌计数,但以查询盲目和有损的方式操作。两者都不将记忆条目、对话轮次和工具输出视为一个单一的候选池,在提示组装时按相关性进行选择。

英文摘要

Conversational and tool-using LLM agents operate over a context window that fills from several directions simultaneously. As a session proceeds, the agent accumulates user and assistant turns, entries drawn from a persistent memory store, and often largest of all, the verbatim outputs of tool calls such as file reads, search results, and API responses. Once the cumulative context exceeds the model's token budget, the framework must decide what to keep. The prevailing mechanism is recency truncation, sometimes paired with periodic summarization. This is topic-blind: a fact established early in a session is discarded simply because it is old, even when the current user query is about exactly that fact; conversely, verbose but irrelevant recent material is retained. Agents that must recall information across many turns, the defining case for memory, are precisely where recency truncation fails. Existing alternatives sit outside the agent's assembly step. Retrieval augmented generation fetches external documents into the prompt but does not arbitrate the agent's \emph{already-present} pooled context. Context-compression methods reduce token count by rewriting or pruning text, but operate query-blind and lossily. Neither treats memory entries, conversation turns, and tool outputs as a single candidate pool to be selected from by relevance at the moment the prompt is assembled.

2606.19988 2026-06-19 cs.SE 新提交

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

基于大语言模型的仓库级Solidity代码生成:从提示到微调

Shi Chen, Rongcun Wang, Yuan Tian, Xiaoyuan Xie, Wei Song, Rubing Huang

AI总结 提出SolidityBench基准和SolidityScore指标,评估多种LLM方法在仓库级Solidity代码生成中的表现,发现监督微调最有效。

Comments 33 pages

详情
AI中文摘要

大语言模型(LLMs)在通用代码生成方面表现出强大的能力,但其在专业软件领域的有效性仍未得到充分探索。Solidity智能合约代表了一个高风险领域,生成的代码必须满足严格的语言级、安全性和软件工程约束。现有的基准和指标对于仓库级Solidity生成仍然不足,其中模型必须从自然语言需求中合成完整的合约。为了解决这一差距,我们引入了SolidityBench,一个包含5,470个仓库级Solidity智能合约及其自然语言描述的基准。我们还提出了SolidityScore,一种基于Solidity的语义度量,强调领域关键结构,如安全修饰符、合约声明和Solidity特定关键词。使用该基准,我们评估了代表性的代码LLM,包括Qwen2.5-Coder、DeepSeek-Coder和CodeLlama,涵盖零样本提示、思维链推理、上下文学习、检索增强生成和监督微调。结果表明,通用模型在仓库级Solidity生成中表现出系统性的结构缺陷。在非参数方法中,检索增强生成表现最佳,而上下文学习在超过两个示例后因上下文饱和而性能下降。监督微调通过将Solidity特定约束内化到模型参数中实现了最大的改进。总体而言,我们的研究为仓库级Solidity代码生成提供了全面的基准,并表明高质量领域数据结合监督微调是提高LLM生成智能合约可靠性的最有效策略。

英文摘要

Large Language Models (LLMs) have shown strong capabilities in general-purpose code generation, but their effectiveness in specialized software domains remains underexplored. Solidity smart contracts represent a high-stakes domain where generated code must satisfy strict language-level, security, and software-engineering constraints. Existing benchmarks and metrics remain insufficient for repository-level Solidity generation, where models must synthesize complete contracts from natural language requirements. To address this gap, we introduce SolidityBench, a benchmark of 5,470 repository-level Solidity smart contracts paired with natural language descriptions. We also propose SolidityScore, a Solidity-aware semantic metric that emphasizes domain-critical constructs such as security modifiers, contract declarations, and Solidity-specific keywords. Using this benchmark, we evaluate representative code LLMs, including Qwen2.5-Coder, DeepSeek-Coder, and CodeLlama, across zero-shot prompting, Chain-of-Thought reasoning, in-context learning, retrieval-augmented generation, and supervised fine-tuning. The results show that general-purpose models exhibit systematic structural deficiencies in repository-level Solidity generation. Among non-parametric methods, retrieval-augmented generation performs best, while in-context learning degrades beyond two examples due to context saturation. Supervised fine-tuning achieves the largest improvement by internalizing Solidity-specific constraints into model parameters. Overall, our study provides a comprehensive benchmark for repository-level Solidity code generation and shows that high-quality domain data combined with supervised fine-tuning is the most effective strategy for improving the reliability of LLM-generated smart contracts.

2606.19983 2026-06-19 cs.CR 新提交

A Measurement Study of Cryptographic Misuse in Embodied AI Mobile Applications

具身AI移动应用中加密误用的测量研究

Junchao Li, Xuelei Wang, Yuhang Huang, Qi Wang, Boyang Ma, Xuelong Dai, Minghui Xu, Yue Zhang

AI总结 首次大规模测量具身AI移动应用的加密误用,通过自动化语义分析管道发现12,975个误用实例,揭示延迟敏感控制路径和离线配置导致的结构性安全权衡。

详情
AI中文摘要

具身AI (EAI) 移动应用正从辅助用户界面演变为主动控制路径组件,直接将移动端加密安全与网络物理信任联系起来。尽管发生了这种转变,现有的安全研究主要关注具身AI设备和云基础设施,而移动控制层作为关键攻击面在很大程度上未被探索。为了弥补这一差距,我们提出了首个针对EAI移动生态系统内加密误用的大规模测量研究。我们构建了EAIAppZoo,一个涵盖六个EAI领域的507个真实世界应用的基准测试,并采用自动化语义分析管道来测量五种主要加密失效模式的普遍性和特征。我们的测量结果产生了12,975个误用发现(评估精度为80.74%),揭示这些加密失效是由EAI特定的工程约束而非随机开发者错误驱动的。我们揭示了结构性的安全权衡:延迟敏感的控制路径系统性地削弱了传输保护,而对离线设备配置和遗留物联网SDK的严重依赖加剧了本地硬编码认证凭证的问题。通过真实世界案例研究,我们展示了这些移动端加密缺陷如何绕过名义上的网络保护,使攻击者能够拦截命令通道并劫持EAI实体的物理控制。最终,我们的发现强调,移动应用已成为网络物理系统中一个脆弱但被忽视的加密信任边界。

英文摘要

Embodied AI (EAI) mobile applications are evolving from auxiliary user interfaces into active control-path components, directly linking mobile-side cryptographic security to cyber-physical trust. Despite this shift, existing security research predominantly focuses on embodied AI devices and cloud infrastructures, leaving the mobile control layer largely unexplored as a critical attack surface. To bridge this gap, we present the first large-scale measurement study of cryptographic misuse within the EAI mobile ecosystem. We construct EAIAppZoo, a benchmark of 507 real-world applications across six EAI domains, and employ an automated semantic-aware analysis pipeline to measure the prevalence and characteristics of five major cryptographic failure modes. Our measurement yields 12,975 misuse findings (with an evaluated precision of 80.74\%), revealing that these cryptographic failures are driven by EAI-specific engineering constraints rather than random developer errors. We uncover structural security trade-offs: latency-sensitive control paths systematically weaken transport protection, while the heavy reliance on offline device provisioning and legacy IoT SDKs exacerbates the local hardcoding of authentication credentials. Through real-world case studies, we demonstrate how these mobile-side cryptographic flaws bypass nominal network protections, enabling adversaries to intercept command channels and hijack the physical control of EAI entities. Ultimately, our findings highlight that mobile applications have become a fragile, yet overlooked, cryptographic trust boundary in cyber-physical systems.

2606.19969 2026-06-19 cs.DB cs.DC 新提交

The Bi-Channel Networking Paradigm for Database Systems in the Cloud

云数据库系统的双通道网络范式

Georg Kreuzmayr, Muhammad El-Hindi, Benjamin Wagner, Tobias Ziegler, Viktor Leis

AI总结 针对现代高速云网络中内核TCP栈成为数据库性能瓶颈的问题,提出双通道网络范式,将通信分离为高性能数据路径和可靠控制路径,结合用户空间UDP与内核TCP,在分布式shuffle和复制键值存储中实现高吞吐与低开销。

Comments Accepted to EDBT 2027 (Lille, France)

详情
AI中文摘要

当网络链路速度较慢时,云和分布式数据库系统可以依赖通用的内核抽象,并将网络通信视为黑盒。在当今快速云网络下,这种方法失效了:数据库性能受到内核TCP栈CPU开销的限制。用用户空间UDP替换TCP可以减少这种开销,但需要重新实现基本保证,如可靠性和有序性。为解决这一难题,数据库系统不应再将网络视为黑盒,而应将其与数据库操作协同设计。我们提出了数据库系统的双通道范式,将通信分为两个通道:一个用于延迟和带宽敏感操作的高性能数据路径,以及一个用于协调和恢复的可靠控制路径。我们通过结合用户空间UDP和基于内核的TCP来实现该范式,尽管其他协议栈组合也是可能的。这种设计利用了现代NIC的能力,同时保留了TCP的可靠性。我们在两个代表性场景中展示了该范式的效率和简洁性:一个分布式shuffle用三个CPU核饱和200 Gbit/s,以及一个每秒处理数百万条消息的复制键值存储。

英文摘要

When network links were slow, cloud and distributed database systems could rely on generic kernel abstractions and treat network communication as a black box. With today's fast cloud networks, this approach breaks down: database performance becomes limited by the CPU overhead of the kernel TCP stack. Replacing TCP with user-space UDP can reduce this overhead, but it requires reimplementing essential guarantees, such as reliability and ordering. To solve this conundrum, database systems should no longer treat networking as a black box but co-design it with database operations. We propose the bi-channel paradigm for database systems, which separates communication into two channels: A high-performance data path for latency- and bandwidth-sensitive operations, and a reliable control path for coordination and recovery. We implement the paradigm by combining user-space UDP and kernel-based TCP, though other stack combinations are possible. This design exploits modern NIC capabilities while preserving TCP's reliability. We demonstrate the paradigm's efficiency and simplicity in two representative settings: a distributed shuffle saturating 200 Gbit/s with three CPU cores, and a replicated key-value store processing millions of messages per second.

2606.19968 2026-06-19 cs.GT 新提交

Beyond Lower Quota: Avoiding Overrepresentation in Multi-Winner Voting

超越最低配额:避免多赢者投票中的过度代表

Anton Baychkov, Martin Lackner, Jan Maly, Oliviero Nardi, Jannik Peters

AI总结 本文提出避免过度代表的公理JUQ,引入复合Thiele规则并刻画满足该公理的Adams-AV规则,同时提出平衡避免不足与过度代表的公理JNQ。

Comments This is an extended version of the publication with the same name in the proceedings of EC 2026

详情
AI中文摘要

最近,在社会选择文献中,避免基于批准的多赢者投票中代表不足的问题受到了广泛关注。本文探讨了被广泛忽视的互补问题——避免过度代表。尽管这是一个具有具体应用的理想性质,但尚未被系统研究。直观上,过度代表发生在一个群体决定了委员会中不成比例的大部分席位,从而超过了该群体的配额。我们提出了一个强且吸引人的避免过度代表的公理,称为可证明的上限配额(JUQ)。我们引入了Thiele规则的一个推广——复合Thiele规则,并刻画了该类中满足我们公理的唯一规则。该规则Adams-AV自然地扩展了Adams分配方法,此前未被研究。此外,我们引入了一个满足JUQ的多项式时间规则。进一步,我们引入了有理由的接近配额(JNQ),这是一个平衡避免不足和过度代表的公理。它刻画了扩展Sainte-Laguë分配方法的唯一Thiele规则。最后,我们分析了我们的公理与已建立的比例性概念(如EJR+)的兼容性。

英文摘要

Recently, in the social choice literature, much attention has been given to the question of avoiding underrepresentation in approval-based multi-winner voting. In this paper, we explore the largely overlooked complementary question of avoiding overrepresentation. This has not been explored systematically, despite being a desirable property with concrete applications. Intuitively, overrepresentation happens when a group determines a disproportionately large part of the committee, thereby exceeding the group's quota. We formulate a strong and appealing axiom for avoiding overrepresentation, called justifiable upper quota (JUQ). We introduce a generalization of Thiele rules, composite Thiele rules, and characterize the unique rule in this class satisfying our axiom. This rule, Adams-AV, which naturally extends Adams' apportionment method, has not been studied before. Additionally, we introduce a polynomial-time rule that satisfies JUQ. Furthermore, we introduce justified near quota, an axiom that balances avoiding under- and overrepresentation. It characterizes the unique Thiele rule extending the Sainte-Laguë apportionment method. Finally, we analyze the compatibility of our axioms with established proportionality notions such as EJR+.

2606.19960 2026-06-19 cs.IR 新提交

Stellar: Scalable Multimodal Document Retrieval for Natural Language Queries

Stellar:面向自然语言查询的可扩展多模态文档检索

Yuxiang Guo, Zhonghao Hu, Yuren Mao, Yuhang Liu, Congcong Ge, Xiaolu Zhang, Jun Zhou, Yunjun Gao

AI总结 提出Stellar框架,通过磁盘存储令牌级文档嵌入并动态加载候选嵌入,结合词汇表示过滤和高效磁盘支持的后交互,在保持检索效果的同时将内存开销和查询延迟降低1-2个数量级。

详情
AI中文摘要

多模态文档检索——从大型语料库中选择最相关的多模态文档以回答自然语言查询——在检索增强生成(RAG)系统中扮演着重要角色。最先进的方法使用多个令牌级嵌入来表示每个文档和查询,并通过后交互实现高效性。然而,这种多向量表示在检索过程中会产生大量内存开销,导致可扩展性差,阻碍了实际部署。在本文中,我们提出了Stellar,一个可扩展的多模态文档检索框架,它将令牌级文档嵌入存储在磁盘上,仅将少量候选嵌入加载到内存中进行后交互。Stellar包含两个关键组件:(i)基于词汇表示的过滤(LRF),它微调多模态大语言模型(MLLM)作为稀疏编码器,以产生高质量的词汇表示,从而实现高效且有效的文档过滤,显著减少候选集;(ii)高效的磁盘支持后交互(DLI),它设计了一种基于平衡聚类算法的磁盘令牌嵌入存储布局,并通过简单有效的成本模型动态地将必要的令牌嵌入加载到内存中。在四个真实世界基准和一个新提出的大规模数据集上的大量实验表明,与现有方法相比,Stellar在不影响检索效果的情况下,将内存开销和查询延迟降低了1-2个数量级。

英文摘要

Multimodal document retrieval--selecting the most relevant multimodal document from a large corpus to answer a natural language query--plays an essential role in Retrieval-Augmented Generation (RAG) systems. State-of-the-art methods represent each document and query with multiple token-level embeddings and use late interaction to achieve high effectiveness. However, such multi-vector representations incur substantial memory overhead during retrieval, leading to poor scalability and hindering real-world deployment. In this paper, we present Stellar, a scalable multimodal document retrieval framework that stores token-level document embeddings on disk and loads only a small set of candidate embeddings into memory for late interaction. Stellar comprises two key components: (i) Lexical Representation-based Filtering (LRF), which fine-tunes a Multimodal Large Language Model (MLLM) as a sparse encoder to produce high-quality lexical representations, enabling efficient and effective document filtering to significantly reduce the candidate set; (ii) Efficient Disk-backed Late Interaction (DLI), which designs an on-disk token embedding storage layout guided by a balanced clustering algorithm, and dynamically loads only the necessary token embeddings into memory using a simple yet effective cost model. Extensive experiments on four real-world benchmarks and a newly presented large-scale dataset demonstrate that Stellar reduces memory overhead and query latency by 1-2 orders of magnitude compared to existing methods without compromising retrieval effectiveness.

2606.19957 2026-06-19 cs.CY 新提交

Modest, artistic, and radical solutions to the environmental impact of image-generating machine learning

图像生成机器学习的环境影响:温和、艺术与激进的解决方案

Laura U. Marks, Jess MacCormack, Kehui Li

AI总结 针对图像生成ML的高能耗问题,从计算机工程、媒体研究和艺术角度探索非精确计算、小模型、低精度硬件等解决方案,并提出真实成本核算。

Comments Paper in Proceedings of LIMITS 2026: 12th Workshop on Computing within Limits, 2026-06-23-25, Online

详情
AI中文摘要

机器学习常被宣称能提高信息通信技术的效率,但这种微小收益被数据中心和ML就绪设备的巨大碳、水和土地足迹所淹没。我们调查了ML应用在训练和推理中的电力消耗,重点关注电力密集型的图像生成。我们的团队由一名计算机工程师、一名媒体学者和一名艺术家组成,探索了包括非精确计算、微型语言模型、低精度硬件架构、有限容量硬件以及在设计阶段预测和缓解能源需求等解决方案。我们将概述正在进行的、使用非抓取数据的道德且美学上精致的微型图像生成器的工作。着眼于经济背景,我们将提出机器学习环境影响的真实成本核算,并表明效率标准是由信息通信技术的股东资本主义框架驱动的。

英文摘要

Machine learning is often touted to improve the efficiency of ICT, but that small gain is overwhelmed by the enormous carbon, water, and land footprints of data centers and ML-ready devices. We survey the electricity consumption of ML applications in training and inference, focusing on electricity-intensive image generation. Our team of a computer engineer, a media scholar, and an artist explore solutions including inexact computing; tiny language models; low-precision hardware architectures; hardware with limited capacity; and anticipating and mitigating energy demands at the design phase. We will sketch our work in progress of an ethical and aesthetically sophisticated tiny image generator using non-scraped data. Looking to the economic context, we will propose a true-cost accounting for the environmental impact of machine learning and suggest that the criterion of efficiency is driven by the shareholder-capitalist framing of ICT.

2606.19949 2026-06-19 cs.CG 新提交

Semi-Automatic Correction of 3D Tubular Structure Skeletons via Component-Wise MST and Filtered Delaunay Triangulation

三维管状结构骨架的半自动校正:基于分量最小生成树与过滤Delaunay三角剖分

Ruoxuan Yang, Chuan Li

AI总结 提出一种半自动方法,通过用户选择源点和目标点,结合分量最小生成树和过滤Delaunay三角剖分,重建合理的中心线连接,校正骨架拓扑伪影。

Comments Accepted at ACM ICMR 2026

Journal ref In Proceedings of the International Conference on Multimedia Retrieval (ICMR '26), June 16--19, 2026, Amsterdam, Netherlands. ACM, New York, NY, USA, 10 pages

详情
AI中文摘要

从三维成像中对管状结构进行骨架化对于形态分析、运输或流动模拟以及包括血管网络、植物根系和神经连接组等领域的过程规划至关重要。然而,自动骨架提取常常引入拓扑伪影,例如邻近分支之间的错误连接以及由噪声或数据缺失引起的碎片化中心线。手动校正这些伪影可能耗时且易出错,尤其是在需要精确交互时。我们提出一种半自动校正方法,从最少的用户输入重建合理的中心线连接。给定用户选择的源点和目标点,我们的方法通过结合(i)用于稳定局部传播的分量最小生成树和(ii)用于桥接间隙和处理模糊连接点的过滤三维Delaunay边图来追踪路径。候选步骤根据考虑方向连续性、空间邻近性、分量一致性和目标导向进展的得分进行排序。输出是一个有序折线(或边序列),可作为建议的校正并集成到下游骨架后处理流程中。我们在C++中实现该系统,并基于Libigl提供交互式查看器,在脑血管数据集上展示了代表性的定性结果,包括校正典型的“交叉”和“点状”伪影。虽然我们目前的验证是定性的,但该方法轻量级,可作为实用的构建块,用于生物医学成像及相关领域中更全面的交互式校正流程。

英文摘要

Skeletonization of tubular structures from 3D imaging is essential for tasks such as morphometric analysis, transport or flow simulation, and procedural planning in domains including vascular networks, plant root systems, and neural connectomes. However, automatic skeleton extraction often introduces topological artifacts, such as erroneous connections between nearby branches and fragmented centerlines caused by noise or missing data. Correcting these artifacts manually can be time-consuming and error-prone, especially when precise interaction is required. We present a semi-automatic correction method that reconstructs a plausible centerline connection from minimal user input. Given a user-selected source and target point, our method traces a path by combining (i) component-wise minimum spanning trees for stable local propagation and (ii) a filtered 3D Delaunay edge graph for bridging gaps and handling ambiguous junctions. Candidate steps are ranked using a score that accounts for direction continuity, spatial proximity, component consistency, and target-directed progress. The output is an ordered polyline (or edge sequence) that can be used as a suggested correction and integrated into downstream skeleton post-processing workflows. We implement the system in C++ with an interactive viewer based on Libigl and demonstrate representative qualitative results on brain vessel datasets, including correction of typical "crossing" and "dotted" artifacts. While our current validation is qualitative, the method is lightweight and serves as a practical building block toward more comprehensive interactive correction pipelines in biomedical imaging and related domains.

2606.19937 2026-06-19 cs.CR 新提交

AutoTam: Specifying Secure Protocol Implementations with Tamarin Model Generation

AutoTam: 通过 Tamarin 模型生成指定安全协议实现

Johannes Wilson, Mikael Asplund, Niklas Johansson

AI总结 提出一种语言优先方法,通过领域特定语言实现协议并自动生成 Tamarin 模型,验证迹属性并保证其传递到实现,同时集成符号执行分析内存安全,在签名 Diffie-Hellman 和 WireGuard 协议上验证了安全性和互操作性。

Comments 19 pages, 5 figures

详情
AI中文摘要

形式化验证是确保密码协议安全性的重要但具有挑战性的任务。虽然现代协议验证工具显著减少了验证工作量,但对于没有形式化验证背景的从业者来说,建模仍然具有挑战性。此外,将验证结果转移到具体的协议实现需要专业知识。在本文中,我们提出了一种新颖的语言优先方法,通过使用领域特定语言进行协议实现来验证迹属性。我们针对 Tamarin 证明器进行验证,并证明验证的通用迹属性可以转换回实现。我们还集成了符号执行以分析协议实现的内存安全性。我们使用我们的工具实现并生成了签名 Diffie-Hellman 协议和 WireGuard VPN 协议的准确模型。当使用我们的解释器时,我们的 WireGuard 实现与现有实现可互操作,并达到了可接受的性能。我们通过符号执行和生成的 Tamarin 模型的验证相结合,正式证明了我们的实现是安全的。

英文摘要

Formal verification is a challenging but important task for ensuring the security of cryptographic protocols. While modern protocol verification tools significantly reduce verification effort, modelling remains challenging to practitioners without a background in formal verification. In addition, transferring verification results to a concrete protocol implementation requires expert knowledge. In this paper, we present a novel language-first method for verification of trace properties using a domain-specific language for protocol implementations. We target the Tamarin prover for verification, and we prove that verified universal trace properties translate back to the implementation. We additionally integrate symbolic execution in order to analyse the memory safety of protocol implementations. We use our tool to implement and generate accurate models for a signed Diffie-Hellman protocol, and for the WireGuard VPN protocol. Our WireGuard implementation is interoperable with existing implementations when using our interpreter, and achieves acceptable performance. We formally prove our implementations secure using a combination of symbolic execution and verification of the generated Tamarin models.

2606.19936 2026-06-19 cs.LO cs.MM 新提交

Prismriver: Formalization of Music Theory and Algorithmic Composition in Lean 4

Prismriver:Lean 4 中音乐理论与算法作曲的形式化

Leni Aniva, Claire Wang

AI总结 使用 Lean 4 形式化音乐理论,实现可验证的算法作曲与伴奏生成,并支持音乐结构的单子分析。

详情
AI中文摘要

音乐理论遵循丰富的数学规则和对称性。这些对称性遵循数学结构,可以在证明助手的精确语言中进行验证和表达。在本文中,我们提出了 Prismriver,一个在 Lean 4 中形式化的音乐理论。通过在 Lean 4 中形式化音乐理论,我们为可验证的算法作曲和伴奏生成打开了大门。我们还实现了对音乐结构中单子分析的支持。

英文摘要

Music theory obeys a rich set of mathematical rules and symmetries. These symmetries follow mathematical structure which can be verified and expressioned in the precise language of a proof assistant. In this paper, we present Prismriver, a formalization of music theory in Lean 4. By formalizing music theory in Lean 4, we open the door to verifiable algorithmic composition and accompaniment generation. We also enable the analysis of monadic analysis of structures in music.

2606.19931 2026-06-19 cs.MA 新提交

Blame is easier than praise: Measuring off-ball defensive performance in football

责备比表扬更容易:衡量足球中的无球防守表现

Jonas Bischofberger, Runqing Ma, Pascal Bauer, Kilian Arnsmeyer, Arnold Baca

AI总结 提出基于防守压力区(DPA)的球员参与度评分,将预期威胁的事件级变化归因于个体,以衡量足球无球防守表现,并在跨性别和跨赛事数据集上验证其有效性。

详情
AI中文摘要

足球运动员的防守表现通常通过有限的行动(如抢断和拦截)来衡量,而他们通过位置行为的持续影响此前很少被研究。我们将此问题表述为多智能体时空轨迹上的归因问题,没有球员级别的真实标签,其中事件级别的预期威胁变化被分配给个体。我们提出了一个框架,使用从防守压力区(DPA)计算的球员参与度评分来执行此归因。通过计算自动检测的团队结构内的角色条件基线,我们可以确定每个防守者对通过任意传球创造的威胁的预期责任。该方法的有效性和鲁棒性在独特的广泛跨性别和跨赛事数据集上进行了评估,包括来自男子世界杯64场比赛、女子德甲116场比赛和男子德丙336场比赛的位置和事件数据。在没有真实标签的情况下,我们提出了一个评估协议,将多个相对较弱的代理组合成稳健的总结分数。我们发现,与最佳基于行动的指标相比,有效性分数提高了大约一个标准差,并证明许多流行指标的有效性有限。对高价值行动的“责备”与外部评级和市场价值显示出特别强的相关性,使其成为足球中第一个可靠衡量定位错误的已发表指标。本工作所有代码均公开可用,以支持可重复性和进一步研究。

英文摘要

The defensive performance of football players is commonly measured through a limited number of actions like tackles and interceptions while their continuous impact through positional behaviour has hardly been studied before. We formulate this problem as an attribution over multi-agent spatiotemporal trajectories without player-level ground truth labels, where event-level changes of expected threat are distributed among individuals. We propose a framework that performs this attribution using player involvement scores calculated from defensive pressure areas (DPAs). By computing role-conditioned baselines within automatically detected team structures, we can determine each defender's expected responsibility for threat created through arbitrary passes. The validity and robustness of this approach are evaluated on a uniquely extensive cross-gender and cross-competition data set, including positional and event data from 64 matches of the men's World Cup, 116 matches of the women's German Bundesliga and 336 matches of the men's German 3. Liga. In the absence of a ground truth, we propose an evaluation protocol that combines multiple relatively weak proxies into robust summary scores. We find a validity score that is improved by around 1 standard deviation compared to the best action-based metric and demonstrate that many popular measures show limited validity. The "blame" for conceding high-value actions shows especially strong correlations with external ratings and market values, making it the first published metric in football to reliably measure positioning errors. All code underlying this work is publicly available to support reproducibility and further research.

2606.19930 2026-06-19 cs.HC 新提交

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

MobileForge:基于分层反馈引导策略优化的移动GUI智能体免标注适配

Guangyi Liu, Pengxiang Zhao, Gao Wu, Yiwen Yin, Mading Li, Liang Liu, Congxiao Liu, Zhang Qi, Mengyan Wang, Liang Guo, Yong Liu

AI总结 提出MobileForge系统,通过MobileGym环境实现任务生成与评估,结合分层反馈引导策略优化(HiFPO)将轨迹结果、步骤反馈和修正提示转化为步骤级GRPO更新,实现移动GUI智能体免标注适配,在AndroidWorld上达到67.2% Pass@3。

Comments Project page: https://mobile-forge.github.io/

详情
AI中文摘要

基于MLLM的移动GUI智能体在UI理解和动作执行方面取得了显著进展,但将它们适配到真实目标应用仍然成本高昂,因为移动应用数量众多、频繁更新,且难以用人工编写的任务、演示或奖励标签覆盖。现有的免标注GUI学习减少了人工监督,但缺乏将目标应用探索、课程挖掘、轨迹执行和反馈连接起来的统一基础,而策略优化通常依赖于孤立的轨迹和难以转化为可靠改进信号的粗粒度奖励。我们提出MobileForge,一个用于移动GUI智能体的免标注适配系统。MobileForge包含MobileGym,它将任务生成和轨迹评估基于真实移动应用交互,以及分层反馈引导策略优化(HiFPO),它将轨迹结果、步骤级过程反馈和修正提示转化为提示上下文化的步骤级GRPO更新。仅使用自动生成的免标注适配数据,MobileForge将Qwen3-VL-8B适配到AndroidWorld上67.2%的Pass@3,接近使用封闭数据的GUI专用GUI-Owl-1.5-8B基础模型的69.0%。MobileForge适配的ForgeOwl-8B进一步在AndroidWorld上达到77.6%的Pass@3,在域外MobileWorld GUI-only分割上达到41.0%的成功率,在我们的评估中建立了最强的开放数据移动GUI智能体。代码、数据和训练模型将在该URL发布。

英文摘要

MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but adapting them to real target apps remains costly because mobile apps are numerous, frequently updated, and hard to cover with human-written tasks, demonstrations, or reward labels. Existing annotation-free GUI learning reduces manual supervision, yet lacks a unified substrate connecting target-app exploration, curriculum mining, rollout execution, and feedback, while policy optimization often relies on isolated rollouts and coarse rewards that are hard to convert into reliable improvement signals. We present MobileForge, an annotation-free adaptation system for mobile GUI agents. MobileForge consists of MobileGym, which grounds task generation and rollout evaluation in real mobile app interaction, and Hierarchical Feedback-Guided Policy Optimization (HiFPO), which turns trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates. Using only automatically generated annotation-free adaptation data, MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld, close to the closed-data GUI-specialized GUI-Owl-1.5-8B base model at 69.0%. The MobileForge-adapted ForgeOwl-8B further reaches 77.6% Pass@3 on AndroidWorld and 41.0% success on the out-of-domain MobileWorld GUI-only split, establishing the strongest open-data mobile GUI agent in our evaluation. Code, data, and trained models will be released at https://mobile-forge.github.io/.

2606.19926 2026-06-19 cs.HC 新提交

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

MemGUI-Agent: 一种具有主动上下文管理的端到端长时移动GUI智能体

Guangyi Liu, Gao Wu, Congxiao Liu, Pengxiang Zhao, Liang Liu, Mading Li, Qi Zhang, Mengyan Wang, Liang Guo, Yong Liu

AI总结 提出MemGUI-Agent,通过主动上下文管理机制(ConAct)将上下文管理作为一等动作,解决长时任务中提示膨胀和关键信息稀释问题,在8B模型上达到最佳性能。

Comments 33 pages, 6 figures. Project page: https://memgui-agent.github.io/

详情
AI中文摘要

基于MLLM的移动GUI智能体在短时任务上取得了显著进展,但在需要跨多步和应用转换保留中间事实的长时任务上仍不可靠。我们将此限制归因于ReAct风格的提示,它被动地累积每一步的记录,导致提示膨胀和关键跨应用事实的稀释。为了解决这个问题,我们引入了MemGUI-Agent,一种具有主动上下文管理的端到端长时移动GUI智能体。MemGUI-Agent建立在Context-as-Action (ConAct)之上,它将上下文管理作为与选择UI动作相同的策略发出的一等动作。ConAct不是被动地追加历史,而是维护三个结构化的上下文字段:折叠的动作历史、折叠的UI状态和最近的步骤记录,在保持上下文紧凑的同时保留关键的UI事实。为了使主动上下文管理跨模型规模可学习,我们构建了MemGUI-3K,一个包含2956条轨迹的数据集,带有完整的ConAct注释,用于监督训练和离线分析。在MemGUI-3K上训练8B模型产生了MemGUI-8B-SFT,一个8B的MemGUI-Agent,它在MemGUI-Bench上实现了最佳的开源8B性能,并泛化到分布外的MobileWorld基准测试。代码、数据和训练好的模型将在以下网址发布:https://this URL。

英文摘要

MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.

2606.19918 2026-06-19 cs.ET 新提交

A Novel FeFET Differential Bit-Cell With Hybrid Volatile and Non-Volatile Memory Modes

一种具有混合易失性和非易失性存储模式的新型FeFET差分位单元

Jianze Wang, Wei Zhang, Xuanyao Fong

AI总结 提出一种由交叉耦合FeFET和存取晶体管组成的4T差分位单元,通过调整写入条件可在易失/非易失模式间切换,无需显式备份恢复操作,面积小于传统6T SRAM。

详情
AI中文摘要

非易失性SRAM(nvSRAM)设计已被研究以解决基于CMOS的SRAM的高泄漏功耗和新兴非易失性存储器(eNVM)技术的大写入延迟问题。然而,先前将SRAM与eNVM器件结合的nvSRAM设计通常需要备份和恢复(B\\&R)操作,并导致显著的单元面积开销。在此,我们提出一种差分存储位单元,由一对交叉耦合的铁电场效应晶体管(FeFET)和一对存取晶体管组成,形成四晶体管(4T)结构,比传统的6T SRAM和许多先前的nvSRAM设计更小。通过调整写入条件,所提出的位单元可配置为在易失性或非易失性模式下工作。在非易失性模式下,所提出的nvSRAM实现了0.13~$\mu$W的存储功耗和2~ns的存储时间,且无需显式的B\\&R操作。所提出的位单元也可视为交叉耦合增益单元,从而实现进一步的应用。

英文摘要

Non-volatile SRAM (nvSRAM) designs have been investigated to address the high leakage power of CMOS-based SRAM and the large write latency of emerging non-volatile memory (eNVM) technologies. However, prior nvSRAM designs that combine SRAM with eNVM devices typically require backup and restore (B\&R) operations and incur significant cell-area overhead. Here, we propose a differential memory bit-cell consisting of a pair of cross-coupled ferroelectric field-effect transistors (FeFETs) and a pair of access transistors, resulting in a four-transistor (4T) structure, which is smaller than conventional 6T SRAM and many prior nvSRAM designs. The proposed bit-cell can be configured to operate in either volatile or non-volatile mode by adjusting the write conditions. In the non-volatile mode, the proposed nvSRAM achieves a store power of 0.13~$μ$W with a 2~ns store time, and no explicit B\&R operation is required. The proposed bit-cell can also be viewed as a cross-coupled gain cell, enabling further applications.

2606.19913 2026-06-19 cs.AR 新提交

Design and Evaluation of Energy-Efficient Whisper Dot-Product Kernel Offloading on a CGLA Architecture

在CGLA架构上设计并评估节能的Whisper点积内核卸载

Takuto Ando, Yu Eto, Ayumu Takeuchi, Yasuhiko Nakashima

AI总结 在CGLA架构IMAX上卸载Whisper点积内核,通过内核映射、本地内存大小调整和突发调度优化,在Whisper tiny上实现比Jetson AGX Orin低2.35倍、比RTX 4090低10.48倍的功耗延迟积(PDP),为低功耗本地语音识别提供可编程架构方案。

Comments This paper is accepted at Concurrency and Computation: Practice and Experience (Wiley)

详情
AI中文摘要

在本文中,我们在IMAX(一种可编程的粗粒度线性阵列(CGLA)架构)上实现并评估了Whisper点积内核卸载。在ARM Cortex-A72上的性能分析显示,点积操作占FP16执行时间的90.6%和Q8_0执行时间的87.1%。为了解决这一内核瓶颈,我们结合了内核映射、本地内存大小调整和突发调度。该实现使用了内联FP16到FP32转换、64位数据路径上的2路SIMD FMA、列式多线程以及混合执行,其中对齐的向量段在IMAX上运行,剩余段在主机CPU上并发执行。我们通过FPGA原型和28nm ASIC投影(840MHz)评估了该设计。对于Whisper tiny,32KB本地内存和突发长度16共同最小化PDP和EDP。在基于TDP的跨平台比较中,投影的IMAX在Whisper tiny Q8_0上的PDP为11.58J,比Jetson AGX Orin(27.16J)低2.35倍,比RTX 4090(121.38J)低10.48倍。相同的设计扩展到Whisper base和Whisper small,但PDP差距缩小,因为32KB本地内存覆盖率从tiny的93.8%下降到base和small的约66.5%。这些结果表明,IMAX是一种在tiny模型范围内实现低PDP本地ASR的可编程架构。

英文摘要

In this paper, we implement and evaluate Whisper dot-product kernel offloading on IMAX, a programmable Coarse-Grained Linear Arrays (CGLAs) architecture. Whisper-tiny.en profiling on an ARM Cortex-A72 shows that dot-product operations account for 90.6% of FP16 execution time and 87.1% of Q8_0 execution time. To address this kernel bottleneck, we combine kernel mapping, local-memory sizing, and burst scheduling. The implementation uses inline FP16-to-FP32 conversion, 2-way SIMD FMA on a 64-bit datapath, column-wise multithreading, and mixed execution in which aligned vector segments run on IMAX and residual segments run concurrently on the host CPU. We evaluate the design with an FPGA prototype and a 28nm ASIC projection at 840MHz. For Whisper-tiny.en, 32KB local memory and burst length 16 jointly minimize PDP and EDP. Under a TDP-based cross-platform comparison, the projected IMAX records a PDP of 11.58J for Whisper-tiny.en Q8_0, 2.35x lower than Jetson AGX Orin (27.16J) and 10.48x lower than RTX 4090 (121.38J). The same design extends to Whisper-base.en and Whisper-small.en, where the PDP gap narrows as 32KB local-memory coverage drops from 93.8% for tiny to about 66.5% for base and small. These results position IMAX as a programmable architecture for lower-PDP local ASR in the tiny-model regime.

2606.19904 2026-06-19 cs.SI 新提交

Toward Temporal Realism in City-Scale Crisis Response Simulation using LLM Agents

面向城市级危机响应模拟中时间真实性的LLM智能体方法

Anping Zhang, Yang Tan, Yuanbo Tang, Huaze Tang, Qiuhua Ye, Marta C. Gonzalez, Yang Li

AI总结 针对LLM社会模拟缺乏时间真实性的问题,基于深圳疫情志愿活动数据,提出数据校准的自激与危机激活机制,实现爆发性时间模式,使智能体时间分布接近真实。

Comments 11pages,7 figures

详情
AI中文摘要

人类集体参与在时间上很少是稳定的:它是爆发性的,短时间的密集活动与长时间的安静间隔交替出现。在危机响应和社区动员中,预测人们何时行动与预测他们是否行动同样重要。这类场景越来越多地使用基于LLM的社会模拟器进行建模,然而这些模拟器的验证仅关注每个行动是否合理,而非行动的时间是否与现实一致。它们的时间真实性,即模拟活动再现真实人类系统爆发性、重尾时间分布的程度,因此仍未得到检验。我们利用深圳跨多年、城市规模的线下志愿活动日志(涵盖COVID-19疫情)来考察这一差距。实证上,我们确认爆发性时间在个体和跟踪群体层面普遍存在,且主要是内生性和自激的,并由疫情放大而非日常活动周期产生。一个标准的纯LLM模拟器几乎无法再现这种时间分布:其同步调度缺乏自激通道,因此智能体以近乎规律的时钟行动。基于这些发现,我们构建了一个模拟器,其中数据校准的自激通道和危机时期机制决定每个智能体何时行动,并仅在这些时刻查询LLM,由LLM决定加入哪个任务以及是否承诺。纯LLM基线未产生任何爆发性智能体(中位爆发性$B=-0.14$);单个数据校准的门控足以将每个智能体的时间分布提升至爆发阈值以上(中位$B\approx0.37$),且不降低LLM的内容决策质量。这些结果表明,基于LLM的危机响应模拟中,时间真实性的最佳实现方式是将智能体何时行动(由显式自激和危机激活机制控制)与做什么(由LLM控制)解耦。

英文摘要

Human collective participation is rarely steady in time: it is bursty, with short episodes of intense activity separated by long quiet intervals. In crisis response and community mobilization, predicting when people act matters as much as predicting whether they act. Such settings are increasingly modeled with LLM-based social simulators, yet these simulators are validated on whether each action is individually plausible, not on whether actions are timed as in reality. Their temporal realism, the degree to which simulated activity reproduces the bursty, heavy-tailed timing of real human systems, thus remains untested. We examine this gap using a multi-year, city-scale log of offline volunteering in Shenzhen that spans the COVID-19 pandemic. Empirically, we establish that bursty timing is common at individual and tracked-group levels, that it is largely endogenous and self-exciting, and that it is amplified by the pandemic rather than produced by daily activity cycles. A standard LLM-only simulator reproduces almost none of this timing: its synchronous schedule has no self-excitation channel, so agents act on a near-regular clock. Guided by these findings, we build a simulator in which a data-calibrated self-excitation channel and a crisis-period regime decide when each agent acts and query the LLM only at those moments, leaving it to decide which task to join and whether to commit. The LLM-only baseline yields no bursty agents (median burstiness $B=-0.14$); a single data-calibrated gate is then sufficient to lift per-agent timing above the burst threshold (median $B\approx0.37$) without degrading LLM content decisions. These results indicate that temporal realism in LLM-based crisis-response simulation is best achieved by decoupling when agents act, governed by an explicit self-excitation and crisis-activation mechanism, from what they do, governed by the LLM.

2606.19898 2026-06-19 cs.DB cs.IR 新提交

Query-aware Routing for Filtered Approximate Nearest Neighbors Search

面向过滤近似最近邻搜索的查询感知路由

Qianqian Xiong, Mengxuan Zhang

AI总结 提出查询感知路由框架,通过轻量级ML模型预测各候选方法的召回率,结合离线基准表选择最佳召回-QPS权衡,在五个未见数据集上达到SOTA性能。

Comments 12 pages

详情
AI中文摘要

过滤ANN搜索结合向量相似性与属性谓词,是现代向量数据库和检索增强生成中的核心原语。我们在多个数据集上对三种谓词下的所有主要分类过滤ANN方法进行基准测试,发现没有单一方法占主导地位。此外,即使在单个数据集和谓词类型内,查询的最佳方法也可能不同。因此,我们提出了一种查询感知路由框架。轻量级ML模型预测每个候选方法在查询上的召回率,路由器查阅离线基准表(该表将每种方法和参数设置映射到其测量的召回率和QPS),然后选择具有最佳召回-QPS权衡的方法。我们的消融研究将22个候选特征缩减为最小的三个特征集,并采用回归而非分类作为预测目标以提高准确性。我们的模型在六个真实世界数据集上训练,并应用于五个未见过的验证数据集。最终结果表明,与现有的过滤ANN基线相比,我们的路由器在所有五个验证数据集上实现了最先进的召回率和QPS平衡,同时引入了可忽略的延迟开销。

英文摘要

Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware routing framework. A lightweight ML model predicts each candidate method's recall on the query, and the router consults an offline benchmark table that maps every method and parameter setting to its measured recall and QPS, then selects the method with the best recall--QPS trade-off. Our ablation study narrows 22 candidate features to a minimal set of three and we adopt regression rather than classification as the prediction target to sharpen accuracy. Our model is trained on six real-world datasets and applied to five unseen validation datasets. The final result shows that our router achieves state-of-the-art recall and QPS balance across all five validation datasets compared to existing filtered ANN baselines, while incurring negligible latency overhead.

2606.19890 2026-06-19 cs.CY 新提交

Open Weight AI Models Require Proportional Evaluation Approaches

开放权重AI模型需要比例评估方法

Patricia Paskov, Christopher Rodriguez, Sunishchal Dev, Stephen Casper

AI总结 本文针对开放权重AI模型(OWMs)的独特风险因素,提出四种比例评估方法(PE1-PE4),并系统审查2025年至2026年4月发布的37个OWM系列,发现仅一个满足所有评估要求。

详情
AI中文摘要

开放权重AI模型(OWMs),即公开发布权重的模型,正在快速分发,并接近领先的封闭权重AI模型(CWMs)的性能水平。虽然OWMs带来了巨大的科学和经济利益,但它们的发布引入了独特的风险因素,而现有的评估实践(主要针对CWM部署设计)未能考虑这些因素。在本文中,我们认为这些风险因素需要不同的比例评估(PE)方法:在没有系统级保障的情况下进行评估(PE1),评估对消除模型级保障的修改的鲁棒性(PE2),测试选择性能力增强(PE3),以及代理最坏情况下的滥用(PE4)。我们系统审查了2025年至2026年4月期间发布的OWMs的当前评估实践,发现所审查的37个模型系列中只有一个满足PE1-4,大多数不满足任何一项。本文面向参与AI评估的政策制定者、资助者和研究人员。随着OWMs能力日益增强,其评估值得开发者、资助者和治理机构密切关注。

英文摘要

Open-weight AI models (OWMs), or models released with publicly-available weights, are distributing rapidly and approaching the performance levels of leading closed-weight AI models (CWMs). While OWMs offer substantial scientific and economic benefits, their release introduces distinct risk factors for which existing evaluation practices, largely designed for CWM deployment, fail to account. In this paper, we argue that these risk factors demand distinct proportional evaluation (PE) approaches: evaluating without system-level safeguards (PE1), assessing robustness to modifications that undo model-level safeguards (PE2), testing selective capability amplification (PE3), and proxying worst-case misuse (PE4). We systematically review current evaluation practices of OWMs released in 2025 through April 2026, finding that only one of the 37 families of models reviewed fulfills PE1-4 and most do not fulfill any. This paper targets policymakers, funders, and researchers involved in AI evaluation. As OWMs grow increasingly capable, their evaluation warrants close attention from developers, funders, and governance bodies alike.

2606.19869 2026-06-19 cs.DC 新提交

EVM Workloads in the Wild: Evidence for Multi-Dimensional Gas Metering, State Growth, Delayed Execution, and Parallelism

现实中的EVM工作负载:多维Gas计量、状态增长、延迟执行和并行性的证据

Lioba Heimbach, Kushal Babel, Jason Milionis

AI总结 通过分析2025年以太坊L1和Base L2的区块追踪,发现资源组合不稳定、状态增长被低估、执行结果对历史状态敏感,为多维Gas计量和状态增长显式定价提供了实证基础。

详情
AI中文摘要

EVM兼容区块链上的Gas计量假设执行条件是稳定的:资源组合足够恒定,可以将执行成本合并为具有固定相对价格的单一标量,并且提交与执行之间的状态漂移不会实质性改变交易结果。我们衡量了这一假设失败的程度。我们呈现了2025年全年以太坊(L1)和Base(L2)上EVM工作负载的追踪级测量研究,每条链每天采样3000个区块。我们将每笔交易分解为操作码级执行Gas、固有Gas、退款和持久状态增量。为测量状态敏感性,我们在旧状态上重新执行2025年9月的交易,并记录Gas使用和存储访问模式的变化。我们发现资源组合远非稳定:在Base上,存储读取和计算分别占执行Gas的29.2%和24.3%,而以太坊将34.9%用于存储写入。以太坊在2025年Gas上限翻倍,使其自身配置转向更重计算、类似Base的模式。Base还表现出更高比例的冷存储读取(49.7%对以太坊的39.6%)。持久状态增长(一种被定价为临时成本的永久成本)在Base上达到456 GB,而在以太坊上为38 GB。执行结果同样不稳定:在Base上,46.0%的交易在附近历史状态间的Gas估算存在差异,而以太坊为13.9%,MEV和DeFi活动的敏感性尤其高。存储访问模式在不同状态间也存在差异,限制了访问列表的有效性并使并行执行复杂化。我们的工作为多维Gas计量和状态增长的显式定价提供了实证基础。研究表明,状态敏感的执行行为使工作负载估算复杂化,直接影响交易可预测性和用户体验。

英文摘要

Gas metering on EVM-compatible blockchains assumes that execution conditions are stable: that the resource mix is constant enough to justify collapsing execution costs into a single scalar with fixed relative prices, and that state drift between submission and execution does not materially alter a transaction's outcome. We measure the extent to which this assumption fails. We present a trace-level measurement study of EVM workloads on Ethereum (L1) and Base (L2) throughout 2025, sampling 3,000 blocks per day per chain. We decompose each transaction into opcode-level execution gas, intrinsic gas, refunds, and persistent state deltas. To measure state sensitivity, we re-execute transactions from September 2025 on older states and record how gas usage and storage access patterns change. We find the resource mix to be far from stable: on Base, storage reads and compute account for 29.2% and 24.3% of execution gas, while Ethereum devotes 34.9% to storage writes. Ethereum's gas limit doubling during 2025 shifted its own profile toward compute-heavier, Base-like patterns. Base also exhibits a higher fraction of cold storage reads (49.7% versus 39.6% on Ethereum). Persistent state growth, a permanent cost priced as a transient one, reaches 456 GB on Base versus 38 GB on Ethereum. Execution outcomes are equally unstable: gas estimates vary across nearby historical states for 46.0% of transactions on Base, compared to 13.9% on Ethereum, with especially high sensitivity for MEV and DeFi activity. Storage access patterns also diverge across states, limiting the effectiveness of access lists and complicating parallel execution. Our work provides an empirical foundation for multi-dimensional gas metering and explicit pricing of state growth. They show that state-sensitive execution behavior complicates workload estimation, directly affecting transaction predictability and user experience.

2606.19866 2026-06-19 cs.CR 新提交

Low-Cost Multi-Precision Systolic Arrays for Accelerating FHE NTTs on AI ASICs

低成本多精度脉动阵列用于在AI ASIC上加速FHE NTT

George Alexakis, Dimitrios Schoinianakis, Giorgos Dimitrakopoulos

AI总结 针对FHE在AI硬件上因精度不匹配导致的性能瓶颈,提出一种最小修改的多精度脉动阵列,在统一数据流下原生执行全精度输出重建,实现1.33倍加速。

详情
AI中文摘要

全同态加密(FHE)确保了强大的数据隐私,但面临难以承受的计算开销。在AI硬件(如张量处理单元TPU)上加速FHE很有前景,但受到精度不匹配的根本限制:TPU针对8位算术优化,而FHE及其关键部分(如数论变换NTT)需要高精度。当前方法通过矩阵分解在低精度矩阵引擎上执行NTT计算来弥合这一差距。然而,重建全精度结果需要移位加累加,这与矩阵乘法的数据流不匹配。这迫使将全精度重建从矩阵引擎卸载到向量处理器,破坏了矩阵乘法数据流,造成显著的性能瓶颈。为解决这一限制,我们提出一种最小修改的多精度脉动阵列,在统一数据流下,与低精度矩阵乘法同步,在阵列内部原生执行全精度输出重建。使用OpenRoad在7nm工艺下综合,我们的设计硬件开销可忽略不计。使用SCALE-Sim的周期精确模拟表明,在128x128矩阵引擎上,对于2^12到2^16的变换大小,在所提出的架构上原生执行NTT可实现至少1.33倍的加速,成功使标准AI硬件支持高精度FHE加速。

英文摘要

Fully Homomorphic Encryption (FHE) ensures robust data privacy but suffers from prohibitive computational overhead. Accelerating FHE on AI hardware like Tensor Processing Units (TPUs) is promising, yet fundamentally limited by a precision mismatch: TPUs are optimized for 8-bit arithmetic, whereas FHE and its critical parts such as the Number Theoretic Transform (NTT), demand high precision. Current approaches bridge this gap using matrix decomposition to execute NTT computations on low-precision matrix engines. However, reconstructing the full-precision results requires shift-and-add accumulation that does not match the dataflow of matrix multiplication. This forces offloading full-precision reconstruction from matrix engines to vector processors that disrupts the matrix multiplication dataflow, creating significant performance bottleneck. To resolve this limitation, we propose a minimally modified multi-precision systolic array that performs full-precision output reconstruction natively within the array in sync with low-precision matrix multiplication under a uniform dataflow. Synthesized at 7nm with OpenRoad, our design incurs negligible hardware overhead. Cycle-accurate simulations using SCALE-Sim demonstrate that natively executing NTTs on the proposed architecture achieves at least 1.33x speedup, for transform sizes 2^12 to 2^16 on 128x128 matrix engines, successfully enabling standard AI hardware to support high-precision FHE acceleration.

2606.19861 2026-06-19 cs.NE 新提交

Weight Adaptation for Improving Parallel Performance of Adaptive Stochastic Natural Gradient

权重自适应提升自适应随机自然梯度的并行性能

Yutaro Yamada, Kento Uchida, Shinichi Shirakawa

AI总结 提出WA-ASNG,通过梯度上升自适应更新权重参数,最大化自然梯度信号,在二进制优化问题中优于PBIL和ASNG,并有效处理强噪声。

Comments Accepted at EvoCOP 2026 (Part of EvoStar 2026)

详情
AI中文摘要

基于概率模型的进化算法在黑箱优化中很有前景。具体来说,自适应随机自然梯度(ASNG)自适应地更新其学习率(概率模型进化算法中的典型超参数),从而实现高效且鲁棒的优化。尽管权重参数是常见的超参数,但随着对耗时任务并行评估需求的增加,如何为更大的种群规模设置合适的权重仍不清楚。在本文中,我们提出了权重自适应ASNG(WA-ASNG),它将权重自适应机制融入ASNG。我们从自然梯度的累积中计算更新方向的估计信号。然后,为了最大化该信号,WA-ASNG通过优化上的梯度上升自适应地更新其权重参数。学习率自适应在满足预期目标值单调改进的充分条件方面发挥作用,而权重自适应机制旨在最大化这种改进。实验结果表明,在二进制优化问题中,种群规模从25到100的各种设置下,WA-ASNG优于PBIL和ASNG。此外,WA-ASNG在存在强噪声的情况下也能高效运行。我们的代码可在此https URL获取。

英文摘要

Probabilistic model-based evolutionary algorithms are promising for black-box optimization. Specifically, the adaptive stochastic natural gradient (ASNG) adaptively updates its learning rate, a typical hyperparameter in probabilistic model-based evolutionary algorithms, thereby realizing efficient and robust optimization. Although weight parameters are common hyperparameters, with the increasing demand for parallel evaluation of time-consuming tasks, it remains unclear how to set suitable weights for larger population sizes. In this paper, we propose Weight Adaptation ASNG (WA-ASNG), which incorporates a weight adaptation mechanism into ASNG. We calculated the estimated signal of the update direction from the accumulations of the natural gradient. Then, to maximize the signal, WA-ASNG adaptively updates its weight parameters by a gradient ascent over the optimization. While the learning rate adaptation plays a role in satisfying a sufficient condition for monotonic improvement of the expected objective value, the mechanism of weight adaptation is intended to maximize this improvement. The experimental results demonstrate that WA-ASNG outperforms PBIL and ASNG across various settings with population sizes ranging from 25 to 100 for binary optimization problems. Furthermore, WA-ASNG can perform efficiently in the presence of strong noise. Our code is available at https://github.com/shiralab/WA-ASNG .

2606.19826 2026-06-19 cs.CR cs.MA 新提交

Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains, Replacement Costs, and Resilience

对抗性同伴下的异构LLM辩论:诚实增益、替代成本与韧性

Prashanti Nilayam, Kiran Kumar Ramanna, Prashil Tumbade, Sankalp Nayak

AI总结 研究异构LLM辩论中诚实与对抗性同伴对修正行为的影响,发现诚实同伴降低有害修正率,对抗性同伴则逆转,且异构性在已有对手时也能作为防御。

详情
AI中文摘要

异构LLM辩论的动机在于,多样化的同伴可以相互纠正,但同样的交流既携带纠正也携带对抗性影响。我们通过跟踪异构同伴如何改变诚实智能体的修正行为来衡量哪种影响占主导:他们改变答案的频率,以及这种改变是纠正性的还是有害的。我们比较了匹配面板(同质基线、诚实混合和对抗混合)以及受污染面板(其中已存在一个恶意的同族同伴),涵盖四个模型家族和三个推理基准。一个诚实的异构同伴显著降低了有害修正,而对抗性同伴则逆转了这一效果。对于Llama-3.1-70B防御者在MATH-hard上,诚实插槽的有害修正率从同质面板的89%下降到有诚实同伴时的35%,而对抗性同伴使其回到90%。条件率对弱防御者隐藏了这种损害,但辩论结束时的翻转率暴露了它。该模式在家族和基准上保持符号一致,而其幅度随防御者-基准机制变化。我们还测量了当已存在一个对抗性同族同伴时的效果:一个诚实的异构同伴降低了有害修正率以及最初正确答案丢失的比率。在相同的Llama-3.1-70B设置下,添加的诚实同伴将最初正确项上的翻转率从同族对手下的31%降至6%。因此,异构性不仅是一个攻击面,而且当对手已经存在时,也是一种防御。

英文摘要

Heterogeneous LLM debate is motivated by the promise that diverse peers correct one another, but the same exchange that carries correction also carries adversarial influence. We measure which dominates by tracking how a heterogeneous peer changes the honest agents' revision behavior: how often they change their answer, and whether the change is corrective or harmful. We compare matched panels (homogeneous baseline, honest-mixed, and adversarial-mixed) and contaminated panels in which a malicious same-family peer is already present, spanning four model families and three reasoning benchmarks. An honest heterogeneous peer sharply lowers harmful revision, and an adversarial one reverses it. For Llama-3.1-70B defenders on MATH-hard, the honest-slot harmful-revision rate falls from 89% in the homogeneous panel to 35% with an honest peer, and an adversarial peer returns it to 90%. The conditional rate hides this damage on weak defenders, but the end-of-debate flip rate exposes it. The pattern keeps its sign across families and benchmarks while its magnitude varies with the defender-benchmark regime. We also measure the effects when an adversarial same-family peer is already present: an honest heterogeneous peer lowers both harmful revision and the rate at which initially-correct answers are lost. On the same Llama-3.1-70B setting, the added honest peer cuts the flip rate on initially-correct items from 31% under a same-family adversary to 6%. Heterogeneity is therefore not only an attack surface but, when an adversary is already present, also a defense.

2606.19822 2026-06-19 cs.FL 新提交

Learning Alternating Real-Time Automata

学习交替实时自动机

Kazuki Kinoshita, Masaki Waga

AI总结 提出AL*RTA算法,结合AL*和NL*RTA,学习交替实时自动机,通过成员和等价查询,实验表明比NL*RTA学到更小自动机但查询更多。

Comments Accepted to QEST+FORMATS 2026

详情
AI中文摘要

我们提出了AL*RTA算法,用于通过成员查询和等价查询学习交替实时自动机(ARTA)。AL*RTA结合了用于学习交替有限自动机的AL*和用于学习非确定性实时自动机的NL*RTA的思想。我们首先定义ARTA,并表明交替提高了简洁性,尽管它没有增加表达能力。然后我们提出AL*RTA并证明其终止性。我们的实证评估表明,AL*RTA通常比NL*RTA学到更小的自动机,但代价是更多的查询。

英文摘要

We present the AL*RTA algorithm for learning alternating real-time automata (ARTAs) using membership and equivalence queries. AL*RTA combines ideas from AL*for learning alternating finite automata and NL*RTA for learning nondeterministic real-time automata. We first define ARTAs and show that alternation improves succinctness, although it does not increase expressive power. We then present AL*RTA and show its termination. Our empirical evaluation suggests that AL*RTA generally learns smaller automata than NL*RTA at the cost of more queries.

2606.19816 2026-06-19 cs.CY 新提交

Challenges to Grassroots Organization Engagement with AI Policy

基层组织开展AI政策参与的挑战

Carter Buckner, Jennifer Mickel, Nandhini Swaminathan, William Agnew, Jacob Hobbs, Sarthak Arora, Michelle Lin, Yanan Long, B. V. Alaka

AI总结 本文通过案例研究,探讨基层组织和边缘化社区在参与AI政策制定中面临的挑战,并提出基于参与式设计的建议。

Comments To appear at ACM FAccT 2026

详情
AI中文摘要

世界各地正在制定公共政策,以应对AI技术带来的隐私、经济、知识产权、能源及其他风险。公众参与作为问责和对齐机制,对治理至关重要。然而,对于缺乏广泛网络、游说能力及其他权力形式的公众群体来说,参与并影响政策制定可能具有挑战性。这一挑战对边缘化社区尤为严峻。本文通过我们组织将参与式设计(PD)原则引入美国AI政策制定的努力进行案例研究。我们描述了与多个美国政策机构的互动,以及为性少数群体参与式开发AI政策的过程。我们强调了与边缘化社区进行PD实践中的挑战,并提出了缓解这些挑战的建议。最后,我们为政策制定者及其他在边缘化社区工作的组织者提供了可行的建议。

英文摘要

Public policies are being developed around the world to address privacy, economic, intellectual property, energy, and other risks that AI technologies pose. Involvement from the general public is essential to governance as an accountability and alignment mechanism. However, participating in and impacting policymaking can be challenging for sections of the public that lack extensive networks, lobbying capabilities, and other forms of power. This challenge is especially acute for marginalized communities. In this paper, we present a case study of our organization's efforts to bring participatory design (PD) principles to AI policymaking in the US. We describe our engagements with several US policy bodies, and our participatory development of AI policy for queer people. We highlight challenges with PD practice with marginalized communities, and offer suggestions to alleviate them. We conclude with actionable recommendations for policymakers and other organizers working in marginalized communities.

2606.19814 2026-06-19 cs.SE 新提交

CoRaCommit: A VS Code Extension for Commit Message Generation with Exemplar Retrieval

CoRaCommit: 一种基于范例检索的提交消息生成的 VS Code 扩展

Chaoran Cai, Bo Xiong, Chong Wang, Lulu He, Peng Liang

AI总结 提出 CoRaCommit VS Code 扩展,通过检索相似提交范例作为提示上下文、并行调用多个大语言模型生成候选消息并基于用户反馈动态推荐,在 ApacheCM 数据集上优于现有扩展。

Comments 17 pages, 6 images, 3 tables, Manuscript submitted to a Journal (2026)

详情
AI中文摘要

提交消息是描述代码变更意图的关键文本制品,在版本控制、代码审查和历史追踪中扮演重要角色。然而,实践中提交消息主要由人工编写,耗时且常导致质量不一致和表达不统一。现有的用于提交消息生成的 VS Code 扩展通常直接基于代码差异调用大语言模型,而不利用相似提交范例作为参考,且很少支持用户反馈驱动的大语言模型推荐。为解决这些局限,本文提出 CoRaCommit,一种 VS Code 扩展,通过检索相似提交范例作为提示上下文、并行调用多个大语言模型进行候选提交消息比较,并基于用户反馈动态推荐大语言模型,从而增强提交消息生成。在 ApacheCM 数据集的 945 个提交上的实验结果表明,CoRaCommit 在 BLEU、CIDEr、METEOR 和 ROUGE-L 指标上优于现有 VS Code 扩展,证明了检索增强上下文对提交消息生成的有效性。

英文摘要

Commit messages are essential textual artifacts that describe the intent behind code changes, and play a critical role in version control, code review, and historical tracking. However, in practice, commit messages are primarily authored manually, which is time-consuming and often results in inconsistent quality and non-uniform expression. Existing VS Code extensions for commit message generation typically directly invoke large language models based on the code diff, without leveraging similar commit exemplars as references, and rarely support user feedback-driven LLM recommendation. To address these limitations, this paper presents CoRaCommit, a VS Code extension that enhances commit message generation by retrieving similar commit exemplars as prompt context, invoking multiple LLMs in parallel for candidate commit message comparison, and dynamically recommending LLMs based on user feedback. Experimental results on 945 commits from the ApacheCM dataset show that CoRaCommit outperforms existing VS Code extensions across BLEU, CIDEr, METEOR, and ROUGE-L metrics, demonstrating the effectiveness of retrieval-augmented context for commit message generation.

2606.19807 2026-06-19 cs.CR 新提交

DISARM: Target Electronic Device Informed Mitigation of Software Runtime Side-Channel Vulnerabilities

DISARM:目标电子设备知情的软件运行时侧信道漏洞缓解

Tasneem Suha, Tanzim Mahfuz, Rima Asmar Awad, Prabuddha Chakraborty

AI总结 提出DISARM方法,利用真实嵌入式设备时序值生成针对性软件修复,以缓解运行时侧信道漏洞,在五个不同设备上优于现有方案。

详情
AI中文摘要

程序运行时或时序攻击利用程序执行时间的变化来提取敏感信息(如加密密钥、敏感变量数据、知识产权)。针对运行时侧信道攻击的最新解决方案试图平衡不同控制流路径下敏感代码的执行时间,以消除时序泄漏。然而,在缓解过程中,大多数技术未考虑目标程序运行的底层硬件或设备。这可能导致过度修复(不必要的额外操作)、修复不足(未正确解决不平衡)甚至失败。我们提出DISARM,一种联合硬件-软件方法(不同于任何现有解决方案),用于缓解运行时侧信道漏洞,该方法利用真实嵌入式设备的时序值生成针对性的软件修复。我们实现了DISARM以支持C、C++和Java源代码,并在22个标准基准测试上进行验证。在五个不同的嵌入式或边缘设备上,DISARM在执行时间开销、代码大小开销和正确性方面均优于现有解决方案如PENDULUM和DifFuzzAR。

英文摘要

Program runtime or timing attacks exploit variations in a program's execution times to extract sensitive information from the program (e.g. encryption keys, sensitive variable data, intellectual property). State-of-the-art solutions to runtime side-channel attacks attempt to balance the execution time of the sensitive code for different control flow paths to eliminate the timing leakage. However, during the mitigation process, most techniques do not consider the underlying hardware or device on which the target program is supposed to run on. This can lead to over-fixing (unnecessary extra operations), under-fixing (not solving the imbalance properly), and even failures. We propose DISARM, a joint hardware-software methodology (unlike any existing solution) for mitigating runtime side-channel vulnerabilities that utilizes timing values from real embedded devices to generate targeted software fixes. We implement DISARM to support C, C++, and Java source codes and validate it across 22 standard benchmarks. DISARM outperforms state-of-the-art solutions such as PENDULUM and DifFuzzAR in terms of execution time overhead, code size overhead, and correctness on five different embedded or edge devices.

2606.19790 2026-06-19 cs.CE 新提交

The Orchestration Gap: Why Process Automation Stalls in Operationally Complex Industries

编排鸿沟:为何流程自动化在操作复杂行业中停滞不前

Jiechao Gao, Yuandong Pan. Yuangang Li, Jie Wang, Kincho Law, Michael Lepech

AI总结 本文提出“编排鸿沟”概念,分析为何多智能体系统在物流、医疗等复杂行业自动化中失败,并给出基于约束执行和可解释性的分阶段自动化路径。

详情
AI中文摘要

智能体系统在数字原生任务上进展迅速,但几乎未触及那些协调自动化可能最重要的行业:物流、医疗运营、建筑以及许多工作分散在不兼容工具和众多参与者中的领域。我们认为原因是缺少一种抽象。在这些场景中,价值并非来自单个有能力的模型调用,而是来自编排——协调多步骤工作流、强制执行硬领域约束、管理人工审批并桥接遗留系统的运行时。我们将这一思想发展成一个可用的概念框架。我们给出了一个操作性测试来识别哪些工作流受限于编排,一种分解方法将工作流的混乱程度与其协调工作量及价值分离,以及一个特征层面的解释说明为何当今的多智能体框架留下了一个特定鸿沟。然后我们提出核心主张:正确的自动化路径是分阶段的,而哪种架构保证最重要取决于一个行业的主要摩擦来源。在监管摩擦下,约束执行是承重关键;在责任摩擦下,可解释性是承重关键。我们以这一观点所暗示的研究计划作为结尾。

英文摘要

Agentic systems have advanced quickly on digitally native tasks, yet they have barely touched the industries where coordinated automation could matter most: logistics, healthcare operations, construction, and the many sectors whose work is spread across incompatible tools and many hands. We argue that the reason is a missing abstraction. The value in these settings does not come from a single capable model invocation; it comes from \emph{orchestration}, the runtime that coordinates multi-step workflows, enforces hard domain constraints, manages human approval, and bridges legacy systems. We develop this idea into a usable conceptual frame. We give an operational test for which workflows are orchestration-bound, a decomposition that separates how tangled a workflow is from how much of its effort is coordination and what that coordination is worth, and a feature-level account of why today's multi-agent frameworks leave a specific gap. We then advance our central claim: the right automation path is staged, and which architectural guarantee carries the most weight depends on a sector's dominant source of friction. Constraint enforcement is load-bearing under regulatory friction; explainability is load-bearing under liability friction. We close with the research program this view implies.

2606.19775 2026-06-19 cs.SI stat.AP stat.OT 新提交

Rethinking Sampling Strategy in Link Prediction

重新思考链接预测中的采样策略

Yilin Bi, Zhenyu Deng, Xinshan Jiao, Tao Zhou

AI总结 提出β-采样方案,研究两阶段采样对链接预测性能的影响,发现缺失链接的结构特征显著影响预测精度,且第二阶段采样策略至关重要。

Comments 19 pages, 5 figures, 3 tables

详情
AI中文摘要

许多现实世界的网络是不完整的,使得链接预测成为网络科学中的一个基本挑战。为了训练参数和评估算法,观察到的链接通常被划分为三个子集,即训练集、验证集和探测集。这种划分隐含地涉及两个采样过程:第一阶段采样产生探测集,第二阶段采样获得变化集。迄今为止,我们对这两个采样过程如何影响算法性能的理解仍然非常有限。为了解决这个问题,我们提出了一种称为β-采样的采样方案,其中链接的采样概率与其两个端点的度数乘积的β次幂成正比。在45个真实网络上的实验表明,通过改变探测集模拟的缺失链接的结构特征显著影响预测精度。当缺失链接倾向于连接高度数节点时,这类链接可以很容易地被准确预测。此外,即使探测集固定,第二阶段采样仍然对预测精度产生显著影响。值得注意的是,最优的第二阶段采样策略不同于随机采样(随机选择链接形成验证集)和一致采样(保证验证集和探测集中的链接具有相同的结构特征)。

英文摘要

Many real-world networks are incomplete, making link prediction a fundamental challenge in network science. To train parameters and evaluate algorithms, observed links are usually divided into three subsets, namely training, validation, and probe sets. This division implicitly involves two sampling processes: first-stage sampling yields the probe set and second-stage sampling obtains the variation set. To date, our understanding of how these two sampling processes affect algorithm performance remains quite limited. To address this issue, we propose a sampling scheme called $β$-sampling, where the sampling probability of a link is proportional to the product of the degrees of its two endpoints raised to the power of $β$. Experiments on 45 real-world networks reveal that the structural characteristics of missing links, as simulated via varying probe sets, substantially impact prediction accuracy. When missing links tend to connect high-degree nodes, such links can be predicted accurately with ease. Furthermore, even with a fixed probe set, second-stage sampling still exerts a significant influence on prediction accuracy. Notably, the optimal second-stage sampling strategy differs from \textit{random sampling} (which randomly selects links to form the validation set) and \textit{consistent sampling} (which guarantees that links in the validation and probe sets share identical structural characteristics).

2606.19758 2026-06-19 cs.MA 新提交

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

SIGMA: 用于组合式多智能体设计的技能-关联图

Kun Zeng, Yu Huo, Siyu Zhang, Yuecheng Zhuo, Yuquan Lu, Haoyue Liu, Siyue Chen, Xiaoying Tang

AI总结 提出SIGMA框架,通过技能-智能体关联图将智能体构建为可复用技能的任务条件组合,并解码通信拓扑,在六个基准测试中优于基线方法,并展现出对未见技能库的鲁棒性。

Comments EMNLP2026

详情
AI中文摘要

现有的基于图的多智能体系统(MAS)设计者主要通过优化预定义智能体、角色或组上的通信拓扑来改善协作。然而,由于每个节点仍然是一个封闭集实体,这些方法难以泛化到需要未见能力组合的任务。我们提出SIGMA,一个技能-关联图框架,将智能体构建为可复用技能的任务条件组合。给定一个任务和一个技能库,SIGMA预测一个技能-智能体关联矩阵,从选定的技能中组合智能体节点嵌入,并在构建的智能体上解码通信拓扑。在执行过程中,特定技能的邮箱将消息路由到相关分配的能力,使关联结构直接可操作。在六个推理和编码基准测试中,使用三个基础LLM,SIGMA实现了最佳平均性能,并分别比最强的非组合式拓扑基线CARD提高了2.06、2.36和1.75分。它还对未见技能库表现出更强的鲁棒性,平均性能下降仅为0.96分。这些结果表明,组合式节点构建是多智能体设计中除了通信拓扑优化之外的一个互补且重要的方向。代码可在以下网址获取:https://this URL。

英文摘要

Existing graph-based multi-agent system (MAS) designers mainly improve collaboration by optimizing communication topologies over predefined agents, roles, or groups. However, because each node remains a closed-set entity, these methods struggle to generalize to tasks that require unseen combinations of capabilities. We propose SIGMA, a skill-incidence graph framework that constructs agents as task-conditioned bundles of reusable skills. Given a task and a skill library, SIGMA predicts a skill-agent incidence matrix, composes agent node embeddings from selected skills, and decodes a communication topology over the constructed agents. During execution, skill-specific mailboxes route messages to the relevant assigned capabilities, making the incidence structure directly operational. Across six reasoning and coding benchmarks with three base LLMs, SIGMA achieves the best average performance and improves over CARD, the strongest non-compositional topology-based baseline, by 2.06, 2.36, and 1.75 points, respectively. It also shows stronger robustness to unseen skill libraries, with an average performance drop of only 0.96 points. These results suggest that compositional node construction is a complementary and important axis for multi-agent design beyond communication topology optimization. Code is available at https://anonymous.4open.science/r/SIGMA-2338/.