arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.20173 2026-06-19 cs.SE 新提交

Qiskit Code Migration with LLMs

使用大语言模型进行Qiskit代码迁移

Jose Manuel Suarez, Luis Mariano Bibbo, Joaquin Bogado, Alenandro Fernandez

AI总结针对量子软件开发套件版本演进导致的代码维护问题，提出结合大语言模型与检索增强生成（RAG）的混合方法，利用自动生成的迁移场景分类体系引导模型，实现Qiskit代码跨版本自动迁移，有效减少幻觉并提升迁移建议质量。

详情

AI中文摘要

量子开发套件（QDK）的快速演进引入了一种特定形式的技术债务，损害了代码可维护性并阻碍了软件复用。在量子软件工程（QSE）这一专业领域，高质量训练数据的稀缺和新兴框架的高波动性加剧了这一挑战，常导致通用大语言模型（LLM）产生不可靠或幻觉结果。本文提出一种将LLM与检索增强生成（RAG）相结合的混合方法，用于自动化Qiskit代码的跨版本迁移。所提方法通过利用自动生成的迁移场景分类体系作为结构化、版本特定的知识源来指导模型，从而提升迁移建议的精度和可靠性。该方法通过一个自动化、可扩展的工作流实现，评估了不同检索方案（无约束和限制性）下的LLM（Google Gemini Flash-2.5和OpenAI Gpt-oss-20b）。结果表明，基于分类体系的RAG架构，特别是在限制性方案下，显著减少了幻觉并提高了描述质量，其中Google Gemini Flash-2.5在检测复杂重构场景方面表现出更优性能。这些发现证实了这种以数据为中心的方法在促进技术独立性、提供缓解API过时问题的鲁棒智能助手方面的潜力，从而确保量子算法在快速变化的生态系统中的长期可用性，并降低量子软件工程（QSE）的学习曲线。

英文摘要

The rapid evolution of Quantum Development Kits (QDKs) introduces a specific form of technical debt that compromises code maintainability and hinders software reuse. In the specialized domain of Quantum Software Engineering (QSE), this challenge is intensified by the scarcity of high-quality training data and the high volatility of emerging frameworks, which often lead general-purpose Large Language Models (LLMs) to produce unreliable or hallucinated results. This paper proposes a hybrid approach integrating LLMs with Retrieval-Augmented Generation (RAG) to automate the migration of Qiskit code across versions. The proposed methodology enhances the precision and reliability of migration suggestions by leveraging an automatically generated taxonomy of migration scenarios as the structured, version-specific knowledge source to guide the models. The approach is implemented through an automated, extensible workflow evaluating LLMs (Google Gemini Flash-2.5 and OpenAI Gpt-oss-20b) under different retrieval schemes (unconstrained and restrictive). Results demonstrate that the taxonomy-based RAG architecture, particularly under the restrictive scheme, significantly reduces hallucinations and improves descriptive quality, with Google Gemini Flash-2.5 showing superior performance in detecting complex refactoring scenarios. These findings confirm the potential of this data-centric methodology to foster technological independence and provide robust, intelligent assistants that mitigate API obsolescence, ensuring the long-term availability of quantum algorithms within a rapidly shifting ecosystem and flattening the learning curve within Quantum Software Engineering (QSE).

URL PDF HTML ☆

赞 0 踩 0

2606.20163 2026-06-19 eess.SY cs.SY 新提交

Techno-Economic Analysis of Shared Mobile Storage for Demand Charge Reduction

用于需求费用削减的共享移动储能技术经济分析

B Hari Kiran Reddy, Ge Chen, Junjie Qin

AI总结本文提出一个高保真车队管理框架，通过混合整数线性规划模型和启发式算法，评估共享电动汽车在考虑实际物流和运营约束下削减需求费用的技术经济可行性。

Comments 22 pages, 26 figures, journal

详情

AI中文摘要

本文研究了在实际物流和运营约束下，共享电动汽车车队用于削减需求费用的技术经济可行性。与忽略运输开销的理想化模型不同，我们提出了一个高保真车队管理框架，明确考虑了能源消耗的时空耦合、电动汽车驾驶员的人工成本和电池退化。我们将调度问题表述为混合整数线性规划，共同最小化需求费用和总拥有成本。为了解决路径依赖约束带来的计算复杂性，我们开发了一种基于边际价值的启发式算法，该算法以高计算效率实现了接近最优的性能。使用旧金山的真实数据，我们的分析表明，适度数量的电动汽车可以实现显著的需求费用节省，足以收回拥有和运营成本。我们的结果还显示了电价结构、车队规模和成本组成部分如何影响整体盈利能力。

英文摘要

This paper investigates the techno-economic viability of shared electric vehicle (EV) fleets for demand charge reduction under practical logistical and operational constraints. Unlike idealized models that overlook transit overheads, we propose a high-fidelity fleet management framework that explicitly accounts for the spatio-temporal coupling of energy consumption, labor costs for EV drivers, and battery degradation. We formulate the dispatch problem as a mixed-integer linear program (MILP) that jointly minimizes demand charges and total cost of ownership. To address the computational complexity arising from path-dependent constraints, we develop a marginal-value-based heuristic algorithm that achieves near-optimal performance with high computational efficiency. Using real-world data from San Francisco, our analysis reveals that a modest number of EVs can achieve significant demand charge savings, sufficient to recover the ownership and operational expenses. Our results also show how tariff structures, fleet size, and cost components influence overall profitability.

URL PDF HTML ☆

赞 0 踩 0

2606.20158 2026-06-19 cs.SE 新提交

N-Version Programming with Coding Agents

使用编码代理的N版本编程

Javier Ron, Benoit Baudry, Martin Monperrus

AI总结本文在当代AI编码代理背景下重新审视N版本编程，通过Knight-Leveson实验评估代理系统、模型和实现语言的多样性对故障模式的影响，发现常见模式故障，但多数投票三版本单元显著降低故障数，证明该策略的工程实用性。

详情

AI中文摘要

本文在当代AI编码代理背景下重新审视N版本编程这一经典概念。通过重访开创性的Knight-Leveson实验，我们研究了代理系统、模型和实现语言之间的多样性是否会产生多样化的故障模式。使用Knight-Leveson的发射拦截器程序规范，我们在共享的预言机和100万个随机测试输入的测试集上评估了48个代理生成的实现。结果显示，与Knight-Leveson的发现一致，存在大量的共模故障。进一步分析表明，许多这些同时发生的故障可以追溯到规范中特别困难或模糊的地方。我们还证明了编码代理的多样性带来了实际效益：在多数投票的三版本单元中，平均故障数从单版本的387.44下降到三版本的130.99，并且有11,844个N版本单元表现出零观测故障。我们的原始结果是迄今为止最强的证据，表明使用编码代理的N版本编程是一种有用的工程策略。

英文摘要

This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial common-mode failure, along the findings of Knight-Leveson. Further analysis that many of those co-occuring failures can be traced to where is specification is particularly hard or ambiguous. We also demonstrate that diversity from coding agents provides practical benefit: across majority voting three-version units, the mean failure count drops from 387.44 for single versions to 130.99 for triples, and 11,844 N-version units exhibit zero observed failures. Our original results is the strongest evidence to date that N-Version Programming with coding agents is a useful engineering strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.20134 2026-06-19 cs.LO cs.PL 新提交

An MSO Framework for Weak-Memory Verification and Robustness

弱内存验证与鲁棒性的MSO框架

Giovanna Kobus Conrado, Andreas Pavlogiannis

AI总结本文研究单子二阶逻辑作为弱内存元理论，证明顺序一致性执行有界树宽而TSO无界，展示多种模型可MSO公理化，并引入读自鲁棒性概念，实现统一验证算法。

Comments Accepted at CONCUR 2026

详情

AI中文摘要

内存模型是并发程序执行的形式化规范，解释了编译器和架构优化引入的弱行为。其数量和复杂性的增加促使人们通过在适当的元理论中公理化模型来统一验证整个模型类别。本文正式研究单子二阶逻辑（MSO）作为弱内存的元理论，通过证明各种流行弱内存模型的树宽和MSO可表达性结果，使得我们能够统一处理多个验证问题。总结如下：首先，我们证明顺序一致性（$\mathsf{SC}$）下的执行具有有界树宽，而总存储顺序（$\mathsf{TSO}$）下的执行则无界。其次，我们证明包括Release/Acquire和完整RC20在内的广泛模型是MSO可公理化的，而其他模型如Strong Release/Acquire和$\mathsf{TSO}$则不可，除非正交向量问题（在SETH下需要二次时间）可以在线性时间内解决。最后，我们引入读自鲁棒性概念，作为对近期粗粒度鲁棒性准则工作的扩展。我们证明树宽界限（上界和下界）对任何MSO可公理化模型$\mathsf{MM}$具有深远的算法意义：存在一个算法，对于每个程序$\mathsf{P}$，要么验证$\mathsf{P}$在$\mathsf{MM}$下的正确性，要么报告$\mathsf{P}$对$\mathsf{MM}$不是读自鲁棒的。总体而言，我们的结果为弱内存验证和鲁棒性建立了一个丰富且多功能的理论框架。

英文摘要

Memory models are formal specifications of concurrent-program executions, accounting for weak behaviors introduced by compiler and architectural optimizations. The increase of their number and complexity has spawned efforts for uniform verification across whole classes of models, by axiomatizing the models in an adequate metatheory that admits a uniform treatment. In this work, we formally study Monadic Second-Order logic (MSO) as a metatheory for weak memory, by proving results on the treewidth and MSO-expressibility of various popular weak-memory models, as this combination allows us to uniformly tackle several verification problems. In summary, our results are as follows. First, we prove that executions under Sequential Consistency ($\mathsf{SC}$) have bounded treewidth, while already those under Total Store Order ($\mathsf{TSO}$) do not. Second, we prove that a broad range of models, including Release/Acquire and the full RC20, are MSO-axiomatizable, while others, such as Strong Release/Acquire and $\mathsf{TSO}$, are not, unless the Orthogonal Vectors problem $\unicode{x2013}$ which requires quadratic time under SETH $\unicode{x2013}$ can be solved in linear time. Finally, we introduce the notion of reads-from robustness, as an extension to recent work on coarse robustness criteria. We show that our treewidth bounds (both upper and lower) have far-reaching algorithmic implications for any of our MSO-axiomatizable models $\mathsf{MM}$: there is an algorithm that, for every program $\mathsf{P}$, either verifies $\mathsf{P}$ under $\mathsf{MM}$ or reports that $\mathsf{P}$ is not reads-from robust against $\mathsf{MM}$. Overall, our results establish a rich and versatile theoretical framework for weak-memory verification and robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.20129 2026-06-19 cs.SE 新提交

Learning Critical Testing Literacy Through Puzzles: an Experience Report

通过谜题学习关键测试素养：经验报告

Niels Doorn, Bart Th. Knaack, Tanja E. J. Vos, Beatriz Marín

AI总结本文报告了使用谜题教授关键测试素养（CTL）的13次工作坊经验，发现参与者通过解谜、汇报和反思的完整序列学习效果显著，并开发了开源分析工具。

详情

AI中文摘要

在本文中，我们报告了使用谜题学习CTL的工作坊经验和收获。背景：软件测试重要但难以教授。我们引入了一个基于谜题的学习活动知识体系来教授CTL，该体系基于关键测试者认知模型，形成了P4TEST教学框架。我们与学生、测试人员、教师和小学生共举办了13次工作坊，评估基于谜题的关键测试素养教学。经验：在11次工作坊中，我们采用半结构化方法，变化谜题、材料和时长。在另外两次工作坊中，我们引入了工作手册和出声思考环节，以收集更多关于学习体验的数据。观察：参与者普遍认为自己在解谜时进行实验。学生倾向于收敛于解决方案，而专业人员继续探索。情绪在行为中可见，但难以通过书面反思单独浮现。出声思考环节揭示了即时推理；书面反思引发了更多元认知反思。主题“意义建构/行动中反思”捕捉了参与者如何构建问题、应对死胡同和转变策略。反思：谜题本身并非干预手段；解谜、汇报和反思的完整序列才是。更刻意地设计这一序列是未来的工作。我们还开发了一个带有内置分析功能的开源网络应用程序，用于定制工作坊。

英文摘要

In this paper, we report our experiences and takeaways from workshops using puzzles to learn CTL. Background: Software testing is important yet difficult to teach. We introduced a BoK of puzzle-based learning activities to teach CTL, based on a model of critical tester's cognition, leading to the pedagogical framework P4TEST. We conducted thirteen workshops with students, testers, teachers, and primary school pupils to assess puzzle-based teaching of critical testing literacy. Experience: Across eleven workshops, we used a semi-structured approach, varying puzzles, materials, and timing. In two additional workshops, we introduced workbooks and think-aloud sessions to gather more data on the learning experience. Observations: Participants consistently perceived themselves as experimenting while solving puzzles. Students tended to converge on solutions, while professionals continued exploring. Emotions were visible in behaviour but hard to surface through written reflection alone. Think-aloud sessions revealed immediate reasoning; written reflections elicited more meta-cognitive reflection. The theme Sensemaking / reflection-in-action captured how participants framed problems, navigated dead ends, and shifted strategies. Reflections: Puzzles are not the intervention: the entire sequence of solving, debriefing, and reflecting is. Designing that sequence more deliberately is the work ahead. We also developed an open-source web application with built-in analytics to customise workshops.

URL PDF HTML ☆

赞 0 踩 0

2606.20127 2026-06-19 eess.SY cs.SY 新提交

Contraction-based Neural Control for Cooperative Aerial Payload Transportation with Variable-length Cables

基于收缩的神经控制用于可变长度缆绳的协同空中载荷运输

Yi Lok Lo, Longhao Qian, Hugh H. T. Liu

AI总结提出一种多无人机吊挂载荷系统的神经非线性控制框架，通过解耦动力学结构，联合训练神经收缩度量控制器和反馈控制器实现载荷轨迹跟踪，并利用可变长度缆绳进行避障。

Comments Submitted for publication in AIAA Scitech 2027

2606.20121 2026-06-19 cs.LO 新提交

BARReL: a modern backend for Atelier B in Lean

BARReL：Atelier B 在 Lean 中的现代后端

Ghilain Bergeron, Vincent Trélat

AI总结 BARReL 是一个 Lean 4 库，桥接工业 B 方法工具 Atelier B 与 Lean 证明助手，支持在 Lean 中交互式进行 B 开发，通过显式良定义条件编码部分算子，并利用依赖类型保证良定义性，同时提供基本自动化。

详情

AI中文摘要

BARReL 是一个 Lean 4 库，桥接了工业 B 方法工具 Atelier B 与 Lean 证明助手，使用户能够在 Lean 中交互式地进行形式化 B 开发（直至机器精化和实现），同时保留标准 B 语法。B 部分算子通过生成显式的良定义条件进行仔细编码，利用 Lean 的依赖类型从构造上强制实施良定义性纪律。也就是说，证明义务和证明步骤不能静默地依赖于类型错误或定义不当的实例化。BARReL 还具备基本自动化功能，尝试自动处理此类良定义条件。该实现完全使用 Lean 元编程编写，并设计为模块化：扩展支持的 B 片段通常只需添加新的语法和编码子句。我们通过一个小型但具有代表性的案例研究说明了该方法，并论证 BARReL 可以作为迈向基于 Lean 证明助手的高度可靠的 Atelier B 工具链的垫脚石。

英文摘要

BARReL is a Lean 4 library bridging Atelier B, an industrial tool for the B method, and the Lean proof assistant by enabling users to conduct their formal B developments -- up to machine refinement and implementation -- interactively inside Lean, while retaining standard B syntax. B partial operators are carefully encoded by generating explicit well-definedness conditions, leveraging Lean's dependent types to enforce a well-definedness discipline by construction. That is, proof obligations and proof steps cannot silently rely on ill-typed or ill-defined instantiations. BARReL also features basic automation to try to discharge such well-definedness conditions automatically. The implementation is written entirely using Lean meta-programming and is designed to be modular: extending the supported B fragment typically requires only adding new syntax and encoding clauses. We illustrate the approach on a small but representative case study, and argue that BARReL can act as a stepping stone towards a strongly reliable Atelier B toolchain grounded in the Lean proof assistant.

URL PDF HTML ☆

赞 0 踩 0

2606.20117 2026-06-19 cs.CE 新提交

Autoregressive Modelling and Synthetic Generation of High-Fidelity, Statistically Equivalent 3D Microstructures for As-Manufactured Misalignments in Fiber-Reinforced Composites

面向纤维增强复合材料中制造偏差的高保真、统计等效三维微观结构的自回归建模与合成生成

Mohamad A. Raja, Clemens Dransfeld, Boyang Chen

AI总结提出一种集成框架，通过X射线μCT数据提取纤维错位特征，结合copula、自回归和极端值建模，经贝叶斯优化校准后，迭代生成约2400根非重叠合成纤维，统计偏差低于10%。

详情

AI中文摘要

本研究提出一个集成框架，用于从实验X射线-μCT观测中处理、建模和生成统计代表性的三维纤维微观结构。首先，引入一种解析的切片-段椭圆相交方法，沿纤维深度提取每切片和每纤维的面内和面外错位轮廓。然后利用这些描述符构建一个随机模型，通过基于copula的面内依赖性、潜在自回归连续性和罕见极端错位模式，捕获切片级错位分布及其沿深度的演变。模型超参数通过贝叶斯优化校准，与原始统计描述符达到高度一致，偏差通常低于10%。优化后的统计模型与物理生成策略相结合，该策略从可变半径纤维种子层开始，通过逐切片迭代的三维生长方案进行，其中统计层引导纤维演化，基于Delaunay的邻域构建与基于椭圆的接触分辨率确保非重叠、半径增强的合成微观结构。该框架成功生成约2400根合成纤维，同时保持对原始X射线-μCT数据的强统计保真度。所提出的管道为生成统计等效、几何可接受且可立即用于仿真的纤维复合材料微观结构提供了一条有前景且可扩展的途径，用于虚拟测试和分析。

英文摘要

This study presents an integrated framework for processing, modelling, and generating statistically representative three-dimensional fiber microstructures from experimental X-ray-$μ$CT observations. First, an analytical slice-segment ellipse-intersection method is introduced to extract per-slice and per-fiber in-plane and out-of-plane misalignment profiles along the fiber depth. These descriptors are then used to construct a stochastic model that captures slice-wise misalignment distributions and their depth-wise evolution through, copula-based in-plane dependence, latent autoregressive continuity, and rare extreme-misalignment motifs. The model hyperparameters are calibrated using Bayesian optimization, achieving close agreement with the original statistical descriptors, with deviations generally below 10\%. The optimized statistical model is coupled with a physical generation strategy that begins with variable-radius fiber seeding layer and proceeds through an iterative slice-by-slice 3D growth scheme, where the statistical layer guides fiber evolution and Delaunay-based neighbourhood construction with ellipse-based contact resolution ensures non-overlapping, radius-augmented synthetic microstructures. The framework successfully generates about 2400 synthetic fibers while preserving strong statistical fidelity to the original X-ray-$μ$CT data. The proposed pipeline provides a promising and scalable route for generating statistically equivalent, geometrically admissible, and simulation-ready fiber composite microstructures for virtual testing and analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.20102 2026-06-19 cs.CY cs.CR 新提交

Artificial Intelligence as Game Changer in Cybersecurity: What We Learned in 2025-2026, and how this is relevant for Africa

人工智能作为网络安全游戏规则改变者：2025-2026年我们学到的，以及这对非洲的意义

Mikael Alemu Gorsky

AI总结本文通过2025-2026年两个事件论证前沿语言模型已成为网络作战决定性工具，而非洲在模型构建、运营和获取上被完全排除，面临技能、算力和投资三重赤字，并遭受AI欺诈攻击，建议在6-12个月内通过威胁情报共享、治理采纳和伙伴关系应对。

Comments International Conference on Cybersecurity in the Era of Digital Transformation and Artificial Intelligence

详情

AI中文摘要

在2025年和2026年，两个事件解决了此前仅是推测的问题。第一个事件中，一个大型语言模型独立执行了国家支持的网络间谍活动的大部分任务，人类操作员仅在少数决策点介入。第二个事件中，最强大的网络相关模型被置于一个受控访问计划之下，仅限于经过审查的美国科技公司、盟国政府和欧洲标准机构；该范围不包括任何非洲政府、运营商或大学。这两个事件共同确立了本文的论点：前沿语言模型已成为网络作战的决定性工具，而该工具在一个小圈子内建造、拥有和配给，非洲被排除在外。本文记录了非洲在每一方面的排斥。该大陆不构建前沿模型，尚无法运营它们，并且目前无法获得最强大的模型。运营赤字沿着三个轴心展开：技能人才、计算和电力、投资，每个都根据当前数据衡量；与此同时，针对非洲移动货币系统（该大陆领先的数字经济部分）的AI欺诈攻击已经在增加。由此产生两个约束：开发者对前沿模型的把关（非洲决策无法打开），以及对基础设施供应商的选择性依赖（现已陷入地缘政治限制）。由于可比较但不受把关的模型预计在6至12个月内扩散，本文主张通过威胁情报共享、治理采纳和伙伴关系，在非洲人自主条件下，在该窗口内采取应对措施。

英文摘要

In 2025 and 2026, two events settled questions that had until then been speculative. In the first, a large language model executed the great majority of a state-aligned cyber-espionage campaign on its own, with human operators intervening at only a few decision points. In the second, the most capable cyber-relevant model was placed under a controlled-access program limited to a vetted set of United States technology firms, allied governments, and European standards bodies; that perimeter included no African government, operator, or university. Together the two events establish the argument of this paper: frontier language models have become a decisive instrument of cyber operations, and that instrument is built, owned, and rationed within a small circle from which Africa is absent. The paper documents Africa's exclusion on every count. The continent does not build frontier models, cannot yet operate them, and cannot, for now, obtain the most capable ones. The operational deficit is set out along three axes, skilled people, compute and electrical power, and investment, each measured against current figures; meanwhile AI-enabled fraud is already mounting against African mobile-money systems, the part of the digital economy the continent leads. Two constraints follow: the gating of frontier models by their developers, which no African decision can open, and a chosen dependence on infrastructure vendors now caught in geopolitical restriction. Because comparable but ungated models are forecast to spread within six to twelve months, the paper argues for a response that operates inside that window through threat-intelligence sharing, governance adoption, and partnership, undertaken by Africans on their own terms.

URL PDF HTML ☆

赞 0 踩 0

2606.20096 2026-06-19 cs.CG q-bio.NC 新提交

Quadratic Forms for Measuring Geometric Trees in 3-dimensional Space

用于测量三维空间中几何树的二次型

Yossi Bokor Bleile, Emanuele Cortinovis, Herbert Edelsbrunner, Shota Uka

AI总结提出使用二次型测量几何树的方向分布，并引入基于Fisher度量的六边形图模型进行可视化和统计分析。

Comments 16 pages, 6 figures

2606.20064 2026-06-19 cs.HC 新提交

AI Conversational Interviewing: Scaling Up Semi-Structured and In-depth Interviews

AI对话式访谈：扩展半结构化与深度访谈的规模

Alexander Wuttke, Max Melchior Lang, Christopher Klamm, Quirin Würschinger, Frauke Kreuter

AI总结本研究提出AI对话式访谈方法，通过语音、文本或自由选择模式大规模收集开放型意见数据，证明其能捕捉标准化调查遗漏的深层思考，且受访者评价不低于传统调查。

详情

AI中文摘要

舆论研究长期以来面临深度与规模之间的权衡：标准化调查能够进行大规模测量，但将受访者限制在研究者定义的类别中，掩盖了公众情绪背后多样化的意外考量。更具对话性的访谈通过开放式探究提供更丰富的见解，但其对训练有素的人类访谈者的依赖使其难以规模化。本研究引入AI对话式访谈作为一种大规模收集开放型舆论数据的方法，追求三个目标：展示对话文本数据对于封闭式问题无法触及的问题的分析价值；通过参与者自身的评估评估该方法的实际可行性；并通过实验比较语音、文本和自由选择访谈模式来指导实施。我们进行了一项研究，将AI主导的访谈与关于移民政策的标准化调查相结合，通过Prolific和Payback Panel招募了571名受访者。研究结果确立了AI对话式访谈作为社会科学工具包中可行且有价值的补充。对话记录揭示了标准化综合问卷无法捕捉的考量和推理，例如在态度水平相似的子群体中存在显著不同的移民心智模型。在完成访谈的受访者中，对AI访谈的评价在各模式下均达到或超过标准化调查，尽管完成率因条件而异。通过发布开放数据和开源流程材料，本研究为利用人工智能扩展舆论测量方法的日益增长的文献做出了贡献。

英文摘要

Public opinion research has long faced a trade-off between depth and scale: standardized surveys enable large-scale measurement but restrict respondents to researcher-defined categories, obscuring the diversity of unexpected considerations that underlie public sentiment. More conversational interviews provide richer insights through open-ended probing, but their reliance on trained human interviewers has kept them difficult to scale. This study introduces AI Conversational Interviewing as a method for collecting open-ended public opinion data at scale, pursuing three objectives: to demonstrate the analytical value of conversational text data for questions beyond the reach of closed-ended items; to assess the method's practical viability through participants' own evaluations; and to inform implementation by experimentally comparing voice-based, chat-based, and free-choice interview modes. We conducted a study combining an AI-led interview with a standardized survey on migration policy among 571 respondents recruited via Prolific and Payback Panel. The findings establish AI Conversational Interviewing as a viable and valuable addition to the social-science toolkit. The conversational transcripts surface considerations and reasoning that a comprehensive standardized battery does not capture such as markedly different mental models of migration among subgroups with similar attitudes levels. Among respondents who completed the interview, evaluations of the AI interview were at or above those of the standardized survey across modes, although completion itself varied by condition. By releasing open data and open-source pipeline materials, the study contributes to a growing literature on harnessing artificial intelligence to expand the methods of public opinion measurement.

URL PDF HTML ☆

赞 0 踩 0

2606.20047 2026-06-19 cs.IR 新提交

PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents

PACMS: 作为LLM代理可插拔引擎的子模上下文选择

Manu Ghulyani, Arunabh Singh, Karan Bharadwaj, Ankit Nath, Suranjan Goswami

AI总结提出PACMS，一种基于子模函数最大化的上下文选择方法，在提示组装时按相关性从会话、记忆和工具输出中挑选内容，替代截断机制，提升长对话中的信息保持能力。

详情

AI中文摘要

对话和工具使用的LLM代理在上下文窗口中操作，该窗口同时从多个方向填充。随着会话进行，代理积累用户和助手轮次、从持久记忆存储中提取的条目，以及通常最大的工具调用输出（如文件读取、搜索结果和API响应）。一旦累积上下文超过模型的令牌预算，框架必须决定保留什么。当前机制是最近截断，有时辅以定期摘要。这是主题盲目的：会话早期建立的事实仅仅因为陈旧而被丢弃，即使当前用户查询正是关于该事实；相反，冗长但无关的近期材料被保留。必须在多轮中回忆信息的代理（记忆的定义案例）正是最近截断失败的地方。现有替代方案位于代理组装步骤之外。检索增强生成将外部文档提取到提示中，但不仲裁代理的“已存在”池化上下文。上下文压缩方法通过重写或修剪文本来减少令牌计数，但以查询盲目和有损的方式操作。两者都不将记忆条目、对话轮次和工具输出视为一个单一的候选池，在提示组装时按相关性进行选择。

英文摘要

Conversational and tool-using LLM agents operate over a context window that fills from several directions simultaneously. As a session proceeds, the agent accumulates user and assistant turns, entries drawn from a persistent memory store, and often largest of all, the verbatim outputs of tool calls such as file reads, search results, and API responses. Once the cumulative context exceeds the model's token budget, the framework must decide what to keep. The prevailing mechanism is recency truncation, sometimes paired with periodic summarization. This is topic-blind: a fact established early in a session is discarded simply because it is old, even when the current user query is about exactly that fact; conversely, verbose but irrelevant recent material is retained. Agents that must recall information across many turns, the defining case for memory, are precisely where recency truncation fails. Existing alternatives sit outside the agent's assembly step. Retrieval augmented generation fetches external documents into the prompt but does not arbitrate the agent's \emph{already-present} pooled context. Context-compression methods reduce token count by rewriting or pruning text, but operate query-blind and lossily. Neither treats memory entries, conversation turns, and tool outputs as a single candidate pool to be selected from by relevance at the moment the prompt is assembled.

URL PDF HTML ☆

赞 0 踩 0

2606.19988 2026-06-19 cs.SE 新提交

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

基于大语言模型的仓库级Solidity代码生成：从提示到微调

Shi Chen, Rongcun Wang, Yuan Tian, Xiaoyuan Xie, Wei Song, Rubing Huang

AI总结提出SolidityBench基准和SolidityScore指标，评估多种LLM方法在仓库级Solidity代码生成中的表现，发现监督微调最有效。

Comments 33 pages

详情

AI中文摘要

大语言模型（LLMs）在通用代码生成方面表现出强大的能力，但其在专业软件领域的有效性仍未得到充分探索。Solidity智能合约代表了一个高风险领域，生成的代码必须满足严格的语言级、安全性和软件工程约束。现有的基准和指标对于仓库级Solidity生成仍然不足，其中模型必须从自然语言需求中合成完整的合约。为了解决这一差距，我们引入了SolidityBench，一个包含5,470个仓库级Solidity智能合约及其自然语言描述的基准。我们还提出了SolidityScore，一种基于Solidity的语义度量，强调领域关键结构，如安全修饰符、合约声明和Solidity特定关键词。使用该基准，我们评估了代表性的代码LLM，包括Qwen2.5-Coder、DeepSeek-Coder和CodeLlama，涵盖零样本提示、思维链推理、上下文学习、检索增强生成和监督微调。结果表明，通用模型在仓库级Solidity生成中表现出系统性的结构缺陷。在非参数方法中，检索增强生成表现最佳，而上下文学习在超过两个示例后因上下文饱和而性能下降。监督微调通过将Solidity特定约束内化到模型参数中实现了最大的改进。总体而言，我们的研究为仓库级Solidity代码生成提供了全面的基准，并表明高质量领域数据结合监督微调是提高LLM生成智能合约可靠性的最有效策略。

英文摘要

Large Language Models (LLMs) have shown strong capabilities in general-purpose code generation, but their effectiveness in specialized software domains remains underexplored. Solidity smart contracts represent a high-stakes domain where generated code must satisfy strict language-level, security, and software-engineering constraints. Existing benchmarks and metrics remain insufficient for repository-level Solidity generation, where models must synthesize complete contracts from natural language requirements. To address this gap, we introduce SolidityBench, a benchmark of 5,470 repository-level Solidity smart contracts paired with natural language descriptions. We also propose SolidityScore, a Solidity-aware semantic metric that emphasizes domain-critical constructs such as security modifiers, contract declarations, and Solidity-specific keywords. Using this benchmark, we evaluate representative code LLMs, including Qwen2.5-Coder, DeepSeek-Coder, and CodeLlama, across zero-shot prompting, Chain-of-Thought reasoning, in-context learning, retrieval-augmented generation, and supervised fine-tuning. The results show that general-purpose models exhibit systematic structural deficiencies in repository-level Solidity generation. Among non-parametric methods, retrieval-augmented generation performs best, while in-context learning degrades beyond two examples due to context saturation. Supervised fine-tuning achieves the largest improvement by internalizing Solidity-specific constraints into model parameters. Overall, our study provides a comprehensive benchmark for repository-level Solidity code generation and shows that high-quality domain data combined with supervised fine-tuning is the most effective strategy for improving the reliability of LLM-generated smart contracts.

URL PDF HTML ☆

赞 0 踩 0

2606.19983 2026-06-19 cs.CR 新提交

A Measurement Study of Cryptographic Misuse in Embodied AI Mobile Applications

具身AI移动应用中加密误用的测量研究

Junchao Li, Xuelei Wang, Yuhang Huang, Qi Wang, Boyang Ma, Xuelong Dai, Minghui Xu, Yue Zhang

AI总结首次大规模测量具身AI移动应用的加密误用，通过自动化语义分析管道发现12,975个误用实例，揭示延迟敏感控制路径和离线配置导致的结构性安全权衡。

详情

AI中文摘要

具身AI (EAI) 移动应用正从辅助用户界面演变为主动控制路径组件，直接将移动端加密安全与网络物理信任联系起来。尽管发生了这种转变，现有的安全研究主要关注具身AI设备和云基础设施，而移动控制层作为关键攻击面在很大程度上未被探索。为了弥补这一差距，我们提出了首个针对EAI移动生态系统内加密误用的大规模测量研究。我们构建了EAIAppZoo，一个涵盖六个EAI领域的507个真实世界应用的基准测试，并采用自动化语义分析管道来测量五种主要加密失效模式的普遍性和特征。我们的测量结果产生了12,975个误用发现（评估精度为80.74%），揭示这些加密失效是由EAI特定的工程约束而非随机开发者错误驱动的。我们揭示了结构性的安全权衡：延迟敏感的控制路径系统性地削弱了传输保护，而对离线设备配置和遗留物联网SDK的严重依赖加剧了本地硬编码认证凭证的问题。通过真实世界案例研究，我们展示了这些移动端加密缺陷如何绕过名义上的网络保护，使攻击者能够拦截命令通道并劫持EAI实体的物理控制。最终，我们的发现强调，移动应用已成为网络物理系统中一个脆弱但被忽视的加密信任边界。

英文摘要

Embodied AI (EAI) mobile applications are evolving from auxiliary user interfaces into active control-path components, directly linking mobile-side cryptographic security to cyber-physical trust. Despite this shift, existing security research predominantly focuses on embodied AI devices and cloud infrastructures, leaving the mobile control layer largely unexplored as a critical attack surface. To bridge this gap, we present the first large-scale measurement study of cryptographic misuse within the EAI mobile ecosystem. We construct EAIAppZoo, a benchmark of 507 real-world applications across six EAI domains, and employ an automated semantic-aware analysis pipeline to measure the prevalence and characteristics of five major cryptographic failure modes. Our measurement yields 12,975 misuse findings (with an evaluated precision of 80.74\%), revealing that these cryptographic failures are driven by EAI-specific engineering constraints rather than random developer errors. We uncover structural security trade-offs: latency-sensitive control paths systematically weaken transport protection, while the heavy reliance on offline device provisioning and legacy IoT SDKs exacerbates the local hardcoding of authentication credentials. Through real-world case studies, we demonstrate how these mobile-side cryptographic flaws bypass nominal network protections, enabling adversaries to intercept command channels and hijack the physical control of EAI entities. Ultimately, our findings highlight that mobile applications have become a fragile, yet overlooked, cryptographic trust boundary in cyber-physical systems.

URL PDF HTML ☆

赞 0 踩 0

2606.19969 2026-06-19 cs.DB cs.DC 新提交

The Bi-Channel Networking Paradigm for Database Systems in the Cloud

云数据库系统的双通道网络范式

Georg Kreuzmayr, Muhammad El-Hindi, Benjamin Wagner, Tobias Ziegler, Viktor Leis

AI总结针对现代高速云网络中内核TCP栈成为数据库性能瓶颈的问题，提出双通道网络范式，将通信分离为高性能数据路径和可靠控制路径，结合用户空间UDP与内核TCP，在分布式shuffle和复制键值存储中实现高吞吐与低开销。

Comments Accepted to EDBT 2027 (Lille, France)

详情

AI中文摘要

当网络链路速度较慢时，云和分布式数据库系统可以依赖通用的内核抽象，并将网络通信视为黑盒。在当今快速云网络下，这种方法失效了：数据库性能受到内核TCP栈CPU开销的限制。用用户空间UDP替换TCP可以减少这种开销，但需要重新实现基本保证，如可靠性和有序性。为解决这一难题，数据库系统不应再将网络视为黑盒，而应将其与数据库操作协同设计。我们提出了数据库系统的双通道范式，将通信分为两个通道：一个用于延迟和带宽敏感操作的高性能数据路径，以及一个用于协调和恢复的可靠控制路径。我们通过结合用户空间UDP和基于内核的TCP来实现该范式，尽管其他协议栈组合也是可能的。这种设计利用了现代NIC的能力，同时保留了TCP的可靠性。我们在两个代表性场景中展示了该范式的效率和简洁性：一个分布式shuffle用三个CPU核饱和200 Gbit/s，以及一个每秒处理数百万条消息的复制键值存储。

英文摘要

When network links were slow, cloud and distributed database systems could rely on generic kernel abstractions and treat network communication as a black box. With today's fast cloud networks, this approach breaks down: database performance becomes limited by the CPU overhead of the kernel TCP stack. Replacing TCP with user-space UDP can reduce this overhead, but it requires reimplementing essential guarantees, such as reliability and ordering. To solve this conundrum, database systems should no longer treat networking as a black box but co-design it with database operations. We propose the bi-channel paradigm for database systems, which separates communication into two channels: A high-performance data path for latency- and bandwidth-sensitive operations, and a reliable control path for coordination and recovery. We implement the paradigm by combining user-space UDP and kernel-based TCP, though other stack combinations are possible. This design exploits modern NIC capabilities while preserving TCP's reliability. We demonstrate the paradigm's efficiency and simplicity in two representative settings: a distributed shuffle saturating 200 Gbit/s with three CPU cores, and a replicated key-value store processing millions of messages per second.

URL PDF HTML ☆

赞 0 踩 0

2606.19968 2026-06-19 cs.GT 新提交

Beyond Lower Quota: Avoiding Overrepresentation in Multi-Winner Voting

超越最低配额：避免多赢者投票中的过度代表

Anton Baychkov, Martin Lackner, Jan Maly, Oliviero Nardi, Jannik Peters

AI总结本文提出避免过度代表的公理JUQ，引入复合Thiele规则并刻画满足该公理的Adams-AV规则，同时提出平衡避免不足与过度代表的公理JNQ。

Comments This is an extended version of the publication with the same name in the proceedings of EC 2026

详情

AI中文摘要

最近，在社会选择文献中，避免基于批准的多赢者投票中代表不足的问题受到了广泛关注。本文探讨了被广泛忽视的互补问题——避免过度代表。尽管这是一个具有具体应用的理想性质，但尚未被系统研究。直观上，过度代表发生在一个群体决定了委员会中不成比例的大部分席位，从而超过了该群体的配额。我们提出了一个强且吸引人的避免过度代表的公理，称为可证明的上限配额（JUQ）。我们引入了Thiele规则的一个推广——复合Thiele规则，并刻画了该类中满足我们公理的唯一规则。该规则Adams-AV自然地扩展了Adams分配方法，此前未被研究。此外，我们引入了一个满足JUQ的多项式时间规则。进一步，我们引入了有理由的接近配额（JNQ），这是一个平衡避免不足和过度代表的公理。它刻画了扩展Sainte-Laguë分配方法的唯一Thiele规则。最后，我们分析了我们的公理与已建立的比例性概念（如EJR+）的兼容性。

英文摘要

Recently, in the social choice literature, much attention has been given to the question of avoiding underrepresentation in approval-based multi-winner voting. In this paper, we explore the largely overlooked complementary question of avoiding overrepresentation. This has not been explored systematically, despite being a desirable property with concrete applications. Intuitively, overrepresentation happens when a group determines a disproportionately large part of the committee, thereby exceeding the group's quota. We formulate a strong and appealing axiom for avoiding overrepresentation, called justifiable upper quota (JUQ). We introduce a generalization of Thiele rules, composite Thiele rules, and characterize the unique rule in this class satisfying our axiom. This rule, Adams-AV, which naturally extends Adams' apportionment method, has not been studied before. Additionally, we introduce a polynomial-time rule that satisfies JUQ. Furthermore, we introduce justified near quota, an axiom that balances avoiding under- and overrepresentation. It characterizes the unique Thiele rule extending the Sainte-Laguë apportionment method. Finally, we analyze the compatibility of our axioms with established proportionality notions such as EJR+.

URL PDF HTML ☆

赞 0 踩 0

2606.19960 2026-06-19 cs.IR 新提交

Stellar: Scalable Multimodal Document Retrieval for Natural Language Queries

Stellar：面向自然语言查询的可扩展多模态文档检索

Yuxiang Guo, Zhonghao Hu, Yuren Mao, Yuhang Liu, Congcong Ge, Xiaolu Zhang, Jun Zhou, Yunjun Gao

AI总结提出Stellar框架，通过磁盘存储令牌级文档嵌入并动态加载候选嵌入，结合词汇表示过滤和高效磁盘支持的后交互，在保持检索效果的同时将内存开销和查询延迟降低1-2个数量级。

详情

AI中文摘要

多模态文档检索——从大型语料库中选择最相关的多模态文档以回答自然语言查询——在检索增强生成（RAG）系统中扮演着重要角色。最先进的方法使用多个令牌级嵌入来表示每个文档和查询，并通过后交互实现高效性。然而，这种多向量表示在检索过程中会产生大量内存开销，导致可扩展性差，阻碍了实际部署。在本文中，我们提出了Stellar，一个可扩展的多模态文档检索框架，它将令牌级文档嵌入存储在磁盘上，仅将少量候选嵌入加载到内存中进行后交互。Stellar包含两个关键组件：（i）基于词汇表示的过滤（LRF），它微调多模态大语言模型（MLLM）作为稀疏编码器，以产生高质量的词汇表示，从而实现高效且有效的文档过滤，显著减少候选集；（ii）高效的磁盘支持后交互（DLI），它设计了一种基于平衡聚类算法的磁盘令牌嵌入存储布局，并通过简单有效的成本模型动态地将必要的令牌嵌入加载到内存中。在四个真实世界基准和一个新提出的大规模数据集上的大量实验表明，与现有方法相比，Stellar在不影响检索效果的情况下，将内存开销和查询延迟降低了1-2个数量级。

英文摘要

Multimodal document retrieval--selecting the most relevant multimodal document from a large corpus to answer a natural language query--plays an essential role in Retrieval-Augmented Generation (RAG) systems. State-of-the-art methods represent each document and query with multiple token-level embeddings and use late interaction to achieve high effectiveness. However, such multi-vector representations incur substantial memory overhead during retrieval, leading to poor scalability and hindering real-world deployment. In this paper, we present Stellar, a scalable multimodal document retrieval framework that stores token-level document embeddings on disk and loads only a small set of candidate embeddings into memory for late interaction. Stellar comprises two key components: (i) Lexical Representation-based Filtering (LRF), which fine-tunes a Multimodal Large Language Model (MLLM) as a sparse encoder to produce high-quality lexical representations, enabling efficient and effective document filtering to significantly reduce the candidate set; (ii) Efficient Disk-backed Late Interaction (DLI), which designs an on-disk token embedding storage layout guided by a balanced clustering algorithm, and dynamically loads only the necessary token embeddings into memory using a simple yet effective cost model. Extensive experiments on four real-world benchmarks and a newly presented large-scale dataset demonstrate that Stellar reduces memory overhead and query latency by 1-2 orders of magnitude compared to existing methods without compromising retrieval effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.19957 2026-06-19 cs.CY 新提交

Modest, artistic, and radical solutions to the environmental impact of image-generating machine learning

图像生成机器学习的环境影响：温和、艺术与激进的解决方案

Laura U. Marks, Jess MacCormack, Kehui Li

AI总结针对图像生成ML的高能耗问题，从计算机工程、媒体研究和艺术角度探索非精确计算、小模型、低精度硬件等解决方案，并提出真实成本核算。

Comments Paper in Proceedings of LIMITS 2026: 12th Workshop on Computing within Limits, 2026-06-23-25, Online

详情

AI中文摘要

机器学习常被宣称能提高信息通信技术的效率，但这种微小收益被数据中心和ML就绪设备的巨大碳、水和土地足迹所淹没。我们调查了ML应用在训练和推理中的电力消耗，重点关注电力密集型的图像生成。我们的团队由一名计算机工程师、一名媒体学者和一名艺术家组成，探索了包括非精确计算、微型语言模型、低精度硬件架构、有限容量硬件以及在设计阶段预测和缓解能源需求等解决方案。我们将概述正在进行的、使用非抓取数据的道德且美学上精致的微型图像生成器的工作。着眼于经济背景，我们将提出机器学习环境影响的真实成本核算，并表明效率标准是由信息通信技术的股东资本主义框架驱动的。

英文摘要

Machine learning is often touted to improve the efficiency of ICT, but that small gain is overwhelmed by the enormous carbon, water, and land footprints of data centers and ML-ready devices. We survey the electricity consumption of ML applications in training and inference, focusing on electricity-intensive image generation. Our team of a computer engineer, a media scholar, and an artist explore solutions including inexact computing; tiny language models; low-precision hardware architectures; hardware with limited capacity; and anticipating and mitigating energy demands at the design phase. We will sketch our work in progress of an ethical and aesthetically sophisticated tiny image generator using non-scraped data. Looking to the economic context, we will propose a true-cost accounting for the environmental impact of machine learning and suggest that the criterion of efficiency is driven by the shareholder-capitalist framing of ICT.

URL PDF HTML ☆

赞 0 踩 0

2606.19949 2026-06-19 cs.CG 新提交

Semi-Automatic Correction of 3D Tubular Structure Skeletons via Component-Wise MST and Filtered Delaunay Triangulation

三维管状结构骨架的半自动校正：基于分量最小生成树与过滤Delaunay三角剖分

Ruoxuan Yang, Chuan Li

AI总结提出一种半自动方法，通过用户选择源点和目标点，结合分量最小生成树和过滤Delaunay三角剖分，重建合理的中心线连接，校正骨架拓扑伪影。

Comments Accepted at ACM ICMR 2026

Journal ref In Proceedings of the International Conference on Multimedia Retrieval (ICMR '26), June 16--19, 2026, Amsterdam, Netherlands. ACM, New York, NY, USA, 10 pages

详情

DOI: 10.1145/3805622.3810782

AI中文摘要

从三维成像中对管状结构进行骨架化对于形态分析、运输或流动模拟以及包括血管网络、植物根系和神经连接组等领域的过程规划至关重要。然而，自动骨架提取常常引入拓扑伪影，例如邻近分支之间的错误连接以及由噪声或数据缺失引起的碎片化中心线。手动校正这些伪影可能耗时且易出错，尤其是在需要精确交互时。我们提出一种半自动校正方法，从最少的用户输入重建合理的中心线连接。给定用户选择的源点和目标点，我们的方法通过结合(i)用于稳定局部传播的分量最小生成树和(ii)用于桥接间隙和处理模糊连接点的过滤三维Delaunay边图来追踪路径。候选步骤根据考虑方向连续性、空间邻近性、分量一致性和目标导向进展的得分进行排序。输出是一个有序折线（或边序列），可作为建议的校正并集成到下游骨架后处理流程中。我们在C++中实现该系统，并基于Libigl提供交互式查看器，在脑血管数据集上展示了代表性的定性结果，包括校正典型的“交叉”和“点状”伪影。虽然我们目前的验证是定性的，但该方法轻量级，可作为实用的构建块，用于生物医学成像及相关领域中更全面的交互式校正流程。

英文摘要

Skeletonization of tubular structures from 3D imaging is essential for tasks such as morphometric analysis, transport or flow simulation, and procedural planning in domains including vascular networks, plant root systems, and neural connectomes. However, automatic skeleton extraction often introduces topological artifacts, such as erroneous connections between nearby branches and fragmented centerlines caused by noise or missing data. Correcting these artifacts manually can be time-consuming and error-prone, especially when precise interaction is required. We present a semi-automatic correction method that reconstructs a plausible centerline connection from minimal user input. Given a user-selected source and target point, our method traces a path by combining (i) component-wise minimum spanning trees for stable local propagation and (ii) a filtered 3D Delaunay edge graph for bridging gaps and handling ambiguous junctions. Candidate steps are ranked using a score that accounts for direction continuity, spatial proximity, component consistency, and target-directed progress. The output is an ordered polyline (or edge sequence) that can be used as a suggested correction and integrated into downstream skeleton post-processing workflows. We implement the system in C++ with an interactive viewer based on Libigl and demonstrate representative qualitative results on brain vessel datasets, including correction of typical "crossing" and "dotted" artifacts. While our current validation is qualitative, the method is lightweight and serves as a practical building block toward more comprehensive interactive correction pipelines in biomedical imaging and related domains.

URL PDF HTML ☆

赞 0 踩 0

2606.19937 2026-06-19 cs.CR 新提交

AutoTam: Specifying Secure Protocol Implementations with Tamarin Model Generation

AutoTam: 通过 Tamarin 模型生成指定安全协议实现

Johannes Wilson, Mikael Asplund, Niklas Johansson

AI总结提出一种语言优先方法，通过领域特定语言实现协议并自动生成 Tamarin 模型，验证迹属性并保证其传递到实现，同时集成符号执行分析内存安全，在签名 Diffie-Hellman 和 WireGuard 协议上验证了安全性和互操作性。

Comments 19 pages, 5 figures

详情

AI中文摘要

形式化验证是确保密码协议安全性的重要但具有挑战性的任务。虽然现代协议验证工具显著减少了验证工作量，但对于没有形式化验证背景的从业者来说，建模仍然具有挑战性。此外，将验证结果转移到具体的协议实现需要专业知识。在本文中，我们提出了一种新颖的语言优先方法，通过使用领域特定语言进行协议实现来验证迹属性。我们针对 Tamarin 证明器进行验证，并证明验证的通用迹属性可以转换回实现。我们还集成了符号执行以分析协议实现的内存安全性。我们使用我们的工具实现并生成了签名 Diffie-Hellman 协议和 WireGuard VPN 协议的准确模型。当使用我们的解释器时，我们的 WireGuard 实现与现有实现可互操作，并达到了可接受的性能。我们通过符号执行和生成的 Tamarin 模型的验证相结合，正式证明了我们的实现是安全的。

英文摘要

Formal verification is a challenging but important task for ensuring the security of cryptographic protocols. While modern protocol verification tools significantly reduce verification effort, modelling remains challenging to practitioners without a background in formal verification. In addition, transferring verification results to a concrete protocol implementation requires expert knowledge. In this paper, we present a novel language-first method for verification of trace properties using a domain-specific language for protocol implementations. We target the Tamarin prover for verification, and we prove that verified universal trace properties translate back to the implementation. We additionally integrate symbolic execution in order to analyse the memory safety of protocol implementations. We use our tool to implement and generate accurate models for a signed Diffie-Hellman protocol, and for the WireGuard VPN protocol. Our WireGuard implementation is interoperable with existing implementations when using our interpreter, and achieves acceptable performance. We formally prove our implementations secure using a combination of symbolic execution and verification of the generated Tamarin models.

URL PDF HTML ☆

赞 0 踩 0

2606.19936 2026-06-19 cs.LO cs.MM 新提交

Prismriver: Formalization of Music Theory and Algorithmic Composition in Lean 4

Prismriver：Lean 4 中音乐理论与算法作曲的形式化

Leni Aniva, Claire Wang

AI总结使用 Lean 4 形式化音乐理论，实现可验证的算法作曲与伴奏生成，并支持音乐结构的单子分析。

2606.19931 2026-06-19 cs.MA 新提交

Blame is easier than praise: Measuring off-ball defensive performance in football

责备比表扬更容易：衡量足球中的无球防守表现

Jonas Bischofberger, Runqing Ma, Pascal Bauer, Kilian Arnsmeyer, Arnold Baca

AI总结提出基于防守压力区（DPA）的球员参与度评分，将预期威胁的事件级变化归因于个体，以衡量足球无球防守表现，并在跨性别和跨赛事数据集上验证其有效性。

详情

AI中文摘要

足球运动员的防守表现通常通过有限的行动（如抢断和拦截）来衡量，而他们通过位置行为的持续影响此前很少被研究。我们将此问题表述为多智能体时空轨迹上的归因问题，没有球员级别的真实标签，其中事件级别的预期威胁变化被分配给个体。我们提出了一个框架，使用从防守压力区（DPA）计算的球员参与度评分来执行此归因。通过计算自动检测的团队结构内的角色条件基线，我们可以确定每个防守者对通过任意传球创造的威胁的预期责任。该方法的有效性和鲁棒性在独特的广泛跨性别和跨赛事数据集上进行了评估，包括来自男子世界杯64场比赛、女子德甲116场比赛和男子德丙336场比赛的位置和事件数据。在没有真实标签的情况下，我们提出了一个评估协议，将多个相对较弱的代理组合成稳健的总结分数。我们发现，与最佳基于行动的指标相比，有效性分数提高了大约一个标准差，并证明许多流行指标的有效性有限。对高价值行动的“责备”与外部评级和市场价值显示出特别强的相关性，使其成为足球中第一个可靠衡量定位错误的已发表指标。本工作所有代码均公开可用，以支持可重复性和进一步研究。

英文摘要

The defensive performance of football players is commonly measured through a limited number of actions like tackles and interceptions while their continuous impact through positional behaviour has hardly been studied before. We formulate this problem as an attribution over multi-agent spatiotemporal trajectories without player-level ground truth labels, where event-level changes of expected threat are distributed among individuals. We propose a framework that performs this attribution using player involvement scores calculated from defensive pressure areas (DPAs). By computing role-conditioned baselines within automatically detected team structures, we can determine each defender's expected responsibility for threat created through arbitrary passes. The validity and robustness of this approach are evaluated on a uniquely extensive cross-gender and cross-competition data set, including positional and event data from 64 matches of the men's World Cup, 116 matches of the women's German Bundesliga and 336 matches of the men's German 3. Liga. In the absence of a ground truth, we propose an evaluation protocol that combines multiple relatively weak proxies into robust summary scores. We find a validity score that is improved by around 1 standard deviation compared to the best action-based metric and demonstrate that many popular measures show limited validity. The "blame" for conceding high-value actions shows especially strong correlations with external ratings and market values, making it the first published metric in football to reliably measure positioning errors. All code underlying this work is publicly available to support reproducibility and further research.

URL PDF HTML ☆

赞 0 踩 0

2606.19930 2026-06-19 cs.HC 新提交

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

MobileForge：基于分层反馈引导策略优化的移动GUI智能体免标注适配

Guangyi Liu, Pengxiang Zhao, Gao Wu, Yiwen Yin, Mading Li, Liang Liu, Congxiao Liu, Zhang Qi, Mengyan Wang, Liang Guo, Yong Liu

AI总结提出MobileForge系统，通过MobileGym环境实现任务生成与评估，结合分层反馈引导策略优化（HiFPO）将轨迹结果、步骤反馈和修正提示转化为步骤级GRPO更新，实现移动GUI智能体免标注适配，在AndroidWorld上达到67.2% Pass@3。

Comments Project page: https://mobile-forge.github.io/

详情

AI中文摘要

基于MLLM的移动GUI智能体在UI理解和动作执行方面取得了显著进展，但将它们适配到真实目标应用仍然成本高昂，因为移动应用数量众多、频繁更新，且难以用人工编写的任务、演示或奖励标签覆盖。现有的免标注GUI学习减少了人工监督，但缺乏将目标应用探索、课程挖掘、轨迹执行和反馈连接起来的统一基础，而策略优化通常依赖于孤立的轨迹和难以转化为可靠改进信号的粗粒度奖励。我们提出MobileForge，一个用于移动GUI智能体的免标注适配系统。MobileForge包含MobileGym，它将任务生成和轨迹评估基于真实移动应用交互，以及分层反馈引导策略优化（HiFPO），它将轨迹结果、步骤级过程反馈和修正提示转化为提示上下文化的步骤级GRPO更新。仅使用自动生成的免标注适配数据，MobileForge将Qwen3-VL-8B适配到AndroidWorld上67.2%的Pass@3，接近使用封闭数据的GUI专用GUI-Owl-1.5-8B基础模型的69.0%。MobileForge适配的ForgeOwl-8B进一步在AndroidWorld上达到77.6%的Pass@3，在域外MobileWorld GUI-only分割上达到41.0%的成功率，在我们的评估中建立了最强的开放数据移动GUI智能体。代码、数据和训练模型将在该URL发布。

英文摘要

MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but adapting them to real target apps remains costly because mobile apps are numerous, frequently updated, and hard to cover with human-written tasks, demonstrations, or reward labels. Existing annotation-free GUI learning reduces manual supervision, yet lacks a unified substrate connecting target-app exploration, curriculum mining, rollout execution, and feedback, while policy optimization often relies on isolated rollouts and coarse rewards that are hard to convert into reliable improvement signals. We present MobileForge, an annotation-free adaptation system for mobile GUI agents. MobileForge consists of MobileGym, which grounds task generation and rollout evaluation in real mobile app interaction, and Hierarchical Feedback-Guided Policy Optimization (HiFPO), which turns trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates. Using only automatically generated annotation-free adaptation data, MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld, close to the closed-data GUI-specialized GUI-Owl-1.5-8B base model at 69.0%. The MobileForge-adapted ForgeOwl-8B further reaches 77.6% Pass@3 on AndroidWorld and 41.0% success on the out-of-domain MobileWorld GUI-only split, establishing the strongest open-data mobile GUI agent in our evaluation. Code, data, and trained models will be released at https://mobile-forge.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.19926 2026-06-19 cs.HC 新提交

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

MemGUI-Agent: 一种具有主动上下文管理的端到端长时移动GUI智能体

Guangyi Liu, Gao Wu, Congxiao Liu, Pengxiang Zhao, Liang Liu, Mading Li, Qi Zhang, Mengyan Wang, Liang Guo, Yong Liu

AI总结提出MemGUI-Agent，通过主动上下文管理机制（ConAct）将上下文管理作为一等动作，解决长时任务中提示膨胀和关键信息稀释问题，在8B模型上达到最佳性能。

Comments 33 pages, 6 figures. Project page: https://memgui-agent.github.io/

详情

AI中文摘要

基于MLLM的移动GUI智能体在短时任务上取得了显著进展，但在需要跨多步和应用转换保留中间事实的长时任务上仍不可靠。我们将此限制归因于ReAct风格的提示，它被动地累积每一步的记录，导致提示膨胀和关键跨应用事实的稀释。为了解决这个问题，我们引入了MemGUI-Agent，一种具有主动上下文管理的端到端长时移动GUI智能体。MemGUI-Agent建立在Context-as-Action (ConAct)之上，它将上下文管理作为与选择UI动作相同的策略发出的一等动作。ConAct不是被动地追加历史，而是维护三个结构化的上下文字段：折叠的动作历史、折叠的UI状态和最近的步骤记录，在保持上下文紧凑的同时保留关键的UI事实。为了使主动上下文管理跨模型规模可学习，我们构建了MemGUI-3K，一个包含2956条轨迹的数据集，带有完整的ConAct注释，用于监督训练和离线分析。在MemGUI-3K上训练8B模型产生了MemGUI-8B-SFT，一个8B的MemGUI-Agent，它在MemGUI-Bench上实现了最佳的开源8B性能，并泛化到分布外的MobileWorld基准测试。代码、数据和训练好的模型将在以下网址发布：https://this URL。

英文摘要

MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.19918 2026-06-19 cs.ET 新提交

A Novel FeFET Differential Bit-Cell With Hybrid Volatile and Non-Volatile Memory Modes

一种具有混合易失性和非易失性存储模式的新型FeFET差分位单元

Jianze Wang, Wei Zhang, Xuanyao Fong

AI总结提出一种由交叉耦合FeFET和存取晶体管组成的4T差分位单元，通过调整写入条件可在易失/非易失模式间切换，无需显式备份恢复操作，面积小于传统6T SRAM。

详情

AI中文摘要

非易失性SRAM（nvSRAM）设计已被研究以解决基于CMOS的SRAM的高泄漏功耗和新兴非易失性存储器（eNVM）技术的大写入延迟问题。然而，先前将SRAM与eNVM器件结合的nvSRAM设计通常需要备份和恢复（B\\&R）操作，并导致显著的单元面积开销。在此，我们提出一种差分存储位单元，由一对交叉耦合的铁电场效应晶体管（FeFET）和一对存取晶体管组成，形成四晶体管（4T）结构，比传统的6T SRAM和许多先前的nvSRAM设计更小。通过调整写入条件，所提出的位单元可配置为在易失性或非易失性模式下工作。在非易失性模式下，所提出的nvSRAM实现了0.13~$\mu$W的存储功耗和2~ns的存储时间，且无需显式的B\\&R操作。所提出的位单元也可视为交叉耦合增益单元，从而实现进一步的应用。

英文摘要

Non-volatile SRAM (nvSRAM) designs have been investigated to address the high leakage power of CMOS-based SRAM and the large write latency of emerging non-volatile memory (eNVM) technologies. However, prior nvSRAM designs that combine SRAM with eNVM devices typically require backup and restore (B\&R) operations and incur significant cell-area overhead. Here, we propose a differential memory bit-cell consisting of a pair of cross-coupled ferroelectric field-effect transistors (FeFETs) and a pair of access transistors, resulting in a four-transistor (4T) structure, which is smaller than conventional 6T SRAM and many prior nvSRAM designs. The proposed bit-cell can be configured to operate in either volatile or non-volatile mode by adjusting the write conditions. In the non-volatile mode, the proposed nvSRAM achieves a store power of 0.13~$μ$W with a 2~ns store time, and no explicit B\&R operation is required. The proposed bit-cell can also be viewed as a cross-coupled gain cell, enabling further applications.

URL PDF HTML ☆

赞 0 踩 0

2606.19913 2026-06-19 cs.AR 新提交

Design and Evaluation of Energy-Efficient Whisper Dot-Product Kernel Offloading on a CGLA Architecture

在CGLA架构上设计并评估节能的Whisper点积内核卸载

Takuto Ando, Yu Eto, Ayumu Takeuchi, Yasuhiko Nakashima

AI总结在CGLA架构IMAX上卸载Whisper点积内核，通过内核映射、本地内存大小调整和突发调度优化，在Whisper tiny上实现比Jetson AGX Orin低2.35倍、比RTX 4090低10.48倍的功耗延迟积（PDP），为低功耗本地语音识别提供可编程架构方案。

Comments This paper is accepted at Concurrency and Computation: Practice and Experience (Wiley)

详情

AI中文摘要

在本文中，我们在IMAX（一种可编程的粗粒度线性阵列（CGLA）架构）上实现并评估了Whisper点积内核卸载。在ARM Cortex-A72上的性能分析显示，点积操作占FP16执行时间的90.6%和Q8_0执行时间的87.1%。为了解决这一内核瓶颈，我们结合了内核映射、本地内存大小调整和突发调度。该实现使用了内联FP16到FP32转换、64位数据路径上的2路SIMD FMA、列式多线程以及混合执行，其中对齐的向量段在IMAX上运行，剩余段在主机CPU上并发执行。我们通过FPGA原型和28nm ASIC投影（840MHz）评估了该设计。对于Whisper tiny，32KB本地内存和突发长度16共同最小化PDP和EDP。在基于TDP的跨平台比较中，投影的IMAX在Whisper tiny Q8_0上的PDP为11.58J，比Jetson AGX Orin（27.16J）低2.35倍，比RTX 4090（121.38J）低10.48倍。相同的设计扩展到Whisper base和Whisper small，但PDP差距缩小，因为32KB本地内存覆盖率从tiny的93.8%下降到base和small的约66.5%。这些结果表明，IMAX是一种在tiny模型范围内实现低PDP本地ASR的可编程架构。

英文摘要

In this paper, we implement and evaluate Whisper dot-product kernel offloading on IMAX, a programmable Coarse-Grained Linear Arrays (CGLAs) architecture. Whisper-tiny.en profiling on an ARM Cortex-A72 shows that dot-product operations account for 90.6% of FP16 execution time and 87.1% of Q8_0 execution time. To address this kernel bottleneck, we combine kernel mapping, local-memory sizing, and burst scheduling. The implementation uses inline FP16-to-FP32 conversion, 2-way SIMD FMA on a 64-bit datapath, column-wise multithreading, and mixed execution in which aligned vector segments run on IMAX and residual segments run concurrently on the host CPU. We evaluate the design with an FPGA prototype and a 28nm ASIC projection at 840MHz. For Whisper-tiny.en, 32KB local memory and burst length 16 jointly minimize PDP and EDP. Under a TDP-based cross-platform comparison, the projected IMAX records a PDP of 11.58J for Whisper-tiny.en Q8_0, 2.35x lower than Jetson AGX Orin (27.16J) and 10.48x lower than RTX 4090 (121.38J). The same design extends to Whisper-base.en and Whisper-small.en, where the PDP gap narrows as 32KB local-memory coverage drops from 93.8% for tiny to about 66.5% for base and small. These results position IMAX as a programmable architecture for lower-PDP local ASR in the tiny-model regime.

URL PDF HTML ☆

赞 0 踩 0

2606.19904 2026-06-19 cs.SI 新提交

Toward Temporal Realism in City-Scale Crisis Response Simulation using LLM Agents

面向城市级危机响应模拟中时间真实性的LLM智能体方法

Anping Zhang, Yang Tan, Yuanbo Tang, Huaze Tang, Qiuhua Ye, Marta C. Gonzalez, Yang Li

AI总结针对LLM社会模拟缺乏时间真实性的问题，基于深圳疫情志愿活动数据，提出数据校准的自激与危机激活机制，实现爆发性时间模式，使智能体时间分布接近真实。

Comments 11pages,7 figures

详情

AI中文摘要

人类集体参与在时间上很少是稳定的：它是爆发性的，短时间的密集活动与长时间的安静间隔交替出现。在危机响应和社区动员中，预测人们何时行动与预测他们是否行动同样重要。这类场景越来越多地使用基于LLM的社会模拟器进行建模，然而这些模拟器的验证仅关注每个行动是否合理，而非行动的时间是否与现实一致。它们的时间真实性，即模拟活动再现真实人类系统爆发性、重尾时间分布的程度，因此仍未得到检验。我们利用深圳跨多年、城市规模的线下志愿活动日志（涵盖COVID-19疫情）来考察这一差距。实证上，我们确认爆发性时间在个体和跟踪群体层面普遍存在，且主要是内生性和自激的，并由疫情放大而非日常活动周期产生。一个标准的纯LLM模拟器几乎无法再现这种时间分布：其同步调度缺乏自激通道，因此智能体以近乎规律的时钟行动。基于这些发现，我们构建了一个模拟器，其中数据校准的自激通道和危机时期机制决定每个智能体何时行动，并仅在这些时刻查询LLM，由LLM决定加入哪个任务以及是否承诺。纯LLM基线未产生任何爆发性智能体（中位爆发性$B=-0.14$）；单个数据校准的门控足以将每个智能体的时间分布提升至爆发阈值以上（中位$B\approx0.37$），且不降低LLM的内容决策质量。这些结果表明，基于LLM的危机响应模拟中，时间真实性的最佳实现方式是将智能体何时行动（由显式自激和危机激活机制控制）与做什么（由LLM控制）解耦。

英文摘要

Human collective participation is rarely steady in time: it is bursty, with short episodes of intense activity separated by long quiet intervals. In crisis response and community mobilization, predicting when people act matters as much as predicting whether they act. Such settings are increasingly modeled with LLM-based social simulators, yet these simulators are validated on whether each action is individually plausible, not on whether actions are timed as in reality. Their temporal realism, the degree to which simulated activity reproduces the bursty, heavy-tailed timing of real human systems, thus remains untested. We examine this gap using a multi-year, city-scale log of offline volunteering in Shenzhen that spans the COVID-19 pandemic. Empirically, we establish that bursty timing is common at individual and tracked-group levels, that it is largely endogenous and self-exciting, and that it is amplified by the pandemic rather than produced by daily activity cycles. A standard LLM-only simulator reproduces almost none of this timing: its synchronous schedule has no self-excitation channel, so agents act on a near-regular clock. Guided by these findings, we build a simulator in which a data-calibrated self-excitation channel and a crisis-period regime decide when each agent acts and query the LLM only at those moments, leaving it to decide which task to join and whether to commit. The LLM-only baseline yields no bursty agents (median burstiness $B=-0.14$); a single data-calibrated gate is then sufficient to lift per-agent timing above the burst threshold (median $B\approx0.37$) without degrading LLM content decisions. These results indicate that temporal realism in LLM-based crisis-response simulation is best achieved by decoupling when agents act, governed by an explicit self-excitation and crisis-activation mechanism, from what they do, governed by the LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.19898 2026-06-19 cs.DB cs.IR 新提交

Query-aware Routing for Filtered Approximate Nearest Neighbors Search

面向过滤近似最近邻搜索的查询感知路由

Qianqian Xiong, Mengxuan Zhang

AI总结提出查询感知路由框架，通过轻量级ML模型预测各候选方法的召回率，结合离线基准表选择最佳召回-QPS权衡，在五个未见数据集上达到SOTA性能。

Comments 12 pages

详情

AI中文摘要

过滤ANN搜索结合向量相似性与属性谓词，是现代向量数据库和检索增强生成中的核心原语。我们在多个数据集上对三种谓词下的所有主要分类过滤ANN方法进行基准测试，发现没有单一方法占主导地位。此外，即使在单个数据集和谓词类型内，查询的最佳方法也可能不同。因此，我们提出了一种查询感知路由框架。轻量级ML模型预测每个候选方法在查询上的召回率，路由器查阅离线基准表（该表将每种方法和参数设置映射到其测量的召回率和QPS），然后选择具有最佳召回-QPS权衡的方法。我们的消融研究将22个候选特征缩减为最小的三个特征集，并采用回归而非分类作为预测目标以提高准确性。我们的模型在六个真实世界数据集上训练，并应用于五个未见过的验证数据集。最终结果表明，与现有的过滤ANN基线相比，我们的路由器在所有五个验证数据集上实现了最先进的召回率和QPS平衡，同时引入了可忽略的延迟开销。

英文摘要

Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware routing framework. A lightweight ML model predicts each candidate method's recall on the query, and the router consults an offline benchmark table that maps every method and parameter setting to its measured recall and QPS, then selects the method with the best recall--QPS trade-off. Our ablation study narrows 22 candidate features to a minimal set of three and we adopt regression rather than classification as the prediction target to sharpen accuracy. Our model is trained on six real-world datasets and applied to five unseen validation datasets. The final result shows that our router achieves state-of-the-art recall and QPS balance across all five validation datasets compared to existing filtered ANN baselines, while incurring negligible latency overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.19890 2026-06-19 cs.CY 新提交

Open Weight AI Models Require Proportional Evaluation Approaches

开放权重AI模型需要比例评估方法

Patricia Paskov, Christopher Rodriguez, Sunishchal Dev, Stephen Casper

AI总结本文针对开放权重AI模型（OWMs）的独特风险因素，提出四种比例评估方法（PE1-PE4），并系统审查2025年至2026年4月发布的37个OWM系列，发现仅一个满足所有评估要求。

详情

AI中文摘要

开放权重AI模型（OWMs），即公开发布权重的模型，正在快速分发，并接近领先的封闭权重AI模型（CWMs）的性能水平。虽然OWMs带来了巨大的科学和经济利益，但它们的发布引入了独特的风险因素，而现有的评估实践（主要针对CWM部署设计）未能考虑这些因素。在本文中，我们认为这些风险因素需要不同的比例评估（PE）方法：在没有系统级保障的情况下进行评估（PE1），评估对消除模型级保障的修改的鲁棒性（PE2），测试选择性能力增强（PE3），以及代理最坏情况下的滥用（PE4）。我们系统审查了2025年至2026年4月期间发布的OWMs的当前评估实践，发现所审查的37个模型系列中只有一个满足PE1-4，大多数不满足任何一项。本文面向参与AI评估的政策制定者、资助者和研究人员。随着OWMs能力日益增强，其评估值得开发者、资助者和治理机构密切关注。

英文摘要

Open-weight AI models (OWMs), or models released with publicly-available weights, are distributing rapidly and approaching the performance levels of leading closed-weight AI models (CWMs). While OWMs offer substantial scientific and economic benefits, their release introduces distinct risk factors for which existing evaluation practices, largely designed for CWM deployment, fail to account. In this paper, we argue that these risk factors demand distinct proportional evaluation (PE) approaches: evaluating without system-level safeguards (PE1), assessing robustness to modifications that undo model-level safeguards (PE2), testing selective capability amplification (PE3), and proxying worst-case misuse (PE4). We systematically review current evaluation practices of OWMs released in 2025 through April 2026, finding that only one of the 37 families of models reviewed fulfills PE1-4 and most do not fulfill any. This paper targets policymakers, funders, and researchers involved in AI evaluation. As OWMs grow increasingly capable, their evaluation warrants close attention from developers, funders, and governance bodies alike.

URL PDF HTML ☆

赞 0 踩 0

2606.19869 2026-06-19 cs.DC 新提交

EVM Workloads in the Wild: Evidence for Multi-Dimensional Gas Metering, State Growth, Delayed Execution, and Parallelism

现实中的EVM工作负载：多维Gas计量、状态增长、延迟执行和并行性的证据

Lioba Heimbach, Kushal Babel, Jason Milionis

AI总结通过分析2025年以太坊L1和Base L2的区块追踪，发现资源组合不稳定、状态增长被低估、执行结果对历史状态敏感，为多维Gas计量和状态增长显式定价提供了实证基础。

详情

AI中文摘要

EVM兼容区块链上的Gas计量假设执行条件是稳定的：资源组合足够恒定，可以将执行成本合并为具有固定相对价格的单一标量，并且提交与执行之间的状态漂移不会实质性改变交易结果。我们衡量了这一假设失败的程度。我们呈现了2025年全年以太坊（L1）和Base（L2）上EVM工作负载的追踪级测量研究，每条链每天采样3000个区块。我们将每笔交易分解为操作码级执行Gas、固有Gas、退款和持久状态增量。为测量状态敏感性，我们在旧状态上重新执行2025年9月的交易，并记录Gas使用和存储访问模式的变化。我们发现资源组合远非稳定：在Base上，存储读取和计算分别占执行Gas的29.2%和24.3%，而以太坊将34.9%用于存储写入。以太坊在2025年Gas上限翻倍，使其自身配置转向更重计算、类似Base的模式。Base还表现出更高比例的冷存储读取（49.7%对以太坊的39.6%）。持久状态增长（一种被定价为临时成本的永久成本）在Base上达到456 GB，而在以太坊上为38 GB。执行结果同样不稳定：在Base上，46.0%的交易在附近历史状态间的Gas估算存在差异，而以太坊为13.9%，MEV和DeFi活动的敏感性尤其高。存储访问模式在不同状态间也存在差异，限制了访问列表的有效性并使并行执行复杂化。我们的工作为多维Gas计量和状态增长的显式定价提供了实证基础。研究表明，状态敏感的执行行为使工作负载估算复杂化，直接影响交易可预测性和用户体验。

英文摘要

Gas metering on EVM-compatible blockchains assumes that execution conditions are stable: that the resource mix is constant enough to justify collapsing execution costs into a single scalar with fixed relative prices, and that state drift between submission and execution does not materially alter a transaction's outcome. We measure the extent to which this assumption fails. We present a trace-level measurement study of EVM workloads on Ethereum (L1) and Base (L2) throughout 2025, sampling 3,000 blocks per day per chain. We decompose each transaction into opcode-level execution gas, intrinsic gas, refunds, and persistent state deltas. To measure state sensitivity, we re-execute transactions from September 2025 on older states and record how gas usage and storage access patterns change. We find the resource mix to be far from stable: on Base, storage reads and compute account for 29.2% and 24.3% of execution gas, while Ethereum devotes 34.9% to storage writes. Ethereum's gas limit doubling during 2025 shifted its own profile toward compute-heavier, Base-like patterns. Base also exhibits a higher fraction of cold storage reads (49.7% versus 39.6% on Ethereum). Persistent state growth, a permanent cost priced as a transient one, reaches 456 GB on Base versus 38 GB on Ethereum. Execution outcomes are equally unstable: gas estimates vary across nearby historical states for 46.0% of transactions on Base, compared to 13.9% on Ethereum, with especially high sensitivity for MEV and DeFi activity. Storage access patterns also diverge across states, limiting the effectiveness of access lists and complicating parallel execution. Our work provides an empirical foundation for multi-dimensional gas metering and explicit pricing of state growth. They show that state-sensitive execution behavior complicates workload estimation, directly affecting transaction predictability and user experience.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Qiskit Code Migration with LLMs

Techno-Economic Analysis of Shared Mobile Storage for Demand Charge Reduction

N-Version Programming with Coding Agents

An MSO Framework for Weak-Memory Verification and Robustness

Learning Critical Testing Literacy Through Puzzles: an Experience Report

Contraction-based Neural Control for Cooperative Aerial Payload Transportation with Variable-length Cables

BARReL: a modern backend for Atelier B in Lean

Autoregressive Modelling and Synthetic Generation of High-Fidelity, Statistically Equivalent 3D Microstructures for As-Manufactured Misalignments in Fiber-Reinforced Composites

Artificial Intelligence as Game Changer in Cybersecurity: What We Learned in 2025-2026, and how this is relevant for Africa

Quadratic Forms for Measuring Geometric Trees in 3-dimensional Space

AI Conversational Interviewing: Scaling Up Semi-Structured and In-depth Interviews

PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

A Measurement Study of Cryptographic Misuse in Embodied AI Mobile Applications

The Bi-Channel Networking Paradigm for Database Systems in the Cloud

Beyond Lower Quota: Avoiding Overrepresentation in Multi-Winner Voting

Stellar: Scalable Multimodal Document Retrieval for Natural Language Queries

Modest, artistic, and radical solutions to the environmental impact of image-generating machine learning

Semi-Automatic Correction of 3D Tubular Structure Skeletons via Component-Wise MST and Filtered Delaunay Triangulation

AutoTam: Specifying Secure Protocol Implementations with Tamarin Model Generation

Prismriver: Formalization of Music Theory and Algorithmic Composition in Lean 4

Blame is easier than praise: Measuring off-ball defensive performance in football

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

A Novel FeFET Differential Bit-Cell With Hybrid Volatile and Non-Volatile Memory Modes

Design and Evaluation of Energy-Efficient Whisper Dot-Product Kernel Offloading on a CGLA Architecture

Toward Temporal Realism in City-Scale Crisis Response Simulation using LLM Agents

Query-aware Routing for Filtered Approximate Nearest Neighbors Search

Open Weight AI Models Require Proportional Evaluation Approaches

EVM Workloads in the Wild: Evidence for Multi-Dimensional Gas Metering, State Growth, Delayed Execution, and Parallelism