arXivDaily arXiv每日学术速递 周一至周五更新
2606.19686 2026-06-19 cs.PL 新提交

Effect Systems as Abstract Interpretations

效应系统作为抽象解释

Colin S. Gordon

AI总结 本文通过将效应量词嵌入抽象域,并从事件发生角度恢复效应量词的一般形式,建立了抽象解释与一般效应系统之间的形式关系。

Comments Draft short paper

详情
AI中文摘要

文献中已知多种关于程序行为的静态推理形式,但形式关系的研究却出奇地少。尽管大多数类型系统被广泛认为可以通过抽象解释来捕捉,但在一般情况下,类型-效应系统的情况尚未明确,尽管有强有力的假设和偶尔将效应系统视为抽象解释的框架。我们开发了抽象解释与一般效应系统之间的形式关系。首先,我们描述了将效应量词嵌入抽象域的方法。其次,我们从事件发生(而非状态或值)的角度,将效应量词的一般形式恢复为抽象解释。

英文摘要

Many forms of static reasoning about program behaviours are known in the literature, yet formal relationships are studied surprisingly infrequently. While most type systems are well-known to be captured by abstract interpretations, the situation for type-and-effect systems is, in the general case, unsettled despite strong hypotheses and occasional framing of effect systems as abstract interpretations. We develop a formal relationship between abstract interpretations and a general class of effect systems. First, we describe an embedding of effect quantales into abstract domains. Second, we recover the general form of an effect quantale as an abstract interpretation -- not on states or values, but on event occurrences.

2606.18941 2026-06-19 cs.PL cs.CL 新提交

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC:使用基于SMT的模型检查对图形化PLCopen XML梯形图程序进行形式验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * Computer Science, The University of Manchester(计算机科学,曼彻斯特大学) Electrical Engineering, Federal University of Amazonas (UFAM)(电气工程,亚马逊联邦大学(UFAM))

AI总结 针对ESBMC-PLC无法处理图形化PLCopen XML梯形图的问题,提出基于DFS的图形LD解析器,将连接图转换为布尔触点合取,并采用三级I/O推断方案,成功实现完整GOTO IR转换,验证了3个图形LD程序。

Comments 18 pages

详情
AI中文摘要

PLCopen XML为IEC 61131-3梯形图程序定义了两种编码格式:一种使用<rung>元素的文本编码,另一种将梯形逻辑表示为localId/refLocalId连接的有向图的图形编码。ESBMC-PLC支持文本格式,但将来自CONTROLLINO、Beremiz和OpenPLC Editor的图形导出解析为空GOTO中间表示,导致空洞的验证成功。本文提出Graph-ESBMC-PLC,通过基于DFS的图形LD解析器填补了这一空白。该解析器从leftPowerRail遍历连接图到每个线圈,将梯形路径提取为布尔触点合取,并应用三级I/O推断方案。按rightPowerRail的connectionPointIn序列对线圈排序,确保SET线圈在RESET线圈之前处理,匹配IEC扫描周期语义。图形到IR的转换无需改动ESBMC后端。在来自CONTROLLINO/OpenPLC Editor的3个图形LD程序上的验证表明,所有程序都生成了包含非确定性输入和梯形逻辑的完整GOTO IR,而之前生成的是空IR。所有3个程序在k=2时在70ms内验证为SAFE。11个文本LD基准测试完全保留,无回归。两个不含LD内容或不支持定时器语义的Beremiz示例被报告为发现的局限性。工件位于Zenodo(DantasCordeiro2026graphical,doi: https://doi.org/10.5281/zenodo.20699856)。

英文摘要

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using <rung> elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents ESBMC-GraphPLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

2606.20517 2026-06-19 cs.AI cs.PL 交叉投稿

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Multi-LCB: 将 LiveCodeBench 扩展到多种编程语言

Maria Ivanova, Pavel Zadorozhny, Rodion Levichev, Ivan Petrov, Adamenko Pavel, Ivan Lopatin, Alexey Kutalev, Dmitrii Babaev

发表机构 * GigaCode Yandex School of Data Analysis, Applied AI Institute(Yandex数据分析学院,应用人工智能研究所)

AI总结 提出 Multi-LCB 基准,将 LiveCodeBench 的 Python 任务扩展到 12 种编程语言,评估 LLM 跨语言代码生成能力,发现 Python 过拟合和语言特定污染等问题。

Comments ICLR 2026

详情
AI中文摘要

LiveCodeBench (LCB) 最近已成为评估大型语言模型 (LLM) 在代码生成任务上的广泛采用的基准。通过策划竞争性编程问题、不断向集合中添加新问题并根据发布日期进行过滤,LCB 提供了污染感知的评估,并提供了编码能力的整体视图。然而,LCB 仍然局限于 Python,留下了 LLM 是否能够泛化到现实软件工程所需的各种编程语言的问题。我们引入了 Multi-LCB,这是一个跨十二种编程语言(包括 Python)评估 LLM 的基准。Multi-LCB 将 LCB 数据集中的 Python 任务转换为其他语言中的等效任务,同时保留 LCB 的污染控制和评估协议。由于它与原始 LCB 格式完全兼容,Multi-LCB 将自动跟踪未来的 LCB 更新,从而能够系统地评估跨语言代码生成能力,并要求模型在 Python 之外保持良好的性能。我们在 Multi-LCB 上评估了 24 个 LLM 的指令和推理能力,发现了 Python 过拟合、语言特定污染以及多语言性能显著差异的证据。我们的结果将 Multi-LCB 确立为多编程语言代码评估的严格新基准,直接解决了 LCB 的主要局限性,并揭示了当前 LLM 能力的关键差距。

英文摘要

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

2606.20351 2026-06-19 cs.LO cs.PL 交叉投稿

A cubical formalisation of conditional independence, Bayesian conditioning, and Pearl's d-separation soundness

条件独立性、贝叶斯条件化和Pearl的d-分离正确性的立方形式化

Karen Sargsyan

AI总结 本文在Cubical Agda中形式化概率单子,提出一种高阶归纳类型,通过引入贝叶斯公式修正标准凸代数交换公理的不足,并验证了半图oid公理、do-演算规则和d-分离定理。

详情
AI中文摘要

自Stone以来概率单子形式化中常见的标准凸代数交换公理被证明不足以支持完整的贝叶斯条件化。我们在Cubical Agda中精确地说明了这一点:有限分布作为高阶归纳类型,条件独立性作为核之间的立方路径,递归贝叶斯条件化作为全支撑片段上的全函数。将条件化提升到完整的HIT暴露了一个结构上的不匹配——重新排列的四叶混合的两半携带由贝叶斯公式关联的不同贝叶斯权重,而不是标准公理提供的单个共享内部权重。我们展示了解决这一问题的最小推广,并证明了标准形式是退化情况,其中两个内部权重重合。基于这一观察,我们在抽象有序域接口之上以构造性方式验证了代数上下文,无需任何公设:绑定交换性、四个半图oid公理、交(通过结构性Σ-见证简化为收缩,无需正性)、Pearl的do-演算规则1、2和3的核形式、有限类型贝叶斯条件化,以及任意n顶点有限有向无环图(DAG)上干预和贝叶斯形式的Pearl的d-分离定理(正确性)。概率单子也被验证为马尔可夫范畴;抽象接口在Q处实现。

英文摘要

The standard convex-algebra interchange axiom, common to probability-monad formalisations since Stone, is provably too weak to support full Bayesian conditioning. We make this precise in Cubical Agda: finite distributions as a higher inductive type, conditional independence as a cubical path between kernels, recursive Bayesian conditioning as a total function on a full-support fragment. Lifting conditioning to the full HIT exposes a structural mismatch -- the two halves of the rearranged 4-leaf mix carry distinct Bayesian weights related by Bayes' formula, not the single shared inner weight the standard axiom provides. We exhibit the minimal generalisation that resolves this and prove the standard form is the degenerate case where the two inner weights coincide. Around this observation we verify the algebraic context constructively, with zero postulates above an abstract ordered-field interface: bind commutativity, the four semi-graphoid axioms, intersection (reduced to contraction via structural $Σ$-witnesses, without positivity), Pearl's do-calculus Rules~1, 2, and~3 in kernel form, finite-type Bayesian conditioning, and Pearl's d-separation theorem (soundness) on arbitrary $n$-vertex finite directed acyclic graphs (DAGs) in both interventional and Bayesian forms. The probability monad is also verified as a Markov category; the abstract interface discharges at $Q$.

2606.20134 2026-06-19 cs.LO cs.PL 交叉投稿

An MSO Framework for Weak-Memory Verification and Robustness

弱内存验证与鲁棒性的MSO框架

Giovanna Kobus Conrado, Andreas Pavlogiannis

AI总结 本文研究单子二阶逻辑作为弱内存元理论,证明顺序一致性执行有界树宽而TSO无界,展示多种模型可MSO公理化,并引入读自鲁棒性概念,实现统一验证算法。

Comments Accepted at CONCUR 2026

详情
AI中文摘要

内存模型是并发程序执行的形式化规范,解释了编译器和架构优化引入的弱行为。其数量和复杂性的增加促使人们通过在适当的元理论中公理化模型来统一验证整个模型类别。本文正式研究单子二阶逻辑(MSO)作为弱内存的元理论,通过证明各种流行弱内存模型的树宽和MSO可表达性结果,使得我们能够统一处理多个验证问题。总结如下:首先,我们证明顺序一致性($\mathsf{SC}$)下的执行具有有界树宽,而总存储顺序($\mathsf{TSO}$)下的执行则无界。其次,我们证明包括Release/Acquire和完整RC20在内的广泛模型是MSO可公理化的,而其他模型如Strong Release/Acquire和$\mathsf{TSO}$则不可,除非正交向量问题(在SETH下需要二次时间)可以在线性时间内解决。最后,我们引入读自鲁棒性概念,作为对近期粗粒度鲁棒性准则工作的扩展。我们证明树宽界限(上界和下界)对任何MSO可公理化模型$\mathsf{MM}$具有深远的算法意义:存在一个算法,对于每个程序$\mathsf{P}$,要么验证$\mathsf{P}$在$\mathsf{MM}$下的正确性,要么报告$\mathsf{P}$对$\mathsf{MM}$不是读自鲁棒的。总体而言,我们的结果为弱内存验证和鲁棒性建立了一个丰富且多功能的理论框架。

英文摘要

Memory models are formal specifications of concurrent-program executions, accounting for weak behaviors introduced by compiler and architectural optimizations. The increase of their number and complexity has spawned efforts for uniform verification across whole classes of models, by axiomatizing the models in an adequate metatheory that admits a uniform treatment. In this work, we formally study Monadic Second-Order logic (MSO) as a metatheory for weak memory, by proving results on the treewidth and MSO-expressibility of various popular weak-memory models, as this combination allows us to uniformly tackle several verification problems. In summary, our results are as follows. First, we prove that executions under Sequential Consistency ($\mathsf{SC}$) have bounded treewidth, while already those under Total Store Order ($\mathsf{TSO}$) do not. Second, we prove that a broad range of models, including Release/Acquire and the full RC20, are MSO-axiomatizable, while others, such as Strong Release/Acquire and $\mathsf{TSO}$, are not, unless the Orthogonal Vectors problem $\unicode{x2013}$ which requires quadratic time under SETH $\unicode{x2013}$ can be solved in linear time. Finally, we introduce the notion of reads-from robustness, as an extension to recent work on coarse robustness criteria. We show that our treewidth bounds (both upper and lower) have far-reaching algorithmic implications for any of our MSO-axiomatizable models $\mathsf{MM}$: there is an algorithm that, for every program $\mathsf{P}$, either verifies $\mathsf{P}$ under $\mathsf{MM}$ or reports that $\mathsf{P}$ is not reads-from robust against $\mathsf{MM}$. Overall, our results establish a rich and versatile theoretical framework for weak-memory verification and robustness.

2606.19409 2026-06-19 cs.SE cs.PL 交叉投稿

OpenRath: Session-Centered Runtime State for Agent Systems

OpenRath: 面向会话的代理系统运行时状态

Fukang Wen, Zhijie Wang, Ruilin Xu

AI总结 针对代理系统运行时状态碎片化问题,提出以Session为核心的一等运行时抽象,支持分支、检查、重放、后端感知和组合,使fork、merge和replay成为显式运行时操作。

详情
AI中文摘要

现代代理系统常常遭受碎片化的运行时状态:对话记录、工具效果、内存事件、工作区放置、分支来源和重放证据被分别记录,难以检查或重现。OpenRath通过一个类似PyTorch的编程模型来解决这个问题,适用于多代理、多会话系统。这里的类比涉及中心一等运行时抽象的角色,而非张量计算。其核心抽象是Session,即在代理和工作流之间传递的运行时值。Session是可分支、可检查、可重放、后端感知且可组合的。它记录对话片段、沙箱放置、谱系元数据、令牌使用、待处理工作和工具证据,同时定义内存交互进入运行时记录的位置。由于此状态由程序执行中使用的同一值携带,fork、merge和replay成为显式的运行时操作,而非从外部痕迹重建的状态。OpenRath进一步定义了Sandbox、Tool、Agent、Memory、Workflow和Selector,其中Selector将控制流转化为运行时路由的决策。本报告介绍了编程模型、架构、审计里程碑和证据协议。其主张仅限于受控的运行时属性,而广泛的定量比较、实时提供者质量、可选后端可用性和内存质量留待后续评估。核心论点是Session为代理系统提供了一个一等运行时值,用于可审计的组合。

英文摘要

Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch-like programming model for multi-agent, multi-session systems. The analogy concerns the role of a central first-class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime-routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live-provider quality, optional-backend availability, and memory quality are left for follow-on evaluation. The central thesis is that Session provides agent systems with a first-class runtime value for auditable composition.

2606.19399 2026-06-19 cs.LG cs.AI cs.LO cs.PL 交叉投稿

VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

VERITAS:验证器引导的零样本形式定理证明搜索

Manish Acharya, Zhenyu Liao, Yueke Zhang, Kevin Leach, Yu Huang, Yifan Zhang

发表机构 * Department of Computer Science, Vanderbilt University(范德堡大学计算机科学系) Amazon(亚马逊)

AI总结 提出VERITAS框架,通过两阶段协议(Best-of-N采样+批评引导MCTS)利用验证器反馈进行零样本定理证明,在miniF2F上达40.6%准确率,并发布组合学基准VERITAS-CombiBench。

详情
AI中文摘要

基于LLM的形式化证明器通常将丰富的验证器信号(语法错误、类型不匹配、部分目标进展)压缩为二进制的通过/失败位。我们提出VERITAS,一个零样本框架,通过两阶段协议将每个验证器信号路由回证明搜索:首先进行Best-of-N采样,然后进行批评引导的MCTS遍历,该遍历将第一阶段失败作为显式负例吸收。该协议保留其第一阶段扫描解决的每个定理,因此第二阶段额外的解决可归因于反馈驱动的探索。VERITAS在miniF2F上达到40.6%(相比之下,独立运行的Best-of-5为36.9%,Portfolio为26.2%),在VERITAS-CombiBench上达到7.3%,这是一个我们发布的55个定理的组合学基准,在该基准上Best-of-5(1.8%)低于Portfolio(3.6%),暴露了当必须从验证器反馈中迭代恢复正确的引理名称时,无指导的采样会带来损害。工件可在GitHub上获取。

英文摘要

LLM-based formal provers often collapse rich verifier signals (syntax errors, type mismatches, partial goal progress) into a binary pass/fail bit. We present VERITAS, a zero-shot framework that routes every verifier signal back into proof search through a two-phase protocol: Best-of-N sampling first, then a critic-guided MCTS pass that ingests Phase 1 failures as explicit negative examples. The protocol preserves every theorem solved by its own Phase 1 sweep, so Phase 2's additional solves are attributable to feedback-driven exploration. VERITAS reaches 40.6% on miniF2F (vs. an independently run Best-of-5 at 36.9%, Portfolio 26.2%) and 7.3% on VERITAS-CombiBench, a 55-theorem combinatorics benchmark we release on which Best-of-5 (1.8%) falls below Portfolio (3.6%), exposing that unguided sampling hurts when correct lemma names must be recovered iteratively from verifier feedback. Artifacts are available on GitHub.

2606.19347 2026-06-19 cs.CL cs.AI cs.PL 交叉投稿

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

LLM在硬件设计的RTL编码中如何失败与泛化?

Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany, Yu-Chiang Frank Wang

发表机构 * NVIDIA Research(英伟达研究院)

AI总结 提出基于问题可解性的错误分类法,揭示LLM在RTL编码中受限于预训练知识,对齐技术仅教会编译,而推理能力才是关键瓶颈。

Comments Preview, under submission for EMNLP 2026

详情
AI中文摘要

将顺序编程先验转换为硬件设计的并行时序逻辑仍然是大型语言模型(LLM)的关键瓶颈。为了研究这一点,我们引入了一种新的错误分类法,该分类法基于问题可解性,受认知理论启发。我们的分类法将失败分为语法、语义、可解功能和不可解功能类型。评估揭示了VerilogEval基准上的严格经验上限,前沿模型初始通过率稳定在90.8%。这些平台期由不可解的功能错误定义,暴露出对测试时计算扩展免疫的持续知识差距。此外,我们揭示了一个显著的表面收敛差距:优化容易消除语法错误,但同时加剧了更深层次的功能失败。我们的发现表明,对齐技术仅仅教会模型编译。虽然重复采样策略可以修补可解错误,但寄存器传输级(RTL)编码能力仍然严格受限于预训练知识。解决当前基于LLM的硬件生成流水线中的挑战需要更多关于模型推理的研究,而不是对齐干预。

英文摘要

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

2601.22978 2026-06-19 cs.CR cs.PL 版本更新

Triosecuris: Formally Verified Protection Against Speculative Control-Flow Hijacking

Triosecuris:针对推测控制流劫持的形式化验证防御

Jonathan Baumann, Yonghyun Kim, Yan Farba, Catalin Hritcu, Julay Leatherman-Brooks

AI总结 提出Triosecuris,结合CET风格硬件辅助控制流完整性与编译器插入的推测加载硬化,通过形式化证明实现相对安全性,确保任意程序在推测执行下不泄露比源程序无推测时更多的信息。

Comments To appear at CSF'26; extended version with appendices. W.r.t. first revision: extended with concrete protection against Spectre RSB and renamed to Triosecuris

Journal ref 39th IEEE Computer Security Foundations Symposium (CSF) (2026) 544-559

详情
AI中文摘要

本文介绍了Triosecuris,一种针对Spectre BTB、RSB和PHT的形式化验证防御,它结合了CET风格的硬件辅助控制流完整性与编译器插入的推测加载硬化(SLH)。Triosecuris基于一个新颖的观察:在CET风格保护存在的情况下,我们可以精确检测间接调用的BTB误推测和返回的RSB误推测,并设置SLH误推测标志。我们在Rocq中将Triosecuris形式化为一种变换,并提供了机器检查的证明,表明它实现了相对安全性:任何经过变换的程序在推测执行下泄露的信息不超过源程序在无推测执行下泄露的信息。这一强安全保证适用于任意程序,即使那些不遵循密码学常数时间编程规范的程序。

英文摘要

This paper introduces Triosecuris, a formally verified defense against Spectre BTB, RSB, and PHT that combines CET-style hardware-assisted control-flow integrity with compiler-inserted speculative load hardening (SLH). Triosecuris is based on the novel observation that in the presence of CET-style protection, we can precisely detect BTB misspeculation for indirect calls and RSB misspeculation for returns and set the SLH misspeculation flag. We formalize Triosecuris as a transformation in Rocq and provide a machine-checked proof that it achieves relative security: any transformed program running with speculation leaks no more than what the source program leaks without speculation. This strong security guarantee applies to arbitrary programs, even those not following the cryptographic constant-time programming discipline.