大模型推理能力 - arXivDaily 专题

2606.20227 2026-06-19 cs.AI cs.SE 新提交 95%

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

QMFOL：通过可量化的一元一阶逻辑测试用例生成来基准测试大语言模型推理

Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette, Kailong Wang

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Nanyang Technological University（南洋理工大学）； Hubei University（湖北大学）； East China Normal University（华东师范大学）； National University of Singapore（新加坡国立大学）

专题命中逻辑推理：提出QMFOL框架，通过一阶逻辑生成推理任务，评估LLM逻辑推理能力。

AI总结提出QMFOL框架，通过可控制复杂度的合取/析取模式生成一元一阶逻辑推理任务，并构建包含2880个实例的基准QMFOLBench，评估显示逻辑复杂度增加导致性能下降和计算开销上升。

详情

AI中文摘要

大型语言模型（LLMs）在推理方面取得了显著进展，特别是在演绎推理中，这对于高风险决策至关重要。随着模型的改进，评估基准也应随之发展。然而，现有基准缺乏对逻辑复杂性的细粒度控制，并且在语义多样性与逻辑一致性之间难以平衡。为了解决这些问题，我们提出了QMFOL，一个自动生成具有可量化和可控复杂度的一元一阶逻辑推理任务的框架。它使用合取和析取模式构建形式逻辑结构，从而能够精确控制推理深度、宽度、标签类型和干扰项。然后通过LLM将这些结构转化为自然语言，并通过外部证明器的往返验证确保逻辑一致性。基于我们的框架，我们构建了QMFOLBench，一个包含2880个实例、960种配置的基准，覆盖不同的逻辑和语义维度。对六个大型推理模型（LRMs）和两个LLM的评估表明，随着逻辑复杂度的增加，性能下降且计算开销上升。模型在True标签任务上的表现优于False或Unknown任务，并且对语义变化敏感。总体而言，QMFOL提供了一种可扩展且可靠的方法来构建具有可控复杂度的演绎推理基准，从而能够更精确地评估现代语言模型的推理能力。

英文摘要

Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with quantifiable and controllable complexity. It constructs formal logical structures using conjunction and disjunction patterns, enabling precise control over reasoning depth, width, label types, and distractors. These structures are then translated into natural language via LLMs, with logical consistency ensured through round-trip verification using an external prover. Based on our framework, we build QMFOLBench, a benchmark comprising 2880 instances with 960 configurations across diverse logical and semantic dimensions. Evaluations on six large reasoning models (LRMs) and two LLMs show that performance degrades and computational overhead increases with rising logical complexity. Models perform better on True-labeled tasks than on False or Unknown ones, and exhibit sensitivity to semantic variation. Overall, QMFOL offers a scalable and reliable approach for constructing deductive reasoning benchmarks with controllable complexity, enabling more precise evaluation of reasoning capabilities in modern language models.

URL PDF HTML ☆

赞 0 踩 0

2606.20526 2026-06-19 cs.AI 新提交 70%

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

DeepSWIP: 神经概率逻辑程序的商-WMC反事实

Saimun Habib, Vaishak Belle, Fengxiang He

发表机构 * University of Edinburgh（爱丁堡大学）

专题命中逻辑推理：神经概率逻辑程序的反事实推理

AI总结提出DeepSWIP，一种用于DeepProbLog程序的单世界反事实语义，通过神经物化、SWIP和加权模型计数实现精确反事实推理，实验证明比孪生网络方法快2.14倍。

详情

AI中文摘要

诸如DeepProbLog之类的神经符号系统将神经感知与概率逻辑相结合，但标准推理是关联性的。反事实推理还需要干预和证据的因果语义。我们引入了DeepSWIP，一种用于DeepProbLog程序的单世界反事实语义。利用神经物化，我们将固定上下文神经谓词简化为普通的ProbLog选择，应用单世界干预程序（SWIP），并通过单个转换程序上的加权模型计数（WMC）计算反事实。在有限基和唯一支持模型假设下，DeepSWIP相对于学习到的物化FCM是精确的。ProbLog条件句的标准商-WMC形式识别了活跃的神经概率，并解释了干预清理、校准敏感性和罕见证据不稳定性。在MPI3D上的实验证实了该转换相对于DeepTwin构造在12,000个查询上的有效性，并且由于避免了孪生网络的内源性重复，推理速度提升了2.14倍。一个SUMO HOV实验表明，神经校准退化会偏置插件估计，而正确作用域的随机策略AIPW估计器消除了总体均值和ATE估计量的大部分一阶偏差。代码位于此https URL。

英文摘要

Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference is associational. Counterfactual reasoning additionally requires a causal semantics for interventions and evidence. We introduce DeepSWIP, a single-world counterfactual semantics for DeepProbLog programs. Using neural materialization, we reduce fixed-context neural predicates to ordinary ProbLog choices, apply Single World Intervention Programs (SWIPs), and compute counterfactuals by weighted model counting (WMC) over a single transformed program. Under finite grounding and unique-supported-model assumptions, DeepSWIP is exact relative to the learned materialized FCM. The standard quotient-WMC form of ProbLog conditionals identifies active neural probabilities and explains intervention cleaning, calibration sensitivity, and rare-evidence instability. Experiments on MPI3D confirm the transformation against a DeepTwin construction against 12,000 queries, as predicted and a 2.14$\times$ inference speedup from avoiding the Twin's endogenous duplication. A SUMO HOV experiment shows that neural calibration degradation biases plug-in estimates, while a correctly scoped randomized-policy AIPW estimator removes most first-order bias for population mean and ATE estimands. Code is at https://github.com/saibib/deep_SWIP.

URL PDF HTML ☆

赞 0 踩 0