AI总结提出无监督深度图像先验方法，在稀疏视角和有限角度条件下实现与监督方法相当的电子断层重建性能，并应用于实验数据验证其可靠性。

Comments 22 pages, 12 figures

2605.27135 2026-05-27 cs.CR cs.CV

Do Modern Post-Hoc Watermarking Methods Beat Broken-Arrows?

现代事后水印方法能否击败断箭？

Enoal Gesny, Eva Giboulot

发表机构 * Inria（法国里昂研究所）

AI总结本文通过公平比较现代与经典事后水印方法在多种攻击下的鲁棒性和安全性，发现经典方法在现实场景中更优。

2605.27131 2026-05-27 cs.ET cs.AI cs.DB

Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice

超越数据网格幻象：设计现代AI增强型湖仓以弥合理论与实践差距

Oliver Angélil, Jan Migon

发表机构 * ishango.ai Zurich, Switzerland（ishango.ai 瑞士苏黎世）； Independent Researcher（独立研究者）

AI总结针对企业数据平台中领域自服务与整体治理之间的张力，提出一种基于现代湖仓架构的AI增强型中心辐射模型，通过中心卓越中心提供共享服务与AI治理，领域团队逐步承担更多责任，以平衡灵活性与控制，并通过数据产品采纳率、查找时间和洞察时间三个指标评估架构效果。

Comments 11 pages, 5 figures

详情

AI中文摘要

ConVer：使用合约和循环不变式合成实现可扩展的形式化软件验证

Muhammad A. A. Pirzada, Weiqi Wang, Yiannis Charalambous, Konstantin Korovin, Lucas C. Cordeiro

发表机构 * The University of Manchester（曼彻斯特大学）

AI总结提出一种自上而下的组合验证工具ConVer，利用大语言模型合成函数合约，并通过CEGAR-CEGIS循环迭代精炼合约，以解决大规模C程序形式化验证中的状态空间爆炸问题。

Comments 12 pages; 6 figures

详情

AI中文摘要

大型C程序的形式化验证受到状态空间爆炸的阻碍：有界模型检验（BMC）工具必须通过展开所有嵌套结构来编码整个状态空间直至预定边界。我们提出了ConVer，一种自上而下的组合验证工具。给定一个带有顶层断言的C程序，ConVer自上而下地分解验证：它使用大语言模型（LLM）从系统属性中合成函数合约，然后在CEGAR-CEGIS循环中交替进行系统级和函数级检查，每当检查失败时通过SMART ICE学习精炼合约。我们在四个难度递增的基准测试套件上评估了ConVer，并与其他最先进（SOTA）工具进行了比较。在包含45个简单C程序的Frama-C基准测试中，ConVer在三个LLM后端上实现了82-96%的验证成功率，其中93-95%的收敛程序仅需一次CEGAR-CEGIS迭代。在X.509解析器基准测试（6个程序）和LF2C-Simple套件（17个程序）上，ConVer分别实现了33-50%和82-88%的成功率。在包含11个递归和循环密集型程序的VerifyThis套件上，预抽象策略实现了55-64%的成功率。此外，我们提出了ESBMC-LF，一个预处理工具，它将LF模型转换为C语言，同时保留LF文件的属性，使ConVer能够验证它们。我们使用ESBMC-LF将LF验证器基准测试转换为C语言；我们将这些称为LF-Hard。我们表明，ConVer总体上成功验证了67%的LF-Hard基准测试。

英文摘要

Formal verification of large C programs is impeded by state-space explosion: Bounded Model Checking (BMC) tools must encode the entire state space up to the predetermined bound by unrolling all nested constructs. We present ConVer, a top-down compositional verification tool. Given a C program with a top-level assertion, ConVer decomposes verification top-down: it uses a large language model (LLM) to synthesise function contracts from the system property, then alternates system-level and function-level checks in a CEGAR-CEGIS loop, refining contracts whenever a check fails via SMART ICE learning. We evaluate ConVer on four benchmark suites of increasing difficulty and against other state-of-the-art (SOTA) tools. On the Frama-C benchmark of 45 simple C programs, ConVer achieves 82-96% verification success across three LLM backends, with 93-95% of converged programs requiring only a single CEGAR-CEGIS iteration. On the X.509 parser benchmark (6~programs) and LF2C-Simple suite (17 programs), ConVer achieves 33-50% and 82-88% success respectively. On the VerifyThis suite of 11 recursive and loop-intensive programs, the Pre-Abstraction strategy achieves 55-64% success. In addition, we present ESBMC-LF a preprocessor tool that converts LF models to C while preserving the properties of the LF files, enabling ConVer to verify them. We transpile the LF Verifier Benchmarks using ESBMC-LF to C; we denote those LF-Hard. We show that ConVer successfully verifies 67% of LF-Hard benchmarks overall.

URL PDF HTML ☆

赞 0 踩 0

2605.27043 2026-05-27 stat.ML cs.LG stat.ME

Causal Representation Learning for Generalisable Recommendation

因果表示学习用于可泛化推荐

Yorgos Felekis, Michael O'Riordan, Oriol Corcoll, Ciarán M. Gilligan-Lee

发表机构 * University of Warwick（沃里克大学）； Spotify（Spotify公司）； University College London（伦敦大学学院）

AI总结针对推荐系统中训练分布与部署分布不一致导致的泛化问题，提出基于因果表示学习的信息论解缠标准及其可计算变分下界，仅利用混淆日志即可提升模型在分布偏移下的泛化能力，在Spotify A/B测试、KuaiRand数据集和合成基准上验证了有效性。

详情

AI中文摘要

基于观测数据训练的预测模型在部署时往往无法泛化到所遇到的分布，尤其是当训练数据是被优化系统的产物时。推荐系统是一个典型例子：它们是在被部署策略、过去用户行为和平台过滤混淆的交互日志上训练的。因此，训练分布与在服务时评分的候选分布存在显著差异，这种差距使得离线指标无法可靠预测在线性能。我们通过一种受因果表示学习（CRL）启发的方法来解决分布偏移问题。我们提出了一种信息论解缠标准，并证明其最优值仅取决于输入的因果成分。然后，我们推导出一个可处理的变分下界，使得该标准仅从有限观测数据中即可优化。我们的方法范围比大多数CRL文献更窄，因为我们目标是改善分布偏移下的泛化能力，而非完全识别所有潜在因果因素。这个更窄的目标使得该方法实用，仅需要现有的混淆日志，适用于任何标准监督模型，且不增加推理时间成本。我们的主要评估是在Spotify上对数百万用户进行的A/B测试，应用于个性化播放列表生成的排序器。一个容量匹配的CRL变体在离线性能上相当，但在在线听众参与度上带来了显著提升。在公开的KuaiRand推荐数据集和具有已知因果结构的合成基准上的补充证据显示了相同模式：与基线离线持平，在分布偏移下获得收益。在所有三种设置中，加入我们的因果解缠目标都带来了更有意义的分布外泛化。

英文摘要

Predictive models trained on observational data often fail to generalise to the distributions they encounter when deployed, especially when the training data is a product of the system being optimised. Recommender systems are a canonical example: they are trained on interaction logs confounded by the deployed policy, past user behaviour, and platform filtering. As a result, the training distribution differs substantially from the candidate distribution scored at serving time, a gap that makes offline metrics unreliable predictors of online performance. We address the distribution shift problem with a method motivated by causal representation learning (CRL). We propose an information-theoretic disentanglement criterion and prove that its optimum depends only on the causal components of the input. We then derive a tractable variational lower bound that makes the criterion optimisable from finite observational data alone. The scope of our method is narrower than that of much of the CRL literature, in that we target better generalisation under distribution shift, not full identification of all latent causal factors. This narrower target is what makes the method practical, requiring only the existing confounded logs, applying to any standard supervised model, and adding no inference-time cost. Our headline evaluation is an A/B test with millions of users on Spotify, applied to a production ranker for personalised playlist generation. A capacity-matched CRL variant performed on par offline but delivered substantial online gains in listener engagement. Complementary evidence on the public KuaiRand recommendation dataset and a synthetic benchmark with known causal structure shows the same pattern: offline parity with baseline, gains under distribution shift. Across all three settings, adding our causal disentanglement objective yields meaningfully better out-of-distribution generalisation.

URL PDF HTML ☆

赞 0 踩 0

2605.27042 2026-05-27 cs.CR cs.AI

Lessons from Penetration Tests on Large-Scale Agent Systems

大规模智能体系统渗透测试的经验教训

Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang, Frederico Araujo, Ian Molloy

发表机构 * IBM Research（IBM研究院）

AI总结本文通过对2025年专有智能体产品的两次渗透测试，评估了AI智能体的安全态势是否有所改善，并指出许多安全漏洞并非全新，而是反映了先前计算系统中长期存在的重复性弱点类别。

Comments Accepted at SAGAI 2026

详情

AI中文摘要

随着AI系统获得越来越多的自主性和执行能力，发现的安全漏洞数量持续上升。然而，许多这些漏洞并非根本上的新颖，而是反映了先前计算系统中长期观察到的重复性弱点类别。具有执行能力的AI智能体实际上是无限的自修改程序，与计算栈的多个层进行广泛交互。这种广泛的交互表面给开发者带来了显著的安全负担，他们必须推理并保护复杂的跨层行为。先前的研究主要集中在开源智能体和智能体框架中的漏洞。相比之下，专有智能体系统——在更严格的编码标准和正式审查流程下开发——是否表现出类似的安全弱点仍不清楚。在本文中，我们展示了2025年对专有智能体产品进行的两次渗透测试的结果，并评估了自这些评估以来AI智能体的安全态势是否有所改善。

英文摘要

As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. However, many of these vulnerabilities are not fundamentally novel, but instead reflect recurring classes of weaknesses long observed in prior computing systems. Execution-capable AI agents are effectively unbounded, self-modifying programs that interact extensively with multiple layers of the computing stack. This broad interaction surface imposes a significant security burden on developers, who must reason about and secure complex cross-layer behaviors. Prior research has primarily focused on vulnerabilities in open-source agents and agent frameworks. In contrast, it remains unclear whether proprietary agent systems -- developed under stricter coding standards and formal review processes -- exhibit similar security weaknesses. In this paper, we present findings from two penetration tests conducted in 2025 against proprietary agent products and evaluate whether the security posture of AI agents has improved since these assessments.

URL PDF HTML ☆

赞 0 踩 0

2605.27039 2026-05-27 eess.AS cs.SD

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

为什么它们记不住？揭示多轮声学记忆中的表征和检索瓶颈

Yang Xiao, Siyi Wang, Han Yin, Hong Jia, Vidhyasaharan Sethu, Eun-Jung Holden, Ting Dang

发表机构 * The University of Melbourne（墨尔本大学）； KAIST（韩国科学技术院）； The University of Auckland（奥克兰大学）； UNSW Sydney（新南威尔士大学悉尼分校）

AI总结本文通过引入EnvMem基准，发现大型音频语言模型在多轮交互中非语音信息记忆失败的主要原因是表征轨迹漂移，而非注意力分配不足。

详情

AI中文摘要

大型音频语言模型（LALMs）处理语音和环境声学线索，但在多轮交互中难以保留非语音信息。语义（语音）和声学（非语音）理解之间的性能差距仍未被充分理解，其表征和检索的底层机制尚不清楚。本文引入EnvMem，一个受控的多轮基准，用于研究这一差距并识别表征（即潜在嵌入）和检索层面（即注意力分配）失败的根源。我们进一步进行事后干预以探究表征结构和注意力动态。我们的结果揭示表征轨迹漂移是关键失败模式，同时表明注意力分配在解释观察到的退化中作用有限。总体而言，我们提供了一个系统框架，用于分析和改进长上下文LALMs中的非语言记忆，为未来鲁棒声学记忆建模的数据和训练设计提供启示。

英文摘要

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.27014 2026-05-27 cs.LO cs.AI

ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning

ReasonOps: 可信验证的LLM推理的统一操作范式

Adnan Rashid

发表机构 * School of Electrical Engineering（电子工程学院）； Computer Science (SEECS) National University of Sciences（计算机科学（SEECS）国家 Sciences and Technology）

AI总结本文提出ReasonOps，一种将推理视为持续监控、可验证、可靠性感知的操作过程的统一范式，整合语义解释、自动形式化、符号推理、定理证明、运行时保证、概率可靠性估计和自适应修正，以解决当前LLM推理中的逻辑不一致、幻觉符号转换等问题。

Comments 5 Pages

详情

AI中文摘要

大型语言模型（LLM）已将人工智能从主要生成系统转变为日益强大的推理代理。最近在定理证明、自动形式化、符号推理和工具增强语言模型方面的进展表明，在机器辅助形式推理方面取得了实质性进展。然而，当前的推理系统仍然存在隐藏的逻辑不一致、幻觉符号转换、无支持的定理应用以及有限可靠性保证。现有方法在形式验证、运行时保证、神经符号推理和可信人工智能（AI）研究社区之间仍然分散。本文介绍了ReasonOps，一种用于可信验证推理系统的统一操作范式。受DevOps和MLOps等操作生态系统的启发，ReasonOps将推理视为一个持续监控、可验证、可靠性感知的操作过程，而不是一个孤立的推理任务。所提出的范式将语义解释、自动形式化、符号推理、定理证明、运行时保证、概率可靠性估计和自适应修正整合到一个统一的推理生命周期中。本文进一步介绍了ReasonOps架构，使用自主制动系统分析示例演示了其工作流程，并讨论了其在未来安全关键自主AI系统中的潜在作用。我们认为，像ReasonOps这样的操作推理范式可能成为下一代可信AI生态系统的基础设施。

引导LLM使用软件设计模式的策略：以单例模式为例

Viktor Kjellberg, Farnaz Fotrousi, Miroslaw Staron

发表机构 * University of Gothenburg and Chalmers University of Technology（哥德堡大学和查尔姆斯理工大学）

AI总结通过实验比较四种提示策略（指令、二元自动反馈、详细自动反馈、少样本详细反馈），评估13个LLM在164个Java编码挑战中生成遵循单例模式的代码的能力，发现迭代二元反馈在保持或提升功能性的同时最佳地实现了单例模式对齐。

Comments Accepted at PROMISE 2026

详情

DOI: 10.1145/3803846.3807469

AI中文摘要

喷注标记器中的粒子-拉普兰多模态

Loukas Gouskos, Benedikt Maier

发表机构 * Brown University（布朗大学）； Imperial College of Science, Technology and Medicine（帝国理工学院科学、技术与医学学院）

AI总结提出PLuM多模态架构，联合处理粒子成分与拉普兰平面分裂，通过交叉注意力机制研究显式QCD层次结构是否补充原始粒子表示，发现对顶夸克和H→bb标记有系统性提升，在HH(4b)分析中背景抑制提高25%。

详情

AI中文摘要

拉普兰平面提供了喷注内QCD辐射的物理动机层次表示，而基于变换器的标记器通过直接从原始粒子成分及其成对关系中学习达到了最先进的性能。我们研究变换器是否从成分级输入隐式捕获层次QCD结构，或者显式物理表示是否仍然具有互补性。为了测试这一点，我们引入了PLuM，一种多模态架构，将粒子成分和拉普兰平面分裂投影到共享潜在空间，并用统一变换器联合处理两者。交叉注意力允许模型探测结构化QCD信息是否提供了超出粒子单独编码的区分能力。我们观察到顶夸克和H→bb标记的系统性增益，而在H→cc或H→4q拓扑中没有发现可比改进。这种选择性增强表明，即使在高度表达性的架构中，关于b喷注形成的显式层次信息仍然与原始粒子表示互补，而其他拓扑已经在成分级被很好地捕获。对于高影响LHC分析，如洛伦兹增强的双希格斯玻色子搜索中的四b夸克末态（HH(4b)），增益显著：在25%的双希格斯效率工作点，PLuM的背景抑制比基线高25%。我们的结果表明，在变换器时代，QCD辐射的物理结构化表示仍然保留区分价值，激励进一步研究深度学习算法如何编码喷注动力学的不同方面。

英文摘要

The Lund plane offers a physics-motivated, hierarchical representation of QCD radiation within jets, while transformer-based taggers have reached state-of-the-art performance by learning directly from raw particle constituents and their pairwise relations. We investigate whether transformers implicitly capture hierarchical QCD structure from constituent-level inputs, or whether explicit physics representations remain complementary. To test this, we introduce PLuM, a multimodal architecture that projects particle constituents and Lund plane splittings into a shared latent space, processing both jointly with a unified transformer. Cross-attention allows the model to probe whether structured QCD information provides discriminating power beyond what particles alone encode. We observe systematic gains for top-quark and $\mathrm{H}\to\mathrm{b}\bar{\mathrm{b}}$ tagging, while finding no comparable improvement for $\mathrm{H}\to\mathrm{c}\bar{\mathrm{c}}$ or $\mathrm{H}\to 4\mathrm{q}$ topologies. This selective enhancement suggests that explicit hierarchical information about b-jet formation remains complementary to raw particle representations even in highly expressive architectures, while other topologies are already well-captured at constituent level. For high-impact LHC analyses such as Lorentz-boosted di-Higgs searches in the four $\mathrm{b}$ quark final state ($\mathrm{H}\mathrm{H}(4\mathrm{b})$), the gains are substantial: at a $25\%$ di-Higgs efficiency working point, PLuM achieves $25\%$ higher background rejection than the baseline. Our results indicate that physically structured representations of QCD radiation retain discriminating value in the transformer era, motivating further study into how different aspects of jet dynamics are encoded by deep learning algorithms.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Efficient Learning of Mesh-Based Physical Simulation with BSMS-GNN

Towards Interpretable Federated Learning

Continual Model-Based Reinforcement Learning with Hypernetworks

Reformulation of RBM to Unify Linear and Nonlinear Dimensionality Reduction

Algorithmic Monocultures in Hiring

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

Governed Evolution of Agent Runtimes through Executable Operational Cognition

Many Logics, One Methodology: A Plea for Logical Pluralism in Formalised Reasoning (preprint)

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

TWIST: Closed-Loop token Synchronization for Application-Aware Wireless Digital Twins

Unsupervised Deep Image Prior for Sparse-View and Limited-Angle Electron Tomography

Do Modern Post-Hoc Watermarking Methods Beat Broken-Arrows?

Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice

BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning

Gaussian Process-based learning with new MCMC-based implementation of Wishart prior on correlation matrix

Cost of Structural Learning Under Censored Feedback: A Threshold-Bandit Approach

ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification

Causal Representation Learning for Generalisable Recommendation

Lessons from Penetration Tests on Large-Scale Agent Systems

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning

Constrained Bayesian Experimental Design via Online Planning

Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks

Adaptive Reinforcement Learning for Robust Open Quantum System Control: A Multi-Task Framework with Temporal Optimization

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

Parsimonious Learning-Augmented Online Metric Matching

Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study

The Sensation Modulating Network:Haltability as the architectural ground for object-directed phenomenology

Particle-Lund Multimodality in Jet Taggers