arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2086
专题追踪
2605.06377 2026-05-08 cs.GT cs.LG cs.MA

Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics

在具有解耦动态的可部分观测马尔可夫势博弈中独立学习纳什均衡

Philip Jordan, Maryam Kamgarpour

发表机构 * SYCAMORE, EPFL(SYCAMORE,瑞士联邦理工学院)

AI总结 本文研究了在可部分观测马尔可夫博弈(POMG)中纳什均衡的学习问题,提出了一种独立学习算法,通过不通信的方式,使玩家在仅观察自身动作和观测的情况下,联合收敛到近似纳什均衡,从而实现高效的样本和计算复杂度。

详情
AI中文摘要

我们研究了在可部分观测马尔可夫博弈(POMG)中的纳什均衡学习,这是一种多智能体强化学习框架,其中智能体无法完全观测底层状态。以往的工作依赖于集中化或信息共享,导致样本和计算复杂度随玩家数量呈指数增长。我们关注一类具有独立状态转移的POMG子类,其中智能体通过奖励保持耦合,并假设底层完全观测马尔可夫博弈是马尔可夫势博弈。对于此类博弈,我们提出了一种独立学习算法,玩家仅观察自身动作和观测,无需通信,即可联合收敛到近似纳什均衡。由于部分可观测性,最优策略可能通常依赖于完整的动作-观测历史。在滤波稳定性假设下,我们证明基于有限历史窗口的策略提供足够的近似保证。这使得我们可以将POMG近似为一个近势博弈,从而在底层POMG中实现独立纳什均衡学习的准多项式样本和计算复杂度。

英文摘要

We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi-agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharing, and suffers from sample and computational complexity that scales exponentially in the number of players. We focus on a subclass of POMGs with independent state transitions, where agents remain coupled through their rewards, and assume that the underlying fully observed Markov game is a Markov potential game. For this class, we present an independent learning algorithm in which players, observing only their own actions and observations and without communication, jointly converge to an approximate Nash equilibrium. Due to partial observability, optimal policies may in general depend on the full action-observation history. Under a filter stability assumption, we show that policies based on finite history windows provide sufficient approximation guarantees. This enables us to approximate the POMG by a surrogate Markov game that is near-potential, leading to quasi-polynomial sample and computational complexity for independent Nash equilibrium learning in the underlying POMG.

2605.06373 2026-05-08 stat.ML cs.LG

Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $τ$-Mixing

超越独立性假设:在τ-混合条件下深度Q学习的有限样本保证

Leon Halgryn, Sophie Langer, Janusz M. Meylahn, E. Moritz Hahn

发表机构 * Department of Applied Mathematics(应用数学系) Faculty of Mathematics(数学学院) University of Twente(埃因霍温理工大学) Ruhr-Universität Bochum(波恩鲁尔大学)

AI总结 本文研究了在τ-混合条件下深度Q学习的有限样本保证,通过将更新的minibatch建模为τ-混合,推导了在依赖数据下的风险界,并展示了时间依赖性对统计速率的影响。

Comments 48 pages total. 6 figures; 3 tables

详情
AI中文摘要

深度Q学习的有限样本分析通常将回放数据视为独立,尽管其是从时间依赖的状态-动作轨迹中采样的。我们研究了在显式依赖下深度Q网络(DQN)算法,将用于更新网络的minibatch建模为τ-混合。我们证明在某些依赖条件下,这种假设成立。基于此观察,我们将具有全连接ReLU架构的DQN的统计分析扩展到依赖数据。我们将每个更新视为非参数回归问题,其中观察值为τ-混合,并在该依赖结构下推导有限样本风险界。我们的结果表明,时间依赖性导致统计速率退化,通过在速率指数中引入额外的维度惩罚,反映了τ-混合数据的减少有效样本量。此外,我们从这些风险界中推导出DQN在τ-混合下的样本复杂性。最后,我们在标准Gymnasium环境中实证地展示了独立性假设的系统性违反,以及回放采样导致近指数衰减的相关性,支持我们的理论框架。

英文摘要

Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $τ$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with $τ$-mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of $τ$-mixing data. Moreover, we derive the sample complexity of DQN under $tau$-mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.

2605.06367 2026-05-08 stat.ML cond-mat.dis-nn cs.LG

The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models

数据结构与不平衡的相互作用在扩散模型学习动态中的作用

Flavio Nicoletti, Chenxiao Ma, Enrico Ventura, Luca Saglietti, Stefano Sarao Mannelli

发表机构 * Data Science and AI, Computer Science and Engineering Department, Chalmers University of Technology and University of Gothenburg(数据科学与人工智能,计算机科学与工程系,查尔姆斯理工大学和哥德堡大学) International School of Advanced Studies (SISSA), Trieste, Italy(国际先进研究学院(SISSA),特里埃斯特,意大利) Department of Computing Sciences, Bocconi University, Milano, Italy(计算科学系,博科尼大学,米兰,意大利) Institute for Data Science and Analytics, Bocconi University, Milano, Italy(数据科学与分析研究所,博科尼大学,米兰,意大利) School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa(计算机科学与应用数学学院,沃图瓦兹兰大学,约翰内斯堡,南非)

AI总结 研究探讨了数据结构差异和采样不平衡如何影响扩散模型的训练动态,揭示了类方差主导学习顺序,采样不平衡可逆转此顺序,且少数类学习延迟。

详情
AI中文摘要

现实世界的数据集本质上是异质的,但如何类内结构差异和采样不平衡影响扩散模型的训练动态,以及可能加剧不平等,仍不明确。现有理论假设数据同质,未考虑类不平衡和异质性如何重塑这些动态。本文提出一个高维分析框架,研究基于分数的扩散模型中的类依赖学习。分析一个在高斯混合物上训练的随机特征模型,推导特征协方差谱以表征每类的泛化和记忆时间。揭示了这些动态的显式层次结构:类方差是学习顺序的主要决定因素,一致倾向于高方差类;而质心几何起次要作用。采样不平衡作为调节器,可逆转此顺序,并在强不平衡下迫使少数类在反向扩散中获得不同的、延迟的谱化时间。这些结果表明,扩散模型可以记忆某些类,而其他类仍未充分学习。我们通过在Fashion MNIST上训练U-Net模型验证了理论预测。

英文摘要

Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring higher-variance classes-while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion. Together, these results suggest that diffusion models can memorize some classes while others remain insufficiently learned. We validate our theoretical predictions empirically using U-Net models trained on Fashion MNIST.

2605.06359 2026-05-08 eess.SP cs.CV

The frame-level leakage trap: rethinking evaluation protocols for intrinsic image decomposition, with source-separable uncertainty as a case study

帧级泄露陷阱:重新思考内在图像分解的评估协议,以源分离不确定性为案例研究

Jihwan Woo

发表机构 * Amazon Web Services (AWS)(亚马逊网络服务(AWS))

AI总结 本文指出MPI Sintel数据集上学习内在图像分解的评估协议不一致,通过量化帧级泄露效应,提出基于场景级划分的评估标准,并展示源分离不确定性的应用与优势。

Comments Submitted to Journal of Electronic Imaging. 25 pages, 10 figures. Addresses evaluation protocol issues in intrinsic image decomposition and proposes source-separable uncertainty estimation

详情
AI中文摘要

针对MPI Sintel数据集上学习内在图像分解的评估协议不一致问题,本文量化了帧级泄露效应,发现帧级划分相比场景级划分使测试R_PSNR提升1.6至2.0 dB(p<0.01,三重种子配对t检验)。通过三重梯度(随机/时间/场景)验证了泄露效应的连续性,扩展训练下帧级泄露效应超过10 dB。本文倡导场景级划分作为社区标准,并为六个代表性模型提供参考数值。作为修正协议中的案例研究,提出基于物理的分解I=R∘S+N,结合源分离三向异方差不确定性头。实验证明通道专业化:非朗伯不确定性通道与非朗伯残差误差的交叉相关性为0.67,超过纹理通道的4倍。进一步展示下游应用:过滤掉75%最高不确定性像素可将重建MSE降低77%,而随机过滤无改善。专业化在非分布真实照片上也成立。更复杂的变体(结合频率分解、跨任务监督、证据学习、对比损失和测试时适应)产生负面结果。本文方法达到15.98±0.41 dB R_PSNR,接近五成员深度集成模型,成本仅为其五分之一,具备源分离不确定性的独特能力。

英文摘要

Evaluation protocols for learned intrinsic image decomposition on MPI Sintel have been inconsistent. Several prior works split the dataset by frames, which allows spatially similar frames of the same scene to appear in both train and test partitions. We quantify this leakage effect for the first time, across three architectures: a frame-level split inflates test R_PSNR by 1.6 to 2.0 dB (p less than 0.01 for all three, paired t-test across 3 seeds) relative to a scene-level split, confirming an architecture-independent protocol effect. A three-point gradient (random/temporal/scene) shows the gap is continuous, and under extended training the frame-level inflation exceeds 10 dB. We advocate scene-level splits as the community standard and provide reference numbers for six representative models under this protocol. As a case study within the corrected protocol, we present a physics-informed decomposition I = R composed with S + N with a source-separable three-way heteroscedastic uncertainty head. We empirically verify channel specialization: the non-Lambertian uncertainty channel shows r = 0.67 cross-correlation with non-Lambertian residual error, more than 4 times the texture channel's correlation. We further demonstrate downstream utility: filtering out the 75% highest-uncertainty pixels reduces reconstruction MSE by 77% on retained pixels, whereas random filtering produces no improvement. The specialization also holds on out-of-distribution real photographs. We report negative results for a more elaborate variant combining frequency decomposition, cross-task supervision, evidential learning, contrastive loss, and test-time adaptation. Our method reaches 15.98 plus or minus 0.41 dB R_PSNR, within 0.8 dB of a 5-member Deep Ensemble at one-fifth the cost, with the unique capability of source-separated uncertainty.

2605.06347 2026-05-08 cs.HC cs.AI

Human-AI Co-Evolution and Epistemic Collapse: A Dynamical Systems Perspective

人类与人工智能的共演与知识崩溃:一种动态系统视角

Xuening Wu, Yanlan Kang, Qianya Xu, Kexuan Xie, Jiaqi Mi, Honggang Wang, Yubin Liu, Zeping Chen

发表机构 * Fudan University(复旦大学) Tongji University(同济大学) Shanghai Jiao Tong University(上海交通大学) The University of Hong Kong(香港大学) University of California San Diego(加州大学圣地亚哥分校) Nanjing University of Posts and Telecommunications(南京邮电大学)

AI总结 本文从动态系统视角探讨人类与语言模型的共演及知识崩溃问题,提出人类与模型形成反馈闭环的耦合系统,通过三变量模型揭示共进化增强、脆弱均衡和退化收敛三种动态模式。

Comments 5 pages, 3 figures, ICML EIML Workshop submitted

详情
AI中文摘要

大型语言模型(LLMs)正在重塑知识生产的模式,其生成、摘要和推理任务日益依赖AI系统。尽管已有研究探讨了人类的认知卸载和递归训练中的模型崩溃,但这些效应通常被视为孤立现象。本文提出统一视角:人类与语言模型通过使用、生成和再训练的反馈环形成耦合动态系统。引入包含人类认知、数据质量和模型能力三个变量的最小模型,证明反馈可产生不同动态模式。分析揭示三种模式:共进化增强、脆弱均衡和退化收敛。通过简单模拟展示,增加对AI的依赖可导致向低多样性、次优均衡过渡。从信息论角度看,这一转变对应于人类-人工智能循环中的涌现信息瓶颈,熵减反映的是闭环反馈下的多样性与支持损失,而非有益压缩。这些结果表明,AI系统的轨迹不仅由模型设计决定,还受人类-人工智能共演动态影响。

英文摘要

Large language models (LLMs) are reshaping how knowledge is produced, with increasing reliance on AI systems for generation, summarization, and reasoning. While prior work has studied cognitive offloading in humans and model collapse in recursive training, these effects are typically considered in isolation. We propose a unified perspective: humans and language models form a coupled dynamical system linked by a feedback loop of usage, generation, and retraining. We introduce a minimal model with three variables -- human cognition, data quality, and model capability -- and show that this feedback can give rise to distinct dynamical regimes. Our analysis identifies three regimes: co-evolutionary enhancement, fragile equilibrium, and degenerative convergence. Through a simple simulation, we demonstrate that increasing reliance on AI can induce a transition toward a low-diversity, suboptimal equilibrium. From an information-theoretic perspective, this transition corresponds to an emergent information bottleneck in the human-AI loop, where entropy reduction reflects loss of diversity and support under closed-loop feedback rather than beneficial compression. These results suggest that the trajectory of AI systems is shaped not only by model design, but by the dynamics of human-AI co-evolution.

2605.06341 2026-05-08 cs.NE cs.AI math.OC

CoupleEvo: Evolving Heuristics for Coupled Optimization Problems Using Large Language Models

CoupleEvo: 使用大型语言模型演化耦合优化问题的启发式方法

Thomas Bömer, Bastian Amberg, Max Disselnmeyer, Anne Meyer

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文提出CoupleEvo,通过三种进化协调策略演化耦合优化问题的启发式方法,实验表明分解策略更稳定,而集成策略搜索复杂度高。

Comments accepted at GECCO 2026, San Jose, Costa Rica, Workshop

详情
AI中文摘要

许多现实优化问题由多个紧密耦合的子问题组成,其解决方案必须协调以实现整体高性能。然而,现有基于大语言模型的自动启发式设计方法局限于单问题设置。本文提出CoupleEvo,提出三种进化协调策略来演化耦合优化问题的启发式方法:顺序策略依次演化子问题的启发式方法;迭代策略在连续世代中交替演化不同子问题的启发式方法;集成策略同时演化所有问题的启发式方法。该方法在两个代表性耦合优化问题上进行了评估。实验结果表明,基于分解的策略(顺序和迭代)提供了更稳定的收敛性和更高的解质量,而集成进化策略则面临增加的搜索复杂性和变异性。这些发现突显了在相互依赖的子问题之间协调进化搜索的重要性,并展示了LLM驱动的启发式设计在复杂耦合优化问题中的潜力。代码可在https://github.com/tb-git-kit-research/CoupleEvo获取。

英文摘要

Many real-world optimization problems consist of multiple tightly coupled subproblems whose solutions must be coordinated to achieve high overall performance. However, existing large language model driven automated heuristic design approaches are limited to single-problem settings. In this paper, we propose CoupleEvo. CoupleEvo proposes three evolutionary coordination strategies to evolve heuristics for coupled optimization problems: the sequential strategy evolves heuristics for one subproblem after the other; the iterative strategy alternates the evolution of heuristics for different subproblems over successive generations; and the integrated strategy evolves heuristics for all problems simultaneously. The approach is evaluated on two representative coupled optimization problems. Experimental results show that decomposition-based strategies (sequential and iterative) provide more stable convergence and higher solution quality, while the integrated evolution strategy suffers from increased search complexity and variability. These findings highlight the importance of coordinating evolutionary search across interdependent subproblems and demonstrate the potential of LLM-driven heuristic design for complex coupled optimization problems. The code is available: https://github.com/tb-git-kit-research/CoupleEvo.

2605.06340 2026-05-08 cs.CY cs.GT cs.LG

A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring

一种在持续合规监控下的战略审计方博弈基准

Florian A. D. Burnat, Brittany I. Davidson

发表机构 * University of Bath(巴斯大学)

AI总结 本文提出持续合规监控下的战略博弈基准,通过Stackelberg博弈框架分析审计方与被审计方的策略互动,揭示噪声感知静态审计设计的结构性特征,并提出可复现的模拟器支持实证研究。

详情
AI中文摘要

持续部署后的合规审计,受如欧盟人工智能法案和数字服务法案等新兴法规的强制要求,形成了一种不同于先前研究中单次输入输出博弈的战略博弈类型。受监管系统可延迟结果报告、在合理噪声范围内漂移报告、利用纵向样本流失以及在模糊的指标定义中选择性使用等策略。我们正式将持续审计定义为T轮Stackelberg博弈,其中审计员承诺时间政策,而适应性被审计方则作出反应,并识别出任何噪声感知静态审计设计的结构性特征:一种覆盖缺口和粒度缺口无法同时关闭的覆盖制度。我们将其正式化为观察1,并展示两种最小扩展策略,每种都源自该观察,沿正交轴关闭该制度:一个样本量感知的静态规则(周期性-下限)关闭粒度失败情况,而一个基于历史的怀疑升级策略关闭对 naive Drift 策略的覆盖失败情况——而两者都无法同时关闭,正如观察所预测;一个利用Stackelberg承诺的审计感知OffAuditDrift策略击败两者。为支持实证研究,我们贡献了一个非加性危害分解(福利损失W,覆盖损失C),揭示了流失如何将危害从监管者可问责的表面转移到监管者不可见的另一面;一个初始的五种被审计方策略(延迟、漂移、挑选、流失、OffAuditDrift)和五种审计员策略,校准到已发布审计的DSA透明数据库的汇总统计数据;以及一个可复现的模拟器,具有小而可扩展的Python接口。

英文摘要

Continuous post-deployment compliance audits, mandated by emerging regulations such as the EU AI Act and Digital Services Act, create a class of strategic gaming distinct from the one-shot input/output gaming studied in prior work. Regulated systems can delay outcome reporting, drift their reports within plausible noise envelopes, exploit longitudinal sample attrition, and cherry-pick among ambiguous metric definitions. We formalize continuous auditing as a $T$-round Stackelberg game between an auditor that commits to a temporal policy and an adaptive auditee, and identify a structural feature of any noise-aware static-auditor design: a cover regime in which coverage gaps and granularity gaps cannot be closed simultaneously. We make this formal as Observation 1 and show that two minimal extension policies, each derived from the observation, close the regime along orthogonal axes: a sample-size-aware static rule (Periodic-with-floor) closes the granularity-failure case, while a history-conditioned suspicion-escalation policy closes the coverage-failure case for the naive Drift strategy -- and neither closes both, exactly as the observation predicts; an audit-aware OffAuditDrift strategy that exploits Stackelberg commitment defeats both. To support empirical study we contribute a non-additive harm decomposition (welfare loss $W$, coverage loss $C$) that exposes how attrition shifts harm from the regulator-accountable surface to a regulator-invisible one; an initial library of five auditee strategies (Delay, Drift, Cherry-pick, Attrition, OffAuditDrift) and five auditor policies, calibrated to summary statistics from published audits of the DSA Transparency Database; and a reproducible simulator with a small, extensible Python interface.

2605.06330 2026-05-08 cs.CR cs.AI

Fine-Tuning Small Language Models for Solution-Oriented Windows Event Log Analysis

针对Windows事件日志分析的细调小型语言模型

Siraaj Akhtar, Saad Khan, Simon Parkinson

发表机构 * School of Computing and Engineering, University of Huddersfield, Huddersfield, United Kingdom(计算与工程学院,赫德斯菲尔德大学,赫德斯菲尔德,英国)

AI总结 本文研究小型语言模型在事件日志分析中的应用,通过合成数据集验证其在问题识别和修复方案生成上的优势,相比大语言模型更节省资源。

Comments 27 pages, 14 figures, 5 tables

详情
AI中文摘要

大型语言模型(LLMs)在事件日志分析中展现出潜力,但其高计算需求、对云基础设施的依赖以及安全问题限制了实际部署。此外,现有方法大多仅关注问题识别,未能提供可行的修复方案。小型语言模型(SLMs)提供了一种轻量级替代方案,可针对特定任务进行微调并本地部署。本文探讨SLMs在特定任务微调后是否能成为事件日志分析的实用替代方案,同时生成解决方案。我们首先创建了一个大规模的Windows事件日志合成数据集,其中包含使用高性能LLM生成的修复操作。然后,我们使用LoRA参数高效微调技术对多个SLMs和LLMs进行微调,并通过与专家评估比较来评估其性能。结果表明,该数据集准确反映了现实场景,微调后的SLMs在识别问题和提供相关修复方案方面始终优于LLMs,同时需要更少的计算资源。

英文摘要

Large language models (LLMs) have shown promise for event log analysis, but their high computational requirements, reliance on cloud infrastructure, and security concerns limit practical deployment. In addition, most existing approaches focus only on the identification of the problem and do not provide actionable remediation. Small language models (SLMs) present a light-weight alternative that can be fine-tuned for a specific purpose and hosted locally. This paper investigates whether SLMs, when fine-tuned for a specific task, can serve as a practical alternative for event log analysis while also generating solutions. We first create a large-scale synthetic Windows event log dataset that contains remediation actions using a high-performing LLM. We then fine-tune multiple SLMs and LLMs using the LoRA parameter-efficient fine-tuning technique and evaluate their performance by comparing with expert assessment. The results show that the dataset accurately reflects real-world scenarios and that fine-tuned SLMs consistently outperform LLMs in identifying issues and providing relevant remediation, while requiring fewer computational resources.

2605.06324 2026-05-08 cs.CR cs.CY cs.LG

Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation

利用度量,而非伤害:针对战略平台操纵的安全审计认证

Florian A. D. Burnat, Brittany I. Davidson

发表机构 * University of Bath(巴斯大学)

AI总结 本文探讨了在安全审计中,当度量被用作合规证据时,如何确保其能真实反映伤害减少。研究提出语义包络度量,证明其在对抗平台策略操纵时的有效性。

详情
AI中文摘要

在英国在线安全法和欧盟数字服务法下,标量度量被越来越多地用作合规证据。一旦公布,此类度量也成为优化目标:平台可通过路由推荐通过语义等价内容变体来提高分数,而无需减少真实伤害。我们探讨了在何种情况下此类审计度量仍能证明真实伤害的减少。该协议被建模为一个已发布的转换图,其连通组件形成语义类别,而度量本身被视为安全对象。三个结果随之而来。首先,任何直接评分变体的度量一旦两个等价变体在有害类别中评分不同,就可被操纵。其次,语义包络提升,将每个变体分配其类别中的最高评分,是保守类常修复中的唯一点wise最小值。第三,一个类分层证书,$H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$,适用于每个平台策略,其中$\barη$吸收注释和协议误差。我们通过三个层次检查声明:有限状态网格上的穷举枚举,Z3交叉回放的SMT编码,以及PRISM-games中的有界单人MDP编码。脆弱的度量在对抗操纵不变性方面失败,无法支持相同的有用预先声明类别覆盖证书;在包络级别证书下,它在每个测试实例中产生大规模违规,随机目录在固定审计预算下具有大的游戏差距。语义包络度量在测试实例中没有此类违规。

英文摘要

Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation graph whose connected components form semantic classes, and the metric itself is treated as a security object. Three results follow. First, any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Second, the semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among conservative classwise-constant repairs. Third, a class-stratified certificate, $H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$, holds for every platform strategy, with $\barη$ absorbing annotation and protocol error. We check the claims at three levels: exhaustive enumeration on a finite-state grid of mixed strategies, an SMT encoding in Z3 cross-replayed in cvc5, and a bounded single-player MDP encoded in PRISM-games. The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget. The semantic-envelope metric exhibits no such violation in the tested instances.

2605.06320 2026-05-08 cs.MA cs.AI cs.CL

Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs

通过自适应任务图提高语言代理团队的效率

Elizabeth Mieczkowski, Alexander Ku, Tiwalayo Eisape, Dilip Arumugam, John Matters, Katherine M. Collins, Ilia Sucholutsky, Thomas L. Griffiths

发表机构 * Princeton University(普林斯顿大学) University of Cambridge(剑桥大学) MIT(麻省理工学院) New York University(纽约大学)

AI总结 本文提出LATTE框架,通过自适应任务图提升语言代理团队效率,减少资源消耗并提高协作准确性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被部署在团队中,但现有协调方法往往处于两个极端。高度结构化的方法依赖于预先分配的固定角色、流水线或任务分解,而完全无结构的团队虽然具有适应性和探索性,但会面临错误传播、代理冲突和资源浪费等问题(以时间、令牌或文件操作衡量)。我们介绍了Language Agent Teams for Task Evolution(LATTE),一种受分布式系统启发的协调LLM团队的框架,其中处理器必须在部分可观测性和通信限制下操作。在LATTE中,一组代理协作构建和维护一个共享的、不断演变的协调图,该图编码子任务依赖性、个体代理分配以及子任务进度的当前状态。该协议在保持一致性的同时,使代理能够动态分配工作、适应协调并发现新任务。在多个协作任务和多种基础模型上,我们展示了LATTE如何减少令牌使用、运行时间、通信和协调失败(例如文件冲突和冗余输出),同时匹配或超过标准设计的准确性,包括MetaGPT、去中心化团队、自上而下领导者-工人层次结构和静态分解。

英文摘要

Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints. In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, we demonstrate how LATTE reduces token usage, wall-clock time, communication, and coordination failures (e.g. file conflicts and redundant outputs) while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions.

2605.06315 2026-05-08 stat.ML cs.LG

End-to-End Identifiable and Consistent Recurrent Switching Dynamical Systems

端到端可识别且一致的递归切换动力学系统

Carles Balsells-Rodas, Zhengrui Xiang, Xavier Sumba, Yingzhen Li

发表机构 * Imperial College London(帝国理工学院伦敦分校)

AI总结 本文提出ΩSDS方法,通过流估计器实现精确似然优化,解决深度生成模型中序列数据可识别性问题,提升潜在结构恢复和动态预测准确性。

详情
AI中文摘要

学习可识别的表示在深度生成模型中仍然是一个基本挑战,特别是对于具有状态切换动态的序列数据。现有方法在限制性假设下建立可识别性,如平稳性或有限发射模型,并通常依赖变分自编码器(VAE)估计器,引入近似间隙限制了潜在结构的恢复。在本文中,我们解决了这一设置的理论和实际限制。首先,我们建立了广泛类别的递归非线性切换动力学系统在灵活假设下的可识别性,显著扩展了先前结果。其次,我们引入了ΩSDS,一种基于流的估计器,能够通过期望最大化进行精确似然优化。通过在合成和真实数据上的实证验证,我们的结果表明,ΩSDS相比基于VAE的估计器实现了更好的解耦,并能更准确地预测底层动态。

英文摘要

Learning identifiable representations in deep generative models remains a fundamental challenge, particularly for sequential data with regime-switching dynamics. Existing approaches establish identifiability under restrictive assumptions, such as stationarity or limited emission models, and typically rely on variational autoencoder (VAE) estimators, which introduce approximation gaps that limit the recovery of the latent structure. In this work, we address both the theoretical and practical limitations of this setting. First, we establish identifiability of a broad class of recurrent nonlinear switching dynamical systems under flexible assumptions, significantly extending prior results. Second, we introduce $Ω$SDS, a flow-based estimator that enables exact likelihood optimization using expectation-maximisation. Through empirical validation on both synthetic and real-world data, our results demonstrate that $Ω$SDS achieves improved disentanglement compared to VAE-based estimators and more accurate forecasting of underlying dynamics.

2605.06289 2026-05-08 stat.ML cs.AI cs.LG

Multimodal Deep Generative Model for Semi-Supervised Learning under Class Imbalance

多模态深度生成模型用于类别不平衡下的半监督学习

Heegeon Yoon, Heeyoung Kim

发表机构 * Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea(工业与系统工程系,韩国科学技术院(KAIST),大田,大韩民国)

AI总结 本文提出一种多模态深度生成模型,通过共享潜在变量和t分布改进类别不平衡问题,提升半监督学习性能。

详情
AI中文摘要

在建模类别不平衡数据时,解决不平衡问题至关重要,因为基于此类数据训练的模型往往偏向多数类。在部分监督下,伪标签的预测会放大这种偏差。尽管最近的半监督模型处理类别不平衡,但通常假设单模态输入数据。然而,随着多模态数据的增加,利用互补模态至关重要。本文提出了一种多模态深度生成模型用于半监督学习中的类别不平衡问题。我们的方法为每个模态使用单独的编码器,并在模态间共享潜在变量,并利用专家产品方法简化联合后验计算。为进一步解决类别不平衡问题,我们用学生t分布替代典型的高斯分布,以更好地捕捉不平衡数据中的重尾潜在分布。我们推导了一个新的目标函数,用于在标记和未标记数据上训练该模型,使用γ-幂分歧。在基准和真实世界数据集上的实验证明,我们的模型在泛化能力上优于基线方法,实现了在部分标记的多模态数据中类别分布不平衡情况下的优越分类性能。

英文摘要

When modeling class-imbalanced data, it is crucial to address the imbalance, as models trained on such data tend to be biased towards the majority classes. This problem is amplified under partial supervision, where pseudo-labels for unlabeled data are predicted based on imbalanced labeled data, propagating the bias. While recent semi-supervised models address class imbalance, they typically assume single-modal input data. However, with the growing availability of multimodal data, it is essential to leverage complementary modalities. In this article, we propose a multimodal deep generative model for semi-supervised learning under class imbalance. Our approach uses separate encoders for each modality, sharing latent variables across modalities, and simplifies joint posterior computation with a product-of-experts method. To further address class imbalance, we replace typical Gaussian distributions with Student's t-distributions for the prior, encoder, and decoder, better capturing the heavy-tailed latent distributions in imbalanced data. We derive a new objective function for training the proposed model on both labeled and unlabeled data using $γ$-power divergence. Empirical results on benchmark and real-world datasets demonstrate that our model outperforms baseline methods in generalization, achieving superior classification performance for partially labeled multimodal data with imbalanced class distributions.

2605.06288 2026-05-08 stat.ME cs.AI

A Topological Sorting Criterion for Random Causal Directed Acyclic Graphs

随机因果有向无环图的拓扑排序标准

Alexander G. Reisach, Antoine Chambaz, Gilles Blanchard, Sebastian Weichwald

发表机构 * University of Copenhagen, Denmark(丹麦哥本哈根大学)

AI总结 本文研究了随机因果DAG中可达节点数随因果顺序单调增加的特性,提出通过估计可达节点数进行因果顺序恢复,并讨论其对因果发现算法的影响。

详情
AI中文摘要

基于对Erdős-Rényi和scale free随机图施加顺序而生成的随机有向无环图(DAG)被广泛用于评估因果发现算法。我们证明在这样的DAG中,通过开放路径可达的节点集(称为relatives)沿因果顺序单调增加。我们通过数值方法评估该模式的普遍性,并展示可通过估计relatives数量进行因果顺序恢复。我们注意到文献中的许多模拟设置中,此方法能成为因果顺序的良好代理,并证明严格增加的relatives导致奇异的马尔可夫等价类。我们提出时间序列DAG采样作为可能的替代方案,并讨论其对因果发现算法及其在合成数据上的评估的影响。

英文摘要

Random directed acyclic graphs (DAGs) based on imposing an order on Erdős-Rényi and scale free random graphs are widely used for evaluating causal discovery algorithms. We show that in such DAGs, the set of nodes reachable via open paths, termed relatives, increases monotonically along the causal order. We assess the prevalence of this pattern numerically, and demonstrate that it can be exploited for causal order recovery via sorting by the estimated number of relatives. We note that many simulations in the literature feature settings where this yields an excellent proxy for the causal order, and show that a strict increase of relatives along the causal order leads to a singular Markov equivalence class. We propose sampling time-series DAGs as a possible alternative and discuss implications for causal discovery algorithms and their evaluation on synthetic data.

2605.06279 2026-05-08 cs.SE cs.AI

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

正确的代码,易受攻击的依赖项:一项大规模的LLM指定库版本测量研究

Chengjie Wang, Jingzheng Wu, Xiang Ling, Tianyue Luo, Chen Zhao

发表机构 * Intelligent Software Research Center, Institute of Software, Chinese Academy of Sciences, Beijing, China(智能软件研究中心,软件研究所,中国科学院,北京,中国) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国) Key Laboratory of System Software (Chinese Academy of Sciences), Beijing, China(中国科学院系统软件重点实验室,北京,中国)

AI总结 研究LLM生成代码中库版本的安全性和兼容性风险,发现版本选择存在系统性偏差,且版本选择影响漏洞暴露和兼容性故障。

Comments 35 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLMs)如今广泛参与软件开发流程,生成的代码通常包含带有特定版本标识的第三方库(TPL)导入。这些版本选择可能带来安全和兼容性风险,但尚未被系统研究。本文首次对LLM生成的Python代码中版本级风险进行了大规模测量研究,评估了10个LLM在PinTrace基准上的表现。LLM在直接提示时倾向于指定版本标识(26.83%-95.18%),而在创建manifest文件时降至6.45%-59.19%。在指定的版本中,36.70%-55.70%的任务包含至少一个已知CVE,62.75%-74.51%的CVE具有高或严重严重性。72.27%-91.37%的CVE在模型知识截止前已公开披露。统计显示所有模型都收敛到相同的少量高风险版本,表明系统性偏差而非孤立模型错误。静态兼容性率为19.70%-63.20%,安装失败是主要原因。动态测试用例通过率6.49%-48.62%。进一步实验证实这些失败归因于版本选择而非代码质量,且外部锚定版本约束显著减少漏洞暴露和兼容性故障。研究揭示LLM版本选择作为LLM开发中的首要风险表面,此前被忽视。

英文摘要

Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack Overflow programming tasks. LLMs tend to specify version identifiers when directly prompted at 26.83%-95.18%, while down to 6.45%-59.19% in creating a manifest file directly. Among the specified versions, 36.70%-55.70% of tasks contain at least one known CVE, and 62.75%-74.51% of them carry Critical or High severity ratings. In 72.27%-91.37% of cases, the associated CVEs were publicly disclosed before the model's knowledge cutoff. The statistics show all models converge on the same small set of risky release versions, indicating a systemic bias rather than isolated model error. Static compatibility rates range from 19.70% to 63.20%, with installation failure as the dominant cause. The dynamic test cases confirm the pattern by 6.49%-48.62% pass rates. Further experiments confirm that these failures are attributable to version selection rather than code quality, and that externally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures. Our findings reveal LLM version selection as a first-class, previously overlooked risk surface in LLM-based development. We disclosed these findings to the community of the evaluated models, and several confirmed the issue. All the code and dataset have been released for open science at https://github.com/dw763j/PinTrace.

2605.06265 2026-05-08 stat.ML cs.LG

ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees

ConquerNet:带有极小极大保证的卷积平滑分位数ReLU神经网络

Tianpai Luo, Fangwei Wu, Weichi Wu

发表机构 * Department of Statistics and Data Science(统计与数据科学系)

AI总结 ConquerNet通过卷积平滑技术解决分位数回归中非光滑损失函数的优化难题,提供非渐近风险界,并在多个分位数水平上提升估计精度和训练效率。

详情
AI中文摘要

分位数回归是分布学习的基础工具,但深度模型中非光滑的pinball损失函数带来了显著的优化挑战。我们提出了ConquerNet,即卷积平滑的分位数ReLU神经网络,能够在保持分位数结构的同时产生平滑的目标函数。我们建立了在温和条件下适用于Besov函数类的通用非渐近风险界,提供了极小极大保证。在数值研究中,我们证明所提出的方法在多个分位数水平上优于标准分位数神经网络,展示了整体改进的估计精度和训练效率,尤其在高和低分位数时表现突出。

英文摘要

Quantile regression is a fundamental tool for distributional learning but poses significant optimization challenges for deep models due to the non-smoothness of the pinball loss. We propose ConquerNet, a class of \textbf{con}volution-smoothed \textbf{qu}antil\textbf{e} \textbf{R}eLU neural \textbf{net}works, which yield smooth objectives while preserving the underlying quantile structure. We establish general nonasymptotic risk bounds for ConquerNet under mild conditions, providing minimax guarantees over Besov function classes. In numerical studies, we demonstrate that the proposed approach outperforms standard quantile neural networks at multiple quantile levels, showing improved estimation accuracy and training efficiency across the board, with particularly pronounced advantages at high and low quantiles.

2605.06210 2026-05-08 stat.ML cs.AI cs.LG stat.AP stat.ME

Super-Level-Set Regression: Conditional Quantiles via Volume Minimization

超水平集回归:通过体积最小化实现条件分位数

Sacha Braun, Michael I. Jordan, Francis Bach

发表机构 * Sierra team, Inria Paris, France(Inria巴黎研究所)

AI总结 本文提出超水平集回归,通过直接优化几何边界,解决多变量回归中构造满足条件覆盖的最小体积预测区域的挑战,替代传统密度估计方法。

详情
AI中文摘要

构造满足条件覆盖的最小体积预测区域是多变量回归中的基本挑战。传统方法依赖于显式估计完整条件密度并随后进行阈值处理。这种两步插件过程 notoriously 难以实现,敏感于估计误差且计算成本高。本文提出超水平集回归(SLS),一种新的数学框架,成功解决了这一隐含耦合问题,允许直接参数化和优化目标条件水平集的几何边界。通过避免完整分布估计并利用灵活的体积保持前沿函数,我们的方法能够端到端地捕捉复杂的、多模式的和不连通的条件结构。最终,SLS为多变量条件分位数回归提供了一种新的视角,用直接的几何优化策略替代传统密度优先方法的限制性假设。

英文摘要

Constructing minimum-volume prediction regions that satisfy conditional coverage is a fundamental challenge in multivariate regression. Standard approaches rely on explicitly estimating the full conditional density and subsequently thresholding it. This two-step plug-in process is notoriously difficult, sensitive to estimation errors, and computationally expensive. One would like to instead optimize the region directly. Formulating a direct solution is challenging, however, because it requires minimizing a volume objective that is coupled with the conditional quantiles of the model's own estimation error. In this work, we address this challenge. We introduce super-level-set regression (SLS), a novel mathematical framework that successfully resolves this implicit coupling, allowing us to directly parameterize and optimize the geometric boundaries of the target conditional level sets. By bypassing full distribution estimation and leveraging flexible volume-preserving frontier functions, our approach natively captures complex, multimodal, and disjoint conditional structures end-to-end. Ultimately, SLS offers a new perspective on multivariate conditional quantile regression, replacing the restrictive assumptions of density-first methods with a direct geometric optimization strategy.

2605.06204 2026-05-08 stat.ML cs.LG

When Does Trimming Help Conformal Prediction? A Retained-Law Diagnostic under Calibration Contamination

当修剪有助于符合预测时?在校准污染下的保留法诊断

Congye Wang

发表机构 * Newcastle University(新castle大学)

AI总结 本文研究了修剪在符合预测中的作用,通过保留法诊断分析了校准污染的影响,提出了一种基于保留法的校准点处理方法,并给出了有限样本保证模板。

详情
AI中文摘要

修剪可疑的校准点是应对校准污染的常见方法。然而,其对干净目标覆盖率的影响由修剪引起的保留法而非污染水平单独决定。本文分析了固定阈值修剪作为条件而非净化的作用。它用保留法替代被污染的校准法,将干净目标覆盖率转化为一个一维分数-CDF转移问题,并具有精确的有限样本恒等式。对转移间隙的分量界给出了总体层面的诊断。这将干净侧的协方差成本与保留污染成本分开,由脏到干净的保留比例所支配。当异常分数分离保留概率且在干净群体上保持分数中性时,修剪有助于减少污染。否则,它无法通过保留混合系数显著减少污染。我们还给出了有限样本证书模板,以在独立审计下提供数值保证。

英文摘要

Trimming suspicious calibration points is a common response to contamination in conformal prediction. Its effect on clean-target coverage, however, is governed by the retained law induced by trimming, not by the contamination level alone. We analyse fixed-threshold trimming as conditioning rather than purification. It replaces the contaminated calibration law with a retained law, reducing clean-target coverage to a one-dimensional score-CDF transfer problem with an exact finite-sample identity. A componentwise bound on the transfer gap gives a population-level diagnostic. This separates a clean-side covariance cost from a retained-contamination cost, governed by the dirty-to-clean retention ratio. Trimming helps when the anomaly score separates retention probabilities while remaining score-neutral on the clean population. Otherwise, it cannot substantially reduce contamination through the retained mixture coefficient. We also give finite-sample certificate templates that provide numerical guarantees under independent audit.

2605.06189 2026-05-08 eess.AS cs.LG

Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

用于语音增强和分离的预测-生成漂移分解

Julius Richter, Yoshiki Masuyama, Christoph Boeddeker, Takahiro Edo, Gordon Wichern, Jonathan Le Roux

发表机构 * MERL Cambridge, MA, USA(MERL剑桥马萨诸塞州美国)

AI总结 本文提出一种结合预测与生成模型的框架,通过分解漂移动态提升语音增强和分离性能,采用统一方法适用于多种任务和降质场景。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

我们提出了一种即插即用的框架,将预测方法与生成语音先验相结合,通过随机插值构建Stochastic Interpolant Prior for Speech(SIPS)。该方法将插值动态分解为任务特定的漂移和随机去噪组件,使预测估计能直接整合到生成采样过程中。通过仅使用干净语音训练得分模型,获得一种不依赖降质的先验,可跨任务复用。推理时,预测器提供确定性漂移以引导采样过程,而得分模型保持感知自然性。与传统混合方法不同,SIPS提供统一框架,适用于多种预测器和加性降质任务。实验显示,SIPS在语音增强和分离中均提升感知质量,语音分离达到+1.0 NISQA的提升。

英文摘要

We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.

2605.06172 2026-05-08 stat.ML cs.LG cs.NA math.NA math.PR

Expressivity of Bi-Lipschitz Normalizing Flows: A Score-Based Diffusion Perspective

双 Lipschitz 正则化流的表达性:基于分数扩散的视角

Meira Iske, Carola-Bibiane Schönlieb

发表机构 * Center for Industrial Mathematics, University of Bremen(工业数学中心,不莱梅大学) Department of Theoretical Physics and Applied Mathematics, University of Cambridge(理论物理与应用数学系,剑桥大学)

AI总结 本文从分数扩散模型的角度研究双 Lipschitz 正则化流的表达性,通过概率流ODE分析其分布逼近能力,并推导扩散运输的确定性收敛保证。

详情
AI中文摘要

许多正则化流架构施加了正则性约束,但其分布逼近性质尚未完全明确。我们通过分数扩散模型的视角研究双 Lipschitz 正则化流的表达性。对于方差保持扩散的概率流ODE,分数的Lipschitz正则性诱导了双 Lipschitz 偏微分运输映射的流。这种ODE桥梁使我们能够分析双 Lipschitz 正则化流的分布逼近能力,并反向推导基于扩散的运输的确定性收敛保证。我们的关键思想是利用概率流ODE将分数的正则性与诱导的运输映射的正则性联系起来。我们验证了广泛的目标密度的分数正则性,包括具有紧支撑的密度、高斯卷积的紧支撑测度以及有限高斯混合物。我们获得了一个通用的分布逼近结果:由双 Lipschitz 方差保持运输映射诱导的高斯拉回在所有概率密度中是L^1-密集的。对于高斯卷积目标,我们进一步获得无需早停的KL散度收敛性。

英文摘要

Many normalizing flow architectures impose regularity constraints, yet their distributional approximation properties are not fully characterized. We study the expressivity of bi-Lipschitz normalizing flows through the lens of score-based diffusion models. For the probability flow ODE of a variance-preserving diffusion, Lipschitz regularity of the score induces a flow of bi-Lipschitz diffeomorphic transport maps. This ODE bridge allows us to analyze the distributional approximation power of bi-Lipschitz normalizing flows and, conversely, derive deterministic convergence guarantees for diffusion-based transport. Our key idea is to use the probability flow ODE to link regularity of the score to regularity of the induced transport maps. We verify score regularity for broad target densities, including compactly supported densities, Gaussian convolutions of compactly supported measures and finite Gaussian mixtures. We obtain a universal distributional approximation result: Gaussian pullbacks induced by bi-Lipschitz variance-preserving transport maps are $L^1$-dense among all probability densities. For Gaussian convolution targets, we further obtain convergence in Kullback-Leibler divergence without early stopping.

2605.06153 2026-05-08 cs.CR cs.CV

Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles

基于安全性的多比特水印嵌入用于扩散模型的原理性方法

Enoal Gesny, Eva Giboulot

发表机构 * Inria Rennes(法国里昂国家信息与自动化研究所)

AI总结 本文提出基于原理的水印方法,通过理论分析建立安全、鲁棒性和保真度的评估框架,实现无需依赖生成模型的水印系统设计。

详情
AI中文摘要

生成图像模型的快速发展促使了专门的水印技术发展,特别是基于种子的嵌入方法。然而,当前的评估仍主要依赖于特定生成和反向模型架构,限制了对性能的明确结论,尤其是安全性方面缺乏严格定义。本文主张通过彻底的理论分析来建立水印方案的有效性。通过将模型依赖部分与水印系统的实际决策机制解耦,我们引入了一个基于安全、鲁棒性和保真的正式评估框架。这允许通过代表安全、鲁棒性和保真度之间权衡的特征曲面进行精确比较。基于此框架,我们提出了SSB,一种新的水印方法,通过允许在特征曲面上达到任何安全-鲁棒性-保真度区域,扩展了先前的种子方法。本文为具有理论保证的现代水印系统设计打开了大门,无需任何昂贵的实证评估。

英文摘要

The rapid emergence of generative image models has led to the development of specialized watermarking techniques, particularly in-generation methods such as seed-based embedding. However, current evaluations in this area remain largely empirical, making them heavily reliant on the specific model architectures used for generation and inversion. This prevents any clear conclusion on the performance of any method, especially regarding security, for which a rigorous definition is lacking. Against this approach, we argue that the effectiveness of a watermarking scheme should be established purely through a thorough theoretical analysis. This is enabled by decoupling the model-dependent part from the actual decision mechanism of the watermarking system. Using this decoupling, we introduce a formal evaluation framework based on security, robustness, and fidelity. This allows precise comparisons between watermarking systems through a characteristic surface representing the trade-off between these three quantities, independent of any generative model. Based on this framework, we propose SSB, a novel watermarking method that generalizes previous seed-based methods by allowing to reach any security-robustness-fidelity regime on its characteristic surface. This work opens the door to the design of modern watermarking systems with theoretical guarantees that do not necessitate any costly empirical evaluations.

2605.06136 2026-05-08 cs.SE cs.AI

BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

BUILD-AND-FIND: 一种考虑努力的协议用于评估代理管理的代码库

Jhen-Ke Lin

发表机构 * National Yang Ming Chiao Tung University

AI总结 该研究提出BUILD-AND-FIND协议,用于评估下游代理从生成代码库中恢复意图选择的能力及所需检查努力,通过分离行为正确性与恢复过程,评估准确性、稳定性、重复性、覆盖范围和努力程度。

Comments 25 pages, 8 figures, 17 tables

详情
AI中文摘要

大多数代码代理基准测试询问生成的代码是否正确。尽管这仍然重要,但仓库级别的工程越来越多地由代理管理:一个代理编写仓库,后来的代理会检查、审计或扩展它作为工作上下文。在这种情况下,生成的仓库不仅是任务的答案,也是未来工作的沟通 artifacts。即使强代理几乎满足可见行为目标,仓库在暴露意图行为和设计选择方面的清晰度可能不同。我们引入BUILD-AND-FIND协议,用于评估下游代理是否能从生成的仓库中恢复这些意图选择,以及恢复所需多少检查努力。对于每个任务,构建者看到隐藏的仓库规范并创建代码库;查找者只能看到代码库和一个带有规范追溯的多选问题库。该协议将行为正确性与 artifact 侧的恢复分离,并报告恢复准确性、稳定性、重复性、实现覆盖范围和检查努力。准确性与稳定性作为闸门:只有在恢复成功可靠时,努力才被解释。在从相同意图可以恢复的 artifacts 中,相同查找者更低的努力表明该 artifact 使该意图更容易定位。仅问题和仅规范的控制量度通用先验和规范访问,而审计将遗漏的声明与查找者失败分离,并检查正确答案是否引用 artifact 证据。在发布的高优先级任务包中,恢复准确性接近饱和,因此检查努力和查找者特定效应提供了主要的面板局部比较。

英文摘要

Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working context. In that setting, a generated repository is not only an answer to a task but also a communication artifact for future work. Even when strong agents nearly satisfy the visible behavioral objective, repositories can differ in how clearly they expose the intended behavior and design choices behind that behavior. We introduce BUILD-AND-FIND, a protocol for evaluating whether downstream agents can recover those intended choices from generated repositories, and how much inspection that recovery requires. For each task, a builder sees a hidden repository specification and creates a codebase; a finder sees only the codebase and a specification-traced multiple-choice question bank. The protocol separates behavioral correctness from artifact-side recovery and reports recovery accuracy, repeatability, implementation coverage, and inspection effort. Accuracy and stability act as gates: effort is interpreted only when recovery succeeds reliably. Among artifacts from which the same intent can be recovered, lower effort by the same finder suggests that the artifact makes that intent easier to locate. Question-only and spec-only controls quantify generic priors and specification access, while audits separate omitted claims from finder failures and check whether correct answers cite artifact evidence. In the released high-prior task pack, recovery accuracy is near saturation, so inspection effort and finder-specific effects provide the main panel-local comparison.

2605.06134 2026-05-08 hep-lat cs.LG

Diffusion model for SU(N) gauge theories

SU(N)规范理论的扩散模型

Javad Komijani, Marina K. Marinkovic, Lara Turgut

发表机构 * Institute for Theoretical Physics, ETH Zurich, 8093 Zurich, Switzerland(苏黎世联邦理工学院理论物理研究所)

AI总结 本文提出适用于SU(N)晶格规范理论的分数匹配框架,用于训练扩散模型并生成高质量样本,通过与HMC模拟对比验证了其有效性,并探讨了提升采样效率的策略。

Comments 23 pages, 6 figures

详情
AI中文摘要

隐式分数匹配提供了一种计算高效的训练扩散模型的方法,能够生成复杂分布的高质量样本。本文开发了适用于SU(N)晶格规范理论的分数匹配框架,可扩展至其他李群。我们将其应用于具有Wilson规范作用的SU(3)规范配置,在二维和四维中进行测试,并通过与混合蒙特卡罗(HMC)模拟的比较评估生成样本的质量。我们展示了扩散模型能够成功训练并用于采样Wilson规范作用。对于逆耦合较大的情况,准确的反向时间积分需要预测-校正方案,为此我们引入基于哈密顿分子动力学的校正器。虽然校正器显著提高了采样质量,但也增加了计算成本。我们概述了几种提升采样效率的策略。

英文摘要

Implicit score matching provides a computationally efficient approach for training diffusion models and generating high-quality samples from complex distributions. In this work, we develop a score-matching framework for SU(N) lattice gauge theories, which can be extended to other Lie groups. We apply the method to SU(3) gauge configurations with the Wilson gauge action in two and four dimensions and assess the quality of the generated samples by comparison with Hybrid Monte Carlo (HMC) simulations. We show that the diffusion models can be successfully trained and applied for sampling the Wilson gauge action. For large values of inverse coupling, accurate reverse-time integration requires predictor-corrector schemes, for which we introduce a corrector based on Hamiltonian molecular dynamics. While the corrector significantly improves sampling quality, it also increases the computational cost. We outline several strategies for improving sampling efficiency.

2605.06111 2026-05-08 cs.SE cs.AI

Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs

调度与校准:面向代码LLM的实用导向多任务强化学习

Yujia Chen, Yang Ye, Xiao Chu, Yuchi Ma, Cuiyun Gao

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Huawei Cloud Computing Technologies Co., Ltd.(华为云计算技术有限公司)

AI总结 本文提出ASTOR框架,通过任务实用性驱动的协调机制,提升多任务强化学习在代码LLM中的效果,实验显示其在多个编码任务中优于专用专家和基线方法。

详情
AI中文摘要

可验证奖励的强化学习(RL)在训练代码LLM后表现出色,但单独部署任务专用专家会带来与任务数量成比例的成本,推动了统一多任务RL(MTRL)方法的发展。然而,现有MTRL方法将所有编码任务视为统一,依赖于共享优化策略下的固定数据课程,最终限制了多任务训练的有效性。为解决这些限制,我们提出了ASTOR,一种通过实用性驱动的协调机制进行多任务代码强化学习的框架。该框架以任务实用性为中心,一种捕捉每个任务学习潜力和跨任务协同的信号,ASTOR包含两个耦合模块:1)分层实用性引导的数据调度模块,分层分配训练预算并优先选择信息量大的提示,引导训练向最有价值的数据倾斜;2)自适应实用性校准的策略优化模块,动态调整每任务KL正则化,使更新约束与每个任务的当前训练状态相匹配。在两个广泛使用的LLM上四个代表性编码任务上的实验表明,ASTOR在所有任务上一致提升了单一模型,优于最佳任务专用专家9.0%-9.5%,并超越最强的MTRL基线7.5%-12.8%。

英文摘要

Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. However, existing MTRL methods treat all coding tasks uniformly, relying on fixed data curricula under a shared optimization strategy, ultimately limiting the effectiveness of multi-task training. To address these limitations, we propose ASTOR, a multi-tASk code reinforcement learning framework via uTility-driven coORdination. Centered on task utility, a signal capturing each task learning potential and cross-task synergy, ASTOR comprises two coupled modules: 1) Hierarchical Utility-Routed Data Scheduling module hierarchically allocates training budget and prioritizes informative prompts, steering training toward the most valuable data and 2) Adaptive Utility-Calibrated Policy Optimization module dynamically scales per-task KL regularization, matching update constraints to each tasks current training state. Experiments on two widely-used LLMs across four representative coding tasks demonstrate that ASTOR consistently improves a single model across all tasks, outperforming the best task-specific specialist by 9.0%-9.5% and surpassing the strongest MTRL baseline by 7.5%-12.8%.

2605.06091 2026-05-08 math.ST cs.LG math.PR stat.CO stat.TH

Time-Inhomogeneous Preconditioned Langevin Dynamics

时间非齐次预条件拉格朗日动力学

Alexander Falk, Laurenz Nagler, Andreas Habring, Thomas Pock

发表机构 * Institute of Visual Computing(视觉计算研究所) Graz University of Technology(格拉茨技术大学)

AI总结 本文提出TIPreL方法,通过时间与位置依赖的预条件器解决拉格朗日动力学在多模分布中的全局模式覆盖与局部模式探索问题,证明其在Wasserstein-2距离下的收敛性,并在实验中验证其有效性。

详情
AI中文摘要

从形式为p(x) ∝ exp(-Ψ(x))的分布进行拉格朗日采样面临两个主要挑战:(全局)模式覆盖和(局部)模式探索。第一挑战尤其适用于具有不相交模式的多模分布,而第二挑战出现在势函数Ψ表现出多样且病态的局部模式几何时。为解决这些挑战,一种常见方法是用问题特定的信息预条件拉格朗日动力学,例如样本协方差或Ψ的局部曲率。然而,现有预条件器选择本质上涉及全局模式覆盖与局部模式探索之间的权衡,且无先前方法同时解决两者。为克服这一限制,我们提出TIPreL,引入时间与位置依赖的预条件器。这种设计在单一框架内有效解决上述两个挑战。我们建立了所得到动力学在Wasserstein-2距离下的收敛性,无论是连续时间还是tamed Euler离散化。特别是,我们的分析扩展了现有状态,通过证明在时间与空间依赖的扩散系数和仅局部Lipschitz漂移下的收敛性,这之前未被先前工作覆盖。最后,我们在二维严重病态示例和高维贝叶斯逻辑回归任务上实验性比较TIPreL与竞争预条件方案,证实所提方法的有效性。

英文摘要

Langevin sampling from distributions of the form $p(x) \propto \exp(-Ψ(x))$ faces two major challenges: (global) mode coverage and (local) mode exploration. The first challenge is particularly relevant for multi-modal distributions with disjoint modes, whereas the second arises when the potential $Ψ$ exhibits diverse and ill-conditioned local mode geometry. To address these challenges, a common approach is to precondition Langevin dynamics with problem-specific information, such as the sample covariance or the local curvature of $Ψ$. However, existing preconditioner choices inherently involve a trade-off between global mode coverage and local mode exploration, and no prior method resolves both simultaneously. To overcome this limitation, we propose the TIPreL, which introduces a time- and position-dependent preconditioner. This design effectively addresses both challenges mentioned above within a single framework. We establish convergence of the resulting dynamics in the Wasserstein-2 distance both in continuous time and for a tamed Euler discretization. In particular, our analysis extends the existing state of the art by proving convergence under time- and space-dependent diffusion coefficients, and only locally Lipschitz drifts, which has not been covered by prior work. Finally, we experimentally compare TIPreL with competing preconditioning schemes on a two-dimensional, severely ill-posed example and on a Bayesian logistic regression task in higher dimensions, confirming the efficiency of the proposed method.

2605.06082 2026-05-08 cs.AR cs.LG cs.PF

PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs

PoTAcc:一种用于端到端加速功率为二量化DNN的流水线

Rappy Saha, Jude Haris, Nicolas Bohm Agostini, David Kaeli, José Cano

发表机构 * UK Engineering and Physical Sciences Research Council(英国工程与物理科学研究委员会)

AI总结 本文提出PoTAcc流水线,用于在资源受限边缘设备上加速和评估功率为二量化DNN,通过TensorFlow Lite实现无缝准备和部署,展示了在CPU和混合CPU-FPGA系统上的加速效果。

Comments Accepted to IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI), 2026

详情
AI中文摘要

功率为二(PoT)量化显著减少了深度神经网络(DNN)的大小,并将乘法替换为位移操作用于推断。先前的工作表明,PoT量化DNN可以在图像分类等任务中保持准确性;然而,其在资源受限边缘设备上的性能仍不够了解。尽管通用边缘CPU和GPU不提供优化的位移操作后端,定制硬件加速器可以通过实现专用位移处理元素更好地利用PoT量化。然而,将PoT量化模型部署在这些加速器上具有挑战性,因为现有推理框架的现有支持有限。此外,不同PoT量化策略对硬件设计、性能和能效在完整推断中的影响尚未系统性地探讨。为了解决这些挑战,我们提出了PoTAcc,一个开源的端到端流水线,用于在资源受限边缘设备上加速和评估PoT量化DNN。PoTAcc通过TensorFlow Lite(TFLite)在异构平台上无缝准备和部署PoT量化模型,包括仅CPU系统和带有定制加速器的混合CPU-FPGA系统。我们为三种PoT量化方法设计了位移处理元素(shift-PE)加速器,并在两个FPGA平台上实现了它们。我们评估了在多种模型(包括CNN和基于Transformer的架构)上的准确性、性能、能效和资源利用率。结果表明,我们的CPU加速器设计在PYNQ-Z2和Kria板上实现了高达3.6倍的加速和78%的能耗降低,相比仅CPU执行的PoT量化DNN。代码将在https://github.com/gicLAB/PoTAcc公开发布。

英文摘要

Power-of-two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit-shift operations for inference. Prior work has shown that PoT-quantized DNNs can preserve accuracy for tasks such as image classification; however, their performance on resource-constrained edge devices remains insufficiently understood. While general-purpose edge CPUs and GPUs do not provide optimized backends for bit-shift operations, custom hardware accelerators can better exploit PoT quantization by implementing dedicated shift-based processing elements. However, deploying PoT-quantized models on such accelerators is challenging due to limited support in existing inference frameworks. In addition, the impact of different PoT quantization strategies on hardware design, performance, and energy efficiency during full inference has not been systematically explored. To address these challenges, we propose PoTAcc, an open-source end-to-end pipeline for accelerating and evaluating PoT-quantized DNNs on resource-constrained edge devices. PoTAcc enables seamless preparation and deployment of PoT-quantized models via TensorFlow Lite (TFLite) across heterogeneous platforms, including CPU-only systems and hybrid CPU-FPGA systems with custom accelerators. We design shift-based processing element (shift-PE) accelerators for three PoT quantization methods and implement them on two FPGA platforms. We evaluate accuracy, performance, energy efficiency, and resource utilization across a range of models, including CNNs and Transformer-based architectures. Results show that our CPU-accelerator design achieves up to 3.6x speedup and 78% energy reduction compared to CPU-only execution for PoT-quantized DNNs on PYNQ-Z2 and Kria boards. The code will be publicly released at https://github.com/gicLAB/PoTAcc

2605.06059 2026-05-08 stat.AP cs.LG

Correcting heterogeneous diagnostic bias when developing clinical prediction models using causal hidden Markov models

在使用因果隐马尔可夫模型开发临床预测模型时纠正异质性诊断偏差

Jose Benitez-Aurioles, Ricardo Silva, Brian McMillan, Matthew Sperrin

发表机构 * Division of Informatics, Imaging & Data Sciences, University of Manchester(曼彻斯特大学信息学、影像与数据科学系) Department of Statistical Science, UCL(伦敦大学学院统计科学系) Division of Population Health, Health Services Research & Primary Care, University of Manchester(曼彻斯特大学人口健康、卫生服务研究与初级保健系)

AI总结 本文提出一种方法,通过因果推断框架纠正因诊断延迟差异导致的预测偏差,利用隐马尔可夫模型模拟纵向过程,提升慢性肾病预测模型的校准性。

Comments 4 figures, 2 tables, 4 supplementaries

详情
AI中文摘要

在常规医疗中,预先确定为高风险的个体通常更频繁地接受检查,受保护属性如性别或种族也可能影响检查频率。这种在人口中的异质性检测率导致标签错误,从而引起特定群体的系统性模型误差和验证期间的性能指标偏差。本文提出了一种方法,通过因果推断框架定义我们的目标估计量:在个体诊断率与参考组相匹配的反事实场景中的诊断概率。我们将纵向过程建模为隐马尔可夫模型,其中确认性检查结果是潜变量渐进性疾病的排放。我们在模拟数据中验证了我们的方法,并将其应用于慢性肾病预测的电子健康记录案例研究。在模拟中,我们的方法减少了预测偏差并提高了整体校准性,将未诊断组的观察到的预期比从1.34(标准差:0.09)降低到1.02(0.09)。模拟中假设的违反影响了模型参数的估计,但所提出的方法仍然比标准模型更校准。在临床案例研究中,我们发现糖尿病是可观察性的主要驱动因素,在6个月尿白蛋白肌酐比测试率中,糖尿病的比值比为10.36(95%置信区间,9.80 - 11.02)。使用我们的方法预测无糖尿病患者的反事实诊断率,将开发的临床预测模型的观察到的预期比从1.55(1.51 - 1.59)提高到1.01(0.98 - 1.04)

英文摘要

In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce label error. This causes systematic model error for specific groups and biases performance metrics during validation. This paper proposes a method to correct for such bias in prediction models due to differential diagnostic delay. We use a causal inference framework to define our target estimand: an individual's diagnosis probability in a counterfactual scenario where their diagnosis rate matches that of a reference group. We model the longitudinal process as a hidden Markov model, in which confirmatory test results are emissions from a latent progressive disease stage. We validate our approach in simulated data and apply it to a case study of chronic kidney disease prediction using electronic health records. In simulations, our method reduces prediction bias and improves calibration-in-the-large, correcting the Observed:Expected ratio in the underdiagnosed group from 1.34 (standard deviation: 0.09) in a model developed without any correction for underdiagnosis bias to 1.02 (0.09). Violations of assumptions in the simulation affected the estimation of model parameters, but the proposed approach nonetheless remained better calibrated than the standard model. In the clinical case study, we identify diabetes as the main driver of observability, with an odds ratio of 10.36 (95% confidence interval, 9.80 - 11.02) in 6-month urine albumin-creatinine ratio testing rate. Using our approach to predict the counterfactual diagnostic rate in patients without diabetes, we improved the Observed:Expected ratio of a developed clinical prediction model from 1.55 (1.51 - 1.59) to 1.01 (0.98 - 1.04).

2605.05996 2026-05-08 stat.ML cs.LG

Gaussian mixture models in Hilbert spaces via kernel methods

基于核方法的希尔伯特空间中高斯混合模型

Daniel López-Montero, Antonio Álvarez-López, Marcos Matabuena

发表机构 * Friedrich-Alexander-Universität Erlangen-Nürnberg(弗里德里希-亚历山大-埃朗根-纽伦堡大学) Universidad Autónoma de Madrid(马德里自治大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出基于核均值嵌入的希尔伯特空间数据高斯混合模型,通过高效优化算法进行估计,并在无限维空间中建立理论保证,验证了该框架在动态函数数据和现代医学应用中的有效性。

Comments 38 pages, 13 figures

详情
AI中文摘要

现代跨学科数据集越来越多地由时间演变的、可能无限维的随机对象组成,如动态函数数据,这些数据自然建模于希尔伯特空间中。在这些设置中,通过密度来表征概率测度可能定义不清或技术上具有挑战性。受聚类应用的启发,本文提出了一种基于核均值嵌入的希尔伯特空间数据高斯混合框架,并开发了高效的估计优化算法。我们建立了理论保证,证明所提出算法是良好的定义,且模型在无限维空间中产生密集的近似类。通过在多样化的结构和数据几何上进行广泛实验,包括L²函数数据和出现在现代医学应用中的拉普拉斯空间中的随机图,评估了该框架。

英文摘要

Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including $L^2$-functional data and random graphs in Laplacian spaces arising in modern medical applications.

2605.05993 2026-05-08 stat.ML cs.LG stat.ME stat.OT

TabCF: Distributional Control Function Estimation with Tabular Foundation Models

TabCF:基于表格基础模型的分布控制函数估计

Geping Chen, Chunlin Li, Tianzhong Yang, Zhengyuan Zhu, Jing Zhou

发表机构 * Iowa State University(爱荷华州立大学) University of Virginia(弗吉尼亚大学) University of Minnesota(明尼苏达大学) University of Manchester(曼彻斯特大学)

AI总结 TabCF利用表格基础模型进行控制函数回归,实现快速且透明的分布因果估计,适用于干预均值和分位数等分布量,提出基于Copula的多变量结果近似方法。

详情
AI中文摘要

工具变量(IV)和控制函数(CF)方法是处理未测量混杂因素时估计因果效应的强大工具,但现有方法大多仅针对均值效应或需要大量拟合和调参。本文提出TabCF,一种基于表格基础模型的控制函数回归方法,能够实现准确、快速、透明且调参轻量的分布量因果估计,如干预均值和分位数;同时提出基于Copula的多变量结果近似方法。TabCF在多种小至中型合成和真实数据场景中表现优异。核心观点:对实践者而言,TabCF是分布因果推断的有效工具;对研究者而言,该方法可作为未来方法开发的强基线。代码见https://github.com/GepingChen/TabCF。

英文摘要

Instrumental variable (IV) and control function (CF) methods are powerful tools for causal effect estimation in the presence of unmeasured confounding, yet most existing approaches target only mean effects and/or demand substantial fitting and tuning effort. In this paper, we introduce a simple method, TabCF, for control function regression using tabular foundation models, which enables accurate, fast, identification-transparent, and tuning-light causal estimation of distributional quantities, such as interventional means and quantiles; we also propose a copula-based approximation for multivariate outcomes. TabCF performs favorably against representative methods across a broad range of small- to medium-sized synthetic and real data scenarios. The central message is two-fold: for practitioners, it highlights that TabCF is an effective tool for distributional causal inference; for researchers, it suggests that the proposed approach could be considered a strong baseline for future method development. Code is available at https://github.com/GepingChen/TabCF.

2605.05973 2026-05-08 stat.ML cs.AI cs.LG stat.AP

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

迈向可靠的LLM评估:在自适应基准测试中纠正胜利者偏差

Yang Xu, Jiefu Zhang, Haixiang Sun, Zihan Zhou, Tianyu Cao, Vaneet Aggarwal

发表机构 * Purdue University(普渡大学) Johns Hopkins University(约翰霍普金斯大学) Purdue University, USA(美国普渡大学)

AI总结 本文研究了自适应基准测试中胜利者偏差问题,提出SIREN方法以提高评估可靠性,通过冻结短名单、分离选择与评估并采用高斯乘子bootstrap进行不确定性量化,实验证明其在有限预算下能提供有效的置信区间和比较。

详情
AI中文摘要

自适应提示和程序搜索使LLM评估对选择敏感。一旦基准项在调优中被重复使用,观察到的胜利者分数不必估计完整调优后部署流程在新鲜数据上的性能。我们研究了在显式调优预算下该流程级别的推断。我们提出SIREN,一种选择感知的重复分割报告协议,冻结搜索后的短名单,分离分割选择与保留评估,并使用项目级高斯乘子bootstrap进行不确定性量化。在固定短名单制度下,当选择稳定时,估计量允许一级项目级表示,bootstrap在有限预算网格上提供有效的同时推断。这支持了过程性能曲线的置信区间和预设等预算和跨预算比较。受控模拟和MMLU-Pro调优实验表明,基于胜利者的报告可以是乐观的,并可能导致部署结论的变化,而SIREN则接近有限样本报告目标。

英文摘要

Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.

2605.04918 2026-05-08 math.AP cs.LG cs.NA math.NA

Neural Discovery of Strichartz Extremizers

神经发现斯特里茨茨极值解

Nicolás Valenzuela, Ricardo Freire, Claudio Muñoz

发表机构 * Departamento de Ingeniería Matemática Universidad de Chile(数学工程系智利大学) Universidad de Chile(智利大学) DIM & CMM (UMI 2807 CNRS)(DIM与CMM(UMI 2807 CNRS))

AI总结 本文提出基于神经网络的流程,用于寻找斯特里茨茨不等式极值解,通过三个案例验证其有效性,发现极值解可能为高斯函数,并提出新的猜想。

Comments 38 pages, 26 figures; v.2: corrected typos

详情
AI中文摘要

斯特里茨茨不等式是现代散射PDE理论的基石,但其极值解仅在少数尖锐情况下明确已知。非凸性使问题复杂,目前尚未有系统性的数值方法被尝试。本文提出一个简单的神经网络流程,通过斯特里茨茨比率的临界点寻找极值解,并应用于三个场景。首先,在薛定谔群中,恢复了Foschi和Hundertmark--Zharnitsky在d=1,2维中的高斯极值解,误差在10^{-3}以内,无分析先验。其次,在d=1的59个进一步可接受配对中,方法一致发现高斯函数,支持高斯函数在可接受范围内是通用极值解的猜想。第三,在临界Airy--Strichartz不等式γ=1/q处,优化不收敛到任何L^2轮廓:相反,迭代项组织成mKdV呼吸子B(0,⋅;α,1,0,0),内部频率α增长,发现的比率从下面接近Frank--Sabin通用下界A_{q,r},存在幂律间隙~α^{-0.9}。通过独立的Hermite基函数假设验证了相同图景。本文提出精确猜想:上界等于A_{q,r},但沿呼吸子家族趋近,而非达到。该流程既验证已知案例,也作为不存在极值解时的发现工具。

英文摘要

Strichartz inequalities are a cornerstone of the modern theory of dispersive PDEs, but their extremizers are known explicitly only in a handful of sharp cases. The non-convexity of the underlying functional makes the problem hard, and to our knowledge no systematic numerical attack has been attempted. We propose a simple neural-network-based pipeline that searches for extremizers as critical points of the Strichartz ratio, and apply it in three settings. First, on the Schrödinger group we recover the Gaussian extremizers of Foschi and Hundertmark--Zharnitsky in dimensions $d=1,2$ to within $10^{-3}$ relative error, with no analytical prior. Second, on $59$ further admissible pairs in $d=1$ where the answer is conjectural, the method consistently finds Gaussians, supporting the conjecture that Gaussians are the universal extremizers in the admissible range. Third, on the critical Airy--Strichartz inequality at $γ=1/q$, where existence is open, the optimization does not converge to any $L^2$ profile: instead, the iterates organize themselves as mKdV breathers $B(0,\cdot;α,1,0,0)$ with growing internal frequency $α$, and the discovered ratio approaches the Frank--Sabin universal lower bound $\widetilde A_{q,r}$ from below with a power-law gap $\simα^{-0.9}$. We confirm the same picture with an independent Hermite-basis ansatz. We propose a precise conjecture: the supremum equals $\widetilde A_{q,r}$ and is approached, but not attained, along the breather family. The pipeline thus serves both as a validator on known cases and as a discovery tool when no extremizer exists.