arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22776 2026-05-22 cs.LG cs.AI stat.CO stat.ML

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

SDPM:用于连续时间生存分析的生存扩散概率模型

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

AI总结 本文提出SDPM,一种用于连续时间生存分析的生成模型,通过去噪扩散模型建模生存结果的条件分布,避免了对事件时间分布的参数假设,并在变换的目标空间中使用标准化对数时间和连续高斯混合表示来表示删失指示符,从而在多个真实生存数据集上取得了竞争力的预测性能。

详情
AI中文摘要

生存分析旨在从具有删失观测的数据中估计时间到事件的分布。许多现有方法要么对危险函数施加结构假设,要么离散化时间轴,这可能会限制灵活性并引入近似误差。我们提出了生存扩散概率模型(SDPM),一种用于连续时间生存分析的生成方法。SDPM利用去噪扩散模型建模生存结果的条件分布,该分布由观测时间和删失指示符表示,即P(T,δ|X)。在假设条件独立删失的情况下,模型生成的条件样本可以通过Kaplan-Meier估计器转换为生存函数估计。该公式避免了对事件时间分布的参数假设,并不需要对输出时间空间进行离散化。模型在变换的目标空间中运行,使用标准化对数时间和连续高斯混合表示来表示删失指示符。我们评估了SDPM在十个真实生存数据集上的性能,并将其与五个强大的基线模型进行了比较,包括基于树、提升和神经生存模型。结果表明,SDPM在C指数、整合时间依赖AUC和整合Brier分数上均取得了竞争性的预测性能。对合成Cox-Weibull数据的分析表明,当生成足够多的样本时,SDPM能够比强大的非参数基线更准确地恢复潜在连续生存分布的形状。消融研究证实了所提出的目标空间变换的重要性,这些变换提高了事件率校准、减少了无效生成时间并提供了预测判别的一致增益。实现所提出模型的代码已公开可用。

英文摘要

Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survival Diffusion Probabilistic Model (SDPM), a generative approach to continuous-time survival analysis. SDPM models the conditional distribution of the survival outcome, represented by the pair of observed time and censoring indicator, $\mathbb{P}(T,δ\mid \mathbf{x})$, using a denoising diffusion model. Under the assumption of conditionally independent censoring, conditional samples generated by the model can be transformed into survival function estimates using the Kaplan-Meier estimator. This formulation avoids parametric assumptions on the event-time distribution and does not require a discretization of the output time space. The model operates in a transformed target space, using standardized log-times and a continuous Gaussian-mixture representation of the censoring indicator. We evaluate SDPM on ten real survival datasets and compare it with five strong baselines, including tree-based, boosting-based, and neural survival models. Results show that SDPM achieves competitive predictive performance across C-index, integrated time-dependent AUC, and integrated Brier score. A study on synthetic Cox-Weibull data demonstrates that SDPM can recover the shape of an underlying continuous survival distribution more accurately than a strong nonparametric baseline when sufficiently many samples are generated. An ablation study confirms the importance of the proposed target-space transformations, which improve event-rate calibration, reduce invalid generated times, and provide consistent gains in predictive discrimination. Codes implementing the proposed model are publicly available.

2605.22765 2026-05-22 cs.LG stat.ML

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

统一扩散模型再审视:留一法去噪器和吸收状态重述

Samson Gourevitch, Yazid Janati, Dario Shariatian, Umut Simsekli, Eric Moulines, Eric P. Xing, Alain Durmus

AI总结 本文研究了统一扩散模型中去噪后验与留一法后验之间的不匹配问题,并通过改进的参数化和采样方法提升了模型性能。

详情
Comments
preprint
AI中文摘要

离散扩散模型通常通过干净数据预测进行训练,但预测可以以不同方式定义反向动态。在掩码扩散模型(MDM)中这些选择大体一致,而在统一扩散模型(UDM)中则不一致。我们展示了标准插件桥参数化对于UDM并非由去噪后验优化,而是由留一法后验优化,该后验预测每个干净token时不使用其自身的噪声观测。这揭示了插件ELBO与常规去噪交叉熵目标之间的不匹配。我们刻画了留一法目标并推导了去噪器、留一法后验和分数之间的精确转换。这些转换使我们能够分离参数化和训练目标。我们的结果还通过有意识的预测-校正采样器和基于留一法预测的改进温度采样方法在无需额外训练的情况下提升了推断性能。我们进一步引入了统一扩散的吸收状态重述,该重述在保持UDM联合分布的同时将其分解为类似掩码扩散的采样操作,具有更简单的去噪后验、携带未掩码和自然重掩码机制。在语言建模中,留一法参数化一致地提升了UDM生成性能,而吸收构造在匹配或超越掩码扩散方面表现优异。这些结果表明,掩码与统一扩散之间的经验差距主要由参数化和采样设计驱动,而非边际本身的选择。代码和模型可在https://github.com/samsongourevitch/rev_udm找到。

英文摘要

Discrete diffusion models are often trained through clean-data prediction, but the prediction can be used in different ways to define the reverse dynamics. In Masked Diffusion Models (MDM) these choices largely coincide, whereas in Uniform Diffusion Models (UDM) they do not. We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This identifies a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. We characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score. These conversions allow us to disentangle parameterization and training objective. Our results also lead to inference improvements without any additional training through an informed predictor-corrector sampler and improved temperature sampling based on the leave-one-out predictor. We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking mechanism. On language modeling, leave-one-out parameterizations consistently improve UDM generation, while the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design. The code and models can be found at https://github.com/samsongourevitch/rev_udm.

2605.22746 2026-05-22 cs.LG eess.AS stat.ML

Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

插件损失用于证据深度学习:一个简化框架用于不确定性估计,其中包括softmax分类器

Berk Hayta, Hannah Laus, Simon Mittermaier, Felix Krahmer

AI总结 本文提出了一种简化框架,用于通过插件损失近似证据深度学习中的不确定性估计,证明了在特定证据到狄利克雷分布映射下,该框架包含标准的softmax分类器,并在Google语音命令数据集上验证了其有效性。

详情
AI中文摘要

现实中的基于传感器的学习系统需要可靠且计算高效的不确定性估计。证据深度学习(EDL)通过狄利克雷分布建模类概率,从而实现单次通过的不确定性估计,其中狄利克雷参数由一个学习的神经网络映射预测。然而,这种方法可能导致计算挑战,因为狄利克雷期望目标比标准监督学习损失更复杂,增加了分析和实现的难度。我们通过近似由EDL诱导的一阶经验风险最小化问题的目标,使用在狄利克雷均值上评估的插件损失,证明在温和假设下,对于广泛的一类损失函数,包括均方误差和交叉熵损失,近似误差随着证据的增长而减小。作为特殊情况,我们的分析为在不确定性估计中使用softmax提供了正当性,因为在特定的证据到狄利克雷分布映射下,我们的框架包含标准的softmax分类器。我们在Google语音命令数据集上验证了所提出的简化目标,并展示了其在预测准确性和选择性预测性能上与经典EDL相当,同时使用标准深度学习损失和训练流程实现起来更简单。到目前为止,本文的实证分析是首次通过EDL获得语音识别任务中的覆盖-准确性权衡。

英文摘要

Real-world sensor-based learning systems require uncertainty estimation that is both reliable and computationally efficient. Evidential Deep Learning (EDL) provides single-pass uncertainty estimation by modeling the class probabilities via Dirichlet distributions, where the Dirichlet parameters are predicted by a learned neural network mapping. However, this approach can lead to computational challenges, as Dirichlet expected objectives are more complex than standard supervised learning losses, complicating their analysis and implementation. We address this issue by approximating the objective of the first-order empirical risk minimization problem induced by EDL with a plug-in loss evaluated at the Dirichlet mean and show that, under mild assumptions, the approximation error decays with growing evidence for a broad class of loss functions, including mean-squared error and cross-entropy loss. As a special case, our analysis provides justification for the use of softmax in the context of uncertainty estimation, since under a particular evidence-to-Dirichlet mapping, our framework includes the standard softmax classifier. We validate the proposed simplified objectives on the Google Speech Commands dataset and show that they achieve predictive accuracy and selective prediction performance comparable to classical EDL, while being simpler to implement using standard deep learning losses and training pipelines. To the best of our knowledge, this empirical analysis is the first to obtain coverage-accuracy trade-offs for speech recognition tasks through EDL.

2605.22724 2026-05-22 cs.LG cs.NA math.NA stat.ML

Multiple Neural Operators Achieve Near-Optimal Rates for Multi-Task Learning

多重神经算子在多任务学习中实现接近最优的速率

Adrien Weihs, Hayden Schaeffer

AI总结 本文研究了共享多任务设置中学习一组算子的近似性和统计复杂性,重点探讨了多重神经算子(MNO)架构。对于广泛类别的Lipschitz多重算子映射,推导出近似和统计泛化性的近优上界。同时,建立了参数复杂性的诅咒并证明了相应的最小最大速率。这些结果表明,跨任务共享表示不会增加总体成本:多任务算子学习遵循与单算子学习相同的缩放定律。此外,本文还比较了MNO与基于拼接任务输入的深度ONet多任务扩展版本,并表明从最坏情况的近似复杂性角度看,两种架构满足本质上相同的渐进行速率。

详情
AI中文摘要

我们研究了在共享多任务设置中学习一组算子的近似性和统计复杂性,重点在于多重神经算子(MNO)架构。对于广泛类别的Lipschitz多重算子映射,我们推导出近似和统计泛化的近优上界。在下界方面,我们建立了参数复杂性的诅咒,并证明了相应的最小最大速率。这些结果表明,跨任务共享的表示不会增加总体成本:多任务算子学习遵循与单算子学习相同的缩放定律。此外,我们还比较了MNO与基于拼接任务输入的深度ONet多任务扩展版本,并表明从最坏情况的近似复杂性角度看,两种架构满足本质上相同的渐进行速率。

英文摘要

We study the approximation and statistical complexity of learning collections of operators in a shared multi-task setting, with a focus on the Multiple Neural Operators (MNO) architecture. For broad classes of Lipschitz multiple operator maps, we derive near-optimal upper bounds for approximation and statistical generalization. On the lower-bound side, we establish a curse of parametric complexity and prove corresponding minimax rates. Together, these results show that shared representations across tasks do not increase the overall cost: multi-task operator learning follows the same scaling laws as single operator learning. We also compare MNO with a multi-task extension of DeepONet based on concatenated task inputs and show that, from a worst-case approximation-complexity perspective, both architectures satisfy essentially the same asymptotic rates.

2605.22676 2026-05-22 stat.AP

Comparison of probabilistic nowcasts and forecasts of SARS-CoV-2 variant proportions made by hierarchical multinomial linear regression models

对基于分层多项式线性回归模型的SARS-CoV-2变异比例概率现在预测和预测的比较

Isaac MacArthur, Thomas Robacker, Evan L. Ray, Benjamin W. Rogers, Nicholas G. Reich, Maryclare Griffin

AI总结 本文研究比较了基于分层多项式线性回归模型的SARS-CoV-2变异比例的概率现在预测和预测方法,探讨了这些模型在不同数据量地区的表现。

详情
Comments
24 pages, 8 figures
AI中文摘要

自SARS-CoV-2大流行以来,传染病的现在预测和预测变得越来越重要。特别是,由于常规监测中基因组测序频率的大幅增加,用于建模特定时间点流行变异组成的方法得到了更广泛的应用。然而,这些方法必须考虑到不同地点的数据量不同,有时还存在不同的趋势。我们讨论了用于预测SARS-CoV-2变异的常用方法——分层多项式逻辑回归(HMLR),该方法允许不同地点之间共享数据。我们展示了该方法在文献中的应用,并定义了一类用于SARS-CoV-2变异现在预测和预测的HMLR模型。我们严格测试了该类模型的一个子集,使用了美国SARS-CoV-2变异现在预测中心(US SARS-CoV-2 Variant Nowcast Hub)的框架,这是一个于2024年启动的协作建模项目。我们基于回顾性数据集创建了两年的每周预测,预测日期范围从2022年8月3日星期三到2024年8月7日星期三。我们在这组数据上测试了12个HMLR模型,与基线模型进行比较。我们发现,HMLR模型在概率准确性(以能量分数衡量)以及点准确性(以布里尔分数衡量)方面均优于基线模型。总体而言,我们发现HMLR模型在数据量较多的地区相对于基线模型表现最佳,更复杂的HMLR模型在这些高数据量地区也显示出更多的改进;然而,并没有一个模型在所有指标上都最佳,而在低数据量地区,更简单的HMLR模型表现更好。我们发现HMLR模型在实际应用中对于SARS-CoV-2变异的现在预测和预测表现良好。

英文摘要

Nowcasting and forecasting of infectious diseases have become increasingly important since the SARS-CoV-2 pandemic. In particular, methods for modeling the composition of circulating variants at a given time have seen more use in part due to a large increase in the frequency of genomic sequencing conducted as a part of routine surveillance. However, methods must take into account that locations have different amounts of data and sometimes have different trends. We discuss hierarchical multinomial logistic regression (HMLR), a commonly used method for forecasting SARS-CoV-2 variants, which allows for data sharing across locations. We show how it has been used in the literature, and define a class of HMLR models for SARS-CoV-2 variant nowcasting and forecasting. We rigorously test a subset of this class of models using the framework of the US SARS-CoV-2 Variant Nowcast Hub, a collaborative modeling project that launched in 2024. We created two years of weekly predictions based on retrospective datasets, with the prediction dates ranging from Wednesday, August 3, 2022, to Wednesday, August 7, 2024. We tested 12 HMLR models against a baseline model on these datasets. We found that the HMLR models outperformed the baseline both in terms of probabilistic accuracy, as measured by the energy score, as well as point accuracy, as measured by the Brier score. Overall, we find that HMLR models perform best with respect to the baseline model in locations with more data, and more complex HMLR models also showed more improvement in those high-data locations; however, there was no one best model across all metrics, and simpler HMLR models perform better in low-data locations. We find that HMLR models perform well in practice for nowcasting and forecasting SARS-CoV-2 variants.

2605.22640 2026-05-22 stat.ME

Positive-definiteness in separable priors: effects on prior interpretability and inference

在可分离先验中的正定性:对先验可解释性和推断的影响

Jack Storror Carter, David Rossell

AI总结 本文研究了在对称正定矩阵中使用可分离先验时,截断对先验可解释性和推断的影响,探讨了如何设置先验参数以减少截断带来的影响。

详情
Comments
32 pages, 3 figures
AI中文摘要

对称正定矩阵的常用先验假设独立的条目并添加截断以确保正定性。虽然概念上简单且计算上常有便利,但除非谨慎处理,这种截断可能会产生意外影响。如果截断先验或其边缘显著不同于未截断的对应物,则其可解释性可能受损,其收缩特性更难刻画,且后验推断可能以意想不到的方式受到影响。我们研究了截断对密集和稀疏矩阵的影响,并展示了如何设置先验参数,如非对角线条目的方差,使得随着矩阵维度的增长,这种影响被减轻。我们特别关注稀疏推断,其中除非精心设置先验参数,否则截断先验及其对应的后验会系统性地将更多质量分配给更稀疏的结构,而非截断先验。

英文摘要

A popular class of priors for symmetric positive-definite matrices assumes independent entries and adds a truncation to ensure positive-definiteness. While conceptually simple and often computationally convenient, unless done carefully this truncation can have unintended effects. If the truncated prior or its margins are significantly different from their untruncated counterpart, then its interpretability may suffer, its shrinkage properties become harder to characterise, and posterior inference may be affected in unanticipated ways. We investigate the effect of the truncation both for dense and sparse matrices, and show how to set prior parameters such as the variance of off-diagonal entries such that said effect is mitigated as the matrix dimension grows. We pay particular attention to sparse inference where, unless prior parameters are set carefully, the truncated prior and hence its corresponding posterior assign systematically higher mass to sparser structures than the untruncated prior.

2605.22595 2026-05-22 stat.ME

A new class of functional conditional autoregressive models

功能条件自回归模型的新一类

Sooran Kim

AI总结 本文提出了一种新的条件自回归模型,用于空间依赖的功能数据,通过给定邻近功能观测的条件均值进行建模,并通过协方差算子和空间依赖参数进行表征。研究方法包括估计协方差算子、估计空间依赖参数以及应用新的基于轮廓的方法。理论结果证明了协方差估计器的一致性和空间依赖参数估计器的超一致性,为功能数据的空间依赖性提供了新的统计推断方法。

详情
AI中文摘要

我们引入了一种新的条件自回归模型,用于空间依赖的功能数据,通过给定邻近功能观测的条件均值进行建模,并通过协方差算子和空间依赖参数进行表征。我们的估计策略由三个组成部分组成:(i) 使用条件中心数据估计协方差算子,(ii) 通过最大化投影观测的似然估计空间依赖参数,(iii) 应用一种新的基于轮廓的方法来获得最终估计器。在扩展的格子框架下,我们建立了两个关键的理论结果。首先,我们建立了所提出协方差估计器的一致性,这在使用边缘中心数据的朴素方法中是无法达到的。其次,我们证明了空间依赖参数估计器是超一致且渐近正态的,其中后者性质使功能数据的空间依赖性能够进行统计推断——这是现有文献中的一项新贡献。数值研究支持了理论结果,并展示了该方法的计算效率。最后,我们通过分析2019年美国中西部各县每周PM2.5浓度轨迹,展示了其实际应用价值。

英文摘要

We introduce a new class of conditional autoregressive models for spatially dependent functional data, formulated through conditional means given neighboring functional observations and characterized by a covariance operator and a spatial dependence parameter. Our estimation strategy consists of three components: (i) estimating the covariance operator using conditionally centered data, (ii) estimating the spatial dependence parameter by maximizing the likelihood of projected observations, and (iii) applying a novel profile-based approach to obtain the final estimators. Under an expanding lattice framework, we establish two key theoretical results. First, we establish the consistency of the proposed covariance estimator, which is not attainable using naive methods based on marginally centered data. Second, we prove that the spatial dependence parameter estimator is superconsistent and asymptotically normal, where the latter property enables statistical inference for spatial dependence in functional data -- a contribution that is novel in the existing literature. Numerical studies support the theoretical results and demonstrate the computational efficiency of our method. Finally, we illustrate its practical utility by analyzing weekly PM$_{2.5}$ concentration trajectories in 2019 across counties in the Midwestern United States.

2605.22579 2026-05-22 cs.CL cs.AI stat.ML

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

超越温度:超拟合作为晚期几何扩展

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

AI总结 本文研究了超拟合现象,发现其与分布锐化不同,通过实验表明超拟合依赖于动态的上下文相关排名重排机制,并在Transformer最后一层的终端扩展中实现了特征空间的几何扩展,提出了Late-Stage LoRA方法以提升生成质量。

详情
Comments
Accepted at ICML 2026
AI中文摘要

近期的研究揭示了一个反直觉现象,称为

英文摘要

Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis localizes this effect to a "Terminal Expansion" in the final transformer block, where a substantial geometric expansion of the feature space (Delta Dim approx +80.8) facilitates the promotion of deep-tail tokens. Additionally, we introduce Late-Stage LoRA, a targeted fine-tuning strategy that updates only the final 5 layers, yielding robust generation with minimal parameter updates

2605.22549 2026-05-22 stat.ML cs.LG

A Martingale Kernel Independence Test

一个鞅核独立性检验

Felix Laumann, Zhaolu Liu, Mauricio Barahona

AI总结 本文提出两种学生化统计量,通过自归一化和半样本分割,实现了无需排列校准的独立性检验,显著提升了计算效率和测试性能。

详情
AI中文摘要

Hilbert-Schmidt Independence Criterion (HSIC) 及其联合独立性扩展 dHSIC 是退化 V 统计量,其数据依赖的加权 χ² 空间迫使排列校准,导致每测试成本乘以排列次数,实际中为两到三个数量级。通过将最近的鞅 MMD 构造应用于两样本检验到联合独立性问题,我们引入了两个学生化统计量,其空分布为标准正态分布,无论数据分布如何,因此单次正态分位数查找可完全替代排列步骤。第一个,mHSIC,是两个经验中心 Gram 矩阵的 Hadamard 积的自归一化下三角和。在独立性和有界四次矩核下,它收敛于标准正态分布。它对所有固定替代一致,且在样本量二次成本下运行,无需样本分割,与偏置 HSIC V 统计量匹配。第二个统计量 mdHSIC 通过单个半样本分割实现有限样本一致性:中心化估计在一半,下三角自归一化鞅在另一半运行,使条件均值残差缩成指数小量,因此在任意固定联合测试变量数下,统计量渐近标准正态分布,每测试成本仅与 d 线性增长。在合成数据中,输入维度从 1 到 500,联合测试变量从 2 到 10,两种统计量在运行速度上比排列校准基线快 25 到 60 倍,同时保持相同的经验 I 类错误率和测试功效。

英文摘要

The Hilbert-Schmidt Independence Criterion (HSIC) and its joint-independence extension $d\mathrm{HSIC}$ are degenerate $V$-statistics whose data-dependent weighted-$χ^2$ null limits force a permutation calibration that multiplies the per-test cost by the number of permutations, in practice two orders of magnitude. Adapting the recent martingale MMD construction for two-sample testing to the (joint) independence problem, we introduce two studentised statistics whose null distributions are standard normal regardless of the data law, so that a single normal-quantile lookup replaces the permutation step entirely. The first, $m\mathrm{HSIC}$, is a self-normalised lower-triangular sum of the Hadamard product of two empirically centred Gram matrices. Under independence and bounded-fourth-moment kernels it converges to a standard normal. It is consistent against every fixed alternative, and runs at quadratic cost in the sample size without any sample split, matching the biased HSIC $V$-statistic. Our second statistic, $md\mathrm{HSIC}$, achieves finite-sample consistency with a single half-sample split: the centring is estimated on one half and the lower-triangular self-normalised martingale is run on the other, shrinking the conditional-mean residual to a quantity that is exponentially small in $d$, so the statistic is asymptotically standard normal at every fixed number of jointly tested variables, with a per-test cost that grows only linearly in $d$. On synthetic data with per-variable input dimension from $1$ to $500$ and between $2$ and $10$ jointly tested variables, both statistics match the empirical type-I error rate and test power of permutation-calibrated baselines while running $25$ to $60\times$ faster.

2605.22507 2026-05-22 cs.LG stat.ML

Generative Modeling by Value-Driven Transport

通过价值驱动传输进行生成建模

Pablo Moreno-Muñoz, Adrian Müller, Gergely Neu

AI总结 本文提出了一种基于测度传输离散时间随机控制 formulations 的新生成建模框架,通过线性规划的对偶变量直接编码最优控制策略,并开发了高效的模拟-free 原始-对偶算法来计算近似最优价值函数和价值驱动传输(VDT)策略,这些策略在多个实验中表现出优越的性能和良好的可扩展性。

详情
AI中文摘要

我们提出了一种基于测度传输离散时间随机控制 formulations 的新生成建模框架。通过适应控制理论中的经典结果,我们将问题 formulations 为一个线性规划,其对偶变量对应于控制问题的最优价值函数,这直接编码了最优控制策略。利用这种线性规划 formulations,我们开发了高效的模拟-free 原始-对偶算法,用于计算近似最优价值函数及其相关的价值驱动传输(VDT)策略,这些策略近似于真正的最优策略。我们展示了经过良好训练的 VDT 策略与其他基于流、扩散或 Schrödinger 桥的最新方法相比具有许多有利的性质:它们导致直线传输路径,可以快速且鲁棒地模拟,并且可以以与扩散和流基模型相同的方式增强(例如,条件生成、分类器-free 引导、无配对数据到数据翻译都很容易整合)。我们在一系列实验中评估了我们的方法,结果表明性能强大且具有良好的可扩展性潜力。

英文摘要

We propose a new framework for generative modeling based on a discrete-time stochastic control formulation of measure transport. Adapting classic results from control theory, we formulate our problem as a linear program whose dual variables correspond to the \emph{optimal value function} of the control problem, which directly encodes the optimal control policy. Exploiting this LP formulation, we develop an efficient simulation-free primal-dual algorithm for computing approximately optimal value functions and the associated \emph{value-driven transport} (VDT) policies which approximate the true optimal policy. We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges: they lead to straight transport paths which can be simulated quickly and robustly, and can be enhanced in all the same ways as diffusion and flow-based models (e.g., conditional generation, classifier-free guidance, unpaired data-to-data translation are all easy to incorporate). We evaluate our methodology in a range of experiments, with results that indicate strong performance and good potential for scalability.

2605.22481 2026-05-22 cs.LG math.ST stat.TH

When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks

当更强的触发器反噬:高维背景下后门攻击的理论

Donald Flynn, Hadas Yaron Goldhirsh, Jonathan P. Keating, Inbar Seroussi

AI总结 本文研究了在高维情况下后门毒化攻击的行为,发现更强的训练触发器有助于防御者,并通过高维理论分析了后门攻击的核心机制和影响因素。

详情
AI中文摘要

后门毒化攻击在高维情况下表现出反直觉的行为:更强的训练触发器有助于防御者。我们研究了在比例极限下(p/n→κ)的正则化广义线性模型在高斯混合数据上的表现,通过改变训练触发强度α(相对于固定的测试触发强度)来研究。三种现象出现:(i)干净测试准确率随着α增加而增加;(ii)攻击成功率在有限的α后达到峰值然后下降;(iii)最危险的触发方向是数据协方差的最小特征向量。我们为平方损失证明了所有三个结果,并通过高斯代理固定点系统将(i)和(ii)扩展到一般的凸GLM损失。我们识别出一个与κ成比例的有限样本噪声底噪是(i)背后机制,这在经典n>>p分析中是不可见的。在CIFAR-10和高斯代理上的实验与理论紧密吻合;ResNet-18实验显示在非凸设置下也出现了相同现象。

英文摘要

Backdoor poisoning attacks behave counter-intuitively in high dimensions: stronger training triggers can help the defender. We study regularised generalised linear models on Gaussian-mixture data in the proportional regime ($p/n \to κ$), varying the training trigger strength $α$ against a fixed test trigger. Three phenomena emerge: (i) clean test accuracy increases with $α$; (ii) attack success peaks at a finite $α$ and then declines; and (iii) the most damaging trigger direction is the minimum eigenvector of the data covariance. We prove all three results in closed form for the squared loss, and extend (i) and (ii) to general convex GLM losses via a Gaussian-proxy fixed-point system. We identify a finite-sample noise floor proportional to $κ$ as the mechanism behind (i), invisible to classical $n \gg p$ analysis. Experiments on CIFAR-10 and Gaussian surrogates match the theory closely; ResNet-18 experiments show the same phenomena beyond the convex setting.

2605.22438 2026-05-22 stat.ML cs.GT cs.LG

Do Not Trust The Auctioneer: Learning to Bid in Feedback-Manipulated Auctions

不要相信拍卖师:在反馈操纵拍卖中学习出价

Luigi Foscari, Matilde Tullii, Vianney Perchet

AI总结 研究在反馈操纵拍卖中学习出价的问题,提出一种结合鲁棒区间消除分支和乐观分支的算法,以应对反馈操纵带来的挑战,并在单活跃区域情况下提供匹配下界。

详情
AI中文摘要

Shilling是指通过人工出价使竞争看起来更激烈以推高价格。我们研究了重复的第一价格拍卖,在其中shilling影响反馈但不影响分配:学习者在真实竞争出价中获胜或失败,但在失败后观察到真实出价和一个独立的shill出价的最大值。这种操纵改变了学习者所观察到的内容,从而影响其学习出价的方式,而不会改变当前拍卖的结果。我们分析了与最佳出价基准相比的遗憾,假设shill-bid分布已知。即使如此,shilling仍可能掩盖真实出价,而有用的侧信息仅通过间歇性低shill事件出现。我们的算法结合了一个鲁棒的区间消除分支,该分支忽略shilled报告并达到动态定价率$ ilde{\mathcal{O}}(T^{2/3})$,以及一个乐观分支,该分支去偏失败侧报告并利用其在可靠时的结果信息,达到第一价格拍卖的速率$ ilde{\mathcal{O}}(\sqrt{T})$。一个验证和竞赛过程让算法在不知道正确尺度或反馈几何学的情况下使用这些乐观更新。我们用单活跃区域情况下的匹配下界补充了上界,除了对数因子外。总体而言,结果表明,即使只有反馈的shilling也能显著改变重复出价的统计难度。

英文摘要

Shilling is the use of artificial bids to make competition appear stronger and push prices upward. We study repeated first-price auctions in which shilling affects feedback but not allocation: the learner wins or loses against the real competing bid, but after a loss observes the maximum of the real bid and an independent shill bid. Thus the manipulation changes what the learner observes and hence how it learns to bid, without changing the outcome of the current auction. We analyze regret with respect to the best bid benchmark, assuming that the shill-bid distribution is known. Even then, shilling can mask the real bid, while useful side information appears only through intermittent low-shill events. Our algorithm combines a robust interval-elimination branch, which ignores the shilled report and achieves the dynamic-pricing rate $\tilde{\mathcal{O}}(T^{2/3})$, with an optimistic branch that debiases losing-side reports and exploits the resulting suffix information when it is reliable and achieves the first-price auctions rate $\tilde{\mathcal{O}}(\sqrt{T})$. A validation and racing procedure lets the algorithm use these optimistic updates without knowing the right scale or feedback geometry in advance. We complement the upper bounds with a matching lower bound, up to logarithmic factors, in the single-active-region case. Overall, the results show that even feedback-only shilling can sharply alter the statistical difficulty of repeated bidding.

2605.22374 2026-05-22 cs.NE stat.ML

Guiding Multi-Objective Genetic Programming with Description Length Improves Symbolic Regression Solutions

用描述长度引导多目标遗传编程以改进符号回归解

Gabriel Kronberger, Fabricio Olivetti de Franca, Deaglan J. Bartlett, Harry Desmond, Pedro G. Ferreira

AI总结 本文研究了如何利用描述长度(DL)和分数贝叶斯因子(FBF)作为数据高效的方法,替代启发式方法来选择泛化性好的紧凑表达式,从而改进符号回归的解决方案。

详情
AI中文摘要

符号回归与遗传编程(GPSR)可能会因过拟合和结构膨胀而出现问题,特别是在存在噪声的情况下。本文评估了描述长度(DL)和分数贝叶斯因子(FBF)作为原则性、数据高效的选择紧凑表达式的方法,以替代启发式方法。我们实现了基于信息论的参数编码的DL,并在多个数据集上将其与AIC和BIC进行比较,包括有噪声的合成基准和现实世界回归问题。我们研究了三种搜索/选择策略:(i)以准确性和程序长度进行多目标搜索后进行DL/FBF选择;(ii)直接使用DL作为多目标搜索的目标;(iii)单目标优化中使用DL/FBF作为适应度函数。在多个数据集中发现,DL/FBF后选择比AIC/BIC基线测试性能更好,而BIC结合DL/FBF相同的函数复杂度惩罚产生相似结果。相反,使用DL/FBF作为单目标GPSR的适应度函数经常导致过早收敛到过于简单的模型。我们最后给出了在遗传编程工作中使用DL/FBF作为稳健模型选择工具的实践指导。

英文摘要

Symbolic regression with genetic programming (GPSR) may suffer from overfitting and structural bloat, especially when noise is present. In this paper we evaluate description length (DL) and fractional Bayes factor (FBF) criteria as principled, data-efficient alternatives to heuristics for selecting compact expressions that generalise well. We implement DL using a Fisher-information-based parameter encoding and compare it to AIC and BIC across multiple datasets, including noisy synthetic benchmarks and real-world regression problems. We study three search/selection strategies: (i) multi-objective search for accuracy and program length followed by DL/FBF selection; (ii) multi-objective search using DL directly as an objective; and (iii) single-objective optimisation with DL/FBF as the fitness. Across datasets we find that DL/FBF post-selection improves test performance compared to AIC/BIC baseline and that BIC in combination with the same function complexity penalty from DL/FBF produces similar results. In contrast, using DL/FBF directly as a fitness function in single-objective GPSR frequently induces premature convergence to overly simple models. We conclude with practical guidance for using DL/FBF as robust model-selection tools in genetic programming workflows.

2605.22354 2026-05-22 stat.ME eess.SP

From Volterra Series to Kunchenko Stochastic Polynomials: Half a Century of Non-Gaussian Estimation Methodology

从Volterra级数到Kunchenko随机多项式:半个世纪的非高斯估计方法学

Serhii Zabolotnii

AI总结 本文回顾了由Yuriy P. Kunchenko(1939-2006)创立的科学学派半个世纪的发展历程,探讨了非高斯估计的半参数方法。从Kunchenko 1972/1973年的博士论文开始,应用Volterra级数估计随机过程参数,直到2006-2026年。Kunchenko随机多项式被呈现为一组一致的矩-混积程序:参数估计的多项式最大化方法(PMM)、假设检验的多项式准则以及在生成元空间中的分解。文章详细描述了学派的结构:15篇通过的博士论文的验证家谱、波兰、斯洛伐克和德国的合作,以及R包EstemPMM。分析了2026年一篇基于Volterra的信号处理论文,展示了Kunchenko的非线性公式在应用无线电工程中的再现。建立了有限Volterra模型与广义Kunchenko多项式之间的正式桥梁,同时将MMSE/L2准则与PMM分开:前者是核适应的协方差投影,而PMM是参数依赖的矩程序。PMM效率声明是条件性的:收益要求矩存在,中心相关矩阵非退化,且方差缩减系数低于一。结论研究计划将历史重建转化为可测试的统计和信号处理任务。

详情
Comments
Bilingual submission: English followed by Ukrainian translation
AI中文摘要

本文重建了由Yuriy P. Kunchenko(1939-2006)创立的科学学派半个世纪的发展历程,探讨了非高斯估计的半参数方法。从Kunchenko 1972/1973年的博士论文开始,应用Volterra级数估计随机过程参数,直到2006-2026年。Kunchenko随机多项式被呈现为一组一致的矩-混积程序:参数估计的多项式最大化方法(PMM)、假设检验的多项式准则以及在生成元空间中的分解。文章详细描述了学派的结构:15篇通过的博士论文的验证家谱、波兰、斯洛伐克和德国的合作,以及R包EstemPMM。分析了2026年一篇基于Volterra的信号处理论文,展示了Kunchenko的非线性公式在应用无线电工程中的再现。建立了有限Volterra模型与广义Kunchenko多项式之间的正式桥梁,同时将MMSE/L2准则与PMM分开:前者是核适应的协方差投影,而PMM是参数依赖的矩程序。PMM效率声明是条件性的:收益要求矩存在,中心相关矩阵非退化,且方差缩减系数低于一。结论研究计划将历史重建转化为可测试的统计和信号处理任务。

英文摘要

This paper reconstructs the half-century evolution of the scientific school founded by Yuriy P. Kunchenko (1939--2006) as the development of a semiparametric methodology for non-Gaussian estimation. Starting with Kunchenko's 1972/1973 dissertation applying Volterra series to estimate parameters of random processes, the trajectory is followed through 2006--2026. Kunchenko stochastic polynomials are presented as a coherent family of moment-cumulant procedures: the polynomial maximization method (PMM) for parameter estimation, polynomial criteria for hypothesis testing, and decomposition in spaces with a generating element. The paper details the school's structure: a verified genealogy of 15 defended dissertations, collaborations in Poland, Slovakia, and Germany, and the R package EstemPMM. A recent 2026 paper on Volterra-based signal processing is analyzed, showing how Kunchenko's nonlinear formulation reappears in applied radio engineering. We build a formal bridge between finite Volterra models and generalized Kunchenko polynomials, while separating the MMSE/L2 criterion from PMM: the former is a covariance projection for kernel adaptation, whereas PMM is a parameter-dependent moment procedure. PMM efficiency claims are stated conditionally: gains require that moments exist, the centered correlant matrix is nondegenerate, and the variance reduction coefficient is below one. The concluding research program operationalizes the historical reconstruction into testable statistical and signal-processing tasks.

2605.22352 2026-05-22 q-bio.PE math.ST stat.AP stat.CO stat.ME stat.TH

Spatiotemporal dynamics and ecological risk factors of highly pathogenic avian influenza A(H5N1) in Canadian wildlife: A One Health surveillance analysis

加拿大野生动物高致病性禽流感A(H5N1)的时空动态及生态风险因素:一项One Health监测分析

Hammed Olawale Fatoyinbo, Hoyeon Jeong

AI总结 本研究通过分析加拿大2022-2026年野生动物H5N1疫情监测数据,揭示了该病毒的时空动态及与检测数量相关的风险因素,采用描述流行病学、空间聚类方法和负二项混合模型进行分析,发现 Eurasian-North American 病毒谱系主导检测,并识别出年份、季节和谱系是关键预测因子。

详情
AI中文摘要

高致病性禽流感A(H5N1)已扩展到地理和生态层面,影响野生鸟类、哺乳动物野生动物、家畜和人类。野生动物监测为One Health准备提供了关键的早期预警,但整合宿主生态、空间模式、季节性、病毒谱系和风险因素的国家层面分析仍有限。本研究分析了加拿大2022至2026年野生动物H5N1监测记录,以表征时空动态并识别与检测数量相关的因素。通过描述流行病学、空间聚类方法和负二项混合模型对2657次检测进行了回顾性分析。检测主要为鸟类,水鸟和猛禽为主要宿主群体,而哺乳动物占较小但流行病学上重要比例。检测负担在2022年最高,秋季和春季活动增加。安大略、阿尔伯塔和不列颠哥伦比亚被识别为主要热点区域,部分草原地区有局部聚类证据。重组欧亚-北美谱系主导检测,并与更高的检测数量强相关。模型结果将年份、季节和谱系识别为关键预测因子。这些发现支持基于风险的One Health监测,优先考虑高负担地区、与迁徙相关的时期、关键鸟类宿主群体、重组病毒谱系以及持续监测哺乳动物野生动物。

英文摘要

Highly pathogenic avian influenza A(H5N1) has expanded geographically and ecologically, affecting wild birds, mammalian wildlife, domestic animals, and humans. Wildlife surveillance provides critical early warning for One Health preparedness, yet national-scale analyses integrating host ecology, spatial patterns, seasonality, viral lineage, and risk factors remain limited. This study analysed Canadian wildlife HPAI A(H5N1) surveillance records from 2022 to 2026 to characterise spatiotemporal dynamics and identify factors associated with detection counts. A retrospective analysis of 2,657 detections across 13 provinces and territories was conducted using descriptive epidemiology, spatial clustering methods, and Negative Binomial mixed models. Detections were predominantly avian, with waterfowl and raptors as the major host groups, while mammals accounted for a smaller but epidemiologically important proportion. Detection burden was highest in 2022, with increased activity in autumn and spring. Ontario, Alberta, and British Columbia were identified as major hotspots, with evidence of local clustering in parts of the Prairie region. Reassortant Eurasian-North American lineages dominated detections and were strongly associated with higher detection counts. Modelling results identified year, season, and lineage as key predictors. These findings support risk-based One Health surveillance prioritising high-burden regions, migration-associated periods, key avian host groups, reassortant viral lineages, and continued monitoring of mammalian wildlife.

2605.22301 2026-05-22 stat.ME

Chained Markov melding using divide and conquer sequential Monte Carlo

使用分治策略的链式马尔可夫融合

Yixuan Liu, Robert J. B. Goudie

AI总结 本文提出了一种新的多阶段采样器,用于链式马尔可夫模型,通过分治策略的序列蒙特卡洛方法,解决了现有MCMC方法在处理相邻子模型共享量时的后验推断难题。

详情
AI中文摘要

指定一个整合多个数据源的完整贝叶斯模型可以具有挑战性。一种自然的方法是分别指定每个个体模型并在之后进行连接。这在马尔可夫融合中采用的方法。然而,当相邻的子模型共享共同的量时,如链式马尔可夫融合,现有基于MCMC的方法的后验推断会变得具有挑战性。在本文中,我们提出了一种新的多阶段采样器,用于涉及任意数量子模型的链式马尔可夫模型。所提出的采样器采用分治策略的序列蒙特卡洛方法,以适合链式马尔可夫融合结构的树状结构模型。所得到的多阶段采样器为从复杂的联合模型中采样提供了一种灵活的替代方法,因为其对不同子模型的单独采样方案避免了直接从完整模型中采样的需求。我们通过两个例子展示了该采样器的应用。第一个是涉及11种不同类型子模型的玩具示例。第二个示例考虑了一个整合生态人口模型,结合多个数据集以估计移民和繁殖率。

英文摘要

Specifying a full Bayesian model that integrates multiple data sources can be challenging. One natural approach is to specify each individual model separately and join them afterwards. This is the approach adopted in Markov melding. However, when adjacent submodels share common quantities, as in chained Markov melding, posterior inference can be challenging for existing MCMC-based approaches. In this paper, we propose a new multi-stage sampler for chained Markov models involving an arbitrary number of submodels. The proposed sampler adopts a divide-and-conquer sequential Monte Carlo approach for the tree-structured model that fits naturally with the structure of chained Markov melding. The resulting multi-stage sampler provides a flexible alternative for sampling from complex joint models, as its separate sampling scheme for different submodels avoids the need for directly sampling from the full model. We demonstrate applications of the sampler through two examples. The first is a toy example involving 11 submodels of various types. The second example considers an ecologically integrated population model that combines multiple datasets to estimate immigration and reproduction rates.

2605.22284 2026-05-22 stat.CO cs.GR

moveEZ: An R Package for Animated Biplots

moveEZ:用于动画双图的R包

Raeesa Ganey, Johané Nienkemper-Swanepoel

AI总结 该研究提出了一种用于构建动画PCA双图的R包moveEZ,通过动画展示多变量结构在有序分类变量层次上的演变,支持高维数据集及分组结构,并可与gganimate集成生成高质量动画。

详情
Comments
R package
AI中文摘要

moveEZ(发音为move easy)R包提供了构建动画PCA双图的工具,这些动画能够揭示多变量结构如何在有序分类变量的层次上演变。该包作为biplotEZ的扩展,提供了三种递增的方法学复杂度的动画框架:固定变量框架,其中变量向量保持不变,仅样本位置被动画化;以及两个动态框架,其中在每个层次中样本位置和变量向量都会被重新计算并动画化。动态框架支持Procrustes对齐和反射以确保不同层次之间的视觉连续性,并且适用于高维数据集,包括分组结构。该包可与gganimate集成以生成适用于出版和演示的高质量动画,并通过单个参数支持动画和静态分面显示。尽管最初是为跟踪非洲气候指标的偏移量而设计,但moveEZ是领域无关的,适用于任何在有序分类变量层次上重复记录多变量测量的场合,包括经济、生态和生物领域。

英文摘要

The moveEZ (pronounced move easy) R package provides tools for constructing animated PCA biplots that reveal how multivariate structure evolves across the ordered levels of a categorical variable. Built as an extension to the biplotEZ package, moveEZ offers three animation frameworks of increasing methodological complexity: a fixed variable frame, in which variable vectors remain constant and only sample positions are animated; and two dynamic frames, in which both sample positions and variable vectors are recomputed and animated at each level. The dynamic frames support Procrustes alignment and reflection to ensure visual continuity across levels, and are compatible with high-dimensional datasets including grouped structures. The package integrates with gganimate to produce high-quality animations suitable for publications and presentations, and supports both animated and static faceted displays via a single argument. Although originally motivated by tracking shifts in African climate indicators, moveEZ is domain-agnostic and applicable wherever multivariate measurements are recorded repeatedly across an ordered categorical variable, including economic, ecological, and biological settings.

2605.22253 2026-05-22 stat.ME

Bayesian Nonparametrics: Principles and Practice

贝叶斯非参数方法:原理与实践

Nils Lid Hjort, Chris Holmes, Peter Mueller, Stephen G. Walker

AI总结 本文探讨了贝叶斯非参数方法的核心原理和实际应用,介绍了该领域的发展背景、历史以及未来的研究方向。

详情
Comments
16 pages, no figures. This is the authors' extended preface to and published in modified form in the book Bayesian Nonparametrics, Cambridge University Press, 2010, sketching the history of Bayesian Nonparametrics, pointing to developments and application domains, etc
AI中文摘要

本前言[针对《贝叶斯非参数方法》一书,剑桥大学出版社,2010年出版,由NL Hjort, CC Holmes, P Mueller, SG Walker所著]旨在解释为何你对贝叶斯非参数方法感到好奇是正确的——为什么你实际上可能需要它,以及如何理解和使用它。前言也作为引言章节,概述了本书的目标和内容。我们还解释了这本书为何诞生的背景,简要回顾了仍相对年轻的贝叶斯非参数领域的历史,并提供了一些结论性的评论,涉及该领域的各种挑战和可能的未来发展方向。

英文摘要

This extended preface [to the Book `Bayesian Nonparametrics', Cambridge University Press, 2010, by NL Hjort, CC Holmes, P Mueller, SG Walker] is meant to explain why you are right to be curious about Bayesian nonparametrics -- why you may actually need it and how you can manage to understand it and use it. The preface also serves as an introductory chapter, giving an overview of the aims and contents of the book. We also explain the background for how the book came into existence, delve briefly on the history of the still relatively young field of Bayesian nonparametrics, and offer some concluding remarks, pertaining to various challenges and likely future developments of the area.

2605.22243 2026-05-22 cs.LG cs.AI stat.AP

Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies

为高维预测研究的数据驱动设计开发可解释的AI

Junyu Yan, Damian Machlanski, Kurt Butler, Panagiotis Dimitrakopoulos, Ewen M Harrison, Bruce Guthrie, Sotirios A Tsaftaris

AI总结 本文提出了一种可解释的AI推荐系统,通过数据驱动的方法改进现有可解释统计模型的预测性能,主要贡献是通过可解释AI技术提供三种推荐类型以提高模型的预测能力和透明度。

详情
Comments
41 pages, 7 figures
AI中文摘要

预测建模在健康数据分析和数据驱动的临床决策中非常重要。然而,当需要选择、转换或交互建模数十甚至数百个特征时,手动优化预测研究具有挑战性。尽管复杂的机器学习模型具有高性能,但其“黑盒”性质限制了临床信任、透明度和决策所需的可解释性。我们开发并评估了一种探索性AI推荐器,以提供数据驱动的推荐,从而提高现有可解释统计模型的预测性能。所开发的框架使用灵活的AI建模来捕捉复杂的数据模式,并利用可解释AI技术将这些模式转化为三种推荐类型:特征排除、非线性项和特征交互。我们通过比较基线(即无交互或非线性项)Cox比例风险(CPH)模型与增强的CPH模型(包含由我们方法建议的推荐)的预测性能来评估该框架。主要分析预测245,614名患者首次发生跌倒或相关伤害的时间。我们的方法推荐排除23个特征,包括两个特征的非线性项,以及包含221个建议的特征交互。C指数从0.805(95% CI 0.798-0.812)提高到0.815(95% CI 0.809-0.822),校准也有所改善(截距:-0.006到0.003;斜率:1.063到0.950)。所有推荐均得到现有文献的支持。该方法还证明在两个额外的公共数据集上有效,显示了更广泛的应用性。所提出的探索性AI推荐器展示了可解释AI和数据驱动研究设计在提高高维透明预测模型开发过程和性能方面的潜力。

英文摘要

Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their "black-box" nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

2605.22188 2026-05-22 cs.LG math.OC stat.ML

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

从顺序节点到GPU批处理:并行分支限界法用于最优k-稀疏广义线性模型

Jiachang Liu, Andrea Lodi

AI总结 本文提出了一种CPU-GPU框架,通过批量处理GPU上的分支限界节点,显著加速了大规模优化问题的求解,特别是在具有离散变量、组合结构和非线性目标的优化问题中,如验证卡数约束下的最优广义线性模型解。

详情
AI中文摘要

GPU在大规模优化的一阶方法中显著加速了计算,尤其是在连续优化中。然而,这种成功并未顺利转移到具有离散变量、组合结构和非线性目标的问题中,例如验证卡数约束下的广义线性模型的最优解。主要挑战包括分支限界(BnB)中异构节点的顺序处理以及CPU和GPU之间频繁的数据移动。我们提出了一种简单、通用且模块化的CPU-GPU框架,该框架可以在GPU上批量处理多个BnB节点。该框架围绕一组GPU高效的子程序构建,并利用填充和轻量级自定义内核来处理不规则的节点数据结构。实验表明,该框架在挑战性实例上实现了1到2个数量级的加速,并且在最优性间隙方面达到了零。该框架还可以扩展以收集整个Rashomon集,从而启用下游的统计分析,如变量重要性分析和在二次用户特定度量(例如分类中的AUC)下的模型选择。

英文摘要

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and nonlinear objectives, such as certifying optimal solutions for cardinality-constrained generalized linear models. Major challenges include the sequential processing of heterogeneous nodes in branch and bound (BnB) and frequent data movement between the CPU and GPU. We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework is built around a small set of GPU-efficient routines and uses padding together with lightweight custom kernels to handle irregular node data structures. Experiments show one to two orders of magnitude speedups and zero optimality gap on challenging instances. The framework can also be extended to collect the entire Rashomon set, enabling downstream statistical analysis such as variable-importance analysis and model selection under secondary user-specific measures (e.g., AUC in classification).

2605.22181 2026-05-22 stat.OT

A critical comparison of handling zeros in high-dimensional compositional count data

高维组成计数数据中零值处理的批判性比较

Wenqi Tang, Kamila Fačevicová, Klaus Nordhausen, Sara Taskinen

AI总结 本文研究了高维组成计数数据中零值处理的问题,分析了现有方法在处理零值时的局限性,并提出了未来需要联合考虑组成约束、零膨胀和计数数据格子性质的改进框架。

详情
Comments
34 pages, 28 figures. Submitted manuscript
AI中文摘要

高通量测序(HTS)的广泛应用使得大规模产生组成计数数据成为可能,推动了微生物组研究的进步。然而,此类计数数据往往具有高维性、过分散性和严重的零膨胀特性,与基于对数比的组成数据分析(CoDA)所依赖的连续性假设相冲突,造成了显著的方法学挑战。本文综述了组成数据中零值处理策略,涵盖零容忍转换、对四舍五入零的插值方法以及针对必要零的统计模型。我们特别强调了将对数比框架应用于测序衍生的组成计数数据时出现的问题,其中连续性假设的违反可能导致数值不稳定性及有偏的统计推断。受这些问题的启发,本文系统地考察了现有插值策略在适应离散、零膨胀的计数数据时的表现,包括评估数据的离散、格子值性质对插值性能的影响。总体而言,本文整合了分散的方法学发展,明确了合适的应用场景,并识别了需要未来零值处理框架解决的开放性挑战,这些框架能够同时容纳组成约束、零膨胀和计数数据的格子性质,同时详细讨论了比较结果。

英文摘要

The growing use of high-throughput sequencing (HTS) has enabled the large-scale production of compositional count data, driving progress in microbiome research. However, such count data are often high-dimensional, over-dispersed, and heavily zero-inflated, and they conflict with the continuity assumptions underlying log-ratio-based compositional data analysis (CoDA), creating substantial methodological challenges. This review provides an overview of zero-handling strategies in compositional data, covering zero-tolerant transformations, imputation approaches for rounded zeros, and statistical models for essential zeros. We specifically highlight the problems that arise when applying the log-ratio framework to sequencing-derived compositional count data, where violations of continuity can induce numerical instabilities and biased statistical inferences. Motivated by these issues, we systematically examine how existing imputation strategies behave when adapted to discrete, zero-inflated count data, including an evaluation of how the discrete, lattice-valued nature of the data affects imputation performance. Overall, this review consolidates scattered methodological developments, clarifies appropriate use cases, and identifies open challenges that motivate future zero-handling frameworks capable of jointly accommodating compositional constraints, zero inflation, and the lattice nature of count data, while also providing a detailed discussion of the comparison results.

2605.22124 2026-05-22 stat.ML cs.LG math.PR

From Betting to Empirical Bernstein LIL

从赌局到经验伯恩斯坦LIL

Francesco Orabona

AI总结 本文通过在线投注策略的财富保证,推导出迭代对数定律,并提出经验伯恩斯坦LIL方法。

详情
AI中文摘要

This is a verbatim copy of a technical report I wrote in 2017-2018 to obtain the law of the iterated logarithm using the guarantee on the wealth of an online betting strategy.

英文摘要

This is a verbatim copy of a technical report I wrote in 2017-2018 to obtain the law of the iterated logarithm using the guarantee on the wealth of an online betting strategy.

2605.22111 2026-05-22 cs.LG cs.CE stat.ML

Aerodynamic force reconstruction using physics-informed Gaussian processes

利用物理信息高斯过程进行气动力重建

Gledson Rodrigo Tondo, Igor Kavrakov, Guido Morgenthal

AI总结 本文提出一种基于物理信息的机器学习方法,用于从结构动态响应的噪声测量中重建底层气动载荷,通过避免过拟合和无需正则化方案,提高了模型的准确性和适用性。

详情
AI中文摘要

准确建模气动载荷对于理解和预测复杂结构系统的响应至关重要。然而,这些模型往往依赖于真实物理力的简化,引入假设可能会限制其准确性。在存在噪声或不完整数据的情况下,验证这些模型变得特别具有挑战性。为此,我们介绍了一种概率物理信息机器学习方法,旨在从结构动态响应的噪声测量中重建底层气动载荷。该模型避免了过拟合,消除了对正则化方案的需要,并允许在训练过程中使用异质和多保真度数据。通过重建大贝尔东桥在线性非稳态假设下的气动载荷,证明了该方法的有效性。结果表明,真实和预测载荷之间有很强的一致性,特别是在均方误差、幅度、相位角和信号峰值值方面。该载荷重建方法具有广泛的应用前景,如模型验证、未来载荷估计和结构损伤预测。

英文摘要

Accurate modeling of aerodynamic loads is essential for understanding and predicting the responses of complex structural systems. However, these models often rely on simplifications of the true physical forces, introducing assumptions that can limit their accuracy. Validating such models becomes particularly challenging in the presence of noisy or incomplete data. To address this, we introduce a probabilistic physics-informed machine learning approach designed to reconstruct the underlying aerodynamic loads from noisy measurements of structural dynamic responses. The model avoids overfitting, eliminates the need for regularization schemes, and allows for the use of heterogeneous and multi-fidelity data during the training process. The efficacy of the approach is demonstrated through the reconstruction of aerodynamic loads on the Great Belt East Bridge, simulated under a linear unsteady assumption. Results show a strong agreement between true and predicted loads, particularly related to root mean squared errors, magnitude, phase angle and peak values of the signals. The method for load reconstructing holds broad applicability, such as modeling validation, future load estimation, and structural damage prognosis.

2605.22110 2026-05-22 stat.ME

Two-stage Ensemble Clustering of Functional Data Using Random Projections

基于随机投影的函数数据双阶段集成聚类

Sourav Chakrabarty, Anirvan Chakraborty, Shyamal K. De

AI总结 本文提出了一种基于高斯过程生成随机投影的函数数据聚类方法,通过两阶段聚类策略,利用均绝对距离度量进行高维数据聚类,并通过数据驱动的投影方向进一步优化聚类结果,验证了该方法在多种函数数据场景中的高准确性。

详情
Comments
32 pages, 6 figures, 7 tables
AI中文摘要

我们提出了一种计算上简单的框架,用于基于高斯过程生成的随机投影对函数数据进行聚类。在该方法中,每条曲线首先被投影到大量的独立高斯过程实现上。所得的高维表示通过均绝对距离度量(MADD)进行聚类,这是一种适合高维设置的不相似性度量。对这种不相似性的总体分析提供了随机投影如何捕捉功能群体之间分布差异的见解。我们引入了第二阶段的聚类,以进一步利用数据驱动的投影方向。因此,在第一阶段,使用一组预指定的投影家族获得初始聚类。在第二阶段,通过基于估计协方差算子构建的高斯随机投影对这一划分进行细化。最后,使用归一化成本函数在候选解中选择最佳聚类。所提出的聚类算法广泛适用于各种函数数据场景,包括不规则和部分观测数据。通过广泛的模拟和实际数据应用,我们证明所提出的方法在多种函数数据设置中实现了高准确性,并在许多最先进的方法中表现更优。

英文摘要

We propose a computationally simple framework for clustering functional data based on Gaussian-process-generated random projections. In this approach, each curve is first projected onto a large collection of independent Gaussian process realizations. The resulting high-dimensional representations are clustered using the Mean Absolute Difference of Distances (MADD), a dissimilarity measure well suited for high-dimensional settings. A population-level analysis of this dissimilarity provides insight into how random projections help capture distributional differences between functional populations. We introduce a second stage of clustering to additionally leverage on data-driven projection directions. Thus, in Stage I, an initial clustering is obtained using a set of prespecified projection families. In Stage II, this partition is refined by constructing Gaussian random projections based on an estimated covariance operator that uses the first stage of cluster labels. Finally, a normalized cost function is used to select the optimal clustering among candidate solutions. The proposed clustering algorithm is broadly applicable to diverse functional data regimes including irregular and partially observed data. Through extensive simulations and real-data applications, we show that the proposed method achieves a high degree of accuracy and outperforms many of the state-of-the-art methods across a wide range of functional data settings.

2605.19152 2026-05-22 stat.ML cs.ET cs.IT cs.LG cs.NE math.IT physics.optics

Information Processing Capacity of Stationary Physical Systems: Theory, Data-efficient Estimation Methods, and Photonic Demonstration

stationary 物理系统的信息处理能力:理论、数据高效估计方法和光子演示

Rahul Uma Ramachandran, Serge Massar

AI总结 本文研究了 stationary 物理系统的信息处理能力,提出了一种理论框架,并开发了数据高效估计方法,通过光子计算系统实验验证了其有效性。

详情
Comments
added 2 new references
AI中文摘要

物理计算系统为实现硬件原生机器学习提供了有前景的途径,但其计算能力在原理上、任务无关和数据高效的方式下难以表征。我们扩展了信息处理能力(IPC)框架以适用于 stationary 物理计算系统,并建立了几个基本结果:个体容量在零和一之间被限制,其在完整基底上的总和受读数数量的限制,噪声严格减少这个界限。我们处理有限样本的 IPC 估计,并推导了影响朴素估计器的系统性正偏倚的渐近形式。基于这些结果,我们引入了基于 Richardson 推理和 Sobol 准随机采样的数据高效估计方法。我们通过基于皮秒激光脉冲在非线性光纤中传播的光子计算系统实验验证了该框架。通过改变激光功率和光纤长度,我们观察到由 Kerr 效应诱导的 IPC 分布系统性地向高阶非线性容量偏移。最后,我们证明了总 IPC 与基准机器学习任务的性能强相关,并提供了系统有效维度的可靠估计。这些结果确立了 IPC 作为连接物理计算系统内在动态与其机器学习性能的实用桥梁。

英文摘要

Physical computing systems provide a promising route toward hardware-native machine learning, but their computational capabilities remain difficult to characterize in a principled, task-independent, and data-efficient way. We extend the Information Processing Capacity (IPC) framework to stationary physical computing systems and establish several fundamental results: individual capacities are bounded between zero and one, their sum over a complete basis is bounded by the number of readouts, and noise strictly reduces this bound. We address the finite-sample estimation of IPC and derive the asymptotic form of the systematic positive bias affecting naive estimators. Building on these results, we introduce data-efficient estimation methods based on Richardson extrapolation and Sobol quasi-random sampling. We validate the framework experimentally using a photonic computing system based on picosecond laser pulses propagating through a nonlinear optical fibre. By varying the laser power and fibre length, we observe systematic shifts of the IPC distribution toward higher-order nonlinear capacities induced by the Kerr effect. Finally, we demonstrate that the total IPC strongly correlates with performance on benchmark machine-learning tasks and provides a reliable estimate of the effective dimensionality of the system. These results establish IPC as a practical bridge between the intrinsic dynamics of physical computing systems and their machine-learning performance.

2605.07870 2026-05-22 cond-mat.dis-nn cs.AI stat.ML

Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

深度网络中的谱动力学:特征学习、异常值逃逸和学习率转移

Clarissa Lauditi, Cengiz Pehlevan, Blake Bordelon

AI总结 本文研究了在宽神经网络中通过(随机)梯度下降训练时隐藏权重谱的演变,提出了一种双层动态平均场理论(DMFT)来联合跟踪具有尖峰集合的隐藏权重谱动态,其中尖峰方向在随机体上保持统计依赖性。该框架应用于两种设置:(1)无限宽度非线性网络在均值场/μP缩放下,以及(2)深度线性网络在比例高维极限下。理论预测了异常值如何随训练时间、宽度、输出尺度和初始化方差演变。在深度线性网络中,μP产生与宽度一致的异常值动态和超参数转移,包括主导NTK模式向稳定性边缘(EoS)的宽度稳定增长。相比之下,NTK参数化表现出强烈依赖宽度的异常值动态,尽管收敛到一个稳定的宽网络极限。我们展示了这种体+异常值图像是描述简单任务的,但涉及大量输出的任务(如ImageNet分类或GPT语言建模)则更适合通过重构谱体来描述。我们开发了一个具有大量输出通道的玩具模型,重现了这一现象,并展示了足够宽的网络下谱边缘仍会收敛。

详情
Comments
Updating related works + discussion
AI中文摘要

我们研究了在宽神经网络中通过(随机)梯度下降训练时隐藏权重谱的演变。我们开发了一种双层动态平均场理论(DMFT),该理论联合跟踪具有尖峰集合的隐藏权重谱动态,其中尖峰方向在随机体上保持统计依赖性。我们将该框架应用于两种设置:(1)无限宽度非线性网络在均值场/μP缩放下,以及(2)深度线性网络在比例高维极限下,其中宽度、输入维度和样本大小以固定比例发散。我们的理论预测了异常值如何随训练时间、宽度、输出尺度和初始化方差演变。在深度线性网络中,μP产生与宽度一致的异常值动态和超参数转移,包括主导NTK模式向稳定性边缘(EoS)的宽度稳定增长。相比之下,NTK参数化表现出强烈依赖宽度的异常值动态,尽管收敛到一个稳定的宽网络极限。我们展示了这种体+异常值图像是描述简单任务的,但涉及大量输出的任务(如ImageNet分类或GPT语言建模)则更适合通过重构谱体来描述。我们开发了一个具有大量输出通道的玩具模型,重现了这一现象,并展示了足够宽的网络下谱边缘仍会收敛。

英文摘要

We study the evolution of hidden-weight spectra in wide neural networks trained by (stochastic) gradient descent. We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk. We apply this framework to two settings: (1) infinite-width nonlinear networks in mean-field/$μ$P scaling and (2) deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios. Our theory predicts how outliers evolve with training time, width, output scale, and initialization variance. In deep linear networks, $μ$P yields width-consistent outlier dynamics and hyperparameter transfer, including width-stable growth of the leading NTK mode toward the edge of stability (EoS). In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit. We show that this bulk+outlier picture is descriptive of simple tasks with small output channels, but that tasks involving large numbers of outputs (ImageNet classification or GPT language modeling) are better described by a restructuring of the spectral bulk. We develop a toy model with extensive output channels that recapitulates this phenomenon and show that edge of the spectrum still converges for sufficiently wide networks.

2605.05118 2026-05-22 cs.LG cs.AI stat.ML

On the Wasserstein Gradient Flow Interpretation of Drifting Models

关于漂移模型的Wasserstein梯度流解释

Arthur Gretton, Li Kevin Wenliang, Alexandre Galashov, James Thornton, Valentin De Bortoli, Arnaud Doucet

AI总结 本文通过Wasserstein梯度流分析了漂移模型,揭示了GMD框架与WGF路径之间的关系,展示了三种主要结果:漂移模型中的算法对应于KL散度的WGF极限点,实际实现的算法对应于Sinkhorn散度的固定点但缺乏某些特性,同时该方法可以扩展到其他WGF的极限点,如MMD、切线Wasserstein距离和GAN批评者函数。

详情
AI中文摘要

最近,Deng等人(2026)提出了生成模型通过漂移(GMD),一种新的生成任务框架。本文通过Wasserstein梯度流(WGF)的视角分析了GMD,即概率测度空间中函数的最速下降路径,配备了最优传输的几何结构。与之前的WGF相关贡献不同,GMD可以被视为直接针对特定WGF流的固定点。我们展示了三个主要结果:首先,Deng等人(2026)提出的一种算法对应于在KL散度上的WGF的极限点,伴有Parzen平滑。其次,Deng等人(2026)实际实现的算法对应于另一种过程,类似于Sinkhorn散度的固定点,但缺乏后者的一些理想特性。第三,同样的想法可以扩展到其他WGF的极限点,包括最大均值差异(MMD)、切线Wasserstein距离和GAN批评者函数。

英文摘要

Recently, Deng et al. (2026) proposed Generative Modeling via Drifting (GMD), a novel framework for generative tasks. This note presents an analysis of GMD through the lens of Wasserstein Gradient Flows (WGF), i.e., the path of steepest descent for a functional in the space of probability measures, equipped with the geometry of optimal transport. Unlike previous WGF-based contributions, GMD can be thought of as directly targeting a fixed point of a specific WGF flow. We demonstrate three main results: first, that one algorithm proposed by Deng et al. (2026) corresponds to finding the limiting point of a WGF on the KL divergence, with Parzen smoothing on the densities. Second, that the algorithm actually implemented by Deng et al. (2026) corresponds to a different procedure, which bears some resemblance to the fixed point of a WGF on the Sinkhorn divergence, but lacks certain desirable properties of the latter. Third, the same same idea can be extended to the limiting point of other WGFs, including the Maximum Mean Discrepancy (MMD), the sliced Wasserstein distance, and GAN critic functions.

2604.20634 2026-05-22 math.PR math.ST stat.TH

Distributional Statistical Models: Weak Moments, Cumulants, and a Central Limit Theorem

分布统计模型:弱矩、累积量及一个中心极限定理

R. Labouriau

AI总结 本文提出了一种广义的概率框架,通过将密度替换为由 tempered 分布和 Schwartz 核组成的对,定义了弱矩、弱特征函数和所有阶的弱累积量,从而扩展了经典量并保留了关键的代数性质。主要贡献包括弱累积量的系统代数、弱矩问题的解以及弱中心极限定理的提出,覆盖了经典定理失效的情况。

详情
Comments
45 pages, no figures. Corrected a local error in the formulation and proof of Theorem 6.3; further detailed the proof of the Leibniz estimates in appendix A; updated the reference and discussion of the classical theory of M-determinacy; inserted a statistical interpretation of the structure and continuity of tempered distributions; further detail some proofs and updated references
AI中文摘要

许多重要的统计模型由于矩或矩生成函数不存在而无法使用经典矩方法。我们提出了一种广义的概率框架,其中密度被替换为对 $(T,φ)$,其中 $T \in \mathcal{S}'(\mathbb{R})$ 是 tempered 分布,$φ\in \mathcal{S}(\mathbb{R})$ 是 Schwartz 核。期望通过分布对正则化测试函数的作用定义,从而得到所有阶的弱矩、弱特征函数和弱累积量。这些扩展了经典量并保留了关键的代数性质,如独立性下的可加性和自然的仿射变换规则。主要结果包括:(i) 弱累积量的系统代数;(ii) 一个弱矩问题,其中所有矩的存在性无条件成立,唯一性依赖于核,正则核具有指数尾界和平方可积密度的 Carleman 型准则,以及指数衰减核的 Denjoy-Carleman 准则;(iii) 弱中心极限定理,形式为弱特征函数收敛到高斯极限,覆盖了经典定理失效的情况。该框架通过 t 分布、稳定分布和双曲分布进行了说明。作为统计结果,弱一阶矩在 Cauchy 模型中给出位置参数的一致估计器,其中不存在经典矩方法的估计器。完整的统计处理在一篇配套论文中给出。

英文摘要

Many important statistical models fall outside classical moment-based methods due to the non-existence of moments or moment generating functions. We propose a generalised probabilistic framework in which densities are replaced by pairs $(T,φ)$, where $T \in \mathcal{S}'(\mathbb{R})$ is a tempered distribution and $φ\in \mathcal{S}(\mathbb{R})$ is a Schwartz kernel. Expectations are defined via the action of distributions on regularised test functions, yielding well-defined weak moments, weak characteristic functions, and weak cumulants of all orders. These extend classical quantities and retain key algebraic properties such as additivity under independence and natural affine transformation rules. The main results are: (i) a systematic algebra of weak cumulants; (ii) a weak moment problem where existence of all moments holds unconditionally and uniqueness depends on the kernel, with uniqueness results under Gaussian kernels (via Hermite completeness), positive Schwartz kernels with an exponential tail bound and square-integrable densities (via a Carleman-type criterion), and kernels with exponential decay (via Denjoy-Carleman quasi-analyticity); and (iii) a weak central limit theorem formulated as convergence of weak characteristic functions to a Gaussian limit, covering cases where the classical theorem fails. The framework is illustrated with Student's $t$, stable, and hyperbolic distributions. As a statistical consequence, the weak first moment yields a consistent estimator of the location parameter in the Cauchy model, where no classical moment-based estimator exists. A full statistical treatment is given in a companion paper.

2604.02889 2026-05-22 stat.ML cs.AI cs.LG

Rethinking Forward Processes for Score-Based Nonlinear Data Assimilation in High Dimensions

重新思考高维数据同化中的分数基非线性数据同化前向过程

Eunbi Yoon, Won Chang, Donghan Kim, Dae Wook Kim

AI总结 本文提出了一种针对数据同化问题的改进前向过程,用于高维非线性系统的状态估计,通过改进的分数基滤波器在测量空间中转换系统状态,提高了同化性能。

详情
AI中文摘要

数据同化是通过结合模型预测和测量来估计动态系统状态的过程。当系统是非线性且高维时,这一任务变得具有挑战性。为了解决这个问题,最近出现了一种基于分数的贝叶斯滤波器。然而,这些方法在某些情况下仍表现不佳,特别是在空间稀疏测量下。这种退化源于对似然分数的启发式近似,其误差会随时间累积。这一限制是因为这些方法只是采用了一种经典的生成建模前向过程,将数据分布转化为高斯分布,而与测量方程无关。在这里,我们提出了一种针对滤波的前向过程,将系统状态转换到测量空间,从而实现了似然分数的理论严谨公式化。基于此,我们开发了测量感知的分数基滤波器(MASF)。我们在Kolmogorov流上评估了MASF,这是一个具有高达$\mathcal{O}(10^5)$维度的高维流体基准测试,包括非线性情况下的状态与测量之间的维度不匹配。MASF在现有分数基滤波器和集合型卡尔曼滤波器上表现出改进的性能。值得注意的是,当使用幅度预训练时,MASF相比基线实现了高达$28.2 imes$的时钟时间加速。我们的实现可在 exttt{https://github.com/tcnllab-oss/masf}获得。

英文摘要

Data assimilation is the process of estimating the state of a dynamical system over time by combining model predictions with measurements. This task becomes challenging when the system is nonlinear and high-dimensional. To address this, score-based Bayesian filters have recently emerged. However, these methods still show unsatisfactory performance in certain cases, particularly under spatially sparse measurements. Such degradation stems from heuristic approximations of the likelihood score, whose errors can accumulate over time. This limitation arises because the methods simply adopt a classical forward process for generative modeling that transforms a data distribution toward a Gaussian distribution, which is independent of the measurement equation. Here, we propose a forward process tailored for filtering that transforms the system state toward the measurement space, enabling a theoretically sound formulation of the likelihood score. Based on this, we develop the Measurement-Aware Score-based Filter (MASF). We evaluate MASF on Kolmogorov flow, a high-dimensional fluid benchmark with up to $\mathcal{O}(10^5)$ dimensions, under diverse measurement operators, including nonlinear cases with a dimensional mismatch between the state and the measurements. MASF shows improved performance over existing score-based filters and ensemble-type Kalman filters. Notably, MASF achieves up to a $28.2\times$ wall-clock speedup compared with the baselines when using amortized pretraining. Our implementation is available at \texttt{https://github.com/tcnllab-oss/masf}.

2603.29981 2026-05-22 cs.LG stat.ML

Aligning Validation with Deployment in Spatial Prediction: Target-Weighted Cross-Validation

在空间预测中对齐验证与部署:目标加权交叉验证

Alexander Brenning, Thomas Suesse

AI总结 本文提出了一种基于加权交叉验证的部署导向验证框架,通过引入目标加权交叉验证(TWCV)来对齐验证任务与指定领域内预测任务的分布,以减少因采样偏差导致的预测误差。

详情
AI中文摘要

可靠地估计预测性能对于空间环境建模至关重要,其中机器学习模型用于从不均匀分布的观测数据中生成地图。标准交叉验证(CV)假设验证数据能代表目标领域内预测条件的分布。在实践中,由于选择性或集群采样,这一假设经常被违反,导致性能和不确定性估计偏倚。本文引入了一种基于加权交叉验证的部署导向验证框架,该框架通过重要性加权交叉验证(IWCV)和基于校准的方法,目标加权交叉验证(TWCV),利用具有空间意义的任务描述符如环境协变量和预测距离。模拟实验表明,传统非空间和空间交叉验证策略在现实采样设计下会表现出显著偏倚,而加权交叉验证方法在验证任务充分覆盖部署任务空间时能大幅减少这种偏倚。德国氮氧化物(NO₂)浓度制图案例研究显示,标准交叉验证由于采样偏倚会高估预测误差,而加权交叉验证则能产生更符合部署条件的估计。该框架将验证任务生成与风险估计分开,并为在样本分布与预测领域不同的空间预测设置中改进性能评估提供了实用方法。

英文摘要

Reliable estimation of predictive performance is essential for spatial environmental modeling, where machine-learning models are used to generate maps from unevenly distributed observations. Standard cross-validation (CV) assumes that validation data are representative of prediction conditions across the target domain. In practice, this assumption is often violated due to preferential or clustered sampling, leading to biased performance and uncertainty estimates. We introduce a deployment-oriented validation framework based on weighted CV that aligns validation tasks with the distribution of prediction tasks across a specified domain. The framework includes importance-weighted cross-validation (IWCV) and a calibration-based approach, Target-Weighted Cross-Validation (TWCV), which uses spatially meaningful task descriptors such as environmental covariates and prediction distance. Simulation experiments show that conventional non-spatial and spatial CV strategies can exhibit substantial bias under realistic sampling designs, whereas weighted CV approaches substantially reduce this bias when validation tasks adequately cover the deployment-task space. A case study on mapping nitrogen dioxide (NO$_2$) concentrations across Germany demonstrates that standard CV can overestimate prediction error due to sampling bias, while weighted CV yields estimates more consistent with deployment conditions. The framework separates validation task generation from risk estimation and provides a practical approach for improving performance assessment in spatial prediction settings where sample distributions differ from prediction domains.

2603.23175 2026-05-22 math.PR cond-mat.stat-mech math.ST stat.TH

On the Golomb-Dickman constant under Ewens sampling

关于Ewens采样下Golomb-Dickman常数

José Ricardo G. Mendonça, Luis Jehiel Negret

AI总结 本文研究了在Ewens测度下随机排列中最长循环比例的极限期望,定义了广义的Golomb-Dickman常数λ_θ,并通过Kingman的Poisson过程构造Poisson-Dirichlet分布的独立性质,得到了λ_θ的显式积分表示,分析了θ对λ_θ的影响,以及在小θ和大θ时的渐进行为。

详情
Journal ref
Statistics & Probability Letters 237 (2026) 110831
Comments
AMSart style, 10 pages, 3 figures, 1 table, 19 refs. Version v2 acknowledges Holst's work (2001), adds the asymptotic analysis of $λ_θ$, and displays simulations of the Hoppe urn model. Version v3 corresponds to the (slightly corrected and improved) published version
AI中文摘要

我们定义了一个广义的Golomb-Dickman常数λ_θ,作为在Ewens测度下随机排列中最长循环比例的极限期望。利用Kingman的Poisson过程构造Poisson-Dirichlet分布的独立性质,我们得到了λ_θ关于指数积分的显式积分表示。λ_θ对θ的依赖性反映了由长循环主导(小θ)和由许多小循环主导(大θ)的两种模式之间的转变。我们还推导了λ_θ在小θ和大θ时的渐进行为,并通过数值计算、Hoppe urn的蒙特卡罗模拟和应用来展示我们的结果。

英文摘要

We define a generalized Golomb--Dickman constant $λ_θ$ as the limiting expected proportion of the longest cycle in random permutations under the Ewens measure with parameter $θ> 0$. Exploiting the independence properties of Kingman's Poisson process construction of the Poisson--Dirichlet distribution, we obtain an explicit integral representation for $λ_θ$ in terms of the exponential integral. The dependence of $λ_θ$ on $θ$ reflects the transition between regimes dominated by long cycles (small $θ$) and those with many small cycles (large $θ$). We also derive the asymptotic behavior of $λ_θ$ for small and large $θ$ and illustrate our results with numerical computations, Monte Carlo simulations of the Hoppe urn, and an application.

2603.04525 2026-05-22 stat.ML cs.LG

The Volterra signature

Volterra签名

Paul P. Hager, Fabian N. Harang, Luca Pelizzari, Samy Tindel

AI总结 本文提出Volterra签名作为处理历史依赖系统的显式特征表示,通过将输入路径与时间核结合到张量代数中,利用Volterra-Chen恒等式推导出严谨的学习理论保证,并展示其在动态学习任务中的有效性。

详情
AI中文摘要

现代处理非马尔可夫时间序列的学习方法,如循环神经网络、神经控制微分方程或变换器,通常依赖于隐式的记忆机制,这些机制在长时间范围内难以解释或训练。我们提出Volterra签名VSig(x;K)作为处理历史依赖系统的显式特征表示。通过将输入路径x加权时间核K转化为张量代数,我们利用相关的Volterra-Chen恒等式推导出严谨的学习理论保证。具体来说,我们证明了注入性陈述(在增强下可识别),从而在无限维路径空间上推导出通用逼近定理,这在某些情况下通过VSig(x;K)的线性泛函实现。此外,我们通过展示与Volterra签名相关的内积可通过二参数积分方程闭合地表示,证明了核技巧的应用,从而利用PDE的数值方法进行计算。对于一大类指数型核,VSig(x;K)在张量代数中解线性状态空间微分方程。结合对时间重参数化的不变性,这些结果将Volterra签名定位为数据科学中稳健且计算上可行的特征映射。我们在真实和合成数据上的动态学习任务中展示了其有效性,其中它一致地改进了经典路径签名基线。

英文摘要

Modern approaches for learning from non-Markovian time series, such as recurrent neural networks, neural controlled differential equations or transformers, typically rely on implicit memory mechanisms that can be difficult to interpret or to train over long horizons. We propose the \emph{Volterra signature} $\mathrm{VSig}(x;K)$ as a principled, explicit feature representation for history-dependent systems. By developing the input path $x$ weighted by a temporal kernel $K$ into the tensor algebra, we leverage the associated Volterra--Chen identity to derive rigorous learning-theoretic guarantees. Specifically, we prove an \emph{injectivity} statement (identifiability under augmentation) that leads to a \emph{universal approximation} theorem on the infinite dimensional path space, which in certain cases is achieved by \emph{linear functionals} of $\mathrm{VSig}(x;K)$. Moreover, we demonstrate applicability of the \emph{kernel trick} by showing that the inner product associated with Volterra signatures admits a closed characterization via a two-parameter integral equation, enabling numerical methods from PDEs for computation. For a large class of exponential-type kernels, $\mathrm{VSig}(x;K)$ solves a linear state-space ODE in the tensor algebra. Combined with inherent invariance to time reparameterization, these results position the Volterra signature as a robust, computationally tractable feature map for data science. We demonstrate its efficacy in dynamic learning tasks on real and synthetic data, where it consistently improves classical path signature baselines.

2602.21792 2026-05-22 stat.OT

p-Hacking Inflates Type I Error Rates in the Error Statistical Approach but not in the Formal Inference Approach

p-值操纵在误差统计方法中会增加I类错误率,但在正式推断方法中不会

Mark Rubin

AI总结 本文探讨了p-值操纵在误差统计方法和正式推断方法中的影响,指出在误差统计方法中p-值操纵会增加I类错误率,而在正式推断方法中不会。

详情
AI中文摘要

p-hacking occurs when researchers conduct multiple significance tests (e.g., p1;H0,1 and p2;H0,2) and then selectively report tests that yield desirable (usually significant) results (e.g., p2 < 0.05;H0,2) without correcting for multiple testing (e.g., 0.05/2 = 0.025). In the present article, I consider p-hacking in the context of two philosophies of significance testing - the error statistical approach and the formal inference approach. I argue that although p-hacking inflates Type I error rates in the error statistical approach, it does not inflate them in the formal inference approach. Specifically, in the error statistical approach, the "actual" familywise error rate (e.g., 1 - [1 - 0.05]2 = 0.098 for two independent tests) is relevant because it covers both the reported and unreported tests in the "actual" test procedure (i.e., p1;H0,1 and p2;H0,2). In this approach, Type I error rate inflation occurs because the "actual" error rate (0.098) is higher than the nominal error rate (0.05). In contrast, in the formal inference approach, the "actual" familywise error rate is irrelevant because (a) the researcher does not report a statistical inference about the corresponding intersection null hypothesis (i.e., H0,1 & H0,2), and (b) the "actual" familywise error rate does not license inferences about the reported individual hypotheses (i.e., H0,2). Instead, in the formal inference approach, only the nominal error rate is relevant, and a comparison with the "actual" error rate is inappropriate. Implications for conceptualizing, demonstrating, and reducing p-hacking are discussed.

英文摘要

p-hacking occurs when researchers conduct multiple significance tests (e.g., p1;H0,1 and p2;H0,2) and then selectively report tests that yield desirable (usually significant) results (e.g., p2 < 0.05;H0,2) without correcting for multiple testing (e.g., 0.05/2 = 0.025). In the present article, I consider p-hacking in the context of two philosophies of significance testing - the error statistical approach and the formal inference approach. I argue that although p-hacking inflates Type I error rates in the error statistical approach, it does not inflate them in the formal inference approach. Specifically, in the error statistical approach, the "actual" familywise error rate (e.g., 1 - [1 - 0.05]2 = 0.098 for two independent tests) is relevant because it covers both the reported and unreported tests in the "actual" test procedure (i.e., p1;H0,1 and p2;H0,2). In this approach, Type I error rate inflation occurs because the "actual" error rate (0.098) is higher than the nominal error rate (0.05). In contrast, in the formal inference approach, the "actual" familywise error rate is irrelevant because (a) the researcher does not report a statistical inference about the corresponding intersection null hypothesis (i.e., H0,1 & H0,2), and (b) the "actual" familywise error rate does not license inferences about the reported individual hypotheses (i.e., H0,2). Instead, in the formal inference approach, only the nominal error rate is relevant, and a comparison with the "actual" error rate is inappropriate. Implications for conceptualizing, demonstrating, and reducing p-hacking are discussed.

2601.11845 2026-05-22 econ.EM stat.ME

Reevaluating Causal Estimation Methods with Data from a Product Release

重新评估产品发布数据中的因果估计方法

Justin Young, Eleanor Wiske Dillon

AI总结 本文通过新产品发布实验数据评估了因果估计方法的有效性,发现通过精心建模可以准确恢复真实因果效应,为现代高维数据集中的处理效应估计提供了最佳实践。

详情
Comments
53 pages
AI中文摘要

近年来,因果机器学习方法的发展使得估计混杂因素、治疗和结果之间的灵活关系变得更加容易,使因果分析中的无偏假设更加可接受。这些方法在恢复真实基准基线方面有多成功?在本文中,我们分析了一个新的数据样本,包括一家大型科技公司在新功能上的实验推广以及同时收集的用户样本,这些用户自发地选择了该功能。我们发现,恢复真实因果效应是可行的——但只有在仔细的建模选择下。我们的结果基于观察性因果文献,始于LaLonde (1986),为现代高维数据集中的更可信处理效应估计提供了最佳实践。

英文摘要

Recent developments in causal machine learning methods have made it easier to estimate flexible relationships between confounders, treatments and outcomes, making unconfoundedness assumptions in causal analysis more palatable. How successful are these approaches in recovering ground truth baselines? In this paper we analyze a new data sample including an experimental rollout of a new feature at a large technology company and a simultaneous sample of users who endogenously opted into the feature. We find that recovering ground truth causal effects is feasible -- but only with careful modeling choices. Our results build on the observational causal literature beginning with LaLonde (1986), offering best practices for more credible treatment effect estimation in modern, high-dimensional datasets.

2511.04106 2026-05-22 physics.soc-ph cs.CL cs.CY stat.AP

Sub-exponential Growth Dynamics in Complex Systems: A Piecewise Power-Law Model for the Diffusion of New Words and Names

复杂系统中的亚指数增长动力学:一种用于新词汇和名称扩散的分段幂律模型

Hayafumi Watanabe

AI总结 本文提出了一种分段幂律模型,用于描述复杂增长曲线,通过分析大规模数据集发现亚指数增长是社会扩散的常见模式。

详情
Journal ref
Physical Review E (2026)
AI中文摘要

社会中思想和语言的扩散通常被S型模型描述,如逻辑斯蒂曲线。然而,亚指数增长——一种比指数增长更慢的模式,在更广泛的社会现象中作用被忽视。本文提出了一种分段幂律模型,通过分析约十亿篇日本博客文章与维基百科词汇的数据集,发现网络搜索趋势数据(英语、西班牙语和日语)中存在一致的模式。分析2963个选定项目(如足够持续时间/峰值、单调增长)发现,1625(55%)种扩散模式没有突变水平,可以由一个或两个段描述。对于单段曲线,发现(i)形状参数α的模式接近0.5,表明亚指数增长普遍;(ii)峰值扩散规模主要由增长速率R决定,次要贡献来自α或持续时间T;(iii)α倾向于随主题性质变化,小主题/本地主题的α较小,广泛共享主题的α较大。此外,一个微观行为模型表明,α可以解释为对外向(陌生人)与内向(社区)接触的偏好指数。这些发现表明亚指数增长是社会扩散的常见模式,我们的模型提供了一个实用框架,用于一致描述、比较和解释复杂多样的增长曲线。

英文摘要

The diffusion of ideas and language in society has conventionally been described by S-shaped models, such as the logistic curve. However, the role of sub-exponential growth -- a slower-than-exponential pattern known in epidemiology -- has been largely overlooked in broader social phenomena. Here, we present a piecewise power-law model to characterize complex growth curves with a few parameters. We systematically analyzed a large-scale dataset of approximately one billion Japanese blog articles linked to Wikipedia vocabulary, and observed consistent patterns in web search trend data (English, Spanish, and Japanese). Our analysis of 2,963 items, selected for reliable estimation (e.g., sufficient duration/peak, monotonic growth), reveals that 1,625 (55%) diffusion patterns without abrupt level shifts were adequately described by one or two segments. For single-segment curves, we found that (i) the mode of the shape parameter $α$ was near 0.5, indicating prevalent sub-exponential growth; (ii) the peak diffusion scale is primarily determined by the growth rate $R$, with minor contributions from $α$ or the duration $T$; and (iii) $α$ showed a tendency to vary with the nature of the topic, being smaller for niche/local topics and larger for widely shared ones. Furthermore, a micro-behavioral model of outward (stranger) vs. inward (community) contact suggests that $α$ can be interpreted as an index of the preference for outward-oriented communication. These findings suggest that sub-exponential growth is a common pattern of social diffusion, and our model provides a practical framework for consistently describing, comparing, and interpreting complex and diverse growth curves.

2510.16892 2026-05-22 math.ST stat.TH

Batch learning equals online learning in Bayesian supervised learning

批量学习等于在线学习在贝叶斯监督学习中

Hông Vân Lê

AI总结 本文研究了Le提出的贝叶斯监督学习模型,证明了在条件独立(可能非i.i.d.)数据的监督学习模型中,序列和批量贝叶斯反转是一致的,并推导了后验预测分布的递推公式,将其简化为高斯过程回归中的卡尔曼滤波。

详情
Comments
Version 5: T. 31 pages, a chracterization of probability measures on $\mathcal{P}(\mathcal{Y})^{\mathcal{X}}$ extended to Souslin spaces (Theorem 5.4), typo correction in Subsection 6.2
AI中文摘要

在本文中,我们研究了Le提出的贝叶斯监督学习模型。我们证明了在通用贝叶斯监督学习模型$(\mathcal{P}(\mathcal{Y})^{\mathcal{X}}, μ, \mathrm{Id}_{\mathcal{P}(\mathcal{Y})^{\mathcal{X}}}, \mathcal{P}(\mathcal{Y})^{\mathcal{X}})$中,对于任意输入空间$\mathcal{X}$,Souslin标签空间$\mathcal{Y}$和先验概率测度$μ\in \mathcal{P}( \mathcal{P}(\mathcal{Y})^{\mathcal{X}})$,存在贝叶斯反转。利用概率形的函子性,我们证明在监督学习模型中,序列和批量贝叶斯反转在条件独立(可能非i.i.d.)数据下是一致的。这种等价性不依赖于采样算子的支配或离散性假设。我们推导了后验预测分布的递推公式,其在高斯过程回归中简化为卡尔曼滤波。对于Souslin标签空间$\mathcal{Y}$和任意输入集$\mathcal{X}$,我们通过投影系统表征$\mathcal{P}(\mathcal{Y})^{\mathcal{X}}$上的概率测度,推广了Orbanz的结果。我们重新审视MacEachern的依赖Dirichlet过程(DDP)并利用copula构造方法,展示了如何在具有DDP先验的通用贝叶斯监督模型中计算后验预测分布。

英文摘要

In this paper we study Bayesian supervised learning models proposed by Lê in \cite{Le2025}. We show the existence of Bayesian inversions on universal Bayesian supervised learning models $(\mathcal{P}(\mathcal{Y})^{\mathcal{X}}, μ, \mathrm{Id}_{\mathcal{P}(\mathcal{Y})^{\mathcal{X}}}, \mathcal{P}(\mathcal{Y})^{\mathcal{X}}$ for arbitrary input space $\mathcal{X}$, Souslin label space $\mathcal{Y}$, and prior probability measure $μ\in \mathcal{P}( \mathcal{P}(\mathcal{Y})^{\mathcal{X}})$. Using functoriality of probabilistic morphisms, we prove that sequential and batch Bayesian inversions coincide in supervised learning models with conditionally independent (possibly non-i.i.d.) data \cite{Le2025}. This equivalence holds without domination or discreteness assumptions on sampling operators. We derive a recursive formula for posterior predictive distributions, which reduces to the Kalman filter in Gaussian process regression. For Souslin label spaces $\mathcal{Y}$ and arbitrary input sets $\mathcal{X}$, we characterize probability measures on $\mathcal{P}(\mathcal{Y})^{\mathcal{X}}$ via projective systems, generalizing Orbanz \cite{Orbanz2011}. We revisit MacEachern's Dependent Dirichlet Processes (DDP) \cite{MacEachern2000} using copula-based constructions \cite{BJQ2012} and show how to compute posterior predictive distributions in universal Bayesian supervised models with DDP priors.

2509.26005 2026-05-22 stat.ML cs.LG

BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories under Spatio-Temporal Vector Fields

BALLAST:基于空间-时间向量场的海漂体轨迹的贝叶斯主动学习与前瞻性修正

Rui-Yang Zhang, Lachlan Astfalck, Edward Cripps, David S. Leslie, Henry B. Moss

AI总结 本文提出了一种正式的主动学习方法,用于指导拉格朗日观测器的布置,以推断时间依赖的向量场,该方法利用了物理信息的空间-时间高斯过程代理模型。现有放置活动主要遵循标准的'空间填充'设计或相对随意的专家意见。在该设置中应用原理性主动学习的主要挑战是拉格朗日观测器持续被向量场推动,因此在不同位置和时间进行测量。因此,考虑已放置观测器的可能未来轨迹以评估候选放置位置的效用至关重要。为此,我们提出了BALLAST:用于海漂体轨迹的贝叶斯主动学习与前瞻性修正。我们观察到BALLAST辅助的顺序观测器布置策略在合成和高保真海洋流模型中均表现出显著优势。此外,我们还开发了一种新的GP推理方法——Vanilla SPDE Exchange(VaSE)——以提高GP后验采样效率,这也具有独立的研究价值。

详情
Comments
ICML 2026
AI中文摘要

我们介绍了一种正式的主动学习方法,用于指导拉格朗日观测器的布置,以推断时间依赖的向量场——海洋学、海洋科学和海洋工程中的关键任务——使用一个具有物理信息的空间-时间高斯过程代理模型。现有放置活动要么遵循标准的'空间填充'设计,要么相对随意地依赖专家意见。在该设置中应用原理性主动学习的主要挑战是拉格朗日观测器持续被向量场推动,因此它们在不同的位置和时间进行测量。因此,考虑已放置观测器的可能未来轨迹以评估候选放置位置的效用至关重要。为此,我们提出了BALLAST:用于海漂体轨迹的贝叶斯主动学习与前瞻性修正。我们观察到BALLAST辅助的顺序观测器布置策略在合成和高保真海洋流模型中均表现出显著优势。此外,我们还开发了一种新的GP推理方法——Vanilla SPDE Exchange(VaSE)——以提高GP后验采样效率,这也具有独立的研究价值。

英文摘要

We introduce a formal active learning methodology for guiding the placement of Lagrangian observers to infer time-dependent vector fields -- a key task in oceanography, marine science, and ocean engineering -- using a physics-informed spatio-temporal Gaussian process surrogate model. The majority of existing placement campaigns either follow standard `space-filling' designs or relatively ad-hoc expert opinions. A key challenge to applying principled active learning in this setting is that Lagrangian observers are continuously advected through the vector field, so they make measurements at different locations and times. It is, therefore, important to consider the likely future trajectories of placed observers to account for the utility of candidate placement locations. To this end, we present BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories. We observe noticeable benefits of BALLAST-aided sequential observer placement strategies on both synthetic and high-fidelity ocean current models. In addition, we developed a novel GP inference method -- the Vanilla SPDE Exchange (VaSE) -- to boost the GP posterior sampling efficiency, which is also of independent interest.

2509.24087 2026-05-22 stat.AP

A penalized distributed lag non-linear Lee-Carter framework for regional weekly mortality forecasting

一种带有惩罚分布式滞后非线性Lee-Carter框架的区域周死亡率预测

Jens Robben, Karim Barigou

AI总结 本文提出了一种扩展Lee-Carter模型的框架,通过年龄和地区特定的季节效应和惩罚分布式滞后非线性组件,捕捉热量、寒冷和流感对死亡率的延迟和非线性影响,以提高周死亡率预测的准确性。

详情
AI中文摘要

准确的周死亡率预测对于公共卫生和保险行业至关重要。我们开发了一个预测框架,该框架扩展了Lee-Carter模型,加入了年龄和地区特定的季节效应以及惩罚分布式滞后非线性组件,以捕捉热量、寒冷和流感对死亡率的延迟和非线性影响。模型通过负二项分布处理过分散的死亡率率。我们使用SARIMA过程建模模型中的潜在因素的时序动态,并通过基于Copula的方法捕捉跨地区依赖性。利用法国地区1990-2019年的死亡率数据,我们证明所提出的框架能够产生校准良好的预测分布,并相对于基准模型提高预测准确性。结果还显示温度和流感相关的相对风险在年龄和地区之间存在显著异质性。这些发现强调了在周死亡率预测框架中纳入外源性驱动因素和依赖结构的重要性。

英文摘要

Accurate forecasts of weekly mortality are essential for public health and the insurance industry. We develop a forecasting framework that extends the Lee-Carter model with age- and region-specific seasonal effects and penalized distributed lag non-linear components that capture the delayed and non-linear effects of heat, cold, and influenza on mortality. The model accommodates overdispersed mortality rates via a negative binomial distribution. We model the temporal dynamics of the latent factors in the model using SARIMA processes and capture cross-regional dependencies through a copula-based approach. Using regional French mortality data (1990-2019), we demonstrate that the proposed framework yields well-calibrated forecast distributions and improves predictive accuracy relative to benchmark models. The results further show substantial heterogeneity in temperature- and influenza-related relative risks between ages and regions. These findings underscore the importance of incorporating exogenous drivers and dependence structures into a weekly mortality forecasting framework.

2509.20083 2026-05-22 stat.AP

Rethinking player evaluation in sports: Goals above expectation and beyond

重新思考体育中的球员评估:预期之上和更进一步

Robert Bajons, Lucas Kook

AI总结 本文提出了一种基于实际与预期结果差异的灵活机器学习算法框架,用于评估球员表现,通过Rao分数检验与可解释的半参数回归模型,实现有效的频繁主义推断,应用于足球、篮球、美式足球等领域的球员评估。

详情
AI中文摘要

评估体育中球员表现的一种流行量化方法是将观察到的结果与忽略球员参与的预期结果进行比较,该预期结果通过统计或机器学习方法估计。在足球中,球员的

英文摘要

A popular quantitative approach to evaluating player performance in sports involves comparing an observed outcome to the expected outcome ignoring player involvement, which is estimated using statistical or machine learning methods. In soccer, for instance, goals above expectation (GAX) of a player measure how often shots of this player led to a goal compared to the model-derived expected outcome of the shots. Typically, sports data analysts rely on flexible machine learning models, which are capable of handling complex nonlinear effects and feature interactions, but fail to provide valid statistical inference due to finite-sample bias and slow convergence rates. In this paper, we close this gap by presenting a framework for player evaluation with metrics derived from differences in actual and expected outcomes using flexible machine learning algorithms, which nonetheless allows for valid frequentist inference. We first show that the commonly used metrics are directly related to Rao's score test in parametric regression models for the expected outcome. Motivated by this finding and recent developments in double machine learning, we then propose the use of residualized versions of the original metrics. For GAX, the residualization step corresponds to an additional regression predicting whether a given player would take the shot under the circumstances described by the features. We further relate metrics in the proposed framework to player-specific effect estimates in interpretable semiparametric regression models, allowing us to infer directional effects, e.g., to determine players that have a positive impact on the outcome. Our primary use case are GAX in soccer. We further apply our framework to evaluate goal-stopping ability of goalkeepers, shooting skill in basketball, quarterback passing skill in American football, and injury-proneness of soccer players.

2509.05443 2026-05-22 stat.ME stat.AP

Multidimensional constructs and moderated linear and nonlinear factor analysis

多维构念与受调节的线性及非线性因子分析

R. Noah Padgett

AI总结 本文提出了一种多维MNLFA模型,允许在三个或更多潜在因子上对项目截距、负荷、残差方差、因子均值、方差和相关性进行调节,通过贝叶斯方法和惩罚最大似然方法实现模型的稳定估计和部分测量不不变性检测,同时保持模型的可解释性。

详情
Comments
22 pages, 2 figures
AI中文摘要

迄今为止,具有对所有模型参数进行调节的多维因子模型仅限于单因子和双因子模型。这与现有的心理测量不一致,因为这些测量通常旨在评估3-5个潜在构念的维度。本文介绍了一种多维MNLFA模型,允许在三个或更多潜在因子上对项目截距、负荷、残差方差、因子均值、方差和相关性进行调节。我描述了通过Stan使用贝叶斯方法实现该模型的努力,并通过惩罚最大似然方法来稳定估计并检测部分测量不不变性,同时保持模型的可解释性。闭式解析似然梯度的出现,消除了对昂贵的数值或基于MCMC的近似计算的需要。最后,我们讨论了惩罚对测量不变性理论影响、计算考虑因素以及将框架扩展到类别指标、纵向数据和应用研究情境的未来方向。

英文摘要

Multidimensional factor models with moderations on all model parameters have so far been limited to single-factor and two-factor models. This does not align well with existing psychological measures, which are commonly intended to assess 3-5 dimensions of a latent construct. In this paper, I introduce a multidimensional MNLFA model that permits the moderation of item intercepts, loadings, residual variances, factor means, variances, and correlations across three or more latent factors. I describe efforts to implement the model using Bayesian methods through Stan and penalized maximum likelihood approaches to stabilize estimation and detect partial measurement non-invariance while preserving model interpretability. Closed-form analytic gradients of the likelihood, eliminating the need for costly numerical or MCMC-based approximations. We conclude by discussing the theoretical implications of penalization for measurement invariance, computational considerations, and future directions for extending the framework to categorical indicators, longitudinal data, and applied research contexts.

2508.12085 2026-05-22 stat.ME stat.ML

Unified Conformalized Multiple Testing with Full Data Efficiency

统一的符合性多重检验与全数据效率

Yuyang Huo, Xiaoyang Wu, Changliang Zou, Haojie Ren

AI总结 本文提出一种统一的符合性多重检验框架,通过充分利用所有可用数据(null、alternative和未标记数据)来构建分数和校准p值,从而提高统计功效并严格控制虚无发现率。

详情
AI中文摘要

符合性多重检验提供了一种无模型的方法来在决策中控制预测不确定性。现有方法通常只使用部分可用数据来构建针对特定设置的分数函数。我们提出了一种统一的框架,将数据利用置于核心:它使用所有可用数据(null、alternative和未标记数据)来构建分数并通过全排列策略校准p值。这种充分利用所有可用数据的方法显著提高了功效,通过提高非符合性分数质量和最大化校准集大小,同时严格控制虚无发现率。关键的是,我们的框架为符合性检验提供了系统的设计原则,并使在候选方案中自动选择最佳符合性程序成为可能,而无需额外的数据分割。广泛的数值实验表明,我们的增强方法在各种场景中均表现出优越的效率和适应性。

英文摘要

Conformalized multiple testing offers a model-free way to control predictive uncertainty in decision-making. Existing methods typically use only part of the available data to build score functions tailored to specific settings. We propose a unified framework that puts data utilisation at the centre: it uses all available data-null, alternative, and unlabelled-to construct scores and calibrate p-values through a full permutation strategy. This unified use of all available data significantly improves power by enhancing non-conformity score quality and maximising calibration set size while rigorously controlling the false discovery rate. Crucially, our framework provides a systematic design principle for conformal testing and enables automatic selection of the best conformal procedure among candidates without extra data splitting. Extensive numerical experiments demonstrate that our enhanced methods deliver superior efficiency and adaptability across diverse scenarios.

2505.18391 2026-05-22 econ.EM stat.ME

Bayesian Estimation of Cohort-Time-Stratum Specific Effects in Staggered Difference-in-Differences

贝叶斯估计队列-时间-组别特定效应在 staggered 差异-in-差异中的应用

Siddhartha Chib, Kenichi Shimizu

AI总结 本文提出了一种概率框架,用于估计在队列、时期和基线协变量定义的组别中变化的高维ATT数组,通过统一的似然模型联合估计子组特定的治疗效应,从而在稀疏的队列-时间-组别设置中稳定推断。

详情
AI中文摘要

在 staggered 处理采用的差异-in-差异设计被广泛用于研究跨队列和时期异质性治疗效应。我们开发了一个概率框架,用于估计可能在队列、时期和由基线协变量定义的组别中变化的高维ATT数组。该框架通过统一的似然模型联合估计子组特定的治疗效应,从而在稀疏的队列-时间-组别设置中稳定推断。我们为ATT数组建立了伯恩斯坦-冯-米泽斯定理,表明后验可信区间在大样本下具有渐近有效的频率覆盖。模拟和对最低工资上涨与青少年就业的应用显示了有限样本改进和重要的子组异质性。

英文摘要

Difference-in-Differences designs with staggered treatment adoption are widely used to study heterogeneous treatment effects across cohorts and time periods. We develop a probabilistic framework for estimating potentially high-dimensional ATT arrays that vary across cohorts, periods, and strata defined by baseline covariates. The framework jointly estimates subgroup-specific treatment effects through a unified likelihood-based model, stabilizing inference in sparse cohort-by-time-by-stratum settings. We establish a Bernstein-von Mises theorem for the ATT array, implying asymptotically valid frequentist coverage of posterior credible intervals. Simulations and an application to minimum wage increases and teen employment demonstrate meaningful finite-sample improvements and important subgroup heterogeneity.

2503.04491 2026-05-22 stat.AP stat.ME

A Spatiotemporal, Quasi-experimental Causal Inference Approach to Characterize the Effects of Global Plastic Waste Export and Burning on Air Quality Using Remotely Sensed Data

一种空间时间准实验因果推断方法用于利用遥感数据表征全球塑料垃圾出口和焚烧对空气质量的影响

Ellen M. Considine, Rachel C. Nethery

AI总结 本文利用遥感数据和空间时间因果分析技术,研究全球塑料垃圾出口和焚烧对空气质量的影响,通过印尼2018年前后的情况,评估大规模塑料垃圾政策对空气质量的影响,发现中国禁令后PM2.5显著增加。

详情
AI中文摘要

塑料废物的开放式燃烧可能通过降低空气质量对全球健康构成重大威胁,但关于此问题的定量研究因数据缺乏而受到阻碍。许多低收入和中等收入国家,其中开放式燃烧最为令人担忧,几乎没有空气质量监测。本文利用遥感数据产品结合空间时间因果分析技术,评估大规模塑料废物政策对空气质量的影响。在整个研究过程中,我们研究了印尼在2018年前后的情况,当时中国停止进口塑料废物,导致这一庞大的废物流被转移至其他国家。我们为这种设置定制了尖端的统计方法,估计了增加的塑料废物进口对印度尼西亚废物填埋场附近细颗粒物(PM2.5)的影响,作为距离港口的接近程度的函数。我们观察到,中国禁令后(2018-2019年)每月PM2.5水平相对于预期的正常情况(2012-2017年)显著增加,当暴露于中等高港口接近度时,增加量高达1.68 μg/m³(95%置信区间=[0.72, 2.48])。对于非常高的港口接近度暴露,效果更为温和,可能反映了政府监管更大的情况下,填埋/焚烧的增加较小。

英文摘要

Open burning of plastic waste may pose a significant threat to global health by degrading air quality, but quantitative research on this problem -- crucial for policy making -- has been stunted by lack of data. Many low- and middle-income countries, where open burning is most concerning, have little to no air quality monitoring. Here, we leverage remotely sensed data products combined with spatiotemporal causal analytic techniques to evaluate the impact of large-scale plastic waste policies on air quality. Throughout, we study Indonesia before and after 2018, when China halted its import of plastic waste, resulting in diversion of this massive waste stream to other countries. We tailor cutting-edge statistical methods to this setting, estimating effects of increased plastic waste imports on fine particulate matter (PM$_{2.5}$) near waste dump sites in Indonesia as a function of proximity to ports, an induced continuous exposure. We observe strong evidence that monthly PM$_{2.5}$increased after China's ban (2018-2019) relative to expected business-as-usual (2012-2017), with increases up to 1.68 $μ$g/m$^3$ (95% CI = [0.72, 2.48]) when exposed to medium-high port proximity. Effects were more modest for very high port proximity exposure, possibly reflecting smaller increases in dumping/burning where government oversight is greater.

2502.13822 2026-05-22 stat.ML cs.LG

Uncertainty quantification for Markov chain induced martingales with application to temporal difference learning

马尔可夫链诱导的martingales的不确定性量化及其在时间差学习中的应用

Weichen Wu, Yuting Wei, Alessandro Rinaldo

AI总结 本文提出了一种新的高维集中不等式和Berry-Esseen界,用于分析由马尔可夫链诱导的向量martingales,并将其应用于时间差学习算法的性能分析,得到了与渐近方差相符的高概率一致性保证,并建立了Gaussian近似的时间差估计器的分布收敛速率。

详情
AI中文摘要

我们建立了针对由马尔可夫链诱导的向量值martingales的新型且通用的高维集中不等式和Berry-Esseen界。我们将这些结果应用于分析具有线性函数近似的时间差(TD)学习算法的性能,这是一种在强化学习(RL)中广泛使用的策略评估方法,获得了与渐近方差相符的高概率一致性保证,直到对数因子。此外,我们建立了Gaussian近似的时间差估计器的O(T^{-1/4}log T)分布收敛速率,以凸距离度量。我们的martingale界具有广泛的适用性,我们对TD学习的分析为RL算法的统计推断提供了新的见解,弥合了经典随机逼近理论与现代RL应用之间的差距。

英文摘要

We establish novel and general high-dimensional concentration inequalities and Berry-Esseen bounds for vector-valued martingales induced by Markov chains. We apply these results to analyze the performance of the Temporal Difference (TD) learning algorithm with linear function approximations, a widely used method for policy evaluation in Reinforcement Learning (RL), obtaining a sharp high-probability consistency guarantee that matches the asymptotic variance up to logarithmic factors. Furthermore, we establish an $O(T^{-\frac{1}{4}}\log T)$ distributional convergence rate for the Gaussian approximation of the TD estimator, measured in convex distance. Our martingale bounds are of broad applicability, and our analysis of TD learning provides new insights into statistical inference for RL algorithms, bridging gaps between classical stochastic approximation theory and modern RL applications.

2402.13472 2026-05-22 stat.ME

Generalized linear models with spatial dependence and a functional covariate

具有空间依赖性和功能性协变量的广义线性模型

Sooran Kim, Mark S. Kaiser, Xiongtao Dai

AI总结 本文研究了在空间依赖性存在的情况下,如何将功能性协变量纳入广义线性模型中,通过基函数展开和截断进行维度约简,并利用复合似然估计方程处理空间依赖性,最终构建了置信区间和置信带,并通过二元条件模型验证了渐近推断结果的适用性。

详情
AI中文摘要

我们扩展了在独立性假设下的广义功能线性模型,将其应用于功能性协变量与具有空间依赖性的标量响应变量之间的关系情况,这是一种复杂却普遍的现象。在估计方面,我们应用基函数展开和截断进行协变量过程的维度约简,随后使用复合似然估计方程处理空间依赖性。在重复格子渐近背景下,我们建立了所提模型的渐近结果,从而可以构建空间依赖参数的置信区间和回归参数函数的置信带。一个具有功能协变量的二元条件模型作为具体示例,并在模拟研究中验证了渐近推断结果的适用性。我们还将所提模型应用于将美国中西部各州县的年度玉米产量与同期(4月至9月)每日最高温度相关联的问题。此外,补充材料中进一步讨论了扩展到扩展格子情境的情况。

英文摘要

We extend generalized functional linear models under independence to a situation in which a functional covariate is related to a scalar response variable that exhibits spatial dependence-a complex yet prevalent phenomenon. For estimation, we apply basis expansion and truncation for dimension reduction of the covariate process followed by a composite likelihood estimating equation to handle the spatial dependency. We establish asymptotic results for the proposed model under a repeating lattice asymptotic context, allowing us to construct a confidence interval for the spatial dependence parameter and a confidence band for the regression parameter function. A binary conditionals model with functional covariates is presented as a concrete illustration and is used in simulation studies to verify the applicability of the asymptotic inferential results. We apply the proposed model to a problem in which the objective is to relate annual corn yield in counties of states in the Midwestern United States to daily maximum temperatures from April to September in those same geographic regions. The extension to an expanding lattice context is further discussed in the supplement.

2605.22062 2026-05-22 math.ST stat.TH

A Circular Chatterjee's Correlation Coefficient

循环的切特杰系数

Sourav Majumdar

AI总结 本文提出了一种适用于循环数据的切特杰相关系数,该系数通过在循环排名空间中平均响应切割点,消除了任意切割点的选择影响,从而保持了数据的循环顺序特性,并在独立性下为零,在响应是预测器可测函数时为一。

详情
AI中文摘要

切特杰的排名相关系数是一种方向性关联测量,用于检测一个变量是否可以作为另一个变量的函数进行预测。尽管原始系数自然定义于实值数据,但循环数据带来了额外的困难。通常的构造方法需要将每个圆任意切割并视为一条线。不同的切割点选择可能导致不同的有限样本值,尽管底层的循环关系保持不变。本文提出了一种循环的切特杰系数,消除了这种任意选择。总体构造在循环排名空间中平均响应切割点,有限样本构造在样本切割间隙中平均,并减少为仅基于循环排名的简单统计量。所得到的系数内在于数据的循环顺序,保持了方向性,并保留了切特杰原始系数的关键解释。在非原子循环边缘情况下,它在独立性下为零,在循环响应是循环预测器可测函数时为一。我们证明了其一致性,并推导了在独立性下的分布自由空缺行为。模拟显示,所提出的系数在检测多绕循环关系时特别有用,例如响应在预测器绕一圈时绕两次或四次的情况,而标准循环相关系数可能几乎失灵。

英文摘要

Chatterjee's rank correlation is a directed measure of association designed to detect whether one variable can be predicted as a function of another. While the original coefficient is naturally defined for real-valued data, circular data poses additional difficulty. Applying the usual construction requires cutting each circle at an arbitrary point and treating it as a line. Different choices of cut points can lead to different finite-sample values, even though the underlying circular relationship is unchanged. This paper proposes a circular version of Chatterjee's coefficient that removes this arbitrary choice. The population construction averages over response cuts in circular rank space, and the finite-sample construction averages over sample cut gaps and reduces to a simple statistic based only on cyclic ranks. The resulting coefficient is intrinsic to the circular ordering of the data, remains directed, and retains the key interpretation of Chatterjee's original coefficient. Under non-atomic circular marginals, it is zero exactly under independence and one exactly when the circular response is a measurable function of the circular predictor. We prove consistency and derive its distribution-free null behavior under independence. Simulations show that the proposed coefficient is especially useful for detecting multi-winding circular relationships, such as cases where the response goes around the circle twice or four times as the predictor goes around once, where standard circular correlations can be nearly blind.

2605.22038 2026-05-22 stat.ME

A Mixed Self-Exciting Process to Model Epileptic Seizures

一种混合自激发过程用于建模癫痫发作

Karen Kanaster, Giovani L. Silva, Peter Mueller, Jacob Pellinen, Elizabeth Juarez-Colunga

AI总结 本文提出了一种贝叶斯混合霍克斯过程模型,用于建模癫痫发作的聚类现象和个体间的异质性,通过引入韦布尔基础强度和随机效应来提高模型的准确性。

详情
Comments
35 pages, 5 figures, 33 pages supplementary material
AI中文摘要

癫痫是一种神经疾病,其特征是反复发作,影响全球超过7000万人。通常,癫痫患者在首次发作后更可能经历后续发作,这一过程我们称为发作聚类。受人类癫痫项目(HEP)中从407名新诊断为局灶性癫痫患者收集的三年内发作日记数据的启发,我们提出了一种贝叶斯混合霍克斯过程模型,以解决发作聚类和个体间的异质性问题。在霍克斯过程中,每次事件发生时,强度会通过背景和激发强度函数的组合而加快。所提出的模型采用韦布尔基础强度来建模背景发作率随时间的变化趋势,而激发过程则用于建模个体内的发作聚类。我们通过在背景和激发强度中加入协变量和随机效应来建模个体间的异质性。在HEP研究中,个体内首次发作与第二次发作之间的时间平均为1.57天(95% CrI:1.43, 1.70),每个聚类的平均发作次数为2.20次(1.96, 2.47)。我们证明在存在异质性的情况下,忽略随机效应会导致背景强度低估和激发率高估。

英文摘要

Epilepsy is a neurological disorder characterized by recurrent seizures affecting more than 70 million people worldwide. Often, an individual with epilepsy is more likely to experience subsequent seizures following an initial seizure, a process we call seizure clustering. Motivated by seizure diary data collected over three years from 407 individuals newly diagnosed with focal epilepsy in the Human Epilepsy Project (HEP), we propose a Bayesian mixed Hawkes process model that addresses seizure clustering and heterogeneity between individuals. In the Hawkes process, the intensity is accelerated each time an event occurs, through the composition of background and excitation intensity functions. The proposed model incorporates a Weibull baseline intensity to model a trend in background seizure rates over time, while the excitation process accounts for seizure clustering within individuals. We model heterogeneity among individuals by including covariates and random effects in both the background and excitation intensities. In the HEP study, the average time between primary and secondary seizures within an individual is 1.57 (95\% CrI: 1.43, 1.70) days, with an average of 2.20 (1.96, 2.47) seizures per cluster. We demonstrate that omitting random effects in the presence of heterogeneity leads to underestimation of the background intensity and overestimation of excitation rates.

2605.22030 2026-05-22 stat.CO

Eigen for Statistical and Machine Learning Computing: A Lightweight C++ Tutorial with Python Bindings

Eigen for Statistical and Machine Learning Computing: A Lightweight C++ Tutorial with Python Bindings

Seyoung Lee, Kwan-Young Bak

AI总结 本文提供了一个轻量级教程,展示如何使用Eigen库将统计和机器学习算法用C++实现,并通过pybind11连接到Python。教程重点在于实用而非方法论,展示了如何用可读的C++编写常见矩阵运算、基于分解的求解器和向量化更新,并通过两个示例(核岭回归和随机梯度下降的矩阵分解)展示其在实际研究软件中的应用。

详情
AI中文摘要

This note provides a lightweight tutorial on using Eigen, a C++ template library for linear algebra, to implement statistical and machine learning algorithms. The emphasis is practical rather than methodological: we show how common matrix operations, decomposition-based solvers, and vectorized updates can be written in readable C++ and then connected to Python through pybind11. Two examples are used throughout the tutorial: kernel ridge regression and matrix factorization with stochastic gradient descent. The examples are intentionally small enough to be studied as code, but they contain many operations that appear in larger research software projects, including kernel matrix construction, regularized linear system solving, row-wise updates, and NumPy--Eigen data conversion. The goal is to provide a reproducible starting point for researchers who want to move from mathematical formulas to efficient C++ implementations while retaining a convenient Python workflow.

英文摘要

This note provides a lightweight tutorial on using Eigen, a C++ template library for linear algebra, to implement statistical and machine learning algorithms. The emphasis is practical rather than methodological: we show how common matrix operations, decomposition-based solvers, and vectorized updates can be written in readable C++ and then connected to Python through pybind11. Two examples are used throughout the tutorial: kernel ridge regression and matrix factorization with stochastic gradient descent. The examples are intentionally small enough to be studied as code, but they contain many operations that appear in larger research software projects, including kernel matrix construction, regularized linear system solving, row-wise updates, and NumPy--Eigen data conversion. The goal is to provide a reproducible starting point for researchers who want to move from mathematical formulas to efficient C++ implementations while retaining a convenient Python workflow.

2605.22025 2026-05-22 stat.ME

Testing for Serial Independence via Auto Hilbert-Schmidt Independence Criterion

通过自适应希尔伯特-施密特独立准则检验序列独立性

Muyi Li, Yuqing Xu, Zhou Zhou

AI总结 本文提出了一种基于自适应希尔伯特-施密特独立准则(AutoHSIC)的框架,用于检验严格平稳时间序列中的序列独立性。该方法通过测量观测值与其滞后值之间的依赖性,提供了一种基于核的方法来检测非线性序列依赖性,并通过自适应统计量和野 bootstrap 方法进行检验。

详情
AI中文摘要

我们开发了一种基于希尔伯特-施密特独立准则(HSIC)的框架,用于检验严格平稳时间序列中的序列独立性。所提出的自适应希尔伯特-施密特独立准则(AutoHSIC)测量观测值与其滞后值之间的依赖性,提供了一种基于核的方法来检测非线性序列依赖性。经验AutoHSIC统计量是一个由重叠观测值构建的滞后U统计量,因此即使在独立同分布的假设下也继承了时间依赖性。因此,其渐近分析与标准独立同分布HSIC理论不同,必须考虑在假设下的退化问题。我们建立了在假设下和固定替代假设下所得到的单滞后和端点检验的极限行为。由于极限零分布是非枢轴的,我们开发了一种野 bootstrap 方法来近似临界值并证明其渐近有效性。该框架进一步扩展到残差基于的模型诊断中,其中参数估计影响零分布。模拟和实证应用展示了其在多变量、函数和矩阵时间序列中检测非线性序列依赖性的能力。

英文摘要

We develop a Hilbert--Schmidt independence criterion (HSIC)-based framework for testing serial independence in strictly stationary time series. The proposed auto Hilbert--Schmidt independence criterion (AutoHSIC) measures dependence between an observation and its lagged counterpart, providing a kernel-based approach to detecting nonlinear serial dependence. The empirical AutoHSIC statistic is a lagged U-statistic constructed from overlapping observations, and hence inherits temporal dependence even under the i.i.d. null. Its asymptotic analysis therefore differs from standard i.i.d. HSIC theory and must account for degeneracy under the null. We establish the limiting behaviour of the resulting single-lag and portmanteau tests under the null and under fixed alternatives. Since the limiting null distribution is non-pivotal, we develop a wild bootstrap procedure for critical value approximation and prove its asymptotic validity. The framework is further extended to residual-based model diagnostics, where parameter estimation affects the null distribution. Simulations and empirical applications illustrate its ability to detect nonlinear serial dependence in multivariate, functional and matrix time series.

2605.22010 2026-05-22 stat.ML cs.LG

Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks

浅层神经网络中关于时间的弱传播混沌性

Margalit Glasgow, Joan Bruna

AI总结 本文研究了在特征学习模式下使用梯度下降训练的一层神经网络,将有限宽度网络的输出与无限宽度网络的输出联系起来,并通过均场动力学来研究其长期行为。

详情
Comments
46 pages
AI中文摘要

我们考虑在特征学习模式下使用梯度下降训练的一层神经网络,并将有限宽度网络的输出$f_{\hatρ_t^m}$与其无限宽度对应的$f_{ρ_t^{MF}}$联系起来,后者在均场动力学中演变。虽然通过标准Grönwall估计可以得到常时间范围内的$\|f_{ρ_t^{MF}} - f_{\hatρ_t^m}\|$的界,但波动的长期行为则更为复杂。均匀时间界通常依赖于(局部)强凸性或噪声梯度动力学中出现的对数Sobolev不等式。在本文中,我们通过利用均场确定性Wasserstein梯度流动力学的收敛率,建立了非渐近的弱传播混沌性,该结果在时间上是均匀的。具体来说,设$L_t$为均场过剩均方误差损失在时间$t$处的值,$m$为神经元数量,在标准正则性假设和条件$\int_0^\infty L_t^{1/2} dt =O(\log d)$下,我们得到时间均匀界$\|f_{ρ_t^{MF}}- f_{\hatρ_t^m}\|^2 \lesssim ext{poly}(d) m^{-\min(1,c/6)}$,当$L_t \lesssim t^{-c}$时。我们的结果在无噪声环境中成立,并不假设在最优解附近景观的几何特性,且无缝扩展到其他离散形式,包括有限样本数和时间离散化。我们的结果的一个关键结论是,当均场人口损失动力学的收敛率快于$t^{-2}$时,我们仅需$ ext{poly}(d/ε)$个神经元、训练样本和GD步数即可达到损失$ε$。

英文摘要

We consider one-hidden layer neural networks trained in the feature-learning regime using gradient descent, and relate the output of the finite-width network $f_{\hatρ_t^m}$ to its infinite-width counterpart $f_{ρ_t^{MF}}$, which evolves in the mean-field dynamics. While constant-time horizon bounds for $\|f_{ρ_t^{MF}} - f_{\hatρ_t^m}\|$ may be obtained via standard Grönwall estimates, the long-time behavior of the fluctuation is a more delicate matter. Uniform-in-time bounds often rely on (local) strong convexity in the landscape or Logarithmic Sobolev inequalities present in noisy gradient dynamics. In this work, we establish non-asymptotic weak propagation-of-chaos that holds uniformly in time, obtained by exploiting instead the convergence rate of the mean-field deterministic Wasserstein-gradient-flow dynamics. Specifically, denoting by $L_t$ the mean-field excess MSE loss at time $t$ and $m$ the number of neurons, under standard regularity assumptions and the condition $\int_0^\infty L_t^{1/2} dt =O(\log d)$, we obtain the uniform in time bound $\|f_{ρ_t^{MF}}- f_{\hatρ_t^m}\|^2 \lesssim \text{poly}(d) m^{-\min(1,c/6)}$ whenever $L_t \lesssim t^{-c}$. Our result holds in a noiseless setting and does not make any assumptions on the geometry of the landscape near the optimum, and extends seamlessly to other forms of discretization, including finite number of samples and time discretization. A key takeaway of our result is that whenever the convergence rate of the mean-field, population-loss dynamics is faster than $t^{-2}$, we can attain a loss of $ε$ with only $\text{poly}(d/ε)$ neurons, training samples, and GD steps.

2605.22004 2026-05-22 stat.ME

Selecting Informative Conformal Prediction Sets with an Optimized FCR-Controlled Approach

选择具有优化 FCR 控制方法的 informative 遵循预测集

Israela Solomon, Etienne Roquain, Saharon Rosset, Ruth Heller

AI总结 本文研究了在选择性推断设置中使用遵循方法,通过优化 FCR 控制方法来构建 informative 遵循预测集,并在实际应用中通过校准程序调整 oracle 策略以维持有限样本 FCR 控制,同时展示了在分类结果上的有效性。

详情
AI中文摘要

遵循方法为结果提供了具有置信保证的预测集。我们研究了其在选择性推断设置中的应用,其中只有在预测集具有信息量时才进行推断。分析人员可能认为信息量大的情况包括预测集足够小、排除空值或满足其他适当的单调约束。由于在实际应用中推断通常仅限于信息量大的案例,因此考虑由此产生的选择偏差对于维持假覆盖率(FCR)控制至关重要。Gazin 等人(2025)提出了一种通用框架,用于在控制选定样本的 FCR 时构建此类 informative 遵循预测集。在本工作中,我们专注于 oracle 引导的程序。我们在 oracle 设置中推导出在合适功率目标下的最优决策策略,其中每个预测集所属的概率可以计算。在实践中,当然只有估计概率可用。因此,我们引入了一个校准程序,将 oracle 策略调整以维持有限样本 FCR 控制。我们证明了这种方法比现有方法具有显著更高的功率。我们展示了新方法在分类结果上的有效性,无论是真实数据还是模拟数据。

英文摘要

Conformal methods provide prediction sets for outcomes with confidence guarantees. We study their use in a selective inference setting, where inference is performed only when the prediction set is informative. The analyst may consider as informative, for example, cases with prediction sets that are sufficiently small, exclude null values, or satisfy other appropriate monotone constraints. Because inference is typically restricted to informative cases in practical applications, accounting for the resulting selection bias is crucial to maintaining false coverage rate (FCR) control. A general framework for constructing such informative conformal prediction sets while controlling the FCR on the selected sample was suggested in Gazin et al. (2025). In this work we focus on oracle-guided procedures. We derive the optimal decision policy under a suitable power objective in the oracle setting where the probability of belonging to each prediction set can be computed. In practice, of course, only estimated probabilities are available. We therefore introduce a calibration procedure that adjusts the oracle policy to maintain finite sample FCR control. We show that this approach can achieve substantially higher power than available alternatives. We demonstrate the effectiveness of our new methods for classification outcomes on both real and simulated data.

2605.21928 2026-05-22 cs.LG cs.AI stat.ME

CausalGuard: Conformal Inference under Graph Uncertainty

CausalGuard: 在图不确定性下的契合推断

Vikash Singh, Weicong Chen, Debargha Ganguly, Yanyan Zhang, Nengbo Wang, Sreehari Sankar, Mohsen Hariri, Alexander Nemecek, Chaoda Song, Shouren Wang, Biyao Zhang, Van Yang, Erman Ayday, Jing Ma, Vipin Chaudhary

AI总结 本文提出CausalGuard,一种结构加权的契合框架,通过聚合图条件双稳健伪结果进行校准,以在图不确定性下提供无分布的有限样本边际覆盖。

详情
AI中文摘要

从观察数据估计治疗效应需要选择调整集,但有效的调整依赖于未知的因果图。图的不规范可能导致覆盖不足,而图无关的契合包装可能只能通过大填充来恢复名义覆盖。我们介绍了CausalGuard,一种结构加权的契合框架,该框架在聚合图条件双稳健伪结果后进行校准。候选DAGs从LLM衍生的边先验中提出,通过条件独立性测试进行修剪,并通过贝叶斯信息准则重新加权。然后,一个复合非契合分数校准后加权的伪结果。CausalGuard为聚合的伪结果提供无分布的有限样本边际覆盖;在因果识别、重叠、条件均值噪声稳定性以及集中在目标对齐的有效调整策略下,其条件均值收敛于真实的条件平均治疗效应。在五个基准测试中,CausalGuard在可直接评估的目标上实现了均值覆盖超过名义90%水平,并在图无关契合基线需要大填充时减少了宽度。压力测试显示,当保留的候选集受数据支持时,CausalGuard能抑制无效的碰撞调整并在不规范的先验下保持稳定。

英文摘要

Estimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.

2605.21899 2026-05-22 stat.CO math.PR math.ST stat.TH

Mad Props: Parallelism in Markov Chain Monte Carlo Through the Lens of the Infinite Proposal Limit

Mad Props: 通过无限提议极限的视角探索马尔可夫链蒙特卡罗中的并行性

Nathan E. Glatt-Holtz, Andrew J. Holbrook, Justin A. Krometis, Cecilia F. Mondaini

AI总结 本文研究了多提议马尔可夫链蒙特卡罗算法中的并行性问题,通过分析无限提议极限下的核函数,揭示了不同提议和接受结构下的新方法,并发现了现有方法的局限性。

详情
AI中文摘要

多提议马尔可夫链蒙特卡罗(MP-MCMC)算法利用大量提议来高效地遍历状态空间并克服复杂的目标几何结构。尽管MCMC方法本质上是并行的,但MP-MCMC形式提供的非平凡并行性有时能显著优于朴素方法。在这里,一个重要调节参数是单个MP-MCMC迭代中使用的提议数量p。尽管已经提出了多种计算策略来高效利用大量提议,但对这些算法的理解仍然有限,尤其是在大p范围内。本文通过识别和研究几种有前途的新方法(算法1.1,算法3.3,算法3.4),排除了其他现有方法,并发现了不同MP-MCMC方法之间的新关系。我们的分析基于作者最近构建的一般状态空间多提议侵入性理论,结合了MP-MCMC算法在不同提议和接受结构类别中的大p极限核函数的考虑。

英文摘要

Multiproposal MCMC (MP-MCMC) algorithms use clouds of proposals to efficiently traverse state spaces and overcome complex target geometries. While MCMC methods are embarrassingly parallel by nature, the non-trivial forms of parallelism provided by the MP-MCMC formalism sometimes leads to significant improvements over a naive approach. Here, one important tuning parameter is the number of proposals p used by a single MP-MCMC iteration. While a number of computational strategies have been proposed to efficiently leverage large numbers of proposals within the MP-MCMC paradigm, much remains unknown about these algorithms, particularly in the large p-regime. In this contribution, we discover surprising results by identifying and studying several promising new methods (Algorithm 1.1, Algorithm 3.3, Algorithm 3.4), ruling out other extant approaches and discovering new relationships between different MP-MCMC methodologies. Our analysis is centered on a general state space multiproposal involutive theory recently constructed by the authors combined with the consideration of the large p-limit kernels for MP-MCMC algorithms within a variety of different classes of proposal and acceptance structures.

2605.21884 2026-05-22 stat.ME

Trend and seasonality estimation for point-process time series

点过程时间序列的趋势和季节性估计

Daniel Gervini, Simon A. Kopischke

AI总结 本文提出了一种用于点过程时间序列趋势和季节性估计的简单M估计器,通过模拟研究其有限样本性能,并以芝加哥Divvy自行车共享系统中的自行车需求模式为例进行实数据应用。

详情
AI中文摘要

本文介绍了用于时间序列点过程的趋势和季节性估计器。我们假设点过程遵循具有对数高斯强度函数的时序或空间双随机泊松模型。所提出的估计器是计算简单的M估计器。其渐近分布被推导出来,通过模拟研究其有限样本性能。作为实际数据应用的例子,我们研究了芝加哥Divvy自行车共享系统中的自行车需求模式。

英文摘要

This article introduces estimators of trend and seasonality for time series of point processes. We assume the point processes follow a temporal or spatial doubly-stochastic Poisson model with log-Gaussian intensity functions. The proposed estimators are computationally simple M-estimators. Their asymptotic distribution is derived, and their finite-sample performance is studied by simulation. As an example of real-data application, we study the patterns of bike demand in the Divvy bike-sharing system of the city of Chicago.

2605.21860 2026-05-22 math.ST cs.DS cs.IT math.IT stat.ML stat.TH

Robust Statistical Estimators with Bounded Empirical Sensitivity

具有有界经验敏感度的稳健统计估计量

Valentio Iverson, Gautam Kamath, Argyris Mouzakis, Adam Smith

AI总结 本文提出了一种衡量统计估计量鲁棒性的新指标,即经验敏感度。研究了高斯均值估计等典型问题中该量的界限,并证明了对于达到最优ℓ₂误差的估计量,经验敏感度下界为Ω(η+√(ηd/n)),并通过最近的鲁棒经验均值估计结果证明该下界在对数因子范围内是紧的。

详情
AI中文摘要

我们引入了一个新的鲁棒性度量标准,称为经验敏感度。一个估计量$\hat θ$具有有界经验敏感度,如果对于数据集$X = (X_1, \dots, X_n) \sim \mathcal{D}^{\otimes n}$,在高概率下,任何通过修改至多$ηn$个点得到的数据集$Y$,都有$\hat θ(Y)$接近$\hat θ(X)$。我们研究了该量在高斯均值估计等典型问题中的界限。我们证明了新的下界,显示对于任何达到最优ℓ₂误差下界$O\left(\sqrt{d/n} ight)$的估计量$\hat μ$,经验敏感度至少为$Ω\left(η+ \sqrt{ηd/n} ight)$。这两个项源于此类估计量的均值和方差的障碍(通过Efron-Stein论证)。我们通过最近的鲁棒经验均值估计结果证明该下界在对数因子范围内是紧的。

英文摘要

We introduce a new measure of robustness for statistical estimators, which we call \emph{empirical sensitivity}. An estimator $\hat θ$ has bounded empirical sensitivity if, with high probability over a dataset $X = (X_1, \dots, X_n) \sim \mathcal{D}^{\otimes n}$, for any dataset $Y$ obtained by modifying at most $ηn$ points in $X$, we have that $\hat θ(Y)$ is close to $\hat θ(X)$. We study bounds on this quantity for the prototypical problem of Gaussian mean estimation. We prove new lower bounds, showing that for any estimator $\hat μ$ which achieves an optimal $\ell_2$-error bound of $O\left(\sqrt{d/n}\right)$, the empirical sensitivity is at least $Ω\left(η+ \sqrt{ηd/n}\right)$. The two terms arise due to obstructions on the mean and variance (via an Efron-Stein argument) of such an estimator. We show that this bound is tight up to logarithmic factors, by employing recent results for robust empirical mean estimation.

2605.21848 2026-05-22 stat.ME

Block-Independent Likelihood Ratio Testing for High-Dimensional Mean Vectors with Applications to Matrix-Variate Data

基于块独立的似然比检验用于高维均值向量的检验及其在矩阵变量数据中的应用

Minsub Shin, Kwangok Seo, Sang Han Lee, Johan Lim

AI总结 本文提出了一种新的检验方法BILT,用于检验两个高维均值向量的相等性,解决了传统方法在高维情况下失效的问题,并在矩阵变量数据中进行了应用验证。

详情
AI中文摘要

检验两个高维均值向量的相等性是多元分析中的基本问题。尽管经典的Hotelling's T²检验在低维情况下是最优的,但当维度p与样本量n相当或超过时,其表现不佳。已提出了一些扩展方法,包括对角似然比检验(DLRT),但这些方法在变量间存在相关性时会显著损失检验效能。本文提出了一种新的检验方法,即块独立似然比检验(BILT),通过放松变量间独立性假设为块独立性假设来推广DLRT。我们建立了BILT统计量在'增加p与小n'情况下渐近正态性的理论结果。进一步分析了BILT在局部替代假设下的渐近检验效能。大量模拟研究显示,BILT在广泛协方差结构下保持了类型I错误控制,并显著优于DLRT。对阿尔茨海默病神经影像计划(ADNI)数据集的应用进一步展示了BILT在检验两个矩阵变量总体均值差异中的应用。

英文摘要

Testing the equality of two high-dimensional mean vectors is a fundamental problem in multivariate analysis. While the classical Hotelling's $T^2$ test is optimal in low-dimensional settings, it fails when the dimension $p$ is comparable to or exceeds the sample size $n$. Several extensions, including the Diagonal Likelihood Ratio Test (DLRT), have been proposed under the working independence assumption among variables. However, such an assumption can lead to a substantial loss of power when correlations are present. In this paper, we propose a new test, the Block Independent Likelihood Ratio Test (BILT), which generalizes DLRT by relaxing the working independence assumption to a block independence assumption. We establish its asymptotic normality of the null distribution of the BILT statistic for 'increasing $p$ with small $n$' under mild regularity conditions. We further analyze the asymptotic power of BILT under a local alternatives. Extensive simulation studies show that BILT maintains Type I error control and achieves substantially higher power than DLRT across a wide range of covariance structures. An application to the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset further demonstrates the application of BILT to testing mean differences between two matrix-variate populations.

2605.21846 2026-05-22 stat.ME cs.LG stat.ML

Causal Discovery in Structural VAR Models Under Equal Noise Variance

在等噪声方差假设下结构VAR模型中的因果发现

SeyedSina Seyedi HasanAbadi, Fahimeh Arab, Erfan Nozari, AmirEmad Ghassami

AI总结 本文研究了在等噪声方差假设下线性高斯结构VAR模型中的因果发现问题,提出了一种基于稀疏性的方法ENVAR,用于在观测等价类中寻找稀疏的结构代表,并在合成数据和fMRI数据集上进行了评估。

详情
AI中文摘要

从多变量时间序列中进行因果发现具有挑战性,因为因果效应可能在时间上和同一采样间隔内同时发生。这个问题在神经科学等应用中尤为重要,其中采样率可能相对粗糙,而同时效应不一定形成无环图。我们研究了在等噪声方差假设下线性高斯结构VAR模型中的因果发现,这意味着结构噪声项具有共同的方差。与基于DAG的横断面等噪声方差设置不同,此处考虑的时间序列设置通常不会导致因果图的唯一点识别。相反,多种结构VAR参数化可以诱导相同的平稳观测过程定律。我们引入了一种针对此设置的观测等价性概念,并展示相应的等价类由结构方程的正交变换以及全局正比例尺度共同刻画。这种刻画导致了观测对齐差异,即比较结构模型模去保持观测定律的变换。基于这一理论,我们提出ENVAR,一种基于稀疏性的方法,用于在诱导的观测等价类中搜索稀疏的归一化结构代表。我们评估了所提出的方法在合成结构VAR数据和fMRI数据集上的性能。

英文摘要

Causal discovery from multivariate time series is challenging when causal effects may occur both across time and within the same sampling interval. This issue is especially important in applications such as neuroscience, where the sampling rate may be coarse relative to the underlying dynamics and contemporaneous effects need not form an acyclic graph. We study causal discovery in linear Gaussian structural VAR models under an equal noise variance assumption, meaning that the structural noise terms have a common variance. Unlike the DAG-based cross-sectional equal noise variance setting, the time-series setting considered here does not generally yield point identification of a unique causal graph. Instead, multiple structural VAR parameterizations can induce the same stationary observed process law. We introduce a notion of observational equivalence tailored to this setting and show that the corresponding equivalence class is characterized by orthogonal transformations of the structural equations together with a global positive scale. This characterization leads to an equivalence-aware model discrepancy, the observational alignment discrepancy, which compares structural models modulo transformations that preserve the observed law. Building on this theory, we propose ENVAR, a sparsity-based procedure that searches over the induced observational equivalence class for a sparse normalized structural representative. We evaluate the proposed methodology on synthetic structural VAR data and on an fMRI dataset.

2605.21805 2026-05-22 stat.CO cs.LG stat.ML

Truncated Neural Likelihood Estimation for Simulation-Based Inference in State-Space Models

截断神经似然估计用于状态空间模型中的基于模拟的推断

Kostas Tsampourakis, Víctor Elvira

AI总结 本文提出了一种改进的截断神经似然估计(T-SNL)方法,解决了传统序列神经似然(SNL)在状态空间模型中推断时存在的样本需求大、扩展性差和不可 amortization 的问题,从而提高了推断的准确性、稳定性与鲁棒性。

详情
AI中文摘要

状态空间模型(SSMs)是强大的概率工具,用于建模具有潜变量动态的时间变化系统。在SSMs中的推断涉及对潜变量和参数的估计。在本文中,我们关注参数推断,这在SSMs中通常是一个极具挑战性的问题,因为似然函数不可行。最近,神经估计方法,如序列神经似然(SNL),在贝叶斯推断问题中显示出有前途的结果。在本文中,我们证明了当SNL应用于SSMs设置时,存在重要的限制,例如需要大量的模拟样本才能实现中等性能,序列长度扩展性差,且不具有amortization特性。我们随后介绍了一种新的推断算法,称为截断-SNL(T-SNL),以解决SNL的限制。我们的算法更加准确,训练过程中更加稳定和鲁棒,扩展性更强,且在新观测可用时可以进行amortization。我们的实验表明,T-SNL是一种样本效率高、鲁棒且灵活的算法,优于其他方法。

英文摘要

State-space models (SSMs) are powerful probabilistic tools for modeling time-varying systems with latent dynamics. Inference in SSMs involves the estimation of latent states and parameters. In this work, we focus on parameter inference, which for SSMs is in general a very challenging problem due to the intractability of the likelihood. Recently, neural estimation methods, such as sequential neural likelihood (SNL), have shown promising results in Bayesian inference problems. In this paper, we show that SNL, when applied to the SSM setting, suffers important limitations, such as requiring a large amount of simulated samples to achieve a moderate performance, scaling poorly with sequence length, while not being amortized. We then introduce a novel inference algorithm called truncated-SNL (T-SNL), which addresses the limitations of SNL. Our algorithm is more accurate, more stable and robust during training, more scalable to longer temporal sequences, and can be amortized when new observations become available. Our experiments show that T-SNL is sample-efficient, robust, and flexible algorithm which outperforms other approaches.

2605.21798 2026-05-22 cs.LG stat.ML

Three Costs of Amortizing Gaussian Process Inference with Neural Processes

三次成本:神经过程在高斯过程推断中的摊销

Robin Young

AI总结 本文研究了神经过程在高斯过程推断中的摊销成本,将高斯过程的后验推断从精确的O(n^3)转换为学习的O(n)映射,分析了标签污染、信息瓶颈和摊销误差三个来源,并提出了架构优化建议。

详情
Comments
To appear at ProbNum 2026
AI中文摘要

神经过程用于摊销高斯过程推断,将精确的O(n^3)后验替换为学习的O(n)映射,从上下文集到预测分布。对于一类潜在的神经过程,我们界定了高斯过程和LNP预测之间的KL散度,将其分解为三个可解释的来源,即标签污染,因为神经过程使用标签值来估计在精确高斯过程中标签无关的量;信息瓶颈,因为有限维表示无法解析完整的上下文几何;以及摊销误差,因为单个编码器网络在所有上下文中共享。瓶颈截断项随着表示维度d衰减为O(e^{-cd^{2/d_x}}),对于平方指数核在R^{d_x}上,其中c>0是核依赖的常数,以及对于Matérn-ν核为O(d^{-2ν/d_x}),直接将架构尺寸与核平滑度和输入维度联系起来。标签污染项通常为O(1),只有观测噪声部分衰减为O(1/n),识别了通过标签依赖的表示路由不确定性估计的持续成本。这些结果刻画了在分析类别中的摊销成本,并产生了架构建议,以在高斯过程摊销范围内仅从上下文位置预测方差,并用二阶池化代替均值聚合以关闭主导的摊销差距。

英文摘要

Neural processes amortize Gaussian process inference, replacing the exact $O(n^3)$ posterior with a learned $O(n)$ map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback--Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension $d$ as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels on $\mathbb{R}^{d_x}$ where $c > 0$ is a kernel-dependent constant and as $O(d^{-2ν/d_x})$ for Matérn-$ν$ kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is $O(1)$ in general, with only the observation-noise component decaying as $O(1/n)$, identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.

2605.21793 2026-05-22 stat.ME stat.AP stat.ML

Targeted maximum likelihood estimation of vaccine effectiveness and immune correlates in test-negative design studies with missing data

针对缺失数据的疫苗效果和免疫相关指标的靶向最大似然估计在测试阳性设计研究中的应用

Leah I. B. Andrews, Lars van der Laan, Peter B. Gilbert

AI总结 本文提出了一种针对测试阳性设计研究中缺失暴露变量数据的靶向最大似然估计方法,通过半参数逻辑回归模型估计症状性疾病的因果条件风险比,以实现灵活的数据驱动混杂控制和有效的因果推断。

详情
Comments
52 pages, 14 figures
AI中文摘要

测试阳性设计(TND)是一种资源高效的观察性研究设计,可用于评估疫苗效果和疾病暴露近端的免疫相关指标。TND招募寻求诊断检测的 symptomatic 个体,并通过暴露变量(如疫苗接种状态或免疫标志物水平)进行比较,该变量在检测时测量。虽然 TND 减少了由就医行为引起的混杂,但其他混杂源可能仍然存在。TND 研究可能由于记录不完整或两阶段抽样设计而在暴露变量中存在缺失数据。本文提出了一种靶向最大似然估计方法,涉及一个半参数逻辑回归模型,该模型针对医疗寻求人群的症状性疾病的因果条件风险比。在因果和缺失随机化的假设下,我们的方法产生了一个高效、渐近线性的估计量,能够在分析具有缺失暴露变量数据的 TND 研究时提供灵活的数据驱动混杂控制和有效的因果推断。我们通过一个两阶段 TND 免疫相关研究的 plasmode 模拟评估了我们方法的有限样本性质。我们还应用我们的方法来评估来自 Moderna 冠状病毒效力 III 期试验衍生的 TND 研究队列的 COVID-19 疫苗效果和抗体标志物与 COVID-19 的相关性。

英文摘要

The test-negative design (TND) is a resource-efficient observational study design that can assess vaccine effectiveness and exposure-proximal immune correlates of disease. The TND enrolls symptomatic individuals seeking diagnostic testing and compares case status by an exposure variable, such as vaccination status or immune marker level, that is measured at testing. While the TND reduces confounding by healthcare-seeking behavior, other sources of confounding may remain. TND studies may also have missing data in the exposure variable due to incomplete records or two-phase sampling designs. We present a targeted maximum likelihood estimation approach involving a semiparametric logistic regression model that targets a causal conditional risk ratio of symptomatic disease in the healthcare-seeking population. Under causal and missing at random assumptions, our method produces an efficient, asymptotically linear estimator that provides flexible, data-driven confounding control and valid causal inference when analyzing TND studies with missing exposure variable data. We evaluate our method's finite sample properties using plasmode simulations of a two-phase TND immune correlates study. We also apply our method to assess COVID-19 vaccine effectiveness and antibody marker correlates of COVID-19 from TND study cohorts derived from the Moderna Coronavirus Efficacy phase 3 trial.

2605.21783 2026-05-22 cs.LG stat.ML

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

Ahanaf Hasan Ariq

AI总结 本文提出了一种基于PAC-Bayesian框架的测试时间适应方法,通过将MMD球体解释为 credal sets,提供了对epistemic不确定性量化的自然方法,并建立了与MMD相关的泛化界限、有限样本版本、统一最坏情况风险界限以及几何保持界限。

详情
Comments
15 pages, 0 figures. Accepted at the 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026)
AI中文摘要

Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley's imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.

英文摘要

Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley's imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.

2605.21782 2026-05-22 stat.ME stat.AP stat.CO

A Scalable Parametric Item Calibration Engine (SPICE) for Explanatory IRT with Sparse Data

一种适用于稀疏数据的解释性IRT参数化项目校准引擎(SPICE)

Steven W. Nydick, Manqian Liao, J. R. Lockwood

AI总结 本文提出了一种适用于稀疏数据的解释性IRT参数化项目校准引擎(SPICE),通过贝叶斯多维解释性IRT模型和MCMC估计方法,实现对大规模稀疏数据的心理测量分析。

详情
AI中文摘要

我们描述了一种贝叶斯多维解释性IRT模型,以及相关的马尔可夫链蒙特卡罗(MCMC)估计过程和相应的校准软件开发,旨在对大量稀疏连接的人和项目进行心理测量分析。此类数据结构可能例如来自使用大量自动生成项目库的自适应评估,其中每位受试者只接收整个库的极小比例。我们讨论了模型规范、数据结构和算法实现的选择如何共同创造一种可扩展的解释性IRT方法,以支持各种稀疏数据的心理测量操作。

英文摘要

We describe a Bayesian multidimensional explanatory IRT model, and an associated Markov Chain Monte Carlo (MCMC) estimation procedure and the corresponding development of calibration software, designed for psychometric analyses of large numbers of sparsely-linked persons and items. Such data structures can arise, for example, from adaptive assessments using large banks of automatically generated items with individual test takers receiving a very small proportion of the entire bank. We discuss how our choices for model specification, data structures, and algorithm implementation combine to create a scalable method for explanatory IRT that can support a variety of psychometric operations with sparse data.

2605.21763 2026-05-22 cs.LG cs.SY eess.SY stat.ML

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

关于优化确定等价的折扣强化学习样本复杂性

Oliver Mortensen, Mohammad Sadegh Talebi

AI总结 本文研究了有限折扣MDP中的风险敏感强化学习,考虑了优化确定等价(OCE)这一风险度量家族,分析了在递归OCE下学习最优状态-动作价值函数和最优策略的样本复杂性,并给出了PAC可学习的效用函数的精确刻画,同时建立了基于模型的简单方法的PAC样本复杂性界,并展示了当效用函数的域不为全实数时问题不可PAC学习,最后给出了价值和策略学习的下界,证明了在状态-动作空间大小SA上的紧性,并对更受限的效用类推导了有效时间 horizon 1/(1-γ) 的依赖性。

详情
Comments
Accepted to RLC 2026. arXiv admin note: substantial text overlap with arXiv:2506.00286
AI中文摘要

我们研究了有限折扣MDP中的风险敏感强化学习,其中假设存在MDP的生成模型。我们考虑了一类称为优化确定等价(OCE)的风险度量家族,其中包括重要的风险度量,如熵风险、CVaR和均方差。我们的重点是递归OCE下学习最优状态-动作价值函数(价值学习)和最优策略(策略学习)的样本复杂性。我们提供了效用函数u的精确刻画,使得对应的OCE定义了一个PAC可学习的目标。我们分析了一个简单的基于模型的方法并推导了PAC样本复杂性界。我们证明了当u的域不为全实数dom(u)≠R时,相应的问题不可PAC学习。最后,我们为价值和策略学习建立了相应的下界,证明了在状态-动作空间大小SA上的紧性,并对更受限的效用类推导了下界,使有效时间 horizon 1/(1-γ) 的依赖性显式化。具体而言,对于CVaR_τ,我们展示了τ的正确依赖性为1/τ²,从而在状态-of-the-art上改进了1/τ因子,尽管我们的界在1/(1-γ)上的依赖性是次优的。

英文摘要

We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-γ}$ explicit. Specifically, for $\text{CVaR}_τ$ we show that the correct dependence on $τ$ is $\frac{1}{τ^2}$, thus improving by a factor of $\frac{1}τ$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-γ}$.

2605.21757 2026-05-22 stat.ME

Substantive-Model-Compatible Multiple Imputation for Cox Regression with a Diverging Number of Covariates

具有实质性模型兼容性的多重插补用于具有发散协变量数的Cox回归

Zhilin Zhang, Yi Li

AI总结 本文提出了一种半参数多重插补框架,用于处理具有发散维度的Cox回归中缺失协变量的问题,通过Cox模型似然贡献驱动的高维SMC-FCS过程进行插补,并结合拒绝采样和岭正则化后验抽样以保持模型兼容性,最终通过Rubin规则结合去偏估计器和插补内部方差估计进行推断。

详情
AI中文摘要

现代高维基因组和临床预测的生物医学生存研究面临缺失协变量的挑战。现有方法在协变量数量随样本量发散时通过惩罚和去偏进行推断,但通常是在完全观察协变量的情况下开发的。相反,具有实质性模型兼容性的多重插补方法,特别是实质性模型兼容的完全条件规范(SMC-FCS),提供了处理缺失协变量的原理化方法,同时保持与Cox模型的兼容性,但当前的方法和理论仍然主要局限于固定维度设置。为了解决这些限制,我们提出了一种半参数多重插补框架,用于具有发散维度的Cox回归中缺失协变量的推断。缺失协变量通过由Cox模型似然贡献驱动的高维SMC-FCS过程进行插补,使用拒绝采样以强制实质性模型兼容性,并使用岭正则化后验抽样以稳定插补模型。该算法通过插补正则化优化迭代稳定Cox估计器,然后从稳定链中生成多重插补数据集。对于低维线性功能或对比,$c^ op β$,通过Rubin规则结合去偏估计器和插补内部方差估计进行推断。我们在发散维度环境下建立了所得到的汇总估计器的一致性和渐近正态性。模拟研究展示了良好的有限样本性能,且对波士顿肺癌生存队列的应用展示了所提出方法在高维生存研究中处理不完整协变量的实用价值。

英文摘要

Modern biomedical survival studies with high-dimensional genomic and clinical predictors are challenged by missing covariates. Existing methods conduct inference through penalization and debiasing when the number of covariates diverges with sample size, but they are typically developed with fully observed covariates. Conversely, substantive-model-compatible multiple imputation methods, particularly substantive-model-compatible fully conditional specification (SMC-FCS), provide principled handling of missing covariates while preserving compatibility with the Cox model, yet current methodology and theory remain largely restricted to fixed-dimensional settings. To address these limitations, we propose a semiparametric multiple imputation framework for inference in Cox regression with missing covariates of a diverging dimension. Missing covariates are imputed through a high-dimensional SMC-FCS procedure driven by Cox-model likelihood contributions, with rejection sampling used to enforce substantive-model compatibility and ridge-regularized posterior draws used to stabilize the imputation models. The algorithm stabilizes the Cox estimator through an imputation-regularized optimization iteration and then generates multiply imputed datasets from a stabilized chain. Inference for low-dimensional linear functionals or contrasts, $c^\top β$, is obtained by combining debiased estimators and within-imputation variance estimates through Rubin's rules. We establish consistency and asymptotic normality of the resulting pooled estimator under a diverging-dimensional regime. Simulation studies demonstrate favorable finite-sample performance, and an application to the Boston Lung Cancer Survival Cohort illustrates the practical utility of the proposed method for high-dimensional survival studies with incomplete covariates.

2605.21736 2026-05-22 stat.ML cs.AI cs.LG

Support-aware offline policy selection for advertising marketplaces

面向广告市场的支持感知离线策略选择

Prashant Shekhar, Caroline Howard

AI总结 本文提出了一种支持感知的离线决策框架,用于广告市场的保留策略选择,通过将记录证据转化为保守决策对象,以确保验证的可靠性,而非仅依赖点估计排名。

详情
AI中文摘要

记录的广告拍卖使离线保留价格评估变得有吸引力但有风险。回放表可以识别具有大显眼收益增益的策略,但它们也可能隐藏弱阈值支持、多重比较效应、子组伤害和投标者响应不确定性。现有的回放和离线策略评估方法估计或排名策略价值,但它们不能直接回答可用证据是否足够强以证明验证的问题。本文开发了一种支持感知的离线决策框架用于保留策略选择。与其输出单一的点估计胜者,该框架将记录证据转化为保守的决策对象,包括认证的策略、统计上被主导的替代方案以及需要进一步验证的未解决候选者。主要理论结果给出了一种统一的有限目录保证,显示在同时控制不确定性和保守支持门控的情况下,该框架保留了最佳通过策略,同时排除了具有认证遗憾的策略。支持性结果描述了支持本地化的回放泛化,建立了信息论阈值解析极限,并量化了异质投标者响应如何推翻本地化回放排名。在iPinYou实时竞价日志上的实验显示,领先的保留规则在第二季实现了47.66%的回放提升,同时实现了40.71%的下限提升,在第三季实现了43.87%的冻结超时回放提升。该框架将19个策略目录减少到两个策略验证短名单,同时在44个广告商、交易所和地区段中认证无害。结果支持核心主张,即离线保留策略评估应产生认证的验证决策,而非仅依赖点估计排名。

英文摘要

Logged advertising auctions make offline reserve-price evaluation attractive but risky. Replay tables can identify policies with large apparent yield gains, yet they can also hide weak threshold support, multiple-comparison effects, subgroup harm, and bidder-response uncertainty. Existing replay and off-policy evaluation methods estimate or rank policy values, but they do not directly answer the operational question of whether the available evidence is strong enough to justify validation. This paper develops a support-aware offline decision framework for reserve-policy selection. Rather than outputting a single point-estimate winner, the framework converts logged evidence into a conservative decision object consisting of certified policies, statistically dominated alternatives, and unresolved candidates requiring further validation. The main theoretical result gives a unified finite-catalog guarantee showing that, under simultaneous uncertainty control and conservative support gates, the framework preserves the best gate-passing policy while eliminating only policies with certified regret. Supporting results characterize support-localized replay generalization, establish information-theoretic threshold-resolution limits, and quantify when heterogeneous bidder response can overturn localized replay rankings. Experiments on iPinYou real-time-bidding logs show that the leading reserve rule achieves a 47.66% replay lift in season two, a 40.71% simultaneous lower-bound lift, and a 43.87% frozen out-of-time replay lift in season three. The framework reduces a 19-policy catalog to a two-policy validation shortlist while certifying non-harm across 44 advertiser, exchange, and region segments. The results support the central claim that offline reserve-policy evaluation should produce certified validation decisions rather than point-estimate rankings alone.

2605.21717 2026-05-22 stat.CO cs.NA math.NA

Likelihood-informed dimension reduction across tempered Bayesian posteriors

基于温控贝叶斯后验的似然信息维度约简

Arne Bouillon, Oliver R. A. Dunbar

AI总结 本文提出了一种通用的方法,用于在温控贝叶斯后验中进行似然信息维度约简,通过理论支持构建部分信息空间并改进在数据有限和噪声大的实际应用中的性能。

详情
AI中文摘要

科学计算模拟无法在现实应用中代表所有尺度。为了弥合模型-数据差距,参数被注入模型并利用贝叶斯反演进行约束。为了减少模拟器评估次数(可能达到10^5次以上),现代方法结合了维度约简和正向映射的建模。由于模型评估和数据稀缺,这种维度约简对后验采样性能至关重要。最近的工作利用似然信息子空间(LIS)通过优化信息损失的界限来截断到信息性方向,尽管在数学上适合采样,但在实践中往往受限。本文证明了该方法可以推广到α-温控(即退火、功率后验)分布,其中α ∈ [0,1]。我们提供理论来构建称为α-LIS的部分信息空间。我们展示了α < 1可以经常产生近最优的空间。此外,我们专注于将α-LIS应用于实际案例,其中可用数据严重受限且存在噪声。我们提出并测试了利用整个分布序列α_0 < ... < α_k的数据扩展方法,并使用简单的模型梯度近似,使得我们的方法可用于混沌或随机系统的正向映射建模,其中导数不可用或无信息。在实验中,我们的累积方法在这些具有挑战性的条件下比理论上最优的α=1方法更加稳健。

英文摘要

Scientific computer simulations cannot represent all scales in realistic applications. To bridge this model-data gap, parameters are injected into models and constrained with noisy data using Bayesian inversion. To reduce the number of simulator evaluations, which can be 10^5 or more, modern approaches employ dimension reduction in conjunction with emulation of the forward map (that contains the simulator). Due to scarcity of model evaluations and data, this dimension reduction becomes very important for posterior sampling performance. Recent work on likelihood-informed subspaces (LIS) truncates to informative directions by optimizing bounds on information loss, and though mathematically well-adapted to sampling, they are often restrictive in practice. In this work, we provably generalize this methodology to facilitate application to $α$-tempered (i.e., annealed, power-posterior) distributions for $α$ in [0,1]. We provide theory to build partially-informed spaces termed $α$-LIS. We show how $α$ < 1 can often produce near-optimal spaces. In addition, we focus on applying $α$-LIS to practical cases, where the available data is severely limited and noisy. We propose and test extensions for utilizing data from the entire sequence of distributions $α$_0 < ... < $α$_k, and use simple approximations of model gradients so that our approach can be used for emulation of forward maps for chaotic or stochastic systems where derivatives are unavailable or uninformative due to noise. In experiments, our accumulated approach is much more robust to these challenging circumstances than the theoretically optimal $α$ = 1.

2605.21698 2026-05-22 stat.CO

A Gaussian Sum Filter for Unifying Gaussian and Particle Filters

一种用于统一高斯滤波器和粒子滤波器的高斯和滤波器

Kostas Tsampourakis, Víctor Elvira

AI总结 本文提出了一种新的滤波框架,即增强高斯和滤波器(AGSF),通过引入潜变量和可调协方差参数的高斯近似,将高斯和滤波器与粒子滤波器统一起来,从而在非线性或非高斯状态下实现更高效和稳健的估计。

详情
AI中文摘要

状态空间模型(SSMs)是一类广泛应用于工程和科学中的概率模型,用于描述动态系统。贝叶斯滤波在线性-高斯设置中是解析可行的,其中卡尔曼滤波能给出精确的后验分布。对于非线性或非高斯的SSMs,需要近似方法。两个显著的近似方法家族是高斯和滤波器(GSFs),它们依赖于局部高斯近似和数值积分方案,以及粒子滤波器(PFs),它们使用序列蒙特卡洛采样。尽管这些方法取得了成功,GSFs在强非线性区域中可能面临数值不稳定性问题,而PFs虽然灵活且稳健,但通常需要大量计算资源才能获得准确的估计。在本文中,我们提出了一种增强的高斯和滤波器(AGSF),这是一种新的滤波框架,通过引入潜变量和可调协方差参数的高斯近似,将GSFs和PFs统一起来。通过调整这些协方差,AGSF可以连续地在GSF-like和PF-like行为之间插值,恢复为两种特殊情形。基于这一观点,我们开发了一种自适应的AGSF,能够根据局部非线性的性质自动调整其行为,当高斯近似可靠时更像GSF,当不可靠时更像PF。在目标跟踪应用中,我们展示了AGSF在效率和鲁棒性方面优于GFS和PF的常见故障模式。我们还通过一个玩具例子验证了自适应机制的切换行为。

英文摘要

State-space models (SSMs) are a broad class of probabilistic models for dynamical systems with many applications in engineering and science. Bayesian filtering is analytically tractable only in the linear-Gaussian setting, where the Kalman filter yields exact posterior distributions. For nonlinear or non-Gaussian SSMs, approximations are required. Two prominent families of approximate methods are Gaussian sum filters (GSFs), which rely on local Gaussian approximations and numerical integration schemes, and particle filters (PFs), which use sequential Monte Carlo sampling. Despite their success, GSFs can suffer from numerical instabilities and severe failures in strongly nonlinear regimes, while PFs are flexible and robust but often demand substantial computational resources to achieve accurate estimates. In this work, we propose the Augmented Gaussian Sum Filter (AGSF), a novel filtering framework that unifies GSFs and PFs through an augmented Gaussian approximation parameterized by latent variables and tunable covariance parameters. By adjusting these covariances, the AGSF interpolates continuously between GSF-like and PF-like behavior, recovering both as special cases. Building on this view, we develop an adaptive AGSF that automatically shifts its behavior according to the local nature of the nonlinearities, acting more like a GSF when Gaussian approximations are reliable and more like a PF when they are not. In a target-tracking application, we demonstrate that AGSF is efficient and robust to common failure modes of both GSFs and PFs. We empirically validate the switching behavior of the adaptive mechanism in a toy example.

2605.21692 2026-05-22 cs.LG stat.ML

Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective

表示差距:从几何视角解释神经网络的不合理有效性

David Perera, Victor Moura, Lais Isabelle Alves dos Santos, Michel F. C. Haddad, Flavio Figueiredo

AI总结 本文从几何视角出发,研究神经网络的表示差距,提出一个与泛化误差密切相关的度量标准,并展示其在更广泛任务和训练算法中的适用性,通过实验证明该理论在合成数据和现实数据中的准确性。

详情
AI中文摘要

精确地用可以高效估计的参数来表征神经网络的渐近泛化误差是机器学习中的关键问题,这严重依赖于启发法和实践者的直觉来做出关键设计选择。为了缓解这一问题,我们引入了表示差距,这是一个与泛化误差密切相关的度量标准,但具有更好的渐近动态特性。我们专注于等变扩散模型,并利用最优量化和点过程理论的结果,推导出表示差距的精确渐近等价,并证明其由单个参数,即任务的内在维度所支配,该参数易于解释、高效估计,并可与常见神经网络架构的等变性相关联。我们展示了这种渐近动态也适用于更广泛的任务和训练算法。最后,我们通过实验证明,我们的渐近定律和内在维度估计在广泛的合成数据集上准确,这些数据集中的这些量是已知的,以及在更现实的数据集上,我们得到的结果与相关文献一致。

英文摘要

Characterizing precisely the asymptotic generalization error of neural networks using parameters that can be estimated efficiently is a crucial problem in machine learning, which relies heavily on heuristics and practitioners' intuition to make key design choices. In order to mitigate this issue, we introduce the Representation Gap, a metric closely related to the generalization error, but admitting better-behaved asymptotic dynamics. Focusing on equivariant diffusion models and leveraging results from optimal quantization and point-process theory, we derive a precise asymptotic equivalent of the Representation Gap and show that it is governed by a single parameter, the \textit{intrinsic dimension} of the task, which is easy to interpret, efficient to estimate, and can be linked to the equivariances of common neural network architectures. We show that this asymptotic dynamic also extends to a broader range of tasks and training algorithms. Finally, we demonstrate empirically that our asymptotic law and intrinsic dimension estimation are accurate on a wide range of synthetic datasets, where these quantities are known, as well as on more realistic datasets, where we obtain results consistent with the related literature.

2605.21651 2026-05-22 stat.ME stat.CO

Similarity-Driven Proposals for MCMC Algorithms on Discrete Spaces

基于相似性的MCMC算法在离散空间中的提案

Luca Aiello, Raffaele Argiento, Alexandros Beskos, Maria De Iorio

AI总结 本文提出了一种基于相似性的MCMC方法,用于离散空间中的后验分布采样,通过数据驱动的观测与提议模型之间的不一致度度量来引导转移,适用于包含离散变量和额外潜在变量的分层模型。

详情
AI中文摘要

近期的研究导致了针对离散状态空间后验分布的MCMC算法的发展,这些算法使用似然信息驱动的提案。我们的工作处于这一领域,并提出了一种基于相似性驱动提案的新MCMC方法。此类提案通过使用数据驱动的观测与提议模型之间的不一致度度量,将转移引导至受后验青睐的状态。我们的方法可以自然地涵盖包含离散变量和额外潜在变量的分层模型类别,而无需对后者进行积分,这与该领域先前的工作不同。新的算法在模拟设置和一个涉及Dirichlet-Multinomial回归模型的复杂真实数据场景中得到了示例。

英文摘要

Recent research has led to the development of MCMC algorithms with likelihood-informed proposals when targeting posterior distributions supported on discrete state spaces. Our work is placed within this field and puts forward a new MCMC methodology based upon similarity-driven proposals. Such proposals sway transitions towards states favored by the posterior via use of a data-driven measure of discrepancy between observations and the proposed model. Our approach can naturally cover classes of hierarchical models that involve both discrete variables and additional latent ones, without a requirement of integrating our the latter, in contrast to previous works in this field. The new algorithms are illustrated in simulation settings and in a involved real data scenario with a Dirichlet-Multinomial regression model.

2605.21627 2026-05-22 stat.ME stat.ML

Distribution-free root cause analysis

无需分布的根因分析

Rohan Hore, Aaditya Ramdas

AI总结 本文研究了多流数据中无需分布的根因分析问题,提出了一种新的框架CROC,能够在最小假设下为根因指数构建有限样本有效的置信集,并证明了任何无需分布的根因定位方法都可以在CROC框架内表示。

详情
Comments
34 pages, 4 figures
AI中文摘要

我们研究了多流数据中的无需分布根因分析,其中一个演变的底层系统通过多个数据流被观察到,这些流可能在未知的时间点经历分布变化。在这种情况下,最早发生变化的流提供了一个自然的起点来调查底层原因,我们称之为根因指数。利用符合性p值,我们提出了一种新的框架,即符合性根因分析(CROC),该框架在最小假设下为根因指数构建有限样本有效的置信集:数据流是独立的,并且在每个流中,变化前和变化后的观测是从任意且未知的分布中交换抽样的。我们进一步建立了普遍性性质,证明了任何无需分布的根因定位方法都可以在CROC框架内表示。此外,在温和的正则条件下和有原则的评分设计下,我们的方法会产生渐近尖锐的置信集,能够高效地隔离根因。我们还扩展了CROC以高效处理存在的跨流依赖性。广泛的模拟展示了准确的根流定位,支持我们的理论保证。

英文摘要

We study distribution-free root cause analysis in multi-stream data, where an evolving underlying system is observed through multiple data streams that may each undergo distributional changes at unknown timepoints. In such settings, the stream exhibiting the earliest change provides a natural starting point for investigating the underlying cause, which we refer to as the root-cause index. Leveraging conformal $p$-values, we propose a novel framework, Conformal Root Cause Analysis (CROC), which constructs finite-sample valid confidence sets for the root-cause index under minimal assumptions: the data streams are independent, and within each stream the pre- and post-change observations are sampled exchangeably from arbitrary and unknown distributions. We further establish a universality property, showing that any distribution-free method for root cause localization can be represented within the CROC framework. In addition, under mild regularity conditions and principled score design, our method yields asymptotically sharp confidence sets that efficiently isolate the root cause. We further extend CROC to efficiently handle cross-stream dependence when present. Extensive simulations demonstrate accurate localization of the root stream, supporting our theoretical guarantees.

2605.21552 2026-05-22 cs.LG stat.ML

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

期望一致性损失:在协变量偏移下重新思考置信度校准

Jinzong Dong, Zhaohui Jiang, Bo Yang

AI总结 本文针对协变量偏移下的置信度校准问题,提出了一种无监督域适应损失(ECL),该方法在理论和实践中均表现出色,能够有效校准目标域的置信度。

详情
Comments
Accepted by ICML 2026
AI中文摘要

置信度校准对于分类模型在安全关键决策场景中的应用至关重要,并已受到广泛关注。通用的置信度校准方法假设训练和测试数据是独立同分布的,这在存在协变量偏移时限制了其有效性。在协变量偏移下的先前校准方法在类内或标准校准方面存在困难,且通常依赖于当密度比较大或无界时不稳定的重要性加权。鉴于上述限制,本文重新思考了协变量偏移下的置信度校准。首先,我们推导出协变量偏移下的置信度校准的必要且充分条件,称为期望一致性条件,该条件揭示协变量偏移并不必然导致未校准的置信度,并提供了比全局协变量分布对齐更弱的置信度校准条件。然后,利用期望一致性条件,本文提出了一种无监督域适应损失来校准目标域的置信度,称为期望一致性损失(ECL),该方法兼容标准校准、类内校准和顶部标签校准。第三,我们证明计算ECL损失的样本复杂度与预期校准误差(ECE)相同,并提供了一种理论支持的mini-batch可训练方案。最后,我们在模拟和现实世界协变量偏移数据集上验证了本文方法的有效性。

英文摘要

Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and identically distributed, limiting their effectiveness under covariate shifts. Previous calibration methods under covariate shift struggle with class-wise or canonical calibrations and often rely on unstable importance weighting when density ratios are large or unbounded. Given the above limitations, this paper rethinks confidence calibration under covariate shifts. First, we derive a necessary and sufficient condition for confidence calibration under covariate shifts, named Expectation consistency condition, which reveals covariate shifts do not necessarily lead to uncalibrated confidence and provides a weaker condition for confidence calibration than global covariate distribution alignment. Then, utilizing Expectation consistency condition, this paper proposes an unsupervised domain adaptation loss to calibrate confidence of the target domain, named Expectation consistency loss (ECL), which is compatible with canonical calibration, class-wise calibration, and top-label calibration. Third, we prove that computing ECL loss has the same sample complexity as Expected Calibration Error (ECE) and provide a theoretically grounded mini-batch trainable scheme for ECL loss. Finally, we validate the effectiveness of our method on both simulated and real-world covariate shift datasets.

2605.21548 2026-05-22 stat.ML cs.AI cs.LG

Local Covariate Selection for Average Causal Effect Estimation without Pretreatment and Causal Sufficiency Assumptions

局部协变量选择用于无预处理和因果充分性假设下的平均因果效应估计

Zeyu Liu, Zheng Li, Feng Xie, Yan Zeng, Hao Zhang, Kun Zhang

AI总结 本文提出了一种局部学习方法,用于非参数因果效应估计中的协变量选择,避免了预处理和因果充分性假设,提高了计算效率和估计准确性。

详情
AI中文摘要

我们研究了选择协变量以无偏估计总因果效应的问题。现有方法通常依赖于对所有变量的全局因果结构学习,或依赖于强假设,如因果充分性假设——观测变量不共享潜在混杂因素,或预处理假设,限制协变量只能是不受处理或结果影响的变量。这些要求在实践中往往不现实,且在高维设置中全局学习变得计算上不可行。为了解决这些挑战,我们提出了一种新颖的局部学习方法,用于非参数因果效应估计中的协变量选择,避免了预处理和因果充分性假设。我们首先刻画了一个局部边界,该边界包含至少一个有效的调整集,当且仅当存在调整集来识别因果效应时。然后我们开发了局部识别程序,以在该边界内高效地搜索。我们证明了所提出的方法是正确且完整的。在多个合成数据集和两个真实世界数据集上的实验表明,我们的方法在准确估计因果效应的同时,显著提高了计算效率。

英文摘要

We study the problem of selecting covariates for unbiased estimation of the total causal effect.Existing approaches typically rely on global causal structure learning over all variables, or on strong assumptions such as causal sufficiency - where observed variables share no latent confounders - or the pretreatment assumption, which limits covariates to those unaffected by the treatment or outcome. These requirements are often unrealistic in practice, and global learning becomes computationally prohibitive in high-dimensional settings.To address these challenges, we propose a novel local learning method for covariate selection in nonparametric causal effect estimation that avoids both the pretreatment and causal sufficiency assumptions. We first characterize a local boundary that contains at least one valid adjustment set whenever one exists for identifying the causal effect, and then develop local identification procedures to efficiently search within this boundary.We prove that the proposed method is sound and complete. Experiments on multiple synthetic datasets and two real-world datasets show that our approach achieves accurate causal effect estimation while substantially improving computational efficiency.

2605.21541 2026-05-22 cs.CR cs.AI cs.LG stat.ML

Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLMs

频域正则化对抗对齐用于针对闭源大语言模型的可转移攻击

Leitao Yuan, Qinghua Mao, Daizong Liu, Kun Wang, Wenjie Wang, Yan Teng, Jing Shao, Dongrui Liu

AI总结 本文提出FRA-Attack,通过频域正则化方法解决对抗转移性问题,通过高通DCT目标和频率域梯度正则化提升跨模型的对抗转移能力。

详情
AI中文摘要

多模态大语言模型(MLLMs)仍易受基于转移的针对性攻击影响,其中在开源代理编码器上优化的扰动可以泛化到闭源MLLMs。提高对抗转移性的一个关键挑战是有效捕捉不同模型间共享的内在视觉聚焦特性,使得扰动与可转移的语义线索对齐,而非代理特定行为。然而,现有方法受到空间域特征冗余和代理特定梯度信号的阻碍,影响跨模型转移性。在本文中,我们提出FRA-Attack,从统一的频域正则化视角解决这两个挑战。在特征对齐方面,对patch特征的高通DCT目标抑制冗余的全局结构,并将损失集中在承载MLLMs内在视觉聚焦的高频带。在梯度优化方面,我们引入频率域梯度正则化(FGR),一种无模型依赖的低通正则化器,仅使用几何频率坐标调节代理梯度,即不涉及代理衍生的统计量,因此FGR通过构造无模型依赖性,消除代理特定的高频伪影,同时保留可转移的低频方向。两者共同形成统一的频域转移性处理。在15个旗舰MLLMs上进行的广泛实验显示,FRA-Attack在跨模型转移性方面表现优异,特别是在GPT-5.4、Claude-Opus-4.6和Gemini-3-flash等最先进的模型上实现了最先进的性能。

英文摘要

Multimodal large language models (MLLMs) remain vulnerable to transfer-based targeted attacks, where perturbations optimized on open-source surrogate encoders can generalize to closed-source MLLMs. A key challenge for improving adversarial transferability is to effectively capture the intrinsic visual focus shared across different models, such that perturbations align with transferable semantic cues rather than surrogate-specific behaviors. However, existing methods suffer from spatial-domain feature redundancy and surrogate-specific gradient signals, thereby hindering cross-model transferability. In this paper, we propose FRA-Attack, which addresses both challenges from a unified frequency-domain regularization perspective. For feature alignment, a high-pass DCT objective on patch features suppresses redundant global structures and concentrates the loss on the high-frequency band that carries the MLLMs' intrinsic visual focus. For gradient optimization, we introduce Frequency-domain Gradient Regularization (FGR), a \textit{model-agnostic} low-pass regularizer that modulates the surrogate gradient using only the geometric frequency coordinate, \textit{i.e.}, no surrogate-derived statistic is involved, so that FGR is model-agnostic by construction, removing surrogate-specific high-frequency artifacts while preserving transferable low-frequency directions. Together, the two components form a unified frequency-domain treatment of transferability. Extensive experiments on $15$ flagship MLLMs across $7$ vendors show that FRA-Attack achieves superior cross-model transferability, particularly with state-of-the-art performance on GPT-5.4, Claude-Opus-4.6 and Gemini-3-flash.

2605.21536 2026-05-22 stat.AP

High-Volume Plaintiff-Side Counsel and Single-Appearance Eviction Cases in Philadelphia

费城高卷宗原告律师与单次出庭的驱逐案件

Marios Papamichalis, Regina Ruane

AI总结 研究探讨了费城755,004份1969-2022年房东-租客记录中,涉及租客仅一次出庭的396,163起住宅案件,发现高卷宗原告律师处理的单次出庭案件更可能进入强制驱逐和送达强制驱逐阶段,但并未更可能导致败诉。在同一原告或同一物业内比较未显示败诉等不利结果的显著优势,组织层面的模式显示,原告采用或转用高卷宗律师后,每月文件数量和不同建筑数量增加,接近前一年前十律师排名时,案件在败诉和执行方面出现局部差异,专科律师的延期与败诉更紧密相关。非固定预处理趋势和律师退出的不精确反向估计限制了因果主张的强度。高卷宗原告律师因此是文件规模和程序顺序的机制,而非案件结果的统一提升或导致个别租客单次出庭的原因。

详情
Comments
Preprint
AI中文摘要

在1969-2022年间提交的755,004份费城房东-租客记录中,396,163起住宅案件涉及租客在观察到的法院文件中仅一次出庭。在未调整的比较中,由高卷宗原告律师处理的单次出庭案件更可能进入强制驱逐和送达强制驱逐阶段,但并未更可能导致败诉。在同一原告或同一原告在同一物业内的比较未显示败诉、判决或费用等不利结果的显著优势。更清晰的模式是组织层面的:原告采用或转用高卷宗律师后,每月文件数量增加约2-5%,不同建筑数量也增加相似幅度;接近前一年前十律师排名时,案件在败诉和执行方面出现局部差异;专科律师的延期与败诉更紧密相关。非固定预处理趋势和律师退出的不精确反向估计限制了任何因果主张的强度。高卷宗原告律师因此是文件规模和程序顺序的机制,而非案件结果的统一提升或导致个别租客单次出庭的原因。

英文摘要

Among 755,004 Philadelphia landlord--tenant records filed during 1969-2022, 396,163 residential cases involve tenants who appear exactly once in the observed docket. In unadjusted comparisons, single-appearance cases handled by high-volume plaintiff-side counsel are more likely to advance to the writ-of-possession and served-writ stages, but no more likely to end in default. Comparisons within the same plaintiff, and within the same plaintiff at the same property, show no broad premium on adverse case outcomes such as default, judgment, or fees. The clearer pattern is organizational: after a plaintiff adopts or switches into high-volume counsel, monthly filings rise by about 2-5% and the number of distinct buildings reached rises by a similar margin; near the prior-year top-10 attorney threshold, cases display local differences in default and enforcement; and continuances under specialist counsel are more closely linked to default. Non-flat pre-treatment trends and imprecise reverse-direction estimates from attorney exits restrict the strength of any causal claim. High-volume plaintiff-side counsel therefore functions as a mechanism of filing scale and procedural sequence, not as a uniform escalator of case outcomes or as a cause of any individual tenant becoming single-appearance.

2605.21535 2026-05-22 stat.ME

An Old Look at Empirical Bayes

对经验贝叶斯的一次回顾

Nicholas G. Polson, Vadim O. Sokolov, Daniel Zantedeschi

AI总结 本文回顾了经验贝叶斯方法,指出其在数据使用、层次结构混同和不确定性量化方面的特点,并提出通过概率对称性、隐式似然和校准研究等新方法改进经验贝叶斯,强调现代经验贝叶斯计算工具应服务于更严格的层次贝叶斯方法。

详情
Comments
23 pages
AI中文摘要

Dennis Lindley曾说,比频率主义者更糟糕的是经验贝叶斯。这个俏皮话看似夸张,但其技术内容严肃:经验贝叶斯使用数据两次,混同层次结构,并产生形状似后验的总结,其不确定性量化不同于完全层次模型。David Blei的2026 IMS Medallion讲座

英文摘要

Dennis Lindley once said that there is only one thing worse than a frequentist, and that is an empirical Bayesian. The quip has the air of caricature, but its technical content is serious: empirical Bayes uses the same data twice, conflates levels of a hierarchy, and produces posterior-shaped summaries whose uncertainty quantification differs from what a fully hierarchical model delivers. David Blei's 2026 IMS Medallion Lecture, "A Fresh Look at Empirical Bayes," revives the program under three new banners: empirical Bayes via probabilistic symmetries (rebranded "Bayesian empirical Bayes"), empirical Bayes with implicit likelihoods through simulation-based inference, and empirical Bayes for combining experimental and observational data through calibration studies. This is a continuation of Blei and Kucukelbir's earlier "population empirical Bayes" (PopEB, 2015). We argue, in the spirit of Lindley, I. J. Good, William DuMouchel, Thomas Louis, and our own recent work with Datta, that Blei's machinery targets inferential objects distinct from the posterior conditional on the realized data, and that the cost of maintaining the full hierarchical discipline has fallen low enough that the computational trade-off no longer favors the shortcut. The case study is the Tweedie formula. Efron's f-modeling empirical Bayes plugs an estimated score function into a posterior-mean identity, but a smoothed score need not arise from any prior. The horseshoe Tweedie formula does. We conclude by recommending that the impressive computational machinery of modern empirical Bayes (variational inference, neural amortization, simulation-based inference) be redeployed in service of properly hierarchical Bayes.

2605.21534 2026-05-22 stat.ML cs.LG

Adaptive RBF-KAN: A Comparative Evaluation of Dynamic Shape Parameters in Kolmogorov-Arnold Networks

自适应RBF-KAN:动态形状参数在Kolmogorov-Arnold网络中的比较评估

Roberto Cavoretto, Alessandra De Rossi, Adeeba Haider, Amir Noorizadegan

AI总结 本文研究了Kolmogorov-Arnold网络中动态形状参数的选择问题,通过引入更广泛的径向基核和基于留一验证的核尺度估计,改进了RBF-KAN模型,提升了对不同函数类型的适应能力。

详情
AI中文摘要

Kolmogorov-Arnold网络(KANs)通过可学习的单变量边缘函数近似多变量函数,通常参数化为B样条基。尽管有效,基于样条的实现可能计算成本较高。一种改进的KAN变体称为FastKAN,通过将样条替换为高斯径向基函数(RBF)来提高效率,但其依赖于固定的核和形状参数。在本工作中,我们扩展了基于RBF的KAN框架,引入了更广泛的径向基核,并通过留一验证(LOOCV)初始化核形状参数。到目前为止,这是首次将基于LOOCV的核尺度估计与深度KAN训练相结合的研究。我们还首次将Matérn和Wendland核引入KAN框架,使KAN能够超越FastKAN中使用的高斯核,提供更灵活的基函数表示。LOOCV估计提供了数据驱动的核尺度初始化,随后在网络训练中进一步优化。所提出的自适应RBF-KAN在多个二维基准函数上进行了评估。结果突显了核选择和自适应形状参数的重要性,不同核在光滑函数、不连续性和振荡模式中表现出优势。总体而言,结合基于LOOCV的初始化与自适应核学习为改进RBF-KAN模型提供了一种实用策略。

英文摘要

Kolmogorov-Arnold Networks (KANs) approximate multivariate functions using learnable univariate edge functions, typically parameterized by B-spline bases. Although effective, spline-based implementations can be computationally expensive. A modified version of KANs, called FastKAN, improves efficiency by replacing splines with Gaussian radial basis functions (RBFs), but it relies on a fixed kernel and shape parameter. In this work, we extend the RBF-based KAN framework by introducing a broader family of radial basis kernels and by initializing the kernel shape parameter using leave-one-out cross-validation (LOOCV). To the best of our knowledge, this is the first study that integrates LOOCV-based kernel scale estimation with deep KAN training. We also introduce Matérn and Wendland kernels into the KAN framework for the first time, enabling more flexible basis representations beyond the Gaussian kernel used in FastKAN. The LOOCV estimate provides a data-driven initialization of the kernel scale, which is subsequently refined during network training. The proposed adaptive RBF-KAN is evaluated on several two-dimensional benchmark functions. The results highlight the importance of kernel selection and adaptive shape parameters, with different kernels showing advantages for smooth functions, discontinuities, and oscillatory patterns. Overall, combining LOOCV-based initialization with adaptive kernel learning provides a practical strategy for improving RBF-based KAN models.

2605.21530 2026-05-22 stat.ME nlin.CD physics.data-an

Pairwise Distance-Diffusion Analysis (PDDA): A Geometric Framework for Estimating Hurst Exponents in Multivariate Long-Memory Processes

成对距离扩散分析(PDDA):一种用于多变量长期记忆过程估计Hurst指数的几何框架

Diogo C. Soriano, Frederique Vanheusden, Slawomir J. Nasuto

AI总结 本文提出了一种基于几何框架的PDDA方法,用于从长期记忆随机过程的距离图中估计Hurst指数,通过R/S-PDDA和MSD-PDDA两种互补路线,扩展到多变量各向同性和各向异性过程,并建立了时间持久性、范围维度和复发统计之间的显式联系。

详情
Comments
Supplemental PDF available via ancillary links
AI中文摘要

我们介绍了一种成对距离扩散分析(PDDA),这是一种用于从长期记忆随机过程的距离图中估计Hurst指数的几何框架。单一构造产生两种互补路线:R/S-PDDA,即经典缩放范围定义的几何重新表述,以及MSD-PDDA,基于均方位移缩放,经典上用于异常扩散。我们扩展PDDA到多变量各向同性和各向异性过程,并推导出时间持久性、范围维度和复发统计之间的显式联系,为Hurst分析提供了统一的距离基础。

英文摘要

We introduce Pairwise Distance-Diffusion Analysis (PDDA), a geometric framework for estimating the Hurst exponent from distance plots of long-memory stochastic processes. A single construction yields two complementary routes: R/S-PDDA, a geometric reformulation of the classical rescaled-range definition, and MSD-PDDA, based on mean-squared-displacement scaling, classically used in anomalous diffusion. We extend PDDA to multivariate isotropic and anisotropic processes and derive an explicit link between temporal persistence, range dimension, and recurrence statistics, providing a unified distance-based foundation for Hurst analysis.

2605.21522 2026-05-22 q-bio.QM cs.AI cs.CE cs.LG stat.ML

Protein Thoughts: Interpretable Reasoning with Tree of Thoughts and Embedding-Space Flow Matching for Protein-Protein Interaction Discovery

蛋白质思想:基于树 of 思维和嵌入空间流匹配的可解释推理用于蛋白质-蛋白质相互作用发现

Kingsley Yeon, Xuefeng Liu, Promit Ghosal

AI总结 本文提出了一种可解释的蛋白质-蛋白质相互作用发现框架,通过显式推理将PPI发现转化为可解释的搜索问题,利用嵌入空间流匹配和树 of 思维搜索方法提升预测精度和可解释性。

详情
AI中文摘要

蛋白质-蛋白质相互作用(PPIs)调控几乎所有细胞过程,但计算方法通常产生排名预测而缺乏机理解释。这限制了其应用,因为生物学家无法判断预测是否反映真实的生化见解或偶然相关性。我们提出了Protein Thoughts框架,将PPI发现重新表述为可解释的搜索问题。该系统将结合证据分解为四个生物意义的信号:序列相似性反映进化关系,结构互补性捕捉几何契合,界面平衡,以及化学兼容性编码残基级相互作用。而不是将这些信号合并为一个模糊的分数,我们通过透明的价值函数保留每个信号的贡献,从而实现排序和审计。为了高效地导航大规模候选空间,我们引入了假设引导的熵正则化树 of 思维搜索。微调的语言模型从嵌入衍生的特征生成搜索指令,将候选者分类为高优先级、探索性或可跳过。这些指令条件化一个玻尔兹曼策略,平衡利用与熵驱动的探索,同时假设意识修剪防止提前放弃有前途的候选者。对于表现出评分分歧的候选者,假设条件的嵌入空间流匹配将蛋白质嵌入推向结合者流形。在SHS148k基准测试中,Protein Thoughts实现了平均最佳结合体排名为11.2,比熵树搜索基线的47.7提高了76%,在结合预测中,训练的价值函数实现了91.08±0.19 Micro-F1,优于现有PPI方法在同一数据集上的表现。

英文摘要

Protein-protein interactions (PPIs) govern nearly all cellular processes, yet computational methods for identifying binding partners typically produce ranked predictions without mechanistic justification. This creates a fundamental barrier to adoption because biologists cannot assess whether predictions reflect genuine biochemical insight or spurious correlations. We present \textbf{Protein Thoughts}, a framework that reformulates PPI discovery as an interpretable search problem with explicit reasoning. The system decomposes binding evidence into four biologically meaningful signals: sequence similarity reflecting evolutionary relationships, structural complementarity capturing geometric fit, interface balance, and chemical compatibility encoding residue-level interactions. Rather than collapsing these signals into an opaque score, we preserve their individual contributions through a transparent value function that enables both ranking and auditing. To navigate large candidate spaces efficiently, we introduce hypothesis-guided entropy-regularized Tree-of-Thoughts search. A fine-tuned language model generates search directives from embedding-derived features, classifying candidates as high-priority, exploratory, or skippable. These directives condition a Boltzmann policy that balances exploitation with entropy-driven exploration, while hypothesis-aware pruning prevents premature abandonment of promising candidates. For candidates exhibiting score disagreement, hypothesis-conditioned embedding-space flow matching transports protein embeddings toward the binder manifold. On the SHS148k benchmark, Protein Thoughts achieves mean best-binder rank of 11.2 versus 47.7 for an entropic tree search baseline, a 76% improvement, and for binding prediction the trained value function achieves $91.08 \pm 0.19$ Micro-F1, outperforming existing PPI methods on the same dataset.

2605.21492 2026-05-22 cs.LG cs.AI cs.LO stat.ML

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

特征归因不可能性:在共线性下,没有任何特征排名是忠实、稳定和完整的

Drake Caraker, Bryan Arnold, David Rhoads

AI总结 本文研究了在共线性情况下特征排名的不可能性,证明了无法同时满足忠实、稳定和完整性的条件,并提出了DASH方法作为解决途径,同时通过形式化验证展示了其理论基础和实际应用影响。

详情
Comments
66 pages, 12 figures, 305 Lean 4 theorems. Code at https://github.com/DrakeCaraker/dash-impossibility-lean
AI中文摘要

在共线性情况下,没有任何特征排名可以同时忠实、稳定和完整。对于共线性对,排名本质上等同于抛硬币。我们证明了这一不可能性,针对四种模型类别进行了量化分析,通过集成平均(DASH)方法解决该问题,并利用305个Lean 4定理进行机验证。我们刻画了完整的归因设计空间:恰好存在两种方法家族——忠实-完整方法(不稳定,排名可能翻转多达50%的时间)和集成方法如DASH(稳定,对称特征报告平局)。归因比在梯度提升中发散为1/(1-rho^2),在Lasso中为无穷大,在随机森林中收敛。DASH(Diversified Aggregation of SHAP)在无偏聚合中被证明是帕累托最优的,达到Cramer-Rao方差下界并具有紧的集成大小公式。在77个公共数据集中,68%表现出归因不稳定性。在特征具有相等因果效应时,切换到条件SHAP无法逃脱这一不可能性。该框架包括实用的诊断工具——Z检验工作流程和单模型筛查工具——并直接影响公平性审计:基于SHAP的代理歧视审计在共线性下被证明不可靠。设计空间定理、诊断和不可能性均在Lean 4中形式化验证(305个定理从16个公理,0 sorry)——据我们所知,这是可解释AI领域首个形式化验证的不可能性。

英文摘要

No feature ranking can be simultaneously faithful, stable, and complete when features are collinear. For collinear pairs, ranking reduces to a coin flip. We prove this impossibility, quantify it for four model classes, resolve it via ensemble averaging (DASH), and machine-verify it with 305 Lean 4 theorems. We characterize the complete attribution design space: exactly two families of methods exist -- faithful-complete methods (unstable, with rankings that flip up to 50% of the time) and ensemble methods like DASH (stable, reporting ties for symmetric features) -- and no method lies outside this dichotomy. The impossibility is quantitative: the attribution ratio diverges as 1/(1-rho^2) for gradient boosting, is infinite for Lasso, and converges for random forests. DASH (Diversified Aggregation of SHAP) is provably Pareto-optimal among unbiased aggregations, achieving the Cramer-Rao variance bound with a tight ensemble size formula. In a survey of 77 public datasets, 68% exhibit attribution instability. Switching to conditional SHAP does not escape the impossibility when features have equal causal effects. The framework includes practical diagnostics -- a Z-test workflow and single-model screening tool -- and has direct consequences for fairness auditing: SHAP-based proxy discrimination audits are provably unreliable under collinearity. The design space theorem, diagnostics, and impossibility are mechanically verified in Lean 4 (305 theorems from 16 axioms, 0 sorry) -- to our knowledge, the first formally verified impossibility in explainable AI.

2605.20828 2026-05-22 stat.ME

Adaptive Test for Jump

自适应跳跃检验

Huifang Ma, Long Feng

AI总结 本文提出了一种自适应跳跃检验方法,结合了Ait-Sahalia-Jacod比率统计量和Lee-Mykland极值收益统计量,并应用Cauchy组合规则。在允许随机Itô漂移、波动率和杠杆效应的情况下,展示了在连续路径原假设和密集局部替代假设下渐近独立性,从而得到具有闭式功率的分析校准检验;在有限活动跳跃情况下,检验具有一致性。此外,还扩展了该方法以处理加性微结构噪声。仿真显示,结合的程序在密集和稀疏替代假设下表现良好,通常整体表现最佳。

详情
AI中文摘要

我们开发了一种自适应跳跃检验,用于离散观测的高频半鞅,通过将Ait-Sahalia-Jacod比率统计量(Ait-Sahalia和Jacod,2009)和Lee-Mykland极值收益统计量(Lee和Mykland,2008)与Cauchy组合规则相结合。允许随机Itô漂移、波动率和杠杆效应,我们展示了在连续路径原假设和密集局部替代假设下渐近独立性,从而得到具有闭式功率的分析校准检验;在有限活动跳跃情况下,检验具有一致性。我们还扩展了该方法以处理加性微结构噪声。仿真显示,结合的程序在密集和稀疏替代假设下表现良好,并且通常整体表现最佳。

英文摘要

We develop an adaptive jump test for discretely observed high-frequency semimartingales by combining the A"it-Sahalia--Jacod ratio statistic (A"it-Sahalia and Jacod, 2009) and the Lee--Mykland extreme-return statistic (Lee and Mykland, 2008) with the Cauchy combination rule. Allowing stochastic It^o drift, volatility, and leverage, we show asymptotic independence under the continuous-path null and dense local alternatives, yielding an analytically calibrated test with closed-form power; under finite-activity jumps, the test is consistent. We also extend the method to additive microstructure noise. Simulations show that the combined procedure performs well under both dense and sparse alternatives and is typically best overall.

2605.20514 2026-05-22 cs.LG cs.NA math.NA stat.ML

Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data

从稀疏数据快速重建精确的Maxwell动力学

Dan DeGenaro, Xin Li, Obed Amo, Michael Pokojovy, Sarah Adel Bargal, Markus Lange-Hegermann, Bogdan Raiţă

AI总结 本文提出FLASH-MAX神经网络架构,通过稀疏点观测预测均匀电磁场,该架构通过符号构造满足Maxwell方程,实现从稀疏数据快速训练,且保持零PDE残差,提升了科学机器学习中精度与优化速度的平衡。

详情
Comments
31 pages, 8 figures
AI中文摘要

我们介绍了FLASH-MAX,一种浅层、精确由构造的神经网络架构,用于从稀疏点观测预测均匀电磁场。每个隐藏神经元代表Maxwell方程的一个独立精确解,因此网络通过构造满足 governing equations,并能从稀疏数据中以秒级时间进行端到端训练。我们证明了一个通用逼近结果,表明这种精确模型类在任意域上保持通用性。FLASH-MAX在约1K稀疏点观测中达到子1%的验证相对误差,同时保持零PDE残差,并在仅100观测采样时仍保持单数字误差。这些结果表明,将 governing structure 从损失转移到假设类可以显著提升科学机器学习中精度与优化速度的平衡。

英文摘要

We introduce FLASH-MAX, a shallow, exact-by-construction neural network architecture for predicting homogeneous electromagnetic fields from sparse pointwise observations. Each hidden neuron represents a separate exact solution to Maxwell's equations, so that the network satisfies the governing equations symbolically by construction and can be trained end-to-end from sparse data within seconds. We prove a universal approximation result showing that this exact model class remains universal on arbitrary domains. FLASH-MAX reaches sub-1% relative validation error from about 1K sparse pointwise observations in seconds, all while maintaining a zero PDE residual, and keeps single-digit errors even for only 100 observations sampled from 3D space. These results suggest that moving governing structure from the loss into the hypothesis class can dramatically improve the trade-off between precision and optimization speed in scientific machine learning.

2605.16108 2026-05-22 stat.ME q-bio.QM stat.AP

Estimating Association Between Paired Outcomes in Clustered Data with Informative Subgroup Size

在具有信息性子组大小的数据中估计配对结果之间的关联

Owen Visser, Somnath Datta

AI总结 本文提出三种加权估计方法,用于估计集群数据中配对结果之间的边际关联,通过引入基于集群内重采样的权重,扩展了逆集群大小和子组大小加权方法,并利用Stouffer方法改进了现有的ISS检验过程,以减少计算负担。

详情
AI中文摘要

信息性集群大小(ICS)和信息性子组大小(ISS)当观测单位的数量或其在结果定义类别中的分布与研究结果相关时,会扭曲边际关联估计。这一问题在配对结果中尤为相关,因为观测到的关联可能依赖于集群大小、配对类别组成以及单位成为分析对象的过程。我们提出三种加权估计方法,用于估计集群数据中配对结果之间的边际关联。权重来源于集群内重采样的论点,并扩展了逆集群大小和子组大小加权到配对结果类别。我们还通过利用Stouffer的方法改进了现有的ISS检验过程,以减少计算负担。为了评估这些方法,我们开发了一个模拟器,用于集群配对结果,该模拟器分离了单元级关联、潜在集群级关联和结果依赖的保留。模拟显示,基于配对的加权方法在关联通过单元级依赖和子组组成信息性时可以减少偏差,但会减弱由潜在集群级结构携带的关联。典型逆集群加权方法在关联主要为集群级时仍更稳定。对NHANES口腔健康数据的应用显示,总体上存在小的正周期牙和龋齿关联,填充表面结果显示出更强的ISS证据和更高的敏感性,比龋坏表面结果更受基于配对的加权影响。这些结果表明,ICS和ISS下的边际关联应根据关联来源、观测单位结构和选择加权方案的假设进行解释。

英文摘要

Informative cluster size (ICS) and informative subgroup size (ISS) can distort marginal association estimates when the number of observed units, or their distribution across outcome-defined categories, is related to the outcomes under study. This issue is especially relevant for paired outcomes, where the observed association can depend on cluster size, paired-category composition, and the process by which units become available for analysis. We propose three weighted estimating approaches for marginal association between paired outcomes in clustered data. The weights are derived from within-cluster resampling arguments and extend inverse cluster-size and subgroup-size weighting to paired outcome categories. We also modify an existing ISS testing procedure by utilizing Stouffer's method to reduce computational burden. To evaluate the methods, we develop a simulator for clustered paired outcomes that separates unit-level association, latent cluster-level association, and outcome-dependent retention. Simulations show that pair-based weighting can reduce bias when association arises through unit-level dependence and subgroup composition is informative, but can attenuate association carried by latent cluster-level structure. Typical inverse-cluster weighting remains more stable when the association is primarily cluster-level. Application to NHANES oral-health data shows small positive periodontal and caries associations overall, with filled-surface outcomes showing stronger ISS evidence and greater sensitivity to pair-based weighting than decayed-surface outcomes. These results indicate that marginal association under ICS and ISS should be interpreted in relation to the source of association, observed-unit structure, and assumptions used to choose the weighting scheme.

2605.10493 2026-05-22 math.OC cs.SY eess.SY stat.ML

A PAC-Bayes Approach for Controlling Unknown Linear Discrete-time Systems

一种用于控制未知线性离散时间系统的PAC-Bayes方法

Yujia Luo, Ye Pu, Jonathan H. Manton, Jingge Zhu

AI总结 本文提出了一种PAC-Bayes框架,用于学习未知随机线性离散时间系统的控制器,推导了数据依赖的高性能界限,并提出了具有理论保证的高效学习算法,适用于有限和无限控制器空间。与先前工作相比,我们的界限适用于无界二次成本,在LQG最优的特殊情况中,数值结果表明所学控制器的性能与LQG相当。

详情
Comments
10 pages, 3 figures, IFAC 2026 conference
AI中文摘要

本文提出了一种PAC-Bayes框架,用于学习未知随机线性离散时间系统的控制器,其中系统参数从一个固定但未知的分布中抽取。我们推导了任何学习得到的(随机)控制器性能的数据依赖的高概率界限,并提出了具有理论保证的新高效学习算法,这些算法可以用于有限和无限的控制器空间。与先前工作相比,我们的界限适用于无界二次成本。在LQG最优的特殊情况中,我们的数值结果表明所学控制器的性能与LQG相当。

英文摘要

This paper presents a PAC-Bayes framework for learning controllers for unknown stochastic linear discrete-time systems, where the system parameters are drawn from a fixed but unknown distribution. We derive a data-dependent high probability bound on the performance of any learned (stochastic) controller, and propose novel efficient learning algorithms with theoretical guarantees, which can be implemented for both finite and infinite controller spaces. Compared to prior work, our bound holds for unbounded quadratic cost. In the special case where LQG is optimal, our numerical results suggest that the learned controllers achieve comparable performance to LQG.

2604.02700 2026-05-22 stat.AP

Wasserstein-Based Test for Empirical Measure Convergence of Dependent Sequences

基于Wasserstein的距离检验经验测度收敛性

Alexander Yordanov, Peter Hristov

AI总结 本文提出了一种基于Wasserstein距离的假设检验方法,用于检验平稳依赖序列的经验测度收敛性。在已知候选不变测度的情况下,研究了统计量T_n=√n·W_1(μ̂_n,μ)的渐近有效性,并在固定备择假设下证明了其一致性。当不变测度未知时,推导了独立轨迹的配对统计量√n·W_1(μ̂_n^(i),μ̂_n^(j))的渐近分布,并得到了相应的配对检验方法,包括Bonferroni校正。为了使估计在长期协方差无法用闭式表达时可行,引入了有限网格插值估计器,并证明基于估计协方差的高斯临界值能一致恢复相应的 oracle 固定网格估计。线性和非线性动力学设置中的模拟实验展示了 oracle 和插值情形,以及由此产生的覆盖概率和功效。

详情
AI中文摘要

我们开发了用于检验平稳依赖序列经验测度收敛性的基于Wasserstein距离的假设检验方法。对于已知的候选不变测度μ,我们研究了统计量T_n=√n·W_1(μ̂_n,μ),并在原假设下建立了渐近水平α的有效性,并在固定备择假设下证明了其一致性。当不变测度未知时,我们推导了独立轨迹的配对统计量√n·W_1(μ̂_n^(i),μ̂_n^(j))的渐近分布,并得到了相应的配对检验方法,包括Bonferroni校正。为了使估计在长期协方差无法用闭式表达时可行,我们引入了有限网格插值估计器,并证明基于估计协方差的高斯临界值能一致恢复相应的oracle固定网格估计。线性和非线性动力学设置中的模拟实验展示了oracle和插值情形,以及由此产生的覆盖概率和功效。

英文摘要

We develop Wasserstein-based hypothesis tests for empirical-measure convergence in stationary dependent sequences. For a known candidate invariant measure, $μ$, we study the statistic $T_n=\sqrt{n}\,W_1(\hatμ_n,μ)$ and establish asymptotic level-$α$ validity under the null, together with consistency under fixed alternatives. When the invariant measure is unknown, we derive the asymptotic law of the pairwise statistic $\sqrt{n}\,W_1(\hatμ_n^{(i)},\hatμ_n^{(j)})$ for independent trajectories and obtain a corresponding pairwise test, including Bonferroni control for multiple comparisons. To make this estimation feasible when the long-run covariance is unavailable in closed form, we introduce a finite-grid plug-in estimator and show that Gaussian critical values based on the estimated covariance consistently recover the corresponding oracle fixed-grid estimation. Simulation experiments in both linear and nonlinear dynamical settings illustrate the oracle and plug-in regimes, along with the resulting coverage probability and power.

2603.21610 2026-05-22 cs.LG cs.AI stat.ML

Rule-State Inference (RSI): A Bayesian Framework for Compliance Monitoring in Rule-Governed Domains

规则状态推断(RSI):一种用于规则治理领域合规监控的贝叶斯框架

Abdou-Raouf Atarmla

AI总结 本文提出了一种名为规则状态推断(RSI)的贝叶斯框架,用于解决规则治理领域中合规监控的三大结构性挑战:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。RSI通过将权威、形式化的规则集作为结构化的贝叶斯先验,利用变分推断和精确坐标上升更新来推断人口的潜在合规状态。

详情
Comments
18 pages. Experimental validation forthcoming
AI中文摘要

在规则治理领域(如税收管理、临床协议遵守、环境监管)的合规监控面临三个结构性障碍,标准机器学习无法同时解决:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。我们引入规则状态推断(RSI),一种贝叶斯框架,颠覆了传统的学习规则从数据的范式。RSI将权威、形式化的规则集作为结构化的贝叶斯先验,并通过均场变分推断和精确坐标上升更新推断人口的潜在合规状态。核心建模对象是一个联合潜变量,每个监管时期一个:全局合规文化因子η以及每个规则的激活、人口合规水平和参数漂移成分。RSI提供了三个正式保证:每个规则更新的监管适应性为O(n_k + K);对于可识别的连续成分的伯恩斯坦-冯·米塞斯一致性;以及每次迭代的单调ELBO收敛。我们将在托戈财政系统上实例化RSI,基于官方监管法律的基准2000家合成企业;完整的数值验证将随后进行。该框架设计用于直接扩展到顺序RSI,一种状态空间公式化中,一个监管时期的后验成为下一个的先验,从而产生精确的卡尔曼滤波器用于合规轨迹跟踪和实体级贝叶斯评分。

英文摘要

Compliance monitoring in rule-governed domains (tax administration, clinical protocol adherence, environmental regulation) faces three structural obstacles that standard machine learning does not simultaneously address: the absence of labeled outcomes at deployment, strategically missing observations where non-compliant entities selectively withhold evidence, and a regulatory environment that changes faster than any supervised model can be retrained. We introduce Rule-State Inference (RSI), a Bayesian framework that reverses the usual paradigm. Rather than learning rules from data, RSI treats an authoritative, formalized rule set as structured Bayesian priors and infers the latent compliance state of a population through mean-field variational inference with exact coordinate-ascent updates. The central modeling object is a joint latent state per regulatory period: a global compliance-culture factor eta and per-rule components for activation, population compliance level, and parametric drift. RSI delivers three formal guarantees: O(n_k + K) regulatory adaptability per rule update; Bernstein-von Mises consistency for the identifiable continuous components; and monotone ELBO convergence at every iteration. We instantiate RSI on the Togolese fiscal system on a benchmark of 2,000 synthetic enterprises grounded in official regulatory law; full numerical validation is forthcoming. The framework is designed for direct extension to Sequential RSI, a state-space formulation where the posterior from one regulatory period becomes the prior for the next, yielding an exact Kalman filter for compliance-trajectory tracking and entity-level Bayesian scoring.

2602.16195 2026-05-22 stat.AP

Phase Transitions in Collective Damage of Civil Structures under Natural Hazards

自然灾害下民用建筑集体损害的相变现象

Sebin Oh, Jinyan Zhao, Raul Rincon, Jamie E. Padgett, Ziqi Wang

AI总结 本文研究了自然灾害下民用建筑集体损害的相变现象,通过随机场伊辛模型揭示了建筑多样性对损害过渡的影响,并指出工程模型实践可能系统性地偏差风险指标,导致维修成本差距扩大。

详情
AI中文摘要

城市在自然灾害下的命运不仅取决于灾害强度,还取决于结构损害的耦合,这是一种尚不充分理解的集体过程。本文表明,城市结构损害表现出相变现象。随着灾害强度增加,系统可突然从安全状态转变为损坏状态,类似于统计物理中的第一类相变。建筑种类的多样性使这种过渡变得平滑,但多尺度损害聚类使系统停留在扩展的临界类似状态(类似于格里菲斯相),抑制了更可预测的无序(高斯)相的出现。这些现象学模式由随机场伊辛模型表征,其中外部场、杂 loạn强度和温度分别解释为有效灾害需求、结构多样性和建模不确定性。将此框架应用于真实城市清单,发现广泛使用的工程建模实践可使城市损害模式在同步和波动状态之间切换,在中等地震(Mw≈5.5-6.0)下系统性地偏差基于超越的风险指标高达50%,相当于维修成本的几倍差距。这种相意识的描述将民用基础设施集体损害行为转化为城市风险评估和规划的可操作诊断。

英文摘要

The fate of cities under natural hazards depends not only on hazard intensity but also on the coupling of structural damage, a collective process that remains poorly understood. Here we show that urban structural damage exhibits phase-transition phenomena. As hazard intensity increases, the system can shift abruptly from a largely safe to a largely damaged state, analogous to a first-order phase transition in statistical physics. Higher diversity in the building portfolio smooths this transition, but multiscale damage clustering traps the system in an extended critical-like regime (analogous to a Griffiths phase), suppressing the emergence of a more predictable disordered (Gaussian) phase. These phenomenological patterns are characterized by a random-field Ising model, with the external field, disorder strength, and temperature interpreted as the effective hazard demand, structural diversity, and modeling uncertainty, respectively. Applying this framework to real urban inventories reveals that widely used engineering modeling practices can shift urban damage patterns between synchronized and volatile regimes, systematically biasing exceedance-based risk metrics by up to 50% under moderate earthquakes ($M_w \approx 5.5$--$6.0$), equivalent to a several-fold gap in repair costs. This phase-aware description turns the collective behavior of civil infrastructure damage into actionable diagnostics for urban risk assessment and planning.

2601.21025 2026-05-22 stat.ML cs.LG

A Diffusive Classification Loss for Learning Energy-based Generative Models

一种用于学习基于能量的生成模型的扩散分类损失

RuiKang OuYang, Louis Grenioux, José Miguel Hernández-Lobato

AI总结 本文提出了一种名为DiffCLF的扩散分类损失,用于学习基于能量的生成模型,通过将能量模型学习重新表述为跨噪声级别的监督分类问题,从而在保持计算效率的同时避免了模式盲区,提高了模型的保真度和应用范围。

详情
Comments
Accepted at ICML 2026
AI中文摘要

基于分数的生成模型最近取得了显著的成功。虽然它们通常由分数参数化,但另一种方法是使用一系列时间依赖的能量模型(EBMs),其中分数是从能量的负输入梯度获得的。关键的是,EBMs不仅可以用于生成,还可以用于诸如组合采样或通过蒙特卡洛方法构建玻尔兹曼生成器等任务。然而,训练EBMs仍然具有挑战性。直接最大似然估计由于需要嵌套采样而计算上不可行,而分数匹配虽然高效,但存在模式盲区。为了解决这些问题,我们引入了扩散分类(DiffCLF)目标,这是一种简单的方法,可以避免盲区同时保持计算效率。DiffCLF将EBM学习重新表述为跨噪声级别的监督分类问题,并可以无缝结合标准的分数基目标。我们通过在分析高斯混合案例中将估计能量与真实值进行比较,以及通过应用训练好的模型到诸如模型组合和玻尔兹曼生成器采样等任务中,验证了DiffCLF的有效性。我们的结果表明,DiffCLF使EBM比现有方法具有更高的保真度和更广泛的应用范围。

英文摘要

Score-based generative models have recently achieved remarkable success. While they are usually parameterized by the score, an alternative way is to use a series of time-dependent energy-based models (EBMs), where the score is obtained from the negative input-gradient of the energy. Crucially, EBMs can be leveraged not only for generation, but also for tasks such as compositional sampling or building Boltzmann Generators via Monte Carlo methods. However, training EBMs remains challenging. Direct maximum likelihood is computationally prohibitive due to the need for nested sampling, while score matching, though efficient, suffers from mode blindness. To address these issues, we introduce the Diffusive Classification (DiffCLF) objective, a simple method that avoids blindness while remaining computationally efficient. DiffCLF reframes EBM learning as a supervised classification problem across noise levels, and can be seamlessly combined with standard score-based objectives. We validate the effectiveness of DiffCLF by comparing the estimated energies against ground truth in analytical Gaussian mixture cases, and by applying the trained models to tasks such as model composition and Boltzmann Generator sampling. Our results show that DiffCLF enables EBMs with higher fidelity and broader applicability than existing approaches.

2601.05157 2026-05-22 cs.DS cs.LG stat.ML

Learning Mixture Models via Efficient High-dimensional Sparse Fourier Transforms

通过高效的高维稀疏傅里叶变换学习混合模型

Alkis Kalavasis, Pravesh K. Kothari, Shuchen Li, Manolis Zampetakis

AI总结 本文提出了一种在高维空间中以多项式时间复杂度学习混合模型参数的方法,适用于具有重尾分布的混合模型,包括那些协方差有限的分布,且无需集群均值的最小分离。

详情
AI中文摘要

在本文中,我们提出了一种${ m poly}(d,k)$时间复杂度和样本复杂度的算法,用于高效学习$d$维空间中$k$个球形分布的参数。与之前的所有方法不同,我们的技术适用于具有重尾分布的情况,甚至包括那些没有有限协方差的分布。我们的方法在集群分布具有足够重的尾部特征函数时才能成功。此类分布包括拉普拉斯分布,但关键地排除了高斯分布。所有之前学习混合模型的方法都隐式或显式地依赖于低次矩。即使对于拉普拉斯分布的情况,我们证明任何此类算法必须使用超多项式数量的样本。因此,我们的方法补充了那些绕过矩方法限制的技术列表。出人意料的是,我们的算法不需要任何集群均值之间的最小分离。这与球形高斯混合模型形成鲜明对比,后者在信息论上证明需要最小的$\ell_2$-分离[Regev and Vijayaraghavan '17]。我们的方法与现有技术相结合,允许在混合模型中获得'两者兼得'的保证,其中每个组件要么具有重尾特征函数,要么具有亚高斯尾部但轻尾特征函数。我们的算法基于一种新的通过高效高维稀疏傅里叶变换学习混合模型的方法。我们相信这种方法将在统计估计中找到更多应用。作为例子,我们给出一个一致的鲁棒均值估计算法,以对抗噪声无关的对手,这是一个由文献中的多重假设检验文献实际提出的模型。它最近在一位作者的硕士论文中正式提出,并已启发了后续的工作。

英文摘要

In this work, we give a ${\rm poly}(d,k)$ time and sample algorithm for efficiently learning the parameters of a mixture of $k$ spherical distributions in $d$ dimensions. Unlike all previous methods, our techniques apply to heavy-tailed distributions and include examples that do not even have finite covariances. Our method succeeds whenever the cluster distributions have a characteristic function with sufficiently heavy tails. Such distributions include the Laplace distribution but crucially exclude Gaussians. All previous methods for learning mixture models relied implicitly or explicitly on the low-degree moments. Even for the case of Laplace distributions, we prove that any such algorithm must use super-polynomially many samples. Our method thus adds to the short list of techniques that bypass the limitations of the method of moments. Somewhat surprisingly, our algorithm does not require any minimum separation between the cluster means. This is in stark contrast to spherical Gaussian mixtures where a minimum $\ell_2$-separation is provably necessary even information-theoretically [Regev and Vijayaraghavan '17]. Our methods compose well with existing techniques and allow obtaining ''best of both worlds" guarantees for mixtures where every component either has a heavy-tailed characteristic function or has a sub-Gaussian tail with a light-tailed characteristic function. Our algorithm is based on a new approach to learning mixture models via efficient high-dimensional sparse Fourier transforms. We believe that this method will find more applications to statistical estimation. As an example, we give an algorithm for consistent robust mean estimation against noise-oblivious adversaries, a model practically motivated by the literature on multiple hypothesis testing. It was formally proposed in a recent Master's thesis by one of the authors, and has already inspired follow-up works.

2511.10619 2026-05-22 cs.LG stat.ML

Algorithm Design and Stronger Guarantees for the Improving Multi-Armed Bandits Problem

改进多臂老虎机问题的算法设计及更强的保证

Avrim Blum, Marten Garicano, Kavya Ravichandran, Dravyansh Sharma

AI总结 本文提出两种新的参数化老虎机算法家族,通过离线数据界定了学习近最优算法的样本复杂度,并在标准超参数调优基准上进行了实证评估。第一家族包含先前工作的最优随机算法,展示在满足额外凹性性质的臂奖励曲线下,可以实现更强的保证。第二家族算法在良好行为实例上保证最佳臂识别,在不良行为实例上退化为最坏情况保证。

详情
Comments
36 pages
AI中文摘要

改进多臂老虎机问题是一个在不确定性下分配努力的形式模型,受投资新技术研究努力、进行临床试验和从学习曲线中选择超参数等场景的启发。每次拉取臂提供奖励,该奖励以递减回报单调增加。已有大量工作设计了改进老虎机算法,但最坏情况保证较为悲观。事实上,已知确定性和随机性算法相对于最优臂的强下界分别为Ω(k)和Ω(√k)的乘法近似因子。在本文中,我们提出两个新的参数化老虎机算法家族,并利用离线数据界定了从每个家族学习近最优算法的样本复杂度。我们还在标准超参数调优基准上进行了实证评估。我们定义的第一家族包含先前工作的最优随机算法。我们证明,适当选择的算法从该家族中可以实现更强的保证,当臂奖励曲线下满足与凹性强度相关的额外性质时,具有最优的k依赖性。我们的第二家族包含在良好行为实例上保证最佳臂识别并在不良行为实例上退化为最坏情况保证的算法。

英文摘要

The improving multi-armed bandits problem is a formal model for allocating effort under uncertainty, motivated by scenarios such as investing research effort into new technologies, performing clinical trials, and hyperparameter selection from learning curves. Each pull of an arm provides reward that increases monotonically with diminishing returns. A growing line of work has designed algorithms for improving bandits, albeit with somewhat pessimistic worst-case guarantees. Indeed, strong lower bounds of $Ω(k)$ and $Ω(\sqrt{k})$ multiplicative approximation factors are known for both deterministic and randomized algorithms (respectively) relative to the optimal arm, where $k$ is the number of bandit arms. In this work, we propose two new parameterized families of bandit algorithms and bound the sample complexity of learning the near-optimal algorithm from each family using offline data. We also perform empirical evaluations on standard hyperparameter tuning benchmarks. The first family we define includes the optimal randomized algorithm from prior work. We show that an appropriately chosen algorithm from this family can achieve stronger guarantees, with optimal dependence on $k$, when the arm reward curves satisfy additional properties related to the strength of concavity. Our second family contains algorithms that both guarantee best-arm identification on well-behaved instances and revert to worst-case guarantees on poorly-behaved instances.

2510.09270 2026-05-22 math.ST math.PR stat.TH

Fast Wasserstein rates for estimating probability distributions of probabilistic graphical models

概率图模型概率分布估计的快速Wasserstein速率

Daniel Bartl, Stephan Eckstein

AI总结 本文研究了利用独立同分布数据在Wasserstein距离下估计高维分布时,如何通过概率图模型的结构知识克服维度诅咒,证明了局部图结构对估计速率的影响,以及不同光滑条件下的精确速率结果。

详情
AI中文摘要

使用i.i.d.数据在Wasserstein距离下估计高维分布是维度诅咒的一个基本实例。我们探讨了如何利用数据生成过程的结构知识来克服这一诅咒。更具体地说,我们考虑了给定有向无环图的概率图模型分布集。结果表明,这种知识仅在可以量化时才有帮助,我们通过离散化对应于图的转移核的光滑条件来形式化这一点。在这种情况下,我们证明估计速率由图的局部结构决定,更具体地说是由单个节点及其父节点对应的维度决定。精确的速率取决于所假设的核的光滑性概念,其中弱(Wasserstein-Lipschitz)或强(双向Total-Variation-Lipschitz)条件导致不同的结果。我们在强条件下证明了速率的最优性,并展示了该条件作为特殊情况涵盖具有正Lipschitz密度的分布。

英文摘要

Using i.i.d. data to estimate a high-dimensional distribution in Wasserstein distance is a fundamental instance of the curse of dimensionality. We explore how structural knowledge about the data-generating process which gives rise to the distribution can be used to overcome this curse. More precisely, we work with the set of distributions of probabilistic graphical models for a given directed acyclic graph. It turns out that this knowledge is only helpful if it can be quantified, which we formalize via smoothness conditions on the transition kernels in the disintegration corresponding to the graph. In this case, we prove that the rate of estimation is governed by the local structure of the graph, more precisely by dimensions corresponding to single nodes together with their parent nodes. The precise rate depends on the exact notion of smoothness assumed for the kernels, where either weak (Wasserstein-Lipschitz) or strong (bidirectional Total-Variation-Lipschitz) conditions lead to different results. We prove sharpness under the strong condition and show that this condition covers, as a special case, distributions having a positive Lipschitz density.

2507.20268 2026-05-22 cs.LG eess.SP stat.ML

Reliable Wireless Indoor Localization via Cross-Validated Prediction-Powered Calibration

通过交叉验证的预测驱动校准实现可靠的无线室内定位

Seonghoon Yoo, Houssem Sifaou, Sangwoo Park, Joonhyuk Kang, Osvaldo Simeone

AI总结 本文提出一种利用有限校准数据同时优化预测器和估计合成标签偏差的方法,通过交叉验证预测驱动校准提高无线室内定位的可靠性。

详情
AI中文摘要

使用预测模型和接收信号强度信息(RSSI)进行无线室内定位需要适当的校准以获得可靠的定位估计。一种解决方法是使用由(通常不同的)预测模型生成的合成标签。但微调额外的预测器以及估计合成标签的残差偏差需要额外的数据,加剧了无线环境中的校准数据稀缺问题。本文提出了一种方法,能够高效利用有限的校准数据,同时微调预测器并估计合成标签的偏差,从而获得具有严格覆盖保证的预测集。在指纹数据集上的实验验证了所提出方法的有效性。

英文摘要

Wireless indoor localization using predictive models with received signal strength information (RSSI) requires proper calibration for reliable position estimates. One remedy is to employ synthetic labels produced by a (generally different) predictive model. But fine-tuning an additional predictor, as well as estimating residual bias of the synthetic labels, demands additional data, aggravating calibration data scarcity in wireless environments. This letter proposes an approach that efficiently uses limited calibration data to simultaneously fine-tune a predictor and estimate the bias of synthetic labels, yielding prediction sets with rigorous coverage guarantees. Experiments on a fingerprinting dataset validate the effectiveness of the proposed method.

2505.24333 2026-05-22 stat.ML cond-mat.dis-nn cond-mat.stat-mech cs.LG

Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

深度变换器的两种失效模式及如何避免它们:初始化下信号传播的统一理论

Alessio Giorlandino, Sebastian Goldt

AI总结 本文研究了深度变换器中自注意力层的两种失效模式——秩坍缩和熵坍缩,并提出了一种统一的信号传播理论,通过分析初始化对训练稳定性的影响,提供了一种计算训练性图的简单算法,以确定正确初始化超参数的选择。

详情
Journal ref
ICLR 2026
AI中文摘要

找到正确的初始化对于确保神经网络的平稳训练和良好性能至关重要。在变换器中,错误的初始化可能导致自注意力层的两种失效模式:秩坍缩,其中所有标记坍缩为相似的表示,以及熵坍缩,其中高度集中的注意力分数导致训练不稳定。尽管之前的研究所研究了变换器的不同缩放领域,但迄今为止,关于如何初始化变换器的渐近精确、到常数的处方仍然缺乏。在这里,我们提供了一种分析深度变换器中信号通过自注意力、层归一化、跳跃连接和MLP传播的理论。我们的理论产生了一种简单的算法,用于计算训练性图,以确定给定架构的正确初始化超参数选择。我们通过建立与统计物理中随机能模型的正式平行,克服了处理自注意力层的关键挑战。我们还分析了反向路径中的梯度,并确定了梯度在初始化时消失的区域。我们通过三个案例研究展示了我们框架的通用性。我们的理论框架为自注意力的两种失效模式提供了统一的视角,并对权重和残差连接的尺度提供了定量预测,以确保平稳训练。

英文摘要

Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.

2505.15844 2026-05-22 q-bio.QM cs.LG stat.AP

Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy

通过一种新型混合架构和特征选择协同效应推进表格性中风建模

Yousuf Islam, Md. Jalal Uddin Chowdhury, Sumon Chandra Das

AI总结 本文提出了一种数据驱动且可解释的机器学习框架,利用十种常规获取的 demographics、生活方式和临床变量,通过详尽的探索性数据分析、数据预处理和特征选择,构建出一个准确率达到97.2%的中风风险评估模型,显著优于现有模型。

详情
Journal ref
IEEE Conference Publication, 2025
AI中文摘要

脑中风仍然是全球死亡和残疾的主要原因之一,但大多数表格数据预测模型仍低于95%的准确率阈值,限制了实际应用。为解决这一差距,本文开发并验证了一个完全数据驱动且可解释的机器学习框架,旨在使用来自4981条记录的公共队列中十种常规获取的 demographics、生活方式和临床变量来预测中风。我们通过详尽的探索性数据分析(EDA)来理解数据集的结构和分布,随后进行严格的数据预处理,包括处理缺失值、去除异常值和使用合成少数类过采样技术(SMOTE)纠正类别不平衡。为了简化特征选择,使用了点二列相关性和随机森林Gini重要性,并利用分层五折交叉验证优化了包括树集成、提升、核方法和多层神经网络在内的十种不同算法。它们基于概率的预测帮助我们构建了所提出的模型,包括随机森林、XGBoost、LightGBM和一个支持向量分类器,其中逻辑回归作为元学习器。所提出的模型实现了97.2%的准确率和97.15%的F1分数,表明其显著优于领先的单个模型LightGBM,其准确率为91.4%。本研究的结果表明,严格的预处理与多样化的混合模型相结合,可以将低成本的表格数据转化为几乎临床级别的中风风险评估工具。

英文摘要

Brain stroke remains one of the principal causes of death and disability worldwide, yet most tabular-data prediction models still hover below the 95% accuracy threshold, limiting real-world utility. Addressing this gap, the present work develops and validates a completely data-driven and interpretable machine-learning framework designed to predict strokes using ten routinely gathered demographic, lifestyle, and clinical variables sourced from a public cohort of 4,981 records. We employ a detailed exploratory data analysis (EDA) to understand the dataset's structure and distribution, followed by rigorous data preprocessing, including handling missing values, outlier removal, and class imbalance correction using Synthetic Minority Over-sampling Technique (SMOTE). To streamline feature selection, point-biserial correlation and random-forest Gini importance were utilized, and ten varied algorithms-encompassing tree ensembles, boosting, kernel methods, and a multilayer neural network-were optimized using stratified five-fold cross-validation. Their predictions based on probabilities helped us build the proposed model, which included Random Forest, XGBoost, LightGBM, and a support-vector classifier, with logistic regression acting as a meta-learner. The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model, LightGBM, which had an accuracy of 91.4%. Our study's findings indicate that rigorous preprocessing, coupled with a diverse hybrid model, can convert low-cost tabular data into a nearly clinical-grade stroke-risk assessment tool.

2503.06115 2026-05-22 stat.ML cs.IT cs.LG math.IT math.PR

On Statistical Estimation of Edge-Reinforced Random Walks

关于边缘增强随机游走的统计估计

Qinghua, Ding, Venkat Anantharam

AI总结 本文研究了边缘增强随机游走初始边权重的统计估计问题,利用随机环境中的超几何高斯结构来分析估计器的样本复杂性。

详情
Comments
This is the full version of the conference paper in submission to ISIT 2025
AI中文摘要

增强型随机游走(RRWs),包括顶点增强随机游走(VRRWs)和边缘增强随机游走(ERRWs),是一种随机游走模型,其转移概率根据先前访问历史演变~\cite{mgr, fmk, tarres, volkov}。这些模型已在网络表示学习~\cite{xzzs}、增强型PageRank~\cite{gly}和动物行为建模~\cite{smouse}等领域得到应用。然而,对RRW参数的统计估计仍不充分。本文聚焦于利用观测轨迹数据估计ERRW的初始边权重。通过利用ERRW与随机环境中的随机游走(RWRE)~\cite{mr, mr2}之间的联系,即所谓的``magic formula

英文摘要

Reinforced random walks (RRWs), including vertex-reinforced random walks (VRRWs) and edge-reinforced random walks (ERRWs), model random walks where the transition probabilities evolve based on prior visitation history~\cite{mgr, fmk, tarres, volkov}. These models have found applications in various areas, such as network representation learning~\cite{xzzs}, reinforced PageRank~\cite{gly}, and modeling animal behaviors~\cite{smouse}, among others. However, statistical estimation of the parameters governing RRWs remains underexplored. This work focuses on estimating the initial edge weights of ERRWs using observed trajectory data. Leveraging the connections between an ERRW and a random walk in a random environment (RWRE)~\cite{mr, mr2}, as given by the so-called ``magic formula", we propose an estimator based on the generalized method of moments. To analyze the sample complexity of our estimator, we exploit the hyperbolic Gaussian structure embedded in the random environment to bound the fluctuations of the underlying random edge conductances.

2502.21194 2026-05-22 stat.ML cs.LG

Prior shift estimation for positive unlabeled data through the lens of kernel embedding

通过核嵌入视角对正样本无标签数据的先验偏移估计

Jan Mielniczuk, Wojciech Rejchel, Paweł Teisseyre

AI总结 本文研究了在目标无标签样本的先验分布估计问题,假设其可能与源群体不同,并且源数据部分可观察:只有正类样本和整个群体的样本可用(PU学习场景)。提出了一种新的直接估计先验分布的方法,避免了对两个群体后验概率的估计,并具有简单的几何解释。该方法基于分布匹配技术与再生核希尔伯特空间中的核嵌入,并作为优化任务的显式解获得。建立了其渐近一致性以及对其与未知先验偏差的显式非渐近界,该界在实践中可计算。通过合成和实际数据研究有限样本行为,证明该方法在性能上与竞争对手相当或更优。

详情
AI中文摘要

我们研究了在目标无标签样本的先验分布估计问题,假设其可能与源群体不同,并且源数据部分可观察:只有正类样本和整个群体的样本可用(PU学习场景)。我们引入了一种新的直接估计先验分布的方法,避免了对两个群体后验概率的估计,并具有简单的几何解释。它基于分布匹配技术以及再生核希尔伯特空间中的核嵌入,并作为优化任务的显式解获得。我们建立了其渐近一致性以及对其与未知先验偏差的显式非渐近界,该界在实践中可计算。我们研究了合成和实际数据的有限样本行为,并证明该方法在性能上与竞争对手相当或更优。

英文摘要

We study estimation of a class prior for unlabeled target samples which possibly differs from that of source population. Moreover, it is assumed that the source data is partially observable: only samples from the positive class and from the whole population are available (PU learning scenario). We introduce a novel direct estimator of a class prior which avoids estimation of posterior probabilities in both populations and has a simple geometric interpretation. It is based on a distribution matching technique together with kernel embedding in a Reproducing Kernel Hilbert Space and is obtained as an explicit solution to an optimisation task. We establish its asymptotic consistency as well as an explicit non-asymptotic bound on its deviation from the unknown prior, which is calculable in practice. We study finite sample behaviour for synthetic and real data and show that the proposal works consistently on par or better than its competitors.

2501.17622 2026-05-22 math.ST math.PR q-bio.PE stat.TH

Likelihood landscape of binary latent model on a tree

二元隐模型在树上的似然景观

David Clancy, Hanbaek Lyu, Sebastien Roch

AI总结 本文研究了Cavender-Farris-Neyman模型的最大似然估计优化景观,证明了在重建区域内,总体对数似然函数在参数真值周围的一个盒状区域内强凸且光滑,并且坐标最大化方法可以以指数速度收敛到一致的MLE。

详情
Comments
59 pages, 8 figures
AI中文摘要

我们研究了Cavender-Farris-Neyman(CFN)模型的最大似然估计(MLE)的优化景观,这是一种两种状态的隐树模型,对统计系统发育学和铁磁伊辛模型至关重要。尽管对数似然函数非凸且可能具有许多临界点,简单的坐标最大化算法在实践中却非常有效。我们为这种成功提供了第一个理论证明。我们证明,在重建区域内足够深的地方,总体对数似然是强凸且光滑的,其大小与树拓扑和叶数无关。这一基本结果意味着在多项式样本复杂度下,经验景观以高概率具有这些正则性质,并且坐标最大化以指数速度收敛到O(1/√m)-一致的MLE。我们的分析集中在总体Hessian的新型衰减性质上:对角线条目保持较大,而非对角线条目随图距离呈指数衰减。这些结果为基于似然的树推断的有效性提供了严格理论证据,并提出了隐变量模型更广泛的原则。

英文摘要

We investigate the optimization landscape of maximum likelihood estimation (MLE) for the Cavender-Farris-Neyman (CFN) model, a two-state latent tree model fundamental to statistical phylogenetics and the ferromagnetic Ising model. Although the log-likelihood function is non-concave and may admit many critical points, simple coordinate maximization algorithms are remarkably effective in practice. We provide the first theoretical justification for this success. We prove that sufficiently deep inside the reconstruction regime, the population log-likelihood is strongly concave and smooth within a box around the true parameter, whose size is independent of tree topology and number of leaves. This fundamental result implies that the empirical landscape shares these regularity properties with high probability given polynomial sample complexity and also that coordinate maximization converges exponentially fast to an $O(1/\sqrt{m})$-consistent MLE. Our analysis centers on a novel decay property of the population Hessian: diagonal entries remain large while off-diagonal entries decay exponentially with graph distance. These results provide rigorous theoretical evidence for the efficacy of likelihood-based tree inference and suggest broader principles for latent variable models.

2501.07772 2026-05-22 math.ST econ.EM stat.ME stat.TH

Honest Inference for Stochastic Optimization

随机优化中的诚实推断

Kenta Takatsu, Arun Kumar Kuchibhotla

AI总结 本文研究了一种通用方法,用于构建随机优化解的置信集,将经验风险最小化作为特殊情况。由于对应的估计量在参数维度增加、目标函数非光滑或存在约束时表现出非标准的极限行为,随机优化的统计推断面临重大挑战。本文提出了一种简单且统一的方法,能够保证在规则和不规则情况下都有效,并提供对置信集有效性和保守性的统一处理。特别地,所展示的宽度分析表明置信集能够自适应地调整到未知的实例特定规则性程度。该方法被应用于多个高维和不规则的统计问题,并提供了所有统计应用的数值结果。

详情
AI中文摘要

本文研究了一种通用方法,用于构建随机优化解的置信集,将经验风险最小化作为特殊情况。统计推断在随机优化中面临重大挑战,因为对应的估计量在参数维度增加、目标函数非光滑或存在约束时表现出非标准的极限行为。本文提出了一种简单且统一的方法,能够保证在规则和不规则情况下都有效。我们提供了对有效性和保守性以及所得到置信集大小的统一处理。特别是,所展示的宽度分析表明置信集能够自适应地调整到未知的实例特定规则性程度。我们应用所提出的方法到多个高维和不规则的统计问题。所有统计应用的数值结果均被提供。

英文摘要

This manuscript studies a general approach to construct confidence sets for the solution of stochastic optimization, rendering empirical risk minimization as special cases. Statistical inference for stochastic optimization poses significant challenges due to the non-standard limiting behaviors of the corresponding estimator, which arise in settings with increasing dimension of parameters, non-smooth objectives, or constraints. We propose a simple and unified method that guarantees validity in both regular and irregular cases. We provide a unified treatment of validity, conservativeness, and the size of the resulting confidence sets. In particular, the presented width analysis demonstrates the adaptive behavior of the confidence set to the unknown degree of instance-specific regularity. We apply the proposed method to several high-dimensional and irregular statistical problems. Numerical results for all statistical applications are provided.

2501.00677 2026-05-22 cs.LG cs.CV cs.IT cs.NA math.IT math.NA stat.ML

Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery

深度学习鲁棒矩阵补全用于大规模低秩数据恢复

HanQin Cai, Chandra Kundu, Jialin Liu, Wotao Yin

AI总结 本文提出了一种可扩展且可学习的非凸方法,即学得鲁棒矩阵补全(LRMC),用于大规模鲁棒矩阵补全问题,该方法具有低计算复杂度和线性收敛性,并通过深度展开有效学习自由参数以实现最优性能,同时在合成数据集和实际应用中验证了其优越的实验性能。

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(6): 6541-6556, 2026
AI中文摘要

鲁棒矩阵补全(RMC)是一种广泛使用的机器学习工具,同时解决低秩数据分析中的两个关键问题:缺失数据条目和极端异常值。本文提出了一种新颖的可扩展且可学习的非凸方法,称为学得鲁棒矩阵补全(LRMC),用于大规模RMC问题。LRMC具有低计算复杂度和线性收敛性。受所提出定理的启发,LRMC的自由参数可通过深度展开有效学习以达到最佳性能。此外,本文提出了一种灵活的前馈-递归-混合神经网络框架,将深度展开从固定次数迭代扩展到无限次数迭代。通过在合成数据集和实际应用中的广泛实验,验证了LRMC的优越的实验性能,包括视频背景减除、超声成像、面部建模和卫星图像云去除。

英文摘要

Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fix-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.

2411.02776 2026-05-22 cs.LG stat.AP

Deep learning-based modularized loading protocol for parameter estimation of Bouc-Wen class models

基于深度学习的模块化加载协议用于Bouc-Wen类模型参数估计

Sebin Oh, Junho Song, Taeyong Kim

AI总结 本文提出了一种基于深度学习的模块化加载协议,用于优化Bouc-Wen类模型的参数估计。该协议包含两个关键部分:最优加载历史构建和基于CNN的快速参数估计。每个部分被分解为独立的子模块,针对不同的滞回行为(基本滞回、结构退化和咬合效应),使协议能够适应多种滞回模型。三种独立的CNN架构被开发出来以捕捉这些滞回行为的路径依赖性。通过在多样化的加载历史上训练这些CNN架构,识别出最小的加载序列,称为加载历史模块,并将其组合以构建最优的加载历史。三种训练好的CNN模型用作快速参数估计器。协议的数值评估,包括三栋钢结构框架的非线性时间历史分析和三栋钢筋混凝土框架的脆弱性曲线构建,表明该协议显著减少了总分析时间,同时保持或提高了估计精度。该协议可扩展到其他滞回模型,表明了一种系统的方法来识别通用滞回模型。

详情
Journal ref
Engineering Structures 339, 120458 (2025)
AI中文摘要

本研究提出了一种模块化的深度学习基于加载协议,用于Bouc-Wen(BW)类模型的最佳参数估计。该协议由两个关键组成部分组成:最佳加载历史构建和基于CNN的快速参数估计。每个组成部分被分解为独立的子模块,针对不同的滞回行为——基本滞回、结构退化和咬合效应——使协议能够适应多种滞回模型。开发了三种独立的CNN架构以捕捉这些滞回行为的路径依赖性。通过在多样化的加载历史上训练这些CNN架构,识别出最小的加载序列,称为加载历史模块,然后将其组合以构建最优的加载历史。三种训练好的CNN模型,分别在相应的加载历史模块上训练,用作快速参数估计器。协议的数值评估,包括三栋钢结构框架的非线性时间历史分析和三栋钢筋混凝土框架的脆弱性曲线构建,表明所提出的协议显著减少了总分析时间,同时保持或提高了估计精度。所提出的协议可以扩展到其他滞回模型,表明了一种系统的方法来识别通用滞回模型。

英文摘要

This study proposes a modularized deep learning-based loading protocol for optimal parameter estimation of Bouc-Wen (BW) class models. The protocol consists of two key components: optimal loading history construction and CNN-based rapid parameter estimation. Each component is decomposed into independent sub-modules tailored to distinct hysteretic behaviors-basic hysteresis, structural degradation, and pinching effect-making the protocol adaptable to diverse hysteresis models. Three independent CNN architectures are developed to capture the path-dependent nature of these hysteretic behaviors. By training these CNN architectures on diverse loading histories, minimal loading sequences, termed \textit{loading history modules}, are identified and then combined to construct an optimal loading history. The three CNN models, trained on the respective loading history modules, serve as rapid parameter estimators. Numerical evaluation of the protocol, including nonlinear time history analysis of a 3-story steel moment frame and fragility curve construction for a 3-story reinforced concrete frame, demonstrates that the proposed protocol significantly reduces total analysis time while maintaining or improving estimation accuracy. The proposed protocol can be extended to other hysteresis models, suggesting a systematic approach for identifying general hysteresis models.

2405.17032 2026-05-22 q-bio.QM math.PR q-bio.PE stat.AP

Exact phylodynamic likelihood via structured Markov genealogy processes

通过结构化马尔可夫基因谱过程实现精确的系统发育动力学似然

Aaron A. King, Qianying Lin, Edward L. Ionides

AI总结 本文提出了一种基于结构化马尔可夫基因谱过程的方法,通过滤波方程精确计算观察到的基因谱的似然,展示了现有基于凝聚过程和线性生灭过程的系统发育动力学方法是其特殊情形,并提供了用于数值求解滤波方程的算法类,为更广泛的模型提供了统计高效似然基于的系统发育推断的可能性。

详情
AI中文摘要

我们证明了每种广义马尔可夫人口模型的成员在基因谱空间上诱导一个唯一的随机过程。我们构建了这种基因谱过程,并推导出观察到的基因谱的似然的精确表达式,该表达式以滤波方程的形式给出,其结构完全由人口模型决定。我们证明现有的基于凝聚过程和线性生灭过程的系统发育动力学方法是其特殊情形。我们推导了滤波方程的一些性质,并描述了一类可用于数值求解它们的算法。重要的是,由于这些算法仅依赖于人口模型的模拟,因此保留了基于模拟推断所依赖的即插即用性质。我们的结果为更广泛的模型类别的统计高效似然基于的系统发育推断打开了大门。

英文摘要

We show that each member of a broad class of Markovian population models induces a unique stochastic process on the space of genealogies. We construct this genealogy process and derive exact expressions for the likelihood of an observed genealogy in terms of a filter equation, the structure of which is completely determined by the population model. We show that existing phylodynamic methods based on the coalescent and linear birth-death processes are special cases. We derive some properties of filter equations and describe a class of algorithms that can be used to numerically solve them. Importantly, because these algorithms rely only on simulation of the population model, they retain the plug-and-play property upon which simulation-based inference depends. Our results open the door to statistically efficient likelihood-based phylodynamic inference for a much wider class of models than is currently possible.

2401.00139 2026-05-22 cs.AI cs.CL cs.LG stat.ME

Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning

增强大语言模型中的因果推理:一种用于精确微调的因果归因模型

Hengrui Cai, Shengjie Liu, Rui Song

AI总结 本文提出一种因果归因模型,通过精确微调提升大语言模型的可解释性和因果推理能力,展示了模型在不同领域中的因果发现任务中的有效性。

详情
Comments
A Python implementation of our proposed method is available at https://github.com/ncsulsj/Causal_LLM
AI中文摘要

本文介绍了一种因果归因模型,旨在通过精确微调增强大语言模型(LLMs)的可解释性并提高其因果推理能力。尽管LLMs在多种任务中表现出色,但其推理过程往往仍是一个黑箱,限制了有针对性的增强。我们提出了一种新的因果归因模型,利用“do-运算符”构建干预场景,使我们能够系统地量化LLMs因果推理过程中不同组件的贡献。通过在各种领域中进行因果发现任务来评估所提出的归因分数,我们证明了LLMs在因果发现中的有效性严重依赖于提供的上下文和领域特定知识,但也可以利用数值数据进行有限的相关性推理,而非因果性。这促使了所提出的微调LLM用于成对因果发现,有效且正确地利用了知识和数值信息。

英文摘要

This paper introduces a causal attribution model to enhance the interpretability of large language models (LLMs) and improve their causal reasoning abilities via precise fine-tuning. Despite LLMs' proficiency in diverse tasks, their reasoning processes often remain black box, and thus restrict targeted enhancement. We propose a novel causal attribution model that utilizes "do-operators" for constructing interventional scenarios, allowing us to quantify the contribution of different components in LLMs's causal reasoning process systematically. By assessing the proposed attribution scores through causal discovery tasks across various domains, we demonstrate that LLMs' effectiveness in causal discovery heavily relies on provided context and domain-specific knowledge but can also utilize numerical data with limited calculations in correlation, not causation. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively and correctly leveraging both knowledge and numerical information.

2307.05732 2026-05-22 stat.ME math.ST stat.TH

From Isotonic to Lipschitz Regression: A New Interpolative Perspective on Shape-restricted Estimation

从单调到利普希茨回归:一种新的插值视角下的形状限制估计

Kenta Takatsu, Tianyu Zhang, Arun Kumar Kuchibhotla

AI总结 本文提出了一种新的插值视角,将非参数光滑性和形状限制估计连接起来,通过将利普希茨函数分解为单调函数和线性函数,提出了一种新的估计方法,具有良好的方法论、理论和计算性质,并在异方差和厚尾误差下提供了收敛保证。

详情
AI中文摘要

本文将非参数光滑性基于和形状限制估计连接起来,这在领域中可能看起来是两个不同的范式。所提出的方法受到一个概念上简单的观察的启发:每一个利普希茨函数都是单调函数和线性函数之和。这一原理进一步推广到高阶单调性和多变量设置。基于样本分割程序提出了一族估计量,继承了形状限制估计量良好的方法论、理论和计算性质。理论分析提供了在异方差和厚尾误差下的估计量收敛保证,并且能够适应未知的“复杂性”回归函数。通过新的近似结果和数值研究展示了所提出分解框架的广泛性。

英文摘要

This manuscript bridges nonparametric smoothness-based and shape-restricted estimation, which may appear as two disjoint paradigms in the field. The proposed approach is motivated by a conceptually simple observation: every Lipschitz function is a sum of a monotonic and a linear function. This principle is further generalized to the higher-order monotonicity and multivariate settings. A family of estimators is proposed based on a sample-splitting procedure, inheriting desirable methodological, theoretical, and computational properties of shape-restricted estimators. The theoretical analysis provides convergence guarantees of the estimator under heteroscedastic and heavy-tailed errors, as well as adaptivity to the unknown ``complexity" of the true regression function. The generality of the proposed decomposition framework is demonstrated through new approximation results and numerical studies.