arXivDaily arXiv每日学术速递 周一至周五更新
重置
2604.11762 2026-04-14 cs.CV cs.LG eess.SP physics.med-ph stat.ML

MosaicMRI: A Diverse Dataset and Benchmark for Raw Musculoskeletal MRI

Paula Arguello, Berk Tinaz, Mohammad Shahab Sepehri, Maryam Soltanolkotabi, Mahdi Soltanolkotabi

Comments 15 pages, 6 figures, preliminary version

详情
英文摘要

Deep learning underpins a wide range of applications in MRI, including reconstruction, artifact removal, and segmentation. However, progress has been driven largely by public datasets focused on brain and knee imaging, shaping how models are trained and evaluated. As a result, careful studies of the reliability of these models across diverse anatomical settings remain limited. In this work, we introduce MosaicMRI, a large and diverse collection of fully sampled raw musculoskeletal (MSK) MR measurements designed for training and evaluating machine-learning-based methods. MosaicMRI is the largest open-source raw MSK MRI dataset to date, comprising 2,671 volumes and 80,156 slices. The dataset offers substantial diversity in volume orientation (e.g., axial, sagittal), imaging contrasts (e.g., PD, T1, T2), anatomies (e.g., spine, knee, hip, ankle, and others), and numbers of acquisition coils. Using VarNet as a baseline for accelerated reconstruction task, we perform a comprehensive set of experiments to study scaling behavior with respect to both model capacity and dataset size. Interestingly, models trained on the combined anatomies significantly outperform anatomy-specific models in low-sample regimes, highlighting the benefits of anatomical diversity and the presence of exploitable cross-anatomical correlations. We further evaluate robustness and cross-anatomy generalization by training models on one anatomy (e.g., spine) and testing them on another (e.g., knee). Notably, we identify groups of body parts (e.g., foot and elbow) that generalize well with each other, and highlight that performance under domain shifts depends on both training set size, anatomy, and protocol-specific factors.

2604.11746 2026-04-14 stat.ME math.ST stat.ML stat.TH

Inferring Change Points in Regression via Sample Weighting

Gabriel Arpino, Ramji Venkataramanan

Comments 70 pages, 11 figures

详情
英文摘要

We study the problem of identifying change points in high-dimensional generalized linear models, and propose an approach based on sample-weighted empirical risk minimization. Our method, Weighted ERM, encodes priors on the change points via weights assigned to each sample, to obtain weighted versions of standard estimators such as M-estimators and maximum-likelihood estimators. Under mild assumptions on the data, we obtain a precise asymptotic characterization of the performance of our method for general Gaussian designs, in the high-dimensional limit where the number of samples and covariate dimension grow proportionally. We show how this characterization can be used to efficiently construct a posterior distribution over change points. Numerical experiments on both simulated and real data illustrate the efficacy of Weighted ERM compared to existing approaches, demonstrating that sample weights constructed with weakly informative priors can yield accurate change point estimators. Our method is implemented as an open-source package, weightederm, available in Python and R.

2604.11731 2026-04-14 stat.ME stat.AP stat.ML

Nested Atoms Model with Application to Clustering Big Population-Scale Single-Cell Data

Arhit Chakrabarti, Yang Ni, Yuchao Jiang, Bani K. Mallick

详情
英文摘要

We consider the problem of clustering nested or hierarchical data, where observations are grouped and there are both group-level and observation-level variables. In our motivating OneK1K dataset, observations consist of single-cell RNA-sequencing (scRNA-seq) data from 982 individuals (groups), totaling 1.27 million cells (observations), along with individual-specific genotype data. This type of data would enable the identification of cell types and the investigation of how genetic variations among individuals influence differences in cell-type profiles. Our goal, therefore, is to jointly cluster cells and individuals to capture the heterogeneity across both levels using cell-specific gene expressions as well as individual-specific genotypes. However, existing grouped clustering methods do not incorporate group-level variables, thereby limiting their ability to capture the heterogeneity of genotypes in our motivating application. To address this, we propose the Nested Atoms Model (NAM), a new Bayesian nonparametric approach that enables the desired two-layered clustering, accounting for both group-level and observation-level variables. To scale NAM for high-dimensional data, we develop a fast variational Bayesian inference algorithm. Simulations show that NAM outperforms existing methods that ignore group-level variables. Applied to the OneK1K dataset, NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles. The resulting cell clusters align with known immune cell types based on differential gene expression, underscoring the ability of NAM to capture nested heterogeneity and provide biologically meaningful insights.

2604.11729 2026-04-14 math.PR cs.DS cs.LG math.ST stat.TH

Universality of first-order methods on random and deterministic matrices

Nicola Gorini, Chris Jones, Dmitriy Kunisky, Lucas Pesenti

详情
英文摘要

General first-order methods (GFOM) are a flexible class of iterative algorithms which update a state vector by matrix-vector multiplications and entrywise nonlinearities. A long line of work has sought to understand the large-n dynamics of GFOM, mostly focusing on "very random" input matrices and the approximate message passing (AMP) special case of GFOM whose state is asymptotically Gaussian. Yet, it has long remained unknown how to construct iterative algorithms that retain this Gaussianity for more structured inputs, or why existing AMP algorithms can be as effective for some deterministic matrices as they are for random matrices. We analyze diagrammatic expansions of GFOM via the limiting traffic distribution of the input matrix, the collection of all limiting values of permutation-invariant polynomials in the matrix entries, to obtain the following results: 1. We calculate the traffic distribution for the first non-trivial deterministic matrices, including (minor variants of) the Walsh-Hadamard and discrete sine and cosine transform matrices. This determines the limiting dynamics of GFOM on these inputs, resolving parts of longstanding conjectures of Marinari, Parisi, and Ritort (1994). 2. We design a new AMP iteration which unifies several previous AMP variants and generalizes to new input types, whose limiting dynamics are Gaussian conditional on some latent random variables. The asymptotic dynamics hold for a large and natural class of traffic distributions (encompassing both random and deterministic input matrices) and the algorithm's analysis gives a simple combinatorial interpretation of the Onsager correction, answering questions posed recently by Wang, Zhong, and Fan (2022).

2604.11673 2026-04-14 stat.ME cs.AI math.ST stat.CO stat.TH

NetworkNet: A Deep Neural Network Approach for Random Networks with Sparse Nodal Attributes and Complex Nodal Heterogeneity

Zhaoyu Xing, Xiufan Yu

详情
英文摘要

Heterogeneous network data with rich nodal information become increasingly prevalent across multidisciplinary research, yet accurately modeling complex nodal heterogeneity and simultaneously selecting influential nodal attributes remains an open challenge. This problem is central to many applications in economics and sociology, when both nodal heterogeneity and high-dimensional individual characteristics highly affect network formation. We propose a statistically grounded, unified deep neural network approach for modeling nodal heterogeneity in random networks with high-dimensional nodal attributes, namely ``NetworkNet''. A key innovation of NetworkNet lies in a tailored neural architecture that explicitly parameterizes attribute-driven heterogeneity, and at the same time, embeds a scalable attribute selection mechanism. NetworkNet consistently estimates two types of latent heterogeneity functions, i.e., nodal expansiveness and popularity, while simultaneously performing data-driven attribute selection to extract influential nodal attributes. By unifying classical statistical network modeling with deep learning, NetworkNet delivers the expressive power of DNNs with methodological interpretability, algorithmic scalability, and statistical rigor with a non-asymptotic approximation error bound. Empirically, simulations demonstrate strong performance in both heterogeneity estimation and high-dimensional attribute selection. We further apply NetworkNet to a large-scale author-citation network among statisticians, revealing new insights into the dynamic evolution of research fields and scholarly impact.

2604.11591 2026-04-14 stat.ME

A novel reference prior for Gaussian hierarchical models with intrinsic conditional autoregressive random effects

Marco A. R. Ferreira

详情
英文摘要

We develop a novel reference prior for Gaussian hierarchical models with intrinsic conditional autoregressive (ICAR) random effects. This is particularly important in the context of objective Bayes variable selection with sample size $n$ and $k$ regressors. In this context, a previously published reference prior requires the computation of spectral decompositions of two $n$-dimensional matrices for each model under consideration. As a consequence, for variable selection the computational cost of this previous reference prior grows as $O(n^3 2^k)$. In contrast, our novel reference prior requires the computation of the spectral decomposition of one $n$-dimensional matrix that can be used for all models under consideration. Thus, the computational cost of our novel reference prior grows much slower as $O(n^3)$. Hence, computational savings can be substantial, e.g. in a problem with 10 regressors, when compared to the previously published reference prior, computations based on our novel reference prior are more than 1000 times faster. We provide a proof of the equivalence of the two priors. A simulation study shows that, while both reference priors provide equivalent variable selection results, for large sample sizes computations based on our novel prior are several orders of magnitude faster. Finally, the utility of our novel reference prior is illustrated with a spatial regression study of county-level median household income on socio-economic regressors for 3108 counties in the contiguous United States.

2604.11578 2026-04-14 quant-ph cs.AI cs.LG stat.ML

Minimizing classical resources in variational measurement-based quantum computation for generative modeling

Arunava Majumder, Hendrik Poulsen Nautrup, Hans J. Briegel

Comments 14 pages

详情
英文摘要

Measurement-based quantum computation (MBQC) is a framework for quantum information processing in which a computational task is carried out through one-qubit measurements on a highly entangled resource state. Due to the indeterminacy of the outcomes of a quantum measurement, the random outcomes of these operations, if not corrected, yield a variational quantum channel family. Traditionally, this randomness is corrected through classical processing in order to ensure deterministic unitary computations. Recently, variational measurement-based quantum computation (VMBQC) has been introduced to exploit this measurement-induced randomness to gain an advantage in generative modeling. A limitation of this approach is that the corresponding channel model has twice as many parameters compared to the unitary model, scaling as $N \times D$, where $N$ is the number of logical qubits (width) and $D$ is the depth of the VMBQC model. This can often make optimization more difficult and may lead to poorly trainable models. In this paper, we present a restricted VMBQC model that extends the unitary setting to a channel-based one using only a single additional trainable parameter. We show, both numerically and algebraically, that this minimal extension is sufficient to generate probability distributions that cannot be learned by the corresponding unitary model.

2604.11550 2026-04-14 stat.ME

Principled Inference in Dense High-Dimensional Linear Models via Local Conditional Sparsity

Wenjun Xiong, Yan Chen, Mingya Long, Qizhai Li

详情
英文摘要

High-dimensional inference methods often rely on coefficient sparsity, an assumption that can be restrictive when signals are dense but individually weak. In such settings, valid inference may still be possible if the covariates exhibit sparse conditional dependence. Motivated by this observation, we propose Neighborhood-Localized Nested Regression (NLNR), a framework for coordinatewise inference in high-dimensional linear models with potentially dense coefficients. The central idea is to localize inference for a target coefficient to a low-dimensional working regression determined by a Sparse Conditional Neighborhood (SCN) of the target covariate. Specifically, for a given covariate, we estimate its SCN through nodewise $\ell_1$-penalized regression and then fit a regression using only the target covariate and its estimated neighborhood. Under suitable regularity conditions, we establish consistency and asymptotic normality of the resulting estimator. Building on this inferential reduction principle, we further develop a thresholding-based screening procedure with theoretical guarantees and a boosting variant that augments the working model with additional response-relevant covariates to improve finite-sample performance. Extensive simulations and an application to the CCLE dataset demonstrate favorable empirical performance.

2604.11507 2026-04-14 math.OC cs.AI cs.LG cs.SY eess.SY stat.ML

Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers

I. Esra Buyuktahtakin

详情
英文摘要

Artificial intelligence (AI) is moving increasingly beyond prediction to support decisions in complex, uncertain, and dynamic environments. This shift creates a natural intersection with operations research and management sciences (OR/MS), which have long offered conceptual and methodological foundations for sequential decision-making under uncertainty. At the same time, recent advances in deep learning, including feedforward neural networks, LSTMs, transformers, and deep reinforcement learning, have expanded the scope of data-driven modeling and opened new possibilities for large-scale decision systems. This tutorial presents an OR/MS-centered perspective on deep learning for sequential decision-making under uncertainty. Its central premise is that deep learning is valuable not as a replacement for optimization, but as a complement to it. Deep learning brings adaptability and scalable approximation, whereas OR/MS provides the structural rigor needed to represent constraints, recourse, and uncertainty. The tutorial reviews key decision-making foundations, connects them to the major neural architectures in modern AI, and discusses leading approaches to integrating learning and optimization. It also highlights emerging impact in domains such as supply chains, healthcare and epidemic response, agriculture, energy, and autonomous operations. More broadly, it frames these developments as part of a wider transition from predictive AI toward decision-capable AI and highlights the role of OR/MS in shaping the next generation of integrated learning--optimization systems.

2604.11491 2026-04-14 stat.ML cs.AI cs.LG math.ST stat.ME stat.TH

ADD for Multi-Bit Image Watermarking

An Luo, Jie Ding

详情
英文摘要

As generative models enable rapid creation of high-fidelity images, societal concerns about misinformation and authenticity have intensified. A promising remedy is multi-bit image watermarking, which embeds a multi-bit message into an image so that a verifier can later detect whether the image is generated by someone and further identify the source by decoding the embedded message. Existing approaches often fall short in capacity, resilience to common image distortions, and theoretical justification. To address these limitations, we propose ADD (Add, Dot, Decode), a multi-bit image watermarking method with two stages: learning a watermark to be linearly combined with the multi-bit message and added to the image, and decoding through inner products between the watermarked image and the learned watermark. On the standard MS-COCO benchmark, we demonstrate that for the challenging task of 48-bit watermarking, ADD achieves 100\% decoding accuracy, with performance dropping by at most 2\% under a wide range of image distortions, substantially smaller than the 14\% average drop of state-of-the-art methods. In addition, ADD achieves substantial computational gains, with 2-fold faster embedding and 7.4-fold faster decoding than the fastest existing method. We further provide a theoretical analysis explaining why the learned watermark and the corresponding decoding rule are effective.

2604.11458 2026-04-14 stat.ME stat.CO

An Empirical Comparison of Methods for Quantifying the Similarity of Categorical Datasets

Marieke Stolte, Jörg Rahnenführer, Andrea Bommert

详情
英文摘要

Quantifying the similarity of two or more datasets has widespread applications in statistics and machine learning. The method choice is, however, difficult due to the abundance of proposed methods and the lack of neutral comparison studies, especially for categorical data. Here, the most promising methods are compared concerning their ability to detect certain differences between datasets and their resource consumption. The results show that the edge count tests perform well when comparing two datasets (i.e., the two-sample case). For certain scenarios, the constrained minimum (CM) distance performs even better. For categorical data consisting of variables with five categories each, the best method depends on the type of difference between the distributions, with either the CM distance and certain graph-based tests performing best, or the classifier-based tests (C2ST). This tendency is even clearer for multiple datasets. Overall, the Friedman-Rafsky test can be recommended for two samples as a compromise of high performance, acceptable resource consumption, and computational error occurrences. For the multi-sample case, the Multi-Sample Mahalanobis Cross-Match (MMCM) test can be recommended due to its comparably good performance and low resource consumption.

2604.11393 2026-04-14 econ.EM math.ST stat.TH

Average Marginal Effects in One-Step Partially Linear Instrumental Regressions

Lucas Girard, Elia Lapenta

Comments 67 pages (body: pages 1-26; appendices: pages 26-67); 8 figures; 5 tables

详情
英文摘要

We propose a novel procedure for estimating and conducting inference on average marginal effects in partially linear instrumental regressions using Reproducing Kernel Hilbert Space methods. Our procedure relies on a single regularization parameter. We obtain the consistency and asymptotic normality of our estimator. Since the variance of the limiting distribution has a complex analytical form, we propose a Bayesian bootstrap method to conduct inference and establish its validity. Our procedure is easy to implement and exhibits good finite-sample performance in simulations. Three empirical applications illustrate its implementation on real data, showing that it yields economically meaningful results.

2604.11363 2026-04-14 math.ST stat.TH

Subordinated Wright-Fisher Priors

Nathan A. Judd, Dario Spanò

详情
英文摘要

A new class of time-dependent Dirichlet priors is introduced as a generalisation of the Wright-Fisher diffusion, allowing discontinuities in the trajectories, as well as non-Markovian memory. This class is obtained as a simple stochastic time-change (subordination), interpreted as a hyper-prior assigned to the operational time-clock of a Wright-Fisher diffusion. Explicit representations and exact sampling algorithms are obtained for prior and posterior distributions of the process and of its clock, given partially exchangeable data sampled at discrete time-points. Computability and conjugacy rely on a novel class of discrete dual processes, generalising existing results on duality and computable filters.

2604.11343 2026-04-14 cs.DL stat.ME

Which Discoveries Are Paradigm Shifting?

Sajad Ashouri, Arash Hajikhani, Ari Hyytinen, Petri Rouvinen, Arho Suominen

详情
Journal ref
Industrial and Corporate Change, 2026
英文摘要

To better align theories of paradigm shifting discoveries and empirics identifying them, we pro-pose a novel measure that incorporates a discovery impact, novelty, and tendency to break with the past into a single, coherent measure. Calibration using the National Inventor Hall of Fame data reveals that impact, novelty, and disruptiveness are strict complements meaning, for example, that greater impact cannot substitute for moderate novelty. We illustrate the workings of the measure using data on USPTO patents from 1982 to 2015.

2604.11335 2026-04-14 math.ST stat.ME stat.TH

Trends in tail dependence of heteroscedastic extremes

John H. J. Einmahl, Chen Zhou

详情
英文摘要

We consider multivariate extreme value statistics for independent but nonidentically distributed random vectors. In particular, the data may have varying tail copulas and also heteroscedastic marginal distributions. Assuming smoothly changing tail copulas, we propose a nonparametric estimator for the integrated tail copula and establish its asymptotic behavior. Notably, the heteroscedastic marginals do not affect the limiting processes. We use the main result for the integrated tail copula to test for a constant tail copula across all observations. Finally, a simulation study shows the good finite-sample behavior of our limit theorems as well as high power of the test.

2604.11311 2026-04-14 cs.LG stat.ML

Learning Discrete Diffusion of Graphs via Free-Energy Gradient Flows

Dario Rancati, Jan Maas, Francesco Locatello

详情
英文摘要

Diffusion-based models on continuous spaces have seen substantial recent progress through the mathematical framework of gradient flows, leveraging the Wasserstein-2 (${W}_2$) metric via the Jordan-Kinderlehrer-Otto (JKO) scheme. Despite the increasing popularity of diffusion models on discrete spaces using continuous-time Markov chains, a parallel theoretical framework based on gradient flows has remained elusive due to intrinsic challenges in translating the ${W}_2$ distance directly into these settings. In this work, we propose the first computational approach addressing these challenges, leveraging an appropriate metric $W_K$ on the simplex of probability distributions, which enables us to interpret widely used discrete diffusion paths, such as the discrete heat equation, as gradient flows of specific free-energy functionals. Through this theoretical insight, we introduce a novel methodology for learning diffusion dynamics over discrete spaces, which recovers the underlying functional directly by leveraging first-order optimality conditions for the JKO scheme. The resulting method optimizes a simple quadratic loss, trains extremely fast, does not require individual sample trajectories, and only needs a numerical preprocessing computing $W_K$-geodesics. We validate our method through extensive numerical experiments on synthetic data, showing that we can recover the underlying functional for a variety of graph classes.

2604.11300 2026-04-14 math.ST stat.ME stat.TH

Detection and Mode-Identification of Multiple Change Points in Tensor Factor Models

Yuqi Zhang, Zetai Cen, Haeran Cho

Comments 165 pages

详情
英文摘要

We study the problems arising from modeling high-dimensional tensor-valued time series under a Tucker decomposition-based factor model with multiple structural change points. First, we propose an algorithm for detecting the multiple change points, which utilizes the low-rank structure of the data for statistical and computational efficiency. Also, the multi-dimensional array setting poses unique challenges, as some changes are associated with a subset of the modes, and the changes in different modes may interact with one another. Recognizing these, we investigate the problem of identifying each change with the tensor modes post-segmentation. To this end, we formalize the mode-identifiability of each change and propose an algorithm for detecting the modes at which the data are undergoing a mode-identifiable shift. We establish the consistency of both change point detection and mode-identification methods under a weak moment condition, and demonstrate their good performance on simulated datasets where, in particular, it is shown that the mode-identification step can improve the post-segmentation estimation of the mode-wise loading space. Additionally we analyze the datasets on New York City taxi usage and Fama--French portfolio returns using the proposed suite of methods.

2604.11253 2026-04-14 stat.ML cs.LG

Trustworthy Feature Importance Avoids Unrestricted Permutations

Emanuele Borgonovo, Francesco Cappelli, Xuefei Lu, Elmar Plischke, Cynthia Rudin

详情
英文摘要

Feature importance methods using unrestricted permutations are flawed due to extrapolation errors; such errors appear in all non-trivial variable importance approaches. We propose three new approaches: conditional model reliance and Knockoffs with Gaussian transformation, and restricted ALE plot designs. Theoretical and numerical results show our strategies reduce/eliminate extrapolation.

2604.11239 2026-04-14 stat.ME

Optimized questionnaire item selection for tracking the progression of motor symptoms in Parkinson's disease

Karl Sigfrid, Ellinor Fackle-Fornius, Frank Miller

详情
英文摘要

Long questionnaires increase the response burden for patients and healthcare workers. In the treatment of Parkinson's disease, the MDS-UPDRS questionnaire to track disease progression may be underutilized due to time requirements. While reduced item sets have been studied using Fisher information from Item Response Theory (IRT) models, optimal selection methods remain unclear. We compared three methods for selecting an optimal subset of items, with the aim of minimizing the uncertainty in the estimates of the disease severity: Ranking by the Fisher information, coordinate descent local search to directly minimize estimate uncertainty, and adaptive selection. Whereas item ranking based on the expected Fisher information outperformed random choice of items, we saw further gains with the coordinate descent algorithm that directly minimizes the uncertainty of the disease severity estimate. An adaptive algorithm gave an additional slight gain compared to the coordinate descent method. However, the performance of the adaptive method is a best-case limit as we assume that we find the optimal set for the true latent trait scores. For a 5-item subset, the ranked Fisher information method reduced the expected standard deviation by 14 percent compared to random item selection. The corresponding reductions for coordinate descent and adaptive selection were 26 percent and 34 percent respectively. More sophisticated selection methods substantially improved estimate accuracy for small item sets, with diminishing returns for larger subsets. Because item parameters are retained from the full test, reduced item sets measure the same latent construct as the original test. The choice of method entails a trade-off between methodological complexity and precision.

2604.11223 2026-04-14 stat.ML cs.AI cs.LG

Regional Explanations: Bridging Local and Global Variable Importance

Salim I. Amoukou, Nicolas J-B. Brunel

Comments Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

详情
英文摘要

We analyze two widely used local attribution methods, Local Shapley Values and LIME, which aim to quantify the contribution of a feature value $x_i$ to a specific prediction $f(x_1, \dots, x_p)$. Despite their widespread use, we identify fundamental limitations in their ability to reliably detect locally important features, even under ideal conditions with exact computations and independent features. We argue that a sound local attribution method should not assign importance to features that neither influence the model output (e.g., features with zero coefficients in a linear model) nor exhibit statistical dependence with functionality-relevant features. We demonstrate that both Local SV and LIME violate this fundamental principle. To address this, we propose R-LOCO (Regional Leave Out COvariates), which bridges the gap between local and global explanations and provides more accurate attributions. R-LOCO segments the input space into regions with similar feature importance characteristics. It then applies global attribution methods within these regions, deriving an instance's feature contributions from its regional membership. This approach delivers more faithful local attributions while avoiding local explanation instability and preserving instance-specific detail often lost in global methods.

2604.11200 2026-04-14 cs.LG cs.AI stat.ML

ShapShift: Explaining Model Prediction Shifts with Subgroup Conditional Shapley Values

Tom Bewley, Salim I. Amoukou, Emanuele Albini, Saumitra Mishra, Manuela Veloso

详情
英文摘要

Changes in input distribution can induce shifts in the average predictions of machine learning models. Such prediction shifts may impact downstream business outcomes (e.g. a bank's loan approval rate), so understanding their causes can be crucial. We propose \ours{}: a Shapley value method for attributing prediction shifts to changes in the conditional probabilities of interpretable subgroups of data, where these subgroups are defined by the structure of decision trees. We initially apply this method to single decision trees, providing exact explanations based on conditional probability changes at split nodes. Next, we extend it to tree ensembles by selecting the most explanatory tree and accounting for residual effects. Finally, we propose a model-agnostic variant using surrogate trees grown with a novel objective function, allowing application to models like neural networks. While exact computation can be intensive, approximation techniques enable practical application. We show that \ours{} provides simple, faithful, and near-complete explanations of prediction shifts across model classes, aiding model monitoring in dynamic environments.

2604.11199 2026-04-14 stat.CO math.PR

Extended One-Liners for the Beta, Gamma, and Dirichlet Distributions with Shape Parameters Below One

Dylan Greaves

Comments 8 pages, 1 figure, 1 table

详情
英文摘要

We present an explicit deterministic transformation of a fixed number of i.i.d. uniform random variables with exact Beta$(a,1-a)$ law for $0<a<1$, using only elementary operations (an ``extended one-liner'', see \cite{devroye1996oneline}). As corollaries, the families Beta$(a,b)$ with $\min(a,b)<1$, Gamma$(c)$ with $c<1$, and Dirichlet$(α_1,\dots,α_d)$ with $0<α_i<1$, for fixed $d$, also have extended one-\liners.

2604.11168 2026-04-14 stat.ME

Prediction decomposition for causal analysis

Ofir Reich

Comments 22 pages, 7 figures

详情
英文摘要

There is rising interest in using Machine Learning (ML) model predictions as outcomes in causal analysis. However, these methods have faced challenges in finding the true treatment effects. It is also challenging to make choices about which prediction models to choose, since we are interested not only in the accuracy of the prediction but in its ability to produce the correct causal effect in the analysis. In this paper I propose a decomposition of the prediction into between-unit prediction ($η_μ$), within-unit-across-time prediction ($η_ε$), and counterfactual-treatment-effect prediction ($η_T$). I show that the counterfactual-treatment-effect component is the one that determines whether the model recovers the true treatment effect, but only the first two components can be estimated from non-experimental data. I argue that within-unit-across-time prediction accuracy ($η_ε$) is a structurally better proxy for the counterfactual-treatment-effect component ($η_T$) than overall prediction accuracy, and propose a metric to estimate it from panel data with at least two time periods. This metric serves as a diagnostic and model-selection tool for choosing ML models for causal analysis. Under the stronger assumption that $η_T \approx η_ε$, it also enables constructing an approximately unbiased estimate of the treatment effect. I develop the theoretical framework and illustrate it with simulations of synthetic data.

2604.11151 2026-04-14 cs.LG stat.ML

Gradient-Variation Regret Bounds for Unconstrained Online Learning

Yuheng Zhao, Andrew Jacobsen, Nicolò Cesa-Bianchi, Peng Zhao

详情
英文摘要

We develop parameter-free algorithms for unconstrained online learning with regret guarantees that scale with the gradient variation $V_T(u) = \sum_{t=2}^T \|\nabla f_t(u)-\nabla f_{t-1}(u)\|^2$. For $L$-smooth convex loss, we provide fully-adaptive algorithms achieving regret of order $\widetilde{O}(\|u\|\sqrt{V_T(u)} + L\|u\|^2+G^4)$ without requiring prior knowledge of comparator norm $\|u\|$, Lipschitz constant $G$, or smoothness $L$. The update in each round can be computed efficiently via a closed-form expression. Our results extend to dynamic regret and find immediate implications to the stochastically-extended adversarial (SEA) model, which significantly improves upon the previous best-known result [Wang et al., 2025].

2604.11127 2026-04-14 math.ST stat.TH

Empirical interpretation of the Pitman efficiency

Tadeusz Inglot

详情
英文摘要

We study an empirical interpretation of the Pitman efficiency in testing for uniformity in the two-parametric family of the beta distributions. We show that for contamination models the Pitman efficiency approximates relative efficiency very well.

2604.11118 2026-04-14 cs.LG stat.ML

Distributionally Robust K-Means Clustering

Vikrant Malik, Taylan Kargin, Babak Hassibi

详情
英文摘要

K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd--Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.

2604.10310 2026-04-14 math.PR math.ST stat.OT stat.TH

Weak convergence from projected laws on a positive-measure set of directions

Alejandro Cholaquidis, Manuel Hernandez Banadik

详情
英文摘要

The Cramér-Wold device characterises weak convergence of probability measures on $\mathbb{R}^d$ through convergence of all one-dimensional projected laws. We prove that, if the target projected laws are moment-determinate for surface-almost every direction, then weak convergence already follows from projected convergence on a positive-measure set of directions. This yields a simple probabilistic interpretation: if one samples a direction at random from any distribution on the sphere that is absolutely continuous with respect to surface measure, then, with probability one, convergence of the projected law along the sampled direction already forces global weak convergence under the same moment-determinacy assumption.

2604.03775 2026-04-14 cond-mat.stat-mech stat.ML

Cross-Spectral Witness for Hidden Nonequilibrium Beyond the Scalar Ceiling

Yuda Bi, Vince D Calhoun

详情
英文摘要

Partial observation is a pervasive obstacle in nonequilibrium physics: coarse graining may absorb hidden forcing into an apparently equilibrium-like reduced description, so a driven system can look reversible through the only variables one can measure. For scalar Gaussian observables of linear stochastic systems, no time-irreversibility statistic can detect the underlying drive. The Lucente--Crisanti ceiling constrains what one channel carries; what two channels carry is a different question, with a sharp closed-form answer. Two simultaneously observed channels retain an off-diagonal cross-spectral sector inaccessible to any scalar reduction; under channel-separable multiplicative structure the observed-channel response factors cancel identically, leaving a closed-form cross-spectral witness controlled only by the hidden spectrum, the loadings, and the innovation scales, strictly positive at every nonzero cross-coupling including at exact timescale coalescence where every scalar reduction is blind. Within general CSM this certifies shared hidden-sector drive; under the additional one-way coupling assumption the witness identifies the total entropy production rate at leading order with a square-root scaling.

2604.02150 2026-04-14 math.NA cs.NA math.PR math.ST stat.ML stat.TH

Samplet limits and multiwavelets

Gianluca Giacchi, Michael Multerer, Jacopo Quizi

详情
英文摘要

Samplets are data adapted multiresolution analyses of localized discrete signed measures. They can be constructed on scattered data sites in arbitrary dimension such that they exhibit vanishing moments with respect to any prescribed set of primitives. We consider the samplet construction in a probabilistic framework and show that, if choosing polynomials as primitives, the resulting samplet basis converges to signed measures with broken polynomial densities in the infinite data limit. These densities amount to multiwavelets with respect to a hierarchical partition of the region containing the data sites. As a byproduct, we therefore obtain a construction of general multiwavelets that allows for a flexible prescription of vanishing moments going beyond tensor product constructions. For congruent partitions we particularly recover classical multiwavelets with scale- and partition- independent filter coefficients. The theoretical findings are complemented by numerical experiments that illustrate the convergence results in case of random as well as low-discrepancy data sites.

2603.22962 2026-04-14 cs.LG stat.ML

Asymptotic Learning Curves for Diffusion Models with Random Features Score and Manifold Data

Anand Jerry George, Nicolas Macris

Comments The proof of Lemma 1 in Appendix C is incorrect

详情
英文摘要

We study the theoretical behavior of denoising score matching--the learning task associated to diffusion models--when the data distribution is supported on a low-dimensional manifold and the score is parameterized using a random feature neural network. We derive asymptotically exact expressions for the test, train, and score errors in the high-dimensional limit. Our analysis reveals that, for linear manifolds the sample complexity required to learn the score function scales linearly with the intrinsic dimension of the manifold, rather than with the ambient dimension. Perhaps surprisingly, the benefits of low-dimensional structure starts to diminish once we have a non-linear manifold. These results indicate that diffusion models can benefit from structured data; however, the dependence on the specific type of structure is subtle and intricate.

2603.15928 2026-04-14 stat.AP

Prior-Data Fitted Networks for Causal Inference: a Simulation Study with Real-World Scenarios

Francisco Mourao, David Hajage, Daria Bystrova, Bertrand Bouvarel, Nathanaël Lapidus, Fabrice Carrat, Benjamin Glemain

Comments 26 pages, 4 tables, 3 figures

详情
英文摘要

Prior-Data Fitted Networks (PFNs) represent a paradigm shift in tabular data prediction. We present the principles of this new paradigm and evaluate two PFNs for estimating the average treatment effect (ATE) of a binary treatment on a binary outcome, using simulated clinical scenarios based on real-world data. We assessed TabPFN combined with causal inference procedures (g-computation and inverse probability of treatment weighting), and CausalPFN, a PFN that directly provides an ATE estimate with a credible interval. Confidence intervals for the TabPFN-based methods were derived using bootstrap resampling. We found that computation times for TabPFN were prohibitive for routine causal inference, particularly because of the need for bootstrapping to yield confidence intervals. Moreover, g-computation with TabPFN produced a highly biased estimator, partially corrected by fitting separate models for each treatment group (T-learner). CausalPFN, by contrast, was computationally efficient but exhibited poor coverage of its 95% credible interval for the ATE, due to both estimation bias and inadequate uncertainty quantification. Beyond automating model specification, some PFN variants - like CausalPFN - attempt to automate causal modeling. In the settings we evaluated, CausalPFN performed poorly. However, new algorithms of this kind continue to be developed, and their application to causal inference tasks requires further investigation.

2603.14305 2026-04-14 astro-ph.HE astro-ph.GA cond-mat.stat-mech stat.AP

Reconnection-driven State Transitions in Flat Spectrum Radio Quasars

Agniva Roychowdhury

Comments 17 pages, 10 figures; accepted for publication in The Astrophysical Journal

详情
英文摘要

We extend the work of Roychowdhury (2026) on skewness variations of the logarithmic flux, driven by large GeV flares in FSRQs, to a sample of 18 FSRQs. We find that they can be categorized into three groups, one where the skewness attains a persistent lower value after a large flare, one where it increases, and those where change in skewness is not significant. To provide a theoretical ground for these results, we use the statistical plasmoid model of Fermo et al. (2010) that self-consistently produces large plasmoids through merging which, when gain energy from the reconnection event and are Doppler aligned, produce large flares. We find that a downsampling of our simulation of 1500 runs to 18 statistically reproduces the observed distribution in p-values for change in skewness. We further compute the ensemble Shannon entropy of the system and the skewness, where the entropy is found to decrease at a $3σ$ level in both the groups where skewness either increases or decreases, as a direct evidence of increase in order in the system caused by a flare. We find that the power spectral densities of the simulated light curves are broken-power-laws, resembling a white noise+red noise broken by the typical cooling timescale in our system, in accordance with known blazar variability. We find that our results are robust to a $200-300\%$ change in several fiducial parameters of the simulation. Our stochastic simulation of plasmoids inside a blazar jet is consistent with key observable statistical properties of blazar GeV light curves.

2601.21860 2026-04-14 math.OC stat.ML

Pathwise Learning of Stochastic Dynamical Systems with Partial Observations

Nicole Tianjiao Yang

详情
英文摘要

The reconstruction and inference of stochastic dynamical systems from data is a fundamental task in inverse problems and statistical learning. While surrogate modeling advances computational methods to approximate these dynamics, standard approaches typically require high-fidelity training data. In many practical settings, the data are indirectly observed through noisy and nonlinear measurement. The challenge lies not only in approximating the coefficients of the SDEs, but in simultaneously inferring the posterior updates given the observations. In this work, we present a neural path estimation approach to solve stochastic dynamical systems based on variational inference. We first derive a stochastic control problem that solve filtering posterior path measure corresponding to a pathwise Zakai equation. We then construct a generative model that maps the prior path measure to posterior measure through the controlled diffusion and the associated Randon-Nykodym derivative. Through an amortization of sample paths of the observation process, the control is learned through the noisy observation paths and we learn an associated SDE which induces the filtering path measure. In the end, we demonstrate the model's performance on various nonlinear stochastic systems, showcasing its ability to handle multimodal data distributions, chaotic dynamics, and sparse observation data.

2512.19691 2026-04-14 cs.AI stat.AP

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati

Comments Github codebase: https://github.com/junzeye/validate-medcalc-labels

详情
英文摘要

Reference labels for machine-learning benchmarks are increasingly synthesized with LLM assistance, but their reliability remains underexamined. We audit MedCalc-Bench, a clinical benchmark for medical score computation whose labels were partly derived with LLM assistance, and develop a scalable physician-in-the-loop stewardship pipeline to reassess them. At least 27% of test labels are likely erroneous or incomputable. On a 50-instance subset validated by physicians, our recomputed labels agree with physician ground truth 74% of the time (95% CI, 60-84%) versus 20% for the originals (95% CI, 11-33%). Using original labels to evaluate frontier LLMs underestimates accuracy by 16-23 percentage points. In a controlled reinforcement-learning experiment, a model trained on recomputed labels outperforms one trained on originals by 13.5 percentage points (95% CI, 10.6-16.6%) on physician-labeled instances, and this advantage extends to related medical tasks. LLM-assisted benchmarks can propagate systematic errors into both evaluation and post-training unless actively stewarded.

2511.15068 2026-04-14 stat.ME

Classification Trees with Valid Inference via the Exponential Mechanism

Soham Bakshi, Snigdha Panigrahi

详情
英文摘要

Decision trees are widely used for non-linear modeling, as they capture interactions between predictors while producing inherently interpretable models. Despite their popularity, performing inference on the non-linear fit remains largely unaddressed. This paper focuses on classification trees and makes two key contributions. First, we introduce a novel tree-fitting method that replaces the greedy splitting of the predictor space in standard tree algorithms with a probabilistic approach. Each split in our approach is selected according to sampling probabilities defined by an exponential mechanism, with a temperature parameter controlling its deviation from the deterministic choice given data. Second, while our approach can fit a tree that with high probability coincides with the fit produced by standard tree algorithms at low temperatures, it is not merely predictive; unlike standard algorithms, it enables inference by taking into account the highly adaptive tree structure. Our method produces pivots directly from the sampling probabilities in the exponential mechanism. In theory, our pivots allow asymptotically valid inference on the parameters in the predictive fit, and in practice, our method delivers powerful inference without sacrificing predictive accuracy, in contrast to data splitting methods.

2510.04358 2026-04-14 physics.ao-ph stat.AP stat.ML

Score-based generative emulation of impact-relevant Earth system model outputs

Shahine Bouabid, Andre Nogueira Souza, Raffaele Ferrari

详情
英文摘要

Policy targets evolve faster than the Coupled Model Intercomparison Project cycles, complicating adaptation and mitigation planning that must often contend with outdated projections. Climate model output emulators address this gap by offering inexpensive surrogates that can rapidly explore alternative futures while staying close to Earth System Model (ESM) behavior. The focus is on emulators designed to provide inputs to impact models. Using monthly ESM fields of near-surface temperature, precipitation, relative humidity, and wind speed, it is shown that deep generative models have the potential to model the joint distribution of variables relevant for impacts. The specific model proposed uses score-based diffusion on a spherical mesh and runs on a single mid-range graphical processing unit. A thorough suite of diagnostics is introduced to compare emulator outputs with their parent ESMs, including their probability densities, cross-variable correlations, time of emergence, or tail behavior. The emulator performance is evaluated across three distinct ESMs in both pre-industrial and forced regimes. The results show that the emulator produces distributions that closely match the ESM outputs and captures key forced responses. They also reveal important failure cases, notably for variables with a strong regime shift in the seasonal cycle. Although not a perfect match to the ESM, the inaccuracies of the emulator are small relative to the magnitude of internal variability in ESM projections. This suggests that the generative emulators can be useful in supporting impact assessment. Priorities for future development toward daily resolution, finer spatial scales, and bias-aware training are discussed. Code is made available at https://github.com/shahineb/climemu.

2509.22736 2026-04-14 eess.IV cs.AI cs.CV cs.LG physics.med-ph stat.ML

PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems

Merve Gülle, Junno Yun, Yaşar Utku Alçalar, Mehmet Akçakaya

Comments IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情
英文摘要

Diffusion models have found extensive use in solving inverse problems, by sampling from an approximate posterior distribution of data given the measurements. Recently, consistency models (CMs) have been proposed to directly predict the final output from any point on the diffusion ODE trajectory, enabling high-quality sampling in just a few neural function evaluations (NFEs). CMs have also been utilized for inverse problems, but existing CM-based solvers either require additional task-specific training or utilize data fidelity operations with slow convergence, limiting their applicability to large-scale problems and making them difficult to extend to nonlinear settings. In this work, we reinterpret CMs as proximal operators of a prior, enabling their integration into plug-and-play (PnP) frameworks. Specifically, we propose PnP-CM, an ADMM-based PnP solver that provides a unified framework for solving a wide range of inverse problems, and incorporates noise perturbations and momentum-based updates to improve performance in the low-NFE regime. We evaluate our approach on a diverse set of linear and nonlinear inverse problems. We also train and apply CMs to MRI data for the first time. Our results show that PnP-CM achieves high-quality reconstructions in as few as 4 NFEs, and produces meaningful results in 2 steps, highlighting its effectiveness in real-world inverse problems while outperforming existing CM-based approaches.

2509.19889 2026-04-14 stat.ME

Improving Disease Risk Estimation in Small Areas by Accounting for Spatiotemporal Local Discontinuities

G. Santafé, A. Adin, M. D. Ugarte

详情
Journal ref
Journal of Computational and Graphical Statistics (2026) 1-18
英文摘要

This work proposes a two-step method to enhance disease risk estimation in small areas by integrating spatiotemporal cluster detection within a Bayesian hierarchical spatiotemporal model. First, we introduce an efficient scan-statistic-based clustering algorithm that employs a greedy search within the scan window, enabling flexible cluster detection across large spatial domains. We then integrate these detected clusters into a Bayesian spatiotemporal model to estimate relative risks, explicitly accounting for identified risk discontinuities. We apply this methodology to large-scale cancer mortality data at the municipality level across continental Spain. Our results show our method offers superior cluster detection accuracy compared to SaTScan. Furthermore, integrating cluster information into a Bayesian spatiotemporal model significantly improves model fit and risk estimate performance, as evidenced by better DIC, WAIC, and logarithmic scores than SaTScan-based or standard BYM2 models. This methodology provides a powerful tool for epidemiological analysis, offering a more precise identification of high- and low-risk areas and enhancing the accuracy of risk estimation models.

2509.15359 2026-04-14 stat.ME

Bayesian Mixture Models for Heterogeneous Extremes

Viviana Carcaiso, Miguel de Carvalho, Ilaria Prosdocimi, Isadora Antoniano-Villalobos

Comments Paper updated based on reviewers' comments

详情
英文摘要

The conventional use of the Generalized Extreme Value (GEV) distribution to model block maxima may be inappropriate when extremes are actually structured into multiple heterogeneous groups. In this work, we propose a novel approach for describing the behavior of extreme values in the presence of such heterogeneity. Rather than defaulting to the GEV distribution simply because it arises as a theoretical limit, we show that alternative block maxima-based models can also align with the extremal types theorem while providing improved flexibility in practice. Our formulation leads us to a mixture model that has a Bayesian nonparametric interpretation as a Dirichlet process mixture of GEV distributions. The use of an infinite number of components enables the characterization of every possible block behavior, while at the same time capturing similarities between observations based on their extremal behavior. By employing a Dirichlet process prior on the mixing measure, we can capture the complex structure of the data without the need to pre-specify the number of mixture components. The application of the proposed model is illustrated using both simulated and real-world data.

2506.04082 2026-04-14 stat.CO stat.AP stat.ME

Adaptive tuning of Hamiltonian Monte Carlo methods

Elena Akhmatskaya, Lorenzo Nagar, Jose Antonio Carrillo, Leonardo Gavira Balmacz, Hristo Inouzhe, Martín Parga Pazos, María Xosé Rodríguez Álvarez

详情
英文摘要

With the recently increased interest in probabilistic models, the efficiency of an underlying sampler becomes a crucial consideration. Hamiltonian Monte Carlo (HMC) is one popular option for models of this kind. Performance of the method, however, strongly relies on a choice of parameters associated with an integration for Hamiltonian equations. Up to date, such a choice remains mainly heuristic or introduces time complexity. We propose a novel computationally inexpensive and flexible approach (we call it Adaptive Tuning or ATune) that, by combining a theoretical analysis of the multivariate Gaussian model with simulation data generated during a burn-in stage of a HMC simulation, detects a system specific splitting integrator with a set of reliable sampler's hyperparameters, including their credible randomization intervals, to be readily used in a production simulation. The method automatically eliminates those values of simulation parameters which could cause undesired extreme scenarios, such as resonance artifacts, low accuracy or poor sampling. The new approach is implemented in the in-house software package HaiCS, with no computational overheads introduced in a production simulation, and can be easily incorporated in any package for Bayesian inference with HMC. The tests on popular statistical models reveal the superiority of adaptively tuned standard and generalized HMC methods in terms of stability, performance and accuracy over conventional HMC tuned heuristically and coupled with the well-established integrators. We also claim that the generalized HMC is preferable for achieving high sampling performance. The efficiency of the new methodology is assessed in comparison with state-of-the-art samplers, e.g. NUTS, in real-world applications, such as endocrine therapy resistance in cancer, modeling of cell-cell adhesion dynamics and influenza A epidemic outbreak.

2503.22924 2026-04-14 stat.ME

Asymptotic Standard Errors for Reliability Coefficients in Item Response Theory

Youjin Sung, Yang Liu

详情
英文摘要

In a recent review, Liu, Pek, & Maydeu-Olivares (2025b) classified reliability coefficients into two types: classical test theory (CTT) reliability and proportional reduction in mean squared error (PRMSE). This article focuses on quantifying the sampling variability of these coefficients under item response theory (IRT) models. While some existing standard error (SE) formulas are accurate when variability arises only from item parameter estimation, the reliability estimators considered in our work involve additional variability from substituting population moments with sample moments. We propose a general strategy to derive SEs that incorporates both sources of sampling error simultaneously, enabling the estimation of model-based reliability coefficients and their SEs in such settings. We then apply our general theory to derive SEs for two specific estimators under the graded response model: (1) CTT reliability for the expected a posteriori score of the latent variable and (2) PRMSE for the latent variable. Simulation results show that the derived SEs accurately capture the sampling variability across various test lengths in moderate to large samples. We conclude with an empirical illustration and directions for future research.

2411.05869 2026-04-14 stat.ML cs.LG stat.AP stat.CO stat.ME

Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data

Mark D. Risser, Marcus M. Noack, Hengrui Luo, Ronald Pandolfi

详情
英文摘要

The Gaussian process (GP) is a widely used probabilistic machine learning method with implicit uncertainty characterization for stochastic function approximation, stochastic modeling, and analyzing real-world measurements of nonlinear processes. Traditional implementations of GPs involve stationary kernels (also termed covariance functions) that limit their flexibility, and exact methods for inference that prevent application to data sets with more than about ten thousand points. Modern approaches to address stationarity assumptions generally fail to accommodate large data sets, while all attempts to address scalability focus on approximating the Gaussian likelihood, which can involve subjectivity and lead to inaccuracies. In this work, we explicitly derive an alternative kernel that can discover and encode both sparsity and nonstationarity. We embed the kernel within a fully Bayesian GP model and leverage high-performance computing resources to enable the analysis of massive data sets. We demonstrate the favorable performance of our novel kernel relative to existing exact and approximate GP methods across a variety of synthetic data examples. Furthermore, we conduct space-time prediction based on more than one million measurements of daily maximum temperature and verify that our results outperform state-of-the-art methods in the Earth sciences. More broadly, having access to exact GPs that use ultra-scalable, sparsity-discovering, nonstationary kernels allows GP methods to truly compete with a wide variety of machine learning methods.

2410.12618 2026-04-14 stat.AP

Spatio-Temporal Analysis of Public Transportation Undercrowding: Leveraging APC Data for a Comprehensive Evaluation of Usage Rates

Arianna Burzacchi, Valeria Maria Urbano, Marika Arena, Giovanni Azzone, Piercesare Secchi, Simone Vantini

Comments Pre-print version

详情
Journal ref
Public Transport 2026
英文摘要

The analysis of the transportation usage rate provides opportunities for evaluating the efficacy of the transportation service offered by proposing an indicator that integrates actual demand and capacity. This study aims to develop a methodology for analyzing the occupancy rate from large-scale datasets to identify gaps between supply and demand in public transportation. Leveraging the spatio-temporal granularity of data from Automatic People Counting (APC) and relying on the Generalized Linear Mixed Effects Model and the Generalized Mixed-Effect Random Forest, in this study we propose a methodology for analyzing factors determining undercrowding. The results of the model are examined at both the segment and ride levels. Initially, the analysis focuses on identifying segments more likely associated with undercrowding, understanding factors influencing the probability of undercrowding, and exploring their relationships. Subsequently, the analysis extends to the temporal distribution of undercrowding, encompassing its impact on the entire journey. The proposed methodology is applied to analyze APC data, provided by the company responsible for public transport management in Milan, on a radial route of the surface transportation network.

2409.06565 2026-04-14 math.PR math.FA math.ST q-bio.QM stat.ME stat.TH

Statistical inference for a multiscale stochastic model of enzyme kinetics via propagation of chaos

Arnab Ganguly, Wasiur R. KhudaBukhsh

Comments Removed functional central limit theorem and added new results to the parameter inference section

详情
英文摘要

We study a class of Stochastic Differential Equations (SDEs) with jumps modeling multistage Michaelis--Menten enzyme kinetics, in which a substrate is sequentially transformed into a product via a cascade of intermediate complexes. These networks are typically high-dimensional and exhibit multiscale behavior with a strong coupling between different components, posing substantial analytical and computational challenges. In particular, the problem of statistical inference of reaction rates is significantly difficult and becomes even more intricate when direct observations of system states are unavailable and only a random sample of product formation times is observed. We address this problem in two stages. First, in a suitable scaling regime consistent with the Quasi-Steady State Approximation (QSSA), we rigorously establish a stochastic averaging principle yielding a reduced model for the product-substrate dynamics. Guided by the reduced-order dynamics, we next construct a novel Interacting Particle System (IPS) that approximates the product-substrate process at the particle level. This IPS plays a pivotal role in the inference methodology, and we prove several non-asymptotic bounds and limiting results for this system. These results facilitate the construction of an estimator based on a product-form approximate likelihood requiring only a random sample of product formation times. This approach does not need access to the system states, and we rigorously prove consistency of the estimator.

2409.06406 2026-04-14 stat.AP

Monitoring road infrastructures from satellite images in Greater Maputo

Arianna Burzacchi, Matteo Landrò, Simone Vantini

Comments Pre-print version of the published manuscript available at Statistical Methods Applications (2024)

详情
Journal ref
Statistical Methods & Applications 2025
英文摘要

The information about pavement surface type is rarely available in road network databases of developing countries although it represents a cornerstone of the design of efficient mobility systems. This research develops an automatic classification pipeline for road pavement which makes use of satellite images to recognize road segments as paved or unpaved. The proposed methodology is based on an object-oriented approach, so that each road is classified by looking at the distribution of its pixels in the RGB space. The proposed approach is proven to be accurate, inexpensive, and readily replicable in other cities.

2408.16004 2026-04-14 stat.AP

Granger causal inference for climate change attribution

Mark D. Risser, Mohammed Ombadi, Michael F. Wehner

详情
英文摘要

Climate change detection and attribution (D&A) is concerned with determining the extent to which anthropogenic activities have influenced specific aspects of the global climate system. D&A fits within the broader field of causal inference, the collection of statistical methods that identify cause and effect relationships. There are a wide variety of methods for making attribution statements, each of which require different types of input data and each of which are conditional to varying extents. Some methods are based on Pearl causality (experimental interference) while others leverage Granger (predictive) causality, and the causal framing provides important context for how the resulting attribution conclusion should be interpreted. However, while Granger-causal attribution analyses have become more common, there is no clear statement of their strengths and weaknesses and no clear consensus on where and when Granger-causal perspectives are appropriate. In this prospective paper, we provide a formal definition for Granger-based approaches to trend and event attribution and a clear comparison with more traditional methods for assessing the human influence on extreme weather and climate events. Broadly speaking, Granger-causal attribution statements can be constructed quickly from observations and do not require computationally-intesive dynamical experiments. These analyses also enable rapid attribution, which is useful in the aftermath of a severe weather event, and provide multiple lines of evidence for anthropogenic climate change when paired with Pearl-causal attribution. Confidence in attribution statements is increased when different methodologies arrive at similar conclusions. Moving forward, we encourage the D&A community to embrace hybrid approaches to climate change attribution that leverage the strengths of both Granger and Pearl causality.

2408.13751 2026-04-14 stat.ML cs.LG math.OC

Improved identification of breakpoints in piecewise regression and its applications

Taehyeong Kim, Hyungu Lee, Myungjin Kim, Hayoung Choi

Comments 32 pages, 6 figures

详情
英文摘要

Identifying breakpoints in piecewise regression is critical in enhancing the reliability and interpretability of data fitting. In this paper, we propose novel algorithms based on the greedy algorithm to accurately and efficiently identify breakpoints in piecewise polynomial regression. The algorithm updates the breakpoints to minimize the error by exploring the neighborhood of each breakpoint. It has a fast convergence rate and stability to find optimal breakpoints. Moreover, it can determine the optimal number of breakpoints. The computational results for real and synthetic data show that its accuracy is better than any existing methods. The real-world datasets demonstrate that breakpoints through the proposed algorithm provide valuable data information.

2407.19191 2026-04-14 math.ST stat.ME stat.TH

Statistical inference for subgraph counts and clustering coefficient using network sampling in a sparse Stochastic Block Model framework

Anirban Mandal, Arindam Chatterjee

Comments 120 pages, 3 figures. Major revisions have been made, and new results have been added

详情
英文摘要

This article develops limit laws for network sampling based estimates of subgraph counts and clustering coefficient of a large population network, and uses them for predictive inference. A model based approach is used, where the population network is assumed to be generated from a sparse Stochastic Block Model (SBM). To quantify the effects of node sampling under resource constraints, a sparse Bernoulli node sampling scheme is introduced, where the node selection probability decays to zero as the population size increases. Both induced and ego-centric network formation approaches are explored. Quantitative bounds on the speed of normal approximation for estimated subgraph counts are obtained in a joint model and design based asymptotic framework. These bounds show that inference accuracy depends on model sparsity, sampling sparsity, and features like edge density and minimum vertex cover size of the target subgraph. We find that the ego-centric approach can handle higher sparsity levels in both the model and sampling scheme, compared to the induced approach. We also show that if model sparsity remains below a threshold, inference quality is unaffected; beyond it, the quality degrades rapidly. The sufficient conditions for obtaining a Gaussian limit law also turn out to be necessary. For strictly balanced target subgraphs, we obtain sharp transitions from Gaussian to Poisson based limit laws, as sparsity levels increase. A complete description of limit laws for estimated subgraph counts is given for the induced case, with a near-complete one for the ego-centric case. These results also yield Gaussian and Poisson limit laws for the estimated clustering coefficient. Simulations support the theory across sparsity levels, and the proposed methodology is applied to a real data set.

2404.18905 2026-04-14 stat.ME cs.LG stat.ML

Detecting critical treatment effect bias in small subgroups

Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, Fanny Yang

Comments Accepted for presentation at the Conference on Uncertainty in Artificial Intelligence (UAI) 2024

详情
英文摘要

Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge.

2402.19036 2026-04-14 math.ST stat.TH

Empirical Bayes in Bayesian learning: understanding a common practice

Stefano Rizzelli, Judith Rousseau, Sonia Petrone

详情
英文摘要

In applications of Bayesian procedures, once a class of priors has been chosen, it may be tempting to fix the prior's hyperparameters from the data, in an empirical Bayes (EB) fashion, usually by their maximum marginal likelihood estimates (MMLE). This is a quite common but questionable practice, lacking a rigorous theoretical basis. We provide a theoretical framework where this form of EB is regarded as a computational strategy for approximating a genuine Bayesian posterior distribution and prove its general properties for parametric models. While computing the MMLE may still be demanding, we prove novel results that allow us to provide a simple proxy. These results establish the limit behavior of the MMLE in quite general settings, including both identifiable and non-identifiable models - specifically, overfitted mixture models - significantly filling a gap in the literature. Moreover, we study higher order merging, showing that, when not degenerate, the EB posterior approximates at a faster rate an oracle-Bayes posterior distribution based on the prior law that, within the given class of priors, expresses the most information on the true model's parameters. This is a faster approximation than classic Bernstein-von Mises results. Our work provides formal content to common beliefs on this popular practice.

2311.14867 2026-04-14 stat.ME

Disaggregating Time-Series with Many Indicators: An Overview of the DisaggregateTS Package

Luke Mosley, Kaveh Salehzadeh Nobari, Giuseppe Brandi, Alex Gibberd

详情
Journal ref
The R Journal, 16(4), 62-73, 2025
英文摘要

Low-frequency time-series (e.g., quarterly data) are often treated as benchmarks for interpolating to higher frequencies, since they generally exhibit greater precision and accuracy in contrast to their high-frequency counterparts (e.g., monthly data) reported by governmental bodies. An array of regression-based methods have been proposed in the literature which aim to estimate a target high-frequency series using higher frequency indicators. However, in the era of big data and with the prevalence of large volume of administrative data-sources there is a need to extend traditional methods to work in high-dimensional settings, i.e. where the number of indicators is similar or larger than the number of low-frequency samples. The package DisaggregateTS includes both classical regressions-based disaggregation methods alongside recent extensions to high-dimensional settings, c.f. Mosley et al. (2022). This paper provides guidance on how to implement these methods via the package in R, and demonstrates their use in an application to disaggregating CO2 emissions.

2305.16272 2026-04-14 cs.LG cs.GT stat.ML

Incentivizing Honesty among Competitors in Collaborative Learning and Optimization

Florian E. Dorner, Nikola Konstantinov, Georgi Pashaliev, Martin Vechev

Comments Updated experimental results after fixing a mistake in the code. Previous version published in NeurIPS 2023; 37 pages, 5 figures

详情
英文摘要

Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity's data. However, in many cases, potential participants in such collaborative schemes are competitors on a downstream task, such as firms that each aim to attract customers by providing the best recommendations. This can incentivize dishonest updates that damage other participants' models, potentially undermining the benefits of collaboration. In this work, we formulate a game that models such interactions and study two learning tasks within this framework: single-round mean estimation and multi-round SGD on strongly-convex objectives. For a natural class of player actions, we show that rational clients are incentivized to strongly manipulate their updates, preventing learning. We then propose mechanisms that incentivize honest communication and ensure learning quality comparable to full cooperation. Lastly, we empirically demonstrate the effectiveness of our incentive scheme on a standard non-convex federated learning benchmark. Our work shows that explicitly modeling the incentives and actions of dishonest clients, rather than assuming them malicious, can enable strong robustness guarantees for collaborative learning.

2010.15950 2026-04-14 math.ST stat.TH

All Block Maxima method for estimating the extreme value index

Jochem Oorschot, Chen Zhou

详情
英文摘要

The block maxima (BM) approach in extreme value analysis fits a sample of block maxima to the Generalized Extreme Value (GEV) distribution. We consider all potential blocks from a sample, which leads to the All Block Maxima (ABM) estimator. Different from existing estimators based on the BM approach, the ABM estimator is permutation invariant. We show the asymptotic behavior of the ABM estimator, which has the lowest asymptotic variance among all estimators using the BM approach. Simulation studies justify our asymptotic theories. A key step in establishing the asymptotic theory for the ABM estimator is to obtain asymptotic expansions for the tail empirical process based on higher order statistics with weights.

1812.00250 2026-04-14 stat.ME stat.AP

A Graphical Framework for Testing Hierarchically Structured Hypothesis Families

Zhiying Qiu, Li Yu, Wenge Guo

Comments 37 pages, 8 figures, 2 tables

详情
英文摘要

In clinical trials, hypotheses are frequently organized into hierarchically ordered families, requiring specialized testing strategies that account for these structured relationships. Existing gatekeeping methods-including serial, parallel, and tree-structured approaches-provide important solutions but are often either too rigid or insufficiently intuitive to accommodate increasingly complex logical dependencies among hypothesis families. To address these limitations, we propose a novel family-based graphical approach that unifies the derivation and visualization of diverse gatekeeping strategies. In this framework, procedures are represented as directed, weighted graphs, where nodes correspond to hypothesis families. Two simple updating rules govern the allocation of significance levels within families and the propagation of significance levels between them. We establish that the proposed method strongly controls the familywise error rate (FWER) at a pre-specified level. Simulation studies under representative configurations indicate that the proposed procedure achieves performance comparable to hypothesis-level graphical approaches and competitive with the superchain procedure, while providing a simpler and more interpretable family-level representation. Case studies and a real clinical trial application further illustrate its flexibility and practical advantages, making it a powerful tool for managing hierarchically structured multiple testing in clinical research.

2604.10986 2026-04-14 stat.ME stat.CO

Optimal multiple testing under family-wise error control: elementary symmetric polynomials and a scalable algorithm

Prasanjit Dubey, Xiaoming Huo

详情
英文摘要

Simultaneously testing $K$ hypotheses while controlling the family-wise error rate is a fundamental problem in statistics. Existing procedures (Bonferroni, Holm, Hochberg, Hommel) provide valid control but sacrifice power, increasingly so as $K$ grows, because they base decisions on marginal $p$-value ranks rather than the joint likelihood. Rosset et al. (2022) formulated the most powerful family-wise-error-rate-controlling test as a dual program and proved the existence of an optimal dual vector $μ^*$, but left its computation as an open problem. We solve this problem for $K$ exchangeable hypotheses. The key insight is that the family-wise error rate constraint coefficients $b_{l,k}(\vec{u})$ admit closed-form expressions through elementary symmetric polynomials of the likelihood-ratio values $g(u_1), \ldots, g(u_K)$. This algebraic structure implies a global monotonicity theorem: the target functions $F_γ(μ) = {\rm FWER}_γ(\vec{D}^μ)$ are simultaneously non-increasing in every component of $μ$, for arbitrary $K$, which guarantees unique coordinate-wise roots and enables a bisection-based coordinate-descent algorithm with $O(\log \varepsilon^{-1})$ convergence rate. The relative power gain over Hommel's method grows from 15\% at $K{=}3$ to 84\% at $K{=}12$. Applications to replication studies, a clinical trial, and a replicability assessment illustrate both the power gains and the role of the exchangeability assumption.

2604.10976 2026-04-14 stat.ML cs.LG stat.CO stat.ME

Neural Generalized Mixed-Effects Models

Yuli Slavutsky, Sebastian Salazar, David M. Blei

详情
英文摘要

Generalized linear mixed-effects models (GLMMs) are widely used to analyze grouped and hierarchical data. In a GLMM, each response is assumed to follow an exponential-family distribution where the natural parameter is given by a linear function of observed covariates and a latent group-specific random effect. Since exact marginalization over the random effects is typically intractable, model parameters are estimated by maximizing an approximate marginal likelihood. In this paper, we replace the linear function with neural networks. The result is a more flexible model, the neural generalized mixed-effects model (NGMM), which captures complex relationships between covariates and responses. To fit NGMM to data, we introduce an efficient optimization procedure that maximizes the approximate marginal likelihood and is differentiable with respect to network parameters. We show that the approximation error of our objective decays at a Gaussian-tail rate in a user-chosen parameter. On synthetic data, NGMM improves over GLMMs when covariate-response relationships are nonlinear, and on real-world datasets it outperforms prior methods. Finally, we analyze a large dataset of student proficiency to demonstrate how NGMM can be extended to more complex latent-variable models.

2604.10965 2026-04-14 stat.CO cs.LG stat.AP stat.ML

bioLeak: Leakage-Aware Modeling and Diagnostics for Machine Learning in R

Selçuk Korkmaz

Comments 35 pages, 4 figures

详情
英文摘要

Data leakage remains a recurrent source of optimistic bias in biomedical machine learning studies. Standard row-wise cross-validation and globally estimated preprocessing steps are often inappropriate for data with repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies. This paper describes bioLeak, an R package for constructing leakage-aware resampling workflows and for auditing fitted models for common leakage mechanisms. The package provides leakage-aware split construction, train-fold-only preprocessing, cross-validated model fitting, nested hyperparameter tuning, post hoc leakage audits, and HTML reporting. The implementation supports binary classification, multiclass classification, regression, and survival analysis, with task-specific metrics and S4 containers for splits, fits, audits, and inflation summaries. The simulation artifacts show how apparent performance changes under controlled leakage mechanisms, and the case study illustrates how guarded and leaky pipelines can yield materially different conclusions on multi-study transcriptomic data. The emphasis throughout is on software design, reproducible workflows, and interpretation of diagnostic output.

2604.10922 2026-04-14 cs.IT math.IT math.ST stat.TH

$α$-Mutual Information for the Gaussian Noise Channel

Mohammad Milanian, Alex Dytso, Martina Cardone

详情
英文摘要

In this paper, we study Sibson's $α$-mutual information in the context of the additive Gaussian noise channel. While the classical case $α= 1$ is well understood and admits deep connections to estimation-theoretic quantities, such as the minimum mean-square error (MMSE) and Fisher information, many of the corresponding structural properties for general $α$ remain less explored. Our goal is to develop a systematic understanding of $α$-mutual information in the Gaussian noise setting and to identify which properties extend beyond the Shannon case. To this end, we establish several regularity properties, including finiteness conditions, continuity with respect to the signal-to-noise ratio (SNR) and the input distribution, and strict concavity/convexity properties that ensure uniqueness in associated optimization problems. A central contribution is the development of an $α$-I-MMSE relationship, generalizing the classical identity by relating the derivative of $α$-mutual information with respect to SNR to the MMSE evaluated under appropriately tilted distributions. This connection further leads to a generalized de Bruijn identity and new estimation-theoretic representations of Rényi entropy and differential Rényi entropy. We also characterize the low- and high-SNR behavior. In the low-SNR regime, the first-order behavior depends only on the input variance. In the high-SNR regime, for discrete inputs, $α$-mutual information converges to the Rényi entropy of order $1/α$, while for general inputs we connect it to $α$-information dimension. Overall, our results show that many fundamental relationships between information and estimation extend beyond the Shannon setting, in a form involving $α$-tilted distributions.

2604.10899 2026-04-14 math.ST stat.TH

Characterisations of Kullback--Leibler approximation by finite Gaussian mixtures

Hien Duy Nguyen

详情
英文摘要

We study the Kullback--Leibler (KL) divergence approximation theory of Gaussian mixture models (GMMs) by isolating an abstract mechanism behind several necessary-and-sufficient statements. The necessity direction is universal: if a density is approximable in KL divergence by finite GMMs, then it must have finite second moment. The sufficient direction is reduced to the construction of approximating GMMs whose likelihood ratios converge pointwise and whose finite log-ratios form a uniformly integrable family. We verify this mechanism on a finite log-moment class of continuous strictly positive target densities, from which bounded, $\mathcal L^p$ $(p>1)$, and Orlicz-dominated subfamilies follow immediately. We also show that a countable-scale support-aware target density class, which allows zero density regions, satisfies the same equivalence. Finally, we give counterexamples showing that the countable-scale class strictly extends the fixed-scale class, that the finite log-moment and countable-scale support-aware classes do not contain one another, and that their union is not exhaustive.

2604.10863 2026-04-14 stat.ME stat.CO

Restricted Search Space Graph MCMC via Birth-Death Processes

Morris Greenberg, Kieran R Campbell, Radu Craiu

Comments 63 pages including 31 pages of supplement, 10 figures and 27 supplemental figures; Code to run the MCMC algorithm and reproduce simulations is available at https://github.com/morrisgreenberg/RestrictedSearchMCMC

详情
英文摘要

Inferring directed acyclic graphs (DAGs) from data via Markov chain Monte Carlo (MCMC) is computationally challenging in moderate-to-high dimensional settings because their discrete sampling space grows super-exponentially with the number of nodes. To address scalability, several recent MCMC-based graph inference methods restrict the search space to a subset of edges, at the cost of introducing error into the inference procedure. In this work, we derive sharp lower and upper bounds on the total variation distance between the unrestricted posterior distribution and the posterior distribution induced by a state-of-the-art restricted search space MCMC method. These bounds characterize regimes in which the approximation error is negligible and regimes in which it is not. In order to reduce the error, we propose a flexible transdimensional MCMC sampler which allows the search space to expand or contract dynamically as the chain progresses. The sampler is defined by birth-and-death rates that induce a prior distribution on the set of search spaces, rather than assume a fixed restricted search space throughout. We outline an efficient implementation of the proposed algorithm and demonstrate its finite-sample performance through simulation studies.

2604.10857 2026-04-14 cs.LG cs.AI cs.DS math.ST stat.ML stat.TH

Query Lower Bounds for Diffusion Sampling

Zhiyang Xun, Eric Price

详情
英文摘要

Diffusion models generate samples by iteratively querying learned score estimates. A rapidly growing literature focuses on accelerating sampling by minimizing the number of score evaluations, yet the information-theoretic limits of such acceleration remain unclear. In this work, we establish the first score query lower bounds for diffusion sampling. We prove that for $d$-dimensional distributions, given access to score estimates with polynomial accuracy $\varepsilon=d^{-O(1)}$ (in any $L^p$ sense), any sampling algorithm requires $\widetildeΩ(\sqrt{d})$ adaptive score queries. In particular, our proof shows that any sampler must search over $\widetildeΩ(\sqrt{d})$ distinct noise levels, providing a formal explanation for why multiscale noise schedules are necessary in practice.

2604.10854 2026-04-14 stat.AP stat.ML

Uncertainty-Aware Sparse Identification of Dynamical Systems via Bayesian Model Averaging

Shuhei Kashiwamura, Yusuke Kato, Hiroshi Kori, Masato Okada

详情
英文摘要

In many problems of data-driven modeling for dynamical systems, the governing equations are not known a priori and must be selected phenomenologically from a large set of candidate interactions and basis functions. In such situations, point estimates alone can be misleading, because multiple model components may explain the observed data comparably well, especially when the data are limited or the dynamics exhibit poor identifiability. Quantifying the uncertainty associated with model selection is therefore essential for constructing reliable dynamical models from data. In this work, we develop a Bayesian sparse identification framework for dynamical systems with coupled components, aimed at inferring both interaction structure and functional form together with principled uncertainty quantification. The proposed method combines sparse modeling with Bayesian model averaging, yielding posterior inclusion probabilities that quantify the credibility of each candidate interaction and basis component. Through numerical experiments on oscillator networks, we show that the framework accurately recovers sparse interaction structures with quantified uncertainty, including higher-order harmonic components, phase-lag effects, and multi-body interactions. We also demonstrate that, even in a phenomenological setting where the true governing equations are not contained in the assumed model class, the method can identify effective functional components with quantified uncertainty. These results highlight the importance of Bayesian uncertainty quantification in data-driven discovery of dynamical models.

2604.10824 2026-04-14 stat.AP

Causal Fairness Analysis of ADHD Status and High School STEM Outcomes

Shuhan Ai

详情
英文摘要

This study applies the Causal Fairness Analysis (CFA) framework of Plecko and Bareinboim (2024) to decompose the total variation in STEM outcomes attributable to ADHD status into direct, indirect, and spurious components using Pearl's Structural Causal Model. Drawing on nationally representative data from the High School Longitudinal Study of 2009, this study examines two outcomes: cumulative STEM GPA and science identity. Total variation decomposition reveals a statistically significant ADHD penalty on STEM GPA (TV = -0.670), of which 63.3% is attributable to the direct effect (x-DE), indicating that the majority of the disparity operates through pathways not mediated by observed sociodemographic or academic confounders. In contrast, the effect on science identity is small and non-significant (TV = -0.068). Counterfactual direct effect analysis using the one-step debiased estimator further reveals that the direct effect is structured by race, with notable variation across racial and ethnic subgroups. Sensitivity analyses confirm robustness to moderate unmeasured confounding. These findings advance the understanding of ADHD-related inequities in STEM education and highlight the need for fairness-aware policies that address both direct institutional barriers and their differential impact across intersecting social identities.

2604.10821 2026-04-14 cs.LG stat.CO stat.ML

Slithering Through Gaps: Capturing Discrete Isolated Modes via Logistic Bridging

Pinaki Mohanty, Ruqi Zhang

详情
英文摘要

High-dimensional and complex discrete distributions often exhibit multimodal behavior due to inherent discontinuities, posing significant challenges for sampling. Gradient-based discrete samplers, while effective, frequently become trapped in local modes when confronted with rugged or disconnected energy landscapes. This limits their ability to achieve adequate mixing and convergence in high-dimensional multimodal discrete spaces. To address these challenges, we propose \emph{Hyperbolic Secant-squared Gibbs-Sampling (HiSS)}, a novel family of sampling algorithms that integrates a \emph{Metropolis-within-Gibbs} framework to enhance mixing efficiency. HiSS leverages a logistic convolution kernel to couple the discrete sampling variable with the continuous auxiliary variable in a joint distribution. This design allows the auxiliary variable to encapsulate the true target distribution while facilitating easy transitions between distant and disconnected modes. We provide theoretical guarantees of convergence and demonstrate empirically that HiSS outperforms many popular alternatives on a wide variety of tasks, including Ising models, binary neural networks, and combinatorial optimization.

2604.10820 2026-04-14 math.PR econ.EM math.CO math.ST stat.TH

A Strict Gap Between Relaxed and Partition-Constrained Spectral Compression in a Six-State Lumpable Markov Chain

Oleg Kiriukhin

详情
英文摘要

This paper studies a finite reversible lumpable Markov chain for which relaxed spectral compression yields a larger determinant than partition-constrained compression. For a symmetric six-state lumpable chain and the positive operator $T=P^2$, I compare the relaxed benchmark \begin{equation*} \mathfrak D^{\mathrm{rel}}_3(T):=\sup_{U^*U=I_3}\det(U^*TU) \end{equation*} and the partition-constrained benchmark \begin{equation*} \sup_{\mathcal A\,\mathrm{3\text{-}partition}}\det Q_{\mathcal A}(T), \qquad Q_{\mathcal A}(T)=H_{\mathcal A}^*TH_{\mathcal A}. \end{equation*} Here the partition-constrained benchmark is the compression induced by normalized indicator vectors of genuine partitions of the state space. I derive closed formulas for the two analytically central partition families, prove strict upper bounds for both in a local-mode-dominated regime, and combine these bounds with an exhaustive enumeration of all $90$ partitions into three nonempty cells in an explicit six-state model. For this model, one obtains a strict global gap: \begin{equation*} \sup_{\mathcal A}\det Q_{\mathcal A}(T)<\mathfrak D^{\mathrm{rel}}_3(T). \end{equation*} Thus, in this model, indicator-based partition frames are strictly weaker than relaxed orthonormal frames even after global partition-constrained optimization.

2604.10814 2026-04-14 cs.LG math.ST stat.TH

Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression

Yijin Ni, Xiaoming Huo

详情
英文摘要

We study online covariance matrix estimation for Polyak--Ruppert averaged stochastic gradient descent (SGD). The online batch-means estimator of Zhu, Chen and Wu (2023) achieves an operator-norm convergence rate of $O(n^{-(1-α)/4})$, which yields $O(n^{-1/8})$ at the optimal learning-rate exponent $α\rightarrow 1/2^+$. A rigorous per-block bias analysis reveals that re-tuning the block-growth parameter improves the batch-means rate to $O(n^{-(1-α)/3})$, achieving $O(n^{-1/6})$. The modified estimator requires no Hessian access and preserves $O(d^2)$ memory. We provide a complete error decomposition into variance, stationarity bias, and nonlinearity bias components. A weighted-averaging variant that avoids hard truncation is also discussed. We establish the minimax rate $Θ(n^{-(1-α)/2})$ for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives $Ω(n^{-(1-α)/2})$, and a trajectory-regression estimator--which estimates the Hessian by regressing SGD increments on iterates--achieves $O(n^{-(1-α)/2})$, matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.

2604.10808 2026-04-14 stat.AP stat.ME

Modeling Tripartite Hyperevents in Scientific Collaboration Networks

Amin Gino Fabbrucci Barbagli, Jürgen Lerner, Viviana Amati, Domenico De Stefano

详情
英文摘要

Sociological research has framed collective action in science, innovation, and culture as tripartite networks connecting teams of actors, lists of prior works, and sets of labels (e.g., keywords, topics). While methods for multipartite social networks were proposed decades ago, and have received a recent surge in interest, none of the suggested solutions scale to the size and granularity of contemporary data sets (scientific publications, patents, filmmaking) and at the same time allow for testing multiple competing hypotheses about the drivers of collective production. In this paper, we address this gap by applying Relational Hyperevent Models (RHEM) to dynamic tripartite hypergraphs. Using scientific networks as a case study, we model events linking any number of actors, references, and keywords, testing and controlling for inter-dependencies within and between each set.

2604.10792 2026-04-14 math.PR econ.EM math.CT math.ST stat.TH

Variable-Length Markov Chains on Finite Quivers: Boundary-Window Identifiability, Exact Depth, and Local Rank Comparison

Oleg Kiriukhin

详情
英文摘要

Variable-length Markov chains on finite quivers provide a natural framework for context-dependent stochastic growth under incidence constraints. I study quiver-valued variable-length Markov chains observed through finite boundary windows and develop a first-order theory of visible-depth identifiability via stationary visible one-step transition laws and their restricted differentials on prescribed tangent blocks. For visible depth $m$, the main object is the stationary one-step informative map $q_{\mathcal{Q}}^{(m)}$. In the edge-homogeneous regime, once the local visible support is fixed and the representation hypothesis holds, all admissible visible depths encode the same edge-level extension law and hence have the same first-order rank. In the exact-depth regime of context length $r$, the depth-$r$ boundary process is the canonical finite-state Markov chain, smaller visible windows are deterministic truncations, and every coarser informative map factors $C^1$-smoothly through the depth-$r$ informative map on the relevant affine transition-array neighborhood. Hence rank cannot increase beyond depth $r$. After quotienting a tangent block by directions already invisible at depth $r$, I characterize strict coarse-depth loss exactly by coarse rank deficiency, equivalently by strict rank drop from depth $r$ to depth $m$ on the original block. I also give subspace-based and global selected-coordinate criteria, a global one-coordinate branching criterion, and an explicit depth-two example. Under full fine-depth rank and strict coordinate-rank loss at every smaller depth, a global coordinate-rank theorem yields $m_*(T,θ_0)=r$. Reduced local coordinates remove stochastic redundancies, first-order criteria are invariant under $C^1$ reparameterization, and the statistical and LAN consequences remain conditional on additional estimation and likelihood-level hypotheses.

2604.10752 2026-04-14 cs.IT econ.EM math.IT math.PR math.ST stat.TH

Entropy-Rate Selection for Partially Observed Processes

Oleg Kiriukhin

详情
英文摘要

I formulate an entropy-rate maximization problem at the observable level for stochastic processes observed through an information-reducing observation map. For a visible stationary law, the map determines an observational fiber of hidden stationary laws generating that law. In the finite-state finite-memory setting, retained visible constraints determine a feasible class of stationary $(r+1)$-block laws, and the entropy maximizer is defined as the entropy-rate maximizer on this class. The paper formulates entropy-rate maximization on feasible classes induced by partial observability and develops a structural theory for the resulting maximizer. I prove existence and uniqueness of the maximizer, with uniqueness under a fixed-context-marginal hypothesis and, more generally, via a strict-concavity characterization by row proportionality. Two global characterization regimes are central: a fixed one-point marginal yields the i.i.d. maximizer, and a fixed $r$-block law yields the $(r-1)$-step Markov extension. The gap functional equals a conditional mutual information and vanishes exactly at the maximizing completion. I also derive optimality conditions, local geometry of the maximizer, a latent random-mapping realization that leaves the visible law unchanged, and a local empirical consistency theorem, and illustrate the framework by an aliased hidden-state example.

2604.10727 2026-04-14 stat.ML cs.AI cs.LG math.PR math.ST stat.TH

Tail-Aware Information-Theoretic Generalization for RLHF and SGLD

Huiming Zhang, Binghan Li, Wan Tian, Qiang Sun

Comments 65 pages, 9 figures

详情
英文摘要

Classical information-theoretic generalization bounds typically control the generalization gap through KL-based mutual information and therefore rely on boundedness or sub-Gaussian tails via the moment generating function (MGF). In many modern pipelines, such as robust learning, RLHF, and stochastic optimization, losses and rewards can be heavy-tailed, and MGFs may not exist, rendering KL-based tools ineffective. We develop a tail-dependent information-theoretic framework for sub-Weibull data, where the tail parameter $θ$ controls the tail heaviness: $θ=2$ corresponds to sub-Gaussian, $θ=1$ to sub-exponential, and $0<θ<1$ to genuinely heavy tails. Our key technical ingredient is a decorrelation lemma that bounds change-of-measure expectations using a shifted-log $f_θ$-divergence, which admits explicit comparisons to Rényi divergence without MGF arguments. On the empirical-process side, we establish sharp maximal inequalities and a Dudley-type chaining bound for sub-Weibull processes with tail index $θ$, with complexity scaling as $\log^{1/θ}$ and entropy$^{1/θ}$. These tools yield expected and high-probability PAC-Bayes generalization bounds, as well as an information-theoretic chaining inequality based on multiscale Rényi mutual information. We illustrate the consequences in Rényi-regularized RLHF under heavy-tailed rewards and in stochastic gradient Langevin dynamics with heavy-tailed gradient noise.

2604.10710 2026-04-14 stat.ME

Causal mediation in cluster-randomized trials with multiple mediators: spillover-aware decomposition, identification, and semiparametric efficient inference

Jiaqi Tong, Chao Cheng, Fan Li

详情
英文摘要

Causal mediation analysis in cluster-randomized trials (CRTs) is complicated by the presence of multiple mediators, intracluster correlation, and within-cluster interference. Existing mediation methods often fall short in accommodating these features simultaneously, and semiparametric efficient estimators that fully address them remain unavailable. We develop a unified framework that defines a class of mediation effect estimands, including exit indirect effects, exit spillover mediation effects, and their interaction effects, to investigate causal mechanisms in CRTs with an arbitrary number of mediators under an unknown causal structure. We introduce a set of interpretable causal assumptions for point identification of each estimand. For optimal inference, we first derive the efficient influence functions for the proposed estimands and construct corresponding one-step and debiased machine learning estimators. In particular, to flexibly model the joint mediator density, we employ an elliptical copula marginal regression model that combines a nonparametric marginal regression with an interpretable association structure. We assess the finite-sample performance of the proposed estimators through simulation studies and illustrate the methodology by reanalyzing the PPACT CRT data with three causally unordered mediators.

2604.10706 2026-04-14 stat.ME

Multiple Imputation Diagnostics when using Electronic Health Record Data in Observational Studies: A Case Study

Nrupen A. Bhavsar, Lingyu Zhou, Samuel I. Berchuck, Matthew L. Maciejewski, Jerome P. Reiter

Comments 22 pages with title page and references, 4 figures

详情
英文摘要

Missing values in electronic health record (EHR) data pose a significant challenge for epidemiologic research. Traditional methods for handling missing data, like mean imputation, may introduce bias. Multiple imputation (MI) offers a principled solution by generating multiple plausible values based on statistical models. However, MI requires careful model specification and validation of imputations, ideally using multivariate graphical tools. We demonstrate the application of such tools to validate MI in a study of chronic kidney disease, assessing cardiovascular outcomes linked to neighborhood socioeconomic status (nSES). This study used data from Duke University Health System (DUHS) and Lincoln Community Health Center (LCHC). Eligible patients had at least one encounter within DUHS or LCHC and had two estimated glomerular filtration rate (eGFR) values <60 mL/min per 1.73 m2 more than 90 days apart between January 1, 2007 and July 1, 2008. Socioeconomic status was assessed using the Agency for Healthcare Research and Quality (AHRQ) index based on census data. The main outcome was a cardiovascular disease-related hospitalization. Participants were mostly older (mean age 73 years), female (64%), and Black (43%). Participants living in lower nSES neighborhoods had higher mean systolic blood pressure (SBP: 140 mmHg) and hemoglobin A1c (HbA1c) levels (7.1%) as compared to participants living in higher nSES neighborhoods. A machine learning based approach, Classification and Regression Trees (CART), was the preferred approach to impute missing data. The distributions of imputed values of SBP and HbA1c were impacted by whether marginal or conditional values of SBP and HbA1c were imputed. The choice of MI had minimal impact on inference and prediction. Future research may want to extend our results and consider how results may differ when using EHR data from multiple health systems.

2604.10672 2026-04-14 stat.ML cs.LG

One-Step Score-Based Density Ratio Estimation

Wei Chen, Qibin Zhao, John Paisley, Junmei Yang, Delu Zeng

详情
英文摘要

Density ratio estimation (DRE) is a useful tool for quantifying discrepancies between probability distributions, but existing approaches often involve a trade-off between estimation quality and computational efficiency. Classical direct DRE methods are usually efficient at inference time, yet their performance can seriously deteriorate when the discrepancy between distributions is large. In contrast, score-based DRE methods often yield more accurate estimates in such settings, but they typically require considerable repeated function evaluations and numerical integration. We propose One-step Score-based Density Ratio Estimation (OS-DRE), a partly analytic and solver-free framework designed to combine these complementary advantages. OS-DRE decomposes the time score into spatial and temporal components, representing the latter with an analytic radial basis function (RBF) frame. This formulation converts the otherwise intractable temporal integral into a closed-form weighted sum, thereby removing the need for numerical solvers and enabling DRE with only one function evaluation. We further analyze approximation conditions for the analytic frame, and establish approximation error bounds for both finitely and infinitely smooth temporal kernels, grounding the framework in existing approximation theory. Experiments across density estimation, continual Kullback-Leibler and mutual information estimation, and near out-of-distribution detection demonstrate that OS-DRE offers a favorable balance between estimation quality and inference efficiency.

2604.10650 2026-04-14 stat.ML cs.LG

A Deep Generative Approach to Stratified Learning

Randy Martinez, Rong Tang, Lizhen Lin

Comments 79 pages, 5 figures

详情
英文摘要

While the manifold hypothesis is widely adopted in modern machine learning, complex data is often better modeled as stratified spaces -- unions of manifolds (strata) of varying dimensions. Stratified learning is challenging due to varying dimensionality, intersection singularities, and lack of efficient models in learning the underlying distributions. We provide a deep generative approach to stratified learning by developing two generative frameworks for learning distributions on stratified spaces. The first is a sieve maximum likelihood approach realized via a dimension-aware mixture of variational autoencoders. The second is a diffusion-based framework that explores the score field structure of a mixture. We establish the convergence rates for learning both the ambient and intrinsic distributions, which are shown to be dependent on the intrinsic dimensions and smoothness of the underlying strata. Utilizing the geometry of the score field, we also establish consistency for estimating the intrinsic dimension of each stratum and propose an algorithm that consistently estimates both the number of strata and their dimensions. Theoretical results for both frameworks provide fundamental insights into the interplay of the underlying geometry, the ambient noise level, and deep generative models. Extensive simulations and real dataset applications, such as molecular dynamics, demonstrate the effectiveness of our methods.

2604.10641 2026-04-14 cs.IT cs.IR math.IT math.PR stat.AP

On the Capacity of Distinguishable Synthetic Identity Generation under Face Verification

Behrooz Razeghi

详情
英文摘要

We study how many synthetic identities can be generated so that a face verifier declares same-identity pairs as matches and different-identity pairs as non-matches at a fixed threshold $τ$. We formalize this question for a generative face-recognition pipeline consisting of a generator followed by a normalized recognition map with outputs on the unit hypersphere. We define the capacity of distinguishable identity generation as the largest number of latent identities whose induced embedding distributions satisfy prescribed same-identity and different-identity verification constraints. In the deterministic view-invariant regime, we show that this capacity is characterized by a spherical-code problem over the realizable set of embeddings, and reduces to the classical spherical-code quantity under a full angular expressivity assumption. For stochastic identity generation, we introduce a centered model and derive a sufficient admissibility condition in which the required separation between identity centers is $\arccos(τ)+2ρ$, where $ρ$ is a within-identity concentration radius. Under full angular expressivity, this yields spherical-code-based achievable lower bounds and a positive asymptotic lower bound on the exponential growth rate with embedding dimension. We also introduce a prior-constrained random-code capacity, in which latent identities are sampled independently from a given prior, and derive high-probability lower bounds in terms of pairwise separation-failure probabilities of the induced identity centers. Under a stronger full-cap-support model, we obtain a converse and an exact spherical-code characterization.

2604.10618 2026-04-14 stat.AP

A comprehensive study on causal discovery between degradation paths

Shi-Shun Chen, Shuai Gao, Xiao-Yang Li, Enrico Zio

详情
英文摘要

Existing studies indicate that complex system degradation is characterized by degradation of multiple dependent parameters. Capturing the dependencies is crucial for accurate degradation modeling and effective degradation control. This work aims to uncover these dependencies through causal analysis, focusing on pairwise causal discovery. Firstly, considering the steady-state characteristic of physical dependencies between parameters, a causal discovery strategy using degradation increments is proposed combined with non-temporal causal discovery techniques. Then, five types of non-temporal causal discovery techniques, including constraint-based, score-based, functional causal model-based, gradient-based and the emerging ordering-based technique, are selected as benchmark methods to identify the most suitable approach. Numerical studies based on Wiener process are first conducted to investigate the method effectiveness on both independent and causally dependent degradation paths. Additionally, sensitivity analysis is performed to evaluate how degradation process characteristics affect the accuracy of causal discovery. Then, two engineering applications are given to show the practical applicability of the approach, including a second-order multiple-feedback band pass filter and a turbofan engine. Our findings indicate that the proposed strategy, which uses degradation increments, outperforms methods that rely on raw degradation data. Among all evaluated techniques, stable Peter-Clark and greedy equivalence search exhibit robust and accurate performance across both numerical and engineering cases, which are recommended for causal discovery between degradation paths. The code is available on GitHub: https://github.com/dirge1/causal_deg_data.

2604.10570 2026-04-14 econ.GN cs.CE q-fin.EC stat.AP

Unveiling contrasting impacts of heat mitigation and adaptation policies on U.S. internal migration

Chao Li, Xing Su, Chao Fan, Yang Li, Luping Li, Chunmo Zheng, Wenglong Chao, Leena Jarvi, Han Lin, Juan Tu

Comments 24 pages, 6 figures, 2 tables

详情
英文摘要

While climate-induced population migration has received rising attention, the role played by human climate endeavors remains underexplored. Here, we combine machine learning with attribution mapping to analyze the impacts of 4,713 heat-related policies (HPs) on 11,177 migration flows between U.S. counties. We find that heat adaptation policies (APs) and heat mitigation policies (MPs) have significant and opposing impacts on internal migration: APs reduce out-migration, while MPs increase it. These policies have heterogeneous effects on migration among policy types. Behavioral and cultural MPs at origins lead to a 0.24%-0.68% (95% confidence interval) increase in annual outflows per policy, whereas behavioral and cultural APs at destinations elevate outflows of origins by 0.11%-1.55% (95% confidence interval). Migration patterns are nonlinearly moderated by income, ageing, education, and racial diversity of both origin and destination counties. Ageing rates have the most noticeable U-shaped relationship in shaping migration responses to behavioral and cultural MPs at origins, and inverted U-shapes for institutional MPs at origins and nature-based MPs at destinations. These findings offer critical insights for policymakers on how HPs influence migration as global warming and policy interventions persist.

2604.10555 2026-04-14 stat.OT

On Some Multivariate Extensions to Zenga Curve: Properties and Applications

Shifna P R, S. M. Sunoj

详情
英文摘要

Measures of inequality are often limited in their ability to capture multidimensional aspects that arise from the joint distribution of multiple socio-economic variables. In this paper, we develop bivariate extensions of the Zenga inequality measure using bivariate quantile functions. We propose new bivariate Zenga surfaces and study their theoretical properties. A vector-valued bivariate Zenga curve is also introduced to provide a more detailed characterization of inequality. A non-parametric estimator is proposed and methods are evaluated through simulation studies and applied to the analysis of digital inequality across countries using indicators such as broadband penetration and digital literacy. The results highlight the effectiveness of the proposed framework in capturing multidimensional inequality.

2604.10412 2026-04-14 stat.ML cs.LG stat.ME

Orthogonal machine learning for conditional odds and risk ratios

Jiacheng Ge, Iván Díaz

详情
英文摘要

Conditional effects are commonly used measures for understanding how treatment effects vary across different groups, and are often used to target treatments/interventions to groups who benefit most. In this work we review existing methods and propose novel ones, focusing on the odds ratio (OR) and the risk ratio (RR). While estimation of the conditional average treatment effect (ATE) has been widely studied, estimators for the OR and RR lag behind, and cutting edge estimators such as those based on doubly robust transformations or orthogonal risk functions have not been generalized to these parameters. We propose such a generalization here, focusing on the DR-learner and the R-learner. We derive orthogonal risk functions for the OR and RR and show that the associated pseudo-outcomes satisfy second-order conditional-mean remainder properties analogous to the ATE case. We also evaluate estimators for the conditional ATE, OR, and RR in a comprehensive nonparametric Monte Carlo simulation study to compare them with common alternatives under hundreds of different data-generating distributions. Our numerical studies provide empirical guidance for choosing an estimator. For instance, they show that while parametric models are useful in very simple settings, the proposed nonparametric estimators significantly reduce bias and mean squared error in the more complex settings expected in the real world. We illustrate the methods in the analysis of physical activity and sleep trouble in U.S. adults using data from the National Health and Nutrition Examination Survey (NHANES). The results demonstrate that our estimators uncover substantial treatment effect heterogeneity that is obscured by traditional regression approaches and lead to improved treatment decision rules, highlighting the importance of data-adaptive methods for advancing precision health research.

2604.10398 2026-04-14 stat.ME stat.ML

Estimating heterogeneous treatment effects with survival outcomes via a deep survival learner

Yuming Sun, Jian Kang, Yi Li

详情
英文摘要

Estimating heterogeneous treatment effects in survival settings is complicated by right censoring as well as the time-varying nature of the estimand. While the conditional average treatment effect (CATE) provides a natural target, most existing approaches focus on a single prespecified time point and do not account for the temporal trajectory, leading to instability in estimation. We propose a deep survival learner (DSL) for estimating heterogeneous treatment effects with right-censored outcomes. The method is based on a doubly robust pseudo-outcome whose conditional expectation identifies time-specific CATEs under standard assumptions. This construction remains unbiased if either the outcome model or the treatment assignment model is correctly specified, when properly accounting for censoring. To estimate CATEs over a clinically relevant time spectrum, DSL employs a multi-output deep neural network with shared representations, enabling joint estimation of treatment effect trajectories. From a theoretical perspective, we derive error bounds for both pointwise and joint estimation over time. We show that joint estimation can leverage temporal structure to control estimation error without incurring much additional approximation cost under smoothness conditions, leading to improved stability relative to separate estimation. Cross-fitting is incorporated to reduce overfitting and mitigate bias arising from flexible nuisance estimation. Simulation studies demonstrate favorable finite-sample performance, particularly under nuisance model misspecification. Applied to the Boston Lung Cancer Study, DSL reveals heterogeneity in the effects of perioperative chemotherapy across patient characteristics and over time.

2604.10376 2026-04-14 math.ST stat.TH

Spectral analysis of multivariate stationary Hawkes processes

Yifu Tang, Conor Kresin, Boris Baeumer, Ting Wang

详情
英文摘要

We establish the asymptotic validity of frequency-domain inference for stationary multivariate Hawkes processes under mild conditions, bridging the gap between theory and application. By developing upper-bounds on the reduced cumulant measures from the cluster representation of the Hawkes processes, we prove a functional central limit theorem and, as a consequence, consistency of the Whittle estimator under stationarity alone (i.e., the spectral radius of the interactions matrix $ρ(\boldsymbolν)<1$), applicable to Hawkes processes with heavy-tailed mutual-excitation kernels. Under mild extra moment conditions, we further obtain asymptotic normality with an explicit limiting covariance in terms of second- and fourth-order cumulant spectral densities. We also propose a simple frequency-domain method to detect joint independence of subprocesses of a multivariate Hawkes process. The performance of the Whittle estimator and the test of independence are demonstrated via simulation studies.

2604.10375 2026-04-14 q-fin.RM q-fin.PM stat.AP

On the Structure of Risk Contribution: A Leave-One-Out Decomposition into Inherent and Correlation Risk

Nolan Alexander, Frank Fabozzi

Comments Code: https://github.com/nolanalexander/inherent-correlation-decomposition

详情
英文摘要

This paper develops a decomposition of standard Risk Contribution (RC) into two economically interpretable components: inherent risk and correlation risk. Using a leave-one-out representation, each position's RC separates into a term reflecting its own volatility contribution independent of the portfolio and a term capturing its covariance with the remainder of the portfolio. The inherent component is always positive, arising from the intrinsic volatility of the position, while the correlation component may amplify or mitigate total portfolio risk depending on how the position moves relative to other holdings. Because the decomposition operates within standard RC, it preserves the property of strict additivity. This separation provides diagnostic insight not visible from aggregate risk contributions alone. It distinguishes whether a position contributes risk because it is volatile in isolation or because it is highly correlated with the rest of the portfolio, and it clarifies when a negatively correlated position functions as an effective hedge. Two approaches to time-series analysis are presented to track how inherent and correlation risk evolve across market regimes, revealing whether changes in portfolio risk during stress periods are driven by volatility shocks, correlation shifts, or both. Empirical illustrations suggest that the decomposition provides stable, transparent, and easily implementable risk diagnostics that can support portfolio risk reporting, stress testing, and performance attribution.

2604.08220 2026-04-14 stat.AP

WaST: a formalisation of the Wave model with associated statistical inference and applications

Grégoire Clarté

详情
英文摘要

We propose a mathematical formalisation of the ``wave model'' originally developed in historical linguistics but with further applications in human sciences. This model assumes new traits appear in a population and spread to nearby populations depending on their closeness. It is mostly used to describe joint evolution of closely related populations, for example of several dialects. These situations of permanent contact are not accurately represented by its competitors based on tree structures. We built a fully Bayesian generative model where innovation spread along a fixed graph and disappear according to a death process. We then develop a Metropolis-Hastings within Gibbs sampler to sample from the posterior distribution on the graph. We test our method on simulated datasets as well as on several real dataset.

2604.05225 2026-04-14 stat.CO cs.LG stat.AP stat.ML

fastml: Guarded Resampling Workflows for Safer Automated Machine Learning in R

Selcuk Korkmaz, Dincer Goksuluk, Eda Karaismailoglu

Comments 36 pages, 2 figures

详情
英文摘要

Preprocessing leakage arises when scaling, imputation, or other data-dependent transformations are estimated before resampling, inflating apparent performance while remaining hard to detect. We present fastml, an R package that provides a single-call interface for leakage-aware machine learning through guarded resampling, where preprocessing is re-estimated inside each resample and applied to the corresponding assessment data. The package supports grouped and time-ordered resampling, blocks high-risk configurations, audits recipes for external dependencies, and includes sandboxed execution and integrated model explanation. We evaluate fastml with a Monte Carlo simulation contrasting global and fold-local normalization, a usability comparison with tidymodels under matched specifications, and survival benchmarks across datasets of different sizes. The simulation demonstrates that global preprocessing substantially inflates apparent performance relative to guarded resampling. fastml matched held-out performance obtained with tidymodels while reducing workflow orchestration, and it supported consistent benchmarking of multiple survival model classes through a unified interface.

2604.05063 2026-04-14 math.ST stat.TH

Robust mean estimation under star-shaped constraints with heavy-tailed noise

Tuorui Peng, Akshay Prasadan, Matey Neykov

Comments 56 pages

详情
英文摘要

We study the problem of robust mean estimation with adversarially contaminated data under star-shaped constraints in a heavy-tailed noise setting, where only a finite second moment $ σ^2 $ is assumed. For a contamination level $ \varepsilon$ below some constant, we show that the minimax rate of the squared $ \ell_2 $ loss is $ \max( δ^{*2}, \varepsilon σ^2) \wedge d^2 $ for a star-shaped set with diameter $ d $ (set $d = \infty$ if the set is unbounded), with $ δ^* $ determined via the local entropy $ \log M^\mathrm{ loc }(δ,c) $ as \begin{align*} δ^*:= \sup\bigg\{δ\geq 0: N\frac{δ^2}{σ^2}\leq \log M^\mathrm{ loc }(δ,c) \bigg\}, \end{align*} where $ c $ is a sufficiently large constant. Crucially, we require that the sample size satisfies $N \gtrsim \mathop{ \sup }\limits_{δ\geq 0} \log M^\mathrm{ loc }(δ,c)$. We also show that the minimax rate is $ \max(δ^{*2},\varepsilon ^2σ^2) \wedge d^2 $ for known or sign-symmetric distributions, matching the rate achieved in the Gaussian case.

2603.29575 2026-04-14 stat.ME

Transfer Learning for Moderate-Dimensional Ridge-Regularized Robust Linear Regression

Lingfeng Lyu, Xiao Guo, Zongqi Liu

详情
英文摘要

This paper studies transfer learning for ridge-regularized robust linear regression in the moderate-dimensional regime, where the number of predictors is of the same order as the sample size and the regression coefficients are not assumed to be sparse. We propose Trans-RR, which combines a robust ridge estimator from a source study with a robust ridge correction based on the target study. Under mild assumptions, we characterize the asymptotic estimation error of the proposed estimator and show that leveraging source data can substantially improve estimation accuracy relative to the traditional single-study ridge-regularized robust estimator. Simulation results and a real-data analysis support the theory and illustrate both positive and negative transfer as the discrepancy between the source and target studies varies.

2603.14356 2026-04-14 stat.AP

Prediction-based Inference in Electronic Health Record (EHR)-linked Biobanks with Clinically Informative Outcomes

Xingran Chen, Cheng-Han Yang, Zhenke Wu, Bhramar Mukherjee

详情
英文摘要

Electronic health record (EHR)-linked biobank data hold tremendous promise for large-scale discoveries via genome-wide association study (GWAS) on diverse phenotypic traits and biomarkers routinely captured in the EHR. However, heterogeneous missingness in biomarkers compromises the validity and efficiency of statistical analyses. Prediction-based (PB) inference methods meet this challenge by using external machine learning (ML) predictions to impute missing biomarker outcomes, thereby improving statistical power and estimation accuracy in association analyses. Yet, their suitability remains unclear when outcomes are subject to clinically informative observation processes, that is, when laboratory tests are ordered based on both measured and unmeasured patient- and health system-level characteristics. In this paper, we review the statistical underpinnings of popular PB methods and then evaluate nine methods, including four PB methods and five traditional missing-data approaches, under an encompassing set of outcome observation processes for continuous and binary outcomes. PB methods can substantially improve statistical power and estimation efficiency when the missing-data mechanism is correctly specified. Under misspecification, however, these gains require both conditional independence between the covariates of interest and the missingness mechanism and independence between imputation error and the missingness mechanism. Using All of Us (AoU) data, we perform GWAS of six laboratory biomarkers and demonstrate that PB methods can replicate known genetic associations while improving efficiency relative to (weighted) complete-case analysis (CCA). Their performance in replicating existing GWAS results in AoU also depends on imputation quality and the underlying missingness mechanism.

2601.01471 2026-04-14 math.ST econ.EM stat.ME stat.ML stat.TH

Double Machine Learning of Continuous Treatment Effects with General Instrumental Variables

Shuyuan Chen, Peng Zhang, Yifan Cui

详情
英文摘要

Estimating causal effects of continuous treatments is a common problem in practice, for example, in studying average dose-response functions. Classical analyses typically assume that all confounders are fully observed, whereas in real-world applications, unmeasured confounding often persists. In this article, we propose a novel framework for the identification of average dose-response functions using instrumental variables, thereby mitigating bias induced by unobserved confounders. We introduce the concept of a uniform regular weighting function and consider covering the treatment space with a finite collection of open sets. On each of these sets, such a weighting function exists, allowing us to identify the average dose-response function locally within the corresponding region. For estimation, we propose an augmented inverse probability weighted score for continuous treatments with instrumental variables under a debiased machine learning framework, and provide practical guidance to adaptively establish regular weighting functions from the data. We further establish the asymptotic properties when the average dose-response function is estimated via kernel regression or empirical risk minimization. Finally, we conduct both simulation and empirical studies to assess the finite-sample performance of the proposed methods.

2512.20552 2026-04-14 cs.IT math.IT stat.ML

Information-theoretic signatures of causality in Bayesian networks and hypergraphs

Sung En Chiang, Zhaolu Liu, Robert L. Peach, Mauricio Barahona

Comments 21 pages, 3 figures

详情
英文摘要

Analyzing causality in multivariate systems involves establishing how information is generated, distributed and combined. Traditional causal discovery frameworks are capable of multivariate reasoning but their intrinsic pairwise graph topology restricts them to do so only indirectly by integrating multivariate information across pairwise edges. Higher-order information theory provides direct tools that can explicitly model higher-order interactions. In particular, Partial Information Decomposition (PID) allows the decomposition of the information that a set of sources provides about a target into redundant, unique, and synergistic components. Yet the mathematical connection between such higher-order information-theoretic measures and causal structure remains undeveloped. Here we establish the first theoretical correspondence between PID components and causal structure in both Bayesian networks and hypergraphs. We first show that in Bayesian networks unique information precisely characterizes direct causal neighbors, while synergy identifies collider relationships. This establishes a localist causal discovery paradigm in which the structure surrounding each variable can be recovered from its immediate informational footprint, eliminating the need for global search over graph space. Extending these results to more expressive causal representation, we prove that PID signatures in Bayesian hypergraphs differentiate parents, children, co-heads, and co-tails, revealing a novel collider effect unique to multi-tail hyperedges. Our results position PID as a rigorous, model-agnostic foundation for inferring both pairwise and higher-order causal structure, and introduce a fundamentally local information-theoretic viewpoint on causal discovery.

2511.03015 2026-04-14 cs.LG stat.ML

Discrete Bayesian Sample Inference for Graph Generation

Ole Petersen, Marcel Kollovieh, Marten Lienen, Stephan Günnemann

详情
Journal ref
ICLR 2026
英文摘要

Generating graph-structured data is crucial in applications such as molecular generation, knowledge graphs, and network analysis. However, their discrete, unordered nature makes them difficult for traditional generative models, leading to the rise of discrete diffusion and flow matching models. In this work, we introduce GraphBSI, a novel one-shot graph generative model based on Bayesian Sample Inference (BSI). Instead of evolving samples directly, GraphBSI iteratively refines a belief over graphs in the continuous space of distribution parameters, naturally handling discrete structures. Further, we state BSI as a stochastic differential equation (SDE) and derive a noise-controlled family of SDEs that preserves the marginal distributions via an approximation of the score function. Our theoretical analysis further reveals the connection to Bayesian Flow Networks and Diffusion models. Finally, in our empirical evaluation, we demonstrate state-of-the-art performance on molecular and synthetic graph generation, outperforming existing one-shot graph generative models on the standard benchmarks Moses and GuacaMol.

2510.07942 2026-04-14 math.PR math.ST stat.TH

Precise convergence rate of spectral radius of product of complex Ginibre

Yutao Ma, Xujia Meng

Comments This version makes substantial improvements over the previous one, including in the title, abstract, and content. We would therefore prefer to announce it as a completely new submission

详情
英文摘要

Let $Z_1, \cdots, Z_n$ denote the eigenvalues of the product $\prod_{j=1}^{k_n} \boldsymbol{A}_j$, where $\{\boldsymbol{A}_j\}_{1 \le j \le k_n}$ are independent $n\times n$ complex Ginibre matrices. Define $α= \lim\limits_{n \to \infty} \frac{n}{k_n}$. We prove that $X_n,$ a suitably rescaled version of $\max_{1 \le j \le n} |Z_j|^2,$ converges weakly as follows: to a non-trivial distribution $Φ_α$ for $α\in (0, +\infty)$, to the Gumbel distribution when $α= +\infty$, and to the standard normal distribution when $α= 0$. This result reveals a phase transition at the boundaries of $α$. Furthermore, we establish the exact rates of convergence in each regime.

2509.20587 2026-04-14 stat.ML cs.LG stat.ME

Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

Chao Ying, Jun Jin, Haotian Zhang, Qinglong Tian, Yanyuan Ma, Sharon Li, Jiwei Zhao

详情
英文摘要

We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.

2509.10853 2026-04-14 stat.ML cs.LG

Variable Selection Using Relative Importance Rankings

Tien-En Chang, Argon Chen

Comments 35 pages, 9 figures

详情
Journal ref
10.1016/j.patcog.2026.113561
英文摘要

Although conceptually related, variable selection and relative importance (RI) analysis have been treated quite differently in the literature. While RI is typically used for post-hoc model explanation, this paper explores its potential for variable or feature ranking and filter-based selection before model creation. Specifically, we anticipate strong performance from the RI measures because they incorporate both direct and combined effects of predictors, addressing a key limitation of marginal correlation, which ignores dependencies among predictors. We implement and evaluate the RI-based variable ranking and selection methods, including a newly proposed RI measure, CRI.Z, with improved computational efficiency relative to conventional RI measures. Through extensive simulations, we first demonstrate how the RI measures more accurately rank the variables than the marginal correlation, especially when there are suppressed or weak predictors. We then show that predictive models built on these rankings are highly competitive, often outperforming state-of-the-art linear-model methods such as the lasso and relaxed lasso. The proposed RI-based methods are particularly effective in challenging cases involving clusters of highly correlated predictors, a setting known to cause failures in many benchmark methods. The practical utility and efficiency of RI-based methods are further demonstrated through two high-dimensional gene expression datasets. Although lasso methods have dominated the recent literature on variable selection, our study reveals that the RI-based method is a powerful and competitive alternative. We believe these underutilized tools deserve greater attention in statistics and machine learning communities. The code is available at: https://github.com/tien-endotchang/RI-variable-selection.

2509.03297 2026-04-14 stat.ME stat.ML

Feedback-Enhanced Online Multiple Testing with Applications to Conformal Selection

Lin Lu, Yuyang Huo, Haojie Ren, Zhaojun Wang, Changliang Zou

详情
英文摘要

We study online multiple testing with feedback, where decisions are made sequentially and the true state of the hypothesis is revealed after the decision has been made, either instantly or with a delay. We propose GAIF, a feedback-enhanced generalized alpha-investing framework that dynamically adjusts thresholds using revealed outcomes, ensuring finite-sample false discovery rate (FDR)/marginal FDR control. Extending GAIF to online conformal testing, we construct independent conformal $p$-values and introduce a feedback-driven model selection criterion to identify the best model/score, thereby improving statistical power. We demonstrate the effectiveness of our methods through numerical simulations and real-data applications.

2507.14457 2026-04-14 stat.ME

Blurring Mean Shift for Clustering Functional Data: A Scalable Algorithm and Convergence Analysis

Toshinari Morimoto, Ting-Li Chen, Su-Yun Huang, Ruey S. Tsay

Comments Proofs are provided in the supplementary material

详情
英文摘要

This paper extends the blurring mean shift algorithm from vector-valued data to functional data, enabling effective clustering in infinite-dimensional settings without requiring specification of the number of clusters. To address the computational challenges posed by large-scale datasets, we introduce a fast stochastic variant that significantly reduces computational complexity. We provide a rigorous convergence analysis for the full blurring functional mean shift procedure, establishing theoretical guarantees for its iterative behavior. For the stochastic variant, we provide partial theoretical justification by showing that, when the subset size is sufficiently large, its one-step update is well approximated by the corresponding update of the full algorithm. The proposed method is demonstrated through real-data applications, including hourly Taiwan PM$_{2.5}$ measurements and Argo oceanographic profiles. Our key contributions include: (1) extending the blurring mean shift algorithm to functional data in a Hilbert-space setting; (2) developing a scalable stochastic variant based on random partitioning for large-scale data; (3) establishing convergence results for the full blurring functional mean shift algorithm; and (4) demonstrating the scalability and practical usefulness of the proposed method through simulation and real-data applications.

2503.11390 2026-04-14 math.ST math.PR stat.TH

On continuity of Chatterjee's rank correlation and related dependence measures

Jonathan Ansari, Sebastian Fuchs

Comments 23 pages, 2 figures; accepted for publication in 'Bernoulli'

详情
英文摘要

While measures of concordance -- such as Spearman's rho, Kendall's tau, and Blomqvist's beta -- are continuous with respect to weak convergence, Chatterjee's rank correlation xi recently introduced in Azadkia and Chatterjee (2021) does not share this property, causing drawbacks in statistical inference as pointed out in Bücher and Dette (2025). As we study in this paper, xi is instead weakly continuous with respect to conditionally independent copies -- the Markov products. To establish weak continuity of Markov products, we provide several sufficient conditions, including copula-based criteria and conditions relying on the concept of conditional weak convergence in Sweeting (1989). As a consequence, we also obtain continuity results for xi and related dependence measures and verify their continuity in the parameters of standard models such as multivariate elliptical and l1-norm symmetric distributions.

2503.02983 2026-04-14 stat.ML cs.LG

BLADE: Bayesian Langevin Active Discovery with Replica Exchange for Identification of Complex Systems

Cindy Xiangrui Kong, Haoyang Zheng, Guang Lin

详情
英文摘要

Traditional methods for system discovery frequently struggle with efficient data usage and uncertainty quantification. Identifying the governing equations of complex dynamical systems from data presents a significant challenge in scientific discovery, especially when high-quality measurements are scarce and expensive to obtain. To overcome these limitations, we propose Bayesian Langevin Active Discovery with Replica Exchange for Identification of Complex Systems (BLADE), a novel Bayesian framework that combines replica-exchange stochastic gradient Langevin Monte Carlo with active learning. By balancing gradient-driven exploration and exploitation in coefficient space, BLADE provides probabilistic parameter estimation and principled uncertainty quantification. Faced with data scarcity, the probabilistic foundation of BLADE further facilitates the integration of active learning through a hybrid acquisition strategy that combines predictive uncertainty with space-filling design, enabling efficient selection of informative samples. Across benchmark systems, BLADE reduces measurement requirements by roughly 60% for Lotka-Volterra and 40% for Burgers' equation relative to random sampling, demonstrating substantial data-efficiency gains. These results highlight BLADE as a general uncertainty-aware framework for discovering interpretable dynamical systems, particularly valuable when high-fidelity data acquisition is prohibitively expensive.

2502.10849 2026-04-14 stat.ME

Dynamic spectral co-clustering of directed networks to unveil latent community paths in VAR-type models

Younghoon Kim, Changryong Baek

Comments This paper is withdrawn due to an error in the model specification. Specifically, the direction of propagation of the network structure in Section 2 was incorrectly defined. This issue also affects the simulation setup and model applications to real data in Sections 5 and 6, which were constructed based on the same specification. Consequently, the results reported in the paper may not be valid

详情
英文摘要

Identifying network Granger causality in large vector autoregressive (VAR) models enhances explanatory power by capturing complex dependencies among variables. This study proposes a methodology that explores latent community structures to uncover underlying network dynamics, rather than relying on sparse coefficient estimation for network construction. A dynamic network framework embeds directed connectivity in the transition matrices of VAR-type models, allowing the tracking of evolving community structures over time, called seasons. To account for network directionality, degree-corrected stochastic co-block models are fitted for each season, then a combination of spectral co-clustering and singular vector smoothing is utilized to refine transitions between latent communities. Periodic VAR (PVAR) and vector heterogeneous autoregressive (VHAR) models are adopted as alternatives to conventional VAR models for dynamic network construction. Theoretical results establish the validity of the proposed methodology, while empirical analyses demonstrate its effectiveness in capturing both the cyclic evolution and transient trajectories of latent communities. The proposed approach is applied to US nonfarm payroll employment data and realized stock market volatility data. Spectral co-clustering of multi-layered directed networks, constructed from high-dimensional PVAR and VHAR representations, reveals rich and dynamic latent community structures.

2501.18785 2026-04-14 stat.ME

Low-Rank Graphon Learning for Networks

Xinyuan Fan, Feiyan Ma, Chenlei Leng, Weichi Wu

详情
英文摘要

Graphons offer a powerful framework for modeling large-scale networks, yet estimation remains challenging. We propose a novel approach that leverages a low-rank additive representation, yielding both a low-rank connection probability matrix and a low-rank graphon--two goals rarely achieved jointly. Our method resolves identification issues and enables an efficient sequential algorithm based on subgraph counts and interpolation. We establish consistency and demonstrate strong empirical performance in terms of computational efficiency and estimation accuracy through simulations and data analysis.

2412.20495 2026-04-14 cs.CR cs.AI cs.LG stat.ML

A Multiparty Homomorphic Encryption Approach to Confidential Federated Kaplan Meier Survival Analysis

Narasimha Raghavan Veeraragavan, Svetlana Boudko, Jan Franz Nygård

Comments 58 pages

详情
英文摘要

The proliferation of real-world health data enables multi-institutional survival studies, yet privacy constraints preclude centralizing sensitive records. We present a privacy-preserving federated Kaplan--Meier framework based on threshold CKKS (Cheon-Kim-Kim-Song) homomorphic encryption that supports approximate floating-point computation and encrypted aggregation of per-time-point counts while exposing only public outputs. Sites compute aligned at-risk and event tallies on a shared time grid and encrypt compact vectors; a coordinator aggregates ciphertexts; and a decryptor committee produces partial shares fused per block to recover aggregated plaintexts without releasing per-time-point tables. We prove correctness, stability, and slot-optimal vector packing, and derive scaling laws showing that communication grows linearly with the number of sites and predictably with the number of time points. Empirically, using synthetic breast-cancer data (N=60,000) distributed across 500 sites, encrypted federated curves match the pooled oracle to numerical precision. In contrast, plaintext protocols permit trivial reconstruction by subtraction; our threshold-gated design precludes this attack under the stated threat model, enabling high-fidelity survival estimation with predictable overhead and substantially reduced privacy risk.

2412.07469 2026-04-14 stat.ML cs.LG

Score-matching-based Structure Learning for Temporal Data on Networks

Hao Chen, Kai Yi

详情
英文摘要

Causal discovery is a crucial initial step in establishing causality from empirical data and background knowledge. Numerous algorithms have been developed for this purpose. Among them, the score-matching method has demonstrated superior performance across various evaluation metrics, particularly for the commonly encountered Additive Nonlinear Causal Models. However, current score-matching-based algorithms are primarily designed to analyze independent and identically distributed (i.i.d.) data. More importantly, they suffer from high computational complexity due to the pruning step required for handling dense Directed Acyclic Graphs (DAGs). To enhance the scalability of score matching, we have developed a new parent-finding subroutine for leaf nodes in DAGs, significantly accelerating the most time-consuming part of the process: the pruning step. This improvement results in an efficiency-lifted score matching algorithm, termed Parent Identification-based Causal structure learning for both i.i.d. and temporal data on networKs, or PICK. The new score-matching algorithm extends the scope of existing algorithms and can handle static and temporal data on networks with weak network interference. Our proposed algorithm can efficiently cope with increasingly complex datasets that exhibit spatial and temporal dependencies, commonly encountered in academia and industry. The proposed algorithm can accelerate score-matching-based methods while maintaining high accuracy in real-world applications.

2409.01599 2026-04-14 stat.ME math.ST stat.TH

Multivariate Inference of Network Moments by Subsampling

Mingyu Qi, Chen-Wei Hua, Tianxi Li, Wen Zhou

详情
英文摘要

Network moments--rescaled counts of motifs such as stars and triangles--are fundamental summaries of network structure, widely used in goodness-of-fit testing, model selection, and network comparison. While the univariate distribution of a single network moment can be approximated by subsampling, the consistency of subsampling for their {\it joint} distribution has remained unestablished. In this paper, we prove that node subsampling provides an asymptotically accurate approximation of the joint distribution of multiple network moments under a general sparse graphon model. The theoretical analysis requires a careful characterization of the dependence structure among network moments and the corresponding multivariate asymptotic convergence, going substantially beyond existing univariate results. Building on this foundation, we address a practically important open problem: two-sample testing between unmatchable networks with unequal edge densities. We propose a novel subsampling-based procedure that combines {\it sparsification} with a {\it sample-splitting} strategy. This yields the first subsampling-based inferential procedure valid for this setting, to our knowledge. We demonstrate the utility of multivariate subsampling inference through simulation studies and a real data application comparing coexpression networks of core and non-core genes in a study of parallel adaptation in Trinidadian guppies, where joint and conditional moment distributions reveal a structural difference that no marginal test can detect.

2402.10537 2026-04-14 stat.ME

Quantifying Individual Risk for Binary Outcomes

Peng Wu, Peng Ding, Zhi Geng, Yue Liu

详情
英文摘要

Understanding treatment effect heterogeneity is crucial for reliable decision-making in treatment evaluation and selection. The conditional average treatment effect (CATE) is widely used to capture treatment effect heterogeneity induced by observed covariates and to design individualized treatment policies. However, it is an average metric within subpopulations, which prevents it from revealing individual risk, potentially leading to misleading results. This article fills this gap by examining individual risk for binary outcomes, specifically focusing on the fraction negatively affected (FNA), a metric that quantifies the percentage of individuals experiencing worse outcomes under treatment compared with control. Even under the strong ignorability assumption, FNA is still unidentifiable, and the existing Fréchet--Hoeffding bounds are often too wide and attainable only under extreme data-generating processes. By invoking mild conditions on the value range of the Pearson correlation coefficient between potential outcomes, we obtain improved bounds compared with the Fréchet--Hoeffding bounds. We show that paradoxically, even with a positive CATE, the lower bound on FNA can be positive, i.e., in the best-case scenario, many individuals will be harmed if they receive treatment. Additionally, we establish a nonparametric sensitivity analysis framework for FNA using the Pearson correlation coefficient as the sensitivity parameter. Furthermore, we propose nonparametric estimators for the refined FNA bounds and prove their consistency and asymptotic normality. We use simulation to evaluate the performance of the proposed estimators and apply the method to a canonical observational study.

2401.03893 2026-04-14 math.OC stat.ML

Finite-Time Decoupled Convergence in Nonlinear Two-Time-Scale Stochastic Approximation

Yuze Han, Xiang Li, Zhihua Zhang

详情
英文摘要

In two-time-scale stochastic approximation (SA), two iterates are updated at varying speeds using different step sizes, with each update influencing the other. Previous studies on linear two-time-scale SA have shown that the convergence rates of the mean-square errors for these updates depend solely on their respective step sizes, a phenomenon termed decoupled convergence. However, achieving decoupled convergence in nonlinear SA remains less understood. Our research investigates the potential for finite-time decoupled convergence in nonlinear two-time-scale SA. We demonstrate that, under a nested local linearity assumption, finite-time decoupled convergence rates can be achieved with suitable step size selection. To derive this result, we conduct a convergence analysis of the matrix cross term between the iterates and leverage fourth-order moment convergence rates to control the higher-order error terms induced by local linearity. To further investigate the necessity of local linearity for decoupled convergence, we also construct an example showing that, even when the fast-time-scale update is linear, the nonlinearity of the slow-time-scale update alone can destroy decoupled convergence.

2207.11825 2026-04-14 stat.ME

Fast convergence rates for dose-response estimation

Matteo Bonvini, Edward H. Kennedy

详情
英文摘要

We consider the problem of estimating a dose-response curve. Continuous treatments arise often in practice, e.g. in the form of time spent on an operation, distance traveled to a location or dosage of a drug. Letting $A$ denote a continuous treatment variable, the target of inference is the expected outcome if everyone in the population takes treatment level $A=t$. Under standard assumptions, the dose-response function takes the form of a partial mean. Building upon the recent literature on nonparametric regression with estimated outcomes, our first contribution is to study global and local estimators of the dose-response based on empirical risk minimization. Our second and main contribution is to construct a $m^{\text{th}}$-order estimator based on the theory of higher-order influence functions. Under certain conditions, this higher order estimator achieves the fastest rate of convergence that we are aware of for this problem. However, the other two approaches are easier to implement using off-the-shelf software, since they are formulated as two-stage regression tasks. For each estimator, we provide an upper bound on the mean-square error and investigate its finite-sample performance through simulations and an empirical application. Finally, the supplementary material introduces a flexible, nonparametric approach for sensitivity analysis to violations of the no-unmeasured-confounding assumption with continuous treatments.

1810.07793 2026-04-14 cs.LG stat.ML

The Wasserstein transform

Kun Jin, Facundo Mémoli, Zane Smith, Zhengchao Wan

详情
英文摘要

We introduce the Wasserstein Transform (WT), a general unsupervised framework for updating distance structures on given data sets with the purpose of enhancing features and denoising. Our framework represents each data point by a probability measure reflecting the neighborhood structure of the point, and then updates the distance by computing the Wasserstein distance between these probability measures. The Wasserstein Transform is a general method which extends the mean shift family of algorithms. We study several instances of WT, and in particular, in one of the instances which we call the Gaussian Transform (GT), we utilize Gaussian measures to model neighborhood structures of individual data points. GT is computationally cheaper than other instances of WT since there exists closed form solution for the $\ell^2$-Wasserstein distance between Gaussian measures. We study the relationship between different instances of WT and prove that each of the instances is stable under perturbations. We devise iterative algorithms for performing the above-mentioned WT and propose several strategies to accelerate GT, such as an observation from linear algebra for reducing the number of matrix square root computations. We examine the performance of the Wasserstein Transform method in many tasks, such as denoising, clustering, image segmentation and word embeddings.

2604.10353 2026-04-14 stat.ME math.ST stat.TH

Uncertainty Quantification for Noisy Low-tubal-rank Tensor Completion

Jiuqian Shang, Jingyang Li, Yang Chen

Comments 56 pages

详情
英文摘要

High-dimensional tensor data often exhibit strong temporal correlations that appear as low-dimensional structures in the frequency domain. While the low-tubal-rank tensor model effectively captures these spectral features, making it potentially suitable for geophysical data, existing methods primarily focus on point estimation. Uncertainty quantification (UQ) of imputed values and rigorous statistical inference for these models remain largely unexplored. In this work, we propose a flexible inference framework for linear forms of high-dimensional tensors. Employing a double-sample debiasing technique followed by a low-rank projection, we construct asymptotically Gaussian estimators that yield valid statistical inference under mild assumptions. More precisely, we can perform hypothesis testing and construct confidence intervals with this result. We validate the theoretical results through extensive simulations and demonstrate the practical effectiveness of our method in completing the global total electron content data. We demonstrate, using those numerical results, that our entrywise confidence intervals are robust and reliable, yielding informative uncertainty quantification that captures underlying variability.

2604.10308 2026-04-14 stat.ME stat.AP

Considerations for the Integration of Randomized Controlled Trials and Real-World Data

Sky Qiu, Charles Barr, Lauren Dang, Issa Dahabreh, Larry Han, Kajsa Kvist, Hana Lee, Andrew Mertens, Nerissa Nance, Lei Nie, Kara Rudolph, Xu Shi, Jens Tarp, Salina P. Waddy, Kenneth Wiley, Andy Wilson, Margot Lisa Jing Yann, Zhiwei Zhang, Tianyue Zhou, Maya Petersen, Mark van der Laan

详情
英文摘要

As clinical decision-making increasingly moves toward individualized and context-specific treatment recommendations, reliance on any single evidence source, randomized or observational, may be insufficient. Principled integration of randomized controlled trials and real-world data, grounded in explicit causal frameworks, offers a path toward evidence that is both internally credible and externally relevant. In this article, we describe distinct objectives for the integration of randomized controlled trials and real-world data and discuss how these objectives shape key design and analytic considerations, illustrating the resulting choices through example estimands. We highlight practical issues that commonly arise in applied settings, including data relevance and curation, cross-source comparability, estimand specification, and sensitivity analysis. We aim for this article to help readers evaluate and implement principled approaches to integrating randomized controlled trials and real-world data in ways that can support more reliable treatment recommendations while maintaining regulatory-grade evidentiary standards.

2604.10249 2026-04-14 stat.ME stat.CO

Gaussian Graphical Models for Functional Connectivity Analysis: A Statistical Review with Applications to Alzheimer's Disease

Panpan Zhang, Shiying Xiao, W. Hudson Robb, Dandan Liu, Angela L. Jefferson, Jun Yan

详情
英文摘要

Functional connectivity analysis is an important tool for characterizing interactions among brain regions, particularly in studies of neurodegenerative disorders such as Alzheimer's disease (AD). Gaussian graphical models (GGMs) provide a promising statistical framework for estimating functional connectivity by capturing conditional dependence relationships among brain regions. Although a variety of regularized precision matrix estimators have been proposed to estimate sparse conditional dependency structures for GGMs, their comparative performance and practical implications for neuroimaging studies are not well understood. In this work, we present a comprehensive statistical review and empirical evaluation of widely used GGM estimation methods, including the graphical lasso (glasso), ridge-based glasso, graphical elastic net, adaptive glasso, smoothly clipped absolute deviation (SCAD), minimax concave penalty (MCP), constrained $\ell_1$ minimization for inverse matrix estimation (CLIME), and tuning-insensitive graph estimation and regression (TIGER). Their performance is evaluated through extensive data-driven simulations designed to reflect realistic neuroimaging settings, along with an application to an AD cohort study to illustrate methodological differences and their impact on downstream network analysis. In addition, a user-friendly R package, spice, is provided to facilitate implementation and enhance the reproducibility of empirical studies.

2604.10232 2026-04-14 econ.EM math.ST stat.TH

Gaussian approximation for maximum score and non-smooth M-estimators with multiway dependence

Harold D. Chiang, Ahnaf Rafi

详情
英文摘要

The maximum score estimator of Manski (1975) provides an elegant approach to estimate slope coefficient in binary choice models without requiring parametric assumptions on the error distribution. However, under i.i.d. sampling, it admits a non-Gaussian limiting distribution and exhibits cube-root asymptotics, which complicates statistical inference. We show that, under multiway dependence, the maximum score estimator attains asymptotic normality at a parametric rate. We obtain this surprising result through the development of a general M-estimation theory that accommodates non-smooth objective functions under multiway dependence. We further propose and establish the validity of a bootstrap procedure for inference.

2604.10205 2026-04-14 math.ST stat.TH

Normalized Likelihood Criteria for Model Selection in the Stochastic Block Model

Andressa Cerqueira, Felipe Baptistão

详情
英文摘要

Estimating the number of communities is a fundamental problem in network analysis under the stochastic block model (SBM). In this paper, we study penalized estimators for this task based on normalized likelihood criteria. We show that a penalized estimator derived from the Normalized Maximum Likelihood (NML) is strongly consistent with a logarithmic penalty term, although its computation is intractable. To overcome this limitation, we consider the Normalized Maximum Complete Likelihood (NMCL) and the Decomposed Normalized Maximum Likelihood (DNML). The DNML admits an explicit formulation with cubic computational complexity in the number of nodes. We prove that the NMCL- and DNML- based estimators are strongly consistent for sparse networks in which the average node degree diverges with the network size. Empirical results show that the DNML estimator performs competitively with existing methods, particularly in unbalanced networks.

2604.10178 2026-04-14 stat.ME

Bayesian Distance-to-Set Models: from Latent Variable to Latent Projection

Leo L Duan, Yuexi Wang, Jason Xu

详情
英文摘要

Statistical models often assume that data are generated near a structured, smooth, or low-dimensional set. A common approach is to use Bayesian latent variable models, in which each observation is associated with a latent coordinate on the set, and the observed data are modeled as noisy deviations from these coordinates. The deviation is typically characterized by a location-scale distribution, such as Gaussian. Despite their intuitive appeal and popularity, latent variable models often present practical challenges in posterior computation. In particular, Markov chain Monte Carlo samplers may suffer from slow mixing, especially when the sample size is large and there is no closed form for integrating out the latent coordinates. In this article, we propose an alternative approach that replaces the deviation-from-coordinate with a distance-to-set. Specifically, the distance-to-set is defined as the distance between a data point and its projection onto the set, where the projection can be rapidly computed by optimization and replaces the latent coordinate in the likelihood. This change substantially reduces the dimensionality of the parameter X latent variable space, leading to efficient posterior computation. We establish several important statistical properties for the distance-to-set models, such as the independence between the normal-cone noise and fixed-effect parameters, posterior consistency, and an Occam's razor effect that automatically penalizes overfitting. We demonstrate the effectiveness of our approach through simulation studies, applications to multi-environment study and Bayesian transfer learning.

2604.10088 2026-04-14 stat.ME

Cox Model Predicting Covariate Subject to Right Censoring

Chen-Yen Lin, Susan Halabi, Taehwa Choi

详情
英文摘要

Time-to-event endpoints are frequently used as outcomes in oncology and other disease areas where the outcome of interest may not be observed within a predetermined period. Although many analytical methods address the challenges of censoring in outcomes, limited research has focused on censored covariates. Conventional methods such as the complete case (CC) analysis, where data from patients with censored covariates are discarded, suffer from efficiency loss and potential bias due to reduced sample size. Alternatively, imputing censored covariates with a constant value can underestimate variability. Recognizing these limitations, novel estimation procedures within the generalized linear model framework have been proposed, with some research emerging in time-to-event outcomes. In this paper, we investigate the association between progression-free survival and overall survival using a semi-parametric Cox model framework. We modify the Cox model's partial likelihood function to account for censored covariates by replacing the relative risk associated with censored covariates with a weighted average of patients with observed covariates. The performance of the proposed method is demonstrated through simulations and applications to two oncology clinical trials. Results indicate that the proposed method offers improved estimation efficiency and better utilization of available data compared to other approaches.

2604.10018 2026-04-14 stat.ME

Inference from multivariate differential recruitment in respondent-driven sampling data

Vanesa Reinoso, Danilo Alvares, Jonathan Acosta, Isabelle S. Beaudry

详情
英文摘要

Respondent-Driven Sampling (RDS) is a chain-referral design used for collecting data from hidden or hard-to-reach populations through their social networks. In RDS, respondents recruit their peers from the population of interest. As such, inference with RDS data commonly relies on estimated sampling probabilities derived from specific recruitment assumptions. Early literature assumes random recruitment, which is often unrealistic because individuals may recruit based on their personal preferences. This behavior is known as Differential Recruitment (DR). Recent works have incorporated univariate categorical DR in the estimation procedures. The main objective of this paper is to introduce Multivariate Differential Recruitment (MDR), a framework that incorporates multiple simultaneous covariates, both categorical and continuous, into the sampling representation. We model RDS as a Markov process with transition probabilities that depend on continuous or categorical variables associated with nodes or their ties. We then extend various prevalence estimators to this multivariate framework and implement a slightly modified neighborhood bootstrap for variance estimation. The proposed methodology is assessed through simulation studies for a range of network and sampling features. It is applied to an RDS study conducted among the adult Venezuelan population living in the Metropolitan Region of Santiago, Chile.

2604.09953 2026-04-14 stat.ME math.ST stat.TH

Partial correlation networks of Gaussian processes

Michele Peruzzi

详情
英文摘要

In Gaussian graphical models, conditional independence and partial correlations are natural inferential targets for understanding direct relationships in multivariate data. No comparable framework exists for spatial processes, where multivariate analysis defaults to modeling unconditional cross-covariance structure, even when direct relationships remain of scientific interest. We address this gap by establishing a novel characterization of process-level partial correlation for multivariate Gaussian processes that recovers a direct link with Gaussian graphical models. Our analysis proceeds through a class of stationary multivariate processes, termed spectrally inside-out, in which a precision matrix modulates the strength of conditional dependence and yields necessary and sufficient conditions for conditional independence. Within this class, partial cross-correlation functions factorize into a process-level partial correlation coefficient and an attenuation term independent of cross-process parameters. The spectrally inside-out class includes the separable coregionalization model, a process convolution construction, and the parsimonious multivariate Matérn, for which such a characterization was previously thought unavailable. We further show that a nonstationary inside-out model satisfies the same factorization and admits the same necessary and sufficient conditions. Our results clarify the limitations of existing approaches: linear coregionalization models encode conditional independence through the zero pattern of the inverse factor loading matrix and do not result in interpretable partial cross-correlation functions. Low-rank spatial factor models lack a meaningful graphical characterization. Methods that enforce network structure through auxiliary graphical layers only characterize presence or absence of graph edges. We illustrate our results through synthetic and real data.

2604.09950 2026-04-14 math.ST math.PR stat.TH

On a copula product linking Wasserstein correlations and rearranged dependence measures

Jonathan Ansari

Comments 22 pages; 3 figures

详情
英文摘要

Recent research in statistics has focused on dependence measures kappa(Y,X) taking values in [0, 1], where 0 characterizes independence of X and Y, and 1 perfect functional dependence of Y on X. One class of such measures consists of the optimal transport-based Wasserstein correlations introduced by Wiesel. Another class comprises the rearranged dependence measures studied by Strothmann, Dette, and Siburg. While the constructions of Wasserstein correlations and rearranged dependence measures seem to be fundamentally different, we show that they are connected by a copula product T (C) = C v Π that models conditional comonotonicity. As a main contribution, we prove that the mapping T acts as a reflection on the class of stochastically increasing copulas, whereas T^2 = T \circ T projects a copula onto its increasing rearranged copula. We further study fixed points, ordering results, and continuity properties of T to better understand the interplay between these classes of dependence measures. Our results demonstrate that conditional comonotonicity is an intrinsic feature of dependence measures, whereas conditional independence underlying Chatterjee's rank correlation is a rather exceptional property.

2604.09913 2026-04-14 stat.ME stat.ML

Performance of weakly-supervised electronic health record-based phenotyping methods in rare-outcome settings

Yunjing Hong, Jennifer C. Nelson, Brian D. Williamson

Comments 58 pages, 4 main figures, 3 supplemental figures, 4 main tables, 17 supplemental tables

详情
英文摘要

Accurately identifying patients with specific medical conditions is a key challenge when using clinical data from electronic health records. Our objective was to comprehensively assess when weakly-supervised prediction methods, which use silver-standard labels (proxy measures of the true outcome) rather than gold-standard true labels, perform well in rare-outcome settings like vaccine safety studies. We compared three methods (PheNorm, MAP, and sureLDA) that combine structured features and features derived from clinical text using natural language processing, through an extensive simulation study with data-generating mechanisms ranging from simple to complex, varying outcome rates, and varying degrees of informative silver labels. We also considered using predicted probabilities to design a chart review validation study. No single method dominated the other across all prediction performance metrics. Probability-guided sampling selected a cohort enriched for patients with more mentions of important concepts in chart notes. SureLDA, the most complex of the three algorithms we considered, often performed well in simulations. Performance depended greatly on selected tuning parameters. Care should be taken when using weakly-supervised prediction methods in rare-outcome settings, particularly if the probabilities will be used in downstream analysis, but these methods can work well when silver labels are strong predictors of true outcomes.

2604.09910 2026-04-14 stat.ME

Mixed Membership Models for Multilevel Functional Data

Donatello Telesca, Nicholas Marco, Emma Landry

详情
英文摘要

Mixed membership models extend classical clustering by substituting the notion of uncertain membership with the notion of mixed membership. In particular, these models allow each observation to partially belong to multiple pure membership classes. We discuss mixed membership models for functional data by extending the framework to multilevel functional observations. We show how the classical multivariate Karhunen-Loeve decomposition can be translated into a simple hierarchical model for scalable and flexible expressivity of the underlying stochastic processes. The identifiability of partial membership structures is aided by the definition of a hierarchical repulsive prior on the unitary simplex. Our work is motivated and illustrated by applications to a study on functional brain imaging through electroencephalography (EEG) of children with autism spectrum disorder (ASD).

2604.09909 2026-04-14 cs.LG cs.NA math.NA math.OC stat.ML

Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size

Michał Dereziński, Xiaoyu Dong

详情
英文摘要

We study last-iterate convergence of SGD with greedy step size over smooth quadratics in the interpolation regime, a setting which captures the classical Randomized Kaczmarz algorithm as well as other popular iterative linear system solvers. For these methods, we show that the $t$-th iterate attains an $O(1/t^{3/4})$ convergence rate, addressing a question posed by Attia, Schliserman, Sherman, and Koren, who gave an $O(1/t^{1/2})$ guarantee for this setting. In the proof, we introduce the family of stochastic contraction processes, whose behavior can be described by the evolution of a certain deterministic eigenvalue equation, which we analyze via a careful discrete-to-continuous reduction.

2604.09902 2026-04-14 stat.ME stat.AP

crumble: A comprehensive framework for modern causal mediation analysis with intermediate confounding

Richard Liu, Nicholas T. Williams, Kara E. Rudolph, Ivan Diaz

详情
英文摘要

Causal mediation analysis is widely used to investigate how causal effects operate through specific pathways linking treatments or exposures to outcomes. Recently, \texttt{crumble} was developed to enable nonparametric estimation of several mediation parameters, even when mediators are continuous and/or multi-dimensional or when treatments are non-binary. But a practical and accessible guide to using \texttt{crumble} -- one that does not require deep familiarity with mediation analysis or semiparametric theory -- is currently lacking. This tutorial aims to an accessible introduction to \texttt{crumble} while minimizing technical complexity. We first review the mediation parameters implemented in \texttt{crumble} -- natural direct and indirect effects, randomized interventional effects, and recanting-twin effects. For each, we give the definition, interpretation, identification assumptions, and suitability in the presence or absence of intermediate confounding. Then, we demonstrate the usage of \texttt{crumble} by examining an example configuration. Next, we describe how \texttt{crumble} accommodates non-binary treatments through modified treatment policies. Finally, we illustrate the practical use of \texttt{crumble} through two case studies -- one with a binary treatment and one with a non-binary treatment -- based on the Job Search Intervention Study data.

2604.09898 2026-04-14 stat.AP

Evaluating the impact of longitudinal treatment strategies in the presence of informative monitoring and time-dependent confounding

Leah Pirondini, Karla Diaz-Ordaz, Edward Palmer, Ruth H. Keogh

详情
英文摘要

Routinely collected data from electronic health records (EHR) provide opportunities to study effects of longitudinal treatment strategies in real-world clinical settings. A challenge presented by EHR data is that frequency of covariate monitoring differs by patient, covariate type and over time, and may be informative about a patient's health status. Many causal inference methods assume measurements of covariates are observed at a common set of regular time points. In this paper we describe and evaluate methods for estimating causal effects of longitudinal treatments on time-to-event outcomes in the presence of informative monitoring of time-dependent confounders. We show how methods based on inverse probability weighting, G-computation and longitudinal targeted maximum likelihood estimation (TMLE) can be adapted to allow for informative monitoring by incorporating monitoring indicator variables as additional time-dependent confounders. We evaluate these methods using a simulation study, comparing against more simple approaches that ignore monitoring variables. We demonstrate that ignoring monitoring can result in biased estimates of treatment effects. The methods are illustrated through an investigation into the effect of early versus delayed initiation of invasive mechanical ventilation on mortality of intensive care patients using routinely-collected data from an intensive care unit. We consider static treatment strategies such as `always treat' and `never treat' but also generalise to treatment strategies that allow for flexibility in the exact initiation time and duration of treatment.

2604.09895 2026-04-14 stat.AP physics.data-an

Blume-Capel model: Estimation of a three stable state network for $-\bf 1$, $\bf 0$ and $\bf +1$ data

Lourens Waldorp, Jonas Dalege, Maarten Marsman, Adam Finnemann, Irene Ferri, Han L. J. van der Maas

详情
英文摘要

An extension of the Ising model is proposed as a viable alternative for data with values $-1$, $0$ and $+1$ in the inverse problem, i.e., estimation of the parameters. This model is called the Blume-Capel (BC) model, adapted from physics for small networks. The advantage of the BC model is not only the fact that it is possible to have a neutral (centrist) position on the response scale, but also that this model allows for three stable states. We illustrate magnetisation properties of the BC model using simulations and mean field results. For estimation of the BC parameters, we show that the BC model is part of the exponential family of distributions and show that the model is identified, except for the (inverse) temperature. We then show that combining pseudo-likelihood with lasso yields accurate parameter recovery for the BC model, even in small networks. Moreover, confidence intervals with good coverage properties can be obtained using the desparsified lasso together with sandwich and shrinkage techniques. We apply the methods to data obtained from the online platform \textit{Stemwijzer}, intended to aid people in deciding for whom to vote.

2604.09858 2026-04-14 econ.EM stat.ME

Coupling Designs for Randomized Experiments with Complex Treatments

Max Cytrynbaum, Fredrik Sävje

详情
英文摘要

We describe a new family of coupling designs, extending the basic principle of stratified randomization to experiments with continuous, constrained multivariate, text/image and other irregular treatment spaces. Our approach is to first match units into homogeneous groups, then use Monte Carlo coupling techniques to assign within-group treatments that are highly dispersed over the treatment space. We show that ensuring similar experimental units receive highly dissimilar treatments generically improves estimation efficiency. In particular, the efficiency gains from a coupling design are proportional to the product of dispersion and match quality, where dispersion measures how spread out the treatment assignments are under a given coupling relative to independent randomization. We develop a new spectral analysis, revealing how efficiency depends on a match between the smoothness and shape of the estimator's influence function and the principal directions of a given coupling. We illustrate how coupling designs work in practice using a cash transfer experiment in development economics and a discrete-choice experiment in two-sided marketplaces.

2604.09779 2026-04-14 stat.ME

Inference conditional on selection: a review

Anna Neufeld, Ronan Perry, Daniela Witten

详情
英文摘要

In this article, we review selective inference, a set of techniques for inference when the statistical question asked is a function of the data. This setting often arises in contemporary scientific workflows, where hypotheses and parameters may be selected from the data, rather than specified in advance. In this setting, classical inferential techniques do not achieve "classical" guarantees, such as nominal coverage of confidence intervals. We consider three examples for which selective inference solutions are required: inference on a "winner", inference on the mean of a region in a regression tree, and inference on the difference in means between a pair of clusters. We argue that conditional guarantees are of scientific interest in such settings. We then review and draw connections between several approaches that provide such guarantees. Finally, we illustrate these approaches in simulation and through an application to single-cell RNA sequencing data.

2604.09754 2026-04-14 stat.AP

Surface temperature extremes produced by huge machine learning hindcasts of summer 2023

Mark Risser, Ankur Mahesh, Joshua North, William D. Collins, Boris Bonev, Karthik Kashinath, Thorsten Kurth, Shashank Subramanian, Michael S. Pritchard

详情
英文摘要

The summer of 2023 was the second hottest on record, with numerous extreme heatwaves across the globe. Using the Spherical Fourier Neural Operator machine learning (ML) weather model, we generated a massive ensemble of 7,424 weather scenarios simulating summer temperature extremes. The ML ensemble produced extreme heatwave scenarios exceeding temperatures from reanalysis and numerical weather prediction ensembles. Our results show that the ML model's extreme surface temperatures were not unusual for approximately two-thirds of the global land area. However, for the other one-third, ML-generated extreme events were well outside the prediction envelope from extrapolating smaller ensembles with extreme value theory. Furthermore, the ML ensemble readily generates storyline simulations of humid heat extremes, which yield more dangerous categories of public safety alerts than can be simulated from smaller ensembles. This research highlights the potential of huge ensemble simulations to improve understanding and prediction of both humid and dry temperature extremes.

2604.09661 2026-04-14 physics.ao-ph nlin.AO physics.data-an stat.ME

Multistability and intermingledness in complex high-dimensional data

George Datseris, Johannes Lohmann, Oisín Hamilton, Jacob Haqq-Misra

详情
英文摘要

Multistability is a phenomenon prevalent in many natural systems. In climate, for example, it allows the possibility of irreversible consequences on planetary scale as a result of climate change. Indeed, a climate ``tipping element'' is a multistable component that can undergo a transition to an alternative steady state due to an external perturbation. Despite the potential impact, multistability in realistic, complex simulations (e.g. climate models) remains poorly understood. Arguably a reason for this the lack of applicable methodology that explicitly targets finite yet high-dimensional datasets. In this work we utilize recent progress in computational nonlinear dynamics to formulate a workflow that analyses potentially multistable simulation data and decides algorithmically what are the alternative steady states contained within, if any. The framework undergoes an optimization routine that showcases which observables in the data best differentiate the alternative states, and which ones do not differentiate at all, which could be used to guide monitoring and early-warning for multistable components in climate or ecosystems. Finally, once the alternate states have been found, we define an indicator called ``intermingledness''. It quantifies differences and similarities between alternate states, as well as for their basins of attraction, across various diagnostic variables. We analyse and present results using three diverse climate datasets: Atlantic ocean circulation, atmospheric midlatitude flow, and habitability of exoplanets. We also provide easy-to-use open source code for applying the workflow to new data.

2604.09660 2026-04-14 stat.AP

Overdispersed and Markovian Children

Nils Lid Hjort

Comments 18 pages, 11 figures. Statistical Research Report, Department of Mathematics, University of Oslo, April 2026. The material is up to research level, for some of the details, but most of it can be read at Master's level statistics (and, indeed, above)

详情
英文摘要

Take a look around you -- in your family, your school or workplace, in the streets, and you see boys & girls in about equal proportion, and without any easily visible gender patterns in case of siblings. So, to the famous first order of statistical approximation, we're all the results of hierarchical cascades of independent coin tosses through history, with each little fate determined by a 0.50-0.50 coin. This is not entirely correct, as one discovers with careful analysis and enough data: the coins of fate are (a little) imbalanced; they vary (a little) from family to family; there is a (slight) dependence in your children's gender sequence; and there are (slightly) more only-girls and only-boys families than predicted from binomial conditions. In this article I use the opportunity to talk also about how sample sizes influence p-values and statistical detection power.

2604.09656 2026-04-14 cs.LG cs.AI stat.AP stat.ME

Fairboard: a quantitative framework for equity assessment of healthcare models

James K. Ruffle, Samia Mohinta, Chris Foulon, Mohamad Zeina, Zicheng Wang, Sebastian Brandner, Harpreet Hyare, Parashkev Nachev

Comments 30 pages, 6 figures, 109 extended data figures (ancillary file)

详情
英文摘要

Despite there now being more than 1,000 FDA-authorised AI medical devices, formal equity assessments -- whether model performance is uniform across patient subgroups -- are rare. Here, we evaluate the equity of 18 open-source brain tumour segmentation models across 648 glioma patients from two independent datasets (n = 11,664 model inferences) along distinct univariate, Bayesian multivariate, spatial, and representational dimensions. We find that patient identity consistently explains more performance variance than model choice, with clinical factors, including molecular diagnosis, tumour grade, and extent of resection, predicting segmentation accuracy more strongly than model architecture. A voxel-wise spatial meta-analysis identifies neuroanatomically localised biases that are compartment-specific yet often consistent across models. Within a high-dimensional latent space of lesion masks and clinic-demographic features, model performance clusters significantly, indicating that the patient feature space contains axes of algorithmic vulnerability. Although newer models tend toward greater equity, none provide a formal fairness guarantee. Lastly, we release Fairboard, an open-source, no-code dashboard that lowers barriers to equitable model monitoring in medical imaging.

2604.09614 2026-04-14 cs.AI cs.IT math.IT math.ST stat.TH

The Geometry of Knowing: From Possibilistic Ignorance to Probabilistic Certainty -- A Measure-Theoretic Framework for Epistemic Convergence

Moriba Kemessia Jah

详情
英文摘要

This paper develops a measure-theoretic framework establishing when and how a possibilistic representation of incomplete knowledge contracts into a probabilistic representation of intrinsic stochastic variability. Epistemic uncertainty is encoded by a possibility distribution and its dual necessity measure, defining a credal set bounding all probability measures consistent with current evidence. As evidence accumulates, the credal set contracts. The epistemic collapse condition marks the transition: the Choquet integral converges to the Lebesgue integral over the unique limiting density. We prove this rigorously (Theorem 4.5), with all assumptions explicit and a full treatment of the non-consonant case. We introduce the aggregate epistemic width W, establish its axiomatic properties, provide a canonical normalization, and give a feasible online proxy resolving a circularity in prior formulations. Section 7 develops the dynamics of epistemic contraction: evidence induces compatibility, compatibility performs falsification, posterior possibility is the min-intersection of prior possibility and compatibility, and a credibility-directed flow governs support geometry contraction. This is not belief updating. It is knowledge contraction. Probability theory is the limiting geometry of that process. The UKF and ESPF solve different problems by different mechanisms. The UKF minimizes MSE, asserts truth, and requires a valid generative model. The ESPF minimizes maximum entropy and surfaces what evidence has not ruled out. When the world is Gaussian and the model valid, both reach the same estimate by entirely different routes -- convergent optimality, not hierarchical containment. We prove this (Theorem 9.1) and compare both on a 2-day, 877-step orbital tracking scenario. Both achieve 1-meter accuracy. The UKF is accurate but epistemically silent. The ESPF is accurate and epistemically honest.

2603.19640 2026-04-14 stat.AP

Logistic-aided Huber M-estimator for robust GNSS positioning

Zhengdao Li, Penggao Yan, Li-Ta Hsu

Comments Submitted to IEEE Transactions on Aerospace and Electronic Systems

详情
英文摘要

This paper develops a logistic-aided Huber (LAH) M-estimator for robust GNSS positioning under long-tailed, multipath-affected measurement errors. The key idea is to leverage a logistic measurement error assumption and establish a one-to-one approximation between the logistic-based loglikelihood (i.e., quasi-log-cosh) and the Huber kernel by matching their score functions. This yields closed-form tuning rules for the scale and threshold parameters in the Huber estimator, grounded on logistic error statistical properties. We further show that the proposed LAH estimator preserves comparable efficiency and robustness to the connected logistic-based least quasi-log-cosh (LQLC) estimator. Both Monte Carlo simulations with long-tailed measurement errors and a one-hour urban GNSS dataset confirm that the proposed logistic-statistics-based tuning improves positioning accuracy and precision while suppressing large error spikes. Specifically, LAH reduces the 2D RMSE/STD by 28.03%/38.83% versus conventional 95%-efficiency-based Huber tuning in simulation, and reduces the overall 3D RMSE/STD by 4.85%/16.68% in real-world experiments while suppressing large positioning error spikes by up to 51%.

2603.05919 2026-04-14 cs.LG math.ST stat.ML stat.TH

Design Experiments to Compare Multi-armed Bandit Algorithms

Huiling Meng, Ningyuan Chen, Xuefeng Gao

详情
英文摘要

Online platforms routinely compare multi-armed bandit algorithms, such as UCB and Thompson Sampling, to select the best-performing policy. Unlike standard A/B tests for static treatments, each run of a bandit algorithm over $T$ users produces only one trajectory, because the algorithm's decisions depend on all past interactions. Reliable inference therefore demands many independent restarts of the algorithm, making experimentation costly and delaying deployment decisions. We propose Artificial Replay (AR) as a new experimental design for this problem. AR first runs one policy and records its trajectory. When the second policy is executed, it reuses a recorded reward whenever it selects an action the first policy already took, and queries the real environment only otherwise. We develop a new analytical framework for this design and prove three key properties of the resulting estimator: it is unbiased; it requires only $T + o(T)$ user interactions instead of $2T$ for a run of the treatment and control policies, nearly halving the experimental cost when both policies have sub-linear regret; and its variance grows sub-linearly in $T$, whereas the estimator from a naïve design has a linearly-growing variance. Numerical experiments with UCB, Thompson Sampling, and $ε$-greedy policies confirm these theoretical gains.

2601.20628 2026-04-14 stat.ML cs.LG

Sparse clustering via the Deterministic Information Bottleneck algorithm

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Comments Submitted to IFCS 2026 (8 pages total)

详情
英文摘要

Cluster analysis relates to the task of assigning objects into groups which ideally present some desirable characteristics. When a cluster structure is confined to a subset of the feature space, traditional clustering techniques face unprecedented challenges. We present an information theoretic framework that overcomes the problems associated with sparse data, allowing for joint feature weighting and clustering. Our proposal constitutes a competitive alternative to existing clustering algorithms for sparse data, as demonstrated through simulations on synthetic data. The effectiveness of our method is established by an application on a real-world genomics data set.

2512.01070 2026-04-14 math.DG stat.ME

Covariance Estimation for Matrix-variate Data via Fixed-rank Core Covariance Geometry

Bongjung Sung

Comments 39 pages, 22 pages in the main text, 4 figures

详情
英文摘要

We study the geometry of the fixed-rank core covariance manifold arising from the Kronecker-core decomposition of covariance matrices. As shown in Hoff, McCormack, and Zhang (2023), every covariance matrix $Σ$ of $p_1\times p_2$ matrix-variate data uniquely decomposes into a separable component $K$ and a core component $C$. Such a decomposition also exists for rank-$r$ $Σ$ if $p_1/p_2+p_2/p_1<r$, with $C$ sharing the same rank. If this core $C$ exhibits a partial-isotropy structure, then a partial-isotropy rank-$r$ core is a non-trivial convex combination of a rank-$r$ core and $I_p$ for $p:=p_1p_2$, where the weight on $I_p$ measures the deviation of $Σ$ from separability. This motivates studying the geometry of the space of rank-$r$ cores, $\mathcal{C}_{p_1,p_2,r}^+$. We show that $\mathcal{C}_{p_1,p_2,r}^+$ is a smooth manifold, except for a measure-zero subset associated with canonical decomposability. When $r=p$, $\mathcal{C}_{p_1,p_2}^{++}:=\mathcal{C}_{p_1,p_2,p}^+$ is itself a smooth manifold. The geometric properties, including smoothness of the positive definite cone via separability and the Riemannian gradient and Hessian operator relevant to $\mathcal{C}_{p_1,p_2,r}^+$, are also derived. As an application, we propose a partial-isotropy core shrinkage estimator for matrix-variate data.

2511.17725 2026-04-14 stat.ME

A Unified Spatiotemporal Framework for Modeling Censored and Missing Areal Responses

Jose A. Ordoñez, Tsung-I Lin, Victor H. Lachos, Luis M. Castro

详情
英文摘要

We propose a new Bayesian approach for spatiotemporal areal data with censored and missing observations. The method introduces a flexible random effect that combines the spatial dependence structures of the Simultaneous Autoregressive (SAR) and Directed Acyclic Graph Autoregressive (DAGAR) models with a temporal autoregressive component. We demonstrate that this formulation extends both spatial models into a unified spatiotemporal framework, expressing them as Gaussian Markov random fields in their innovation form. The resulting model captures spatial, temporal, and joint spatiotemporal correlations in an interpretable way. Simulation studies show that the proposed model outperforms common ad hoc imputation strategies, such as replacing censored values with the limit of detection (LOD) or imputing missing data by the sample mean. We further apply the method to carbon monoxide (CO) concentration data from Beijing's air quality network, comparing the proposed DAGAR-AR model with the traditional Conditional Autoregressive (CAR) approach. The results indicate that while the CAR model achieves slightly better predictive performance, the DAGAR-AR specification offers clearer interpretability and a more coherent representation of the spatiotemporal dependence structure.

2510.02050 2026-04-14 stat.AP cs.LG

Multidata Causal Discovery for Statistical Hurricane Intensity Forecasting

Saranya Ganesh S, Frederick Iat-Hin Tam, Milton S. Gomez, Marie McGraw, Mark DeMaria, Kate Musgrave, Jakob Runge, Tom Beucler

Comments 20 pages, 8 Figures, 1 Table, SI; Manuscript following second peer review

详情
英文摘要

Improving statistical forecasts of tropical cyclone (TC) intensity is limited by complex nonlinear interactions and difficulty in identifying relevant predictors. Conventional methods prioritize correlation or fit, often overlooking confounding variables and limiting generalizability to unseen TCs. To address this, we leverage a multidata causal discovery framework with a replicated dataset based on Statistical Hurricane Intensity Prediction Scheme (SHIPS) using ERA5 meteorological reanalysis. We conduct experiments to identify and select predictors causally linked to TC intensity changes. We then train multiple linear regression models to compare causal feature selection with correlation, random forest feature importance, and no feature selection, across five forecast lead times from 1 to 5 days (24 to 120 hours). Causal feature selection consistently outperforms on unseen test cases, especially for lead times shorter than 3 days. Top causal features include vertical shear, mid-tropospheric potential vorticity and surface moisture conditions, which are physically significant yet often underutilized in TC intensity predictions. We build an extended predictor set (SHIPS+) by adding selected features to the standard SHIPS predictors. SHIPS+ yields increased short-term predictive skill at lead times of 24, 48, and 72 hours. Adding nonlinearity using a multilayer perceptron further extends skill to longer lead times, despite our framework being purely regional and not requiring global forecast data. Operational SHIPS tests confirm that three of the six added causally discovered predictors improve forecast skill, with the largest gains at longer lead times. Our results demonstrate that causal discovery improves TC intensity prediction and pave the way toward more empirical forecasts.

2509.24904 2026-04-14 astro-ph.CO astro-ph.HE astro-ph.IM physics.data-an stat.CO

Graph-based Summary Statistics for Revealing the Stochastic Gravitational Wave Background in Pulsar Timing Arrays

M. Alakhras, S. M. S. Movahed

Comments 29 pages, 15 figures, 1 table. Matched with the published version. Including the revision in a part of method

详情
Journal ref
The Astrophysical Journal 999.2 (2026): 226
英文摘要

In this work, we propose a graph-based method implemented on the pulsar timing residuals (PTRs) for stochastic gravitational wave background (SGWB) detection within the nano-Hertz frequency regime and examining uncertainties of its parameters. We construct a correlation graph with pulsars as its nodes, and analyze the graph-based summary statistics, including structural characteristics of complex network, for identifying SGWB in the real and synthetic datasets. The effect of the number of pulsars, the observation time span, and the strength of the SGWB on the graph-based feature vector is evaluated. Our results demonstrate that the Discriminative Summary Statistics for common signal detection consists of the average clustering coefficient and the edge weight fluctuation. The SGWB detection conducted after the observation of a common signal and then exclusion of non-Hellings \& Downs templates is performed by the second cumulant of edge weight for angular separation thresholds $\barζ\gtrsim 40^{\circ}$. The lowest detectable value of SGWB strain amplitude utilizing our graph-based measures at the current PTAs sensitivity is $A_{\rm SGWB}\gtrsim 1.2\times 10^{-15}$. Fisher forecasts confirmed that the uncertainty levels of $\log_{10} A_{\rm SGWB}$ and spectral index reach $1.5\%$ and $19.5\%$, respectively, at $2σ$ confidence interval. A weak evidence for an SGWB at $\sim 2.3σ$ level is obtained by applying our graph-based method to the NANOGrav 15-year dataset.

2506.05014 2026-04-14 cs.LG cs.AI stat.ML

Towards Reasonable Concept Bottleneck Models

Nektarios Kalampalikis, Kavya Gupta, Georgi Vitanov, Isabel Valera

Comments 32 pages, 20 figures

详情
英文摘要

We propose a novel, flexible, and efficient framework for designing Concept Bottleneck Models (CBMs) that enables practitioners to explicitly encode and extend their prior knowledge and beliefs about the concept-concept ($C-C$) and concept-task ($C \to Y$) relationships within the model's reasoning when making predictions. The resulting $\textbf{C}$oncept $\textbf{REA}$soning $\textbf{M}$odels (CREAMs) architecturally encode arbitrary types of $C-C$ relationships such as mutual exclusivity, hierarchical associations, and/or correlations, as well as potentially sparse $C \to Y$ relationships. Moreover, CREAM can optionally incorporate a regularized side-channel to complement the potentially {incomplete concept sets}, achieving competitive task performance while encouraging predictions to be concept-grounded. To evaluate CBMs in such settings, we introduce a $C \to Y$ agnostic metric that quantifies interpretability when predictions partially rely on the side-channel. In our experiments, we show that, without additional computational overhead, CREAM models support efficient interventions, can avoid concept leakage, and achieve black-box-level performance under missing concepts. We further analyze how an optional side-channel affects interpretability and intervenability. Importantly, the side-channel enables CBMs to remain effective even in scenarios where only a limited number of concepts are available.

2506.00444 2026-04-14 math.ST stat.TH

Detecting non-uniform patterns on high-dimensional hyperspheres

Tiefeng Jiang, Tuan Pham

Comments added results for the Watson model

详情
英文摘要

We propose a new probabilistic characterization of the uniform distribution on the hypersphere in terms of the distribution of pairwise inner products, extending the ideas of \citep{cuesta2009projection,cuesta2007sharp} in a data-driven manner. This characterization naturally leads to an Ingster-type distance for quantifying deviations from uniformity, whose asymptotic behavior can be analyzed systematically via Edgeworth-type expansions. Perhaps surprisingly, we show that this distance captures the minimax rates for testing uniformity simultaneously across several high-dimensional parametric models, even in the models where densities with respect to the uniform law do not exist. We then introduce a simple test for spherical uniformity based on this distance and study its detection rates and consistency against various classes of alternatives, both local and non-local. The proposed test is universally consistent in fixed dimensions, minimax-optimal over a variety of high-dimensional parametric models, and consistent against non-local high-dimensional alternatives. This is different from previously studied high-dimensional Sobolev tests and extreme-value-based tests, which are rate-suboptimal or inconsistent against one or more classes of alternatives. We also establish the local asymptotic distribution of the proposed test under the considered classes of alternatives, along with new information lower bounds.

2505.18344 2026-04-14 cs.LG cs.AI stat.ML

Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access

Mudit Gaur, Prashant Trivedi, Sasidhar Kunapuli, Amrit Singh Bedi, Vaneet Aggarwal

详情
Journal ref
Transactions on Machine Learning Research, Apr 2026
英文摘要

Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\mathcal{O}(ε^{-4})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result that achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.

2503.11268 2026-04-14 stat.ME stat.CO

Rank estimation for the accelerated failure time model with partially interval-censored data

Taehwa Choi, Sangbum Choi, Dipankar Bandyopadhyay

Comments Accepted in Statistica Sinica

详情
英文摘要

This paper presents a unified rank-based inferential procedure for fitting the accelerated failure time model to partially interval-censored data. A Gehan-type monotone estimating function is constructed based on the idea of the familiar weighted log-rank test, and an extension to a general class of rank-based estimating functions is suggested. The proposed estimators can be obtained via linear programming and are shown to be consistent and asymptotically normal via standard empirical process theory. Unlike common maximum likelihood-based estimators for partially interval-censored regression models, our approach can directly provide a regression coefficient estimator without involving a complex nonparametric estimation of the underlying residual distribution function. An efficient variance estimation procedure for the regression coefficient estimator is considered. Moreover, we extend the proposed rank-based procedure to the linear regression analysis of multivariate clustered partially interval-censored data. The finite-sample operating characteristics of our approach are examined via simulation studies. Data example from a colorectal cancer study illustrates the practical usefulness of the method.

2502.07114 2026-04-14 stat.ML cs.LG cs.NA math.NA math.OC stat.CO

Online Covariance Matrix Estimation in Sketched Newton Methods

Wei Kuang, Mihai Anitescu, Sen Na

Comments 63 pages, 4 figures, 9 tables

详情
英文摘要

Given the ubiquity of streaming data, online algorithms have been widely used for parameter estimation, with second-order methods particularly standing out for their efficiency and robustness. In this paper, we study an online sketched Newton method that leverages a randomized sketching technique to perform an approximate Newton step in each iteration, thereby eliminating the computational bottleneck of second-order methods. While existing studies have established the asymptotic normality of sketched Newton methods, a consistent estimator of the limiting covariance matrix remains an open problem. We propose a fully online covariance matrix estimator that is constructed entirely from the Newton iterates and requires no matrix factorization. Compared to covariance estimators for first-order online methods, our estimator for second-order methods is batch-free. We establish the consistency and convergence rate of our estimator, and coupled with asymptotic normality results, we can then perform online statistical inference for the model parameters based on sketched Newton methods. We also discuss the extension of our estimator to constrained problems, and demonstrate its superior performance on regression problems as well as benchmark problems in the CUTEst set.

2410.22559 2026-04-14 cs.LG cs.AI stat.ML

Disentanglement as Identifiable Pushforward Factorisation

Carl Allen

Comments 9 pages

详情
英文摘要

We characterise disentanglement in smooth generative pushforward models, such as in VAEs and GANs. For a generator/decoder $g:Z\to X$ and factorised prior $p(z)=\prod_i p_i(z_i)$, we define disentanglement as factorisation of the pushforward density $p_μ= g_\#p$ into one-dimensional "seam" factors, where each latent dimension controls an independent generative factor of the data. We prove that $p_μ$ factorises according to the SVD of $g$'s Jacobian; that disentanglement equates to two conditions on $g$ (C1-C2); and that under those conditions the seam factors are identifiable, up to permutation and sign. In the particular case of Gaussian ($β$-)VAEs, we show via an identity how diagonal posteriors promote C1-C2, in expectation, explaining why disentanglement arises modulated by $β$. Experiments illustrate this mechanism on Gaussian data, dSprites, and CelebA.

2410.11964 2026-04-14 cs.LG stat.ML

A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection

James Enouen, Mahito Sugiyama

详情
英文摘要

The log-linear model has received a significant amount of theoretical attention in previous decades and remains the fundamental tool used for learning probability distributions over discrete variables. Despite its large popularity in statistical mechanics and high-dimensional statistics, the majority of related energy-based models only focus on the two-variable relationships, such as Boltzmann machines and Markov graphical models. Although these approaches have easier-to-solve structure learning problems and easier-to-optimize parametric distributions, they often ignore the rich structure which exists in the higher-order interactions between different variables. Using more recent tools from the field of information geometry, we revisit the classical formulation of the log-linear model with a focus on higher-order mode interactions, going beyond the 1-body modes of independent distributions and the 2-body modes of Boltzmann distributions. This perspective allows us to define a complete decomposition of the KL error. This then motivates the formulation of a sparse selection problem over the set of possible mode interactions. In the same way as sparse graph selection allows for better generalization, we find that our learned distributions are able to more efficiently use the finite amount of data which is available in practice. We develop an algorithm called MAHGenTa which leverages a novel Monte-Carlo sampling technique for energy-based models alongside a greedy heuristic for incorporating statistical robustness. On both synthetic and real-world datasets, we demonstrate our algorithm's effectiveness in maximizing the log-likelihood for the generative task and also the ease of adaptability to the discriminative task of classification.

2311.11487 2026-04-14 stat.ME stat.AP

Modeling Insurance Claims using Bayesian Nonparametric Regression

Mostafa Shams Esfand Abadi, Kaushik Ghosh

详情
Journal ref
PLoS One 21(4): e0346734, 2026
英文摘要

The prediction of future insurance claims based on observed risk factors, or covariates, help the actuary set insurance premiums. Typically, actuaries use parametric regression models to predict claims based on the covariate information. Such models assume the same functional form tying the response to the covariates for each data point. These models are not flexible enough and can fail to accurately capture at the individual level, the relationship between the covariates and the claims frequency and severity, which are often multimodal, highly skewed, and heavy-tailed. In this article, we explore the use of Bayesian nonparametric (BNP) regression models to predict claims frequency and severity based on covariates. In particular, we model claims frequency as a mixture of Poisson regression, and the logarithm of claims severity as a mixture of normal regression. We use the Dirichlet process (DP) and Pitman-Yor process (PY) as a prior for the mixing distribution over the regression parameters. Unlike parametric regression, such models allow each data point to have its individual parameters, making them highly flexible, resulting in improved prediction accuracy. We describe model fitting using MCMC and illustrate their applicability using French motor insurance claims data.