arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.24594 2026-03-26 cs.LG cs.NA math.NA stat.ML

Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method

Arthur Jacot

详情
英文摘要

We introduce the Multilevel Euler-Maruyama (ML-EM) method compute solutions of SDEs and ODEs using a range of approximators $f^1,\dots,f^k$ to the drift $f$ with increasing accuracy and computational cost, only requiring a few evaluations of the most accurate $f^k$ and many evaluations of the less costly $f^1,\dots,f^{k-1}$. If the drift lies in the so-called Harder than Monte Carlo (HTMC) regime, i.e. it requires $ε^{-γ}$ compute to be $ε$-approximated for some $γ>2$, then ML-EM $ε$-approximates the solution of the SDE with $ε^{-γ}$ compute, improving over the traditional EM rate of $ε^{-γ-1}$. In other terms it allows us to solve the SDE at the same cost as a single evaluation of the drift. In the context of diffusion models, the different levels $f^{1},\dots,f^{k}$ are obtained by training UNets of increasing sizes, and ML-EM allows us to perform sampling with the equivalent of a single evaluation of the largest UNet. Our numerical experiments confirm our theory: we obtain up to fourfold speedups for image generation on the CelebA dataset downscaled to 64x64, where we measure a $γ\approx2.5$. Given that this is a polynomial speedup, we expect even stronger speedups in practical applications which involve orders of magnitude larger networks.

2603.24567 2026-03-26 stat.ML cs.LG

Trust Region Constrained Bayesian Optimization with Penalized Constraint Handling

Raju Chowdhury, Tanmay Sen, Prajamitra Bhuyan, Biswabrata Pradhan

详情
英文摘要

Constrained optimization in high-dimensional black-box settings is difficult due to expensive evaluations, the lack of gradient information, and complex feasibility regions. In this work, we propose a Bayesian optimization method that combines a penalty formulation, a surrogate model, and a trust region strategy. The constrained problem is converted to an unconstrained form by penalizing constraint violations, which provides a unified modeling framework. A trust region restricts the search to a local region around the current best solution, which improves stability and efficiency in high dimensions. Within this region, we use the Expected Improvement acquisition function to select evaluation points by balancing improvement and uncertainty. The proposed Trust Region method integrates penalty-based constraint handling with local surrogate modeling. This combination enables efficient exploration of feasible regions while maintaining sample efficiency. We compare the proposed method with state-of-the-art methods on synthetic and real-world high-dimensional constrained optimization problems. The results show that the method identifies high-quality feasible solutions with fewer evaluations and maintains stable performance across different settings.

2603.24545 2026-03-26 math.ST cs.CC cs.DS math.PR stat.ML stat.TH

Detection of local geometry in random graphs: information-theoretic and computational limits

Jinho Bok, Shuangping Li, Sophie H. Yu

Comments 68 pages

详情
英文摘要

We study the problem of detecting local geometry in random graphs. We introduce a model $\mathcal{G}(n, p, d, k)$, where a hidden community of average size $k$ has edges drawn as a random geometric graph on $\mathbb{S}^{d-1}$, while all remaining edges follow the Erdős--Rényi model $\mathcal{G}(n, p)$. The random geometric graph is generated by thresholding inner products of latent vectors on $\mathbb{S}^{d-1}$, with each edge having marginal probability equal to $p$. This implies that $\mathcal{G}(n, p, d, k)$ and $\mathcal{G}(n, p)$ are indistinguishable at the level of the marginals, and the signal lies entirely in the edge dependencies induced by the local geometry. We investigate both the information-theoretic and computational limits of detection. On the information-theoretic side, our upper bounds follow from three tests based on signed triangle counts: a global test, a scan test, and a constrained scan test; our lower bounds follow from two complementary methods: truncated second moment via Wishart--GOE comparison, and tensorization of KL divergence. These results together settle the detection threshold at $d = \widetildeΘ(k^2 \vee k^6/n^3)$ for fixed $p$, and extend the state-of-the-art bounds from the full model (i.e., $k = n$) for vanishing $p$. On the computational side, we identify a computational--statistical gap and provide evidence via the low-degree polynomial framework, as well as the suboptimality of signed cycle counts of length $\ell \geq 4$.

2603.24538 2026-03-26 stat.CO

Generalized and Scalable Deep Gaussian Process Emulation

Deyu Ming, Daniel Williamson

详情
英文摘要

Gaussian process (GP) emulators have become essential tools for approximating complex simulators, significantly reducing computational demands in optimization, sensitivity analysis, and model calibration. While traditional GP emulators effectively model continuous and Gaussian-distributed simulator outputs with homogeneous variability, they typically struggle with discrete, heteroskedastic Gaussian, or non-Gaussian data, limiting their applicability to increasingly common stochastic simulators. In this work, we introduce a scalable Generalized Deep Gaussian Process (GDGP) emulation framework designed to accommodate simulators with heteroskedastic Gaussian outputs and a wide range of non-Gaussian response distributions, including Poisson, negative binomial, and categorical distributions. The GDGP framework leverages the expressiveness of DGPs and extends them to latent GP structures, enabling it to capture the complex, non-stationary behavior inherent in many simulators while also modeling non-Gaussian simulator outputs. We make GDGP scalable by incorporating the Vecchia approximation for settings with a large number of input locations, while also developing efficient inference procedures for handling large numbers of replicates. In particular, we present methodological developments that further enhance the computation of the approach for heteroskedastic Gaussian responses. We demonstrate through a series of synthetic and empirical examples that these extensions deliver the practical application of GDGP emulators and a unified methodology capable of addressing diverse modeling challenges. The proposed GDGP framework is implemented in the open-source R package dgpsi.

2603.24495 2026-03-26 math.ST stat.ML stat.TH

Reflected diffusion models adapt to low-dimensional data

Asbjørn Holk, Claudia Strauch, Lukas Trottner

详情
英文摘要

While the mathematical foundations of score-based generative models are increasingly well understood for unconstrained Euclidean spaces, many practical applications involve data restricted to bounded domains. This paper provides a statistical analysis of reflected diffusion models on the hypercube $[0,1]^D$ for target distributions supported on $d$-dimensional linear subspaces. A primary challenge in this setting is the absence of Gaussian transition kernels, which play a central role in standard theory in $\mathbb{R}^D$. By employing an easily implementable infinite series expansion of the transition densities, we develop analytic tools to bound the score function and its approximation by sparse ReLU networks. For target densities with Sobolev smoothness $α$, we establish a convergence rate in the $1$-Wasserstein distance of order $n^{-\frac{α+1-δ}{2α+d}}$ for arbitrarily small $δ> 0$, demonstrating that the generative algorithm fully adapts to the intrinsic dimension $d$. These results confirm that the presence of reflecting boundaries does not degrade the fundamental statistical efficiency of the diffusion paradigm, matching the almost optimal rates known for unconstrained settings.

2603.24493 2026-03-26 cs.LG math.ST stat.TH

Uniform Laws of Large Numbers in Product Spaces

Ron Holzman, Shay Moran, Alexander Shlimovich

详情
英文摘要

Uniform laws of large numbers form a cornerstone of Vapnik--Chervonenkis theory, where they are characterized by the finiteness of the VC dimension. In this work, we study uniform convergence phenomena in cartesian product spaces, under assumptions on the underlying distribution that are compatible with the product structure. Specifically, we assume that the distribution is absolutely continuous with respect to the product of its marginals, a condition that captures many natural settings, including product distributions, sparse mixtures of product distributions, distributions with low mutual information, and more. We show that, under this assumption, a uniform law of large numbers holds for a family of events if and only if the linear VC dimension of the family is finite. The linear VC dimension is defined as the maximum size of a shattered set that lies on an axis-parallel line, namely, a set of vectors that agree on all but at most one coordinate. This dimension is always at most the classical VC dimension, yet it can be arbitrarily smaller. For instance, the family of convex sets in $\mathbb{R}^d$ has linear VC dimension $2$, while its VC dimension is infinite already for $d\ge 2$. Our proofs rely on estimator that departs substantially from the standard empirical mean estimator and exhibits more intricate structure. We show that such deviations from the standard empirical mean estimator are unavoidable in this setting. Throughout the paper, we propose several open questions, with a particular focus on quantitative sample complexity bounds.

2603.24474 2026-03-26 stat.AP

Leveraging Synthetic and Genetic Data to Improve Epidemic Forecasting

Dave Osthus, Alexander C. Murph, Emma E. Goldberg, Lauren J. Beesley, William M. Fischer, Nidhi K. Parikh, Lauren A. Castro

Comments 36 pages, 19 figures, 5 tables

详情
英文摘要

Forecasting infectious disease outbreaks is hard. Forecasting emerging infectious diseases with limited historical data is even harder. In this paper, we investigate ways to improve emerging infectious disease forecasting under operational constraints. Specifically, we explore two options likely to be available near the start of an emerging disease outbreak: synthetic data and genetic information. For this investigation, we conducted an experiment where we trained deep learning models on different combinations of real and synthetic data, both with and without genetic information, to explore how these models compare when forecasting COVID-19 cases for US states. All models are developed with an eye towards forecasting the next pandemic. We find that models trained with synthetic data have better forecast accuracy than models trained on real data alone, and models that use genetic variants have better forecast accuracy compared to those that do not. All models outperformed a baseline persistence model (a feat only accomplished by 7 out of 22 real-time COVID-19 cases forecasting models as reported in [38]) and multiple models outperformed the COVIDHub-4_week_ensemble. This paper demonstrates the value of these underutilized sources of information and provides a blueprint for forecasting future pandemics.

2603.24439 2026-03-26 stat.ME

Distributionally balanced sampling designs via minimum tactical configurations

Anton Grafström, Wilmer Prentius

Comments 15 pages, 3 figures

详情
英文摘要

Distributionally balanced sampling designs are low-discrepancy probability designs obtained by minimizing the expected discrepancy between the auxiliary-variable distribution of a random sample and the target population distribution. Existing constructions rely on circular population sequences, which restrict the design space by forcing samples to be contiguous blocks of a sequence. We propose a new construction based on minimum tactical configurations that removes this topological constraint. The resulting designs are fixed-size, have equal inclusion probabilities, and belong to the class with minimum feasible configuration size. We develop both a simple initialization valid for arbitrary population and sample sizes and a spatial initialization that yields a lower initial expected discrepancy, together with a simulated annealing algorithm for optimization within this class. In simulations and empirical examples, the proposed method outperforms state-of-the-art alternatives in terms of distributional fit, balance, and spatial spread.

2603.24427 2026-03-26 stat.ML cs.LG

Continuous-Time Learning of Probability Distributions: A Case Study in a Digital Trial of Young Children with Type 1 Diabetes

Antonio Álvarez-López, Marcos Matabuena

Comments 53 pages, 11 figures

详情
英文摘要

Understanding how biomarker distributions evolve over time is a central challenge in digital health and chronic disease monitoring. In diabetes, changes in the distribution of glucose measurements can reveal patterns of disease progression and treatment response that conventional summary measures miss. Motivated by a 26-week clinical trial comparing the closed-loop insulin delivery system t:slim X2 with standard therapy in children with type 1 diabetes, we propose a probabilistic framework to model the continuous-time evolution of time-indexed distributions using continuous glucose monitoring data (CGM) collected every five minutes. We represent the glucose distribution as a Gaussian mixture, with time-varying mixture weights governed by a neural ODE. We estimate the model parameter using a distribution-matching criterion based on the maximum mean discrepancy. The resulting framework is interpretable, computationally efficient, and sensitive to subtle temporal distributional changes. Applied to CGM trial data, the method detects treatment-related improvements in glucose dynamics that are difficult to capture with traditional analytical approaches.

2603.24421 2026-03-26 stat.ME math.ST stat.OT stat.TH

E-values as statistical evidence: A comparison to Bayes factors, likelihoods, and p-values

Ben Chugg, Aaditya Ramdas, Peter Grünwald

Comments 34 pages

详情
英文摘要

A recurring debate in the philosophy of statistics concerns what, exactly, should count as a measure of evidence for or against a given hypothesis. P-values, likelihood ratios, and Bayes factors all have their defenders. In this paper we add two additional candidates to this list: the e-value and its sequential analogue, the e-process. E-values enjoy several desirable properties as measures of evidence: they combine naturally across studies, handle composite hypotheses, provide long-run error rates, and admit a useful interpretation as the wealth accrued by a bettor in a game against the null distribution. E-processes additionally handle optional stopping and optional continuation. This work examines the extent to which e-values and e-processes satisfy the evidential desiderata of different statistical traditions, concluding that they combine attractive features of p-values, likelihood ratios, and Bayes factors, and merit serious consideration as interpretable and intuitive measures of statistical evidence.

2603.24392 2026-03-26 stat.ML cs.LG stat.ME

Federated fairness-aware classification under differential privacy

Gengyu Xue, Yi Yu

详情
英文摘要

Privacy and algorithmic fairness have become two central issues in modern machine learning. Although each has separately emerged as a rapidly growing research area, their joint effect remains comparatively under-explored. In this paper, we systematically study the joint impact of differential privacy and fairness on classification in a federated setting, where data are distributed across multiple servers. Targeting demographic disparity constrained classification under federated differential privacy, we propose a two-step algorithm, namely FDP-Fair. In the special case where there is only one server, we further propose a simple yet powerful algorithm, namely CDP-Fair, serving as a computationally-lightweight alternative. Under mild structural assumptions, theoretical guarantees on privacy, fairness and excess risk control are established. In particular, we disentangle the source of the private fairness-aware excess risk into a) intrinsic cost of classification, b) cost of private classification, c) non-private cost of fairness and d) private cost of fairness. Our theoretical findings are complemented by extensive numerical experiments on both synthetic and real datasets, highlighting the practicality of our designed algorithms.

2603.24384 2026-03-26 cs.LG stat.ML

On the Use of Bagging for Local Intrinsic Dimensionality Estimation

Kristóf Péter, Ricardo J. G. B. Campello, James Bailey, Michael E. Houle

Comments Main document: 10 pages, 5 figures; Appendix: 38 pages, 27 figures

详情
英文摘要

The theory of Local Intrinsic Dimensionality (LID) has become a valuable tool for characterizing local complexity within and across data manifolds, supporting a range of data mining and machine learning tasks. Accurate LID estimation requires samples drawn from small neighborhoods around each query to avoid biases from nonlocal effects and potential manifold mixing, yet limited data within such neighborhoods tends to cause high estimation variance. As a variance reduction strategy, we propose an ensemble approach that uses subbagging to preserve the local distribution of nearest neighbor (NN) distances. The main challenge is that the uniform reduction in total sample size within each subsample increases the proximity threshold for finding a fixed number k of NNs around the query. As a result, in the specific context of LID estimation, the sampling rate has an additional, complex interplay with the neighborhood size, where both combined determine the sample size as well as the locality and resolution considered for estimation. We analyze both theoretically and experimentally how the choice of the sampling rate and the k-NN size used for LID estimation, alongside the ensemble size, affects performance, enabling informed prior selection of these hyper-parameters depending on application-based preferences. Our results indicate that within broad and well-characterized regions of the hyper-parameters space, using a bagged estimator will most often significantly reduce variance as well as the mean squared error when compared to the corresponding non-bagged baseline, with controllable impact on bias. We additionally propose and evaluate different ways of combining bagging with neighborhood smoothing for substantial further improvements on LID estimation performance.

2603.24333 2026-03-26 math.ST math.PR stat.ME stat.ML stat.TH

Notes on Forré's Notion of Conditional Independence and Causal Calculus for Continuous Variables

Leihao Chen

详情
英文摘要

Recently, Forré (arXiv:2104.11547, 2021) introduced transitional conditional independence, a notion of conditional independence that provides a unified framework for both random and non-stochastic variables. The original paper establishes a strong global Markov property connecting transitional conditional independencies with suitable graphical separation criteria for directed mixed graphs with input nodes (iDMGs), together with a version of causal calculus for iDMGs in a general measure-theoretic setting. These notes aim to further illustrate the motivations behind this framework and its connections to the literature, highlight certain subtlies in the general measure-theoretic causal calculus, and extend the "one-line" formulation of the ID algorithm of Richardson et al. (Ann. Statist. 51(1):334--361, 2023) to the general measure-theoretic setting.

2603.24304 2026-03-26 stat.ML cs.LG

CGRL: Causal-Guided Representation Learning for Graph Out-of-Distribution Generalization

Bowen Lu, Liangqiang Yang, Teng Li

详情
英文摘要

Graph Neural Networks (GNNs) have achieved impressive performance in graph-related tasks. However, they suffer from poor generalization on out-of-distribution (OOD) data, as they tend to learn spurious correlations. Such correlations present a phenomenon that GNNs fail to stably learn the mutual information between prediction representations and ground-truth labels under OOD settings. To address these challenges, we formulate a causal graph starting from the essence of node classification, adopt backdoor adjustment to block non-causal paths, and theoretically derive a lower bound for improving OOD generalization of GNNs. To materialize these insights, we further propose a novel approach integrating causal representation learning and a loss replacement strategy. The former captures node-level causal invariance and reconstructs graph posterior distribution. The latter introduces asymptotic losses of the same order to replace the original losses. Extensive experiments demonstrate the superiority of our method in OOD generalization and effectively alleviating the phenomenon of unstable mutual information learning.

2603.24276 2026-03-26 stat.ME math.ST stat.TH

Rethinking Individual Risk and Aggregation in Survival Analysis: A Latent Mechanism Framework

Xijia Liu

详情
英文摘要

Survival analysis provides a well-established framework for modeling time-to-event data, with hazard and survival functions formally defined as population-level quantities. In applied work, however, these quantities are often interpreted as representing individual-level risk, despite the absence of a clear generative account linking individual risk mechanisms to observed survival data. This paper develops a latent hazard framework that makes this relationship explicit by modeling event times as arising from unobserved, individual-specific hazard mechanisms and viewing population-level survival quantities as aggregates over heterogeneous mechanisms. Within this framework, we show that individual hazard trajectories are not identifiable from survival data under partial information. More generally, the conditional distribution of latent hazard mechanisms given covariates is structurally non-identifiable, even when population-level survival functions are fully known. This non-identifiability arises from the aggregation inherent in survival data and persists independently of model flexibility or estimation strategy. Finally, we show that classical survival models can be systematically reinterpreted according to how they handle this unresolved conditional mechanism distribution. This paper provides a unified framework for understanding heterogeneity, identifiability, and interpretation in survival analysis, and clarifies how population-level survival models should be interpreted when individual risk mechanisms are only partially observed, thereby establishing explicit information constraints for principled modeling and inference.

2603.24263 2026-03-26 stat.ME math.ST stat.TH

XT-REM: A Two-Component Model for Meta-Analysis of Extreme Event Proportions

Jovana Dedeić, Jelena Ivetić, Srđan Milićević, Katarina Vidojević, Marija Delić

Comments Under preparation for submission to Computational Statistics & Data Analysis. Includes simulation study and real-world application of the XT-REM model

详情
英文摘要

In this paper, we introduce a novel model for the meta-analysis of proportions that integrates the standard random-effects model (REM) with an extreme value theory (EVT)-based component. The proposed model, named XT-REM (Extreme-Tail Random Effects Model), extends the classical REM framework by explicitly accounting for extreme proportions through a partial segmentation of the study set based on a predefined threshold. While the majority of proportions are modeled using REM, proportions exceeding the threshold are analyzed using the Generalized Pareto Distribution (GPD). This formulation enables a dual interpretation of meta-analytic results, providing both an aggregate estimate for the central bulk of studies and a separate characterization of tail behavior. The XT-REM framework accommodates heteroskedastic variance structures inherent to proportion data, while preserving identifiability and consistency. Using real-world data on immunotherapy-related adverse events, together with simulation studies calibrated to empirical settings, we demonstrate that XT-REM yields a comparable central estimate while enabling a more explicit assessment of tail behavior, including high-percentile extreme proportions. Compared with the classical REM, XT-REM achieves higher log-likelihood values and lower AIC, in the considered scenarios, indicating a better fit within this modeling framework. In summary, XT-REM offers a theoretically grounded and practically useful extension of random-effects meta-analysis, with potential relevance to clinical contexts in which extreme event rates carry important implications for risk assessment.

2603.24259 2026-03-26 stat.CO

Uncertainty Quantification of Spline Predictors on Compact Riemannian Manifolds

Charlie Sire, Mike Pereira

详情
英文摘要

To predict smooth physical phenomena from observations, spline interpolation provides an interpretable framework by minimizing an energy functional associated with the Laplacian operator. This work proposes a methodology to construct a spline predictor on a compact Riemannian manifold, while quantifying the uncertainty inherent in the classical deterministic solution. Our approach leverages the equivalence between spline interpolation and universal kriging with a specific covariance kernel. By adopting a Gaussian random field framework, we generate stochastic simulations that reflect prediction uncertainty. However, on compact manifolds, the covariance kernel depends on the generally unknown spectrum of the Laplace-Beltrami operator. To address this, we introduce a finite element approximation based on a triangulation of the manifold. This leads to the use of intrinsic Gaussian Markov Random Fields (GMRF) and allows for the incorporation of anisotropies through local modifications of the Riemannian metric. The method is validated using a temperature study on a sphere, where the operator's spectrum is known, and is further extended to a test case on a cylindrical surface.

2603.24227 2026-03-26 cs.LG stat.ME

Identification of NMF by choosing maximum-volume basis vectors

Qianqian Qi, Zhongming Chen, Peter G. M. van der Heijden

详情
英文摘要

In nonnegative matrix factorization (NMF), minimum-volume-constrained NMF is a widely used framework for identifying the solution of NMF by making basis vectors as similar as possible. This typically induces sparsity in the coefficient matrix, with each row containing zero entries. Consequently, minimum-volume-constrained NMF may fail for highly mixed data, where such sparsity does not hold. Moreover, the estimated basis vectors in minimum-volume-constrained NMF may be difficult to interpret as they may be mixtures of the ground truth basis vectors. To address these limitations, in this paper we propose a new NMF framework, called maximum-volume-constrained NMF, which makes the basis vectors as distinct as possible. We further establish an identifiability theorem for maximum-volume-constrained NMF and provide an algorithm to estimate it. Experimental results demonstrate the effectiveness of the proposed method.

2603.20518 2026-03-26 stat.ME stat.AP

Multi-dimensional Mortality (MDMx): Sex-Age-Specific Model Life Tables, Fitting, Prediction from Summary Mortality Indicators, and Forecasting

Samuel J. Clark

详情
英文摘要

Demographers rely on a variety of tools and methods to work with mortality schedules - model life tables, fitting methods, summary-indicator prediction, and forecasting - largely developed independently and not providing structurally coherent sex-specific outputs. The multi-dimensional mortality model (MDMx) unifies all four within one Tucker tensor decomposition demonstrated using the Human Mortality Database (HMD). Period life tables from the HMD are organized as a four-way tensor of logit(1qx) indexed by sex, age, country, and year. Shared factor matrices for sex and age make every output schedule structurally coherent by construction. From this decomposition four capabilities emerge: model life tables via clustering and smooth within-regime trajectories; life table fitting via a three-stage algorithm with Bayes-factor disruption detection; summary-indicator prediction mapping child or adult mortality to complete schedules, reformulating SVD-Comp in tensor coordinates; and forecasting via a damped local linear trend Kalman filter on PCA-reduced core matrices with hierarchical drift.

2603.19808 2026-03-26 cs.LG math.AP stat.ML

Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training

Giacomo Borghi, Hyesung Im, Lorenzo Pareschi

详情
英文摘要

Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast within-model optimisation with slower population-level adaptation. Despite their empirical success, a general mathematical description of the resulting collective training dynamics remains incomplete. We introduce a theoretical framework for neural network training based on two-time-scale population dynamics. We model a population of neural networks as an interacting agent system in which network parameters evolve through fast noisy gradient updates of SGD/Langevin type, while hyperparameters evolve through slower selection--mutation dynamics. We prove the large-population limit for the joint distribution of parameters and hyperparameters and, under strong time-scale separation, derive a selection--mutation equation for the hyperparameter density. For each fixed hyperparameter, the fast parameter dynamics relaxes to a Boltzmann--Gibbs measure, inducing an effective fitness for the slow evolution. The averaged dynamics connects population-based learning with bilevel optimisation and classical replicator--mutator models, yields conditions under which the population mean moves toward the fittest hyperparameter, and clarifies the role of noise and diversity in balancing optimisation and exploration. Numerical experiments illustrate both the large-population regime and the reduced two-time-scale dynamics, and indicate that access to the effective fitness, either in closed form or through population-level estimation, can improve population-level updates.

2603.16661 2026-03-26 cs.LG stat.ML

Self-Aware Markov Models for Discrete Reasoning

Gregor Kornhardt, Jannis Chemseddine, Christian Wald, Gabriele Steidl

详情
英文摘要

Standard masked discrete diffusion models face limitations in reasoning tasks due to their inability to correct their own mistakes on the masking path. Since they rely on a fixed number of denoising steps, they are unable to adjust their computation to the complexity of a given problem. To address these limitations, we introduce a method based on learning a Markov transition kernel that is trained on its own outputs. This design enables tokens to be remasked, allowing the model to correct its previous mistakes. Furthermore, we do not need a fixed time schedule but use a trained stopping criterion. This allows for adaptation of the number of function evaluations to the difficulty of the reasoning problem. Our adaptation adds two lightweight prediction heads, enabling reuse and fine-tuning of existing pretrained models. On the Sudoku-Extreme dataset we clearly outperform other flow based methods with a validity of 95%. For the Countdown-4 we only need in average of 10 steps to solve almost 96% of them correctly, while many problems can be solved already in 2 steps.

2603.05335 2026-03-26 stat.ML cs.LG math.ST stat.TH

Bayes with No Shame: Admissibility Geometries of Predictive Inference

Nicholas G. Polson, Daniel Zantedeschi

详情
英文摘要

Four distinct admissibility geometries govern sequential and distribution-free inference: Blackwell risk dominance over convex risk sets, anytime-valid admissibility within the nonnegative supermartingale cone, marginal coverage validity over exchangeable prediction sets, and Cesàro approachability (CAA) admissibility, which reaches the risk-set boundary via approachability-style arguments rather than explicit priors. We prove a criterion separation theorem: the four classes of admissible procedures are pairwise non-nested. Each geometry carries a different certificate of optimality: a supporting-hyperplane prior (Blackwell), a nonnegative supermartingale (anytime-valid), an exchangeability rank (coverage), or a Cesàro steering argument (CAA). Martingale coherence is necessary for Blackwell admissibility and necessary and sufficient for anytime-valid admissibility within e-processes, but is not sufficient for Blackwell admissibility and is not necessary for coverage validity or CAA-admissibility. All four criteria can be viewed through a common schematic template (minimize Bayesian risk subject to a feasibility constraint), but the decision spaces, partial orders, and performance metrics differ by criterion, making them geometrically incompatible. Admissibility is irreducibly criterion-relative.

2603.04030 2026-03-26 math.ST stat.TH

On the generalized circular projected Cauchy distribution

Omar Alzeley, Michail Tsagris

详情
英文摘要

\cite{tsagris2025a} proposed the generalized circular projected Cauchy (GCPC) distribution, whose special case is the wrapped Cauchy distribution. In this paper we first derive the relationship with the wrapped Cauchy distribution, and then we attempt to characterize the distribution. We establish the conditions under which the distribution exhibits unimodality. We provide non-analytical formulas for the mean resultant length and the Kullback-Leibler divergence, and analytical form for the cumulative probability function and the entropy of the GCPC distribution. We propose log-likelihood ratio tests for one, or two location parameters without assuming equality of the concentration parameters. We revisit maximum likelihood estimation with and without predictors. In the regression setting we briefly mention the addition of circular and simplicial predictors. Simulation studies illustrate a) the performance of the log-likelihood ratio test when one falsely assumes that the true distribution is the wrapped Cauchy distribution, and b) the empirical rate of convergence of the regression coefficients. Using a real data analysis example we show how to avoid the log-likelihood being trapped in a local maximum and we correct a mistake in the regression setting.

2602.22605 2026-03-26 cs.IT math.IT math.ST physics.data-an stat.TH

A Thermodynamic Structure of Asymptotic Inference

Willy Wong

Comments 31 pages, 1 figure. This version reworks the paper around observation variance and clarifies the unification of de Bruijn and I-MMSE identities

详情
英文摘要

A thermodynamic framework for asymptotic inference is developed in which sample size and parameter variance define a state space. Within this description, Shannon information plays the role of entropy, and an integrating factor organizes its variation into a first-law-type balance equation. The framework supports a cyclic inequality analogous to a reversed second law, derived for the estimation of the mean. A non-trivial third-law-type result emerges as a lower bound on entropy set by representation noise. Optimal inference paths, global bounds on information gain, and a natural Carnot-like information efficiency follow from this structure, with efficiency fundamentally limited by a noise floor. Finally, de Bruijn's identity and the I-MMSE relation in the Gaussian-limit case appear as coordinate projections of the same underlying thermodynamic structure. This framework suggests that ensemble physics and inferential physics constitute shadow processes evolving in opposite directions within a unified thermodynamic description.

2512.22691 2026-03-26 cs.IT math.IT math.ST stat.TH

An Improved Lower Bound on Cardinality of Support of the Amplitude-Constrained AWGN Channel

Haiyang Wang, Luca Barletta, Alex Dytso

详情
英文摘要

We study the amplitude-constrained additive white Gaussian noise channel. It is well known that the capacity-achieving input distribution for this channel is discrete and supported on finitely many points. The best known bounds show that the support size of the capacity-achieving distribution is lower-bounded by a term of order $A$ and upper-bounded by a term of order $A^2$, where $A$ denotes the amplitude constraint. It was conjectured in [1] that the linear scaling is optimal. In this work, we establish a new lower bound of order $A\sqrt{\log A}$, improving the known bound and ruling out the conjectured linear scaling. To obtain this result, we quantify the fact that the capacity-achieving output distribution is close to the uniform distribution in the interior of the amplitude constraint. Next, we introduce a wrapping operation that maps the problem to a compact domain and develop a theory of best approximation of the uniform distribution by finite Gaussian mixtures. These approximation bounds are then combined with stability properties of capacity-achieving distributions to yield the final support-size lower bound.

2512.10069 2026-03-26 stat.ME

Information Borrowing from Partially Compatible Trajectories for Estimation of Dynamic Treatment Regimes

Chloe Si, David A. Stephens, Erica E. M. Moodie

详情
英文摘要

Dynamic Treatment Regimes (DTRs) provide a systematic framework for optimizing sequential decision-making in chronic disease management, where therapies must adapt to patients' evolving clinical profiles. Inverse probability weighting (IPW) is a cornerstone methodology for estimating regime values from observational data due to its intuitive formulation and established theoretical properties, yet standard IPW estimators face significant limitations, including variance instability and data inefficiency. A fundamental but underexplored source of inefficiency lies in the strict alignment requirement between observed and target treatment trajectories, which fails to account for partial compatibility and discards substantial information from individuals with only minimal deviations from the regime. We propose two novel methodologies that relax the strict inclusion rule through flexible compatibility mechanisms. Both methods provide computationally tractable alternatives that can be easily integrated into existing IPW workflows, offering more efficient approaches to DTR estimation. Theoretical analysis demonstrates that both estimators preserve consistency while achieving superior finite-sample efficiency compared to standard IPW, and comprehensive simulation studies confirm improved stability. We illustrate the practical utility of our methods through an application to HIV treatment data from the AIDS Clinical Trials Group Study 175 (ACTG175).

2511.20888 2026-03-26 stat.ML cs.CC cs.LG

Deep Learning as a Convex Paradigm of Computation: Minimizing Circuit Size with ResNets

Arthur Jacot

详情
英文摘要

This paper argues that DNNs implement a computational Occam's razor -- finding the `simplest' algorithm that fits the data -- and that this could explain their incredible and wide-ranging success over more traditional statistical methods. We start with the discovery that the set of real-valued function $f$ that can be $ε$-approximated with a binary circuit of size at most $cε^{-γ}$ becomes convex in the `Harder than Monte Carlo' (HTMC) regime, when $γ>2$, allowing for the definition of a HTMC norm on functions. In parallel one can define a complexity measure on the parameters of a ResNets (a weighted $\ell_1$ norm of the parameters), which induce a `ResNet norm' on functions. The HTMC and ResNet norms can then be related by an almost matching sandwich bound. Thus minimizing this ResNet norm is equivalent to finding a circuit that fits the data with an almost minimal number of nodes (within a power of 2 of being optimal). ResNets thus appear as an alternative model for computation of real functions, better adapted to the HTMC regime and its convexity.

2510.26485 2026-03-26 stat.ME

Discovering Causal Relationships Between Time Series With Spatial Structure

Rebecca F. Supple, Hannah Worthington, Ben Swallow

Comments 10 pages, 2 figures

详情
英文摘要

Causal discovery is the subfield of causal inference concerned with estimating the structure of cause-and-effect relationships in a system of interrelated variables, as opposed to quantifying the strength or describing the form of causal effects. As interest in causal discovery builds in fields such as ecology, public health, and environmental sciences where data are regularly collected with spatial and temporal structures, approaches must evolve to manage autocorrelation and complex confounding. As it stands, the few proposed causal discovery algorithms for spatiotemporal data require summarizing across locations, ignore spatial autocorrelation, and/or scale poorly to high dimensions. Here, we introduce our developing framework that extends time-series causal discovery to systems with spatial structure, building upon work on causal discovery across contexts and methods for handling spatial confounding in causal effect estimation. We close by outlining remaining gaps in the literature and directions for future research.

2507.00629 2026-03-26 cond-mat.dis-nn cs.LG math.PR math.ST stat.TH

Generalization performance of narrow one-hidden layer networks in the teacher-student setting

Rodrigo Pérez Ortiz, Gibbs Nwemadji, Jean Barbier, Federica Gerace, Alessandro Ingrosso, Clarissa Lauditi, Enrico M. Malatesta

Comments 37 pages, 7 figures

详情
英文摘要

Understanding the generalization properties of neural networks on simple input-output distributions is key to explaining their performance on real datasets. The classical teacher-student setting, where a network is trained on data generated by a teacher model, provides a canonical theoretical test bed. In this context, a complete theoretical characterization of fully connected one-hidden-layer networks with generic activation functions remains missing. In this work, we develop a general framework for such networks with large width, yet much smaller than the input dimension. Using methods from statistical physics, we derive closed-form expressions for the typical performance of both finite-temperature (Bayesian) and empirical risk minimization estimators in terms of a small number of order parameters. We uncover a transition to a specialization phase, where hidden neurons align with teacher features once the number of samples becomes sufficiently large and proportional to the number of network parameters. Our theory accurately predicts the generalization error of networks trained on regression and classification tasks using either noisy full-batch gradient descent (Langevin dynamics) or deterministic full-batch gradient descent.

2506.03462 2026-03-26 stat.ME

Robust domain selection for functional data via interval-wise testing and effect size mapping

Yeonjoo Park, Aiguo Han

详情
Journal ref
Journal of the Royal Statistical Society Series C: Applied Statistics (2026)
英文摘要

Among inferential problems in functional data analysis, domain selection is one of the practical interests aiming to identify sub-interval(s) of the domain where desired functional features are displayed. Motivated by applications in quantitative ultrasound signal analysis, we propose the robust domain selection method, particularly aiming to discover a subset of the domain presenting distinct behaviors on location parameters among different groups. By extending the interval testing approach, we propose to take into account multiple aspects of functional features simultaneously to detect the practically interpretable domain. To further handle potential outliers and missing segments on collected functional trajectories, we perform interval testing with a test statistic based on functional M-estimators for the inference. In addition, we introduce the effect size heatmap by calculating robustified effect sizes from the lowest to the largest scales over the domain to reflect dynamic functional behaviors among groups so that clinicians get a comprehensive understanding and select practically meaningful sub-interval(s). The performance of the proposed method is demonstrated through simulation studies and an application to motivating quantitative ultrasound measurements.

2504.06870 2026-03-26 astro-ph.IM astro-ph.CO stat.AP

Bayesian Component Separation for DESI LAE Automated Spectroscopic Redshifts and Photometric Targeting

Ana Sofía M. Uzsoy, Andrew K. Saydjari, Arjun Dey, Anand Raichoor, Douglas P. Finkbeiner, Eric Gawiser, Kyoung-Soo Lee, Steven Ahlen, Davide Bianchi, David Brooks, Todd Claybaugh, Andrei Cuceu, Axel de la Macorra, Peter Doel, Andreu Font-Ribera, Jaime E. Forero-Romero, Enrique Gaztañaga, Satya Gontcho A Gontcho, Gaston Gutierrez, Mustapha Ishak, Robert Kehoe, David Kirkby, Anthony Kremin, Martin Landriau, Laurent Le Guillou, Aaron Meisner, Ramon Miquel, John Moustakas, Nathalie Palanque-Delabrouille, Francisco Prada, Ignasi Pérez-Ràfols, Graziano Rossi, Eusebio Sanchez, David Schlegel, Michael Schubnell, Hee-Jong Seo, David Sprayberry, Gregory Tarlé, Benjamin Alan Weaver, Hu Zou

Comments 20 pages, 11 figures

详情
英文摘要

Lyman Alpha Emitters (LAEs) are valuable high-redshift cosmological probes traditionally identified using specialized narrow-band photometric surveys. In ground-based spectroscopy, it can be difficult to distinguish the sharp LAE peak from residual sky emission lines using automated methods, leading to misclassified redshifts. We present a Bayesian spectral component separation technique to automatically determine spectroscopic redshifts for LAEs while marginalizing over sky residuals. We use visually inspected spectra of LAEs obtained using the Dark Energy Spectroscopic Instrument (DESI) to create a data-driven prior and can determine redshift by jointly inferring sky residual, LAE, and residual components for each individual spectrum. We demonstrate this method on 881 spectroscopically observed $z = 2-4$ DESI LAE candidate spectra and determine their redshifts with $>$90% accuracy when validated against visually inspected redshifts. Using the $Δχ^2$ value from our pipeline as a proxy for detection confidence, we then explore potential survey design choices and implications for targeting LAEs with medium-band photometry. This method allows for scalability and accuracy in determining redshifts from DESI spectra, and the results provide recommendations for LAE targeting in anticipation of future high-redshift spectroscopic surveys.

2503.13191 2026-03-26 math.ST stat.ME stat.TH

Stein's method of moment estimators for local dependency exponential random graph models

Adrian Fischer, Gesine Reinert, Wenkai Xu

Comments Updated version with detailed connection to MPLE

详情
英文摘要

Providing theoretical guarantees for parameter estimation in exponential random graph models is a largely open problem. While maximum likelihood estimation has theoretical guarantees in principle, verifying the assumptions for these guarantees to hold can be very difficult. Moreover, in complex networks, numerical maximum likelihood estimation is computer-intensive and may not converge in reasonable time. To ameliorate this issue, local dependency exponential random graph models have been introduced, which assume that the network consists of many independent exponential random graphs. In this setting, progress towards maximum likelihood estimation has been made. However the estimation is still computer-intensive. Instead, we propose to use so-called Stein estimators: we use the Stein characterizations to obtain new estimators for local dependency exponential random graph models.

2502.02861 2026-03-26 stat.ML cs.DS cs.LG

Algorithms with Calibrated Machine Learning Predictions

Judy Hanwen Shen, Ellen Vitercik, Anders Wikum

Comments Matches the camera-ready version accepted at ICML 2025

详情
英文摘要

The field of algorithms with predictions incorporates machine learning advice in the design of online algorithms to improve real-world performance. A central consideration is the extent to which predictions can be trusted -- while existing approaches often require users to specify an aggregate trust level, modern machine learning models can provide estimates of prediction-level uncertainty. In this paper, we propose calibration as a principled and practical tool to bridge this gap, demonstrating the benefits of calibrated advice through two case studies: the ski rental and online job scheduling problems. For ski rental, we design an algorithm that achieves near-optimal prediction-dependent performance and prove that, in high-variance settings, calibrated advice offers more effective guidance than alternative methods for uncertainty quantification. For job scheduling, we demonstrate that using a calibrated predictor leads to significant performance improvements over existing methods. Evaluations on real-world data validate our theoretical findings, highlighting the practical impact of calibration for algorithms with predictions.

2501.06844 2026-03-26 stat.ME

REML implementations of kernel-based genomic prediction models for genotype x environment x management interactions

Killian A. C. Melsen, Salvador Gezan, Daniel J. Tolhurst, Fred A. van Eeuwijk, Carel F. W. Peeters

详情
英文摘要

High-throughput pheno-, geno-, and envirotyping allows characterization of plant genotypes and the trials they are evaluated in, producing different types of data. These different data modalities can be integrated into statistical or machine learning models for genomic prediction in several ways. One commonly used approach within the analysis of multi-environment trial data in plant breeding is to create linear or nonlinear kernels which are subsequently used in linear mixed models (LMMs) to model genotype by environment (G$\times$E) interactions. Current implementations of these kernel-based LMMs present a number of opportunities in terms of methodological extensions. Here we show how these models can be implemented in standard software, allowing direct restricted maximum likelihood (REML) estimation of all parameters. We also further extend the models by combining the kernels with unstructured covariance matrices for three-way interactions in genotype by environment by management (G$\times$E$\times$M) datasets, while simultaneously allowing for environment-specific genetic variances. We show how the models incorporating nonlinear kernels and heterogeneous variances maximize the amount of genetic variance captured by environmental covariables and perform best in prediction settings. We discuss the opportunities regarding models with multiple kernels or kernels obtained after environmental feature selection, as well as the similarities to models regressing phenotypes on latent and observed environmental covariables. Finally, we discuss the flexibility provided by our implementation in terms of modeling complex plant breeding datasets, allowing for straightforward integration of phenomics, enviromics, and genomics.

2501.03883 2026-03-26 stat.ME stat.CO

Spline Quantile Regression

Ta-Hsin Li, Nimrod Megiddo

详情
Journal ref
Journal of Statistical Theory Practice, 20:30, 2026
英文摘要

Quantile regression is a powerful tool capable of offering a richer view of the data as compared to least-squares regression. Quantile regression is typically performed individually on a few quantiles or a grid of quantiles without considering the similarity of the underlying regression coefficients at nearby quantiles. When needed, an ad hoc post-processing procedure such as kernel smoothing is employed to smooth the individually estimated coefficients across quantiles and thereby improve the performance of these estimates. This paper introduces a new method, called spline quantile regression (SQR), that unifies quantile regression with quantile smoothing and jointly estimates the regression coefficients across quantiles as smoothing splines. We discuss the computation of the SQR solution as a linear program (LP) using an interior-point algorithm. We also experiment with some gradient algorithms that require less memory than the LP algorithm. The performance of the SQR method and these algorithms is evaluated using simulated and real-world data.

2412.17163 2026-03-26 stat.ME

Spline Autoregression Method for Estimation of Quantile Spectrum

Ta-Hsin Li

详情
Journal ref
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 1-15, 2025
英文摘要

The quantile spectrum was introduced in Li (2012; 2014) as an alternative tool for spectral analysis of time series. It has the capability of providing a richer view of time series data than that offered by the ordinary spectrum especially for nonlinear dynamics such as stochastic volatility. A novel method, called spline autoregression (SAR), is proposed in this paper for estimating the quantile spectrum as a bivaraite function of frequency and quantile level, under the assumption that the quantile spectrum varies smoothly with the quantile level. The SAR method is facilitated by the quantile discrete Fourier transform (QDFT) based on trigonometric quantile regression. It is enabled by the resulting time-domain quantile series (QSER) which represents properly scaled oscillatory characteristics of the original time series around a quantile. A functional autoregressive (AR) model is fitted to the QSER on a grid of quantile levels by penalized least-squares with the AR coefficients represented as smoothing splines of the quantile level. While the ordinary AR model is widely used for conventional spectral estimation, the proposed SAR method provides an effective way of estimating the quantile spectrum as a bivariate function in comparison with the alternatives. This is confirmed by a simulation study.

2412.02513 2026-03-26 stat.ME

Quantile-Crossing Spectrum and Spline Autoregression Estimation

Ta-Hsin Li

详情
Journal ref
Statistical Inference for Stochastic Processes, 28:20, 2025
英文摘要

The quantile-crossing spectrum is the spectrum of quantile-crossing processes created from a time series by the indicator function that shows whether or not the time series lies above or below a given quantile at a given time. This bivariate function of frequency and quantile level provides a richer view of serial dependence than that offered by the ordinary spectrum. We propose a new method for estimating the quantile-crossing spectrum as a bivariate function of frequency and quantile level. The proposed method, called spline autoregression (SAR), jointly fits an AR model to the quantile-crossing series across multiple quantiles; the AR coefficients are represented as spline functions of the quantile level and penalized for their roughness. Numerical experiments show that when the underlying spectrum is smooth in quantile level the proposed method is able to produce more accurate estimates in comparison with the alternative that ignores the smoothness.

2405.17573 2026-03-26 stat.ML cs.AI cs.LG

Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets

Arthur Jacot, Alexandre Kaiser

详情
英文摘要

We study Leaky ResNets, which interpolate between ResNets and Fully-Connected nets depending on an 'effective depth' hyper-parameter $\tilde{L}$. In the infinite depth limit, we study 'representation geodesics' $A_{p}$: continuous paths in representation space (similar to NeuralODEs) from input $p=0$ to output $p=1$ that minimize the parameter norm of the network. We give a Lagrangian and Hamiltonian reformulation, which highlight the importance of two terms: a kinetic energy which favors small layer derivatives $\partial_{p}A_{p}$ and a potential energy that favors low-dimensional representations, as measured by the 'Cost of Identity'. The balance between these two forces offers an intuitive understanding of feature learning in ResNets. We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work: for large $\tilde{L}$ the potential energy dominates and leads to a separation of timescales, where the representation jumps rapidly from the high dimensional inputs to a low-dimensional representation, move slowly inside the space of low-dimensional representations, before jumping back to the potentially high-dimensional outputs. Inspired by this phenomenon, we train with an adaptive layer step-size to adapt to the separation of timescales.

2402.08151 2026-03-26 stat.ME cs.AI cs.LG math.SP math.ST stat.TH

Perturbative adaptive importance sampling for Bayesian LOO cross-validation

Joshua C Chang, Xiangting Li, Tianyi Su, Shixin Xu, Hao-Ren Yao, Julia Porcino, Carson Chow

Comments Submitted

详情
英文摘要

Importance sampling (IS) is an efficient stand-in for model refitting in performing (LOO) cross-validation (CV) on a Bayesian model. IS inverts the Bayesian update for a single observation by reweighting posterior samples. The so-called importance weights have high variance -- we resolve this issue through adaptation by transformation. We observe that removing a single observation perturbs the posterior by $\mathcal{O}(1/n)$, motivating bijective transformations of the form $T(θ)=θ+ h Q(θ)$ for $0<h\ll 1.$ We introduce several such transformations: partial moment matching, which generalizes prior work on affine moment-matching with a tunable step size; log-likelihood descent, which partially invert the Bayesian update for an observation; and gradient flow steps that minimize the KL divergence or IS variance. The gradient flow and likelihood descent transformations require Jacobian determinants, which are available via auto-differentiation; we additionally derive closed-form expressions for logistic regression and shallow ReLU networks. We tested the methodology on classification ($n\ll p$), count regression (Poisson and zero-inflated negative binomial), and survival analysis problems, finding that no single transformation dominates but their combination nearly eliminates the need to refit.

2211.05844 2026-03-26 stat.ME stat.CO

Quantile Fourier Transform, Quantile Series, and Nonparametric Estimation of Quantile Spectra

Ta-Hsin Li

详情
Journal ref
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATION, 1-22, 2025
英文摘要

A nonparametric method is proposed for estimating the quantile spectra and cross-spectra introduced in Li (2012; 2014) as bivariate functions of frequency and quantile level. The method is based on the quantile discrete Fourier transform (QDFT) defined by trigonometric quantile regression and the quantile series (QSER) defined by the inverse Fourier transform of the QDFT. A nonparametric spectral estimator is constructed from the autocovariance function of the QSER using the lag-window (LW) approach. Smoothing techniques are also employed to reduce the statistical variability of the LW estimator across quantiles when the underlying spectrum varies smoothly with respect to the quantile level. The performance of the proposed estimation method is evaluated through a simulation study.

2012.08371 2026-03-26 math.ST stat.ME stat.TH

Limiting laws and consistent estimation criteria for fixed and diverging number of spiked eigenvalues

Jianwei Hu, Jingfei Zhang, Jianhua Guo, Ji Zhu

详情
英文摘要

In this paper, we study limiting laws and consistent estimation criteria for the extreme eigenvalues in a spiked covariance model of dimension $p$. Firstly, for fixed $p$, we propose a generalized estimation criterion that can consistently estimate, $k$, the number of spiked eigenvalues. Compared with the existing literature, we show that consistency can be achieved under weaker conditions on the penalty term. Next, allowing both $p$ and $k$ to diverge, we derive limiting distributions of the spiked sample eigenvalues using random matrix theory techniques. Notably, our results do not require the spiked eigenvalues to be uniformly bounded from above or tending to infinity, as have been assumed in the existing literature. Based on the above derived results, we formulate a generalized estimation criterion and show that it can consistently estimate $k$, while $k$ can be fixed or grow at an order of $k=o(n^{1/3})$. We further show that the results in our work continue to hold under a general population distribution without assuming normality. The efficacy of the proposed estimation criteria is illustrated through comparative simulation studies.

1908.02545 2026-03-26 q-fin.ST stat.ME

Quantile-Frequency Analysis and Spectral Measures for Diagnostic Checks of Time Series With Nonlinear Dynamics

Ta-Hsin Li

详情
Journal ref
Journal of the Royal Statistical Society Series C, 70(2), 270-290, 2021
英文摘要

Nonlinear dynamic volatility has been observed in many financial time series. The recently proposed quantile periodogram offers an alternative way to examine this phenomena in the frequency domain. The quantile periodogram is constructed from trigonometric quantile regression of time series data at different frequencies and quantile levels, enabling the quantile-frequency analysis (QFA) of nonlinear serial dependence. This paper introduces some spectral measures based on the quantile periodogram for diagnostic checks of financial time series models and for model-based discriminant analysis. A simulation-based parametric bootstrapping technique is employed to compute the $p$-values of the spectral measures. The usefulness of the proposed method is demonstrated by a simulation study and a motivating application using the daily log returns of the S\&P 500 index together with GARCH-type models. The results show that the QFA method is able to provide additional insights into the goodness of fit of these financial time series models that may have been missed by conventional tests. The results also show that the QFA method offers a more informative way of discriminant analysis for detecting regime changes in financial time series.

2603.24122 2026-03-26 stat.ME

Scoring Rules with Normalized Upper Order Statistics for Tail Inference

Martin Bladt, Christoffer Øhlenschlæger

Comments 8 figures, 1 table

详情
英文摘要

This paper proposes a scoring-rule-based method for ranking predictive distributions in the Fréchet domain that is able to distinguish between different tail indices. The approach is built on normalized order statistics and exploits proper scoring rules to compare tail limit distributions in a distributional framework, with direct relevance for insurance claim-severity tails. On the theoretical side, consistency and asymptotic normality for empirical tail scores based on normalized upper order statistics are obtained through residual estimation theory. Simulation results demonstrate that the scoring-rule-based approach is capable of discriminating between different tail behaviors in finite samples and that trends in the scaling have only a minor impact on stability. We further show that optimizing scoring rules (equivalently, minimizing the associated loss form) yields consistent tail-index estimators and that the classical Hill estimator arises as a special case. The performance of the proposed method is investigated and compared with the Hill estimator across a range of tail indices. Lastly, we analyze an automobile claim-severity data set to demonstrate how scoring rules can be used to rank predictive models based on tail predictions in actuarial settings.

2603.24108 2026-03-26 stat.ME eess.SP

Aitchison Geometry on the Simplex for Uncertainty Quantification in Bayesian Hyperspectral Image Unmixing

Hector Blondel, Lucas Drumetz, Thierry Chonavel

详情
英文摘要

Most algorithms for hyperspectral image unmixing produce point estimates of fractional abundances of the materials to be separated. However, in the absence of reliable ground truth, the ability to perform abundance uncertainty quantification (UQ) should be an important feature of algorithms, e.g. to evaluate how hard the unmixing problem is and how much the results should be trusted. The usual modeling assumptions in Bayesian models for unmixing rely heavily on the Euclidean geometry of the simplex and typically disregard spatial information. In addition, to our knowledge, abundance UQ is close to nonexistent. In this paper, we propose to leverage Aitchinson geometry from the compositional data analysis literature to provide practitioners with alternative tools for modeling prior abundance distributions. In particular we show how to design simplex-valued Gaussian Process priors using this geometry. Then we link Aitchinson geometry to constrained sampling algorithms in the literature, and propose UQ diagnostics that comply with the constraints on abundance vectors. We illustrate these concepts on real and simulated data.

2603.24041 2026-03-26 stat.ME cs.LG

Minimal Sufficient Representations for Self-interpretable Deep Neural Networks

Zhiyao Tan, Liu Li, Huazhen Lin

详情
英文摘要

Deep neural networks (DNNs) achieve remarkable predictive performance but remain difficult to interpret, largely due to overparameterization that obscures the minimal structure required for interpretation. Here we introduce DeepIn, a self-interpretable neural network framework that adaptively identifies and learns the minimal representation necessary for preserving the full expressive capacity of standard DNNs. We show that DeepIn can correctly identify the minimal representation dimension, select relevant variables, and recover the minimal sufficient network architecture for prediction. The resulting estimator achieves optimal non-asymptotic error rates that adapt to the learned minimal dimension, demonstrating that recovering minimal sufficient structure fundamentally improves generalization error. Building on these guarantees, we further develop hypothesis testing procedures for both selected variables and learned representations, bridging deep representation learning with formal statistical inference. Across biomedical and vision benchmarks, DeepIn improves both predictive accuracy and interpretability, reducing error by up to 30% on real-world datasets while automatically uncovering human-interpretable discriminative patterns. Our results suggest that interpretability and statistical rigor can be embedded directly into deep architectures without sacrificing performance.

2603.24025 2026-03-26 cs.LG stat.ME

i-IF-Learn: Iterative Feature Selection and Unsupervised Learning for High-Dimensional Complex Data

Chen Ma, Wanjie Wang, Shuhao Fan

Comments 28 pages, 5 figures, including appendix. Accepted at AISTATS

详情
英文摘要

Unsupervised learning of high-dimensional data is challenging due to irrelevant or noisy features obscuring underlying structures. It's common that only a few features, called the influential features, meaningfully define the clusters. Recovering these influential features is helpful in data interpretation and clustering. We propose i-IF-Learn, an iterative unsupervised framework that jointly performs feature selection and clustering. Our core innovation is an adaptive feature selection statistic that effectively combines pseudo-label supervision with unsupervised signals, dynamically adjusting based on intermediate label reliability to mitigate error propagation common in iterative frameworks. Leveraging low-dimensional embeddings (PCA or Laplacian eigenmaps) followed by $k$-means, i-IF-Learn simultaneously outputs influential feature subset and clustering labels. Numerical experiments on gene microarray and single-cell RNA-seq datasets show that i-IF-Learn significantly surpasses classical and deep clustering baselines. Furthermore, using our selected influential features as preprocessing substantially enhances downstream deep models such as DeepCluster, UMAP, and VAE, highlighting the importance and effectiveness of targeted feature selection.

2603.24015 2026-03-26 stat.ME stat.AP

STAMP: A shot-type-aware areal multilevel Poisson model for league-wide comparison of basketball shot charts

Kazuhiro Yamada, Keisuke Fujii

Comments 25 pages

详情
英文摘要

Shooting location is a core indicator of offensive style in invasion sports. Existing basketball shot-chart analyses often use spatial information for descriptive visualization, location-based efficiency modeling, or clustering players into shooting archetypes, yet few studies provide a unified framework for fair comparison of shot-type-specific tendencies. We propose the shot-type-aware areal multilevel Poisson (STAMP) model, which jointly models team-level field-goal attempts across predefined court regions, seasons, and shot types using a Poisson likelihood with a possession-based exposure offset. The hierarchical random-effects structure combines team, area, team-area, and team-side random effects with shot-type-specific random slopes for key shot categories. We fit the model using approximate Bayesian inference via the Integrated Nested Laplace Approximation (INLA), enabling efficient analysis of more than $3\times 10^{5}$ shots from two seasons of B.LEAGUE (the men's professional basketball league in Japan). The STAMP model achieves better out-of-sample predictive performance than simpler baselines, yielding interpretable relative-rate maps and left-right bias summaries. Case studies illustrate how the model reveals team-specific spatial tendencies for comparative analysis, and we discuss its limitations and potential extensions.

2603.24009 2026-03-26 stat.AP q-bio.QM

Analyzing animal movement using deep learning

Thibault Fronville, Maximilian Pichler, Johannes Signer, Marius Grabow, Stephanie Kramer-Schadt, Viktoriia Radchuk, Florian Hartig

Comments 34 pages, 7 figures

详情
英文摘要

Understanding how animals move through heterogeneous landscapes is central to ecology and conservation. In this context, step selection functions (SSFs) have emerged as the main statistical framework to analyze how biotic and abiotic predictors influence movement paths observed by radio tracking, GPS tags, or similar sensors. A traditional SSF consists of a generalized linear model (GLM) that infers the animal's habitat preferences (selection coefficients) by comparing each observed movement step to random steps. Such GLM-SSFs, however, cannot flexibly consider non-linear or interacting effects, unless those have been specified a priori. To address this problem, generalized additive models have been integrated in the SSF framework, but those GAM-SSFs are still limited in their ability to represent complex habitat preferences and inter-individual variability. Here we explore the utility of deep neural networks (DNNs) to overcome these limitations. We find that DNN-SSFs, coupled with explainable AI to extract selection coefficients, offer many advantages for analyzing movement data. In the case of linear effects, they effectively retrieve the same effect sizes and p-values as conventional GLMs. At the same time, however, they can automatically detect complex interaction effects, nonlinear responses, and inter-individual variability if those are present in the data. We conclude that DNN-SSFs are a promising extension of traditional SSF. Our analysis extends previous research on DNN-SSF by exploring differences and similarities of GLM, GAM and DNN-based SSF models in more depth, in particular regarding the validity of statistical indicators that are derived from the DNN. We also propose new DNN structures to capture inter-individual effects that can be viewed as a nonlinear random effect. All methods used in this paper are available via the 'citoMove' R package.

2603.23963 2026-03-26 stat.ME stat.AP

An Exponential-Polynomial Divergence-based Robust Information Criterion for Linear Panel Data Models and Neural Networks

Udita Goswami, Shuvashree Mondal

Comments 31 pages, 2 figures

详情
英文摘要

Model selection is a cornerstone of statistical inference, where information criteria are widely employed to balance model fit and complexity. However, classical likelihood-based criteria are often highly sensitive to contamination, outliers, and model misspecification. In this paper, we develop a robust alternative based on the Exponential-Polynomial Divergence, a flexible extension of existing divergence measures that enhances adaptability to diverse data irregularities. The proposed Exponential-Polynomial Divergence Information Criterion preserves the objective of approximating the discrepancy between the true model and candidate models while incorporating robustness against anomalous observations. Its theoretical properties are established, and robustness is examined through influence function analysis, demonstrating controlled sensitivity to extreme data points. For practical implementation, a data-driven tuning parameter selection strategy based on generalized score matching is employed, ensuring improved computational stability and efficiency. The effectiveness of the proposed method is demonstrated through extensive simulation studies under varying contamination levels, as well as real data applications involving linear mixed-effects panel data models and neural network-based prediction tasks. The results consistently show improved stability and reliability compared to classical likelihood and density power divergence-based information criteria. The proposed framework thus provides a practical and unified approach for model selection in complex and contaminated data settings.

2603.23959 2026-03-26 math.ST stat.CO stat.ME stat.TH

Microergodicity implies orthogonality of Matérn fields on bounded domains in $\mathbb{R}^4$

Natesh S. Pillai

详情
英文摘要

Matérn random fields are one of the most widely used classes of models in spatial statistics. The fixed-domain identifiability of covariance parameters for stationary Matérn Gaussian random fields exhibits a dimension-dependent phase transition. For known smoothness $ν$, Zhang \cite{Zhang2004} showed that when $d\le3$, two Matérn models with the same microergodic parameter $m=σ^2α^{2ν}$ induce equivalent Gaussian measures on bounded domains, while Anderes \cite{Anderes2010} proved that when $d>4$, the corresponding measures are mutually singular whenever the parameters differ. The critical case $d=4$ for stationary Matérn models has remained open. We resolve this case. Let $d=4$ and consider two stationary Matérn models on $\mathbb R^4$ with parameters $(σ_1,α_1)$ and $(σ_2,α_2)$ satisfying \[ σ_1^2α_1^{2ν}=σ_2^2α_2^{2ν}, \qquad α_1\neq α_2. \] We prove that the corresponding Gaussian measures on any bounded observation domain are mutually singular on every countable dense observation set, and on the associated path space of continuous functions. Our approach can be viewed as a spectral analogue of the higher-order increment method of Anderes \cite{Anderes2010}. Whereas Anderes isolates the second irregular covariance coefficient through renormalized quadratic variations in physical space, we detect the first nonvanishing high-frequency spectral mismatch via localized Fourier coefficients and use a normalized Whittle score to identify parameters. More broadly, the localized spectral probing framework used here for detecting subtle covariance differences in Gaussian random fields may be useful for studying identifiability and estimation in other spatial models.

2603.23926 2026-03-26 cs.LG cs.IT math.IT math.OC stat.ML

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Guy Zamir, Matthew Zurek, Yudong Chen

详情
英文摘要

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $γ$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $γ$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

2603.23923 2026-03-26 stat.ME stat.ML

Elements of Conformal Prediction for Statisticians

Matteo Sesia, Stefano Favaro

详情
英文摘要

Predictive inference is a fundamental task in statistics, traditionally addressed using parametric assumptions about the data distribution and detailed analyses of how models learn from data. In recent years, conformal prediction has emerged as a rapidly growing alternative framework that is particularly well suited to modern applications involving high-dimensional data and complex machine learning models. Its appeal stems from being both distribution-free -- relying mainly on symmetry assumptions such as exchangeability -- and model-agnostic, treating the learning algorithm as a black box. Even under such limited assumptions, conformal prediction provides exact finite-sample guarantees, though these are typically of a marginal nature that requires careful interpretation. This paper explains the core ideas of conformal prediction and reviews selected methods. Rather than offering an exhaustive survey, it aims to provide a clear conceptual entry point and a pedagogical overview of the field.

2603.23872 2026-03-26 astro-ph.GA math.ST stat.TH

Rigorous Formulation of Finite-Sample and Finite-Window Effects in Galaxy Clustering

Tsutomu T. Takeuchi, Satoshi Kuriki, Keisuke Yano

Comments 8 pages, no figure, submitted

详情
英文摘要

Galaxy surveys provide finite catalogs of objects observed within bounded volumes, yet clustering statistics are often interpreted using theoretical frameworks developed for infinite point processes. In this work, we formulate key statistical quantities directly for finite point processes and examine the structural consequences of finite-number and finite-window constraints. We show that several well-known features of galaxy survey analysis arise naturally from finiteness alone. In particular, non-vanishing higher-order connected correlations can occur even in statistically independent samples when the total number of points is fixed, and the integral constraint in two-point statistics appears as an exact identity implied by the finite-number condition rather than as an estimator artifact. We further demonstrate that counts-in-cells and point-centered environmental measures correspond to distinct statistical ensembles. Using Palm conditioning, we derive an exact relation between random-cell and point-centered statistics, showing that the latter probe a tilted version of the underlying distribution. These results provide a probabilistic framework for separating structural effects imposed by finite sampling from correlations reflecting genuine astrophysical processes. The formulation presented here remains valid for realistic survey geometries and finite data sets and clarifies the interpretation of commonly used clustering statistics in galaxy surveys.

2603.23835 2026-03-26 stat.ML cs.LG math.ST stat.TH

Beyond Consistency: Inference for the Relative risk functional in Deep Nonparametric Cox Models

Sattwik Ghosal, Xuran Meng, Yi Li

Comments 24 pages, 5 figures, 4 tables

详情
英文摘要

There remain theoretical gaps in deep neural network estimators for the nonparametric Cox proportional hazards model. In particular, it is unclear how gradient-based optimization error propagates to population risk under partial likelihood, how pointwise bias can be controlled to permit valid inference, and how ensemble-based uncertainty quantification behaves under realistic variance decay regimes. We develop an asymptotic distribution theory for deep Cox estimators that addresses these issues. First, we establish nonasymptotic oracle inequalities for general trained networks that link in-sample optimization error to population risk without requiring the exact empirical risk optimizer. We then construct a structured neural parameterization that achieves infinity-norm approximation rates compatible with the oracle bound, yielding control of the pointwise bias. Under these conditions and using the Hajek--Hoeffding projection, we prove pointwise and multivariate asymptotic normality for subsampled ensemble estimators. We derive a range of subsample sizes that balances bias correction with the requirement that the Hajek--Hoeffding projection remain dominant. This range accommodates decay conditions on the single-overlap covariance, which measures how strongly a single shared observation influences the estimator, and is weaker than those imposed in the subsampling literature. An infinitesimal jackknife representation provides analytic covariance estimation and valid Wald-type inference for relative risk contrasts such as log-hazard ratios. Finally, we illustrate the finite-sample implications of the theory through simulations and a real data application.

2603.23831 2026-03-26 cs.LG eess.SP stat.ML

Unveiling Hidden Convexity in Deep Learning: a Sparse Signal Processing Perspective

Emi Zeger, Mert Pilanci

详情
英文摘要

Deep neural networks (DNNs), particularly those using Rectified Linear Unit (ReLU) activation functions, have achieved remarkable success across diverse machine learning tasks, including image recognition, audio processing, and language modeling. Despite this success, the non-convex nature of DNN loss functions complicates optimization and limits theoretical understanding. In this paper, we highlight how recently developed convex equivalences of ReLU NNs and their connections to sparse signal processing models can address the challenges of training and understanding NNs. Recent research has uncovered several hidden convexities in the loss landscapes of certain NN architectures, notably two-layer ReLU networks and other deeper or varied architectures. This paper seeks to provide an accessible and educational overview that bridges recent advances in the mathematics of deep learning with traditional signal processing, encouraging broader signal processing applications.

2603.23805 2026-03-26 cs.LG cs.AI cs.NE stat.ML

Deep Neural Regression Collapse

Akshay Rangamani, Altay Unal

Comments Accepted to CPAL 2026; Code will be available at https://github.com/altayunal/neural-collapse-regression

详情
英文摘要

Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended the definition of neural collapse to regression problems, albeit only measuring the phenomenon at the last layer. In this paper, we establish that Neural Regression Collapse (NRC) also occurs below the last layer across different types of models. We show that in the collapsed layers of neural regression models, features lie in a subspace that corresponds to the target dimension, the feature covariance aligns with the target covariance, the input subspace of the layer weights aligns with the feature subspace, and the linear prediction error of the features is close to the overall prediction error of the model. In addition to establishing Deep NRC, we also show that models that exhibit Deep NRC learn the intrinsic dimension of low rank targets and explore the necessity of weight decay in inducing Deep NRC. This paper provides a more complete picture of the simple structure learned by deep networks in the context of regression.

2603.23792 2026-03-26 cs.LG stat.ML

Manifold Generalization Provably Proceeds Memorization in Diffusion Models

Zebang Shen, Ya-Ping Hsieh, Niao He

Comments The first two authors contributed equally

详情
英文摘要

Diffusion models often generate novel samples even when the learned score is only \emph{coarse} -- a phenomenon not accounted for by the standard view of diffusion training as density estimation. In this paper, we show that, under the \emph{manifold hypothesis}, this behavior can instead be explained by coarse scores capturing the \emph{geometry} of the data while discarding the fine-scale distributional structure of the population measure~$μ_{\scriptscriptstyle\mathrm{data}}$. Concretely, whereas estimating the full data distribution $μ_{\scriptscriptstyle\mathrm{data}}$ supported on a $k$-dimensional manifold is known to require the classical minimax rate $\tilde{\mathcal{O}}(N^{-1/k})$, we prove that diffusion models trained with coarse scores can exploit the \emph{regularity of the manifold support} and attain a near-parametric rate toward a \emph{different} target distribution. This target distribution has density uniformly comparable to that of~$μ_{\scriptscriptstyle\mathrm{data}}$ throughout any $\tilde{\mathcal{O}}\bigl(N^{-β/(4k)}\bigr)$-neighborhood of the manifold, where $β$ denotes the manifold regularity. Our guarantees therefore depend only on the smoothness of the underlying support, and are especially favorable when the data density itself is irregular, for instance non-differentiable. In particular, when the manifold is sufficiently smooth, we obtain that \emph{generalization} -- formalized as the ability to generate novel, high-fidelity samples -- occurs at a statistical rate strictly faster than that required to estimate the full population distribution~$μ_{\scriptscriptstyle\mathrm{data}}$.

2603.23790 2026-03-26 stat.ME cs.CE

Root Finding and Metamodeling for Rapid and Robust Computer Model Calibration

Yongseok Jeon, Sara Shashaani

详情
英文摘要

We concern computer model calibration problem where the goal is to find the parameters that minimize the discrepancy between the multivariate real-world and computer model outputs. We propose to solve an approximation using signed residuals that enables a root finding approach and an accelerated search. We characterize the distance of the solutions to the approximation from the solutions of the original problem for the strongly-convex objective functions, showing that it depends on variability of the signed residuals across output dimensions, as wells as their variance and covariance. We develop a metamodel-based root finding framework under kriging and stochastic kriging that is augmented with a sequential search space reduction. We derive three new acquisition functions for finding roots of the approximate problem along with their derivatives usable by first-order solvers. Compared to kriging, stochastic kriging accounts for observational noise, promoting more robust solutions. We also analyze the case where a root may not exist. Our analysis of the asymptotic behavior in this context show that, since existence of roots in the approximation problem may not be known a priori, using new acquisition functions will not compromise the outcome. Numerical experiments on data-driven and physics-based examples demonstrate significant computational gains over standard calibration approaches.

2603.23736 2026-03-26 stat.ML cs.LG math.PR math.ST stat.TH

Wasserstein Parallel Transport for Predicting the Dynamics of Statistical Systems

Tristan Luca Saidi, Gonzalo Mena, Larry Wasserman, Florian Gunsilius

详情
英文摘要

Many scientific systems, such as cellular populations or economic cohorts, are naturally described by probability distributions that evolve over time. Predicting how such a system would have evolved under different forces or initial conditions is fundamental to causal inference, domain adaptation, and counterfactual prediction. However, the space of distributions often lacks the vector space structure on which classical methods rely. To address this, we introduce a general notion of parallel dynamics at a distributional level. We base this principle on parallel transport of tangent dynamics along optimal transport geodesics and call it ``Wasserstein Parallel Trends''. By replacing the vector subtraction of classic methods with geodesic parallel transport, we can provide counterfactual comparisons of distributional dynamics in applications such as causal inference, domain adaptation, and batch-effect correction in experimental settings. The main mathematical contribution is a novel notion of fanning scheme on the Wasserstein manifold that allows us to efficiently approximate parallel transport along geodesics while also providing the first theoretical guarantees for parallel transport in the Wasserstein space. We also show that Wasserstein Parallel Trends recovers the classic parallel trends assumption for averages as a special case and derive closed-form parallel transport for Gaussian measures. We deploy the method on synthetic data and two single-cell RNA sequencing datasets to impute gene-expression dynamics across biological systems.

2603.23726 2026-03-26 stat.ME stat.AP

Inverse Probability Weighting of Count Exposures in the Presence of Missing Data: A Simulation Study

Martin N. Danka, Jessica K. Bone, George B. Ploubidis, Richard J. Silverwood

详情
英文摘要

Inverse probability of treatment weighting (IPTW) is widely used to estimate causal effects, but guidance is limited for count exposures. It is also unclear how IPTW performs when combined with multiple imputation in this context. In this study, we evaluated five IPTW methods applied to count exposures: multinomial binning, parametric and non-parametric covariate balancing propensity scores (CBPS, npCBPS), generalised boosted models (GBM), and energy balancing. Our simulations were informed by an example using data from the 1970 British Cohort Study, aiming to estimate the effect of psychological distress, measured as a count of symptoms at age 34, on self-reported longstanding illness at age 42. We compared these approaches on bias, coverage, effective sample size, and other metrics under truncated negative binomial and Poisson exposure distributions. We also assessed the performance of Rubin's rules under different missingness mechanisms. Under complete data, multinomial, CBPS, GBM, and energy weights produced low bias and near-nominal coverage, whereas npCBPS resulted in bias and poor coverage due to extreme weights. When data were missing completely at random, similar performance patterns were observed for IPTW with multiple imputation. Under missing at random, bias increased with higher missingness, but this was present for both IPTW and covariate-adjusted regression, possibly reflecting a limitation of the imputation model rather than a failure of IPTW. Overall, these findings support the use of multinomial, CBPS, GBMs, and energy weights for count exposures in similar settings while highlighting trade-offs between these methods and the need for imputation models accommodating right-truncated overdispersed counts.

2603.23707 2026-03-26 stat.AP

The Long Shadow of Pandemic: Understanding the lingering effects of cause-specific mortality shocks

Yanxin Liu, Kenneth Q. Zhou

Comments Mortality shocks, Long-lasting pandemic effects, Stochastic mortality modeling, Cause-specific mortality, Natural hedging

详情
英文摘要

In the aftermath of the COVID-19 pandemic, empirical data have revealed that large-scale health crises not only cause immediate disruptions in mortality dynamics but also have persistent effects that may last for several years. Existing mortality models largely assume that mortality shocks are transitory and overlook how their effects can be long-lasting and heterogeneous across age groups and causes of death. In response to this limitation, we propose a novel stochastic mortality model that captures age- and cause-specific long-lasting effects of mortality jumps through a gamma-density-like decay function, estimated via a customized conditional maximum likelihood algorithm. Applying the model to recent U.S. mortality data, we reveal divergent persistence patterns across demographic groups and provide key insights into the tail risk profiles of life insurance and annuity products. Our scenario-based analyses further show that neglecting persistent shock effects can lead to suboptimal hedging, while the proposed model enables what-if testing to analyze such effects under potential future health crises.

2603.23688 2026-03-26 stat.CO stat.AP stat.ME stat.ML

Adaptive Gaussian Process Search for Simulation-Based Sample Size Estimation in Clinical Prediction Models: Validation of the pmsims R Package

Oyebayo Ridwan Olaniran, Diana Shamsutdinova, Sarah Markham, Felix Zimmer, Daniel Stahl, Gordon Forbes, Ewan Carr

Comments 27 pages, 2 main-text figures, 16 supplementary figures, 9 tables, preprint

详情
英文摘要

Background: Determining an adequate sample size is essential for developing reliable and generalisable clinical prediction models, yet practical guidance on selecting appropriate methods remains limited. Existing analytical and simulation-based approaches often rely on restrictive assumptions and focus on mean-based criteria. We present and validate pmsims, an R package that uses Gaussian process surrogate modelling to provide a flexible and computationally efficient simulation-based framework for sample size determination across diverse prediction settings. Methods: We conducted a comprehensive simulation study with two aims. First, we compared three search engines implemented in pmsims: a Gaussian process-based adaptive method, a deterministic bisection method, and a hybrid approach, across binary, continuous, and survival outcomes. Second, we benchmarked the best-performing pmsims engine against existing analytical (pmsampsize) and simulation-based (samplesizedev) methods, evaluating recommended sample sizes, computational time, and achieved performance on large independent validation datasets. Results: The Gaussian process-based method consistently produced the most stable sample size estimates, particularly in low-signal, high-dimensional settings. In benchmarking, pmsims achieved performance close to prespecified targets across all outcome types, matching simulation-based approaches and outperforming analytical methods in more challenging scenarios. Conclusions: pmsims provides an efficient and flexible framework for principled sample size planning in clinical prediction modelling, requiring fewer model evaluations than non-adaptive simulation approaches.

2603.23655 2026-03-26 math.ST stat.TH

The Bernstein-von Mises theorem and efficiency for semiparametric inference in multivariate Hawkes processes

Mael Duverger, Judith Rousseau

Comments 80 pages, 0 figure

详情
英文摘要

In this paper, we study semiparametric inference for linear multivariate Hawkes processes, a class of point processes widely used to describe self and mutually exciting phenomena. We establish a convolution theorem giving the best limiting distribution for a regular estimator of smooth functional. Then, in the Bayesian setting, we prove a semiparametric Bernstein-von Mises (BvM) theorem for nonparametric random series priors. We apply this result to histogram and wavelet based priors. Taken together, the convolution and BvM theorems show that, from a frequentist point of view, semiparametric Bayesian procedures have asymptotically the optimal behavior. Deriving the BvM property for random series priors led us to prove L2 posterior contraction, complementing for these priors the results of Donnet, Rivoirard and Rousseau (2020).

2603.23581 2026-03-26 stat.ML cs.LG

The Mass Agreement Score: A Point-centric Measure of Cluster Size Consistency

Randolph Wiredu-Aidoo

详情
英文摘要

In clustering, strong dominance in the size of a particular cluster is often undesirable, motivating a measure of cluster size uniformity that can be used to filter such partitions. A basic requirement of such a measure is stability: partitions that differ only slightly in their point assignments should receive similar uniformity scores. A difficulty arises because cluster labels are not fixed objects; algorithms may produce different numbers of labels even when the underlying point distribution changes very little. Measures defined directly over labels can therefore become unstable under label-count perturbations. I introduce the Mass Agreement Score (MAS), a point-centric metric bounded in [0, 1] that evaluates the consistency of expected cluster size as measured from the perspective of points in each cluster. Its construction yields fragment robustness by design, assigning similar scores to partitions with similar bulk structure while remaining sensitive to genuine redistribution of cluster mass.

2603.23576 2026-03-26 stat.AP cs.AI cs.LG

Wafer-Level Etch Spatial Profiling for Process Monitoring from Time-Series with Time-LLM

Hyunwoo Kim, Munyoung Lee, Seung Hyub Jeon, Kyu Sung Lee

Comments Submitted to AVSS 2026

详情
英文摘要

Understanding wafer-level spatial variations from in-situ process signals is essential for advanced plasma etching process monitoring. While most data-driven approaches focus on scalar indicators such as average etch rate, actual process quality is determined by complex two-dimensional spatial distributions across the wafer. This paper presents a spatial regression model that predicts wafer-level etch depth distributions directly from multichannel in-situ process time series. We propose a Time-LLM-based spatial regression model that extends LLM reprogramming from conventional time-series forecasting to wafer-level spatial estimation by redesigning the input embedding and output projection. Using the BOSCH plasma-etching dataset, we demonstrate stable performance under data-limited conditions, supporting the feasibility of LLM-based reprogramming for wafer-level spatial monitoring.

2603.23568 2026-03-26 cs.LG stat.ML

Causal Reconstruction of Sentiment Signals from Sparse News Data

Stefania Stan, Marzio Lunghi, Vito Vargetto, Claudio Ricci, Rolands Repetto, Brayden Leo, Shao-Hong Gan

Comments 28 pages, 2 figures, 14 tables

详情
英文摘要

Sentiment signals derived from sparse news are commonly used in financial analysis and technology monitoring, yet transforming raw article-level observations into reliable temporal series remains a largely unsolved engineering problem. Rather than treating this as a classification challenge, we propose to frame it as a causal signal reconstruction problem: given probabilistic sentiment outputs from a fixed classifier, recover a stable latent sentiment series that is robust to the structural pathologies of news data such as sparsity, redundancy, and classifier uncertainty. We present a modular three-stage pipeline that (i) aggregates article-level scores onto a regular temporal grid with uncertainty-aware and redundancy-aware weights, (ii) fills coverage gaps through strictly causal projection rules, and (iii) applies causal smoothing to reduce residual noise. Because ground-truth longitudinal sentiment labels are typically unavailable, we introduce a label-free evaluation framework based on signal stability diagnostics, information preservation lag proxies, and counterfactual tests for causality compliance and redundancy robustness. As a secondary external check, we evaluate the consistency of reconstructed signals against stock-price data for a multi-firm dataset of AI-related news titles (November 2024 to February 2026). The key empirical finding is a three-week lead lag pattern between reconstructed sentiment and price that persists across all tested pipeline configurations and aggregation regimes, a structural regularity more informative than any single correlation coefficient. Overall, the results support the view that stable, deployable sentiment indicators require careful reconstruction, not only better classifiers.

2603.23547 2026-03-26 stat.ML cs.LG

PDGMM-VAE: A Variational Autoencoder with Adaptive Per-Dimension Gaussian Mixture Model Priors for Nonlinear ICA

Yuan-Hao Wei, Yan-Jie Sun

详情
英文摘要

Independent component analysis is a core framework within blind source separation for recovering latent source signals from observed mixtures under statistical independence assumptions. In this work, we propose PDGMM-VAE, a source-oriented variational autoencoder in which each latent dimension, interpreted explicitly as an individual source signal, is assigned its own Gaussian mixture model prior. Unlike conventional VAE formulations with a shared simple prior, the proposed framework imposes per-dimension heterogeneous prior constraints, enabling the model to capture diverse non-Gaussian source statistics and thereby promote source separation under a probabilistic encoder-decoder architecture. Importantly, the parameters of these per-dimension GMM priors are not fixed in advance, but are adaptively learned and automatically refined toward convergence together with the encoder and decoder parameters under the overall training objective. Within this formulation, the encoder serves as a demixing mapping from observations to latent sources, while the decoder reconstructs the observed mixtures from the inferred components. The proposed model provides a systematic study of an idea that had previously only been noted in our preliminary form, namely, equipping different latent sources with different GMM priors for ICA, and formulates it as a full VAE framework with end-to-end training and per-dimension prior learning. Experimental results on both linear and nonlinear mixing problems demonstrate that PDGMM-VAE can recover latent source signals and achieve satisfactory separation performance.

2603.16813 2026-03-26 stat.AP

Evolutionary Structural Shift in Security Screening Sensitivity within the U.S. Aviation Network: A 15-Year Longitudinal Bayesian Assessment (2010-2024)

Shuo Liu, John Mott

详情
英文摘要

This paper investigates the evolving causal mechanisms of flight delays in the U.S. domestic aviation network from 2010-2024. Utilizing a three-level hierarchical Bayesian model on Bureau of Transportation Statistics (BTS) on-time performance data, we decouple the marginal contribution factors of weather, national aviation system (NAS), security delays, and late-arriving aircraft, using carrier delays as the baseline reference. Our findings suggest a structural shift: during the pre-pandemic decade (2010-2019), security delays functioned as an operational stabilizer with negative causal leverage (beta approx -1.307). However, in the post-pandemic period, they shift to a statistically marginal effect (beta approx -0.130). While the total volume of security delays remains a marginal fraction of the overall system latency, this structural shift points toward a potential change in the operational sensitivity of the system to security-related frictions. We show that while causal neutralization is characteristic of high-volume hubs (n >= 100), a discernible directional shift into a positive delay driver (beta approx 0.118) is observed as the analysis scales down to include the broader network (n >= 30). Our model identifies a significant change in how security delays propagate through high-volume nodes, evolving from an internalized operational buffer into a statistically discernible contributor to delay probability in the post-pandemic era.

2603.03071 2026-03-26 quant-ph cs.LG hep-ex hep-ph stat.ML

From Reachability to Learnability: Geometric Design Principles for Quantum Neural Networks

Vishal S. Ngairangbam, Michael Spannowsky

Comments Added acknowledgements and corrected typos

详情
英文摘要

Classical deep networks are effective because depth enables adaptive geometric deformation of data representations. In quantum neural networks (QNNs), however, depth or state reachability alone does not guarantee this feature-learning capability. We study this question in the pure-state setting by viewing encoded data as an embedded manifold in $\mathbb{C}P^{2^n-1}$ and analysing infinitesimal unitary actions through Lie-algebra directions. We introduce Classical-to-Lie-algebra (CLA) maps and the criterion of almost Complete Local Selectivity (aCLS), which combines directional completeness with data-dependent local selectivity. Within this framework, we show that data-independent trainable unitaries are complete but non-selective, i.e. learnable rigid reorientations, whereas pure data encodings are selective but non-tunable, i.e. fixed deformations. Hence, geometric flexibility requires a non-trivial joint dependence on data and trainable weights. We further show that accessing high-dimensional deformations of many-qubit state manifolds requires parametrised entangling directions; fixed entanglers such as CNOT alone do not provide adaptive geometric control. Numerical examples validate that aCLS-satisfying data re-uploading models outperform non-tunable schemes while requiring only a quarter of the gate operations. Thus, the resulting picture reframes QNN design from state reachability to controllable geometry of hidden quantum representations.

2601.10006 2026-03-26 stat.AP

An Information-Theoretic Diagnostic Analytics Framework for Mapping Past-Future Dependence in Horizon-Specific Forecastability

Peter Maurice Catt

详情
英文摘要

In many systems, the true data-generating process is unknown, requiring forecasters to rely on observed time series. This study proposes a pre-modeling diagnostic framework for horizon-specific forecastability assessment that evaluates forecastability before model selection begins. Forecastability is operationalized using auto-mutual information at lag h, which quantifies how much past observations reduce uncertainty about future values, estimated via a k-nearest-neighbor estimator computed strictly on training data to preserve out-of-sample validity. The diagnostic signal is validated against realized out-of-sample symmetric mean absolute percentage error across 42,355 time series spanning six temporal frequencies, using benchmark and higher-capacity probe models under a rolling-origin protocol. The results reveal a strong frequency-dependent relationship between measurable dependence and realized forecast error: for five of six frequencies, auto-mutual information exhibits a consistent negative rank association with realized error, supporting its use as a forecast triage signal for modeling investment decisions, whereas daily series show weaker discrimination despite measurable dependence. Across all frequencies, median forecast error declines monotonically from low to high forecastability terciles, demonstrating clear decision-relevant separation. Overall, the findings establish measurable past-future dependence as a practical screening tool for analytics-driven forecasting strategy, identifying when advanced models are likely to add value, when simple baselines suffice, and when attention should shift from accuracy improvement to robust decision design, thereby supporting a diagnostic-first approach to modeling effort and resource allocation in organizational forecasting contexts.

2511.18789 2026-03-26 cs.LG stat.ML

Perturbing the Derivative: Doubly Wild Refitting for Model-Free Evaluation of Opaque Machine Learning Predictors

Haichen Hu, David Simchi-Levi

详情
英文摘要

We study the problem of excess risk evaluation for empirical risk minimization (ERM) under convex losses. We show that by leveraging the idea of wild refitting, one can upper bound the excess risk through the so-called "wild optimism," without relying on the global structure of the underlying function class but only assuming black box access to the training algorithm and a single dataset. We begin by generating two sets of artificially modified pseudo-outcomes created by stochastically perturbing the derivatives with carefully chosen scaling. Using these pseudo-labeled datasets, we refit the black-box procedure twice to obtain two wild predictors and derive an efficient excess risk upper bound under the fixed design setting. Requiring no prior knowledge of the complexity of the underlying function class, our method is essentially model-free and holds significant promise for theoretically evaluating modern opaque deep neural networks and generative models, where traditional learning theory could be infeasible due to the extreme complexity of the hypothesis class.

2510.00533 2026-03-26 gr-qc astro-ph.IM physics.comp-ph stat.CO

Bayesian power spectral density estimation for LISA noise based on penalized splines with a parametric boost

Nazeela Aimen, Patricio Maturana-Russel, Avi Vajpeyi, Nelson Christensen, Renate Meyer

详情
Journal ref
Phys. Rev. D 113, 024022 (2026)
英文摘要

Flexible and accurate noise characterization is crucial for the precise estimation of gravitational-wave parameters. We introduce a Bayesian method for estimating the power spectral density (PSD) of long, stationary time series, explicitly tailored for LISA data analysis. Our approach models the PSD as the geometric mean of a parametric and a nonparametric component, combining the knowledge from parametric models with the flexibility to capture deviations from theoretical expectations. The nonparametric component is expressed by a mixture of penalized B-splines. Adaptive, data-driven knot placement, performed once at initialization, removes the need for reversible-jump Markov chain Monte Carlo, while hierarchical roughness-penalty priors prevent overfitting. Validation on simulated autoregressive AR(4) data demonstrates estimator consistency and shows that well-matched parametric components reduce the integrated absolute error compared to an uninformative baseline, requiring fewer spline knots to achieve comparable accuracy. Applied to one year of simulated LISA X-channel (univariate) noise, our method achieves relative integrated absolute errors of $\mathcal{O}(10^{-2})$, making it suitable for iterative analysis pipelines and multi-year mission data sets.

2509.24140 2026-03-26 cs.LG stat.ML

A signal separation view of classification

H. N. Mhaskar, Ryan O'Dowd

详情
英文摘要

The problem of classification in machine learning has often been approached in terms of function approximation. In this paper, we propose an alternative approach for classification in arbitrary compact metric spaces which, in theory, yields both the number of classes, and a perfect classification using a minimal number of queried labels. Our approach uses localized trigonometric polynomial kernels initially developed for the point source signal separation problem in signal processing. Rather than point sources, we argue that the various classes come from different probability measures. The localized kernel technique developed for separating point sources is then shown to separate the supports of these distributions. This is done in a hierarchical manner in our MASC algorithm to accommodate touching/overlapping class boundaries. We illustrate our theory on several simulated and real life datasets, including the Salinas and Indian Pines hyperspectral datasets and a document dataset.

2508.01517 2026-03-26 math.ST math.PR stat.ML stat.TH

Central Limit Theorems for Transition Probabilities of Controlled Markov Chains

Ziwei Su, Imon Banerjee, Diego Klabjan

Comments 45 pages (main text 21 pages + appendix 24 pages)

详情
英文摘要

We develop a central limit theorem (CLT) for a non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build on it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.

2507.23743 2026-03-26 stat.ME econ.EM

Relative Bias Under Imperfect Identification in Observational Causal Inference

Melody Huang, Cory McCartan

Comments 20 pages, 3 figures, plus references and appendices

详情
英文摘要

To conduct causal inference in observational settings, researchers must rely on certain identifying assumptions. In practice, these assumptions are unlikely to hold exactly. This paper considers the bias of selection-on-observables, instrumental variables, and proximal inference estimates under violations of their identifying assumptions. We develop bias expressions for IV and proximal inference that show how violations of their respective assumptions are amplified by any unmeasured confounding in the outcome variable. We propose a set of sensitivity tools that quantify the sensitivity of different identification strategies, and an augmented bias contour plot visualizes the relationship between these strategies. We argue that the act of choosing an identification strategy implicitly expresses a belief about the degree of violations that must be present in alternative identification strategies. Even when researchers intend to conduct an IV or proximal analysis, a sensitivity analysis comparing different identification strategies can help to better understand the implications of each set of assumptions. Throughout, we compare the different approaches on a re-analysis of the impact of state surveillance on the incidence of protest in Communist Poland.

2506.11232 2026-03-26 stat.ME

Regularized Estimation of the Loading Matrix in Factor Models for High-Dimensional Time Series

Xialu Liu, Xin Wang

详情
英文摘要

High-dimensional data analysis using traditional models suffers from overparameterization. Two types of techniques are commonly used to reduce the number of parameters - regularization and dimension reduction. In this project, we combine them by imposing a sparse factor structure and propose a regularized estimator to further reduce the number of parameters in factor models. A challenge limiting the widespread application of factor models is that factors are hard to interpret, as both factors and the loading matrix are unobserved. To address this, we introduce a penalty term when estimating the loading matrix for a sparse estimate. As a result, each factor only drives a smaller subset of time series that exhibit the strongest correlation, improving the factor interpretability. The theoretical properties of the proposed estimator are investigated. The simulation results are presented to confirm that our algorithm performs well. We apply our method to Hawaii tourism data.

2506.03395 2026-03-26 stat.AP

A Bayesian hierarchical model for methane emission source apportionment

William S. Daniels, Douglas W. Nychka, Dorit M. Hammerling

详情
英文摘要

Reducing methane emissions from the oil and gas sector is a key component of short-term climate action. Emission reduction efforts are often conducted at the individual site-level, where being able to apportion emissions between a finite number of potentially emitting equipment is necessary for leak detection and repair as well as regulatory reporting of annualized emissions. We present a hierarchical Bayesian model, referred to as the multisource detection, localization, and quantification (MDLQ) model, for performing source apportionment on oil and gas sites using methane measurements from point sensor networks. The MDLQ model accounts for autocorrelation in the sensor data and enforces sparsity in the emission rate estimates via a spike-and-slab prior, as oil and gas equipment often emit intermittently. We use the MDLQ model to apportion methane emissions on an experimental oil and gas site designed to release methane in known quantities, providing a means of model evaluation. Data from this experiment are unique in their size (i.e., the number of controlled releases) and in their close approximation of emission characteristics on real oil and gas sites. As such, this study provides a baseline level of apportionment accuracy that can be expected when using point sensor networks on operational sites.

2502.10328 2026-03-26 stat.ML cs.LG

Accelerated Parallel Tempering via Neural Transports

Leo Zhang, Peter Potaptchik, Jiajun He, Yuanqi Du, Arnaud Doucet, Francisco Vargas, Hai-Dang Dau, Saifuddin Syed

Comments Camera-ready version for ICLR 2026

详情
英文摘要

Markov Chain Monte Carlo (MCMC) algorithms are essential tools in computational statistics for sampling from unnormalised probability distributions, but can be fragile when targeting high-dimensional, multimodal, or complex target distributions. Parallel Tempering (PT) enhances MCMC's sample efficiency through annealing and parallel computation, propagating samples from tractable reference distributions to intractable targets via state swapping across interpolating distributions. The effectiveness of PT is limited by the often minimal overlap between adjacent distributions in challenging problems, which requires increasing the computational resources to compensate. We introduce a framework that accelerates PT by leveraging neural samplers -- including normalising flows, diffusion models, and controlled diffusions -- to reduce the required overlap. Our approach utilises neural samplers in parallel, circumventing the computational burden of neural samplers while preserving the asymptotic consistency of classical PT. We demonstrate theoretically and empirically on a variety of multimodal sampling problems that our method improves sample quality, reduces the computational cost compared to classical PT, and enables efficient free energy/normalising constant estimation.

2409.16003 2026-03-26 stat.ME

Easy Conditioning far beyond Gaussian

Antoine Faul, David Ginsbourger, Ben Spycher

Comments 36 pages, 13 figures

详情
英文摘要

Multivariate Gaussian distributions enjoy Gaussian conditional distributions that makes conditioning easy: conditioning boils down to implementing analytical formulae for conditional means and covariances. For more general distributions, however, conditional distributions may not be available in analytical form and require demanding and approximate numerical approaches. Primarily motivatedby probabilistic imputation problems, we review and discuss families of multivariate distributions that do enjoy analytical conditioning, also providing a few counter-examples. Proving that trans-dimensional stability under conditioning extends to mixtures and transformations, we demonstrate that a broader class of multivariate distributions inherit easy conditioning properties. Building on this insight, we developed a generative method to estimate conditional distributions from data by first fitting a flexible joint distribution using copulas and then performing analytical conditioning in a latent space. In our applications, we specifically opt for Gaussian Mixture Copula Models (GMCM), comparing in turn various fitting strategies. Through simulations and real-world data experiments, we showcase the efficacy of our method in tasks involving conditional density estimation and data imputation. We also touch upon links to Gaussian process modelling and how stability by mixtures and transformations and mixtures carries over towards easy conditioning of non-Gaussian processes.

2408.13848 2026-03-26 math.ST stat.TH

Inference for Spiked Eigenstructure under Generalized Covariance and Correlation Models

Yanqing Yin, Wang Zhou

Comments This version is substantially revised from the original arXiv posting. The title has been changed, the manuscript has been reorganized, and a large portion of the technical material has been moved to the Supplementary Material

详情
英文摘要

In high-dimensional principal component analysis, important inferential targets include both leading spikes and the associated principal eigenspaces. Such problems arise naturally in high-dimensional factor models, where leading principal directions are interpreted as dominant loading directions and spike magnitudes reflect the strength of the corresponding common factors. We study inference based on the sample covariance matrix $\bS$ and the sample correlation matrix $\widehat{\bR}$ under generalized spiked models with arbitrary bulk spectrum. We establish almost sure limits and central limit theorems for spiked sample eigenvalues, and derive asymptotic distributions for functionals of sample spiked eigenspaces. Building on this theory, we develop procedures for one-sample inference for benchmark principal directions and for two-sample comparison of leading spike strengths across populations. Even in the covariance setting, our results substantially extend the existing literature by allowing a non-identity bulk structure. A real-data analysis on stock returns further illustrates the practical relevance of the proposed procedures, showing that covariance-based and correlation-based PCA can lead to markedly different conclusions.

2407.01111 2026-03-26 cs.LG cs.AI stat.ML

Proximity Matters: Local Proximity Enhanced Balancing for Treatment Effect Estimation

Hao Wang, Zhichao Chen, Zhaoran Liu, Xu Chen, Haoxuan Li, Zhouchen Lin

Comments Accepted as a poster in SIGKDD 2025

详情
英文摘要

Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-enhanced CounterFactual Regression (CFR-Pro) to exploit proximity for enhancing representation balancing within the HTE estimation context. Specifically, we introduce a pair-wise proximity regularizer based on optimal transport to incorporate the local proximity in discrepancy calculation. However, the curse of dimensionality renders the proximity measure and discrepancy estimation ineffective -- exacerbated by limited data availability for HTE estimation. To handle this problem, we further develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that CFR-Pro accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at https://github.com/HowardZJU/CFR-Pro.

2405.17669 2026-03-26 stat.ME math.ST stat.TH

Bayesian Nonparametrics for Principal Stratification with Continuous Post-Treatment Variables

Dafne Zorzetto, Antonio Canale, Fabrizia Mealli, Francesca Dominici, Falco J. Bargagli-Stoffi

详情
英文摘要

Principal stratification provides a causal inference framework for investigating treatment effects in the presence of a post-treatment variable. Principal strata play a key role in characterizing the treatment effect by identifying groups of units with the same or similar values for the potential post-treatment variable at all treatment levels. The literature has focused mainly on binary post-treatment variables. Few papers considered continuous post-treatment variables. In the presence of a continuous post-treatment, a challenge is how to identify and characterize meaningful coarsening of the latent principal strata that lead to interpretable principal causal effects. This paper introduces the Confounders-Aware SHared atoms BAyesian mixture (CASBAH), a novel approach for principal stratification with binary treatment and continuous post-treatment variables. CASBAH leverages Bayesian nonparametric priors with an innovative hierarchical structure for the potential post-treatment outcomes that overcomes some of the limitations of previous works. Specifically, the novel features of our method allow for (i) identifying coarsened principal strata through a data-adaptive approach and (ii) providing a comprehensive quantification of the uncertainty surrounding stratum membership. Through Monte Carlo simulations, we show that the proposed methodology performs better than existing methods in characterizing the principal strata and estimating principal effects of the treatment. Finally, CASBAH is applied to a case study in which we estimate the causal effects of US national air quality regulations on pollution levels and health outcomes.

2302.12728 2026-03-26 stat.ME

Principles of Conditionality and Layering of Error Rates with Application to Platform Trials

Xinping Cui, Emily Ouyang, Yi Liu, Jingjing Yan Schneider, Hong Tian, Bushi Wang, Jason C. Hsu

详情
英文摘要

There has been a misconception that only one type of error rate control is necessary in clinical trials, leading to debates over whether to prioritize Familywise Error Rate (FWER) or False Discovery Rate (FDR). This misconception has led to misleading statements about FWER control and proposals to shift towards FDR control, which could be manipulated by the industry. In reality, since the early 2000s, biopharmaceutical statistics have implicitly applied two layers of Type I error rate control. This aligns with Tukey's 1953 invention of Error Rate per Family (ERpF) for controlling error across studies, while FWER applies within each study. Our paper clarifies this layering, using Platform trials to demonstrate the verifiable conditions needed across studies for the FDA to fulfill its regulatory mission. We show that controlling FWER within a study at $5\%$ inherently controls ERpF across studies at 5-per-100, regardless of study correlations. This supports current regulatory practices that protect public health while fostering innovation. We also address concerns about ERpF stability in Platform trials, where shared controls introduce dependencies. By applying the Conditionality Principle and utilizing an innovative Shiny app, we explore how correlations impact ERpF variability, providing deeper insights for informed decision-making. Our findings, supported by principles like Layering of Error Rate Controls and the Conditionality Principle, are particularly relevant as Platform trials gain popularity for their efficiency in testing multiple treatments simultaneously.

2210.11039 2026-03-26 cs.LG cs.AI stat.ML

Entire Space Counterfactual Learning for Reliable Content Recommendations

Hao Wang, Zhichao Chen, Zhaoran Liu, Haozhe Li, Degui Yang, Xinggao Liu, Haoxuan Li

Comments This submission is an extension of arXiv:2204.05125

详情
英文摘要

Post-click conversion rate (CVR) estimation is a fundamental task in developing effective recommender systems, yet it faces challenges from data sparsity and sample selection bias. To handle both challenges, the entire space multitask models are employed to decompose the user behavior track into a sequence of exposure $\rightarrow$ click $\rightarrow$ conversion, constructing surrogate learning tasks for CVR estimation. However, these methods suffer from two significant defects: (1) intrinsic estimation bias (IEB), where the CVR estimates are higher than the actual values; (2) false independence prior (FIP), where the causal relationship between clicks and subsequent conversions is potentially overlooked. To overcome these limitations, we develop a model-agnostic framework, namely Entire Space Counterfactual Multitask Model (ESCM$^2$), which incorporates a counterfactual risk minimizer within the ESMM framework to regularize CVR estimation. Experiments conducted on large-scale industrial recommendation datasets and an online industrial recommendation service demonstrate that ESCM$^2$ effectively mitigates IEB and FIP defects and substantially enhances recommendation performance.

2002.12586 2026-03-26 stat.ME

Nonparametric Empirical Bayes Estimation on Heterogeneous Data

Trambak Banerjee, Luella J. Fu, Gareth M. James, Gourab Mukherjee, Wenguang Sun

Comments Proof of Theorem 1 revised

详情
英文摘要

The simultaneous estimation of many parameters based on data collected from corresponding studies is a key research problem that has received renewed attention in the high-dimensional setting. Many practical situations involve heterogeneous data where heterogeneity is captured by a nuisance parameter. Effectively pooling information across samples while correctly accounting for heterogeneity presents a significant challenge in large-scale estimation problems. We address this issue by introducing the ``Nonparametric Empirical Bayes Structural Tweedie" (NEST) estimator, which efficiently estimates the unknown effect sizes and properly adjusts for heterogeneity via a generalized version of Tweedie's formula. For the normal means problem, NEST simultaneously handles the two main selection biases introduced by heterogeneity: one, the selection bias in the mean, which cannot be effectively corrected without also correcting for, two, selection bias in the variance. We develop theory to show that NEST is asymptotically as good as the optimal Bayes rule that uniquely minimizes a weighted squared error loss. In our simulation studies NEST outperforms competing methods, with much efficiency gains in many settings. The proposed method is demonstrated on estimating the batting averages of baseball players and Sharpe ratios of mutual fund returns. Extensions to other members of the two-parameter exponential family are discussed.