arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.00806 2026-05-04 stat.ME stat.AP

High-Dimensional Multivariate VAR Estimation with Spatio-Temporal Structure

Peiliang Bai

详情
英文摘要

High-dimensional vector autoregressive (VAR) models provide a flexible framework for characterizing dynamic dependence in multivariate spatio-temporal systems, but their unrestricted estimation becomes infeasible when multiple variables are observed over many spatial locations. This paper develops a structured estimation procedure for high-dimensional multivariate VAR processes that explicitly incorporates spatial information. We decompose each block transition matrix into a cross-variable dependence coefficient and a spatial transition matrix, and constrain the spatial transition matrices through a pre-specified spatial graph. The resulting estimator is formulated as a weighted $\ell_1$-regularized least-squares problem, where the weights encode spatial proximity or topological similarity and induce stronger shrinkage on spatially implausible interactions. Since the objective function is bi-convex, we estimate the cross-variable dependence matrix and the spatial transition matrices through an alternating convex-search algorithm implemented with ADMM. Under stability and restricted-eigenvalue-type conditions for high-dimensional VAR processes, we establish convergence to a blockwise stationary point in the subgradient sense and derive high-probability estimation error bounds for both components of the model. Simulation studies demonstrate that the proposed estimator accurately recovers sparse transition structures and improves over existing two-step $\ell_1$-regularized methods in support recovery and estimation accuracy. An application to North American climate data illustrates that the method recovers interpretable variable-dependence networks and spatial interaction patterns across different climate regions.

2605.00786 2026-05-04 stat.ME math.PR stat.ML

Recursive Maximum Likelihood Estimation for Interacting Particle Systems using Virtual Particles

Louis Sharrock, Nikolas Kantas, Grigorios A. Pavliotis

Comments arXiv admin note: text overlap with arXiv:2602.20875

详情
英文摘要

We study recursive maximum likelihood estimation for stochastic interacting particle systems based on continuous observation of a single particle. In this regime, consistent estimation of the finite-particle log-likelihood is not possible, even in the limit as the number of particles $N\rightarrow\infty$ and the time horizon $t\rightarrow\infty$. We thus seek to optimise the stationary log-likelihood of the limiting mean-field system. We achieve this via a form of stochastic gradient estimate in continuous time, with stochastic gradient estimates computed using the continuous trajectory of the single observed particle, alongside a virtual interacting particle system and a virtual tangent interacting particle system, which are integrated with the online parameter estimate. For fixed numbers of real and virtual particles, we show that the resulting algorithms drive the gradient of a finite-particle surrogate objective to zero as $t\to\infty$. We then prove that, in the iterated limit $t\to\infty$ followed by $N,M\to\infty$, these surrogate gradients converge uniformly to the gradient of the stationary log-likelihood of the limiting mean-field system, yielding convergence to its stationary points. We illustrate the method on several numerical examples, including a model with quadratic confinement and interaction potentials, a model of interacting FitzHugh--Nagumo neurons, and a stochastic Kuramoto model.

2605.00779 2026-05-04 stat.ME

Evaluating the performance of GCM trajectories using Weather Type frequencies for persistence and transitions: the Iberian Peninsula and Lamb classification

Elsa Barrio-Torres, Swen Brands, Jesús Asín, Jesús Abaurrea, Zeus Gracia-Tabuenca

详情
英文摘要

This study evaluates the performance of 36 historical CMIP6 GCM trajectories (1979-2005) in reproducing atmospheric circulation over the Iberian Peninsula in the summer months (June-September) using the Lamb Weather Type (WT) classification scheme. Using ERA5 reanalysis as the observational reference, we introduce a methodological framework-applicable to any region worldwide-to evaluate GCM performance. This approach extends traditional daily frequency analysis by evaluating both the daily frequency distribution of WTs and their 24-hour dynamic evolution (i.e., transition probabilities and persistence). Model performance is quantified using the Overlap coefficient. A filtering process is applied where only trajectories that successfully reproduce both daily and conditional distributions with a minimum Overlap threshold $t_{sim}$ across a set number of grid points are retained. The findings show that while several models can adequately reproduce daily WT frequencies (16 out of 36), some struggle to capture day-to-day atmospheric transitions. This leads to a final selection of 12 trajectories over the Iberian Peninsula. Model performance across the region is then evaluated using integrated metrics assessing daily reproduction, conditional reproduction, and transition dynamics. Overall, models from the ec earth3 family-specifically the ec earth3 aerchem trajectory-exhibit the best and most consistent performance across the region. Additionally, the results highlight a geographical performance gap: while models generally represent circulation well in the northwest, they face significant challenges in the central and southern Mediterranean regions of the Peninsula. Ultimately, this study establishes that assessing WT persistence and transitions provides a far more discriminative, objective tool for GCM selection than evaluating daily distributions alone.

2605.00771 2026-05-04 econ.EM math.ST stat.TH

Penalized Likelihood for Dyadic Network Formation Models with Degree Heterogeneity

Zizhong Yan, Jingrong Li, Yi Zhang

详情
英文摘要

Estimating network formation models with degree heterogeneity raises two problems in empirical networks. First, agents that send no links, receive no links, or link to all remaining agents can make the fixed-effects MLE fail to exist. Trimming these agents changes the estimation sample and induces selection bias. Second, the incidental-parameter problem biases common parameters and average partial effects. We resolve both issues through a penalized likelihood approach. Our leading specification is a directed network model with reciprocity, nesting the standard undirected and non-reciprocal directed models. The penalty guarantees finite-sample existence and yields bias corrections for coefficients and partial effects. We establish asymptotic results without imposing compactness on the fixed-effects. Allowing the fixed effects to diverge at a logarithmic rate, our asymptotic framework accommodates the degree sparsity ubiquitous in large empirical networks. A global trade application demonstrates that our estimator avoids selection bias and recovers robust parameters where conventional methods fail.

2605.00765 2026-05-04 stat.ME

Efficient Longitudinal Function-on-Function Regression

Leif Verace, Siobhan McMahon, Erjia Cui

详情
英文摘要

We propose a computationally efficient inferential procedure for longitudinal function-on-function regression. The method follows a marginal three-step approach: (1) fit massive pointwise longitudinal scalar-on-function regression models, (2) smooth the resulting estimates along the bivariate functional domain, and (3) compute confidence bands using either an analytic approach for Gaussian data or a cluster bootstrap for Gaussian or non-Gaussian data. Simulation studies demonstrate that the proposed method achieves accurate estimation and valid inference, while substantially reducing computational burden compared to existing approaches. Methods are motivated by a physical activity intervention trial in older adults where high-dimensional wearable data were collected longitudinally across multiple visits. Our applications reveal significant increases in physical activity in the morning using interpersonal intervention strategies, but not intrapersonal strategies. The proposed methods are implemented in an R package.

2605.00750 2026-05-04 stat.OT math.PR math.ST nlin.AO stat.TH

Quenched Amplification and Tail Shaping in Networked Systems with Memory and Regime Switching

Mauricio Herrera-Marín

详情
英文摘要

Networked systems operating under intermittent adverse conditions and long memory can remain stable on average while exhibiting rare but extreme trajectory-level excursions. We study linear regime-switching network dynamics with Volterra-type memory, formulated through a finite-dimensional lifted ordinary differential equation embedding. Despite finite-horizon annealed boundedness, we show that quenched amplification emerges generically from the interaction of regime persistence, memory accumulation, and non-normal lifted operator geometry. A lower bound on burst-size distributions reveals power-law tails whose exponent is determined by the ratio between unfavorable dwell-time rates and an operator-defined instantaneous growth parameter. This parameter is computable online via the Euclidean logarithmic norm of the lifted operator, yielding a practical early-warning indicator. Building on this structure, we introduce a dynamic data-driven intervention strategy that enforces contraction on demand along rare amplification channels, thereby shaping or truncating tail risk without altering exogenous regime statistics or typical system behavior. The results provide a geometrically grounded and operationally actionable framework for understanding and mitigating extreme events in memory-driven regime-switching systems.

2605.00740 2026-05-04 math.OC cs.LG stat.ML

Randomized Subspace Nesterov Accelerated Gradient

Gaku Omiya, Pierre-Louis Poirion, Akiko Takeda

Comments 50 pages

详情
英文摘要

Randomized-subspace methods reduce the cost of first-order optimization by using only low-dimensional projected-gradient information, a feature that is attractive in forward-mode automatic differentiation and communication-limited settings. While Nesterov acceleration is well understood for full-gradient and coordinate-based methods, obtaining accelerated methods for general subspace sketches that use only projected-gradient information and can improve over full-dimensional Nesterov acceleration in oracle complexity is technically nontrivial. We develop randomized-subspace Nesterov accelerated gradient methods for smooth convex and smooth strongly convex optimization under matrix smoothness and generic sketch moment assumptions. The key technical ingredient is a three-sequence formulation tailored to matrix smoothness, which recovers the corresponding classical Nesterov methods in the full-dimensional case. The resulting theory establishes accelerated oracle-complexity guarantees and makes explicit how matrix smoothness and the sketch distribution enter the complexity. It also provides a unified basis for comparing sketch families and identifying when randomized-subspace acceleration improves over full-dimensional Nesterov acceleration in oracle complexity.

2604.27772 2026-05-04 stat.ME

Single-Observation Uniformity Testing under Increasing Precision via Lacunary Harmonic

Davide Ferrari

Comments 31 pages, 3 figures, 1 table

详情
英文摘要

A test of uniformity on [0,1] is developed for the setting of a single observation recorded with sufficient precision. Although consistency against general alternatives is not attainable with only one draw in the classical large-sample sense, a multiscale harmonic digit expansion provides a framework for structured inference. By aggregating trigonometric components across digit scales at Hadamard-gap frequencies, a quadratic test statistic is constructed whose null distribution converges to a chi-square law via a lacunary central limit theorem. Under departures from uniformity, the statistic is driven by Fourier components induced by digit-scale transformations of the observation, with detectability depending on their coherent accumulation as precision increases. The resulting procedure detects multiscale harmonic structure that remains invisible to classical digit-frequency methods.

2605.00729 2026-05-04 math.ST math.PR nlin.AO stat.OT stat.TH

Intermittency induced by long memory under stochastic regime switching

Mauricio Herrera-Marín

详情
英文摘要

We study a fundamental instability mechanism in nonlinear, nonlocal dynamical systems arising from the interaction of long-range memory and stochastic regime switching. The dynamics are governed by network-coupled, operator-valued Volterra evolutions with completely monotone memory kernels whose excitation operators and kernel parameters are modulated by an ergodic finite-state continuous-time Markov chain. We formalize a sharp separation between annealed stability (in expectation) and quenched behaviour (along typical sample paths). On the annealed side, we identify an averaged memory gain that yields uniform moment bounds and a memory-adapted Lyapunov functional implying mean-square control under an averaged subcriticality condition. On the quenched side, we show that rare but persistent excursions into supercritical regimes are amplified by memory, producing intermittent macroscopic bursts with heavy-tailed statistics and a deterministic almost sure growth exponent obtained via a subadditive ergodic argument. This establishes an annealed--quenched dichotomy specific to non-Markovian switching systems, where stability in expectation can coexist with pathwise growth and metastable burst phases. We further derive a micro--macro correspondence by proving that a population of regime-modulated self-exciting point processes converges, both annealed and quenched, to the random-coefficient Volterra limit, transferring the burst mechanism from microscopic branching dynamics to macroscopic long-memory flows. Numerical experiments illustrate how burst localization depends on graph geometry and on noncommuting excitation operators.

2605.00723 2026-05-04 stat.ML cs.LG math.PR

Decentralized Proximal Stochastic Gradient Langevin Dynamics

Mohammad Rafiqul Islam, Lingjiong Zhu

Comments 42 pages, 7 figures

详情
英文摘要

We propose Decentralized Proximal Stochastic Gradient Langevin Dynamics (DE-PSGLD), a decentralized Markov chain Monte Carlo (MCMC) algorithm for sampling from a log-concave probability distribution constrained to a convex domain. Constraints are enforced through a shared proximal regularization based on the Moreau-Yosida envelope, enabling unconstrained updates while preserving consistency with the target constrained posterior. We establish non-asymptotic convergence guarantees in the 2-Wasserstein distance for both individual agent iterates and their network averages. Our analysis shows that DE-PSGLD converges to a regularized Gibbs distribution and quantifies the bias introduced by the proximal approximation. We evaluate DE-PSGLD for different sampling problems on synthetic and real datasets. As the first decentralized approach for constrained domains, our algorithm exhibits fast posterior concentration and high predictive accuracy.

2605.00709 2026-05-04 math.ST econ.EM stat.TH

Bootstrap Inference under General Two-way Clustering with Serially and Spatially Dependent Common Effects

Ulrich Hounyo, Jiahao Lin

详情
英文摘要

This paper develops bootstrap procedures for inference in linear regression models with two-way clustered data. We characterize the estimator's asymptotic behavior in five mutually exclusive and exhaustive regimes: three Gaussian and two non-Gaussian. We establish four impossibility results: heterogeneous score components preclude uniform consistency; uniform consistency also fails in one non-Gaussian (infeasible) regime; the infeasible regime is not uniformly distinguishable from a feasible one; and uniform validity over all feasible regimes rules out uniform conservativeness over the infeasible regime. To address the feasible regimes, we propose a data-driven regime classifier and a projection-based wild bootstrap procedure. The procedure delivers uniformly valid inference across the four feasible regimes while allowing serial dependence along the second clustering dimension and spatial dependence along the first. This combination of regime adaptivity and flexible dependence is new to the two-way clustering literature. Monte Carlo simulations confirm the accuracy and flexibility of the proposed methods in settings with complex clustering structures.

2605.00668 2026-05-04 cs.IT math.IT stat.CO

SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass

Lucas H. McCabe, H. Howie Huang

详情
英文摘要

Discrete entropy estimation is a classic information theory problem, wherein the average information content of a discrete random variable is estimated from samples alone. Naive approaches, such as the plugin method, fail to account for the probability mass associated with members of the random variable's support that are unobserved in a given sample, known as the "missing mass." The resulting systemic underestimation is particularly problematic when data is time-consuming or costly to gather. We propose SENECA, an entropy estimation scheme based on a novel ``self-consistent'' missing mass calculation. Extensive numerical experiments indicate that our approach outperforms many state-of-the-art alternatives overall in the small-sample setting. We then apply SENECA to two practical use cases, namely biodiversity estimation and the detection of incorrect large language model responses, where our method is competitive with domain-specific approaches. Our work advances SENECA as an effective drop-in replacement for small-sample entropy estimation, with broad utility across several domains.

2605.00654 2026-05-04 cs.LG cs.AI math.OC stat.ML

Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

Andrzej Ruszczynski, Tiangang Zhang

详情
英文摘要

For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in a feature-based $Q$-learning method with multipattern $Q$-factor approximation and we prove a high-probability regret bound of $\mathcal{O}\big(H^2 N^H \sqrt{ K}\big)$, where $H$ is the horizon, $N$ is the mini-batch size, and $K$ is the number of episodes. We also propose an economical version of the $Q$-learning method that streamlines the policy evaluation (backward) step. The theoretical results are illustrated on a stochastic assignment problem and a short-horizon multi-armed bandit problem.

2605.00598 2026-05-04 stat.ME

Sparse $K$-spatial-median clustering for high-dimensional data

Ping Zhao, Dan Zhuang, Long Feng

详情
英文摘要

We propose a robust clustering framework for high-dimensional data with heavy tails and a large fraction of irrelevant variables. The method replaces the mean updates of Lloyd's $K$-means with \emph{spatial medians} to enhance robustness. For the assignment step, it admits either a Euclidean rule for computational simplicity or a robust Mahalanobis-type metric constructed from the spatial sign covariance matrix to account for heterogeneous scales and feature dependence. To handle the $p \gg n$ regime, we further introduce a simple \emph{hard feature-exclusion} mechanism that removes weakly separating dimensions based on across-center dispersion, with the exclusion threshold selected automatically via a permutation-based Gap criterion. Simulation studies under correlated Gaussian and multivariate $t$ models demonstrate that the proposed approach provides competitive clustering accuracy and improved stability relative to $K$-means and sparse $K$-means baselines.

2605.00581 2026-05-04 stat.ML cs.LG math.OC

Gradient Regularized Newton Boosting Trees with Global Convergence

Nikita Zozoulenko, Daniel Falkowski, Thomas Cass, Lukas Gonon

详情
英文摘要

Gradient Boosting Decision Trees (GBDTs) dominate tabular machine learning, with modern implementations like XGBoost, LightGBM, and CatBoost being based on Newton boosting: a second-order descent step in the space of decision trees. Despite its empirical success, the global convergence of Newton boosting is poorly understood compared to first-order boosting. In this paper, we introduce Restricted Newton Descent, which studies convex optimization with Newton's method on Hilbert spaces with inexact iterates, based on the concepts of cosine angle and weak gradient edge. Within this framework, we recover Newton boosting with GBDTs and classical finite-dimensional theory as special cases. We first prove that vanilla Newton boosting achieves a linear rate of convergence for smooth, strongly convex losses that satisfy a Hessian-dominance condition. To handle general convex losses with Lipschitz Hessians, we extend a recent gradient regularized Newton scheme to the restricted weak learner setting. This scheme minimally modifies the classical algorithm by introducing an adaptive $\ell_2$-regularization term proportional to the square root of the gradient norm at each iteration. We establish a $\mathcal{O}(\frac{1}{k^2})$ rate for this scheme, thereby obtaining a globally convergent second-order GBDT algorithm with a rate matching that of first-order boosting with Nesterov momentum. In numerical experiments, we show that our scheme converges while vanilla Newton boosting may diverge.

2605.00534 2026-05-04 stat.ME

Estimating Treatment and Spillover Effects with the Ego-Cluster Experimental Design

Xiao Liu, Feifang Hu, Jingfei Zhang

详情
英文摘要

Network interference occurs when a unit's outcome depends not only on its own treatment but also on the treatments received by connected units in the network. Experimental designs and analysis methods that ignore such interference can yield biased estimators of causal effects. In this paper, we develop a new experimental design for the estimation and inference of global treatment effect and spillover effect under a model-based framework and ego-cluster randomization. Under this design, the network is partitioned into a collection of ego-clusters, each consisting of a focal unit (the ego) and its network neighbors (the alters), with randomization conducted at the cluster level. We propose model-based estimators for the global treatment effect and spillover effect and establish their consistency and asymptotic normality, with asymptotic variances determined by the ego-cluster structure. Building on these theoretical results, we introduce an ego-clustering algorithm that sequentially selects egos and assigns alters to minimize asymptotic variances. Simulation studies and two empirical applications demonstrate that the proposed procedure yields accurate inference and efficiency improvements over existing network experimental designs.

2605.00533 2026-05-04 math.PR math-ph math.MP math.ST stat.TH

Royen's proof of the Gaussian correlation inequality as a supersymmetric dimensional reduction

Yichao Huang

Comments 13 pages

详情
英文摘要

We revisit Royen's proof of the Gaussian correlation inequality from a supersymmetric point of view. Many key elements in Royen's proof of this inequality have natural geometric interpretations in terms of supersymmetric dimensional reduction from $\mathbb{R}^{3|2}$ to $\mathbb{R}^{1|0}$. In particular, the auxiliary multivariate Gamma distributions appearing in Royen's Laplace-transform argument arise naturally as the body of a supersymmetric radial variable on $\mathbb{R}^{3|2}$. The generalization to the half-integer multivariate Gamma case also follows naturally as a dimensional reduction from $\mathbb{R}^{k+2|2}$ to $\mathbb{R}^{k|0}$. This provides an example in which the supersymmetric localization method is applied to prove correlation inequalities with continuous parameters.

2605.00470 2026-05-04 stat.ME stat.AP

Robust spatial scalar-on-function regression: A Fisher-consistent redescending M-estimation approach

Muge Mutis, Ufuk Beyaztas, Han Lin Shang

Comments 51 pages, 7 figures, 6 tables

详情
英文摘要

We develop a Fisher-consistent redescending robust estimator for the spatial scalar-on-function regression model, where a scalar response depends on both a functional predictor and a spatial autoregressive lag. Existing estimation procedures for this model are typically based on likelihood methods or monotone-loss robust M-estimators. They may be highly sensitive to vertical outliers, leverage points in the functional predictor, and numerical instability induced by strong spatial dependence. To address these issues, we propose a new estimation framework that first applies robust functional principal component analysis to obtain a contamination-resistant finite-dimensional representation of the functional predictor and then estimates the resulting spatial regression model through a bias-corrected system of M-estimating equations. The proposed method allows redescending loss functions, including Andrews' sine and Danish losses, and jointly estimates the regression coefficients, spatial dependence parameter, and scale parameter within a unified Fisher-consistent framework. For computation, we develop a hybrid IRLS-Newton algorithm that combines weighted least-squares updates for the regression parameters with a Newton-Raphson update for the spatial parameter. We establish Fisher consistency, consistency, asymptotic normality, and the asymptotic distribution of the reconstructed slope function. Monte Carlo experiments show that the proposed estimators remain competitive under clean data and substantially outperform classical and Huber-type robust competitors under contamination, particularly in severe outlier settings. An application to French air-quality data further demonstrates improved predictive performance and stable estimation of spatial dependence. Our method has been implemented in the fcsar R package.

2605.00467 2026-05-04 cs.LG stat.ML

Batch Normalization for Neural Networks on Complex Domains

Xuan Son Nguyen, Nistor Grozavu

详情
英文摘要

Riemannian neural networks have proven effective in solving a variety of machine learning tasks. The key to their success lies in the development of principled Riemannian analogs of fundamental building blocks in deep neural networks (DNNs). Among those, Riemannian batch normalization (BN) layers have shown to enhance training stability and improve accuracy. In this paper, we propose BN layers for neural networks on complex domains. The proposed layers have close connections with existing Riemannian BN layers. We derive essential components for practical implementations of BN layers on some complex domains which are less studied in previous works, e.g., the Siegel disk domain. We conduct experiments on radar clutter classification, node classification, and action recognition demonstrating the efficacy of our method.

2605.00455 2026-05-04 stat.ME math.ST stat.ML stat.TH

Concentration and Calibration in Predictive Bayesian Inference

David T. Frazier, Hui Wang

详情
英文摘要

Predictive Bayesian inference (PBI) represents a model-and prior-agnostic approach to standard Bayesian inference which allows users to quantify uncertainty for a functional of interest only by specifying a forward predictive model for future unobserved data. The flexibility and generality of this framework have led to a host of novel algorithms for implementing this approach, and many empirical applications, yet the reliability of the resulting inferences for the underlying statistical functional of interest remains unclear. Herein, we demonstrate that when using PBI for a population functional of interest, the resulting posterior concentrates onto a well-defined quantity that explicitly depends on the forward predictive model used to implement the predictive recursion underlying the method. Furthermore, the forward predictive model entirely determines the uncertainty quantification produced in PBI. Consequently, our results show that if the predictive model does not capture all relevant features of the data, and, even in very simple examples, the coverage of predictive Bayes credible sets for the population value of the functional of interest can be arbitrarily close to zero. We carefully explain why this occurs, and show that this behavior is directly tied to the inaccuracy of the forward predictive model used to produce future observations within the PBI framework. As a consequence, our results imply that in order for PBI to deliver calibrated posterior inferences, the resulting predictive engine used to generate posterior samples must contain, in a well-defined sense, the true DGP, else inferences generated under this framework will not be calibrated.

2605.00428 2026-05-04 stat.ME cs.PF cs.SY eess.SY

How to Do Statistical Evaluations in ECE/CS Papers: A Practical Playbook for Defensible Results

Bhaskar Krishnamachari

Comments 30 pages, 8 figures; Tutorial paper; companion student workbook and claude skill available as ancillary material

详情
英文摘要

Strong experimental papers in electrical and computer engineering and computer science (ECE/CS), especially in systems, networking, and applied machine learning, rest on more than a single impressive number. They rest on a chain of design, measurement, analysis, and validation choices that, taken together, make a result believable. This tutorial is a compact, example-driven guide to that chain for beginning researchers. We organize it as an evaluation workflow: claim, hypothesis, unit of analysis, baseline, regime sweep, uncertainty estimate, validation check, and reporting. Within that workflow we cover the classical statistical foundations (descriptive statistics, the central limit theorem, normal- and $t$-based confidence intervals, Student's $t$-test, ANOVA, chi-squared and Pearson correlation, linear regression) alongside the modern, distribution-free techniques (the bootstrap, Wilcoxon and Mann--Whitney tests, Cliff's delta) that are usually preferred for ECE/CS data. We also discuss factorial design, randomization and blocking, multiple-comparison correction, latency-specific pitfalls, simulation verification and validation, equivalence-style claims, and reproducibility. A running example, a comparison of two job-scheduling algorithms on simulated workloads with truncated heavy-tailed job sizes, threads through the tutorial, with Python snippets the reader can paste and adapt. The paper closes with a pre-submission checklist; companion student-facing material (project-type translation tables, an evaluation-plan worksheet, exercises, and a worked ``bad evaluation autopsy'') is collected in a separate workbook released alongside this paper.

2605.00398 2026-05-04 cs.LG physics.ao-ph stat.ML

M-CaStLe: Uncovering Local Causal Structures in Multivariate Space-Time Gridded Data

J. Jake Nichol, Michael Weylandt, G. Matthew Fricke, Jhayron Perez-Carrasquilla, Melanie E. Moses

Comments 19 pages and 6 figures in the main text; 33 pages and 11 figures total

详情
英文摘要

Causal graph discovery for space-time systems is challenging in high-dimensional gridded data, which often has many more grid cells than temporal observations per cell. The Causal Space-Time Stencil Learning (CaStLe) meta-algorithm was developed to address that niche under space-time locality and stationarity assumptions, but it is currently limited to univariate analyses. In this work, we present M-CaStLe. M-CaStLe generalizes the local embedding and parent-identification phases of CaStLe to jointly model local within-variable and cross-variable space-time causal structures in gridded data. Like CaStLe, by constraining candidate parents to a constant-size space-time neighborhood and pooling spatial replicates, M-CaStLe increases effective sample size to make discovery tractable in high-dimensional settings. We further decompose the resulting multivariate stencil graph into reaction and spatial graphs to aid interpretation in complex settings. We study M-CaStLe in four settings: a multivariate space-time vector autoregression benchmark with known ground truth, an advective-diffusive-reaction partial differential equation verification problem with derived physical reference structure, an atmospheric chemistry case study in a low-temporal-sample regime, and an El Niño Southern Oscillation study on reanalysis data, identifying phase-dependent ocean--atmosphere coupling. Across these settings, M-CaStLe more accurately recovers multivariate causal structure in controlled settings and identifies important physical dynamics in real-world case studies. Overall, M-CaStLe advances causal discovery for multivariate space-time systems while retaining interpretability at the grid level.

2605.00379 2026-05-04 stat.ME

Economical Experimental Design with Generalized Posteriors

Luke Hagar, James M. McGree

详情
英文摘要

The hybrid approach to experimental design aims to control frequentist operating characteristics of Bayesian decision procedures. These operating characteristics are assessed by simulating sampling distributions of posterior summaries under assumed data-generation processes that also define posterior distributions. Model misspecification can distort effect estimation and compromise control over operating characteristics. Generalized posterior distributions are defined using generalized likelihoods that characterize data generation under fewer assumptions, enhancing the robustness of Bayesian analysis and study design. However, widely applicable and computationally efficient design methodology with generalized posteriors is lacking. We propose an economical method to determine suitable sample sizes and decision criteria associated with generalized posteriors under the hybrid approach. Using theoretical results to model posterior summaries as functions of the sample size, we efficiently assess operating characteristics throughout the sample size space given simulations conducted at only two sample sizes. While the benefits of the proposed methodology are emphasized by redesigning an adaptive clinical trial with time-to-event outcomes, we overview our framework's broader applicability to experiments involving Bayesian analogues to M-estimation.

2605.00365 2026-05-04 cs.LG cs.CL stat.ML

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

Anamika Lochab, Bolian Li, Ruqi Zhang

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10\% absolute improvement on AIME24 at Pass@64 and up to 45\% higher equation-level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.

2605.00363 2026-05-04 math.ST stat.TH

Profile Likelihood Inference for Anisotropic Hyperbolic Wrapped Normal Models on Hyperbolic Space

Kisung You

Comments 34 pages, 2 figures

详情
英文摘要

We study likelihood-based inference for the anisotropic hyperbolic wrapped normal distribution on standard hyperbolic space. The model has a manifold-valued location parameter and a full positive definite covariance matrix in tangent coordinates. For independent observations from this family, we analyze the profile maximum likelihood estimator obtained by optimizing the likelihood over the location after profiling out the covariance. To guarantee finite-sample existence, we formulate the estimator on a covariance shell that bounds eigenvalues away from zero and infinity. We prove that this constrained likelihood is well posed, that the anisotropic wrapped normal family is identifiable, and that the estimator is strongly consistent when the true covariance lies in the interior of the shell. In global normal coordinates for the location and log-covariance coordinates for the nuisance parameter, we establish joint asymptotic normality and derive efficient profile inference for the location parameter through the Schur-complement information. We further prove local asymptotic normality of the experiment and obtain the Hájek--Le Cam local asymptotic minimax lower bound under squared geodesic loss. The profile estimator attains this bound for truncated squared loss, and for ordinary squared loss under a uniform-integrability condition. We also give an explicit computational form of the estimator based on spectral clipping of the empirical tangent covariance, and present a Monte Carlo calibration study showing that the finite-sample scaled location risk and Wald coverage agree with the asymptotic theory.

2605.00360 2026-05-04 cs.LG stat.ME

Binomial flows: Denoising and flow matching for discrete ordinal data

Yair Shenfeld, Ricardo Baptista, Stefano Peluchetti

Comments 41 pages, 9 figures

详情
英文摘要

Flow-based generative modeling in continuous spaces exploit Tweedie's formula to express the denoiser (learned in training) as a score function (used in sampling). In contrast, this relation has been largely missing in the discrete setting where common approaches focus on learning discrete scores and rates. In this work we close this gap for discrete non-negative ordinal data by introducing Binomial flows. Our framework provides a simple recipe for training a discrete diffusion model which simultaneously denoises, samples, and estimates exact likelihoods. We verify our methodology on synthetic examples and obtain competitive results on real-world data sets.

2605.00332 2026-05-04 stat.ME cs.NA math.NA

Beyond Independence: on Jointly Normal Priors in Bayesian Inversion

Ruanui Nicholson, Matti Niskanen, Oliver J. Maclaren, Jari P. Kaipio

详情
英文摘要

We consider joint inversion for two or more unknown parameters from observational data in the Bayesian framework. Standard approaches often either treat the parameters as independent or impose structural similarity through regularisation terms that can be difficult to interpret statistically. We instead construct jointly Gaussian prior models with prescribed Gaussian marginals, so that correlation between the parameters can be incorporated without altering the marginal prior distributions. We propose a joint covariance construction that preserves the marginals, allows spatially varying cross-correlation, and supports uncertainty and inference in the correlation itself. The construction is valid for any strict contraction encoding the desired cross-correlation and is optimal in a canonical correlation sense under the principal square root factorisation. We demonstrate the method using prior sampling and several inference examples: a low-dimensional illustrative example and two higher-dimensional examples, including a PDE-constrained problem. The examples highlight both the potential pitfalls of ignoring or neglecting uncertainty in the correlation as well as reinforcing a key principle of the Bayesian paradigm: unknown quantities included in a model should be treated as random variables.

2605.00284 2026-05-04 cs.LG cs.NA math.NA stat.ML

A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions

Matteo Raviola, Benjamin Peherstorfer

详情
英文摘要

Dirac-Frenkel instantaneous residual minimization evolves nonlinear parametrizations of PDE solutions in time, but ill-conditioning can render the parameter dynamics non-unique. We interpret this non-uniqueness as a gauge freedom: nullspace directions that leave the time derivative unchanged can be used to select better-conditioned parameter velocities. Building on Onsager's minimum-dissipation principle, we introduce a history variable -- interpretable as momentum -- and inject it only along the nullspace directions. The resulting Dirac-Frenkel-Onsager dynamics preserve instantaneous residual minimization, in contrast to standard regularization that can introduce bias, while promoting temporally smooth parameter evolutions. Examples demonstrate that the approach leads to increased robustness in singular and near-singular regimes.

2605.00250 2026-05-04 stat.ML cs.CV cs.LG

Information-geometric adaptive sampling for graph diffusion

Yuhui Lu, Wenjing Liu, Kun Zhan

Comments Accepted to ICML 2026!

详情
英文摘要

Standard diffusion models for graph generation typically rely on uniform time-stepping, an approach that overlooks the non-homogeneous dynamics of distributional evolution on complex manifolds. In this paper, we present an information-geometric framework that reinterprets the diffusion sampling trajectory as a parametric curve on a Riemannian manifold. Our key observation is that the Fisher-Rao metric provides a principled measure of the intrinsic distance. By analyzing this metric, we derive the Drift Variation Score (DVS), a geometry-aware indicator that quantifies the instantaneous rate of distributional change. Unlike prior heuristic-based adaptive samplers, our DVS solver enforces a constant informational speed on the statistical manifold, automatically maintaining a uniform rate of distributional change along the sampling trajectory. This equal arc-length strategy ensures that each discretization step contributes equally to the information speed. Theoretical analysis verifies that DVS characterizes the local stiffness of the sampling dynamics in the Fisher-Rao sense. Experimental results on molecule and social network generation show that DVS significantly improves structural fidelity and sampling efficiency. Code is at https://github.com/kunzhan/DVS

2605.00229 2026-05-04 stat.ML cs.LG math.OC

A unified perspective on fine-tuning and sampling with diffusion and flow models

Carles Domingo-Enrich, Yuanqi Du, Michael S. Albergo

详情
英文摘要

We study the problem of training diffusion and flow generative models to sample from target distributions defined by an exponential tilting of a base density; a formulation that subsumes both sampling from unnormalized densities and reward fine-tuning of pre-trained models. This problem can be approached from a stochastic optimal control (SOC) perspective, using adjoint-based or score matching methods, or from a non-equilibrium thermodynamics perspective. We provide a unified framework encompassing these approaches and make three main contributions: (i) bias-variance decompositions revealing that Adjoint Matching/Sampling and Novel Score Matching have finite gradient variance, while Target and Conditional Score Matching do not; (ii) norm bounds on the lean adjoint ODE that theoretically support the effectiveness of adjoint-based methods; and (iii) adaptations of the CMCD and NETS loss functions, along with novel Crooks and Jarzynski identities, to the exponential tilting setting. We validate our analysis with reward fine-tuning experiments on Stable Diffusion 1.5 and 3.

2605.00216 2026-05-04 stat.ME

Simplicity Above Elegance: Another Look at the Asymptotically Correct Standardization of Snijders

Sandip Sinharay

Comments 34 pages, 5 figures. This version is the corrected version of the published article. Due to the correction, the content in pages 7-12 of this document differs substantially from that in the journal version

详情
Journal ref
Jour. Ed. Behav. Stat. (2026)
英文摘要

Person-fit statistics are widely used to detect aberrant response patterns in educational and psychological measurement. Snijders (2001) suggested an asymptotically correct standardization for a broad class of such statistics. This paper presents an alternative derivation of this standardization. The derivation yields several advantages including a simpler formula and simpler description of several person-fit statistics including the lz* statistic (van Krimpen-Stoop & Meijer, 1999) and a theoretical explanation of simulation findings reported by Snijders (2001) and van Krimpen-Stoop and Meijer (1999), among others.

2605.00196 2026-05-04 stat.ME math.PR math.ST q-fin.ST stat.TH

Modeling Stock Returns and Volatility Using Bivariate Gamma Generalized Laplace Law

Tomasz J. Kozubowski, Andrey Sarantsev, James A. Spiker

Comments 25 pages, 2 figures. Keywords: Financial modeling, Generalized Laplace distribution, Maximum likelihood estimation, Normal mean-variance mixture, Variance-gamma distribution

详情
英文摘要

We consider a generalization of the variance-gamma (generalized asymmetric Laplace) distribution, defined as a normal mean - variance mixture with a gamma mixing distribution. While this model is typically studied in the univariate setting, we assume that the gamma mixing variable is observed alongside the primary variable, resulting in a bivariate framework. In this setting, maximum likelihood estimation becomes significantly simpler than in the standard univariate case, reducing to a form of classical linear regression. We derive explicit expressions for the resulting estimators. For certain parameter configurations, the estimators exhibit nonstandard convergence rates, exceeding the usual square-root rate. Finally, we illustrate the applicability of this model in financial contexts by analyzing stock index returns and associated volatility for several major indices.

2605.00193 2026-05-04 cs.LG stat.ML

OTSS: Output-Targeted Soft Segmentation for Contextual Decision-Weight Learning

Renjun Hu, Hyun-Soo Ahn

Comments 23 pages, 2 figures

详情
英文摘要

Many machine learning systems make constrained decisions by optimizing factorized objectives, but the context-specific objective is often treated as fixed. We study contextual decision-weight learning: from logged decisions and proxy outputs, learn an optimizer-facing weight vector w(x) over interpretable decision factors z(x,d), rather than a direct policy or generic predictive score. We propose OTSS, an output-targeted soft-segmentation model that deploys the personalized decision-ready weight vector. At the function-class level, the theory highlights a hard-versus-soft distinction. Hard partitions incur an approximation-estimation tradeoff under overlap, while a realizable fixed-K soft class removes the hard-partition approximation floor and attains a parametric rate. We evaluate OTSS in controlled benchmarks with finite evaluation libraries, where the true weight vector and downstream regret can be computed exactly. In the representative overlap setting, OTSS attains the lowest mean regret among the comparators, including EM mixture regression, the strongest soft-mixture baseline in our comparison; it matches EM on coefficient recovery while running about two orders of magnitude faster. In a matched K=5 benchmark, OTSS remains competitive under hard-routed truth and improves as heterogeneity becomes softer and sample size grows. On a fixed Complete Journey retail anchor with real household covariates and action geometry, OTSS again achieves the lowest mean-regret point estimate.

2605.00176 2026-05-04 stat.ML cs.LG

SHIFT: Robust Double Machine Learning for Average Dose-Response Functions under Heavy-Tailed Contamination

Eichi Uehara

Comments 77 pages, 43 figures, 35 tables. Code and raw CSVs: https://github.com/EichiUehara/ADRF-Robust-DML

详情
英文摘要

Double-machine-learning pipelines for the Average Dose-Response Function rely on kernel-weighted local-linear smoothers, which inherit unbounded functional influence: a single outlier within a kernel window biases the curve across the entire window. We introduce SHIFT (Self-calibrated Heavy-tail Inlier-Fit with Tempering), a robust DML estimator combining cross-fit nuisance orthogonalization with a kernel-local Welsch-loss second stage optimized by Graduated Non-Convexity, and -- the principal design choice -- a defensive OLS refit whose inlier cutoff is scaled by post-GNC residual MAD rather than the raw-outcome MAD. On a localized-contamination stress test at $p=0.25$ this design choice drops level-RMSE from 1.03 to 0.33 while leaving clean and uniformly-contaminated runs unchanged. Across 1,400 main-sweep fits, SHIFT has competitive worst-case shape recovery (RMSE $0.325$ at $p=0.25$, second to Huber-DML's $0.276$); among the three methods with worst-case RMSE below $0.35$, only SHIFT emits a non-uniform per-sample weight vector, recovering the ground-truth outlier mask at mean $F_1 \approx 0.96$ (range $0.945$--$0.968$) on Gaussian-jump DGPs. We pair the estimator with a six-technique Extreme Value Theory diagnostic suite (Hill, GPD-MLE/PWM, GEV, Mean Excess, parameter stability, causal tail coefficient) that lets a practitioner distinguish Frechet from Weibull regimes and choose between SHIFT and L1 alternatives on empirical grounds. Extensions to binary-treatment CATE (Huber pseudo-outcome X-Learner) and time-series ADRF (block-CV + rolling MAD) are included. A counter-intuitive ablation: linear nuisance models (Ridge, Lasso) outperform gradient-boosted nuisances for robust DML under uniform contamination, inverting the usual more-flexible-is-better heuristic.

2605.00175 2026-05-04 stat.AP

Using Linked Micromaps to Explore Complex Structures in Official Statistics

Randall Powers, Darcy Steeg Morris, John Eltinge, Wendy Martinez

详情
英文摘要

Over the past decade, researchers have focused increasing levels of attention on the use of survey and non-survey data to inform decision-making by multiple stakeholders. Work with such data generally requires extensive exploration before a statistics practitioner focuses on specific steps in model building and inference. For many of the resulting initial exploratory analyses, crucial issues center on the extent to which empirical results may vary over geography and subpopulations. Such information is usually presented in tabular form, which can be difficult for stakeholders and decision makers to understand and to utilize. To address these issues, this paper uses data from the U.S. Bureau of Labor Statistics to illustrate a suite of tools known as linked micromaps. These applications show how linked micromaps can help stakeholders better understand and view descriptive statistics for populations and subpopulations, explore multivariate relationships and ordinal structure, and discover patterns of heterogeneity across time and space. In addition, this paper comments briefly on the prospective use of linked micromaps in model-building and analysis of multiple components of uncertainty.

2605.00171 2026-05-04 stat.ML cs.LG stat.AP

Adaptive Norm-Based Regularization for Neural Networks

Muhammad Qasim, Farrukh Javed

Comments 37 pages, 9 figures

详情
英文摘要

In this paper, we study norm-based regularization methods for neural networks. We compare existing penalization approaches and introduce two regularization strategies that extend classical ridge- and lasso-type penalties to neural network models. The first strategy modifies weight decay by incorporating the covariance structure of the input features into a ridge-type $\ell_2$ penalty, allowing regularization to account for feature dependence. The second combines an $\ell_1$ sparsity penalty with covariance-aware $\ell_2$ regularization, producing neural network weights that are both sparse and structurally informed. Monte Carlo simulations are used to evaluate these methods under different data-generating settings, followed by two real-data applications on building cooling-load prediction and leukemia cell-type classification from high-dimensional gene expression data. Across simulated and real-data examples, the proposed regularizers improve predictive performance on unseen data and provide more effective complexity control than standard norm-based penalties, particularly when features are correlated or high-dimensional.

2605.00126 2026-05-04 cs.LG eess.SP stat.ML

SPLICE: Latent Diffusion over JEPA Embeddings for Conformal Time-Series Inpainting

Arnaud Zinflou

详情
英文摘要

Generative models for time-series imputation achieve strong reconstruction accuracy, yet provide no finite-sample reliability guarantees, a critical limitation in power systems where imputed values inform dispatch and planning. We introduce SPLICE (Self-supervised Predictive Latent Inpainting with Conformal Envelopes), a modular framework coupling latent generative imputation with distribution-free, online-adaptive prediction intervals. A JEPA encoder maps daily load segments into a 64-dimensional latent space; a conditional latent bridge with four sampling modes generates candidate gap trajectories; an hourly-conditioned decoder maps back to signal space; and Adaptive Conformal Inference (ACI) wraps the output with coverage-guaranteed prediction bands. The flow-matching variant achieves comparable quality to DDIM in 5--10 ODE steps (5-10x speedup). On thirteen load datasets (nine proprietary, three UCI Electricity, ETTh1), SPLICE achieves the lowest mean Load-only MSE (0.056), winning 9/12 non-degenerate datasets at 91-day gaps and 18/32 across all gap lengths vs. five established baselines, and produces the best CRPS (0.161, -18.3% vs. the strongest competitor). ACI delivers 93--95% empirical coverage, correcting under-coverage failures of up to 7.5 pp observed with static conformal prediction. A pooled JEPA encoder trained on nine feeds transfers to four unseen domains, matching or exceeding per-dataset oracles with only a quick bridge fine-tuning.

2605.00108 2026-05-04 physics.soc-ph econ.GN q-fin.EC stat.AP

Urban Science Beyond Samples: Up-to-Date Street Network Models and Indicators for Every Urban Area in the World

Geoff Boeing

详情
Journal ref
Environment and Planning B: Urban Analytics and City Science, 2026
英文摘要

Urban planners need up-to-date, global, and consistent street network models and indicators to measure resilience and performance, model accessibility, and target local quality-of-life interventions. This article presents up-to-date street network models and indicators for every urban area in the world. It uses 2025 urban area boundaries from the Global Human Settlement Layer, allowing users to join these data to hundreds of other urban attributes. Its workflow ingests 180 million OpenStreetMap nodes and 360 million OpenStreetMap edges across 10,351 urban areas in 189 countries. The code, models, and indicators are publicly available for reuse. These resources unlock worldwide urban street network science beyond samples as well as local analyses in under-resourced regions where models and indicators are otherwise less-accessible.

2605.00099 2026-05-04 quant-ph cs.LG stat.ML

Provable and scalable quantum Gaussian processes for quantum learning

Jonas Jäger, Paolo Braccia, Pablo Bermejo, Manuel G. Algaba, Diego García-Martín, M. Cerezo

Comments 18 + 70 pages, 5 + 14 figures, 2 tables

详情
英文摘要

Despite rapid recent advances in quantum machine learning, the field is in many ways stuck. Existing approaches can exhibit serious limitations, and we still lack learning frameworks that are simple, interpretable, scalable, and naturally suited to quantum data. To address this, here we introduce quantum Gaussian processes, a Bayesian framework for learning from quantum systems through priors over unknown quantum transformations. We show that, under suitable conditions, unitary quantum stochastic processes define Gaussian processes, thereby enabling regression, classification, and Bayesian optimization directly on quantum data. The key ingredient in this framework is sufficient knowledge of a quantum process's structure and symmetries to define an informative prior through its corresponding quantum kernel, effectively injecting a strong, physics-informed inductive bias into the learning model. We then prove that matchgate, or free-fermionic, evolutions give rise to provable and scalable quantum Gaussian processes, providing the first family in our framework where the unknown unitary acts non-trivially on all qubits. Finally, we demonstrate accurate long-range extrapolation, phase-diagram learning in many-body systems, and sample-efficient Bayesian optimization in a quantum sensing task. Our results identify quantum Gaussian processes as a promising route toward simpler and more structured forms of quantum learning.

2605.00056 2026-05-04 cs.LG cs.AI physics.data-an physics.geo-ph stat.AP stat.ML

Smart Ensemble Learning Framework for Predicting Groundwater Heavy Metal Pollution

T. Ansah-Narh, G. Y. Afrifa, J. B. Tandoh, K. Asare, M. Addi, K. E. Yorke, D. M. A. Akpoley, K. Aidoo, S. K. Fosuhene

Comments 53 pages, 16 figures, accepted for publication in Earth Systems and Environment (2026)

详情
英文摘要

Groundwater in the Densu Basin is increasingly threatened by heavy metal contamination, but conventional methods fail to capture the statistical complexity and spatial heterogeneity of pollution indicators. A key challenge is modelling the Heavy Metal Pollution Index (HPI), which is typically skewed and affected by correlated contaminants, leading to biased predictions without transformation. This study develops a predictive framework integrating response transformations with nested cross-validated ensemble machine learning. Three transformations (raw, log, and Gaussian copula) were applied to HPI and evaluated across six learners: support vector regression (SVM), $k$-nearest neighbours (k-NN), CART, Elastic Net, kernel ridge regression, and a stacked Lasso ensemble. Raw-scale models produced deceptively high fits (Elastic Net and stacked ensemble $R^2 \approx 1.0$), suggesting over-optimism. The log transformation stabilised variance (SVM: $R^2 = 0.93$, RMSE $= 0.18$; k-NN: $R^2 = 0.92$, RMSE $= 0.20$). The Gaussian copula gave the most reliable results: stacked ensemble $R^2 = 0.96$ (RMSE $= 0.19$), with other learners maintaining high accuracy. Copula-based models improved residuals and produced spatially plausible maps. DBSCAN clustering revealed Fe and Mn as primary HPI contributors, consistent with regional hydrogeochemistry. Limitations include reliance on random (not spatial) cross-validation and basin-specific scope. Future work should explore spatial validation and other geological settings. Overall, distribution-aware ensembles with clustering diagnostics offer robust, interpretable assessments of groundwater contamination.

2605.00007 2026-05-04 math.OC cs.AI stat.ML

Mean-Field Path-Integral Diffusion: From Samples to Interacting Agents

Michael Chertkov

Comments 31 pages, 14 figures

详情
英文摘要

Independent sample generation is the prevailing paradigm in modern diffusion-based generative models of AI. We ask a different question: can samples \emph{coordinate} through shared population statistics to transport probability mass more efficiently? We introduce Mean-Field Path-Integral Diffusion (MF-PID), a framework in which samples are promoted to interacting agents whose drift depends self-consistently on the evolving population density. The coupling converts distribution matching into a McKean--Vlasov extension of the stochastic optimal transport problem, unifying generative modeling and multi-agent control under the same Hamilton--Jacobi--Bellman/Kolmogorov--Fokker--Planck duality. We identify two analytically tractable regimes: a Linear--Quadratic--Gaussian (LQG) benchmark in which the infinite-dimensional mean-field system reduces to a finite set of Riccati and linear ODEs, and a Gaussian-mixture regime governed by a piecewise-constant protocol that preserves closed-form solvability. For a quadratic interaction potential with schedule $β_t$ and zero base drift we prove that the self-consistent MF guidance is the \emph{exact} linear interpolant between initial and target global means -- a result that holds for arbitrary initial and target densities and any $β_t$. Applied to demand-response control of energy systems, where agents aggregated into an ensemble are energy consumers (e.g.\ thermal zones within a building), MF-PID achieves 19--24\% reductions in cumulative control energy over independent-agent baselines while matching the prescribed terminal distribution exactly, and reveals how coordination redistributes actuation effort across heterogeneous sub-populations.

2604.27077 2026-05-04 cs.LG cs.AI stat.ML

Learning Rate Transfer in Normalized Transformers

Boris Shigida, Boris Hanin, Andrey Gromov

详情
英文摘要

The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning rate transfer across model dimension and token horizon. To rectify this, we combine numerical experiments with a principled use of alignment exponents (arXiv:2407.05872) to revisit and modify the $μ$P approach to hyperparameter transfer (arXiv:2011.14522). The result is a novel nGPT parameterization we call $ν$GPT. Through extensive empirical validation, we find $ν$GPT exhibits learning rate transfer across width, depth, and token horizon.

2604.24032 2026-05-04 stat.ME

On Cluster Randomized Trials with the Desirability of Outcome Ranking (DOOR) Endpoints

Wanying Shao, Toshimitsu Hamasaki, Scott Evans, Guoqing Diao

详情
英文摘要

Cluster randomized trials are widely used when individual randomization is logistically infeasible or when correlations between observations cannot be ignored, especially in fields such as ophthalmology, infectious disease, vaccine research, and sociology. The desirability of outcome ranking (DOOR) framework evaluates patient-centric benefit-risk using an ordinal outcome and a Wilcoxon-Mann-Whitney statistic-based approach to compare outcome distributions between interventions. We propose a suite of new methods to extend DOOR to cluster trials based on properties of U-statistics and influence functions to estimate within-cluster and between-cluster treatment effects. These approaches can be applied in different scenarios, including mixtures of clusters with two treatment groups and clusters with only one group, and both small and large numbers of clusters. Simulations demonstrate that the proposed methods perform well under various scenarios regarding the number of clusters and cluster sizes. As an illustration, we apply the proposed methods to a cluster randomized crossover trial comparing delayed cord clamping and umbilical cord milking for newborns.

2604.04567 2026-05-04 stat.ML cs.LG

Generative Modeling under Non-Monotone MAR Missingness via Approximate Wasserstein Gradient Flows

Gitte Kremling, Jeffrey Näf, Johannes Lederer

详情
英文摘要

The prevalence of missing values in data science poses a substantial risk to any further analyses. Despite a wealth of research, principled nonparametric methods to deal with general non-monotone missingness are still scarce. Instead, ad-hoc imputation methods are often used, for which it remains unclear whether the correct distribution can be recovered. In this paper, we propose FLOWGEM, a principled iterative method for generating a complete dataset from a dataset with values Missing at Random (MAR). Motivated by convergence results of the ignoring maximum likelihood estimator, our approach minimizes the expected Kullback-Leibler (KL) divergence between the observed data distribution and the distribution of the generated sample over different missingness patterns. To minimize the KL divergence, we employ a discretized particle evolution of the corresponding Wasserstein Gradient Flow, where the velocity field is approximated using a local linear estimator of the density ratio. This construction yields a data generation scheme that iteratively transports an initial particle ensemble toward the target distribution. Simulation studies and real-data benchmarks demonstrate that FLOWGEM achieves state-of-the-art performance across a range of settings, including the challenging case of non-monotone MAR mechanisms. Together, these results position FLOWGEM as a principled and practical alternative to existing imputation methods, and a decisive step towards closing the gap between theoretical rigor and empirical performance.

2603.18413 2026-05-04 stat.ML cs.LG

Statistical Testing Framework for Clustering Pipelines by Selective Inference

Yugo Miyata, Tomohiro Shiraishi, Shuichi Nishino, Ichiro Takeuchi

Comments 59 pages, 11 figures

详情
英文摘要

A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms. In many practical applications, analytical findings are obtained only after data pass through several data-dependent procedures within such pipelines. In this study, we address the problem of quantifying the statistical reliability of results produced by data analysis pipelines. As a proof of concept, we focus on clustering pipelines that identify cluster structures from complex and heterogeneous data through procedures such as outlier detection, feature selection, and clustering. We propose a novel statistical testing framework to assess the significance of clustering results obtained through these pipelines. Our framework, based on selective inference, enables the systematic construction of valid statistical tests for clustering pipelines composed of predefined components. We prove that the proposed test controls the type I error rate at any nominal level and demonstrate its validity and effectiveness through experiments on synthetic and real datasets.

2603.08538 2026-05-04 math.ST stat.ME stat.TH

Minimax estimation for Varying Coefficient Model via Laguerre Series

Rida Benhaddou, Khalid Chokri, Jackson Pinschenat

Comments 27 pages, 6 figures

详情
英文摘要

We delve into the estimation of the functional coefficients and inference for varying coefficient model. Applying Laguerre series, we develop an estimator for the vector of functional coefficients that attains asymptotically optimal convergence rates in the minimax sense. These rates are derived for functional coefficients that belong to Laguerre-Sobolev space. The method is based on approximating the functional coefficients using truncated Laguerre series and choosing empirical Laguerre coefficients that minimize the least squares criterion. In addition, we establish the asymptotic normality of the estimator for the functional coefficients, construct their confidence intervals, and establish point-wise hypothesis tests about their true values. A simulations study is carried out to examine the finite-sample properties of the proposed methodology. A real data set is considered as well, and results based on the proposed methodology are compared to those based on selected existing approaches.

2603.02275 2026-05-04 cs.LG stat.AP stat.ML

A Comparative Study of UMAP and Other Dimensionality Reduction Methods

Guanzhe Zhang, Shanshan Ding, Zhezhen Jin

Comments 31 pages, 4 figures

详情
英文摘要

Uniform Manifold Approximation and Projection (UMAP) is a widely used manifold learning technique for dimensionality reduction. This paper studies UMAP, supervised UMAP, and several competing dimensionality reduction methods, including Principal Component Analysis (PCA), Kernel PCA, Sliced Inverse Regression (SIR), Kernel SIR, and t-distributed Stochastic Neighbor Embedding, through a comprehensive comparative analysis. Although UMAP has attracted substantial attention for preserving local and global structures, its supervised extensions, particularly for regression settings, remain rather underexplored. We provide a systematic evaluation of supervised UMAP for both regression and classification using simulated and real datasets, with performance assessed via predictive accuracy on low-dimensional embeddings. Our results show that supervised UMAP performs well for classification but exhibits limitations in effectively incorporating response information for regression, highlighting an important direction for future development.

2601.09011 2026-05-04 stat.ME q-bio.PE

Causal attribution by the chain rule: unifying natural selection, learning, economics, and other disciplines

Steven A. Frank

Comments New title, abstract, introduction, and other significant changes throughout

详情
英文摘要

Analysis often splits change into components. For example, how much of the observed variance is caused by genes or environment? In many cases, the split is ultimately made by the logic of the chain rule, which divides the difference of a product into two terms. Each term quantifies the partial difference associated with change in one component while holding the other component constant. The chain rule is of course widely known. However, this article argues that its deep fundamental role often goes unrecognized. The article shows how simply the basic chain rule unifies Fisher's fundamental theorem of natural selection, the Price equation description of evolutionary change, the Oaxaca-Blinder decomposition of wage differences in economics, the Kitagawa decomposition of mortality differences in demography, many expressions of thermodynamics, and most strikingly back propagation, the core optimization method of modern machine learning and artificial intelligence. The success in creating good designs and finding good solutions in both natural selection and artificial intelligence depends on how the chain rule propagates causes from instances of success or failure back to the underlying genes or parameters of the system. The mathematical analysis presented here shows that, for finite differences, the product rule form of the chain rule yields a basic decomposition of change into two components of a regression equation. That regression decomposition is purely a description of change with no explicit causal meaning. However, simple additional assumptions lead naturally to the modern counterfactual analysis of causality. From that perspective, we can easily understand the causal interpretation that Fisher gave to his fundamental theorem, and we can see the same causal structure in the Oaxaca-Blinder decomposition of economics and in causal analyses across many disciplines.

2510.23557 2026-05-04 stat.ML cs.LG

Minimizing Human Intervention in Online Classification

William Réveillard, Vasileios Saketos, Alexandre Proutiere, Richard Combes

Comments 53 pages, 10 figures. AISTATS 2026

详情
英文摘要

Training or fine-tuning large language model (LLM)-based systems often requires costly human feedback, yet there is limited understanding of how to minimize such intervention while maintaining strong error guarantees. We study this problem for LLM-based classification systems in an active learning framework: an agent sequentially labels $d$-dimensional query embeddings drawn i.i.d. from an unknown distribution by either calling a costly expert or guessing with no feedback, with the goal of minimizing regret relative to an oracle with free expert access. When the horizon $T$ is at least exponential in the embedding dimension $d$, the geometry of the class regions can be learned. In this regime, we propose the Conservative Hull-based Classifier (CHC), which maintains convex hulls of expert-labeled queries and calls the expert when a query lands outside all known hulls. CHC attains $\mathcal{O}(\log^d T)$ regret in $T$ and is minimax optimal for $d=1$. Otherwise, the geometry cannot be reliably learned in general. We show that for queries drawn from a subgaussian mixture and $T \le e^d$, a Center-based Classifier (CC) achieves regret proportional to $N\log{N}$ where $N$ is the number of labels. To bridge these regimes, we introduce the Generalized Hull-based Classifier (GHC), a practical extension of CHC that enables more aggressive guessing via a tunable parameter. Our approach is validated on real-world question-answering datasets using state-of-the-art text embedding models.

2510.19206 2026-05-04 math.ST stat.ML stat.TH

Shrinkage to Infinity: Reducing Test Error by Inflating the Minimum Norm Interpolator in Linear Models

Jake Freeman

详情
英文摘要

Hastie et al. (2022) found that ridge regularization is essential in high dimensional linear regression $y=β^Tx + ε$ with isotropic co-variates $x\in \mathbb{R}^d$ and $n$ samples at fixed $d/n$. However, Hastie et al. (2022) also notes that when the co-variates are anisotropic and $β$ is aligned with the top eigenvalues of population covariance, the "situation is qualitatively different." In the present article, we make precise this observation for linear regression with highly anisotropic covariances and diverging $d/n$. We find (both theoretically and empirically) that simply scaling up (or inflating) the minimum $\ell_2$ norm interpolator by a constant greater than one can improve the generalization error. This is in sharp contrast to traditional regularization/shrinkage prescriptions. Moreover, we use a data-splitting technique to produce consistent estimators that achieve generalization error comparable to that of the optimally inflated minimum-norm interpolator. Our proof relies on matching upper and lower bounds for expectations of Gaussian random projections for a general class of anisotropic covariance matrices when $d/n\rightarrow \infty$.

2509.20015 2026-05-04 q-fin.MF q-fin.CP stat.ME

Randomized Kolmogorov-Smirnov Analysis of Volatility Roughness

Sergio Bianchi, Daniele Angelini

Comments 23 pages

详情
英文摘要

We introduce a novel distribution-based estimator for the Hurst parameter of log-volatility, leveraging the Kolmogorov-Smirnov statistic to assess the scaling behavior of entire distributions rather than individual moments. To address the temporal dependence of financial volatility, we propose a random permutation procedure that effectively removes serial correlation while preserving marginal distributions, enabling the rigorous application of the KS framework to dependent data. We establish the asymptotic variance of the estimator, useful for inference and confidence interval construction. From a computational standpoint, we show that derivative-free optimization methods, particularly Brent's method and the Nelder-Mead simplex, achieve substantial efficiency gains relative to grid search while maintaining estimation accuracy. Empirical analysis of the CBOE VIX index and the 5-minute realized volatility of the S&P 500 reveals a statistically significant hierarchy of roughness, with implied volatility smoother than realized volatility. Both measures, however, exhibit Hurst exponents well below one-half, reinforcing the rough volatility paradigm and highlighting the open challenge of disentangling local roughness from long-memory effects in fractional modeling.

2509.17960 2026-05-04 stat.ME stat.AP

Everything all at once: On choosing an estimand for multi-component environmental exposures

Kara E. Rudolph, Shodai Inose, Nicholas Williams, Ivan Diaz, Lucia Calderon, Jacqueline M. Torres, Marianthi-Anna Kioumourtzoglou

详情
英文摘要

Many research questions -- particularly those in environmental health -- do not involve binary exposures. In environmental epidemiology, this includes multivariate exposure mixtures with nondiscrete components. Causal inference estimands and estimators to quantify the relationship between an exposure mixture and an outcome are relatively few. We propose an approach to quantify a relationship between a shift in the exposure mixture and the outcome -- either in the single timepoint or longitudinal setting. The shift in the exposure mixture can be defined flexibly in terms of shifting one or more components, including examining interaction between mixture components, and in terms of shifting the same or different amounts across components. The estimand we discuss has a similar interpretation as a main effect regression coefficient. First, we focus on choosing a shift in the exposure mixture supported by observed data. We demonstrate how to assess extrapolation and modify the shift to minimize reliance on extrapolation. Second, we propose estimating the relationship between the exposure mixture shift and outcome completely nonparametrically, using machine learning in model-fitting. This is in contrast to other current approaches, which employ parametric modeling for at least some relationships, which we would like to avoid because parametric modeling assumptions in complex, nonrandomized settings are tenuous at best. We are motivated by longitudinal data on pesticide exposures among participants in the CHAMACOS Maternal Cognition cohort. We examine the relationship between longitudinal exposure to agricultural pesticides and risk of hypertension. We provide step-by-step code to facilitate the easy replication and adaptation of the approaches we use.

2509.02829 2026-05-04 math.PR math.ST stat.TH

An iterated $I$-projection procedure for solving the generalized minimum information checkerboard copula problem

Ivan Kojadinovic, Tommaso Martini

Comments 35 pages, 13 figures

详情
英文摘要

The minimum information copula principle initially suggested in \cite{MeeBed97} is a maximum entropy-like approach for finding the least informative copula, if it exists, that satisfies a certain number of expectation constraints specified either from domain knowledge or the available data. We first propose a generalization of this principle allowing the inclusion of additional constraints fixing certain higher-order margins of the copula. We next show that the associated optimization problem has a unique solution under a natural condition. As the latter problem is intractable in general we consider its version with all the probability measures involved in its formulation replaced by checkerboard approximations. This amounts to attempting to solve a so-called discrete $I$-projection linear problem. We then exploit the seminal results of \cite{Csi75} to derive an iterated procedure for solving the latter and provide theoretical guarantees for its convergence. The usefulness of the procedure is finally illustrated via numerical experiments in dimensions up to four with substantially finer discretizations than those encountered in the literature.

2508.18682 2026-05-04 math.ST stat.TH

Simple and Sharp Generalization Bounds via Lifting

Jingbo Liu

Comments 1 figure

详情
英文摘要

We develop an information-theoretic framework for bounding the supremum of stochastic processes, offering a simpler and sharper alternative to classical chaining and slicing arguments for generalization bounds. The key idea is a lifting argument that produces information-theoretic analogues of empirical process bounds, such as Dudley's entropy integral. Lifting introduces permutation symmetry, yielding sharp bounds when the classical Dudley integral is loose. This gives a simple proof of the majorizing measure theorem via the sharpness of Dudley's entropy integral for stationary processes, a result known well before the proof of the majorizing measure theorem. Furthermore, the information-theoretic formulation provides soft versions of classical localized complexity bounds in generalization theory, but is simpler and does not require the slicing argument. We apply this approach to empirical risk minimization over Sobolev ellipsoids, obtaining sharp convergence rates in settings where previous methods are suboptimal.

2508.01610 2026-05-04 stat.ME

Sample size calculations for multilevel factorial longitudinal cluster randomised trials

Rhys Bowden, Rebecca Walwyn, Jessica Kasza, Andrew Copas, Fan Li, James Wason, Andrew Forbes

详情
英文摘要

Typically, trials investigate the impact of either an individual-level intervention on participant outcomes, or the impact of a cluster-level intervention on participant outcomes. Factorial designs consider two (or more) treatments for each of two (or more) different factors. In factorial trial designs, trial units (individuals or clusters) are each randomised to a level of each of the treatments; these designs allow assessment of the interactions between different interventions. Recently, there has been growing interest in the design of trials that jointly assess the impact of individual- and cluster-level interventions (i.e. multi-level interventions); requiring the development of methodology that accommodates randomisation at multiple levels. While recent work has developed sample size methodology for variants combining standard cluster randomisation and individual randomisation, that work does not apply to longitudinal cluster randomised trial designs such as the stepped wedge design or cluster randomised crossover design. Here we present dedicated sample size methodology for "split-plot factorial longitudinal cluster randomised trials" with continuous outcomes: allowing for joint assessment of individual-level and cluster-level interventions that allows for the impact of the cluster-level intervention to be assessed using any longitudinal cluster randomised trial design. We show how the power to detect given effects of the individual-level intervention, the cluster-level intervention, and the interaction between the two depends on standard results for individually-randomised trials and longitudinal cluster randomised trials. We apply these results to the SharES trial, which considered the effects of a patient- and clinician-level interventions for patients with breast cancer on patient knowledge about the risks and benefits of treatment.

2504.15290 2026-05-04 stat.OT

Parental Imprints On Birth Weight: A Data-Driven Model For Neonatal Prediction In Low Resource Prenatal Care

Rajeshwari Mistri, Harsh Joshi, Nachiket Kapure, Parul Kumari, Manasi Mali, Seema Purohit, Neha Sharma, Mrityunjoy Panday, Chittaranjan S. Yajnik

Comments Withdrawn due to identified issues in manuscript originality and overlap in some Sections requiring substantial revision and restructuring of the text and methodology. A corrected and improved version will be submitted

详情
英文摘要

Accurate fetal birth weight prediction is a cornerstone of prenatal care, yet traditional methods often rely on imaging technologies that remain inaccessible in resource-limited settings. This study presents a novel machine learning-based framework that circumvents these conventional dependencies, using a diverse set of physiological, environmental, and parental factors to refine birth weight estimation. A multi-stage feature selection pipeline filters the dataset into an optimized subset, demonstrating previously underexplored yet clinically relevant predictors of fetal growth. By integrating advanced regression architectures and ensemble learning strategies, the model captures non-linear relationships often overlooked by traditional approaches, offering a predictive solution that is both interpretable and scalable. Beyond predictive accuracy, this study addresses a question: whether birth weight can be reliably estimated without conventional diagnostic tools. The findings challenge entrenched methodologies by introducing an alternative pathway that enhances accessibility without compromising clinical utility. While limitations exist, the study lays the foundation for a new era in prenatal analytics, one where data-driven inference competes with, and potentially redefines, established medical assessments. By bridging computational intelligence with obstetric science, this research establishes a framework for equitable, technology-driven advancements in maternal-fetal healthcare.

2503.14459 2026-05-04 stat.ML cs.LG stat.ME

Doubly robust identification of treatment effects from multiple environments

Piersilvio De Bartolomeis, Julia Kostin, Javier Abad, Yixin Wang, Fanny Yang

Comments Accepted for presentation at the International Conference on Learning Representations (ICLR) 2025

详情
英文摘要

Practical and ethical constraints often require the use of observational data for causal inference, particularly in medicine and social sciences. Yet, observational datasets are prone to confounding, potentially compromising the validity of causal conclusions. While it is possible to correct for biases if the underlying causal graph is known, this is rarely a feasible ask in practical scenarios. A common strategy is to adjust for all available covariates, yet this approach can yield biased treatment effect estimates, especially when post-treatment or unobserved variables are present. We propose RAMEN, an algorithm that produces unbiased treatment effect estimates by leveraging the heterogeneity of multiple data sources without the need to know or learn the underlying causal graph. Notably, RAMEN achieves doubly robust identification: it can identify the treatment effect whenever the causal parents of the treatment or those of the outcome are observed, and the node whose parents are observed satisfies an invariance assumption. Empirical evaluations on synthetic and real-world datasets show that our approach outperforms existing methods.

2503.10990 2026-05-04 cs.GT cs.LG econ.TH math.ST stat.ML stat.TH

Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

Kaizhao Liu, Qi Long, Zhekun Shi, Weijie J. Su, Jiancong Xiao

Comments Accepted for publication in the Annals of Statistics

详情
英文摘要

Aligning large language models (LLMs) with diverse human preferences is critical for ensuring fairness and informed outcomes when deploying these models for decision-making. In this paper, we seek to uncover fundamental statistical limits concerning aligning LLMs with human preferences, with a focus on the probabilistic representation of human preferences and the preservation of diverse preferences in aligned LLMs. We first show that human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Moreover, we prove that Condorcet cycles exist with probability converging to one exponentially fast under a general probabilistic preference model called the Luce model, thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback. Next, we explore the conditions under which LLMs would employ mixed strategies -- meaning they do not collapse to a single response -- when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback. We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority. As a blessing, we prove that this condition holds with high probability under the Luce model, thereby highlighting the statistical possibility of preserving minority preferences without explicit regularization in aligning LLMs.

2501.06540 2026-05-04 cs.CV math.ST stat.AP stat.ME stat.TH

Copula-enhanced Vision Transformer for high myopia diagnosis through OU UWF fundus images

Chong Zhong, Yunhao Liu, Yang Li, Xiang Fu, Jin Yang, Danjuan Yang, Meiyan Li, Jinfeng Xu, Aiyi Liu, Alan H. Welsh, Xingtao Zhou, Bo Fu, Catherine C. Liu

详情
英文摘要

The advancement of AI-assisted myopia screening necessitates the joint diagnosis of both-eye (OU) high myopia (HM) status and the prediction of axial length (AL). This clinical requirement introduces a complex mixed-type (binary-continuous) multitask learning task with bi-domain (OU) image covariates, giving rise to two key challenges: i) capture the inter-ocular asymmetry of OU images within a cutting-edge foundation model; ii) model and estimate the conditional dependence structure among mixed-type multivariate responses given image covariates. We address the challenges by: i) imposing residual adapters on the Vision Transformer foundation model to capture the OU similarity and heterogeneity simultaneously; ii) developing a four-dimensional copula loss that is implementable in PyTorch based on a latent variable expression for the Gaussian copula likelihood, and proposing a computationally efficient fast Monte Carlo Expectation Maximization (fMCEM) algorithm to estimate copula parameters. We further formulate a specific overfitting problem called stronger covariance phenomenon in multitask learning. We reveal the disturbance of the phenomenon to estimation of copula parameters and theoretically demonstrate the numerical stability of the proposed fMCEM algorithm against the disturbance. The application to our annotated OU ultra-widefield fundus image dataset and simulation on synthetic data demonstrate that our method stably enhances the predictive capabilities on both classification and regression tasks.

2409.02331 2026-05-04 stat.ME

A parameterization of anisotropic Gaussian fields with penalized complexity priors

Liam Llamazares-Elias, Jonas Latz, Finn Lindgren

Comments v2: revised version, accepted for publication in the Journal of the American Statistical Association

详情
英文摘要

Gaussian random fields (GFs) are fundamental tools in spatial modeling and can be represented flexibly and efficiently as solutions to stochastic partial differential equations (SPDEs). The SPDEs depend on specific parameters, which enforce various field behaviors and can be estimated using Bayesian inference. However, even under in-fill asymptotics, the likelihood only provides limited insights into the covariance structure. In response, it is essential to leverage priors to achieve appropriate, meaningful covariance structures in the posterior. This study introduces a smooth, invertible parameterization of the correlation length and diffusion matrix of an anisotropic GF and constructs penalized complexity (PC) priors for the model when the parameters are constant in space. The formulated prior is weakly informative, effectively penalizing complexity by pushing the correlation range toward infinity and the anisotropy to zero.

2404.13164 2026-05-04 stat.CO cs.CR

Least Squares Estimation For Hierarchical Data

Ryan Cumings-Menon, Pavel Zhuravlev

详情
英文摘要

The U.S. Census Bureau's 2020 Disclosure Avoidance System (DAS) bases its output on noisy measurements, which are population tabulations added to realizations of mean-zero random variables. These noisy measurements are observed for a set of hierarchical geographic levels, e.g., the U.S. as a whole, states, counties, census tracts, and census blocks. The Census Bureau released the noisy measurements generated in the DAS executions for the two primary 2020 Census data products, in part to allow data users to assess uncertainty in 2020 Census tabulations introduced by disclosure avoidance. This paper describes an algorithm that can leverage the hierarchical structure of the input data in order to compute very high dimensional least squares estimates in a computationally efficient manner. Afterward, we show that this algorithm's output is equal to the generalized least squares estimator, describe how to find the variance of linear functions of this estimator, and provide a numerical experiment in which we compute confidence intervals of tabulations based on this estimator. We also describe an accompanying Census Bureau experimental data product that applies this estimator to the publicly available noisy measurements to provide data users with the inputs required to derive confidence intervals for all tabulations that were included in the 2020 Redistricting Data File, for the U.S., state, county, and census tract geographic levels.

2312.10234 2026-05-04 stat.ME stat.ML

Flexible Nonparametric Inference for Causal Effects under the Front-Door Model

Anna Guo, David Benkeser, Razieh Nabi

详情
英文摘要

Evaluating causal treatment effects in observational studies requires addressing confounding. While the back-door criterion enables identification through adjustment for observed covariates, it fails in the presence of unmeasured confounding. The front-door criterion offers an alternative by leveraging variables that fully mediate the treatment effect and are unaffected by unmeasured confounders of the treatment-outcome pair. We develop novel one-step and targeted minimum loss-based estimators for both the average treatment effect and the average treatment effect on the treated under front-door assumptions. Our estimators are built on multiple parameterizations of the observed data distribution, including approaches that avoid modeling the mediator density entirely, and are compatible with flexible, machine learning-based nuisance estimation. We establish conditions for root-n consistency and asymptotic linearity by deriving second-order remainder bounds. We also develop flexible tests for assessing identification assumptions, including a doubly robust testing procedure, within a semiparametric extension of the front-door model that encodes generalized (Verma) independence constraints. We further show how these constraints can be leveraged to improve the efficiency of causal effect estimators. Simulation studies confirm favorable finite-sample performance, and real-data applications in education and emergency medicine illustrate the practical utility of our methods.

2202.00814 2026-05-04 stat.ME stat.AP

Adjustment for Unmeasured Spatial Confounding in Settings of Continuous Exposure Conditional on the Binary Exposure Status: Conditional Generalized Propensity Score-Based Spatial Matching

Honghyok Kim, Michelle Bell

Comments Online supplementary materials are appended at the bottom of the main pdf As of 2026, under revision at a method-oriented journal

详情
英文摘要

Propensity score (PS) matching to estimate causal effects of exposure is biased when unmeasured spatial confounding exists. Some exposures are continuous yet dependent on a binary variable (e.g., level of a contaminant (continuous) within a specified radius from residence (binary)). Further, unmeasured spatial confounding may vary by spatial patterns for both continuous and binary attributes of exposure. We propose a new generalized propensity score (GPS) matching method for such settings, referred to as conditional GPS (CGPS)-based spatial matching (CGPSsm). A motivating example is to investigate the association between proximity to refineries with high petroleum production and refining (PPR) and stroke prevalence in the southeastern United States. CGPSsm matches exposed observational units (e.g., exposed participants) to unexposed units by their spatial proximity and GPS integrated with spatial information. GPS is estimated by separately estimating PS for the binary status (exposed vs. unexposed) and CGPS on the binary status. CGPSsm maintains the salient benefits of PS matching and spatial analysis: straightforward assessments of covariate balance and adjustment for unmeasured spatial confounding. Simulations showed that CGPSsm can adjust for unmeasured spatial confounding. Using our example, we found positive association between PPR and stroke prevalence. Our R package, CGPSspatialmatch, has been made publicly available.

2107.01742 2026-05-04 stat.ME stat.CO

Nonparametric Detection of Multiple Location-Scale Change Points via Wild Binary Segmentation

Gordon J. Ross

详情
英文摘要

Change point methods are used to divide a sequence of observations into segments with different behaviour. Often, the distributional form of the observations is unknown, but the changes of interest are likely to involve shifts in location, scale, or both. We consider the problem of detecting multiple change points in a sequence without specifying a parametric model for the data. We propose the WBS-Lepage procedure, a nonparametric method which combines wild binary segmentation with a rank-based Lepage statistic. The statistic is formed from Mann--Whitney and Mood components, which are respectively sensitive to changes in location and scale. Since it depends on the observations only through their ranks, its null distribution is distribution-free. This allows finite-sample thresholds to be calibrated by Monte Carlo simulation, providing direct control over the probability of falsely detecting change points when none exist. We compare WBS-Lepage with existing nonparametric change point methods, including penalised likelihood and binary-segmentation-based competitors. The proposed method performs competitively for location changes and is particularly effective for detecting changes in scale. We illustrate the procedure on a stylometric analysis of changes in an author's writing style and provide an implementation of our method in the accompanying R package npwbs.