arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.20184 2026-03-23 cs.LG stat.ML

Kolmogorov-Arnold causal generative models

Alejandro Almodóvar, Mar Elizo, Patricia A. Apellániz, Santiago Zazo, Juan Parras

Comments 14 pages, 8 figures, 3 tables, 5 algorithms, preprint

详情
英文摘要

Causal generative models provide a principled framework for answering observational, interventional, and counterfactual queries from observational data. However, many deep causal models rely on highly expressive architectures with opaque mechanisms, limiting auditability in high-stakes domains. We propose KaCGM, a causal generative model for mixed-type tabular data where each structural equation is parameterized by a Kolmogorov--Arnold Network (KAN). This decomposition enables direct inspection of learned causal mechanisms, including symbolic approximations and visualization of parent--child relationships, while preserving query-agnostic generative semantics. We introduce a validation pipeline based on distributional matching and independence diagnostics of inferred exogenous variables, allowing assessment using observational data alone. Experiments on synthetic and semi-synthetic benchmarks show competitive performance against state-of-the-art methods. A real-world cardiovascular case study further demonstrates the extraction of simplified structural equations and interpretable causal effects. These results suggest that expressive causal generative modeling and functional transparency can be achieved jointly, supporting trustworthy deployment in tabular decision-making settings. Code: https://github.com/aalmodovares/kacgm

2603.20155 2026-03-23 cs.LG cs.CV stat.ML

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, Tim Salimans

详情
英文摘要

It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can reduce sampling steps to a handful. Our method, Discrete Moment Matching Distillation (D-MMD), leverages ideas that have been highly successful in the continuous domain. Whereas previous discrete distillation methods collapse, D-MMD maintains high quality and diversity (given sufficient sampling steps). This is demonstrated on both text and image datasets. Moreover, the newly distilled generators can outperform their teachers.

2603.20135 2026-03-23 math.ST cs.IT math.IT stat.TH

Classifier-Based Nonparametric Sequential Hypothesis Testing

Chia-Yu Hsu, Shubhanshu Shekhar

详情
英文摘要

We consider the problem of constructing sequential power-one tests where the null and alternative classes are specified indirectly through historical or offline data. More specifically, given an offline dataset consisting of observations from $L+1$ distributions $\{P_0, P_1, \ldots, P_L\}$, and a new unlabeled data stream $\{X_t: t \geq 1\} \overset{i.i.d}{\sim} P_θ$, the goal is to decide between the null $H_0: θ= 0$, against the alternative $H_1: θ\in [L]:=\{1,\ldots,L\}$. Our main methodological contribution is a general approach for designing a level-$α$ power-one test for this problem using a multi-class classifier trained on the given offline dataset. Working under a mild "separability" condition on the distributions and the trained classifier, we obtain an upper bound on the expected stopping time of our proposed level-$α$ test, and then show that in general this cannot be improved. In addition to rejecting the null, we show that our procedure can also identify the true underlying distribution almost surely. We then establish a sufficient condition to ensure the required separability of the classifier, and provide some converse results to investigate the role of the size of the offline dataset and the family of classifiers among classifier-based tests that satisfy the level-$α$ power-one criterion. Finally, we present an extension of our analysis for the training-and-testing distribution mismatch and illustrate an application to sequential change detection. Empirical results using both synthetic and real data provide support for our theoretical results.

2603.20134 2026-03-23 econ.EM math.ST stat.TH

Triple/Double-Debiased Lasso

Denis Chetverikov, Jesper R. -V. Sørensen, Aleh Tsyvinski

Comments 47 pages, 10 figures

详情
英文摘要

In this paper, we propose a triple (or double-debiased) Lasso estimator for inference on a low-dimensional parameter in high-dimensional linear regression models. The estimator is based on a moment function that satisfies not only first- but also second-order Neyman orthogonality conditions, thereby eliminating both the leading bias and the second-order bias induced by regularization. We derive an asymptotic linear representation for the proposed estimator and show that its remainder terms are never larger and are often smaller in order than those in the corresponding asymptotic linear representation for the standard double Lasso estimator. Because of this improvement, the triple Lasso estimator often yields more accurate finite-sample inference and confidence intervals with better coverage. Monte Carlo simulations confirm these gains. In addition, we provide a general recursive formula for constructing higher-order Neyman orthogonal moment functions in Z-estimation problems, which underlies the proposed estimator as a special case.

2603.20082 2026-03-23 math.ST stat.ME stat.TH

Inference in high-dimensional logistic regression under tensor network dependence

Josh Miles, Sohom Bhattacharya

详情
英文摘要

We investigate the problem of statistical inference for logistic regression with high-dimensional covariates in settings where dependence among individuals is induced by an underlying Markov random field. Going beyond the pairwise interaction models such as the Ising model, we consider a framework to accommodate more general tensor structures that capture higher-order dependencies. We develop a two-step procedure for low-dimensional linear and quadratic functionals. The first step constructs a regularized maximum pseudolikelihood estimator, for which we establish consistency under high-dimensional features. However, as in other classical high-dimensional regression problems, this estimator is biased and cannot be directly used for valid statistical inference. The second step introduces a bias-correction that yields an asymptotically normal estimator from which one can construct confidence intervals and test hypotheses. Our results move beyond the existing literature, where only estimation guarantees were available or only for pairwise interaction models. We complement our theoretical analysis with simulation studies confirming the effectiveness of the proposed method.

2603.20071 2026-03-23 stat.ME math.ST stat.TH

Posterior inference via Hill's prediction model

Pier Giovanni Bissiri, Chris Holmes, Stephen G. Walker

Comments 23 pages, 7 figures

详情
英文摘要

This paper is concerned with the construction of prior free posterior distributions which rely on the use of one step ahead predictive distribution functions. These are typically more straightforward to motivate than prior distributions. Recent interest has been with Hill's $A_n$ prediction model through what has become known as conformal prediction. This model predicts the next observation to lie with equal probability in the intervals created by the observed data. The prediction model generates complete data sets which can be used to provide posterior inference on any statistic of interest.

2603.20070 2026-03-23 math.ST cond-mat.stat-mech cs.CC stat.ML stat.TH

The monotonicity of the Franz-Parisi potential is equivalent with Low-degree MMSE lower bounds

Konstantinos Tsirkas, Leda Wang, Ilias Zadik

Comments 92 pages

详情
英文摘要

Over the last decades, two distinct approaches have been instrumental to our understanding of the computational complexity of statistical estimation. The statistical physics literature predicts algorithmic hardness through local stability and monotonicity properties of the Franz--Parisi (FP) potential \cite{franz1995recipes,franz1997phase}, while the mathematically rigorous literature characterizes hardness via the limitations of restricted algorithmic classes, most notably low-degree polynomial estimators \cite{hopkins2017efficient}. For many inference models, these two perspectives yield strikingly consistent predictions, giving rise to a long-standing open problem of establishing a precise mathematical relationship between them. In this work, we show that for estimation problems the power of low-degree polynomials is equivalent to the monotonicity of the annealed FP potential for a broad family of Gaussian additive models (GAMs) with signal-to-noise ratio $λ$. In particular, subject to a low-degree conjecture for GAMs, our results imply that the polynomial-time limits of these models are directly implied by the monotonicity of the annealed FP potential, in conceptual agreement with predictions from the physics literature dating back to the 1990s.

2603.20068 2026-03-23 stat.ME stat.CO

Approximate posterior recalibration

Tiffany Cai, Philip Greengard, Ben Goodrich, Andrew Gelman

详情
英文摘要

Bayesian inference is often implemented using approximations, which can yield interval estimates that are too narrow, not fully capturing the uncertainty in the posterior distribution. We address the question of how to adjust these approximate posteriors so that they appropriately capture uncertainty. vWe introduce two methods that extend simulation-based calibration checking (SBC) to widen approximate posterior uncertainty intervals to aim for marginal calibration. We demonstrate these methods in several experimental settings, and we discuss the challenge of calibration using posterior inferences and the potential for posterior recalibration of hierarchical models.

2603.20052 2026-03-23 physics.ao-ph stat.AP

Uncertainty in wind and solar projections depends on global and regional climate models

Nina Effenberger, Reto Knutti

详情
英文摘要

Ensembles of regional-global climate model combinations show substantial spread in projected wind and solar resources. Using 31 RCM-GCM pairs, we quantify the sources of this spread with a spatially and seasonally resolved variance decomposition, separating contributions from RCMs and GCMs. For both wind speed and solar radiation, RCMs dominate the variability in the absolute historical fields. In contrast, projected changes in wind speed are largely controlled by the driving GCMs, except in mountainous regions where RCM-induced variance becomes larger than that induced by GCMs. For solar radiation, contributions are strongly season-dependent, with RCMs dominating in summer and GCMs in winter. Our findings support that GCM and RCM variability together define the uncertainty of wind and solar climate projections. This provides guidance for designing climate model ensembles that better support uncertainty-aware energy system decisions under climate change.

2603.20022 2026-03-23 stat.ME

Q-approximation of operating characteristics of clinical trial designs

Susanna Gentile, Daniel E. Schwartz, Riddhiman Saha, Lorenzo Trippa

详情
英文摘要

Designing clinical trials requires evaluating multiple operating characteristics (OCs), such as the likelihood of an early stopping decision, the probability of detecting a treatment effect, and the Type I error rate. In most cases, these evaluations are based on computationally intensive Monte Carlo simulations. As the complexity of clinical trials and the use of adaptive designs increase, the computational burden can quickly become prohibitive. We introduce a strategy for rapidly approximating OCs, called the Q-approximation. Our approach is based on quadratic approximations of the log-likelihood and asymptotic arguments. The Q-approximation approach can be applied to any trial design that uses data analysis methods coherent with the likelihood principle, including multistage designs with early stopping, adaptively randomized designs, and designs that leverage external data. We illustrate the approach with several examples and show that it provides an accurate approximation of important OCs while reducing the computation time compared to Monte Carlo simulations. In particular, in our experiments, the standard Monte Carlo approximation of OCs requires 150 to 1,900 times greater computing budget than Q-approximations to achieve comparable levels of accuracy.

2603.20015 2026-03-23 stat.ME stat.AP

On the Calibration of Bayesian Success Criteria and Operating Characteristics for Clinical Trials

Peng Yang, Li Wang, Ying Yuan

详情
英文摘要

Recently, the U.S. Food and Drug Administration (FDA) released draft guidance \citep{FDA2026} signaling a paradigm shift that facilitates the use of Bayesian methodology as the primary analysis and decision framework for drug approval. The cornerstone and fundamental challenge of this framework is the specification and calibration of Bayesian success criteria to control decision errors, ensuring reliable clinical and regulatory outcomes. In this work, we systematically investigate various Bayesian decision-error metrics, their theoretical interrelationships, and their alignment with conventional Frequentist counterparts. This investigation provides critical theoretical insights and practical guidance on calibrating Bayesian success criteria and operating characteristics to ensure robust decision-making and the integrity of public health decisions. We illustrate this framework using a clinical trial evaluating revascularization strategies for cardiogenic shock. A Shiny application will be available at www.trialdesign.org to assist sponsors and regulators in evaluating calibration strategies consistent with recent regulatory perspectives.

2603.19986 2026-03-23 stat.AP

Probabilistic Estimation of Hidden Migrant Fatalities Along the Central Mediterranean Route

Gregor Zens, Zoe Sigman

详情
英文摘要

Estimating the number of migrants who die or go missing along dangerous routes such as the Central Mediterranean remains challenging as available records are incomplete. Some incidents are never documented, and fatalities associated with such unobserved incidents are absent from observed totals. We propose a Bayesian approach for probabilistic estimation of total migrant fatalities in such settings. Building on recent developments in multiple-systems estimation, we develop a time-stratified latent-class framework that accommodates missing fatality counts for unobserved incidents. We apply the method to recoded incident-level data from the Missing Migrants Project for the Central Mediterranean route from 2014 to 2025, encompassing 25,712 fatalities across 1,562 incidents. Our model yields 95% credible intervals of 30,426-39,172 fatalities and 2,200-2,591 deadly incidents, indicating that approximately 66%-85% of fatalities and 60%-71% of incidents are reflected in the available data. We estimate that unreported fatalities were concentrated between 2014 and 2016. Furthermore, we document that reporting likelihood increases with incident severity, implying that smaller incidents are most likely to remain undetected. While contingent on modeling assumptions and incomplete data, our method provides a broadly applicable and principled alternative to naive data adjustment methods.

2603.19977 2026-03-23 stat.ME

Scalable and Robust Spatial Prediction via Multi-Resolution Ensembles of Predictive Processes

Nicolas Bianco, Nadja Klein

详情
英文摘要

Gaussian processes provide a flexible framework for spatial prediction, but their computational cost limits applicability to large-scale data with large sample size $n$. Predictive processes (PPs), a popular low-rank approximation, mitigate this burden by projecting the original process onto a reduced set of $m\ll n$ inducing points. However, existing theory requires $m$ to grow with $n$, creating a trade-off between accuracy and computational efficiency. We address this challenge by introducing an ensemble of PPs based on spatial partitioning, and propose a novel partitioning and patching scheme with desirable properties. By generalizing the convergence results of PPs, it becomes possible to explicitly balance scalability and accuracy: increasing the number of ensemble components slows down the convergence but substantially improves computational efficiency. We further show theoretically that, despite the limited approximation accuracy of PPs with fixed $m$, they are asymptotically robust to data contamination. Motivated by this insight, we finally introduce a multi-resolution ensemble that combines PPs with fixed $m$ with multiple ensembles defined over possibly overlapping coarse to fine partitions. Simulations and large-scale geostatistical applications demonstrate that our approach delivers accurate, robust predictions with computational gains, providing a practical and broadly applicable solution for spatial prediction.

2603.19945 2026-03-23 stat.ME

Cancer Survival Rates Are Misleading

Allen B. Downey

Comments 10 pages, 1 PDF figure. Companion analysis notebook: https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/examples/cancer.ipynb

详情
英文摘要

Five-year cancer survival rates are widely reported and often interpreted to mean that early detection saves lives, that a late fatal diagnosis would have been prevented by earlier detection, and that increasing survival over time proves better treatment. This expository article explains why such inferences are not supported by survival statistics alone. A simple Markov model of tumor progression, calibrated to patterns like those in SEER data, shows that high survival after early diagnosis, large gaps between early and late stage, and improving survival can all appear even when treatment is ineffective and screening does not reduce mortality. The discussion ties these points to the clinical literature and argues that randomized trials and mortality outcomes are needed to support screening and treatment claims; five-year survival alone provides little actionable evidence and is easily misread.

2603.19902 2026-03-23 cond-mat.dis-nn stat.ML

A Federated Many-to-One Hopfield model for associative Neural Networks

Andrea Alessandrelli, Fabrizio Durante, Andrea Ladiana, Andrea Lepre

详情
英文摘要

Federated learning enables collaborative training without sharing raw data, but struggles under client heterogeneity and streaming distribution shifts, where drift and novel data can impair convergence and cause forgetting. We propose a federated associative-memory framework that learns shared archetypes in heterogeneous, continual settings, where client data are independent but not necessarily balanced. Each client encodes its experience as a low-rank Hebbian operator, sent to a central server for aggregation and factorization into global archetypes. This approach preserves privacy, avoids centralized replay buffers, and is robust to small, noisy, or evolving datasets. We cast aggregation as a low-rank-plus-noise spectral inference problem, deriving theoretical thresholds for detectability and retrieval robustness. An entropy-based controller balances stability and plasticity in streaming regimes. Experiments with heterogeneous clients, drift, and novelty show improved global archetype reconstruction and associative retrieval, supporting the spectral view of federated consolidation.

2603.19899 2026-03-23 stat.ML cs.LG stat.AP

Deep Autocorrelation Modeling for Time-Series Forecasting: Progress and Prospects

Hao Wang, Licheng Pan, Qingsong Wen, Jialin Yu, Zhichao Chen, Chunyuan Zheng, Xiaoxi Li, Zhixuan Chu, Chao Xu, Mingming Gong, Haoxuan Li, Yuan Lu, Zhouchen Lin, Philip Torr, Yan Liu

详情
英文摘要

Autocorrelation is a defining characteristic of time-series data, where each observation is statistically dependent on its predecessors. In the context of deep time-series forecasting, autocorrelation arises in both the input history and the label sequences, presenting two central research challenges: (1) designing neural architectures that model autocorrelation in history sequences, and (2) devising learning objectives that model autocorrelation in label sequences. Recent studies have made strides in tackling these challenges, but a systematic survey examining both aspects remains lacking. To bridge this gap, this paper provides a comprehensive review of deep time-series forecasting from the perspective of autocorrelation modeling. In contrast to existing surveys, this work makes two distinctive contributions. First, it proposes a novel taxonomy that encompasses recent literature on both model architectures and learning objectives -- whereas prior surveys neglect or inadequately discuss the latter aspect. Second, it offers a thorough analysis of the motivations, insights, and progression of the surveyed literature from a unified, autocorrelation-centric perspective, providing a holistic overview of the evolution of deep time-series forecasting. The full list of papers and resources is available at https://github.com/Master-PLC/Awesome-TSF-Papers.

2603.19840 2026-03-23 stat.ML cs.LG

Explainable cluster analysis: a bagging approach

Federico Maria Quetti, Elena Ballante, Silvia Figini, Paolo Giudici

详情
英文摘要

A major limitation of clustering approaches is their lack of explainability: methods rarely provide insight into which features drive the grouping of similar observations. To address this limitation, we propose an ensemble-based clustering framework that integrates bagging and feature dropout to generate feature importance scores, in analogy with feature importance mechanisms in supervised random forests. By leveraging multiple bootstrap resampling schemes and aggregating the resulting partitions, the method improves stability and robustness of the cluster definition, particularly in small-sample or noisy settings. Feature importance is assessed through an information-theoretic approach: at each step, the mutual information between each feature and the estimated cluster labels is computed and weighted by a measure of clustering validity to emphasize well-formed partitions, before being aggregated into a final score. The method outputs both a consensus partition and a corresponding measure of feature importance, enabling a unified interpretation of clustering structure and variable relevance. Its effectiveness is demonstrated on multiple simulated and real-world datasets.

2603.19804 2026-03-23 math.ST stat.ML stat.TH

Uncertainty Quantification Via the Posterior Predictive Variance

Sanjay Chaudhuri, Dean Dustin, Bertrand Clarke

详情
英文摘要

We use the law of total variance to generate multiple expansions for the posterior predictive variance. These expansions are sums of terms involving conditional expectations and conditional variances and provide a quantification of the sources of predictive uncertainty. Since the posterior predictive variance is fixed given the model, it represents a constant quantity that is conserved over these expansions. The terms in the expansions can be assessed in absolute or relative sense to understand the main contributors to the length of prediction intervals. We quantify the term-wise uncertainty across expansions varying in the number of terms and the order of conditionates. In particular, given that a specific term in one expansion is small or zero, we identify the other terms in other expansions that must also be small or zero. We illustrate this approach to predictive model assessment in several well-known models.

2603.19799 2026-03-23 stat.ME stat.CO

Estimation of Multivariate Functional Principal Components from Sparse Functional Data

Uche Mbaka, Michelle Carey

详情
英文摘要

Traditional Functional Principal Component Analysis typically focuses on densely observed univariate functional data, yet many applications, particularly in longitudinal studies, involve multivariate functional data observed sparsely and irregularly across subjects. A common approach for extracting multivariate functional principal components in such settings relies on an eigen decomposition of univariate functional principal component scores to capture cross-component correlations. We propose a new approach for the estimation of multivariate functional principal components by improving the univariate eigenanalysis through maximum likelihood estimation combined with a modified Gram-Schmidt orthonormalization. The performance of the proposed approach is evaluated against two established methods, and its practical utility is demonstrated through an application to longitudinal cognitive biomarker data from an Alzheimer's disease study and a collection of data on dairy milk yield and milk compositions from research dairy farms in Ireland.

2603.19792 2026-03-23 cs.LG cs.DS stat.CO stat.ME stat.ML

Scalable Learning of Multivariate Distributions via Coresets

Zeyu Ding, Katja Ickstadt, Nadja Klein, Alexander Munteanu, Simon Omlor

Comments AISTATS 2026

详情
英文摘要

Efficient and scalable non-parametric or semi-parametric regression analysis and density estimation are of crucial importance to the fields of statistics and machine learning. However, available methods are limited in their ability to handle large-scale data. We address this issue by developing a novel coreset construction for multivariate conditional transformation models (MCTMs) to enhance their scalability and training efficiency. To the best of our knowledge, these are the first coresets for semi-parametric distributional models. Our approach yields substantial data reduction via importance sampling. It ensures with high probability that the log-likelihood remains within multiplicative error bounds of $(1\pm\varepsilon)$ and thereby maintains statistical model accuracy. Compared to conventional full-parametric models, where coresets have been incorporated before, our semi-parametric approach exhibits enhanced adaptability, particularly in scenarios where complex distributions and non-linear relationships are present, but not fully understood. To address numerical problems associated with normalizing logarithmic terms, we follow a geometric approximation based on the convex hull of input data. This ensures feasible, stable, and accurate inference in scenarios involving large amounts of data. Numerical experiments demonstrate substantially improved computational efficiency when handling large and complex datasets, thus laying the foundation for a broad range of applications within the statistics and machine learning communities.

2603.19756 2026-03-23 stat.AP

Extraction of tabulated statistical results with tableParser

Ingmar Böschen

Comments 16 pages, 14 tables

详情
英文摘要

Tabulated content is omnipresent in scientific literature. This work presents the R package *tableParser*, designed to extract and postprocess tables from NISO-JATS-encoded XML, HTML, DOCX, and, with limitations, PDF documents. *tableParser* focuses on extracting and analyzing statistical test results reported in scientific publications. It can be used for large-scale analysis of effect sizes, reporting practices, or summarization of results, as well as for checking completeness and consistency of standard test results in unpublished documents. Documents can be processed in three decoding levels. *table2matrix()* compiles all tables into a list of character matrices with captions and footnotes. *table2text()* collapses the matrix contents into human-readable text, mimicking a screen reader. Optionally, many common codings that are reported within the table's caption and footnote can be used to decode and expand the table's content. The collapsed and decoded table content can be further processed match an ideal input for the extraction of statistical standard results with the *standardStats()* function from the *JATSdecoder* package. The output of *table2stats()* is a data frame with all detected standard results as columns and, if calculation is possible, a recalculated p-value. If desired, an automated consistency check of the reported and the coded p-values with the recalculated p-value can be initiated. *tableParser* works best on barrier-free HTML tables encoded in NISO-JATS, where captions and footnotes are clearly identifiable. By guessing the tables captions and footnotes conservatively, the processing of tables within HTML and DOCX documents is comparably robust. Technically, tables in PDFs often fail to be correctly extracted, with captions and footnotes not detectable. Therefore, a decoding of codes is not possible, which lowers *tableParser*'s decoding accuracy on PDFs.

2603.19755 2026-03-23 math.AP stat.ML

Regularity of Solutions to Beckmann's Parametric Optimal Transport

Hanno Gottschalk, Tobias J. Riedlinger

Comments arXiv admin note: text overlap with arXiv:2503.10729

详情
英文摘要

Beckmann's problem in optimal transport minimizes the total squared flux in a continuous transport problem from a source to a target distribution. In this article, the regularity theory for solutions to Beckmann's problem in optimal transport is developed utilizing an unconstrained Lagrangian formulation and solving the variational first order optimality conditions. It turns out that the Lagrangian multiplier that enforces Beckmann's divergence constraint fulfills a Poisson equation and the flux vector field is obtained as the potential's gradient. Utilizing Schauder estimates from elliptic regularity theory, the exact Hölder regularity of the potential, the flux and the flow generating is derived on the basis of Hölder regularity of source and target densities on a bounded, regular domain. If the target distribution depends on parameters, as is the case in conditional (``promptable'') generative learning, we provide sufficient conditions for separate and joint Hölder continuity of the resulting vector field in the parameter and the data dimension. Following a recent result by Belomnestny et al., one can thus approximate such vector fields with deep ReQu neural networks in C^(k,alpha)-Hölder norm. We also show that this approach generalizes to other probability paths, like Fisher-Rao gradient flows.

2603.19736 2026-03-23 stat.ML cs.LG

A two-step sequential approach for hyperparameter selection in finite context models

José Contente, Ana Martins, Armando J. Pinho, Sónia Gouveia

详情
英文摘要

Finite-context models (FCMs) are widely used for compressing symbolic sequences such as DNA, where predictive performance depends critically on the context length k and smoothing parameter α. In practice, these hyperparameters are typically selected through exhaustive search, which is computationally expensive and scales poorly with model complexity. This paper proposes a statistically grounded two-step sequential approach for efficient hyperparameter selection in FCMs. The key idea is to decompose the joint optimization problem into two independent stages. First, the context length k is estimated using categorical serial dependence measures, including Cramér's ν, Cohen's \k{appa} and partial mutual information (pami). Second, the smoothing parameter α is estimated via maximum likelihood conditional on the selected context length k. Simulation experiments were conducted on synthetic symbolic sequences generated by FCMs across multiple (k, α) configurations, considering a four-letter alphabet and different sample sizes. Results show that the dependence measures are substantially more sensitive to variations in k than in α, supporting the sequential estimation strategy. As expected, the accuracy of the hyperparameter estimation improves with increasing sample size. Furthermore, the proposed method achieves compression performance comparable to exhaustive grid search in terms of average bitrate (bits per symbol), while substantially reducing computational cost. Overall, the results on simulated data show that the proposed sequential approach is a practical and computationally efficient alternative to exhaustive hyperparameter tuning in FCMs.

2603.19728 2026-03-23 stat.ME

Objective Model Prior Probabilities in Variable Selection

James Berger, Gonzalo García-Donato, Elías Moreno, Luis Pericchi

详情
英文摘要

For many years it was routine to use equal model prior probabilities in Bayesian model uncertainty analysis. At least twenty years ago it became clear that this was problematic, leading to support of much too large models in the increasingly huge model spaces being considered in genomics and other fields. A popular replacement was to adopt a suggestion of Harold Jeffreys for the variable selection problem in which a total of $k$ possible variables are being considered for inclusion in the model: give the collection of all models containing $d$ variables ($d = 0, . . . , k$) prior probability $1/(k + 1)$ and then divide this prior probability equally among the models in the collection. Many other choices of model prior probabilities that impose severe parsimony have also been introduced. We begin by reviewing the problems with using equal model prior probabilities and then discuss some serious problems with the Jeffreys choice. Finally, we introduce and study a number of objective alternative choices of model prior probabilities, from both numerical and theoretical perspectives.

2603.19703 2026-03-23 math.ST cs.LG stat.TH

Minimax and Adaptive Covariance Matrix Estimation under Differential Privacy

T. Tony Cai, Yicheng Li

详情
英文摘要

The covariance matrix plays a fundamental role in the analysis of high-dimensional data. This paper studies minimax and adaptive estimation of high-dimensional bandable covariance matrices under differential privacy constraints. We propose a novel differentially private blockwise tridiagonal estimator that achieves minimax-optimal convergence rates under both the operator norm and the Frobenius norm. In contrast to the non-private setting, the privacy-induced error exhibits a polynomial dependence on the ambient dimension, revealing a substantial additional cost of privacy. To establish optimality, we develop a new differentially private van Trees inequality and construct carefully designed prior distributions to obtain matching minimax lower bounds. The proposed private van Trees inequality applies more broadly to general private estimation problems and is of independent interest. We further introduce an adaptive estimator that attains the optimal rate up to a logarithmic factor without prior knowledge of the decay parameter, based on a novel hierarchical tridiagonal approach. Numerical experiments corroborate the theoretical results and illustrate the fundamental privacy-accuracy trade-off.

2603.19657 2026-03-23 stat.ML cs.LG

Model Selection and Parameter Estimation of Multi-dimensional Gaussian Mixture Model

Xinyu Liu, Hai Zhang

详情
英文摘要

In this paper, we study the problem of learning multi-dimensional Gaussian Mixture Models (GMMs), with a specific focus on model order selection and efficient mixing distribution estimation. We first establish an information-theoretic lower bound on the critical sample complexity required for reliable model selection. More specifically, we show that distinguishing a $k$-component mixture from a simpler model necessitates a sample size scaling of $Ω(Δ^{-(4k-4)})$. We then propose a thresholding-based estimation algorithm that evaluates the spectral gap of an empirical covariance matrix constructed from random Fourier measurement vectors. This parameter-free estimator operates with an efficient time complexity of $\mathcal{O}(k^2 n)$, scaling linearly with the sample size. We demonstrate that the sample complexity of our method matches the established lower bound, confirming its minimax optimality with respect to the component separation distance $Δ$. Conditioned on the estimated model order, we subsequently introduce a gradient-based minimization method for parameter estimation. To effectively navigate the non-convex objective landscape, we employ a data-driven, score-based initialization strategy that guarantees rapid convergence. We prove that this method achieves the optimal parametric convergence rate of $\mathcal{O}_p(n^{-1/2})$ for estimating the component means. To enhance the algorithm's efficiency in high-dimensional regimes where the ambient dimension exceeds the number of mixture components (i.e., \(d > k\)), we integrate principal component analysis (PCA) for dimension reduction. Numerical experiments demonstrate that our Fourier-based algorithmic framework outperforms conventional Expectation-Maximization (EM) methods in both estimation accuracy and computational time.

2603.19648 2026-03-23 cs.LG cs.SY eess.SY math.OC stat.ML

Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis

Siddharth Chandak, Anuj Yadav, Ayfer Ozgur, Nicholas Bambos

Comments Submitted to IEEE Transactions on Automatic Control

详情
英文摘要

Stochastic approximation (SA) is a fundamental iterative framework with broad applications in reinforcement learning and optimization. Classical analyses typically rely on martingale difference or Markov noise with bounded second moments, but many practical settings, including finance and communications, frequently encounter heavy-tailed and long-range dependent (LRD) noise. In this work, we study SA for finding the root of a strongly monotone operator under these non-classical noise models. We establish the first finite-time moment bounds in both settings, providing explicit convergence rates that quantify the impact of heavy tails and temporal dependence. Our analysis employs a noise-averaging argument that regularizes the impact of noise without modifying the iteration. Finally, we apply our general framework to stochastic gradient descent (SGD) and gradient play, and corroborate our finite-time analysis through numerical experiments.

2603.19633 2026-03-23 cs.LG stat.ML

Alternating Diffusion for Proximal Sampling with Zeroth Order Queries

Hirohane Takagi, Atsushi Nitanda

Comments Accepted to ICLR2026

详情
英文摘要

This work introduces a new approximate proximal sampler that operates solely with zeroth-order information of the potential function. Prior theoretical analyses have revealed that proximal sampling corresponds to alternating forward and backward iterations of the heat flow. The backward step was originally implemented by rejection sampling, whereas we directly simulate the dynamics. Unlike diffusion-based sampling methods that estimate scores via learned models or by invoking auxiliary samplers, our method treats the intermediate particle distribution as a Gaussian mixture, thereby yielding a Monte Carlo score estimator from directly samplable distributions. Theoretically, when the score estimation error is sufficiently controlled, our method inherits the exponential convergence of proximal sampling under isoperimetric conditions on the target distribution. In practice, the algorithm avoids rejection sampling, permits flexible step sizes, and runs with a deterministic runtime budget. Numerical experiments demonstrate that our approach converges rapidly to the target distribution, driven by interactions among multiple particles and by exploiting parallel computation.

2603.19629 2026-03-23 stat.ML cs.LG physics.geo-ph

On the role of memorization in learned priors for geophysical inverse problems

Ali Siahkoohi, Davide Sabeddu

详情
英文摘要

Learned priors based on deep generative models offer data-driven regularization for seismic inversion, but training them requires a dataset of representative subsurface models -- a resource that is inherently scarce in geoscience applications. Since the training objective of most generative models can be cast as maximum likelihood on a finite dataset, any such model risks converging to the empirical distribution -- effectively memorizing the training examples rather than learning the underlying geological distribution. We show that the posterior under such a memorized prior reduces to a reweighted empirical distribution -- i.e., a likelihood-weighted lookup among the stored training examples. For diffusion models specifically, memorization yields a Gaussian mixture prior in closed form, and linearizing the forward operator around each training example gives a Gaussian mixture posterior whose components have widths and shifts governed by the local Jacobian. We validate these predictions on a stylized inverse problem and demonstrate the consequences of memorization through diffusion posterior sampling for full waveform inversion.

2603.18168 2026-03-23 stat.ML cs.LG math.PR

ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit

Louis-Pierre Chaintron, Lénaïc Chizat, Javier Maass

详情
英文摘要

We establish convergence of the training dynamics of residual neural networks (ResNets) to their joint infinite depth L, hidden width M, and embedding dimension D limit. Specifically, we consider ResNets with two-layer perceptron blocks in the maximal local feature update (MLU) regime and prove that, after a bounded number of training steps, the error between the ResNet and its large-scale limit is O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). This error rate is empirically tight when measured in embedding space. For a budget of P = Theta(L M D) parameters, this yields a convergence rate O(P^(-1/6)) for the scalings of (L, M, D) that minimize the bound. Our analysis exploits in an essential way the depth-two structure of residual blocks and applies formally to a broad class of state-of-the-art architectures, including Transformers with bounded key-query dimension. From a technical viewpoint, this work completes the program initiated in the companion paper [Chi25] where it is proved that for a fixed embedding dimension D, the training dynamics converges to a Mean ODE dynamics at rate O(1/L + sqrt(D)/sqrt(L M)). Here, we study the large-D limit of this Mean ODE model and establish convergence at rate O(1/sqrt(D)), yielding the above bound by a triangle inequality. To handle the rich probabilistic structure of the limit dynamics and obtain one of the first rigorous quantitative convergence for a DMFT-type limit, we combine the cavity method with propagation of chaos arguments at a functional level on so-called skeleton maps, which express the weight updates as functions of CLT-type sums from the past.

2603.17381 2026-03-23 econ.EM stat.ML

An Auditable AI Agent Loop for Empirical Economics: A Case Study in Forecast Combination

Minchul Shin

Comments 34 pages, no figure

详情
英文摘要

AI coding agents make empirical specification search fast and cheap, but they also widen hidden researcher degrees of freedom. Building on an open-source agent-loop architecture, this paper adapts that framework to an empirical economics workflow and adds a post-search holdout evaluation. In a forecast-combination illustration, multiple independent agent runs outperform standard benchmarks in the original rolling evaluation, but not all continue to do so on a post-search holdout. Logged search and holdout evaluation together make adaptive specification search more transparent and help distinguish robust improvements from sample-specific discoveries.

2603.16982 2026-03-23 astro-ph.IM math.DS stat.AP

Trajectory Stability and Signature Diagnostics for Comet-Based Interstellar Navigation

Bo Pieter Johannes Andrée

Comments 31 pages, 2 figures, 4 added references

详情
英文摘要

Interstellar objects (ISOs) motivate a coupled mission-design and inference question relevant to spacecraft dynamics and control in extreme environments: if volatile-rich, rotating comet-like bodies were used for sustained deep-space navigation by exploiting pre-existing hyperbolic motion and in-situ propellant, what stability requirements arise under non-gravitational forcing, and what astrometric signatures might distinguish active stabilization from uncontrolled natural dynamics? We develop a stability-theoretic framework for trajectory tracking with jet-actuated correction, and show that high-speed transit geometry -- including debris-belt avoidance and encounter phasing -- tightly constrains feasible trajectories, making long-horizon tracking stability mission-critical. We model tracking residuals as the balance of disturbances and corrective action, and derive stability conditions across four levels: disturbance-energy stability, outer-loop contraction, actuator-memory stability, and rotation-mediated (Floquet) stability. The analysis implies residual diagnostics that can motivate empirical tests: under comparable forcing, effective stabilization is expected to strengthen short-horizon error correction, reduce event-conditioned persistence and variance clustering, regularize standardized innovations, and yield bounded post-shock recovery. More broadly, the framework provides a reference for deep-space guidance and control under nonlinear, multi-field disturbances and for planetary-defense concepts involving attitude shaping or impulsive kinetic impact.

2603.15781 2026-03-23 stat.ML cs.LG

Learnability with Partial Labels and Adaptive Nearest Neighbors

Nicolas A. Errandonea, Santiago Mazuelas, Jose A. Lozano, Sanjoy Dasgupta

详情
英文摘要

Prior work on partial labels learning (PLL) has shown that learning is possible even when each instance is associated with a bag of labels, rather than a single accurate but costly label. However, the necessary conditions for learning with partial labels remain unclear, and existing PLL methods are effective only in specific scenarios. In this work, we mathematically characterize the settings in which PLL is feasible. In addition, we present PL A-$k$NN, an adaptive nearest-neighbors algorithm for PLL that is effective in general scenarios and enjoys strong performance guarantees. Experimental results corroborate that PL A-$k$NN can outperform state-of-the-art methods in general PLL scenarios.

2602.18184 2026-03-23 math.ST math.PR stat.ME stat.TH

Kolmogorov-Type Maximal Inequalities for Independent and Dependent Negative Binomial Random Variables: Sharp Bounds, Sub-Exponential Refinements, and Applications to Overdispersed Count Data

Aristides V. Doumas, S. Spektor

Comments 11 pages, 8 figures, 2 tables

详情
英文摘要

This paper develops Kolmogorov-type maximal inequalities for sums of Negative Binomial random variables under both independence and dependence structures. For independent heterogeneous Negative Binomial variables we derive sharp Markov-type deviation inequalities and Kolmogorov-type bounds expressed in terms of Tweedie dispersion parameters, providing explicit control limits for NB2 generalized linear model monitoring. For dependent count data arising through a shared Gamma mixing variable, we establish a \emph{sub-exponential Bernstein-type refinement} that exploits the Poisson-Gamma hierarchical structure to yield exponentially decaying tail probabilities -- this refinement is new in the literature. Through moment-matched Monte Carlo experiments ($n=20$, 2{,}000 replications), we document a 55\% reduction in mean maximum deviation under appropriate dependence structures, a stabilization effect we explain analytically. A concrete epidemiological application with NB2 parameters calibrated from COVID-19 surveillance data demonstrates practical utility. These results materially advance the applicability of classical maximal inequalities to overdispersed and dependent count data prevalent in public health, insurance, and ecological modeling.

2602.11132 2026-03-23 math.ST stat.ME stat.TH

A New Look at Bayesian Testing

Jyotishka Datta, Nicholas G. Polson, Vadim Sokolov, Daniel Zantedeschi

Comments Revised version addresses proofs and references

详情
英文摘要

We identify the critical deviation scale governing Bayesian evidence accumulation in regular parametric testing. Under integrated Bayes risk with zero-one loss, the risk-optimal rejection boundary lies in a moderate deviation regime, with a square-root logarithmic inflation relative to the usual local asymptotic normal scale. Under Cramer regularity, local prior smoothness at the null, and symmetric loss, we derive the sharp threshold and show that its leading logarithmic term is universal across regular priors, while lower-order constants depend on the local prior density, Fisher information, and prior model odds. The result extends to one-parameter exponential families through local asymptotic normality and places Jeffreys' testing threshold, the Bayesian information criterion penalty, and Chernoff-Stein type error-exponent arguments within a common asymptotic moderate deviation framework.

2601.20018 2026-03-23 math.ST econ.EM math.PR stat.TH

Decoupling and randomization for double-indexed permutation statistics

Mingxuan Zou, Jingfan Xu, Peng Ding, Fang Han

Comments 42 pages

详情
英文摘要

This paper introduces a version of decoupling and randomization to establish concentration inequalities for double-indexed permutation statistics. The results yield, among other applications, a new combinatorial Hanson-Wright inequality and a new combinatorial Bennett inequality. Several illustrative examples from rank-based statistics, graph-based statistics, and causal inference are also provided.

2511.12435 2026-03-23 stat.ME

Transfer learning for high-dimensional Factor-augmented sparse linear model

Bo Fu, Dandan Jiang

Comments 54 pages, 1 figures

详情
英文摘要

In this paper, we study transfer learning for high-dimensional factor-augmented sparse linear models, motivated by applications in economics and finance where strongly correlated predictors and latent factor structures pose major challenges for reliable estimation. Our framework simultaneously mitigates the impact of high correlation and removes the additional contributions of latent factors, thereby reducing potential model misspecification in conventional linear modeling. In such settings, the target dataset is often limited, but multiple heterogeneous auxiliary sources may provide additional information. We develop transfer learning procedures that effectively leverage these auxiliary datasets to improve estimation accuracy, and establish non-asymptotic $\ell_1$- and $\ell_2$-error bounds for the proposed estimators. To prevent negative transfer, we introduce a data-driven source detection algorithm capable of identifying informative auxiliary datasets and prove its consistency. In addition, we provide a hypothesis testing framework for assessing the adequacy of the factor model, together with a procedure for constructing simultaneous confidence intervals for the regression coefficients of interest. Numerical studies demonstrate that our methods achieve substantial gains in estimation accuracy and remain robust under heterogeneity across datasets. Overall, our framework offers a theoretical foundation and a practically scalable solution for incorporating heterogeneous auxiliary information in settings with highly correlated features and latent factor structures.

2510.05685 2026-03-23 math.ST math.PR stat.TH

Sample complexity for divergence regularized optimal transport with radial cost

Ruiyu Han, Johannes Wiesel

详情
英文摘要

We prove a new sample complexity result for divergence regularized optimal transport. Our bound holds for probability measures on~$\mathbb{R}^d$ with exponential tail decay and for radial cost functions that satisfy a local Lipschitz condition. It is sharp up to logarithmic factors, and captures the intrinsic dimension of the marginal distributions through a generalized covering number of their supports. Examples that fit into our framework include subexponential and subgaussian distributions and radial cost functions $c(x,y)=|x-y|^p$ for $p\ge 1$ with logarithmic entropy or polynomial $α$-divergence.

2510.01803 2026-03-23 stat.AP

The Perceived Impact of Environment on Health in Italy: a Penalized Ordinal Regression Approach

Mattia Stival, Angela Andreella, Gaia Bertarelli, Catarina Midões, Stefano Federico Tonellato, Stefano Campostrini

详情
英文摘要

Understanding how individuals perceive their living environment is a complex task, as it reflects both personal and contextual determinants. In this paper, we address this task by analyzing the environmental module of the Italian nationwide health surveillance system PASSI (Progressi delle Aziende Sanitarie per la Salute in Italia), integrating it with contextual information at the municipal level, including socio-economic indicators, pollution exposure, and other geographical characteristics. Methodologically, we adopt a penalized semi-parallel cumulative ordinal regression model to analyze how subjective perceptions are shaped by both personal and territorial determinants. The approach balances flexibility and interpretability by allowing both parallel and non-parallel effects while regularizing estimates to address multicollinearity and separation issues. We use the model as an analytical tool to uncover the determinants of positivity and neutrality in environmental perceptions, defined as factors that contribute the most to improving perception or increasing the sense of neutrality. The results are diverse. First, results reveal significant heterogeneity across Italian territories, indicating that local characteristics strongly shape environmental perception. Second, various individual factors interact with contextual influences to shape perceptions. Third, hazardous environmental factors, such as higher PM2.5 levels, appear to be associated with poorer environmental perception, suggesting a tendency among respondents to recognize specific environmental issues. Overall, the approach demonstrates strong potential for application and provides useful insights for environmental policy planning.

2509.26385 2026-03-23 stat.ME stat.CO

An Order of Magnitude Time Complexity Reduction for Gaussian Graphical Model Posterior Sampling Using a Reverse Telescoping Block Decomposition

Zejin Gao, Ksheera Sagar, Anindya Bhadra

详情
英文摘要

We consider the problem of fully Bayesian posterior estimation and uncertainty quantification in undirected Gaussian graphical models via Markov chain Monte Carlo (MCMC) under recently-developed element-wise graphical priors, such as the graphical horseshoe. Unlike the conjugate Wishart family, these priors are non-conjugate; but have the advantage that they naturally allow one to encode a prior belief of sparsity in the off-diagonal elements of the precision matrix, without imposing a structure on the entire matrix. Unfortunately, for a graph with $p$ nodes and with $n$ samples, the state-of-the-art MCMC approaches for the element-wise priors achieve a per iteration complexity of ${O}(p^4),$ which is prohibitive when $p\gg n$. In this regime, we develop a suitably reparameterized MCMC with per iteration complexity of ${O}(p^3)$, providing a one order of magnitude improvement, and consequently bringing the per iteration computational cost at par with the conjugate Wishart family, which is also ${O}(p^3)$ due to a use of the classical Bartlett decomposition, but this decomposition does not apply outside the Wishart family. Importantly, the proposed benefit is obtained solely due to our reparameterization in an MCMC scheme targeting the true posterior, that reverses the recently developed telescoping block decomposition of Bhadra et al. (2024), in a suitable sense. There is no variational or any other approximate Bayesian computation scheme considered in this paper that compromises targeting the true posterior. Simulations and the analysis of a breast cancer data set confirm both the correctness and better algorithmic scaling of the proposed reverse telescoping sampler.

2509.01597 2026-03-23 cs.CR cs.DS stat.AP

Statistics-Friendly Confidentiality Protection for Establishment Data, with Applications to the QCEW

Kaitlyn Webb, Prottay Protivash, John Durrell, Daniell Toth, Aleksandra Slavković, Daniel Kifer

Comments 42 pages (13 main text, 2 references, and 27 appendix pages), 13 figures (4 in main text)

详情
英文摘要

Confidentiality for business data is an understudied area of disclosure avoidance, where legacy methods struggle to provide acceptable results. Standard formal privacy techniques for person-level data, like differential privacy, are designed to protect against membership inference and hence do not provide suitable confidentiality/utility trade-offs due to the highly skewed nature of business data and because extreme outlier records are often important contributors to query answers. Prior proposals, therefore, took a personalized differential privacy approach that allowed privacy parameters to degrade for the outlying records -- larger establishments get weaker membership inference guarantees. However, providing guarantees to some entities that are strictly weaker than guarantees for others is problematic from a policy standpoint. In this paper, we propose a novel confidentiality framework for business data with a focus on interpretability for policy makers. Instead of protecting against membership inference, which is often not a concern in business data, we protect against attribute inferences that are too precise. In our framework, data curators specify a neighbor function that is used to define uncertainty interval bands around an establishment's attribute values and the privacy parameters govern the strength of indistinguishability between values within the same uncertainty interval.We propose two query-answering mechanisms under this framework and evaluate them on: (1) a confidential Quarterly Census of Employment and Wages (QCEW) dataset produced by the U.S. Bureau of Labor Statistics (this was done through a cooperative agreement), and (2) a substitute dataset that we created from public sources (and will publicly release).

2508.18716 2026-03-23 stat.AP

Dynamic Count Models with Flexible Innovation Processes for Irregular Maritime Migration

Gregor Zens, Jakub Bijak

详情
英文摘要

Motivated by the challenge of analyzing the dynamics of weekly sea border crossings in the Mediterranean (2015-2025) and the English Channel (2018-2025), we develop a Bayesian dynamic framework for modeling heteroskedastic count time series. Building on theoretical considerations and empirical stylized facts, our approach utilizes a Poisson random walk model that allows for heavy-tailed innovations or stochastic volatility dynamics, while incorporating an explicit mechanism to separate structural from sampling zeros. Posterior inference is carried out via a straightforward Markov chain Monte Carlo algorithm. Applying this methodology to Mediterranean and English Channel data, we compare alternative model specifications through a comprehensive out-of-sample forecasting exercise. Using log predictive scores and empirical coverage at predictive quantiles to evaluate each model, we find strong evidence for stochastic volatility in migration innovations. These models deliver the strongest out-of-sample forecasts with empirical coverage close to nominal levels up to the 99th percentile. Our framework can be used to develop risk indicators with direct policy implications for improving governance and preparedness for migration surges. More broadly, the methodology extends to other zero-inflated non-stationary count time series applications, including epidemiological surveillance and public safety incident monitoring.

2508.15954 2026-03-23 math.OC stat.AP

A Heuristic Framework of Variable Neighborhood Descent Methods for the Large-Scale Multi-Level Facility Location Problem in Supply Chain Networks

Haibo Wang, Bahram Alidaee

Comments 48 pages 3 figures

详情
英文摘要

This paper addresses the single-assignment, uncapacitated, multi-level facility location (MFL) problem, a strategic decision-making process critical to the design of long-term supply chain networks. Specifically, we examine four- and five-level facility location structures (k-LFL), modeled as a location-allocation problem where demand nodes must be assigned to open facilities across hierarchical levels. Although the MFL has been addressed in the literature, solutions to large-scale, realistic problems involving thousands of nodes are lacking. This paper proposes a heuristic framework based on the Variable Neighborhood Descent (VND) metaheuristic with a multi-start strategy. We develop and compare four variants: Basic Variable Neighborhood Descent (BVND), Pipe Variable Neighborhood Descent (PVND), Cyclic Variable Neighborhood Descent (CVND), and Union Variable Neighborhood Descent (UVND). In each case, a multi-start strategy with strong diversification components is employed. Extensive computational experiments compare the methods on large-scale instances involving up to 10,000 customers, 150 distribution centers, 50 warehouses, and 30 plants. Each algorithm settled into a unique, statistically significant computational time when solving these problems. Sensitivity analyses, supported by non-parametric statistical methods, validate the effectiveness of the proposed heuristic framework.

2505.24144 2026-03-23 math.PR math.ST stat.TH

Sharp Concentration of Simple Random Tensors II: Asymmetry

Jiaheng Chen, Daniel Sanz-Alonso

Comments 42 pages, to appear in Information and Inference

详情
英文摘要

This paper establishes sharp concentration inequalities for simple random tensors. Our theory unveils a phenomenon that arises only for asymmetric tensors of order $p \ge 3:$ when the effective ranks of the covariances of the component random variables lie on both sides of a critical threshold, an additional logarithmic factor emerges that is not present in sharp bounds for symmetric tensors. To establish our results, we develop empirical process theory for products of $p$ different function classes evaluated at $p$ different random variables, extending generic chaining techniques for quadratic and product empirical processes to higher-order settings.

2505.14255 2026-03-23 stat.ME math.PR

Statistical Inference for Quasi-Infinitely Divisible Distributions via Fourier Methods

Vladimir Panov, Anton Ryabchenko

Comments 25 pages, 9 figures

详情
英文摘要

This study focuses on statistical inference for the class of quasi-infinitely divisible (QID) distributions, which was recently introduced by Lindner, Pan and Sato (2018). The paper presents a Fourier approach, based on the analogue of the L{é}vy-Khintchine theorem with a signed spectral measure. We prove that for some subclasses of QID distributions, the considered estimates have polynomial rates of convergence. This is a remarkable fact when compared to the logarithmic convergence rates of similar methods for infinitely divisible distributions, which cannot be improved in general. We demonstrate the numerical performance of the algorithm using simulated examples.

2502.10647 2026-03-23 cs.LG math.ST stat.ML stat.TH

A Power Transform

Jonathan T. Barron

详情
英文摘要

Power transforms, such as the Box-Cox transform and Tukey's ladder of powers, are a fundamental tool in mathematics and statistics. These transforms are primarily used for normalizing and standardizing datasets, effectively by raising values to a power. In this work I present a novel power transform, and I show that it serves as a unifying framework for wide family of loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions.

2502.04082 2026-03-23 stat.AP

Market-based insurance ratemaking: application to pet insurance

Pierre-Olivier Goffard, Pierrick Piette, Gareth W. Peters

详情
英文摘要

This paper introduces a method for pricing insurance policies using market data. The approach is designed for scenarios in which the insurance company seeks to enter a new market, in our case: pet insurance, lacking historical data. The methodology involves an iterative two-step process. First, a suitable parameter is proposed to characterize the underlying risk. Second, the resulting pure premium is linked to the observed commercial premium using an isotonic regression model. To validate the method, comprehensive testing is conducted on synthetic data, followed by its application to a dataset of actual pet insurance rates. To facilitate practical implementation, we have developed an R package called IsoPriceR. By addressing the challenge of pricing insurance policies in the absence of historical data, this method helps enhance pricing strategies in emerging markets.

2501.11868 2026-03-23 stat.ME math.ST stat.ML stat.TH

Automatic Debiased Machine Learning for Smooth Functionals of Nonparametric M-Estimands

Lars van der Laan, Aurelien Bibaut, Nathan Kallus, Alex Luedtke

详情
英文摘要

We develop a unified framework for automatic debiased machine learning (autoDML) for inference on a broad class of statistical parameters. The framework applies to any smooth functional of a nonparametric M-estimand, defined as the minimizer of a population risk over an infinite-dimensional linear space. Examples include counterfactual regression, quantile, and survival functions, as well as conditional average treatment effects. Rather than requiring manual derivation of influence functions, our approach automates the construction of debiased estimators using three ingredients: the gradient and Hessian of the loss function and a linear approximation of the target functional. Estimation reduces to solving two risk minimization problems, one for the M-estimand and one for a Riesz representer. The framework accommodates Neyman-orthogonal loss functions that depend on nuisance parameters and extends to vector-valued M-estimands through joint risk minimization. We characterize the efficient influence function and construct efficient autoDML estimators via one-step correction, targeted minimum loss estimation, and sieve-based plug-in methods. Under quadratic risk, these estimators satisfy double robustness for linear functionals. We further show that they are robust to mild misspecification of the M-estimand model, incurring only second-order bias. We illustrate the method by estimating long-term survival probabilities under a semiparametric two-parameter beta-geometric failure model.

2501.11421 2026-03-23 cs.LG cs.IT math.IT math.ST stat.TH

Online Clustering of Data Sequences with Bandit Information

G Dhinesh Chandran, Srinivas Reddy Kota, Srikrishna Bhashyam

详情
英文摘要

We study the problem of online clustering of data sequences in the multi-armed bandit (MAB) framework under the fixed-confidence setting. There are $M$ arms, each providing i.i.d. samples from a parametric distribution whose parameters are unknown. The $M$ arms form $K$ clusters based on the distance between the true parameters. In the MAB setting, one arm can be sampled at each time. The objective is to estimate the clusters of the arms using as few samples as possible from the arms, subject to an upper bound on the error probability. Our setting allows for: arms within a cluster to have non-identical distributions, vector parameter arms, vector observations, and $K \le M$ clusters. We propose and analyze the Average Tracking Bandit Online Clustering (ATBOC) algorithm. ATBOC is asymptotically order-optimal for multivariate Gaussian arms, with expected sample complexity grows at most twice as fast as the lower bound as $δ\rightarrow 0$, and this guarantee extends to multivariate sub-Gaussian arms. For single-parameter exponential family arms, ATBOC is asymptotically optimal, matching the lower bound. We also propose a computationally more efficient alternatives Lower and Upper Confidence Bound based Bandit Online Clustering Algorithm (LUCBBOC), and Bandit Online Clustering-Elimination (BOC-ELIM). We derive the computational complexity of the proposed algorithms and compare their per-sample runtime through simulations. LUCBBOC and BOC-ELIM require lower per-sample runtime than ATBOC while achieving comparable performance. All the proposed algorithms are $δ$-Probably correct, i.e., the error probability of cluster estimate at the stopping time is atmost $δ$. We validate the asymptotic optimality guarantees through simulations, and present the comparison of our proposed algorithms with other related work through simulations on both synthetic and real-world datasets.

2410.15166 2026-03-23 math.ST stat.ME stat.TH

Adversarial Estimation of Assortment Probabilities under Independence Structure

Alexandre Belloni, Yan Chen, Matthew Harding

详情
英文摘要

We consider the problem of estimating assortment probabilities, which is common in operations management applications, including product bundling, advertising, etc. Existing approaches typically model each assortment as a category and apply multinomial models to estimate the choice probabilities; while computationally convenient, these methods do not exploit independence structures in the joint distribution and may therefore be statistically inefficient when the total number of items is large. Using the representation from Bahadur (1959), we relate the sparsity of the generalized correlation coefficients to the independence structure of the binary components. We formulate the problem as estimating a high-dimensional vector of generalized correlation coefficients, together with low or moderate-dimensional nuisance parameters corresponding to the marginal probabilities. We develop a regularized adversarial estimator that attains the optimal rate under standard regularity conditions while remaining computationally feasible. The framework naturally extends to settings with covariates. We apply the proposed estimators to causal inference with multiple binary treatments and show substantial finite-sample improvements over non-adaptive methods. Numerical studies corroborate the theoretical results.

2409.19435 2026-03-23 cs.LG stat.CO stat.ML

Simulation-based Inference with the Python Package sbijax

Simon Dirmeier, Antonietta Mira, Carlo Albert

详情
英文摘要

Neural simulation-based inference (SBI) describes an emerging family of methods for Bayesian inference with intractable likelihood functions that use neural networks as surrogate models. Here we introduce sbijax, a Python package that implements a wide variety of state-of-the-art methods in neural simulation-based inference using a user-friendly programming interface. sbijax offers high-level functionality to quickly construct SBI estimators, and compute and visualize posterior distributions with only a few lines of code. In addition, the package provides functionality for conventional approximate Bayesian computation, to compute model diagnostics, and to automatically estimate summary statistics. By virtue of being entirely written in JAX, sbijax is extremely computationally efficient, allowing rapid training of neural networks and executing code automatically in parallel on both CPU and GPU.

2409.18010 2026-03-23 eess.SY cs.SY math.OC stat.ML

End-to-end guarantees for indirect data-driven control of bilinear systems with finite stochastic data

Nicolas Chatzikiriakos, Robin Strässer, Frank Allgöwer, Andrea Iannelli

Comments Accepted for publication in Automatica

详情
Journal ref
Automatica, vol. 187, pp. 112908, 2026
英文摘要

In this paper we propose an end-to-end algorithm for indirect data-driven control for bilinear systems with stability guarantees. We consider the case where the collected i.i.d. data is affected by probabilistic noise with possibly unbounded support and leverage tools from statistical learning theory to derive finite sample identification error bounds. To this end, we solve the bilinear identification problem by solving a set of linear and affine identification problems, by a particular choice of a control input during the data collection phase. We provide a priori as well as data-dependent finite sample identification error bounds on the individual matrices as well as ellipsoidal bounds, both of which are structurally suitable for control. Further, we integrate the structure of the derived identification error bounds in a robust controller design to obtain an exponentially stable closed-loop. By means of an extensive numerical study we showcase the interplay between the controller design and the derived identification error bounds. Moreover, we note appealing connections of our results to indirect data-driven control of general nonlinear systems through Koopman operator theory and discuss how our results may be applied in this setup.

2409.06271 2026-03-23 stat.ML cs.LG stat.ME

A new paradigm for global sensitivity analysis

Gildas Mazo

详情
英文摘要

It is well-known that Sobol indices, which count among the most popular sensitivity indices, are based on the Sobol decomposition. Here we challenge this construction by redefining Sobol indices without the Sobol decomposition. In fact, we show that Sobol indices are a particular instance of a more general concept which we call sensitivity measures. A sensitivity measure of a system taking inputs and returning outputs is a set function that is null at a subset of inputs if and only if, with probability one, the output actually does not depend on those inputs. A sensitivity measure evaluated at the whole set of inputs represents the uncertainty about the output. We show that measuring sensitivity to a particular subset is akin to measuring the expected output's uncertainty conditionally on the fact that the inputs belonging to that subset have been fixed to random values. By considering all of the possible combinations of inputs, sensitivity measures induce an implicit symmetric factorial experiment with two levels, the factorial effects of which can be calculated. This new paradigm generalizes many known sensitivity indices, can create new ones, and defines interaction effects independently of the choice of the sensitivity measure. No assumption about the distribution of the inputs is required.

2408.12905 2026-03-23 math.ST stat.TH

On the relation between likelihood ratios and p-values for testing success probabilities of Bernoulli trials

Wouter Kager, Ronald Meester

Comments 24 pages, 2 figures

详情
英文摘要

It is well known that there is no direct one-to-one relation between $p$-values and likelihood ratios or Bayes factors, since their relation crucially involves the sample size $n$. We investigate their (asymptotic) relation in a coin-tossing context where the hypotheses of interest address the success probability of the coin, and where detailed computations are possible. This leads to useful insights in the nature of $p$-values and likelihood ratios. Our results imply, for instance, that under mild conditions, a $p$-value of 0.05 cannot correspond to a likelihood ratio larger than 7.5, for any hypothesis versus a null hypothesis that the success probability has a specific value. We also show it is unlikely one can obtain a large likelihood ratio by tossing a fair coin until the number of heads deviates from the mean by several standard deviations.

2405.13591 2026-03-23 stat.ME

Practical limitations for real-life application of data fission and data thinning in post-clustering differential analysis

Benjamin Hivert, Denis Agniel, Rodolphe Thiébaut, Boris P. Hejblum

详情
英文摘要

Post-clustering inference in single-cell RNA sequencing (scRNA-seq) analysis presents significant challenges in controlling Type I error during differential expression analysis. Data fission, a promising approach that aims to split data into two independent parts, relies on strong parametric assumptions of non-mixture distributions that are inherently violated in clustered data. To address this limitation, we introduce conditional data fission, an extension designed to decompose each mixture component into two independent parts. However, we demonstrate that applying such conditional data fission to mixture distributions requires prior knowledge of the clustering structure to ensure valid post-clustering inference. This arises from the need to accurately estimate component-specific scale parameters, which are critical for performing decomposition while maintaining independence. We theoretically quantify how biases in estimating these parameters lead to inflated Type I error rates due to deviations from independence. Given that mixture components are typically unknown in practice, our results underscore the fundamental difficulty of applying data fission in real-world settings, despite its prior proposal as a solution for post-clustering inference.

2405.01425 2026-03-23 cs.DS cs.LG math.ST stat.ML stat.TH

In-and-Out: Algorithmic Diffusion for Sampling Convex Bodies

Yunbum Kook, Santosh S. Vempala, Matthew S. Zhang

Comments To appear in Random Structures & Algorithms; conference version appeared in NeurIPS 2024 (spotlight)

详情
英文摘要

We present a new random walk for uniformly sampling high-dimensional convex bodies. It achieves state-of-the-art runtime complexity with stronger guarantees on the output than previously known, namely in Rényi divergence (which implies TV, $\mathcal{W}_2$, KL, $χ^2$). The proof departs from known approaches for polytime algorithms for the problem -- we utilize a stochastic diffusion perspective to show contraction to the target distribution with the rate of convergence determined by functional isoperimetric constants of the target distribution.

2211.09875 2026-03-23 stat.CO

Mixture of Experts Distributional Regression: Implementation Using Robust Estimation with Adaptive First-order Methods

David Rügamer, Florian Pfisterer, Bernd Bischl, Bettina Grün

Comments arXiv admin note: text overlap with arXiv:2010.06889

详情
Journal ref
AStA Advances in Statistical Analysis 108 (2024) 351-373
英文摘要

In this work, we propose an efficient implementation of mixtures of experts distributional regression models which exploits robust estimation by using stochastic first-order optimization techniques with adaptive learning rate schedulers. We take advantage of the flexibility and scalability of neural network software and implement the proposed framework in mixdistreg, an R software package that allows for the definition of mixtures of many different families, estimation in high-dimensional and large sample size settings and robust optimization based on TensorFlow. Numerical experiments with simulated and real-world data applications show that optimization is as reliable as estimation via classical approaches in many different settings and that results may be obtained for complicated scenarios where classical approaches consistently fail.

2210.12790 2026-03-23 math.ST cond-mat.dis-nn cond-mat.soft math.PR stat.TH

A genuine test for hyperuniformity

Michael A. Klatt, Günter Last, Norbert Henze

Comments 46 pages, 11 figures, 1 table

详情
英文摘要

We introduce a rigorous and sensitive significance test for hyperuniformity that yields reliable results even from a single sample. Our approach is based on a detailed analysis of the empirical Fourier transform of a stationary point process in $\mathbb{R}^d$. For large system sizes, we derive the asymptotic covariances and establish a multivariate central limit theorem (CLT) for these empirical Fourier transforms. Their absolute square value, the scattering intensity, is then used as the standard estimator of the structure factor. The above CLT holds for a preferably large class of point processes, and whenever this is the case, the scattering intensity satisfies a multivariate limit theorem as well. Hence, we can use the likelihood ratio principle to test for hyperuniformity. Remarkably, the asymptotic distribution of the resulting test statistic is universal under the null hypothesis of hyperuniformity. We obtain its explicit form from simulations with very high accuracy. The novel test precisely keeps a nominal significance level for hyperuniform models, and it rejects non-hyperuniform examples with high power even in borderline cases. Moreover, it does so given only a single sample with a practically relevant system size.

2603.19577 2026-03-23 math.PR q-bio.QM stat.ME

Stochastic Averaging and Statistical Inference of Glycolytic Pathway

Arnab Ganguly, Hye-Won Kang

Comments 33 pages, 2 figures

详情
英文摘要

Many biological processes exhibit oscillatory behavior. Among these, glycolytic oscillations have been extensively studied due to their well-characterized biochemical reaction networks. However, the complexity of these networks necessitates low-dimensional ordinary differential equation (ODE) models to identify core mechanisms and perform stability analysis. While previous studies proposed reduced ODE models, these were typically introduced from deterministic descriptions rather than the underlying stochastic dynamics, which more accurately represent discrete reaction events occurring at random times. In this paper, we develop a rigorous probabilistic framework for deriving a reduced Othmer-Aldridge model of the glycolytic pathway from its stochastic formulation. The full system is modeled as a multiscale continuous-time Markov chain with different time and abundance scales. Under an appropriate scaling regime and specific structural conditions, we prove that the dynamics of the slow components are approximated by a two-dimensional ODE. The proof is technically involved due to the network's complexity and strong coupling between its components. We further consider the problem of parameter estimation when observations are limited to the slow species: fructose-6-phosphate and ADP. The reduced system yields a tractable loss function depending solely on these variables. We prove that the resulting estimators are statistically consistent when the data originate from the full stochastic reaction network. Together, these results provide a mathematically rigorous framework linking stochastic biochemical reaction networks, reduced deterministic dynamics, and statistically reliable parameter estimation.

2603.19549 2026-03-23 cs.CY cs.AI cs.ET stat.AP

Plagiarism or Productivity? Students Moral Disengagement and Behavioral Intentions to Use ChatGPT in Academic Writing

John Paul P. Miranda, Rhiziel P. Manalese, Mark Anthony A. Castro, Renen Paul M. Viado, Vernon Grace M. Maniago, Rudante M. Galapon, Jovita G. Rivera, Amado B. Martinez

Comments 5 pages, 1 figure, 2 table, conference proceeding

详情
Journal ref
2025 International Workshop on Artificial Intelligence and Education (2026) 383-387
英文摘要

This study examined how moral disengagement influences Filipino college students' intention to use ChatGPT in academic writing. The model tested five mechanisms: moral justification, euphemistic labeling, displacement of responsibility, minimizing consequences, and attribution of blame. These mechanisms were analyzed as predictors of attitudes, subjective norms, and perceived behavioral control, which then predicted behavioral intention. A total of 418 students with ChatGPT experience participated. The results showed that several moral disengagement mechanisms influenced students' attitudes and sense of control. Among the predictors, attribution of blame had the strongest influence, while attitudes had the highest impact on behavioral intention. The model explained more than half of the variation in intention. These results suggest that students often rely on institutional gaps and peer behavior to justify AI use. Many believe it is acceptable to use ChatGPT for learning or when rules are unclear. This shows a need for clear academic integrity policies, ethical guidance, and classroom support. The study also recognizes that intention-based models may not fully explain student behavior. Emotional factors, peer influence, and convenience can also affect decisions. The results provide useful insights for schools that aim to support responsible and informed AI use in higher education.

2603.19506 2026-03-23 math.ST stat.ME stat.TH

Doubly-Unlinked Regression for Dependent Data

Anik Burman, Sayantan Choudhury, Debangan Dey

Comments 81 pages, 6 figures, supplementary appendix included

详情
英文摘要

Shuffled regression concerns settings in which covariates and responses are observed without their correct pairing. In dependent-data problems, a second form of missing correspondence can arise when responses are also detached from the latent temporal, spatial, or geometric domain that induces their dependence structure. We study regression under this joint loss of correspondence and, to our knowledge, provide the first systematic treatment of this setting. Specifically, we consider a doubly-unlinked regression model in which both the covariate-response link and the response-domain link are unknown, represented by two latent permutation matrices, while dependence is induced by an unobserved stochastic process. This framework unifies shuffled regression and latent-domain permutation models within a common dependent-data setting. We characterize signal-to-noise regimes governing recovery of the regression parameter and the latent permutations, and show that consistent estimation of the regression coefficient can be achieved under strictly weaker conditions than exact permutation recovery. To address the combinatorial difficulty of inference, we develop REPAIR, a variational Bayes method based on a block-structured permutation model that captures localized scrambling while substantially reducing computational complexity. Simulations and an applied example illustrate the empirical behavior of REPAIR and support the theoretical results.

2603.19480 2026-03-23 stat.ME math.ST stat.TH

Regression Adjustments for Double Randomization in Two-Sided Marketplaces

Timothy Sudijono, Lihua Lei, Lorenzo Masoero, Suhas Vijaykumar, Guido Imbens, James McQueen

Comments 72 pages. Comments welcome

详情
英文摘要

Multiple randomization designs (MRDs) are a class of experimental designs used to handle interference in two-sided marketplaces. We investigate regression adjustment strategies for estimating total, spillover, and direct effects in MRDs. We derive minimum asymptotic variance estimators among a broad class of linearly adjusted estimators, without assuming a linear model on the potential outcomes. Surprisingly, the optimal regression adjustments are estimable from data and are generally different from regression adjustments in classical randomized experiments. For example, one such optimal estimator for the direct effect corresponds to a weighted regression with interacted two-way fixed effects. We establish model-robustness properties, central limit theorems, and inferential methods for our estimators, relying on improved theoretical results for MRD experiments. Our results provide the analog of classical regression adjustments for marketplace experiments. Numerical simulations demonstrate a considerable increase in efficiency over simpler approaches, enabling better inference when running MRDs.

2603.19440 2026-03-23 stat.ML cs.LG

Near-Equivalent Q-learning Policies for Dynamic Treatment Regimes

Sophia Yazzourh, Erica E. M. Moodie

Comments 13 pages, 2 figures

详情
英文摘要

Precision medicine aims to tailor therapeutic decisions to individual patient characteristics. This objective is commonly formalized through dynamic treatment regimes, which use statistical and machine learning methods to derive sequential decision rules adapted to evolving clinical information. In most existing formulations, these approaches produce a single optimal treatment at each stage, leading to a unique decision sequence. However, in many clinical settings, several treatment options may yield similar expected outcomes, and focusing on a single optimal policy may conceal meaningful alternatives. We extend the Q-learning framework for retrospective data by introducing a worst-value tolerance criterion controlled by a hyperparameter $\varepsilon$, which specifies the maximum acceptable deviation from the optimal expected value. Rather than identifying a single optimal policy, the proposed approach constructs sets of $\varepsilon$-optimal policies whose performance remains within a controlled neighborhood of the optimum. This formulation shifts Q-learning from a vector-valued representation to a matrix-valued one, allowing multiple admissible value functions to coexist during backward recursion. The approach yields families of near-equivalent treatment strategies and explicitly identifies regions of treatment indifference where several decisions achieve comparable outcomes. We illustrate the framework in two settings: a single-stage problem highlighting indifference regions around the decision boundary, and a multi-stage decision process based on a simulated oncology model describing tumor size and treatment toxicity dynamics.

2603.19439 2026-03-23 stat.ML cs.LG eess.SP

Subspace Projection Methods for Fast Spectral Embeddings of Evolving Graphs

Mohammad Eini, Abdullah Karaaslanli, Vassilis Kalantzis, Panagiotis A. Traganitis

详情
英文摘要

Several graph data mining, signal processing, and machine learning downstream tasks rely on information related to the eigenvectors of the associated adjacency or Laplacian matrix. Classical eigendecomposition methods are powerful when the matrix remains static but cannot be applied to problems where the matrix entries are updated or the number of rows and columns increases frequently. Such scenarios occur routinely in graph analytics when the graph is changing dynamically and either edges and/or nodes are being added and removed. This paper puts forth a new algorithmic framework to update the eigenvectors associated with the leading eigenvalues of an initial adjacency or Laplacian matrix as the graph evolves dynamically. The proposed algorithm is based on Rayleigh-Ritz projections, in which the original eigenvalue problem is projected onto a restricted subspace which ideally encapsulates the invariant subspace associated with the sought eigenvectors. Following ideas from eigenvector perturbation analysis, we present a new methodology to build the projection subspace. The proposed framework features lower computational and memory complexity with respect to competitive alternatives while empirical results show strong qualitative performance, both in terms of eigenvector approximation and accuracy of downstream learning tasks of central node identification and node clustering.

2603.19403 2026-03-23 stat.AP

Evaluation of Individual and Trial Level Association Metrics in the Validation of a Binary Surrogate Endpoint for a True Time-to-Event Endpoint

Renee Y. Ge, Azadeh Shohoudi, Malini Iyengar, Quefeng Li, Judy Li

Comments 29 pages, 6 figures

详情
英文摘要

Candidate binary endpoints are often considered as surrogates for time-to-event (TTE) clinical endpoints, primarily because they can be assessed at earlier time points. To be submitted for regulatory approval candidate binary endpoints need to validated. The most well-known method for performing such validation employs a meta-analytic framework to estimate individual-level and trial-level association. However, the performance of these association estimates in the context of a binary surrogate has not yet been examined through a comprehensive simulation study. This research aims to systematically investigate the performance of association estimates at the trial-level and at the individual-level under various trial design choices, using both simulation studies and clinical trial data, where available.

2603.19336 2026-03-23 stat.ME

Coordinate Descent Algorithm for Least Absolute Deviations Regression

Zehaan Naik, Debasis Kundu

Comments 28 pages, 7 figures

详情
英文摘要

Least Absolute Deviations (LAD) regression provides a robust alternative to ordinary least squares by minimizing the sum of absolute residuals. However, its widespread use has been limited by the computational cost of existing solvers, particularly simplex-based methods in high-dimensional settings. We propose a coordinate descent algorithm for LAD regression that avoids matrix inversion, naturally accommodates the non-differentiability of the objective function, and remains well-defined even when the number of predictors exceeds the number of observations. The key observation is that each coordinate update reduces to a one-dimensional minimization admitting a closed-form solution given by a median or weighted median. The resulting algorithm has per-iteration complexity $O(p\,n \log n)$ and is provably convergent due to the convexity of the LAD objective and the exactness of each coordinate update. Experiments on synthetic and real datasets show that the method matches the accuracy of linear-programming-based LAD solvers while offering improved scalability and stability in high-dimensional regimes, including cases where $p \ge n$. The method is easy to implement, requires no specialized optimization software, and provides a practical tool for robust linear models.

2603.19331 2026-03-23 cs.LG stat.ML

FalconBC: Flow matching for Amortized inference of Latent-CONditioned physiologic Boundary Conditions

Chloe H. Choi, Alison L. Marsden, Daniele E. Schiavazzi

详情
英文摘要

Boundary condition tuning is a fundamental step in patient-specific cardiovascular modeling. Despite an increase in offline training cost, recent methods in data-driven variational inference can efficiently estimate the joint posterior distribution of boundary conditions, with amortization of training efforts over clinical targets. However, even the most modern approaches fall short in two important scenarios: open-loop models with known mean flow and assumed waveform shapes, and anatomies affected by vascular lesions where segmentation influences the reachability of pressure or flow split targets. In both cases, boundary conditions cannot be tuned in isolation. We introduce a general amortized inference framework based on probabilistic flow that treats clinical targets, inflow features, and point cloud embeddings of patient-specific anatomies as either conditioning variables or quantities to be jointly estimated. We demonstrate the approach on two patient-specific models: an aorto-iliac bifurcation with varying stenosis locations and severity, and a coronary arterial tree.

2603.19291 2026-03-23 cs.LG cs.AI stat.ML

A Visualization for Comparative Analysis of Regression Models

Nassime Mountasir, Baptiste Lafabregue, Bruno Albert, Nicolas Lachiche

详情
英文摘要

As regression is a widely studied problem, many methods have been proposed to solve it, each of them often requiring setting different hyper-parameters. Therefore, selecting the proper method for a given application may be very difficult and relies on comparing their performances. Performance is usually measured using various metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), or R-squared (R${}^2$). These metrics provide a numerical summary of predictive accuracy by quantifying the difference between predicted and actual values. However, while these metrics are widely used in the literature for summarizing model performance and useful to distinguish between models performing poorly and well, they often aggregate too much information. This article addresses these limitations by introducing a novel visualization approach that highlights key aspects of regression model performance. The proposed method builds upon three main contributions: (1) considering the residuals in a 2D space, which allows for simultaneous evaluation of errors from two models, (2) leveraging the Mahalanobis distance to account for correlations and differences in scale within the data, and (3) employing a colormap to visualize the percentile-based distribution of errors, making it easier to identify dense regions and outliers. By graphically representing the distribution of errors and their correlations, this approach provides a more detailed and comprehensive view of model performance, enabling users to uncover patterns that traditional aggregate metrics may obscure. The proposed visualization method facilitates a deeper understanding of regression model performance differences and error distributions, enhancing the evaluation and comparison process.

2512.18720 2026-03-23 stat.ML cs.LG

Unsupervised Feature Selection via Robust Autoencoder and Adaptive Graph Learning

Feng Yu, MD Saifur Rahman Mazumder, Ying Su, Oscar Contreras Velasco

详情
英文摘要

Effective feature selection is essential for high-dimensional data analysis and machine learning. Unsupervised feature selection (UFS) aims to simultaneously cluster data and identify the most discriminative features. Most existing UFS methods linearly project features into a pseudo-label space for clustering, but they suffer from two critical limitations: (1) an oversimplified linear mapping that fails to capture complex feature relationships, and (2) an assumption of uniform cluster distributions, ignoring outliers prevalent in real-world data. To address these issues, we propose the Robust Autoencoder-based Unsupervised Feature Selection (RAEUFS) model, which leverages a deep autoencoder to learn nonlinear feature representations while inherently improving robustness to outliers. We further develop an efficient optimization algorithm for RAEUFS. Extensive experiments demonstrate that our method outperforms state-of-the-art UFS approaches in both clean and outlier-contaminated data settings.

2509.24005 2026-03-23 cs.LG stat.ML

Does Weak-to-strong Generalization Happen under Spurious Correlations?

Chenruo Liu, Yijun Dong, Qi Lei

详情
英文摘要

We initiate a unified theoretical and algorithmic study of a key problem in weak-to-strong (W2S) generalization: when fine-tuning a strong pre-trained student with pseudolabels from a weaker teacher on a downstream task with spurious correlations, does W2S happen, and how to improve it upon failures? We consider two sources of spurious correlations caused by group imbalance: (i) a weak teacher fine-tuned on group-imbalanced labeled data with a minority group of fraction $η_\ell$, and (ii) a group-imbalanced unlabeled set pseudolabeled by the teacher with a minority group of fraction $η_u$. Theoretically, a precise characterization of W2S gain at the proportional asymptotic limit shows that W2S always happens with sufficient pseudolabels when $η_u = η_\ell$ but may fail when $η_u \ne η_\ell$, where W2S gain diminishes as $(η_u - η_\ell)^2$ increases. Our theory is corroborated by extensive experiments on various spurious correlation benchmarks and teacher-student pairs. To boost W2S performance upon failures, we further propose a simple, effective algorithmic remedy that retrains the strong student on its high-confidence data subset after W2S fine-tuning. Our algorithm is group-label-free and achieves consistent, substantial improvements over vanilla W2S fine-tuning.

2507.16945 2026-03-23 stat.ME

Optimal two-phase sampling designs for generalized raking estimators with multiple parameters of interest

Jasper B. Yang, Bryan E. Shepherd, Thomas Lumley, Pamela A. Shaw

Comments 40 pages (27 main, 13 supplemental); 1 figure, 5 tables

详情
英文摘要

Large observational datasets, including those derived from electronic health records, are a valuable resource for medical research but are often affected by missingness, measurement error, and misclassification. Two-phase sampling with generalized raking (GR) estimation is an efficient and robust approach to statistical inference in such settings. In this approach, variables that are unavailable or measured with error in a large phase 1 cohort are obtained with higher-quality measurements in a phase 2 subsample. Previous research has studied optimal phase 2 sampling designs for inverse probability weighted (IPW) estimators in non-adaptive, multi-parameter settings, and for GR estimators in single-parameter settings. In this work, we extend these results by deriving optimal adaptive, multiwave sampling designs for IPW and GR estimators when multiple parameters are of interest. We propose several practical allocation strategies and evaluate their performance through extensive simulations and a data example from the Vanderbilt Comprehensive Care Clinic HIV Study. Our results show that independently optimizing allocation for each parameter improves efficiency over traditional case-control sampling. We also derive an integer-valued, A-optimal allocation method that typically outperforms independent optimization. Notably, we find that optimal designs for GR can differ substantially from those for IPW, and that this distinction can meaningfully affect estimator efficiency in the multiple-parameter setting. These findings offer practical guidance for future two-phase studies involving incomplete or error-prone data.

2506.12177 2026-03-23 stat.ME q-bio.QM stat.AP

A proxy-based approach for unmeasured confounding in electronic health records research

Haley Colgate Kottler, Amy Cochran

详情
英文摘要

Electronic health records (EHR) are widely used to study clinical decisions, yet unmeasured confounding remains a persistent challenge. Proxy variables offer a potential solution. In EHR data, clinicians already record many such measurements (e.g., vitals), each revealing something about a patient's underlying health. Despite this, proxy-based methods are rarely used in practice. We introduce a new way to use proxies to adjust for unmeasured confounding. Our approach uses a vector of proxies to construct covariates that capture aspects of the unmeasured confounder, which are then included in a regression model. As one implementation, we use factor analysis followed by regression. We compare this approach with existing methods, including proximal causal inference, across a range of realistic settings. In practice, assumptions rarely hold exactly, so we study what happens when models are misspecified and variables are used incorrectly: e.g., a confounder or instrument is treated as a proxy. Finally, we apply the method to EHR data to estimate the effect of hospital admission for older adults presenting to the emergency department with chest pain, a setting where unmeasured confounding is a substantial concern. This work provides a practical way to use proxies and may help bring proxy-based methods into broader use.

2505.08729 2026-03-23 stat.ME econ.EM

Which Covariates to Adjust for? Specification-robust Causal Inference in Observational Studies

Aditya Ghosh, Dominik Rothenhäusler

Comments 61 pages, 4 figures

详情
英文摘要

In observational causal inference, domain knowledge often leaves multiple covariate adjustments plausible, yet which sets satisfy ignorability is untestable. Different adjustment sets can yield conflicting estimates of the average treatment effect, and standard remedies (adjusting for their union or intersection, or reporting the union or convex hull of confidence intervals) can fail or produce intervals whose width does not vanish with sample size. We propose a specification-robust procedure that returns a single point estimate and a confidence interval that is valid as long as at least one candidate adjustment set is valid and has width shrinking at the parametric $n^{-1/2}$ rate. Our approach mirrors how trimming and overlap weighting handle overlap violations:~We shift the target to a reweighted population, closest in KL-divergence to the original population, for which credible, specification-robust inference is feasible. We also provide diagnostic plots to assess the population shift and an extension to protect any function of the covariates used for reweighting, similar to calipers in matching. Synthetic and real-data examples demonstrate that our procedure provides substantially tighter confidence intervals than the convex hull while maintaining nominal coverage.

2502.09880 2026-03-23 physics.soc-ph cs.LG cs.SI nlin.AO stat.ML

Interpretable Early Warnings using Machine Learning in an Online Game-experiment

Guillaume Falmagne, Anna B. Stephenson, Simon A. Levin

详情
Journal ref
PNAS 123(1), e2503493122(2026)
英文摘要

Stemming from physics and later applied to other fields such as ecology, the theory of critical transitions suggests that some regime shifts are preceded by statistical early warning signals. Reddit's r/place experiment, a large-scale social game, provides a unique opportunity to test these signals consistently across thousands of subsystems undergoing critical transitions. In r/place, millions of users collaboratively created ''compositions'', or pixel-art drawings, in which transitions occur when one composition rapidly replaces another. We develop a machine-learning-based early warning system that combines the predictive power of multiple system-specific time series via gradient-boosted decision trees with memory-retaining features. Our method significantly outperforms standard early warning indicators. Trained on the 2022 r/place data, our algorithm detects half of the transitions occurring within 20 min at a false positive rate of just 3.6%. Its performance remains robust when tested on the 2023 r/place event, demonstrating generalizability across different contexts. Using SHapley Additive exPlanations (SHAP) for interpreting the predictions, we investigate the underlying drivers of warnings, which could be relevant to other complex systems, especially online social systems. We reveal an interplay of patterns preceding transitions, such as critical slowing down or speeding up, a lack of innovation or coordination, turbulent histories, and a lack of image complexity. These findings show the potential of machine learning indicators in socio-ecological systems for predicting regime shifts and understanding their dynamics.

2502.05709 2026-03-23 cs.LG stat.ML

Flow-based Conformal Prediction for Multi-dimensional Time Series

Junghwan Lee, Chen Xu, Yao Xie

详情
英文摘要

Time series prediction underpins a broad range of downstream tasks across many scientific domains. Recent advances and increasing adoption of black-box machine learning models for time series prediction highlight the critical need for uncertainty quantification. While conformal prediction has gained attention as a reliable uncertainty quantification method, conformal prediction for time series faces two key challenges: (1) \textbf{leveraging correlations in observations and non-conformity scores to overcome the exchangeability assumption}, and (2) \textbf{constructing prediction sets for multi-dimensional outcomes}. To address these challenges, we propose a novel conformal prediction method for time series using flow with classifier-free guidance. We provide coverage guarantees by establishing exact non-asymptotic marginal coverage and a finite-sample bound on conditional coverage for the proposed method. Evaluations on real-world time series datasets demonstrate that our method constructs significantly smaller prediction sets than existing conformal prediction methods, maintaining target coverage.

2409.06890 2026-03-23 stat.ML cs.LG

Learning Representations for Independence Testing

Nathaniel Xu, Feng Liu, Danica J. Sutherland

Comments v3: as published at TMLR (https://openreview.net/forum?id=pDvKoXRsnW), including many relatively smaller improvements

详情
英文摘要

Many tools exist to detect dependence between random variables, a core question across a wide range of machine learning, statistical, and scientific endeavors. Although several statistical tests guarantee eventual detection of any dependence with enough samples, standard tests may require an exorbitant amount of samples for detecting subtle dependencies between high-dimensional random variables with complex distributions. In this work, we study two related ways to learn powerful independence tests. First, we show how to construct powerful statistical tests with finite-sample validity by using variational estimators of mutual information, such as the InfoNCE or NWJ estimators. Second, we establish a close connection between these variational mutual information-based tests and tests based on the Hilbert-Schmidt Independence Criterion (HSIC); in particular, learning a variational bound (typically parameterized by a deep network) for mutual information is closely related to learning a kernel for HSIC. Finally, we show how to, rather than selecting a representation to maximize the statistic itself, select a representation which can maximize the power of a test, in either setting; we term the former case a Neural Dependency Statistic (NDS). While HSIC power optimization has been recently considered in the literature, we correct some important misconceptions and expand to considering deep kernels. In our experiments, while all approaches can yield powerful tests with exact level control, optimized HSIC tests generally outperform the other approaches on difficult problems of detecting structured dependence.

2305.07433 2026-03-23 stat.AP

Aligning the Western Balkans power sectors with the European Green Deal

Emir Fejzić, Taco Niet, Cameron Wade, Will Usher

Comments 34 pages, 14 figures

详情
英文摘要

Located in Southern Europe, the Drina River Basin is shared between Bosnia and Herzegovina, Montenegro, and Serbia. The power sectors of the three countries have an exceptionally high dependence on coal for power generation. In this paper, we analyse different development pathways for achieving climate neutrality in these countries and explore the potential of variable renewable energy (VRE) and its role in power sector decarbonization. We investigate whether hydro and non-hydro renewables can enable a net-zero transition by 2050 and how VRE might affect the hydropower cascade shared by the three countries. The Open-Source Energy Modelling System (OSeMOSYS) was used to develop a model representation of the countries' power sectors. Findings show that the renewable potential of the countries is a significant 94.4 GW. This potential is 68% higher than previous assessments have shown. Under an Emission Limit scenario assuming net zero by 2050, 17% of this VRE potential is utilized to support the decarbonization of the power sectors. Additional findings show a limited impact of VRE technologies on total power generation output from the hydropower cascade. However, increased solar deployment shifts the operation of the cascade to increased short-term balancing, moving from baseload to more responsive power generation patterns. Prolonged use of thermal power plants is observed under scenarios assuming high wholesale electricity prices, leading to increased emissions. Results from scenarios with low cost of electricity trade suggest power sector developments that lead to decreased energy security.

1911.01850 2026-03-23 stat.ME stat.AP

Stabilizing Variable Selection and Regression

Niklas Pfister, Evan G. Williams, Jonas Peters, Ruedi Aebersold, Peter Bühlmann

详情
英文摘要

We consider regression in which one predicts a response $Y$ with a set of predictors $X$ across different experiments or environments. This is a common setup in many data-driven scientific fields and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, i.e., predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models, which allows to graphically characterize stable versus unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error given that the resulting regression generalizes to unseen new environments.