arXivDaily arXiv每日学术速递 周一至周五更新
2601.00781 2026-02-10 cs.LG stat.ML

Categorical Reparameterization with Denoising Diffusion models

Samson Gourevitch, Alain Durmus, Eric Moulines, Jimmy Olsson, Yazid Janati

Comments preprint

详情
英文摘要

Learning models with categorical variables requires optimizing expectations over discrete distributions, a setting in which stochastic gradient-based optimization is challenging due to the non-differentiability of categorical sampling. A common workaround is to replace the discrete distribution with a continuous relaxation, yielding a smooth surrogate that admits reparameterized gradient estimates via the reparameterization trick. Building on this idea, we introduce ReDGE, a novel and efficient diffusion-based soft reparameterization method for categorical distributions. Our approach defines a flexible class of gradient estimators that includes the Straight-Through estimator as a special case. Experiments spanning latent variable models and inference-time reward guidance in discrete diffusion models demonstrate that ReDGE consistently matches or outperforms existing gradient-based methods. The code will be made available at https://github.com/samsongourevitch/redge.

2602.08980 2026-02-10 physics.soc-ph cs.LG cs.SI stat.ML

When do neural ordinary differential equations generalize on complex networks?

Moritz Laber, Tina Eliassi-Rad, Brennan Klein

详情
英文摘要

Neural ordinary differential equations (neural ODEs) can effectively learn dynamical systems from time series data, but their behavior on graph-structured data remains poorly understood, especially when applied to graphs with different size or structure than encountered during training. We study neural ODEs ($\mathtt{nODE}$s) with vector fields following the Barabási-Barzel form, trained on synthetic data from five common dynamical systems on graphs. Using the $\mathbb{S}^1$-model to generate graphs with realistic and tunable structure, we find that degree heterogeneity and the type of dynamical system are the primary factors in determining $\mathtt{nODE}$s' ability to generalize across graph sizes and properties. This extends to $\mathtt{nODE}$s' ability to capture fixed points and maintain performance amid missing data. Average clustering plays a secondary role in determining $\mathtt{nODE}$ performance. Our findings highlight $\mathtt{nODE}$s as a powerful approach to understanding complex systems but underscore challenges emerging from degree heterogeneity and clustering in realistic graphs.

2602.08933 2026-02-10 stat.ML cs.LG cs.NE stat.ME

Provably robust learning of regression neural networks using $β$-divergences

Abhik Ghosh, Suryasis Jana

Comments Pre-print, under review

详情
英文摘要

Regression neural networks (NNs) are most commonly trained by minimizing the mean squared prediction error, which is highly sensitive to outliers and data contamination. Existing robust training methods for regression NNs are often limited in scope and rely primarily on empirical validation, with only a few offering partial theoretical guarantees. In this paper, we propose a new robust learning framework for regression NNs based on the $β$-divergence (also known as the density power divergence) which we call `rRNet'. It applies to a broad class of regression NNs, including models with non-smooth activation functions and error densities, and recovers the classical maximum likelihood learning as a special case. The rRNet is implemented via an alternating optimization scheme, for which we establish convergence guarantees to stationary points under mild, verifiable conditions. The (local) robustness of rRNet is theoretically characterized through the influence functions of both the parameter estimates and the resulting rRNet predictor, which are shown to be bounded for suitable choices of the tuning parameter $β$, depending on the error density. We further prove that rRNet attains the optimal 50\% asymptotic breakdown point at the assumed model for all $β\in(0, 1]$, providing a strong global robustness guarantee that is largely absent for existing NN learning methods. Our theoretical results are complemented by simulation experiments and real-data analyses, illustrating practical advantages of rRNet over existing approaches in both function approximation problems and prediction tasks with noisy observations.

2602.08892 2026-02-10 stat.ML cs.LG econ.EM

Winner's Curse Drives False Promises in Data-Driven Decisions: A Case Study in Refugee Matching

Hamsa Bastani, Osbert Bastani, Bryce McLaughlin

详情
英文摘要

A major challenge in data-driven decision-making is accurate policy evaluation-i.e., guaranteeing that a learned decision-making policy achieves the promised benefits. A popular strategy is model-based policy evaluation, which estimates a model from data to infer counterfactual outcomes. This strategy is known to produce unwarrantedly optimistic estimates of the true benefit due to the winner's curse. We searched the recent literature on data-driven decision-making, identifying a sample of 55 papers published in the Management Science in the past decade; all but two relied on this flawed methodology. Several common justifications are provided: (1) the estimated models are accurate, stable, and well-calibrated, (2) the historical data uses random treatment assignment, (3) the model family is well-specified, and (4) the evaluation methodology uses sample splitting. Unfortunately, we show that no combination of these justifications avoids the winner's curse. First, we provide a theoretical analysis demonstrating that the winner's curse can cause large, spurious reported benefits even when all these justifications hold. Second, we perform a simulation study based on the recent and consequential data-driven refugee matching problem. We construct a synthetic refugee matching environment (calibrated to closely match the real setting) but designed so that no assignment policy can improve expected employment compared to random assignment. Model-based methods report large, stable gains of around 60% even when the true effect is zero; these gains are on par with improvements of 22-75% reported in the literature. Our results provide strong evidence against model-based evaluation.

2602.08888 2026-02-10 math.PR math.ST q-fin.MF stat.TH

Almost sure null bankruptcy of testing-by-betting strategies

Hongjian Wang, Shubhada Agrawal, Aaditya Ramdas

详情
英文摘要

The bounded mean betting procedure serves as a crucial interface between the domains of (1) sequential, anytime-valid statistical inference, and (2) online learning and portfolio selection algorithms. While recent work in both domains has established the exponential wealth growth of numerous betting strategies under any alternative distribution, the tightness of the inverted confidence sets, and the pathwise minimax regret bounds, little has been studied regarding the asymptotics of these strategies under the null hypothesis. Under the null, a strategy induces a wealth martingale converging to some random variable that can be zero (bankrupt) or non-zero (non-bankrupt, e.g. when it eventually stops betting). In this paper, we show the conceptually intuitive but technically nontrivial fact that these strategies (universal portfolio, Krichevsky-Trofimov, GRAPA, hedging, etc.) all go bankrupt with probability one, under any non-degenerate null distribution. Part of our analysis is based on the subtle almost sure divergence of various sums of $\sum O_p(n^{-1})$ type, a result of independent interest. We also demonstrate the necessity of null bankruptcy by showing that non-bankrupt strategies are all improvable in some sense. Our results significantly deepen our understanding of these betting strategies as they qualify their behavior on "almost all paths", whereas previous results are usually on "all paths" (e.g. regret bounds) or "most paths" (e.g. concentration inequalities and confidence sets).

2602.08865 2026-02-10 stat.ME

Regression modeling of multivariate precipitation extremes under regular variation

Rishikesh Yadav, Arnab Hazra

Comments 15 Pages, 3 figures, 1 table

详情
英文摘要

Motivated by the EVA2025 data challenge, where we participated as the team DesiBoys, we propose a regression strategy within the framework of regular variation to estimate the occurrences and intensities of high precipitation extremes derived from different climate runs of the CESM2 Large Ensemble Community Project (LENS2). Our approach first empirically estimates the target quantities at sub-asymptotic (lower threshold) levels and sets them as response variables within a simple regression framework arising from the theoretical expressions of joint regular variation. Although a seasonal pattern is evident in the data, the precipitation intensities do not exhibit any significant long-term trends across years. Besides, we can safely assume the data to be independent across different climate model runs, thereby simplifying the modeling framework. Once the regression parameters are estimated, we employ a standard prediction approach to infer precipitation levels at very high quantiles. We calculate the confidence intervals using a nonparametric block bootstrap procedure. While a likelihood-based inference grounded in multivariate extreme value theory may provide more accurate estimates and confidence intervals, it would involve a significantly higher computational burden. Our proposed simple and computationally straightforward two-stage approach provides reasonable estimates for the desired quantities, securing us a joint second position in the final rankings of the EVA2025 conference data challenge competition.

2602.08862 2026-02-10 cs.LG cs.DS stat.ML

Near-optimal Swap Regret Minimization for Convex Losses

Lunjia Hu, Jon Schneider, Yifan Wu

详情
英文摘要

We give a randomized online algorithm that guarantees near-optimal $\widetilde O(\sqrt T)$ expected swap regret against any sequence of $T$ adaptively chosen Lipschitz convex losses on the unit interval. This improves the previous best bound of $\widetilde O(T^{2/3})$ and answers an open question of Fishelson et al. [2025b]. In addition, our algorithm is efficient: it runs in $\mathsf{poly}(T)$ time. A key technical idea we develop to obtain this result is to discretize the unit interval into bins at multiple scales of granularity and simultaneously use all scales to make randomized predictions, which we call multi-scale binning and may be of independent interest. A direct corollary of our result is an efficient online algorithm for minimizing the calibration error for general elicitable properties. This result does not require the Lipschitzness assumption of the identification function needed in prior work, making it applicable to median calibration, for which we achieve the first $\widetilde O(\sqrt T)$ calibration error guarantee.

2602.08849 2026-02-10 stat.ML cond-mat.mtrl-sci cs.LG physics.chem-ph

Cutting Through the Noise: On-the-fly Outlier Detection for Robust Training of Machine Learning Interatomic Potentials

Terry C. W. Lam, Niamh O'Neill, Christoph Schran, Lars L. Schaaf

Comments 12 pages, 6 figures

详情
英文摘要

The accuracy of machine learning interatomic potentials suffers from reference data that contains numerical noise. Often originating from unconverged or inconsistent electronic-structure calculations, this noise is challenging to identify. Existing mitigation strategies such as manual filtering or iterative refinement of outliers, require either substantial expert effort or multiple expensive retraining cycles, making them difficult to scale to large datasets. Here, we introduce an on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations. By tracking the loss distribution via an exponential moving average, this unsupervised method identifies outliers throughout a single training run. We show that this approach prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead. The method's effectiveness is demonstrated by recovering accurate physical observables for liquid water from unconverged reference data, including diffusion coefficients. Furthermore, we validate its scalability by training a foundation model for organic chemistry on the SPICE dataset, where it reduces energy errors by a factor of three. This framework provides a simple, automated solution for training robust models on imperfect datasets across dataset sizes.

2602.08787 2026-02-10 stat.AP

Accessibility and Serviceability Assessment to Inform Offshore Wind Energy Development and Operations off the U.S. East Coast

Cory Petersen, Feng Ye, Jiaxiang Ji, Josh Kohut, Ahmed Aziz Ezzat, David Saginaw, Avril Montanti, Jack Cammarota

详情
英文摘要

The economic success of offshore wind energy projects relies on accurate projections of the construction, and operations and maintenance (O&M) costs. These projections must consider the logistical complexities introduced by adverse met-ocean conditions that can prohibit access to the offshore assets for sustained periods of time. In response, the goal of this study is two-fold: (1) to provide high-resolution estimates of the accessibility of key offshore wind energy areas in the United States (U.S.) East Coast--a region with significant offshore wind energy potential; and (2) to introduce a new operational metric, called serviceability, as motivated by the need to assess the accessibility of an offshore asset along a vessel travel path, rather than at a specific site, as commonly carried out in the literature. We hypothesize that serviceability is more relevant to offshore operations than accessibility, since it more realistically reflects the success and safety of a vessel operation along its journey from port to site and back. Our analysis reveals high temporal and spatial variations in accessibility and serviceability, even for proximate offshore locations. We also find that solely relying on numerical met-ocean data can introduce considerable bias in estimating accessibility and serviceability, raising the need for a statistical treatment that combines both numerical and observational data sources, such as the one proposed herein. Collectively, our analysis sheds light on the value of high-resolution met-ocean information and models in supporting offshore operations, including but not limited to future offshore wind energy developments.

2602.08782 2026-02-10 stat.ML cs.LG

Amortising Inference and Meta-Learning Priors in Neural Networks

Tommy Rochussen, Vincent Fortuin

Comments Accepted at ICLR 2026

详情
英文摘要

One of the core facets of Bayesianism is in the updating of prior beliefs in light of new evidence$\text{ -- }$so how can we maintain a Bayesian approach if we have no prior beliefs in the first place? This is one of the central challenges in the field of Bayesian deep learning, where it is not clear how to represent beliefs about a prediction task by prior distributions over model parameters. Bridging the fields of Bayesian deep learning and probabilistic meta-learning, we introduce a way to $\textit{learn}$ a weights prior from a collection of datasets by introducing a way to perform per-dataset amortised variational inference. The model we develop can be viewed as a neural process whose latent variable is the set of weights of a BNN and whose decoder is the neural network parameterised by a sample of the latent variable itself. This unique model allows us to study the behaviour of Bayesian neural networks under well-specified priors, use Bayesian neural networks as flexible generative models, and perform desirable but previously elusive feats in neural processes such as within-task minibatching or meta-learning under extreme data-starvation.

2602.08723 2026-02-10 cs.LG cs.CR stat.ML

Data Reconstruction: Identifiability and Optimization with Sample Splitting

Yujie Shen, Zihan Wang, Jian Qian, Qi Lei

详情
英文摘要

Training data reconstruction from KKT conditions has shown striking empirical success, yet it remains unclear when the resulting KKT equations have unique solutions and, even in identifiable regimes, how to reliably recover solutions by optimization. This work hereby focuses on these two complementary questions: identifiability and optimization. On the identifiability side, we discuss the sufficient conditions for KKT system of two-layer networks with polynomial activations to uniquely determine the training data, providing a theoretical explanation of when and why reconstruction is possible. On the optimization side, we introduce sample splitting, a curvature-aware refinement step applicable to general reconstruction objectives (not limited to KKT-based formulations): it creates additional descent directions to escape poor stationary points and refine solutions. Experiments demonstrate that augmenting several existing reconstruction methods with sample splitting consistently improves reconstruction performance.

2602.08657 2026-02-10 cs.LG stat.ME

Two-Stage Data Synthesization: A Statistics-Driven Restricted Trade-off between Privacy and Prediction

Xiaotong Liu, Shao-Bo Lin, Jun Fan, Ding-Xuan Zhou

详情
英文摘要

Synthetic data have gained increasing attention across various domains, with a growing emphasis on their performance in downstream prediction tasks. However, most existing synthesis strategies focus on maintaining statistical information. Although some studies address prediction performance guarantees, their single-stage synthesis designs make it challenging to balance the privacy requirements that necessitate significant perturbations and the prediction performance that is sensitive to such perturbations. We propose a two-stage synthesis strategy. In the first stage, we introduce a synthesis-then-hybrid strategy, which involves a synthesis operation to generate pure synthetic data, followed by a hybrid operation that fuses the synthetic data with the original data. In the second stage, we present a kernel ridge regression (KRR)-based synthesis strategy, where a KRR model is first trained on the original data and then used to generate synthetic outputs based on the synthetic inputs produced in the first stage. By leveraging the theoretical strengths of KRR and the covariant distribution retention achieved in the first stage, our proposed two-stage synthesis strategy enables a statistics-driven restricted privacy--prediction trade-off and guarantee optimal prediction performance. We validate our approach and demonstrate its characteristics of being statistics-driven and restricted in achieving the privacy--prediction trade-off both theoretically and numerically. Additionally, we showcase its generalizability through applications to a marketing problem and five real-world datasets.

2602.08647 2026-02-10 stat.ME

Measures for Assessing Causal Effect Heterogeneity Unexplained by Covariates

Yuta Kawakami, Jin Tian

详情
英文摘要

There has been considerable interest in estimating heterogeneous causal effects across individuals or subpopulations. Researchers often assess causal effect heterogeneity based on the subjects' covariates using the conditional average causal effect (CACE). However, substantial heterogeneity may persist even after accounting for the covariates. Existing work on causal effect heterogeneity unexplained by covariates mainly focused on binary treatment and outcome. In this paper, we introduce novel heterogeneity measures, P-CACE and N-CACE, for binary treatment and continuous outcome that represent CACE over the positively and negatively affected subjects, respectively. We also introduce new heterogeneity measures, P-CPICE and N-CPICE, for continuous treatment and continuous outcome by leveraging stochastic interventions, expanding causal questions that researchers can answer. We establish identification and bounding theorems for these new measures. Finally, we show their application to a real-world dataset.

2602.08643 2026-02-10 stat.ME stat.AP

State policy heterogeneity analyses: considerations and proposals

Max Rubinstein, Megan S. Schuler, Elizabeth A. Stuart, Bradley D. Stein, Max Griswold, Elizabeth M. Stone, Beth Ann Griffin

详情
英文摘要

State-level policy studies often conduct heterogeneity analyses that quantify how treatment effects vary across state characteristics. These analyses may be used to inform state-specific policy decisions, or to infer how the effect of a policy changes in combination with other state characteristics. However, in state-level settings with varied contexts and policy landscapes, multiple versions of similar policies, and differential policy implementation, the causal quantities targeted by these analyses may not align with the inferential goals. This paper clarifies these issues by distinguishing several causal estimands relevant to heterogeneity analyses in state-policy settings, including state-specific treatment effects (ITE), conditional average treatment effects (CATE), and controlled direct effects (CDE). We argue that the CATE is often the easiest to identify and estimate, but may not be the most policy relevant target of inference. Moreover, the widespread practice of coarsening distinct policies or implementations into a single indicator further complicates the interpretation of these analyses. Motivated by these limitations, we propose bounding ITEs as an alternative inferential goal, yielding ranges for each state's policy effect under explicit assumptions that quantify deviations from the ideal identifying conditions. These bounds target a well-defined and policy-relevant quantity, the effect for specific states. We develop this approach within a difference-in-differences framework and discuss how sensitivity parameters may be informed using pre-treatment data. Through simulations we demonstrate that bounding state-specific effects can more reliably determine the sign of the ITEs than CATE estimates. We then illustrate this method to examine the effect of the Affordable Care Act Medicaid expansion on high-volume buprenorphine prescribing.

2602.08629 2026-02-10 cs.LG cs.AI stat.ML

CauScale: Neural Causal Discovery at Scale

Bo Peng, Sirui Chen, Jiaguo Tian, Yu Qiao, Chaochao Lu

详情
英文摘要

Causal discovery is essential for advancing data-driven fields such as scientific AI and data analysis, yet existing approaches face significant time- and space-efficiency bottlenecks when scaling to large graphs. To address this challenge, we present CauScale, a neural architecture designed for efficient causal discovery that scales inference to graphs with up to 1000 nodes. CauScale improves time efficiency via a reduction unit that compresses data embeddings and improves space efficiency by adopting tied attention weights to avoid maintaining axis-specific attention maps. To keep high causal discovery accuracy, CauScale adopts a two-stream design: a data stream extracts relational evidence from high-dimensional observations, while a graph stream integrates statistical graph priors and preserves key structural signals. CauScale successfully scales to 500-node graphs during training, where prior work fails due to space limitations. Across testing data with varying graph scales and causal mechanisms, CauScale achieves 99.6% mAP on in-distribution data and 84.4% on out-of-distribution data, while delivering 4-13,000 times inference speedups over prior methods. Our project page is at https://github.com/OpenCausaLab/CauScale.

2602.08577 2026-02-10 cs.LG math.CO stat.CO

An arithmetic method algorithm optimizing k-nearest neighbors compared to regression algorithms and evaluated on real world data sources

Theodoros Anagnostopoulos, Evanthia Zervoudi, Christos Anagnostopoulos, Apostolos Christopoulos, Bogdan Wierzbinski

Comments Nature Scientific Reports

详情
英文摘要

Linear regression analysis focuses on predicting a numeric regressand value based on certain regressor values. In this context, k-Nearest Neighbors (k-NN) is a common non-parametric regression algorithm, which achieves efficient performance when compared with other algorithms in literature. In this research effort an optimization of the k-NN algorithm is proposed by exploiting the potentiality of an introduced arithmetic method, which can provide solutions for linear equations involving an arbitrary number of real variables. Specifically, an Arithmetic Method Algorithm (AMA) is adopted to assess the efficiency of the introduced arithmetic method, while an Arithmetic Method Regression (AMR) algorithm is proposed as an optimization of k-NN adopting the potentiality of AMA. Such algorithm is compared with other regression algorithms, according to an introduced optimal inference decision rule, and evaluated on certain real world data sources, which are publicly available. Results are promising since the proposed AMR algorithm has comparable performance with the other algorithms, while in most cases it achieves better performance than the k-NN. The output results indicate that introduced AMR is an optimization of k-NN.

2602.08552 2026-02-10 cs.LG eess.AS stat.ML

Rho-Perfect: Correlation Ceiling For Subjective Evaluation Datasets

Fredrik Cumlin

详情
英文摘要

Subjective ratings contain inherent noise that limits the model-human correlation, but this reliability issue is rarely quantified. In this paper, we present $ρ$-Perfect, a practical estimation of the highest achievable correlation of a model on subjectively rated datasets. We define $ρ$-Perfect to be the correlation between a perfect predictor and human ratings, and derive an estimate of the value based on heteroscedastic noise scenarios, a common occurrence in subjectively rated datasets. We show that $ρ$-Perfect squared estimates test-retest correlation and use this to validate the estimate. We demonstrate the use of $ρ$-Perfect on a speech quality dataset and show how the measure can distinguish between model limitations and data quality issues.

2602.08544 2026-02-10 stat.ME stat.CO

Adaptive Markovian Spatiotemporal Transfer Learning in Multivariate Bayesian Modeling

Luca Presicce, Sudipto Banerjee

详情
英文摘要

This manuscript develops computationally efficient online learning for multivariate spatiotemporal models. The method relies on matrix-variate Gaussian distributions, dynamic linear models, and Bayesian predictive stacking to efficiently share information across temporal data shards. The model facilitates effective information propagation over time while seamlessly integrating spatial components within a dynamic framework, building a Markovian dependence structure between datasets at successive time instants. This structure supports flexible, high-dimensional modeling of complex dependence patterns, as commonly found in spatiotemporal phenomena, where computational challenges arise rapidly with increasing dimensions. The proposed approach further manages exact inference through predictive stacking, enhancing accuracy and interoperability. Combining sequential and parallel processing of temporal shards, each unit passes assimilated information forward, then back-smoothed to improve posterior estimates, incorporating all available information. This framework advances the scalability and adaptability of spatiotemporal modeling, making it suitable for dynamic, multivariate, and data-rich environments.

2602.08486 2026-02-10 math.ST stat.TH

Empirical Bayes Variable Selection with Lasso Statistics in the AMP Framework

Lina Hidmi, Asaf Weinstein

详情
英文摘要

The Lasso is one of the most ubiquitous methods for variable selection in high-dimensional linear regression and has been studied extensively under different regimes. In a particular asymptotic setup entailing $n/p\to \text{constant}$, an i.i.d.~Gaussian $X$ matrix and linear sparsity, \citet{su2017false} analyzed the Lasso selection path and presented negative results, showing that maintaining small levels of the false discovery proportion comes at a substantial cost in power. Followup work by \citet{wang2020bridge} used the same framework to study the tradeoff between type I error and power for thresholded-Lasso selection, which ranks the variables based on the magnitude of the Lasso estimate instead of the order of appearance on the Lasso path, and demonstrated that significant improvements are possible if the regularization parameter is chosen appropriately. We take this line of research a step further, seeking an {\em optimal} selection procedure in the AMP framework among procedures that order the variables by some univariate function of the Lasso estimate at a fixed value $λ$ of the regularization term. Observing that the model for the Lasso estimates effectively reduces asymptotically to a version of the well-studied two-groups model, we propose an empirical Bayes variable selection procedure based on an estimate of the local false discovery rate. We extend existing results in the AMP framework to obtain exact predictions for the curve describing the asymptotic tradeoff between type I error and power of this procedure. Additionally, we prove that the optimal $λ$ is the minimizer of the asymptotic mean squared error, and accordingly propose to use the empirical Bayes procedure with $λ$ estimated by cross-validation. The theoretical predictions imply that the gains in power can be substantial, and we confirm this by numerical studies under different settings.

2602.08467 2026-02-10 cs.LG stat.ML

Low Rank Transformer for Multivariate Time Series Anomaly Detection and Localization

Charalampos Shimillas, Kleanthis Malialis, Konstantinos Fokianos, Marios M. Polycarpou

详情
英文摘要

Multivariate time series (MTS) anomaly diagnosis, which encompasses both anomaly detection and localization, is critical for the safety and reliability of complex, large-scale real-world systems. The vast majority of existing anomaly diagnosis methods offer limited theoretical insights, especially for anomaly localization, which is a vital but largely unexplored area. The aim of this contribution is to study the learning process of a Transformer when applied to MTS by revealing connections to statistical time series methods. Based on these theoretical insights, we propose the Attention Low-Rank Transformer (ALoRa-T) model, which applies low-rank regularization to self-attention, and we introduce the Attention Low-Rank score, effectively capturing the temporal characteristics of anomalies. Finally, to enable anomaly localization, we propose the ALoRa-Loc method, a novel approach that associates anomalies to specific variables by quantifying interrelationships among time series. Extensive experiments and real data analysis, show that the proposed methodology significantly outperforms state-of-the-art methods in both detection and localization tasks.

2602.08427 2026-02-10 cs.LG math.ST stat.TH

The Connection between Kriging and Large Neural Networks

Marius Marinescu

详情
英文摘要

AI has impacted many disciplines and is nowadays ubiquitous. In particular, spatial statistics is in a pivotal moment where it will increasingly intertwine with AI. In this scenario, a relevant question is what relationship spatial statistics models have with machine learning (ML) models, if any. In particular, in this paper, we explore the connections between Kriging and neural networks. At first glance, they may appear unrelated. Kriging - and its ML counterpart, Gaussian process regression - are grounded in probability theory and stochastic processes, whereas many ML models are extensively considered Black-Box models. Nevertheless, they are strongly related. We study their connections and revisit the relevant literature. The understanding of their relations and the combination of both perspectives may enhance ML techniques by making them more interpretable, reliable, and spatially aware.

2602.08414 2026-02-10 stat.AP

Temporal Trends in Incidence of Dementia in a Birth Cohorts Analysis of the Framingham Heart Study

Paula Staudt, Anika Schlosser, Annika Möhl, Martin Schumacher, Nadine Binder

Comments 14 pages, 3 figures, 2 tables

详情
英文摘要

Background: Dementia leads to a high burden of disability and the number of dementia patients worldwide doubled between 1990 and 2016. Nevertheless, some studies indicated a decrease in dementia risk which may be due to a bias caused by conventional analysis methods that do not adequately account for missing disease information due to death. Methods: This study re-examines potential trends in dementia incidence over four decades in the Framingham Heart Study. We apply a multistate modeling framework tailored to interval-censored illness-death data and define three non-overlapping birth cohorts (1915-1924, 1925-1934, and 1935-1944). Trends are evaluated based on both dementia prevalence and dementia risk, using age as the underlying timescale. Additionally, age-conditional dementia probabilities stratified by sex are estimated. Results: A total of 731 out of 3828 individuals were diagnosed with dementia. The multistate model analysis revealed no temporal decline in dementia risk across birth cohorts, irrespective of sex. When stratified by sex and adjusted for education, women consistently exhibited higher lifetime age-conditional risks (46%-50%) than men (30%-34%) over the study period. Conclusions: We recommend using a combination of multistate approach and separation into birth cohorts to adequately estimate trends of disease risk in cohort studies as well as to communicate patient-relevant outcomes such age-conditional disease risks.

2602.08413 2026-02-10 stat.ME

A Bayesian regression framework for circular models with INLA

Xiang Ye, Janet Van Niekerk, Haavard Rue

Comments 19 pages, 12 figures

详情
英文摘要

Regression models for circular variables are less developed, since the concept of building a linear predictor from linear combinations of covariates and various random effects, breaks the circular nature of the variable. In this paper, we introduce a new approach to rectify this issue, leading to well-defined regression models for circular responses when the data are concentrated. Our approach extends naturally to joint regression models where we can have several circular and non-circular responses, and allow us to handle a mix of linear covariates, circular covariates and various random effects. Our formulation aligns naturally with the integrated nested Laplace approximation (INLA), which provides fast and accurate Bayesian inference. We illustrate our approach through several simulated and real examples.

2602.08374 2026-02-10 stat.ML cs.LG math.PR math.ST stat.TH

Schrödinger bridge problem via empirical risk minimization

Denis Belomestny, Alexey Naumov, Nikita Puchkin, Denis Suchkov

详情
英文摘要

We study the Schrödinger bridge problem when the endpoint distributions are available only through samples. Classical computational approaches estimate Schrödinger potentials via Sinkhorn iterations on empirical measures and then construct a time-inhomogeneous drift by differentiating a kernel-smoothed dual solution. In contrast, we propose a learning-theoretic route: we rewrite the Schrödinger system in terms of a single positive transformed potential that satisfies a nonlinear fixed-point equation and estimate this potential by empirical risk minimization over a function class. We establish uniform concentration of the empirical risk around its population counterpart under sub-Gaussian assumptions on the reference kernel and terminal density. We plug the learned potential into a stochastic control representation of the bridge to generate samples. We illustrate performance of the suggested approach with numerical experiments.

2602.08350 2026-02-10 cs.LG stat.ML

All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

Tal Burla, Roi Livni

详情
英文摘要

We study the sample complexity of the best-case Empirical Risk Minimizer in the setting of stochastic convex optimization. We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of $Ω\left(\sqrt{ηT/m^{1.5}}\right)$ for Gradient Descent, where $η$ is the learning rate, $T$ is the horizon and $m$ is the sample size. This narrows down, exponentially, the gap between the best known upper bound of $O(ηT/m)$ and existing lower bounds from previous constructions.

2602.08315 2026-02-10 cs.LG stat.ML

Fast Flow Matching based Conditional Independence Tests for Causal Discovery

Shunyu Zhao, Yanfeng Yang, Shuai Li, Kenji Fukumizu

详情
英文摘要

Constraint-based causal discovery methods require a large number of conditional independence (CI) tests, which severely limits their practical applicability due to high computational complexity. Therefore, it is crucial to design an algorithm that accelerates each individual test. To this end, we propose the Flow Matching-based Conditional Independence Test (FMCIT). The proposed test leverages the high computational efficiency of flow matching and requires the model to be trained only once throughout the entire causal discovery procedure, substantially accelerating causal discovery. According to numerical experiments, FMCIT effectively controls type-I error and maintains high testing power under the alternative hypothesis, even in the presence of high-dimensional conditioning sets. In addition, we further integrate FMCIT into a two-stage guided PC skeleton learning framework, termed GPC-FMCIT, which combines fast screening with guided, budgeted refinement using FMCIT. This design yields explicit bounds on the number of CI queries while maintaining high statistical power. Experiments on synthetic and real-world causal discovery tasks demonstrate favorable accuracy-efficiency trade-offs over existing CI testing methods and PC variants.

2602.08307 2026-02-10 cs.LG stat.ML

Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback

Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro

详情
英文摘要

In this paper, we study Interaction-Grounded Learning (IGL) [Xie et al., 2021], a paradigm designed for realistic scenarios where the learner receives indirect feedback generated by an unknown mechanism, rather than explicit numerical rewards. While prior work on IGL provides efficient algorithms with provable guarantees, those results are confined to single-step settings, restricting their applicability to modern sequential decision-making systems such as multi-turn Large Language Model (LLM) deployments. To bridge this gap, we propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback. Technically, we extend the reward-estimator construction of Zhang et al. [2024a] from the single-step to the multi-step setting, addressing the unique challenges of decoding latent rewards under MDPs. Building on this estimator, we design an Inverse-Gap-Weighting (IGW) algorithm for policy optimization. Finally, we demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.

2602.08287 2026-02-10 cs.LG cs.AI stat.ML

Noise Stability of Transformer Models

Themistoklis Haris, Zihan Zhang, Yuichi Yoshida

Comments Published in ICLR 2026

详情
英文摘要

Understanding simplicity biases in deep learning offers a promising path toward developing reliable AI. A common metric for this, inspired by Boolean function analysis, is average sensitivity, which captures a model's robustness to single-token perturbations. We argue that average sensitivity has two key limitations: it lacks a natural generalization to real-valued domains and fails to explain the "junta-like" input dependence we empirically observe in modern LLMs. To address these limitations, we propose noise stability as a more comprehensive simplicity metric. Noise stability expresses a model's robustness to correlated noise applied to all input coordinates simultaneously. We provide a theoretical analysis of noise stability for single-layer attention and ReLU MLP layers and tackle the multi-layer propagation problem with a covariance interval propagation approach. Building on this theory, we develop a practical noise stability regularization method. Experiments on algorithmic and next-token-prediction tasks show that our regularizer consistently catalyzes grokking and accelerates training by approximately $35\%$ and $75\%$ respectively. Our results sculpt a new connection between signal propagation in neural networks and interpretability, with noise stability emerging as a powerful tool for understanding and improving modern Transformers.

2602.08259 2026-02-10 stat.ML cs.LG

A Statistical Framework for Alignment with Biased AI Feedback

Xintao Xia, Zhiqiu Xia, Linjun Zhang, Zhanrui Cai

详情
英文摘要

Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for large-scale alignment, whereas DIPO serves as a robust, statistically optimal alternative that attains the semiparametric efficiency bound. Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.

2602.08243 2026-02-10 stat.ML cs.LG

Discrete Adjoint Schrödinger Bridge Sampler

Wei Guo, Yuchen Zhu, Xiaochen Du, Juno Nam, Yongxin Chen, Rafael Gómez-Bombarelli, Guan-Horng Liu, Molei Tao, Jaemoo Choi

详情
英文摘要

Learning discrete neural samplers is challenging due to the lack of gradients and combinatorial complexity. While stochastic optimal control (SOC) and Schrödinger bridge (SB) provide principled solutions, efficient SOC solvers like adjoint matching (AM), which excel in continuous domains, remain unexplored for discrete spaces. We bridge this gap by revealing that the core mechanism of AM is $\mathit{state}\text{-}\mathit{space~agnostic}$, and introduce $\mathbf{discrete~ASBS}$, a unified framework that extends AM and adjoint Schrödinger bridge sampler (ASBS) to discrete spaces. Theoretically, we analyze the optimality conditions of the discrete SB problem and its connection to SOC, identifying a necessary cyclic group structure on the state space to enable this extension. Empirically, discrete ASBS achieves competitive sample quality with significant advantages in training efficiency and scalability.

2602.08232 2026-02-10 math.OC cs.LG stat.ML

Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization

Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, Aryan Mokhtari

Comments 37 pages, 1 figure

详情
英文摘要

We study online linear optimization with matrix variables constrained by the operator norm, a setting where the geometry renders designing data-dependent and efficient adaptive algorithms challenging. The best-known adaptive regret bounds are achieved by Shampoo-like methods, but they require solving a costly quadratic projection subproblem. To address this, we extend the gradient-based prediction scheme to adaptive matrix online learning and cast algorithm design as constructing a family of smoothed potentials for the nuclear norm. We define a notion of admissibility for such smoothings and prove any admissible smoothing yields a regret bound matching the best-known guarantees of one-sided Shampoo. We instantiate this framework with two efficient methods that avoid quadratic projections. The first is an adaptive Follow-the-Perturbed-Leader (FTPL) method using Gaussian stochastic smoothing. The second is Follow-the-Augmented-Matrix-Leader (FAML), which uses a deterministic hyperbolic smoothing in an augmented matrix space. By analyzing the admissibility of these smoothings, we show both methods admit closed-form updates and match one-sided Shampoo's regret up to a constant factor, while significantly reducing computational cost. Lastly, using the online-to-nonconvex conversion, we derive two matrix-based optimizers, Pion (from FTPL) and Leon (from FAML). We prove convergence guarantees for these methods in nonsmooth nonconvex settings, a guarantee that the popular Muon optimizer lacks.

2602.08197 2026-02-10 cs.LG stat.ML

Interpretable Dynamic Network Modeling of Tensor Time Series via Kronecker Time-Varying Graphical Lasso

Shingo Higashiguchi, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai

Comments Accepted at ACM Web Conference 2026 (WWW2026)

详情
英文摘要

With the rapid development of web services, large amounts of time series data are generated and accumulated across various domains such as finance, healthcare, and online platforms. As such data often co-evolves with multiple variables interacting with each other, estimating the time-varying dependencies between variables (i.e., the dynamic network structure) has become crucial for accurate modeling. However, real-world data is often represented as tensor time series with multiple modes, resulting in large, entangled networks that are hard to interpret and computationally intensive to estimate. In this paper, we propose Kronecker Time-Varying Graphical Lasso (KTVGL), a method designed for modeling tensor time series. Our approach estimates mode-specific dynamic networks in a Kronecker product form, thereby avoiding overly complex entangled structures and producing interpretable modeling results. Moreover, the partitioned network structure prevents the exponential growth of computational time with data dimension. In addition, our method can be extended to stream algorithms, making the computational time independent of the sequence length. Experiments on synthetic data show that the proposed method achieves higher edge estimation accuracy than existing methods while requiring less computation time. To further demonstrate its practical value, we also present a case study using real-world data. Our source code and datasets are available at https://github.com/Higashiguchi-Shingo/KTVGL.

2602.08185 2026-02-10 stat.ML cs.LG

Information Geometry of Absorbing Markov-Chain and Discriminative Random Walks

Masanari Kimura

详情
英文摘要

Discriminative Random Walks (DRWs) are a simple yet powerful tool for semi-supervised node classification, but their theoretical foundations remain fragmentary. We revisit DRWs through the lens of information geometry, treating the family of class-specific hitting-time laws on an absorbing Markov chain as a statistical manifold. Starting from a log-linear edge-weight model, we derive closed-form expressions for the hitting-time probability mass function, its full moment hierarchy, and the observed Fisher information. The Fisher matrix of each seed node turns out to be rank-one, taking the quotient by its null space yields a low-dimensional, globally flat manifold that captures all identifiable directions of the model. Leveraging the geometry, we introduce a sensitivity score for unlabeled nodes that bounds, and in one-dimensional cases attains, the maximal first-order change in DRW betweenness under unit Fisher perturbations. The score can lead to principled strategies for active label acquisition, edge re-weighting, and explanation.

2602.08174 2026-02-10 math.ST cs.IT math.IT stat.TH

Asymptotically Minimax Robust Likelihood Ratio Test

Gökhan Gül

Comments 29 pages, 10 figures. Submitted to IEEE Transactions on Information Theory

详情
英文摘要

This paper develops a unified framework for asymptotically minimax robust hypothesis testing under distributional uncertainty, applicable to both Bayesian and Neyman--Pearson formulations (Type-I and Type-II). Uncertainty classes based on the KL-divergence, $α$-divergence, and its symmetrized variant are considered. Using Sion's minimax theorem and Karush-Kuhn-Tucker conditions, the existence and uniqueness of the resulting robust tests are established. The least favorable distributions and corresponding robust likelihood ratio functions are derived in closed parametric forms, enabling computation via systems of nonlinear equations. It is proven that Dabak's approach does not yield an asymptotically minimax robust test. The proposed theory generalizes earlier work by offering a more systematic and comprehensive derivation of robust tests. Numerical simulations confirm the theoretical results and illustrate the behavior of the derived robust tests.

2602.08173 2026-02-10 math.ST cs.IT cs.LG math.IT math.PR stat.TH

Fundamental Limits of Community Detection in Contextual Multi-Layer Stochastic Block Models

Shuyang Gong, Dong Huang, Zhangsong Li

Comments 49 pages, 5 figures

详情
英文摘要

We consider the problem of community detection from the joint observation of a high-dimensional covariate matrix and $L$ sparse networks, all encoding noisy, partial information about the latent community labels of $n$ subjects. In the asymptotic regime where the networks have constant average degree and the number of features $p$ grows proportionally with $n$, we derive a sharp threshold under which detecting and estimating the subject labels is possible. Our results extend the work of \cite{MN23} to the constant-degree regime with noisy measurements, and also resolve a conjecture in \cite{YLS24+} when the number of networks is a constant. Our information-theoretic lower bound is obtained via a novel comparison inequality between Bernoulli and Gaussian moments, as well as a statistical variant of the ``recovery to chi-square divergence reduction'' argument inspired by \cite{DHSS25}. On the algorithmic side, we design efficient algorithms based on counting decorated cycles and decorated paths and prove that they achieve the sharp threshold for both detection and weak recovery. In particular, our results show that there is no statistical-computational gap in this setting.

2602.08172 2026-02-10 stat.AP

Learning from Literature: Integrating LLMs and Bayesian Hierarchical Modeling for Oncology Trial Design

Guannan Gong, Satrajit Roychoudhury, Allison Meisner, Lajos Pusztai, Sarah B Goldberg, Wei Wei

详情
英文摘要

Designing modern oncology trials requires synthesizing evidence from prior studies to inform hypothesis generation and sample size determination. Trial designs based on incomplete or imprecise summaries can lead to misspecified hypotheses and underpowered studies, resulting in false positive or negative conclusions. To address this challenge, we developed LEAD-ONC (Literature to Evidence for Analytics and Design in Oncology), an AI-assisted framework that transforms published clinical trial reports into quantitative, design-relevant evidence. Given expert-curated trial publications that meet prespecified eligibility criteria, LEAD-ONC uses large language models to extract baseline characteristics and reconstruct individual patient data from Kaplan-Meier curves, followed by Bayesian hierarchical modeling to generate predictive survival distributions for a prespecified target trial population. We demonstrate the framework using five phase III trials in first-line non-small-cell lung cancer evaluating PD-1 or PD-L1 inhibitors with or without CTLA-4 blockade. Clustering based on baseline characteristics identified three clinically interpretable populations defined by histology. For a prospective randomized trial in the mixed-histology population comparing mono versus dual immune checkpoint inhibition, LEAD-ONC projected a modest median overall survival difference of 2.8 months (95 percent credible interval -2.0 to 7.6) and an estimated probability of at least a 3-month benefit of approximately 0.45. As LEAD-ONC remains under active development, these results are intended as preliminary demonstrations of the frameworks potential to support evidence-driven oncology trial design rather than definitive clinical conclusions.

2602.08171 2026-02-10 cs.LG stat.AP stat.ME

A Causal Machine Learning Framework for Treatment Personalization in Clinical Trials: Application to Ulcerative Colitis

Cristian Minoccheri, Sophia Tesic, Kayvan Najarian, Ryan Stidham

详情
英文摘要

Randomized controlled trials estimate average treatment effects, but treatment response heterogeneity motivates personalized approaches. A critical question is whether statistically detectable heterogeneity translates into improved treatment decisions -- these are distinct questions that can yield contradictory answers. We present a modular causal machine learning framework that evaluates each question separately: permutation importance identifies which features predict heterogeneity, best linear predictor (BLP) testing assesses statistical significance, and doubly robust policy evaluation measures whether acting on the heterogeneity improves patient outcomes. We apply this framework to patient-level data from the UNIFI maintenance trial of ustekinumab in ulcerative colitis, comparing placebo, standard-dose ustekinumab every 12 weeks, and dose-intensified ustekinumab every 8 weeks, using cross-fitted X-learner models with baseline demographics, medication history, week-8 clinical scores, laboratory biomarkers, and video-derived endoscopic features. BLP testing identified strong associations between endoscopic features and treatment effect heterogeneity for ustekinumab versus placebo, yet doubly robust policy evaluation showed no improvement in expected remission from incorporating endoscopic features, and out-of-fold multi-arm evaluation showed worse performance. Diagnostic comparison of prognostic contribution against policy value revealed that endoscopic scores behaved as disease severity markers -- improving outcome prediction in untreated patients but adding noise to treatment selection -- while clinical variables (fecal calprotectin, age, CRP) captured the decision-relevant variation. These results demonstrate that causal machine learning applications to clinical trials should include policy-level evaluation alongside heterogeneity testing.

2602.08151 2026-02-10 cs.LG stat.ML

A second order regret bound for NormalHedge

Yoav Freund, Nicholas J. A. Harvey, Victor S. Portella, Yabing Qi, Yu-Xiang Wang

详情
英文摘要

We consider the problem of prediction with expert advice for ``easy'' sequences. We show that a variant of NormalHedge enjoys a second-order $ε$-quantile regret bound of $O\big(\sqrt{V_T \log(V_T/ε)}\big) $ when $V_T > \log N$, where $V_T$ is the cumulative second moment of instantaneous per-expert regret averaged with respect to a natural distribution determined by the algorithm. The algorithm is motivated by a continuous time limit using Stochastic Differential Equations. The discrete time analysis uses self-concordance techniques.

2602.08120 2026-02-10 quant-ph cs.NA math.NA q-fin.MF stat.CO

Optimal Quantum Speedups for Repeatedly Nested Expectation Estimation

Yihang Sun, Guanyang Wang, Jose Blanchet

详情
英文摘要

We study the estimation of repeatedly nested expectations (RNEs) with a constant horizon (number of nestings) using quantum computing. We propose a quantum algorithm that achieves $\varepsilon$-error with cost $\tilde O(\varepsilon^{-1})$, up to logarithmic factors. Standard lower bounds show this scaling is essentially optimal, yielding an almost quadratic speedup over the best classical algorithm. Our results extend prior quantum speedups for single nested expectations to repeated nesting, and therefore cover a broader range of applications, including optimal stopping. This extension requires a new derandomized variant of the classical randomized Multilevel Monte Carlo (rMLMC) algorithm. Careful de-randomization is key to overcoming a variable-time issue that typically increases quantized versions of classical randomized algorithms.

2602.06832 2026-02-10 math.ST stat.TH

Exact recovery for seeded graph matching

Nicolas Fraiman, Michael Nisenzon

Comments 27 pages, 1 figure

详情
英文摘要

We study graph matching between two correlated networks in the almost fully seeded regime, where all but a vanishing fraction of vertex correspondences are revealed. Concretely, we consider the correlated stochastic block model and assume that $n^{1-α}$ vertices remain unrevealed for some $α\in (0,1)$, while the remaining $n - n^{1-α}$ vertices are provided as seed correspondences. Our goal is to determine when the true permutation can be recovered efficiently as the proportion of unrevealed vertices vanishes. We prove that exact recovery of the remaining correspondences is achievable in polynomial time whenever $λs^{2} > 1 - α$, where $λ= (a+b)/2$ is the SBM density parameter and $s$ denotes the edge retention parameter. This condition smoothly interpolates between the fully seeded setting and the classical unseeded threshold $λs^{2} > 1$ for matching in correlated Erdős-Rényi graphs. Our analysis applies to both a simple neighborhood-overlap rule and a bistochastic relaxation followed by projection, establishing matching achievability in the almost fully seeded regime without requiring spectral methods or message passing. On the converse side, we show that below the same threshold, exact recovery is information-theoretically impossible with high probability. Thus, to our knowledge, we obtain the first tight statistical and computational characterization of graph matching when only a vanishing fraction of vertices remain unrevealed. Our results complement recent progress in semi-supervised community detection by demonstrating that revealing all but $n^{1-α}$ correspondences similarly lowers the information threshold for graph matching.

2602.06267 2026-02-10 stat.ME stat.ML

Conformal changepoint localization

Rohan Hore, Aaditya Ramdas

Comments 52 pages, 12 figures

详情
英文摘要

We study the problem of offline changepoint localization in a distribution-free setting. One observes a vector of data with a single changepoint, assuming that the data before and after the changepoint are iid (or more generally exchangeable) from arbitrary and unknown distributions. The goal is to produce a finite-sample confidence set for the index at which the change occurs without making any other assumptions. Existing methods often rely on parametric assumptions, tail conditions, or asymptotic approximations, or only produce point estimates. In contrast, our distribution-free algorithm, CONformal CHangepoint localization (CONCH), only leverages exchangeability arguments to construct confidence sets with finite sample coverage. By proving a conformal Neyman-Pearson lemma, we derive principled score functions that yield informative (small) sets. Moreover, with such score functions, the normalized length of the confidence set shrinks to zero under weak assumptions. We also establish a universality result showing that any distribution-free changepoint localization method must be an instance of CONCH. Experiments suggest that CONCH delivers precise confidence sets even in challenging settings involving images or text.

2602.04302 2026-02-10 math.ST stat.TH

Phase Transition of Spectral Fluctuations in Large Gram Matrices with a Variance Profile: A Unified Framework for Sparse CLTs

Rui Wang, Guangming Pan, Dandan Jiang

Comments 24 pages, 4 figures

详情
英文摘要

We study the asymptotic spectral behavior of high-dimensional random Gram matrices with sparsity and a variance profile, motivated by applications in wireless communications. Specifically, we consider the Gram matrices $\mathbf S_n=\mathbf Y_n\mathbf Y_n^*$, where the entries of $\mathbf Y_n$ are independent, centered, heteroscedastic, and sparse through Bernoulli masking. The sparsity level is parameterized as $s=q^2/n$, where $q$ ranges from polynomial order up to order $n^{1/2}$. We investigate two asymptotic regimes: a moderate-sparsity regime with fixed $s\in(0,1]$, and a high-sparsity regime where $s\to0$. In both regimes, we establish the convergence of the empirical spectral distribution of $\mathbf S_n$ to a deterministic limit, and further derive central limit theorems for linear spectral statistics using resolvent techniques and martingale difference arguments. Our analysis reveals a phase transition in the fluctuation behavior across the two regimes. In the high-sparsity regime, the asymptotic fluctuations are entirely governed by fourth-moment effects, with sparsity-scaled contributions being suppressed. Moreover, the leading deterministic term and the variance of the linear spectral statistic scale at different rates in $q$, causing the standard centering to fail and necessitating an explicit correction to recover a valid CLT. The results apply to both Gaussian and non-Gaussian entries and are illustrated through applications to hypothesis testing and outage probability analysis in large-scale MIMO systems.

2602.02759 2026-02-10 stat.ML cs.LG

Near-Universal Multiplicative Updates for Nonnegative Einsum Factorization

John Hood, Aaron Schein

Comments 27 pages, 5 figures

详情
英文摘要

Despite the ubiquity of multiway data across scientific domains, there are few user-friendly tools that fit tailored nonnegative tensor factorizations. Researchers may use gradient-based automatic differentiation (which often struggles in nonnegative settings), choose between a limited set of methods with mature implementations, or implement their own model from scratch. As an alternative, we introduce NNEinFact, an einsum-based multiplicative update algorithm that fits any nonnegative tensor factorization expressible as a tensor contraction by minimizing one of many user-specified loss functions (including the $(α,β)$-divergence). To use NNEinFact, the researcher simply specifies their model with a string. NNEinFact converges to a stationary point of the loss, supports missing data, and fits to tensors with hundreds of millions of entries in seconds. Empirically, NNEinFact fits custom models which outperform standard ones in heldout prediction tasks on real-world tensor data by over $37\%$ and attains less than half the test loss of gradient-based methods while converging up to 90 times faster.

2602.01228 2026-02-10 stat.ME

Estimation of Tsallis entropy and its applications to goodness-of-fit tests

Siddhartha Chakraborty, Asok K. Nanda, Narayanaswamy Balakrishnan

Comments 21 pages, 4 figures

详情
英文摘要

In this paper, we consider the problem of estimating Tsallis entropy from a given data set. We propose four different estimators for Tsallis entropy measure based on higher-order sample spacings, and then discuss estimation of Tsallis divergence measure. We compare the performance of the proposed estimators by means of bias and mean squared error and also examine their robustness to outliers. Next, we propose a spacings-based estimator for Tsallis entropy under progressive type-II censoring and study its performance using Monte Carlo simulations. Another estimator for Tsallis entropy is proposed using quantile function and its consistency and asymptotic normality are studied, and its performance is evaluated through Monte Carlo simulations. Goodness-of-fit tests for normal and exponential distributions as applications are developed using Tsallis divergence measure. The performance of the proposed tests are then compared with some known tests using simulations and it is shown that the proposed tests perform very well. Also, an exponentiality test under progressive type-II censoring is proposed, its performance is compared with an existing entropy-based test using simulation. It is observed that the proposed test performs well. Finally, some real data sets are analysed for illustrative purposes.

2602.00989 2026-02-10 stat.ML cs.LG

Optimal Decision-Making Based on Prediction Sets

Tao Wang, Edgar Dobriban

详情
英文摘要

Prediction sets can wrap around any ML model to cover unknown test outcomes with a guaranteed probability. Yet, it remains unclear how to use them optimally for downstream decision-making. Here, we propose a decision-theoretic framework that seeks to minimize the expected loss (risk) against a worst-case distribution consistent with the prediction set's coverage guarantee. We first characterize the minimax optimal policy for a fixed prediction set, showing that it balances the worst-case loss inside the set with a penalty for potential losses outside the set. Building on this, we derive the optimal prediction set construction that minimizes the resulting robust risk subject to a coverage constraint. Finally, we introduce Risk-Optimal Conformal Prediction (ROCP), a practical algorithm that targets these risk-minimizing sets while maintaining finite-sample distribution-free marginal coverage. Empirical evaluations on medical diagnosis and safety-critical decision-making tasks demonstrate that ROCP reduces critical mistakes compared to baselines, particularly when out-of-set errors are costly.

2601.13519 2026-02-10 stat.ML cs.LG math.OC

Small Gradient Norm Regret for Online Convex Optimization

Wenzhi Gao, Chang He, Madeleine Udell

详情
英文摘要

This paper introduces a new problem-dependent regret measure for online convex optimization with smooth losses. The notion, which we call the $G^\star$ regret, depends on the cumulative squared gradient norm evaluated at the decision in hindsight. We show that the $G^\star$ regret strictly refines the existing $L^\star$ (small loss) regret, and that it can be arbitrarily sharper when the losses have vanishing curvature around the hindsight decision. We establish upper and lower bounds on the $G^\star$ regret and extend our results to dynamic regret and bandit settings. As a byproduct, we refine the existing convergence analysis of stochastic optimization algorithms in the interpolation regime. Some experiments validate our theoretical findings.

2512.23671 2026-02-10 stat.ML cs.LG math.OC stat.ME

Calibrated Multi-Level Quantile Forecasting

Tiffany Ding, Isaac Gibbs, Ryan J. Tibshirani

详情
英文摘要

We develop an online method that guarantees calibration of quantile forecasts at multiple quantile levels simultaneously. In this work, a sequence of quantile forecasts is said to be calibrated provided that its $α$-level predictions are greater than or equal to the target value at an $α$ fraction of time steps, for each level $α$. Our procedure, called the multi-level quantile tracker (MultiQT), is lightweight and wraps around any point or quantile forecaster to produce adjusted quantile forecasts that are guaranteed to be calibrated, even against adversarial distribution shifts. Critically, it does so while ensuring that the quantiles remain ordered, e.g., the 0.5-level quantile forecast will never be larger than the 0.6-level forecast. Moreover, the method has a no-regret guarantee, implying it will not degrade the performance of the existing forecaster (asymptotically), with respect to the quantile loss. In our experiments, we find that MultiQT significantly improves the calibration of real forecasters in epidemic and energy forecasting problems, while leaving the quantile loss largely unchanged or slightly improved.

2512.04696 2026-02-10 stat.ML cs.LG math.ST stat.TH

Provable FDR Control for Deep Feature Selection: Deep MLPs and Beyond

Kazuma Sawaya

Comments Accepted to AISTATS 2026

详情
英文摘要

We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected. From the second layer onward, it accommodates multilayer perceptrons (MLPs) of arbitrary width and depth, convolutional and recurrent networks, attention mechanisms, residual connections, and dropout. The procedure also accommodates stochastic gradient descent with data-independent initializations and learning rates. To the best of our knowledge, this is the first work to provide a theoretical guarantee of FDR control for feature selection within such a general deep learning setting. Our analysis is built upon a multi-index data-generating model and an asymptotic regime in which the feature dimension $n$ diverges faster than the latent dimension $q^{*}$, while the sample size, the number of training iterations, the network depth, and hidden layer widths are left unrestricted. Under this setting, we show that each coordinate of the gradient-based feature-importance vector admits a marginal normal approximation, thereby supporting the validity of asymptotic FDR control. As a theoretical limitation, we assume $\mathbf{B}$-right orthogonal invariance of the design matrix, and we discuss broader generalizations. We also present numerical experiments that underscore the theoretical findings.

2510.19349 2026-02-10 cs.LG stat.ML

Scalable LinUCB: Low-Rank Design Matrix Updates for Recommenders with Large Action Spaces

Evgenia Shustova, Marina Sheshukova, Sergey Samsonov, Evgeny Frolov

详情
英文摘要

In this paper, we introduce PSI-LinUCB, a scalable variant of LinUCB that enables efficient training, inference, and memory usage by representing the inverse regularized design matrix as a sum of a diagonal matrix and low-rank correction. We derive numerically stable rank-1 and batched updates that maintain the inverse without explicitly forming the matrix. To control memory growth, we employ a projector-splitting integrator for dynamical low-rank approximation, yielding an average per-step update cost and memory usage of $O(dr)$ for approximation rank $r$. The inference complexity of the proposed algorithm is $O(dr)$ per action evaluation. Experiments on recommender system datasets demonstrate the effectiveness of our algorithm.

2510.01798 2026-02-10 stat.ME physics.app-ph

Optimal smoothing parameter in Eilers-Whittaker smoother

Roberto Bernal-Arencibia, Karel Garcia Medina, Ernesto Estevez-Rams, Beatriz Aragon-Fernandez

Comments minor typos corrected

详情
英文摘要

The Eilers-Whittaker method for data smoothing effectiveness depends on the choice of the regularisation parameter, and automatic selection is a necessity for large datasets. Common methods, such as leave-one-out cross-validation, can perform poorly when serially correlated noise is present. We propose a novel procedure for selecting the control parameter based on the spectral entropy of the residuals. We define an S-curve from the Euclidean distance between points in a plot of the spectral entropy of the residuals versus that of the smoothed signal. The regularisation parameter corresponding to the absolute maximum of this S-curve is chosen as the optimal parameter. Using simulated data, we benchmarked our method against cross-validation and the V-curve. Validation was also performed on diverse experimental data. This robust and straightforward procedure can be a valuable addition to the available selection methods for the Eilers smoother.

2509.25351 2026-02-10 cs.LG stat.ML

Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region

Shuang Liang, Guido Montúfar

详情
英文摘要

We examine gradient descent in matrix factorization and show that under large step sizes the parameter space develops a fractal structure. We derive the exact critical step size for convergence in scalar-vector factorization and show that near criticality the selected minimizer depends sensitively on the initialization. Moreover, we show that adding regularization amplifies this sensitivity, generating a fractal boundary between initializations that converge and those that diverge. The analysis extends to general matrix factorization with orthogonal initialization. Our findings reveal that near-critical step sizes induce a chaotic regime of gradient descent where the training outcome is unpredictable and there are no simple implicit biases, such as towards balancedness, minimum norm, or flatness.

2508.17411 2026-02-10 stat.ME math.ST stat.TH

Two-sample Testing with Block-wise Missingness in Multi-source Data

Kejian Zhang, Muxuan Liang, Robert Maile, Doudou Zhou

详情
英文摘要

Multi-source and multi-modal datasets are increasingly common in scientific research, yet they often exhibit block-wise missingness, where entire modalities are systematically absent in some sources or no single source contains all modalities. This structured missingness poses major challenges for two-sample hypothesis testing. Standard approaches, such as imputation or complete-case analysis, may introduce bias or suffer efficiency loss, especially under missingness-not-at-random mechanisms. To address this challenge, we propose the Block-Pattern Enhanced Test, a general framework for constructing two-sample testing statistics that explicitly accounts for block-wise missingness. We show that the framework yields valid tests under a new condition allowing for missing-not-at-random mechanism. Building on this general framework, we further propose the Block-wise Rank In Similarity graph Edge-count (BRISE) test, which accommodate heterogeneous modalities using rank-based similarity graphs. Theoretically, we establish that the null distribution of BRISE converges to a $χ^2$ distribution, and that the test is consistent both in the standard asymptotic regime and in the high-dimensional low-sample-size setting under mild conditions. Simulation studies demonstrate that BRISE controls the type-I error rate and achieves strong power across a wide range of alternatives. Applications to two real-world datasets with block-wise missingness further illustrate the practical utility of the proposed method.

2507.11236 2026-02-10 cs.DS cs.LG math.PR stat.ML

Improved sampling algorithms and functional inequalities for non-log-concave distributions

Yuchen He, Zhehan Lei, Jianan Shao, Chihao Zhang

详情
英文摘要

We study the problem of sampling from a distribution $μ$ with density $\propto e^{-V}$ for some potential function $V:\mathbb R^d\to \mathbb R$ with query access to $V$ and $\nabla V$. We start with the following standard assumptions: (1) $V$ is $L$-smooth. (2) The second moment $\mathbf{E}_{X\sim μ}[\|X\|^2]\leq M$. Recently, He and Zhang (COLT'25) showed that the query complexity of this problem is at least $\left(\frac{LM}{dε}\right)^{Ω(d)}$ where $ε$ is the desired accuracy in total variation distance, and the Poincaré constant can be unbounded. Meanwhile, another common assumption in the study of diffusion based samplers (see e.g., the work of Chen, Chewi, Li, Li, Salim and Zhang (ICLR'23)) strengthens (1) to the following: (1*) The potential function of *every* distribution along the Ornstein-Uhlenbeck process starting from $μ$ is $L$-smooth. We show that under the assumptions (1*) and (2), the query complexity of sampling from $μ$ can be $\mathrm{poly}(L,d)\cdot \left(\frac{Ld+M}{ε^2}\right)^{\mathcal{O}(L+1)}$, which is polynomial in $d$ and $\frac{1}ε$ when $L=\mathcal{O}(1)$ and $M=\mathrm{poly}(d)$. This improves the algorithm with quasi-polynomial query complexity developed by Huang et al. (COLT'24). Our results imply that the seemingly moderate strengthening from (1) to (1*) yields an exponential gap in the query complexity. Furthermore, we show that together with the assumption (1*) and the stronger moment assumption that $\|X\|$ is $λ$-sub-Gaussian for $X\simμ$, the Poincaré constant of $μ$ is at most $\mathcal{O}(λ)^{2(L+1)}$. We also establish a modified log-Sobolev inequality for $μ$ under these conditions. As an application of our technique, we obtain a new estimate of the modified log-Sobolev constant for a specific class of mixtures of strongly log-concave distributions.

2507.05806 2026-02-10 cs.LG stat.ML

Predicting Graph Structure via Adapted Flux Balance Analysis

Sevvandi Kandanaarachchi, Ziqi Xu, Stefan Westerlund, Conrad Sanderson

Comments extended and revised version of arXiv:2401.04280

Journal ref Lecture Notes in Computer Science (LNCS), Vol. 16370, pp. 351-363, 2026

详情
英文摘要

Many dynamic processes such as telecommunication and transport networks can be described through discrete time series of graphs. Modelling the dynamics of such time series enables prediction of graph structure at future time steps, which can be used in applications such as detection of anomalies. Existing approaches for graph prediction have limitations such as assuming that the vertices do not to change between consecutive graphs. To address this, we propose to exploit time series prediction methods in combination with an adapted form of flux balance analysis (FBA), a linear programming method originating from biochemistry. FBA is adapted to incorporate various constraints applicable to the scenario of growing graphs. Empirical evaluations on synthetic datasets (constructed via Preferential Attachment model) and real datasets (UCI Message, HePH, Facebook, Bitcoin) demonstrate the efficacy of the proposed approach.

2506.09648 2026-02-10 stat.ML cs.LG

Scaling Laws for Uncertainty in Deep Learning

Mattia Rosso, Simone Rossi, Giulio Franzese, Markus Heinonen, Maurizio Filippone

详情
英文摘要

Deep learning has recently revealed the existence of scaling laws, demonstrating that model performance follows predictable trends based on dataset and model sizes. Inspired by these findings and fascinating phenomena emerging in the over-parameterized regime, we examine a parallel direction: do similar scaling laws govern predictive uncertainties in deep learning? In identifiable parametric models, such scaling laws can be derived in a straightforward manner by treating model parameters in a Bayesian way. In this case, for example, we obtain $O(1/N)$ contraction rates for epistemic uncertainty with respect to the number of data $N$. However, in over-parameterized models, these guarantees do not hold, leading to largely unexplored behaviors. In this work, we empirically show the existence of scaling laws associated with various measures of predictive uncertainty with respect to dataset and model sizes. Through experiments on vision and language tasks, we observe such scaling laws for in- and out-of-distribution predictive uncertainty estimated through popular approximate Bayesian inference and ensemble methods. Besides the elegance of scaling laws and the practical utility of extrapolating uncertainties to larger data or models, this work provides strong evidence to dispel recurring skepticism against Bayesian approaches: "In many applications of deep learning we have so much data available: what do we need Bayes for?". Our findings show that "so much data" is typically not enough to make epistemic uncertainty negligible.

2506.03849 2026-02-10 stat.ML cs.LG

Algorithm- and Data-Dependent Generalization Bounds for Diffusion Models

Benjamin Dupuis, Dario Shariatian, Maxime Haddouche, Alain Durmus, Umut Simsekli

详情
英文摘要

Score-based generative models (SGMs) have emerged as one of the most popular classes of generative models. A substantial body of work now exists on the analysis of SGMs, focusing either on discretization aspects or on their statistical performance. In the latter case, bounds have been derived, under various metrics, between the true data distribution and the distribution induced by the SGM, often demonstrating polynomial convergence rates with respect to the number of training samples. However, these approaches adopt a largely approximation theory viewpoint, which tends to be overly pessimistic and relatively coarse. In particular, they fail to fully explain the empirical success of SGMs or capture the role of the optimization algorithm used in practice to train the score network. To support this observation, we first present simple experiments illustrating the concrete impact of optimization hyperparameters on the generalization ability of the generated distribution. Then, this paper aims to bridge this theoretical gap by providing the first algorithmic- and data-dependent generalization analysis for SGMs. In particular, we establish bounds that explicitly account for the optimization dynamics of the learning algorithm, offering new insights into the generalization behavior of SGMs. Our theoretical findings are supported by empirical results on several datasets.

2505.13142 2026-02-10 cs.LG stat.ML

Parallel Layer Normalization for Universal Approximation

Yunhao Ni, Yuxin Guo, Yuhe Liu, Wenxin Sun, Jie Luo, Wenjun Wu, Lei Huang

Comments 45 pages

详情
英文摘要

This paper studies the approximation capabilities of neural networks that combine layer normalization (LN) with linear layers. We prove that networks consisting of two linear layers with parallel layer normalizations (PLNs) inserted between them (referred to as PLN-Nets) achieve universal approximation, whereas architectures that use only standard LN exhibit strictly limited expressive power.We further analyze approximation rates of shallow and deep PLN-Nets under the $L^\infty$ norm as well as in Sobolev norms. Our analysis extends beyond LN to RMSNorm, and from standard MLPs to position-wise feed-forward networks, the core building blocks used in RNNs and Transformers.Finally, we provide empirical experiments to explore other possible potentials of PLN-Nets.

2504.06477 2026-02-10 stat.ML cs.LG math.ST stat.TH

Sparsified-Learning for High-Dimensional Heavy-Tailed Locally Stationary Time Series, Concentration and Oracle Inequalities

Yingjie Wang, Mokhtar Z. Alaya, Salim Bouzebda, Xinsheng Liu

详情
英文摘要

Sparse learning is ubiquitous in many machine learning tasks. It aims to regularize the goodness-of-fit objective by adding a penalty term to encode structural constraints on the model parameters. In this paper, we develop a flexible sparse learning framework tailored to high-dimensional heavy-tailed locally stationary time series (LSTS). The data-generating mechanism incorporates a regression function that changes smoothly over time and is observed under noise belonging to the class of sub-Weibull and regularly varying distributions. We introduce a sparsity-inducing penalized estimation procedure that combines additive modeling with kernel smoothing and define an additive kernel-smoothing hypothesis class. In the presence of locally stationary dynamics, we assume exponentially decaying $β$-mixing coefficients to derive concentration inequalities for kernel-weighted sums of locally stationary processes with heavy-tailed noise. We further establish nonasymptotic prediction-error bounds, yielding both slow and fast convergence rates under different sparsity structures, including Lasso and total variation penalization with the least-squares loss. To support our theoretical results, we conduct numerical experiments on simulated LSTS with sub-Weibull and Pareto noise, highlighting how tail behavior affects prediction error across different covariate-dimensions as the sample size increases.

2502.03848 2026-02-10 math.ST stat.TH

Consistent model selection in a collection of stochastic block models

Lucie Arts

详情
英文摘要

We introduce the penalized Krichevsky-Trofimov (KT) estimator as a convergent method for estimating the number of nodes clusters when observing multiple networks within both multi-layer and dynamic Stochastic Block Models. We establish the consistency of the KT estimator, showing that it converges to the correct number of clusters in both types of models when the number of nodes in the networks increases. Our estimator does not require a known upper bound on this number to be consistent. Furthermore, we show that these consistency results hold in both dense and sparse regimes, making the penalized KT estimator robust across various network configurations. We illustrate its performance on synthetic datasets.

2409.05066 2026-02-10 math.ST stat.TH

Precise Asymptotics for Linear Mixed Models with Crossed Random Effects

Jiming Jiang, Matt P. Wand, Swarnadip Ghosh

详情
英文摘要

We obtain an asymptotic normality result that reveals the precise asymptotic behavior of the maximum likelihood estimators of parameters for a very general class of linear mixed models containing cross random effects. In achieving the result, we overcome theoretical difficulties that arise from random effects being crossed as opposed to the simpler nested random effects case. Our new theory is for a class of Gaussian response linear mixed models which includes crossed random slopes that partner arbitrary multivariate predictor effects and does not require the cell counts to be balanced. Statistical utilities include confidence interval construction, Wald hypothesis test and sample size calculations.

2408.16773 2026-02-10 cs.CY cs.LG stat.AP stat.ML

Advance Real-time Detection of Traffic Incidents in Highways using Vehicle Trajectory Data

Sudipta Roy, Samiul Hasan

Comments 19 Pages, 4 Tables, 10 Figures

详情
英文摘要

A significant number of traffic crashes are secondary crashes that occur because of an earlier incident on the road. Thus, early detection of traffic incidents is crucial for road users from safety perspectives with a potential to reduce the risk of secondary crashes. The wide availability of GPS devices now-a-days gives an opportunity of tracking and recording vehicle trajectories. The objective of this study is to use vehicle trajectory data for advance real-time detection of traffic incidents on highways using machine learning-based algorithms. The study uses three days of unevenly sequenced vehicle trajectory data and traffic incident data on I-10, one of the most crash-prone highways in Louisiana. Vehicle trajectories are converted to trajectories based on virtual detector locations to maintain spatial uniformity as well as to generate historical traffic data for machine learning algorithms. Trips matched with traffic incidents on the way are separated and along with other trips with similar spatial attributes are used to build a database for modeling. Multiple machine learning algorithms such as Logistic Regression, Random Forest, Extreme Gradient Boost, and Artificial Neural Network models are used to detect a trajectory that is likely to face an incident in the downstream road section. Results suggest that the Random Forest model achieves the best performance for predicting an incident with reasonable recall value and discrimination capability.

2407.11342 2026-02-10 stat.ME

GenTwoArmsTrialSize: An R Statistical Software Package to estimate Generalized Two Arms Randomized Clinical Trial Sample Size

Mohsen Soltanifar, Chel Hee Lee, Amin Shirazi, Martha Behnke, Ilfra Raymond-Loher, Getachew A. Dagne

Comments 33 pages, 2 figures, 2 tables

Journal ref Communications for Statistical Applications and Methods(CSAM), 2025

详情
英文摘要

The precise calculation of sample sizes is a crucial aspect in the design of clinical trials particularly for pharmaceutical statisticians. While various R statistical software packages have been developed by researchers to estimate required sample sizes under different assumptions, there has been a notable absence of a standalone R statistical software package that allows researchers to comprehensively estimate sample sizes under generalized scenarios. This paper introduces the R statistical software package "GenTwoArmsTrialSize" available on the Comprehensive R Archive Network (CRAN), designed for estimating the required sample size in two-arm clinical trials. The package incorporates four endpoint types, two trial treatment designs, four types of hypothesis tests, as well as considerations for noncompliance and loss of follow-up, providing researchers with the capability to estimate sample sizes across 24 scenarios. To facilitate understanding of the estimation process and illuminate the impact of noncompliance and loss of follow-up on the size and variability of estimations, the paper includes four hypothetical examples and one applied example. The discussion encompasses the package's limitations and outlines directions for future extensions and improvements.

2405.16958 2026-02-10 stat.ML cs.LG math.PR

Large Deviations of Gaussian Neural Networks with ReLU activation

Quirin Vogel

Comments typo corrected from a previous version

详情
英文摘要

We prove a large deviation principle for deep neural networks with Gaussian weights and at most linearly growing activation functions, such as ReLU. This generalises earlier work, in which bounded and continuous activation functions were considered. In practice, linearly growing activation functions such as ReLU are most commonly used. We furthermore simplify previous expressions for the rate function and provide a power-series expansions for the ReLU case.

2404.00735 2026-02-10 stat.ME stat.ML

Bias-Targeted Nonparametric Balancing for Stable Causal Mediation Analysis

Chang Liu, AmirEmad Ghassami

详情
英文摘要

Influence function (IF)-based estimators are widely used in mediation analysis due to their modeling flexibility, but standard implementations require direct estimation of the distribution functions of the mediator and treatment variables. Since these functions appear in the denominator of IF-based estimators, they can induce significant instability, particularly with continuous mediators. In this work, we propose an alternative implementation of IF-based estimators for both single- and multiple-mediator settings, based on reparametrizations of the likelihood. The key idea is to estimate the involved nuisance functions according to their role in the bias structure of the IF-based estimators. In our approach, key nuisance functions that are potential sources of instability are estimated using a novel nonparametric weighted balancing method-which can be viewed as a nonparametric extension of covariate balancing generalized to mediation analysis-fully stabilizing the estimators. We establish consistency and multiple robustness under suitable regularity conditions, and asymptotic normality. Simulation studies demonstrate substantial reductions in bias and variance relative to existing methods for continuous mediators. We further illustrate the approach using NHANES 2013-2014 data to estimate the effect of obesity on coronary heart disease mediated by Glycohemoglobin.

2403.09170 2026-02-10 math.ST cs.NA math.NA math.PR stat.ML stat.TH

Analysis of singular subspaces under random perturbations

Ke Wang

Comments Final version; accepted for publication in the Annals of Statistics. The Supplementary Material is appended to the end of the main document for convenience

详情
英文摘要

We present a comprehensive analysis of singular vector and singular subspace perturbations in the signal-plus-noise matrix model with random Gaussian noise. Assuming a low-rank signal matrix, we extend the Davis-Kahan-Wedin theorem in a fully generalized manner, applicable to any unitarily invariant matrix norm, building on previous results by O'Rourke, Vu, and the author. Our analysis provides fine-grained insights, including $\ell_\infty$ bounds for singular vectors, $\ell_{2, \infty}$ bounds for singular subspaces, and results for linear and bilinear functions of singular vectors. Additionally, we derive $\ell_{2,\infty}$ bounds on perturbed singular vectors, taking into account the weighting by their corresponding singular values. Finally, we explore practical implications of these results in the Gaussian mixture model and the submatrix localization problem.

2402.02128 2026-02-10 stat.ME

Adaptive Accelerated Failure Time modeling with a Semiparametric Skewed Error Distribution

Sangkon Oh, Hyunjae Lee, Sangwook Kang, Byungtae Seo

详情
英文摘要

The accelerated failure time (AFT) model is widely used to analyze relationships between variables in the presence of censored observations. However, this model relies on some assumptions such as the error distribution, which can lead to biased or inefficient estimates if these assumptions are violated. In order to overcome this challenge, we propose a novel approach that incorporates a semiparametric skew-normal scale mixture distribution for the error term in the AFT model. By allowing for more flexibility and robustness, this approach reduces the risk of misspecification and improves the accuracy of parameter estimation. We investigate the identifiability and consistency of the proposed model and develop a practical estimation algorithm. To evaluate the performance of our approach, we conduct extensive simulation studies and real data analyses. The results demonstrate the effectiveness of our method in providing robust and accurate estimates in various scenarios.

2310.09384 2026-02-10 stat.ME stat.AP

Modeling Missing at Random Neuropsychological Test Scores Using a Mixture of Binomial Product Experts

Daniel Suen, Yen-Chi Chen

Comments 74 pages, 6 figures, 10 tables

详情
英文摘要

Multivariate bounded discrete data arises in many fields. In the setting of dementia studies, such data is collected when individuals complete neuropsychological tests. We outline a modeling and inference procedure that can model the joint distribution conditional on baseline covariates, leveraging previous work on mixtures of experts and latent class models. Furthermore, we illustrate how the work can be extended when the outcome data is missing at random using a nested EM algorithm. The proposed model can incorporate covariate information and perform imputation and clustering. We apply our model on simulated data and an Alzheimer's disease data set.

2306.13681 2026-02-10 stat.ME cs.LG econ.EM stat.ML

Estimating the Value of Evidence-Based Decision Making

Alberto Abadie, Anish Agarwal, Guido Imbens, Siwei Jia, James McQueen, Serguei Stepaniants, Santiago Torres

详情
英文摘要

In an era of data abundance, statistical evidence is increasingly critical for business and policy decisions. Yet, organizations lack empirical tools to assess the value of evidence-based decision making (EBDM), optimize statistical precision, and balance the costs of evidence-gathering strategies against their benefits. To tackle these challenges, this article introduces an empirical framework to estimate the value of EBDM and evaluate the return on investment in statistical precision and project ideation. The framework leverages parametric and nonparametric empirical Bayes methods to account for parameter heterogeneity and measure how statistical precision changes the value of evidence. The value extracted from statistical evidence depends critically on how organizations translate evidence into policy decisions. Commonly used decision rules based on statistical significance can leave substantial value unrealized and, in some cases, generate negative expected value.

2209.14559 2026-02-10 stat.ME

MML Probabilistic Principal Component Analysis

Enes Makalic, Daniel F. Schmidt

详情
英文摘要

Principal component analysis (PCA) is perhaps the most widely used method for data dimensionality reduction. A key question in PCA is deciding how many factors to retain. This manuscript describes a new approach to automatically selecting the number of principal components based on the Bayesian minimum message length method of inductive inference. We derive a new estimate of the isotropic residual variance and demonstrate that it improves on the usual maximum likelihood approach. We also discuss extending this approach to finite mixture models of principal component analyzers.

2109.13648 2026-02-10 econ.EM stat.ME

Gaussian and Student's $t$ mixture vector autoregressive model with application to the effects of the Euro area monetary policy shock

Savi Virolainen

Comments arXiv admin note: text overlap with arXiv:2007.04713

详情
英文摘要

A new mixture vector autoregressive model based on Gaussian and Student's $t$ distributions is introduced. As its mixture components, our model incorporates conditionally homoskedastic linear Gaussian vector autoregressions and conditionally heteroskedastic linear Student's $t$ vector autoregressions. For a $p$th order model, the mixing weights depend on the full distribution of the preceding $p$ observations, which leads to attractive practical and theoretical properties such as ergodicity and full knowledge of the stationary distribution of $p+1$ consecutive observations. A structural version of the model with statistically identified shocks is also proposed. The empirical application studies the effects of the Euro area monetary policy shock. We fit a two-regime model to the data and find the effects, particularly on inflation, stronger in the regime that mainly prevails before the Financial crisis than in the regime that mainly dominates after it. The introduced methods are implemented in the accompanying R package gmvarkit.

2105.13440 2026-02-10 stat.ML cs.LG stat.CO

Non-negative matrix factorization algorithms generally improve topic model fits

Peter Carbonetto, Abhishek Sarkar, Zihao Wang, Matthew Stephens

详情
英文摘要

In an effort to develop topic modeling methods that can be quickly applied to large data sets, we revisit the problem of maximum-likelihood estimation in topic models. It is known, at least informally, that maximum-likelihood estimation in topic models is closely related to non-negative matrix factorization (NMF). Yet, to our knowledge, this relationship has not been exploited previously to fit topic models. We show that recent advances in NMF optimization methods can be leveraged to fit topic models very efficiently, often resulting in much better fits and in less time than existing algorithms for topic models. We also formally make the connection between the NMF optimization problem and maximum-likelihood estimation for the topic model, and using this result we show that the expectation maximization (EM) algorithm for the topic model is essentially the same as the classic multiplicative updates for NMF (the only difference being that the operations are performed in a different order). Our methods are implemented in the R package fastTopics.

2007.04713 2026-02-10 econ.EM stat.ME

Structural Gaussian mixture vector autoregressive model with application to the asymmetric effects of monetary policy shocks

Savi Virolainen

详情
英文摘要

A structural Gaussian mixture vector autoregressive model is introduced. The shocks are identified by combining simultaneous diagonalization of the reduced form error covariance matrices with constraints on the time-varying impact matrix. This leads to flexible identification conditions, and some of the constraints are also testable. The empirical application studies asymmetries in the effects of the U.S. monetary policy shock and finds strong asymmetries with respect to the sign and size of the shock and to the initial state of the economy. The accompanying CRAN distributed R package gmvarkit provides a comprehensive set of tools for numerical analysis.

2006.06618 2026-02-10 stat.ML cs.CR cs.DS cs.IT cs.LG math.IT math.ST stat.TH

CoinPress: Practical Private Mean and Covariance Estimation

Sourav Biswas, Yihe Dong, Gautam Kamath, Jonathan Ullman

Comments Code is available at https://github.com/twistedcubic/coin-press. Experimental results were inadvertently commented out of previous version

详情
英文摘要

We present simple differentially private estimators for the mean and covariance of multivariate sub-Gaussian data that are accurate at small sample sizes. We demonstrate the effectiveness of our algorithms both theoretically and empirically using synthetic and real-world datasets -- showing that their asymptotic error rates match the state-of-the-art theoretical bounds, and that they concretely outperform all previous methods. Specifically, previous estimators either have weak empirical accuracy at small sample sizes, perform poorly for multivariate data, or require the user to provide strong a priori estimates for the parameters.

2003.05221 2026-02-10 econ.EM math.ST stat.ME stat.TH

A mixture autoregressive model based on Gaussian and Student's $t$-distributions

Savi Virolainen

详情
英文摘要

We introduce a new mixture autoregressive model which combines Gaussian and Student's $t$ mixture components. The model has very attractive properties analogous to the Gaussian and Student's $t$ mixture autoregressive models, but it is more flexible as it enables to model series which consist of both conditionally homoscedastic Gaussian regimes and conditionally heteroscedastic Student's $t$ regimes. The usefulness of our model is demonstrated in an empirical application to the monthly U.S. interest rate spread between the 3-month Treasury bill rate and the effective federal funds rate.

1810.10987 2026-02-10 econ.EM stat.ML

Nuclear Norm Regularized Estimation of Panel Regression Models

Hyungsik Roger Moon, Martin Weidner

详情
英文摘要

In this paper we investigate panel regression models with interactive fixed effects. We propose two new estimation methods that are based on minimizing convex objective functions. The first method minimizes the sum of squared residuals with a nuclear (trace) norm regularization. The second method minimizes the nuclear norm of the residuals. We establish the consistency of the two resulting estimators. Those estimators have a very important computational advantage compared to the existing least squares (LS) estimator, in that they are defined as minimizers of a convex objective function. In addition, the nuclear norm penalization helps to resolve a potential identification problem for interactive fixed effect models, in particular when the regressors are low-rank and the number of the factors is unknown. We also show how to construct estimators that are asymptotically equivalent to the least squares (LS) estimator in Bai (2009) and Moon and Weidner (2017) by using our nuclear norm regularized or minimized estimators as initial values for a finite number of LS minimizing iteration steps. This iteration avoids any non-convex minimization, while the original LS estimation problem is generally non-convex, and can have multiple local minima.

2602.08108 2026-02-10 stat.ME

Goodness-of-Fit Tests for Censored and Truncated Data: Maximum Mean Discrepancy Over Regular Functionals

Juan Carlos Escanciano, Jacobo de Uña-Álvarez

详情
英文摘要

We develop a systematic, omnibus approach to goodness-of-fit testing for parametric distributional models when the variable of interest is only partially observed due to censoring and/or truncation. In many such designs, tests based on the nonparametric maximum likelihood estimator are hindered by nonexistence, computational instability, or convergence rates too slow to support reliable calibration under composite nulls. We avoid these difficulties by constructing a regular (pathwise differentiable) Neyman-orthogonal score process indexed by test functions, and aggregating it over a reproducing kernel Hilbert space ball. This yields a maximum-mean-discrepancy-type supremum statistic with a convenient quadratic-form representation. Critical values are obtained via a multiplier bootstrap that keeps nuisance estimates fixed. We establish asymptotic validity under the null and local alternatives and provide concrete constructions for left-truncated right-censored data, current status data, and random double truncation; in particular, to the best of our knowledge, we give the first omnibus goodness-of-fit test for a parametric family under random double truncation in the composite-hypothesis case. Simulations and an empirical illustration demonstrate size control and power in practically relevant incomplete-data designs.

2602.08105 2026-02-10 cs.LG physics.data-an stat.ML

Mutual information and task-relevant latent dimensionality

Paarth Gulati, Eslam Abdelaleem, Audrey Sederberg, Ilya Nemenman

详情
英文摘要

Estimating the dimensionality of the latent representation needed for prediction -- the task-relevant dimension -- is a difficult, largely unsolved problem with broad scientific applications. We cast it as an Information Bottleneck question: what embedding bottleneck dimension is sufficient to compress predictor and predicted views while preserving their mutual information (MI). This repurposes neural MI estimators for dimensionality estimation. We show that standard neural estimators with separable/bilinear critics systematically inflate the inferred dimension, and we address this by introducing a hybrid critic that retains an explicit dimensional bottleneck while allowing flexible nonlinear cross-view interactions, thereby preserving the latent geometry. We further propose a one-shot protocol that reads off the effective dimension from a single over-parameterized hybrid model, without sweeping over bottleneck sizes. We validate the approach on synthetic problems with known task-relevant dimension. We extend the approach to intrinsic dimensionality by constructing paired views of a single dataset, enabling comparison with classical geometric dimension estimators. In noisy regimes where those estimators degrade, our approach remains reliable. Finally, we demonstrate the utility of the method on multiple physics datasets.

2602.08096 2026-02-10 stat.ME cs.LG stat.ML

GAAVI: Global Asymptotic Anytime Valid Inference for the Conditional Mean Function

Brian M Cho, Raaz Dwivedi, Nathan Kallus

详情
英文摘要

Inference on the conditional mean function (CMF) is central to tasks from adaptive experimentation to optimal treatment assignment and algorithmic fairness auditing. In this work, we provide a novel asymptotic anytime-valid test for a CMF global null (e.g., that all conditional means are zero) and contrasts between CMFs, enabling experimenters to make high confidence decisions at any time during the experiment beyond a minimum sample size. We provide mild conditions under which our tests achieve (i) asymptotic type-I error guarantees, (i) power one, and, unlike past tests, (iii) optimal sample complexity relative to a Gaussian location testing. By inverting our tests, we show how to construct function-valued asymptotic confidence sequences for the CMF and contrasts thereof. Experiments on both synthetic and real-world data show our method is well-powered across various distributions while preserving the nominal error rate under continuous monitoring.

2602.08042 2026-02-10 stat.ML cs.LG

Graph-based Semi-Supervised Learning via Maximum Discrimination

Nadav Katz, Ariel Jaffe

详情
英文摘要

Semi-supervised learning (SSL) addresses the critical challenge of training accurate models when labeled data is scarce but unlabeled data is abundant. Graph-based SSL (GSSL) has emerged as a popular framework that captures data structure through graph representations. Classic graph SSL methods, such as Label Propagation and Label Spreading, aim to compute low-dimensional representations where points with the same labels are close in representation space. Although often effective, these methods can be suboptimal on data with complex label distributions. In our work, we develop AUC-spec, a graph approach that computes a low-dimensional representation that maximizes class separation. We compute this representation by optimizing the Area Under the ROC Curve (AUC) as estimated via the labeled points. We provide a detailed analysis of our approach under a product-of-manifold model, and show that the required number of labeled points for AUC-spec is polynomial in the model parameters. Empirically, we show that AUC-spec balances class separation with graph smoothness. It demonstrates competitive results on synthetic and real-world datasets while maintaining computational efficiency comparable to the field's classic and state-of-the-art methods.

2602.08003 2026-02-10 cs.LG cs.AI cs.DC cs.IT math.IT stat.ML

Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection

Yigit Turkmen, Baturalp Buyukates, Melih Bastopcu

详情
英文摘要

Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.

2602.07997 2026-02-10 stat.ML cs.LG math.ST stat.CO stat.ME stat.TH

Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

TrungKhang Tran, TrungTin Nguyen, Md Abul Bashar, Nhat Ho, Richi Nayak, Christopher Drovandi

Comments TrungKhang Tran and TrungTin Nguyen are co-first authors

详情
英文摘要

Mixture-of-Experts (MoE) architectures combine specialized predictors through a learned gate and are effective across regression and classification, but for classification with softmax multinomial-logistic gating, rigorous guarantees for stable maximum-likelihood training and principled model selection remain limited. We address both issues in the full-data (batch) regime. First, we derive a batch minorization-maximization (MM) algorithm for softmax-gated multinomial-logistic MoE using an explicit quadratic minorizer, yielding coordinate-wise closed-form updates that guarantee monotone ascent of the objective and global convergence to a stationary point (in the standard MM sense), avoiding approximate M-steps common in EM-type implementations. Second, we prove finite-sample rates for conditional density estimation and parameter recovery, and we adapt dendrograms of mixing measures to the classification setting to obtain a sweep-free selector of the number of experts that achieves near-parametric optimal rates after merging redundant fitted atoms. Experiments on biological protein--protein interaction prediction validate the full pipeline, delivering improved accuracy and better-calibrated probabilities than strong statistical and machine-learning baselines.

2602.07992 2026-02-10 cs.LG stat.ML

When Is Compositional Reasoning Learnable from Verifiable Rewards?

Daniel Barzilai, Yotam Wolf, Ronen Basri

详情
英文摘要

The emergence of compositional reasoning in large language models through reinforcement learning with verifiable rewards (RLVR) has been a key driver of recent empirical successes. Despite this progress, it remains unclear which compositional problems are learnable in this setting using outcome-level feedback alone. In this work, we theoretically study the learnability of compositional problems in autoregressive models under RLVR training. We identify a quantity that we call the task-advantage ratio, a joint property of the compositional problem and the base model, that characterizes which tasks and compositions are learnable from outcome-level feedback. On the positive side, using this characterization, we show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR. We also analyze how such an advantage naturally arises in different problems. On the negative side, when the structural advantage is not present, RLVR may converge to suboptimal compositions. We prove that, in some cases, the quality of the base model determines if such an advantage exists and whether RLVR will converge to a suboptimal solution. We hope our analysis can provide a principled theoretical understanding of when and why RLVR succeeds and when it does not.

2602.07944 2026-02-10 math.ST math.PR stat.TH

Geometric ergodicity of Gibbs samplers for linear latent models with GIG variance mixtures

Elsiddig Awadelkarim, David Bolin, Xiaotian Jin, Alexandre B. Simas, Jonas Wallin

详情
英文摘要

We study geometric ergodicity of the Gibbs sampler for linear latent non-Gaussian models (LLnGMs), a class of hierarchical models in which conditional Gaussian structure is preserved through generalized inverse Gaussian (GIG) variance-mixture augmentation. Two complementary routes to geometric ergodicity are developed for the marginal chain on the mixing variables. First, we show that the associated Markov operator is trace-class, and hence admits a spectral gap, over a large portion of the GIG parameter space. Second, for the remaining boundary and heavy-tail regimes, we establish geometric ergodicity via drift and minorization, subject to an explicit null-smallness condition that quantifies how the drift interacts with the null space of the observation operator. Together, these results cover the full GIG parameter space, including the normal-inverse Gaussian, generalized asymmetric Laplace, and Student-$t$ special cases. The geometric ergodicity of this chain underpins the consistency of Gibbs-based stochastic-gradient estimators for maximum likelihood estimation, and we provide conditions that make the required integrability checks transparent. Numerical experiments illustrate the theoretical findings, contrasting mixing efficiency across parameter regimes and probing the role of the null-smallness constant.

2602.07935 2026-02-10 stat.AP

Analysis of Repairable Systems Availability with Lindley Failure and Repair Behavior

Afshin Yaghoubi

详情
英文摘要

Maintainability analysis is a cornerstone of reliability engineering. While the Markov approach is the classical analytical foundation, its reliance on the exponential distribution for failure and repair times is a major and often unrealistic limitation. This paper directly overcomes this critical constraint by investigating and modeling system maintainability using the more flexible and versatile Lindley distribution, which is represented via phase-type distributions. We first present a comprehensive maintainability analysis of a single-component system, deriving precise closed-form expressions for its time-dependent and steady-state availability, as well as the mean time to repair. The core methodology is then systematically generalized to analyze common series and parallel system configurations with n independent and identically distributed components. A dedicated numerical study compares the system performance under the Lindley and exponential distributions, conclusively demonstrating the significant and practical impact of non-exponential repair times on key reliability metrics. Our work provides a versatile and more widely applicable analytical framework for accurate maintainability assessment that successfully relaxes the restrictive exponential assumption, thereby offering greater realism in reliability modeling.

2602.07918 2026-02-10 cs.CR cs.LG stat.ME

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, Tomas Pfister

详情
英文摘要

AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.

2602.07911 2026-02-10 stat.AP

Adaptive Test Procedure for High Dimensional Regression Coefficient

Ping Zhao, Fengyi Song, Huifang Ma

详情
英文摘要

We develop a unified $L$-statistic testing framework for high-dimensional regression coefficients that adapts to unknown sparsity. The proposed statistics rank coordinate-wise evidence measures and aggregate the top $k$ signals, bridging classical max-type and sum-type tests. We establish joint weak convergence of the extreme-value component and standardized $L$-statistics under mild conditions, yielding an asymptotic independence that justifies combining multiple $k$'s. An adaptive omnibus test is constructed via a Cauchy combination over a dyadic grid of $k$, and a wild bootstrap calibration is provided with theoretical guarantees. Simulations demonstrate accurate size and strong power across sparse and dense alternatives, including non-Gaussian designs.

2602.07856 2026-02-10 math.NA cs.NA math.PR math.ST stat.TH

Inhomogeneous Priors for Bayesian Inverse Problems

Babak Maboudi Afkham, Tomas Soto, Mirza Karamehmedovic, Lassi Roininen

详情
英文摘要

Many inverse problems arising in engineering and applied sciences involve unknown quantities with pronounced spatial inhomogeneity, such as localized defects or spatially varying material properties, making reliable uncertainty quantification particularly challenging. While Bayesian inverse problem methodologies provide a principled framework for assessing reconstruction reliability, commonly used Gaussian priors, such as Whittle-Matern models, impose globally homogeneous assumptions that limit their ability to capture such structure in large-scale settings. We introduce a new class of inhomogeneous priors defined via convolution with white noise, yielding nonstationary Whittle-Matern-type random fields with a rigorous mathematical construction. These priors fit naturally within existing Bayesian well-posedness theory and enable efficient sampling by reducing prior realizations to the solution of a pseudo-differential equation, for which we develop numerical schemes with quantified approximation error. Numerical experiments in one-dimensional denoising and two-dimensional limited-angle X-ray tomography demonstrate significant improvements in reconstruction quality and uncertainty quantification, particularly in data-limited scenarios.

2602.07785 2026-02-10 stat.AP stat.ME

Digital exclusion among middle-aged and older adults in China: age-period-cohort evidence from three national surveys, 2011-2022

Yufei Zhang, Zhihao Ma

Comments Also available as a preprint on OSF Preprints:https://osf.io/hpv68_v1

详情
英文摘要

Amid China's ageing and digital shift, digital exclusion among older adults poses an urgent challenge. To unpack this phenomenon, this study disentangles age, period, and cohort effects on digital exclusion among middle-aged and older Chinese adults. Using three nationally representative surveys (CHARLS 2011-2020, CFPS 2010-2022, and CGSS 2010-2021), we fitted hierarchical age-period-cohort (HAPC) models weighted by cross-sectional survey weights and stabilized inverse probability weights for item response. We further assessed heterogeneity by urban-rural residence, region, multimorbidity, and cognitive risk, and evaluated robustness with APC bounding analyses. Across datasets, digital exclusion increased with age and displayed mild non-linearity, with a small midlife easing followed by a sharper rise at older ages. Period effects declined over the 2010s and early 2020s, although the pace of improvement differed across survey windows. Cohort deviations were present but less consistent than age and period patterns, with an additional excess risk concentrated among cohorts born in the 1950s. Rural and western residents, as well as adults with multimorbidity or cognitive risk, remained consistently more excluded. Over the study period, the urban-rural divide showed no evidence of narrowing, whereas the cognitive-risk gap widened. These findings highlight digital inclusion as a vital pathway for older adults to remain integral participants in an evolving digital society.

2602.07767 2026-02-10 stat.ML cs.LG stat.ME

BFTS: Thompson Sampling with Bayesian Additive Regression Trees

Ruizhe Deng, Bibhas Chakraborty, Ran Chen, Yan Shuo Tan

详情
英文摘要

Contextual bandits are a core technology for personalized mobile health interventions, where decision-making requires adapting to complex, non-linear user behaviors. While Thompson Sampling (TS) is a preferred strategy for these problems, its performance hinges on the quality of the underlying reward model. Standard linear models suffer from high bias, while neural network approaches are often brittle and difficult to tune in online settings. Conversely, tree ensembles dominate tabular data prediction but typically rely on heuristic uncertainty quantification, lacking a principled probabilistic basis for TS. We propose Bayesian Forest Thompson Sampling (BFTS), the first contextual bandit algorithm to integrate Bayesian Additive Regression Trees (BART), a fully probabilistic sum-of-trees model, directly into the exploration loop. We prove that BFTS is theoretically sound, deriving an information-theoretic Bayesian regret bound of $\tilde{O}(\sqrt{T})$. As a complementary result, we establish frequentist minimax optimality for a "feel-good" variant, confirming the structural suitability of BART priors for non-parametric bandits. Empirically, BFTS achieves state-of-the-art regret on tabular benchmarks with near-nominal uncertainty calibration. Furthermore, in an offline policy evaluation on the Drink Less micro-randomized trial, BFTS improves engagement rates by over 30% compared to the deployed policy, demonstrating its practical effectiveness for behavioral interventions.

2602.07740 2026-02-10 stat.ME

Hyperbolic statistical inference for Treatment Effects with Circular biomarker of astigmatism

Buddhananda Banerjee, Surojit Biswas, Daitari Prusty

详情
英文摘要

Circular biomarkers arise naturally in many biomedical applications, particularly in ophthalmology, where angular measurements such as astigmatism are routinely recorded. Similar directional variables also occur in the study of human body rotations, including movements of the hand, waist, neck, and lower limbs. Motivated by a clinical dataset comprising angular measurements of astigmatism induced by two cataract surgery procedures, we propose a novel two-sample testing framework for circular data grounded in hyperbolic geometry. Assuming von Mises distributions with either common or group-specific concentration parameters, we embed the corresponding parameter spaces into the Poincaré disk, an open unit disk endowed with the Poincaré metric.Under this construction, each von Mises distribution is mapped uniquely to a point in the Poincaré disk, yielding a continuous geometric representation that preserves the intrinsic structure of the parameter space. This embedding enables direct comparison of group distributions via hyperbolic distances, leading to natural and interpretable test statistics. We develop permutation-based tests for the common concentration case and bootstrap-based procedures for unequal concentrations. Extensive simulation studies demonstrate stable empirical size, strong consistency, and superior asymptotic power compared with existing competing methods. The proposed methodology is illustrated through a detailed analysis of the cataract surgery dataset, including a clinically informed restructuring of the original observations. The results highlight the practical advantages of incorporating hyperbolic geometry into the analysis of circular biomedical data and underscore the potential of geometry-aware inference for directional biomarkers.

2602.07710 2026-02-10 stat.ML cs.LG

On Generation in Metric Spaces

Jiaxun Li, Vinod Raman, Ambuj Tewari

详情
英文摘要

We study generation in separable metric instance spaces. We extend the language generation framework from Kleinberg and Mullainathan [2024] beyond countable domains by defining novelty through metric separation and allowing asymmetric novelty parameters for the adversary and the generator. We introduce the $(\varepsilon,\varepsilon')$-closure dimension, a scale-sensitive analogue of closure dimension, which yields characterizations of uniform and non-uniform generatability and a sufficient condition for generation in the limit. Along the way, we identify a sharp geometric contrast. Namely, in doubling spaces, including all finite-dimensional normed spaces, generatability is stable across novelty scales and invariant under equivalent metrics. In general metric spaces, however, generatability can be highly scale-sensitive and metric-dependent; even in the natural infinite-dimensional Hilbert space $\ell^2$, all notions of generation may fail abruptly as the novelty parameters vary.

2602.07669 2026-02-10 math.ST stat.TH

The statistical threshold for planted matchings and spanning trees

Louigi Addario-Berry, Omer Angel, Gábor Lugosi, Miklós Z. Rácz, Tselil Schramm

详情
英文摘要

In this paper, we study the problem of detecting the presence of a planted perfect matching or spanning tree in an Erdős--Rényi random graph. More precisely, we study the hypothesis testing problem where the statistician observes a graph on $n$ vertices. Under the null hypothesis, the graph is a realization of an Erdős--Rényi random graph $G(n,q)$, while under the alternative hypothesis, the graph is the union of an Erdős--Rényi random graph and a random perfect matching (or random spanning tree). In order to avoid trivial detection by counting edges, we adjust the alternative hypothesis so that the expected number of edges under both distributions coincides. We prove that in both problems, when $q\gg n^{-1/2}$, no test can perform better than random guessing, while for $q\ll n^{-1/2}$, there exist computationally efficient tests that guess correctly with high probability.

2602.07665 2026-02-10 math.ST stat.TH

The Fisher score on the closed simplex

Giovanni Pistone, Fabio Rapallo, Eva Riccomagno

Comments Submitted

详情
英文摘要

We extend classical analytic tools for finite-state statistical models to allow zero probabilities. Using methods from algebraic statistics and information geometry, we develop a framework in which a smooth statistical model could hit the boundary of the simplex, for example, in contingency tables with non-structural zeros. The central object of our approach is the vector bundle whose fibres are the $p$-contrasts associated to each probability distribution $p$. In this framework, Fisher score and other key statistical concepts, such as entropy for one-dimensional statistical models, admit an algebraic representation also on the boundary of the simplex.

2602.07613 2026-02-10 stat.ME stat.CO stat.ML

Fast Rerandomization for Balancing Covariates in Randomized Experiments: A Metropolis-Hastings Framework

Jiuyao Lu, Tianruo Zhang, Ke Zhu

详情
英文摘要

Balancing covariates is critical for credible and efficient randomized experiments. Rerandomization addresses this by repeatedly generating treatment assignments until covariate balance meets a prespecified threshold. By shrinking this threshold, it can achieve arbitrarily strong balance, with established results guaranteeing optimal estimation and valid inference in both finite-sample and asymptotic settings across diverse complex experimental settings. Despite its rigorous theoretical foundations, practical use is limited by the extreme inefficiency of rejection sampling, which becomes prohibitively slow under small thresholds and often forces practitioners to adopt suboptimal settings, leading to degraded performance. Existing work focusing on acceleration typically fail to maintain the uniformity over the acceptable assignment space, thus losing the theoretical grounds of classical rerandomization. Building upon a Metropolis-Hastings framework, we address this challenge by introducing an additional sampling-importance resampling step, which restores uniformity and preserves statistical guarantees. Our proposed algorithm, PSRSRR, achieves speedups ranging from 10 to 10,000 times while maintaining exact and asymptotic validity, as demonstrated by simulations and two real-data applications.

2602.07593 2026-02-10 cs.LG cs.GT stat.ML

Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

Polina Gordienko, Christoph Jansen, Julian Rodemann, Georg Schollmeyer

详情
英文摘要

Modern benchmarks such as HELM MMLU account for multiple metrics like accuracy, robustness and efficiency. When trying to turn these metrics into a single ranking, natural aggregation procedures can become incoherent or unstable to changes in the model set. We formalize this aggregation as a social choice problem where each metric induces a preference ranking over models on each dataset, and a benchmark operator aggregates these votes across metrics. While prior work has focused on Arrow's impossibility result, we argue that the impossibility often originates from pathological examples and identify sufficient conditions under which these disappear, and meaningful multi-criteria benchmarking becomes possible. In particular, we deal with three restrictions on the combinations of rankings and prove that on single-peaked, group-separable and distance-restricted preferences, the benchmark operator allows for the construction of well-behaved rankings of the involved models. Empirically, we investigate several modern benchmark suites like HELM MMLU and verify which structural conditions are fulfilled on which benchmark problems.

2602.07562 2026-02-10 cs.LG cs.AI stat.ML

Gaussian Match-and-Copy: A Minimalist Benchmark for Studying Transformer Induction

Antoine Gonon, Alexandre Cordonnier, Nicolas Boumal

详情
英文摘要

Match-and-copy is a core retrieval primitive used at inference time by large language models to retrieve a matching token from the context then copy its successor. Yet, understanding how this behavior emerges on natural data is challenging because retrieval and memorization are entangled. To disentangle the two, we introduce Gaussian Match-and-Copy (GMC), a minimalist benchmark that isolates long-range retrieval through pure second-order correlation signals. Numerical investigations show that this task retains key qualitative aspects of how Transformers develop match-and-copy circuits in practice, and separates architectures by their retrieval capabilities. We also analyze the optimization dynamics in a simplified attention setting. Although many solutions are a priori possible under a regression objective, including ones that do not implement retrieval, we identify an implicit-bias regime in which gradient descent drives the parameters to diverge while their direction aligns with the max-margin separator, yielding hard match selection. We prove this max-margin alignment for GD trajectories that reach vanishing empirical loss under explicit technical conditions.

2602.07482 2026-02-10 stat.ME

Event-driven type design for clinical trials with recurrent events

Jingwen Zhang, Satoshi Hattori

详情
英文摘要

It is a common practice in randomized clinical trials with the standard survival outcome to follow patients until a prespecified number of events have been observed, a type of trial known as the event-driven trial. The event-driven design ensures that the target power for a specified type 1 error rate is achieved to detect the target hazard ratio, regardless of the specification of other quantities. To understand the treatment effect for chronic conditions, the analysis of recurrent events has gained popularity in randomized controlled trials, particularly large-scale confirmatory trials. In the absence of within-subject correlation among multiple events, a similar event-driven design can be employed for recurrent event outcomes. On the other hand, in the presence of the within-subject correlation, one needs to model the correlation among recurrent events in evaluating power and setting the sample size. However, information useful in modeling the within-subject correlation is limited at the design stage. Failing to consider the correlation properly may lead to underpowered studies. We propose an event-driven type design for recurrent event outcomes. Our method ensures the target power for the target treatment effect, regardless of the specification of other quantities, by monitoring the robust variance under the marginal rates/means model in a blinded manner. We investigate the operating characteristics of the proposed monitoring procedure in simulation studies. The results of simulation studies showed that the proposed blinded monitoring procedure controlled the power well so that the test possessed the target power and did not lead to serious inflation of the type 1 error rate. Furthermore, we illustrate the proposed method using a real clinical trial dataset.

2602.07477 2026-02-10 stat.ME cs.LG stat.AP stat.CO stat.ML

Statistical inference after variable selection in Cox models: A simulation study

Lena Schemet, Sarah Friedrich-Welz

详情
英文摘要

Choosing relevant predictors is central to the analysis of biomedical time-to-event data. Classical frequentist inference, however, presumes that the set of covariates is fixed in advance and does not account for data-driven variable selection. As a consequence, naive post-selection inference may be biased and misleading. In right-censored survival settings, these issues may be further exacerbated by the additional uncertainty induced by censoring. We investigate several inference procedures applied after variable selection for the coefficients of the Lasso and its extension, the adaptive Lasso, in the context of the Cox model. The methods considered include sample splitting, exact post-selection inference, and the debiased Lasso. Their performance is examined in a neutral simulation study reflecting realistic covariate structures and censoring rates commonly encountered in biomedical applications. To complement the simulation results, we illustrate the practical behavior of these procedures in an applied example using a publicly available survival dataset.

2602.07472 2026-02-10 cs.LG math.OC math.PR stat.ML

Bandit Allocational Instability

Yilun Chen, Jiaqi Lu

详情
英文摘要

When multi-armed bandit (MAB) algorithms allocate pulls among competing arms, the resulting allocation can exhibit huge variation. This is particularly harmful in modern applications such as learning-enhanced platform operations and post-bandit statistical inference. Thus motivated, we introduce a new performance metric of MAB algorithms termed allocation variability, which is the largest (over arms) standard deviation of an arm's number of pulls. We establish a fundamental trade-off between allocation variability and regret, the canonical performance metric of reward maximization. In particular, for any algorithm, the worst-case regret $R_T$ and worst-case allocation variability $S_T$ must satisfy $R_T \cdot S_T=Ω(T^{\frac{3}{2}})$ as $T\rightarrow\infty$, as long as $R_T=o(T)$. This indicates that any minimax regret-optimal algorithm must incur worst-case allocation variability $Θ(T)$, the largest possible scale; while any algorithm with sublinear worst-case regret must necessarily incur ${S}_T= ω(\sqrt{T})$. We further show that this lower bound is essentially tight, and that any point on the Pareto frontier $R_T \cdot S_T=\tildeΘ(T^{3/2})$ can be achieved by a simple tunable algorithm UCB-f, a generalization of the classic UCB1. Finally, we discuss implications for platform operations and for statistical inference, when bandit algorithms are used. As a byproduct of our result, we resolve an open question of Praharaj and Khamaru (2025).

2602.07468 2026-02-10 stat.AP

Consistency Assessment of Regional Treatment Effect for Multi-Regional Clinical Trials in the Presence of Covariate Shift

Kunhai Qing, Xinru Ren, Jin Xu, Menggang Yu

详情
英文摘要

Multi-Regional Clinical Trials (MRCTs) play a central role in the development of new therapies by enabling the simultaneous evaluation of drug efficacy and safety across diverse global populations. Assessing the consistency of treatment effects across regions is a fundamental aspect of MRCTs. Existing methods typically focus on region-specific marginal treatment effects. However, when treatment effect heterogeneity arises due to effect-modifying baseline covariates, distributional differences in these covariates can lead to erroneous conclusions. In this paper, we explicitly account for this phenomenon in the consistency assessment by considering the conditional average treatment effect. We propose a two-step assessment strategy that complements existing methods and mitigates the impact of treatment effect heterogeneity. Results from numerical studies demonstrate the effectiveness of the proposed approach.

2602.07454 2026-02-10 stat.ME stat.CO

Estimation of log-Gaussian gamma processes with iterated posterior linearization and Hamiltonian Monte Carlo

Teemu Härkönen, Simo Särkkä

详情
英文摘要

Stochastic processes are a flexible and widely used family of models for statistical modeling. While stochastic processes offer attractive properties such as inclusion of uncertainty properties, their inference is typically intractable, with the notable exception of Gaussian processes. Inference of models with non-Gaussian errors typically involves estimation of a high-dimensional latent variable. We propose two methods that use iterated posterior linearization followed by Hamiltonian Monte Carlo to sample the posterior distributions of such latent models with a particular focus on log-Gaussian gamma processes. The proposed methods are validated with two synthetic datasets generated from the log-Gaussian gamma process and a multiscale biocomposite stiffness model. In addition, we apply the methodology to an experimental Raman spectrum of argentopyrite.

2602.07453 2026-02-10 cs.LG stat.ML

Data-Aware and Scalable Sensitivity Analysis for Decision Tree Ensembles

Namrita Varshney, Ashutosh Gupta, Arhaan Ahmad, Tanay V. Tayal, S. Akshay

详情
英文摘要

Decision tree ensembles are widely used in critical domains, making robustness and sensitivity analysis essential to their trustworthiness. We study the feature sensitivity problem, which asks whether an ensemble is sensitive to a specified subset of features -- such as protected attributes -- whose manipulation can alter model predictions. Existing approaches often yield examples of sensitivity that lie far from the training distribution, limiting their interpretability and practical value. We propose a data-aware sensitivity framework that constrains the sensitive examples to remain close to the dataset, thereby producing realistic and interpretable evidence of model weaknesses. To this end, we develop novel techniques for data-aware search using a combination of mixed-integer linear programming (MILP) and satisfiability modulo theories (SMT) encodings. Our contributions are fourfold. First, we strengthen the NP-hardness result for sensitivity verification, showing it holds even for trees of depth 1. Second, we develop MILP-optimizations that significantly speed up sensitivity verification for single ensembles and for the first time can also handle multiclass tree ensembles. Third, we introduce a data-aware framework generating realistic examples close to the training distribution. Finally, we conduct an extensive experimental evaluation on large tree ensembles, demonstrating scalability to ensembles with up to 800 trees of depth 8, achieving substantial improvements over the state of the art. This framework provides a practical foundation for analyzing the reliability and fairness of tree-based models in high-stakes applications.

2602.07404 2026-02-10 stat.ME math.ST stat.TH

Adaptive Experimental Design Using Shrinkage Estimators

Evan T. R. Rosenman, Kristen B. Hunter

详情
英文摘要

In the setting of multi-armed trials, adaptive designs are a popular way to increase estimation efficiency, identify optimal treatments, or maximize rewards to individuals. Recent work has considered the case of estimating the effects of K active treatments, relative to a control arm, in a sequential trial. Several papers have proposed sequential versions of the classical Neyman allocation scheme to assign treatments as individuals arrive, typically with the goal of using Horvitz-Thompson-style estimators to obtain causal estimates at the end of the trial. However, this approach may be inefficient in that it fails to borrow information across the treatment arms. In this paper, we consider adaptivity when the final causal estimation is obtained using a Stein-like shrinkage estimator for heteroscedastic data. Such an estimator shares information across treatment effect estimates, providing provable reductions in expected squared error loss relative to estimating each causal effect in isolation. Moreover, we show that the expected loss of the shrinkage estimator takes the form of a Gaussian quadratic form, allowing it to be computed efficiently using numerical integration. This result paves the way for sequential adaptivity, allowing treatments to be assigned to minimize the shrinker loss. Through simulations, we demonstrate that this approach can yield meaningful reductions in estimation error. We also characterize how our adaptive algorithm assigns treatments differently than would a sequential Neyman allocation.

2602.07378 2026-02-10 cs.LG physics.data-an stat.ML

Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent

Shota Imai, Sota Nishiyama, Masaaki Imaizumi

Comments 40 pages

详情
英文摘要

The dynamics of gradient-based training in neural networks often exhibit nontrivial structures; hence, understanding them remains a central challenge in theoretical machine learning. In particular, a concept of feature unlearning, in which a neural network progressively loses previously learned features over long training, has gained attention. In this study, we consider the infinite-width limit of a two-layer neural network updated with a large-batch stochastic gradient, then derive differential equations with different time scales, revealing the mechanism and conditions for feature unlearning to occur. Specifically, we utilize the fast-slow dynamics: while an alignment of first-layer weights develops rapidly, the second-layer weights develop slowly. The direction of a flow on a critical manifold, determined by the slow dynamics, decides whether feature unlearning occurs. We give numerical validation of the result, and derive theoretical grounding and scaling laws of the feature unlearning. Our results yield the following insights: (i) the strength of the primary nonlinear term in data induces the feature unlearning, and (ii) an initial scale of the second-layer weights mitigates the feature unlearning. Technically, our analysis utilizes Tensor Programs and the singular perturbation theory.

2602.07370 2026-02-10 cs.LG cs.CR stat.ML

Privately Learning Decision Lists and a Differentially Private Winnow

Mark Bun, William Fang

Comments 27 pages, The 37th International Conference on Algorithmic Learning Theory

详情
英文摘要

We give new differentially private algorithms for the classic problems of learning decision lists and large-margin halfspaces in the PAC and online models. In the PAC model, we give a computationally efficient algorithm for learning decision lists with minimal sample overhead over the best non-private algorithms. In the online model, we give a private analog of the influential Winnow algorithm for learning halfspaces with mistake bound polylogarithmic in the dimension and inverse polynomial in the margin. As an application, we describe how to privately learn decision lists in the online model, qualitatively matching state-of-the art non-private guarantees.

2602.07258 2026-02-10 cs.LG stat.ME

Robust Ultra-High-Dimensional Variable Selection With Correlated Structure Using Group Testing

Wanru Guo, Juan Xie, Binbin Wang, Weicong Chen, Xiaoyi Lu, Vipin Chaudhary, Curtis Tatsuoka

Comments 57 Pages, 5 Figures, 4 Tables

详情
英文摘要

Background: High-dimensional genomic data exhibit strong group correlation structures that challenge conventional feature selection methods, which often assume feature independence or rely on pre-defined pathways and are sensitive to outliers and model misspecification. Methods: We propose the Dorfman screening framework, a multi-stage procedure that forms data-driven variable groups via hierarchical clustering, performs group and within-group hypothesis testing, and refines selection using elastic net or adaptive elastic net. Robust variants incorporate OGK-based covariance estimation, rank-based correlation, and Huber-weighted regression to handle contaminated and non-normal data. Results: In simulations, Dorfman-Sparse-Adaptive-EN performed best under normal conditions, while Robust-OGK-Dorfman-Adaptive-EN showed clear advantages under data contamination, outperforming classical Dorfman and competing methods. Applied to NSCLC gene expression data for trametinib response, robust Dorfman methods achieved the lowest prediction errors and enriched recovery of clinically relevant genes. Conclusions: The Dorfman framework provides an efficient and robust approach to genomic feature selection. Robust-OGK-Dorfman-Adaptive-EN offers strong performance under both ideal and contaminated conditions and scales to ultra-high-dimensional settings, making it well suited for modern genomic biomarker discovery.

2602.07228 2026-02-10 stat.ME

Modelling heavy tail data with bayesian nonparametric mixtures

Luis E. Nieto-Barajas

详情
英文摘要

In the study of heavy tail data, several models have been introduced. If the interest is in the tail of the distribution, block maxima or excess over thresholds are the typical approaches, wasting relevant information in the bulk of the data. To avoid this, two building block mixture models for the body (below the threshold) and the tail (above the threshold) are proposed. In this paper, we exploit the richness of nonparametric mixture models to model heavy tail data. We specifically consider mixtures of shifted gamma-gamma distributions with four parameters and a normalised stable processes as a mixing distribution. One of these parameters is associated with the tail. By studying the posterior distribution of the tail parameter, we are able to estimate the proportion of the data that supports a heavy tail component. We develop an efficient MCMC method with adapting Metropolis-Hastings steps to obtain posterior inference and illustrate with simulated and real datasets.

2602.01931 2026-02-10 stat.AP

Bootstrap-based estimation and inference for measurement precision under ISO 5725

Jun-ichi Takeshita, Kazuhiro Morita, Tomomichi Suzuki

详情
英文摘要

The ISO 5725 series frames interlaboratory precision through repeatability, between-laboratory, and reproducibility variances, yet practical guidance on deploying bootstrap methods within this one-way random-effects setting remains limited. We study resampling strategies tailored to ISO 5725 data and extend a bias-correction idea to obtain simple adjusted point estimators and confidence intervals for the variance components. Using extensive simulations that mirror realistic study sizes and variance ratios, we evaluate accuracy, stability, and coverage, and we contrast the resampling-based procedures with ANOVA-based estimators and common approximate intervals. The results yield a clear division of labor: adjusted within-laboratory resampling provides accurate and stable point estimation in small-to-moderate designs, whereas a two-stage strategy-resampling laboratories and then resampling within each-paired with bias-corrected and accelerated intervals offers the most reliable (near-nominal or conservative) confidence intervals. Performance degrades under extreme designs, such as very small samples or dominant between-laboratory variation, clarifying when additional caution is warranted. A case study from an ISO 5725-4 dataset illustrates how the recommended procedures behave in practice and how they compare with ANOVA and approximate methods. We conclude with concrete guidance for implementing resampling-based precision analysis in interlaboratory studies: use adjusted within-laboratory resampling for point estimation, and adopt the two-stage strategy with bias-corrected and accelerated intervals for interval estimation.

2602.01777 2026-02-10 cs.LG cs.AI math.ST stat.ML stat.TH

Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions

M. Arashi, M. Amintoosi

详情
英文摘要

Stochastic gradient methods are central to large-scale learning, but they treat mini-batch gradients as unbiased estimators, which classical decision theory shows are inadmissible in high dimensions. We formulate gradient computation as a high-dimensional estimation problem and introduce a framework based on Stein-rule shrinkage. We construct a gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging statistics from adaptive optimizers. Under a Gaussian noise model, we show our estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal. We incorporate this into the Adam optimizer, yielding SR-Adam, a practical algorithm with negligible computational cost. Empirical evaluations on CIFAR10 and CIFAR100 across multiple levels of input noise show consistent improvements over Adam in the large-batch regime. Ablation studies indicate that gains arise primarily from selectively applying shrinkage to high-dimensional convolutional layers, while indiscriminate shrinkage across all parameters degrades performance. These results illustrate that classical shrinkage principles provide a principled approach to improving stochastic gradient estimation in deep learning.

2602.00999 2026-02-10 math.ST math.SP stat.TH

Asymptotic expansions for spectral convergence of compact self-adjoint operators on general spectral subsets, with application to kernel Gram matrices

Eunseong Bae, Wolfgang Polonik

详情
英文摘要

We study the spectral convergence of compact, self-adjoint operators on a separable Hilbert space under operator norm perturbations, and derive asymptotic expansions for their eigenvalues and eigenprojections. Our analysis focuses on eigenvalues indexed by a general subset, with minimal restrictions on their selection. The usefulness of the provided expansions is illustrated by an application to kernel Gram matrices, deriving concentration inequalities as well as weak convergence results, which, in contrast to existing literature, are primarily relying on assumptions on the kernel that are easy to check.

2601.07061 2026-02-10 stat.ML cs.LG

Local EGOP for Continuous Index Learning

Alex Kokot, Anand Hemmady, Vydhourie Thiyageswaran, Marina Meila

详情
英文摘要

We introduce the setting of continuous index learning, in which a function of many variables varies only along a small number of directions at each point. For efficient estimation, it is beneficial for a learning algorithm to adapt, near each point $x$, to the subspace that captures the local variability of the function $f$. We pose this task as kernel adaptation along a manifold with noise, and introduce Local EGOP learning, a recursive algorithm that utilizes the Expected Gradient Outer Product (EGOP) quadratic form as both a metric and inverse-covariance of our target distribution. We prove that Local EGOP learning adapts to the regularity of the function of interest, showing that under a supervised noisy manifold hypothesis, intrinsic dimensional learning rates are achieved for arbitrarily high-dimensional noise. Empirically, we compare our algorithm to the feature learning capabilities of deep learning. Additionally, we demonstrate improved regression quality compared to two-layer neural networks in the continuous single-index setting.

2601.00892 2026-02-10 cs.LG cs.CV physics.data-an stat.ME stat.ML

Hierarchical topological clustering

Ana Carpio, Gema Duro

Comments not peer reviewed, reviewed version to appear in Soft Computing

Journal ref Soft Computing 2026

详情
英文摘要

Topological methods have the potential of exploring data clouds without making assumptions on their the structure. Here we propose a hierarchical topological clustering algorithm that can be implemented with any distance choice. The persistence of outliers and clusters of arbitrary shape is inferred from the resulting hierarchy. We demonstrate the potential of the algorithm on selected datasets in which outliers play relevant roles, consisting of images, medical and economic data. These methods can provide meaningful clusters in situations in which other techniques fail to do so.

2512.12065 2026-02-10 stat.ME

Meta-analysis of diagnostic test accuracy with multiple disease stages: combining stage-specific and merged-stage data

Efthymia Derezea, Nicky J Welton, Gabriel Rogers, Hayley E Jones

详情
英文摘要

For many conditions, it is of clinical importance to know not just the ability of a test to distinguish between those with and without the disease, but also the sensitivity to detect disease at different stages: in particular, the test's ability to detect disease at a stage most amenable to treatment. In a systematic review of test accuracy, pooled stage-specific estimates can be produced using subgroup analysis or meta-regression. However, this requires stage-specific data from each study, which is often not reported. Studies may however report test sensitivity for merged stage categories (e.g. stages I-II) or merged across all stages, together with information on the proportion of patients with disease at each stage. We demonstrate how to incorporate studies reporting merged stage data alongside studies reporting stage-specific data, to allow the inclusion of more studies in the meta-analysis. We consider both meta-analysis of tests with binary results, and meta-analysis of tests with continuous results, where the sensitivity to detect disease of each stage across the whole range of observed thresholds is estimated. The methods are demonstrated using a series of simulated datasets and applied to data from a systematic review of the accuracy of tests used to screen for hepatocellular carcinoma in people with liver cirrhosis. We show that incorporating studies with merged stage data can lead to more precise estimates and, in some cases, corrects biologically implausible results that can arise when the availability of stage-specific data is limited.

2512.11063 2026-02-10 stat.AP

umx version 4.5: Extending Twin and Path-Based SEM in R with CLPM, MR-DoC, Definition Variables, $Ω$nyx Integration, and Censored Distributions

Luis FS Castro-de-Araujo, Nathan Gillespie, Michael C Neale, Timothy Bates

Comments 13 pages, 2 figures

详情
英文摘要

Structural Equation Modeling (SEM) is a flexible statistical technique with multiple applications, including behavioral genetics and social sciences. Building on the original design of the umx package, which improved accessibility to OpenMx by specifying a concise syntax, umx v4.5 extends functionality for longitudinal and causal twin designs while improving interoperability with graphical modelling tools such as Onyx. New capabilities include: classic and modern cross-lagged panel models; Mendelian Randomization Direction-of-Causation (MR-DoC) twin models incorporating polygenic scores as instruments; support for definition variables directly in umxRAM(); a workflow for importing paths from Ωnyx; a dedicated function for incorporating censored variables' data into models, particularly valuable in biomarker research; improved covariate placeholder handling for definition variables; sex-limitation modelling across five twin groups, accommodating quantitative and qualitative sex differences; and covariate residualization in wide- or long-format data. These new functionalities accelerate reproducible, reliable, publication-ready twin and family modelling, and integrated journal-quality reporting, thereby lowering barriers to genetic epidemiological analyzes.

2511.22270 2026-02-10 stat.ML cs.LG

Towards Understanding Generalization in DP-GD: A Case Study in Training Two-Layer CNNs

Zhongjie Shi, Puyu Wang, Chenyang Zhang, Yuan Cao

详情
英文摘要

Modern deep learning techniques focus on extracting intricate information from data to achieve accurate predictions. However, the training datasets may be crowdsourced and include sensitive information, such as personal contact details, financial data, and medical records. As a result, there is a growing emphasis on developing privacy-preserving training algorithms for neural networks that maintain good performance while preserving privacy. In this paper, we investigate the generalization and privacy performances of the differentially private gradient descent (DP-GD) algorithm, which is a private variant of the gradient descent (GD) by incorporating additional noise into the gradients during each iteration. Moreover, we identify a concrete learning task where DP-GD can achieve superior generalization performance compared to GD in training two-layer Huberized ReLU convolutional neural networks (CNNs). Specifically, we demonstrate that, under mild conditions, a small signal-to-noise ratio can result in GD producing training models with poor test accuracy, whereas DP-GD can yield training models with good test accuracy and privacy guarantees if the signal-to-noise ratio is not too small. This indicates that DP-GD has the potential to enhance model performance while ensuring privacy protection in certain learning tasks. Numerical simulations are further conducted to support our theoretical results.

2510.23935 2026-02-10 stat.ML cs.LG

Understanding Fairness and Prediction Error through Subspace Decomposition and Influence Analysis

Enze Shi, Pankaj Bhagwat, Zhixian Yang, Linglong Kong, Bei Jiang

详情
英文摘要

Machine learning models have achieved widespread success but often inherit and amplify historical biases, resulting in unfair outcomes. Traditional fairness methods typically impose constraints at the prediction level, without addressing underlying biases in data representations. In this work, we propose a principled framework that adjusts data representations to balance predictive utility and fairness. Using sufficient dimension reduction, we decompose the feature space into target-relevant, sensitive, and shared components, and control the fairness-utility trade-off by selectively removing sensitive information. We provide a theoretical analysis of how prediction error and fairness gaps evolve as shared subspaces are added, and employ influence functions to quantify their effects on the asymptotic behavior of parameter estimates. Experiments on both synthetic and real-world datasets validate our theoretical insights and show that the proposed method effectively improves fairness while preserving predictive performance.

2510.22063 2026-02-10 stat.ML cs.AI cs.LG math.ST stat.TH

Deep Ensembles for Epistemic Uncertainty: A Frequentist Perspective

Anchit Jain, Stephen Bates

详情
英文摘要

Decomposing prediction uncertainty into aleatoric (irreducible) and epistemic (reducible) components is critical for the reliable deployment of machine learning systems. While the mutual information between the response variable and model parameters is a principled measure for epistemic uncertainty, it requires access to the parameter posterior, which is computationally challenging to approximate. Consequently, practitioners often rely on probabilistic predictions from deep ensembles to quantify uncertainty, which have demonstrated strong empirical performance. However, a theoretical understanding of their success from a frequentist perspective remains limited. We address this gap by first considering a bootstrap-based estimator for epistemic uncertainty, which we prove is asymptotically correct. Next, we connect deep ensembles to the bootstrap estimator by decomposing it into data variability and training stochasticity; specifically, we show that deep ensembles capture the training stochasticity component. Through empirical studies, we show that this stochasticity component constitutes the majority of epistemic uncertainty, thereby explaining the effectiveness of deep ensembles.

2510.20404 2026-02-10 stat.ME econ.EM math.ST stat.ML stat.TH

Identification and Debiased Learning of Causal Effects with General Instrumental Variables

Shuyuan Chen, Peng Zhang, Yifan Cui

详情
英文摘要

Instrumental variable methods are fundamental to causal inference when treatment assignment is confounded by unobserved variables. In this article, we develop a general nonparametric causal framework for identification and learning with multi-categorical or continuous instrumental variables. Specifically, the mean potential outcomes and the average treatment effect can be identified via a regular weighting function derived from the proposed framework. Leveraging semiparametric theory, we derive efficient influence functions and construct two consistent, asymptotically normal estimators via debiased machine learning. The first estimator uses a prespecified weighting function, while the second estimator selects the optimal weighting function adaptively. Extensions to longitudinal data, dynamic treatment regimes, and multiplicative instrumental variables are further developed. We demonstrate the proposed method by employing simulation studies and analyzing real data from the Job Training Partnership Act program.

2510.18161 2026-02-10 stat.ML cs.LG econ.EM

Beating the Winner's Curse via Inference-Aware Policy Optimization

Hamsa Bastani, Osbert Bastani, Bryce McLaughlin

详情
英文摘要

There has been a surge of recent interest in automatically learning policies to target treatment decisions based on rich individual covariates. In addition, practitioners want confidence that the learned policy has better performance than the incumbent policy according to downstream policy evaluation. However, due to the winner's curse -- an issue where the policy optimization procedure exploits prediction errors rather than finding actual improvements -- predicted performance improvements are often not substantiated by downstream policy evaluation. To address this challenge, we propose a novel strategy called inference-aware policy optimization, which modifies policy optimization to account for how the policy will be evaluated downstream. Specifically, it optimizes not only for the estimated objective value, but also for the chances that the estimate of the policy's improvement passes a significance test during downstream policy evaluation. We mathematically characterize the Pareto frontier of policies according to the tradeoff of these two goals. Based on our characterization, we design a policy optimization algorithm that estimates the Pareto frontier using machine learning models; then, the decision-maker can select the policy that optimizes their desired tradeoff, after which policy evaluation can be performed on the test set as usual. Finally, we perform simulations to illustrate the effectiveness of our methodology.

2510.09891 2026-02-10 cs.LG cs.AI physics.ao-ph stat.ML

Probabilistic bias adjustment of seasonal predictions of Arctic Sea Ice Concentration

Parsa Gooya, Reinel Sospedra-Alfonso

详情
英文摘要

Seasonal forecast of Arctic sea ice concentration is key to mitigate the negative impact and assess potential opportunities posed by the rapid decline of sea ice coverage. Seasonal prediction systems based on climate models often show systematic biases and complex spatio-temporal errors that grow with the forecasts. Consequently, operational predictions are routinely bias corrected and calibrated using retrospective forecasts. For predictions of Arctic sea ice concentration, error corrections are mainly based on one-to-one post-processing methods including climatological mean or linear regression correction and, more recently, machine learning. Such deterministic adjustments are confined at best to the limited number of costly-to-run ensemble members of the raw forecast. However, decision-making requires proper quantification of uncertainty and likelihood of events, particularly of extremes. We introduce a probabilistic error correction framework based on a conditional Variational Autoencoder model to map the conditional distribution of observations given the biased model prediction. This method naturally allows for generating large ensembles of adjusted forecasts. We evaluate our model using deterministic and probabilistic metrics and show that the adjusted forecasts are better calibrated, closer to the observational distribution, and have smaller errors than climatological mean adjusted forecasts.

2509.14218 2026-02-10 stat.ME math.ST stat.ML stat.OT stat.TH

Adaptive Off-Policy Inference for M-Estimators Under Model Misspecification

James Leiner, Robin Dunn, Aaditya Ramdas

Comments 43 pages, 6 figures

详情
英文摘要

When data are collected adaptively, such as in bandit algorithms, classical statistical approaches such as ordinary least squares and $M$-estimation will often fail to achieve asymptotic normality. Although recent lines of work have modified the classical approaches to ensure valid inference on adaptively collected data, most of these works assume that the model is correctly specified. The misspecified setting poses unique challenges because the parameter of interest itself may not be well-defined over a non-stationary distribution of rewards. We therefore tackle the problem of \emph{off-policy} inference in adaptive settings, where we uniquely define a projected solution over a stationary evaluation policy. Our method provides valid inference for $M$-estimators that use adaptively collected bandit data with a possibly misspecified working model. A key ingredient in our approach is the use of flexible approaches to stabilize the variance induced by adaptive data collection. A major novelty is that the procedure enables the construction of valid confidence sets even in settings where treatment policies are unstable and non-converging, such as when there is no unique optimal arm and standard bandit algorithms are used. Empirical results on semi-synthetic datasets constructed from the Osteoarthritis Initiative demonstrate that the method maintains type I error control, while existing methods for inference in adaptive settings do not cover in the misspecified case.

2508.16817 2026-02-10 math.OC cs.LG cs.SY eess.SY math.DS stat.ML

Predictability Enables Parallelization of Nonlinear State Space Models

Xavier Gonzalez, Leo Kozachkov, David M. Zoltowski, Kenneth L. Clarkson, Scott W. Linderman

Comments NeurIPS '25. XG and LK dual lead authors. Code: https://github.com/lindermanlab/predictability_enables_parallelization

详情
英文摘要

The rise of parallel computing hardware has made it increasingly important to understand which nonlinear state space models can be efficiently parallelized. Recent advances like DEER (arXiv:2309.12252) and DeepPCR (arXiv:2309.16318) recast sequential evaluation as a parallelizable optimization problem, sometimes yielding dramatic speedups. However, the factors governing the difficulty of these optimization problems remained unclear, limiting broader adoption. In this work, we establish a precise relationship between a system's dynamics and the conditioning of its corresponding optimization problem, as measured by its Polyak-Lojasiewicz (PL) constant. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behavior and quantified by the largest Lyapunov exponent (LLE), impacts the number of optimization steps required for evaluation. For predictable systems, the state trajectory can be computed in at worst $O((\log T)^2)$ time, where $T$ is the sequence length: a major improvement over the conventional sequential approach. In contrast, chaotic or unpredictable systems exhibit poor conditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis shows that predictable systems always yield well-conditioned optimization problems, whereas unpredictable systems lead to severe conditioning degradation. We validate our claims through extensive experiments, providing practical guidance on when nonlinear dynamical systems can be efficiently parallelized. We highlight predictability as a key design principle for parallelizable models.

2508.01116 2026-02-10 quant-ph cs.AI cs.LG stat.ML

TensorHyper-VQC: A Tensor-Train-Guided Hypernetwork for Robust and Scalable Variational Quantum Computing

Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hsiu Hsieh

Comments The paper has been accepted by npj Quantum Information and will be published in February 2026

详情
英文摘要

Variational Quantum Computing (VQC) faces fundamental scalability barriers, primarily due to barren plateaus and sensitivity to quantum noise. To address these challenges, we introduce TensorHyper-VQC, a novel tensor-train (TT)-guided hypernetwork framework that significantly improves the robustness and scalability of VQC. Our framework fully delegates the generation of quantum-circuit parameters to a classical TT network, thereby decoupling optimization from quantum hardware. This innovative parameterization mitigates gradient vanishing, enhances noise resilience through structured low-rank representations, and facilitates efficient gradient propagation. Grounded in Neural Tangent Kernel and statistical learning theory, our rigorous theoretical analyses establish strong guarantees on approximation capability, optimization stability, and generalization performance. Extensive empirical results across quantum dot classification, Max-Cut optimization, and molecular quantum simulation tasks demonstrate that TensorHyper-VQC consistently achieves superior performance and robust noise tolerance, including hardware-level validation on a 156-qubit IBM Heron processor. These results position TensorHyper-VQC as a scalable and noise-resilient framework for advancing practical quantum machine learning on near-term devices.

2506.16289 2026-02-10 stat.ML cs.LG

The Condition Number as a Scale-Invariant Proxy for Information Encoding in Neural Units

Oswaldo Ludwig

Comments This version includes a new experiment using a larger LLM and introducing KappaTune-LoRA

详情
英文摘要

This paper explores the relationship between the condition number of a neural network's weight tensor and the extent of information encoded by the associated processing unit, viewed through the lens of information theory. It argues that a high condition number, though not sufficient for effective knowledge encoding, may indicate that the unit has learned to selectively amplify and compress information. This intuition is formalized for linear units with Gaussian inputs, linking the condition number and the transformation's log-volume scaling factor to the characteristics of the output entropy and the geometric properties of the learned transformation. The analysis demonstrates that for a fixed weight norm, a concentrated distribution of singular values (high condition number) corresponds to reduced overall information transfer, indicating a specialized and efficient encoding strategy. Furthermore, the linear stage entropy bound provides an upper limit on post-activation information for contractive, element-wise nonlinearities, supporting the condition number as a scale-invariant proxy for encoding capacity in practical neural networks. An empirical case study applies these principles to guide selective fine-tuning of Large Language Models for both a new task and a new input modality. The experiments show that the proposed method, named KappaTune, effectively mitigates catastrophic forgetting. Unlike many existing catastrophic forgetting mitigation methods that rely on access to pre-training statistics, which are often unavailable, this selective fine-tuning approach offers a way to bypass this common requirement.

2506.12751 2026-02-10 stat.ML cs.LG

Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions

Yue Kang, Mingshuo Liu, Bongsoo Yi, Jing Lyu, Zhi Zhang, Doudou Zhou, Yao Li

Comments The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound $\tilde{O}_T(\sqrt{T})$ in terms of the time horizon $T$. We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets.

2505.22984 2026-02-10 cs.LG cs.CY stat.ME

A Computational Approach to Improving Fairness in K-means Clustering

Guancheng Zhou, Haiping Xu, Hongkang Xu, Chenyu Li, Donghui Yan

Comments 14 pages, 5 figures

详情
英文摘要

The popular K-means clustering algorithm potentially suffers from a major weakness for further analysis or interpretation. Some cluster may have disproportionately more (or fewer) points from one of the subpopulations in terms of some sensitive variable, e.g., gender or race. Such a fairness issue may cause bias and unexpected social consequences. This work attempts to improve the fairness of K-means clustering with a two-stage optimization formulation--clustering first and then adjust cluster membership of a small subset of selected data points. Two computationally efficient algorithms are proposed in identifying those data points that are expensive for fairness, with one focusing on nearest data points outside of a cluster and the other on highly 'mixed' data points. Experiments on benchmark datasets show substantial improvement on fairness with a minimal impact to clustering quality. The proposed algorithms can be easily extended to a broad class of clustering algorithms or fairness metrics.

2505.12599 2026-02-10 math.OC cs.LG stat.CO

Accelerated Markov Chain Monte Carlo Algorithms on Discrete States

Bohan Zhou, Shu Liu, Xinzhe Zuo, Wuchen Li

详情
英文摘要

We propose a class of discrete state sampling algorithms based on Nesterov's accelerated gradient method, which extends the classical Metropolis-Hastings (MH) algorithm. The evolution of the discrete states probability distribution governed by MH can be interpreted as a gradient descent direction of the Kullback--Leibler (KL) divergence, via a mobility function and a score function. Specifically, this gradient is defined on a probability simplex equipped with a discrete Wasserstein-2 metric with a mobility function. This motivates us to study a momentum-based acceleration framework using damped Hamiltonian flows on the simplex set, whose stationary distribution matches the discrete target distribution. Furthermore, we design an interacting particle system to approximate the proposed accelerated sampling dynamics. The extension of the algorithm with a general choice of potentials and mobilities is also discussed. In particular, we choose the accelerated gradient flow of the relative Fisher information, demonstrating the advantages of the algorithm in estimating discrete score functions without requiring the normalizing constant and keeping positive probabilities. Numerical examples, including sampling on a Gaussian mixture supported on lattices or a distribution on a hypercube, demonstrate the effectiveness of the proposed discrete-state sampling algorithm.

2505.11918 2026-02-10 cs.LG stat.ML

Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures

Zhiheng Chen, Ruofan Wu, Guanhua Fang

Comments Code available at https://github.com/Rorschach1989/transformer-for-gmm

详情
英文摘要

The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the under standing of pre-trained large language models. However, most recent works have been focusing on studying supervised learning topics such as in-context learning, leaving the field of unsupervised learning largely unexplored. This paper investigates the capabilities of transformers in solving Gaussian Mixture Models (GMMs), a fundamental unsupervised learning problem through the lens of statistical estimation. We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks using a shared transformer backbone. The learned models are empirically demonstrated to effectively mitigate the limitations of classical methods such as Expectation-Maximization (EM) or spectral algorithms, at the same time exhibit reasonable robustness to distribution shifts. Theoretically, we prove that transformers can approximate both the EM algorithm and a core component of spectral methods (cubic tensor power iterations). These results bridge the gap between practical success and theoretical understanding, positioning transformers as versatile tools for unsupervised learning.

2504.11304 2026-02-10 stat.ML cs.LG

Differentially Private Geodesic Regression

Aditya Kulkarni, Carlos Soto

Comments 20 pages, 10 figures

详情
英文摘要

In statistical applications it has become increasingly common to encounter data structures that live on non-linear spaces such as manifolds. Classical linear regression, one of the most fundamental methodologies of statistical learning, captures the relationship between an independent variable and a response variable which both are assumed to live in Euclidean space. Thus, geodesic regression emerged as an extension where the response variable lives on a Riemannian manifold. The parameters of geodesic regression, as with linear regression, capture the relationship of sensitive data and hence one should consider the privacy protection practices of said parameters. We consider releasing Differentially Private (DP) parameters of geodesic regression via the K-Norm Gradient (KNG) mechanism for Riemannian manifolds. We derive theoretical bounds for the sensitivity of the parameters showing they are tied to their respective Jacobi fields and hence the curvature of the space. This corroborates, and extends, recent findings of differential privacy for the Fréchet mean. We demonstrate the efficacy of our methodology on the sphere, $S_2\subset\mathbb{R}^3$, the space of symmetric positive definite matrices, and Kendall's planar shape space. Our methodology is general to any Riemannian manifold, and thus it is suitable for data in domains such as medical imaging and computer vision.

2504.05489 2026-02-10 stat.ME econ.EM stat.AP

Bayesian Shrinkage in High-Dimensional VAR Models: A Comparative Study

Harrison Katz, Robert E. Weiss

Journal ref International Journal of Statistics and Probability 14(3) (2025), 1-22

详情
英文摘要

High-dimensional vector autoregressive (VAR) models offer a versatile framework for multivariate time series analysis, yet face critical challenges from over-parameterization and uncertain lag order. In this paper, we systematically compare three Bayesian shrinkage priors (horseshoe, lasso, and normal) and two frequentist regularization approaches (ridge and nonparametric shrinkage) under three carefully crafted simulation scenarios. These scenarios encompass (i) overfitting in a low-dimensional setting, (ii) sparse high-dimensional processes, and (iii) a combined scenario where both large dimension and overfitting complicate inference. We evaluate each method in quality of parameter estimation (root mean squared error, coverage, and interval length) and out-of-sample forecasting (one-step-ahead forecast RMSE). Our findings show that local-global Bayesian methods, particularly the horseshoe, dominate in maintaining accurate coverage and minimizing parameter error, even when the model is heavily over-parameterized. Frequentist ridge often yields competitive point forecasts but underestimates uncertainty, leading to sub-nominal coverage. A real-data application using macroeconomic variables from Canada illustrates how these methods perform in practice, reinforcing the advantages of local-global priors in stabilizing inference when dimension or lag order is inflated.

2504.03480 2026-02-10 stat.ME

Multivariate Causal Effects: a Bayesian Causal Regression Factor Model

Dafne Zorzetto, Jenna Landy, Corwin Zigler, Giovanni Parmigiani, Roberta De Vito

详情
英文摘要

The impact of wildfire smoke on air quality is a growing concern, contributing to air pollution through a complex mixture of chemical species with important implications for public health. While previous studies have primarily focused on its association with total particulate matter (PM2.5), the causal relationship between wildfire smoke and the chemical composition of PM2.5 remains largely unexplored. Exposure to these chemical mixtures plays a critical role in shaping public health, yet capturing their relationships requires advanced statistical methods capable of modeling the complex dependencies among chemical species. To fill this gap, we propose a Bayesian causal regression factor model that estimates the multivariate causal effects of wildfire smoke on the concentration of 27 chemical species in PM2.5 across the United States. Our approach introduces two key innovations: (i) a causal inference framework for multivariate potential outcomes, and (ii) a novel Bayesian factor model that employs a probit stick-breaking process as prior for treatment-specific factor scores. By focusing on factor scores, our method addresses the missing data challenge common in causal inference and enables a flexible, data-driven characterization of the latent factor structure, which is crucial to capture the complex correlation among multivariate outcomes. Through Monte Carlo simulations, we show the model's accuracy in estimating the causal effects in multivariate outcomes and characterizing the treatment-specific latent structure. Finally, we apply our method to US air quality data, estimating the causal effect of wildfire smoke on 27 chemical species in PM2.5, providing a deeper understanding of their interdependencies.

2503.16028 2026-02-10 math.NA cs.NA math.ST stat.TH

Sequential Monte Carlo with Gaussian Mixture Approximation for Infinite-Dimensional Statistical Inverse Problems

Haoyu Lu, Junxiong Jia, Deyu Meng

Comments 41 pages

详情
英文摘要

By formulating the inverse problem of partial differential equations (PDEs) as a statistical inference problem, the Bayesian approach provides a general framework for quantifying uncertainties. In the inverse problem of PDEs, parameters are defined on an infinite-dimensional function space, and the PDEs induce a computationally intensive likelihood function. Additionally, sparse data tends to lead to a multi-modal posterior. These features make it difficult to apply existing sequential Monte Carlo (SMC) algorithms. To overcome these difficulties, we propose new conditions for the likelihood functions, construct a Gaussian mixture based preconditioned Crank-Nicolson transition kernel, and demonstrate the universal approximation property of the infinite-dimensional Gaussian mixture probability measure. By combining these three novel tools, we propose a new SMC algorithm with Gaussian mixture approximation, together with an easy-to-use reduced version. For this new algorithm, we obtain a convergence theorem that allows Gaussian priors, illustrating that the sequential particle filter actually reproduces the true posterior distribution. Furthermore, the proposed new algorithm is rigorously defined on the infinite-dimensional function space, naturally exhibiting the discretization-invariant property. Numerical experiments demonstrate that the reduced version has a strong ability to probe the multi-modality of the posterior, significantly reduces the computational burden, and numerically exhibits the discretization-invariant property (important for large-scale problems).

2503.00307 2026-02-10 cs.LG stat.ML

Remasking Discrete Diffusion Models with Inference-Time Scaling

Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, Volodymyr Kuleshov

Comments NeurIPS 2025. Project page: https://remdm.github.io

详情
英文摘要

Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler, a method that can be applied to pretrained masked diffusion models in a principled way and that is derived from a discrete diffusion model with a custom remasking backward process. Most interestingly, ReMDM endows discrete diffusion with a form of inference-time compute scaling. By increasing the number of sampling steps, ReMDM generates natural language outputs that approach the quality of autoregressive models, whereas when the computation budget is limited, ReMDM better maintains quality. ReMDM also improves sample quality of masked diffusion models for discretized images, and in scientific domains such as molecule design, ReMDM facilitates diffusion guidance and pushes the Pareto frontier of controllability relative to classical masking and uniform noise diffusion. We provide the code along with a blog post on the project page: https://guanghanwang.com/remdm

2502.11510 2026-02-10 stat.OT stat.CO

Here Be Dragons: Bimodal posteriors arise from numerical integration error in longitudinal models

Tess O'Brien, Matthew T. Moores, David Warton, Daniel Falster

Comments 33 pages, 7 figures, 2 tables

详情
英文摘要

Longitudinal models with dynamics governed by differential equations may require numerical integration alongside parameter estimation. We have identified a situation where the numerical integration introduces error in such a way that it becomes a novel source of non-uniqueness in estimation. We obtain two very different sets of parameters, one of which is a good estimate of the true values and the other a very poor one. The two estimates have forward numerical projections statistically indistinguishable from each other because of numerical error. In such cases, the posterior distribution for parameters is bimodal, with a dominant mode closer to the true parameter value, and a second cluster around the errant value. We demonstrate that multi-modality exists both theoretically and empirically for an affine first order differential equation, that a simulation workflow can test for evidence of the issue more generally, and that Markov Chain Monte Carlo sampling with a suitable solution can avoid bimodality. The issue of multi-modal posteriors arising from numerical error has consequences for Bayesian inverse methods that rely on numerical integration more broadly.

2502.00380 2026-02-10 cs.LG stat.ML

CoHiRF: Hierarchical Consensus for Interpretable Clustering Beyond Scalability Limits

Katia Meziani, Bruno Belucci, Karim Lounici, Vladimir R. Kostic

详情
英文摘要

We introduce CoHiRF (Consensus Hierarchical Random Features), a hierarchical consensus framework that enables existing clustering methods to operate beyond their usual computational and memory limits. CoHiRF is a meta-algorithm that operates exclusively on the label assignments produced by a base clustering method, without modifying its objective function, optimization procedure, or geometric assumptions. It repeatedly applies the base method to multiple low-dimensional feature views or stochastic realizations, enforces agreement through consensus, and progressively reduces the problem size via representative-based contraction. Across a diverse set of synthetic and real-world experiments involving centroid-based, kernel-based, density-based, and graph-based methods, we show that CoHiRF can improve robustness to high-dimensional noise, enhance stability under stochastic variability, and enable scalability to regimes where the base method alone is infeasible. We also provide an empirical characterization of when hierarchical consensus is beneficial, highlighting the role of reproducible label relations and their compatibility with representative-based contraction. Beyond flat partitions, CoHiRF produces an explicit Cluster Fusion Hierarchy, offering a multi-resolution and interpretable view of the clustering structure. Together, these results position hierarchical consensus as a practical and flexible tool for large-scale clustering, extending the applicability of existing methods without altering their underlying behavior.

2412.14423 2026-02-10 stat.ME math.ST stat.TH

Cross-Validation with Antithetic Gaussian Randomization

Sifan Liu, Snigdha Panigrahi, Jake A. Soloff

详情
英文摘要

We introduce a new cross-validation method based on an equicorrelated Gaussian randomization scheme. Our method is well-suited for problems where sample splitting is infeasible, either because the data violate the assumption of independent and identically distributed samples, or because there are insufficient samples to form representative train-test data pairs. In such problems, our method provides a simple, principled, and computationally efficient approach to estimating prediction error, often outperforming standard cross-validation while requiring only a small number of repetitions. Drawing inspiration from recent splitting techniques like data fission and data thinning, our method constructs train-test data pairs using Gaussian randomization. Our main contribution is the introduction of an antithetic Gaussian randomization scheme, involving a carefully designed correlation structure among the randomization variables. We show theoretically that this antithetic construction can eliminate the bias of cross-validation for a broad class of smooth prediction functions, without inflating variance. Through simulations across a range of data types and loss functions, we demonstrate that our estimator outperforms existing methods for prediction error estimation.

2411.01732 2026-02-10 stat.ME

Alignment and matching tests for high-dimensional tensor signals via tensor contraction

Ruihan Liu, Zhenggang Wang, Jianfeng Yao

详情
英文摘要

We consider two hypothesis testing problems for low-rank and high-dimensional tensor signals, namely the tensor signal alignment and tensor signal matching problems. These problems are challenging due to the high dimension of tensors and the lack of suitable test statistics. By exploiting a recent tensor contraction method, we propose and validate relevant test statistics using eigenvalues of a data matrix resulting from the tensor contraction. The matrix entries exhibit long-range dependence, which makes the analysis of the matrix challenging, involved, and distinct from standard random matrix theory. Our approach provides a novel framework for addressing hypothesis testing problems in the context of high-dimensional tensor signals.

2410.16391 2026-02-10 stat.ME

Causal Data Fusion for Panel Data without a Pre-Intervention Period

Zou Yang, Seung Hee Lee, Julia R. Köhler, AmirEmad Ghassami

详情
英文摘要

Traditional panel-data causal inference frameworks, such as difference-in-differences and synthetic control methods, rely on pre-intervention data to estimate counterfactual means. However, such data may be unavailable in real-world settings when interventions are implemented in response to sudden events, such as public health crises or epidemiological shocks. In this paper, we introduce two data-fusion methods for causal inference from panel data in scenarios where pre-intervention data are unavailable. These methods leverage auxiliary reference domains with related panel data to estimate causal effects in the target domain, thereby overcoming the limitations imposed by the absence of pre-intervention data. We demonstrate the efficacy of these methods by deriving bounds on the absolute bias that converge to zero under suitable conditions, as well as through simulations across a variety of panel-data settings. Our proposed methodology renders causal inference feasible in urgent and data-constrained environments where the assumptions of existing causal inference frameworks are not met. As an application of our methodology, we evaluate the effect of a community organization vaccination intervention in Chelsea, Massachusetts on COVID-19 vaccination rates.

2408.12136 2026-02-10 cs.LG stat.ML

Provable Domain Adaptation for Offline Reinforcement Learning with Limited Samples

Weiqin Chen, Xinjie Zhang, Sandipan Mishra, Santiago Paternain

详情
英文摘要

Offline reinforcement learning (RL) learns effective policies from a static target dataset. The performance of state-of-the-art offline RL algorithms notwithstanding, it relies on the size of the target dataset, and it degrades if limited samples in the target dataset are available, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. However, establishing the optimal way to trade off the limited target dataset and the large-but-biased source dataset while ensuring provably theoretical guarantees remains an open challenge. To the best of our knowledge, this paper proposes the first framework that theoretically explores the impact of the weights assigned to each dataset on the performance of offline RL. In particular, we establish performance bounds and the existence of the optimal weight, which can be computed in closed form under simplifying assumptions. We also provide algorithmic guarantees in terms of convergence to a neighborhood of the optimum. Notably, these results depend on the quality of the source dataset and the number of samples in the target dataset. Our empirical results on the well-known Procgen and MuJoCo benchmarks substantiate the theoretical contributions in this work.

2408.06525 2026-02-10 stat.ML cs.LG

Note on computational complexity of the Gromov-Wasserstein distance

Natalia Kravtsova

Comments This is an updated version of the note previously titled "The NP-hardness of the Gromov-Wasserstein distance." This version corrects deficiencies in the previous version

详情
英文摘要

This note addresses computational difficulty of the Gromov-Wasserstein distance frequently mentioned in the literature. We provide details on the structure of the Gromov-Wasserstein distance optimization problem that show its non-convex quadratic nature for any instance of an input data. We further illustrate the non-convexity of the problem with several explicit examples.

2406.19958 2026-02-10 stat.ML cs.LG math.ST stat.TH

On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

详情
英文摘要

Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory, comprising guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. However, the computational properties of the widely-used BART sampler proposed by Chipman et al. (2010) are yet to be well-understood. In this paper, we perform an asymptotic analysis of a slightly modified version of the default BART sampler when fitted to data-generating processes with discrete covariates. We show that the sampler's time to convergence, evaluated in terms of the hitting time of a high posterior density set, increases with the number of training samples, due to the multi-modal nature of the target posterior. On the other hand, we show that this trend can be dampened by simple changes, such as increasing the number of trees in the ensemble or raising the temperature of the sampler. These results provide a nuanced picture on the computational efficiency of the BART sampler in the presence of large amounts of training data while suggesting strategies to improve the sampler. We complement our theoretical analysis with a simulation study focusing on the default BART sampler. We observe that the increasing trend of convergence time against number training samples holds for the default BART sampler and is robust to changes in sampler initialization, number of burn-in iterations, feature selection prior, and discretization strategy. On the other hand, increasing the number of trees or raising the temperature sharply dampens this trend, as indicated by our theory.

2406.03628 2026-02-10 stat.ML cs.LG

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

Comments 92 pages, 9 figures

详情
英文摘要

Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.

2405.14104 2026-02-10 econ.EM stat.ME

On the Identifying Power of Generalized Monotonicity for Average Treatment Effects

Yuehao Bai, Shunzhuang Huang, Sarah Moon, Azeem M. Shaikh, Edward J. Vytlacil

详情
英文摘要

In the context of a binary outcome, treatment, and instrument, Balke and Pearl (1993, 1997) es- tablish that the monotonicity condition of Imbens and Angrist (1994) has no identifying power beyond instrument exogeneity for average potential outcomes and average treatment effects in the sense that adding it to instrument exogeneity does not decrease the identified sets for those parameters whenever those restrictions are consistent with the distribution of the observable data. This paper shows that this phenomenon holds in a broader setting with a multi-valued outcome, treatment, and instrument, under an extension of the monotonicity condition that we refer to as generalized monotonicity. We further show that this phenomenon holds for any restriction on treatment response that is stronger than generalized monotonicity provided that these stronger restrictions do not restrict potential outcomes. Importantly, many models of potential treatments previously considered in the literature imply generalized monotonic- ity, including the types of monotonicity restrictions considered by Kline and Walters (2016), Kirkeboen et al. (2016), and Heckman and Pinto (2018), and the restriction that treatment selection is determined by particular classes of additive random utility models. We show through a series of examples that restrictions on potential treatments can provide identifying power beyond instrument exogeneity for av- erage potential outcomes and average treatment effects when the restrictions imply that the generalized monotonicity condition is violated. In this way, our results shed light on the types of restrictions required for help in identifying average potential outcomes and average treatment effects.

2405.09676 2026-02-10 math.ST math.OC stat.ML stat.TH

The radius of statistical efficiency

Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy

Comments 44 pages. v2: corrections and improvements to the exposition throughout

Journal ref Foundations of Computational Mathematics, 2026

详情
英文摘要

Classical results in asymptotic statistics show that the Fisher information matrix controls the difficulty of estimating a statistical model from observed data. In this work, we introduce a companion measure of robustness of an estimation problem: the radius of statistical efficiency (RSE) is the size of the smallest perturbation to the problem data that renders the Fisher information matrix singular. We compute RSE up to numerical constants for a variety of testbed problems, including principal component analysis, generalized linear models, phase retrieval, bilinear sensing, and matrix completion. Interestingly, we observe a precise reciprocal relationship between RSE and the intrinsic complexity/sensitivity of the problem instance, paralleling the classical Eckart-Young theorem in numerical analysis. To establish our results, we develop theory for spectral functions of measures that extends well-known results from matrix analysis and eigenvalue optimization$-$a contribution that may be of interest beyond our immediate findings.

2403.02060 2026-02-10 stat.ME

Expectile Periodograms

Tianbo Chen, Ta-Hsin Li, Hanbing Zhu, Wenwu Gao

Journal ref Computational Statistics & Data Analysis, 217 (2026), 108337

详情
英文摘要

This paper introduces a novel periodogram-like function, called the expectile periodogram, for modeling spectral features of time series and detecting hidden periodicities. The expectile periodogram is constructed from trigonometric expectile regression, in which a specially designed check function is used to substitute the squared $l_2$ norm that leads to the ordinary periodogram. The expectile periodogram retains the key properties of the ordinary periodogram as a frequency-domain representation of serial dependence in time series, while offering a more comprehensive understanding by examining the data across the entire range of expectile levels. We establish the asymptotic theory and investigate the relationship between the expectile periodogram and the so called expectile spectrum. Simulations demonstrate the efficiency of the expectile periodogram in the presence of hidden periodicities. Finally, by leveraging the inherent two-dimensional nature of the expectile periodogram, we train a deep learning (DL) model to classify earthquake waveform data. Remarkably, our approach outperforms alternative periodogram-based methods in terms of classification accuracy.

2401.06740 2026-02-10 q-fin.CP cs.LG cs.NA math.NA math.PR stat.ML

A deep implicit-explicit minimizing movement method for option pricing in jump-diffusion models

Emmanuil H. Georgoulis, Antonis Papapantoleon, Costas Smaragdakis

Comments 17 pages, 11 figures

详情
英文摘要

We develop a novel deep learning approach for pricing European basket options written on assets that follow jump-diffusion dynamics. The option pricing problem is formulated as a partial integro-differential equation, which is approximated via a new implicit-explicit minimizing movement time-stepping approach, involving approximation by deep, residual-type Artificial Neural Networks (ANNs) for each time step. The integral operator is discretized via two different approaches: (a) a sparse-grid Gauss-Hermite approximation following localised coordinate axes arising from singular value decompositions, and (b) an ANN-based high-dimensional special-purpose quadrature rule. Crucially, the proposed ANN is constructed to ensure the appropriate asymptotic behavior of the solution for large values of the underlyings and also leads to consistent outputs with respect to a priori known qualitative properties of the solution. The performance and robustness with respect to the dimension of these methods are assessed in a series of numerical experiments involving the Merton jump-diffusion model, while a comparison with the deep Galerkin method and the deep BSDE solver with jumps further supports the merits of the proposed approach.

2303.02756 2026-02-10 stat.CO stat.ME

Modeling Spatio-Temporal Transport: From Rigid Advection to Realistic Dynamics

Maria Laura Battagliola, Sofia Charlotta Olhede

详情
英文摘要

Stochastic models for spatio-temporal transport face a critical trade-off between physical realism and interpretability. The advection model with a single constant velocity is interpretable but physically limited by its perfect correlation over time. This work aims to bridge the gap between this simple framework and its physically realistic extensions. Our guiding principle is to introduce a spatial correlation structure that vanishes over time. To achieve this, we present two distinct approaches. The first constructs complex velocity structures, either through superpositions of advection components or by allowing the velocity to vary locally. The second is a spectral technique that replaces the singular spectrum of rigid advection with a more flexible form, introducing temporal decorrelation controlled by parameters. We accompany these models with efficient simulation algorithms and demonstrate their success in replicating complex dynamics, such as tropical cyclones and the solutions of partial differential equations. Finally, we illustrate the practical utility of the proposed framework by comparing its simulations to real-world precipitation data from Hurricane Florence.

2302.00107 2026-02-10 stat.ML cs.LG stat.ME

Distributed sequential federated learning

Z. F. Wang, X. Y. Zhang, Y-c I. Chang

Comments 22 pages

Journal ref Statistica Sinica 2026 (in print)

详情
英文摘要

The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.

2112.12908 2026-02-10 stat.ME stat.CO

Annealed Leap-Point Sampler for Multimodal Target Distributions

Nicholas G. Tawn, Matthew T. Moores, Hugo Queniat, Gareth O. Roberts

详情
英文摘要

In Bayesian statistics, exploring high-dimensional multimodal posterior distributions poses major challenges for existing MCMC approaches. This paper introduces the Annealed Leap-Point Sampler (ALPS), which augments the target distribution state space with modified annealed (cooled) distributions, in contrast to traditional tempering approaches. The coldest state is chosen such that its annealed density is well-approximated locally by a Laplace approximation. This allows for automated setup of a scalable mode-leaping independence sampler. ALPS requires an exploration component to search for the mode locations, which can either be run adaptively in parallel to improve these mode-jumping proposals, or else as a pre-computation step. A theoretical analysis shows that for a d-dimensional problem the coolest temperature level required only needs to be linear in dimension, $\mathcal{O}\left(d\right)$, implying that the number of iterations needed for ALPS to converge is $\mathcal{O}\left(d\right)$ (typically leading to overall complexity $\mathcal{O}\left(d^3\right)$ when computational cost per iteration is taken into account). ALPS is illustrated on several complex, multimodal distributions that arise from real-world applications. This includes a seemingly-unrelated regression (SUR) model of longitudinal data from U.S. manufacturing firms, as well as a spectral density model that is used in analytical chemistry for identification of molecular biomarkers.

2103.01313 2026-02-10 stat.AP cs.CY q-bio.OT

Towards Understanding the COVID-19 Case Fatality Rate

Donghui Yan, Aiyou Chen, Buqing Yang

Comments 13 pages, 7 figures

详情
英文摘要

An important parameter for COVID-19 is the case fatality rate (CFR). It has been applied to wide applications, including the measure of the severity of the infection, the estimation of the number of infected cases, risk assessment etc. However, there remains a lack of understanding on several aspects of CFR, including population factors that are important to CFR, the apparent discrepancy of CFRs in different countries, and how the age effect comes into play. We analyze the CFRs at two different time snapshots, July 6 and Dec 28, 2020, with one during the first wave and the other a second wave of the COVID-19 pandemic. We consider two important population covariates, age and GDP as a proxy for the quality and abundance of public health. Extensive exploratory data analysis leads to some interesting findings. First, there is a clear exponential age effect among different age groups, and, more importantly, the exponential index is almost invariant across countries and time in the pandemic. Second, the roles played by the age and GDP are a little surprising: during the first wave, age is a more significant factor than GDP, while their roles have switched during the second wave of the pandemic, which may be partially explained by the delay in time for the quality and abundance of public health and medical research to factor in.

2007.16012 2026-02-10 q-bio.BM cs.AI cs.CL cs.LG stat.ML

BERT Learns (and Teaches) Chemistry

Josh Payne, Mario Srouji, Dian Ang Yap, Vineet Kosaraju

Comments 10 pages, 5 figures

详情
英文摘要

Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using a transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.

2602.07205 2026-02-10 cs.LG cs.GT stat.ML

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Junyan Liu, Haipeng Luo, Zihan Zhang, Lillian J. Ratliff

Comments 36 pages

详情
英文摘要

We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(ηC + \sqrt{K/η})$ regret bound, where $η$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $η$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

2602.07170 2026-02-10 stat.AP

Bayesian Dynamic Gamma Models for Route-Level Travel Time Reliability

Vadim Sokolov, Refik Soyer

详情
英文摘要

Route-level travel time reliability requires characterizing the distribution of total travel time across correlated segments -- a problem where existing methods either assume independence (fast but miscalibrated) or model dependence via copulas and simulation (accurate but expensive). We propose a conjugate Bayesian dynamic Gamma model with a common random environment that resolves this trade-off. Each segment's travel time follows a Gamma distribution conditional on a shared latent environment process that evolves as a Markov chain, inducing cross-segment dependence while preserving conditional independence. A moment-matching approximation yields a closed-form $F$-distribution for route travel time, from which the Planning Time Index, Buffer Index, and on-time probability are computed instantly -- at the same $O(1)$ cost as independence-based methods. The conjugate structure ensures that Bayesian posterior updates and the full predictive distribution are available in closed form as new sensor data arrives. Applied to 16 sensors spanning 8.26 miles on I-55 in Chicago, the model achieves 95.4% coverage of nominal 90\% predictive intervals versus 34--37% for independence-based convolution, at identical computational cost.

2602.07160 2026-02-10 cs.CL cs.AI cs.LG stat.ML

Free Energy Mixer

Jiecheng Lu, Shihao Yang

Comments Camera-ready version. Accepted at ICLR 2026

Journal ref Proceedings of the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

2602.07148 2026-02-10 stat.ME stat.CO

Functional Estimation of the Marginal Likelihood

Omiros Papaspiliopoulos, Timothée Stumpf-Fétizon, Jonathan Weare

详情
英文摘要

We propose a framework for computing, optimizing and integrating with respect to a smooth marginal likelihood in statistical models that involve high-dimensional parameters/latent variables and continuous low-dimensional hyperparameters. The method requires samples from the posterior distribution of the parameters for different values of the hyperparameters on a simulation grid and returns inference on the marginal likelihood defined everywhere on its domain, and on its functionals. We show how the method relates to many of the methods that have been used in this context, including sequential Monte Carlo, Gibbs sampling, Monte Carlo maximum likelihood, and umbrella sampling. We establish the consistency of the proposed estimators as the sampling effort increases, both when the simulation grid is kept fixed and when it becomes dense in the domain. We showcase the approach on Gaussian process regression and classification and crossed effect models.

2602.07102 2026-02-10 stat.ML cs.AI cs.LG

Fast and Robust Likelihood-Guided Diffusion Posterior Sampling with Amortized Variational Inference

Léon Zheng, Thomas Hirtz, Yazid Janati, Eric Moulines

详情
英文摘要

Zero-shot diffusion posterior sampling offers a flexible framework for inverse problems by accommodating arbitrary degradation operators at test time, but incurs high computational cost due to repeated likelihood-guided updates. In contrast, previous amortized diffusion approaches enable fast inference by replacing likelihood-based sampling with implicit inference models, but at the expense of robustness to unseen degradations. We introduce an amortization strategy for diffusion posterior sampling that preserves explicit likelihood guidance by amortizing the inner optimization problems arising in variational diffusion posterior sampling. This accelerates inference for in-distribution degradations while maintaining robustness to previously unseen operators, thereby improving the trade-off between efficiency and flexibility in diffusion-based inverse problems.

2602.05030 2026-02-10 stat.ME

Billions-Scale Forecast Reconciliation

Tianyu Wang, Matthew C. Johnson, Steven Klee, Matthew L. Malloy

详情
英文摘要

The problem of combining multiple forecasts of related quantities that obey expected equality and additivity constraints, often referred to a hierarchical forecast reconciliation, is naturally stated as a simple optimization problem. In this paper we explore optimization-based point forecast reconciliation at scales faced by large retailers. We implement and benchmark several algorithms to solve the forecast reconciliation problem, showing efficacy when the dimension of the problem exceeds four billion forecasted values. To the best of our knowledge, this is the largest forecast reconciliation problem, and perhaps on-par with the largest constrained least-squares-problem ever solved. We also make several theoretical contributions. We show that for a restricted class of problems and when the loss function is weighted appropriately, least-squares forecast reconciliation is equivalent to share-based forecast reconciliation. This formalizes how the optimization based approach can be thought of as a generalization of share-based reconciliation, applicable to multiple, overlapping data hierarchies.

2602.04124 2026-02-10 stat.ME

Privacy Amplification for Synthetic data using Range Restriction

Jingchen Hu, Matthew R. Williams, Terrance D. Savitsky

Comments 25 pages, 20 figures

详情
英文摘要

We introduce a new class of range restricted formal data privacy standards that condition on owner beliefs about sensitive data ranges. By incorporating this additional information, we can provide a stronger privacy guarantee (e.g. an amplification). The range restricted formal privacy standards protect only a subset (or ball) of data values and exclude ranges (or balls) believed to be already publicly known. The privacy standards are designed for the risk-weighted pseudo posterior (model) mechanism (PPM) used to generate synthetic data under an asymptotic Differential (aDP) privacy guarantee. The PPM downweights the likelihood contribution for each record proportionally to its disclosure risk. The PPM is adapted under inclusion of beliefs by adjusting the risk-weighted pseudo likelihood. We introduce two alternative adjustments. The first expresses data owner knowledge of the sensitive range as a probability, $λ$, that a datum value drawn from the underlying generating distribution lies outside the ball or subspace of values that are sensitive. The portion of each datum likelihood contribution deemed sensitive is then $(1-λ) \leq 1$ and is the only portion of the likelihood subject to risk down-weighting. The second adjustment encodes knowledge as the difference in probability masses $P(R) \leq 1$ between the edges of the sensitive range, $R$. We use the resulting conditional (pseudo) likelihood for a sensitive record, which boosts its worst case tail values away from 0. We compare privacy and utility properties for the PPM under the aDP and range restricted privacy standards.

2506.18221 2026-02-10 cs.LG cs.AI stat.ML

These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining

Xingyu Alice Yang, Jianyu Zhang, Léon Bottou

Comments 10 pages, 6 figures, Preprint. Under review

详情
英文摘要

Transfer learning is widely used to adapt large pretrained models to new tasks with only a small amount of new data. However, a challenge persists -- the features from the original task often do not fully cover what is needed for unseen data, especially when the relatedness of tasks is not clear. Since deep learning models tend to learn very sparse representations, they retain only the minimal features required for the initial training while discarding potentially ones for downstream transfer. A theoretical framework developed in this work demonstrates that such pretraining captures inconsistent aspects of the data distribution, therefore, inducing transfer bias. To address this limitation, we propose an inexpensive ensembling strategy that aggregates multiple models to generate richer feature representations. On ResNet, this approach yields a $9\%$ improvement in transfer accuracy without incurring extra pretraining cost. We also present empirical evidence from a range of deep learning studies, confirming that the phenomenon is pervasive across modern deep learning architectures. These results suggests that relying solely on large pretrained networks is not always the most effective way to improve model generalization. Instead, fostering richer, more diverse representations -- e.g. - through model ensembles -- can substantially enhance transfer learning performance.

2505.03938 2026-02-10 stat.CO

A computationally efficient framework for realistic epidemic modelling through Gaussian Markov random fields

Angelos Alexopoulos, Paul Birrell, Daniela De Angelis

Comments 34 pages, 7 Figures, 3 Tables

详情
英文摘要

We tackle limitations of ordinary differential equation-driven Susceptible-Infections-Removed (SIR) models and their extensions that have recently be employed for epidemic nowcasting and forecasting. In particular, we deal with challenges related to the extension of SIR-type models to account for the so-called \textit{environmental stochasticity}, i.e., external factors, such as seasonal forcing, social cycles and vaccinations that can dramatically affect outbreaks of infectious diseases. Typically, in SIR-type models environmental stochasticity is modelled through stochastic processes. However, this stochastic extension of epidemic models leads to models with large dimension that increases over time. Here we propose a Bayesian approach to build an efficient modelling and inferential framework for epidemic nowcasting and forecasting by using Gaussian Markov random fields to model the evolution of these stochastic processes over time and across population strata. Importantly, we also develop a bespoke and computationally efficient Markov chain Monte Carlo algorithm to estimate the large number of parameters and latent states of the proposed model. We test our approach on simulated data and we apply it to real data from the Covid-19 pandemic in the United Kingdom.

2502.09667 2026-02-10 cs.CL cs.LG stat.ML

Summaries as Centroids for Interpretable and Scalable Text Clustering

Jairo Diaz-Rodriguez

Comments Accepted to ICLR 2026

详情
英文摘要

We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.

2502.03792 2026-02-10 stat.ML cs.LG

Step by Step: Adaptive Gradient Descent for Training L-Lipschitz Neural Networks

Kyle Sung, Kholood Khalil, Noah Forman, Steven Samu, Anastasis Kratsios

Comments 31 pages, 8 figures

详情
英文摘要

We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.

2409.03502 2026-02-10 stat.ME

Differential Test Functioning via Robust Scaling

Peter F. Halpin

Comments 23 pages, 3 figure

详情
英文摘要

In the item response theory (IRT) literature, differential test functioning (DTF) has been conceptualized in terms of how the test response function differs over groups of respondents. This paper presents an alternative approach to DTF that focusses on how the distribution of the latent trait differs over groups, which is referred to as impact. It is proposed to evaluate DTF by comparing two estimates of impact, one that naively aggregates over all test items and a robust alternative that down-weights items that exhibit differential item functioning (DIF). Taking this approach, this paper makes the following three contributions. First it is shown that the difference between the naive and robust estimands provides a convenient effect size for quantifying the extent to which DIF affects conclusions about impact (as opposed to test scores). Second it is shown how to construct a robust estimator that yields consistent estimates of impact whenever fewer than 1/2 of items exhibit DIF. Third, a relatively general purpose Wald test of the difference between two estimates of impact is developed. Using simulations and an empirical example from physics education, it is shown how the proposed effect size and test statistic perform using the proposed robust estimator of impact, as well as estimators that arise from conventional item-by-item tests of DIF.

2305.10878 2026-02-10 stat.ME stat.AP

Multi-scale wavelet coherence

Haibo Wu, Marina I. Knight, Hernando Ombao

Comments 48 pages, 13 figures in the paper

详情
英文摘要

This paper develops a novel statistical approach to characterize temporally localised cross-oscillatory interactions between channels in a functional brain network. Brain signals are generally nonstationary and the proposed framework uses wavelets as an effective tool for capturing (i) single-scale channel transient features, due to their adaptiveness to the dynamic signal properties, and (ii) cross-scale channel interactions, due to their multi-scale nature. Our approach formalises scale-specific subprocesses and cross-scale (CS) dependencies for a new class of multivariate locally stationary (MvLSW) wavelet processes that we refer to as CS-MvLSW. Under this model, we develop a novel spectral domain time-varying cross-scale dependence measure and its appropriate estimation. Extensive simulation studies demonstrate that the theoretically established properties hold in practice. The proposed CS-MvLSW framework remains accurate under pronounced cross-scale dependence, whereas existing MvLSW modelling can deteriorate even for single-scale coherence when such complex structure is present in the process. The proposed cross-scale analysis is applied to electroencephalogram (EEG) data to study alterations in the functional connectivity structure in children diagnosed with attention deficit hyperactivity disorder (ADHD). Our approach identified novel, clinically pertinent cross-scale interactions in the functional brain network, differentiating brain connectivity between control and ADHD groups.