arXivDaily arXiv每日学术速递 周一至周五更新
2602.21200 2026-02-25 stat.ME

A Time-Varying and Covariate-Dependent Correlation Model for Multivariate Longitudinal Studies

Qingzhi Liu, Gen Li, Anastasia K. Yocum, Melvin McInnis, Brian D. Athey, Veerabhadran Baladandayuthapani

Comments 23 pages, 5 figures, 1 table

详情
英文摘要

In multivariate longitudinal studies, associations between outcomes often exhibit time-varying and individual level heterogeneity, motivating the modeling of correlations as an explicit function of time and covariates. However, most existing methods for correlation analysis fail to simultaneously capture the time-varying and covariate-dependent effects. We propose a Time-Varying and Covariate-Dependent (TiVAC) correlation model that jointly allows covariate effects on correlation to change flexibly and smoothly across time. TiVAC employs a bivariate Gaussian model where the covariate-dependent correlations are modeled semiparametrically using penalized splines. We develop a penalized maximum likelihood-based Newton-Raphson algorithm, and inference on time-varying effects is provided through simultaneous confidence bands. Simulation studies show that TiVAC consistently outperforms existing methods in accurately estimating correlations across a wide range of settings, including binary and continuous covariates, sparse to dense observation schedules, and across diverse correlation trajectory patterns. We apply TiVAC to a psychiatric case study of 291 bipolar I patients, modeling the time-varying correlation between depression and anxiety scores as a function of their clinical variables. Our analyses reveal significant heterogeneity associated with gender and nervous-system medication use, which varies with age, revealing the complex dynamic relationship between depression and anxiety in bipolar disorders.

2602.21191 2026-02-25 cs.LG cs.DS stat.ML

Statistical Query Lower Bounds for Smoothed Agnostic Learning

Ilias Diakonikolas, Daniel M. Kane

详情
英文摘要

We study the complexity of smoothed agnostic learning, recently introduced by~\cite{CKKMS24}, in which the learner competes with the best classifier in a target class under slight Gaussian perturbations of the inputs. Specifically, we focus on the prototypical task of agnostically learning halfspaces under subgaussian distributions in the smoothed model. The best known upper bound for this problem relies on $L_1$-polynomial regression and has complexity $d^{\tilde{O}(1/σ^2) \log(1/ε)}$, where $σ$ is the smoothing parameter and $ε$ is the excess error. Our main result is a Statistical Query (SQ) lower bound providing formal evidence that this upper bound is close to best possible. In more detail, we show that (even for Gaussian marginals) any SQ algorithm for smoothed agnostic learning of halfspaces requires complexity $d^{Ω(1/σ^{2}+\log(1/ε))}$. This is the first non-trivial lower bound on the complexity of this task and nearly matches the known upper bound. Roughly speaking, we show that applying $L_1$-polynomial regression to a smoothed version of the function is essentially best possible. Our techniques involve finding a moment-matching hard distribution by way of linear programming duality. This dual program corresponds exactly to finding a low-degree approximating polynomial to the smoothed version of the target function (which turns out to be the same condition required for the $L_1$-polynomial regression to work). Our explicit SQ lower bound then comes from proving lower bounds on this approximation degree for the class of halfspaces.

2602.21170 2026-02-25 stat.CO

cyclinbayes: Bayesian Causal Discovery with Linear Non-Gaussian Directed Acyclic and Cyclic Graphical Models

Robert Lee, Raymond K. W. Wong, Yang Ni

Comments 4 Pages

详情
英文摘要

We introduce cyclinbayes, an open-source R package for discovering linear causal relationships with both acyclic and cyclic structures. The package employs scalable Bayesian approaches with spike-and-slab priors to learn directed acyclic graphs (DAGs) and directed cyclic graphs (DCGs) under non-Gaussian noise. A central feature of cyclinbayes is comprehensive uncertainty quantification, including posterior edge inclusion probabilities, posterior probabilities of network motifs, and posterior probabilities over entire graph structures. Our implementation addresses two limitations in existing software: (1) while methods for linear non-Gaussian DAG learning are available in R and Python, they generally lack proper uncertainty quantification, and (2) reliable implementations for linear non-Gaussian DCG remain scarce. The package implements computationally efficient hybrid MCMC algorithms that scale to large datasets. Beyond uncertainty quantification, we propose a new decision-theoretic approach to summarize posterior samples of graphs, yielding principled point estimates based on posterior expected loss such as posterior expected structural Hamming distance and structural intervention distance. The package, a supplementary material, and a tutorial are available on GitHub at https://github.com/roblee01/cyclinbayes.

2602.21133 2026-02-25 cs.LG stat.ML

SOM-VQ: Topology-Aware Tokenization for Interactive Generative Models

Alessandro Londei, Denise Lanzieri, Matteo Benati

详情
英文摘要

Vector-quantized representations enable powerful discrete generative models but lack semantic structure in token space, limiting interpretable human control. We introduce SOM-VQ, a tokenization method that combines vector quantization with Self-Organizing Maps to learn discrete codebooks with explicit low-dimensional topology. Unlike standard VQ-VAE, SOM-VQ uses topology-aware updates that preserve neighborhood structure: nearby tokens on a learned grid correspond to semantically similar states, enabling direct geometric manipulation of the latent space. We demonstrate that SOM-VQ produces more learnable token sequences in the evaluated domains while providing an explicit navigable geometry in code space. Critically, the topological organization enables intuitive human-in-the-loop control: users can steer generation by manipulating distances in token space, achieving semantic alignment without frame-level constraints. We focus on human motion generation - a domain where kinematic structure, smooth temporal continuity, and interactive use cases (choreography, rehabilitation, HCI) make topology-aware control especially natural - demonstrating controlled divergence and convergence from reference sequences through simple grid-based sampling. SOM-VQ provides a general framework for interpretable discrete representations applicable to music, gesture, and other interactive generative domains.

2602.21055 2026-02-25 math.ST stat.ML stat.TH

Adjacency Spectral Embeddings of Correlation Networks

Keith Levin

详情
英文摘要

In many applications, weighted networks are constructed based on time series data: each time series is associated to a vertex and edge weights are given by pairwise correlations. The result is a network whose edge dependency structure violates the assumptions of most common network models. Nonetheless, it is common to analyze these "correlation networks" using embedding methods derived from edge-independent network models, based on a belief that the edges are approximately independent. In this work, we put this modeling choice on firm theoretical ground. We show that when the time series are expressible in terms of a small number of Fourier basis elements (or in some other suitably-chosen basis), correlation networks correspond to latent space networks with dependent edge noise in which the vertex-level latent variables encode the basis coefficients. Further, we show that when time series are observed subject to noise, spectral embedding of the resulting noisy correlation network still recovers these true vertex-level latent representations under suitable assumptions. This characterization of embeddings as learning Fourier coefficients appears to be folklore in the signal processing community in the context of principal component analysis, but is, to the best of our knowledge, new to the statistical network analysis literature.

2602.21039 2026-02-25 stat.ML cs.LG

Is Multi-Distribution Learning as Easy as PAC Learning: Sharp Rates with Bounded Label Noise

Rafael Hanashiro, Abhishek Shetty, Patrick Jaillet

详情
英文摘要

Towards understanding the statistical complexity of learning from heterogeneous sources, we study the problem of multi-distribution learning. Given $k$ data sources, the goal is to output a classifier for each source by exploiting shared structure to reduce sample complexity. We focus on the bounded label noise setting to determine whether the fast $1/ε$ rates achievable in single-task learning extend to this regime with minimal dependence on $k$. Surprisingly, we show that this is not the case. We demonstrate that learning across $k$ distributions inherently incurs slow rates scaling with $k/ε^2$, even under constant noise levels, unless each distribution is learned separately. A key technical contribution is a structured hypothesis-testing framework that captures the statistical cost of certifying near-optimality under bounded noise-a cost we show is unavoidable in the multi-distribution setting. Finally, we prove that when competing with the stronger benchmark of each distribution's optimal Bayes error, the sample complexity incurs a \textit{multiplicative} penalty in $k$. This establishes a \textit{statistical} separation between random classification noise and Massart noise, highlighting a fundamental barrier unique to learning from multiple sources.

2602.21036 2026-02-25 stat.ME cs.LG stat.ML

Empirically Calibrated Conditional Independence Tests

Milleno Pan, Antoine de Mathelin, Wesley Tansey

详情
英文摘要

Conditional independence tests (CIT) are widely used for causal discovery and feature selection. Even with false discovery rate (FDR) control procedures, they often fail to provide frequentist guarantees in practice. We highlight two common failure modes: (i) in small samples, asymptotic guarantees for many CITs can be inaccurate and even correctly specified models fail to estimate the noise levels and control the error, and (ii) when sample sizes are large but models are misspecified, unaccounted dependencies skew the test's behavior and fail to return uniform p-values under the null. We propose Empirically Calibrated Conditional Independence Tests (ECCIT), a method that measures and corrects for miscalibration. For a chosen base CIT (e.g., GCM, HRT), ECCIT optimizes an adversary that selects features and response functions to maximize a miscalibration metric. ECCIT then fits a monotone calibration map that adjusts the base-test p-values in proportion to the observed miscalibration. Across empirical benchmarks on synthetic and real data, ECCIT achieves valid FDR with higher power than existing calibration strategies while remaining test agnostic.

2602.21031 2026-02-25 stat.ME stat.ML

Exchangeable Gaussian Processes for Staggered-Adoption Policy Evaluation

Hayk Gevorgyan, Konstantinos Kalogeropoulos, Angelos Alexopoulos

详情
英文摘要

We study the use of exchangeable multi-task Gaussian processes (GPs) for causal inference in panel data, applying the framework to two settings: one with a single treated unit subject to a once-and-for-all treatment and another with multiple treated units and staggered treatment adoption. Our approach models the joint evolution of outcomes for treated and control units through a GP prior that ensures exchangeability across units while allowing for flexible nonlinear trends over time. The resulting posterior predictive distribution for the untreated potential outcomes of the treated unit provides a counterfactual path, from which we derive pointwise and cumulative treatment effects, along with credible intervals to quantify uncertainty. We implement several variations of the exchangeable GP model using different kernel functions. To assess prediction accuracy, we conduct a placebo-style validation within the pre-intervention window by selecting a ``fake'' intervention date. Ultimately, this study illustrates how exchangeable GPs serve as a flexible tool for policy evaluation in panel data settings and proposes a novel approach to staggered-adoption designs with a large number of treated and control units.

2602.21029 2026-02-25 stat.AP math.OC physics.soc-ph

On the non-uniformity of the 2026 FIFA World Cup draw

László Csató, Martin Becker, Karel Devriesere, Dries Goossens

Comments 19 pages, 1 figure, 5 tables

详情
英文摘要

The group stage of a sports tournament is often made more appealing by introducing additional constraints in the group draw that promote an attractive and balanced group composition. For example, the number of intra-regional group matches is minimised in several World Cups. However, under such constraints, the traditional draw procedure may become non-uniform, meaning that the feasible allocations of the teams into groups are not equally likely to occur. Our paper quantifies this non-uniformity of the 2026 FIFA World Cup draw for the official draw procedure, as well as for 47 reasonable alternatives implied by all permutations of the four pots and two group labelling policies. We show why simulating with a recursive backtracking algorithm is intractable, and propose a workable implementation using integer programming. The official draw mechanism is found to be optimal based on four measures of non-uniformity. Nonetheless, non-uniformity can be more than halved if the organiser aims to treat the best teams drawn from the first pot equally.

2602.20965 2026-02-25 stat.ME math.ST stat.TH

Estimating the Partially Linear Zero-Inflated Poisson Regression Model: a Robust Approach Using a EM-like Algorithm

María José Llop, Andrea Bergesio, Anne-Françoise Yao

详情
英文摘要

Count data with an excessive number of zeros frequently arise in fields such as economics, medicine, and public health. Traditional count models often fail to adequately handle such data, especially when the relationship between the response and some predictors is nonlinear. To overcome these limitations, the partially linear zero-inflated Poisson (PLZIP) model has been proposed as a flexible alternative. However, all existing estimation approaches for this model are based on likelihood, which is known to be highly sensitive to outliers and slight deviations from the model assumptions. This article presents the first robust estimation method specifically developed for the PLZIP model. An Expectation-Maximization-like algorithm is used to take advantage of the mixture nature of the model and to address extreme observations in both the response and the covariates. Results of the algorithm convergence and the consistency of the estimators are proved. A simulation study under various contamination schemes showed the robustness and efficiency of the proposed estimators in finite samples, compared to classical estimators. Finally, the application of the methodology is illustrated through an example using real data.

2602.20956 2026-02-25 math.PR math.ST stat.TH

Extreme eigenvalues and eigenvectors for finite rank additive deformations of non-hermitian sparse random matrices

Walid Hachem, Michail Louvaris, Jamal Najim

Comments 27 pages

详情
英文摘要

Consider a $n\times n$ sparse non-Hermitian random matrix $X_n$ defined as the Hadamard product between a random matrix with centered independent and identically distributed entries and a sparse Bernoulli matrix with success probability $K_n/n$ where $K_n\le n$ (and possibly $K_n\ll n$) and $K_n\to \infty$ as $n\to \infty$. Let $E_n$ be a deterministic $n\times n$ finite-rank matrix. We prove that the outlier eigenvalues of $Y_n= X_n +E_n$ asymptotically match those of $E_n$. In the special case of a rank-one deformation, assuming further that the sparsity parameter satisfies $K_n \gg \log^9(n)$ and that the entries of the random matrix are sub-Gaussian, we describe the limiting behavior of the projection of the right eigenvector associated with the leading eigenvalue onto the right eigenvector of the rank-one deformation. In particular, we prove that the projection behaves as in the Hermitian case. To that end, we rely on the recent universality results of Brailovskaya and van Handel (2024) relating the singular value spectra of deformations of $X_n$ to Gaussian analogues of these matrices. Our analysis builds upon a recent framework introduced by Bordenave et.al. (2022), and amounts to showing the asymptotic equivalence between the reverse characteristic polynomial of the random matrix and a random analytic function on the unit disc with explicit dependence on the finite-rank deformation.

2602.20954 2026-02-25 stat.OT

Hierarchical Aggregation Clustering Algorithms Derived from the Bi-partial Objective Function

Jan W. Owsiński

Comments An original paper, not yet submitted anywhere

详情
英文摘要

The paper outlines the principles of construction of a broad class of hierarchical aggregation algorithms of cluster analysis, essentially based on minimum distance mergers, which are derived from the general bi-partial objective function. It is shown how the algorithms arise from the bi-partial objective function, their affinity with the classical hierarchical aggregation algorithms is demonstrated, and the examples of such algorithms for the concrete forms of the bi-partial objective function are provided. This amounts to the first explicit and, at the same time, quite general, connection between optimization in clustering and the hierarchical aggregation algorithms. Thereby, the respective hierarchical algorithms gain a deeper justification, the means for evaluating the quality of clustering is provided, along with the criterion of stopping the cluster mergers.

2602.20939 2026-02-25 stat.ME

A Statistical Framework for Detecting Emergent Narratives in Longitudinal Text Corpora

Cynthia Medeiros, John Quigley, Matthew Revie

详情
英文摘要

Narratives about economic events and policies are widely recognised as influential drivers of economic and business behaviour. Yet the statistical identification of narrative emergence remains underdeveloped. Narratives evolve gradually, exhibit subtle shifts in content, and may exert influence disproportionate to their observable frequency, making it difficult to determine when observed changes reflect genuine structural shifts rather than routine variation in language use. We propose a statistical framework for detecting narrative emergence in longitudinal text corpora using Latent Dirichlet Allocation (LDA). We define emergence as a sustained increase in a topic's relative prominence over time and articulate a statistical framework for interpreting such trajectories, recognising that topic proportions are latent, model-estimated quantities. We illustrate the approach using a corpus of academic publications in economics spanning 1970-2018, where Nobel Prize-recognised contributions serve as externally observable signals of influential narratives. Topics associated with these contributions display sustained increases in estimated prevalence that coincide with periods of heightened citation activity and broader disciplinary recognition. These findings indicate that model-based topic trajectories can reflect identifiable shifts in economic discourse and provide a statistically grounded basis for analysing thematic change in longitudinal textual data.

2602.20896 2026-02-25 math.ST stat.ME stat.TH

On Stein's test of uniformity on the hypersphere

Paul Axmann, Bruno Ebner, Eduardo García-Portugués

Comments 31 pages, 5 figures, 4 tables

详情
英文摘要

We propose a new test of uniformity on the hypersphere based on a Stein characterization associated with the Laplace--Beltrami operator. We identify a sufficient class of test functions for this characterization, linked to the moment generating function. Exploiting the operator's eigenfunctions to obtain a harmonic decomposition in terms of Gegenbauer polynomials, we show that the proposed procedure belongs to the class of Sobolev tests. We derive closed-form expressions for the distribution of the test statistic under the null hypothesis and under fixed alternatives. To enhance power against a range of alternatives, we introduce a tuning parameter into the characterization and study its impact on rejection probabilities. We discuss data-driven strategies for selecting this parameter to maximize rejection rates for a given alternative and compare the resulting performance with that of related parametric tests. Additional numerical experiments compare the proposed test with competing Sobolev-class procedures, highlighting settings in which it offers clear advantages.

2602.20885 2026-02-25 stat.ME

Combining Information Across Diverse Sources: The II-CC-FF Paradigm

Céline Cunen, Nils Lid Hjort

Comments 44 pages, 14 figures. This is the authors' version, with Supplementing Material, July 2020 (and arXiv'd February 2026), published in essentially this form in Scandinavian Journal of Statistics, 2022, vol. 49, pages 625-256, at this url: onlinelibrary.wiley.com/doi/pdf/10.1111/sjos.12530

详情
英文摘要

We introduce and develop a general paradigm for combining information across diverse data sources. In broad terms, suppose $ϕ$ is a parameter of interest, built up via components $ψ_1,\ldots,ψ_k$ from data sources $1,\ldots,k$. The proposed scheme has three steps. First, the Independent Inspection (II) step amounts to investigating each separate data source, translating statistical information to a confidence distribution $C_j(ψ_j)$ for the relevant focus parameter $ψ_j$ associated with data source $j$. Second, Confidence Conversion (CC) techniques are used to translate the confidence distributions to confidence log-likelihood functions, say $\ell_{{\rm con},j}(ψ_j)$. Finally, the Focused Fusion (FF) step uses relevant and context-driven techniques to construct a confidence distribution for the primary focus parameter $ϕ=ϕ(ψ_1,\ldots,ψ_k)$, acting on the combined confidence log-likelihood. In traditional setups, the II-CC-FF strategy amounts to versions of meta-analysis, and turns out to be competitive against state-of-the-art methods. Its potential lies in applications to harder problems, however. Illustrations are presented, related to actual applications.

2602.20875 2026-02-25 math.ST math.OC math.PR stat.ME stat.ML stat.TH

Efficient Online Learning in Interacting Particle Systems

Louis Sharrock, Nikolas Kantas, Grigorios A. Pavliotis

详情
英文摘要

We introduce a new method for online parameter estimation in stochastic interacting particle systems, based on continuous observation of a small number of particles from the system. Our method recursively updates the model parameters using a stochastic approximation of the gradient of the asymptotic log likelihood, which is computed using the continuous stream of observations. Under suitable assumptions, we rigorously establish convergence of our method to the stationary points of the asymptotic log-likelihood of the interacting particle system. We consider asymptotics both in the limit as the time horizon $t\rightarrow\infty$, for a fixed and finite number of particles, and in the joint limit as the number of particles $N\rightarrow\infty$ and the time horizon $t\rightarrow\infty$. Under additional assumptions on the asymptotic log-likelihood, we also establish an $\mathrm{L}^2$ convergence rate and a central limit theorem. Finally, we present several numerical examples of practical interest, including a model for systemic risk, a model of interacting FitzHugh--Nagumo neurons, and a Cucker--Smale flocking model. Our numerical results corroborate our theoretical results, and also suggest that our estimator is effective even in cases where the assumptions required for our theoretical analysis do not hold.

2602.17554 2026-02-25 cs.LG stat.ML

A Theoretical Framework for Modular Learning of Robust Generative Models

Corinna Cortes, Mehryar Mohri, Yutao Zhong

详情
英文摘要

Training large-scale generative models is resource-intensive and relies heavily on heuristic dataset weighting. We address two fundamental questions: Can we train Large Language Models (LLMs) modularly-combining small, domain-specific experts to match monolithic performance-and can we do so robustly for any data mixture, eliminating heuristic tuning? We present a theoretical framework for modular generative modeling where a set of pre-trained experts are combined via a gating mechanism. We define the space of normalized gating functions, $G_{1}$, and formulate the problem as a minimax game to find a single robust gate that minimizes divergence to the worst-case data mixture. We prove the existence of such a robust gate using Kakutani's fixed-point theorem and show that modularity acts as a strong regularizer, with generalization bounds scaling with the lightweight gate's complexity. Furthermore, we prove that this modular approach can theoretically outperform models retrained on aggregate data, with the gap characterized by the Jensen-Shannon Divergence. Finally, we introduce a scalable Stochastic Primal-Dual algorithm and a Structural Distillation method for efficient inference. Empirical results on synthetic and real-world datasets confirm that our modular architecture effectively mitigates gradient conflict and can robustly outperform monolithic baselines.

2510.06465 2026-02-25 stat.ME

Maximum softly penalised likelihood in factor analysis

Philipp Sterzinger, Ioannis Kosmids, Irini Moustaki

详情
英文摘要

Estimation in exploratory factor analysis often yields estimates on the boundary of the parameter space. Such occurrences, known as Heywood cases, are characterised by non-positive variance estimates and can cause issues in numerical optimisation procedures or convergence failures, which, in turn, can lead to misleading inferences, particularly regarding factor scores and model selection. We derive sufficient conditions on the model and a penalty to the log-likelihood function that i) guarantee the existence of maximum penalised likelihood estimates in the interior of the parameter space, and ii) ensure that the corresponding estimators possess the desirable asymptotic properties expected by the maximum likelihood estimator, namely consistency and asymptotic normality. Consistency and asymptotic normality are achieved when the penalisation is soft enough, in a way that adapts to the information accumulation about the model parameters. We formally show, for the first time, that the penalties of Akaike (1987) and Hirose et al. (2011) to the log-likelihood of the normal linear factor model satisfy the conditions for existence, and, hence, deal with Heywood cases. Their vanilla versions, though, can result in questionable finite-sample properties in estimation, inference, and model selection. The maximum softly-penalised likelihood framework we introduce enables the careful scaling of those penalties to ensure that the resulting estimation and inference procedures inherit the ML estimator's optimal properties. Through comprehensive simulation studies and the analysis of real data sets, we illustrate the desirable finite-sample properties of the maximum softly penalised likelihood estimators and associated procedures.

2509.24870 2026-02-25 astro-ph.IM stat.AP

Closing the Evidence Gap: reddemcee, a Fast Adaptive Parallel Tempering Sampler

Pablo A. Peña, James S. Jenkins

Comments v2: Revised after referee comments; resubmitted to A&A on 24 Sep 2025

Journal ref A&A 706, A323 (2026)

详情
英文摘要

Markov Chain Monte Carlo (MCMC) excels at sampling complex posteriors but traditionally lags behind nested sampling in accurate evidence estimation, which is crucial for model comparison in astrophysical problems. We introduce reddemcee, an Adaptive Parallel Tempering Ensemble Sampler, aiming to close this gap by simultaneously presenting next-generation automated temperature-ladder adaptation techniques and robust, low-bias evidence estimators. reddemcee couples an affine-invariant stretch move with five interchangeable ladder-adaptation objectives, Uniform Swap Acceptance Rate, Swap Mean Distance, Gaussian-Area Overlap, Small Gaussian Gap, and Equalised Thermodynamic Length, implemented through a common differential update rule. Three evidence estimators are provided: Curvature-aware Thermodynamic Integration (TI+), Geometric-Bridge Stepping Stones (SS+), and a novel Hybrid algorithm that blends both approaches (H+). Performance and accuracy are benchmarked on n-dimensional Gaussian Shells, Gaussian Egg-box, Rosenbrock Functions, and exoplanet radial-velocity time-series of HD 20794. Across Shells up to 15 dimensions, reddemcee presents roughly 7 times the effective sampling speed of the best dynamic nested sampling configuration. The TI+, SS+ and H+ estimators recover estimates under 3 percent error and supply realistic uncertainties with as few as six temperatures. In the HD 20794 case study, reddemcee reproduces literature model rankings and yields tighter yet consistent planetary parameters compared with dynesty, with evidence errors that track run-to-run dispersion. By unifying fast ladder adaptation with reliable evidence estimators, reddemcee delivers strong throughput and accurate evidence estimates, often matching, and occasionally surpassing, dynamic nested sampling, while preserving the rich posterior information which makes MCMC indispensable for modern Bayesian inference.

2507.04448 2026-02-25 cs.LG cond-mat.dis-nn stat.ML

Transfer Learning in Infinite Width Feature Learning Networks

Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan

详情
英文摘要

We develop a theory of transfer learning in infinitely wide neural networks under gradient flow that quantifies when pretraining on a source task improves generalization on a target task. We analyze both (i) fine-tuning, when the downstream predictor is trained on top of source-induced features and (ii) a jointly rich setting, where both pretraining and downstream tasks can operate in a feature learning regime, but the downstream model is initialized with the features obtained after pre-training. In this setup, the summary statistics of randomly initialized networks after a rich pre-training are adaptive kernels which depend on both source data and labels. For (i), we analyze the performance of a readout for different pretraining data regimes. For (ii), the summary statistics after learning the target task are still adaptive kernels with features from both source and target tasks. We test our theory on linear and polynomial regression tasks as well as real datasets. Our theory allows interpretable conclusions on performance, which depend on the amount of data on both tasks, the alignment between tasks, and the feature learning strength.

2507.03854 2026-02-25 cs.LG cs.SD cs.SY eess.AS eess.SY nlin.AO stat.ML

Latent FxLMS: Accelerating Active Noise Control with Neural Adaptive Filters

Kanad Sarkar, Austin Lu, Manan Mittal, Yongjie Zhuang, Ryan Corey, Andrew Singer

Comments 8 pages, Submitted at Forum Acousticum Euronoise 2025

Journal ref 10.61782/fa.2025.0565

详情
英文摘要

Filtered-X LMS (FxLMS) is commonly used for active noise control (ANC), wherein the soundfield is minimized at a desired location. Given prior knowledge of the spatial region of the noise or control sources, we could improve FxLMS by adapting along the low-dimensional manifold of possible adaptive filter weights. We train an auto-encoder on the filter coefficients of the steady-state adaptive filter for each primary source location sampled from a given spatial region and constrain the weights of the adaptive filter to be the output of the decoder for a given state of latent variables. Then, we perform updates in the latent space and use the decoder to generate the cancellation filter. We evaluate how various neural network constraints and normalization techniques impact the convergence speed and steady-state mean squared error. Under certain conditions, our Latent FxLMS model converges in fewer steps with comparable steady-state error to the standard FxLMS.

2506.07167 2026-02-25 stat.ME

Spectral Clustering with Likelihood Refinement for High-dimensional Latent Class Recovery

Zhongyuan Lyu, Yuqi Gu

详情
英文摘要

Latent class models are widely used for identifying unobserved subgroups from multivariate categorical data in social sciences, with binary data as a particularly popular example. However, accurately recovering individual latent class memberships remains challenging, especially when handling high-dimensional datasets with many items. This work proposes a novel two-stage algorithm for latent class models suited for high-dimensional binary responses. Our method first initializes latent class assignments by an easy-to-implement spectral clustering algorithm, and then refines these assignments with a one-step likelihood-based update. This approach combines the computational efficiency of spectral clustering with the improved statistical accuracy of likelihood-based estimation. We establish theoretical guarantees showing that this method is minimax-optimal for latent class recovery in the statistical decision theory sense. The method also leads to exact clustering of subjects with high probability under mild conditions. As a byproduct, we propose a computationally efficient consistent estimator for the number of latent classes. Extensive experiments on both simulated data and real data validate our theoretical results and demonstrate our method's superior performance over alternative methods.

2505.11602 2026-02-25 cs.LG math.DS math.OC stat.ML

Regularity and Stability Properties of Selective SSMs with Discontinuous Gating

Nikola Zubić, Davide Scaramuzza

Comments 26 pages, 6 theorems, 2 figures, 1 table

详情
英文摘要

Deep selective State-Space Models (SSMs), whose state-space parameters are modulated online by a selection signal, offer significant expressive power but pose challenges for stability analysis, especially under discontinuous gating. We study continuous-time selective SSMs through the lenses of passivity and Input-to-State Stability (ISS), explicitly distinguishing the selection schedule $x(\cdot)$ from the driving (port) input $u(\cdot)$. First, we show that state-strict dissipativity ($β>0$) together with quadratic bounds on a storage functional implies exponential decay of homogeneous trajectories ($u\equiv 0$), yielding exponential forgetting. Second, by freezing the selection ($x(t)\equiv 0$) we obtain a passive LTV input-output subsystem and prove that its minimal available storage is necessarily quadratic, $V_{a,0}(t,h)=\tfrac{1}{2}h^H Q_0(t)h,$ with $Q_0 \in \mathrm{AUC}_{\mathrm{loc}}$, accommodating discontinuities induced by gating. Third, under the strong hypothesis that a single quadratic storage certifies passivity uniformly over all admissible selection schedules, we derive a parametric LMI and universal kernel constraints on gating, formalizing an "irreversible forgetting" structure. Finally, we give sufficient conditions for global ISS with respect to the port input $u(\cdot)$, uniformly over admissible selection schedules, and we validate the main predictions in targeted simulation studies.

2501.06873 2026-02-25 econ.GN cs.CL cs.IR cs.SI q-fin.EC stat.ME

Causal Claims in Economics

Prashant Garg, Thiemo Fetzer

Comments Data, code, prompts, and workflow documentation are publicly available at our GitHub repository: https://github.com/prashgarg/CausalClaimsInEconomics

详情
英文摘要

As economics scales, a key bottleneck is representing what papers claim in a comparable, aggregable form. We introduce evidence-annotated claim graphs that map each paper into a directed network of standardized economic concepts (nodes) and stated relationships (edges), with each edge labeled by evidentiary basis, including whether it is supported by causal inference designs or by non-causal evidence. Using a structured multi-stage AI workflow, we construct claim graphs for 44,852 economics papers from 1980-2023. The share of causal edges rises from 7.7% in 1990 to 31.7% in 2020. Measures of causal narrative structure and causal novelty are positively associated with top-five publication and long-run citations, whereas non-causal counterparts are weakly related or negative.

2412.10304 2026-02-25 econ.EM stat.ME

A Neyman-Orthogonalization Approach to the Incidental Parameter Problem

Stéphane Bonhomme, Koen Jochmans, Martin Weidner

详情
英文摘要

A popular approach to perform inference on a target parameter in the presence of nuisance parameters is to construct estimating equations that are orthogonal to the nuisance parameters, in the sense that their expected first derivative is zero. Such first-order orthogonalization allows the estimator of the nuisance parameters to converge at a slower-than-parametric rate. It may, however, not suffice when the nuisance parameters are very imprecisely estimated. Leading examples are models for panel and network data that feature fixed effects. In this paper, we show how, in the conditional-likelihood setting, estimating equations can be constructed that are orthogonal to any chosen order $q$, in that their leading $q$ expected derivatives are zero. This yields estimators of target parameters that are unaffected by the presence of nuisance parameters to order $q$. In an empirical illustration, we apply our method to a fixed-effect model of team production.

2412.08828 2026-02-25 stat.ME

A Two-Stage Approach for Segmenting Spatial Point Patterns Applied to Multiplex Imaging

Alvin Sheng, Brian J. Reich, Ana-Maria Staicu, Santhoshi N. Krishnan, Arvind Rao, Timothy L. Frankel

Comments 13 pages, 5 figures, 1 table

Journal ref Biostatistics 2026

详情
英文摘要

Recent advances in multiplex imaging have enabled researchers to locate different types of cells within a tissue sample. This is especially relevant for tumor immunology, as clinical regimes corresponding to different stages of disease or responses to treatment may manifest as different spatial arrangements of tumor and immune cells. Spatial point pattern modeling can be used to partition multiplex tissue images according to these regimes. To this end, we propose a two-stage approach: first, local intensities and pair correlation functions are estimated from the spatial point pattern of cells within each image, and the pair correlation functions are reduced in dimension via spectral decomposition of the covariance function. Second, the estimates are clustered in a Bayesian hierarchical model with spatially-dependent cluster labels. The clusters correspond to regimes of interest that are present across subjects; the cluster labels segment the spatial point patterns according to those regimes. Through Markov Chain Monte Carlo sampling, we jointly estimate and quantify uncertainty in the cluster assignment and spatial characteristics of each cluster. Simulations demonstrate the performance of the method, and it is applied to a set of multiplex immunofluorescence images of diseased pancreatic tissue.

2408.07016 2026-02-25 cs.LG cs.AI stat.ML

Rethinking Disentanglement under Dependent Factors of Variation

Antonio Almudévar, Alfonso Ortega

详情
英文摘要

Representation learning is an approach that allows to discover and extract the factors of variation from the data. Intuitively, a representation is said to be disentangled if it separates the different factors of variation in a way that is understandable to humans. Definitions of disentanglement and metrics to measure it usually assume that the factors of variation are independent of each other. However, this is generally false in the real world, which limits the use of these definitions and metrics to very specific and unrealistic scenarios. In this paper we give a definition of disentanglement based on information theory that is also valid when the factors of variation are not independent. Furthermore, we relate this definition to the Information Bottleneck Method. Finally, we propose a method to measure the degree of disentanglement from the given definition that works when the factors of variation are not independent. We show through different experiments that the method proposed in this paper correctly measures disentanglement with non-independent factors of variation, while other methods fail in this scenario.

1704.02502 2026-02-25 stat.ML

Interactive Graphics for Visually Diagnosing Forest Classifiers in R

Natalia da Silva, Dianne Cook, Eun-Kyung Lee

Journal ref Comput Stat 40, (2025)

详情
英文摘要

This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble, produced by bagging multiple trees. The process of bagging and combining results from multiple trees, produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions. Various aspects are explored in this paper, to assess model complexity, individual model contributions, variable importance and dimension reduction, and uncertainty in prediction associated with individual observations. The ideas are applied to the random forest algorithm, and to the projection pursuit forest, but could be more broadly applied to other bagged ensembles. Interactive graphics are built in R, using the ggplot2, plotly, and shiny packages.

2602.20856 2026-02-25 q-fin.CP econ.EM q-fin.PM stat.ML

Stochastic Discount Factors with Cross-Asset Spillovers

Doron Avramov, Xin He

详情
英文摘要

This paper develops a unified framework that links firm-level predictive signals, cross-asset spillovers, and the stochastic discount factor (SDF). Signals and spillovers are jointly estimated by maximizing the Sharpe ratio, yielding an interpretable SDF that both ranks characteristic relevance and uncovers the direction of predictive influence across assets. Out-of-sample, the SDF consistently outperforms self-predictive and expected-return benchmarks across investment universes and market states. The inferred information network highlights large, low-turnover firms as net transmitters. The framework offers a clear, economically grounded view of the informational architecture underlying cross-sectional return dynamics.

2602.20834 2026-02-25 stat.ME

Confidence Distributions and Related Themes

Nils Lid Hjort, Tore Schweder

Comments 26 pages, 5 figures. Invited introduction for a theme issue on Confidence Distributions and Related Themes, for the Journal of Statistical Planning and Inference, by the two guest editors. JSPI 2018, vol. 195, pages 1-13; journal version i is here: www.sciencedirect.com/science/article/abs/pii/S037837581730174X

详情
英文摘要

This is the guest editors' general introduction to a Special Issue of the Journal of Statistical Planning and Inference, dedicated to confidence distributions and related themes. Confidence distributions (CDs) are distributions for parameters of interest, constructed via a statistical model after analysing the data. As such they serve the same purpose for the frequentist statisticians as the posterior distributions for the Bayesians. There have been several attempts in the literature to put up a clear theory for such confidence distributions, from Fisher's fiducial inference and onwards. There are certain obstacles and difficulties involved in these attempts, both conceptually and operationally, which have contributed to the CDs being slow in entering statistical mainstream. Recently there is a renewed surge of interest in CDs and various related themes, however, reflected in both series of new methodological research, advanced applications to substantive sciences, and dissemination and communication via workshops and conferences. The present special issue of the JSPI is a collection of papers emanating from the {\it Inference With Confidence} workshop in Oslo, May 2015. Several of the papers appearing here were first presented at that workshop. The present collection includes however also new research papers from other scholars in the field.

2602.20815 2026-02-25 math.ST stat.TH

Upper Bounds for the I-MSE and max-MSE of Kernel Density Estimators

Nils Lid Hjort, Nikolai G. Ushakov

Comments 21 pages, 0 figures. Statistical Research Report, Department of Mathematics, University of Oslo, from December 1999, but arXiv'd in February 2026

详情
英文摘要

The performance of kernel density estimators is usually studied via Taylor expansions and asymptotic approximation arguments, in which the bandwidth parameter tends to zero with increasing sample size. In contrast, this paper focusses directly on the finite-sample situation. Informative upper bounds are derived both for the integrated and the maximal mean squared error function. Results are reached for the traditional case, where the kernel is a probability density function, under various sets of assumptions on the underlying density to be estimated. Results are also derived for the important non-conventional case of the sinc kernel, which is not integrable and also takes negative values. We pin-point ways in which the sinc-based estimator performs better than the conventional kernel estimators. When proving our results we rely on methods related to characteristic and empirical characteristic functions.

2602.20795 2026-02-25 stat.ME cs.SY eess.SY

Hawkes Identification with a Prescribed Causal Basis: Closed-Form Estimators and Asymptotics

Xinhui Rong, Girish N. Nair

Comments 21 pages, 4 figures

详情
英文摘要

Driven by the recent surge in neural-inspired modeling, point processes have gained significant traction in systems and control. While the Hawkes process is the standard model for characterizing random event sequences with memory, identifying its unknown kernels is often hindered by nonlinearity. Approaches using prescribed basis kernels have emerged to enable linear parameterization, yet they typically rely on iterative likelihood methods and lack rigorous analysis under model misspecification. This paper justifies a closed-form Least Squares identification framework for Hawkes processes with prescribed kernels. We guarantee estimator existence via the almost-sure positive definiteness of the empirical Gram matrix and prove convergence to the true parameters under correct specification, or to well-defined pseudo-true parameters under misspecification. Furthermore, we derive explicit Central Limit Theorems for both regimes, providing a complete and interpretable asymptotic theory. We demonstrate these theoretical findings through comparative numerical simulations.

2602.20681 2026-02-25 math.ST stat.TH

Statistical Inference in Causal Partial Identification with Smooth Densities

Sirui Lin, Zijun Gao, Jose Blanchet, Peter Glynn

详情
英文摘要

Many causal quantities are only partially identifiable due to the inherent missingness of potential outcomes, and the associated partial identification (PI) sets can be obtained by solving an optimal transport (OT) problem. Covariates often provide additional information about the potential outcomes and thus yield tighter PI sets, which can be obtained via conditional optimal transport (COT). However, COT-based PI set estimators are susceptible to the curse of dimensionality in the covariates and outcomes, which precludes the asymptotic normality and hinders statistical inference. In this paper, we exploit smoothness in the marginal densities of covariates and potential outcomes and develop a wavelet-based primal method for COT with multivariate outcomes and covariates. Moreover, for quadratic cost functions, we establish a stability result for COT and prove asymptotic normality of the proposed estimator. This characterization of the asymptotic distribution enables valid statistical inference for the partial identification set. Empirically, we validate the estimation and inference performance of our approach through numerical experiments in comparison with existing benchmarks.

2602.20652 2026-02-25 stat.ML cs.LG

DANCE: Doubly Adaptive Neighborhood Conformal Estimation

Brandon R. Feng, Brian J. Reich, Daniel Beaglehole, Xihaier Luo, David Keetae Park, Shinjae Yoo, Zhechao Huang, Xueyu Mao, Olcay Boz, Jungeum Kim

详情
英文摘要

The recent developments of complex deep learning models have led to unprecedented ability to accurately predict across multiple data representation types. Conformal prediction for uncertainty quantification of these models has risen in popularity, providing adaptive, statistically-valid prediction sets. For classification tasks, conformal methods have typically focused on utilizing logit scores. For pre-trained models, however, this can result in inefficient, overly conservative set sizes when not calibrated towards the target task. We propose DANCE, a doubly locally adaptive nearest-neighbor based conformal algorithm combining two novel nonconformity scores directly using the data's embedded representation. DANCE first fits a task-adaptive kernel regression model from the embedding layer before using the learned kernel space to produce the final prediction sets for uncertainty quantification. We test against state-of-the-art local, task-adapted and zero-shot conformal baselines, demonstrating DANCE's superior blend of set size efficiency and robustness across various datasets.

2602.20646 2026-02-25 math.OC cs.LG stat.ML

On the Convergence of Stochastic Gradient Descent with Perturbed Forward-Backward Passes

Boao Kong, Hengrui Zhang, Kun Yuan

Comments 34 pages

详情
英文摘要

We study stochastic gradient descent (SGD) for composite optimization problems with $N$ sequential operators subject to perturbations in both the forward and backward passes. Unlike classical analyses that treat gradient noise as additive and localized, perturbations to intermediate outputs and gradients cascade through the computational graph, compounding geometrically with the number of operators. We present the first comprehensive theoretical analysis of this setting. Specifically, we characterize how forward and backward perturbations propagate and amplify within a single gradient step, derive convergence guarantees for both general non-convex objectives and functions satisfying the Polyak--Łojasiewicz condition, and identify conditions under which perturbations do not deteriorate the asymptotic convergence order. As a byproduct, our analysis furnishes a theoretical explanation for the gradient spiking phenomenon widely observed in deep learning, precisely characterizing the conditions under which training recovers from spikes or diverges. Experiments on logistic regression with convex and non-convex regularization validate our theories, illustrating the predicted spike behavior and the asymmetric sensitivity to forward versus backward perturbations.

2602.20611 2026-02-25 stat.ML cs.LG stat.ME

Amortized Bayesian inference for actigraph time sheet data from mobile devices

Daniel Zhou, Sudipto Banerjee

Comments 40 pages, 7 figures

详情
英文摘要

Mobile data technologies use ``actigraphs'' to furnish information on health variables as a function of a subject's movement. The advent of wearable devices and related technologies has propelled the creation of health databases consisting of human movement data to conduct research on mobility patterns and health outcomes. Statistical methods for analyzing high-resolution actigraph data depend on the specific inferential context, but the advent of Artificial Intelligence (AI) frameworks require that the methods be congruent to transfer learning and amortization. This article devises amortized Bayesian inference for actigraph time sheets. We pursue a Bayesian approach to ensure full propagation of uncertainty and its quantification using a hierarchical dynamic linear model. We build our analysis around actigraph data from the Physical Activity through Sustainable Transport Approaches in Los Angeles (PASTA-LA) study conducted by the Fielding School of Public Health in the University of California, Los Angeles. Apart from achieving probabilistic imputation of actigraph time sheets, we are also able to statistically learn about the time-varying impact of explanatory variables on the magnitude of acceleration (MAG) for a cohort of subjects.

2602.20585 2026-02-25 stat.ML cs.LG

Characterizing Online and Private Learnability under Distributional Constraints via Generalized Smoothness

Moïse Blanchard, Abhishek Shetty, Alexander Rakhlin

详情
英文摘要

Understanding minimal assumptions that enable learning and generalization is perhaps the central question of learning theory. Several celebrated results in statistical learning theory, such as the VC theorem and Littlestone's characterization of online learnability, establish conditions on the hypothesis class that allow for learning under independent data and adversarial data, respectively. Building upon recent work bridging these extremes, we study sequential decision making under distributional adversaries that can adaptively choose data-generating distributions from a fixed family $U$ and ask when such problems are learnable with sample complexity that behaves like the favorable independent case. We provide a near complete characterization of families $U$ that admit learnability in terms of a notion known as generalized smoothness i.e. a distribution family admits VC-dimension-dependent regret bounds for every finite-VC hypothesis class if and only if it is generalized smooth. Further, we give universal algorithms that achieve low regret under any generalized smooth adversary without explicit knowledge of $U$. Finally, when $U$ is known, we provide refined bounds in terms of a combinatorial parameter, the fragmentation number, that captures how many disjoint regions can carry nontrivial mass under $U$. These results provide a nearly complete understanding of learnability under distributional adversaries. In addition, building upon the surprising connection between online learning and differential privacy, we show that the generalized smoothness also characterizes private learnability under distributional constraints.

2602.20578 2026-02-25 cs.LG math.OC stat.ML

Upper-Linearizability of Online Non-Monotone DR-Submodular Maximization over Down-Closed Convex Sets

Yiyang Lu, Haresh Jadav, Mohammad Pedramfar, Ranveer Singh, Vaneet Aggarwal

详情
英文摘要

We study online maximization of non-monotone Diminishing-Return(DR)-submodular functions over down-closed convex sets, a regime where existing projection-free online methods suffer from suboptimal regret and limited feedback guarantees. Our main contribution is a new structural result showing that this class is $1/e$-linearizable under carefully designed exponential reparametrization, scaling parameter, and surrogate potential, enabling a reduction to online linear optimization. As a result, we obtain $O(T^{1/2})$ static regret with a single gradient query per round and unlock adaptive and dynamic regret guarantees, together with improved rates under semi-bandit, bandit, and zeroth-order feedback. Across all feedback models, our bounds strictly improve the state of the art.

2602.20572 2026-02-25 stat.ME math.ST stat.TH

Local Fréchet regression with toroidal predictors

Chang Jun Im, Jeong Min Jeon

Comments 52 pages, 1 figure. Submitted to Scandinavian Journal of Statistics

详情
英文摘要

We provide the first regression framework that simultaneously accommodates responses taking values in a general metric space and predictors lying on a general torus. We propose intrinsic local constant and local linear estimators that respect the underlying geometries of both the response and predictor spaces. Our local linear estimator is novel even in the case of scalar responses. We further establish their asymptotic properties, including consistency and convergence rates. Simulation studies, together with an application to real data, illustrate the superior performance of the proposed methodology.

2602.20567 2026-02-25 cs.LG math.OC stat.ML

Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs

Yifei Liang, Yan Sun, Xiaochun Cao, Li Shen

Comments 47 Pages

详情
英文摘要

Push-Sum-based decentralized learning enables optimization over directed communication networks, where information exchange may be asymmetric. While convergence properties of such methods are well understood, their finite-iteration stability and generalization behavior remain unclear due to structural bias induced by column-stochastic mixing and asymmetric error propagation. In this work, we develop a unified uniform-stability framework for the Stochastic Gradient Push (SGP) algorithm that captures the effect of directed topology. A key technical ingredient is an imbalance-aware consistency bound for Push-Sum, which controls consensus deviation through two quantities: the stationary distribution imbalance parameter $δ$ and the spectral gap $(1-λ)$ governing mixing speed. This decomposition enables us to disentangle statistical effects from topology-induced bias. We establish finite-iteration stability and optimization guarantees for both convex objectives and non-convex objectives satisfying the Polyak--Łojasiewicz condition. For convex problems, SGP attains excess generalization error of order $\tilde{\mathcal{O}}\!\left(\frac{1}{\sqrt{mn}}+\fracγ{δ(1-λ)}+γ\right)$ under step-size schedules, and we characterize the corresponding optimal early stopping time that minimizes this bound. For PŁ objectives, we obtain convex-like optimization and generalization rates with dominant dependence proportional to $κ\!\left(1+\frac{1}{δ(1-λ)}\right)$, revealing a multiplicative coupling between problem conditioning and directed communication topology. Our analysis clarifies when Push-Sum correction is necessary compared with standard decentralized SGD and quantifies how imbalance and mixing jointly shape the best attainable learning performance.

2602.20555 2026-02-25 stat.ML cs.IT cs.LG math.IT

Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets

Yanming Lai, Defeng Sun

Comments 58 pages, 1 figure

详情
英文摘要

The tremendous success of Transformer models in fields such as large language models and computer vision necessitates a rigorous theoretical investigation. To the best of our knowledge, this paper is the first work proving that standard Transformers can approximate Hölder functions $ C^{s,λ}\left([0,1]^{d\times n}\right) $$ (s\in\mathbb{N}_{\geq0},0<λ\leq1) $ under the $L^t$ distance ($t \in [1, \infty]$) with arbitrary precision. Building upon this approximation result, we demonstrate that standard Transformers achieve the minimax optimal rate in nonparametric regression for Hölder target functions. It is worth mentioning that, by introducing two metrics: the size tuple and the dimension vector, we provide a fine-grained characterization of Transformer structures, which facilitates future research on the generalization and optimization errors of Transformers with different structures. As intermediate results, we also derive the upper bounds for the Lipschitz constant of standard Transformers and their memorization capacity, which may be of independent interest. These findings provide theoretical justification for the powerful capabilities of Transformer models.

2602.20498 2026-02-25 stat.ME

Fast Algorithms for Exact Confidence Intervals in Randomized Experiments with Binary Outcomes

Peng Zhang

详情
英文摘要

We construct exact confidence intervals for the average treatment effect in randomized experiments with binary outcomes using sequences of randomization tests. Our approach does not rely on large-sample approximations and is valid for all sample sizes. Under a balanced Bernoulli design or a matched-pairs design, we show that exact confidence intervals can be computed using only $O(\log n)$ randomization tests, yielding an exponential reduction in the number of tests compared to brute-force. We further prove an information-theoretic lower bound showing that this rate is optimal. In contrast, under balanced complete randomization, the most efficient known procedures require $O(n\log n)$ randomization tests (Aronow et al., 2023), establishing a sharp separation between these designs. In addition, we extend our algorithm to general Bernoulli designs using $O(n^2)$ randomization tests.

2602.20457 2026-02-25 cs.LG stat.ML

Oracle-Robust Online Alignment for Large Language Models

Zimeng Li, Mudit Gaur, Vaneet Aggarwal

详情
英文摘要

We study online alignment of large language models under misspecified preference feedback, where the observed preference oracle deviates from an ideal but unknown ground-truth oracle. The online LLM alignment problem is a bi-level reinforcement problem due to the coupling between data collection and policy updates. Recently, the problem has been reduced to tractable single-level objective in the SAIL (Self-Improving Efficient Online Alignment) framework. In this paper, we introduce a pointwise oracle uncertainty set in this problem and formulate an oracle-robust online alignment objective as a worst-case optimization problem. For log-linear policies, we show that this robust objective admits an exact closed-form decomposition into the original loss function plus an explicit sensitivity penalty. We develop projected stochastic composite updates for the resulting weakly convex objective and prove $\widetilde{O}(\varepsilon^{-2})$ oracle complexity for reaching approximate stationarity.

2602.20448 2026-02-25 stat.ME stat.CO

Posterior Mode Guided Dimension Reduction for Bayesian Model Averaging in Heavy-Tailed Linear Regression

Shamriddha De, Joyee Ghosh

Comments 35 pages, 6 figures

详情
英文摘要

For large model spaces, the potential entrapment of Markov chain Monte Carlo (MCMC) based methods with spike-and-slab priors poses significant challenges in posterior computation in regression models. On the other hand, maximum a posteriori (MAP) estimation, which is a more computationally viable alternative, fails to provide uncertainty quantification. To address these problems simultaneously and efficiently, this paper proposes a hybrid method that blends MAP estimation with MCMC-based stochastic search algorithms within a heavy-tailed error framework. Under hyperbolic errors, the current work develops a two-step expectation conditional maximization (ECM) guided MCMC algorithm. In the first step, we conduct an ECM-based posterior maximization and perform variable selection, thereby identifying a reduced model space in a high posterior probability region. In the second step, we execute a Gibbs sampler on the reduced model space for posterior computation. Such a method is expected to improve the efficiency of posterior computation and enhance its inferential richness. Through simulation studies and benchmark real life examples, our proposed method is shown to exhibit several advantages in variable selection and uncertainty quantification over various state-of-the-art methods.

2602.20403 2026-02-25 cs.LG math.OC stat.ML

Wasserstein Distributionally Robust Online Learning

Guixian Chen, Salar Fattahi, Soroosh Shafiee

详情
英文摘要

We study distributionally robust online learning, where a risk-averse learner updates decisions sequentially to guard against worst-case distributions drawn from a Wasserstein ambiguity set centered at past observations. While this paradigm is well understood in the offline setting through Wasserstein Distributionally Robust Optimization (DRO), its online extension poses significant challenges in both convergence and computation. In this paper, we address these challenges. First, we formulate the problem as an online saddle-point stochastic game between a decision maker and an adversary selecting worst-case distributions, and propose a general framework that converges to a robust Nash equilibrium coinciding with the solution of the corresponding offline Wasserstein DRO problem. Second, we address the main computational bottleneck, which is the repeated solution of worst-case expectation problems. For the important class of piecewise concave loss functions, we propose a tailored algorithm that exploits problem geometry to achieve substantial speedups over state-of-the-art solvers such as Gurobi. The key insight is a novel connection between the worst-case expectation problem, an inherently infinite-dimensional optimization problem, and a classical and tractable budget allocation problem, which is of independent interest.

2602.20383 2026-02-25 stat.ME cs.LG econ.EM

Detecting and Mitigating Group Bias in Heterogeneous Treatment Effects

Joel Persson, Jurriën Bakker, Dennis Bohle, Stefan Feuerriegel, Florian von Wangenheim

详情
英文摘要

Heterogeneous treatment effects (HTEs) are increasingly estimated using machine learning models that produce highly personalized predictions of treatment effects. In practice, however, predicted treatment effects are rarely interpreted, reported, or audited at the individual level but, instead, are often aggregated to broader subgroups, such as demographic segments, risk strata, or markets. We show that such aggregation can induce systematic bias of the group-level causal effect: even when models for predicting the individual-level conditional average treatment effect (CATE) are correctly specified and trained on data from randomized experiments, aggregating the predicted CATEs up to the group level does not, in general, recover the corresponding group average treatment effect (GATE). We develop a unified statistical framework to detect and mitigate this form of group bias in randomized experiments. We first define group bias as the discrepancy between the model-implied and experimentally identified GATEs, derive an asymptotically normal estimator, and then provide a simple-to-implement statistical test. For mitigation, we propose a shrinkage-based bias-correction, and show that the theoretically optimal and empirically feasible solutions have closed-form expressions. The framework is fully general, imposes minimal assumptions, and only requires computing sample moments. We analyze the economic implications of mitigating detected group bias for profit-maximizing personalized targeting, thereby characterizing when bias correction alters targeting decisions and profits, and the trade-offs involved. Applications to large-scale experimental data at major digital platforms validate our theoretical results and demonstrate empirical performance.

2602.20371 2026-02-25 math.ST stat.ME stat.TH

Semi-parametric Bayesian inference under Neyman orthogonality

Magid Sabbagh, David A. Stephens

Comments 18 pages

详情
英文摘要

The validity of two-step or plug-in inference methods is questioned in the Bayesian framework. We study semi-parametric models where the plug-in of a non-parametrically modelled nuisance component is used. We show that when the nuisance and targeted parameters satisfy a Neyman orthogonal score property, the approach of cutting feedback through a two-step procedure is a valid way of conducting Bayesian inference. Our method relies on a non-parametric Bayesian formulation based on the Dirichlet process and the Bayesian bootstrap. We show that the marginal posterior of the targeted parameter exhibits good frequentist properties despite not accounting for the inferential uncertainty of the nuisance parameter. We adopt this approach in Bayesian causal inference problems where the nuisance propensity score model is estimated to obtain marginal inference for the treatment effect parameter, and demonstrate that a plug-in of the propensity score has a negligible effect on marginal posterior inference for the causal contrast. We investigate the absence of Neyman orthogonality and exploit our findings to show that in conventional two-step procedures, the posterior distribution converges under weaker restrictions than those needed in the frequentist sequel. For a simple family of useful scores, we demonstrate that even in the absence of Neyman orthogonality, the posterior distribution is asymptotically unchanged by the estimation of the nuisance parameter, merely provided the latter estimator is consistent.

2602.20297 2026-02-25 stat.ML cs.LG

Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Haochen Zhang, Zhong Zheng, Lingzhou Xue

详情
英文摘要

We study gap-dependent performance guarantees for nearly minimax-optimal algorithms in reinforcement learning with linear function approximation. While prior works have established gap-dependent regret bounds in this setting, existing analyses do not apply to algorithms that achieve the nearly minimax-optimal worst-case regret bound $\tilde{O}(d\sqrt{H^3K})$, where $d$ is the feature dimension, $H$ is the horizon length, and $K$ is the number of episodes. We bridge this gap by providing the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ (He et al., 2023). Our analysis yields improved dependencies on both $d$ and $H$ compared to previous gap-dependent results. Moreover, leveraging the low policy-switching property of LSVI-UCB++, we introduce a concurrent variant that enables efficient parallel exploration across multiple agents and establish the first gap-dependent sample complexity upper bound for online multi-agent RL with linear function approximation, achieving linear speedup with respect to the number of agents.

2602.20271 2026-02-25 cs.LG cs.AI math.OC stat.AP

Uncertainty-Aware Delivery Delay Duration Prediction via Multi-Task Deep Learning

Stefan Faulkner, Reza Zandehshahvar, Vahid Eghbal Akhlaghi, Sebastien Ouellet, Carsten Jordan, Pascal Van Hentenryck

详情
英文摘要

Accurate delivery delay prediction is critical for maintaining operational efficiency and customer satisfaction across modern supply chains. Yet the increasing complexity of logistics networks, spanning multimodal transportation, cross-country routing, and pronounced regional variability, makes this prediction task inherently challenging. This paper introduces a multi-task deep learning model for delivery delay duration prediction in the presence of significant imbalanced data, where delayed shipments are rare but operationally consequential. The model embeds high-dimensional shipment features with dedicated embedding layers for tabular data, and then uses a classification-then-regression strategy to predict the delivery delay duration for on-time and delayed shipments. Unlike sequential pipelines, this approach enables end-to-end training, improves the detection of delayed cases, and supports probabilistic forecasting for uncertainty-aware decision making. The proposed approach is evaluated on a large-scale real-world dataset from an industrial partner, comprising more than 10 million historical shipment records across four major source locations with distinct regional characteristics. The proposed model is compared with traditional machine learning methods. Experimental results show that the proposed method achieves a mean absolute error of 0.67-0.91 days for delayed-shipment predictions, outperforming single-step tree-based regression baselines by 41-64% and two-step classify-then-regress tree-based models by 15-35%. These gains demonstrate the effectiveness of the proposed model in operational delivery delay forecasting under highly imbalanced and heterogeneous conditions.

2602.20194 2026-02-25 cs.LG stat.ML

FedAvg-Based CTMC Hazard Model for Federated Bridge Deterioration Assessment

Takato Yasuno

Comments 10 pages, 4 figures, 2 tables

详情
英文摘要

Bridge periodic inspection records contain sensitive information about public infrastructure, making cross-organizational data sharing impractical under existing data governance constraints. We propose a federated framework for estimating a Continuous-Time Markov Chain (CTMC) hazard model of bridge deterioration, enabling municipalities to collaboratively train a shared benchmark model without transferring raw inspection records. Each User holds local inspection data and trains a log-linear hazard model over three deterioration-direction transitions -- Good$\to$Minor, Good$\to$Severe, and Minor$\to$Severe -- with covariates for bridge age, coastline distance, and deck area. Local optimization is performed via mini-batch stochastic gradient descent on the CTMC log-likelihood, and only a 12-dimensional pseudo-gradient vector is uploaded to a central server per communication round. The server aggregates User updates using sample-weighted Federated Averaging (FedAvg) with momentum and gradient clipping. All experiments in this paper are conducted on fully synthetic data generated from a known ground-truth parameter set with region-specific heterogeneity, enabling controlled evaluation of federated convergence behaviour. Simulation results across heterogeneous Users show consistent convergence of the average negative log-likelihood, with the aggregated gradient norm decreasing as User scale increases. Furthermore, the federated update mechanism provides a natural participation incentive: Users who register their local inspection datasets on a shared technical-standard platform receive in return the periodically updated global benchmark parameters -- information that cannot be obtained from local data alone -- thereby enabling evidence-based life-cycle planning without surrendering data sovereignty.

2602.19703 2026-02-25 econ.EM stat.ML

Testing Effect Homogeneity and Confounding in High-Dimensional Experimental and Observational Studies

Ana Armendariz, Martin Huber

详情
英文摘要

We propose a framework for testing the homogeneity of conditional average treatment effects (CATEs) across multiple experimental and observational studies. Our approach leverages multiple randomized trials to assess whether treatment effects vary with unobserved heterogeneity that differs across trials: if CATEs are homogeneous, this indicates the absence of interactions between treatment and unobservables in the mean effect. Comparing CATEs between experimental and observational data further allows evaluation of potential confounding: if the estimands coincide, there is no unobserved confounding; if they differ, deviations may arise from unobserved confounding, effect heterogeneity, or both. We extend the framework to settings with alternative identification strategies, namely instrumental variable settings and panel data with parallel trends assumptions based on differences in differences, where effects are identified only locally for subpopulations such as compliers or treated units. In these contexts, testing homogeneity is useful for assessing whether local effects can be extrapolated to the total population. We suggest a test based on double machine learning that accommodates high-dimensional covariates in a data-driven way and investigate its finite-sample performance through a simulation study. Finally, we apply the test to the International Stroke Trial (IST), a large multi-country randomized controlled trial in patients with acute ischaemic stroke that evaluated whether early treatment with aspirin altered subsequent clinical outcomes. Our methodology provides a flexible tool for both validating identification assumptions and understanding the generalizability of estimated treatment effects.

2602.18958 2026-02-25 stat.ME math.ST stat.ML stat.TH

Optimal and Structure-Adaptive CATE Estimation with Kernel Ridge Regression

Seok-Jin Kim

详情
英文摘要

We propose an optimal algorithm for estimating conditional average treatment effects (CATEs) when response functions lie in a reproducing kernel Hilbert space (RKHS). We study settings in which the contrast function is structurally simpler than the nuisance functions: (i) it lies in a lower-complexity RKHS with faster eigenvalue decay, (ii) it satisfies a source condition relative to the nuisance kernel, or (iii) it depends on a known low-dimensional covariate representation. We develop a unified two-stage kernel ridge regression (KRR) method that attains minimax rates governed by the complexity of the contrast function rather than the nuisance class, in terms of both sample size and overlap. We also show that a simple model-selection step over candidate contrast spaces and regularization levels yields an oracle inequality, enabling adaptation to unknown CATE regularity.

2601.16842 2026-02-25 math.ST stat.ML stat.TH

Parametric Mean-Field empirical Bayes in high-dimensional linear regression

Seunghyun Lee, Nabarun Deb

Comments Typos fixed

详情
英文摘要

In this paper, we consider the problem of parametric empirical Bayes estimation of an i.i.d. prior in high-dimensional Bayesian linear regression, with random design. We obtain the asymptotic distribution of the variational Empirical Bayes (vEB) estimator, which approximately maximizes a variational lower bound of the intractable marginal likelihood. We characterize a sharp phase transition behavior for the vEB estimator -- namely that it is information theoretically optimal (in terms of limiting variance) up to $p=o(n^{2/3})$ while it suffers from a sub-optimal convergence rate in higher dimensions. In the first regime, i.e., when $p=o(n^{2/3})$, we show how the estimated prior can be calibrated to enable valid coordinate-wise and delocalized inference, both under the \emph{empirical Bayes posterior} and the oracle posterior. In the second regime, we propose a debiasing technique as a way to improve the performance of the vEB estimator beyond $p=o(n^{2/3})$. Extensive numerical experiments corroborate our theoretical findings.

2512.07770 2026-02-25 stat.ML cs.LG

Distribution-informed Online Conformal Prediction

Dongjian Hu, Junxi Wu, Shu-Tao Xia, Changliang Zou

Comments ICLR2026 camera-ready version

详情
英文摘要

Conformal prediction provides a pivotal and flexible technique for uncertainty quantification by constructing prediction sets with a predefined coverage rate. Many online conformal prediction methods have been developed to address data distribution shifts in fully adversarial environments, resulting in overly conservative prediction sets. We propose Conformal Optimistic Prediction (COP), an online conformal prediction algorithm incorporating underlying data pattern into the update rule. Through estimated cumulative distribution function of non-conformity scores, COP produces tighter prediction sets when predictable pattern exists, while retaining valid coverage guarantees even when estimates are inaccurate. We establish a joint bound on coverage and regret, which further confirms the validity of our approach. We also prove that COP achieves distribution-free, finite-sample coverage under arbitrary learning rates and can converge when scores are $i.i.d.$. The experimental results also show that COP can achieve valid coverage and construct shorter prediction intervals than other baselines.

2510.12402 2026-02-25 cs.LG math.OC stat.ML

Cautious Weight Decay

Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu

详情
英文摘要

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

2510.12356 2026-02-25 stat.ME

Variational Inference for Count Response Semiparametric Regression: A Convex Solution

Virginia Murru, Matt P. Wand

详情
英文摘要

We develop a version of variational inference for Bayesian count response regression-type models that possesses attractive attributes such as convexity and closed form updates. The convex solution aspect entails numerically stable fitting algorithms, whilst the closed form aspect makes the methodology fast and easy to implement. The essence of the approach is the use of Pólya-Gamma augmentation of a Negative Binomial likelihood, a finite-valued prior on the shape parameter and the structured mean field variational Bayes paradigm. The approach applies to general count response situations. For concreteness, we focus on generalized linear mixed models within the semiparametric regression class of models. Real-time fitting is also described.

2509.21895 2026-02-25 cs.LG math.FA math.RT stat.ML

Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs

Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Masahiro Ikeda

Journal ref ICLR 2026

详情
英文摘要

We derive a new Rademacher complexity bound for deep neural networks using Koopman operators, group representations, and reproducing kernel Hilbert spaces (RKHSs). The proposed bound describes why the models with high-rank weight matrices generalize well. Although there are existing bounds that attempt to describe this phenomenon, these existing bounds can be applied to limited types of models. We introduce an algebraic representation of neural networks and a kernel function to construct an RKHS to derive a bound for a wider range of realistic models. This work paves the way for the Koopman-based theory for Rademacher complexity bounds to be valid for more practical situations.

2507.05306 2026-02-25 stat.ML cs.AI cs.LG math.ST stat.TH

Enjoying Non-linearity in Multinomial Logistic Bandits: A Minimax-Optimal Algorithm

Pierre Boudart, Pierre Gaillard, Alessandro Rudi

详情
英文摘要

We consider the multinomial logistic bandit problem in which a learner interacts with an environment by selecting actions to maximize expected rewards based on probabilistic feedback from multiple possible outcomes. In the binary setting, recent work has focused on understanding the impact of the non-linearity of the logistic model (Faury et al., 2020; Abeille et al., 2021). They introduced a problem-dependent constant $κ_* \geq 1$ that may be exponentially large in some problem parameters and which is captured by the derivative of the sigmoid function. It encapsulates the non-linearity and improves existing regret guarantees over $T$ rounds from $\smash{O(d\sqrt{T})}$ to $\smash{O(d\sqrt{T/κ_*})}$, where $d$ is the dimension of the parameter space. We extend their analysis to the multinomial logistic bandit framework with a finite action space, making it suitable for complex applications with more than two choices, such as reinforcement learning or recommender systems. To achieve this, we extend the definition of $ κ_* $ to the multinomial setting and propose an efficient algorithm that leverages the problem's non-linearity. Our method yields a problem-dependent regret bound of order $ \smash{\widetilde{\mathcal{O}}( R d \sqrt{ {KT}/{κ_*}} ) } $, where $R$ denotes the norm of the vector of rewards and $K$ is the number of outcomes. This improves upon the best existing guarantees of order $ \smash{\widetilde{\mathcal{O}}( RdK \sqrt{T} )}$. Moreover, we provide a matching $\smash{ Ω(dR\sqrt{KT/κ_*})}$ lower-bound, showing that our algorithm is minimax-optimal and that our definition of $κ_*$ is optimal.

2506.16229 2026-02-25 stat.ME

Diversifying Conformal Selections

Yash Nair, Ying Jin, James Yang, Emmanuel Candes

详情
英文摘要

When selecting from a list of potential candidates, it is important to ensure not only that the selected items are of high quality, but also that they are sufficiently dissimilar so as to both avoid redundancy and to capture a broader range of desirable properties. In drug discovery, scientists aim to select potent drugs from a library of unsynthesized candidates, but recognize that it is wasteful to repeatedly synthesize highly similar compounds. In job hiring, recruiters may wish to hire candidates who will perform well on the job, while also considering factors such as socioeconomic background, prior work experience, gender, or race. We study the problem of using any prediction model to construct a maximally diverse selection set of candidates while controlling the false discovery rate (FDR) in a model-free fashion. Our method, diversity-aware conformal selection (DACS), achieves this by designing a general optimization procedure to construct a diverse selection set subject to a simple constraint involving conformal e-values which depend on carefully chosen stopping times. The key idea of DACS is to use optimal stopping theory to adaptively choose the set of e-values which (approximately) maximizes the expected diversity measure. We give an example diversity metric for which our procedure can be run exactly and efficiently. We also develop a number of computational heuristics which greatly improve its running time for generic diversity metrics. We demonstrate the empirical performance of our method both in simulation and on job hiring and drug discovery datasets.

2506.13792 2026-02-25 cs.AI cs.CL cs.LG stat.AP

ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution

Gonçalo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, Mário S. Correia, Kristinn R. Thórisson, Tangrui Li, Pétur Húni Björnsson, Eiríkur Smári Sigurðarson, Jilles S. Dibangoye

详情
英文摘要

We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers. ICE-ID combines hierarchical geography (farm$\to$parish$\to$district$\to$county), patronymic naming conventions, sparse kinship links (partner, father, mother), and multi-decadal temporal drift -- challenges not captured by standard product-matching or citation datasets. This paper presents an artifact-backed analysis of temporal coverage, missingness, identifier ambiguity, candidate-generation efficiency, and cluster distributions, and situates ICE-ID against classical ER benchmarks (Abt--Buy, Amazon--Google, DBLP--ACM, DBLP--Scholar, Walmart--Amazon, iTunes--Amazon, Beer, Fodors--Zagats). We also define a deployment-faithful temporal OOD protocol and release the dataset, splits, regeneration scripts, analysis artifacts, and a dashboard for interactive exploration. Baseline model comparisons and end-to-end ER results are reported in the companion methods paper.

2505.24811 2026-02-25 math.ST stat.ME stat.TH

Locally Differentially Private Two-Sample Testing

Alexander Kent, Thomas B. Berrett, Yi Yu

Comments 76 pages, 7 figures, 1 table

详情
英文摘要

We consider the problem of two-sample testing under a local differential privacy constraint where a permutation procedure is used to calibrate the tests. We develop testing procedures which are optimal up to logarithmic factors, for general discrete distributions and continuous distributions subject to a smoothness constraint. Both non-interactive and interactive tests are considered, and we show allowing interactivity results in an improvement in the minimax separation rates. Our results show that permutation procedures remain feasible in practice under local privacy constraints, despite the inability to permute the non-private data directly and only the private views. Further, through a refined theoretical analysis of the permutation procedure, we are able to avoid an equal sample size assumption which has been made in the permutation testing literature regardless of the presence of the privacy constraint. Lastly, we conduct numerical experiments which demonstrate the performance of our proposed test and verify the theoretical findings, especially the improved performance enabled by allowing interactivity.

2504.13961 2026-02-25 cs.LG cs.AI stat.ML

CONTINA: Confidence Interval for Traffic Demand Prediction with Coverage Guarantee

Chao Yang, Xiannan Huang, Shuhan Qiu, Yan Cheng

Comments Accepted in Transportation Research Part C: Emerging Technologies

详情
英文摘要

Accurate short-term traffic demand prediction is critical for the operation of traffic systems. Besides point estimation, the confidence interval of the prediction is also of great importance. Many models for traffic operations, such as shared bike rebalancing and taxi dispatching, take into account the uncertainty of future demand and require confidence intervals as the input. However, existing methods for confidence interval modeling rely on strict assumptions, such as unchanging traffic patterns and correct model specifications, to guarantee enough coverage. Therefore, the confidence intervals provided could be invalid, especially in a changing traffic environment. To fill this gap, we propose an efficient method, CONTINA (Conformal Traffic Intervals with Adaptation) to provide interval predictions that can adapt to external changes. By collecting the errors of interval during deployment, the method can adjust the interval in the next step by widening it if the errors are too large or shortening it otherwise. Furthermore, we theoretically prove that the coverage of the confidence intervals provided by our method converges to the target coverage level. Experiments across four real-world datasets and prediction models demonstrate that the proposed method can provide valid confidence intervals with shorter lengths. Our method can help traffic management personnel develop a more reasonable and robust operation plan in practice. And we release the code, model and dataset in \href{ https://github.com/xiannanhuang/CONTINA/}{ Github}.

2503.00229 2026-02-25 cs.LG math.OC stat.ML

Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster

Sharan Vaswani, Reza Babanezhad

详情
英文摘要

Armijo line-search (Armijo-LS) is a standard method to set the step-size for gradient descent (GD). For smooth functions, Armijo-LS alleviates the need to know the global smoothness constant L and adapts to the ``local'' smoothness, enabling GD to converge faster. Existing theoretical analyses show that GD with Armijo-LS (GD-LS) can result in constant factor improvements over GD with a 1/L step-size (denoted as GD(1/L)). We strengthen these results and show that if the objective function satisfies a certain non-uniform smoothness condition, GD-LS can result in a faster convergence rate than GD(1/L). In particular, we prove that for convex objectives corresponding to logistic regression and multi-class classification, GD-LS can converge to the optimum at a linear rate, and hence improves over the sublinear convergence of GD(1/L). Furthermore, for non-convex objectives satisfying gradient domination (e.g., those corresponding to the softmax policy gradient in RL or generalized linear models with a logistic link function), GD-LS can match the fast convergence of algorithms tailored for these specific settings. Finally, we analyze the convergence of stochastic GD with a stochastic line-search on convex losses under the interpolation assumption.

2501.18183 2026-02-25 math.OC cs.CC cs.LG stat.ML

Decentralized Projection-free Online Upper-Linearizable Optimization with Applications to DR-Submodular Optimization

Yiyang Lu, Mohammad Pedramfar, Vaneet Aggarwal

Journal ref Transactions on Machine Learning Research, 2025

详情
英文摘要

We introduce a novel framework for decentralized projection-free optimization, extending projection-free methods to a broader class of upper-linearizable functions. Our approach leverages decentralized optimization techniques with the flexibility of upper-linearizable function frameworks, effectively generalizing traditional DR-submodular function optimization. We obtain the regret of $O(T^{1-θ/2})$ with communication complexity of $O(T^θ)$ and number of linear optimization oracle calls of $O(T^{2θ})$ for decentralized upper-linearizable function optimization, for any $0\le θ\le 1$. This approach allows for the first results for monotone up-concave optimization with general convex constraints and non-monotone up-concave optimization with general convex constraints. Further, the above results for first order feedback are extended to zeroth order, semi-bandit, and bandit feedback.

2501.10538 2026-02-25 cs.LG math.ST stat.ML stat.TH

Universality of Benign Overfitting in Binary Linear Classification

Ichiro Hashimoto, Stanislav Volgushev, Piotr Zwiernik

Comments 83 pages, 10 figures; two sections added in the supplement, added clarifications, fixed minor typos

详情
英文摘要

The practical success of deep learning has led to the discovery of several surprising phenomena. One of these phenomena, that has spurred intense theoretical research, is ``benign overfitting'': deep neural networks seem to generalize well in the over-parametrized regime even though the networks show a perfect fit to noisy training data. It is now known that benign overfitting also occurs in various classical statistical models. For linear maximum margin classifiers, benign overfitting has been established theoretically in a class of mixture models with very strong assumptions on the covariate distribution. However, even in this simple setting, many questions remain open. For instance, most of the existing literature focuses on the noiseless case where all true class labels are observed without errors, whereas the more interesting noisy case remains poorly understood. We provide a comprehensive study of benign overfitting for linear maximum margin classifiers. We discover a phase transition in test error bounds for the noisy model which was previously unknown and provide some geometric intuition behind it. We further considerably relax the required covariate assumptions in both, the noisy and noiseless case. Our results demonstrate that benign overfitting of maximum margin classifiers holds in a much wider range of scenarios than was previously known and provide new insights into the underlying mechanisms.

2501.04330 2026-02-25 stat.ME stat.ML

Neural Parameter Estimation with Incomplete Data

Matthew Sainsbury-Dale, Andrew Zammit-Mangion, Noel Cressie, Raphaël Huser

详情
英文摘要

Advances in artificial intelligence (AI) and deep learning have led to neural networks being used to generate lightning-speed answers to complex science questions, paintings in the style of Monet, or stories like those of Twain. Leveraging their computational speed and flexibility, neural networks are also being used to facilitate fast, likelihood-free statistical inference. However, it is not straightforward to use neural networks with data that for various reasons are incomplete, which precludes their use in many applications. A recently proposed approach to remedy this issue uses an appropriately padded data vector and a vector that encodes the missingness pattern as input to a neural network. While computationally efficient, this "masking" approach is not robust to the missingness mechanism and can result in statistically inefficient inferences. Here, we propose an alternative approach that is based on the Monte Carlo expectation-maximization (EM) algorithm. Our EM approach is likelihood-free, substantially faster than the conventional EM algorithm as it does not require numerical optimization at each iteration, and more statistically efficient than the masking approach. This research addresses a prototypical problem that asks how improvements could be made in AI by introducing Bayesian statistical thinking. We compare the two approaches to missingness using simulated incomplete data from a variety of spatial models. The utility of the methodology is shown on Arctic sea-ice data, analyzed using a novel hidden Potts model with an intractable likelihood.

2412.03723 2026-02-25 stat.AP stat.ME

Bayesian Perspective for Orientation Determination in Cryo-EM with Application to Structural Heterogeneity Analysis

Sheng Xu, Amnon Balanov, Amit Singer, Tamir Bendory

Comments 33 pages, 6 figures

详情
英文摘要

Accurate orientation estimation is a crucial component of 3D molecular structure reconstruction, both in single-particle cryo-electron microscopy (cryo-EM) and in the increasingly popular field of cryo-electron tomography (cryo-ET). The dominant approach, which involves searching for the orientation that maximizes cross-correlation relative to given templates, is sub-optimal, particularly under low signal-to-noise conditions. In this work, we propose a Bayesian framework for more accurate and flexible orientation estimation, with the minimum mean square error (MMSE) estimator serving as a key example. Through simulations, we demonstrate that the MMSE estimator consistently outperforms the cross-correlation-based method, especially in challenging low signal-to-noise scenarios, and we provide a theoretical framework that supports these improvements. When incorporated into iterative refinement algorithms in the 3D reconstruction pipeline, the MMSE estimator markedly improves reconstruction accuracy, reduces model bias, and enhances robustness to the ``Einstein from Noise'' artifact. Crucially, we demonstrate that orientation estimation accuracy has a decisive effect on downstream structural heterogeneity analysis. In particular, integrating the MMSE-based pose estimator into frameworks for continuous heterogeneity recovery yields accuracy improvements approaching those obtained with ground-truth poses, establishing MMSE-based pose estimation as a key enabler of high-fidelity conformational landscape reconstruction. These findings indicate that the proposed Bayesian framework could substantially advance cryo-EM and cryo-ET by enhancing the accuracy, robustness, and reliability of 3D molecular structure reconstruction, thereby facilitating deeper insights into complex biological systems.

2411.17224 2026-02-25 math.ST stat.ME stat.TH

Doubly robust estimation with functional outcomes missing at random

Xijia Liu, Kreske Felix Ecker, Lina Schelin, Xavier de Luna

Comments 23 pages incl appendices

详情
英文摘要

We present and study semi-parametric estimators for the mean of functional outcomes in situations where some of these outcomes are missing and covariate information is available on all units. Assuming that the missingness mechanism depends only on the covariates (missing at random assumption), we present two estimators for the functional mean parameter, using working models for the functional outcome given the covariates, and the probability of missingness given the covariates. We contribute by establishing that both these estimators have Gaussian processes as limiting distributions and explicitly give their covariance functions. One of the estimators is double robust in the sense that the limiting distribution holds whenever at least one of the nuisance models is correctly specified. These results allow us to present simultaneous confidence bands for the mean function with asymptotically guaranteed coverage. A Monte Carlo study shows the finite sample properties of the proposed functional estimators and their associated simultaneous inference. The use of the method is illustrated in an application where the mean of counterfactual outcomes is targeted.

2410.16106 2026-02-25 stat.ML cs.LG

Statistical Inference for Temporal Difference Learning with Linear Function Approximation

Weichen Wu, Gen Li, Yuting Wei, Alessandro Rinaldo

详情
英文摘要

We investigate the statistical properties of Temporal Difference (TD) learning with Polyak-Ruppert averaging, arguably one of the most widely used algorithms in reinforcement learning, for the task of estimating the parameters of the optimal linear approximation to the value function. Assuming independent samples, we make three theoretical contributions that improve upon the current state-of-the-art results: (i) we establish refined high-dimensional Berry-Esseen bounds over the class of convex sets, achieving faster rates than the best known results, and (ii) we propose and analyze a novel, computationally efficient online plug-in estimator of the asymptotic covariance matrix; (iii) we derive sharper high probability convergence guarantees that depend explicitly on the asymptotic variance and hold under weaker conditions than those adopted in the literature. These results enable the construction of confidence regions and simultaneous confidence intervals for the linear parameters of the value function approximation, with guaranteed finite-sample coverage. We demonstrate the applicability of our theoretical findings through numerical experiments.

2408.10118 2026-02-25 math.ST stat.TH

Local Fréchet regression with circular predictors

Chang Jun Im, Jeong Min Jeon

Comments The case for circular predictors is containted in the case for spherical preditors

详情
英文摘要

Fréchet regression extends the principles of linear regression to accommodate responses valued in generic metric spaces. While this approach has primarily focused on exploring relationships between Euclidean predictors and non-Euclidean responses, our work introduces a novel statistical method for handling random objects with circular predictors. We concentrate on local constant and local linear Fréchet regression, providing rigorous proofs for the upper bounds of both bias and stochastic deviation of the estimators under mild conditions. This research lays the groundwork for broadening the application of Fréchet regression to scenarios involving non-Euclidean covariates, thereby expanding its utility in complex data analysis.

2408.07231 2026-02-25 stat.ME

Estimating the False Discovery Rate of Variable Selection

Yixiang Luo, William Fithian, Lihua Lei

详情
英文摘要

We introduce a generic estimator for the false discovery rate of any model selection procedure, in common statistical modeling settings including the Gaussian linear model, Gaussian graphical model, and model-X setting. We prove that our method has a conservative (non-negative) bias in finite samples under standard statistical assumptions, and provide a bootstrap method for assessing its standard error. For methods like the Lasso, forward-stepwise regression, and the graphical Lasso, our estimator serves as a valuable companion to cross-validation, illuminating the tradeoff between prediction error and variable selection accuracy as a function of the model complexity parameter.

2406.04653 2026-02-25 stat.ME cs.NA math.NA stat.ML

Variational Markov chain mixtures with automatic component selection

Christopher E. Miles, Robert J. Webber

Comments v2: changed title, streamlined early sections

Journal ref Statistics and Computing (2026) 36:90

详情
英文摘要

Markov state modeling has gained popularity in various scientific fields since it reduces complex time-series data sets into transitions between a few states. Yet common Markov state modeling frameworks assume a single Markov chain describes the data, so they suffer from an inability to discern heterogeneities. As an alternative, this paper models time-series data using a mixture of Markov chains, and it automatically determines the number of mixture components using the variational expectation-maximization algorithm.Variational EM simultaneously identifies the number of Markov chains and the dynamics of each chain without expensive model comparisons or posterior sampling. As a theoretical contribution, this paper identifies the natural limits of Markov state mixture modeling by proving a lower bound on the classification error. It then presents numerical experiments where variational EM achieves performance consistent with the theoretically optimal error scaling. The experiments are based on synthetic and observational data sets including Last.fm music listening, ultramarathon running, and gene expression. In each of the three data sets, variational EM leads to the identification of meaningful heterogeneities.

2403.19994 2026-02-25 stat.ME

Supervised Bayesian joint graphical model for simultaneous network estimation and subgroup identification

Xing Qin, Xu Liu, Shuangge Ma, Mengyun Wu

详情
英文摘要

Heterogeneity is a fundamental characteristic of cancer. To accommodate heterogeneity, subgroup identification has been extensively studied and broadly categorized into unsupervised and supervised analysis. Compared to unsupervised analysis, supervised approaches potentially hold greater clinical implications. Under the unsupervised analysis framework, several methods focusing on network-based subgroup identification have been developed, offering more comprehensive insights than those restricted to mean, variance, and other simplistic distributions by incorporating the interconnections among variables. However, research on supervised network-based subgroup identification remains limited. In this study, we develop a novel supervised Bayesian graphical model (SBJGM) for jointly identifying multiple heterogeneous networks and subgroups. In the proposed model, heterogeneity is not only reflected in molecular data but also associated with a clinical outcome, and a novel similarity prior is introduced to effectively accommodate similarities among the networks of different subgroups, significantly facilitating clinically meaningful biological network construction and subgroup identification. The consistency properties of the estimates are rigorously established, and an efficient algorithm is developed. Extensive simulation studies and a real-world application to The Cancer Genome Atlas (TCGA) data are conducted, which demonstrate the advantages of the proposed approach in terms of both subgroup and network identification.

2401.11046 2026-02-25 econ.EM stat.ME

Information Based Inference in Models with Set-Valued Predictions and Misspecification

Hiroaki Kaido, Francesca Molinari

详情
英文摘要

This paper proposes an information-based inference method for partially identified parameters in incomplete models that is valid both when the model is correctly specified and when it is misspecified. Key features of the method are: (i) it is based on minimizing a suitably defined Kullback-Leibler information criterion that accounts for incompleteness of the model and delivers a non-empty pseudo-true set; (ii) it is computationally tractable; (iii) its implementation is the same for both correctly and incorrectly specified models; (iv) it exploits all information provided by variation in discrete and continuous covariates; (v) it relies on Rao's score statistic, which is shown to be asymptotically pivotal.