arXivDaily arXiv每日学术速递 周一至周五更新
2602.24263 2026-03-02 stat.ML cs.LG

Active Bipartite Ranking with Smooth Posterior Distributions

James Cheshire, Stephan Clémençon

Journal ref Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, pages 2044--2052, year 2025, volume 258, series Proceedings of Machine Learning Research, publisher PMLR

详情
英文摘要

In this article, bipartite ranking, a statistical learning problem involved in many applications and widely studied in the passive context, is approached in a much more general \textit{active setting} than the discrete one previously considered in the literature. While the latter assumes that the conditional distribution is piece wise constant, the framework we develop permits in contrast to deal with continuous conditional distributions, provided that they fulfill a Hölder smoothness constraint. We first show that a naive approach based on discretisation at a uniform level, fixed \textit{a priori} and consisting in applying next the active strategy designed for the discrete setting generally fails. Instead, we propose a novel algorithm, referred to as smooth-rank and designed for the continuous setting, which aims to minimise the distance between the ROC curve of the estimated ranking rule and the optimal one w.r.t. the $\sup$ norm. We show that, for a fixed confidence level $ε>0$ and probability $δ\in (0,1)$, smooth-rank is PAC$(ε,δ)$. In addition, we provide a problem dependent upper bound on the expected sampling time of smooth-rank and establish a problem dependent lower bound on the expected sampling time of any PAC$(ε,δ)$ algorithm. Beyond the theoretical analysis carried out, numerical results are presented, providing solid empirical evidence of the performance of the algorithm proposed, which compares favorably with alternative approaches.

2602.24261 2026-03-02 stat.AP

Quantifying Robustness to Unmeasured Confounding in Time-Varying Treatment Confounder Settings: An Extension of E-value Approach

Md. Niamul Islam Sium

Comments Under Review

详情
英文摘要

Background: The E-value has become widely used for assessing robustness to unmeasured confounding in observational studies, but the original framework was developed for single time-point exposure-outcome settings. This study extends the E-value methodology to longitudinal set up with time-varying treatments and confounders, where treatment-confounder feedback occurs. Methods: A combined bias factor accounting for unmeasured confounding at multiple time points was extended, with three reporting scenarios presented: equal bias distribution across time points, confounding at a single time point, and a general case visualizing all possible confounder strength combinations. Results: In simulations with an observed risk ratio of 1.73, unmeasured confounders with 1.96-fold associations at each time point could nullify the effect under equal distribution-substantially lower than the single time-point E-value of 2.85. Re-analysis of a published insulin resistance and cardiovascular disease study yielded similar patterns, with time-varying E-values of 1.63 at each time point compared to the originally reported 2.09. Conclusions: Studies more like longitudinal set up may be more vulnerable to unmeasured confounding than single time-point E-values suggest. This extension provides accessible tools for transparent sensitivity analysis in time-varying settings while preserving the simplicity and minimal assumptions that make E-values widely applicable.

2602.24234 2026-03-02 stat.AP

Stability of relaxed calibration

Nicholas T. Longford

Comments 30 pages, 11 figures

详情
英文摘要

Estimation of the population total of a variable can be improved by calibration on a set of auxiliary variables. It is difficult to establish that such a set of variables is sufficient, that estimation could not be improved by calibration on any further variables. We address this issue by finding an upper bound for the change of the calibration estimate of the population total of a variable when the auxiliary information is supplemented by another variable for which the population total is known. This upper bound can be interpreted as a measure of sensitivity of the estimate to unavailable auxiliary information and considered as a factor in deciding whether to seek further data sources that would be included in calibration.

2602.24230 2026-03-02 stat.ML cs.LG

A Variational Estimator for $L_p$ Calibration Errors

Eugène Berta, Sacha Braun, David Holzmüller, Francis Bach, Michael I. Jordan

详情
英文摘要

Calibration$\unicode{x2014}$the problem of ensuring that predicted probabilities align with observed class frequencies$\unicode{x2014}$is a basic desideratum for reliable prediction with machine learning systems. Calibration error is traditionally assessed via a divergence function, using the expected divergence between predictions and empirical frequencies. Accurately estimating this quantity is challenging, especially in the multiclass setting. Here, we show how to extend a recent variational framework for estimating calibration errors beyond divergences induced induced by proper losses, to cover a broad class of calibration errors induced by $L_p$ divergences. Our method can separate over- and under-confidence and, unlike non-variational approaches, avoids overestimation. We provide extensive experiments and integrate our code in the open-source package probmetrics (https://github.com/dholzmueller/probmetrics) for evaluating calibration errors.

2602.24165 2026-03-02 math.ST stat.ML stat.TH

Hypothesis Testing over Observable Regimes in Singular Models

Sean Plummer

Comments 16 pages, 4 figures. Structural classification of hypothesis testability in singular statistical models, with numerical illustrations in Gaussian mixture models and reduced-rank regression

详情
英文摘要

Hypothesis testing in singular statistical models is often regarded as inherently problematic due to non-identifiability and degeneracy of the Fisher information. We show that the fundamental obstruction to testing in such models is not singularity itself, but the formulation of hypotheses on non-identifiable parameter quantities. Testing is inherently a problem in distribution space: if two hypotheses induce overlapping subsets of the model class, then no uniformly consistent test exists. We formalize this overlap obstruction and show that hypotheses depending on non-identifiable parameter functions necessarily fail in this sense. In contrast, hypotheses formulated over identifiable observables-quantities that are determined by the induced distribution-reduce entirely to classical testing theory. When the corresponding distributional regimes are separated in Hellinger distance, uniformly consistent tests exist and posterior contraction follows from standard testing-based arguments. Near singular boundaries, separation may collapse locally, leading to scale-dependent detectability governed jointly by sample size and distance to the singular stratum. We illustrate these phenomena in Gaussian mixture models and reduced-rank regression, exhibiting both untestable non-identifiable hypotheses and classically testable identifiable ones. The results provide a structural classification of which hypotheses in singular models are statistically meaningful.

2602.24131 2026-03-02 stat.ME stat.ML

Efficient Targeted Maximum Likelihood Estimators for Two-Phase Design Problems

Sky Qiu, Susan Gruber, Pamela A. Shaw, Brian D. Williamson, Mark J. van der Laan

详情
英文摘要

In a typical two-phase design, a random sample is drawn from the target population in phase 1, during which only a subset of variables is collected. In phase 2, a subsample of the phase-1 cohort is selected, and additional variables are measured. This setting induces a coarsened data structure on the data from the second phase. We assume coarsening at random, that is, the phase-2 sampling mechanism depends only on variables fully observed. We review existing estimators, including the generalized raking estimator and the inverse probability of censoring weighted targeted maximum likelihood estimation (IPCW-TMLE) along with its extensions that also target the phase-2 sampling mechanism to improve efficiency. We further introduce a new class of estimators constructed within the TMLE framework that are asymptotically equivalent.

2602.24127 2026-03-02 stat.AP

Advancing Evidence Generation in Biomedical Research Using Natural Hermite and Propensity Score Indices: Applications to External Control Arms

Javier Cabrera, Berhanu Alemayehu, Demissie Alemayehu, Sofia Weigle

详情
英文摘要

When it is not feasible to conduct randomized controlled trials (RCTs), the use of external control arms based on real-world data (RWD) may be a viable option. However, challenges arising from data heterogeneity must be addressed to ensure the reliability of trial results. We consider the use of Natural Hermite and propensity score indices to facilitate robust comparisons between RCTs and RWD studies. Illustrations are provided on the implementation and performance of the underlying algorithms using simulated data, as well as synthetic data from a clinical trial and RWD.

2602.24083 2026-03-02 cs.LG math.PR stat.ML

Neural Diffusion Intensity Models for Point Process Data

Xinlong Du, Harsha Honnappa, Vinayak Rao

详情
英文摘要

Cox processes model overdispersed point process data via a latent stochastic intensity, but both nonparametric estimation of the intensity model and posterior inference over intensity paths are typically intractable, relying on expensive MCMC methods. We introduce Neural Diffusion Intensity Models, a variational framework for Cox processes driven by neural SDEs. Our key theoretical result, based on enlargement of filtrations, shows that conditioning on point process observations preserves the diffusion structure of the latent intensity with an explicit drift correction. This guarantees the variational family contains the true posterior, so that ELBO maximization coincides with maximum likelihood estimation under sufficient model capacity. We design an amortized encoder architecture that maps variable-length event sequences to posterior intensity paths by simulating the drift-corrected SDE, replacing repeated MCMC runs with a single forward pass. Experiments on synthetic and real-world data demonstrate accurate recovery of latent intensity dynamics and posterior paths, with orders-of-magnitude speedups over MCMC-based methods.

2602.24038 2026-03-02 stat.AP

Bayesian Profile Regression using Variational Inference to Identify Clusters of Multiple Long-Term Conditions Conditioning on Mortality in Population-Scale Data

James Rafferty, Keith R Abrams, Munir Pirmohamed, Mark Davies, Rhiannon K Owen

详情
英文摘要

Multiple long-term conditions (MLTC) are increasingly observed in clinical practice globally. Clustering methods to group diseases into commonly co-occurring clusters have been of interest for further understanding of how MLTC group together and their associated impact on patient outcomes. However, such approaches require large, often population-scale datasets. Bayesian Profile Regression (BPR) is a statistical model that combines a Dirichlet Process Mixture model with a hierarchical regression model, in order to form clusters of items conditional on covariates and an outcome of interest. We developed a BPR model using full-rank Stochastic Variational Inference (SVI) for application in large-scale data. We assessed it's performance using simulation studies comparing fits using the No-U-turn (NUTS) sampler and full-rank SVI. We then fit a BPR model to find clusters of MLTC in a population-scale data held in the Secure Anonymised Information Linkage (SAIL) databank. We found results from full-rank SVI compared well with results from NUTS in a simulation study, and the improved fitting performance allowed for fitting models in population-scale datasets. There were 1,296,463 individuals in our electronic health record (EHR) cohort. The clustering model was conditioned on age at cohort entry, socioeconomic deprivation and sex with mortality as the outcome. We used the Elixhauser comorbidity index disease definitions, and found there were 33 disease clusters. We found that clusters featuring metastatic cancer and cardiovascular diseases, such as congestive heart failure, were most strongly associated with the probability of mortality. Our findings show that SVI can be a useful and accurate method for fitting Bayesian models, especially when the dataset size would make Monte Carlo methods prohibitively time consuming or impossible.

2602.24004 2026-03-02 stat.AP

The Best Metal-Grabbing Games Ever: How a Tiny Nation Won the Most Medals (By Far)

Nils Lid Hjort

Comments 15 pages, 6 figures; written up three days after the 2026 Winter Olympics. Readers are advised to print out the report and then to pencil in vertical lines and bars for Appendix C, page 13

详情
英文摘要

For three Winter Olympics in a row, tiny nation Norway has out-medalled everyone else, in 2026 winning 18 golds, 12 silvers, 11 bronzes, i.e.~41 medals, compared to e.g.~12 + 12 + 9 = 33 for the USA, 10 + 6 + 14 = 30 for home team Italy, 8 + 10 + 7 = 26 for powerhouse Germany, etc. Never before have we [pluralis proudiensis] or anyone else won as many as 41 medals at a Winter Olympics. But how impressive is this, really, when we factor in that the number of events has increased so drastically?

2602.23987 2026-03-02 stat.ME

A Unified and Computationally Efficient Non-Gaussian Statistical Modeling Framework

David Bolin, Xiaotian Jin, Alexandre B. Simas, Jonas Wallin

详情
英文摘要

Datasets that exhibit non-Gaussian characteristics are common in many fields, while the current modeling framework and available software for non-Gaussian models is limited. We introduce Linear Latent Non-Gaussian Models (LLnGMs), a unified and computationally efficient statistical modeling framework that extends a class of latent Gaussian models to allow for latent non-Gaussian processes. The framework unifies several popular models, from simple temporal models to complex spatial-temporal and multivariate models, facilitating natural non-Gaussian extensions. Computationally efficient Bayesian inference, with theoretical guarantees, is developed based on stochastic gradient descent estimation. The R package \texttt{ngme2}, which implements the framework, is presented and demonstrated through a wide range of applications including novel non-Gaussian spatial and spatio-temporal models.

2602.23943 2026-03-02 stat.ME

A flexible approach to sequential prediction under intervention

Matthew Sperrin, Bowen Jiang, Joyce Huang, Niels Peek, Alexander Pate

详情
英文摘要

We propose a causal predictive framework for estimating risk under preventative interventions. The Unexposed Mediator Model maintains mediators that are also predictors at their unexposed level, removing double counting of intervention effects at followup visits. The Modifiable Risk Factor Model handles multiple interventions flexibly by modelling their effects via mediators that are also predictors, assuming a known causal structure. The Two Component Model combines a predictive baseline model with an intervention model to improve predictive performance. We illustrate the framework in primary prevention of cardiovascular disease. The proposed models allow arbitrary interventions to be evaluated within a prediction under intervention framework, with causally consistent risk estimates across repeated visits. Limitations include reliance on predictor values from an arbitrary first visit, requirements for causal structural knowledge, and a consistency assumption, that interventions with identical effects on predictors have identical effects on outcomes, which warrant further investigation.

2602.23911 2026-03-02 stat.ME stat.CO

Online Bootstrap Inference for the Trend of Nonstationary Time Series

Thomas Nagler, Tobias Brock, Nicolai Palm

详情
英文摘要

This article proposes an online bootstrap scheme for nonparametric level estimation in nonstationary time series. Our approach applies to a broad class of level estimators expressible as weighted sample averages over time windows, including exponential smoothing methods and moving averages. The bootstrap procedure is motivated by asymptotic arguments and provides well-calibrated uniform-in-time coverage, enabling scalable uncertainty quantification in streaming or large-scale time-series settings. This makes the method suitable for tasks such as adaptive anomaly detection, online monitoring, or streaming A/B testing. Simulation studies demonstrate good finite-sample performance of our method across a range of nonstationary scenarios. In summary, this offers a practical resampling framework that complements online trend estimation with reliable statistical inference.

2602.23909 2026-03-02 stat.ME stat.CO

Automated selection of r for stationary and nonstationary models for r largest order statistics

Yire Shin, Jihong Park, Jeong-Soo Park

详情
英文摘要

In generalized extreme value model for the r largest order statistics, denoted by rGEV, the selection of r is critical. The existing entropy difference test for selecting r is applicable to large sample. Another existing method (the score test with parametric bootstrap) is applicable to small sample, but computationally demanding. To address this problem for small sample, we propose a new method using a sequence of the goodness-of-fit tests based on the conditional cumulative distribution function (CCDF). The proposed CCDF test is easy to implement and computationally fast. The Cram{é}r-von Mises test was employed for the goodness-of-fit purpose. The proposed method is compared via Monte Carlo simulations with existing methods including the spacings, the score, and the entropy difference tests. The proposed CCDF test turned out to perform well for both small and large samples, comparable to the spacings and entropy difference tests. The utility of the proposed method is illustrated by an application to the r largest daily rainfall data in Korea. Additionally, we extended the existing methods and the CCDF test to a nonstationary rGEV model. Wide applicability of the proposed method are discussed.

2602.23887 2026-03-02 physics.chem-ph cs.AI stat.AP

Uncovering sustainable personal care ingredient combinations using scientific modelling

Sandip Bhattacharya, Vanessa da Silva, Christina Kohlmann

Comments Paper submitted and part of 35th IFSCC Congress, Brazil, 14-17 October 2024

详情
英文摘要

Personal care formulations often contain synthetic and non-biodegradable ingredients, such as silicone and mineral oils, which can offer a unique performance. However, due to regulations like the EU ban of Octamethylcyclotetrasiloxane (D4), Decamethyl-cyclopentasiloxane (D5), Dodecamethylcyclohexasiloxane (D6) already in effect for rinse off and for leave on cosmetics by June 2027 coupled with growing consumer awareness and expectations on sustainability, personal care brands face significant pressure to replace these synthetic ingredients with natural alternatives without compromising performance and cost. As a result, formulators are confronted with the challenge to find natural-based solutions within a short timeframe. In this study, we propose a pioneering approach that utilizes predicting modelling and simulation-based digital services to obtain natural-based ingredient combinations as recommendations to commonly used synthetic ingredients. We will demonstrate the effectiveness of our predictions through the application of these proposals in specific formulations. By offering a platform of digital services, it is aimed to empower formulators to explore good performing novel and environmentally friendly alternatives, ultimately driving a substantial and genuine transformation in the personal care industry.

2602.23854 2026-03-02 math.OC cs.LG cs.NA math.NA stat.ML

A distributed semismooth Newton based augmented Lagrangian method for distributed optimization

Qihao Ma, Chengjing Wang, Peipei Tang, Dunbiao Niu, Aimin Xu

详情
英文摘要

This paper proposes a novel distributed semismooth Newton based augmented Lagrangian method for solving a class of optimization problems over networks, where the global objective is defined as the sum of locally held cost functions, and communication is restricted to neighboring agents. Specifically, we employ the augmented Lagrangian method to solve an equivalently reformulated constrained version of the original problem. Each resulting subproblem is solved inexactly via a distributed semismooth Newton method. By fully leveraging the structure of the generalized Hessian, a distributed accelerated proximal gradient method is proposed to compute the Newton direction efficiently, eliminating the need to communicate with full Hessian matrices. Theoretical results are also obtained to guarantee the convergence of the proposed algorithm. Numerical experiments demonstrate the efficiency and superiority of our algorithm compared to state-of-the-art distributed algorithms.

2602.23815 2026-03-02 stat.ME stat.CO

Efficient Tests for Testing in Two-way ANOVA under Heteroscedasticity

Anjana Mondal, Somesh Kumar

详情
英文摘要

New tests are developed for two-way ANOVA models with heterogeneous error variances. The testing problems are considered for testing the significant interaction effects, simple effects, and treatment effects. The likelihood ratio tests (LRTs) and simultaneous comparison tests are derived for all three problems. Hill climbing algorithms have been proposed to compute the maximum likelihood estimators (MLEs) of parameters under the restrictions on the null and alternative hypotheses. It is proved that the proposed algorithms converge to the MLEs. A parametric bootstrap algorithm is provided for the computation of the critical points. The simulated power values of the proposed tests are compared with two existing tests. For testing main effects in the additive ANOVA model, the LRT appears to be about $30\%$ to $50\%$ gain in power over the available tests. Also, the proposed tests for the interaction and simple effects are seen to have comparable power and size performance to the existing tests. The behavior of the proposed tests under the non-normal error distribution is also discussed. Four real data sets are used to demonstrate the application of the proposed tests. A software package is made in `R' to make it simple to apply the tests to experimental data sets.

2602.23800 2026-03-02 stat.ME cs.AI cs.LG

Operationalizing Longitudinal Causal Discovery Under Real-World Workflow Constraints

Tadahisa Okuda, Shohei Shimizu, Thong Pham, Tatsuyoshi Ikenoue, Shingo Fukuma

详情
英文摘要

Causal discovery has achieved substantial theoretical progress, yet its deployment in large-scale longitudinal systems remains limited. A key obstacle is that operational data are generated under institutional workflows whose induced partial orders are rarely formalized, enlarging the admissible graph space in ways inconsistent with the recording process. We characterize a workflow-induced constraint class for longitudinal causal discovery that restricts the admissible directed acyclic graph space through protocol-derived structural masks and timeline-aligned indexing. Rather than introducing a new optimization algorithm, we show that explicitly encoding workflow-consistent partial orders reduces structural ambiguity, especially in mixed discrete--continuous panels where within-time orientation is weakly identified. The framework combines workflow-derived admissible-edge constraints, measurement-aligned time indexing and block structure, bootstrap-based uncertainty quantification for lagged total effects, and a dynamic representation supporting intervention queries. In a nationwide annual health screening cohort in Japan with 107,261 individuals and 429,044 person-years, workflow-constrained longitudinal LiNGAM yields temporally consistent within-time substructures and interpretable lagged total effects with explicit uncertainty. Sensitivity analyses using alternative exposure and body-composition definitions preserve the main qualitative patterns. We argue that formalizing workflow-derived constraint classes improves structural interpretability without relying on domain-specific edge specification, providing a reproducible bridge between operational workflows and longitudinal causal discovery under standard identifiability assumptions.

2602.23775 2026-03-02 stat.ME

Novel Stein-type Characterizations of Bivariate Count Distributions with Applications

Shaochen Wang, Christian H. Weiß

详情
英文摘要

The derivation and application of Stein identities have received considerable research interest in recent years, especially for continuous or discrete-univariate distributions. In this paper, we complement the existing literature by deriving and investigating Stein-type characterizations for the three most common types of bivariate count distributions, namely the bivariate Poisson, binomial, and negative-binomial distribution. Then, we demonstrate the practical relevance of these novel Stein identities by a couple of applications, namely the deduction of sophisticated moment expressions, of flexible goodness-of-fit tests, and of novel tests for the symmetry of bivariate count distributions. The paper concludes with an analysis of real-world data examples.

2602.23750 2026-03-02 stat.AP cs.LG

Predictive Hotspot Mapping for Data-driven Crime Prediction

Karthik Sriram, Ankur Sinha, Suvashis Choudhary

Comments 50 pages

详情
英文摘要

Predictive hotspot mapping is an important problem in crime prediction and control. An accurate hotspot mapping helps in appropriately targeting the available resources to manage crime in cities. With an aim to make data-driven decisions and automate policing and patrolling operations, police departments across the world are moving towards predictive approaches relying on historical data. In this paper, we create a non-parametric model using a spatio-temporal kernel density formulation for the purpose of crime prediction based on historical data. The proposed approach is also able to incorporate expert inputs coming from humans through alternate sources. The approach has been extensively evaluated in a real-world setting by collaborating with the Delhi police department to make crime predictions that would help in effective assignment of patrol vehicles to control street crime. The results obtained in the paper are promising and can be easily applied in other settings. We release the algorithm and the dataset (masked) used in our study to support future research that will be useful in achieving further improvements.

2602.23672 2026-03-02 stat.ML cs.LG econ.EM math.ST stat.ME stat.TH

General Bayesian Policy Learning

Masahiro Kato

详情
英文摘要

This study proposes the General Bayes framework for policy learning. We consider decision problems in which a decision-maker chooses an action from an action set to maximize its expected welfare. Typical examples include treatment choice and portfolio selection. In such problems, the statistical target is a decision rule, and the prediction of each outcome $Y(a)$ is not necessarily of primary interest. We formulate this policy learning problem by loss-based Bayesian updating. Our main technical device is a squared-loss surrogate for welfare maximization. We show that maximizing empirical welfare over a policy class is equivalent to minimizing a scaled squared error in the outcome difference, up to a quadratic regularization controlled by a tuning parameter $ζ>0$. This rewriting yields a General Bayes posterior over decision rules that admits a Gaussian pseudo-likelihood interpretation. We clarify two Bayesian interpretations of the resulting generalized posterior, a working Gaussian view and a decision-theoretic loss-based view. As one implementation example, we introduce neural networks with tanh-squashed outputs. Finally, we provide theoretical guarantees in a PAC-Bayes style.

2602.23640 2026-03-02 stat.ME

Stress-Testing Assumptions: A Guide to Bayesian Sensitivity Analyses in Causal Inference

Arman Oganisian

详情
英文摘要

While observational data are routinely used to estimate causal effects of biomedical treatments, doing so requires special methods to adjust for observed confounding. These methods invariably rely on untestable statistical and causal identification assumptions. When these assumptions do not hold, sensitivity analysis methods can be used to characterize how different violations may change our inferences. The Bayesian approach to sensitivity analyses in causal inference has unique advantages as it allows users to encode subjective beliefs about the direction and magnitude of assumption violations via prior distributions and make inferences using the updated posterior. However, uptake of these methods remains low since implementation requires substantial methodological knowledge. Moreover, while implementation with publicly available software is possible, it is not straight-forward. At the same time, there are few papers that provide practical guidance on these fronts. In this paper, we walk through four examples of Bayesian sensitivity analyses: 1) exposure misclassification, 2) unmeasured confounding, and missing not-at-random outcomes with 3) parametric and 4) nonparametric Bayesian models. We show how all of these can be done using a unified Bayesian "missing data" approach. We also cover implementation using Stan, a publicly available open-source software for fitting Bayesian models. To the best of our knowledge, this is the first paper that presents a unified approach with code, examples, and methodology in a three-pronged illustration of sensitivity analyses in Bayesian causal inference. Our goal is for the reader to walk away with implementation-level knowledge.

2602.23611 2026-03-02 stat.ML cs.LG

Fairness under Graph Uncertainty: Achieving Interventional Fairness with Partially Known Causal Graphs over Clusters of Variables

Yoichi Chikahara

Comments 26 pages, 9 figures

详情
英文摘要

Algorithmic decisions about individuals require predictions that are not only accurate but also fair with respect to sensitive attributes such as gender and race. Causal notions of fairness align with legal requirements, yet many methods assume access to detailed knowledge of the underlying causal graph, which is a demanding assumption in practice. We propose a learning framework that achieves interventional fairness by leveraging a causal graph over \textit{clusters of variables}, which is substantially easier to estimate than a variable-level graph. With possible \textit{adjustment cluster sets} identified from such a cluster causal graph, our framework trains a prediction model by reducing the worst-case discrepancy between interventional distributions across these sets. To this end, we develop a computationally efficient barycenter kernel maximum mean discrepancy (MMD) that scales favorably with the number of sensitive attribute values. Extensive experiments show that our framework strikes a better balance between fairness and accuracy than existing approaches, highlighting its effectiveness under limited causal graph knowledge.

2602.20293 2026-03-02 cs.LG stat.ML

Discrete Diffusion with Sample-Efficient Estimators for Conditionals

Karthik Elamvazhuthi, Abhijith Jayakumar, Andrey Y. Lokhov

详情
英文摘要

We study a discrete denoising diffusion framework that integrates a sample-efficient estimator of single-site conditionals with round-robin noising and denoising dynamics for generative modeling over discrete state spaces. Rather than approximating a discrete analog of a score function, our formulation treats single-site conditional probabilities as the fundamental objects that parameterize the reverse diffusion process. We employ a sample-efficient method known as Neural Interaction Screening Estimator (NeurISE) to estimate these conditionals in the diffusion dynamics. Controlled experiments on synthetic Ising models, MNIST, and scientific data sets produced by a D-Wave quantum annealer, synthetic Potts model and one-dimensional quantum systems demonstrate the proposed approach. On the binary data sets, these experiments demonstrate that the proposed approach outperforms popular existing methods including ratio-based approaches, achieving improved performance in total variation, cross-correlations, and kernel density estimation metrics.

2602.18997 2026-03-02 stat.ML cs.LG math.OC

Implicit Bias and Convergence of Matrix Stochastic Mirror Descent

Danil Akhtiamov, Reza Ghane, Omead Pooladzandi, Babak Hassibi

详情
英文摘要

We investigate Stochastic Mirror Descent (SMD) with matrix parameters and vector-valued predictions, a framework relevant to multi-class classification and matrix completion problems. Focusing on the overparameterized regime, where the total number of parameters exceeds the number of training samples, we prove that SMD with matrix mirror functions $ψ(\cdot)$ converges exponentially to a global interpolator. Furthermore, we generalize classical implicit bias results of vector SMD by demonstrating that the matrix SMD algorithm converges to the unique solution minimizing the Bregman divergence induced by $ψ(\cdot)$ from initialization subject to interpolating the data. These findings reveal how matrix mirror maps dictate inductive bias in high-dimensional, multi-output problems.

2601.04663 2026-03-02 stat.ME econ.EM

Quantile Vector Autoregression without Crossing

Tomohiro Ando, Tadao Hoshino, Ruey Tsay

详情
英文摘要

This paper considers estimation and model selection of quantile vector autoregression (QVAR). Conventional quantile regression often yields undesirable crossing quantile curves, violating the monotonicity of quantiles. To address this issue, we propose a simplex quantile vector autoregression (SQVAR) framework, which transforms the autoregressive (AR) structure of the original QVAR model into a simplex, ensuring that the estimated quantile curves remain monotonic across all quantile levels. In addition, we impose the smoothly clipped absolute deviation (SCAD) penalty on the SQVAR model to mitigate the explosive nature of the parameter space. We further develop a Bayesian information criterion (BIC)-based procedure for selecting the optimal penalty parameter and introduce new frameworks for impulse response analysis of QVAR models. Finally, we establish asymptotic properties of the proposed method, including the convergence rate and asymptotic normality of the estimator, the consistency of AR order selection, and the validity of the BIC-based penalty selection. For illustration, we apply the proposed method to U.S. financial market data, highlighting the usefulness of our SQVAR method.

2512.23075 2026-03-02 cs.LG cs.AI cs.IT math.IT stat.ML

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang

详情
英文摘要

Policy gradient methods for Large Language Models optimize a policy $π_θ$ via a surrogate objective computed from samples of a rollout policy $π_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch ($π_{\text{roll}} \neq π_θ$) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound ($O(T^{3/2})$), a Mixed bound ($O(T)$), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields the tightest known guarantee across all divergence regimes. Crucially, all bounds depend on the maximum token-level divergence $D_{\mathrm{KL}}^{\mathrm{tok,max}}$ (or $D_{\mathrm{TV}}^{\mathrm{tok,max}}$), a sequence-level quantity that cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences violating the trust region, enabling the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.

2512.17131 2026-03-02 cs.LG cs.AI stat.ML

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao

详情
英文摘要

We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method that unifies and generalizes recent averaging-based optimizers like single-worker DiLoCo and Schedule-Free, within a non-distributed setting. While DiLoCo relies on a memory-intensive two-loop structure to periodically aggregate pseudo-gradients using Nesterov momentum, GPA eliminates this complexity by decoupling Nesterov's interpolation constants to enable smooth iterate averaging at every step. Structurally, GPA resembles Schedule-Free but replaces uniform averaging with exponential moving averaging. Empirically, GPA consistently outperforms single-worker DiLoCo and AdamW with reduced memory overhead. GPA achieves speedups of 8.71%, 10.13%, and 9.58% over the AdamW baseline in terms of steps to reach target validation loss for Llama-160M, 1B, and 8B models, respectively. Similarly, on the ImageNet ViT workload, GPA achieves speedups of 7% and 25.5% in the small and large batch settings respectively. Furthermore, we prove that for any base optimizer with $O(\sqrt{T})$ regret, where $T$ is the number of iterations, GPA matches or exceeds the original convergence guarantees depending on the interpolation constants.

2512.14062 2026-03-02 math.ST stat.TH

Maximal signed volume for (multivariate) supermodular quasi-copulas

Matjaž Omladič, Martin Vuk, Aljaž Zalar

Comments 15 pages, 1 figure

详情
英文摘要

Copulas are the primary tool for dependence modeling in statistics, and quasi-copulas are their essential companions. The latter appear, say, as infima or suprema of sets of copulas; they form a huge class and have some unpleasant properties. Their statistical interpretation is challenged by the fact that they may lead to negative volumes of some boxes. So, numerous applications call for an intermediate class, and supermodular quasi-copulas are one of them, having many useful properties. An excellent measure, Average Rectangular Volume (ARV in short), to clarify and position this class was proposed in the seminal paper by Anzilli and Durante, The average rectangular volume induced by supermodular aggregation functions, J. Math. Anal. Appl. 555 (2026) 21 pp. While supermodularity is a bivariate notion, its extension to the $d$-variate case for $d>2$ was recently emphasized in a key paper by Arias-Garcia, Mesiar, and De Baets, The unwalked path between quasi-copulas and copulas: Stepping stones in higher dimensions, Int. J. of Appr. Reasoning, 80 (2017) pp. 89-99. Here, an alternative method to ARV is presented, extendable to the multivariate case based on Maximal (in absolute value) Negative Volumes (MNV in short) on boxes, thus helping practitioners when seeking the right (quasi-)copula for their problem. Observe that these volumes on copulas are zero, while their values on quasi-copulas, depending on $d$, have been a long-standing open problem solved only recently. We present a nontrivial extension of this solution, which serves as the main goal of this paper: a measure that clarifies and positions the classes considered based on MNV.

2512.11012 2026-03-02 stat.ME math.PR stat.CO

On a class of constrained Bayesian filters and their numerical implementation in high-dimensional state-space Markov models

Utku Erdogan, Gabriel J. Lord, Joaquin Miguez

详情
英文摘要

Bayesian filtering is a key tool in many problems that involve the online processing of data, including data assimilation, optimal control, nonlinear tracking and others. Unfortunately, the implementation of filters for nonlinear, possibly high-dimensional, dynamical systems is far from straightforward, as computational methods have to meet a delicate trade-off involving stability, accuracy and computational cost. In this paper we investigate the design, and theoretical features, of constrained Bayesian filters for state space models. The constraint on the filter is given by a sequence of compact subsets of the state space that determines the sources and targets of the Markov transition kernels in the dynamical model. Subject to such constraints, we provide sufficient conditions for filter stability and approximation error rates with respect to the original (unconstrained) Bayesian filter. Then, we look specifically into the implementation of constrained filters in a continuous-discrete setting where the state of the system is a continuous-time stochastic Itô process but data are collected sequentially over a time grid. We propose an implementation of the constraint that relies on a data-driven modification of the drift of the Itô process using barrier functions, and discuss the relation of this scheme with methods based on the Doob $h$-transform. Finally, we illustrate the theoretical results and the performance of the proposed methods in computer experiments for a partially-observed stochastic Lorenz 96 model.

2512.02878 2026-03-02 stat.ME

Correcting for sampling variability in maximum likelihood-based one-sample log-rank tests

Moritz Fabian Danzer, Rene Schmidt

Comments Main manuscript: 12 pages, 4 figures, 2 tables Supplementary Material: 13 pages, 7 figures with multiple subfigures

详情
英文摘要

Single-arm studies in the early development phases of new treatments are not uncommon in the context of rare diseases or in paediatrics. If an assessment of efficacy is to be made at the end of such a study, the observed endpoints can be compared with reference values that can be derived from historical data. For a time-to-event endpoint, a statistical comparison with a reference curve can be made using the one-sample log-rank test. In order to ensure the interpretability of the results of this test, the role of the reference curve is crucial. This quantity is often estimated from a historical control group using a parametric procedure. Hence, it should be noted that it is subject to estimation uncertainty. However, this aspect is not taken into account in the one-sample log-rank test statistic. We analyse this estimation uncertainty for the common situation that the reference curve is estimated parametrically using the maximum likelihood method, and indicate how the variance estimation of the one-sample log-rank test can be adapted in order to take this variability into account. The resulting test procedures are illustrated using a data example and analysed in more detail using simulations, particularly in comparison with established two-sample methods.

2511.06652 2026-03-02 stat.ME

Causal Inference for Network Autoregression Model: A Targeted Minimum Loss Estimation Approach

Yong Wu, Shuyuan Wu, Xinwei Sun, Xuening Zhu

Comments This paper is withdrawn due to errors in the current version and a mismatch between the title and the actual scope of the manuscript. A substantially revised version may be prepared in the future

详情
英文摘要

We study estimation of the average treatment effect (ATE) from a single network in observational settings with interference. The weak cross-unit dependence is modeled via an endogenous peer-effect (network autoregressive) term that induces distance-decaying network dependence, relaxing the common finite-order interference to infinite interference. We propose a targeted minimum loss estimation (TMLE) procedure that removes plug-in bias from an initial estimator. The targeting step yields an adjustment direction that incorporates the network autoregressive structure and assigns heterogeneous, network-dependent weights to units. We find that the asymptotic leading term related to the covariates $\mathbf{X}_i$ can be formulated into a $V$-statistic whose order diverges with the network degrees. A novel limit theory is developed to establish the asymptotic normality under such complex network dependent scenarios. We show that our method can achieve smaller asymptotic variance than existing methods when $\mathbf{X}_i$ is i.i.d. generated and estimated with empirical distribution, and provide theoretical guarantees for estimating the variance. Extensive numerical studies and a live-streaming data analysis are presented to illustrate the advantages of the proposed method.

2510.17268 2026-03-02 cs.LG stat.ML

Uncertainty-aware data assimilation through variational inference

Anthony Frion, David S Greenberg

详情
英文摘要

Data assimilation, consisting in the combination of a dynamical model with a set of noisy and incomplete observations in order to infer the state of a system over time, involves uncertainty in most settings. Building upon an existing deterministic machine learning approach, we propose a variational inference-based extension in which the predicted state follows a multivariate Gaussian distribution. Using the chaotic Lorenz-96 dynamics as a testing ground, we show that our new model enables to obtain nearly perfectly calibrated predictions, and can be integrated in a wider variational data assimilation pipeline in order to achieve greater benefit from increasing lengths of data assimilation windows. Our code is available at https://github.com/anthony-frion/Stochastic_CODA.

2510.04970 2026-03-02 stat.ML cs.AI cs.LG stat.ME

Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning

Marcel Wienöbst, Leonard Henckel, Sebastian Weichwald

Comments Accepted at ICLR 2026

详情
英文摘要

We present FLOP (Fast Learning of Order and Parents), a score-based causal discovery algorithm for linear models. It pairs fast parent selection with iterative Cholesky-based score updates, cutting run-times over prior algorithms. This makes it feasible to fully embrace discrete search, enabling iterated local search with principled order initialization to find graphs with scores at or close to the global optimum. The resulting structures are highly accurate across benchmarks, with near-perfect recovery in standard settings. This performance calls for revisiting discrete search over graphs as a reasonable approach to causal discovery.

2509.21021 2026-03-02 cs.LG cs.AI stat.ML

Efficient Ensemble Conditional Independence Test Framework for Causal Discovery

Zhengkang Guan, Kun Kuang

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Constraint-based causal discovery relies on numerous conditional independence tests (CITs), but its practical applicability is severely constrained by the prohibitive computational cost, especially as CITs themselves have high time complexity with respect to the sample size. To address this key bottleneck, we introduce the Ensemble Conditional Independence Test (E-CIT), a general-purpose and plug-and-play framework. E-CIT operates on an intuitive divide-and-aggregate strategy: it partitions the data into subsets, applies a given base CIT independently to each subset, and aggregates the resulting p-values using a novel method grounded in the properties of stable distributions. This framework reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed. Moreover, our tailored p-value combination method offers theoretical consistency guarantees under mild conditions on the subtests. Experimental results demonstrate that E-CIT not only significantly reduces the computational burden of CITs and causal discovery but also achieves competitive performance. Notably, it exhibits an improvement in complex testing scenarios, particularly on real-world datasets.

2508.15978 2026-03-02 stat.AP stat.CO stat.ME

A nonstationary spatial model of PM2.5 with localized transfer learning from numerical model output

Wenlong Gong, Brian J. Reich, Joseph Guinness

Comments Environ Ecol Stat (2026)

详情
英文摘要

Ambient air pollution measurements from regulatory monitoring networks are routinely used to support epidemiologic studies and environmental policy decision making. However, regulatory monitors are spatially sparse and preferentially located in areas with large populations. Numerical air pollution model output can be leveraged into the inference and prediction of air pollution data combining with measurements from monitors. Nonstationary covariance functions allow the model to adapt to spatial surfaces whose variability changes with location like air pollution data. In the paper, we employ localized covariance parameters learned from the numerical output model to knit together into a global nonstationary covariance, to incorporate in a fully Bayesian model. We model the nonstationary structure in a computationally efficient way to make the Bayesian model scalable.

2508.12391 2026-03-02 math.ST stat.ME stat.TH

Asymptotic confidence bands for the histogram regression estimator

Natalie Neumeyer, Jan Rabe, Mathias Trabs

详情
英文摘要

Asymptotic uniform confidence bands are constructed for a multivariate nonparametric regression model with heteroscedastic noise, employing histogram estimators under flexible partition conditions. The construction is especially applicable to unsmooth regression functions of Hölder regularity less than one. While the radius of the confidence bands could be approximated via the Gumbel distribution, our construction does not depend on an extreme value distribution, but instead can be explicitly calculated for the chosen partition.

2507.16467 2026-03-02 stat.ML cs.AI cs.LG

Estimating Treatment Effects with Independent Component Analysis

Patrik Reizinger, Lester Mackey, Wieland Brendel, Rahul Krishnan

详情
英文摘要

Independent Component Analysis (ICA) uses a measure of non-Gaussianity to identify latent sources from data and estimate their mixing coefficients (Shimizu et al., 2006). Meanwhile, higher-order Orthogonal Machine Learning (OML) exploits non-Gaussian treatment noise to provide more accurate estimates of treatment effects in the presence of confounding nuisance effects (Mackey et al., 2018). Remarkably, we find that the two approaches rely on the same moment conditions for consistent estimation. We then seize upon this connection to show how ICA can be effectively used for treatment effect estimation. Specifically, we prove that linear ICA can consistently estimate multiple treatment effects, even in the presence of Gaussian confounders, and identify regimes in which ICA is provably more sample-efficient than OML for treatment effect estimation. Our synthetic demand estimation experiments confirm this theory and demonstrate that linear ICA can accurately estimate treatment effects even in the presence of nonlinear nuisance.

2507.00893 2026-03-02 stat.AP stat.ME

Stochastic highway capacity: Unsuitable Kaplan-Meier estimator, revised maximum likelihood estimator, and impact of speed harmonisation

Igor Mikolášek

Comments Replaces arXiv:2003.05355 (withdrawn due to invalid methodology conclusions). 22 pages, 4 figures, 4 tables. v3 is reformated and includes minor revisions

详情
英文摘要

The Kaplan-Meier estimate, also known as the product-limit method (PLM), is a widely used non-parametric maximum likelihood estimator (MLE) in survival analysis. In the context of highway engineering, it has been repeatedly applied to estimate stochastic traffic flow capacity. However, this paper demonstrates that PLM is fundamentally unsuitable for this purpose. The method implicitly assumes continuous exposure to failure risk over time - a premise invalid for traffic flow, where intensity does not increase linearly, and capacity is not even directly observable. Although parametric MLE approach offers a viable alternative, its earlier derivation for this use case suffers from flawed likelihood formulation, likely due to attempt to preserve consistency with PLM. This study derives a corrected likelihood formula for stochastic capacity MLE and validates it using two empirical datasets. The proposed method is then applied in a case study examining the effect of a variable speed limit (VSL) system used for traffic flow speed harmonisation at a 2-to-1 lane drop. Results show that the VSL improved capacity by approximately 10 % or reduced breakdown probability at the same flow intensity by up to 50 %. The findings underscore the methodological importance of correct model formulation and highlight the practical relevance of stochastic capacity estimation for evaluating traffic control strategies.

2506.18223 2026-03-02 stat.ME

Dependent Dirichlet processes via thinning

Laura D'Angelo, Bernardo Nipoti, Andrea Ongaro

Comments 29 pages

详情
英文摘要

When analyzing data from multiple sources, it is often convenient to strike a careful balance between two goals: capturing the heterogeneity of the samples and sharing information across them. We introduce a novel framework to model a collection of samples using dependent Dirichlet processes constructed through a thinning mechanism. The proposed approach modifies the stick-breaking representation of the Dirichlet process by thinning, that is, setting equal to zero a random subset of the beta random variables used in the original construction. This results in a collection of dependent random distributions that exhibit both shared and unique atoms, with the shared ones assigned distinct weights in each distribution. The generality of the construction allows expressing a wide variety of dependence structures among the elements of the generated random vectors. Moreover, its simplicity facilitates the characterization of several theoretical properties and the derivation of efficient computational methods for posterior inference. A simulation study illustrates how a modeling approach based on the proposed process reduces uncertainty in group-specific inferences while preventing excessive borrowing of information when the data indicate it is unnecessary. This added flexibility improves the accuracy of posterior inference, outperforming related state-of-the-art models. An application to the Collaborative Perinatal Project data highlights the model's capability to estimate group-specific densities and uncover a meaningful partition of the observations, both within and across samples, providing valuable insights into the underlying data structure.

2506.17014 2026-03-02 stat.ME

A Semi-Parametric Torus-to-Torus Regression Model with Geometric Loss: Application to Cyclone Data

Surojit Biswas, Buddhananda Banerjee

详情
英文摘要

This study introduces a novel torus-to-torus regression framework to improve the analysis and prediction of cyclone-driven wind-wave directional dynamics. This research, to our knowledge, establishes a mathematical framework for modeling the regression between bivariate angular predictors and bivariate angular responses for the first time in the literature. The proposed approach enhances the capacity to model coupled directional processes commonly observed in extreme coastal cyclones. The proposed model makes use of generalized Möbius transformation and differential geometry for model building. A new loss function, derived from the intrinsic geometry of the torus, is introduced to facilitate effective semi-parametric estimation without requiring any specific distributional assumptions on the angular error. The prediction error is measured as an angular loss on the surface of the torus and also the angular deflection along normal directions on the unit sphere transported from the torus. Additionally, a new visualization technique for circular data is introduced. The practical relevance of the model is illustrated through its application to wind-wave directional datasets from two major cyclonic events, Amphan and Biparjoy, that impacted the eastern and western coastlines of India, respectively.

2505.02607 2026-03-02 stat.AP

Expectiles as basis risk-optimal payment schemes in parametric insurance

Markus Johannes Maier, Matthias Scherer

Comments 34 pages, 8 figures

详情
英文摘要

Payments in parametric insurance solutions are linked to an index and thus decoupled from policyholders' true losses. While this principle has appealing operational benefits compared to traditional indemnity coverage, i.e. is very efficient and cost effective, a downside is the discrepancy between payouts and actual damage, called basis risk. We show that in an asymmetrically weighted mean square error framework, the basis risk-minimizing payment schemes for pure parametric and parametric index insurance contracts can be expressed as conditional expectiles of policyholders' true loss given a compensation-triggering incident. We provide connections to stochastic orderings and demonstrate that regression approaches allow easy implementation in practice. Our results are visualized in parametric coverage for cyber risks and agricultural insurance.

2504.13520 2026-03-02 stat.ME econ.EM math.ST stat.TH

Bayesian Model Averaging in Causal Instrumental Variable Models

Gregor Steiner, Mark Steel

详情
英文摘要

Instrumental variables are a popular tool to infer causal effects under unobserved confounding, but choosing suitable instruments is challenging in practice. We propose gIVBMA, a Bayesian model averaging procedure that addresses this challenge by averaging across different sets of instrumental variables and covariates in a structural equation model. This allows for data-driven selection of valid and relevant instruments and provides additional robustness against invalid instruments. Our approach extends previous work through a scale-invariant prior structure and accommodates non-Gaussian outcomes and treatments, offering greater flexibility than existing methods. The computational strategy uses conditional Bayes factors to update models separately for the outcome and treatments. We prove that this model selection procedure is consistent. In simulation experiments, gIVBMA outperforms current state-of-the-art methods. We demonstrate its usefulness in two empirical applications: the effects of malaria and institutions on income per capita and the returns to schooling. A software implementation of gIVBMA is available in Julia.

2503.15477 2026-03-02 cs.LG cs.AI cs.CL stat.ML

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

Comments Accepted to NeurIPS 2025; Code available at https://github.com/princeton-pli/what-makes-good-rm

详情
英文摘要

The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

2502.01383 2026-03-02 cs.LG stat.ML

InfoBridge: Mutual Information estimation via Bridge Matching

Sergei Kholkin, Ivan Butakov, Evgeny Burnaev, Nikita Gushchin, Alexander Korotin

详情
英文摘要

Diffusion bridge models have recently become a powerful tool in the field of generative modeling. In this work, we leverage their power to address another important problem in machine learning and information theory, the estimation of the mutual information (MI) between two random variables. Neatly framing MI estimation as a domain transfer problem, we construct an unbiased estimator for data posing difficulties for conventional MI estimators. We showcase the performance of our estimator on three standard MI estimation benchmarks, i.e., low-dimensional, image-based and high MI, and on real-world data, i.e., protein language model embeddings.

2501.16985 2026-03-02 stat.ME

Nonparametric methods controlling the median of the false discovery proportion

Jesse Hemerik

详情
英文摘要

When testing many hypotheses, often we do not have strong expectations about the directions of the effects. In some situations however, the alternative hypotheses are that the parameters lie in a certain direction or interval, and it is in fact expected that most hypotheses are false. This is often the case when researchers perform multiple noninferiority or equivalence tests, e.g. when testing food safety with metabolite data. The goal is then to use data to corroborate the expectation that most hypotheses are false. We propose a nonparametric multiple testing approach that is powerful in such situations. If the user's expectations are wrong, our approach will still be valid but have low power. Of course all multiple testing methods become more powerful when appropriate one-sided instead of two-sided tests are used, but our approach often has superior power then. The proposed methods are not at all limited to safety testing and can be used for testing hypotheses about various kinds of parameters, such as coefficients of a model. The methods in this paper control the median of the false discovery proportion (FDP), which is the fraction of false discoveries among the rejected hypotheses. This approach is comparable to false discovery rate control, where one ensures that the mean rather than the median of the FDP is small. Our procedures make use of a symmetry property of the test statistics, do not require independence and have finite-sample properties.

2501.15910 2026-03-02 cs.LG cs.SY eess.SY math.OC stat.ML

The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective

Michael Muehlebach, Zhiyu He, Michael I. Jordan

Comments accepted at ICLR 2026; 37 pages, 6 figures

详情
英文摘要

We study the sample complexity of online reinforcement learning in the general \hzyrev{non-episodic} setting of nonlinear dynamical systems with continuous state and action spaces. Our analysis accommodates a large class of dynamical systems ranging from a finite set of nonlinear candidate models to models with bounded and Lipschitz continuous dynamics, to systems that are parametrized by a compact and real-valued set of parameters. In the most general setting, our algorithm achieves a policy regret of $\mathcal{O}(N ε^2 + d_\mathrm{u}\mathrm{ln}(m(ε))/ε^2)$, where $N$ is the time horizon, $ε$ is a user-specified discretization width, $d_\mathrm{u}$ the input dimension, and $m(ε)$ measures the complexity of the function class under consideration via its packing number. In the special case where the dynamics are parametrized by a compact and real-valued set of parameters (such as neural networks, transformers, etc.), we prove a policy regret of $\mathcal{O}(\sqrt{d_\mathrm{u}N p})$, where $p$ denotes the number of parameters, recovering earlier sample-complexity results that were derived for linear time-invariant dynamical systems. While this article focuses on characterizing sample complexity, the proposed algorithms are likely to be useful in practice, due to their simplicity, their ability to incorporate prior knowledge, and their benign transient behaviors.

2411.13370 2026-03-02 stat.AP

Analysis of Higher Education Dropouts Dynamics through Multilevel Functional Decomposition of Recurrent Events in Counting Processes

Alessandra Ragni, Chiara Masci, Anna Maria Paganoni

详情
英文摘要

This paper analyzes the dynamics of higher education dropouts through an innovative approach that integrates recurrent events modeling and point process theory with functional data analysis. We propose a novel methodology that extends existing frameworks to accommodate hierarchical data structures, demonstrating its potential through a simulation study. Using administrative data from student careers at Politecnico di Milano, we explore dropout patterns during the first year across different bachelor's degree programs and schools. Specifically, we employ Cox-based recurrent event models, treating dropouts as repeated occurrences within both programs and schools. Additionally, we apply functional modeling of recurrent events and multilevel principal component analysis to disentangle latent effects associated with degree programs and schools, identifying critical periods of dropout risk and providing valuable insights for institutions seeking to implement strategies aimed at reducing dropout rates.

2410.17692 2026-03-02 math.ST stat.ME stat.TH

Asymptotics for parametric martingale posteriors

Edwin Fong, Andrew Yiu

Comments 18 pages (main), 50 pages (total), 3 figures, 4 tables

详情
英文摘要

The martingale posterior framework is a generalization of Bayesian inference where one elicits a sequence of one-step ahead predictive densities instead of the likelihood and prior. Posterior sampling then involves the imputation of unseen observables, and can then be carried out in an expedient and parallelizable manner using predictive resampling without requiring Markov chain Monte Carlo. Recent work has investigated the use of plug-in parametric predictive densities, combined with stochastic gradient descent, to specify a parametric martingale posterior. This paper investigates the asymptotic properties of this class of parametric martingale posteriors. In particular, two central limit theorems based on martingale limit theory are introduced and applied. The first is a predictive central limit theorem, which enables a significant acceleration of the predictive resampling scheme through a hybrid sampling algorithm based on a normal approximation. The second is a Bernstein-von Mises result, which is novel for martingale posteriors, and provides methodological guidance on attaining desirable frequentist properties. We demonstrate the utility of the theoretical results in simulations and a real data example.

2410.10258 2026-03-02 cs.LG stat.ML

Revisiting Matrix Sketching in Linear Bandits: Achieving Sublinear Regret via Dyadic Block Sketching

Dongxie Wen, Hanyan Yin, Xiao Zhang, Peng Zhao, Lijun Zhang, Zhewei Wei

Comments Accepted by ICLR 2026

详情
英文摘要

Linear bandits have become a cornerstone of online learning and sequential decision-making, providing solid theoretical foundations for balancing exploration and exploitation. Within this domain, matrix sketching serves as a critical component for achieving computational efficiency, especially when confronting high-dimensional problem instances. The sketch-based approaches reduce per-round complexity from $Ω(d^2)$ to $O(dl)$, where $d$ is the dimension and $l<d$ is the sketch size. However, this computational efficiency comes with a fundamental pitfall: when the streaming matrix exhibits heavy spectral tails, such algorithms can incur vacuous \textit{linear regret}. In this paper, we revisit the regret bounds and algorithmic design for sketch-based linear bandits. Our analysis reveals that inappropriate sketch sizes can lead to substantial spectral error, severely undermining regret guarantees. To overcome this issue, we propose Dyadic Block Sketching, a novel multi-scale matrix sketching approach that dynamically adjusts the sketch size during the learning process. We apply this technique to linear bandits and demonstrate that the new algorithm achieves \textit{sublinear regret} bounds without requiring prior knowledge of the streaming matrix properties. It establishes a general framework for efficient sketch-based linear bandits, which can be integrated with any matrix sketching method that provides covariance guarantees. Comprehensive experimental evaluation demonstrates the superior utility-efficiency trade-off achieved by our approach.

2410.05419 2026-03-02 cs.LG cs.AI stat.ME

Joint Distribution-Informed Shapley Values for Sparse Counterfactual Explanations

Lei You, Yijun Bian, Lele Cao

详情
英文摘要

Counterfactual explanations (CE) aim to reveal how small input changes flip a model's prediction, yet many methods modify more features than necessary, reducing clarity and actionability. We introduce \emph{COLA}, a model- and generator-agnostic post-hoc framework that refines any given CE by computing a coupling via optimal transport (OT) between factual and counterfactual sets and using it to drive a Shapley-based attribution (\emph{$p$-SHAP}) that selects a minimal set of edits while preserving the target effect. Theoretically, OT minimizes an upper bound on the $W_1$ divergence between factual and counterfactual outcomes and that, under mild conditions, refined counterfactuals are guaranteed not to move farther from the factuals than the originals. Empirically, across four datasets, twelve models, and five CE generators, COLA achieves the same target effects with only 26--45\% of the original feature edits. On a small-scale benchmark, COLA shows near-optimality.

2409.15145 2026-03-02 stat.ME

Adaptive weight selection for time-to-event data under non-proportional hazards

Moritz Fabian Danzer, Ina Dormuth

Comments including Supplementary Material

详情
英文摘要

When planning a clinical trial for a time-to-event endpoint, we require an estimated effect size and need to consider the type of effect. Usually, an effect of proportional hazards is assumed with the hazard ratio as the corresponding effect measure. Thus, the standard procedure for survival data is generally based on a single-stage log-rank test. Knowing that the assumption of proportional hazards is often violated and sufficient knowledge to derive reasonable effect sizes is usually unavailable, such an approach is relatively rigid. We introduce a more flexible procedure by combining two methods designed to be more robust in case we have little to no prior knowledge. First, we employ a more flexible adaptive multi-stage design instead of a single-stage design. Second, we apply combination-type tests in the first stage of our suggested procedure to benefit from their robustness under uncertainty about the deviation pattern. We can then use the data collected during this period to choose a more specific single-weighted log-rank test for the subsequent stages. In this step, we employ Royston-Parmar spline models to extrapolate the survival curves to make a reasonable decision. Based on a real-world data example, we show that our approach can save a trial that would otherwise end with an inconclusive result. Additionally, our simulation studies demonstrate a sufficient power performance while maintaining more flexibility.

2405.17591 2026-03-02 stat.ME

Individualized Dynamic Mediation Analysis Using Latent Factor Models

Yijiao Zhang, Yubai Yuan, Yuexia Zhang, Zhongyi Zhu, Annie Qu

Comments 35 pages, 4 figures, 3 tables

详情
英文摘要

Mediation analysis plays a crucial role in causal inference as it can investigate the pathways through which treatment influences outcome. Most existing mediation analysis assumes that mediation effects are static and homogeneous within populations. However, mediation effects usually change over time and exhibit significant heterogeneity among individuals in many real-world applications. Additionally, the mediation mechanism can be complicated and involves non-sparse, making mediator selection particularly challenging. To address these issues, we propose an individualized dynamic mediation analysis method for mediator selection. Our approach can identify the significant mediators at the population level while capturing the time-varying and heterogeneous mediation effects at the individual level via varying-coefficient structural equation models. Another advantage of our method is that we allow the presence of unmeasured time-varying confounders that induce the heterogeneous mediation effects. We provide asymptotic results for the proposed estimator and selection consistency for significant mediators. Extensive simulation studies and an application to a DNA methylation study demonstrate the effectiveness and advantages of our method.

2405.12317 2026-03-02 stat.ML cs.LG

Kernel spectral joint embeddings for high-dimensional noisy datasets using duo-landmark integral operators

Xiucai Ding, Rong Ma

Comments 57 pages, 16 figures

详情
英文摘要

Integrative analysis of multiple heterogeneous datasets has become standard practice in many research fields, especially in single-cell genomics and medical informatics. Existing approaches oftentimes suffer from limited power in capturing nonlinear structures, insufficient account of noisiness and effects of high-dimensionality, lack of adaptivity to signals and sample sizes imbalance, and their results are sometimes difficult to interpret. To address these limitations, we propose a novel kernel spectral method that achieves joint embeddings of two independently observed high-dimensional noisy datasets. The proposed method automatically captures and leverages possibly shared low-dimensional structures across datasets to enhance embedding quality. The obtained low-dimensional embeddings can be utilized for many downstream tasks such as simultaneous clustering, data visualization, and denoising. The proposed method is justified by rigorous theoretical analysis. Specifically, we show the consistency of our method in recovering the low-dimensional noiseless signals, and characterize the effects of the signal-to-noise ratios on the rates of convergence. Under a joint manifolds model framework, we establish the convergence of ultimate embeddings to the eigenfunctions of some newly introduced integral operators. These operators, referred to as duo-landmark integral operators, are defined by the convolutional kernel maps of some reproducing kernel Hilbert spaces (RKHSs). These RKHSs capture the either partially or entirely shared underlying low-dimensional nonlinear signal structures of the two datasets. Our numerical experiments and analyses of two single-cell omics datasets demonstrate the empirical advantages of the proposed method over existing methods in both embeddings and several downstream tasks.

2306.16056 2026-03-02 stat.ME

Confirmatory adaptive group sequential designs for clinical trials with multiple time-to-event outcomes in Markov models

Moritz Fabian Danzer, Andreas Faldum, Thorsten Simon, Barbara Hero, Rene Schmidt

Comments 32 pages, including 16 pages of Appendix; 3 figures; 4 tables

详情
英文摘要

The analysis of multiple time-to-event outcomes in a randomised controlled clinical trial can be accomplished with exisiting methods. However, depending on the characteristics of the disease under investigation and the circumstances in which the study is planned, it may be of interest to conduct interim analyses and adapt the study design if necessary. Due to the expected dependency of the endpoints, the full available information on the involved endpoints may not be used for this purpose. We suggest a solution to this problem by embedding the endpoints in a multi-state model. If this model is Markovian, it is possible to take the disease history of the patients into account and allow for data-dependent design adaptiations. To this end, we introduce a flexible test procedure for a variety of applications, but are particularly concerned with the simultaneous consideration of progression-free survival (PFS) and overall survival (OS). This setting is of key interest in oncological trials. We conduct simulation studies to determine the properties for small sample sizes and demonstrate an application based on data from the NB2004-HR study.

2306.09778 2026-03-02 cs.LG cs.NA math.NA math.OC stat.ML

Gradient is All You Need? How Consensus-Based Optimization can be Interpreted as a Stochastic Relaxation of Gradient Descent

Konstantin Riedl, Timo Klock, Carina Geldhauser, Massimo Fornasier

Comments 49 pages, 5 figures

详情
英文摘要

In this paper, we provide a novel analytical perspective on the theoretical understanding of gradient-based learning algorithms by interpreting consensus-based optimization (CBO), a recently proposed multi-particle derivative-free optimization method, as a stochastic relaxation of gradient descent. Remarkably, we observe that through communication of the particles, CBO exhibits a stochastic gradient descent (SGD)-like behavior despite solely relying on evaluations of the objective function. The fundamental value of such link between CBO and SGD lies in the fact that CBO is provably globally convergent to global minimizers for ample classes of nonsmooth and nonconvex objective functions. Hence, on the one side, we offer a novel explanation for the success of stochastic relaxations of gradient descent by furnishing useful and precise insights that explain how problem-tailored stochastic perturbations of gradient descent (like the ones induced by CBO) overcome energy barriers and reach deep levels of nonconvex functions. On the other side, and contrary to the conventional wisdom for which derivative-free methods ought to be inefficient or not to possess generalization abilities, our results unveil an intrinsic gradient descent nature of heuristics. Instructive numerical illustrations support the provided theoretical insights.

2302.01701 2026-03-02 stat.ML cs.LG

Assessment of Spatio-Temporal Predictors in the Presence of Missing and Heterogeneous Data

Daniele Zambon, Cesare Alippi

Journal ref Neurocomputing, Volume 675, 2026, Article 132963

详情
英文摘要

Deep learning methods achieve remarkable predictive performance in modeling complex, large-scale data. However, assessing the quality of derived models has become increasingly challenging, as more classical statistical assumptions may no longer apply. These difficulties are particularly pronounced for spatio-temporal data, which exhibit dependencies across both space and time and are often characterized by nonlinear dynamics, time variance, and missing observations, hence calling for new accuracy assessment methodologies. This paper introduces a residual correlation analysis framework for assessing the optimality of spatio-temporal relational-enabled neural predictive models, notably in settings with incomplete and heterogeneous data. By leveraging the principle that residual correlation indicates information not captured by the model, enabling the identification and localization of regions in space and time where predictive performance can be improved. A strength of the proposed approach is that it operates under minimal assumptions, allowing also for robust evaluation of deep learning models applied to multivariate time series, even in the presence of missing and heterogeneous data. In detail, the methodology constructs tailored spatio-temporal graphs to encode sparse spatial and temporal dependencies and employs asymptotically distribution-free summary statistics to detect time intervals and spatial regions where the model underperforms. The effectiveness of what proposed is demonstrated through experiments on both synthetic and real-world datasets using state-of-the-art predictive models.

2301.11690 2026-03-02 stat.ME stat.AP

A statistical framework for planning and analysing test-retest studies for repeatability of quantitative biomarker measurements

Moritz Fabian Danzer, Maria Eveslage, Dennis Görlich, Benjamin Noto

Comments 19 pages, 5 figures

详情
英文摘要

There is an increasing number of potential biomarkers that could allow for early assessment of treatment response or disease progression. However, measurements of quantitative biomarkers are subject to random variability. Hence, differences of a biomarker in longitudinal measurements do not necessarily represent real change but might be caused by this random measurement variability. Before utilizing a quantitative biomarker in longitudinal studies, it is therefore essential to assess the measurement repeatability. Measurement repeatability obtained from test-retest studies can be quantified by the repeatability coefficient (RC), which is then used in the subsequent longitudinal study to determine if a measured difference represents real change or is within the range of expected random measurement variability. The quality of the point estimate of RC therefore directly governs the assessment quality of the longitudinal study. RC estimation accuracy depends on the case number in the test-retest study, but despite its pivotal role, no comprehensive framework for sample size calculation of test-retest studies exists. To address this issue, we have established such a framework, which allows for flexible sample size calculation of test-retest studies, based upon newly introduced criteria concerning assessment quality in the longitudinal study. This also permits retrospective assessment of prior test-retest studies.

2208.14960 2026-03-02 stat.ME cs.LG math.ST stat.ML stat.TH

Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case

Iskander Azangulov, Andrei Smolensky, Alexander Terenin, Viacheslav Borovitskiy

Comments This version fixes two mathematical typos, in equations (58) and (65), where both sums should be taken only over the diagonal part $π^{(λ)}_{jj}$ and not over $π^{(λ)}_{jk}$ as had erroneously been written in the previous version. The proofs for both statements remain unchanged. We thank Nathaël Da Costa for making us aware of this pair of typos

Journal ref Journal of Machine Learning Research, 2024

详情
英文摘要

Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.

2109.02315 2026-03-02 stat.ME

One-sample log-rank tests with consideration of reference curve sampling variability

Jannik Feld, Moritz Fabian Danzer, Andreas Faldum, Rene Schmidt

Comments 24 pages, 2 pictures, 1 supplementary file

详情
英文摘要

The one-sample log-rank test is the method of choice for single-arm Phase II trials with time-to-event endpoint. It allows to compare the survival of the patients to a reference survival curve that typically represents the expected survival under standard of care. The classical one-sample log-rank test, however, assumes that the reference survival curve is deterministic. This ignores that the reference curve is commonly estimated from historic data and thus prone to statistical error. Ignoring sampling variability of the reference curve results in type I error rate inflation. For that reason, a new one-sample log-rank test is proposed that explicitly accounts for the statistical error made in the process of estimating the reference survival curve. The test statistic and its distributional properties are derived using martingale techniques in the large sample limit. In particular, a sample size formula is provided. Small sample properties regarding type I and type II error rate control are studied by simulation. A case study is conducted to study the influence of several design parameters of a single-armed trial on the inflation of the type I error rate when reference curve sampling variability is ignored.

2108.08194 2026-03-02 stat.ME

On variance estimation for the one-sample log-rank test

Moritz Fabian Danzer, Andreas Faldum, Rene Schmidt

详情
英文摘要

Time-to-event endpoints show an increasing popularity in phase II cancer trials. The standard statistical tool for such one-armed survival trials is the one-sample log-rank test. Its distributional properties are commonly derived in the large sample limit. It is however known from the literature, that the asymptotical approximations suffer when sample size is small. There have already been several attempts to address this problem. While some approaches do not allow easy power and sample size calculations, others lack a clear theoretical motivation and require further considerations. The problem itself can partly be attributed to the dependence of the compensated counting process and its variance estimator. For this purpose, we suggest a variance estimator which is uncorrelated to the compensated counting process. Moreover, this and other present approaches to variance estimation are covered as special cases by our general framework. For practical application, we provide sample size and power calculations for any approach fitting into this framework. Finally, we use simulations and real world data to study the empirical type I error and power performance of our methodology as compared to standard approaches.

2602.23561 2026-03-02 stat.ME cs.LG cs.SC stat.CO stat.ML

VaSST: Variational Inference for Symbolic Regression using Soft Symbolic Trees

Somjit Roy, Pritam Dey, Bani K. Mallick

Comments 38 pages, 5 figures, 35 tables, Submitted

详情
英文摘要

Symbolic regression has recently gained traction in AI-driven scientific discovery, aiming to recover explicit closed-form expressions from data that reveal underlying physical laws. Despite recent advances, existing methods remain dominated by heuristic search algorithms or data-intensive approaches that assume low-noise regimes and lack principled uncertainty quantification. Fully probabilistic formulations are scarce, and existing Markov chain Monte Carlo-based Bayesian methods often struggle to efficiently explore the highly multimodal combinatorial space of symbolic expressions. We introduce VaSST, a scalable probabilistic framework for symbolic regression based on variational inference. VaSST employs a continuous relaxation of symbolic expression trees, termed soft symbolic trees, where discrete operator and feature assignments are replaced by soft distributions over allowable components. This relaxation transforms the combinatorial search over an astronomically large symbolic space into an efficient gradient-based optimization problem while preserving a coherent probabilistic interpretation. The learned soft representations induce posterior distributions over symbolic structures, enabling principled uncertainty quantification. Across simulated experiments and Feynman Symbolic Regression Database within SRBench, VaSST achieves superior performance in both structural recovery and predictive accuracy compared to state-of-the-art symbolic regression methods.

2602.23535 2026-03-02 stat.ML cs.LG

Partition Function Estimation under Bounded f-Divergence

Adam Block, Abhishek Shetty

详情
英文摘要

We study the statistical complexity of estimating partition functions given sample access to a proposal distribution and an unnormalized density ratio for a target distribution. While partition function estimation is a classical problem, existing guarantees typically rely on structural assumptions about the domain or model geometry. We instead provide a general, information-theoretic characterization that depends only on the relationship between the proposal and target distributions. Our analysis introduces the integrated coverage profile, a functional that quantifies how much target mass lies in regions where the density ratio is large. We show that integrated coverage tightly characterizes the sample complexity of multiplicative partition function estimation and provide matching lower bounds. We further express these bounds in terms of $f$-divergences, yielding sharp phase transitions depending on the growth rate of f and recovering classical results as a special case while extending to heavy-tailed regimes. Matching lower bounds establish tightness in all regimes. As applications, we derive improved finite-sample guarantees for importance sampling and self-normalized importance sampling, and we show a strict separation between the complexity of approximate sampling and counting under the same divergence constraints. Our results unify and generalize prior analyses of importance sampling, rejection sampling, and heavy-tailed mean estimation, providing a minimal-assumption theory of partition function estimation. Along the way we introduce new technical tools including new connections between coverage and $f$-divergences as well as a generalization of the classical Paley-Zygmund inequality.

2602.23528 2026-03-02 cs.LG cs.CE stat.CO stat.ML

Neural Operators Can Discover Functional Clusters

Yicen Li, Jose Antonio Lara Benitez, Ruiyang Hong, Anastasis Kratsios, Paul David McNicholas, Maarten Valentijn de Hoop

详情
英文摘要

Operator learning is reshaping scientific computing by amortizing inference across infinite families of problems. While neural operators (NOs) are increasingly well understood for regression, far less is known for classification and its unsupervised analogue: clustering. We prove that sample-based neural operators can learn any finite collection of classes in an infinite-dimensional reproducing kernel Hilbert space, even when the classes are neither convex nor connected, under mild kernel sampling assumptions. Our universal clustering theorem shows that any $K$ closed classes can be approximated to arbitrary precision by NO-parameterized classes in the upper Kuratowski topology on closed sets, a notion that can be interpreted as disallowing false-positive misclassifications. Building on this, we develop an NO-powered clustering pipeline for functional data and apply it to unlabeled families of ordinary differential equation (ODE) trajectories. Discretized trajectories are lifted by a fixed pre-trained encoder into a continuous feature map and mapped to soft assignments by a lightweight trainable head. Experiments on diverse synthetic ODE benchmarks show that the resulting practical SNO recovers latent dynamical structure in regimes where classical methods fail, providing evidence consistent with our universal clustering theory.

2602.23518 2026-03-02 stat.ML cs.LG

Uncovering Physical Drivers of Dark Matter Halo Structures with Auxiliary-Variable-Guided Generative Models

Arkaprabha Ganguli, Anirban Samaddar, Florian Kéruzoré, Nesar Ramachandra, Julie Bessac, Sandeep Madireddy, Emil Constantinescu

详情
英文摘要

Deep generative models (DGMs) compress high-dimensional data but often entangle distinct physical factors in their latent spaces. We present an auxiliary-variable-guided framework for disentangling representations of thermal Sunyaev-Zel'dovich (tSZ) maps of dark matter halos. We introduce halo mass and concentration as auxiliary variables and apply a lightweight alignment penalty to encourage latent dimensions to reflect these physical quantities. To generate sharp and realistic samples, we extend latent conditional flow matching (LCFM), a state-of-the-art generative model, to enforce disentanglement in the latent space. Our Disentangled Latent-CFM (DL-CFM) model recovers the established mass-concentration scaling relation and identifies latent space outliers that may correspond to unusual halo formation histories. By linking latent coordinates to interpretable astrophysical properties, our method transforms the latent space into a diagnostic tool for cosmological structure. This work demonstrates that auxiliary guidance preserves generative flexibility while yielding physically meaningful, disentangled embeddings, providing a generalizable pathway for uncovering independent factors in complex astronomical datasets.

2602.23507 2026-03-02 cs.LG stat.AP stat.ME

Sample Size Calculations for Developing Clinical Prediction Models: Overview and pmsims R package

Diana Shamsutdinova, Felix Zimmer, Oyebayo Ridwan Olaniran, Sarah Markham, Daniel Stahl, Gordon Forbes, Ewan Carr

Comments 26 pages, 4 figures, 1 table, preprint

详情
英文摘要

Background: Clinical prediction models are increasingly used to inform healthcare decisions, but determining the minimum sample size for their development remains a critical and unresolved challenge. Inadequate sample sizes can lead to overfitting, poor generalisability, and biased predictions. Existing approaches, such as heuristic rules, closed-form formulas, and simulation-based methods, vary in flexibility and accuracy, particularly for complex data structures and machine learning models. Methods: We review current methodologies for sample size estimation in prediction modelling and introduce a conceptual framework that distinguishes between mean-based and assurance-based criteria. Building on this, we propose a novel simulation-based approach that integrates learning curves, Gaussian Process optimisation, and assurance principles to identify sample sizes that achieve target performance with high probability. This approach is implemented in pmsims, an open-source, model-agnostic R package. Results: Through case studies, we demonstrate that sample size estimates vary substantially across methods, performance metrics, and modelling strategies. Compared to existing tools, pmsims provides flexible, efficient, and interpretable solutions that accommodate diverse models and user-defined metrics while explicitly accounting for variability in model performance. Conclusions: Our framework and software advance sample size methodology for clinical prediction modelling by combining flexibility with computational efficiency. Future work should extend these methods to hierarchical and multimodal data, incorporate fairness and stability metrics, and address challenges such as missing data and complex dependency structures.

2602.23505 2026-03-02 math.ST math.GR stat.TH

How to recover a permutation group amidst errors

Taylor Brysiewicz, Juhee Kim

Comments 31 pages, 14 Figures

详情
英文摘要

We consider the problem of recovering a permutation group $G \leq S_n$ from an error-prone sampling process $X$. We model $X$ as an $S_n$-valued random variable, defined as a mixture of the uniform distributions on $G$ and $S_n$ . Our suite of tools recovers properties of $G$ from $X$ and bolsters our main method for recovering $G$ itself. Our algorithms are motivated by the numerical computation of monodromy groups, a setting where such error-prone sampling procedures occur organically.

2602.23482 2026-03-02 econ.EM stat.ME

Testing Hypotheses About Ratios of Linear Trend Slopes in Systems of Equations with a Focus on Tests of Equal Trend Ratios

Timothy J. Vogelsang

Comments 30 pages

详情
英文摘要

This paper develops inference methods for ratios of deterministic trend slopes in systems of pairs of time series. Hypotheses based on linear cross-equation restrictions are considered with particular interest in tests that trend ratios are equal across pairs of trending series. Tests of equal ratios can be used for the empirical assessment of climate models through comparisons of trend ratios (amplification ratios) of model generated temperature series and observed temperature series. The analysis in this paper builds on the estimation and inference methods developed by Vogelsang and Nawaz (2017, Journal of Time Series Analysis) for a single pair of trending time series. Because estimators of ratios can have poor finite sample properties when the trend slope are small relative to variation around the trends, tests of equal trend ratios are restated in terms of products of trend slopes leading to inference that is less affected by small trend slopes. Asymptotic theory is developed that can be used to generate critical values. For tests of equal trend ratios, finite sample performance is assessed using simulations. Practical advice is provided for empirical practitioners. An empirical application compares amplification ratios (trend ratios) across a set of five groups of observed global temperature series.

2602.23459 2026-03-02 cs.LG q-bio.QM stat.ML

Global Interpretability via Automated Preprocessing: A Framework Inspired by Psychiatric Questionnaires

Eric V. Strobl

详情
英文摘要

Psychiatric questionnaires are highly context sensitive and often only weakly predict subsequent symptom severity, which makes the prognostic relationship difficult to learn. Although flexible nonlinear models can improve predictive accuracy, their limited interpretability can erode clinical trust. In fields such as imaging and omics, investigators commonly address visit- and instrument-specific artifacts by extracting stable signal through preprocessing and then fitting an interpretable linear model. We adopt the same strategy for questionnaire data by decoupling preprocessing from prediction: we restrict nonlinear capacity to a baseline preprocessing module that estimates stable item values, and then learn a linear mapping from these stabilized baseline items to future severity. We refer to this two-stage method as REFINE (Redundancy-Exploiting Follow-up-Informed Nonlinear Enhancement), which concentrates nonlinearity in preprocessing while keeping the prognostic relationship transparently linear and therefore globally interpretable through a coefficient matrix, rather than through post hoc local attributions. In experiments, REFINE outperforms other interpretable approaches while preserving clear global attribution of prognostic factors across psychiatric and non-psychiatric longitudinal prediction tasks.

2602.17772 2026-03-02 stat.ME cs.LG

Sparse Bayesian Modeling of EEG Channel Interactions Improves P300 Brain-Computer Interface Performance

Guoxuan Ma, Yuan Zhong, Moyan Li, Yuxiao Nie, Jian Kang

详情
英文摘要

Electroencephalography (EEG)-based P300 brain-computer interfaces (BCIs) enable communication without physical movement by detecting stimulus-evoked neural responses. Accurate and efficient decoding remains challenging due to high dimensionality, temporal dependence, and complex interactions across EEG channels. Most existing approaches treat channels independently or rely on black-box machine learning models, limiting interpretability and personalization. We propose a sparse Bayesian time-varying regression framework that explicitly models pairwise EEG channel interactions while performing automatic temporal feature selection. The model employs a relaxed-thresholded Gaussian process prior to induce structured sparsity in both channel-specific and interaction effects, enabling interpretable identification of task-relevant channels and channel pairs. Applied to a publicly available P300 speller dataset of 55 participants, the proposed method achieves a median character-level accuracy of 100\% using all stimulus sequences and attains the highest overall decoding performance among competing statistical and deep learning approaches. Incorporating channel interactions yields subgroup-specific gains of up to 7\% in character-level accuracy, particularly among participants who abstained from alcohol (up to 18\% improvement). Importantly, the proposed method improves median BCI-Utility by approximately 10\% at its optimal operating point, achieving peak throughput after only seven stimulus sequences. These results demonstrate that explicitly modeling structured EEG channel interactions within a principled Bayesian framework enhances predictive accuracy, improves user-centric throughput, and supports personalization in P300 BCI systems.

2602.06775 2026-03-02 cs.LG stat.ML

Robust Online Learning

Sajad Ashkezari

详情
英文摘要

We study the problem of learning robust classifiers where the classifier will receive a perturbed input. Unlike robust PAC learning studied in prior work, here the clean data and its label are also adversarially chosen. We formulate this setting as an online learning problem and consider both the realizable and agnostic learnability of hypothesis classes. We define a new dimension of classes and show it controls the mistake bounds in the realizable setting and the regret bounds in the agnostic setting. In contrast to the dimension that characterizes learnability in the PAC setting, our dimension is rather simple and resembles the Littlestone dimension. We generalize our dimension to multiclass hypothesis classes and prove similar results in the realizable case. Finally, we study the case where the learner does not know the set of allowed perturbations for each point and only has some prior on them.

2511.18060 2026-03-02 stat.ML cs.LG stat.ME

An operator splitting analysis of Wasserstein--Fisher--Rao gradient flows

Francesca Romana Crucinio, Sahani Pathiraja

详情
英文摘要

Wasserstein-Fisher-Rao (WFR) gradient flows have been recently proposed as a powerful sampling tool that combines the advantages of pure Wasserstein (W) and pure Fisher-Rao (FR) gradient flows. Existing algorithmic developments implicitly make use of operator splitting techniques to numerically approximate the WFR partial differential equation, whereby the W flow is evaluated over a given step size and then the FR flow (or vice versa). This works investigates the impact of the order in which the W and FR operator are evaluated and aims to provide a quantitative analysis. Somewhat surprisingly, we show that with a judicious choice of step size and operator ordering, the split scheme can converge to the target distribution faster than the exact WFR flow (in terms of model time). We obtain variational formulae describing the evolution over one time step of both splitting schemes and investigate in which settings the W-FR split should be preferred to the FR-W split. As a step towards this goal we show that the WFR gradient flow preserves log-concavity and obtain the first sharp decay bound for WFR flow.

2510.06091 2026-03-02 cs.LG cs.SY eess.SY q-bio.NC stat.ML

Learning Mixtures of Linear Dynamical Systems via Hybrid Tensor-EM Method

Lulu Gong, Shreya Saxena

Comments 24 pages, 14 figures

详情
英文摘要

Mixtures of linear dynamical systems (MoLDS) provide a path to model time-series data that exhibit diverse temporal dynamics across trajectories. However, its application remains challenging in complex and noisy settings, limiting its effectiveness for neural data analysis. Tensor-based moment methods can provide global identifiability guarantees for MoLDS, but their performance degrades under noise and complexity. Commonly used expectation-maximization (EM) methods offer flexibility in fitting latent models but are highly sensitive to initialization and prone to poor local minima. Here, we propose a tensor-based method that provides identifiability guarantees for learning MoLDS, which is followed by EM updates to combine the strengths of both approaches. The novelty in our approach lies in the construction of moment tensors using the input-output data to recover globally consistent estimates of mixture weights and system parameters. These estimates can then be refined through a Kalman EM algorithm, with closed-form updates for all LDS parameters. We validate our framework on synthetic benchmarks and real-world datasets. On synthetic data, the proposed Tensor-EM method achieves more reliable recovery and improved robustness compared to either pure tensor or randomly initialized EM methods. We then analyze neural recordings from the primate somatosensory cortex while a non-human primate performs reaches in different directions. Our method successfully models and clusters different conditions as separate subsystems, consistent with supervised single-LDS fits for each condition. Finally, we apply this approach to another neural dataset where monkeys perform a sequential reaching task. These results demonstrate that MoLDS provides an effective framework for modeling complex neural data, and that Tensor-EM is a reliable approach to MoLDS learning for these applications.

2507.06867 2026-03-02 stat.ML cs.CV cs.LG stat.ME

Conformal Prediction for Long-Tailed Classification

Tiffany Ding, Jean-Baptiste Fermanian, Joseph Salmon

详情
英文摘要

Many real-world classification problems, such as plant identification, have extremely long-tailed class distributions. In order for prediction sets to be useful in such settings, they should (i) provide good class-conditional coverage, ensuring that rare classes are not systematically omitted from the prediction sets, and (ii) be a reasonable size, allowing users to easily verify candidate labels. Unfortunately, existing conformal prediction methods, when applied to the long-tailed setting, force practitioners to make a binary choice between small sets with poor class-conditional coverage or sets that have very good class-conditional coverage but are extremely large. We propose methods with marginal coverage guarantees that smoothly trade off set size and class-conditional coverage. First, we introduce a new conformal score function called prevalence-adjusted softmax that optimizes for macro-coverage, defined as the average class-conditional coverage across classes. Second, we propose a new procedure that interpolates between marginal and class-conditional conformal prediction by linearly interpolating their conformal score thresholds. We demonstrate our methods on Pl@ntNet-300K and iNaturalist-2018, two long-tailed image datasets with 1,081 and 8,142 classes, respectively.

2503.12354 2026-03-02 cs.LG stat.ML

Probabilistic Neural Networks (PNNs) with t-Distributed Outputs: Adaptive Prediction Intervals Beyond Gaussian Assumptions

Farhad Pourkamali-Anaraki

Comments 9 Figures, 1 Table

详情
英文摘要

Traditional neural network regression models provide only point estimates, failing to capture predictive uncertainty. Probabilistic neural networks (PNNs) address this limitation by producing output distributions, enabling the construction of prediction intervals. However, the common assumption of Gaussian output distributions often results in overly wide intervals, particularly in the presence of outliers or deviations from normality. To enhance the adaptability of PNNs, we propose t-Distributed Neural Networks (TDistNNs), which generate t-distributed outputs, parameterized by location, scale, and degrees of freedom. The degrees of freedom parameter allows TDistNNs to model heavy-tailed predictive distributions, improving robustness to non-Gaussian data and enabling more adaptive uncertainty quantification. We incorporate a likelihood based on the t-distribution into neural network training and derive efficient gradient computations for seamless integration into deep learning frameworks. Empirical evaluations on synthetic and real-world data demonstrate that TDistNNs improve the balance between coverage and interval width. Notably, for identical architectures, TDistNNs consistently produce narrower prediction intervals than Gaussian-based PNNs while maintaining proper coverage. This work contributes a flexible framework for uncertainty estimation in neural networks tasked with regression, particularly suited to settings involving complex output distributions.

2501.05836 2026-03-02 stat.ME

Treatment Effect Estimation in Causal Survival Analysis: Practical Recommendations

Charlotte Voinot, Clément Berenfeld, Imke Mayer, Bernard Sebastien, Julie Josse

详情
英文摘要

The restricted mean survival time (RMST) difference offers an interpretable causal contrast to estimate the treatment effect for time-to-event outcomes, yet a wide range of available estimators leaves limited guidance for practice. We provide a unified review of RMST estimators for randomized trials and observational studies, establish identification and asymptotic properties, and supply new derivations where needed. Our extensive simulation study compares simple nonparametric methods (such as unweighted Kaplan-Meier estimators) alongside parametric and nonparametric implementations of the G-formula, weighting approaches, Buckley-James transformations, and augmented estimators under diverse censoring mechanisms and model specifications. Across scenarios, classical Kaplan-Meier estimators (weighted when required by the censoring process) and G-formula methods perform well in randomized settings, while in observational data G-formula estimators remain competitive; however, augmented estimators such as AIPTW-AIPCW generally offer robustness to model misspecification and a favorable bias-variance trade-off. Parametric estimators perform best under correct specification, whereas nonparametric methods avoid functional assumptions but require large sample sizes to achieve reliable performance. We offer practical recommendations for estimator choice and provide open-source R code to support reproducibility and application.

2406.16826 2026-03-02 stat.AP

Practical privacy metrics for synthetic data

Gillian M Raab, Beata Nowok, Chris Dibben

Comments 24 pages, including 3 figures and references ands appendices. s Also appears as a vignette for the synthpop package for R

详情
英文摘要

This paper explains how the synthpop package for R has been extended to include functions to calculate measures of identity and attribute disclosure risk for synthetic data that measure risks for the records used to create the synthetic data. The basic function, disclosure, calculates identity disclosure for a set of quasi-identifiers (keys) and attribute disclosure for one variable specified as a target from the same set of keys. The second function, disclosure.summary, is a wrapper for the first and presents summary results for a set of targets. This short paper explains the measures of disclosure risk and documents how they are calculated. We recommend two measures: $RepU$ (replicated uniques) for identity disclosure and $DiSCO$ (Disclosive in Synthetic Correct Original) for attribute disclosure. Both are expressed a \% of the original records and each can be compared to similar measures calculated from the original data. Experience with using the functions on real data found that some apparent disclosures could be identified as coming from relationships in the data that would be expected to be known to anyone familiar with its features. We flag cases when this seems to have occurred and provide means of excluding them. This paper was originally written as a vignette for the R package synthpop, with substantial changes added in February 2026 for synthpop version 1.9-3.

2109.13124 2026-03-02 math.ST stat.ME stat.TH

Parameterising the effect of a continuous treatment using average derivative effects

Oliver J. Hines, Karla Diaz-Ordaz, Stijn Vansteelandt

Comments Replication code is available from https://github.com/ohines/alse

详情
英文摘要

The average treatment effect (ATE) is commonly used to quantify the main effect of a binary treatment on an outcome. Extensions to continuous treatments are usually based on the dose-response curve or shift interventions, but both require strong overlap conditions and the resulting curves may be difficult to summarise. We focus instead on average derivative effects (ADEs) that are scalar estimands related to infinitesimal shift interventions requiring only local overlap assumptions. ADEs, however, are rarely used in practice because their estimation usually requires estimating conditional density functions. By characterising the Riesz representers of weighted ADEs, we propose a new class of estimands that provides a unified view of weighted ADEs/ATEs when the treatment is continuous/binary. We derive the estimand in our class that minimises the nonparametric efficiency bound, thereby extending optimal weighting results from the binary treatment literature to the continuous setting. We develop efficient estimators for two weighted ADEs that avoid density estimation and are amenable to modern machine learning methods, which we evaluate in simulations and an applied analysis of Warfarin dosage effects.