arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.16854 2026-03-18 stat.ME

Spatial Causal Tensor Completion for Multiple Exposures and Outcomes: An Application to the Health Effects of PFAS Pollution

Xiaodan Zhou, Brian J Reich, Shu Yang

详情
英文摘要

Per- and polyfluoroalkyl substances (PFAS) are typically encountered as mixtures of distinct chemicals with distinct effects on multiple health outcomes. Estimating joint causal effects using spatially-dependent observed data is challenging. We propose a spatial causal tensor completion framework that jointly models multiple exposures and outcomes within a low-rank tensor structure, while adjusting for observed confounders and latent spatial confounders. This method combines a low-rank tensor representation to pool information across exposures and outcomes with a spectral adjustment step that incorporates graph-Laplacian eigenvectors to approximate unmeasured spatial confounders, implemented via a projected-gradient descent algorithm. This framework enables causal inference in the presence of unmeasured spatial confounding and pervasive missingness of potential outcomes. We establish theoretical guarantees for the estimator and evaluate its finite-sample performance through extensive simulations. In an application to national PFAS monitoring data, our approach yields more conservative and credible causal relationships between PFOA and PFOS exposure and 13 chronic disease outcomes compared with existing alternatives.

2603.16829 2026-03-18 stat.ML cs.LG math.ST stat.ME stat.TH

Conditional Distributional Treatment Effects: Doubly Robust Estimation and Testing

Saksham Jain, Alex Luedtke

详情
英文摘要

Beyond conditional average treatment effects, treatments may impact the entire outcome distribution in covariate-dependent ways, for example, by altering the variance or tail risks for specific subpopulations. We propose a novel estimand to capture such conditional distributional treatment effects, and develop a doubly robust estimator that is minimax optimal in the local asymptotic sense. Using this, we develop a test for the global homogeneity of conditional potential outcome distributions that accommodates discrepancies beyond the maximum mean discrepancy (MMD), has provably valid type 1 error, and is consistent against fixed alternatives -- the first test, to our knowledge, with such guarantees in this setting. Furthermore, we derive exact closed-form expressions for two natural discrepancies (including the MMD), and provide a computationally efficient, permutation-free algorithm for our test.

2603.16798 2026-03-18 cs.LG cs.DS math.ST stat.ML stat.TH

High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas

详情
英文摘要

We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $ε$-contamination model. In this model an adversary can choose a function $r(x)$ between 0 and $ε$ and each sample $x$ goes missing with probability $r(x)$. Recent work Ma et al., 2024 proposed this model as an intermediate-strength setting between Missing Completely At Random (MCAR) -- where missingness is independent of the data -- and Missing Not At Random (MNAR) -- where missingness may depend arbitrarily on the sample values and can lead to non-identifiability issues. That work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model. Their proposed estimators incur runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information-computation gap in the Statistical Query model (and, as a corollary, for Low-Degree Polynomials and PTF tests), showing that algorithms must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our SQ lower bound with an algorithm whose sample-time tradeoff nearly matches our lower bound. Together, these results qualitatively characterize the complexity of Gaussian mean estimation under $ε$-realizable contamination.

2603.16785 2026-03-18 math.ST stat.TH

Local asymptotic normality for mixed fractional Ornstein-Uhlenbeck process under high-frequency observation

Chunhao Cai, Yiwu Shang, Cong Zhang

详情
英文摘要

This paper consider the LAN property for the mixed O-U process under high-frequency observation when H>3/4. As considered in mixed fractional Brownian motion, we will also use the projection step to get the non-diagonal rate matrix.

2603.12454 2026-03-18 stat.ME stat.AP

Rank-based methods for estimating landmark win probability in longitudinal randomized controlled trials with missing data

Guangyong Zou, Shi-Fang Qui, Joshua Zou, Emma Davies Smith, Yun-Hee Choi, Yuhan Bi

详情
英文摘要

The primary analysis for longitudinal randomized controlled trials (RCTs) often compares treatment groups at the last timepoint, referred to as the landmark time. Assuming data are normally distributed and missing at random, the mixed model for repeated measures (MMRM) is widely used to conduct inference in terms of a mean difference. When outcomes violate normality assumption and/or the mean difference lacks a clear interpretation, we may quantify treatment effects using the probability that a treated participant would have a better outcome than (or win over) a control participant. For RCTs with missing data, one may apply the generalized pairwise comparison (GPC) procedure, which carries forward the results of a pairwise comparison from a previous timepoint. We propose first using ranks to converts each observation at a timepoint into a win fraction, reflecting the proportion of times that the observation is better than every observation in the comparison group. Then, we conduct inference for the win probability based on the win fractions using the MMRM to obtain the point and variance estimates. Simulation results suggest that our method performed much better than the GPC procedure. We illustrate our proposed procedure in SAS and R using data from two published trials.

2602.16933 2026-03-18 stat.ME math.ST stat.ML stat.TH

M-estimation under Two-Phase Multiwave Sampling with Applications to Prediction-Powered Inference

Dan M. Kluger, Stephen Bates

详情
英文摘要

In two-phase multiwave sampling, inexpensive measurements are collected on a large sample and expensive, more informative measurements are adaptively obtained on subsets of units across multiple waves. Adaptively collecting the expensive measurements can increase efficiency but complicates statistical inference. We give valid estimators and confidence intervals for M-estimation under adaptive two-phase multiwave sampling. We focus on the case where proxies for the expensive variables -- such as predictions from pretrained machine learning models -- are available for all units and propose a Multiwave Predict-Then-Debias estimator that combines proxy information with the expensive, higher-quality measurements to improve efficiency while removing bias. We establish asymptotic linearity and normality and propose asymptotically valid confidence intervals. We also develop an approximately greedy sampling strategy that improves efficiency relative to uniform sampling. Data-based simulation studies support the theoretical results and demonstrate efficiency gains.

2506.19015 2026-03-18 stat.ME

Principal stratification with recurrent events truncated by a terminal event: A nested Bayesian nonparametric approach

Yuki Ohnishi, Michael O. Harhay, Guangyu Tong, Fan Li

Comments 58 pages

详情
英文摘要

Recurrent events often serve as key endpoints in clinical studies but may be prematurely truncated by terminal events such as death, creating selection bias and complicating causal inference. To address this challenge, we develop a Bayesian nonparametric framework to address potential selection bias due to truncation by death within the continuous-time principal stratification framework. We introduce causal estimands for recurrent events in the presence of a terminal event and derive a partial identification result for the estimand under a dual-frailty framework, enabling transparent sensitivity analysis for non-identifiable parameters. We then propose a flexible Bayesian nonparametric prior, the enriched dependent Dirichlet process, specifically designed for joint modeling of recurrent and terminal events, addressing a limitation where standard Dirichlet process priors create random partitions dominated by recurrent events, yielding poor predictive performance for terminal events. Simulations are carried out to show that our method has superior performance compared to existing methods. We apply the proposed new Bayesian nonparametric methods to infer the causal effect of a structured exercise program on rehospitalizations, which are subject to truncation by death.

2603.16756 2026-03-18 stat.CO stat.AP

Sequential Bayesian Experimental Design for Prediction in Physical Experiments Informed by Computer Models

Hao Zhu, Markus Hainy

Comments Accepted for presentation at the SIAM Conference on Uncertainty Quantification (UQ26), March 22-25, 2026, Minneapolis, USA

详情
英文摘要

In many scientific and engineering domains, physical experiments are often costly, non-replicable, or time-consuming. The Kennedy and O'Hagan (KOH) model framework has become a widely used approach for combining simulator runs with limited experimental observations. Under a Bayesian implementation, the simulator output, model discrepancy, and observation noise are jointly modeled by coupled Gaussian processes, followed by coherent posterior inference and uncertainty quantification. This work presents a genuinely sequential Bayesian experimental design (BED) framework explicitly aimed at improving the predictive performance of the KOH model. We employ a mutual information (MI)-based criterion and develop a hybrid variant that integrates it with measures of local model complexity, leading to significantly more efficient design decisions. We further show theoretically that the MI-based criterion is more comprehensive and robust than the classical integrated mean squared prediction error (IMSPE) minimization criterion, especially when the model is highly uncertain in the early stages of the experiment. To mitigate the computational burden of fully Bayesian inference and the ensuing BED process, we propose two acceleration strategies - Gaussian Mixture Compression and Schur complement and rank-one update - which together substantially reduce runtime. Finally, we demonstrate the effectiveness of the proposed methods through both a synthetic example and a real biochemical case study, and compare them against several classical design criteria under sequential (offline) and adaptive (online) BED settings.

2603.16741 2026-03-18 cs.LG q-bio.NC q-bio.QM stat.ML

Bayesian Inference of Psychometric Variables From Brain and Behavior in Implicit Association Tests

Christian A. Kothe, Sean Mullen, Michael V. Bronstein, Grant Hanada, Marcelo Cicconet, Aaron N. McInnes, Tim Mullen, Marc Aafjes, Scott R. Sponheim, Alik S. Widge

Comments 43 pages, 7 figures, 6 tables, submitted to: Journal of Neural Engineering

详情
英文摘要

Objective. We establish a principled method for inferring mental health related psychometric variables from neural and behavioral data using the Implicit Association Test (IAT) as the data generation engine, aiming to overcome the limited predictive performance (typically under 0.7 AUC) of the gold-standard D-score method, which relies solely on reaction times. Approach. We propose a sparse hierarchical Bayesian model that leverages multi-modal data to predict experiences related to mental illness symptoms in new participants. The model is a multivariate generalization of the D-score with trainable parameters, engineered for parameter efficiency in the small-cohort regime typical of IAT studies. Data from two IAT variants were analyzed: a suicidality-related E-IAT ($n=39$) and a psychosis-related PSY-IAT ($n=34$). Main Results. Our approach overcomes a high inter-individual variability and low within-session effect size in the dataset, reaching AUCs of 0.73 (E-IAT) and 0.76 (PSY-IAT) in the best modality configurations, though corrected 95% confidence intervals are wide ($\pm 0.18$) and results are marginally significant after FDR correction ($q=0.10$). Restricting the E-IAT to MDD participants improves AUC to 0.79 $[0.62, 0.97]$ (significant at $q=0.05$). Performance is on par with the best reference methods (shrinkage LDA and EEGNet) for each task, even when the latter were adapted to the task, while the proposed method was not. Accuracy was substantially above near-chance D-scores (0.50-0.53 AUC) in both tasks, with more consistent cross-task performance than any single reference method. Significance. Our framework shows promise for enhancing IAT-based assessment of experiences related to entrapment and psychosis, and potentially other mental health conditions, though further validation on larger and independent cohorts will be needed to establish clinical utility.

2603.16729 2026-03-18 cs.LG cs.CE econ.EM math.OC stat.ML

GeMA: Learning Latent Manifold Frontiers for Benchmarking Complex Systems

Jia Ming Li, Anupriya, Daniel J. Graham

Comments Latent manifold frontiers for benchmarking complex production systems, and applications to national rail operators, wind farms, and macroeconomic productivity are presented

详情
英文摘要

Benchmarking the performance of complex systems such as rail networks, renewable generation assets and national economies is central to transport planning, regulation and macroeconomic analysis. Classical frontier methods, notably Data Envelopment Analysis (DEA) and Stochastic Frontier Analysis (SFA), estimate an efficient frontier in the observed input-output space and define efficiency as distance to this frontier, but rely on restrictive assumptions on the production set and only indirectly address heterogeneity and scale effects. We propose Geometric Manifold Analysis (GeMA), a latent manifold frontier framework implemented via a productivity-manifold variational autoencoder (ProMan-VAE). Instead of specifying a frontier function in the observed space, GeMA represents the production set as the boundary of a low-dimensional manifold embedded in the joint input-output space. A split-head encoder learns latent variables that capture technological structure and operational inefficiency. Efficiency is evaluated with respect to the learned manifold, endogenous peer groups arise as clusters in latent technology space, a quotient construction supports scale-invariant benchmarking, and a local certification radius, derived from the decoder Jacobian and a Lipschitz bound, quantifies the geometric robustness of efficiency scores. We validate GeMA on synthetic data with non-convex frontiers, heterogeneous technologies and scale bias, and on four real-world case studies: global urban rail systems (COMET), British rail operators (ORR), national economies (Penn World Table) and a high-frequency wind-farm dataset. Across these domains GeMA behaves comparably to established methods when classical assumptions hold, and provides additional insight in settings with pronounced heterogeneity, non-convexity or size-related bias.

2603.16712 2026-03-18 math.ST cs.DS cs.LG stat.ML stat.TH

High-dimensional estimation with missing data: Statistical and computational limits

Kabir Aladin Verchand, Ankit Pensia, Saminul Haque, Rohith Kuditipudi

详情
英文摘要

We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

2603.16535 2026-03-18 cs.LG math.OC stat.ML

SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

Viktor Stein, Wuchen Li, Gabriele Steidl

Comments 24 pages, 2 figures, 3 tables, comments welcome!

详情
英文摘要

Transformers owe much of their empirical success in natural language processing to the self-attention blocks. Recent perspectives interpret attention blocks as interacting particle systems, whose mean-field limits correspond to gradient flows of interaction energy functionals on probability density spaces equipped with Wasserstein-$2$-type metrics. We extend this viewpoint by introducing accelerated attention blocks derived from inertial Nesterov-type dynamics on density spaces. In our proposed architecture, tokens carry both spatial (feature) and velocity variables. The time discretization and the approximation of accelerated density dynamics yield Hamiltonian momentum attention blocks, which constitute the proposed accelerated attention architectures. In particular, for linear self-attention, we show that the attention blocks approximate a Stein variational gradient flow, using a bilinear kernel, of a potential energy. In this setting, we prove that elliptically contoured probability distributions are preserved by the accelerated attention blocks. We present implementable particle-based algorithms and demonstrate that the proposed accelerated attention blocks converge faster than the classical attention blocks while preserving the number of oracle calls.

2603.16530 2026-03-18 stat.ME

Estimation and Hypothesis Testing of Fixed Effects Models-Based Uncertainty for Factor Designs

Fan Zhang, Zhiming Li

Comments 24 pages, 10 tables

详情
英文摘要

To analyze the uncertain data frequently encountered in practice, this paper proposes novel fixed-effects models that incorporate an uncertain measure to investigate variables of interest and nuisance variables in factor designs. First, an uncertain fixed-effects (UFE) model of a single-factor design is established, and uncertain estimation and hypothesis testing are conducted. We then extend the UFE model to two-factor designs with and without interactions and classify them as balanced or unbalanced based on the equality of replicates within each combination. In the above UFE models, the effectiveness and practicality of estimation and hypothesis methods are demonstrated through three real-world cases, including both balanced and unbalanced designs. These examples highlight the models' ability to handle uncertain experimental data.

2603.16400 2026-03-18 stat.ME

A nonparametric approach to understand multivariate quantile dynamics in financial time series

Kunal Rai, Archi Roy, Itai Dattner, Soudeep Deb

详情
英文摘要

Over the last decade, nonparametric methods have gained increasing attention for modeling complex data structures due to their flexibility and minimal structural assumptions. In this paper, we study a general multivariate nonparametric regression framework that encompasses a broad class of parametric models commonly used in financial econometrics. Both the response and the covariate processes are allowed to be multivariate with fixed finite dimensions, and the framework accommodates temporal dependence, thereby introducing additional modeling and theoretical hurdles. To address these challenges, we adopt a functional dependence structure which permits flexible dynamic behavior while maintaining tractable asymptotic analysis. Within this setting, we establish strong and weak convergence results for the estimators of the conditional mean and volatility functions. In addition, we investigate conditional geometric quantiles in the multivariate time series context and prove their consistency under mild regularity conditions. The finite sample performance is examined through comprehensive simulation studies, and the methodology is illustrated by modeling the stock returns of Maersk and Lockheed Martin as a nonparametric function of a geopolitical risk index.

2603.16345 2026-03-18 astro-ph.IM stat.AP

Optimising the FRB Search Pipeline for the Northern Cross Radio Telescope

Hayley Camilleri, Alessio Magro, Andrea Geminardi, Giovanni Naldi, Gianni Bernardi, Luca Bruno, Valentina Cesare, Francesco Fiori, Davide Pelliciari, Maura Pilia, Matteo Trudu

Comments 18 pages, 8 figures

详情
英文摘要

FRB search pipelines are being developed to operate under strict real-time constraints while maintaining sensitivity to short-duration transient signals. In incoherent dedispersion based pipelines such as Heimdall, apart from observation bandwidth and number of beams, detection performance and computational throughput are strongly dependent on the choice of processing parameters, which are often selected heuristically. In this work, we present a systematic evaluation of key dedispersion and matched filtering parameters and quantify their impact on both detection accuracy and runtime performance. A controlled synthetic injection framework is developed in which artificial FRB pulses with known DMs, SNRs, and pulse widths are embedded into realistic filterbank data containing instrumental noise representative of observations from the Northern Cross radio telescope. Using this framework, a grid of Heimdall configurations is explored, spanning DM tolerance, boxcar filter width, and processing gulp size. Detection performance is assessed by comparing recovered and injected signal properties, while computational performance is evaluated through end-to-end processing time measurements. The results reveal clear trade-offs between sensitivity and throughput across parameter choices. We identify an empirically optimal configuration that provides burst recovery while maintaining processing speeds exceeding real-time requirements. While the specific optimal parameters are derived for the Northern Cross, the methodology and findings are broadly applicable to any real-time transient detection pipeline employing matched-filtering and dedispersion, and are particularly relevant for low-frequency radio telescopes with similar observing configurations. These findings demonstrate the value of data-driven parameter evaluation for improving the performance of real-time transient detection pipelines.

2603.16344 2026-03-18 stat.ME

A flexible wrapped Lindley-type distribution for angular data modelling

Johan Ferreira, Delene van Wyk-de Ridder, Janet van Niekerk

详情
英文摘要

Flexible distributions for modelling angular data have received considerable attention in recent years, with ongoing work extending existing circular models to provide greater flexibility in capturing diverse angular behaviours. In this paper, we introduce and study the w3PL distribution, a circular model obtained by extending the wrapped Lindley distribution by incorporating two additional shape parameters. The proposed generalisation increases flexibility in modelling concentration and skewness while preserving analytical tractability and encompassing existing circular models as special cases. Closed-form expressions for the probability density function, cumulative distribution function, and trigonometric moments are derived, allowing key distributional properties to be studied analytically. The distributional modality is characterised, and the nature of invariance is investigated for the newly proposed circular model. Parameter estimation is developed within a regularised maximum likelihood framework, and a simulation study demonstrates reliable parameter recovery and stable finite-sample performance. Applications to angular datasets from geology, marine biology, and finance illustrate the model's practical significance and show improved fit relative to existing circular alternatives.

2603.16317 2026-03-18 stat.OT

Balance and Fairness through Multicalibration in Nonlife Insurance Pricing

Michel Denuit, Marie Michaelides, Julien Trufin

详情
英文摘要

Autocalibration is known to be an important requirement for insurance premiums since it guarantees that premium income balances corresponding claims, on average, not only at portfolio level but also inside each group paying similar premiums. Also, fairness has become a major concern because unfair treatment may expose insurers to lawsuits or reputational damage. Translating fairness into conditional mean independence allows actuaries to combine autocalibration and fairness into the multicalibration concept. This paper studies the properties of multicalibration in an insurance context and proposes practical ways to implement it, through local regression or bias correction within groups including credibility adjustments. A case study based on motor insurance data illustrates the relevance of multicalibration in insurance pricing.

2603.16294 2026-03-18 math.ST stat.TH

A Kernel Two-Sample Test Invariant under Group Action with Applications to Functional Data

Madison Giacofci, Anouar Meynaoui, Alex Podgorny

详情
英文摘要

We introduce a kernel-based two-sample test for comparing probability distributions up to group actions. Our construction yields invariant kernels for locally compact $σ$-compact groups and extends classical Haar-based approaches beyond the compact setting. The resulting invariant Maximum Mean Discrepancy (MMD) test is developed in a general framework where the sample space is assumed to be Polish. Under natural conditions, the invariant kernel induces a characteristic kernel on the quotient space, ensuring consistency of the associated MMD test. The method is well suited to functional data, where invariances such as temporal shifts arise naturally, and its effectiveness is illustrated through simulation studies.

2603.16279 2026-03-18 cs.RO stat.ML

Agile Interception of a Flying Target using Competitive Reinforcement Learning

Timothée Gavin, Simon Lacroix, Murat Bronz

详情
Journal ref
Conference on Artificial Intelligence for Defence, AMIAD, Nov 2025, Rennes, France
英文摘要

This article presents a solution to intercept an agile drone by another agile drone carrying a catching net. We formulate the interception as a Competitive Reinforcement Learning problem, where the interceptor and the target drone are controlled by separate policies trained with Proximal Policy Optimization (PPO). We introduce a high-fidelity simulation environment that integrates a realistic quadrotor dynamics model and a low-level control architecture implemented in JAX, which allows for fast parallelized execution on GPUs. We train the agents using low-level control, collective thrust and body rates, to achieve agile flights both for the interceptor and the target. We compare the performance of the trained policies in terms of catch rate, time to catch, and crash rate, against common heuristic baselines and show that our solution outperforms these baselines for interception of agile targets. Finally, we demonstrate the performance of the trained policies in a scaled real-world scenario using agile drones inside an indoor flight arena.

2603.16213 2026-03-18 math.ST stat.ME stat.TH

Equivalence testing with data-dependent and post-hoc equivalence margins

Stan Koobs, Nick W. Koning

详情
英文摘要

Equivalence testing compares the hypothesis that an effect $μ$ is large against the alternative that it is negligible. Here, `large' is classically expressed as being larger than some `equivalence margin' $Δ$. A longstanding problem is that this margin must be specified but can rarely be objectively justified in practice. We lay the foundation for an alternative paradigm, arguing to instead report a data-dependent margin $\widehatΔ_α$ that bounds the true effect $μ$ with probability $1 - α$. Our key argument is that $\widehatΔ_α$ is more useful than a test outcome at a fixed margin $Δ$, as measured by the guarantees it offers to decision makers. We generalize this to a curve of margins $α\mapsto \widehatΔ_α$, uniformly valid under the post-hoc selection of the margin. These ideas rely on e-values, which we derive for models that are strictly totally positive of order 3, nesting the classical z-test and t-test settings.

2603.16062 2026-03-18 stat.ML cs.LG

Safe Distributionally Robust Feature Selection under Covariate Shift

Hiroyuki Hanada, Satoshi Akahane, Noriaki Hashimoto, Shion Takeno, Ichiro Takeuchi

详情
英文摘要

In practical machine learning, the environments encountered during the model development and deployment phases often differ, especially when a model is used by many users in diverse settings. Learning models that maintain reliable performance across plausible deployment environments is known as distributionally robust (DR) learning. In this work, we study the problem of distributionally robust feature selection (DRFS), with a particular focus on sparse sensing applications motivated by industrial needs. In practical multi-sensor systems, a shared subset of sensors is typically selected prior to deployment based on performance evaluations using many available sensors. At deployment, individual users may further adapt or fine-tune models to their specific environments. When deployment environments differ from those anticipated during development, this strategy can result in systems lacking sensors required for optimal performance. To address this issue, we propose safe-DRFS, a novel approach that extends safe screening from conventional sparse modeling settings to a DR setting under covariate shift. Our method identifies a feature subset that encompasses all subsets that may become optimal across a specified range of input distribution shifts, with finite-sample theoretical guarantees of no false feature elimination.

2603.16056 2026-03-18 cond-mat.stat-mech quant-ph stat.ML

Population Annealing as a Discrete-Time Schrödinger Bridge

Masayuki Ohzeki

Comments 4 pages

详情
英文摘要

We present a theoretical framework that reinterprets Population Annealing (PA) through the lens of the discrete-time Schrödinger Bridge (SB) problem. We demonstrate that the heuristic reweighting step in PA is derived by analytically solving the Schrödinger system without iterative computation via instantaneous projection. In addition, we identify the thermodynamic work as the optimal control potential that solves the global variational problem on path space. This perspective unifies non-equilibrium thermodynamics with the geometric framework of optimal transport, interpreting the Jarzynski equality as a consistency condition within the Donsker-Varadhan variational principle, and elucidates the thermodynamic optimality of PA.

2603.16042 2026-03-18 math.OC cs.LG stat.ML

Shuffling the Stochastic Mirror Descent via Dual Lipschitz Continuity and Kernel Conditioning

Junwen Qiu, Leilei Mei, Junyu Zhang

Comments 28 pages, 3 figures

详情
英文摘要

The global Lipschitz smoothness condition underlies most convergence and complexity analyses via two key consequences: the descent lemma and the gradient Lipschitz continuity. How to study the performance of optimization algorithms in the absence of Lipschitz smoothness remains an active area. The relative smoothness framework from Bauschke-Bolte-Teboulle (2017) and Lu-Freund-Nesterov (2018) provides an extended descent lemma, ensuring convergence of Bregman-based proximal gradient methods and their vanilla stochastic counterparts. However, many widely used techniques (e.g., momentum schemes, random reshuffling, and variance reduction) additionally require the Lipschitz-type bound for gradient deviations, leaving their analysis under relative smoothness an open area. To resolve this issue, we introduce the dual kernel conditioning (DKC) regularity condition to regulate the local relative curvature of the kernel functions. Combined with the relative smoothness, DKC provides a dual Lipschitz continuity for gradients: even though the gradient mapping is not Lipschitz in the primal space, it preserves Lipschitz continuity in the dual space induced by a mirror map. We verify that DKC is widely satisfied by popular kernels and is closed under affine composition and conic combination. With these novel tools, we establish the first complexity bounds as well as the iterate convergence of random reshuffling mirror descent for constrained nonconvex relative smooth problems.

2603.16041 2026-03-18 stat.ME cs.LG

Power Analysis for Prediction-Powered Inference

Yiqun T. Chen, Moran Guo, Shengy Li

详情
英文摘要

Modern studies increasingly leverage outcomes predicted by machine learning and artificial intelligence (AI/ML) models, and recent work, such as prediction-powered inference (PPI), has developed valid downstream statistical inference procedures. However, classical power and sample size formulas do not readily account for these predictions. In this work, we tackle a simple yet practical question: given a new AI/ML model with high predictive power, how many labeled samples are needed to achieve a desired level of statistical power? We derive closed-form power formulas by characterizing the asymptotic variance of the PPI estimator and applying Wald test inversion to obtain the required labeled sample size. Our results cover widely used settings including two-sample comparisons and risk measures in 2x2 tables. We find that a useful rule of thumb is that the reduction in required labeled samples relative to classical designs scales roughly with the R2 between the predictions and the ground truth. Our analytical formulas are validated using Monte Carlo simulations, and we illustrate the framework in three contemporary biomedical applications spanning single-cell transcriptomics, clinical blood pressure measurement, and dermoscopy imaging. We provide our software as an R package and online calculators at https://github.com/yiqunchen/pppower.

2603.16014 2026-03-18 stat.CO

Fast Multitask Gaussian Process Regression

Aleksei G. Sorokin, Pieterjan Robbe, Fred J. Hickernell

详情
英文摘要

Gaussian process (GP) regression is a powerful probabilistic modeling technique with built-in uncertainty quantification. When one has access to multiple correlated simulations (tasks), it is common to fit a multitask GP (MTGP) surrogate which is capable of capturing both inter-task and intra-task correlations. However, with a total of $N$ evaluations across all tasks, fitting an MTGP is often infeasible due to the $\mathcal{O}(N^2)$ storage and $\mathcal{O}(N^3)$ computations required to store, solve a linear system in, and compute the determinant of the $N \times N$ Gram matrix of pairwise kernel evaluations. In the single-task setting, one may reduce the required storage to $\mathcal{O}(N)$ and computations to $\mathcal{O}(N \log N)$ by fitting "fast GPs" which pair low-discrepancy design points from quasi-Monte Carlo to special kernel forms which yields nicely structured Gram matrices, e.g., circulant matrices. This article generalizes fast GPs to fast MTGPs which pair low-discrepancy design points for each task to special product kernel forms which yields nicely structured block Gram matrices, e.g., circulant block matrices. An algorithm is presented to efficiently store, invert, and compute the determinant of such Gram matrices with optionally different sampling nodes and different sample sizes for each task. Derivations for fast MTGP Bayesian cubature are also provided. A GPU-compatible, open-source Python implementation is made available in the FastGPs package (https://alegresor.github.io/fastgps/). We validate the efficiency of our algorithm and implementation compared to standard techniques on a range of problems with low numbers of tasks and large sample sizes.

2603.16005 2026-03-18 math.ST stat.ML stat.TH

Breakdown properties of optimal transport maps: general transportation costs

Alberto Gonzalez-Sanz, Marco Avella Medina

详情
英文摘要

Two recent works, Avella-Medina and González-Sanz (2026) and Passeggeri and Paindaveine (2026), studied the robustness of the optimal transport map through its breakdown point, i.e., the smallest fraction of contamination that can make the map take arbitrarily aberrant values. Their main finding is the following: let $P$ and $Q$ denote the target and reference measures, respectively, and let $T$ be the optimal transport map for the squared Euclidean cost. Then, the breakdown point of $T(u)$, when $P$ is perturbed and $Q$ is fixed, coincides with the Tukey depth of $u$ relative to $Q$. In this note, we extend this result to general convex cost functions, demonstrating that the cost function does not have any impact on the breakdown point of the optimal transport map. Our contribution provides a definitive characterization of the breakdown point of the optimal transport map. In particular, it shows that for a broad class of regular cost functions, all transport-based quantiles enjoy the same high breakdown point properties.

2603.15924 2026-03-18 stat.ME

Time Partitioning in Target Trial Emulation

Harold Tankpinou Zoumenou, Simon Ferreira, Charles Assaad, Nathanael Lapidus, Daria Bystrova, Benjamin Glemain

详情
英文摘要

In target trial emulation, time partitioning enables researchers to handle time-varying confounders and immortal time bias with appropriate methods. Based on two clinical scenarios, this study aimed to explore issues related to time partitioning and to provide guidance for trial emulation. After formalizing the research question within the framework of structural causal models, we show how a given time partitioning may be too fine or too coarse depending on the clinical context. When the partitioning is too fine, the dimensionality of the model is unnecessarily high. When the partitioning is too coarse, the resulting causal structure may hinder effect estimation. We also show that cloning-censoring-weighting may not be valid when treatment influences outcome within study periods, and we confirm this through simulations. In conclusion, we provide practical guidance for actively specifying an appropriate time partitioning in trial emulation, rather than using the available data resolution as a default.

2603.15923 2026-03-18 stat.ML cs.LG

Learning to Recall with Transformers Beyond Orthogonal Embeddings

Nuri Mert Vural, Alberto Bietti, Mahdi Soltanolkotabi, Denny Wu

Comments ICLR 2026

详情
英文摘要

Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training and retrieve it at inference. Existing theoretical analyses typically study transformers under idealized assumptions such as infinite data or orthogonal embeddings. In realistic settings, however, models are trained on finite datasets with non-orthogonal (random) embeddings. We address this gap by analyzing a single-layer transformer with random embeddings trained with (empirical) gradient descent on a simple token-retrieval task, where the model must identify an informative token within a length-$L$ sequence and learn a one-to-one mapping from tokens to labels. Our analysis tracks the ``early phase'' of gradient descent and yields explicit formulas for the model's storage capacity -- revealing a multiplicative dependence between sample size $N$, embedding dimension $d$, and sequence length $L$. We validate these scalings numerically and further complement them with a lower bound for the underlying statistical problem, demonstrating that this multiplicative scaling is intrinsic under non-orthogonal embeddings.

2603.15902 2026-03-18 stat.CO stat.ME

SEMMS with Random Effects: A Mixed-Model Extension for Variable Selection in Clustered and Longitudinal Data

Haim Bar, Martin T. Wells

详情
英文摘要

SEMMS (Scalable Empirical-Bayes Model for Marker Selection) is a variable-selection procedure for generalized linear models that uses a three-component normal mixture prior on regression coefficients. In its original form, SEMMS assumes that all observations are independent. Many real-world datasets, however, arise from repeated-measures or clustered designs in which observations within the same subject are correlated. Ignoring this correlation inflates the apparent residual variance and can severely degrade variable-selection performance. We extend SEMMS to accommodate random intercepts, random slopes, or both, via an alternating coordinate-ascent algorithm. After each round of fixed-effect variable selection, the subject-level best linear unbiased predictors (BLUPs) are updated with \texttt{lmer} (Gaussian) or \texttt{glmer} (non-Gaussian); the fixed-effect step then operates on the random-effect-adjusted response. We describe the algorithm, evaluate its performance in three Gaussian simulation studies spanning a range of signal strengths, random-effect magnitudes, and sample/predictor-space regimes, and present a semi-synthetic real-data example. We further extend the framework to non-Gaussian families (Poisson, binomial) via an IRLS working-response adaptation: at each outer iteration the fixed-effects step uses the RE-adjusted working response computed from the current \texttt{glmer} fitted values rather than the raw response. When the fixed-effect signal is strong relative to the random-effect variance, both the original and extended procedures perform comparably. When the random-effect variance dominates -- the scenario most likely to cause plain SEMMS to fail -- the mixed-model extension recovers the exact true predictor set in 93\% of simulated datasets (Gaussian), 61\% (Poisson), and 65\% (binomial), compared with 1\%, 45\%, and 39\% for plain SEMMS respectively.

2603.15175 2026-03-18 stat.ME math.DS q-bio.PE

Bayesian Inference in Epidemic Modelling: A Beginner's Guide

Augustine Okolie

Comments 12 pages, 2 plots

详情
英文摘要

This lecture note provides a self-contained introduction to Bayesian inference and Markov Chain Monte Carlo (MCMC) methods for parameter estimation in epidemic models. Using the classical Susceptible-Infectious-Recovered (SIR) compartmental model as a running example, we derive the likelihood function from first principles, specify priors on the transmission and recovery parameters, and implement the Metropolis-Hastings algorithm to sample from the posterior distribution. The note is aimed at graduate students and researchers in mathematical epidemiology with limited prior exposure to Bayesian statistics.

2603.15057 2026-03-18 stat.ML cs.AI cs.LG

Analyzing Error Sources in Global Feature Effect Estimation

Timo Heiß, Coco Bögel, Bernd Bischl, Giuseppe Casalicchio

Comments Accepted to The 4th World Conference on eXplainable Artificial Intelligence (XAI 2026)

详情
英文摘要

Global feature effects such as partial dependence (PD) and accumulated local effects (ALE) plots are widely used to interpret black-box models. However, they are only estimates of true underlying effects, and their reliability depends on multiple sources of error. Despite the popularity of global feature effects, these error sources are largely unexplored. In particular, the practically relevant question of whether to use training or holdout data to estimate feature effects remains unanswered. We address this gap by providing a systematic, estimator-level analysis that disentangles sources of bias and variance for PD and ALE. To this end, we derive a mean-squared-error decomposition that separates model bias, estimation bias, model variance, and estimation variance, and analyze their dependence on model characteristics, data selection, and sample size. We validate our theoretical findings through an extensive simulation study across multiple data-generating processes, learners, estimation strategies (training data, validation data, and cross-validation), and sample sizes. Our results reveal that, while using holdout data is theoretically the cleanest, potential biases arising from the training data are empirically negligible and dominated by the impact of the usually higher sample size. The estimation variance depends on both the presence of interactions and the sample size, with ALE being particularly sensitive to the latter. Cross-validation-based estimation is a promising approach that reduces the model variance component, particularly for overfitting models. Our analysis provides a principled explanation of the sources of error in feature effect estimates and offers concrete guidance on choosing estimation strategies when interpreting machine learning models.

2603.14894 2026-03-18 cs.LG cs.AI stat.ML

Informative Perturbation Selection for Uncertainty-Aware Post-hoc Explanations

Sumedha Chugh, Ranjitha Prasad, Nazreen Shah

详情
英文摘要

Trust and ethical concerns due to the widespread deployment of opaque machine learning (ML) models motivating the need for reliable model explanations. Post-hoc model-agnostic explanation methods addresses this challenge by learning a surrogate model that approximates the behavior of the deployed black-box ML model in the locality of a sample of interest. In post-hoc scenarios, neither the underlying model parameters nor the training are available, and hence, this local neighborhood must be constructed by generating perturbed inputs in the neighborhood of the sample of interest, and its corresponding model predictions. We propose \emph{Expected Active Gain for Local Explanations} (\texttt{EAGLE}), a post-hoc model-agnostic explanation framework that formulates perturbation selection as an information-theoretic active learning problem. By adaptively sampling perturbations that maximize the expected information gain, \texttt{EAGLE} efficiently learns a linear surrogate explainable model while producing feature importance scores along with the uncertainty/confidence estimates. Theoretically, we establish that cumulative information gain scales as $\mathcal{O}(d \log t)$, where $d$ is the feature dimension and $t$ represents the number of samples, and that the sample complexity grows linearly with $d$ and logarithmically with the confidence parameter $1/δ$. Empirical results on tabular and image datasets corroborate our theoretical findings and demonstrate that \texttt{EAGLE} improves explanation reproducibility across runs, achieves higher neighborhood stability, and improves perturbation sample quality as compared to state-of-the-art baselines such as Tilia, US-LIME, GLIME and BayesLIME.

2603.14198 2026-03-18 cs.LG cs.AI stat.ML

Efficient Federated Conformal Prediction with Group-Conditional Guarantees

Haifeng Wen, Osvaldo Simeone, Hong Xing

Comments 22 pages, 5 figures, submitted for possible publication

详情
英文摘要

Deploying trustworthy AI systems requires principled uncertainty quantification. Conformal prediction (CP) is a widely used framework for constructing prediction sets with distribution-free coverage guarantees. In many practical settings, including healthcare, finance, and mobile sensing, the calibration data required for CP are distributed across multiple clients, each with its own local data distribution. In this federated setting, data can often be partitioned into, potentially overlapping, groups, which may reflect client-specific strata or cross-cutting attributes such as demographic or semantic categories. We propose group-conditional federated conformal prediction (GC-FCP), a novel protocol that provides group-conditional coverage guarantees. GC-FCP constructs mergeable, group-stratified coresets from local calibration scores, enabling clients to communicate compact weighted summaries that support efficient aggregation and calibration at the server. Experiments on synthetic and real-world datasets validate the performance of GC-FCP compared to centralized calibration baselines.

2603.13784 2026-03-18 math.ST stat.ME stat.TH

Mixed difference integer-valued GARCH model for $ \mathbb{Z}$-valued time series

Abdelhakim Aknouche, Christian Francq, Yuichi Goto

Comments 61 pages, 8 figures

详情
英文摘要

In this paper, we introduce flexible observation-driven $\mathbb{Z}$-valued time series models constructed from mixtures of negative and non-negative components. Compared to models based on the standard Skellam distribution or on a difference of two integer-valued variables, our specification offers greater versatility. For example, it easily allows for skewness and bimodality. Furthermore, the observation of one component of the mixture makes interpretation and statistical analysis easier. We establish conditions for stationarity and mixing, and develop a mixed Poisson quasi-maximum likelihood estimator with proven asymptotic properties. A portmanteau test is proposed to diagnose residual serial dependence. The finite-sample performance of the methodology is assessed via simulation, and an empirical application on tick prices demonstrates its practical usefulness.

2603.11267 2026-03-18 stat.AP

A Statistically Reliable Optimization Framework for Bandit Experiments in Scientific Discovery

Tong Li, Travis Mandel, Goldie Phillips, Anna Rafferty, Eric M. Schwartz, Dehan Kong, Joseph J. Williams

详情
英文摘要

Scientific experimentation is largely driven by statistical hypothesis testing to determine significant differences in interventions. Traditionally, experimenters allocate samples uniformly between each intervention. However, such an approach may lead to suboptimal outcomes - multi-armed bandits (MABs) addresses this problem by allocating samples adaptively to maximize outcomes. Yet, two challenges have hindered the use of MABs in scientific domains. First, common hypothesis tests (e.g., $t$-tests) become invalid under adaptive sampling without correction, leading to inflated type~I and type~II errors. This is an understudied problem, and prior solutions suffer from issues such as low statistical power which prevent adoption in many practical settings. Second, practitioners must explicitly balance cumulative reward with statistical efficiency, yet no general methodology exists to quantify this trade-off across algorithms. In this paper, we study assumption modification and critical region correction approaches for hypothesis testing that enable common tests to be applied to adaptively collected data. We provide heuristic justification for its power efficiency and show in simulation that it achieves higher power than existing approaches. Further, we derive a theoretically and practically motivated objective function for adaptive experiment evaluation, which we integrate into a unified experimental framework. Our framework asks experimenters to specify an experiment extension cost for their problem, and based on that enables our proposed optimization procedure to select the bandit algorithm that best balances reward and power in their setting. We show that our approach enables practitioners to improve outcomes with only slightly more steps than uniform randomization, while retaining statistical validity.

2603.09531 2026-03-18 q-bio.QM cs.CV eess.IV stat.AP

Association of Progressive PPFE and Mortality in Lung Cancer Screening Cohorts

Shahab Aslani, Mehran Azimbagirad, Daryl Cheng, Daisuke Yamada, Ryoko Egashira, Adam Szmul, Justine Chan-Fook, Robert Chapman, Alfred Chung Pui So, Shanshan Wang, John McCabe, Tianqi Yang, Jose M Brenes, Eyjolfur Gudmundsson, The SUMMIT Consortium, Susan M. Astley, Daniel C. Alexander, Sam M. Janes, Joseph Jacob

详情
英文摘要

Background: Pleuroparenchymal fibroelastosis (PPFE) is an upper lobe predominant fibrotic lung abnormality associated with increased mortality in established interstitial lung disease. However, the clinical significance of radiologic PPFE progression in lung cancer screening (LCS) populations remains unclear. Methods: We analysed longitudinal low-dose CT scans and clinical data from two LCS studies: National Lung Screening Trial (NLST; n=7,980); SUMMIT study (n=8,561). An automated algorithm quantified PPFE volume on baseline and follow-up scans. Annualised change in PPFE was derived and dichotomised using a distribution-based threshold to define progressive PPFE. Associations between progressive PPFE and mortality were evaluated using Cox proportional hazards models adjusted for demographic and clinical variables. In SUMMIT cohort, associations between progressive PPFE and clinical outcomes were assessed using incidence rate ratios (IRR) and odds ratios (OR). Findings: Progressive PPFE independently associated with mortality in both LCS cohorts (NLST: Hazard Ratio (HR)=1.25, 95% Confidence Interval (CI): 1.01--1.56, p=0.042; SUMMIT: HR=3.14, 95% CI: 1.66--5.97, p<0.001). Within SUMMIT, progressive PPFE was strongly associated with higher respiratory admissions (IRR=2.79, p<0.001), increased antibiotic and steroid use (IRR=1.55, p=0.010), and showed a trend towards higher modified medical research council scores (OR=1.40, p=0.055). Interpretation: Radiologic PPFE progression independently associates with mortality across two large LCS cohorts, and associates with adverse clinical outcomes. Quantitative assessment of PPFE progression may provide a clinically relevant imaging biomarker to identify individuals at increased risk of respiratory morbidity within LCS programmes.

2603.01381 2026-03-18 stat.ME

Differential gene expression analysis via two-component mixture models with a semiparametric skew-normal scale mixture alternative

Sangkon Oh, Geoffrey J. McLachlan

详情
英文摘要

Two-component mixture models are particularly useful for identifying differentially expressed genes, but their performance can deteriorate markedly when the alternative distribution departs from parametric assumptions or symmetry. We propose a semiparametric mixture model in which the null component is standard normal and the alternative follows a skew-normal scale mixture with an unspecified scale mixing distribution. This formulation accommodates skewness and heavy tails, providing a flexible and computationally tractable tool for differential gene-expression analysis without restrictive distributional assumptions. We establish identifiability and consistency of the model and develop an efficient estimation algorithm that incorporates nonparametric maximum likelihood estimation of the scale distribution. Numerical studies show notable improvements over existing parametric and nonparametric approaches for modeling the alternative distribution, and applications to colon cancer and leukemia datasets demonstrate reduced false discovery and false negative rates.

2602.17922 2026-03-18 stat.CO stat.ME stat.ML

Data-driven configuration tuning of glmnet for balancing accuracy and computational efficiency

Shuhei Muroya, Kei Hirose

Comments 23 pages, 9 figure. Title changed. Revised for linguistic clarity and stylistic improvements; no changes to the main results

详情
英文摘要

The glmnet package in R is widely used for lasso estimation because of its computational efficiency. Despite its popularity, glmnet occasionally yields solutions that deviate substantially from the true ones because of the inappropriate default configuration of the algorithm. The accuracy of the obtained solutions can be improved by appropriately tuning the configuration. However, such improvements typically increase computational time, resulting in a tradeoff between accuracy and computational efficiency. Therefore, a systematic approach is required to determine the appropriate configuration. To address this need, we propose a unified data-driven framework specifically designed to optimize the configuration by balancing solution path accuracy and computational cost. Specifically, we generate a large-scale training dataset by measuring the accuracy and computation time of glmnet. Using this dataset, we construct neural networks to predict accuracy and computation time from data characteristics and configuration. For a new dataset, the proposed framework uses the trained networks to explore the configuration space and derive a Pareto front that represents the tradeoff between accuracy and computational cost. This front enables automatic selection of the configuration that maximizes accuracy under a user-specified time constraint. The proposed method is implemented in the R package glmnetconf, available at https://github.com/Shuhei-Muroya/glmnetconf.git.

2602.03999 2026-03-18 math.PR cs.DS cs.LG math.ST stat.ML stat.TH

Functional Stochastic Localization

Anming Gu, Bobby Shi, Kevin Tian

Comments Comments welcome! v2 adds citations and fixes typos

详情
英文摘要

Eldan's stochastic localization is a probabilistic construction that has proved instrumental to modern breakthroughs in high-dimensional geometry and the design of sampling algorithms. Motivated by sampling under non-Euclidean geometries and the mirror descent algorithm in optimization, we develop a functional generalization of Eldan's process that replaces Gaussian regularization with regularization by any positive integer multiple of a log-Laplace transform. We further give a mixing time bound on the Markov chain induced by our localization process, which holds if our target distribution satisfies a functional Poincaré inequality. Finally, we apply our framework to differentially private convex optimization in $\ell_p$ norms for $p \in [1, 2)$, where we improve state-of-the-art query complexities in a zeroth-order model.

2601.01259 2026-03-18 stat.ME math.ST stat.TH

A Novel Multiple Imputation Approach For Parameter Estimation in Observation-Driven Time Series Models With Missing Data

Guilherme Pumi, Taiane Schaedler Prass, Douglas Krauthein Verdum

Comments This version presents the large sample theory for the proposed method, showing its strong consistency under mild assumptions, regardless of the amount of missing data or the its generating mechanism

详情
英文摘要

Handling missing data in time series is a complex problem due to the presence of temporal dependence. General-purpose imputation methods, while widely used, often distort key statistical properties of the data, such as variance and dependence structure, leading to biased estimation and misleading inference. These issues become more pronounced in models that explicitly rely on capturing serial dependence, as standard imputation techniques fail to preserve the underlying dynamics. This paper proposes a novel multiple imputation method specifically designed for parameter estimation in observation-driven models (ODM). The approach takes advantage of the iterative nature of the systematic component in ODM to propagate the dependence structure through missing data, minimizing its impact on estimation. Unlike traditional imputation techniques, the proposed method accommodates continuous, discrete, and mixed-type data while preserving key distributional and dependence properties. We evaluate its performance through Monte Carlo simulations in the context of GARMA models, considering time series with up to 70\% missing data. An application to the proportion of stocked energy stored in South Brazil further demonstrates its practical utility.

2512.07709 2026-03-18 econ.EM stat.CO stat.ME

Bounds on inequality with incomplete data

James Banks, Thomas Glinnan, Tatiana Komarova

详情
英文摘要

We develop a unified nonparametric framework for sharp partial identification and inference on inequality indices when the data contain coarsened observations of the variable of interest. We characterize the extremal allocations for all Schur-convex inequality measures, and show that sharp bounds are attained by distributions with finite support. This reduces the computational problem to finite-dimensional optimization, and for indices admitting linear-fractional representations after suitable ordering of the data (including the Gini coefficient and quantile ratios), we express the bound problems as linear or quadratic programs. We then establish $\sqrt{n}$ inference for the upper and lower bounds using a directional delta method and bootstrap confidence intervals. In applications, we compute sharp Gini bounds from household wealth data with mixed point and interval observations and use historical U.S. grouped income tables to bound time series for the Gini and quantile ratios.

2512.00698 2026-03-18 cs.LG stat.ML

Flow Matching for Tabular Data Synthesis

Bahrul Ilmi Nasution, Floor Eijkelboom, Mark Elliot, Richard Allmendinger, Christian A. Naesseth

Comments Published at TMLR

详情
英文摘要

Synthetic data generation is an important tool for privacy-preserving data sharing. Although diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement FM for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers -- something possible when learning to generate using \textit{variational} FM -- characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that FM, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieve better performance with remarkably low function evaluations ($\leq$ 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT is a strong default and more robust to early stopping on average, while VP has potential to produce synthetic data with lower privacy risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk. The implementation code associated with this paper is publicly available at https://github.com/rulnasution/tabular-flow-matching.

2510.25289 2026-03-18 cs.SI math.ST stat.TH

Testing Correlation in Graphs by Counting Bounded Degree Motifs

Dong Huang, Pengkun Yang

Comments 46 pages, 8 figures

详情
英文摘要

We investigate the problem of detecting correlation between two Erdős-Rényi graphs $G(n,p)$, formulated as a hypothesis testing problem: under the null hypothesis, the two graphs are independent, while under the alternative hypothesis, they are correlated through a latent bijective mapping between their vertex sets. We develop a polynomial-time test by counting bounded degree motifs and prove its effectiveness for any constant correlation coefficient $ρ$ when the edge connecting probability satisfies $p\ge n^{-1+δ}$ for some constant $δ>0$. In particular, our guarantee improves the constrain of motif-counting methods from $ρ\ge \sqrtα$ to any constant $ρ= Ω(1)$, where $α\approx 0.338$ is the Otter's constant.

2510.14055 2026-03-18 math.ST cs.IT math.IT math.PR stat.TH

Minimum Hellinger Distance Estimators for Complex Survey Designs

David Kepplinger, Anand N. Vidyashankar

Comments 36 pages

详情
英文摘要

Reliable inference from complex survey samples can be derailed by outliers and high-leverage observations induced by unequal inclusion probabilities and calibration. We develop a minimum Hellinger distance estimator (MHDE) for parametric superpopulation models under complex designs, including Poisson PPS and fixed-size SRS/PPS without replacement, with possibly stochastic post-stratified or calibrated weights. Using a Horvitz-Thompson-adjusted kernel density plug-in, we show: (i) $L^1$-consistency of the KDE with explicit large-deviation tail bounds driven by a variance-adaptive effective sample size; (ii) uniform exponential bounds for the Hellinger affinity that yield MHDE consistency under mild identifiability; (iii) an asymptotic Normal distribution for the MHDE with covariance $\mathbf A^{-1}\boldsymbolΣ\mathbf A^{\intercal}$ (and a finite-population correction under without-replacement designs); and (iv) robustness via the influence function and $α$-influence curves in the Hellinger topology. Simulations under Gamma and lognormal superpopulation models quantify efficiency-robustness trade-offs relative to weighted MLE under independent and high-leverage contamination. An application to NHANES 2021-2023 total water consumption shows that the MHDE remains stable despite extreme responses that markedly bias the MLE. The estimator is simple to implement via quadrature over a fixed grid and is extensible to other divergence families.

2510.06122 2026-03-18 cs.LG stat.ML

PolyGraph Discrepancy: a classifier-based metric for graph generation

Markus Krimmel, Philip Hartout, Karsten Borgwardt, Dexiong Chen

Comments Camera-ready version published at ICLR 2026

详情
英文摘要

Existing methods for evaluating graph generative models primarily rely on Maximum Mean Discrepancy (MMD) metrics based on graph descriptors. While these metrics can rank generative models, they do not provide an absolute measure of performance. Their values are also highly sensitive to extrinsic parameters, namely kernel and descriptor parametrization, making them incomparable across different graph descriptors. We introduce PolyGraph Discrepancy (PGD), a new evaluation framework that addresses these limitations. It approximates the Jensen-Shannon distance of graph distributions by fitting binary classifiers to distinguish between real and generated graphs, featurized by these descriptors. The data log-likelihood of these classifiers approximates a variational lower bound on the JS distance between the two distributions. Resulting metrics are constrained to the unit interval [0,1] and are comparable across different graph descriptors. We further derive a theoretically grounded summary metric that combines these individual metrics to provide a maximally tight lower bound on the distance for the given descriptors. Thorough experiments demonstrate that PGD provides a more robust and insightful evaluation compared to MMD metrics. The PolyGraph framework for benchmarking graph generative models is made publicly available at https://github.com/BorgwardtLab/polygraph-benchmark.

2509.09965 2026-03-18 stat.ME math.ST stat.TH

Confidence Intervals for Extinction Risk: Validating Population Viability Analysis with Limited Data

Hiroshi Hakoyama

Comments 151 pages, 32 figures, 30 tables

详情
英文摘要

Quantitative assessment of extinction risk requires confidence intervals (CIs) that remain informative with limited data. Their usefulness has long been debated because short observation spans can make uncertainty so large that population viability analysis appears impractical. I derive new CIs for extinction probability under the drift-Wiener process, a canonical model of extinction dynamics, by introducing transformed parameters $w$ and $z$ whose maximum-likelihood estimators follow noncentral $t$ distributions. The resulting $w$-$z$ method yields CIs with coverage close to the nominal level and shows that precision depends not only on data length but also on effect size: extinction probabilities that are sufficiently low or high can often be estimated reliably even from limited time series. I also propose an observation-error-and-autocovariance-robust (OEAR) estimator for settings with additive observation error and short-run dependence. Applied to two 64-year national harvest indices for Japanese eel (Anguilla japonica), the method gives Criterion E extinction probabilities far below the IUCN threatened-category thresholds, with narrow CIs, despite the species being listed as Endangered under Criterion A. These results show that extinction-risk CIs can be both statistically rigorous and practically informative for conservation assessment under limited data.

2508.11814 2026-03-18 stat.ME

Simulation-based validation of Bayes factor computation

Martin Modrák, Sebastian Stroppel, Paul-Christian Bürkner

Comments 49 pages, 14 figures

详情
英文摘要

We propose and evaluate two methods that validate the computation of Bayes factors: one based on an improved variant of simulation-based calibration checking (SBC) and one based on calibration metrics for binary predictions. We show that in theory, binary prediction calibration is equivalent to a special case of SBC, but with limited resources, binary prediction calibration is typically more sensitive to the problems we investigated. With well-designed test quantities, SBC can however detect all possible problems in computation, including some that cannot be uncovered by binary prediction calibration. Previous work on Bayes factor validation includes checks based on the data-averaged posterior and the Good check method. We demonstrate that both checks miss many problems in Bayes factor computation detectable with SBC and binary prediction calibration. Moreover, we find that the Good check as originally described fails to control its error rates. Our proposed checks also typically use simulation results more efficiently than data-averaged posterior checks. Finally, we show that a special approach based on posterior SBC is necessary when checking Bayes factor computation under improper priors and we validate several models with such priors. We recommend that novel methods for Bayes factor computation be validated with SBC, binary prediction calibration and data-averaged posterior with at least several hundred simulations. For all the models we tested, the bridgesampling and BayesFactor R packages satisfy all available checks and thus are likely safe to use in standard scenarios.

2508.09554 2026-03-18 stat.AP

A Bayesian factor analysis model for non-randomised staggered designs

Constantin Schmidt, Shaun R. Seaman, Beatrice Emmanouil, Leila Reid, Stuart Smith, Daniela De Angelis, Pantelis Samartsidis

Comments 17 pages, 5 figures

详情
英文摘要

The employment of peer supporter workers starting in 2018 was one of the interventions deployed by National Health Service England as part of its Hepatitis C virus (HCV) elimination plan. Peers are individuals with relevant lived experience who educate their communities about the virus and promote testing and treatment. In this paper, we assess the causal effect of the peers intervention on HCV patient case-finding, using data on 22 administrative regions from January 2016 to May 2021. To do this, we develop a Bayesian causal factor analysis model for count outcomes and ordinal interventions. Our method provides uncertainty quantification for all causal estimands of interest, gains efficiency by jointly modelling the intervention assignment process, pre- and post-intervention outcomes, and provides estimates of both conditional average and individual treatment effects (ITEs). For ITEs, we propose a copula-based approach that allows practitioners to perform sensitivity analysis to assumptions made regarding the joint distribution of potential outcomes, that are necessary to estimate these quantities. Our analysis suggests that the introduction of peers led to an increase in HCV patient case-finding. Further, we found that the effect of the intervention increased with intervention intensity, and was stronger during the national COVID-19 lockdown.

2508.03675 2026-03-18 stat.ME stat.AP

Partial Conjunction Analysis in Neuroimaging: A Comparative Study

Monitirtha Dey, Anna Vesely, Thorsten Dickhaus

详情
英文摘要

Replicability is a cornerstone of science. The partial conjunction (PC) hypothesis testing framework objectively quantifies replicability across disciplines. Although several statistical methodologies for testing PC hypotheses exist, it is not clear which method performs well under which circumstances. In this paper, we consider the PC hypothesis testing problem from a neuroimaging perspective. Identifying the brain regions activated by a specific cognitive task constitutes a central challenge in neuroimaging. This problem becomes complex when the objective is to evaluate whether activation patterns are consistent across different cognitive tasks or subjects. In this paper, we cast this question as a PC hypothesis testing problem, assessing, for each location in the brain, whether it is activated in at least $γ$ subjects, for a pre-specified granularity $γ$. In our comparative study, we consider three methods, namely: adaFilter, CoFilter, and a method proposed by Benjamini, Heller, and Yekutieli (BHY). In equi-correlated simulated data, the BHY procedure tends to outperform the competing methods for high values of $γ$, while CoFilter performs well for low values of $γ$. In the real-data analysis, CoFilter dominates the other methods for intermediate values of $γ$.

2506.01324 2026-03-18 stat.ML cs.IT cs.LG math.IT math.PR

Near-Optimal Clustering in Mixture of Markov Chains

Junghyun Lee, Yassir Jedra, Alexandre Proutière, Se-Young Yun

Comments AISTATS 2026 (50 pages, 6 figures) (ver3: camera-ready version, major revisions)

详情
英文摘要

We study the problem of clustering $T$ trajectories of length $H$, each generated by one of K unknown ergodic Markov chains over a finite state space of size $S$. We derive an instance-dependent, high-probability lower bound on the clustering error rate, governed by the stationary-weighted KL divergence between transition kernels. We then propose a two-stage algorithm: Stage I applies spectral clustering via a new injective Euclidean embedding for ergodic Markov chains, a contribution of independent interest enabling sharp concentration results; Stage II refines clusters with a single likelihood-based reassignment step. We prove that our algorithm achieves near-optimal clustering error with high probability under reasonable requirements on $T$ and $H$. Preliminary experiments support our approach, and we conclude with discussions of its limitations and extensions.

2505.21777 2026-03-18 cs.LG cond-mat.dis-nn cs.CV q-bio.NC stat.ML

Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J. Zaki, Luca Ambrogioni, Dmitry Krotov

详情
英文摘要

Dense Associative Memories (DenseAMs) are generalizations of Hopfield networks, which have superior information storage capacity and can store training data points (memories) at local minima of the energy landscape. When the amount of training data exceeds the critical memory storage capacity of these models, new local minima, which are different from the training data, emerge. In Associative Memory these emergent local minima are called $\textit{spurious}\; \textit{states}$, which hinder memory retrieval. In this work, we examine diffusion models (DMs) through the DenseAM lens, viewing their generative process as an attempt of a memory retrieval. In the small data regimes, DMs create distinct attractors for each training sample, akin to DenseAMs below the critical memory storage. As the training data size increases, they transition from memorization to generalization. We identify a critical intermediate phase, predicted by DenseAM theory -- the spurious states. In generative modeling, these states are no longer negative artifacts but rather are the first signs of generative capabilities. We characterize the basins of attraction, energy landscape curvature, and computational properties of these previously overlooked states. Their existence is demonstrated across a wide range of architectures and datasets.

2505.09647 2026-03-18 cs.DS cs.IT cs.LG math.IT math.PR math.ST stat.TH

On Unbiased Low-Rank Approximation with Minimum Distortion

Leighton Pate Barnes, Stephen Cameron, Benjamin Howard

详情
英文摘要

We describe an algorithm for sampling a low-rank random matrix $Q$ that best approximates a fixed target matrix $P\in\mathbb{C}^{n\times m}$ in the following sense: $Q$ is unbiased, i.e., $\mathbb{E}[Q] = P$; $\mathsf{rank}(Q)\leq r$; and $Q$ minimizes the expected Frobenius norm error $\mathbb{E}\|P-Q\|_F^2$. Our algorithm mirrors the solution to the efficient unbiased sparsification problem for vectors, except applied to the singular components of the matrix $P$. Optimality is proven by showing that our algorithm matches the error from an existing lower bound.

2505.07272 2026-03-18 stat.ML cs.LG eess.SP

ALPCAH: Subspace Learning for Sample-wise Heteroscedastic Data

Javier Salazar Cavazos, Jeffrey A. Fessler, Laura Balzano

详情
Journal ref
IEEE Transactions on Signal Processing, vol. 73, pp. 876-886, 2025
英文摘要

Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace basis associated with the low-rank structure of the data. Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known. Further, this method uses a soft rank constraint that does not require subspace dimension to be known. Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated. Simulations and real data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing algorithms. Code available at https://github.com/javiersc1/ALPCAH.

2504.13336 2026-03-18 stat.ML cs.LG math.ST stat.TH

On the minimax optimality of Flow Matching through the connection to kernel density estimation

Lea Kunkel, Mathias Trabs

详情
英文摘要

Flow Matching has recently gained attention in generative modeling as a simple and flexible alternative to diffusion models. While existing statistical guarantees adapt tools from the analysis of diffusion models, we take a different perspective by connecting Flow Matching to kernel density estimation. We first verify that the kernel density estimator matches the optimal rate of convergence in Wasserstein distance up to logarithmic factors, improving existing bounds for the Gaussian kernel. Based on this result, we prove that for sufficiently large networks, Flow Matching achieves the optimal rate up to logarithmic factors. If the target distribution lies on a lower-dimensional manifold, we show that the kernel density estimator profits from the smaller intrinsic dimension on a small tube around the manifold. The faster rate also applies to Flow Matching, providing a theoretical foundation for its empirical success in high-dimensional settings.

2504.09347 2026-03-18 stat.ML cs.LG math.ST stat.TH

Inference for Deep Neural Network Estimators in Generalized Nonparametric Models

Xuran Meng, Yi Li

Comments 91 pages, 14 figures, 20 tables

详情
英文摘要

While deep neural networks (DNNs) are used for prediction, inference on DNN-estimated subject-specific means for categorical or exponential family outcomes remains underexplored. We address this by proposing a DNN estimator under generalized nonparametric regression models (GNRMs) and developing a rigorous inference framework. Unlike existing approaches that assume independence between estimation errors and inputs to establish the error bound, a condition often violated in GNRMs, we allow for dependence and our theoretical analysis demonstrates the feasibility of drawing inference under GNRMs. To implement inference, we consider an Ensemble Subsampling Method (ESM) that leverages U-statistics and the Hoeffding decomposition to construct reliable confidence intervals for DNN estimates. We show that, under GNRM settings, ESM enables model-free variance estimation and accounts for heterogeneity among individuals in the population. Through simulations under nonparametric logistic, Poisson, and binomial regression models, we demonstrate the effectiveness and efficiency of our method. We further apply the method to the electronic Intensive Care Unit (eICU) dataset, a large scale collection of anonymized health records from ICU patients, to predict ICU readmission risk and offer patient-centric insights for clinical decision making.

2504.03228 2026-03-18 econ.EM stat.ML

Weak instrumental variables due to nonlinearities in panel data: A Super Learner Control Function estimator

Monika Avila-Marquez

详情
英文摘要

A triangular structural panel data model with additive separable individual-specific effects is used to model the causal effect of a covariate on an outcome variable when there are unobservable confounders with some of them time-invariant. In this setup, a linear reduced-form equation might be problematic when the conditional mean of the endogenous covariate and the instrumental variables is nonlinear. The reason is that ignoring the nonlinearity could lead to weak instruments (instruments are weakly correlated with the endogenous covariate). As a solution, we propose a triangular simultaneous equation model for panel data with additive separable individual-specific fixed effects composed of a linear structural equation with a nonlinear reduced form equation. The parameter of interest is the structural parameter of the endogenous variable. The identification of this parameter is obtained under the assumption of available exclusion restrictions and using a control function approach. Estimating the parameter of interest is done using an estimator that we call Super Learner Control Function (SLCF) estimator. The estimation procedure is composed of two main steps and sample splitting. First, we estimate the control function using a super learner . In the following step, we use the estimated control function to control for endogeneity in the structural equation. Sample splitting is done across the individual dimension. The estimator is consistent and asymptotically normal achieving a parametric rate of convergence. We perform a Monte Carlo simulation to test the performance of the estimators proposed. We conclude that the Super Learner Control Function Estimators significantly outperform Within 2SLS estimators. Finally, we show that the SLCF estimator differs from both the plug-in IV estimator and a naive plug-in 2SLS estimator.

2504.02547 2026-03-18 stat.ME

Outlier-Robust Multi-Group Gaussian Mixture Modeling with Flexible Group Reassignment

Patricia Puchhammer, Ines Wilms, Peter Filzmoser

详情
英文摘要

Do expert-defined or diagnostically-labeled data groups align with clusters inferred through statistical modeling? If not, where do discrepancies between predefined labels and model-based groupings occur and why? In this work, we introduce the multi-group Gaussian mixture model (MG-GMM), the first model developed to investigate these questions. It incorporates prior group information while allowing flexibility to reassign observations to alternative groups based on data-driven evidence. We achieve this by modeling the observations of each group as arising not from a single distribution, but from a Gaussian mixture comprising all group-specific distributions. Moreover, our model offers robustness against cellwise outliers that may obscure or distort the underlying group structure. We propose a novel penalized likelihood approach, called cellMG-GMM, to jointly estimate mixture probabilities, location and scale parameters of the MG-GMM, and detect outliers through a penalty term on the number of flagged cellwise outliers in the objective function. We show that our estimator has good breakdown properties in presence of cellwise outliers. We develop a computationally-efficient EM-based algorithm for cellMG-GMM, and demonstrate its strong performance in identifying and diagnosing observations at the intersection of multiple groups through simulations and diverse applications in medicine and oenology.

2503.14978 2026-03-18 math.ST cs.NA math.AP math.NA math.PR stat.TH

Inferring diffusivity from killed diffusion

Richard Nickl, Fanny Seizilles

Comments 33 pages, to appear in the Annals of Statistics

详情
英文摘要

We consider diffusion of independent molecules in an insulated Euclidean domain with unknown diffusivity parameter. At a random time and position, the molecules may bind and stop diffusing in dependence of a given `binding potential'. The binding process can be modeled by an additive random functional corresponding to the canonical construction of a `killed' diffusion Markov process. We study the problem of conducting inference on the infinite-dimensional diffusion parameter from a histogram plot of the `killing' positions of the process. We show first that these positions follow a Poisson point process whose intensity measure is determined by the solution of a certain Schrödinger equation. The inference problem can then be re-cast as a non-linear inverse problem for this PDE, which we show to be consistently solvable in a Bayesian way under natural conditions on the initial state of the diffusion, provided the binding potential is not too `aggressive'. In the course of our proofs we obtain novel posterior contraction rate results for high-dimensional Poisson count data that are of independent interest. A numerical illustration of the algorithm by standard MCMC methods is also provided.

2503.13986 2026-03-18 math.ST stat.TH

Stratified Permutational Berry--Esseen Bounds and Their Applications to Statistics

Pengfei Tian, Fan Yang, Peng Ding

详情
英文摘要

The stratified linear permutation statistic arises in various statistics problems, including stratified and post-stratified survey sampling, stratified and post-stratified experiments, conditional permutation tests, etc. Although we can derive the Berry--Esseen bounds for the stratified linear permutation statistic based on existing bounds for the non-stratified statistics, those bounds are not sharp, and moreover, this strategy does not work in general settings with heterogeneous strata with varying sizes. We first use Stein's method to obtain a unified stratified permutational Berry--Esseen bound that can accommodate heterogeneous strata. We then apply the bound to various statistics problems, leading to stronger theoretical quantifications and thereby facilitating statistical inference in those problems.

2503.13148 2026-03-18 stat.ME math.ST stat.TH

Spearman's rho for zero-inflated count data: formulation and attainable bounds

Jasper Arends, Guanjie Lyu, Mhamed Mesfioui, Elisa Perrone, Julien Trufin

详情
英文摘要

We propose an alternative formulation of Spearman's rho for zero-inflated count data. The formulation yields an estimator with explicitly attainable bounds, facilitating interpretation in settings where the standard range [-1,1] is no longer informative.

2503.12966 2026-03-18 cs.LG stat.ML

Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity

Eliot Beyler, Francis Bach

详情
英文摘要

Score-based generative models achieve state-of-the-art sampling performance by denoising a distribution perturbed by Gaussian noise. In this paper, we focus on a single deterministic denoising step, and compare the optimal denoiser for the quadratic loss, we name ''full-denoising'', to the alternative ''half-denoising'' introduced by Hyv{ä}rinen (2025). We show that looking at the performance in terms of distance between distributions tells a more nuanced story, with different assumptions on the data leading to very different conclusions. We prove that half-denoising is better than full-denoising for regular enough densities, while full-denoising is better for singular densities such as mixtures of Dirac measures or densities supported on a low-dimensional subspace. In the latter case, we prove that full-denoising can alleviate the curse of dimensionality under a linear manifold hypothesis.

2503.07327 2026-03-18 stat.ME stat.ML

Casewise and Cellwise Robust Multilinear Principal Component Analysis

Mehdi Hirari, Fabio Centofanti, Mia Hubert, Stefan Van Aelst

详情
英文摘要

Multilinear Principal Component Analysis (MPCA) is an important tool for analyzing tensor data. It performs dimension reduction similar to PCA for multivariate data. However, standard MPCA is sensitive to outliers. It is highly influenced by observations deviating from the bulk of the data, called casewise outliers, as well as by individual outlying cells in the tensors, so-called cellwise outliers. This latter type of outlier is highly likely to occur in tensor data, as tensors typically consist of many cells. This paper introduces a novel robust MPCA method that can handle both types of outliers simultaneously, and can cope with missing values as well. This method uses a single loss function to reduce the influence of both casewise and cellwise outliers. The solution that minimizes this loss function is computed using an iteratively reweighted least squares algorithm with a robust initialization. Graphical diagnostic tools are also proposed to identify the different types of outliers that have been found by the new robust MPCA method. The performance of the method and associated graphical displays is assessed through simulations and illustrated on two real datasets.

2501.11738 2026-03-18 stat.ME

A new class of non-stationary Gaussian fields with general smoothness on metric graphs

David Bolin, Lenin Riera-Segura, Alexandre B. Simas

详情
英文摘要

The increasing availability of network data has driven the development of advanced statistical models specifically designed for metric graphs, where Gaussian processes play a pivotal role. While models such as Whittle-Matérn fields have been introduced, there remains a lack of practically applicable options that accommodate flexible non-stationary covariance structures or general smoothness. To address this gap, we propose a novel class of generalized Whittle-Matérn fields, which are rigorously defined on general compact metric graphs and permit both non-stationarity and arbitrary smoothness. We establish new regularity results for these fields, which extend even to the standard Whittle-Matérn case. Furthermore, we introduce a method to approximate the covariance operator of these processes by combining the finite element method with a rational approximation of the operator's fractional power, enabling computationally efficient Bayesian inference for large datasets. Theoretical guarantees are provided by deriving explicit convergence rates for the covariance approximation error, and the practical utility of our approach is demonstrated through simulation studies and an application to traffic speed data, highlighting the flexibility and effectiveness of the proposed model class.

2409.04412 2026-03-18 stat.ME q-fin.MF q-fin.RM

Robust Elicitable Functionals

Kathleen E. Miao, Silvana M. Pesenti

详情
英文摘要

Elicitable functionals and (strictly) consistent scoring functions are of interest due to their utility of determining (uniquely) optimal forecasts, and thus the ability to effectively backtest predictions. However, in practice, assuming that a distribution is correctly specified is too strong a belief to reliably hold. To remediate this, we incorporate a notion of statistical robustness into the framework of elicitable functionals, meaning that our robust functional accounts for "small" misspecifications of a baseline distribution. Specifically, we propose a robustified version of elicitable functionals by using the Kullback-Leibler divergence to quantify potential misspecifications from a baseline distribution. We show that the robust elicitable functionals admit unique solutions lying at the boundary of the uncertainty region, and provide conditions for existence and uniqueness. Since every elicitable functional possesses infinitely many scoring functions, we propose the class of b-homogeneous strictly consistent scoring functions, for which the robust functionals maintain desirable statistical properties. We show the applicability of the robust elicitable functional in several examples: in a reinsurance setting and in robust regression problems.

2409.01983 2026-03-18 stat.ME math.ST stat.TH

The causal interpretation of acceleration factors

Mari Brathovde, Hein Putter, Morten Valberg, Richard A. J. Post

详情
英文摘要

In studies of time-to-event outcomes with unmeasured heterogeneity, the hazard ratio for treatment is known to have a complex causal interpretation. Accelerated failure time (AFT) models, which assess the effect on the survival time ratio scale, are often suggested as a better alternative because they model a parameter with direct causal interpretation while allowing straightforward adjustment for measured confounders. In this work, we formalize the causal interpretation of the acceleration factor in AFT models using structural causal models and data under independent censoring. We prove that the acceleration factor is a valid causal effect measure, even in the presence of frailty and treatment effect heterogeneity. Through simulations, we show that the acceleration factor better captures the causal effect than the hazard ratio when both AFT and conditional proportional hazards models apply. Additionally, we extend the interpretation to systems with time-dependent acceleration factors, illustrating the impossibility of distinguishing between a time-varying homogeneous effect and unmeasured effect heterogeneity. While the causal interpretation of acceleration factors is promising, we caution practitioners about potential challenges for the interpretation in the presence of effect heterogeneity.

2408.08771 2026-03-18 stat.ME stat.CO

Dynamic factor analysis for sparse and irregular longitudinal data: an application to metabolite measurements in a COVID-19 study

Jiachen Cai, Robert J. B. Goudie, Brian D. M. Tom

详情
Journal ref
Stat. Med. 2026, 45(6-7):e70499
英文摘要

It is of scientific interest to identify essential biomarkers in biological processes underlying diseases to facilitate precision medicine. Factor analysis (FA) has long been used to address this goal: by assuming latent biological pathways drive the activity of measurable biomarkers, a biomarker is more influential if its absolute factor loading is larger. Although correlation between biomarkers has been properly handled under this framework, correlation between latent pathways are often overlooked, as one classical assumption in FA is the independence between factors. However, this assumption may not be realistic in the context of pathways, as existing biological knowledge suggests that pathways interact with one another rather than functioning independently. Motivated by sparsely and irregularly collected longitudinal measurements of metabolites in a COVID-19 study of large sample size, we propose a dynamic factor analysis model that can account for the potential cross-correlations between pathways, through a multi-output Gaussian processes (MOGP) prior on the factor trajectories. To mitigate against overfitting caused by sparsity of longitudinal measurements, we introduce a roughness penalty upon MOGP hyperparameters and allow for non-zero mean functions. To estimate these hyperparameters, we develop a stochastic expectation maximization (StEM) algorithm that scales well to the large sample size. In our simulation studies, StEM leads across all sample sizes considered to a more accurate and stable estimate of the MOGP hyperparameters than a comparator algorithm used in previous research. Application to the motivating example identifies a kynurenine pathway that affects the clinical severity of patients with COVID-19. In particular, a novel biomarker taurine is discovered, which has been receiving increased attention clinically, yet its role was overlooked in a previous analysis.

2408.03415 2026-03-18 stat.ME stat.CO

Gradient-Based Approximate Bayesian Inference with Entropy-Optimized Summary Statistics for Compartmental Models

Xiahui Li, Fergus J. Chadwick, Ben Swallow

详情
英文摘要

Recent pandemics have highlighted the critical role of infectious disease models in guiding public health decision-making, driving demand for realistic models that can provide timely answers under uncertainty. Compartmental models are widely used to capture disease dynamics, and advances in data availability, computational resources, and epidemiological understanding have allowed the development of models that incorporate detailed representations of population structure, disease progression, and intervention effects. While these improvements improve model fidelity, they also increase model complexity, leading to high-dimensional parameter spaces, intractable likelihoods, and computational challenges for fitting models to limited surveillance data in real time. Existing likelihood-free methods, such as Approximate Bayesian Computation (ABC) and Bayesian Synthetic Likelihood (BSL), have developed largely independently, each with distinct strengths and limitations. We propose an integrated three-stage framework that synthesizes advances from both likelihood-based and likelihood-free method: (1) ABC-based entropy minimization to identify low-dimensional, approximately orthogonal summary statistics; (2) BSL inference using these optimized summaries to construct tractable Gaussian approximations; and (3) Hamiltonian Monte Carlo sampling for efficient posterior exploration. Through SEIR simulation study and application to the 1978 British boarding school influenza outbreak, we demonstrate that our framework achieves reliable parameter estimation and uncertainty quantification while maintaining computational efficiency.

2407.19892 2026-03-18 stat.ML cs.LG q-bio.GN

Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Cells

Bailey Andrew, Erica L. Harris, James A. Poulter, David R. Westhead, Luisa Cutillo

Comments 8 pages (35 with appendix+references), 8 figures, 10 tables

详情
英文摘要

Motivation: Networks underlie the generation and interpretation of many biological datasets: gene networks shed light on the regulatory structure of the genome, and cell networks can capture structure of the tumor micro-environment. However, most methods that learn such networks make the faulty 'independence assumption'; to learn the gene network, they assume that no cell network exists. 'Multi-axis' methods, which do not make this assumption, fail to scale beyond a few thousand cells or genes. This limits their applicability to only the smallest datasets. Results: We develop a multi-axis method capable of processing million-cell datasets within minutes. This was previously impossible, and unlocks the use of such methods on modern scRNA-seq datasets, as well as more complex datasets. We show that our method yields novel biological insights from real single-cell data, and compares favorably to the existing hdWGCNA methodology. In particular, it identifies long non-coding RNA genes that potentially have a regulatory or functional role in neuronal development. Availability and implementation: Our methodology is available as a Python package GmGM on PyPI (https://pypi.org/project/GmGM/0.5.3/). The code for all experiments performed in this paper is available on GitHub (https://github.com/BaileyAndrew/GmGM-Bioinformatics). Contact: sceba@leeds.ac.uk Supplementary information: Our proofs, and some additional experiments, are available in the supplementary material. Keywords: gaussian graphical models, multi-axis models, transcriptomics, multi-omics, scalability

2407.18360 2026-03-18 stat.AP

Evaluating Organizational Effectiveness: A New Strategy to Leverage Multisite Randomized Trials for Valid Assessment

Guanglei Hong, Jonah Deutsch, Peter Kress, Jose Eos Trinidad, Zhengyan Xu

Comments To appear in the American Journal of Evaluation

详情
英文摘要

Determining which organizations are more effective in implementing an intervention program is essential for theoretically and empirically characterizing exemplary practice and for intervening to enhance the capacity of ineffective ones. Yet sites differ in their local ecological conditions including client composition, alternative programs, and community context. Applying the causal inference framework, this study proposes a formal mathematical definition for the local relative effectiveness of an organization attributable solely to malleable organizational practice. Capitalizing on multisite randomized trials, the identification leverages observed control group outcomes that capture some of the confounding impacts of otherwise unmeasured contextual variation. We propose a two-step mixed-effects modeling (2SME) procedure that adjusts for pre-existing between-site variation. A series of Monte Carlo simulations reveals its superior performance in comparison with conventional methods. We apply the new strategy to an evaluation of Job Corps centers nationwide serving disadvantaged youths.

2405.19553 2026-03-18 math.ST cs.LG math.PR stat.ML stat.TH

Convergence Bounds for Sequential Monte Carlo on Multimodal Distributions using Soft Decomposition

Holden Lee, Matheau Santana-Gijzen

详情
英文摘要

We prove bounds on the variance of a function $f$ under the empirical measure of the samples obtained by the Sequential Monte Carlo (SMC) algorithm, with time complexity depending on local rather than global Markov chain mixing dynamics. SMC is a Markov Chain Monte Carlo (MCMC) method, which starts by drawing $N$ particles from a known distribution, and then, through a sequence of distributions, re-weights and re-samples the particles, at each instance applying a Markov chain for smoothing. In principle, SMC tries to alleviate problems from multi-modality. However, most theoretical guarantees for SMC are obtained by assuming global mixing time bounds, which are only efficient in the uni-modal setting. We show that bounds can be obtained in the truly multi-modal setting, with mixing times that depend only on local MCMC dynamics.

2402.03819 2026-03-18 stat.ML cs.LG

Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants

Abdoulaye Sakho, Emmanuel Malherbe, Erwan Scornet

详情
Journal ref
International Conference on Artificial Intelligence and Statistics, May 2026, Tanger, Morocco
英文摘要

Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we derive several non-asymptotic upper bound on SMOTE density. From these results, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We confirm and illustrate empirically this first theoretical behavior on a real-world data-set.bFurthermore, we prove that SMOTE density vanishes near the boundary of the support of the minority class distribution. We then adapt SMOTE based on our theoretical findings to introduce two new variants. These strategies are compared on 13 tabular data sets with 10 state-of-the-art rebalancing procedures, including deep generative and diffusion models. One of our key findings is that, for most data sets, applying no rebalancing strategy is competitive in terms of predictive performances, would it be with LightGBM, tuned random forests or logistic regression. However, when the imbalance ratio is artificially augmented, one of our two modifications of SMOTE leads to promising predictive performances compared to SMOTE and other state-of-the-art strategies.

2310.12032 2026-03-18 cs.LG stat.ML

Exact and general decoupled solutions of the LMC Multitask Gaussian Process model

Olivier Truffinet, Karim Ammar, Jean-Philippe Argaud, Bertrand Bouriquet

Comments 78 pages, 12 figures, submitted to Neurocomputing

详情
英文摘要

The Linear Model of Co-regionalization (LMC) is a very general multitask gaussian process model for regression or classification. While its expressiveness and conceptual simplicity are appealing, naive implementations have cubic complexity in the product (number of datapoints $\times$ number of tasks), making approximations mandatory for most applications. However, recent work has shown that in some settings the latent processes of the model can be decoupled, leading to a complexity that is only linear in the number of said processes. We here extend these results, showing from the most general assumptions that the only condition necessary to an efficient exact computation of the LMC is a mild hypothesis on the noise model. We introduce a full parametrization of the resulting \emph{projected LMC} model, enabling its efficient optimization. The effectiveness of this approach is assessed through synthetic and real-data experiments, testing in particular the behavior of its underlying noise model restriction.\\ Overall, the projected LMC appears as a competitive and simpler alternative to state-of-the art multitask gaussian process models. It greatly facilitates some computations such as training data updates or leave-one-out cross-validation, and is more interpretable, for it gives access to its low-dimensional quantities and to their explicit relation with the full-dimensional data. These qualities could facilitate the adoption by various industries of entire classes of methodologies, notably multitask bayesian optimization.

2308.11458 2026-03-18 stat.ME

Towards a unified approach to formal risk of bias assessments for causal and descriptive inference

Oliver L. Pescott, Robin J. Boyd, Gary D. Powney, Gavin B. Stewart

详情
英文摘要

Statistics is sometimes described as the science of reasoning under uncertainty. Statistical models provide one view of this uncertainty, but what is frequently neglected is the 'invisible' portion of uncertainty: that assumed not to exist once a model has been fitted to some data. Systematic errors, i.e. bias, in data relative to some model and inferential goal can seriously undermine research conclusions, and qualitative and quantitative techniques have been created across several disciplines to quantify and generally appraise such potential biases. Perhaps best known are so-called 'risk of bias' assessment instruments used to investigate the likely quality of randomised controlled trials in medical research. However, the logic of assessing the risks caused by various types of systematic error to statistical arguments applies far more widely. This logic applies even when statistical adjustment strategies for potential biases are used, as these frequently make assumptions (e.g. data 'missing at random') that can rarely be empirically guaranteed. Mounting concern about such situations can be seen in the increasing calls for greater consideration of biases caused by nonprobability sampling in descriptive inference (e.g. in survey sampling), and the statistical generalisability of in-sample causal effect estimates in causal inference. Both of these relate to the consideration of model-based and wider uncertainty when presenting research conclusions from models. Given that model-based adjustments are never perfect, we argue that qualitative risk of bias reporting frameworks for both descriptive and causal inferential arguments should be further developed and made mandatory by journals and funders. It is only through clear statements of the limits to statistical arguments that consumers of research can fully judge their value for any given application.

2303.08528 2026-03-18 stat.ME

Translating predictive distributions into informative priors

Andrew A. Manderson, Robert J. B. Goudie

Comments Revised, added KL methodology and comparisons, improved manuscript for clarity

详情
英文摘要

When complex Bayesian models exhibit implausible behaviour, one solution is to assemble available information into an informative prior. Challenges arise as prior information is often only available for the observable quantity, or some model-derived marginal quantity, rather than directly pertaining to the (usually latent) parameters in our model. We propose a method for translating available prior information, in the form of an elicited distribution for the observable or model-derived marginal quantity, into an informative joint prior. Our approach proceeds given a parametric class of prior distributions with as yet undetermined hyperparameters, and minimises the difference between the supplied elicited distribution and corresponding prior predictive distribution. We employ a global, multi-stage Bayesian optimisation procedure to locate optimal values for the hyperparameters. Three examples illustrate our approach: a cure-fraction survival model, where censoring implies that the observable quantity is _a priori_ a mixed discrete/continuous quantity; a setting in which prior information pertains to $R^{2}$ -- a model-derived quantity; and a nonlinear regression model.

2211.04129 2026-03-18 math.OC cs.LG stat.ML

An Efficient Global Optimization Algorithm with Adaptive Estimates of the Local Lipschitz Constants

Danny D'Agostino

Comments Accepted in Journal of Global Optimization, Springer

详情
英文摘要

In this work, we present a new deterministic partition-based global optimization algorithm, HALO (Hybrid Adaptive Lipschitzian Optimization), which uses estimates of the local Lipschitz constants associated with different sub-regions of the objective function's domain to compute lower bounds and guide the search toward global minimizers. These estimates are obtained by adaptively balancing the global and local information collected from the algorithm, based on absolute slopes. HALO is hyperparameter-free, eliminating the need for manual tuning, and it highlights the most important variables to help interpret the optimization problem. We also introduce a coupling strategy with local optimization algorithms, both gradient-based and derivative-free, to accelerate convergence. We compare HALO with popular global optimization algorithms on hundreds of test functions. The numerical results are very promising and demonstrate that HALO can expand our arsenal of efficient procedures of efficient procedures for challenging real-world black-box optimization problems. The Python code of HALO is publicly available on GitHub. https://github.com/dannyzx/HALO

2211.03274 2026-03-18 stat.ME math.ST stat.TH

A General Framework for Cutting Feedback within Modularised Bayesian Inference

Yang Liu, Robert J. B. Goudie

Comments 30 pages, 9 figures

详情
Journal ref
J R Stat Soc Series B Stat Methodol (2025) 87(4):1171-1199
英文摘要

Standard Bayesian inference can build models that combine information from various sources, but this inference may not be reliable if components of a model are misspecified. Cut inference, as a particular type of modularized Bayesian inference, is an alternative which splits a model into modules and cuts the feedback from the suspect module. Previous studies have focused on a two-module case, but a more general definition of a "module" remains unclear. We present a formal definition of a "module" and discuss its properties. We formulate methods for identifying modules; determining the order of modules; and building the cut distribution that should be used for cut inference within an arbitrary directed acyclic graph structure. We justify the cut distribution by showing that it not only cuts the feedback but also is the best approximation satisfying this condition to the joint distribution in the Kullback-Leibler divergence. We also extend cut inference for the two-module case to a general multiple-module case via a sequential splitting technique and demonstrate this via illustrative applications.

2106.00996 2026-03-18 stat.ME

Generalized Geographically Weighted Regression Model within a Modularized Bayesian Framework

Yang Liu, Robert J. B. Goudie

Comments 34 pages, 11 figures

详情
Journal ref
Bayesian Anal. (2024) 19(2):465-500
英文摘要

Geographically weighted regression (GWR) models handle geographical dependence through a spatially varying coefficient model and have been widely used in applied science, but its general Bayesian extension is unclear because it involves a weighted log-likelihood which does not imply a probability distribution on data. We present a Bayesian GWR model and show that its essence is dealing with partial misspecification of the model. Current modularized Bayesian inference models accommodate partial misspecification from a single component of the model. We extend these models to handle partial misspecification in more than one component of the model, as required for our Bayesian GWR model. Information from the various spatial locations is manipulated via a geographically weighted kernel and the optimal manipulation is chosen according to a Kullback-Leibler (KL) divergence. We justify the model via an information risk minimization approach and show the consistency of the proposed estimator in terms of a geographically weighted KL divergence.

2006.01584 2026-03-18 stat.ME stat.CO

Stochastic Approximation Cut Algorithm for Inference in Modularized Bayesian Models

Yang Liu, Robert J. B. Goudie

Comments 36 pages, 9 figures, 1 table

详情
Journal ref
Stat. Comput. (2022) 32:7
英文摘要

Bayesian modelling enables us to accommodate complex forms of data and make a comprehensive inference, but the effect of partial misspecification of the model is a concern. One approach in this setting is to modularize the model, and prevent feedback from suspect modules, using a cut model. After observing data, this leads to the cut distribution which normally does not have a closed-form. Previous studies have proposed algorithms to sample from this distribution, but these algorithms have unclear theoretical convergence properties. To address this, we propose a new algorithm called the Stochastic Approximation Cut algorithm (SACut) as an alternative. The algorithm is divided into two parallel chains. The main chain targets an approximation to the cut distribution; the auxiliary chain is used to form an adaptive proposal distribution for the main chain. We prove convergence of the samples drawn by the proposed algorithm and present the exact limit. Although SACut is biased, since the main chain does not target the exact cut distribution, we prove this bias can be reduced geometrically by increasing a user-chosen tuning parameter. In addition, parallel computing can be easily adopted for SACut, which greatly reduces computation time.

1607.06779 2026-03-18 stat.ME stat.AP stat.CO

Joining and splitting models with Markov melding

Robert J. B. Goudie, Anne M. Presanis, David Lunn, Daniela De Angelis, Lorenz Wernisch

详情
Journal ref
Bayesian Anal. (2019) 14(1):81-109
英文摘要

Analysing multiple evidence sources is often feasible only via a modular approach, with separate submodels specified for smaller components of the available evidence. Here we introduce a generic framework that enables fully Bayesian analysis in this setting. We propose a generic method for forming a suitable joint model when joining submodels, and a convenient computational algorithm for fitting this joint model in stages, rather than as a single, monolithic model. The approach also enables splitting of large joint models into smaller submodels, allowing inference for the original joint model to be conducted via our multi-stage algorithm. We motivate and demonstrate our approach through two examples: joining components of an evidence synthesis of A/H1N1 influenza, and splitting a large ecology model.

2603.15840 2026-03-18 cs.LG cs.AI cs.CL stat.ML

When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

Nazia Riasat

Comments 13 pages, 5 figures. Accepted at ICLR 2026 Workshop: I Can't Believe It's Not Better (ICBINB 2026). OpenReview: https://openreview.net/pdf?id=vf8vs2ibso

详情
英文摘要

Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility across repeated runs. While these properties are desirable, stability alone does not guar- antee agreement with statistical ground truth when such references are available. We introduce a controlled behavioral evaluation framework that explicitly sep- arates four dimensions of LLM decision-making: stability, correctness, prompt sensitivity, and output validity under fixed statistical inputs. We evaluate multi- ple LLMs using a statistical gene prioritization task derived from differential ex- pression analysis across prompt regimes involving strict and relaxed significance thresholds, borderline ranking scenarios, and minor wording variations. Our ex- periments show that LLMs can exhibit near-perfect run-to-run stability while sys- tematically diverging from statistical ground truth, over-selecting under relaxed thresholds, responding sharply to minor prompt wording changes, or producing syntactically plausible gene identifiers absent from the input table. Although sta- bility reflects robustness across repeated runs, it does not guarantee agreement with statistical ground truth in structured scientific decision tasks. These findings highlight the importance of explicit ground-truth validation and output validity checks when deploying LLMs in automated or semi-automated scientific work- flows.

2603.15839 2026-03-18 stat.AP q-fin.RM

A Portfolio-Anchored Frequency-Severity Risk Index for Trip and Driver Assessment Using Telematics Signals

Jongtaek Lee, Andrei Badescu, X. Sheldon Lin

Comments 31 pages, 4 figures. Submitted to ASTIN Bulletin

详情
英文摘要

In this paper, we propose a novel frequency-severity joint trip-level risk index that combines the frequency of abnormal driving patterns with a severity component reflecting how extreme such behavior is relative to a portfolio-level baseline. Severity is quantified through an inverse-probability penalty that increases with the rarity of observed tail extremes, rather than being interpreted as a claim size. Based on high-frequency telematics data, we construct a multi-scale representation of longitudinal acceleration using the maximal overlap discrete wavelet transform (MODWT), which preserves localized driving patterns across multiple time scales. To capture severity as tail rarity, we model the portfolio distribution using a Gaussian-Uniform mixture with a layered tail structure, where Gaussian components describe typical driving behavior and the tail is partitioned into ordered severity layers that reflect increasing extremeness. We develop a likelihood-based estimation procedure that makes inference feasible for this mixture model. The resulting severity layers are then used to construct multi-layer tail counts (MLTC) at the trip level, which are modeled within a Poisson-Gamma framework to yield a closed-form posterior risk index that jointly reflects frequency and severity. This conjugate structure naturally supports sequential updating, enabling the construction of dynamically evolving driver-level risk profiles. Using the UAH-DriveSet controlled dataset, we demonstrate that the proposed index enables reliable discrimination across behavioral driving states, identification of high-risk trips, and coherent ranking of drivers, yielding a purely behavior-driven risk measure suitable for actuarial ratemaking and potentially mitigating fairness concerns associated with traditional covariates.

2603.15814 2026-03-18 cs.LG stat.AP

Longitudinal Risk Prediction in Mammography with Privileged History Distillation

Banafsheh Karimian, Alexis Guichemerre, Soufiane Belharbi, Natacha Gillet, Luke McCaffrey, Mohammadhadi Shateri, Eric Granger

详情
英文摘要

Breast cancer remains a leading cause of cancer-related mortality worldwide. Longitudinal mammography risk prediction models improve multi-year breast cancer risk prediction based on prior screening exams. However, in real-world clinical practice, longitudinal histories are often incomplete, irregular, or unavailable due to missed screenings, first-time examinations, heterogeneous acquisition schedules, or archival constraints. The absence of prior exams degrades the performance of longitudinal risk models and limits their practical applicability. While substantial longitudinal history is available during training, prior exams are commonly absent at test time. In this paper, we address missing history at inference time and propose a longitudinal risk prediction method that uses mammography history as privileged information during training and distills its prognostic value into a student model that only requires the current exam at inference time. The key idea is a privileged multi-teacher distillation scheme with horizon-specific teachers: each teacher is trained on the full longitudinal history to specialize in one prediction horizon, while the student receives only a reconstructed history derived from the current exam. This allows the student to inherit horizon-dependent longitudinal risk cues without requiring prior screening exams at deployment. Our new Privileged History Distillation (PHD) method is validated on a large longitudinal mammography dataset with multi-year cancer outcomes, CSAW-CC, comparing full-history and no-history baselines to their distilled counterparts. Using time-dependent AUC across horizons, our privileged history distillation method markedly improves the performance of long-horizon prediction over no-history models and is comparable to that of full-history models, while using only the current exam at inference time.

2603.15785 2026-03-18 math.PR math.MG math.ST stat.TH

On the Uniqueness of Fréchet Means for Polytope Norms

Roan Talbut, Andrew McCormack, Anthea Monod

Comments 28 pages, 1 figure

详情
英文摘要

Fréchet means are a popular type of average for non-Euclidean datasets, defined as those points which minimise the average squared distance to a set of data points. We consider the behaviour of sample Fréchet means on normed spaces whose unit ball is a polytope; this setting is rarely covered by existing literature on Fréchet means, which focuses on smooth spaces or spaces with bounded curvature. We study the geometry of the set of Fréchet means over polytope normed spaces, with a focus on dimension and probabilistic conditions for uniqueness. In particular, we provide a geometric characterisation of the threshold sample size at which Fréchet means have a positive probability of being unique, and we prove that this threshold is at most one more than the dimension of our space. We are able to use this geometric characterisation to compute the unique Fréchet mean sample threshold in the case of the $\ell_\infty$ and $\ell_1$ norms.

2603.15683 2026-03-18 stat.ML cs.LG

Beyond Distance: Quantifying Point Cloud Dynamics with Persistent Homology and Dynamic Optimal Transport

Yixin Wang, Ting Gao, Jinqiao Duan

Comments 42 pages, 15 figures

详情
英文摘要

We introduce a framework for analyzing topological tipping in time-evolutionary point clouds by extending the recently proposed Topological Optimal Transport (TpOT) distance. While TpOT unifies geometric, homological, and higher-order relations into one metric, its global scalar distance can obscure transient, localized structural reorganizations during dynamic phase transitions. To overcome this limitation, we present a hierarchical dynamic evaluation framework driven by a novel topological and hypergraph reconstruction strategy. Instead of directly interpolating abstract network parameters, our method interpolates the underlying spatial geometry and rigorously recomputes the valid topological structures, ensuring physical fidelity. Along this geodesic, we introduce a set of multi-scale indicators: macroscopic metrics (Topological Distortion and Persistence Entropy) to capture global shifts, and a novel mesoscopic dual-perspective Hypergraph Entropy (node-perspective and edge-perspective) to detect highly sensitive, asynchronous local rewirings. We further propagate the cycle-level entropy change onto individual vertices to form a point-level topological field. Extensive evaluations on physical dynamical systems (Rayleigh-Van der Pol limit cycles, Double-Well cluster fusion), high-dimensional biological aggregation (D'Orsogna model), and longitudinal stroke fMRI data demonstrate the utility of combining transport-based alignment with multi-scale entropy diagnostics for dynamic topological analysis.

2603.15664 2026-03-18 stat.AP cs.AI cs.CE stat.ML

Quantum Amplitude Estimation for Catastrophe Insurance Tail-Risk Pricing: Empirical Convergence and NISQ Noise Analysis

Alexis Kirke

详情
英文摘要

Classical Monte Carlo methods for pricing catastrophe insurance tail risk converge at order reciprocal root N, requiring large simulation budgets to resolve upper-tail percentiles of the loss distribution. This sample-sparsity problem can lead to AI models trained on impoverished tail data, producing poorly calibrated risk estimates where insolvency risk is greatest. Quantum Amplitude Estimation (QAE), following Montanaro, achieves convergence approaching order reciprocal N in oracle queries - a quadratic speedup that, at scale, would enable high-resolution tail estimation within practical budgets. We validate this advantage empirically using a Qiskit Aer simulator with genuine Grover amplification. A complete pipeline encodes fitted lognormal catastrophe distributions into quantum oracles via amplitude encoding, producing small readout probabilities that enable safe Grover amplification with up to k=16 iterations. Seven experiments on synthetic and real (NOAA Storm Events, 58,028 records) data yield three main findings: an oracle-model advantage, that strong classical baselines win when analytical access is available, and that discretisation, not estimation, is the current bottleneck.