arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.17984 2026-03-19 stat.ME math.ST stat.TH

On min-Storey estimators for multiple testing and conformal novelty detection

Gao Zijun, Roquain Etienne

Comments 52 pages, 9 figures, 2 tables

详情
英文摘要

In a multiple testing task, finding an appropriate estimator of the proportion $π_0$ of non-signal in the data to boost power of false discovery rate (FDR) controlling procedures is a long-standing research theme, sometimes referred to as 'adaptive FDR control'. The interest in this theme has been reinforced in the recent years with conformal novelty detection, for which it turns out that similar tools can be used in combination with any 'blackbox' machine learning algorithm. Nevertheless, perhaps surprisingly, finding a solution for 'adaptive FDR control' that is optimal in a broad sense is still an open problem. This paper fills this gap by introducing new $π_0$-estimators, referred to as min-Storey (MS) and interval-min-Storey (IMS), which are built upon the so-called 'Storey estimator'. Plugging these estimators in the adaptive Benjamini-Hochberg (BH) procedure is shown to deliver FDR control both in the independent and conformal settings. In addition, these methods satisfy an optimal power property over any (regular) alternative distribution. The excellent behaviors of the new adaptive procedures are illustrated with numerical experiments both in the independent and conformal models for various distribution structures.

2603.17912 2026-03-19 cs.CL stat.ML

Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages

Yue Zhao, Jiatao Gu, Paloma Jeretič, Weijie Su

详情
英文摘要

Understanding the distance between human languages is central to linguistics, anthropology, and tracing human evolutionary history. Yet, while linguistics has long provided rich qualitative accounts of cross-linguistic variation, a unified and scalable quantitative approach to measuring language distance remains lacking. In this paper, we introduce a method that leverages pretrained multilingual language models as systematic instruments for linguistic measurement. Specifically, we show that the spontaneously emerged attention mechanisms of these models provide a robust, tokenization-agnostic measure of cross-linguistic distance, termed Attention Transport Distance (ATD). By treating attention matrices as probability distributions and measuring their geometric divergence via optimal transport, we quantify the representational distance between languages during translation. Applying ATD to a large and diverse set of languages, we demonstrate that the resulting distances recover established linguistic groupings with high fidelity and reveal patterns aligned with geographic and contact-induced relationships. Furthermore, incorporating ATD as a regularizer improves transfer performance in low-resource machine translation. Our results establish a principled foundation for testing linguistic hypotheses using artificial neural networks. This framework transforms multilingual models into powerful tools for quantitative linguistic discovery, facilitating more equitable multilingual AI.

2603.17896 2026-03-19 stat.ML cs.LG

A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models

Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard

详情
英文摘要

Understanding when learning is statistically possible yet computationally hard is a central challenge in high-dimensional statistics. In this work, we investigate this question in the context of single- and multi-index models, classes of functions widely studied as benchmarks to probe the ability of machine learning methods to discover features in high-dimensional data. Our main contribution is to show that a Noise Sensitivity Exponent (NSE) - a simple quantity determined by the activation function - governs the existence and magnitude of statistical-to-computational gaps within a broad regime of these models. We first establish that, in single-index models with large additive noise, the onset of a computational bottleneck is fully characterized by the NSE. We then demonstrate that the same exponent controls a statistical-computational gap in the specialization transition of large separable multi-index models, where individual components become learnable. Finally, in hierarchical multi-index models, we show that the NSE governs the optimal computational rate in which different directions are sequentially learned. Taken together, our results identify the NSE as a unifying property linking noise robustness, computational hardness, and feature specialization in high-dimensional learning.

2603.17864 2026-03-19 stat.ME

Bivariate deconvolution for cancer detection after surgery

Nuria Senar, Stavros Makrodimitris, Michel H. Hof, Cornelis Verhoef, Saskia M. Wilting, Mark A. van de Wiel

Comments 11 pages, 3 figures and appendix

详情
英文摘要

Detection of minimal residual disease (MRD) in cancer patients after surgery can provide an early marker for disease recurrence and guide subsequent treatment decisions. Accurate and sensitive estimation of tumour burden after cancer surgery may be obtained through liq- uid biopsies, measuring circulating tumour DNA (ctDNA) using, for example, mutation-based Variant Allele Frequency (VAF) values. However, to be applicable to all patients this ei- ther requires tumour-informed, patient-specific mutation panels or sensitive, tumour-agnostic genome-wide measurements. We propose a solution that accounts for patient-specific charac- teristics in genome-wide screens. For that, we introduce a bivariate deconvolution model to estimate tumour proportion from circulating cell-free DNA (cfDNA) methylation profiles of patients before and after surgery. The observations are modelled as a convolution of two bivariate latent variables, corresponding to tumour and background signals, mixed by the tumour proportion at each measurement. This bivariate approach links pre- and post-surgery measurements improving estimation of the tumour proportion after surgery, when the tumour signal is potentially very weak, or absent. We approximate likelihood of the convolution through a discretisation of the bivariate density for each latent variable into a two-dimensional grid for each pair of observations which allows for fast maximum likelihood estimation. We evaluate the predictive performance of the estimated post-surgery tumour proportions based on cfDNA methylation against available mutation-based VAF values in one-year recurrence-free survival.

2603.17681 2026-03-19 math.NT stat.ML

Murmurations, Mestre--Nagao sums, and Convolutional Neural Networks for elliptic curves

Joanna Bieri, Edgar Costa, Alyson Deines, Kyu-Hwan Lee, David Lowry-Duda, Thomas Oliver, Yidi Qi, Tamara Veenstra

Comments 15 pages, 11 figures, 3 tables

详情
英文摘要

We apply one-dimensional convolutional neural networks to the Frobenius traces of elliptic curves over $\mathbb{Q}$ and evaluate and interpret their predictive capacity. In keeping with similar experiments by Kazalicki--Vlah, Bujanović--Kazalicki--Novak, and Pozdnyakov, we observe high accuracy predictions for the analytic rank across a range of conductors. We interpret the prediction using saliency curves and explore the interesting interplay between murmurations and Mestre--Nagao sums, the details of which vary with the conductor and the (predicted) rank.

2603.17663 2026-03-19 stat.ME

More with Less - Bethel Allocation and Precision-Preserving Sample Size Reduction via Hierarchical Bayes Modelling

Siu-Ming Tam

Comments 29 pages,9 tables and 1 appendix

详情
英文摘要

Statistical offices face a familiar and intensifying dilemma: rising demand for detailed regional and domain-level estimates under budgets that are fixed or shrinking. National statistical offices (NSOs) either ignore the problem of optimal sample allocation for multiple target variables when designing a multi-purpose survey, or address it incorrectly - relying on ad hoc approaches such as computing Neyman allocations separately per variable and taking the element-wise maximum, a practice that simultaneously wastes budget and fails to guarantee precision across all domains. This paper presents a practical two-stage strategy that reframes the question: not how to allocate a given sample, but how small the sample can be made while still meeting pre-defined precision targets for all target variables across all geographic domains at once. The innovation lies not in inventing new methods, but in the novel combination of two well-established techniques applied to this cost-reduction problem: (i) multivariate constrained optimisation via Bethel allocation, which finds the globally minimum sample satisfying all precision constraints simultaneously; and (ii) Hierarchical Bayes (HB) small area modelling, which borrows strength across strata and permits a further reduction of the Bethel sample. The approach is validated using a Monte Carlo study (B = 1,000 replications) based on a synthetic labour-force population of one million individuals, where known population truth allows rigorous evaluation of precision, accuracy, and credible-interval coverage. Keywords: Bethel allocation; Hierarchical Bayes; small area estimation; sample size reduction; multivariate optimisation; labour force survey; coefficient of variation.

2603.08511 2026-03-19 stat.ME

Kantorovich Regression Analysis of Random Distributions with Mixed Predictors

Kaheon Kim, Changbo Zhu

详情
英文摘要

We study regression problems with distribution-valued responses and mixed distributional and Euclidean predictors. In quadratic cost, the negative gradient of the Kantorovich potential represents, at each source location, the displacement to its matched location under the optimal transport map. By constructing potentials from the Wasserstein barycenter to individual distributions, the proposed Kantorovich regression model approximates the response displacement field as a sum of predictor displacement fields, each adjusted by a functional parameter. Owing to the linear structure, Euclidean predictors can enter as scaling coefficients of $c$-concave parameter potentials. We characterize functional parameter classes ensuring the intrinsic structure of the model, establish asymptotic theory through uniform convergence of the empirical Wasserstein loss, and derive Gâteaux derivatives leading to first-order optimization algorithms. Real data applications include a mixed-predictor analysis of housing price distributions and an analysis of two-dimensional temperature distributions, demonstrating the flexibility and interpretability of the proposed framework.

2512.01899 2026-03-19 cs.LG stat.ML

Provably Safe Model Updates

Leo Elmecker-Plakolm, Pierre Fasterling, Philip Sosnin, Calvin Tsay, Matthew Wicker

Comments 12 pages, 9 figures. This work has been accepted for publication at SaTML 2026. The final version will be available on IEEE Xplore

详情
英文摘要

Safety-critical environments are inherently dynamic. Distribution shifts, emerging vulnerabilities, and evolving requirements demand continuous updates to machine learning models. Yet even benign parameter updates can have unintended consequences, such as catastrophic forgetting in classical models or alignment drift in foundation models. Existing heuristic approaches (e.g., regularization, parameter isolation) can mitigate these effects but cannot certify that updated models continue to satisfy required performance specifications. We address this problem by introducing a framework for provably safe model updates. Our approach first formalizes the problem as computing the largest locally invariant domain (LID): a connected region in parameter space where all points are certified to satisfy a given specification. While exact maximal LID computation is intractable, we show that relaxing the problem to parameterized abstract domains (orthotopes, zonotopes) yields a tractable primal-dual formulation. This enables efficient certification of updates - independent of the data or algorithm used - by projecting them onto the safe domain. Our formulation further allows computation of multiple approximately optimal LIDs, incorporation of regularization-inspired biases, and use of lookahead data buffers. Across continual learning and foundation model fine-tuning benchmarks, our method matches or exceeds heuristic baselines for avoiding forgetting while providing formal safety guarantees.

2509.10337 2026-03-19 stat.ML cs.LG

Exact Generalisation Error Exposes Benchmarks Skew Graph Neural Networks Success (or Failure)

Nil Ayday, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar

详情
英文摘要

Graph Neural Networks (GNNs) have become the standard method for learning from networks across fields ranging from biology to social systems, yet a principled understanding of what enables them to extract meaningful representations, or why performance varies drastically between similar models, remains elusive. These questions can be answered through the generalisation error, which measures the discrepancy between a model's predictions and the true values it is meant to recover. Although several works have derived generalisation error bounds, learning theoretical bounds are typically loose, restricted to a single architecture, and offer limited insight into what governs generalisation in practice. In this work, we take a fundamentally different approach by deriving the exact generalisation error for a broad range of linear GNNs, including convolutional, PageRank-based, and attention-based models, through the lens of signal processing. Our exact generalisation error exposes a strong benchmark bias in existing literature: commonly used datasets exhibit high alignment between node features and the graph structure, inherently favouring architectures that rely on it. We further show that the similarity between connected nodes (homophily) decisively governs which architectures are best suited for a given graph, thereby explaining how specific benchmark properties systematically shape the reported performance in the literature. Together, these results explain when and why GNNs can effectively leverage structure and feature information, supporting the reliable application of GNNs.

2507.03681 2026-03-19 stat.ML cs.LG stat.ME

Robust estimation of heterogeneous treatment effects in randomized trials leveraging external data

Rickard Karlsson, Piersilvio De Bartolomeis, Issa J. Dahabreh, Jesse H. Krijthe

Comments Accepted to AISTATS 2026. 24 pages, including references and appendix

详情
英文摘要

Randomized trials are typically designed to detect average treatment effects but often lack the statistical power to uncover individual-level treatment effect heterogeneity, limiting their value for personalized decision-making. To address this, we propose the QR-learner, a model-agnostic learner that estimates conditional average treatment effects (CATE) within the trial population by leveraging external data from other trials or observational studies. The proposed method is robust: it can reduce the mean squared error relative to a trial-only CATE learner, and is guaranteed to recover the true CATE even when the external data are not aligned with the trial. Moreover, we introduce a procedure that combines the QR-learner with a trial-only CATE learner and show that it asymptotically matches or exceeds both component learners in terms of mean squared error. We examine the performance of our approach in simulation studies and apply the methods to a real-world dataset, demonstrating improvements in both CATE estimation and statistical power for detecting heterogeneous effects.

2505.17300 2026-03-19 stat.ML cs.LG stat.CO stat.ME

Statistical Inference for Online Algorithms

Selina Carter, Arun K Kuchibhotla

Comments 1) Adding to ASGD simulations, we add 5 other SGD algorithms: averaged-implicit-SGD, last-iterate-implicit-SGD, ROOT-SGD, truncated-SGD, and noisy-truncated-SGD. 2) We modify links to the online viz/GitHub pages. 3) We qualify previous conclusions on ASGD: ex, we claim that logistic regression is sometimes more challenging "in terms of achieving the target coverage" than linear regression

详情
英文摘要

The construction of confidence intervals and hypothesis tests for functionals is a cornerstone of statistical inference. Traditionally, the most efficient procedures - such as the Wald interval or the Likelihood Ratio Test - require both a point estimator and a consistent estimate of its asymptotic variance. However, when estimators are derived from online or sequential algorithms, computational constraints often preclude multiple passes over the data, complicating variance estimation. In this article, we propose a computationally efficient, rate-optimal wrapper method (HulC) that wraps around any online algorithm to produce asymptotically valid confidence regions bypassing the need for explicit asymptotic variance estimation. The method is provably valid for any online algorithm that yields an asymptotically normal estimator. We evaluate the practical performance of the proposed method primarily using Stochastic Gradient Descent (SGD) with Polyak-Ruppert averaging. Furthermore, we provide extensive numerical simulations comparing the performance of our approach (HulC) when used with other online algorithms, including implicit-SGD and ROOT-SGD.

2504.03466 2026-03-19 math.ST stat.TH

Identifiability of VAR(1) model in a stationary setting

Bixuan Liu

详情
英文摘要

We consider a classical First-order Vector AutoRegressive (VAR(1)) model, where we interpret the autoregressive interaction matrix as influence relationships among the components of the VAR(1) process that can be encoded by a weighted directed graph. A majority of previous work studies the structural identifiability of the graph based on time series observations and therefore relies on dynamical information. In this work we assume that an equilibrium exists, and study instead the identifiability of the graph from the stationary distribution, meaning that we seek a way to reconstruct the influence graph underlying the dynamic network using only static information. We use an approach from algebraic statistics that characterizes models using the Jacobian matroids associated with the parametrization of the models, and we introduce sufficient graphical conditions under which different graphs yield distinct steady-state distributions. Additionally, we illustrate how our results could be applied to characterize networks inspired by ecological research.

2503.19068 2026-03-19 stat.ML cs.AI cs.LG stat.ME stat.OT

Minimum Volume Conformal Sets for Multivariate Regression

Sacha Braun, Liviu Aolaritei, Michael I. Jordan, Francis Bach

详情
英文摘要

Conformal prediction provides a principled framework for constructing predictive sets with finite-sample validity. While much of the focus has been on univariate response variables, existing multivariate methods either impose rigid geometric assumptions or rely on flexible but computationally expensive approaches that do not explicitly optimize prediction set volume. We propose an optimization-driven framework based on a novel loss function that directly learns minimum-volume covering sets while ensuring valid coverage. This formulation naturally induces a new nonconformity score for conformal prediction, which adapts to the residual distribution and covariates. Our approach optimizes over prediction sets defined by arbitrary norm balls, including single and multi-norm formulations. Additionally, by jointly optimizing both the predictive model and predictive uncertainty, we obtain prediction sets that are tight, informative, and computationally efficient, as demonstrated in our experiments on real-world datasets.

2503.08746 2026-03-19 q-bio.QM stat.ME

In silico clinical trials in drug development: a systematic review

Bohua Chen, Lucia Chantal Schneider, Christian Röver, Emmanuelle Comets, Markus Christian Elze, Andrew Hooker, Joanna IntHout, Anne-Sophie Jannot, Daria Julkowska, Yanis Mimouni, Marina Savelieva, Nigel Stallard, Moreno Ursino, Marc Vandemeulebroecke, Sebastian Weber, Martin Posch, Sarah Zohar, Tim Friede

Comments 30 pages, 9 figures

详情
Journal ref
Therapeutic Innovation & Regulatory Science, 60:423-439, 2025
英文摘要

In the context of clinical research, computational models have received increasing attention over the past decades. In this systematic review, we aimed to provide an overview of the role of so-called in silico clinical trials (ISCTs) in medical applications. Exemplary for the broad field of clinical medicine, we focused on in silico (IS) methods applied in drug development, sometimes also referred to as model informed drug development (MIDD). We searched PubMed and ClinicalTrials.gov for published articles and registered clinical trials related to ISCTs. We identified 202 articles and 48 trials, and of these, 76 articles and 19 trials were directly linked to drug development. We extracted information from all 202 articles and 48 clinical trials and conducted a more detailed review of the methods used in the 76 articles that are connected to drug development. Regarding application, most articles and trials focused on cancer and imaging-related research while rare and pediatric diseases were only addressed in 14 articles and 5 trials, respectively. While some models were informed combining mechanistic knowledge with clinical or preclinical (in-vivo or in-vitro) data, the majority of models were fully data-driven, illustrating that clinical data is a crucial part in the process of generating synthetic data in ISCTs. Regarding reproducibility, a more detailed analysis revealed that only 24% (18 out of 76) of the articles provided an open-source implementation of the applied models, and in only 20% of the articles the generated synthetic data were publicly available. Despite the widely raised interest, we also found that it is still uncommon for ISCTs to be part of a registered clinical trial and their application is restricted to specific diseases leaving potential benefits of ISCTs not fully exploited.

2502.05021 2026-03-19 stat.ME eess.SP stat.ML

Gradient-based filtering under misspecification: Stability and error bounds

Simon Donker van Heel, Rutger-Jan Lange, Bram van Os, Dick van Dijk

Comments 62 pages

详情
英文摘要

Can stochastic gradient methods track a moving target? We study the problem of tracking multidimensional time-varying parameters under noisy observations and possible model misspecification. Gradient-based filters update the time-varying parameters using the gradient of a postulated objective function. A natural filtering objective is the logarithm of the postulated observation density, which gives rise to the widely used class of score-driven filters. As in the optimization literature, these filters come in two forms: explicit filters evaluate the gradient at the predicted parameter, whereas implicit filters evaluate it at the updated parameter. For both filter types, we derive novel sufficient conditions for exponential stability of the filtered parameter path, showing that stability can be guaranteed independently of the data-generating process. Under mild additional moment conditions on the data-generating process, we also obtain finite-sample and asymptotic mean squared error bounds relative to the pseudo-true parameter path. For implicit filters, these guarantees hold under weak parameter restrictions. For explicit filters, they additionally require Lipschitz continuity of the score and a sufficiently small learning rate. Simulation studies support our theoretical findings and show that implicit gradient filters outperform explicit ones in both accuracy and stability.

2501.05007 2026-03-19 quant-ph cs.AI cs.LG stat.ME

Quantum-enhanced causal discovery for a small number of samples

Yu Terada, Ken Arai, Yu Tanaka, Yota Maeda, Hiroshi Ueno, Hiroyuki Tezuka

Comments 20 pages, 10 figures

详情
Journal ref
Quantum Mach. Intell. 8, 36 (2026)
英文摘要

The discovery of causal relations from observed data has attracted significant interest from disciplines such as economics, social sciences, and biology. In practical applications, considerable knowledge of the underlying systems is often unavailable, and real data are usually associated with nonlinear causal structures, which makes the direct use of most conventional causality analysis methods difficult. This study proposes a novel quantum Peter-Clark (qPC) algorithm for causal discovery that does not require any assumptions about the underlying model structures. Based on conditional independence tests in a class of reproducing kernel Hilbert spaces characterized by quantum circuits, the proposed algorithm can explore causal relations from the observed data drawn from arbitrary distributions. We conducted systematic experiments on fundamental graphs of causal structures, demonstrating that the qPC algorithm exhibits better performance, particularly with smaller sample sizes compared to its classical counterpart. Furthermore, we proposed a novel optimization approach based on Kernel Target Alignment (KTA) for determining hyperparameters of quantum kernels. This method effectively reduced the risk of false positives in causal discovery, enabling more reliable inference. Our theoretical and experimental results demonstrate that the quantum algorithm can empower classical algorithms for accurate inference in causal discovery, supporting them in regimes where classical algorithms typically fail. In addition, the effectiveness of this method was validated using the datasets on Boston housing prices, heart disease, and biological signaling systems as real-world applications. These findings highlight the potential of quantum-based causal discovery methods in addressing practical challenges, particularly in small-sample scenarios, where traditional approaches have shown significant limitations.

2408.08177 2026-03-19 stat.ME stat.ML

Localized Sparse Principal Component Analysis of Multivariate Time Series in Frequency Domain

Jamshid Namdari, Amita Manatunga, Fabio Ferrarelli, Robert Krafty

Comments 63 pages, 6 figures

详情
英文摘要

Principal component analysis has been a main tool in multivariate analysis for estimating a low dimensional linear subspace that explains most of the variability in the data. However, in high-dimensional regimes, naive estimates of the principal loadings are not consistent and difficult to interpret. In the context of time series, principal component analysis of spectral density matrices can provide valuable, parsimonious information about the behavior of the underlying process, particularly if the principal components are interpretable in that they are sparse in coordinates and localized in frequency bands. In this paper, we introduce a formulation and consistent estimation procedure for interpretable principal component analysis for high-dimensional time series in the frequency domain. An efficient frequency-sequential algorithm is developed to compute sparse-localized estimates of the low-dimensional principal subspaces of the signal process. The method is motivated by and used to understand neurological mechanisms from high-density resting-state EEG in a study of first episode psychosis.

2405.16924 2026-03-19 cs.LG stat.ML

Demystifying amortized causal discovery with transformers

Francesco Montagna, Max Cairney-Leeming, Dhanya Sridhar, Francesco Locatello

详情
Journal ref
Transactions in Machine Learning Research (TMLR), 2025
英文摘要

Supervised learning for causal discovery from observational data often achieves competitive performance despite seemingly avoiding the explicit assumptions that traditional methods require for identifiability. In this work, we analyze CSIvA (Ke et al., 2023) on bivariate causal models, a transformer architecture for amortized inference promising to train on synthetic data and transfer to real ones. First, we bridge the gap with identifiability theory, showing that the training distribution implicitly defines a prior on the causal model of the test observations: consistent with classical approaches, good performance is achieved when we have a good prior on the test data, and the underlying model is identifiable. Second, we find that CSIvA can not generalize to classes of causal models unseen during training: to overcome this limitation, we theoretically and empirically analyze \textit{when} training CSIvA on datasets generated by multiple identifiable causal models with different structural assumptions improves its generalization at test time. Overall, we find that amortized causal discovery with transformers still adheres to identifiability theory, violating the previous hypothesis from Lopez-Paz et al. (2015) that supervised learning methods could overcome its restrictions.

2302.02415 2026-03-19 math.ST stat.ME stat.TH

On Separability of Covariance in Multiway Data Analysis

Dogyoon Song, Alfred O. Hero

Comments 45 pages, 8 figures, 3 tables

详情
英文摘要

Multiway data analysis aims to uncover patterns in data structured as multi-indexed arrays, with multiway covariance playing a crucial role in many applications. However, the high dimensionality of multiway covariance presents significant computational challenges. To overcome these challenges, factorized covariance models have been proposed that rely on a separability assumption: the multiway covariance can be accurately expressed as a sum of Kronecker products of mode-wise covariances. This paper addresses the representability, certification, and approximation of such separable models, leaving statistical estimation or finite-sample properties aside. We reduce the question of whether a given covariance can be decomposed into a separable multiway form to an equivalent question about the separability of quantum states. Leveraging results from quantum information theory, we show that generic multiway covariances are typically \emph{not} separable and that determining the best separable approximation is NP-hard. These findings suggest that factorized covariance models can be overly restrictive and difficult to fit without additional structural assumptions. Nevertheless, our numerical experiments indicate that standard iterative algorithms, namely Frank-Wolfe and gradient descent, often converge close to the best separable approximation. As NP-hardness concerns worst-case computational complexity, Kronecker-separable approximations to multiway covariance could still be tractable to apply for analyzing many real-world datasets.

2111.06390 2026-03-19 stat.AP cs.AI cs.GT cs.HC

Theoretical Foundations of δ-margin Majority Voting

Margarita Boyarskaya, Panos Ipeirotis

详情
英文摘要

In high-stakes ML applications such as fraud detection, medical diagnostics, and content moderation, practitioners rely on consensus-based approaches to control prediction quality. A particularly valuable technique -- δδδ-margin majority voting -- collects votes sequentially until one label exceeds alternatives by a threshold δδδ, offering stronger confidence than simple majority voting. Despite widespread adoption, this approach has lacked rigorous theoretical foundations, leaving practitioners reliant on heuristics for key metrics like expected accuracy and cost. This paper establishes a comprehensive theoretical framework for δδδ-margin majority voting by formulating it as an absorbing Markov chain and leveraging Gambler's Ruin theory. Our contributions form a practical \emph{design calculus} for δδδ-margin voting: (1)~Closed-form expressions for consensus accuracy, expected voting duration, variance, and the stopping-time PMF, enabling model-based design rather than trial-and-error. (2)~A Bayesian extension handling uncertainty in worker accuracy, supporting real-time monitoring of expected quality and cost as votes arrive, with single-Beta and mixture-of-Betas priors. (3)~Cost-calibration methods for achieving equivalent quality across worker pools with different accuracies and for setting payment rates accordingly. We validate our predictions on two real-world datasets, demonstrating close agreement between theory and observed outcomes. The framework gives practitioners a rigorous toolkit for designing δδδ-margin voting processes, replacing ad-hoc experimentation with model-based design where quality control and cost transparency are essential.

2603.17628 2026-03-19 stat.ML cs.AI cs.LG stat.ME

rSDNet: Unified Robust Neural Learning against Label Noise and Adversarial Attacks

Suryasis Jana, Abhik Ghosh

Comments Pre-print; under review

详情
英文摘要

Neural networks are central to modern artificial intelligence, yet their training remains highly sensitive to data contamination. Standard neural classifiers are trained by minimizing the categorical cross-entropy loss, corresponding to maximum likelihood estimation under a multinomial model. While statistically efficient under ideal conditions, this approach is highly vulnerable to contaminated observations including label noises corrupting supervision in the output space, and adversarial perturbations inducing worst-case deviations in the input space. In this paper, we propose a unified and statistically grounded framework for robust neural classification that addresses both forms of contamination within a single learning objective. We formulate neural network training as a minimum-divergence estimation problem and introduce rSDNet, a robust learning algorithm based on the general class of $S$-divergences. The resulting training objective inherits robustness properties from classical statistical estimation, automatically down-weighting aberrant observations through model probabilities. We establish essential population-level properties of rSDNet, including Fisher consistency, classification calibration implying Bayes optimality, and robustness guarantees under uniform label noise and infinitesimal feature contamination. Experiments on three benchmark image classification datasets show that rSDNet improves robustness to label corruption and adversarial attacks while maintaining competitive accuracy on clean data, Our results highlight minimum-divergence learning as a principled and effective framework for robust neural classification under heterogeneous data contamination.

2603.17599 2026-03-19 stat.ME math.ST stat.AP stat.TH

Prediction with Missing Data: Target Probabilities and Missingness Mechanisms

Pierre Catoire, Robin Genuer, Cecile Proust-Lima

Comments 55 pages (including 40 pages for the main article and 15 pages for the supplementary material)

详情
英文摘要

Conditions ensuring optimal parameter estimation in the presence of missing data are well established in inference, typically relying on the Missing-at-Random (MAR) assumption. In prediction, similar principles are often assumed to apply. However, methods considered biased in inference, such as pattern sub-modelling or unconditional imputation, have been shown to achieve optimal predictive performance under any missingness mechanism, including non-MAR (MNAR). To explain this apparent contradiction, we introduce a new formal framework for describing missingness in prediction. Central to this framework is a distinction between two prediction targets, defined according to whether or not the indicator of observation of the predictors is exploited to predict the outcome. This distinction leads to a classification of the missingness mechanisms describing the conditions under which these targets are equal, and when consistent prediction of each is achievable. A key result is that both targets may be consistently predicted under conditions weaker than MAR. We discuss the implications of this paradigm for handling missing data in prediction, distinguishing between missingness at development, validation and deployment of a forecaster. The findings are illustrated using simulated data and a real-world application with the prediction of significant injury after trauma upon arrival at the emergency department.

2603.17569 2026-03-19 stat.ML cs.LG

Gaussian Process Limit Reveals Structural Benefits of Graph Transformers

Nil Ayday, Lingchu Yang, Debarghya Ghoshdastidar

详情
英文摘要

Graph transformers are the state-of-the-art for learning from graph-structured data and are empirically known to avoid several pitfalls of message-passing architectures. However, there is limited theoretical analysis on why these models perform well in practice. In this work, we prove that attention-based architectures have structural benefits over graph convolutional networks in the context of node-level prediction tasks. Specifically, we study the neural network gaussian process limits of graph transformers (GAT, Graphormer, Specformer) with infinite width and infinite heads, and derive the node-level and edge-level kernels across the layers. Our results characterise how the node features and the graph structure propagate through the graph attention layers. As a specific example, we prove that graph transformers structurally preserve community information and maintain discriminative node representations even in deep layers, thereby preventing oversmoothing. We provide empirical evidence on synthetic and real-world graphs that validate our theoretical insights, such as integrating informative priors and positional encoding can improve performance of deep graph transformers.

2603.17551 2026-03-19 stat.ML cs.LG

Consistency of the $k$-Nearest Neighbor Regressor under Complex Survey Designs

Caren Hasler

详情
英文摘要

We study the consistency of the $k$-nearest neighbor regressor under complex survey designs. While consistency results for this algorithm are well established for independent and identically distributed data, corresponding results for complex survey data are lacking. We show that the $k$-nearest neighbor regressor is consistent under regularity conditions on the sampling design and the distribution of the data. We derive lower bounds for the rate of convergence and show that these bounds exhibit the curse of dimensionality, as in the independent and identically distributed setting. Empirical studies based on simulated and real data illustrate our theoretical findings.

2603.17502 2026-03-19 stat.ME stat.AP

A lightweight framework for characterising extreme precipitation events in climate ensembles

Dáire Healy, Isadora Antoniano-Villalobos, Claudia Collarin, Nathan Huet, Ilaria Prosdocimi, Emilia Siviero

详情
英文摘要

This article summarises the methods used by the team ``Ca' Foscari" for the EVA 2025 Data Challenge. The questions of the challenge concern the estimation of exceedance probabilities across several locations. Rather than modelling the spatial dependence structure, we reduce the problems to univariate ones by considering relevant spatial order statistics across the sites. Within a Peaks over Threshold framework, we model the marginal distributions of exceedances using generalised Pareto distributions. Generalised additive models are employed to allow the parameters to vary as functions of external predictors, which for all questions are reduced to the month. For questions 1 and 2, the required estimates and confidence intervals are obtained by generating samples from our fitted models. Question 3 involves the dependence between two consecutive observed statistics. To account for this temporal dependence, we fit a conditional extreme value model and derive empirical estimates of persistent extreme events by simulating from this model.

2603.17495 2026-03-19 stat.AP

A Weight-Dependent 1RM Prediction Equation Optimized on 303,494 Near-Failure Sets Across 388 Exercises

Thiago Marzagao

Comments 22 pages, 4 figures, 5 tables

详情
英文摘要

Classical equations for predicting one-repetition maximum (1RM) from submaximal performance were derived from small samples performing a single exercise, yet are routinely applied to hundreds of exercises. All use a fixed conversion factor relating repetitions to estimated 1RM, regardless of exercise or load. We used large-scale observational data from a consumer fitness app (303,494 near-failure sets from 14,966 users across 388 exercises spanning 16 muscle groups) to derive and evaluate a generalization in which the conversion factor varies logarithmically with the weight lifted: 1RM = w * (1 + (r - 1)^0.85 / (-2.55 + 4.58 * ln(w))). Because the dataset contains no directly measured maxima, we optimized and evaluated the formula using an internal consistency criterion -- the degree to which different weight-repetition combinations from the same person, exercise, and time window yield the same estimated 1RM. The proposed formula reduced inconsistency by 17-22% relative to four classical benchmarks, with the improvement positive for every one of the 183 exercises with sufficient data. Five-fold user-level cross-validation confirmed near-zero overfitting. An ablation analysis attributed 91% of the improvement to the weight-dependent conversion factor and 9% to the sub-linear repetition exponent. The conversion factor increases with load: at light weights each additional repetition implies a larger fraction of maximal capacity than at heavy weights, consistent with prior evidence that the repetitions-%1RM relationship varies by exercise. Classical equations, by applying a single conversion factor across all loads, systematically underestimate this variation -- and the discrepancy is largest for the lighter, more diverse exercises that dominate real-world training programs.

2603.17469 2026-03-19 stat.ME

Fast and scalable inference in hidden Markov models with Gaussian fields

Jan-Ole Fischer

Comments 37 pages, 13 figures

详情
英文摘要

Hidden Markov models (HMMs) are powerful tools for analysing time series data that depend on discrete underlying but unobserved states. As such, they have gained prominence across numerous empirical disciplines, in particular ecology, medicine, and economics. However, the increasing complexity of empirical data is often accompanied by additional latent structure such as spatial effects, temporal trends, or measurement perturbations. Gaussian fields provide an attractive building block for incorporating such structured latent variation into HMMs. Fast inference methods for Gaussian fields have emerged through the stochastic partial differential equation (SPDE) approach. Due to their sparse representation, these integrate well with novel frequentist estimation methods for random-effects models via the use of automatic differentiation and the Laplace approximation. Scaling to high dimensions requires tools such as (R)TMB to exploit sparsity in the Hessian w.r.t. the latent variables - a property satisfied by SPDE fields but violated by the HMM likelihood. We present a modified forward algorithm to compute the HMM likelihood, constructing sparsity in the Hessian and consequently enabling fast and scalable inference. We demonstrate the practical feasibility and the usefulness through simulations and two case studies exploring the detection of stellar flares as well as modelling the movement of lions.

2603.17460 2026-03-19 stat.ME

Algorithms for Models with Intractable Normalizing Functions

Murali Haran, Bokgyeong Kang, Jaewoo Park

详情
英文摘要

In this paper we discuss a well known computing problem -- inference for models with intractable normalizing functions. Models with intractable normalizing functions arise in a wide variety of areas, for instance network models, models for spatial data on lattices, spatial point processes, flexible models for count data and gene expression, and models for permutations. Simulating from these models for fixed parameter values is well studied, starting with work dating back seventy years to the origin of the Metropolis algorithm. On the other hand some of the most practical and theoretically justified algorithms for inference, particularly Bayesian inference, have only been developed within the past two decades. The most computationally efficient algorithms often do not have well developed theory and few if any approaches exist for assessing the quality of approximations based on them. For many problems even the best algorithms can be computationally infeasible. Hence, this is an exciting area of research with many open problems. We explain several key algorithms, providing connections and touching upon practical advantages and disadvantages of each, with some discussion of theoretical properties where they impact practice. We discuss an approach for assessing the accuracy of approximations produced by these algorithms; this diagnostic is particularly valuable for algorithm tuning. While our focus is largely on models with intractable normalizing functions, we also discuss algorithms that are more broadly applicable to models where the entire likelihood function is intractable; these methods are of course also applicable to intractable normalizing function problems.

2603.17327 2026-03-19 stat.ME math.ST stat.TH

Empirical Likelihood Inference for Sen and Sen--Shorrocks--Thon Indices

Sreelakshmi N, Saparya Suresh, Sudheesh K. Kattumannil

详情
英文摘要

The Sen index and Sen-Shorrocks-Thon (SST) index are widely used measures of poverty indices. Developing reliable inference for these measures enables us to compare these measures in different populations of interest in an effective way. It is important to construct confidence intervals for the Sen index and SST index, which provide better coverage probability and shorter interval length. Motivated by this, we discuss empirical likelihood (EL) and jackknife empirical likelihood (JEL) based inference for the Sen index. To derive a JEL-based confidence interval for the Sen and SST indices, we propose a new estimator for the Sen index using the theory of U-statistics and examine its properties. The large sample properties of the EL and JEL ratio statistics are studied. We also discuss EL and JEL-based inference for the Sen-Shorrocks-Thon (SST) index. The finite sample performance of the EL and JEL-based confidence intervals of both Sen and SST indices is evaluated through a Monte Carlo simulation study. Finally, we illustrate our methods using individual-level data from the Panel Study of Income Dynamics (PSID) survey from the US as well as Indian household level income data for different states sourced from the Consumer Pyramids Household Survey (CPHS).

2603.17318 2026-03-19 stat.AP physics.chem-ph physics.comp-ph

Analysis of molecular dynamics simulation data via statistical distances between covariance matrices

Yusuke Ono, Takumi Sato, Kenji Yasuoka, Linyu Peng

Comments 12 pages, 8 figures

详情
英文摘要

Molecular dynamics (MD) simulations are powerful tools for elucidating the macroscopic physical properties of materials from microscopic atomic behaviors. However, the massive, high-dimensional datasets generated by MD simulations pose a significant challenge for analysis, necessitating efficient dimensionality reduction and feature extraction techniques. While existing methods such as principal component analysis and unsupervised learning have been utilized, issues regarding data efficiency and computational cost remain. In this study, we propose a statistical analysis framework focusing on the analysis of the particle data distributions through their covariance matrices, corresponding to the second-order moments of MD trajectory data. Discrepancies between system states are quantified using statistical distances between these covariance matrices. By applying dimensionality reduction to the resulting distance matrix, we extract lower-dimensional features that characterize the systems' dynamics. We validate the proposed method using Lennard-Jones (LJ) particle systems under different temperature conditions, as well as separate bulk systems of ice and liquid water. The results of LJ particles demonstrate an approximately linear correlation between the first principal component obtained through dimensionality reduction of the distance matrix and the diffusion coefficient. This suggests that global physical properties can be effectively inferred from local statistical information, such as covariance matrices, offering a data-efficient alternative for analyzing complex molecular systems. Furthermore, in the case of separate bulk systems of ice and liquid water, the method successfully distinguishes between the two phases, highlighting its potential for characterizing phase transitions and structural differences in molecular systems.

2603.17294 2026-03-19 stat.ME stat.AP

Bayesian Scalar-on-Tensor Quantile Regression for Longitudinal Data on Alzheimer's Disease

Rongke Lyu, Marina Vannucci, Suprateek Kundu

详情
英文摘要

As a general and robust alternative to traditional mean regression models, quantile regression avoids the assumption of normally distributed errors, making it a versatile choice when modeling outcomes such as cognitive scores that typically have skewed distributions. Motivated by an application to Alzheimer's disease data where the aim is to explore how brain-behavior associations change over time, we propose a novel Bayesian tensor quantile regression for high-dimensional longitudinal imaging data. The proposed approach distinguishes between effects that are consistent across visits and patterns unique to each visit, contributing to the overall longitudinal trajectory. A low-rank decomposition is employed on the tensor coefficients which reduces dimensionality and preserves spatial configurations of the imaging voxels. We incorporate multiway shrinkage priors to model the visit-invariant tensor coefficients and variable selection priors on the tensor margins of the visit-specific effects. For posterior inference, we develop a computationally efficient Markov chain Monte Carlo sampling algorithm. Simulation studies reveal significant improvements in parameter estimation, feature selection, and prediction performance when compared with existing approaches. In the analysis of the Alzheimer's disease data, the flexibility of our modeling approach brings new insights as it provides a fuller picture of the relationship between the imaging voxels and the quantile distributions of the cognitive scores.

2603.17291 2026-03-19 math.PR math.FA math.ST stat.TH

On the structure of marginals in high dimensions

Daniel Bartl, Shahar Mendelson

详情
英文摘要

Let $G, G_1,\dots,G_N$ be independent copies of a standard gaussian random vector in $\mathbb{R}^d$ and denote by $Γ= \sum_{i=1}^N \langle G_i,\cdot\rangle e_i$ the standard gaussian ensemble. We show that, for any set $A\subset S^{d-1}$, with exponentially high probability, \[ \sup_{x\in A} \frac{1}{N}\sum_{i=1}^N \big| (Γx)^\sharp_i - q_i\big| \le c \frac{ \mathbb{E} \sup_{x\in A} \langle G,x\rangle + \log^2N }{\sqrt N }. \] Here each $q_i$ is the $\frac{i}{N+1}$-quantile of the standard normal distribution and $(Γx)^\sharp $ denotes the monotone increasing rearrangement of the vector $Γx$. The estimate is sharp up to a possible logarithmic factor and significantly extends previously known bounds. Moreover, we show that similar estimates hold in much greater generality: after replacing the gaussian quantiles by the appropriate ones, the same phenomenon persists for a broad class of random vectors.

2603.17278 2026-03-19 cs.LG stat.ME

Classifier Pooling for Modern Ordinal Classification

Noam H. Rotenberg, Andreia V. Faria, Brian Caffo

详情
英文摘要

Ordinal data is widely prevalent in clinical and other domains, yet there is a lack of both modern, machine-learning based methods and publicly available software to address it. In this paper, we present a model-agnostic method of ordinal classification, which can apply any non-ordinal classification method in an ordinal fashion. We also provide an open-source implementation of these algorithms, in the form of a Python package. We apply these models on multiple real-world datasets to show their performance across domains. We show that they often outperform non-ordinal classification methods, especially when the number of datapoints is relatively small or when there are many classes of outcomes. This work, including the developed software, facilitates the use of modern, more powerful machine learning algorithms to handle ordinal data.

2603.17271 2026-03-19 stat.ME cs.LG

Wasserstein-type Gaussian Process Regressions for Input Measurement Uncertainty

Hengrui Luo, Xiaoye S. Li, Yang Liu, Marcus Noack, Ji Qiang, Mark D. Risser

Comments 22 pages

详情
英文摘要

Gaussian process (GP) regression is widely used for uncertainty quantification, yet the standard formulation assumes noise-free covariates. When inputs are measured with error, this errors-in-variables (EIV) setting can lead to optimistically narrow posterior intervals and biased decisions. We study GP regression under input measurement uncertainty by representing each noisy input as a probability measure and defining covariance through Wasserstein distances between these measures. Building on this perspective, we instantiate a deterministic projected Wasserstein ARD (PWA) kernel whose one-dimensional components admit closed-form expressions and whose product structure yields a scalable, positive-definite kernel on distributions. Unlike latent-input GP models, PWA-based GPs (\PWAGPs) handle input noise without introducing unobserved covariates or Monte Carlo projections, making uncertainty quantification more transparent and robust.

2603.17243 2026-03-19 stat.ME

Transmuted logistic-exponential distribution - some new properties, estimation methods and application with infectious disease mortality data

Isqeel Ogunsola, Abosede Akintunde, Kehinde Yusuff, Basirat Adetona, Faheez Abdulrasaq

详情
英文摘要

Lately, a New Transmuted Logistic-exponential (NTLE) distribution was introduced and studied as an extension of the Logistic-Exponential Distribution (LED) with wider applicability in lifetime modelling. However, the maximum likelihood estimates (MLE) of NTLE are not in closed form, and the consistency of the estimates was not examined. Furthermore, some other important properties of NTLE, namely the Shannon entropy, Rényi entropy, stochastic ordering, mode, stress-strength reliability measure, residual life functions (mean and reverse), incomplete moments, Bonferroni and Lorenz curves are yet to be derived. Motivated by this, we derived and studied these important properties and evaluated the performance of ten estimation methods (Maximum Likelihood, Moments, Least Squares, Weighted Least Squares, Maximum product of Spacings, Anderson-Darling, Cramer-von Mises, percentile estimation, and Maximum Goodness-of-Fit methods) for NTLE parameters via Monte Carlo simulation using bias, mean square error, and root mean square error as evaluation criteria. Real-life infectious mortality data fitted to the distributions showed that NTLE has a better fit compared to its base distributions (Exponential and Logistic-Exponential). This finding contributes valuable insights for researchers and practitioners when selecting the appropriate estimation methods, especially for NTLE and some similar distributions in non-closed form.

2603.17226 2026-03-19 stat.ME

Difference-Based High-Dimensional Long-Run Covariance Matrix Estimation for Mean-shift Time Series

Yanhong Liu, Fengyi Song, Long Feng

详情
英文摘要

We consider estimation of high-dimensional long-run covariance matrices for time series with nonconstant means, a setting in which conventional estimators can be severely biased. To address this difficulty, we propose a difference-based initial estimator that is robust to a broad class of mean variations, and combine it with hard thresholding, soft thresholding, and tapering to obtain sparse long-run covariance estimators for high-dimensional data. We derive convergence rates for the resulting estimators under general temporal dependence and time-varying mean structures, showing explicitly how the rates depend on covariance sparsity, mean variation, dimension, and sample size. Numerical experiments show that the proposed methods perform favorably in high dimensions, especially when the mean evolves over time.

2603.17160 2026-03-19 stat.ML cs.LG math.ST stat.TH

Self-Regularized Learning Methods

Max Schölpple, Liu Fanghui, Ingo Steinwart

详情
英文摘要

We introduce a general framework for analyzing learning algorithms based on the notion of self-regularization, which captures implicit complexity control without requiring explicit regularization. This is motivated by previous observations that many algorithms, such as gradient-descent based learning, exhibit implicit regularization. In a nutshell, for a self-regularized algorithm the complexity of the predictor is inherently controlled by that of the simplest comparator achieving the same empirical risk. This framework is sufficiently rich to cover both classical regularized empirical risk minimization and gradient descent. Building on self-regularization, we provide a thorough statistical analysis of such algorithms including minmax-optimal rates, where it suffices to show that the algorithm is self-regularized -- all further requirements stem from the learning problem itself. Finally, we discuss the problem of data-dependent hyperparameter selection, providing a general result which yields minmax-optimal rates up to a double logarithmic factor and covers data-driven early stopping for RKHS-based gradient descent.

2603.17151 2026-03-19 q-fin.CP stat.ML

Shallow Representation of Option Implied Information

Jimin Lin

详情
英文摘要

Option prices encode the market's collective outlook through implied density and implied volatility. An explicit link between implied density and implied volatility translates the risk-neutrality of the former into conditions on the latter to rule out static arbitrage. Despite earlier recognition of their parity, the two had been studied in isolation for decades until the recent demand in implied volatility modeling rejuvenated such parity. This paper provides a systematic approach to build neural representations of option implied information. As a preliminary, we first revisit the explicit link between implied density and implied volatility through an alternative and minimalist lens, where implied volatility is viewed not as volatility but as a pointwise corrector mapping the Black-Scholes quasi-density into the implied risk-neutral density. Building on this perspective, we propose the neural representation that incorporates arbitrage constraints through the differentiable corrector. With an additive logistic model as the synthetic benchmark, extensive experiments reveal that deeper or wider network structures do not necessarily improve the model performance due to the nonlinearity of both arbitrage constraints and neural derivatives. By contrast, a shallow feedforward network with a single hidden layer and a specific activation effectively approximates implied density and implied volatility.

2603.17142 2026-03-19 math.ST stat.TH

Identifiability and Estimation in Continuous Lyapunov Models

Cecilie Olesen Recke, Niels Richard Hansen

Comments 41 pages

详情
英文摘要

Cross-sectional observations from a dynamical system can be modeled via steady-state distributions of Markov processes. The major challenge is then to determine whether the process parameters can be identified and estimated from the steady-state distributions. We study this problem for continuous Lyapunov models that arise as steady-state distributions of the solution to a multivariate stochastic differential equation, whose linear drift matrix is parametrized by a directed graph. We derive equations for the cumulant tensors of any order for this distribution, which generalize the well-known covariance Lyapunov equation. Under a non-Gaussianity assumption we prove generic identifiability of the drift matrix for any connected graph using the equations for the higher-order cumulants. Based on the identifiability result, we propose a new semiparametric estimator of the drift matrix, and we derive its asymptotic distribution. A simulation study demonstrates the asymptotic validity of the estimator but shows that it is only accurate for relatively large sample sizes, illustrating the hardness of the unconstrained estimation problem.

2603.17139 2026-03-19 cs.LG stat.ML

Contextual Preference Distribution Learning

Benjamin Hudson, Laurent Charlin, Emma Frejinger

Comments In CPAIOR 2026 (23rd International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research)

详情
英文摘要

Decision-making problems often feature uncertainty stemming from heterogeneous and context-dependent human preferences. To address this, we propose a sequential learning-and-optimization pipeline to learn preference distributions and leverage them to solve downstream problems, for example risk-averse formulations. We focus on human choice settings that can be formulated as (integer) linear programs. In such settings, existing inverse optimization and choice modelling methods infer preferences from observed choices but typically produce point estimates or fail to capture contextual shifts, making them unsuitable for risk-averse decision-making. Using a bounded-variance score function gradient estimator, we train a predictive model mapping contextual features to a rich class of parameterizable distributions. This approach yields a maximum likelihood estimate. The model generates scenarios for unseen contexts in the subsequent optimization phase. In a synthetic ridesharing environment, our approach reduces average post-decision surprise by up to 114$\times$ compared to a risk-neutral approach with perfect predictions and up to 25$\times$ compared to leading risk-averse baselines.

2603.17106 2026-03-19 stat.AP

How Proxy Race Distorts Regression-Based Fairness Audits

Xi Xin, Giles Hooker, Fei Huang

详情
英文摘要

Proxy-based race inference is increasingly used to conduct fairness assessments when protected-class data are unavailable or legally restricted -- most prominently in U.S. fair-lending enforcement, and now explicitly contemplated in emerging insurance regulation, including Colorado's draft SB21-169 testing framework and New York's Insurance Circular Letter No. 7. Despite this growing regulatory relevance, little is known about how standard regression-based discrimination analyses behave when race is measured with error through proxies such as Bayesian Improved Surname Geocoding (BISG) or Bayesian Improved First Name and Surname Geocoding (BIFSG). This paper studies the consequences of using proxy-imputed race as a categorical regressor in regression-based fairness assessments. Treating proxy race as a categorical covariate subject to misclassification, we show that proxy-based coefficients become weighted mixtures of true group effects, systematically shrinking estimated disparities toward the majority group -- even when overall classification accuracy is high. Empirically, using a linked North Carolina voter-insurance dataset with self-reported race and ZIP-level auto insurance premiums, we demonstrate two mechanisms through which it distorts inference: (i) the intrinsic mixing of group effects implied by misclassification, and (ii) structured errors that vary with ZIP-level racial composition and socioeconomic conditions and remain correlated with pricing residuals after controls. As a result, regression-based disparity estimates can be attenuated or amplified relative to analogous analyses based on self-reported race. Our findings caution against treating proxy race as a plug-in substitute in regulatory testing and highlight design implications for proxy-based audit frameworks in insurance and other high-stakes domains.

2603.17086 2026-03-19 stat.ME

Topological inference on brain networks with application to lesion symptom mapping

Yuan Wang, Jian Yin, Nicholas Riccardi, Drik-Bart Den Ouden, Julius Fridriksson, Rutvik H. Desai

Comments arXiv admin note: substantial text overlap with arXiv:2311.01625

详情
英文摘要

Persistent homology (PH) characterizes the shape of brain networks through persistence features. Group comparison of persistence features from brain networks can be challenging as they are inherently heterogeneous. A recent scale-space representation of persistence diagrams (PDs) through heat diffusion reparameterizes them using a finite number of Fourier coefficients with respect to the Laplace--Beltrami (LB) eigenfunction expansion of the domain, providing a powerful vectorized algebraic representation for group comparisons. In this study, we develop a transposition-based permutation test for comparing multiple groups of PDs using heat-diffusion estimates. We evaluate the empirical performance of the spectral transposition test in capturing within- and between-group similarity and dissimilarity under varying levels of topological noise and cycle location variability. In application, we propose a topological lesion symptom mapping (TLSM) method based on the proposed framework. The method is applied to resting-state functional brain networks of individuals with post-stroke aphasia to identify characteristic cycles associated with varying levels of speech-language impairment.

2603.17066 2026-03-19 stat.ME

Improving RCT-Based CATE Estimation Under Covariate Mismatch via Double Calibration

Samhita Pal, Jared D. Huling, Amir Asiaee

详情
英文摘要

We develop estimators that improve precision of heterogeneous treatment effect estimates that allow borrowing information from observational studies when the available covariates in each data source do not perfectly match. Standard data-borrowing methods often assume perfectly matched covariates. We propose MR-OSCAR, an RCT-calibrated, two-stage estimation approach that first predicts the trial-missing variables using the observational data via imputation and then calibrates observational outcome predictions to the randomized trial, preserving the causal contrast, unlike the results for generalization, where imputation does not improve performance. Our theory gives finite-sample guarantees with a transparent error decomposition including an imputation error that shrinks as the observational mapping becomes more predictable. Simulations show that imputation almost always outperforms naively using only the shared covariates and clarifies when borrowing helps (strong predictability of the missing block, moderate trial size) and when it does not (poor predictability or dominant trial-only moderators). We motivate the approach with the Greenlight Plus trial on early childhood obesity and outline a forthcoming EHR analysis at Vanderbilt, highlighting the use of our method in common scenarios where data do not perfectly align.

2603.17031 2026-03-19 stat.ME

Minimizing Type 2 Errors in an Experiment-Rich Regime via Optimal Resource Allocation

Fenghua Yang, Dae Woong Ham, Stefanus Jasin

详情
英文摘要

Randomized experiments (often known as "A/B tests") are widely used to evaluate product and service innovations. We study how to allocate limited experimentation resources across M concurrent experiments in an experiment-rich regime. Existing work on allocation has predominantly focused on minimizing the worst-case mean squared error (MSE) of estimated treatment effects, which favors experiments with larger (and typically unknown) outcome variance. While appropriate for controlling estimation accuracy, this objective does not directly capture a common managerial priority in screening stages: detecting practically meaningful treatment effects with high probability. Motivated by this, we consider the objective of minimizing the worst-case Type II error across all experiments. When the standard deviations are known, we characterize the power-optimal allocation and show that MSE-based allocations can be highly inefficient for detection, even though the two objectives align asymptotically. When the standard deviations are unknown and must be learned from pilot data, we show that a naive plug-in approach, treating pilot standard deviations as truth, can suffer substantial power loss. We propose inflating pilot estimates via correction factors and develop three optimization-based frameworks for selecting them, each reflecting a different risk criterion with distinct managerial implications. Although the resulting stochastic programs are computationally challenging at scale, we derive tractable surrogate reformulations inspired by robust optimization and establish favorable theoretical properties. We further propose Surrogate-S, a fully data-dependent and implementable procedure that computes correction factors using only pilot variance estimates and achieves near-oracle performance in numerical experiments.

2603.16950 2026-03-19 stat.ML cs.LG stat.ME

Kriging via variably scaled kernels

Gianluca Audone, Francesco Marchetti, Emma Perracchione, Milvia Rossini

详情
英文摘要

Classical Gaussian processes and Kriging models are commonly based on stationary kernels, whereby correlations between observations depend exclusively on the relative distance between scattered data. While this assumption ensures analytical tractability, it limits the ability of Gaussian processes to represent heterogeneous correlation structures. In this work, we investigate variably scaled kernels as an effective tool for constructing non-stationary Gaussian processes by explicitly modifying the correlation structure of the data. Through a scaling function, variably scaled kernels alter the correlations between data and enable the modeling of targets exhibiting abrupt changes or discontinuities. We analyse the resulting predictive uncertainty via the variably scaled kernel power function and clarify the relationship between variably scaled kernels-based constructions and classical non-stationary kernels. Numerical experiments demonstrate that variably scaled kernels-based Gaussian processes yield improved reconstruction accuracy and provide uncertainty estimates that reflect the underlying structure of the data

2603.16937 2026-03-19 cs.LG stat.AP stat.ME

Integrating Explainable Machine Learning and Mixed-Integer Optimization for Personalized Sleep Quality Intervention

Mahfuz Ahmed Anik, Mohsin Mahmud Topu, Azmine Toushik Wasi, Md Isfar Khan, MD Manjurul Ahsan

Comments 34 Pages. 7 Tables. 6 Figures

详情
英文摘要

Sleep quality is influenced by a complex interplay of behavioral, environmental, and psychosocial factors, yet most computational studies focus mainly on predictive risk identification rather than actionable intervention design. Although machine learning models can accurately predict subjective sleep outcomes, they rarely translate predictive insights into practical intervention strategies. To address this gap, we propose a personalized predictive-prescriptive framework that integrates interpretable machine learning with mixed-integer optimization. A supervised classifier trained on survey data predicts sleep quality, while SHAP-based feature attribution quantifies the influence of modifiable factors. These importance measures are incorporated into a mixed-integer optimization model that identifies minimal and feasible behavioral adjustments, while modelling resistance to change through a penalty mechanism. The framework achieves strong predictive performance, with a test F1-score of 0.9544 and an accuracy of 0.9366. Sensitivity and Pareto analyses reveal a clear trade-off between expected improvement and intervention intensity, with diminishing returns as additional changes are introduced. At the individual level, the model generates concise recommendations, often suggesting one or two high-impact behavioral adjustments and sometimes recommending no change when expected gains are minimal. By integrating prediction, explanation, and constrained optimization, this framework demonstrates how data-driven insights can be translated into structured and personalized decision support for sleep improvement.

2603.16896 2026-03-19 stat.AP

Model Selection via Focused Information Criteria for Complex Data in Ecology and Evolution

Gerda Claeskens, Céline Cunen, Nils Lid Hjort

Comments 24 pages, 2 figures; Statistical Research Report, Department of Mathematics, University of Oslo, September 2019, arXiv'd March 2026; published, in essentially this form, in Frontiers in Ecology and Evolution, 2019, at this url: https://www.frontiersin.org/journals/ecology-and-evolution/articles/10.3389/fevo.2019.00415/full

详情
英文摘要

Datasets encountered when examining deeper issues in ecology and evolution are often complex. This calls for careful strategies for both model building, model selection, and model averaging. Our paper aims at motivating, exhibiting, and further developing focused model selection criteria. In contexts involving precisely formulated interest parameters, these versions of FIC, the focused information criterion, typically lead to better final precision for the most salient estimates, confidence intervals, etc. as compared to estimators obtained from other selection methods. Our methods are illustrated with real case studies in ecology; one related to bird species abundance and another to the decline in body condition for the Antarctic minke whale.

2603.16884 2026-03-19 q-bio.NC math.PR math.ST stat.TH

Macro-Micro Inference: Robust Synaptic Classification via Spike-Triggered Extrapolation

Emilio De Santis

Comments 26 pages, 5 figures

详情
英文摘要

This work introduces a framework for reconstructing the interaction graph of neuronal networks modeled as multivariate point processes. The methodology performs bivariate inference, identifying synaptic links exclusively from the spike trains of a pair of neurons, without requiring observations of the remaining network activity. We propose a Macro-Micro Extrapolation algorithm to address data sparsity at the micro-scale, inferring synaptic interactions in the limit $Δ\to 0^+$. A key contribution is the Spike-Triggered Estimator, which leverages the local reset property of Galves-Löcherbach dynamics to decouple local synaptic jumps from higher-order network contributions, significantly reducing estimation variance and eliminating spurious dependencies on baseline firing intensities. By employing an adaptive hybrid logic that switches between sample averaging and our novel Pyramid Extrapolation, we ensure robust classification of excitatory, inhibitory, and null connections even in low signal-to-noise regimes. The framework's scalability and precision are validated by numerical results on dense cliques and structured layered networks, achieving perfect classification accuracy across diverse topological motifs.

2603.15674 2026-03-19 cs.AI cs.IT cs.LG math.IT stat.ML

Theoretical Foundations of Latent Posterior Factors: Formal Guarantees for Multi-Evidence Reasoning

Aliyu Agboola Alege

Comments 30 pages, 8 figures, 10 tables. Theoretical characterization of the Latent Posterior Factors (LPF) framework for multi-evidence probabilistic reasoning, with formal guarantees and empirical validation

详情
英文摘要

We present a complete theoretical characterization of Latent Posterior Factors (LPF), a principled framework for aggregating multiple heterogeneous evidence items in probabilistic prediction tasks. Multi-evidence reasoning arises pervasively in high-stakes domains including healthcare diagnosis, financial risk assessment, legal case analysis, and regulatory compliance, yet existing approaches either lack formal guarantees or fail to handle multi-evidence scenarios architecturally. LPF encodes each evidence item into a Gaussian latent posterior via a variational autoencoder, converting posteriors to soft factors through Monte Carlo marginalization, and aggregating factors via exact Sum-Product Network inference (LPF-SPN) or a learned neural aggregator (LPF-Learned). We prove seven formal guarantees spanning the key desiderata for trustworthy AI: Calibration Preservation (ECE <= epsilon + C/sqrt(K_eff)); Monte Carlo Error decaying as O(1/sqrt(M)); a non-vacuous PAC-Bayes bound with train-test gap of 0.0085 at N=4200; operation within 1.12x of the information-theoretic lower bound; graceful degradation as O(epsilon*delta*sqrt(K)) under corruption, maintaining 88% performance with half of evidence adversarially replaced; O(1/sqrt(K)) calibration decay with R^2=0.849; and exact epistemic-aleatoric uncertainty decomposition with error below 0.002%. All theorems are empirically validated on controlled datasets spanning up to 4,200 training examples. Our theoretical framework establishes LPF as a foundation for trustworthy multi-evidence AI in safety-critical applications.

2603.14801 2026-03-19 stat.AP stat.CO

Genetic Algorithms in Regression

Mo Li, QiQi Lu, Robert Lund, Xueheng Shi

详情
英文摘要

Many statistical problems involve optimization over a discrete parameter space having an unknown dimension. In such settings, gradient-based methods often fail due to the non-differentiability of the objective function or a non-convex or massive search space with an objective function having many local maxima/minima. This paper presents GAReg, a unified genetic algorithm package that handles discrete optimization regression problems, which works well when standard algorithms are unjustified. GAReg provides a compact chromosome representation supporting optimal knot placement for regression splines, best-subset regression variable selection, and related problems. The package allows for uniform initialization, constraint-preserving crossover and mutation, steady-state replacement, and an optional island-model parallelization. GAReg efficiently searches high-dimensional model spaces, providing near-optimal solutions in settings where exhaustive enumeration or integer or dynamic programming approaches are infeasible.

2603.14441 2026-03-19 stat.ML cs.LG

AR-Flow VAE: A Structured Autoregressive Flow Prior Variational Autoencoder for Unsupervised Blind Source Separation

Yuan-Hao Wei, Fu-Hao Deng, Lin-Yong Cui, Yan-Jie Sun

详情
英文摘要

Blind source separation (BSS) seeks to recover latent source signals from observed mixtures. Variational autoencoders (VAEs) offer a natural perspective for this problem: the latent variables can be interpreted as source components, the encoder can be viewed as a demixing mapping from observations to sources, and the decoder can be regarded as a remixing process from inferred sources back to observations. In this work, we propose AR-Flow VAE, a novel VAE-based framework for BSS in which each latent source is endowed with a parameter-adaptive autoregressive flow prior. This prior significantly enhances the flexibility of latent source modeling, enabling the framework to capture complex non-Gaussian behaviors and structured dependencies, such as temporal correlations, that are difficult to represent with conventional priors. In addition, the structured prior design assigns distinct priors to different latent dimensions, thereby encouraging the latent components to separate into different source signals under heterogeneous prior constraints. Experimental results validate the effectiveness of the proposed architecture for blind source separation. More importantly, this work provides a foundation for future investigations into the identifiability and interpretability of AR-Flow VAE.

2603.13681 2026-03-19 stat.ME math.ST stat.TH

Generalized projection tests for function-valued parameters with applications to testing structural causal assumptions

Rui Wang, Albert Osom, Bo Zhang

详情
英文摘要

Structural assumptions are central to the causal inference literature. In practice, it is often crucial to assess their validity or to test implications that follow from them. In many settings, such tests can be framed as evaluating whether a function-valued parameter equals zero. In this paper, we propose a class of generalized projection tests based on series estimators for function-valued parameters. We establish conditions under which the proposed tests are valid and illustrate their applicability through examples from the data fusion and instrumental variables literature. Our approach accommodates flexible machine learning methods for estimating nuisance parameters. In contrast to many existing approaches, the limiting distribution of the proposed test statistics is straightforward to compute under the null hypothesis. We apply our method to test the equality of conditional COVID-19 risk across vaccine arms in the COVID-19 Variant Immunologic Landscape (COVAIL) trial.

2603.10485 2026-03-19 stat.ML cs.LG

Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime

Reza Ghane, Danil Akhtiamov, Babak Hassibi

详情
英文摘要

In this work we study the convergence properties of the Dual Space Preconditioned Gradient Descent, encompassing optimizers such as Normalized Gradient Descent, Gradient Clipping and Adam. We consider preconditioners of the form $\nabla K$, where $K: \mathbb{R}^p \to \mathbb{R}$ is convex and assume that the latter is applied to train an over-parameterized linear model with loss of the form $\ell({X} {W} - {Y})$, for weights ${W} \in \mathbb{R}^{d \times k}$, labels ${Y} \in \mathbb{R}^{n \times k}$ and data ${X} \in \mathbb{R}^{n \times d}$. Under the aforementioned assumptions, we prove that the iterates of the preconditioned gradient descent always converge to a point ${W}_{\infty} \in \mathbb{R}^{d \times k}$ satisfying ${X}{W}_{\infty} = {Y}$. Our proof techniques are of independent interest as we introduce a novel version of the Bregman Divergence with accompanying identities that allow us to establish convergence. We also study the implicit bias of Dual Space Preconditioned Gradient Descent. First, we demonstrate empirically that, for general $K(\cdot)$, ${W}_\infty$ depends on the chosen learning rate, hindering a precise characterization of the implicit bias. Then, for preconditioners of the form $K({G}) = h(\|{G}\|_F)$, known as \textit{isotropic preconditioners}, we show that ${W}_\infty$ minimizes $\|{W}_\infty - {W}_0\|_F^2$ subject to ${X}{W}_\infty = {Y}$, where ${W}_0$ is the initialization. Denoting the convergence point of GD initialized at ${W}_0$ by ${W}_{\text{GD}, \infty}$, we thus note ${W}_{\infty} = {W}_{\text{GD}, \infty}$ for isotropic preconditioners. Finally, we show that a similar fact holds for general preconditioners up to a multiplicative constant, namely, $\|{W}_0 - {W}_{\infty}\|_F \le c \|{W}_0 - {W}_{\text{GD}, \infty}\|_F$ for a constant $c>0$.

2603.10272 2026-03-19 stat.ME econ.EM math.ST q-fin.ST stat.TH

An operator-level ARCH Model

Alexander Aue, Sebastian Kühnert, Gregory Rice, Jeremy VanderDoes

Comments 48 pages, 8 Figures, 2 Tables

详情
英文摘要

AutoRegressive Conditional Heteroscedasticity (ARCH) models are standard for modeling time series exhibiting volatility, with a rich literature in univariate and multivariate settings. In recent years, these models have been extended to function spaces. However, functional ARCH and generalized ARCH (GARCH) processes established in the literature have thus far been restricted to model ``pointwise'' variances. In this paper, we propose a new ARCH framework for data residing in general separable Hilbert spaces that accounts for the full evolution of the conditional covariance operator. We define a general operator-level ARCH model. For a simplified Constant Conditional Correlation version of the model, we establish conditions under which such models admit strictly and weakly stationary solutions, finite moments, and weak serial dependence. Additionally, we derive consistent Yule--Walker-type estimators of the infinite-dimensional model parameters. The practical relevance of the model is illustrated through simulations and a data application to high-frequency cumulative intraday returns.

2603.00269 2026-03-19 stat.ME

Robust Regression with Student's T: The Role of Degrees of Freedom

Amanda Ng, Shangkai Zhu, Archer Gong Zhang, Nancy Reid

详情
英文摘要

Linear regression estimators are known to be sensitive to outliers, and one alternative to obtain a robust and efficient estimator of the regression parameter is to model the error with Student's $t$ distribution. In this article, we compare estimators of the degrees of freedom parameter in the $t$ distribution using frequentist and Bayesian methods, and then study properties of the corresponding estimated regression coefficient. We also include the comparison with some recommended approaches in the literature, including fixing the degrees of freedom and robust regression using the Huber loss. Our extensive simulations on both synthetic and real data demonstrate that estimating the degrees of freedom via the adjusted profile log-likelihood approach yields regression coefficient estimators with high accuracy, performing comparably to the maximum likelihood estimators where the degrees of freedom are fixed at their true values. These findings provide a detailed synthesis of $t$-based robust regression and underscore a key insight: the proper calibration of the degrees of freedom is as crucial as the choice of the robust distribution itself for achieving optimal performance. The {\tt R} package that implements our method is available at https://github.com/amanda-ng518/RobustTRegression.

2602.19414 2026-03-19 cs.LG cs.SY eess.SY stat.ML

Federated Causal Representation Learning in State-Space Systems for Decentralized Counterfactual Reasoning

Nazal Mohamed, Ayush Mohanty, Nagi Gebraeel

Comments Manuscript under review

详情
英文摘要

Networks of interdependent industrial assets (clients) are tightly coupled through physical processes and control inputs, raising a key question: how would the output of one client change if another client were operated differently? This is difficult to answer because client-specific data are high-dimensional and private, making centralization of raw data infeasible. Each client also maintains proprietary local models that cannot be modified. We propose a federated framework for causal representation learning in state-space systems that captures interdependencies among clients under these constraints. Each client maps high-dimensional observations into low-dimensional latent states that disentangle intrinsic dynamics from control-driven influences. A central server estimates the global state-transition and control structure. This enables decentralized counterfactual reasoning where clients predict how outputs would change under alternative control inputs at others while only exchanging compact latent states. We prove convergence to a centralized oracle and provide privacy guarantees. Our experiments demonstrate scalability, and accurate cross-client counterfactual inference on synthetic and real-world industrial control system datasets.

2602.17918 2026-03-19 cs.LG cs.DS stat.ML

Distribution-Free Sequential Prediction with Abstentions

Jialin Yu, Moïse Blanchard

Comments 40 pages, 2 figures

详情
英文摘要

We study a sequential prediction problem in which an adversary is allowed to inject arbitrarily many adversarial instances in a stream of i.i.d. instances, but at each round, the learner may also abstain from making a prediction without incurring any penalty if the instance was indeed corrupted. This semi-adversarial setting naturally sits between the classical stochastic case with i.i.d. instances for which function classes with finite VC dimension are learnable; and the adversarial case with arbitrary instances, known to be significantly more restrictive. For this problem, Goel et al. (2023) showed that, if the learner knows the distribution $μ$ of clean samples in advance, learning can be achieved for all VC classes without restrictions on adversary corruptions. This is, however, a strong assumption in both theory and practice: a natural question is whether similar learning guarantees can be achieved without prior distributional knowledge, as is standard in classical learning frameworks (e.g., PAC learning or asymptotic consistency) and other non-i.i.d. models (e.g., smoothed online learning). We therefore focus on the distribution-free setting where $μ$ is unknown and propose an algorithm AbstainBoost based on a boosting procedure of weak learners, which guarantees sublinear error for general VC classes in distribution-free abstention learning for oblivious adversaries. These algorithms also enjoy similar guarantees for adaptive adversaries, for structured function classes including linear classifiers. These results are complemented with corresponding lower bounds, which reveal an interesting polynomial trade-off between misclassification error and number of erroneous abstentions.

2602.15562 2026-03-19 stat.OT

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Scott Lee

详情
英文摘要

In Neyman's original formulation, a 1-alpha confidence interval procedure is justified by its long-run coverage properties, and a single realized interval is to be described only by the slogan that it either covers the parameter or it does not. On this view, post-data probability statements about the coverage of an individual interval are taken to be conceptually out of bounds. In this paper, I present two kinds of arguments against treating that "either-or" reading as the only legitimate interpretation of confidence. The first is informal, via a set of thought experiments in which the same joint probability model is used to compute both forward-looking and backward-looking probabilities for occurred-but-unobserved events. The second is more formal, recasting the standard confidence-interval construction in terms of infinite sequences of trials and their associated 0/1 coverage indicators. In that representation, the design-level coverage probability 1-alpha and the degenerate conditional probabilities given the full data appear simply as different conditioning levels of the same model. I argue that a strict behavioristic reading that privileges only the latter is in tension with the very mathematical machinery used to define long-run error rates. I then sketch an alternative view of confidence as a predictive probability (or forecast) about the coverage indicator, together with a simple normative rule for when intermediate probabilities for single coverage events should be allowed. Keywords: confidence intervals; coverage probability; frequentist inference; single-case probability; predictive probability; Neyman. Disclaimer: The findings and conclusions in this report are those of the author and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

2602.02706 2026-03-19 physics.space-ph stat.AP

Ionospheric Observations from the ISS: Overcoming Noise Challenges in Signal Extraction

Rachel Ulrich, Kelly R. Moran, Ky Potter, Lauren A. Castro, Gabriel R. Wilson, Carlos Maldonado

详情
英文摘要

The Electric Propulsion Electrostatic Analyzer Experiment (ÈPÈE) is a compact ion energy bandpass filter deployed on the International Space Station (ISS) in March 2023 and providing continuous measurements through April 2024. This period coincides with the Solar Cycle 25 maximum, capturing unique observations of solar activity extremes in the mid- to low-latitude regions of the topside ionosphere. From these in situ spectra we derive plasma parameters that inform space-weather impacts on satellite navigation and radio communication. We present a statistical processing pipeline for ÈPÈE that (i) estimates the instrument noise floor, (ii) accounts for irregular temporal sampling, and (iii) extracts ionospheric signals. Rather than discarding noisy data, the method learns a baseline noise model and fits the measurement surface using a scaled Vecchia Gaussian process approximation, recovering values typically rejected by thresholding. The resulting products increase data coverage and enable noise-assisted monitoring of ionospheric variability.

2601.00987 2026-03-19 math.ST stat.AP stat.ML stat.TH

Tessellation Localized Transfer learning for nonparametric regression

Hélène Halconruy, Benjamin Bobbia, Paul Lejamtel

Comments 57 pages, 2 figures

详情
英文摘要

Transfer learning aims to improve performance on a target task by leveraging information from related source tasks. We propose a nonparametric regression transfer learning framework that explicitly models heterogeneity in the source-target relationship. Our approach relies on a local transfer assumption: the covariate space is partitioned into finitely many cells such that, within each cell, the target regression function can be expressed as a low-complexity transformation of the source regression function. This localized structure enables effective transfer where similarity is present while limiting negative transfer elsewhere. We introduce estimators that jointly learn the local transfer functions and the target regression, together with fully data-driven procedures that adapt to unknown partition structure and transfer strength. We establish sharp minimax rates for target regression estimation, showing that local transfer can mitigate the curse of dimensionality by exploiting reduced functional complexity. Our theoretical guarantees take the form of oracle inequalities that decompose excess risk into estimation and approximation terms, ensuring robustness to model misspecification. Numerical experiments illustrate the benefits of the proposed approach.

2511.03369 2026-03-19 cs.CL stat.ML

Silenced Biases: The Dark Side LLMs Learned to Refuse

Rom Himelstein, Amit LeVi, Brit Youngmann, Yaniv Nemcovsky, Avi Mendelson

Comments Accepted to The 40th Annual AAAI Conference on Artificial Intelligence - AI Alignment Track (Oral)

详情
英文摘要

Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models' direct responses and their underlying fairness issues.

2510.17903 2026-03-19 stat.ML cs.LG

Learning Time-Varying Graphs from Incomplete Graph Signals

Chuansen Peng, Xiaojing Shen

详情
英文摘要

This paper tackles the challenging problem of jointly inferring time-varying network topologies and imputing missing data from partially observed graph signals. We propose a unified non-convex optimization framework to simultaneously recover a sequence of graph Laplacian matrices while reconstructing the unobserved signal entries. Unlike conventional decoupled methods, our integrated approach facilitates a bidirectional flow of information between the graph and signal domains, yielding superior robustness, particularly in high missing-data regimes. To capture realistic network dynamics, we introduce a fused-lasso type regularizer on the sequence of Laplacians. This penalty promotes temporal smoothness by penalizing large successive changes, thereby preventing spurious variations induced by noise while still permitting gradual topological evolution. For solving the joint optimization problem, we develop an efficient Alternating Direction Method of Multipliers (ADMM) algorithm, which leverages the problem's structure to yield closed-form solutions for both the graph and signal subproblems. This design ensures scalability to large-scale networks and long time horizons. On the theoretical front, despite the inherent non-convexity, we establish a convergence guarantee, proving that the proposed ADMM scheme converges to a stationary point. Furthermore, we derive non-asymptotic statistical guarantees, providing high-probability error bounds for the graph estimator as a function of sample size, signal smoothness, and the intrinsic temporal variability of the graph. Extensive numerical experiments validate the approach, demonstrating that it significantly outperforms state-of-the-art baselines in both convergence speed and the joint accuracy of graph learning and signal recovery.

2510.10322 2026-03-19 stat.AP cs.NA math.NA

A Spatio-temporal CP decomposition analysis of New England region in the US

Fatoumata Sanogo

Comments 14 pages, 3 figures

详情
英文摘要

Spatio temporal data consist of measurement for one or more raster fields such as weather, traffic volume, crime rate, or disease incidents. Advances in modern technology have increased the number of available information for this type of data hence the rise of multidimensional data. In this paper we take advantage of the multidimensional structure of the data but also its temporal and spatial structure. In fact, we will be using the NCAR Climate Data Gateway website which provides data discovery and access services for global and regional climate model data. The daily values of total precipitation (prec), maximum (tmax), and minimum (tmin) temperature are combined to create a multidimensional data called tensor (a multidimensional array). In this paper, we propose a spatio temporal principal component analysis to initialize CP decomposition component. We take full advantage of the spatial and temporal structure of the data in the initialization step for cp component analysis. The performance of our method is tested via comparison with most popular initialization method. We also run a clustering analysis to further show the performance of our analysis.

2510.04072 2026-03-19 cs.LG cs.AI cs.CL stat.ML

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Ziyan Wang, Zheng Wang, Xingwei Qu, Qi Cheng, Jie Fu, Shengpu Tang, Minjia Zhang, Xiaoming Huo

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address the above limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces number of rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO's best accuracy. Project website is available at https://slow-fast-po.github.io/.

2509.18766 2026-03-19 cs.LG math.OC stat.ML

Diagonal Linear Networks and the Lasso Regularization Path

Raphaël Berthier

Comments 35 pages, 1 figure

详情
英文摘要

Diagonal linear networks are neural networks with linear activation and diagonal weight matrices. Their theoretical interest is that their implicit regularization can be rigorously analyzed: from a small initialization, the training of diagonal linear networks converges to the linear predictor with minimal 1-norm among minimizers of the training loss. In this paper, we deepen this analysis showing that the full training trajectory of diagonal linear networks is closely related to the lasso regularization path. In this connection, the training time plays the role of an inverse regularization parameter. Both rigorous results and simulations are provided to illustrate this conclusion. Under a monotonicity assumption on the lasso regularization path, the connection is exact while in the general case, we show an approximate connection.

2508.09896 2026-03-19 stat.AP

XGBoost meets INLA: a two-stage spatio-temporal forecasting of wildfires in Portugal

Chenglei Hu, Regina Baltazar Bispo, Håvard Rue, Carlos C. DaCamara, Ben Swallow, Daniela Castro-Camilo

Comments 50 pages, 14 figures, 4 tables

详情
英文摘要

Wildfires pose a major threat to Portugal, with over 115,000 hectares burned annually on average during 1980-2024, and the country has faced devastating mega-fires such as those in 2017. Accurate forecasts of wildfire occurrence and burned area are therefore essential for firefighting resource allocation and emergency preparedness. In this study, we propose a novel two-stage ensemble that extends the widely used latent Gaussian modelling framework with integrated nested Laplace approximation (INLA) for spatio-temporal wildfire forecasting. Stage 1 applies a gradient boosting model (XGBoost) to environmental covariates and historical fire records to produce one-month-ahead point forecasts of fire counts and burned area. Stage 2 uses these predictions as external covariates in a latent Gaussian model with additional spatiotemporal random effects to generate probabilistic forecasts of monthly total fire counts and burned area at the council level. To capture both moderate and extreme events, we implement the extended generalised Pareto (eGP) likelihood (a sub-asymptotic distribution) within INLA, develop Penalised Complexity (PC) priors for its parameters, and compare the eGP likelihood with common alternatives (e.g., Gamma and Weibull). Our framework tackles the unavailability of future environmental covariates at prediction time and performs strongly for one-month-ahead forecasts.

2507.00641 2026-03-19 nlin.AO cs.LG stat.CO stat.ME

Hebbian Physics Networks: A Self-Organizing Computational Architecture Based on Local Physical Laws

Gunjan Auti, Hirofumi Daiguji, Gouhei Tanaka

Comments 16 pages, 3 figures

详情
英文摘要

Physical transport processes organize through local interactions that redistribute imbalance while preserving conservation. Classical solvers enforce this organization by applying fixed discrete operators on rigid grids. We introduce the Hebbian Physics Network (HPN), a computational framework that replaces this rigid scaffolding with a plastic transport geometry. An HPN is a coupled dynamical system of physical states on nodes and constitutive weights on edges in a graph. Residuals--local violations of continuity, momentum balance, or energy conservation--act as thermodynamic forces that drive the joint evolution of both the state and the operator (i.e. the adaptive weights). The weights adapt through a three-factor Hebbian rule, which we prove constitutes a strictly local gradient descent on the residual energy. This mechanism ensures thermodynamic stability: near equilibrium, the learned operator naturally converges to a symmetric, positive-definite form, rigorously reproducing Onsagerś reciprocal relations without explicit enforcement. Far from equilibrium, the system undergoes a self-organizing search for a transport topology that restores global coercivity. Unlike optimization-based approaches that impose physics through global loss functions, HPNs embed conservation intrinsically: transport is restored locally by the evolving operator itself, without a global Poisson solve or backpropagated objective. We demonstrate the framework on scalar diffusion and incompressible lid-driven cavity flow, showing that physically consistent transport geometries and flow structures emerge from random initial conditions solely through residual-driven local adaptation. HPNs thus reframe computation not as the solution of a fixed equation, but as a thermodynamic relaxation process where the constitutive geometry and physical state co-evolve.

2506.14531 2026-03-19 stat.AP stat.ME

A statistical framework for dynamic cognitive diagnosis in digital learning environments

Yawen Ma, Anastasia Ushakova, Kate Cain, Gabriel Wallin

详情
英文摘要

Reading is foundational for educational, employment, and economic outcomes, but a persistent proportion of students globally struggle to develop adequate reading skills. Some countries promote digital tools to support reading development, alongside regular classroom instruction. Such tools generate rich log data capturing students' behaviour and performance. This study proposes a dynamic cognitive diagnostic modeling (CDM) framework based on restricted latent class models to trace students' time-varying skills mastery using log files from digital tools. Unlike traditional CDMs that require expert-defined skill-item mappings (Q-matrix), our approach jointly estimates the Q-matrix and latent skill profiles, integrates log-derived covariates (e.g., reattempts, response times, counts of mastered items) and individual characteristics, and models transitions in mastery using a Bayesian estimation approach. Applied to real-world data, the model demonstrates practical value in educational settings by effectively uncovering individual skill profiles and the skill-item mappings. Simulation studies confirm robust recovery of Q-matrix structures and latent profiles with high accuracy under varied sample sizes, item counts and different sparsity of Q-matrices. The framework offers a data-driven, time-dependent restricted latent class modeling approach to understanding early reading development.

2502.17292 2026-03-19 cs.LG cs.GT cs.IT math.IT stat.ME stat.ML

Joint Value Estimation and Bidding in Repeated First-Price Auctions

Yuxiao Wen, Yanjun Han, Zhengyuan Zhou

Comments POMS-HK 2026 Best Student Paper Finalist

详情
英文摘要

We study regret minimization in repeated first-price auctions (FPAs), where a bidder observes only the realized outcome after each auction -- win or loss. This setup reflects practical scenarios in online display advertising where the actual value of an impression depends on the difference between two potential outcomes, such as clicks or conversion rates, when the auction is won versus lost. We incorporate causal inference into this framework and analyze the challenging case where only the treatment effect admits a simple dependence on observable features. Under this framework, we propose algorithms that jointly estimate private values and optimize bidding strategies under two different feedback types on the highest other bid (HOB): the full-information feedback where the HOB is always revealed, and the binary feedback where the bidder only observes the win-loss indicator. Under both cases, our algorithms are shown to achieve near-optimal regret bounds. Notably, our framework enjoys a unique feature that the treatments are actively chosen, and hence eliminates the need for the overlap condition commonly required in causal inference.

2502.14719 2026-03-19 stat.ML cs.LG

How PC-based Methods Err: Towards Better Reporting of Assumption Violations and Small Sample Errors

Sofia Faltenbacher, Jonas Wahl, Rebecca Herman, Jakob Runge

Comments under review

详情
英文摘要

Causal discovery methods based on the PC algorithm are proven to be sound if all structural assumptions are fulfilled and all conditional independence tests are correct. This idealized setting is rarely given in real data. In this work, we first analyze how local errors can propagate throughout the output graph of a PC-based method, highlighting how consequential seemingly innocuous errors can become. Next, we introduce coherency scores to find assumption violations and small sample errors in the absence of a ground truth. These scores do not require statistical tests beyond those already executed by the causal discovery algorithm. Errors detected by our approach extend the set of errors that can be detected by comparable existing methods. We place our computationally cheap global error detection and quantification scores as a bridge between computationally expensive global answer-set-programming-based methods and less expensive local error detection methods. The scores are analyzed on simulated and real-world datasets.

2411.16902 2026-03-19 stat.ME

Bounding causal effects with an unknown mixture of informative and non-informative missingness

Max Rubinstein, Denis Agniel, Larry Han, Marcela Horvitz-Lennon, Sharon-Lise Normand

详情
英文摘要

In experimental and observational data settings, researchers often have limited knowledge of the reasons for missing outcomes. To address this uncertainty, we propose bounds on causal effects for missing outcomes, accommodating the scenario where missingness is an unobserved mixture of informative and non-informative components. Within this mixed missingness framework, we explore several assumptions to derive bounds on causal effects, including bounds expressed as a function of user-specified sensitivity parameters. We develop influence-function based estimators of these bounds to enable flexible, non-parametric, and machine learning based estimation, achieving root-n convergence rates and asymptotic normality under relatively mild conditions. We further consider the identification and estimation of bounds for other causal quantities that remain meaningful when informative missingness reflects a competing outcome, such as death. We conduct simulation studies and illustrate our methodology with a study on the causal effect of antipsychotic drugs on diabetes risk using a health insurance dataset.

2411.12127 2026-03-19 cs.LG cs.IT math.IT math.ST stat.ML stat.TH

Fine-Grained Uncertainty Quantification via Collisions

Jesse Friedbaum, Sudarshan Adiga, Ravi Tandon

详情
英文摘要

We propose a new and intuitive metric for aleatoric uncertainty quantification (UQ), the prevalence of class collisions defined as the same input being observed in different classes. We use the rate of class collisions to define the collision matrix, a novel and uniquely fine-grained measure of uncertainty. For a classification problem involving $K$ classes, the $K\times K$ collision matrix $S$ measures the inherent difficulty in distinguishing between each pair of classes. We discuss several applications of the collision matrix, establish its fundamental mathematical properties, and show its relationship with existing UQ methods, including the Bayes error rate (BER). We also address the new problem of estimating the collision matrix using one-hot labeled data by proposing a series of innovative techniques to estimate $S$. First, we learn a pair-wise contrastive model which accepts two inputs and determines if they belong to the same class. We then show that this contrastive model (which is PAC learnable) can be used to estimate the row Gramian matrix of $S$, defined as $G=SS^T$. Finally, we show that under reasonable assumptions, $G$ can be used to uniquely recover $S$, a new result on non-negative matrices which could be of independent interest. With a method to estimate $S$ established, we demonstrate how this estimate of $S$, in conjunction with the contrastive model, can be used to estimate the posterior class probability distribution of any point. Experimental results are also presented to validate our methods of estimating the collision matrix and class posterior distributions on several datasets.

2410.19031 2026-03-19 stat.ME

High-dimensional Statistical Inference and Variable Selection Using Sufficient Dimension Association

Shangyuan Ye, Shauna Rakshe, Ye Liang

详情
英文摘要

Simultaneous variable selection and statistical inference is challenging in high-dimensional data analysis. Most existing post-selection inference methods require explicitly specified regression models, which are often linear, as well as sparsity in the regression model. The performance of such procedures can be poor under either misspecified nonlinear models or a violation of the sparsity assumption. In this paper, we propose a sufficient dimension association (SDA) technique that measures the association between each predictor and the response variable conditioning on other predictors in the high-dimensional setting. Our proposed SDA method requires neither a specific form of regression model nor sparsity in the regression. Alternatively, our method assumes normalized or Gaussian-distributed predictors with a Markov blanket property. We propose an estimator for the SDA and prove asymptotic properties for the estimator. We construct three types of test statistics for the SDA and propose a multiple testing procedure to control the false discovery rate. Extensive simulation studies have been conducted to show the validity and superiority of our SDA method. Gene expression data from the Alzheimer Disease Neuroimaging Initiative are used to demonstrate a real application.

2212.09996 2026-03-19 stat.ME stat.AP

A marginalized three-part interrupted time series regression model for proportional data

Shangyuan Ye, Maricela Cruz, Ziyou Wang, Yun Yu

详情
英文摘要

Interrupted time series (ITS) is often used to evaluate the effectiveness of a health policy intervention that accounts for the temporal dependence of outcomes. When the outcome of interest is a percentage or percentile, the data can be highly skewed, bounded in $[0, 1]$, and have many zeros or ones. A three-part Beta regression model is commonly used to separate zeros, ones, and positive values explicitly by three submodels. However, incorporating temporal dependence into the three-part Beta regression model is challenging. In this article, we propose a marginalized zero-one-inflated Beta time series model that captures the temporal dependence of outcomes through copula and allows investigators to examine covariate effects on the marginal mean. We investigate its practical performance using simulation studies and apply the model to a real ITS study.

2211.15168 2026-03-19 math.PR math.DG math.ST stat.TH

Most probable paths for developed processes

Erlend Grong, Stefan Sommer

详情
英文摘要

Optimal paths for the classical Onsager-Machlup function determining most probable paths between points on a manifold are only explicitly identified for specific processes, for example the Riemannian Brownian motion. This leaves out large classes of manifold-valued processes such as processes with parallel transported non-trivial diffusion matrix, processes with rank-deficient generator and sub-Riemannian processes, and push-forwards to quotient spaces. In this paper, we construct a general approach to definition and identification of most probable paths by measuring the Onsager-Machlup function on the anti-development of such processes. The construction encompasses large classes of manifold-valued process and results in explicit equation systems for the paths that we denote \emph{development most probable paths}. We define and derive these results and apply them to several cases of stochastic processes on Lie groups, homogeneous spaces, and landmark spaces appearing in shape analysis.

2205.06868 2026-03-19 stat.ME stat.AP

Regression and Dimension Reduction for Multivariate Mixed-Type Data via Semiparametric Gaussian Copula

Debangan Dey, Vadim Zipunnikov

Comments 43 pages, 8 figures, 3 tables

详情
英文摘要

Clinical and epidemiological studies encode participant information in multivariate vectors with mixed type variables on continuous, truncated, ordinal, and binary scales. Semiparametric Gaussian Copula (SGC) assumes that observed data is generated by latent multivariate normal random variables which marginals are monotonically transformed and then truncated/ordinalized/binarized. In SGC, the latent correlation matrix fully determines the dependence structure and it is estimated through an inversion of ``bridges'' between Kendall's Tau rank correlations of observed variables and latent correlations. By employing SGC, we develop regression (SGC-Reg), principal component analysis (SGC-PCA), and principal component regression (SGC-PCR) for latent representations of observed data. To build our framework, we make several key contributions: i) establishing novel bridging results for general ordinal type variables, ii) developing regression estimation on the latent space and deriving asymptotic normality of estimators, iii) developing a computationally efficient algorithm that reduces calculation complexity of all steps including calculation of asymptotic covariance matrix from $O(n^4)$ to $O(n\log n)$, iv) developing methods to predict latent representations of observed data and perform imputation of missing data, and v) developing principal component analysis and principal component regression on the latent space. We apply our framework to study the association between a 5-year mortality and 61 frailty-related measures composed of 29 continuous, 17 ordinal, and 15 binary variables in 9478 participants of 1999-2010 waves of National Health and Nutrition Examination Survey (NHANES).