arXivDaily arXiv每日学术速递 周一至周五更新
2510.22007 2026-01-21 math.ST cs.CL cs.CR cs.LG stat.ML stat.TH

Optimal Detection for Language Watermarks with Pseudorandom Collision

T. Tony Cai, Xiang Li, Qi Long, Weijie J. Su, Garrett G. Wen

详情
英文摘要

Text watermarking plays a crucial role in ensuring the traceability and accountability of large language model (LLM) outputs and mitigating misuse. While promising, most existing methods assume perfect pseudorandomness. In practice, repetition in generated text induces collisions that create structured dependence, compromising Type I error control and invalidating standard analyses. We introduce a statistical framework that captures this structure through a hierarchical two-layer partition. At its core is the concept of minimal units -- the smallest groups treatable as independent across units while permitting dependence within. Using minimal units, we define a non-asymptotic efficiency measure and cast watermark detection as a minimax hypothesis testing problem. Applied to Gumbel-max and inverse-transform watermarks, our framework produces closed-form optimal rules. It explains why discarding repeated statistics often improves performance and shows that within-unit dependence must be addressed unless degenerate. Both theory and experiments confirm improved detection power with rigorous Type I error control. These results provide the first principled foundation for watermark detection under imperfect pseudorandomness, offering both theoretical insight and practical guidance for reliable tracing of model outputs.

2601.14223 2026-01-21 math.ST stat.TH

Symmetry Testing in Time Series using Ordinal Patterns: A U-Statistic Approach

Annika Betken, Giorgio Micali, Manuel Ruiz Marín

详情
英文摘要

We introduce a general framework for testing temporal symmetries in time series based on the distribution of ordinal patterns. While previous approaches have focused on specific forms of asymmetry, such as time reversal, our method provides a unified framework applicable to arbitrary symmetry tests. We establish asymptotic results for the resulting test statistics under a broad class of stationary processes. Comprehensive experiments on both synthetic and real data demonstrate that the proposed test achieves high sensitivity to structural asymmetries while remaining fully data-driven and computationally efficient.

2601.14199 2026-01-21 stat.ME

Factor Analysis of Multivariate Stochastic Volatility Model

Taehee Lee, Jun S. Liu

Comments Submitted to Journal of the American Statistical Association (JASA)

详情
英文摘要

Modeling the time-varying covariance structures of high-dimensional variables is critical across diverse scientific and industrial applications; however, existing approaches exhibit notable limitations in either modeling flexibility or inferential efficiency. For instance, change-point modeling fails to account for the continuous time-varying nature of covariance structures, while GARCH and stochastic volatility models suffer from over-parameterization and the risk of overfitting. To address these challenges, we propose a Bayesian factor modeling framework designed to enable simultaneous inference of both the covariance structure of a high-dimensional time series and its time-varying dynamics. The associated Expectation-Maximization (EM) algorithm not only features an exact, closed-form update for the M-step but also is easily generalizable to more complex settings, such as spatiotemporal multivariate factor analysis. We validate our method through simulation studies and real-data experiments using climate and financial datasets.

2601.14173 2026-01-21 cs.LG stat.ML

Penalizing Localized Dirichlet Energies in Low Rank Tensor Products

Paris A. Karakasis, Nicholas D. Sidiropoulos

Comments 19 pages

详情
英文摘要

We study low-rank tensor-product B-spline (TPBS) models for regression tasks and investigate Dirichlet energy as a measure of smoothness. We show that TPBS models admit a closed-form expression for the Dirichlet energy, and reveal scenarios where perfect interpolation is possible with exponentially small Dirichlet energy. This renders global Dirichlet energy-based regularization ineffective. To address this limitation, we propose a novel regularization strategy based on local Dirichlet energies defined on small hypercubes centered at the training points. Leveraging pretrained TPBS models, we also introduce two estimators for inference from incomplete samples. Comparative experiments with neural networks demonstrate that TPBS models outperform neural networks in the overfitting regime for most datasets, and maintain competitive performance otherwise. Overall, TPBS models exhibit greater robustness to overfitting and consistently benefit from regularization, while neural networks are more sensitive to overfitting and less effective in leveraging regularization.

2601.14170 2026-01-21 math.PR cond-mat.stat-mech cs.DM math.CO math.ST stat.TH

Wasserstein distances between ERGMs and Erdős-Rényi models

Vilas Winstein

Comments 33 pages

详情
英文摘要

Ferromagnetic exponential random graph models (ERGMs) are random graph models under which the presence of certain small structures (such as triangles) is encouraged; they can be constructed by tilting an Erdős--Rényi model by the exponential of a particular nonlinear Hamiltonian. These models are mixtures of metastable wells which each behave macroscopically like an Erdős--Rényi model, exhibiting the same laws of large numbers for subgraph counts [CD13]. However, on the microscopic scale these metastable wells are very different from Erdős--Rényi models, with the total variation distance between the two measures tending to 1 [MX23]. In this article we clarify this situation by providing a sharp (up to constants) bound on the Hamming-Wasserstein distance between the two models, which is the average number of edges at which they differ, under the coupling which minimizes this average. In particular, we show that this distance is $Θ(n^{3/2})$, quantifying exactly how these models differ. An upper bound of this form has appeared in the past [RR19], but this was restricted to the subcritical (high-temperature) regime of parameters. We extend this bound, using a new proof technique, to the supercritical (low-temperature) regime, and prove a matching lower bound which has only previously appeared in the subcritical regime of special cases of ERGMs satisfying a "triangle-free" condition [DF25]. To prove the lower bound in the presence of triangles, we introduce an approximation of the discrete derivative of the Hamiltonian, which controls the dynamical properties of the ERGM, in terms of local counts of triangles and wedges (two-stars) near an edge. This approximation is the main technical and conceptual contribution of the article, and we expect it will be useful in a variety of other contexts as well. Along the way, we also prove a bound on the marginal edge probability under the ERGM via a new bootstrapping argument. Such a bound has already appeared [FLSW25], but again only in the subcritical regime and using a different proof strategy.

2601.14062 2026-01-21 q-fin.ST stat.AP stat.ML

Demystifying the trend of the healthcare index: Is historical price a key driver?

Payel Sadhukhan, Samrat Gupta, Subhasis Ghosh, Tanujit Chakraborty

详情
英文摘要

Healthcare sector indices consolidate the economic health of pharmaceutical, biotechnology, and healthcare service firms. The short-term movements in these indices are closely intertwined with capital allocation decisions affecting research and development investment, drug availability, and long-term health outcomes. This research investigates whether historical open-high-low-close (OHLC) index data contain sufficient information for predicting the directional movement of the opening index on the subsequent trading day. The problem is formulated as a supervised classification task involving a one-step-ahead rolling window. A diverse feature set is constructed, comprising original prices, volatility-based technical indicators, and a novel class of nowcasting features derived from mutual OHLC ratios. The framework is evaluated on data from healthcare indices in the U.S. and Indian markets over a five-year period spanning multiple economic phases, including the COVID-19 pandemic. The results demonstrate robust predictive performance, with accuracy exceeding 0.8 and Matthews correlation coefficients above 0.6. Notably, the proposed nowcasting features have emerged as a key determinant of the market movement. We have employed the Shapley-based explainability paradigm to further elucidate the contribution of the features: outcomes reveal the dominant role of the nowcasting features, followed by a more moderate contribution of original prices. This research offers a societal utility: the proposed features and model for short-term forecasting of healthcare indices can reduce information asymmetry and support a more stable and equitable health economy.

2601.13998 2026-01-21 stat.ME stat.AP stat.CO

Modeling Zero-Inflated Longitudinal Circular Data Using Bayesian Methods: Application to Ophthalmology

Prajamitra Bhuyan, Soutik Halder, Jayant Jha

详情
英文摘要

This paper introduces the modeling of circular data with excess zeros under a longitudinal framework, where the response is a circular variable and the covariates can be both linear and circular in nature. In the literature, various circular-circular and circular-linear regression models have been studied and applied to different real-world problems. However, there are no models for addressing zero-inflated circular observations in the context of longitudinal studies. Motivated by a real case study, a mixed-effects two-stage model based on the projected normal distribution is proposed to handle such issues. The interpretation of the model parameters is discussed and identifiability conditions are derived. A Bayesian methodology based on Gibbs sampling technique is developed for estimating the associated model parameters. Simulation results show that the proposed method outperforms its competitors in various situations. A real dataset on post-operative astigmatism is analyzed to demonstrate the practical implementation of the proposed methodology. The use of the proposed method facilitates effective decision-making for treatment choices and in the follow-up phases.

2601.13966 2026-01-21 math.ST stat.TH

Information-Theoretic and Computational Limits of Correlation Detection under Graph Sampling

Dong Huang, Pengkun Yang

Comments 61 pages, 7 figures

详情
英文摘要

Correlation analysis is a fundamental problem in statistics. In this paper, we consider the correlation detection problem between a pair of Erdos-Renyi graphs. Specifically, the problem is formulated as a hypothesis testing problem: under the null hypothesis, the two graphs are independent; under the alternative hypothesis, the two graphs are edge-correlated through a latent permutation. We focus on the scenario where only two induced subgraphs are sampled, and characterize the sample size threshold for detection. At the information-theoretic level, we establish the sample complexity rates that are optimal up to constant factors over most parameter regimes, and the remaining gap is bounded by a subpolynomial factor. On the algorithmic side, we propose polynomial-time tests based on counting trees and bounded degree motifs, and identify the regimes where they succeed. Moreover, leveraging the low-degree conjecture, we provide evidence of computational hardness that matches our achievable guarantees, showing that the proposed polynomial-time tests are rate-optimal. Together, these results reveal a statistical--computational gap in the sample size required for correlation detection. Finally, we validate the proposed algorithms on synthetic data and a real coauthor network, demonstrating strong empirical performance.

2601.13962 2026-01-21 eess.SP cs.SY eess.SY q-bio.NC stat.ME

Optimal Calibration of the endpoint-corrected Hilbert Transform

Eike Osmers, Dorothea Kolossa

详情
英文摘要

Accurate, low-latency estimates of the instantaneous phase of oscillations are essential for closed-loop sensing and actuation, including (but not limited to) phase-locked neurostimulation and other real-time applications. The endpoint-corrected Hilbert transform (ecHT) reduces boundary artefacts of the Hilbert transform by applying a causal narrow-band filter to the analytic spectrum. This improves the phase estimate at the most recent sample. Despite its widespread empirical use, the systematic endpoint distortions of ecHT have lacked a principled, closed-form analysis. In this study, we derive the ecHT endpoint operator analytically and demonstrate that its output can be decomposed into a desired positive-frequency term (a deterministic complex gain that induces a calibratable amplitude/phase bias) and a residual leakage term setting an irreducible variance floor. This yields (i) an explicit characterisation and bounds for endpoint phase/amplitude error, (ii) a mean-squared-error-optimal scalar calibration (c-ecHT), and (iii) practical design rules relating window length, bandwidth/order, and centre-frequency mismatch to residual bias via an endpoint group delay. The resulting calibrated ecHT achieves near-zero mean phase error and remains computationally compatible with real-time pipelines. Code and analyses are provided at https://github.com/eosmers/cecHT.

2601.13955 2026-01-21 math.ST stat.TH

Uniform Consistency of Generalized Cross-Validation for Ridge Regression in High-Dimensional Misspecified Linear Models

Akira Shinkyu

详情
英文摘要

This study examines generalized cross-validation for the tuning parameter selection for ridge regression in high-dimensional misspecified linear models. The set of candidates for the tuning parameter includes not only positive values but also zero and negative values. We demonstrate that if the second moment of the specification error converges to zero, generalized cross-validation is still a uniformly consistent estimator of the out-of-sample prediction risk. This implies that generalized cross-validation selects the tuning parameter for which ridge regression asymptotically achieves the smallest prediction risk among the candidates if the degree of misspecification for the regression function is small. Our simulation studies show that ridge regression tuned by generalized cross-validation exhibits a prediction performance similar to that of optimally tuned ridge regression and outperforms the Lasso under correct and incorrect model specifications.

2601.13946 2026-01-21 math.ST stat.TH

Topological Criteria for Hypothesis Testing with Finite-Precision Measurements

Philip Boeken, Eduardo Skapinakis, Konstantin Genin, Joris M. Mooij

详情
英文摘要

We establish topological necessary and sufficient conditions under which a pair of statistical hypotheses can be consistently distinguished when i.i.d. observations are recorded only to finite precision. Requiring the test's decision regions to be open in the sample-space topology to accommodate finite-precision data, we show that a pair of null- and alternative hypotheses $H_0$ and $H_1$ admits a consistent test if and only if they are $F_σ$ in the weak topology on the space of probability measures $W := H_0\cup H_1$. Additionally, the hypotheses admit uniform error control under $H_0$ and/or $H_1$ if and only if $H_0$ and/or $H_1$ are closed in $W$. Under compactness assumptions, uniform consistency is characterised by $H_0$ and $H_1$ having disjoint closures in the ambient space of probability measures. These criteria imply that - without regularity assumptions - conditional independence is not consistently testable. We introduce a Lipschitz-continuity assumption on the family of conditional distributions under which we recover testability of conditional independence with uniform error control under the null, with testable smoothness constraints.

2601.13784 2026-01-21 stat.ME

An Adaptive Phase II Trial Design for Dose Selection and Addition in Microfilarial Infections

Sonja Zehetmayer, Marta Bofill Roig, Fabrice Lotola Mougeni, Sabine Specht, Marc P. Hübner, Martin Posch

详情
英文摘要

We propose a frequentist adaptive phase 2 trial design to evaluate the safety and efficacy of three treatment regimens (doses) compared to placebo for four types of helminth (worm) infections. This trial will be carried out in four Subsaharan African countries from spring 2025. Since the safety of the highest dose is not yet established, the study begins with the two lower doses and placebo. Based on safety and early efficacy results from an interim analysis, a decision will be made to either continue with the two lower doses or drop one or both and introduce the highest dose instead. This design borrows information across baskets for safety assessment, while efficacy is assessed separately for each basket. The proposed adaptive design addresses several key challenges: (1) The trial must begin with only the two lower doses because reassuring safety data from these doses is required before escalating to a higher dose. (2) Due to the expected speed of recruitment, adaptation decisions must rely on an earlier, surrogate endpoint. (3) The primary outcome is a count variable that follows a mixture distribution with an atom at 0. To control the familywise error rate in the strong sense when comparing multiple doses to the control in the adaptive design, we extend the partial conditional error approach to accommodate the inclusion of new hypotheses after the interim analysis. In a comprehensive simulation study we evaluate various design options and analysis strategies, assessing the robustness of the design under different design assumptions and parameter values. We identify scenarios where the adaptive design improves the trial's ability to identify an optimal dose. Adaptive dose selection enables resource allocation to the most promising treatment arms, increasing the likelihood of selecting the optimal dose while reducing the required overall sample size and trial duration.

2601.13776 2026-01-21 cs.LG stat.ML

Orthogonium : A Unified, Efficient Library of Orthogonal and 1-Lipschitz Building Blocks

Thibaut Boissin, Franck Mamalet, Valentin Lafargue, Mathieu Serrurier

Journal ref ICML 2025 Workshop on Championing Open- source Development in Machine Learning (CODEML '25), Jul 2025, Vancouver, France

详情
英文摘要

Orthogonal and 1-Lipschitz neural network layers are essential building blocks in robust deep learning architectures, crucial for certified adversarial robustness, stable generative models, and reliable recurrent networks. Despite significant advancements, existing implementations remain fragmented, limited, and computationally demanding. To address these issues, we introduce Orthogonium , a unified, efficient, and comprehensive PyTorch library providing orthogonal and 1-Lipschitz layers. Orthogonium provides access to standard convolution features-including support for strides, dilation, grouping, and transposed-while maintaining strict mathematical guarantees. Its optimized implementations reduce overhead on large scale benchmarks such as ImageNet. Moreover, rigorous testing within the library has uncovered critical errors in existing implementations, emphasizing the importance of standardized and reliable tools. Orthogonium thus significantly lowers adoption barriers, enabling scalable experimentation and integration across diverse applications requiring orthogonality and robust Lipschitz constraints. Orthogonium is available at https://github.com/deel-ai/orthogonium.

2601.13755 2026-01-21 stat.ME

Building a Standardised Statistical Reporting Toolbox in an Academic Oncology Clinical Trials Unit: The grstat R Package

Dan Chaltiel, Alexis Cochard, Nusaibah Ibrahimi, Charlotte Bargain, Ikram Benchara, Anne Lourdessamy, Aldéric Fraslin, Matthieu Texier, Livia Pierotti

详情
英文摘要

Academic Clinical Trial Units frequently face fragmented statistical workflows, leading to duplicated effort, limited collaboration, and inconsistent analytical practices. To address these challenges within an oncology Clinical Trial Unit, we developed grstat, an R package providing a standardised set of tools for routine statistical analyses. Beyond the software itself, the development of grstat is embedded in a structured organisational framework combining formal request tracking, peer-reviewed development, automated testing, and staged validation of new functionalities. The package is intentionally opinionated, reflecting shared practices agreed upon within the unit, and evolves through iterative use in real-world projects. Its development as an open-source project on GitHub supports transparent workflows, collective code ownership, and traceable decision-making. While primarily designed for internal use, this work illustrates a transferable approach to organising, validating, and maintaining a shared analytical toolbox in an academic setting. By coupling technical implementation with governance and validation principles, grstat supports efficiency, reproducibility, and long-term maintainability of biostatistical workflows, and may serve as a source of inspiration for other Clinical Trial Units facing similar organisational challenges.

2601.13744 2026-01-21 math.ST stat.TH

A Note on k-NN Gating in RAG

Gérard Biau, Claire Boyer

详情
英文摘要

We develop a statistical proxy framework for retrieval-augmented generation (RAG), designed to formalize how a language model (LM) should balance its own predictions with retrieved evidence. For each query x, the system combines a frozen base model q0 ($\times$ x) with a k-nearest neighbor retriever r (k ) ($\times$ x) through a measurable gate k(x). A retrieval-trust weight wfact (x) quantifies the geometric reliability of the retrieved neighborhood and penalizes retrieval in low-trust regions. We derive the Bayes-optimal per-query gate and analyze its effect on a discordance-based hallucination criterion that captures disagreements between LM predictions and retrieved evidence. We further show that this discordance admits a deterministic asymptotic limit governed solely by the structural agreement (or disagreement) between the Bayes rule and the LM. To account for distribution mismatch between queries and memory, we introduce a hybrid geometric-semantic model combining covariate deformation and label corruption. Overall, this note provides a principled statistical foundation for factuality-oriented RAG systems.

2601.13642 2026-01-21 stat.ML cs.LG

Sample Complexity of Average-Reward Q-Learning: From Single-agent to Federated Reinforcement Learning

Yuchen Jiao, Jiin Woo, Gen Li, Gauri Joshi, Yuejie Chi

详情
英文摘要

Average-reward reinforcement learning offers a principled framework for long-term decision-making by maximizing the mean reward per time step. Although Q-learning is a widely used model-free algorithm with established sample complexity in discounted and finite-horizon Markov decision processes (MDPs), its theoretical guarantees for average-reward settings remain limited. This work studies a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption, covering both single-agent and federated scenarios. For the single-agent case, we show that Q-learning with carefully chosen parameters achieves sample complexity $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{\varepsilon^3}\right)$, where $\|h^{\star}\|_{\mathsf{sp}}$ is the span norm of the bias function, improving previous results by at least a factor of $\frac{\|h^{\star}\|_{\mathsf{sp}}^2}{\varepsilon^2}$. In the federated setting with $M$ agents, we prove that collaboration reduces the per-agent sample complexity to $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{M\varepsilon^3}\right)$, with only $\widetilde{O}\left(\frac{\|h^{\star}\|_{\mathsf{sp}}}{\varepsilon}\right)$ communication rounds required. These results establish the first federated Q-learning algorithm for average-reward MDPs, with provable efficiency in both sample and communication complexity.

2601.13641 2026-01-21 stat.AP eess.SP

Correction of Pooling Matrix Mis-specifications in Compressed Sensing Based Group Testing

Shuvayan Banerjee, Radhendushka Srivastava, James Saunderson, Ajit Rajwade

详情
英文摘要

Compressed sensing, which involves the reconstruction of sparse signals from an under-determined linear system, has been recently used to solve problems in group testing. In a public health context, group testing aims to determine the health status values of p subjects from n<<p pooled tests, where a pool is defined as a mixture of small, equal-volume portions of the samples of a subset of subjects. This approach saves on the number of tests administered in pandemics or other resource-constrained scenarios. In practical group testing in time-constrained situations, a technician can inadvertently make a small number of errors during pool preparation, which leads to errors in the pooling matrix, which we term `model mismatch errors' (MMEs). This poses difficulties while determining health status values of the participating subjects from the results on n<<p pooled tests. In this paper, we present an algorithm to correct the MMEs in the pooled tests directly from the pooled results and the available (inaccurate) pooling matrix. Our approach then reconstructs the signal vector from the corrected pooling matrix, in order to determine the health status of the subjects. We further provide theoretical guarantees for the correction of the MMEs and the reconstruction error from the corrected pooling matrix. We also provide several supporting numerical results.

2601.13627 2026-01-21 stat.AP

Are Large Language Models able to Predict Highly Cited Papers? Evidence from Statistical Publications

Zhanshuo Ye, Yiming Hou, Rui Pan, Tianchen Gao, Hansheng Wang

详情
英文摘要

Predicting highly-cited papers is a long-standing challenge due to the complex interactions of research content, scholarly communities, and temporal dynamics. Recent advances in large language models (LLMs) raise the question of whether early-stage textual information can provide useful signals of long-term scientific impact. Focusing on statistical publications, we propose a flexible, text-centered framework that leverages LLMs and structured prompt design to predict highly cited papers. Specifically, we utilize information available at the time of publication, including titles, abstracts, keywords, and limited bibliographic metadata. Using a large corpus of statistical papers, we evaluate predictive performance across multiple publication periods and alternative definitions of highly cited papers. The proposed approach achieves stable and competitive performance relative to existing methods and demonstrates strong generalization over time. Textual analysis further reveals that papers predicted as highly cited concentrate on recurring topics such as causal inference and deep learning. To facilitate practical use of the proposed approach, we further develop a WeChat mini program, \textit{Stat Highly Cited Papers}, which provides an accessible interface for early-stage citation impact assessment. Overall, our results provide empirical evidence that LLMs can capture meaningful early signals of long-term citation impact, while also highlighting their limitations as tools for research impact assessment.

2601.13605 2026-01-21 eess.SY cs.SY stat.AP

Outage Identification from Electricity Market Data: Quickest Change Detection Approach

Milad Hoseinpour, Shubhanshu Shekhar, Vladimir Dvorkin

Comments 7 pages, 2 figures, 1 table

详情
英文摘要

Power system outages expose market participants to significant financial risk unless promptly detected and hedged. We develop an outage identification method from public market signals grounded in the parametric quickest change detection (QCD) theory. Parametric QCD operates on stochastic data streams, distinguishing pre- and post-change regimes using the ratio of their respective probability density functions. To derive the density functions for normal and post-outage market signals, we exploit multi-parametric programming to decompose complex market signals into parametric random variables with a known density. These densities are then used to construct a QCD-based statistic that triggers an alarm as soon as the statistic exceeds an appropriate threshold. Numerical experiments on a stylized PJM testbed demonstrate rapid line outage identification from public streams of electricity demand and price data.

2601.13544 2026-01-21 physics.soc-ph econ.EM stat.AP

The Collapse of Multilayer Predation and the Emergence of a Monolithic Leviathan

Li Tuobang

Comments in Chinese language

详情
英文摘要

This paper constructs a multilayer recursive game model to demonstrate that in a rule vacuum environment, hierarchical predatory structures inevitably collapse into a monolithic political strongman system due to the conflict between exponentially growing rent dissipation and the rigidity of bottom-level survival constraints. We propose that the rise of a monolithic political strongman is essentially an "algorithmic entropy reduction" achieved through forceful means by the system to counteract the "informational entropy increase" generated by multilayer agency. However, the order gained at the expense of social complexity results in the stagnation of social evolutionary functions.

2601.13535 2026-01-21 stat.ME stat.AP

What is Overlap Weighting, How Has it Evolved, and When to Use It for Causal Inference?

Haidong Lu, Fan Li, Laine E. Thomas, Fan Li

Comments 26 pages, 1 table, 1 figure

详情
英文摘要

The growing availability of large health databases has expanded the use of observational studies for comparative effectiveness research. Unlike randomized trials, observational studies must adjust for systematic differences in patient characteristics between treatment groups. Propensity score methods, including matching, weighting, stratification, and regression adjustment, address this issue by creating groups that are comparable with respect to measured covariates. Among these approaches, overlap weighting (OW) has emerged as a principled and efficient method that emphasizes individuals at empirical equipoise, those who could plausibly receive either treatment. By assigning weights proportional to the probability of receiving the opposite treatment, OW targets the Average Treatment Effect in the Overlap population (ATO), achieves exact mean covariate balance under logistic propensity score models, and minimizes asymptotic variance. Over the last decade, the OW method has been recognized as a valuable confounding adjustment tool across the statistical, epidemiologic, and clinical research communities, and is increasingly applied in clinical and health studies. Given the growing interest in using observational data to emulate randomized trials and the capacity of OW to prioritize populations at clinical equipoise while achieving covariate balance (fundamental attributes of randomized studies), this article provides a concise overview of recent methodological developments in OW and practical guidance on when it represents a suitable choice for causal inference.

2601.13514 2026-01-21 stat.ME

Post-selection inference for penalized M-estimators via score thinning

Ronan Perry, Snigdha Panigrahi, Daniela Witten

详情
英文摘要

We consider inference for M-estimators after model selection using a sparsity-inducing penalty. While existing methods for this task require bespoke inference procedures, we propose a simpler approach, which relies on two insights: (i) adding and subtracting carefully-constructed noise to a Gaussian random variable with unknown mean and known variance leads to two \emph{independent} Gaussian random variables; and (ii) both the selection event resulting from penalized M-estimation, and the event that a standard (non-selective) confidence interval for an M-estimator covers its target, can be characterized in terms of an approximately normal ``score variable". We combine these insights to show that -- when the noise is chosen carefully -- there is asymptotic independence between the model selected using a noisy penalized M-estimator, and the event that a standard (non-selective) confidence interval on noisy data covers the selected parameter. Therefore, selecting a model via penalized M-estimation (e.g. \verb=glmnet= in \verb=R=) on noisy data, and then conducting \emph{standard} inference on the selected model (e.g. \verb=glm= in \verb=R=) using noisy data, yields valid inference: \emph{no bespoke methods are required}. Our results require independence of the observations, but only weak distributional requirements. We apply the proposed approach to conduct inference on the association between sex and smoking in a social network.

2601.13474 2026-01-21 cs.LG cs.AI math.OC stat.ML

Preconditioning Benefits of Spectral Orthogonalization in Muon

Jianhao Ma, Yu Huang, Yuejie Chi, Yuxin Chen

详情
英文摘要

The Muon optimizer, a matrix-structured algorithm that leverages spectral orthogonalization of gradients, is a milestone in the pretraining of large language models. However, the underlying mechanisms of Muon -- particularly the role of gradient orthogonalization -- remain poorly understood, with very few works providing end-to-end analyses that rigorously explain its advantages in concrete applications. We take a step by studying the effectiveness of a simplified variant of Muon through two case studies: matrix factorization, and in-context learning of linear transformers. For both problems, we prove that simplified Muon converges linearly with iteration complexities independent of the relevant condition number, provably outperforming gradient descent and Adam. Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior. Our theory formalizes the preconditioning effect induced by spectral orthogonalization, offering insight into Muon's effectiveness in these matrix optimization problems and potentially beyond.

2601.13454 2026-01-21 stat.ME

Categorical distance correlation under general encodings and its application to high-dimensional feature screening

Qingyang Zhang

Comments 39 pages, 7 figures

详情
英文摘要

In this paper, we extend distance correlation to categorical data with general encodings, such as one-hot encoding for nominal variables and semicircle encoding for ordinal variables. Unlike existing methods, our approach leverages the spacing information between categories, which enhances the performance of distance correlation. Two estimates including the maximum likelihood estimate and a bias-corrected estimate are given, together with their limiting distributions under the null and alternative hypotheses. Furthermore, we establish the sure screening property for high-dimensional categorical data under mild conditions. We conduct a simulation study to compare the performance of different encodings, and illustrate their practical utility using the 2018 General Social Survey data.

2601.13449 2026-01-21 stat.ME stat.AP

Identifying Causes of Test Unfairness: Manipulability and Separability

Youmi Suk, Weicong Lyu

Comments 20 pages for the main text

详情
英文摘要

Differential item functioning (DIF) is a widely used statistical notion for identifying items that may disadvantage specific groups of test-takers. These groups are often defined by non-manipulable characteristics, e.g., gender, race/ethnicity, or English-language learner (ELL) status. While DIF can be framed as a causal fairness problem by treating group membership as the treatment variable, this invokes the long-standing controversy over the interpretation of causal effects for non-manipulable treatments. To better identify and interpret causal sources of DIF, this study leverages an interventionist approach using treatment decomposition proposed by Robins and Richardson (2010). Under this framework, we can decompose a non-manipulable treatment into intervening variables. For example, ELL status can be decomposed into English vocabulary unfamiliarity and classroom learning barriers, each of which influences the outcome through different causal pathways. We formally define separable DIF effects associated with these decomposed components, depending on the absence or presence of item impact, and provide causal identification strategies for each effect. We then apply the framework to biased test items in the SAT and Regents exams. We also provide formal detection methods using causal machine learning methods, namely causal forests and Bayesian additive regression trees, and demonstrate their performance through a simulation study. Finally, we discuss the implications of adopting interventionist approaches in educational testing practices.

2601.13436 2026-01-21 stat.ML cs.LG cs.SY eess.SP eess.SY math.ST stat.TH

Distribution-Free Confidence Ellipsoids for Ridge Regression with PAC Bounds

Szabolcs Szentpéteri, Balázs Csanád Csáji

详情
英文摘要

Linearly parametrized models are widely used in control and signal processing, with the least-squares (LS) estimate being the archetypical solution. When the input is insufficiently exciting, the LS problem may be unsolvable or numerically unstable. This issue can be resolved through regularization, typically with ridge regression. Although regularized estimators reduce the variance error, it remains important to quantify their estimation uncertainty. A possible approach for linear regression is to construct confidence ellipsoids with the Sign-Perturbed Sums (SPS) ellipsoidal outer approximation (EOA) algorithm. The SPS EOA builds non-asymptotic confidence ellipsoids under the assumption that the noises are independent and symmetric about zero. This paper introduces an extension of the SPS EOA algorithm to ridge regression, and derives probably approximately correct (PAC) upper bounds for the resulting region sizes. Compared with previous analyses, our result explicitly show how the regularization parameter affects the region sizes, and provide tighter bounds under weaker excitation assumptions. Finally, the practical effect of regularization is also demonstrated via simulation experiments.

2601.13405 2026-01-21 stat.ME

Associating High-Dimensional Longitudinal Datasets through an Efficient Cross-Covariance Decomposition

Jianbin Tan, Pixu Shi

Comments 30 pages, 6 figures

详情
英文摘要

Understanding associations between paired high-dimensional longitudinal datasets is a fundamental yet challenging problem that arises across scientific domains, including longitudinal multi-omic studies. The difficulty stems from the complex, time-varying cross-covariance structure coupled with high dimensionality, which complicates both model formulation and statistical estimation. To address these challenges, we propose a new framework, termed Functional-Aggregated Cross-covariance Decomposition (FACD), tailored for canonical cross-covariance analysis between paired high-dimensional longitudinal datasets through a statistically efficient and theoretically grounded procedure. Unlike existing methods that are often limited to low-dimensional data or rely on explicit parametric modeling of temporal dynamics, FACD adaptively learns temporal structure by aggregating signals across features and naturally accommodates variable selection to identify the most relevant features associated across datasets. We establish statistical guarantees for FACD and demonstrate its advantages over existing approaches through extensive simulation studies. Finally, we apply FACD to a longitudinal multi-omic human study, revealing blood molecules with time-varying associations across omic layers during acute exercise.

2601.13396 2026-01-21 stat.AP

A Two-Stage Bayesian Framework for Multi-Fidelity Online Updating of Spatial Fragility Fields

Abdullah M. Braik, Maria Koliou

Comments 46 pages, 14 figures, 2 tables. This is a preprint and has not been peer reviewed

详情
英文摘要

This paper addresses a long-standing gap in natural hazard modeling by unifying physics-based fragility functions with real-time post-disaster observations. It introduces a Bayesian framework that continuously refines regional vulnerability estimates as new data emerges. The framework reformulates physics-informed fragility estimates into a Probit-Normal (PN) representation that captures aleatory variability and epistemic uncertainty in an analytically tractable form. Stage 1 performs local Bayesian updating by moment-matching PN marginals to Beta surrogates that preserve their probability shapes, enabling conjugate Beta-Bernoulli updates with soft, multi-fidelity observations. Fidelity weights encode source reliability, and the resulting Beta posteriors are re-projected into PN form, producing heteroscedastic fragility estimates whose variances reflect data quality and coverage. Stage 2 assimilates these heteroscedastic observations within a probit-warped Gaussian Process (GP), which propagates information from high-fidelity sites to low-fidelity and unobserved regions through a composite kernel that links space, archetypes, and correlated damage states. The framework is applied to the 2011 Joplin tornado, where wind-field priors and computer-vision damage assessments are fused under varying assumptions about tornado width, sampling strategy, and observation completeness. Results show that the method corrects biased priors, propagates information spatially, and produces uncertainty-aware exceedance probabilities that support real-time situational awareness.

2601.13362 2026-01-21 stat.AP cs.LG

Improving Geopolitical Forecasts with Bayesian Networks

Matthew Martin

Comments 34 pages, 3 figures

详情
英文摘要

This study explores how Bayesian networks (BNs) can improve forecast accuracy compared to logistic regression and recalibration and aggregation methods, using data from the Good Judgment Project. Regularized logistic regression models and a baseline recalibrated aggregate were compared to two types of BNs: structure-learned BNs with arcs between predictors, and naive BNs. Four predictor variables were examined: absolute difference from the aggregate, forecast value, days prior to question close, and mean standardized Brier score. Results indicated the recalibrated aggregate achieved the highest accuracy (AUC = 0.985), followed by both types of BNs, then the logistic regression models. Performance of the BNs was likely harmed by reduced information from the discretization process and violation of the assumption of linearity likely harmed the logistic regression models. Future research should explore hybrid approaches combining BNs with logistic regression, examine additional predictor variables, and account for hierarchical data dependencies.

2601.13347 2026-01-21 math.NA cs.NA math.OC math.ST stat.TH

A Scalable Sequential Framework for Dynamic Inverse Problems via Model Parameter Estimation

Aryeh Keating, Mirjeta Pasha

Comments 27 Pages, 8 Figures

详情
英文摘要

Large-scale dynamic inverse problems are often ill-posed due to model complexity and the high dimensionality of the unknown parameters. Regularization is commonly employed to mitigate ill-posedness by incorporating prior information and structural constraints. However, classical regularization formulations are frequently infeasible in this setting due to prohibitive memory requirements, necessitating sequential methods that process data and state information online, eliminating the need to form the full space-time problem. In this work, we propose a memory-efficient framework for reconstructing dynamic sequences of undersampled images from computerized tomography data that requires minimal hyperparameter tuning. The approach is based on a prior-informed, dimension-reduced Kalman filter with smoothing. While well suited for dynamic image reconstruction, practical deployment is challenging when the state transition model and covariance parameters must be initialized without prior knowledge and estimated in a single pass. To address these limitations, we integrate regularized motion models with expectation-maximization strategies for the estimation of state transition dynamics and error covariances within the Kalman filtering framework. We demonstrate the effectiveness of the proposed method through numerical experiments on limited-angle and single-shot computerized tomography problems, highlighting improvements in reconstruction accuracy, memory efficiency, and computational cost.

2601.13281 2026-01-21 econ.EM q-fin.RM stat.AP

Spectral Dynamics and Regularization for High-Dimensional Copulas

Koos B. Gubbels, Andre Lucas

详情
英文摘要

We introduce a novel model for time-varying, asymmetric, tail-dependent copulas in high dimensions that incorporates both spectral dynamics and regularization. The dynamics of the dependence matrix' eigenvalues are modeled in a score-driven way, while biases in the unconditional eigenvalue spectrum are resolved by non-linear shrinkage. The dynamic parameterization of the copula dependence matrix ensures that it satisfies the appropriate restrictions at all times and for any dimension. The model is parsimonious, computationally efficient, easily scalable to high dimensions, and performs well for both simulated and empirical data. In an empirical application to financial market dynamics using 100 stocks from 10 different countries and 10 different industry sectors, we find that our copula model captures both geographic and industry related co-movements and outperforms recent computationally more intensive clustering-based factor copula alternatives. Both the spectral dynamics and the regularization contribute to the new model's performance. During periods of market stress, we find that the spectral dynamics reveal strong increases in international stock market dependence, which causes reductions in diversification potential and increases in systemic risk.

2601.13272 2026-01-21 cs.LG stat.CO stat.ML

Multi-level Monte Carlo Dropout for Efficient Uncertainty Quantification

Aaron Pim, Tristan Pryer

Comments 26 pages, 11 figures

详情
英文摘要

We develop a multilevel Monte Carlo (MLMC) framework for uncertainty quantification with Monte Carlo dropout. Treating dropout masks as a source of epistemic randomness, we define a fidelity hierarchy by the number of stochastic forward passes used to estimate predictive moments. We construct coupled coarse--fine estimators by reusing dropout masks across fidelities, yielding telescoping MLMC estimators for both predictive means and predictive variances that remain unbiased for the corresponding dropout-induced quantities while reducing sampling variance at fixed evaluation budget. We derive explicit bias, variance and effective cost expressions, together with sample-allocation rules across levels. Numerical experiments on forward and inverse PINNs--Uzawa benchmarks confirm the predicted variance rates and demonstrate efficiency gains over single-level MC-dropout at matched cost.

2601.13254 2026-01-21 math.ST math.AP math.FA math.PR stat.TH

Inverting the Fisher information operator in non-linear models

Dimitri Konen

详情
英文摘要

We consider non-linear regression models corrupted by generic noise when the regression functions form a non-linear subspace of L^2, relevant in non-linear PDE inverse problems and data assimilation. We show that when the score of the model is injective, the Fisher information operator is automatically invertible between well-identified Hilbert spaces, and we provide an operational characterization of these spaces. This allows us to construct in broad generality the efficient Gaussian involved in the classical minimax and convolution theorems to establish information lower bounds, that are typically achieved by Bayesian algorithms thus showing optimality of these methods. We illustrate our results on time-evolution PDE models for reaction-diffusion and Navier-Stokes equations.

2512.13642 2026-01-21 econ.EM stat.ML

From Many Models, One: Macroeconomic Forecasting with Reservoir Ensembles

Giovanni Ballarin, Lyudmila Grigoryeva, Yui Ching Li

Comments Updated manuscript with shortened main text

详情
英文摘要

Model combination is a powerful approach for achieving superior performance compared to selecting a single model. We study both theoretically and empirically the effectiveness of ensembles of Multi-Frequency Echo State Networks (MFESNs), which have been shown to achieve state-of-the-art macroeconomic time series forecasting results (Ballarin et al., 2024a). The Hedge and Follow-the-Leader schemes are discussed, and their online learning guarantees are extended to settings with dependent data. In empirical applications, the proposed Ensemble Echo State Networks demonstrate significantly improved predictive performance relative to individual MFESN models.

2511.21603 2026-01-21 math.ST stat.TH

Uniform inference for kernel instrumental variable regression

Marvin Lob, Rahul Singh, Suhas Vijaykumar

详情
英文摘要

Instrumental variable regression is a foundational tool for causal analysis across the social and biomedical sciences. Recent advances use kernel methods to estimate nonparametric causal relationships, with general data types, while retaining a simple closed-form expression. Empirical researchers ultimately need reliable inference on causal estimates; however, uniform confidence sets for the method remain unavailable. To fill this gap, we develop valid and sharp confidence sets for kernel instrumental variable regression, allowing general nonlinearities and data types. Computationally, our bootstrap procedure requires only a single run of the kernel instrumental variable regression estimator. Theoretically, it relies on the same key assumptions. Overall, we provide a practical procedure for inference that substantially increases the value of kernel methods for causal analysis.

2511.03694 2026-01-21 stat.CO

Robust Global Fr'echet Regression via Weight Regularization

Hao Li, Shonosuke Sugasawa, Shota Katayama

详情
英文摘要

The Fréchet regression is a useful method for modeling random objects in a general metric space given Euclidean covariates. However, the conventional approach could be sensitive to outlying objects in the sense that the distance from the regression surface is large compared to the other objects. In this study, we develop a robust version of the global Fréchet regression by incorporating weight parameters into the objective function. We then introduce the Elastic net regularization, favoring a sparse vector of robust parameters to control the influence of outlying objects. We provide a computational algorithm to iteratively estimate the regression function and weight parameters, with providing a linear convergence property. We also propose the Bayesian information criterion to select the tuning parameters for regularization, which gives adaptive robustness along with observed data. The finite sample performance of the proposed method is demonstrated through numerical studies on matrix and distribution responses.

2509.11741 2026-01-21 stat.CO

Tidy simulation: Designing robust, reproducible, and scalable Monte Carlo simulations

Erik-Jan van Kesteren

Comments 16 pages, 3 figures

详情
英文摘要

Monte Carlo simulation studies are at the core of the modern applied, computational, and theoretical statistical literature. Simulation is a broadly applicable research tool, used to collect data on the relative performance of methods or data analysis approaches under a well-defined data-generating process. However, extant literature focuses largely on design aspects of simulation, rather than implementation strategies aligned with the current state of (statistical) programming languages, portable data formats, and multi-node cluster computing. In this work, I propose tidy simulation: a simple, language-agnostic, yet flexible functional framework for designing, writing, and running simulation studies. It has four components: a tidy simulation grid, a data generation function, an analysis function, and a results table. Using this structure, even the smallest simulations can be written in a consistent, modular way, yet they can be readily scaled to thousands of nodes in a computer cluster should the need arise. Tidy simulation also supports the iterative, sometimes exploratory nature of simulation-based experiments. By adopting the tidy simulation approach, researchers can implement their simulations in a robust, reproducible, and scalable way, which contributes to high-quality statistical science.

2509.03515 2026-01-21 cs.RO cs.AI cs.LG cs.SY eess.SY stat.AP

Can the Waymo Open Motion Dataset Support Realistic Behavioral Modeling? A Validation Study with Naturalistic Trajectories

Yanlin Zhang, Sungyong Chung, Nachuan Li, Dana Monzer, Hani S. Mahmassani, Samer H. Hamdar, Alireza Talebpour

详情
英文摘要

The Waymo Open Motion Dataset (WOMD) has become a popular resource for data-driven modeling of autonomous vehicles (AVs) behavior. However, its validity for behavioral analysis remains uncertain due to proprietary post-processing, the absence of error quantification, and the segmentation of trajectories into 20-second clips. This study examines whether WOMD accurately captures the dynamics and interactions observed in real-world AV operations. Leveraging an independently collected naturalistic dataset from Level 4 AV operations in Phoenix, Arizona (PHX), we perform comparative analyses across three representative urban driving scenarios: discharging at signalized intersections, car-following, and lane-changing behaviors. For the discharging analysis, headways are manually extracted from aerial video to ensure negligible measurement error. For the car-following and lane-changing cases, we apply the Simulation-Extrapolation (SIMEX) method to account for empirically estimated error in the PHX data and use Dynamic Time Warping (DTW) distances to quantify behavioral differences. Results across all scenarios consistently show that behavior in PHX falls outside the behavioral envelope of WOMD. Notably, WOMD underrepresents short headways and abrupt decelerations. These findings suggest that behavioral models calibrated solely on WOMD may systematically underestimate the variability, risk, and complexity of naturalistic driving. Caution is therefore warranted when using WOMD for behavior modeling without proper validation against independently collected data.

2508.20257 2026-01-21 cs.LG stat.ML

Discovering equations from data: symbolic regression in dynamical systems

Beatriz R. Brum, Luiza Lober, Isolde Previdelli, Francisco A. Rodrigues

详情
英文摘要

The process of discovering equations from data lies at the heart of physics and in many other areas of research, including mathematical ecology and epidemiology. Recently, machine learning methods known as symbolic regression emerged as a way to automate this task. This study presents an overview of the current literature on symbolic regression, while also comparing the efficiency of five state-of-the-art methods in recovering the governing equations from nine processes, including chaotic dynamics and epidemic models. Benchmark results demonstrate the PySR method as the most suitable for inferring equations, with some estimates being indistinguishable from the original analytical forms. These results highlight the potential of symbolic regression as a robust tool for inferring and modeling real-world phenomena.

2508.00770 2026-01-21 math.ST stat.ME stat.TH

On admissibility in post-hoc hypothesis testing

Ben Chugg, Tyron Lardy, Aaditya Ramdas, Peter Grünwald

Comments 58 pages. To appear in the International Journal of Approximate Reasoning

详情
英文摘要

The validity of classical hypothesis testing requires the significance level $α$ be fixed before any statistical analysis takes place. This is a stringent requirement. For instance, it prohibits updating $α$ during (or after) an experiment due to changing concern about the cost of false positives, or to reflect unexpectedly strong evidence against the null. Perhaps most disturbingly, witnessing a p-value $p\llα$ vs $p= α- ε$ for tiny $ε> 0$ has no (statistical) relevance for any downstream decision-making. Following recent work of Grünwald (2024), we develop a theory of post-hoc hypothesis testing, enabling $α$ to be chosen after seeing and analyzing the data. To study "good" post-hoc tests we introduce $Γ$-admissibility, where $Γ$ is a set of adversaries which map the data to a significance level. We classify the set of $Γ$-admissible rules for various sets $Γ$, showing they must be based on e-values, and recover the Neyman-Pearson lemma when $Γ$ is the constant map.

2507.01726 2026-01-21 quant-ph physics.chem-ph stat.ML

Generative flow-based warm start of the variational quantum eigensolver

Hang Zou, Martin Rahm, Anton Frisk Kockum, Simon Olsson

Comments 20 pages; 8 figures

Journal ref npj Quantum Information volume 12, 5 (2026)

详情
英文摘要

Hybrid quantum-classical algorithms like the variational quantum eigensolver (VQE) show promise for quantum simulations on near-term quantum devices, but are often limited by complex objective functions and expensive optimization procedures. Here, we propose Flow-VQE, a generative framework leveraging conditional normalizing flows with parameterized quantum circuits to efficiently generate high-quality variational parameters. By embedding a generative model into the VQE optimization loop through preference-based training, Flow-VQE enables quantum gradient-free optimization and offers a systematic approach for parameter transfer, accelerating convergence across related problems through warm-started optimization. We compare Flow-VQE to a number of standard benchmarks through numerical simulations on molecular systems, including hydrogen chains, water, ammonia, and benzene. We find that Flow-VQE outperforms baseline optimization algorithms, achieving computational accuracy with fewer circuit evaluations (improvements range from modest to more than two orders of magnitude) and, when used to warm-start the optimization of new systems, accelerates subsequent fine-tuning by up to 50-fold compared with Hartree--Fock initialization. Therefore, we believe Flow-VQE can become a pragmatic and versatile paradigm for leveraging generative modeling to reduce the costs of variational quantum algorithms.

2506.22324 2026-01-21 stat.ME

General measures of effect size to calculate power and sample size for Wald tests with generalized linear models

Amy L Cochran, Shijie Yuan, Paul J Rathouz

详情
英文摘要

Power and sample size calculations for Wald tests in generalized linear models (GLMs) are often limited to specific cases like logistic regression. More general methods typically require detailed study parameters that are difficult to obtain during planning. We introduce two new effect size measures for estimating power and sample size in studies using Wald tests across any GLM. These measures accommodate any number of predictors or adjusters and require only basic study information. We provide practical guidance for interpreting and applying these measures to approximate a key parameter in power calculations. We also derive asymptotic bounds on the relative error of these approximations, showing that accuracy depends on features of the GLM such as the nonlinearity of the link function. To complement this analysis, we conduct simulation studies across common model specifications, identifying best use cases and opportunities for improvement. Finally, we test the methods in finite samples to confirm their practical utility, using a case study on the relationship between education and receipt of mental health treatment.

2506.15511 2026-01-21 stat.AP

A sequential ensemble approach to epidemic modeling: Combining Hawkes and SEIR models using SMC$^2$

Dhorasso Temfack, Jason Wyse

详情
英文摘要

This paper proposes a sequential ensemble methodology for epidemic modeling that integrates discrete-time Hawkes processes (DTHP) and Susceptible-Exposed-Infectious-Removed (SEIR) models. Motivated by the need for accurate and reliable epidemic forecasts to inform timely public health interventions, we develop a flexible model averaging (MA) framework using Sequential Monte Carlo Squared. While generating estimates from each model individually, our approach dynamically assigns them weights based on their incrementally estimated marginal likelihoods, accounting for both model and parameter uncertainty, to produce a single ensemble estimate. We assess the methodology through simulation studies mimicking abrupt changes in epidemic dynamics, followed by an application to the Irish influenza and COVID-19 epidemics. Our results show that combining the two models can improve both estimates of the infection trajectory and reproduction number compared to using either model alone. Moreover, the MA consistently produces more stable and informative estimates of the time-varying reproduction number, with credible intervals that provide a realistic assessment of uncertainty. These features are particularly useful when epidemic dynamics change rapidly, enabling more reliable short-term forecasts and timely public health decisions. This research contributes to pandemic preparedness by enhancing forecast reliability and supporting more informed public health responses.

2505.18918 2026-01-21 stat.ML cs.LG eess.SP

ALPCAHUS: Subspace Clustering for Heteroscedastic Data

Javier Salazar Cavazos, Jeffrey A Fessler, Laura Balzano

Comments Manuscript submitted to IEEE Transactions on Signal Processing (TSP), revised, and pending acceptance

详情
英文摘要

Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. Various methods have been proposed to extend PCA to the union of subspace (UoS) setting for clustering data that comes from multiple subspaces like K-Subspaces (KSS). However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a heteroscedastic-based subspace clustering method, named ALPCAHUS, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace bases associated with the low-rank structure of the data. This clustering algorithm builds on K-Subspaces (KSS) principles by extending the recently proposed heteroscedastic PCA method, named LR-ALPCAH, for clusters with heteroscedastic noise in the UoS setting. Simulations and real-data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing clustering algorithms. Code available at https://github.com/javiersc1/ALPCAHUS.

2502.04162 2026-01-21 stat.AP cs.LG cs.SI stat.ML

Network-Level Measures of Mobility from Aggregated Origin-Destination Data

Alisha Foster, David A. Meyer, Asif Shakeel

Comments 34 pages, 20 figures

详情
英文摘要

We introduce a framework for defining and interpreting collective mobility measures from spatially and temporally aggregated origin--destination (OD) data. Rather than characterizing individual behavior, these measures describe properties of the mobility system itself: how network organization, spatial structure, and routing constraints shape and channel population movement. In this view, aggregate mobility flows reveal aspects of connectivity, functional organization, and large-scale daily activity patterns encoded in the underlying transport and spatial network. To support interpretation and provide a controlled reference for the proposed time-elapsed calculations, we first employ an independent, network-driven synthetic data generator in which trajectories arise from prescribed system structure rather than observed data. This controlled setting provides a concrete reference for understanding how the proposed measures reflect network organization and flow constraints. We then apply the measures to fully anonymized data from the NetMob 2024 Data Challenge, examining their behavior under realistic limitations of spatial and temporal aggregation. While such data constraints restrict dynamical resolution, the resulting metrics still exhibit interpretable large-scale structure and temporal variation at the city scale.

2502.02986 2026-01-21 math.ST stat.TH

Matching Criterion for Identifiability in Sparse Factor Analysis

Nils Sturma, Miriam Kranzlmueller, Irem Portakal, Mathias Drton

详情
英文摘要

Factor analysis models explain dependence among observed variables by a smaller number of unobserved factors. A main challenge in confirmatory factor analysis is determining whether the factor loading matrix is identifiable from the observed covariance matrix. The factor loading matrix captures the linear effects of the factors and, if unrestricted, can only be identified up to an orthogonal transformation of the factors. However, in many applications the factor loadings exhibit an interesting sparsity pattern that may lead to identifiability up to column signs. We study this phenomenon by connecting sparse confirmatory factor analysis models to bipartite graphs and providing sufficient graphical conditions for identifiability of the factor loading matrix up to column signs. In contrast to previous work, our main contribution, the matching criterion, exploits sparsity by operating locally on the graph structure, thereby improving existing conditions. Our criterion is efficiently decidable in time that is polynomial in the size of the graph, when restricting the search steps to sets of bounded size.

2501.17512 2026-01-21 stat.ML cs.LG

A survey on Clustered Federated Learning: Taxonomy, Analysis and Applications

Michael Ben Ali, Omar El-Rifai, Imen Megdiche, André Peninou, Olivier Teste

详情
英文摘要

As Federated Learning (FL) expands, the challenge of non-independent and identically distributed (non-IID) data becomes critical. Clustered Federated Learning (CFL) addresses this by training multiple specialized models, each representing a group of clients with similar data distributions. However, the term ''CFL'' has increasingly been applied to operational strategies unrelated to data heterogeneity, creating significant ambiguity. This survey provides a systematic review of the CFL literature and introduces a principled taxonomy that classifies algorithms into Server-side, Client-side, and Metadata-based approaches. Our analysis reveals a distinct dichotomy: while theoretical research prioritizes privacy-preserving Server/Client-side methods, real-world applications in IoT, Mobility, and Energy overwhelmingly favor Metadata-based efficiency. Furthermore, we explicitly distinguish ''Core CFL'' (grouping clients for non-IID data) from ''Clustered X FL'' (operational variants for system heterogeneity). Finally, we outline lessons learned and future directions to bridge the gap between theoretical privacy and practical efficiency.

2501.10263 2026-01-21 stat.ME math.ST stat.TH

Prior distributions for structured semi-orthogonal matrices

Michael Jauch, Marie-Christine Düker, Peter Hoff

Comments 23 pages, 5 figures

详情
英文摘要

Statistical models for multivariate data often include a semi-orthogonal matrix parameter. In many applications, there is reason to expect that the semi-orthogonal matrix parameter satisfies a structural assumption such as sparsity or smoothness. From a Bayesian perspective, these structural assumptions should be incorporated into an analysis through the prior distribution. In this work, we introduce a general approach to constructing prior distributions for structured semi-orthogonal matrices that leads to tractable posterior inference via parameter-expanded Markov chain Monte Carlo. We draw on recent results from random matrix theory to establish a theoretical basis for the proposed approach. We then introduce specific prior distributions for incorporating sparsity or smoothness and illustrate their use through applications to biological and oceanographic data.

2412.19012 2026-01-21 stat.ME

Dynamic networks clustering via mirror distance

Runbing Zheng, Avanti Athreya, Marta Zlatic, Michael Clayton, Carey E. Priebe

详情
英文摘要

The classification of different patterns of network evolution, for example in brain connectomes or social networks, is a key problem in network inference and modern data science. Building on the notion of a network's Euclidean mirror, which captures its evolution as a curve in Euclidean space, we develop the Dynamic Network Clustering through Mirror Distance (DNCMD), an algorithm for clustering dynamic networks based on a distance measure between their associated mirrors. We provide theoretical guarantees for DNCMD to achieve exact recovery of distinct evolutionary patterns for latent position random networks both when underlying vertex features change deterministically and when they follow a stochastic process. We validate our theoretical results through numerical simulations and demonstrate the application of DNCMD to understand edge functions in Drosophila larval connectome data, as well as to analyze temporal patterns in dynamic trade networks.

2411.19908 2026-01-21 stat.ML cs.LG

Another look at statistical inference with machine learning-imputed data

Jessica Gronsbell, Jianhui Gao, Zachary R. McCaw, Yaqi Shi, David Cheng

详情
英文摘要

From structural biology to epidemiology, predictions from machine learning (ML) models increasingly complement costly gold-standard data, enabling faster, more affordable, and scalable scientific inquiry. In response, prediction-based (PB) inference has emerged to support statistical analysis that combines a large volume of predicted data with a small amount of gold-standard data. The goals of PB inference are twofold: (i) to mitigate bias arising from prediction error and (ii) to improve efficiency relative to classical inference based solely on gold-standard data. While early PB inference methods primarily focused on bias mitigation, improving efficiency remains an active area of research. Motivated by connections between PB inference and longstanding problems in statistics and related fields, we draw on the two-phase sampling literature to introduce an approach for Z-estimation with ML-imputed outcomes that is guaranteed to match or exceed the efficiency of classical inference, regardless of prediction quality. We demonstrate the utility of our approach through theoretical and numerical analyses as well as an application to UK Biobank data. We further establish new connections between existing PB inference approaches and foundational and contemporary statistical methods.

2410.00858 2026-01-21 math.PR math.ST stat.CO stat.ML stat.TH

Entropy contraction of the Gibbs sampler under log-concavity

Filippo Ascolani, Hugo Lavenant, Giacomo Zanella

详情
英文摘要

The Gibbs sampler (a.k.a. Glauber dynamics and heat-bath algorithm) is a popular Markov Chain Monte Carlo algorithm which iteratively samples from the conditional distributions of a probability measure $π$ of interest. Under the assumption that $π$ is strongly log-concave, we show that the random scan Gibbs sampler contracts in relative entropy and provide a sharp characterization of the associated contraction rate. Assuming that evaluating conditionals is cheap compared to evaluating the joint density, our results imply that the number of full evaluations of $π$ needed for the Gibbs sampler to mix grows linearly with the condition number and is independent of the dimension. If $π$ is non-strongly log-concave, the convergence rate in entropy degrades from exponential to polynomial. Our techniques are versatile and extend to Metropolis-within-Gibbs schemes and the Hit-and-Run algorithm. A comparison with gradient-based schemes and the connection with the optimization literature are also discussed.

2409.02684 2026-01-21 q-bio.NC cs.LG stat.ML

Neural timescales from a computational perspective

Roxana Zeraati, Anna Levina, Jakob H. Macke, Richard Gao

Comments 21 pages, 5 figures, 3 boxes, 1 table

详情
英文摘要

Neural activity fluctuates over a wide range of timescales within and across brain areas. Experimental observations suggest that diverse neural timescales reflect information in dynamic environments. However, how timescales are defined and measured from brain recordings vary across the literature. Moreover, these observations do not specify the mechanisms underlying timescale variations, nor whether specific timescales are necessary for neural computation and brain function. Here, we synthesize three directions where computational approaches can distill the broad set of empirical observations into quantitative and testable theories: We review (i) how different data analysis methods quantify timescales across distinct behavioral states and recording modalities, (ii) how biophysical models provide mechanistic explanations for the emergence of diverse timescales, and (iii) how task-performing networks and machine learning models uncover the functional relevance of neural timescales. This integrative computational perspective thus complements experimental investigations, providing a holistic view on how neural timescales reflect the relationship between brain structure, dynamics, and behavior.

2409.00297 2026-01-21 cs.LG stat.ML

On Expressive Power of Quantized Neural Networks under Fixed-Point Arithmetic

Yeachan Park, Sejun Park, Geonho Hwang

详情
英文摘要

Existing works on the expressive power of neural networks typically assume real parameters and exact operations. In this work, we study the expressive power of quantized networks under discrete fixed-point parameters and inexact fixed-point operations with round-off errors. We first provide a necessary condition and a sufficient condition on fixed-point arithmetic and activation functions for quantized networks to represent all fixed-point functions from fixed-point vectors to fixed-point numbers. Then, we show that various popular activation functions satisfy our sufficient condition, e.g., Sigmoid, ReLU, ELU, SoftPlus, SiLU, Mish, and GELU. In other words, networks using those activation functions are capable of representing all fixed-point functions. We further show that our necessary condition and sufficient condition coincide under a mild condition on activation functions: e.g., for an activation function $σ$, there exists a fixed-point number $x$ such that $σ(x)=0$. Namely, we find a necessary and sufficient condition for a large class of activation functions. We lastly show that even quantized networks using binary weights in $\{-1,1\}$ can also represent all fixed-point functions for practical activation functions.

2404.07923 2026-01-21 stat.ME

BESS: A Bayesian Estimator of Sample Size

Dehua Bi, Yuan Ji

详情
英文摘要

We consider a Bayesian framework for estimating the sample size of a clinical trial. The new approach, called BESS, is built upon three pillars: Sample size of the trial, Evidence from the observed data, and Confidence of the final decision in the posterior inference. It uses a simple logic of "given the evidence from data, a specific sample size can achieve a degree of confidence in trial success." The key distinction between BESS and standard sample size estimation (SSE) is that SSE, typically based on Frequentist inference, specifies the true parameters values in its calculation to achieve properties under repeated sampling while BESS assumes possible outcome from the observed data to achieve high posterior probabilities of trial success. As a result, the calibration of the sample size is directly based on the probability of making a correct decision rather than type I or type II error rates. We demonstrate that BESS leads to a more interpretable statement for investigators, and can easily accommodates prior information as well as sample size re-estimation. We explore its performance in comparison to the standard SSE and demonstrate its usage through a case study of oncology optimization trial. An R tool is available at https://ccte.uchicago.edu/BESS.

2403.19196 2026-01-21 math.ST stat.TH

What Is a Good Imputation Under MAR Missingness?

Jeffrey Näf, Erwan Scornet, Julie Josse

详情
英文摘要

Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first investigate whether the widely used fully conditional specification (FCS) approach indeed identifies the correct conditional distributions. Based on this analysis, we propose three essential properties an ideal imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new imputation method, denoted mice-DRF, that meets two out of the three criteria. We also discuss ways to compare imputation methods, based on distributional distances. Finally, numerical experiments illustrate the points made in this discussion.

2311.17797 2026-01-21 cs.LG stat.ME

Learning to Simulate: Generative Metamodeling via Quantile Regression

L. Jeff Hong, Yanxi Hou, Qingkai Zhang, Xiaowei Zhang

详情
英文摘要

Stochastic simulation models effectively capture complex system dynamics but are often too slow for real-time decision-making. Traditional metamodeling techniques learn relationships between simulator inputs and a single output summary statistic, such as the mean or median. These techniques enable real-time predictions without additional simulations. However, they require prior selection of one appropriate output summary statistic, limiting their flexibility in practical applications. We propose a new concept: generative metamodeling. It aims to construct a "fast simulator of the simulator," generating random outputs significantly faster than the original simulator while preserving approximately equal conditional distributions. Generative metamodels enable rapid generation of numerous random outputs upon input specification, facilitating immediate computation of any summary statistic for real-time decision-making. We introduce a new algorithm, quantile-regression-based generative metamodeling (QRGMM), and establish its distributional convergence and convergence rate. Extensive numerical experiments demonstrate QRGMM's efficacy compared to other state-of-the-art generative algorithms in practical real-time decision-making scenarios.

2309.14581 2026-01-21 stat.AP cs.CR econ.EM

Assessing Utility of Differential Privacy for RCTs

Kaitlyn R. Webb, Soumya Mukherjee, Aratrika Mustafi, Aleksandra Slavković, Lars Vilhuber

Comments Submitted

详情
英文摘要

Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.

2307.02044 2026-01-21 math.ST cs.IT math.IT stat.TH

The distribution of Ridgeless least squares interpolators

Qiyang Han, Xiaocong Xu

详情
英文摘要

The Ridgeless minimum $\ell_2$-norm interpolator in overparametrized linear regression has attracted considerable attention in recent years in both machine learning and statistics communities. While it seems to defy conventional wisdom that overfitting leads to poor prediction, recent theoretical research on its $\ell_2$-type risks reveals that its norm minimizing property induces an `implicit regularization' that helps prediction in spite of interpolation. This paper takes a further step that aims at understanding its precise stochastic behavior as a statistical estimator. Specifically, we characterize the distribution of the Ridgeless interpolator in high dimensions, in terms of a Ridge estimator in an associated Gaussian sequence model with positive regularization, which provides a precise quantification of the prescribed implicit regularization in the most general distributional sense. Our distributional characterizations hold for general non-Gaussian random designs and extend uniformly to positively regularized Ridge estimators. As a direct application, we obtain a complete characterization for a general class of weighted $\ell_q$ risks of the Ridge(less) estimators that are previously only known for $q=2$ by random matrix methods. These weighted $\ell_q$ risks not only include the standard prediction and estimation errors, but also include the non-standard covariate shift settings. Our uniform characterizations further reveal a surprising feature of the commonly used generalized and $k$-fold cross-validation schemes: tuning the estimated $\ell_2$ prediction risk by these methods alone lead to simultaneous optimal $\ell_2$ in-sample, prediction and estimation risks, as well as the optimal length of debiased confidence intervals.

2305.14543 2026-01-21 stat.ML cs.LG

Deep Functional Factor Models: Forecasting High-Dimensional Functional Time Series via Bayesian Nonparametric Factorization

Yirui Liu, Xinghao Qiao, Yulong Pei, Liying Wang

Journal ref Proceedings of the 41st International Conference on Machine Learning 2024

详情
英文摘要

This paper introduces the Deep Functional Factor Model (DF2M), a Bayesian nonparametric model designed for analysis of high-dimensional functional time series. DF2M is built upon the Indian Buffet Process and the multi-task Gaussian Process, incorporating a deep kernel function that captures non-Markovian and nonlinear temporal dynamics. Unlike many black-box deep learning models, DF2M offers an explainable approach to utilizing neural networks by constructing a factor model and integrating deep neural networks within the kernel function. Additionally, we develop a computationally efficient variational inference algorithm to infer DF2M. Empirical results from four real-world datasets demonstrate that DF2M provides better explainability and superior predictive accuracy compared to conventional deep learning models for high-dimensional functional time series.

2302.09049 2026-01-21 cs.IT cs.LG math.IT math.ST stat.TH

Multiperiodic Processes: Ergodic Sources with a Sublinear Entropy

Łukasz Dębowski

Comments 30 pages; 1 figure

详情
英文摘要

We construct multiperiodic processes -- a simple example of stationary ergodic (but not mixing) processes over natural numbers that enjoy the vanishing entropy rate under a mild condition. Multiperiodic processes are supported on randomly shifted deterministic sequences called multiperiodic sequences, which can be efficiently generated using an algorithm called the Infinite Clock. Under a suitable parameterization, multiperiodic sequences exhibit relative frequencies of particular numbers given by Zipf's law. Exactly in the same setting, the respective multiperiodic processes satisfy an asymptotic power-law growth of block entropy, called Hilberg's law. Hilberg's law is deemed to hold for statistical language models, in particular.

2302.01607 2026-01-21 stat.ME

dynamite: An R Package for Dynamic Multivariate Panel Models

Santtu Tikka, Jouni Helske

Comments This is the version published in the Journal of Statistical Software

Journal ref Journal of Statistical Software, 115(5):1-42, 2025

详情
英文摘要

dynamite is an R package for Bayesian inference of intensive panel (time series) data comprising multiple measurements per multiple individuals measured in time. The package supports joint modeling of multiple response variables, time-varying and time-invariant effects, a wide range of discrete and continuous distributions, group-specific random effects, latent factors, and customization of prior distributions of the model parameters. Models in the package are defined via a user-friendly formula interface, and estimation of the posterior distribution of the model parameters takes advantage of state-of-the-art Markov chain Monte Carlo methods. The package enables efficient computation of both individual-level and aggregated predictions and offers a comprehensive suite of tools for visualization and model diagnostics.

2204.01540 2026-01-21 cs.CY stat.OT

Teaching for large-scale Reproducibility Verification

Lars Vilhuber, Hyuk Harry Son, Meredith Welch, David N. Wasser, Michael Darisse

详情
英文摘要

We describe a unique environment in which undergraduate students from various STEM and social science disciplines are trained in data provenance and reproducible methods, and then apply that knowledge to real, conditionally accepted manuscripts and associated replication packages. We describe in detail the recruitment, training, and regular activities. While the activity is not part of a regular curriculum, the skills and knowledge taught through explicit training of reproducible methods and principles, and reinforced through repeated application in a real-life workflow, contribute to the education of these undergraduate students, and prepare them for post-graduation jobs and further studies.

2110.12722 2026-01-21 econ.EM stat.ME

Functional instrumental variable regression with an application to estimating the impact of immigration on native wages

Dakyung Seong, Won-Ki Seo

Journal ref Econom. Theory 41 (2025) 1248-1283

详情
英文摘要

Functional linear regression gets its popularity as a statistical tool to study the relationship between function-valued response and exogenous explanatory variables. However, in practice, it is hard to expect that the explanatory variables of interest are perfectly exogenous, due to, for example, the presence of omitted variables and measurement error. Despite its empirical relevance, it was not until recently that this issue of endogeneity was studied in the literature on functional regression, and the development in this direction does not seem to sufficiently meet practitioners' needs; for example, this issue has been discussed with paying particular attention on consistent estimation and thus distributional properties of the proposed estimators still remain to be further explored. To fill this gap, this paper proposes new consistent FPCA-based instrumental variable estimators and develops their asymptotic properties in detail. Simulation experiments under a wide range of settings show that the proposed estimators perform considerably well. We apply our methodology to estimate the impact of immigration on native wages.

2109.08351 2026-01-21 econ.EM stat.ME

Regression Discontinuity Design with Potentially Many Covariates

Yoichi Arai, Taisuke Otsu, Myung Hwan Seo

Journal ref Econom. Theory 41 (2025) 1416-1451

详情
英文摘要

This paper studies the case of possibly high-dimensional covariates in the regression discontinuity design (RDD) analysis. In particular, we propose estimation and inference methods for the RDD models with covariate selection which perform stably regardless of the number of covariates. The proposed methods combine the local approach using kernel weights with $\ell_{1}$-penalization to handle high-dimensional covariates. We provide theoretical and numerical results which illustrate the usefulness of the proposed methods. Theoretically, we present risk and coverage properties for our point estimation and inference methods, respectively. Under certain special case, the proposed estimator becomes more efficient than the conventional covariate adjusted estimator at the cost of an additional sparsity condition. Numerically, our simulation experiments and empirical example show the robust behaviors of the proposed methods to the number of covariates in terms of bias and variance for point estimation and coverage probability and interval length for inference.

1912.07075 2026-01-21 math.NA cs.NA math.ST stat.TH

Boosted optimal weighted least-squares

Cécile Haberstich, Anthony Nouy, Guillaume Perrin

Comments This version contains a corrected version of appendix section B.1

Journal ref Math. Comp. (2022)

详情
英文摘要

This paper is concerned with the approximation of a function $u$ in a given approximation space $V_m$ of dimension $m$ from evaluations of the function at $n$ suitably chosen points. The aim is to construct an approximation of $u$ in $V_m$ which yields an error close to the best approximation error in $V_m$ and using as few evaluations as possible. Classical least-squares regression, which defines a projection in $V_m$ from $n$ random points, usually requires a large $n$ to guarantee a stable approximation and an error close to the best approximation error. This is a major drawback for applications where $u$ is expensive to evaluate. One remedy is to use a weighted least squares projection using $n$ samples drawn from a properly selected distribution. In this paper, we introduce a boosted weighted least-squares method which allows to ensure almost surely the stability of the weighted least squares projection with a sample size close to the interpolation regime $n=m$. It consists in sampling according to a measure associated with the optimization of a stability criterion over a collection of independent $n$-samples, and resampling according to this measure until a stability condition is satisfied. A greedy method is then proposed to remove points from the obtained sample. Quasi-optimality properties are obtained for the weighted least-squares projection, with or without the greedy procedure. The proposed method is validated on numerical examples and compared to state-of-the-art interpolation and weighted least squares methods.

1709.03473 2026-01-21 math.ST econ.EM stat.ML stat.TH

Is completeness necessary? Estimation in nonidentified linear models

Andrii Babii, Jean-Pierre Florens

Journal ref Econom. Theory 41 (2025) 1284-1321

详情
英文摘要

Modern data analysis depends increasingly on estimating models via flexible high-dimensional or nonparametric machine learning methods, where the identification of structural parameters is often challenging and untestable. In linear settings, this identification hinges on the completeness condition, which requires the nonsingularity of a high-dimensional matrix or operator and may fail for finite samples or even at the population level. Regularized estimators provide a solution by enabling consistent estimation of structural or average structural functions, sometimes even under identification failure. We show that the asymptotic distribution in these cases can be nonstandard. We develop a comprehensive theory of regularized estimators, which include methods such as high-dimensional ridge regularization, gradient descent, and principal component analysis (PCA). The results are illustrated for high-dimensional and nonparametric instrumental variable regressions and are supported through simulation experiments.

2601.13191 2026-01-21 stat.ML cs.LG

Empirical Risk Minimization with $f$-Divergence Regularization

Francisco Daunas, Iñaki Esnaola, Samir M. Perlaza, H. Vincent Poor

Comments Submitted to IEEE Transactions on Information Theory. arXiv admin note: substantial text overlap with arXiv:2502.14544, arXiv:2508.03314

详情
英文摘要

In this paper, the solution to the empirical risk minimization problem with $f$-divergence regularization (ERM-$f$DR) is presented and conditions under which the solution also serves as the solution to the minimization of the expected empirical risk subject to an $f$-divergence constraint are established. The proposed approach extends applicability to a broader class of $f$-divergences than previously reported and yields theoretical results that recover previously known results. Additionally, the difference between the expected empirical risk of the ERM-$f$DR solution and that of its reference measure is characterized, providing insights into previously studied cases of $f$-divergences. A central contribution is the introduction of the normalization function, a mathematical object that is critical in both the dual formulation and practical computation of the ERM-$f$DR solution. This work presents an implicit characterization of the normalization function as a nonlinear ordinary differential equation (ODE), establishes its key properties, and subsequently leverages them to construct a numerical algorithm for approximating the normalization factor under mild assumptions. Further analysis demonstrates structural equivalences between ERM-$f$DR problems with different $f$-divergences via transformations of the empirical risk. Finally, the proposed algorithm is used to compute the training and test risks of ERM-$f$DR solutions under different $f$-divergence regularizers. This numerical example highlights the practical implications of choosing different functions $f$ in ERM-$f$DR problems.

2601.12957 2026-01-21 math.ST math.PR stat.TH

Random tree Besov priors: Data-driven regularisation parameter selection

Hanne Kekkonen, Andreas Tataris

详情
英文摘要

We develop a data-driven algorithm for automatically selecting the regularisation parameter in Bayesian inversion under random tree Besov priors. One of the key challenges in Bayesian inversion is the construction of priors that are both expressive and computationally feasible. Random tree Besov priors, introduced in Kekkonen et al. (2023), provide a flexible framework for capturing local regularity properties and sparsity patterns in a wavelet basis. In this paper, we extend this approach by introducing a hierarchical model that enables data-driven selection of the wavelet density parameter, allowing the regularisation strength to adapt across scales while retaining computational efficiency. We focus on nonparametric regression and also present preliminary plug-and-play results for a deconvolution problem.

2601.12931 2026-01-21 cs.LG cs.AI stat.ML

Online Continual Learning for Time Series: a Natural Score-driven Approach

Edoardo Urettini, Daniele Atzeni, Ioanna-Yvonni Tsaknaki, Antonio Carta

详情
英文摘要

Online continual learning (OCL) methods adapt to changing environments without forgetting past knowledge. Similarly, online time series forecasting (OTSF) is a real-world problem where data evolve in time and success depends on both rapid adaptation and long-term memory. Indeed, time-varying and regime-switching forecasting models have been extensively studied, offering a strong justification for the use of OCL in these settings. Building on recent work that applies OCL to OTSF, this paper aims to strengthen the theoretical and practical connections between time series methods and OCL. First, we reframe neural network optimization as a parameter filtering problem, showing that natural gradient descent is a score-driven method and proving its information-theoretic optimality. Then, we show that using a Student's t likelihood in addition to natural gradient induces a bounded update, which improves robustness to outliers. Finally, we introduce Natural Score-driven Replay (NatSR), which combines our robust optimizer with a replay buffer and a dynamic scale heuristic that improves fast adaptation at regime drifts. Empirical results demonstrate that NatSR achieves stronger forecasting performance than more complex state-of-the-art methods.

2601.12930 2026-01-21 stat.ME

Guidance for Addressing Individual Time Effects in Cohort Stepped Wedge Cluster Randomized Trials: A Simulation Study

Jale Basten, Katja Ickstadt, Nina Timmesfeld

详情
英文摘要

Background: Stepped wedge cluster randomized trials (SW-CRTs) involve sequential measurements within clusters over time. Initially, all clusters start in the control condition before crossing over to the intervention on a staggered schedule. In cohort designs, secular trends, cluster-level changes, and individual-level changes (e.g., aging) must be considered. Methods: We performed a Monte Carlo simulation to analyze the influence of different time effects on the estimation of the intervention effect in cohort SW-CRTs. We compared four linear mixed models with different adjustment strategies, all including random intercepts for clustering and repeated measurements. We recorded the estimated fixed intervention effects and their corresponding model-based standard errors, derived from models both without and with cluster-robust variance estimators (CRVEs). Results: Models incorporating fixed categorical time effects, a fixed intervention effect, and two random intercepts provided unbiased estimates of the intervention effect in both closed and open cohort SW-CRTs. Fixed categorical time effects captured temporal cohort changes, while random individual effects accounted for baseline differences. However, these differences can cause large, non-normally distributed random individual effects. CRVEs provide reliable standard errors for the intervention effect, controlling the Type I error rate. Conclusions: Our simulation study is the first to assess individual-level changes over time in cohort SW-CRTs. Linear mixed models incorporating fixed categorical time effects and random cluster and individual effects yield unbiased intervention effect estimates. However, cluster-robust variance estimation is necessary when time-varying independent variables exhibit nonlinear effects. We recommend always using CRVEs.

2601.12864 2026-01-21 stat.AP

The impact of abnormal temperatures on crop yields in Italy: a functional quantile regression approach

Giovanni Bocchi, Alessandra Micheletti, Paolo Nota, Alessandro Olper

Comments 14 pages, 5 figures, 3 tables

详情
英文摘要

In this study, we apply functional regression analysis to identify the specific within-season periods during which temperature and precipitation anomalies most affect crop yields. Using provincial data for Italy from 1952 to 2023, we analyze two major cereals, maize and soft wheat, and quantify how abnormal weather conditions influence yields across the growing cycle. Unlike traditional statistical yield models, which assume additive temperature effects over the season, our approach is capable of capturing the timing and functional shape of weather impacts. In particular, the results show that above-average temperatures reduce maize yields primarily between June and August, while exerting a mild positive effect in April and October. For soft wheat, unusually high temperatures negatively affect yields from late March to early April. Precipitation also exerts season-dependent effects, improving wheat yields early in the season but reducing them later on. These findings highlight the importance of accounting for intra-seasonal weather patterns to provide insights for climate change adaptation strategies, including the timely adjustment of key crop management inputs.

2601.12633 2026-01-21 math.PR cs.NA math.NA stat.ML

New Trends in the Stability of Sinkhorn Semigroups

Pierre Del Moral, Ajay Jasra

详情
英文摘要

Entropic optimal transport problems play an increasingly important role in machine learning and generative modelling. In contrast with optimal transport maps which often have limited applicability in high dimensions, Schrodinger bridges can be solved using the celebrated Sinkhorn's algorithm, a.k.a. the iterative proportional fitting procedure. The stability properties of Sinkhorn bridges when the number of iterations tends to infinity is a very active research area in applied probability and machine learning. Traditional proofs of convergence are mainly based on nonlinear versions of Perron-Frobenius theory and related Hilbert projective metric techniques, gradient descent, Bregman divergence techniques and Hamilton-Jacobi-Bellman equations, including propagation of convexity profiles based on coupling diffusions by reflection methods. The objective of this review article is to present, in a self-contained manner, recently developed Sinkhorn/Gibbs-type semigroup analysis based upon contraction coefficients and Lyapunov-type operator-theoretic techniques. These powerful, off-the-shelf semigroup methods are based upon transportation cost inequalities (e.g. log-Sobolev, Talagrand quadratic inequality, curvature estimates), $ϕ$-divergences, Kantorovich-type criteria and Dobrushin contraction-type coefficients on weighted Banach spaces as well as Wasserstein distances. This novel semigroup analysis allows one to unify and simplify many arguments in the stability of Sinkhorn algorithm. It also yields new contraction estimates w.r.t. generalized $ϕ$-entropies, as well as weighted total variation norms, Kantorovich criteria and Wasserstein distances.

2601.12612 2026-01-21 cs.LG stat.ML

What Trace Powers Reveal About Log-Determinants: Closed-Form Estimators, Certificates, and Failure Modes

Piyush Sao

详情
英文摘要

Computing $\log\det(A)$ for large symmetric positive definite matrices arises in Gaussian process inference and Bayesian model comparison. Standard methods combine matrix-vector products with polynomial approximations. We study a different model: access to trace powers $p_k = \tr(A^k)$, natural when matrix powers are available. Classical moment-based approximations Taylor-expand $\log(λ)$ around the arithmetic mean. This requires $|λ- \AM| < \AM$ and diverges when $κ> 4$. We work instead with the moment-generating function $M(t) = \E[X^t]$ for normalized eigenvalues $X = λ/\AM$. Since $M'(0) = \E[\log X]$, the log-determinant becomes $\log\det(A) = n(\log \AM + M'(0))$ -- the problem reduces to estimating a derivative at $t = 0$. Trace powers give $M(k)$ at positive integers, but interpolating $M(t)$ directly is ill-conditioned due to exponential growth. The transform $K(t) = \log M(t)$ compresses this range. Normalization by $\AM$ ensures $K(0) = K(1) = 0$. With these anchors fixed, we interpolate $K$ through $m+1$ consecutive integers and differentiate to estimate $K'(0)$. However, this local interpolation cannot capture arbitrary spectral features. We prove a fundamental limit: no continuous estimator using finitely many positive moments can be uniformly accurate over unbounded conditioning. Positive moments downweight the spectral tail; $K'(0) = \E[\log X]$ is tail-sensitive. This motivates guaranteed bounds. From the same traces we derive upper bounds on $(\det A)^{1/n}$. Given a spectral floor $r \leq λ_{\min}$, we obtain moment-constrained lower bounds, yielding a provable interval for $\log\det(A)$. A gap diagnostic indicates when to trust the point estimate and when to report bounds. All estimators and bounds cost $O(m)$, independent of $n$. For $m \in \{4, \ldots, 8\}$, this is effectively constant time.

2601.12587 2026-01-21 stat.ML cs.LG

A Theory of Diversity for Random Matrices with Applications to In-Context Learning of Schrödinger Equations

Frank Cole, Yulong Lu, Shaurya Sehgal

详情
英文摘要

We address the following question: given a collection $\{\mathbf{A}^{(1)}, \dots, \mathbf{A}^{(N)}\}$ of independent $d \times d$ random matrices drawn from a common distribution $\mathbb{P}$, what is the probability that the centralizer of $\{\mathbf{A}^{(1)}, \dots, \mathbf{A}^{(N)}\}$ is trivial? We provide lower bounds on this probability in terms of the sample size $N$ and the dimension $d$ for several families of random matrices which arise from the discretization of linear Schrödinger operators with random potentials. When combined with recent work on machine learning theory, our results provide guarantees on the generalization ability of transformer-based neural networks for in-context learning of Schrödinger equations.

2601.12566 2026-01-21 econ.EM math.ST stat.ME stat.TH

Partial Identification under Stratified Randomization

Bruno Ferman, Davi Siqueira, Vitor Possebom

详情
英文摘要

This paper develops a unified framework for partial identification and inference in stratified experiments with attrition, accommodating both equal and heterogeneous treatment shares across strata. For equal-share designs, we apply recent theory for finely stratified experiments to Lee bounds, yielding closed-form, design-consistent variance estimators and properly sized confidence intervals. Simulations show that the conventional formula can overstate uncertainty, while our approach delivers tighter intervals. When treatment shares differ across strata, we propose a new strategy, which combines inverse probability weighting and global trimming to construct valid bounds even when strata are small or unbalanced. We establish identification, introduce a moment estimator, and extend existing inference results to stratified designs with heterogeneous shares, covering a broad class of moment-based estimators which includes the one we formulate. We also generalize our results to designs in which strata are defined solely by observed labels.

2601.12552 2026-01-21 stat.AP

Stop using limiting stimuli as a measure of sensitivities of energetic materials

Dennis Christensen, Geir Petter Novik

详情
英文摘要

Accurately estimating the sensitivity of explosive materials is a potentially life-saving task which requires standardised protocols across nations. One of the most widely applied procedures worldwide is the so-called '1-In-6' test from the United Nations (UN) Manual of Tests in Criteria, which estimates a 'limiting stimulus' for a material. In this paper we demonstrate that, despite their popularity, limiting stimuli are not a well-defined notion of sensitivity and do not provide reliable information about a material's susceptibility to ignition. In particular, they do not permit construction of confidence intervals to quantify estimation uncertainty. We show that continued reliance on limiting stimuli through the 1-In-6 test has caused needless confusion in energetic materials research, both in theoretical studies and practical safety applications. To remedy this problem, we consider three well-founded alternative approaches to sensitivity testing to replace limiting stimulus estimation. We compare their performance in an extensive simulation study and apply the best-performing approach to real data, estimating the friction sensitivity of pentaerythritol tetranitrate (PETN).

2601.12540 2026-01-21 stat.ME math.ST stat.TH

Rerandomization for quantile treatment effects

Tingxuan Han, Yuhao Wang

Comments 67 pages, 0 figure

详情
英文摘要

Although complete randomization is widely regarded as the gold standard for causal inference, covariate imbalance can still arise by chance in finite samples. Rerandomization has emerged as an effective tool to improve covariate balance across treatment groups and enhance the precision of causal effect estimation. While existing work focuses on average treatment effects, quantile treatment effects (QTEs) provide a richer characterization of treatment heterogeneity by capturing distributional shifts in outcomes, which is crucial for policy evaluation and equity-oriented research. In this article, we establish the asymptotic properties of the QTE estimator under rerandomization within a finite-population framework, without imposing any distributional or modeling assumptions on the covariates or outcomes.The estimator exhibits a non-Gaussian asymptotic distribution, represented as a linear combination of Gaussian and truncated Gaussian random variables. To facilitate inference, we propose a conservative variance estimator and construct corresponding confidence interval. Our theoretical analysis demonstrates that rerandomization improves efficiency over complete randomization under mild regularity conditions. Simulation studies further support the theoretical findings and illustrate the practical advantages of rerandomization for QTE estimation.

2601.12518 2026-01-21 cs.LG cs.AI stat.ML

Cooperative Multi-agent RL with Communication Constraints

Nuoya Xiong, Aarti Singh

Comments 33 pages

详情
英文摘要

Cooperative MARL often assumes frequent access to global information in a data buffer, such as team rewards or other agents' actions, which is typically unrealistic in decentralized MARL systems due to high communication costs. When communication is limited, agents must rely on outdated information to estimate gradients and update their policies. A common approach to handle missing data is called importance sampling, in which we reweigh old data from a base policy to estimate gradients for the current policy. However, it quickly becomes unstable when the communication is limited (i.e. missing data probability is high), so that the base policy in importance sampling is outdated. To address this issue, we propose a technique called base policy prediction, which utilizes old gradients to predict the policy update and collect samples for a sequence of base policies, which reduces the gap between the base policy and the current policy. This approach enables effective learning with significantly fewer communication rounds, since the samples of predicted base policies could be collected within one communication round. Theoretically, we show that our algorithm converges to an $\varepsilon$-Nash equilibrium in potential games with only $O(\varepsilon^{-3/4})$ communication rounds and $O(poly(\max_i |A_i|)\varepsilon^{-11/4})$ samples, improving existing state-of-the-art results in communication cost, as well as sample complexity without the exponential dependence on the joint action space size. We also extend these results to general Markov Cooperative Games to find an agent-wise local maximum. Empirically, we test the base policy prediction algorithm in both simulated games and MAPPO for complex environments.

2601.12515 2026-01-21 stat.CO

Bayesian Inference for Partially Observed McKean-Vlasov SDEs with Full Distribution Dependence

Ning Ning, Amin Wu

Comments 23 pages, 30 pages supplementary

详情
英文摘要

McKean-Vlasov stochastic differential equations (MVSDEs) describe systems whose dynamics depend on both individual states and the population distribution, and they arise widely in neuroscience, finance, and epidemiology. In many applications the system is only partially observed, making inference very challenging when both drift and diffusion coefficients depend on the evolving empirical law. This paper develops a Bayesian framework for latent state inference and parameter estimation in such partially observed MVSDEs. We combine time-discretization with particle-based approximations to construct tractable likelihood estimators, and we design two particle Markov chain Monte Carlo (PMCMC) algorithms: a single-level PMCMC method and a multilevel PMCMC (MLPMCMC) method that couples particle systems across discretization levels. The multilevel construction yields correlated likelihood estimates and achieves mean square error $(O(\varepsilon^2))$ at computational cost $(O(\varepsilon^{-6}))$, improving on the $(O(\varepsilon^{-7}))$ complexity of single-level schemes. We address the fully law-dependent diffusion setting which is the most general formulation of MVSDEs, and provide theoretical guarantees under standard regularity assumptions. Numerical experiments confirm the efficiency and accuracy of the proposed methodology.

2601.12478 2026-01-21 stat.AP

Assessing Interactive Causes of an Occurred Outcome Due to Two Binary Exposures

Shanshan Luo, Wei Li, Xueli Wang, Shaojie Wei, Zhi Geng

详情
英文摘要

In contrast to evaluating treatment effects, causal attribution analysis focuses on identifying the key factors responsible for an observed outcome. For two binary exposure variables and a binary outcome variable, researchers need to assess not only the likelihood that an observed outcome was caused by a particular exposure, but also the likelihood that it resulted from the interaction between the two exposures. For example, in the case of a male worker who smoked, was exposed to asbestos, and developed lung cancer, researchers aim to explore whether the cancer resulted from smoking, asbestos exposure, or their interaction. Even in randomized controlled trials, widely regarded as the gold standard for causal inference, identifying and evaluating retrospective causal interactions between two exposures remains challenging. In this paper, we define posterior probabilities to characterize the interactive causes of an observed outcome. We establish the identifiability of posterior probabilities by using a secondary outcome variable that may appear after the primary outcome. We apply the proposed method to the classic case of smoking and asbestos exposure. Our results indicate that for lung cancer patients who smoked and were exposed to asbestos, the disease is primarily attributable to the synergistic effect between smoking and asbestos exposure.

2601.12425 2026-01-21 stat.ME stat.CO

Robust semi-parametric mixtures of linear experts using the contaminated Gaussian distribution

Peterson Mambondimumwe, Sphiwe B. Skhosana, Najmeh Nakhaei Rad

详情
英文摘要

Semi- and non-parametric mixture of regressions are a very useful flexible class of mixture of regressions in which some or all of the parameters are non-parametric functions of the covariates. These models are, however, based on the Gaussian assumption of the component error distributions. Thus, their estimation is sensitive to outliers and heavy-tailed error distributions. In this paper, we propose semi- and non-parametric contaminated Gaussian mixture of regressions to robustly estimate the parametric and/or non-parametric terms of the models in the presence of mild outliers. The virtue of using a contaminated Gaussian error distribution is that we can simultaneously perform model-based clustering of observations and model-based outlier detection. We propose two algorithms, an expectation-maximization (EM)-type algorithm and an expectation-conditional-maximization (ECM)-type algorithm, to perform maximum likelihood and local-likelihood kernel estimation of the parametric and non-parametric of the proposed models, respectively. The robustness of the proposed models is examined using an extensive simulation study. The practical utility of the proposed models is demonstrated using real data.

2601.12380 2026-01-21 cs.LG stat.ML

Statistical-Neural Interaction Networks for Interpretable Mixed-Type Data Imputation

Ou Deng, Shoji Nishimura, Atsushi Ogihara, Qun Jin

详情
英文摘要

Real-world tabular databases routinely combine continuous measurements and categorical records, yet missing entries are pervasive and can distort downstream analysis. We propose Statistical-Neural Interaction (SNI), an interpretable mixed-type imputation framework that couples correlation-derived statistical priors with neural feature attention through a Controllable-Prior Feature Attention (CPFA) module. CPFA learns head-wise prior-strength coefficients $\{λ_h\}$ that softly regularize attention toward the prior while allowing data-driven deviations when nonlinear patterns appear to be present in the data. Beyond imputation, SNI aggregates attention maps into a directed feature-dependency matrix that summarizes which variables the imputer relied on, without requiring post-hoc explainers. We evaluate SNI against six baselines (Mean/Mode, MICE, KNN, MissForest, GAIN, MIWAE) on six datasets spanning ICU monitoring, population surveys, socio-economic statistics, and engineering applications. Under MCAR/strict-MAR at 30\% missingness, SNI is generally competitive on continuous metrics but is often outperformed by accuracy-first baselines (MissForest, MIWAE) on categorical variables; in return, it provides intrinsic dependency diagnostics and explicit statistical-neural trade-off parameters. We additionally report MNAR stress tests (with a mask-aware variant) and discuss computational cost, limitations -- particularly for severely imbalanced categorical targets -- and deployment scenarios where interpretability may justify the trade-off.

2601.12370 2026-01-21 stat.ME

Single-index Semiparametric Transformation Cure Models with Interval-censored Data

Xiaoru Huang, Tonghui Yu, Xiaoyu Liu

详情
英文摘要

Interval censored data commonly arise in medical studies when the event time of interest is only known to lie within an interval. In the presence of a cure subgroup, conventional mixture cure models typically assume a logistic model for the uncure probability and a proportional hazards model for the susceptible subjects. However, in practice, the assumptions of parametric form for the uncure probability and the proportional hazards model for the susceptible may not always be satisfied. In this paper, we propose a class of flexible single-index semiparametric transformation cure models for interval-censored data, where a single-index model and a semiparametric transformation model are utilized for the uncured and conditional survival probability, respectively, encompassing both the proportional hazards cure and proportional odds cure models as specific cases. We approximate the single-index function and cumulative baseline hazard functions via the kernel technique and splines, respectively, and develop a computationally feasible expectation-maximisation (EM) algorithm, facilitated by a four-layer gamma-frailty Poisson data augmentation. Simulation studies demonstrate the satisfactory performance of our proposed method, compared to the spline-based approach and the classical logistic-based mixture cure models. The application of the proposed methodology is illustrated using the Alzheimers dataset.

2601.12343 2026-01-21 econ.EM cs.AI stat.ML

How Well Do LLMs Predict Human Behavior? A Measure of their Pretrained Knowledge

Wayne Gao, Sukjin Han, Annie Liang

详情
英文摘要

Large language models (LLMs) are increasingly used to predict human behavior. We propose a measure for evaluating how much knowledge a pretrained LLM brings to such a prediction: its equivalent sample size, defined as the amount of task-specific data needed to match the predictive accuracy of the LLM. We estimate this measure by comparing the prediction error of a fixed LLM in a given domain to that of flexible machine learning models trained on increasing samples of domain-specific data. We further provide a statistical inference procedure by developing a new asymptotic theory for cross-validated prediction error. Finally, we apply this method to the Panel Study of Income Dynamics. We find that LLMs encode considerable predictive information for some economic variables but much less for others, suggesting that their value as substitutes for domain-specific data differs markedly across settings.

2601.12321 2026-01-21 stat.AP

A Machine Learning--Based Surrogate EKMA Framework for Diagnosing Urban Ozone Formation Regimes: Evidence from Los Angeles

Sijie Zheng

Comments Preprint. Under review

详情
英文摘要

Surface ozone pollution remains a persistent challenge in many metropolitan regions worldwide, as the nonlinear dependence of ozone formation on nitrogen oxides and volatile organic compounds (VOCs) complicates the design of effective emission control strategies. While chemical transport models provide mechanistic insights, they rely on detailed emission inventories and are computationally expensive. This study develops a machine learning--based surrogate framework inspired by the Empirical Kinetic Modeling Approach (EKMA). Using hourly air quality observations from Los Angeles during 2024--2025, a random forest model is trained to predict surface ozone concentrations based on precursor measurements and spatiotemporal features, including site location and cyclic time encodings. The model achieves strong predictive performance, with permutation importance highlighting the dominant roles of diurnal temporal features and nitrogen dioxide, along with additional contributions from carbon monoxide. Building on the trained surrogate, EKMA-style sensitivity experiments are conducted by perturbing precursor concentrations while holding other covariates fixed. The results indicate that ozone formation in Los Angeles during the study period is predominantly VOC-limited. Overall, the proposed framework offers an efficient and interpretable approach for ozone regime diagnosis in data-rich urban environments.

2601.10993 2026-01-21 stat.ML cs.LG

Memorize Early, Then Query: Inlier-Memorization-Guided Active Outlier Detection

Minseo Kang, Seunghwan Park, Dongha Kim

详情
英文摘要

Outlier detection (OD) aims to identify abnormal instances, known as outliers or anomalies, by learning typical patterns of normal data, or inliers. Performing OD under an unsupervised regime-without any information about anomalous instances in the training data-is challenging. A recently observed phenomenon, known as the inlier-memorization (IM) effect, where deep generative models (DGMs) tend to memorize inlier patterns during early training, provides a promising signal for distinguishing outliers. However, existing unsupervised approaches that rely solely on the IM effect still struggle when inliers and outliers are not well-separated or when outliers form dense clusters. To address these limitations, we incorporate active learning to selectively acquire informative labels, and propose IMBoost, a novel framework that explicitly reinforces the IM effect to improve outlier detection. Our method consists of two stages: 1) a warm-up phase that induces and promotes the IM effect, and 2) a polarization phase in which actively queried samples are used to maximize the discrepancy between inlier and outlier scores. In particular, we propose a novel query strategy and tailored loss function in the polarization phase to effectively identify informative samples and fully leverage the limited labeling budget. We provide a theoretical analysis showing that the IMBoost consistently decreases inlier risk while increasing outlier risk throughout training, thereby amplifying their separation. Extensive experiments on diverse benchmark datasets demonstrate that IMBoost not only significantly outperforms state-of-the-art active OD methods but also requires substantially less computational cost.

2601.06371 2026-01-21 econ.EM stat.AP

The Promise of Time-Series Foundation Models for Agricultural Forecasting: Evidence from Commodity Prices

Le Wang, Boyuan Zhang

详情
英文摘要

Forecasting agricultural markets remains challenging due to nonlinear dynamics, structural breaks, and sparse data. A long-standing belief holds that simple time-series methods outperform more advanced alternatives. This paper provides the first systematic evidence that this belief no longer holds with modern time-series foundation models (TSFMs). Using USDA ERS monthly commodity price data from 1997-2025, we evaluate 17 forecasting approaches across four model classes, including traditional time-series, machine learning, deep learning, and five state-of-the-art TSFMs (Chronos, Chronos-2, TimesFM 2.5, Time-MoE, Moirai-2), and construct annual marketing year price predictions to compare with USDA's futures-based season-average price (SAP) forecasts. We show that zero-shot foundation models consistently outperform traditional time-series methods, machine learning, and deep learning architectures trained from scratch in both monthly and annual forecasting. Furthermore, foundation models remarkably outperform USDA's futures-based forecasts on three of four major commodities despite USDA's information advantage from forward-looking futures markets. Time-MoE delivers the largest accuracy gains, achieving 54.9% improvement on wheat and 18.5% improvement on corn relative to USDA ERS benchmarks on recent data (2017-2024 excluding COVID). These results point to a paradigm shift in agricultural forecasting.

2512.22162 2026-01-21 math.ST stat.ME stat.TH

Exchangeability and randomness for infinite and finite sequences

Vladimir Vovk

Comments 18 pages, 2 figures

详情
英文摘要

Randomness (in the sense of being generated in an IID fashion) and exchangeability are standard assumptions in nonparametric statistics and machine learning, and relations between them have been a popular topic of research. This short paper draws the reader's attention to the fact that, while for infinite sequences of observations the two assumptions are almost indistinguishable, the difference between them becomes very significant for finite sequences of a given length.

2512.17372 2026-01-21 math.ST stat.TH

False positive control in time series coincidence detection

Ruiting Liang, Samuel Dyson, Rina Foygel Barber, Daniel E. Holz

详情
英文摘要

We study the problem of coincidence detection in time series data, where we aim to determine whether the appearance of simultaneous or near-simultaneous events in two time series is indicative of some shared underlying signal or synchronicity, or might simply be due to random chance. This problem arises across many applications, such as astrophysics (e.g., detecting astrophysical events such as gravitational waves, with two or more detectors) and neuroscience (e.g., detecting synchronous firing patterns between two or more neurons). In this work, we consider methods based on time-shifting, where the timeline of one data stream is randomly shifted relative to another, to mimic the types of coincidences that could occur by random chance. Our theoretical results establish rigorous finite-sample guarantees controlling the probability of false positives, under weak assumptions that allow for dependence within the time series data, providing reassurance that time-shifting methods are a reliable tool for inference in this setting. Empirical results with simulated and real data validate the strong performance of time-shifting methods in dependent-data settings.

2512.10825 2026-01-21 math.ST cs.LG math.OC stat.TH

An Elementary Proof of the Near Optimality of LogSumExp Smoothing

Thabo Samakhoana, Benjamin Grimmer

Comments 11 pages

详情
英文摘要

We consider the design of smoothings of the (coordinate-wise) max function in $\mathbb{R}^d$ in the infinity norm. The LogSumExp function $f(x)=\ln(\sum^d_i\exp(x_i))$ provides a classical smoothing, differing from the max function in value by at most $\ln(d)$. We provide an elementary construction of a lower bound, establishing that every overestimating smoothing of the max function must differ by at least $\sim 0.8145\ln(d)$. Hence, LogSumExp is optimal up to small constant factors. However, in small dimensions, we provide stronger, exactly optimal smoothings attaining our lower bound, showing that the entropy-based LogSumExp approach to smoothing is not exactly optimal.

2512.10032 2026-01-21 cs.LG cs.AI stat.ML

Cluster-Dags as Powerful Background Knowledge For Causal Discovery

Jan Marco Ruiz de Vargas, Kirtan Padh, Niki Kilbertus

Comments 23 pages, 5 figures

详情
英文摘要

Finding cause-effect relationships is of key importance in science. Causal discovery aims to recover a graph from data that succinctly describes these cause-effect relationships. However, current methods face several challenges, especially when dealing with high-dimensional data and complex dependencies. Incorporating prior knowledge about the system can aid causal discovery. In this work, we leverage Cluster-DAGs as a prior knowledge framework to warm-start causal discovery. We show that Cluster-DAGs offer greater flexibility than existing approaches based on tiered background knowledge and introduce two modified constraint-based algorithms, Cluster-PC and Cluster-FCI, for causal discovery in the fully and partially observed setting, respectively. Empirical evaluation on simulated data demonstrates that Cluster-PC and Cluster-FCI outperform their respective baselines without prior knowledge.

2512.08601 2026-01-21 stat.ML cs.LG math.OC

Heuristics for Combinatorial Optimization via Value-based Reinforcement Learning: A Unified Framework and Analysis

Orit Davidovich, Shimrit Shtern, Segev Wasserkrug, Nimrod Megiddo

详情
英文摘要

Since the 1990s, considerable empirical work has been carried out to train statistical models, such as neural networks (NNs), as learned heuristics for combinatorial optimization (CO) problems. When successful, such an approach eliminates the need for experts to design heuristics per problem type. Due to their structure, many hard CO problems are amenable to treatment through reinforcement learning (RL). Indeed, we find a wealth of literature training NNs using value-based, policy gradient, or actor-critic approaches, with promising results, both in terms of empirical optimality gaps and inference runtimes. Nevertheless, there has been a paucity of theoretical work undergirding the use of RL for CO problems. To this end, we introduce a unified framework to model CO problems through Markov decision processes (MDPs) and solve them using RL techniques. We provide easy-to-test assumptions under which CO problems can be formulated as equivalent undiscounted MDPs that provide optimal solutions to the original CO problems. Moreover, we establish conditions under which value-based RL techniques converge to approximate solutions of the CO problem with a guarantee on the associated optimality gap. Our convergence analysis provides: (1) a sufficient rate of increase in batch size and projected gradient descent steps at each RL iteration; (2) the resulting optimality gap in terms of problem parameters and targeted RL accuracy; and (3) the importance of a choice of state-space embedding. Together, our analysis illuminates the success (and limitations) of the celebrated deep Q-learning algorithm in this problem context.

2510.13459 2026-01-21 cs.AI cs.CE cs.NI stat.AP

Mobile Coverage Analysis using Crowdsourced Data

Timothy Wong, Tom Freeman, Joseph Feehily

Comments 8 pages

详情
英文摘要

Effective assessment of mobile network coverage and the precise identification of service weak spots are paramount for network operators striving to enhance user Quality of Experience (QoE). This paper presents a novel framework for mobile coverage and weak spot analysis utilising crowdsourced QoE data. The core of our methodology involves coverage analysis at the individual cell (antenna) level, subsequently aggregated to the site level, using empirical geolocation data. A key contribution of this research is the application of One-Class Support Vector Machine (OC-SVM) algorithm for calculating mobile network coverage. This approach models the decision hyperplane as the effective coverage contour, facilitating robust calculation of coverage areas for individual cells and entire sites. The same methodology is extended to analyse crowdsourced service loss reports, thereby identifying and quantifying geographically localised weak spots. Our findings demonstrate the efficacy of this novel framework in accurately mapping mobile coverage and, crucially, in highlighting granular areas of signal deficiency, particularly within complex urban environments.

2510.09895 2026-01-21 cs.LG cs.AI stat.ML

Chain-of-Influence: Tracing Interdependencies Across Time and Features in Clinical Predictive Modelings

Yubo Li, Rema Padman

详情
英文摘要

Modeling clinical time-series data is hampered by the challenge of capturing latent, time-varying dependencies among features. State-of-the-art approaches often rely on black-box mechanisms or simple aggregation, failing to explicitly model how the influence of one clinical variable propagates through others over time. We propose $\textbf{Chain-of-Influence (CoI)}$, an interpretable deep learning framework that constructs an explicit, time-unfolded graph of feature interactions. CoI enables the tracing of influence pathways, providing a granular audit trail that shows how any feature at any time contributes to the final prediction, both directly and through its influence on other variables. We evaluate CoI on mortality and disease progression tasks using the MIMIC-IV dataset and a chronic kidney disease cohort. Our framework achieves state-of-the-art predictive performance (AUROC of 0.960 on CKD progression and 0.950 on ICU mortality), with deletion-based sensitivity analyses confirming that CoI's learned attributions faithfully reflect its decision process. Through case studies, we demonstrate that CoI uncovers clinically meaningful, patient-specific patterns of disease progression, offering enhanced transparency into the temporal and cross-feature dependencies that inform clinical decision-making.

2510.09616 2026-01-21 cs.CR cs.AI math.ST stat.TH

Causal Digital Twins for Cyber-Physical Security: A Framework for Robust Anomaly Detection in Industrial Control Systems

Mohammadhossein Homaei, Mehran Tarif, Pablo Garcia Rodriguez, Andres Caro, Mar Avila

Comments 22 Pages, six figures, and 14 tables,

详情
英文摘要

Industrial Control Systems (ICS) in water distribution and treatment face cyber-physical attacks exploiting network and physical vulnerabilities. Current water system anomaly detection methods rely on correlations, yielding high false alarms and poor root cause analysis. We propose a Causal Digital Twin (CDT) framework for water infrastructures, combining causal inference with digital twin modeling. CDT supports association for pattern detection, intervention for system response, and counterfactual analysis for water attack prevention. Evaluated on water-related datasets SWaT, WADI, and HAI, CDT shows 90.8\% compliance with physical constraints and structural Hamming distance 0.133 $\pm$ 0.02. F1-scores are $0.944 \pm 0.014$ (SWaT), $0.902 \pm 0.021$ (WADI), $0.923 \pm 0.018$ (HAI, $p<0.0024$). CDT reduces false positives by 74\%, achieves 78.4\% root cause accuracy, and enables counterfactual defenses reducing attack success by 73.2\%. Real-time performance at 3.2 ms latency ensures safe and interpretable operation for medium-scale water systems.

2510.08335 2026-01-21 stat.ML cs.LG

PAC Learnability in the Presence of Performativity

Ivan Kirev, Lyuben Baltadzhiev, Nikola Konstantinov

Comments 21 pages, 3 figures; Added another assumption on the RN derivative in Section 5, to fix an incorrect bounding argument in the proof of Theorem 5.1 in the initial version; more details on page 6

详情
英文摘要

Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are learnable, via the lens of the classic PAC (Probably Approximately Correct) learning framework. We motivate several performative scenarios, accounting in particular for linear shifts in the label distribution, as well as for more general changes in both the labels and the features. We construct a performative empirical risk function, which depends only on data from the original distribution and on the type performative effect, and is yet an unbiased estimate of the true risk of a classifier on the shifted distribution. Minimizing this notion of performative risk allows us to show that any PAC-learnable hypothesis space in the standard binary classification setting remains PAC-learnable for the considered performative scenarios. We also conduct an extensive experimental evaluation of our performative risk minimization method and showcase benefits on synthetic and real data.

2509.25482 2026-01-21 cs.AI cs.LG cs.RO cs.SY eess.SY stat.ML

Message passing-based inference in an autoregressive active inference agent

Wouter M. Kouw, Tim N. Nisslbeck, Wouter L. N. Nuijten

Comments 14 pages, 4 figures, proceedings of the International Workshop on Active Inference 2025. Erratum v1: in Eq. (50), $p(y_t, Θ, u_t \mid y_{*}, \mathcal{D}_k)$ should have been $p(y_t, Θ\mid u_t, y_{*}, \mathcal{D}_k)$

详情
英文摘要

We present the design of an autoregressive active inference agent in the form of message passing on a factor graph. Expected free energy is derived and distributed across a planning graph. The proposed agent is validated on a robot navigation task, demonstrating exploration and exploitation in a continuous-valued observation space with bounded continuous-valued actions. Compared to a classical optimal controller, the agent modulates action based on predictive uncertainty, arriving later but with a better model of the robot's dynamics.

2508.13071 2026-01-21 stat.ME

Surrogate-based Bayesian calibration methods for chaotic systems: a comparison of traditional and non-traditional approaches

Maike F. Holthuijzen, Atlanta Chakraborty, Elizabeth Krath, Tommie Catanach

Comments 34 pages, 6 figures

详情
英文摘要

Parameter calibration is essential for reducing uncertainty and improving predictive fidelity in physics-based models, yet it is often limited by the high computational cost of model evaluations. Bayesian calibration methods provide a principled framework for combining prior information with data while rigorously quantifying uncertainty. In this work, we compare four emulator-based Bayesian calibration strategies, Calibrate-Emulate-Sample (CES), History Matching (HM), Bayesian Optimal Experimental Design (BOED), and a goal-oriented extension of BOED (GBOED). The proposed GBOED formulation explicitly targets information gain with respect to the calibration posterior, aligning design decisions with downstream inference. We assess methods using accuracy and uncertainty quantification metrics, convergence behavior under increasing computational budgets, and practical considerations such as implementation complexity and robustness. For the Lorenz '96 system, CES, HM, and GBOED all yield strong calibration performance, even with limited numbers of model evaluations, while standard BOED generally underperforms in this setting. Differences among the strongest methods are modest, particularly as computational budgets increase. For the two-layer quasi-geostrophic system, all methods produce reasonable posterior estimates, and convergence behavior is more consistent. Overall, our results indicate that multiple emulator-based calibration strategies can perform comparably well when applied appropriately, with method selection often guided more by computational and practical considerations than by accuracy alone. These findings highlight both the limitations of standard BOED for calibration and the promise of goal-oriented and iterative approaches for efficient Bayesian inference in complex dynamical systems.

2507.16340 2026-01-21 math.ST stat.ME stat.TH

Structured linear factor models for tail dependence

Alexis Boulin, Axel Bücher

Comments 42 pages

详情
英文摘要

A common object to describe the extremal dependence of a $d$-variate random vector $X$ is the stable tail dependence function $L$. Various parametric models have emerged, with a popular subclass consisting of those stable tail dependence functions that arise for linear and max-linear factor models with heavy tailed factors. The stable tail dependence function is then parameterized by a $d \times K$ matrix $A$, where $K$ is the number of factors and where $A$ can be interpreted as a factor loading matrix. We study estimation of $L$ under an additional assumption on $A$ called the `pure variable assumption'. Both $K \in \{1, \dots, d\}$ and $A \in [0, \infty)^{d \times K}$ are treated as unknown, which constitutes an unconventional parameter space that does not fit into common estimation frameworks. We suggest two algorithms that allow to estimate $K$ and $A$, and provide finite sample guarantees for both algorithms. Remarkably, the guarantees allow for the case where the dimension $d$ is larger than the sample size $n$. The results are illustrated with numerical experiments and two case studies.

2507.12276 2026-01-21 econ.EM stat.AP

Probabilistic Forecasting of Climate Policy Uncertainty: The Role of Macro-financial Variables and Google Search Data

Donia Besher, Anirban Sengupta, Tanujit Chakraborty

详情
英文摘要

Accurately forecasting Climate Policy Uncertainty (CPU) is essential for designing climate strategies that balance economic growth with environmental objectives. Elevated CPU levels can delay regulatory implementation, hinder investment in green technologies, and amplify public resistance to policy reforms, particularly during periods of economic stress. Despite the growing literature documenting the economic relevance of CPU, forecasting its evolution and understanding the role of macro-financial drivers in shaping its fluctuations have not been explored. This study addresses this gap by presenting the first effort to forecast CPU and identify its key drivers. We employ various statistical tools to identify macro-financial exogenous drivers, alongside Google search data to capture early public attention to climate policy. Local projection impulse response analysis quantifies the dynamic effects of these variables, revealing that household financial vulnerability, housing market activity, business confidence, credit conditions, and financial market sentiment exert the most substantial impacts. These predictors are incorporated into a Bayesian Structural Time Series (BSTS) framework to produce probabilistic forecasts for both US and Global CPU indices. Extensive experiments and statistical validation demonstrate that BSTS with time-invariant regression coefficients achieves superior forecasting performance. We demonstrate that this performance stems from its variable selection mechanism, which identifies exogenous predictors that are empirically significant and theoretically grounded, as confirmed by the feature importance analysis. From a policy perspective, the findings underscore the importance of adaptive climate policies that remain effective across shifting economic conditions while supporting long-term environmental and growth objectives.

2506.21894 2026-01-21 stat.ML cs.LG

Thompson Sampling in Function Spaces via Neural Operators

Rafael Oliveira, Xuesong Wang, Kian Ming A. Chai, Edwin V. Bonilla

Comments Final revision to appear at NeurIPS 2025 proceedings, expanded proof of Proposition 2, added Remark 2 on sublinear information gain, and revised discussion at the end of Appendix C.4

详情
英文摘要

We propose an extension of Thompson sampling to optimization problems over function spaces where the objective is a known functional of an unknown operator's output. We assume that queries to the operator (such as running a high-fidelity simulator or physical experiment) are costly, while functional evaluations on the operator's output are inexpensive. Our algorithm employs a sample-then-optimize approach using neural operator surrogates. This strategy avoids explicit uncertainty quantification by treating trained neural operators as approximate samples from a Gaussian process (GP) posterior. We derive regret bounds and theoretical results connecting neural operators with GPs in infinite-dimensional settings. Experiments benchmark our method against other Bayesian optimization baselines on functional optimization tasks involving partial differential equations of physical systems, demonstrating better sample efficiency and significant performance gains.

2505.13809 2026-01-21 math.ST econ.EM stat.ML stat.TH

Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

Haoyu Wei

详情
英文摘要

Off-policy evaluation (OPE) constructs confidence intervals for the value of a target policy using data generated under a different behavior policy. Most existing inference methods focus on fixed target policies and may fail when the target policy is estimated as optimal, particularly when the optimal policy is non-unique or nearly deterministic. We study inference for the value of optimal policies in Markov decision processes. We characterize the existence of the efficient influence function and show that non-regularity arises under policy non-uniqueness. Motivated by this analysis, we propose a novel \textit{N}onparametric \textit{S}equenti\textit{A}l \textit{V}alue \textit{E}valuation (NSAVE) method, which achieves semiparametric efficiency and retains the double robustness property when the optimal policy is unique, and remains stable in degenerate regimes beyond the scope of existing asymptotic theory. We further develop a smoothing-based approach for valid inference under non-unique optimal policies, and a post-selection procedure with uniform coverage for data-selected optimal policies. Simulation studies support the theoretical results. An application to the OhioT1DM mobile health dataset provides patient-specific confidence intervals for optimal policy values and their improvement over observed treatment policies.

2504.20894 2026-01-21 cs.LG stat.ML

Does Feedback Help in Bandits with Arm Erasures?

Merve Karakas, Osama Hanna, Lin F. Yang, Christina Fragouli

Journal ref 2025 IEEE International Symposium on Information Theory (ISIT)

详情
英文摘要

We study a distributed multi-armed bandit (MAB) problem over arm erasure channels, motivated by the increasing adoption of MAB algorithms over communication-constrained networks. In this setup, the learner communicates the chosen arm to play to an agent over an erasure channel with probability $ε\in [0,1)$; if an erasure occurs, the agent continues pulling the last successfully received arm; the learner always observes the reward of the arm pulled. In past work, we considered the case where the agent cannot convey feedback to the learner, and thus the learner does not know whether the arm played is the requested or the last successfully received one. In this paper, we instead consider the case where the agent can send feedback to the learner on whether the arm request was received, and thus the learner exactly knows which arm was played. Surprisingly, we prove that erasure feedback does not improve the worst-case regret upper bound order over the previously studied no-feedback setting. In particular, we prove a regret lower bound of $Ω(\sqrt{KT} + K / (1 - ε))$, where $K$ is the number of arms and $T$ the time horizon, that matches no-feedback upper bounds up to logarithmic factors. We note however that the availability of feedback enables simpler algorithm designs that may achieve better constants (albeit not better order) regret bounds; we design one such algorithm and evaluate its performance numerically.

2504.09854 2026-01-21 econ.GN q-fin.EC stat.AP

Do Determinants of EV Purchase Intent vary across the Spectrum? Evidence from Bayesian Analysis of US Survey Data

Nafisa Lohawala, Mohammad Arshad Rahman

Comments 33 pages, three figures, five tables

详情
英文摘要

While electric vehicle (EV) adoption has been widely studied, most research focuses on the average effects of predictors on purchase intent, overlooking variation across the distribution of EV purchase intent. This paper makes a threefold contribution by analyzing four unique explanatory variables, leveraging large-scale US survey data from 2021 to 2023, and employing Bayesian ordinal probit and Bayesian ordinal quantile modeling to evaluate the effects of these variables-while controlling for other commonly used covariates-on EV purchase intent, both on average and across its full distribution. By modeling purchase intent as an ordered outcome-from "not at all likely" to "very likely"-we reveal how covariate effects differ across levels of interest. This is the first application of ordinal quantile modeling in the EV adoption literature, uncovering heterogeneity in how potential buyers respond to key factors. For instance, confidence in development of charging infrastructure and belief in environmental benefits are linked not only to higher interest among likely adopters but also to reduced resistance among more skeptical respondents. Notably, we identify a gap between the prevalence and influence of key predictors: although few respondents report strong infrastructure confidence or frequent EV information exposure, both factors are strongly associated with increased intent across the spectrum. These findings suggest clear opportunities for targeted communication and outreach, alongside infrastructure investment, to support widespread EV adoption.

2502.10653 2026-01-21 econ.EM stat.ME

Policy Learning with Confidence

Victor Chernozhukov, Sokbae Lee, Adam M. Rosen, Liyang Sun

Comments 40 pages, 3 figures

详情
英文摘要

This paper introduces a rule for policy selection in the presence of estimation uncertainty, explicitly accounting for estimation risk. The rule belongs to the class of risk-aware rules on the efficient decision frontier, characterized as policies offering maximal estimated welfare for a given level of estimation risk. Among this class, the proposed rule is chosen to provide a reporting guarantee, ensuring that the welfare delivered exceeds a threshold with a pre-specified confidence level. We apply this approach to the allocation of a limited budget among social programs using estimates of their marginal value of public funds and associated standard errors.

2501.08673 2026-01-21 stat.AP

A Spatio-Temporal Dirichlet Process Mixture Model on Linear Networks for Crime Data

Sujeong Lee, Won Chang, Jorge Mateu, Heejin Lee, Jaewoo Park

详情
英文摘要

Analyzing crime events is crucial to understand crime dynamics and it is largely helpful for constructing prevention policies. Point processes specified on linear networks can provide a more accurate description of crime incidents by considering the geometry of the city. We propose a spatio-temporal Dirichlet process mixture model on a linear network to analyze crime events in Valencia, Spain. We propose a Bayesian hierarchical model with a Dirichlet process prior to automatically detect space-time clusters of the events and adopt a convolution kernel estimator to account for the network structure in the city. From the fitted model, we provide crime hotspot visualizations that can inform social interventions to prevent crime incidents. Furthermore, we study the relationships between the detected cluster centers and the city's amenities, which provides an intuitive explanation of criminal contagion.

2412.16765 2026-01-21 cs.LG math.OC stat.ML

Optimization Insights into Deep Diagonal Linear Networks

Hippolyte Labarrière, Cesare Molinari, Lorenzo Rosasco, Cristian Vega, Silvia Villa

详情
英文摘要

Gradient-based methods successfully train highly overparameterized models in practice, even though the associated optimization problems are markedly nonconvex. Understanding the mechanisms that make such methods effective has become a central problem in modern optimization. To investigate this question in a tractable setting, we study Deep Diagonal Linear Networks. These are multilayer architectures with a reparameterization that preserves convexity in the effective parameter, while inducing a nontrivial geometry in the optimization landscape. Under mild initialization conditions, we show that gradient flow on the layer parameters induces a mirror-flow dynamic in the effective parameter space. This structural insight yields explicit convergence guarantees, including exponential decay of the loss under a Polyak-Lojasiewicz condition, and clarifies how the parametrization and initialization scale govern the training speed. Overall, our results demonstrate that deep diagonal over parameterizations, despite their apparent complexity, can endow standard gradient methods with well-behaved and interpretable optimization dynamics.

2411.05870 2026-01-21 eess.SY cs.SY math.DS math.PR physics.data-an stat.ME

An Adaptive Online Smoother with Closed-Form Solutions and Information-Theoretic Lag Selection for Conditional Gaussian Nonlinear Systems

Marios Andreou, Nan Chen, Yingda Li

Comments Latest revision. 46 pages (Main Text pp. 1--28; Appendix pp. 29--40), 9 figures (7 in Main Text, 2 in Appendix). Currently under review in Journal of Nonlinear Science (Springer Nature). Code available upon request. For further details visit https://mariosandreou.short.gy/OnlineSmootherCGNS

详情
英文摘要

Data assimilation (DA) combines partial observations with dynamical models to improve state estimation. Filter-based DA uses only past and present data and is the prerequisite for real-time forecasts. Smoother-based DA exploits both past and future observations. It aims to fill in missing data, provide more accurate estimations, and develop high-quality datasets. However, the standard smoothing procedure requires using all historical state estimations, which is storage-demanding, especially for high-dimensional systems. This paper develops an adaptive-lag online smoother for a large class of complex dynamical systems with strong nonlinear and non-Gaussian features, which has important applications to many real-world problems. The adaptive lag allows the utilization of observations only within a nearby window, thus reducing computational complexity and storage needs. Online lag adjustment is essential for tackling turbulent systems, where temporal autocorrelation varies significantly over time due to intermittency, extreme events, and nonlinearity. Based on the uncertainty reduction in the estimated state, an information criterion is developed to systematically determine the adaptive lag. Notably, the mathematical structure of these systems facilitates the use of closed analytic formulae to calculate the online smoother and adaptive lag, avoiding empirical tunings as in ensemble-based DA methods. The adaptive online smoother is applied to studying three important scientific problems. First, it helps detect online causal relationships between state variables. Second, the advantage of reduced computational storage expenditure is illustrated via Lagrangian DA, a high-dimensional nonlinear problem. Finally, the adaptive smoother advances online parameter estimation with partial observations, emphasizing the role of the observed extreme events in accelerating convergence.

2407.17225 2026-01-21 stat.ME

Asymmetry Analysis of Bilateral Shapes

Kanti V. Mardia, Xiangyu Wu, John T. Kent, Colin R. Goodall, Balvinder S. Khambay

Comments 38 pages

详情
英文摘要

Many biological objects possess bilateral symmetry about a midline or midplane, up to a ``noise'' term. This paper uses landmark-based methods to measure departures from bilateral symmetry, especially for the two-group problem where one group is more asymmetric than the other. In this paper, we formulate our work in the framework of size-and-shape analysis including registration via rigid body motion. Our starting point is a vector of elementary asymmetry features defined at the individual landmark coordinates for each object. We introduce two approaches for testing. In the first, the elementary features are combined into a scalar composite asymmetry measure for each object. Then standard univariate tests can be used to compare the two groups. In the second approach, a univariate test statistic is constructed for each elementary feature. The maximum of these statistics lead to an overall test statistic to compare the two groups and we then provide a technique to extract the important features from the landmark data. Our methodology is illustrated on a pre-registered smile dataset collected to assess the success of cleft lip surgery on human subjects. The asymmetry in a group of cleft lip subjects is compared to a group of normal subjects, and statistically significant differences have been found by univariate tests in the first approach. Further, our feature extraction method leads to an anatomically plausible set of landmarks for medical applications.

2407.15301 2026-01-21 stat.ML cs.LG math.ST q-bio.QM stat.TH

U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks

Zhe Fei, Yi Li

详情
英文摘要

Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hájek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.

2405.18828 2026-01-21 math.ST stat.ML stat.TH

CHANI: Correlation-based Hawkes Aggregation of Neurons with bio-Inspiration

Sophie Jaffard, Samuel Vaiter, Patricia Reynaud-Bouret

详情
英文摘要

The present work aims at proving mathematically that a neural network inspired by biology can learn a classification task thanks to local transformations only. In this purpose, we propose a spiking neural network named CHANI (Correlation-based Hawkes Aggregation of Neurons with bio-Inspiration), whose neurons activity is modeled by Hawkes processes. Synaptic weights are updated thanks to an expert aggregation algorithm, providing a local and simple learning rule. We were able to prove that our network can learn on average and asymptotically. Moreover, we demonstrated that it automatically produces neuronal assemblies in the sense that the network can encode several classes and that a same neuron in the intermediate layers might be activated by more than one class, and we provided numerical simulations on synthetic dataset. This theoretical approach contrasts with the traditional empirical validation of biologically inspired networks and paves the way for understanding how local learning rules enable neurons to form assemblies able to represent complex concepts.

2405.15325 2026-01-21 cs.LG stat.ML

On the Identification of Temporally Causal Representation with Instantaneous Dependence

Zijian Li, Yifan Shen, Kaitao Zheng, Ruichu Cai, Xiangchen Song, Mingming Gong, Guangyi Chen, Kun Zhang

详情
英文摘要

Temporally causal representation learning aims to identify the latent causal process from time series observations, but most methods require the assumption that the latent causal processes do not have instantaneous relations. Although some recent methods achieve identifiability in the instantaneous causality case, they require either interventions on the latent variables or grouping of the observations, which are in general difficult to obtain in real-world scenarios. To fill this gap, we propose an \textbf{ID}entification framework for instantane\textbf{O}us \textbf{L}atent dynamics (\textbf{IDOL}) by imposing a sparse influence constraint that the latent causal processes have sparse time-delayed and instantaneous relations. Specifically, we establish identifiability results of the latent causal process based on sufficient variability and the sparse influence constraint by employing contextual information of time series data. Based on these theories, we incorporate a temporally variational inference architecture to estimate the latent variables and a gradient-based sparsity regularization to identify the latent causal process. Experimental results on simulation datasets illustrate that our method can identify the latent causal process. Furthermore, evaluations on multiple human motion forecasting benchmarks with instantaneous dependencies indicate the effectiveness of our method in real-world settings.

2405.11923 2026-01-21 math.ST stat.ME stat.TH

Rate Optimality and Phase Transition for User-Level Local Differential Privacy

Alexander Kent, Thomas B. Berrett, Yi Yu

Comments 98 pages, 12 figures, 6 tables

详情
英文摘要

Most of the literature on differential privacy considers the item-level case where each user has a single observation, but a growing field of interest is that of user-level privacy where each of the $n$ users holds $T$ observations and wishes to maintain the privacy of their entire collection. In this paper, we derive a general minimax lower bound, which shows that, for locally private user-level estimation problems, the risk cannot, in general, be made to vanish for a fixed number of users even when each user holds an arbitrarily large number of observations. We then derive matching, up to logarithmic factors, lower and upper bounds for univariate and multidimensional mean estimation, sparse mean estimation and non-parametric density estimation. In particular, with other model parameters held fixed, we observe phase transition phenomena in the minimax rates as $T$ the number of observations each user holds varies. In the case of (non-sparse) mean estimation and density estimation, we see that, for $T$ below a phase transition boundary, the rate is the same as having $nT$ users in the item-level setting. Different behaviour is however observed in the case of $s$-sparse $d$-dimensional mean estimation, wherein consistent estimation is impossible when $d$ exceeds the number of observations in the item-level setting, but is possible in the user-level setting when $T \gtrsim s \log (d)$, up to logarithmic factors. This may be of independent interest for applications as an example of a high-dimensional problem that is feasible under local privacy constraints.

2301.10932 2026-01-21 cs.LG math.OC stat.ML

On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures

Xian Yu, Lei Ying

详情
英文摘要

Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While it has been shown that policy gradient methods can find globally optimal policies in the risk-neutral setting, it remains unclear if the risk-averse variants enjoy the same global convergence guarantees. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive natural policy gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality and iteration complexity of the proposed risk-averse NPG algorithm with softmax parameterization and entropy regularization under both exact and inexact policy evaluation. Furthermore, we test our risk-averse NPG algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our method.

2207.05442 2026-01-21 stat.ML cs.LG

Wasserstein multivariate auto-regressive models for modeling distributional time series

Yiye Jiang, Jérémie Bigot

详情
英文摘要

This paper is focused on the statistical analysis of data consisting of a collection of multiple series of probability measures that are indexed by distinct time instants and supported over a bounded interval of the real line. By modeling these time-dependent probability measures as random objects in the Wasserstein space, we propose a new auto-regressive model for the statistical analysis of multivariate distributional time series. Using the theory of iterated random function systems, results on the second order stationarity of the solution of such a model are provided. We also propose a consistent estimator for the auto-regressive coefficients of this model. Due to the simplex constraints that we impose on the model coefficients, the proposed estimator that is learned under these constraints, naturally has a sparse structure. The sparsity allows the application of the proposed model in learning a graph of temporal dependency from multivariate distributional time series. We explore the numerical performances of our estimation procedure using simulated data. To shed some light on the benefits of our approach for real data analysis, we also apply this methodology to two data sets, respectively made of observations from age distribution in different countries and those from the bike sharing network in Paris.

2103.13236 2026-01-21 stat.ME stat.AP

Bayesian Evidence Synthesis for the common effect model

Stavros Nikolakopoulos, Björn Alfons Edmar, Ioannis Ntzoufras

Comments 19 pages, 2 figures

详情
英文摘要

Bayes Factors, the Bayesian tool for hypothesis testing, are receiving increasing attention in the literature. Compared to their frequentist rivals ($p$-values or test statistics), Bayes Factors have the conceptual advantage of providing evidence both for and against a null hypothesis, and they can be calibrated so that they do not depend so heavily on the sample size. Research on the synthesis of Bayes Factors arising from individual studies has received increasing attention, mostly for the fixed effects model for meta-analysis. In this work, we review and propose methods for combining Bayes Factors from multiple studies, depending on the level of information available, focusing on the common effect model. In the process, we provide insights with respect to the interplay between frequentist and Bayesian evidence. We assess the performance of the methods discussed via a simulation study and apply the methods in an example from the field of positive psychology.

2103.04086 2026-01-21 stat.ME

Non-parametric Bayesian inference via loss functions under model misspecification

Yu Luo, David A. Stephens, Daniel J. Graham, Emma J. McCoy

详情
英文摘要

In the usual Bayesian setting, a full probabilistic model is required to link the data and parameters, and the form of this model and the inference and prediction mechanisms are specified via de Finetti's representation. In general, such a formulation is not robust to model misspecification of its component parts. An alternative approach is to draw inference based on loss functions, where the quantity of interest is defined as a minimizer of some expected loss, and to construct posterior distributions based on the loss-based formulation; this strategy underpins the construction of the Gibbs posterior. We develop a Bayesian non-parametric approach; specifically, we generalize the Bayesian bootstrap, and specify a Dirichlet process model for the distribution of the observables. We implement this using direct prior-to-posterior calculations, but also using predictive sampling. We also study the assessment of posterior validity for non-standard Bayesian calculations. We show that the developed non-standard Bayesian updating procedures yield valid posterior distributions in terms of consistency and asymptotic normality under model misspecification. Simulation studies show that the proposed methods can recover the true value of the parameter under misspecification.

2601.12231 2026-01-21 cs.LG cs.CR stat.CO

Wavelet-Aware Anomaly Detection in Multi-Channel User Logs via Deviation Modulation and Resolution-Adaptive Attention

Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Shijie Xu, Guanggang Geng

Comments Accepted by ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

详情
英文摘要

Insider threat detection is a key challenge in enterprise security, relying on user activity logs that capture rich and complex behavioral patterns. These logs are often multi-channel, non-stationary, and anomalies are rare, making anomaly detection challenging. To address these issues, we propose a novel framework that integrates wavelet-aware modulation, multi-resolution wavelet decomposition, and resolution-adaptive attention for robust anomaly detection. Our approach first applies a deviation-aware modulation scheme to suppress routine behaviors while amplifying anomalous deviations. Next, discrete wavelet transform (DWT) decomposes the log signals into multi-resolution representations, capturing both long-term trends and short-term anomalies. Finally, a learnable attention mechanism dynamically reweights the most discriminative frequency bands for detection. On the CERT r4.2 benchmark, our approach consistently outperforms existing baselines in precision, recall, and F1 score across various time granularities and scenarios.

2601.12221 2026-01-21 stat.AP

A warping function-based control chart for detecting distributional changes in damage-sensitive features for structural condition assessment

Zhicheng Chen, Wenyu Chen, Xinyi Lei

详情
英文摘要

Data-driven damage detection methods achieve damage identification by analyzing changes in damage-sensitive features (DSFs) derived from structural health monitoring (SHM) data. The core reason for their effectiveness lies in the fact that damage or structural state transition can be manifested as changes in the distribution of DSF data. This enables us to reframe the problem of damage detection as one of identifying these distributional changes. Hence, developing automated tools for detecting such changes is pivotal for automated structural health diagnosis. Control charts are extensively utilized in SHM for DSF change detection, owing to their excellent online detection and early warning capabilities. However, conventional methods are primarily designed to detect mean or variance shifts, making it challenging to identify complex shape changes in distributions. This limitation results in insufficient damage detection sensitivity. Moreover, they typically exhibit poor robustness against data contamination. This paper proposes a novel control chart to address these limitations. It employs the probability density functions (PDFs) of subgrouped DSF data as monitoring objects, with shape deformations characterized by warping functions. Furthermore, a nonparametric control chart is specifically constructed for warping function monitoring in the functional data analysis framework. Key advantages of the new method include the ability to detect both shifts and complex shape deformations in distributions, excellent online detection performance, and robustness against data contamination. Extensive simulation studies demonstrate its superiority over competing approaches. Finally, the method is applied to detecting distributional changes in DSF data for cable condition assessment in a long-span cable-stayed bridge, demonstrating its practical utility in engineering.

2601.12213 2026-01-21 cs.LG math.OC stat.ML

One-Sided Matrix Completion from Ultra-Sparse Samples

Hongyang R. Zhang, Zhenshuo Zhang, Huy L. Nguyen, Guanghui Lan

Comments 41 pages

Journal ref Trans. Mach. Learn. Res. 2026

详情
英文摘要

Matrix completion is a classical problem that has received recurring interest across a wide range of fields. In this paper, we revisit this problem in an ultra-sparse sampling regime, where each entry of an unknown, $n\times d$ matrix $M$ (with $n \ge d$) is observed independently with probability $p = C / d$, for a fixed integer $C \ge 2$. This setting is motivated by applications involving large, sparse panel datasets, where the number of rows far exceeds the number of columns. When each row contains only $C$ entries -- fewer than the rank of $M$ -- accurate imputation of $M$ is impossible. Instead, we estimate the row span of $M$ or the averaged second-moment matrix $T = M^{\top} M / n$. The empirical second-moment matrix computed from observed entries exhibits non-random and sparse missingness. We propose an unbiased estimator that normalizes each nonzero entry of the second moment by its observed frequency, followed by gradient descent to impute the missing entries of $T$. The normalization divides a weighted sum of $n$ binomial random variables by the total number of ones. We show that the estimator is unbiased for any $p$ and enjoys low variance. When the row vectors of $M$ are drawn uniformly from a rank-$r$ factor model satisfying an incoherence condition, we prove that if $n \ge O({d r^5 ε^{-2} C^{-2} \log d})$, any local minimum of the gradient-descent objective is approximately global and recovers $T$ with error at most $ε^2$. Experiments on both synthetic and real-world data validate our approach. On three MovieLens datasets, our algorithm reduces bias by $88\%$ relative to baseline estimators. We also empirically validate the linear sampling complexity of $n$ relative to $d$ on synthetic data. On an Amazon reviews dataset with sparsity $10^{-7}$, our method reduces the recovery error of $T$ by $59\%$ and $M$ by $38\%$ compared to baseline methods.

2601.12178 2026-01-21 cs.LG stat.ML

Federated Learning for the Design of Parametric Insurance Indices under Heterogeneous Renewable Production Losses

Fallou Niakh

详情
英文摘要

We propose a federated learning framework for the calibration of parametric insurance indices under heterogeneous renewable energy production losses. Producers locally model their losses using Tweedie generalized linear models and private data, while a common index is learned through federated optimization without sharing raw observations. The approach accommodates heterogeneity in variance and link functions and directly minimizes a global deviance objective in a distributed setting. We implement and compare FedAvg, FedProx and FedOpt, and benchmark them against an existing approximation-based aggregation method. An empirical application to solar power production in Germany shows that federated learning recovers comparable index coefficients under moderate heterogeneity, while providing a more general and scalable framework.

2601.12167 2026-01-21 stat.ME stat.AP stat.OT

Using Directed Acyclic Graphs to Illustrate Common Biases in Diagnostic Test Accuracy Studies

Yang Lu, Nandini Dendukuri

详情
英文摘要

Background: Diagnostic test accuracy (DTA) studies, like etiological studies, are susceptible to various biases including reference standard error bias, partial verification bias, spectrum effect, confounding, and bias from misassumption of conditional independence. While directed acyclic graphs (DAGs) are widely used in etiological research to identify and illustrate bias structures, they have not been systematically applied to DTA studies. Methods: We developed DAGs to illustrate the causal structures underlying common biases in DTA studies. For each bias, we present the corresponding DAG structure and demonstrate the parallel with equivalent biases in etiological studies. We use real-world examples to illustrate each bias mechanism. Results: We demonstrate that five major biases in DTA studies can be represented using DAGs with clear structural parallels to etiological studies: reference standard error bias corresponds to exposure misclassification, misassumption of conditional independence creates spurious correlations similar to unmeasured confounding, spectrum effect parallels effect modification, confounding operates through backdoor paths in both settings, and partial verification bias mirrors selection bias. These DAG representations reveal the causal mechanisms underlying each bias and suggest appropriate correction strategies. Conclusions: DAGs provide a valuable framework for understanding bias structures in DTA studies and should complement existing quality assessment tools like STARD and QUADAS-2. We recommend incorporating DAGs during study design to prospectively identify potential biases and during reporting to enhance transparency. DAG construction requires interdisciplinary collaboration and sensitivity analyses under alternative causal structures.

2601.12120 2026-01-21 stat.ME stat.ML

Lost in Aggregation: The Causal Interpretation of the IV Estimand

Danielle Tsao, Krikamol Muandet, Frederick Eberhardt, Emilija Perković

详情
英文摘要

Instrumental variable based estimation of a causal effect has emerged as a standard approach to mitigate confounding bias in the social sciences and epidemiology, where conducting randomized experiments can be too costly or impossible. However, justifying the validity of the instrument often poses a significant challenge. In this work, we highlight a problem generally neglected in arguments for instrumental variable validity: the presence of an ''aggregate treatment variable'', where the treatment (e.g., education, GDP, caloric intake) is composed of finer-grained components that each may have a different effect on the outcome. We show that the causal effect of an aggregate treatment is generally ambiguous, as it depends on how interventions on the aggregate are instantiated at the component level, formalized through the aggregate-constrained component intervention distribution. We then characterize conditions on the interventional distribution and the aggregate setting under which standard instrumental variable estimators identify the aggregate effect. The contrived nature of these conditions implies major limitations on the interpretation of instrumental variable estimates based on aggregate treatments and highlights the need for a broader justificatory base for the exclusion restriction in such settings.

2601.12105 2026-01-21 cs.CR stat.AP stat.ME

Privacy-Preserving Cohort Analytics for Personalized Health Platforms: A Differentially Private Framework with Stochastic Risk Modeling

Richik Chakraborty, Lawrence Liu, Syed Hasnain

Comments 18 pages, 4 figures

详情
英文摘要

Personalized health analytics increasingly rely on population benchmarks to provide contextual insights such as ''How do I compare to others like me?'' However, cohort-based aggregation of health data introduces nontrivial privacy risks, particularly in interactive and longitudinal digital platforms. Existing privacy frameworks such as $k$-anonymity and differential privacy provide essential but largely static guarantees that do not fully capture the cumulative, distributional, and tail-dominated nature of re-identification risk in deployed systems. In this work, we present a privacy-preserving cohort analytics framework that combines deterministic cohort constraints, differential privacy mechanisms, and synthetic baseline generation to enable personalized population comparisons while maintaining strong privacy protections. We further introduce a stochastic risk modeling approach that treats re-identification risk as a random variable evolving over time, enabling distributional evaluation through Monte Carlo simulation. Adapting quantitative risk measures from financial mathematics, we define Privacy Loss at Risk (P-VaR) to characterize worst-case privacy outcomes under realistic cohort dynamics and adversary assumptions. We validate our framework through system-level analysis and simulation experiments, demonstrating how privacy-utility tradeoffs can be operationalized for digital health platforms. Our results suggest that stochastic risk modeling complements formal privacy guarantees by providing interpretable, decision-relevant metrics for platform designers, regulators, and clinical informatics stakeholders.

2601.12031 2026-01-21 stat.ME

Estimations of Extreme CoVaR and CoES under Asymptotic Independence

Qingzhao Zhong

详情
英文摘要

The two popular systemic risk measures CoVaR (Conditional Value-at-Risk) and CoES (Conditional Expected Shortfall) have recently been receiving growing attention on applications in economics and finance. In this paper, we study the estimations of extreme CoVaR and CoES when the two random variables are asymptotic independent but positively associated. We propose two types of extrapolative approaches: the first relies on intermediate VaR and extrapolates it to extreme CoVaR/CoES via an adjustment factor; the second directly extrapolates the estimated intermediate CoVaR/CoES to the extreme tails. All estimators, including both intermediate and extreme ones, are shown to be asymptotically normal. Finally, we explore the empirical performances of our methods through conducting a series of Monte Carlo simulations and a real data analysis on S&P500 Index with 12 constituent stock data.

2601.12023 2026-01-21 stat.ML cs.LG

A Kernel Approach for Semi-implicit Variational Inference

Longlin Yu, Ziheng Cheng, Shiyue Zhang, Cheng Zhang

Comments 40 pages, 15 figures. arXiv admin note: substantial text overlap with arXiv:2405.18997

详情
英文摘要

Semi-implicit variational inference (SIVI) enhances the expressiveness of variational families through hierarchical semi-implicit distributions, but the intractability of their densities makes standard ELBO-based optimization biased. Recent score-matching approaches to SIVI (SIVI-SM) address this issue via a minimax formulation, at the expense of an additional lower-level optimization problem. In this paper, we propose kernel semi-implicit variational inference (KSIVI), a principled and tractable alternative that eliminates the lower-level optimization by leveraging kernel methods. We show that when optimizing over a reproducing kernel Hilbert space, the lower-level problem admits an explicit solution, reducing the objective to the kernel Stein discrepancy (KSD). Exploiting the hierarchical structure of semi-implicit distributions, the resulting KSD objective can be efficiently optimized using stochastic gradient methods. We establish optimization guarantees via variance bounds on Monte Carlo gradient estimators and derive statistical generalization bounds of order $\tilde{\mathcal{O}}(1/\sqrt{n})$. We further introduce a multi-layer hierarchical extension that improves expressiveness while preserving tractability. Empirical results on synthetic and real-world Bayesian inference tasks demonstrate the effectiveness of KSIVI.

2601.11949 2026-01-21 stat.AP

A Deep Learning-Copula Framework for Climate-Related Home Insurance Risk

Asim K. Dey

详情
英文摘要

Extreme weather events are becoming more common, with severe storms, floods, and prolonged precipitation affecting communities worldwide. These shifts in climate patterns pose a direct threat to the insurance industry, which faces growing exposure to weather-related damages. As claims linked to extreme weather rise, insurance companies need reliable tools to assess future risks. This is not only essential for setting premiums and maintaining solvency but also for supporting broader disaster preparedness and resilience efforts. In this study, we propose a two-step method to examine the impact of precipitation on home insurance claims. Our approach combines the predictive power of deep neural networks with the flexibility of copula-based multivariate analysis, enabling a more detailed understanding of how precipitation patterns relate to claim dynamics. We demonstrate this methodology through a case study of the Canadian Prairies, using data from 2002 to 2011.

2601.11905 2026-01-21 cs.AI cs.LG math.ST stat.TH

LIBRA: Language Model Informed Bandit Recourse Algorithm for Personalized Treatment Planning

Junyu Cao, Ruijiang Gao, Esmaeil Keyvanshokooh, Jianhao Ma

Comments 50 pages. Previous version with human-AI collaboration: arXiv:2410.14640

详情
英文摘要

We introduce a unified framework that seamlessly integrates algorithmic recourse, contextual bandits, and large language models (LLMs) to support sequential decision-making in high-stakes settings such as personalized medicine. We first introduce the recourse bandit problem, where a decision-maker must select both a treatment action and a feasible, minimal modification to mutable patient features. To address this problem, we develop the Generalized Linear Recourse Bandit (GLRB) algorithm. Building on this foundation, we propose LIBRA, a Language Model-Informed Bandit Recourse Algorithm that strategically combines domain knowledge from LLMs with the statistical rigor of bandit learning. LIBRA offers three key guarantees: (i) a warm-start guarantee, showing that LIBRA significantly reduces initial regret when LLM recommendations are near-optimal; (ii) an LLM-effort guarantee, proving that the algorithm consults the LLM only $O(\log^2 T)$ times, where $T$ is the time horizon, ensuring long-term autonomy; and (iii) a robustness guarantee, showing that LIBRA never performs worse than a pure bandit algorithm even when the LLM is unreliable. We further establish matching lower bounds that characterize the fundamental difficulty of the recourse bandit problem and demonstrate the near-optimality of our algorithms. Experiments on synthetic environments and a real hypertension-management case study confirm that GLRB and LIBRA improve regret, treatment quality, and sample efficiency compared with standard contextual bandits and LLM-only benchmarks. Our results highlight the promise of recourse-aware, LLM-assisted bandit algorithms for trustworthy LLM-bandits collaboration in personalized high-stakes decision-making.

2601.11897 2026-01-21 cs.LG stat.ME stat.ML

Task-tailored Pre-processing: Fair Downstream Supervised Learning

Jinwon Sohn, Guang Lin, Qifan Song

详情
英文摘要

Fairness-aware machine learning has recently attracted various communities to mitigate discrimination against certain societal groups in data-driven tasks. For fair supervised learning, particularly in pre-processing, there have been two main categories: data fairness and task-tailored fairness. The former directly finds an intermediate distribution among the groups, independent of the type of the downstream model, so a learned downstream classification/regression model returns similar predictive scores to individuals inputting the same covariates irrespective of their sensitive attributes. The latter explicitly takes the supervised learning task into account when constructing the pre-processing map. In this work, we study algorithmic fairness for supervised learning and argue that the data fairness approaches impose overly strong regularization from the perspective of the HGR correlation. This motivates us to devise a novel pre-processing approach tailored to supervised learning. We account for the trade-off between fairness and utility in obtaining the pre-processing map. Then we study the behavior of arbitrary downstream supervised models learned on the transformed data to find sufficient conditions to guarantee their fairness improvement and utility preservation. To our knowledge, no prior work in the branch of task-tailored methods has theoretically investigated downstream guarantees when using pre-processed data. We further evaluate our framework through comparison studies based on tabular and image data sets, showing the superiority of our framework which preserves consistent trade-offs among multiple downstream models compared to recent competing models. Particularly for computer vision data, we see our method alters only necessary semantic features related to the central machine learning task to achieve fairness.

2601.11790 2026-01-21 stat.ML cs.LG stat.ME

Gradient-based Active Learning with Gaussian Processes for Global Sensitivity Analysis

Guerlain Lambert, Céline Helbert, Claire Lauvernet

详情
英文摘要

Global sensitivity analysis of complex numerical simulators is often limited by the small number of model evaluations that can be afforded. In such settings, surrogate models built from a limited set of simulations can substantially reduce the computational burden, provided that the design of computer experiments is enriched efficiently. In this context, we propose an active learning approach that, for a fixed evaluation budget, targets the most informative regions of the input space to improve sensitivity analysis accuracy. More specifically, our method builds on recent advances in active learning for sensitivity analysis (Sobol' indices and derivative-based global sensitivity measures, DGSM) that exploit derivatives obtained from a Gaussian process (GP) surrogate. By leveraging the joint posterior distribution of the GP gradient, we develop acquisition functions that better account for correlations between partial derivatives and their impact on the response surface, leading to a more comprehensive and robust methodology than existing DGSM-oriented criteria. The proposed approach is first compared to state-of-the-art methods on standard benchmark functions, and is then applied to a real environmental model of pesticide transfers.

2601.11717 2026-01-21 math.ST math.PR stat.TH

Detecting Mutual Excitations in Non-Stationary Hawkes Processes

Elchanan Mossel, Anirudh Sridhar

Comments 12 pages

详情
英文摘要

We consider the problem of learning the network of mutual excitations (i.e., the dependency graph) in a non-stationary, multivariate Hawkes process. We consider a general setting where baseline rates at each node are time-varying and delay kernels are not shift-invariant. Our main results show that if the dependency graph of an $n$-variate Hawkes process is sparse (i.e., it has a maximum degree that is bounded with respect to $n$), our algorithm accurately reconstructs it from data after observing the Hawkes process for $T = \mathrm{polylog}(n)$ time, with high probability. Our algorithm is computationally efficient, and provably succeeds in learning dependencies even if only a subset of time series are observed and event times are not precisely known.

2601.11701 2026-01-21 math.ST stat.ML stat.TH

Stability and Accuracy Trade-offs in Statistical Estimation

Abhinav Chakraborty, Yuetian Luo, Rina Foygel Barber

Comments The first two authors contributed equally and are listed in alphabetical order

详情
英文摘要

Algorithmic stability is a central concept in statistics and learning theory that measures how sensitive an algorithm's output is to small changes in the training data. Stability plays a crucial role in understanding generalization, robustness, and replicability, and a variety of stability notions have been proposed in different learning settings. However, while stability entails desirable properties, it is typically not sufficient on its own for statistical learning -- and indeed, it may be at odds with accuracy, since an algorithm that always outputs a constant function is perfectly stable but statistically meaningless. Thus, it is essential to understand the potential statistical cost of stability. In this work, we address this question by adopting a statistical decision-theoretic perspective, treating stability as a constraint in estimation. Focusing on two representative notions-worst-case stability and average-case stability-we first establish general lower bounds on the achievable estimation accuracy under each type of stability constraint. We then develop optimal stable estimators for four canonical estimation problems, including several mean estimation and regression settings. Together, these results characterize the optimal trade-offs between stability and accuracy across these tasks. Our findings formalize the intuition that average-case stability imposes a qualitatively weaker restriction than worst-case stability, and they further reveal that the gap between these two can vary substantially across different estimation problems.

2601.11638 2026-01-21 cs.LG stat.ML

Verifying Physics-Informed Neural Network Fidelity using Classical Fisher Information from Differentiable Dynamical System

Josafat Ribeiro Leal Filho, Antônio Augusto Fröhlich

Comments This paper has been submitted and is currently under review at IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

详情
英文摘要

Physics-Informed Neural Networks (PINNs) have emerged as a powerful tool for solving differential equations and modeling physical systems by embedding physical laws into the learning process. However, rigorously quantifying how well a PINN captures the complete dynamical behavior of the system, beyond simple trajectory prediction, remains a challenge. This paper proposes a novel experimental framework to address this by employing Fisher information for differentiable dynamical systems, denoted $g_F^C$. This Fisher information, distinct from its statistical counterpart, measures inherent uncertainties in deterministic systems, such as sensitivity to initial conditions, and is related to the phase space curvature and the net stretching action of the state space evolution. We hypothesize that if a PINN accurately learns the underlying dynamics of a physical system, then the Fisher information landscape derived from the PINN's learned equations of motion will closely match that of the original analytical model. This match would signify that the PINN has achieved comprehensive fidelity capturing not only the state evolution but also crucial geometric and stability properties. We outline an experimental methodology using the dynamical model of a car to compute and compare $g_F^C$ for both the analytical model and a trained PINN. The comparison, based on the Jacobians of the respective system dynamics, provides a quantitative measure of the PINN's fidelity in representing the system's intricate dynamical characteristics.

2601.06327 2026-01-21 cs.OH stat.AP

From Lagging to Leading: Validating Hard Braking Events as High-Density Indicators of Segment Crash Risk

Yechen Li, Shantanu Shahane, Shoshana Vasserman, Carolina Osorio, Yi-fan Chen, Ivan Kuznetsov, Kristin White, Justyna Swiatkowska, Neha Arora, Feng Guo

详情
英文摘要

Identifying high crash risk road segments and accurately predicting crash incidence is fundamental to implementing effective safety countermeasures. While collision data inherently reflects risk, the infrequency and inconsistent reporting of crashes present a major challenge to robust risk prediction models. The proliferation of connected vehicle technology offers a promising avenue to leverage high-density safety metrics for enhanced crash forecasting. A Hard-Braking Event (HBE), interpreted as an evasive maneuver, functions as a potent proxy for elevated driving risk due to its demonstrable correlation with underlying crash causal factors. Crucially, HBE data is significantly more readily available across the entire road network than conventional collision records. This study systematically evaluated the correlation at individual road segment level between police-reported collisions and aggregated and anonymized HBEs identified via the Google Android Auto platform, utilizing datasets from California and Virginia. Empirical evidence revealed that HBEs occur at a rate magnitudes higher than traffic crashes. Employing the state-of-the-practice Negative-Binomial regression models, the analysis established a statistically significant positive correlation between the HBE rate and the crash rate: road segments exhibiting a higher frequency of HBEs were consistently associated with a greater incidence of crashes. This sophisticated model incorporated and controlled for various confounding factors, including road type, speed profile, proximity to ramps, and road segment slope. The HBEs derived from connected vehicle technology thus provide a scalable, high-density safety surrogate metric for network-wide traffic safety assessment, with the potential to optimize safer routing recommendations and inform the strategic deployment of active safety countermeasures.

2601.06296 2026-01-21 stat.ME stat.ML

A Targeted Learning Framework for Estimating Restricted Mean Survival Time Difference using Pseudo-observations

Man Jin, Yixin Fang

详情
英文摘要

A targeted learning (TL) framework is developed to estimate the difference in the restricted mean survival time (RMST) for a clinical trial with time-to-event outcomes. The approach starts by defining the target estimand as the RMST difference between investigational and control treatments. Next, an efficient estimation method is introduced: a targeted minimum loss estimator (TMLE) utilizing pseudo-observations. Moreover, a version of the copy reference (CR) approach is developed to perform a sensitivity analysis for right-censoring. The proposed TL framework is demonstrated using a real data application.

2512.20232 2026-01-21 cs.LG stat.AP

Adaptive Multi-task Learning for Probabilistic Load Forecasting

Onintze Zaballa, Verónica Álvarez, Santiago Mazuelas

详情
英文摘要

Simultaneous load forecasting across multiple entities (e.g., regions, buildings) is crucial for the efficient, reliable, and cost-effective operation of power systems. Accurate load forecasting is a challenging problem due to the inherent uncertainties in load demand, dynamic changes in consumption patterns, and correlations among entities. Multi-task learning has emerged as a powerful machine learning approach that enables the simultaneous learning across multiple related problems. However, its application to load forecasting remains underexplored and is limited to offline learning methods, which cannot capture changes in consumption patterns. This paper presents an adaptive multi-task learning method for probabilistic load forecasting. The proposed method can dynamically adapt to changes in consumption patterns and correlations among entities. In addition, the techniques presented provide reliable probabilistic predictions for loads of multiple entities and assess load uncertainties. Specifically, the method is based on vectorvalued hidden Markov models and uses a recursive process to update the model parameters and provide predictions with the most recent parameters. The performance of the proposed method is evaluated using datasets that contain the load demand of multiple entities and exhibit diverse and dynamic consumption patterns. The experimental results show that the presented techniques outperform existing methods both in terms of forecasting performance and uncertainty assessment.

2510.25005 2026-01-21 cs.AI cs.LG math.ST stat.ML stat.TH

Cyclic Counterfactuals under Shift-Scale Interventions

Saptarshi Saha, Dhruv Vansraj Rathore, Utpal Garain

Comments Accepted at NeurIPS 2025

详情
英文摘要

Most counterfactual inference frameworks traditionally assume acyclic structural causal models (SCMs), i.e. directed acyclic graphs (DAGs). However, many real-world systems (e.g. biological systems) contain feedback loops or cyclic dependencies that violate acyclicity. In this work, we study counterfactual inference in cyclic SCMs under shift-scale interventions, i.e., soft, policy-style changes that rescale and/or shift a variable's mechanism.

2510.04460 2026-01-21 math.PR cs.DS cs.LG math.ST stat.TH

Perspectives on Stochastic Localization

Bobby Shi, Kevin Tian, Matthew S. Zhang

详情
英文摘要

We survey different perspectives on the stochastic localization process of Eldan, a powerful construction that has had many exciting recent applications in high-dimensional probability and algorithm design. Unlike prior surveys on this topic, our focus is on giving a self-contained presentation of all known alternative constructions of Eldan's stochastic localization, with an emphasis on connections between different constructions. Our hope is that by collecting these perspectives, some of which had primarily arisen within a particular community (e.g., probability theory, theoretical computer science, information theory, or machine learning), we can broaden the accessibility of stochastic localization, and ease its future use.

2509.17140 2026-01-21 stat.AP stat.ME

An Italian Gender Equality Index

Lorenzo Panebianco

详情
英文摘要

Composite indices like the Gender Equality Index (GEI) are widely used to monitor gender disparities and guide evidence-based policy. However, their original design is often limited when applied to subnational contexts. Building on the GEI framework and the WeWorld Index Italia, this study proposes a composite indicator tailored to measure gender disparities across Italian regions. The methodology, based on a variation of the Mazziotta-Pareto Index, introduces a novel aggregation approach that penalizes uneven performances across domains. Indicators cover employment, economic resources, education, use of time, political participation, and health, reflecting multidimensional gender inequality. Using open regional data for 2024, the proposed Italian Gender Equality Index (IGEI) provides a comparable and robust measure across regions, highlighting both high-performing and lagging areas. The approach addresses compensatory limitations of traditional aggregation and offers a practical tool for regional monitoring and targeted interventions, benefiting from the fact that the IGEI is specifically tailored on the GEI framework.

2508.09040 2026-01-21 stat.ME econ.EM math.ST stat.TH

Bias correction for Chatterjee's graph-based correlation coefficient

Mona Azadkia, Leihao Chen, Fang Han

Comments 45 pages; this version includes additional results demonstrating that the bias can be negligible when d<=3

详情
英文摘要

Azadkia and Chatterjee (2021) recently introduced a simple nearest neighbor (NN) graph-based correlation coefficient that consistently detects both independence and functional dependence. Specifically, it approximates a measure of dependence that equals 0 if and only if the variables are independent, and 1 if and only if they are functionally dependent. However, this NN estimator includes a bias term that may vanish at a rate slower than root-$n$, preventing root-$n$ consistency in general. In this article, we (i) analyze this bias term closely and show that it could become asymptotically negligible when the dimension is smaller than four; and (ii) propose a bias-correction procedure for more general settings. In both regimes, we obtain estimators (either the original or the bias-corrected version) that are root-$n$ consistent and asymptotically normal.

2508.05259 2026-01-21 math.ST stat.TH

Nonparametric Estimation from Correlated Copies of a Drifted Process

Nicolas Marie

Comments 23 pages, 6 figures

详情
英文摘要

This paper presents several situations leading to the observation of multiple correlated copies of a drifted process, and then non-asymptotic risk bounds are established on nonparametric estimators of the drift function $b_0$ and its derivative. For drifted Gaussian processes with a regular enough covariance function, a sharper risk bound is established on the estimator of $b_0'$, and a model selection procedure is provided with theoretical guarantees.

2508.04444 2026-01-21 cs.LG cs.NA math.NA stat.ML

Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation

Askar Tsyganov, Evgeny Frolov, Sergey Samsonov, Maxim Rakhuba

Comments AAAI-2026, camera-ready version

详情
英文摘要

In this paper, we propose new randomized algorithms for estimating the two-to-infinity and one-to-two norms in a matrix-free setting, using only matrix-vector multiplications. Our methods are based on appropriate modifications of Hutchinson's diagonal estimator and its Hutch++ version. We provide oracle complexity bounds for both modifications. We further illustrate the practical utility of our algorithms for Jacobian-based regularization in deep neural network training on image classification tasks. We also demonstrate that our methodology can be applied to mitigate the effect of adversarial attacks in the domain of recommender systems.

2506.22621 2026-01-21 cs.LG math.OC stat.ML

Modeling Hierarchical Spaces: A Review and Unified Framework for Surrogate-Based Architecture Design

Paul Saves, Edward Hallé-Hannan, Jasper Bussemaker, Youssef Diouane, Nathalie Bartoli

Comments Published in Structural and Multidisciplinary Optimization, Springer Nature (2026)

详情
英文摘要

Simulation-based problems involving mixed-variable inputs frequently feature domains that are hierarchical, conditional, heterogeneous, or tree-structured. These characteristics pose challenges for data representation, modeling, and optimization. This paper reviews extensive literature on these structured input spaces and proposes a unified framework that generalizes existing approaches. In this framework, input variables may be continuous, integer, or categorical. A variable is described as meta if its value governs the presence of other decreed variables, enabling the modeling of conditional and hierarchical structures. We further introduce the concept of partially-decreed variables, whose activation depends on contextual conditions. To capture these inter-variable hierarchical relationships, we introduce design space graphs, combining principles from feature modeling and graph theory. This allows the definition of general hierarchical domains suitable for describing complex system architectures. Our framework defines hierarchical distances and kernels to enable surrogate modeling and optimization on hierarchical domains. We demonstrate its effectiveness on complex system design problems, including a neural network and a green-aircraft case study. Our methods are available in the open-source Surrogate Modeling Toolbox (SMT 2.0).

2505.01849 2026-01-21 stat.AP stat.CO

Applications of higher order Markov models and Pressure Index to strategize controlled run chases in Twenty20 cricket

Rhitankar Bandyopadhyay, Dibyojyoti Bhattacharjee

Comments 33 pages, 2 figures, 24 tables, 8 pseudo codes for algorithms

详情
英文摘要

In limited overs cricket, the team batting first posts a target score for the team batting second to achieve in order to win the match. The team batting second is constrained by decreasing resources in terms of number of balls left and number of wickets in hand in the process of reaching the target as the second innings progresses. The Pressure Index, a measure created by researchers in the past, serves as a tool for quantifying the level of pressure that a team batting second encounters in limited overs cricket. Through a ball-by-ball analysis of the second innings, it reveals how effectively the team batting second in a limited-over game proceeds towards their target. This research employs higher order Markov chains to examine the strategies employed by successful teams during run chases in Twenty20 matches. By studying the trends in successful run chases spanning over 16 years and utilizing a significant dataset of 6537 Twenty20 matches, specific strategies are identified. Consequently, an efficient approach to successful run chases in Twenty20 cricket is formulated, effectively limiting the Pressure Index to [0.5, 3.5] or even further down under 0.5 as early as possible. The innovative methodology adopted in this research offers valuable insights for cricket teams looking to enhance their performance in run chases.

2503.20466 2026-01-21 physics.ao-ph cs.LG stat.ML

Data-driven Seasonal Climate Predictions via Variational Inference and Transformers

Lluís Palma, Alejandro Peraza, David Civantos, Amanda Duarte, Stefano Materia, Ángel G. Muñoz, Jesús Peña-Izquierdo, Laia Romero, Albert Soret, Markus G. Donat

详情
英文摘要

Most operational climate services providers base their seasonal predictions on initialised general circulation models (GCMs) or statistical techniques that fit past observations. GCMs require substantial computational resources, which limits their capacity. In contrast, statistical methods often lack robustness due to short historical records. Recent works propose machine learning methods trained on climate model output, leveraging larger sample sizes and simulated scenarios. Yet, many of these studies focus on prediction tasks that might be restricted in spatial extent or temporal coverage, opening a gap with existing operational predictions. Thus, the present study evaluates the effectiveness of a methodology that combines variational inference with transformer models to predict fields of seasonal anomalies. The predictions cover all four seasons and are initialised one month before the start of each season. The model was trained on climate model output from CMIP6 and tested using ERA5 reanalysis data. We analyse the method's performance in predicting interannual anomalies beyond the climate change-induced trend. We also test the proposed methodology in a regional context with a use case focused on Europe. While climate change trends dominate the skill of temperature predictions, the method presents additional skill over the climatological forecast in regions influenced by known teleconnections. We reach similar conclusions based on the validation of precipitation predictions. Despite underperforming SEAS5 in most tropics, our model offers added value in numerous extratropical inland regions. This work demonstrates the effectiveness of training generative models on climate model output for seasonal predictions, providing skilful predictions beyond the induced climate change trend at time scales and lead times relevant for user applications.

2503.06431 2026-01-21 stat.ME cs.LG

Fairness-aware kidney exchange and kidney paired donation

Mingrui Zhang, Xiaowu Dai, Lexin Li

详情
英文摘要

The kidney paired donation (KPD) program provides an innovative solution to overcome incompatibility challenges in kidney transplants by matching incompatible donor-patient pairs and facilitating kidney exchanges. To address unequal access to transplant opportunities, there are two widely used fairness criteria: group fairness and individual fairness. However, these criteria do not consider protected patient features, which refer to characteristics legally or ethically recognized as needing protection from discrimination, such as race and gender. Motivated by the calibration principle in machine learning, we introduce a new fairness criterion: the matching outcome should be conditionally independent of the protected feature, given the sensitization level. We integrate this fairness criterion as a constraint within the KPD optimization framework and propose a computationally efficient solution using linearization strategies and column-generation methods. Theoretically, we analyze the associated price of fairness using random graph models. Empirically, we compare our fairness criterion with group fairness and individual fairness through both simulations and a real-data example.

2502.20966 2026-01-21 stat.ML cs.LG math.ST stat.TH

Post-Hoc Uncertainty Quantification in Pre-Trained Neural Networks via Activation-Level Gaussian Processes

Richard Bergna, Stefan Depeweg, Sergio Calvo Ordonez, Jonathan Plenk, Alvaro Cartea, Jose Miguel Hernandez-Lobato

Comments 10 pages, 8 figures, 7th Symposium on Advances in Approximate Bayesian Inference

详情
英文摘要

Uncertainty quantification in neural networks through methods such as Dropout, Bayesian neural networks and Laplace approximations is either prone to underfitting or computationally demanding, rendering these approaches impractical for large-scale datasets. In this work, we address these shortcomings by shifting the focus from uncertainty in the weight space to uncertainty at the activation level, via Gaussian processes. More specifically, we introduce the Gaussian Process Activation function (GAPA) to capture neuron-level uncertainties. Our approach operates in a post-hoc manner, preserving the original mean predictions of the pre-trained neural network and thereby avoiding the underfitting issues commonly encountered in previous methods. We propose two methods. The first, GAPA-Free, employs empirical kernel learning from the training data for the hyperparameters and is highly efficient during training. The second, GAPA-Variational, learns the hyperparameters via gradient descent on the kernels, thus affording greater flexibility. Empirical results demonstrate that GAPA-Variational outperforms the Laplace approximation on most datasets in at least one of the uncertainty quantification metrics.

2502.16542 2026-01-21 stat.ML cs.LG

Variable transformations in consistent loss functions

Hristos Tyralis, Georgia Papacharalampous

Comments 37 pages, 4 figures, 2 tables

Journal ref Knowledge-Based Systems 336 (2026) 115202

详情
英文摘要

The empirical use of variable transformations within (strictly) consistent loss functions is widespread, yet a theoretical understanding is lacking. To address this gap, we develop a theoretical framework that establishes formal characterizations of (strict) consistency for such transformed loss functions. Our analysis focuses on two interrelated cases: (a) transformations applied solely to the realization variable and (b) bijective transformations applied jointly to both the realization and prediction variables. These cases extend the well-established framework of transformations applied exclusively to the prediction variable, as formalized by Osband's revelation principle. We further develop analogous characterizations for (strict) identification functions. The resulting theoretical framework is broadly applicable to statistical and machine learning methodologies. For instance, we apply the framework to Bregman and expectile loss functions to interpret empirical findings from models trained with transformed loss functions and systematically construct new identifiable and elicitable functionals, which we term respectively $g$-transformed expectation and $g$-transformed expectile. Applications of the framework to simulated and real-world data illustrate its practical utility in diverse settings. By unifying theoretical insights with practical applications, this work advances principled methodologies for designing loss functions in complex predictive tasks.

2411.00839 2026-01-21 cs.LG cs.AI cs.CV stat.ME stat.ML

CausAdv: A Causal-based Framework for Detecting Adversarial Examples

Hichem Debbi

详情
英文摘要

Deep learning has led to tremendous success in computer vision, largely due to Convolutional Neural Networks (CNNs). However, CNNs have been shown to be vulnerable to crafted adversarial perturbations. This vulnerability of adversarial examples has has motivated research into improving model robustness through adversarial detection and defense methods. In this paper, we address the adversarial robustness of CNNs through causal reasoning. We propose CausAdv: a causal framework for detecting adversarial examples based on counterfactual reasoning. CausAdv learns both causal and non-causal features of every input, and quantifies the counterfactual information (CI) of every filter of the last convolutional layer. We then perform a statistical analysis of the filters' CI across clean and adversarial samples, to demonstrate that adversarial examples exhibit different CI distributions compared to clean samples. Our results show that causal reasoning enhances the process of adversarial detection without the need to train a separate detector. Moreover, we illustrate the efficiency of causal explanations as a helpful detection tool by visualizing the extracted causal features.

2410.21914 2026-01-21 stat.ME stat.CO

Bayesian Stability Selection and Inference on Selection Probabilities

Mahdi Nouraie, Connor Smith, Samuel Muller

详情
英文摘要

Stability selection is a versatile framework for structure estimation and variable selection in high-dimensional setting, primarily grounded in frequentist principles. In this paper, we propose an enhanced methodology that integrates Bayesian analysis to refine the inference of selection probabilities within the stability selection framework. Traditional approaches rely on selection frequencies for decision-making, often disregarding domain-specific knowledge. Our methodology uses prior information to derive posterior distributions of selection probabilities, thereby improving both inference and decision-making. We present a two-step process for engaging with domain experts, enabling statisticians to construct prior distributions informed by expert knowledge while allowing experts to control the weight of their input on the final results. Using posterior distributions, we offer Bayesian credible intervals to quantify uncertainty in the variable selection process. Furthermore, we demonstrate how the integration of prior knowledge reduces the variance of selection probabilities, thereby improving the stability of decision-making. Our approach preserves the versatility of stability selection and is suitable for a broad range of structure estimation challenges.

2410.18164 2026-01-21 cs.LG cs.AI stat.ML

TabDPT: Scaling Tabular Foundation Models on Real Data

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C. Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L. Caterini, Maksims Volkovs

Comments Inference repo: github.com/layer6ai-labs/TabDPT-inference; Training repo: github.com/layer6ai-labs/TabDPT-training

Journal ref NeurIPS 2025 Proceedings

详情
英文摘要

Tabular data is one of the most ubiquitous sources of information worldwide, spanning a wide variety of domains. This inherent heterogeneity has slowed the development of Tabular Foundation Models (TFMs) capable of fast generalization to unseen datasets. In-Context Learning (ICL) has recently emerged as a promising solution for TFMs, enabling dynamic adaptation to new tasks without additional tuning. While many studies have attempted to re-purpose large language models for tabular ICL, they have had limited success, so recent works have focused on developing tabular-specific foundation models. In this work, we propose an approach to combine ICL-based retrieval with self supervised learning to train tabular foundation models. We also investigate the utility of real vs. synthetic data for model pre-training, and show that real data can contain useful signal not easily captured in synthetic training. Specifically, we show that incorporating real data during the pre-training phase can lead to significantly faster training and better downstream generalization to unseen data. Our resulting model, TabDPT, achieves strong performance on both regression (CTR23) and classification (CC18) benchmarks. Importantly, we also demonstrate that with our pre-training procedure, scaling both model and data size leads to consistent performance improvements that follow power laws. This echoes scaling laws in LLMs and other foundation models, and suggests that large-scale TFMs can be achievable. We open-source our full pipeline: inference code including trained model weights can be found at github.com/layer6ai-labs/TabDPT-inference, and the training code to reproduce experiments can be found at github.com/layer6ai-labs/TabDPT-training.

2410.15530 2026-01-21 stat.ME math.ST stat.TH

Simultaneous Inference in Multiple Matrix-Variate Graphs for High-Dimensional Neural Recordings

Zongge Liu, Heejong Bong, Zhao Ren, Matthew A. Smith, Robert E. Kass

详情
英文摘要

We study simultaneous inference for multiple matrix-variate Gaussian graphical models in high-dimensional settings. Such models arise when spatiotemporal data are collected across multiple sample groups or experimental sessions, where each group is characterized by its own graphical structure but shares common sparsity patterns. A central challenge is to conduct valid inference on collections of graph edges while efficiently borrowing strength across groups under both high-dimensionality and temporal dependence. We propose a unified framework that combines joint estimation via group penalized regression with a high-dimensional Gaussian approximation bootstrap to enable global testing of edge subsets of arbitrary size. The proposed procedure accommodates temporally dependent observations and avoids naive pooling across heterogeneous groups. We establish theoretical guarantees for the validity of the simultaneous tests under mild conditions on sample size, dimensionality, and non-stationary autoregressive temporal dependence, and show that the resulting tests are nearly optimal in terms of the testable region boundary. The method relies only on convex optimization and parametric bootstrap, making it computationally tractable. Simulation studies and a neural recording example illustrate the efficacy of the proposed approach.

2410.09267 2026-01-21 stat.ME

Experimentation on Endogenous Graphs

Wenshuo Wang, Edvard Bakhitov, Dominic Coey

详情
英文摘要

We study experimentation under endogenous network interference. Interference patterns are mediated by an endogenous graph, where edges can be formed or eliminated as a result of treatment. We show that conventional estimators are biased in these circumstances, and present a class of unbiased, consistent and asymptotically normal estimators of total treatment effects in the presence of such interference. We show via simulation that our estimator outperforms existing estimators in the literature. Our results apply both to bipartite experimentation, in which the units of analysis and measurement differ, and the standard network experimentation case, in which they are the same.

2409.15995 2026-01-21 stat.ME q-bio.QM stat.AP

Robust Inference for Non-Linear Regression Models with Applications in Enzyme Kinetics

Suryasis Jana, Abhik Ghosh

Comments To appear in the Journal of Applies Statistics

详情
英文摘要

Despite linear regression being the most popular statistical modelling technique, in real-life we often need to deal with situations where the true relationship between the response and the covariates is nonlinear in parameters. In such cases one needs to adopt appropriate non-linear regression (NLR) analysis, having wider applications in biochemical and medical studies among many others. In this paper we propose a new improved robust estimation and testing methodologies for general NLR models based on the minimum density power divergence approach and apply our proposal to analyze the widely popular Michaelis-Menten (MM) model in enzyme kinetics. We establish the asymptotic properties of our proposed estimator and tests, along with their theoretical robustness characteristics through influence function analysis. For the particular MM model, we have further empirically justified the robustness and the efficiency of our proposed estimator and the testing procedure through extensive simulation studies and several interesting real data examples of enzyme-catalyzed (biochemical) reactions.

2408.02060 2026-01-21 math.ST stat.ME stat.ML stat.TH

Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection

Tianyu Zhang, Hao Lee, Jing Lei

详情
英文摘要

We study the problem of finding the index of the minimum value of a vector from noisy observations. This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection. We develop an asymptotically normal test statistic, even in high-dimensional settings and with potentially many ties in the population mean vector, by integrating concepts and tools from cross-validation and differential privacy. The key technical ingredient is a central limit theorem for globally dependent data. We also propose practical ways to select the tuning parameter that adapts to the signal landscape. Numerical experiments and data examples demonstrate the ability of the proposed method to achieve a favorable bias-variance trade-off in practical scenarios.

2407.21119 2026-01-21 econ.EM stat.ME

Potential weights and implicit causal designs in linear regression

Jiafeng Chen

详情
英文摘要

When we interpret linear regression as estimating causal effects justified by quasi-experimental treatment variation, what do we mean? This paper formalizes a minimal criterion for quasi-experimental interpretation and characterizes its necessary implications. A minimal requirement is that the regression always estimates some contrast of potential outcomes under the true treatment assignment process. This requirement implies linear restrictions on the true distribution of treatment. If the regression were to be interpreted quasi-experimentally, these restrictions imply candidates for the true distribution of treatment, which we call implicit designs. Regression estimators are numerically equivalent to augmented inverse propensity weighting (AIPW) estimators using an implicit design. Implicit designs serve as a framework that unifies and extends existing theoretical results on causal interpretation of regression across starkly distinct settings (including multiple treatment, panel, and instrumental variables). They lead to new theoretical insights for widely used but less understood specifications.

2404.11781 2026-01-21 stat.ME math.ST stat.AP stat.TH

Asymmetric canonical correlation analysis of Riemannian and high-dimensional data

James Buenfil, Eardi Lila

Journal ref Electron. J. Statist. 19 (2) 6077 - 6102, 2025

详情
英文摘要

In this paper, we introduce a novel statistical model for the integrative analysis of Riemannian-valued functional data and high-dimensional data. We apply this model to explore the dependence structure between each subject's dynamic functional connectivity -- represented by a temporally indexed collection of positive definite covariance matrices -- and high-dimensional data representing lifestyle, demographic, and psychometric measures. Specifically, we employ a reformulation of canonical correlation analysis that enables efficient control of the complexity of the functional canonical directions using tangent space sieve approximations. Additionally, we enforce an interpretable group structure on the high-dimensional canonical directions via a sparsity-promoting penalty. The proposed method shows improved empirical performance over alternative approaches and comes with theoretical guarantees. Its application to data from the Human Connectome Project reveals a dominant mode of covariation between dynamic functional connectivity and lifestyle, demographic, and psychometric measures. This mode aligns with results from static connectivity studies but reveals a unique temporal non-stationary pattern that such studies fail to capture.

2402.10818 2026-01-21 cs.LG stat.ML

Trading off Consistency and Dimensionality of Convex Surrogates for the Mode

Enrique Nueve, Bo Waggoner, Dhamma Kimpara, Jessie Finocchiaro

Comments Updated error with Bregman Losses to only Square Losses

详情
英文摘要

In multiclass classification over $n$ outcomes, the outcomes must be embedded into the reals with dimension at least $n-1$ in order to design a consistent surrogate loss that leads to the "correct" classification, regardless of the data distribution. For large $n$, such as in information retrieval and structured prediction tasks, optimizing a surrogate in $n-1$ dimensions is often intractable. We investigate ways to trade off surrogate loss dimension, the number of problem instances, and restricting the region of consistency in the simplex for multiclass classification. Following past work, we examine an intuitive embedding procedure that maps outcomes into the vertices of convex polytopes in a low-dimensional surrogate space. We show that full-dimensional subsets of the simplex exist around each point mass distribution for which consistency holds, but also, with less than $n-1$ dimensions, there exist distributions for which a phenomenon called hallucination occurs, which is when the optimal report under the surrogate loss is an outcome with zero probability. Looking towards application, we derive a result to check if consistency holds under a given polytope embedding and low-noise assumption, providing insight into when to use a particular embedding. We provide examples of embedding $n = 2^{d}$ outcomes into the $d$-dimensional unit cube and $n = d!$ outcomes into the $d$-dimensional permutahedron under low-noise assumptions. Finally, we demonstrate that with multiple problem instances, we can learn the mode with $\frac{n}{2}$ dimensions over the whole simplex.

2401.12309 2026-01-21 econ.EM stat.ME

Interpreting Event-Studies from Recent Difference-in-Differences Methods

Jonathan Roth

详情
英文摘要

This note discusses the interpretation of event-study plots produced by recent difference-in-differences methods. I show that even when specialized to the case of non-staggered treatment timing, the default plots produced by software for several of the most popular recent methods do not match those of traditional two-way fixed effects (TWFE) event-studies. The plots produced by the new methods may show a kink or jump at the time of treatment even when the TWFE event-study shows a straight line. This difference stems from the fact that the new methods construct the pre-treatment coefficients asymmetrically from the post-treatment coefficients. As a result, visual heuristics for evaluating violations of parallel trends using TWFE event-study plots should not be immediately applied to those from these methods. I conclude with practical recommendations for constructing and interpreting event-study plots when using these methods.

2312.16819 2026-01-21 cs.LG math.OC stat.ML

Hidden Minima in Two-Layer ReLU Networks

Yossi Arjevani

详情
英文摘要

We consider the optimization problem associated with training two-layer ReLU networks with \(d\) inputs under the squared loss, where the labels are generated by a target network. Recent work has identified two distinct classes of infinite families of minima: one whose training loss vanishes in the high-dimensional limit, and another whose loss remains bounded away from zero. The latter family is empirically avoided by stochastic gradient descent, hence \emph{hidden}, motivating the search for analytic criteria that distinguish hidden from non-hidden minima. A key challenge is that prior analyses have shown the Hessian spectra at hidden and non-hidden minima to coincide up to terms of order \(O(d^{-1/2})\), seemingly limiting the discriminative power of spectral methods. We therefore take a different route, studying instead certain curves along which the loss is locally minimized. Our main result shows that arcs emanating from hidden minima exhibit distinctive structural and symmetry properties, arising precisely from \(Ω(d^{-1/2})\) eigenvalue contributions that are absent from earlier analyses.

2309.12162 2026-01-21 stat.ME cs.LG econ.EM math.ST stat.TH

Optimal Conditional Inference in Adaptive Experiments

Jiafeng Chen, Isaiah Andrews

Comments An extended abstract of this paper was presented at CODE@MIT 2021

详情
英文摘要

We study batched bandit experiments and consider the problem of inference conditional on the realized stopping time, assignment probabilities, and target parameter, where all of these may be chosen adaptively using information up to the last batch of the experiment. Absent further restrictions on the experiment, we show that inference using only the results of the last batch is optimal. When the adaptive aspects of the experiment are known to be location-invariant, in the sense that they are unchanged when we shift all batch-arm means by a constant, we show that there is additional information in the data, captured by one additional linear function of the batch-arm means. In the more restrictive case where the stopping time, assignment probabilities, and target parameter are known to depend on the data only through a collection of polyhedral events, we derive computationally tractable and optimal conditional inference procedures.

2210.00953 2026-01-21 stat.ML cs.LG math.OC

Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes

Dongyan Huo, Yudong Chen, Qiaomin Xie

Comments SIGMETRICS 2023

Journal ref Mathematics of Operations Research, 2026

详情
英文摘要

We consider Linear Stochastic Approximation (LSA) with a constant stepsize and Markovian data. Viewing the joint process of the data and LSA iterate as a time-homogeneous Markov chain, we prove its convergence to a unique limiting and stationary distribution in Wasserstein distance and establish non-asymptotic, geometric convergence rates. Furthermore, we show that the bias vector of this limit admits an infinite series expansion with respect to the stepsize. Consequently, the bias is proportional to the stepsize up to higher order terms. This result stands in contrast with LSA under i.i.d. data, for which the bias vanishes. In the reversible chain setting, we provide a general characterization of the relationship between the bias and the mixing time of the Markovian data, establishing that they are roughly proportional to each other. While Polyak-Ruppert tail-averaging reduces the variance of the LSA iterates, it does not affect the bias. The above characterization allows us to show that the bias can be reduced using Richardson-Romberg extrapolation with $m\ge 2$ stepsizes, which eliminates the $m-1$ leading terms in the bias expansion. This extrapolation scheme leads to an exponentially smaller bias and an improved mean squared error, both in theory and empirically. Our results immediately apply to the Temporal Difference learning algorithm with linear function approximation, Markovian data, and constant stepsizes.

2111.03754 2026-01-21 stat.ME

Transformed Linear Prediction for Extremes

Jeongjin Lee, Daniel Cooley

详情
英文摘要

We address the problem of prediction for extreme observations by proposing an extremal linear prediction method. We construct an inner product space of nonnegative random variables derived from transformed-linear combinations of independent regularly varying random variables. Under a reasonable modeling assumption, the matrix of inner products corresponds to the tail pairwise dependence matrix, which can be easily estimated. We derive the optimal transformed-linear predictor via the projection theorem, which yields a predictor with the same form as the best linear unbiased predictor in non-extreme settings. We quantify uncertainty for prediction errors by constructing prediction intervals based on the geometry of regular variation. We demonstrate the effectiveness of our method through a simulation study and its applications to predicting high pollution levels, and extreme precipitation.

2103.03191 2026-01-21 stat.ML cs.LG cs.NA math.NA math.OC math.PR

Generalization Bounds for Sparse Random Feature Expansions

Abolfazl Hashemi, Hayden Schaeffer, Robert Shi, Ufuk Topcu, Giang Tran, Rachel Ward

详情
英文摘要

Random feature methods have been successful in various machine learning tasks, are easy to compute, and come with theoretical accuracy bounds. They serve as an alternative approach to standard neural networks since they can represent similar function spaces without a costly training phase. However, for accuracy, random feature methods require more measurements than trainable parameters, limiting their use for data-scarce applications or problems in scientific machine learning. This paper introduces the sparse random feature expansion to obtain parsimonious random feature models. Specifically, we leverage ideas from compressive sensing to generate random feature expansions with theoretical guarantees even in the data-scarce setting. In particular, we provide generalization bounds for functions in a certain class (that is dense in a reproducing kernel Hilbert space) depending on the number of samples and the distribution of features. The generalization bounds improve with additional structural conditions, such as coordinate sparsity, compact clusters of the spectrum, or rapid spectral decay. In particular, by introducing sparse features, i.e. features with random sparse weights, we provide improved bounds for low order functions. We show that the sparse random feature expansions outperforms shallow networks in several scientific machine learning tasks.

2009.13040 2026-01-21 stat.ML cs.LG math.ST stat.TH

Local Minima Structures in Gaussian Mixture Models

Yudong Chen, Dogyoon Song, Xumei Xi, Yuqian Zhang

Comments 73 pages, 6 figures, 2Tables. To appear in Transactions on Information Theory

Journal ref IEEE Transactions on Information Theory, vol. 70, no. 6, pp. 4218-4257, 2024

详情
英文摘要

We investigate the landscape of the negative log-likelihood function of Gaussian Mixture Models (GMMs) with a general number of components in the population limit. As the objective function is non-convex, there can be multiple local minima that are not globally optimal, even for well-separated mixture models. Our study reveals that all local minima share a common structure that partially identifies the cluster centers (i.e., means of the Gaussian components) of the true location mixture. Specifically, each local minimum can be represented as a non-overlapping combination of two types of sub-configurations: fitting a single mean estimate to multiple Gaussian components or fitting multiple estimates to a single true component. These results apply to settings where the true mixture components satisfy a certain separation condition, and are valid even when the number of components is over- or under-specified. We also present a more fine-grained analysis for the setting of one-dimensional GMMs with three components, which provide sharper approximation error bounds with improved dependence on the separation.