arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.03247 2026-03-04 stat.ME stat.AP

Fusing Sparse Observations and Dense Simulations for Spatial Extreme Value Analysis: Application to U.S. Coastal Sea Levels

Brian N. White, Brian Blanton, Rick Luettich, Richard L. Smith

Comments 34 pages, 7 figures, 7 tables; Supporting Information included

详情
英文摘要

Estimating spatial extremes from sparse observational networks produces uncertain return level maps, but dense output from physics-based simulation models is often available as a complementary data source. We develop a two-stage frequentist frame-work for fusing observations and simulations. In Stage 1, generalized extreme value (GEV) distributions are fitted independently at each site, with a nonstationary location parameter where appropriate to accommodate observed trends. In Stage 2, the parameter estimates from all sources are modeled jointly as a high-dimensional spatial process through a linear model of coregionalization (LMC). Cross-source correlations, estimated from spatially interspersed networks without co-located sites, provide the mechanism for information transfer; an analytic gradient for the resulting likelihood keeps estimation computationally practical. We apply the framework to U.S. coastal sea levels over 1979-2021, fusing 29 NOAA tide gauge records with 100 ADCIRC hydrodynamic simulation sites. Leave-one-out cross-validation shows a 35% reduction in 100-year return level RMSE relative to a gauge-only model. Geographic block cross-validation confirms that fusion benefits persist under spatial extrapolation. The approach is implemented in the R package evfuse.

2603.03235 2026-03-04 stat.ML cs.LG stat.ME

The elbow statistic: Multiscale clustering statistical significance

Francisco J. Perez-Reche

Comments 30 pages, 3 figures, 5 tables

详情
英文摘要

Selecting the number of clusters remains a fundamental challenge in unsupervised learning. Existing criteria typically target a single ``optimal'' partition, often overlooking statistically meaningful structure present at multiple resolutions. We introduce ElbowSig, a framework that formalizes the heuristic ``elbow'' method as a rigorous inferential problem. Our approach centers on a normalized discrete curvature statistic derived from the cluster heterogeneity sequence, which is evaluated against a null distribution of unstructured data. We derive the asymptotic properties of this null statistic in both large-sample and high-dimensional regimes, characterizing its baseline behavior and stochastic variability. As an algorithm-agnostic procedure, ElbowSig requires only the heterogeneity sequence and is compatible with a wide range of clustering methods, including hard, fuzzy, and model-based clustering. Extensive experiments on synthetic and empirical datasets demonstrate that the method maintains appropriate Type-I error control while providing the power to resolve multiscale organizational structures that are typically obscured by single-resolution selection criteria.

2603.03191 2026-03-04 stat.ML cs.LG math.OC

A Covering Framework for Offline POMDPs Learning using Belief Space Metric

Youheng Zhu, Yiping Lu

详情
英文摘要

In off policy evaluation (OPE) for partially observable Markov decision processes (POMDPs), an agent must infer hidden states from past observations, which exacerbates both the curse of horizon and the curse of memory in existing OPE methods. This paper introduces a novel covering analysis framework that exploits the intrinsic metric structure of the belief space (distributions over latent states) to relax traditional coverage assumptions. By assuming value relevant functions are Lipschitz continuous in the belief space, we derive error bounds that mitigate exponential blow ups in horizon and memory length. Our unified analysis technique applies to a broad class of OPE algorithms, yielding concrete error bounds and coverage requirements expressed in terms of belief space metrics rather than raw history coverage. We illustrate the improved sample efficiency of this framework via case studies: the double sampling Bellman error minimization algorithm, and the memory based future dependent value functions (FDVF). In both cases, our coverage definition based on the belief space metric yields tighter bounds.

2603.03154 2026-03-04 stat.ME stat.CO

Extending the saemix package for R to fit non Gaussian outcomes

Emmanuelle Comets, Maud Delattre, Belhal Karimi

Comments Main text: 24 pages, 6 figures, 6 tables

详情
英文摘要

Background and Objectives: Longitudinal data are increasingly collected in clinical trials to provide information on treatment action and disease evolution. The trajectory of continuous biomarkers such as target hormone concentrations or viral loads can then be modelled in relationship to the occurrence of events such as recovery or hospitalisation. Other studies may include repeated measurements of discrete pain scores, number of episodes (count) or occurrence of events (survival). Non-linear mixed-effect models (NLMEM) can handle individual differences in trajectories while modelling the underlying population evolution and are the natural choice for their analysis. The saemix package for R is one of the few open-source solutions and the most flexible. In this paper, we extend it to accommodate a variety of models for non-Gaussian data. Methods: The saemix package estimates parameters through the Stochastic Approximation Expectation-Maximisation (SAEM) algorithm. Within the package, non-Gaussian models are specified by their log-likelihood functions, affording maximal control over model formulation. We extend estimation algorithms as well as exploratory and diagnostic plots for non-Gaussian data. Bootstrap approaches were implemented to estimate parameter uncertainty. To evaluate the performance of saemix, we performed a simulation study based on the toenail dataset, containing repeated binary data from a randomised clinical trial. Results: saemix showed good performance to recover the true parameter values in the simulation study, and was stable across different starting values for the parameters. An algorithm jointly searching for covariate and interindividual variability model was also implemented to build the covariate model and applied to categorical and survival-type data.

2603.03035 2026-03-04 stat.ML cs.LG

Generalized Bayes for Causal Inference

Emil Javurek, Dennis Frauen, Yuxin Wang, Stefan Feuerriegel

详情
英文摘要

Uncertainty quantification is central to many applications of causal machine learning, yet principled Bayesian inference for causal effects remains challenging. Standard Bayesian approaches typically require specifying a probabilistic model for the data-generating process, including high-dimensional nuisance components such as propensity scores and outcome regressions. Standard posteriors are thus vulnerable to strong modeling choices, including complex prior elicitation. In this paper, we propose a generalized Bayesian framework for causal inference. Our framework avoids explicit likelihood modeling; instead, we place priors directly on the causal estimands and update these using an identification-driven loss function, which yields generalized posteriors for causal effects. As a result, our framework turns existing loss-based causal estimators into estimators with full uncertainty quantification. Our framework is flexible and applicable to a broad range of causal estimands (e.g., ATE, CATE). Further, our framework can be applied on top of state-of-the-art causal machine learning pipelines (e.g., Neyman-orthogonal meta-learners). For Neyman-orthogonal losses, we show that the generalized posteriors converge to their oracle counterparts and remain robust to first-stage nuisance estimation error. With calibration, we thus obtain valid frequentist uncertainty even when nuisance estimators converge at slower-than-parametric rates. Empirically, we demonstrate that our proposed framework offers causal effect estimation with calibrated uncertainty across several causal inference settings. To the best of our knowledge, this is the first flexible framework for constructing generalized Bayesian posteriors for causal machine learning.

2603.03008 2026-03-04 econ.EM stat.ME

Focused Weighted-Average Least Squares Estimator

Shou-Yung Yin

详情
英文摘要

We propose a focused weighted-average least squares (FWALS) estimator that addresses the computational burden of focused model averaging. By semi-orthogonalizing auxiliary regressors, the weighting problem is reduced from $2^{k_2}$ sub-models to at most $k_2$ regressor-wise weights, yielding a tractable sub-optimal procedure. Under local-to-zero conditions, we derive the limiting distribution of FWALS for smooth focused functions and provide a plug-in AMSE criterion for data-driven weight selection. Simulations show that FWALS closely matches the focused information criterion (FIC) benchmark and delivers stable performance when focused function is designed for impulse response function. Prior-based WALS can be competitive in some settings, but its performance depends on the signal regime and the design of focused parameter. Overall, FWALS offers a practical and robust alternative with substantial computational savings.

2603.02906 2026-03-04 cs.LG stat.ME

Towards Accurate and Interpretable Time-series Forecasting: A Polynomial Learning Approach

Bo Liu, Shao-Bo Lin, Changmiao Wang, Xiaotong Liu

详情
英文摘要

Time series forecasting enables early warning and has driven asset performance management from traditional planned maintenance to predictive maintenance. However, the lack of interpretability in forecasting methods undermines users' trust and complicates debugging for developers. Consequently, interpretable time-series forecasting has attracted increasing research attention. Nevertheless, existing methods suffer from several limitations, including insufficient modeling of temporal dependencies, lack of feature-level interpretability to support early warning, and difficulty in simultaneously achieving the accuracy and interpretability. This paper proposes the interpretable polynomial learning (IPL) method, which integrates interpretability into the model structure by explicitly modeling original features and their interactions of arbitrary order through polynomial representations. This design preserves temporal dependencies, provides feature-level interpretability, and offers a flexible trade-off between prediction accuracy and interpretability by adjusting the polynomial degree. We evaluate IPL on simulated and Bitcoin price data, showing that it achieves high prediction accuracy with superior interpretability compared with widely used explainability methods. Experiments on field-collected antenna data further demonstrate that IPL yields simpler and more efficient early warning mechanisms.

2603.02898 2026-03-04 q-fin.ST econ.EM stat.AP

Range-Based Volatility Estimators for Monitoring Market Stress: Evidence from Local Food Price Data

Bo Pieter Johannes Andrée

Comments 41 pages, 10 figures, 11 tables

详情
英文摘要

Range-based volatility estimators are widely used in financial econometrics to quantify risk and market stress, yet their application to local commodity markets remains limited. This paper shows how open-high--low-close (OHLC) volatility estimators can be adapted to monitor localized market distress across diverse development contexts, including conflict-affected settings, climate-exposed regions, remote and thinly traded markets, and import- and logistics-constrained urban hubs. Using monthly food price data from the World Bank's Real-Time Prices dataset, several volatility measures -- including the Parkinson, Garman-Klass, Rogers-Satchell, and Yang-Zhang estimators -- are constructed and evaluated against independently documented disruption timelines. Across settings, elevated volatility aligns with episodes linked to insecurity and market fragmentation, extreme weather and disaster shocks, policy and fuel-cost adjustments, and global supply-chain and trade disruptions. Volatility also detects stress that standard momentum indicators such as the relative strength index (RSI) can miss, including symmetric or rapidly reversing shocks in which offsetting supply and demand disturbances dampen net directional price movements while amplifying intra-period dispersion. Overall, OHLC-based volatility indicators provide a robust and interpretable signal of market disruptions and complement price-level monitoring for applications spanning financial risk, humanitarian early warning, and trade.

2603.02890 2026-03-04 math.ST math.PR stat.TH

Markov processes on a circular lattice

Sourav Majumdar

详情
英文摘要

We develop a Markov process viewpoint for discrete circular distributions motivated by directional-statistics settings where angles are observed on a finite grid and evolve over time. On the $m$-point discrete circle, the cycle graph, we study diffusion-generated families, obtaining an explicit transition kernel, exact trigonometric moments, and convergence to uniformity. We present a simple approach to construct reversible nearest-neighbour chains with any prescribed strictly positive stationary pmf $π$, providing discrete analogues of Markov processes on the continuous circle. We construct processes whose stationary laws are the discrete von Mises and wrapped Cauchy distributions with closed-form normalizers and exact moments.

2603.02861 2026-03-04 stat.ME

Focused Information Criteria for Semiparametric Linear Hazard Regression

Axel Gandy, Nils Lid Hjort

Comments 16 pages, 4 figures, 3 tables; Statistical Research Report, Department of Mathematics, University of Oslo, February 2009, now arXiv'd March 2026. The paper was accepted by Biometrika in 2010, modulo "minor changes", but things slipped away from our tables

详情
英文摘要

The semiparametric linear hazard regression model introduced by McKeague and Sasieni (1994) is an extension of the linear hazard regression model developed by Aalen (1980). Methods of model selection for this type of model are still underdeveloped. In the process of fitting a semiparametric linear hazard regression model one usually starts with a given set of covariates. For each covariate one has at least the following three choices: allow it to have time-varying effect; allow it to have constant effect over time; or exclude it from the model. In this paper we discuss focused information criteria (FIC) to help with this choice. In the spirit of Claeskens and Hjort (2003, 2008), `focused' means that one is interested in one specific quantity, e.g. the probability of survival of a patient with a certain set of covariates up to a given time. The FIC involves estimating the mean squared error of the estimator of the quantity one is interested in, and the chosen model is the one minimising this estimated mean squared error. The focused model selection machinery is extended to allow for weighted versions, leading to a suitable wFIC method that aims at finding models that lead to good estimates of a given list of parameters, such as survival probabilities for a subset of patients or for a specified region of covariate vectors. In addition to developing model selection criteria, methods associated with averaging across the best models are also discussed. We illustrate these methods of model selection in a real data situation.

2603.02840 2026-03-04 cs.LG stat.ML

Adapting Time Series Foundation Models through Data Mixtures

Thomas L. Lee, Edoardo M. Ponti, Amos Storkey

Comments Preprint, 8 pages

详情
英文摘要

Time series foundation models (TSFMs) have become increasingly popular for zero-shot forecasting. However, for a new time series domain not fully covered by the pretraining set, performance can suffer. Therefore, when a practitioner cares about a new domain and has access to a set of related datasets, the question arises: how best to fine-tune a TSFM to improve zero-shot forecasting? A typical approach to this type of problem is to fine-tune a LoRA module on all datasets or separately on each dataset. Tuning a separate module on each dataset allows for the specialisation of the TSFM to different types of data distribution, by selecting differing combinations of per-dataset modules for different time series contexts. However, we find that, using per-dataset modules might not be optimal, since a time series dataset can contain data from several types of distributions, i.e. sub-domains. This can be due to the distribution shifting or having differing distributions for different dimensions of the time series. Hence, we propose MixFT which re-divides the data using Bayesian mixtures into sets that best represent the sub-domains present in the data, and fine-tunes separately on each of these sets. This re-division of the data ensures that each set is more homogeneous, leading to fine-tuned modules focused on specific sub-domains. Our experiments show that MixFT performs better than per-dataset methods and when fine-tuning a single module on all the data. This suggests that by re-partitioning the data to represent sub-domains we can better specialise TSFMs to improve zero-shot forecasting.

2603.02753 2026-03-04 cs.LG q-bio.QM stat.ML

Deep learning-guided evolutionary optimization for protein design

Erik Hartman, Di Tang, Johan Malmström

Comments Code available at GitHub

详情
英文摘要

Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence-function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data-efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of \textit{Streptococcus pneumoniae}. BoGA accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \href{https://github.com/ErikHartman/bopep}{GitHub}.

2602.21501 2026-03-04 stat.ML cs.LG math.ST stat.TH

A Researcher's Guide to Empirical Risk Minimization

Lars van der Laan

Comments Version 2; minor edits and clarifications, expanded references, extended Section 2 (high-probability bounds)

详情
英文摘要

This guide provides a reference for high-probability regret bounds in empirical risk minimization (ERM). The presentation is modular: we begin with intuition and general proof strategies, then state broadly applicable guarantees under high-level conditions and provide tools for verifying them for specific losses and function classes. We emphasize that many ERM rate derivations can be organized around a three-step recipe -- a basic inequality, a uniform local concentration bound, and a fixed-point argument -- which yields regret bounds in terms of a critical radius, defined via localized Rademacher complexity, under a mild Bernstein-type variance-risk condition. To make these bounds concrete, we upper bound the critical radius using local maximal inequalities and metric-entropy integrals, thereby recovering familiar rates for VC-subgraph, Sobolev/Hölder, and bounded-variation classes. We also study ERM with nuisance components -- including weighted ERM and Neyman-orthogonal losses -- as they arise in causal inference, missing data, and domain adaptation. Following the orthogonal statistical learning framework, we highlight that these problems often admit regret-transfer bounds linking regret under an estimated loss to population regret under the target loss. These bounds typically decompose the regret into (i) statistical error under the estimated loss and (ii) approximation error due to nuisance estimation. Under sample splitting or cross-fitting, the first term can be controlled using standard fixed-loss ERM regret bounds, while the second depends only on nuisance-estimation accuracy. As a novel contribution, we also treat the in-sample regime, in which the nuisances and the ERM are fit on the same data, deriving regret bounds and showing that fast oracle rates remain attainable under suitable smoothness and Donsker-type conditions.

2602.20394 2026-03-04 stat.ML cond-mat.stat-mech cs.LG

Selecting Optimal Variable Order in Autoregressive Ising Models

Shiba Biswal, Marc Vuffray, Andrey Y. Lokhov

详情
英文摘要

Autoregressive models enable tractable sampling from learned probability distributions, but their performance critically depends on the variable ordering used in the factorization via complexities of the resulting conditional distributions. We propose to learn the Markov random field describing the underlying data, and use the inferred graphical model structure to construct optimized variable orderings. We illustrate our approach on two-dimensional image-like models where a structure-aware ordering leads to restricted conditioning sets, thereby reducing model complexity. Numerical experiments on Ising models with discrete data demonstrate that graph-informed orderings yield higher-fidelity generated samples compared to naive variable orderings.

2602.05797 2026-03-04 cs.LG stat.ME

Classification Under Local Differential Privacy with Model Reversal and Model Averaging

Caihong Qin, Yang Bai

详情
Journal ref
J. Mach. Learn. Res. 27 (2026) 1-44
英文摘要

Local differential privacy (LDP) has become a central topic in data privacy research, offering strong privacy guarantees by perturbing user data at the source and removing the need for a trusted curator. However, the noise introduced by LDP often significantly reduces data utility. To address this issue, we reinterpret private learning under LDP as a transfer learning problem, where the noisy data serve as the source domain and the unobserved clean data as the target. We propose novel techniques specifically designed for LDP to improve classification performance without compromising privacy: (1) a noised binary feedback-based evaluation mechanism for estimating dataset utility; (2) model reversal, which salvages underperforming classifiers by inverting their decision boundaries; and (3) model averaging, which assigns weights to multiple reversed classifiers based on their estimated utility. We provide theoretical excess risk bounds under LDP and demonstrate how our methods reduce this risk. Empirical results on both simulated and real-world datasets show substantial improvements in classification accuracy.

2511.08111 2026-03-04 math.PR math.ST stat.TH

On the Kantorovich contraction of Markov semigroups

Pierre Del Moral, Mathieu Gerber

Comments 39 pages

详情
英文摘要

This paper develops a novel operator theoretic framework to study the contraction properties of Markov semigroups with respect to a general class of Kantorovich semi-distances, which notably includes Wasserstein distances. The rather simple contraction cost framework developed in this article, which combines standard Lyapunov techniques with local contraction conditions, helps to unifying and simplifying many arguments in the stability of Markov semigroups, as well as to improve upon some existing results. Our results can be applied to both discrete time and continuous time Markov semigroups, and we illustrate their wide applicability in the context of (i) Markov transitions on models with boundary states, including bounded domains with entrance boundaries, (ii) operator products of a Markov kernel and its adjoint, including two-block-type Gibbs samplers, (iii) iterated random functions and (iv) diffusion models, including overdampted Langevin diffusion with convex at infinity potentials.

2510.08646 2026-03-04 cs.LG cs.AI cs.CL stat.ML

Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy

Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li

详情
英文摘要

Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly refuse benign requests. A key goal of safe alignment is therefore to improve safety while simultaneously minimizing false refusals. In this work, we introduce Energy Landscape Steering (ELS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We train a lightweight external Energy-Based Model (EBM) to assign high energy to undesirable states (false refusal or jailbreak) and low energy to desirable states (helpful response or safe reject). During inference, the EBM maps the LLM's internal activations to an energy landscape, and we use the gradient of the energy function to steer the hidden states toward low-energy regions in real time. This dynamically guides the model toward desirable behavior without modifying its parameters. By decoupling behavioral control from the model's core knowledge, ELS provides a flexible and computationally efficient solution. Extensive experiments across diverse models demonstrate its effectiveness, raising compliance on the ORB-H benchmark from 57.3 percent to 82.6 percent while maintaining baseline safety performance. Our work establishes a promising paradigm for building LLMs that simultaneously achieve high safety and low false refusal rates.

2510.08382 2026-03-04 cs.LG stat.ML

Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions

Jacob Trauger, Tyson Trauger, Ambuj Tewari

Comments 15 pages

详情
英文摘要

In this paper we will give a characterization of the learnability of forgiving 0-1 loss functions in the multiclass setting with effectively finite cardinality of the output and label space. To do this, we create a new combinatorial dimension that is based off of the Natarajan Dimension and we show that a hypothesis class is learnable in our setting if and only if this Generalized Natarajan Dimension is finite. We also show how this dimension characterizes other known learning settings such as a vast amount of instantiations of learning with set-valued feedback and a modified version of list learning.

2509.22613 2026-03-04 cs.AI cs.CL cs.LG stat.ML

Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, Wei Chen

详情
英文摘要

Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent Q-value bias in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

2508.14400 2026-03-04 math.ST stat.TH

Gaussian Multiplier Bootstrap Procedure for the $k$th Largest Coordinate of High-Dimensional Statistics

Yixi Ding, Qizhai Li, Yuke Shi, Liuquan Sun, Luobin Zhang

详情
英文摘要

We consider the problem of Gaussian multiplier bootstrap procedures for the $k$th largest statistics and functions of the top $k$ order statistics, which are commonly encountered in high-dimensional statistical inference. Such a problem has been studied previously for $k=1$ (i.e., maxima). However, in many applications, a general $k$ ($k\geq 1$) is of great interest. We provide the upper bounds for the errors between Gaussian approximations and Gaussian multiplier approximations. The dimension $p$ is allowed to be larger than the sample size $n$. The effectiveness of the proposed methods is demonstrated via the computer numerical results and a real-world data analysis.

2508.12983 2026-03-04 stat.ME

Dynamic Latent Class Structural Equation Modeling: A Hands-On Tutorial for Modeling Intensive Longitudinal Data

Roberto Faleh, Sofia Morelli, Vivato Andriamiarana, Zachary J. Roman, Christoph Flückiger, Holger Brandt

Comments 41 pages, 13 figures,13 tables

详情
英文摘要

In this tutorial, we provide a hands-on guideline on how to implement complex Dynamic Latent Class Structural Equation Models (DLCSEM) in the Bayesian software JAGS. We provide building blocks starting with simple Confirmatory Factor and Time Series analysis, and then extend these blocks to Multilevel Models and Dynamic Structural Equation Models (DSEM). Subsequently, we introduce Hidden Markov Switching Models (HMSM) and demonstrate their integration with DSEM to yield DLCSEM. Leading through the tutorial is an example from clinical psychology using data on a generalized anxiety treatment that includes scales on anxiety symptoms and the Working Alliance Inventory that measures alliance between therapists and patients. Within each block, we provide an overview, specific hypotheses we want to test, the resulting model and its implementation, as well as an interpretation of the results. The aim of this tutorial is to provide a step-by-step guide for applied researchers that enables them to use this flexible DLCSEM framework for their own analyses.

2506.07275 2026-03-04 cs.LG cs.HC stat.AP

Tailored Behavior-Change Messaging for Physical Activity: Integrating Contextual Bandits and Large Language Models

Haochen Song, Dominik Hofer, Rania Islambouli, Laura Hawkins, Ananya Bhattacharjee, Zahra Hassanzadeh, Jan Smeddinck, Meredith Franklin, Joseph Jay Williams

详情
英文摘要

Contextual multi-armed bandit (cMAB) algorithms offer a promising framework for adapting behavioral interventions to individuals over time. However, cMABs often require large samples to learn effectively and typically rely on a finite pre-set of fixed message templates. In this paper, we present a hybrid cMABxLLM approach in which the cMAB selects an intervention type, and a large language model (LLM) which personalizes the message content within the selected type. We deployed this approach in a 30-day physical-activity intervention, comparing four behavioral change intervention types: behavioral self-monitoring, gain-framing, loss-framing, and social comparison, delivered as daily motivational messages to support motivation and achieve a daily step count. Message content is personalized using dynamic contextual factors, including daily fluctuations in self-efficacy, social influence, and regulatory focus. Over the trial, participants received daily messages assigned by one of five models: equal randomization (RCT), cMAB only, LLM only, LLM with interaction history, or cMABxLLM. Outcomes include motivation towards physical activity and message usefulness, assessed via ecological momentary assessments (EMAs). We evaluate and compare the five delivery models using pre-specified statistical analyses that account for repeated measures and time trends. We find that the cMABxLLM approach retains the perceived acceptance of LLM-generated messages, while reducing token usage and providing an explicit, reproducible decision rule for intervention selection. This hybrid approach also avoids the skew in intervention delivery by improving support for under-delivered intervention types. More broadly, our approach provides a deployable template for combining Bayesian adaptive experimentation with generative models in a way that supports both personalization and interpretability.

2506.05116 2026-03-04 stat.ME econ.EM math.ST stat.TH

The Spurious Factor Dilemma: Robust Inference in Heavy-Tailed Elliptical Factor Models

Jiang Hu, Jiahui Xie, Yangchun Zhang, Wang Zhou

Comments Added some content and some simulations

详情
英文摘要

Standard methods for determining the number of factors often overestimate the true number when data exhibit heavy-tailed randomness, misinterpreting noise-induced outliers as genuine factors. This paper addresses this challenge within the framework of Elliptical Factor Models (EFM), which accommodate both heavy tails and potential non-linear dependencies common in real-world data. We demonstrate, both theoretically and empirically, that heavy-tailed noise generates spurious eigenvalues that mimic true factor signals. To distinguish these, we propose a novel methodology based on a fluctuation magnification algorithm. Under mild conditions, we show that, by magnifying perturbations, the eigenvalues associated with real factors exhibit significantly less fluctuation (stabilizing asymptotically) than spurious eigenvalues arising from heavy-tailed effects. We develop a formal testing procedure based on this principle and apply it to the problem of accurately selecting the number of common factors in heavy-tailed EFMs. Simulation studies and real data analysis confirm the effectiveness of our approach, particularly in scenarios with pronounced heavy-tailedness.

2505.21813 2026-03-04 cs.LG stat.ML

Optimizing Data Augmentation through Bayesian Model Selection

Madi Matymov, Ba-Hien Tran, Michael Kampffmeyer, Markus Heinonen, Maurizio Filippone

Comments 26 pages, 3 figures

详情
英文摘要

Data Augmentation (DA) has become an essential tool to improve robustness and generalization of modern machine learning. However, when deciding on DA strategies it is critical to choose parameters carefully, and this can be a daunting task which is traditionally left to trial-and-error or expensive optimization based on validation performance. In this paper, we counter these limitations by proposing a novel framework for optimizing DA. In particular, we take a probabilistic view of DA, which leads to the interpretation of augmentation parameters as model (hyper)-parameters, and the optimization of the marginal likelihood with respect to these parameters as a Bayesian model selection problem. Due to its intractability, we derive a tractable ELBO, which allows us to optimize augmentation parameters jointly with model parameters. We provide extensive theoretical results on variational approximation quality, generalization guarantees, invariance properties, and connections to empirical Bayes. Through experiments on computer vision and NLP tasks, we show that our approach improves calibration and yields robust performance over fixed or no augmentation. Our work provides a rigorous foundation for optimizing DA through Bayesian principles with significant potential for robust machine learning.

2505.18996 2026-03-04 cs.LG stat.ML

Automatic and Structure-Aware Sparsification of Hybrid Neural ODEs

Bob Junyi Zou, Lu Tian

Comments Accepted at The 14th International Conference on Learning Representations (ICLR) 2026

详情
英文摘要

Hybrid neural ordinary differential equations (neural ODEs) integrate mechanistic models with neural ODEs, offering strong inductive bias and flexibility, and are particularly advantageous in data-scarce healthcare settings. However, excessive latent states and interactions from mechanistic models can lead to training inefficiency and over-fitting, limiting practical effectiveness of hybrid neural ODEs. In response, we propose a new hybrid pipeline for automatic state selection and structure optimization in mechanistic neural ODEs, combining domain-informed graph modifications with data-driven regularization to sparsify the model for improving predictive performance and stability while retaining mechanistic plausibility. Experiments on synthetic and real-world data show improved predictive performance and robustness with desired sparsity, establishing an effective solution for hybrid model reduction in healthcare applications.

2505.15008 2026-03-04 cs.LG cs.AI stat.ML

Know When to Abstain: Optimal Selective Classification with Likelihood Ratios

Alvin Heng, Harold Soh

详情
英文摘要

Selective classification enhances the reliability of predictive models by allowing them to abstain from making uncertain predictions. In this work, we revisit the design of optimal selection functions through the lens of the Neyman--Pearson lemma, a classical result in statistics that characterizes the optimal rejection rule as a likelihood ratio test. We show that this perspective not only unifies the behavior of several post-hoc selection baselines, but also motivates new approaches to selective classification which we propose here. A central focus of our work is the setting of covariate shift, where the input distribution at test time differs from that at training. This realistic and challenging scenario remains relatively underexplored in the context of selective classification. We evaluate our proposed methods across a range of vision and language tasks, including both supervised learning and vision-language models. Our experiments demonstrate that our Neyman--Pearson-informed methods consistently outperform existing baselines, indicating that likelihood ratio-based selection offers a robust mechanism for improving selective classification under covariate shifts. Our code is publicly available at https://github.com/clear-nus/sc-likelihood-ratios.

2502.13583 2026-03-04 math.NA cs.NA math.OC stat.ML

Fundamental Bias in Inverting Random Sampling Matrices with Application to Sub-sampled Newton

Chengmei Niu, Zhenyu Liao, Zenan Ling, Michael W. Mahoney

Comments 55 pages, 4 figures. This version incorporates minor revisions to the proof

详情
英文摘要

A substantial body of work in machine learning (ML) and randomized numerical linear algebra (RandNLA) has exploited various sorts of random sketching methodologies, including random sampling and random projection, with much of the analysis using Johnson--Lindenstrauss and subspace embedding techniques. Recent studies have identified the issue of inversion bias -- the phenomenon that inverses of random sketches are not unbiased, despite the unbiasedness of the sketches themselves. This bias presents challenges for the use of random sketches in various ML pipelines, such as fast stochastic optimization, scalable statistical estimators, and distributed optimization. In the context of random projection, the inversion bias can be easily corrected for dense Gaussian projections (which are, however, too expensive for many applications). Recent work has shown how the inversion bias can be corrected for sparse sub-gaussian projections. In this paper, we show how the inversion bias can be corrected for random sampling methods, both uniform and non-uniform leverage-based, as well as for structured random projections, including those based on the Hadamard transform. Using these results, we establish problem-independent local convergence rates for sub-sampled Newton methods.

2501.13483 2026-03-04 stat.ML cs.LG

Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data

Aayush Mishra, Daniel Habermann, Marvin Schmitt, Stefan T. Radev, Paul-Christian Bürkner

Comments Accepted to International Conference on Learning Representations (ICLR) 2026

详情
英文摘要

Amortized Bayesian inference (ABI) with neural networks can solve probabilistic inverse problems orders of magnitude faster than classical methods. However, ABI is not yet sufficiently robust for widespread and safe application. When performing inference on observations outside the scope of the simulated training data, posterior approximations are likely to become highly biased, which cannot be corrected by additional simulations due to the bad pre-asymptotic behavior of current neural posterior estimators. In this paper, we propose a semi-supervised approach that enables training not only on labeled simulated data generated from the model, but also on \textit{unlabeled} data originating from any source, including real data. To achieve this, we leverage Bayesian self-consistency properties that can be transformed into strictly proper losses that do not require knowledge of ground-truth parameters. We test our approach on several real-world case studies, including applications to high-dimensional time-series and image data. Our results show that semi-supervised learning with unlabeled data drastically improves the robustness of ABI in the out-of-simulation regime. Notably, inference remains accurate even when evaluated on observations far away from the labeled and unlabeled data seen during training.

2501.09648 2026-03-04 stat.ME math.ST stat.TH

Central limit theorems for interacting innovation processes, related statistical tools and general results

Giacomo Aletti, Irene Crimaldi, Andrea Ghiglietti

详情
英文摘要

We study a networked system of innovation processes, where each process is modeled as an urn with infinitely many colors-a classical framework for capturing the emergence of novelties. Extending this paradigm, we analyze a model of interacting urns, where the probability of generating or reusing elements in one process is influenced by the histories of others. This interaction is governed by two matrices that control innovation triggering and reinforcement dynamics across the system. The core contribution of this work is a detailed analysis of the second-order asymptotic behavior of the model. Building on these theoretical results, we develop statistical tools to infer the structure and strength of inter-process influence. The methodology is framed in a general setting, making it broadly applicable. We validate our approach with applications to two real-world datasets from Reddit discussions and Gutenberg text corpora.

2411.12725 2026-03-04 cs.GT econ.TH stat.ML

The Bounds of Algorithmic Collusion; $Q$-learning, Gradient Learning, and the Folk Theorem

Galit Askenazi-Golan, Domenico Mergoni Cecchelli, Edward Plumb, Clemens Possnig

Comments This is a new version of a previous paper by the title "Reinforcement Learning, Collusion, and the Folk Theorem" by the three (alphabetically) first authors

详情
英文摘要

We explore the behaviour emerging from learning agents repeatedly interacting strategically for a wide range of learning dynamics, including $Q$-learning, projected gradient, replicator and log-barrier dynamics. Going beyond the better understood classes of potential games and zero-sum games, we consider the setting of a general repeated game with finite recall under different forms of monitoring. We obtain a Folk Theorem-style result and characterise the set of payoff vectors that can be obtained by these dynamics, discovering a wide range of possibilities for the emergence of algorithmic collusion. Achieving this requires a novel technical approach, which, to the best of our knowledge, yields the first convergence result for multi-agent $Q$-learning algorithms in repeated games.

2410.22047 2026-03-04 math.PR math.ST stat.TH

Self-normalized Cramér-type Moderate Deviation of Stochastic Gradient Langevin Dynamics

Hongsheng Dai, Xiequan Fan, Jianya Lu

详情
英文摘要

In this paper, we study the self-normalized Cramér-type moderate deviation of the empirical measure of the stochastic gradient Langevin dynamics (SGLD). Consequently, we also derive the Berry-Esseen bound for SGLD. Our approach is by constructing a stochastic differential equation (SDE) to approximate the SGLD and then applying Stein's method as developed in [9,19], to decompose the empirical measure into a martingale difference series sum and a negligible remainder term.

2408.03039 2026-03-04 math.ST stat.TH

Gaussian Approximations for the $k$th coordinate of sums of random vectors

Yixi Ding, Qizhai Li, Yuke Shi, Wei Zhang

Comments This submission is a duplicate of arXiv:2508.14400. We mistakenly created a new submission instead of a replacement

详情
英文摘要

We consider the problem of Gaussian approximation for the $κ$th coordinate of a sum of high-dimensional random vectors. Such a problem has been studied previously for $κ=1$ (i.e., maxima). However, in many applications, a general $κ\geq1$ is of great interest, which is addressed in this paper. We make four contributions: 1) we first show that the distribution of the $κ$th coordinate of a sum of random vectors, $\boldsymbol{X}= (X_{1},\cdots,X_{p})^{\sf T}= n^{-1/2}\sum_{i=1}^n \boldsymbol{x}_{i}$, can be approximated by that of Gaussian random vectors and derive their Kolmogorov's distributional difference bound; 2) we provide the theoretical justification for estimating the distribution of the $κ$th coordinate of a sum of random vectors using a Gaussian multiplier procedure, which multiplies the original vectors with i.i.d. standard Gaussian random variables; 3) we extend the Gaussian approximation result and Gaussian multiplier bootstrap procedure to a more general case where $κ$ diverges; 4) we further consider the Gaussian approximation for a square sum of the first $d$ largest coordinates of $\boldsymbol{X}$. All these results allow the dimension $p$ of random vectors to be as large as or much larger than the sample size $n$.

2407.01656 2026-03-04 cs.LG cond-mat.dis-nn physics.data-an q-bio.NC stat.ML

Absolute abstraction: a renormalisation group approach

Carlo Orientale Caputo, Elias Seiffert, Enrico Frausin, Matteo Marsili

Comments 35 pages, 6 figures

详情
英文摘要

Abstraction is the process of extracting the essential features from raw data while ignoring irrelevant details. It is well known that abstraction emerges with depth in neural networks, where deep layers capture abstract characteristics of data by combining lower level features encoded in shallow layers (e.g. edges). Yet we argue that depth alone is not enough to develop truly abstract representations. We advocate that the level of abstraction crucially depends on how broad the training set is. We address the issue within a renormalisation group approach where a representation is expanded to encompass a broader set of data. We take the unique fixed point of this transformation -- the Hierarchical Feature Model -- as a candidate for a representation which is absolutely abstract. This theoretical picture is tested in numerical experiments based on Deep Belief Networks and auto-encoders trained on data of different breadth. These show that representations in neural networks approach the Hierarchical Feature Model as the data get broader and as depth increases, in agreement with theoretical predictions.

2405.15204 2026-03-04 stat.ME

A New Fit Assessment Framework for Common Factor Models Using Generalized Residuals

Youjin Sung, Youngjin Han, Yang Liu

详情
Journal ref
Psychometrika 90 (2025) 1419-1444
英文摘要

Assessing fit in common factor models solely through the lens of mean and covariance structures, as is commonly done with conventional goodness-of-fit (GOF) assessments, may overlook critical aspects of misfit, potentially leading to misleading conclusions. To achieve more flexible fit assessment, we extend the theory of generalized residuals (Haberman & Sinharay, 2013), originally developed for models with categorical data, to encompass more general measurement models. Within this extended framework, we propose several fit test statistics designed to evaluate various parametric assumptions involved in common factor models. The examples include assessing the distributional assumptions of latent variables and functional form assumptions of individual manifest variables. The performance of the proposed statistics is examined through simulation studies and an empirical data analysis. Our findings suggest that generalized residuals are promising tools for detecting misfit in measurement models, often masked when assessed by conventional GOF testing methods.

2312.01518 2026-03-04 stat.AP

Analyzing State-Level Longevity Trends with the U.S. Mortality Database

Mike Ludkovski, Doris Padilla

Comments 31 pages, 18 figures

详情
Journal ref
Ann. actuar. sci. 20 (2026) 22-53
英文摘要

We investigate state-level age-specific mortality trends based on the United States Mortality Database (USMDB) published by the Human Mortality Database. In tandem with looking at the longevity experience across the 51 states, we also consider a collection of socio-demographic, economic and educational covariates that correlate with mortality trends. To obtain smoothed mortality surfaces for each state, we implement the machine learning framework of Multi-Output Gaussian Process regression (Huynh & Ludkovski 2021) on targeted groupings of 3-6 states. Our detailed exploratory analysis shows that the mortality experience is highly inhomogeneous across states in terms of respective Age structures. We moreover document multiple divergent trends between best and worst states, between Females and Males, and between younger and older Ages. The comparisons across the 50+ fitted models offer opportunities for rich insights about drivers of mortality in the U.S. and are visualized through numerous figures and an online interactive dashboard.

2210.09709 2026-03-04 stat.ML cs.LG math.ST stat.TH

Importance Weighting Correction of Regularized Least-Squares for Target Shift

Davit Gogolashvili

详情
英文摘要

Importance weighting is a standard tool for correcting distribution shift, but its statistical behavior under target shift -- where the label distribution changes between training and testing while the conditional distribution of inputs given the label remains stable -- remains under-explored. We analyze importance-weighted kernel ridge regression under target shift and show that, because the weights depend only on the output variable, reweighting corrects the train-test mismatch without altering the input-space complexity that governs kernel generalization. Under standard RKHS regularity and capacity conditions and a mild Bernstein-type moment condition on the label weights, we obtain finite-sample guarantees showing that the estimator achieves the same convergence behavior as in the no-shift case, with shift severity affecting only the constants through weight moments. We complement these results with matching minimax lower bounds, establishing rate optimality and quantifying the unavoidable dependence on shift severity. We further study more general weighting schemes and prove that weight misspecification induces an irreducible bias: the estimator concentrates around an induced population regression function that generally differs from the desired test regression function unless the weights are accurate. Finally, we derive consequences for plug-in classification under target shift via standard calibration arguments.

2603.02729 2026-03-04 cs.LG math.OC stat.ML

The power of small initialization in noisy low-tubal-rank tensor recovery

ZHiyu Liu, Haobo Geng, Xudong Wang, Yandong Tang, Zhi Han, Yao Wang

详情
英文摘要

We study the problem of recovering a low-tubal-rank tensor $\mathcal{X}\_\star\in \mathbb{R}^{n \times n \times k}$ from noisy linear measurements under the t-product framework. A widely adopted strategy involves factorizing the optimization variable as $\mathcal{U} * \mathcal{U}^\top$, where $\mathcal{U} \in \mathbb{R}^{n \times R \times k}$, followed by applying factorized gradient descent (FGD) to solve the resulting optimization problem. Since the tubal-rank $r$ of the underlying tensor $\mathcal{X}_\star$ is typically unknown, this method often assumes $r < R \le n$, a regime known as over-parameterization. However, when the measurements are corrupted by some dense noise (e.g., Gaussian noise), FGD with the commonly used spectral initialization yields a recovery error that grows linearly with the over-estimated tubal-rank $R$. To address this issue, we show that using a small initialization enables FGD to achieve a nearly minimax optimal recovery error, even when the tubal-rank $R$ is significantly overestimated. Using a four-stage analytic framework, we analyze this phenomenon and establish the sharpest known error bound to date, which is independent of the overestimated tubal-rank $R$. Furthermore, we provide a theoretical guarantee showing that an easy-to-use early stopping strategy can achieve the best known result in practice. All these theoretical findings are validated through a series of simulations and real-data experiments.

2603.02723 2026-03-04 stat.ME

The partly parametric and partly nonparametric additive risk model

Nils Lid Hjort, Emil Aas Stoltenberg

Comments 26 pages, 5 figures; Statistical Research Report, Department of Mathematics, University of Oslo, August 2021, but arXiv'd March 2026. The article has appeared in essentially this form in Lifetime Data Analysis 2021, vol. 27, pages 1-31, at this url: link.springer.com/content/pdf/10.1007/s10985-021-09535-3.pdf

详情
英文摘要

Aalen's linear hazard rate regression model is a useful and increasingly popular alternative to Cox' multiplicative hazard rate model. It postulates that an individual has hazard rate function $h(s)=z_1α_1(s)+\cdots+z_rα_r(s)$ in terms of his covariate values $z_1,\ldots,z_r$. These are typically levels of various hazard factors, and may also be time-dependent. The hazard factor functions $α_j(s)$ are the parameters of the model and are estimated from data. This is traditionally accomplished in a fully nonparametric way. This paper develops methodology for estimating the hazard factor functions when some of them are modelled parametrically while the others are left unspecified. Large-sample results are reached inside this partly parametric, partly nonparametric framework, which also enables us to assess the goodness of fit of the model's parametric components. In addition, these results are used to pinpoint how much precision is gained, using the parametric-nonparametric model, over the standard nonparametric method. A real-data application is included, along with a brief simulation study.

2603.02649 2026-03-04 cs.LG math.OC stat.ML

HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization

Feihu Huang, Guanyi Zhang, Songcan Chen

Comments 39 pages

详情
英文摘要

Adam and AdamW are a class of default optimizers for training deep learning models in machine learning. These adaptive algorithms converge faster but generalize worse compared to SGD. In fact, their proved generalization error $O(\frac{1}{\sqrt{N}})$ also is larger than $O(\frac{1}{N})$ of SGD, where $N$ denotes training sample size. Recently, although some variants of Adam have been proposed to improve its generalization, their improved generalizations are still unexplored in theory. To fill this gap, in the paper, we restudy generalization of Adam and AdamW via algorithmic stability, and first prove that Adam and AdamW without square-root (i.e., Adam(W)-srf) have a generalization error $O(\frac{\hatρ^{-2T}}{N})$, where $T$ denotes iteration number and $\hatρ>0$ denotes the smallest element of second-order momentum plus a small positive number. To improve generalization, we propose a class of efficient clever Adam (i.e., HomeAdam(W)) algorithms via sometimes returning momentum-based SGD. Moreover, we prove that our HomeAdam(W) have a smaller generalization error $O(\frac{1}{N})$ than $O(\frac{\hatρ^{-2T}}{N})$ of Adam(W)-srf, since $\hatρ$ is generally very small. In particular, it is also smaller than the existing $O(\frac{1}{\sqrt{N}})$ of Adam(W). Meanwhile, we prove our HomeAdam(W) have a faster convergence rate of $O(\frac{1}{T^{1/4}})$ than $O(\frac{\breveρ^{-1}}{T^{1/4}})$ of the Adam(W)-srf, where $\breveρ\leq\hatρ$ also is very small. Extensive numerical experiments demonstrate efficiency of our HomeAdam(W) algorithms.

2603.02616 2026-03-04 stat.AP cs.AI cs.LG stat.ME stat.ML

Detecting Structural Heart Disease from Electrocardiograms via a Generalized Additive Model of Interpretable Foundation-Model Predictors

Ya Zhou, Zhaohong Sun, Tianxiang Hao, Xiangjie Li

详情
英文摘要

Structural heart disease (SHD) is a prevalent condition with many undiagnosed cases, and early detection is often limited by the high cost and accessibility constraints of echocardiography (ECHO). Recent studies show that artificial intelligence (AI)-based analysis of electrocardiograms (ECGs) can detect SHD, offering a scalable alternative. However, existing methods are fully black-box models, limiting interpretability and clinical adoption. To address these challenges, we propose an interpretable and effective framework that integrates clinically meaningful ECG foundation-model predictors within a generalized additive model, enabling transparent risk attribution while maintaining strong predictive performance. Using the EchoNext benchmark of over 80,000 ECG-ECHO pairs, the method demonstrates relative improvements of +0.98% in AUROC, +1.01% in AUPRC, and +1.41% in F1 score over the latest state-of-the-art deep-learning baseline, while achieving slightly better performance even with only 30% of the training data. Subgroup analyses confirm robust performance across heterogeneous populations, and the estimated entry-wise functions provide interpretable insights into the relationships between risks of traditional ECG diagnoses and SHD. This work illustrates a complementary paradigm between classical statistical modeling and modern AI, offering a pathway to interpretable, high-performing, and clinically actionable ECG-based SHD screening.

2603.02611 2026-03-04 stat.ME stat.AP

A Bayesian Hierarchical Hurdle Beta-Binomial Model for Survey-Weighted Bounded Counts and Its Application to Childcare Enrollment

JoonHo Lee

详情
英文摘要

Bounded discrete proportions -- counts out of known totals -- present modeling challenges when data exhibit structural zeros, overdispersion, and hierarchical clustering. We develop a Bayesian hierarchical hurdle beta-binomial model with state-varying coefficients that addresses all four features. The framework makes three methodological contributions: (i) it studies cross-margin dependence via a cross-block covariance component and clarifies when and how this parameter is identified through the hierarchical layer rather than the conditional likelihood; (ii) it proposes a Cholesky-based sandwich variance calibration for pseudo-posterior inference under survey weights, guided by a parameter-specific design effect ratio diagnostic; and (iii) it introduces a log-scale marginal effect decomposition for hurdle models that translates regression coefficients into policy-relevant quantities. Applied to 6,785 childcare providers across 51 states from the 2019 National Survey of Early Care and Education, the model reveals a "poverty reversal": poverty reduces enrollment participation yet increases intensity among participants, with the extensive margin accounting for two-thirds of the total effect. Design-calibrated simulation shows that sandwich-corrected intervals substantially improve coverage, reaching 82--88.5% at the 90% nominal level for fixed effects. The R package hurdlebb implements all methods.

2603.02607 2026-03-04 stat.ML cs.DS cs.LG math.OC

Combinatorial Sparse PCA Beyond the Spiked Identity Model

Syamantak Kumar, Purnamrita Sarkar, Kevin Tian, Peiyuan Zhang

Comments 36 pages, 6 figures

详情
英文摘要

Sparse PCA is one of the most well-studied problems in high-dimensional statistics. In this problem, we are given samples from a distribution with covariance $Σ$, whose top eigenvector $v \in R^d$ is $s$-sparse. Existing sparse PCA algorithms can be broadly categorized into (1) combinatorial algorithms (e.g., diagonal or elementwise covariance thresholding) and (2) SDP-based algorithms. While combinatorial algorithms are much simpler, they are typically only analyzed under the spiked identity model (where $Σ= I_d + γvv^\top$ for some $γ> 0$), whereas SDP-based algorithms require no additional assumptions on $Σ$. We demonstrate explicit counterexample covariances $Σ$ against the success of standard combinatorial algorithms for sparse PCA, when moving beyond the spiked identity model. In light of this discrepancy, we give the first combinatorial method for sparse PCA that provably succeeds for general $Σ$ using $s^2 \cdot \mathrm{polylog}(d)$ samples and $d^2 \cdot \mathrm{poly}(s, \log(d))$ time, by providing a global convergence guarantee on a variant of the truncated power method of Yuan and Zhang (2013). We provide a natural generalization of our method to recovering a vector in a sparse leading eigenspace. Finally, we evaluate our method on synthetic and real-world sparse PCA datasets.

2603.02594 2026-03-04 stat.ML cs.CC cs.DS cs.LG

Low-Degree Method Fails to Predict Robust Subspace Recovery

He Jia, Aravindan Vijayaraghavan

Comments 27 pages, 1 figure

详情
英文摘要

The low-degree polynomial framework has been highly successful in predicting computational versus statistical gaps for high-dimensional problems in average-case analysis and machine learning. This success has led to the low-degree conjecture, which posits that this method captures the power and limitations of efficient algorithms for a wide class of high-dimensional statistical problems. We identify a natural and basic hypothesis testing problem in $\mathbb{R}^n$ which is polynomial time solvable, but for which the low-degree polynomial method fails to predict its computational tractability even up to degree $k=n^{Ω(1)}$. Moreover, the low-degree moments match exactly up to degree $k=O(\sqrt{\log n/\log\log n})$. Our problem is a special case of the well-studied robust subspace recovery problem. The lower bounds suggest that there is no polynomial time algorithm for this problem. In contrast, we give a simple and robust polynomial time algorithm that solves the problem (and noisy variants of it), leveraging anti-concentration properties of the distribution. Our results suggest that the low-degree method and low-degree moments fail to capture algorithms based on anti-concentration, challenging their universality as a predictor of computational barriers.

2603.02593 2026-03-04 stat.CO stat.ME

Composite Wavelet Matrix-Based Transforms and Applications

Radhika Kulkarni, Brani Vidakovic

Comments 30 pages, 9 figures, 6 tables

详情
英文摘要

Orthogonal wavelet transforms are a cornerstone of modern signal and image denoising because they combine multiscale representation, energy preservation, and perfect reconstruction. In this paper, we show that these advantages can be retained and substantially enhanced by moving beyond classical single-basis wavelet filterbanks to a broader class of composite wavelet-like matrices. By combining orthogonal wavelet matrices through products, Kronecker products, and block-diagonal constructions, we obtain new unitary transforms that generally fall outside the strict wavelet filterbank class, yet remain fully invertible and numerically stable. The central finding is that such composite transforms induce stronger concentration of signal energy into fewer coefficients than conventional wavelets. This increased sparsity, quantified using Lorenz curve diagnostics, directly translates into improved denoising under identical thresholding rules. Extensive simulations on Donoho-Johnstone benchmark signals, complex-valued unitary examples, and adaptive block constructions demonstrate consistent reductions in mean-squared error relative to single-basis transforms. Applications to atmospheric turbulence measurements and image denoising of the Barbara benchmark further confirm that composite transforms better preserve salient structures while suppressing noise.

2603.02574 2026-03-04 stat.AP stat.CO

An Augmented Rating System for Test cricket: adapting Glicko's model

Rhitankar Bandyopadhyay, Diganta Mukherjee

Comments 23 pages, 17 tables, 1 figure

详情
英文摘要

ICC's current ranking system does not adequately account for key contextual factors such as home advantage, toss impact and scheduling imbalances; leading to inconsistencies in team evaluation in Test cricket. This study develops an enhanced rating framework by adapting and enhancing Glicko's model to incorporate these influences alongside Margin of Victory, an important indicator of dominance a contest. This enables a more dynamic and probabilistically grounded assessment of team performance. Using past match data, the model demonstrates improved expected score estimation and predictive accuracy. Robustness of the resulting ratings is demonstrated through bootstrap resampling, confirming stability with respect to match scheduling. Overall, the framework provides a fairer and more statistically consistent approach to ranking Test teams.

2603.02533 2026-03-04 cs.IT cs.CV cs.LG math.IT math.ST stat.ML stat.TH

Functional Properties of the Focal-Entropy

Jaimin Shah, Martina Cardone, Alex Dytso

Comments Accepted to AISTATS 2026

详情
英文摘要

The focal-loss has become a widely used alternative to cross-entropy in class-imbalanced classification problems, particularly in computer vision. Despite its empirical success, a systematic information-theoretic study of the focal-loss remains incomplete. In this work, we adopt a distributional viewpoint and study the focal-entropy, a focal-loss analogue of the cross-entropy. Our analysis establishes conditions for finiteness, convexity, and continuity of the focal-entropy, and provides various asymptotic characterizations. We prove the existence and uniqueness of the focal-entropy minimizer, describe its structure, and show that it can depart significantly from the data distribution. In particular, we rigorously show that the focal-loss amplifies mid-range probabilities, suppresses high-probability outcomes, and, under extreme class imbalance, induces an over-suppression regime in which very small probabilities are further diminished. These results, which are also experimentally validated, offer a theoretical foundation for understanding the focal-loss and clarify the trade-offs that it introduces when applied to imbalanced learning tasks.

2603.02509 2026-03-04 stat.ME stat.AP

Semi-partitioned Generalized Method of Moments for Longitudinal Data with Lagged and Feedback Covariates

Niloofar Ramezani, Jeffrey R. Wilson

Comments 15 pages, 0 figures, 5 tables

详情
英文摘要

We propose a semi-partitioned Generalized Method of Moments (GMM) framework for analyzing longitudinal data with time-dependent covariates, within a marginal modeling paradigm. This approach addresses limitations of both aggregated and fully partitioned GMM models. Aggregated methods obscure temporal dynamics by assuming constant effects, while fully partitioned approaches offer temporal specificity at the cost of increased model complexity and instability--particularly with moderate sample sizes or deep lag structures. Our method distinguishes immediate from lagged effects by estimating contemporaneous coefficients separately and grouping lagged moment conditions into structured sets, while retaining flexibility in the lag-specific effects. This yields a model that is both statistically efficient and interpretable, capturing essential temporal variation while mitigating variance inflation and convergence challenges associated with full partitioning. The framework accommodates feedback, supports both continuous and binary outcomes, and utilizes the Broyden--Fletcher--Goldfarb--Shanno (BFGS) algorithm for reliable optimization. Through simulations, we demonstrate that the semi-partitioned GMM achieves coverage and competitive efficiency relative to fully partitioned models when the grouped-lag structure approximates the underlying lag pattern. Applications to clinical datasets on knee osteoarthritis and adolescent obesity confirm that the method recovers consistent, interpretable effects and avoids instability associated with finely grained partitioning.

2603.02502 2026-03-04 stat.ME

Tree-Embedded Bayesian Factor Models for Multidimensional Categorical Distributions

Naoki Awaya, Keisuke Sasaki, Genya Kobayashi, Shonosuke Sugasawa

详情
英文摘要

Analyzing data collected from multiple sources to estimate common and heterogeneous structures through a hierarchical model is a central task in Bayesian inference, and to this end, Bayesian factor models are one of the most widely used tools for this purpose. In this paper, we propose a new Bayesian latent factor model for distributions, providing a parsimonious model for describing many observed distributions through lower-dimensional structures. Many applications are found in the social science in the form of grouped data, for example, distributions of age composition and income observed across locations. In these contexts, standard mixture models can be inefficient because the distributions do not necessarily exhibit clear clustering structures. To overcome the difficulty, we introduce a tree-based transformation that embeds distributions into a Euclidean space and construct a Bayesian latent factor model in the transformed space. Once transformed into Euclidean vectors, the Bayesian hierarchical model can be extended in a straightforward manner. As an illustration, we incorporate spatial dependence by introducing a prior based on a simultaneous autoregressive (SAR) model. The proposed model is "nonparametric" in the sense that it does not impose parametric assumptions on the form of the observed distributions. Through numerical experiments using real population data, we demonstrate that the proposed model outperforms the standard Dirichlet mixture model as well as models built on parametric assumptions.

2603.02492 2026-03-04 cs.IT cs.LO math.IT math.LO math.ST stat.TH

E-variables and tests of randomness for distribution classes

Georgii Potapov, Yuri Kalnishkan

详情
英文摘要

E-variables are a relatively new approach for testing statistical hypotheses that has been experiencing major development during the last several years. In this paper we introduce the method of e-variable-approximability and use it to develop a general approximation technique allowing us to construct e-variables for popular distribution classes important for applications. E-variables were originally based on a concept of Levin's (average-bounded) randomness tests from Algorithmic Information Theory. We show that our construction of e-variables can be used to provide an explicit construction for a randomness test with respect to a class of distributions.

2603.02485 2026-03-04 stat.ME stat.AP

A Decision Analysis Framework for High-fidelity and Low-fidelity Systems with Applications in Manufacturing Processes

Fan Zhang, Qiong Zhang, Madhura Limaye, Dhanashree Shinde, Gang Li, Sai Aditya Pradeep, Srikanth Pilla

详情
英文摘要

Optimizing complex manufacturing processes often involves a trade-off between data accuracy and acquisition cost. High-fidelity data are accurate but limited, while low-fidelity data are abundant but often biased. Balancing these two sources is critical for efficient manufacturing optimization. To address this challenge, we develop a decision analysis framework based on multi-fidelity Gaussian process (GP) modeling based on the Kennedy-O'Hagan (KOH) framework. We propose a systematic Bayesian calibration approach using multi-fidelity GPs that explicitly quantifies the model discrepancy, and an algorithm that combines posterior sampling of calibration parameters with predictive sampling to characterize the distribution of optimal input settings and their associated uncertainty. These components are integrated into a five-stage practical workflow for the optimization of manufacturing processes. Through an illustrative example and two real-world applications in composite cure cycle optimization and injection molding process control, we demonstrate how the framework integrates information from both high-fidelity and low-fidelity data sources to support decision-making under parameter uncertainty.

2603.02483 2026-03-04 stat.ML cs.CG cs.CV cs.LG

Geometric structures and deviations on James' symmetric positive-definite matrix bicone domain

Jacek Karwowski, Frank Nielsen

Comments 35 pages, 4 figures

详情
英文摘要

Symmetric positive-definite (SPD) matrix datasets play a central role across numerous scientific disciplines, including signal processing, statistics, finance, computer vision, information theory, and machine learning among others. The set of SPD matrices forms a cone which can be viewed as a global coordinate chart of the underlying SPD manifold. Rich differential-geometric structures may be defined on the SPD cone manifold. Among the most widely used geometric frameworks on this manifold are the affine-invariant Riemannian structure and the dual information-geometric log-determinant barrier structure, each associated with dissimilarity measures (distance and divergence, respectively). In this work, we introduce two new structures, a Finslerian structure and a dual information-geometric structure, both derived from James' bicone reparameterization of the SPD domain. Those structures ensure that geodesics correspond to straight lines in appropriate coordinate systems. The closed bicone domain includes the spectraplex (the set of positive semi-definite diagonal matrices with unit trace) as an affine subspace, and the Hilbert VPM distance is proven to generalize the Hilbert simplex distance which found many applications in machine learning. Finally, we discuss several applications of these Finsler/dual Hessian structures and provide various inequalities between the new and traditional dissimilarities.

2603.02474 2026-03-04 stat.ME

Transportable inference using target population summary statistics under covariate shift

Ying Sheng, Yifei Sun, Chiung-Yu Huang

详情
英文摘要

Transporting findings from a study population to a target population is central to evidence-based decision-making in real-world settings. Most existing methods require individual-level data from both populations to account for covariate shift. However, privacy regulations and data-sharing constraints often preclude access to such data from the target population, leaving only covariate summaries available for analysis. In this paper, we develop transportability methods that enable valid inference using source individual-level data and target covariate summaries. Firstly, we apply entropy balancing to transportability, enabling source individual-level data to be adjusted to match the target covariate moments. We establish asymptotic normality for the entropy balancing estimator and propose a variance estimator to account for uncertainty in covariate summaries. Secondly, we develop a new transportability method that allows flexible modeling of covariate shift, thereby accounting for covariate shift and uncertainty in covariate summaries simultaneously. Asymptotic normality for the proposed estimator is established and its asymptotic variance is consistently estimated. The proposed method offers greater flexibility in accounting for covariate shift and thus permits consistent estimation and valid inference under weaker conditions than those required by entropy balancing. The proposed methods are evaluated by simulations and illustrated with an analysis of Surveillance, Epidemiology, and End Results breast cancer data.

2603.02467 2026-03-04 stat.CO

CCMnet: A Software Package for Network Generation with Congruence Class Models

Ravi Goyal, Victor De Gruttola, Natasha K. Martin, Lior Rennert, Jukka-Pekka Onnela

Comments 27 pages, 9 figures, 2 tables

详情
英文摘要

We introduce CCMnet, an R package designed to generate network ensembles that accurately reflect the uncertainty inherent in empirical data. While traditional network modeling often results in ensembles with fixed property values or model-determined levels of variability, CCMnet enables a continuous spectrum of variability for network properties, including edge counts, degree distribution, and mixing patterns. By defining probability distributions directly over congruence classes of networks, the package allows researchers to specify the uncertainty in network properties across the generated ensemble to match a specific sampling design or empirical distribution. Furthermore, this formulation provides a principled framework that encompasses several classic models (e.g., Erdős--Rényi model, stochastic block models, and certain exponential random graph models) that implicitly share this structural basis, while offering the flexibility to specify arbitrary, even non-parametric, distributions for network properties. CCMnet implements a Markov chain Monte Carlo (MCMC) framework to sample from these models. The utility of the package is illustrated by generating posterior predictive network ensembles representing school friendship networks.

2603.02452 2026-03-04 cs.LG cs.AI stat.ML

Manifold Aware Denoising Score Matching (MAD)

Alona Levy-Jurgenson, Alvaro Prat, James Cuin, Yee Whye Teh

详情
英文摘要

A major focus in designing methods for learning distributions defined on manifolds is to alleviate the need to implicitly learn the manifold so that learning can concentrate on the data distribution within the manifold. However, accomplishing this often leads to compute-intensive solutions. In this work, we propose a simple modification to denoising score-matching in the ambient space to implicitly account for the manifold, thereby reducing the burden of learning the manifold while maintaining computational efficiency. Specifically, we propose a simple decomposition of the score function into a known component $s^{base}$ and a remainder component $s-s^{base}$ (the learning target), with the former implicitly including information on where the data manifold resides. We derive known components $s^{base}$ in analytical form for several important cases, including distributions over rotation matrices and discrete distributions, and use them to demonstrate the utility of this approach in those cases.

2603.02437 2026-03-04 stat.CO stat.ME

Leveraging Sparsity to Improve No-U-Turn Sampling Efficiency for Hierarchical Bayesian Models

Cole C. Monnahan, Kasper Kristensen, James T. Thorson, Bob Carpenter

Comments 26 pages, 12 figures including appendices

详情
英文摘要

Analysts routinely use Bayesian hierarchical models to understand natural processes. The no-U-turn sampler (NUTS) is the most widely used algorithm to sample high-dimensional, continuously differentiable models. But NUTS is slowed by high correlations, especially in high dimensions, limiting the complexity of applied analyses. Here we introduce Sparse NUTS (SNUTS), which preconditions (decorrelates and descales) posteriors using a sparse precision matrix ($Q$). We use Template Model Builder (TMB) to efficiently compute $Q$ from the mode of the Laplace approximation to the marginal posterior, then pass the preconditioned posterior to NUTS through the Bayesian software Stan for sampling. We apply SNUTS to seventeen diverse case studies to demonstrate that preconditioning with $Q$ converges one to two orders of magnitude faster than Stan's industry standard diagonal or dense preconditioners. SNUTS also outperforms preconditioning with the inverse of the covariance estimated with Pathfinder variational inference. SNUTS does not improve sampling efficiency for models with the highly varying curvature found in funnels, wide tails, or multiple modes. SNUTS is most advantageous, and can be scaled beyond $10^4$ parameters, in the presence of high dimensionality, sparseness, and high correlations, all of which are widespread in applied statistics. An open-source implementation of SNUTS is provided in the R package SparseNUTS.

2603.02429 2026-03-04 cs.LG math.OC stat.ML

Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence

Shiyuan Zhang, Qiwei Di, Xuheng Li, Quanquan Gu

Comments 51 pages, 1 table

详情
英文摘要

Underdamped Langevin dynamics (ULD) is a widely-used sampler for Gibbs distributions $π\propto e^{-V}$, and is often empirically effective in high dimensions. However, existing non-asymptotic convergence guarantees for discretized ULD typically scale polynomially with the ambient dimension $d$, leading to vacuous bounds when $d$ is large. The main known dimension-free result concerns the randomized midpoint discretization in Wasserstein-2 distance (Liu et al.,2023), while dimension-independent guarantees for ULD discretizations in KL divergence have remained open. We close this gap by proving the first dimension-free KL divergence bounds for discretized ULD. Our analysis refines the KL local error framework (Altschuler et al., 2025) to a dimension-free setting and yields bounds that depend on $\mathrm{tr}(\mathbf{H})$, where $\mathbf{H}$ upper bounds the Hessian of $V$, rather than on $d$. As a consequence, we obtain improved iteration complexity for underdamped Langevin Monte Carlo relative to overdamped Langevin methods in regimes where $\mathrm{tr}(\mathbf{H})\ll d$.

2603.02424 2026-03-04 stat.AP

On the misuse of time-dependent models in assessing mask usage and excess mortality

Beny Spira, Daniel V. Tausk

Comments 19 pages, 3 figures

详情
英文摘要

The effectiveness of face masks as a population level intervention against respiratory viral transmission remains contested. While a large observational literature published during the COVID-19 pandemic reported beneficial effects, randomized controlled trials have consistently shown limited or no impact. An ecological analysis of European countries reported that average mask usage during the years 2020 and 2021 is positively associated with excess mortality in that same period in 24 European countries (Tausk and Spira, 2025). Such association remains after several attempts at controlling for confounding variables. This finding was later challenged by other authors and attributed to reverse causality (Cerqueira-Silva et al., 2026). In this paper, we reassess those criticisms in detail. We show that their analysis is fundamentally flawed, as the time-dependent regression framework used to refute the original findings yields spurious results partly due to the use of cumulative excess mortality as an outcome variable, thereby incorporating pre-intervention deaths and producing statistically significant effects even at impossible negative time lags. Diagnostic analyses further demonstrate that key assumptions of the model are violated, invalidating any association or causal interpretation. Finally, we present an original longitudinal analysis of mask usage designed to directly test the reverse causality hypothesis. By constructing multiple indices that capture mask adoption during distinct phases of pandemic waves, including interwave periods characterized by low mortality, we show that the association between mask usage and excess mortality persists and is not driven by reactive increases in masking. These findings provide substantial evidence that reverse causality provides, at most, a minor contribution to the observed association.

2603.02418 2026-03-04 stat.AP

Contributions of geolocated weather and building related data for insurance assessment of flood risks

Mulah Moriah, Franck Vermet, Pierre Ailliot, Philippe Naveau, Juliette Legrand

详情
英文摘要

Floods rank among the costliest natural hazards, causing over USD 100 billion in insured losses between 2013 and 2023. In France, persistent deficits in the natural catastrophe scheme highlight the need for accurate, building-scale flood risk assessment. Insurers typically rely on frequency-severity models supported by hazard maps and regional climate indicators. However, previous studies show that such large-scale variables explain only a limited share of the variability in individual flood losses. This study evaluates the marginal contribution of multiple georeferenced data layers to modeling flood claim occurrence and severity in a large French home insurance portfolio. Starting from a baseline model based on standard underwriting information, we sequentially introduce climate-expert variables, extreme rainfall indicators, and fine-scale geolocated building and environmental attributes. The analysis focuses on a practical setting in which insurers cannot deploy full hydrological or hydraulic catastrophe models because of budgetary, licensing, or operational constraints. Results show that rainfall-based indicators, particularly a newly constructed metric capturing intense local precipitation, substantially improve claim modeling performance. Building and environmental variables further enhance occurrence prediction. Overall, the findings demonstrate how high-resolution geolocated data improve exposure and vulnerability assessment, complement official flood maps, and provide insurers with an operational framework for refining flood risk evaluation and pricing.

2603.02372 2026-03-04 stat.OT physics.pop-ph physics.space-ph stat.AP

Implications of the Pessimistic Lower Limit on the Drake Equation

Max Baak, Hella Snoek

Comments 8 pages, 2 figures

详情
英文摘要

The observation of life on Earth is generally accepted to be uninformative concerning the probability of life on other Earth-like planets, a belief first formalized by Brandon Carter and based on the selection effect of our existence. In a similar way, the Drake equation is either presented as estimate of the total number of active, communicative, extraterrestrial civilizations in our Galaxy ($n^g_{\rm civ}$), i.e. excluding humanity, or humanity is included in the estimate but judged to be an uninformative data point. Daniel Whitmire has recently challenged the Carter abiogenesis argument, claiming the logic behind it is flawed, as the conditional likelihoods used by Carter in Bayes' theorem are not evaluated prior to the occurrence of the evidence of life on Earth, but posterior. Doing so correctly, the anthropic selection effect is removed and the observation of life on Earth is informative after all. Following this argument, we treat the Drake equation as estimate of all technological civilizations in a statistical counting experiment and include the data point of humanity as informative evidence. This allows one to set a pessimistic lower limit on $n^o_{\rm civ}$ for the observable universe, $n^o_{\rm civ} > 0.051$ at 95\% C.L., or $n^g_{\rm civ} > 8\times10^{-13}$ at 95\% C.L. for the Galaxy. In particular, this excludes models that predict $n^o_{\rm civ}\ll 1$ for the observable universe and refines the allowable parameter space for hypotheses like Rare Earth. Our analysis substantially reduces the portion of the Drake equation parameter space that predicts humanity is alone; when applying the lower limit this study finds $P(n^o_{\rm civ}>1 |\, {\rm humanity}) = 97.6\%$, making solitude in the observable universe a disfavored outcome. For the low-end estimate of $n^o_{\rm civ}\! =\! 1$ we calculate a probability of 42\% for the existence of other communicating civilizations.

2603.02360 2026-03-04 math.PR stat.AP

"Game, Set, Match": Double Delight Watching a Grand Slam Tennis Match

Edsel A. Pena, Dip Das, Yuexuan Wu

详情
英文摘要

Probabilistic properties of tennis scoring systems are examined and compared with best-of-K systems. A model, where each player has his/her own probability of winning his/her service point and which remains invariant for the duration of the match, and where outcomes of points played are independent of each other, is assumed. Probabilities of winning a game tie-breaker, a game, a set tie-breaker, a set, and the match are obtained. Since tennis scoring systems are unique, probability calculations require decomposing big and complicated problems into smaller and simpler constituent problems, solving these sub-problems, then combining to obtain the solution to the big problem. The problems that arise from tennis scoring systems offer excellent pedagogical venues for teaching probability, in particular, the use of the Theorem of Total Probability and the Iterated Rules for Mean, Variance, and Covariance. There are also many interesting questions in tennis, foremost of which is whether a tennis match under this assumption will actually end with probability one; or whether when two players of `equal abilities' play a match, the first server possesses an advantage. These questions are addressed in this work. Tennis scoring systems are technically statistical decision systems to determine the better player. Since such a decision system is based on a finite number of points played, erroneous decisions could arise, such as the inferior player winning the match. We compare different systems in terms of the probability of the better player winning, as well as the duration of the match in terms of the number of points played.

2603.02289 2026-03-04 stat.ME cs.LG stat.ML

Topological Causal Effects

Kwangho Kim, Hajin Lee

详情
Journal ref
Proceedings of the Fourteenth International Conference on Learning Representations (ICLR 2026)
英文摘要

Estimating causal effects is particularly challenging when outcomes arise in complex, non-Euclidean spaces, where conventional methods often fail to capture meaningful structural variation. We develop a framework for topological causal inference that defines treatment effects through differences in the topological structure of potential outcomes, summarized by power-weighted silhouette functions of persistence diagrams. We develop an efficient, doubly robust estimator in a fully nonparametric model, establish functional weak convergence, and construct a formal test of the null hypothesis of no topological effect. Empirical studies illustrate that the proposed method reliably quantifies topological treatment effects across diverse complex outcome types.

2603.02282 2026-03-04 stat.ME

A Simpson Based Estimation Approach for the Overlapping Coefficient of k>=2 Normal Distributions

Omar Eidous, Majd Alsheyyab

Comments 15 pages, 2 tables

详情
英文摘要

The overlapping coefficient is a fundamental measure of similarity between probability distributions. While the case of two distributions has been extensively studied, extending this measure to multiple populations presents both analytical and computational challenges. In this paper, we propose a general estimation framework for the overlapping coefficient of k>=2 normal distributions. The method employs Simpsons numerical integration rule combined with plug-in maximum likelihood estimators of the normal parameters. The resulting estimator is shown to be consistent under standard regularity conditions. A Monte Carlo simulation study is conducted across various overlap scenarios and sample sizes. The results demonstrate that the proposed Simpson based estimator performs competitively for all overlap levels, with notable advantages in low overlap situations. This methodology offers a flexible and computationally efficient approach applicable to an arbitrary number of normal populations.

2602.19988 2026-03-04 stat.ME stat.CO

Change point analysis of high-dimensional data using random projections

Yi Xu, Yeonwoo Rho

详情
英文摘要

This paper develops a novel change point identification method for high-dimensional data using random projections. By projecting high-dimensional time series into a one-dimensional space, we are able to leverage the rich literature for univariate time series. We propose applying random projections multiple times and then combining the univariate test results using existing multiple comparison methods. Simulation results suggest that the proposed method tends to have better size and power, with more accurate location estimation. At the same time, random projections may introduce variability in the estimated locations. To enhance stability in practice, we recommend repeating the procedure, and using the mode of the estimated locations as a guide for the final change point estimate. An application to an Australian temperature dataset is presented. This study, though limited to the single change point setting, demonstrates the usefulness of random projections in change point analysis.

2601.13759 2026-03-04 stat.ME

ChauBoxplot and AdaptiveBoxplot: Two R packages for boxplot-based outlier detection

Tiejun Tong, Hongmei Lin, Bowen Gang, Riquan Zhang

Comments 11 pages, 2 figures, 2 tables

详情
Journal ref
Statistical Theory and Related Fields, 2026
英文摘要

Tukey's boxplot is widely used for outlier detection; however, its classic fixed-fence rule tends to flag an excessive number of outliers as the sample size grows. To address this, we introduce two new R packages, ChauBoxplot and AdaptiveBoxplot, which implement more robust and statistically principled outlier detection methods. We illustrate their advantages and practical implications through comprehensive simulation studies and a real-world analysis of provincial university admission rates from China's National College Entrance Examination. Based on these findings, we provide practical guidance to help practitioners select appropriate boxplot methods, achieving a balance between interpretability and statistical reliability.

2511.19476 2026-03-04 stat.ML cs.AI cs.LG

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

Jin Cui, Boran Zhao, Jiajun Xu, Jiaqi Guo, Shuo Guan, Pengju Ren

Comments 20 pages, 17 figures

详情
英文摘要

Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are tied to model-specific parameters and introduce architectural bias; or (ii) DNN-free, which rely on heuristics lacking theoretical guarantees. Neither approach explicitly constrains distributional equivalence, largely because continuous distribution matching is considered inapplicable to discrete sampling. Moreover, prevalent metrics (e.g., MSE, KL, CE, MMD) cannot accurately capture higher-order moment discrepancies, leading to suboptimal coresets. In this work, we propose FAST, the first DNN-free distribution-matching coreset selection framework that formulates the coreset selection task as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information in the frequency domain. We further discover that naive CFD suffers from a "vanishing phase gradient" issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high, preserving global structure before refining local details and enabling accurate matching with fewer frequencies while avoiding overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2x average speedup, underscoring its high performance and energy efficiency.

2510.10902 2026-03-04 cs.LG stat.ML

Auditing Information Disclosure During LLM-Scale Gradient Descent Using Gradient Uniqueness

Sleem Abdelghafar, Maryam Aliakbarpour, Chris Jermaine

详情
英文摘要

Disclosing information via the publication of a machine learning model poses significant privacy risks. However, auditing this disclosure across every datapoint during the training of Large Language Models (LLMs) is computationally prohibitive. In this paper, we present Gradient Uniqueness (GNQ), a principled, attack-agnostic metric derived from an information-theoretic upper bound on the amount of information embedded in a model about individual training points via gradient descent. While naively computing GNQ requires forming and inverting an $P \times P$ matrix for every datapoint (for a model with $P$ parameters), we introduce Batch-Space Ghost GNQ (BS-Ghost GNQ). This efficient algorithm performs all computations in a much smaller batch-space and leverages ghost kernels to compute GNQ ``in-run'' with minimal computational overhead. We empirically validate that GNQ successfully accounts for prior/common knowledge. Our evaluation demonstrates that GNQ strongly predicts sequence extractability in targeted attacks and reveals how disclosure risk concentrates heterogeneously on specific examples over the course of LLM training.

2509.20508 2026-03-04 stat.ML cs.LG

Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances

Khai Nguyen, Hai Nguyen, Nhat Ho

Comments Accepted to ICLR 2026, 34 pages, 30 figures, 6 tables

详情
英文摘要

We address the problem of efficiently computing Wasserstein distances for multiple pairs of distributions drawn from a meta-distribution. To this end, we propose a fast estimation method based on regressing Wasserstein distance on sliced Wasserstein (SW) distances. Specifically, we leverage both standard SW distances, which provide lower bounds, and lifted SW distances, which provide upper bounds, as predictors of the true Wasserstein distance. To ensure parsimony, we introduce two linear models: an unconstrained model with a closed-form least-squares solution, and a constrained model that uses only half as many parameters. We show that accurate models can be learned from a small number of distribution pairs. Once estimated, the model can predict the Wasserstein distance for any pair of distributions via a linear combination of SW distances, making it highly efficient. Empirically, we validate our approach on diverse tasks, including Gaussian mixtures, point-cloud classification, and Wasserstein-space visualizations for 3D point clouds. Across various datasets such as MNIST point clouds, ShapeNetV2, MERFISH Cell Niches, and scRNA-seq, our method consistently provides a better approximation of Wasserstein distance than the state-of-the-art Wasserstein embedding model, Wasserstein Wormhole, particularly in low-data regimes. Finally, we demonstrate that our estimator can also accelerate Wormhole training, yielding \textit{RG-Wormhole}.

2508.01154 2026-03-04 stat.ME

On classes of distributions on the unit interval: structural properties and application to inequality data

Roberto Vila, Helton Saulo, Poliana Matos, Subhankar Dutta

Comments 39 pages, 24 figures

详情
英文摘要

Probability distributions defined on the unit interval are widely used in fields ranging from econometrics to reliability studies. Traditional models such as the beta and Kumaraswamy distributions are well-established due to their flexibility and tractability. In this paper, we introduce two novel families of unit-interval distributions derived via non-injective transformations of the gamma ratio. These transformations, denoted $S_r$ and $T_r$, allow the construction of new random variables with support on $(0,1)$ and admit simple closed-form expressions for their densities when the underlying variables are independent gamma distributed. Notably, for $r = 1/2$, these constructions yield sample-based estimators of the Gini and Atkinson indices, establishing a direct link with classical inequality measures. We derive the distributional laws, cumulative distribution functions, quantile functions, and raw moments, and discuss maximum likelihood estimation for the proposed models. A Monte Carlo simulation study is conducted to assess the finite sample behavior of the maximum likelihood estimators under different parameter configurations. An application to cross-country Gini index data illustrates the flexibility and practical relevance of the proposed distributions in modeling real inequality indicators.

2507.18686 2026-03-04 math.ST math.AG math.CO math.CV stat.TH

One-dimensional Discrete Models of Maximum Likelihood Degree One

Carlos Améndola, Viet Duc Nguyen, Janike Oldekop

Comments 25 pages, minor improvements and more references to CR geometry

详情
英文摘要

We settle a conjecture by Bik and Marigliano stating that the degree of a one-dimensional discrete model with rational maximum likelihood estimator is bounded above by a linear function in the size of its support, therefore showing that there are only finitely many fundamental such models for any given number of states. We study these models from a combinatorial perspective with regard to their existence and enumeration. In particular, sharp models, those whose degree attains the maximal bound, enjoy special properties and have been studied as monomial maps between unit spheres. In this way, we present a novel link between Cauchy-Riemann geometry and algebraic statistics.

2507.08965 2026-03-04 cs.LG cs.AI stat.ML

Improving Classifier-Free Guidance in Masked Diffusion: Low-Dim Theoretical Insights with High-Dim Impact

Kevin Rojas, Ye He, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji, Molei Tao

详情
英文摘要

Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and its extensions to discrete diffusion has recently started to be investigated. In order to improve the algorithms in a principled way, this paper starts by analyzing the exact effect of CFG in the context of a low-dimensional masked diffusion model, with a special emphasis on the guidance schedule. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance improves it. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism. Intuitively, our method smooths the transport between the data distribution and the initial (masked) distribution, resulting in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. Experiments on conditional image and text generation empirically confirm the efficacy of our method.

2507.08150 2026-03-04 stat.ML cs.LG stat.ME

CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk

Ilia Azizi, Juraj Bodik, Jakob Heiss, Bin Yu

Comments Project page: https://unco3892.github.io/clear/

详情
英文摘要

Accurate uncertainty quantification is critical for reliable predictive modeling. Existing methods typically address either aleatoric uncertainty due to measurement noise or epistemic uncertainty resulting from limited data, but not both in a balanced manner. We propose CLEAR, a calibration method with two distinct parameters, $γ_1$ and $γ_2$, to combine the two uncertainty components and improve the conditional coverage of predictive intervals for regression tasks. CLEAR is compatible with any pair of aleatoric and epistemic estimators; we show how it can be used with (i) quantile regression for aleatoric uncertainty and (ii) ensembles drawn from the Predictability-Computability-Stability (PCS) framework for epistemic uncertainty. Across 17 diverse real-world datasets, CLEAR achieves an average improvement of 28.3\% and 17.5\% in the interval width compared to the two individually calibrated baselines while maintaining nominal coverage. Similar improvements are observed when applying CLEAR to Deep Ensembles (epistemic) and Simultaneous Quantile Regression (aleatoric). The benefits are especially evident in scenarios dominated by high aleatoric or epistemic uncertainty. Project page: https://unco3892.github.io/clear/

2506.01502 2026-03-04 cs.LG cs.AI stat.ML

Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme

Mikhail Persiianov, Jiawei Chen, Petr Mokrov, Alexander Tyurin, Evgeny Burnaev, Alexander Korotin

详情
英文摘要

Learning population dynamics involves recovering the underlying process that governs particle evolution, given evolutionary snapshots of samples at discrete time points. Recent methods frame this as an energy minimization problem in probability space and leverage the celebrated JKO scheme for efficient time discretization. In this work, we introduce $\texttt{iJKOnet}$, an approach that combines the JKO framework with inverse optimization techniques to learn population dynamics. Our method relies on a conventional $\textit{end-to-end}$ adversarial training procedure and does not require restrictive architectural choices, e.g., input-convex neural networks. We establish theoretical guarantees for our methodology and demonstrate improved performance over prior JKO-based methods. The code of $\texttt{iJKOnet}$ is available at https://github.com/MuXauJl11110/iJKOnet.

2505.22423 2026-03-04 math.ST stat.TH

Max-laws of large numbers for weakly dependent high dimensional arrays with applications

Jonathan B. Hill

详情
英文摘要

We derive so-called weak and strong \textit{max-laws of large numbers} for $% \max_{1\leq i\leq k_{n}}|1/n\sum_{t=1}^{n}x_{i,n,t}|$ for zero mean stochastic triangular arrays $\{x_{i,n,t}$ $:$ $1$ $\leq $ $t$ $\leq n\}_{n\geq 1}$, with dimension counter $i$ $=$ $1,...,k_{n}$ and dimension $% k_{n}$ $\rightarrow $ $\infty $. Rates of convergence are also analyzed based on feasible sequences $\{k_{n}\}$. We work in three dependence settings: independence, Dedecker and Prieur's (2004) $τ$-mixing and Wu's (2005) physical dependence. We initially ignore cross-coordinate $i$ dependence as a benchmark. We then work with martingale, nearly martingale, and mixing coordinates to deliver improved bounds on $k_{n}$. Finally, we use the results in three applications, each representing a key novelty: we ($i$) bound $k_{n}$\ for a max-correlation statistic for regression residuals under $α$-mixing or physical dependence; ($ii$) extend correlation screening, or marginal regressions, to physical dependent data with diverging dimension $k_{n}$ $\rightarrow $ $\infty $; and ($iii$) test a high dimensional parameter after partialling out a fixed dimensional nuisance parameter in a linear time series regression model under $τ$% -mixing.

2505.13614 2026-03-04 cs.LG stat.ML

Deterministic Bounds and Random Estimates of Metric Tensors on Neuromanifolds

Ke Sun

Comments Published at the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

The high-dimensional parameter space of deep neural networks -- the neuromanifold -- is endowed with a unique metric tensor defined by the Fisher information. Reliable and scalable computation of this metric tensor is valuable for theorists and practitioners. Focusing on neural classifiers, we return to a low-dimensional space of probability distributions, which we call the core space, and examine the spectrum and envelopes of its Fisher information matrix. We extend our discoveries there to deterministic bounds for the metric tensor on the neuromanifold. We introduce an unbiased random estimator based on Hutchinson's trace method and derive related bounds. It can be evaluated efficiently with a single backward pass per batch, with a standard deviation bounded by the true value up to scaling.

2503.20513 2026-03-04 q-bio.BM q-bio.QM stat.AP

A Principal Submanifold-based Approach for Clustering and Multiscale RNA Correction

Menghao Wu, Zhigang Yao

Comments 30 pages, 15 figures

详情
英文摘要

RNA structure determination is essential for understanding its biological functions. However, the reconstruction process often faces challenges, such as atomic clashes, which can lead to inaccurate models. To address these challenges, we introduce the principal submanifold (PSM) approach for analyzing RNA data on a torus. This method provides an accurate, low-dimensional feature representation, overcoming the limitations of previous torus-based methods. By combining PSM with DBSCAN, we propose a novel clustering technique, the principal submanifold-based DBSCAN (PSM-DBSCAN). Our approach achieves superior clustering accuracy and increased robustness to noise. Additionally, we apply this new method for multiscale corrections, effectively resolving RNA backbone clashes at both microscopic and mesoscopic scales. Extensive simulations and comparative studies highlight the enhanced precision and scalability of our method, demonstrating significant improvements over existing approaches. The proposed methodology offers a robust foundation for correcting complex RNA structures and has broad implications for applications in structural biology and bioinformatics.

2501.18912 2026-03-04 stat.AP cs.SI

Constructing Reliable Social Networks from Conversational Data: An Ensemble Prompt Engineering Approach with Uncertainty Quantification

Gwanghee Kim, Ick Hoon Jin, Minjeong Jeon

详情
英文摘要

Conversational data are central to the study of interaction dynamics and social structures across psychological research. However, constructing structured social networks from unstructured conversational data remains a major methodological challenge. This study presents a pipeline for reliable network construction using prompt engineering. We employ an ensemble of multiple Large Language Models (LLMs) with majority voting to automate utterance classification, overcoming the scalability limitations of manual coding and the generalizability constraints of supervised deep learning. Classification reliability is assessed through an uncertainty quantification framework based on Shannon entropy, which supports systematic human-in-the-loop review of ambiguous cases. The classified utterances are used to construct directed interaction networks for subsequent analysis. We demonstrate the utility of this approach through two illustrative applications to classroom interaction data: network centrality analysis to characterize participant roles, and network mediation analysis using the additive and multiplicative effects network (AMEN) model to examine how interaction structures mediate the relationship between gender and mathematics performance. This pipeline provides a scalable foundation for automated network construction from conversational data across diverse research contexts.

2412.02333 2026-03-04 stat.ME

Estimation of a multivariate von Mises distribution for contaminated torus data

Giulia Bertagnolli, Luca Greco, Claudio Agostinelli

详情
英文摘要

The occurrence of atypical circular observations on the torus can badly affect parameter estimation of the multivariate von Mises distribution. This paper addresses the problem of robust fitting of the multivariate von Mises model using the weighted likelihood methodology. The key ingredients are non-parametric density estimation for multivariate circular data and the definition of appropriate weighted estimating equations. Some theoretical properties are discussed. The finite sample behavior of the proposed weighted likelihood estimator has been investigated by Monte Carlo numerical studies and empirical applications.

2412.00798 2026-03-04 cs.LG stat.ML

Combinatorial Rising Bandits

Seockbean Song, Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok

详情
英文摘要

Combinatorial online learning is a fundamental task for selecting the optimal action (or super arm) as a combination of base arms in sequential interactions with systems providing stochastic rewards. It is applicable to diverse domains such as robotics, social advertising, network routing, and recommendation systems. In many real-world scenarios, we often encounter rising rewards, where playing a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, e.g., robots improving through practice and social influence strengthening in the history of successful recommendations. Crucially, these enhancements may propagate to multiple super arms that share the same base arms, introducing dependencies beyond the scope of existing bandit models. To address this gap, we introduce the Combinatorial Rising Bandit (CRB) framework and propose a provably efficient and empirically effective algorithm, Combinatorial Rising Upper Confidence Bound (CRUCB). We empirically demonstrate the effectiveness of CRUCB in realistic deep reinforcement learning environments and synthetic settings, while our theoretical analysis establishes tight regret bounds. Together, they underscore the practical impact and theoretical rigor of our approach. Our code is available at https://github.com/ml-postech/Combinatorial-Rising-Bandits.

2411.01563 2026-03-04 math.ST stat.ML stat.TH

Statistical guarantees for denoising reflected diffusion models

Asbjørn Holk, Claudia Strauch, Lukas Trottner

详情
英文摘要

In recent years, denoising diffusion models have become a crucial area of research due to their abundance in the rapidly expanding field of generative AI. While recent statistical advances have delivered explanations for the generation ability of idealised denoising diffusion models for high-dimensional target data, implementations introduce thresholding procedures for the generating process to overcome issues arising from the unbounded state space of such models. This mismatch between theoretical design and implementation of diffusion models has been addressed empirically by using a \emph{reflected} diffusion process as the driver of noise instead. In this paper, we study statistical guarantees of these denoising reflected diffusion models. In particular, under Sobolev smoothness assumptions, we establish rates of convergence in total variation which, up to a polylogarithmic factor, match the minimax lower bound. Our main contributions include the statistical analysis of this novel class of denoising reflected diffusion models and a refined score approximation method in both time and space, leveraging spectral decomposition and rigorous neural network analysis.

2410.06378 2026-03-04 stat.ML cs.AI cs.IT cs.LG math.IT

Covering Numbers for Deep ReLU Networks with Applications to Function Approximation and Nonparametric Regression

Weigutian Ou, Helmut Bölcskei

Comments To appear in Foundations of Computational Mathematics

详情
英文摘要

Covering numbers of (deep) ReLU networks have been used to characterize approximation-theoretic performance, to upper-bound prediction error in nonparametric regression, and to quantify classification capacity. These results rely on covering number upper bounds obtained via explicit constructions of coverings. Lower bounds on covering numbers do not appear to be available in the literature. The present paper fills this gap by deriving tight (up to multiplicative constants) lower and upper bounds on the metric entropy (i.e., the logarithm of the covering numbers) of fully connected networks with bounded weights, sparse networks with bounded weights, and fully connected networks with quantized weights. The tightness of these bounds yields a fundamental understanding of the impact of sparsity, quantization, bounded versus unbounded weights, and network output truncation. Moreover, the bounds allow one to characterize fundamental limits of neural network transformation, including network compression, and lead to sharp upper bounds on the prediction error in nonparametric regression through deep networks. In particular, we remove a $\log^6(n)$-factor from the best known sample complexity rate for estimating Lipschitz functions via deep networks, thereby establishing optimality. Finally, we identify a systematic relation between optimal nonparametric regression and optimal approximation through deep networks, unifying numerous results in the literature and revealing underlying general principles.

2407.10417 2026-03-04 stat.ML cs.LG

Proper losses regret at least 1/2-order

Han Bao, Asuka Takatsu

Comments JMLR accepted (50 pages)

详情
英文摘要

A fundamental challenge in machine learning is the choice of a loss as it characterizes our learning task, is minimized in the training phase, and serves as an evaluation criterion for estimators. Proper losses are commonly chosen, ensuring minimizers of the full risk match the true probability vector. Estimators induced from a proper loss are widely used to construct forecasters for downstream tasks such as classification and ranking. In this procedure, how does the forecaster based on the obtained estimator perform well under a given downstream task? This question is substantially relevant to the behavior of the $p$-norm between the estimated and true probability vectors when the estimator is updated. In the proper loss framework, the suboptimality of the estimated probability vector from the true probability vector is measured by a surrogate regret. First, we analyze a surrogate regret and show that the strict properness of a loss is necessary and sufficient to establish a non-vacuous surrogate regret bound. Second, we solve an important open question that the order of convergence in p-norm cannot be faster than the $1/2$-order of surrogate regrets for a broad class of strictly proper losses. This implies that strongly proper losses entail the optimal convergence rate.

2406.01819 2026-03-04 stat.ME

Bayesian Linear Models: A compact general set of results

J Andres Christen

Comments 13 pages, 4 figures, Python implementation

详情
英文摘要

I present all the details in calculating the posterior distribution of the conjugate Normal-Gamma prior in Bayesian Linear Models (BLM), including correlated observations, prediction, model selection and comments on efficient numeric implementations. A Python implementation is also presented. These have been presented and available in many books and texts but, I believe, a general compact and simple presentation is always welcome and not always simple to find. Since correlated observations are also included, these results may also be useful for time series analysis and spacial statistics. Other particular cases presented include regression, Gaussian processes and Bayesian Dynamic Models.

2406.00961 2026-03-04 math.PR math.ST stat.TH

Kronecker-product random matrices and a matrix least squares problem

Zhou Fan, Renyuan Ma

详情
Journal ref
Annals of Probability 2026, Vol. 54, No. 2, 973-1033
英文摘要

We study the eigenvalue distribution and resolvent of a Kronecker-product random matrix model $A \otimes I_{n \times n}+I_{n \times n} \otimes B+Θ\otimes Ξ\in \mathbb{C}^{n^2 \times n^2}$, where $A,B$ are independent Wigner matrices and $Θ,Ξ$ are deterministic and diagonal. For fixed spectral arguments, we establish a quantitative approximation for the Stieltjes transform by that of an approximating free operator, and a diagonal deterministic equivalent approximation for the resolvent. We further obtain sharp estimates in operator norm for the $n \times n$ resolvent blocks, and show that off-diagonal resolvent entries fall on two differing scales of $n^{-1/2}$ and $n^{-1}$ depending on their locations in the Kronecker structure. Our study is motivated by consideration of a matrix-valued least-squares optimization problem $\min_{X \in \mathbb{R}^{n \times n}} \frac{1}{2}\|XA+BX\|_F^2+\frac{1}{2}\sum_{ij} ξ_iθ_j x_{ij}^2$ subject to a linear constraint. For random instances of this problem defined by Wigner inputs $A,B$, our analyses imply an asymptotic characterization of the minimizer $X$ and its associated minimum objective value as $n \to \infty$.

2403.10945 2026-03-04 stat.ME stat.AP

Zero-inflated stochastic volatility model for disaggregated inflation data with exact zeros

Geonhee Han, Kaoru Irie

详情
英文摘要

The disaggregated time-series for the Consumer Price Index (CPI) often exhibits exact zero price changes, stemming from structural features of the data collection process. However, the currently prominent stochastic volatility model of trend-inflation is designed for aggregate measures of price inflation, where zeros rarely occur. We formulate a zero-inflated stochastic volatility model applicable to such non-stationary, real-valued, multivariate time-series data with exact zeros, which jointly specifies the dynamic zero-generating process. For posterior inference, an efficient custom Pólya--Gamma augmented Gibbs sampler is derived. Applying the model to disaggregated CPI data in four advanced economies -- US, UK, Germany, and Japan -- we find that the zero-inflated model yields more informative estimates of time-varying trend and volatility, as it accounts for the presence of zeros and avoids underestimation. In an out-of-sample forecasting exercise, we find that the zero-inflated model delivers improved point forecasts and better calibrated interval forecasts, particularly when zero-inflation is prevalent.

2302.12717 2026-03-04 stat.ME stat.ML

Statistical Inference with Stochastic Gradient Methods under $ϕ$-mixing Data

Ruiqi Liu, Xi Chen, Zuofeng Shang

详情
英文摘要

Stochastic gradient descent (SGD) is a scalable and memory-efficient optimization algorithm for large datasets and stream data, which has drawn a great deal of attention and popularity. The applications of SGD-based estimators to statistical inference such as interval estimation have also achieved great success. However, most of the related works are based on i.i.d. observations or Markov chains. When the observations come from a mixing time series, how to conduct valid statistical inference remains unexplored. As a matter of fact, the general correlation among observations imposes a challenge on interval estimation. Most existing methods may ignore this correlation and lead to invalid confidence intervals. In this paper, we propose a mini-batch SGD estimator for statistical inference when the data is $ϕ$-mixing. The confidence intervals are constructed using an associated mini-batch bootstrap SGD procedure. Using ``independent block'' trick from \cite{yu1994rates}, we show that the proposed estimator is asymptotically normal, and its limiting distribution can be effectively approximated by the bootstrap procedure. The proposed method is memory-efficient and easy to implement in practice. Simulation studies on synthetic data and an application to a real-world dataset confirm our theory.

2210.10278 2026-03-04 cs.LG cs.GT stat.ML

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

Rui Ai, Boxiang Lyu, Zhaoran Wang, Zhuoran Yang, Michael I. Jordan

详情
英文摘要

We study reserve price optimization in multi-phase second price auctions, where the seller's prior actions affect the bidders' later valuations through a Markov Decision Process (MDP). Compared to the bandit setting in existing works, the setting in ours involves three challenges. First, from the seller's perspective, we need to efficiently explore the environment in the presence of potentially untruthful bidders who aim to manipulate the seller's policy. Second, we want to minimize the seller's revenue regret when the market noise distribution is unknown. Third, the seller's per-step revenue is an unknown, nonlinear random variable, and cannot even be directly observed from the environment but realized values. We propose a mechanism addressing all three challenges. To address the first challenge, we use a combination of a new technique named "buffer periods" and inspirations from Reinforcement Learning (RL) with low switching cost to limit bidders' surplus from untruthful bidding, thereby incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm that removes the need for pure exploration when the market noise distribution is unknown. The third challenge is resolved by an extension of LSVI-UCB, where we use the auction's underlying structure to control the uncertainty of the revenue function. The three techniques culminate in the Contextual-LSVI-UCB-Buffer (CLUB) algorithm which achieves $\tilde{O}(H^{5/2}\sqrt{K})$ revenue regret, where $K$ is the number of episodes and $H$ is the length of each episode, when the market noise is known and $\tilde{O}(H^{3}\sqrt{K})$ revenue regret when the noise is unknown with no assumptions on bidders' truthfulness.