arXivDaily arXiv每日学术速递 周一至周五更新
2602.22203 2026-02-26 stat.ME

Local Bayesian Regression

Nils Lid Hjort

Comments 28 pages; statistical Research Report, Department of Mathematics, University of Oslo, August 1994, but arXiv'd in February 2026. A journal paper can be written up based on this report, requiring though numerical studies and good illustrations

详情
英文摘要

This paper develops a class of Bayesian non- and semiparametric methods for estimating regression curves and surfaces. The main idea is to model the regression as locally linear, and then place suitable local priors on the local parameters. The method requires the posterior distribution of the local parameters given local data, and this is found via a suitably defined local likelihood function. When the width of the local data window is large the methods reduce to familiar fully parametric Bayesian methods, and when the width is small the estimators are essentially nonparametric. When noninformative reference priors are used the resulting estimators coincide with recently developed well-performing local weighted least squares methods for nonparametric regression. Each local prior distribution needs in general a centre parameter and a variance parameter. Of particular interest are versions of the scheme that are more or less automatic and objective in the sense that they do not require subjective specifications of prior parameters. We therefore develop empirical Bayes methods to obtain the variance parameter and a hierarchical Bayes method to account for uncertainty in the choice of centre parameter. There are several possible versions of the general programme, and a number of its specialisations are discussed. Some of these are shown to be capable of outperforming standard nonparametric regression methods, particularly in situations with several covariates.

2602.22178 2026-02-26 math.ST stat.TH

Confidence in confidence distributions!

Céline Cunen, Nils Lid Hjort, Tore Schweder

Comments 5 pages, 2 figures. Statistical Research Report, Department of Mathematics, University of Oslo, February 2020, here arXiv'd February 2026. Published in Proceedings of the Royal Society, Series A, 2020, vo. 476, at this url: royalsocietypublishing.org/rspa/article/476/2237/20190781/56889

详情
英文摘要

The recent article `Satellite conjunction analysis and the false confidence theorem' (Balch, Martin, and Ferson, 2019, Proceedings of the Royal Society, Series A) points to certain difficulties with Bayesian analysis when used for models for satellite conjuntion and ensuing operative decisions. Here we supplement these previous analyses and findings with further insights, uncovering what we perceive of as being the crucial points, explained in a prototype setup where exact analysis is attainable. We also show that a different and frequentist method, involving confidence distributions, is free of the false confidence syndrome.

2602.22122 2026-02-26 stat.ML cs.LG

Probing the Geometry of Diffusion Models with the String Method

Elio Moreau, Florentin Coeurdoux, Grégoire Ferre, Eric Vanden-Eijnden

详情
英文摘要

Understanding the geometry of learned distributions is fundamental to improving and interpreting diffusion models, yet systematic tools for exploring their landscape remain limited. Standard latent-space interpolations fail to respect the structure of the learned distribution, often traversing low-density regions. We introduce a framework based on the string method that computes continuous paths between samples by evolving curves under the learned score function. Operating on pretrained models without retraining, our approach interpolates between three regimes: pure generative transport, which yields continuous sample paths; gradient-dominated dynamics, which recover minimum energy paths (MEPs); and finite-temperature string dynamics, which compute principal curves -- self-consistent paths that balance energy and entropy. We demonstrate that the choice of regime matters in practice. For image diffusion models, MEPs contain high-likelihood but unrealistic ''cartoon'' images, confirming prior observations that likelihood maxima appear unrealistic; principal curves instead yield realistic morphing sequences despite lower likelihood. For protein structure prediction, our method computes transition pathways between metastable conformers directly from models trained on static structures, yielding paths with physically plausible intermediates. Together, these results establish the string method as a principled tool for probing the modal structure of diffusion models -- identifying modes, characterizing barriers, and mapping connectivity in complex learned distributions.

2602.22083 2026-02-26 stat.ME cs.LG stat.ML

Coarsening Bias from Variable Discretization in Causal Functionals

Xiaxian Ou, Razieh Nabi

详情
英文摘要

A class of causal effect functionals requires integration over conditional densities of continuous variables, as in mediation effects and nonparametric identification in causal graphical models. Estimating such densities and evaluating the resulting integrals can be statistically and computationally demanding. A common workaround is to discretize the variable and replace integrals with finite sums. Although convenient, discretization alters the population-level functional and can induce non-negligible approximation bias, even under correct identification. Under smoothness conditions, we show that this coarsening bias is first order in the bin width and arises at the level of the target functional, distinct from statistical estimation error. We propose a simple bias-reduced functional that evaluates the outcome regression at within-bin conditional means, eliminating the leading term and yielding a second-order approximation error. We derive plug-in and one-step estimators for the bias-reduced functional. Simulations demonstrate substantial bias reduction and near-nominal confidence interval coverage, even under coarse binning. Our results provide a simple framework for controlling the impact of variable discretization on parameter approximation and estimation.

2602.22062 2026-02-26 stat.ME math.ST stat.TH

Robust Model Selection for Discovery of Latent Mechanistic Processes

Jiawei Li, Nguyen Nguyen, Meng Lai, Ioannis Ch. Paschalidis, Jonathan H. Huggins

详情
英文摘要

When learning interpretable latent structures using model-based approaches, even small deviations from modeling assumptions can lead to inferential results that are not mechanistically meaningful. In this work, we consider latent structures that consist of $K_o$ mechanistic processes, where $K_o$ is unknown. When the model is misspecified, likelihood-based model selection methods can substantially overestimate $K_o$ while more robust nonparametric methods can be overly conservative. Hence, there is a need for approaches that combine the sensitivity of likelihood-based methods with the robustness of nonparametric ones. We formalize this objective in terms of a robust model selection consistency property, which is based on a component-level discrepancy measure that captures the mechanistic structure of the model. We then propose the accumulated cutoff discrepancy criterion (ACDC), which leverages plug-in estimates of component-level discrepancies. To apply ACDC, we develop mechanistically meaningful component-level discrepancies for a general class of latent variable models that includes unsupervised and supervised variants of probabilistic matrix factorization and mixture modeling. We show that ACDC is robustly consistent when applied to unsupervised matrix factorization and mixture models. Numerical results demonstrate that in practice our approach reliably identifies a mechanistically meaningful number of latent processes in numerous illustrative applications, outperforming existing methods.

2602.22047 2026-02-26 math.OC math.ST stat.TH

Stochastic Optimal Control with Side Information and Bayesian Learning

Johannes Milz, Alexander Shapiro, Enlu Zhou

详情
英文摘要

We study infinite-horizon stochastic optimal control problems with observable side information: a Markov chain that modulates an unknown context-conditional randomness distribution. Since this distribution is unknown, we propose a Bayesian reformulation based on a parametric density model and posterior predictive dynamics, which yields a Bayesian Bellman equation. We prove posterior consistency under Markov samples and, under correct specification and identifiability, uniform convergence of the Bayesian value function. Finally, we establish Bernstein--von Mises-type asymptotic normality for the data-driven contextual optimal value.

2602.22021 2026-02-26 stat.ME

Budgeted Active Experimentation for Treatment Effect Estimation from Observational and Randomized Data

Jiacan Gao, Xinyan Su, Mingyuan Ma, Yiyan Huang, Xiao Xu, Xinrui Wan, Tianqi Gu, Enyun Yu, Jiecheng Guo, Zhiheng Zhang

详情
英文摘要

Estimating heterogeneous treatment effects is central to data-driven decision-making, yet industrial applications often face a fundamental tension between limited randomized controlled trial (RCT) budgets and abundant but biased observational data collected under historical targeting policies. Although observational logs offer the advantage of scale, they inherently suffer from severe policyinduced imbalance and overlap violations, rendering standalone estimation unreliable. We propose a budgeted active experimentation framework that iteratively enhances model training for causal effect estimation via active sampling. By leveraging observational priors, we develop an acquisition function targeting uplift estimation uncertainty, overlap deficits, and domain discrepancy to select the most informative units for randomized experiments. We establish finite-sample deviation bounds, asymptotic normality via martingale Central Limit Theorems (CLTs), and minimax lower bounds to prove information-theoretic optimality. Extensive experiments on industrial datasets demonstrate that our approach significantly outperforms standard randomized baselines in cost-constrained settings.

2602.22003 2026-02-26 cs.LG math.OC stat.ML

Neural solver for Wasserstein Geodesics and optimal transport dynamics

Hailiang Liu, Yan-Han Chen

Comments 28 pages, 22 figures

详情
英文摘要

In recent years, the machine learning community has increasingly embraced the optimal transport (OT) framework for modeling distributional relationships. In this work, we introduce a sample-based neural solver for computing the Wasserstein geodesic between a source and target distribution, along with the associated velocity field. Building on the dynamical formulation of the optimal transport (OT) problem, we recast the constrained optimization as a minimax problem, using deep neural networks to approximate the relevant functions. This approach not only provides the Wasserstein geodesic but also recovers the OT map, enabling direct sampling from the target distribution. By estimating the OT map, we obtain velocity estimates along particle trajectories, which in turn allow us to learn the full velocity field. The framework is flexible and readily extends to general cost functions, including the commonly used quadratic cost. We demonstrate the effectiveness of our method through experiments on both synthetic and real datasets.

2602.21998 2026-02-26 stat.ME

Design-based theory for causal inference from adaptive experiments

Xinran Li, Anqi Zhao

详情
英文摘要

Adaptive designs dynamically update treatment probabilities using information accumulated during the experiment. Existing theory for causal inference from adaptive experiments primarily assumes the superpopulation framework with independent and identically distributed units, and may not apply when the distribution of units evolves over time. This paper makes two contributions. First, we extend the literature to the finite-population framework, which allows for possibly nonexchangeable units, and establish the design-based theory for causal inference under general adaptive designs using inverse-propensity-weighted (IPW) and augmented IPW (AIPW) estimators. Our theory accommodates nonexchangeable units, both nonconverging and vanishing treatment probabilities, and nonconverging outcome estimators, thereby justifying inference using AIPW estimators with black-box outcome models that integrate advances from machine learning methods. To alleviate the conservativeness inherent in variance estimation under finite-population inference, we also introduce a covariance estimator for the AIPW estimator that becomes sharp when the residuals from the adaptive regression of potential outcomes on covariates are additive across units. Our framework encompasses widely used adaptive designs, such as multi-armed bandits, covariate-adaptive randomization, and sequential rerandomization, advancing the design-based theory for causal inference in these specific settings. Second, as a methodological contribution, we propose an adaptive covariate adjustment approach for analyzing even nonadaptive designs. The martingale structure induced by adaptive adjustment enables valid inference with black-box outcome estimators that would otherwise require strong assumptions under standard nonadaptive analysis.

2602.21948 2026-02-26 cs.LG stat.ML

Bayesian Generative Adversarial Networks via Gaussian Approximation for Tabular Data Synthesis

Bahrul Ilmi Nasution, Mark Elliot, Richard Allmendinger

Comments 28 pages, 5 Figures, Accepted in Transactions on Data Privacy

详情
英文摘要

Generative Adversarial Networks (GAN) have been used in many studies to synthesise mixed tabular data. Conditional tabular GAN (CTGAN) have been the most popular variant but struggle to effectively navigate the risk-utility trade-off. Bayesian GAN have received less attention for tabular data, but have been explored with unstructured data such as images and text. The most used technique employed in Bayesian GAN is Markov Chain Monte Carlo (MCMC), but it is computationally intensive, particularly in terms of weight storage. In this paper, we introduce Gaussian Approximation of CTGAN (GACTGAN), an integration of the Bayesian posterior approximation technique using Stochastic Weight Averaging-Gaussian (SWAG) within the CTGAN generator to synthesise tabular data, reducing computational overhead after the training phase. We demonstrate that GACTGAN yields better synthetic data compared to CTGAN, achieving better preservation of tabular structure and inferential statistics with less privacy risk. These results highlight GACTGAN as a simpler, effective implementation of Bayesian tabular synthesis.

2602.21928 2026-02-26 cs.LG stat.ML

Learning Unknown Interdependencies for Decentralized Root Cause Analysis in Nonlinear Dynamical Systems

Ayush Mohanty, Paritosh Ramanan, Nagi Gebraeel

Comments Manuscript under review

详情
英文摘要

Root cause analysis (RCA) in networked industrial systems, such as supply chains and power networks, is notoriously difficult due to unknown and dynamically evolving interdependencies among geographically distributed clients. These clients represent heterogeneous physical processes and industrial assets equipped with sensors that generate large volumes of nonlinear, high-dimensional, and heterogeneous IoT data. Classical RCA methods require partial or full knowledge of the system's dependency graph, which is rarely available in these complex networks. While federated learning (FL) offers a natural framework for decentralized settings, most existing FL methods assume homogeneous feature spaces and retrainable client models. These assumptions are not compatible with our problem setting. Different clients have different data features and often run fixed, proprietary models that cannot be modified. This paper presents a federated cross-client interdependency learning methodology for feature-partitioned, nonlinear time-series data, without requiring access to raw sensor streams or modifying proprietary client models. Each proprietary local client model is augmented with a Machine Learning (ML) model that encodes cross-client interdependencies. These ML models are coordinated via a global server that enforces representation consistency while preserving privacy through calibrated differential privacy noise. RCA is performed using model residuals and anomaly flags. We establish theoretical convergence guarantees and validate our approach on extensive simulations and a real-world industrial cybersecurity dataset.

2602.21846 2026-02-26 stat.ML cs.LG math.ST stat.ME stat.TH

Scalable Kernel-Based Distances for Statistical Inference and Integration

Masha Naslidnyk

Comments PhD thesis

详情
英文摘要

Representing, comparing, and measuring the distance between probability distributions is a key task in computational statistics and machine learning. The choice of representation and the associated distance determine properties of the methods in which they are used: for example, certain distances can allow one to encode robustness or smoothness of the problem. Kernel methods offer flexible and rich Hilbert space representations of distributions that allow the modeller to enforce properties through the choice of kernel, and estimate associated distances at efficient nonparametric rates. In particular, the maximum mean discrepancy (MMD), a kernel-based distance constructed by comparing Hilbert space mean functions, has received significant attention due to its computational tractability and is favoured by practitioners. In this thesis, we conduct a thorough study of kernel-based distances with a focus on efficient computation, with core contributions in Chapters 3 to 6. Part I of the thesis is focused on the MMD, specifically on improved MMD estimation. In Chapter 3 we propose a theoretically sound, improved estimator for MMD in simulation-based inference. Then, in Chapter 4, we propose an MMD-based estimator for conditional expectations, a ubiquitous task in statistical computation. Closing Part I, in Chapter 5 we study the problem of calibration when MMD is applied to the task of integration. In Part II, motivated by the recent developments in kernel embeddings beyond the mean, we introduce a family of novel kernel-based discrepancies: kernel quantile discrepancies. These address some of the pitfalls of MMD, and are shown through both theoretical results and an empirical study to offer a competitive alternative to MMD and its fast approximations. We conclude with a discussion on broader lessons and future work emerging from the thesis.

2602.21765 2026-02-26 cs.LG cs.AI stat.ML

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Kenton Tang, Yuzhu Chen, Fengxiang He

详情
英文摘要

Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.

2602.21764 2026-02-26 math.ST stat.TH

Estimation of the Self-similarity Index of Non-stationary Increments Self-similar Processes via Lamperti Transformations

William Wu, Qidi Peng

详情
英文摘要

We introduce a novel method for estimating the self-similarity index of a general $H$-self-similar process with either stationary or non-stationary increments. The estimation algorithm is developed based on a modified Lamperti transformation, which transforms $H$-self-similar processes to stationary ones. As an application, we show how to use this approach to estimate the self-similarity index of fractional Brownian motion, subfractional Brownian motion, bifractional Brownian motion, and trifractional Brownian motion. Simulation study is performed to support the consistency of our estimators. Implementation in Python is publicly shared. Application on the estimation of the self-similarity index of the Nile river water level data from the year 900 to 1200 C.E..

2602.20503 2026-02-26 stat.ME stat.AP

Error-Controlled Borrowing from External Data Using Wasserstein Ambiguity Sets

Yui Kimura, Shu Tamano

详情
英文摘要

Incorporating external data can improve the efficiency of clinical trials, but distributional mismatches between current and external populations threaten the validity of inference. While numerous dynamic borrowing methods exist, the calibration of their borrowing parameters relies mainly on ad hoc, simulation-based tuning. To overcome this, we propose BOND (Borrowing under Optimal Nonparametric Distributional robustness), a framework that formalizes data noncommensurability through Wasserstein ambiguity sets centered at the current-trial distribution. By deriving sharp, closed-form bounds on the worst-case mean drift for both continuous and binary outcomes, we construct a distributionally robust, bias-corrected Wald statistic that ensures asymptotic type I error control uniformly over the ambiguity set. Importantly, BOND determines the optimal borrowing strength by maximizing a worst-case power proxy, converting heuristic parameter tuning into a transparent, analytically tractable optimization problem. Furthermore, we demonstrate that many prominent borrowing methods can be reparameterized via an effective borrowing weight, rendering our calibration framework broadly applicable. Simulation studies and a real-world clinical trial application confirm that BOND preserves the nominal size under unmeasured heterogeneity while achieving efficiency gains over standard borrowing methods.

2512.25017 2026-02-26 math.NA cs.LG cs.NA q-fin.CP stat.ML

Convergence of the generalization error for deep gradient flow methods for PDEs

Chenguang Liu, Antonis Papapantoleon, Jasper Rou

Comments 29 pages

详情
英文摘要

The aim of this article is to provide a firm mathematical foundation for the application of deep gradient flow methods (DGFMs) for the solution of (high-dimensional) partial differential equations (PDEs). We decompose the generalization error of DGFMs into an approximation and a training error. We first show that the solution of PDEs that satisfy reasonable and verifiable assumptions can be approximated by neural networks, thus the approximation error tends to zero as the number of neurons tends to infinity. Then, we derive the gradient flow that the training process follows in the ``wide network limit'' and analyze the limit of this flow as the training time tends to infinity. These results combined show that the generalization error of DGFMs tends to zero as the number of neurons and the training time tend to infinity.

2509.25800 2026-02-26 cs.LG stat.ME

Characterization and Learning of Causal Graphs with Latent Confounders and Post-treatment Selection from Interventional Data

Gongxu Luo, Loka Li, Guangyi Chen, Haoyue Dai, Kun Zhang

详情
英文摘要

Interventional causal discovery seeks to identify causal relations by leveraging distributional changes introduced by interventions, even in the presence of latent confounders. Beyond the spurious dependencies induced by latent confounders, we highlight a common yet often overlooked challenge in the problem due to post-treatment selection, in which samples are selectively included in datasets after interventions. This fundamental challenge widely exists in biological studies; for example, in gene expression analysis, both observational and interventional samples are retained only if they meet quality control criteria (e.g., highly active cells). Neglecting post-treatment selection may introduce spurious dependencies and distributional changes under interventions, which can mimic causal responses, thereby distorting causal discovery results and challenging existing causal formulations. To address this, we introduce a novel causal formulation that explicitly models post-treatment selection and reveals how its differential reactions to interventions can distinguish causal relations from selection patterns, allowing us to go beyond traditional equivalence classes toward the underlying true causal structure. We then characterize its Markov properties and propose a Fine-grained Interventional equivalence class, named FI-Markov equivalence, represented by a new graphical diagram, F-PAG. Finally, we develop a provably sound and complete algorithm, F-FCI, to identify causal relations, latent confounders, and post-treatment selection up to $\mathcal{FI}$-Markov equivalence, using both observational and interventional data. Experimental results on synthetic and real-world datasets demonstrate that our method recovers causal relations despite the presence of both selection and latent confounders.

2508.16110 2026-02-26 stat.ME math.PR

Estimating the growth rate of a birth and death process using data from a small sample

Carola Sophia Heinzel, Jason Schweinsberg

详情
英文摘要

The problem of estimating the growth rate of a birth and death processes based on the coalescence times of a sample of $n$ individuals has been considered by several authors (\cite{stadler2009incomplete, williams2022life, mitchell2022clonal, Johnson2023}). This problem has applications, for example, to cancer research, when one is interested in determining the growth rate of a clone. Recently, \cite{Johnson2023} proposed an analytical method for estimating the growth rate using the theory of coalescent point processes, which has comparable accuracy to more computationally intensive methods when the sample size $n$ is large. We use a similar approach to obtain an estimate of the growth rate that is not based on the assumption that $n$ is large. We demonstrate, through simulations using the R package \texttt{cloneRate}, that our estimator of the growth rate performs well in comparison with previous approaches when $n$ is small.

2504.19994 2026-02-26 stat.ME

Semi-parametric bulk and tail regression using spline-based neural networks

Reetam Majumder, Jordan Richards

详情
英文摘要

Semi-parametric quantile regression (SPQR) is a flexible approach to density regression that learns a spline-based representation of conditional density functions using neural networks. As it makes no parametric assumptions about the underlying density, SPQR performs well for in-sample testing and interpolation. However, it can perform poorly when modelling heavy-tailed data or when asked to extrapolate beyond the range of observations, as it fails to satisfy any of the asymptotic guarantees provided by extreme value theory (EVT). To build semi-parametric density regression models that can be used for reliable tail extrapolation, we create the blended generalised Pareto (GP) distribution, which i) provides a model for the entire range of data and, via a smooth and continuous transition, ii) benefits from exact GP upper-tails without the need for intermediate threshold selection. We combine SPQR with our blended GP to create semi-parametric quantile regression for extremes (SPQRx), which provides a flexible semi-parametric approach to density regression that is compliant with traditional EVT. We handle interpretability of SPQRx through the use of model-agnostic variable importance scores, which provide the relative importance of a covariate for separately determining the bulk and tail of the conditional density. The efficacy of SPQRx is illustrated on simulated data, and an application to U.S. wildfire burnt areas from 1990-2020.

2503.20852 2026-02-26 stat.OT math.PR math.ST stat.TH

Teachable normal approximations to binomial and related probabilities or confidence bounds

Lutz Mattner

Comments 13 pages. Contains now a complete proof of the proposed bounds for Clopper-Pearson bounds. Further various minor improvements

详情
英文摘要

For the usual normal approximations to binomial, hypergeometric, or Poisson interval probabilities, we collect some simple but then reasonably sharp error bounds. For the Clopper-Pearson~(1934) binomial confidence bounds, we present, following Michael Short's~(2023) approach, bounds similar to, but necessarily more complicated than, Lagrange's (1776) success rate plus/minus normal quantile times estimated standard deviation. The bounds, as presented here in four theorems, should be teachable, to people ranging from sufficiently advanced high school pupils to university students in mathematics or statistics: For understanding most of the proposed approximation results, it should suffice to know binomial laws, their means and variances, and the standard normal distribution function, but not necessarily the concept of a corresponding normal random variable. Accompanying technical remarks, references, and proofs are meant for assuring teachers or for stimulating further research. Of the proposed approximations, some are essentially well-known at least to experts, and some are based on teaching experience and research at Trier University.

2503.01081 2026-02-26 stat.ME

A Dynamic Factor Model for Multivariate Counting Process Data

Fangyi Chen, Hok Kan Ling, Zhiliang Ying

详情
英文摘要

We propose a dynamic multiplicative factor model for process data, which arise from complex problem-solving items, an emerging testing mode in large-scale educational assessment. The proposed model can be viewed as an extension of the classical frailty models developed in survival analysis for multivariate recurrent event times, but with two important distinctions: (i) the factor (frailty) is of primary interest; (ii) covariates are internal and embedded in the factor. It allows us to explore low dimensional structure with meaningful interpretation. We show that the proposed model is identifiable and that the maximum likelihood estimators are consistent and asymptotically normal. Furthermore, to obtain a parsimonious model and to improve interpretation of parameters therein, variable selection and estimation for both fixed and random effects are developed through suitable penalisation. The computation is carried out by a stochastic EM combined with the Metropolis algorithm and the coordinate descent algorithm. Simulation studies demonstrate that the proposed approach provides an effective recovery of the true structure. The proposed method is applied to analysing the log-file of an item from the Programme for the International Assessment of Adult Competencies (PIAAC), where meaningful relationships are discovered.

2501.08449 2026-02-26 cs.CR cs.CY cs.DS stat.ME

A Refreshment Stirred, Not Shaken: Invariant-Preserving Deployments of Differential Privacy for the U.S. Decennial Census

James Bailie, Ruobin Gong, Xiao-Li Meng

Comments 65 pages, 2 figures

Journal ref Harvard Data Science Review (2026), Special Issue 6

详情
英文摘要

Protecting an individual's privacy when releasing their data is inherently an exercise in relativity, regardless of how privacy is qualified or quantified. This is because we can only limit the gain in information about an individual relative to what could be derived from other sources. This framing is the essence of differential privacy (DP), through which this article examines two statistical disclosure control (SDC) methods for the United States Decennial Census: the Permutation Swapping Algorithm (PSA), which resembles the 2010 Census's disclosure avoidance system (DAS), and the TopDown Algorithm (TDA), which was used in the 2020 DAS. To varying degrees, both methods leave unaltered certain statistics of the confidential data (their invariants) and hence neither can be readily reconciled with DP, at least as originally conceived. Nevertheless, we show how invariants can naturally be integrated into DP and use this to establish that the PSA satisfies pure DP subject to the invariants it necessarily induces, thereby proving that this traditional SDC method can, in fact, be understood from the perspective of DP. By a similar modification to zero-concentrated DP, we also provide a DP specification for the TDA. Finally, as a point of comparison, we consider a counterfactual scenario in which the PSA was adopted for the 2020 Census, resulting in a reduction in the nominal protection loss budget but at the cost of releasing many more invariants. This highlights the pervasive danger of comparing budgets without accounting for the other dimensions on which DP formulations vary (such as the invariants they permit). Therefore, while our results articulate the mathematical guarantees of SDC provided by the PSA, the TDA, and the 2020 DAS in general, care must be taken in translating these guarantees into actual privacy protection$\unicode{x2014}$just as is the case for any DP deployment.

2412.10895 2026-02-26 cs.LG stat.ML

Multi-Class and Multi-Task Strategies for Neural Directed Link Prediction

Claudio Moroni, Claudio Borile, Carolina Mattsson, Michele Starnini, André Panisson

Comments 15 pages, 2 figures

Journal ref ECML PKDD 2025

详情
英文摘要

Link Prediction is a foundational task in Graph Representation Learning, supporting applications like link recommendation, knowledge graph completion and graph generation. Graph Neural Networks have shown the most promising results in this domain and are currently the de facto standard approach to learning from graph data. However, a key distinction exists between Undirected and Directed Link Prediction: the former just predicts the existence of an edge, while the latter must also account for edge directionality and bidirectionality. This translates to Directed Link Prediction (DLP) having three sub-tasks, each defined by how training, validation and test sets are structured. Most research on DLP overlooks this trichotomy, focusing solely on the "existence" sub-task, where training and test sets are random, uncorrelated samples of positive and negative directed edges. Even in the works that recognize the aforementioned trichotomy, models fail to perform well across all three sub-tasks. In this study, we experimentally demonstrate that training Neural DLP (NDLP) models only on the existence sub-task, using methods adapted from Neural Undirected Link Prediction, results in parameter configurations that fail to capture directionality and bidirectionality, even after rebalancing edge classes. To address this, we propose three strategies that handle the three tasks simultaneously. Our first strategy, the Multi-Class Framework for Neural Directed Link Prediction (MC-NDLP) maps NDLP to a Multi-Class training objective. The second and third approaches adopt a Multi-Task perspective, either with a Multi-Objective (MO-DLP) or a Scalarized (S-DLP) strategy. Our results show that these methods outperform traditional approaches across multiple datasets and models, achieving equivalent or superior performance in addressing the three DLP sub-tasks.

2407.16299 2026-02-26 stat.ME stat.ML

Sparse outlier-robust PCA for multi-source data

Patricia Puchhammer, Ines Wilms, Peter Filzmoser

详情
英文摘要

Sparse and outlier-robust Principal Component Analysis (PCA) has been a very active field of research recently. Yet, most existing methods apply PCA to a single dataset whereas multi-source data-i.e. multiple related datasets requiring joint analysis-arise across many scientific areas. We introduce a novel PCA methodology that simultaneously (i) selects important features, (ii) allows for the detection of global sparse patterns across multiple data sources as well as local source-specific patterns, and (iii) is resistant to outliers. To this end, we develop a regularization problem with a penalty that accommodates global-local structured sparsity patterns, and where the ssMRCD estimator is used as plug-in to permit joint outlier-robust analysis across multiple data sources. We provide an efficient implementation of our proposal via the Alternating Direction Method of Multiplier and illustrate its practical advantages in simulation and in applications.

2404.07849 2026-02-26 stat.ML cs.LG

Overparameterized Multiple Linear Regression as Hyper-Curve Fitting

E. Atza, N. Budko

Comments 18 pages, 8 figures, version 2 (IOP style, revised), Python code and data available at: https://github.com/the-iterator/hyper-curve-regression-yarn

详情
英文摘要

This work demonstrates that applying a fixed-effect multiple linear regression (MLR) model to an overparameterized dataset is mathematically equivalent to fitting a hyper-curve parameterized by a single scalar. This reformulation shifts the focus from global coefficients to individual predictors, allowing each to be modeled as a function of a common parameter. We prove that this overparameterized linear framework can yield exact predictions even when the underlying data contains nonlinear dependencies that violate classical linear assumptions. By employing parameterization in terms of the dependent variable and a monomial basis, we validate this approach on both synthetic and experimental datasets. Our results show that the hyper-curve perspective provides a robust framework for regularizing problems with noisy predictors and offers a systematic method for identifying and removing 'improper' predictors that degrade model generalizability.

2311.11216 2026-02-26 stat.ME

Reconciling Overt Bias and Hidden Bias in Sensitivity Analysis for Matched Observational Studies

Siyu Heng, Yanxin Shen, Pengyun Wang

详情
英文摘要

Matching is one of the most widely used causal inference designs in observational studies, but post-matching confounding bias remains a critical concern. This bias includes overt bias from inexact matching on measured confounders and hidden bias from unmeasured confounders. Researchers routinely apply the famous Rosenbaum-type sensitivity analysis after matching to assess the impact of these biases on causal conclusions. In this work, we show that this approach is often conservative and may overstate sensitivity to confounding bias because the classical solution to the Rosenbaum sensitivity model may allocate hypothetical hidden bias in ways that contradict the overt bias observed in the matched dataset. To address this problem, we propose a new approach to Rosenbaum-type sensitivity analysis by ensuring compatibility between hidden and overt biases. Our approach does not need to add any additional assumptions (beyond mild regularity conditions) to Rosenbaum-type sensitivity analysis, and can produce uniformly more informative sensitivity analysis results than the conventional Rosenbaum-type sensitivity analysis. Computationally, our approach can be solved efficiently via iterative convex programming. Extensive simulations and a real data application demonstrate substantial gains in statistical power of sensitivity analysis. Importantly, our approach can also be applied to many other sensitivity analysis frameworks.

2207.00985 2026-02-26 math.NA cs.DM cs.NA math.ST stat.AP stat.ME stat.TH

Linguistic Approach to Time Series Forecasting

Dmytro Lande, Volodymyr Yuzefovych, Yevheniia Tsybulska

Comments 8 pages, 9 figures

详情
英文摘要

This paper proposes methods of predicting dynamic time series (including non-stationary ones) based on a linguistic approach, namely, the study of occurrences and repetition of so-called N-grams. This approach is used in computational linguistics to create statistical translators, detect plagiarism and duplicate documents. However, the scope of application can be extended beyond linguistics by taking into account the correlations of sequences of stable word combinations, as well as trends. The proposed methods do not require a preliminary study and determination of the characteristics of time series or complex tuning of the input parameters of the forecasting model. They allow, with a high level of automation, to carry out short-term and medium-term forecasts of time series, characterized by trends and cyclicality, in particular, series of publication dynamics in content monitoring systems. Also, the proposed methods can be used to predict the values of the parameters of a large complex system with the aim of monitoring its state, when the number of such parameters is significant, and therefore a high level of automation of the forecasting process is desirable. A significant advantage of the approach is the absence of requirements for time series stationarity and a small number of tuning parameters. Further research may focus on the study of various criteria for the similarity of time series fragments, the use of nonlinear similarity criteria, the search for ways to automatically determine the rational step of quantization of the time series.

2602.21713 2026-02-26 stat.ME

Multi-Parameter Estimation of Prevalence (MPEP): A Bayesian modelling approach to estimate the prevalence of opioid dependence

Andreas Markoulidakis, Matthew Hickman, Nicky J Welton, Loukia Meligkotsidou, Hayley E Jones

详情
英文摘要

Estimating the number of the number of people from hidden and/or marginalised populations - such as people dependent on opioids or cocaine - is important to guide policy decisions and provision of harm reduction services. Methods such as capture-recapture are widely used, but rely on assumptions that are often violated and not feasible in specific applications. We describe a Bayesian modelling approach called Multi-Parameter Estimation of Prevalence (MPEP). The MPEP approach leverages routinely collected administrative data, starting from a large baseline cohort of individuals from the population of interest and linked events, to estimate the full size of the target population. When multiple event types are included, the approach enables checking of the consistency of evidence about prevalence from different event types. Additional evidence can be incorporated where inconsistencies are identified. In this article, we summarize the general framework of MPEP, with focus on the most recent version, with improved computational efficiency (implemented in STAN). We also explore several extensions to the model that help us understand the sensitivity of the results to modelling assumptions or identify potential sources of bias. We demonstrate the MPEP approach through a case study estimating the prevalence of opioid dependence in Scotland each year from 2014 to 2022.

2602.21711 2026-02-26 stat.ME math.ST stat.AP stat.CO stat.TH

Adaptive Penalized Doubly Robust Regression for Longitudinal Data

Yuyao Wang, Yu Lu, Tianni Zhang, Mengfei Ran

详情
英文摘要

Longitudinal data often involve heterogeneity, sparse signals, and contamination from response outliers or high-leverage observations especially in biomedical science. Existing methods usually address only part of this problem, either emphasizing penalized mixed effects modeling without robustness or robust mixed effects estimation without high-dimensional variable selection. We propose a doubly adaptive robust regression (DAR-R) framework for longitudinal linear mixed effects models. It combines a robust pilot fit, doubly adaptive observation weights for residual outliers and leverage points, and folded concave penalization for fixed effect selection, together with weighted updates of random effects and variance components. We develop an iterative reweighting algorithm and establish estimation and prediction error bounds, support recovery consistency, and oracle-type asymptotic normality. Simulations show that DAR-R improves estimation accuracy, false-positive control, and covariance estimation under both vertical outliers and bad leverage contamination. In the TADPOLE/ADNI Alzheimer's disease application, DAR-R achieves accurate and stable prediction of ADAS13 while selecting clinically meaningful predictors with strong resampling stability.

2602.21701 2026-02-26 cs.LG physics.data-an stat.ML

Learning Complex Physical Regimes via Coverage-oriented Uncertainty Quantification: An application to the Critical Heat Flux

Michele Cazzola, Alberto Ghione, Lucia Sargentini, Julien Nespoulous, Riccardo Finotello

Comments 34 pages, 14 figures

详情
英文摘要

A central challenge in scientific machine learning (ML) is the correct representation of physical systems governed by multi-regime behaviours. In these scenarios, standard data analysis techniques often fail to capture the nature of the data, as the system's response varies significantly across the state space due to its stochasticity and the different physical regimes. Uncertainty quantification (UQ) should thus not be viewed merely as a safety assessment, but as a support to the learning task itself, guiding the model to internalise the behaviour of the data. We address this by focusing on the Critical Heat Flux (CHF) benchmark and dataset presented by the OECD/NEA Expert Group on Reactor Systems Multi-Physics. This case study represents a test for scientific ML due to the non-linear dependence of CHF on the inputs and the existence of distinct microscopic physical regimes. These regimes exhibit diverse statistical profiles, a complexity that requires UQ techniques to internalise the data behaviour and ensure reliable predictions. In this work, we conduct a comparative analysis of UQ methodologies to determine their impact on physical representation. We contrast post-hoc methods, specifically conformal prediction, against end-to-end coverage-oriented pipelines, including (Bayesian) heteroscedastic regression and quality-driven losses. These approaches treat uncertainty not as a final metric, but as an active component of the optimisation process, modelling the prediction and its behaviour simultaneously. We show that while post-hoc methods ensure statistical calibration, coverage-oriented learning effectively reshapes the model's representation to match the complex physical regimes. The result is a model that delivers not only high predictive accuracy but also a physically consistent uncertainty estimation that adapts dynamically to the intrinsic variability of the CHF.

2602.21663 2026-02-26 stat.ME

Estimation, inference and model selection for jump regression models

Steffen Grønneberg, Gudmund Hermansen, Nils Lid Hjort

Comments 33 pages, 3 figures; Statistical Research Report, Department of Mathematics, University of Oslo, from June 2014, and arXiv'd February 2026. This paper constituted a part of the doctoral dissertations for respectively Gudmund Hermansen and Steffen Grønneberg. An extended and polished version will be written up for journal publication

详情
英文摘要

We consider regression models with data of the type $y_i=m(x_i)+\varepsilon_i$, where the $m(x)$ curve is taken locally constant, with unknown levels and jump points. We investigate the large-sample properties of the minimum least squares estimators, finding in particular that jump point parameters and level parameters are estimated with respectively $n$-rate precision and $\sqrt{n}$-rate precision, where $n$ is sample size. Bayes solutions are investigated as well and found to be superior. We then construct jump information criteria, respectively AJIC and BJIC, for selecting the right number of jump points from data. This is done by following the line of arguments that lead to the Akaike and Bayesian information criteria AIC and BIC, but which here lead to different formulae due to the different type of large-sample approximations involved.

2602.21579 2026-02-26 stat.ME

Asymptotically Optimal Sequential Confidence Interval for the Gini Index Under Complex Household Survey Design with Sub-Stratification

Shivam, Bhargab Chattopadhyay, Nil Kamal Hazra

详情
英文摘要

We examine the optimality properties of the Gini index estimator under complex survey design involving stratification, clustering, and sub-stratification. While Darku et al. (Econometrics, 26, 2020) considered only stratification and clustering and did not provide theoretical guarantees, this study addresses these limitations by proposing two procedures - a purely sequential method and a two-stage method. Under suitable regularity conditions, we establish uniform continuity in probability for the proposed estimator, thereby contributing to the development of random central limit theorems under sequential sampling frameworks. Furthermore, we show that the resulting procedures satisfy both asymptotic first-order efficiency and asymptotic consistency. Simulation results demonstrate that the proposed procedures achieve the desired optimality properties across diverse settings. The practical utility of the methodology is further illustrated through an empirical application using data collected by the National Sample Survey agency of India

2602.21572 2026-02-26 stat.ML cs.LG stat.ME

Goodness-of-Fit Tests for Latent Class Models with Ordinal Categorical Data

Huan Qing

Comments 50 pages, 4 tables, 3 figures

详情
英文摘要

Ordinal categorical data are widely collected in psychology, education, and other social sciences, appearing commonly in questionnaires, assessments, and surveys. Latent class models provide a flexible framework for uncovering unobserved heterogeneity by grouping individuals into homogeneous classes based on their response patterns. A fundamental challenge in applying these models is determining the number of latent classes, which is unknown and must be inferred from data. In this paper, we propose one test statistic for this problem. The test statistic centers the largest singular value of a normalized residual matrix by a simple sample-size adjustment. Under the null hypothesis that the candidate number of latent classes is correct, its upper bound converges to zero in probability. Under an under-fitted alternative, the statistic itself exceeds a fixed positive constant with probability approaching one. This sharp dichotomous behavior of the test statistic yields two sequential testing algorithms that consistently estimate the true number of latent classes. Extensive experimental studies confirm the theoretical findings and demonstrate their accuracy and reliability in determining the number of latent classes.

2602.21569 2026-02-26 math.ST cs.LG stat.ME stat.ML stat.TH

How many asymmetric communities are there in multi-layer directed networks?

Huan Qing

Comments 44 pages, 4 tables, 2 figures

详情
英文摘要

Estimating the asymmetric numbers of communities in multi-layer directed networks is a challenging problem due to the multi-layer structures and inherent directional asymmetry, leading to possibly different numbers of sender and receiver communities. This work addresses this issue under the multi-layer stochastic co-block model, a model for multi-layer directed networks with distinct community structures in sending and receiving sides, by proposing a novel goodness-of-fit test. The test statistic relies on the deviation of the largest singular value of an aggregated normalized residual matrix from the constant 2. The test statistic exhibits a sharp dichotomy: Under the null hypothesis of correct model specification, its upper bound converges to zero with high probability; under underfitting, the test statistic itself diverges to infinity. With this property, we develop a sequential testing procedure that searches through candidate pairs of sender and receiver community numbers in a lexicographic order. The process stops at the smallest such pair where the test statistic drops below a decaying threshold. For robustness, we also propose a ratio-based variant algorithm, which detects sharp changes in the sequence of test statistics by comparing consecutive candidates. Both methods are proven to consistently determine the true numbers of sender and receiver communities under the multi-layer stochastic co-block model.

2602.21509 2026-02-26 stat.ML cs.LG stat.AP

Fair Model-based Clustering

Jinwon Park, Kunwoong Kim, Jihu Lee, Yongdai Kim

Comments Accepted by AAAI 2026 (Main Track, Oral presentation)

详情
英文摘要

The goal of fair clustering is to find clusters such that the proportion of sensitive attributes (e.g., gender, race, etc.) in each cluster is similar to that of the entire dataset. Various fair clustering algorithms have been proposed that modify standard K-means clustering to satisfy a given fairness constraint. A critical limitation of several existing fair clustering algorithms is that the number of parameters to be learned is proportional to the sample size because the cluster assignment of each datum should be optimized simultaneously with the cluster center, and thus scaling up the algorithms is difficult. In this paper, we propose a new fair clustering algorithm based on a finite mixture model, called Fair Model-based Clustering (FMC). A main advantage of FMC is that the number of learnable parameters is independent of the sample size and thus can be scaled up easily. In particular, mini-batch learning is possible to obtain clusters that are approximately fair. Moreover, FMC can be applied to non-metric data (e.g., categorical data) as long as the likelihood is well-defined. Theoretical and empirical justifications for the superiority of the proposed algorithm are provided.

2602.21490 2026-02-26 stat.ME

Connection Probabilities Estimation in Multi-layer Networks via Iterative Neighborhood Smoothing

Dingzi Guo, Diqing Li, Jingyi Wang, Wen-Xin Zhou

详情
英文摘要

Understanding the structural mechanisms of multi-layer networks is essential for analyzing complex systems characterized by multiple interacting layers. This work studies the problem of estimating connection probabilities in multi-layer networks and introduces a new Multi-layer Iterative Connection Probability Estimation (MICE) method. The proposed approach employs an iterative framework that jointly refines inter-layer and intra-layer similarity sets by dynamically updating distance metrics derived from current probability estimates. By leveraging both layer-level and node-level neighborhood information, MICE improves estimation accuracy while preserving computational efficiency. Theoretical analysis establishes the consistency of the estimator and shows that, under mild regularity conditions, the proposed method achieves an optimal convergence rate comparable to that of an oracle estimator. Extensive simulation studies across diverse graphon structures demonstrate the superior performance of MICE relative to existing methods. Empirical evaluations using brain network data from patients with Attention-Deficit/Hyperactivity Disorder (ADHD) and global food and agricultural trade network data further illustrate the robustness and effectiveness of the method in link prediction tasks. Overall, this work provides a theoretically grounded and practically scalable framework for probabilistic modeling and inference in multi-layer network systems.

2602.21478 2026-02-26 stat.ML cs.LG math.ST stat.ME stat.TH

Efficient Inference after Directionally Stable Adaptive Experiments

Zikai Shen, Houssam Zenati, Nathan Kallus, Arthur Gretton, Koulik Khamaru, Aurélien Bibaut

Comments 34 pages

详情
英文摘要

We study inference on scalar-valued pathwise differentiable targets after adaptive data collection, such as a bandit algorithm. We introduce a novel target-specific condition, directional stability, which is strictly weaker than previously imposed target-agnostic stability conditions. Under directional stability, we show that estimators that would have been efficient under i.i.d. data remain asymptotically normal and semiparametrically efficient when computed from adaptively collected trajectories. The canonical gradient has a martingale form, and directional stability guarantees stabilization of its predictable quadratic variation, enabling high-dimensional asymptotic normality. We characterize efficiency using a convolution theorem for the adaptive-data setting, and give a condition under which the one-step estimator attains the efficiency bound. We verify directional stability for LinUCB, yielding the first semiparametric efficiency guarantee for a regular scalar target under LinUCB sampling.

2602.21465 2026-02-26 math.ST math.PR stat.TH

Exponential Concentration Inequalities For Independent Random Vectors Under Sublinear Expectations

Nahom Seyoum

详情
英文摘要

Li and Hu recently established variance-type O(1/n) bounds for the sample mean of independent random vectors under sublinear expectations. We extend their results to the exponential concentration regime. For bounded, independent R^d-valued random vectors under a regular sublinear expectation, we prove: (i) a general concentration principle that reduces vector-valued tail bounds to scalar martingale inequalities via a three-layer architecture; (ii) an Azuma-Hoeffding inequality showing that the distance from the sample mean to the Minkowski average of the expectation sets has sub-Gaussian tails; (iii) a Bernstein inequality incorporating the variance parameter of Li and Hu, interpolating between sub-Gaussian and sub-exponential regimes; (iv) a dimension-free bound replacing the exponential covering prefactor with a polynomial one via the matrix Freedman inequality; and (v) an explicit construction demonstrating that the sub-Gaussian rate is optimal. To the best of our knowledge, these constitute the first exponential concentration inequalities for the multivariate sample mean under sublinear expectations in terms of the set-valued distance to the Minkowski average.

2602.21462 2026-02-26 cs.LG q-bio.GN stat.ML

Effects of Training Data Quality on Classifier Performance

Alan F. Karr, Regina Ruane

详情
英文摘要

We describe extensive numerical experiments assessing and quantifying how classifier performance depends on the quality of the training data, a frequently neglected component of the analysis of classifiers. More specifically, in the scientific context of metagenomic assembly of short DNA reads into "contigs," we examine the effects of degrading the quality of the training data by multiple mechanisms, and for four classifiers -- Bayes classifiers, neural nets, partition models and random forests. We investigate both individual behavior and congruence among the classifiers. We find breakdown-like behavior that holds for all four classifiers, as degradation increases and they move from being mostly correct to only coincidentally correct, because they are wrong in the same way. In the process, a picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases.

2602.21446 2026-02-26 stat.ML cs.LG

ConformalHDC: Uncertainty-Aware Hyperdimensional Computing with Application to Neural Decoding

Ziyi Liang, Hamed Poursiami, Zhishun Yang, Keiland Cooper, Akhilesh Jaiswal, Maryam Parsa, Norbert Fortin, Babak Shahbaba

详情
英文摘要

Hyperdimensional Computing (HDC) offers a computationally efficient paradigm for neuromorphic learning. Yet, it lacks rigorous uncertainty quantification, leading to open decision boundaries and, consequently, vulnerability to outliers, adversarial perturbations, and out-of-distribution inputs. To address these limitations, we introduce ConformalHDC, a unified framework that combines the statistical guarantees of conformal prediction with the computational efficiency of HDC. For this framework, we propose two complementary variations. First, the set-valued formulation provides finite-sample, distribution-free coverage guarantees. Using carefully designed conformity scores, it forms enclosed decision boundaries that improve robustness to non-conforming inputs. Second, the point-valued formulation leverages the same conformity scores to produce a single prediction when desired, potentially improving accuracy over traditional HDC by accounting for class interactions. We demonstrate the broad applicability of the proposed framework through evaluations on multiple real-world datasets. In particular, we apply our method to the challenging problem of decoding non-spatial stimulus information from the spiking activity of hippocampal neurons recorded as subjects performed a sequence memory task. Our results show that ConformalHDC not only accurately decodes the stimulus information represented in the neural activity data, but also provides rigorous uncertainty estimates and correctly abstains when presented with data from other behavioral states. Overall, these capabilities position the framework as a reliable, uncertainty-aware foundation for neuromorphic computing.

2602.21436 2026-02-26 stat.ML cs.GT cs.LG

Efficient Uncoupled Learning Dynamics with $\tilde{O}\!\left(T^{-1/4}\right)$ Last-Iterate Convergence in Bilinear Saddle-Point Problems over Convex Sets under Bandit Feedback

Arnab Maiti, Claire Jie Zhang, Kevin Jamieson, Jamie Heather Morgenstern, Ioannis Panageas, Lillian J. Ratliff

Comments 19 pages, Accepted at AISTATS 2026

详情
英文摘要

In this paper, we study last-iterate convergence of learning algorithms in bilinear saddle-point problems, a preferable notion of convergence that captures the day-to-day behavior of learning dynamics. We focus on the challenging setting where players select actions from compact convex sets and receive only bandit feedback. Our main contribution is the design of an uncoupled learning algorithm that guarantees last-iterate convergence to the Nash equilibrium with high probability. We establish a convergence rate of $\tilde{O}(T^{-1/4})$ up to polynomial factors in problem parameters. Crucially, our proposed algorithm is computationally efficient, requiring only an efficient linear optimization oracle over the players' compact action sets. The algorithm is obtained by combining techniques from experimental design and the classic Follow-The-Regularized-Leader (FTRL) framework, with a carefully chosen regularizer function tailored to the geometry of the action set of each learner.

2602.21423 2026-02-26 math.ST stat.ME stat.TH

Causal Inference with High-Dimensional Treatments

Patrick Kramer, Edward H. Kennedy, Isaac M. Opper

详情
英文摘要

In this work, we consider causal inference in various high-dimensional treatment settings, including for single multi-valued treatments and vector treatments with binary or continuous components, when the number of treatments can be comparable to or even larger than the number of observations. These settings bring unique challenges: first, the treatment effects of interest are a high-dimensional vector rather than a low-dimensional scalar; second, positivity violations are often unavoidable; and third, estimation can be based on a smaller effective sample size. We first discuss fundamental limits of estimating effects here, showing that consistent estimation is impossible without further assumptions. We go on to propose a novel sparse pseudo-outcome regression framework for arbitrary high-dimensional statistical functionals, which includes generic constrained regression estimators and error guarantees. We use the framework to derive new doubly robust estimators for mean potential outcomes of high-dimensional treatments, though it can also be applied to other scenarios. We analyze the proposed estimators under exact and approximate sparsity assumptions, giving finite-sample risk bounds. Finally, we derive minimax lower bounds to characterize optimal rates of convergence and show our risk bounds are unimprovable.

2602.21410 2026-02-26 stat.ME

Identifying the potential of sample overlap in evidence synthesis of observational studies

Zhentian Zhang, Tim Friede, Tim Mathes

Comments 36 pages,17 figures

详情
英文摘要

Sample overlap is a common issue in evidence synthesis in the field of medical research, particularly when integrating findings from observational studies utilizing existing databases such as registries. Due to the general inaccessibility of unique identifiers for each observation, addressing sample overlap has been a complex problem, potentially biasing evidence synthesis outcomes and undermining their credibility. We developed a method to construct indicators for the degree of sample overlap in evidence synthesis of studies based on existing data. Our method is rooted in set theory and is based on the coding of the ranges of several well selected sample characteristics, offers a practical solution by focusing on making inference based on sample characteristics rather than on individual participant data. Useful information, such as the overlap-free sample set with the largest sample size in an evidence synthesis, can be derived from this method. We applied our model to several real-world evidence syntheses, demonstrating its effectiveness and flexibility. Our findings highlight the growing importance of addressing sample overlap in evidence synthesis, especially with the increasing relevance of secondary use of data, an area currently under-explored in research.

2602.21408 2026-02-26 cs.LG stat.AP stat.CO stat.ME stat.ML

Generative Bayesian Computation as a Scalable Alternative to Gaussian Process Surrogates

Nick Polson, Vadim Sokolov

详情
英文摘要

Gaussian process (GP) surrogates are the default tool for emulating expensive computer experiments, but cubic cost, stationarity assumptions, and Gaussian predictive distributions limit their reach. We propose Generative Bayesian Computation (GBC) via Implicit Quantile Networks (IQNs) as a surrogate framework that targets all three limitations. GBC learns the full conditional quantile function from input--output pairs; at test time, a single forward pass per quantile level produces draws from the predictive distribution. Across fourteen benchmarks we compare GBC to four GP-based methods. GBC improves CRPS by 11--26\% on piecewise jump-process benchmarks, by 14\% on a ten-dimensional Friedman function, and scales linearly to 90,000 training points where dense-covariance GPs are infeasible. A boundary-augmented variant matches or outperforms Modular Jump GPs on two-dimensional jump datasets (up to 46\% CRPS improvement). In active learning, a randomized-prior IQN ensemble achieves nearly three times lower RMSE than deep GP active learning on Rocket LGBB. Overall, GBC records a favorable point estimate in 12 of 14 comparisons. GPs retain an edge on smooth surfaces where their smoothness prior provides effective regularization.

2602.21403 2026-02-26 stat.ME cs.CE eess.SP stat.CO

An index of effective number of variables for uncertainty and reliability analysis in model selection problems

Luca Martino, Eduardo Morgado, Roberto San Millán-Castillo

Journal ref Signal Processing, Volume 227, Pages 1-9, 2025. Num. 109735

详情
英文摘要

An index of an effective number of variables (ENV) is introduced for model selection in nested models. This is the case, for instance, when we have to decide the order of a polynomial function or the number of bases in a nonlinear regression, choose the number of clusters in a clustering problem, or the number of features in a variable selection application (to name few examples). It is inspired by the idea of the maximum area under the curve (AUC). The interpretation of the ENV index is identical to the effective sample size (ESS) indices concerning a set of samples. The ENV index improves {drawbacks of} the elbow detectors described in the literature and introduces different confidence measures of the proposed solution. These novel measures can be also employed jointly with the use of different information criteria, such as the well-known AIC and BIC, or any other model selection procedures. Comparisons with classical and recent schemes are provided in different experiments involving real datasets. Related Matlab code is given.

2602.21390 2026-02-26 cs.LG stat.ML

Defensive Generation

Gabriele Farina, Juan Carlos Perdomo

详情
英文摘要

We study the problem of efficiently producing, in an online fashion, generative models of scalar, multiclass, and vector-valued outcomes that cannot be falsified on the basis of the observed data and a pre-specified collection of computational tests. Our contributions are twofold. First, we expand on connections between online high-dimensional multicalibration with respect to an RKHS and recent advances in expected variational inequality problems, enabling efficient algorithms for the former. We then apply this algorithmic machinery to the problem of outcome indistinguishability. Our procedure, Defensive Generation, is the first to efficiently produce online outcome indistinguishable generative models of non-Bernoulli outcomes that are unfalsifiable with respect to infinite classes of tests, including those that examine higher-order moments of the generated distributions. Furthermore, our method runs in near-linear time in the number of samples and achieves the optimal, vanishing T^{-1/2} rate for generation error.

2602.21383 2026-02-26 stat.ME

Evaluating time-varying treatment effects in hybrid SMART-MRT designs

Mengbing Li, Inbal Nahum-Shani, Walter Dempsey

详情
英文摘要

Recently a new experimental approach, the hybrid experimental design (HED), was introduced to enable investigators to answer scientific questions about building behavioral interventions in which human-delivered and digital components are integrated and adapted on multiple timescales: slow (e.g., every few weeks) and fast (e.g., every few hours), respectively. An increasingly common HED involves the integration of the sequential, multiple assignment, randomized trial (SMART) with the micro-randomized trial (MRT), allowing investigators to answer scientific questions about potential synergistic effects of digital and human-delivered interventions. Approaches to formalize these questions in terms of causal estimands and associated data analytic methods are limited. In this paper, we formally define and assess these synergistic effects in hybrid SMART-MRTs on both proximal and distal outcomes. Practical utility is shown through the analysis of M-Bridge, a hybrid SMART-MRT aimed at reducing binge drinking among first-year college students.

2602.21370 2026-02-26 stat.AP

Evaluation of Minimal Residual Disease as a Surrogate for Progression-Free Survival in Hematology Oncology Trials: A Meta-Analytic Review

Jane She, Xiaofei Chen, Malini Iyengar, Judy Li

详情
英文摘要

Traditional health authority approval for oncology drugs is based on a clinical benefit endpoint, or a valid surrogate. In 1992 the FDA created the Accelerated Approval pathway to allow for earlier approval of therapies in serious conditions with an unmet medical need. This is accomplished typically by granting accelerated approval based on a surrogate endpoint that can be measured earlier than a traditional approval endpoint. Minimal residual disease (MRD) is a sensitive measure of residual cancer cells in hematology oncology after treatment, and is increasingly considered as a secondary or exploratory endpoint due to its prognostic potential for traditional clinical trial endpoints such as progression-free survival (PFS) and overall survival (OS). This work aims to evaluate MRD's surrogacy potential across several hematologic cancer indications while keeping the focus on follicular lymphoma (FL), using data from published studies. We examine individual-level and trial-level correlations extracted from previously published studies to elucidate the potential role of MRD in accelerating the drug approval process in hematology oncology trials.

2602.21368 2026-02-26 cs.LG cs.AI cs.CL stat.ML

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Charafeddine Mouzouni

Comments 41 pages, 11 figures, 10 tables, including appendices

详情
英文摘要

Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.

2602.21359 2026-02-26 math.ST stat.ME stat.TH

Some Asymptotic Results on Multiple Testing under Weak Dependence

Swarnadeep Datta, Monitirtha Dey

详情
英文摘要

This paper studies the means-testing problem under weakly correlated Normal setups. Although quite common in genomic applications, test procedures having exact FWER control under such dependence structures are nonexistent. We explore the asymptotic behaviors of the classical Bonferroni (when adjusted suitably) and the Sidak procedure; and show that both of these control FWER at the desired level exactly as the number of hypotheses approaches infinity. We derive analogous limiting results on the generalized family-wise error rate and power. Simulation studies depict the asymptotic exactness of the procedures empirically.

2602.21357 2026-02-26 stat.ML cs.LG

Conditional neural control variates for variance reduction in Bayesian inverse problems

Ali Siahkoohi, Hyunwoo Oh

详情
英文摘要

Bayesian inference for inverse problems involves computing expectations under posterior distributions -- e.g., posterior means, variances, or predictive quantities -- typically via Monte Carlo (MC) estimation. When the quantity of interest varies significantly under the posterior, accurate estimates demand many samples -- a cost often prohibitive for partial differential equation-constrained problems. To address this challenge, we introduce conditional neural control variates, a modular method that learns amortized control variates from joint model-data samples to reduce the variance of MC estimators. To scale to high-dimensional problems, we leverage Stein's identity to design an architecture based on an ensemble of hierarchical coupling layers with tractable Jacobian trace computation. Training requires: (i) samples from the joint distribution of unknown parameters and observed data; and (ii) the posterior score function, which can be computed from physics-based likelihood evaluations, neural operator surrogates, or learned generative models such as conditional normalizing flows. Once trained, the control variates generalize across observations without retraining. We validate our approach on stylized and partial differential equation-constrained Darcy flow inverse problems, demonstrating substantial variance reduction, even when the analytical score is replaced by a learned surrogate.

2602.21356 2026-02-26 stat.CO

Adaptive Importance Tempering: A flexible approach to improve computational efficiency of Metropolis Coupled Markov Chain Monte Carlo algorithms on binary spaces

Alexander Valencia-Sanchez, Jeffrey S. Rosenthal, Yasuhiro Watanabe, Hirotaka Tamura, Ali Sheikholeslami

Comments 25 pages, 8 figures

详情
英文摘要

Based on the algorithm Informed Importance Tempering (IIT) proposed by Li et al. (2023) we propose an algorithm that uses an adaptive bounded balancing function. We argue why implementing parallel tempering where each replica uses a rejection free MCMC algorithm can be inefficient in high dimensional spaces and show how the proposed adaptive algorithm can overcome these computational inefficiencies. We present two equivalent versions of the adaptive algorithm (A-IIT and SS-IIT) and establish that both have the same limiting distribution, making either suitable for use within a parallel tempering framework. To evaluate performance, we benchmark the adaptive algorithm against several MCMC methods: IIT, Rejection free Metropolis-Hastings (RF-MH) and RF-MH using a multiplicity list. Simulation results demonstrate that Adaptive IIT identifies high-probability states more efficiently than these competing algorithms in high-dimensional binary spaces with multiple modes.

2602.21342 2026-02-26 cs.LG stat.ML

Archetypal Graph Generative Models: Explainable and Identifiable Communities via Anchor-Dominant Convex Hulls

Nikolaos Nakis, Chrysoula Kosma, Panagiotis Promponas, Michail Chatzianastasis, Giannis Nikolentzos

Comments Accepted to AISTATS26 (Spotlight)

详情
英文摘要

Representation learning has been essential for graph machine learning tasks such as link prediction, community detection, and network visualization. Despite recent advances in achieving high performance on these downstream tasks, little progress has been made toward self-explainable models. Understanding the patterns behind predictions is equally important, motivating recent interest in explainable machine learning. In this paper, we present GraphHull, an explainable generative model that represents networks using two levels of convex hulls. At the global level, the vertices of a convex hull are treated as archetypes, each corresponding to a pure community in the network. At the local level, each community is refined by a prototypical hull whose vertices act as representative profiles, capturing community-specific variation. This two-level construction yields clear multi-scale explanations: a node's position relative to global archetypes and its local prototypes directly accounts for its edges. The geometry is well-behaved by design, while local hulls are kept disjoint by construction. To further encourage diversity and stability, we place principled priors, including determinantal point processes, and fit the model under MAP estimation with scalable subsampling. Experiments on real networks demonstrate the ability of GraphHull to recover multi-level community structure and to achieve competitive or superior performance in link prediction and community detection, while naturally providing interpretable predictions.

2602.21314 2026-02-26 stat.ME

Discussion of "Matrix Completion When Missing Is Not at Random and Its Applications in Causal Panel Data Models"

Eli Ben-Michael, Avi Feller

Comments Invited discussion of Choi and Yuan "Matrix Completion When Missing Is Not at Random and Its Applications in Causal Panel Data Models" at JSM 2025

详情
英文摘要

Choi and Yuan (2025) propose a novel approach to applying matrix completion to the problem of estimating causal effects in panel data. The key insight is that even in the presence of structured patterns of missing data -- i.e. selection into treatment -- matrix completion can be effective if the number of treated observations is small relative to the number of control observations. We applaud the authors for their insightful and interesting paper. We discuss this proposal from two complementary perspectives. First, we situate their proposal as an example of a "split-apply-combine" strategy that underlies many modern panel data estimators, including difference-in-differences and synthetic control approaches. Second, we discuss the issue of the statistical "last mile problem" -- the gap between theory and practice -- and offer suggestions on how to partially address it. We conclude by considering the challenges of estimating the impacts of public policies using panel data and apply the approach to a study on the effect of right to carry laws on violent crime.

2602.21276 2026-02-26 cs.LG stat.ML

Neural network optimization strategies and the topography of the loss landscape

Jianneng Yu, Alexandre V. Morozov

Comments 12 pages in the main text + 5 pages in the supplement. 6 figures + 1 table in the main text, 4 figures and 1 table in the supplement

详情
英文摘要

Neural networks are trained by optimizing multi-dimensional sets of fitting parameters on non-convex loss landscapes. Low-loss regions of the landscapes correspond to the parameter sets that perform well on the training data. A key issue in machine learning is the performance of trained neural networks on previously unseen test data. Here, we investigate neural network training by stochastic gradient descent (SGD) - a non-convex global optimization algorithm which relies only on the gradient of the objective function. We contrast SGD solutions with those obtained via a non-stochastic quasi-Newton method, which utilizes curvature information to determine step direction and Golden Section Search to choose step size. We use several computational tools to investigate neural network parameters obtained by these two optimization methods, including kernel Principal Component Analysis and a novel, general-purpose algorithm for finding low-height paths between pairs of points on loss or energy landscapes, FourierPathFinder. We find that the choice of the optimizer profoundly affects the nature of the resulting solutions. SGD solutions tend to be separated by lower barriers than quasi-Newton solutions, even if both sets of solutions are regularized by early stopping to ensure adequate performance on test data. When allowed to fit extensively on the training data, quasi-Newton solutions occupy deeper minima on the loss landscapes that are not reached by SGD. These solutions are less generalizable to the test data however. Overall, SGD explores smooth basins of attraction, while quasi-Newton optimization is capable of finding deeper, more isolated minima that are more spread out in the parameter space. Our findings help understand both the topography of the loss landscapes and the fundamental role of landscape exploration strategies in creating robust, transferrable neural network models.

2602.21272 2026-02-26 stat.ML cs.LG stat.CO

Counterdiabatic Hamiltonian Monte Carlo

Reuben Cohn-Gordon, Uroš Seljak, Dries Sels

详情
英文摘要

Hamiltonian Monte Carlo (HMC) is a state of the art method for sampling from distributions with differentiable densities, but can converge slowly when applied to challenging multimodal problems. Running HMC with a time varying Hamiltonian, in order to interpolate from an initial tractable distribution to the target of interest, can address this problem. In conjunction with a weighting scheme to eliminate bias, this can be viewed as a special case of Sequential Monte Carlo (SMC) sampling \cite{doucet2001introduction}. However, this approach can be inefficient, since it requires slow change between the initial and final distribution. Inspired by \cite{sels2017minimizing}, where a learned \emph{counterdiabatic} term added to the Hamiltonian allows for efficient quantum state preparation, we propose \emph{Counterdiabatic Hamiltonian Monte Carlo} (CHMC), which can be viewed as an SMC sampler with a more efficient kernel. We establish its relationship to recent proposals for accelerating gradient-based sampling with learned drift terms, and demonstrate on simple benchmark problems.

2602.21269 2026-02-26 cs.LG cs.AI stat.ML

Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

Wang Zixian

详情
英文摘要

We present Group Orthogonalized Policy Optimization (GOPO), a new alignment algorithm for large language models derived from the geometry of Hilbert function spaces. Instead of optimizing on the probability simplex and inheriting the exponential curvature of Kullback-Leibler divergence, GOPO lifts alignment into the Hilbert space L2(pi_k) of square-integrable functions with respect to the reference policy. Within this space, the simplex constraint reduces to a linear orthogonality condition <v, 1> = 0, defining a codimension-one subspace H0. Minimizing distance to an unconstrained target u_star yields the work-dissipation functional J(v) = <g, v> - (mu / 2) ||v||^2, whose maximizer follows directly from the Hilbert projection theorem. Enforcing the boundary v >= -1 produces a bounded Hilbert projection that induces exact sparsity, assigning zero probability to catastrophically poor actions through a closed-form threshold. To connect this functional theory with practice, GOPO projects from infinite-dimensional L2(pi_k) to a finite empirical subspace induced by group sampling. Because group-normalized advantages sum to zero, the Lagrange multiplier enforcing probability conservation vanishes exactly, reducing the constrained projection to an unconstrained empirical loss. The resulting objective has constant Hessian curvature mu I, non-saturating linear gradients, and an intrinsic dead-zone mechanism without heuristic clipping. Experiments on mathematical reasoning benchmarks show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.

2602.19473 2026-02-26 stat.ME math.ST stat.ML stat.TH

The generalized underlap coefficient with an application in clustering

Zhaoxi Zhang, Vanda Inacio, Sara Wade

详情
英文摘要

Quantifying distributional separation across groups is fundamental in statistical learning and scientific discovery, yet most classical discrepancy measures are tailored to two-group comparisons. We generalize the underlap coefficient (UNL), a multi-group separation measure, to multivariate variables. We establish key properties of the UNL and provide an explicit connection to total variation. We further interpret the UNL as a dependence measure between a group label and variables of interest and compare it with mutual information. We propose an efficient importance sampling estimator of the UNL that can be combined with flexible density estimators. The utility of the UNL for assessing partition-covariate dependence in clustering is highlighted in detail, where it is particularly useful for evaluating whether the latent group structure can be explained by specific covariates. Finally we illustrate the application of the UNL in clustering using two real world datasets.

2511.01734 2026-02-26 stat.ML cs.AI cs.CL cs.LG

A Proof of Learning Rate Transfer under $μ$P

Soufiane Hayou

Comments 21 pages

详情
英文摘要

We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $μP$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.

2510.21686 2026-02-26 stat.ML cs.LG

Multimodal Datasets with Controllable Mutual Information

Raheem Karim Hashmani, Garrett W. Merz, Helen Qu, Mariel Pettee, Kyle Cranmer

Comments 16 pages, 7 figures, 2 tables. Our code is publicly available at https://github.com/RKHashmani/MmMi-Datasets. Datasets generated based on Figure 1 can be found at https://huggingface.co/datasets/RKHashmani/mmmi-dag1-2modalities-cifar10

详情
英文摘要

We introduce a framework for generating highly multimodal datasets with explicitly calculable mutual information (MI) between modalities. This enables the construction of benchmark datasets that provide a novel testbed for systematic studies of mutual information estimators and multimodal self-supervised learning (SSL) techniques. Our framework constructs realistic datasets with known MI using a flow-based generative model and a structured causal framework for generating correlated latent variables. We benchmark a suite of MI estimators on datasets with varying ground truth MI values and verify that regression performance improves as the MI increases between input modalities and the target value. Finally, we describe how our framework can be applied to contexts including multi-detector astrophysics and SSL studies in the highly multimodal regime.

2510.11789 2026-02-26 stat.ML cs.LG math.PR math.ST stat.TH

Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Shai Zucker, Xiong Wang, Fei Lu, Inbar Seroussi

详情
英文摘要

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is $M^{-\frac{2β}{2β+1}}$, where $M$ is the sample size and $β$ is the Hölder smoothness of the activation function. Importantly, this rate is independent of the embedding dimension $d$, the number of tokens $N$, and the rank $r$ of the weight matrix, provided that $rd \le (M/\log M)^{\frac{1}{2β+1}}$. These results highlight a fundamental statistical efficiency of attention-style models, even when the weight matrix and activation are not separately identifiable, and provide a theoretical understanding of attention mechanisms and guidance on training.

2509.20831 2026-02-26 stat.ME stat.AP

Modi linear failure rate distribution with application to survival time data

Lazhar Benkhelifa

Journal ref Modern Journal of Statistics 2026

详情
英文摘要

A new lifetime model, named the Modi linear failure rate distribution, is suggested. This flexible model is capable of accommodating a wide range of hazard rate shapes, including decreasing, increasing, bathtub, upside-down bathtub, and modified bathtub forms, making it particularly suitable for modeling diverse survival and reliability data. Our proposed model contains the Modi exponential distribution and the Modi Rayleigh distribution as sub-models. Numerous mathematical and reliability properties are derived, including the $r^{th}$ moment, moment generating function, $r^{th}$ conditional moment, quantile function, order statistics, mean deviations, Rényi entropy, and reliability function. The method of maximum likelihood is employed to estimate the model parameters. Monte Carlo simulations are presented to examine how these estimators perform. The superior fit of our newly introduced model is proved through two real-world survival data sets.

2508.04957 2026-02-26 stat.ME

Goodness-of-fit test for multi-layer stochastic block models

Huan Qing

Comments 52 pages, 5 tables, 3 figures

详情
英文摘要

Community detection in multi-layer networks is a fundamental task in complex network analysis across various areas like social, biological, and computer sciences. However, most existing algorithms assume that the number of communities is known in advance, which is usually impractical for real-world multi-layer networks. To address this limitation, we develop a novel goodness-of-fit test for the popular multi-layer stochastic block model based on a normalized aggregation of layer-wise adjacency matrices. Under the null hypothesis that a candidate community count is correct, we establish the asymptotic normality of the test statistic using recent advances in random matrix theory; conversely, we prove its divergence when the model is underfitted. This dual theoretical foundations enable two computationally efficient sequential testing algorithms to consistently determine the number of communities without prior knowledge. Numerical experiments on simulated and real-world multi-layer networks demonstrate the accuracy and efficiency of our approaches in estimating the number of communities.

2507.14206 2026-02-26 eess.SP cs.AI cs.LG stat.ML

A Comprehensive Benchmark for Electrocardiogram Time-Series

Zhijiang Tang, Jiaxin Qi, Yuhua Zheng, Jianqiang Huang

Comments ACM MM 2025

Journal ref Proceedings of the 33rd ACM International Conference on Multimedia. 2025

详情
英文摘要

Electrocardiogram~(ECG), a key bioelectrical time-series signal, is crucial for assessing cardiac health and diagnosing various diseases. Given its time-series format, ECG data is often incorporated into pre-training datasets for large-scale time-series model training. However, existing studies often overlook its unique characteristics and specialized downstream applications, which differ significantly from other time-series data, leading to an incomplete understanding of its properties. In this paper, we present an in-depth investigation of ECG signals and establish a comprehensive benchmark, which includes (1) categorizing its downstream applications into four distinct evaluation tasks, (2) identifying limitations in traditional evaluation metrics for ECG analysis, and introducing a novel metric; (3) benchmarking state-of-the-art time-series models and proposing a new architecture. Extensive experiments demonstrate that our proposed benchmark is comprehensive and robust. The results validate the effectiveness of the proposed metric and model architecture, which establish a solid foundation for advancing research in ECG signal analysis.

2506.13630 2026-02-26 stat.AP cs.HC

The Hammock Plot: Where Categorical and Numerical Data Relax Together

Matthias Schonlau, Tiancheng Yang

Comments 21 pages, 10 figures, 1 table. Submitted to the Stata Journal

详情
英文摘要

Effective methods for visualizing data involving multiple variables, including categorical ones, are limited. The hammock plot (Schonlau 2003) visualizes both categorical and numerical variables using parallel coordinates. We introduce the Stata implementation hammock. We give numerous examples that explore highlighting, missing values, putting axes on the same scale, and tracing an observation across variables. Further, we discuss parallel univariate plots as an edge case of hammock plots. We also present and make publicly available a new dataset on the 2020 Tour de France.

2504.19138 2026-02-26 math.ST cs.NA math.NA stat.CO stat.TH

Quasi-Monte Carlo confidence intervals using quantiles of randomized nets

Zexin Pan

详情
英文摘要

Recent advances in quasi-Monte Carlo integration have shown that for linearly scrambled digital net estimators, the convergence rate can be dramatically improved by taking the median rather than the mean of multiple independent replicates. In this work, we demonstrate that the quantiles of such estimators can be used to construct confidence intervals with asymptotically valid coverage for high-dimensional integrals. By analyzing the error distribution for a class of infinitely differentiable integrands, we prove that as the sample size increases, the integration error decomposes into an asymptotically symmetric component and a vanishing remainder. Consequently, the asymptotic error distribution is symmetric about zero, ensuring that a quantile-based interval constructed from independent replicates captures the true integral with probability converging to a nominal level determined by the binomial distribution.

2408.09418 2026-02-26 stat.ME

Grade of membership analysis for multi-layer ordinal categorical data

Huan Qing

Comments 46 pages, accepted by Statistica Sinica in 2025

详情
英文摘要

Consider a group of individuals (subjects) participating in the same psychological tests with numerous questions (items) at different times, where the choices of each item have an implicit ordering. The observed responses can be recorded in multiple response matrices over time, named multi-layer ordinal categorical data, where layers refer to time points. Assuming that each subject has a common mixed membership shared across all layers, enabling it to be affiliated with multiple latent classes with varying weights, the objective of the grade of membership (GoM) analysis is to estimate these mixed memberships from the data. When the test is conducted only once, the data becomes traditional single-layer ordinal categorical data. The GoM model is a popular choice for describing single-layer categorical data with a latent mixed membership structure. However, GoM cannot handle multi-layer ordinal categorical data. In this work, we propose a new model, multi-layer GoM, which extends GoM to multi-layer ordinal categorical data. To estimate the common mixed memberships, we propose a new approach, GoM-DSoG, based on a debiased sum of Gram matrices. We establish GoM-DSoG's per-subject convergence rate under the multi-layer GoM model. Our theoretical results suggest that fewer no-responses, more subjects, more items, and more layers are beneficial for GoM analysis. We also propose an approach to select the number of latent classes. Extensive experimental studies verify the theoretical findings and show GoM-DSoG's superiority over its competitors, as well as the accuracy of our method in determining the number of latent classes.

2408.06323 2026-02-26 stat.ME

Infer-and-widen, or not?

Ronan Perry, Zichun Xu, Olivia McGough, Daniela Witten

详情
英文摘要

In recent years, there has been substantial interest in the task of selective inference: inference on a parameter that is selected from the data. Many of the existing proposals fall into what we refer to as the \emph{infer-and-widen} framework: they produce symmetric confidence intervals whose midpoints do not account for selection and therefore are biased; thus, the intervals must be wide enough to account for this bias. In this paper, we investigate infer-and-widen approaches in three vignettes: the winner's curse, maximal contrasts, and inference after the lasso. In each of these examples, we show that a state-of-the-art infer-and-widen proposal leads to confidence intervals that are wider than a non-infer-and-widen alternative. Furthermore, even an ``oracle'' infer-and-widen confidence interval -- the narrowest possible interval that could be theoretically attained via infer-and-widen -- can be wider than the alternative.

2312.16307 2026-02-26 econ.EM cs.GT cs.LG stat.ME

Incentive-Aware Synthetic Control: Accurate Counterfactual Estimation via Incentivized Exploration

Daniel Ngo, Keegan Harris, Anish Agarwal, Vasilis Syrgkanis, Zhiwei Steven Wu

Comments Accepted to TMLR

详情
英文摘要

Synthetic control methods (SCMs) are a canonical approach used to estimate treatment effects from panel data in the internet economy. We shed light on a frequently overlooked but ubiquitous assumption made in SCMs of "overlap": a treated unit can be written as some combination -- typically, convex or linear -- of the units that remain under control. We show that if units select their own interventions, and there is sufficiently large heterogeneity between units that prefer different interventions, overlap will not hold. We address this issue by proposing a recommender system which incentivizes units with different preferences to take interventions they would not normally consider. Specifically, leveraging tools from information design and online learning, we propose an SCM that incentivizes exploration in panel data settings by providing incentive-compatible intervention recommendations to units. We establish this estimator obtains valid counterfactual estimates without the need for an a priori overlap assumption. We extend our results to the setting of synthetic interventions, where the goal is to produce counterfactual outcomes under all interventions, not just control. Finally, we provide two hypothesis tests for determining whether unit overlap holds for a given panel dataset.

2306.15908 2026-02-26 stat.ME

Generalized Bayesian Multidimensional Scaling and Model Comparison

Jiarui Zhang, Jiguo Cao, Liangliang Wang

详情
英文摘要

Multidimensional scaling (MDS) is widely used to reconstruct a low-dimensional representation of high-dimensional data while preserving pairwise distances. However, Bayesian MDS approaches based on Markov chain Monte Carlo (MCMC) face challenges in model generalization and comparison. To address these limitations, we propose a generalized Bayesian multidimensional scaling (GBMDS) framework that accommodates non-Gaussian errors and diverse dissimilarity metrics for improved robustness. We develop an adaptive annealed Sequential Monte Carlo (ASMC) algorithm for Bayesian inference, leveraging an annealing schedule to enhance posterior exploration and computational efficiency. The ASMC algorithm also provides a nearly unbiased marginal likelihood estimator, enabling principled Bayesian model comparison across different error distributions, dissimilarity metrics, and dimensional choices. Using synthetic and real data, we demonstrate the effectiveness of the proposed approach. Our results show that ASMC-based GBMDS achieves superior computational efficiency and robustness compared to MCMC-based methods under the same computational budget. The implementation of our proposed method and applications are available at https://github.com/SFU-Stat-ML/GBMDS.

2211.02003 2026-02-26 cs.CR cs.LG stat.ML

Private Blind Model Averaging - Distributed, Non-interactive, and Convergent

Moritz Kirschte, Sebastian Meiser, Saman Ardalan, Esfandiar Mohammadi

Comments This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore

详情
英文摘要

Distributed differentially private learning techniques enable a large number of users to jointly learn a model without having to first centrally collect the training data. At the same time, neither the communication between the users nor the resulting model shall leak information about the training data. This kind of learning technique can be deployed to edge devices if it can be scaled up to a large number of users, particularly if the communication is reduced to a minimum: no interaction, i.e., each party only sends a single message. The best previously known methods are based on gradient averaging, which inherently requires many synchronization rounds. A promising non-interactive alternative to gradient averaging relies on so-called output perturbation: each user first locally finishes training and then submits its model for secure averaging without further synchronization. We analyze this paradigm, which we coin blind model averaging (BlindAvg), in the setting of convex and smooth empirical risk minimization (ERM) like a support vector machine (SVM). While the required noise scale is asymptotically the same as in the centralized setting, it is not well understood how close BlindAvg comes to centralized learning, i.e., its utility cost. We characterize and boost the privacy-utility tradeoff of BlindAvg with two contributions: First, we prove that BlindAvg converges towards the centralized setting for a sufficiently strong L2-regularization for a non-smooth SVM learner. Second, we introduce the novel differentially private convex and smooth ERM learner SoftmaxReg that has a better privacy-utility tradeoff than an SVM in a multi-class setting. We evaluate our findings on three datasets (CIFAR-10, CIFAR-100, and Federated EMNIST) and provide an ablation in an artificially extreme non-IID scenario.