arXivDaily arXiv每日学术速递 周一至周五更新
2601.15254 2026-01-22 stat.ML cs.AI cs.LG

Many Experiments, Few Repetitions, Unpaired Data, and Sparse Effects: Is Causal Inference Possible?

Felix Schur, Niklas Pfister, Peng Ding, Sach Mukherjee, Jonas Peters

详情
英文摘要

We study the problem of estimating causal effects under hidden confounding in the following unpaired data setting: we observe some covariates $X$ and an outcome $Y$ under different experimental conditions (environments) but do not observe them jointly; we either observe $X$ or $Y$. Under appropriate regularity conditions, the problem can be cast as an instrumental variable (IV) regression with the environment acting as a (possibly high-dimensional) instrument. When there are many environments but only a few observations per environment, standard two-sample IV estimators fail to be consistent. We propose a GMM-type estimator based on cross-fold sample splitting of the instrument-covariate sample and prove that it is consistent as the number of environments grows but the sample size per environment remains constant. We further extend the method to sparse causal effects via $\ell_1$-regularized estimation and post-selection refitting.

2601.15239 2026-01-22 stat.ML cs.LG math.ST stat.TH

Multi-context principal component analysis

Kexin Wang, Salil Bhate, João M. Pereira, Joe Kileel, Matylda Figlerowicz, Anna Seigal

Comments 47 pages, 8 figures. Supplementary tables are provided as downloadable file

详情
英文摘要

Principal component analysis (PCA) is a tool to capture factors that explain variation in data. Across domains, data are now collected across multiple contexts (for example, individuals with different diseases, cells of different types, or words across texts). While the factors explaining variation in data are undoubtedly shared across subsets of contexts, no tools currently exist to systematically recover such factors. We develop multi-context principal component analysis (MCPCA), a theoretical and algorithmic framework that decomposes data into factors shared across subsets of contexts. Applied to gene expression, MCPCA reveals axes of variation shared across subsets of cancer types and an axis whose variability in tumor cells, but not mean, is associated with lung cancer progression. Applied to contextualized word embeddings from language models, MCPCA maps stages of a debate on human nature, revealing a discussion between science and fiction over decades. These axes are not found by combining data across contexts or by restricting to individual contexts. MCPCA is a principled generalization of PCA to address the challenge of understanding factors underlying data across contexts.

2601.13428 2026-01-22 stat.ME math.ST stat.TH

Optimal estimation of generalized causal effects in cluster-randomized trials with multiple outcomes

Xinyuan Chen, Fan Li

详情
英文摘要

Cluster-randomized trials (CRTs) are widely used to evaluate group-level interventions and increasingly collect multiple outcomes capturing complementary dimensions of benefit and risk. Investigators often seek a single global summary of treatment effect, yet existing methods largely focus on single-outcome estimands or rely on model-based procedures with unclear causal interpretation or limited robustness. We develop a unified potential outcomes framework for generalized treatment effects with multiple outcomes in CRTs, accommodating both non-prioritized and prioritized outcome settings. The proposed cluster-pair and individual-pair causal estimands are defined through flexible pairwise contrast functions and explicitly account for potentially informative cluster sizes. We establish nonparametric estimation via weighted clustered U-statistics and derive efficient influence functions to construct covariate-adjusted estimators that integrate debiased machine learning with U-statistics. The resulting estimators are consistent and asymptotically normal, attain the semiparametric efficiency bounds under mild regularity conditions, and have analytically tractable variance estimators that are proven to be consistent under cross-fitting. Simulations and an application to a CRT for chronic pain management illustrate the practical utility of the proposed methods.

2512.16463 2026-01-22 stat.AP stat.ME

Dynamic Prediction for Hospital Readmission in Patients with Chronic Heart Failure

Rebecca Farina, Francois Mercier, Christian Wohlfart, Serge Masson, Silvia Metelli

Comments 20 pages, 5 figures, 3 tables

详情
英文摘要

Hospital readmission among patients with chronic heart failure (HF) is a major clinical and economic burden. Dynamic prediction models that leverage longitudinal biomarkers may improve risk stratification over traditional static models. This study aims to develop and validate a joint model using longitudinal N-terminal pro-B-type natriuretic peptide (NT-proBNP) measurements to predict the risk of rehospitalization or death in HF patients. We analyzed real-world data from the TriNetX database, including patients with an incident HF diagnosis between 2016 and 2022. The final selected cohort included 1,804 patients. A Bayesian joint modeling framework was developed to link patient-specific NT-proBNP trajectories to the risk of a composite endpoint (HF rehospitalization or all-cause mortality) within a 180-day window following hospital discharge. The model's performance was evaluated using 5-fold cross-validation and assessed with the Integrated Brier Score and Integrated Calibration Index. The joint model demonstrated a strong predictive advantage over a benchmark static model, particularly when making updated predictions at later time points (180-360 days). A joint model trained on patients with more frequent NT-proBNP measurements achieved the highest accuracy. The main joint model showed excellent calibration, suggesting its risk estimates are reliable. Our findings suggest that modeling the full trajectory of NT-proBNP with a joint modeling framework enables more accurate and dynamic risk assessment compared to static, single-timepoint methods. This approach supports the development of adaptive clinical decision-support tools for personalized HF management.

2511.06498 2026-01-22 math.ST stat.TH

An ordering for the strength of functional dependence

Jonathan Ansari, Sebastian Fuchs

Comments Extended to Wasserstein correlations

详情
英文摘要

We introduce a new dependence order, termed the conditional convex order, whose minimal and maximal elements characterize independence and perfect dependence. Moreover, it characterizes conditional independence, satisfies information monotonicity, and exhibits several invariance properties. Consequently, it is an ordering for the strength of functional dependence of a random variable Y on a random vector X. As we show, various recently studied dependence measures -- including Chatterjee's rank correlation, Wasserstein correlations, and rearranged dependence measures -- are increasing in this order and inherit their fundamental properties from it. We characterize the conditional convex order by the Schur order and by the concordance order, and we verify it in settings such as additive error models, the multivariate normal distribution, and various copula-based models. Our results offer a unified perspective on the behavior of dependence measures across statistical models.

2506.16230 2026-01-22 q-fin.RM math.PR stat.ME stat.ML

EVT-Based Rate-Preserving Distributional Robustness for Tail Risk Functionals

Anand Deo

详情
英文摘要

Risk measures such as Conditional Value-at-Risk (CVaR) focus on extreme losses, where scarce tail data makes model error unavoidable. To hedge misspecification, one evaluates worst-case tail risk over an ambiguity set. Using Extreme Value Theory (EVT), we derive first-order asymptotics for worst-case tail risk for a broad class of tail-risk measures under standard ambiguity sets, including Wasserstein balls and $ϕ$-divergence neighborhoods. We show that robustification can alter the nominal tail asymptotic scaling as the tail level $β\to0$, leading to excess risk inflation. Motivated by this diagnostic, we propose a tail-calibrated ambiguity design that preserves the nominal tail asymptotic scaling while still guarding against misspecification. Under standard domain of attraction assumptions, we prove that the resulting worst-case risk preserves the baseline first-order scaling as $β\to0$, uniformly over key tuning parameters, and that a plug-in implementation based on consistent tail-index estimation inherits these guarantees. Synthetic and real-data experiments show that the proposed design avoids the severe inflation often induced by standard ambiguity sets.

2601.15132 2026-01-22 stat.ME astro-ph.CO astro-ph.IM

Efficient prior sensitivity analysis for Bayesian model comparison

Zixiao Hu, Jason D. McEwen

Comments 11 pages, 4 figures; submitted conference proceedings for MaxEnt 2025

详情
英文摘要

Bayesian model comparison implements Occam's razor through its sensitivity to the prior. However, prior-dependence makes it important to assess the influence of plausible alternative priors. Such prior sensitivity analyses for the Bayesian evidence are expensive, either requiring repeated, costly model re-fits or specialised sampling schemes. By exploiting the learned harmonic mean estimator (LHME) for evidence calculation we decouple sampling and evidence calculation, allowing resampled posterior draws to be used directly to calculate the evidence without further likelihood evaluations. This provides an alternative approach to prior sensitivity analysis for Bayesian model comparison that dramatically alleviates the computational cost and is agnostic to the method used to generate posterior samples. We validate our method on toy problems and a cosmological case study, reproducing estimates obtained by full Markov chain Monte Carlo (MCMC) sampling and nested sampling re-fits. For the cosmological example considered our approach achieves up to $6000\times$ lower computational cost.

2601.15128 2026-01-22 math.CO math.AC math.AG math.ST stat.TH

Decomposing Determinantal Varieties from Statistics via Matroid Theory

Per Alexandersson, Yulia Alexandr, Emiliano Liwski, Fatemeh Mohammadi, Pardis Semnani

详情
英文摘要

We study determinantal varieties from conditional independence models with hidden variables, focusing on their irreducible decompositions, dimensions, degrees, and Gröbner bases. Each variety encodes a collection of matroids, whose flats capture algebraic dependencies among variables. Using this approach, we provide a systematic description of the components, their dimensions, and defining equations, and introduce a combinatorial framework for computing the degree of the determinantal variety. Our approach highlights the central role of matroidal structures in the study of determinantal varieties and extends beyond the reach of current computational techniques.

2601.15024 2026-01-22 eess.SP stat.CO

Physical Layer Security in Massive MIMO: Challenges and Open Research Directions Against Passive Eavesdroppers

Nipun Agarwal

详情
英文摘要

Massive Multiple-Input Multiple-Output (MIMO) has become a crucial enabling technology for 5G and beyond, providing previously unheard-of increases in energy and spectrum efficiency. It is still difficult to guarantee secure communication in these systems, particularly when it comes to passive eavesdroppers whose base station is unaware of their channel state information. By taking advantage of the inherent randomness of wireless channels, Physical Layer Security (PLS) offers a promising paradigm; however, its efficacy in massive MIMO is heavily reliant on resource allocation and transmission strategies. In this work, the performance of secure transmission schemes, such as Maximum Ratio Transmission (MRT), Zero-Forcing (ZF), and Artificial Noise (AN)-aided beamforming, is examined when passive eavesdroppers are present. This work will use extensive Monte Carlo simulations to assess important performance metrics such as energy efficiency, secrecy outage probability, and secrecy sum rate under different system parameters (e.g., number of antennas, Signal-to-Noise Ratio (SNR), power allocation). The results aim to provide comparative insight into the strengths and limitations of different PLS strategies and to highlight open research directions to design scalable, energy-efficient, and robust secure transmission techniques in future 6G networks.

2601.14937 2026-01-22 stat.ME math.ST stat.AP stat.TH

Geostatistics from Elliptic Boundary-Value Problems: Green Operators, Transmission Conditions, and Schur Complements

Juan J. Segura

详情
英文摘要

Classical geostatistics encodes spatial dependence by prescribing variograms or covariance kernels on Euclidean domains, whereas the SPDE--GMRF paradigm specifies Gaussian fields through an elliptic precision operator whose inverse is the corresponding Green operator. We develop an operator-based formulation of Gaussian spatial random fields on bounded domains and manifolds with internal interfaces, treating boundary and transmission conditions as explicit components of the statistical model. Starting from coercive quadratic energy functionals, variational theory yields a precise precision--covariance correspondence and shows that variograms are derived quadratic functionals of the Green operator, hence depend on boundary conditions and domain geometry. Conditioning and kriging follow from standard Gaussian update identities in both covariance and precision form, with hard constraints represented equivalently by exact interpolation constraints or by distributional source terms. Interfaces are modelled via surface penalty terms; taking variations produces flux-jump transmission conditions and induces controlled attenuation of cross-interface covariance. Finally, boundary-driven prediction and domain reduction are formulated through Dirichlet-to-Neumann operators and Schur complements, providing an operator language for upscaling, change of support, and subdomain-to-boundary mappings. Throughout, we use tools standard in spatial statistics and elliptic PDE theory to keep boundary and interface effects explicit in covariance modeling and prediction.

2601.14916 2026-01-22 cs.DL stat.AP

Citation of scientific evidence from video description and its association with attention and impact

Pablo Dorta-González, María Isabel Dorta-González

Comments 18 pages, 6 tables

详情
英文摘要

This study investigates how YouTube content creators utilize scientific evidence in videos. Log-linear regression examines the influence of alternative communication channels on video creators in Biotechnology, using data from 81,302 papers (2018-2023). This reveals a positive association with news articles and Wikipedia pages, but a negative association with scientific papers, policy documents, and patents. Despite the potential for enriching discussions, science video creators seem to favor materials with wider public attention over influential science, technology, and policy papers. These findings suggest a need for improved dissemination strategies for scientific research. Authors, universities, and journals should consider how their work can be made more accessible and engaging for science communicators on video.

2601.14849 2026-01-22 stat.ME

Graphical model-based clustering of categorical data

Laura Ferrini, Federico Castelletti

详情
英文摘要

Clustering multivariate data is a pervasive task in many applied problems, particularly in social studies and life science. Model-based approaches to clustering rely on mixture models, where each mixture component corresponds to the kernel of a distribution characterizing a latent sub-group. Current methods developed within this framework employ multivariate distributions built under the assumption of independence among variables given the cluster allocation. Accordingly, possible dependence structures characterizing differences across groups are not directly accounted for during the clustering process. In this paper we consider multivariate categorical data, and introduce a model-based clustering method which employs graphical models as a tool to encode dependencies between variables. Specifically, we consider a Dirichlet Process mixture of categorical graphical models, which clusters individuals into groups that are homogeneous in terms of dependence (graphical) structure and allied parameters. We provide full Bayesian inference for the model and develop a Markov chain Monte Carlo scheme for posterior analysis. Our method is evaluated through simulations and applied to real case studies, including the analysis of genomic data and voting records. Results reveal the merits of a graphical model-based clustering, in comparison with approaches that do not explicitly account for dependencies in the multivariate distribution of variables.

2601.14818 2026-01-22 cs.LG math.ST stat.TH

Statistical Learning Theory for Distributional Classification

Christian Fiedler

Comments Contains supplementary material

详情
英文摘要

In supervised learning with distributional inputs in the two-stage sampling setup, relevant to applications like learning-based medical screening or causal learning, the inputs (which are probability distributions) are not accessible in the learning phase, but only samples thereof. This problem is particularly amenable to kernel-based learning methods, where the distributions or samples are first embedded into a Hilbert space, often using kernel mean embeddings (KMEs), and then a standard kernel method like Support Vector Machines (SVMs) is applied, using a kernel defined on the embedding Hilbert space. In this work, we contribute to the theoretical analysis of this latter approach, with a particular focus on classification with distributional inputs using SVMs. We establish a new oracle inequality and derive consistency and learning rate results. Furthermore, for SVMs using the hinge loss and Gaussian kernels, we formulate a novel variant of an established noise assumption from the binary classification literature, under which we can establish learning rates. Finally, some of our technical tools like a new feature space for Gaussian kernels on Hilbert spaces are of independent interest.

2601.14815 2026-01-22 stat.AP

Zero-inflated binary Tree Pólya splitting regression for multivariate count data

Fabrice Moudjieu, Jean Peyhardi, Maxime Réjou-Méchain, Patrice Soh Takam, Frédéric Mortier

详情
英文摘要

Species distribution models (SDMs) are widely used to assess the effects of environmental factors on species distributions. However, classical SDMs ignore inter-species dependencies. Multivariate SDMs (MSDMs), especially those based on latent Gaussian fields such as the multivariate Poisson log-normal (MPLN), address this limitation but face challenges related to computation, dimensionality, and interpretability. Pólya-splitting (PS) distributions offer an alternative, combining a model for total abundance with a multivariate allocation structure, and have natural interpretations from ecological process models. Yet, they lack flexibility in modeling correlation structures. Tree Pólya-splitting (TPS) distributions overcome this by introducing hierarchical structure such as a phylogenetic tree. In this paper, we extend TPS to account for zero-inflation, leading to the zero-inflated tree Pólya-splitting (Z-TPS) family. We detail its statistical properties, show how standard software enables efficient inference, and illustrate its ecological relevance using tree abundance data from over 180 genera across the Congo Basin tropical rainforest.

2601.14796 2026-01-22 stat.AP

A Practical Guide to Modern Imputation

Jeffrey Näf

详情
英文摘要

This guide based on recent papers should help researchers avoid some of the most common pitfalls of missing value imputation imputation.

2601.14752 2026-01-22 stat.ME

Global-local shrinkage priors for modeling random effects in multivariate spatial small area estimation

Shushi Nishina, Takahiro Onizuka, Shintaro Hashimoto

Comments 39 pages, 10 figures

详情
英文摘要

Small area estimation (SAE) plays a central role in survey statistics and epidemiology, providing reliable estimates for domains with limited sample sizes. The multivariate Fay-Herriot model has been extensively used for this purpose, because it enhances estimation accuracy by borrowing strength across multiple correlated variables. In this paper, we develop a Bayesian extension of the multivariate Fay-Herriot model that enables flexible, component-specific shrinkage of the random effects. The proposed approach employs global-local priors formulated through a sandwich mixture representation, allowing adaptive regularization of each element of the random-effect vectors. This construction yields greater robustness and prevents excessive shrinkage in areas exhibiting strong underlying signals. In addition, we incorporate spatial dependence into the model to account for geographical correlation across small areas. The resulting spatial multivariate framework simultaneously exploits cross-variable relationships and spatial structure, yielding improved estimation efficiency. The utility of the proposed method is demonstrated through simulation studies and an empirical application to real survey data.

2601.14701 2026-01-22 stat.AP

Regulatory Expectations for Bayesian Methods in Drug and Biologic Clinical Trials: A Practical Perspective on FDA's 2026 Draft Guidance

Yuan Ji, Ph. D

详情
英文摘要

The U.S. Food and Drug Administration (FDA) released a landmark draft guidance in January 2026 on the use of Bayesian methodology to support primary inference in clinical trials of drugs and biological products. For sponsors, the central message is not merely that ``Bayes is allowed,'' but that Bayesian designs should be justified through explicit success criteria, thoughtful priors (especially when borrowing external information), prospective operating-characteristic evaluation (often via simulation when simulation is used), and computational transparency suitable for regulatory review. This paper provides a practical, regulatory-oriented synthesis of the draft guidance, highlighting where Bayesian designs can be calibrated to traditional frequentist error-rate targets and where, with sponsor--FDA agreement, alternative Bayesian operating metrics may be appropriate. We illustrate expectations through examples discussed in the guidance (e.g., platform trials, external/nonconcurrent controls, pediatric extrapolation) and conclude with an actionable checklist for planning documents and submission packages.

2601.14647 2026-01-22 math.OC cs.LG stat.ML

TRSVR: An Adaptive Stochastic Trust-Region Method with Variance Reduction

Yuchen Fang, Xinshou Zheng, Javad Lavaei

Comments 22 pages

详情
英文摘要

We propose a stochastic trust-region method for unconstrained nonconvex optimization that incorporates stochastic variance-reduced gradients (SVRG) to accelerate convergence. Unlike classical trust-region methods, the proposed algorithm relies solely on stochastic gradient information and does not require function value evaluations. The trust-region radius is adaptively adjusted based on a radius-control parameter and the stochastic gradient estimate. Under mild assumptions, we establish that the algorithm converges in expectation to a first-order stationary point. Moreover, the method achieves iteration and sample complexity bounds that match those of SVRG-based first-order methods, while allowing stochastic and potentially gradient-dependent second-order information. Extensive numerical experiments demonstrate that incorporating SVRG accelerates convergence, and that the use of trust-region methods and Hessian information further improves performance. We also highlight the impact of batch size and inner-loop length on efficiency, and show that the proposed method outperforms SGD and Adam on several machine learning tasks.

2601.14631 2026-01-22 stat.ML cs.LG stat.CO stat.ME

Semi-Supervised Mixture Models under the Concept of Missing at Radom with Margin Confidence and Aranda Ordaz Function

Jinyang Liao, Ziyang Lyu

Comments 8 pages, 7 figures

详情
英文摘要

This paper presents a semi-supervised learning framework for Gaussian mixture modelling under a Missing at Random (MAR) mechanism. The method explicitly parameterizes the missingness mechanism by modelling the probability of missingness as a function of classification uncertainty. To quantify classification uncertainty, we introduce margin confidence and incorporate the Aranda Ordaz (AO) link function to flexibly capture the asymmetric relationships between uncertainty and missing probability. Based on this formulation, we develop an efficient Expectation Conditional Maximization (ECM) algorithm that jointly estimates all parameters appearing in both the Gaussian mixture model (GMM) and the missingness mechanism, and subsequently imputes the missing labels by a Bayesian classifier derived from the fitted mixture model. This method effectively alleviates the bias induced by ignoring the missingness mechanism while enhancing the robustness of semi-supervised learning. The resulting uncertainty-aware framework delivers reliable classification performance in realistic MAR scenarios with substantial proportions of missing labels.

2601.14625 2026-01-22 cs.CV stat.ML

Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection

Yingsong Huang, Hui Guo, Jing Huang, Bing Bai, Qi Xiong

详情
英文摘要

The rapid progress of diffusion models highlights the growing need for detecting generated images. Previous research demonstrates that incorporating diffusion-based measurements, such as reconstruction error, can enhance the generalizability of detectors. However, ignoring the differing impacts of aleatoric and epistemic uncertainty on reconstruction error can undermine detection performance. Aleatoric uncertainty, arising from inherent data noise, creates ambiguity that impedes accurate detection of generated images. As it reflects random variations within the data (e.g., noise in natural textures), it does not help distinguish generated images. In contrast, epistemic uncertainty, which represents the model's lack of knowledge about unfamiliar patterns, supports detection. In this paper, we propose a novel framework, Diffusion Epistemic Uncertainty with Asymmetric Learning~(DEUA), for detecting diffusion-generated images. We introduce Diffusion Epistemic Uncertainty~(DEU) estimation via the Laplace approximation to assess the proximity of data to the manifold of diffusion-generated samples. Additionally, an asymmetric loss function is introduced to train a balanced classifier with larger margins, further enhancing generalizability. Extensive experiments on large-scale benchmarks validate the state-of-the-art performance of our method.

2601.14616 2026-01-22 stat.AP econ.EM

Implementing Substance Over Form: A Novel Metric for Taxing E-commerce to Address Deterritorialization

Li Tuobang

详情
英文摘要

Against the backdrop of e-commerce restructuring consumption patterns, last-mile delivery stations have substantially fulfilled the function of community retail distribution. However, the current tax system only levies a low labor service tax on delivery fees, resulting in a tax contribution from the massive circulating goods value that is significantly lower than that of retail supermarkets of equivalent scale. This disparity not only triggers local tax base erosion but also fosters unfair competition. Based on the "substance over form" principle, this paper proposes a tax rate calculation method using "delivery fee plus insurance premium" as the base, corrected through "goods value conversion." This method aims to align the substantive tax burden of e-commerce with that of community retail at the terminal stage, effectively internalizing the high negative externalities of delivery stations through fiscal instruments, addressing E-commerce Deterritorialization.

2601.14609 2026-01-22 stat.ML cs.AI cs.LG

Communication-Efficient Federated Risk Difference Estimation for Time-to-Event Clinical Outcomes

Ziwen Wang, Siqi Li, Marcus Eng Hock Ong, Nan Liu

详情
英文摘要

Privacy-preserving model co-training in medical research is often hindered by server-dependent architectures incompatible with protected hospital data systems and by the predominant focus on relative effect measures (hazard ratios) which lack clinical interpretability for absolute survival risk assessment. We propose FedRD, a communication-efficient framework for federated risk difference estimation in distributed survival data. Unlike typical federated learning frameworks (e.g., FedAvg) that require persistent server connections and extensive iterative communication, FedRD is server-independent with minimal communication: one round of summary statistics exchange for the stratified model and three rounds for the unstratified model. Crucially, FedRD provides valid confidence intervals and hypothesis testing--capabilities absent in FedAvg-based frameworks. We provide theoretical guarantees by establishing the asymptotic properties of FedRD and prove that FedRD (unstratified) is asymptotically equivalent to pooled individual-level analysis. Simulation studies and real-world clinical applications across different countries demonstrate that FedRD outperforms local and federated baselines in both estimation accuracy and prediction performance, providing an architecturally feasible solution for absolute risk assessment in privacy-restricted, multi-site clinical studies.

2601.14597 2026-01-22 cs.IT cs.AI cs.CR math.IT stat.ML

Optimality of Staircase Mechanisms for Vector Queries under Differential Privacy

James Melbourne, Mario Diaz, Shahab Asoodeh

Comments Submitted for possible publication

详情
英文摘要

We study the optimal design of additive mechanisms for vector-valued queries under $ε$-differential privacy (DP). Given only the sensitivity of a query and a norm-monotone cost function measuring utility loss, we ask which noise distribution minimizes expected cost among all additive $ε$-DP mechanisms. Using convex rearrangement theory, we show that this infinite-dimensional optimization problem admits a reduction to a one-dimensional compact and convex family of radially symmetric distributions whose extreme points are the staircase distributions. As a consequence, we prove that for any dimension, any norm, and any norm-monotone cost function, there exists an $ε$-DP staircase mechanism that is optimal among all additive mechanisms. This result resolves a conjecture of Geng, Kairouz, Oh, and Viswanath, and provides a geometric explanation for the emergence of staircase mechanisms as extremal solutions in differential privacy.

2601.14586 2026-01-22 math.ST math.PR stat.TH

Cluster size distributions of discrete random fields

Dan Cheng, John Ginos

详情
英文摘要

We study discrete random fields $\{X_t: t\in \mathbb{Z}^d\}$ parameterized on the $d$-dimensional integer lattice $\mathbb{Z}^d$. For a fixed threshold $u$, the excursion set $\{t \in \mathbb{Z}^d : X_t > u\}$ decomposes into connected components or clusters, whose size, defined as the number of lattice points they contain, are random. This paper investigates the probability distribution of these cluster sizes. For stationary random fields, we derive exact expressions for the cluster size distribution. To address nonstationary settings, we introduce a peak-based cluster size distribution, which characterizes the distribution of cluster sizes conditional on the presence of a local maximum above $u$. This formulation provides a tractable alternative when exact cluster size distributions are analytically inaccessible. The proposed framework applies broadly to Gaussian and non-Gaussian random fields, relying only on their joint dependence structure. Our results provide a theoretical foundation for quantifying spatial extent in discretely sampled data, with applications to medical imaging, geoscience, environmental monitoring, and other scientific areas where thresholded random fields naturally arise.

2601.14536 2026-01-22 cs.LG q-bio.GN stat.ML

engGNN: A Dual-Graph Neural Network for Omics-Based Disease Classification and Feature Selection

Tiantian Yang, Yuxuan Wang, Zhenwei Zhou, Ching-Ti Liu

Comments 21 pages, 14 figures, 5 tables

详情
英文摘要

Omics data, such as transcriptomics, proteomics, and metabolomics, provide critical insights into disease mechanisms and clinical outcomes. However, their high dimensionality, small sample sizes, and intricate biological networks pose major challenges for reliable prediction and meaningful interpretation. Graph Neural Networks (GNNs) offer a promising way to integrate prior knowledge by encoding feature relationships as graphs. Yet, existing methods typically rely solely on either an externally curated feature graph or a data-driven generated one, which limits their ability to capture complementary information. To address this, we propose the external and generated Graph Neural Network (engGNN), a dual-graph framework that jointly leverages both external known biological networks and data-driven generated graphs. Specifically, engGNN constructs a biologically informed undirected feature graph from established network databases and complements it with a directed feature graph derived from tree-ensemble models. This dual-graph design produces more comprehensive embeddings, thereby improving predictive performance and interpretability. Through extensive simulations and real-world applications to gene expression data, engGNN consistently outperforms state-of-the-art baselines. Beyond classification, engGNN provides interpretable feature importance scores that facilitate biologically meaningful discoveries, such as pathway enrichment analysis. Taken together, these results highlight engGNN as a robust, flexible, and interpretable framework for disease classification and biomarker discovery in high-dimensional omics contexts.

2601.14515 2026-01-22 stat.ML cs.LG cs.NA math.NA

Large Data Limits of Laplace Learning for Gaussian Measure Data in Infinite Dimensions

Zhengang Zhong, Yury Korolev, Matthew Thorpe

详情
英文摘要

Laplace learning is a semi-supervised method, a solution for finding missing labels from a partially labeled dataset utilizing the geometry given by the unlabeled data points. The method minimizes a Dirichlet energy defined on a (discrete) graph constructed from the full dataset. In finite dimensions the asymptotics in the large (unlabeled) data limit are well understood with convergence from the graph setting to a continuum Sobolev semi-norm weighted by the Lebesgue density of the data-generating measure. The lack of the Lebesgue measure on infinite-dimensional spaces requires rethinking the analysis if the data aren't finite-dimensional. In this paper we make a first step in this direction by analyzing the setting when the data are generated by a Gaussian measure on a Hilbert space and proving pointwise convergence of the graph Dirichlet energy.

2601.14498 2026-01-22 stat.ME

The RobinCar Family: R Tools for Robust Covariate Adjustment in Randomized Clinical Trials

Marlena Bannick, Yuanyuan Bian, Gregory Chen, Liming Li, Yuhan Qian, Daniel Sabanés Bové, Dong Xi, Ting Ye, Yanyao Yi

Comments On behalf of the Software Subteam ASA-BIOP Covariate Adjustment Scientific Working Group. All authors contributed equally to this work

详情
英文摘要

Purpose: Covariate adjustment is a powerful statistical technique that can increase efficiency in clinical trials. Recent guidance from the U.S. FDA provided recommendations and best practices for using covariate adjustment. However, there has existed a gap between the extensive statistical literature on covariate adjustment and software that is easy to use and abides by these best practices. Methods: We have developed the RobinCar Family, which is comprised of RobinCar and RobinCar2. These two R packages enable covariate-adjusted analyses for continuous, discrete, and time-to-event outcomes that follow best practices. For continuous and discrete outcomes, the functions in the RobinCar Family facilitate traditional forms of covariate adjustment such as ANCOVA as well as more recent approaches like ANHECOVA, G-computation with generalized linear models and machine learning models, and adjustment for a super-covariate (as in PROCOVA(TM)). Functions for time-to-event outcomes implement the covariate-adjusted log-rank test, the stratified covariate-adjusted log-rank test, and the marginal covariate-adjusted hazard ratio. The RobinCar Family is supported by the ASA Biopharmaceutical Section Covariate Adjustment Scientific Working Group. Results: We provide an accessible overview of the covariate-adjusted statistical methods, and describe how they are implemented in RobinCar and RobinCar2. We highlight important usage notes for clinical trial practitioners. Conclusion: We apply RobinCar and RobinCar2 functions by analyzing data from the AIDS Clinical Trials Group Study 175, demonstrating that they are straightforward and user-friendly.

2601.14431 2026-01-22 stat.ME

Doubly robust estimators of the restricted mean time in favor estimands in individual- and cluster-randomized trials

Xi Fang, Bingkai Wang, Guangyu Tong, Liangyuan Hu, Shuangge Ma, Fan Li

详情
英文摘要

Progressive multi-state survival outcomes are common in trials with recurrent or sequential events and require treatment effect estimands that remain interpretable without proportional intensity or Markov assumptions. The restricted mean time in favor of treatment (RMT-IF) extends the restricted mean survival time to ordered multi-state processes and provides such an interpretable estimand. However, existing RMT-IF methods are nonparametric, assume covariate-independent censoring for independent observations, and do not accommodate cluster-randomized trials (CRTs), limiting both efficiency and applicability. We develop a class of doubly robust estimators for RMT-IF under right censoring using an augmented inverse-probability weighting framework that combines stage-specific outcome regression with arm-specific censoring models, yielding consistency when either nuisance model is correctly specified. We further extend the framework to CRTs by formalizing both cluster-level and individual-level average RMT-IF estimands to address informative cluster size and by constructing corresponding doubly robust estimators that account for within-cluster correlation. For inference, we employ model-agnostic jackknife variance estimators in both individually randomized and cluster-randomized settings. Extensive simulation studies demonstrate finite-sample performance, and the methods are illustrated using two randomized trial examples.

2601.14309 2026-01-22 stat.ME

A Bayesian framework for cost-effectiveness analysis with time-varying treatment decisions

Esteban Fernández-Morales, Emily M. Ko, Nandita Mitra, Youjin Lee, Arman Oganisian

详情
英文摘要

Cost-effectiveness analyses (CEAs) compare the costs and health outcomes of treatment regimes to inform medical decisions. With observational claims data, CEAs must address nonrandom treatment assignment, administrative censoring, and irregularly spaced medical visits that reflect the continuous timing of care and treatment initiation. In high-risk, early-stage endometrial cancer (HR-EC), adjuvant radiation is initiated at patient-specific times following hysterectomy, causing confounding between treatment and outcomes that can evolve with post-surgical recovery and clinical course. Most existing CEA methods use point-treatment or discrete-time models. However, point-treatment approaches break down with time-varying confounding, while discrete-time models bin continuous time, expand the data into a person-period format, and can induce zero-inflation by creating many intervals with no cost-accruing events. We propose a Bayesian framework for CEAs with sequential decision-making that jointly models costs and event times in continuous time, accounts for administrative censoring, and supports dynamic treatment regimes with minimal parametric assumptions. We use Bayesian g-computation to estimate causally interpretable cost-effectiveness measures, including net monetary benefit, and to compare regimes through posterior contrasts. We evaluate the finite-sample performance of the proposed method in simulations across censoring levels and compare it against discrete-time and fully parametric alternatives. We then use SEER-Medicare data to assess the cost-effectiveness of initiating adjuvant radiation therapy within six months following hysterectomy among HR-EC patients.

2601.14262 2026-01-22 stat.ME cs.AI cs.HC cs.LG

On Meta-Evaluation

Hongxiao Li, Chenxi Wang, Fanda Fan, Zihan Wang, Wanling Gao, Lei Wang, Jianfeng Zhan

详情
英文摘要

Evaluation is the foundation of empirical science, yet the evaluation of evaluation itself -- so-called meta-evaluation -- remains strikingly underdeveloped. While methods such as observational studies, design of experiments (DoE), and randomized controlled trials (RCTs) have shaped modern scientific practice, there has been little systematic inquiry into their comparative validity and utility across domains. Here we introduce a formal framework for meta-evaluation by defining the evaluation space, its structured representation, and a benchmark we call AxiaBench. AxiaBench enables the first large-scale, quantitative comparison of ten widely used evaluation methods across eight representative application domains. Our analysis reveals a fundamental limitation: no existing method simultaneously achieves accuracy and efficiency across diverse scenarios, with DoE and observational designs in particular showing significant deviations from real-world ground truth. We further evaluate a unified method of entire-space stratified sampling from previous evaluatology research, and the results report that it consistently outperforms prior approaches across all tested domains. These results establish meta-evaluation as a scientific object in its own right and provide both a conceptual foundation and a pragmatic tool set for advancing trustworthy evaluation in computational and experimental research.

2601.14235 2026-01-22 astro-ph.IM astro-ph.CO cs.AI cs.LG stat.ML

Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration

LSST Dark Energy Science Collaboration, Eric Aubourg, Camille Avestruz, Matthew R. Becker, Biswajit Biswas, Rahul Biswas, Boris Bolliet, Adam S. Bolton, Clecio R. Bom, Raphaël Bonnet-Guerrini, Alexandre Boucaud, Jean-Eric Campagne, Chihway Chang, Aleksandra Ćiprijanović, Johann Cohen-Tanugi, Michael W. Coughlin, John Franklin Crenshaw, Juan C. Cuevas-Tello, Juan de Vicente, Seth W. Digel, Steven Dillmann, Mariano Javier de León Dominguez Romero, Alex Drlica-Wagner, Sydney Erickson, Alexander T. Gagliano, Christos Georgiou, Aritra Ghosh, Matthew Grayling, Kirill A. Grishin, Alan Heavens, Lindsay R. House, Mustapha Ishak, Wassim Kabalan, Arun Kannawadi, François Lanusse, C. Danielle Leonard, Pierre-François Léget, Michelle Lochner, Yao-Yuan Mao, Peter Melchior, Grant Merz, Martin Millon, Anais Möller, Gautham Narayan, Yuuki Omori, Hiranya Peiris, Laurence Perreault-Levasseur, Andrés A. Plazas Malagón, Nesar Ramachandra, Benjamin Remy, Cécile Roucelle, Jaime Ruiz-Zapatero, Stefan Schuldt, Ignacio Sevilla-Noarbe, Ved G. Shah, Tjitske Starkenburg, Stephen Thorp, Laura Toribio San Cipriano, Tilman Tröster, Roberto Trotta, Padma Venkatraman, Amanda Wasserman, Tim White, Justine Zeghal, Tianqing Zhang, Yuanyuan Zhang

Comments 84 pages. This is v1.0 of the DESC's white paper on AI/ML, a collaboration document that is being made public but which is not planned for submission to a journal

详情
英文摘要

The Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST) will produce unprecedented volumes of heterogeneous astronomical data (images, catalogs, and alerts) that challenge traditional analysis pipelines. The LSST Dark Energy Science Collaboration (DESC) aims to derive robust constraints on dark energy and dark matter from these data, requiring methods that are statistically powerful, scalable, and operationally reliable. Artificial intelligence and machine learning (AI/ML) are already embedded across DESC science workflows, from photometric redshifts and transient classification to weak lensing inference and cosmological simulations. Yet their utility for precision cosmology hinges on trustworthy uncertainty quantification, robustness to covariate shift and model misspecification, and reproducible integration within scientific pipelines. This white paper surveys the current landscape of AI/ML across DESC's primary cosmological probes and cross-cutting analyses, revealing that the same core methodologies and fundamental challenges recur across disparate science cases. Since progress on these cross-cutting challenges would benefit multiple probes simultaneously, we identify key methodological research priorities, including Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning for discovery. With an eye on emerging techniques, we also explore the potential of the latest foundation model methodologies and LLM-driven agentic AI systems to reshape DESC workflows, provided their deployment is coupled with rigorous evaluation and governance. Finally, we discuss critical software, computing, data infrastructure, and human capital requirements for the successful deployment of these new methodologies, and consider associated risks and opportunities for broader coordination with external actors.

2601.13468 2026-01-22 stat.ME math.ST stat.TH

Resampling-free Inference for Time Series via RKHS Embedding

Deep Ghoshal, Xiaofeng Shao

详情
英文摘要

In this article, we study nonparametric inference problems in the context of multivariate or functional time series, including testing for goodness-of-fit, the presence of a change point in the marginal distribution, and the independence of two time series, among others. Most methodologies available in the existing literature address these problems by employing a bandwidth-dependent bootstrap or subsampling approach, which can be computationally expensive and/or sensitive to the choice of bandwidth. To address these limitations, we propose a novel class of kernel-based tests by embedding the data into a reproducing kernel Hilbert space, and construct test statistics using sample splitting, projection, and self-normalization (SN) techniques. Through a new conditioning technique, we demonstrate that our test statistics have pivotal limiting null distributions under strong mixing and mild moment assumptions. We also analyze the limiting power of our tests under local alternatives. Finally, we showcase the superior size accuracy and computational efficiency of our methods as compared to some existing ones.

2601.11860 2026-01-22 stat.AP cs.LG stat.ME

Adversarial Drift-Aware Predictive Transfer: Toward Durable Clinical AI

Xin Xiong, Zijian Guo, Haobo Zhu, Chuan Hong, Jordan W Smoller, Tianxi Cai, Molei Liu

详情
英文摘要

Clinical AI systems frequently suffer performance decay post-deployment due to temporal data shifts, such as evolving populations, diagnostic coding updates (e.g., ICD-9 to ICD-10), and systemic shocks like the COVID-19 pandemic. Addressing this ``aging'' effect via frequent retraining is often impractical due to computational costs and privacy constraints. To overcome these hurdles, we introduce Adversarial Drift-Aware Predictive Transfer (ADAPT), a novel framework designed to confer durability against temporal drift with minimal retraining. ADAPT innovatively constructs an uncertainty set of plausible future models by combining historical source models and limited current data. By optimizing worst-case performance over this set, it balances current accuracy with robustness against degradation due to future drifts. Crucially, ADAPT requires only summary-level model estimators from historical periods, preserving data privacy and ensuring operational simplicity. Validated on longitudinal suicide risk prediction using electronic health records from Mass General Brigham (2005--2021) and Duke University Health Systems, ADAPT demonstrated superior stability across coding transitions and pandemic-induced shifts. By minimizing annual performance decay without labeling or retraining future data, ADAPT offers a scalable pathway for sustaining reliable AI in high-stakes healthcare environments.

2601.11444 2026-01-22 cs.LG cs.CV math.ST stat.ME stat.ML stat.TH

When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Damien Garreau, Pierre-Alexandre Mattei

Comments Accepted at Transactions on Machine Learning Research (reviewed on OpenReview: https://openreview.net/forum?id=4iRx9b0Csu). Code: https://github.com/rarazafin/score_diffusion_ensemble

详情
英文摘要

Diffusion models now generate high-quality, diverse samples, with an increasing focus on more powerful models. Although ensembling is a well-known way to improve supervised models, its application to unconditional score-based diffusion models remains largely unexplored. In this work we investigate whether it provides tangible benefits for generative modelling. We find that while ensembling the scores generally improves the score-matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as FID on image datasets. We confirm this observation across a breadth of aggregation rules using Deep Ensembles, Monte Carlo Dropout, on CIFAR-10 and FFHQ. We attempt to explain this discrepancy by investigating possible explanations, such as the link between score estimation and image quality. We also look into tabular data through random forests, and find that one aggregation strategy outperforms the others. Finally, we provide theoretical insights into the summing of score models, which shed light not only on ensembling but also on several model composition techniques (e.g. guidance).

2601.07687 2026-01-22 q-fin.ST cs.LG stat.ML

Physics-Informed Singular-Value Learning for Cross-Covariances Forecasting in Financial Markets

Efstratios Manolakis, Christian Bongiorno, Rosario Nunzio Mantegna

详情
英文摘要

A new wave of work on covariance cleaning and nonlinear shrinkage has delivered asymptotically optimal analytical solutions for large covariance matrices. The same framework has been generalized to empirical cross-covariance matrices, whose singular value decomposition identifies canonical comovement modes between two asset sets, with singular values quantifying the strength of each mode and providing natural targets for shrinkage. Existing analytical cross-covariance cleaners are derived under strong stationarity and large-sample assumptions, and they typically rely on mesoscopic regularity conditions such as bounded spectra; macroscopic common modes (e.g., a global market factor) violate these conditions. When applied to real equity returns, where dependence structures drift over time and global modes are prominent, we find that these theoretically optimal formulas do not translate into robust out-of-sample performance. We address this gap by designing a random-matrix-inspired neural architecture that operates in the empirical singular-vector basis and learns a nonlinear mapping from empirical singular values to their corresponding cleaned values. By construction, the network can recover the analytical solution as a special case, yet it remains flexible enough to adapt to non-stationary dynamics and mode-driven distortions. Trained on a long history of equity returns, the proposed method achieves a more favorable bias-variance trade-off than purely analytical cleaners and delivers systematically lower out-of-sample cross-covariance prediction errors. Our results demonstrate that combining random-matrix theory with machine learning makes asymptotic theories practically effective in realistic time-varying markets.

2512.23251 2026-01-22 stat.ME physics.data-an

A Wide-Sense Stationarity Test Based on the Geometric Structure of Covariance

Yinbu Wang, Yong Xu

详情
英文摘要

This paper presents a test for wide-sense stationarity (WSS) based on the geometry of the covariance function. We estimate local patches of the covariance surface and then check whether the directional derivative in the $(1,1,0)$ direction is zero on each patch. The method only requires the covariance function to be locally smooth and does not assume stationarity in advance. It can be applied to general stochastic dynamical systems and provides a time-resolved view. We apply the test method to an SDOF system and to a stochastic Duffing oscillator. These examples show that the method is numerically stable and can detect departures from WSS in practice.

2512.11150 2026-01-22 stat.ME stat.AP stat.ML

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Eddie Landesberg, Manjari Narayan

Comments Code: https://github.com/cimo-labs/cje Experiments for Reproducibility: https://github.com/cimo-labs/cje-arena-experiments Original Preprint: https://zenodo.org/records/17903629

详情
英文摘要

Measuring long-run LLM outcomes (user satisfaction, expert judgment, downstream KPIs) is expensive. Teams default to cheap LLM judges, but uncalibrated proxies can invert rankings entirely. Causal Judge Evaluation (CJE) makes it affordable to aim at the right target: calibrate cheap scores against a small oracle slice, then evaluate at scale with valid uncertainty. We treat surrogate validity as auditable: for each policy or deployment context, a small oracle audit tests whether the learned calibration remains mean-unbiased, turning an uncheckable identification condition into a falsifiable diagnostic. On 4,961 Chatbot Arena prompts comparing five policies with a 16x oracle/judge cost ratio, at a 5% oracle fraction CJE achieves 99% pairwise ranking accuracy at 14x lower cost; across all configurations (5-50% oracle, varying n), accuracy averages 94%. An adversarial policy fails the transport audit and is correctly flagged; in such cases CJE refuses level claims rather than reporting biased estimates. Key findings: naive confidence intervals on raw judge scores achieve 0% coverage (CJE: ~95%); importance-weighted estimators fail despite >90% effective sample size; and the Coverage-Limited Efficiency (CLE) bound and its TTC diagnostic explain why.

2512.00386 2026-01-22 math.PR math.ST stat.TH

Convergence of Reflected Langevin Diffusion for Constrained Sampling

Tarika Mane, Amine Boukardagha

详情
英文摘要

We examine the Langevin diffusion confined to a closed, convex domain $D\subset\mathbb{R}^d$, represented as a reflected stochastic differential equation. We introduce a sequence of penalized stochastic differential equations and prove that their invariant measures converge, in Wasserstein-2 distance and with explicit polynomial rate, to the invariant measure of the reflected Langevin diffusion. We also analyze a time-discretization of the penalized process obtained via the Euler-Maruyama scheme and demonstrate the convergence to the original constrained measure. These results provide a rigorous approximation framework for reflected Langevin dynamics in both continuous and discrete time.

2511.16029 2026-01-22 stat.ME econ.EM math.ST stat.TH

Possibilistic Instrumental Variable Regression

Gregor Steiner, Jeremie Houssineau, Mark F. J. Steel

详情
英文摘要

Instrumental variable regression is a common approach for causal inference in the presence of unobserved confounding. However, identifying valid instruments is often difficult in practice. In this paper, we propose a novel method based on possibility theory that performs posterior inference on the treatment effect, conditional on a user-specified set of potential violations of the exogeneity assumption. Our method can provide informative results even when only a single, potentially invalid, instrument is available, offering a natural and principled framework for sensitivity analysis. Simulation experiments and a real-data application indicate strong performance of the proposed approach.

2511.05623 2026-01-22 cs.CV cs.LG stat.ME stat.ML

Registration-Free Monitoring of Unstructured Point Cloud Data via Intrinsic Geometrical Properties

Mariafrancesca Patalano, Giovanna Capizzi, Kamran Paynabar

Comments Code available at https://github.com/franci2312/RFM

详情
英文摘要

Modern sensing technologies have enabled the collection of unstructured point cloud data (PCD) of varying sizes, which are used to monitor the geometric accuracy of 3D objects. PCD are widely applied in advanced manufacturing processes, including additive, subtractive, and hybrid manufacturing. To ensure the consistency of analysis and avoid false alarms, preprocessing steps such as registration and mesh reconstruction are commonly applied prior to monitoring. However, these steps are error-prone, time-consuming and may introduce artifacts, potentially affecting monitoring outcomes. In this paper, we present a novel registration-free approach for monitoring PCD of complex shapes, eliminating the need for both registration and mesh reconstruction. Our proposal consists of two alternative feature learning methods and a common monitoring scheme designed to handle hundreds of features. Feature learning methods leverage intrinsic geometric properties of the shape, captured via the Laplacian and geodesic distances. In the monitoring scheme, thresholding techniques are used to further select intrinsic features most indicative of potential out-of-control conditions. Numerical experiments and case studies highlight the effectiveness of the proposed approach in identifying different types of defects.

2510.27066 2026-01-22 physics.ao-ph stat.CO stat.ML

AI-boosted rare event sampling to characterize extreme weather

Amaury Lancelin, Alex Wikner, Laurent Dubus, Clément Le Priol, Dorian S. Abbot, Freddy Bouchet, Pedram Hassanzadeh, Jonathan Weare

详情
英文摘要

Weather extremes pose major societal risks, especially in a changing climate, but due to their rarity, they are difficult to study using limited observations or complex climate models. We introduce AI+RES, a framework coupling fast AI weather forecasts with a high-fidelity physics model using a rare-event algorithm to efficiently characterize extremes. This approach enables the study of the statistics and physics of very rare events, such as once per millennium heatwaves at two orders-of-magnitude lower computational cost. AI+RES can be applied broadly across climate science and other fields concerned with rare events.

2509.17636 2026-01-22 stat.ML cs.LG

Whitening Spherical Gaussian Mixtures in the Large-Dimensional Regime

Mohammed Racim Moussa Boudjemaa, Alper Kalle, Xiaoyi Mai, José Henrique de Morais Goulart, Cédric Févotte

Comments Accepted for presentation at ICASSP 2026

详情
英文摘要

Whitening is a classical technique in unsupervised learning that can facilitate estimation tasks by standardizing data. An important application is the estimation of latent variable models via the decomposition of tensors built from high-order moments. In particular, whitening orthogonalizes the means of a spherical Gaussian mixture model (GMM), thereby making the corresponding moment tensor orthogonally decomposable, hence easier to decompose. However, in the large-dimensional regime (LDR) where data are high-dimensional and scarce, the standard whitening matrix built from the sample covariance becomes ineffective because the latter is spectrally distorted. Consequently, whitened means of a spherical GMM are no longer orthogonal. Using random matrix theory, we derive exact limits for their dot products, which are generally nonzero in the LDR. As our main contribution, we then construct a corrected whitening matrix that restores asymptotic orthogonality, allowing for performance gains in spherical GMM estimation.

2505.09516 2026-01-22 stat.ME cs.LG stat.AP

Depth-Based Local Center Clustering: A Framework for Handling Different Clustering Scenarios

Siyi Wang, Alexandre Leblanc, Paul D. McNicholas

详情
英文摘要

Cluster analysis, or clustering, plays a crucial role across numerous scientific and engineering domains. Despite the wealth of clustering methods proposed over the past decades, each method is typically designed for specific scenarios and presents certain limitations in practical applications. In this paper, we propose depth-based local center clustering (DLCC). This novel method makes use of data depth, which is known to produce a center-outward ordering of sample points in a multivariate space. However, data depth typically fails to capture the multimodal characteristics of {data}, something of the utmost importance in the context of clustering. To overcome this, DLCC makes use of a local version of data depth that is based on subsets of {data}. From this, local centers can be identified as well as clusters of varying shapes. Furthermore, we propose a new internal metric based on density-based clustering to evaluate clustering performance on {non-convex clusters}. Overall, DLCC is a flexible clustering approach that seems to overcome some limitations of traditional clustering methods, thereby enhancing data analysis capabilities across a wide range of application scenarios.

2505.03717 2026-01-22 math.OC cs.LG stat.ML

Nonnegative Low-rank Matrix Recovery Can Have Spurious Local Minima

Richard Y. Zhang

详情
英文摘要

Low-rank matrix recovery is well-known to exhibit benign nonconvexity under the restricted isometry property (RIP): every second-order critical point is globally optimal, so local methods provably recover the ground truth. Motivated by the strong empirical performance of projected gradient methods for nonnegative low-rank recovery problems, we investigate whether this benign geometry persists when the factor matrices are constrained to be elementwise nonnegative. In the simple setting of a rank-1 nonnegative ground truth, we confirm that benign nonconvexity holds in the fully-observed case with RIP constant $δ=0$. This benign nonconvexity, however, is unstable. It fails to extend to the partially-observed case with any arbitrarily small RIP constant $δ>0$, and to higher-rank ground truths $r^{\star}>1$, regardless of how much the search rank $r\ge r^{\star}$ is overparameterized. Together, these results undermine the standard stability-based explanation for the empirical success of nonconvex methods and suggest that fundamentally different tools are needed to analyze nonnegative low-rank recovery.

2505.02587 2026-01-22 stat.AP

Deriving Duration Time from Occupancy Data -- A case study in the length of stay in Intensive Care Units for COVID-19 patients

Martje Rave, Göran Kauermann

详情
英文摘要

This paper focuses on drawing information on underlying processes, which are not directly observed in the data. In particular, we work with data in which only the total count of units in a system at a given time point is observed, but the underlying process of inflows, length of stay and outflows is not. The particular data example looked at in this paper is the occupancy of intensive care units (ICU) during the COVID-19 pandemic, where the aggregated numbers of occupied beds in ICUs on the district level (`Landkreis') are recorded, but not the number of incoming and outgoing patients. The Skellam distribution allows us to infer the number of incoming and outgoing patients from the occupancy in the ICUs. This paper goes a step beyond and approaches the question of whether we can also estimate the average length of stay of ICU patients. Hence, the task is to derive not only the number of incoming and outgoing units from a total net count but also to gain information on the duration time of patients on ICUs. We make use of a stochastic Expectation-Maximisation algorithm and additionally include exogenous information which are assumed to explain the intensity of inflow.

2504.04266 2026-01-22 stat.AP econ.GN q-fin.EC stat.CO

BlockingPy: approximate nearest neighbours for blocking of records for entity resolution

Tymoteusz Strojny, Maciej Beręsewicz

Comments accepted by the pyOpenSci; resubmitted to the SoftwareX journal;

详情
英文摘要

Entity resolution (probabilistic record linkage, deduplication) is a key step in scientific analysis and data science pipelines involving multiple data sources. The objective of entity resolution is to link records without common unique identifiers that refer to the same entity (e.g., person, company). However, without identifiers, researchers need to specify which records to compare in order to calculate matching probability and reduce computational complexity. One solution is to deterministically block records based on some common variables, such as names, dates of birth or sex or use phonetic algorithms. However, this approach assumes that these variables are free of errors and completely observed, which is often not the case. To address this challenge, we have developed a Python package, BlockingPy, which uses blocking using modern approximate nearest neighbour search and graph algorithms to reduce the number of comparisons. The package supports both CPU and GPU execution. In this paper, we present the design of the package, its functionalities and two case studies related to official statistics. The presented software will be useful for researchers interested in linking data from various sources.

2504.02381 2026-01-22 math.ST stat.TH

Fermat Distance-to-Measure: a robust Fermat-like metric

Jérôme Taupin, Frédéric Chazal

详情
英文摘要

Given a probability measure with density, Fermat distances and density-driven metrics are conformal transformations of the Euclidean metric that shrink distances in high density areas and enlarge distances in low density areas. Although they have been widely studied and have shown to be useful in various machine learning tasks, they are limited to measures with density (with respect to Lebesgue measure, or volume form on manifold). In this paper, by replacing the density with the Distance-to-Measure, we introduce a new metric, the Fermat Distance-to-Measure, defined for any probability measure in R^d. We derive strong stability properties for the Fermat Distance-to-Measure with respect to the measure and propose an estimator from random sampling of the same measure, featuring an explicit bound on its convergence rate.

2503.12644 2026-01-22 math.PR math-ph math.MP math.ST stat.TH

Asymptotic Expansions of Gaussian and Laguerre Ensembles at the Soft Edge II: Level Densities

Folkmar Bornemann

Comments V3: new Fig. 1, extended Sect. 2.2; 22 pages, 1 figure

Journal ref Random Matrices: Theory and Applications 15, 2550025: 1-31 (2026)

详情
英文摘要

We continue our work [arXiv:2403.07628] on asymptotic expansions at the soft edge for the classical $n$-dimensional Gaussian and Laguerre random matrix ensembles. By revisiting the construction of the associated skew-orthogonal polynomials in terms of wave functions, we obtain concise expressions for the level densities that are well suited for proving asymptotic expansions in powers of a certain parameter $h \asymp n^{-2/3}$. In the unitary case, the expansion for the level density can be used to reconstruct the first correction term in an established asymptotic expansion of the associated generating function. In the orthogonal and symplectic cases, we can even reconstruct the conjectured first and second correction terms.

2502.11072 2026-01-22 stat.ME

Box Confidence Depth: simulation-based inference with hyper-rectangles

Elena Bortolato, Laura Ventura

Comments 22 pages, 5 figures

详情
英文摘要

This work presents a novel simulation-based approach for constructing confidence regions in parametric models, which is particularly suited for generative models and situations where limited data and conventional asymptotic approximations fail to provide accurate results. The method leverages the concept of data depth and depends on creating random hyper-rectangles, i.e. boxes, in the sample space generated through simulations from the model, varying the input parameters. A probabilistic acceptance rule allows to retrieve a Depth-Confidence Distribution for the model parameters from which point estimators as well as calibrated confidence sets can be read-off. The method is designed to address cases where both the parameters and test statistics are multivariate.

2502.10199 2026-01-22 math.NA cs.NA stat.CO

On the unconventional Hug integrator

Christophe Andrieu, J. M. Sanz-Serna

Comments Added some clarifications and additional numerical results

详情
英文摘要

Hug is a recently proposed iterative mapping used to design efficient updates in Markov chain Monte Carlo (MCMC) methods. Hug generates proposals that remain very close to hypersurfaces (level sets) of constant probabilty density. We analyse a generalization of Hug from hypersurfaces to manifolds of arbitrary dimensions, not necessarily arising in a sampling context. The analysis is based on interpreting, in a nonstandard way, Hug as a consistent discretization of a system of differential equations with a rather complicated structure. The proof of convergence of this discretization includes a number of unusual features we explore fully, in particular a supraconvergence property is established, whereby second order of convergence is attained with consistency of the first order. We uncover and discuss an unexpected property of the solutions of the underlying dynamical system that manifest itself by the existence of Hug trajectories that fail to cover the manifold of interest.

2410.18939 2026-01-22 stat.ME stat.AP stat.OT

Adaptive partition Factor Analysis

Elena Bortolato, Antonio Canale

Comments 35 pages, 8 figures

详情
英文摘要

Factor Analysis has traditionally been utilized across diverse disciplines to extrapolate latent traits that influence the behavior of multivariate observed variables. Historically, the focus has been on analyzing data from a single study, neglecting the potential study-specific variations present in data from multiple studies. Multi-study factor analysis has emerged as a recent methodological advancement that addresses this gap by distinguishing between latent traits shared across studies and study-specific components arising from artifactual or population-specific sources of variation. In this paper, we extend the current methodologies by introducing novel shrinkage priors for the latent factors, thereby accommodating a broader spectrum of scenarios -- from the absence of study-specific latent factors to models in which factors pertain only to small subgroups nested within or shared between the studies. For the proposed construction we provide conditions for identifiability of factor loadings and guidelines to perform straightforward posterior computation via Gibbs sampling. Through comprehensive simulation studies, we demonstrate that our proposed method exhibits competing performance across a variety of scenarios compared to existing methods, yet providing richer insights. The practical benefits of our approach are further illustrated through applications to bird species co-occurrence data and ovarian cancer gene expression data.

2407.19030 2026-01-22 stat.ME math.ST stat.TH

Multimodal data integration and cross-modal querying via orchestrated approximate message passing

Sagnik Nandy, Zongming Ma

详情
英文摘要

The need for multimodal data integration arises naturally when multiple complementary sets of features are measured on the same sample. Under a dependent multifactor model, we develop a fully data-driven orchestrated approximate message passing algorithm for integrating information across these feature sets to achieve statistically optimal signal recovery. In practice, these reference data sets are often queried later by new subjects that are only partially observed. Leveraging on asymptotic normality of estimates generated by our data integration method, we further develop an asymptotically valid prediction set for the latent representation of any such query subject. We demonstrate the prowess of both the data integration and the prediction set construction algorithms on both synthetic examples and real world single-cell datasets.

2406.04071 2026-01-22 stat.ML cs.LG math.ST stat.TH

Dynamic angular synchronization under smoothness constraints

Ernesto Araya, Mihai Cucuringu, Hemant Tyagi

Comments 42 pages, 9 figures. Post publication version. Corrected minor typos in eqs. (3.9), (3.11) and Assumption 3

详情
英文摘要

Given an undirected measurement graph $\mathcal{H} = ([n], \mathcal{E})$, the classical angular synchronization problem consists of recovering unknown angles $θ_1^*,\dots,θ_n^*$ from a collection of noisy pairwise measurements of the form $(θ_i^* - θ_j^*) \mod 2π$, for all $\{i,j\} \in \mathcal{E}$. This problem arises in a variety of applications, including computer vision, time synchronization of distributed networks, and ranking from pairwise comparisons. In this paper, we consider a dynamic version of this problem where the angles, and also the measurement graphs evolve over $T$ time points. Assuming a smoothness condition on the evolution of the latent angles, we derive three algorithms for joint estimation of the angles over all time points. Moreover, for one of the algorithms, we establish non-asymptotic recovery guarantees for the mean-squared error (MSE) under different statistical models. In particular, we show that the MSE converges to zero as $T$ increases under milder conditions than in the static setting. This includes the setting where the measurement graphs are highly sparse and disconnected, and also when the measurement noise is large and can potentially increase with $T$. We complement our theoretical results with experiments on synthetic data.

2312.10814 2026-01-22 stat.ME

Design of Bayesian A/B Tests Controlling False Discovery Rates and Power

Luke Hagar, Nathaniel T. Stevens

详情
英文摘要

Businesses frequently run online controlled experiments (i.e., A/B tests) to learn about the effect of an intervention on multiple business metrics. To account for multiple hypothesis testing, multiple metrics are commonly aggregated into a single composite measure, losing valuable information, or strict family-wise error rate adjustments are imposed, leading to reduced power. In this paper, we propose an economical framework to design Bayesian A/B tests while controlling both power and the false discovery rate (FDR). Selecting optimal decision thresholds to control power and the FDR typically relies on intensive simulation at each sample size considered. Our framework efficiently recommends optimal sample sizes and decision thresholds for Bayesian A/B tests that satisfy criteria for the FDR and average power. Our approach is efficient because we leverage new theoretical results to obtain these recommendations using simulations conducted at only two sample sizes. Our methodology is illustrated using an example based on a real A/B test involving several metrics.

2310.17165 2026-01-22 stat.ME

Price Experimentation and Interference

Ramesh Johari, Orrie B. Page, Gabriel Y. Weintraub

详情
英文摘要

In this paper, we examine the biases that arise when firms run A/B tests on continuous parameters to estimate global treatment effects on performance metrics of interest; we particularly focus on price experiments to measure the price impact on quantity demanded, and on profit. In canonical A/B experimental estimators, biases emerge due to interference between market participants. We employ structural modeling and differential calculus to derive intuitive characterizations of these biases. We then specialize our general model to the standard revenue-management pricing problem. This setting highlights a fundamental risk innate to A/B pricing experiments: that the canonical estimator for the expected change in profits, counterintuitively, can have the wrong sign in expectation. In other words, following the guidance of canonical estimators may lead firms to move prices (or fees) in the wrong direction, inadvertently decreasing profits. We introduce a novel debiasing technique for these canonical experiments, requiring only that firms equally split units between treatment and control. We apply these results to a two-sided market model, and demonstrate how the "change of sign" regime depends on market factors such as the supply/demand imbalance, and the price markup. We conclude by calibrating our revenue-management pricing model to published empirical estimates from Airbnb marketplaces, demonstrating that estimators with the wrong sign are not a knife-edge issue, and that they may be prevalent enough to be of concern to practitioners.

2310.00786 2026-01-22 econ.EM math.ST stat.TH

Semidiscrete optimal transport with unknown costs

Yinchu Zhu, Ilya O. Ryzhov

详情
英文摘要

Semidiscrete optimal transport is a challenging generalization of the classical transportation problem in linear programming. The goal is to design a joint distribution for two random variables (one continuous, one discrete) with fixed marginals, in a way that minimizes expected cost. We formulate a novel variant of this problem in which the cost functions are unknown, but can be learned through noisy observations; however, only one function can be sampled at a time. We develop a semi-myopic algorithm that couples online learning with stochastic approximation, and prove that it achieves optimal convergence rates, despite the non-smoothness of the stochastic gradient and the lack of strong concavity in the objective function.

2212.14883 2026-01-22 stat.ML cs.LG

Online Statistical Inference for Contextual Bandits via Stochastic Gradient Descent

Xiangyu Chang, Xi Chen, Zehua Lai, He Li, Zhihong Liu, Yichen Zhang

详情
英文摘要

With the fast development of big data, learning the optimal decision rule by recursively updating it and making online decisions has been easier than before. We study the online statistical inference of model parameters in a contextual bandit framework of sequential decision-making. We propose a general framework for an online and adaptive data collection environment that can update decision rules via weighted stochastic gradient descent. We allow different weighting schemes of the stochastic gradient and establish the asymptotic normality of the parameter estimator. Our proposed estimator significantly improves the asymptotic efficiency over the previous averaged SGD approach via inverse probability weights. We also conduct an optimality analysis on the weights in a linear regression setting. We provide a Bahadur representation of the proposed estimator and show that the remainder term in the Bahadur representation entails a slower convergence rate compared to classical SGD due to the adaptive data collection.

2210.12837 2026-01-22 stat.ME

Estimating Gaussian graphical models of multi-study data with Multi-Study Factor Analysis

Katherine H. Shutta, Denise M. Scholtens, William L. Lowe, Raji Balasubramanian, Roberta De Vito

详情
英文摘要

Network models are powerful tools for gaining new insights from complex biological data. Most lines of investigation in biology involve comparing datasets in the setting where the same predictors are measured across multiple studies or conditions (multi-study data). Consequently, the development of statistical tools for network modeling of multi-study data is a highly active area of research. Multi-study factor analysis (MSFA) is a method for estimation of latent variables (factors) in multi-study data. In this work, we generalize MSFA by adding the capacity to estimate Gaussian graphical models (GGMs). Our new tool, MSFA-X, is a framework for latent variable-based graphical modeling of shared and study-specific signals in multi-study data. We demonstrate through simulation that MSFA-X can recover shared and study-specific GGMs and outperforms a graphical lasso benchmark. We apply MSFA-X to analyze maternal response to an oral glucose tolerance test in targeted metabolomic profiles from the Hyperglycemia and Adverse Pregnancy Outcomes (HAPO) Study, identifying network-level differences in glucose metabolism between women with and without gestational diabetes mellitus.

2112.09427 2026-01-22 eess.AS cs.CL cs.LG stat.ML

Continual Learning for Monolingual End-to-End Automatic Speech Recognition

Steven Vander Eeckt, Hugo Van hamme

Comments Published at EUSIPCO 2022. 5 pages, 1 figure

Journal ref Proceedings of the 30th European Signal Processing Conference (EUSIPCO 2022), pg.459-463

详情
英文摘要

Adapting Automatic Speech Recognition (ASR) models to new domains results in a deterioration of performance on the original domain(s), a phenomenon called Catastrophic Forgetting (CF). Even monolingual ASR models cannot be extended to new accents, dialects, topics, etc. without suffering from CF, making them unable to be continually enhanced without storing all past data. Fortunately, Continual Learning (CL) methods, which aim to enable continual adaptation while overcoming CF, can be used. In this paper, we implement an extensive number of CL methods for End-to-End ASR and test and compare their ability to extend a monolingual Hybrid CTC-Transformer model across four new tasks. We find that the best performing CL method closes the gap between the fine-tuned model (lower bound) and the model trained jointly on all tasks (upper bound) by more than 40%, while requiring access to only 0.6% of the original data.